Prints PDF
Prints PDF
Prints PDF
b) Let X1:n be a random sample from a N(θ, 1) distribution. However, only the
largest value of the sample, Y = max(X1 , . . . , Xn ), is known. Show that the
density of Y is
where Φ(·) is the distribution function and ϕ(·) is the density function of the
standard normal distribution N(0, 1). Derive the distribution function of Y and
the likelihood function L(θ).
2 2 Likelihood 3
follows:
0.8
d
fY (y) = FY (y) 0.6
dy
L(θ)
= n{Φ(y − θ)} n−1
φ(y − θ) 0.4
and the likelihood function L(θ) is exactly this density function, but seen as a 0.2
bution, cf . Appendix A.5.2. Here θ ∈ R denotes the location parameter of the 0.00 0.05 0.10 0.15 0.20
1 (b)
0.6 2. A first-order autoregressive process X0 , X1 , . . . , Xn is specified by the conditional
0.5
distribution
θ
◮ The likelihood is given by
iii. L(θ) in 1c) if the observed sample is x = (0, 5, 9).
◮ Here the vector computations are very useful: L(α) = f (x1 , . . . , xn | x0 ; α)
> ## likelihood for location parameter of Cauchy in = f (xn | xn−1 , . . . , x1 , x0 ; α)f (xn−1 | xn−2 , . . . , x0 ; α) · · · f (x1 | x0 ; α)
> ## random sample
n
Y
> likelihood3 <- function(theta, # location parameter
x) # observed data vector = f (xi | xi−1 , . . . , x0 ; α)
{ i=1
1/pi^3 / prod((1+(x-theta)^2))
1 1
Yn
} = √ exp − (xi − αxi−1 )2
> ## In order to plot the likelihood, the function must be able 2π 2
i=1
> ## to take not only one theta value but a theta vector. We can use the
n
> ## following trick to get a vectorised likelihood function: Y 1
= exp − (xi − αxi−1 )2
> likelihood3vec <- Vectorize(likelihood3,
2
vectorize.args="theta") i=1
> x <- c(0, 5, 9)
( n )
X 1
2
> theta <- seq(from=-5, to=15, = exp − (xi − αxi−1 ) .
length = 1000) 2
i=1
> plot(theta, likelihood3vec(theta, x=x),
type="l", The log-likelihood kernel is thus
xlab=expression(theta), ylab=expression(L(theta)),
n
main="1 (c)") 1X
l(α) = log L(α) = − (xi − αxi−1 )2 .
1 (c)
2
i=1
6e−05 b) Derive the score equation for α, compute α̂ML and verify that it is really the
maximum of l(α).
4e−05 ◮ The score function for α is
L(θ)
dl(α)
S(α) =
2e−05 dα
n
1X
=− 2(xi − αxi−1 ) · (−xi−1 )
0e+00
2
i=1
n
X
−5 0 5 10 15
= xi xi−1 − αx2i−1
θ
i=1
n
X n
X
= xi xi−1 − α x2i−1 ,
i=1 i=1
6 2 Likelihood 7
l(α)
This is really a local maximum of the log-likelihood function, because the latter −8
distribution? a) What is the parameter space of φ? See Table A.3 in the Appendix for details
◮ The log-likelihood function is on the multinomial distribution and the parameter space of π.
◮ The first requirement for the probabilities is satisfied for all φ ∈ R:
l(π) = log f (x; π) = log(π) + (x − 1) log(1 − π),
4
X 1 1
so the score function is πj = 2 + φ + 2(1 − φ) + φ = (4 + 2φ − 2φ) = 1.
4 4
j=1
d 1 x−1
S(π) = l(π) = − . Moreover, each probability πj (j = 1, . . . , 4) must lie in the interval (0, 1). We
dπ π 1−π
thus have
Solving the score equation S(π) = 0 yields the MLE π̂ML = 1/x. The Fisher infor-
2+φ
mation is 0< < 1 ⇐⇒ 0 < 2 + φ < 4 ⇐⇒ −2 < φ < 2,
d 1 x−1 4
I(π) = − S(π) = 2 + ,
dπ π (1 − π)2 1−φ
0< < 1 ⇐⇒ 0 < 1 − φ < 4 ⇐⇒ −3 < φ < 1, (2.1)
4
which is positive for every 0 < π < 1, since x ≥ 1 by definition. Thus, 1/x indeed
φ
maximises the likelihood. 0 < < 1 ⇐⇒ 0 < φ < 4. (2.2)
4
For a realisation x1:n of a random sample from this distribution, the quantities
calculated above become Hence, (2.2) and (2.1) imply the lower and upper bounds, respectively, for the
range 0 < φ < 1. This is the intersection of the sets suitable for the probabilities.
n
X
l(π) = log f (xi ; π) b) Show that the likelihood kernel function for φ, based on the observation x, has
i=1 the form
Xn L(φ) = (2 + φ)m1 (1 − φ)m2 φm3
= log(π) + (xi − 1) log(1 − π)
i=1 and derive expressions for m1 , m2 and m3 depending on x.
= n log(π) + n(x̄ − 1) log(1 − π), ◮ We derive the likelihood kernel function based on the probability mass func-
d tion:
S(π) = l(π)
dπ 4
Y
n!
n n(x̄ − 1) L(φ) = Q4
x
πj j
= − , x !
π 1−π j=1 j j=1
x x x x
d 2+φ 1 1−φ 2 1−φ 3 φ 4
and I(π) = − S(π) ∝
dπ 4 4 4 4
n n(x̄ − 1) x1 +x2 +x3 +x4
= 2+ . 1
π (1 − π)2 = (2 + φ)x1 (1 − φ)x2 +x3 φx4
4
The Fisher information is again positive, thus the solution 1/x̄ of the score equation ∝ (2 + φ)m1 (1 − φ)m2 φm3
is the MLE.
5. A sample of 197 animals has been analysed regarding a specific phenotype. The with m1 = x1 , m2 = x2 + x3 and m3 = x4 .
number of animals with phenotypes AB, Ab, aB and ab, respectively, turned out c) Derive an explicit formula for the MLE φ̂ML , depending on m1 , m2 and m3 .
to be Compute the MLE given the data given above.
x = (x1 , x2 , x3 , x4 )⊤ = (125, 18, 20, 34)⊤ . ◮ The log-likelihood kernel is
A genetic model now assumes that the counts are realizations of a multinomially l(φ) = m1 log(2 + φ) + m2 log(1 − φ) + m3 log(φ),
distributed multivariate random variable X ∼ M4 (n, π) with n = 197 and proba-
bilities π1 = (2 + φ)/4, π2 = π3 = (1 − φ)/4 and π4 = φ/4 (Rao, 1973, p. 368).
10 2 Likelihood 11
S(φ) =
dl(φ) 6. Show that h(X) = maxi (Xi ) is sufficient for θ in Example 2.18.
dφ ◮ From Example 2.18, we know that the likelihood function of θ is
m1 m2 m3
= + (−1) + (Qn
2+φ 1−φ φ 1
i=1 f (xi ; θ) = θn for θ ≥ maxi (xi ),
m1 (1 − φ)φ − m2 (2 + φ)φ + m3 (2 + φ)(1 − φ) L(θ) = .
= . 0 otherwise
(2 + φ)(1 − φ)φ
The score equation S(φ) = 0 is satisfied if and only if the numerator in the We also know that L(θ) = f (x1:n ; θ), the density of the random sample. The density
expression above equals zero, i. e. if can thus be rewritten as
1
0 = m1 (1 − φ)φ − m2 (2 + φ)φ + m3 (2 + φ)(1 − φ) f (x1:n ; θ) = I[0,θ] (max(xi )).
θn i
= m1 φ − m1 φ2 − 2m2 φ − m2 φ2 + m3 (2 − 2φ + φ − φ2 ) 1
Hence, we can apply the Factorisation theorem (Result 2.2) with g1 (t; θ) = θn I[0,θ] (t)
= φ2 (−m1 − m2 − m3 ) + φ(m1 − 2m2 − m3 ) + 2m3 .
and g2 (x1:n ) = 1 to conclude that T = maxi (Xi ) is sufficient for θ.
This is a quadratic equation of the form aφ2 + bφ + c = 0, with a = −(m1 + m2 + 7. a) Let X1:n be a random sample from a distribution with density
m3 ), b = (m1 − 2m2 − m3 ) and c = 2m3 , which has two solutions φ0/1 ∈ R given (
exp(iθ − xi ) xi ≥ iθ
by √ f (xi ; θ) =
−b ± b2 − 4ac 0 xi < iθ
φ0/1 = .
2a
for Xi , i = 1, . . . , n. Show that T = mini (Xi /i) is a sufficient statistic for θ.
There is no hope for simplifying this expression much further, so we just imple-
◮ Since xi ≥ iθ is equivalent to xi /i ≥ θ, we can rewrite the density of the i-th
ment it in R, and check which of φ0/1 is in the parameter range (0, 1):
observation as
> mle.phi <- function(x)
{ f (xi ; θ) = exp(iθ − xi )I[θ,∞) (xi /i).
m <- c(x[1], x[2] + x[3], x[4])
a <- - sum(m) The joint density of the random sample then is
b <- m[1] - 2 * m[2] - m[3]
n
Y
c <- 2 * m[3]
phis <- (- b + c(-1, +1) * sqrt(b^2 - 4 * a * c)) / (2 * a) f (x1:n ; θ) = f (xi )
correct.range <- (phis > 0) & (phis < 1) i=1
return(phis[correct.range])
( n
! ) n
X Y
} = exp θ i − nx̄ I[θ,∞) (xi /i)
> x <- c(125, 18, 20, 34)
i=1 i=1
> (phiHat <- mle.phi(x))
n(n + 1)
[1] 0.6268215
= exp θ I[θ,∞) (min(xi /i)) · exp{−nx̄} .
2 i | {z }
Note that this example is also used in the famous EM algorithm paper (Dempster | {z } =g2 (x1:n )
=g2 (h(x1:n )=mini (xi /i) ;θ)
et al., 1977, p. 2), producing the same result as we obtained by using the EM
algorithm (cf. Table 2.1 in Subsection 2.3.2). The result now follows from the Factorisation Theorem (Result 2.2). The crucial
√ Qn
d) What is the MLE of θ = φ? step is that i=1 I[θ,∞) (xi /i) = I[θ,∞) (mini (xi /i)) = I[θ,∞) (h(x1:n )).
◮ From the invariance property of the MLE we have Now we will show minimal sufficiency (required for next item). Consider the
q likelihood ratio
θ̂ML = φ̂ML ,
Λx1:n (θ1 , θ2 ) = exp (θ1 − θ2 )n(n + 1)/2 I[θ1 ,∞) (h(x1:n ))/I[θ2 ,∞) (h(x1:n )).
which in the example above gives θ̂ML ≈ 0.792:
> (thetaHat <- sqrt(phiHat))
12 2 Likelihood
If Λx1:n (θ1 , θ2 ) = Λx̃1:n (θ1 , θ2 ) for all θ1 , θ2 ∈ R for two realisations x1:n and x̃1:n ,
then necessarily
3 Elements of frequentist inference
I[θ1 ,∞) (h(x1:n )) I[θ ,∞) (h(x̃1:n ))
= 1 . (2.3)
I[θ2 ,∞) (h(x1:n )) I[θ2 ,∞) (h(x̃1:n ))
Now assume that h(x1:n ) 6= h(x̃1:n ), and, without loss of generality, that
h(x1:n ) > h(x̃1:n ). Then for θ1 = {h(x1:n ) + h(x̃1:n )}/2 and θ2 = h(x1:n ), we
obtain 1 on the left-hand side of (2.3) an 0 on the right-hand side. Hence,
h(x1:n ) = h(x̃1:n ) must be satisfied for the equality to hold for all θ1 , θ2 ∈ R, and
so the statistic T = h(X1:n ) is minimal sufficient.
b) Let X1:n denote a random sample from a distribution with density 1. Sketch why the MLE
M ·n
f (x; θ) = exp{−(x − θ)}, θ < x < ∞, −∞ < θ < ∞. N̂ML =
x
Derive a minimal sufficient statistic for θ. in the capture-recapture experiment (cf . Example 2.2) cannot be unbiased. Show
◮ We have a random sample from the distribution of X1 in (7a), hence we that the alternative estimator
proceed in a similar way. First we rewrite the above density as
(M + 1) · (n + 1)
N̂ = −1
f (x; θ) = exp(θ − x)I[θ,∞) (x) (x + 1)
and second we write the joint density as is unbiased if N ≤ M + n.
f (x1:n ; θ) = exp(nθ − nx̄)I[θ,∞) (min(xi )). ◮ If N ≥ n + M , then X can equal zero with positive probability. Hence, the
i
MLE
By the Factorisation Theorem (Result 2.2), the statistic T = mini (Xi ) is suffi- N̂ML =
M ·n
cient for θ. Its minimal sufficiency can be proved in the same way as in (7a). X
8. Let T = h(X1:n ) be a sufficient statistic for θ, g(·) a one-to-one function and T̃ = can be infinite with positive probability. It follows that the expectation of the MLE
h̃(X1:n ) = g{h(X1:n )}. Show that T̃ is sufficient for θ. is infinite if N ≥ M +n and so cannot be equal to the true parameter value N . We
◮ By the Factorisation Theorem (Result 2.2), the sufficiency of T = h(X1:n ) for have thus shown that for some parameter values, the expectation of the estimator
θ implies the existence of functions g1 and g2 such that is not equal to the true parameter value. Hence, the MLE is not unbiased.
f (x1:n ; θ) = g1 {h(x1:n ); θ} · g2 (x1:n ). To show that the alternative estimator is unbiased if N ≤ M + n, we need to
compute its expectation. If N ≤ M + n, the smallest value in the range T of the
If we set g̃1 := g1 ◦ g −1
, we can write possible values for X is max{0, n − (N − M )} = n − (N − M ). The expectation of
f (x1:n ; θ) = g1 (g −1 [g{h(x1:n )}]; θ) · g2 (x1:n ) = g̃1 {h̃(x1:n ); θ} · g2 (x1:n ), the statistic g(X) = (M + 1)(n + 1)/(X + 1) can thus be computed as
X
which shows the sufficiency of T̃ = h̃(X1:n ) for θ. E{g(X)} = g(x) Pr(X = x)
9. Let X1 and X2 denote two independent exponentially Exp(λ) distributed random x∈T
variables with parameter λ > 0. Show that h(X1 , X2 ) = X1 + X2 is sufficient for λ. min{n,M } N −M
(M + 1)(n + 1)
X M
i=1 = (N + 1) x+1
N +1
n−x
.
= λ2 exp{−λ(x1 + x2 )} · |{z}
1 , x=n−(N −M ) n+1
| {z }
g1 {h(x1:2 )=x1 +x2 ;λ} g2 (x1:n )
and the result follows from the Factorisation Theorem (Result 2.2).
14 3 Elements of frequentist inference 15
We may now shift the index in the sum, so that the summands containing x in 3. Let X1:n be a random sample from a normal distribution with mean µ and vari-
the expression above contain x − 1. Of course, we need to change the range of ance σ 2 > 0. Show that the estimator
summation accordingly. By doing so, we obtain that r
n − 1 Γ( n−1
2
)
σ̂ = S
min{n,M }
X M +1 (N +1)−(M +1) 2 Γ( n2 )
x+1 n−x
N +1
x=n−(N −M ) n+1 is unbiased for σ, where S is the square root of the sample variance S 2 in (3.1).
It is well known that for X1 , . . . , Xn ∼ N(µ, σ 2 ),
iid
min{n+1,M +1}
X ◮
= .
(n − 1)S 2
x=(n+1)−((N +1)−(M +1)) Y := ∼ χ2 (n − 1),
σ2
Note that the sum above is a sum of probabilities corresponding to a hypergeometric √
distribution with different parameters, namely HypGeom(n + 1, N + 1, M + 1), i. e. see e. g. Davison (2003, page 75). For the expectation of the statistic g(Y ) = Y
we thus obtain that
min{n+1,M +1} M +1 (N +1)−(M +1)
X (n+1)−x Z∞
=
x
N +1
x=(n+1)−((N +1)−(M +1)) n+1 E{g(Y )} = g(y)fY (y) dy
X 0
= Pr(X ∗ = x) Z∞
( 21 ) 2 n−1 −1
n−1
x∈T ∗ 1
= y 2 exp(−y/2)y 2 dy
= 1, Γ( n−12 )
0
− 12 Z∞ 1 n
where X is a random variable, X ∼ HypGeom(n + 1, N + 1, M + 1). It follows
∗ ∗
1 Γ( n2 ) ( 2 ) 2 n −1
= y 2 exp(−y/2) dy.
that 2 Γ( n−1
2
) Γ( n2 )
0
E(N̂ ) = E{g(X)} − 1 = N + 1 − 1 = N.
Note however that the alternative estimator is also not unbiased for N > M + n. The integral on the most-right-hand side is the integral of the density of the χ2 (n)
Moreover, its values are not necessarily integer and thus not necessarily in the distribution over its support and therefore equals one. It follows that
parameter space. The latter property can be remedied by rounding. This, however, √ √ n n−1
would lead to the loss of unbiasedness even for N ≤ M + n. E( Y ) = 2Γ Γ ,
2 2
2. Let X1:n be a random sample from a distribution with mean µ and variance σ 2 > 0.
Show that and √
σ2 Y Γ( n−1
2
)
E(X̄) = µ and Var(X̄) = . E(σ̂) = E σ √ = σ.
2 Γ( 2 )
n
n
4. Show that the sample variance S 2 can be written as
◮ By linearity of the expectation, we have that
n
1 X
n
X S2 = (Xi − Xj )2 .
E(X̄) = n−1 E(Xi ) = n−1 n · µ = µ. 2n(n − 1)
i,j=1
i=1
Use this representation to show that
Sample mean X̄ is thus unbiased for expectation µ.
1 n − 3
The variance of a sum of uncorrelated random variables is the sum of the respective Var(S 2 ) = c4 − σ4 ,
variances; hence, n n−1
σ2
n
X
Var(X̄) = n−2 Var(Xi ) = n−2 n · σ 2 = .
n
i=1
16 3 Elements of frequentist inference 17
where c4 = E {X − E(X)}4 is the fourth central moment of X. It follows that
◮ We start with showing that the estimator S 2 can be rewritten as T :=
1
Pn 2 Cov (Xi − Xj )2 , (Xi − Xj )2 = Var (Xi − Xj )2 =
2n(n−1) i,j=1 (Xi − Xj ) :
2
= E (Xi − Xj )4 − E (Xi − Xj )2 =
n
1 X 2
(n − 1)T = · (Xi − 2Xi Xj + Xj2 ) = = 2µ4 + 6σ 4 − (2σ 2 )2 = 2µ4 + 2σ 4 .
2n
i,j=1
n n n n
! Note that since (Xi − Xj )2 = (Xj − Xi )2 , there are 2 · n(n − 1) such terms in
1 X X X X
= · n Xi2 − 2 Xi Xj + n Xj2 = the sum (3.1).
2n
i=1 i=1 j=1 j=1 – In an analogous way, we may show that if i, j, k are all different, Cov (Xi −
Xj )2 , (Xk − Xj )2 = µ4 − σ 4 . We can form n(n − 1)(n − 2) different triplets
n
X
= Xi2 − nX̄ 2 = (n − 1)S 2 .
(i, j, k) of i, j, k that are all different elements of {1, . . . , n}. For each of
i=1
these triplets, there are four different terms in the sum (3.1): Cov (Xi −
It follows that we can compute the variance of S 2 from the pairwise correlations Xj ) , (Xk − Xj ) Cov (Xi − Xj ) , (Xj − Xk ) , Cov (Xj − Xi ) , (Xj − Xk )2 ,
2 2 2 2 2
between the terms (Xi − Xj )2 , i, j = 1, . . . , n as and Cov (Xj − Xi )2 , (Xk − Xj )2 , each with the same value. In total, we thus
−2 X have 4 · n(n − 1)(n − 2) terms in (3.1) with the value of µ4 − σ 4 .
Var(S 2Auf :arithmetischesM ittel ) = 2n(n − 1) Var (Xi − Xj )2 =
i,j By combining these intermediate computations, we finally obtain that
−2 X
= 2n(n − 1) Cov (Xi − Xj )2 , (Xk − Xl )2 . 1
Var(S 2 ) = 4 4
2 2n(n − 1)(2µ4 + 2σ ) + 4n(n − 1)(n − 2)(µ4 − σ ) =
i,j,k,l
2n(n − 1)
(3.1)
1
= µ4 + σ 4 + (n − 2)(µ4 − σ 4 ) =
Depending on the combination of indices, the covariances in the sum above take n(n − 1)
n − 3
1
one of the three following values: = µ4 − σ4 .
n n−1
– Cov (Xi − Xj )2 , (Xk − Xl )2 = 0 if i = j and/or k = l (in this case either the
first or the second term is identically zero) or if i, j, k, l are all different (in this 5. Show that the confidence interval defined in Example 3.6 indeed has coverage
case the result follows from the independence between the different Xi ). probability 50% for all values θ ∈ Θ.
– For i 6= j, Cov (Xi − Xj )2 , (Xi − Xj )2 = 2µ4 + 2σ 4 . To show this, we proceed ◮ To prove the statement, we need to show that Pr{min(X1 , X2 ) ≤ θ ≤
in two steps. We denote µ := E(X1 ), and, using the independence of Xi and max(X1 , X2 )} = 0.5 for all θ ∈ Θ. This follows by the simple calculation:
Xj , we obtain that
Pr{min(X1 , X2 ) ≤ θ ≤ max(X1 , X2 )} = Pr(X1 ≤ θ ≤ X2 ) + Pr(X2 ≤ θ ≤ X1 )
E (Xi − Xj )2 = E (Xi − µ)2 + E (Xj − µ)2 − 2 E (Xi − µ)(Xj − µ) = = Pr(X1 ≤ θ) Pr(X2 ≥ θ) + Pr(X2 ≤ θ) Pr(X1 ≥ θ)
= 2σ 2 − 2 E(Xi ) − µ E(Xj ) − µ = 2σ 2 . = 0.5 · 0.5 + 0.5 · 0.5 = 0.5.
In an analogous way, we can show that 6. Consider a random sample X1:n from the uniform model U(0, θ), cf . Example 2.18.
4
4
Let Y = max(X1 , . . . , Xn ) denote the maximum of the random sample X1:n . Show
E (Xi − Xj ) = E (Xi − µ + µ − Xj ) =
that the confidence interval for θ with limits
= E (Xi − µ)4 − 4 E (Xi − µ)3 (Xj − µ) + 6 E (Xi − µ)2 (Xj − µ)2
(1 − γ)−1/n Y
− 4 E (Xi − µ)(Xj − µ)3 + E (Xj − µ)4 = Y and
= µ4 − 4 · 0 + 6 · (σ 2 )2 − 4 · 0 + µ4 = 2µ4 + 6σ 4 .
18 3 Elements of frequentist inference 19
has coverage γ. 7. Consider a population with mean µ and variance σ 2 . Let X1 , . . . , X5 be indepen-
◮ Recall that the density function of the uniform distribution U(0, θ) is f (x) = dent draws from this population. Consider the following estimators for µ:
1
θ I[0,θ) (x). The corresponding distribution function is 1
T1 = (X1 + X2 + X3 + X4 + X5 ),
Zx 5
Fx (x) = f (u) du 1
T2 = (X1 + X2 + X3 ),
3
1 1
−∞
T3 = (X1 + X2 + X3 + X4 ) + X5 ,
0R for x ≤ 0,
8 2
= x 1
du = x
for 0 ≤ x ≤ θ, T4 = X1 + X2
0 θ θ
1 for x ≥ θ. and T5 = X1 .
To prove the coverage of the confidence interval with limits Y and (1 − γ)−1/n Y , a) Which estimators are unbiased for µ?
we need to show that Pr{Y ≤ θ ≤ (1 − γ)−1/n Y } = γ for all θ ∈ Θ. We first derive ◮ The estimators T1 , T2 , and T5 are sample means of sizes 5, 3, and 1,
the distribution of the random variable Y . For its distribution function FY , we respectively, and as such are unbiased for µ, cf. Exercise 2. Further, T3 is also
obtain that unbiased, as
1 1
E(T3 ) = · 4µ + µ = µ.
FY (y) = Pr(Y ≤ y) 8 2
= Pr{max(X1 , . . . , Xn ) ≤ y} On the contrary, T4 is not unbiased, as
= {1 − (1 − γ)} = γ.
20 3 Elements of frequentist inference 21
Here τ is the p-dimensional parameter vector and Ti , ηi , B and c are real-valued c) Show that the density of the normal distribution N(µ, σ 2 ) can be written in
functions. It is assumed that the set {1, η1 (τ ), . . . , ηp (τ )} is linearly independent. the forms (3.2) and (3.3), respectively, where τ = (µ, σ 2 )⊤ . Hence derive a
Then we define the canonical parameters θ1 = η1 (τ1 ), . . . , θp = ηp (τp ). With minimal sufficient statistic for τ .
θ = (θ1 , . . . , θp )⊤ and T (x) = (T1 (x), . . . , Tp (x))⊤ we can write the log density in ◮ For X ∼ N(µ, σ 2 ), we have τ = (µ, σ 2 )⊤ . We can rewrite the log density
canonical form: as
log{f (x; θ)} = θ⊤ T (x) − A(θ) + c(x). (3.3) 1 1 (x − µ)2
log f (x; µ, σ 2 ) = − log(2πσ 2 ) −
Exponential families are interesting because most of the commonly used distribu- 2 2 σ2
1 1 1 x2 − 2xµ + µ2
tions, such as the Poisson, geometric, binomial, normal and gamma distribution, = − log(2π) − log(σ 2 ) −
2 2 2 σ2
are exponential families. Therefore it is worthwhile to derive general results for
1 2 µ µ2 1 1
exponential families, which can then be applied to many distributions at once. = − 2 x + 2 x − 2 − log(σ 2 ) − log(2π)
2σ σ 2σ 2 2
For example, two very useful results for the exponential family of order one in
= η1 (τ )T1 (x) + η2 (τ )T2 (x) − B(τ ) + c(x),
canonical form are E{T (X)} = dA/dθ(θ) and Var{T (X)} = d2 A/dθ 2 (θ).
where
a) Show that T (X) is minimal sufficient for θ.
1
◮ Consider two realisations x and y with corresponding likelihood ratios θ1 = η1 (τ ) = − T1 (x) = x2
2σ 2
Λx (θ 1 , θ 2 ) and Λy (θ1 , θ2 ) being equal, which on the log scale gives the equation µ
θ2 = η2 (τ ) = 2 T2 (x) = x
σ
log f (x; θ1 ) − log f (x; θ2 ) = log f (y; θ1 ) − log f (y; θ 2 ). µ2 1
B(τ ) = + log(σ 2 )
2σ 2 2
Plugging in (3.3) we can simplify it to 1
and c(x) = − log(2π).
2
1 T (x) − A(θ 1 ) − θ 2 T (x) + A(θ 2 ) = θ 1 T (y) − A(θ 1 ) − θ 2 T (y) + A(θ 2 )
θ⊤ ⊤ ⊤ ⊤
so p = 1, θ = η(λ) = log(λ), T (x) = x, B(λ) = λ and c(x) = − log(x!). Finally, from above, we know that T (x) = (x2 , x)⊤ is minimal sufficient for τ .
For the canonical representation, we have A(θ) = B{η −1 (θ)} = B{exp(θ)} = d) Show that for an exponential family of order one, I(τ̂ML ) = J(τ̂ML ). Verify this
exp(θ). Hence, both the expectation E{T (X)} = dA/dθ(θ) and the variance result for the Poisson distribution.
Var{T (X)} = d2 A/dθ 2 (θ) of X are exp(θ) = λ. ◮ Let X be a random variable with density from the exponential family of
order one. By taking the derivative of the log likelihood, we obtain the score
function
dη(τ ) dB(τ )
S(τ ) = T (x) − ,
dτ dτ
22 3 Elements of frequentist inference 23
so that the MLE τ̂ML satisfies the equation The log-likelihood l(θ) of the random sample X1:n is thus
dB(τ̂ML ) n
X n
X n
X
T (x) = dτ
dη(τ̂ML )
. l(θ) = log{f (xi ; θ)} = {θ T (xi ) − A(θ) + c(xi )} ∝ θ T (xi ) − n A(θ).
dτ i=1 i=1 i=1
We obtain the observed Fisher information from the Fisher information 9. Assume that survival times X1:n form a random sample from a gamma distribution
2 2
d B(τ ) d η(τ ) G(α, α/µ) with mean E(Xi ) = µ and shape parameter α.
I(τ ) = − T (x) Pn
dτ 2 dτ 2 a) Show that X̄ = n−1 i=1 Xi is a consistent estimator of the mean survival
by plugging in the MLE: time µ.
◮ The sample mean X̄ is unbiased for µ and has variance Var(X̄) =
dB(τ̂ML )
d2 A(τ̂ML ) d2 η(τ̂ML ) Var(Xi )/n = µ2 /(nα), cf. Exercise 2 and Appendix A.5.2. It follows that
I(τ̂ML ) = − dτ
.
dτ 2 dτ 2 dη(τ̂ML )
its mean squared error MSE = µ2 /(nα) goes to zero as n → ∞. Thus, the
dτ
estimator is consistent in mean square and hence also consistent.
Further, we have that
Note that this holds for all random samples where the individual random vari-
dB(η −1 (θ)) dη −1 (θ)
dB(τ)
d ables have finite expectation and variance.
E{T (X)} = (B ◦ η −1 )(θ) = · = dτ
,
dθ dτ dθ dη(τ) b) Show that Xi /µ ∼ G(α, α).
dτ
◮ From Appendix A.5.2, we know that by multiplying a random variable
where θ = η(τ ) is the canonical parameter. Hence with G(α, α/µ) distribution by µ−1 , we obtain a random variable with G(α, α)
distribution.
d2 B(τ ) d2 η(τ )
dB(τ)
J(τ ) = − dτ
c) Define the approximate pivot from Result 3.1,
dτ 2 dτ 2 dη(τ)
dτ
X̄ − µ
follows. If we now plug in τ̂ML , we obtain the same formula as for I(τ̂ML ). Z= √ ,
S/ n
For the Poisson example, we have I(λ) = x/λ2 and J(λ) = 1/λ. Plugging in
Pn
the MLE λ̂ML = x leads to I(λ̂ML ) = J(λ̂ML ) = 1/x. where S 2 = (n − 1)−1 i=1 (Xi − X̄)2 . Using the result from above, show that
e) Show that for an exponential family of order one in canonical form, I(θ) = J(θ). the distribution of Z does not depend on µ.
Verify this result for the Poisson distribution. ◮ We can rewrite Z as follows:
◮ In the canonical parametrisation (3.3),
X̄ − µ
Z=q Pn
dA(θ) 1
i=1 (Xi − X̄)
2
S(θ) = T (x) − n(n−1)
dθ
d2 A(θ) X̄/µ − 1
and I(θ) = , =q Pn
dθ 2 1
i=1 (Xi /µ − X̄/µ)
2
n(n−1)
where the latter is independent of the observation x, and therefore obviously
Ȳ − 1
I(θ) = J(θ). =q Pn ,
1
i=1 (Yi − Ȳ )
2
For the Poisson example, the canonical parameter is θ = log(λ). Since A(θ) = n(n−1)
exp(θ), also the second derivative equals exp(θ) = I(θ) = J(θ). Pn
f ) Suppose X1:n is a random sample from a one-parameter exponential family where Yi = Xi /µ and Ȳ = n−1 i=1 Yi = X̄/µ . From above, we know that
with canonical parameter θ. Derive an expression for the log-likelihood l(θ). Yi ∼ G(α, α), so its distribution depends only on α and not on µ. Therefore, Z
◮ Using the canonical parametrisation of the density, we can write the log- is a function of random variables whose distributions do not depend on µ. It
likelihood of a single observation as follows that the distribution of Z does not depend on µ either.
d) For n = 10 and α ∈ {1, 2, 5, 10}, simulate 100 000 samples from Z, and com- [3,] -2.841559 1.896299
pare the resulting 2.5% and 97.5% quantiles with those from the asymptotic [4,] -2.657800 2.000258
> ## compare with standard normal ones:
standard normal distribution. Is Z a good approximate pivot? > qnorm(p=c(0.025, 0.975))
◮ [1] -1.959964 1.959964
> ## simulate one realisation n=10 and alpha=1 n=10 and alpha=2
0.4 0.4
> z.sim <- function(n, alpha)
Density
Density
0.3 0.3
{
0.2 0.2
y <- rgamma(n=n, alpha, alpha)
0.1 0.1
0.0 0.0
yq <- mean(y)
−4 −2 0 2 4 −4 −2 0 2 4
sy <- sd(y)
Z Z
n=10 and alpha=5 n=10 and alpha=10
z <- (yq - 1) / (sy / sqrt(n)) 0.4 0.4
return(z)
Density
Density
0.3 0.3
} 0.2 0.2
> ## fix cases: 0.1 0.1
> n <- 10 0.0 0.0
> alphas <- c(1, 2, 5, 10) −4 −2 0 2 4 −4 −2 0 2 4
> ## space for quantile results
> quants <- matrix(nrow=length(alphas), Z Z
ncol=2) We see that the distribution of Z is skewed to the left compared to the standard
> ## set up graphics space
> par(mfrow=c(2, 2)) normal distribution: the 2.5% quantiles are clearly lower than −1.96, and also
> ## treat every case the 97.5% quantiles are slightly lower than 1.96. For increasing α (and also for
> for(i in seq_along(alphas))
increasing n of course), the normal approximation becomes better. Altogether,
{
## draw 100000 samples the normal approximation does not appear too bad, given the fact that n = 10
Z <- replicate(n=100000, expr=z.sim(n=n, alpha=alphas[i])) is a rather small sample size.
## plot histogram e) Show that X̄/µ ∼ G(nα, nα). If α was known, how could you use this quantity
hist(Z, to derive a confidence interval for µ?
prob=TRUE,
◮ We know from above that the summands Xi /µ in X̄/µ are independent and
col="gray", Pn
main=paste("n=", n, " and alpha=", alphas[i], sep=""), have G(α, α) distribution. From Appendix A.5.2, we obtain that i=1 Xi /µ ∼
nclass=50, G(nα, α). From the same appendix, we also have that by multiplying the sum
xlim=c(-4, 4),
ylim=c(0, 0.45)) by n−1 we obtain G(nα, nα) distribution.
If α was known, then X̄/µ would be a pivot for µ and we could derive a 95%
## compare with N(0, 1) density
confidence interval as follows:
curve(dnorm(x),
from=min(Z),
to=max(Z), 0.95 = Pr{q0.025 (nα) ≤ X̄/µ ≤ q0.975 (nα)}
n=201,
add=TRUE, = Pr{1/q0.975 (nα) ≤ µ/X̄ ≤ 1/q0.025 (nα)}
col="red") = Pr{X̄/q0.975 (nα) ≤ µ ≤ X̄/q0.025 (nα)}
## save empirical quantiles
quants[i, ] <- quantile(Z, prob=c(0.025, 0.975)) where qγ (β) denotes the γ quantile of G(β, β). So the confidence interval would
} be
X̄/q0.975 (nα), X̄/q0.025 (nα) .
> ## so the quantiles were:
(3.4)
> quants
[,1] [,2]
f ) Suppose α is unknown, how could you derive a confidence interval for µ?
[1,] -4.095855 1.623285
[2,] -3.326579 1.741014 ◮ If α is unknown, we could estimate it and then use the confidence interval
26 3 Elements of frequentist inference 27
from (3.4). Of course we could also use Z ∼ N(0, 1) and derive the standard Now assume that xn 6= yn . Without loss of generality, let xn < yn . Then we
a
a doctor sees n ≤ N beds, which are a random subset of all beds, with (ordered) fXn (xn ; N ) = n−1
N
I{n,...,N } (xn ).
n
numbers X1 < · · · < Xn . The doctor now wants to estimate the total number of
beds N in the hospital. ◮ For a fixed value Xn = xn of the maximum, there are xn−1 n −1
possibilities
a) Show that the joint probability mass function of X = (X1 , . . . , Xn ) is how to choose the first n − 1 values. Hence, the total number of possible draws
xn −1
−1 giving a maximum of x is N n / n−1 . Considering also the possible range for
N
f (x; N ) = I{n,...,N } (xn ). xn , this leads to the probability mass function
n
xn −1
fXn (xn ; N ) = n−1
N
I{n,...,N } (xn ).
◮ There are N n possibilities to draw n values without replacement out of N n
values. Hence, the probability of one outcome x = (x1 , . . . , xn ) is the inverse,
N −1
d) Show that
. Due to the nature of the problem, the highest number xn cannot be n+1
n
N̂ = Xn − 1
larger than N , nor can it be smaller than n. Altogether, we thus have n
is an unbiased estimator of N .
f (x; N ) = Pr(X1 = x1 , . . . , Xn = xn ; N )
◮ For the expectation of Xn , we have
−1
N −1 X
= I{n,...,N } (xn ). N
N
x−1
n E(Xn ) = x·
n n−1
x=n
b) Show that Xn is minimal sufficient for N . −1 X
N
◮ We can factorise the probability mass function as follows: N x
= n·
−1 n n
N! x=n
f (x; N ) = I{n,...,N } (xn ) −1 X N
(N − n)!n! N x+1−1
= n·
(N − n)! n n+1−1
= |{z}
n! I{n,...,N } (xn ), x=n
| N!
=g2 (x) {z } −1
N
N
X +1
x−1
=g1 {h(x)=xn ;N } = n· .
n (n + 1) − 1
x=n+1
so from the Factorization Theorem (Result 2.2), we have that Xn is sufficient
PN
for N . In order to show the minimal sufficiency, consider two data sets x and Since x−1
x=n n−1 = N
n , we have
y such that for every two parameter values N1 and N2 , the likelihood ratios are −1
N N +1
identical, i. e. E(Xn ) = n·
n n+1
Λx (N1 , N2 ) = Λy (N1 , N2 ).
n!(N − n)! (N + 1)!
= n·
This can be rewritten as N! (n + 1)!(N − n)!
I{n,...,N1 } (xn ) I{n,...,N1 } (yn ) n
= . (3.6) = (N + 1).
I{n,...,N2 } (xn ) I{n,...,N2 } (yn ) n+1
28 3 Elements of frequentist inference
Altogether thus
4 Frequentist properties of the
n+1
E(N̂) =
n
E(Xn ) − 1
n+1 n
likelihood
= (N + 1) − 1
n n+1
= N.
So N̂ is unbiased for N .
e) Study the ratio L(N + 1)/L(N ) and derive the ML estimator of N . Compare
it with N̂ .
◮ The likelihood ratio of N ≥ xn relative to N + 1 with respect to x is
1. Compute an approximate 95% confidence interval for the true correlation ρ based
f (x; N + 1)
N
N +1−n on the MLE r = 0.7, a sample of size of n = 20 and Fisher’s z-transformation.
= Nn+1 = < 1,
f (x; N ) n
N +1 ◮ Using Example 4.16, we obtain the transformed correlation as
1 + 0.7
so N must be as small as possible to maximise the likelihood, i. e. N̂ML = Xn . z = tanh−1 (0.7) = 0.5 log = 0.867.
1 − 0.7
From above, we have E(Xn ) = n+1
n
(N + 1), so the bias of the MLE is
n Using the more accurate approximation 1/(n−3) for the variance of ζ = tanh−1 (ρ),
E(Xn ) − N = (N + 1) − N √
n+1 we obtain the standard error 1/ n − 3 = 0.243. The 95%-Wald confidence interval
n(N + 1) − (n + 1)N for ζ is thus
=
n+1 [z ± 1.96 · se(z)] = [0.392, 1.343].
nN + n − nN − N
= By back-transforming using the inverse Fisher’s z-transformation, we obtain the
n+1
n−N following confidence interval for ρ:
= < 0.
n+1
[tanh(0.392), tanh(1.343)] = [0.373, 0.872].
This means that N̂ML systematically underestimates N , in contrast to N̂.
2. Derive a general formula for the score confidence interval in the Poisson model
based on the Fisher information, cf . Example 4.9.
◮ We consider a random sample X1:n from Poisson distribution Po(ei λ) with
known offsets ei > 0 and unknown rate parameter λ. As in Example 4.8, we can
see that if we base the score statistic for testing the null hypothesis that the true
rate parameter equals λ on the Fisher information I(λ; x1:n ), we obtain
We now determine the values of λ for which the score test based on the asymptotic To obtain an approximate one-sided P -value, we calculate its realisation z(0.8)
distribution of T2 would not reject the null hypothesis at level α. These are the and compare it to the approximate normal distribution of the score statistic
values for which we have |T2 (λ; x1:n )| ≤ q := z1−α/2 : under the null hypothesis that the true proportion is 0.8. Since a more extreme
result in the direction of the alternative H1 corresponds to a larger realisation
√ x̄ − ēλ
n· √ ≤q x and hence a larger observed value of the score statistic, the approximate
x̄
r one-sided P -value is the probability that a standard normal random variable is
x̄
− λ ≤ q x̄ greater than z(0.8):
ē ē n
> ## general settings
" r # > x <- 105
x̄ q x̄ > n <- 117
λ∈ ±
ē ē n > pi0 <- 0.8
> ## the first approximate pivot
> z.pi <- function(x, n, pi)
Note that this score confidence interval is symmetric around the MLE x̄/ē, unlike
{
the one based on the expected Fisher information derived in Example 4.9. sqrt(n) * (x - n * pi) / sqrt( (x * (n - x)) )
3. A study is conducted to quantify the evidence against the null hypothesis that }
> z1 <- z.pi(x, n, pi0)
less than 80 percent of the Swiss population have antibodies against the human > (p1 <- pnorm(z1, lower.tail=FALSE))
herpesvirus. Among a total of 117 persons investigated, 105 had antibodies. [1] 0.0002565128
a) Formulate an appropriate statistical model and the null and alternative hy- Pr{Z(0.8) > z(0.8)} ≈ 1 − Φ{z(0.8)} = 1 − Φ(3.47) ≈ 0.00026.
potheses. Which sort of P -value should be used to quantify the evidence against
c) Use the logit-transformation (compare Example 4.22) and the corresponding
the null hypothesis?
Wald statistic to obtain a P -value.
◮ The researchers are interested in the frequency of herpesvirus antibodies oc-
◮ We can equivalently formulate the testing problem as H0 : φ < φ0 =
currence in the Swiss population, which is very large compared to the n = 117
logit(0.8) versus H1 : φ > φ0 after parametrising the binomial model with φ =
probands. Therefore, the binomial model, actually assuming infinite popula-
logit(π) instead of π. Like in Example 4.22, we obtain the test statistic
tion, is appropriate. Among the total of n = 117 draws, x = 105 “successes”
were obtained, and the proportion π of these successes in the theoretically infi- log{X/(n − X)} − φ
Zφ (φ) = p , (4.2)
nite population is of interest. We can therefore suppose that the observed value 1/X + 1/(n − X)
x = 105 is a realisation of a random variable X ∼ Bin(n, π).
The null hypothesis is H0 : π < 0.8, while the alternative hypothesis is H1 : π ≥ which, by the delta method, is asymptotically normally distributed. To compute
0.8. Since this is a one-sided testing situation, we will need the corresponding the corresponding P -value, we may proceed as follows:
one-sided P -value to quantify the evidence against the null hypothesis. > ## the second approximate pivot
> z.phi <- function(x, n, phi)
b) Use the Wald statistic (4.12) and its approximate normal distribution to obtain {
a P -value. (log(x / (n - x)) - phi) / sqrt(1/x + 1/(n-x))
p }
◮ The Wald statistic is z(π) = I(π̂ML )(π̂ML − π). As in Example 4.10, we > (phi0 <- qlogis(pi0))
have π̂ML = x/n. Further, I(π) = x/π 2 + (n − x)/(1 − π)2 , and so [1] 1.386294
> z2 <- z.phi(x, n, phi0)
x n−x n2 n2 (n − x) n2 n2 n3 > (p2 <- pnorm(z2, lower.tail=FALSE))
I(π̂ML ) = + = + = + = .
(x/n)2 (1 − x/n)2 x (n − x)2 x n−x x(n − x) [1] 0.005103411
We thus have Pr[Zφ {logit(0.8)} > zφ {logit(0.8)}] ≈ 1 − Φ{zφ (1.3863)} = 1 − Φ(2.57) ≈ 0.0051.
s
n3 x √ x − nπ
z(π) = −π = np . (4.1)
x(n − x) n x(n − x)
32 4 Frequentist properties of the likelihood 33
d) Use the score statistic (4.2) to obtain a P -value. Why do we not need to con- Note that the z-statistic on the φ-scale and the score statistic produce P -values
sider parameter transformations when using this statistic? which are closer to the exact P -value than that from the z-statistic on the π-
p
◮ By Result 4.5, the score statistic V (π) = S(π; X1:n )/ J1:n (π) asymp- scale. This is due to the bad quadratic approximation of the likelihood on the
totically follows the standard normal distribution under the Fisher regular- π-scale.
ity assumptions. In our case, we may use that a binomial random variable 4. Suppose X1:n is a random sample from an Exp(λ) distribution.
X ∼ Bin(n, π) can be viewed as the sum of n independent random variables
a) Derive the score function of λ and solve the score equation to get λ̂ML .
with Bernoulli distribution B(π), so the asymptotic results apply to the score
◮ From the log-likelihood
statistic corresponding to X as n → ∞. The score function corresponding to
the binomial variable is S(π; X) = X/π − (n − X)/(1 − π) and the expected n
X
Fisher information is J(π) = n/{π(1 − π)}, cf. Example 4.10. To calculate a l(λ) = log(λ) − λxi
i=1
third approximate P -value, we may therefore proceed as follows:
> ## and the third
= n log(λ) − nλx̄
> v.pi <- function(x, n, pi)
{ we get the score function
n
(x/pi - (n - x)/(1 - pi)) / sqrt(n/pi/(1-pi)) S(λ; x) = − nx̄,
} λ
> v <- v.pi(x, n, pi0) which has the root
> (p3 <- pnorm(v, lower.tail=FALSE))
[1] 0.004209022 λ̂ML = 1/x̄.
Pr{V (0.8) > v(0.8)} ≈ 1 − Φ(2.63) ≈ 0.00421. Since the Fisher information
c) Derive the expected Fisher information J(λ) and the variance stabilizing trans- e) Derive the Cramér-Rao lower bound for the variance of unbiased estimators of
formation φ = h(λ) of λ. λ.
◮ Because the Fisher information does not depend on x in this case, we have ◮ If T = h(X) is an unbiased estimator for λ, then Result 4.8 states that
simply
λ2
J(λ) = E{I(λ; X)} = n/λ2 . Var(T ) ≥ J(λ)−1 = ,
n
Now we can derive the variance stabilising transformation: which is the Cramér-Rao lower bound.
Zλ f ) Compute the expectation of λ̂ML and use this result to construct an unbiased
φ = h(λ) ∝ Jλ (u)1/2 du estimator of λ. Compute its variance and compare it to the Cramér-Rao lower
bound.
Zλ Pn
◮ By the properties of exponential distribution we know that i=1 Xi ∼
∝ u−1 du
G(n, λ), cf. Appendix A.5.2. Next, by the properties of Gamma distribution
Pn
= log(u)|u=λ we that get that X̄ = n1 i=1 Xi ∼ G(n, nλ), and λ̂ML = 1/X̄ ∼ IG(n, nλ), cf.
Appendix A.5.2. It follows that
= log(λ).
nλ
d) Compute the MLE of φ and derive a 95% confidence interval for λ by back- E(λ̂ML ) = > λ,
n−1
transforming the limits of the 95% Wald confidence interval for φ. Compare
with the result from 4b). cf. again Appendix A.5.2. Thus, λ̂ML is a biased estimator of λ. However,
◮ Due to the invariance of ML estimation with respect to one-to-one trans- we can easily correct it by multiplying with the constant (n − 1)/n. This new
formations we have estimator λ̂ = (n − 1)/(nX̄) is obviously unbiased, and has variance
φ̂ML = log λ̂ML = − log x̄
(n − 1)2
Var(λ̂) = Var(1/X̄)
as the MLE of φ = log(λ). Using the delta method we can get the corresponding n2
2
standard error as (n − 1) n2 λ 2
=
d
n 2 (n − 1)2 (n − 2)
se(φ̂ML ) = se(λ̂ML ) h(λ̂ML ) λ 2
dλ = ,
n−2
1
= √ 1/λ̂ML
x̄ n cf. again Appendix A.5.2. This variance only asymptotically reaches the
1 Cramér-Rao lower bound λ2 /n. Theoretically there might be other unbiased
= √ x̄
x̄ n estimators which have a smaller variance than λ̂.
= n−1/2 . 5. An alternative parametrization of the exponential distribution is
So the 95% Wald confidence interval for φ is 1 x
fX (x) = exp − IR+ (x), θ > 0.
θ θ
− log x̄ ± z0.975 · n−1/2 ,
Let X1:n denote a random sample from this density. We want to test the null
and transformed back to the λ-space we have the 95% confidence interval hypothesis H0 : θ = θ0 against the alternative hypothesis H1 : θ 6= θ0 .
a) Calculate both variants T1 and T2 of the score test statistic.
exp(− log x̄ − z0.975 n−1/2 ), exp(− log x̄ + z0.975 n−1/2 ) =
√ √ ◮ Recall from Section 4.1 that
= x̄−1 / exp(z0.975 / n), x̄−1 · exp(z0.975 / n) ,
S(θ0 ; x1:n ) S(θ0 ; x1:n )
T1 (x1:n ) = p and T2 (x1:n ) = p .
which is not centred around the MLE λ̂ML = x̄−1 , unlike the original Wald J1:n (θ0 ) I(θ0 ; x1:n )
confidence interval for λ.
36 4 Frequentist properties of the likelihood 37
Like in the previous exercise, we can compute the log-likelihood a) Derive the probability mass function f (x; π) of Xi .
n ◮ Xi can only take one of the values 1, 2, . . . , so it is a discrete random
X xi
l(θ) = − log(θ) − variable supported on natural numbers N. For a given x ∈ N, the probability
θ
i=1 that Xi equals x is
nx̄
= −n log(θ) −
θ f (x; π) = Pr(First test negative, . . . , (x − 1)-st test negative, x-th test positive)
and derive the score function = (1 − π) · · · (1 − π) ·π
| {z }
x−1 times
1 n(x̄ − θ)
S(θ; x1:n ) = (nθ − nx̄) · − 2 = , = (1 − π)x−1 π,
θ θ2
the Fisher information since the results of the different tests are independent. This is the probability
mass function of the geometric distribution Geom(π) (cf. Appendix A.5.1), i. e.
d 2x̄ − θ
I(θ; x1:n ) = − S(θ; x1:n ) = n , we have Xi ∼ Geom(π) for i = 1, . . . , n.
iid
dθ θ3
b) Write down the log-likelihood function for the random sample X1:n and com-
and the expected Fisher information
pute the MLE π̂ML .
2 E(X̄) − θ n ◮ For a realisation x1:n = (x1 , . . . , xn ), the likelihood is
J1:n (θ) = n = 2.
θ3 θ n
Y
The test statistics can now be written as L(π) = f (xi ; π)
i=1
n(x̄ − θ0 ) θ0 √ x̄ − θ0 Yn
T1 (x1:n ) = ·√ = n
θ02 n θ0 = π(1 − π)xi −1
3/2
r
n(x̄ − θ0 )
i=1
θ0 θ0 Pn
and T2 (x1:n ) = ·p = T1 (θ0 ) . = π n (1 − π) i=1 i
x −n
θ02 n(2x̄ − θ0 ) 2x̄ − θ0
= π n (1 − π)n(x̄−1) ,
b) A sample of size n = 100 gave x̄ = 0.26142. Quantify the evidence against
H0 : θ0 = 0.25 using a suitable significance test. yielding the log-likelihood
◮ By plugging these numbers into the formulas for T1 (x1:n ) and T2 (x1:n ), we
obtain l(π) = n log(π) + n(x̄ − 1) log(1 − π).
T1 (x1:n ) = 0.457 and T2 (x1:n ) = 0.437.
The score function is thus
Under the null hypothesis, both statistics follow asymptotically the standard
d n n(x̄ − 1)
normal distribution. Hence, to test at level α, we need to compare the observed S(π; x1:n ) = l(π) = −
dπ π 1−π
values with the (1 − α/2) · 100% quantile of the standard normal distribution.
For α = 0.05, we compare with z0.975 ≈ 1.96. As neither of the observed values and the solution of the score equation S(π; x1:n ) = 0 is
is larger than the critical value, the null hypothesis cannot be rejected.
π̂ML = 1/x̄.
6. In a study assessing the sensitivity π of a low-budget diagnostic test for asthma,
each of n asthma patients is tested repeatedly until the first positive test result is The Fisher information is
obtained. Let Xi be the number of the first positive test for patient i. All patients
d
and individual tests are independent, and the sensitivity π is equal for all patients I(π) = − S(π)
dπ
and tests. n n(x̄ − 1)
= 2+ ,
π (1 − π)2
38 4 Frequentist properties of the likelihood 39
- 0.5 * se.phi^(-2) * (phi - mle.phi)^2 7. A simple model for the drug concentration in plasma over time after a single
} intravenous injection is c(t) = θ2 exp(−θ1 t), with θ1 , θ2 > 0. For simplicity we
> ## and the plot
> piGrid <- seq(0.01, 0.5, length=201) assume here that θ2 = 1.
> plot(piGrid, rel.loglik.pi(piGrid),
type="l",
a) Assume that n probands had their concentrations ci , i = 1, . . . , n, measured
at the same single time-point t and assume that the model ci ∼ N(c(t), σ 2 ) is
iid
xlab=expression(pi),
ylab = expression(tilde(l)(pi)),
appropriate for the data. Calculate the MLE of θ1 .
lwd=2)
> abline(v=0, col="gray") ◮ The likelihood is
> lines(piGrid, approx.rel.loglik.pi(piGrid), n
Y 1 1 2
L(θ1 ) = exp − 2 ci − exp(−θ1 t)
lty=2,
√ ,
col="blue") 2πσ 2 2σ
> lines(piGrid, approx.rel.loglik.phi(qlogis(piGrid)), i=1
lty=2,
yielding the log-likelihood
col="red")
> abline(v=mle.pi) n
X 1 1 2
> legend("bottomright", l(θ1 ) = − log 2πσ 2 − 2 ci − exp(−θ1 t) .
legend= 2 2σ
i=1
c("relative log-lik.",
"quadratic approx.", For the score function we thus have
"transformed quad. approx."),
n
col= exp(−θ1 t) t X
c("black", S(θ1 ; c1:n ) = − 2
ci − exp(−θ1 t)
"blue", σ
i=1
"red"),
exp(−θ1 t) nt
lty= = {exp(−θ1 t) − c̄},
c(1, σ2
2, and for the Fisher information
2),
lwd= exp(−θ1 t) nt2
c(2, I(θ1 ) = {2 exp(−θ1 t) − c̄}.
1, σ2
1)) The score equation is solved as
0
0 = S(θ1 ; c1:n )
−5 exp(−θ1 t) = c̄
1
−10
θ̂1 = − log(c̄).
~l (π)
t
The observed Fisher information
−15
c̄2 nt2
relative log−lik. I(θ̂1 ) =
−20 quadratic approx. σ2
transformed quad. approx.
is positive; thus, θ̂1 is indeed the MLE.
0.0 0.1 0.2 0.3 0.4 0.5
b) Calculate the asymptotic variance of the MLE.
π
◮ By Result 4.10, the asymptotic variance of θ̂1 is the inverse of the expected
The transformed quadratic approximation qφ (logit(π)) is closer to the true rel- Fisher information
ative log-likelihood l̃(π) than the direct quadratic approximation qπ (π). This
corresponds to a better performance of the second approximate confidence in- J1:n (θ1 ) = E{I(θ1 ; C1:n )}
terval. exp(−2θ1 t) nt2
= .
σ2
44 4 Frequentist properties of the likelihood 45
c) In pharmacokinetic studies one is often interested in the area under the concen- If we plug in topt = 1/θ1 for t, we obtain
R∞
tration curve, α = 0 exp(−θ1 t) dt. Calculate the MLE for α and its variance
2σ 2 2θ 4 σ 2 exp(2)
estimate using the delta theorem. exp(2)(2θ14 − 4θ14 + 3θ14 ) = 1 ,
n n
◮ By the invariance of the MLE with respect to one-to-one transformations,
we obtain that which is positive. Thus, topt indeed minimises the variance.
Z∞ 8. Assume the gamma model G(α, α/µ) for the random sample X1:n with mean
α̂ML = exp(−θ̂1 t) dt E(Xi ) = µ > 0 and shape parameter α > 0.
0 a) First assume that α is known. Derive the MLE µ̂ML and the observed Fisher
1 information I(µ̂ML ).
=
θ̂1 ◮ The log-likelihood kernel for µ is
t
=− . n
log(c̄) αX
l(µ) = −αn log(µ) − xi ,
µ
Further, by the delta method, we obtain that i=1
Thus, the asymptotic variance of α̂ML is The score equation S(µ; x) = 0 can be written as
n
σ2 1X
. n= xi
exp(−2θ̂1 t)nt2 θ14 µ
i=1
d) We now would like to determine the optimal time point for measuring the and is hence solved by µ̂ML = x̄. The ordinary Fisher information is
concentrations ci . Minimise the asymptotic variance of the MLE with respect
d
to t, when θ1 is assumed to be known, to obtain an optimal time point topt . I(µ) = − S(µ; x)
dµ
◮ We take the derivative of the asymptotic variance of θ̂1 with respect to t: n
!
X
= − αnµ −2
− 2αµ −3
xi
d σ 2 exp(2θ1 t) σ 2 2θ1 exp(2θ1 t) 2 exp(2θ1 t)
· 2
= 2
− 3
, i=1
dt n t n t t n
2α X αn
= xi − 2 ,
and find that it is equal zero for topt satisfying that µ3 µ
i=1
b) Use the p∗ formula to derive an asymptotic density of µ̂ML depending on the and the score function reads
true parameter µ. Show that the kernel of this approximate density is exact in
d
this case, i. e. it equals the kernel of the exact density known from Exercise 9 S(α; x) = l(α)
dα
from Chapter 3. n n
α µ1 X 1X
◮ The p∗ formula gives us the following approximate density of the MLE: = n log + αn − nψ(α) + log(xi ) − xi
µ αµ µ
r i=1 i=1
I(µ̂ML ) L(µ) n n
f (µ̂ML ) =
∗ α X 1X
2π L(µ̂ML ) = n log + n − nψ(α) + log(xi ) − xi .
r µ µ
i=1 i=1
I(µ̂ML )
= exp{l(µ) − l(µ̂ML )}
2π Hence, the Fisher information is
r
αn
= exp {−αn log(µ) − α/µ · nµ̂ML + αn log(µ̂ML ) + α/µ̂ML · nµ̂ML } I(α) = −
d
µ̂2ML 2π dα
S(α; x)
r nn o
=
αn −αn
µ exp(αn) · µ̂αn−1 exp −
αn
µ̂ . (4.3) =− − nψ ′ (α)
2π
ML
α
ML
µ
1
Pn = n ψ ′ (α) − .
From Appendix A.5.2 we know that i=1 Xi ∼ G(nα, α/µ), and X̄ = µ̂ML ∼ α
G(nα, nα/µ); cf. Exercise 9 in Chapter 3. The corresponding density function
has the kernel e) Show, by rewriting the score equation, that the MLE α̂ML fulfils
αn
f (µ̂ML ) ∝ µ̂nα−1
ML
exp − µ̂ML , n
X 1X
n
µ −nψ(α̂ML ) + n log(α̂ML ) + n = − log(xi ) + xi + n log(µ). (4.5)
µ
which is the same as the kernel in (4.3). i=1 i=1
c) Stirling’s approximation of the gamma function is Hence show that the log-likelihood kernel can be written as
r
2π xx
Γ(x) ≈ . (4.4) l(α) = n α log(α) − α − log{Γ(α)} + αψ(α̂ML ) − α log(α̂ML ) .
x exp(x)
Show that approximating the normalising constant of the exact density
with (4.4) gives the normalising constant of the approximate p∗ formula den- ◮ The score equation S(α̂ML ; x) = 0 can be written as
sity.
1X
n
X n
◮ The normalising constant of the exact distribution G(nα, nα/µ) is: n log(α̂ML ) − n log(µ) + n − nψ(α̂ML ) + log(xi ) − xi = 0
αn µ
i=1 i=1
αn αn r
µ αn αn exp(αn) n
X 1X
n
Γ(αn)
≈
µ 2π (αn)αn − nψ(α̂ML ) + n log(α̂ML ) + n = − log(xi ) + xi + n log(µ).
µ
r i=1 i=1
αn
= µ−αn exp(αn) ,
2π Hence, we can rewrite the log-likelihood kernel as follows:
Xn n
which equals the normalising constant of the approximate density in (4.3). α αX
l(α) = αn log − n log{Γ(α)} + α log(xi ) − xi
d) Now assume that µ is known. Derive the log-likelihood, score function and µ µ
i=1 i=1
Fisher information of α. Use the digamma function ψ(x) = dx d
log{Γ(x)} and ( )
1X
X n n
the trigamma function ψ (x) = dx ψ(x).
′ d
= αn log(α) − n log{Γ(α)} − α n log(µ) − log(xi ) + xi
µ
◮ The log-likelihood kernel of α is i=1 i=1
Xn n = αn log(α) − n log{Γ(α)} − α {−nψ(α̂ML ) + n log(α̂ML ) + n}
α αX
l(α) = αn log − n log{Γ(α)} + α log(xi ) − xi , = n[α log(α) − α − log{Γ(α)} + αψ(α̂ML ) − α log(α̂ML )].
µ µ
i=1 i=1
48 4 Frequentist properties of the likelihood 49
f ) Implement an R-function of the p∗ formula, taking as arguments the MLE mu) # the mean parameter
value(s) α̂ML at which to evaluate the density, and the true parameter α. For {
## solve the score equation
numerical reasons, first compute the approximate log-density uniroot(f=scoreFun.alpha,
interval=c(1e-10, 1e+10),
1 1
log f ∗ (α̂ML ) = − log(2π) + log{I(α̂ML )} + l(α) − l(α̂ML ), x=x, # pass additional parameters
2 2 mu=mu)$root # to target function
}
and then exponentiate it. The R-functions digamma, trigamma and lgamma can > ## now simulate the datasets and compute the MLE for each
be used to calculate ψ(x), ψ ′ (x) and log{Γ(x)}, respectively. > nSim <- 10000
> alpha <- 2
◮ We first rewrite the relative log-likelihood l(α) − l(α̂ML ) as > mu <- 3
> n <- 10
n[α log(α) − α̂ML log(α̂ML ) − (α − α̂ML ) − log{Γ(α)} + log{Γ(α̂ML )} > alpha.sim.mles <- numeric(nSim)
> set.seed(93)
+ (α − α̂ML )ψ(α̂ML ) − (α − α̂ML ) log(α̂ML )] > for(i in seq_len(nSim))
= n[α{log(α) − log(α̂ML )} − (α − α̂ML ) − log{Γ(α)} + log{Γ(α̂ML )} + (α − α̂ML )ψ(α̂ML )]. {
alpha.sim.mles[i] <- getMle.alpha(x=
rgamma(n=n,
Now we are ready to implement the approximate density of the MLE, as de- alpha,
scribed by the p∗ formula: alpha / mu),
mu=mu)
> approx.mldens <- function(alpha.mle, alpha.true) }
{ > ## compare the histogram with the p* density
relLogLik <- n * (alpha.true * (log(alpha.true) - log(alpha.mle)) - > hist(alpha.sim.mles,
(alpha.true - alpha.mle) - lgamma(alpha.true) + prob=TRUE,
lgamma(alpha.mle) + (alpha.true - alpha.mle) * nclass=50,
digamma(alpha.mle)) ylim=c(0, 0.6),
logObsFisher <- log(n) + log(trigamma(alpha.mle) - 1 / alpha.mle) xlim=c(0, 12))
logret <- - 0.5 * log(2 * pi) + 0.5 * logObsFisher + relLogLik > curve(approx.mldens(x, alpha.true=alpha),
return(exp(logret)) add=TRUE,
} lwd=2,
n=201,
g) In order to illustrate the quality of this approximation, we consider the case
col="red")
with α = 2 and µ = 3. Simulate 10 000 data sets of size n = 10, and compute Histogram of alpha.sim.mles
0.6
the MLE α̂ML for each of them by numerically solving (4.5) using the R-function
uniroot (cf . Appendix C.1.1). Plot a histogram of the resulting 10 000 MLE 0.5
samples (using hist with option prob=TRUE). Add the approximate density 0.4
Density
derived above to compare.
0.3
◮ To illustrate the quality of this approximation by simulation, we may run
the following code. 0.2
1. In a cohort study on the incidence of ischaemic heart disease (IHD) 337 male
probands were enrolled. Each man was categorised as non-exposed (group 1, daily
energy consumption ≥ 2750 kcal) or exposed (group 2, daily energy consumption
< 2750 kcal) to summarise his average level of physical activity. For each group,
the number of person years (Y1 = 2768.9 and Y2 = 1857.5) and the number of IHD
cases (D1 = 17 and D2 = 28) was registered thereafter.
We assume that Di | Yi ∼ Po(λi Yi ), i = 1, 2, where λi > 0 is the group-specific
ind
incidence rate.
a) For each group, derive the MLE λ̂i and a corresponding 95% Wald confidence
interval for log(λi ) with subsequent back-transformation to the λi -scale.
◮ The log-likelihood kernel corresponding to a random variable X with Poisson
distribution Po(θ) is
l(θ) = −θ + x log(θ),
implying the score function
d x
S(θ; x) = l(θ) = −1 +
dθ θ
and the Fisher information
d x
I(θ) = − S(θ; x) = 2 .
dθ θ
The score equation is solved by θ̂ML = x and since the observed Fisher informa-
tion I(θ̂ML ) = 1/x is positive, θ̂ML indeed is the MLE of θ.
The rates of the Poisson distributions for Di can therefore be estimated by the
maximum likelihood estimators θ̂i = Di . Now,
θi
λi = ,
Yi
so, by the invariance of the MLE, we obtain that
θi Di
λ̂i = = .
Yi Yi
52 5 Likelihood inference in multiparameter models 53
With the given data we have the results λ̂1 = D1 /Y1 = 6.14 · 10−3 and λ̂2 = In terms of the new parametrisation,
D2 /Y2 = 1.51 · 10−2 .
λ1 = λ and λ2 = λθ,
We can now use the fact that Poisson distribution Po(θ) with θ a natural number
may be seen as the distribution of a sum of θ independent random variables, each so if, moreover, we denote D = D1 + D2 , we have
with distribution Po(1). As such, Po(θ) for reasonably large θ is approximately l(λ, θ) = D1 log(λ) − λY1 + D2 log(λ) + D2 log(θ) − λθY2
N(θ, θ). In our case, D1 = θ̂1 = 17 and D2 = θ̂2 = 28 might be considered
= D log(λ) + D2 log(θ) − λY1 − λθY2 . (5.1)
approximately normally distributed, θ̂i ∼ N(θi , θi ). It follows that λ̂i are, too,
approximately normal, λ̂i ∼ N(λi , λi /Yi ). The standard errors of λ̂i can be c) Compute the MLE (λ̂, θ̂), the observed Fisher information matrix I(λ̂, θ̂) and
√
estimated as se(λ̂i ) = Di /Yi . derive expressions for both profile log-likelihood functions lp (λ) = l{λ, θ̂(λ)} and
Again by the invariance of the MLE, the MLEs of ψi = log(λi ) = f (λi ) are lp (θ) = l{λ̂(θ), θ}.
◮ The score function is
Di
ψ̂i = log . d
!
D
!
Yi dλ l(λ, θ) λ − Y1 − θY2
S(λ, θ) = = ,
d D2
By the delta method, the standard errors of ψ̂i are dθ l(λ, θ) θ − λY2
and the Fisher information is
se(ψ̂i ) = se(λ̂i ) · f ′ (λ̂i ) ! !
√ d2 d2 l(λ,θ) D
Di 1 dλ2
l(λ, θ) λ2
Y2
= · I(λ, θ) = − d2 l(λ,θ)
dλ dθ
= .
d2
Yi λ̂i
D2
dλ dθ dθ2 l(λ, θ) Y2 θ2
1 The score equation S(λ, θ) = 0 is solved by
=√ .
Di
D1 D2 Y1
Therefore, the back-transformed limits of the 95% Wald confidence intervals with (λ̂, θ̂) = , ,
Y1 D1 Y2
log-transformation for λi equal
h and, as the observed Fisher information
p p i
exp ψ̂i − z0.975 / Di , exp ψ̂i + z0.975 / Di , DY12
2 Y2
I(λ̂, θ̂) = D1
D12 Y22
and with the data we get: Y2 D2 Y12
> (ci1 <- exp(log(d[1]/y[1]) + c(-1, 1) * qnorm(0.975) / sqrt(d[1])))
[1] 0.003816761 0.009876165 is positive definite, (λ̂, θ̂) indeed is the MLE.
> (ci2 <- exp(log(d[2]/y[2]) + c(-1, 1) * qnorm(0.975) / sqrt(d[2])))
[1] 0.01040800 0.02183188 In order to derive the profile log-likelihood functions, one first has to compute
the maxima of the log-likelihood with fixed λ or θ, which we call here θ̂(λ) and
i. e. the confidence interval for λ1 is (0.00382, 0.00988) and for λ2 it is
λ̂(θ). This amounts to solving the score equations dθd
l(λ, θ) = 0 and dλ
d
l(λ, θ) = 0
(0.01041, 0.02183).
separately for θ and λ, respectively. The solutions are
b) In order to analyse whether λ1 = λ2 , we reparametrise the model with λ = λ1
D2 D
and θ = λ2 /λ1 . Show that the joint log-likelihood kernel of λ and θ has the θ̂(λ) = and λ̂(θ) = .
λY2 Y1 + θY2
following form:
The strictly positive diagonal entries of the Fisher information show that the log-
l(λ, θ) = D log(λ) + D2 log(θ) − λY1 − θλY2 , likelihoods are strictly concave, so θ̂(λ) and λ̂(θ) indeed are the maxima. Now
we can obtain the profile log-likelihood functions by plugging in θ̂(λ) and λ̂(θ)
where D = D1 + D2 .
into the log-likelihood (5.1). The results are (after omitting additive constants
◮ First, note that θ now has a different meaning than in the solution of 1a).
not depending on the arguments λ and θ, respectively)
By the independence of D1 and D2 , the joint log-likelihood kernel in the original
parametrisation is lp (λ) = D1 log(λ) − λY1
5 −100
d) Plot both functions lp (λ) and lp (θ), and also create a contour plot of the relative
−400
log-likelihood l̃(λ, θ) using the R-function contour. Add the points {λ, θ̂(λ)} and 4 −120
lp(λ)
lp(θ)
> ## log-likelihood of lambda, theta:
θ
> loglik <- function(param) −0 −550
2 .5 −160
{
−2
−1
0
lambda <- param[1] −600
Since φ is the log relative incidence rate, and zero is not contained in the 95% The equations are solved by
confidence interval (0.296, 1.501), the corresponding P -value for testing the null Pn
i=1 xi yi
hypothesis φ = 0, which is equivalent to θ = 1 and λ1 = λ2 , must be smaller than ρ̂ML = 1
P 2 2
i=1 (xi + yi )
n
α = 5%. 2
n
2 1 X 2
Note that the use of (asymptotic) results from the likelihood theory can be justified and σ̂ML = (xi + yi2 ).
here by considering the Poisson distribution Po(n) as the distribution of a sum 2n
i=1
of n independent Poisson random variables with unit rates, as in the solution
As the observed Fisher information matrix shown below is positive definite, the
to 1a).
above estimators are indeed the MLEs.
2. Let Z 1:n be a random sample from a bivariate normal distribution N2 (µ, Σ) with b) Show that the Fisher information matrix is
mean vector µ = 0 and covariance matrix
! n
4 − σ̂2 n(1−
ρ̂ML
2 )
1 ρ 2
I(σ̂ML , ρ̂ML ) =
σ̂ML ML
ρ̂ML
.
Σ = σ2 . − σ̂2 n(1−ρ̂ML n(1+ρ̂2ML )
ρ 1 ρ̂2 )
ML ML
(1−ρ̂2 )2 ML
2 2
a) Interpret σ and ρ. Derive the MLE (σ̂ML , ρ̂ML ). ◮ The components of the Fisher information matrix I(σ 2 , ρ) are computed as
◮ σ 2 is the variance of each of the components Xi and Yi of the bivariate vector
Z i . The components have correlation ρ. d2 Q(ρ) − nσ 2 (1 − ρ2 )
− l(σ 2 , ρ) = ,
To derive the MLE, we first compute the log-likelihood kernel d(σ 2 )2 σ 6 (1 − ρ2 )
P
d2 xi yi (1 − ρ2 ) − ρQ(ρ)
n
( !)
Xn
1 − 2 l(σ 2 , ρ) = i=1 ,
l(Σ) = − log |Σ| + (xi , yi )Σ −1 xi
. dσ dρ σ 4 (1 − ρ2 )2
2 Pn
yi d2 (1 − ρ2 )Q(ρ) − nσ 2 (1 − ρ4 ) − 4ρ i=1 (ρyi − xi )(ρxi − yi )
and − 2 l(σ 2 , ρ) =
i=1
.
dρ σ 2 (1 − ρ2 )3
In our case, since |Σ| = σ 4 (1 − ρ2 ) and
2
! and those of the observed Fisher information matrix I(σ̂ML , ρ̂ML ) are obtained
1 1 −ρ
Σ −1
= 2 , by plugging in the MLEs. The wished-for expressions can be obtained by simple
σ (1 − ρ2 ) −ρ 1 algebra, using that
we obtain n
X
n 1 2
l(σ 2 , ρ) = − log{σ 4 (1 − ρ2 )} − 2 Q(ρ), {(ρ̂ML yi − xi )(ρ̂ML xi − yi )} = nρ̂ML σ̂ML (ρ̂2ML − 1).
2 2σ (1 − ρ2 ) i=1
Pn 2 2
where Q(ρ) = i=1 (xi −2ρxi yi +yi ). The score function thus has the components
The computations can also be performed in a suitable software.
d n 1 c) Show that
l(σ 2 , ρ) = − 2 + 4 Q(ρ)
dσ 2 σ 2σ (1 − ρ2 ) 1 − ρ̂2
( n ) se(ρ̂ML ) = √ ML .
n
d 2 nρ 1 X ρ
and l(σ , ρ) = + 2 xi yi − Q(ρ) .
dρ 1 − ρ2 σ (1 − ρ2 ) 1 − ρ2
i=1 ◮ Using the expression for the inversion of a 2 × 2 matrix, cf. Appendix B.1.1,
The first component of the score equation can be rewritten as we obtain that the element I 22 of the inversed observed Fisher information matrix
2
1 I(σ̂ML , ρ̂ML )−1 is
σ2 = Q(ρ),
2n(1 − ρ2 ) −1
n2 (1 + ρ̂2ML ) n2 ρ̂2ML n (1 − ρ̂2ML )2
4 (1 − ρ̂2 )2
− 4 · = .
which, plugged into the second component of the score equation, yields σ̂ML ML
σ̂ ML
(1 − ρ̂2ML )2 4
σ̂ML n
Pn
2 i=1 xi yi ρ
= . The standard error of ρ̂ML is the square root of this expression.
Q(ρ) 1 − ρ2
58 5 Likelihood inference in multiparameter models 59
3. Calculate again the Fisher information of the profile log-likelihood in Result 5.2, which is equal to
but this time without using Result 5.1. Use instead the fact that α̂ML (δ) is a point
d
where the partial derivative of l(α, δ) with respect to α equals zero. Ip (δ) = S0 {α̂ML (δ)}
dδ
◮ Suppose again that the data are split in two independent parts (denoted by 0 d
and 1), and the corresponding likelihoods are parametrised by α and β, respectively. = −I0 {α̂ML (δ)} α̂ML (δ)
dδ
Then the log-likelihood decomposes as
as S1 {α̂ML (δ) + δ} = −S0 {α̂ML (δ)}. Here I0 and I1 denote the Fisher information
l(α, β) = l0 (α) + l1 (β). corresponding to l0 and l1 , respectively. Hence, we can solve
d d
We are interested in the difference δ = β − α. Obviously β = α + δ, so the joint I1 {α̂ML (δ) + δ} α̂ML (δ) + 1 = −I0 {α̂ML (δ)} α̂ML (δ)
dδ dδ
log-likelihood of α and δ is
dδ α̂ML (δ),
d
for and plug the result into (5.2) to finally obtain
l(α, δ) = l0 (α) + l1 (α + δ).
1 1 1
= + . (5.3)
Furthermore, Ip (δ) I0 {α̂ML (δ)} I1 {α̂ML (δ) + δ}
d d d
l(α, δ) = l0 (α) + l1 (α + δ)
dα dα dα 4. Let X ∼ Bin(m, πx ) and Y ∼ Bin(n, πy ) be independent binomial random variables.
= S0 (α) + S1 (α + δ), In order to analyse the null hypothesis H0 : πx = πy one often considers the relative
risk θ = πx /πy or the log relative risk ψ = log(θ).
where S0 and S1 are the score functions corresponding to l0 and l1 , respectively.
a) Compute the MLE ψ̂ML and its standard error for the log relative risk estimation.
For the profile log-likelihood lp (δ) = l{α̂ML (δ), δ}, we need the value α̂ML (δ) for which
Proceed as in Example 5.8.
dα l{α̂ML (δ), δ} = 0, hence it follows that S1 {α̂ML (δ) + δ} = −S0 {α̂ML (δ)}. This is
d
◮ As in Example 5.8, we may use the invariance of the MLEs to conclude that
also the derivative of the profile log-likelihood, because
π̂x x1 /m nx1 nx1
d
lp (δ) =
d
l{α̂ML (δ), δ} θ̂ML = = = and ψ̂ML = log .
dδ dδ π̂y x2 /n mx2 mx2
d d
= l0 {α̂ML (δ)} + l1 {α̂ML (δ) + δ} Further, ψ = log(θ) = log(πx ) − log(πy ), so we can use Result 5.2 to derive
dδ dδ the standard error of ψ̂ML . In Example 2.10, we derived the observed Fisher
d d
= S0 {α̂ML (δ)} α̂ML (δ) + S1 {α̂ML (δ) + δ} α̂ML (δ) + 1 information corresponding to the MLE π̂ML = x/n as
dδ dδ
d n
= [S0 {α̂ML (δ)} + S1 {α̂ML (δ) + δ}] α̂ML (δ) + S1 {α̂ML (δ) + δ} I(π̂ML ) = .
dδ π̂ML (1 − π̂ML )
= S1 {α̂ML (δ) + δ}. Using Result 2.1, we obtain that
−2
The Fisher information (negative curvature) of the profile log-likelihood is given by 1 n
I{log(π̂ML )} = I(π̂ML ) · = · π̂ 2 .
d2 π̂ML π̂ML (1 − π̂ML ) ML
Ip (δ) = − 2 lp (δ)
dδ By Result 5.2, we thus have that
d
= − S1 {α̂ML (δ) + δ} s
dδ p 1 − π̂x 1 − π̂y
d se(φ̂ML ) = I{log(π̂x )}−1 + I{log(π̂y )}−1 = + .
= I1 {α̂ML (δ) + δ} α̂ML (δ) + 1 , (5.2) mπ̂x nπ̂y
dδ
60 5 Likelihood inference in multiparameter models 61
b) Compute a 95% confidence interval for the relative risk θ given the data in > ## profile log-likelihood of theta:
Table 3.1. > profilLoglik <- function(theta)
{
◮ The estimated risk for preeclampsia in the Diuretics group is π̂x = 6/108 = res <- theta
0.056, and in the Placebo group it is π̂y = 2/103 = 0.019. The log-relative risk is eps <- sqrt(.Machine$double.eps)
thus estimated by for(i in seq_along(theta)){ # the function can handle vectors
optimResult <-
ψ̂ML = log(6/108) − log(2/103) = 1.051, optim(par = 0.5,
fn = function(pi2) loglik(theta[i], pi2),
with the 95% confidence interval gr = function(pi2) grad(theta[i], pi2),
method = "L-BFGS-B", lower = eps, upper = 1/theta[i] - eps,
control = list(fnscale = -1))
[1.051 − 1.96 · 0.648, 1.051 + 1.96 · 0.648] = [−0.218, 2.321]. if(optimResult$convergence == 0){ # has the algorithm converged?
res[i] <- optimResult$value # only then save the value (not the parameter!)
Back-transformation to the relative risk scale by exponentiating gives the follow- } else {
ing 95% confidence interval for the relative risk θ = π1 /π2 : [0.804, 10.183]. res[i] <- NA # otherwise return NA
}
c) Also compute the profile likelihood and the corresponding 95% profile likelihood }
confidence interval for θ.
◮ The joint log-likelihood for θ = πx /πy and πy is return(res)
}
> ## plot the normed profile log-likelihood:
l(θ, πy ) = x log(θ) + (m − x) log(1 − θπy ) + (x + y) log(πy ) + (n − y) log(1 − πy ). > thetaGrid <- seq(from = 0.1, to = 30, length = 309)
> normProfVals <- profilLoglik(thetaGrid) - profilLoglik(thetaMl)
In order to compute the profile likelihood, we need to maximise l(θ, πy ) with > plot(thetaGrid, normProfVals,
type = "l", ylim = c(-5, 0),
respect to πy for fixed θ. To do this, we look for the points where xlab = expression(theta), ylab = expression(tilde(l)[p](theta))
)
d θ(m − x) x + y n−y
l(θ, πy ) = + + > ## show the cutpoint:
dπy θπy − 1 πy πy − 1 > abline(h = - 1/2 * qchisq(0.95, 1), lty = 2)
0
equals zero. To this aim, we need to solve the quadratic equation
πy2 θ(m + n) − πy θ(m + y) + n + x + x + y = 0 −1
−2
for πy . In this case, it is easier to perform the numerical maximisation.
~l p(θ)
> ## data −3
> x <- c(6, 2)
> n <- c(108, 103)
−4
> ## MLE
> piMl <- x/n
> thetaMl <- piMl[1] / piMl[2] −5
a) Use Result 5.2 to compute the standard error of the logarithm of the positive and [2.251 − 1.96 · 0.301, 2.251 + 1.96 · 0.301] = [1.662, 2.841].
negative likelihood ratio, defined as LR+ = πx /(1 − πy ) and LR− = (1 − πx )/πy .
and
Suppose m = n = 100, x = 95 and y = 90. Compute a point estimate and the
[−2.89 − 1.96 · 0.437, −2.89 + 1.96 · 0.437] = [−3.747, −2.034].
limits of a 95% confidence interval for both LR+ and LR− , using the standard
error from above. Back-transformation to the positive and negative likelihood ratio scale by ex-
◮ As in Example 5.8, we may use the invariance of the MLEs to conclude that ponentiating gives the following 95% confidence intervals [5.268, 17.133] and
[0.024, 0.131].
d+ = π̂x x1 /m nx1
LR = = b) The positive predictive value PPV is the probability of disease, given a positive
1 − π̂y 1 − x2 /n nm − mx2
test result. The equation
d− = 1 − π̂x = 1 − x1 /m = nm − nx1 . PPV
and LR = LR+ · ω
π̂y x2 /n mx2 1 − PPV
Now, relates PPV to LR+ and to the pre-test odds of disease ω. Likewise, the follow-
ing equation holds for the negative predictive value NPV, the probability to be
log(LR+ ) = log(πx ) − log(1 − πy ) and log(LR− ) = log(1 − πx ) − log(πy ), disease-free, given a negative test result:
1 − NPV
= LR− · ω.
NPV
64 5 Likelihood inference in multiparameter models 65
Suppose ω = 1/1000. Use the 95% confidence interval for LR+ and LR− , ob- We can also get the MLE for the number needed to treat as
tained in 5a), to compute the limits of a 95% confidence interval for both PPV
\ 1 1
and NPV. N NT = = .
|x1 /n1 − x2 /n2 |
θ̂ML
◮ By multiplying the endpoints of the confidence intervals for LR+ and LR−
obtained in 5a) by ω, we obtain confidence intervals for PPV/(1 − PPV) and For the concrete data example:
(1 − NPV)/NPV, respectively. It remains to transform these intervals to the > ## the data
scales of PPV and NPV. Now, if the confidence interval for PPV/(1 − PPV) is > x <- c(6, 18)
> n <- c(38, 40)
[l, u], then the confidence interval for PPV is [l/(1+l), u/(1+u)]; and if the confi- > ## MLEs
dence interval for (1 − NPV)/NPV is [l, u], then the confidence interval for NPV > pi1Hat <- x[1] / n[1]
is [1/(1 + u), 1/(1 + l)]. With our data, we obtain the following confidence in- > pi2Hat <- x[2] / n[2]
> (thetaHat <- pi1Hat - pi2Hat)
tervals for PPV/(1 − PPV) and (1 − NPV)/NPV, respectively: [0.00527, 0.01713] [1] -0.2921053
and [2.4e − 05, 0.000131]; and the following confidence intervals for PPV and > (nntHat <- 1 / abs(thetaHat))
NPV, respectively: [0.00524, 0.01684] and [0.999869, 0.999976]. [1] 3.423423
> ## Wald CI
6. In the placebo-controlled clinical trial of diuretics during pregnancy to prevent > seTheta <- sqrt(pi1Hat * (1 - pi1Hat) / n[1] +
pi2Hat * (1 - pi2Hat) / n[2])
preeclampsia by Fallis et al. (cf . Table 1.1), 6 out of 38 treated women and 18 out
> (thetaWald <- thetaHat + c(-1, 1) * qnorm(0.975) * seTheta)
of 40 untreated women got preeclampsia. [1] -0.48500548 -0.09920505
a) Formulate a statistical model assuming independent binomial distributions c) Write down the joint log-likelihood kernel l(π1 , θ) of the risk parameter π1 in
in the two groups. Translate the null hypothesis “there is no difference in the treatment group and θ. In order to derive the profile log-likelihood of θ,
preeclampsia risk between the two groups” into a statement on the model pa- consider θ as fixed and write down the score function for π1 ,
rameters.
∂
◮ We consider two independent random variables X1 ∼ Bin(n1 , π1 ) and Sπ1 (π1 , θ) = l(π1 , θ).
∂π1
X2 ∼ Bin(n2 , π2 ) modelling the number of preeclampsia cases in the treat-
ment and control group, respectively. Here the sample sizes are n1 = 38 and Which values are allowed for π1 when θ is fixed?
n2 = 40; and we have observed the realisations x1 = 6 and x2 = 18. No differ- ◮ The joint likelihood kernel of π1 and π2 is
ence in preeclampsia risk between the two groups is expressed in the hypothesis L(π1 , π2 ) = π1x1 (1 − π1 )n1 −x1 · π2x2 (1 − π2 )n2 −x2 .
H0 : π1 = π2 .
b) Let θ denote the risk difference between treated and untreated women. Derive We can rewrite the risk in the control group as π2 = π1 − θ. Plugging this into
the MLE θ̂ML and a 95% Wald confidence interval for θ. Also give the MLE for the log-likelihood kernel of π1 and π2 gives the joint log-likelihood kernel of risk
the number needed to treat (NNT) which is defined as 1/θ. in the treatment group π1 and the risk difference θ as
◮ In terms of the model parameters, we have θ = π1 − π2 . By the invariance
l(π1 , θ) = x1 log(π1 )+(n1 −x1 ) log(1−π1 )+x2 log(π1 −θ)+(n2 −x2 ) log(1−π1 +θ).
of the MLEs, we obtain the MLE of θ as
The score function for π1 is thus given by
θ̂ML = π̂1 − π̂2 = x1 /n1 − x2 /n2 .
d
To derive the standard error, we may use Result 5.2 and Example 2.10 to con- Sπ1 (π1 , θ) = l(π1 , θ)
dπ1
clude that r x1 x 1 − n1 x2 x 2 − n2
π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) = + + + .
se(
b θ̂ML ) = + . π1 1 − π1 π1 − θ 1 − π1 + θ
n1 n2
If we want to solve the score equation Sπ1 (π1 , θ) = 0, we must think about the
It follows that a 95% Wald confidence interval for θ is given by
" r r # allowed range for π1 : Of course, we have 0 < π1 < 1. And we also have this for
π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) π̂1 (1 − π̂1 ) π̂2 (1 − π̂2 ) the second proportion π2 , giving 0 < π1 − θ < 1 or θ < π1 < 1 + θ. Altogether we
π̂1 − π̂2 − z0.975 + , π̂1 − π̂2 + z0.975 + .
n1 n2 n1 n2 have the range max{0, θ} < π1 < min{1, 1 + θ}.
66 5 Likelihood inference in multiparameter models 67
d) Write an R-function which solves Sπ1 (π1 , θ) = 0 (use uniroot) and thus gives an Tab. 5.1: Probability of offspring’s blood group given allele frequencies
in parental generation, and sample realisations.
estimate π̂1 (θ). Hence write an R-function for the profile log-likelihood lp (θ) =
l{π̂1 (θ), θ}. Blood group Probability Observation
◮
A={AA,A0} π1 = p2 + 2pr x1 = 182
> ## the score function for pi1:
> pi1score <- function(pi1, theta) B={BB,B0} π2 = q 2 + 2qr x2 = 60
{ AB={AB} π3 = 2pq x3 = 17
x[1] / pi1 + (x[1] - n[1]) / (1 - pi1) +
x[2] / (pi1 - theta) + (x[2] - n[2]) / (1 - pi1 + theta) 0={00} π4 = r2 x4 = 176
}
> ## get the MLE for pi1 given fixed theta:
> getPi1 <- function(theta)
{ The two 95% confidence intervals are quite close, and neither contains the ref-
eps <- 1e-9 erence value zero. Therefore, the P -value for testing the null hypothesis of no
uniroot(pi1score,
interval=
risk difference between the two groups against the two-sided alternative must be
c(max(0, theta) + eps, smaller than 5%.
min(1, 1 + theta) - eps),
theta=theta)$root 7. The AB0 blood group system was described by Karl Landsteiner in 1901, who
} was awarded the Nobel Prize for this discovery in 1930. It is the most important
> ## the joint log-likelihood kernel:
> loglik <- function(pi1, theta)
blood type system in human blood transfusion, and comprises four different groups:
{ A, B, AB and 0.
x[1] * log(pi1) + (n[1] - x[1]) * log(1 - pi1) + Blood groups are inherited from both parents. Blood groups A and B are dominant
x[2] * log(pi1 - theta) + (n[2] - x[2]) * log(1 - pi1 + theta)
} over 0 and codominant to each other. Therefore a phenotype blood group A may
> ## so we have the profile log-likelihood for theta: have the genotype AA or A0, for phenotype B the genotype is BB or B0, for
> profLoglik <- function(theta)
{
phenotype AB there is only the genotype AB, and for phenotype 0 there is only
pi1Hat <- getPi1(theta) the genotype 00.
loglik(pi1Hat, theta) Let p, q and r be the proportions of alleles A, B, and 0 in a population, so p+q+r = 1
}
and p, q, r > 0. Then the probabilities of the four blood groups for the offspring
e) Compute a 95% profile likelihood confidence interval for θ using numerical tools. generation are given in Table 5.1. Moreover, the realisations in a sample of size
Compare it with the Wald interval. What can you say about the P -value for n = 435 are reported.
the null hypothesis from 6a)?
◮ a) Explain how the probabilities in Table 5.1 arise. What assumption is tacitly
> ## now we need the relative profile log-likelihood, made?
> ## and for that the value at the MLE: ◮ The core assumption is random mating, i. e. there are no mating restric-
> profLoglikMle <- profLoglik(thetaHat)
tions, neither genetic or behavioural, upon the population, and that therefore all
> relProfLoglik <- function(theta)
{ recombination is possible. We assume that the alleles are independent, so the
profLoglik(theta) - profLoglikMle probability of the haplotype a1 /a2 (i. e. the alleles in the order mother/father)
}
> ## now compute the profile CI bounds: is given by Pr(a1 ) Pr(a2 ) when Pr(ai ) is the frequency of allele ai in the popu-
> lower <- uniroot(function(theta){relProfLoglik(theta) + 1.92}, lation. Then we look at the haplotypes which produce the requested phenotype,
c(-0.99, thetaHat))
and sum their probabilities to get the probability for the requested phenotype. For
> upper <- uniroot(function(theta){relProfLoglik(theta) + 1.92},
c(thetaHat, 0.99)) example, phenotype A is produced by the haplotypes A/A, A/0 and 0/A, having
> (profLogLikCi <- c(lower$root, upper$root)) probabilities p · p, p · r and r · p, and summing up gives π1 .
[1] -0.47766117 -0.09330746
> ## compare with Wald interval b) Write down the log-likelihood kernel of θ = (p, q)⊤ . To this end, assume that x =
> thetaWald (x1 , x2 , x3 , x4 )⊤ is a realisation from a multinomial distribution with parameters
[1] -0.48500548 -0.09920505
68 5 Likelihood inference in multiparameter models 69
+ x3 log(2pq) + x4 log{(1 − p − q)2 }. we can apply the multivariate delta method in the special case of a linear trans-
formation g(θ) = a⊤ · θ + b, and we have D(θ̂ ML ) = a⊤ . Thus,
Note that we have used here that r = 1 − p − q, so there are only two parameters
in this problem. q
se g(θ̂ML ) = a⊤ I(θ̂ ML )−1 a.
c) Compute the MLEs of p and q numerically, using the R function optim. Use
the option hessian = TRUE in optim and process the corresponding output to In our case, θ = (p, q)⊤ , a⊤ = (−1, −1), and b = 1, so
receive the standard errors of p̂ML and q̂ML . v ! v
u u 2
◮ u −1 uX
t
se(r̂ML ) = se g(p, q) = (−1, −1)I(p̂ML , q̂ML )−1 =t {I(p̂ML , q̂ML )−1 }ij .
> ## observed data: −1 i,j=1
> data <- c(182, 60, 17, 176)
> n <- sum(data)
> ## the loglikelihood function of theta = (p, q) > (rMl <- 1 - sum(thetaMl))
> loglik <- function(theta,data) {
[1] 0.6423991
p <- theta[1]
> (rSe <- sqrt(sum(thetaCov)))
q <- theta[2]
r <- 1-p-q [1] 0.01761619
−1050 00
0.8 −10
00 −14
0
−1
150
0.4 −1
100
T2
−9
50 W = n log 1 + ,
−50
n−1
0.2
−1
20 0
see Example 5.15, and compare it graphically with the density function of the
χ2 (1) distribution for different values of n.
−150 −250 −200 −100 −300 −350 −450 −550 −650
Using the change-of-variables formula, we obtain the following form for the den- Indeed we can see that the differences between the densities are small already for
sity fW of W : n = 20.
d −1
b) Show that for n → ∞, W follows indeed a χ2 (1) distribution.
fW (x) = fT 2 {g (x)} g (x)
−1 ◮ Since T −→ N(0, 1) as n → ∞, we have that T 2 −→ χ2 (1) as n → ∞. Further,
D D
dx
hq i the transformation g is for large n close to identity, as
fT (n − 1) exp(x/n) − 1 n−1 n
= exp(x/n). x n o
q g(x) = log 1+
(n − 1) exp(x/n) − 1 n n−1
f(x)
2 2
3
Wn = 2n pi0 + + O{(ni /n − pi0 ) }
i i i0
− pi0 · −
1 1
n pi0 2 pi0
0 0 i=1
k
0 1 2 3 4 5 0 1 2 3 4 5 X 1 (ni /n − pi0 )2
x x = 2n (ni /n − pi0 ) + · + O{(ni /n − pi0 )3 }
4 n = 20 4 n = 100 2 pi0
i=1
3 3
k
X
f(x)
f(x)
2 2
= Dn + n O{(ni /n − pi0 )3 },
1 1
i=1
0 0
0 1 2 3 4 5 0 1 2 3 4 5 where the last equality follows from the fact that both ni /n and pi0 must sum up to
x x one.
74 5 Likelihood inference in multiparameter models 75
Since both (ni /n − πi ) and (pi0 − πi ) converge to zero in probability and both > n <- 100
√ √
n(ni /n − πi ) and n(pi0 − πi ) are bounded in probability as n → ∞, the last > gridSize <- 100
> theta1grid <- seq(0.8,1,length=gridSize)
→ 0 as n → ∞.
P
sum converges to zero in probability as n → ∞, i. e. Wn − Dn − > theta2grid <- seq(0.3,0.6,length=gridSize)
10.In a psychological experiment the forgetfulness of probands is tested with the recog- > loglik <- matrix(NA, nrow=length(theta1grid), ncol=length(theta2grid))
> for(i in 1:length(theta1grid)){
nition of syllable triples. The proband has ten seconds to memorise the triple, af- for(j in 1:length(theta2grid)){
terwards it is covered. After a waiting time of t seconds it is checked whether the loglik[i, j] <-
loglik.fn(theta1=theta1grid[i], theta2=theta2grid[j], n=n, y=y, t=t)
proband remembers the triple. For each waiting time t the experiment is repeated }
n times. }
Let y = (y1 , . . . , ym ) be the relative frequencies of correctly remembered syllable > contour(theta1grid, theta2grid, loglik, nlevels=50, xlab=math (theta[1]),
ylab=math (theta[2]), xaxs = "i", yaxs = "i", labcex=1)
triples for the waiting times of t = 1, . . . , m seconds. The power model now assumes,
0.60
37 34 31 28 −317
that −3 −3 −3 −3
−327
333 30 27
π(t; θ) = θ1 t−θ2 , 0 ≤ θ1 ≤ 1, θ2 > 0, −3 −315
0.55 − −3
29
−3
is the probability to correctly remember a syllable triple after the waiting time 0.50
t ≥ 1. 0.45
−314
θ2
−316
a) Derive an expression for the log-likelihood l(θ) where θ = (θ1 , θ2 ). 0.40 −318
−319
◮ If the relative frequencies of correctly remembered triples are independent for −321
−322
−320
−323
−324
0.35 −325 41
different waiting times, then the likelihood kernel is −328
−326 −327
−3 2
−329 −330 −338
−332 −331 35
m 0.30 −334 −333 −335 −337 −339 −342 −346 −
Y
L(θ1 , θ2 ) = (θ1 t−θ
i ) (1 − θ1 t−θ
2 nyi
i )
2 n−nyi 0.80 0.85 0.90 0.95 1.00
i=1 θ1
Pm m
Y
n yi
= θ1 i=1
ti−θ2 nyi (1 − θ1 t−θ
i )
2 n−nyi
. 11.Often the exponential model is used instead of the power model (described in Ex-
i=1 ercise 10), assuming:
The log-likelihood kernel is thus
π(t; θ) = min{1, θ1 exp(−θ2 t)}, t > 0, θ1 > 0 and θ2 > 0.
m
X m
X m
X
l(θ1 , θ2 ) = n log(θ1 ) yi − nθ2 yi log(ti ) + n (1 − yi ) log(1 − θ1 t−θ
i
2
). a) Create a contour plot of the log-likelihood in the parameter range [0.8, 1.4] ×
i=1 i=1 i=1 [0, 0.4] for the same data as in Exercise 10.
b) Create a contour plot of the log-likelihood in the parameter range [0.8, 1] × ◮
[0.3, 0.6] with n = 100 and > pi.exp <- function(theta1, theta2, t){
return(
pmin( theta1*exp(-theta2*t), 1 )
y = (0.94, 0.77, 0.40, 0.26, 0.24, 0.16), t = (1, 3, 6, 9, 12, 18). )
}
> loglik.fn <- function(theta, n, y, t){
return(
◮ sum( dbinom(x=n*y, size=n,
> loglik.fn <- function(theta1, theta2, n, y, t){ prob=pi.exp(theta1=theta[1], theta2=theta[2], t=t), log=TRUE) )
return( )
n*log(theta1)*sum(y) }
- n*theta2*sum(y*log(t)) > y <- c(0.94,0.77,0.40,0.26,0.24,0.16)
+ n*sum((1-y)*log(1-theta1*t^(-theta2))) > t <- c(1, 3, 6, 9, 12, 18)
) > n <- 100
} > gridSize <- 100
> y <- c(0.94,0.77,0.40,0.26,0.24,0.16) > theta1grid <- seq(0.8,1.4,length=gridSize)
> t <- c(1, 3, 6, 9, 12, 18) > theta2grid <- seq(0,0.4,length=gridSize)
76 5 Likelihood inference in multiparameter models 77
0.4
> loglik <- matrix(NA, nrow=length(theta1grid), ncol=length(theta2grid)) −320 −300 −280 −260 −240 −220 −200
−180
> for(i in 1:length(theta1grid)){ −160
−140
for(j in 1:length(theta2grid)){ 0.3 −120
loglik[i, j] <- −100
−80
loglik.fn(theta=c(theta1grid[i], theta2grid[j]), n=n, y=y, t=t)
−60
} 0.2
θ2
−40
}
> contour(theta1grid, theta2grid, loglik, nlevels=50, xlab=math (theta[1]),
ylab=math (theta[2]), xaxs = "i", yaxs = "i", labcex=1) −20
0.1
0.4 −40 −60
−320 −300 −280 −260 −240 −220 −200 40
−180 −100 −120 −160 −1
−160 0.0 −340 −420 −540 −7
−76400
−140
0.3 −120
0.8 0.9 1.0 1.1 1.2 1.3 1.4
−100
−80 θ1
−60
0.2
c) For 0 ≤ t ≤ 20 create a plot of π(t; θ̂ML ) and add the observations y.
θ2
−40
◮
−20
0.1 > tt <- seq(1, 20, length=101)
−40 −60
> plot(tt, pi.exp(theta1=thetaMl[1], theta2=thetaMl[2], t=tt), type="l",
40
−100 −120 −160 −1 xlab= math (t), ylab= math (pi(t, hat(theta)[ML])) )
0.0 −340 −420 −540 −7400
−76 > points(t, y, pch = 19, col = 2)
0.8 0.9 1.0 1.1 1.2 1.3 1.4
θ1
0.8
Note that we did not evaluate the likelihood values in the areas where π(t; θ) = 1
for some t. These are somewhat particular situations, because when π = 1,
θML)
0.6
the corresponding likelihood contribution is 1, no matter what we observe as the
π(t, ^
corresponding y and no matter what the values of θ1 and θ2 are. 0.4
b) Use the R-function optim to numerically compute the MLE θ̂ ML . Add the MLE
to the contour plot from 11a). 0.2
◮
> ## numerically optimise the log-likelihood
5 10 15 20
> optimResult <- optim(c(1.0, 0.1), loglik.fn,
control = list(fnscale=-1), t
n=n, y=y, t=t)
> ## check convergence: 12.Let X1:n be a random sample from a log-normal LN(µ, σ 2 ) distribution, cf . Ta-
> optimResult[["convergence"]] == 0
ble A.2.
[1] TRUE
> ## extract the MLE a) Derive the MLE of µ and σ 2 . Use the connection between the densities of the
> thetaMl <- optimResult$par
normal distribution and the log-normal distribution. Also compute the corre-
> ## add the MLE to the plot
> contour(theta1grid, theta2grid, loglik, nlevels=50, xlab=math (theta[1]), sponding standard errors.
ylab=math (theta[2]), xaxs = "i", yaxs = "i", labcex=1) ◮ We know that if X is normal, i. e. X ∼ N(µ, σ 2 ), then exp(X) ∼ LN(µ, σ 2 )
> points(thetaMl[1], thetaMl[2], pch = 19, col = 2)
(cf. Table A.2). Thus, if we have a random sample X1:n from log-normal distri-
bution LN(µ, σ 2 ), then Y1:n = {log(X1 ), . . . , log(Xn )} is a random sample from
normal distribution N(µ, σ 2 ). In Example 5.3 we computed the MLEs in the
normal model:
1X
n
2
µ̂ML = ȳ and σ̂ML = (yi − ȳ)2 ,
n
i=1
78 5 Likelihood inference in multiparameter models 79
11.6
and in Section 5.2 we derived the corresponding standard errors se(µ̂ML ) =
√ 2 2
p 15
σ̂ML / n and se(σ̂ML ) = σ̂ML 2/n. 11.4
b) Derive the profile log-likelihood functions of µ and σ 2 and plot them for the
11.2
10
following data:
L p (σ2)
L p (µ)
11.0
x = (225, 171, 198, 189, 189, 135, 162, 136, 117, 162). 5
10.8
Compare the profile log-likelihood functions with their quadratic approxima- 10.6
tions. 0
) expected survival time is 1/η. θ is the multiplicative change of the rate for the
} treatment group. The expected survival time changes to 1/(ηθ).
> approxrelloglik.sigma2 <- function(y, sigma2){
n <- length(y) The likelihood function is, by independence of all patients and the distributional
hatsigma2 <- (n-1)/n*var(y) assumptions (cf. Example 2.8):
return(
-n/4/(hatsigma2)^2*(hatsigma2-sigma2)^2 n
Y m
Y
) L(η, θ) = f (xi ; η)γi {1 − F (xi ; η)}(1−γi ) · f (yi ; η, θ)δi {1 − F (yi ; η, θ)}(1−δi )
} i=1 i=1
> par(mfrow=c(1, 2)) n m
> mu.x <- seq(4.8, 5.4, length=101) Y Y
> plot(mu.x, relloglik.mu(y=y, mu=mu.x), = {η exp(−ηxi )}γi {exp(−ηxi )}(1−γi ) · {ηθ exp(−ηθyi )}δi {exp(−ηθyi )}(1−δi )
type="l", xlab=math(mu), ylab=math(L[p](mu))) i=1 i=1
> lines(mu.x, approxrelloglik.mu(y=y, mu=mu.x), = η nγ̄+mδ̄ θ mδ̄ exp {−η (nx̄ + θmȳ)} .
lty=2, col=2)
> abline(v=mean(y), lty=2)
> sigma2.x <- seq(0.02, 0.05, length=101) Hence the log-likelihood is
> plot(sigma2.x, relloglik.sigma2(y=y, sigma2=sigma2.x),
type="l", xlab=math(sigma^2), ylab=math(L[p](sigma^2))) l(η, θ) = log{L(η, θ)}
> lines(sigma2.x, approxrelloglik.sigma2(y=y, sigma2=sigma2.x),
lty=2, col=2) = (nγ̄ + mδ̄) log(η) + mδ̄ log(θ) − η(nx̄ + θmȳ).
> abline(v=(length(y)-1)/length(y)*var(y), lty=2)
0 0.0
b) Calculate the MLE (η̂ML , θ̂ML ) and the observed Fisher information matrix
I(η̂ML , θ̂ML ).
−1 −0.2 ◮ For calculating the MLE, we need to solve the score equations. The score
−2 function components are
−0.4
L p (σ2)
L p (µ)
−3
d nγ̄ + mδ̄
l(η, θ) = − (nx̄ + θmȳ)
−0.6
−4 dη η
−0.8
−5
and
−1.0 d mδ̄
−6
l(η, θ) = − ηmȳ.
dθ θ
l(η, θ) = 0 we get θ̂ML = δ̄/(η̂ML ȳ). Plugging
4.8 4.9 5.0 5.1 5.2 5.3 5.4 0.020 0.030 0.040 0.050 d
From the second score equation dθ
µ σ2 d
this into the first score equation dη l(η, θ) = 0 we get η̂ML = γ̄/x̄, and hence
13.We assume an exponential model for the survival times in the randomised placebo- θ̂ML = (x̄δ̄)/(ȳγ̄).
controlled trial of Azathioprine for primary biliary cirrhosis (PBC) from Sec- The ordinary Fisher information matrix is
tion 1.1.8. The survival times (in days) of the n = 90 patients in the placebo !
nγ̄+mδ̄
group are denoted by xi with censoring indicators γi (i = 1, . . . , n), while the sur- η2 mȳ
I(η, θ) = ,
vival times of the m = 94 patients in the treatment group are denoted by yi and
mδ̄
mȳ θ2
have censoring indicators δi (i = 1, . . . , m). The (partly unobserved) uncensored
so plugging in the MLEs gives the observed Fisher information matrix
survival times follow exponential models with rates η and θη in the placebo and
!
treatment group, respectively (η, θ > 0). (nγ̄ + mδ̄) · (x̄/γ̄)2 mȳ
I(η̂ML , θ̂ML ) = .
a) Interpret η and θ. Show that their joint log-likelihood is mȳ mδ̄ · {(ȳγ̄)/(x̄δ̄)}2
l(η, θ) = (nγ̄ + mδ̄) log(η) + mδ̄ log(θ) − η(nx̄ + θmȳ),
c) Show that s
where γ̄, δ̄, x̄, ȳ are the averages of the γi , δi , xi and yi , respectively. nγ̄ + mδ̄
se(θ̂ML ) = θ̂ML · ,
◮ η is the rate of the exponential distribution in the placebo group, so the nγ̄ · mδ̄
82 5 Likelihood inference in multiparameter models 83
and derive a general formula for a γ · 100% Wald confidence interval for θ. Use giving the standard error for ψ̂ML as
Appendix B.1.1 to compute the required entry of the inverse observed Fisher
d
information. se(ψ̂ML ) = se(θ̂ML ) log(θ̂ML )
dθ
◮ From Section 5.2 we know that the standard error of θ̂ML is given by the s
square root of the corresponding diagonal element of the inverse observed Fisher nγ̄ + mδ̄ 1
= θ̂ML
information matrix. From Appendix B.1.1 we know how to compute the required nγ̄ · mδ̄ θ̂ML
s
entry:
nγ̄ + mδ̄
h i = .
(nγ̄ + mδ̄) · (x̄/γ̄)2 nγ̄ · mδ̄
I(η̂ML , θ̂ML )−1 =
22 (nγ̄ + mδ̄) · (x̄/γ̄)2 · mδ̄ · (ȳγ̄)2 /(x̄δ̄)2 − (mȳ)2
Thus, a γ · 100% Wald confidence interval for ψ is given by
(nγ̄ + mδ̄) · (x̄/γ̄)2
= s s
(nγ̄ + mδ̄) · m · ȳ 2 /δ̄ − (mȳ)2
log{(x̄δ̄)/(ȳγ̄)} − z(1+γ)/2 nγ̄ + mδ̄ , log{(x̄δ̄)/(ȳγ̄)} + z(1+γ)/2 nγ̄ + mδ̄ .
(nγ̄ + mδ̄) · (x̄/γ̄)2
= nγ̄ · mδ̄ nγ̄ · mδ̄
nγ̄ · m · ȳ 2 /δ̄
2
x̄δ̄ nγ̄ + mδ̄ e) Derive the profile log-likelihood for θ. Implement an R-function which calculates
=
ȳγ̄ nγ̄ · mδ̄ a γ · 100% profile likelihood confidence interval for θ.
nγ̄ + mδ̄ ◮ Solving dη d
l(η, θ) = 0 for η gives
2
= θ̂ML .
nγ̄ · mδ̄ d
l(η, θ) = 0
Hence the standard error is dη
s nγ̄ + mδ̄
nγ̄ + mδ̄ = nx̄ + θmȳ
se(θ̂ML ) = θ̂ML · , η
nγ̄ · mδ̄ nγ̄ + mδ̄
η= ,
nx̄ + θmȳ
and the formula for a γ · 100% Wald confidence interval for θ is
s s hence the profile log-likelihood of θ is
x̄ δ̄ x̄δ̄ nγ̄ + m δ̄ x̄ δ̄ x̄ δ̄ nγ̄ + m δ̄
− z(1+γ)/2 , + z(1+γ)/2 ,. lp (θ) = l(η̂ML (θ), θ)
ȳγ̄ ȳγ̄ nγ̄ · mδ̄ ȳγ̄ ȳγ̄ nγ̄ · mδ̄
nγ̄ + mδ̄
= (nγ̄ + mδ̄) log + mδ̄ log(θ) − (nγ̄ + mδ̄).
d) Consider the transformation ψ = log(θ). Derive a γ · 100% Wald confidence nx̄ + θmȳ
interval for ψ using the delta method.
Dropping the additive terms, we obtain the profile log-likelihood as
◮ Due to the invariance of the MLE we have that
lp (θ) = mδ̄ log(θ) − (nγ̄ + mδ̄) log(nx̄ + θmȳ).
ψ̂ML = log(θ̂ML ) = log{(x̄δ̄)/(ȳγ̄)}.
To obtain a γ · 100% profile likelihood confidence interval, we need to find the
For the application of the delta method, we need the derivative of the transfor-
solutions of
mation, which is
d 1 2{lp (θ̂ML ) − lp (θ)} = χ2γ (1).
log(θ) = ,
dθ θ In R:
84 5 Likelihood inference in multiparameter models 85
the PBC survival times in the treatment group is not different from the placebo All three test statistics say that the evidence against H0 is not sufficient for the
group. rejection.
◮ First we read the data:
14.Let X1:n be a random sample from the N(µ, σ 2 ) distribution.
86 5 Likelihood inference in multiparameter models 87
a) First assume that σ 2 is known. Derive the likelihood ratio statistic for testing implying the standard error of the MLE in the form
specific values of µ. √
◮ If σ 2 is known, the log-likelihood kernel is se(µ̂ML ) = I(µ)−1/2 = σ/ n.
n
1 X Therefore the γ · 100% Wald confidence interval for µ is
l(µ; x) = − (xi − µ)2
2σ 2
√ √
x̄ − σ/ nz(1+γ)/2 , x̄ + σ/ nz(1+γ)/2
i=1
change, so we have µ̂i,0 = x̄i . However, the score function for σ 2 now comprises
~
−6 all groups:
K K ni
−8 1 X 1 XX
Sσ2 (θ) = − n + (xij − µi )2 .
2σ 2 2(σ 2 )2
i
i=1 i=1 j=1
40000 50000 60000 70000 80000
Plugging in the estimates µ̂i,0 = x̄i for µi and solving for σ 2 gives
σ2
ni
K X
1 X
The two confidence intervals are very similar here. σ̂02 = PK (xij − x̄i )2 ,
i=1 ni i=1 j=1
15.Consider K independent groups of normally distributed observations with group-
specific means and variances, i. e. let Xi,1:ni be a random sample from N(µi , σi2 ) which is a pooled variance estimate. So the MLE under the restriction σi2 = σ 2
for group i = 1, . . . , K. We want to test the null hypothesis that the variances are of the null hypothesis is θ̂ 0 = (x̄1 , . . . , x̄K , σ̂02 , . . . , σ̂02 )⊤ .
identical, i. e. H0 : σi2 = σ 2 . c) Show that the generalised likelihood ratio statistic for testing H0 : σi2 = σ 2 is
a) Write down the log-likelihood kernel for the parameter vector K
X
θ = (µ1 , . . . , µK , σ12 , . . . , σK
2 ⊤
) . Derive the MLE θ̂ ML by solving the score equa- W = ni log(σ̂02 /σ̂i2 )
tions Sµi (θ) = 0 and then Sσi2 (θ) = 0, for i = 1, . . . , K. i=1
◮ Since all random variables are independent, the log-likelihood is the sum of where σ̂02 and σ̂i2 are the ML variance estimates for the i-th group with and
all individual log density contributions log f (xij ; θ i ), where θi = (µi , σi2 )⊤ : without the H0 assumption, respectively. What is the approximate distribution
ni
K X
of W under H0 ?
X
l(θ) = log f (xij ; θ). ◮ The generalised likelihood ratio statistic is
W = 2{l(θ̂ML ) − l(θ̂0 )}
i=1 j=1
and solving the corresponding score equation Sµi (θ) = 0 gives µ̂i = x̄i , the
i=1
average of the observations in the i-th group. The score function for the variance Since there are p = 2K free parameters in the unconstrained model, and r = K+1
σi2 of the i-th group is given by free parameters under the H0 restriction, we have
W ∼ χ2 (p − r) = χ2 (K − 1)
a
ni
d ni 1 X
Sσi2 (θ) = l(θ) = − 2 + (xij − µi )2 ,
dσi2 2σi 2(σi )
2 2
under H0 .
j=1
92 5 Likelihood inference in multiparameter models 93
d) Consider the special case with K = 2 groups having equal size n1 = n2 = n. is used because T /C converges more rapidly to the asymptotic χ2 (K − 1) distri-
Show that W is large when the ratio bution than T alone.
Pn 2 Write two R-functions which take the vector of the group sizes (n1 , . . . , nK ) and
j=1 (x1j − x̄1 )
R = Pn the sample variances (s21 , . . . , s2K ) and return the values of the statistics W and
j=1 (x2j − x̄2 )
2
B, respectively.
is large or small. Show that W is minimal if R = 1. Which value has W for ◮
R = 1? > ## general function to compute the likelihood ratio statistic.
> W <- function(ni, # the n_i's
◮ In this special case we have s2i) # the s^2_i's
nP Pn o {
1 2 2
j=1 (x1j − x̄1 ) + j=1 (x2j − x̄2 )
n
2n 1 mleVars <- (ni - 1) * s2i / ni # sigma^2_i estimate
σ̂02 /σ̂12 = Pn = (1 + 1/R) pooledVar <- sum((ni - 1) * s2i) / sum(ni) # sigma^2_0 estimate
1
n
(x
j=1 1j − x̄ 1 )2 2
return(sum(ni * log(pooledVar / mleVars)))
and analogously }
1
σ̂02 /σ̂22 = (1 + R). > ## general function to compute the Bartlett statistic.
2 > B <- function(ni, # the n_i's
Hence the likelihood ratio statistic is s2i) # the s^2_i's
{
W = n log(σ̂02 /σ̂12 ) + n log(σ̂02 /σ̂22 )
pooledSampleVar <- sum((ni - 1) * s2i) / sum(ni - 1)
For example, a is the number of case-control pairs with positive exposure history
0.4 0.4
of both the case and the control.
0.3 0.3 Let ω1 and ω0 denote the odds for a case and a control, respectively, to be exposed,
Density
Density
such that
ω1 ω0
0.2 0.2
Pr(E | case) = and Pr(E | control) = .
1 + ω1 1 + ω0
0.1 0.1
To derive conditional likelihood estimates of the odds ratio ψ = ω1 /ω0 , we argue
0.0 0.0 conditional on the number NE of exposed individuals in a case-control pair. If
0 5 10 15 20 25 0 5 10 15 20
NE = 2 then both the case and the control are exposed so the corresponding a
case-control pairs do not contribute to the conditional likelihood. This is also the
Wsims Bsims
case for the d case-control pairs where both the case and the control are unexposed
We see that the empirical distribution of B is slightly closer to the χ2 (2) distri-
(NE = 0). In the following we therefore only consider the case NE = 1, in which
bution than that of W , but the discrepancy is not very large.
case either the case or the control is exposed, but not both.
g) Consider the alcohol concentration data from Section 1.1.7. Quantify the evi-
dence against equal variances of the transformation factor between the genders a) Conditional on NE = 1, show that the probability that the case rather than
using P -values based on the test statistics W and B. the control is exposed is ω1 /(ω0 + ω1 ). Show that the corresponding conditional
◮
> ni <- c(33, 152)
> s2i <- c(220.1, 232.5)^2
> p.W <- pchisq(W(ni, s2i),
df=3,
lower.tail=FALSE)
> p.B <- pchisq(B(ni, s2i),
df=3,
96 5 Likelihood inference in multiparameter models 97
odds are equal to the odds ratio ψ. and the observed Fisher information is
◮ For the conditional probability we have
1 b
I(ψ̂ML ) = · · (1 + 2b/c) − c
Pr(case E, control not E) {1 + (b/c)}2 (b/c)2
Pr(case E | NE = 1) =
Pr(case E, control not E) + Pr(case not E, control E) c2 c · (c + b)
ω1
· 1 = ·
1+ω1 1+ω0 (c + b)2 b
= 1 1
1+ω0 + 1+ω1 c3
ω1 ω0
1+ω1 · · 1+ω0
= .
ω1 b(c + b)
= ,
ω1 + ω0
Note that, since the latter is positive, ψ̂ML = b/c indeed maximises the likelihood.
and for the conditional odds we have
By Result 2.1, the observed Fisher information corresponding to log(ψ̂ML ) is
Pr(case E | NE = 1)
ω1
ω1 +ω0 −2
=
1 − Pr(case E | NE = 1) ω0
ω1 +ω0 I{log(ψ̂ML )} =
d
log(ψ̂ML ) · I(ψ̂ML )
ω1 dψ
= .
ω0 b2 c3
= ·
c b(c + b)
2
b) Write down the binomial log-likelihood in terms of ψ and show that the MLE of
p b·c
the odds ratio ψ is ψ̂ML = b/c with standard error se{log(ψ̂ML )} = 1/b + 1/c. = .
c+b
Derive the Wald test statistic for H0 : log(ψ) = 0.
◮ Note that Pr(case E | NE = 1) = ω1 /(ω1 + ω0 ) = 1/(1 + 1/ψ) and so It follows that
Pr(control E | NE = 1) = ω0 /(ω1 +ω0 ) = 1/(1 +ψ). The conditional log-likelihood
se{log(ψ̂ML )} = [I{log(ψ̂ML )}]−1/2
is p
= 1/b + 1/c.
1 1
l(ψ) = b log + c log
1 + 1/ψ 1+ψ
The Wald test statistic for H0 : log(ψ) = 0 is
1
= −b log 1 + − c log(1 + ψ).
ψ log(ψ̂ML ) − 0 log(b/c)
=p .
Hence the score function is se{log(ψ̂ML )} 1/b + 1/c
d c) Derive the standard error se(ψ̂ML ) of ψ̂ML and derive the Wald test statistic for
S(ψ) = l(ψ)
dψ H0 : ψ = 1. Compare your result with the Wald test statistic for H0 : log(ψ) = 0.
b 1 c ◮ Using the observed Fisher information computed above, we obtain that
= · −
1 + 1/ψ ψ 2 1+ψ
b c se(ψ̂ML ) = {I(ψ̂ML )}−1/2
= − , r
ψ(1 + ψ) 1 + ψ b(c + b)
=
and the score equation S(ψ) = 0 is solved by ψ̂ML = b/c. The Fisher information c3
r
is b 1 1
= · + .
d c b c
I(ψ) = − S(ψ)
dψ The Wald test statistic for H0 : ψ = 1 is
b c
= 2 · {(1 + ψ) + ψ} − ψ̂ML − 1 b/c − 1
ψ (1 + ψ)2 (1 + ψ)2 = p = p
b−c
.
1 b se(ψ̂ML ) b/c · 1/b + 1/c b 1/b + 1/c
= · · (1 + 2ψ) − c ,
(1 + ψ)2 ψ2
98 5 Likelihood inference in multiparameter models 99
d) Finally compute the score test statistic for H0 : ψ = 1 based on the expected which shows that F = logit−1 . For the derivative:
Fisher information of the conditional likelihood.
d d exp(x)
◮ We first compute the expected Fisher information. F (x) =
dx dx 1 + exp(x)
J(ψ) = E{I(ψ)} exp(x){1 + exp(x)} − exp(x) exp(x)
=
{1 + exp(x)}2
1 b+c 1 + 2ψ b+c
= · · − exp(x) 1
(1 + ψ)2 1 + 1/ψ ψ2 1+ψ =
1 + exp(x) 1 + exp(x)
1 b+c 1+ψ
= · · = F (x){1 − F (x)}.
(1 + ψ)2 1 + ψ ψ
b+c
= . b) Use the results on multivariate derivatives outlined in Appendix B.2.2 to show
ψ(1 + ψ)2
that the log-likelihood, score vector and Fisher information matrix of β, given
p
The score test statistic is S(ψ0 )/ J(ψ0 ), where ψ0 = 1. Using the results derived the realisation y = (y1 , . . . , yn )⊤ , are
above, we obtain the statistic in the form n
X
b c l(β) = yi log(πi ) + (1 − yi ) log(1 − πi ),
1·(1+1)
− 1+1 b−c
q =√ . i=1
b+c b+c Xn
1·(1+1)2
S(β) = (yi − πi )xi = X ⊤ (y − π)
i=1
17.Let Yi ∼ Bin(1, πi ), i = 1, . . . , n, be the binary response variables in a logistic
ind
Xn
regression model , where the probabilities πi = F (x⊤ i β) are parametrised via the
and I(β) = i = X W X,
πi (1 − πi )xi x⊤ ⊤
exp(y) To derive the score vector, we have to take the derivative with respect to βj ,
⇐⇒ x = = F (y),
1 + exp(y) j = 1, . . . , p. Using the chain rule for the partial derivative of a function g of πi ,
we obtain that
d d d d d
g(πi ) = g[F {ηi (β)}] = g(πi ) · F (ηi ) · ηi (β),
dβj dβj dπi dηi dβj
100 5 Likelihood inference in multiparameter models 101
Pp Pn
where ηi (β) = x⊤
i β = j=1 xij βj is the linear predictor for the i-th observation. c) Show that the statistic T (y) = i=1 yi xi is minimal sufficient for β.
In our case, ◮ We can rewrite the log-likelihood as follows:
n n
d X yi yi − 1 X
l(β) = + · πi (1 − πi ) · xij l(β) = yi log(πi ) + (1 − yi ) log(1 − πi )
dβj πi 1 − πi
i=1 i=1
Xn Xn
= (yi − πi )xij . = yi {log(πi ) − log(1 − πi )} + log(1 − πi )
i=1 i=1
Xn n
X
Together we have obtained = i β+
yi x⊤ log{1 − F (x⊤
i β)}
Pn i=1 i=1
i=1 (yi − πi )xi1
d
l(β)
dβ1
X n = T (y)⊤ β − A(β), (5.4)
.. ..
S(β) =
.
=
.
=
(yi − πi )xi . Pn
Pn i=1 where A(β) = − i=1 log{1 − F (x⊤ i β)}. Now, (5.4) is in the form of an expo-
i=1 (yi − πi )xip
d
dβp l(β) nential family of order p in canonical form; cf. Exercise 8 in Chapter 3. In that
We would have obtained the same result using vector differentiation: exercise, we showed that T (y) is minimal-sufficient for β.
d) Implement an R-function which maximises the log-likelihood using the Newton-
∂
S(β) = l(β) Raphson algorithm (see Appendix C.1.2) by iterating
∂β
n
X yi yi − 1 ∂ β(t+1) = β (t) + I(β(t) )−1 S(β (t) ), t = 1, 2, . . .
= + · πi (1 − πi ) · ηi (β)
πi 1 − πi ∂β
i=1 until the new estimate β (t+1) and the old one β (t) are almost identical and
n
X β̂ML = β(t+1) . Start with β (1) = 0.
= (yi − πi )xi ,
i=1
◮ Note that
β(t+1) = β(t) + I(β (t) )−1 S(β (t) )
∂β ηi (β) = = xi . It is easily
∂ ∂ ⊤
because the chain rule also works here, and ∂β xi β
is equivalent to
seen that we can also write this as S(β) = X (y − π). ⊤
I(β (t) )(β (t+1) − β(t) ) = S(β (t) ).
We now use vector differentiation to derive the Fisher information matrix:
} having had an X-ray gives odds ratio of exp(β3 ), so the odds are exp(β3 ) times
> ## here comes the Newton-Raphson algorithm: higher; and the father’s ever having had an X-ray changes the odds by factor
> computeMle <- function(data)
{ exp(β4 ).
## start with the null vector We now consider the data set amlxray and first compute the ML estimates of
p <- ncol(data$X)
beta <- rep(0, p) the regression coefficients:
names(beta) <- colnames(data$X) > library(faraway)
> # formatting the data
## loop only to be left by returning the result > data.subset <- amlxray[, c("disease", "age", "Sex", "Mray", "Fray")]
while(TRUE) > data.subset$Sex <- as.numeric(data.subset$Sex)-1
{ > data.subset$Mray <- as.numeric(data.subset$Mray)-1
## compute increment vector v > data.subset$Fray <- as.numeric(data.subset$Fray)-1
v <- solve(fisherInfo(beta, data), > data <- list(y = data.subset$disease,
scoreVec(beta, data)) X =
cbind(intercept=1,
## update the vector as.matrix(data.subset[, c("age", "Sex", "Mray", "Fray")])))
beta <- beta + v > (myMle <- computeMle(data))
[,1]
## check if we have converged intercept -0.282982807
if(sum(v^2) < 1e-8) age -0.002951869
{ Sex 0.124662103
return(beta) Mray -0.063985047
} Fray 0.394158386
} > (myOr <- exp(myMle))
} [,1]
intercept 0.7535327
e) Consider the data set amlxray on the connection between X-ray usage and acute
age 0.9970525
myeloid leukaemia in childhood, which is available in the R-package faraway. Sex 1.1327656
Here yi = 1 if the disease was diagnosed for the i-th child and yi = 0 otherwise Mray 0.9380190
Fray 1.4831354
(disease). We include an intercept in the regression model, i. e. we set x1 = 1.
We want to analyse the association of the diabetes status with the covariates We see a negative association of the leukaemia risk with age and the mother’s
x2 (age in years), x3 (1 if the child is male and 0 otherwise, Sex), x4 (1 if the ever having had an X-ray, and a positive association of the leukaemia risk with
mother ever have an X-ray and 0 otherwise, Mray) and x5 (1 if the father ever male gender and the father’s ever having had an X-ray, and could now say the
have an X-ray and 0 otherwise, Fray). same as described above, e. g. being a male increases the odds for leukaemia by
Interpret β2 , . . . , β5 by means of odds ratios. Compute the MLE β̂ML = 13.3 %. However, these are only estimates and we need to look at the associated
(β̂1 , . . . , β̂5 )⊤ and standard errors se(β̂i ) for all coefficient estimates β̂i , and con- standard errors before we make conclusions.
struct 95% Wald confidence intervals for βi (i = 1, . . . , 5). Interpret the results, We easily obtain the standard errors by inverting the observed Fisher infor-
and compare them with those from the R-function glm (using the binomial fam- mation and taking the square root of the resulting diagonal. This leads to the
ily). (transformed) Wald confidence intervals:
◮ In order to interpret the coefficients, consider two covariate vectors xi and > (obsFisher <- fisherInfo(myMle, data))
xj . The modelled odds ratio for having leukaemia (y = 1) is then intercept age Sex Mray
intercept 58.698145 414.95349 29.721114 4.907156
πi /(1 − πi ) exp(x⊤
i β)
age 414.953492 4572.35780 226.232641 38.464410
= = exp{(xi − xj )⊤ β}. Sex 29.721114 226.23264 29.721114 1.973848
πj /(1 − πj ) exp(x⊤
j β) Mray 4.907156 38.46441 1.973848 4.907156
Fray 16.638090 128.19588 8.902213 1.497194
If now xik − xjk = 1 for one covariate k and xil = xjl for all other covariates Fray
l 6= k, then this odds ratio is equal to exp(βk ). Thus, we can interpret β2 as the intercept 16.638090
age 128.195880
log odds ratio for person i versus person j who is one year younger than person Sex 8.902213
i; β3 as the log odds ratio for a male versus a female; likewise, the mother’s ever
104 5 Likelihood inference in multiparameter models 105
then Cβ = (β2 , β3 , β4 , β5 ), and with the right hand side δ = 0 we have expressed > ## Second the generalized likelihood ratio statistic:
the null hypothesis H0 : β2 = β3 = β4 = β5 = 0 in the form of a so-called linear >
> ## We have to fit the model under the H0 restriction, so
hypothesis: H0 : Cβ = δ. > ## only with intercept.
From Section 5.4.2 we know that > h0data <- list(y=data$y,
X=data$X[, "intercept", drop=FALSE])
> myH0mle <- computeMle(data=h0data)
β̂ML ∼ Np (β, I(β̂ ML )−1 ),
a
> ## then the statistic is
> (likRatioStat <- 2 * (fullLogLik(beta=myMle,
hence data=data) -
C β̂ML ∼ Nq (Cβ, CI(β̂ ML )−1 C ⊤ ).
a fullLogLik(beta=myH0mle,
data=h0data)))
Now under H0 , Cβ = δ, therefore [1] 2.141096
> (p.likRatio <- pchisq(q=likRatioStat, df=3, lower.tail=FALSE))
[1] 0.5436436
C β̂ML ∼ Nq (δ, CI(β̂ML )−1 C ⊤ )
a
We see that neither of the two statistics finds much evidence against H0 , so the
and association of the oral cancer risk with the set of these four covariates is not
Σ−1/2 (C β̂ML − δ) ∼ Nq (0, I q )
a
significant.
where Σ = CI(β̂ML )−1 C ⊤ . Analogous to Section 5.4, we finally get Note that we can compute the generalised likelihood ratio statistic for the compar-
ison of two models in R using the anova function. For example, we can reproduce
n o⊤ n o
Σ−1/2 (C β̂ML − δ) Σ−1/2 (C β̂ ML − δ) the null deviance from above using the following code:
n o⊤ > anova(update(amlGlm, . ~ 1), amlGlm, test="LRT")
= (C β̂ML − δ)⊤ Σ−1/2 Σ−1/2 (C β̂ ML − δ) Analysis of Deviance Table
Pn i
> plot(x=dose, y=resp[[1]], type="l", ylim=c(0,1.2), xlab="dose", ylab=m.resp) where ỹi = j=1 {yij − µ(di ; θ)} and δi = di + θ3 . The expected Fisher infor-
> legend("bottomright", legend= "theta_1=0.2, theta_2=1, theta_3=0.2") mation matrix is thus
> plot(x=dose, y=resp[[2]], type="l", ylim=c(0,1.2), xlab="dose", ylab=m.resp)
PK PK ni di PK
> legend("bottomright", legend= "theta_1=0.2, theta_2=1, theta_3=0.5")
i=1 ni i=1 δi −θ2 i=1 nδi2di
> plot(x=dose, y=resp[[3]], type="l", ylim=c(0,2.6), xlab="dose", ylab=m.resp)
1 PK ni di PK ni d2i PK n d2 i
> legend("bottomright", legend= "theta_1=0.5, theta_2=2, theta_3=1") J(θ) = 2 i=1 δi2 −θ2 i=1 δi 3 i
.
> plot(x=dose,resp[[4]], type="l", ylim=c(0,2.6), xlab="dose", ylab=m.resp) σ i=1 δi
PK PK n d2 PK n di2
> legend("bottomright", legend= "theta_1=0.5, theta_2=2, theta_3=2") −θ2 i=1 nδi2di −θ2 i=1 δi 3 i θ22 i=1 δi 4 i
i i i
1.2 1.2
1.0 1.0 The following R function uses J(θ) to obtain the approximate covariance matrix
0.8 0.8
of the MLE θ̂ML for a given set of doses d1 , . . . , dK , a total sample size N =
µ(d)
µ(d)
0.6 0.6
0.4 0.4 PK 2
0.2
theta_1=0.2, theta_2=1, theta_3=0.2
0.2
theta_1=0.2, theta_2=1, theta_3=0.5 i=1 ni , allocation weights wi = ni /N , and given error variance σ .
0.0 0.0
0 1 2 3 4 0 1 2 3 4
> ApprVar <- function(doses, N, w, sigma, theta1, theta2, theta3){
delta <- doses + theta3
dose dose
n <- w*N
2.5 2.5
2.0 2.0
V <- matrix(NA, nrow=3, ncol=3)
diag(V) <- c( N, sum(n*doses^2/delta^2), theta2^2*sum(n*doses^2/delta^4) )
µ(d)
b) Compute the expected Fisher information for the parameter vector θ. Using this c) Assume θ1 = 0, θ2 = 1, θ3 = 0.5 and σ 2 = 1. Calculate the approximate
result, implement an R function which calculates the approximate covariance covariance matrix, first for K = 5 doses 0, 1, 2, 3, 4, and second for doses 0,
matrix of the MLE θ̂ ML for a given set of doses d1 , . . . , dK , a total sample size 0.5, 1, 2, 4, both times with balanced allocations wi = 1/5 and total sample size
PK
N = i=1 ni , allocation weights wi = ni /N and given error variance σ 2 . N = 100. Compare the approximate standard deviations of the MLEs of the
◮ The log-likelihood kernel is parameters between the two designs, and also compare the determinants of the
two calculated matrices.
K ni
1 XX ◮
l(θ) = − {yij − µ(dij ; θ)}2 ,
2σ 2 > # with the first set of doses:
i=1 j=1
> V1 <- ApprVar(doses=c(0:4), N=100, w=c(rep(1/5, 5)),
the score function is sigma=1, theta1=0, theta2=1, theta3=0.5)
> # with the second set of doses:
PK > V2 <- ApprVar(doses=c(0, 0.5, 1, 2, 4), N=100, w=c(rep(1/5, 5)),
d
l(θ) i=1 ỹi sigma=1, theta1=0, theta2=1, theta3=0.5)
dθ1 1 PK
S(θ) =
dθ2 l(θ) = σ 2
d di
i=1 ỹi · di +θ3
,
> # standard errors
PK > sqrt(diag(V1))
d di
dθ3 l(θ) −θ 2 i=1 ỹi · (di +θ3 ) 2 [1] 0.2235767 0.4081669 0.9185196
> sqrt(diag(V2))
and the Fisher information matrix is [1] 0.2230535 0.3689985 0.6452322
PK > # determinant
PK ni di PK ni di > det(V1)
i=1 ni i=1 δi −θ2 i=1 δi2
1 PK ni di PK ni d2i PK [1] 0.0007740515
I(θ) = 2 i=1 δi2 i=1 3 (δi ỹi − ni di θ2 )
di , > det(V2)
σ i=1 δ
PK i PK di PK dδii
i=1 δ 3 (δi ỹi − ni di θ2 ) θ2 i=1 δ4 (−2δi ỹi + ni di θ2 ) [1] 0.000421613
−θ2 i=1 nδi2di
i i i
The second design attains achieves a lower variability of estimation by placing
more doses on the increasing part of the dose-response curve.
114 5 Likelihood inference in multiparameter models
d) Using the second design, determine the required total sample size N so that the
standard deviation for estimation of θ2 is 0.35 (so that the half-length of a 95%
6 Bayesian inference
confidence interval is about 0.7).
◮
> sample.size <- function(N){
V2 <- ApprVar(doses=c(0, 0.5, 1, 2, 4), N=N, w=c(rep(1/5, 5)),
sigma=1, theta1=0, theta2=1, theta3=0.5)
return(sqrt(V2[2,2])-0.35)
}
> # find the root of the function above
> uniroot(sample.size, c(50, 200))
£root 1. In 1995, O. J. Simpson, a retired American football player and actor, was accused
[1] 111.1509 of the murder of his ex-wife Nicole Simpson and her friend Ronald Goldman. His
lawyer, Alan M. Dershowitz stated on T.V. that only one-tenth of 1% of men who
£f.root
[1] -8.309147e-10 abuse their wives go on to murder them. He wanted his audience to interpret this
to mean that the evidence of abuse by Simpson would only suggest a 1 in 1000
£iter
[1] 7 chance of being guilty of murdering her.
However, Merz and Caulkins (1995) and Good (1995) argue that a different proba-
£estim.prec bility needs to be considered: the probability that the husband is guilty of murdering
[1] 6.103516e-05
his wife given both that he abused his wife and his wife was murdered. Both com-
We need a little more than 111 patients in total, that is a little more than 22 pute this probability using Bayes theorem, but in two different ways. Define the
per dose. By taking 23 patients per dose, we obtain the desired length of the following events:
confidence interval:
> V2 <- ApprVar(doses=c(0, 0.5, 1, 2, 4), N=5*23, w=c(rep(1/5, 5)), A: “The woman was abused by her husband.”
sigma=1, theta1=0, theta2=1, theta3=0.5)
> sqrt(V2[2,2]) M : “The woman was murdered by somebody.”
[1] 0.3440929 G: “The husband is guilty of murdering his wife.”
a) Merz and Caulkins (1995) write the desired probability in terms of the corre-
sponding odds as:
They use the fact that, of the 4936 women who were murdered in 1992, about
1430 were killed by their husband. In a newspaper article, Dershowitz stated
that “It is, of course, true that, among the small number of men who do kill
their present or former mates, a considerable number did first assault them.”
Merz and Caulkins (1995) interpret “a considerable number” to be 1/2. Finally,
they assume that the probability of a wife being abused by her husband, given
that she was murdered by somebody else, is the same as the probability of a
randomly chosen woman being abused, namely 0.05.
Calculate the odds (6.1) based on this information. What is the corresponding
probability of O. J. Simpson being guilty, given that he has abused his wife and
116 6 Bayesian inference 117
so that Pr(G | A, M ) = 0.5. That means the probability of O.J. Simpson being
The odds (6.1) are therefore guilty, given that he has abused his wife and she has been murdered, is about
50%.
Pr(G | A, M ) Pr(A | G, M ) Pr(G | M )
= · c) Good (1996) revised this calculation, noting that approximately only a quarter
Pr(Gc | A, M ) Pr(A | Gc , M ) Pr(Gc | M )
1430
of murdered victims are female, so Pr(M | Gc , A) reduces to 1/20 000. He also
0.5
= · 4936 corrects Pr(G | A) to 1/2000, when he realised that Dershowitz’s estimate was an
0.05 3506
4936 annual and not a lifetime risk. Calculate the probability of O. J. Simpson being
≈ 4.08, guilty based on this updated information.
◮ The revised calculation is now
so that
4.08 Pr(G | A, M ) Pr(M | G, A) Pr(G | A)
Pr(G | A, M ) = ≈ 0.8. =
1 + 4.08 Pr(Gc | A, M )
·
Pr(M | Gc , A) Pr(Gc | A)
That means the probability of O.J. Simpson being guilty, given that he has abused 1
1 2000
his wife and she has been murdered, is about 80%. ≈ 1
· 1999
20000 2000
b) Good (1995) uses the alternative representation
≈ 10,
Pr(G | A, M ) Pr(M | G, A) Pr(G | A)
= · . (6.2) so that Pr(G | A, M ) ≈ 0.91. Based on this updated information, the probability
Pr(Gc | A, M ) Pr(M | Gc , A) Pr(Gc | A)
of O.J. Simpson being guilty, given that he has abused his wife and she has been
He first needs to estimate Pr(G | A) and starts with Dershowitz’s estimate of
murdered, is about 90%.
1/1000 that the abuser will murder his wife. He assumes the probability is at
least 1/10 that this will happen in the year in question. Thus Pr(G | A) is at 2. Consider Example 6.4. Here we will derive the implied distribution of θ =
least 1/10 000. Obviously Pr(M | Gc , A) = Pr(M | A) ≈ Pr(M ). Since there are Pr(D+ | T +) if the prevalence is π ∼ Be(α̃, β̃).
about 25 000 murders a year in the U.S. population of 250 000 000, Good (1995) a) Deduce with the help of Appendix A.5.2 that
estimates Pr(M | Gc , A) to be 1/10 000.
α̃ 1 − π
Calculate the odds (6.2) based on this information. What is the corresponding γ= ·
β̃ π
probability of O. J. Simpson being guilty, given that he has abused his wife and
she has been murdered? follows an F distribution with parameters 2β̃ and 2α̃, denoted by F(2β̃, 2α̃).
◮ Using the method of Good (1995) it follows: ◮ Following the remark on page 336, we first show 1 − π ∼ Be(β̃, α̃) and then
1 9999 we deduce γ = α̃ · 1−π ∼ F(2β̃, 2α̃).
Pr(G | A) = ⇒ Pr(Gc | A) =
β̃ π
10000 10000 Step 1: To obtain the density of 1 − π, we apply the change-of-variables formula
1 to the density of π.
Pr(M | Gc , A) ≈
10000 The transformation function is
Pr(M | G, A) = 1.
g(π) = 1 − π
118 6 Bayesian inference 119
dg −1
(y)
= −1. where
dy α̃ Pr(T + | D+)
c= .
This gives β̃{1 − Pr(T − | D−)}
−1
dg (y)
f1−π (y) = fπ g −1 (y) ◮ We first plug in the expression for ω given in Example 6.4 and then we
dy
express the term depending on π as a function of γ:
= fπ (1 − y)
1 θ = (1 + ω −1 )−1
= (1 − y)α̃−1 (1 − (1 − y))β̃−1 −1
B(α̃, β̃) 1 − Pr(T − | D−) 1 − π
1 = 1+ ·
= y β̃−1 (1 − y)α̃−1 , Pr(T + | D+) π
B(β̃, α̃) −1
1 − Pr(T − | D−) β̃γ
= 1+ ·
where we have used that B(α̃, β̃) = B(β̃, α̃), which follows easily from the defini- Pr(T + | D+) α̃
tion of the beta function (see Appendix B.2.1). Thus, 1 − π ∼ Be(β̃, α̃). −1
α̃ Pr(T + | D+)
Step 2: We apply the change-of-variables formula again to obtain the density of = 1 + γ/
β̃{1 − Pr(T − | D−)}
γ from the density of 1 − π.
= (1 + γ/c)
−1
.
We have γ = g(1 − π), where
α̃ x c) Show that
g(x) = · , d 1
β̃ 1 − x g(γ) = −
dγ c(1 + γ/c)2
y β̃y
g −1 (y) = = and
α̃/β̃ + y α̃ + β̃y and that g(γ) is a strictly monotonically decreasing function of γ.
2 ◮ Applying the chain rule to the function g gives
dg −1
(y) β̃(α̃ + β̃y) − β̃ y α̃β̃
= = .
dy (α̃ + β̃y)2 (α̃ + β̃y)2 d d 1
g(γ) = − (1 + γ/c) · (1 + γ/c) = −
−2
.
dγ dγ c(1 + γ/c)2
Hence,
As c > 0, we have
dg −1 (y)
fγ (y) = f1−π g −1 (y) d
dy g(γ) < 0 for all γ ∈ [0, ∞),
dγ
β̃−1 α̃−1
1 β̃y β̃y α̃β̃ which implies that g(γ) is a strictly monotonically decreasing function of γ.
= 1−
B(β̃, α̃) α̃ + β̃y α̃ + β̃y (α̃ + β̃y)2 d) Use the change of variables formula (A.11) to derive the density of θ in (6.13).
1−β̃ 1−α̃ ◮ We derive the density of θ = g(γ) from the density of γ obtained in 2a).
1 α̃ + β̃y α̃ + β̃y α̃β̃
= Since g is strictly monotone by 2c) and hence one-to-one, we can apply the
B(β̃, α̃) β̃y α̃ (α̃ + β̃y)2
−β̃ −α̃ change-of-variables formula to this transformation. We have
1 α̃ + β̃y α̃ + β̃y
=
B(β̃, α̃)y β̃y α̃ θ = (1 + γ/c)−1 by 2b) and
−β̃ −α̃
1 α̃ β̃y γ = g −1 (θ) =
c(1 − θ)
= c(1/θ − 1).
= 1+ 1+ , θ
B(β̃, α̃)y β̃y α̃
a) Verify with R that the parameters of the inverse-gamma distribution lead to a > (beta.post <- beta + 0.5*sum((heights - mu)^2))
prior probability of approximately 95% that σ 2 ∈ [22, 41]. [1] 1380
◮ We use the fact that if σ 2 ∼ IG(38, 1110), then 1/σ 2 ∼ G(38, 1110) (see The posterior distribution of σ 2 is IG(46, 1380).
Table A.2). We can thus work with the cumulative distribution function of the > library(MCMCpack)
corresponding gamma distribution in R. We are interested in the probability > # plot the posterior distribution
> curve(dinvgamma(x, shape=alpha.post, scale=beta.post), from=15, to=50,
1 1 1 1 1 1 1 xlab=expression(sigma^2), ylab="Density", col=1, lty=1)
Pr ∈ , = Pr ≤ − Pr < . > # plot the prior distribution
σ2 41 22 σ2 22 σ2 41 > curve(dinvgamma(x, shape=alpha, scale=beta), from=15, to=50,
n=200, add=T, col=1, lty=2)
> (prior.prob <- pgamma(1/22, shape=38, rate=1110) > legend("topright",
- pgamma(1/41, shape=38, rate=1110)) c("Prior density: IG(38, 1110)",
[1] 0.9431584 "Posterior density: IG(46, 1380)"), bg="white", lty=c(2,1), col=1)
b) Derive and plot the posterior density of σ 2 corresponding to the following data: Prior density: IG(38, 1110)
Posterior density: IG(46, 1380)
0.08
183, 173, 181, 170, 176, 180, 187, 176, 171, 190, 184, 173, 176, 179, 181, 186.
0.06
.
Density
◮ We assume that the observed heights x1 = 183, x2 = 173, . . . , x16 = 186 are 0.04
realisations of a random sample X1:n (in particular: X1 , . . . ,Xn are indepen-
dent) for n = 16. We know that 0.02
15 20 25 30 35 40 45 50
A priori we have σ 2 ∼ IG(38, 1110) and we are interested in the posterior dis- 2
σ
tribution of σ 2 | x1:n . It can be easily verified, see also the last line in Table 6.2,
that c) Compute the posterior density of the standard deviation σ.
1 1
◮ For notational convenience, let Y = σ 2 . We know that Y ∼ IG(α = 46, β =
√
σ 2 | xi ∼ IG 38 + , 1110 + (xi − µ)2 . √
1380) and are interested in Z = g(Y ) = Y = σ, where g(y) = y. Thus,
2 2
As in Example 6.8, this result can be easily extended to a random sample X1:n : g −1 (z) = z 2 and
! dg −1 (z)
n = 2z.
n 1X dz
σ 2 | x1:n ∼ IG 38 + , 1110 + (xi − µ)2 .
2 2 Using the change-of-variables formula, we obtain
i=1
βα β
fZ (z) = |2z| (z 2 )−(α+1) exp − 2
> # parameters of the inverse gamma prior Γ(α) z
> alpha <- 38
2β α −(2α+1) β
> beta <- 1110
= z exp − 2 .
> # prior mean of the normal distribution Γ(α) z
> mu <- 180
> # data vector This is the required posterior density of the standard deviation Z = σ for α = 46
> heights <- c(183, 173, 181, 170, 176, 180, 187, 176, and β = 1380.
171, 190, 184, 173, 176, 179, 181, 186)
> # number of observations 4. Assume that n throat swabs have been tested for influenza. We denote by X
> n <- length(heights)
> # compute the parameters of the inverse gamma posterior distribution the number of throat swabs which yield a positive result and assume that X
> (alpha.post <- alpha + n/2) is binomially distributed with parameters n and unknown probability π, so that
[1] 46 X | π ∼ Bin(n, π).
124 6 Bayesian inference 125
a) Determine the expected Fisher information and obtain Jeffreys’ prior. so that for the log odds η = log{π/(1 − π)}, we have
◮ As X | π ∼ Bin(n, π), the log-likelihood is given by
n
l(π) = x log π + (n − x) log(1 − π), f (x | η) = exp(ηx){1 + exp(η)}−n .
x
so the score function is The log-likelihood is therefore
dl(π) x n−x
S(π) = = + · (−1).
dπ π 1−π l(η) = ηx − n log{1 + exp(η)},
The Fisher information turns out to be
the score function is
dS(π) x n−x
I(π) = − = 2+ .
dπ π (1 − π)2 dl(η) n
S(η) = =x− · exp(η),
Thus, the expected Fisher information is dη 1 + exp(η)
Applying the change-of-variables formula gives because x(r+1) , . . . , x(n) are assumed to be censored at time x(r) .
The posterior distribution under Jeffreys’ prior f (λ) ∝ λ−1 is thus
dg(π) −1
f (η) = fπ (π)
dπ f (λ | x1:n ) ∝ f (λ) · L(λ)
( !)
∝ π(1 − π)π −1/2 (1 − π)−1/2 r
X
= λr−1 exp −λ x(i) + (n − r)x(r) .
= π 1/2 (1 − π)1/2
i=1
1/2 1/2
exp(η) 1
= c) Show that the posterior is improper if all observations are censored.
1 + exp(η) 1 + exp(η)
◮ If no death has occured prior to some time c, the likelihood of λ is
exp(η)1/2
= .
1 + exp(η) L(λ) = exp(−nλc).
This density is the same as the one we received in part 4b).
Using Jeffreys’ prior f (λ) ∝ λ−1 , we obtain the posterior
5. Suppose that the survival times X1:n form a random sample from an exponential
1
distribution with parameter λ. f (λ | x1:n ) ∝ exp(−ncλ).
λ
a) Derive Jeffreys’ prior for λ und show that it is improper.
This can be identified as the kernel of an improper G(0, nc) distribution.
◮ From Exercise 4a) in Chapter 4 we know the score function
6. After observing a patient, his/her LDL cholesterol level θ is estimated by a. Due
n
n X to the increased health risk of high cholesterol levels, the consequences of under-
S(λ) = − xi .
λ estimating a patient’s cholesterol level are considered more serious than those of
i=1
overestimation. That is to say that |a − θ| should be penalised more when a ≤ θ
Viewed as a random variable in X1 , . . . , Xn , its variance is
than when a > θ. Consider the following loss function parameterised in terms of
n c, d > 0:
J(λ) = Var{S(λ)} = n · Var(X1 ) = . (
λ2
−c(a − θ) if a − θ ≤ 0
Jeffreys’ prior therefore is l(a, θ) = .
d(a − θ) if a − θ > 0
p
f (λ) ∝ J(λ) ∝ λ−1 , a) Plot l(a, θ) as a function of a − θ for c = 1 and d = 3.
◮
which cannot be normalised, since
> # loss function with argument a-theta
Z∞ > loss <- function(aMinusTheta, c, d)
0 = log(∞) − log(0) = ∞ − (−∞) = ∞,
λ−1 dλ = [log(λ)]∞ {
ifelse(aMinusTheta <= 0, - c * aMinusTheta, d * aMinusTheta)
0 }
> aMinusTheta <- seq(-3, 3, length = 101)
so f (λ) is improper. > plot(aMinusTheta, loss(aMinusTheta, c = 1, d = 3),
b) Suppose that the survival times are only partially observed until the r-th death type = "l", xlab = expression(a - theta), ylab = "loss")
such that n−r observations are actually censored. Write down the corresponding
likelihood function and derive the posterior distribution under Jeffreys’ prior.
◮ Let x(1) , . . . , x(n) denote the ordered survival times. Only x(1) , . . . , x(r) are ob-
served, the remaining survival times are censored. The corresponding likelihood
Pn
function can be derived from Example 2.8 with δ(i) = I{1,...,r} (i) and i=1 δi = r:
( r
!)
X
L(λ) = λ exp −λ
r
x(i) + (n − r)x(r) ,
i=1
128 6 Bayesian inference 129
7. Our goal is to estimate the allele frequency at one bi-allelic marker, which has either
8
allele A or B. DNA sequences for this location are provided for n individuals. We
denote the observed number of allele A by X and the underlying (unknown) allele
6
frequency with π. A formal model specification is then a binomial distribution
loss
4
X | π ∼ Bin(n, π) and we assume a beta prior distribution π ∼ Be(α, β) where
α, β > 0.
2
a) Derive the posterior distribution of π and determine the posterior mean and
mode.
0
◮ We know
X | π ∼ Bin(n, π) π ∼ Be(α, β).
−3 −2 −1 0 1 2 3
and
a−θ
As in Example 6.3, we are interested in the posterior distribution π | X:
b) Compute the Bayes estimate with respect to the loss function l(a, θ).
◮ The expected posterior loss is f (π | x) ∝ f (x | π)f (π)
Z
∝ π x (1 − π)n−x π α−1 (1 − π)β−1
E l(a, θ) | x = l(a, θ)f (θ | x) dθ
= π α+x−1 (1 − π)β+n−x−1 ,
Za Z∞
= d(a − θ)f (θ | x) dθ + c(θ − a)f (θ | x) dθ. i.e. π | x ∼ Be(α + x, β + n − x). Hence, the posterior mean is given by (α +
−∞ a x)/(α + β + n) and the posterior mode by (α + x − 1)/(α + β + n − 2).
b) For some genetic markers the assumption of a beta prior may be restrictive
To compute the Bayes estimate
and a bimodal prior density, e. g., might be more appropriate. For example,
â = arg min E l(a, θ) | x , we can easily generate a bimodal shape by considering a mixture of two beta
a
distributions:
we take the derivative with respect to a, using Leibniz integral rule, see Ap-
pendix B.2.4. Using the convention ∞ · 0 = 0 we obtain f (π) = wfBe (π; α1 , β1 ) + (1 − w)fBe (π; α2 , β2 )
d
Za Z∞ with mixing weight w ∈ (0, 1).
E l(a, θ) | x = d f (θ | x) dθ − c f (θ | x) dθ i. Derive the posterior distribution of π.
da
−∞ a
◮
= dF (a | x) − c 1 − F (a | x)
f (π | x) ∝ f (x | π)f (π)
= (c + d)F (a | x) − c.
w
∝ π x (1 − π)n−x π α1 −1 (1 − π)β1 −1
The root of this function in a is therefore B(α1 , β1 )
1−w
â = F −1 c/(c + d) | x , + π α2 −1 (1 − π)β2 −1
B(α2 , β2 )
w
i. e. the Bayes estimate â is the c/(c + d) · 100% quantile of the posterior distribu- = π α1 +x−1 (1 − π)β1 +n−x−1
B(α1 , β1 )
tion of θ. For c = d we obtain as a special case the posterior median. For c = 1
1−w
and d = 3 the Bayes estimate is the 25%-quantile of the posterior distribution. + π α2 +x−1 (1 − π)β2 +n−x−1 .
B(α2 , β2 )
Remark: If we choose c = 3 and d = 1 as mentioned in the Errata, then the
Bayes estimate is the 75%-quantile of the posterior distribution.
130 6 Bayesian inference 131
ii. The posterior distribution is a mixture of two familiar distributions. Identify > # Distribution function of a mixture of beta distributions with
these distributions and the corresponding posterior weights. > # weight gamma1 of the first component and parameter vectors
> # alpha und beta
◮ We have > pbetamix <- function(pi, gamma1, alpha, beta){
gamma1 * pbeta(pi, alpha[1], beta[1]) +
w B(α⋆1 , β1⋆ ) α⋆1 −1
f (π | x) ∝ (1 − π)β1 −1
⋆
π (1 - gamma1) * pbeta(pi, alpha[2], beta[2])
B(α1 , β1 ) B(α⋆1 , β1⋆ ) }
1 − w B(α⋆2 , β2⋆ ) α⋆2 −1 > # corresponding quantile function
+ (1 − π)β2 −1
⋆
π > qbetamix <- function(q, gamma1, alpha, beta){
B(α2 , β2 ) B(α⋆2 , β2⋆ ) f <- function(pi){
pbetamix(pi, gamma1, alpha, beta) - q
for }
unirootResult <- uniroot(f, lower=0, upper=1)
α⋆1 = α1 + x β1⋆ = β1 + n − x, if(unirootResult$iter < 0)
return(NA)
α⋆2 = α2 + x β2⋆ = β2 + n − x. else
return(unirootResult$root)
}
> # credibility interval with level level
Hence, the posterior distribution is a mixture of two beta distributions > credBetamix <- function(level, gamma1, alpha, beta){
halfa <- (1 - level)/2
Be(α⋆1 , β1⋆ ) and Be(α⋆2 , β2⋆ ). The mixture weights are proportional to ret <- c(qbetamix(halfa, gamma1, alpha, beta),
qbetamix(1 - halfa, gamma1, alpha, beta))
w · B(α⋆1 , β1⋆ ) (1 − w) · B(α⋆2 , β2⋆ )
γ1 = and γ2 = . return(ret)
B(α1 , β1 ) B(α2 , β2 ) }
v. Let n = 10 and x = 3. Assume an even mixture (w = 0.5) of two beta distri-
The normalized weights are γ1⋆ = γ1 /(γ1 + γ2 ) and γ2⋆ = γ2 /(γ1 + γ2 ).
butions, Be(10, 20) and Be(20, 10). Plot the prior and posterior distributions
iii. Determine the posterior mean of π.
in one figure.
◮ The posterior distribution is a linear combination of two beta distribu-
> # data
tions:
> n <- 10
f (π | x) = γ1⋆ Be(π | α⋆1 , β1⋆ ) + (1 − γ1⋆ ) Be(π | α⋆2 , β2⋆ ), > x <- 3
> #
so the posterior mean is given by > # parameters for the beta components
> a1 <- 10
E(π | x) = γ1⋆ E(π | α⋆1 , β1⋆ ) + (1 − γ1⋆ ) E(π | α⋆2 , β2⋆ ) > b1 <- 20
> a2 <- 20
α⋆1 α⋆2 > b2 <- 10
= γ1⋆ ⋆ + (1 − γ1 ) ⋆
⋆
.
α⋆1 + β1 α2 + β2⋆ > #
> # weight for the first mixture component
iv. Write an R-function that numerically computes the limits of an equi-tailed > w <- 0.5
> #
credible interval. > # define a function that returns the density of a beta mixture
◮ The posterior distribution function is > # with two components
> mixbeta <- function(x, shape1a, shape2a, shape1b, shape2b, weight){
F (π | x) = γ1∗ F (π | α∗1 , β1∗ ) + (1 − γ1∗ )F (π | α∗2 , β2∗ ).
y <- weight * dbeta(x, shape1=shape1a, shape2=shape2a) +
(1-weight)* dbeta(x, shape1=shape1b, shape2=shape2b)
return(y)
The equi-tailed (1 − α)-credible interval is therefore }
> #
[F −1 (α/2 | x), F −1 (1 − α/2 | x)], > # plot the prior density
> curve(mixbeta(x, shape1a=a1, shape2a=b1, shape1b=a2, shape2b=b2,weight=w),
from=0, to=1, col=2, ylim=c(0,5), xlab=expression(pi), ylab="Density")
i. e. we are looking for arguments of π where the distribution function has > #
the values α/2 and (1 − α/2), respectively. > # parameters of the posterior distribution
> a1star <- a1 + x
132 6 Bayesian inference 133
> b1star <- b1 + n - x a) Derive the posterior density f (π | x). Which distribution is this and what are its
> a2star <- a2 + x parameters?
> b2star <- b2 + n - x
> # ◮ We have
> # the posterior weights are proportional to
> gamma1 <- w*beta(a1star,b1star)/beta(a1,b1) f (π | x) ∝ f (x | π)f (π)
> gamma2 <- (1-w)*beta(a2star,b2star)/beta(a2,b2)
∝ π r (1 − π)x−r π α−1 (1 − π)β−1
> #
> # calculate the posterior weight = π α+r−1 (1 − π)β+x−r−1 .
> wstar <- gamma1/(gamma1 + gamma2)
> # This is the kernel of a beta distribution with parameters α̇ = α + r and
> # plot the posterior distribution
> curve(mixbeta(x, shape1a=a1star, shape2a=b1star, β̇ = β + x − r, that is π | x ∼ Be(α + r, β + x − r).
shape1b=a2star, shape2b=b2star,weight=wstar), b) Define conjugacy and explain why, or why not, the beta prior is conjugate with
from=0, to=1, col=1, add=T) respect to the negative binomial likelihood.
> #
> legend("topright", c("Prior density", "Posterior density"), ◮ Definition of conjugacy (Def. 6.5): Let L(θ) = f (x | θ) denote a likelihood
col=c(2,1), lty=1, bty="n") function based on the observation X = x. A class G of distributions is called
5
Prior density
conjugate with respect to L(θ) if the posterior distribution f (θ | x) is in G for all
Posterior density x whenever the prior distribution f (θ) is in G.
4
The beta prior is conjugate with respect to the negative binomial likelihood since
3 the resulting posterior distribution is also a beta distribution.
Density
9. Let X1:n denote a random sample from a uniform distribution on the interval [0, θ] a) Derive the posterior distribution of λ1 and λ2 . Plot these in R for comparison.
with unknown upper limit θ. Suppose we select a Pareto distribution Par(α, β) ◮ We first derive Jeffreys’ prior for λi , i = 1, 2:
with parameters α > 0 and β > 0 as prior distribution for θ, cf . Table A.2 in
f (di | λi Yi ) ∝ (λi Yi )di exp(−λi Yi ) ∝ (λi )di exp(−λi Yi ),
Section A.5.2.
l(λi ) ∝ di log(λi ) − λi ,
a) Show that T (X1:n ) = max{X1 , . . . , Xn } is sufficient for θ.
l′ (λi ) ∝ di /λi − 1,
◮ This was already shown in Exercise 6 of Chapter 2 (see the solution there).
b) Derive the posterior distribution of θ and identify the distribution type. I(λi ) = −l′′ (λi ) ∝ di λ−2
i ,
◮ The posterior distribution is also a Pareto distribution since for t = J(λi ) = E(Di )λ−2
i ∝ λi λ−2
i = λ−1
i .
max{x1 , . . . , xn }, we have p
Thus, f (λi ) ∝ J(λi ) ∝ λi , which corresponds to the improper G(1/2, 0)
−1/2
f (θ | x1:n ) ∝ f (x1:n | θ)f (θ) distribution (compare to Table 6.3 in the book). This implies
1 1 f (λi | di ) ∝ f (di | λi Yi )f (λi )
∝ n I[0,θ] (t) · α+1 I[β,∞) (θ)
θ θ
∝ (λi )di exp(−λi Yi )λi
−1/2
1
= (α+n)+1 I[max{β,t},∞) (θ),
θ = (λi )di +1/2−1 exp(−λi Yi ),
that is θ | x1:n ∼ Par(α + n, max{β, t}).
Thus, the Pareto distribution is conjugate with respect to the uniform likelihood which is the density of the G(di + 1/2, Yi ) distribution (compare to Table 6.2).
function. Consequently,
c) Determine posterior mode Mod(θ | x1:n ), posterior mean E(θ | x1:n ), and the gen-
λ1 | D1 = 17 ∼ G(17 + 1/2, Y1 ) = G(17.5, 2768.9),
eral form of the 95% HPD interval for θ.
◮ The formulas for the mode and mean of the Pareto distribution are listed in λ2 | D2 = 28 ∼ G(28 + 1/2, Y2 ) = G(28.5, 1857.5).
Table A.2 in the Appendix. Here, we have > # the data is:
> # number of person years in the non-exposed and the exposed group
Mod(θ | x1:n ) = max{β, t} = max{β, x1 , . . . , xn } > y <- c(2768.9, 1857.5)
> # number of cases in the non-exposed and the exposed group
and > d <- c(17, 28)
(α + n) max{β, t} > #
E(θ | x1:n ) = , > # plot the gamma densities for the two groups
α+n−1 > curve(dgamma(x, shape=d[1]+0.5, rate=y[1]),from=0,to=0.05,
where the condition α + n > 1 is satisfied for any n ≥ 1 as α > 0. ylim=c(0,300), ylab="Posterior density", xlab=expression(lambda))
> curve(dgamma(x, shape=d[2]+0.5, rate=y[2]), col=2, from=0,to=0.05, add=T)
Since the density f (θ | x1:n ) equals 0 for θ < max{β, t} and is strictly monotoni- > legend("topright", c("Non-exposed", "Exposed"), col=c(1,2), lty=1, bty="n")
cally decreasing for θ ≥ max{β, t}, the 95% HPD interval for θ has the form
300
Non−exposed
Exposed
[max{β, t}, q], 250
Posterior density
where q is the 95%-quantile of the Par(α + n, max{β, t}) distribution. 200
150
10.We continue Exercise 1 in Chapter 5, so we assume that the number of IHD cases
is Di | λi ∼ Po(λi Yi ), i = 1, 2, where λi > 0 is the group-specific incidence rate. We
ind
100
λ
136 6 Bayesian inference 137
b) Derive the posterior distribution of the relative risk θ = λ2 /λ1 as follows: which is a beta prime distribution with parameters α2 and α1 .
i. Derive the posterior distributions of τ1 = λ1 Y1 and τ2 = λ2 Y2 . ◮
◮ From Appendix A.5.2 we know that if X ∼ G(α, β), then c · X ∼ Z∞
G(α, β/c). Therefore, we have τ1 ∼ G(1/2 + d1 , 1) and τ2 ∼ G(1/2 + d2 , 1). f (η1 ) = f (η1 , η2 )dη2
ii. An appropriate multivariate transformation of τ = (τ1 , τ2 )⊤ to work with is 0
[1] 1.357199 4.537326 To obtain to expression for f (N | xn ) given in the exercise, we thus have to show
that −1
An equi-tailed credible interval for θ is thus [1.357, 4.537]. X∞
(N − n)! n − 1 xn (xn − n)!
In Exercise 1 in Chapter 5, we have obtained the confidence interval [0.296, 1.501] = n! = . (6.8)
N! xn n (n − 1)(xn − 1)!
N =xn
for log(θ). Transforming the limits of this interval with the exponential function
gives the confidence interval [1.344, 4.485] for θ, which is quite similar to the To this end, note that
credible interval obtained above. The credible interval is slightly wider and shifted ∞ k
X (N − n)! X (N − n)!
towards slightly larger values than the confidence interval. = lim and
N! k→∞ N!
N =xn N =xn
11.Consider Exercise 10 in Chapter 3. Our goal is now to perform Bayesian inference
k
with an improper discrete uniform prior for the unknown number N of beds:
X (N − n)! (xn − n)! (k − (n − 1))!
= − , (6.9)
N! (n − 1)(xn − 1)! k!(n − 1)
N =xn
f (N ) ∝ 1 for N = 2, 3, . . .
where (6.9) can be shown easily by induction on k ≥ xn (and can be deduced
a) Why is the posterior mode equal to the MLE?
by using the software Maxima, for example). Now, the second term in on the
◮ This is due to Result 6.1: The posterior mode Mod(N | xn ) maximizes the
right-hand side of (6.9) converges to 0 as k → ∞ since
posterior distribution, which is proportional to the likelihood function under a
uniform prior: (k − (n − 1))! 1 1
= · ,
f (N | xn ) ∝ f (xn | N )f (N ) ∝ f (xn | N ). k!(n − 1) k(k − 1) · · · (k − (n − 1) + 1) n − 1
Hence, the posterior mode must equal the value that maximizes the likelihood which impies (6.8) and completes the proof.
function, which is the MLE. In Exercise 10 in Chapter 3, we have obtained c) Show that the posterior expectation is
N̂ML = Xn . n−1
b) Show that for n > 1 the posterior probability mass function is E(N | xn ) = · (xn − 1) for n > 2.
n−2
−1
n − 1 xn N
f (N | xn ) = , for N ≥ xn . ◮ We have
xn n n
∞
X
◮ We have E(N | xn ) = N f (N | xn )
f (xn | N )f (N ) f (xn | N ) N =0
f (N | xn ) = ∝ . X
f (xn ) f (xn )
∞
n − 1 xn (N − n)!
= n! (6.10)
From Exercise 10 in Chapter 3, we know that xn n (N − 1)!
N =xn
−1
xn − 1 N and to determine the limit of the involved series, we can use (6.8) again:
f (xn | N ) = for N ≥ xn .
n−1 n
∞ ∞
X (N − n)! X (N − (n − 1))! (xn − n)!
Next, we derive the marginal likelihood f (xn ): = = .
(N − 1)! N! (n − 2)(xn − 2)!
N =xn N =xn −1
∞
X
f (xn ) = f (xn | N ) Plugging this result into expression (6.10) yields the claim.
N =1 d) Compare the frequentist estimates from Exercise 10 in Chapter 3 with the pos-
∞ −1
X xn − 1 N terior mode and mean for n = 48 and xn = 1812. Numerically compute the
=
n−1 n associated 95% HPD interval for N .
N =xn
X ∞ ◮ The unbiased estimator from Exercise 10 is
xn − 1 (N − n)!
= n! .
n−1 N! n+1
N =xn N̂ = xn − 1 = 1848.75,
n
140 6 Bayesian inference 141
which is considerably larger than the MLE and posterior mode xn = 1812. The a) Show that the marginal distribution of Xi is beta-binomial, see Appendix A.5.1
posterior mean for details. The first two moments of this distribution are
n−1
E(N | xn ) = · (xn − 1) = 1850.37 α
n−2 µ1 = E(Xi ) = m , (6.11)
α+β
is even larger than N̂ . We compute the 95% HPD interval for N in R: α{m(1 + α) + β}
µ2 = E(Xi2 ) = m . (6.12)
> # the data is (α + β)(1 + α + β)
> n <- 48 Pn
> x_n <- 1812 Solve for α and β using the sample moments µ b1 = n−1 i=1 xi , µ b2 =
P
n−1 i=1 x2i to obtain estimates of α and β.
> # n
> # compute the posterior distribution for a large enough interval of N values
> N <- seq(from = x_n, length = 2000) ◮ The marginal likelihood of Xi can be found by integrating πi out in the joint
> posterior <- exp(log(n - 1) - log(x_n) + lchoose(x_n, n) - lchoose(N, n)) distribution of Xi and πi :
> plot(N, posterior, type = "l", col=4,
ylab = "f(N | x_n)") Z1
m xi 1
> # we see that this interval is large enough f (xi ) = π (1 − πi )m−xi · π α−1 (1 − πi )β−1 dπi
> # xi i B(α, β) i
> # the posterior density is monotonically decreasing for values of N >= x_n 0
> # hence, the mode x_n is the lower limit of of the HPD interval m B(α + xi , β + m − xi )
> level <- 0.95 =
> hpdLower <- x_n xi B(α, β)
> # Z1
1
> # we next determine the upper limit
· π α+xi −1 (1 − πi )β+m−xi −1 dπi
> # since the posterior density is monotonically decreasing B(α + xi , β + m − xi ) i
> # the upper limit is the smallest value of N for 0
> # which the cumulative posterior distribution function is larger or equal to 0.95 m B(α + xi , β + m − xi )
> cumulatedPosterior <- cumsum(posterior) = ,
> hpdUpper <- min(N[cumulatedPosterior >= level]) xi B(α, β)
> #
> # add the HPD interval to the figure which is known as a beta-binomial distribution.
> abline(v = c(hpdLower, hpdUpper), col=2) To derive estimates for α and β, we first solve the two given equations for α and
β and then we replace µ1 and µ2 by the corresponding sample moments µ c1 and
0.025
c
µ2 .
0.020 First, we combine the two equations by replacing part of the expression on the
right-hand side of Equation (6.12) by µ1 , which gives
f(N | x_n)
0.015
m(1 + α) + β
µ2 = µ1 · .
0.010
1+α+β
0.005 We solve the above equation for β to obtain
0.000 (1 + α)(mµ1 − µ2 )
β= . (6.13)
2000 2500 3000 3500 µ2 − µ1
N Solving Equation (6.11) for α yields
Thus, the HPD interval is [1812, 1929]. α=β·
µ1
.
m − µ1
12.Assume that X1 , . . . , Xn are independent samples from the binomial models
Bin(m, πi ) and assume that πi ∼ Be(α, β). Compute empirical Bayes estimates
iid Next, we plug this expression for α into Equation (6.13) and solve the resulting
π̂i of πi as follows: equation
1 mµ21 − µ1 µ2
β= β + mµ1 − µ2
µ2 − µ1 m − µ1
142 6 Bayesian inference
for β to obtain
(mµ1 − µ2 )(m − µ1 )
7 Model selection
β=
m(µ2 − µ1 (mu1 + 1)) + µ21
and consequently
µ1 (mµ1 − µ2 )µ1
α=β· = .
m − µ1 m(µ2 − µ1 (mu1 + 1)) + µ21
(mµ µ2 ) µ
c1 − c c1 1. Derive Equation (7.18).
b=
α 2 and
m (c µ1 (µ
µ2 − c c1 + 1)) + µ
c1 ◮ Since the normal prior is conjugate to the normal likelihood with known variance
(mµ µ2 ) (m − µ
c1 − c c1 ) (compare to Table 7.2), we can avoid integration and use Equation (7.16) instead
βb = 2
.
m (c µ1 (µ
µ2 − c c1 + 1)) + µ
c1 to compute the marginal distribution:
b) Now derive the empirical Bayes estimates π̂i . Compare them with the corre- f (x | µ)f (µ)
f (x | M1 ) = .
sponding MLEs. f (µ | x)
b +xi , βb+m−xi ) distribution
◮ By Example 6.3, the posterior πi |xi has a Beta(α We know that
and the empirical Bayes estimate is thus the posterior mean
x | µ ∼ N(µ, κ−1 ), µ ∼ N(ν, δ −1 )
b + xi
α
πbi = E[πi |xi ] = .
b + βb + m
α and in Example 6.8, we have derived the posterior distribution of µ:
For comparison, the maximum likelihood estimate is π̂i ML = xi /m . Hence, the nκx̄ + δν
b = βb = 0, which corresponds µ|x ∼ N , (nκ + δ)−1 .
Bayes estimate is equal to the MLE if and only if α nκ + δ
to an improper prior distribution. In general, the Bayes estimate
Consequently, we obtain
b + xi
α b + βb
α b
α m xi Pn
= · + · (2πκ−1 )−n/2 exp −κ/2 i=1 (xi − µ)2 · (2πδ −1 )−1/2 exp −δ/2(µ − ν)2
b + βb + m
α b + βb + m α
α b + βb αb + βb + m m f (x | M1 ) =
(2π(nκ + δ) )
−1 −1/2 exp −(nκ + δ)/2(µ − nκx̄+δν
nκ+δ
)2
b/(α
is a weighted average of the prior mean α b + β)
b and the MLE xi /m. The κ n2 δ 12
weights are proportional to the prior sample size m0 = α
b + βb and the data =
2π nκ + δ
sample size m, respectively. " n
!
1 X
· exp − κ xi − 2nx̄µ + mµ + δ(µ2 − 2µν + ν 2 )
2 2
2
i=1
2 (nκx̄ + δν)2
− (nκ + δ)µ − 2µ(nκx̄ + δν) +
nκ + δ
" !#
κ n2 δ 12 κ X 2 δν 2
n
(nκx̄ + δν)2
= · exp − xi + −
2π nκ + δ 2 κ κ(nκ + δ)
i=1
1
" ( )#
κ n2 δ 2
κ X
n
nδ
2 2
= exp − (xi − x̄) + (x̄ − ν) .
2π nκ + δ 2 nκ + δ
i=1
144 7 Model selection 145
Thus, Next, we store the alcohol concentration data given in Table 1.3 and use the given
n
! priori parameters to compute the marginal likelihoods of the the two models M1
κ X
f (x | θ) = (2π/κ)−n/2 exp − (xi − µ)2 , and M2 :
2
i=1 > ## store the data
βα λκ > (alcoholdata <-
f (θ) = (2π(λκ)−1 )−1/2 κα−1 exp(−βκ) · exp − (µ − ν)2 , data.frame(gender = c("Female", "Male", "Total"),
Γ(α) 2 n = c(33, 152, 185),
∗α
∗ ∗ mean = c(2318.5, 2477.5, 2449.2),
β λ κ ∗ 2
f (θ | x) = (2π(λ κ) )
∗ −1 −1/2 α∗ −1
κ exp(−β ∗
κ) · exp − (µ − ν ) . sd = c(220.1, 232.5, 237.8))
Γ(α∗ ) 2 )
gender n mean sd
Note that it is important to include the normalizing constants of the above den- 1 Female 33 2318.5 220.1
sities in the following calculation to get 2 Male 152 2477.5 232.5
3 Total 185 2449.2 237.8
f (x | θ)f (θ)
f (x | M1 ) = > attach(alcoholdata)
f (θ | x) > ##
− 12 1/2
> ## priori parameter
(2π)− 2 Γ(α) (2π)
βα
n
λ > nu <- 2000
= (β ∗ )α
∗ 1 > lambda <- 5
(2π)− 2 (λ∗ )1/2
Γ(α∗ ) > alpha <- 1
n1/2 > beta <- 50000
1 λ 2
Γ(α + n/2)β α
= · > ##
2π λ+n Γ(α) > ## vector to store the marginal likelihood values
2
2)
−(α+ n > logMargLik <- numeric(2)
nσ̂ML + (λ + n)−1 nλ(ν − x̄)2
β+
> ## compute the marginal log-likelihood of model M_1
. (7.1)
2 > ## use the accumulated data for both genders
> logMargLik[1] <-
b) Next, calculate explicitly the posterior probabilities of the four (a priori equally marginalLikelihood(n = n[3], mean = mean[3], var = sd[3]^2,
nu, lambda, alpha, beta,
probable) models M1 to M3 using a NG(2000, 5, 1, 50 000) distribution as prior log = TRUE)
for κ and µ. > ##
◮ We work with model M1 and M2 only, as there is no model M3 specified in > ## compute the marginal log-likelihood of model M_2
> ## first compute the marginal log-likelihoods for the
Example 7.7. > ## two groups (female and male)
We first implement the marginal (log-)likelihood derived in part (a): > ## the marginal log-likelihood of model M_2 is the sum
> ## of these two marginal log-likelihoods
> ## marginal (log-)likelihood in the normal-normal-gamma model,
> logMargLikFem <-
> ## for n realisations with mean "mean" and MLE estimate "var"
marginalLikelihood(n = n[1], mean = mean[1], var = sd[1]^2,
> ## for the variance
nu, lambda, alpha, beta,
> marginalLikelihood <- function(n, mean, var, # data
log = TRUE)
nu, lambda, alpha, beta, # priori parameter
> logMargLikMale <-
log = FALSE # should log(f(x))
marginalLikelihood(n = n[2], mean = mean[2], var = sd[2]^2,
# or f(x) be returned?
nu, lambda, alpha, beta,
)
log = TRUE)
{
> logMargLik[2] <- logMargLikFem + logMargLikMale
betaStar <- beta + > logMargLik
(n * var + n * lambda * (nu - mean)^2 / (lambda + n)) / 2
[1] -1287.209 -1288.870
logRet <- - n/2 * log(2 * pi) + 1/2 * log(lambda / (lambda + n)) +
lgamma(alpha + n/2) - lgamma(alpha) + alpha * log(beta) -
(alpha + n/2) * log(betaStar)
if(log)
return(logRet)
else
return(exp(logRet))
}
148 7 Model selection 149
Hence, the marginal likelihood of model M1 is larger than the marginal likelihood [1] 0.952 0.048
of model M2 . [1] 0.892 0.108
[1] 0.864 0.136
For equally probable models M1 and M2 , we have [1] 0.936 0.064
> ## vary beta
f (x | Mi )
Pr(Mi | x) = P2 > for(betaNew in c(10000, 30000, 70000, 90000))
j=1 f (x | Mj )
print(modelcomp(nu, lambda, alpha, betaNew))
[1] 0.931 0.069
f (x | Mi )/c [1] 0.863 0.137
= P2 ,
j=1 f (x | Mj )/c
[1] 0.839 0.161
[1] 0.848 0.152
where the expansion with the constant c−1 ensures that applying the implemented The larger ν is chosen the smaller the posterior probability of the simpler model
exponential function to the marginal likelihood values in the range of −1290 does M1 becomes. In contrast, larger values of λ favour the simpler model M1 . The
not return the value 0. Here we use log(c) = min{log(f (x | M1 ), log(f (x | M2 ))}: choice of α and β only has a small influence on the posterior model probabilities.
> const <- min(logMargLik) 4. Let X1:n be a random sample from a normal distribution with expected value µ and
> posterioriProb <- exp(logMargLik - const)
> (posterioriProb <- posterioriProb / sum(posterioriProb)) known variance κ−1 , for which we want to compare two models. In the first model
[1] 0.8403929 0.1596071 (M1 ) the parameter µ is fixed to µ = µ0 . In the second model (M2 ) we suppose
Thus, given the data above, model M1 is more likely than model M2 , i. e. the that the parameter µ is unknown with prior distribution µ ∼ N(ν, δ −1 ), where ν
model using the same transformation factors for both genders is much more likely and δ are fixed.
than the model using the different transformation factors for women and men, a) Determine analytically the Bayes factor BF12 of model M1 compared to model
respectively. M2 .
c) Evaluate the behaviour of the posterior probabilities depending on varying pa- ◮ Since there is no unknown parameter in model M1 , the marginal likelihood
rameters of the prior normal-gamma distribution. of this model equals the usual likelihood:
◮ The R-code from part (b) to compute the posterior probabilities can be used !
κ n2 κX
n
2
to define a function modelcomp, which takes the four priori parameters as argu- f (x | M1 ) = exp − (xi − µ0 ) .
2π 2
ments and returns the posterior probabilities of model M1 and M2 (rounded to i=1
three decimals). We will vary one parameter at a time in the following: The marginal likelihood of model M2 is given in Equation (7.18) in the book.
> ## given parameters The Bayes factor BF12 is thus
> modelcomp(nu, lambda, alpha, beta)
[1] 0.84 0.16 f (x | M1 )
B12 =
> ## vary nu f (x | M2 )
> for(nuNew in c(1900, 1950, 2050, 2100)) 1 ( n n
!)
print(modelcomp(nuNew, lambda, alpha, beta)) nκ + δ 2 κ X
2
X
2 nδ
= exp − (xi − µ0 ) − (xi − x̄) − (x̄ − ν)2 .
[1] 0.987 0.013 δ 2 nκ + δ
[1] 0.95 0.05 i=1 i=1
[1] 0.614 0.386 (7.2)
[1] 0.351 0.649
> ## vary lambda b) As an example, calculate the Bayes factor for the centered alcohol concentration
> for(lambdaNew in c(1, 3, 7, 9)) data using µ0 = 0, ν = 0 and δ = 1/100.
print(modelcomp(nu, lambdaNew, alpha, beta))
[1] 0.166 0.834 ◮ For µ0 = ν = 0, the Bayes factor can be simplified to
[1] 0.519 0.481 1 ( −1 )
nκ + δ 2 nκx̄2 δ
B12 = exp − 1+
[1] 0.954 0.046
.
[1] 0.985 0.015 δ 2 nκ
> ## vary alpha
> for(alphaNew in c(0.2, 0.5, 2, 3))
We choose the parameter κ as the precision estimated from the data: κ = 1/σ̂ 2 .
print(modelcomp(nu, lambda, alphaNew, beta))
150 7 Model selection 151
> # define the Bayes factor as a function of the data and delta for arbitrary prior distribution on µ, where z = x/σ is standard normal under
> bayesFactor <- function(n, mean, var, delta) model M0 . The expression exp(−1/2z 2 ) is called the minimum Bayes factor
{
kappa <- 1 / var (Goodman, 1999).
logbayesFactor <- 1/2 * (log(n * kappa + delta) - log(delta)) - ◮ We denote the unknown prior distribution of µ by f (µ). Then, the Bayes
n * kappa * mean^2 / 2 *
(1 + delta / (n * kappa))^{-1} factor BF01 can be expressed as
exp(logbayesFactor)
f (x | M0 ) f (x | M0 )
} BF01 = = R . (7.3)
> # centered alcohol concentration data f (x | M1 ) f (x | µ)f (µ) dµ
> n <- 185
> mean <- 0 The model M0 has no free parameters, so its marginal likelihood is the usual
> sd <- 237.8
likelihood
1
> # compute the Bayes factor for the alcohol data
1
> bayesFactor(n, mean, sd^2, delta = 1/100) f (x | M0 ) = (2πσ 2 )− 2 exp − z 2 ,
[1] 1.15202 2
According to the Bayes factor, model M1 is more likely than model M2 , i. e. the for z = x/σ. To find a lower bound for BF01 , we have to maximise the integral
mean transformation factor does not differ from 0, as expected. in the denomiator in (7.3). Note that the density f (µ) averages over the values
c) Show that the Bayes factor tends to ∞ for δ → 0 irrespective of the data and of the likelihood function f (x | µ). Hence, it is intuitively clear that the integral
the sample size n. is maximized if we keep the density constant at its maximum value, which is
◮ The claim easily follows from Equation (7.2) since for δ → 0, the expression reached at the MLE µ̂ML = x. We thus obtain
in the exponential converges to
f (x | M1 ) ≤ f (x | µ̂ML )
n n
! ( 2 )
κ X 2
X
2 1 x − µ̂ML
− (xi − µ0 ) − (xi − x̄) 1
= (2πσ 2 )− 2 exp −
2 2 σ
i=1 i=1
1
p = (2πσ 2 )− 2 ,
and the factor (nκ + δ)/δ diverges to ∞.
For the alcohol concentration data, we can obtain a large Bayes factor by using
which implies
an extremely small δ: f (x | M0 ) 1
B01 = ≥ exp − z 2 .
> bayesFactor(n, mean, sd^2, f (x | M1 ) 2
delta = 10^{-30})
[1] 5.71971e+13 b) Calculate for selected values of z the two-sided P -value 2{1 − Φ(|z|)}, the mini-
mum Bayes factor and the corresponding posterior probability of M0 , assuming
This is an example of Lindley’s paradox.
equal prior probabilities Pr(M0 ) = Pr(M1 ) = 1/2. Compare the results.
(One can deduce in a similar way that for µ0 = ν = 0, the Bayes factor converges
◮
to 1 as δ → ∞, i. e. the two models become equally likely.)
> ## minimum Bayes factor:
5. In order to compare the models > mbf <- function(z)
exp(-1/2 * z^2)
M0 : X ∼ N(0, σ 2 ) > ## use these values for z:
> zgrid <- seq(0, 5, length = 101)
and M1 : X ∼ N(µ, σ 2 ) > ##
> ## compute the P-values, the values of the minimum Bayes factor and
with known σ 2 we calculate the Bayes factor BF01 . > ## the corresponding posterrior probability of M_0
> ## note that under equal proir probabilities for the models,
a) Show that > ## the posterior odds equals the Bayes factor
1 > pvalues <- 2 * (1 - pnorm(zgrid))
BF01 ≥ exp − z 2 ,
2 > mbfvalues <- mbf(zgrid)
> postprob.M_0 <- mbfvalues/(1 + mbfvalues)
> ##
152 7 Model selection 153
> ## plot the obtained values for some prior density f (θ) for θ.
> matplot(zgrid, cbind(pvalues, mbfvalues, postprob.M_0), type = "l", ◮ We have
xlab = expression(z), ylab = "values")
> legend("topright", legend = c("P-values from Wald test", "Minimum Bayes factor", "Posterior f (p | M0 )
col = 1:3, lty = 1:3, bty = "n") BF(p) =
> ## comparisons: f (p | M1 )
> all(pvalues <= mbfvalues) 1
= R1
[1] TRUE
0
B(θ, 1)−1 pθ−1 f (θ) dθ
> zgrid[pvalues == mbfvalues] 1 −1
[1] 0 Z
1.0
= θp θ−1
f (θ)dθ ,
P−values from Wald test
Minimum Bayes factor 0
Posterior prob. of M_0
0.8
since
Γ(θ) Γ(θ)
0.6 B(θ, 1) = = = 1/θ.
Γ(θ + 1) θ Γ(θ)
values
0.2 Z∞
Γ(θ + 1) = tθ exp(−t) dt
0.0
0
0 1 2 3 4 5 Z∞
z = −tθ exp(−t) | ∞
0 + θ tθ−1 exp(−t) dt
0
Thus, the P-values from the Wald test are smaller than or equal to the minimum
Z∞
Bayes factors for all considered values of z. Equality holds for z = 0 only and
=θ tθ−1 exp(−t) dt = θ Γ(θ).
for z > 3, the P-values and the minimum Bayes factors are very similar.
0
6. Consider the models
b) Show that the minimum Bayes factor mBF over all prior densities f (θ) has the
M0 : p ∼ U(0, 1) form (
−e p log p for p < e−1 ,
and M1 : p ∼ Be(θ, 1) mBF(p) =
1 otherwise,
where 0 < θ < 1. This scenario aims to reflect the distribution of a two-sided
where e = exp(1) is Euler’s number.
P -value p under the null hypothesis (M0 ) and some alternative hypothesis (M1 ),
◮ We have
where smaller P -values are more likely (Sellke et al., 2001). This is captured by
1 −1
the decreasing density of the Be(θ, 1) for 0 < θ < 1. Note that the data are now Z −1
represented by the P -value. mBF(p) = min θpθ−1 f (θ) dθ = max θpθ−1 ,
f density θ∈[0,1]
a) Show that the Bayes factor for M0 versus M1 is 0
1 −1 where the last equality is due to the fact that the above integral is maximum if
Z the density f (θ) is chosen as a point mass at the value of θ which maximises
BF(p) = θ−1
θp f (θ)dθ
θpθ−1 .
0
We now consider the function g(θ) = θpθ−1 to determine its maximum. For p <
1/e, the function g has a unique maximum in (0, 1). For p ≥ 1/e, the function
g is strictly monotoically increasing on [0,1] and thus attains its maximum at
θ = 1 (compare to the figure below).
154 7 Model selection 155
1.0 c) Compute and interpret the minimum Bayes factor for selected values of p
1.5
(e.g. p = 0.05, p = 0.01, p = 0.001).
0.8
◮
1.0 0.6
g(θ) > ## minimum Bayes factor:
g(θ)
> mbf <- function(p)
0.4 { if(p < 1/exp(1))
0.5 { - exp(1)*p*log(p) }
0.2 else
{ 1 }
p=0.1 p=0.5 }
0.0 0.0
> ## use these values for p:
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 > p <- c(0.05, 0.01, 0.001)
θ θ > ## compute the corresponding minimum Bayes factors
> minbf <- numeric(length =3)
We now derive the maxima described above analytically: > for (i in 1:3)
{minbf[i] <- mbf(p[i])
i. Case p < 1/e: }
We compute the maximum of the function h(θ) = log(g(θ)): > minbf
[1] 0.40716223 0.12518150 0.01877723
h(θ) = log(θ) + (θ − 1) log(p), > ratio <- p/minbf
> ratio
d 1
h(θ) = + log(p) and hence [1] 0.12280117 0.07988401 0.05325600
dθ θ > ## note that the minimum Bayes factors are considerably larger
d 1 > ## than the corresponding p-values
h(θ) = 0 ⇒ θ = − .
dθ log(p)
For p = 0.05, we obtain a minimum Bayes factor of approximately 0.4. This
d
It is easy to see that dθ h(θ) > 0 for θ < −(log p)−1 and dθd
h(θ) < 0 for θ > means that given the data p = 0.05, model M0 as at least 40% as likely as model
M1 . If the prior odds of M0 versus M1 is 1, then the posterior odds of M0
−1
−(log p) , so that h and hence also g are strictly monotonically increasing
for θ < −(log p)−1 and strictly monotonically decreasing for θ > −(log p)−1 . versus M1 is at least 0.4. Hence, the data p = 0.05 does not correspond to
Consequently, the maximum of g determined above is unique and we have strong evidence against model M0 . The other minimum Bayes factors have an
1 analoguous interpretation.
sup θpθ−1 = g −
θ∈[0,1] log p 7. Box (1980) suggested a method to investigate the compatibility of a prior with the
1 1 observed data. The approach is based on computation of a P -value obtained from
=− exp log p − −1 the prior predictive distribution f (x) and the actually observed datum xo . Small
log p log p
1 p-values indicate a prior-data conflict and can be used for prior criticism.
=− ,
log(p)e p Box’s p-value is defined as the probability of obtaining a result with prior predictive
ordinate f (X) equal to or lower than at the actual observation xo :
which implies mBF(p) = −e p log p.
ii. Case p ≥ 1/e: Pr{f (X) ≤ f (xo )},
We have
d
g(θ) = pθ−1 (1 + θ log p) ≥ 0 here X is distributed according to the prior predictive distribution f (x), so f (X)
dθ
is a random variable. Suppose both likelihood and prior are normal, i. e. X | µ ∼
for all θ ∈ [0, 1] since log p ≥ −1 in this case. Thus, g is monotonically
N(µ, σ 2 ) and µ ∼ N(ν, τ 2 ). Show that Box’s p-value is the upper tail probability of
increasing on [0, 1] and
a χ2 (1) distribution evaluated at
!−1
mBF(p) = sup θpθ−1 = (g(1))
−1
= 1. (xo − ν)2
.
θ∈[0,1] σ2 + τ 2
156 7 Model selection
◮ We have already derived the prior predictive density for a normal likelihood
8 Numerical methods for Bayesian
with known variance and a normal prior for the mean µ in Exercise 1. By setting
κ = 1/σ 2 , δ = 1/τ 2 and n = 1 in Equation (7.18), we obtain
inference
1 1 2
f (x) = exp − (x − ν) ,
2π(τ 2 + σ 2 ) 2(τ 2 + σ 2 )
X −ν (X − ν)2
√ ∼ N(0, 1) and ∼ χ2 (1) (7.4)
σ2 + τ 2 σ2 + τ 2
1. Let X ∼ Po(eλ) with known e, and assume the prior λ ∼ G(α, β).
(see Table A.2 for the latter fact).
a) Compute the posterior expectation of λ.
Thus, Box’s p-value is
◮ The posterior density is
Pr{f (X) ≤ f (xo )}
f (λ | x) ∝ f (x | λ)f (λ)
1 (X − ν)2 1 (xo − ν)2
= Pr exp − ≤ exp − ∝ λx exp(−eλ) · λα−1 exp(−βλ)
(2π(σ 2 + τ 2 ))1/2 2(σ 2 + τ 2 ) (2π(σ 2 + τ 2 ))1/2 2(σ 2 + τ 2 )
2 2
= λ(α+x)−1 exp −(β + e)λ ,
(X − ν) (xo − ν)
= Pr − 2 ≤− 2
σ + τ2 σ + τ2 which is the kernel of the G(α + x, β + e) distribution (compare to Table 6.2).
2
(X − ν) (xo − ν)2 The posterior expectation is thus
= Pr ≥ 2 .
σ +τ
2 2 σ +τ 2
α+x
E(λ | x) = .
2
Due to (7.4), the latter probability equals the upper tail probability of a χ (1) dis- β+e
tribution evaluated at (xo − ν)2 /(σ 2 + τ 2 ). b) Compute the Laplace approximation of this posterior expectation.
◮ We use approximation (8.6) with g(λ) = λ und n = 1. We have
which yields the following Laplace approximation of the posterior expectation: For a fixed ratio of observed value x and offset e, the Laplace approximation
r n thus improves for larger values of x and e. If we consider the ratio of the
α+x
Ê(λ | x) = exp log(λ̂∗ ) + (α + x − 1) log(λ̂∗ ) − (β + e)λ̂∗ Laplace approximation and the exact value
α+x−1
o Ê(λ | x)
− (α + x − 1) log(λ̂) + (β + e)λ̂ = exp (α + x − 0.5) log(α + x) − log(α + x − 1) − 1
E(λ | x)
r α+x−0.5 .
α+x n o 1
= exp log(λ̂∗ ) + (α + x − 1) log(λ̂∗ /λ̂) + (β + e)(λ̂ − λ̂∗ ) = 1+ exp(1),
α+x−1 α+x−1
r α+x
α+x α+x it is not hard to see that this ratio converges to 1 as x → ∞.
= exp log − (α + x − 1) log −1
α+x−1 e+β α+x−1 d) Now consider θ = log(λ). First derive the posterior density function using the
= exp (α + x + 0.5) log(α + x) − (α + x − 0.5) log(α + x − 1) change of variables formula (A.11). Second, compute the Laplace approxima-
tion of the posterior expectation of θ and compare again with the exact value
− log(β + e) − 1 .
which you have obtained by numerical integration using the R-function inte-
c) For α = 0.5 and β = 0, compare the Laplace approximation with the exact grate.
value, given the observations x = 11 and e = 3.04, or x = 110 and e = 30.4. ◮ The posterior density is
Also compute the relative error of the Laplace approximation. d
fθ (θ | x) = fλ g −1 (θ) | x · g −1 (θ)
◮ We first implement the Laplace approximation and the exact formula: dθ
> ## Laplace approximation of the posterior expectation (β + e)α+x
> ## for data x, offset e und priori parameters alpha, beta: = exp(θ)α+x−1 exp −(β + e) exp(θ) · exp(θ)
> laplaceApprox1 <- function(x, e, alpha, beta) Γ(α + x)
{ (β + e)α+x
= exp (α + x)θ − (β + e) exp(θ) ,
logRet <- (alpha + x + 0.5) * log(alpha + x) -
Γ(α + x)
(alpha + x - 0.5) * log(alpha + x - 1) -
log(beta + e) - 1 which does not correspond to any well-known distribution.
exp(logRet)
} > ## posterior density of theta = log(lambda):
> ## exact calculation of the posterior expectation > thetaDens <- function(theta, x, e, alpha, beta, log = FALSE)
> exact <- function(x, e, alpha, beta) {
(alpha + x) / (beta + e) logRet <- (alpha + x) * (theta + log(beta + e)) -
(beta + e) * exp(theta) - lgamma(alpha + x)
Using the values given above, we obtain if(log)
> (small <- c(exact = exact(11, 3.04, 0.5, 0), return(logRet)
approx = laplaceApprox1(11, 3.04, 0.5, 0))) else
exact approx return(exp(logRet))
3.782895 3.785504 }
> (large <- c(exact = exact(110, 30.4, 0.5, 0), > # check by simulation if the density is correct:
approx = laplaceApprox1(110, 30.4, 0.5, 0))) > x <- 110
> e <- 30.4
exact approx
> alpha <- 0.5
3.634868 3.634893
> beta <- 0
> ## relative errors:
> ## draw histogram of a sample from the distribution of log(theta)
> diff(small) / small["exact"]
> set.seed(59)
approx > thetaSamples <- log(rgamma(1e+5, alpha + x, beta + e))
0.0006897981 > histResult <- hist(thetaSamples, prob= TRUE, breaks = 50,
> diff(large) / large["exact"] xlab = expression(theta), main = "")
approx > ## plot the computed density
6.887162e-06 > thetaGrid <- seq(from = min(histResult$breaks),
to = max(histResult$breaks), length = 101)
> lines(thetaGrid, thetaDens(thetaGrid, x, e, alpha, beta))
> ## looks correct!
160 8 Numerical methods for Bayesian inference 161
4
and the second-order derivative is
d2 k(θ)
= (β + e) exp(θ),
3 dθ 2
Density
which yields the following curvature of k(θ) at its minimum:
2
d2 k(θ̂)
κ̂ = = α + x.
1 dθ 2
> (small <- c(exact = exact(11, 3.04, 0.5, 0), with g = id. The derivatives are
approx1 = laplaceApprox1(11, 3.04, 0.5, 0),
approx2 = laplaceApprox2(11, 3.04, 0.5, 0), dk(φ) 2x cos(φ) 2(n − x) sin(φ)
− = −
approx3 = numApprox(11, 3.04, 0.5, 0))) dφ sin(φ) cos(φ)
exact approx1 approx2 approx3
x 2(x − n sin2 (φ))
3.782895 3.785504 3.785087 3.782895
=2 − (n − x) tan(φ) = and
> (large <- c(exact = exact(110, 30.4, 0.5, 0), tan(φ) sin(φ) cos(φ)
dkg (φ)
approx = laplaceApprox1(110, 30.4, 0.5, 0),
dk(π) 2 cos(φ)
approx2 = laplaceApprox2(110, 30.4, 0.5, 0), − =− +
approx3 = numApprox(110, 30.4, 0.5, 0))) dφ dφ sin(φ)
exact approx approx2 approx3 x+1
3.634868 3.634893 3.634893 3.634868 =2 − (n − x) tan(φ)
> ## relative errors:
tan(φ)
> (small[2:4] - small["exact"]) / small["exact"] 2{x + 1 − (n + 1) sin2 (φ)}
= .
approx1 approx2 approx3 sin(φ) cos(φ)
6.897981e-04 5.794751e-04 -1.643516e-15
> (large[2:4] - large["exact"]) / large["exact"] The different expressions for the derivatives will be useful for calculating the roots
approx approx2 approx3
6.887162e-06 6.763625e-06 -4.105072e-14 and the second-order derivatives, respectively. From the last expressions, we easily
obtain the roots
Numerical integration using integrate thus gives even more accurate results r !
r
x x+1
φˆg = arcsin
than the two Laplace approximations in this setting.
φ̂ = arcsin and .
n n+1
2. In Example 8.3, derive the Laplace approximation (8.9) for the posterior expec-
tation of π using the variance stabilising transformation.
Exploiting the relation cos(arcsin(x)) = (1 − x2 )1/2 gives
◮ As mentioned in Example 8.3, the variance stabilising transformation is
√ ( r r !)
φ = h(π) = arcsin( π) and its inverse is h−1 (φ) = sin2 (φ). The relation x n−x
sin2 (φ) + cos2 (φ) = 1 will be used several times in the following. −k(φ̂) = 2 x log + (n − x) log and
n n
We first reparametrise the likelihood and the prior density: ( r ! r )
x+1 x+1 n−x
−kg (φ̂) = log + 2 x log + (n − x) log .
f (x | h−1 (φ)) =
n
sin(φ)2x (1 − sin2 (φ))n−x n+1 n+1 n+1
x
n By using for example
= sin(φ)2x cos(φ)2(n−x) d tan(φ) 1
x = ,
dφ cos2 (φ)
and applying the change-of-variables formula gives
we obtain the second-order derivatives
− 12
f (φ) = B(0.5, 0.5)−1 {sin2 (φ)(1 − sin2 (φ))} 2 sin(φ) cos(φ) d2 k(φ) x n−x
= 2 − and
= 2 B(0.5, 0.5)−1
, dφ2 sin2 (φ) cos2 (φ)
2
d kg (φ) x+1 n−x
= 2 − ,
i. e. the transformed density is constant. Thus, dφ2 sin2 (φ) cos2 (φ)
−k(φ) = log f x | h−1 (φ) + log {f (φ)} which yields the following curvatures of k(φ) and kg (φ) at their minima:
= 2x log {sin(φ)} + 2(n − x) log {cos(φ)} + const and
d2 k(φ̂)
−kg (φ) = log h−1 (φ) − k(φ) κ̂ = = 4n and
dφ2
= log sin2 (φ) + 2x log {sin(φ)} + 2(n − x) log {cos(φ)} + const. d2 k(φ̂)
κˆg = = 4(n + 1).
dφ2
166 8 Numerical methods for Bayesian inference 167
vals. Compare with the Monte Carlo estimates from 3a). > ## density of theta= pi_1/(1-pi_1)* (1-pi_2)/pi_2
◮ To find the density of the odds ratio > ## for pij ~ Be(a[j], b[j])
> thetaDens <- function(theta, a, b, log = FALSE)
π1 1 − π2 {
θ= · , logRet <- theta
1 − π1 π2 ## compute the value of the density function
we first derive the density of γ = π/(1 − π) for π ∼ Be(a, b): Similarly to
integrand <- function(gamma, theta)
{
Exercise 2 in Chapter 6, we apply the change-of-variables formula with trans- ## use built-in distributions if possible
formation g(π) = π/(1 − π), inverse function g −1 (γ) = γ/(1 + γ) and derivative logRet <-
lbeta(a[1],b[1]+2) + lbeta(b[2],a[2]+2) -
d −1 1 lbeta(a[1],b[1]) - lbeta(a[2],b[2]) +
g (γ) = dbeta(gamma/(gamma+1), a[1], b[1]+2, log=TRUE) +
dγ (1 + γ)2 dbeta(theta/(gamma+theta), b[2], a[2]+2, log=TRUE) -
log(abs(gamma))
to get exp(logRet)
a−1
b−1 }
1 γ γ 1
fγ (γ) = 1− for(i in seq_along(theta)){
B(a, b) γ+1 γ+1 (γ + 1)2 ## if the integration worked, save the result
a−1 b+1 intRes <- integrate(integrand, lower = 0, upper = 1,
1 γ 1 theta = theta[i],
= . (8.1)
B(a, b) γ + 1 γ+1 stop.on.error = FALSE, rel.tol = 1e-6,
subdivisions = 200)
Since π ∼ Be(a, b) implies 1 − π ∼ Be(b, a), (1 − π)/π also has a density of the if(intRes$message == "OK")
logRet[i] <- log(intRes$value)
form (8.1) with the roles of a and b interchanged. To obtain the density of θ, else
we use the following result: logRet[i] <- NA
}
If X and Y are two independent random variables with density fX and fY , ## return the vector of results
respectively, then the density of Z = X · Y is given by if(log)
return(logRet)
Z∞ z 1 else
fZ (z) = fX (x)fY dx. return(exp(logRet))
x |x| }
−∞ > ## test the function using the simulated data:
> histRes <- hist(theta, prob = TRUE, xlim=c(0,25), breaks=1000,
Setting γ1 = π1 /(1 − π1 ) and γ2∗ = (1 − π2 )/π2 , we get main = "", xlab = expression(theta))
> thetaGrid <- seq(from = 0, to = 25,
Z∞
1 length = 501)
fθ (θ) = fγ1 (γ)fγ2∗ (γ) dγ > lines(thetaGrid, thetaDens(thetaGrid, 0.5 + x, 0.5 + n - x))
|γ|
−∞
0.25
Z∞ a1 −1 b1 +1
1 γ 1
= 0.20
B(a1 , b1 )B(a2 , b2 ) γ+1 γ+1
−∞
b2 −1 a2 +1 0.15
Density
θ γ 1
· dγ.
γ+θ γ+θ |γ| 0.10
where 0.05
a2 = x2 + 0.5, b2 = n2 − x2 + 0.5. 0 5 10 15 20 25
θ
We use numerical integration to compute the above integral:
170 8 Numerical methods for Bayesian inference 171
The log odds ratio ψ is the difference of the two independent log odds φi , i = 1, 2: Now let φi = logit(πi ) for πi ∼ Be(ai , bi ). As φ1 and φ2 are independent, the
density of ψ = φ1 − φ2 can be calculated by applying the convolution theorem:
π1 /(1 − π1 )
ψ = log = logit(π1 ) − logit(π2 ) = φ1 − φ2 . Z
π2 /(1 − π2 ) fψ (ψ) = f1 (ψ + φ2 )f2 (φ2 ) dφ2
To compute the density of ψ, we therefore first compute the density of φ = Z∞
g(π) = logit(π) assuming π ∼ Be(a, b). Since 1 exp(ψ + φ2 )a1 exp(φ2 )a2
= a1 +b1 a2 +b2 dφ2 .
B(a1 , b1 )B(a2 , b2 ) 1 + exp(ψ + φ2 ) 1 + exp(φ2 )
exp(φ) d −1 −∞
g −1 (φ) = and g (φ) = g −1 (φ) 1 − g −1 (φ) ,
1 + exp(φ) dφ By substituting π = g −1
(φ2 ), the above integral can be expressed as
applying the change-of-variables formula gives Z1
a1 b1
1 exp(φ)a g −1 g(π) + ψ 1 − g −1 g(π) + ψ π a2 −1 (1 − π)b2 −1 dπ,
f (φ) = .
B(a, b) 1 + exp(φ) a+b 0
else
0.4 logRet[i] <- NA
}
0.2 ## return the vector of results
if(log)
return(logRet)
0.0
else
−5 −4 −3 −2 return(exp(logRet))
}
φ1
> ## test the function using the simulated data:
> histRes <- hist(psi, prob = TRUE, breaks = 50,
main = "", xlab = expression(psi))
> psiGrid <- seq(from = min(histRes$breaks), to = max(histRes$breaks),
length = 201)
> lines(psiGrid, psiDens(psiGrid, 0.5 + x, 0.5 + n - x))
172 8 Numerical methods for Bayesian inference 173
> ## the Monte Carlo estimate was: which corresponds to the normal model for a random sample. Using the last
> psiHpd equation in Example 6.8 on page 182 with xi replaced by ψi and σ 2 by τ 2 yields
lower upper
-0.5111745 2.7162842 the full conditional distribution
−1 −1 !
The Monte Carlo estimates of the HPD intervals are also close to the HPD n 1 nψ̄ 0 n 1
ν | ψ, τ 2 , D ∼ N + + , + .
intervals obtained by numerical methods. The above calculations illustrate that τ2 10 τ2 10 τ2 10
Monte Carlo estimation is considerably easier (e. g. we do not have to calculate
We further have
any densities!) than the corresponding numerical methods and does not require
any tuning of integration routines as the numerical methods do. n
Y
f (τ 2 | ψ, ν, D) ∝ f (ψi | ν, τ 2 ) · f (τ 2 )
4. In this exercise we will estimate a Bayesian hierarchical model with MCMC meth- i=1
ods. Consider Example 6.31, where we had the following model:
(ψi − ν)2
Yn
1
∝ (τ 2 )− 2 exp − τ 2 · (τ 2 )−(1+1) exp(−1/τ 2 )
2
ψ̂i | ψi ∼ N ψi , σi2 , i=1
Pn
ψi | ν, τ ∼ N ν, τ 2 , (ψi − ν)2 + 2
= (τ 2 )−( 2 +1) exp − i=1 τ2 ,
n+2
2
where we assume that the empirical log odds ratios ψ̂i and corresponding variances
that is Pn
σi2 := 1/ai + 1/bi + 1/ci + 1/di are known for all studies i = 1, . . . , n. Instead of − ν)2 + 2
n+2 i=1 (ψi
empirical Bayes estimation of the hyper-parameters ν and τ 2 , we here proceed in τ 2 | ψ, ν, D ∼ IG , .
2 2
a fully Bayesian way by assuming hyper-priors for them. We choose ν ∼ N(0, 10)
and τ 2 ∼ IG(1, 1). b) Implement a Gibbs sampler to simulate from the corresponding posterior dis-
a) Derive the full conditional distributions of the unknown parameters ψ1 , . . . , ψn , tributions.
ν and τ 2 . ◮ In the following R code, we iteratively simulate from the full conditional
◮ Let i ∈ {1, . . . , n = 9}. The conditional density of ψi given all other distributions of ν, τ 2 , ψ1 , . . . , ψn :
parameters and the data D (which can be reduced to the empirical log odds > ## the data is
ratios {ψ̂i } and the corresponding variances {σi2 }) is > preeclampsia <- read.table ("../Daten/preeclampsia.txt", header = TRUE)
> preeclampsia
f (ψi | {ψj }j6=i , ν, τ 2 , D) ∝ f (ψ, ν, τ 2 , D) Trial Diuretic Control Preeclampsia
1 Weseley 14 14 yes
∝ f (ψ̂i | ψi )f (ψi | ν, τ 2 ). 2 Weseley 117 122 no
3 Flowers 21 17 yes
4 Flowers 364 117 no
This corresponds to the normal model in Example 6.8: µ is replaced by ψi 5 Menzies 14 24 yes
here, σ 2 by σi2 and x by ψ̂i . We can thus use Equation (6.16) to obtain the full 6 Menzies 43 24 no
conditional distribution 7 Fallis 6 18 yes
8 Fallis 32 22 no
−1 −1 ! 9 Cuadros 12 35 yes
1 1 ψ̂i ν 1 1
ψi | {ψj }j6=i , ν, τ 2 , D ∼ N + + , + . 10 Cuadros 999 725 no
σi2 τ2 σi2 τ2 σi2 τ2 11 Landesman 138 175 yes
12 Landesman 1232 1161 no
For the population mean ν, we have 13 Krans 15 20 yes
14 Krans 491 504 no
n
Y 15 Tervila 6 2 yes
f (ν | ψ, τ 2 , D) ∝ f (ψi | ν, τ 2 ) · f (ν), 16 Tervila 102 101 no
17 Campbell 65 40 yes
i=1
18 Campbell 88 62 no
176 8 Numerical methods for Bayesian inference 177
ψ1
τ2
ν
6 0.0
−1
> ## Gibbs sampler for inference in the fully Bayesian model:
4
> niter <- 1e+5
−0.5
> s <- matrix(nrow = 2 + n, ncol = niter) −2
2
> rownames(s) <- c("nu", "tau2", paste("psi", 1:n, sep = ""))
> psiIndices <- 3:(n + 2)
0 −1.0
> ## set initial values (other values in the domains are also possible)
0e+00 4e+04 8e+04 0e+00 4e+04 8e+04 0e+00 4e+04 8e+04
> s[, 1] <- c(nu = mean(logOddsRatios),
Iteration Iteration Iteration
tau2 = var(logOddsRatios), logOddsRatios)
> set.seed(59)
> ## iteratively update the values The generated Markov chain seems to converge quickly so that a burn-in of
> for(j in 2:niter){ 1000 iterations seems sufficient.
## nu first c) For the data given in Table 1.1, compute 95% credible intervals for ψ1 , . . . , ψn
nuPrecision <- n / s["tau2",j-1] + 1 / 10
and ν. Produce a plot similar to Figure 6.15 and compare with the results
psiSum <- sum(s[psiIndices,j-1]) from the empirical Bayes estimation.
nuMean <- (psiSum / s["tau2",j-1]) / nuPrecision
◮ We use the function hpd written in Exercise 3a) to calculate Monte Carlo
s["nu",j] <- rnorm(1, mean = nuMean, sd = 1 / sqrt(nuPrecision)) estimates of 95% HPD intervals based on samples from the posterior distribu-
tions and produce a plot similar to Figure 6.15 as follows:
## then tau^2
sumPsiNuSquared <- sum( (s[psiIndices,j-1] - s["nu",j])^2 ) > ## remove burn-in
> s <- s[, -(1:1000)]
tau2a <- (n + 2) / 2 > ## estimate the 95 % HPD credible intervals and the posterior expectations
tau2b <- (sumPsiNuSquared + 2) / 2 > (mcmcHpds <- apply(s, 1, hpd))
nu tau2 psi1 psi2 psi3
s["tau2",j] <- 1 / rgamma(1, shape = tau2a, rate = tau2b) lower -1.1110064 0.1420143 -0.4336897 -1.8715310 -2.0651776
upper 0.1010381 1.5214695 0.5484652 -0.6248588 -0.2488816
## finally psi1, ..., psin psi4 psi5 psi6 psi7
for(i in 1:n){ lower -1.4753737 -0.9334114 -0.5345147 -1.7058202
psiiPrecision <- 1 / variances[i] + 1 / s["tau2",j] upper -0.2362761 0.3118408 -0.0660448 -0.2232895
psiiMean <- (logOddsRatios[i] / variances[i] + s["nu",j] / psi8 psi9
s["tau2",j])/psiiPrecision lower -0.9592916 -0.8061968
upper 1.4732138 0.6063260
s[psiIndices[i],j] <- rnorm(1, mean = psiiMean,
> (mcmcExpectations <- rowMeans(s))
sd = 1 / sqrt(psiiPrecision))
178 8 Numerical methods for Bayesian inference 179
lx <- as.numeric(lx[subscripts]) Compared to the results of the empirical Bayes analysis in Example 6.31, the
ux <- as.numeric(ux[subscripts])
credible intervals are wider in this fully Bayesian analysis. The point estimate
# normal dotplot for point estimates of the mean effect ν is E(ν | D) = −0.508 here, which is similar to the result
panel.dotplot(x, y, lty = 2, ...) ν̂ML = −0.52 obtained with empirical Bayes. However, the credible interval
# draw intervals
panel.arrows(lx, y, ux, y, for ν includes zero here, which is not the case for the empirical Bayes result.
length = 0.1, unit = "native", In addition, shrinkage of the Bayesian point estimates for the single studies
angle = 90, # deviation from line
code = 3, # left and right whisker
towards the mean effect ν is less pronounced here, see for example the Tervila
...) study.
# reference line
panel.abline (v = 0, lty = 2) 5. Let Xi , i = 1, . . . , n denote a random sample from a Po(λ) distribution with
} gamma prior λ ∼ G(α, β) for the mean λ.
> ## labels:
> studyNames <- c (names(groups), "mean effect size") a) Derive closed forms of E(λ | x1:n ) and Var(λ | x1:n ) by computing the posterior
> studyNames <- ordered (studyNames, levels = rev(studyNames)) distribution of λ | x1:n .
> # levels important for order!
> indices <- c(psiIndices, nu = 1) ◮ We have
> ## collect the data in a dataframe
> ciData <- data.frame (low = mcmcHpds["lower", indices], f (λ | x1:n ) = f (x1:n | λ)f (λ)
up = mcmcHpds["upper", indices], n
mid = mcmcExpectations[indices], Y
names = studyNames ∝ (λxi exp(−λ)) λα−1 exp(−βλ)
) i=1
Pn
> ciData[["signif"]] <- with (ciData,
=λ exp(−(β + n)λ),
α+ xi −1
i=1
up < 0 | low > 0)
> ciData[["color"]] <- with (ciData,
ifelse (signif, "black", "gray")) that is λ | x1:n ∼ G(α + nx̄, β + n). Consequently,
> randomEffectsCiPlot <- with (ciData,
dotplot (names ~ mid, α + nx̄
E(λ | x1:n ) = and
panel = panel.ci, β+n
lx = low, ux = up,
α + nx̄
pch = 19, col = color, Var(λ | x1:n ) = .
xlim = c (-1.5, 1.5), (β + n)2
xlab = "Log odds ratio",
180 8 Numerical methods for Bayesian inference 181
b) Approximate E(λ | x1:n ) and Var(λ | x1:n ) by exploiting the asymptotic normal- iii. and Monte Carlo integration.
ity of the posterior (cf . Section 6.6.2). .
◮ We use the following result from Section 6.6.2: ◮
i. Analogously to part (b), we have
λ | x1:n ∼ N λ̂n , I(λ̂n )−1 ,
a
θ | x1:n ∼ N θ̂n , I(θ̂n )−1 ,
a
where λ̂n denotes the MLE and I(λ̂n )−1 the inverse observed Fisher informa-
tion. We now determine these two quantities for the Poisson likelihood: The where θ̂n denotes the MLE and I(θ̂n )−1 the inverse observed Fisher infor-
log-likelihood is mation. We exploit the invariance of the MLE with respect to one-to-one
n
X
l(x1:n | λ) = xi log(λ) − nλ, transformations to obtain
i=1
Pn E(θ | x1:n ) ≈ θ̂n = log(λ̂n ) = log(x̄) = 2.2925.
which yields the MLE λ̂n = ( i=1 xi )/n = x̄. We further have
Pn To transform the observed Fisher information obtained in part (b), we apply
d2 l(x1:n | λ) i=1 xi nx̄
I(λ) = − = = 2 Result 2.1:
dλ2 λ2 λ
−2
and thus d exp(θ̂n ) 1
x̄2 x̄ Var(θ | x1:n ) ≈ I(θ̂n )−1 = I(λ̂n )−1 = = 0.0101.
I(λ̂n )−1 = = . dθ nx̄
nx̄ n
Consequently, ii. For the numerical integration, we work with the density of λ = exp(θ) in-
stead of θ to avoid numerical problems. We thus compute the posterior
E(λ | x1:n ) ≈ λ̂n = x̄ and expectation and variance of log(λ). We use the R function integrate to
x̄ compute the integrals numerically:
Var(λ | x1:n ) ≈ I(λ̂n )−1 = .
n > ## given data
> alpha <- beta <- 1 ## parameters of the prior distribution
c) Consider now the log mean θ = log(λ). Use the change of variables for- > n <- 10 ## number of observed values
mula (A.11) to compute the posterior density f (θ | x1:n ). > xbar <- 9.9 ## mean of observed values
> ##
◮ From part (a), we know that the posterior density of λ is > ## function for numerical computation of
> ## posterior expectation and variance of theta=log(lambda)
(β + n)α+nx̄ α+nx̄−1
f (λ | x1:n ) = λ exp(−(β + n)λ). > numInt <- function(alpha, beta, n, xbar)
Γ(α + nx̄) {
## posterior density of lambda
Applying the change of variables formula with transformation function g(y) = lambdaDens <- function(lambda, alpha, beta, n, xbar, log=FALSE)
log(y) gives
{
## parameters of the posterior gamma density
alphapost <- alpha + n*xbar
(β + n)α+nx̄
f (θ | x1:n ) = exp(θ)α+nx̄−1 exp(−(β + n) exp(θ)) exp(θ) betapost <- beta + n
Γ(α + nx̄) logRet <- dgamma(lambda, alphapost, betapost, log=TRUE)
(β + n)α+nx̄
if(log)
= exp(θ)α+nx̄ exp(−(β + n) exp(θ)). return(logRet)
Γ(α + nx̄) else
return(exp(logRet))
d) Let α = 1, β = 1 and assume that x̄ = 9.9 has been obtained for n = 10 }
observations from the model. Compute approximate values of E(θ | x1:n ) and # integrand for computation of posterior expectation
integrand.mean <- function(lambda)
Var(θ | x1:n ) via: {
i. the asymptotic normality of the posterior, log(lambda) *
lambdaDens(lambda, alpha, beta, n, xbar)
ii. numerical integration (cf . Appendix C.2.1), }
182 8 Numerical methods for Bayesian inference 183
# numerical integration to get posterior expectation as the sample size n = 10 is quite small so that an asymtotic approximation
res.mean <- integrate(integrand.mean, lower = 0, upper = Inf, may be inaccurate.
stop.on.error = FALSE,
rel.tol = sqrt(.Machine$double.eps)) 6. Consider the genetic linkage model from Exercise 5 in Chapter 2. Here we assume
if(res.mean$message == "OK")
mean <- res.mean$value a uniform prior on the proportion φ, i. e. φ ∼ U(0, 1). We would like to compute
else the posterior mean E(φ | x).
mean <- NA
a) Construct a rejection sampling algorithm to simulate from f (φ | x) using the
# numerical computation of variance prior density as the proposal density.
integrand.square <- function(lambda)
{ (log(lambda))^2 * ◮
lambdaDens(lambda, alpha, beta, n, xbar)
> ## define the log-likelihood function (up to multiplicative constants)
}
> ## Comment: on log-scale the numerical calculations are more robust
> loglik <- function(phi, x)
res.square <- integrate(integrand.square, lower = 0, upper = Inf,
{
stop.on.error = FALSE,
loglik <- x[1]*log(2+phi)+(x[2]+x[3])*log(1-phi)+x[4]*log(phi)
rel.tol = sqrt(.Machine$double.eps))
return(loglik)
if(res.square$message == "OK")
}
var <- res.square$value - mean^2
> ## rejection sampler (M: number of samples, x: data vector)
else
> rej <- function(M, x)
var <- NA
{
return(c(mean=mean,var=var))
post.mode <- optimize(loglik, x=x, lower = 0, upper = 1,
}
maximum = TRUE)
> # numerical approximation for posterior mean and variance of theta
> numInt(alpha, beta, n, xbar)
## determine constant a to be ordinate at the mode
mean var ## a represents the number of trials up to the first success
2.20226658 0.01005017 (a <- post.mode$objective)
iii. To obtain a random sample from the distribution of θ = log(λ), we first
## empty vector of length M
generate a random sample from the distribution of λ - which is a Gamma phi <- double(M)
distribution - and then transform this sample:
## counter to get M samples
> ## Monte-Carlo integration
N <- 1
> M <- 10000
while(N <=M)
> ## parameters of posterior distribution
{
> alphapost <- alpha + n*xbar
while (TRUE)
> betapost <- beta + n
{
> ## sample for lambda
## value from uniform distribution
> lambdaSam <- rgamma(M,alphapost,betapost)
u <- runif(1)
> ## sample for theta
## proposal for phi
> thetaSam <- log(lambdaSam)
z <- runif(1)
> # conditional expectation of theta
## check for acceptance
> (Etheta <- mean(thetaSam))
## exit the loop after acceptance
[1] 2.201973
if (u <= exp(loglik(phi=z, x=x)-a))
> ## Monte Carlo standard error break
> (se.Etheta <- sqrt(var(thetaSam)/M)) }
[1] 0.001001739 ## save the proposed value
> EthetaSq <- mean(thetaSam^2) phi[N] <- z
> # conditional variance of theta ## go for the next one
> (VarTheta <- EthetaSq - Etheta^2) N <- N+1
[1] 0.0100338 }
return(phi)
The estimates obtained by numerical integration and Monte Carlo integration }
are similar. The estimate of the posterior mean obtained from asymtotic nor-
mality is larger than the other two estimates. This difference is not surprising
184 8 Numerical methods for Bayesian inference 185
Density
4
phipost
b) Estimate the posterior mean of φ by Monte Carlo integration using M = 10 000 c) In 6b) we obtained samples of the posterior distribution assuming a uniform
samples from f (φ | x). Calculate also the Monte Carlo standard error. prior on φ. Suppose we now assume a Be(0.5, 0.5) prior instead of the previous
◮ U(0, 1) = Be(1, 1). Use the importance sampling weights to estimate the pos-
> ## load the data
terior mean and Monte Carlo standard error under the new prior based on the
> x <- c(125, 18, 20, 34) old samples from 6b).
> ## draw values from the posterior ◮
> set.seed(2012)
> M <- 10000 > ## Importance sampling -- try a new prior, the beta(0.5,0.5)
> phipost <- rej(M=M, x=x) > ## posterior density ratio = prior density ratio!
> ## posterior mean by Monte Carlo integration > weights <- dbeta(phipost, .5, .5)/dbeta(phipost, 1, 1)
> (Epost <- mean(phipost)) > (Epost2 <- sum(phipost*weights)/sum(weights))
[1] 0.6227486 [1] 0.6240971
> ## compute the Monte Carlo standard error > ## or simpler
> (se.Epost <- sqrt(var(phipost)/M)) > (Epost2 <- weighted.mean(phipost, w=weights))
[1] 0.0005057188 [1] 0.6240971
> ## check the posterior mean using numerical integration: > ##
> numerator <- integrate(function(phi) phi * exp(loglik(phi, x)), > (se.Epost2 <- 1/sum(weights)*sum((phipost - Epost2)^2*weights^2))
lower=0,upper=1)$val [1] 0.001721748
> denominator <- integrate(function(phi) exp(loglik(phi, x)), > hist(phipost, prob=TRUE, nclass=100, main=NULL)
lower=0,upper=1)$val > abline(v=Epost2, col="blue")
> numerator/denominator
[1] 0.6228061
> ## draw histogram of sampled values 8
> hist(phipost, prob=TRUE, nclass=100, main=NULL)
> ## compare with density
6
> phi.grid <- seq(0,1,length=1000)
Density
> dpost <- function(phi) exp(loglik(phi, x)) / denominator
> lines(phi.grid,dpost(phi.grid),col=2) 4
> abline(v=Epost, col="red")
2
phipost
186 8 Numerical methods for Bayesian inference 187
7. As in Exercise 6, we consider the genetic linkage model from Exercise 5 in Chap- # ml <- optim(0.5, log.lik, x=data, control=mycontrol, hessian=T,
ter 2. Now, we would like to sample from the posterior distribution of φ using # method="L-BFGS-B", lower=0+eps, upper=1-eps)
# mymean <- ml$par
MCMC. Using the Metropolis-Hastings algorithm, an arbitrary proposal distribu- # mystd <- sqrt(-1/ml$hessian)
tion can be used and the algorithm will always converge to the target distribution. ##################################################################
However, the time until convergence and the degree of dependence between the # count number of accepted and rejected values
samples depends on the chosen proposal distribution. yes <- 0
no <- 0
a) To sample from the posterior distribution, construct an MCMC sampler based
on the following normal independence proposal (cf . approximation 6.6.2 in # use as initial starting value mymean
xsamples[1] <- mymean
Section 6.6.2):
φ∗ ∼ N Mod(φ | x), F 2 · C −1 , # Metropolis-Hastings iteration
for(k in 2:M){
where Mod(φ | x) denotes the posterior mode, C the negative curvature of the # value of the past iteration
old <- xsamples[k-1]
log-posterior at the mode and F a factor to blow up the variance.
◮ # propose new value
# factor fac blows up standard deviation
> # define the log-likelihood function proposal <- rnorm(1, mean=mymean, sd=mystd*factor)
> log.lik <- function(phi, x)
{ # compute acceptance ratio
if((phi<1)&(phi>0)) { # under uniform proir: posterior ratio = likelihood ratio
loglik <- x[1]*log(2+phi)+(x[2]+x[3])*log(1-phi)+x[4]*log(phi) posterior.ratio <- exp(log.lik(proposal, data)-log.lik(old, data))
} if(is.na(posterior.ratio)){
else { # happens when the proposal is not between 0 and 1
# if phi is not in the defined range return NA # => acceptance probability will be 0
loglik <- NA posterior.ratio <- 0
} }
return(loglik) proposal.ratio <- exp(dnorm(old, mymean, mystd*factor, log=T) -
} dnorm(proposal, mymean, mystd*factor, log=T))
> # MCMC function with independence proposal
> # M: number of samples, x: data vector, factor: factor to blow up # get the acceptance probability
> # the variance alpha <- posterior.ratio*proposal.ratio
> mcmc_indep <- function(M, x, factor)
{ # accept-reject step
# store samples here if(runif(1) <= alpha){
xsamples <- rep(NA, M) # accept the proposed value
xsamples[k] <- proposal
# Idea: Normal Independence proposal with mean equal to the posterior # increase counter of accepted values
# mode and standard deviation equal to the standard error or to yes <- yes + 1
# a multiple of the standard error. }
mymean <- optimize(log.lik, x=x, lower = 0, upper = 1, else{
maximum = TRUE)$maximum # stay with the old value
xsamples[k] <- old
# negative curvature of the log-posterior at the mode no <- no + 1
a <- -1*(-x[1]/(2+mymean)^2 - (x[2]+x[3])/(1-mymean)^2 - x[4]/mymean^2) }
mystd <- sqrt(1/a) }
proposal.ratio <- 1
# accept-reject step
if(runif(1) <= alpha){
# accept the proposed value
xsamples[k] <- proposal
# increase counter of accepted values
yes <- yes + 1
}
else{
# stay with the old value
xsamples[k] <- old
no <- no + 1
b) Construct an MCMC sampler based on the following random walk proposal: }
}
φ∗ ∼ U(φ(m) − d, φ(m) + d),
# acceptance rate
cat("The acceptance rate is: ", round(yes/(yes+no)*100,2),
(m)
where φ denotes the current state of the Markov chain and d is a constant. "%\n", sep="")
◮ return(xsamples)
}
> # MCMC function with random walk proposal
> # M: number of samples, x: data vector, fac: factor to blow up
> # the variance
> mcmc_rw <- function(M, x, d){
# Metropolis-Hastings iteration
for(k in 2:M){
# value of the past iteration c) Generate M = 10 000 samples from algorithm 7a), setting F = 1 and F = 10,
and from algorithm 7b) with d = 0.1 and d = 0.2. To check the convergence of
old <- xsamples[k-1]
> # number of iterations All four Markov chains converge quickly after a few hundred iterations. The
> M <- 10000 independence proposal with the original variance (F=1) performs best: It pro-
> # get samples using Metropolis-Hastings with independent proposal
> indepF1 <- mcmc_indep(M=M, x=data, factor=1) duces uncorrelated samples and has a high acceptance rate. In contrast, the
The acceptance rate is: 96.42% independence proposal with blown-up variance (F = 10) performs worst. It has
> indepF10 <- mcmc_indep(M=M, x=data, factor=10) a low acceptance rate and thus the Markov chain often gets stuck in the same
The acceptance rate is: 12.22%
> # get samples using Metropolis-Hastings with random walk proposal value for several iterations, which leads to correlated samples. Regarding the
> RW0.1 <- mcmc_rw(M=M, x=data, d=0.1) random walk proposals, the one with the wider proposal distribution (d = 0.2)
The acceptance rate is: 63.27% performs better since it yields less correlated samples and has a preferrable ac-
> RW0.2 <- mcmc_rw(M=M, x=data, d=0.2)
ceptance rate. (For random walk proposals, acceptance rates between 30% and
The acceptance rate is: 40.33%
> ## some plots 50% are recommended.)
> # independence proposal with F=1
> par(mfrow=c(4,3))
8. Cole et al. (2012) describe a rejection sampling approach to sample from a poste-
> # traceplot rior distribution as a simple and efficient alternative to MCMC. They summarise
> plot(indepF1, type="l", xlim=c(2000,3000), xlab="Iteration") their approach as:
> # autocorrelation plot
> acf(indepF1) I. Define model with likelihood function L(θ; y) and prior f (θ).
> # histogram
II. Obtain the maximum likelihood estimate θ̂ML .
> hist(indepF1, nclass=100, xlim=c(0.4,0.8), prob=T, xlab=expression(phi),
ylab=expression(hat(f)*(phi ~"|"~ x)), main="") III.To obtain a sample from the posterior:
> # ylab=expression(phi^{(k)})
i. Draw θ ∗ from the prior distribution (note: this must cover the range of the
> # independence proposal with F=10
> plot(indepF10, type="l", xlim=c(2000,3000), xlab="Iteration") posterior).
> acf(indepF10) ii. Compute the ratio p = L(θ ∗ ; y)/L(θ̂ML ; y).
> hist(indepF10, nclass=100, xlim=c(0.4,0.8), prob=T, xlab=expression(phi),
ylab=expression(hat(f)*(phi ~"|"~ x)), main="") iii. Draw u from U(0, 1).
> # random walk proposal with d=0.1 iv. If u ≤ p, then accept θ ∗ . Otherwise reject θ ∗ and repeat.
> plot(RW0.1, type="l", xlim=c(2000,3000), xlab="Iteration")
> acf(RW0.1) a) Using Bayes’ rule, write out the posterior density of f (θ | y). In the notation of
> hist(RW0.1, nclass=100, xlim=c(0.4,0.8), prob=T, xlab=expression(phi), Section 8.3.3, what are the functions fX (θ), fZ (θ) and L(θ; x) in the Bayesian
ylab=expression(hat(f)*(phi ~"|"~ x)), main="")
> # random walk proposal with d=0.2 formulation?
> plot(RW0.2, type="l", xlim=c(2000,3000), xlab="Iteration") ◮ By Bayes’ rule, we have
> acf(RW0.2)
> hist(RW0.2, nclass=100, xlim=c(0.4,0.8), prob=T, xlab=expression(phi), f (y | θ)f (θ)
f (θ | y) = ,
ylab=expression(hat(f)*(phi ~"|"~ x)), main="") f (y)
R
^f (φ | x)
0.80 1.0
indepF1
0.75 8
where f (y) = f (y | θ)f (θ) dθ is the marginal likelihood. The posterior density
0.70 0.8
ACF
0.65 0.6 6
0.60 0.4 4
0.55 0.2 2
0.50 0.0
0.45 0
2000 2200 2400 2600 2800 3000 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8 is the target, so f (θ | y) = fZ (θ), and the prior density is the proposal, so that
f (θ) = fX (θ). As usual, the likelihood is f (y | θ) = L(θ; y). Thus, we can
Iteration Lag φ
indepF10
0.80
^f (φ | x)
0.75 1.0 10
0.70 0.8
ACF
0.65 0.6 8
0.60 0.4 6
0.55
0.50
0.45
0.2
0.0
4
2
0 rewrite the above equation as
2000 2200 2400 2600 2800 3000 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8
L(θ; y)
fX (θ) = fZ (θ)
Iteration Lag φ
(8.2)
^f (φ | x)
1.0
RW0.1
0.7 0.8 8
c′
ACF
0.6 0.6 6
0.5 0.4 4
0.2 2
0.4
with constant c′ = f (y).
0.0 0
2000 2200 2400 2600 2800 3000 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8
Iteration Lag φ
b) Show that the acceptance probability fX (θ ∗ )/{afZ (θ ∗ )} is equal to
^f (φ | x)
1.0
RW0.2
0.75 0.8 8
0.70
L(θ ∗ ; y)/L(θ̂ML ; y). What is a?
ACF
0.65 0.6 6
0.60 0.4 4
0.55 0.2 2
0.50
0.45 0.0 0
2000 2200 2400 2600 2800 3000 0 10 20 30 40 0.4 0.5 0.6 0.7 0.8
◮ Let U denote a random variable with U ∼ U(0, 1). Then, the acceptance
Iteration Lag φ
probability is
Pr(U ≤ p) = p = L(θ ∗ ; y)/L(θ̂ML ; y)
192 8 Numerical methods for Bayesian inference 193
of φ. We therefore know that the Be(0.5, 0.5) distribution, which has range 8
[0, 1], covers the range of the posterior distribution so that the condition in
Cole et al.’s rejection sampling algorithm is satisfied. 6
> # data
Density
> x <- c(125,18,20,34)
4
> n <- sum(x)
> ## define the log-likelihood function (up to multiplicative constants)
> loglik <- function(phi, x) 2
{
loglik <- x[1]*log(2+phi)+(x[2]+x[3])*log(1-phi)+x[4]*log(phi)
return(loglik) 0
}
0.45 0.50 0.55 0.60 0.65 0.70 0.75
> ## rejection sampler (M: number of samples, x: data vector);
> ## approach by Cole et al. phipost
> rejCole <- function(M, x)
{
# determine the MLE for phi
mle <- optimize(loglik, x=x, lower = 0, upper = 1,
maximum = TRUE)$maximum
∝ π 5ȳ (1 − π)5n̄−5ȳ ,
P5
where ȳ = 1/5 i=1 yi is the mean number of successful treatments per physi-
P5
cians and n̄ = 1/5 i=1 ni the mean number of patients treated per study.
b) Specify a conjugate prior distribution f (π) for π and choose appropriate values
for its parameters. Using these parameters derive the posterior distribution
f (π | n, y).
◮ It is easy to see that the beta distribution Be(α, β) with kernel
is conjugate with respect to the above likelihood (or see Example 6.7). We choose
the non-informative Jeffreys’ prior as prior for π, i. e. we choose α = β = 1/2
(see Table 6.3). This gives the following posterior distribution for π:
c) A sixth physician wants to participate in the study with n6 = 5 patients. De- [,1] [,2] [,3] [,4] [,5]
termine the posterior predictive distribution for y6 (the number of patients out [1,] 0.000000000 1.00000000 2.00000000 3.0000000 4.0000000
[2,] 0.001729392 0.01729392 0.08673568 0.2824352 0.6412176
of the five for which the medication will have a positive effect). [,6]
◮ The density of the posterior predictive distribution is [1,] 5
[2,] 1
Z1
Thus, the 2.5% quantile is 2 and the 97.5% quantile is 5 so that the 95% pre-
f (y6 | n6 , y, n) = f (y6 | π, n6 )f (π | y, n) dπ
diction interval is [2, 5]. Clearly, this interval does not contain exactly 95% of
0
the probability mass of the predictive distribution since the distribution of Y6 is
Z1
n6 y6 discrete. In fact, the predictive probability for Y6 to fall into [2, 5] is larger:
= π (1 − π)n6 −y6
y6
0 1 − Pr(Y6 ≤ 1) = 0.9827.
1
· π 5ȳ−1/2 (1 − π)5n̄−5ȳ−1/2 dπ
B(5ȳ + 1/2, 5n̄ − 5ȳ + 1/2) d) Calculate the likelihood prediction as well.
n6 ◮ The extended likelihood function is
= B(5ȳ + 1/2, 5n̄ − 5ȳ + 1/2)−1
y6
n6 5ȳ+y6
Z1 L(π, y6 ) = π (1 − π)5n̄+n6 −5ȳ−y6 .
y6
· π 5ȳ+y6 −1/2 (1 − π)5n̄+n6 −5ȳ−y6 −1/2 dπ (9.1)
0
If y6 had been observed, then the ML estimate of π would be
n6 B(5ȳ + y6 + 1/2, 5n̄ + n6 − 5ȳ − y6 + 1/2) 5ȳ + y6
= , π̂(y6 ) = ,
y6 B(5ȳ + 1/2, 5n̄ − 5ȳ + 1/2) 5n̄ + n6
where in (9.1), we have used that the integrand is the kernel of a Be(5ȳ + y6 + which yields the predictive likelihood
1/2, 5n̄ + n6 − 5ȳ − y6 + 1/2) density. The obtained density is the density of a
Lp (y6 ) = L(π̂(y6 ), y6 )
beta-binomial distribution (see Table A.1), more precisely
5ȳ+y6 5n̄+n6 −5ȳ−y6
n6 5ȳ + y6 5n̄ + n6 − 5ȳ − y6
y6 | n6 , y, n ∼ BeB(n6 , 5ȳ + 1/2, 5n̄ − 5ȳ + 1/2). = .
y6 5n̄ + n6 5n̄ + n6
Addition: Based on this posterior predictive distribution, we now compute a The likelihood prediction
point prediction and a prognostic interval for the given data:
Lp (y6 )
> ## given observations fp (y6 ) = Pn6
> n <- c(3, 2, 4, 4, 3) y=0 Lp (y)
> y <- c(2, 1, 4, 3, 3)
> ## parameters of the beta-binomial posterior predictive distribution can now be calculated numerically:
> ## (under Jeffreys' prior) > ## predictive likelihood
> alphaStar <- sum(y) + 0.5 > predLik <- function(yNew, nNew)
> betaStar <- sum(n - y) + 0.5 {
> nNew <- 5 ## number of patients treated by the additional physician sumY <- sum(y) + yNew
> ## point prediction: expectation of the post. pred. distr. sumN <- sum(n) + nNew
> (expectation <- nNew * alphaStar / (alphaStar + betaStar)) pi <- sumY / sumN
[1] 3.970588
> ## compute cumulative distribution function to get a prediction interval logRet <- lchoose(nNew, yNew) + sumY * log(pi)
> library(VGAM, warn.conflicts = FALSE) + (sumN - sumY) * log(1 - pi)
> rbind(0:5, return(exp(logRet))
pbetabinom.ab(0:5, size = nNew, alphaStar, betaStar)) }
> ## calculate values of the discrete likelihood prediction:
> predictiveProb <- predLik(0:5, 5)
> (predictiveProb <- predictiveProb / sum(predictiveProb))
198 9 Prediction 199
[1] 0.004754798 0.041534762 0.155883020 0.312701779 This yields the predictive likelihood
[5] 0.333881997 0.151243644
> ## distribution function Lp (y) = L(µ̂(y), σ̂ 2 (y), y),
> cumsum(predictiveProb)
[1] 0.004754798 0.046289560 0.202172580 0.514874359
which can only be normalised numerically for a given data set to obtain the
[5] 0.848756356 1.000000000 R
likelihood prediction f (y) = Lp (y)/ Lp (u) du.
The values of the discrete distribution function are similar to ones obtained from To determine the bootstrap predictive distribution, we need the distribution of
the Bayes prediction. The 95% prediction interval is also [2, 5] here. The point the ML estimators in (9.2). In Example 3.5 and Example 3.8, respectively, we
estimate from the likehood prediction turns out to be: have seen that
> sum((0:5) * predictiveProb) Pn 2
i=1 (Xi − X̄)
µ̂ML | µ, σ 2 ∼ N(µ, σ 2 /n) and | σ 2 ∼ χ2 (n − 1).
[1] 3.383152
σ 2
This estimate is close to 3.9706 from the Bayes prediction.
In addition, the two above random variables are independent. Since χ2 (d) =
2. Let X1:n be a random sample from a N(µ, σ 2 ) distribution from which a further
G(d/2, 1/2), we can deduce
observation Y = Xn+1 is to be predicted. Both the expectation µ and the variance
σ 2 are unknown. n
2 1X n−1 n
σ̂ML | σ2 = (Xi − X̄)2 | σ 2 ∼ G , 2
a) Start by determining the plug-in predictive distribution. n 2 2σ
i=1
◮ Note that in contrast to Example 9.2, the variance σ 2 is unknown here. By
Example 5.3, the ML estimates are by using the fact that the second parameter of the Gamma distribution is an
inverse scale parameter (see Appendix A.5.2).
n
2 1X The bootstrap predictive distribution of y given θ = (µ, σ 2 )T has density
µ̂ML = x̄ and σ̂ML = (xi − x̄)2 . (9.2)
n
i=1 Z∞ Z∞
2
The plug-in predictive distribution is thus g(y; θ) = f (y | µ̂ML , σ̂ML )f (µ̂ML | µ, σ 2 )f (σ̂ML
2
| µ, σ 2 ) dµ̂ML dσ̂ML
2
! 0 −∞
n
1X Z∞ Z∞
Y ∼ N x̄, (xi − x̄)2 . 2
n = f (y | µ̂ML , σ̂ML )f (µ̂ML | µ, σ 2 ) dµ̂ML f (σ̂ML
2
| µ, σ 2 ) dσ̂ML
2
. (9.3)
i=1
0 −∞
b) Calculate the likelihood and the bootstrap predictive distributions.
◮ The extended likelihood is The inner integral in (9.3) corresponds to the marginal likelihood in the normal-
normal model. From (7.18) we thus obtain
L(µ, σ 2 , y) = f (y | µ, σ 2 ) · L(µ, σ 2 )
( n
!) Z∞
1 2 2 −n 1 X
2 2
f (y | µ̂ML , σ̂ML )f (µ̂ML | µ, σ 2 ) dµ̂ML
∝ σ exp − 2 (y − µ) · (σ )
−1 2 exp − 2 (xi − µ)
2σ 2σ
i=1 −∞
( n
!) 21
1 X 1/σ 2 1 1/σ 2
= (σ 2 )− 2 exp − 2 (y − µ)2 + n(x̄ − µ)2 + (xi − x̄)2
n+1 1
2 −2 2
= (2πσ̂ML ) exp − 2 (y − µ) .
2σ 1/σ̂ML + 1/σ 2
2 2σ̂ML 1/σ̂ML
2 + 1/σ 2
i=1
(9.4)
2 − n+1 1 2 2
∝ (σ ) 2 exp − 2 (y − µ) + n(x̄ − µ)
2σ
Analytical computation of the outer integral in (9.3) is however difficult. To
and the ML estimates of the parameters based on the extended data set are calculate g(y; θ), we can use Monte Carlo integration instead: We draw a large
2 (i) 2
! number of random numbers (σ̂ML ) from the G((n−1)/2, n/(2σ̂ML )) distribution,
nx̄ + y 1 X n
plug them into (9.4) and compute the mean for the desired values of y. Of course,
µ̂(y) = and σ̂ 2 (y) = (xi − µ̂(y))2 + (y − µ̂(y))2 .
n+1 n+1 this only works for a given data set x1 , . . . , xn .
i=1
200 9 Prediction 201
c) Derive the Bayesian predictive distribution under the assumption of the reference 3. Derive Equation (9.11).
prior f (µ, σ 2 ) ∝ σ −2 . ◮ We proceed analoguously as in Example 9.7. By Example 6.8, the posterior
◮ As in Example 6.24, it is convenient to work with the precision κ = (σ 2 )−1 distribution of µ is
and the corresponding reference prior f (µ, κ) ∝ κ−1 , which formally corresponds σ2
µ | x1:n ∼ N µ̄, ,
to the normal-gamma distribution NG(0, 0, −1/2, 0). By (6.26), the posterior n + δσ 2
Let the observations be assigned to J groups and denoted by yji , i = 1, . . . , nj , is proper for a binary observation Y .
j = 1, . . . , J with group means ◮ Let B(π0 ) denote the true distribution of the observation Yo and f the probability
nj mass of the predictive distribution Y ∼ B(π) as introduced in Definition 9.9. The
1 X
ȳj = yji expected score under the true distribution is then
nj
i=1
E[S(f (y), Yo )] = − E[f (Yo )] = −f (0) · (1 − π0 ) − f (1) · π0
PJ
and respresentative prediction probabilities πj . In total, there are N = j=1 nj = −(1 − π)(1 − π0 ) − π · π0
observed values with overall mean (or relative frequency) ȳ.
= (1 − 2π0 )π + π0 − 1.
We now calculate the right-hand side of Murphy’s decomposition (9.16) and use
(9.6): As a function of π, the expected score is thus a line with slope 1 − 2π0 . If this slope
is positive or negative, respectively, then the score is minimised by π = 0 or π = 1,
ȳ(1 − ȳ) + SC − MR
respectively (compare to the proof of Result 9.2 for the absolute score). Hence, the
J nj J J
1 XX 1 X 1 X score is in general not minimised by π = π0 , i. e. this scoring rule is not proper.
= (yji − ȳ)2 + nj (ȳj − πj )2 − nj (ȳj − ȳ)2
N
j=1 i=1
N
j=1
N
j=1
6. For a normally distributed prediction show that it is possible to write the CRPS as
nj
! in (9.17) using the formula for the expectation of the folded normal distribution in
1 X X
J
2 2 2
= (yji − ȳ) + nj (ȳj − πj ) − nj (ȳj − ȳ) . Appendix A.5.2.
N
j=1 i=1 ◮ The predictive distribution here is the normal distribution N(µ, σ 2 ). Let Y1
und Y2 be independent random variables with N(µ, σ 2 ) distribution. From this, we
The aim is to obtain the mean Brier score
deduce
Y1 − yo ∼ N(µ − yo , σ 2 ) and Y1 − Y2 ∼ N(0, 2σ 2 ),
J nj
1 XX
BS = (yji − πj )2 ,
N
j=1 i=1 where for the latter result, we have used Var(Y1 + Y2 ) = Var(Y1 ) + Var(Y2 ) due to
independence (see Appendix A.3.5). This implies (see Appendix A.5.2)
which we can isolate from the first term above since
nj
X nj
X nj
X |Y1 − yo | ∼ FN(µ − yo , σ 2 ) and |Y1 − Y2 | ∼ FN(0, 2σ 2 ).
(yji − ȳ)2 = (yji − πj )2 + 2(πj − ȳ) (yji − πj ) + nj (πj − ȳ)2 .
i=1 i=1 i=1 The CRPS is therefore
Consequently, 1
CRP S(f (y), yo ) = E{|Y1 − yo |} − E{|Y1 − Y2 |}
2
ȳ(1 − ȳ) + SC − MR
µ − y n µ − y o 1n √ o
= 2σϕ + (µ − yo ) 2Φ −1 − 2 2σϕ(0) + 0
o o
J σ σ 2
1 X y − µ h n y − µ o i √2σ
=BS + 2(πj − ȳ)nj (ȳj − πj ) + nj (πj − ȳ)2 + nj (ȳj − πj )2 − nj (ȳj − ȳ)2
= 2σϕ + (µ − yo ) 2 1 − Φ −1 − √
o o
N
j=1 σ σ 2π
µ − y σ
= 2σϕ(ỹo ) + σ {1 − 2Φ(ỹo )} − √
o
and by expanding the quadratic terms on the right-hand side of the equation, we see
σ π
that they all cancel. This completes the proof of Murphy’s decomposition.
1
5. Investigate if the scoring rule = σ ỹo {2Φ(ỹo ) − 1} + 2ϕ(ỹo ) − √ .
π
S(f (y), yo ) = −f (yo )
Bibliography
Bartlett M. S. (1937) Properties of sufficiency and statistical tests. Proceedings of the Royal
Society of London. Series A, Mathematical and Physical Sciences, 160(901):268–282.
Box G. E. P. (1980) Sampling and Bayes’ inference in scientific modelling and robustness (with
discussion). Journal of the Royal Statistical Society, Series A, 143:383–430.
Cole S. R., Chu H., Greenland S., Hamra G. and Richardson D. B. (2012) Bayesian posterior
distributions without Markov chains. American Journal of Epidemiology, 175(5):368–375.
Dempster A. P., Laird N. M. and Rubin D. B. (1977) Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodolog-
ical), 39(1):1–38.
Goodman S. N. (1999) Towards evidence-based medical statistics. 2.: The Bayes factor. Annals
of Internal Medicine, 130:1005–1013.
Rao C. R. (1973) Linear Statistical Inference and Its Applications. Wiley series in probability
and mathematical statistics. John Wiley & Sons, New York.
Sellke T., Bayarri M. J. and Berger J. O. (2001) Calibration of p values for testing precise null
hypotheses. The American Statistician, 55:62–71.
Index
A N
arithmetic mean 14 normal distribution
folded 203
normal-gamma distribution 200
B normal-normal model 199
beta distribution 195 numerical integration 163, 168, 171, 172
beta-binomial distribution 196
binomial distribution 195
bootstrap predictive distribution 199 P
burn-in 177 P-value 152
Pareto distribution 134
point prediction 196
C power model 74
case-control study prediction interval 198
matched 95 predictive likelihood 199
change-of-variables formula 170 prior
convolution theorem 171 -data conflict 155
criticism 155
E prior distribution
Emax model 111 non-informative 195
examples prior predictive distribution see marginal
analysis of survival times 80 likelihood
blood alcohol concentration 94, 145 profile likelihood confidence interval 61
capture-recapture method 7 prognostic interval 196
prevention of preeclampsia 64, 174
exponential model 75 R
regression model
F logistic 98
Fisher information 57 normal 144
risk
log relative 59
H relative 59
HPD interval 172
S
I score equations 56
inverse gamma distribution 175 significance test 71
J T
Jeffreys’ Priori 195 Taylor approximation 73
L
likelihood function
extended 197
likelihood]extended 198
Lindley’s paradox 150
M
Mallow’s Cp statistic 144
marginal likelihood 199
minimum Bayes factor 151
Monte Carlo estimate 172, 199