Chapter 2 B
Chapter 2 B
This chapter shows how to make inferences for the mean and variance of a normal
population using a conjugate prior distribution. First we need the multi-parameter version
of Bayes Theorem.
Suppose that now the probability (density) function we used to describe the data depends
on many parameters, that is, f (x|θ) where θ = (θ1 , θ2 , . . . , θp )T . After observing the
data, the likelihood function for θ is f (x|θ). Prior beliefs about θ are represented through
a probability (density) function π(θ). Therefore, using Bayes Theorem, the posterior
probability (density) function for θ is
π(θ) f (x|θ)
π(θ|x) =
f (x)
where
R
Θ π(θ) f (x|θ) dθ
if θ is continuous,
f (x) =
P
Θ π(θ) f (x|θ) if θ is discrete.
31
32 CHAPTER 2. INFERENCE FOR A NORMAL POPULATION
Example 2.1
If X has a generalised ta (b, c) distribution (see page 101) then show that Y = (X −
√
b)/ c ∼ ta ≡ ta (0, 1).
Recall the general result: if X is a random variable with probability density function fX (x)
and g is a bijective (1–1) function then the random variable Y = g(X) has probability
density function
d −1
fY (y ) = fX g −1 (y )
g (y ) . (2.1)
dy
Solution
√ √
Here we take Y = g(X) = (X − b)/ c from which we obtain X = g −1 (Y ) = b + c Y .
Therefore using (2.1) we have
d −1
fY (y ) = fX g −1 (y )
g (y )
dy
√ √
= fY b + c y × c
− a+1
Γ a+1
2 y2 2 √
=√ 1 + × c, y ∈ R
acπ Γ 2a
a
− a+1
Γ a+1
2 y2 2
=√ 1 + , y ∈ R.
aπ Γ 2a
a
√
This is the ta density and so Y = (X − b)/ c ∼ ta .
Comment
Values for the density function fY (y ) and the distribution function FY (y ) can be obtained
by using the R functions dgt and pgt in the package nclbayes.
It is clear that ta (0, 1) ≡ ta by examining their densities. Therefore, it makes sense
to think of the ta distribution as the standard ta –distribution and make all calculations
for the generalised ta (b, c) distribution from this standard distribution. The relationship
between this standard and generalised version of the t-distribution is directly analogous
to that between the standard normal N(0, 1) distribution and its more general version:
the N(b, c) distribution. In both cases the relationship is one of location and scale:
Y −b
Y ∼ N(b, c) =⇒ √ ∼ N(0, 1)
c
Y −b
Y ∼ ta (b, c) =⇒ √ ∼ ta .
c
2.2. PRIOR TO POSTERIOR ANALYSIS 33
Suppose we have a random sample from a normal distribution in which both the mean µ
and the precision τ are unknown, that is, Xi |µ, τ ∼ N(µ, 1/τ ), i = 1, 2, . . . , n (indepen-
dent). We shall adopt a (joint) prior distribution for µ and τ for which
1
µ|τ ∼ N b, and τ ∼ Ga(g, h)
cτ
for known values b, c, g and h. This distribution has density function
Solution
Notice that this posterior density is of the same form as the prior density (2.2). Therefore,
we can conclude that the posterior distribution is
µ
x ∼ NGa(B, C, G, H).
τ
Suppose (µ, τ )T ∼ NGa(b, c, g, h). From the definition of the NGa distribution we know
√
that τ ∼ Ga(g, h). This also means that σ = 1/ τ ∼ Inv-Chi(g,h); see page 101.
The (marginal) density for µ is, for µ ∈ R
Z ∞
π(µ) = π(µ, τ ) dτ
Z0 ∞ n τ
g− 21 2
o
∝ τ exp − c(µ − b) + 2h dτ.
0 2
Now, as the integral of a gamma density over its entire range is one, we have
Z ∞ a a−1 −bθ Z ∞
b θ e Γ(a)
dθ = 1 =⇒ θa−1 e −bθ dθ = a .
0 Γ(a) 0 b
Therefore, for µ ∈ R
Z ∞ n τ o
g+ 21 −1 2
π(µ) ∝ τ exp − c(µ − b) + 2h dτ
0 2
1
Γ g+2
∝ 1
[{c(µ − b)2 + 2h}/2}]g+ 2
−g−1/2
c(µ − b)2
−g−1/2
∝h 1+
2h
− 2g+1
c(µ − b)2
2
∝ 1+ .
2h
2.2. PRIOR TO POSTERIOR ANALYSIS 35
Comparing this density with that of the generalised t–distribution (on page 101) gives
h
µ ∼ t2g b, . (2.4)
gc
h
• µ ∼ t2g b, gc
• τ ∼ Ga(g, h)
√
Also σ = 1/ τ ∼ Inv-Chi(g, h).
µ
The posterior x ∼ NGa(B, C, G, H) has marginal distributions
τ
H
• µ|x ∼ t2G B, GC
• τ |x ∼ Ga(G, H)
The relationships between the prior and posterior variance of µ and mean and variance
of τ and of σ are rather more complex.
Example 2.2
Recall Example 1.4 on the earth’s density. Previously we assumed that the measurements
followed a N(µ, 0.22 ) distribution, that is, the standard deviation of the measurements
was known to be 0.2 g/cm3 . Now we consider the case where this standard deviation is
unknown and determine posterior distributions using the theory in section 2.2.
Before we can proceed, we must specify the parameters in the NGa(b, c, g, h) prior distri-
bution for (µ, τ ). In the previous analysis, we assumed that the population measurement
36 CHAPTER 2. INFERENCE FOR A NORMAL POPULATION
precision was τ = 1/0.22 = 25 and assumed a N(5.41, 0.42 ) prior distribution for the
population mean, that is, µ|τ = 25 ∼ N(5.41, 0.42 ).
Choice of b and c: the conditional prior distribution for µ is µ|τ ∼ N{b, 1/(cτ )} and so
matching the prior distributions for µ (when τ = 25) gives b = 5.41 and c = 0.25.
Choice of g and h: the marginal prior distribution for τ is τ ∼ Ga(g, h). Previously, we
assumed τ = 25 (with V ar (τ ) = 0) and so take this value as the prior mean: E(τ ) = 25.
Suppose we also decide that V ar (τ ) = 250. These two requirements give g = 2.5 and
h = 0.1. Therefore, we will assume the prior distribution
µ
∼ NGa(5.41, 0.25, 2.5, 0.1).
τ
T
We have seen that if (µ, τ ) ∼ NGa(b, c, g, h) then the marginal distribution of µ is
µ ∼ t2g b, h/(gc) . Therefore, with this choice of prior distribution, the marginal prior
distribution for µ is
µ ∼ t5 (5.41, 0.16).
Figure 2.1 shows the close match between the new (marginal) prior distribution for µ and
that used previously.
0.8
density
0.4
0.0
Figure 2.1: Marginal prior density for µ: new version (solid) and previous version (dashed)
Determine the posterior distribution for (µ, τ )T . Also determine the marginal prior dis-
tribution for τ and for σ, and the marginal posterior distribution for each of µ, τ and σ.
Solution
We can combine the information in the NGa(5.41, 0.25, 2.5, 0.1) prior distribution
for (µ, τ )T with that in the data (n = 23, x̄ = 5.4848, s = 0.1882) using the results in
2.2. PRIOR TO POSTERIOR ANALYSIS 37
Plots of the (marginal) prior and posterior distributions of µ, τ and σ are given in Fig-
ure 2.2. Note that the (marginal) prior and posterior distributions for σ can be determined
from that of τ . We can also examine the joint prior and posterior distributions for (µ, τ )T
via the contour plots of their densities to see if there is any change in the dependence
structure; see Figure 2.3. This figure is produced by using the R command NGacontour
in the nclbayes package as follows:
mu=seq(4.5,6.5,len=1000)
tau=seq(0,71,len=1000)
NGacontour(mu,tau,b,c,g,h,lty=3)
NGacontour(mu,tau,B,C,G,H,add=TRUE)
in which the variables b,c,g,h,B,C,G,H have already been set to their prior/posterior
values. A careful look at the values of the contour levels plotted shows that the highest
38 CHAPTER 2. INFERENCE FOR A NORMAL POPULATION
0 2 4 6 8
density
15
density
density
10
0.03
5
0.00
0
0 10 20 30 40 50 60 70 0.0 0.1 0.2 0.3 0.4 0.5
τ σ
Figure 2.2: Prior (dashed) and posterior (solid) densities for µ, τ and σ
contour level plotted for the prior density is 0.024 and the lowest level for the posterior
density is 0.05. From this we can conclude that the posterior distribution is far more
concentrated than the prior distribution. Also the contours for the posterior distribution
are much more elliptical than those for the prior distribution. This indicates a change
in the dependence structure. However, the main changes shown by the figure are in the
mean and variability of µ and τ .
Wikipedia tells us that the actual mean density of the earth is 5.515 g/cm3 . We can
determine the (posterior) probability that the mean density is within 0.1 of this value as
follows. We already know that µ|x ∼ t28 (5.484, 0.001561) and so we can calculate
using pgt(5.615,28,5.484,0.001561)-pgt(5.415,28,5.484,0.001561).
Without the data, the only basis for determining the earth’s density is via the prior
distribution. Here the prior distribution is µ ∼ t5 (5.41, 0.16) and so the (prior) probability
that the mean density is within 0.1 of the (now known) true value is
70
0.002
60
0.004
50
40
0.016
τ
0.02
30
0.024
20
0.2
0.05
0.022
10
0.018
0.014
2
0.01 0.01 0.008
0.006
0.004
0.002
0
Figure 2.3: Contour plot of the prior (dashed) and posterior (solid) densities for (µ, τ )T .
Example 2.3
Determine the 100(1 − α)% highest density interval (HDI) for the population mean µ in
terms of quantiles of the standard t-distribution.
Solution
H
The marginal posterior distribution is µ|x ∼ t2G B, GC . This is a symmetric
distribution and so the HDI is an equi-tailed interval. Therefore the HDI (`, u) for µ
must satisfy
P r (µ < `|x) = α/2 and P r (µ > u|x) = α/2.
µ−B
p ∼ t2G
H/(GC)
40 CHAPTER 2. INFERENCE FOR A NORMAL POPULATION
and so
!
µ−B u−B
P r (µ > u|x) = α/2 ⇒ Pr p >p x = α/2
H/(GC) H/(GC)
u−B
⇒ p = t2G;α/2
H/(GC)
where t2G;p is the upper p point of the t2G distribution. Therefore
r
H
u = B + t2G;α/2 .
GC
Similar calculations give
r r
H H
` = B + t2G;1−α/2 = B − t2G;α/2
GC GC
since the t distribution is symmetric about zero. Thus the 100(1 − α)% HDI for µ is
r r !
H H
B − t2G;α/2 , B + t2G;α/2 .
GC GC
These intervals can be calculated easily using the R function qgt in the package nclbayes.
For example, the prior and posterior 95% HDIs for µ can be calculated using
c(qgt(0.025,2*g,b,h/(g*c)),qgt(0.975,2*g,b,h/(g*c)))
c(qgt(0.025,2*G,B,H/(G*C)),qgt(0.975,2*G,B,H/(G*C)))
Determining a highest density interval (HDI) for the population precision τ or standard
deviation σ is more complicated as their posterior distributions are not symmetric. The
(marginal) posterior for τ is τ |x ∼ Ga(G, H) and the (marginal) posterior for σ is σ|x ∼
Inv-Chi(G, H). HDIs can be found by using the R functions hdiGamma and hdiInvchi
in the package nclbayes. More standard equi-tailed confidence intervals can be found
using the functions qgamma and qinvchi.
For example, the prior and posterior 95% HDIs for τ can be calculated using R com-
mands hdiGamma(0.95,g,h) and hdiGamma(0.95,G,H), and those for σ using com-
mands hdiInvchi(0.95,g,h) and hdiInvchi(0.95,G,H). The 95% equi-tailed confi-
dence intervals are calculated in a similar way to the HDIs for µ above. So for τ , the
prior and posterior intervals are calculated using
c(qgamma(0.025,g,h),qgamma(0.975,g,h))
c(qgamma(0.025,G,H),qgamma(0.975,G,H))
c(qinvchi(0.025,g,h),qinvchi(0.975,g,h))
c(qinvchi(0.025,G,H),qinvchi(0.975,G,H))
2.3. CONFIDENCE INTERVALS AND REGIONS 41
Prior Posterior
µ: (4.3818, 6.4382) (5.4031, 5.5649)
τ: (1.4812, 55.9573) (14.0193, 42.2530) ← HDI
(4.1561, 64.1625) (15.0674, 43.7625)
σ: (0.1062, 0.4246) (0.1466, 0.2505) ← HDI
(0.1248, 0.4905) (0.1512, 0.2576)
Table 2.1: Prior and posterior 95% intervals for the analysis in Example 2.2
The numerical values for the prior and posterior 95% intervals for the analysis in Exam-
ple 2.2 are given in Table 2.1. Notice that there is little difference between the posterior
HDI and equi-tailed intervals for τ and for σ, whereas the prior intervals are fairly differ-
ent. This is because the prior distributions are quite skewed but the posterior distributions
are fairly symmetric; see Figure 2.2.
In Bayesian inference it can also be useful to determine (joint) confidence regions for
several parameters, in this case, for (µ, τ )T . In general this is a difficult problem to solve
mathematically, and it is in this case.
42 CHAPTER 2. INFERENCE FOR A NORMAL POPULATION
Example 2.4
Solution
mu=seq(3.5,7.5,len=1000)
tau=seq(0,80,len=1000)
NGacontour(mu,tau,b,c,g,h,p=c(0.95,0.9,0.8),lty=3)
NGacontour(mu,tau,B,C,G,H,p=c(0.95,0.9,0.8),add=TRUE)
produces a plot containing the 95%, 90% and 80% prior and posterior confidence regions
for (µ, τ )T for the prior and posterior distributions in Example 2.2; see Figure 2.4. The
upper plot shows contours of both prior and posterior densities. The numbers within
the plot are the contour levels. The largest prior confidence region is the 95% region.
The next largest is the 90% prior confidence region and the smallest is the 80% prior
confidence region. The same ordering holds for the posterior confidence regions. The
posterior contours are so concentrated in the middle of the plot that there is no room to
put in the contour levels. However, these can be see on the lower plot which also shows
the contours but focuses the parameter range to highlight the contours of the posterior
density. The values of the contours in this lower plot show that the posterior density is
much more peaked, that is, the posterior has a much reduced variability. The location
of the centre of the central contour for both the prior and posterior densities shows that
there has been little change in the mean/mode.
Suppose we sample another value y randomly from the population. What values is it
likely to take? This is described by its predictive distribution. We can determine this
distribution by using the definition of the predictive density
Z
f (y |x) = f (y |µ, τ ) π(µ, τ |x) dµ dτ
or by using Candidate’s formula (as this is a conjugate analysis). However, for this
model/prior, there is a more straightforward method to determine the predictive distri-
bution in this model.
As Y is a random value from the population, we have that Y |µ, τ ∼ N(µ, 1/τ ). We also
know that the posterior distribution is (µ, τ )T |x ∼ NGa(B, C, G, H). Therefore, we can
write
Y = µ + ε,
where
1
ε|τ ∼ N(0, 1/τ ) and µ|x, τ ∼ N B, .
Cτ
Hence Y is the sum of two independent normal random quantities, and so
1 1 C+1
Y |x, τ ∼ N B, + ≡ N B, .
τ Cτ Cτ
44 CHAPTER 2. INFERENCE FOR A NORMAL POPULATION
80
60
40
τ
20
0.0053
0.0014 0.0027
0
4 5 6 7
µ
60
0.0053
50
0.056
40
τ
30
20
0.11
0.029
10
Figure 2.4: 95%, 90% and 80% prior (dashed) and posterior (solid) confidence regions
for (µ, τ )T
2.4. PREDICTIVE DISTRIBUTION 45
Thus, as τ |x ∼ Ga(G, H)
Y C
x ∼ NGa B, , G, H
τ C+1
We can determine 100(1 − α)% predictive intervals by noting that the predictive distri-
bution is symmetric about its mean and therefore the HDI is
r r !
H(C + 1) H(C + 1)
B − t2G;α/2 , B + t2G;α/2 .
GC GC
46 CHAPTER 2. INFERENCE FOR A NORMAL POPULATION
These predictive intervals can be calculated easily using the R function qgt. For example,
in Example 2.2, the prior and posterior predictive HDIs for a new value Y from the
population are (4.2604, 6.5596) and (5.0855, 5.8825) respectively, calculated using
c(qgt(0.025,2*g,b,h*(c+1)/(g*c)),qgt(0.975,2*g,b,h*(c+1)/(g*c)))
c(qgt(0.025,2*G,B,H*(C+1)/(G*C)),qgt(0.975,2*G,B,H*(C+1)/(G*C)))
2.5 Summary
(ii) The posterior distribution is (µ, τ )T |x ∼ NGa(B, C, G, H) where the posterior pa-
rameters are given by (2.3).
√
(iii) The marginal prior distributions are µ ∼ t2g {b, h/(gc)}, τ ∼ Ga(g, h), σ = 1/ τ ∼
Inv-Chi(g, h).
(iv) The marginal posterior distributions are µ|x ∼ t2G {B, H/(GC)}, τ |x ∼ Ga(G, H),
σ|x ∼ Inv-Chi(G, H).
(v) Prior and posterior means and standard deviations for µ, τ and σ can be calculated
from the properties of the t, Gamma and Inv-Chi distributions.
(vi) Prior and posterior probabilities and densities for µ, τ and σ can be calculated using
the R functions pgt, dgt, pgamma, dgamma, pinvchi, dinvchi.
(vii) HDIs or equi-tailed CIs for µ, τ and σ can be calculated using qgt, hdiGamma,
hdiInvchi, qgamma, qinvchi.
(viii) Contour plots of the prior and posterior densities for (µ, τ )T can be plotted using
the NGacontour function.
(ix) Prior and posterior confidence regions for (µ, τ )T can be plotted using the NGacontour
function.
(x) The predictive distribution for a new observation Y from the population is Y |x ∼
t2G {B, H(C + 1)/(GC)} and its HDI can be calculated using the qgt function.
2.6. WHY DO WE HAVE SO MANY DIFFERENT DISTRIBUTIONS? 47
So far we have used many distributions, some you will have met before and some will be
new. After a while the variety and sheer number of different distributions can become
overwhelming. Why do we need so many distributions and why do we name so many of
them?
Statistics studies the random variation in experiments, samples and processes. The variety
of applications leads to their randomness being described by many different distributions.
In many applications, bespoke distributions will need to be formulated. However, some
distributions come up time and time again for modelling random variation in data and
for describing prior beliefs. It is helpful for us to be able to refer to these distributions –
and so we give each one a name – and also to be able to quote known results for these
distributions such as their mean and variance. In this chapter you have been introduced
to a generalisation of the t-distribution and the inverse chi distribution, and we have been
able to use results for their mean and variance to study prior and posterior distributions
and have been able to plot these distributions using functions in the R package.
You will meet several other new distributions in the remainder of the module. You won’t
be surprised to hear that it is useful to have a working knowledge of each of these
distributions but perhaps not vital to remember all their properties listed in these notes.
To help in this regard, the exam paper will contain a list of all the distributions used in
the exam, together with their density (or probability function) and any useful results such
as their mean and variance (as needed for the exam); see the specimen exam paper at
the back of this booklet.
48 CHAPTER 2. INFERENCE FOR A NORMAL POPULATION
• determine the predictive distribution of another value from the population, and its
predictive interval
• determine the predictive distribution of the mean of another random sample from
the population
both in general and for a particular prior and data set. Also you should be able to:
• appreciate the benefit of naming distributions and for having lists of properties for
these distributions