Lecture Material 2.5 - Bayesian Estimation & Concepts
Lecture Material 2.5 - Bayesian Estimation & Concepts
Bayesian Statistics
18.1 Bayesian Concepts
The classical methods of estimation that we have studied in this text are based
solely on information provided by the random sample. These methods essentially
interpret probabilities as relative frequencies. For example, in arriving at a 95%
confidence interval for μ, we interpret the statement
to mean that 95% of the time in repeated experiments Z will fall between −1.96
and 1.96. Since
X̄ − μ
Z= √
σ/ n
Subjective Probability
Subjective probability is the foundation of Bayesian concepts. In Chapter 2, we
discussed two possible approaches to probability, namely the relative frequency and
the indifference approaches. The first one determines a probability as a consequence
of repeated experiments. For instance, to decide the free-throw percentage of a
basketball player, we can record the number of shots made and the total number
of attempts this player has made. The probability of hitting a free-throw for this
player can be calculated as the ratio of these two numbers. On the other hand,
if we have no knowledge of any bias in a die, the probability that a 3 will appear
in the next throw will be 1/6. Such an approach to probability interpretation is
based on the indifference rule.
709
710 Chapter 18 Bayesian Statistics
Conditional Perspective
Recall that in Chapters 9 through 17, all statistical inferences were based on the
fact that the parameters are unknown but fixed quantities, apart from those in
Section 9.14, in which the parameters were treated as variables and the maximum
likelihood estimates (MLEs) were calculated conditioning on the observed sample
data. In Bayesian statistics, not only are the parameters treated as variables as in
MLE calculation, but also they are treated as random.
Because the observed data are the only experimental results for the practitioner,
statistical inference is based on the actual observed data from a given experiment.
Such a view is called a conditional perspective. Furthermore, in Bayesian concepts,
since the parameters are treated as random, a probability distribution can be
specified, generally by using the subjective probability for the parameter. Such a
distribution is called a prior distribution and it usually reflects the experimenter’s
prior belief about the parameter. In the Bayesian perspective, once an experiment
is conducted and data are observed, all knowledge about the parameter is contained
in the actual observed data and in the prior information.
Bayesian Applications
Although Bayes’ rule is credited to Thomas Bayes, Bayesian applications were
first introduced by French scientist Pierre Simon Laplace, who published a paper
on using Bayesian inference on the unknown binomial proportions (for binomial
distribution, see Section 5.2).
Since the introduction of the Markov chain Monte Carlo (MCMC) computa-
tional tools for Bayesian analysis in the early 1990s, Bayesian statistics has become
more and more popular in statistical modeling and data analysis. Meanwhile,
methodology developments using Bayesian concepts have progressed dramatically,
and they are applied in fields such as bioinformatics, biology, business, engineer-
ing, environmental and ecology science, life science and health, medicine, and many
others.
Definition 18.1: The distribution of θ, given x, which is called the posterior distribution, is given
by
f (x|θ)π(θ)
π(θ|x) = ,
g(x)
Example 18.1: Assume that the prior distribution for the proportion of defectives produced by a
machine is
p 0.1 0.2
π(p) 0.6 0.4
Denote by x the number of defectives among a random sample of size 2. Find the
posterior probability distribution of p, given that x is observed.
Solution : The random variable X follows a binomial distribution
2 x 2−x
f (x|p) = b(x; 2, p) = p q , x = 0, 1, 2.
x
The marginal distribution of x can be calculated as
g(x) = f (x|0.1)π(0.1) + f (x|0.2)π(0.2)
2
= [(0.1)x (0.9)2−x (0.6) + (0.2)x (0.8)2−x (0.4)].
x
Hence, for x = 0, 1, 2, we obtain the marginal probabilities as
x 0 1 2
g(x) 0.742 0.236 0.022
The posterior probability of p = 0.1, given x, is
f (x|0.1)π(0.1) (0.1)x (0.9)2−x (0.6)
π(0.1|x) = = x 2−x
,
g(x) (0.1) (0.9) (0.6) + (0.2)x (0.8)2−x (0.4)
and π(0.2|x) = 1 − π(0.1|x).
Suppose that x = 0 is observed.
f (0 | 0.1)π(0.1) (0.1)0 (0.9)2−0 (0.6)
π(0.1|0) = = = 0.6550,
g(0) 0.742
and π(0.2|0) = 0.3450. If x = 1 is observed, π(0.1|1) = 0.4576, and π(0.2|1) =
0.5424. Finally, π(0.1|2) = 0.2727, and π(0.2|2) = 0.7273.
The prior distribution for Example 18.1 is discrete, although the natural range
of p is from 0 to 1. Consider the following example, where we have a prior distri-
bution covering the whole space for p.
712 Chapter 18 Bayesian Statistics
Example 18.2: Suppose that the prior distribution of p is uniform (i.e., π(p) = 1, for 0 < p <
1). Use the same random variable X as in Example 18.1 to find the posterior
distribution of p.
Solution : As in Example 18.1, we have
2 x 2−x
f (x|p) = b(x; 2, p) = p q , x = 0, 1, 2.
x
The marginal distribution of x can be calculated as
1 1
2
g(x) = f (x|p)π(p) dp = px (1 − p)2−x dp.
0 x 0
The integral above can be evaluated at each x directly as g(0) = 1/3, g(1) = 1/3,
and g(2) = 1/3. Therefore, the posterior distribution of p, given x, is
2 x
x p (1 − p)
2−x
2 x
π(p|x) = =3 p (1 − p)2−x , 0 < p < 1.
1/3 x
The posterior distribution above is actually a beta distribution (see Section 6.8)
with parameters α = x + 1 and β = 3 − x. So, if x = 0 is observed, the posterior
distribution of p is a beta distribution with parameters (1, 3). The posterior mean
1
is μ = 1+3 = 14 and the posterior variance is σ 2 = (1+3)(1)(3) 3
2 (1+3+1) = 80 .
π(θ|x) ∝ f (x|θ)π(θ),
where the symbol “∝” stands for is proportional to. In the calculation of the
posterior distribution above, we can leave the factors that do not depend on θ out
of the normalization constant, i.e., the marginal density g(x).
Example 18.3: Suppose that random variables X1 , . . . , Xn are independent and from a Poisson
distribution with mean λ. Assume that the prior distribution of λ is exponential
with mean 1. Find the posterior distribution of λ when x̄ = 3 with n = 10.
Solution : The density function of X = (X1 , . . . , Xn ) is
n
xi
,
n xi
−λ λ λi=1
f (x|λ) = e = e−nλ -
n ,
xi !
i=1 xi !
i=1
Referring to the gamma distribution in Section 6.6, we conclude that the posterior
n
1
distribution of λ follows a gamma distribution with parameters 1 + xi and n+1 .
n i=1 n
x +1
i i=1 ix +1
Hence, we have the posterior mean and variance of λ as i=1 n+1 and (n+1) 2 .
10
So, when x̄ = 3 with n = 10, we have i=1 xi = 30. Hence, the posterior
distribution of λ is a gamma distribution with parameters 31 and 1/11.
From Example 18.3 we observe that sometimes it is quite convenient to use
the “proportional to” technique in calculating the posterior distribution, especially
when the result can be formed to a commonly used distribution as described in
Chapters 5 and 6.
Example 18.4: Suppose that x = 1 is observed for Example 18.2. Find the posterior mean and
the posterior mode.
Solution : When x = 1, the posterior distribution of p can be expressed as
To find the posterior mode, we need to obtain the value of p such that the posterior
distribution is maximized. Taking derivative of π(p) with respect to p, we obtain
6 − 12p. Solving for p in 6 − 12p = 0, we obtain p = 1/2. The second derivative is
−12, which implies that the posterior mode is achieved at p = 1/2.
Bayesian methods of estimation concerning the mean μ of a normal population
are based on the following example.
Example 18.5: If x̄ is the mean of a random sample of size n from a normal population with
known variance σ 2 , and the prior distribution of the population mean is a normal
distribution with known mean μ0 and known variance σ02 , then show that the
posterior distribution of the population mean is also a normal distribution with
714 Chapter 18 Bayesian Statistics
from Section 8.5. Completing the squares for μ yields the posterior distribution
2
1 μ − μ∗
π(μ|x) ∝ exp − ,
2 σ∗
where
%
∗ nx̄σ02 + μ0 σ 2 ∗ σ02 σ 2
μ = , σ = .
nσ02 + σ 2 nσ02 + σ 2
Definition 18.2: The interval a < θ < b will be called a 100(1 − α)% Bayesian interval for θ if
a ∞
α
π(θ|x) dθ = π(θ|x) dθ = .
−∞ b 2
Example 18.6: Supposing that X ∼ b(x; n, p), with known n = 2, and the prior distribution of p
is uniform π(p) = 1, for 0 < p < 1, find a 95% Bayesian interval for p.
716 Chapter 18 Bayesian Statistics
and
1
0.025 = 3(1 − p)2 dp = (1 − b)3 .
b
The solutions to the above equations result in a = 0.0084 and b = 0.7076. There-
fore, the probability that p falls into (0.0084, 0.7076) is 95%.
For the normal population and normal prior case described in Example 18.5,
the posterior mean μ∗ is the Bayes estimate of the population mean μ, and a
100(1−α)% Bayesian interval for μ can be constructed by computing the interval
μ∗ − zα/2 σ ∗ < μ < μ∗ + zα/2 σ ∗ ,
which is centered at the posterior mean and contains 100(1 − α)% of the posterior
probability.
Example 18.7: An electrical firm manufactures light bulbs that have a length of life that is ap-
proximately normally distributed with a standard deviation of 100 hours. Prior
experience leads us to believe that μ is a value of a normal random variable with a
mean μ0 = 800 hours and a standard deviation σ0 = 10 hours. If a random sample
of 25 bulbs has an average life of 780 hours, find a 95% Bayesian interval for μ.
Solution : According to Example 18.5, the posterior distribution of the mean is also a normal
distribution with mean
(25)(780)(10)2 + (800)(100)2
μ∗ = = 796
(25)(10)2 + (100)2
and standard deviation
%
(10)2 (100)2 √
σ∗ = = 80.
(25)(10)2 + (100)2
The 95% Bayesian interval for μ is then given by
√ √
796 − 1.96 80 < μ < 796 + 1.96 80,
or
778.5 < μ < 813.5.
Hence, we are 95% sure that μ will be between 778.5 and 813.5.
On the other hand, ignoring the prior information about μ, we could proceed
as in Section 9.4 and construct the classical 95% confidence interval
100 100
780 − (1.96) √ < μ < 780 + (1.96) √ ,
25 25
or 740.8 < μ < 819.2, which is seen to be wider than the corresponding Bayesian
interval.
18.3 Bayes Estimates Using Decision Theory Framework 717
Squared-Error Loss
L(θ, a) = (θ − a)2 ,
where θ is the parameter (or state of nature) and a an action (or estimate).
A Bayes estimate minimizes the posterior expected loss, given on the observed
sample data.
Theorem 18.1: The mean of the posterior distribution π(θ|x), denoted by θ∗ , is the Bayes esti-
mate of θ under the squared-error loss function.
Example 18.8: Find the Bayes estimates of p, for all the values of x, for Example 18.1 when the
squared-error loss function is used.
Solution : When x = 0, p∗ = (0.1)(0.6550) + (0.2)(0.3450) = 0.1345.
When x = 1, p∗ = (0.1)(0.4576) + (0.2)(0.5424) = 0.1542.
When x = 2, p∗ = (0.1)(0.2727) + (0.2)(0.7273) = 0.1727.
Note that the classical estimate of p is p̂ = x/n = 0, 1/2, and 1, respectively,
for the x values at 0, 1, and 2. These classical estimates are very different from
the corresponding Bayes estimates.
Example 18.10: Suppose that the sampling distribution of a random variable, X, is Poisson with
parameter λ. Assume that the prior distribution of λ follows a gamma distribution
/ /
with parameters (α, β). Find the Bayes estimate of λ under the squared-error loss
function.
Solution : Using Example 18.3, we conclude that the posterior distribution of λ follows a
gamma distribution with parameters (x + α, (1 + 1/β)−1 ). Using Theorem 6.4, we
obtain the posterior mean
x+α
λ̂ = .
1 + 1/β
Since the posterior mean is the Bayes estimate under the squared-error loss, λ̂ is
our Bayes estimate.
Absolute-Error Loss
The squared-error loss described above is similar to the least-squares concept we
discussed in connection with regression in Chapters 11 and 12. In this section, we
introduce another loss function as follows.
L(θ, a) = |θ − a|,
Theorem 18.2: The median of the posterior distribution π(θ|x), denoted by θ∗ , is the Bayes
estimate of θ under the absolute-error loss function.
Example 18.11: Under the absolute-error loss, find the Bayes estimator for Example 18.9 when
x = 1 is observed.
Solution : Again, the posterior distribution of p is a B(x + 1, 3 − x). When x = 1, it is a beta
distribution with density π(p | x = 1) = 6x(1 − x) for 0 < x < 1 and 0 otherwise.
The median of this distribution is the value of p∗ such that
p∗
1
= 6p(1 − p) dp = 3p∗2 − 2p∗3 ,
2 0
which yields the answer p∗ = 12 . Hence, the Bayes estimate in this case is 0.5.
Exercises
18.1 Estimate the proportion of defectives being pro- p 0.05 0.10 0.15
duced by the machine in Example 18.1 if the random π(p) 0.3 0.5 0.2
sample of size 2 yields 2 defectives. If 2 of the next 9 drinks from this machine overflow,
find
18.2 Let us assume that the prior distribution for the (a) the posterior distribution for the proportion p;
proportion p of drinks from a vending machine that (b) the Bayes estimate of p.
overflow is
/ /
Exercises 719
18.3 Repeat Exercise 18.2 when 1 of the next 4 drinks (a) a Bayes estimate of the true average daily profit for
overflows and the uniform prior distribution is this building;
(b) a 95% Bayesian interval of μ for this building;
π(p) = 10, 0.05 < p < 0.15.
(c) the probability that the average daily profit from
the machine in this building is between $24.00 and
18.4 Service calls come to a maintenance center ac- $26.00.
cording to a Poisson process with λ calls per minute.
A data set of 20 one-minute periods yields an average 18.9 The mathematics department of a large uni-
of 1.8 calls. If the prior for λ follows an exponential versity is designing a placement test to be given to
distribution with mean 2, determine the posterior dis- incoming freshman classes. Members of the depart-
tribution of λ. ment feel that the average grade for this test will vary
from one freshman class to another. This variation of
18.5 A previous study indicates that the percentage the average class grade is expressed subjectively by a
of chain smokers, p, who have lung cancer follows a normal distribution with mean μ0 = 72 and variance
beta distribution (see Section 6.8) with mean 70% and σ02 = 5.76.
standard deviation 10%. Suppose a new data set col-
(a) What prior probability does the department assign
lected shows that 81 out of 120 chain smokers have
to the actual average grade being somewhere be-
lung cancer.
tween 71.8 and 73.4 for next year’s freshman class?
(a) Determine the posterior distribution of the percent-
(b) If the test is tried on a random sample of 100 stu-
age of chain smokers who have lung cancer by com-
dents from the next incoming freshman class, re-
bining the new data and the prior information.
sulting in an average grade of 70 with a variance of
(b) What is the posterior probability that p is larger 64, construct a 95% Bayesian interval for μ.
than 50%?
(c) What posterior probability should the department
assign to the event of part (a)?
18.6 The developer of a new condominium complex
claims that 3 out of 5 buyers will prefer a two-bedroom
unit, while his banker claims that it would be more 18.10 Suppose that in Example 18.7 the electrical
correct to say that 7 out of 10 buyers will prefer a two- firm does not have enough prior information regard-
bedroom unit. In previous predictions of this type, the ing the population mean length of life to be able to
banker has been twice as reliable as the developer. If assume a normal distribution for μ. The firm believes,
12 of the next 15 condominiums sold in this complex however, that μ is surely between 770 and 830 hours,
are two-bedroom units, find and it is thought that a more realistic Bayesian ap-
proach would be to assume the prior distribution
(a) the posterior probabilities associated with the
claims of the developer and banker; 1
π(μ) = , 770 < μ < 830.
(b) a point estimate of the proportion of buyers who 60
prefer a two-bedroom unit.
If a random sample of 25 bulbs gives an average life of
780 hours, follow the steps of the proof for Example
18.7 The burn time for the first stage of a rocket is 18.5 to find the posterior distribution
a normal random variable with a standard deviation
of 0.8 minute. Assume a normal prior distribution for π(μ | x1 , x2 , . . . , x25 ).
μ with a mean of 8 minutes and a standard deviation
of 0.2 minute. If 10 of these rockets are fired and the
first stage has an average burn time of 9 minutes, find 18.11 Suppose that the time to failure T of a certain
a 95% Bayesian interval for μ. hinge is an exponential random variable with probabil-
ity density
18.8 The daily profit from a juice vending machine
placed in an office building is a value of a normal ran- f (t) = θe−θt , t > 0.
dom variable with unknown mean μ and variance σ 2 .
Of course, the mean will vary somewhat from building From prior experience we are led to believe that θ is
to building, and the distributor feels that these average a value of an exponential random variable with proba-
daily profits can best be described by a normal distri- bility density
bution with mean μ0 = $30.00 and standard deviation
σ0 = $1.75. If one of these juice machines, placed in π(θ) = 2e−2θ , θ > 0.
a certain building, showed an average daily profit of
x̄ = $24.90 during the first 30 days with a standard If we have a sample of n observations on T , show that
deviation of s = $2.10, find the posterior distribution of Θ is a gamma distribution
720 Chapter 18 Bayesian Statistics