Lecture 11
Lecture 11
Contents
1 Introduction 3
4 Conjugate Distributions 9
1
CONTENTS CONTENTS
5 Conclusion 15
1 Introduction
The change from discrete to continuous priors is quite a large change, requiring new concepts
and numerical techniques.Up to the 1990’s it was only possible to use continuous prior
distributions if they met the necessary conjugacy condition, something we will learn about
this week. If the condition is not met we can always resort to using OpenBUGS.
To solve problems using conjugate prior distributions, amounts to no more than deriving
simple formulas and applying them to solve decision problems. But the concepts involved
are new and take some getting used to. Let’s use the previous machinery example as a base
to illustrate these concepts.
• Before we were not sure the equipment sent to a customer would be perfectly flat or
not, and it could be tested before shipment with some level of accuracy.
• Suppose now that the customer could return the equipment at no charge if it failed
within the first 30 days of usage.
• The company has no way of testing to determine if it will fail in the first 30 days.
• The DM thinks of this problem as each piece of equipment has some probability of
failing within 30 days and he has a hard time guessing the true value of that probability.
• The only way DM is going to get a better handle on that probability is to make the
best possible product within reason, and then to monitor the number of returns as
sales proceed.
• DM does remember from statistics that if θ is the probability that a unit will fail within
the first 30 days then the probability of observing y returns after sending N units is a
Binomial Distribution
N y
p(y|θ) = θ (1 − θ)N −y
y
p(y|θ)pdf (θ)
pdf (θ|y) = R
Θ
p(y|θ)pdf (θ)dθ
• So DM realizes that if some type of prior distribution pdf (θ) can be elicited then, in
principle, once the number returns y is observed, a posterior distribution pdf (θ|y) can
be computed to make appropriate business decisions that maximize expected utility.
• He might , for example, stop sales to improve the product if more than a threshold
number of units are returned as faulty, or if his profit margin is low enough, due to
returns.
• The key problem for the DM is the difficulty with integrating the denominator of Bayes’
Rule for a particular continuous prior.
Figure shows the wide variety of distributions which occur naturally in various settings.
Pallette.pdf
We will spend some time becoming familiar with some notable examples such as the Beta,
Exponential, Gamma, Normal, and Weibull distributions.
For discrete distributions the height of each of the bars - the reading on the y-axis - is the
probability that the value along the x-axis will materialize.
Is the same true for the continuous distributions? Is the height of the distribution the
probability that the value along the x-axis will materialize? NO
The probability that any specific value on the x-axis will materialize is zero. Just like the
probability that a dart will fall exactly on a single infinitesimally small dot on the line is
zero. Yet the dart falls somewhere! The solution to this paradox is that the point of every
dart and every line is not infinitesimally small. The overlap between dart point and line
has a finite area. And thus we always deal with probabilities that the dart will fall within a
given small part of the line. Thus if the distribution relates to the probability that the dart
falls on various parts of the x-axis then the area under the curve corresponding to a small
interval on the line is the probability that the dart will fall in that interval.
Can the height of the discrete distribution ever be greater than 1. Clearly not.
Can the height of the continuous distribution ever be greater than 1. Yes! How can that be?
Suppose that the interval is really small, say 0.01 cm, and that there is a 5% chance of falling
in that interval. Then the area under the distribution (approx height x width ) must be 0.05.
Thus the height must be 5, so that 5*.01=.05.
Because of this qualitative difference between modeling discrete and continuous uncertainty
it is common not to refer to the graphs in the picture as Probability Distributions, but to
name them instead Probability Density Functions, or pdf.
The term Probability Distribution for continuous distributions is usually called the Cumu-
lative Probability Function, or cdf, which is defined as the probability that the value on the
x-axis will be less than or equal to a given value.
Z x
CDF (x) = pdf (x)dx
−∞
BUT the integral sign is used conceptually to determine what needs to be computed in order
to solve our problems. So we cannot afford to be intimidated by the math!
Figure 2 shows discrete distributions which occur naturally in various settings. In this part of
the course we assume that if the data we are collecting is an integer, it is distributed according
to one of these frequently occurring distributions. Similarly Figure 3 shows continuous
Discrete Distributions
Distribution Use Range Parameters Pdf Mean Var
Bernoulli 2 outcomes {0,1} 0<p<1
Binomial Successes in n Bernoulli trials {0,…n} n>0 (int)
Defective items in batch of n 0<p<1
distributions which occur naturally in various settings. In this part of the course we will
assume that if the information or data we are collecting is a real number, then it will be
distributed according to either the Exponential, Normal or Uniform distributions. The
Gamma, Beta and Pareto are for use in the coming Conjugate Distribution section.
Continuous Distributions
Distribution Use Range Parameters Pdf Mean Variance
Loc Scale Shape
Exponential Inter-arrival times [0, ) >0
Time to failure
Gamma Conjugate to [0, ) >0 >0
Exponential
Poisson
Normal Errors (-,) >0
Sums & Averages
= precision = 1/2
Beta Conjugate to [0, 1] >0,
Bernoulli β>0
Binomial
Negative Binomial
Uniform First Model, No data [a, b] a b-a>0
One useful observation to help understand the concept of conjugacy is that many distribu-
tions pdf (x) include multiplication by a constant to ensure the area under the distribution
equals 1. This constant is commonly referred to as a Normalizing Constant.
βα
The Normalizing Constant for the Exponential Distribution is β, for the Gamma τ (α)
, for
τ (α+β)
Normal √τ , for the Beta , for the Uniform 1
, and for the Pareto αxα0 .
2π τ (α)τ (β) b−a
Clearly the earlier material was focused on discrete events, i.e. the probability that a well
defined event Ai will take place, where i = 1 . . . n. Of course since it will take elicitation of
n − 1 probabilities this can become an onerous task for large n.
But what if we are trying to elicit probabilities over an infinite number of possibilities, e.q.
income, temperature, interest rates, etc. One way would be to assume a finite number of
discrete possibilities, and proceed in the same way. But what if the DM was concerned that
different discrete versions of the continuous reality will lead to different decisions, and would
much rather use a continuous variable model.
In that case we would define events in terms of ranges. Typically we define events so that
the resulting elicited probabilities result in Cumulative Probabilities. For example, suppose
we wish to assess the probability distribution of the demand for a certain product. Then we
might start by getting the DM indifference point between
B: a bet that the spinner falls within a given area of the wheel.
The relative size of the area is then the subjective probability that the demand will exceed
1500m units. Figure 4 below shows how a cumulative curve can be derived based on 16
assessments. Note that assessment 1 is way off. The diagram is marked from repeated
assessments. By measuring the slope of the curve at various points, and using a product
such as BestFit (used in MIE360) a continuous probability density function can be derived.
Experience seems to prefer an alternate approach for assessing continuous quantities through
indifference between the following:
Next the 50% would be replaced by 25% and 75% and repeated If need be the process would
continue by splitting each of the subintervals into equal halves until there were sufficient
points to draw a smooth graph between the points.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4 Conjugate Distributions
Suppose the data collected is the number of failures x in n trials, which we know is Binomially
distributed. The reason we are collecting x is to give us a better idea about the underlying
uncertainty about the failure rate p.
If we had a prior distribution of p then we would be able to use Bayes rule in the form
n x
P r(x|p)pdf (p) x
p (1 − p)n−x pdf (p) px (1 − p)n−x pdf (p)
pdf (p|x) = R = R n x = R
p
P r(x|p)pdf (p)dp p x
p (1 − p)n−x pdf (p)dp p
px (1 − p)n−x pdf (p)dp
For any arbitrary prior distribution pdf (p) it is very difficult to accurately and efficiently
compute the integral in the denominator of Baye’s Rule. It is no coincidence that cumulative
distributions CDF (x) often appear as tables; it is because they are difficult to compute.
well known distributions over [0, 1], the Beta distribution is Conjugate to the Binomial. So
let’s assume the the DM believes that the prior distribution of p is Beta with parameters α0
and β 0 . Now we see how the math simplifies.
For starters, we take a surprising step which totally removes the need to take the integral in
the denominator of Bayes Rule. We replace the = sign with a proportionality ∝ sign. For a
given x
pdf (p|x) ∝ px (1 − p)n−x pdf (p)
Replacing pdf (p) with the formula for the Beta, with initial parameters α0 , β 0 yields
τ (α0 + β 0 ) α0 −1 0
pdf (p|x) ∝ px (1 − p)n−x 0 0
p (1 − p)β −1
τ (α )τ (β )
Removing the Normalization Constant
0 −1 0 −1 0 −1 0 −1
pdf (p|x) ∝ px (1 − p)n−x pα (1 − p)β = px+α (1 − p)n−x+β
We now observe that what remains, looks like a Beta Distribution, without its Normalizing
Constant and with revised parameters α+ , β +
0 +x)−1 0 +n−x)−1 + −1 + −1
pdf (p|x) ∝ p(α (1 − p)(β = pα (1 − p)β
Although it’s not necessary we can reinsert the Normalization Constant to explicitly state
the Posterior distribution as
τ (α+ + β + ) α+ −1 +
pdf (p|x) = + +
p (1 − p)β −1 Voila!
τ (α )τ (β )
Thus the posterior distribution pdf (p|x) is again a Beta Distribution with parameters α+ =
α0 + x, β + = β 0 + n − x
Note that if we collect N Binomial observations x1 , . . . , xN with the number of trials given
by each of size n1 , . . . ,P
nN then the posterior distribution
PN will P be a Beta distribution with
+ 0 N 0 + 0 N 0
PN
parameters α = α + i=1 xi = α + N x, β = β + i=1 ni − i=1 xi = β + i=1 ni − N x
A machinery supplier (the DM) has a new machine which a customer is willing to buy for
$10,000, but pay only if it still works after a month. Otherwise the broken machine will be
sent back, and the supplier will receive nothing. The supplier initially has no idea about the
chances that it will work fine, except he thinks that on average it will be fine 80% of the
time. It costs the supplier $7,000 for each machine, so on average he seems to be doing fine,
but if the expected profit is less than $500 per unit he plans to discontinue the sales.
If we let p be the percentage of machines that fail within 30 days, then the sampling dis-
tribution of xi out of a sales volume of ni in month i will be binomially distributed with
unknown parameter p.
Suppose that after 6 months the monthly sales volumes are 98, 132, 121, 117, 108, 143 and
the number of returns were 12, 16, 22, 32, 15 and 20. Thus there were a total 126 returns
out of a total sales of 719. What is the expected profit for the next month?
In order to take advantage of conjugacy, the prior on p has to be beta distributed. Since the
DM mentioned that on average he believes that 80% of units are good, therefore his prior
parameters α0 , β 0 have to be such that
α0
p0 = = 0.2
α0 + β 0
Suppose that his prior belief is based on a thinking that about 20 units out of 100 will fail
then α0 = 20, β 0 = 80.
Thus α+ = 20 + 117 = 137 and β + = 80 + 719 − 137 = 662, so that the expected number of
returns next month is p+ = 137/(137 + 662) = 0.17, less than previously believed, and so the
business continues with an expected profit per unit of 0.83 ∗ 3000 + 0.17 ∗ (−7000) = 1300.
Suppose the data collected is the number of failures x before a success, which we know is
Geometrically distributed. Again, the reason we are collecting x is to give us a better idea
about the underlying uncertainty about the failure rate p.
If we had a prior distribution of p then we would be able to use Bayes rule. Again, for any
arbitrary prior distribution pdf (p) it is very difficult to accurately and efficiently compute
the integral in the denominator of Baye’s Rule. BUT, if the prior distribution is forced to be
a ”conjugate” to the Geometric distribution then the math simplifies. Again the conjugate
prior for the Geometric sampling distribution is the Beta distribution. Let α0 , β 0 be the
DM’s initial beliefs.
For a given x
Suppose the data collected is the number of failures x before r successes, which we know is
Negative Binomial distributed, which again has a Beta conjugate distribution. For given x
pdf (p|x) ∝ px (1 − p)r pdf (p)
τ (α0 + β 0 ) α0 −1 0
∝ px (1 − p)r 0 0
p (1 − p)β −1
τ (α )τ (β )
0 +x)−1 0 +r)−1
∝ p(α (1 − p)(β
Suppose the data collected is the number of events x per unit time, which we know is Poisson
distributed, for which the Gamma distribution is conjugate. For a given x
pdf (λ|x) ∝ e−λ λx pdf (λ)
0 −1 0λ
∝ e−λ λx λα e−β
0 +x)−1 0 +1)
∝ λ(α e−(β
Thus the posterior is again a Gamma distribution with α+ = α0 + x, β + = β 0 + 1.
Suppose the data collected is a time between (memoryless) events x which we know is
Exponentially distributed, for which the Gamma distribution is again conjugate. For a
given x
pdf (λ|x) ∝ λe−λx pdf (λ)
0 −1 0λ
∝ λe−λx λα e−β
0 +1)−1 0 +x)λ
∝ λ(α e−(β
Thus the posterior is again a Gamma distribution with α+ = α0 + 1, β + = β 0 + x.
Here we deal with with a sampling distribution with more than one parameter, i.e. both the
mean and the precision of the sampling distribution are assumed unknown. We will assume
that the DM can specify a prior on the mean µ given the precision τ , and a prior on the
precision τ so that the joint mean-precision prior is given by p(µ, τ ) = p(µ|τ )p(τ ). Thus we
will be dealing with two conjugate distributions for the normal sampling case with unknown
mean and precision. The result without proof is:
Theorem 4.1. For the normal sampling distribution with mean µ and precision τ the
following pair of prior distributions is conjugate:
• Normal prior for the mean, with mean µ0 and precision τ 0 = n0 τ where n0 is a
measure of strength in prior belief.
Suppose you observe the heights of 6 students entering a class room 1 (64, 73, 64, 63, 69 and
71 inches), and you are asked to bet $100 on the height of the next student. In exchange for
your $100, you will be given 200 − 20|d − z|, where d is your guess of the height, and z is
the actual height of the student. If the reward is negative you will not have to worry about
it. Assume you wish to maximize your expected returns, should you play? And if so, what
should be your guess d?
• We should play if the expected reward of playing exceeds the expected reward of not
playing, and if we play we should maximize the expected reward of playing. Thus in
either case we need to determine the d∗ that maximizes the expected reward of playing,
and if the expected reward is positive we play.
• If the reward would have been 200 − 20(d − z)2 then we would need to select d to
minimize the expected square error. It is well known that the mean value, or expected
1
Based on Biostatistics Workshop Notes by Professor Michael Escobar
Since we do not know neither the mean nor the variance (or precision) of the heights in ad-
vance we usually need to specify 4 parameters which reflect our opinion about the uncertainty
of the heights:
For estimating n0 we will use the concept of pseudo-observations. Observe that τ + = (n0 +
n)τ . We will interpret n0 as the number of representative kids that our 5’6” estimate is
based upon. Suppose that our estimate is based on 4 kids, then n0 = 4.
We have decided that d = µ will maximize our expected reward if µ were known. Based on
the prior estimates, prior to collecting data the DM would use d = µ0 =5’6” to maximize his
expected reward. What we need to compute now is the posterior estimate of the mean µ+ .
Thus to maximize the expected reward the DM will select d = 66.8 ≈5’7”
For this problem we did not need α0 , β 0 . Generally for estimating α0 , β 0 we could ask the
DM to suggest a 95% confidence interval for the heights before observing the data. Suppose
she settles on a range of 56 to 76 inches, i.e. 2 standard deviations = 10 inches, or one s.d. of
5 inches, or a variance of 25. Thus her estimate of the precision is 1/25. Now the mean value
of a gamma distributed variable is α/β, and since we are dealing with a gamma distributed
precision then α/β = 1/25. Thus perhaps we might use α0 = 1 and β 0 = 25.
5 Conclusion
Here we have covered Conjugate distributions for solving this problem for some prior distri-
butions, and note that we have not required any software to solve the problems. Once we
assume conjugate prior distributions, the calculation can most often be performed with a
simple calculator.