Chap 8 3
Chap 8 3
In the last step, we used the property ↵(n) = (n 1)!, which is applicable because
r0 is an integer. This is the NBin(r0 , b0 /(b0 + t)) PMF, so the marginal distribution
of Y is Negative Binomial with parameters r0 and b0 /(b0 + t).
(c) By Bayes’ rule, the posterior PDF of ' is given by
P (Y = y|')f0 (')
f1 ('|y) = .
P (Y = y)
We found P (Y = y) in the previous part, but since it does not depend on ', we
can just treat it as part of the normalizing constant. Absorbing this and other
multiplicative factors that don’t depend on ' into the normalizing constant,
1 1
f1 ('|y) ep@t 'y 'r0 epb0 @ = ep(b0 +t)@ 'r0 +y ,
' '
which shows that the posterior distribution of ' is Gamma(r0 + y, b0 + t).
When going from prior to posterior, the distribution of ' stays in the Gamma family,
so the Gamma is indeed the conjugate prior for the Poisson.
Now that we have the posterior PDF of ', we have a more elegant approach to
solving (b). Rearranging Bayes’ rule, the marginal PMF of Y is
P (Y = y|')f0 (')
P (Y = y) = ,
f1 ('|y)
where we know the numerator from the statement of the problem and the denomi-
nator from the calculation we just did. Plugging in these ingredients and simplifying
again yields
Y ⇠ NBin(r0 , b0 /(b0 + t)).
(d) Since conditional PDFs are PDFs, it is perfectly fine to calculate the expectation
and variance of ' with respect to the posterior distribution. The mean and variance
of the Gamma(r0 + y, b0 + t) distribution give us
r0 + y r0 + y
E('|Y = y) = and Var('|Y = y) = .
b0 + t (b0 + t)2
This example gives another interpretation for the parameters in a Gamma when
396 Introduction to Probability
In this section, we will unite the Beta and Gamma distributions with a common
story. As an added bonus, the story will give us an expression for the normalizing
constant of the Beta(a, b) PDF in terms of gamma functions, and it will allow us to
easily find the expectation of the Beta(a, b) distribution.
Story 8.5.1 (Bank–post office). While running errands, you need to go to the
bank, then to the post office. Let X ⇠ Gamma(a, ') be your waiting time in line
at the bank, and let Y ⇠ Gamma(b, ') be your waiting time in line at the post
office (with the same ' for both). Assume X and Y are independent. What is the
joint distribution of T = X + Y (your total wait at the bank and post office) and
X
W = X+Y (the fraction of your waiting time spent at the bank)?
Solution:
We’ll do a change of variables in two dimensions to get the joint PDF of T and W .
x
Let t = x + y, w = x+y . Then x = tw, y = t(1 w), and
✓
⇡(x, y) w t
= ,
⇡(t, w) 1 w t
for 0 < w < 1 and t > 0. The form of the joint PDF, together with Proposi-
tion 7.1.21, tells us several things:
1. Since the joint PDF factors into a function of t times a function of w, we have
that T and W are independent: the total waiting time is independent of the fraction
of time spent at the bank.
2. We recognize the marginal PDF of T and deduce that T ⇠ Gamma(a + b, ').
3. The PDF of W is
↵(a + b) ap1
fW (w) = w (1 w)bp1 , 0 < w < 1,
↵(a)↵(b)
by Proposition 7.1.21 or just by integrating out T from the joint PDF of T and
W . This PDF is proportional to the Beta(a, b) PDF, so it is the Beta(a, b) PDF!
Note that as a byproduct of the calculation we have just done, we have found the
normalizing constant of the Beta distribution:
1 ↵(a + b)
=
@(a, b) ↵(a)↵(b)
is the constant that goes in front of the Beta(a, b) PDF. ⌅
To summarize, the bank–post office story tells us that when we add independent
Gamma r.v.s X and Y with the same rate ', the total X +Y has a Gamma distribu-
tion, the fraction X/(X + Y ) has a Beta distribution, and the total is independent
of the fraction.
We can use this result to find the mean of W ⇠ Beta(a, b) without the slightest
trace of calculus.
Example 8.5.2 (Beta expectation). With notation as above, note that since T and
W are independent, they are uncorrelated: E(T W ) = E(T )E(W ). Writing this in
terms of X and Y , we have
✓ ✓
X X
E (X + Y ) · = E(X + Y )E ,
X +Y X +Y
✓
X
E(X) = E(X + Y )E ,
X +Y
✓
E(X) X
=E .
E(X + Y ) X +Y
Ordinarily, the last equality would be a horrendous blunder: faced with an expec-
tation like E(X/(X + Y )), we are not generally permitted to move the E into the
numerator and denominator as we please. In this case, however, the bank–post office
story justifies the move, so finding the expectation of W happily reduces to finding
the expectations of X and X + Y :
✓
X E(X) a/' a
E(W ) = E = = = .
X +Y E(X + Y ) a/' + b/' a+b
398 Introduction to Probability
for k = 0, 1, . . . , n. ⌅
Furthermore, like the sample mean X̄n , the order statistics serve as useful summaries
of an experiment, since we can use them to determine the cutoffs for the worst 5%
of observations, the worst 25%, the best 25%, and so forth (such cutoffs are called
the quantiles of the sample).
X(1) = min(X1 , . . . , Xn ),
X(2) is the second-smallest of X1 , . . . , Xn ,
..
.
X(np1) is the second-largest of X1 , . . . , Xn ,
X(n) = max(X1 , . . . , Xn ).
Note that X(1) 1 X(2) 1 . . . 1 X(n) by definition. We call X(j) the jth order
statistic. If n is odd, X((n+1)/2) is called the sample median of X1 , . . . , Xn .
h 8.6.2. The order statistics X(1) , . . . , X(n) are r.v.s, and each X(j) is a function
of X1 , . . . , Xn . Even if the original r.v.s are independent, the order statistics are
dependent: if we know that X(1) = 100, then X(n) is forced to be at least 100.
We will focus our attention on the case where X1 , . . . , Xn are i.i.d. continuous r.v.s.
The reason is that with discrete r.v.s, there is a positive probability of tied values;
with continuous r.v.s, the probability of a tie is exactly 0, which makes matters much
easier. Thus, for the rest of this section, assume X1 , . . . , Xn are i.i.d. and continuous,
with CDF F and PDF f . We will derive the marginal CDF and PDF of each
individual order statistic X(j) , as well as the joint PDF of (X(1) , . . . , X(n) ).
A complication we run into right away is that the transformation to order statistics
is not invertible: starting with min(X, Y ) = 3 and max(X, Y ) = 5, we can’t tell
whether the original values of X and Y were 3 and 5, respectively, or 5 and 3.
Therefore the change of variables formula from Rn to Rn does not apply. Instead
we will take a direct approach, using pictures to guide us when necessary.
Let’s start with the CDF of X(n) = max(X1 , . . . Xn ). Since X(n) is less than x if
2
This term sometimes causes confusion. In statistics (the field of study), any function of the
data is called a statistic. If X1 , . . . , Xn are the data, then min(X1 , . . . , Xn ) is a statistic, and so
is max(X1 , . . . , Xn ). They are called order statistics because we get them by sorting the data in
order (from smallest to largest).
400 Introduction to Probability
and only if all of the Xj are less than x, the CDF of X(n) is
FX(n) (x) = P (max(X1 , . . . , Xn ) 1 x)
= P (X1 1 x, . . . , Xn 1 x)
= P (X1 1 x) . . . P (Xn 1 x)
= (F (x))n ,
where F is the CDF of the individual Xi . Similarly, X(1) = min(X1 , . . . , Xn ) exceeds
x if and only if all of the Xj exceed x, so the CDF of X(1) is
FX(1) (x) = 1 P (min(X1 , . . . , Xn ) > x)
= 1 P (X1 > x, . . . , Xn > x)
= 1 (1 F (x))n .
The same logic lets us find the CDF of X(j) . For the event X(j) 1 x to occur,
we need at least j of the Xi to fall to the left of x. This is illustrated in Figure
8.10.
at least j to the left of x
x
X(1) ... X(j) X(j + 1) X(j + 2) ...
FIGURE 8.10
The event X(j) 1 x is equivalent to the event “at least j Xi ’s fall to the left of x”.
Since it appears that the number of Xi to the left of x will be important to us,
let’s define a new random variable, N , to keep track of just that: define N to
be the number of Xi that land to the left of x. Each Xi lands to the left of x
with probability F (x), independently. If we define success as landing to the left
of x, we have n independent Bernoulli trials with probability F (x) of success, so
N ⇠ Bin(n, F (x)). Then, by the Binomial PMF,
P (X(j) 1 x) = P (at least j of the Xi are to the left of x)
= P (N p j)
@n ✓
n
= F (x)k (1 F (x))npk .
k
k=j
To get the PDF of X(j) , we can differentiate the CDF with respect to x, but the
resulting expression is ugly (though it can be simplified). Instead we will take a more
direct approach. Consider fX(j) (x)dx, the probability that the jth order statistic falls
into an infinitesimal interval of length dx around x. The only way this can happen
is illustrated in Figure 8.11. We need one of the Xi to fall into the infinitesimal
interval around x, and we need exactly j 1 of the Xi to fall to the left of x, leaving
the remaining n j to fall to the right of x.
( )
x
FIGURE 8.11
In order for X(j) to fall within a small interval of x, we require that one of the Xi
fall within the small interval and that exactly j 1 fall to the left of x.
What is the probability of this extremely specific event? Let’s break up the experi-
ment into stages.
• First, we choose which one of the Xi will fall into the infinitesimal interval around
x. There are n such choices, each of which occurs with probability f (x)dx, where
f is the PDF of the Xi .
• Next, we choose
0np1B exactly j 1 out of the remaining n 1jp1
to fall to the left of x.
There are jp1 such choices, each with probability F (x) (1 F (x))npj by the
Bin(n, F (x)) PMF.
We multiply the probabilities of the two stages to get
✓
n1
fX(j) (x)dx = nf (x)dx F (x)jp1 (1 F (x))npj .
j1
Dropping the dx’s from both sides gives us the PDF we desire.
Theorem 8.6.4 (PDF of order statistic). Let X1 , . . . , Xn be i.i.d. continuous r.v.s
with CDF F and PDF f . Then the marginal PDF of the jth order statistic X(j) is
✓
n1
fX(j) (x) = n f (x)F (x)jp1 (1 F (x))npj .
j1
The simple case n = 2 is consistent with Example 7.2.2, where we used 2D LOTUS
to show that for i.i.d. U1 , U2 ⇠ Unif(0, 1),
Now that we know max(U1 , U2 ) and min(U1 , U2 ) follow Beta distributions, the ex-
pectation of the Beta distribution confirms our earlier findings. ⌅
8.7 Recap
0.20
What can
0.15
0.10
0.05
0.00
happen?
3
0
3
y
−1 2
1
0
−2
−1 x
−2
−3
−3
integrate over A
transformation
Jacobian
= =
(same event) (same probability)
(Z,W) is in B,
joint PDF fZ,W(z,w) generate P
(Z,W) = g(X,Y) where P((Z,W) is in B)
B = {g(x,y): (x,y) is in A}
integrate over B
FIGURE 8.12
Let (Z, W ) = g(X, Y ), where g is an invertible transformation satisfying certain
technical assumptions. The change of variables formula lets us go back and forth
between the joint PDFs of (X, Y ) and (Z, W ).
In this chapter, as in many others, we made extensive use of Bayes’ rule and LOTP,
especially in continuous or hybrid forms. And we often used the strategy of integra-
tion by pattern recognition. Since any valid PDF must integrate to 1, and by now
we know lots of valid PDFs, we can often use probability to help us do calculus in
addition to using calculus to help us do probability!
The two new distributions we introduced are the Beta and Gamma, which are laden
with stories and connections to other distributions. The Beta is a generalization of
the Unif(0, 1) distribution, and it has the following stories.
• Order statistics of the Uniform: The jth order statistic of n i.i.d. Unif(0, 1) r.v.s
is distributed Beta(j, n j + 1).
• Unknown probability, conjugate prior of the Binomial : If p ⇠ Beta(a, b) and X|p ⇠
Bin(n, p), then p|X = k ⇠ Beta(a + k, b + n k). The posterior distribution of
p stays within the Beta family of distributions after updating based on Binomial
data, a property known as conjugacy. The parameters a and b can be interpreted
as the prior number of successes and failures, respectively.
The Gamma is a generalization of the Exponential distribution, and it has the
following stories.
• Poisson process: In a Poisson process of rate ', the total waiting time for n arrivals
404 Introduction to Probability
is distributed Gamma(n, '). Thus the Gamma is the continuous analog of the
Negative Binomial distribution.
• Unknown rate, conjugate prior of the Poisson: If ' ⇠ Gamma(r0 , b0 ) and Y |' ⇠
Pois('t), then '|Y = y ⇠ Gamma(r0 + y, b0 + t). The posterior distribution of '
stays within the Gamma family of distributions after updating based on Poisson
data. The parameters r0 and b0 can be interpreted as the prior number of observed
successes and the total waiting time for those successes, respectively.
The Beta and Gamma distributions are related by the bank–post office story, which
says that if X ⇠ Gamma(a, '), Y ⇠ Gamma(b, ') are independent, then X + Y ⇠
X X
Gamma(a + b, '), X+Y ⇠ Beta(a, b), with X + Y and X+Y independent.
The diagram of connections, which we last saw in Chapter 5, is hereby updated to
include the Beta and Gamma distributions. Distributions listed in parentheses are
special cases of the ones not in parentheses.
HGeom
Limit
Conditioning
Bin
(Bern)
Limit
Conjugacy
Conditioning
Beta
Pois
(Unif)
Poisson process
Bank–Post Office
Conjugacy
Gamma NBin
(Expo) Limit (Geom)
8.8 R
Convolution of Uniforms
i.i.d.
Using R, we can quickly verify that for X, Y ⇠ Unif(0, 1), the distribution of
T = X + Y is triangular in shape:
x <- runif(10^5)
y <- runif(10^5)
hist(x+y)
The histogram looks like an ascending and then descending staircase, a discrete
approximation to a triangle.
Bayes’ billiards
In the Bayes’ billiards story, we have n white balls and 1 gray ball, throw them onto
the unit interval completely at random, and count the number of white balls to the
left of the gray ball. Letting p be the position of the gray ball and X be the number
of white balls to the left of the gray ball, we have
p ⇠ Unif(0, 1)
X|p ⇠ Bin(n, p).
By performing this experiment a large number of times, we can verify the results
we derived in this chapter about the marginal PMF of X and the posterior PDF of
p given X = x. We’ll let the number of simulations be called nsim, to avoid a name
conflict with the number of white balls, n, which we set equal to 10:
nsim <- 10^5
n <- 10
406 Introduction to Probability
We simulate 105 values of p, then simulate 105 values from the conditional distri-
bution of X given p:
p <- runif(nsim)
x <- rbinom(nsim,n,p)
Notice that we feed the entire vector p into rbinom. This means that the first element
of x is generated using the first element of p, the second element of x is generated
using the second element of p, and so forth. Thus, conditional on a particular
element of p, the corresponding element of x is Binomial, but the elements of p are
themselves Uniform, exactly as the model specifies.
According to the Bayes’ billiards argument, the marginal distribution of X should
be Discrete Uniform on the integers 0 through n. Is this in fact the case? We can
make a histogram of x to check! Because the distribution of X is discrete, we tell
R to make the histogram breaks at 0.5, 0.5, 1.5, . . . so that each bar is centered at
an integer value:
hist(x,breaks=seq(-0.5,n+0.5,1))
Indeed, all the histogram bars are approximately equal in height, consistent with a
Discrete Uniform distribution.
Now for the posterior distribution of p given X = x. Conditioning is very simple
in R. To consider only the simulated values of p where the value of X was 3, we
use square brackets, like this: p[x==3]. In particular, we can create a histogram of
these values to see what the posterior distribution of p given X = 3 looks like; try
hist(p[x==3]).
According to the Beta-Binomial conjugacy result, the true posterior distribution
is p|X = 3 ⇠ Beta(4, 8). We can plot the histogram of p[x==3] next to a his-
togram of simulated values from the Beta(4, 8) distribution to confirm that they
look similar:
par(mfrow=c(1,2))
hist(p[x==3])
hist(rbeta(10^4,4,8))
The first line tells R we want two side-by-side plots, and the second and third lines
create the histograms.
Simulating order statistics in R is easy: we simply simulate i.i.d. r.v.s and use sort
to sort them in increasing order. For example,
sort(rnorm(10))
produces one realization of X(1) , . . . , X(10) , where X1 , . . . , X10 are i.i.d. N (0, 1).