0% found this document useful (0 votes)
19 views12 pages

Chap 8 3

Introduction to probability chapter 8

Uploaded by

daiyifei36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

Chap 8 3

Introduction to probability chapter 8

Uploaded by

daiyifei36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Transformations 395

remembering to multiply by the reciprocal on the outside:


X
ty br00 d'
P (Y = y) = ep(b0 +t)@ 'r0 +y
y!↵(r0 ) 0 '
y r0 X
↵(r0 + y) t b0 1 p(b0 +t)@ r0 +y d'
= r +y
e ((b 0 + t)')
y!↵(r0 ) (b0 + t) 0 0 ↵(r0 + y) '
✓y ✓r 0
(r0 + y  1)! t b0
= .
(r0  1)!y! b0 + t b0 + t

In the last step, we used the property ↵(n) = (n  1)!, which is applicable because
r0 is an integer. This is the NBin(r0 , b0 /(b0 + t)) PMF, so the marginal distribution
of Y is Negative Binomial with parameters r0 and b0 /(b0 + t).
(c) By Bayes’ rule, the posterior PDF of ' is given by

P (Y = y|')f0 (')
f1 ('|y) = .
P (Y = y)

We found P (Y = y) in the previous part, but since it does not depend on ', we
can just treat it as part of the normalizing constant. Absorbing this and other
multiplicative factors that don’t depend on ' into the normalizing constant,
1 1
f1 ('|y) ep@t 'y 'r0 epb0 @ = ep(b0 +t)@ 'r0 +y ,
' '
which shows that the posterior distribution of ' is Gamma(r0 + y, b0 + t).
When going from prior to posterior, the distribution of ' stays in the Gamma family,
so the Gamma is indeed the conjugate prior for the Poisson.
Now that we have the posterior PDF of ', we have a more elegant approach to
solving (b). Rearranging Bayes’ rule, the marginal PMF of Y is

P (Y = y|')f0 (')
P (Y = y) = ,
f1 ('|y)
where we know the numerator from the statement of the problem and the denomi-
nator from the calculation we just did. Plugging in these ingredients and simplifying
again yields
Y ⇠ NBin(r0 , b0 /(b0 + t)).

(d) Since conditional PDFs are PDFs, it is perfectly fine to calculate the expectation
and variance of ' with respect to the posterior distribution. The mean and variance
of the Gamma(r0 + y, b0 + t) distribution give us
r0 + y r0 + y
E('|Y = y) = and Var('|Y = y) = .
b0 + t (b0 + t)2

This example gives another interpretation for the parameters in a Gamma when
396 Introduction to Probability

it is being used as a conjugate prior. Fred’s Gamma(r0 , b0 ) prior got updated to a


Gamma(r0 +y, b0 +t) posterior after observing y arrivals in t hours. We can imagine
that in the past, Fred observed r0 buses arrive in b0 hours; then after the new data,
he has observed r0 + y buses in b0 + t hours. So we can interpret r0 as the number
of prior arrivals and b0 as the total time required for those prior arrivals. ⌅

8.5 Beta-Gamma connections

In this section, we will unite the Beta and Gamma distributions with a common
story. As an added bonus, the story will give us an expression for the normalizing
constant of the Beta(a, b) PDF in terms of gamma functions, and it will allow us to
easily find the expectation of the Beta(a, b) distribution.
Story 8.5.1 (Bank–post office). While running errands, you need to go to the
bank, then to the post office. Let X ⇠ Gamma(a, ') be your waiting time in line
at the bank, and let Y ⇠ Gamma(b, ') be your waiting time in line at the post
office (with the same ' for both). Assume X and Y are independent. What is the
joint distribution of T = X + Y (your total wait at the bank and post office) and
X
W = X+Y (the fraction of your waiting time spent at the bank)?
Solution:
We’ll do a change of variables in two dimensions to get the joint PDF of T and W .
x
Let t = x + y, w = x+y . Then x = tw, y = t(1  w), and

⇡(x, y) w t
= ,
⇡(t, w) 1  w t

which has an absolute determinant of t. Therefore


◆ ◆
◆ ⇡(x, y) ◆

fT,W (t, w) = fX,Y (x, y) · | ◆ ◆|
⇡(t, w) ◆
= fX (x)fY (y) · t
1 1 1 1
= ('x)a ep@x · ('y)b ep@y · t
↵(a) x ↵(b) y
1 1 1 1
= ('tw)a ep@tw · ('t(1  w))b ep@t(1pw) · t.
↵(a) tw ↵(b) t(1  w)
Let’s group all the terms involving w together, and all the terms involving t together:
1 1
fT,W (t, w) = wap1 (1  w)bp1 ('t)a+b ep@t
↵(a)↵(b) t
✓ ✓
↵(a + b) ap1 bp1 1 a+b p@t 1
= w (1  w) ('t) e ,
↵(a)↵(b) ↵(a + b) t
Transformations 397

for 0 < w < 1 and t > 0. The form of the joint PDF, together with Proposi-
tion 7.1.21, tells us several things:
1. Since the joint PDF factors into a function of t times a function of w, we have
that T and W are independent: the total waiting time is independent of the fraction
of time spent at the bank.
2. We recognize the marginal PDF of T and deduce that T ⇠ Gamma(a + b, ').
3. The PDF of W is
↵(a + b) ap1
fW (w) = w (1  w)bp1 , 0 < w < 1,
↵(a)↵(b)
by Proposition 7.1.21 or just by integrating out T from the joint PDF of T and
W . This PDF is proportional to the Beta(a, b) PDF, so it is the Beta(a, b) PDF!
Note that as a byproduct of the calculation we have just done, we have found the
normalizing constant of the Beta distribution:
1 ↵(a + b)
=
@(a, b) ↵(a)↵(b)
is the constant that goes in front of the Beta(a, b) PDF. ⌅
To summarize, the bank–post office story tells us that when we add independent
Gamma r.v.s X and Y with the same rate ', the total X +Y has a Gamma distribu-
tion, the fraction X/(X + Y ) has a Beta distribution, and the total is independent
of the fraction.
We can use this result to find the mean of W ⇠ Beta(a, b) without the slightest
trace of calculus.
Example 8.5.2 (Beta expectation). With notation as above, note that since T and
W are independent, they are uncorrelated: E(T W ) = E(T )E(W ). Writing this in
terms of X and Y , we have
✓ ✓
X X
E (X + Y ) · = E(X + Y )E ,
X +Y X +Y

X
E(X) = E(X + Y )E ,
X +Y

E(X) X
=E .
E(X + Y ) X +Y
Ordinarily, the last equality would be a horrendous blunder: faced with an expec-
tation like E(X/(X + Y )), we are not generally permitted to move the E into the
numerator and denominator as we please. In this case, however, the bank–post office
story justifies the move, so finding the expectation of W happily reduces to finding
the expectations of X and X + Y :

X E(X) a/' a
E(W ) = E = = = .
X +Y E(X + Y ) a/' + b/' a+b
398 Introduction to Probability

Another approach is to proceed from the definition of expectation:


X 1
↵(a + b) a
E(W ) = w (1  w)bp1 dw.
0 ↵(a)↵(b)

By pattern recognition, the integrand is a Beta(a + 1, b) PDF, up to a normalizing


constant. After obtaining an exact match for the PDF, we apply properties of the
gamma function:
X 1
↵(a + b) ↵(a + 1) ↵(a + b + 1) a
E(W ) = w (1  w)bp1 dw
↵(a) ↵(a + b + 1) 0 ↵(a + 1)↵(b)
↵(a + b) a↵(a)
=
↵(a) (a + b)↵(a + b)
a
= .
a+b
In Exercise 30, you will use this approach to find the variance and the other moments
of the Beta distribution. ⌅
Now that we know the Beta normalizing constant, we can also quickly obtain the
PMF of the Beta-Binomial distribution.
Example 8.5.3 (Beta-Binomial PMF). Let X|p ⇠ Bin(n, p), with p ⇠ Beta(a, b).
As mentioned in Story 8.3.3, X has a Beta-Binomial distribution. Find the marginal
distribution of X.
Solution: Let f (p) be the Beta(a, b) PDF. Then
X 1
P (X = k) = P (X = k|p)f (p)dp
0
X 1 ✓
1 n k
= p (1  p)npk pap1 (1  p)bp1 dp
@(a, b) 0 k
0n B X 1
= k
pa+kp1 (1  p)b+npkp1 dp
@(a, b) 0

n @(a + k, b + n  k)
= ,
k @(a, b)

for k = 0, 1, . . . , n. ⌅

8.6 Order statistics

The final transformation we will consider in this chapter is the transformation


that takes n random variables X1 , . . . , Xn and sorts them in order, producing the
Transformations 399

transformed r.v.s min(X1 , . . . , Xn ), . . . , max(X1 , . . . , Xn ). The transformed r.v.s are


called the order statistics,2 and they are often useful when we are concerned with
the distribution of extreme values, as we alluded to earlier.

Furthermore, like the sample mean X̄n , the order statistics serve as useful summaries
of an experiment, since we can use them to determine the cutoffs for the worst 5%
of observations, the worst 25%, the best 25%, and so forth (such cutoffs are called
the quantiles of the sample).

Definition 8.6.1 (Order statistics). For r.v.s X1 , X2 , . . . , Xn , the order statistics


are the random variables X(1) , X(2) , . . . , X(n) , where

X(1) = min(X1 , . . . , Xn ),
X(2) is the second-smallest of X1 , . . . , Xn ,
..
.
X(np1) is the second-largest of X1 , . . . , Xn ,
X(n) = max(X1 , . . . , Xn ).

Note that X(1) 1 X(2) 1 . . . 1 X(n) by definition. We call X(j) the jth order
statistic. If n is odd, X((n+1)/2) is called the sample median of X1 , . . . , Xn .

h 8.6.2. The order statistics X(1) , . . . , X(n) are r.v.s, and each X(j) is a function
of X1 , . . . , Xn . Even if the original r.v.s are independent, the order statistics are
dependent: if we know that X(1) = 100, then X(n) is forced to be at least 100.

We will focus our attention on the case where X1 , . . . , Xn are i.i.d. continuous r.v.s.
The reason is that with discrete r.v.s, there is a positive probability of tied values;
with continuous r.v.s, the probability of a tie is exactly 0, which makes matters much
easier. Thus, for the rest of this section, assume X1 , . . . , Xn are i.i.d. and continuous,
with CDF F and PDF f . We will derive the marginal CDF and PDF of each
individual order statistic X(j) , as well as the joint PDF of (X(1) , . . . , X(n) ).

A complication we run into right away is that the transformation to order statistics
is not invertible: starting with min(X, Y ) = 3 and max(X, Y ) = 5, we can’t tell
whether the original values of X and Y were 3 and 5, respectively, or 5 and 3.
Therefore the change of variables formula from Rn to Rn does not apply. Instead
we will take a direct approach, using pictures to guide us when necessary.

Let’s start with the CDF of X(n) = max(X1 , . . . Xn ). Since X(n) is less than x if

2
This term sometimes causes confusion. In statistics (the field of study), any function of the
data is called a statistic. If X1 , . . . , Xn are the data, then min(X1 , . . . , Xn ) is a statistic, and so
is max(X1 , . . . , Xn ). They are called order statistics because we get them by sorting the data in
order (from smallest to largest).
400 Introduction to Probability

and only if all of the Xj are less than x, the CDF of X(n) is
FX(n) (x) = P (max(X1 , . . . , Xn ) 1 x)
= P (X1 1 x, . . . , Xn 1 x)
= P (X1 1 x) . . . P (Xn 1 x)
= (F (x))n ,
where F is the CDF of the individual Xi . Similarly, X(1) = min(X1 , . . . , Xn ) exceeds
x if and only if all of the Xj exceed x, so the CDF of X(1) is
FX(1) (x) = 1  P (min(X1 , . . . , Xn ) > x)
= 1  P (X1 > x, . . . , Xn > x)
= 1  (1  F (x))n .
The same logic lets us find the CDF of X(j) . For the event X(j) 1 x to occur,
we need at least j of the Xi to fall to the left of x. This is illustrated in Figure
8.10.
at least j to the left of x

x
X(1) ... X(j) X(j + 1) X(j + 2) ...
FIGURE 8.10
The event X(j) 1 x is equivalent to the event “at least j Xi ’s fall to the left of x”.

Since it appears that the number of Xi to the left of x will be important to us,
let’s define a new random variable, N , to keep track of just that: define N to
be the number of Xi that land to the left of x. Each Xi lands to the left of x
with probability F (x), independently. If we define success as landing to the left
of x, we have n independent Bernoulli trials with probability F (x) of success, so
N ⇠ Bin(n, F (x)). Then, by the Binomial PMF,
P (X(j) 1 x) = P (at least j of the Xi are to the left of x)
= P (N p j)
@n ✓
n
= F (x)k (1  F (x))npk .
k
k=j

We thus have the following result for the CDF of X(j) .


Theorem 8.6.3 (CDF of order statistic). Let X1 , . . . , Xn be i.i.d. continuous r.v.s
with CDF F . Then the CDF of the jth order statistic X(j) is
n
@ ✓
n
P (X(j) 1 x) = F (x)k (1  F (x))npk .
k
k=j
Transformations 401

To get the PDF of X(j) , we can differentiate the CDF with respect to x, but the
resulting expression is ugly (though it can be simplified). Instead we will take a more
direct approach. Consider fX(j) (x)dx, the probability that the jth order statistic falls
into an infinitesimal interval of length dx around x. The only way this can happen
is illustrated in Figure 8.11. We need one of the Xi to fall into the infinitesimal
interval around x, and we need exactly j  1 of the Xi to fall to the left of x, leaving
the remaining n  j to fall to the right of x.

1 in a tiny interval around x


j – 1 to the left of x n – j to the right of x

( )
x
FIGURE 8.11
In order for X(j) to fall within a small interval of x, we require that one of the Xi
fall within the small interval and that exactly j  1 fall to the left of x.

What is the probability of this extremely specific event? Let’s break up the experi-
ment into stages.
• First, we choose which one of the Xi will fall into the infinitesimal interval around
x. There are n such choices, each of which occurs with probability f (x)dx, where
f is the PDF of the Xi .
• Next, we choose
0np1B exactly j  1 out of the remaining n  1jp1
to fall to the left of x.
There are jp1 such choices, each with probability F (x) (1  F (x))npj by the
Bin(n, F (x)) PMF.
We multiply the probabilities of the two stages to get

n1
fX(j) (x)dx = nf (x)dx F (x)jp1 (1  F (x))npj .
j1

Dropping the dx’s from both sides gives us the PDF we desire.
Theorem 8.6.4 (PDF of order statistic). Let X1 , . . . , Xn be i.i.d. continuous r.v.s
with CDF F and PDF f . Then the marginal PDF of the jth order statistic X(j) is

n1
fX(j) (x) = n f (x)F (x)jp1 (1  F (x))npj .
j1

In general, the order statistics of X1 , . . . , Xn will not follow a named distribution,


but the order statistics of the standard Uniform distribution are an exception.
402 Introduction to Probability

Example 8.6.5 (Order statistics of Uniforms). Let U1 , . . . , Un be i.i.d. Unif(0, 1).


Then for 0 1 x 1 1, f (x) = 1 and F (x) = x, so the PDF of U(j) is

n  1 jp1
fU(j) (x) = n x (1  x)npj .
j1
j
This is the Beta(j, n  j + 1) PDF! So U(j) ⇠ Beta(j, n  j + 1), and E(U(j) ) = n+1 .

The simple case n = 2 is consistent with Example 7.2.2, where we used 2D LOTUS
to show that for i.i.d. U1 , U2 ⇠ Unif(0, 1),

E(max(U1 , U2 )) = 2/3, E(min(U1 , U2 )) = 1/3.

Now that we know max(U1 , U2 ) and min(U1 , U2 ) follow Beta distributions, the ex-
pectation of the Beta distribution confirms our earlier findings. ⌅

8.7 Recap

In this chapter we discussed three broad classes of transformations:


• invertible transformations Y = g(X) of continuous random vectors, which can be
handled with the change of variables formula (under some technical assumptions,
most notably that the partial derivatives ⇡xi /⇡yj exist and are continuous);
• convolutions, for which we can determine the distribution using (in decreasing
order of preference) stories, MGFs, or convolution sums/integrals;
• the transformation of i.i.d. continuous r.v.s to their order statistics.
Figure 8.12 illustrates connections between the original random vector (X, Y ) and
the transformed random vector (Z, W ) = g(X, Y ), where g is an invertible trans-
formation satisfying certain technical assumptions. The change of variables formula
uses Jacobians to take us back and forth between the joint PDF of (X, Y ) and the
joint PDF of (Z, W ).
Let A be a region in the support of (X, Y ), and B = {g(x, y) : (x, y) 6 A} be the
corresponding region in the support of (Z, W ). Then (X, Y ) 6 A is the same event
as (Z, W ) 6 B, so
P ((X, Y ) 6 A) = P ((Z, W ) 6 B).
To find this probability, we can either integrate the joint PDF of (X, Y ) over A or
integrate the joint PDF of (Z, W ) over B.
Transformations 403

0.20

What can
0.15

0.10

0.05

0.00

happen?
3

0
3
y

−1 2
1
0
−2
−1 x
−2
−3
−3

joint distributions random vectors events numbers


(X,Y) is in A,
generate P
joint PDF fX,Y(x,y) (X,Y) where A is a region in P((X,Y) is in A)
the support of (X,Y)

integrate over A

transformation
Jacobian

= =
(same event) (same probability)

(Z,W) is in B,
joint PDF fZ,W(z,w) generate P
(Z,W) = g(X,Y) where P((Z,W) is in B)
B = {g(x,y): (x,y) is in A}

integrate over B

FIGURE 8.12
Let (Z, W ) = g(X, Y ), where g is an invertible transformation satisfying certain
technical assumptions. The change of variables formula lets us go back and forth
between the joint PDFs of (X, Y ) and (Z, W ).

In this chapter, as in many others, we made extensive use of Bayes’ rule and LOTP,
especially in continuous or hybrid forms. And we often used the strategy of integra-
tion by pattern recognition. Since any valid PDF must integrate to 1, and by now
we know lots of valid PDFs, we can often use probability to help us do calculus in
addition to using calculus to help us do probability!
The two new distributions we introduced are the Beta and Gamma, which are laden
with stories and connections to other distributions. The Beta is a generalization of
the Unif(0, 1) distribution, and it has the following stories.
• Order statistics of the Uniform: The jth order statistic of n i.i.d. Unif(0, 1) r.v.s
is distributed Beta(j, n  j + 1).
• Unknown probability, conjugate prior of the Binomial : If p ⇠ Beta(a, b) and X|p ⇠
Bin(n, p), then p|X = k ⇠ Beta(a + k, b + n  k). The posterior distribution of
p stays within the Beta family of distributions after updating based on Binomial
data, a property known as conjugacy. The parameters a and b can be interpreted
as the prior number of successes and failures, respectively.
The Gamma is a generalization of the Exponential distribution, and it has the
following stories.
• Poisson process: In a Poisson process of rate ', the total waiting time for n arrivals
404 Introduction to Probability

is distributed Gamma(n, '). Thus the Gamma is the continuous analog of the
Negative Binomial distribution.
• Unknown rate, conjugate prior of the Poisson: If ' ⇠ Gamma(r0 , b0 ) and Y |' ⇠
Pois('t), then '|Y = y ⇠ Gamma(r0 + y, b0 + t). The posterior distribution of '
stays within the Gamma family of distributions after updating based on Poisson
data. The parameters r0 and b0 can be interpreted as the prior number of observed
successes and the total waiting time for those successes, respectively.
The Beta and Gamma distributions are related by the bank–post office story, which
says that if X ⇠ Gamma(a, '), Y ⇠ Gamma(b, ') are independent, then X + Y ⇠
X X
Gamma(a + b, '), X+Y ⇠ Beta(a, b), with X + Y and X+Y independent.
The diagram of connections, which we last saw in Chapter 5, is hereby updated to
include the Beta and Gamma distributions. Distributions listed in parentheses are
special cases of the ones not in parentheses.

HGeom

Limit

Conditioning

Bin
(Bern)

Limit
Conjugacy
Conditioning

Beta
Pois
(Unif)

Poisson process
Bank–Post Office
Conjugacy

Gamma NBin
(Expo) Limit (Geom)

8.8 R

Beta and Gamma distributions

The Beta and Gamma distributions are programmed into R.


• dbeta, pbeta, rbeta: To evaluate the Beta(a, b) PDF or CDF at x, we use
Transformations 405

dbeta(x,a,b) and pbeta(x,a,b). To generate n realizations from the Beta(a, b)


distribution, we use rbeta(n,a,b).
• dgamma, pgamma, rgamma: To evaluate the Gamma(a, ') PDF or CDF at x, we use
dgamma(x,a,lambda) or pgamma(x,a,lambda). To generate n realizations from
the Gamma(a, ') distribution, we use rgamma(n,a,lambda).
For example, we can check that the Gamma(3, 2) distribution has mean 3/2 and
variance 3/4. To do this, we generate a large number of Gamma(3, 2) random vari-
ables using rgamma, then compute their mean and var:
y <- rgamma(10^5,3,2)
mean(y)
var(y)
Did you get values that were close to 1.5 and 0.75, respectively?

Convolution of Uniforms
i.i.d.
Using R, we can quickly verify that for X, Y ⇠ Unif(0, 1), the distribution of
T = X + Y is triangular in shape:
x <- runif(10^5)
y <- runif(10^5)
hist(x+y)
The histogram looks like an ascending and then descending staircase, a discrete
approximation to a triangle.

Bayes’ billiards

In the Bayes’ billiards story, we have n white balls and 1 gray ball, throw them onto
the unit interval completely at random, and count the number of white balls to the
left of the gray ball. Letting p be the position of the gray ball and X be the number
of white balls to the left of the gray ball, we have

p ⇠ Unif(0, 1)
X|p ⇠ Bin(n, p).

By performing this experiment a large number of times, we can verify the results
we derived in this chapter about the marginal PMF of X and the posterior PDF of
p given X = x. We’ll let the number of simulations be called nsim, to avoid a name
conflict with the number of white balls, n, which we set equal to 10:
nsim <- 10^5
n <- 10
406 Introduction to Probability

We simulate 105 values of p, then simulate 105 values from the conditional distri-
bution of X given p:
p <- runif(nsim)
x <- rbinom(nsim,n,p)
Notice that we feed the entire vector p into rbinom. This means that the first element
of x is generated using the first element of p, the second element of x is generated
using the second element of p, and so forth. Thus, conditional on a particular
element of p, the corresponding element of x is Binomial, but the elements of p are
themselves Uniform, exactly as the model specifies.
According to the Bayes’ billiards argument, the marginal distribution of X should
be Discrete Uniform on the integers 0 through n. Is this in fact the case? We can
make a histogram of x to check! Because the distribution of X is discrete, we tell
R to make the histogram breaks at 0.5, 0.5, 1.5, . . . so that each bar is centered at
an integer value:
hist(x,breaks=seq(-0.5,n+0.5,1))
Indeed, all the histogram bars are approximately equal in height, consistent with a
Discrete Uniform distribution.
Now for the posterior distribution of p given X = x. Conditioning is very simple
in R. To consider only the simulated values of p where the value of X was 3, we
use square brackets, like this: p[x==3]. In particular, we can create a histogram of
these values to see what the posterior distribution of p given X = 3 looks like; try
hist(p[x==3]).
According to the Beta-Binomial conjugacy result, the true posterior distribution
is p|X = 3 ⇠ Beta(4, 8). We can plot the histogram of p[x==3] next to a his-
togram of simulated values from the Beta(4, 8) distribution to confirm that they
look similar:
par(mfrow=c(1,2))
hist(p[x==3])
hist(rbeta(10^4,4,8))
The first line tells R we want two side-by-side plots, and the second and third lines
create the histograms.

Simulating order statistics

Simulating order statistics in R is easy: we simply simulate i.i.d. r.v.s and use sort
to sort them in increasing order. For example,
sort(rnorm(10))
produces one realization of X(1) , . . . , X(10) , where X1 , . . . , X10 are i.i.d. N (0, 1).

You might also like