Problem Set 1 Sol
Problem Set 1 Sol
E(X) = E(E(X | Y ))
1
(Note that a similar proof can be used for the discrete case)
Solution:
Z Z Z Z
E(X) = Xp(X, Y )dXdY = Xp(X | Y )dup(Y )dv
Z
= E(X | Y )p(y)dy = E(E(X | Y ))
Solution:
For variety, I will prove this one for the discrete case
E(var(X | Y ) + var(E(X | Y ))
= E(E(X 2 | Y ) − (E(X | Y ))2 ) + E((E(X | Y ))2 ) − (E(E(X | Y )))2
= E(X 2 ) − E((E(X | Y ))2 ) + E((E(X | Y ))2 ) − (E(X))2
= E(X 2 ) − (E(X))2 = var(X)
(c) Let y denote the observed data. We assume y was generated from
p(y | θ), where θ, the parameters governing the sampling of y are
random and distributed according to p(θ). Use the above and de-
scribe (i.e. understand the equation and then put into words) the
relationship between the mean and variance of the prior p(θ) and
and the posterior p(θ | y).
Solution:
Plugging θ and y into the first equation gives us
which means that the prior mean over the parameters is the overage
over all possible posterior means over the distribution of possible
data. This is in the opposite direction of the way we are used to
think of priors and posterior but in fact matches our intuition of
what the prior should capture. The second equation
2
3. Posterior of a Poisson Distribution. Suppose that X is the number
of pregnant woman arriving at a particular hospital to deliver babies in
a given month. The discrete count nature of the data plus its natural
interpretation as an arrival rate suggest adopting a Poisson likelihood
e−θ θx
p(x | θ) = , x ∈ {0, 1, 2, . . .}, θ > 0
x!
To provide support on the positive real line and reasonable flexibility we
suggest a Gamma G(α, β) distribution prior
θα−1 e−θ/β
p(θ) = , θ > 0, α > 0, β > 0
Γ(α)β α
where Γ() is a continuous generalization of the factorial function so that
Γ(c) = cΓ(c − 1). α, β are the parameters of this prior, or the hyper-
parameters of the model. The Gamma distribution has mean αβ and
variance αβ 2 .
Show that the posterior distribution p(θ | x) is also Gamma distributed.
Determine its parameters α and β.
Solution:
Although the question was phrased for a univariate x, for generality, in
the solution x will be a vector of observations xi , each of which follows
the Poisson distribution. In this case, we have
P
p(x | θ) ∝ θ i xi e−nθ
3
Solution:
For a fixed α, the larger the value of β the wider the distribution and
thus behaves like a stretching or scale parameter. For a fixed β, α
determines the form of the distribution. At α = 1 we see a singular
point where for alpha <= 1 the distribution starts with a high value
and decays exponentially, and for alpha > 1 the distribution is a
unimodal one that resembles a normal distribution more as α grows.
(b) Continuing the previous question involving births, assume that in
December 2008 we observed x = 42 moms arriving at the hospital
to deliver babies, and suppose we adopt a Gamma(5,6), which has
mean 30 and variance 180, reflecting the hospital’s total for the two
preceding years. Use Matlab/R to plot the posterior distribution of
θ next to its prior. What are your conclusions?
Solution:
In this case we have a single observation of x = 42 and our posterior
is Gamma(42 + 5, 6/(1 ∗ 6 + 1)).
0.07
Prior
0.06
Posterior
0.05
P(theta)
0.04
0.03
0.02
0.01
0
0 10 20 30 40 50 60 70 80 90 100
theta
4
5. Extinction of Species. Paleobotanists estimate the moment in the re-
mote past when a given species became extinct by taking cylindrical, ver-
tical core samples well below the earths surface and looking for the last
occurrence of the species in the fossil record, measured in meters above
the point P at which the species was known to have first emerged. Letting
{y1 , . . . , yn } denote a sample of such distances above P at a random set
of locations, the model
(yi |θ) ∼ Unif(0, θ)
emerges from simple and plausible assumptions. In this model the un-
known θ > 0 can be used, through carbon dating, to estimate the species
extinction time. This problem is about Bayesian inference for θ, and it
will be seen that some of our usual intuitions do not quite hold in this
case.
(a) Show that the likelihood may be written as
l(θ : y) = θ−n I(θ ≥ max(y1 , . . . , yn ))
where I(A) = 1 if A is true and 0 otherwise.
Solution:
Y Y1
l(θ : y) = p(yi | θ) = I(0 < yi ≤ θ)
i i
θ
Y
= θ−n I(yi ≤ θ) = θ−n I(θ ≥ max(y1 , . . . , yn ))
i
where we also used the fact that all yi , by the experiment design are
guaranteed to be positive.
(b) The Pareto(α, β) distribution has density
αβ α θ−(α+1)
θ≥β
p(θ) =
0 otherwise
αβ
where α, β > 0. The Pareto distribution has mean α−1 for α > 1
2
αβ
and a variance of (α−1)2 (α−2) for alpha > 2.
With the likelihood viewed as a constant multiple of a density for
θ, show that the likelihood corresponds to the Pareto(n1, m) distri-
bution. Now let the prior for θ be taken to be Pareto(α, β) and
derive the posterior distribution p(θ|y). Is the Pareto conjugate to
the uniform?
Solution:
We define m = (y1 , . . . , yn ) (this was supposed to be part of the
question but was omitted by mistake). The likelihood can then be
written as
l(θ : y) = θ−n I(m ≤ θ) ∝ θ−[(n−1)+1] (n − 1)mn−1 1(m ≤ θ)
5
which is, by definition, a Pareto distribution with parameters n − 1
and m. Together with a Pareto(α, β) prior we have
0.25
Prior
Posterior
Likelihood
0.2
0.15
p(theta)
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10
theta
(d) Make a table summarizing the mean and standard deviation for the
prior, likelihood and posterior distributions, using the (α, β) choices
and the data in part (d) above. In Bayesian updating the posterior
mean is often a weighted average of the prior mean and the likelihood
mean (with positive weights), and the posterior standard deviation
6
is typically smaller than either the prior or likelihood standard devi-
ations. Are each of these behaviors true in this case? Explain briefly.
Solution:
For the prior,likelihood and posterior distribution, the shape (α) pa-
rameter of the Pareto distribution is greater than 2 so that both the
mean and standard deviation are finite (see equations above). Specif-
ically we have
Prior Likelihood Posterior
Mean 6.667 5.513 5.326
STD 5.9628 0.6945 0.4649
In this case the posterior standard deviation, as expected is smaller
than both that of the prior and the likelihood distribution. However,
non-typically, the mean of the posterior is further away from the prior
mean than the likelihood mean. This is a result of the unique nature
of the Pareto distribution and can be seen directly by plugging, for
a fixed β, the appropriate shape parameter into the mean equation
of the Pareto distribution.