SampleQs Solutions PDF
SampleQs Solutions PDF
SampleQs Solutions PDF
As you might have gathered if you attempted these problems, they are quite long relative to the 24
minutes you have available to attempt similar questions in the exam; I am aware of this. However, these
questions were designed to cover as many of the topics we studied in the course.
(a) State Cramer’s result (also known as the Delta Method) on the asymptotic normal distribution
of a (scalar) random variable Y defined in terms of random variable X via the transformation
Y = g(X), where X is asymptotically normally distributed
¡ ¢
X ∼ AN µ, σ 2 .
Y ∼ AN (g(µ), {ġ(µ)}2 σ 2 )
(b) Suppose that X1 , ..., Xn are independent and identically distributed P oisson (λ) random variables.
Find the maximum likelihood (ML) estimator, and an asymptotic normal distribution for the
estimator, of the following parameters
(i) λ,
(ii) exp {−λ}.
˙ sn
l(λ) = −n +
λ
so equating this to zero yields λ̂ = sn /n = x̄. Using the Central Limit Theorem,
(Sn − nλ) L
√ −→ N (0, λ)
n
as E[X] = Var[X] = λ. Hence
Sn
Sn ∼ AN (nλ, nλ) =⇒ X̄ = ∼ AN (λ, λ/n)
n
[3 MARKS]
(ii) By invariance of ML estimators to reparameterization, or from first principles, the ML estimator of
φ = exp(−λ) is φ̂ = exp(−X̄) = Tn , say.
For Cramer’s Theorem (Delta Method), let g(t) = exp(−t), so that ġ(t) = − exp(−t). Thus
Tn ∼ AN (exp(−λ), exp(−2λ)λ/n)
[3 MARKS]
(i) We now effectively have a Bernoulli sampling model; let Yi be a random variable taking the value
0 if Xi = 0, and 1 otherwise; note that P [Yi = 0] = P [Xi = 0] = exp(−λ) = θ, say, so that the log
likelihood is
l(θ) = (n − m) log θ + m log(1 − θ)
Pn
where m = i=1 yi , the number of times that Yi , and hence Xi , is greater than zero. From this
likelihood, the ML estimate of θ is θ̂ = (n − m)/n, and hence the ML estimate of λ is
λ̂ = − log(θ̂) = − log((n − m)/n)
P
and the estimator is Tn = − log(n−1 ni=1 Yi )
[2 MARKS]
(ii) This
P estimate is not finite if m = n, that is, if we never observe Xi =0 in the sample, so that
m = ni=1 yi = n.
[2 MARKS]
(iii) The event of interest from (ii) occurs with the following probability:
" n # n n
X Y Y
P Yi = n = P [Yi = 1] = [1 − exp(−λ0 )] = (1 − exp(−λ0 ))n
i=1 i=1 i=1
which, if λ0 is not large, can be appreciable. Thus, for a finite value of n, there is a non-zero probability
that the estimator is not finite.
[3 MARKS]
(iv) Consistency (weak or strong) for λ will follow from the consistency of the estimator of θ, as we
have, from the Strong Law Pn
i=1 Yi a.s.
−→ θ
n
The only slight practical problem is that raised in (ii) and (iii), the finiteness of the estimator. We can
overcome this by defining the estimator as follows; estimate λ by
Pn
− log(n−1 i=1 Yi ) if max{Y1 , . . . , Yn } > 0
Tn0 =
k if max{Y1 , . . . , Yn } = 0
where k is some constant value. As the event (max{Y1 , . . . , Yn } = 0) occurs with probability (1 − exp(−λ0 ))n
which converges to 0 as n −→ ∞, this adjustment does not disrupt the strong convergence. Note that
6
we could choose k = 1, or k = 1010 , and consistency would be preserved.
[3 MARKS]
(a) Suppose that X(1) < . . . < X(n) are the order statistics from a random sample of size n from a
distribution FX with continuous density fX on R. Suppose 0 < p1 < p2 < 1, and denote the
quantiles of FX corresponding to p1 and p2 by xp1 and xp2 respectively.
Regarding xp1 and xp2 as unknown parameters, natural estimators of these quantities are X(dnp1 e)
and X(dnp2 e) respectively, where dxe is the smallest integer not less than x. Show that
µ ¶
√ X(dnp1 e) − xp1 L
n −→ N (0, Σ)
X(dnp2 e) − xp2
where
p1 (1 − p1 ) p1 (1 − p2 )
{fX (xp1 )}2 fX (xp1 )fX (xp2 )
Σ=
p1 (1 − p2 ) p2 (1 − p2 )
fX (xp1 )fX (xp2 ) {fX (xp2 )} 2
This is bookwork, from the handout that I gave out in lectures. In solving the problem, it is legitimate
to state without proof some of the elementary parts; in terms of the handout, after describing the set
up, you would be allowed to quote without proof Results 1 through 3, and would only need to give the
full details for the final parts.
For the final result, for a single quantile xp , we have that
µ ¶
√ ¡ ¢ L p(1 − p)
n X(dnpe) − xp −→ N 0,
{fX (xp )}2
[10 MARKS]
(i) The sample median estimator of the median FX (corresponding to p = 0.5), if FX is a Normal
distribution with parameters µ and σ 2 .
(ii) The upper and lower quartile estimators (corresponding to p1 = 0.25 and p2 = 0.75) if FX is
an Exponential distribution with parameter λ
(i) Here we have p = 0.5, and xp = µ, as the Normal distribution is symmetric about µ.
µ ¶ µ ¶
√ ¡ ¢ L (1/2)(1/2) πσ 2
n X(dn/2e) − µ −→ N 0, ≡ N 0,
{φ(0)}2 2
√
as φ(0) = 1/ 2πσ 2 and hence X(dn/2e) ∼ AN (µ, πσ 2 /2n).
[3 MARKS]
(ii) For probability p the corresponding quantile is given by
(c) The results in (a) and (b) describe convergence in law for the estimators concerned. Show how
the form of convergence may be strengthened using the Strong Law for any specific quantile xp .
The standard Strong Law result says, effectively, that for i.i.d. random variables X1 , X2 , . . ., for arbitrary
function G
n
1X a.s.
G(Xi ; θ) −→ EX|θ [G(X)].
n
i=1
and we have strong convergence of the statistic on the left-hand side to p. Now FX−1 is a continuous,
monotone increasing function, so we can map both sides of the last result by FX−1 to obtain the result
a.s.
FX−1 (Un ) −→ FX−1 (p) = xp .
[4 MARKS]
(a) (i) State (without proof ) Wald’s Theorem on the strong consistency of maximum likelihood (ML)
estimators, listing the five conditions under which this theorem holds.
Bookwork (although we focussed less on strong consistency of the MLE this year, and studied
weak consistency in more detail): Let X1 , . . . , Xn be i.i.d. with pdf fX (x|θ) (with respect
to measure ν), let Θ denote the parameter space, and let θ0 denote the true value of the
parameter θ. Suppose θ is 1-dimensional. Then, if
(1) Θ is compact,
(2) fX (x|θ) is upper semi-continuous (USC) in θ on Θ for all x, that is for all θ ∈ Θ and any
sequence {θn } such that θn −→ θ
for all x,
(3) there exists a function M (x) with EfX|θ0 [M (x)] < ∞ and
sup fX (x|θ0 )
0
|θ −θ|<δ
is measurable (wrt ν) in x.
(5) If fX (x|θ) = fX (x|θ0 ) almost everywhere wrt ν in x, then θ = θ0 ; this is the identifia-
bility condition,
any sequence of ML estimators {θ̂n } of θ is strongly consistent for θ, that is
a.s.
θ̂n −→ θ0
as n −→ ∞.
[5 MARKS]
(ii) Verify that the conditions of the theorem hold when random variables X1 , . . . , Xn correspond
to independent observations from the Uniform density on (0, θ)
1
fX (x|θ) = 0≤x≤θ
θ
and zero otherwise, for parameter θ ∈ Θ ≡ [a, b], where [a, b] is the closed interval from a to
b, 0 < a < b < ∞.
fX (x|θ) = 0
(|θ0 − θ| < δ defines an interval centered at θ, to the left of θ the function is zero, to the right of
θ the function is 1/θ, so the supremum over the interval is always 1/θ.)
(3) If
fX (x|θ)
M (x) = max
θ∈Θ fX (x|θ0 )
then
θ0
x≤a
a
M (x) = θ0
a < x ≤ θ0
x
∞ x > θ0
The expectation of M (X), when θ = θ0 , is finite as the third case is excluded (P [X > θ0 ] = 0).
(5) Identifiability is assured, as different θ values yield densities with different supports.
[5 MARKS]
Suppose that random variables X1 , . . . , Xn correspond independent observations from density (wrt
Lebesgue measure) fX (x|θ), and for θ ∈ Θ, this family of densities have common support X. Let
the true value of θ be denoted θ0 , and let Ln (θ) denote the likelihood for θ
n
Y
Ln (θ) = fX (xi |θ).
i=1
(i) Using Jensen’s inequality for the function g(x) = − log x, and an appropriate law of large
numbers, show that
for any fixed θ 6= θ0 , where Pθ0 denotes probability under the true model, indexed by θ0 .
This follows in a similar fashion to the proof of the positivity of the Kullback-Liebler (K)
divergence;
X n
Ln (θ0 ) Ln (θ0 ) fX (Xi |θ0 )
Ln (θ0 ) > Ln (θ) ⇔ > 1 ⇔ log >0⇔ log >0 (1)
Ln (θ) Ln (θ) fX (Xi |θ)
i=1
Now, by the weak law of large numbers
n · ¸
1X fX (Xi |θ0 ) p fX (X|θ0 )
Tn (θ0 , θ) = log −→ EfX|θ0 log = K(θ0 , θ) (2)
n fX (Xi |θ) fX (X|θ)
i=1
To finish the proof we use the Kullback-Liebler proof method; from Jensen’s inequality
· ¸ · ¸ · ¸
fX (X|θ0 ) fX (X|θ) fX (X|θ)
EfX|θ0 log = −EfX|θ0 log ≥ − log EfX|θ0
fX (X|θ) fX (X|θ0 ) fX (X|θ0 )
Z
fX (x|θ)
= − log fX (x|θ0 )dν
fX (x|θ0 )
Z
= − log fX (x|θ)dν ≥ − log 1 = 0.
so that
Pθ0 [Ln (θ0 ) > Ln (θ)] = Pθ0 [Tn (θ0 , θ) > 0] −→ 1
as n −→ ∞.
Which other condition from (a)(i) needs to be assumed in order for the result to hold ?
Identifiability; the strictness of the inequality relies on θ 6= θ0 .
[5 MARKS]
Show that, in this case, the ML estimator θ̂n exists, and is weakly consistent for θ0 .
which is the definition for weak consistency. Note that existence of the ML estimator (as a
finite value in the parameter space) is guaranteed for every n, as Θ is finite, and uniqueness
of the ML estimator is also guaranteed, with probability tending to 1, as n → ∞.
[5 MARKS]
(a) (i) Give definitions for the following modes of stochastic convergence, summarizing the relation-
ships between the various modes;
• convergence in law (convergence in distribution)
• convergence almost surely
• convergence in rth mean
Bookwork: For a sequence of rvs X1 , X2 , . . . with distribution functions FX1 , FX2 , . . . with
and common governing probability measure P on space Ω with associated sigma algebra A;
almost everywhere with respect to P (that is, for all ω ∈ Ω except in sets A ∈ A such that
P (A) = 0. Equivalently,
h i
a.s.
Xn −→ X ⇐⇒ P lim |Xn (ω) − X(ω)| < ² = 1, ∀² > 0, a.e. P.
n→∞
Also equivalently,
a.s.
Xn −→ X ⇐⇒ P [|Xn (ω) − X(ω)| < ² i.o.] = 1, ∀² > 0, a.e. P.
In summary, convergence in law is implied by both convergence a.s. and convergence in rth
mean, but there are no general relations between the latter two modes.
[6 MARKS]
P [Xn = 0] = exp{−n} → 0
in which case
P [Xn = 0] = P [Z ≤ n] = 1 − exp{−n} → 1
a.s.
as n → ∞, which makes things more interesting. Direct from the definition, we have Xn −→
0, as h i
P lim |Xn | < ² = 1
n→∞
or equivalently
lim P [|Xk | < ², ∀k ≥ n] = 1.
n→∞
To see this, for some n, n0 say, Z ∈ [0, n0 ), and thus for all k > n0 , Z ∈ [0, k) also, so
|Xk | = 0 < ².
Note that this result follows because we are considering a single Z that is used to define
the sequence {Xn }, so that the {Xn } are dependent random variables. If the {Xn } were
generated independently, using a sequence of independent rvs {Zn }, then assessment of
convergence would need use of, say, the Borel-Cantelli Lemma (b).
For convergence in rth mean for the new variable: note that
(b) Suppose that X1 , X2 , . . . are independent, identically distributed random variables defined on R,
with common distribution function FX for which FX (x) < 1 for all finite x. Let Mn be the
maximum random variable defined for finite n by
Mn = max{X1 , X2 , . . . , Xn }
(i) Show that the sequence of random variables {Mn } converges almost surely to infinity, that is
a.s.
Mn −→ ∞
as n → ∞.
lim Mn (ω) = ∞
n→∞
that is, for all x, there exists n0 = n0 (ω, x) such that if n ≥ n0 then Mn (ω) ≥ x. Now,
∞ \
[ ∞
A0x = (Mk > x).
n=1 k=n
T
that is, if ω ∈ A0x then there exists an n such that for k ≥ n, Mn (ω) > x. Thus B 0 = 0
x Ax
has probability 1 under P , so that for all ω in sets of probability 1,
lim Mn (ω) = ∞.
n→∞
[5 MARKS]
a.s.
(ii) We demonstrate that Mn −→ xU . Fix ² > 0. Let En ≡ (Mn < xU − ²). Then
Ã∞ ∞ ! Ã∞ !
\ [ [
P [lim sup En ] = P Ek ≤ P En
n→∞
n=1 k=n n=1
∞
X
≤ P (En )
n=1
∞
X
= FX (xU − ²)n
n=1
< ∞
(a) Suppose that X1 , . . . , Xn are an independent and identically distributed sample from distribution
with density fX (x|θ), for vector parameter θ ∈ Θ ⊆ Rk . Suppose that fX is twice differentiable
with respect to the elements of θ, and let the true value of θ be denoted θ0 .
Define
[I(θ) is sometimes called the unit Fisher Information; Iˆn (θ) is the estimator of I(θ)]
Give the asymptotic Normal distribution of the score statistic under standard regularity conditions,
when the data are distributed as a Normal distribution with mean zero and variance 1/θ.
Bookwork: Let X and x denote the vector of random variables/observations, let L denote the likelihood,
l denote the log likelihood, and let partial differentiation be denoted by dots.
where twice partial differentiation returns a k × k symmetric matrix. It can be shown that
£ ¤
I(θ) = EX1 |θ S(X1 ; θ)S(X1 ; θ)T
so it follows that £ ¤
In (θ) = nI(θ) = EX|θ S(X; θ)S(X; θ)T
1 1 θX12
log fX|θ (X1 |θ) = log θ − log (2π) −
2 2 2
∂ 1 X2
log fX|θ (X1 |θ) = − 1
∂θ 2θ 2
∂2 1
2 log fX|θ (X1 |θ) = −
∂θ 2θ2
Thus, as the expectation of the score function is zero, then
S(X; θ) ∼ AN (0, In (θ)) ≡ AN (0, 2n/θ2 )
where AN means asymptotically normal.
[2 MARKS]
(b) One class of estimating procedures for parameter θ involves solution of equations of the form
n
1X
Gn (θ) = Gi (Xi ; θ) = 0 (3)
n
i=1
(i) Show that maximum likelihood (ML) estimation falls into this class of estimating procedures.
For ML estimation, we find estimator b
θ, where
b
θ = arg max L(X; θ)
θ∈Θ
by, typically differentiating l(X; θ) partially in turn with respect to each component of θ, and
then setting the resulting derivative equations equal to zero, that is, we solve the system of
k equations
n n
∂ ∂ X X ∂
log l(X; θ) = log fX|θ (Xi |θ) = log fX|θ (Xi |θ) = 0
∂θ ∂θ ∂θ
i=1 i=1
State precisely the assumptions made in order to obtain the asymptotic Normal distribution.
Apologies, some lax notation here; this is a vector problem, and θ, θ0 , θbn and G are conventionally
k × 1 (column) vectors, and Ġn is a k × k matrix, so it makes more sense to write
Gn (b
θn ) = Gn (θ0 ) + Ġn (θ0 )(θbn − θ0 ) (4)
although working through with the form given, assuming row rather than column vectors, is OK.
Anyway, proceeding with column vectors:
Now, b θn is a solution to equation (3) by definition of the estimator, so rearranging equation (4)
√
after setting the LHS to zero and multiplying through by n yields
√ √
n Gn (θ0 ) = − n Ġn (θ0 )(b
θn − θ0 ). (5)
But also, by the Central Limit Theorem, under the assumption that
(that is, the usual “unbiasedness” assumption made for score equations), we have
√ L
nGn (θ0 ) −→ Z ∼ N (0, VG (θ0 ))
where
VG (θ0 ) = V arX|θ0 [Gn (θ0 )]
But, by analogy with the standard likelihood case, a natural assumption (that can be proved
formally) is that
a.s.
−Ġn (θ0 ) −→ VG (θ0 )
akin to the likelihood result that says the Fisher Information is minus one times the expectation
of the log likelihood second derivative matrix. Thus, from equation (5), we have by rearrangement
(formally, using Slutsky’s Theorem) that
√ √ L
n (b
θn − θ0 ) = (−Ġn (θ0 ))−1 n Gn (θ0 ) −→ VG (θ0 )−1 Z ∼ N (0, VG (θ0 )−1 )
This result follows in the same fashion as in the Cramer’s Theorem from lectures.
[6 MARKS]
(a) Suppose that X1 , . . . , Xn are a finitely exchangeable sequence of random variables with (De Finetti)
representation
Z ∞Y
n
p(X1 , . . . , Xn ) = fX|θ (Xi |θ)pθ (θ)dθ
−∞ i=1
In the following cases, find the joint probability distribution p(X1 , . . . , Xn ), and give an interpre-
tation of the parameter θ in terms of a strong law limiting quantity.
(i)
pθ (θ) = N ormal(0, τ 2 )
We have
n n ½ ¾ ( n
)
Y Y 1 1 1 X
fX|θ (Xi |θ) = 1/2
exp − (Xi − θ)2 = (2π)−n/2 exp − (Xi − θ)2
(2π) 2 2
i=1 i=1 i=1
so
n
(
" n
#)
Y 1 X
fX|θ (Xi |θ) = (2π)−n/2 exp − n(X̄ − θ)2 + (Xi − X̄)2
2
i=1 i=1
n n o
= K1 (X, n) exp − (X̄ − θ)2
2
where ( )
n
−n/2 1X
K1 (X, n) = (2π) exp − (Xi − X̄)2 .
2
i=1
Now ½ ¾
1 1 2
pθ (θ) = exp − 2 θ
(2πτ 2 )1/2 2τ
so
n
Y n n o ½ ¾
2 1 1 2
fX|θ (Xi |θ)pθ (θ) = K1 (X, n) exp − (X̄ − θ) exp − 2 θ
i=1
2 (2πτ 2 )1/2 2τ
and combining the terms in the exponents, completing the square, we have
µ ¶2
2 2 2 2 nX̄ n/τ 2
n(X̄ − θ) + θ /τ = (n + 1/τ ) θ − + (X̄)2
n + 1/τ 2 n + 1/τ 2
A(x − a)2 + B(x − b)2 = (A + B)(x − (Aa + Bb)/(A + B))2 + ((AB)/(A + B))(a − b)2
where
½ ¾
K1 (X, n) n/τ 2
K2 (X, n, τ 2 ) = exp − (X̄)2
(2πτ 2 )1/2 2(n + 1/τ 2 )
nX̄
µn =
n + 1/τ 2
η n = (n + 1/τ 2 )
and thus
Z ∞ n
Y Z ∞ n η o
fX|θ (Xi |θ)pθ (θ)dθ = K2 (X, n, τ 2 ) exp − n (θ − µn )2 dθ
−∞ i=1 −∞ 2
p
= K2 (X, n, τ 2 ) 2π/η n
The parameter θ in the conditional distribution for the Xi is the expectation. Thus, θ has
the interpretation
a.s.
X̄ −→ θ
as n → ∞. To see this more formally, we have the posterior distribution for θ from above as
n
Y n η o
n 2
pθ|X (θ|X = x) ∝ fX|θ (Xi |θ)pθ (θ) ∝ exp − (θ − µn )
2
i=1
nX̄ a.s.
µn = −→ E[Xi ]
n + 1/τ 2
and 1/η n → 0.
(ii)
pθ (θ) = Gamma(α, β)
and
β α α−1
pθ (θ) = θ exp {−βθ}
Γ(α)
so
n
( n
)
Y X β α α−1
fX|θ (Xi |θ)pθ (θ) = θn exp −θ Xi θ exp {−βθ}
Γ(α)
i=1 i=1
( Ã n !)
βα X
n+α−1
= θ exp −θ Xi + β
Γ(α)
i=1
which yields
Z n
∞Y
βα Γ(n + α)
fX|θ (Xi |θ)pθ (θ)dθ = Ã n !n+α
0 Γ(α) X
i=1
Xi + β
i=1
Now, as
n
( Ã n !)
Y X
pθ|X (θ|X = x) ∝ fX|θ (Xi |θ)pθ (θ) ∝ θn+α−1 exp −θ Xi + β
i=1 i=1
so à !
n
X
pθ|X (θ|X = x) ≡ Ga n + α, Xi + β .
i=1
so θ is interpreted as the strong law limit of the reciprocal of the expected value of the Xi .
[5 MARKS]
[5 MARKS each]
We compute
Z ∞ m+n
Y
p(Xm+1 , . . . , Xm+n |X1 , . . . , Xm ) = fX|θ (Xi |θ)pθ|X (1) (θ|X (1) = x(1) )dθ
−∞ i=m+1
m+n
Y n n o
fX|θ (Xi |θ) = K1 (X (2) , n) exp − (X̄ (2) − θ)2
2
i=m+1
where µ(1) and η (1) are as defined earlier, computed for X (1) . The posterior predictive is computed in
a fashion similar to earlier, completing the square in θ to facilitate the integral; here we have by the
previous identity
à !2
(2) 2 (1) (1) 2 (1) nX̄ (2) + η (1) µ(1) nη (1)
n(X̄ − θ) + η (θ − µ ) = (n + η ) θ− + (X̄ (2) − µ(1) )2
n + η (1) n+η (1)
Thus, on integrating out θ, and cancelling terms, we obtain the posterior predictive as
( )Ã !1/2
nη (1) η (1)
K1 (X (2) , n) exp − (X̄ (2) − µ(1) )2
2(n + η (1) ) n + η (1)
m+n
Y n o
fX|θ (Xi |θ) = θn exp −θS (2)
i=m+1
where
m
X n
X
S (1) = Xi S (2) = Xi
i=1 i=m+1
and on integrating out θ, as this form is proportional to a Gamma pdf, we obtain the posterior predictive
as ¡ (1) ¢m+α
S +β Γ(n + m + α)
¡ ¢n+m+α
Γ(m + α) S (1) + S (2) + β
In both cases, by the general theorem from lecture notes, the limiting posterior predictive when n → ∞
is merely the posterior distribution based on the sample X1 . . . , Xm .
[5 MARKS each]