0% found this document useful (0 votes)
128 views16 pages

36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019

- The document is a homework assignment for a statistical machine learning course that provides solutions to 3 problems. - Problem 1 involves analyzing independence statements and causal relationships implied by a directed graph with 5 nodes. - Problem 2 derives upper and lower bounds for the minimax risk of estimating a multivariate Bernoulli distribution from data. The upper bound is O(√log(d)/n) and the lower bound is Ω(√log(d)/n). - Problem 3 shows that the minimax rate for estimating a parametric model using Hellinger distance is the same as the rate for squared error loss, which is achieved by the maximum likelihood estimator and is O(1/n).

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views16 pages

36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019

- The document is a homework assignment for a statistical machine learning course that provides solutions to 3 problems. - Problem 1 involves analyzing independence statements and causal relationships implied by a directed graph with 5 nodes. - Problem 2 derives upper and lower bounds for the minimax risk of estimating a multivariate Bernoulli distribution from data. The upper bound is O(√log(d)/n) and the lower bound is Ω(√log(d)/n). - Problem 3 shows that the minimax rate for estimating a parametric model using Hellinger distance is the same as the rate for squared error loss, which is achieved by the maximum likelihood estimator and is O(1/n).

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Carnegie Mellon University Department of Statistics & Data Science

36-708 Statistical Machine Learning Homework #4


Solutions

DUE: April 19, 2019

Problem 1 [5 pts.]
Consider the directed graph with vertices V = {X1 , X2 , X3 , X4 , X5 } and edge set E = {(1, 3), (2, 3), (3, 4), (3, 5)}.
(a)[2 pts.] List all the independence statements implied by this graph.
(b)[1 pts.] Find the causal distribution p(x4 ∣set x3 = s).
(c)[2 pts.] Find the implied undirected graph for these random variables. Which independence
statements get lost in the undirected graph (if any)?

Solution.
The graph can be visualized as follows:

X1 X2

X3

X4 X5

(a) The independence statements implied are the following:

– X1 ⊥ X2 ;
– X4 ⊥ {X1 , X2 }∣X3 and X5 ⊥ {X1 , X2 }∣X3 ;
– X5 ⊥ X4 ∣X3 .

(b) Given that we set X3 = X3 , then by the independence highlighted above X1 and X2 can be
dropped from the graph. Hence we have that:

p(x4 ∣set x3 = s) = ∫ p∗ (x4 , x5 )dx5 = ∫ p(x4 ∣x3 = s)p(x5 ∣x3 = s)dx5 = p(x4 ∣x3 = s)

(c) The moralized graph becomes:

1
X1 X2

X3

X4 X5

We lose the (unconditional) independence between X1 and X2 during the moralization process,
while all the others are retained.

2
Problem 2 [20 pts.]
Let d ≥ 2, and let X1 , . . . , Xn ∼ P where Xi = (Xi (1), . . . , Xi (d)) ∈ Rd . Assume that the coordinates
of Xi are independent. Further, assume that Xi (j) ∼ Bernoulli(pj ) where 0 < c ≤ pj ≤ C < 1. Let P
be all such distributions. Let
Rn = inf sup EP ∣∣p̂ − p∣∣∞ .
p̂ P ∈P

Find lower and upper bounds on the minimax risk.

Solution.
For upper bound, consider an estimator X for estimating p. Then X̄ − p is sub-Gaussian with
1
parameter σ 2 = 4n . Hence, from the lemma below,
1 √
E∣∣X̄ − p∣∣∞ = E [max∣X̄ − p∣i ] ≤ √ 2 log(2d).
1≤i≤d 2 n
And since d ≥ 2, we have 2 log(2d) ≤ 4 log d, so,

log d
inf sup EP ∣∣p̂ − p∣∣∞ ≤ E∣∣X̄ − p∣∣∞ ≤ .
p̂ P ∈P n

For lower bound, let α = log d
16n , p
(0)
= (c, . . . , c) ∈ Rd .
And for 1 ≤ j ≤ d, construct d-dimensional vector parameter

pj ∶= (pj , i = 1, ⋯d) = (c, ⋯, c, c + α, c, ⋯, c) ∈ Rd ,


(i)
±
jth

then ∣∣pj − pk ∣∣∞ = α for j ≠ k, so

min∣∣pj − pk ∣∣∞ = α.
j≠k

Let Pj be the multivariate Bernoulli with parameter pj , then it’s a product of uni-variate Bernoulli’s,
Pj = ∏di=1 Pj , where Pj is uni-variate Bernoulli with parameter pj . Then
(i) (i) (i)

d
(j) (k) (j) (k) (j) (k)
KL(P (j) , P (k) ) = ∑ KL(Pi , Pi ) = KL(Pj , Pj ) + KL(Pk , Pk ).
i=1

since they only differ in two terms: i = j or i = k. Let uni-variate Beroulli with parameter c as Pc ,
then,
(j) (k) (j) (k)
KL(P (j) , P (k) ) = KL(Pj , Pj ) + KL(Pk , Pk )
Pc+α Pc
= ∑ Pc+α log( ) + ∑ Pc log( )
x Pc x Pc+α
Pc+α
= ∑(Pc+α − Pc ) log( )
x Pc
(α + c)(1 − c)
= α log
(1 − α − c)c
≤ Cα2 , by Taylor expansion.

3
(Fano’s method)
1
maxKL(Pj , Pk ) = max ∣∣µj − µk ∣∣22 = Cα2
j≠k j≠k 2
log d log(d + 1)
= ≤ .
8n 4n
Hence by Corollary 13 in the minimax notes where N = d + 1,

α log d
inf sup EP ∣∣p̂ − p∣∣∞ ≥ = .
p̂ P ∈P 4 256n

Note: following the same construction of p(i) and KL distance bound, lower bounds for minimax
with the same rate with respect to d and n can also be derived using Theorem 12 or Theorem 14.

Lemma 1 (Maximal inequality for subgaussian random variables)


Let {Xi }1≤i≤n be sub-Gaussian variables with parameter σ 2 , then
√ √
E [max Xi ] ≤ σ 2 log n and E [max ∣Xi ∣] ≤ σ 2 log(2n).
1≤i≤n 1≤i≤n

It’s covered in Advanced stats.


See http://www.stat.cmu.edu/~arinaldo/Teaching/36755/F17/Scribed_Lectures/F17_0911.pdf.

4
10/36-702 Statistical Machine Learning: Homework 4

Problem 3 [20 pts.]


Let {pθ ∶ θ ∈ Θ} where Θ ⊂ R be a parametric model. Suppose that the model satisfies the usual
regularity conditions. In particular, the Fisher information I(θ) is positive and smooth and the mle
has the usual nice properties. Let the loss function be L(θ̂, θ) = H(pθ̂ , pθ ) where H denotes Hellinger
distance. Find the minimax rate.

Solution.
Assume that the densities in {pθ ∶ θ ∈ Θ} is differentiable in quadratic mean (QMD) as equa-
tion (27) in the Minimax note, which says the Hellinger distance between pθ+h , pθ can be approxi-
mated by first order Taylor expansion on h. In other words, the loss with Hellinger distance is then
approximately equivalent with that with squared loss on θ,
1
H 2 (pθ+h , pθ ) = ∣∣h∣∣2 I(θ) + o(∣∣h∣∣2 ).
8
Proof see https://www.stat.berkeley.edu/ bartlett/courses/2013spring-stat210b/notes/25notes.pdf.
So the minimax rate for Hellinger loss is the same as for squared risk R(θ̂, θ) = ∣∣θ̂ − θ∣∣2 ,

inf sup H 2 (pθ̂ , pθ ) = O(inf sup R(θ̂nmle , θ)).


θ̂ θ∈Θ θ̂ θ∈Θ

Next, we prove that the minimax rate for squared loss is achieved by MLE estimator. For a
fixed true parameter θ, denote the MLE as θ̂nmle from sample X1 , ⋯, Xn ∼ Pθ . By MLE asymptotic
distribution, √
n(θ̂nmle − θ) → N (0, I −1 (θ).
So the squared risk for the MLE is,

R(θ̂nmle , θ) = V ar(θ̂nmle ) + bias2 → I −1 (θ)/n.

For any other estimator θ̂, by theorem 17 with ψ(x) = x and l as the square loss, under the QMD
condition, the squared risk is lower bounded by that for MLE,

R(Tn , θ) = V ar(Tn ) + bias2 ≥ V ar(Tn ) ≥ V ar(U ) = I −1 (θ)/n = R(θ̂nmle , θ).

This lower bound holds for all θ ∈ Θ. So the minimax risk for squared loss is achieved by the MLE,
1
inf sup R(θ̂, θ) = sup R(θ̂nmle , θ) → sup I −1 (θ).
θ̂ θ∈Θ θ∈Θ n θ∈Θ

Therefore,
1
inf sup H 2 (pθ̂ , pθ ) = O( ),
θ̂ θ∈Θ n
or equivalently,
inf sup H(pθ̂ , pθ ) = O(n−1/2 ).
θ̂ θ∈Θ

5
10/36-702 Statistical Machine Learning: Homework 4

Alternative solution (thank Tim Barry for the ideas)


We first derive an upper bound for the KL distance between arbitrary pθ+h , pθ ,

KL(pθ+h , pθ ) = ∫ log pθ dx
pθ+h
= ∫ log pθ − log pθ+h pθ dx
∂ log pθ ∂ 2 log pθ
= ∫ log pθ − (log pθ + h + h2 2
+ o(h2 ))pθ dx
∂θ ∂θ
∂ 2 log pθ
= h2 (− ∫ 2
pθ dx) + o(h2 )
∂θ
≤ CI(θ)h2 ,

where we assume good properties for the density to allow swapping the derivative and integral, so
that,

∂ log pθ 1 ∂pθ ∂pθ ∂( pθ dx)


∫ pθ dx = ∫ pθ dx = ∫ dx = ∫ = 0.
∂θ pθ ∂θ ∂θ ∂θ
For upper bound, consider MLE estimator,
√ √
H(pθ̂mle , pθ ) ≤ KL(pθ̂mle , pθ ) ≤ C(θ̂mle − θ)2 I(θ),

By MLE asymptotic distribution,

θ̂mle − θ → N (0, I −1 (θ)/n),

thus, √
inf sup H(pθ̂ , pθ ) ≤ H(pθ̂mle , pθ ) ≤ CI(θ)(V ar(θ̂nmle ) + bias2 ) → O(n−1/2 ).
θ̂ θ∈Θ

log 2
For lower bound, consider two distribution pθ and pθ+h , where h = CI(θ)n , then by QMD
condition,
H(pθ+h , pθ ) = Ch = O(n−1/2 ).
And the KL distance,
log 2
KL(pθ+h , pθ ) ≤ CI(θ)h2 ≤ .
n
Hence by Corollary 5 in the minimax notes,

inf sup H(pθ̂ , pθ ) = O(n−1/2 ).


θ̂ θ∈Θ

6
10/36-702 Statistical Machine Learning: Homework 4

Problem 4 [15 pts.]


Let Y = (Y1 , . . . , Yd ) ∼ N (θ, I) where θ = (θ1 , . . . , θd ). Assume that θ ∈ Θ = {θ ∈ Rd ∶ ∥θ∥0 ≤ 1}. Let
Rd = inf sup Eθ ∥θ̂ − θ∥2 .
θ̂ θ∈Θ

Show that c log d ≤ Rd ≤ C log d for some constants c and C.

Solution.

For the upper bound, we first prove a high-probability bound lemma for the maximum of Gaussians
(from 36-705, Lecture 27, Fall 2018):

Lemma 2 Suppose that, 1 , . . . , d ∼ N (0, σ 2 ) then with probability at least 1 − δ,


d √
max ∣i ∣ ≤ σ 2 log(2d/δ).
i=1

Proof. By the Gaussian tail bound, if  ∼ N (0, σ 2 ):


P(∣∣ ≥ t) ≤ 2 exp(−t2 /(2σ 2 )),
By using the union bound:
P(max ∣i ∣ ≥ t) ≤ 2d exp(−t2 /(2σ 2 )),
i

By setting 2d exp(−t2 /(2σ 2 )) = δ we obtain the lemma.

Now, we assume θ̂ to be the hard-thresholding estimator, defined as:

θ̂i = yi I(∣yi ∣ ≥ t), ∀ i ∈ {1, . . . , d},


We have the following theorem (from 36-705, Lecture 27, Fall 2018):

Theorem 3 Suppose we choose the threshold:



t = 2σ 2 log(2d/δ),
then with probability at least 1 − δ,
d
t2
∥θ̂ − θ∥22 ≤ 9 ∑ min {θi2 , } ≤ Ct2
i=1 4
For some C1 > 0 ∈ R.

Proof. We condition on the event from the previous lemma, i.e. that
d √ t
max ∣i ∣ ≤ σ 2 log(2d/δ) ≤ .
i=1 2
Now, observe that,
d
∥θ̂ − θ∥22 = ∑(θ̂i − θi )2 ,
i=1

so we can consider each co-ordinate separately. Let us consider some cases:

7
10/36-702 Statistical Machine Learning: Homework 4

t
1. If for any co-ordinate ∣θi ∣ ≤ 2 our estimate is 0, so our risk for that coordinate is simply θi2 .
3t t2
2. If ∣θi ∣ ≥ 2 our estimate is simply θ̂i = yi so our risk is simply 2i ≤ 4.
t 3t
3. If 2 ≤ ∣θi ∣ ≤ 2, then our risk,

9t2
(θ̂i − θi )2 = (yi I(∣yi ∣ ≥ t) − θi )2 = θi2 I(∣yi ∣ < t) + 2i I(∣yi ∣ ≥ t) ≤ max{2i , θi2 } ≤ .
4

Putting these together we see that,


d
t2 t2 t2
∥θ̂ − θ∥22 ≤ 9 ∑ min {θi2 , } = 9 ∑ min {θi2 , } + 9 ∑ min {θi2 , } ≤ C1 t2
i=1 4 θi =0 4 θi ≠0 4

We also have that, under the same assumptions of the theorems above:

d
∥θ̂ − θ∥22 = ∑(yi I(∣yi ∣ ≥ t) − θi )2 ≤ C2 d max (t − θi )2 ≤ C2 dt2
i=1 i=1,..,d

And so we have that:



E [∥θ̂ − θ∥22 ] = ∫ P (∥θ̂ − θ∥22 > x) dx
0
C1 t2 C2 dt2
=∫ P (∥θ̂ − θ∥22 > x) dx + ∫ P (∥θ̂ − θ∥22 > x) dx
0 C1 t2
C1 t2 C2 dt2
≤∫ (1 − δ)dx + ∫ δdx
0 C1 t2
≤ C1 t2 (1 − δ) + t2 (C2 d − 1)δ ≤ C3 t2 ≈ C3 log d

For the lower bound, let Pj = N (θj , I) where θ0 = (0, . . . , 0) and θj is the d-dimensional vector
(d ≥ 8) where √

⎪ log
⎪ d
k=j
θj (k) = ⎨ 32
⎪0
⎪ k ≠ j.

P0 is absolutely continuous wrt each other distribution and for all j = 1, . . . , d, we claim
log d
KL(Pj , P0 ) ≤ .
16
The statement above actually works for any Pj , Pi with i ≠ j. Let Pi = N (µi , I), then:
1
KL(Pj , Pk ) = ∫ (∣∣x − µk ∣∣22 − ∣∣x − µj ∣∣22 ) Pj (x)dx
2
1
= ∣∣µj − µk ∣∣22 .
2
Which implies:
√ 2
1 2 ⎛ log d ⎞ log d
maxKL(Pj , Pk ) = max ∣∣µj − µk ∣∣2 = 2 =
j≠k j≠k 2 ⎝ 32 ⎠ 16

8
10/36-702 Statistical Machine Learning: Homework 4

It then follows that:


1 d log d
∑ KL(Pj , P0 ) ≤ .
d j=1 16
Therefore, by Tsybakov’s bound,
s
Rd ≥
16
maxj≠k ∥θj − θk ∥22
=
16
√ 2
2 log 32
d
=
16
≥ c log d.

9
10/36-702 Statistical Machine Learning: Homework 4

Problem 5 [20 pts.]


Let X1 , . . . , Xn ∼ F where F is some distribution on R. Suppose we put a Dirichlet process prior on
F:
F ∼ DP(α, F0 ).

(a) (10 pts.) Recall the stick-breaking construction. Show that E( ∑∞


j=1 Wj ) = 1.

(b) (10 pts.) Simulate n = 10 data points from a N (0, 1). Try three values of α: namely, α = .1,
α = 1 and α = 10. Compute the 95 percent Bayesian confidence band and the 95 percent DKW
band. Plot the results for one example. Now repeat the simulation 1,000 times and report the
coverage probability for each confidence band.

Solution.

(a) We start by showing P(∑∞


j=1 Wj = 1) = 1. The expectation will then follow easily. First we prove
a series of lemmas.

Lemma 4 For all n ∈ N,


n n
1 − ∑ Wj = ∏(1 − Vj ).
j=1 j=1

Proof. (by induction)


Base case. k = 1
1 − W1 = 1 − V1 . ✓
Inductive hypothesis. Now assume that

n−1 n−1
1 − ∑ Wj = ∏ (1 − Vj ).
j=1 j=1

Inductive step.

n n−1
1 − ∑ Wj = 1 − ∑ Wj − Wn
j=1 j=1
n−1 n−1
= ∏ (1 − Vj ) − Vn ∏ (1 − Vj )
j=1 j=1
n−1
= (1 − Vn ) ∏ (1 − Vj )
j=1
n
= ∏(1 − Vj ). ✓ (1)
j=1

Lemma 5 Let v1 , v2 , . . . be a sequence such that 0 < vj < 1 for all j. Then
∞ ∞
∏(1 − vj ) > 0 if and only if ∑ vj < ∞.
j=1 j=1

10
10/36-702 Statistical Machine Learning: Homework 4

Proof. First notice that


∞ ∞
− log ∏(1 − vj ) = − ∑ log(1 − vj ),
j=1 j=1

and thus
∞ ∞
∏(1 − vj ) > 0 ⇐⇒ − ∑ log(1 − vj ) < ∞.
j=1 j=1

We now have
{ − log(1 − vj )} > 0 and {vj }j∈N > 0,
j∈N

so we can use the Limit Comparison test to prove that


∞ ∞
− ∑ log(1 − vj ) < ∞ ⇐⇒ ∑ vj < ∞.
j=1 j=1

Since both series diverge when vj Ð→


/ 0, it suffices to consider only the sequences where vj Ð→ 0.

− log(1 − vj ) L′ H 1
lim = lim (2)
vj →0 vj vj →0 1 − vj

= 1. (3)

Hence,
∞ ∞
− ∑ log(1 − vj ) < ∞ if and only if ∑ vj < ∞,
j=1 j=1

and thus,
∞ ∞
∏(1 − vj ) > 0 if and only if ∑ vj < ∞.
j=1 j=1

Corollary 6 Let v1 , v2 , . . . be a sequence such that 0 < vj < 1 for all j. Then
∞ ∞
∏(1 − vj ) = 0 if and only if ∑ vj = ∞.
j=1 j=1

Lemma 7 (Borel-Cantelli) If ∑∞
j=1 P(Vj > ) = ∞ for some  > 0, then P(Vj >  i.o.) = 1.

Now since
Vj ∼ Beta(1, α), j = 1, 2, . . .
we have
P(Vj > ) > 0 for all  ∈ (0, 1) and j = 1, 2, . . . (4)
since the beta distribution puts positive mass over its entire support (0, 1), and now (4) implies

∑ P(Vj > ) = ∞ for all  ∈ (0, 1). (5)
j=1

11
10/36-702 Statistical Machine Learning: Homework 4

So altogether,
∞ ∞
Lemma 2
P( ∑ Wj = 1) = P( ∏(1 − Vj ) = 0)
j=1 j=1

Cor. 4
= P( ∑ Vj = ∞)
j=1

≥ P(Vj >  i.o.) (6)


Lemma 5
= 1,

where the inequality comes from the fact



{ ∑ Vj = ∞} ⊃ {Vj >  i.o.}.
j=1

Thus,

P( ∑ Wj = 1) = 1. ∎
j=1

It follows that
∞ ∞
E( ∑ Wj ) = ∫ ∑ Wj dF
j=1 R j=1


= P( ∑ Wj = 1) ⋅ 1 + 0
j=1

= 1,

where F is the distribution function of the random variable ∑∞


j=1 Wj .

(b) Two good resources for such simulation are the distr package in R, which is showcased by this
tutorial. For Python 3 this tutorial uses the pyMC3 modules to provide a achieve a similar goal.
We include R code for a single simulation, with parts taken from the tutorial indicated above.
library(distr)
library(coda)
library(latex2exp)

# Setup
set.seed(7)
n <- 10
alpha_vec <- c(0.1,1,10)
x_grid <- seq(-3, 3, by=0.05)
signif_level <- 0.05

# Sample observations
x_pts <- rnorm(n)

# Generate DKW Band


x_ecdf <- ecdf(x_pts)
x_ecdf_error <- sqrt(log(2 / signif_level) / (2 * n))

12
10/36-702 Statistical Machine Learning: Homework 4

dkw.lb <- pmax(x_ecdf(x_grid) - x_ecdf_error, 0)


dkw.ub <- pmin(x_ecdf(x_grid) + x_ecdf_error, 1)

## BAYESIAN BANDS
# Functions to generate Bayesian Credible Bands
sample_cdf <- function(F_hat, n){
F_hat@r(n) # F_hat is a S6 class object from the distr package
}

# Sampling from the prior distribution


sample_prior <- function(F0, alpha, n){
cdf_sample <- sample_cdf(F0, n)
v <- rbeta(n, 1, alpha) # See 5a) for definitions
w <- c(v[1], rep(0, n-1))
for(ii in 2:n){
w[ii] <- v[ii]*cumprod(1-v)[ii-1] # See 5a) for definitions
}
function(m){
sample(cdf_sample, m, prob=w, replace=T)
}
}

# Sampling from the posterior distribution


sample_posterior <- function(F0,alpha,data){
n <- length(data)
F_hat <- DiscreteDistribution(data) #distr function for empirical CDF
F_for_post <- n/(n+alpha)*F_hat+alpha/(n+alpha)*F0
sample_prior(F_for_post,alpha+n,n)
}

# Now simulate for all different alphas


list_out_bayes <- list()
for(alpha in alpha_vec){
iters <- 100
m <- 1000
F0 <- DiscreteDistribution(rnorm(m))

y <- matrix(nrow=length(x_grid), ncol=iters)


for(iter in 1:iters){
F_post <- sample_posterior(F0, alpha, x_pts)
y[,iter] <- ecdf(F_post(m))(x_grid)
}

mean_post_sim <- rowMeans(y) #Posterior Mean


cred_int <- apply(y, 1, function(row) HPDinterval(as.mcmc(row),prob=signif_level))
#obtains 95% credible interval
list_out_bayes[[as.character(alpha)]] <- list(’cred_int’ = cred_int,
’mean_post_sim’=mean_post_sim)
}

# Plot the results for each of the alpha


for (alpha_val in alpha_vec){
plot(x_ecdf, xlim=c(-3, 3),
main = TeX(sprintf("95%% DKW and Bayesian Credible Band ($\\alpha = %s$)",
as.character(alpha_val)))) #ECDF

13
10/36-702 Statistical Machine Learning: Homework 4

lines(x_grid, dkw.lb, col="green") #DKW


lines(x_grid, dkw.ub, col="green")

points(x_grid, list_out_bayes[[as.character(alpha_val)]]$’mean_post_sim’,
type="l", col="red")
points(x_grid, list_out_bayes[[as.character(alpha_val)]]$’cred_int’[2,], type=’l’,
col=’blue’)
points(x_grid, list_out_bayes[[as.character(alpha_val)]]$’cred_int’[1,], type=’l’,
col=’blue’)
curve(pnorm, xlim=c(-3, 3), add=TRUE, col="red", lwd=2)
legend("topleft", lty="solid", legend=c("True", "DKW", "Kolmogorov", "Posterior
Mean"),
col=c("black", "green", "blue", "red"))
}

Figure 1: 95% DKW and Bayesian credible bands at different alpha levels: 0.1, 1, 10 from upper left
to bottom center respectively

For n = 1, 000 simulations, one should replicate the above code and consider whether the full
empirical CDF is captured between the bands for both the Bayesian confidence and DKW
bands.

14
10/36-702 Statistical Machine Learning: Homework 4

Problem 6 [20 pts.]


In this question we consider a nonparametric Bayesian estimator and compare to the minimax esti-
mator. For i = 1, . . . , n and j = 1, 2, . . . let

Xij = θj + ij

where all the ′ij s are independent N (0, 1). The parameter is θ = (θ1 , θ2 , . . . ). Assume that ∑j θj2 < ∞.
Due to sufficiency, we can reduce the problem to the sample means. Thus let Yj = n−1 ∑ni=1 Xij . So
the model is Yj ∼ N (θj , 1/n) for j = 1, 2, 3, . . . . We will put a prior π on θ as follows. We take each
θj to be independent and we take θj ∼ N (0, τj2 ).

(a) (5 pts.) Find the posterior for θ. Find the posterior mean θ.
̂

P
(b) (7 pts.) Suppose that ∑j τj2 < ∞. Show that θ̂ is consistent, that is, ∥θ̂ − θ∥2 Ð→ 0.

(c) (8 pts.)Now suppose that θ is in the Sobolev ball



⎪ ⎫

⎪ 2p 2 2⎪
Θ = ⎨θ = (θ1 , θ2 , . . .) ∶ ∑ j θj ≤ C ⎬

⎪ j ⎪

⎩ ⎭

where p > 1/2. The minimax (for squared error loss) for this problem is Rn ≍ n−2p/(2p+1) . Let
τj2 = (1/j)2r . Find r so that the posterior mean achieves the minimax rate.

Solution.

(a) By Theorem 6 (in the appendix) we have

nYj τj2
θ̂j = .
1 + nτj2

(b) For any  > 0,

E[∥θ̂ − θ∥2 ]
2
P (∥θ̂ − θ∥ > ) ≤ Markov’s inequality

1 ∞ 2
= ∑ E[(θ̂j − θj ) ]
 j=1
⎡ 2

1 ⎢⎢ ∞ ∞ ⎥
= ⎢ ∑ (E[θj − θj ]) + ∑ Var(θj )⎥⎥
̂ ̂
 ⎢ j=1 j=1 ⎥
⎣ ⎦
⎡∞ 2
τj ⎥ ⎤
1⎢ 1 ∞
= ⎢⎢ ∑ θj2 +∑ ⎥ Theorem 6
 ⎢ j=1 (1 + nτj )2 2
j=1 1 + nτj2 ⎥⎥
⎣ ⎦
→ 0,

as n → ∞, since ∑j τj2 < ∞ and ∑j θj2 < ∞.

(c) See Shen and Wasserman (2001).

15
10/36-702 Statistical Machine Learning: Homework 4

Theorem 8 Let X1 , . . . , Xn ∼ N (µ, σ 2 ) where σ 2 is known. Let µ ∼ N (a, b2 ). Then,

aσ 2 + nXb2
E[µ∣X1 , . . . , Xn ] = .
σ 2 + nb2
Proof.
n
1 1 1 n
fX n (xn ∣µ) = ∏ √ exp{− (x i − µ) 2
} = (2πσ 2 −n/2
) exp{− 2
∑(xi − µ) }
i=1 2πσ 2 σ2 2σ 2 i=1

−1 −(n − 1)s2 −n(µ − X)2


= (2πσ 2 )−n/2 exp { [(n − 1)s2
+ n(µ − X) 2
]} = (2πσ 2 −n/2
) exp { } exp { }
2σ 2 2σ 2 2σ 2
−n(µ − X)2
∝ exp { }.
2σ 2
1 1 1
π(µ) = √ exp { − (µ − a)2 } ∝ exp { − 2 (µ − a)2 }.
2πb2 2b2 2b
Hence,
−n(µ − X)2 1
π(µ∣X n ) ∝ fX n (xn ∣µ)π(µ) ∝ exp { 2
− 2 (µ − a)2 }
2σ 2b
2
−nµ2 + 2nµX − nX −µ2 + 2µa − a2
= exp { + }
2σ 2 2b2
2
−1 n a nX a2 nX
= exp {µ2 ( − ) − 2µ( − − ) − ( + )}
2b2 2σ 2 2b2 2σ 2 2b2 2σ 2
For simplicity, let
−1 n a nX
U =( 2
− 2 ) and V = ( − 2 − 2 ).
2b 2σ 2b 2σ
Then
V V2 V2
π(µ∣X n ) ∝ exp{U µ2 − 2V µ} = exp {U (µ2 − 2µ+ 2)− }
U U U
V −1 V V −1
∝ exp {U (µ − )2 } = exp { √ (µ − )2 } ∝ N ( , ).
U 2(1/ −2U ) 2 U U 2U
Therefore the mean of the posterior is,

V −aσ 2 − nb2 X aσ 2 + nXb2


̂ = E[µ∣X n ] =
µ = = .
U −σ 2 − nb2 σ 2 + nb2

16

You might also like