36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
Problem 1 [5 pts.]
Consider the directed graph with vertices V = {X1 , X2 , X3 , X4 , X5 } and edge set E = {(1, 3), (2, 3), (3, 4), (3, 5)}.
(a)[2 pts.] List all the independence statements implied by this graph.
(b)[1 pts.] Find the causal distribution p(x4 ∣set x3 = s).
(c)[2 pts.] Find the implied undirected graph for these random variables. Which independence
statements get lost in the undirected graph (if any)?
Solution.
The graph can be visualized as follows:
X1 X2
X3
X4 X5
– X1 ⊥ X2 ;
– X4 ⊥ {X1 , X2 }∣X3 and X5 ⊥ {X1 , X2 }∣X3 ;
– X5 ⊥ X4 ∣X3 .
(b) Given that we set X3 = X3 , then by the independence highlighted above X1 and X2 can be
dropped from the graph. Hence we have that:
p(x4 ∣set x3 = s) = ∫ p∗ (x4 , x5 )dx5 = ∫ p(x4 ∣x3 = s)p(x5 ∣x3 = s)dx5 = p(x4 ∣x3 = s)
1
X1 X2
X3
X4 X5
We lose the (unconditional) independence between X1 and X2 during the moralization process,
while all the others are retained.
2
Problem 2 [20 pts.]
Let d ≥ 2, and let X1 , . . . , Xn ∼ P where Xi = (Xi (1), . . . , Xi (d)) ∈ Rd . Assume that the coordinates
of Xi are independent. Further, assume that Xi (j) ∼ Bernoulli(pj ) where 0 < c ≤ pj ≤ C < 1. Let P
be all such distributions. Let
Rn = inf sup EP ∣∣p̂ − p∣∣∞ .
p̂ P ∈P
Solution.
For upper bound, consider an estimator X for estimating p. Then X̄ − p is sub-Gaussian with
1
parameter σ 2 = 4n . Hence, from the lemma below,
1 √
E∣∣X̄ − p∣∣∞ = E [max∣X̄ − p∣i ] ≤ √ 2 log(2d).
1≤i≤d 2 n
And since d ≥ 2, we have 2 log(2d) ≤ 4 log d, so,
√
log d
inf sup EP ∣∣p̂ − p∣∣∞ ≤ E∣∣X̄ − p∣∣∞ ≤ .
p̂ P ∈P n
√
For lower bound, let α = log d
16n , p
(0)
= (c, . . . , c) ∈ Rd .
And for 1 ≤ j ≤ d, construct d-dimensional vector parameter
min∣∣pj − pk ∣∣∞ = α.
j≠k
Let Pj be the multivariate Bernoulli with parameter pj , then it’s a product of uni-variate Bernoulli’s,
Pj = ∏di=1 Pj , where Pj is uni-variate Bernoulli with parameter pj . Then
(i) (i) (i)
d
(j) (k) (j) (k) (j) (k)
KL(P (j) , P (k) ) = ∑ KL(Pi , Pi ) = KL(Pj , Pj ) + KL(Pk , Pk ).
i=1
since they only differ in two terms: i = j or i = k. Let uni-variate Beroulli with parameter c as Pc ,
then,
(j) (k) (j) (k)
KL(P (j) , P (k) ) = KL(Pj , Pj ) + KL(Pk , Pk )
Pc+α Pc
= ∑ Pc+α log( ) + ∑ Pc log( )
x Pc x Pc+α
Pc+α
= ∑(Pc+α − Pc ) log( )
x Pc
(α + c)(1 − c)
= α log
(1 − α − c)c
≤ Cα2 , by Taylor expansion.
3
(Fano’s method)
1
maxKL(Pj , Pk ) = max ∣∣µj − µk ∣∣22 = Cα2
j≠k j≠k 2
log d log(d + 1)
= ≤ .
8n 4n
Hence by Corollary 13 in the minimax notes where N = d + 1,
√
α log d
inf sup EP ∣∣p̂ − p∣∣∞ ≥ = .
p̂ P ∈P 4 256n
Note: following the same construction of p(i) and KL distance bound, lower bounds for minimax
with the same rate with respect to d and n can also be derived using Theorem 12 or Theorem 14.
4
10/36-702 Statistical Machine Learning: Homework 4
Solution.
Assume that the densities in {pθ ∶ θ ∈ Θ} is differentiable in quadratic mean (QMD) as equa-
tion (27) in the Minimax note, which says the Hellinger distance between pθ+h , pθ can be approxi-
mated by first order Taylor expansion on h. In other words, the loss with Hellinger distance is then
approximately equivalent with that with squared loss on θ,
1
H 2 (pθ+h , pθ ) = ∣∣h∣∣2 I(θ) + o(∣∣h∣∣2 ).
8
Proof see https://www.stat.berkeley.edu/ bartlett/courses/2013spring-stat210b/notes/25notes.pdf.
So the minimax rate for Hellinger loss is the same as for squared risk R(θ̂, θ) = ∣∣θ̂ − θ∣∣2 ,
Next, we prove that the minimax rate for squared loss is achieved by MLE estimator. For a
fixed true parameter θ, denote the MLE as θ̂nmle from sample X1 , ⋯, Xn ∼ Pθ . By MLE asymptotic
distribution, √
n(θ̂nmle − θ) → N (0, I −1 (θ).
So the squared risk for the MLE is,
For any other estimator θ̂, by theorem 17 with ψ(x) = x and l as the square loss, under the QMD
condition, the squared risk is lower bounded by that for MLE,
This lower bound holds for all θ ∈ Θ. So the minimax risk for squared loss is achieved by the MLE,
1
inf sup R(θ̂, θ) = sup R(θ̂nmle , θ) → sup I −1 (θ).
θ̂ θ∈Θ θ∈Θ n θ∈Θ
Therefore,
1
inf sup H 2 (pθ̂ , pθ ) = O( ),
θ̂ θ∈Θ n
or equivalently,
inf sup H(pθ̂ , pθ ) = O(n−1/2 ).
θ̂ θ∈Θ
5
10/36-702 Statistical Machine Learning: Homework 4
where we assume good properties for the density to allow swapping the derivative and integral, so
that,
thus, √
inf sup H(pθ̂ , pθ ) ≤ H(pθ̂mle , pθ ) ≤ CI(θ)(V ar(θ̂nmle ) + bias2 ) → O(n−1/2 ).
θ̂ θ∈Θ
√
log 2
For lower bound, consider two distribution pθ and pθ+h , where h = CI(θ)n , then by QMD
condition,
H(pθ+h , pθ ) = Ch = O(n−1/2 ).
And the KL distance,
log 2
KL(pθ+h , pθ ) ≤ CI(θ)h2 ≤ .
n
Hence by Corollary 5 in the minimax notes,
6
10/36-702 Statistical Machine Learning: Homework 4
Solution.
For the upper bound, we first prove a high-probability bound lemma for the maximum of Gaussians
(from 36-705, Lecture 27, Fall 2018):
Proof. We condition on the event from the previous lemma, i.e. that
d √ t
max ∣i ∣ ≤ σ 2 log(2d/δ) ≤ .
i=1 2
Now, observe that,
d
∥θ̂ − θ∥22 = ∑(θ̂i − θi )2 ,
i=1
7
10/36-702 Statistical Machine Learning: Homework 4
t
1. If for any co-ordinate ∣θi ∣ ≤ 2 our estimate is 0, so our risk for that coordinate is simply θi2 .
3t t2
2. If ∣θi ∣ ≥ 2 our estimate is simply θ̂i = yi so our risk is simply 2i ≤ 4.
t 3t
3. If 2 ≤ ∣θi ∣ ≤ 2, then our risk,
9t2
(θ̂i − θi )2 = (yi I(∣yi ∣ ≥ t) − θi )2 = θi2 I(∣yi ∣ < t) + 2i I(∣yi ∣ ≥ t) ≤ max{2i , θi2 } ≤ .
4
We also have that, under the same assumptions of the theorems above:
d
∥θ̂ − θ∥22 = ∑(yi I(∣yi ∣ ≥ t) − θi )2 ≤ C2 d max (t − θi )2 ≤ C2 dt2
i=1 i=1,..,d
For the lower bound, let Pj = N (θj , I) where θ0 = (0, . . . , 0) and θj is the d-dimensional vector
(d ≥ 8) where √
⎧
⎪ log
⎪ d
k=j
θj (k) = ⎨ 32
⎪0
⎪ k ≠ j.
⎩
P0 is absolutely continuous wrt each other distribution and for all j = 1, . . . , d, we claim
log d
KL(Pj , P0 ) ≤ .
16
The statement above actually works for any Pj , Pi with i ≠ j. Let Pi = N (µi , I), then:
1
KL(Pj , Pk ) = ∫ (∣∣x − µk ∣∣22 − ∣∣x − µj ∣∣22 ) Pj (x)dx
2
1
= ∣∣µj − µk ∣∣22 .
2
Which implies:
√ 2
1 2 ⎛ log d ⎞ log d
maxKL(Pj , Pk ) = max ∣∣µj − µk ∣∣2 = 2 =
j≠k j≠k 2 ⎝ 32 ⎠ 16
8
10/36-702 Statistical Machine Learning: Homework 4
9
10/36-702 Statistical Machine Learning: Homework 4
(b) (10 pts.) Simulate n = 10 data points from a N (0, 1). Try three values of α: namely, α = .1,
α = 1 and α = 10. Compute the 95 percent Bayesian confidence band and the 95 percent DKW
band. Plot the results for one example. Now repeat the simulation 1,000 times and report the
coverage probability for each confidence band.
Solution.
n−1 n−1
1 − ∑ Wj = ∏ (1 − Vj ).
j=1 j=1
Inductive step.
n n−1
1 − ∑ Wj = 1 − ∑ Wj − Wn
j=1 j=1
n−1 n−1
= ∏ (1 − Vj ) − Vn ∏ (1 − Vj )
j=1 j=1
n−1
= (1 − Vn ) ∏ (1 − Vj )
j=1
n
= ∏(1 − Vj ). ✓ (1)
j=1
Lemma 5 Let v1 , v2 , . . . be a sequence such that 0 < vj < 1 for all j. Then
∞ ∞
∏(1 − vj ) > 0 if and only if ∑ vj < ∞.
j=1 j=1
10
10/36-702 Statistical Machine Learning: Homework 4
and thus
∞ ∞
∏(1 − vj ) > 0 ⇐⇒ − ∑ log(1 − vj ) < ∞.
j=1 j=1
We now have
{ − log(1 − vj )} > 0 and {vj }j∈N > 0,
j∈N
− log(1 − vj ) L′ H 1
lim = lim (2)
vj →0 vj vj →0 1 − vj
= 1. (3)
Hence,
∞ ∞
− ∑ log(1 − vj ) < ∞ if and only if ∑ vj < ∞,
j=1 j=1
and thus,
∞ ∞
∏(1 − vj ) > 0 if and only if ∑ vj < ∞.
j=1 j=1
Corollary 6 Let v1 , v2 , . . . be a sequence such that 0 < vj < 1 for all j. Then
∞ ∞
∏(1 − vj ) = 0 if and only if ∑ vj = ∞.
j=1 j=1
Lemma 7 (Borel-Cantelli) If ∑∞
j=1 P(Vj > ) = ∞ for some > 0, then P(Vj > i.o.) = 1.
Now since
Vj ∼ Beta(1, α), j = 1, 2, . . .
we have
P(Vj > ) > 0 for all ∈ (0, 1) and j = 1, 2, . . . (4)
since the beta distribution puts positive mass over its entire support (0, 1), and now (4) implies
∞
∑ P(Vj > ) = ∞ for all ∈ (0, 1). (5)
j=1
11
10/36-702 Statistical Machine Learning: Homework 4
So altogether,
∞ ∞
Lemma 2
P( ∑ Wj = 1) = P( ∏(1 − Vj ) = 0)
j=1 j=1
∞
Cor. 4
= P( ∑ Vj = ∞)
j=1
Thus,
∞
P( ∑ Wj = 1) = 1. ∎
j=1
It follows that
∞ ∞
E( ∑ Wj ) = ∫ ∑ Wj dF
j=1 R j=1
∞
= P( ∑ Wj = 1) ⋅ 1 + 0
j=1
= 1,
(b) Two good resources for such simulation are the distr package in R, which is showcased by this
tutorial. For Python 3 this tutorial uses the pyMC3 modules to provide a achieve a similar goal.
We include R code for a single simulation, with parts taken from the tutorial indicated above.
library(distr)
library(coda)
library(latex2exp)
# Setup
set.seed(7)
n <- 10
alpha_vec <- c(0.1,1,10)
x_grid <- seq(-3, 3, by=0.05)
signif_level <- 0.05
# Sample observations
x_pts <- rnorm(n)
12
10/36-702 Statistical Machine Learning: Homework 4
## BAYESIAN BANDS
# Functions to generate Bayesian Credible Bands
sample_cdf <- function(F_hat, n){
F_hat@r(n) # F_hat is a S6 class object from the distr package
}
13
10/36-702 Statistical Machine Learning: Homework 4
points(x_grid, list_out_bayes[[as.character(alpha_val)]]$’mean_post_sim’,
type="l", col="red")
points(x_grid, list_out_bayes[[as.character(alpha_val)]]$’cred_int’[2,], type=’l’,
col=’blue’)
points(x_grid, list_out_bayes[[as.character(alpha_val)]]$’cred_int’[1,], type=’l’,
col=’blue’)
curve(pnorm, xlim=c(-3, 3), add=TRUE, col="red", lwd=2)
legend("topleft", lty="solid", legend=c("True", "DKW", "Kolmogorov", "Posterior
Mean"),
col=c("black", "green", "blue", "red"))
}
Figure 1: 95% DKW and Bayesian credible bands at different alpha levels: 0.1, 1, 10 from upper left
to bottom center respectively
For n = 1, 000 simulations, one should replicate the above code and consider whether the full
empirical CDF is captured between the bands for both the Bayesian confidence and DKW
bands.
14
10/36-702 Statistical Machine Learning: Homework 4
Xij = θj + ij
where all the ′ij s are independent N (0, 1). The parameter is θ = (θ1 , θ2 , . . . ). Assume that ∑j θj2 < ∞.
Due to sufficiency, we can reduce the problem to the sample means. Thus let Yj = n−1 ∑ni=1 Xij . So
the model is Yj ∼ N (θj , 1/n) for j = 1, 2, 3, . . . . We will put a prior π on θ as follows. We take each
θj to be independent and we take θj ∼ N (0, τj2 ).
(a) (5 pts.) Find the posterior for θ. Find the posterior mean θ.
̂
P
(b) (7 pts.) Suppose that ∑j τj2 < ∞. Show that θ̂ is consistent, that is, ∥θ̂ − θ∥2 Ð→ 0.
where p > 1/2. The minimax (for squared error loss) for this problem is Rn ≍ n−2p/(2p+1) . Let
τj2 = (1/j)2r . Find r so that the posterior mean achieves the minimax rate.
Solution.
nYj τj2
θ̂j = .
1 + nτj2
E[∥θ̂ − θ∥2 ]
2
P (∥θ̂ − θ∥ > ) ≤ Markov’s inequality
1 ∞ 2
= ∑ E[(θ̂j − θj ) ]
j=1
⎡ 2
⎤
1 ⎢⎢ ∞ ∞ ⎥
= ⎢ ∑ (E[θj − θj ]) + ∑ Var(θj )⎥⎥
̂ ̂
⎢ j=1 j=1 ⎥
⎣ ⎦
⎡∞ 2
τj ⎥ ⎤
1⎢ 1 ∞
= ⎢⎢ ∑ θj2 +∑ ⎥ Theorem 6
⎢ j=1 (1 + nτj )2 2
j=1 1 + nτj2 ⎥⎥
⎣ ⎦
→ 0,
15
10/36-702 Statistical Machine Learning: Homework 4
aσ 2 + nXb2
E[µ∣X1 , . . . , Xn ] = .
σ 2 + nb2
Proof.
n
1 1 1 n
fX n (xn ∣µ) = ∏ √ exp{− (x i − µ) 2
} = (2πσ 2 −n/2
) exp{− 2
∑(xi − µ) }
i=1 2πσ 2 σ2 2σ 2 i=1
16