Stats 231 / CS229T Homework 3 Solutions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Stats 231 / CS229T Homework 3 Solutions

Question 1: Let k : X × X → R be a valid kernel function. Define

k(x, z)
knorm (x, z) := p p .
k(x, x) k(z, z)

Is knorm a valid kernel? Justify your answer.


Answer: Yes, it is. Let k(x, z) = hφ(x), φ(z)i for some mapping φ : X → H, where H is a Hilbert
space. Then
knorm (x, z) = hφ(x)/ kφ(x)k2 , φ(z)/ kφ(z)k2 i
so that it is still a valid inner product, where the feature mapping is now x 7→ φ(x)/ kφ(x)k2 for
kφ(x)k22 = hφ(x), φ(x)i.

Question 2: Consider the class of functions

H := f : f (0) = 0, f 0 ∈ L2 ([0, 1]) ,




Rthat is, functions f : [0, 1] → R with f (0) = 0 that are almost everywhere differentiable, where
1 0
0 (f (t))2 dt < ∞. On this space of functions, we define the inner product by
Z 1
hf, gi = f 0 (t)g 0 (t)dt.
0

Show that k(x, z) = min{x, z} is the reproducing kernel for H, so that it is (i) positive semidefinite
and (ii) a valid kernel.
Answer: If we show that k(x, z) = min{x, z} is indeed the reproducing kernel for H, then that
suffices to demonstrate that it is a positive definite function. We have for g(z) = k(x, z) that
(almost everywhere) g 0 (z) = 1 {x ≤ z}, so that
Z 1 Z z
hf, k(z, ·)i = f 0 (t)1 {t ≤ z} dt = f 0 (t)dt = f (z) − f (0) = f (z).
0 0

Thus k is evidently a reproducing kernel, so it must be a positiveRdefinite function.


1
(Another way to see that, we have min{x, z} = k(x, z) = 0 1 {t ≤ x} 1 {t ≤ z} dt, so that
min{x, z} is evidently an inner product.)

Question 3: Consider the Sobolev space Fk , which is defined as the set of functions that are
(k − 1)-times differentiable and have kth derivative almost everywhere on [0, 1], where the kth
derivative is square-integrable. That is, we define
n o
Fk := f : [0, 1] | f (k) ∈ L2 ([0, 1]) ,

where f (k) denotes the kth derivative of f . We define the inner product on Fk by
k−1
X Z 1
hf, gi = f (i) (0)g (i) (0) + f (k) (t)g (k) (t)dt.
i=0 0

1
(a) Find the representer of evaluation for this Hilbert space, that is, find a function rx : [0, 1] → R
(defined for each x ∈ [0, 1]) such that rx ∈ Fk and

hrx , f i = f (x)

for all x ∈ [0, 1].

(b) What is the reproducing kernel k(x, z) associated with this space? (Recall that k(x, z) = hrx , rz i
for an RKHS.)

(c) Show that Fk is a Hilbert space, meaning that kf k2 = hf, f i defines a norm and that Fk is
complete for the norm.

Answer:

(a) By Taylor’s theorem, we have


k−1 x
xi
Z
X
(i) 1
f (x) = f (0) + f (0) + f (k) (t)(x − t)k−1 dt.
i! (k − 1)! 0
i=1

Define the function


k−1 i i k−1
X x t (−1)k X x2k−1−i ti
rx (t) = + max{x − t, 0}2k−1 + (−1)k+i+1 .
i! i! (2k − 1)! (2k − 1 − i)! i!
i=0 i=0

Then
1 i (−1)k+i (−1)k+i+1 2k−1−i
rx(i) (0) = x + max{x, 0}2k−1−i + x = xi
i! (2k − i − 1)! (2k − 1 − i)!
for i < k and
1
rx(k) (t) = max{x − t, 0}k−1 .
(k − 1)!
Thus we have
Z 1
1 1 1
hf, rx i = f (0) + f (0)x + f 00 (0)x2 + · · · +
0
f (k−1) (0)xk−1 + f (k) (t) [x − t]k−1
+ dt
2 (k − 1)! (k − 1)! 0
k−1 (i) Z x
X f (0) i 1
= x + f (k) (t)(x − t)k−1 dt
i! (k − 1)! 0
i=0
= f (x)

where the last equality is Taylor’s theorem.

(b) For the reproducing kernel, note that

k(x, z) = hrx , rz i
k−1 i i Z 1
X x z 1 k−1
= + [x − t]+ [z − t]k−1
+ dt
i! i! (k − 1)!(k − 1)! 0
i=0
k−1 min{x,z}
xi z i
Z
X 1
= + (x − t)k−1 (z − t)k−1 dt.
i! i! (k − 1)!(k − 1)! 0
i=0

2
(c) To see that Fk is a Hilbert space, we must show that kf k2H = hf, f i is a norm and that Fk is
complete for k·kH . Non-negativity of k·kH and the triangle inequality are trivial, as it is clear
that h·, ·i is an inner product. Now suppose that kf kH = 0. Then f (l) (0) = 0 for all l < k, and
R 1 (k) 2
0 f (t) dt = 0, so that f (k) = 0 almost everywhere. Of course, this shows that f (k−1) ≡ 0
by integration, and so on, so that f ≡ 0. To show completeness, let fn be a Cauchy sequence
in Fk . Then since
k−1
X Z 1
2 (l) (l) 2
kfn − fm kH = (fn (0) − fm (0)) + (fn(k) (t) − fm
(k)
(t))2 dt,
l=0 0

(l) (k)
it is clear that fn (0) is a Cauchy sequence in R and fn is a Cauchy sequence in L2 ([0, 1]).
(l)
Completeness of R and completeness of L2 then imply the existence of limn fn (0) for l < k
(k)
and a g ∈ L2 ([0, 1]) such that fn → g in L2 . Now define the functions f (l) by
Z x Z x
(k) (k−1) (k−1)
f (x) = g(x), f (x) = lim fn (0) + g(t)dt, . . . , f (x) = lim fn (0) + f (1) (t)dt.
n 0 n 0

Since f (k) ∈ L2 ([0, 1]), it is clear that each of the f (l) are absolutely continuous, and the
derivative of f (l) is f (l+1) . So fn indeed has a limit f .

Question 4: The variation distance between probability distributions P and Q on a space X is


defined by kP − QkTV = supA⊂X |P (A) − Q(A)|.
(a) Show that
2 kP − QkTV = sup {EP [f (X)] − EQ [f (X)]}
f :kf k∞ ≤1

where the supremum is taken over all functions with f (x) ∈ [−1, 1], and the first expectation
is taken with respect to P and the second with respect to Q. You may assume that P and Q
have densities.

Answer: Using the assumption that we have a density and that P (A) − Q(A) = 1 − P (Ac ) −
(1 − Q(Ac )) = Q(Ac ) − P (Ac ), we have
Z
kP − QkTV = sup {P (A) − Q(A)} = sup 1 {x ∈ A} (p(x) − q(x))dx
A⊂X A
Z
= 1 {p(x) ≥ q(x)} (p(x) − q(x))dx.

Similarly, we have kP − QkTV = supA {Q(A) − P (A)}, and combining these yields
Z
2 kP − QkTV = (1 {p(x) ≥ q(x)} − 1 {p(x) ≤ q(x)}) (p(x) − q(x))dx.

But of course, supa∈[−1,1] a(p − q) = (p − q)(1 {p ≥ q} − 1 {p ≤ q}), which proves the result.

Question 5: In a number of experimental situations, it is valuable to determine if two distributions


P and Q are the same or different. For example, P may be the distribution of widgets produced
by one machine, Q the distributions of widgets by a second machine, and we wish to test if the two
distributions are the same (to within allowable tolerances). Let H be an RKHS of functions with
domain X and reproducing kernel k, and let P and Q be distributions on X .

3
(a) Let k·kH denote the norm on the Hilbert space H. Show that
n o
Dk (P, Q)2 := sup |EP [f (X)] − EQ [f (Z)]|2 = E[k(X, X 0 )] + E[k(Z, Z 0 )] − 2E[k(X, Z)]
f :kf kH ≤1

iid iid
where X, X 0 ∼ P and Z, Z 0 ∼ Q.

(b) A kernel k : X × X → R is called universal if the induced RKHS H of functions f : X → R


can arbitrarily approximate continuous functions. That is, for any φ : X → R continuous and
 > 0, there is some f ∈ H such that

sup |f (x) − φ(x)| ≤ .


x∈X

Show that if k is universal, then

Dk (P, Q) = 0 if and only if P = Q.

You may assume X is a metric space and that P = Q iff P (A) = Q(A) for all compact A ⊂ X .

(c) You wish to estimate Dk (P, Q) given samples from each of the distributions. Assume that
iid iid
k(x, z) ∈ [−B, B] for all x, z ∈ X . Let Xi ∼ P , i = 1, . . . , n1 and Zi ∼ Q, i = 1, . . . , n2 . Define
 −1 X  −1 X
b 1:n ) := n1
K(X k(Xi , Xj ), K(Z
b 1:n2 ) :=
n2
k(Zi , Zj ),
1
2 2
1≤i<j≤n1 1≤i<j≤n2

and
n1 X
n2
1 X
K(X
b 1:n , Z1:n ) :=
1 2 k(Xi , Zj ).
n1 n2
i=1 j=1

b 1:n )] = E[k(X, X 0 )] and E[K(X


b 1:n , Z1:n )] = E[k(X, Z)] for X, X 0 ∼ P and iid
Show that E[K(X 1 2
iid
Z, Z 0 ∼ Q. Show for some numerical constant c > 0 that for all t ≥ 0,

nt2
   
0
P K(X ) − X )] ≥ t ≤ 2 exp −c
b
1:n E[k(X,
B2

and
n1 t 2 n2 t2
     
P K(X1:n1 , Z1:n2 ) − E[k(X, Z)] ≥ t ≤ 2 exp −c 2 + 2 exp −c 2 .
b
B B

(d) Define the empirical Hilbert distances


 −1  −1 n1 X
n2
b 2 (P, Q) := n1 X n2 X 2 X
D k k(Xi , Xj ) + k(Zi , Zj ) − k(Xi , Zj ).
2 2 n1 n2
1≤i<j≤n1 1≤i<j≤n2 i=1 j=1

Show that for all t ≥ 0,

min{n1 , n2 }t2
   
b2 2
P Dk (P, Q) − Dk (P, Q) ≥ t ≤ C exp −c

B2
where 0 < c, C < ∞ are numerical constants.

4
Answer:
(a) As k : X × X → R is the reproducing kernel for H, we have for any f ∈ H such that kf kH ≤ 1

E[f (X)] − E[f (Z)] = E[hf, k(X, ·)i] − E[hf, k(Z, ·)i]
(i)
= hf, E[k(X, ·) − k(Z, ·)]i
(ii)
≤ kf kH kE[k(X, ·) − k(Z, ·)]kH ≤ kE[k(X, ·) − k(Z, ·)]kH ,

where we have used linearity in (i) and Cauchy-Schwarz in (ii), and that kf kH ≤ 1 in the final
line. Equality holds in step (ii) if
E[k(X, ·) − k(Z, ·)]
f (·) = ,
kE[k(X, ·) − k(Z, ·)]kH
and we have

kE[k(X, ·) − k(Z, ·)]k2H = E[k(X, ·) − k(Z, ·)], E[k(X 0 , ·) − k(Z 0 , ·)]



= E[k(X, ·)], E[k(X 0 , ·)] + E[k(Z, ·)], E[k(Z 0 , ·)] − 2 hE[k(X, ·)], E[k(Z, ·)]i


= E[k(X, X 0 )] + E[k(Z, Z 0 )] − 2E[k(X, Z)],

where the final equality uses the linearity of the inner product and independence of X, X 0 , Z, Z 0 .
(b) Suppose that P = Q. Then certainly EP [f (X)] − EQ [f (Z)] = EP [f (X)] − EP [f (X)] = 0 for all
f ∈ H. Now suppose P 6= Q. Then there exists a compact set A such that P (A) 6= Q(A). For
n ∈ N, define the function

φn (x) = max{1 − n · dist(x, A), 0} = [1 − n dist(x, A)]+ ,

which satisfies φn (x) = 1 for x ∈ A, φn (x) = 0 for x such that dist(x, A) ≥ 1/n, and is Lipschitz
continuous. Moreover, we have φn (x) ↓ 1 {x ∈ A} for all x ∈ A as n → ∞. Thus the monotone
convergence theorem gives that

lim EP [φn (X)] = P (A) and lim EQ [φn (Z)] = Q(A).


n n

Let  > 0 be such that |P (A) − Q(A)| ≥ 4. Choose N such that n ≥ N implies |EP [φn ] −
P (A)| <  and |EQ [φn ] − Q(A)| < , and let n ≥ N . Choose f ∈ H such that supx |f (x) −
φn (x)| ≤ . Then

|EP [f (X)] − EQ [f (Z)]| ≥ |EP [φn (X)] − EQ [φn (Z)]| − 2 > |P (A) − Q(A)| − 4 ≥ 4 − 4 = 0.

Dividing by kf kH we have
|EP [f (X)] − EQ [f (Z)]|
Dk (P, Q) = sup |EP [g] − EQ [g]| ≥ > 0.
g:kgkH ≤1 kf kH

(c) The expectation equalities are immediate.


We apply bounded differences for the first statement. We first look at f (x1:n ) = K(x
b 1:n ). As
0
the function is symmetric, we fix index i = 1. Then for x, x ∈ X , we have
n
 −1 X
0 n
f (x, x2:n ) − f (x , x2:n ) = (k(x, Xj ) − k(x0 , Xj ))
2
j=2

5
and using that k(x, x0 ) ∈ [−B, B], the summands are each bounded by 2B in magnitude. Thus
2 4B
|f (x, x2:n ) − f (x0 , x2:n )| ≤ · 2B(n − 1) = .
n(n − 1) n
Bounded differences (McDiarmid’s inequality) implies

nt2
   
P K(X1:n ) − E[K(X1:n )] ≥ t ≤ 2 exp − 2 .
b b
8B

b 1:n , Z1:n ) is a bit more complex. Define


The argument about K(X 1 2

n1
1 X
K(X1:n1 , Q) =
b EQ [k(Xi , Z) | Xi ].
n1
i=1

Then we have
b 1:n , Z1:n ) | X1:n ] = K(X
E[K(X b 1:n , Q)
1 2 1 1

by the independence of Zi , Xj . Fixing X1:n1 , define the function g(z1:n2 | X1:n1 ) by

g(z1:n2 | X1:n1 ) = K(X


b 1:n , z1:n ).
1 2

Then g satisfies bounded differences with parameter 4B/n2 , as above, and so conditional on
X1:n1 , we have
2
 
  n 2 t
P g(Z1:n2 | X1:n1 ) − K(X
b 1:n , Q) ≥ t | X1:n ≤ 2 exp − . (1)

1 1
8B 2
Now we argue that
x1:n1 7→ K(x
b 1:n , Q)
1

satisfies bounded differences as well. Note that E[K(X b 1:n , Q)] = E[k(X, Z)] by construction.
1
Without loss of generality let us fix x2:n1 and modify x1 ∈ {x, x0 }. Then
 
0 1 0 2B 2B
K(x, x2:n1 , Q) − K(x , x2:n1 , Q) =
b b EQ [k(x, Z) − k(x , Z)] ∈ − , ,
n1 n1 n1
satisfying bounded differences with parameter 2B/n1 . Thus we have

n1 t2
   
P K(X1:n1 , Q) − E[k(X, Z)] ≥ t ≤ 2 exp − 2 . (2)
b
2B
Combining the bounds (1) and (2) and applying the tower property of expectation and the
triangle inequality, we have
 
P K(X1:n1 , Z1:n2 ) − E[k(X, Z)] ≥ t
b
h  i  
≤ E P g(Z1:n2 | X1:n1 ) − K(X
b 1:n , Q) ≥ t/2 | X1:n + K(X , Q) − Z)] ≥ t/2
b
1 1 P 1:n1 E[k(X,
n2 t2 n1 t 2
   
≤ 2 exp − 2
+ 2 exp − 2 .
32B 8B

You might also like