HW 3
HW 3
A. Kernels
1. Let X be a finite set. Show that the kernel K defined over 2X , the set of
subsets of X , by
1
∀A, B ∈ 2X , K(A, B) = exp − |A∆B| ,
2
where A∆B is the symmetric difference of A and B is PDS (hint: you could
use the fact that K is the result of the normalization of a kernel function
K 0 ). Note that this could define a similarity measure for documents based
on the set of their common words, or n-grams, or gappy n-grams, or a sim-
ilarity measure for images based on some patterns, or a similarity measure
for graphs based on their commong sub-graphs.
2. Let X be a finite set. Let K0 be a PDS kernel over X , show that K 0 defined
by X
∀A, B ∈ 2X , K 0 (A, B) = K0 (x, x0 )
x∈A,x0 ∈B
is a PDS kernel.
3. Show that K defined by K(x, x0 ) = √ 1
for all x, x0 ∈ X = {x ∈
1−(x·x0 )
RN : kxk2 < 1} is a PDS kernel. Bonus point: show that the dimension of
the feature space associated to K is infinite (hint: one method to show that
consists of finding an explicit expression of a feature mapping Φ).
http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ ,
and briefly consult the documentation to become more familiar with the
tools.
1
2. Consider the splice data set
http://www.cs.toronto.edu/˜delve/data/splice/desc.html .
Download the already formatted training and test files of that dataset from
http://www.cs.nyu.edu/˜mohri/ml13/splice.train.txt
http://www.cs.nyu.edu/˜mohri/ml13/splice.test.txt .
Use the libsvm scaling tool to scale the features of all the data. The scaling
parameters should be computed only on the training data and then applied to
the test data.
5. Suppose
Pmwe replace in the primal 2optimization problem of SVMs the penalty
term i=1 ξi = kξk1 with kξk2 , that is we use the quadratic hinge loss
instead. Give the associated dual optimization problem and compare it with
the dual optimization problems of SVMs.
2
To do that, you could use instead of the margin loss function Φρ defined in
class the function Ψρ defined by
1
if u ≤ 0
u
2
Ψρ (u) = ρ −1 if u ∈ [0, ρ]
0 otherwise,
and show that it is a Lipschitz function. Compare the empirical and com-
plexity term of your generalization bound to those given in class using Φρ .
C. Boosting
2. Implement that algorithm with boosting stumps and apply the algorithm to
the same data set as question B with the same training and test sets. Plot
the average cross-validation error plus or minus one standard deviation as
a function of the number of rounds of boosting T by selecting the value
of this parameter out of {10, 102 , . . . , 10k } for a suitable value of k, as in
question B. Let T ∗ be the best value found for the parameter. Plot the error
on the training and test set as a function of the number of rounds of boosting
for t ∈ [1, T ∗ ]. Compare your results with those obtained using SVMs in
question B.