0% found this document useful (0 votes)
12 views3 pages

hw2 4

Uploaded by

explosion4601
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views3 pages

hw2 4

Uploaded by

explosion4601
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Mehryar Mohri

Foundations of Machine Learning 2015


Courant Institute of Mathematical Sciences
Homework assignment 2
October 23, 2015
Due: November 09, 2015
A. VC-dimension of convex combinations

1. Let H be a family of functions mapping from an input space X to


{−1, +1} and let T be a positive integer. Give an upper bound on the
VC-dimension of the family of functions FT defined by
T T
( ! )
X X
F = sgn αt ht : ht ∈ H, αt ≥ 0, αt ≤ 1 .
t=1 t=1

(Hint: you can use Problem C. of (Foundations of Machine Learning,


HW2, 2014, http://www.cs.nyu.edu/~mohri/ml14/hw2.pdf and its
solution).

B. Growth function

1. A linearly separable labeling of a set X of vectors in Rd is a classification


of X into two sets X + and X − with X + = {x ∈ X : w · x > 0} and
X − = {x ∈ X : w · x < 0} for some w ∈ Rd .
Let X = {x1 , . . . , xm } be a subset of Rd .

(a) Let {X + , X − } be a dichotomy of X and let xm+1 ∈ Rd . Show


that {X + ∪ {xm+1 }, X − } and {X + , X − ∪ {xm+1 }} are linearly
separable by a hyperplane going through the origin if and only
if {X + , X − } is linearly separable by a hyperplane going through
the origin and xm+1 .
(b) Let X = {x1 , . . . , xm } be a subset of Rd such that any k-element
subset of X with k ≤ d is linearly independent. Then,Pthe number
of linearly separable labelings of X is C(m, d) = 2 d−1 m−1

k=0 k .
(Hint: prove by induction that C(m + 1, d) = C(m, d) + C(m, d −
1).
(c) Let f1 , . . . , fp be p functions mapping Rd to R. Define F as
the family of classifiers based on linear combinations of these

1
functions:
 p
X  
F= x 7→ sgn ak fk (x) : a1 , . . . , ap ∈ R .
k=1

Define Ψ by Ψ(x) = (f1 (x), . . . , fp (x)). Assume that there exists


x1 , . . . , xm ∈ Rd such that every p-subset of {Ψ(x1 ), . . . , Ψ(xm )}
is linearly independent. Then, show that
p−1  
X m−1
ΠF (m) = 2 .
i
i=0

C. Support Vector Machines

1. Download and install the libsvm software library from:

http://www.csie.ntu.edu.tw/~cjlin/libsvm/ ,

and briefly consult the documentation to become more familiar with


the tools.

2. Consider the splice data set

http://www.cs.toronto.edu/~delve/data/splice/desc.html.

Download the already formatted training and test files of a noisy ver-
sion of that dataset from
http://www.cs.nyu.edu/~mohri/ml15/splice_noise_train.txt
http://www.cs.nyu.edu/~mohri/ml15/splice_noise_test.txt.

Use the libsvm scaling tool to scale the features of all the data. The
scaling parameters should be computed only on the training data and
then applied to the test data.

3. Consider the corresponding binary classification which consists of dis-


tinguishing two types of splice junctions in DNA sequences using about
60 features. Use SVMs combined with polynomial kernels to tackle this
problem.
To do that, randomly split the training data into ten equal-sized dis-
joint sets. For each value of the polynomial degree, d = 1, 3, 5, plot the

2
average cross-validation error plus or minus one standard deviation as
a function of C (let other parameters of polynomial kernels in libsvm
be equal to their default values), varying C in powers of 5, starting
from a small value C = 5−k to C = 5k , for some value of k. k should be
chosen so that you see a significant variation in training error, starting
from a very high training error to a low training error. Expect longer
training times with libsvm as the value of C increases.

4. Let (C ∗ , d∗ ) be the best pair found previously. Fix C to be C ∗ . Plot the


ten-fold cross-validation error and the test errors for the hypotheses
obtained as a function of d. Plot the average number of support vectors
obtained as a function of d. How many of the support vectors lie on
the marginal hyperplanes? Plot the soft margin of the solution as a
function of d.

5. Now, combine SVMs with Gaussian kernels to tackle the same task.
Use cross-validation as before to determine the best value of C and σ,
varying C in powers of 5, and σ in powers of 2 for a reasonable range
so that you see a significant variation in training error, as before. Fix
C and σ to the best values found via cross-validation. How does the
test error of the solution compare to the best result obtained using
polynomial kernels? What is the value of the soft margin?

6. Here, use as a kernel the sum of the best polynomial kernel (degree
d∗ ) and the Gaussian kernel with the best parameter σ you found in
the previous question. Use cross-validation as before to determine the
best value of C. How does the test error of the solution compare to
the best result obtained in the previous questions?

D. Kernels

Show that the following kernels are PDS.


PN n (x2 −
1. Let n be a positive integer. K is defined by K(x, y) = i=1 cos i
yi2 ) for all (x, y) ∈ RN × RN .
kx−yk
2. Let σ be a positive real number. K is defined by K(x, y) = e− σ for
all (x, y) ∈ RN × RN (Hint: you could show that K is the normalized
kernel of a kernel K 0 and show that K 0 is PDS using the following
R +∞ 1−e−tkx−yk2
equality: kx − yk = 2Γ(1 1 ) 0 3 dt valid for all x, y).
2 t2

You might also like