10-601 Machine Learning: Homework 7: Instructions
10-601 Machine Learning: Homework 7: Instructions
Instructions
• Late homework policy: Homework is worth full credit if submitted before the due date, half credit
during the next 48 hours, and zero credit after that. You must turn in at least n−1 of the n homeworks
to pass the class, even if for zero credit.
• Collaboration policy: Homeworks must be done individually, except where otherwise noted in the
assignments. “Individually” means each student must hand in their own answers, and each student
must write and use their own code in the programming parts of the assignment. It is acceptable for
students to collaborate in figuring out answers and to help each other solve the problems, though you
must in the end write up your own solutions individually, and you must list the names of students you
discussed this with. We will be assuming that, as participants in a graduate course, you will be taking
the responsibility to make sure you personally understand the solution to any work arising from such
collaboration.
• Online submission: You must submit your solutions online on autolab. We recommend that you
use LATEX to type your solutions to the written questions, but we will accept scanned solutions as well.
On the Homework 7 autolab page, you can download the template, which is a tar archive containing
a blank placeholder pdf for the written questions. Replace each pdf file with one that contains your
solutions to the written questions. When you are ready to submit, create a new tar archive of the
top-level directory and submit your archived solutions online by clicking the “Submit File” button.
You should submit a single tar archive identical to the template, except with the blank pdfs replaced
by your solutions for the written questions. You are free to submit as many times as you like. DO
NOT change the name of any of the files or folders in the submission template. In other words,
your submitted files should have exactly the same names as those in the submission template. Do not
modify the directory structure.
where n is the number of data points. To do so, we iterate between assigning xi to the nearest cluster
center and updating each cluster center cj to the average of all points assigned to the j th cluster.
1
k-means can be kernelized too !
k-means with Euclidean distance metric assumes that each pair of clusters is linearly separable. This may
not be the case. A classical example is where we have two clusters corresponding to data points on two
concentric circles in the R2 plane. We have seen that we can use kernels to obtain a non-linear version of
an algorithm that is linear by nature and k-means is no exception. Recall that there are two main aspects
of kernelized algorithms: (i) the solution is expressed as a linear combination of training examples, (ii) the
algorithm relies only on inner products between data points rather than their explicit representation. We
will show that these two aspects can be satisfied in k-means.
2. [5 pt] Let zij be an indicator that is equal to 1 if the xi is currently assigned to the j th cluster
Pn and 0
otherwise (1 ≤ i ≤ n and 1 ≤ j ≤ k). Show that the j th cluster center cj can be updated as i=1 αij xi .
Specifically, show how αij can be computed given all z’s.
3. [5 pt] Given two data points x1 and x2 , show that the square distance kx1 − x2 k2 can be computed
using only (linear combinations of) inner products.
4. [5 pt] Given the results of parts 2 and 3, show how to compute the square distance kxi − cj k2 using
only (linear combinations of) inner products between the data points x1 , . . . , xn .
Note: This means that given a kernel K, we can run Lloyd’s algorithm. We begin with some initial
data points as centers and use the answer to part 2 to find the closest center for each data point, giving
us the initial zij ’s. We then repeatedly use the answer to part 3 to reassign the points to centers and
update the zij ’s.
5. [2 pt] Consider the case where k = 3 and we have 4 data points x1 = 1, x2 = 2, x3 = 5, x4 = 7. What
is the optimal clustering for this data ? What is the corresponding value of the objective (1) ?
6. [3 pt] One might be tempted to think that Lloyd’s algorithm is guaranteed to converge to the global
minimum when d = 1. Show that there exists a suboptimal cluster assignment for the data in part 5
that Lloyd’s algorithm will not be able to improve (to get full credit, you need to show the assignment,
show why it is suboptimal and explain why it will not be improved).
7. [10 pt] Assume we sort our data points such that x1 ≤ x2 ≤ · · · ≤ xn , prove that an optimal cluster
assignment has the property that each cluster corresponds to some interval of points. That is, for each
cluster j there exists i1 , i2 such that the cluster consists of {xi1 , xi1 +1 , . . . , xi2 }.
8. (Extra Credit [10 pt]) Develop an O(kn2 ) dynamic programming algorithm for single dimensional
k-means. [Hint: From part 7, what we need to optimize are k − 1 cluster boundaries where the ith
boundary marks the largest data point in the ith cluster.]
2
Problem 2: Dimensionality Reduction and Representation Learn-
ing [30 pt]
In this question, we explore the relation between PCA, kernel PCA and auto encoder neural networks
(trained to output the same vector they receive as input). We will use n and d to denote the number and
dimensionality of the given data points respectively.
1. [10 pt] Consider an auto encoder with a single hidden layer of k nodes. Let wij denote the weight of
the edge from the ith input node to the j th hidden node. Similarly, let vij denote the weight of the
edge from the ith hidden node to the j th output node. Show how you can set the activation functions
of hidden and output nodes as well as the weights wij and vij such that the resulting auto encoder
resembles PCA.
2. [10 pt] Kernel PCA is a non-linear dimensionality reduction reduction where a principal vector uj is
computed as a linear combination of training examples in the feature space
n
X
uj = αij φ(xi ).
i=1
Computing the principal component of a new point x can then be done using kernel evaluations
n
X n
X
zj (x) = huj , φ(x)i = αij hφ(xi ), φ(x)i = αij k(xi , x).
i=1 i=1
You will show that kernel PCA can be represented by a neural network. First we define a kernel
node. A kernel node with a vector wi of incoming weights and an input vector x computes the output
y = k(x, wi ). Show that, given a data set x1 , . . . , xn , there exists a network with a single hidden layer
and the output of the network is the kernel principal components z1 (x), . . . , zk (x) for a given input x.
Specify the number of nodes in the input, output and hidden layers, the type and activation function
of hidden and output nodes , and the weights of the edges in terms of α, x1 , . . . , xn .
3. [5 pt] What is the number of parameters (weights) required to store the network in part 2?
4. [5 pt] Another way to do non-linear dimensionality reduction is to train an auto encoder with non-linear
activation functions (e.g. sigmoid) in the hidden layers. State one advantage and one disadvantage of
that approach compared to kernel PCA.
3
Problem 3: Co-training Doesn’t Like Groupthink [30 pt]
Consider the data set in figure 1. Each data point has two features. Circled data points are unlabeled
points where the true label (shown inside the circle) is invisible to the learning algorithm. We will use this
dataset to co-train two threshold based classifiers C1 and C2 where Ci is trained using feature i and produces
a decision threshold on feature i that maximizes the margin between training examples and the threshold.
The ”confidence” of a classifier for a new data points is measured by how far away it is from the threshold. In
particular, the farther away the point is from the decision boundary, the more confident the classifier is. We
will run iterative co-training such that, in each iteration, each classifier adds the unlabeled example it is most
confident about to the training data. Assume that co-training halts when, for each classifier, the unlabeled
point that is farthest from the threshold (i.e. most confident) is between the largest known negative example
and the smallest known positive example (at that point, the algorithm deems the unlabeled examples too
uncertain to label for the other classifier).
1. [10 pt] Explain what happens in a single iteration of co-training. Specifically, illustrate:
• The initial thresholds produced by C1 and C2 given labeled examples.
• The new labeled example (coordinates and label) that will be provided to C2 by C1 and vice
versa.
• The new thresholds after incorporating the new examples.
• The number of data points misclassified by C1 using the initial and updated thresholds.
4
2. [15 pt] Now assume that we train both C1 and C2 using feature 1. Therefore they share the same
view of the data. What happens if we run co-training to completion ? What are the initial thresholds,
which unlabeled example will be added in each iteration at what are the final thresholds ?
3. [5 pt] Based on your observations of parts 1 and 2, provide an intuitive explanation (in no more than
two lines) for why having features that satisfy independence given the label helps co-training to be
successful.
1. [4 pt] In the active learning setting, the learning algorithm can query the label of an unlabeled example.
Assume that you can query any possible example. Show that, starting with a single positive example,
you can exactly learn the true hypothesis h∗ using d queries.
2. [3 pt] In the passive learning setting, the examples are drawn i.i.d from an unknown distribution.
According to PAC learning thoery, how many examples (in big-O notation) are required to guarantee
a generalization error less than with probability 1 − δ ? (Hint: the VC dimension of the class of
conjunctions of d binary features is d)
Note: The result of part 1 is much stronger that that of part 2; it guarantees that the classifier will
exactly learn the true hypothesis with probability 1. PAC learning guarantees, on the other hand,
would require an infinite number of examples as the error and probability of failure δ got to 0. In
other words, it is ”surely exactly correct” compared to ”probably approximately correct”.
3. [3 pt] Show that if the training data is not representative of the underlying distribution, a consistent
hypothesis can perform poorly. Specifically, assume that the true hypothesis h∗ is a conjunction of k
out of the d features for some k > 0 and that all possible data points are equally likely. Show that
there exists a training set of 2(d−k) unique examples and a hypothesis ĥ that is consistent with this
training set but achieves a classification error ≥ 50% when tested on all possible data points.
Note: The result of part 3 does not contradict that of part 2; the adversarial unrepresentative sample
given in part 3 could still occur with random i.i.d sampling. The probability of having such unrepre-
sentative training sets is included in the failure probability δ.