Assignment 1
Assignment 1
Problem 1: Classification
Here are two reviews of Perfect Blue, from Rotten Tomatoes:
Rotten Tomatoes has classified these reviews as “positive” and “negative”, respectively, as indicated by the intact
tomato on the top and the splatter on the bottom. In this problem, you will create a simple text classification system
that can perform this task automatically. We’ll warm up with the following set of four mini-reviews, each labeled
positive (+1) or negative (-1):
Each review x is mapped onto a feature vector ϕ(x), which maps each word to the number of occurrences of that
word in the review. For example, the second review maps to the (sparse) feature vector ϕ(x) = {pretty : 1, bad : 1}.
In this problem, we will use the hinge loss given by:
where, yi is the actual value and y is the predicted value. Assuming a linear hypothesis space, and using the feature
vector ϕ(x) as input,
Losshinge (xi , yi , w) = max {0, 1 − w.ϕ(xi )yi } ,
where x is the review text, yi is the correct label, w is the weight vector.
1. Suppose we run stochastic gradient descent once for each of the 4 samples in the order given above, updating
the weights according to
w ← w − η∇w Losshinge (x, y, w).
After the updates, what are the weights of the six words (“pretty”, “good”, “bad”, “plot”, “not”, “scenery”)
that appear in the above reviews?
• Use η = 0.1 as the step size.
• Initialize w = [0, 0, 0, 0, 0, 0].
3. Prove that no linear classifier using word features (i.e. word count) can get zero error on this dataset. Remember
that this is a question about classifiers, not optimization algorithms; your proof should be true for any linear
classifier of the form fw (x) = sign(w.ϕ(x)), regardless of how the weights are learned.
4. Propose a single additional feature for your dataset that we could augment the feature vector with that would
fix this problem.
1. Suppose that we wish to use squared loss. Write out the expression for Loss(x, y, w) for a single datapoint
(x, y).
2. Given Loss(x, y, w from the previous part, compute the gradient of the loss with respect to w, ∇w Loss(x, y, w).
Write the answer in terms of the predicted value p = σ(w.ϕ(x)).