Homework 2
Homework 2
Submit your work as a single PDF on Gradescope. Make sure to prepare your
solution to each problem on a separate page. (Gradescope will ask you select the
pages which contain the solution to each problem.)
Total: 80 points (+ 10 bonus points)
where each tk ≤ 1/L. As usual, we will write a generic update as x+ = x − t∇f (x), where t ≤ 1/L.
(b, 2 pts) Use t ≤ 1/L, and rearrange the previous result, to get
2
k∇f (x)k22 ≤ (f (x) − f (x+ )).
t
1
(c, 2 pts) Sum the previous result over all iterations from 1, . . . , k + 1 to establish
k
X 2
k∇f (x(i) )k22 ≤ (f (x(0) ) − f ? ).
i=0
t
(d, 2 pts) Lower bound the sum in the previous result to get
s
(i) 2
min k∇f (x )k2 ≤ (f (x(0) ) − f ? ),
i=0,...,k t(k + 1)
(a, 3 pts) Starting with (2), apply the first-order condition for convexity of f , to show
t
f (x+ ) ≤ f ? + ∇f (x)T (x − x? ) − k∇f (x)k22 .
2
(c, 2 pts) Sum the previous result over all iterations 1, . . . , k to get
k
X 1 (0)
(f (x(i) ) − f ? ) ≤ kx − x? k22 .
i=1
2t
(d, 2 pts) Use the fact that gradient descent is a descent method to lower bound the sum above, and
conclude
kx(0) − x? k22
f (x(k) ) − f ? ≤ ,
2tk
which establishes the desired O(1/) rate for achieving -suboptimality.
2
2 Properties and examples of subgradients (18 points)
We will inspect various properties and examples of subgradients.
(a, 2 pts) Show that ∂f (x) is a closed and convex set for any function f (not necessarily convex) and
any point x in its domain.
(b, 2 pts) Show that g ∈ ∂f (x) if and only if (g, −1) defines supporting hyperplane to epigraph of f
at (x, f (x)) (i.e., (g, −1) is the normal vector to this hyperplane).
(c, 2 pts) For a convex function f , show that if x ∈ U where U is a open neighborhood in its domain,
then
f (y) ≥ f (x) + g T (y − x), for all y ∈ U ⇒ g ∈ ∂f (x).
In other words, if the tangent line inequality holds in a local open neighborhood of x, then it holds
globally.
(d, 1 pt) For a convex function f and subgradients gx ∈ ∂f (x), gy ∈ ∂f (y), prove that
(gx − gy )T (x − y) ≥ 0.
(e, 2 pts) For f (x) = kxk2 , show that all subgradients g ∈ Rn at a point x ∈ Rn are of the form
(
{x/kxk2 } x=6 0
g∈
{v : kvk2 ≤ 1} x = 0.
(f, 3 pts) For f (x) = maxs∈S fs (x), where each fs is convex, show that
[
∂f (x) ⊇ conv ∂fs (x) .
s:fs (x)=f (x)
(g, 6 pts): For f (X) = kXktr , show that subgradients at X = U ΣV T (this is an SVD of X) satisfy
Hint: you may use the fact that k · ktr and k · kop are dual norms, which implies hA, Bi ≤ kAktr kBkop
for any matrices A, B, where recall hA, Bi = tr(AT B). Bonus (5 pts): prove the other direction.
3
(a, 3 pts) Prove that proxh,t is a well-defined function on Rn , that is, each point x ∈ Rn gets mapped
to a unique value proxh,t (x).
Hint: use the previous question, and the monotonicity of subgradients from Q2(d).
(d, 3 pts) The proximal minimization algorithm (a special case of proximal gradient descent) repeats
the updates:
x(k+1) = proxh,t (x(k) ), k = 1, 2, 3, . . . .
Write out these updates when applied to h(x) = 12 xT Ax − bT x, where A ∈ Sn . Show that this is
equivalent to the iterative refinement algorithm for solving the linear system Ax = b:
where > 0 is some constant. Bonus (1 pt): assuming that proximal minimization converges to
the minimizer of h(x) = 12 xT Ax − bT x (which is does, under suitable step sizes), what would the
iterations of iterative refinement converge to in the case when A is singular, Ax = b, and x(0) = 0?
For h(X) = kXktr , show that the proximal operator evaluated at X = U ΣV T (this is an SVD of X)
is so-called matrix soft-thresholding,
proxh,t (X) = U Σt V T , where Σt = diag (Σ11 − t)+ , . . . , (Σnn − t)+ ,
and x+ = max{x, 0} denotes the positive part of x. Hint: start with subgradient optimality as you
developed in Q3(b), and use the subgradients of the trace norm from Q2(g).
where 1 = (1, . . . , 1) ∈ Rn and each X(j) ∈ Rn×pj . To achieve sparsity over groups of features, rather
than individual features, we can use a group lasso penalty. Write β = (β0 , β(1) , . . . , β(J) ) ∈ Rp+1 ,
where β0 is an intercept term and each β(j) ∈ Rpj . Consider the problem
J
X
min g(β) + λ wj kβ(j) k2 , (3)
β
j=1
4
PJ
where g is a loss function and λ ≥ 0 is a tuning parameter. The penalty h(β) = λ j=1 wj kβ(j) k2 is
√
called the group lasso penalty. A common choice for wj is pj to adjust for the group size.
(a, 3 pts) Derive the proximal operator proxh,t (β) for the group lasso penalty defined above.
(b, 2 pts) Let y ∈ {0, 1}n be a binary label, and let g be the logistic loss
n
X n
X
g(β) = − yi (Xβ)i + log(1 + exp{(Xβ)i }),
i=1 i=1
Write out the steps for proximal gradient descent applied to the logistic group lasso problem (3) in
explicit detail.
(c, 5 pts) Now we’ll use the logistic group lasso to classify a person’s age group from his movie
ratings. The movie ratings can be categorized into groups according to a movie’s genre (e.g., all
ratings for action movies can be grouped together). Load the training data in trainRatings.txt,
trainLabels.txt; the features have already been arranged into groups and you can find information
about this in groupTitles.txt, groupLabelsPerRating.txt. Solve the logistic group lasso problem
(3) with regularization parameter λ = 5 by running proximal gradient descent for 1000 iterations
with fixed step size t = 10−4 . Plot f (k) − f ? versus k, where f (k) denotes the objective value at
iteration k, and use as an optimal objective value f ? = 336.207. Make sure the plot is on a semi-log
scale (where the y-axis is in log scale).
(d, 5 pts) Now implement Nesterov acceleration for the same problem. You should again run
accelerated proximal gradient descent for 1000 iterations with fixed step size t = 10−4 . As before,
produce a plot f (k) − f ? versus k. Describe any differences you see in the criterion convergence curve.
(e, 5 pts) Lastly, implement backtracking line search (rather than a fixed step size), and rerun
proximal gradient for 400 iterations, without acceleration. (Note this means 400 outer iterations;
the backtracking loop itself can take several inner iterations.) You should set β = 0.1 and α = 0.5.
Produce a plot of f (k) − f ? versus i(k), where i(k) counts the total number of iterations performed
at outer iteration k (total, meaning the sum of the iterations in both the inner and outer loops).
Note: since it makes for an easier comparison, you can draw the convergence curves from (c), (d), (e)
on the same plot.
(f, 2 pts) Finally, use the solution from accelerated proximal gradient descent in part (d) to make
predictions on the test set, available in testRatings.txt, testLabels.txt. What is the classification
error? What movie genre are important for classifying whether a viewer is under 40 years old?