0% found this document useful (0 votes)

192 views5 pages

Homework 2

(1) The document provides instructions for Homework 2 in a convex optimization course. It outlines 4 problems worth a total of 80 points plus 10 bonus points. (2) Problem 1 involves analyzing the convergence of gradient descent for convex and nonconvex functions. Problem 2 covers properties and examples of subgradients. Problem 3 examines properties and examples of proximal operators. Problem 4 is about group lasso logistic regression. (3) For each problem, specific sub-parts are outlined worth various point values totaling the points for that problem. Details are provided for the assumptions and goals of analyzing each algorithm or concept.

Uploaded by

22022510 Nguyễn Công Hiếu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

192 views5 pages

Homework 2

Uploaded by

22022510 Nguyễn Công Hiếu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Homework 2

Convex Optimization 10-725

Due Friday September 27 at 11:59pm

Submit your work as a single PDF on Gradescope. Make sure to prepare your
solution to each problem on a separate page. (Gradescope will ask you select the
pages which contain the solution to each problem.)
Total: 80 points (+ 10 bonus points)

1 Gradient descent convergence analysis (18 points)

In this problem, we will analyze gradient descent under suitable assumptions. Consider minimizing a
differentiable function f with dom(f ) = Rn , whose gradient is L-Lipschitz continuous for a constant
L > 0, meaning
k∇f (x) − ∇f (y)k2 ≤ Lkx − yk2 , for all x, y.
We will run gradient descent, starting from x(0) , with the updates

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . . ,

where each tk ≤ 1/L. As usual, we will write a generic update as x+ = x − t∇f (x), where t ≤ 1/L.

1.1 Nonconvex case (8 points)

Here we will assume nothing about convexity of f . We will show that gradient descent reaches an
-substationary point x, such that k∇f (x)k2 ≤ , in O(1/2 ) iterations. Important note: you may
use here that
L
f (y) ≤ f (x) + ∇f (x)T (y − x) + ky − xk22 , for all x, y. (1)
2
Recall that you assumed convexity and twice differentiability of f on Homework 1 to show that the
above is equivalent to the L-Lipschitz condition on ∇f . But (1) is in fact a consequence of ∇f being
L-Lipschitz, and does not actually require convexity or twice differentiability of f .

(a, 2 pts) Plug in y = x+ = x − t∇f (x) to (1) to show that

Lt
f (x+ ) ≤ f (x) − 1 − tk∇f (x)k22 .
2

(b, 2 pts) Use t ≤ 1/L, and rearrange the previous result, to get
2
k∇f (x)k22 ≤ (f (x) − f (x+ )).
t

1
(c, 2 pts) Sum the previous result over all iterations from 1, . . . , k + 1 to establish
k
X 2
k∇f (x(i) )k22 ≤ (f (x(0) ) − f ? ).
i=0
t

(d, 2 pts) Lower bound the sum in the previous result to get
s
(i) 2
min k∇f (x )k2 ≤ (f (x(0) ) − f ? ),
i=0,...,k t(k + 1)

which establishes the desired O(1/2 ) rate for achieving -substationarity.

1.2 Convex case (10 points)

Now we will assume that f is convex. We will show that gradient descent reaches an -suboptimal
point x, such that f (x) − f ? ≤ , in O(1/) iterations. Going back to part (b) from the nonconvex
case, we can rearrange this to get
t
f (x+ ) ≤ f (x) − k∇f (x)k22 . (2)
2
Note that, by this property, we see that gradient descent is indeed a descent method for t ≤ 1/L (it
decreases the criterion at each iteration).

(a, 3 pts) Starting with (2), apply the first-order condition for convexity of f , to show
t
f (x+ ) ≤ f ? + ∇f (x)T (x − x? ) − k∇f (x)k22 .
2

(b, 3 pts) From the previous result, show that

1
f (x+ ) ≤ f ? + kx − x? k22 − kx+ − x? k22 .

2t

(c, 2 pts) Sum the previous result over all iterations 1, . . . , k to get
k
X 1 (0)
(f (x(i) ) − f ? ) ≤ kx − x? k22 .
i=1
2t

(d, 2 pts) Use the fact that gradient descent is a descent method to lower bound the sum above, and
conclude
kx(0) − x? k22
f (x(k) ) − f ? ≤ ,
2tk
which establishes the desired O(1/) rate for achieving -suboptimality.

2
2 Properties and examples of subgradients (18 points)
We will inspect various properties and examples of subgradients.

(a, 2 pts) Show that ∂f (x) is a closed and convex set for any function f (not necessarily convex) and
any point x in its domain.

(b, 2 pts) Show that g ∈ ∂f (x) if and only if (g, −1) defines supporting hyperplane to epigraph of f
at (x, f (x)) (i.e., (g, −1) is the normal vector to this hyperplane).

(c, 2 pts) For a convex function f , show that if x ∈ U where U is a open neighborhood in its domain,
then
f (y) ≥ f (x) + g T (y − x), for all y ∈ U ⇒ g ∈ ∂f (x).
In other words, if the tangent line inequality holds in a local open neighborhood of x, then it holds
globally.

(d, 1 pt) For a convex function f and subgradients gx ∈ ∂f (x), gy ∈ ∂f (y), prove that

(gx − gy )T (x − y) ≥ 0.

This property is called monotonicity of the subdifferential ∂f .

(e, 2 pts) For f (x) = kxk2 , show that all subgradients g ∈ Rn at a point x ∈ Rn are of the form
(
{x/kxk2 } x=6 0
g∈
{v : kvk2 ≤ 1} x = 0.

(f, 3 pts) For f (x) = maxs∈S fs (x), where each fs is convex, show that
[
∂f (x) ⊇ conv ∂fs (x) .
s:fs (x)=f (x)

Bonus (4 pts): when S is a discrete set, prove the other direction.

(g, 6 pts): For f (X) = kXktr , show that subgradients at X = U ΣV T (this is an SVD of X) satisfy

∂f (X) ⊇ {U V T + W : kW kop ≤ 1, U T W = 0, W V = 0}.

Hint: you may use the fact that k · ktr and k · kop are dual norms, which implies hA, Bi ≤ kAktr kBkop
for any matrices A, B, where recall hA, Bi = tr(AT B). Bonus (5 pts): prove the other direction.

3 Properties and examples of proximal operators (22 points)

We will inspect various properties and examples of proximal operators. Unless otherwise specified,
take h to be a convex function with domain dom(h) = Rn , and t > 0 be arbitrary, and consider its
associated proximal operator
1
proxh,t (x) = argmin kx − zk22 + h(z).
z 2t

3
(a, 3 pts) Prove that proxh,t is a well-defined function on Rn , that is, each point x ∈ Rn gets mapped
to a unique value proxh,t (x).

(b, 2 pts) Prove that proxh,t (x) = u if and only if

1
h(y) ≥ h(u) + (x − u)T (y − u), for all y.
t
Hint: use subgradient optimality.

(c, 6 pts) Prove that proxh,t is nonexpansive, meaning

kproxh,t (x) − proxh,t (y)k2 ≤ kx − yk2 , for all x, y.

Hint: use the previous question, and the monotonicity of subgradients from Q2(d).

(d, 3 pts) The proximal minimization algorithm (a special case of proximal gradient descent) repeats
the updates:
x(k+1) = proxh,t (x(k) ), k = 1, 2, 3, . . . .
Write out these updates when applied to h(x) = 12 xT Ax − bT x, where A ∈ Sn . Show that this is
equivalent to the iterative refinement algorithm for solving the linear system Ax = b:

x(k+1) = x(k) + (A + I)−1 (b − Ax(k) ), k = 1, 2, 3, . . . ,

where > 0 is some constant. Bonus (1 pt): assuming that proximal minimization converges to
the minimizer of h(x) = 12 xT Ax − bT x (which is does, under suitable step sizes), what would the
iterations of iterative refinement converge to in the case when A is singular, Ax = b, and x(0) = 0?

(e, 8 pts) For a matrix-variate function h, we define its proximal operator as

1
proxh,t (X) = argmin kX − Zk2F + h(Z),
Z 2t

For h(X) = kXktr , show that the proximal operator evaluated at X = U ΣV T (this is an SVD of X)
is so-called matrix soft-thresholding,

proxh,t (X) = U Σt V T , where Σt = diag (Σ11 − t)+ , . . . , (Σnn − t)+ ,

and x+ = max{x, 0} denotes the positive part of x. Hint: start with subgradient optimality as you
developed in Q3(b), and use the subgradients of the trace norm from Q2(g).

4 Group lasso logistic regression (22 points)

Suppose we have features X ∈ Rn×(p+1) that we divide into J groups:
h i
X = 1 X(1) X(2) · · · X(J) ,

where 1 = (1, . . . , 1) ∈ Rn and each X(j) ∈ Rn×pj . To achieve sparsity over groups of features, rather
than individual features, we can use a group lasso penalty. Write β = (β0 , β(1) , . . . , β(J) ) ∈ Rp+1 ,
where β0 is an intercept term and each β(j) ∈ Rpj . Consider the problem
J
X
min g(β) + λ wj kβ(j) k2 , (3)
β
j=1

4
PJ
where g is a loss function and λ ≥ 0 is a tuning parameter. The penalty h(β) = λ j=1 wj kβ(j) k2 is
√
called the group lasso penalty. A common choice for wj is pj to adjust for the group size.

(a, 3 pts) Derive the proximal operator proxh,t (β) for the group lasso penalty defined above.

(b, 2 pts) Let y ∈ {0, 1}n be a binary label, and let g be the logistic loss
n
X n
X
g(β) = − yi (Xβ)i + log(1 + exp{(Xβ)i }),
i=1 i=1

Write out the steps for proximal gradient descent applied to the logistic group lasso problem (3) in
explicit detail.

(c, 5 pts) Now we’ll use the logistic group lasso to classify a person’s age group from his movie
ratings. The movie ratings can be categorized into groups according to a movie’s genre (e.g., all
ratings for action movies can be grouped together). Load the training data in trainRatings.txt,
trainLabels.txt; the features have already been arranged into groups and you can find information
about this in groupTitles.txt, groupLabelsPerRating.txt. Solve the logistic group lasso problem
(3) with regularization parameter λ = 5 by running proximal gradient descent for 1000 iterations
with fixed step size t = 10−4 . Plot f (k) − f ? versus k, where f (k) denotes the objective value at
iteration k, and use as an optimal objective value f ? = 336.207. Make sure the plot is on a semi-log
scale (where the y-axis is in log scale).

(d, 5 pts) Now implement Nesterov acceleration for the same problem. You should again run
accelerated proximal gradient descent for 1000 iterations with fixed step size t = 10−4 . As before,
produce a plot f (k) − f ? versus k. Describe any differences you see in the criterion convergence curve.

(e, 5 pts) Lastly, implement backtracking line search (rather than a fixed step size), and rerun
proximal gradient for 400 iterations, without acceleration. (Note this means 400 outer iterations;
the backtracking loop itself can take several inner iterations.) You should set β = 0.1 and α = 0.5.
Produce a plot of f (k) − f ? versus i(k), where i(k) counts the total number of iterations performed
at outer iteration k (total, meaning the sum of the iterations in both the inner and outer loops).

Note: since it makes for an easier comparison, you can draw the convergence curves from (c), (d), (e)
on the same plot.

(f, 2 pts) Finally, use the solution from accelerated proximal gradient descent in part (d) to make
predictions on the test set, available in testRatings.txt, testLabels.txt. What is the classification
error? What movie genre are important for classifying whether a viewer is under 40 years old?

Ps 1
No ratings yet
Ps 1
16 pages
HW 2
No ratings yet
HW 2
10 pages
Exam With Solutions PDF
0% (1)
Exam With Solutions PDF
17 pages
Fast Gradient Method
No ratings yet
Fast Gradient Method
25 pages
COGS 118 Homework 3 Supervised Machine Learning Algorithms
No ratings yet
COGS 118 Homework 3 Supervised Machine Learning Algorithms
7 pages
Convex - Optimization - Homework 3
No ratings yet
Convex - Optimization - Homework 3
6 pages
Homework 2: Mathematics For AI: AIT2005
No ratings yet
Homework 2: Mathematics For AI: AIT2005
3 pages
MAT3007 Homework 8 Xiying Lu 122090371: (A) Proof: Since, Substituting The Definition of
No ratings yet
MAT3007 Homework 8 Xiying Lu 122090371: (A) Proof: Since, Substituting The Definition of
12 pages
A Note On The Accelerated Proximal Gradient Method For Nonconvex Optimization
No ratings yet
A Note On The Accelerated Proximal Gradient Method For Nonconvex Optimization
9 pages
Lecture 10 Proximal
No ratings yet
Lecture 10 Proximal
4 pages
EE364a Homework 7 Solutions
No ratings yet
EE364a Homework 7 Solutions
16 pages
Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
CSE 597 Spring 2019 Exercise 1 Due Sunday 11:59 PM, February 3th
No ratings yet
CSE 597 Spring 2019 Exercise 1 Due Sunday 11:59 PM, February 3th
6 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
HW 1
No ratings yet
HW 1
3 pages
10.3934 Math.2023930
No ratings yet
10.3934 Math.2023930
19 pages
Gradient
No ratings yet
Gradient
37 pages
Exam 2023
No ratings yet
Exam 2023
16 pages
Practice Midterm Sol
No ratings yet
Practice Midterm Sol
15 pages
Gradient
No ratings yet
Gradient
31 pages
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
No ratings yet
CS 229, Public Course Problem Set #3: Learning Theory and Unsuper-Vised Learning
4 pages
Lecture 15 Projected Gradient
No ratings yet
Lecture 15 Projected Gradient
8 pages
Controle 16
No ratings yet
Controle 16
4 pages
C62 ProblemSheet 2 PartAandC Solutions 2024
No ratings yet
C62 ProblemSheet 2 PartAandC Solutions 2024
5 pages
A Generic Proximal Algorithm For Convex Optimization - Application To Total Variation Minimization
No ratings yet
A Generic Proximal Algorithm For Convex Optimization - Application To Total Variation Minimization
5 pages
Oblig2 Fasit
No ratings yet
Oblig2 Fasit
6 pages
Final 13
No ratings yet
Final 13
9 pages
On The Convergence of The Proximal Algorithm For Nonsmooth Functions Involving Analytic Features
No ratings yet
On The Convergence of The Proximal Algorithm For Nonsmooth Functions Involving Analytic Features
12 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
C2 M2 Exam Withsol
No ratings yet
C2 M2 Exam Withsol
12 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Hw2sol PDF
100% (1)
Hw2sol PDF
5 pages
Exam 2018
No ratings yet
Exam 2018
18 pages
M2 Exam 2022-23 Solutions
No ratings yet
M2 Exam 2022-23 Solutions
12 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
Homework 1
No ratings yet
Homework 1
3 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
10-725/36-725 Optimization Midterm Exam: Name
No ratings yet
10-725/36-725 Optimization Midterm Exam: Name
10 pages
Hw3sol PDF
No ratings yet
Hw3sol PDF
8 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
Sheet 2 Solution
No ratings yet
Sheet 2 Solution
5 pages
Sol3 2015
No ratings yet
Sol3 2015
8 pages
Chương 9
No ratings yet
Chương 9
12 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Convex Optimization Quizz
No ratings yet
Convex Optimization Quizz
5 pages
System of Linear Equations: Assignment - 3 24142
No ratings yet
System of Linear Equations: Assignment - 3 24142
7 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
No ratings yet
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
3 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
No ratings yet
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
9 pages
Col726 A2
No ratings yet
Col726 A2
5 pages
HW1 Cmo 2019 X
No ratings yet
HW1 Cmo 2019 X
5 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet