HW 2 Sol
HW 2 Sol
Prof. S. Boyd
EE364b Homework 2
1. Subgradient optimality conditions for nondifferentiable inequality constrained optimization. Consider the problem
minimize f0 (x)
subject to fi (x) 0,
i = 1, . . . , m,
i = 1, . . . , m,
dual feasibility,
0 f0 (
x) +
m
X
i fi (
x),
i=1
x) = 0,
i = 1, . . . , m.
Show that x is optimal, using only a simple argument, and definition of subgradient.
Recall that we do not assume the functions f0 , . . . , fm are convex.
P
x). By
Solution. Let g be defined by g(x) = f0 (x) + m
i=1 i fi (x). Then, 0 g(
definition of subgradient, this means that for any y,
g(y) g(
x) + 0T (y x).
Thus, for any y,
f0 (y) f0 (
x)
m
X
i (fi (y) fi (
x)).
i=1
(a) Show that x = 0 is optimal for this problem (i.e., minimizes ) if and only if
kf (0)k . In particular, for max = kf (0)k , 1 regularization yields
the sparsest possible x, the zero vector.
Remark. The value max gives a good reference point for choosing a value of the
penalty parameter in 1 -regularized minimization. A common choice is to start
with = max /2, and then adjust to achieve the desired sparsity/fit trade-off.
Solution. A necessary and sufficient condition for optimality of x = 0 is that
0 (0). Now (0) = f (0) + k0k1 = f (0) + [1, 1]n . In other words,
x = 0 is optimal if f (x) [, ]n . This is equivalent to kf (0)k .
(b) Coordinate-wise descent. In the coordinate-wise descent method for minimizing
a convex function g, we first minimize over x1 , keeping all other variables fixed;
then we minimize over x2 , keeping all other variables fixed, and so on. After
minimizing over xn , we go back to x1 and repeat the whole process, repeatedly
cycling over all n variables.
Show that coordinate-wise descent fails for the function
g(x) = |x1 x2 | + 0.1(x1 + x2 ).
(In particular, verify that the algorithm terminates after one step at the point
(0)
(0)
(x2 , x2 ), while inf x g(x) = .) Thus, coordinate-wise descent need not work,
for general convex functions.
(0)
Solution. We first minimize over x1 , with x2 fixed as x2 . The optimal choice is
(0)
x1 = x2 , since the derivative on the left is 0.9, and on the right, it is 1.1. We
(0)
(0)
then arrive at the point (x2 , x2 ). We now optimize over x2 . But it is optimal,
with the same left and right derivatives, so x is unchanged. Were now at a fixed
point of the coordinate-descent algorithm.
On the other hand, taking x = (t, t) and letting t , we see that g(x) =
0.1t .
Its good to visualize coordinate-wise descent for this function, to see why x gets
stuck at the crease along x1 = x2 . The graph looks like a folded piece of paper,
with the crease along the line x1 = x2 . The bottom of the crease has a small
tilt in the direction (1, 1), so the function is unbounded below. Moving along
either axis increases g, so coordinate-wise descent is stuck. But moving in the
direction (1, 1), for example, decreases the function.
(c) Now consider coordinate-wise descent for minimizing the specific function defined above. Assuming f is strongly convex (say) it can be shown that the iterates
converge to a fixed point x. Show that x is optimal, i.e., minimizes .
Thus, coordinate-wise descent works for 1 -regularized minimization of a differentiable function.
Solution. For each i, xi minimizes the function , with all other variables kept
f
(
x) + Ii ,
xi
i = 1, . . . , n,
u1 u>1
|u| 1
(u) = 0
u + 1 u < 1
useful. Generate some data and try out the coordinate-wise descent method.
Check the result against the solution found using CVX, and produce a graph showing convergence of your coordinate-wise method.
Solution. At each step we choose an index i, and minimize kAx bk22 + kxk1
over xi , while holding all other xj , with j 6= i, constant.
Selecting the optimal xi for this problem is equivalent to selecting the optimal xi
in the problem
minimize ax2i + cxi + |xi |,
where a = (AT A)ii / and c = (2/)( j6=i (AT A)ij xj + (bT A)i ). Using the theory
discussed above, any minimizer xi will satisfy 0 2axi + c + |xi |. Now we note
that a is positive, so the minimizer of the above problem will have opposite sign
to c. From there we deduce that the (unique) minimizer xi will be
P
xi
0
c [1, 1]
(1/2a)(c + sign(c)) otherwise,
where
1 u < 0
sign(u) = 0
1
3
u=0
u > 0.
Finally, we make use of the deadzone function defined above and write
xi
((2/)
T
j6=i (A A)ij xj
(2/)(AT A)ii
+ (bT A)i )
10
10
10
10
10
10
10
10
10
10
15
20
25
30