This tutorial presents how to derive the proximity operator and subdifferential of the l_2-penalty, and the l_2-penalty with nonnegative constraints.
Let
g:x \mapsto \norm{x}_2 ,
then its Fenchel-Legendre conjugate is
g^{\star}:x \mapsto i_{\norm{x}_2 \leq 1} ,
and for all x \in \mathbb{R}^p
\text{prox}_{g^{\star}}(x) = \text{proj}_{\mathcal{B}_2}(x) = \frac{x}{\max(\norm{x}_2, 1)} .
Using the Moreau decomposition, Equations :eq:`fenchel` and :eq:`prox_projection`, one has
\text{prox}_{\lambda g}(x) = x - \lambda \text{prox}_{g^\star/\lambda }(x/\lambda)
= x - \lambda \text{prox}_{g^\star}(x/\lambda)
= x - \lambda \frac{x/\lambda}{\max(\norm{x/\lambda}_2, 1)}
= x - \frac{\lambda x}{\max(\norm{x}_2, \lambda)}
= (1 - \frac{\lambda}{\norm{x}})_{+} x .
A similar formula can be derived for the group Lasso with nonnegative constraints.
Let
h:x \mapsto \norm{x}_2 + i_{x \geq 0} .
Let x \in \mathbb{R}^p and S = \{ j \in 1, ..., p | x_j > 0 \} \in \mathbb{R}^p, then
h^{\star} :x \mapsto i_{\norm{x_S}_2 \leq 1} ,
and
\text{prox}_{h^{\star}}(x)_{S^c} = x_{S^c}
\text{prox}_{h^{\star}}(x)_S = \text{proj}_{\mathcal{B}_2}(x_S) = \frac{x_S}{\max(\norm{x_S}_2, 1)} .
As before, using the Moreau decomposition and Equation :eq:`fenchel_nn` yields
\text{prox}_{\lambda h}(x) = x - \lambda \text{prox}_{h^\star / \lambda }(x/\lambda)
= x - \lambda \text{prox}_{h^\star}(x/\lambda) ,
and thus, combined with Equations :eq:`prox_projection_nn_Sc` and :eq:`prox_projection_nn_S` it leads to
\text{prox}_{\lambda h}(x)_{S^c} = 0
\text{prox}_{\lambda h}(x)_{S} = (1 - \frac{\lambda}{\norm{x_S}})_{+} x_S .
For the subdiff_diff
working set strategy, we compute the distance D(v) for some v to the subdifferential of the h penalty at a point w.
Since the penalty is group-separable, we reduce the case where w is a block of variables in \mathbb{R}^g.
If any component of w is strictly negative, the subdifferential is empty, and the distance is + \infty.
D(v) = + \infty, \quad \forall v \in \mathbb{R}^g .
At w = 0, the subdifferential is:
\lambda \partial || \cdot ||_2 + \partial \iota_{x \geq 0} = \lambda \mathcal{B}_2 + \mathbb{R}_-^g ,
where \mathcal{B}_2 is the unit ball.
Therefore, the distance to the subdifferential writes
D(v) = \min_{u \in \lambda \mathcal{B}_2, n \in \mathbb{R}_{-}^g} \ || u + n - v || .
Minimizing over n then over u, thanks to [1], yields
D(v) = \max(0, ||v^+|| - \lambda) ,
where v^+ is v restricted to its positive coordinates. Intuitively, it is clear that if v_i < 0, we can cancel it exactly in the objective function by taking n_i = - v_i and u_i = 0; on the other hand, if v_i>0, taking a non zero n_i will only increase the quantity that u_i needs to bring closer to 0.
For a rigorous derivation of this, introduce the Lagrangian on a squared objective
\mathcal{L}(u, n, \nu, \mu) = \frac{1}{2}\norm{u + n - v}^2 + \nu(\frac{1}{2} \norm{u}^2 - \lambda^2 / 2) + \langle \mu, n \rangle ,
and write down the optimality condition with respect to u and n. Treat the case nu = 0 separately; in the other case show that :math:u must be positive, and that v = (1 + \nu) u + n, together with u = \mu / \nu and complementary slackness, to reach the conclusion.
The subdifferential in that case is \lambda w / {|| w ||} + C_1 \times \ldots \times C_g where C_j = {0} if w_j > 0 and C_j = mathbb{R}_- otherwise (w_j =0).
By letting p denotes the projection of v onto this set, one has
p_j = \lambda \frac{w_j}{||w||} \text{ if } w_j > 0
and
p_j = \min(v_j, 0) \text{ otherwise}.
The distance to the subdifferential is then:
D(v) = || v - p || = \sqrt{\sum_{j, w_j > 0} (v_j - \lambda \frac{w_j}{||w||})^2 + \sum_{j, w_j=0} \max(0, v_j)^2
since v_j - \min(v_j, 0) = v_j + \max(-v_j, 0) = \max(0, v_j).
[1] https://math.stackexchange.com/a/2887332/167258