Skip to content

Latest commit

 

History

History
226 lines (136 loc) · 5.39 KB

prox_nn_group_lasso.rst

File metadata and controls

226 lines (136 loc) · 5.39 KB

Details on the Positive Group Lasso

This tutorial presents how to derive the proximity operator and subdifferential of the l_2-penalty, and the l_2-penalty with nonnegative constraints.

Proximity operator of the group Lasso

Let

g:x \mapsto \norm{x}_2
,

then its Fenchel-Legendre conjugate is

g^{\star}:x \mapsto i_{\norm{x}_2 \leq 1}
,

and for all x \in \mathbb{R}^p

\text{prox}_{g^{\star}}(x)
=
\text{proj}_{\mathcal{B}_2}(x) = \frac{x}{\max(\norm{x}_2, 1)}
.

Using the Moreau decomposition, Equations :eq:`fenchel` and :eq:`prox_projection`, one has

\text{prox}_{\lambda g}(x)
=
x
- \lambda \text{prox}_{g^\star/\lambda }(x/\lambda)
= x
- \lambda \text{prox}_{g^\star}(x/\lambda)
= x
- \lambda  \frac{x/\lambda}{\max(\norm{x/\lambda}_2, 1)}
= x
- \frac{\lambda x}{\max(\norm{x}_2, \lambda)}
= (1 - \frac{\lambda}{\norm{x}})_{+} x
.

A similar formula can be derived for the group Lasso with nonnegative constraints.

Proximity operator of the group Lasso with positivity constraints

Let

h:x \mapsto \norm{x}_2
+ i_{x \geq 0}
.

Let x \in \mathbb{R}^p and S = \{ j \in 1, ..., p | x_j > 0 \} \in \mathbb{R}^p, then

h^{\star} :x  \mapsto i_{\norm{x_S}_2 \leq 1}
,

and

\text{prox}_{h^{\star}}(x)_{S^c}
=
x_{S^c}
\text{prox}_{h^{\star}}(x)_S
=
\text{proj}_{\mathcal{B}_2}(x_S) = \frac{x_S}{\max(\norm{x_S}_2, 1)}
.

As before, using the Moreau decomposition and Equation :eq:`fenchel_nn` yields

\text{prox}_{\lambda h}(x)
=
x
- \lambda \text{prox}_{h^\star / \lambda }(x/\lambda)
= x
- \lambda \text{prox}_{h^\star}(x/\lambda)
,

and thus, combined with Equations :eq:`prox_projection_nn_Sc` and :eq:`prox_projection_nn_S` it leads to

\text{prox}_{\lambda h}(x)_{S^c} = 0
\text{prox}_{\lambda h}(x)_{S}
=
(1 - \frac{\lambda}{\norm{x_S}})_{+} x_S
.

Subdifferential of the positive Group Lasso penalty

For the subdiff_diff working set strategy, we compute the distance D(v) for some v to the subdifferential of the h penalty at a point w. Since the penalty is group-separable, we reduce the case where w is a block of variables in \mathbb{R}^g.

Case w \notin \mathbb{R}_+^g

If any component of w is strictly negative, the subdifferential is empty, and the distance is + \infty.

D(v) = + \infty, \quad \forall v \in \mathbb{R}^g
.

Case w = 0

At w = 0, the subdifferential is:

\lambda \partial || \cdot ||_2 + \partial \iota_{x \geq 0} = \lambda \mathcal{B}_2 + \mathbb{R}_-^g
,

where \mathcal{B}_2 is the unit ball.

Therefore, the distance to the subdifferential writes

D(v) = \min_{u \in \lambda \mathcal{B}_2, n \in \mathbb{R}_{-}^g} \ || u + n - v ||
.

Minimizing over n then over u, thanks to [1], yields

D(v) = \max(0, ||v^+|| - \lambda)
,

where v^+ is v restricted to its positive coordinates. Intuitively, it is clear that if v_i < 0, we can cancel it exactly in the objective function by taking n_i = - v_i and u_i = 0; on the other hand, if v_i>0, taking a non zero n_i will only increase the quantity that u_i needs to bring closer to 0.

For a rigorous derivation of this, introduce the Lagrangian on a squared objective

\mathcal{L}(u, n, \nu, \mu) =
\frac{1}{2}\norm{u + n - v}^2 + \nu(\frac{1}{2} \norm{u}^2 - \lambda^2 / 2) + \langle \mu, n \rangle
,

and write down the optimality condition with respect to u and n. Treat the case nu = 0 separately; in the other case show that :math:u must be positive, and that v = (1 + \nu) u + n, together with u = \mu / \nu and complementary slackness, to reach the conclusion.

Case || w || \ne 0

The subdifferential in that case is \lambda w / {|| w ||} + C_1 \times \ldots \times C_g where C_j = {0} if w_j > 0 and C_j = mathbb{R}_- otherwise (w_j =0).

By letting p denotes the projection of v onto this set, one has

p_j = \lambda \frac{w_j}{||w||}  \text{ if }  w_j > 0

and

p_j = \min(v_j, 0)  \text{ otherwise}.

The distance to the subdifferential is then:

D(v) = || v - p || = \sqrt{\sum_{j, w_j > 0} (v_j - \lambda \frac{w_j}{||w||})^2 + \sum_{j, w_j=0} \max(0, v_j)^2

since v_j - \min(v_j, 0) = v_j + \max(-v_j, 0) = \max(0, v_j).

References

[1] https://math.stackexchange.com/a/2887332/167258