Ratio and Difference of and Norms and Sparse Representation With Coherent Dictionaries

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

1

Ratio and Difference of l1 and l2 Norms and Sparse


Representation with Coherent Dictionaries
Penghang Yin*, Ernie Esser*, and Jack Xin*.

Abstract
The ratio of l1 and l2 norms has been used empirically to enforce sparsity of scale invariant solutions in non-
convex blind source separation problems such as nonnegative matrix factorization and blind deblurring. In this
paper, we study the mathematical theory of the sparsity promoting properties of the ratio metric in the context of
basis pursuit via over-complete dictionaries. Due to the coherence in the dictionary elements, convex relaxations
such as l1 minimization or non-negative least squares may not find the sparsest solutions. We found sufficient
conditions on the nonnegative solutions of the basis pursuit problem so that the sparsest solutions can be recovered
exactly by minimizing the nonconvex ratio penalty. Similar results hold for the difference of l1 and l2 norms. In the
unconstrained form of the basis pursuit problem, these penalties are robust and help select sparse if not the sparsest
solutions. We give analytical and numerical examples and introduce sequentially convex algorithms to illustrate how
the ratio and difference penalties are computed to produce both stable and sparse solutions.

Keywords: l1 and l2 norms, ratio and difference, coherent dictionary, sparse representation.

AMS Subject Classification: 94A12, 94A15, 90C26, 90C25.

I. I NTRODUCTION
The ratio of l1 and l2 norms is a widely used empirical nonconvex scale-invariant penalty for encouraging sparsity
for nonconvex problems such as nonnegative matrix factorization (NMF) and blind deconvolution applications [11],
[12], [13]. A related metric that is homogeneous of degree one is the difference of l1 and l2 norms. Both appear
to encourage sparse solutions to non-negative least squares (NNLS) type problems of the form
λ
min kA x − bk2 + R(x) , (1.1)
x≥0 2

where R(x) = kxk kxk2 or R(x) = kxk1 − kxk2 . If A satisfies certain incoherence properties, then sufficiently sparse
1

nonnegative solutions to A x = b are unique [1], which is why solving the convex NNLS problem often works well
without the additional sparsity penalty R(x). A coherence measure of matrix A is ρ(A) defined as the maximum
ρ
of the cosine of pairwise angles between any two columns of A. Let tA = 1+ρ , as in [1]. A sufficient incoherence
condition [1] for uniqueness of the sparsest solution x0 ≥ 0 is that kx0 k0 < 2t1A . However, such conditions are often
not satisfied in practice, in which case including R(x) can yield much sparser solutions. Likewise, minimizing the
convex metric, the l1 norm, can effectively recover sparse solutions to the underdetermined system A x = b when
columns of the matrix A satisfy certain incoherence conditions [2], [3], [4].
We are interested in understanding the sparsity promoting properties of the ratio and difference metrics theo-
retically and computationally in a highly coherent (over-complete) dictionary. Over-complete dictionaries occur in
human visual and auditory systems [14], [10], [16], and in discretization of continuum imaging problems such as
radar and medical imaging when the grid spacing is below the Rayleigh threshold [7]. Band exclusion and local
optimization techniques are introduced to image objects sufficiently separated with respect to the coherence bands
in [7].
Computationally, the two non-convex penalties can be treated as follows. Since kxk1 − kxk2 is a difference of
convex functions, stationary points of the resulting nonconvex model are computed by difference of convex (DC)
programming [15]. The model with the ratio penalty kxk 1
kxk2 can be minimized using a related gradient projection

* Department of Mathematics, University of California, Irvine, CA, 92697, USA. E-mail: [email protected], [email protected],
[email protected].
2

strategy. In the general dictionary case, the exact l1 recovery of sparse solutions is studied in [5] where a main
result is that if A x0 = f , kx0 k0 < 1+M
−1

2 , M being an upper bound of off diagonal entries of the Gram matrix


T
A A, then x0 is the unique solution given by l1 norm minimization. The columns of A are more coherent if M is
larger. In this case, l1 minimization is less effective.
The organization of the paper is as follows. In section II, we begin with examples of the basis pursuit problem
of the form minx R(x) subject to A x = b to compare l1 or lp (p ∈ (0, 1)) minimization with that of the ratio
or difference of l1 and l2 norms, and with the ground truth to understand the properties and limitations of each
metric. These analytical examples help to introduce the coherence issues in finding sparse solutions. We leave as a
future work to investigate similar phenomena in physical data sets. In section III, we show that mimimizers of the
ratio or difference of l1 and l2 norms must be locally the sparsest feasible solution. We then formulate a uniformity
condition on a particular subset FL of the feasible solutions and prove the exact recovery of the sparsest solution
x0 of A x = b by minimizing the ratio of l1 and l2 norms. The uniformity condition essentially says that the ratio
of the minimum and maximum of the nonzero entries of any solution x from FL is bounded from below by a
constant that depends on kx0 k0 and kxk0 . Interestingly, a similar condition appears in [7] for the band-excluded
orthogonal matching pursuit method to recover the support of the solution up to the coherence band. The ratio of
the maximum and minimum over the support of a vector is referred to as dynamic range in optical imaging [7]. For
the difference of l1 and l2 norms, the√exact recovery condition is that the minimum of the nonzero entries of any
2( kx k −1)
solution x 6= x0 from FL be above kxk00−1 0
kx0 k2 . Our theoretical results shed light on the sparsity promoting
capability of the ratio and difference penalties. In section IV, we show numerical examples optimizing (1.1) with A
being a coherent dictionary such that the ratio and difference of l1 and l2 norms regularization outperform NNLS.
More comparisons with l1 minimization for imaging data can be found in [6]. Concluding remarks are in section
V.

II. E XAMPLES OF BASIS P URSUIT IN C OHERENT D ICTIONARIES


The examples below will show a couple of situations where l1 or lp (p ∈ (0, 1)) minimization ceases to be
effective. Though these examples are mathematical in nature, they help to illustrate the coherence induced issues in
finding sparse solutions. Imaging examples can be found in [7], [6] where the sparse solutions are more complicated
and not in closed analytical form.
Example 1: Let p ∈ (0, 1] and two distinct dense vectors b1 , b2 ∈ Rn (n ≥ 2). so that b = b1 + b2 is also dense;
i √
Let kb k1 1 2 1 2
kbi k2 be close to their upper bound O( n), i = 1, 2. a = k(b , b )kp , A = [b , b , a In , a In ], where In is n × n
identity matrix. Consider the linear system A x = b, x ∈ R 2+2n , which has a 2-sparse solution:
x0 = [1, 1, 0, · · · , 0]′ .
1 ′ 2 ′ 1 ′ 2 ′
The other sparse solutions are: x1 = [0, 1, (ba) , 0]′ , x2 = [1, 0, 0, (ba) ]′ , x3 = [0, 0, (ba) , (ba) ]′ , the first two are at
least 3-sparse, the last one is at least 4-sparse. The lp norm of xs is:
kx0 kp = 21/p ,
while:
kb1 kpp 1/p
kx1 kp = (1 + ) ∈ (1, 21/p ),
ap
kb2 kpp 1/p
kx2 kp = (1 + ) ∈ (1, 21/p ),
ap
k[(b1 )′ , (b2 )′ ]′ kp
kx3 kp = = 1.
a
Thus, x0 cannot be recovered by minimizing lp norm subject to A x = b. There are at least three less sparse
solutions with smaller lp norm than x0 .
Now let p = 1, the l2 norms of x0 and x3 are:
√ k[(b1 )′ , (b2 )′ ]′ k2
kx0 k2 = 2, kx3 k2 = .
k[(b1 )′ , (b2 )′ ]′ k1
3

So the ratio of l1 and l2 norms are:


kx0 k1 √ kx1 k1 k[a′ , (b1 )′ ]′ k1
= 2, = ,
kx0 k2 kx1 k2 k[a′ , (b1 )′ ]′ k2
kx2 k1 k[a′ , (b2 )′ ]′ k1 kx3 k1 k[(b1 )′ , (b2 )′ ]′ k1 √
= , = ∼ n.
kx2 k2 k[a′ , (b2 )′ ]′ k2 kx3 k2 k[(b1 )′ , (b2 )′ ]′ k2

kx1 k1 √
We want to have kx1 k2 > 2 or:

k[a′ , (b1 )′ ]′ k1 2kb1 k1 + kb2 k1 √


= > 2,
k[a′ , (b1 )′ ]′ k2
p
(kb1 k1 + kb2 k1 )2 + kb1 k22
or:
2kb1 k21 > kb2 k21 + 2kb1 k22 .
kx2 k1 √
Likewise kx2 k2 > 2 requires:
2kb2 k21 > kb1 k21 + 2kb2 k22 .
The above inequalities reduce to: √
kbi k1 > 2kbi k2 , i = 1, 2,
if we assume that the first two columns of A satisfy kb1 k1 = kb2 k1 , b1 6= b2 . It follows that x0 has the smallest
ratio of l1 and l2 norms. So ll12 minimization can recover x0 in the counterexample where l1 minimization could
not.

Let us look at difference of l1 and l2 norms at p = 1.



kx0 k1 − kx0 k2 = 2 − 2.
k[(b1 )′ , (b2 )′ ]′ k2 √
kx3 k1 − kx3 k2 = 1 − 1 ′ 2 ′ ′
= 1 − O(n−1/2 ) > 2 − 2,
k[(b ) , (b ) ] k1
if n is large enough. However,
r
kb1 k1 kb1 k22
kx1 k1 − kx1 k2 = 1 + − 1+ .
a a2
r
kb2 k1 kb2 k22
kx2 k1 − kx2 k2 = 1 + − 1+ .
a a2

If both were above 2 − 2 so that xs has the least difference of l1 and l2 norms, we would have by adding the
two expressions: s
√ X kbi k22 X
4−2 2≤3− 1+ 1 ′ 2 ′ ′ 2 ≤3− 1,
i=0,1
k[(b ) , (b ) ] k1 i=0,1
or: √
4 − 2 2 ≈ 1.1716 ≤ 1,
which is impossible. Hence minimizing the difference of l1 and l2 norms gives either x1 or x2 , the 2nd sparsest
solution, but not the sparsest solution x0 in this example. It is better than minimizing l1 which gives the 3rd sparsest
solution x3 .

Since the ll12 penalty tends to get larger for more dense vectors, it is plausible that x0 is recovered by minimizing
l1
l2if n is large enough. However, this cannot happen without proper conditions on b1 , b2 . We show a counterexample
below.
4

First, we note that


(b1 )′ ′ (b2 )′ ′
Ker(A) = span{[1, 0, − , 0] , [0, 1, − , 0] , [0, 0, −c′ , c′ ]′ }, ∀ c ∈ Rn .
a a

Let
(b1 )′ ′ (b2 )′ ′ (b2 − b1 )′ ′
x4 = x0 + [1, 0, − , 0] − [0, 1, , 0] = [2, 0, , 0] .
a a a
Then 2 1
kx4 k1 2 + kb −b
a
k1
kx0 k1 √
≤ < = 2,
kx4 k2 2 kx0 k2
if
kb2 − b1 k1 √
< 2 2 − 2 ≈ 0.828.
a
2 1 2 1
Since kb −b
a
k1
≤ kb k1 +kb
a
k1
= 1, this is not a stringent condition. Thus x4 is a less sparse solution than x0 with
smaller ratio of ll12 norm. Minimization of ll12 does not yield x0 . On the other hand, x4 contains a large peak (height
1 2
2), and many smaller peaks ( b −b ′
a ), resembling a perturbation of the 1-sparse solution [2, 0, · · · , 0] in the case of
1
b =b .2

2 1
In particular, if b2 is a small perturbation of b1 , then kb −b a
k1
≈ 0. So x4 is close to the 1-sparse vector
′ l1
[2, 0, · · · , 0] with the ratio of l2 norm slightly above 1, the least value of the ratio among all nonzero vectors.
We observe here that minx:A x=b kxk kxk2 is continuous with respect to the perturbations of A. The minimizer goes
1

from exact 1-sparse structure when b1 = b2 to an approximate 1-sparse structure when b1 ≈ b2 . In contrast, the
l0 minimizer x0 experiences a jump from [2, 0, 0, 0]′ to [1, 1, 0, 0]′ . The discrete character of l0 makes it non-trival
to recover the least l0 solution from minimizing ll12 . If we view b1 and b2 as dictionary elements in a group, then
minimizing ll12 selects only one of them (intra sparsity). Similarly, if we view corresponding columns (1st and
(n + 1)-th, 2nd and (n + 2)-th, etc) of [αIn αIn ] as vectors in a group (of 2 elements), then x4 selects one member
out of each group. The examples here show that minimizing ll21 has the tendency of removing redundencies or
preferring intra-sparsity in a coherent and over-determined dictionary. The l1 minimization does not do as well in
terms of intra-sparsity, using all group elements except for knocking out the b1 , b2 group.

Let us look closer at the solutions in the non-negative orthant. Such vectors are:
b1 b2
x = [1 + t1 , 1 + t2 , −(t1 − c)′ , −(t2 + c)′ ]′ ,
a a
satisfying:
b2 b1
1 + t1 ≥ 0, 1 + t2 ≥ 0, t2 ≤ c ≤ −t1 ,
a a
which is valid if:
2
b1 < 0, t1 ∈ (0, ), b2 = (1 − ǫ)b1 , 0 < ǫ ≪ 1, t2 ≈ −t1 . (2.2)
3
The kernel is a n + 2 dimensional plane which contains a lower dimensional affine subspace parallel to the unit l1
ball if the vectors on the plane:
b1 b2
v = [t1 , t2 , −(t1 − c)′ , −(t2 + c)′ ]′ ,
a a
are orthogonal to the one vector [1, 1, · · · , 1, 1]′ ∈ R2n+2 , in other words,
 
X X bij
1 −  ti = 0, (2.3)
a
i=1,2 j

which holds with essentially a (n + 1)-dimensional free parameter (t1 , c), under constraints in (2.2). If the minimal
l1 ball intersects the kernel at a point p, the line at p in the direction v lies on the l1 ball. Then l1 minimization
5

Data pts // L1 unit ball (UB), projection on L2 UB then min(L1) −−> sparsity
2

1.8

1.6
Project data points
1.4 to the unit circle

1.2

0.8

0.6

0.4

0.2

0
0 0.5 1 1.5 2

Fig. 1. Illustration of the advantage of minimizing ll21 over l1 when data points (on x1 + x2 = 2 in the first quadrant) lie parallel to the l1
unit ball. Minimization of ll12 is same as projection onto the unit l2 ball then intersecting with the unit l1 ball to select sparse solutions. In
contrast, l1 minimization cannot distinguish sparse data points.

is not effective, there are infinitely many non-sparse minimizers. An illustration in two dimensions is in Fig. 1,
where all points on x1 + x2 = 2 in the first quadrant are minimizers of l1 norm. Using the scale invariance of ll12 ,
minimizing ll21 can be viewed as first projecting data points (feasible vectors) onto the l2 unit ball, then intersecting
with the minimal l1 ball, which leads to sparse solutions.

Example 2: Let p ∈ (0, 1] and a dense vector b ∈ Rn (n ≥ 2), a = k(b, b)kp = 21/p kbkp , A = [b, b, a In , a In ],
where In is n × n identity matrix. The linear system A x = 2b, x ∈ R2+2n has a 1-sparse solution:
x0 = [2, 0, · · · , 0]′ .
There is also a 2-sparse solution
x1 = [1, 1, 0, · · · , 0]′ .
Some other solutions are: x2 = [1, 0, ba , 0]′ , x3 = [0, 0, ba , ba ]′ , Then minimizing ll12 and l1 − l2 both give the sparsest
′ ′ ′

solution x0 , since both kx 0 k1


kx0 k2 and kx0 k1 − kx0 k2 attain their possible lower bounds 1 and 0. However, for lp -norm
minimization (p ∈ (0, 1]),
kx0 kp = 2,
kx1 kp = 21/p ≥ 2,
kbkpp 1/p 3 3
kx2 kp = (1 + p ) = ( )1/p ≥ ,
a 2 2
k[b′ , b′ ]′ kp
kx3 kp = = 1.
a
For p = 1, x0 has the largest l1 norm among these solutions. For p ∈ (0, 1], kx0 kp > kx3 kp . So lp -norm minimization
fails to find the sparsest solution.
6

Example 3: Let A = [b1 , b2 , In , a In ] be the same from Example 1 and b = b1 + e1 , where e1 = [1, 0, · · · , 0]′ ,
a = 21/p kb1 kp (p ∈ (0, 1]), b2 6= b1 , both dense. The aim is to represent data b with columns of A to have both
intra-sparsity and inter-sparsity across the groups.

The 2-sparse solutions with perfect intra and inter sparsity (at most 1 in each group and least number of groups)
are:
(e1 )′ ′
x10 = [1, 0, (e1 )′ , 0]′ , x20 = [1, 0, 0, ],
a
some much less sparse solutions are (good intra-sparsity, almost no inter-sparsity):
(b1 )′ ′
x1 = [0, 0, (e1 )′ , ],
a
(e1 + b1 )′ ′
x2 = [0, 0, 0, ],
a
We have:
1
kx10 kpp = 2 > 1 +
= kx1 kpp
2
1 1 1
kx20 kpp = 1 + p > + p ≥ kx2 kpp ,
a 2 a
So lp minimization will miss the 2-sparse solutions.
Let p = 1, in view of:
kx10 k1 √
= 2 ≈ 1.414,
kx10 k2
kx20 k1 1 + a−1 √
2 = √ ≤ 2,
kx0 k2 1+a −2

kx1 k1 1.5
=q ≈ 1.5− ,
kx1 k2 kb1 k22
1 + 4kb1 k2
1

l1
if kb1 k1
≫ kb1 k2
by the assumption, of or
l2 can be smaller. If a is large, ll12 minimization prefers x20 because
x10 x20
it is a small perturbation of a 1-sparse vector [1, 0, 0, 0]′ . However, minimizing ll12 does not always lead to x20 if a
1 ′
is small enough. We show a counterexample as below: Let x3 = [0, 0, (b1 )′ , (ea) ]′ , then
a
kx3 k1 2 + a−1 kx20 k1 1 + a−1
≤ < = √ ,
kx3 k2 a−1 kx20 k2 1 + a−2
if
a < 0.908.
Notice that x3 has one large peak and many (relatively speaking) smaller peaks, resembling a small perturbation
of a 1-sparse vector if a is small enough.

The solutions are of the form:


(e1 )′ ′
x = [1, 0, 0, ] + t1 [1, 0, −(b1 )′ , 0]′ + t2 [0, 1, −(b2 )′ , 0]′ + [0, 0, a c′ , c′ ]′ ,
a
nonnegativity constraints are:
e1
1 + t1 ≥ 0, t2 ≥ 0, −t1 b1 − t2 b2 + a c ≥ 0, + c ≥ 0. (2.4)
a
In particular, consider
t2 ≥ 0, c ≥ 0.
At any point p on the plane A x = b, we seek a direction
v = t1 [1, 0, −(b1 )′ , 0]′ + t2 [0, 1, −(b2 )′ , 0]′ + [0, 0, a c′ , c′ ]′ ,
7

so that v · [1, · · · , 1] = 0, or: X


t1 + t2 + −t1 b1j − t2 b1j + (a + 1)cj = 0,
j
or: X X X
(1 − b1j ) t1 + (1 − b2j ) x2 + (a + 1) cj = 0,
j j j

which admits nontrivial solutions satisfying (2.4) if


X X
c = 0, b1j = 0, b2j = 0,
j j

t1 = −t2 , t2 ∈ (0, 1),


−t1 b1 − t2 b2 = t2 (b1 − b2 ) > 0.
So the intersection of the l1 minimal ball with the kernel is at least a line segment, rendering l1 minimizers
non-unique and most of the l1 minimizers non-sparse.

In summary, the examples here indicate that minimizing the ratio of l1 and l2 norms is more likely to get a
sparser solution than minimizing lp (p ∈ (0, 1)) when the column vectors of matrix A are structured or coherent.
The geometric reason is that the l1 unit ball with corners and edges tend to hit the unit sphere on axes or coordinate
planes resulting in sparse solutions. Intersecting the l1 unit ball with another high dimensional plane may have
multiple non-sparse minimizers (as shown in Fig. 1). Minimizing the difference of l1 and l2 norms is better than
minimizing lp norms, and appears no better than minimizing the ratio of l1 and l2 norms. Computationally though,
the difference has better analytical structure for algorithm design as we shall explore later.

III. E XACT R ECOVERY T HEORY


In this section, we show that it is possible to recover the sparsest solution exactly by minimizing the ratio and
difference of l1 and l2 norms, thereby establishing the origin of their sparsity promoting property.

l1
A. Exact recovery of l2
Suppose A ∈ Rm×n andx0 ≥ 0 ∈ Rn , where m < n. Let b = A x0 , we exclude the case b = 0 throughout this
paper and study the following problems:
P0 : min kxk0 subject to Ax = b
x≥0

kxk1
Pr : min subject to Ax = b
x≥0 kxk2
Pd : min kxk1 − kxk2 subject to Ax = b
x≥0

Denote by F = {x ∈ Rn : A x = b, x ≥ 0} the set of feasible solutions, and let S(x) denote the support of x.

Definition III.1. x ∈ F is called locally sparse if ∄ y ∈ F \ {x} such that S(y) ⊆ S(x). Denote by FL = {x ∈
F : x is locally sparse} the set of locally sparse feasible solutions.
The following lemma says that any locally sparse solution is in essence locally the sparsest solution.
Lemma III.1. ∀ x ∈ FL , ∃ δx > 0 such that ∀ y ∈ F , if 0 < ky − xk2 < δx , we have S(x) ⊂ S(y).
Proof: Let y = x + v and choose δx = mini∈S(x) {xi }, then
kvk∞ ≤ kvk2 < min {xi }
i∈S(x)

So
yi ≥ xi − kvk∞ > xi − min {xi } ≥ 0, ∀ i ∈ S(x)
i∈S(x)
8

which implies
S(x) ⊆ S(y).
And S(x) 6= S(y) since x ∈ FL . Then the claim follows.

The following theorem states that the solutions of Pr , Pd and P0 must be locally sparse, thereby being at least
locally the sparsest feasible solution.
Theorem III.1. If x∗ solves Pr , Pd or P0 , then x∗ ∈ FL .
Proof: Suppose x∗ solves Pr or Pd and it is not locally sparse, then ∃ y ∗ ∈ F \ {x∗ } such that S(y ∗ ) ⊆ S(x∗ ).
Thus there exists a small enough ǫ > 0, such that x∗ − ǫy ∗ ≥ 0. Let
x∗ − ǫy ∗
z∗ = ≥0
1−ǫ
or equivalently,
x∗ = ǫy ∗ + (1 − ǫ)z ∗
then A z ∗ = b and thus z ∗ ∈ F .
By the nonnegativity of y ∗ and z ∗ ,
kx∗ k1 = ǫky ∗ k1 + (1 − ǫ)kz ∗ k1
Moreover, since y ∗ 6= x∗ both satisfying A x = b, they are linearly independent. So y ∗ and z ∗ are also linearly
independent, and
kx∗ k2 < ǫky ∗ k2 + (1 − ǫ)kz ∗ k2
Thus
kx∗ k1 ǫky ∗ k1 + (1 − ǫ)kz ∗ k1 ky ∗ k1 kz ∗ k1

> ∗ ∗
≥ min{ ∗ , ∗ }
kx k2 ǫky k2 + (1 − ǫ)kz k2 ky k2 kz k2
and
kx∗ k1 − kx∗ k2 > ǫ(ky ∗ k1 − ky ∗ k2 ) + (1 − ǫ)(kz ∗ k1 − kz ∗ k2 )
≥ min{ky ∗ k1 − ky ∗ k2 , kz ∗ k1 − kz ∗ k2 }
Contradiction.

Now suppose x∗ solves P0 and it is not in FL , then ∃ y ∗ ∈ F \ {x∗ } such that S(y ∗ ) ⊆ S(x∗ )∗. Since no
nonnegative solution of A x = b is sparser than x∗ , we have S(y ∗ ) = S(x∗ ) = S . So mini∈S { xy∗i } < 1 or
i
x∗k
mini∈S { xyi∗ } < 1 must be true. Without loss of generality, let mini∈S { xy∗i } =
∗ ∗

yk∗ = r < 1 for some index k ∈ S . Then


i i

y ∗ ≥ 0 since zi∗ = xi1−r


−ryi
∗ ∗
1 r 1 r 1 r
z ∗ = 1−r x∗ − 1−r ≥ 0, ∀ i ∈ S . Moreover, A z ∗ = 1−r A x∗ − 1−r A y ∗ = 1−r b− 1−r b=b
∗ ∗ ∗ ∗ ∗
which implies z ∈ F . But S(z ) ⊆ S and zk = 0, thus S(z ) ⊂ S which contradicts with x being the solution
of P0 .

By Theorem III.1, all the minimizers of Pr , Pd and P0 are contained in FL . From now on, we no longer care
about those feasible solutions outside FL .

For any x ≥ 0 ∈ Rn , suppose (S(x), Z(x)) is a partition of the index set of x, i.e., {1, 2, · · · , n}, where
S(x) = {i : xi > 0}, Z(x) = {i : xi = 0}.
Definition III.2. The uniformity of x, U(x), is the ratio between the smallest nonzero entry and the largest one, i.e.
mini∈S(x) xi
0 < U (x) := ≤ 1.
maxi∈S(x) xi
√ √
kxk0 − kxk0 −s
Theorem III.2. If x0 uniquely solves P0 and kx0 k0 = s, if U (x) > √ √ , ∀ x ∈ FL \ {x0 }, then x0
kxk0 + kxk0 −s
also uniquely solves Pr . In particular, if any feasible solution x is a binary vector with all entries either 0 or 1,
9

√ √
kxk0 − kxk0 −s
then the above inequality holds since U (x) = 1 > √ √ . Clearly P0 and Pr are equivalent as we note
kxk0 + kxk0 −s
that kxk
p
kxk2 = kxk0 .
1

kxk1
Since is scale-invariant, ∀x ≥ 0 ∈ Rn , without loss of generality, we assume maxi∈S(x) xi = 1 and
kxk2
0 < mini∈S(x) xi = U (x) ≤ 1. By the Cauchy-Schwarz inequality, kxk
p
kxk2 ≤ kxk0 . Starting with the following
1

kxk1
lemma, we first estimate the lower bound of kxk2 .

Lemma III.2. Let x = [x1 , · · · , xj−1 , xj , xj+1 , · · · , xn ]′ , where U (x) < xj < 1.
Let x− = [x1 , · · · , xj−1 , U (x), xj+1 , · · · , xn ]′ and x+ = [x1 , · · · , xj−1 , 1, xj+1 , · · · , xn ]′ , then we have
kxk1 kx− k1 kx+ k1
> min{ , }
kxk2 kx− k2 kx+ k2
Proof: Since U (x) < xj < 1, ∃ λ > 0, such that xj = λU (x) + (1 − λ)1 and x = λx− + (1 − λ)x+ . Given
that x− and x+ are nonnegative but not linearly dependent, we have
kxk1 = kλx− + (1 − λ)x+ k1 = λkx− k1 + (1 − λ)kx+ k1
and
kxk2 = kλx− + (1 − λ)x+ k2 < λkx− k2 + (1 − λ)kx+ k2
So
kxk1 λkx− k1 + (1 − λ)kx+ k1 kx− k1 kx+ k1
> ≥ min{ , }
kxk2 λkx− k2 + (1 − λ)kx+ k2 kx− k2 kx+ k2

kxk1
By above lemma, in order for x to obtain its minimum of kxk2 , every nonzero entry in x should be either ’U (x)’
or ’1’. Then we have the following lemma:
Lemma III.3. Let U (x) = U . Then

2 U p kxk1 p
kxk0 ≤ ≤ kxk0
1+U kxk2
Proof: To estimate the lower bound, it is reasonable to assume the number of ’1’ in x is l and the number of
’U ’ is kxk0 − l. Then
kxk1 (1 − U )l + U kxk0
g(l) := =p
kxk2 (1 − U 2 )l + U 2 kxk0

U
Set g (l) = 0, we have l = 1+U kxk0 , and the inequality of lower bound follows.
We now prove Theorem III.2:
Proof: Suppose x0 is the unique solution of P0 with a sparsity of √
s. First √
of all, by Theorem III.1, the minimizer
kxk0 − kxk0 −s
of Pr must be in FL . If any other solution x ∈ FL satisfies U (x) > √ √ , by solving the inequality for
kxk0 + kxk0 −s
s, we have the following: p
√ 2 U (x) p
s< kxk0 .
1 + U (x)
By Lemma III.3, p
kx0 k1 √ 2 U (x) p kxk1
≤ s< kxk0 ≤
kx0 k2 1 + U (x) kxk2
Hence solving Pr will yield the sparest solution x0 .
10

B. Exact recovery of l1 - l2
In this subsection, we show similar exact recovery results for the difference l1 and l2 norms.
Lemma III.4. Suppose x ≥ 0 ∈ Rn , then
kxk0 − 1 p
min {xi } ≤ kxk1 − kxk2 ≤ ( kxk0 − 1)kxk2
2 i∈S(x)

Proof: It suffices to show the lower bound given that the upper bound is directly by the Cauchy-Schwarz
inequality.
kxk21 − kxk22
kxk1 − kxk2 =
kxk1 + kxk2
Pn
i6=j xi xj
=
kxk + kxk2
P 1
i6=j∈S(x) xi xj
=
kxk1 + kxk2
(kxk0 − 1)kxk1 mini∈S(x) {xi }

kxk1 + kxk2
kxk0 − 1
= min {xi }
1 + kxk2 i∈S(x)
kxk1
kxk0 − 1
≥ min {xi }
2 i∈S(x)


2( s−1)
Theorem III.3. If x0 uniquely solves P0 with a sparsity of s, and if mini∈S(x) {xi } > kxk0 −1 kx0 k2 , ∀ x ∈ FL \{x0 },
then x0 also uniquely solves Pd .

2( s−1)
Proof: Suppose mini∈S(x) {xi } > kxk0 −1 kx0 k2 ,
∀ x ∈ FL \ {x0 }, then by Lemma III.4,

kx0 k1 − kx0 k2 ≤ ( s − 1)kx0 k2
kxk0 − 1
< min {xi }
2 i∈S(x)
≤ kxk1 − kxk2 , ∀ x ∈ FL \ {x0 }
Moreover, the solution of Pd is contained in FL by Theorem III.1. Hence x0 is the unique solution of Pd .

IV. N UMERICAL A PPROACH


In this section, we consider the numerical aspects of minimizing l1 /l2 and l1 − l2 penalties for finding sparse
solutions. The setting where it is most effective computationally is in the unconstrained optimization model:
1
min F (x) := kA x − bk2 + R(x), (4.5)
x∈X 2
where R(x) = γ kxk
kxk2 or R(x) = γ(kxk1 − kxk2 ), and X = {x ∈ R
N : x ≥ 0,
P
i xi ≥ r > 0}. Due to the
1
i
nonnegativity constraint, R(x) simplifies to γ h1,xiwhere 1 denotes the constant vector in RN consisting of all
kxk2 ,
ones. The model (4.5) allows some measurement error in representing b in terms of the coherent dictionary, and
helps to regularize the ill-conditioning of A.
Under the nonnegative constraints, it is reasonable to assume that F (x) is coercive on X in the sense that for any
x0 ∈ X the set {x ∈ X : F (x) ≤ F (x0 )} is bounded. This is true if there are no nonnegative vectors in ker (A),
which follows for example if A has only nonnegative elements and no columns that are identically zero. Let us
consider the more challenging ratio penalty first. Since R is differentiable on X , it is natural to use a gradient
11

projection approach to solve (4.5). We will use the scaled gradient projection method proposed for a similar class
of problems in [6]. The approach is based on the estimate
1 1
F (y) − F (x) ≤ (y − x)T ((λR − λr )I − C)(y − x) + (y − x)T ( AT A + C)(y − x) + (y − x)T ∇F (x),
2 2
where λr and λR are lower and upper bounds respectively on the eigenvalues of ∇2 R(x) for x ∈ X and C is any
matrix. This leads naturally to the strategy of iterating
1
xn+1 = arg min(x − xn )T ( AT A + cn I)(x − xn ) + (x − xn )T ∇F (xn ).
x∈X 2
To ensure convergence and a monotonically decreasing objective F (xn ), it suffices to choose cn > 0 such that
there is a sufficient decrease in F according to
1
F (xn+1 ) − F (xn ) ≤ σ[(xn+1 − xn )T ( AT A + cn I)(xn+1 − xn ) + (xn+1 − xn )T ∇F (xn )]
2
for some σ ∈ (0, 1]. To improve the method’s overall efficiency, cn can be adjusted every iteration to prefer smaller
values while still ensuring a sufficient decrease in F . The complete algorithm from [6] is shown below for reader’s
convenience.
kxk1
Algorithm 1: A Scaled Gradient Projection Method for Solving (4.5) with R(x) = γ kxk2
.

Define x0 ∈ X , c0 > 0, σ ∈ (0, 1], ǫ1 > 0, ρ > 0, ξ1 > 1, ξ2 > 1 and set n = 0.
while n = 0 or kxn − xn−1 k∞ > ǫ1
1
y = arg min (x − xn )T ( AT A + cn I)(x − xn ) + (x − xn )T ∇F (xn )
x∈X
 2 
n n T 1 T n n T n
if F (y) − F (x ) > σ (y − x ) ( A A + cn I)(y − x ) + (y − x ) ∇F (x )
2
cn = ξ2 cn
else
xn+1 = y
(
cn cn
ξ1 if smallest eigenvalue of ξ1 I + 12 AT A is greater than ρ
cn+1 =
cn otherwise .
n=n+1
end if
end while

Any limit point x∗ of the sequence of iterates {xn } satisfies (y − x∗ )T ∇F (x∗ ) ≥ 0 for all y ∈ X and is therefore
a stationary point of (4.5) [6]. Note that every iteration requires solving the convex problem
1
min (x − xn )T ( AT A + cn I)(x − xn ) + (x − xn )T ∇F (xn ).
x∈X 2
As in [6] we can solve this using the Alternating Direction Method of Multipliers (ADMM) [8], [9]. The explicit
iterations are described in the following algorithm.
12

Algorithm 2: ADMM for solving convex subproblem.


Define δ > 0, ǫ2 > 0, v 0 and p0 arbitrarily and let k = 0.

kv k − v k−1 k kv k − uk k
while k = 0 or k k n
> ǫ2 or k > ǫ2
kv − x k kv − xn k
 
uk+1 = xn + (AT A + (2cn + δ)I)−1 δ(v k − xn ) − pk − ∇F (xn )
pk
 
v k+1 = ΠX uk+1 +
δ
k+1 k k+1
p = p + δ(u − v k+1 )
k =k+1
end while
xn+1 = v k .

Here, ΠX denotes the orthogonal projection onto X . Note that if AAT is much smaller in size than AT A we
can use the Woodbury identity to rewrite the inverse that appears in Algorithm 2.
As numerical experiments we apply these algorithms to Examples 2 and 3, with both of the A matrices defined
using values of n = 100, p = 0.95 and b a vector of n random numbers uniformly distributed on [0, 1]. The
coherence and ill-conditioning of these matrices make these examples numerically challenging. Non-negative least
squares, often a good method for finding sparse nonnegative solutions when they exist [1], fails to find sparse
solutions for these examples as shown in Figure 2. Solving the ll21 model (4.5) on the other hand, while it does not
identify the sparsest solutions, does find solutions with much better sparsity properties. The results for Examples 2
and 3 are shown in Figure 3. The model parameters used were γ = 0.1 and r = 0.05. For the algorithm parameters,
δ = 1, c0 = 10−9 , ξ1 = 2, ξ2 = 10 and σ = 0.01. The most important of the algorithm parameters is δ, which
affects the efficiency of ADMM on the convex subproblem. The tolerances for the stopping conditions were set to
ǫ1 = 10−8 and ǫ2 = 10−4 .
For Example 2, Algorithm 1 recovered the 2-sparse solution [1, 1, 0, · · · , 0]′ . For Example 3 it approximately
1 ′
recovered [0, 0, (e1 )′ , (ba) ]′ , which has the property that one coefficient is much larger than all the others.
These results are initialization dependent. Here we initialized x0 to be a constant vector, which is partly to blame
for finding a 2-sparse solution to Example 2 that is a stationary point but not a local minimum. Instead, consider
initializing x0 to be a small perturbation of a constant vector, for instance x0i = r(100 + 0.01ηi ) with ηi sampled
from a normal distribution with mean zero and standard deviation 1. With such an initialization, we are far more
likely to find one of the 1-sparse solutions [2, 0, 0, · · · , 0]′ or [0, 2, 0, · · · , 0]′ .
Another important numerical consideration is the parameter r that acts as a lower bound on the l1 norms of the
possible solutions. Because of the way the matrices A are scaled for Examples 2 and 3, the sparsest solutions also
have larger l1 norms. In this case, larger values of r promote sparsity. Choosing r = 0.05 is still much less than
the norms of the NNLS solutions, so it is not the case here that those potential solutions were eliminated by the
choice of constraint set.
Minimizing the difference of l1 and l2 norms is easier than minimizing the ratio because the objective becomes a
difference of convex functions. In particular we can set cn = c for any c > 0 in the iteration (4.6) and be guaranteed
to satisfy the sufficient decrease inequality (4.6) with σ = 1. Moreover, since the difference penalty is better behaved
at the origin, we could consider simplifying the constraint set X and letting it be the entire nonnegative orthant.
However, we choose to leave X as previously defined since it may be advantageous to disallow solutions whose
l1 norms are below some threshold r . Using a constant c, Algorithm 1 can be simplified to the following.

Algorithm 3: SGP Method for Solving (4.5) with R(x) = γ(||x||1 − ||x||2 ).
Define x0 ∈ X , c > 0, ǫ1 > 0 and set n = 0.
while n = 0 or kxn − xn−1 k∞ > ǫ1
1
xn+1 = arg min (x − xn )T ( AT A + cI)(x − xn ) + (x − xn )T ∇F (xn )
x∈X 2
n=n+1
end while
13

Computed x using NNLS on Example 2 Computed x using NNLS on Example 3


0.018 0.012

0.016
0.01
0.014

0.012 0.008

0.01
0.006
0.008

0.006 0.004

0.004
0.002
0.002

0 0
20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

Fig. 2. Estimated x using non-negative least squares.

Computed x using l1/l2 minimization on Example 2 Computed x using l1/l2 minimization on Example 3
2 2

1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

Fig. 3. Estimated x using Algorithm 1 on (4.5).

Algorithm 2 can again be used to solve the convex subproblem in Algorithm 3.


We repeat the experiments on Examples 2 and 3 using Algorithm 3 to numerically compare how well the l1 − l2
penalty is able to promote sparsity. We first attempt to use the same parameters as before, setting γ = 0.1, r = 0.05,
δ = 1 and c = 10−9 . We again set the tolerances for the stopping conditions to be ǫ1 = 10−8 for the outer iterations
and ǫ2 = 10−4 for the inner iterations. Unfortunately, with these parameters l1 − l2 minimization does not yield
sparse solutions for either Example 2 or 3. Two approaches to improve sparsity are to increase γ or to increase r .
Using large values of γ does yield sparse vectors, but they are highly sensitive to the initialization and are often
not close to the correct sparse solutions. On the other hand, if we keep all the parameters the same but increase r
to r = 0.5, then we are able to get the sparse solutions shown in Figure 4, which are similar to those generated by
Algorithm 1. For Example 2, the l1 norm of the NNLS solution is approximately 0.76, so it is still in our constraint
set. For Example 3, however, the l1 norm of the NNLS solution is approximately 0.39, which falls outside our
constraint set when we set r = 0.5. So the sparse result for Example 3 shown in Figure 4 is special to this problem
and probably has more to do with the constraint set than it does with l1 − l2 minimization. But for Example 2,
l1 − l2 minimization did help find a good sparse solution.

V. D ISCUSSION AND C ONCLUSION


We studied properties of the ratio and difference of l1 and l2 norms in finding sparse solutions from a represen-
tation with coherent and redundant dictionaries. We presented an exact recovery theory and showed both anlaytical
14

Computed x using l1−l2 minimization on Example 2 Computed x using l1−l2 minimization on Example 3
2 2

1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200

Fig. 4. Estimated x using Algorithm 3 on (4.5).

and numerical examples. In future work, we plan to investigate further the mathematical theory and computational
performance of the related algorithms based on these sparsity promoting measures, also apply them to data in
applications. A work along this line is [6].

ACKNOWLEDGMENTS
The work was partially supported by NSF grants DMS-0911277, DMS-0928427, and DMS-1222507. The authors
would like to thank Professors Russel Caflisch, Ingrid Daubechies, Tom Hou, and Stanley Osher for their interest,
and the opportunity to present some of the results here at the Adaptive Data Analysis and Sparsity Workshop at
the Institute for Pure and Applied Mathematics at UCLA, Jan. 31, 2013.

R EFERENCES
[1] A. Bruckstein, M. Elad, and M. Zibulevsky, On the uniqueness of nonnegative sparse solutions to underdetermined systems of equations,
IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 4813–4820, nov. 2008.
[2] E. Candès, J. Romberg, T. Tao, Robust uncertainty principles: Exact signal rconstruction from highly incomplete Fourier information,
IEEE Trans. Info. Theory, 52(2), 489–509, Feb. 2006.
[3] E. Candès, T. Tao, Near optimal signal recovery from random projections: Universal encoding stratgies ?, IEEE Trans. Info. Theory,
52(12), 5406-5425, Dec. 2006.
[4] D. Donoho, Compressed sensing, IEEE Trans. Info. Theory, 52(4), 1289-1306, April, 2006.
[5] D. Donoho, M. Elad, Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization, Proc. Nat. Acad.
Scien. USA, vol. 100, pp. 2197–2202, Mar. 2003.
[6] E. Esser, Y. Lou and J. Xin, A Method for Finding Structured Sparse Solutions to Non-negative Least Squares Problems with Applications,
SIAM J. Imaging Sciences, 6(4), pp. 2010–2046, 2013.
[7] A. Fannjiang and W. Liao, Coherence Pattern-Guided Compressive Sensing with Unresolved Grids, SIAM J. Imaging Sciences, 5(1),
pp. 179–202, 2012.
[8] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational problems via finite-element approximations, Comp.
Math. Appl., vol. 2, pp. 17–40, 1976.
[9] R. Glowinski and A. Marrocco, Sur lapproximation par elements finis dordre un, et la resolution par penalisation-dualite dune classe
de problemes de Dirichlet nonlineaires, Rev. Francaise dAut Inf. Rech. Oper., vol. R-2, pp. 41–76, 1975.
[10] W. Hartmann, “Signals, Sound, and Sensation”, AIP Press, 1998, Chapter 10 (Auditory Filters).
[11] P. Hoyer, Non-negative matrix factorization with sparseness constraints, Journal of Machine Learning Research, vol. 5, no. 12, pp.
1457–1469, 2004.
[12] D. Krishnan, T. Tay, and R. Fergus, Blind deconvolution using a normalized sparsity measure, Proc. of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2011.
[13] H. Ji, J. Li, Z. Shen, and K. Wang, Image deconvolution using a characterization of sharp images in wavelet domain, Applied and
Computational Harmonic Analysis, vol. 32, no. 2, pp. 295–304, 2012.
[14] B. Olshaussen, D. Field, Sparse coding with an overcomplete basis set: A strategy employed by V1 ?, Vision Research, 37, pp.
3311–3325, 1997.
[15] P.D. Tao and L.T.H. An, Convex analysis approach to d.c. programming: Theory, algorithms and applications, Acta Mathematica
Vietnamica, vol. 22, no. 1, pp. 289–355, 1997.
[16] J. Xin, Y-Y. Qi, “Mathematical Modeling and Signal Processing in Speech and Hearing Sciences”, MS&A, Vol. 10, Springer, 2014,
Chapter 4 (Speech Recognition).

You might also like