0% found this document useful (0 votes)
14 views31 pages

Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints

Uploaded by

z1820394290
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views31 pages

Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints

Uploaded by

z1820394290
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Multiplicative Updates for NMF with β-Divergences

under Disjoint Equality Constraints


Valentin Leplat˚ Nicolas Gillis∗ Jérôme Idier†
arXiv:2010.16223v2 [cs.LG] 28 Apr 2022

April 29, 2022

Abstract
Nonnegative matrix factorization (NMF) is the problem of approximating an input nonnegative
matrix, V , as the product of two smaller nonnegative matrices, W and H. In this paper, we intro-
duce a general framework to design multiplicative updates (MU) for NMF based on β-divergences
(β-NMF) with disjoint equality constraints, and with penalty terms in the objective function. By
disjoint, we mean that each variable appears in at most one equality constraint. Our MU satisfy
the set of constraints after each update of the variables during the optimization process, while
guaranteeing that the objective function decreases monotonically. We showcase this framework on
three NMF models, and show that it competes favorably the state of the art: (1) β-NMF with sum-
to-one constraints on the columns of H, (2) minimum-volume β-NMF with sum-to-one constraints
on the columns of W , and (3) sparse β-NMF with `2 -norm constraints on the columns of W .

Keywords: nonnegative matrix factorization (NMF), β-divergences, disjoint constraints, simplex-


structured NMF, minimum-volume NMF, sparsity

1 Introduction
Given a non-negative matrix V P RF`ˆN and a factorization rank K ! minpF, N q, nonnegative matrix
factorization (NMF) aims to compute two non-negative matrices, W with K columns and H with K
rows, such that V « W H [24]. Over the last two decades, NMF has shown to be a powerful tool for the
analysis of high-dimensional data. The main reason is that NMF automatically extracts sparse and
meaningful features from a set of nonnegative data vectors. NMF has been successfully used in many
applications such as image processing, text mining, hyperspectral imaging, blind source separation,
single-channel audio source separation, clustering and music analysis; see [15, 7, 5, 29, 13, 16] and the
references therein.
To compute W and H, the most standard approach is to solve the following optimization problem

min D pV |W Hq such that H ě 0 and W ě 0, (1)


W PRF ˆK ,HPRKˆN

Department of Mathematics and Operational Research, Faculté Polytechnique, Université de Mons, Rue de Houdain
9, 7000 Mons, Belgium. Authors acknowledge the support by the Fonds de la Recherche Scientifique - FNRS and the
Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Project no O005318F-RG47, and by the European
Research Council (ERC Starting Grant no 679515, ERC Consolidator Grant no 681839). E-mails: {valentin.leplat,
nicolas.gillis}@umons.ac.be.

Laboratoire des Sciences du Numérique de Nantes (LS2N, CNRS UMR 6004), Ecole Centrale de Nantes, 44321
Nantes, France. E-mail: [email protected]

1
ř
where D pV |W Hq “ f,n dpVf n |r W Hs f n q with dpx|yq a measure of distance between two scalars, and
A ě 0 means that the matrix A is component-wise nonnegative. In this paper, we focus on β-NMF
for which the measure of fit is the discrete β-divergence denoted dβ px|yq and defined as
$ 1 ` β β β´1
˘
& βpβ´1q x ` pβ ´ 1q y ´ βxy
’ for β P Rz t0, 1u ,
dβ px|yq “ x log xy ´ x ` y for β “ 1,

%x x
y ´ log y ´ 1 for β “ 0.

For β “ 2, d2 px|yq “ 12 px ´ yq2 , so D pV |W Hq is the halved standard squared Euclidean distance


between V and W H, that is, the halved squared Frobenius norm 21 }V ´ W H}2F . For β “ 1 and
β “ 0, the β-divergence corresponds to the Kullback-Leibler (KL) divergence and the Itakura-Saito
(IS) divergence, respectively. The error measure should be chosen accordingly with the distribution
of the noise assumed on the data. The Frobenius norm assumes i.i.d. Gaussian noise, KL divergence
assumes a Poisson distribution, and the IS divergence assumes multiplicative gamma noise; see for
example [11, 8, 20] and the references therein. In the NMF literature, β-divergences are the most
widely used objective functions.
Most NMF algorithms developed to tackle (1) are based on iterative schemes that alternatively
updates the factors W and H. At each iteration, the minimization over one factor, W or H, is
performed with various optimization methods. For β-divergences, the most popular approach is to
use multiplicative updates (MU) which were introduced for NMF in the seminal papers of Lee and
Seung [24, 23]. In all applications we are aware of, β is always chosen smaller than two. The reason
is that, for β ą 2, β-divergences become more and more sensitive to outliers. Already for β “ 2, it
is well-known that the squared Frobenius norm is sensitive to outliers. However, the case β “ 2 is
particular because the subproblem in W and H are nonnegative least squares problems, that is, convex
quadratic problems with Lipschitz continuous gradient. Therefore, highly efficient schemes exist when
β “ 2 that outperform the MU; for example exact block coordinate descent methods [6, 17, 22, 2], or
fast gradient methods [18, 19]. In this paper, we focus on the case β ă 2.
In many applications, on top of the nonnegative constraints on the variables, additional constraints
are needed to provide a meaningful solution. An instrumental example is the constraint that the entries
in each column of H sum to one; this is the so-called sum-to-one constraint that is crucial in blind
hyperspectral unmixing; see Section 3. Another example is a sum-to-one constraint on the columns
of W along with a volume regularizer on W . This model leads to identifiability of the factors W and
H under mild conditions; see Section 4. Most algorithms that deal with such equality constraints
do it a posteriori with a projection onto the feasible set, or with a renormalization of the columns
of W and the rows of H (that is, replace W p:, kq and Hpk, :q by αk W p:, kq and Hpk, :q{αk for some
αk ą 0), so that their product W H remains unchanged, and hence DpV |W Hq remains unchanged.
Such approaches are not ideal:

• Projection requires to perform a line-search to ensure the monotonicity of the algorithm, that is,
to ensure that the objective does not increase after each iteration, which may be computationally
heavy.

• Renormalization of the columns of W and the rows of H is only useful when each constraint
applies to a column of W or a row of H. It is not applicable for example for the sum-to-one
constraint on the columns of H mentioned above. Moreover, in the presence of regularization
terms in the objective function, it may destroy the monotonicity of the algorithm.

2
Another approach is to use parametrization. However, as far as we know, it does not guarantee the
monotonicity of the algorithm; see Section 3 for more details.

Outline and contribution In this paper, we introduce a general framework to design MU for β-
NMF with disjoint linear equality constraints, and with penalty terms in the objective function. By
disjoint, we mean that each variable appears in at most one equality constraint. This framework,
presented in Section 2, does not resort to projection, renormalization, or parametrization. Our MU
satisfy the set of constraints after each update of the variables during the optimization process, while
guaranteeing that the objective function decreases monotonically. This framework works as follows:
• First, as for the standard MU for β-NMF, we majorize the objective function using a separable
majorizer, that is, the majorizer is the sum of functions involving a single variable.

• Second, we construct the augmented Lagrangian for the majorizer. Because the majorizer is
separable, the problem can be decomposed into independent subproblems involving only variables
that occur in the same equality constraint since they are disjoint. For a fixed value of the
Lagrange multipliers, we prove that the solution of these subproblems are unique, under mild
conditions (Proposition 1). Moreover, it can be written in closed form via MU for specific values
of β and depending on the regularizer used (this is summarized in Table 1).

• Finally, we prove that, under mild conditions, there is a unique solution for the Lagrange mul-
tipliers so that the equality constraints are satisfied (Proposition 2). This allows us to apply
the Newton-Raphson method to compute the Lagrange multipliers while guaranteeing quadratic
convergence (Proposition 3).
We then showcase this framework on two NMF models, and show that it competes favorably with
the state of the art:
1. A β-NMF model with sum-to-one constraints on the columns of H, which we refer to as simplex-
structured β-NMF (Section 3), and

2. A minimum-volume β-NMF model with sum-to-one constraints on the columns of W (Section 4).
Finally, Section 5 shows that the framework can be extended to the case of quadratic disjoints con-
straints, which we showcase on sparse β-NMF with `2 -norm constraints on the columns of W .

2 General framework to design MU for β-NMF under disjoint linear


equality constraints and penalization
In this paper, we introduce a general framework to tackle β-NMF with disjoint linear equality con-
straints, and with penalty terms in the objective function. Let us first introduce specific notations:
given a matrix A P RF ˆN and a list of indices K Ď tpf, nq | 1 ď f ď F, 1 ď n ď N u, we denote
by ApKq the vector of dimension |K| whose entries are the entries of A corresponding to the indices
within K. Let us introduce Ki (1 ď i ď I) and Bj (1 ď j ď J) to be disjoint sets of indices for the
entries of W and H, respectively, that is,
• Ki Ď tpf, kq | 1 ď f ď F, 1 ď k ď Ku for i “ 1, 2, . . . , I,

• Bj Ď tpk, nq | 1 ď k ď K, 1 ď n ď N u for j “ 1, 2, . . . , J,

3
• Ku X Kv “ H for all 1 ď u, v ď I and u ‰ v,
• Bp X Bq “ H for all 1 ď p, q ď J and p ‰ q.
We now define penalized β-NMF with disjoint linear equality constraints as follows
min Dβ pV |W Hq ` λ1 Φ1 pW q ` λ2 Φ2 pHq
W PRF
`
ˆK
,HPRKˆN
`

such that αTi W pKi q “ bi for 1 ď i ď I, (2)

γ Tj H pBj q “ cj for 1 ď j ď J,
where
• the penalty functions Φ1 pW q and Φ2 pHq are lower bounded and admit a particular upper ap-
proximation; see Assumption 1 below.
• λ1 and λ2 are the penalty weights (nonnegative scalars).
|K | |B |
• αi P R``i (1 ď i ď I) and γ j P R``j (1 ď j ď J) are vectors with positive entries. Note that if
αi or γ j contains zero entries, the corresponding indices can be removed from Ki and Bj .
• bi (1 ď i ď I) and cj (1 ď j ď J) are positive scalars.
As for most NMF algorithms, we propose to resort to a block coordinate descent (BCD) framework
to solve problem (2): at each iteration we tackle two sub-problems separately; one in W and the
other in H. The subproblems in W and H are essentially the same, by symmetry of the model,
since transposing the relation X « W H gives X T « H T W T . Hence, we may focus on solving the
subproblem in H only, namely

min Dβ pV |W Hq ` λ2 Φ2 pHq such that γ Tj H pBj q “ cj for 1 ď j ď J. (3)


HPRKˆN
`

In order to solve (3), we will design MU based on the majorization-minimization (MM) frame-
work [33], which is the standard in the NMF literature; see [12] and the references therein. Let us
briefly recall the high-level ideas to obtain MU via MM. Let us consider the general problem

min f phq.
hPH

h P H,
Given an initial iterate r ` ˘MM generates
` ˘ a new iterate ĥ P H that is guaranteed to decrease the
objective function, that is, f ĥ ď f r
h . To do so, it uses the following two steps:
• Majorization: find a function that is an upper approximation of the objective and is tight
` ˘at the
current iterate, which is referred to as a majorizer. More precisely find a function g h|r
h such
that ` ˘ ` ˘ ` ˘
piq g r
h|rh “f r h and piiq g h|r h ě f phq for all h P H.

• Minimization: minimize the majorizer, that is, solve minhPH g h|r


` ˘
h approximately or exactly,
` ˘
to obtain the next iterate ĥ P H which is such that piiiq g ĥ|h ď gpr
r h|r
hq. This guarantees the
objective function to decrease at each step of this iterative process since
` ˘ ` ˘ ` ˘ ` ˘
f ĥ ď g ĥ|r h ď g r h|r
h “ f r h .
piiq piiiq piq

4
` ˘
The MU for NMF are obtained using MM where the majorizer g is chosen separable, that is, g h|r h “
ř ` ˘
i“1 gi hi |hi for some well chosen univariate functions gi ’s; see (4) in the next section. This choice
r
typically makes the minimization
` ˘ of g admits a closed-form solution which`is ˘multiplicative, that is, it
has the form ĥ “ h d c h where d is the component-wise product, and c r
r r h is a nonnegative vector
that depends on r h. We will encounter several examples later in this paper.
In summary, to derive MU for (3), we will follow the MM framework. We first provide a majorizer
for the objective of (3) in Section 2. This majorizer has the property to be separable in each entry of
H. In order to handle the equality constraints, we introduce Lagrange dual variables in Section 2.2,
and explain how they can be computed efficiently. This allows us to derive general MU in Section 2.3
in the case of non-penalized β-NMF under disjoint linear equality constraints. This is showcased on
simplex-structured β-NMF in Section 3. In Section 4, we will illustrate on minimum-volume KL-NMF
how to derive MU in the presence of penalty terms.

2.1 Separable majorizer for the objective function


` ˘
Let us derive a majorizer for ΨpHq :“ Dβ pV |W Hq`λΦpHq, that is, a function G H|H r satisfying
` ˘ ` ˘ ` ˘
(i) G H|H r ě ΨpHq for all H, and (ii) G H| r H r “Ψ H r . Note that, to simplify the presentation, we
denote Φ2 pHq “ ΦpHq and λ “ λ2 . To do so, let us analyze each term of ΨpHq independently.
Majorizing Dβ pV |W Hq The first term Dβ pV |W Hq can be decoupled into N independent terms,
one for each column hn of H, that is, Dβ pV |W Hq “ N
ř
n“1 Dβ pvn |W hn q, where vn denotes the nth
column of matrix V . Let us focus on a specific column of H, denoted h P RK ` , and the corresponding
K
řF
column of V , denoted v P R` . We majorize Dβ pv|W hq “ f “1 dβ pvf |pW hqf q following the method-
ology introduced in [12], which consists in applying a convex-concave procedure [36] to dβ , as presented
in Appendix A. The resulting upper bound is given by
K ˆ ˙ K
ÿ wf k r
hk q hk ` ˘ÿ ` ˘ ` ˘
dβ pvf |pW hqf q ď d vf |r
vf ` dp1 vf |r
vf wf k hk ´ r vf ,
hk ` dp vf |r (4)
k“1
vrf hk
r
k“1
` ˘
where wf k denotes the entry of matrix W at position pf, kq, vrf :“ W h r denotes the f th entry of v
f
r,
and dp and dq are the concave and convex parts of d, respectively.

Majorizing ΦpHq For the second term ΦpHq, we rely on the following assumption for Φ.
Assumption 1. The function Φ : RKˆN` ÞÑ R is lower bounded, and for any H r P RKˆN there exists
`
constants Lkn (1 ď k ď K, 1 ď n ď N ) such that the inequality
` ˘ A E ÿL
r ` ∇ΦpHq, kn r kn q2
ΦpHq ď Φ H r H ´H r ` pHkn ´ H (5)
k,n
2

is satisfied for all H P RKˆN


` . (Note that the constants Lkn may depend on H,
r this will be the case
for example in Section 4).
Let us mention two important classes of functions satisfying Assumption 1.
1. Smooth concave functions that are lower bounded on the nonnegative orthant. For such func-
tions, we can take Lkn “ 0 for all k, n since they are upper approximated by their first-order
Taylor approximation. Note that, in this case,
∇ΦpHq
r ě 0, (6)

5
` ˘
otherwise we would have limyÑ8 Φ H ` yei eTj “ ´8, where ei is the ith unit vector, and this
would contradict the fact that Φ is bounded from below. This observation will be useful in the
proof of Proposition 1 and is only valid for the special case Lkn “ 0 for all k, n.
Examples of such penalty functions include the sparsity-promoting regularizers ΦpHq “ }H}pp “
p
ř
k,n Hpk, nq for 0 ă p ď 1 since H ě 0.

2. Lower-bounded functions with Lipschitz continuous gradient for which (5) follows from the
descent lemma [3].
Examples of such penalty functions include any smooth convex functions; for example any
quadratic penalty, such as ||AH ´ B||22 for some matrices A and B in which case L`kn “ σ1 pAq˘2
for all k, n. We will encounter another example later in the paper, namely logdet HH J ` δI
for δ ą 0 which allows to minimize the volume of the rows of H; see Section 4 for the details
(Note that we will use this regularizer for W ).

Majorizing ΨpHq Combining (4) and (5), we can construct a majorizer for ΨpHq. Since both (4)
and (5) are separable in each entry of H, their combination is also separable into a sum of K ˆ N
component-wise majorizers, up to an additive constant:
N ÿ
ÿ K
` ˘ ` ˘ ` ˘
G H|H
r “ g hkn |H
r `C H
r , (7)
n“1 k“1

where
F ˆ ˙
ÿ wf k r
hkn q hkn
` akn h2kn ` pkn hkn ,
` ˘
g hkn |H “
r d vf n |r vf n (8)
f “1
v
r f n h
r kn
˜ ¸
N
ÿ ÿ F K
˘ ÿ 1`
h2kn ,
` ˘ ` ˘
C H r “ vf n ´
dp vf n |r v f n wf k r
dp vf n |r hkn `aknr
n“1 f “1 k“1

with akn “ λ L2kn , and


F ˆ ˙
ÿ ` ˘ BΦ ` r ˘
pkn “ wf k dp1 vf n |r
vf n ` λ H ´ Lkn hkn .
r
f “1
Bhkn

2.2 Dealing with equality constraints via Lagrange dual variables


` ˘
In the previous section, we derived a majorizer for ΨpHq, G H|H r , which is separable in each entry
of H. Without the equality
` constraints,
˘ we could then compute closed-form solutions to univariate
problems to minimize G H|H̃ to obtain the standard MU for NMF as in [12].
However, in problem (3), the entries of H in the subsets Bj are not independent as they are linked
with the equality constraints γ Tj HpBj q “ cj for j “ 1, 2, . . . , J. In fact, to minimize the majorizer
under the equality constraints, we need to solve

such that γ Tj HpBj q “ cj for 1 ď j ď J.


` ˘
min G H|H r (9)
HPRKˆN
`

The variables in different sets Bj can be optimized independently, as they do not interact in the
majorizer nor in the constraints. Note that, for the entries of H that do not appear in any constraints,

6
the standard MU [12] can be used. For simplicity, let us fix j and denote B “ Bj , Q “ |B|, y “
HpBq P RQ Q
` , γ “ γ j P R`` , and c “ cj ą 0. The problems we need to solve have the form
` ˘
min G y|H
r , (10)
yPY
! )
where Y “ y P RQ
` | γ T y “ c and

` ˘ ÿ ` ˘
G y|H
r “ g hkn |H
r , (11)
pk,nqPB

` ˘
where the component-wise majorizers g hkn |H r are defined by (8). Let us introduce a convenient
notation: for q “ 1, 2, . . . , Q, we denote by pkpqq, npqqq the qth pair belonging to B. Hence the
Lagrangian function of (11) can be written as
Q
` ˘ ÿ
Gµ y|H r ´ µpγ T y ´ cq “ µc ` C H g µ yq |H
` ˘ ` ˘ ` ˘
r “ G y|H r ` r , (12)
q“1

where

g µ yq | H
` ˘ ` ˘
r “ g yq | Hr ´ µγq yq
F ˆ ˙
ÿ wf kpqq yrq yq
“ q vf npqq
d vf npqq |r ` aq yq2 ` ppq ´ µγq qyq , (13)
f “1
v
r f npqq y
rq

F ˆ ˙
ÿ
1
` ˘ BΦ ` r ˘
pq “ wf kpqq d vf npqq |r
p vf npqq ` λ H ´ Lkpqqnpqq yrq , (14)
f “1
Byq

and µ P R. Note that Gµ is separable, as is G, because the term γ T y is linear. ` ˘


Assume for now that the Lagrangian multiplier µ is known, and let us minimize Gµ y|H r on
p0, 8qQ . Such a problem` is˘ separable under the form of Q subproblems, consisting in minimizing
univariate functions g µ ¨ |H
r separately over p0, 8q. We now show in Proposition 1 that, under mild
conditions, each subproblem admits a unique solution over p0, 8q.

Proposition 1. Let q P t1, 2, . . . , Qu. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all f . Moreover,
p
when β ď 1, assume that µ ă γqq for all q such that aq “ 0. Then there exists a unique minimizer
` ˘
yq‹ pµq of g µ yq |H
r in p0, 8q.

Proof. According to Proposition 4 (see Appendix A), each g µ is C 8 and strictly convex on p0, 8q, so
its infimum is uniquely attained in the closure of p0, 8q. We have to prove that it is neither reached
at 0 nor at 8. On the one hand, from (13), we have
F ˆ ˙
µ 1
` ˘ ÿ
1 yq
pg q yq |H “
r wf kpqq d vf npqq |r
q vf npqq ` 2aq yq ` pq ´ γq µ (15)
f “1
yrq

and, for any β ă 2 and any x ą 0,


lim dq1 px|yq “ ´8,
yÑ0`

7
` ˘
so limyq Ñ0` pg µ q1 yq |H
r “ ´8, which ensures that the infimum is not reached at 0. On the other
hand, #
0 if β ď 1,
lim dq1 px|yq “ (16)
yÑ8 8 otherwise.
According to (15) and (16), the distinction must be made between two cases:

• If aq ą 0 or β P p1, 2q: limyq Ñ8 pg µ q1 yq |H


` ˘
r “ 8, so the infimum is reached for a finite yq .
pq
• If aq “ 0 and β ď 1: limyq Ñ8 pg µ q1 yq |H
` ˘
r “ pq ´ γq µ, so the same conclusion holds if µ ă
γq .

We just proved that, under mild conditions, each g µ has a unique minimizer over p0, 8q. However
‹ pµq T , let us show that the
“ ‰
we assumed that the value of µ is fixed. Now given y ‹ pµq “ y1‹ pµq, . . . , yQ
solution to γ T y ‹ pµq “ c is unique. The corresponding value of µ, which we denote µ‹ , provides the
minimizer y ‹ pµ‹ q of Gµ py|Hq
r that satisfies the linear constraint γ T y ‹ pµ‹ q “ c. Moreover, µ‹ naturally
p
fulfills µ‹ ă γq for all q when β ď 1 and aq “ 0, as required in Proposition 1.
q

Proposition 2. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all q, f . Then the scalar equation
γ T y ‹ pµq “ c in the variable µ admits a unique solution µ‹ in p´8, tq, where
#p
q
if β ď 1 and aq “ 0,
t “ min tq , where tq “ γq (17)
1ďqďQ 8 otherwise,

so that y ‹ pµ‹ q P p0, 8qQ is the unique solution to problem (10).


` ˘
Proof. Under the conditions of Proposition 1, g µ yq |H r has a unique minimizer y ‹ pµq for each j. By
q
` ˘
the first-order optimality condition, yq‹ pµq is a solution of pg µ q1 yq |H
r “ 0 or equivalently, by (15), a
` ˘
solution of γq´1 g 1 yq |H
r “ µ over p0, 8q where

F ˆ ˙
` ˘ ÿ yq aq pq
γq´1 g 1 r ´1
yq |H “ γq q1
wf kpqq d vf npqq |r
vf npqq ` 2 yq ` (18)
f “1
yrq γq γq

is strictly increasing on p0, 8q (since g is strictly convex) and one-to-one from p0, 8q to an open interval
Tq “ pt´ `
q , tq q where
` ˘
t´ 1
q “ lim g yq |H “ ´8,
r (19)
yq Ñ0
` ˘
t “ lim g 1 yq |H
q
` r “ tq . (20)
yq Ñ8

Moreover, pq ě 0 if aq “ 0 (then L “ 0) and β ď 1 according to (6) and (14). As a consequence,


γq´1 g 1 pyq‹ |r
yq q “ µ is equivalent to
` ˘´1
yq‹ pµq “ g 1 pγq µq, (21)
` ˘´1
where µ P Tq and g 1 denotes the inverse function of g 1 .

8
Coming back to the multivariate problem (10), we must find a value µ‹ of the Lagrangian multiplier
such that the constraint γ T y ‹ pµq “ c is satisfied. Given (21), µ‹ is a solution of
Q
ÿ ` ˘´1
γq g 1 pγq µq “ c. (22)
q“1

` ˘
Each g 1 yq |H
r being strictly increasing on p0, 8q, pg 1 q´1 pγq µq is also strictly increasing (from Tq to
p0, 8q), this is a direct consequence of pf ´1 q1 “ f 1 ˝f1 ´1 where f is any strictly increasing function on
some interval. Finally Q J
ř 1 ´1
q“1 γq pg q pγq µq is strictly increasing from Xj“1 Tq “ p´8, tq to p0, 8q, with
t ě 0. Therefore, the solution µ‹ is unique.

Proposition 2 shows that the optimal Lagrangian multiplier is the unique solution of (22). Finding
the solution of (22) is equivalent to finding the root of a function rpµq. We propose here-under to
use a Newton-Raphson method to compute µ‹ , and show that this method generates a sequence of
iterates µn that converges towards µ‹ at a quadratic speed.

Proposition 3. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all q, f . Let

Q
ÿ ` ˘
rpµq “ γq pg 1 q´1 γq µ ´ c
q“1

for µ P p´8, tq where t is defined in (17), and denote µ‹ the unique solution of rpµq “ 0. From any
initial point µ0 P pµ‹ , tq, Newton-Raphson’s iterates

rpµn q
µn`1 “ µn ´
r1 pµn q

decrease towards µ‹ at a quadratic speed.

Proof. We already know that r is strictly increasing from p´8, tq to p0, 8q. Let us show that r
is also strictly convex. According to the third item of Proposition 4 in Appendix A, dq2 px|yq is
completely monotonic, so it is strictly decreasing in y. Equivalently, dq1 px|yq is strictly concave in
y, and each g 1 is also strictly concave according to (18). Since the inverse of a strictly increasing,
strictly concave function f is strictly increasing and strictly convex, which is a direct consequence of
2 ´1
pf ´1 q2 “ ´ pff1 ˝f˝f´1 q3 , then each pg 1 q´1 is strictly convex, and finally, r is strictly convex.
For any µ0 P pµ‹ , tq, we have rpµ0 q ą 0, so µ1 “ µ0 ´ rrpµ 0q
1 pµ q ă µ0 . We have also µ1 ą µ
0
‹ as a

consequence of the strict convexity of r. By immediate recurrence, we obtain that µn is a decreasing


series that converges towards µ‹ . According to [30], it converges at a quadratic speed since |r1 | and
|r2 | are bounded away from 0 in rµ‹ , µ0 s.

Discussion At this point, we have derived an optimization framework to tackle problem (10). The
optimal Lagrangian multiplier value is determined before each majorization-minimization update using
a Newton-Raphson algorithm. However, such a formal solution` ˘ is implementable if and only if each
yq‹ pµq can be actually computed as the minimizer of g µ yq |H r in p0, 8q. In some cases, computing
yq‹ pµq is equivalent to extracting the roots of a polynomial of a degree smaller or equal to four, which
is possible in closed form. In other cases, we have to solve a polynomial equation of degree larger

9
β P p´8, 1qzt0u β “ 0 β “ 1 β P p1, 2q
5 4 3
4 3 2 other
No penalization, or
1 1 1 3 4 2 O
Lkn “ 0 for all k, n
Lkn ą 0 for some k, n O 3 2 O O 3 O

Table 1: Cases where (21) can be computed in closed form. They are indicated by the degree of the
corresponding polynomial equation, otherwise the symbol O is used. The constants Lkn are the one
needed in Assumption 1 for the penalization functions Φ1 pHq and Φ2 pW q; see (5).

than four, or even an equation that is not polynomial. Table 1 indicates the cases where a closed-
form solution is available, and hence when our framework can be efficiently implemented. We observe
that, without penalization or with penalization satisfying Lkn “ 0 for all k, n (e.g., smooth concave
functions), the polynomial
( equation is of degree one, and hence always admit a closed form for β ď 1
and β P 54 , 43 , 23 . This particular case is discussed in the next section, which we will exemplify in
Section 3 with β-NMF with sum-to-one constraints on the columns of H. In Section 4, we will present
an important example with Lkn ą 0 for all k, n and β “ 1, namely minimum-volume KL-NMF.

2.3 MU for β-NMF with disjoint linear equality constraints without penalization
In this section, we derive an algorithm based on the general framework presented in the previous
section to tackle the β-NMF problem under disjoint linear equality constraints without penalization,
that is, problem (2) with λ1 “ λ2 “ 0. We consider this simplified case here as it allows to provide
explicit MU for any value of β ă 2; see the row ‘No penalization’ of Table 1. These updates satisfy
the constraints after each update of W or H, and monotonically decrease the objective function
Dβ pV |W Hq.
Let us then consider the subproblem of (2) over H when W is fixed and with λ2 “ 0, that is,

min Dβ pV |W Hq such that γ Tj HpBj q “ cj for 1 ď j ď J. (23)


HPRKˆN
`

Let us follow the framework presented above. First, an auxiliary function, which we denote GpH|Hq, r
is constructed at the current iterate H so that it majorizes the objective for all H and is defined as
r
follows:
« ˆ ˇ ˙ff « ff
` ˘ ÿ ÿ wf k r h kn h kn ` ˘ ÿ ` ˘ ` ˘
G H|H dq vf n ˇvrf n ` dp1 vf n |r
vf n hkn ` dp vf n |r
wf k hkn ´ r vf n ,
r “ ˇ
f,n k
v
r f n h
r kn k

` ˘ (24)
where dp.|.q
q and dp.|.q
p are given in Appendix A. Second, we need to minimize G H|H r while imposing
the set of linear constraints γ Tj HpBj q “ cj . The Lagrangian function of G is given by

J
ÿ
µ
µj γ Tj HpBj q ´ cj ,
“ ` ˘‰
G pH|Hq
r “ GpH|Hq
r ´ (25)
j

where µj are the Lagrange multipliers associated to each linear constraint γ Tj HpBj q “ cj . We observe
that Gµ in (25) is a separable majorizer in the variables H of the Lagrangian function Dβ pV |W Hq ´

10
řJ ” ´ T ¯ı
j µj γ j HpBj q ´ cj . Due to the disjoitness of each subset of variables Bj (25), we only consider
the optimization over one specific subset Bj . The minimizer (21) of Gµ pH pBj q |H r pBj qq has the
following component-wise expression:
˜ ¸.ηpβq
‹ rC pB j qs
H pBj q “ Hr pBj q d “ ‰ , (26)
D pBj q ´ µj γ j
´ ¯
where C “ W T pW Hq.pβ´2q d V , D “ W T pW Hq.pβ´1q , ηpβq “ 2´β 1 1
for β ď 1, and ηpβq “ β´1 for
rAs
β ě 2 [12, Table 2], A d B (resp. rBs ) is the Hadamard product (resp. division) between A and B, A.α
is the element-wise α exponent of A. The case β P p1, 2q is more difficult: we need to find a root of a
function of the form µ ` bxβ´1 ´ cxβ´2 “ 0. For example, for β “ 32 , we have µ ` bx1{2 ´ cx´1{2 “ 0.
?
Using y “ x, and after simplifications, we obtain µy ` by 2 ´ c “ 0 leading to the positive root
ˆ? ˙2
µ2 `4bc´µ
x“ 2b .
According to Proposition 2, (26) is a well-defined update from p0, 8qQ to itself, provided that µj
is tuned to µ‹j . This brings us a structural guarantee that D pBj q ´ µ‹j γ j cannot cancel.
Finally, we need to evaluate µ‹j , which is uniquely determined on some interval p´8, tq according
to Proposition 2. This amounts to solve γ Tj H ‹ pBj q “ cj . When β R p1, 2q, this is equivalent to find
the root of the function
» ˜ ¸.ηpβq fi
Q
ÿ rC pB j qs
rj pµj q “ γj,q –Hr pBj q d “ ‰ fl ´ cj , (27)
q“1
D pB j q ´ µj γ j
q

where rAsq denotes the q-th entry of expression A. Indeed, rj pµj q is a finite sum of elementary
rational functions of µj and each of them is an increasing, convex function in µj over p´8, tq q with
D pB q
tq “ γq j,qj for each q. It is even completely monotone for all µ in p´8, tq q because η pβq ą 0 [28].
As a consequence rj pµj q is also a completely monotone, convex increasing function of µj in p´8, tq,
where t “ min ptq q. Finally, we can easily show that the function rj pµj q changes of sign on the interval
p´8, tq by computing two limits at the closure of the interval. As µ‹ P p´8, tq, the update (26) is
nonnegative. To evaluate µ‹ , we use a Newton-Raphson method, with any initial point µ0 P pµ‹ , tq,
with a quadratic rate of convergence as demonstrated in Proposition 3. Algorithm 1 summarizes our
method to tackle (2) for all the β-divergences, β R p1, 2q which we refer to as disjoint-constrained
β-NMF algorithm. The update for matrix W can be derived in the same way, by symmetry of the
problem. For β P p1, 2q, a case-by-case analysis could be carried out for the values of β for which the
minimizer of (25) takes a closed-form expression.
Remark 1. As noted above, the denominators of (26) and (27) will be different from zero. This
follows notably from our assumption that pW, Hq ą 0; see Propositions 1, 2 and 3. This is a standard
assumption in the NMF literature: the entries of pW, Hq are initialized with positive entries which
ensures all iterates to remain positive. This is important because the MU cannot change an entry
equal to zero [26]; this is the so-called zero-locking phenomenon. This implies C and D in (26)
and (27) are positive matrices (as long as V has at least one nonzero entry per row and column). In
practice, one should however be careful because some entries of W and H can numerically be set to
zero (because of finite precision). Hence, in our implementation, we use the machine precision as a
lower bound for the entries of W and H, as recommended in [17].

11
Computational cost The computational cost of Algorithm 1 is asymptotically equivalent to the
standard MU for β-NMF, that is, it requires O pF N Kq operations per iteration. Indeed, the complexity
is mainly driven by matrix products required to compute C and D; see (26). To compute the roots
of (27) corresponding to H using Newton-Raphson, each iteration requires to compute rj pµj q{rj1 pµj q
for all j which requires OpKN q operations (when every entry of H appears in a constraint). Finding
the roots therefore requires OpKN q operations times the number of Newton-Raphson iterations. By
symmetry, it requires OpKF q operations to compute the roots corresponding to W . Because of the
quadratic convergence, the number of iterations required for the convergence of the Newton-Raphson
method is typically small, namely between 10 to 100 in our experiments using the stopping criterion
|rpµj q| ď 10´6 for all j. Therefore, in practice, the overall complexity of Algorithm 1 is dominated by
the matrix products that require O pF N Kq operations. The same conclusions apply to the algorithms
presented in Sections 3, 4 and 5, and this will be confirmed by our numerical experiments.

Algorithm 1 β-NMF with disjoint linear constraints


KˆN
Input: A matrix V P RF ˆN , an initialization H P R` and W P RF ˆK , a factorization rank K, a
maximum number of iterations, maxiter, a value for β R p1, 2q, and the linear constraints defined
by Ki , αj and bi for i “ 1, 2, . . . , I, and Bj , γ j and cj for j “ 1, 2, . . . , J.
Output: A rank-K NMF pW, Hq of V satisfying constraints in (2).
1: for it = 1 : maxiter do
2: % Update´of matrix H ¯
3: C Ð W T pW Hq.pβ´2q d V
4: D Ð W T pW Hq.pβ´1q
5: for j “ 1 : J do
6: µj Ð root prj pµj qq % see Equation (27)
´ ¯.pηpβqq
rCpB qs
j
7: H pBj q Ð H pBj q d rDpBj q´µ j γj s
8: end for ` ˘
9: Bc “ tpk, nq |1 ď k ď K, 1 ď n ď N u z YJj Bj . % Bc is the complement of YJj Bj
´ ¯.ηpβq
rCpBc qs
10: H pBc q Ð H pBc q d rDpB c qs
11: % Update of matrix W
12: W is updated in the same way as H, by symmetry of the problem.
13: end for

3 Showcase 1: Simplex-structured β-NMF


In this section, we showcase a particularly important example of β-NMF with linear disjoint constraints
and no penalization, namely, the simplex-structured matrix factorization (SSMF) problem. It is
defined as follows: given a data matrix V P RF ˆN and a factorization rank K, SSMF refers to the
problem of computing W and H such that V « W H and the columns of H lie on the unit simplex, that
is, the entries of each column of H are nonnegative and sum to one. SSMF is a powerful tool in many
applications such as hyperspectral unmixing in geoscience and remote sensing [4, 27, 1], document
analysis [5], and self-modeling curve resolution [29]. We refer the reader to the recent survey [13] for
more applications and details about SSMF.

12
To understand the underlying significance of SSMF, it is necessary to give more insights on a
research topic for which important SSMF techniques were initially developed which is the blind Hy-
perspectral Unmixing (HU), a main research topic in remote sensing. The task of blind HU is to
decompose a remotely sensed hyperspectral image into endmember spectral signatures and the corre-
sponding abundance maps with limited prior information, usually the only known information being
the number of endmembers. In this context, the columns of W correspond to the endmembers spectral
signatures and the columns of H contain the proportion of the endmembers in each column of V , so
the column-stochastic assumption for H naturally holds. The nonnegativity of W follows from the
nonnegativity of the spectral signatures. We refer to the corresponding problem as simplex-structured
nonnegative matrix factorization with the β-divergence (β-SSNMF), and is formulated as follows:

min Dβ pV |W Hq such that eT hj “ 1 for 1 ď j ď N, (28)


W PRF
`
ˆK
,HPRKˆN
`

where e is the vector of all ones of appropriate dimension. This is particular case of (2) where
• the subsets Bj correspond to the columns of H, and there is no subset Ki (no constraint on W ),

• γ Tj “ e and cj “ 1 for j “ 1, 2, . . . , N .
Hence Algorithm 1 can be directly applied to (28).

Numerical experiments Let us perform numerical experiments to evaluate the effectiveness of


Algorithm 1 on the simplex-structure β-NMF problem against existing methods. To the best of
our knowledge, the so-called group robust NMF (GR-NMF) algorithm1 from [14] is the most recent
algorithm that is able to tackle problem (28) for the full range of β-divergences. The approach is not
based on Lagrangian multipliers but introduces a change of variables for matrix H. This approach,
initially used for NMF in [10], does not provide an auxiliary function for the subproblem in H and
resort to a heuristic commonly used in NMF, see for example [34, 11]. Therefore there is no guarantee
that the objective function is decreasing at each update of the abundance matrix, unlike Algorithm 1.
We apply Algorithm 1 and GR-NMF on three widely used real hyperspectral data sets2 [37]:
• Samson: 156 spectral bands with 95ˆ95 pixels, containing mostly 3 materials pK “ 3q, namely
“Soil”, “Tree” and “Water”.

• Jasper Ridge: 198 spectral bands with 100ˆ100 pixels, containing mostly 4 materials pK “ 4q,
namely “Road”, “Soil”, “Water” and “Tree”.

• Cuprite: 188 spectral bands with 250ˆ190 pixels, containing mostly 12 types of minerals pK “
12q.
β-SSNMF has shown itself as a powerful one to tackle blind HU, hence this comparative study
between Algorithm 1 and GR-NMF [14] focuses on the convergence aspects including the evolution (
of the objective function and the runtime. The algorithms are compared3 for β P 0, 21 , 1, 32 , 2 . To
1
https://www.irit.fr/„Cedric.Fevotte/extras/tip2015/code.zip
2
http://lesun.weebly.com/hyperspectral-data-set.html
3
For β “ 3{2, we had an error in our derivations, and use (26) with ηpβq “ 1 for β P p1, 2q; see the discussion
after (26). However, the corresponding MU always decreases the objective function values (which we were monitoring),
although we do not have a theoretical justification for this. A possible approach to obtain such as result would be to
come up with a majorizer of the majorizer that has a closed-form minimizer given by (26) with ηpβq “ 1 for β P p1, 2q.

13
report the results, we use the relative objective function, denoted F̄ pW, Hq and defined as4

Dβ pV |W Hq
F̄ pW, Hq “ ,
Dβ pV |veeT q
T
where v “ eF N Ve
is the average of the entries of V . The relative error F̄ should be between 0 and 1:
it is equal to 0 for an exact decomposition with V “ W H, and is equal to 1 for a trivial rank-one
approximation where all entries are equal to the average of the entries of V . This allows to meaningfully
interpret the results, especially since we consider in this comparative study multiple values for β. In
fact, the degree of homogeneity of the β-divergence is a function of β. For example, if all the entries
of the input matrix are multiplied by 10 and keeping the same NMF solution properly scaled, the
squared Frobenius error (β = 2) is multiplied by 100 while the IS-divergence (β = 0) is not affected.
As for all tests performed in this paper, the algorithms are tested on a desktop computer with
Intel Core [email protected] CPU and 32GB memory. The codes are written in MATLAB R2018a,
and available from https://sites.google.com/site/nicolasgillis/. For all simulations, the algorithms are
run for 20 random initializations of W and H (each entry sampled from the uniform distribution in
r0, 1s). Table 2 reports the average and standard deviation of the runtime (in seconds) as the final
value for the relative objective function over these 20 runs for a maximum of 300 iterations.

Table 2: Runtime performance in seconds and final value of relative( objective function F̄end pW, Hq
for Algorithm 1 and the GR-NMFreported for β P 0, 21 , 1, 32 , 2 . The table reports the average
and standard deviation over 20 random initializations with a maximum of 300 iterations for three
hyperspectral data sets. A bold entry indicates the best value for each experiment.

Algorithms Samson Jasper Ridge Cuprite


runtime (s.) F̄end pW, Hq runtime (s.) F̄end pW, Hq runtime (s.) F̄end pW, Hq
β“2
Algorithm 1 16.62˘0.15 (1.89˘0.04)10´3 22.86˘0.08 (4.68 ˘ 0.39)10´3 121.04 ˘ 0.62 (0.98 ˘ 0.06)10´3
GR-NMF 18.23˘0.29 (1.91˘0.05)10´3 25.32˘0.16 (5.87 ˘ 1.22)10´3 114.27 ˘ 0.20 (1.29 ˘ 0.07)10´3
β “ 3{2
Algorithm 1 63.69˘0.40 (2.52 ˘0.78)10´3 89.23 ˘0.30 (4.92 ˘ 0.29)10´3 421.49 ˘ 2.79 (1.54 ˘ 0.07)10´3
GR-NMF 80.09˘0.60 (2.60 ˘0.63)10´3 112.72 ˘0.67 (6.32 ˘ 1.37)10´3 508.57 ˘ 3.50 (2.01 ˘ 0.09)10´3
β“1
Algorithm 1 18.33 ˘ 0.08 (3.54 ˘ 0.27)10´3 24.82 ˘ 0.35 (6.07 ˘ 0.21)10´3 182.98 ˘ 14.14 (2.07 ˘ 0.09)10´3
GR-NMF 44.78 ˘ 0.18 (3.77 ˘ 0.38 ) 10´3 62.83 ˘ 0.76 (7.26 ˘ 1.50)10´3 370.25 ˘ 21.33 (2.67 ˘ 0.10)10´3
β “ 1{2
Algorithm 1 89.80 ˘ 0.65 (7.21 ˘ 0.75)10´3 126.43 ˘ 0.61 (1.08 ˘ 0.10)10´2 682.80 ˘ 3.32 (3.13 ˘ 0.15)10´3
GR-NMF 102.21 ˘ 0.72 (6.93 ˘ 0.88)10´3 141.75 ˘ 0.69 (1.12 ˘ 0.13)10´2 642.49 ˘ 1.22 (3.14 ˘ 0.14)10´3
β“0
Algorithm 1 52.89˘0.54 (4.60 ˘ 0.66)10´2 69.59˘0.44 (3.76 ˘ 0.11)10´2 479.84 ˘ 16.02 (4.39 ˘ 0.31)10´3
GR-NMF 55.61˘0.47 (4.22 ˘ 0.79)10´2 77.87˘0.63 (3.76 ˘ 0.44)10´2 354.65 ˘6.01 (3.35 ˘ 0.10)10´3

We observe that Algorithm 1 outperforms the GR-NMF in terms of runtime and final values for the
relative objective function for all test cases except when β “ 0 for the Samson and Cuprite data sets.
In particular, for β “ 1, Algorithm 1 is up to 2.5 times faster than the GR-NMF. For the Cuprite data
set with β “ 1{2, Algorithm 1 and GR-NMF perform similarly. We also observe that the standard
4 β D pV |W Hq
For the Frobenius norm, that is, β “ 2, the relative error is typically defined as D β pV |0q
meaning that the trivial
solution used is the all-zero matrix. However, for other β-divergences, the value of Dβ pV |0q might not be defined; in
particular, for β ď 1 and vf n ą 0 for some f, n.

14
deviations obtained with Algorithm 1 are in general significantly smaller for all β, except for β “ 0
for the Samson and Cuprite data sets.
In the supplementary material S1, we provide figures that show the evolution of the relative
objective function values with respect to iterations, and that confirm the observations above.

4 Showcase 2: minimum-volume KL-NMF


In this section, we showcase another important example of β-NMF with linear disjoint constraints,
namely, the minimum volume NMF with the β-divergences (min-vol β-NMF) model. This model
is based on the minimization of β-divergences including a penalty term promoting solutions with
minimum volume spanned by the columns of the matrix W . It is defined as follows:

min Dβ pV |W Hq ` λvolpW q such that W T e “ e. (29)


W PRF
`
ˆK
,HPRKˆN
`

where λ is a penalty parameter, and volpW q is a function measuring the volume spanned by the
columns of W . In [25], the authors use volpW q “ logdetpW T W ` δIq, where δ is a small positive
constant that prevents logdetpW T W q to go to ´8 when W tends to a rank-deficient matrix (that
is, when rankpW q ă K). This model is particularly powerful as it leads to identifiability which is
crucial in many applications such as in hyperspectral imaging or audio source separation [13]. Indeed,
under some mild assumptions and in the exact case, authors prove in [25] that (29) is able to identify
the groundtruth factors pW # , H # q that generated the input data V , in the absence of noise. In [25],
(29) is used for blind audio source separation. In a nutshell, blind audio source separation consists in
isolating and extracting unknown sources based on an observation of their mix recorded with a single
microphone5 . We have to mention that model (29) is also well suited for hyperspectral imaging as
discussed in [16].
In the next subsections, we show that we can tackle the min-vol β-NMF optimization problem
defined in (29) with the general framework presented in Section 2 in the case β “ 1.

4.1 Problem formulation and algorithm


As the minimum-volume penalty of model (29) concerns matrix W only, the main challenge concerns
the update of W . Indeed, the update of H is simply the one from [23]. Let us therefore consider the
subproblem in W for H fixed:

min Dβ pV |W Hq ` λ logdetpW T W ` δIq such that eT wi “ 1 for 1 ď i ď K. (30)


W PRF
`
ˆK

Compared to the general model (2), we have that

• the subsets Ki correspond to the columns of W , and there is no subset Bj ,

• αTi “ e and bi “ 1 for 1 ď i ď K.

To upper bound logdetpW T W ` δIq as required by (5) in Assumption 1, we majorize it using a convex
quadratic separable auxiliary function provided in [25, Eq. (3.6)] and which is derived as follows.
5
We invite the interested reader to watch the video https://www.youtube.com/watch?v=1BrpxvpghKQ to see the
application of min-vol KL-NMF on the decomposition of a famous song from the city of Mons.

15
First, the concave function logdetpQq for Q ą 0 can be upper bounded using the first-order Taylor
approximation: for any Qr ą 0,

logdetpQq ď logdetpQq r ´1 , Q ´ Qy
r ` xQ r ´1 , Qy ` cst,
r “ xQ

where cst is some constant independent of Q. For any W, W


Ă , and denoting Qr“W ĂT WĂ ` δI ą 0, we
obtain A E
T ´1 T r ´1 W T q ` cst,
logdetpW W ` δIq ď Q , W W ` cst “ tracepW Q
r

which is a convex quadratic and Lipschitz-smooth function in W . In fact, letting Q r ´1 “ DDT be a


decomposition (such as Cholesky) of Q ´1
r ą 0, we have tracepW Q ´1 2
r W q “ }W D} , from which (5)
F
can be derived easily; see [25] for the details. With this and following our framework from Section 2,
we obtain the Lagrangian function
¨ ˛
ÿ ÿ ÿˆ 1
˙
µ ¯ T
` ˘
G W |W “Ă G pwf |wrf q ` λ ˝ rf q ` c ` µ
l pwf |w ‚ wf ´ e , (31)
f f f
F

where wf P denotes the f -th row of W , G is given by (24), ¯l by [25, Eq. (3.6)] and derived as explained
above, and c is a constant. Let µ is the vector Lagrange multipliers of dimension K associated to each
linear constraint eT wi “ 1. Exactly as before (hence we omit the details here), Gµ is separable and,
given µ, one can compute the closed-form solution:
„”
‰.2 ı. 1 ` 
“ T 2 T
˘
C ` eµ `S ´ C ` eµ
‹ (32)
W pµq “ W d
Ă ,
rDs
ˆ ˙
T
` ´
˘ ` ´ ` ´ rV s T
where C “ eF,N H ´ 4λ W Ă Y , D “ 4λW Ă pY ` Y q, and S “ 8λW Ă pY ` Y q d
ĂHs H
rW
Ă ` δI ´1 , Y ` “ maxpY, 0q ě 0 and Y ´ “ maxp´Y, 0q ě 0, and eF,N is
` T ˘
with Y “ Y ` ´ Y ´ “ W Ă W
the F -by-N matrix of all ones. As proved in Proposition 2, the constraint W ‹ pµqT e “ e is satisfied
for a unique µ in p´8, tq where t “ 8 in this case. We can therefore use a Newton-Raphson method
to find the µi with quadratic rate of convergence, see Proposition 3. Algorithm 2 summarizes our
method to tackle (29).

4.2 Numerical experiments


In this section we compare baseline KL-NMF (that is, the standard MU), the min-vol KL-NMF from
[25, Algorithm 1] that solves (29) using MU combined with line search (min-vol KL-NMF LS), and
Algorithm 2 applied to the spectrogram of two monophonic piano sequences considered in [25]. The
first audio sample is the first measure of “Mary had a little lamb”, a popular English song. The second
audio sample corresponds to the first 30 seconds of “Prelude and Fugue No.1 in C major” from de
Jean-Sebastien Bach played by Glenn Gould6 . We use the following three setups:

• Setup 71: sample “Mary had a little lamb” with K “ 3, 200 iterations.

• Setup 72: sample “Mary had a little lamb” with K “ 7, 200 iterations.
6
https://www.youtube.com/watch?v=ZlbK5r5mBH4

16
Algorithm 2 Min-vol KL-NMF
KˆN
Input: A matrix V P RF ˆN , an initialization H P R` , an initialization W P RF ˆK , a factorization
rank K, and a maximum number of iterations, maxiter, the parameters δ ą 0 and λ ą 0.
Output: A min-vol rank-K NMF pW, Hq of V satisfying constraints in (29).
1: for it = 1 : maxiter do
2: % Update of
” matrix
´ H
¯ı
rV s
WT rW Hs
3: HÐHd T
r s
W eF,N
4: % Update
` T of matrix
˘´1W
5: Y Ð W W ` δI
6: Y ` Ð max pY, 0q
7: Y ´ Ð max p´Y, 0q
8: C Ð eF,N H T ´ 4λ pW Y ´´q ¯
rV s T
9: S Ð 8λW pY ` ` Y ´ q d rW Hs H
10: D Ð 4λW` pY ` ` Y ´ q ˘
11: µ Ð root W ‹ pµqT e “ e over RK % see (32) for the expression of W ‹ pµq
« ff
” .2
ı. 1
2
rC`eµT s `S ´pC`eµT q
12: W ÐW d rDs
13: end for

• Setup 73: “Prelude and Fugue No.1 in C major” with K “ 16, 300 iterations.

For each setup, the algorithms are run for the same 20 random initializations of W and H. Table 3
reports the average and standard deviation of the runtime (in seconds) over these 20 runs. Table 4
reports the average and standard deviation of the final values for β-divergences (data fitting term)
and the objective function of (29) over these 20 runs for min-vol KL-NMF LS and Algorithm 2. For
this last comparison, the value for the penalty weight λ has been chosen so that KL-NMF leads to
reasonable
ˇ solutions forˇ W and H. More precisely, the values for λ are chosen so that the initial value
T
λˇlogdetpW p0q W p0q `δIqˇ
ˇ ˇ
of Dβ pV |W Hq is equal to 0.1, 0.1 and 0.022 for setup 71, setup 72 and setup 7, respectively.

Table 3: Runtime performance in seconds of baseline KL-NMF, min-vol KL-NMF LS and Algorithm 2.
The table reports the average and standard deviation over 20 random initializations.

Algorithms runtime in seconds


setup 71 setup 72 setup 73
baseline KL-NMF 0.53˘0.03 0.45˘0.02 4.32˘0.30
min-vol KL-NMF LS [25] 3.79˘0.13 2.39˘0.30 10.19˘1.28
Algorithm 2 0.58˘0.03 0.66˘0.03 4.80˘ 0.38

We observe that the runtime of Algorithm 2 is close to the baseline KL-NMF algorithm which
confirms the negligible cost of the Newton-Raphson steps to compute µ‹ as discussed in Section 2.3.
On the other hand, since no line search is needed, we have a drastic acceleration from 2x to 7x

17
Table 4: Final values for Dβ and the penalized objective Ψ from (29) obtained with min-vol KL-
NMF LS and Algorithm 2. The table reports the average and standard deviation over 20 random
initializations for three experimental setups. A bold entry indicates the best value for each experiment.

min-vol KL-NMF LS [25] Algorithm 2


Dβ,end (3.52 ˘ 0.03)103 (2.31 ˘ 0.01)103
setup 71
Ψend (4.17 ˘ 0.03)103 (3.08 ˘ 0.01)103
Dβ,end (3.54 ˘ 0.03)103 (1.77 ˘ 0.02)103
setup 72
Ψend (4.42 ˘ 0.04)103 (2.87 ˘ 0.02)103
Dβ,end (7.77 ˘ 0.23)103 (4.67 ˘ 0.08)103
setup 73
Ψend (9.14 ˘ 0.20)103 (6.50 ˘ 0.06)103

compared to the backtracking line-search procedure integrated in min-vol KL-NMF LS [25]. Moreover,
we observe in Table 4 that Algorithm 2 outperforms min-vol KL-NMF LS in terms of final values for
the data fitting term and objective function values, with lower standard deviations.

5 Extension to quadratic disjoints constraints


Our general framework presented in Section 2 applies to β-NMF under disjoint linear equality con-
straints with penalty terms satisfying Assumption 1; see problem (2). We have showcased our approach
on β-SSNMF in Section 3 and on min-vol KL-NMF under sum-to-one constraints on the columns of
W in Section 4. In this section, we show that the same framework can be extended to other simple
constraints, namely disjoint quadratic constraints.
We consider sparse β-NMF for β “ 1 where the rows of H are penalized with the `1 norm and
each column of W have a fixed `2 norm. We show that MU satisfying the set of constraints can be
derived which we apply on blind HU.

5.1 Problem formulation and algorithm


In this section we consider the following model involving quadratic disjoints constraints, that we refer
to as hyperspheric-structured sparse β-NMF:
K
ÿ .p2q
min Dβ pV |W Hq ` λk }Hpk, :q}1 such that eT wj “ ρ for 1 ď i ď K, (33)
W PRF
`
ˆK KˆN
,HPR` k“1

where λk is a penalty weight to control the sparsity of the k-th row of H, and the quadratic constraints
require the columns of W to lie on the surface of a hyper-sphere centered at the origin with radius
?
ρ ą 0. Without this normalization, the `1 -norm regularization would make H tends to zero and W
grows to infinity.
As done before, we update W and H alternatively. We tackle the subproblem in H with W fixed
based on the MU developed in [11] and guaranteed to decrease the objective function:
” ´ ¯ı
r .pβ´2q
“ ‰
WT V d WH
H“H rd ” ı, (34)
r .pβ´1q ` λeT
“ ‰
WT WH

18
where λ P RK` is the vector of penalty weights. It remains to compute an update for W . To do so, we
use the convex separable auxiliary function G from [12] constructed at the current iterate W
Ă , from
which we obtain, as before, the Lagrangian function
˘ ÿ ÿ ÿ ˆ .p2q 1
˙
µ T
`
G W |W “Ă G pwf |w
rf q ` λk }Hpk, :q}1 ` µ wf ´ ρe , (35)
f k f
F

where µ P RK is the vector of Lagrange multipliers associated to the constraint eT W .p2q “ ρeT .
Exactly as before (hence we omit the details here), given µ, one can obtain a closed-form solution:
„” ı. 1 
.2 ` T˘ 2
rCs ` 8 eµ d S ´C
‹ (36)
W pµq “ ,
r4eµT s
ˆ ˙
T rV s T
where C “ eF,N H and S “ W d Ă
ĂHs H . Let us now write the expression of the quadratic
rW
constraint f pW ‹ pµqf,i q2 ´ ρ “ 0 for one specific column of W , say the i-th:
ř

¨b ˛2
ÿ ÿ pCf,i q2 ` 8µi Sf,i ´ Cf,i
ri pµi q :“ ‹
pWf,i pµi qq2 ´ ρ “ ˝ ‚ ´ ρ “ 0. (37)
f f
4µi

Computing the Lagrangian multiplier µi to satisfy the constraint requires computing the roots of the
functions ri pµi q. We can show that each Wf,i ‹ pµ q (36) is a monotone decreasing, nonnegative convex
i
‹ pµ qq2 is also monotone decreasing and convex in µ over
ř
function over p0, `8q. Therefore f pWf,i i i
p0, `8q. Indeed, let g : R Ñ R
` `2 ˘2 ` 1 2be a monotone decreasing, nonnegative` 2 ˘1 convex function. If g is
twice-differentiable, then g 2 2 1
“ 2 pg q ` 2gg ě 0 since g, g ě 0 and g “ 2g g ď 0 since g ě 0,
g 1 ď 0 by hypothesis. Now we can conclude that ri pµi q is a monotone decreasing convex function over
p0, `8q. Moreover, using Hospital’s rule, we have:
ÿ ÿ
lim ‹
pWf,i pµi qq2 ´ ρ “ `8 and lim ‹
pWf,i pµi qq2 ´ ρ “ ´ρ ă 0,
µi Ñ0` µi Ñ`8
f f

since ρ ą 0. Therefore, the root of ri pµi q is unique over p0, `8q. We use a Newton-Raphson method
to solve the problem. Algorithm 3 summarizes our method.

5.2 Numerical experiments


In this section, we perform numerical experiments to evaluate the effectiveness of Algorithm 3 on the
HU problem. To the best of our knowledge, sparse β-NMF7 from [32] is the most recent algorithm
that is able to tackle problem (33) for the KL-divergence by integrating the `2 -normalization for
each update of matrix W . This approach is similar to that of [14] for β-SSNMF, that is, it uses
parametrization, and resort to a heuristic with no guarantee on the decrease of the objective function.
We refer to this algorithm as β-SNMF.
We apply Algorithm 3 and β-SNMF [32] to the three real hyperspectral datasets detailed in
Section 3. This comparative study focuses on the convergence aspects including the evolution of the
7
http://www.jonathanleroux.org/software/sparseNMF.zip

19
Algorithm 3 Hyperspheric-structured sparse KL-NMF
KˆN
Input: A matrix V P RF ˆN , an initialization H P R` , an initialization W P RF ˆK , a factorization
rank K, a maximum number of iterations, maxiter, a weight vector λ ą 0.
Output: A sparse rank-K NMF pW, Hq of V satisfying constraints in (33).
1: for it = 1 : maxiter do
2: % Update of
„ matrix
ˆ H
“ ‰.pβ´2q ˙
WT V d WH
3: HÐHd „ “ ‰.pβ´1q 
WT WH `λeT

4: % Update of matrix W
5: C Ð eF,N H T
´ ¯
rV s T
6: S Ð W d rW Hs H
7: for j = 1 : K do
8: µi Ð root pri pµi qq over p0, `8q % See Equation (37)
9: end for„ 
.1
rrCs.2 `8peµT qdS s 2 ´C
10: W Ð r4µeT s
11: end for

objective function and the runtime; we refer the interested reader to the Supplementary Material S2
for qualitative result on the ability of sparse β-NMF to decompose such images. For all simulations,
the algorithms are ran for 20 random initializations of W and H, the entries of the penalty weight λ
has been set to 0.1, 0.05 and 0.05 for Samson, Jasper Ridge and Cuprite data sets, respectively. In
order to fairly compare both algorithms, ρ has been set to 1 as β-SNMF considers a `2 -normalization
for the columns of W , and the entries of the weight vector λ in Algorithm 3 have the same values as
β-SNMF requires to use the same values for all rows of H. Table 5 reports the average and standard
deviation of the runtime (in seconds) as the final value for the objective function over these 20 runs
for a maximum of 300 iterations. Figure 1 displays the objective function values.

Table 5: Runtime performance in seconds and final value of objective function Φend pW, Hq for Algo-
rithm 3 and β-SNMF. The table reports the average and standard deviation over 20 random initial-
izations with a maximum of 300 iterations for three hyperspectral data sets. A bold entry indicates
the best value for each experiment.

Algorithms Samson data set Jasper Ridge data set Cuprite data set
runtime (sec) Φend pW, Hq runtime (sec) Φend pW, Hq runtime (sec) Φend pW, Hq
Algorithm 3 11.07˘0.19 (2.68˘0.00)103 15.67˘0.17 (4.65 ˘ 0.00)103 70.16 ˘ 0.85 (2.12 ˘ 0.00)103
β-SNMF [32] 7.63˘0.13 (2.68˘0.00)103 10.98˘0.18 (4.71 ˘ 0.00)103 51.86 ˘ 0.74 (2.18 ˘ 0.00)103

According to Table 5 (top row), we observe that Algorithm 3 outperforms the heuristic from [32]
in terms of final value for the objective functions while β-SNMF shows lower runtimes. Additionally,
based on Figure 1, we observe that Algorithm 3 converges on average faster than β-SNMF for all
the data sets, in terms of iterations. However, β-SNMF has a lower computational cost per iteration.
Thus, we complete the comparison between both algorithms by imposing the same computational
time: we run Algorithm 3 for 300 iterations, record the computational time and run β-SNMF for the

20
104 104
1.6 3 16000
1.4 2.5 14000
1.2 12000
2 10000
1
1.5 8000
0.8
6000
0.6
1
4000
0.4

0.5
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300

104 104
1.6 3 16000
1.4 2.5 14000
1.2 12000
2
1 10000
1.5 8000
0.8
6000
0.6 1
4000
0.4

0.5
0 2 4 6 8 10 12 0 5 10 15 0 20 40 60 80

Figure 1: Averaged objective functions over 20 random initializations obtained for Algorithm 3 with
300 iterations (red line with circle markers), and the heuristic β-SNMF from [32] (black dashed line).

same amount of time. Table 6 reports the average and standard deviation of the final value for the

Table 6: Final value of objective function values Φend pW, Hq for Algorithm 3 and the heuristic from
[32]. The table reports the average and standard deviation over 20 random initializations for an equal
computational time that corresponds to 300 iterations of Algorithm 3. A bold entry indicates the best
value for each experiment.

Algorithms Samson data set Jasper Ridge data set Cuprite data set
Φend pW, Hq Φend pW, Hq Φend pW, Hq
Algorithm 3 (2.68˘0.00)10 3 (4.65 ˘ 0.00)10 3 (2.12 ˘ 0.00)103
β-SNMF [32] (2.68˘0.00)10 3 (4.66 ˘ 0.00)10 3 (2.15 ˘ 0.00)103

objective function over 20 runs in this setting. Figure 1 (bottom row) displays the objective function
w.r.t. time for the three data sets. On this comparison, Algorithm 3 and the heuristic from [32]
perform similarly although Algorithm 3 has slightly better final objective function values. However,
keep in mind that only Algorithm 3 is theoretically guaranteed to decrease the objective function.

6 Conclusion
In this paper we have presented a general framework to solve penalized β-NMF problems that in-
tegrates a set of disjoint constraints on the variables; see the general formulation (2). Using this
framework, we showed that we can derive algorithms that compete favorably with the state of the art
for a wide variety of β-NMF problems, such as the simplex-structured NMF and the minimum-volume

21
β-NMF with sum-to-one constraints on the columns of W . We have also shown how to extend the
framework to non-linear disjoints constraints, with application to a sparse β-NMF model for β “ 1
where each column of W lie on a hyper-sphere.
Further works will focus on the possible extension of the methods to non-disjoints constraints. The
non-disjoint constraints will lead to roots finding problems of polynomial equations in the Lagrangian
multipliers for which we hope to find conditions that ensure the uniqueness of the solution.
Another interesting direction of research would be to apply our framework to other NMF models.
For example, in probabilistic latent semantic analysis/indexing (PLSA/PLSI), the model is the fol-
lowing: given a nonnegative matrix V such that eT V e “ 1 (this can be assumed w.l.o.g. by dividing
the input matrix by eT V e), solve
ÿ
max vf n logpW DiagpsqHqf n such that W T e “ e, He “ e, sT e “ 1.
W ě0,Hě0,sě0
f,n

This model is equivalent to KL-NMF [9], with the additional constraint that eT W He “ eT Xe, and
hence our framework is applicable to PLSA/PLSI. Such constraints have also applications in soft
clustering contexts; see [35].

Acknowledgment
We would like to thank the Associate Editor and the reviewers for taking the time to carefully read the
paper and for the useful feedback that helped us improve the paper. We also thank Arthur Marmin
for identifying an error in our derivations when β P p1, 2q (indicated in red color in this version of the
manuscript).

A Convexity, concavity and complete monotonicity for a convex-


concave decomposition of the discrete β-divergence
The discrete β-divergence can always be expressed as the sum of convex, concave, and constant terms.
In Table 7 we introduce a convex-concave decomposition of the β-divergence which slightly differ from
the one given in [12, Table 1] (by the fact that ours contains no constant term d)
s as given in Table 7.

Decomposition
β P p´8, 1qzt0u β“0 β“1 β P p1, 2q β P r2, `8q
dβ “ dq ` dp
1 x 1 β 1 1 β
dpx|yq
q x y β´1
1´β y ´x log y βy ´ β´1 xy
β´1
βy
1 β 1 β y 1 β 1 1
dpx|yq
p
β y ´ β p1´βq x log x ´ 1 y ` x log x ´ x β pβ´1q x ´ β´1 x y β´1 ` β pβ´1q xβ

Table 7: Proposed concave-convex decomposition of the discrete β-divergence.

In Table 7, y P p0, 8q, β is real valued and x P p0, 8q. Further, β and x are considered as param-
eters, dβ , dp and dq being handled as univariate functions of y.

Let us now recall the definition of a complete monotonic function f :


Definition 1. A function f is said to be completely monotonic (c.m.) on an interval I if f has
derivatives of all orders on I and p´1qn f pnq pxq ě 0 for x P I and n ě 0.

22
We can now introduce the properties of concavity, convexity and monotonicity for our convex-
concave formulation of the discrete β-divergence:

Proposition 4. Given dp¨|¨q


q and dp¨|¨q
p as defined above, we have that

1. dpx|yq
q is C 8 and strictly convex on p0, 8q for x ą 0 and β P R;

2. dpx|yq
p is concave for x ą 0 and β P R;

3. for all β ă 2, dq2 px|yq and dp2 px|yq are c.m.

Proof. The proof is straightforward, given that dpx|yq


q and dpx|yq
p linearly combine C 8 functions on
p0, 8q, and that in the same interval,

• log y is strictly concave;

• y ν is strictly convex for all ν P p´8, 0q Y p1, 8q, and strictly concave for all ν P p0, 1q;

• y ν is c.m. for all ν ă 0.

According to the first two items of Proposition 4, dq and dp indeed yield a convex-concave decom-
position of the β-divergence, which is a variant of [12, Table 1]. Let us remark that the successive
minimization of an upper approximation of this convex-concave decomposition following the method-
ology presented in [12] yields to the usual multiplicative update scheme.

References
[1] M. Abdolali and N. Gillis, Simplex-structured matrix factorization: Sparsity-based identifiability and
provably correct algorithms, arXiv preprint arXiv:2007.11446, (2020).
[2] A. Ang and N. Gillis, Accelerating nonnegative matrix factorization algorithms using extrapolation,
Neural computation, 31 (2019), pp. 417–439.
[3] D. P. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, second ed., 1999.
[4] J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot,
Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches, IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5 (2012), pp. 354–379.
[5] E. C. Chi and T. G. Kolda, On tensors, sparsity, and nonnegative factorizations, SIAM Journal on
Matrix Analysis and Applications, 33 (2012), pp. 1272–1299.
[6] A. Cichocki, R. Zdunek, and S.-I. Amari, Hierarchical ALS algorithms for nonnegative matrix and
3D tensor factorization, in Lecture Notes in Computer Science, Vol. 4666, Springer, 2007, pp. 169–176.
[7] A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari, Nonnegative Matrix and Tensor Factoriza-
tions: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation, John Wiley &
Sons, 2009.
[8] O. Dikmen, Z. Yang, and E. Oja, Learning the information divergence, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37 (2015), pp. 1442–1454.
[9] C. Ding, T. Li, and W. Peng, On the equivalence between non-negative matrix factorization and prob-
abilistic latent semantic indexing, Computational Statistics & Data Analysis, 52 (2008), pp. 3913–3927.

23
[10] J. Eggert and E. Korner, Sparse coding and NMF, in IEEE International Joint Conference on Neural
Networks, vol. 4, 2004, pp. 2529–2533 vol.4.
[11] C. Févotte, N. Bertin, and J.-L. Durrieu, Nonnegative matrix factorization with the Itakura-Saito
divergence: With application to music analysis, Neural computation, 21 (2009), pp. 793–830.
[12] C. Févotte and J. Idier, Algorithms for nonnegative matrix factorization with the β-divergence, Neural
computation, 23 (2011), pp. 2421–2456.
[13] X. Fu, K. Huang, N. D. Sidiropoulos, and W.-K. Ma, Nonnegative matrix factorization for signal
and data analytics: Identifiability, algorithms, and applications., IEEE Signal Process. Mag., 36 (2019),
pp. 59–80.
[14] C. Févotte and N. Dobigeon, Nonlinear hyperspectral unmixing with robust nonnegative matrix fac-
torization, IEEE Transactions on Image Processing, 24 (2015), pp. 4810–4819.
[15] N. Gillis, The why and how of nonnegative matrix factorization, in Regularization, Optimization, Kernels,
and Support Vector Machines, J. Suykens, M. Signoretto, and A. Argyriou, eds., Machine Learning and
Pattern Recognition, Chapman & Hall/CRC, Boca Raton, Florida, 2014, ch. 12, pp. 257–291.
[16] N. Gillis, Nonnegative Matrix Factorization, SIAM, Philadelphia, 2020.
[17] N. Gillis and F. Glineur, Accelerated multiplicative updates and hierarchical ALS algorithms for non-
negative matrix factorization, Neural computation, 24 (2012), pp. 1085–1105.
[18] N. Guan, D. Tao, Z. Luo, and B. Yuan, NeNMF: An optimal gradient method for nonnegative matrix
factorization, IEEE Transactions on Signal Processing, 60 (2012), pp. 2882–2898.
[19] L. T. K. Hien, N. Gillis, and P. Patrinos, Inertial block proximal methods for non-convex non-smooth
optimization, in International Conference on Machine Learning, 2020, pp. 5671–5681.
[20] D. Hong, T. G. Kolda, and J. A. Duersch, Generalized canonical polyadic tensor decomposition,
SIAM Review, 62 (2020), pp. 133–163.
[21] P. Hoyer, Non-negative matrix factorization with sparseness constraints, J. Mach. Learn. Res., 5 (2004),
p. 1457–1469.
[22] J. Kim, Y. He, and H. Park, Algorithms for nonnegative matrix and tensor factorizations: A unified
view based on block coordinate descent framework, Journal of Global Optimization, 58 (2014), pp. 285–319.
[23] D. Lee and H. Seung, Algorithms for non-negative matrix factorization, in Proceedings of the 13th
International Conference on Neural Information Processing Systems, NIPS, MIT Press Cambridge, 2000,
pp. 535–541.
[24] D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature,
401 (1999), p. 788.
[25] V. Leplat, N. Gillis, and A. M. S. Ang, Blind audio source separation with minimum-volume beta-
divergence NMF, IEEE Transactions on Signal Processing, 68 (2020), pp. 3400–3410.
[26] C.-J. Lin, Projected gradient methods for nonnegative matrix factorization, Neural computation, 19 (2007),
pp. 2756–2779.
[27] W.-K. Ma, J. M. Bioucas-Dias, T. Chan, N. Gillis, P. Gader, A. J. Plaza, A. Ambikapathi,
and C. Chi, A signal processing perspective on hyperspectral unmixing: Insights from remote sensing,
IEEE Signal Processing Magazine, 31 (2014), pp. 67–81.
[28] K. Miller and G. Samko, Completely monotonic functions, Integral Transforms and Special Functions,
12 (2001), p. 389–402.
[29] K. Neymeyr and M. Sawall, On the set of solutions of the nonnegative matrix factorization problem,
SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 1049–1069.

24
[30] J. Ortega and W. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Aca-
demic Press, New York, NY, 1970.
[31] Y. Qian, S. Jia, J. Zhou, and A. Robles-Kelly, Hyperspectral unmixing via l1{2 sparsity-constrained
nonnegative matrix factorization, IEEE Transactions on Geoscience and Remote Sensing, 49 (2011),
pp. 4282–4297.
[32] J. L. Roux, F. J. Weninger, and J. R. Hershey, Sparse NMF – half-baked or well done?, tech. rep.,
Mitsubishi Electric Research Laboratories (MERL), 2015.
[33] Y. Sun, P. Babu, and D. Palomar, Majorization-minimization algorithms in signal processing, com-
munications, and machine learning, IEEE Transactions on Signal Processing, 65 (2017), pp. 794–816.
[34] T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal conti-
nuity and sparseness criteria, IEEE Transactions on Audio, Speech, and Language Processing, 15 (2007),
pp. 1066–1074.
[35] Z. Yang, J. Corander, and E. Oja, Low-rank doubly stochastic matrix decomposition for cluster anal-
ysis, The Journal of Machine Learning Research, 17 (2016), pp. 6454–6478.
[36] A. L. Yuille and A. Rangarajan, The concave-convex procedure, Neural Computation, 15 (2003),
pp. 915–936.
[37] F. Zhu, Hyperspectral unmixing: ground truth labeling, datasets, benchmark performances and survey,
arXiv preprint arXiv:1708.05125, (2017).

25
Supplementary Material
This supplementary materials provide additional numerical experiments. In S1, we show the evolution
of the error as a function of the iterations for Algorithm 1 and GR-NMF [14] for the tests performed
in Section 3. In S2, we provide qualitative results obtained with Algorithm 3 on hyperspectral images.

S1. Evolution of the objective function for β-SSNMF


D pV |W Hq
Figure 2 displays the evolution of the relative objective function values, that is, D βpV |veeT q , of β-
β
SSNMF for Algorithm 1 and GR-NMF [14] on the experiments described in Section 3. As mentioned
in the paper, Algorithm 1 performs better than GR-NMF [14], except for β “ 0.

S2. Qualitative results obtained with Algorithm 3


In the following we report qualitative results obtained with Algorithm 3 applied to three HS real data
sets, that are Samson, Jasper and Urban data sets. The first two data sets are detailed in Section 3.
The Urban data set contains 162 spectral bands with 307ˆ307 pixels with mostly six endmembers.
Note that Cuprite data set is replaced by the Urban data set since endmembers for Cuprite correspond
to chemical components which are more difficult to interpret visually while endmembers for Urban
data sets are more easily interpretable.
As mentioned earlier, λk enables to control sparsity of the k-th row of H. Given a row Hpk, :q P RN
`
of H, a meaningful way to measure its sparsity is to consider the following measure [21]:
? }Hpk,:q}
N ´ }Hpk,:q}1
sp pHpk, :qq “ ? 2
P r0, 1s . (38)
N ´1
During the numerical experiments, we observed that Algorithm 3 gives better results when the
initial values for λ are low and progressively increased. During a specified interval of iterations
ritmin , itmax s, the sparsity of the current iterate is measured by using equation (38), and the entries of
λ are dynamically updated (increased with a rate α ą 1) to achieve a desired sparsity level sp. The
dynamic update of the weight vector to reach the desired levels of sparsity has been activated in the
iterations intervals r1, 150s, r1, 150s and r1, 75s for Samson, Jasper and Urban, respectively. We report
here the abundance maps of each end-member for two levels of average target sparsity that are 0.25
and 0.5. For all the simulations, the weight vector λ has been initialized to 0.05e, and the algorithm
was run for 300 iterations.
We fix the number of endmembers to 3, 4 and 6 respectively for Samson, Jasper Ridge and Urban
data sets, these values are commonly considered in the HS community [37]. Figures 3 to 5 picture the
abundance estimation for the three data sets for the two levels of sparsity.
In order to validate the results obtained for the abundances of the endmembers, we display in
Figures 6, 7 and 8 the ground truth results obtained in [37]. Note that the grayscale used in [37] is
the complementary of the one used in Figures 3 to 5.
We observe that the abundance estimation gets significantly more accurate when the level of
average sparsity is higher. For the Samson and Jasper Ridge data sets, the abundances for the
endmembers are nicely estimated while five endmembers over six are well estimated for the Urban
data set. The “Roof” is divided into “Roof1” and “Roof2/shadow” [31, 37]. In our simulations, it
seems that the sixth endmember corresponds to some shadows with a small residual of “Grass”, while
the “Roof” is not split into two groups.

26
10 -1 10 -1
10 -1

10 -2
10 -2

10 -2

10 -3
0 100 200 300 0 100 200 300 0 100 200 300

10 -1 10 -1
10 -1

10 -2
10 -2
-2
10

0 100 200 300 0 100 200 300 0 100 200 300

10 -1 10 -1
10 -1

10 -2
10 -2
10 -2

0 100 200 300 0 100 200 300 0 100 200 300

10 -1
10 -1
10 -1

10 -2
10 -2

0 100 200 300 0 100 200 300 0 100 200 300

10 0

10 -1

10 -1 10 -1

10 -2

0 100 200 300 0 100 200 300 0 100 200 300

(a) Samson Data set (b) Jasper Ridge Data set (c) Cuprite Data set

Figure 2: Averaged relative objective function values over 20 random initializations obtained for
Algorithm 1 (red line with circle markers) and the GR-NMF (black dashed line) applied to the three
data sets detailed in the text for 300 iteration. The comparison is performed for different values of β,
from top to bottom: β “ 2, β “ 3{2, and β “ 1. Logarithmic scale for y axis.

27
(a) Samson Data set

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

Figure 3: Samson data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the three endmembers: 71 Tree, 72 Soil and 73 Water. Two average sparsity levels
considered: 0.25 (b) and 0.5 (c).

28
(a) Jasper Ridge Data set

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

Figure 4: Jasper Ridge data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the four endmembers: 71 Road, 72 Tree, 73 Water and 74 Soil. Two average sparsity
levels are considered: 0.25 (b) and 0.5 (c).

29
(a) Urban Data set

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

Figure 5: Urban data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the six endmembers: 71 Soil, 72 Tree, 73 Grass, 74 Roof, 75 Road/Asphalt and 76
Roof2/shadows. Two average sparsity levels are considered: 0.25 (b) and 0.5 (c).

30
Figure 6: Baseline abundances for the endmembers obtained for Samson data extracted from [37]: 71
Soil, 72 Tree and 73 Water.

Figure 7: Baseline abundances for the endmembers obtained for Jasper Ridge data extracted from
[37]: 71 Road, 72 Soil, 73 Water and 74 Tree.

Figure 8: Baseline abundances for the endmembers obtained for Urban data extracted from [37]: 71
Asphalt, 72 Grass, 73 Tree, 74 Roof1, 75 Roof2/Shadow and 76 Soil.

31

You might also like