0% found this document useful (0 votes)

14 views31 pages

Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints

Uploaded by

z1820394290

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views31 pages

Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints

Uploaded by

z1820394290

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Multiplicative Updates for NMF with β-Divergences

under Disjoint Equality Constraints

Valentin Leplat˚ Nicolas Gillis∗ Jérôme Idier†
arXiv:2010.16223v2 [cs.LG] 28 Apr 2022

April 29, 2022

Abstract
Nonnegative matrix factorization (NMF) is the problem of approximating an input nonnegative
matrix, V , as the product of two smaller nonnegative matrices, W and H. In this paper, we intro-
duce a general framework to design multiplicative updates (MU) for NMF based on β-divergences
(β-NMF) with disjoint equality constraints, and with penalty terms in the objective function. By
disjoint, we mean that each variable appears in at most one equality constraint. Our MU satisfy
the set of constraints after each update of the variables during the optimization process, while
guaranteeing that the objective function decreases monotonically. We showcase this framework on
three NMF models, and show that it competes favorably the state of the art: (1) β-NMF with sum-
to-one constraints on the columns of H, (2) minimum-volume β-NMF with sum-to-one constraints
on the columns of W , and (3) sparse β-NMF with `2 -norm constraints on the columns of W .

Keywords: nonnegative matrix factorization (NMF), β-divergences, disjoint constraints, simplex-

structured NMF, minimum-volume NMF, sparsity

1 Introduction
Given a non-negative matrix V P RF`ˆN and a factorization rank K ! minpF, N q, nonnegative matrix
factorization (NMF) aims to compute two non-negative matrices, W with K columns and H with K
rows, such that V « W H [24]. Over the last two decades, NMF has shown to be a powerful tool for the
analysis of high-dimensional data. The main reason is that NMF automatically extracts sparse and
meaningful features from a set of nonnegative data vectors. NMF has been successfully used in many
applications such as image processing, text mining, hyperspectral imaging, blind source separation,
single-channel audio source separation, clustering and music analysis; see [15, 7, 5, 29, 13, 16] and the
references therein.
To compute W and H, the most standard approach is to solve the following optimization problem

min D pV |W Hq such that H ě 0 and W ě 0, (1)

W PRF ˆK ,HPRKˆN
∗
Department of Mathematics and Operational Research, Faculté Polytechnique, Université de Mons, Rue de Houdain
9, 7000 Mons, Belgium. Authors acknowledge the support by the Fonds de la Recherche Scientifique - FNRS and the
Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Project no O005318F-RG47, and by the European
Research Council (ERC Starting Grant no 679515, ERC Consolidator Grant no 681839). E-mails: {valentin.leplat,
nicolas.gillis}@umons.ac.be.
†
Laboratoire des Sciences du Numérique de Nantes (LS2N, CNRS UMR 6004), Ecole Centrale de Nantes, 44321
Nantes, France. E-mail: [email protected]

1
ř
where D pV |W Hq “ f,n dpVf n |r W Hs f n q with dpx|yq a measure of distance between two scalars, and
A ě 0 means that the matrix A is component-wise nonnegative. In this paper, we focus on β-NMF
for which the measure of fit is the discrete β-divergence denoted dβ px|yq and defined as
$ 1 ` β β β´1
˘
& βpβ´1q x ` pβ ´ 1q y ´ βxy
’ for β P Rz t0, 1u ,
dβ px|yq “ x log xy ´ x ` y for β “ 1,
’
%x x
y ´ log y ´ 1 for β “ 0.

For β “ 2, d2 px|yq “ 12 px ´ yq2 , so D pV |W Hq is the halved standard squared Euclidean distance

between V and W H, that is, the halved squared Frobenius norm 21 }V ´ W H}2F . For β “ 1 and
β “ 0, the β-divergence corresponds to the Kullback-Leibler (KL) divergence and the Itakura-Saito
(IS) divergence, respectively. The error measure should be chosen accordingly with the distribution
of the noise assumed on the data. The Frobenius norm assumes i.i.d. Gaussian noise, KL divergence
assumes a Poisson distribution, and the IS divergence assumes multiplicative gamma noise; see for
example [11, 8, 20] and the references therein. In the NMF literature, β-divergences are the most
widely used objective functions.
Most NMF algorithms developed to tackle (1) are based on iterative schemes that alternatively
updates the factors W and H. At each iteration, the minimization over one factor, W or H, is
performed with various optimization methods. For β-divergences, the most popular approach is to
use multiplicative updates (MU) which were introduced for NMF in the seminal papers of Lee and
Seung [24, 23]. In all applications we are aware of, β is always chosen smaller than two. The reason
is that, for β ą 2, β-divergences become more and more sensitive to outliers. Already for β “ 2, it
is well-known that the squared Frobenius norm is sensitive to outliers. However, the case β “ 2 is
particular because the subproblem in W and H are nonnegative least squares problems, that is, convex
quadratic problems with Lipschitz continuous gradient. Therefore, highly efficient schemes exist when
β “ 2 that outperform the MU; for example exact block coordinate descent methods [6, 17, 22, 2], or
fast gradient methods [18, 19]. In this paper, we focus on the case β ă 2.
In many applications, on top of the nonnegative constraints on the variables, additional constraints
are needed to provide a meaningful solution. An instrumental example is the constraint that the entries
in each column of H sum to one; this is the so-called sum-to-one constraint that is crucial in blind
hyperspectral unmixing; see Section 3. Another example is a sum-to-one constraint on the columns
of W along with a volume regularizer on W . This model leads to identifiability of the factors W and
H under mild conditions; see Section 4. Most algorithms that deal with such equality constraints
do it a posteriori with a projection onto the feasible set, or with a renormalization of the columns
of W and the rows of H (that is, replace W p:, kq and Hpk, :q by αk W p:, kq and Hpk, :q{αk for some
αk ą 0), so that their product W H remains unchanged, and hence DpV |W Hq remains unchanged.
Such approaches are not ideal:

• Projection requires to perform a line-search to ensure the monotonicity of the algorithm, that is,
to ensure that the objective does not increase after each iteration, which may be computationally
heavy.

• Renormalization of the columns of W and the rows of H is only useful when each constraint
applies to a column of W or a row of H. It is not applicable for example for the sum-to-one
constraint on the columns of H mentioned above. Moreover, in the presence of regularization
terms in the objective function, it may destroy the monotonicity of the algorithm.

2
Another approach is to use parametrization. However, as far as we know, it does not guarantee the
monotonicity of the algorithm; see Section 3 for more details.

Outline and contribution In this paper, we introduce a general framework to design MU for β-
NMF with disjoint linear equality constraints, and with penalty terms in the objective function. By
disjoint, we mean that each variable appears in at most one equality constraint. This framework,
presented in Section 2, does not resort to projection, renormalization, or parametrization. Our MU
satisfy the set of constraints after each update of the variables during the optimization process, while
guaranteeing that the objective function decreases monotonically. This framework works as follows:
• First, as for the standard MU for β-NMF, we majorize the objective function using a separable
majorizer, that is, the majorizer is the sum of functions involving a single variable.

• Second, we construct the augmented Lagrangian for the majorizer. Because the majorizer is
separable, the problem can be decomposed into independent subproblems involving only variables
that occur in the same equality constraint since they are disjoint. For a fixed value of the
Lagrange multipliers, we prove that the solution of these subproblems are unique, under mild
conditions (Proposition 1). Moreover, it can be written in closed form via MU for specific values
of β and depending on the regularizer used (this is summarized in Table 1).

• Finally, we prove that, under mild conditions, there is a unique solution for the Lagrange mul-
tipliers so that the equality constraints are satisfied (Proposition 2). This allows us to apply
the Newton-Raphson method to compute the Lagrange multipliers while guaranteeing quadratic
convergence (Proposition 3).
We then showcase this framework on two NMF models, and show that it competes favorably with
the state of the art:
1. A β-NMF model with sum-to-one constraints on the columns of H, which we refer to as simplex-
structured β-NMF (Section 3), and

2. A minimum-volume β-NMF model with sum-to-one constraints on the columns of W (Section 4).
Finally, Section 5 shows that the framework can be extended to the case of quadratic disjoints con-
straints, which we showcase on sparse β-NMF with `2 -norm constraints on the columns of W .

2 General framework to design MU for β-NMF under disjoint linear

equality constraints and penalization
In this paper, we introduce a general framework to tackle β-NMF with disjoint linear equality con-
straints, and with penalty terms in the objective function. Let us first introduce specific notations:
given a matrix A P RF ˆN and a list of indices K Ď tpf, nq | 1 ď f ď F, 1 ď n ď N u, we denote
by ApKq the vector of dimension |K| whose entries are the entries of A corresponding to the indices
within K. Let us introduce Ki (1 ď i ď I) and Bj (1 ď j ď J) to be disjoint sets of indices for the
entries of W and H, respectively, that is,
• Ki Ď tpf, kq | 1 ď f ď F, 1 ď k ď Ku for i “ 1, 2, . . . , I,

• Bj Ď tpk, nq | 1 ď k ď K, 1 ď n ď N u for j “ 1, 2, . . . , J,

3
• Ku X Kv “ H for all 1 ď u, v ď I and u ‰ v,
• Bp X Bq “ H for all 1 ď p, q ď J and p ‰ q.
We now define penalized β-NMF with disjoint linear equality constraints as follows
min Dβ pV |W Hq ` λ1 Φ1 pW q ` λ2 Φ2 pHq
W PRF
`
ˆK
,HPRKˆN
`

such that αTi W pKi q “ bi for 1 ď i ď I, (2)

γ Tj H pBj q “ cj for 1 ď j ď J,
where
• the penalty functions Φ1 pW q and Φ2 pHq are lower bounded and admit a particular upper ap-
proximation; see Assumption 1 below.
• λ1 and λ2 are the penalty weights (nonnegative scalars).
|K | |B |
• αi P R``i (1 ď i ď I) and γ j P R``j (1 ď j ď J) are vectors with positive entries. Note that if
αi or γ j contains zero entries, the corresponding indices can be removed from Ki and Bj .
• bi (1 ď i ď I) and cj (1 ď j ď J) are positive scalars.
As for most NMF algorithms, we propose to resort to a block coordinate descent (BCD) framework
to solve problem (2): at each iteration we tackle two sub-problems separately; one in W and the
other in H. The subproblems in W and H are essentially the same, by symmetry of the model,
since transposing the relation X « W H gives X T « H T W T . Hence, we may focus on solving the
subproblem in H only, namely

min Dβ pV |W Hq ` λ2 Φ2 pHq such that γ Tj H pBj q “ cj for 1 ď j ď J. (3)

HPRKˆN
`

In order to solve (3), we will design MU based on the majorization-minimization (MM) frame-
work [33], which is the standard in the NMF literature; see [12] and the references therein. Let us
briefly recall the high-level ideas to obtain MU via MM. Let us consider the general problem

min f phq.
hPH

h P H,
Given an initial iterate r ` ˘MM generates
` ˘ a new iterate ĥ P H that is guaranteed to decrease the
objective function, that is, f ĥ ď f r
h . To do so, it uses the following two steps:
• Majorization: find a function that is an upper approximation of the objective and is tight
` ˘at the
current iterate, which is referred to as a majorizer. More precisely find a function g h|r
h such
that ` ˘ ` ˘ ` ˘
piq g r
h|rh “f r h and piiq g h|r h ě f phq for all h P H.

• Minimization: minimize the majorizer, that is, solve minhPH g h|r

` ˘
h approximately or exactly,
` ˘
to obtain the next iterate ĥ P H which is such that piiiq g ĥ|h ď gpr
r h|r
hq. This guarantees the
objective function to decrease at each step of this iterative process since
` ˘ ` ˘ ` ˘ ` ˘
f ĥ ď g ĥ|r h ď g r h|r
h “ f r h .
piiq piiiq piq

4
` ˘
The MU for NMF are obtained using MM where the majorizer g is chosen separable, that is, g h|r h “
ř ` ˘
i“1 gi hi |hi for some well chosen univariate functions gi ’s; see (4) in the next section. This choice
r
typically makes the minimization
` ˘ of g admits a closed-form solution which`is ˘multiplicative, that is, it
has the form ĥ “ h d c h where d is the component-wise product, and c r
r r h is a nonnegative vector
that depends on r h. We will encounter several examples later in this paper.
In summary, to derive MU for (3), we will follow the MM framework. We first provide a majorizer
for the objective of (3) in Section 2. This majorizer has the property to be separable in each entry of
H. In order to handle the equality constraints, we introduce Lagrange dual variables in Section 2.2,
and explain how they can be computed efficiently. This allows us to derive general MU in Section 2.3
in the case of non-penalized β-NMF under disjoint linear equality constraints. This is showcased on
simplex-structured β-NMF in Section 3. In Section 4, we will illustrate on minimum-volume KL-NMF
how to derive MU in the presence of penalty terms.

2.1 Separable majorizer for the objective function

` ˘
Let us derive a majorizer for ΨpHq :“ Dβ pV |W Hq`λΦpHq, that is, a function G H|H r satisfying
` ˘ ` ˘ ` ˘
(i) G H|H r ě ΨpHq for all H, and (ii) G H| r H r “Ψ H r . Note that, to simplify the presentation, we
denote Φ2 pHq “ ΦpHq and λ “ λ2 . To do so, let us analyze each term of ΨpHq independently.
Majorizing Dβ pV |W Hq The first term Dβ pV |W Hq can be decoupled into N independent terms,
one for each column hn of H, that is, Dβ pV |W Hq “ N
ř
n“1 Dβ pvn |W hn q, where vn denotes the nth
column of matrix V . Let us focus on a specific column of H, denoted h P RK ` , and the corresponding
K
řF
column of V , denoted v P R` . We majorize Dβ pv|W hq “ f “1 dβ pvf |pW hqf q following the method-
ology introduced in [12], which consists in applying a convex-concave procedure [36] to dβ , as presented
in Appendix A. The resulting upper bound is given by
K ˆ ˙ K
ÿ wf k r
hk q hk ` ˘ÿ ` ˘ ` ˘
dβ pvf |pW hqf q ď d vf |r
vf ` dp1 vf |r
vf wf k hk ´ r vf ,
hk ` dp vf |r (4)
k“1
vrf hk
r
k“1
` ˘
where wf k denotes the entry of matrix W at position pf, kq, vrf :“ W h r denotes the f th entry of v
f
r,
and dp and dq are the concave and convex parts of d, respectively.

Majorizing ΦpHq For the second term ΦpHq, we rely on the following assumption for Φ.
Assumption 1. The function Φ : RKˆN` ÞÑ R is lower bounded, and for any H r P RKˆN there exists
`
constants Lkn (1 ď k ď K, 1 ď n ď N ) such that the inequality
` ˘ A E ÿL
r ` ∇ΦpHq, kn r kn q2
ΦpHq ď Φ H r H ´H r ` pHkn ´ H (5)
k,n
2

is satisfied for all H P RKˆN

` . (Note that the constants Lkn may depend on H,
r this will be the case
for example in Section 4).
Let us mention two important classes of functions satisfying Assumption 1.
1. Smooth concave functions that are lower bounded on the nonnegative orthant. For such func-
tions, we can take Lkn “ 0 for all k, n since they are upper approximated by their first-order
Taylor approximation. Note that, in this case,
∇ΦpHq
r ě 0, (6)

5
` ˘
otherwise we would have limyÑ8 Φ H ` yei eTj “ ´8, where ei is the ith unit vector, and this
would contradict the fact that Φ is bounded from below. This observation will be useful in the
proof of Proposition 1 and is only valid for the special case Lkn “ 0 for all k, n.
Examples of such penalty functions include the sparsity-promoting regularizers ΦpHq “ }H}pp “
p
ř
k,n Hpk, nq for 0 ă p ď 1 since H ě 0.

2. Lower-bounded functions with Lipschitz continuous gradient for which (5) follows from the
descent lemma [3].
Examples of such penalty functions include any smooth convex functions; for example any
quadratic penalty, such as ||AH ´ B||22 for some matrices A and B in which case L`kn “ σ1 pAq˘2
for all k, n. We will encounter another example later in the paper, namely logdet HH J ` δI
for δ ą 0 which allows to minimize the volume of the rows of H; see Section 4 for the details
(Note that we will use this regularizer for W ).

Majorizing ΨpHq Combining (4) and (5), we can construct a majorizer for ΨpHq. Since both (4)
and (5) are separable in each entry of H, their combination is also separable into a sum of K ˆ N
component-wise majorizers, up to an additive constant:
N ÿ
ÿ K
` ˘ ` ˘ ` ˘
G H|H
r “ g hkn |H
r `C H
r , (7)
n“1 k“1

where
F ˆ ˙
ÿ wf k r
hkn q hkn
` akn h2kn ` pkn hkn ,
` ˘
g hkn |H “
r d vf n |r vf n (8)
f “1
v
r f n h
r kn
˜ ¸
N
ÿ ÿ F K
˘ ÿ 1`
h2kn ,
` ˘ ` ˘
C H r “ vf n ´
dp vf n |r v f n wf k r
dp vf n |r hkn `aknr
n“1 f “1 k“1

with akn “ λ L2kn , and

F ˆ ˙
ÿ ` ˘ BΦ ` r ˘
pkn “ wf k dp1 vf n |r
vf n ` λ H ´ Lkn hkn .
r
f “1
Bhkn

2.2 Dealing with equality constraints via Lagrange dual variables

` ˘
In the previous section, we derived a majorizer for ΨpHq, G H|H r , which is separable in each entry
of H. Without the equality
` constraints,
˘ we could then compute closed-form solutions to univariate
problems to minimize G H|H̃ to obtain the standard MU for NMF as in [12].
However, in problem (3), the entries of H in the subsets Bj are not independent as they are linked
with the equality constraints γ Tj HpBj q “ cj for j “ 1, 2, . . . , J. In fact, to minimize the majorizer
under the equality constraints, we need to solve

such that γ Tj HpBj q “ cj for 1 ď j ď J.

` ˘
min G H|H r (9)
HPRKˆN
`

The variables in different sets Bj can be optimized independently, as they do not interact in the
majorizer nor in the constraints. Note that, for the entries of H that do not appear in any constraints,

6
the standard MU [12] can be used. For simplicity, let us fix j and denote B “ Bj , Q “ |B|, y “
HpBq P RQ Q
` , γ “ γ j P R`` , and c “ cj ą 0. The problems we need to solve have the form
` ˘
min G y|H
r , (10)
yPY
! )
where Y “ y P RQ
` | γ T y “ c and

` ˘ ÿ ` ˘
G y|H
r “ g hkn |H
r , (11)
pk,nqPB

` ˘
where the component-wise majorizers g hkn |H r are defined by (8). Let us introduce a convenient
notation: for q “ 1, 2, . . . , Q, we denote by pkpqq, npqqq the qth pair belonging to B. Hence the
Lagrangian function of (11) can be written as
Q
` ˘ ÿ
Gµ y|H r ´ µpγ T y ´ cq “ µc ` C H g µ yq |H
` ˘ ` ˘ ` ˘
r “ G y|H r ` r , (12)
q“1

where

g µ yq | H
` ˘ ` ˘
r “ g yq | Hr ´ µγq yq
F ˆ ˙
ÿ wf kpqq yrq yq
“ q vf npqq
d vf npqq |r ` aq yq2 ` ppq ´ µγq qyq , (13)
f “1
v
r f npqq y
rq

F ˆ ˙
ÿ
1
` ˘ BΦ ` r ˘
pq “ wf kpqq d vf npqq |r
p vf npqq ` λ H ´ Lkpqqnpqq yrq , (14)
f “1
Byq

and µ P R. Note that Gµ is separable, as is G, because the term γ T y is linear. ` ˘

Assume for now that the Lagrangian multiplier µ is known, and let us minimize Gµ y|H r on
p0, 8qQ . Such a problem` is˘ separable under the form of Q subproblems, consisting in minimizing
univariate functions g µ ¨ |H
r separately over p0, 8q. We now show in Proposition 1 that, under mild
conditions, each subproblem admits a unique solution over p0, 8q.

Proposition 1. Let q P t1, 2, . . . , Qu. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all f . Moreover,
p
when β ď 1, assume that µ ă γqq for all q such that aq “ 0. Then there exists a unique minimizer
` ˘
yq‹ pµq of g µ yq |H
r in p0, 8q.

Proof. According to Proposition 4 (see Appendix A), each g µ is C 8 and strictly convex on p0, 8q, so
its infimum is uniquely attained in the closure of p0, 8q. We have to prove that it is neither reached
at 0 nor at 8. On the one hand, from (13), we have
F ˆ ˙
µ 1
` ˘ ÿ
1 yq
pg q yq |H “
r wf kpqq d vf npqq |r
q vf npqq ` 2aq yq ` pq ´ γq µ (15)
f “1
yrq

and, for any β ă 2 and any x ą 0,

lim dq1 px|yq “ ´8,
yÑ0`

7
` ˘
so limyq Ñ0` pg µ q1 yq |H
r “ ´8, which ensures that the infimum is not reached at 0. On the other
hand, #
0 if β ď 1,
lim dq1 px|yq “ (16)
yÑ8 8 otherwise.
According to (15) and (16), the distinction must be made between two cases:

• If aq ą 0 or β P p1, 2q: limyq Ñ8 pg µ q1 yq |H

` ˘
r “ 8, so the infimum is reached for a finite yq .
pq
• If aq “ 0 and β ď 1: limyq Ñ8 pg µ q1 yq |H
` ˘
r “ pq ´ γq µ, so the same conclusion holds if µ ă
γq .

We just proved that, under mild conditions, each g µ has a unique minimizer over p0, 8q. However
‹ pµq T , let us show that the
“ ‰
we assumed that the value of µ is fixed. Now given y ‹ pµq “ y1‹ pµq, . . . , yQ
solution to γ T y ‹ pµq “ c is unique. The corresponding value of µ, which we denote µ‹ , provides the
minimizer y ‹ pµ‹ q of Gµ py|Hq
r that satisfies the linear constraint γ T y ‹ pµ‹ q “ c. Moreover, µ‹ naturally
p
fulfills µ‹ ă γq for all q when β ď 1 and aq “ 0, as required in Proposition 1.
q

Proposition 2. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all q, f . Then the scalar equation
γ T y ‹ pµq “ c in the variable µ admits a unique solution µ‹ in p´8, tq, where
#p
q
if β ď 1 and aq “ 0,
t “ min tq , where tq “ γq (17)
1ďqďQ 8 otherwise,

so that y ‹ pµ‹ q P p0, 8qQ is the unique solution to problem (10).

` ˘
Proof. Under the conditions of Proposition 1, g µ yq |H r has a unique minimizer y ‹ pµq for each j. By
q
` ˘
the first-order optimality condition, yq‹ pµq is a solution of pg µ q1 yq |H
r “ 0 or equivalently, by (15), a
` ˘
solution of γq´1 g 1 yq |H
r “ µ over p0, 8q where

F ˆ ˙
` ˘ ÿ yq aq pq
γq´1 g 1 r ´1
yq |H “ γq q1
wf kpqq d vf npqq |r
vf npqq ` 2 yq ` (18)
f “1
yrq γq γq

is strictly increasing on p0, 8q (since g is strictly convex) and one-to-one from p0, 8q to an open interval
Tq “ pt´ `
q , tq q where
` ˘
t´ 1
q “ lim g yq |H “ ´8,
r (19)
yq Ñ0
` ˘
t “ lim g 1 yq |H
q
` r “ tq . (20)
yq Ñ8

Moreover, pq ě 0 if aq “ 0 (then L “ 0) and β ď 1 according to (6) and (14). As a consequence,

γq´1 g 1 pyq‹ |r
yq q “ µ is equivalent to
` ˘´1
yq‹ pµq “ g 1 pγq µq, (21)
` ˘´1
where µ P Tq and g 1 denotes the inverse function of g 1 .

8
Coming back to the multivariate problem (10), we must find a value µ‹ of the Lagrangian multiplier
such that the constraint γ T y ‹ pµq “ c is satisfied. Given (21), µ‹ is a solution of
Q
ÿ ` ˘´1
γq g 1 pγq µq “ c. (22)
q“1

` ˘
Each g 1 yq |H
r being strictly increasing on p0, 8q, pg 1 q´1 pγq µq is also strictly increasing (from Tq to
p0, 8q), this is a direct consequence of pf ´1 q1 “ f 1 ˝f1 ´1 where f is any strictly increasing function on
some interval. Finally Q J
ř 1 ´1
q“1 γq pg q pγq µq is strictly increasing from Xj“1 Tq “ p´8, tq to p0, 8q, with
t ě 0. Therefore, the solution µ‹ is unique.

Proposition 2 shows that the optimal Lagrangian multiplier is the unique solution of (22). Finding
the solution of (22) is equivalent to finding the root of a function rpµq. We propose here-under to
use a Newton-Raphson method to compute µ‹ , and show that this method generates a sequence of
iterates µn that converges towards µ‹ at a quadratic speed.

Proposition 3. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all q, f . Let

Q
ÿ ` ˘
rpµq “ γq pg 1 q´1 γq µ ´ c
q“1

for µ P p´8, tq where t is defined in (17), and denote µ‹ the unique solution of rpµq “ 0. From any
initial point µ0 P pµ‹ , tq, Newton-Raphson’s iterates

rpµn q
µn`1 “ µn ´
r1 pµn q

decrease towards µ‹ at a quadratic speed.

Proof. We already know that r is strictly increasing from p´8, tq to p0, 8q. Let us show that r
is also strictly convex. According to the third item of Proposition 4 in Appendix A, dq2 px|yq is
completely monotonic, so it is strictly decreasing in y. Equivalently, dq1 px|yq is strictly concave in
y, and each g 1 is also strictly concave according to (18). Since the inverse of a strictly increasing,
strictly concave function f is strictly increasing and strictly convex, which is a direct consequence of
2 ´1
pf ´1 q2 “ ´ pff1 ˝f˝f´1 q3 , then each pg 1 q´1 is strictly convex, and finally, r is strictly convex.
For any µ0 P pµ‹ , tq, we have rpµ0 q ą 0, so µ1 “ µ0 ´ rrpµ 0q
1 pµ q ă µ0 . We have also µ1 ą µ
0
‹ as a

consequence of the strict convexity of r. By immediate recurrence, we obtain that µn is a decreasing

series that converges towards µ‹ . According to [30], it converges at a quadratic speed since |r1 | and
|r2 | are bounded away from 0 in rµ‹ , µ0 s.

Discussion At this point, we have derived an optimization framework to tackle problem (10). The
optimal Lagrangian multiplier value is determined before each majorization-minimization update using
a Newton-Raphson algorithm. However, such a formal solution` ˘ is implementable if and only if each
yq‹ pµq can be actually computed as the minimizer of g µ yq |H r in p0, 8q. In some cases, computing
yq‹ pµq is equivalent to extracting the roots of a polynomial of a degree smaller or equal to four, which
is possible in closed form. In other cases, we have to solve a polynomial equation of degree larger

9
β P p´8, 1qzt0u β “ 0 β “ 1 β P p1, 2q
5 4 3
4 3 2 other
No penalization, or
1 1 1 3 4 2 O
Lkn “ 0 for all k, n
Lkn ą 0 for some k, n O 3 2 O O 3 O

Table 1: Cases where (21) can be computed in closed form. They are indicated by the degree of the
corresponding polynomial equation, otherwise the symbol O is used. The constants Lkn are the one
needed in Assumption 1 for the penalization functions Φ1 pHq and Φ2 pW q; see (5).

than four, or even an equation that is not polynomial. Table 1 indicates the cases where a closed-
form solution is available, and hence when our framework can be efficiently implemented. We observe
that, without penalization or with penalization satisfying Lkn “ 0 for all k, n (e.g., smooth concave
functions), the polynomial
( equation is of degree one, and hence always admit a closed form for β ď 1
and β P 54 , 43 , 23 . This particular case is discussed in the next section, which we will exemplify in
Section 3 with β-NMF with sum-to-one constraints on the columns of H. In Section 4, we will present
an important example with Lkn ą 0 for all k, n and β “ 1, namely minimum-volume KL-NMF.

2.3 MU for β-NMF with disjoint linear equality constraints without penalization
In this section, we derive an algorithm based on the general framework presented in the previous
section to tackle the β-NMF problem under disjoint linear equality constraints without penalization,
that is, problem (2) with λ1 “ λ2 “ 0. We consider this simplified case here as it allows to provide
explicit MU for any value of β ă 2; see the row ‘No penalization’ of Table 1. These updates satisfy
the constraints after each update of W or H, and monotonically decrease the objective function
Dβ pV |W Hq.
Let us then consider the subproblem of (2) over H when W is fixed and with λ2 “ 0, that is,

min Dβ pV |W Hq such that γ Tj HpBj q “ cj for 1 ď j ď J. (23)

HPRKˆN
`

Let us follow the framework presented above. First, an auxiliary function, which we denote GpH|Hq, r
is constructed at the current iterate H so that it majorizes the objective for all H and is defined as
r
follows:
« ˆ ˇ ˙ff « ff
` ˘ ÿ ÿ wf k r h kn h kn ` ˘ ÿ ` ˘ ` ˘
G H|H dq vf n ˇvrf n ` dp1 vf n |r
vf n hkn ` dp vf n |r
wf k hkn ´ r vf n ,
r “ ˇ
f,n k
v
r f n h
r kn k

` ˘ (24)
where dp.|.q
q and dp.|.q
p are given in Appendix A. Second, we need to minimize G H|H r while imposing
the set of linear constraints γ Tj HpBj q “ cj . The Lagrangian function of G is given by

J
ÿ
µ
µj γ Tj HpBj q ´ cj ,
“ ` ˘‰
G pH|Hq
r “ GpH|Hq
r ´ (25)
j

where µj are the Lagrange multipliers associated to each linear constraint γ Tj HpBj q “ cj . We observe
that Gµ in (25) is a separable majorizer in the variables H of the Lagrangian function Dβ pV |W Hq ´

10
řJ ” ´ T ¯ı
j µj γ j HpBj q ´ cj . Due to the disjoitness of each subset of variables Bj (25), we only consider
the optimization over one specific subset Bj . The minimizer (21) of Gµ pH pBj q |H r pBj qq has the
following component-wise expression:
˜ ¸.ηpβq
‹ rC pB j qs
H pBj q “ Hr pBj q d “ ‰ , (26)
D pBj q ´ µj γ j
´ ¯
where C “ W T pW Hq.pβ´2q d V , D “ W T pW Hq.pβ´1q , ηpβq “ 2´β 1 1
for β ď 1, and ηpβq “ β´1 for
rAs
β ě 2 [12, Table 2], A d B (resp. rBs ) is the Hadamard product (resp. division) between A and B, A.α
is the element-wise α exponent of A. The case β P p1, 2q is more difficult: we need to find a root of a
function of the form µ ` bxβ´1 ´ cxβ´2 “ 0. For example, for β “ 32 , we have µ ` bx1{2 ´ cx´1{2 “ 0.
?
Using y “ x, and after simplifications, we obtain µy ` by 2 ´ c “ 0 leading to the positive root
ˆ? ˙2
µ2 `4bc´µ
x“ 2b .
According to Proposition 2, (26) is a well-defined update from p0, 8qQ to itself, provided that µj
is tuned to µ‹j . This brings us a structural guarantee that D pBj q ´ µ‹j γ j cannot cancel.
Finally, we need to evaluate µ‹j , which is uniquely determined on some interval p´8, tq according
to Proposition 2. This amounts to solve γ Tj H ‹ pBj q “ cj . When β R p1, 2q, this is equivalent to find
the root of the function
» ˜ ¸.ηpβq fi
Q
ÿ rC pB j qs
rj pµj q “ γj,q –Hr pBj q d “ ‰ fl ´ cj , (27)
q“1
D pB j q ´ µj γ j
q

where rAsq denotes the q-th entry of expression A. Indeed, rj pµj q is a finite sum of elementary
rational functions of µj and each of them is an increasing, convex function in µj over p´8, tq q with
D pB q
tq “ γq j,qj for each q. It is even completely monotone for all µ in p´8, tq q because η pβq ą 0 [28].
As a consequence rj pµj q is also a completely monotone, convex increasing function of µj in p´8, tq,
where t “ min ptq q. Finally, we can easily show that the function rj pµj q changes of sign on the interval
p´8, tq by computing two limits at the closure of the interval. As µ‹ P p´8, tq, the update (26) is
nonnegative. To evaluate µ‹ , we use a Newton-Raphson method, with any initial point µ0 P pµ‹ , tq,
with a quadratic rate of convergence as demonstrated in Proposition 3. Algorithm 1 summarizes our
method to tackle (2) for all the β-divergences, β R p1, 2q which we refer to as disjoint-constrained
β-NMF algorithm. The update for matrix W can be derived in the same way, by symmetry of the
problem. For β P p1, 2q, a case-by-case analysis could be carried out for the values of β for which the
minimizer of (25) takes a closed-form expression.
Remark 1. As noted above, the denominators of (26) and (27) will be different from zero. This
follows notably from our assumption that pW, Hq ą 0; see Propositions 1, 2 and 3. This is a standard
assumption in the NMF literature: the entries of pW, Hq are initialized with positive entries which
ensures all iterates to remain positive. This is important because the MU cannot change an entry
equal to zero [26]; this is the so-called zero-locking phenomenon. This implies C and D in (26)
and (27) are positive matrices (as long as V has at least one nonzero entry per row and column). In
practice, one should however be careful because some entries of W and H can numerically be set to
zero (because of finite precision). Hence, in our implementation, we use the machine precision as a
lower bound for the entries of W and H, as recommended in [17].

11
Computational cost The computational cost of Algorithm 1 is asymptotically equivalent to the
standard MU for β-NMF, that is, it requires O pF N Kq operations per iteration. Indeed, the complexity
is mainly driven by matrix products required to compute C and D; see (26). To compute the roots
of (27) corresponding to H using Newton-Raphson, each iteration requires to compute rj pµj q{rj1 pµj q
for all j which requires OpKN q operations (when every entry of H appears in a constraint). Finding
the roots therefore requires OpKN q operations times the number of Newton-Raphson iterations. By
symmetry, it requires OpKF q operations to compute the roots corresponding to W . Because of the
quadratic convergence, the number of iterations required for the convergence of the Newton-Raphson
method is typically small, namely between 10 to 100 in our experiments using the stopping criterion
|rpµj q| ď 10´6 for all j. Therefore, in practice, the overall complexity of Algorithm 1 is dominated by
the matrix products that require O pF N Kq operations. The same conclusions apply to the algorithms
presented in Sections 3, 4 and 5, and this will be confirmed by our numerical experiments.

Algorithm 1 β-NMF with disjoint linear constraints

KˆN
Input: A matrix V P RF ˆN , an initialization H P R` and W P RF ˆK , a factorization rank K, a
maximum number of iterations, maxiter, a value for β R p1, 2q, and the linear constraints defined
by Ki , αj and bi for i “ 1, 2, . . . , I, and Bj , γ j and cj for j “ 1, 2, . . . , J.
Output: A rank-K NMF pW, Hq of V satisfying constraints in (2).
1: for it = 1 : maxiter do
2: % Update´of matrix H ¯
3: C Ð W T pW Hq.pβ´2q d V
4: D Ð W T pW Hq.pβ´1q
5: for j “ 1 : J do
6: µj Ð root prj pµj qq % see Equation (27)
´ ¯.pηpβqq
rCpB qs
j
7: H pBj q Ð H pBj q d rDpBj q´µ j γj s
8: end for ` ˘
9: Bc “ tpk, nq |1 ď k ď K, 1 ď n ď N u z YJj Bj . % Bc is the complement of YJj Bj
´ ¯.ηpβq
rCpBc qs
10: H pBc q Ð H pBc q d rDpB c qs
11: % Update of matrix W
12: W is updated in the same way as H, by symmetry of the problem.
13: end for

3 Showcase 1: Simplex-structured β-NMF

In this section, we showcase a particularly important example of β-NMF with linear disjoint constraints
and no penalization, namely, the simplex-structured matrix factorization (SSMF) problem. It is
defined as follows: given a data matrix V P RF ˆN and a factorization rank K, SSMF refers to the
problem of computing W and H such that V « W H and the columns of H lie on the unit simplex, that
is, the entries of each column of H are nonnegative and sum to one. SSMF is a powerful tool in many
applications such as hyperspectral unmixing in geoscience and remote sensing [4, 27, 1], document
analysis [5], and self-modeling curve resolution [29]. We refer the reader to the recent survey [13] for
more applications and details about SSMF.

12
To understand the underlying significance of SSMF, it is necessary to give more insights on a
research topic for which important SSMF techniques were initially developed which is the blind Hy-
perspectral Unmixing (HU), a main research topic in remote sensing. The task of blind HU is to
decompose a remotely sensed hyperspectral image into endmember spectral signatures and the corre-
sponding abundance maps with limited prior information, usually the only known information being
the number of endmembers. In this context, the columns of W correspond to the endmembers spectral
signatures and the columns of H contain the proportion of the endmembers in each column of V , so
the column-stochastic assumption for H naturally holds. The nonnegativity of W follows from the
nonnegativity of the spectral signatures. We refer to the corresponding problem as simplex-structured
nonnegative matrix factorization with the β-divergence (β-SSNMF), and is formulated as follows:

min Dβ pV |W Hq such that eT hj “ 1 for 1 ď j ď N, (28)

W PRF
`
ˆK
,HPRKˆN
`

where e is the vector of all ones of appropriate dimension. This is particular case of (2) where
• the subsets Bj correspond to the columns of H, and there is no subset Ki (no constraint on W ),

• γ Tj “ e and cj “ 1 for j “ 1, 2, . . . , N .
Hence Algorithm 1 can be directly applied to (28).

Numerical experiments Let us perform numerical experiments to evaluate the effectiveness of

Algorithm 1 on the simplex-structure β-NMF problem against existing methods. To the best of
our knowledge, the so-called group robust NMF (GR-NMF) algorithm1 from [14] is the most recent
algorithm that is able to tackle problem (28) for the full range of β-divergences. The approach is not
based on Lagrangian multipliers but introduces a change of variables for matrix H. This approach,
initially used for NMF in [10], does not provide an auxiliary function for the subproblem in H and
resort to a heuristic commonly used in NMF, see for example [34, 11]. Therefore there is no guarantee
that the objective function is decreasing at each update of the abundance matrix, unlike Algorithm 1.
We apply Algorithm 1 and GR-NMF on three widely used real hyperspectral data sets2 [37]:
• Samson: 156 spectral bands with 95ˆ95 pixels, containing mostly 3 materials pK “ 3q, namely
“Soil”, “Tree” and “Water”.

• Jasper Ridge: 198 spectral bands with 100ˆ100 pixels, containing mostly 4 materials pK “ 4q,
namely “Road”, “Soil”, “Water” and “Tree”.

• Cuprite: 188 spectral bands with 250ˆ190 pixels, containing mostly 12 types of minerals pK “
12q.
β-SSNMF has shown itself as a powerful one to tackle blind HU, hence this comparative study
between Algorithm 1 and GR-NMF [14] focuses on the convergence aspects including the evolution (
of the objective function and the runtime. The algorithms are compared3 for β P 0, 21 , 1, 32 , 2 . To
1
https://www.irit.fr/„Cedric.Fevotte/extras/tip2015/code.zip
2
http://lesun.weebly.com/hyperspectral-data-set.html
3
For β “ 3{2, we had an error in our derivations, and use (26) with ηpβq “ 1 for β P p1, 2q; see the discussion
after (26). However, the corresponding MU always decreases the objective function values (which we were monitoring),
although we do not have a theoretical justification for this. A possible approach to obtain such as result would be to
come up with a majorizer of the majorizer that has a closed-form minimizer given by (26) with ηpβq “ 1 for β P p1, 2q.

13
report the results, we use the relative objective function, denoted F̄ pW, Hq and defined as4

Dβ pV |W Hq
F̄ pW, Hq “ ,
Dβ pV |veeT q
T
where v “ eF N Ve
is the average of the entries of V . The relative error F̄ should be between 0 and 1:
it is equal to 0 for an exact decomposition with V “ W H, and is equal to 1 for a trivial rank-one
approximation where all entries are equal to the average of the entries of V . This allows to meaningfully
interpret the results, especially since we consider in this comparative study multiple values for β. In
fact, the degree of homogeneity of the β-divergence is a function of β. For example, if all the entries
of the input matrix are multiplied by 10 and keeping the same NMF solution properly scaled, the
squared Frobenius error (β = 2) is multiplied by 100 while the IS-divergence (β = 0) is not affected.
As for all tests performed in this paper, the algorithms are tested on a desktop computer with
Intel Core [email protected] CPU and 32GB memory. The codes are written in MATLAB R2018a,
and available from https://sites.google.com/site/nicolasgillis/. For all simulations, the algorithms are
run for 20 random initializations of W and H (each entry sampled from the uniform distribution in
r0, 1s). Table 2 reports the average and standard deviation of the runtime (in seconds) as the final
value for the relative objective function over these 20 runs for a maximum of 300 iterations.

Table 2: Runtime performance in seconds and final value of relative( objective function F̄end pW, Hq
for Algorithm 1 and the GR-NMFreported for β P 0, 21 , 1, 32 , 2 . The table reports the average
and standard deviation over 20 random initializations with a maximum of 300 iterations for three
hyperspectral data sets. A bold entry indicates the best value for each experiment.

Algorithms Samson Jasper Ridge Cuprite

runtime (s.) F̄end pW, Hq runtime (s.) F̄end pW, Hq runtime (s.) F̄end pW, Hq
β“2
Algorithm 1 16.62˘0.15 (1.89˘0.04)10´3 22.86˘0.08 (4.68 ˘ 0.39)10´3 121.04 ˘ 0.62 (0.98 ˘ 0.06)10´3
GR-NMF 18.23˘0.29 (1.91˘0.05)10´3 25.32˘0.16 (5.87 ˘ 1.22)10´3 114.27 ˘ 0.20 (1.29 ˘ 0.07)10´3
β “ 3{2
Algorithm 1 63.69˘0.40 (2.52 ˘0.78)10´3 89.23 ˘0.30 (4.92 ˘ 0.29)10´3 421.49 ˘ 2.79 (1.54 ˘ 0.07)10´3
GR-NMF 80.09˘0.60 (2.60 ˘0.63)10´3 112.72 ˘0.67 (6.32 ˘ 1.37)10´3 508.57 ˘ 3.50 (2.01 ˘ 0.09)10´3
β“1
Algorithm 1 18.33 ˘ 0.08 (3.54 ˘ 0.27)10´3 24.82 ˘ 0.35 (6.07 ˘ 0.21)10´3 182.98 ˘ 14.14 (2.07 ˘ 0.09)10´3
GR-NMF 44.78 ˘ 0.18 (3.77 ˘ 0.38 ) 10´3 62.83 ˘ 0.76 (7.26 ˘ 1.50)10´3 370.25 ˘ 21.33 (2.67 ˘ 0.10)10´3
β “ 1{2
Algorithm 1 89.80 ˘ 0.65 (7.21 ˘ 0.75)10´3 126.43 ˘ 0.61 (1.08 ˘ 0.10)10´2 682.80 ˘ 3.32 (3.13 ˘ 0.15)10´3
GR-NMF 102.21 ˘ 0.72 (6.93 ˘ 0.88)10´3 141.75 ˘ 0.69 (1.12 ˘ 0.13)10´2 642.49 ˘ 1.22 (3.14 ˘ 0.14)10´3
β“0
Algorithm 1 52.89˘0.54 (4.60 ˘ 0.66)10´2 69.59˘0.44 (3.76 ˘ 0.11)10´2 479.84 ˘ 16.02 (4.39 ˘ 0.31)10´3
GR-NMF 55.61˘0.47 (4.22 ˘ 0.79)10´2 77.87˘0.63 (3.76 ˘ 0.44)10´2 354.65 ˘6.01 (3.35 ˘ 0.10)10´3

We observe that Algorithm 1 outperforms the GR-NMF in terms of runtime and final values for the
relative objective function for all test cases except when β “ 0 for the Samson and Cuprite data sets.
In particular, for β “ 1, Algorithm 1 is up to 2.5 times faster than the GR-NMF. For the Cuprite data
set with β “ 1{2, Algorithm 1 and GR-NMF perform similarly. We also observe that the standard
4 β D pV |W Hq
For the Frobenius norm, that is, β “ 2, the relative error is typically defined as D β pV |0q
meaning that the trivial
solution used is the all-zero matrix. However, for other β-divergences, the value of Dβ pV |0q might not be defined; in
particular, for β ď 1 and vf n ą 0 for some f, n.

14
deviations obtained with Algorithm 1 are in general significantly smaller for all β, except for β “ 0
for the Samson and Cuprite data sets.
In the supplementary material S1, we provide figures that show the evolution of the relative
objective function values with respect to iterations, and that confirm the observations above.

4 Showcase 2: minimum-volume KL-NMF

In this section, we showcase another important example of β-NMF with linear disjoint constraints,
namely, the minimum volume NMF with the β-divergences (min-vol β-NMF) model. This model
is based on the minimization of β-divergences including a penalty term promoting solutions with
minimum volume spanned by the columns of the matrix W . It is defined as follows:

min Dβ pV |W Hq ` λvolpW q such that W T e “ e. (29)

W PRF
`
ˆK
,HPRKˆN
`

where λ is a penalty parameter, and volpW q is a function measuring the volume spanned by the
columns of W . In [25], the authors use volpW q “ logdetpW T W ` δIq, where δ is a small positive
constant that prevents logdetpW T W q to go to ´8 when W tends to a rank-deficient matrix (that
is, when rankpW q ă K). This model is particularly powerful as it leads to identifiability which is
crucial in many applications such as in hyperspectral imaging or audio source separation [13]. Indeed,
under some mild assumptions and in the exact case, authors prove in [25] that (29) is able to identify
the groundtruth factors pW # , H # q that generated the input data V , in the absence of noise. In [25],
(29) is used for blind audio source separation. In a nutshell, blind audio source separation consists in
isolating and extracting unknown sources based on an observation of their mix recorded with a single
microphone5 . We have to mention that model (29) is also well suited for hyperspectral imaging as
discussed in [16].
In the next subsections, we show that we can tackle the min-vol β-NMF optimization problem
defined in (29) with the general framework presented in Section 2 in the case β “ 1.

4.1 Problem formulation and algorithm

As the minimum-volume penalty of model (29) concerns matrix W only, the main challenge concerns
the update of W . Indeed, the update of H is simply the one from [23]. Let us therefore consider the
subproblem in W for H fixed:

min Dβ pV |W Hq ` λ logdetpW T W ` δIq such that eT wi “ 1 for 1 ď i ď K. (30)

W PRF
`
ˆK

Compared to the general model (2), we have that

• the subsets Ki correspond to the columns of W , and there is no subset Bj ,

• αTi “ e and bi “ 1 for 1 ď i ď K.

To upper bound logdetpW T W ` δIq as required by (5) in Assumption 1, we majorize it using a convex
quadratic separable auxiliary function provided in [25, Eq. (3.6)] and which is derived as follows.
5
We invite the interested reader to watch the video https://www.youtube.com/watch?v=1BrpxvpghKQ to see the
application of min-vol KL-NMF on the decomposition of a famous song from the city of Mons.

15
First, the concave function logdetpQq for Q ą 0 can be upper bounded using the first-order Taylor
approximation: for any Qr ą 0,

logdetpQq ď logdetpQq r ´1 , Q ´ Qy
r ` xQ r ´1 , Qy ` cst,
r “ xQ

where cst is some constant independent of Q. For any W, W

Ă , and denoting Qr“W ĂT WĂ ` δI ą 0, we
obtain A E
T ´1 T r ´1 W T q ` cst,
logdetpW W ` δIq ď Q , W W ` cst “ tracepW Q
r

which is a convex quadratic and Lipschitz-smooth function in W . In fact, letting Q r ´1 “ DDT be a

decomposition (such as Cholesky) of Q ´1
r ą 0, we have tracepW Q ´1 2
r W q “ }W D} , from which (5)
F
can be derived easily; see [25] for the details. With this and following our framework from Section 2,
we obtain the Lagrangian function
¨ ˛
ÿ ÿ ÿˆ 1
˙
µ ¯ T
` ˘
G W |W “Ă G pwf |wrf q ` λ ˝ rf q ` c ` µ
l pwf |w ‚ wf ´ e , (31)
f f f
F

where wf P denotes the f -th row of W , G is given by (24), ¯l by [25, Eq. (3.6)] and derived as explained
above, and c is a constant. Let µ is the vector Lagrange multipliers of dimension K associated to each
linear constraint eT wi “ 1. Exactly as before (hence we omit the details here), Gµ is separable and,
given µ, one can compute the closed-form solution:
„”
‰.2 ı. 1 ` 
“ T 2 T
˘
C ` eµ `S ´ C ` eµ
‹ (32)
W pµq “ W d
Ă ,
rDs
ˆ ˙
T
` ´
˘ ` ´ ` ´ rV s T
where C “ eF,N H ´ 4λ W Ă Y , D “ 4λW Ă pY ` Y q, and S “ 8λW Ă pY ` Y q d
ĂHs H
rW
Ă ` δI ´1 , Y ` “ maxpY, 0q ě 0 and Y ´ “ maxp´Y, 0q ě 0, and eF,N is
` T ˘
with Y “ Y ` ´ Y ´ “ W Ă W
the F -by-N matrix of all ones. As proved in Proposition 2, the constraint W ‹ pµqT e “ e is satisfied
for a unique µ in p´8, tq where t “ 8 in this case. We can therefore use a Newton-Raphson method
to find the µi with quadratic rate of convergence, see Proposition 3. Algorithm 2 summarizes our
method to tackle (29).

4.2 Numerical experiments

In this section we compare baseline KL-NMF (that is, the standard MU), the min-vol KL-NMF from
[25, Algorithm 1] that solves (29) using MU combined with line search (min-vol KL-NMF LS), and
Algorithm 2 applied to the spectrogram of two monophonic piano sequences considered in [25]. The
first audio sample is the first measure of “Mary had a little lamb”, a popular English song. The second
audio sample corresponds to the first 30 seconds of “Prelude and Fugue No.1 in C major” from de
Jean-Sebastien Bach played by Glenn Gould6 . We use the following three setups:

• Setup 71: sample “Mary had a little lamb” with K “ 3, 200 iterations.

• Setup 72: sample “Mary had a little lamb” with K “ 7, 200 iterations.
6
https://www.youtube.com/watch?v=ZlbK5r5mBH4

16
Algorithm 2 Min-vol KL-NMF
KˆN
Input: A matrix V P RF ˆN , an initialization H P R` , an initialization W P RF ˆK , a factorization
rank K, and a maximum number of iterations, maxiter, the parameters δ ą 0 and λ ą 0.
Output: A min-vol rank-K NMF pW, Hq of V satisfying constraints in (29).
1: for it = 1 : maxiter do
2: % Update of
” matrix
´ H
¯ı
rV s
WT rW Hs
3: HÐHd T
r s
W eF,N
4: % Update
` T of matrix
˘´1W
5: Y Ð W W ` δI
6: Y ` Ð max pY, 0q
7: Y ´ Ð max pÝ, 0q
8: C Ð eF,N H T ´ 4λ pW Y ´´q ¯
rV s T
9: S Ð 8λW pY ` ` Y ´ q d rW Hs H
10: D Ð 4λW` pY ` ` Y ´ q ˘
11: µ Ð root W ‹ pµqT e “ e over RK % see (32) for the expression of W ‹ pµq
« ff
” .2
ı. 1
2
rCèµT s `S ´pCèµT q
12: W ÐW d rDs
13: end for

• Setup 73: “Prelude and Fugue No.1 in C major” with K “ 16, 300 iterations.

For each setup, the algorithms are run for the same 20 random initializations of W and H. Table 3
reports the average and standard deviation of the runtime (in seconds) over these 20 runs. Table 4
reports the average and standard deviation of the final values for β-divergences (data fitting term)
and the objective function of (29) over these 20 runs for min-vol KL-NMF LS and Algorithm 2. For
this last comparison, the value for the penalty weight λ has been chosen so that KL-NMF leads to
reasonable
ˇ solutions forˇ W and H. More precisely, the values for λ are chosen so that the initial value
T
λˇlogdetpW p0q W p0q `δIqˇ
ˇ ˇ
of Dβ pV |W Hq is equal to 0.1, 0.1 and 0.022 for setup 71, setup 72 and setup 7, respectively.

Table 3: Runtime performance in seconds of baseline KL-NMF, min-vol KL-NMF LS and Algorithm 2.
The table reports the average and standard deviation over 20 random initializations.

Algorithms runtime in seconds

setup 71 setup 72 setup 73
baseline KL-NMF 0.53˘0.03 0.45˘0.02 4.32˘0.30
min-vol KL-NMF LS [25] 3.79˘0.13 2.39˘0.30 10.19˘1.28
Algorithm 2 0.58˘0.03 0.66˘0.03 4.80˘ 0.38

We observe that the runtime of Algorithm 2 is close to the baseline KL-NMF algorithm which
confirms the negligible cost of the Newton-Raphson steps to compute µ‹ as discussed in Section 2.3.
On the other hand, since no line search is needed, we have a drastic acceleration from 2x to 7x

17
Table 4: Final values for Dβ and the penalized objective Ψ from (29) obtained with min-vol KL-
NMF LS and Algorithm 2. The table reports the average and standard deviation over 20 random
initializations for three experimental setups. A bold entry indicates the best value for each experiment.

min-vol KL-NMF LS [25] Algorithm 2

Dβ,end (3.52 ˘ 0.03)103 (2.31 ˘ 0.01)103
setup 71
Ψend (4.17 ˘ 0.03)103 (3.08 ˘ 0.01)103
Dβ,end (3.54 ˘ 0.03)103 (1.77 ˘ 0.02)103
setup 72
Ψend (4.42 ˘ 0.04)103 (2.87 ˘ 0.02)103
Dβ,end (7.77 ˘ 0.23)103 (4.67 ˘ 0.08)103
setup 73
Ψend (9.14 ˘ 0.20)103 (6.50 ˘ 0.06)103

compared to the backtracking line-search procedure integrated in min-vol KL-NMF LS [25]. Moreover,
we observe in Table 4 that Algorithm 2 outperforms min-vol KL-NMF LS in terms of final values for
the data fitting term and objective function values, with lower standard deviations.

5 Extension to quadratic disjoints constraints

Our general framework presented in Section 2 applies to β-NMF under disjoint linear equality con-
straints with penalty terms satisfying Assumption 1; see problem (2). We have showcased our approach
on β-SSNMF in Section 3 and on min-vol KL-NMF under sum-to-one constraints on the columns of
W in Section 4. In this section, we show that the same framework can be extended to other simple
constraints, namely disjoint quadratic constraints.
We consider sparse β-NMF for β “ 1 where the rows of H are penalized with the `1 norm and
each column of W have a fixed `2 norm. We show that MU satisfying the set of constraints can be
derived which we apply on blind HU.

5.1 Problem formulation and algorithm

In this section we consider the following model involving quadratic disjoints constraints, that we refer
to as hyperspheric-structured sparse β-NMF:
K
ÿ .p2q
min Dβ pV |W Hq ` λk }Hpk, :q}1 such that eT wj “ ρ for 1 ď i ď K, (33)
W PRF
`
ˆK KˆN
,HPR` k“1

where λk is a penalty weight to control the sparsity of the k-th row of H, and the quadratic constraints
require the columns of W to lie on the surface of a hyper-sphere centered at the origin with radius
?
ρ ą 0. Without this normalization, the `1 -norm regularization would make H tends to zero and W
grows to infinity.
As done before, we update W and H alternatively. We tackle the subproblem in H with W fixed
based on the MU developed in [11] and guaranteed to decrease the objective function:
” ´ ¯ı
r .pβ´2q
“ ‰
WT V d WH
H“H rd ” ı, (34)
r .pβ´1q ` λeT
“ ‰
WT WH

18
where λ P RK` is the vector of penalty weights. It remains to compute an update for W . To do so, we
use the convex separable auxiliary function G from [12] constructed at the current iterate W
Ă , from
which we obtain, as before, the Lagrangian function
˘ ÿ ÿ ÿ ˆ .p2q 1
˙
µ T
`
G W |W “Ă G pwf |w
rf q ` λk }Hpk, :q}1 ` µ wf ´ ρe , (35)
f k f
F

where µ P RK is the vector of Lagrange multipliers associated to the constraint eT W .p2q “ ρeT .
Exactly as before (hence we omit the details here), given µ, one can obtain a closed-form solution:
„” ı. 1 
.2 ` T˘ 2
rCs ` 8 eµ d S ´C
‹ (36)
W pµq “ ,
r4eµT s
ˆ ˙
T rV s T
where C “ eF,N H and S “ W d Ă
ĂHs H . Let us now write the expression of the quadratic
rW
constraint f pW ‹ pµqf,i q2 ´ ρ “ 0 for one specific column of W , say the i-th:
ř

¨b ˛2
ÿ ÿ pCf,i q2 ` 8µi Sf,i ´ Cf,i
ri pµi q :“ ‹
pWf,i pµi qq2 ´ ρ “ ˝ ‚ ´ ρ “ 0. (37)
f f
4µi

Computing the Lagrangian multiplier µi to satisfy the constraint requires computing the roots of the
functions ri pµi q. We can show that each Wf,i ‹ pµ q (36) is a monotone decreasing, nonnegative convex
i
‹ pµ qq2 is also monotone decreasing and convex in µ over
ř
function over p0, `8q. Therefore f pWf,i i i
p0, `8q. Indeed, let g : R Ñ R
` `2 ˘2 ` 1 2be a monotone decreasing, nonnegative` 2 ˘1 convex function. If g is
twice-differentiable, then g 2 2 1
“ 2 pg q ` 2gg ě 0 since g, g ě 0 and g “ 2g g ď 0 since g ě 0,
g 1 ď 0 by hypothesis. Now we can conclude that ri pµi q is a monotone decreasing convex function over
p0, `8q. Moreover, using Hospital’s rule, we have:
ÿ ÿ
lim ‹
pWf,i pµi qq2 ´ ρ “ `8 and lim ‹
pWf,i pµi qq2 ´ ρ “ ´ρ ă 0,
µi Ñ0` µi Ñ`8
f f

since ρ ą 0. Therefore, the root of ri pµi q is unique over p0, `8q. We use a Newton-Raphson method
to solve the problem. Algorithm 3 summarizes our method.

5.2 Numerical experiments

In this section, we perform numerical experiments to evaluate the effectiveness of Algorithm 3 on the
HU problem. To the best of our knowledge, sparse β-NMF7 from [32] is the most recent algorithm
that is able to tackle problem (33) for the KL-divergence by integrating the `2 -normalization for
each update of matrix W . This approach is similar to that of [14] for β-SSNMF, that is, it uses
parametrization, and resort to a heuristic with no guarantee on the decrease of the objective function.
We refer to this algorithm as β-SNMF.
We apply Algorithm 3 and β-SNMF [32] to the three real hyperspectral datasets detailed in
Section 3. This comparative study focuses on the convergence aspects including the evolution of the
7
http://www.jonathanleroux.org/software/sparseNMF.zip

19
Algorithm 3 Hyperspheric-structured sparse KL-NMF
KˆN
Input: A matrix V P RF ˆN , an initialization H P R` , an initialization W P RF ˆK , a factorization
rank K, a maximum number of iterations, maxiter, a weight vector λ ą 0.
Output: A sparse rank-K NMF pW, Hq of V satisfying constraints in (33).
1: for it = 1 : maxiter do
2: % Update of
„ matrix
ˆ H
“ ‰.pβ´2q ˙
WT V d WH
3: HÐHd „ “ ‰.pβ´1q 
WT WH `λeT

4: % Update of matrix W
5: C Ð eF,N H T
´ ¯
rV s T
6: S Ð W d rW Hs H
7: for j = 1 : K do
8: µi Ð root pri pµi qq over p0, `8q % See Equation (37)
9: end for„ 
.1
rrCs.2 `8peµT qdS s 2 ´C
10: W Ð r4µeT s
11: end for

objective function and the runtime; we refer the interested reader to the Supplementary Material S2
for qualitative result on the ability of sparse β-NMF to decompose such images. For all simulations,
the algorithms are ran for 20 random initializations of W and H, the entries of the penalty weight λ
has been set to 0.1, 0.05 and 0.05 for Samson, Jasper Ridge and Cuprite data sets, respectively. In
order to fairly compare both algorithms, ρ has been set to 1 as β-SNMF considers a `2 -normalization
for the columns of W , and the entries of the weight vector λ in Algorithm 3 have the same values as
β-SNMF requires to use the same values for all rows of H. Table 5 reports the average and standard
deviation of the runtime (in seconds) as the final value for the objective function over these 20 runs
for a maximum of 300 iterations. Figure 1 displays the objective function values.

Table 5: Runtime performance in seconds and final value of objective function Φend pW, Hq for Algo-
rithm 3 and β-SNMF. The table reports the average and standard deviation over 20 random initial-
izations with a maximum of 300 iterations for three hyperspectral data sets. A bold entry indicates
the best value for each experiment.

Algorithms Samson data set Jasper Ridge data set Cuprite data set
runtime (sec) Φend pW, Hq runtime (sec) Φend pW, Hq runtime (sec) Φend pW, Hq
Algorithm 3 11.07˘0.19 (2.68˘0.00)103 15.67˘0.17 (4.65 ˘ 0.00)103 70.16 ˘ 0.85 (2.12 ˘ 0.00)103
β-SNMF [32] 7.63˘0.13 (2.68˘0.00)103 10.98˘0.18 (4.71 ˘ 0.00)103 51.86 ˘ 0.74 (2.18 ˘ 0.00)103

According to Table 5 (top row), we observe that Algorithm 3 outperforms the heuristic from [32]
in terms of final value for the objective functions while β-SNMF shows lower runtimes. Additionally,
based on Figure 1, we observe that Algorithm 3 converges on average faster than β-SNMF for all
the data sets, in terms of iterations. However, β-SNMF has a lower computational cost per iteration.
Thus, we complete the comparison between both algorithms by imposing the same computational
time: we run Algorithm 3 for 300 iterations, record the computational time and run β-SNMF for the

20
104 104
1.6 3 16000
1.4 2.5 14000
1.2 12000
2 10000
1
1.5 8000
0.8
6000
0.6
1
4000
0.4

0.5
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300

104 104
1.6 3 16000
1.4 2.5 14000
1.2 12000
2
1 10000
1.5 8000
0.8
6000
0.6 1
4000
0.4

0.5
0 2 4 6 8 10 12 0 5 10 15 0 20 40 60 80

Figure 1: Averaged objective functions over 20 random initializations obtained for Algorithm 3 with
300 iterations (red line with circle markers), and the heuristic β-SNMF from [32] (black dashed line).

same amount of time. Table 6 reports the average and standard deviation of the final value for the

Table 6: Final value of objective function values Φend pW, Hq for Algorithm 3 and the heuristic from
[32]. The table reports the average and standard deviation over 20 random initializations for an equal
computational time that corresponds to 300 iterations of Algorithm 3. A bold entry indicates the best
value for each experiment.

Algorithms Samson data set Jasper Ridge data set Cuprite data set
Φend pW, Hq Φend pW, Hq Φend pW, Hq
Algorithm 3 (2.68˘0.00)10 3 (4.65 ˘ 0.00)10 3 (2.12 ˘ 0.00)103
β-SNMF [32] (2.68˘0.00)10 3 (4.66 ˘ 0.00)10 3 (2.15 ˘ 0.00)103

objective function over 20 runs in this setting. Figure 1 (bottom row) displays the objective function
w.r.t. time for the three data sets. On this comparison, Algorithm 3 and the heuristic from [32]
perform similarly although Algorithm 3 has slightly better final objective function values. However,
keep in mind that only Algorithm 3 is theoretically guaranteed to decrease the objective function.

6 Conclusion
In this paper we have presented a general framework to solve penalized β-NMF problems that in-
tegrates a set of disjoint constraints on the variables; see the general formulation (2). Using this
framework, we showed that we can derive algorithms that compete favorably with the state of the art
for a wide variety of β-NMF problems, such as the simplex-structured NMF and the minimum-volume

21
β-NMF with sum-to-one constraints on the columns of W . We have also shown how to extend the
framework to non-linear disjoints constraints, with application to a sparse β-NMF model for β “ 1
where each column of W lie on a hyper-sphere.
Further works will focus on the possible extension of the methods to non-disjoints constraints. The
non-disjoint constraints will lead to roots finding problems of polynomial equations in the Lagrangian
multipliers for which we hope to find conditions that ensure the uniqueness of the solution.
Another interesting direction of research would be to apply our framework to other NMF models.
For example, in probabilistic latent semantic analysis/indexing (PLSA/PLSI), the model is the fol-
lowing: given a nonnegative matrix V such that eT V e “ 1 (this can be assumed w.l.o.g. by dividing
the input matrix by eT V e), solve
ÿ
max vf n logpW DiagpsqHqf n such that W T e “ e, He “ e, sT e “ 1.
W ě0,Hě0,sě0
f,n

This model is equivalent to KL-NMF [9], with the additional constraint that eT W He “ eT Xe, and
hence our framework is applicable to PLSA/PLSI. Such constraints have also applications in soft
clustering contexts; see [35].

Acknowledgment
We would like to thank the Associate Editor and the reviewers for taking the time to carefully read the
paper and for the useful feedback that helped us improve the paper. We also thank Arthur Marmin
for identifying an error in our derivations when β P p1, 2q (indicated in red color in this version of the
manuscript).

A Convexity, concavity and complete monotonicity for a convex-

concave decomposition of the discrete β-divergence
The discrete β-divergence can always be expressed as the sum of convex, concave, and constant terms.
In Table 7 we introduce a convex-concave decomposition of the β-divergence which slightly differ from
the one given in [12, Table 1] (by the fact that ours contains no constant term d)
s as given in Table 7.

Decomposition
β P p´8, 1qzt0u β“0 β“1 β P p1, 2q β P r2, `8q
dβ “ dq ` dp
1 x 1 β 1 1 β
dpx|yq
q x y β´1
1´β y ´x log y βy ´ β´1 xy
β´1
βy
1 β 1 β y 1 β 1 1
dpx|yq
p
β y ´ β p1´βq x log x ´ 1 y ` x log x ´ x β pβ´1q x ´ β´1 x y β´1 ` β pβ´1q xβ

Table 7: Proposed concave-convex decomposition of the discrete β-divergence.

In Table 7, y P p0, 8q, β is real valued and x P p0, 8q. Further, β and x are considered as param-
eters, dβ , dp and dq being handled as univariate functions of y.

Let us now recall the definition of a complete monotonic function f :

Definition 1. A function f is said to be completely monotonic (c.m.) on an interval I if f has
derivatives of all orders on I and p´1qn f pnq pxq ě 0 for x P I and n ě 0.

22
We can now introduce the properties of concavity, convexity and monotonicity for our convex-
concave formulation of the discrete β-divergence:

Proposition 4. Given dp¨|¨q

q and dp¨|¨q
p as defined above, we have that

1. dpx|yq
q is C 8 and strictly convex on p0, 8q for x ą 0 and β P R;

2. dpx|yq
p is concave for x ą 0 and β P R;

3. for all β ă 2, dq2 px|yq and dp2 px|yq are c.m.

Proof. The proof is straightforward, given that dpx|yq

q and dpx|yq
p linearly combine C 8 functions on
p0, 8q, and that in the same interval,

• log y is strictly concave;

• y ν is strictly convex for all ν P p´8, 0q Y p1, 8q, and strictly concave for all ν P p0, 1q;

• y ν is c.m. for all ν ă 0.

According to the first two items of Proposition 4, dq and dp indeed yield a convex-concave decom-
position of the β-divergence, which is a variant of [12, Table 1]. Let us remark that the successive
minimization of an upper approximation of this convex-concave decomposition following the method-
ology presented in [12] yields to the usual multiplicative update scheme.

References
[1] M. Abdolali and N. Gillis, Simplex-structured matrix factorization: Sparsity-based identifiability and
provably correct algorithms, arXiv preprint arXiv:2007.11446, (2020).
[2] A. Ang and N. Gillis, Accelerating nonnegative matrix factorization algorithms using extrapolation,
Neural computation, 31 (2019), pp. 417–439.
[3] D. P. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, second ed., 1999.
[4] J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot,
Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches, IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5 (2012), pp. 354–379.
[5] E. C. Chi and T. G. Kolda, On tensors, sparsity, and nonnegative factorizations, SIAM Journal on
Matrix Analysis and Applications, 33 (2012), pp. 1272–1299.
[6] A. Cichocki, R. Zdunek, and S.-I. Amari, Hierarchical ALS algorithms for nonnegative matrix and
3D tensor factorization, in Lecture Notes in Computer Science, Vol. 4666, Springer, 2007, pp. 169–176.
[7] A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari, Nonnegative Matrix and Tensor Factoriza-
tions: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation, John Wiley &
Sons, 2009.
[8] O. Dikmen, Z. Yang, and E. Oja, Learning the information divergence, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37 (2015), pp. 1442–1454.
[9] C. Ding, T. Li, and W. Peng, On the equivalence between non-negative matrix factorization and prob-
abilistic latent semantic indexing, Computational Statistics & Data Analysis, 52 (2008), pp. 3913–3927.

23
[10] J. Eggert and E. Korner, Sparse coding and NMF, in IEEE International Joint Conference on Neural
Networks, vol. 4, 2004, pp. 2529–2533 vol.4.
[11] C. Févotte, N. Bertin, and J.-L. Durrieu, Nonnegative matrix factorization with the Itakura-Saito
divergence: With application to music analysis, Neural computation, 21 (2009), pp. 793–830.
[12] C. Févotte and J. Idier, Algorithms for nonnegative matrix factorization with the β-divergence, Neural
computation, 23 (2011), pp. 2421–2456.
[13] X. Fu, K. Huang, N. D. Sidiropoulos, and W.-K. Ma, Nonnegative matrix factorization for signal
and data analytics: Identifiability, algorithms, and applications., IEEE Signal Process. Mag., 36 (2019),
pp. 59–80.
[14] C. Févotte and N. Dobigeon, Nonlinear hyperspectral unmixing with robust nonnegative matrix fac-
torization, IEEE Transactions on Image Processing, 24 (2015), pp. 4810–4819.
[15] N. Gillis, The why and how of nonnegative matrix factorization, in Regularization, Optimization, Kernels,
and Support Vector Machines, J. Suykens, M. Signoretto, and A. Argyriou, eds., Machine Learning and
Pattern Recognition, Chapman & Hall/CRC, Boca Raton, Florida, 2014, ch. 12, pp. 257–291.
[16] N. Gillis, Nonnegative Matrix Factorization, SIAM, Philadelphia, 2020.
[17] N. Gillis and F. Glineur, Accelerated multiplicative updates and hierarchical ALS algorithms for non-
negative matrix factorization, Neural computation, 24 (2012), pp. 1085–1105.
[18] N. Guan, D. Tao, Z. Luo, and B. Yuan, NeNMF: An optimal gradient method for nonnegative matrix
factorization, IEEE Transactions on Signal Processing, 60 (2012), pp. 2882–2898.
[19] L. T. K. Hien, N. Gillis, and P. Patrinos, Inertial block proximal methods for non-convex non-smooth
optimization, in International Conference on Machine Learning, 2020, pp. 5671–5681.
[20] D. Hong, T. G. Kolda, and J. A. Duersch, Generalized canonical polyadic tensor decomposition,
SIAM Review, 62 (2020), pp. 133–163.
[21] P. Hoyer, Non-negative matrix factorization with sparseness constraints, J. Mach. Learn. Res., 5 (2004),
p. 1457–1469.
[22] J. Kim, Y. He, and H. Park, Algorithms for nonnegative matrix and tensor factorizations: A unified
view based on block coordinate descent framework, Journal of Global Optimization, 58 (2014), pp. 285–319.
[23] D. Lee and H. Seung, Algorithms for non-negative matrix factorization, in Proceedings of the 13th
International Conference on Neural Information Processing Systems, NIPS, MIT Press Cambridge, 2000,
pp. 535–541.
[24] D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature,
401 (1999), p. 788.
[25] V. Leplat, N. Gillis, and A. M. S. Ang, Blind audio source separation with minimum-volume beta-
divergence NMF, IEEE Transactions on Signal Processing, 68 (2020), pp. 3400–3410.
[26] C.-J. Lin, Projected gradient methods for nonnegative matrix factorization, Neural computation, 19 (2007),
pp. 2756–2779.
[27] W.-K. Ma, J. M. Bioucas-Dias, T. Chan, N. Gillis, P. Gader, A. J. Plaza, A. Ambikapathi,
and C. Chi, A signal processing perspective on hyperspectral unmixing: Insights from remote sensing,
IEEE Signal Processing Magazine, 31 (2014), pp. 67–81.
[28] K. Miller and G. Samko, Completely monotonic functions, Integral Transforms and Special Functions,
12 (2001), p. 389–402.
[29] K. Neymeyr and M. Sawall, On the set of solutions of the nonnegative matrix factorization problem,
SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 1049–1069.

24
[30] J. Ortega and W. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Aca-
demic Press, New York, NY, 1970.
[31] Y. Qian, S. Jia, J. Zhou, and A. Robles-Kelly, Hyperspectral unmixing via l1{2 sparsity-constrained
nonnegative matrix factorization, IEEE Transactions on Geoscience and Remote Sensing, 49 (2011),
pp. 4282–4297.
[32] J. L. Roux, F. J. Weninger, and J. R. Hershey, Sparse NMF – half-baked or well done?, tech. rep.,
Mitsubishi Electric Research Laboratories (MERL), 2015.
[33] Y. Sun, P. Babu, and D. Palomar, Majorization-minimization algorithms in signal processing, com-
munications, and machine learning, IEEE Transactions on Signal Processing, 65 (2017), pp. 794–816.
[34] T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal conti-
nuity and sparseness criteria, IEEE Transactions on Audio, Speech, and Language Processing, 15 (2007),
pp. 1066–1074.
[35] Z. Yang, J. Corander, and E. Oja, Low-rank doubly stochastic matrix decomposition for cluster anal-
ysis, The Journal of Machine Learning Research, 17 (2016), pp. 6454–6478.
[36] A. L. Yuille and A. Rangarajan, The concave-convex procedure, Neural Computation, 15 (2003),
pp. 915–936.
[37] F. Zhu, Hyperspectral unmixing: ground truth labeling, datasets, benchmark performances and survey,
arXiv preprint arXiv:1708.05125, (2017).

25
Supplementary Material
This supplementary materials provide additional numerical experiments. In S1, we show the evolution
of the error as a function of the iterations for Algorithm 1 and GR-NMF [14] for the tests performed
in Section 3. In S2, we provide qualitative results obtained with Algorithm 3 on hyperspectral images.

S1. Evolution of the objective function for β-SSNMF

D pV |W Hq
Figure 2 displays the evolution of the relative objective function values, that is, D βpV |veeT q , of β-
β
SSNMF for Algorithm 1 and GR-NMF [14] on the experiments described in Section 3. As mentioned
in the paper, Algorithm 1 performs better than GR-NMF [14], except for β “ 0.

S2. Qualitative results obtained with Algorithm 3

In the following we report qualitative results obtained with Algorithm 3 applied to three HS real data
sets, that are Samson, Jasper and Urban data sets. The first two data sets are detailed in Section 3.
The Urban data set contains 162 spectral bands with 307ˆ307 pixels with mostly six endmembers.
Note that Cuprite data set is replaced by the Urban data set since endmembers for Cuprite correspond
to chemical components which are more difficult to interpret visually while endmembers for Urban
data sets are more easily interpretable.
As mentioned earlier, λk enables to control sparsity of the k-th row of H. Given a row Hpk, :q P RN
`
of H, a meaningful way to measure its sparsity is to consider the following measure [21]:
? }Hpk,:q}
N ´ }Hpk,:q}1
sp pHpk, :qq “ ? 2
P r0, 1s . (38)
N ´1
During the numerical experiments, we observed that Algorithm 3 gives better results when the
initial values for λ are low and progressively increased. During a specified interval of iterations
ritmin , itmax s, the sparsity of the current iterate is measured by using equation (38), and the entries of
λ are dynamically updated (increased with a rate α ą 1) to achieve a desired sparsity level sp. The
dynamic update of the weight vector to reach the desired levels of sparsity has been activated in the
iterations intervals r1, 150s, r1, 150s and r1, 75s for Samson, Jasper and Urban, respectively. We report
here the abundance maps of each end-member for two levels of average target sparsity that are 0.25
and 0.5. For all the simulations, the weight vector λ has been initialized to 0.05e, and the algorithm
was run for 300 iterations.
We fix the number of endmembers to 3, 4 and 6 respectively for Samson, Jasper Ridge and Urban
data sets, these values are commonly considered in the HS community [37]. Figures 3 to 5 picture the
abundance estimation for the three data sets for the two levels of sparsity.
In order to validate the results obtained for the abundances of the endmembers, we display in
Figures 6, 7 and 8 the ground truth results obtained in [37]. Note that the grayscale used in [37] is
the complementary of the one used in Figures 3 to 5.
We observe that the abundance estimation gets significantly more accurate when the level of
average sparsity is higher. For the Samson and Jasper Ridge data sets, the abundances for the
endmembers are nicely estimated while five endmembers over six are well estimated for the Urban
data set. The “Roof” is divided into “Roof1” and “Roof2/shadow” [31, 37]. In our simulations, it
seems that the sixth endmember corresponds to some shadows with a small residual of “Grass”, while
the “Roof” is not split into two groups.

26
10 -1 10 -1
10 -1

10 -2
10 -2

10 -2

10 -3
0 100 200 300 0 100 200 300 0 100 200 300

10 -1 10 -1
10 -1

10 -2
10 -2
-2
10

0 100 200 300 0 100 200 300 0 100 200 300

10 -1 10 -1
10 -1

10 -2
10 -2
10 -2

0 100 200 300 0 100 200 300 0 100 200 300

10 -1
10 -1
10 -1

10 -2
10 -2

0 100 200 300 0 100 200 300 0 100 200 300

10 0

10 -1

10 -1 10 -1

10 -2

0 100 200 300 0 100 200 300 0 100 200 300

(a) Samson Data set (b) Jasper Ridge Data set (c) Cuprite Data set

Figure 2: Averaged relative objective function values over 20 random initializations obtained for
Algorithm 1 (red line with circle markers) and the GR-NMF (black dashed line) applied to the three
data sets detailed in the text for 300 iteration. The comparison is performed for different values of β,
from top to bottom: β “ 2, β “ 3{2, and β “ 1. Logarithmic scale for y axis.

27
(a) Samson Data set

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

Figure 3: Samson data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the three endmembers: 71 Tree, 72 Soil and 73 Water. Two average sparsity levels
considered: 0.25 (b) and 0.5 (c).

28
(a) Jasper Ridge Data set

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

Figure 4: Jasper Ridge data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the four endmembers: 71 Road, 72 Tree, 73 Water and 74 Soil. Two average sparsity
levels are considered: 0.25 (b) and 0.5 (c).

29
(a) Urban Data set

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

Figure 5: Urban data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the six endmembers: 71 Soil, 72 Tree, 73 Grass, 74 Roof, 75 Road/Asphalt and 76
Roof2/shadows. Two average sparsity levels are considered: 0.25 (b) and 0.5 (c).

30
Figure 6: Baseline abundances for the endmembers obtained for Samson data extracted from [37]: 71
Soil, 72 Tree and 73 Water.

Figure 7: Baseline abundances for the endmembers obtained for Jasper Ridge data extracted from
[37]: 71 Road, 72 Soil, 73 Water and 74 Tree.

Figure 8: Baseline abundances for the endmembers obtained for Urban data extracted from [37]: 71
Asphalt, 72 Grass, 73 Tree, 74 Roof1, 75 Roof2/Shadow and 76 Soil.

Soderstrom T., Stoica P. System Identification (PH 1989) (ISBN S
100% (6)
Soderstrom T., Stoica P. System Identification (PH 1989) (ISBN S
637 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Ai
No ratings yet
Ai
287 pages
Power System Expansion Planning
No ratings yet
Power System Expansion Planning
27 pages
Pgradnmf PDF
No ratings yet
Pgradnmf PDF
27 pages
Class06 SVM
No ratings yet
Class06 SVM
47 pages
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
No ratings yet
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
58 pages
Regularized Compression of A Noisy Blurred Image
No ratings yet
Regularized Compression of A Noisy Blurred Image
13 pages
Frank Choquet Bonferroni Mean Operators of Bipolar Neutrosophic Sets and Their Application To Multi-Criteria Decision-Making Problems
No ratings yet
Frank Choquet Bonferroni Mean Operators of Bipolar Neutrosophic Sets and Their Application To Multi-Criteria Decision-Making Problems
16 pages
1 Lmis: C 2005, 2007, 2009, 2011 Anders Helmersson, ISY, May 18, 2011
No ratings yet
1 Lmis: C 2005, 2007, 2009, 2011 Anders Helmersson, ISY, May 18, 2011
16 pages
Online Nonnegative Matrix Factorization With Outliers
No ratings yet
Online Nonnegative Matrix Factorization With Outliers
28 pages
Approximation Algorithms For Orthogonal Non-Negative Matrix Factorization
No ratings yet
Approximation Algorithms For Orthogonal Non-Negative Matrix Factorization
12 pages
Projected Gradient Methods For Non-Negative Matrix Factorization
No ratings yet
Projected Gradient Methods For Non-Negative Matrix Factorization
27 pages
Multi-Resolution Beta-Divergence NMF For Blind Spectral Unmixing
No ratings yet
Multi-Resolution Beta-Divergence NMF For Blind Spectral Unmixing
13 pages
1861 Algorithms For Non Negative Matrix Factorization
No ratings yet
1861 Algorithms For Non Negative Matrix Factorization
7 pages
A Block Coordinate Descent-Based Projected GradientAlgorithm For Orthogonal Non-Negative Matrix Factorization
No ratings yet
A Block Coordinate Descent-Based Projected GradientAlgorithm For Orthogonal Non-Negative Matrix Factorization
22 pages
Non-Negative Matrix Factorization, A New Tool For Feature Extraction: Theory and Applications
No ratings yet
Non-Negative Matrix Factorization, A New Tool For Feature Extraction: Theory and Applications
8 pages
Projected Gradient Methods For Nonnegative Matrix
No ratings yet
Projected Gradient Methods For Nonnegative Matrix
24 pages
Fast and Efficient Boolean Matrix Factorization by Geometric Segmentation
No ratings yet
Fast and Efficient Boolean Matrix Factorization by Geometric Segmentation
8 pages
Kim 2013
No ratings yet
Kim 2013
35 pages
Neutrosophic Treatment of The Modified Simplex Algorithm To Find The Optimal Solution For Linear Models
No ratings yet
Neutrosophic Treatment of The Modified Simplex Algorithm To Find The Optimal Solution For Linear Models
8 pages
Neutrosophic Treatment of The Modified Simplex Algorithm To Find The Optimal Solution For Linear Models
No ratings yet
Neutrosophic Treatment of The Modified Simplex Algorithm To Find The Optimal Solution For Linear Models
8 pages
Topic: Non-Negative Matrix Factorisation: Assignment - 2
No ratings yet
Topic: Non-Negative Matrix Factorisation: Assignment - 2
6 pages
Deep Orthogonal Matrix Factorization As A Hierarchical Clustering Technique
No ratings yet
Deep Orthogonal Matrix Factorization As A Hierarchical Clustering Technique
5 pages
15.093J Optimization Methods: Lecture 4: The Simplex Method II
No ratings yet
15.093J Optimization Methods: Lecture 4: The Simplex Method II
10 pages
TR2015 023
No ratings yet
TR2015 023
23 pages
Graph Regularized Non-Negative Matrix Factorization For Data Representation
No ratings yet
Graph Regularized Non-Negative Matrix Factorization For Data Representation
14 pages
Algorithms For Nonnegative Matrix Factorization With The Kullback-Leibler Divergence
No ratings yet
Algorithms For Nonnegative Matrix Factorization With The Kullback-Leibler Divergence
31 pages
Lec 2 Modified
No ratings yet
Lec 2 Modified
10 pages
2EL1730 ML Lecture11 NMF - Annotated
No ratings yet
2EL1730 ML Lecture11 NMF - Annotated
41 pages
From-Below Approximations in Boolean Matrix Factorization: Geometry and New Algorithm
No ratings yet
From-Below Approximations in Boolean Matrix Factorization: Geometry and New Algorithm
38 pages
NIPS 2009 An Integer Projected Fixed Point Method For Graph Matching and Map Inference Paper
No ratings yet
NIPS 2009 An Integer Projected Fixed Point Method For Graph Matching and Map Inference Paper
9 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
Reliability & Maintainability Engineering Ebeling Chapter 12 Book Solutions - Data Collection ..
No ratings yet
Reliability & Maintainability Engineering Ebeling Chapter 12 Book Solutions - Data Collection ..
15 pages
MIT15 097S12 Lec12
No ratings yet
MIT15 097S12 Lec12
14 pages
O4MD 01 Introduction
No ratings yet
O4MD 01 Introduction
10 pages
Bipartite Complete Matching Vertex Interdiction Problem
No ratings yet
Bipartite Complete Matching Vertex Interdiction Problem
22 pages
2012 03127top
No ratings yet
2012 03127top
14 pages
SPH Modeling Using LS Dyna
No ratings yet
SPH Modeling Using LS Dyna
6 pages
Sequential Minimal Optimization Method To Solve The Support Vector Machine Problem
No ratings yet
Sequential Minimal Optimization Method To Solve The Support Vector Machine Problem
5 pages
Set Membership Adaptive Filtering
No ratings yet
Set Membership Adaptive Filtering
41 pages
Spectral Unmixing Using Nonnegative Matrix Factorization With Smoothed L0 Norm Constraint
No ratings yet
Spectral Unmixing Using Nonnegative Matrix Factorization With Smoothed L0 Norm Constraint
8 pages
MIT18 409S15 Bookex
No ratings yet
MIT18 409S15 Bookex
123 pages
NNLS1 2019 HW4 Solutions
No ratings yet
NNLS1 2019 HW4 Solutions
11 pages
Convergence of Inexact Forward - Backward Algorithms Using The Forward - Backward Envelope
No ratings yet
Convergence of Inexact Forward - Backward Algorithms Using The Forward - Backward Envelope
29 pages
BeOuTr Tqabmf
No ratings yet
BeOuTr Tqabmf
15 pages
OQM Lecture Note - Part 3 The Simplex Method and Its Algebraic Background
No ratings yet
OQM Lecture Note - Part 3 The Simplex Method and Its Algebraic Background
20 pages
Non-Negative Matrix Factorization
No ratings yet
Non-Negative Matrix Factorization
21 pages
Optimal Constraint Vectors For Set-Membership Affine Projection
No ratings yet
Optimal Constraint Vectors For Set-Membership Affine Projection
10 pages
NSQMF SDM2023
No ratings yet
NSQMF SDM2023
13 pages
Chapter 7 PDF
No ratings yet
Chapter 7 PDF
2 pages
Generalised Coupled Tensor Factorisation: Taylan - Cemgil, Umut - Simsekli @boun - Edu.tr
No ratings yet
Generalised Coupled Tensor Factorisation: Taylan - Cemgil, Umut - Simsekli @boun - Edu.tr
9 pages
Tin TranDat
No ratings yet
Tin TranDat
18 pages
1 Non-Negative Matrix Factorization (NMF) : K A A A
No ratings yet
1 Non-Negative Matrix Factorization (NMF) : K A A A
7 pages
Algorithms For Non-Negative Matrix Factorization
No ratings yet
Algorithms For Non-Negative Matrix Factorization
8 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
MLSlides4 Selected Shared
No ratings yet
MLSlides4 Selected Shared
35 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
Powell - A View of Algorithms For Optimization Without Derivatives
No ratings yet
Powell - A View of Algorithms For Optimization Without Derivatives
12 pages
NIPS 1999 Support Vector Method For Novelty Detection Paper
No ratings yet
NIPS 1999 Support Vector Method For Novelty Detection Paper
7 pages
Support Vector Machine
No ratings yet
Support Vector Machine
49 pages
QP Null Space Method
No ratings yet
QP Null Space Method
30 pages
Algorithms For Non-Negative Matrix Factorization
No ratings yet
Algorithms For Non-Negative Matrix Factorization
7 pages
Graph Regularized Non-Negative Matrix Factorization For Data Representation
No ratings yet
Graph Regularized Non-Negative Matrix Factorization For Data Representation
17 pages
M (It) 101 - Engineering Mathematics - I - r23
No ratings yet
M (It) 101 - Engineering Mathematics - I - r23
2 pages
CBSE Sample Paper
No ratings yet
CBSE Sample Paper
6 pages
Numerov
No ratings yet
Numerov
5 pages
Real Coded Genetic Algorithm
No ratings yet
Real Coded Genetic Algorithm
4 pages
Fault Analysis Using Z Bus
No ratings yet
Fault Analysis Using Z Bus
11 pages
Unit 3
No ratings yet
Unit 3
67 pages
Lesson 8 Sum and Product of Roots of Quadratic Equations
No ratings yet
Lesson 8 Sum and Product of Roots of Quadratic Equations
4 pages
Silverwood 2007 Report
No ratings yet
Silverwood 2007 Report
23 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
Tutorial-Sheet, Linear Algebra, 2023
No ratings yet
Tutorial-Sheet, Linear Algebra, 2023
3 pages
AI - Unit 1
No ratings yet
AI - Unit 1
66 pages
Extended Target Tracking Using Gaussian Processes: Niklas Wahlstr Om, Student Member, IEEE, Emre Ozkan, Member, IEEE
No ratings yet
Extended Target Tracking Using Gaussian Processes: Niklas Wahlstr Om, Student Member, IEEE, Emre Ozkan, Member, IEEE
31 pages
V. IS 1893 2016 Static Seismic
No ratings yet
V. IS 1893 2016 Static Seismic
5 pages
Stochastic Optimization Under Hidden Convexity: Ilyas Fatkhullin Niao He Yifan Hu
No ratings yet
Stochastic Optimization Under Hidden Convexity: Ilyas Fatkhullin Niao He Yifan Hu
34 pages
Coding Theory and Techniques - Updated
No ratings yet
Coding Theory and Techniques - Updated
21 pages
Mid Sem BTP Report
No ratings yet
Mid Sem BTP Report
16 pages
Engineering Mathematics and Artificial Intelligence Foundations Methods and Applications Herb Kunze Download
No ratings yet
Engineering Mathematics and Artificial Intelligence Foundations Methods and Applications Herb Kunze Download
79 pages
Mamba Survey
No ratings yet
Mamba Survey
20 pages
Bs Synopsis
No ratings yet
Bs Synopsis
7 pages
Optimal Number of Trials For Monte Carlo Simulation
No ratings yet
Optimal Number of Trials For Monte Carlo Simulation
4 pages
Week 04 Logistic Regression
No ratings yet
Week 04 Logistic Regression
5 pages
Kristian Perriu Audio Classification
No ratings yet
Kristian Perriu Audio Classification
5 pages
对高斯分布函数形式的推导
No ratings yet
对高斯分布函数形式的推导
4 pages
Linear Programming Fundamentals 2Nd Year - Ensia: Exercice C1
No ratings yet
Linear Programming Fundamentals 2Nd Year - Ensia: Exercice C1
2 pages
Optimization Theory with Applications
From Everand
Optimization Theory with Applications
Donald A. Pierre
4/5 (4)
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet

Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints

Uploaded by

Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints

Uploaded by

Multiplicative Updates for NMF with β-Divergences

under Disjoint Equality Constraints

April 29, 2022

Keywords: nonnegative matrix factorization (NMF), β-divergences, disjoint constraints, simplex-

min D pV |W Hq such that H ě 0 and W ě 0, (1)

For β “ 2, d2 px|yq “ 12 px ´ yq2 , so D pV |W Hq is the halved standard squared Euclidean distance

2 General framework to design MU for β-NMF under disjoint linear

such that αTi W pKi q “ bi for 1 ď i ď I, (2)

min Dβ pV |W Hq ` λ2 Φ2 pHq such that γ Tj H pBj q “ cj for 1 ď j ď J. (3)

• Minimization: minimize the majorizer, that is, solve minhPH g h|r

2.1 Separable majorizer for the objective function

is satisfied for all H P RKˆN

with akn “ λ L2kn , and

2.2 Dealing with equality constraints via Lagrange dual variables

such that γ Tj HpBj q “ cj for 1 ď j ď J.

and µ P R. Note that Gµ is separable, as is G, because the term γ T y is linear. ` ˘

and, for any β ă 2 and any x ą 0,

• If aq ą 0 or β P p1, 2q: limyq Ñ8 pg µ q1 yq |H

so that y ‹ pµ‹ q P p0, 8qQ is the unique solution to problem (10).

Moreover, pq ě 0 if aq “ 0 (then L “ 0) and β ď 1 according to (6) and (14). As a consequence,

decrease towards µ‹ at a quadratic speed.

consequence of the strict convexity of r. By immediate recurrence, we obtain that µn is a decreasing

min Dβ pV |W Hq such that γ Tj HpBj q “ cj for 1 ď j ď J. (23)

Algorithm 1 β-NMF with disjoint linear constraints

3 Showcase 1: Simplex-structured β-NMF

min Dβ pV |W Hq such that eT hj “ 1 for 1 ď j ď N, (28)

Numerical experiments Let us perform numerical experiments to evaluate the effectiveness of

Algorithms Samson Jasper Ridge Cuprite

4 Showcase 2: minimum-volume KL-NMF

min Dβ pV |W Hq ` λvolpW q such that W T e “ e. (29)

4.1 Problem formulation and algorithm

min Dβ pV |W Hq ` λ logdetpW T W ` δIq such that eT wi “ 1 for 1 ď i ď K. (30)

Compared to the general model (2), we have that

• the subsets Ki correspond to the columns of W , and there is no subset Bj ,

• αTi “ e and bi “ 1 for 1 ď i ď K.

where cst is some constant independent of Q. For any W, W

which is a convex quadratic and Lipschitz-smooth function in W . In fact, letting Q r ´1 “ DDT be a

4.2 Numerical experiments

Algorithms runtime in seconds

min-vol KL-NMF LS [25] Algorithm 2

5 Extension to quadratic disjoints constraints

5.1 Problem formulation and algorithm

5.2 Numerical experiments

A Convexity, concavity and complete monotonicity for a convex-

Table 7: Proposed concave-convex decomposition of the discrete β-divergence.

Let us now recall the definition of a complete monotonic function f :

Proposition 4. Given dp¨|¨q

3. for all β ă 2, dq2 px|yq and dp2 px|yq are c.m.

Proof. The proof is straightforward, given that dpx|yq

• log y is strictly concave;

• y ν is c.m. for all ν ă 0.

S1. Evolution of the objective function for β-SSNMF

S2. Qualitative results obtained with Algorithm 3

0 100 200 300 0 100 200 300 0 100 200 300

0 100 200 300 0 100 200 300 0 100 200 300

0 100 200 300 0 100 200 300 0 100 200 300

0 100 200 300 0 100 200 300 0 100 200 300

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

(b) Abundance map with average sparsity level set to 0.25

(c) Abundance map with average sparsity level set to 0.5

You might also like