Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints
Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints
Abstract
Nonnegative matrix factorization (NMF) is the problem of approximating an input nonnegative
matrix, V , as the product of two smaller nonnegative matrices, W and H. In this paper, we intro-
duce a general framework to design multiplicative updates (MU) for NMF based on β-divergences
(β-NMF) with disjoint equality constraints, and with penalty terms in the objective function. By
disjoint, we mean that each variable appears in at most one equality constraint. Our MU satisfy
the set of constraints after each update of the variables during the optimization process, while
guaranteeing that the objective function decreases monotonically. We showcase this framework on
three NMF models, and show that it competes favorably the state of the art: (1) β-NMF with sum-
to-one constraints on the columns of H, (2) minimum-volume β-NMF with sum-to-one constraints
on the columns of W , and (3) sparse β-NMF with `2 -norm constraints on the columns of W .
1 Introduction
Given a non-negative matrix V P RF`ˆN and a factorization rank K ! minpF, N q, nonnegative matrix
factorization (NMF) aims to compute two non-negative matrices, W with K columns and H with K
rows, such that V « W H [24]. Over the last two decades, NMF has shown to be a powerful tool for the
analysis of high-dimensional data. The main reason is that NMF automatically extracts sparse and
meaningful features from a set of nonnegative data vectors. NMF has been successfully used in many
applications such as image processing, text mining, hyperspectral imaging, blind source separation,
single-channel audio source separation, clustering and music analysis; see [15, 7, 5, 29, 13, 16] and the
references therein.
To compute W and H, the most standard approach is to solve the following optimization problem
1
ř
where D pV |W Hq “ f,n dpVf n |r W Hs f n q with dpx|yq a measure of distance between two scalars, and
A ě 0 means that the matrix A is component-wise nonnegative. In this paper, we focus on β-NMF
for which the measure of fit is the discrete β-divergence denoted dβ px|yq and defined as
$ 1 ` β β β´1
˘
& βpβ´1q x ` pβ ´ 1q y ´ βxy
’ for β P Rz t0, 1u ,
dβ px|yq “ x log xy ´ x ` y for β “ 1,
’
%x x
y ´ log y ´ 1 for β “ 0.
• Projection requires to perform a line-search to ensure the monotonicity of the algorithm, that is,
to ensure that the objective does not increase after each iteration, which may be computationally
heavy.
• Renormalization of the columns of W and the rows of H is only useful when each constraint
applies to a column of W or a row of H. It is not applicable for example for the sum-to-one
constraint on the columns of H mentioned above. Moreover, in the presence of regularization
terms in the objective function, it may destroy the monotonicity of the algorithm.
2
Another approach is to use parametrization. However, as far as we know, it does not guarantee the
monotonicity of the algorithm; see Section 3 for more details.
Outline and contribution In this paper, we introduce a general framework to design MU for β-
NMF with disjoint linear equality constraints, and with penalty terms in the objective function. By
disjoint, we mean that each variable appears in at most one equality constraint. This framework,
presented in Section 2, does not resort to projection, renormalization, or parametrization. Our MU
satisfy the set of constraints after each update of the variables during the optimization process, while
guaranteeing that the objective function decreases monotonically. This framework works as follows:
• First, as for the standard MU for β-NMF, we majorize the objective function using a separable
majorizer, that is, the majorizer is the sum of functions involving a single variable.
• Second, we construct the augmented Lagrangian for the majorizer. Because the majorizer is
separable, the problem can be decomposed into independent subproblems involving only variables
that occur in the same equality constraint since they are disjoint. For a fixed value of the
Lagrange multipliers, we prove that the solution of these subproblems are unique, under mild
conditions (Proposition 1). Moreover, it can be written in closed form via MU for specific values
of β and depending on the regularizer used (this is summarized in Table 1).
• Finally, we prove that, under mild conditions, there is a unique solution for the Lagrange mul-
tipliers so that the equality constraints are satisfied (Proposition 2). This allows us to apply
the Newton-Raphson method to compute the Lagrange multipliers while guaranteeing quadratic
convergence (Proposition 3).
We then showcase this framework on two NMF models, and show that it competes favorably with
the state of the art:
1. A β-NMF model with sum-to-one constraints on the columns of H, which we refer to as simplex-
structured β-NMF (Section 3), and
2. A minimum-volume β-NMF model with sum-to-one constraints on the columns of W (Section 4).
Finally, Section 5 shows that the framework can be extended to the case of quadratic disjoints con-
straints, which we showcase on sparse β-NMF with `2 -norm constraints on the columns of W .
• Bj Ď tpk, nq | 1 ď k ď K, 1 ď n ď N u for j “ 1, 2, . . . , J,
3
• Ku X Kv “ H for all 1 ď u, v ď I and u ‰ v,
• Bp X Bq “ H for all 1 ď p, q ď J and p ‰ q.
We now define penalized β-NMF with disjoint linear equality constraints as follows
min Dβ pV |W Hq ` λ1 Φ1 pW q ` λ2 Φ2 pHq
W PRF
`
ˆK
,HPRKˆN
`
γ Tj H pBj q “ cj for 1 ď j ď J,
where
• the penalty functions Φ1 pW q and Φ2 pHq are lower bounded and admit a particular upper ap-
proximation; see Assumption 1 below.
• λ1 and λ2 are the penalty weights (nonnegative scalars).
|K | |B |
• αi P R``i (1 ď i ď I) and γ j P R``j (1 ď j ď J) are vectors with positive entries. Note that if
αi or γ j contains zero entries, the corresponding indices can be removed from Ki and Bj .
• bi (1 ď i ď I) and cj (1 ď j ď J) are positive scalars.
As for most NMF algorithms, we propose to resort to a block coordinate descent (BCD) framework
to solve problem (2): at each iteration we tackle two sub-problems separately; one in W and the
other in H. The subproblems in W and H are essentially the same, by symmetry of the model,
since transposing the relation X « W H gives X T « H T W T . Hence, we may focus on solving the
subproblem in H only, namely
In order to solve (3), we will design MU based on the majorization-minimization (MM) frame-
work [33], which is the standard in the NMF literature; see [12] and the references therein. Let us
briefly recall the high-level ideas to obtain MU via MM. Let us consider the general problem
min f phq.
hPH
h P H,
Given an initial iterate r ` ˘MM generates
` ˘ a new iterate ĥ P H that is guaranteed to decrease the
objective function, that is, f ĥ ď f r
h . To do so, it uses the following two steps:
• Majorization: find a function that is an upper approximation of the objective and is tight
` ˘at the
current iterate, which is referred to as a majorizer. More precisely find a function g h|r
h such
that ` ˘ ` ˘ ` ˘
piq g r
h|rh “f r h and piiq g h|r h ě f phq for all h P H.
4
` ˘
The MU for NMF are obtained using MM where the majorizer g is chosen separable, that is, g h|r h “
ř ` ˘
i“1 gi hi |hi for some well chosen univariate functions gi ’s; see (4) in the next section. This choice
r
typically makes the minimization
` ˘ of g admits a closed-form solution which`is ˘multiplicative, that is, it
has the form ĥ “ h d c h where d is the component-wise product, and c r
r r h is a nonnegative vector
that depends on r h. We will encounter several examples later in this paper.
In summary, to derive MU for (3), we will follow the MM framework. We first provide a majorizer
for the objective of (3) in Section 2. This majorizer has the property to be separable in each entry of
H. In order to handle the equality constraints, we introduce Lagrange dual variables in Section 2.2,
and explain how they can be computed efficiently. This allows us to derive general MU in Section 2.3
in the case of non-penalized β-NMF under disjoint linear equality constraints. This is showcased on
simplex-structured β-NMF in Section 3. In Section 4, we will illustrate on minimum-volume KL-NMF
how to derive MU in the presence of penalty terms.
Majorizing ΦpHq For the second term ΦpHq, we rely on the following assumption for Φ.
Assumption 1. The function Φ : RKˆN` ÞÑ R is lower bounded, and for any H r P RKˆN there exists
`
constants Lkn (1 ď k ď K, 1 ď n ď N ) such that the inequality
` ˘ A E ÿL
r ` ∇ΦpHq, kn r kn q2
ΦpHq ď Φ H r H ´H r ` pHkn ´ H (5)
k,n
2
5
` ˘
otherwise we would have limyÑ8 Φ H ` yei eTj “ ´8, where ei is the ith unit vector, and this
would contradict the fact that Φ is bounded from below. This observation will be useful in the
proof of Proposition 1 and is only valid for the special case Lkn “ 0 for all k, n.
Examples of such penalty functions include the sparsity-promoting regularizers ΦpHq “ }H}pp “
p
ř
k,n Hpk, nq for 0 ă p ď 1 since H ě 0.
2. Lower-bounded functions with Lipschitz continuous gradient for which (5) follows from the
descent lemma [3].
Examples of such penalty functions include any smooth convex functions; for example any
quadratic penalty, such as ||AH ´ B||22 for some matrices A and B in which case L`kn “ σ1 pAq˘2
for all k, n. We will encounter another example later in the paper, namely logdet HH J ` δI
for δ ą 0 which allows to minimize the volume of the rows of H; see Section 4 for the details
(Note that we will use this regularizer for W ).
Majorizing ΨpHq Combining (4) and (5), we can construct a majorizer for ΨpHq. Since both (4)
and (5) are separable in each entry of H, their combination is also separable into a sum of K ˆ N
component-wise majorizers, up to an additive constant:
N ÿ
ÿ K
` ˘ ` ˘ ` ˘
G H|H
r “ g hkn |H
r `C H
r , (7)
n“1 k“1
where
F ˆ ˙
ÿ wf k r
hkn q hkn
` akn h2kn ` pkn hkn ,
` ˘
g hkn |H “
r d vf n |r vf n (8)
f “1
v
r f n h
r kn
˜ ¸
N
ÿ ÿ F K
˘ ÿ 1`
h2kn ,
` ˘ ` ˘
C H r “ vf n ´
dp vf n |r v f n wf k r
dp vf n |r hkn `aknr
n“1 f “1 k“1
The variables in different sets Bj can be optimized independently, as they do not interact in the
majorizer nor in the constraints. Note that, for the entries of H that do not appear in any constraints,
6
the standard MU [12] can be used. For simplicity, let us fix j and denote B “ Bj , Q “ |B|, y “
HpBq P RQ Q
` , γ “ γ j P R`` , and c “ cj ą 0. The problems we need to solve have the form
` ˘
min G y|H
r , (10)
yPY
! )
where Y “ y P RQ
` | γ T y “ c and
` ˘ ÿ ` ˘
G y|H
r “ g hkn |H
r , (11)
pk,nqPB
` ˘
where the component-wise majorizers g hkn |H r are defined by (8). Let us introduce a convenient
notation: for q “ 1, 2, . . . , Q, we denote by pkpqq, npqqq the qth pair belonging to B. Hence the
Lagrangian function of (11) can be written as
Q
` ˘ ÿ
Gµ y|H r ´ µpγ T y ´ cq “ µc ` C H g µ yq |H
` ˘ ` ˘ ` ˘
r “ G y|H r ` r , (12)
q“1
where
g µ yq | H
` ˘ ` ˘
r “ g yq | Hr ´ µγq yq
F ˆ ˙
ÿ wf kpqq yrq yq
“ q vf npqq
d vf npqq |r ` aq yq2 ` ppq ´ µγq qyq , (13)
f “1
v
r f npqq y
rq
F ˆ ˙
ÿ
1
` ˘ BΦ ` r ˘
pq “ wf kpqq d vf npqq |r
p vf npqq ` λ H ´ Lkpqqnpqq yrq , (14)
f “1
Byq
Proposition 1. Let q P t1, 2, . . . , Qu. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all f . Moreover,
p
when β ď 1, assume that µ ă γqq for all q such that aq “ 0. Then there exists a unique minimizer
` ˘
yq‹ pµq of g µ yq |H
r in p0, 8q.
Proof. According to Proposition 4 (see Appendix A), each g µ is C 8 and strictly convex on p0, 8q, so
its infimum is uniquely attained in the closure of p0, 8q. We have to prove that it is neither reached
at 0 nor at 8. On the one hand, from (13), we have
F ˆ ˙
µ 1
` ˘ ÿ
1 yq
pg q yq |H “
r wf kpqq d vf npqq |r
q vf npqq ` 2aq yq ` pq ´ γq µ (15)
f “1
yrq
7
` ˘
so limyq Ñ0` pg µ q1 yq |H
r “ ´8, which ensures that the infimum is not reached at 0. On the other
hand, #
0 if β ď 1,
lim dq1 px|yq “ (16)
yÑ8 8 otherwise.
According to (15) and (16), the distinction must be made between two cases:
We just proved that, under mild conditions, each g µ has a unique minimizer over p0, 8q. However
‹ pµq T , let us show that the
“ ‰
we assumed that the value of µ is fixed. Now given y ‹ pµq “ y1‹ pµq, . . . , yQ
solution to γ T y ‹ pµq “ c is unique. The corresponding value of µ, which we denote µ‹ , provides the
minimizer y ‹ pµ‹ q of Gµ py|Hq
r that satisfies the linear constraint γ T y ‹ pµ‹ q “ c. Moreover, µ‹ naturally
p
fulfills µ‹ ă γq for all q when β ď 1 and aq “ 0, as required in Proposition 1.
q
Proposition 2. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all q, f . Then the scalar equation
γ T y ‹ pµq “ c in the variable µ admits a unique solution µ‹ in p´8, tq, where
#p
q
if β ď 1 and aq “ 0,
t “ min tq , where tq “ γq (17)
1ďqďQ 8 otherwise,
F ˆ ˙
` ˘ ÿ yq aq pq
γq´1 g 1 r ´1
yq |H “ γq q1
wf kpqq d vf npqq |r
vf npqq ` 2 yq ` (18)
f “1
yrq γq γq
is strictly increasing on p0, 8q (since g is strictly convex) and one-to-one from p0, 8q to an open interval
Tq “ pt´ `
q , tq q where
` ˘
t´ 1
q “ lim g yq |H “ ´8,
r (19)
yq Ñ0
` ˘
t “ lim g 1 yq |H
q
` r “ tq . (20)
yq Ñ8
8
Coming back to the multivariate problem (10), we must find a value µ‹ of the Lagrangian multiplier
such that the constraint γ T y ‹ pµq “ c is satisfied. Given (21), µ‹ is a solution of
Q
ÿ ` ˘´1
γq g 1 pγq µq “ c. (22)
q“1
` ˘
Each g 1 yq |H
r being strictly increasing on p0, 8q, pg 1 q´1 pγq µq is also strictly increasing (from Tq to
p0, 8q), this is a direct consequence of pf ´1 q1 “ f 1 ˝f1 ´1 where f is any strictly increasing function on
some interval. Finally Q J
ř 1 ´1
q“1 γq pg q pγq µq is strictly increasing from Xj“1 Tq “ p´8, tq to p0, 8q, with
t ě 0. Therefore, the solution µ‹ is unique.
Proposition 2 shows that the optimal Lagrangian multiplier is the unique solution of (22). Finding
the solution of (22) is equivalent to finding the root of a function rpµq. We propose here-under to
use a Newton-Raphson method to compute µ‹ , and show that this method generates a sequence of
iterates µn that converges towards µ‹ at a quadratic speed.
Proposition 3. Assume that β ă 2 and yrq , vf npqq , wf kpqq ą 0 for all q, f . Let
Q
ÿ ` ˘
rpµq “ γq pg 1 q´1 γq µ ´ c
q“1
for µ P p´8, tq where t is defined in (17), and denote µ‹ the unique solution of rpµq “ 0. From any
initial point µ0 P pµ‹ , tq, Newton-Raphson’s iterates
rpµn q
µn`1 “ µn ´
r1 pµn q
Proof. We already know that r is strictly increasing from p´8, tq to p0, 8q. Let us show that r
is also strictly convex. According to the third item of Proposition 4 in Appendix A, dq2 px|yq is
completely monotonic, so it is strictly decreasing in y. Equivalently, dq1 px|yq is strictly concave in
y, and each g 1 is also strictly concave according to (18). Since the inverse of a strictly increasing,
strictly concave function f is strictly increasing and strictly convex, which is a direct consequence of
2 ´1
pf ´1 q2 “ ´ pff1 ˝f˝f´1 q3 , then each pg 1 q´1 is strictly convex, and finally, r is strictly convex.
For any µ0 P pµ‹ , tq, we have rpµ0 q ą 0, so µ1 “ µ0 ´ rrpµ 0q
1 pµ q ă µ0 . We have also µ1 ą µ
0
‹ as a
Discussion At this point, we have derived an optimization framework to tackle problem (10). The
optimal Lagrangian multiplier value is determined before each majorization-minimization update using
a Newton-Raphson algorithm. However, such a formal solution` ˘ is implementable if and only if each
yq‹ pµq can be actually computed as the minimizer of g µ yq |H r in p0, 8q. In some cases, computing
yq‹ pµq is equivalent to extracting the roots of a polynomial of a degree smaller or equal to four, which
is possible in closed form. In other cases, we have to solve a polynomial equation of degree larger
9
β P p´8, 1qzt0u β “ 0 β “ 1 β P p1, 2q
5 4 3
4 3 2 other
No penalization, or
1 1 1 3 4 2 O
Lkn “ 0 for all k, n
Lkn ą 0 for some k, n O 3 2 O O 3 O
Table 1: Cases where (21) can be computed in closed form. They are indicated by the degree of the
corresponding polynomial equation, otherwise the symbol O is used. The constants Lkn are the one
needed in Assumption 1 for the penalization functions Φ1 pHq and Φ2 pW q; see (5).
than four, or even an equation that is not polynomial. Table 1 indicates the cases where a closed-
form solution is available, and hence when our framework can be efficiently implemented. We observe
that, without penalization or with penalization satisfying Lkn “ 0 for all k, n (e.g., smooth concave
functions), the polynomial
( equation is of degree one, and hence always admit a closed form for β ď 1
and β P 54 , 43 , 23 . This particular case is discussed in the next section, which we will exemplify in
Section 3 with β-NMF with sum-to-one constraints on the columns of H. In Section 4, we will present
an important example with Lkn ą 0 for all k, n and β “ 1, namely minimum-volume KL-NMF.
2.3 MU for β-NMF with disjoint linear equality constraints without penalization
In this section, we derive an algorithm based on the general framework presented in the previous
section to tackle the β-NMF problem under disjoint linear equality constraints without penalization,
that is, problem (2) with λ1 “ λ2 “ 0. We consider this simplified case here as it allows to provide
explicit MU for any value of β ă 2; see the row ‘No penalization’ of Table 1. These updates satisfy
the constraints after each update of W or H, and monotonically decrease the objective function
Dβ pV |W Hq.
Let us then consider the subproblem of (2) over H when W is fixed and with λ2 “ 0, that is,
Let us follow the framework presented above. First, an auxiliary function, which we denote GpH|Hq, r
is constructed at the current iterate H so that it majorizes the objective for all H and is defined as
r
follows:
« ˆ ˇ ˙ff « ff
` ˘ ÿ ÿ wf k r h kn h kn ` ˘ ÿ ` ˘ ` ˘
G H|H dq vf n ˇvrf n ` dp1 vf n |r
vf n hkn ` dp vf n |r
wf k hkn ´ r vf n ,
r “ ˇ
f,n k
v
r f n h
r kn k
` ˘ (24)
where dp.|.q
q and dp.|.q
p are given in Appendix A. Second, we need to minimize G H|H r while imposing
the set of linear constraints γ Tj HpBj q “ cj . The Lagrangian function of G is given by
J
ÿ
µ
µj γ Tj HpBj q ´ cj ,
“ ` ˘‰
G pH|Hq
r “ GpH|Hq
r ´ (25)
j
where µj are the Lagrange multipliers associated to each linear constraint γ Tj HpBj q “ cj . We observe
that Gµ in (25) is a separable majorizer in the variables H of the Lagrangian function Dβ pV |W Hq ´
10
řJ ” ´ T ¯ı
j µj γ j HpBj q ´ cj . Due to the disjoitness of each subset of variables Bj (25), we only consider
the optimization over one specific subset Bj . The minimizer (21) of Gµ pH pBj q |H r pBj qq has the
following component-wise expression:
˜ ¸.ηpβq
‹ rC pB j qs
H pBj q “ Hr pBj q d “ ‰ , (26)
D pBj q ´ µj γ j
´ ¯
where C “ W T pW Hq.pβ´2q d V , D “ W T pW Hq.pβ´1q , ηpβq “ 2´β 1 1
for β ď 1, and ηpβq “ β´1 for
rAs
β ě 2 [12, Table 2], A d B (resp. rBs ) is the Hadamard product (resp. division) between A and B, A.α
is the element-wise α exponent of A. The case β P p1, 2q is more difficult: we need to find a root of a
function of the form µ ` bxβ´1 ´ cxβ´2 “ 0. For example, for β “ 32 , we have µ ` bx1{2 ´ cx´1{2 “ 0.
?
Using y “ x, and after simplifications, we obtain µy ` by 2 ´ c “ 0 leading to the positive root
ˆ? ˙2
µ2 `4bc´µ
x“ 2b .
According to Proposition 2, (26) is a well-defined update from p0, 8qQ to itself, provided that µj
is tuned to µ‹j . This brings us a structural guarantee that D pBj q ´ µ‹j γ j cannot cancel.
Finally, we need to evaluate µ‹j , which is uniquely determined on some interval p´8, tq according
to Proposition 2. This amounts to solve γ Tj H ‹ pBj q “ cj . When β R p1, 2q, this is equivalent to find
the root of the function
» ˜ ¸.ηpβq fi
Q
ÿ rC pB j qs
rj pµj q “ γj,q –Hr pBj q d “ ‰ fl ´ cj , (27)
q“1
D pB j q ´ µj γ j
q
where rAsq denotes the q-th entry of expression A. Indeed, rj pµj q is a finite sum of elementary
rational functions of µj and each of them is an increasing, convex function in µj over p´8, tq q with
D pB q
tq “ γq j,qj for each q. It is even completely monotone for all µ in p´8, tq q because η pβq ą 0 [28].
As a consequence rj pµj q is also a completely monotone, convex increasing function of µj in p´8, tq,
where t “ min ptq q. Finally, we can easily show that the function rj pµj q changes of sign on the interval
p´8, tq by computing two limits at the closure of the interval. As µ‹ P p´8, tq, the update (26) is
nonnegative. To evaluate µ‹ , we use a Newton-Raphson method, with any initial point µ0 P pµ‹ , tq,
with a quadratic rate of convergence as demonstrated in Proposition 3. Algorithm 1 summarizes our
method to tackle (2) for all the β-divergences, β R p1, 2q which we refer to as disjoint-constrained
β-NMF algorithm. The update for matrix W can be derived in the same way, by symmetry of the
problem. For β P p1, 2q, a case-by-case analysis could be carried out for the values of β for which the
minimizer of (25) takes a closed-form expression.
Remark 1. As noted above, the denominators of (26) and (27) will be different from zero. This
follows notably from our assumption that pW, Hq ą 0; see Propositions 1, 2 and 3. This is a standard
assumption in the NMF literature: the entries of pW, Hq are initialized with positive entries which
ensures all iterates to remain positive. This is important because the MU cannot change an entry
equal to zero [26]; this is the so-called zero-locking phenomenon. This implies C and D in (26)
and (27) are positive matrices (as long as V has at least one nonzero entry per row and column). In
practice, one should however be careful because some entries of W and H can numerically be set to
zero (because of finite precision). Hence, in our implementation, we use the machine precision as a
lower bound for the entries of W and H, as recommended in [17].
11
Computational cost The computational cost of Algorithm 1 is asymptotically equivalent to the
standard MU for β-NMF, that is, it requires O pF N Kq operations per iteration. Indeed, the complexity
is mainly driven by matrix products required to compute C and D; see (26). To compute the roots
of (27) corresponding to H using Newton-Raphson, each iteration requires to compute rj pµj q{rj1 pµj q
for all j which requires OpKN q operations (when every entry of H appears in a constraint). Finding
the roots therefore requires OpKN q operations times the number of Newton-Raphson iterations. By
symmetry, it requires OpKF q operations to compute the roots corresponding to W . Because of the
quadratic convergence, the number of iterations required for the convergence of the Newton-Raphson
method is typically small, namely between 10 to 100 in our experiments using the stopping criterion
|rpµj q| ď 10´6 for all j. Therefore, in practice, the overall complexity of Algorithm 1 is dominated by
the matrix products that require O pF N Kq operations. The same conclusions apply to the algorithms
presented in Sections 3, 4 and 5, and this will be confirmed by our numerical experiments.
12
To understand the underlying significance of SSMF, it is necessary to give more insights on a
research topic for which important SSMF techniques were initially developed which is the blind Hy-
perspectral Unmixing (HU), a main research topic in remote sensing. The task of blind HU is to
decompose a remotely sensed hyperspectral image into endmember spectral signatures and the corre-
sponding abundance maps with limited prior information, usually the only known information being
the number of endmembers. In this context, the columns of W correspond to the endmembers spectral
signatures and the columns of H contain the proportion of the endmembers in each column of V , so
the column-stochastic assumption for H naturally holds. The nonnegativity of W follows from the
nonnegativity of the spectral signatures. We refer to the corresponding problem as simplex-structured
nonnegative matrix factorization with the β-divergence (β-SSNMF), and is formulated as follows:
where e is the vector of all ones of appropriate dimension. This is particular case of (2) where
• the subsets Bj correspond to the columns of H, and there is no subset Ki (no constraint on W ),
• γ Tj “ e and cj “ 1 for j “ 1, 2, . . . , N .
Hence Algorithm 1 can be directly applied to (28).
• Jasper Ridge: 198 spectral bands with 100ˆ100 pixels, containing mostly 4 materials pK “ 4q,
namely “Road”, “Soil”, “Water” and “Tree”.
• Cuprite: 188 spectral bands with 250ˆ190 pixels, containing mostly 12 types of minerals pK “
12q.
β-SSNMF has shown itself as a powerful one to tackle blind HU, hence this comparative study
between Algorithm 1 and GR-NMF [14] focuses on the convergence aspects including the evolution (
of the objective function and the runtime. The algorithms are compared3 for β P 0, 21 , 1, 32 , 2 . To
1
https://www.irit.fr/„Cedric.Fevotte/extras/tip2015/code.zip
2
http://lesun.weebly.com/hyperspectral-data-set.html
3
For β “ 3{2, we had an error in our derivations, and use (26) with ηpβq “ 1 for β P p1, 2q; see the discussion
after (26). However, the corresponding MU always decreases the objective function values (which we were monitoring),
although we do not have a theoretical justification for this. A possible approach to obtain such as result would be to
come up with a majorizer of the majorizer that has a closed-form minimizer given by (26) with ηpβq “ 1 for β P p1, 2q.
13
report the results, we use the relative objective function, denoted F̄ pW, Hq and defined as4
Dβ pV |W Hq
F̄ pW, Hq “ ,
Dβ pV |veeT q
T
where v “ eF N Ve
is the average of the entries of V . The relative error F̄ should be between 0 and 1:
it is equal to 0 for an exact decomposition with V “ W H, and is equal to 1 for a trivial rank-one
approximation where all entries are equal to the average of the entries of V . This allows to meaningfully
interpret the results, especially since we consider in this comparative study multiple values for β. In
fact, the degree of homogeneity of the β-divergence is a function of β. For example, if all the entries
of the input matrix are multiplied by 10 and keeping the same NMF solution properly scaled, the
squared Frobenius error (β = 2) is multiplied by 100 while the IS-divergence (β = 0) is not affected.
As for all tests performed in this paper, the algorithms are tested on a desktop computer with
Intel Core [email protected] CPU and 32GB memory. The codes are written in MATLAB R2018a,
and available from https://sites.google.com/site/nicolasgillis/. For all simulations, the algorithms are
run for 20 random initializations of W and H (each entry sampled from the uniform distribution in
r0, 1s). Table 2 reports the average and standard deviation of the runtime (in seconds) as the final
value for the relative objective function over these 20 runs for a maximum of 300 iterations.
Table 2: Runtime performance in seconds and final value of relative( objective function F̄end pW, Hq
for Algorithm 1 and the GR-NMFreported for β P 0, 21 , 1, 32 , 2 . The table reports the average
and standard deviation over 20 random initializations with a maximum of 300 iterations for three
hyperspectral data sets. A bold entry indicates the best value for each experiment.
We observe that Algorithm 1 outperforms the GR-NMF in terms of runtime and final values for the
relative objective function for all test cases except when β “ 0 for the Samson and Cuprite data sets.
In particular, for β “ 1, Algorithm 1 is up to 2.5 times faster than the GR-NMF. For the Cuprite data
set with β “ 1{2, Algorithm 1 and GR-NMF perform similarly. We also observe that the standard
4 β D pV |W Hq
For the Frobenius norm, that is, β “ 2, the relative error is typically defined as D β pV |0q
meaning that the trivial
solution used is the all-zero matrix. However, for other β-divergences, the value of Dβ pV |0q might not be defined; in
particular, for β ď 1 and vf n ą 0 for some f, n.
14
deviations obtained with Algorithm 1 are in general significantly smaller for all β, except for β “ 0
for the Samson and Cuprite data sets.
In the supplementary material S1, we provide figures that show the evolution of the relative
objective function values with respect to iterations, and that confirm the observations above.
where λ is a penalty parameter, and volpW q is a function measuring the volume spanned by the
columns of W . In [25], the authors use volpW q “ logdetpW T W ` δIq, where δ is a small positive
constant that prevents logdetpW T W q to go to ´8 when W tends to a rank-deficient matrix (that
is, when rankpW q ă K). This model is particularly powerful as it leads to identifiability which is
crucial in many applications such as in hyperspectral imaging or audio source separation [13]. Indeed,
under some mild assumptions and in the exact case, authors prove in [25] that (29) is able to identify
the groundtruth factors pW # , H # q that generated the input data V , in the absence of noise. In [25],
(29) is used for blind audio source separation. In a nutshell, blind audio source separation consists in
isolating and extracting unknown sources based on an observation of their mix recorded with a single
microphone5 . We have to mention that model (29) is also well suited for hyperspectral imaging as
discussed in [16].
In the next subsections, we show that we can tackle the min-vol β-NMF optimization problem
defined in (29) with the general framework presented in Section 2 in the case β “ 1.
To upper bound logdetpW T W ` δIq as required by (5) in Assumption 1, we majorize it using a convex
quadratic separable auxiliary function provided in [25, Eq. (3.6)] and which is derived as follows.
5
We invite the interested reader to watch the video https://www.youtube.com/watch?v=1BrpxvpghKQ to see the
application of min-vol KL-NMF on the decomposition of a famous song from the city of Mons.
15
First, the concave function logdetpQq for Q ą 0 can be upper bounded using the first-order Taylor
approximation: for any Qr ą 0,
logdetpQq ď logdetpQq r ´1 , Q ´ Qy
r ` xQ r ´1 , Qy ` cst,
r “ xQ
where wf P denotes the f -th row of W , G is given by (24), ¯l by [25, Eq. (3.6)] and derived as explained
above, and c is a constant. Let µ is the vector Lagrange multipliers of dimension K associated to each
linear constraint eT wi “ 1. Exactly as before (hence we omit the details here), Gµ is separable and,
given µ, one can compute the closed-form solution:
„”
‰.2 ı. 1 `
“ T 2 T
˘
C ` eµ `S ´ C ` eµ
‹ (32)
W pµq “ W d
Ă ,
rDs
ˆ ˙
T
` ´
˘ ` ´ ` ´ rV s T
where C “ eF,N H ´ 4λ W Ă Y , D “ 4λW Ă pY ` Y q, and S “ 8λW Ă pY ` Y q d
ĂHs H
rW
Ă ` δI ´1 , Y ` “ maxpY, 0q ě 0 and Y ´ “ maxp´Y, 0q ě 0, and eF,N is
` T ˘
with Y “ Y ` ´ Y ´ “ W Ă W
the F -by-N matrix of all ones. As proved in Proposition 2, the constraint W ‹ pµqT e “ e is satisfied
for a unique µ in p´8, tq where t “ 8 in this case. We can therefore use a Newton-Raphson method
to find the µi with quadratic rate of convergence, see Proposition 3. Algorithm 2 summarizes our
method to tackle (29).
• Setup 71: sample “Mary had a little lamb” with K “ 3, 200 iterations.
• Setup 72: sample “Mary had a little lamb” with K “ 7, 200 iterations.
6
https://www.youtube.com/watch?v=ZlbK5r5mBH4
16
Algorithm 2 Min-vol KL-NMF
KˆN
Input: A matrix V P RF ˆN , an initialization H P R` , an initialization W P RF ˆK , a factorization
rank K, and a maximum number of iterations, maxiter, the parameters δ ą 0 and λ ą 0.
Output: A min-vol rank-K NMF pW, Hq of V satisfying constraints in (29).
1: for it = 1 : maxiter do
2: % Update of
” matrix
´ H
¯ı
rV s
WT rW Hs
3: HÐHd T
r s
W eF,N
4: % Update
` T of matrix
˘´1W
5: Y Ð W W ` δI
6: Y ` Ð max pY, 0q
7: Y ´ Ð max p´Y, 0q
8: C Ð eF,N H T ´ 4λ pW Y ´´q ¯
rV s T
9: S Ð 8λW pY ` ` Y ´ q d rW Hs H
10: D Ð 4λW` pY ` ` Y ´ q ˘
11: µ Ð root W ‹ pµqT e “ e over RK % see (32) for the expression of W ‹ pµq
« ff
” .2
ı. 1
2
rC`eµT s `S ´pC`eµT q
12: W ÐW d rDs
13: end for
• Setup 73: “Prelude and Fugue No.1 in C major” with K “ 16, 300 iterations.
For each setup, the algorithms are run for the same 20 random initializations of W and H. Table 3
reports the average and standard deviation of the runtime (in seconds) over these 20 runs. Table 4
reports the average and standard deviation of the final values for β-divergences (data fitting term)
and the objective function of (29) over these 20 runs for min-vol KL-NMF LS and Algorithm 2. For
this last comparison, the value for the penalty weight λ has been chosen so that KL-NMF leads to
reasonable
ˇ solutions forˇ W and H. More precisely, the values for λ are chosen so that the initial value
T
λˇlogdetpW p0q W p0q `δIqˇ
ˇ ˇ
of Dβ pV |W Hq is equal to 0.1, 0.1 and 0.022 for setup 71, setup 72 and setup 7, respectively.
Table 3: Runtime performance in seconds of baseline KL-NMF, min-vol KL-NMF LS and Algorithm 2.
The table reports the average and standard deviation over 20 random initializations.
We observe that the runtime of Algorithm 2 is close to the baseline KL-NMF algorithm which
confirms the negligible cost of the Newton-Raphson steps to compute µ‹ as discussed in Section 2.3.
On the other hand, since no line search is needed, we have a drastic acceleration from 2x to 7x
17
Table 4: Final values for Dβ and the penalized objective Ψ from (29) obtained with min-vol KL-
NMF LS and Algorithm 2. The table reports the average and standard deviation over 20 random
initializations for three experimental setups. A bold entry indicates the best value for each experiment.
compared to the backtracking line-search procedure integrated in min-vol KL-NMF LS [25]. Moreover,
we observe in Table 4 that Algorithm 2 outperforms min-vol KL-NMF LS in terms of final values for
the data fitting term and objective function values, with lower standard deviations.
where λk is a penalty weight to control the sparsity of the k-th row of H, and the quadratic constraints
require the columns of W to lie on the surface of a hyper-sphere centered at the origin with radius
?
ρ ą 0. Without this normalization, the `1 -norm regularization would make H tends to zero and W
grows to infinity.
As done before, we update W and H alternatively. We tackle the subproblem in H with W fixed
based on the MU developed in [11] and guaranteed to decrease the objective function:
” ´ ¯ı
r .pβ´2q
“ ‰
WT V d WH
H“H rd ” ı, (34)
r .pβ´1q ` λeT
“ ‰
WT WH
18
where λ P RK` is the vector of penalty weights. It remains to compute an update for W . To do so, we
use the convex separable auxiliary function G from [12] constructed at the current iterate W
Ă , from
which we obtain, as before, the Lagrangian function
˘ ÿ ÿ ÿ ˆ .p2q 1
˙
µ T
`
G W |W “Ă G pwf |w
rf q ` λk }Hpk, :q}1 ` µ wf ´ ρe , (35)
f k f
F
where µ P RK is the vector of Lagrange multipliers associated to the constraint eT W .p2q “ ρeT .
Exactly as before (hence we omit the details here), given µ, one can obtain a closed-form solution:
„” ı. 1
.2 ` T˘ 2
rCs ` 8 eµ d S ´C
‹ (36)
W pµq “ ,
r4eµT s
ˆ ˙
T rV s T
where C “ eF,N H and S “ W d Ă
ĂHs H . Let us now write the expression of the quadratic
rW
constraint f pW ‹ pµqf,i q2 ´ ρ “ 0 for one specific column of W , say the i-th:
ř
¨b ˛2
ÿ ÿ pCf,i q2 ` 8µi Sf,i ´ Cf,i
ri pµi q :“ ‹
pWf,i pµi qq2 ´ ρ “ ˝ ‚ ´ ρ “ 0. (37)
f f
4µi
Computing the Lagrangian multiplier µi to satisfy the constraint requires computing the roots of the
functions ri pµi q. We can show that each Wf,i ‹ pµ q (36) is a monotone decreasing, nonnegative convex
i
‹ pµ qq2 is also monotone decreasing and convex in µ over
ř
function over p0, `8q. Therefore f pWf,i i i
p0, `8q. Indeed, let g : R Ñ R
` `2 ˘2 ` 1 2be a monotone decreasing, nonnegative` 2 ˘1 convex function. If g is
twice-differentiable, then g 2 2 1
“ 2 pg q ` 2gg ě 0 since g, g ě 0 and g “ 2g g ď 0 since g ě 0,
g 1 ď 0 by hypothesis. Now we can conclude that ri pµi q is a monotone decreasing convex function over
p0, `8q. Moreover, using Hospital’s rule, we have:
ÿ ÿ
lim ‹
pWf,i pµi qq2 ´ ρ “ `8 and lim ‹
pWf,i pµi qq2 ´ ρ “ ´ρ ă 0,
µi Ñ0` µi Ñ`8
f f
since ρ ą 0. Therefore, the root of ri pµi q is unique over p0, `8q. We use a Newton-Raphson method
to solve the problem. Algorithm 3 summarizes our method.
19
Algorithm 3 Hyperspheric-structured sparse KL-NMF
KˆN
Input: A matrix V P RF ˆN , an initialization H P R` , an initialization W P RF ˆK , a factorization
rank K, a maximum number of iterations, maxiter, a weight vector λ ą 0.
Output: A sparse rank-K NMF pW, Hq of V satisfying constraints in (33).
1: for it = 1 : maxiter do
2: % Update of
„ matrix
ˆ H
“ ‰.pβ´2q ˙
WT V d WH
3: HÐHd „ “ ‰.pβ´1q
WT WH `λeT
4: % Update of matrix W
5: C Ð eF,N H T
´ ¯
rV s T
6: S Ð W d rW Hs H
7: for j = 1 : K do
8: µi Ð root pri pµi qq over p0, `8q % See Equation (37)
9: end for„
.1
rrCs.2 `8peµT qdS s 2 ´C
10: W Ð r4µeT s
11: end for
objective function and the runtime; we refer the interested reader to the Supplementary Material S2
for qualitative result on the ability of sparse β-NMF to decompose such images. For all simulations,
the algorithms are ran for 20 random initializations of W and H, the entries of the penalty weight λ
has been set to 0.1, 0.05 and 0.05 for Samson, Jasper Ridge and Cuprite data sets, respectively. In
order to fairly compare both algorithms, ρ has been set to 1 as β-SNMF considers a `2 -normalization
for the columns of W , and the entries of the weight vector λ in Algorithm 3 have the same values as
β-SNMF requires to use the same values for all rows of H. Table 5 reports the average and standard
deviation of the runtime (in seconds) as the final value for the objective function over these 20 runs
for a maximum of 300 iterations. Figure 1 displays the objective function values.
Table 5: Runtime performance in seconds and final value of objective function Φend pW, Hq for Algo-
rithm 3 and β-SNMF. The table reports the average and standard deviation over 20 random initial-
izations with a maximum of 300 iterations for three hyperspectral data sets. A bold entry indicates
the best value for each experiment.
Algorithms Samson data set Jasper Ridge data set Cuprite data set
runtime (sec) Φend pW, Hq runtime (sec) Φend pW, Hq runtime (sec) Φend pW, Hq
Algorithm 3 11.07˘0.19 (2.68˘0.00)103 15.67˘0.17 (4.65 ˘ 0.00)103 70.16 ˘ 0.85 (2.12 ˘ 0.00)103
β-SNMF [32] 7.63˘0.13 (2.68˘0.00)103 10.98˘0.18 (4.71 ˘ 0.00)103 51.86 ˘ 0.74 (2.18 ˘ 0.00)103
According to Table 5 (top row), we observe that Algorithm 3 outperforms the heuristic from [32]
in terms of final value for the objective functions while β-SNMF shows lower runtimes. Additionally,
based on Figure 1, we observe that Algorithm 3 converges on average faster than β-SNMF for all
the data sets, in terms of iterations. However, β-SNMF has a lower computational cost per iteration.
Thus, we complete the comparison between both algorithms by imposing the same computational
time: we run Algorithm 3 for 300 iterations, record the computational time and run β-SNMF for the
20
104 104
1.6 3 16000
1.4 2.5 14000
1.2 12000
2 10000
1
1.5 8000
0.8
6000
0.6
1
4000
0.4
0.5
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
104 104
1.6 3 16000
1.4 2.5 14000
1.2 12000
2
1 10000
1.5 8000
0.8
6000
0.6 1
4000
0.4
0.5
0 2 4 6 8 10 12 0 5 10 15 0 20 40 60 80
Figure 1: Averaged objective functions over 20 random initializations obtained for Algorithm 3 with
300 iterations (red line with circle markers), and the heuristic β-SNMF from [32] (black dashed line).
same amount of time. Table 6 reports the average and standard deviation of the final value for the
Table 6: Final value of objective function values Φend pW, Hq for Algorithm 3 and the heuristic from
[32]. The table reports the average and standard deviation over 20 random initializations for an equal
computational time that corresponds to 300 iterations of Algorithm 3. A bold entry indicates the best
value for each experiment.
Algorithms Samson data set Jasper Ridge data set Cuprite data set
Φend pW, Hq Φend pW, Hq Φend pW, Hq
Algorithm 3 (2.68˘0.00)10 3 (4.65 ˘ 0.00)10 3 (2.12 ˘ 0.00)103
β-SNMF [32] (2.68˘0.00)10 3 (4.66 ˘ 0.00)10 3 (2.15 ˘ 0.00)103
objective function over 20 runs in this setting. Figure 1 (bottom row) displays the objective function
w.r.t. time for the three data sets. On this comparison, Algorithm 3 and the heuristic from [32]
perform similarly although Algorithm 3 has slightly better final objective function values. However,
keep in mind that only Algorithm 3 is theoretically guaranteed to decrease the objective function.
6 Conclusion
In this paper we have presented a general framework to solve penalized β-NMF problems that in-
tegrates a set of disjoint constraints on the variables; see the general formulation (2). Using this
framework, we showed that we can derive algorithms that compete favorably with the state of the art
for a wide variety of β-NMF problems, such as the simplex-structured NMF and the minimum-volume
21
β-NMF with sum-to-one constraints on the columns of W . We have also shown how to extend the
framework to non-linear disjoints constraints, with application to a sparse β-NMF model for β “ 1
where each column of W lie on a hyper-sphere.
Further works will focus on the possible extension of the methods to non-disjoints constraints. The
non-disjoint constraints will lead to roots finding problems of polynomial equations in the Lagrangian
multipliers for which we hope to find conditions that ensure the uniqueness of the solution.
Another interesting direction of research would be to apply our framework to other NMF models.
For example, in probabilistic latent semantic analysis/indexing (PLSA/PLSI), the model is the fol-
lowing: given a nonnegative matrix V such that eT V e “ 1 (this can be assumed w.l.o.g. by dividing
the input matrix by eT V e), solve
ÿ
max vf n logpW DiagpsqHqf n such that W T e “ e, He “ e, sT e “ 1.
W ě0,Hě0,sě0
f,n
This model is equivalent to KL-NMF [9], with the additional constraint that eT W He “ eT Xe, and
hence our framework is applicable to PLSA/PLSI. Such constraints have also applications in soft
clustering contexts; see [35].
Acknowledgment
We would like to thank the Associate Editor and the reviewers for taking the time to carefully read the
paper and for the useful feedback that helped us improve the paper. We also thank Arthur Marmin
for identifying an error in our derivations when β P p1, 2q (indicated in red color in this version of the
manuscript).
Decomposition
β P p´8, 1qzt0u β“0 β“1 β P p1, 2q β P r2, `8q
dβ “ dq ` dp
1 x 1 β 1 1 β
dpx|yq
q x y β´1
1´β y ´x log y βy ´ β´1 xy
β´1
βy
1 β 1 β y 1 β 1 1
dpx|yq
p
β y ´ β p1´βq x log x ´ 1 y ` x log x ´ x β pβ´1q x ´ β´1 x y β´1 ` β pβ´1q xβ
In Table 7, y P p0, 8q, β is real valued and x P p0, 8q. Further, β and x are considered as param-
eters, dβ , dp and dq being handled as univariate functions of y.
22
We can now introduce the properties of concavity, convexity and monotonicity for our convex-
concave formulation of the discrete β-divergence:
1. dpx|yq
q is C 8 and strictly convex on p0, 8q for x ą 0 and β P R;
2. dpx|yq
p is concave for x ą 0 and β P R;
• y ν is strictly convex for all ν P p´8, 0q Y p1, 8q, and strictly concave for all ν P p0, 1q;
According to the first two items of Proposition 4, dq and dp indeed yield a convex-concave decom-
position of the β-divergence, which is a variant of [12, Table 1]. Let us remark that the successive
minimization of an upper approximation of this convex-concave decomposition following the method-
ology presented in [12] yields to the usual multiplicative update scheme.
References
[1] M. Abdolali and N. Gillis, Simplex-structured matrix factorization: Sparsity-based identifiability and
provably correct algorithms, arXiv preprint arXiv:2007.11446, (2020).
[2] A. Ang and N. Gillis, Accelerating nonnegative matrix factorization algorithms using extrapolation,
Neural computation, 31 (2019), pp. 417–439.
[3] D. P. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, second ed., 1999.
[4] J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot,
Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches, IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5 (2012), pp. 354–379.
[5] E. C. Chi and T. G. Kolda, On tensors, sparsity, and nonnegative factorizations, SIAM Journal on
Matrix Analysis and Applications, 33 (2012), pp. 1272–1299.
[6] A. Cichocki, R. Zdunek, and S.-I. Amari, Hierarchical ALS algorithms for nonnegative matrix and
3D tensor factorization, in Lecture Notes in Computer Science, Vol. 4666, Springer, 2007, pp. 169–176.
[7] A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari, Nonnegative Matrix and Tensor Factoriza-
tions: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation, John Wiley &
Sons, 2009.
[8] O. Dikmen, Z. Yang, and E. Oja, Learning the information divergence, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37 (2015), pp. 1442–1454.
[9] C. Ding, T. Li, and W. Peng, On the equivalence between non-negative matrix factorization and prob-
abilistic latent semantic indexing, Computational Statistics & Data Analysis, 52 (2008), pp. 3913–3927.
23
[10] J. Eggert and E. Korner, Sparse coding and NMF, in IEEE International Joint Conference on Neural
Networks, vol. 4, 2004, pp. 2529–2533 vol.4.
[11] C. Févotte, N. Bertin, and J.-L. Durrieu, Nonnegative matrix factorization with the Itakura-Saito
divergence: With application to music analysis, Neural computation, 21 (2009), pp. 793–830.
[12] C. Févotte and J. Idier, Algorithms for nonnegative matrix factorization with the β-divergence, Neural
computation, 23 (2011), pp. 2421–2456.
[13] X. Fu, K. Huang, N. D. Sidiropoulos, and W.-K. Ma, Nonnegative matrix factorization for signal
and data analytics: Identifiability, algorithms, and applications., IEEE Signal Process. Mag., 36 (2019),
pp. 59–80.
[14] C. Févotte and N. Dobigeon, Nonlinear hyperspectral unmixing with robust nonnegative matrix fac-
torization, IEEE Transactions on Image Processing, 24 (2015), pp. 4810–4819.
[15] N. Gillis, The why and how of nonnegative matrix factorization, in Regularization, Optimization, Kernels,
and Support Vector Machines, J. Suykens, M. Signoretto, and A. Argyriou, eds., Machine Learning and
Pattern Recognition, Chapman & Hall/CRC, Boca Raton, Florida, 2014, ch. 12, pp. 257–291.
[16] N. Gillis, Nonnegative Matrix Factorization, SIAM, Philadelphia, 2020.
[17] N. Gillis and F. Glineur, Accelerated multiplicative updates and hierarchical ALS algorithms for non-
negative matrix factorization, Neural computation, 24 (2012), pp. 1085–1105.
[18] N. Guan, D. Tao, Z. Luo, and B. Yuan, NeNMF: An optimal gradient method for nonnegative matrix
factorization, IEEE Transactions on Signal Processing, 60 (2012), pp. 2882–2898.
[19] L. T. K. Hien, N. Gillis, and P. Patrinos, Inertial block proximal methods for non-convex non-smooth
optimization, in International Conference on Machine Learning, 2020, pp. 5671–5681.
[20] D. Hong, T. G. Kolda, and J. A. Duersch, Generalized canonical polyadic tensor decomposition,
SIAM Review, 62 (2020), pp. 133–163.
[21] P. Hoyer, Non-negative matrix factorization with sparseness constraints, J. Mach. Learn. Res., 5 (2004),
p. 1457–1469.
[22] J. Kim, Y. He, and H. Park, Algorithms for nonnegative matrix and tensor factorizations: A unified
view based on block coordinate descent framework, Journal of Global Optimization, 58 (2014), pp. 285–319.
[23] D. Lee and H. Seung, Algorithms for non-negative matrix factorization, in Proceedings of the 13th
International Conference on Neural Information Processing Systems, NIPS, MIT Press Cambridge, 2000,
pp. 535–541.
[24] D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature,
401 (1999), p. 788.
[25] V. Leplat, N. Gillis, and A. M. S. Ang, Blind audio source separation with minimum-volume beta-
divergence NMF, IEEE Transactions on Signal Processing, 68 (2020), pp. 3400–3410.
[26] C.-J. Lin, Projected gradient methods for nonnegative matrix factorization, Neural computation, 19 (2007),
pp. 2756–2779.
[27] W.-K. Ma, J. M. Bioucas-Dias, T. Chan, N. Gillis, P. Gader, A. J. Plaza, A. Ambikapathi,
and C. Chi, A signal processing perspective on hyperspectral unmixing: Insights from remote sensing,
IEEE Signal Processing Magazine, 31 (2014), pp. 67–81.
[28] K. Miller and G. Samko, Completely monotonic functions, Integral Transforms and Special Functions,
12 (2001), p. 389–402.
[29] K. Neymeyr and M. Sawall, On the set of solutions of the nonnegative matrix factorization problem,
SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 1049–1069.
24
[30] J. Ortega and W. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Aca-
demic Press, New York, NY, 1970.
[31] Y. Qian, S. Jia, J. Zhou, and A. Robles-Kelly, Hyperspectral unmixing via l1{2 sparsity-constrained
nonnegative matrix factorization, IEEE Transactions on Geoscience and Remote Sensing, 49 (2011),
pp. 4282–4297.
[32] J. L. Roux, F. J. Weninger, and J. R. Hershey, Sparse NMF – half-baked or well done?, tech. rep.,
Mitsubishi Electric Research Laboratories (MERL), 2015.
[33] Y. Sun, P. Babu, and D. Palomar, Majorization-minimization algorithms in signal processing, com-
munications, and machine learning, IEEE Transactions on Signal Processing, 65 (2017), pp. 794–816.
[34] T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal conti-
nuity and sparseness criteria, IEEE Transactions on Audio, Speech, and Language Processing, 15 (2007),
pp. 1066–1074.
[35] Z. Yang, J. Corander, and E. Oja, Low-rank doubly stochastic matrix decomposition for cluster anal-
ysis, The Journal of Machine Learning Research, 17 (2016), pp. 6454–6478.
[36] A. L. Yuille and A. Rangarajan, The concave-convex procedure, Neural Computation, 15 (2003),
pp. 915–936.
[37] F. Zhu, Hyperspectral unmixing: ground truth labeling, datasets, benchmark performances and survey,
arXiv preprint arXiv:1708.05125, (2017).
25
Supplementary Material
This supplementary materials provide additional numerical experiments. In S1, we show the evolution
of the error as a function of the iterations for Algorithm 1 and GR-NMF [14] for the tests performed
in Section 3. In S2, we provide qualitative results obtained with Algorithm 3 on hyperspectral images.
26
10 -1 10 -1
10 -1
10 -2
10 -2
10 -2
10 -3
0 100 200 300 0 100 200 300 0 100 200 300
10 -1 10 -1
10 -1
10 -2
10 -2
-2
10
10 -1 10 -1
10 -1
10 -2
10 -2
10 -2
10 -1
10 -1
10 -1
10 -2
10 -2
10 0
10 -1
10 -1 10 -1
10 -2
(a) Samson Data set (b) Jasper Ridge Data set (c) Cuprite Data set
Figure 2: Averaged relative objective function values over 20 random initializations obtained for
Algorithm 1 (red line with circle markers) and the GR-NMF (black dashed line) applied to the three
data sets detailed in the text for 300 iteration. The comparison is performed for different values of β,
from top to bottom: β “ 2, β “ 3{2, and β “ 1. Logarithmic scale for y axis.
27
(a) Samson Data set
Figure 3: Samson data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the three endmembers: 71 Tree, 72 Soil and 73 Water. Two average sparsity levels
considered: 0.25 (b) and 0.5 (c).
28
(a) Jasper Ridge Data set
Figure 4: Jasper Ridge data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the four endmembers: 71 Road, 72 Tree, 73 Water and 74 Soil. Two average sparsity
levels are considered: 0.25 (b) and 0.5 (c).
29
(a) Urban Data set
Figure 5: Urban data set (a) and results ((b) and (c)) for the Abundance maps estimated using
Algorithm 3 for the six endmembers: 71 Soil, 72 Tree, 73 Grass, 74 Roof, 75 Road/Asphalt and 76
Roof2/shadows. Two average sparsity levels are considered: 0.25 (b) and 0.5 (c).
30
Figure 6: Baseline abundances for the endmembers obtained for Samson data extracted from [37]: 71
Soil, 72 Tree and 73 Water.
Figure 7: Baseline abundances for the endmembers obtained for Jasper Ridge data extracted from
[37]: 71 Road, 72 Soil, 73 Water and 74 Tree.
Figure 8: Baseline abundances for the endmembers obtained for Urban data extracted from [37]: 71
Asphalt, 72 Grass, 73 Tree, 74 Roof1, 75 Roof2/Shadow and 76 Soil.
31