A Block Coordinate Descent-Based Projected GradientAlgorithm For Orthogonal Non-Negative Matrix Factorization
A Block Coordinate Descent-Based Projected GradientAlgorithm For Orthogonal Non-Negative Matrix Factorization
Article
A Block Coordinate Descent-Based Projected Gradient
Algorithm for Orthogonal Non-Negative Matrix Factorization
Soodabeh Asadi 1 and Janez Povh 2,3, *
1 Institute for Data Science, School of Engineering, University of Applied Sciences and Arts Northwestern
Switzerland, 5210 Windisch, Switzerland; [email protected]
2 Faculty of Mechanical Engineering, University of Ljubljana, Aškerčeva ulica 6, SI-1000 Ljubljana, Slovenia
3 Institute of Mathematics, Physics and Mechanics, Jadranska 19, SI-1000 Ljubljana, Slovenia
* Correspondence: [email protected]
Abstract: This article uses the projected gradient method (PG) for a non-negative matrix factoriza-
tion problem (NMF), where one or both matrix factors must have orthonormal columns or rows.
We penalize the orthonormality constraints and apply the PG method via a block coordinate descent
approach. This means that at a certain time one matrix factor is fixed and the other is updated by
moving along the steepest descent direction computed from the penalized objective function and
projecting onto the space of non-negative matrices. Our method is tested on two sets of synthetic
data for various values of penalty parameters. The performance is compared to the well-known
multiplicative update (MU) method from Ding (2006), and with a modified global convergent variant
of the MU algorithm recently proposed by Mirzal (2014). We provide extensive numerical results
coupled with appropriate visualizations, which demonstrate that our method is very competitive
and usually outperforms the other two methods.
Citation: Asadi, S.; Povh, J. A Block Keywords: non-negative matrix factorization; orthogonality conditions; projected gradient method;
Coordinate Descent-Based Projected multiplicative update algorithm; block coordinate descent
Gradient Algorithm for Orthogonal
Non-Negative Matrix Factorization.
Mathematics 2021, 9, 540. https://
doi.org/10.3390/math9050540 1. Introduction
1.1. Motivation
Academic Editor: Cornelio
Yáñez-Marquez
Many machine learning applications require processing large and high dimensional
data. The data could be images, videos, kernel matrices, spectral graphs, etc., represented
Received: 8 December 2020 as an m × n matrix R. The data size and the amount of redundancy increase rapidly
Accepted: 24 February 2021 when m and n grow. To make the analysis and the interpretation easier, it is favorable to
Published: 4 March 2021 obtain compact and concise low rank approximation of the original data R. This low-rank
approximation is known to be very efficient in a wide range of applications, such as: text
Publisher’s Note: MDPI stays neutral mining [1–3], document classification [4], clustering [5,6], spectral data analysis [1,7], face
with regard to jurisdictional claims in recognition [8], and many more.
published maps and institutional affil- There exist many different low rank approximation methods. For instance, two
iations. well-known strategies, broadly used for data analysis, are singular value decomposition
(SVD) [9] and principle component analysis (PCA) [10]. Much of real-world data are
non-negative, and the related hidden parts express physical features only when the non-
negativity holds. The factorizing matrices in SVD or PCA can have negative entries,
Copyright: © 2021 by the authors. making it hard or impossible to put a physical interpretation on them. Non-negative matrix
Licensee MDPI, Basel, Switzerland. factorization was introduced as an attempt to overcome this drawback, i.e., to provide the
This article is an open access article desired low rank non-negative matrix factors.
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
R ≈ GH, (1)
m×n m× p
where R ∈ R+ usually corresponds to the data matrix, G ∈ R+ represents the basis
p×n
matrix, and H ∈ R+ is the coefficient matrix. With p we denote the number of factors
for which it is desired that p min(m, n). If we consider each of the n columns of R
being a sample of m-dimensional vector data, the factorization represents each instance
(column) as a non-negative linear combination of the columns of G, where the coefficients
correspond to the columns of H. The columns of G can be therefore interpreted as the p
pieces that constitute the data R. To compute G and H, condition (1) is usually rewritten as
a minimization problem using the Frobenius norm:
1
min f ( G, H ) = k R − GH k2F , G ≥ 0, H ≥ 0. (NMF)
G,H 2
1
min f ( G, H ) = k R − GH k2F , s.t. G ≥ 0, H ≥ 0, G T G = I. (ONMF)
G,H 2
1
min f ( G, H ) = k R − GH k2F , s.t. G ≥ 0, H ≥ 0, G T G = I, HH T = I, (bi-ONMF)
G,H 2
the orthogonality constraint with incremental learning. This approach adopts batch update
in the process of incremental learning.
k R − GH k F
RSE := , (2)
1 + k Rk F
Please note that we added 1 to the denominator in the formula above to prevent
numerical difficulties when the data matrix R has a very small Frobenius norm.
Deviations from the orthonormality are computed using Formulas (17) and (18) from
Section 4. Our numerical results show that our algorithm is very competitive and
almost always outperforms the MU based algorithms.
1.5. Notations
Some notations used throughout our work are described here. We denote scalars and
indices by lower-case Latin letters, vectors by lowercase boldface Latin letters, and matrices
by capital Latin letters. Rm×n denotes the set of m by n real matrices, and I symbolizes
the identity matrix. We use the notation ∇ to show the gradient of a real-valued function.
We define ∇+ and ∇− as the positive and (unsigned) negative parts of ∇, respectively,
i.e., O = O+ − O− . and denote the element-wise multiplication and the element-wise
division, respectively.
from the work of Lee and Seung [16]. Various MU variants were later proposed by several
researchers, for an overview see [38]. At each iteration of these methods, the elements of G
and H are multiplied by certain updating factors.
As already mentioned, (ONMF) was proposed by Ding et al. [11] as a tool to improve
the clustering capability of the associated optimization approaches. To adapt the MU
algorithm for this problem, they employed standard Lagrangian techniques: they intro-
duced the Lagrangian multiplier Λ (a symmetric matrix of size p × p) for the orthogonality
constraint, and minimized the Lagrangian function where the orthogonality constraint is
moved to the objective function as the penalty term Trace(Λ( G T G − I )). The complemen-
tarity conditions from the related KKT conditions can be rewritten as a fixed point relation,
which finally can lead to the following MU rule for (ONMF):
r
( RH T )ij
Gij = Gij ( GG T RH T )ij
, i = 1, · · · , m, j = 1, · · · , p,
r (3)
( R T G )st
Hst = Hst ( H T G T G )st
, s = 1, · · · , p, t = 1, · · · , n.
They extended this approach to non-negative three factor factorization with demand
that two factors satisfy orthogonality conditions, which is a generalization of (bi-ONMF).
The MU rules (28)–(30) from [11], adapted to (bi-ONMF), are the main ingredients of
Algorithm 1, which we will call Ding’s algorithm.
Algorithm 1 converges in the sense that the solution pairs G and H generated by this
algorithm yield a sequence of decreasing RSEs, see [11], Theorems 5 and 7.
If R has zero vector as columns or rows, a division by zero may occur. In contrast,
denominators close to zero may still cause numerical problems. To escape this situation,
we follow [39] and add a small positive number δ to the denominators of the MU terms (4).
Pleas note that Algorithm 1 can be easily adapted to solve (ONMF) by replacing the second
MU rule from (4) with the second MU rule of (3).
the orthogonality constraints are moved into the objective function (the so-called penalty
approach), and the importance of the orthogonality constraints are controlled by the penalty
parameters α, β:
2
2
1
GH k2F + α2
HH T − I
F + 2
G T G − I
F ,
β
minG,H F ( G, H ) = 2 kR −
(pen-ONMF)
s.t. G ≥ 0, H ≥ 0
For the objective function in (pen-ONMF), Mirzal proposed the MAU rules along
with the use of Ḡ = ( ḡ)ij and H̄ = (h̄)ij , instead of G and H, to avoid the zero locking
phenomenon [28], Section 2:
(
gij , if ∇G f ( G, H )ij ≥ 0
ḡij = (6)
max{ gij , ν}, if ∇G f ( G, H )ij < 0
(
hst , if ∇ H f ( G, H )st ≥ 0
h̄st = (7)
max{hst , ν}, if ∇ H f ( G, H )st < 0
1
2 α
2 β
2
G k+1 := argminG≥0
R − GH k
+
H k H kT − I
+
G T G − I
(8)
2 F 2 F 2 F
Fix G := G k+1 and compute new H as follows:
1
2 α
T β
2
H k+1 := argmin H ≥0
R − G k+1 H
+
HH T − I
+
( k +1) T k +1
G − I
(9)
2 F 2 2
G
F F
k := k + 1
3. Until some stopping criteria is satisfied
OUTPUT: G, H.
The objective function in (pen-ONMF) is not quadratic any more, so we lose the nice
properties about Armijo’s rule that represent advantages for Lin. We managed to use the
Armijo rule directly and still obtained good numerical results, see Section 4. Armijo [41]
was the first to establish convergence to stationary points of smooth functions using an
inexact line search with a simple “sufficient decrease” condition. The Armijo condition
ensures that the line search step is not too large.
We refer to (8) or (9) as sub-problems. Obviously, solving these sub-problems in every
iteration could be more costly than Algorithms 1 and 2. Therefore, we must find effective
methods for solving these sub-problems. Similarly to Lin, we apply the PG method to
solve the sub-problems (8) and (9). Algorithm 4 contains the main steps of the PG method
for solving the latter and can be straightforwardly adapted for the former.
For the sake of simplicity, we denote by FH the function that we optimize in (8), which
is actually a simplified version (pure H terms removed) of the objective function from
(pen-ONMF) for H fixed:
1 β
2
FH ( G ) := k R − GH k2F +
G T G − I
.
2 2 F
Similarly, for G is fixed, the objective function from (9) will be denoted by:
1 α
T
FG ( H ) := k R − GH k2F +
HH T − I
.
2 2 F
In Algorithm 4, P is the projection operator which projects the new point (matrix) on
the cone of non-negative matrices (we simply put negative entries to 0).
Inequality (10) shows the Armijo rule to find a suitable step-size guaranteeing a
sufficient decrease. Searching for λk is a time-consuming operation, therefore we strive to
do only a small number of trials for new λ in Step 3.1.
Similarly to Lin [19], we allow for λ any positive value. More precisely, we start with
λ = 1 and if the Armijo rule (10) is satisfied, we increase the value of λ by dividing it
Mathematics 2021, 9, 540 8 of 22
with γ < 1. We repeat this until (10) is no longer satisfied or the same matrix Hλ as in the
previous iteration is obtained. If the starting λ = 1 does not yield Hλ which would satisfy
the Armijo rule (10), then we decrease it by a factor γ and repeat this until (10) is satisfied.
The numerical results obtained using different values of parameters γ (updating factor for
λ) and σ (parameter to check (10)) are reported in the following subsections.
where f is the differentiable function that we try to optimize and ∇ P f ( x k ) is the projected
gradient defined as
(
P ∇ f ( x )i , if xi > 0,
∇ f ( x )i = (12)
min{0, ∇ f ( x )i }, if xi = 0,
We impose a time limit in seconds and a maximum number of iterations for Algorithm 4
as well. Following [19], we also define stopping conditions for the sub-problems. The
matrices G k+1 and H k+1 returned by Algorithm 4, respectively, must satisfy
P
∇G F G k+1 , H k
≤ ε̄ G ,
F (14)
P
∇ H F G k+1 , H k+1
≤ ε̄ H ,
F
where
ε̄ G = ε̄ H = max{10−7 , ε}
∇ F G0 , H 0
, (15)
F
and ε is the same tolerance used in (13). If the PG method for solving the sub-problem (8)
or (9) stops after the first iteration, then we decrease the stopping tolerance as follows:
ε̄ G ←− τ ε̄ G , ε̄ H ←− τ ε̄ H , (16)
4. Numerical Results
In this section we demonstrate, how the PG method described in Section 3, performs
compared to the MU-based algorithms of Ding and Mirzal, which were described in
Sections 2.1 and 2.2, respectively.
Table 1. Paris (n, k) for which we created UNION and BION datasets.
where µ is parameter chosen by us and was set to 10−2 , 10−4 , 10−6 . By using
√basic properties
of the uniform distribution, we can easily derive that µ̄ ≤ µ(1 + k GH k F ) 3/n, where n is
the order of square matrix R. We indeed used the right hand side term of this inequality to
generate the noise matrices E.
All computations are done using MATLAB [37] and a high performance computer
available at Faculty of Mechanical Engineering of University of Ljubljana. This is Intel Xeon
X5670 (1536 hyper-cores) HPC cluster and an E5-2680 V3 (1008 hyper-cores) DP cluster,
with an IB QDR interconnection, 164 TB of LUSTRE storage, 4.6 TB RAM and with 24
TFlop/s performance.
UNION data by setting α = 0 in the problem formulation (bi-ONMF) and in all formulas
underlying these two algorithms.
The maximum number of outer iterations for all three algorithms was set to 1000. In
practices, we stop Algorithms 1 and 3 only when the maximum number of iterations is
reached, while for Algorithm 2, the stopping condition involved also checking the progress
of RSE. If this is too small (below 10−5 ) we also stop.
Recall that for UNION data we have for each pair n, k from Table 1 five symmetric
matrices R for which we try to solve (ONMF) by Algorithms 1–3. Please note that all these
algorithms demand as input the internal dimension k, i.e., the number of columns of factor
G, which is in general not known in advance. Even though, we know this dimension by
construction for UNION data, we tested the algorithms using internal dimensions p equal
to 20%, 40%, . . . , 100% of k. For p = k, we know the optimum of the problem, which is 0,
so for this case we can also estimate how good are the tested algorithms in terms of finding
the global optimum.
The first question we had to answer was which value of β to use in Mirzal’s and PG
algorithms. It is obvious that larger values of β moves the focus from optimizing the RSE to
guaranteeing the orthonormality, i.e., feasibility for the original problem. We decided not to
fix the value of β but to run both algorithms for β ∈ {1, 10, 100, 1000} and report the results.
For each solution pair G, H returned by all algorithms, the non-negativity constraints
are held by the construction of algorithms, so we only need to consider deviation of G from
orthonormality, which we call infeasibility and define it as
T
G G − I
F
infeasG := . (17)
1 + k I kF
The computational results that follow in the rest of this subsection were obtained
by setting the tolerance in the stopping criterion to ε = 10−10 , the maximum number of
iterations to 1000 in Algorithm 3 and to 20 in Algorithm 4.
We also set a time limit to 3600 seconds. Additionally, for σ and γ (updating parameter
for λ in Algorithm 4) we choose 0.001 and 0.1, respectively. Finally, for τ from (16) we set a
value of 0.1.
In general, Algorithm 3 converges to a solution in early iterations and the norm of the
projected gradient falls below the tolerance shortly after running the algorithm.
Results in Tables 2 and 3 and their visualizations in Figures 1 and 2 confirm expecta-
tions. More precisely, we can see that the smaller the value of β, the better RSE. Likewise,
the larger the value of β, the smaller the infeasibility infeasG . In practice, we want to reach
both criteria: small RSE and small infeasibility, so some compromise should be made. If
RSE is more important than infeasibility, we choose the smaller value of β and vice versa.
We can also observe that regarding RSE the three compared algorithms do not differ a lot.
However, when the input dimension p approaches the real inner dimension k, Algorithm 3
comes closest to the global optimum RSE = 0. The situation with infeasibility is a bit
different. While Algorithm 1 performs very well in all instances, Algorithm 2 reaches better
feasibility for smaller values of n. Algorithm 3 outperforms the others for β = 1000.
Mathematics 2021, 9, 540 11 of 22
50
50
●
●
●
40
40
●
● ●
●
●
●
● ●
● ●
● ●
●
●
30
30
RSE (%)
RSE (%)
●
● ●
● ●
●
● ●
●
●
20
20
● ●
● ●
● Ding ●
●
●
● Ding ●
●
● Mirzal (β = 1) ● PG (β = 1)
Mirzal (β = 10) PG (β = 10)
10
10
●
● Mirzal (β = 100) ●
● PG (β = 100)
● Mirzal (β = 1000) ●
● PG (β = 1000)
●
● ●
●
●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of RSE for different values of β obtained Values of RSE for different values of β obtained
by Algorithms 1 and 2 for n = 100 by Algorithms 1 and 3 for n = 100
50
50
● ●
●
● ●
40
40
● ●
●
● ●
●
30
30
● ●
●
RSE (%)
RSE (%)
●
●
●
●
●
● ●
●
20
20
●
●
● Ding ● Ding
● Mirzal (β = 1) ● PG (β = 1)
Mirzal (β = 10) PG (β = 10)
10
10
● ●
●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of RSE for different values of β obtained Values of RSE for different values of β obtained
by Algorithms 1 and 2 for n = 500 by Algorithms 1 and 3 for n = 500
50
50
● ●
●
● ●
●
40
40
● ●
●
● ●
●
● ●
30
30
● ●
RSE (%)
RSE (%)
●
●
●
●
● ●
●
●
20
20
● Ding ● Ding
● Mirzal (β = 1) ● PG (β = 1)
Mirzal (β = 10) PG (β = 10)
10
10
● ●
● Mirzal (β = 100) ●
●
● PG (β = 100)
● Mirzal (β = 1000) ● ● PG (β = 1000) ●
●
●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of RSE for different values of β obtained Values of RSE for different values of β obtained
by Algorithms 1 and 2 for n = 1000 by Algorithms 1 and 3 for n = 1000
Figure 1. This figure depicts data from Table 2. It contains six plots which illustrate the quality of Algorithms 1–3 regarding RSE
on UNION instances with n = 100, 500, 1000, for β ∈ {1, 10, 100, 1000}. We can see that regarding RSE the performance of these
algorithms on this dataset does not differ a lot. As expected, larger values of β yield larger values of RSE, but the differences are
rather small. However, when p approached 100% of k, Algorithm 3 comes closest to the global optimum RSE = 0.
Mathematics 2021, 9, 540 12 of 22
0.3
0.3
●
● Ding ● ● Ding
0.25
0.25
● Mirzal (β = 1) ● PG (β = 1)
● Mirzal (β = 10) ● PG (β = 10)
Mirzal (β = 100) PG (β = 100)
0.2
0.2
● ●
Mirzal (β = 1000) PG (β = 1000)
Infeasibility
Infeasibility
●
● ●
●
0.15
0.15
●
●
●
●
0.1
0.1
●
●
●
● ●
●
0.05
0.05
● ● ●
●
●
● ●
●
● ● ● ●
●
● ● ●
●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of infeasG for different values of β ob- Values of infeasG for different values of β ob-
tained by Algorithms 1 and 2 for n = 100 tained by Algorithms 1 and 3 for n = 100
0.3
0.3
● Ding ● Ding
0.25
0.25
● Mirzal (β = 1) ● PG (β = 1)
● Mirzal (β = 10) ● PG (β = 10)
Mirzal (β = 100) PG (β = 100)
0.2
0.2
● ●
●
Mirzal (β = 1000) PG (β = 1000)
●
Infeasibility
Infeasibility
● ●
●
0.15
0.15
●
●
●
●
●
●
0.1
0.1
●
●
● ●
●
●
●
●
0.05
0.05
●
● ●
● ● ● ●
● ●
● ●
● ● ●
● ● ● ●
●
● ● ● ●
●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of infeasG for different values of β ob- Values of infeasG for different values of β ob-
tained by Algorithms 1 and 2 for n = 500 tained by Algorithms 1 and 3 for n = 500
0.3
0.3
● Ding ● Ding
0.25
0.25
● Mirzal (β = 1) ● PG (β = 1)
● Mirzal (β = 10) ● PG (β = 10)
Mirzal (β = 100) PG (β = 100)
0.2
0.2
● ●
Mirzal (β = 1000) PG (β = 1000)
Infeasibility
Infeasibility
● ●
0.15
0.15
●
0.1
0.1
● ●
●
●
● ●
●
0.05
0.05
● ●
● ●
● ●
●
●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of infeasG for different values of β ob- Values of infeasG for different values of β ob-
tained by Algorithms 1 and 2 for n = 1000 tained by Algorithms 1 and 3 for n = 1000
Figure 2. This figure depicts data from Table 3. It contains six plots which illustrate the quality of Algorithms 1–3 regarding
infeasibility on UNION instances with n = 100, 500, 1000, for β ∈ {1, 10, 100, 1000}. We can see that regarding infeasibility
the performance of these algorithms on this dataset does not differ a lot. As expected, larger values of β yield smaller values
of infeasG , but the differences are rather small.
Mathematics 2021, 9, 540 13 of 22
Table 2. In this table we demonstrate how good RSE is achieved by Algorithms 1–3 on UNION dataset. For each
n ∈ {50, 100, 200, 500, 1000} we take all 10 matrices R (five of them corresponding to k = 0.2n and five to k = 0.4n). We
run all three algorithms on these matrices with inner dimensions p ∈ {0.2k, 0.4k, . . . , 1.0k} with all possible values of
β ∈ {1, 10, 100, 1000}. Each row represents the average (arithmetic mean value) RSE obtained on instances corresponding
to given n. For example, the last row shows the average value of RSE in 10 instances of dimension 1000 (five of them
corresponding to k = 200 and five to k = 400) obtained by all three algorithms for all four values of β, which were run with
the input dimension p = k. The bold number is the smallest one in each line.
Table 3. In this table we demonstrate how feasible (orthonormal) the solutions are G computed by Algorithms 1–3 on
UNION data set, i.e., in this table we report the average infeasibility of the solutions underlying Table 2. The bold number is
the smallest one in each line.
Table 3. Cont.
Results from Table 3, corresponding to n = 100, 500, 1000 are depicted in Figure 2.
Table 4. RSE obtained by Algorithms 1–3 on the BION data. For the latter two algorithms, we used α = β ∈ {1, 10, 100, 1000}.
For each n ∈ {50, 100, 200, 500, 1000} we take all ten matrices R (five of them corresponding to k = 0.2n and five to k = 0.4n).
We run all three algorithms on these matrices with inner dimensions p ∈ {0.2k, 0.4k, . . . , 1.0k} with all possible values of
α = β. Like before, each row represents the average (arithmetic mean value) of RSE obtained on instances corresponding
to given n and given p as a percentage of k. We can see that the larger the β, the worse the RSE, which is consistent with
expectations. The bold number is the smallest one in each line.
Table 4. Cont.
Table 5. In this table we demonstrate how feasible (orthonormal) are the solutions G and H computed by Algorithms 1–3
on the BION dataset, i.e., in this table we report the average infeasibility (18) of the solutions underlying Table 4. We can
observe that with these settings of all algorithms we can bring infeasibility to order of 10−3 very often, for all values of β.
The bold number is the smallest one in each line.
Figures 3 and 4 depict RSE and infeasibility reached by the three compared algo-
rithms, for n = 100, 500, 1000. We can see that all three algorithms behave well; however,
Mathematics 2021, 9, 540 16 of 22
Algorithm 3 is more stable and less dependent on the choice of β. It is interesting to see
that β does not have a big impact on RSE and infeasibility for Algorithm 3, a significant
difference can be observed only when the internal dimension is equal to the real internal
dimension, i.e., when p = 100%. Based on these numerical results, we can conclude that
smaller β achieve better RSE and almost the same infeasibility, so it would make sense to
use β = 1.
For Algorithm 2 these differences are bigger and it is less obvious which β is appropri-
ate. Again, if RSE is more important then smaller values of β should be taken, otherwise
larger values.
140
140
120
120
100
100
RSE (%)
RSE (%)
80
80
● ●
●
● ●
60
60
● ●
●
●
● Ding ● ● Ding
40
40
●
● Mirz (β = 1) ●
●
● PG (β = 1) ●
● ●
20
●
● ●
● ●
● Mirz (β = 1000) ● PG (β = 1000) ●
●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of RSE for different values of β obtained Values of RSE for different values of β obtained by
by Algorithms 1 and 2 in BION data with n = 100 Algorithms 1 and 3 on BION data with n = 100.
140
140
●
120
●
120
●
100
●
100
RSE (%)
●
80
RSE (%)
●
80
●
60
●
● ●
60
●
●
● Ding ●
40
●
● Mirz (β = 1) ● Ding ●
40
● ●
● Mirz (β = 10) ● PG (β = 1) ●
● ● ●
PG (β = 100)
20
● ●
● Mirz (β = 1000) ●
●
● PG (β = 1000) ●
20 40 60 80 100
20 40 60 80 100
p (% of k)
p (% of k)
140
●
120
120
●
100
●
100
● ●
RSE (%)
RSE (%)
80
80
● ●
60
60
●
● ●
●
●
● Ding ●
● Ding ●
●
40
40
● Mirz (β = 1) ●
● PG (β = 1)
● Mirz (β = 10) ● PG (β = 10)
Mirz (β = 100) PG (β = 100)
●
20
20
● ● ● ●
● ●
● Mirz (β = 1000) ● PG (β = 1000) ●
●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of RSE for different values of β obtained by Values of RSE for different values of β obtained by
Algorithms 1 and 2 in BION data with n = 1000 Algorithms 1 and 3 on BION data with n = 1000.
Figure 3. This figure contains six plots which illustrate the quality of Algorithms 1–3 regarding RSE on BION instances
with n = 100, 500, 1000 and k = 0.2n, 0.4n, for β ∈ {1, 10, 100, 1000}. We can observe that Algorithm 3 is more stable, less
dependent to the choice of β and is computing better values of RSE.
Mathematics 2021, 9, 540 17 of 22
0.1
0.1
● Ding
0.08
● Mirzal (β = 1) ● Ding
0.08
● Mirzal (β = 10) ● PG (β = 1)
● Mirzal (β = 100) ● PG (β = 10)
Infeasibility
0.06
●
● Mirzal (β = 1000) ● PG (β = 100)
Infeasibility
0.06
●
● PG (β = 1000)
●
0.04
0.04
0.02
0.02
●
● ●
● ● ●
● ●
● ●
● ● ●
● ● ●
● ● ● ●
●
● ● ● ● ● ●
●
● ● ● ●
● ● ● ●
20 40 60 80 100
20 40 60 80 100
p (% of k) p (% of k)
0.1
● Ding
0.08
● Mirzal (β = 1) ● Ding
● Mirzal (β = 10) 0.08 ● PG (β = 1)
● Mirzal (β = 100) ● PG (β = 10)
Infeasibility
0.06
● PG (β = 1000)
0.04
0.04
●
0.02
●
●
0.02
●
●
●
● ● ●
●
● ● ●
● ●
● ●
●
● ●
● ●
● ● ● ●
20 40 60 80 100
20 40 60 80 100
p (% of k) p (% of k)
0.1
● Ding ● Ding
0.08
0.08
● Mirzal (β = 1) ● PG (β = 1)
● Mirzal (β = 10) ● PG (β = 10)
● Mirzal (β = 100) ● PG (β = 100)
Infeasibility
Infeasibility
0.06
0.06
0.04
0.02
0.02
● ●
● ●
●
●
● ●
● ●
●
●
● ●
● ●
● ●
● ● ● ● ● ●
20 40 60 80 100 20 40 60 80 100
p (% of k) p (% of k)
Values of infeasibility for different values of β Values of infeasibility for different values of β
obtained by Algorithms 1 and 2 in BION data obtained by Algorithms 1 and 3 on BION data
with n = 1000 with n = 1000.
Figure 4. This figure contains six plots which illustrate the quality of Algorithms 1–3 regarding the infeasibility on BION
instances with n = 100, 500, 1000 and k = 0.2n, 0.4n, for β ∈ {1, 10, 100, 1000}. We can observe that Algorithm 3 computes
solutions with infeasibility (18) slightly smaller compared to solutions computed by Algorithm 2.
computed RSE also increase. However, we can see that all three algorithms are robust to
noise, i.e., the resulting RSE for the noisy and the original BION data are very close. The
same holds for the infeasibility, depicted in Figure 5.
0.15
100
Ding − no noise
0.12
Ding (µ = 1e−2)
80
Ding (µ = 1e−4)
Ding (µ = 1e−6)
Infeasibility
0.09
60
RSE (%)
0.06
40
Ding − no noise
0.03
Ding (µ = 1e−2)
20
Ding (µ = 1e−4)
Ding (µ = 1e−6)
Mirzal − no noise
0.08
Mirzal (µ = 1e−2)
80
Mirzal (µ = 1e−4)
Mirzal (µ = 1e−6)
Infeasibility
0.06
60
RSE (%)
0.04
40
Mirzal − no noise
0.02
Mirzal (µ = 1e−2)
20
Mirzal (µ = 1e−4)
Mirzal (µ = 1e−6)
PG − no noise
0.08
PG (µ = 1e−2)
80
PG (µ = 1e−4)
PG (µ = 1e−6)
Infeasibility
0.06
60
RSE (%)
0.04
40
PG − no noise
0.02
PG (µ = 1e−2)
20
PG (µ = 1e−4)
PG (µ = 1e−6)
Figure 5. On the left, we depict how RSE is changing with increasing the inner dimension p from 20% of real inner
dimension k to 140% of k. For each algorithm, we depict RSE on the original BION data and on the noisy BION data with
µ ∈ {10−2 , 10−4 , 10−6 }. On the right plots, we demonstrate how (in)feasible are the optimum solutions obtained by each
algorithm, for different relative inner dimensions p, for the original and the noisy BION data.
On the noisy dataset, we also demonstrate what happens if the internal dimension
is larger than the true internal dimension (this is demonstrated by p = 120%, 140%).
Mathematics 2021, 9, 540 19 of 22
Algorithm 1 finds solution that is slightly closer to the optimum compared to the non-
noisy data. Algorithm 2 does not improve RSE, actually RSE slightly increases with p.
Algorithm 3 has best performance. It comes with RSE very close to 0 and stays there with
increasing p.
Regarding infeasibility, the situation from Figure 4 can be observed also on the noisy
dataset. Figure 5 shows that with p > 100% the infeasibility increases. This is not surprising,
the higher the internal dimension, the more difficult is to achieve orthonormality. However,
resulting vales of infeasG,H are still surprisingly small.
We also analyzed how close are the matrices G̃ and H̃, computed by all three algo-
rithms, to the matrices G and H that were used to generate the data matrices R. This
comparison is possible only when the inner dimension is equal to the real inner dimension
(p = 100%). We figured out that the Frobenious norms √ between these pairs of matrices, i.e.,
k G̃ − G k F and k H̃ − H k F are quite large (of order k), which means that on a first glance,
the solutions are quite different. However, since for every pair G, H and every permutation
matrix Π, we have GH = GΠΠ T H, the differences between the computed pairs matrices
are mainly due to the fact that they have permuted columns (G) or rows (H).
Table 6. This table contains numerical data obtained by running all three algorithms on the third dataset—BION data with
three levels of noise, represented by µ ∈ {10−2 , 10−4 , 10−6 }. The bold number is the smallest one in each line.
80 200 0.3939 0.0007 0.0007 0.3939 0.0010 0.0010 0.3938 0.0015 0.0015
100 200 0.0072 0.0003 0.0003 0.1327 0.0097 0.0098 0.0039 0.0010 0.0010
120 200 0.0278 0.0483 0.0483 0.1325 0.0283 0.0326 0.0036 0.0465 0.0141
140 200 0.0344 0.0589 0.0590 0.1909 0.0366 0.0363 0.0034 0.0565 0.0188
20 200 0.7884 0.0002 0.0002 0.7884 0.0018 0.0018 0.7884 0.0003 0.0003
40 200 0.6828 0.0001 0.0001 0.6828 0.0009 0.0009 0.6828 0.0001 0.0001
60 200 0.5575 0.0001 0.0001 0.5575 0.0006 0.0006 0.5575 0.0001 0.0001
10−4
80 200 0.3942 0.0000 0.0000 0.3942 0.0004 0.0003 0.3942 0.0001 0.0001
100 200 0.0575 0.0086 0.0086 0.1717 0.0089 0.0192 0.0001 0.0004 0.0004
120 200 0.0043 0.0490 0.0489 0.1407 0.0275 0.0321 0.0003 0.0468 0.0159
140 200 0.0049 0.0596 0.0596 0.1743 0.0363 0.0390 0.0003 0.0558 0.0221
20 200 0.7884 0.0001 0.0001 0.7884 0.0017 0.0017 0.7884 0.0002 0.0002
40 200 0.6828 0.0000 0.0000 0.6828 0.0009 0.0008 0.6828 0.0001 0.0001
60 200 0.5575 0.0000 0.0000 0.5575 0.0005 0.0005 0.5575 0.0001 0.0001
10−6
80 200 0.3942 0.0000 0.0000 0.3942 0.0003 0.0003 0.3942 0.0001 0.0001
100 200 0.0624 0.0092 0.0091 0.1966 0.0159 0.0167 0.0137 0.0029 0.0006
120 200 0.0031 0.0490 0.0492 0.1250 0.0301 0.0309 0.0003 0.0478 0.0179
140 200 0.0051 0.0595 0.0597 0.1692 0.0367 0.0381 0.0003 0.0562 0.0268
condition involves also checking the progress of RSE. If this is too small (below 10−5 ) we
also stop.
We first demonstrate how RSE decreases with iterations. The left plot in Figure 6
depicts that Algorithm 3 has the fastest decrease with the number of iterations and needs
only a few dozens of iterations to reach optimum RSE. The other two algorithms need
much more iterations. However, in each iteration, Algorithm 3 involves solving two sub-
problems (9). This results in much higher times needed for one iteration. The right plot
of Figure 6 depicts how RSE is decreasing with time. We can see that Algorithms 1 and 2
are much faster. We could decrease this difference by involving more advanced stopping
criteria for Algorithm 3, which will be addressed during our future research.
0.7
0.7
0.6
0.6
Ding
Ding
0.5
Mirzal
0.5
PG Mirzal
RSE
0.4
PG
RSE
0.4
0.3
0.3
0.2
0.2
0.1
0.1
5 15 25 35 45 55 65 75 85 95
number of iterations 1 2 3 4 5 6 7
time (sec)
Values of RSE vs. the number of outer iterations,
Values of RSE vs. time, for the noisy BION data.
for the noisy BION data.
Figure 6. This figure depicts how RSE is changing with the number of outer iterations (left) and with time (right), for all
three algorithms. Computations are done on the noisy BION data set with µ = 10−2 , for n = 200 and the inner dimension
was equal to the true inner dimension (p = 100 %).
Funding: The work of the first author is supported by the Swiss Government Excellence Scholarships
grant number ESKAS-2019.0147. The work of the second author was partially funded by Slovenian
Research Agency under research program P2-0162 and research projects J1-2453, N1-0071, J5-2552,
J2-2512 and J1-1691.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data is available on git portal: https://github.com/Soodi1/ONMFdata
(accessed on 8 November 2020).
Acknowledgments: The work of the first author is supported by the Swiss Government Excellence
Scholarships grant number ESKAS-2019.0147. This author also thanks the University of Applied
Sciences and Arts, Northwestern Switzerland for supporting the work. The work of the second
author was partially funded by Slovenian Research Agency under research program P2-0162 and
research projects J1-2453, N1-0071, J5-2552, J2-2512 and J1-1691. The authors would also like to thank
to Andri Mirzal (Faculty of Computing, Universiti Teknologi Malaysia) for providing the code for
his algorithm (Algorithm 2) to solve (ONMF). This code was also adapted by the authors to solve
(bi-ONMF).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Berry, M.W.; Browne, M.; Langville, A.N.; Pauca, V.P.; Plemmons, R.J. Algorithms and applications for approximate nonnegative
matrix factorization. Comput. Stat. Data Anal. 2007, 52, 155–173. [CrossRef]
2. Pauca, V.P.; Shahnaz, F.; Berry, M.W.; Plemmons, R.J. Text mining using non-negative matrix factorizations. In Proceedings of the
2004 SIAM International Conference on Data Mining, Lake Buena Vista, FL, USA, 22–24 April 2004; pp. 452–456.
3. Shahnaz, F.; Berry, M.W.; Pauca, V.P.; Plemmons, R.J. Document clustering using nonnegative matrix factorization. Inf. Process.
Manag. 2006, 42, 373–386. [CrossRef]
4. Berry, M.W.; Gillis, N.; Glineur, F. Document classification using nonnegative matrix factorization and underapproximation. In
Proceedings of the 2009 IEEE International Symposium on Circuits and Systems, Taipei, Taiwan, 24–27 May 2009; pp. 2782–2785.
5. Li, T.; Ding, C. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the
IEEE Sixth International Conference on Data Mining (ICDM’06), Hong Kong, China, 18–22 December 2006; pp. 362–371.
6. Xu, W.; Liu, X.; Gong, Y. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th
Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, ON, Canada,
28 July–1 August 2003; pp. 267–273.
7. Kaarna, A. Non-negative matrix factorization features from spectral signatures of AVIRIS images. In Proceedings of the 2006
IEEE International Symposium on Geoscience and Remote Sensing, Denver, CO, USA, 31 July–4 August 2006; pp. 549–552.
8. Zafeiriou, S.; Tefas, A.; Buciu, I.; Pitas, I. Exploiting discriminant information in nonnegative matrix factorization with application
to frontal face verification. IEEE Trans. Neural Netw. 2006, 17, 683–695. [CrossRef]
9. Golub, G.H.; Reinsch, C. Singular value decomposition and least squares solutions. In Linear Algebra; Springer: Berlin, Germany,
1971; pp. 134–151.
10. Jolliffe, I. Principal Component Analysis; Wiley Online Library: Hoboken, NJ, USA, 2005.
11. Ding, C.; Li, T.; Peng, W.; Park, H. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006;
pp. 126–135.
12. Gillis, N. Nonnegative Matrix Factorization; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2020.
13. Paatero, P.; Tapper, U. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of
data values. Environmetrics 1994, 5, 111–126. [CrossRef]
14. Anttila, P.; Paatero, P.; Tapper, U.; Järvinen, O. Source identification of bulk wet deposition in Finland by positive matrix
factorization. Atmos. Environ. 1995, 29, 1705–1718. [CrossRef]
15. Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788. [CrossRef]
[PubMed]
16. Lee, D.D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems;
Denver, CO, USA, 27 November–2 December 2000; pp. 556–562.
17. Chu, M.; Diele, F.; Plemmons, R.; Ragni, S. Optimality, computation, and interpretation of nonnegative matrix factorizations.
SIAM J. Matrix Anal. 2004, 4, 8030.
18. Kim, H.; Park, H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set
method. SIAM J. Matrix Anal. Appl. 2008, 30, 713–730. [CrossRef]
Mathematics 2021, 9, 540 22 of 22
19. Lin, C. Projected gradient methods for nonnegative matrix factorization. Neural Comput. 2007, 19, 2756–2779. [CrossRef]
[PubMed]
20. Cichocki, A.; Zdunek, R.; Amari, S.i. Hierarchical ALS algorithms for nonnegative matrix and 3D tensor factorization. In
International Conference on Independent Component Analysis and Signal Separation; Springer: Berlin, Germany, 2007; pp. 169–176.
21. Halko, N.; Martinsson, P.G.; Tropp, J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate
matrix decompositions. SIAM Rev. 2011, 53, 217–288. [CrossRef]
22. Yoo, J.; Choi, S. Orthogonal nonnegative matrix factorization: Multiplicative updates on Stiefel manifolds. In International
Conference on Intelligent Data Engineering and Automated Learning; Springer: Berlin, Germany, 2008; pp. 140–147.
23. Yoo, J.; Choi, S. Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on stiefel manifolds.
Inf. Process. Manag. 2010, 46, 559–570. [CrossRef]
24. Choi, S. Algorithms for orthogonal nonnegative matrix factorization. In Proceedings of the 2008 IEEE International Joint
Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008;
pp. 1828–1832.
25. Kim, D.; Sra, S.; Dhillon, I.S. Fast Projection-Based Methods for the Least Squares Nonnegative Matrix Approximation Problem.
Stat. Anal. Data Mining 2008, 1, 38–51. [CrossRef]
26. Kim, D.; Sra, S.; Dhillon, I.S. Fast Newton-type methods for the least squares nonnegative matrix approximation problem. In
Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 343–354.
27. Kim, J.; Park, H. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In Proceedings of the 2008
Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 353–362.
28. Mirzal, A. A convergent algorithm for orthogonal nonnegative matrix factorization. J. Comput. Appl. Math. 2014, 260, 149–166.
[CrossRef]
29. Lin, C.J. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw.
2007, 18, 1589–1596.
30. Esposito, F.; Boccarelli, A.; Del Buono, N. An NMF-Based methodology for selecting biomarkers in the landscape of genes of
heterogeneous cancer-associated fibroblast Populations. Bioinform. Biol. Insights 2020, 14, 1–13. [CrossRef] [PubMed]
31. Peng, S.; Ser, W.; Chen, B.; Lin, Z. Robust orthogonal nonnegative matrix tri-factorization for data representation. Knowl.-Based
Syst. 2020, 201, 106054. [CrossRef]
32. V. Leplat, N. Gillis, A.A. Blind audio source separation with minimum-volume beta-divergence NMF. IEEE Trans. Signal Process.
2020, 68, 3400–3410. [CrossRef]
33. Casalino, G.; Coluccia, M.; Pati, M.L.; Pannunzio, A.; Vacca, A.; Scilimati, A.; Perrone, M.G. Intelligent microarray data analysis
through non-negative matrix factorization to study human multiple myeloma cell lines. Appl. Sci. 2019, 9, 5552. [CrossRef]
34. Ge, S.; Luo, L.; Li, H. Orthogonal incremental non-negative matrix factorization algorithm and its application in image
classification. Comput. Appl. Math. 2020, 39, 1–16. [CrossRef]
35. Bertsekas, D. Nonlinear Programming; Athena Scientific optimization and Computation Series; Athena Scientific: Nashua, NH,
USA, 2016.
36. Richtárik, P.; Takác, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite
function. Math. Program. 2014, 144, 1–38. [CrossRef]
37. The MathWorks. MATLAB Version R2019a; The MathWorks: Natick, MA, USA, 2019.
38. Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S.i. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way
Data Analysis and Blind Source Separation; John Wiley & Sons: Hoboken, NJ, USA, 2009.
39. Piper, J.; Pauca, V.P.; Plemmons, R.J.; Giffin, M. Object Characterization from Spectral Data Using Nonnegative Factorization
and Information theory. In Proceedings of the AMOS Technical Conference, 2004. Available online: http://users.wfu.edu/
plemmons/papers/Amos2004_2.pdf (accessed on 8 November 2020).
40. Mirzal, A. A Convergent Algorithm for Bi-orthogonal Nonnegative Matrix Tri-Factorization. arXiv 2017, arXiv:1710.11478.
41. Armijo, L. Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 1966, 16, 1–3. [CrossRef]
42. Lin, C.J.; Moré, J.J. Newton’s method for large bound-constrained optimization problems. SIAM J. Optim. 1999, 9, 1100–1127.
[CrossRef]