0% found this document useful (0 votes)
85 views12 pages

Approximation Algorithms For Orthogonal Non-Negative Matrix Factorization

Uploaded by

sagar deshpande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views12 pages

Approximation Algorithms For Orthogonal Non-Negative Matrix Factorization

Uploaded by

sagar deshpande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Approximation Algorithms for Orthogonal Non-negative Matrix

Factorization

Moses Charikar Lunjia Hu


Stanford University Stanford University

Abstract low-rank matrix AW where A has size m × k and W


has size k × n. Here k is called the inner dimension
of the factorization M ≈ AW , controlling the rank of
In the non-negative matrix factorization
AW . Such low-rank matrix decomposition enables a
(NMF) problem, the input is an m×n matrix
succinct and often more interpretable representation
M with non-negative entries and the goal is
of the original data matrix M .
to factorize it as M ≈ AW . The m × k ma-
trix A and the k × n matrix W are both con- One of the standard approaches of low-rank approxi-
strained to have non-negative entries. This mation is singular value decomposition (SVD) (Wold
is in contrast to singular value decomposi- et al., 1987; Alter et al., 2000; Papadimitriou et al.,
tion, where the matrices A and W can have 2000). SVD computes a solution minimizing both the
negative entries but must satisfy the orthog- Frobenius norm kM − AW kF and the spectral norm
onality constraint: the columns of A are or- σmax (M − AW ) (Eckart and Young, 1936; Mirsky,
thogonal and the rows of W are also or- 1960). In addition, SVD always gives a solution
thogonal. The orthogonal non-negative ma- with the orthogonality property: the columns of A
trix factorization (ONMF) problem imposes are orthogonal and the rows of W are also orthogo-
both the non-negativity and the orthogonal- nal. Orthogonality makes the factors more separable,
ity constraints, and previous work showed and thus causes the low-rank representation to have a
that it leads to better performances than cleaner structure.
NMF on many clustering tasks. We give
However, in certain cases the data matrix M is inher-
the first constant-factor approximation algo-
ently non-negative, with entries corresponding to fre-
rithm for ONMF when one or both of A
quencies or probability mass, and in these cases SVD
and W are subject to the orthogonality con-
has a serious limitation: the factors A and W com-
straint. We also show an interesting con-
puted by SVD often contain negative entries, making
nection to the correlation clustering prob-
the factorization less interpretable. Non-negative ma-
lem on bipartite graphs. Our experiments
trix factorization (NMF), which constrains A and W
on synthetic and real-world data show that
to have non-negative entries, is better suited to these
our algorithm achieves similar or smaller er-
cases, and is applied in many domains including com-
rors compared to previous ONMF algorithms
puter vision (Lee and Seung, 1999; Li et al., 2001), text
while ensuring perfect orthogonality (many
mining (Xu et al., 2003; Pauca et al., 2004) and bioin-
previous algorithms do not satisfy the hard
formatics (Brunet et al., 2004; Kim and Park, 2007;
orthogonality constraint).
Devarajan, 2008).
One drawback of NMF relative to SVD is that it
1 INTRODUCTION gives less separable factors: the angle between any two
columns of A or any two rows of W is at most π/2
Low-rank approximation of matrices is a fundamental simply because the inner product of a pair of vectors
technique in data analysis. Given a large data matrix with non-negative coordinates is always non-negative.
M of size m × n, the goal is to approximate it by a To reap the benefits of non-negativity and orthogonal-
ity simultaneously, orthogonal NMF (ONMF) adds or-
Proceedings of the 24th International Conference on Artifi- thogonality constraints to NMF on one or both of the
cial Intelligence and Statistics (AISTATS) 2021, San Diego, factors A and W : the columns of A and/or the rows
California, USA. PMLR: Volume 130. Copyright 2021 by of W are required to be orthogonal. Indeed, ONMF
the author(s). leads to better empirical performances in many clus-
Approximation Algorithms for Orthogonal Non-negative Matrix Factorization

tering tasks (Ding et al., 2006; Choi, 2008; Yoo and comparable to the best possible factorization.
Choi, 2010). While previous works showed ONMF al-
gorithms that converge to local minima (Ding et al.,
2006) and an efficient polynomial-time approximation
scheme (EPTAS) assuming the inner-dimension is a
constant (Asteris et al., 2015), a theoretical under-
standing of the worst-case guarantee one can achieve
for ONMF with arbitrary inner-dimension is lacking.
In this work, we show the first constant-factor ap-
proximation algorithm for ONMF with respect to the
squared Frobenius error kM − AW k2F when the or-
thogonality constraint is imposed on one or both of Figure 1: The k columns of A have disjoint supports.
the factors. The k rows of W also have disjoint supports. The
product AW has entries equal to zero outside the k
blocks.
Our Results We use approximation algorithms for
weighted k-means as subroutines, such as the (9 + ε)-
approximation local search algorithm by Kanungo Connection to Bipartite Correlation Clustering
et al. (2002). Assuming an r-approximation algorithm The block-wise structure of AW (Figure 1) relates
for weighted k-means, we show algorithms for ONMF ONMF to the correlation clustering problem (Bansal
with approximation ratio 2r in the single-factor or- et al., 2004) on complete bipartite graphs.
thogonality setting where only one of the factors A
To see the relationship with correlation clustering, let
or W is required to be
 orthogonal (Theorem 3), and
 us consider a data matrix M with binary entries and
approximation ratio 2r + sin8r+82 (π/12) in the double- assume k ≥ min{m, n}. Since we can find at most
factor orthogonality setting where both A and W are min{m, n} non-zero ai wiT satisfying the orthogonal
required to be orthogonal (Theorem 8). Here, A constraint, all k ≥ min{m, n} give equivalent prob-
(resp. W ) being orthogonal means that its columns lems, where any inner-dimension is considered feasi-
(resp. rows) are orthogonal but not necessarily of unit ble. M can be treated as a complete bipartite graph
length. The approximation ratios are provable upper with vertices {u1 , · · · , um } ∪ {v1 , · · · , vn } and edges
bounds for the ratio between the error of the output (ui , vj ) labeled “+” if Mij = 1 or “−” if Mij = 0.
(A, W ) of the algorithm and the minimum error over This edge-labeled complete bipartite graph is exactly
all feasible solutions (A, W ), with error measured us- an instance of the correlation clustering problem. If
ing the squared Frobenius norm kM − AW k2F . We the factors A and W also have binary entries and
also demonstrate the superior practical performance both satisfy
Pk the orthogonality constraint, the blocks of
of our algorithms by experiments in both the single- AW = i=1 ai wiT (see Figure 1) are all-ones matri-
factor and the double-factor orthogonality setting on ces corresponding to vertex-disjoint complete bipartite
synthetic and real-world datasets (see Section 5 and sub-graphs. This is exactly the form of a solution to
Appendix G). the correlation clustering problem, and the objective
kM − AW k2F is exactly the number of disagreements
Sparse Structure of Solution When we impose in the correlation clustering problem. Although our
the orthogonality constraint on both the columns of algorithm (specifically, the algorithm in Theorem 9)
A and the rows of W , the non-negativity and the or- doesn’t impose the binary constraint on A and W , we
thogonality constraints together cause the solution to can apply the following lemma to each block of AW
ONMF to have a very sparse structure. Let ai de- to round the solution to binary with only a constant
note the i-th column of A and wiT denote the i-th loss in the objective (see Appendix A for proof):
row of W . Since ai and aj are constrained to have Lemma 1. Let M ∈ {0, 1}m×n be a binary matrix.
non-negative entries but zero inner product, they have Let a ∈ Rm n
≥0 and w ∈ R≥0 be two non-negative vectors.
disjoint supports, and this also holds for wi and wj . Then, there exist binary vectors â ∈ {0, 1}m and ŵ ∈
Pk
As a result, AW = i=1 ai wiT naturally consists of k {0, 1}n such that
disjoint blocks, as shown in Figure 1.
kM − âŵT k2F ≤ 8kM − awT k2F .
If the input M factorizes as M = AW exactly, we
can easily recover A and W based on the block-wise Moreover, â and ŵ can be computed in poly-time.
structure of M . Therefore, we focus on the agnostic
setting where M = AW does not hold exactly, and Thus, we can obtain an approximation algorithm for
design approximation algorithms that find solutions minimizing disagreements in complete bipartite graphs
Moses Charikar, Lunjia Hu

via our approximation algorithm for ONMF in The- (2016). ONMFS is the only previous algorithm we
orem 9. Moreover, without the binary constraint know that has a provable approximation guarantee,
on M, A, W , ONMF with orthogonality constraint on but it has a running time exponential in the squared
both A and W can be treated as a soft version of bi- inner dimension. (Pompili et al., 2014) give a reduc-
partite correlation clustering. tion of ONMF to spherical k-means with a somewhat
non-standard objective function: the goal is to mini-
mize the sum of 1 minus the square of cosine similar-
Open Questions We used the Frobenius norm as
ity, while the commonly studied objective function for
a natural measure of goodness of fit, but it would be
spherical k-means sums up 1 minus the cosine similar-
interesting to see if one can achieve constant-factor
ity. Our results for ONMF imply a constant factor ap-
approximation with respect to other measures, such as
proximation for this variant of spherical k-means with
the spectral norm, since the two norms can be different
the squared cosine similarity in the objective. Many
by a factor that grows with min{m, n}. It would also
variants of ONMF have also been studied in the liter-
be interesting to consider replacing the orthogonality
ature, including the semi-ONMF (Li et al., 2018) and
constraint on A and W by a lower bound θ < π/2 on
the sparse ONMF (Chen et al., 2018; Li et al., 2020).
the angles between different columns of A and different
rows of W . We would also like to point out that the connection
between ONMF and k-means shown in (Ding et al.,
Related Work Non-negative matrix factorization 2006, Theorems 1 and 2) does not give a reduction in
was first proposed by Paatero and Tapper (1994), and either direction. Their proof shows that the optimiza-
was shown to be NP-hard by Vavasis (2010). Algorith- tion problem associated with k-means is essentially
mic frameworks for efficiently finding local optima in- ONMF, but with additional constraints: the matrix
clude the multiplicative updating framework (Lee and G in the ONMF formulation (8) in (Ding et al., 2006)
Seung, 2001) and the alternating non-negative least is replaced by matrix G̃ in the k-means formulation
squares framework (Lin, 2007; Kim and Park, 2011). (11) in (Ding et al., 2006). However G̃ is a “normal-
Under the usually mild separability assumption, Arora ized cluster indicator matrix” that is more constrained
et al. (2016) showed an efficient algorithm that com- than the generic matrix G with orthonormal columns
putes the global optimum. because the entries in every column of G̃ are either
zero or take the same non-zero value. This additional
Ding et al. (2006) first studied NMF with the orthogo- constraint makes their argument insufficient to either
nality constraint, and showed its effectiveness in docu- directly derive an algorithm for ONMF with the same
ment clustering. After that, algorithms for ONMF us- approximation guarantee given one for k-means, or the
ing various techniques have been developed for a broad other way around. Also, later works such as Yoo and
range of applications (Chen et al., 2009; Ma et al., Choi (2010) and Asteris et al. (2015) used techniques
2010; Kuang et al., 2012; Pompili et al., 2013; Li et al., different from k-means to improve the empirical per-
2014b; Kim et al., 2015; Qin et al., 2016; Alaudah formance of ONMF.
et al., 2017; Huang et al., 2019). The less restrictive
single-factor orthogonality setting attracted the most The correlation clustering problem was proposed by
attention, and most algorithms for solving it belong to Bansal et al. (2004) on complete graphs, who showed
the multiplicative updating framework: iteratively up- a constant factor approximation algorithm for the dis-
dating A and/or W by taking the element-wise prod- agreement minimization version and a polynomial-
uct with other computed non-negative matrices (Yang time approximation scheme (PTAS) for the agreement
and Laaksonen, 2007; Choi, 2008; Yoo and Choi, 2008, maximization version. Ailon et al. (2008) showed a
2010; Yang and Oja, 2010; Pan and Ng, 2018; He et al., simple combinatorial algorithm achieving an approx-
2020). Other techniques include HALS (hierarchical imation ratio of 3 in the disagreement minimization
alternating least squares) (Li et al., 2014a; Kimura version, and Chawla et al. (2015) improved the ap-
et al., 2016) and using a penalty function (Del Buono, proximation ratio to the currently best 2.06. Chawla
2009) for the orthogonality constraint. et al. (2015) also showed a 3-approximation algorithm
on complete k-partite graphs.
While improving the separability of the factors com-
pared to NMF, these algorithm do not guarantee con-
vergence to a solution that has perfect orthogonal- 2 WEIGHTED k-MEANS
ity (which is also demonstrated in our experiments).
There are only a few previous algorithms that have this The k-means problem is a fundamental clustering
guarantee, including the EM-ONMF algorithm (Pom- problem, and we will apply algorithms for its weighted
pili et al., 2014), the ONMFS algorithm (Asteris et al., version as subroutines to solve our orthogonal NMF
2015) and the NRCG-ONMF algorithm Zhang et al. problem. Given points m1 , · · · , mn ∈ Rm and their
Approximation Algorithms for Orthogonal Non-negative Matrix Factorization

weights `1 , · · · , `n ∈ R≥0 , the weighted k-means prob- without loss of generality that every column of A in the
lem seeks k centroids c1 , · · · , ck and an assignment optimal solution is the zero vector or has unit length
mapping φ : {1, · · · , n} → {1, · · · , k} that solve the as we can always scale them back using θi . We nor-
following optimization problem: malize the columns of M and weight each column pro-
portional to its initial squared L2 norm. After that,
n
X always setting θi = 1 only increases the approxima-
minimize `i kmi − cφ(i) k22 .
c1 ,··· ,ck ;φ tion ratio by a factor of 2 as we show in the following
i=1
lemma proved in Appendix B (think of x as a column
Even the unweighted (∀i, `i = 1) version of this prob- of the optimal A and y as a column of M ):
lem is APX hard, but many constant factor approx- Fact 2. Let x ∈ Rm ≥0 be a unit vector or the zero
imation algorithms were obtained. Kanungo et al. vector. For any non-negative vector y ∈ Rm ≥0 and any
(2002) showed a local-search algorithm achieving an θ ≥ 0, we have ky − θxk22 ≥ 12 kyk22 · kȳ − xk22 , where
approximation ratio 9 + ε,1 which was improved by  y
ȳ = kyk2 , y 6= 0 .
Ahmadian et al. (2017) in the unweighted setting to
0, y=0
an approximation ratio 6.357.
Based on this intuition, we obtain the following algo-
3 SINGLE-FACTOR rithm. Let m1 , m2 , · · · , mn ∈ Rm≥0 be the columns of

ORTHOGONALITY M , and let m̄i be the normalized version of mi :


 mi
m̄i := kmi k2 , if mi 6= 0 .
In the single-factor orthogonality setting, we impose 0, if mi = 0
the orthogonality constraint only on one of the factors
A or W . For concreteness, let us assume that the rows Let `i := kmi k22 be the weight of point m̄i ∈ Rm
≥0 . We
of W are required to be orthogonal. Since the rows of first compute an r-approximate solution to the follow-
W are also non-negative, they must have disjoint sup- ing weighted k-means problem:
ports, or equivalently, each column of W has at most
n
one non-zero entry. This particular structure relates X
minimize `i km̄i − cφ(i) k22 . (1)
our problem closely to the weighted k-means problem, c1 ,··· ,ck ;φ
i=1
and it’s not hard to apply the approximation algo-
rithms for weighted k-means to our single-factor or- We can assume WLOG that all of the centroids
thogonality setting. Specifically, assuming there is a c1 , · · · , ck have non-negative coordinates since increas-
poly-time r-approximation algorithm for weighted k- ing the negative coordinates to zero never increases
means, we show a poly-time algorithm for the single- the weighted k-means objective. Then we simply out-
factor orthogonality setting with approximation factor put A = [c1 , · · · , ck ] and W = [θ1 eφ(1) , · · · , θn eφ(n) ],
2r (Theorem 3). where
(
To see why k-means plays an important role in our hmi ,cφ(i) i
, if cφ(i) 6= 0
problem, recall that the non-negativity and orthogo- θi = kcφ(i) k22 ∈ arg min kmi − θcφ(i) k22 .
0, if cφ(i) = 0 θ
nality constraints on W simplify each column wi of
W to the form θi eφ(i) , where θi is a non-negative
real number, φ maps {1, · · · , n} to {1, · · · , k}, and We show the approximation guarantee in the following
eφ(i) ∈ Rk is the unit vector with its φ(i)-th coor- theorem proved in Appendix C.
dinate being one. This means that the i-th column of Theorem 3. The algorithm above computes a 2r-
AW is exactly θi times the φ(i)-th column of A. If approximate solution A and W in the single-factor or-
we think of the k columns of A as k centroids, and φ thogonality setting in time O(Tk-means + mn), where
as the assignment mapping that maps every column of Tk-means is the time needed by the weighted k-means
M to its closest centroid, (unweighted) k-means is ex- subroutine.
actly our problem with the additional constraint that
θi = 1 for all i. 4 DOUBLE-FACTOR
With the freedom of choosing θi , it’s more convenient ORTHOGONALITY
to solve our problem by weighted k-means. Assume
1
Now we consider the double-factor orthogonality set-
The algorithm of Kanungo et al. (2002) was originally ting, where we require A to have orthogonal columns
designed for the unweighted setting, but it works naturally
in the weighted setting if we use the algorithm by Feld- and W to have orthogonal rows, and show a poly-
man et al. (2007) when computing the (k, ε)-approximate time constant factor approximation algorithm in this
centroid set on which local search is performed. setting.
Moses Charikar, Lunjia Hu

We first state some basic facts that will be used in the angles that are neither very “small” nor very “large”.
discussion of our algorithms. We make the observation that if the angle between two
vectors is in the range [π/6, π/3], they can’t be simul-
Useful Inequalities The following doubled triangle taneously close to a set of orthonormal vectors, and
inequality for the squared L2 distance between vectors thus they can’t have low cost in the optimal solution,
x and y is useful when we analyze the approximation so we can safely “ignore” them by decreasing their
ratio of our algorithm: weights by the same amount. This weight reduction
Fact 4. kx − yk22 ≤ 2kxk22 + 2kyk22 . procedure eventually makes the angle between any two
vectors lie in the range [0, π/6) ∪ (π/3, π/2]. If two
When both x and y have non-negative coordinates, we vectors both have angles less than π/6 with the third,
have the following stronger fact: they themselves cannot form an angle larger than π/3,
so now we have the desired transitivity. Our Lemma
Fact 5. If both x and y have non-negative coordinates,
10 shows that the assignment mapping computed this
then kx − yk22 ≤ kxk22 + kyk22 .
way is comparable to the optimal one.
Center of Mass Given n points x1 , · · · , xn ∈ Rm
and their weights `1 , · · · , `n ∈ R≥0 , the point y ∈ 4.2 Algorithm
Rm minimizing
Pn the weighted sum of the squared L2
2 Our algorithm consists of three major steps. The first
distances `
i=1Pi kxi − yk 2 is the center of mass: y =
Pn n step is to apply the weighted k-means algorithm as we
( i=1 `i xi ) / ( i=1 `i ). Moreover, the weighted sum
did in the single-factor orthogonality setting, and two
can be decomposed using the following identity (see,
additional steps are needed to make sure the solution
for example, Lemma 2.1 in (Kanungo et al., 2002)):
has both factors being orthogonal.
Fact
Pn 6. Assume Pn `1 , · · · , `n ≥ 0 and y =
( i=1 `i xi ) / ( i=1 `i ). Then for any vector b, we Step 1: Weighted k-Means
have
n n n
Let m1 , m2 , · · · , mn ∈ Rm ≥0 be the columns of M
X
`i kxi − bk22 =
X
`i kxi − yk22 +
X
`i ky − bk22 . and define m̄i and `i the same way as in Section 3.
i=1 i=1 i=1
Compute an r-approximate solution c1 , · · · , ck , φ to
the weighted k-means problem (1). Define the weight
qj of a centroid cj to P be the total weight of the points
4.1 Intuition
assigned to it: qj := i∈φ−1 (j) `i . By Fact 6, we can
We describe the intuition that leads us to the algo- always assume
P WLOG thatwhenever qj > 0, it holds
rithm. As our first step, we solve the weighted k-means that cj = i∈φ−1 (j) `i m̄i /qj . Under this assump-
problem as we did in the single-factor orthogonality tion, whenever qj > 0, we have kcj k2 ≤ 1. We also
setting, but we need to additionally ensure that the have the following easy fact:
columns of A are orthogonal. By the doubled triangle
inequality (Fact 4) and the property of the center of Fact 7. If qj > 0, then cj 6= 0.
mass (Fact 6), we can move the n points to their cen-
troids without affecting the approximation ratio too Proof. Assume for the sake of contradiction that cj =
much. Now there are only k distinct points, and it’s 0.
PAccording toour assumption, we have 0 = cj =
−1
more convenient to treat these points as vectors, so i∈φ−1 (j) `i m̄i /qj , so for all i ∈ φ (j), `i m̄i = 0.
that we can talk about the angles between them. Our If m̄i 6= 0, we know `i = 0; otherwise, we know mi = 0
goal is to find k orthogonal centroids that approximate and thus, again, `i = kmi kP 2
2 = 0. Now we have our
these k vectors. The key challenge is to find the as- desired contradiction: qj = i∈φ−1 (j) `i = 0.
signment mapping: which vectors are mapped to the
same centroid, and once we know the assignment map- Step 2: Weight Reduction
ping, we can find the best centroids by optimizing each
coordinate separately (see (2)). Intuitively, the as- Recall that the weight qj of a centroid cj was defined
signment mapping should respect the angles between to be the total weight of the points assigned to it. The
the vectors: if a pair of vectors form a “small” angle, second step of the algorithm is to reduce the weights
they should be mapped to the same centroid, and if q1 , · · · , qk to q10 , · · · , qk0 . To start, all qj0 are initialized
they form a “large” angle close to π/2, they should to be qj . Our algorithm iterates over all pairs (j1 , j2 )
be mapped to different centroids. However, two vec- satisfying 1 ≤ j1 < j2 ≤ k. If qj0 1 > 0, qj0 2 > 0 and
tors both forming “small” angles with the third may ∠(cj1 , cj2 ) ∈ [π/6, π/3], our algorithm decreases both
themselves form a relatively “large” angle. In order qj0 1 , qj0 2 by the minimum of the two (thus sending at
to solve the lack of transitivity, we need to eliminate least one of them to 0). Recall Fact 7 that cj1 and
Approximation Algorithms for Orthogonal Non-negative Matrix Factorization

cj2 are both non-zero, so the angle between them is where


well-defined. (
hmi ,aσ(φ(i)) i
kaσ(φ(i)) k22
, 6 0
if aσ(φ(i)) =
θi =
Step 3: Finalize the Solution 0, if aσ(φ(i)) = 0
∈ arg min kmi − θaσ(φ(i)) k22 .
Now we are most interested in centroids cj with pos- θ
itive weights (qj0 > 0) after the weight reduction step.
For any two centroids cj1 , cj2 with positive weights, we Note that σ(j) was defined only for j with qj0 > 0, but
know ∠(cj1 , cj2 ) ∈ [0, π/6) ∪ (π/3, π/2]. Since the an- here we extend its definition to all j ∈ {1, · · · , k} by
gles between vectors satisfy the triangle inequality, we setting the remaining values arbitrarily.
can group these centroids so that ∠(cj1 , cj2 ) ∈ [0, π/6)
if j1 , j2 belong to the same group, and ∠(cj1 , cj2 ) ∈ 4.3 Analysis
(π/3, π/2] if j1 , j2 belong to different groups. Suppose
cj belongs to group σ(j) ∈ {1, · · · , k}. We show the following two theorems on the approxi-
mation guarantee of our algorithm in the double-factor
We claim that we can find an optimal solution to the orthogonality setting. Theorem 8 applies to general in-
following optimization problem in poly-time: ner dimensions k, while Theorem 9 gives improved ap-
X proximation factors when k is large, which is the case
minimize qj0 kcj − aσ(j) k22 , when we apply our ONMF algorithm to correlation
a1 ,··· ,ak
j:qj0 >0 clustering. Recall that we used an r-approximation
s.t. a1 , · · · , ak ∈ Rm algorithm for weighted k-means as a subroutine, and
≥0 ,
we assume that its running time is Tk-means .
∀1 ≤ s < t ≤ k, aT
s at = 0. (2)
Theorem 8. The algorithm
 in Section 4.2 com-
8r+8
To solve the above optimization problem, we decom- putes a 2r + sin2 (π/12) -approximate solution A and
pose it coordinate-wise. Specifically, the constraints W in the double-factor orthogonality setting in time
on a1 , · · · , ak can be translated to that for every O(Tk-means + mn + mk 2 ).
h ∈ {1, · · · , m}, the h-th coordinates a1,h , · · · , ak,h
Theorem 9. When k ≥ min{m, n}, there exists an
are all non-negative and contain at most one posi- 1
algorithm that gives a sin2 (π/12) (≤ 15)-approximate so-
tive value. The objective can also be decomposed
lution in the double-factor orthogonality setting in time
coordinate-wise:
O(mn2 ).
X m X
X
qj0 kcj − aσ(j) k22 = qj0 (cj,h − aσ(j),h )2 We prove Theorem 8 based on the following lemma,
j:qj0 >0 h=1 j:qj0 >0 which we prove in Appendix D. We defer the proof of
m Theorem 9 to Appendix E.
def
X
= Oh . Lemma 10. Let z1 , · · · , zk1 ∈ Rm ≥0 be non-negative
h=1 unit vectors that are orthogonal to each other. For
∗ any σ 0 : {1, · · · , k} → {1, · · · , k1 }, we have
If we define qP s as the total weight of the s-th
∗ 0 ∗
group: qs = j∈σ −1 (s) qj , and when qs > 0 de- k k
X 2 X
fine µs,h as the weighted average of the h-th coor- qj kcj − aσ(j) k22 ≤ 2 qj kcj − zσ0 (j) k22 .
dinate Pof the centroids in the s-th group: µs,h = j=1
sin (π/12) j=1
(qs∗ )−1 j∈σ−1 (s) qj0 cj,h , we can further decompose the
objective above using Fact 6 as Proof of Theorem 8. We obtain the running time of
X X the algorithm by summing over the three steps. Step
Oh = qj0 (cj,h − µσ(j),h )2 + qs∗ (µs,h − as,h )2 . 1 requires O(mn) time to create the input to the
j:qj0 >0 s:qs∗ >0 weighted k-means subroutine, and the subroutine
takes Tk-means time. Step 2 takes O(mk 2 ) time be-
The first term does not depend on a1 , · · · , ak , and cause we use O(m) time to compute the angle between
the second term is minimized when as,h = µs,h for each of the O(k 2 ) pairs of centroids. In step 3, it takes
s = arg maxs qs∗ µ2s,h and as,h = 0 for other s. We O(mk) time to solve the optimization problem (2), and
have thus computed the optimal solution to (2). Since it takes time O(mn) to compute the θi ’s.
kcj k2 ≤ 1 whenever qj0 > 0, it is straightforward to
The feasibility of (A, W ) is clear from the algorithm.
check that kas k2 ≤ 1 for s = 1, · · · , k.
We focus on proving the approximation guarantee.
We output A = [a1 , · · · , ak ] and W = We start by showing an upper bound for the objec-
[θ1 eσ(φ(1)) , · · · , θn eσ(φ(n)) ] as the final solution, tive kM − AW k2F achieved by our algorithm. For
Moses Charikar, Lunjia Hu

n
i = 1, · · · , n, the i-th column of AW is θi aσ(φ(i)) , 1X
≥ `i · min km̄i − aopt
s k2
2
where θi ∈ arg minθ kmi − θaσ(φ(i)) k22 . Therefore, 2 i=1 1≤s≤k1

n
kM − AW k2F 1 X
≥ `i · km̄i − cφ(i) k22 . (5)
n 2r i=1
mi − θi aσ(φ(i)) 2
X
= 2
i=1 Combining (4) with (5), we have
n
mi − kmi k2 aσ(φ(i)) 2
X
(4r + 4)kM − Aopt W opt k2F

≤ 2
i=1 X n
n
X ≥ `i (2km̄i − cφ(i) k22 + 2 min km̄i − aopt 2
s k2 )
= kmi k22 · km̄i − aσ(φ(i)) k22 . i=1
1≤s≤k1

i=1 Xn
P  ≥ `i min kcφ(i) − aopt
s k2
2
(6)
1≤s≤k1
By Fact 6 and cj = i∈φ−1 (j) `i m̄i /qj , we have i=1
Xn
= `i kcφ(i) − aopt 2
σ 0 (φ(i)) k2
kM − AW k2F
i=1
n
X k
≤ kmi k22 · km̄i − aσ(φ(i)) k22 X
= qj kcj − aopt 2
σ 0 (j) k2 ,
i=1
n j=1
X
= `i · km̄i − aσ(φ(i)) k22 where (6) is by Fact 4 and σ 0 (j) is defined to be
i=1
n n
arg min1≤s≤k1 kcj − aopt
s k2 . Applying Lemma 10, we

=
X
`i · km̄i − cφ(i) k22 +
X
`i · kcφ(i) − aσ(φ(i)) k22 get
i=1 i=1
(4r + 4)kM − Aopt W opt k2F
n
X k
X k
= `i · km̄i − cφ(i) k22 + qj kcj − aσ(j) k22 . (3) X
i=1 j=1
≥ qj kcj − aopt 2
σ 0 (j) k2 (7)
j=1

(3) gives an upper bound for kM − AW k2F . We sin2 (π/12) X


k

proceed by giving a lower bound for the objective ≥ qj kcj − aσ(j) k22 . (8)
2
kM − Aopt W opt k2F achieved by the optimal solution j=1
(Aopt , W opt ). We first remove the columns of Aopt
Combining (3) with (5) and (7), we have
filled with the zero vector and also remove the cor-
responding rows in W opt . This doesn’t change the kM − AW k2F
product Aopt W opt and doesn’t violate the orthogonal-
n k
ity requirement either, but the sizes of Aopt and W opt X X
≤ `i · km̄i − cφ(i) k22 + qj kcj − aσ(j) k22
may now change to m × k1 and k1 × n. We can now
i=1 j=1
assume WLOG that every column aopt s of Aopt is a  
unit vector. Note that each column of W contains at 8r + 8
≤ 2r + kM − Aopt W opt k2F .
most one non-zero entry, so we have sin2 (π/12)

kM − Aopt W opt k2F


Xn
≥ min kmi − θaopt 2
s k2 5 EXPERIMENTS
1≤s≤k1
i=1 θ≥0
n We report on the results of experiments comparing the
1 X
≥ kmi k22 · min km̄i − aopt
s k2
2
performance of our algorithm with eight previous al-
2 i=1
1≤s≤k1
gorithms in the literature. For these experiments, we
n
1 X use k-means++ as the subroutine for solving k-means.
= `i · min km̄i − aopt 2
s k2 , (4)
2 1≤s≤k1 For the single factor orthogonality setting, our experi-
i=1
ments show that our algorithm ensures perfect orthog-
where the second inequality follows from Fact 2. By onality and give similar approximation error as six pre-
the r-approximate optimality of c1 , · · · , ck , we have vious algorithms in the literature that do not guaran-
tee orthogonality. For this single factor setting, we
kM − Aopt W opt k2F also directly compare to two previous algorithms that
Approximation Algorithms for Orthogonal Non-negative Matrix Factorization

Figure 2: Results of experiment 1. From left to right, the plots in the first row show the recovery error and the
reconstruction error, and the plots in the second row show the non-orthogonality and the running time. The
performance of our algorithm is shown in the red line under the label ONMF-apx.

do ensure orthogonality and find that that the perfor- G. The previous algorithms we compare with include
mance of our algorithm is superior. One of the pre- NMF (Lee and Seung, 2001), PNMF (Yuan and Oja,
vious algorithms has runtime that scales very poorly 2005), ONFS-Ding (Ding et al., 2006), NHL (Yang and
with inner dimension (and worse error for small inner Laaksonen, 2007), ONMF-A (Choi, 2008), HALS (Li
dimension); the other suffers from poor local minima, et al., 2014a), EM-ONMF (Pompili et al., 2014), and
leading to large error even with zero noise. For the ONMFS (Asteris et al., 2015).
double factor orthogonality setting, only two previous
algorithms are able to handle this case. None of them Experimental Setup We generate the input ma-
ensure perfect orthogonality, while our algorithm does. trix M ∈ Rm×n by adding noise to the product Mtruth
Further, it has lower error than these previous algo- of random non-negative matrices Atruth ∈ Rm×k and
rithms. Our algorithm runs significantly faster than Wtruth ∈ Rk×n . We make sure that Wtruth has or-
all these other algorithms in both settings. Thus we thogonal rows2 , and every non-zero entry of Atruth
achieve the best of both worlds – stronger approxi- and Wtruth is independently drawn from the expo-
mation guarantees as well as superior practical perfor- nential distribution with mean 1. We call Mtruth =
mance for ONMF. Atruth Wtruth the planted solution, and we add iid noise
to every entry of Mtruth to obtain M . The noise also
Specifically, we compare our algorithm (ONMF-apx)
2
with previous algorithms in the more well-studied Due to non-negativity, making the rows of Wtruth or-
single-factor orthogonality setting on synthetic data, thogonal is equivalent to making every column of Wtruth
and defer the experiments on real-world data and in contain at most one non-zero entry. Independently for ev-
ery column, we pick the location of the non-zero entry uni-
the double-factor orthogonality setting to Appendix formly at random.
Moses Charikar, Lunjia Hu

Figure 3: Results of experiment 2. From left to right, the plots show the recovery error and the reconstruction
error. The non-orthogonality (not shown in figure) is identically zero for both algorithms.

follows an exponential distribution, and we use the As shown in Figure 2, our algorithm ensures perfect
phrase “noise level” to denote the mean of that distri- orthogonality and gives similar approximation error
bution. as previous ones which do not guarantee orthogonal-
ity. Except for EM-ONMF, none of the other previous
Evaluation We measure the quality of the matri- algorithms in this experiment output a perfectly or-
ces A and W output by the algorithms in terms thogonal W . Our recovery error is slightly better than
of the approximation error and the orthogonality of previous algorithms, but our reconstruction error is
W . We measure the approximation error using the slightly worse. This is because the orthogonality con-
Frobenius norm: we compute both the recovery error straint effectively regularizes our solution, making it fit
kMtruth − AW kF , which measures how well the out- the noise in the input worse but reveal the structure
put recovers the underlying structure of the input, and of the input better. It is worth noting that our al-
the reconstruction error kM −AW kF , which measures gorithm achieves lower reconstruction errors than the
the approximation error to the input matrix that con- planted solution Mtruth , and so do most other algo-
tains iid noise. We define the reconstruction error of rithms in the experiment (the reconstruction error of
the planted solution Mtruth as kM Mtruth concentrates well around 1000 times the noise
√ − Mtruth kF , whose
value concentrates well around 2mn times the noise level (thick green line in Figure 2) by Fact 11).
level as shown in the following easy fact: We would also like to point out that our algorithm
Fact 11. The mean (resp. √standard deviation) of runs significantly faster than all the other algorithms
kM − Mtruth k2F is 2mn (resp. 20mn) times the noise considered in this experiment. The bottom right plot
level squared. of Figure 2 shows the running time on a machine with
1.4 GHz Quad-Core Intel Core i5 processor and 8 GB
We measure the non-orthogonality of W by the Frobe- 2133 MHz LPDDR3 memory (note that the y-axis is
nius norm of W W T − I after removing the zero rows on logarithmic scale). Our algorithm is based on the
of W and normalizing the other rows. k-means++ subroutine, which is very efficient. The
previous algorithms are based on iterative update and
Experiment 1 In the first experiment, we choose may take a long time to reach a local optimum.
m = 100, n = 5000, k = 10, and compare our algo-
rithm with previous ones. We run each algorithm in- Experiment 2 We compare our algorithm with ON-
dependently for 7 times and record the median results MFS (Asteris et al., 2015), an algorithm that guaran-
in Figure 2. We found that ONMFS could not finish tees perfect orthogonality, but runs in time exponential
in a reasonable amount of time, so we investigate it in the squared inner dimension. ONMFS was based
separately on smaller matrices in experiment 2. We on two levels of exhaustive search, which is inefficient
also found that there is a high variance in the approxi- when the inner dimension is large. We thus reduce the
mation error of EM-ONMF because it often converges sizes of the matrices and set m = 10, n = 50, k = 2 in
to a bad local optimum, giving the fluctuating black this experiment. Our result shows that our algorithm
lines in Figure 2. gives smaller error than ONMFS (Figure 3).
Approximation Algorithms for Orthogonal Non-negative Matrix Factorization

Acknowledgements Gang Chen, Fei Wang, and Changshui Zhang. Collab-


orative filtering using orthogonal nonnegative ma-
We thank Suyash Gupta and anonymous reviewers for trix tri-factorization. Information Processing &
helpful comments on earlier versions of this paper. Management, 45(3):368–379, 2009.
Moses Charikar was supported by a Simons Investiga- Yong Chen, Hui Zhang, Rui Liu, and Zhiwen Ye. Soft
tor Award, a Google Faculty Research Award and an orthogonal non-negative matrix factorization with
Amazon Research Award. Lunjia Hu was supported sparse representation: Static and dynamic. Neuro-
by NSF Award IIS-1908774 and a VMware fellowship. computing, 310:148–164, 2018.
Seungjin Choi. Algorithms for orthogonal nonnegative
References
matrix factorization. In 2008 ieee international joint
Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, conference on neural networks (ieee world congress
and Justin Ward. Better guarantees for k-means and on computational intelligence), pages 1828–1832.
Euclidean k-median by primal-dual algorithms. In IEEE, 2008.
2017 IEEE 58th Annual Symposium on Foundations Nicoletta Del Buono. A penalty function for comput-
of Computer Science (FOCS), pages 61–72. IEEE, ing orthogonal non-negative matrix factorizations.
2017. In 2009 Ninth International Conference on Intelli-
Nir Ailon, Moses Charikar, and Alantha Newman. gent Systems Design and Applications, pages 1001–
Aggregating inconsistent information: ranking and 1005. IEEE, 2009.
clustering. Journal of the ACM (JACM), 55(5):1–
Karthik Devarajan. Nonnegative matrix factorization:
27, 2008.
an analytical and interpretive tool in computational
YK Alaudah, Haibin Di, and Ghassan AlRegib. biology. PLoS computational biology, 4(7), 2008.
Weakly supervised seismic structure labeling via
Chris Ding, Tao Li, Wei Peng, and Haesun Park.
orthogonal non-negative matrix factorization. In
Orthogonal nonnegative matrix t-factorizations for
79th EAGE Conference and Exhibition 2017, vol-
clustering. In Proceedings of the 12th ACM
ume 2017, pages 1–5. European Association of Geo-
SIGKDD international conference on Knowledge
scientists & Engineers, 2017.
discovery and data mining, pages 126–135, 2006.
Orly Alter, Patrick O Brown, and David Botstein.
Singular value decomposition for genome-wide ex- Dheeru Dua and Casey Graff. UCI machine learning
pression data processing and modeling. Proceedings repository, 2017. URL http://archive.ics.uci.
of the National Academy of Sciences, 97(18):10101– edu/ml.
10106, 2000. Carl Eckart and Gale Young. The approximation of
Sanjeev Arora, Rong Ge, Ravi Kannan, and one matrix by another of lower rank. Psychometrika,
Ankur Moitra. Computing a nonnegative matrix 1(3):211–218, 1936.
factorization—provably. SIAM Journal on Comput- Dan Feldman, Morteza Monemizadeh, and Christian
ing, 45(4):1582–1611, 2016. Sohler. A PTAS for k-means clustering based on
Megasthenis Asteris, Dimitris Papailiopoulos, and weak coresets. In Proceedings of the twenty-third an-
Alexandros G Dimakis. Orthogonal NMF through nual symposium on Computational geometry, pages
subspace exploration. In Advances in Neural Infor- 11–18, 2007.
mation Processing Systems, pages 343–351, 2015. Ping He, Xiaohua Xu, Jie Ding, and Baichuan
Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Cor- Fan. Low-rank nonnegative matrix factorization on
relation clustering. Machine learning, 56(1-3):89– Stiefel manifold. Information Sciences, 514:131–148,
113, 2004. 2020.
Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, Meng Huang, JiHong OuYang, Chen Wu, and Liu
and Jill P Mesirov. Metagenes and molecular pat- Bo. Collaborative filtering based on orthogonal non-
tern discovery using matrix factorization. Proceed- negative matrix factorization. In Journal of Physics:
ings of the national academy of sciences, 101(12): Conference Series, volume 1345, page 052062. IOP
4164–4169, 2004. Publishing, 2019.
Shuchi Chawla, Konstantin Makarychev, Tselil Tapas Kanungo, David M Mount, Nathan S Ne-
Schramm, and Grigory Yaroslavtsev. Near optimal tanyahu, Christine D Piatko, Ruth Silverman, and
LP rounding algorithm for correlation clustering on Angela Y Wu. A local search approximation algo-
complete and complete k-partite graphs. In Proceed- rithm for k-means clustering. In Proceedings of the
ings of the forty-seventh annual ACM symposium on eighteenth annual symposium on Computational ge-
Theory of computing, pages 219–228, 2015. ometry, pages 10–18, 2002.
Moses Charikar, Lunjia Hu

Hyunsoo Kim and Haesun Park. Sparse non-negative Chih-Jen Lin. Projected gradient methods for nonneg-
matrix factorizations via alternating non-negativity- ative matrix factorization. Neural computation, 19
constrained least squares for microarray data anal- (10):2756–2779, 2007.
ysis. Bioinformatics, 23(12):1495–1502, 2007.
Huifang Ma, Weizhong Zhao, Qing Tan, and
Jingu Kim and Haesun Park. Fast nonnegative matrix Zhongzhi Shi. Orthogonal nonnegative matrix
factorization: An active-set-like method and com- tri-factorization for semi-supervised document co-
parisons. SIAM Journal on Scientific Computing, clustering. In Pacific-Asia Conference on Knowl-
33(6):3261–3281, 2011. edge Discovery and Data Mining, pages 189–200.
Sungchul Kim, Lee Sael, and Hwanjo Yu. A muta- Springer, 2010.
tion profile for top-k patient search exploiting gene- L. Mirsky. Symmetric gauge functions and unitarily in-
ontology and orthogonal non-negative matrix fac- variant norms. Quart. J. Math. Oxford Ser. (2), 11:
torization. Bioinformatics, 31(22):3653–3659, 2015. 50–59, 1960. ISSN 0033-5606. doi: 10.1093/qmath/
Keigo Kimura, Mineichi Kudo, and Yuzuru Tanaka. A 11.1.50. URL https://doi.org/10.1093/qmath/
column-wise update algorithm for nonnegative ma- 11.1.50.
trix factorization in Bregman divergence with an or-
Pentti Paatero and Unto Tapper. Positive matrix fac-
thogonal constraint. Machine learning, 103(2):285–
torization: A non-negative factor model with op-
306, 2016.
timal utilization of error estimates of data values.
Da Kuang, Chris Ding, and Haesun Park. Symmetric Environmetrics, 5(2):111–126, 1994.
nonnegative matrix factorization for graph cluster-
ing. In Proceedings of the 2012 SIAM international Junjun Pan and Michael K Ng. Orthogonal nonneg-
conference on data mining, pages 106–117. SIAM, ative matrix factorization by sparsity and nuclear
2012. norm optimization. SIAM Journal on Matrix Anal-
ysis and Applications, 39(2):856–875, 2018.
Daniel D Lee and H Sebastian Seung. Learning the
parts of objects by non-negative matrix factoriza- Christos H Papadimitriou, Prabhakar Raghavan,
tion. Nature, 401(6755):788–791, 1999. Hisao Tamaki, and Santosh Vempala. Latent se-
Daniel D Lee and H Sebastian Seung. Algorithms for mantic indexing: A probabilistic analysis. Journal
non-negative matrix factorization. In Advances in of Computer and System Sciences, 61(2):217–235,
neural information processing systems, pages 556– 2000.
562, 2001. V Paul Pauca, Farial Shahnaz, Michael W Berry,
Bo Li, Guoxu Zhou, and Andrzej Cichocki. Two effi- and Robert J Plemmons. Text mining using non-
cient algorithms for approximately orthogonal non- negative matrix factorizations. In Proceedings of the
negative matrix factorization. IEEE Signal Process- 2004 SIAM International Conference on Data Min-
ing Letters, 22(7):843–846, 2014a. ing, pages 452–456. SIAM, 2004.
Jack Yutong Li, Ruoqing Zhu, Annie Qu, Han Ye, and Filippo Pompili, Nicolas Gillis, François Glineur, and
Zhankun Sun. Semi-orthogonal non-negative ma- Pierre-Antoine Absil. Onp-mf: An orthogonal non-
trix factorization. arXiv preprint arXiv:1805.02306, negative matrix factorization algorithm with appli-
2018. cation to clustering. In ESANN. Citeseer, 2013.
Ping Li, Jiajun Bu, Yi Yang, Rongrong Ji, Chun Chen, Filippo Pompili, Nicolas Gillis, P-A Absil, and
and Deng Cai. Discriminative orthogonal nonneg- François Glineur. Two algorithms for orthogonal
ative matrix factorization with flexibility for data nonnegative matrix factorization with application to
representation. Expert systems with applications, 41 clustering. Neurocomputing, 141:15–25, 2014.
(4):1283–1293, 2014b.
Yaoyao Qin, Caiyan Jia, and Yafang Li. Commu-
Stan Z Li, Xin Wen Hou, Hong Jiang Zhang, and nity detection using nonnegative matrix factoriza-
Qian Sheng Cheng. Learning spatially localized, tion with orthogonal constraint. In 2016 Eighth In-
parts-based representation. In Proceedings of the ternational Conference on Advanced Computational
2001 IEEE Computer Society Conference on Com- Intelligence (ICACI), pages 49–54. IEEE, 2016.
puter Vision and Pattern Recognition. CVPR 2001,
volume 1, pages I–I. IEEE, 2001. Stephen A Vavasis. On the complexity of nonnegative
matrix factorization. SIAM Journal on Optimiza-
Wenbo Li, Jicheng Li, Xuenian Liu, and Liqiang Dong.
tion, 20(3):1364–1377, 2010.
Two fast vector-wise update algorithms for orthog-
onal nonnegative matrix factorization with sparsity Svante Wold, Kim Esbensen, and Paul Geladi. Prin-
constraint. Journal of Computational and Applied cipal component analysis. Chemometrics and intel-
Mathematics, 375:112785, 2020. ligent laboratory systems, 2(1-3):37–52, 1987.
Approximation Algorithms for Orthogonal Non-negative Matrix Factorization

Wei Xu, Xin Liu, and Yihong Gong. Document


clustering based on non-negative matrix factoriza-
tion. In Proceedings of the 26th annual international
ACM SIGIR conference on Research and develop-
ment in informaion retrieval, pages 267–273, 2003.
Zhirong Yang and Jorma Laaksonen. Multiplicative
updates for non-negative projections. Neurocomput-
ing, 71(1-3):363–373, 2007.
Zhirong Yang and Erkki Oja. Linear and nonlinear
projective nonnegative matrix factorization. IEEE
Transactions on Neural Networks, 21(5):734–749,
2010.
Ji-Ho Yoo and Seung-Jin Choi. Nonnegative matrix
factorization with orthogonality constraints. Jour-
nal of computing science and engineering, 4(2):97–
109, 2010.
Jiho Yoo and Seungjin Choi. Orthogonal nonnega-
tive matrix factorization: Multiplicative updates on
Stiefel manifolds. In International conference on in-
telligent data engineering and automated learning,
pages 140–147. Springer, 2008.
Zhijian Yuan and Erkki Oja. Projective nonnegative
matrix factorization for image compression and fea-
ture extraction. In Scandinavian Conference on Im-
age Analysis, pages 333–342. Springer, 2005.
Wei Emma Zhang, Mingkui Tan, Quan Z Sheng, Lina
Yao, and Qingfeng Shi. Efficient orthogonal non-
negative matrix factorization over Stiefel manifold.
In Proceedings of the 25th ACM International on
Conference on Information and Knowledge Manage-
ment, pages 1743–1752, 2016.

You might also like