Algorithms for Non-negative Matrix Factorization
Algorithms for Non-negative Matrix Factorization
Factorization
Abstract
Non-negative matrix factorization (NMF) has previously been shown to
be a useful decomposition for multivariate data. Two different multi-
plicative algorithms for NMF are analyzed. They differ only slightly in
the multiplicative factor used in the update rules. One algorithm can be
shown to minimize the conventional least squares error while the other
minimizes the generalized Kullback-Leibler divergence. The monotonic
convergence of both algorithms can be proven using an auxiliary func-
tion analogous to that used for proving convergence of the Expectation-
Maximization algorithm. The algorithms can also be interpreted as diag-
onally rescaled gradient descent, where the rescaling factor is optimally
chosen to ensure convergence.
Introduction
Unsupervised learning algorithms such as principal components analysis and vector quan-
tization can be understood as factorizing a data matrix subject to different constraints. De-
pending upon the constraints utilized, the resulting factors can be shown to have very dif-
ferent representational properties. Principal components analysis enforces only a weak or-
thogonality constraint, resulting in a very distributed representation that uses cancellations
to generate variability [1, 2]. On the other hand, vector quantization uses a hard winner-
take-all constraint that results in clustering the data into mutually exclusive prototypes [3].
We have previously shown that nonnegativity is a useful constraint for matrix factorization
that can learn a parts representation of the data [4, 5]. The nonnegative basis vectors that are
learned are used in distributed, yet still sparse combinations to generate expressiveness in
the reconstructions [6, 7]. In this submission, we analyze in detail two numerical algorithms
for learning the optimal nonnegative factors from data.
Cost functions
To find an approximate factorization V W H , we first need to define cost functions
that quantifies the quality of the approximation. Such a cost function can be constructed
using some measure of distance between two non-negative matrices A and B . One useful
measure is simply the square of the Euclidean distance between A and B [12],
X
jjA B jj2 = Aij
( Bij )2 (2)
ij
This is lower bounded by zero, and clearly vanishes if and only if A = B .
Another useful measure is
X A
D(AjjB ) = Aij log ij Aij + Bij (3)
ij
Bij
Like the Euclidean distance this is also lower bounded by zero, and vanishes if and only
if A = B . But it cannot be called a “distance”, because it is not symmetric in A and B ,
P of A fromP
so we will refer to it as the “divergence” B . It reduces to the Kullback-Leibler
divergence, or relative entropy, when ij Aij = ij Bij = 1, so that A and B can be
regarded as normalized probability distributions.
We now consider two alternative formulations of NMF as optimization problems:
Problem 1 Minimize jjV W H jj2 with respect to W and H , subject to the constraints
W; H 0 .
Problem 2 Minimize D(V jjW H ) with respect to W and H, subject to the constraints
W; H 0 .
Although the functions jjV W H jj2 and D(V jjW H ) are convex in W only or H only, they
are not convex in both variables together. Therefore it is unrealistic to expect an algorithm
to solve Problems 1 and 2 in the sense of finding global minima. However, there are many
techniques from numerical optimization that can be applied to find local minima.
Gradient descent is perhaps the simplest technique to implement, but convergence can be
slow. Other methods such as conjugate gradient have faster convergence, at least in the
vicinity of local minima, but are more complicated to implement than gradient descent
[8]. The convergence of gradient based methods also have the disadvantage of being very
sensitive to the choice of step size, which can be very inconvenient for large applications.
We have found that the following “multiplicative update rules” are a good compromise
between speed and ease of implementation for solving Problems 1 and 2.
Theorem 1 The Euclidean distance jjV W H jj is nonincreasing under the update rules
(W T V )a V H T )ia
(
Ha Ha Wia Wia (4)
(W T W H )a (W HH T )ia
The Euclidean distance is invariant under these updates if and only if W and H are at a
stationary point of the distance.
Theorem 2 The divergence D(V jjW H ) is nonincreasing under the update rules
P P
i Wia Vi =(W H )i Ha Vi =(W H )i
Ha Ha P Wia Wia P (5)
k Wka Ha
The divergence is invariant under these updates if and only if W and H are at a stationary
point of the divergence.
Proofs of these theorems are given in a later section. For now, we note that each update
consists of multiplication by a factor. In particular, it is straightforward to see that this
multiplicative factor is unity when V = W H , so that perfect reconstruction is necessarily
a fixed point of the update rules.
It is useful to contrast these multiplicative updates with those arising from gradient descent
[13]. In particular, a simple additive update for H that reduces the squared distance can be
written as
Ha Ha + a (W T V )a (W T W H )a : (6)
If a are all set equal to some small positive number, this is equivalent to conventional
gradient descent. As long as this number is sufficiently small, the update should reduce
jjV W H jj.
Now if we diagonally rescale the variables and set
Ha
a = ; (7)
( W T W H )a
then we obtain the update rule for H that is given in Theorem 1.
For the divergence, diagonally rescaled gradient descent takes the form
" #
X V X
Ha Ha + a Wia i Wia : (8)
(W H )i
i i
Again, if the a are small and positive, this update should reduce D(V jjW H ). If we now
set
H
a = P a ; (9)
i Wia
then we obtain the update rule for H that is given in Theorem 2.
Since our choices for a are not small, it may seem that there is no guarantee that such a
rescaled gradient descent should cause the cost function to decrease. Surprisingly, this is
indeed the case as shown in the next section.
Proofs of convergence
To prove Theorems 1 and 2, we will make use of an auxiliary function similar to that used
in the Expectation-Maximization algorithm [14, 15].
The auxiliary function is a useful concept because of the following lemma, which is also
graphically illustrated in Fig. 1.
We will show that by defining the appropriate auxiliary functions G(h; ht ) for both jjV
W H jj and D(V; W H ), the update rules in Theorems 1 and 2 easily follow from Eq. (11).
Lemma 2 If K (ht ) is the diagonal matrix
Kab (ht ) = Æab (W T W ht )a =hta (13)
G(h,ht)
F(h)
ht ht+1 hmin h
Figure 1: Minimizing the auxiliary function G(h; ht ) F (h) guarantees that F (ht+1 )
F (ht ) for hn+1 = arg minh G(h; ht ).
then
G(h; ht ) = F (ht ) + (h ht )T rF (ht ) +
1
h ht )T K (ht )(h ht )
( (14)
2
is an auxiliary function for
1X X
F (h) = (vi Wia ha )2 (15)
2
i a
Proof: Since G(h; h) = F (h) is obvious, we need only show that G(h; ht ) F (h). To
do this, we compare
Proof: It is straightforward to verify that G(h; h) = F (h). To show that G(h; ht ) F (h),
we use convexity of the log function to derive the inequality
X X Wia ha
log Wia ha a log (29)
a a a
which holds for all nonnegative a that sum to unity. Setting
Wia hta
a= P t (30)
b Wib hb
we obtain
X X W ht Wia hta
log Wia ha P
ia a
t log Wia ha log P t (31)
a a b Wib hb b Wib hb
From this inequality it follows that F (h) G(h; ht ).
Theorem 2 then follows from the application of Lemma 1:
Proof of Theorem 2: The minimum of G(h; ht ) with respect to h is determined by setting
the gradient to zero:
dG(h; ht ) X W ht 1 X
= vi P ia a t + Wia = 0 (32)
dha i b Wib hb ha i
Thus, the update rule of Eq. (11) takes the form
ht X P vi
hta+1 = P a t Wib : (33)
b Wkb i b Wib hb
Since G is an auxiliary function, F in Eq. (28) is nonincreasing under this update. Rewrit-
ten in matrix form, this is equivalent to the update rule in Eq. (5). By reversing the roles of
H and W , the update rule for W can similarly be shown to be nonincreasing.
Discussion
We have shown that application of the update rules in Eqs. (4) and (5) are guaranteed to
find at least locally optimal solutions of Problems 1 and 2, respectively. The convergence
proofs rely upon defining an appropriate auxiliary function. We are currently working to
generalize these theorems to more complex constraints. The update rules themselves are
extremely easy to implement computationally, and will hopefully be utilized by others for
a wide variety of applications.
We acknowledge the support of Bell Laboratories. We would also like to thank Carlos
Brody, Ken Clarkson, Corinna Cortes, Roland Freund, Linda Kaufman, Yann Le Cun, Sam
Roweis, Larry Saul, and Margaret Wright for helpful discussions.
References
[1] Jolliffe, IT (1986). Principal Component Analysis. New York: Springer-Verlag.
[2] Turk, M & Pentland, A (1991). Eigenfaces for recognition. J. Cogn. Neurosci. 3, 71–86.
[3] Gersho, A & Gray, RM (1992). Vector Quantization and Signal Compression. Kluwer Acad.
Press.
[4] Lee, DD & Seung, HS. Unsupervised learning by convex and conic coding (1997). Proceedings
of the Conference on Neural Information Processing Systems 9, 515–521.
[5] Lee, DD & Seung, HS (1999). Learning the parts of objects by non-negative matrix factoriza-
tion. Nature 401, 788–791.
[6] Field, DJ (1994). What is the goal of sensory coding? Neural Comput. 6, 559–601.
[7] Foldiak, P & Young, M (1995). Sparse coding in the primate cortex. The Handbook of Brain
Theory and Neural Networks, 895–898. (MIT Press, Cambridge, MA).
[8] Press, WH, Teukolsky, SA, Vetterling, WT & Flannery, BP (1993). Numerical recipes: the art
of scientific computing. (Cambridge University Press, Cambridge, England).
[9] Shepp, LA & Vardi, Y (1982). Maximum likelihood reconstruction for emission tomography.
IEEE Trans. MI-2, 113–122.
[10] Richardson, WH (1972). Bayesian-based iterative method of image restoration. J. Opt. Soc.
Am. 62, 55–59.
[11] Lucy, LB (1974). An iterative technique for the rectification of observed distributions. Astron.
J. 74, 745–754.
[12] Paatero, P & Tapper, U (1997). Least squares formulation of robust non-negative factor analy-
sis. Chemometr. Intell. Lab. 37, 23–35.
[13] Kivinen, J & Warmuth, M (1997). Additive versus exponentiated gradient updates for linear
prediction. Journal of Information and Computation 132, 1–64.
[14] Dempster, AP, Laird, NM & Rubin, DB (1977). Maximum likelihood from incomplete data via
the EM algorithm. J. Royal Stat. Soc. 39, 1–38.
[15] Saul, L & Pereira, F (1997). Aggregate and mixed-order Markov models for statistical language
processing. In C. Cardie and R. Weischedel (eds). Proceedings of the Second Conference on
Empirical Methods in Natural Language Processing, 81–89. ACL Press.