Tin TranDat
Tin TranDat
Factorization
Author: Tran Dat Tin∗
Supervisor: Dr. Le Hai Yen†
Abstract. For the reason that the extensive applications of nonnegative matrix factorizations (NMFs)
of nonnegative matrices, such as in image processing, text mining, spectral data analysis, speech processing,
and so on. Algorithms for NMF have been studied for years. This report aims to present multiplicative
updating (MU) [10], alternating gradient (AG) and alternating projected gradient (APG) algorithms [11]
for solving the NMF problem. We can therefore take advantage of well-developed theories on majorize-
minimization and gradient descent methods to analyze the convergence of these algorithms. Finally, it is
shown that the performance of APG algorithm is better than that of MU algorithm when we do numerical
experiments on the AT&T face database.
1 Introduction
Let V be a m × n nonnegative matrix (that means aij ≥ 0, ∀i = 1, 2, . . . , n, ∀j = 1, 2, . . . , n),
denoted by V ≥ 0 or V ∈ Rm×n
+ , and a fixed integer r ≤ min(m, n), the so-called nonnegative
matrix factorization (NMF) of V is approximated by two nonnegative matrices W ∈ Rm×r
+ and
H ∈ Rr+×n . More precisely, it leads to the following optimization problem:
V ≈ WH
To find an approximate factorization V ≈ W H, we first need to define the cost function that
quantifies the quality of the approximation. Such a cost function can be constructed using some
measure of distance between two non-negative matrices W and H. The first time the NMF problem
was stated is in the paper by Paatero and Tapper in 1994 [12].
∗
Dalat University, Vietnam ([email protected])
†
Institute of Mathematics, Vietnam Academy of Science and Technology ([email protected])
1
There are many ways to define the cost function for different purposes. Lee and Seung [10]
defined the cost function based on Frobenius norm and Kullback-Leibler divergence. Févotte,
Bertin, and Durrieu [5] defined the cost function by Itakura-Saito distance. Ke, Kanade and
Robust [9] used `1 norm to define the other cost function. In this report, we define the cost
function based on the Frobenius norm as in [10]. Let us state the main problem:
1
min fV (W, H) := ||V − W H||2F (1)
W ∈Rm×r
+ , H∈Rr+×n 2
Remark. For a fixed matrix H ∈ Rr ×n the function fv (W ) is convex because for all W1 , W2 ∈ Rm×r
+
and λ ∈ [0, 1]:
1
fV (λW1 + (1 − λ)W2 , H) = ||λ(V − W1 H) + (1 − λ)(V − W2 H)||2F
2
λ (1 − λ)
≤ ||V − W1 H||2F + ||V − W2 H||2F (||.||2F is a convex function)
2 2
= λfV (W1 , H) + (1 − λ)fV (W2 , H)
Implies that the function fV (W ) is convex. Similarly, it is easy to show that fV (H) is convex. The
objective function fV in (1) is not convex in both variables W and H. To see this, we consider the
case when m = n = 1:
fv (w , h) := ||v − w h||2 = (v 2 − 2v w h + w 2 h2 )
implies " #
2h2 4hw − 2v
∇2 fv (w , h) =
4hw − 2v 2w 2
For all v > 0 and h > 0 then 2h2 > 0 and det(∇2 fv (w , h)) = −12h2 w 2 + 16w h − 4v 2 . There
exists w > 0 such that det(∇2 fv (w , h)) < 0.
2
Algorithm 1 Multiplicative Updating Algorithm (Lee and Seung [10])
1. W 0 = r and(m, r ); H 0 = r and(r, n); (W 0 , H 0 > 0)
2. for k = 0, 1, 2, . . .
V (H k )T
W k+1 = W k ◦ k k k T
W H (H )
(W k+1 )T V
H k+1 = H k ◦
(W k+1 )T W k+1 H k
end
In the next theorem, we will show that the objective function in (1) is nonincreasing if we use
the update rule of MU algorithm.
The auxiliary function is useful by the following lemma, which is described in Figure 1
Figure 1: Property of auxiliary function, in this figure F̄ (v) := F̄ (v, v̄) [8]
3
Lemma 3
If F̄ is an auxiliary function of F , then F is nonincreasing under the update
Remark. F (v(i+1) ) = F (v(i) ) if and only if v(i) is the local minimum of F̄ (v, v(i) ). And F (v(i) ) =
F̄ (v(i) , v(i) ). If the derivative of F exists and continuous in a small neighborhood of v (i) , this also
implies that
If F is a convex function then the sequence generated by Equation (2) converges to a local minimum
vmin = argmin F (v) of the objective function:
v >0
We will show that by defining the appropriate auxiliary functions F̄ (v , v(i) ) for the objective
function (1), the update rules in Theorem 1 easily follow from equation (2). Firstly, we need the
following lemma
Lemma 4
Given W ∈ Rm×r , v ∈ Rm , the function f : Rr → R is defined by f (h) = ||W h − v||2 then
∇h f (h) = 2W T W h − 2W T v and ∇2h f (h) = 2W T W
Proof. By definition:
T
2 ∂ ∂
∇h ||W h − v|| := ||W h − v||2 , · · · , ||W h − v||2 (3)
∂h1 ∂hr
and
m
X m X
X r
||W h − v||2F = 2
((W h)i − vi ) = ( wik hk − vi )2
i=1 i=1 k=1
!2
m
X r
X r
X
= wik hk − 2vi wik hk + vi2
i=1 k=1 k=1
m r
!2 m r m
X X X X X
= wik hk −2 vi wik hk + vi2
i=1 k=1 i=1 k=1 i=1
4
For any hj we have:
!2
m r m r m
∂ ∂ X X X X X
||W h − v||2F = wik hk −2 vi wik hk + vi2
∂hj ∂hj i=1 k=1 i=1 k=1 i=1
m r
!2 m r m
X ∂ X X ∂ X ∂ X 2
= wik hk −2 vi wik hk + v
i=1
∂hj k=1 i=1
∂hj k=1 ∂hj i=1 i
m r
!2 m
X ∂ X X
= wik hk −2 vi wij
i=1
∂hj k=1 i=1
Xm r
X m
X
=2 wij wik hk − 2 vi wij
i=1 k=1 i=1
m r
! m
X X X
=2 wij wik hk − vi =2 wij ((W h)i − vi ) = 2(W T (W h − v))j
i=1 k=1 i=1
From 3
∇h f (h) = 2W T W h − 2W T v
It is clearly that
∇2h f (h) = 2W T W
We denote Di ag(v) is a square diagonal matrix with the entries of vector v on the main diagonal
and a vector v > 0 means vi > 0, ∀i . The following theorem gives us an auxiliary function for the
target function.
W T W ht
t
θ(h ) = Di ag
ht
Then
1
G(h, ht ) = f (ht ) + (h − ht )T ∇f (ht ) + (h − ht )T θ(ht )(h − ht ) (4)
2
is an auxiliary function of
1X X
f (h) = (vi − Wik hk )2 for h > 0
2 i k
Proof. It is clear that G(h, h) = f (h), we need to show that G(h, ht ) ≥ f (h). By the Taylor
5
expansion of function F (h) and Lemma 4 we have
1
f (h) = f (ht ) + (h − ht )T ∇f (ht ) + (h − ht )T (W T W )(h − ht )
2
From (4) G(h, ht ) ≥ f (h) if and only if
(h − ht )(θ(ht ) − W T W )(h − ht ) ≥ 0
vT (θ(ht ) − W T W )v = vT θ(ht )v − vT W T W v
(W T W ht )1 T
2 (W W h )2
t T
2 (W W h )r
t
= v12 + v 2 + · · · + vr − vT W T W v
h1t h2t hrt
r X r r r
X (W T W )ik hkt 2 X X
= t vi − vi (W T W )ik vk
i=1 k=1
hi i=1 k=1
r r t
XX hk
= (W T W )ik 2
t vi − vi vk
i=1 k=1
h i
r r
hkt 2 hit 2
1 XX T
= (W W )ik v − vi vk + t vk − vi vk
2 i=1 k=1 hit i hk
r r
s s !2
1 XX T hkt hit
= (W W )ik vi − vk ≥ 0, for ht > 0
2 i=1 k=1 hit hkt
Proof (Theorem 1). Let v and h is some column of V and H, respectively. Let
1
F (h) = ||v − W h||2 (5)
2
Using Taylor expansion for F (h) we have
1
F (h) = F (ht ) + (h − ht )T ∇F (ht ) + (h − ht )∇2 F (ht )(h − ht ) + o||h − ht ||2
2
From Lemma 4 then
6
Let ht+1 = argmin G(h, ht ). Then we can find ht+1 by solving the equation:
h>0
∂G(ht+1 , ht )
=0
∂ht+1
It implies that
∂G(ht+1 , ht )
= ∇F (ht ) + θ(ht )(ht+1 − ht )
∂ht+1
Thus
ht+1 = ht − (θ(ht ))−1 ∇F (ht ) (7)
Since G(h, ht ) is an auxiliary function, F is nonincreasing under the update rule, according to
Lemma 3. By rewriting the equation, we obtain
ht
t+1 t
h = h − Di ag (W T W ht − W T v)
W T W ht
WTv
= ht − ht + ht T
W W ht
WTv
= ht T
W W ht
That work for any column v and h then work for the matrix V and H. By reversing the roles of
W and H in equation 5, we obtain that F can similarly show to be nonincreasing under the update
rules for W
min f (x)
x∈Rn
xk+1 = xk − tk ∇f (xk )
7
where tk is called stepsize and can be chosen as tk = argmin f (xk − t∇f (xk )). This rule of
t≥0
choosing stepsize is called exact line search.
Remark 1. If f is a convex function and the stepsize is chosen in (0, tk ], where tk = argmin f (xk −
t≥0
t∇f (xk )), then the update rule still makes f (x) nonincreasing. For any tk0 ∈ (0, tk ] then f (xk −
tk ∇f (xk )) < f (xk − tk0 ∇f (xk )) < f (xk ). Because if ∇f (xk ) > 0 then xk − tk ∇f (xk ) < xk and
f (xk − tk0 ∇f (xk )) < f (xk ). If ∇f (xk ) < 0 then xk − tk ∇f (xk ) > xk and f (xk − tk0 ∇f (xk )) < f (xk ).
We now applied the idea of this method to the NMF problem. Firstly, we state the following
lemma
Lemma 6
1
Given the function f = ||V − W H||2F with W and H fixed let
2
∗ ∂f (W, H) ∗ ∂f (W, H)
εH = argmin f W, H − ε and εW = argmin f W − ε ,H
ε≥0 ∂H ε≥0 ∂W
Then
||W T W H − W T V ||2F ||W HH T − V H T ||2F
ε∗H = and ε ∗
W =
||W (W T W H − W T V )||2F ||(W HH T − V T H)H||2F
∂f (W, H)
= WTWH − WTV
∂H
This implies that
∂f (W, H)
ε∗H = argmin f W, H − ε (8)
ε≥0 ∂H
= argmin f W, H − ε(W T W H − W T V )
ε≥0
8
Then
hW H − V, W Ki
g 0 (ε) = 0 ⇔ ε =
||W K||2F
T r ((W K)T (W H − V ))
= (hA, Bi = T r (B T A))
||W K||2F
T r (K T W T (W H − V ))
=
||W (W T W H − W T V )||2F
hW T W H − W T V, W T W H − W T V i
=
||W (W T W H − W T V )||2F
||W T W H − W T V ||2F
=
||W (W T W H − W T V )||2F
Note
These functions in (10) are convex and the constraint sets are Rr+×n and Rm×r
+ .
9
||W T W H − W T V ||2F
= (by Lemma 6)
||W (W T W H − W T V )||2F
Choose !
∂f (W, H)
εH = min ε∗H , Hi,j / , (11)
( ∂f (W,H)
∂H )i,j >0 ∂H i,j
and
||W HH T − V H T ||2F
ε∗W =
||(W HH T − V T H)H||2F
then !
∂f (W, H)
εW = min ε∗W , Wi,j / , (12)
( ∂f (W,H)
∂W)i,j >0 ∂W i,j
Remark. These stepsizes choose in (11) and (12) guarantee these objective functions in (10) are
nonincreasing by Remark 1.
Remark. The AG algorithm updates all entries with the same stepsize. To ensure the updates are
in the constraint set, entries must frequently update stepsize to be less than the best stepsize. We
will improve the AG algorithm with the idea of updating each entry with different stepsizes.
10
3.2 Alternating projected gradient algorithm (APG)
Let W be fixed, we update each H’s entry by different stepsizes by
H ∂f (W, H)
Hi,j = Hi,j − εi,j , εH
i,j ≥ 0
∂H i,j
where
||W T W H−W T V ||2F ∂f (W,H)
ε∗H = ||W (W T W H−W T V )||
2 ,if ∂H ≤0
H F
εi,j =
min ε∗H , Hi,j / ∂f (W,H)
∂H ,otherwise
i,j
where
||W HH T −V H T ||2F ∂f (W,H)
ε∗W = ||(W HHT −V T H)H||
2 ,if ∂W ≤0
F
εW
i,j =
min ε∗W , Wi,j / ∂f (W,H)
∂W ,otherwise
i,j
In fact, the new improvement is equivalent to first updating H (or W ) with ε∗H or (ε∗W ) in the
weighted gradient direction of f (W, H) and then setting all the negative components in the upload
H (or W ) to be zero. Now we present the theory and algorithm for this idea.
The idea of alternating projected gradient algorithm (APG) is based on the projection
gradient method. Recall that the projection gradient method is designed for solving a constrained
optimization problem
min f (x)
x∈C
where C is a closed and convex set and f is a convex function over C. Starting from an initial
point x0 ∈ C then at each step, we compute:
The orthogonal projection operator with input x returns the vector in C that is closest to x. The
optimization problem in (13) has a unique optimal solution [1, p. 157] if C is closed, convex and
nonempty. In the next example, we present the formula for the projection of matrix in Rm×n to the
closed convex set C = Rm×n
+ .
11
Example 7 (Projection on the nonnegative matrix)
Let C = Rm×n
+ . To compute the orthogonal projection of X ∈ Rm×n onto Rm×n
+ , we need to
solve the convex optimization problem
n X
X m
min (Yik − Xik )2 s.t. Yik ≥ 0 (14)
k=1 i=1
Since this problem is separable, it follows that the ith row and kth column entry of the optimal
solution Y ∗ of problem (14) is the optimal solution of the univariate problem
Clearly, the unique solution of the above problem is given by Yik∗ = [Xik ]+ , where for a real
number α ∈ R, [α]+ , is the nonnegative part of α:
α, α ≥ 0
[α]+ =
0, α < 0
We will extend the definition of the nonnegative part to matrices, and the nonnegative part of
a matrix V ∈ Rm×n is defined by
[V11 ]+ · · · [V1n ]+
. ... ..
[V ]+ = .. .
[Vm1 ]+ · · · [Vmn ]+
and !
∂f (W, H)
Wi,j = PC Wi,j − ε∗W
∂W i,j
where
||W T W H − W T V ||2F ||W HH T − V H T ||2F
ε∗H = , ε ∗
W = (15)
||W (W T W H − W T V )||2F ||(W HH T − V T H)H||2F
and PC is the orthogonal projection on the nonnegative matrix discussed in Example 7
Therefore, our improvement can be written as the following algorithm:
12
Algorithm 3 Alternating Projected Gradient Algorithm (Lin and Liu [11])
1. W 0 = r and(m, r ); H 0 = r and(r, n); (W 0 , H 0 > 0)
2. for k = 0, 1, 2, . . .
∂f (W k , H k )
k+1 k ∗
Wi,j = Wi,j − εW k
∂W k i,j
k+1
set all negative entries in W to 0
∂f (W k+1 , H k )
k+1 k ∗
Hi,j = Hi,j − εHk
∂H k i,j
set all negative entries in H k+1 to 0
end
Remark. Compare to MU algorithm there are no zero entries that appear in denominators in our
algorithm which implies no breakdown occurs, and even if some zero entries appear in the numerator
new updates can always be improved in our algorithm.
4 Applications
Determining the structure of the data set and extracting meaningful information is a critical prob-
lem in data analysis. There are many powerful methods for this purpose as linear dimensionality
reduction (LDR) techniques, which are equivalent to low-rank matrix approximations (LRMA).
Examples of LDR techniques are principal component analysis (PCA), independent component
analysis, sparse PCA, robust PCA, low-rank matrix completion, and sparse component analysis.
Among LRMA methods, NMF requires the factors of the low-rank approximation to be compo-
nentwise nonnegative. That helps us interpret them in a meaningful way, for example when they
correspond to non-negative physical quantities. Another important motivation for NMF is to find
the nonnegative rank of a nonnegative matrix [6].
In this section, we emphasize two important applications of NMF in data analysis: image
processing and text mining. Other applications include air emission control [12], music genre
classification [2], email surveillance [3],. . . However, it is important to stress that NMF is not
motivated only by its use as an LDR technique for data analysis [6, p. 12].
13
X ∈ Rm×n
+ be a vectorized gray-level image of a face. Vectorization means that the two-dimensional
images are transformed into a long one-dimensional vector, for example, by stacking the columns
of the image on top of each other [6]. Then each row of X corresponds to the same pixel location
in the original images. This means that with the (i, j)th entry of matrix X being the intensity of
the i th pixel in the jth face.
NMF generates two factors (W, H) so that each image V (:, j) is approximated using a linear
combination of the columns of W . Since W is nonnegative, the columns of W can be interpreted
as images (that is, vectors of pixel intensities) which we refer to as the basis images. The columns
of W are vectors of pixel intensities whose linear combinations allow us to approximate each input
image. Hence these basis images typically correspond to localized features that are shared among
the input images, and the entries of H indicate which input image contains which feature. This is
illustrated by the following figure
1
Figure 2: NMF applied on the CBCL face data set with r = 49 (2429 images with 19 × 19
pixels each) by MU algorithm [6].
On the left is a column of X reshaped as an image. In the middle are the 49 columns of the
basis W reshaped as images and displayed in a 7 × 7 grid and the reshaped column of H in the
same 7 × 7 grid that shows which features are present in that particular face. On the right is the
reshaped approximation (W H):,j of X:,j as an image.
The following example may be more descriptive.
1
http://cbcl.mit.edu/software-datasets/FaceData2.html
14
Figure 4: NMF basis images for the swimmer data set, which reshapes sample images with suitable
coefficients of this linear combination [6]
Figure 5 illustrates the extraction of topics, and classification of each document with respect
to these topics. The columns of X are sets of words, with Xi,j being the number of word i in a
document j. Because H is nonnegative, these sets are approximated as the union of a smaller
number of sets defined by the columns of W that correspond to topics.
Therefore, given a set of documents, NMF identifies topics and simultaneously classifies the
documents among these different topics.
15
5 Numerical experiments
In this section, we will present some experiments using the MU algorithm and the APG algorithm,
which are implemented in R 4.2.11 . We use the AT&T face dataset2 consists of 10 black and
white photos of each member of a group of 40 individuals, 400 images in total and each image is
92 × 112 pixels, with 256 grey levels per pixel.
In Table 1, we record the values of the cost function obtained from the MU and the APG
algorithms after the same number of iterations. The data on the column denoted by % is the
relative improvement defined by
fMU − fAP G
fMU
k k k k k k k k
where fMU = f (WMU , HMU ), fAP G (WAP G , HAP G ), and WMU , HMU , WAP G , HAP G denote the corre-
sponding results is computed by the MU algorithm and the APG algorithm respectively, the relative
1
Detail code: https://rpubs.com/trandattin/954115
2
https://www.kaggle.com/datasets/kasikrit/att-database-of-faces
16
error for each algorithm defined by
f (W k , H k )
ρ=
f (W 0 , H 0 )
where W 0 and H 0 are initial matrices. Table 1 shows that the APG algorithm has apparent
advantage at early stage as compared to the MU algorithm.
Figure 7: Compare the convergence rate of the algorithms with AT&T face dataset
6 Conclusion
The results presented in the report are the first step to study the MU, AG, and APG algorithms
for NMF as well as the theories to analyze the convergence of the algorithms and to compare
the performance of these algorithms on R. Finally, the author presents some applications of NMF.
There is a question posed by the writer during the research as follows:
1. Comparing the performance of the APG algorithm when using constant stepsize and the
stepsize is given by exact line search.
Due to the limitation of time and ability, the writer has not been able to prove the convergence
of the APG algorithm. Hopefully, the author can develop and expand the results in the future.
References
[1] A. Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with
MATLAB. SIAM, 2014.
17
[2] E. Benetos and C. Kotropoulos. Non-negative tensor factorization applied to music genre
classification. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):1955–
1967, 2010.
[4] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
[5] C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the itakura-
saito divergence: With application to music analysis. Neural computation, 21(3):793–830,
2009.
[6] N. Gillis. Nonnegative Matrix Factorization. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 2020.
[7] D. Guillamet and J. Vitria. Non-negative matrix factorization for face recognition. In Catalo-
nian Conference on Artificial Intelligence, pages 336–344. Springer, 2002.
[8] N. D. Ho. Non negative matrix factorization algorithms and applications. Université Catholique
de Louvain, 2008.
[9] Q. Ke and T. Kanade. Robust l/sub 1/norm factorization in the presence of outliers and
missing data by alternative convex programming. In 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 739–746. IEEE,
2005.
[10] D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Advances in neural
information processing systems, 13, 2000.
[11] L. Lin and Z.-Y. Liu. An alternating projected gradient algorithm for nonnegative matrix
factorization. Applied mathematics and computation, 217(24):9997–10002, 2011.
[12] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with
optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, 1994.
18