0% found this document useful (0 votes)
7 views18 pages

Tin TranDat

This document discusses algorithms for Nonnegative Matrix Factorization (NMF), focusing on multiplicative updating (MU), alternating gradient (AG), and alternating projected gradient (APG) methods. It highlights the convergence analysis of these algorithms using majorize-minimization and gradient descent theories, demonstrating that the APG algorithm outperforms the MU algorithm in numerical experiments. The report also details the mathematical foundations and cost functions associated with NMF, emphasizing the importance of nonnegative matrices in various applications.

Uploaded by

AM AJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

Tin TranDat

This document discusses algorithms for Nonnegative Matrix Factorization (NMF), focusing on multiplicative updating (MU), alternating gradient (AG), and alternating projected gradient (APG) methods. It highlights the convergence analysis of these algorithms using majorize-minimization and gradient descent theories, demonstrating that the APG algorithm outperforms the MU algorithm in numerical experiments. The report also details the mathematical foundations and cost functions associated with NMF, emphasizing the importance of nonnegative matrices in various applications.

Uploaded by

AM AJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Some Algorithms for Nonnegative Matrix

Factorization
Author: Tran Dat Tin∗
Supervisor: Dr. Le Hai Yen†

Hanoi - October 31, 2022

Abstract. For the reason that the extensive applications of nonnegative matrix factorizations (NMFs)
of nonnegative matrices, such as in image processing, text mining, spectral data analysis, speech processing,
and so on. Algorithms for NMF have been studied for years. This report aims to present multiplicative
updating (MU) [10], alternating gradient (AG) and alternating projected gradient (APG) algorithms [11]
for solving the NMF problem. We can therefore take advantage of well-developed theories on majorize-
minimization and gradient descent methods to analyze the convergence of these algorithms. Finally, it is
shown that the performance of APG algorithm is better than that of MU algorithm when we do numerical
experiments on the AT&T face database.

1 Introduction
Let V be a m × n nonnegative matrix (that means aij ≥ 0, ∀i = 1, 2, . . . , n, ∀j = 1, 2, . . . , n),
denoted by V ≥ 0 or V ∈ Rm×n
+ , and a fixed integer r ≤ min(m, n), the so-called nonnegative
matrix factorization (NMF) of V is approximated by two nonnegative matrices W ∈ Rm×r
+ and
H ∈ Rr+×n . More precisely, it leads to the following optimization problem:

V ≈ WH

To find an approximate factorization V ≈ W H, we first need to define the cost function that
quantifies the quality of the approximation. Such a cost function can be constructed using some
measure of distance between two non-negative matrices W and H. The first time the NMF problem
was stated is in the paper by Paatero and Tapper in 1994 [12].

Dalat University, Vietnam ([email protected])

Institute of Mathematics, Vietnam Academy of Science and Technology ([email protected])

1
There are many ways to define the cost function for different purposes. Lee and Seung [10]
defined the cost function based on Frobenius norm and Kullback-Leibler divergence. Févotte,
Bertin, and Durrieu [5] defined the cost function by Itakura-Saito distance. Ke, Kanade and
Robust [9] used `1 norm to define the other cost function. In this report, we define the cost
function based on the Frobenius norm as in [10]. Let us state the main problem:

Problem 1 (Nonnegative Matrix Factorization - NMF)


Given a nonnegative matrix V ∈ Rm×n
+ and an integer r ≤ mi n(m, n), solve

1
min fV (W, H) := ||V − W H||2F (1)
W ∈Rm×r
+ , H∈Rr+×n 2

Remark. For a fixed matrix H ∈ Rr ×n the function fv (W ) is convex because for all W1 , W2 ∈ Rm×r
+
and λ ∈ [0, 1]:
1
fV (λW1 + (1 − λ)W2 , H) = ||λ(V − W1 H) + (1 − λ)(V − W2 H)||2F
2
λ (1 − λ)
≤ ||V − W1 H||2F + ||V − W2 H||2F (||.||2F is a convex function)
2 2
= λfV (W1 , H) + (1 − λ)fV (W2 , H)

Implies that the function fV (W ) is convex. Similarly, it is easy to show that fV (H) is convex. The
objective function fV in (1) is not convex in both variables W and H. To see this, we consider the
case when m = n = 1:

fv (w , h) := ||v − w h||2 = (v 2 − 2v w h + w 2 h2 )

implies " #
2h2 4hw − 2v
∇2 fv (w , h) =
4hw − 2v 2w 2
For all v > 0 and h > 0 then 2h2 > 0 and det(∇2 fv (w , h)) = −12h2 w 2 + 16w h − 4v 2 . There
exists w > 0 such that det(∇2 fv (w , h)) < 0.

2 Multiplicative updating algorithm


The long-established algorithm for NMF problem is multiplicative updating algorithm (MU),
which was propounded by Lee and Seung [10]. That made the name NMF become standard. From
now, the notations ◦ and / will be used for the component-wise multiplications and component-wise
divisions, respectively.

2
Algorithm 1 Multiplicative Updating Algorithm (Lee and Seung [10])
1. W 0 = r and(m, r ); H 0 = r and(r, n); (W 0 , H 0 > 0)
2. for k = 0, 1, 2, . . .
V (H k )T
W k+1 = W k ◦ k k k T
W H (H )
(W k+1 )T V
H k+1 = H k ◦
(W k+1 )T W k+1 H k
end

In the next theorem, we will show that the objective function in (1) is nonincreasing if we use
the update rule of MU algorithm.

Theorem 1 (Lee and Seung [10])


The Frobenius norm 12 ||V − W H||2F is nonincreasing under the updating rules of Algorithm 1

To prove Theorem 1 we need the following definition

Definition 2 (Auxiliary function)


Given F : RK → R, the function F̄ : RK × RK → R is said to be an auxiliary function of F if
and only if
• ∀v̄ ∈ RK , F (v̄) = F̄ (v̄, v̄)

• ∀(v∗ , v̄) ∈ RK × RK , F (v∗ ) ≤ F̄ (v∗ , v̄)

The auxiliary function is useful by the following lemma, which is described in Figure 1

Figure 1: Property of auxiliary function, in this figure F̄ (v) := F̄ (v, v̄) [8]

3
Lemma 3
If F̄ is an auxiliary function of F , then F is nonincreasing under the update

v(i+1) = argminF̄ (v , v(i) ) (2)


v ≥0

Proof. F (v(i+1) ) ≤ F̄ (v(i+1) , v(i) ) ≤ F̄ (v(i) , v(i) ) = F (v(i) )

Remark. F (v(i+1) ) = F (v(i) ) if and only if v(i) is the local minimum of F̄ (v, v(i) ). And F (v(i) ) =
F̄ (v(i) , v(i) ). If the derivative of F exists and continuous in a small neighborhood of v (i) , this also
implies that

∇F (v(i) ) = ∇F̄ (v(i) , v(i) )


⇔ ∇F (v(i) ) = 0

If F is a convex function then the sequence generated by Equation (2) converges to a local minimum
vmin = argmin F (v) of the objective function:
v >0

F (vmin ) ≤ · · · ≤ F (v(i+1) ) ≤ F (v(i) ) ≤ · · · ≤ F (v(1) ) ≤ F (v(0) )

We will show that by defining the appropriate auxiliary functions F̄ (v , v(i) ) for the objective
function (1), the update rules in Theorem 1 easily follow from equation (2). Firstly, we need the
following lemma

Lemma 4
Given W ∈ Rm×r , v ∈ Rm , the function f : Rr → R is defined by f (h) = ||W h − v||2 then
∇h f (h) = 2W T W h − 2W T v and ∇2h f (h) = 2W T W

Proof. By definition:
 T
2 ∂ ∂
∇h ||W h − v|| := ||W h − v||2 , · · · , ||W h − v||2 (3)
∂h1 ∂hr

and
m
X m X
X r
||W h − v||2F = 2
((W h)i − vi ) = ( wik hk − vi )2
i=1 i=1 k=1
 !2 
m
X r
X r
X
=  wik hk − 2vi wik hk + vi2 
i=1 k=1 k=1

m r
!2 m r m
X X X X X
= wik hk −2 vi wik hk + vi2
i=1 k=1 i=1 k=1 i=1

4
For any hj we have:
 !2 
m r m r m
∂ ∂  X X X X X
||W h − v||2F = wik hk −2 vi wik hk + vi2 
∂hj ∂hj i=1 k=1 i=1 k=1 i=1

m r
!2 m r m
X ∂ X X ∂ X ∂ X 2
= wik hk −2 vi wik hk + v
i=1
∂hj k=1 i=1
∂hj k=1 ∂hj i=1 i
m r
!2 m
X ∂ X X
= wik hk −2 vi wij
i=1
∂hj k=1 i=1
Xm r
X m
X
=2 wij wik hk − 2 vi wij
i=1 k=1 i=1
m r
! m
X X X
=2 wij wik hk − vi =2 wij ((W h)i − vi ) = 2(W T (W h − v))j
i=1 k=1 i=1

From 3
∇h f (h) = 2W T W h − 2W T v

It is clearly that
∇2h f (h) = 2W T W

The proof is complete.

We denote Di ag(v) is a square diagonal matrix with the entries of vector v on the main diagonal
and a vector v > 0 means vi > 0, ∀i . The following theorem gives us an auxiliary function for the
target function.

Theorem 5 (Auxiliary function of objective function - Lee and Seung [10])


Let ht > 0, V is a nonegative matrix and

W T W ht
 
t
θ(h ) = Di ag
ht

Then
1
G(h, ht ) = f (ht ) + (h − ht )T ∇f (ht ) + (h − ht )T θ(ht )(h − ht ) (4)
2
is an auxiliary function of
1X X
f (h) = (vi − Wik hk )2 for h > 0
2 i k

Proof. It is clear that G(h, h) = f (h), we need to show that G(h, ht ) ≥ f (h). By the Taylor

5
expansion of function F (h) and Lemma 4 we have
1
f (h) = f (ht ) + (h − ht )T ∇f (ht ) + (h − ht )T (W T W )(h − ht )
2
From (4) G(h, ht ) ≥ f (h) if and only if

(h − ht )(θ(ht ) − W T W )(h − ht ) ≥ 0

We need to show that θ(ht ) − W T W is a positive semidefinite matrix. Let v ∈ Rr then

vT (θ(ht ) − W T W )v = vT θ(ht )v − vT W T W v
(W T W ht )1 T
2 (W W h )2
t T
2 (W W h )r
t
= v12 + v 2 + · · · + vr − vT W T W v
h1t h2t hrt
r X r r r
X (W T W )ik hkt 2 X X
= t vi − vi (W T W )ik vk
i=1 k=1
hi i=1 k=1
r r  t

XX hk
= (W T W )ik 2
t vi − vi vk
i=1 k=1
h i
r r
hkt 2 hit 2
 
1 XX T
= (W W )ik v − vi vk + t vk − vi vk
2 i=1 k=1 hit i hk
r r
s s !2
1 XX T hkt hit
= (W W )ik vi − vk ≥ 0, for ht > 0
2 i=1 k=1 hit hkt

That means θ(ht ) − W T W is a positive semidefinite matrix.

Now we can prove Theorem 1

Proof (Theorem 1). Let v and h is some column of V and H, respectively. Let
1
F (h) = ||v − W h||2 (5)
2
Using Taylor expansion for F (h) we have
1
F (h) = F (ht ) + (h − ht )T ∇F (ht ) + (h − ht )∇2 F (ht )(h − ht ) + o||h − ht ||2
2
From Lemma 4 then

∇F (ht ) = W T W h − W T v and ∇2 F (ht ) = W T W

From Theorem 5 the following is auxiliary function of F (h)


1
G(h, ht ) = F (ht ) + (h − ht )T ∇F (ht ) + (h − ht )T θ(ht )(h − ht ) (6)
2

6
Let ht+1 = argmin G(h, ht ). Then we can find ht+1 by solving the equation:
h>0

∂G(ht+1 , ht )
=0
∂ht+1
It implies that
∂G(ht+1 , ht )
= ∇F (ht ) + θ(ht )(ht+1 − ht )
∂ht+1
Thus
ht+1 = ht − (θ(ht ))−1 ∇F (ht ) (7)

Since G(h, ht ) is an auxiliary function, F is nonincreasing under the update rule, according to
Lemma 3. By rewriting the equation, we obtain

ht
 
t+1 t
h = h − Di ag (W T W ht − W T v)
W T W ht
WTv
= ht − ht + ht T
W W ht
WTv
= ht T
W W ht
That work for any column v and h then work for the matrix V and H. By reversing the roles of
W and H in equation 5, we obtain that F can similarly show to be nonincreasing under the update
rules for W

3 Alternating projected gradient algorithms


In this section, we first introduce alternating gradient (AG) algorithm for NMF problem and then
study alternating projected gradient (APG) algorithm [11]

3.1 Alternating gradient algorithms (AG)


The alternating gradient algorithm is based on gradient descent method. Recall that the gradient
descent method is designed for solving an unconstrained optimization problem:

min f (x)
x∈Rn

where f is a continuously differentiable function. Starting from an initial point x0 ∈ Rn , then at


each step, we compute

xk+1 = xk − tk ∇f (xk )

7
where tk is called stepsize and can be chosen as tk = argmin f (xk − t∇f (xk )). This rule of
t≥0
choosing stepsize is called exact line search.

Remark 1. If f is a convex function and the stepsize is chosen in (0, tk ], where tk = argmin f (xk −
t≥0
t∇f (xk )), then the update rule still makes f (x) nonincreasing. For any tk0 ∈ (0, tk ] then f (xk −
tk ∇f (xk )) < f (xk − tk0 ∇f (xk )) < f (xk ). Because if ∇f (xk ) > 0 then xk − tk ∇f (xk ) < xk and
f (xk − tk0 ∇f (xk )) < f (xk ). If ∇f (xk ) < 0 then xk − tk ∇f (xk ) > xk and f (xk − tk0 ∇f (xk )) < f (xk ).

We now applied the idea of this method to the NMF problem. Firstly, we state the following
lemma

Lemma 6
1
Given the function f = ||V − W H||2F with W and H fixed let
2
   
∗ ∂f (W, H) ∗ ∂f (W, H)
εH = argmin f W, H − ε and εW = argmin f W − ε ,H
ε≥0 ∂H ε≥0 ∂W

Then
||W T W H − W T V ||2F ||W HH T − V H T ||2F
ε∗H = and ε ∗
W =
||W (W T W H − W T V )||2F ||(W HH T − V T H)H||2F

Proof. Apply Lemma 4 for all columns V:j , H:j then

∂f (W, H)
= WTWH − WTV
∂H
This implies that
 
∂f (W, H)
ε∗H = argmin f W, H − ε (8)
ε≥0 ∂H
= argmin f W, H − ε(W T W H − W T V )

ε≥0

Put K = W T W H − W T V . Consider the function


1
g(ε) = f (W, H − εK) = ||V − W (H − εK)||2F (9)
2
1
= h(V − W H) + εW K, (V − W H) + εW Ki
2
1
||V − W H||2F + 2εhV − W H, W Ki + ε2 ||W K||2F

=
2
This implies g(ε) is a single variable convex quadratic function and

g 0 (ε) = hV − W H, W Ki + ε||W K||2F

8
Then
hW H − V, W Ki
g 0 (ε) = 0 ⇔ ε =
||W K||2F
T r ((W K)T (W H − V ))
= (hA, Bi = T r (B T A))
||W K||2F
T r (K T W T (W H − V ))
=
||W (W T W H − W T V )||2F
hW T W H − W T V, W T W H − W T V i
=
||W (W T W H − W T V )||2F
||W T W H − W T V ||2F
=
||W (W T W H − W T V )||2F

Because g(ε) is convex then


||W T W H − W T V ||2F
ε∗H =
||W (W T W H − W T V )||2F
Similarly, we have
||W HH T − V H T ||2F
ε∗W =
||(W HH T − V T H)H||2F

We consider two constrained optimization problems with W and H fixed, respectively:


1 1
min ||V − W H||2F and min ||V − W H||2F (10)
H∈R+r ×n 2 m×r
W ∈R+ 2

Note
These functions in (10) are convex and the constraint sets are Rr+×n and Rm×r
+ .

We update H (and W ) with W (and H) fixed respectively, that is:


   
∂f (W, H) ∂f (W, H)
Hi,j = Hi,j − εH , εH ≥ 0 and Wi,j = Wi,j − εW , εW ≥ 0
∂H i,j ∂W i,j

In order to ensure matrix H ∈ Rr+×n , εH should be chosen such that


   
∂f (W, H) ∂f (W, H)
εH ≤ Hi,j / , when >0
∂H i,j ∂H i,j

Find ε∗H such that


 
∂f (W, H)
ε∗H = argmin f W, H − ε
ε≥0 ∂H

9
||W T W H − W T V ||2F
= (by Lemma 6)
||W (W T W H − W T V )||2F

Choose   !
∂f (W, H)
εH = min ε∗H , Hi,j / , (11)
( ∂f (W,H)
∂H )i,j >0 ∂H i,j

and thus H ∈ Rr+×n .


When H is fixed, we have the following analogue to W
 
∂f (W, H)
Wi,j = Wi,j − εW , εW ≥ 0
∂W i,j

and
||W HH T − V H T ||2F
ε∗W =
||(W HH T − V T H)H||2F
then   !
∂f (W, H)
εW = min ε∗W , Wi,j / , (12)
( ∂f (W,H)
∂W)i,j >0 ∂W i,j

In this case, we have W ∈ Rm×r


+ . Based on the above analysis, we have the following Algorithm
2.

Remark. These stepsizes choose in (11) and (12) guarantee these objective functions in (10) are
nonincreasing by Remark 1.

Algorithm 2 Alternating Gradient Algorithm (Lin and Liu [11])


1. W 0 = r and(m, r ); H 0 = r and(r, n); (W 0 , H 0 > 0)
2. for k = 0, 1, 2, . . .
∂f (W k , H k )
 
k+1 k
Wi,j = Wi,j − εW k
∂W k
 k+1 k
i,j
k+1 k ∂f (W , H )
Hi,j = Hi,j − εH k
∂H k i,j
end

where εHk and εW k are calculated by formula 11 and 12, respectively.

Remark. The AG algorithm updates all entries with the same stepsize. To ensure the updates are
in the constraint set, entries must frequently update stepsize to be less than the best stepsize. We
will improve the AG algorithm with the idea of updating each entry with different stepsizes.

10
3.2 Alternating projected gradient algorithm (APG)
Let W be fixed, we update each H’s entry by different stepsizes by
 
H ∂f (W, H)
Hi,j = Hi,j − εi,j , εH
i,j ≥ 0
∂H i,j

where   
||W T W H−W T V ||2F ∂f (W,H)
ε∗H = ||W (W T W H−W T V )||
 2 ,if ∂H ≤0
H F
εi,j =    
min ε∗H , Hi,j / ∂f (W,H)

∂H ,otherwise
i,j

Again, assuming that H is fixed we update each entry of matrix W by


 
W ∂f (W, H)
Wi,j = Wi,j − εi,j , εW
i,j ≥ 0
∂W i,j

where   
||W HH T −V H T ||2F ∂f (W,H)
ε∗W = ||(W HHT −V T H)H||
 2 ,if ∂W ≤0
F
εW
i,j =    
min ε∗W , Wi,j / ∂f (W,H)

∂W ,otherwise
i,j

In fact, the new improvement is equivalent to first updating H (or W ) with ε∗H or (ε∗W ) in the
weighted gradient direction of f (W, H) and then setting all the negative components in the upload
H (or W ) to be zero. Now we present the theory and algorithm for this idea.
The idea of alternating projected gradient algorithm (APG) is based on the projection
gradient method. Recall that the projection gradient method is designed for solving a constrained
optimization problem

min f (x)
x∈C

where C is a closed and convex set and f is a convex function over C. Starting from an initial
point x0 ∈ C then at each step, we compute:

xk+1 = PC (xk − tk ∇f (xk ))

where PC : Rn → C is the orthogonal projection operator, which itself is also an optimization


problem:
PC (x) = argmin{||y − x||2 : y ∈ C} (13)

The orthogonal projection operator with input x returns the vector in C that is closest to x. The
optimization problem in (13) has a unique optimal solution [1, p. 157] if C is closed, convex and
nonempty. In the next example, we present the formula for the projection of matrix in Rm×n to the
closed convex set C = Rm×n
+ .

11
Example 7 (Projection on the nonnegative matrix)
Let C = Rm×n
+ . To compute the orthogonal projection of X ∈ Rm×n onto Rm×n
+ , we need to
solve the convex optimization problem
n X
X m
min (Yik − Xik )2 s.t. Yik ≥ 0 (14)
k=1 i=1

Since this problem is separable, it follows that the ith row and kth column entry of the optimal
solution Y ∗ of problem (14) is the optimal solution of the univariate problem

min{(Yik − Xik )2 : Yik ≥ 0}

Clearly, the unique solution of the above problem is given by Yik∗ = [Xik ]+ , where for a real
number α ∈ R, [α]+ , is the nonnegative part of α:

α, α ≥ 0
[α]+ =
0, α < 0

We will extend the definition of the nonnegative part to matrices, and the nonnegative part of
a matrix V ∈ Rm×n is defined by
 
[V11 ]+ · · · [V1n ]+
 . ... .. 
[V ]+ =  .. . 
 
[Vm1 ]+ · · · [Vmn ]+

To summarize, the orthogonal projection operator onto Rm×n


+ is given by PRm×n
+
(V ) = [V ]+

Now, we applied this method to improve AG algorithm.


  !
∂f (W, H)
Hi,j = PC Hi,j − ε∗H
∂H i,j

and   !
∂f (W, H)
Wi,j = PC Wi,j − ε∗W
∂W i,j

where
||W T W H − W T V ||2F ||W HH T − V H T ||2F
ε∗H = , ε ∗
W = (15)
||W (W T W H − W T V )||2F ||(W HH T − V T H)H||2F
and PC is the orthogonal projection on the nonnegative matrix discussed in Example 7
Therefore, our improvement can be written as the following algorithm:

12
Algorithm 3 Alternating Projected Gradient Algorithm (Lin and Liu [11])
1. W 0 = r and(m, r ); H 0 = r and(r, n); (W 0 , H 0 > 0)
2. for k = 0, 1, 2, . . .
∂f (W k , H k )
 
k+1 k ∗
Wi,j = Wi,j − εW k
∂W k i,j
k+1
set all negative entries in W to 0
∂f (W k+1 , H k )
 
k+1 k ∗
Hi,j = Hi,j − εHk
∂H k i,j
set all negative entries in H k+1 to 0
end

Remark. Compare to MU algorithm there are no zero entries that appear in denominators in our
algorithm which implies no breakdown occurs, and even if some zero entries appear in the numerator
new updates can always be improved in our algorithm.

4 Applications
Determining the structure of the data set and extracting meaningful information is a critical prob-
lem in data analysis. There are many powerful methods for this purpose as linear dimensionality
reduction (LDR) techniques, which are equivalent to low-rank matrix approximations (LRMA).
Examples of LDR techniques are principal component analysis (PCA), independent component
analysis, sparse PCA, robust PCA, low-rank matrix completion, and sparse component analysis.
Among LRMA methods, NMF requires the factors of the low-rank approximation to be compo-
nentwise nonnegative. That helps us interpret them in a meaningful way, for example when they
correspond to non-negative physical quantities. Another important motivation for NMF is to find
the nonnegative rank of a nonnegative matrix [6].
In this section, we emphasize two important applications of NMF in data analysis: image
processing and text mining. Other applications include air emission control [12], music genre
classification [2], email surveillance [3],. . . However, it is important to stress that NMF is not
motivated only by its use as an LDR technique for data analysis [6, p. 12].

4.1 Image Processing - Facial Feature Extraction


A potential application of NMF is in face recognition. It has for example been observed that NMF
generated dense factors more than PCA. In fact, if a new occluded face (e.g., with sunglasses) has
to be mapped into the NMF basis, the non-occluded parts (e.g., the mustache or the lips) can still
be well-approximated [7].
Given a set of gray-scale face images of the same dimensions. Let each column of data matrix

13
X ∈ Rm×n
+ be a vectorized gray-level image of a face. Vectorization means that the two-dimensional
images are transformed into a long one-dimensional vector, for example, by stacking the columns
of the image on top of each other [6]. Then each row of X corresponds to the same pixel location
in the original images. This means that with the (i, j)th entry of matrix X being the intensity of
the i th pixel in the jth face.
NMF generates two factors (W, H) so that each image V (:, j) is approximated using a linear
combination of the columns of W . Since W is nonnegative, the columns of W can be interpreted
as images (that is, vectors of pixel intensities) which we refer to as the basis images. The columns
of W are vectors of pixel intensities whose linear combinations allow us to approximate each input
image. Hence these basis images typically correspond to localized features that are shared among
the input images, and the entries of H indicate which input image contains which feature. This is
illustrated by the following figure

1
Figure 2: NMF applied on the CBCL face data set with r = 49 (2429 images with 19 × 19
pixels each) by MU algorithm [6].

On the left is a column of X reshaped as an image. In the middle are the 49 columns of the
basis W reshaped as images and displayed in a 7 × 7 grid and the reshaped column of H in the
same 7 × 7 grid that shows which features are present in that particular face. On the right is the
reshaped approximation (W H):,j of X:,j as an image.
The following example may be more descriptive.

Figure 3: Sample images of the swimmer data set [6]

1
http://cbcl.mit.edu/software-datasets/FaceData2.html

14
Figure 4: NMF basis images for the swimmer data set, which reshapes sample images with suitable
coefficients of this linear combination [6]

4.2 Text Mining – Topic Recovery and Document Classification


Let each column of the nonnegative data matrix X correspond to a document and each row to a
word. The (i , j)th entry of the matrix X could for example be equal to the number of times the i th
word appears in the jth document in which case each column of X is the vector of word counts of a
document. In practice, more sophisticated constructions are used, e.g. the term frequency-inverse
document frequency techniques (tf-idf). This is the so-called bag-of-words model: each document
is associated with a set of words with different weights, while the ordering of the words in the
documents is not taken into account, see [4]. Note that such a matrix X is in general rather sparse
as most documents only use different small subsets of the dictionary.

Figure 5: Illustrating NMF for text mining [6]

Figure 5 illustrates the extraction of topics, and classification of each document with respect
to these topics. The columns of X are sets of words, with Xi,j being the number of word i in a
document j. Because H is nonnegative, these sets are approximated as the union of a smaller
number of sets defined by the columns of W that correspond to topics.
Therefore, given a set of documents, NMF identifies topics and simultaneously classifies the
documents among these different topics.

15
5 Numerical experiments
In this section, we will present some experiments using the MU algorithm and the APG algorithm,
which are implemented in R 4.2.11 . We use the AT&T face dataset2 consists of 10 black and
white photos of each member of a group of 40 individuals, 400 images in total and each image is
92 × 112 pixels, with 256 grey levels per pixel.

Figure 6: Plotting some faces in the AT&T face dataset

AT&T face dataset


k
fMU fAP G % ρMU ρAP G
25 72970.28 46690.85 36.01 0.455 0.0291
50 57273.47 38820.98 32.21 0.357 0.0242
75 47363.50 34145.80 27.90 0.0295 0.0213
100 40982.22 31264.15 23.71 0.0255 0.0195

Table 1: The values of cost function for MU and APG

In Table 1, we record the values of the cost function obtained from the MU and the APG
algorithms after the same number of iterations. The data on the column denoted by % is the
relative improvement defined by
fMU − fAP G
fMU
k k k k k k k k
where fMU = f (WMU , HMU ), fAP G (WAP G , HAP G ), and WMU , HMU , WAP G , HAP G denote the corre-
sponding results is computed by the MU algorithm and the APG algorithm respectively, the relative
1
Detail code: https://rpubs.com/trandattin/954115
2
https://www.kaggle.com/datasets/kasikrit/att-database-of-faces

16
error for each algorithm defined by
f (W k , H k )
ρ=
f (W 0 , H 0 )
where W 0 and H 0 are initial matrices. Table 1 shows that the APG algorithm has apparent
advantage at early stage as compared to the MU algorithm.

Figure 7: Compare the convergence rate of the algorithms with AT&T face dataset

6 Conclusion
The results presented in the report are the first step to study the MU, AG, and APG algorithms
for NMF as well as the theories to analyze the convergence of the algorithms and to compare
the performance of these algorithms on R. Finally, the author presents some applications of NMF.
There is a question posed by the writer during the research as follows:

1. Comparing the performance of the APG algorithm when using constant stepsize and the
stepsize is given by exact line search.

Due to the limitation of time and ability, the writer has not been able to prove the convergence
of the APG algorithm. Hopefully, the author can develop and expand the results in the future.

References
[1] A. Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with
MATLAB. SIAM, 2014.

17
[2] E. Benetos and C. Kotropoulos. Non-negative tensor factorization applied to music genre
classification. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):1955–
1967, 2010.

[3] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons. Algorithms


and applications for approximate nonnegative matrix factorization. Computational statistics
& data analysis, 52(1):155–173, 2007.

[4] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

[5] C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the itakura-
saito divergence: With application to music analysis. Neural computation, 21(3):793–830,
2009.

[6] N. Gillis. Nonnegative Matrix Factorization. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 2020.

[7] D. Guillamet and J. Vitria. Non-negative matrix factorization for face recognition. In Catalo-
nian Conference on Artificial Intelligence, pages 336–344. Springer, 2002.

[8] N. D. Ho. Non negative matrix factorization algorithms and applications. Université Catholique
de Louvain, 2008.

[9] Q. Ke and T. Kanade. Robust l/sub 1/norm factorization in the presence of outliers and
missing data by alternative convex programming. In 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 739–746. IEEE,
2005.

[10] D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Advances in neural
information processing systems, 13, 2000.

[11] L. Lin and Z.-Y. Liu. An alternating projected gradient algorithm for nonnegative matrix
factorization. Applied mathematics and computation, 217(24):9997–10002, 2011.

[12] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with
optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, 1994.

18

You might also like