0% found this document useful (0 votes)

7 views18 pages

Tin TranDat

This document discusses algorithms for Nonnegative Matrix Factorization (NMF), focusing on multiplicative updating (MU), alternating gradient (AG), and alternating projected gradient (APG) methods. It highlights the convergence analysis of these algorithms using majorize-minimization and gradient descent theories, demonstrating that the APG algorithm outperforms the MU algorithm in numerical experiments. The report also details the mathematical foundations and cost functions associated with NMF, emphasizing the importance of nonnegative matrices in various applications.

Uploaded by

AM AJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views18 pages

Tin TranDat

Uploaded by

AM AJ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Some Algorithms for Nonnegative Matrix

Factorization
Author: Tran Dat Tin∗
Supervisor: Dr. Le Hai Yen†

Hanoi - October 31, 2022

Abstract. For the reason that the extensive applications of nonnegative matrix factorizations (NMFs)
of nonnegative matrices, such as in image processing, text mining, spectral data analysis, speech processing,
and so on. Algorithms for NMF have been studied for years. This report aims to present multiplicative
updating (MU) [10], alternating gradient (AG) and alternating projected gradient (APG) algorithms [11]
for solving the NMF problem. We can therefore take advantage of well-developed theories on majorize-
minimization and gradient descent methods to analyze the convergence of these algorithms. Finally, it is
shown that the performance of APG algorithm is better than that of MU algorithm when we do numerical
experiments on the AT&T face database.

1 Introduction
Let V be a m × n nonnegative matrix (that means aij ≥ 0, ∀i = 1, 2, . . . , n, ∀j = 1, 2, . . . , n),
denoted by V ≥ 0 or V ∈ Rm×n
+ , and a fixed integer r ≤ min(m, n), the so-called nonnegative
matrix factorization (NMF) of V is approximated by two nonnegative matrices W ∈ Rm×r
+ and
H ∈ Rr+×n . More precisely, it leads to the following optimization problem:

V ≈ WH

To find an approximate factorization V ≈ W H, we first need to define the cost function that
quantifies the quality of the approximation. Such a cost function can be constructed using some
measure of distance between two non-negative matrices W and H. The first time the NMF problem
was stated is in the paper by Paatero and Tapper in 1994 [12].
∗
Dalat University, Vietnam ([email protected])
†
Institute of Mathematics, Vietnam Academy of Science and Technology ([email protected])

1
There are many ways to define the cost function for different purposes. Lee and Seung [10]
defined the cost function based on Frobenius norm and Kullback-Leibler divergence. Févotte,
Bertin, and Durrieu [5] defined the cost function by Itakura-Saito distance. Ke, Kanade and
Robust [9] used `1 norm to define the other cost function. In this report, we define the cost
function based on the Frobenius norm as in [10]. Let us state the main problem:

Problem 1 (Nonnegative Matrix Factorization - NMF)

Given a nonnegative matrix V ∈ Rm×n
+ and an integer r ≤ mi n(m, n), solve

1
min fV (W, H) := ||V − W H||2F (1)
W ∈Rm×r
+ , H∈Rr+×n 2

Remark. For a fixed matrix H ∈ Rr ×n the function fv (W ) is convex because for all W1 , W2 ∈ Rm×r
+
and λ ∈ [0, 1]:
1
fV (λW1 + (1 − λ)W2 , H) = ||λ(V − W1 H) + (1 − λ)(V − W2 H)||2F
2
λ (1 − λ)
≤ ||V − W1 H||2F + ||V − W2 H||2F (||.||2F is a convex function)
2 2
= λfV (W1 , H) + (1 − λ)fV (W2 , H)

Implies that the function fV (W ) is convex. Similarly, it is easy to show that fV (H) is convex. The
objective function fV in (1) is not convex in both variables W and H. To see this, we consider the
case when m = n = 1:

fv (w , h) := ||v − w h||2 = (v 2 − 2v w h + w 2 h2 )

implies " #
2h2 4hw − 2v
∇2 fv (w , h) =
4hw − 2v 2w 2
For all v > 0 and h > 0 then 2h2 > 0 and det(∇2 fv (w , h)) = −12h2 w 2 + 16w h − 4v 2 . There
exists w > 0 such that det(∇2 fv (w , h)) < 0.

2 Multiplicative updating algorithm

The long-established algorithm for NMF problem is multiplicative updating algorithm (MU),
which was propounded by Lee and Seung [10]. That made the name NMF become standard. From
now, the notations ◦ and / will be used for the component-wise multiplications and component-wise
divisions, respectively.

2
Algorithm 1 Multiplicative Updating Algorithm (Lee and Seung [10])
1. W 0 = r and(m, r ); H 0 = r and(r, n); (W 0 , H 0 > 0)
2. for k = 0, 1, 2, . . .
V (H k )T
W k+1 = W k ◦ k k k T
W H (H )
(W k+1 )T V
H k+1 = H k ◦
(W k+1 )T W k+1 H k
end

In the next theorem, we will show that the objective function in (1) is nonincreasing if we use
the update rule of MU algorithm.

Theorem 1 (Lee and Seung [10])

The Frobenius norm 12 ||V − W H||2F is nonincreasing under the updating rules of Algorithm 1

To prove Theorem 1 we need the following definition

Definition 2 (Auxiliary function)

Given F : RK → R, the function F̄ : RK × RK → R is said to be an auxiliary function of F if
and only if
• ∀v̄ ∈ RK , F (v̄) = F̄ (v̄, v̄)

• ∀(v∗ , v̄) ∈ RK × RK , F (v∗ ) ≤ F̄ (v∗ , v̄)

The auxiliary function is useful by the following lemma, which is described in Figure 1

Figure 1: Property of auxiliary function, in this figure F̄ (v) := F̄ (v, v̄) [8]

3
Lemma 3
If F̄ is an auxiliary function of F , then F is nonincreasing under the update

v(i+1) = argminF̄ (v , v(i) ) (2)

v ≥0

Proof. F (v(i+1) ) ≤ F̄ (v(i+1) , v(i) ) ≤ F̄ (v(i) , v(i) ) = F (v(i) )

Remark. F (v(i+1) ) = F (v(i) ) if and only if v(i) is the local minimum of F̄ (v, v(i) ). And F (v(i) ) =
F̄ (v(i) , v(i) ). If the derivative of F exists and continuous in a small neighborhood of v (i) , this also
implies that

∇F (v(i) ) = ∇F̄ (v(i) , v(i) )

⇔ ∇F (v(i) ) = 0

If F is a convex function then the sequence generated by Equation (2) converges to a local minimum
vmin = argmin F (v) of the objective function:
v >0

F (vmin ) ≤ · · · ≤ F (v(i+1) ) ≤ F (v(i) ) ≤ · · · ≤ F (v(1) ) ≤ F (v(0) )

We will show that by defining the appropriate auxiliary functions F̄ (v , v(i) ) for the objective
function (1), the update rules in Theorem 1 easily follow from equation (2). Firstly, we need the
following lemma

Lemma 4
Given W ∈ Rm×r , v ∈ Rm , the function f : Rr → R is defined by f (h) = ||W h − v||2 then
∇h f (h) = 2W T W h − 2W T v and ∇2h f (h) = 2W T W

Proof. By definition:
T
2 ∂ ∂
∇h ||W h − v|| := ||W h − v||2 , · · · , ||W h − v||2 (3)
∂h1 ∂hr

and
m
X m X
X r
||W h − v||2F = 2
((W h)i − vi ) = ( wik hk − vi )2
i=1 i=1 k=1
 !2 
m
X r
X r
X
=  wik hk − 2vi wik hk + vi2 
i=1 k=1 k=1

m r
!2 m r m
X X X X X
= wik hk −2 vi wik hk + vi2
i=1 k=1 i=1 k=1 i=1

4
For any hj we have:
 !2 
m r m r m
∂ ∂  X X X X X
||W h − v||2F = wik hk −2 vi wik hk + vi2 
∂hj ∂hj i=1 k=1 i=1 k=1 i=1

m r
!2 m r m
X ∂ X X ∂ X ∂ X 2
= wik hk −2 vi wik hk + v
i=1
∂hj k=1 i=1
∂hj k=1 ∂hj i=1 i
m r
!2 m
X ∂ X X
= wik hk −2 vi wij
i=1
∂hj k=1 i=1
Xm r
X m
X
=2 wij wik hk − 2 vi wij
i=1 k=1 i=1
m r
! m
X X X
=2 wij wik hk − vi =2 wij ((W h)i − vi ) = 2(W T (W h − v))j
i=1 k=1 i=1

From 3
∇h f (h) = 2W T W h − 2W T v

It is clearly that
∇2h f (h) = 2W T W

The proof is complete.

We denote Di ag(v) is a square diagonal matrix with the entries of vector v on the main diagonal
and a vector v > 0 means vi > 0, ∀i . The following theorem gives us an auxiliary function for the
target function.

Theorem 5 (Auxiliary function of objective function - Lee and Seung [10])

Let ht > 0, V is a nonegative matrix and

W T W ht

t
θ(h ) = Di ag
ht

Then
1
G(h, ht ) = f (ht ) + (h − ht )T ∇f (ht ) + (h − ht )T θ(ht )(h − ht ) (4)
2
is an auxiliary function of
1X X
f (h) = (vi − Wik hk )2 for h > 0
2 i k

Proof. It is clear that G(h, h) = f (h), we need to show that G(h, ht ) ≥ f (h). By the Taylor

5
expansion of function F (h) and Lemma 4 we have
1
f (h) = f (ht ) + (h − ht )T ∇f (ht ) + (h − ht )T (W T W )(h − ht )
2
From (4) G(h, ht ) ≥ f (h) if and only if

(h − ht )(θ(ht ) − W T W )(h − ht ) ≥ 0

We need to show that θ(ht ) − W T W is a positive semidefinite matrix. Let v ∈ Rr then

vT (θ(ht ) − W T W )v = vT θ(ht )v − vT W T W v
(W T W ht )1 T
2 (W W h )2
t T
2 (W W h )r
t
= v12 + v 2 + · · · + vr − vT W T W v
h1t h2t hrt
r X r r r
X (W T W )ik hkt 2 X X
= t vi − vi (W T W )ik vk
i=1 k=1
hi i=1 k=1
r r t

XX hk
= (W T W )ik 2
t vi − vi vk
i=1 k=1
h i
r r
hkt 2 hit 2

1 XX T
= (W W )ik v − vi vk + t vk − vi vk
2 i=1 k=1 hit i hk
r r
s s !2
1 XX T hkt hit
= (W W )ik vi − vk ≥ 0, for ht > 0
2 i=1 k=1 hit hkt

That means θ(ht ) − W T W is a positive semidefinite matrix.

Now we can prove Theorem 1

Proof (Theorem 1). Let v and h is some column of V and H, respectively. Let
1
F (h) = ||v − W h||2 (5)
2
Using Taylor expansion for F (h) we have
1
F (h) = F (ht ) + (h − ht )T ∇F (ht ) + (h − ht )∇2 F (ht )(h − ht ) + o||h − ht ||2
2
From Lemma 4 then

∇F (ht ) = W T W h − W T v and ∇2 F (ht ) = W T W

From Theorem 5 the following is auxiliary function of F (h)

1
G(h, ht ) = F (ht ) + (h − ht )T ∇F (ht ) + (h − ht )T θ(ht )(h − ht ) (6)
2

6
Let ht+1 = argmin G(h, ht ). Then we can find ht+1 by solving the equation:
h>0

∂G(ht+1 , ht )
=0
∂ht+1
It implies that
∂G(ht+1 , ht )
= ∇F (ht ) + θ(ht )(ht+1 − ht )
∂ht+1
Thus
ht+1 = ht − (θ(ht ))−1 ∇F (ht ) (7)

Since G(h, ht ) is an auxiliary function, F is nonincreasing under the update rule, according to
Lemma 3. By rewriting the equation, we obtain

ht

t+1 t
h = h − Di ag (W T W ht − W T v)
W T W ht
WTv
= ht − ht + ht T
W W ht
WTv
= ht T
W W ht
That work for any column v and h then work for the matrix V and H. By reversing the roles of
W and H in equation 5, we obtain that F can similarly show to be nonincreasing under the update
rules for W

3 Alternating projected gradient algorithms

In this section, we first introduce alternating gradient (AG) algorithm for NMF problem and then
study alternating projected gradient (APG) algorithm [11]

3.1 Alternating gradient algorithms (AG)

The alternating gradient algorithm is based on gradient descent method. Recall that the gradient
descent method is designed for solving an unconstrained optimization problem:

min f (x)
x∈Rn

where f is a continuously differentiable function. Starting from an initial point x0 ∈ Rn , then at

each step, we compute

xk+1 = xk − tk ∇f (xk )

7
where tk is called stepsize and can be chosen as tk = argmin f (xk − t∇f (xk )). This rule of
t≥0
choosing stepsize is called exact line search.

Remark 1. If f is a convex function and the stepsize is chosen in (0, tk ], where tk = argmin f (xk −
t≥0
t∇f (xk )), then the update rule still makes f (x) nonincreasing. For any tk0 ∈ (0, tk ] then f (xk −
tk ∇f (xk )) < f (xk − tk0 ∇f (xk )) < f (xk ). Because if ∇f (xk ) > 0 then xk − tk ∇f (xk ) < xk and
f (xk − tk0 ∇f (xk )) < f (xk ). If ∇f (xk ) < 0 then xk − tk ∇f (xk ) > xk and f (xk − tk0 ∇f (xk )) < f (xk ).

We now applied the idea of this method to the NMF problem. Firstly, we state the following
lemma

Lemma 6
1
Given the function f = ||V − W H||2F with W and H fixed let
2

∗ ∂f (W, H) ∗ ∂f (W, H)
εH = argmin f W, H − ε and εW = argmin f W − ε ,H
ε≥0 ∂H ε≥0 ∂W

Then
||W T W H − W T V ||2F ||W HH T − V H T ||2F
ε∗H = and ε ∗
W =
||W (W T W H − W T V )||2F ||(W HH T − V T H)H||2F

Proof. Apply Lemma 4 for all columns V:j , H:j then

∂f (W, H)
= WTWH − WTV
∂H
This implies that

∂f (W, H)
ε∗H = argmin f W, H − ε (8)
ε≥0 ∂H
= argmin f W, H − ε(W T W H − W T V )

ε≥0

Put K = W T W H − W T V . Consider the function

1
g(ε) = f (W, H − εK) = ||V − W (H − εK)||2F (9)
2
1
= h(V − W H) + εW K, (V − W H) + εW Ki
2
1
||V − W H||2F + 2εhV − W H, W Ki + ε2 ||W K||2F

=
2
This implies g(ε) is a single variable convex quadratic function and

g 0 (ε) = hV − W H, W Ki + ε||W K||2F

8
Then
hW H − V, W Ki
g 0 (ε) = 0 ⇔ ε =
||W K||2F
T r ((W K)T (W H − V ))
= (hA, Bi = T r (B T A))
||W K||2F
T r (K T W T (W H − V ))
=
||W (W T W H − W T V )||2F
hW T W H − W T V, W T W H − W T V i
=
||W (W T W H − W T V )||2F
||W T W H − W T V ||2F
=
||W (W T W H − W T V )||2F

Because g(ε) is convex then

||W T W H − W T V ||2F
ε∗H =
||W (W T W H − W T V )||2F
Similarly, we have
||W HH T − V H T ||2F
ε∗W =
||(W HH T − V T H)H||2F

We consider two constrained optimization problems with W and H fixed, respectively:

1 1
min ||V − W H||2F and min ||V − W H||2F (10)
H∈R+r ×n 2 m×r
W ∈R+ 2

Note
These functions in (10) are convex and the constraint sets are Rr+×n and Rm×r
+ .

We update H (and W ) with W (and H) fixed respectively, that is:

∂f (W, H) ∂f (W, H)
Hi,j = Hi,j − εH , εH ≥ 0 and Wi,j = Wi,j − εW , εW ≥ 0
∂H i,j ∂W i,j

In order to ensure matrix H ∈ Rr+×n , εH should be chosen such that

∂f (W, H) ∂f (W, H)
εH ≤ Hi,j / , when >0
∂H i,j ∂H i,j

Find ε∗H such that

∂f (W, H)
ε∗H = argmin f W, H − ε
ε≥0 ∂H

9
||W T W H − W T V ||2F
= (by Lemma 6)
||W (W T W H − W T V )||2F

Choose !
∂f (W, H)
εH = min ε∗H , Hi,j / , (11)
( ∂f (W,H)
∂H )i,j >0 ∂H i,j

and thus H ∈ Rr+×n .

When H is fixed, we have the following analogue to W

∂f (W, H)
Wi,j = Wi,j − εW , εW ≥ 0
∂W i,j

and
||W HH T − V H T ||2F
ε∗W =
||(W HH T − V T H)H||2F
then !
∂f (W, H)
εW = min ε∗W , Wi,j / , (12)
( ∂f (W,H)
∂W)i,j >0 ∂W i,j

In this case, we have W ∈ Rm×r

+ . Based on the above analysis, we have the following Algorithm
2.

Remark. These stepsizes choose in (11) and (12) guarantee these objective functions in (10) are
nonincreasing by Remark 1.

Algorithm 2 Alternating Gradient Algorithm (Lin and Liu [11])

1. W 0 = r and(m, r ); H 0 = r and(r, n); (W 0 , H 0 > 0)
2. for k = 0, 1, 2, . . .
∂f (W k , H k )

k+1 k
Wi,j = Wi,j − εW k
∂W k
k+1 k
i,j
k+1 k ∂f (W , H )
Hi,j = Hi,j − εH k
∂H k i,j
end

where εHk and εW k are calculated by formula 11 and 12, respectively.

Remark. The AG algorithm updates all entries with the same stepsize. To ensure the updates are
in the constraint set, entries must frequently update stepsize to be less than the best stepsize. We
will improve the AG algorithm with the idea of updating each entry with different stepsizes.

10
3.2 Alternating projected gradient algorithm (APG)
Let W be fixed, we update each H’s entry by different stepsizes by

H ∂f (W, H)
Hi,j = Hi,j − εi,j , εH
i,j ≥ 0
∂H i,j

where 
||W T W H−W T V ||2F ∂f (W,H)
ε∗H = ||W (W T W H−W T V )||
 2 ,if ∂H ≤0
H F
εi,j =
min ε∗H , Hi,j / ∂f (W,H)

∂H ,otherwise
i,j

Again, assuming that H is fixed we update each entry of matrix W by

W ∂f (W, H)
Wi,j = Wi,j − εi,j , εW
i,j ≥ 0
∂W i,j

where 
||W HH T −V H T ||2F ∂f (W,H)
ε∗W = ||(W HHT −V T H)H||
 2 ,if ∂W ≤0
F
εW
i,j =
min ε∗W , Wi,j / ∂f (W,H)

∂W ,otherwise
i,j

In fact, the new improvement is equivalent to first updating H (or W ) with ε∗H or (ε∗W ) in the
weighted gradient direction of f (W, H) and then setting all the negative components in the upload
H (or W ) to be zero. Now we present the theory and algorithm for this idea.
The idea of alternating projected gradient algorithm (APG) is based on the projection
gradient method. Recall that the projection gradient method is designed for solving a constrained
optimization problem

min f (x)
x∈C

where C is a closed and convex set and f is a convex function over C. Starting from an initial
point x0 ∈ C then at each step, we compute:

xk+1 = PC (xk − tk ∇f (xk ))

where PC : Rn → C is the orthogonal projection operator, which itself is also an optimization

problem:
PC (x) = argmin{||y − x||2 : y ∈ C} (13)

The orthogonal projection operator with input x returns the vector in C that is closest to x. The
optimization problem in (13) has a unique optimal solution [1, p. 157] if C is closed, convex and
nonempty. In the next example, we present the formula for the projection of matrix in Rm×n to the
closed convex set C = Rm×n
+ .

11
Example 7 (Projection on the nonnegative matrix)
Let C = Rm×n
+ . To compute the orthogonal projection of X ∈ Rm×n onto Rm×n
+ , we need to
solve the convex optimization problem
n X
X m
min (Yik − Xik )2 s.t. Yik ≥ 0 (14)
k=1 i=1

Since this problem is separable, it follows that the ith row and kth column entry of the optimal
solution Y ∗ of problem (14) is the optimal solution of the univariate problem

min{(Yik − Xik )2 : Yik ≥ 0}

Clearly, the unique solution of the above problem is given by Yik∗ = [Xik ]+ , where for a real
number α ∈ R, [α]+ , is the nonnegative part of α:

α, α ≥ 0
[α]+ =
0, α < 0

We will extend the definition of the nonnegative part to matrices, and the nonnegative part of
a matrix V ∈ Rm×n is defined by
 
[V11 ]+ · · · [V1n ]+
 . ... .. 
[V ]+ =  .. . 
 
[Vm1 ]+ · · · [Vmn ]+

To summarize, the orthogonal projection operator onto Rm×n

+ is given by PRm×n
+
(V ) = [V ]+

Now, we applied this method to improve AG algorithm.

!
∂f (W, H)
Hi,j = PC Hi,j − ε∗H
∂H i,j

and !
∂f (W, H)
Wi,j = PC Wi,j − ε∗W
∂W i,j

where
||W T W H − W T V ||2F ||W HH T − V H T ||2F
ε∗H = , ε ∗
W = (15)
||W (W T W H − W T V )||2F ||(W HH T − V T H)H||2F
and PC is the orthogonal projection on the nonnegative matrix discussed in Example 7
Therefore, our improvement can be written as the following algorithm:

12
Algorithm 3 Alternating Projected Gradient Algorithm (Lin and Liu [11])
1. W 0 = r and(m, r ); H 0 = r and(r, n); (W 0 , H 0 > 0)
2. for k = 0, 1, 2, . . .
∂f (W k , H k )

k+1 k ∗
Wi,j = Wi,j − εW k
∂W k i,j
k+1
set all negative entries in W to 0
∂f (W k+1 , H k )

k+1 k ∗
Hi,j = Hi,j − εHk
∂H k i,j
set all negative entries in H k+1 to 0
end

Remark. Compare to MU algorithm there are no zero entries that appear in denominators in our
algorithm which implies no breakdown occurs, and even if some zero entries appear in the numerator
new updates can always be improved in our algorithm.

4 Applications
Determining the structure of the data set and extracting meaningful information is a critical prob-
lem in data analysis. There are many powerful methods for this purpose as linear dimensionality
reduction (LDR) techniques, which are equivalent to low-rank matrix approximations (LRMA).
Examples of LDR techniques are principal component analysis (PCA), independent component
analysis, sparse PCA, robust PCA, low-rank matrix completion, and sparse component analysis.
Among LRMA methods, NMF requires the factors of the low-rank approximation to be compo-
nentwise nonnegative. That helps us interpret them in a meaningful way, for example when they
correspond to non-negative physical quantities. Another important motivation for NMF is to find
the nonnegative rank of a nonnegative matrix [6].
In this section, we emphasize two important applications of NMF in data analysis: image
processing and text mining. Other applications include air emission control [12], music genre
classification [2], email surveillance [3],. . . However, it is important to stress that NMF is not
motivated only by its use as an LDR technique for data analysis [6, p. 12].

4.1 Image Processing - Facial Feature Extraction

A potential application of NMF is in face recognition. It has for example been observed that NMF
generated dense factors more than PCA. In fact, if a new occluded face (e.g., with sunglasses) has
to be mapped into the NMF basis, the non-occluded parts (e.g., the mustache or the lips) can still
be well-approximated [7].
Given a set of gray-scale face images of the same dimensions. Let each column of data matrix

13
X ∈ Rm×n
+ be a vectorized gray-level image of a face. Vectorization means that the two-dimensional
images are transformed into a long one-dimensional vector, for example, by stacking the columns
of the image on top of each other [6]. Then each row of X corresponds to the same pixel location
in the original images. This means that with the (i, j)th entry of matrix X being the intensity of
the i th pixel in the jth face.
NMF generates two factors (W, H) so that each image V (:, j) is approximated using a linear
combination of the columns of W . Since W is nonnegative, the columns of W can be interpreted
as images (that is, vectors of pixel intensities) which we refer to as the basis images. The columns
of W are vectors of pixel intensities whose linear combinations allow us to approximate each input
image. Hence these basis images typically correspond to localized features that are shared among
the input images, and the entries of H indicate which input image contains which feature. This is
illustrated by the following figure

1
Figure 2: NMF applied on the CBCL face data set with r = 49 (2429 images with 19 × 19
pixels each) by MU algorithm [6].

On the left is a column of X reshaped as an image. In the middle are the 49 columns of the
basis W reshaped as images and displayed in a 7 × 7 grid and the reshaped column of H in the
same 7 × 7 grid that shows which features are present in that particular face. On the right is the
reshaped approximation (W H):,j of X:,j as an image.
The following example may be more descriptive.

Figure 3: Sample images of the swimmer data set [6]

1
http://cbcl.mit.edu/software-datasets/FaceData2.html

14
Figure 4: NMF basis images for the swimmer data set, which reshapes sample images with suitable
coefficients of this linear combination [6]

4.2 Text Mining – Topic Recovery and Document Classification

Let each column of the nonnegative data matrix X correspond to a document and each row to a
word. The (i , j)th entry of the matrix X could for example be equal to the number of times the i th
word appears in the jth document in which case each column of X is the vector of word counts of a
document. In practice, more sophisticated constructions are used, e.g. the term frequency-inverse
document frequency techniques (tf-idf). This is the so-called bag-of-words model: each document
is associated with a set of words with different weights, while the ordering of the words in the
documents is not taken into account, see [4]. Note that such a matrix X is in general rather sparse
as most documents only use different small subsets of the dictionary.

Figure 5: Illustrating NMF for text mining [6]

Figure 5 illustrates the extraction of topics, and classification of each document with respect
to these topics. The columns of X are sets of words, with Xi,j being the number of word i in a
document j. Because H is nonnegative, these sets are approximated as the union of a smaller
number of sets defined by the columns of W that correspond to topics.
Therefore, given a set of documents, NMF identifies topics and simultaneously classifies the
documents among these different topics.

15
5 Numerical experiments
In this section, we will present some experiments using the MU algorithm and the APG algorithm,
which are implemented in R 4.2.11 . We use the AT&T face dataset2 consists of 10 black and
white photos of each member of a group of 40 individuals, 400 images in total and each image is
92 × 112 pixels, with 256 grey levels per pixel.

Figure 6: Plotting some faces in the AT&T face dataset

AT&T face dataset

k
fMU fAP G % ρMU ρAP G
25 72970.28 46690.85 36.01 0.455 0.0291
50 57273.47 38820.98 32.21 0.357 0.0242
75 47363.50 34145.80 27.90 0.0295 0.0213
100 40982.22 31264.15 23.71 0.0255 0.0195

Table 1: The values of cost function for MU and APG

In Table 1, we record the values of the cost function obtained from the MU and the APG
algorithms after the same number of iterations. The data on the column denoted by % is the
relative improvement defined by
fMU − fAP G
fMU
k k k k k k k k
where fMU = f (WMU , HMU ), fAP G (WAP G , HAP G ), and WMU , HMU , WAP G , HAP G denote the corre-
sponding results is computed by the MU algorithm and the APG algorithm respectively, the relative
1
Detail code: https://rpubs.com/trandattin/954115
2
https://www.kaggle.com/datasets/kasikrit/att-database-of-faces

16
error for each algorithm defined by
f (W k , H k )
ρ=
f (W 0 , H 0 )
where W 0 and H 0 are initial matrices. Table 1 shows that the APG algorithm has apparent
advantage at early stage as compared to the MU algorithm.

Figure 7: Compare the convergence rate of the algorithms with AT&T face dataset

6 Conclusion
The results presented in the report are the first step to study the MU, AG, and APG algorithms
for NMF as well as the theories to analyze the convergence of the algorithms and to compare
the performance of these algorithms on R. Finally, the author presents some applications of NMF.
There is a question posed by the writer during the research as follows:

1. Comparing the performance of the APG algorithm when using constant stepsize and the
stepsize is given by exact line search.

Due to the limitation of time and ability, the writer has not been able to prove the convergence
of the APG algorithm. Hopefully, the author can develop and expand the results in the future.

References
[1] A. Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with
MATLAB. SIAM, 2014.

17
[2] E. Benetos and C. Kotropoulos. Non-negative tensor factorization applied to music genre
classification. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):1955–
1967, 2010.

[3] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons. Algorithms

and applications for approximate nonnegative matrix factorization. Computational statistics
& data analysis, 52(1):155–173, 2007.

[4] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

[5] C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the itakura-
saito divergence: With application to music analysis. Neural computation, 21(3):793–830,
2009.

[6] N. Gillis. Nonnegative Matrix Factorization. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 2020.

[7] D. Guillamet and J. Vitria. Non-negative matrix factorization for face recognition. In Catalo-
nian Conference on Artificial Intelligence, pages 336–344. Springer, 2002.

[8] N. D. Ho. Non negative matrix factorization algorithms and applications. Université Catholique
de Louvain, 2008.

[9] Q. Ke and T. Kanade. Robust l/sub 1/norm factorization in the presence of outliers and
missing data by alternative convex programming. In 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 739–746. IEEE,
2005.

[10] D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Advances in neural
information processing systems, 13, 2000.

[11] L. Lin and Z.-Y. Liu. An alternating projected gradient algorithm for nonnegative matrix
factorization. Applied mathematics and computation, 217(24):9997–10002, 2011.

[12] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with
optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, 1994.

Weatherwax Nocedal Solutions
No ratings yet
Weatherwax Nocedal Solutions
23 pages
Lastest Advancements in Process Control in Refinery
100% (1)
Lastest Advancements in Process Control in Refinery
15 pages
1861 Algorithms For Non Negative Matrix Factorization
No ratings yet
1861 Algorithms For Non Negative Matrix Factorization
7 pages
Algorithms For Non-Negative Matrix Factorization
No ratings yet
Algorithms For Non-Negative Matrix Factorization
7 pages
1 Non-Negative Matrix Factorization (NMF) : K A A A
No ratings yet
1 Non-Negative Matrix Factorization (NMF) : K A A A
7 pages
Projected Gradient Methods For Non-Negative Matrix Factorization
No ratings yet
Projected Gradient Methods For Non-Negative Matrix Factorization
27 pages
Pgradnmf PDF
No ratings yet
Pgradnmf PDF
27 pages
Projected Gradient Methods For Nonnegative Matrix
No ratings yet
Projected Gradient Methods For Nonnegative Matrix
24 pages
NMF 0
No ratings yet
NMF 0
15 pages
Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints
No ratings yet
Multiplicative Updates For NMF With - Divergences Under Disjoint Equality Constraints
31 pages
Kim 2013
No ratings yet
Kim 2013
35 pages
Methods For Economists Lecture Notes: (In Extracts)
No ratings yet
Methods For Economists Lecture Notes: (In Extracts)
55 pages
Approximation Algorithms For Orthogonal Non-Negative Matrix Factorization
No ratings yet
Approximation Algorithms For Orthogonal Non-Negative Matrix Factorization
12 pages
Algorithms For Nonnegative Matrix Factorization With The Kullback-Leibler Divergence
No ratings yet
Algorithms For Nonnegative Matrix Factorization With The Kullback-Leibler Divergence
31 pages
QP Null Space Method
No ratings yet
QP Null Space Method
30 pages
2EL1730 ML Lecture11 NMF - Annotated
No ratings yet
2EL1730 ML Lecture11 NMF - Annotated
41 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Second and Higher Order Iteration in Lagrangian Methodpaper Shelja Salil
No ratings yet
Second and Higher Order Iteration in Lagrangian Methodpaper Shelja Salil
13 pages
A Neural Network Approach For Solving Optimal Control Problems With Inequality Constraints and Some Applications
No ratings yet
A Neural Network Approach For Solving Optimal Control Problems With Inequality Constraints and Some Applications
29 pages
Nonlinear Analysis: Real World Applications: Changfeng Ma, Lihua Jiang, Desheng Wang
No ratings yet
Nonlinear Analysis: Real World Applications: Changfeng Ma, Lihua Jiang, Desheng Wang
16 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
Newton Raphson
No ratings yet
Newton Raphson
16 pages
Several NP-Hard Problems Arising in Robust Stability Analysis
No ratings yet
Several NP-Hard Problems Arising in Robust Stability Analysis
7 pages
Non-Negative Matrix Factorization, A New Tool For Feature Extraction: Theory and Applications
No ratings yet
Non-Negative Matrix Factorization, A New Tool For Feature Extraction: Theory and Applications
8 pages
Paper 23-A New Type Method For The Structured Variational Inequalities Problem
No ratings yet
Paper 23-A New Type Method For The Structured Variational Inequalities Problem
4 pages
A First Course in Optimization: Answers To Selected Exercises
No ratings yet
A First Course in Optimization: Answers To Selected Exercises
71 pages
A First Course in Optimization: Answers To Selected Exercises
No ratings yet
A First Course in Optimization: Answers To Selected Exercises
71 pages
A First Course in Optimization: Answers To Selected Exercises
No ratings yet
A First Course in Optimization: Answers To Selected Exercises
71 pages
Curs Tehnici de Optimizare
No ratings yet
Curs Tehnici de Optimizare
141 pages
On Finite Termination of An Iterative Method For Linear Complementarity Problems
No ratings yet
On Finite Termination of An Iterative Method For Linear Complementarity Problems
17 pages
Hildreth 1957
No ratings yet
Hildreth 1957
7 pages
Newton Gauss Method
No ratings yet
Newton Gauss Method
37 pages
Shifting Method
No ratings yet
Shifting Method
9 pages
Nonlinear Programming Only
No ratings yet
Nonlinear Programming Only
46 pages
A New Semismooth Newton Method For Ncps Based On The Penalized KK Function
No ratings yet
A New Semismooth Newton Method For Ncps Based On The Penalized KK Function
18 pages
Non Linear Optmisation - Notes
No ratings yet
Non Linear Optmisation - Notes
24 pages
EC400 Lecture Notes
No ratings yet
EC400 Lecture Notes
43 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Inexact Newton Method For Minimization of Convex P
No ratings yet
Inexact Newton Method For Minimization of Convex P
16 pages
Friedland2020 Article TheCollatz-WielandtQuotientFor
No ratings yet
Friedland2020 Article TheCollatz-WielandtQuotientFor
41 pages
An Algorithm For Minimax Solution of Overdetennined Systems of Non-Linear Equations
No ratings yet
An Algorithm For Minimax Solution of Overdetennined Systems of Non-Linear Equations
8 pages
Triangular Factorization and Inversion by Fast Matrix Multiplication
No ratings yet
Triangular Factorization and Inversion by Fast Matrix Multiplication
6 pages
Solving Nonlinear Algebraic Systems Using Artificial Neural Networks
No ratings yet
Solving Nonlinear Algebraic Systems Using Artificial Neural Networks
13 pages
The Levenberg-Marquardt Algorithm-Implementation and Theory
No ratings yet
The Levenberg-Marquardt Algorithm-Implementation and Theory
12 pages
Bank 1985
No ratings yet
Bank 1985
4 pages
Regularized Compression of A Noisy Blurred Image
No ratings yet
Regularized Compression of A Noisy Blurred Image
13 pages
Symmetric Matrix In: Manchester Ml3 Opl, Engeanc
No ratings yet
Symmetric Matrix In: Manchester Ml3 Opl, Engeanc
16 pages
HW2 Solution
No ratings yet
HW2 Solution
9 pages
A Penalty Method For Deriving Neces
No ratings yet
A Penalty Method For Deriving Neces
21 pages
ECOM 6302: Engineering Optimization: Chapter Three
100% (1)
ECOM 6302: Engineering Optimization: Chapter Three
56 pages
Project For Automated Train by Roshan
No ratings yet
Project For Automated Train by Roshan
6 pages
A Simple Constructive Proof of The Arithmetic Geometric Mean Inequality and Two Applications
No ratings yet
A Simple Constructive Proof of The Arithmetic Geometric Mean Inequality and Two Applications
11 pages
Optimization Structure and Applications
100% (1)
Optimization Structure and Applications
407 pages
An Inverse Power Method For Nonlinear Eigenproblems With Applications in 1-Spectral Clustering and Sparse PCA
No ratings yet
An Inverse Power Method For Nonlinear Eigenproblems With Applications in 1-Spectral Clustering and Sparse PCA
9 pages
A Block Coordinate Descent-Based Projected GradientAlgorithm For Orthogonal Non-Negative Matrix Factorization
No ratings yet
A Block Coordinate Descent-Based Projected GradientAlgorithm For Orthogonal Non-Negative Matrix Factorization
22 pages
Note Set 7 - Nonlinear Equations: 7.1 - Overview
No ratings yet
Note Set 7 - Nonlinear Equations: 7.1 - Overview
10 pages
Research On The Optimal Solution of Lagrangian Multiplier Function Method in Nonlinear Programming
No ratings yet
Research On The Optimal Solution of Lagrangian Multiplier Function Method in Nonlinear Programming
8 pages
Topic: Non-Negative Matrix Factorisation: Assignment - 2
No ratings yet
Topic: Non-Negative Matrix Factorisation: Assignment - 2
6 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
06-10-2014 22.46.37tuyen Tap de Thi OLP SV Toan Quoc
No ratings yet
06-10-2014 22.46.37tuyen Tap de Thi OLP SV Toan Quoc
115 pages
100 CÂU KIỂM TRA NGỮ PHÁP PHỔ BIẾN
No ratings yet
100 CÂU KIỂM TRA NGỮ PHÁP PHỔ BIẾN
14 pages
Bài Tập Phrasal Verbs & Idioms
No ratings yet
Bài Tập Phrasal Verbs & Idioms
8 pages
Bài Tập Trắc Nghiệm Idiom
No ratings yet
Bài Tập Trắc Nghiệm Idiom
7 pages
Distributed Word Representations For Information Retrieval
No ratings yet
Distributed Word Representations For Information Retrieval
46 pages
Homomorphic Filtering
No ratings yet
Homomorphic Filtering
5 pages
T Distribution
100% (3)
T Distribution
33 pages
Heuristic Optimization Methods: Tabu Search: Slides Prepared by Nina Skorin-Kapov
No ratings yet
Heuristic Optimization Methods: Tabu Search: Slides Prepared by Nina Skorin-Kapov
40 pages
D 2324 1S EEE 151 Lecture 02 (Ramos)
No ratings yet
D 2324 1S EEE 151 Lecture 02 (Ramos)
4 pages
Markowitz-S Portfolio Theory (For INCHARGE)
No ratings yet
Markowitz-S Portfolio Theory (For INCHARGE)
3 pages
DSC VivaQuestions PDF
No ratings yet
DSC VivaQuestions PDF
7 pages
Computer Science Question Paper
No ratings yet
Computer Science Question Paper
4 pages
Partial and Total Differentiation
No ratings yet
Partial and Total Differentiation
21 pages
Radio Signal Steganography Based On Wavelet Transform
No ratings yet
Radio Signal Steganography Based On Wavelet Transform
8 pages
Advanced Patented Methodologies in Ground Vibration Testing For Aerospace Applications
No ratings yet
Advanced Patented Methodologies in Ground Vibration Testing For Aerospace Applications
19 pages
Business Analytics M17 EVAN7821 09 SE SUPPA Online - Evans
No ratings yet
Business Analytics M17 EVAN7821 09 SE SUPPA Online - Evans
30 pages
Unicycle Dynamics
No ratings yet
Unicycle Dynamics
6 pages
Van Liebergen - Machine Learning in Compliance Risk Management PDF
No ratings yet
Van Liebergen - Machine Learning in Compliance Risk Management PDF
8 pages
Ba Sas
No ratings yet
Ba Sas
5 pages
Deep Learning Lecture 6
No ratings yet
Deep Learning Lecture 6
8 pages
Knight's Tour
No ratings yet
Knight's Tour
8 pages
Enhancing Medical Image Classification Through PSO-Optimized Dual Deterministic Approach and Robust Transfer Learning
No ratings yet
Enhancing Medical Image Classification Through PSO-Optimized Dual Deterministic Approach and Robust Transfer Learning
16 pages
DESY Qiskit Intro Kuehn
100% (1)
DESY Qiskit Intro Kuehn
59 pages
Mid Term Question Paper AI Soln v2
No ratings yet
Mid Term Question Paper AI Soln v2
6 pages
JNTU Old Question Papers 2007
67% (3)
JNTU Old Question Papers 2007
11 pages
Module3-Telecommunications Traffic: Introduction
No ratings yet
Module3-Telecommunications Traffic: Introduction
26 pages
AI Project Cycle Q & A
No ratings yet
AI Project Cycle Q & A
9 pages
BML End Sem
No ratings yet
BML End Sem
2 pages
100 - 機動學專題講解車庫門機構
No ratings yet
100 - 機動學專題講解車庫門機構
18 pages
System of Linear Equations
No ratings yet
System of Linear Equations
8 pages
M3 Machine Input Output - 161
No ratings yet
M3 Machine Input Output - 161
7 pages
Image Processing
No ratings yet
Image Processing
10 pages
Chapter 6 Solutions
100% (1)
Chapter 6 Solutions
30 pages

Tin TranDat

Uploaded by

Tin TranDat

Uploaded by

Some Algorithms for Nonnegative Matrix

Hanoi - October 31, 2022

Problem 1 (Nonnegative Matrix Factorization - NMF)

2 Multiplicative updating algorithm

Theorem 1 (Lee and Seung [10])

To prove Theorem 1 we need the following definition

Definition 2 (Auxiliary function)

• ∀(v∗ , v̄) ∈ RK × RK , F (v∗ ) ≤ F̄ (v∗ , v̄)

v(i+1) = argminF̄ (v , v(i) ) (2)

Proof. F (v(i+1) ) ≤ F̄ (v(i+1) , v(i) ) ≤ F̄ (v(i) , v(i) ) = F (v(i) )

∇F (v(i) ) = ∇F̄ (v(i) , v(i) )

F (vmin ) ≤ · · · ≤ F (v(i+1) ) ≤ F (v(i) ) ≤ · · · ≤ F (v(1) ) ≤ F (v(0) )

The proof is complete.

Theorem 5 (Auxiliary function of objective function - Lee and Seung [10])

We need to show that θ(ht ) − W T W is a positive semidefinite matrix. Let v ∈ Rr then

That means θ(ht ) − W T W is a positive semidefinite matrix.

Now we can prove Theorem 1

∇F (ht ) = W T W h − W T v and ∇2 F (ht ) = W T W

From Theorem 5 the following is auxiliary function of F (h)

3 Alternating projected gradient algorithms

3.1 Alternating gradient algorithms (AG)

where f is a continuously differentiable function. Starting from an initial point x0 ∈ Rn , then at

Proof. Apply Lemma 4 for all columns V:j , H:j then

Put K = W T W H − W T V . Consider the function

g 0 (ε) = hV − W H, W Ki + ε||W K||2F

Because g(ε) is convex then

We consider two constrained optimization problems with W and H fixed, respectively:

We update H (and W ) with W (and H) fixed respectively, that is:

In order to ensure matrix H ∈ Rr+×n , εH should be chosen such that

Find ε∗H such that

and thus H ∈ Rr+×n .

In this case, we have W ∈ Rm×r

Algorithm 2 Alternating Gradient Algorithm (Lin and Liu [11])

where εHk and εW k are calculated by formula 11 and 12, respectively.

Again, assuming that H is fixed we update each entry of matrix W by

xk+1 = PC (xk − tk ∇f (xk ))

where PC : Rn → C is the orthogonal projection operator, which itself is also an optimization

min{(Yik − Xik )2 : Yik ≥ 0}

To summarize, the orthogonal projection operator onto Rm×n

Now, we applied this method to improve AG algorithm.

4.1 Image Processing - Facial Feature Extraction

Figure 3: Sample images of the swimmer data set [6]

4.2 Text Mining – Topic Recovery and Document Classification

Figure 5: Illustrating NMF for text mining [6]

Figure 6: Plotting some faces in the AT&T face dataset

AT&T face dataset

Table 1: The values of cost function for MU and APG

[3] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons. Algorithms

You might also like