0% found this document useful (0 votes)
16 views7 pages

1 Non-Negative Matrix Factorization (NMF) : K A A A

The document discusses Non-negative Matrix Factorization (NMF) as a method for low-rank approximations of non-negative data matrices, emphasizing its interpretability in various domains. It outlines the challenges in computing NMF, including non-convexity and the lack of incremental updates, and presents optimization strategies such as projected gradient descent and multiplicative updates. Additionally, it explores coordinate descent methods and conditions under which NMF becomes more tractable, particularly in separable cases.

Uploaded by

ataabuasad08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

1 Non-Negative Matrix Factorization (NMF) : K A A A

The document discusses Non-negative Matrix Factorization (NMF) as a method for low-rank approximations of non-negative data matrices, emphasizing its interpretability in various domains. It outlines the challenges in computing NMF, including non-convexity and the lack of incremental updates, and presents optimization strategies such as projected gradient descent and multiplicative updates. Additionally, it explores coordinate descent methods and conditions under which NMF becomes more tractable, particularly in separable cases.

Uploaded by

ataabuasad08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Bindel, Summer 2018 Numerics for Data Science

2018-05-30

1 Non-negative Matrix Factorization (NMF)


In the last lecture, we considered low rank approximations to data matrices.
We started with the “optimal” rank k approximation to A ∈ Rm×n via the
SVD, then moved on to approximations that represent A in terms of the
rows and columns of A rather than in terms of the left and right singular
vectors. We argued that while these latter factorizations may not minimize
the Frobenius norm of the error for a given rank, they are easier to interpret
because they are expressed in terms of the factors in the original data set.
We continue with our theme of finding interpretable factorizations today by
looking at non-negative matrix factorizations (NMF).
Let R+ denote the non-negative real numbers; for a non-negative data
matrix A ∈ Rm×n+ , we seek
A ≈ W H, where W ∈ Rm×k
+ , H ∈ R+
k×n
.
Non-negative matrix factorizations are convenient because they express the
columns of A (the data) in terms of positively weighted sums of the columns
of W , which we interpret as “parts.” This type of decomposition into parts
makes sense in many different domains; for example:
Meaning of columns of A Meaning of columns of W
Word distributions for documents Word distributions for topics
Pictures of faces Pictures of facial features
Connections to friends Communities
Spectra of chemical mixtures Spectra of component molecules
Unfortunately, non-negative matrix factorizations are generally much more
difficult to compute than the factorizations we considered in the last lecture.
There are three fundamental difficulties:
• We do not know how big k must be to get a “good” representation.
Compare this to ordinary factorization, where we can hope for error
bounds in terms of σk+1 , . . . , σmin(m,n) .
• The optimization problem is non-convex, and there may generally be
many local minima. Again, compare this with the optimal approxima-
tion problem solved by singular value decomposition, which has saddle
points, but has no local minimizers that are not also global minimizers.
Bindel, Summer 2018 Numerics for Data Science

• NMF is not incremental: the best rank k approximation may have


little to do with the best rank k + 1 approximation. Again, we can
compare with the unconstrained problem, for which the best rank k + 1
approximation is a rank-one update to the best rank k approximation.

Faced with this hard optimization problem, we consider two tactics. First,
we might seek efficient optimization methods that at least converge to a
local minimizer1 ; we will spend the first part of the lecture discussing this
approach. Second, we might seek common special cases where we can prove
something about the approximation. In particular, the NMF problem is much
more tractable when we make a separability assumption which is appropriate
in some applications.

2 Going with gradients


2.1 Projected gradient descent
We begin with the projected gradient descent algorithm for minimizing a func-
tion ϕ subject to simple constraints. Let P(x) be a projection function that
maps x to the nearest feasible point; in the case of a simple non-negativity
constraint, P(x) = [x]+ is the elementwise maximum of x and zero. The
projected gradient descent iteration is then
( )
xk+1 = P xk+1 − αk ∇ϕ(xk ) .

The convergence properties of projected gradient descent are similar to those


of the unprojected version: we can show reliable convergence for convex
(or locally convex) functions and sufficiently short step sizes, though ill-
conditioning may make the convergence slow.
In order to write the gradient for the NMF objective without descending
into a morass of indices, it is helpful to introduce the Frobenius inner product:
for matrices X, Y ∈ Rm×n ,

⟨X, Y ⟩F = yij xij = tr(Y T X).
i,j

1
In most cases, we can only show convergence to a stationary point, but we are likely
to converge to a minimizer for almost all initial guesses.
Bindel, Summer 2018 Numerics for Data Science

The Frobenius inner product is the inner product associated with the Frobe-
nius norm: ∥X∥2F = ⟨X, X⟩F , and we can apply the usual product rule for dif-
ferentiation to compute directional derivatives of ϕ(W, H) = ∥A − W H∥2F /2
with respect to W and H:
[ ]
1
δϕ = δ ⟨A − W H, A − W H⟩F
2
= ⟨δ(A − W H), A − W H⟩F
= −⟨(δW )H, A − W H⟩F − ⟨W (δH), A − W H⟩F .

We let R = A − W H, and use the fact that the trace of a product of matrices
is invariant under cyclic permutations of the matrices:

⟨(δW )H, R⟩F = tr(H T (δW )T R) = tr((δW )T RH T ) = ⟨δW, RH T ⟩F


⟨W (δH), R⟩F = tr((δH)T W T R) = ⟨δH, W T R⟩F .

Therefore, the projected gradient descent iteration for this problem is


[ ]
W new = W + αRH T +
[ ]
H new = H + αW T R + ,

where in the interest of legibility we have suppressed the iteration index on


the right hand side.

2.2 Multiplicative updates


One of the earliest and most popular NMF solvers is the multiplicative update
scheme of Lee and Seung. This has the form of a scaled gradient descent
iteration where we replace the uniform step size αk with a different (non-
negative) step size for each entry of W and H:
[ ( )]
W new = W + S ⊙ AH T − W HH T +
[ ( )]
H new = H + S ′ ⊙ W T A − W T W H + ,

where ⊙ denotes elementwise multiplication. We similarly let ⊘ to denote


elementwise division to define the nonnegative scaling matrices

S = W ⊘ (W HH T ), S ′ = H ⊘ (W T W H).
Bindel, Summer 2018 Numerics for Data Science

With these choices, two of the terms in the summation cancel, so that

W new = S ⊙ (AH T ) = W ⊘ (W HH T ) ⊙ (AH T )


H new = S ′ ⊙ (W T A) = H ⊘ (W T W H) ⊙ (W T A).

At each step of the Lee and Seung scheme, we scale the (non-negative) ele-
ments of W and H by non-negative factors, yielding a non-negative result.
There is no need for a non-negative projection because the step sizes are
chosen increasingly conservatively as elements of W and H approach zero.
But because the steps are very conservative, the Lee and Seung algorithm
may require a large number of steps to converge.

3 Coordinate descent
The (block) coordinate descent method (also known as block relaxation or
nonlinear Gauss-Seidel) for solving

minimize ϕ(x1 , x2 , . . . , xp ) for xi ∈ Ωi

involves repeatedly optimizing with respect to one coordinate at a time. In


the basic method, we iterate through each i and compute

xk+1
i = argminξ ϕ(xk+1 k+1 k k
1 , . . . , xi−1 , ξ, xi+1 , . . . , xp ).

The individual subproblems are often simpler than the full problem. If each
subproblem has a unique solution (e.g. if each subproblem is strongly convex),
the iteration converges to a stationary point2 ; this is the situation for all the
iterations we will discuss.

3.1 Simple coordinate descent


Perhaps the simplest coordinate descent algorithm for NMF sweeps through
all entries of W and H. Let R = A − W H; then for the (i, j) coordinate of
W , we compute the update wij = wij + s where s minimizes the quadratic
1 1 1
∥A − (W + sei eTj )H∥sF = ∥R∥2F − s⟨(ei eTj ), RH T ⟩F + s2 ∥ei eTj H∥2F ,
2 2 2
2
For non-convex problems, we may converge to a saddle; as an example, consider
simple coordinate descent for ϕ(x1 , x2 ) = x21 + 4x1 x2 + x22 .
Bindel, Summer 2018 Numerics for Data Science

subject to the constraint that s ≥ −wij . The solution to this optimization is


( )
(RH T )ij
s = max −wij , .
(HH T )jj
Therefore, the update for wij is
( )
(RH T )ij
s = max −wij , , wij := wij + s, Ri,: := Ri,: − sHj,:
(HH T )jj
A similar computation for the elements of H gives us the update formulas
( )
(W T R)ij
s = max −hij , , hij := hij + s, R:,j := R:,j − sW:,i .
(W T W )ii
Superficially, this looks much like projected gradient descent with scaled step
lengths. However, where in gradient descent (or the multiplicative updates
of Lee and Seung) the updates for all entries of W and H are independent,
in this coordinate descent algorithm we only have independence of updates
for a single column of W or a single row of H. This is a disadvantage for
efficient implementation.

3.2 HALS/RRI
The simple algorithm in the previous algorithm relaxed on each element of
W and H independently. In the hierarchical alternating least squares or rank-
one residual iteration, we treat the problem as consisting of 2k vector blocks,
one for each column of W and row of H. To update a column W:,j := W:,j +u,
we must solve the least squares problem
minimize ∥R − uHj,: ∥2F s.t. u ≥ −W:,j ,
which is equivalent to solving the independent single variable least squares
problems
minimize ∥Ri,: − ui Hj,: ∥22 s.t. ui ≥ −wij .
The ui must satisfy the normal equations unless it hits the bound constraint;
thus, ( )
T ( )
Ri,: Hj,: (RH T )ij
ui = max −wij , T
= max −wij , .
Hj,: Hj,: (HH T )jj
Thus, updating a column of W at a time is equivalent to updating each of
the elements in the column in sequence in scalar coordinate descent. The
same is true when we update row of H.
Bindel, Summer 2018 Numerics for Data Science

3.3 ANLS
The alternating non-negative least squares (ANLS) iteration updates all of
W together, then all of H:

W := argminW ≥0 ∥A − W H∥2F
H := argminH≥0 ∥A − W H∥2F

We can solve for each row of W (or column of H) independently by solving a


non-negative least squares problem. Unfortunately, these non-negative least
squares problems cannot be solved in a simple closed form!
The non-negative least squares problem has the general form

minimize ∥Ax − b∥2 such that x ≥ 0;

it is a convex optimization problem that can be solved using any constrained


optimization solver. An old class of solvers for this problem is the active set
methods. To derive these methods, we partition the variables into a free set
I and a constrained set J , and rewrite the KKT equations in the form

xI = A†I b xI ≥ 0
ATJ (Ax − b) ≥ 0 xJ = 0.

If the partitioning into I and J is known, we can compute x via an ordinary


least squares solve. The difficult part is to figure out which variables are free!
The simplest approach is to guess I and J and then iteratively improve the
guess by moving one variable at a time between the two sets as follows.
Starting from an initial non-negative guess x, I, J , we

• Compute p = A†I b − x.

• Compute a new point x := x + αp where α ≤ 1 is chosen to be as large


as possible subject to non-negativity of the new point.

• If α < 1, we move the index for whatever component became zero from
the I set to the J set and compute another step.

• If α = 1 and gJ = ATJ (Ax − b) has any negative elements, we move the


index associated with the most negative element of rJ from the J set
to the I set and compute another step.
Bindel, Summer 2018 Numerics for Data Science

• Otherwise, we have α = 1 and gJ ≥ 0. In this case, the KKT conditions


are satisfied, and we may terminate.
The problem with this approach is that we only change our guess at the free
variables by adding or removing one variable per iteration. If our initial guess
is not very good, it may take many iterations to converge. Alternate methods
are more aggressive about changing the free variable set (or, equivalently, the
active constraint set).

4 Separable NMF
In the general case, non-negative matrix factorization is a hard problem.
However, there are special cases where it becomes easier, and these are worth
exploring. In a separable problem, we can compute
[ ]
T I
Π A= H;
W2
that is, every row of A can be expressed as a positively-weighted combination
of k columns of A. Examples where we might see this include:
• In topic modeling, we might have “anchor words” that are primarily
associated with just one topic.
• In image decomposition, we might have “pure pixels” that are active
for just one part of an image.
• In chemometrics, we might see that a component molecule produces a
spike at a unique frequency that is not present for other components.
Assuming that this separability condition occurs, how are we to find the k
rows of A that go into H? What we will do is to compute the normalized ma-
trix Ā by scaling each row of A so that it sums to 1. With this normalization,
all rows of Ā are positively weighted combinations of the anchor rows where
the weights sum to 1; that is, if we view each row as a point in m-dimensional
space, then the anchor rows are points on the convex hull. As discussed in
the last lecture, we can find the convex hull with the pivoted QR algorithm.
ĀT Π = QR.
Variants of this approach also work for nearly separable problems.

You might also like