1 Non-Negative Matrix Factorization (NMF) : K A A A
1 Non-Negative Matrix Factorization (NMF) : K A A A
2018-05-30
Faced with this hard optimization problem, we consider two tactics. First,
we might seek efficient optimization methods that at least converge to a
local minimizer1 ; we will spend the first part of the lecture discussing this
approach. Second, we might seek common special cases where we can prove
something about the approximation. In particular, the NMF problem is much
more tractable when we make a separability assumption which is appropriate
in some applications.
1
In most cases, we can only show convergence to a stationary point, but we are likely
to converge to a minimizer for almost all initial guesses.
Bindel, Summer 2018 Numerics for Data Science
The Frobenius inner product is the inner product associated with the Frobe-
nius norm: ∥X∥2F = ⟨X, X⟩F , and we can apply the usual product rule for dif-
ferentiation to compute directional derivatives of ϕ(W, H) = ∥A − W H∥2F /2
with respect to W and H:
[ ]
1
δϕ = δ ⟨A − W H, A − W H⟩F
2
= ⟨δ(A − W H), A − W H⟩F
= −⟨(δW )H, A − W H⟩F − ⟨W (δH), A − W H⟩F .
We let R = A − W H, and use the fact that the trace of a product of matrices
is invariant under cyclic permutations of the matrices:
S = W ⊘ (W HH T ), S ′ = H ⊘ (W T W H).
Bindel, Summer 2018 Numerics for Data Science
With these choices, two of the terms in the summation cancel, so that
At each step of the Lee and Seung scheme, we scale the (non-negative) ele-
ments of W and H by non-negative factors, yielding a non-negative result.
There is no need for a non-negative projection because the step sizes are
chosen increasingly conservatively as elements of W and H approach zero.
But because the steps are very conservative, the Lee and Seung algorithm
may require a large number of steps to converge.
3 Coordinate descent
The (block) coordinate descent method (also known as block relaxation or
nonlinear Gauss-Seidel) for solving
xk+1
i = argminξ ϕ(xk+1 k+1 k k
1 , . . . , xi−1 , ξ, xi+1 , . . . , xp ).
The individual subproblems are often simpler than the full problem. If each
subproblem has a unique solution (e.g. if each subproblem is strongly convex),
the iteration converges to a stationary point2 ; this is the situation for all the
iterations we will discuss.
3.2 HALS/RRI
The simple algorithm in the previous algorithm relaxed on each element of
W and H independently. In the hierarchical alternating least squares or rank-
one residual iteration, we treat the problem as consisting of 2k vector blocks,
one for each column of W and row of H. To update a column W:,j := W:,j +u,
we must solve the least squares problem
minimize ∥R − uHj,: ∥2F s.t. u ≥ −W:,j ,
which is equivalent to solving the independent single variable least squares
problems
minimize ∥Ri,: − ui Hj,: ∥22 s.t. ui ≥ −wij .
The ui must satisfy the normal equations unless it hits the bound constraint;
thus, ( )
T ( )
Ri,: Hj,: (RH T )ij
ui = max −wij , T
= max −wij , .
Hj,: Hj,: (HH T )jj
Thus, updating a column of W at a time is equivalent to updating each of
the elements in the column in sequence in scalar coordinate descent. The
same is true when we update row of H.
Bindel, Summer 2018 Numerics for Data Science
3.3 ANLS
The alternating non-negative least squares (ANLS) iteration updates all of
W together, then all of H:
W := argminW ≥0 ∥A − W H∥2F
H := argminH≥0 ∥A − W H∥2F
xI = A†I b xI ≥ 0
ATJ (Ax − b) ≥ 0 xJ = 0.
• Compute p = A†I b − x.
• If α < 1, we move the index for whatever component became zero from
the I set to the J set and compute another step.
4 Separable NMF
In the general case, non-negative matrix factorization is a hard problem.
However, there are special cases where it becomes easier, and these are worth
exploring. In a separable problem, we can compute
[ ]
T I
Π A= H;
W2
that is, every row of A can be expressed as a positively-weighted combination
of k columns of A. Examples where we might see this include:
• In topic modeling, we might have “anchor words” that are primarily
associated with just one topic.
• In image decomposition, we might have “pure pixels” that are active
for just one part of an image.
• In chemometrics, we might see that a component molecule produces a
spike at a unique frequency that is not present for other components.
Assuming that this separability condition occurs, how are we to find the k
rows of A that go into H? What we will do is to compute the normalized ma-
trix Ā by scaling each row of A so that it sums to 1. With this normalization,
all rows of Ā are positively weighted combinations of the anchor rows where
the weights sum to 1; that is, if we view each row as a point in m-dimensional
space, then the anchor rows are points on the convex hull. As discussed in
the last lecture, we can find the convex hull with the pivoted QR algorithm.
ĀT Π = QR.
Variants of this approach also work for nearly separable problems.