Deep Orthogonal Matrix Factorization As A Hierarchical Clustering Technique
Deep Orthogonal Matrix Factorization As A Hierarchical Clustering Technique
Abstract—Deep orthogonal nonnegative matrix factorization the optimization is performed. Multilayer MF performs the
(deep ONMF) is a constrained deep low-rank matrix approxim- decomposition in a purely sequential way, that is, it suc-
ation model which decomposes a data matrix through several cessively minimizes kWi−1 − Wi Hi k2F for i = 1, 2, . . . , L
layers of factorizations. Deep ONMF imposes that each data
point is assigned to a single cluster at each layer. In this paper, where W0 = X. Deep MF considers a further backpropaga-
we first explain why deep ONMF can be interpreted as a bottom- tion step. This consists in minimizing the loss function
up hiearchical clustering technique. Then our main contribution kX − WL HL · · · H1 k2F across the layers: after a sequential
is to provide a simple yet effective greedy initialization strategy decomposition as in multilayer MF, the factors are iteratively
for deep ONMF. We show on synthetic data sets that it performs updated in a block-coordinate descent fashion, see [6] and
competitively with other initialization strategies, and apply it on
the decomposition of a hyperspectral image into its constitutive the references therein for more details. As for shallow MFs,
materials. additional constraints must be imposed on the factors to render
it meaningful. Imposing nonnegativity and orthogonality on
I. I NTRODUCTION the Hl ’s leads to deep ONMF [7], the topic of this paper.
Given a matrix X ∈ Rm×n where each column is a data Organization of the paper: In Section II, we start by
point lying in an m-dimensional space, a low-rank matrix explaining why deep ONMF is a particular hierarchical clus-
approximation seeks for matrices W ∈ Rm×r and H ∈ Rr×n tering (HC) model. We then provide a greedy initialization
such that P each data point X(:, j) can be approximated as for deep ONMF in Section III. In Section IV-A, we compare
r
X(:, j) ≈ k=1 W (:, k)H(k, j) for j = 1, . . . , n. This means several initialization techniques on synthetic data, and in
that each data point is a linear combination of r basis vectors, Section IV-B we illustrate the ability of deep ONMF combined
where r is called the rank of the approximation. In matrix with our greedy initialization to cluster the pixels of a hyper-
form, this approximation, also called a factorization, is written spectral image in a hierarchical way. Finally, in Section V, we
as X ≈ W H, where each column of W corresponds to a briefly conclude and give perspectives of research.
basis vector and each column of H indicates the proportions II. D EEP ONMF IS EQUIVALENT TO HC
in which each basis vector appears in each data point. The It is well-known that standard NMF can be interpreted as
quality of the approximation is generally measured by the least a soft clustering technique. In particular, when the sum of
squares criterion, that is, kX − W Hk2F . the entries in each column of H is constrained to be equal
To ensure the interpretability and uniqueness of such mod- to 1, H(k, j) is the proportion in which the data point X(:, j)
els, constraints are typically imposed on the factors W and is associated to the k-th basis vector W (:, k). Due to the
H, such as sparsity [1] and nonnegativity [2], leading to row-wise orthogonality constraint, ONMF is more restrictive:
sparse component analysis and nonnegative matrix factoriz- nonnegativity together with orthogonality implies that each
ation (NMF), respectively. Adding orthogonality on top of column of H has at most a single non-zero entry. This
nonnegativity for the factor H, we obtain orthogonal NMF follows from the fact that two nonnegative and orthogonal
(ONMF) [3] which can be formulated as follows vectors must have disjoint supports. Hence ONMF associates
min kX − W Hk2F such that HH T = Ir , (1) each data point to a single basis vector and performs a
W ∈Rm×r
+ ,H∈Rr×n
+ hard clustering [3]. In fact, it can be proved that ONMF is
where Ir is the identity matrix of size r. equivalent to a weighted variant of spherical k-means [8].
Recently, matrix factorizations (MFs) have been extended Recall that spherical k-means minimizes the angles between
to the case where the input matrix is decomposed in more the data points and their associated centroid, as opposed to
than two factors. More precisely, L layers of successive k-means that minimizes their Euclidean distances.
Deep ONMF is the extension of ONMF (1) to several layers:
factorizations of ranks dl (l = 1, ..., L) are performed on X
for l = 1, 2, . . . , L,
as follows: X ≈ W1 H1 , W1 ≈ W2 H2 , . . . , WL−1 ≈ WL HL ,
where Wl ∈ Rm×dl and Hl ∈ R+ dl ×dl−1 (l = 1, . . . , L) Wl−1 ≈ Wl Hl such that (Wl , Hl ) ≥ 0 and Hl HlT = Idl ,
with d0 = n, so that the matrix X is approximated as (2)
X ≈ WL HL HL−1 · · · H1 . This model is referred to as where W0 = X. In deep ONMF, the ranks dl ’s need to
multilayer MF [4] or deep MF [5], depending on the way decrease as the factorization unfolds, that is, dl > dl+1 for
1467
the new one, which requires O(nm) operations. Moreover,
assuming an efficient sorting strategy of the array of errors
e(i, j)’s, finding the couple of indices which generates the
smallest e(i, j) is O(n2 log(n2 )). Hence, SODA requires in
total Õ(n2 m) operations (where ˜ indicates that we removed
logarithmic terms), which is not practical for large data sets.
Usually, deep ONMF algorithms run in O(mnd1 ) operations
where d1 n.
However, for large data sets, such as hyperspectral images
where n is the number of pixels (which can be of the order of
millions), this greedy idea can be used as well. For example,
we first compute an ONMF of rank larger than (or equal to) d1 ,
say d01 ≥ d1 with any standard algorithm faster than SODA,
and then "unfold" the remaining d01 clusters through SODA.
This requires O(mnd01 ) operations for the first layer ONMF,
and then Õ(d02 1 m) operations for the next ones computed
by SODA. In practice, we recommend to use d01 as a small
multiple of d1 .
IV. N UMERICAL EXPERIMENTS
In this section, we evaluate the performance of several
initialization techniques for deep ONMF on synthetic data in Fig. 1: Geometric illustration of the synthetic data sets.
Section IV-A, and show the hierarchical clustering produced
by deep ONMF for a hyperspectral image in Section IV-B. where X̃ = W1 H1 , each entry of N follows a Gaussian
distribution of mean 0 and standard deviation 1, and is the
A. Synthetic data sets noise level. To assess the quality of the different initializations,
Let us compare the effectiveness of the following initializ- 10 data sets are generated for each noise level and we report
ation methods for the Wl ’s in deep ONMF: the mean and standard deviation of the clustering accuracy
• Random initialization (RAND): any Wl , l = 1, ..., L is (ACC) at both layers for several noise levels. Given K
set up by randomly picking dl < dl−1 columns of Wl−1 , estimated clusters Gk ’s and K ground truth clusters Hk ’s, the
with W0 = X. This is the most standard approach in the ACC is defined as
literature.
K
• Successive nonnegative projection algorithm (SNPA) [6]: 1 X
ACC(G, H) = max |Gk ∩ HP (k) | (4)
Wl is obtained with SNPA [14] applied on Wl−1 . n P ∈[1···K]
k=1
• Our proposed greedy algorithm, SODA.
• RAND+SODA: Similarly to as described at the end of where P is any permutation of {1, 2, . . . , K}.
Section III-B, we randomly choose d01 n points, and The results are presented in terms of both reconstruction
then apply SODA on this subset of points. error and accuracy at Table I. More precisely, it reports the
We compare these initializations when combined with the relative reconstruction error kX−W 2 H 2 H 1 kF
kXkF , denoted rel_err,
alternating optimization strategy that optimizes Wl ’s and Hl ’s and the accuracy at the first and the second layers, denoted
alternatively by extending the multiplicative updates proposed ACC 1 and ACC 2, respectively.
for ONMF by [15] to deep ONMF. Clearly, SODA outperforms RAND and SNPA in terms
We generate the synthetic data sets as follows, in m = 3 of clustering accuracy. When the noise is small, it always
dimensions. We take d1 = 16 and d2 = 4 and generate the manages to reach a perfect clustering at both layers, contrary
ground-truth (GT) basis vectors W1 and W2 whose columns to the two other methods. Of course, this is at the expense
have unit `1 norms in such a way that the 16 first layer basis of a larger computational cost, from O(mnr) for RAND
vectors are clustered in 4 groups around 4 second layer basis and SNPA, to O(mn2 ) for SODA. However, RAND+SODA
vectors; see Fig. 1 for an illustration. As shown on Fig. 1, performs almost as well as SODA at a reduced cost (see the
the columns of W2 are the central basis vectors of 4 columns end of Section III-B), showing that using the greedy procedure
of W1 : more precisely, it is equal to their average, up to a further in the decomposition is also worthwhile. Note that the
scaling factor. We then pick n = 1000 points uniformly at accuracy of all algorithms is always a bit higher for the second
random over the GT clusters, that is, each data point is equal layer since there are fewer clusters, which are better separated.
to one of the columns of W1 , up to scaling factor and fix d01 The reason SNPA underperforms is because some clusters are
to 100. Finally, noise is added to the data such that contained in the convex cone of the others, while SNPA is
designed to identify extreme rays of the cone generated by
N
X = max 0, X̃ + ||X̃||F , the columns of X.
||N ||F
1468
Table I: Comparison of the clustering accuracies at layer 1 (ACC 1) and 2 (ACC 2) and final relative error (rel_err) of deep
MF applied on synthetic data with several initialization strategies, as a function of the noise level. The average and standard
deviation (if above 0.01) over 10 data sets are reported. The best method in terms of accuracy is highlighted in bold for each
configuration.
RAND SNPA SODA RAND+SODA
ACC 1 ACC 2 rel_err (%) ACC 1 ACC 2 rel_err (%) ACC 1 ACC 2 rel_err (%) ACC 1 ACC 2 rel_err
10−4 0.54 ± 0.14 0.74 ± 0.21 9.26 ± 6.23 0.21 ± 0.03 0.69 ± 0.17 14.84 ± 3.91 1 1 7.49 1 1 7.50
10−3 0.49 ± 0.17 0.66 ± 0.19 9.41 ± 6.19 0.18 ± 0.02 0.67 ± 0.14 15.49 ± 4.24 1 1 7.49 1 1 7.50
10−2 0.48 ± 0.17 0.76 ± 0.16 10.31 ± 6.51 0.42 ± 0.09 0.72 ± 0.16 10.82 ± 4.21 0.97 0.99 7.51 0.97 0.99 7.52
10−1 0.40 ± 0.07 0.70 ± 0.17 15.46 ± 4.57 0.38 ± 0.07 0.68 ± 0.09 14.27 ± 3.20 0.69 ± 0.01 0.92 ± 0.01 9.75 0.57 ± 0.07 0.92 ± 0.01 10.00 ± 0.67
B. Hyperspectral unmixing signatures, are merged in a single cluster. Then, the road/metal
A hyperspectral image (HI) contains the reflectance values and dirt are merged to create a single cluster while the two
of n pixels in m wavelength spectral bands and is generally kinds of grass are also merged. Finally, the road and roof are
represented by a matrix X ∈ Rm×n where each column of X merged, while trees and grass are also gathered in a cluster
is the so-called spectral signature of each pixel. Hyperspectral made of vegetation.
unmixing (HU) consists in identifying the spectral signatures Deep MF provides a richer decomposition than single-layer
of r materials and under the linear mixing assumption, NMF is matrix factorization and the hybrid initialization combining
appropriate to solve HU [16]. Similarly, when deep ONMF is SNPA with SODA is efficient to set up the factors.
applied, the materials (also called endmembers) are extracted
in a hierarchical bottom-up fashion. V. C ONCLUSION
The HYDICE Urban HI is made of n = 307 × 307 pixels In this paper, we explained why deep ONMF is equivalent
with m = 162 spectral bands; see Fig. 2. There are several to a particular bottom-up hierarchical clustering. We then
versions of the ground truth depending on the number of proposed a greedy initialisation for deep ONMF, SODA, which
materials considered [17]. was shown to outperform random initialization and SNPA on
The abundance maps, that is, the proportions in which every synthetic data sets, especially in situations with noise or when
material appears in every pixel, extracted by deep ONMF, with the clusters are quite close to each other. We emphasized the
L = 6, dl = 8 − l for all l are represented on Fig. 3. To fact that similar (small) final reconstruction errors can be as-
initialize the factors of each layer, we first apply ONMF with sociated to various clustering accuracies hence a proper choice
d1 = 7 with SNPA initialization, and then apply SODA on W1 , of the initialization technique is critical. We also showed
while [6] used a multilayer ONMF with SNPA initialization that deep ONMF initialized with SODA-based algorithms are
of all layers. For conciseness, we gathered the representations able to produce meaningful hierarchical decompositions in a
of layer 3 and 4 as well as those of layer 5 and 6 in a single hyperspectral image.
level as distinct clusters were merged at these layers. The first Future directions of research include to validate the pro-
layer extracts two types of grass, trees, road, dirt, metal and posed method on more data sets and other applications, such
roof. At layer 2, road and metal, which have similar spectral as topic modeling. Also, a thorough study of the robustness
1469
Fig. 3: Deep ONMF applied on the Urban HI.
to noise of SODA would be interesting. In fact, as long as [7] Bensheng Lyu, Kan Xie, and Weijun Sun, “A deep orthogonal non-
the noise is sufficiently small, SODA provides an optimal negative matrix factorization method for learning attribute representa-
tions,” in International Conference on Neural Information Processing.
clustering. This is obvious in the noiseless case, where all Springer, 2017, pp. 443–452.
data points in the same cluster are multiples of one another, [8] Filippo Pompili, Nicolas Gillis, P-A Absil, and François Glineur, “Two
and should be quantified in noisy situations. algorithms for orthogonal nonnegative matrix factorization with applic-
ation to clustering,” Neurocomputing, vol. 141, pp. 15–25, 2014.
[9] Athman Bouguettaya, Qi Yu, Xumin Liu, Xiangmin Zhou, and Andy
ACKNOWLEDGEMENT Song, “Efficient agglomerative hierarchical clustering,” Expert Systems
with Applications, vol. 42, no. 5, pp. 2785–2797, 2015.
This work was supported by the European Research Council [10] Shudong Huang, Zhao Kang, and Zenglin Xu, “Deep k-means: A simple
and effective method for data clustering,” in International Conference
(ERC starting grant no 679515), and by the Fonds de la on Neural Computing for Advanced Applications. Springer, 2020, pp.
Recherche Scientifique - FNRS (F.R.S.-FNRS) and the Fonds 272–283.
Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOS [11] Da Kuang and Haesun Park, “Fast rank-2 nonnegative matrix factoriz-
ation for hierarchical document clustering,” in Proceedings of the 19th
Project no O005318F-RG47. Pierre De Handschutter is a ACM SIGKDD international conference on Knowledge discovery and
research fellow of the F.R.S.-FNRS. data mining, 2013, pp. 739–747.
[12] Yuqian Li, Diana M Sima, Sofie Van Cauter, Anca R Croitor Sava, Uwe
Himmelreich, Yiming Pi, and Sabine Van Huffel, “Hierarchical non-
R EFERENCES negative matrix factorization (hNMF): a tissue pattern differentiation
method for glioblastoma multiforme diagnosis using MRSI,” NMR in
[1] Pando Georgiev, Fabian Theis, and Andrzej Cichocki, “Sparse compon- Biomedicine, vol. 26, no. 3, pp. 307–319, 2013.
ent analysis and blind source separation of underdetermined mixtures,” [13] Nicolas Gillis, Da Kuang, and Haesun Park, “Hierarchical clustering of
IEEE transactions on neural networks, vol. 16, no. 4, pp. 992–996, 2005. hyperspectral images using rank-two nonnegative matrix factorization,”
[2] Daniel D Lee and H Sebastian Seung, “Learning the parts of objects IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 4,
by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. pp. 2066–2078, 2015.
788–791, 1999. [14] Nicolas Gillis, “Successive nonnegative projection algorithm for robust
[3] Chris HQ Ding, Tao Li, Wei Peng, and Haesun Park, “Orthogonal nonnegative blind source separation,” SIAM Journal on Imaging Sci-
nonnegative matrix t-factorizations for clustering,” in Proceedings of the ences, vol. 7, no. 2, pp. 1420–1450, 2014.
12th ACM SIGKDD international conference on Knowledge discovery [15] Seungjin Choi, “Algorithms for orthogonal nonnegative matrix factoriz-
and data mining, 2006, pp. 126–135. ation,” in 2008 IEEE international joint conference on neural networks.
[4] Andrzej Cichocki and Rafal Zdunek, “Multilayer nonnegative matrix IEEE, 2008, pp. 1828–1832.
factorisation,” Electronics Letters, vol. 42, no. 16, pp. 947–948, 2006. [16] José M Bioucas-Dias, Antonio Plaza, Nicolas Dobigeon, Mario Par-
[5] George Trigeorgis, Konstantinos Bousmalis, Stefanos Zafeiriou, and ente, Qian Du, Paul Gader, and Jocelyn Chanussot, “Hyperspectral
Björn Schuller, “A deep matrix factorization method for learning unmixing overview: Geometrical, statistical, and sparse regression-
attribute representations,” IEEE Transactions on Pattern Analysis and based approaches,” IEEE journal of selected topics in applied earth
Machine Intelligence, vol. 39, no. 3, pp. 417–429, 2016. observations and remote sensing, vol. 5, no. 2, pp. 354–379, 2012.
[6] Pierre De Handschutter, Nicolas Gillis, and Xavier Siebert, “Deep matrix [17] Feiyun Zhu, “Spectral unmixing datasets with ground truths,” arXiv
factorizations,” arXiv preprint arXiv:2010.00380, 2020. preprint arXiv:1708.05125, 2017.
1470