0% found this document useful (0 votes)
23 views7 pages

2017 IDEC Guo

The document presents the Improved Deep Embedded Clustering (IDEC) algorithm, which enhances deep clustering by preserving local data structures during feature transformation. IDEC integrates clustering loss with an under-complete autoencoder to optimize both cluster assignments and feature representations simultaneously. Experimental results demonstrate the effectiveness of IDEC in achieving better clustering performance compared to existing methods.

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

2017 IDEC Guo

The document presents the Improved Deep Embedded Clustering (IDEC) algorithm, which enhances deep clustering by preserving local data structures during feature transformation. IDEC integrates clustering loss with an under-complete autoencoder to optimize both cluster assignments and feature representations simultaneously. Experimental results demonstrate the effectiveness of IDEC in achieving better clustering performance compared to existing methods.

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Improved Deep Embedded Clustering with Local Structure Preservation

Xifeng Guo, Long Gao, Xinwang Liu, Jianping Yin


College of Computer, National University of Defense Technology, Changsha, China
[email protected], [email protected], [email protected], [email protected]

Abstract can be done by applying dimension reduction techniques like


Principle Component Analysis (PCA), but the representation
Deep clustering learns deep feature representations ability of these shallow models is limited. Thanks to the de-
that favor clustering task using neural networks. velopment of deep learning, such feature transformation can
Some pioneering work proposes to simultaneously be achieved by using Deep Neural Networks (DNN). We refer
learn embedded features and perform clustering to this kind of clustering as deep clustering.
by explicitly defining a clustering oriented loss. Deep clustering is most recently proposed and leaves a
Though promising performance has been demon- lot of problems unsolved. For example, what types of neu-
strated in various applications, we observe that a ral networks are proper? How to provide guidance informa-
vital ingredient has been overlooked by these work tion i.e. to define clustering oriented loss function? Which
that the defined clustering loss may corrupt fea- properties of data should be preserved during transforma-
ture space, which leads to non-representative mean- tion? The primitive work in deep clustering focuses on
ingless features and this in turn hurts clustering learning features that preserve some properties of data by
performance. To address this issue, in this paper, adding priori knowledge to the subjective [Tian et al., 2014;
we propose the Improved Deep Embedded Cluster- Peng et al., 2016]. They are two-stage algorithms: fea-
ing (IDEC) algorithm to take care of data structure ture transformation and then clustering. Latter, algorithms
preservation. Specifically, we manipulate feature that jointly accomplish feature transformation and clustering
space to scatter data points using a clustering loss as come into being [Yang et al., 2016; Xie et al., 2016]. The
guidance. To constrain the manipulation and main- Deep Embedded Clustering (DEC) [Xie et al., 2016] algo-
tain the local structure of data generating distribu- rithm defines an effective objective in a self-learning man-
tion, an under-complete autoencoder is applied. By ner. The defined clustering loss is used to update parameters
integrating the clustering loss and autoencoder’s re- of transforming network and cluster centers simultaneously.
construction loss, IDEC can jointly optimize clus- The cluster assignment is implicitly integrated to soft labels.
ter labels assignment and learn features that are However, the local structure preservation can not be guaran-
suitable for clustering with local structure preser- teed by the clustering loss. Thus the feature transformation
vation. The resultant optimization problem can be may be misguided, leading to corruption of embedded space.
effectively solved by mini-batch stochastic gradi- To deal with this problem, in this paper, we assume that
ent descent and backpropagation. Experiments on both clustering oriented loss guidance and local structure
image and text datasets empirically validate the im- preservation mechanism are essential for deep clustering. In-
portance of local structure preservation and the ef- spired by [Peng et al., 2016], we use under-complete au-
fectiveness of our algorithm. toencoder to learn embedded features and to preserve local
structure of data generating distribution. We propose to in-
1 Introduction corporate autoencoder into DEC framework. In this way, the
Unsupervised clustering is a vital research topic in data sci- proposed framework can jointly perform clustering and learn
ence and machine learning. Traditional clustering algo- representative features with local structure preservation. We
rithms like k-means [MacQueen, 1967], gaussian mixture refer to our algorithm as Improved Deep Embedded Cluster-
model [Bishop, 2006] and spectral clustering [Von Luxburg, ing (IDEC). The optimization of IDEC can directly perform
2007] group data on handcrafted features according to intrin- mini-batch stochastic gradient descent and backpropagation.
sic characteristics or similarity. However, when the dimen- At last, some experiments are carefully designed and con-
sion of input feature space (data space) is very high, the clus- ducted. The results validate our assumption and the effective-
tering becomes ineffective due to unreliable similarity met- ness of our IDEC.
rics. Transforming data from high dimensional feature space The contributions of this work are summarized as below:
to lower dimensional space in which to perform clustering • We propose a deep clustering algorithm that can jointly
is an intuitive solution and has been widely studied. This perform clustering and learn representative features with
local structure preservation. where x̃ is a copy of x that is corrupted by some form of
• We empirically prove the importance of local structure noise. Therefore, denoising autoencoder has to recover x
preservation in deep clustering. from this corruption rather than simply copying their input.
In this way, denoising autoencoder can force encoder fW and
• The proposed IDEC outperforms the newest opponent in decoder gW 0 to implicitly capture the structure of data gener-
a large margin. ating distribution.
In our algorithm, the denoising autoencoder is used for pre-
2 Related Work training and under-complete autoencoder is added to DEC
framework after initialization.
2.1 Deep Clustering
Existing deep clustering algorithms broadly fall into two cat- 2.3 Deep Embedded Clustering
egories: (i) two-stage work that applies clustering after hav- Deep Embedded Clustering (DEC) [Xie et al., 2016] starts
ing learned a representation, and (ii) approaches that jointly with pretraining an autoencoder and then removes the de-
optimize the feature learning and clustering. coder. The remaining encoder is finetuned by optimizing the
The former category of algorithms directly take advan- following objective:
tage of existing unsupervised deep learning frameworks and XX pij
techniques. For example, [Tian et al., 2014] uses autoen- L = KL(P kQ) = pij log (2)
coder to learn low dimensional features of original graph, and i j
qij
then runs k-means algorithm to get clustering results. [Chen,
2015] layer-wisely trains a Deep Belief Network (DBN) and where qij is the similarity between embedded point
then applies non-parametric maximum-margin clustering to zi and cluster center µj measured by Student’s t-
learned intermediate representation. [Peng et al., 2016] uses distribution [Maaten and Hinton, 2008]:
autoencoder with sparsity prior to learn representations in (1 + kzi − µj k2 )−1
nonlinear latent space that are adaptive to local and global qij = P 2 −1
(3)
subspace structure simultaneously, and then traditional clus- j (1 + kzi − µj k )

tering algorithms are employed to get label assignment. And pij in (2) is the target distribution defined as
The other category of algorithms try to explicitly define
2
P
a clustering loss, simulating classification error in supervised qij / i qij
deep learning. [Yang et al., 2016] proposes a recurrent frame- pij = P 2
P  (4)
j qij / i qij
work in deep representations and image clusters, which in-
tegrates two processes into a single model with a unified As we can see, the target distribution P is defined by Q, so
weighted triplet loss and optimizes it end-to-end. DEC [Xie et minimizing L is a form of self-training [Nigam and Ghani,
al., 2016] learns a mapping from the observed space to a low- 2000].
dimensional latent space with deep neural networks, which Let fW be the encoder mapping, i.e. zi = fW (xi ) where
can obtain feature representations and cluster assignments si- xi is input example from dataset X. After pretraining, all em-
multaneously. bedded points {zi } can be extracted using fW . Then employ
The proposed algorithm intrinsically is a modified version k-means on {zi } to get initial cluster centers {µj }. After-
of DEC with incorporating an under-complete autoencoder wards, L can be computed according to (2), (3) and (4). And
to preserve local structure. It excels [Yang et al., 2016] by the predicted label of sample xi is arg maxj qij .
simplicity without recurrent and outperforms DEC in terms During backpropagation, ∂L/∂zi and ∂L/∂µj can be eas-
of clustering accuracy and feature’s representativeness. Since ily computed. Then ∂L/∂zi is passed down to update fW
IDEC mainly depends on autoencoder and DEC, we will in- and ∂L/∂µj is used to update cluster center µj :
troduce them in more detail in the following sections.
∂L
µj = µj − λ (5)
2.2 Autoencoder ∂µj
An autoencoder is a neural network that is trained to attempt The biggest contribution of DEC is the clustering loss (or
to copy its input to its output. Internally, it has a hidden layer target distribution P , to be specific). It works by using high
z that describes a code used to represent the input. The net- confidential samples as supervision and then making samples
work consists of two parts: an encoder function z = fW (x) in each cluster distribute more densely. However, there is no
and a decoder x0 = gW 0 (z) that produces a reconstruction. guarantee of pulling samples near margins towards the correct
There are two widely used types of autoencoders. cluster. We deal with this problem by explicitly preserving
Under-complete autoencoder. It controls the dimension the local structure of data. Under this condition, the super-
of latent code z lower than input data x. Learning such under- vision information of high confidential samples can help the
complete representations force the autoencoder to capture the marginal samples walk to the correct cluster.
most salient features of the data.
Denoising autoencoder. Instead of reconstructing x given 3 Improved Deep Embedded Clustering
x, denoising autoencoder minimizes the following objective:
Consider a dataset X with n samples and each sample xi ∈
L = kx − gW 0 (fW (x̃))k22 (1) Rd where d is the dimension. The number of clusters K is a
𝑥′ 3.2 Local structure preservation

Reconstruction loss
Decoder The embedded points obtained in Section 3.1 are not neces-
𝑥
sarily suitable for clustering task. To this end, DEC [Xie et
Encoder
al., 2016] abandons the decoder and finetunes the encoder us-
𝑧
ing clustering loss Lc . However, we suppose that this kind of
finetuning could distort the embedded space, weaken the rep-
resentativeness of embedded features and thereby hurt clus-
tering performance. Therefore, we propose to keep the de-
𝑞
coder untouched and directly attach the clustering loss to em-
Clustering loss bedded space.
To ensure the effectiveness of clustering loss, the stacked
Figure 1: The network structure of IDEC. The encoder and decoder denoising autoencoder used in pretraining is not appropriate
are composed of fully connected layers. Clustering loss is used to any more. Because the clustering should be performed on
scatter the embedded points z and the reconstruction loss makes sure features of clean data, instead of noised data that used in de-
that the embedded space preserves local structure of data generating noising autoencoder. So we directly remove the noise. Then
distribution. the stacked denoising autoencoder degenerates into an under-
complete autoencoder (See Section 2.2). The reconstruction
priori knowledge and the jth cluster center is represented by loss is measured by Mean Squared Error (MSE):
µj ∈ Rd . Let the value of si ∈ {1, 2, . . . , K} represent the n
X
cluster index assigned to sample xi . Define nonlinear map- Lr = kxi − gW 0 (zi )k22 (8)
ping fW : xi → zi and gW 0 : zi → x0i where zi is the i=1
embedded point of xi in the low dimensional feature space
and x0i is the reconstructed sample for xi . where zi = fW (xi ) and fW and gW 0 are encoder and de-
We aim to find a good fW which makes embedded points coder mappings respectively. As shown in [Peng et al., 2016]
{zi }ni=1 more suitable for clustering task. To this end, two and [Goodfellow et al., 2016], autoencoders can preserve lo-
components are essential: the autoencoder and clustering cal structure of data generating distribution. Under this condi-
loss. The autoencoder is used to learn representations in un- tion, manipulating embedded space slightly using clustering
supervised manner and the learned features can preserve in- loss will not cause corruption. So the coefficient γ is better
trinsic local structure in data. The clustering loss, borrowed to be less than 1, which will be empirically demonstrated in
from [Xie et al., 2016], is responsible for manipulating em- Section 4.3.
bedded space in order to scatter embedded points. The whole
network structure is illustrated in Fig. 1. And the objective is 3.3 Optimization
defined as We optimize (6) using mini-batch stochastic gradient decent
L = Lr + γLc (6) (SGD) and backpropagation. To be specific, there are three
where Lr and Lc are reconstruction loss and clustering loss kinds of parameters to optimize or update: autoencoder’s
respectively, and γ > 0 is a coefficient that controls the de- weights, cluster centers and target distribution P .
gree of distorting embedded space. When γ = 1 and Lr ≡ 0, Update autoencoder’s weights and cluster centers. Fix
(6) reduces to the objective of DEC [Xie et al., 2016]. target distribution P , then the gradients of Lc with respect to
embedded point zi and cluster center µj can be computed as:
3.1 Clustering loss and Initialization
K
The clustering loss is proposed by [Xie et al., 2016]. It is de- ∂Lc X −1
=2 1 + kzi − µj k2 (pij − qij )(zi − µj ) (9)
fined as KL divergence between distributions P and Q, where ∂zi j=1
Q is the distribution of soft labels measured by Student’s t-
n
distribution and P is the target distribution derived from Q. ∂Lc X −1
That is to say, the clustering loss is defined as =2 1 + kzi − µj k2 (qij − pij )(zi − µj ) (10)
∂µj i=1
XX pij
Lc = KL(P kQ) = pij log (7) Note that the above derivations are from [Xie et al., 2016].
i j
qij
Then given a mini batch with m samples and learning rate λ,
where KL is KullbackLeibler divergence that measures the µj is updated by
non-symmetric difference between two probability distribu- m
tions, P and Q are defined by (4) and (3). Details can be λ X ∂Lc
µj = µj − (11)
found in Section 2.3 and [Xie et al., 2016]. m i=1 ∂µj
Follow suggestions in [Xie et al., 2016], we also pretrain a
stacked denoising autoencoder before performing clustering. The decoder’s weights are updated by
After pretraining, embedded points are valid feature represen- m
tations for input samples. Then cluster centers {µj }K j=1 can
λ X ∂Lr
W0 = W0 − (12)
be initialized by employing k-means on {zi = fW (xi )}ni=1 m i=1 ∂W 0
The encoder’s weights are updated by Table 1: Datasets statistics
m  
λ X ∂Lr ∂Lc Dataset # examples # classes Dimension
W =W − +γ (13) MNIST 70000 10 784
m i=1 ∂W ∂W
USPS 9298 10 256
Update target distribution. The target distribution P REUTERS-10K 10000 4 2000
serves as “groundtruth” soft label but also depends on pre-
dicted soft label. Therefore, to avoid instability, P should
not be updated at each iteration (one update for autoencoder’s 89
weights using a mini-batch of samples is called an iteration) IDEC
using only a batch of data. In practice, we update target distri- 88 DEC
bution using all embedded points every T iterations. See (3) 87
and (4) for the update rules. When update target distribution,
86

Accuracy (%)
the label assigned to xi is obtained by
si = arg max qij (14) 85
j

where qij is computed by (3). We will stop training if label 84


assignment change (in percentage) between two consecutive 83
updates for target distribution is less than a threshold δ.
The whole algorithm is summarized in Algorithm 1. 82
810 5000 10000 15000 20000
Algorithm 1: Improved Deep Embedded Clustering Iteration
Input: Input data: X; Number of clusters: K; Target 0.5
distribution update interval: T ; Stopping DEC
threshold: δ; Maximum iterations: M axIter. IDEC(total loss)
Output: Autoencoder’s weights W and W 0 ; Cluster 0.4 IDEC(clustering loss)
centers µ and labels s. IDEC(reconstruction loss)
0
1 Initialize µ, W and W according to Section 3.1.
2 for iter ∈ {0, 1, . . . , M axIter} do
0.3
Loss

3 if iter%T == 0 then
4 Compute all embedded points {zi = fW (xi )}ni=1 0.2
5 Update P using (3), (4) and {zi }ni=1 .
Save last label assignment: sold = s.
0.1
6
7 Compute new label assignments s via (14).
8 if sum(sold 6= s)/n < δ then
9 Stop training. 0.00 5000 10000 15000 20000
10 Choose a batch of samples S ∈ X. Iteration
11 Update µ, W 0 and W via (11), (12) and (13) on S. Figure 2: Accuracies and losses during training on MNIST.

It is not difficult to see that the time complexity of IDEC al- • REUTERS-10K: Reuters contains around 810000 En-
gorithm is O(nD2 + ndK), where D, d and K are maximum glish news stories labeled with a category tree [Lewis et
number of neurons in hidden layers, dimension of embedding al., 2004]. Following DEC [Xie et al., 2016], we used 4
layer and number of clusters. Generally K ≤ d ≤ D holds, root categories: corporate/industrial, government/social,
so the time complexity is O(nD2 ). markets and economics as labels and excluded all docu-
ments with multiple labels. Restricted by computational
4 Experiments resources, we randomly sampled a subset of 10000 ex-
amples and computed tf-idf features on the 2000 most
4.1 DataSets frequent words. The sampled dataset is referred as to
The proposed IDEC method is evaluated on two image REUTERS-10K.
datasets and one text dataset:
For all algorithms, we preprocessed datasets as same as DEC,
• MNIST: The MNIST dataset [LeCun et al., 1998] con- i.e. normalizing each example xi ∈ X to d1 kxi k22 ≈ 1.
sists of total 70000 handwritten digits of 28x28 pixel
size. We reshaped each gray image to a 784 dimensional 4.2 Experiment Setup
vector. Comparing methods. We demonstrate the effectiveness of
• USPS: The USPS dataset contains 9298 gray-scale our IDEC algorithm mainly by comparing with DEC [Xie et
handwritten digit images with size of 16x16 pixels. The al., 2016] which can be viewed as a special case of IDEC
features are floating point in [0, 2]. when the reconstruction term is set to zero. we use the pub-
Epoch 0 Epoch 15 Epoch 30

Epoch 0 Epoch 5 Epoch 10

Figure 3: Visualization of clustering results on subset of MNIST during training. Different colors mark different clusters. The first row is
ours and second row corresponds to DEC. The proposed IDEC converges slower since it optimizes reconstruction loss as well. Both methods
separate clusters well but the data structure in the first row is preserved better than DEC. Note points with red and blue color, they are totally
mixed together in DEC while still somehow separable in our IDEC.

licly available code released by the author to report the per- Table 2: Comparison of clustering performance in terms of accuracy
formance of DEC. The two-stage deep clustering algorithm is (%) and NMI (%, in bracket).
denoted as AE+k-means, which means performing k-means
Methods MNIST USPS REUTERS-10K
algorithm on embedded features of pretrained autoencoder.
This is the same as the results of DEC and IDEC before train- k-means 53.24 66.82 51.62
SEC 80.37 N/A 60.08
ing with clustering loss. For the sake of completeness, two AE+k-means 81.82(74.73) 69.31(66.20) 70.52(39.79)
traditional and classic clustering algorithms, k-means and DEC 86.55(83.72) 74.08(75.29) 73.68(49.76)
Spectral Embedded Clustering (SEC) [Nie et al., 2011], are IDEC 88.06(86.72) 76.05(78.46) 75.64(49.81)
also included in comparison. k-means is run 20 times with
different initialization and the result with best objective value
is chosen. SEC is a variant of spectral clustering with a lin- the update intervals T are 140, 30, 3 iterations for MNIST,
earity regularization explicitly added and outperforms tradi- USPS and REUTERS-10K respectively. Our implementation
tional spectral clustering methods on a wide range of datasets is based on Python and Keras [Chollet, 2015] and is available
according to [Nie et al., 2011]. The parameters of SEC are at https://github.com/XifengGuo/IDEC.
fixed as default value in the code provided by the authors. Evaluation Metric. All clustering methods are evaluated
Parameters setting. Following the settings in DEC [Xie by clustering accuracy (ACC) and Normalized Mutual Infor-
et al., 2016], the encoder network is set as a fully connected mation (NMI) which are widely used in unsupervised learn-
multilayer perceptron (MLP) with dimensions d−500−500− ing scenario.
2000 − 10 for all datasets, where d is the dimension of input
data (features). And the decoder network is a mirror of en- 4.3 Results
coder, i.e. a MLP with dimensions 10−2000−500−500−d. We report the results of all comparing algorithms on 3
Except for input, output and embedding layers, all internal datasets in Table 2. As it shows, deep clustering algorithms
layers are activated by ReLU nonlinearity function [Glorot et AE+k-means, DEC and IDEC outperform traditional cluster-
al., 2011]. The autoencoder network pretraining is set exactly ing algorithms k-means and Spectral Embedded Clustering
the same as [Xie et al., 2016], please refer to the paper for (SEC) [Nie et al., 2011] with a large margin, which indi-
more details. After pretraining, the coefficient γ of clustering cates the fascinating potentials of deep learning in unsuper-
loss is set to 0.1 (this is determined by a grid search in {0.01, vised clustering field. The performance gap between AE+k-
0.02, 0.05, 0.1, 0.2, 0.5, 1.0}) and batch size to 256 for all means and DEC reflects the effect of clustering loss. And
datasets. The optimizer Adam [Kingma and Ba, 2014] with the outperformance of IDEC over DEC demonstrates that the
init learning rate λ = 0.001, β1 = 0.9, β2 = 0.999 is applied autoencoder can help improve clustering performance.
for MNIST dataset and SGD with learning rate λ = 0.1 and Figure 2 illustrates the behavior of DEC and IDEC dur-
momentum β = 0.99 is used for USPS and REUTERS-10K ing training on MNIST. We observe the following phenom-
datasets. The convergence threshold is set to δ = 0.1%. And ena. First, the final accuracies comply with results in Table 2,
IDEC λ = 0. 1 IDEC λ = 0. 01 IDEC λ = 0. 001 IDEC λ = 0. 0001
DEC λ = 0. 1 DEC λ = 0. 01 DEC λ = 0. 001 DEC λ = 0. 0001
89 88
88 λ = 0. 1 λ = 0. 1
86
87
84
86 λ = 0. 01 λ = 0. 01 82
85
ACC (%)

NMI (%)
84 80
λ = 0. 001
83 λ = 0. 0001 λ = 0. 001 78
λ = 0. 0001
82 76
81
74
80 -2
10 10 -1 10 0 10 1 10 2 10 -2 10 -1 10 0 10 1 10 2
γ γ

Figure 4: The effect of learning rate λ and clustering coefficient γ in (6) on clustering performance for MNIST dataset.

i.e. IDEC outperforms DEC. Second, IDEC converges slower manipulate the embedded feature space appropriately.
than DEC because of the fluctuation of reconstruction loss. To see how the coefficient γ of clustering loss in (6) affects
Third, IDEC has larger clustering loss and higher clustering the performance of IDEC algorithm, we conduct experiment
accuracy than DEC. This implies that the objective of DEC on MNIST dataset by sampling γ in range [10−2 , 102 ]. The
may mislead the clustering procedure by distorting the em- optimizer is set as SGD with momentum β = 0.9, as same as
bedded feature space and breaking the intrinsic structure of DEC’s default setting, for fair comparison. The learning rate
data. Finally, the reconstruction losses at last few iterations λ is set as 0.1, 0.01, 0.001, 0.0001 successively. As shown in
approximately equal the loss at beginning. It implies that the Figure 4, there are following observations:
performance improvement from DEC to IDEC is not likely
due to the clustering ability of autoencoder. Actually, we did • For the best learning rate, IDEC (λ = 0.1) outperforms
conduct an experiment that finetunes the autoencoder only us- DEC (λ = 0.01) when γ ∈ [0.05, 1.0]. Because γ with
ing reconstruction loss Lr (by setting coefficient γ in (6) to too small value eliminates the positive effect of clus-
0) via various optimizers, and no improvement in terms of tering loss term and large value tends to distort latent
clustering accuracy was observed. So we assume that the au- feature space. When γ → 0, the clustering result ap-
toencoder plays the role of preserving local structure of data, proaches the result of AE+k-means.
and under this condition clustering loss can manipulate em-
bedded space to get better clustering accuracy. • Learning rate λ and clustering coefficient γ are coupling.
For larger γ, it requires smaller λ to maintain perfor-
We further prove our assumption about the role autoen- mance. But the combination of small γ and large λ leads
coder acts by visualizing the embedded feature space during to higher performance. So we recommend γ = 0.1, as
training. The t-SNE [Maaten and Hinton, 2008] visualization we did in all experiments.
on a random subset of MNIST with 1000 samples is shown in
Fig. 3. From left to right in the top row, the training process
of IDEC, the “shape” of each cluster is almost maintained. 5 Conclusion
On the contrary, the “shape” in the bottom is changed a lot
with training proceeding. Furthermore, when you focus on This paper proposes Improved Deep Embedded Clustering
clusters colored by red and blue (digits 4 and 9), in the first (IDEC) algorithm, which jointly performs clustering and
column they are still separable but become distinguishable in learns embedded features that are suitable for clustering and
the last column. This is a loophole of DEC’s objective (clus- preserve local structure of data generating distribution. IDEC
tering loss). Our IDEC doesn’t overcome this problem, but manipulates feature space to scatter data by optimizing a KL
does go further than DEC. To validate this, see the figures in divergence based clustering loss with a self-training target
the last column: blue and red clusters of IDEC are still some- distribution. And it maintains the local structure by incor-
how separable while in DEC they are totally mixed up. This porating an autoencoder. Empirical experiments demonstrate
problem was not observed from Figure 5 in [Xie et al., 2016], that structure preservation is vital to deep clustering algo-
but it indeed happens by using their released code. This is rithm and can favor clustering performance. Future work in-
also pointed out by [Zheng et al., 2016]. It can be concluded cludes: adding more prior knowledge (e.g. sparsity) in IDEC
that the autoencoder can preserve the intrinsic structure of framework, and incorporating convolutional layers for image
data generating distribution and hence help clustering loss to datasets.
Acknowledgments [Von Luxburg, 2007] Ulrike Von Luxburg. A tutorial on
spectral clustering. Statistics and Computing, 17(4):395–
This work was financially supported by the National Nat-
416, 2007.
ural Science Foundation of China (Project no. 60970034,
61170287, 61232016 and 61672528). [Xie et al., 2016] Junyuan Xie, Ross Girshick, and Ali
Farhadi. Unsupervised deep embedding for clustering
analysis. In International Conference on Machine Learn-
References ing (ICML), 2016.
[Bishop, 2006] Christopher M Bishop. Pattern Recognition [Yang et al., 2016] Jianwei Yang, Devi Parikh, and Dhruv
and Machine Learning. Springer, 2006. Batra. Joint unsupervised learning of deep representations
[Chen, 2015] Gang Chen. Deep learning with nonparametric and image clusters. In IEEE Conference on Computer Vi-
clustering. arXiv preprint arXiv:1501.03084, 2015. sion and Pattern Recognition (CVPR), pages 5147–5156,
[Chollet, 2015] François Chollet. Keras, 2015. 2016.
[Zheng et al., 2016] Yin Zheng, Huachun Tan, Bangsheng
[Glorot et al., 2011] Xavier Glorot, Antoine Bordes, and
Tang, Hanning Zhou, et al. Variational deep embed-
Yoshua Bengio. Deep sparse rectifier neural networks. ding: A generative approach to clustering. arXiv preprint
Journal of Machine Learning Research, 15:315–323, arXiv:1611.05148, 2016.
2011.
[Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio,
and Aaron Courville. Deep learning. MIT Press, 2016.
[Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba.
Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua
Bengio, and Patrick Haffner. Gradient-based learning ap-
plied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[Lewis et al., 2004] David D Lewis, Yiming Yang, Tony G
Rose, and Fan Li. Rcv1: A new benchmark collection for
text categorization research. Journal of Machine Learning
Research, 5(Apr):361–397, 2004.
[Maaten and Hinton, 2008] Laurens van der Maaten and Ge-
offrey Hinton. Visualizing data using t-sne. Journal of
Machine Learning Research, 9(Nov):2579–2605, 2008.
[MacQueen, 1967] James MacQueen. Some methods for
classification and analysis of multivariate observations. In
Berkeley Symposium on Mathematical Statistics and Prob-
ability, volume 1, pages 281–297. Oakland, CA, USA.,
1967.
[Nie et al., 2011] Feiping Nie, Zinan Zeng, Ivor W Tsang,
Dong Xu, and Changshui Zhang. Spectral embedded
clustering: A framework for in-sample and out-of-sample
spectral clustering. IEEE Transactions on Neural Net-
works, 22(11):1796–1808, 2011.
[Nigam and Ghani, 2000] Kamal Nigam and Rayid Ghani.
Analyzing the effectiveness and applicability of co-
training. In International Conference on Information and
Knowledge Management, pages 86–93. ACM, 2000.
[Peng et al., 2016] Xi Peng, Shijie Xiao, Jiashi Feng, Wei-
Yun Yau, and Zhang Yi. Deep subspace clustering with
sparsity prior. In International Joint Conference on Artifi-
cial Intelligence (IJCAI), 2016.
[Tian et al., 2014] Fei Tian, Bin Gao, Qing Cui, Enhong
Chen, and Tie-Yan Liu. Learning deep representations for
graph clustering. In AAAI, pages 1293–1299, 2014.

You might also like