20-Structural Deep Clustering Network
20-Structural Deep Clustering Network
representation learning. However, it has seldom been applied for a novel delivery operator and a dual self-supervised module.
deep clustering. To the best of our knowledge, this is the first time to apply
In reality, integrating structural information into deep cluster- structural information into deep clustering explicitly.
ing usually needs to address the following two problems. (1) What • We give a theoretical analysis of our proposed SDCN and
structural information should be considered in deep clustering? It is prove that GCN provides an approximate second-order graph
well known that the structural information indicates the underlying regularization for the DNN representations and the data rep-
similarity among data samples. However, the structure of data is resentation learned in SDCN is equivalent to the sum of the
usually very complex, i.e., there is not only the direct relationship representations with different-order structural information.
between samples (also known as first-order structure), but also the Based on our theoretical analysis, the over-smoothing issue
high-order structure. The high-order structure imposes the sim- of GCN module in SDCN will be effectively alleviated.
ilarity constraint from more than one-hop relationship between • Extensive experiments on six real-world datasets demon-
samples. Taking the second-order structure as an example, it im- strate the superiority of SDCN in comparison with the state-
plies that for two samples with no direct relationship, if they have of-the-art techniques. Specifically, SDCN achieves signifi-
many common neighbor samples, they should still have similar cant improvements (17% on NMI, 28% on ARI) over the best
representations. When the structure of data is sparse, which always baseline method.
holds in practice, the high-order structure is of particular impor-
tance. Therefore, only utilizing the low-order structure in deep
clustering is far from sufficient, and how to effectively consider
higher-order structure is the first problem; (2) What is the relation 2 RELATED WORK
between the structural information and deep clustering? The basic In this section, we introduce the most related work: deep clustering
component of deep clustering is the Deep Neural Network (DNN), and graph clustering with GCN.
e.g. autoencoder. The network architecture of autoencoder is very Deep clustering methods aim to combine the deep representation
complex, consisting of multiple layers. Each layer captures different learning with the clustering objective. For example, [27] proposes
latent information. And there are also various types of structural deep clustering network, using the loss function of K-means to help
information between data. Therefore, what is the relation between autoencoder learn a "K-means-friendly" data representation. Deep
different structures and different layers in autoencoder? One can embedding clustering [26] designs a KL-divergence loss to make
use the structure to regularize the representation learned by the the representation learned by autoencoder surround the cluster
autoencoder in some way, however, on the other hand, one can centers closer, thus improving the cluster cohesion. Improved deep
also directly learn the representation from the structure itself. How embedding clustering [4] adds a reconstruction loss to the objective
to elegantly combine the structure of data with the autoencoder of DEC as a constraint to help autoencoder learn a better data
structure is another problem. representation. Variational deep embedding [9] is able to model
In order to capture the structural information, we first construct the data generation process and clusters jointly by using a deep
a K-Nearest Neighbor (KNN) graph, which is able to reveal the variational autoencoder, so as to achieve better clustering results.
underlying structure of the data[? ? ]. To capture the low-order and [8] proposes deep subspace clustering networks, which uses a novel
high-order structural information from the KNN graph, we propose self-expressive layer between the encoder and the decoder. It is able
a GCN module, consisting of multiple graph convolutional layers, to mimic the "self-expressiveness" property in subspace clustering,
to learn the GCN-specific representation. thus obtaining a more expressive representation. DeepCluster [3]
In order to introduce structural information into deep clustering, treats the clustering results as pseudo labels so that it can be applied
we introduce an autoencoder module to learn the autoencoder- in training deep neural network with large datasets. However, all
specific representation from the raw data, and propose a delivery of these methods only focus on learning the representation of data
operator to combine it with the GCN-specific representation. We from the samples themselves. Another important information in
theoretically prove that the delivery operator is able to assist the learning representation, the structure of data, is largely ignored by
integration between autoencoder and GCN better. In particular, these methods.
we prove that GCN provides an approximate second-order graph To cope with the structural information underlying the data,
regularization for the representation learned by autoencoder, and some GCN-based clustering methods have been widely applied.
the representation learned by autoencoder can alleviate the over- For instance, [10] proposes graph autoencoder and graph variation
smoothing issue in GCN. autoencoder, which uses GCN as an encoder to integrate graph
Finally, because both of the autoencoder and GCN modules will structure into node features to learn the nodes embedding. Deep at-
output the representations, we propose a dual self-supervised mod- tentional embedded graph clustering [25] uses an attention network
ule to uniformly guide these two modules. Through the dual self- to capture the importance of the neighboring nodes and employs
supervised module, the whole model can be trained in an end-to-end the KL-divergence loss from DEC to supervise the training process
manner for clustering task. of graph clustering. All GCN-based clustering methods mentioned
In summary, we highlight the main contributions as follows: above rely on reconstructing the adjacency matrix to update the
model, and those methods can only learn data representations from
• We propose a novel Structural Deep Clustering Network the graph structure, which ignores the characteristic of the data
(SDCN) for deep clustering. The proposed SDCN effectively itself. However, the performance of this type of methods might be
combines the strengths of both autoencoder and GCN with limited to the overlapping between community structure.
Structural Deep Clustering Network WWW ’20, April 20–24, 2020, Taipei, Taiwan
P
𝑿 ... Q
𝑯(𝑳)
P
𝑳𝒄𝒍𝒖 = 𝑲𝑳(𝑸||𝑷)
Q
𝑿 ...
Dual Self-Supervised Module
𝑵
𝟏
𝑳𝒓𝒆𝒔 = ||𝑿 − 𝑿||𝟐𝑭
𝟐𝑵 DNN Module
𝒊=𝟏
Figure 1: The framework of our proposed SDCN. X, X̂ are the input data and the reconstructed data, respectively. H(ℓ) and
Z(ℓ) are the representations in the ℓ-th layer in the DNN and GCN module, respectively. Different colors represent different
representations H(ℓ) , learned the by DNN module. The blue solid line represents that target distribution P is calculated by the
distribution Q and the two red dotted lines represent the dual self-supervised mechanism. The target distribution P to guide
the update of the DNN module and the GCN module at the same time.
3 THE PROPOSED MODEL For discrete data, e.g., bag-of-words, we use the dot-product
In this section, we introduce our proposed structural deep clustering similarity so that the similarity is related to the number of
network, where the overall framework is shown in Figure 1. We first identical words only.
construct a KNN graph based on the raw data. Then we input the After calculating the similarity matrix S, we select the top-K similar-
raw data and KNN graph into autoencoder and GCN, respectively. ity points of each sample as its neighbors to construct an undirected
We connect each layer of autoencoder with the corresponding K-nearest neighbor graph. In this way, we can get the adjacency
layer of GCN, so that we can integrate the autoencoder-specific matrix A from the non-graph data.
representation into structure-aware representation by a delivery
operator. Meanwhile, we propose a dual self-supervised mechanism 3.2 DNN Module
to supervise the training progress of autoencoder and GCN. We As we mentioned before, learning an effective data representation is
will describe our proposed model in detail in the following. of great importance to deep clustering. There are several alternative
unsupervised methods for different types of data to learn represen-
3.1 KNN Graph tations. For example, denoising autoencoder [24], convolutional
Assume that we have the raw data X ∈ RN ×d , where each row autoencoder [19], LSTM encoder-decoder [18] and adversarial au-
xi represents the i-th sample, and N is the number of samples toencoder [17]. They are variations of the basic autoencoder [7].
and d is the dimension. For each sample, we first find its top-K In this paper, for the sake of generality, we employ the basic au-
similar neighbors and set edges to connect it with its neighbors. toencoder to learn the representations of the raw data in order to
There are many ways to calculate the similarity matrix S ∈ RN ×N accommodate for different kinds of data characteristics. We assume
of the samples. Here we list two popular approaches we used in that there are L layers in the autoencoder and ℓ represents the layer
constructing the KNN graph: number. Specifically, the representation learned by the ℓ-th layer
in encoder part, H(ℓ) , can be obtained as follows:
1) Heat Kernel. The similarity between samples i and j is cal-
(ℓ) (ℓ)
culated by: H(ℓ) = ϕ We H(ℓ−1) + be , (3)
(ℓ) (ℓ)
where Wd and bd are the weight matrix and bias of the ℓ-th layer The result zi j ∈ Z indicates the probability sample i belongs to
in the decoder, respectively. cluster center j, and we can treat Z as a probability distribution.
The output of the decoder part is the reconstruction of the raw
data X̂ = H(L) , which results in the following objective function: 3.4 Dual Self-Supervised Module
N Now, we have connected the autoencoder with GCN in the neural
1 Õ 1
Lr es = ∥xi − x̂i ∥ 22 = ||X − X̂||F2 . (5) network architecture. However, they are not designed for the deep
2N i=1 2N clustering. Basically, autoencoder is mainly used for data represen-
tation learning, which is an unsupervised learning scenario, while
3.3 GCN Module the traditional GCN is in the semi-supervised learning scenario.
Autoencoder is able to learn the useful representations from the Both of them cannot be directly applied to the clustering problem.
data itself, e.g. H(1) , H(2) , · · · , H(L) , while ignoring the relationship Here, we propose a dual self-supervised module, which unifies
between samples. In the section, we will introduce how to use the the autoencoder and GCN modules in a uniform framework and
GCN module to propagate these representations generated by the effectively trains the two modules end-to-end for clustering.
DNN module. Once all the representations learned by DNN module In particular, for the i-th sample and j-th cluster, we use the
are integrated into GCN, then the GCN-learnable representation Student’s t-distribution [16] as a kernel to measure the similarity
will be able to accommodate for two different kinds of informa- between the data representation hi and the cluster center vector
tion, i.e., data itself and relationship between data. In particular, µ j as follows:
with the weight matrix W, the representation learned by the ℓ-th v +1
(1 + hi − µ j /v)−
2
layer of GCN, Z(ℓ) , can be obtained by the following convolutional qi j = Í
2
, (11)
v +1
j ′ (1 + hi − µ j ′ /v)−
operation: 2
2
1 1
Z(ℓ) = ϕ(D e− 2 A
eDe− 2 Z(ℓ−1) W(ℓ−1) ), (6) where hi is the i-th row of H(L) , µ j is initialized by K-means on
where Ae = A + I and D eii = j A
Í eij . I is the identity diagonal matrix representations learned by pre-train autoencoder and v are the
of the adjacent matrix A for the self-loop in each node. As can be degrees of freedom of the StudentâĂŹs t-distribution. qi j can be
seen from Eq. 6, the representation Z(ℓ−1) will propagate through considered as the probability of assigning sample i to cluster j, i.e.,
e− 2 A
1 1
e− 2 to obtain the new rep- a soft assignment. We treat Q = [qi j ] as the distribution of the
the normalized adjacency matrix D eD
assignments of all samples and let α=1 for all experiments.
resentation Z(ℓ) . Considering that the representation learned by
After obtaining the clustering result distribution Q, we aim to
autoencoder H(ℓ−1) is able to reconstruct the data itself and contain optimize the data representation by learning from the high confi-
different valuable information, we combine the two representations dence assignments. Specifically, we want to make data representa-
Z(ℓ−1) and H(ℓ−1) together to get a more complete and powerful tion closer to cluster centers, thus improving the cluster cohesion.
representation as follows: Hence, we calculate a target distribution P as follows:
Z(ℓ−1) = (1 − ϵ)Z(ℓ−1) + ϵH(ℓ−1) ,
e (7) qi2j /f j
where ϵ is a balance coefficient, and we uniformly set it to 0.5 here. pi j = Í , (12)
j ′ qi j ′ /f j ′
2
In this way, we connect the autoencoder and GCN layer by layer.
Z(ℓ−1) as the input of the l-th layer in GCN to where f j = i qi j are soft cluster frequencies. In the target distribu-
Í
Then we use e
generate the representation Z(ℓ) : tion P, each assignment in Q is squared and normalized so that the
1 1
assignments will have higher confidence, leading to the following
Z(ℓ) = ϕ De− 2 AeDe− 2 e
Z(ℓ−1) W(ℓ−1) . (8) objective function:
As we can see in Eq. 8, the autoencoder-specific representation ÕÕ pi j
Lclu = KL(P ||Q) = pi j loд . (13)
H(ℓ−1) will be propagated through the normailized adjacency ma- i j
qi j
1 1
e− 2 A
trix D eDe− 2 . Because the representations learned by each DNN
layer are different, to preserve information as much as possible, we By minimizing the KL divergence loss between Q and P distribu-
transfer the representations learned from each DNN layer into a tions, the target distribution P can help the DNN module learn a
corresponding GCN layer for information propagation, as in Figure better representation for clustering task, i.e., making the data rep-
1. The delivery operator works L times in the whole model. We will resentation surround the cluster centers closer. This is regarded as
theoretically analyze the advantages of this delivery operator in a self-supervised mechanism1 , because the target distribution P is
Section 3.5. calculated by the distribution Q, and the P distribution supervises
Note that, the input of the first layer GCN is the raw data X: the updating of the distribution Q in turn.
As for training the GCN module, one possible way is to treat the
1 1
Z(1) = ϕ(D
e− 2 A
eDe− 2 XW(1) ). (9) clustering assignments as the truth labels [3]. However, this strategy
will bring noise and trivial solutions, and lead to the collapse of
The last layer of the GCN module is a multiple classification
the whole model. As mentioned before, the GCN module will also
layer with a softmax function:
1 1
e− 2 A
Z = so f tmax D eDe− 2 Z(L) W(L) . (10)
1 Although some previous work tend to call this mechanism self-training, we prefer to
use the term "self-supervised" to be consistent with the GCN training method.
Structural Deep Clustering Network WWW ’20, April 20–24, 2020, Taipei, Taiwan
provide a clustering assignment distribution Z . Therefore, we can In practice, after training until the maximum epochs, SDCN will
use distribution P to supervise distribution Z as follows: get a stable result. Then we can set labels to samples. We choose
ÕÕ pi j the soft assignments in distribution Z as the final clustering results.
Lдcn = KL(P ||Z ) = pi j loд . (14)
i j
zi j Because the representations learned by GCN contains two different
kinds of information. The label assigned to sample i is:
There are two advantages of the objective function: (1) compared r i = arg max zi j , (16)
with the traditional multi-classification loss function, KL divergence j
updates the entire model in a more "gentle" way to prevent the data where zi j is calculated in Eq. 10.
representations from severe disturbances; (2) both GCN and DNN The algorithm of the whole model is shown in Algorithm 1.
modules are unified in the same optimization target, making their
results tend to be consistent in the training process. Because the 3.5 Theory Analysis
goal of the DNN module and GCN module is to approximate the
In this section, we will analyze how SDCN introduces structural
target distribution P, which has a strong connection between the
information into the autoencoder. Before that, we give the definition
two modules, we call it a dual self-supervised mechanism.
of graph regularization and second-order graph regularization.
Definition 1. Graph regularization [2]. Given a weighted graph
Algorithm 1: Training process of SDCN G, the objective of graph regularization is to minimize the following
Input: Input data: X , Graph: G, Number of clusters: K, equation:
Maximum iterations: MaxIter ; Õ1
hi − hj 2 w i j ,
2
Output: Clustering results R; (17)
ij
2
(ℓ) (ℓ) (ℓ) (ℓ)
1 Initialize We , be , Wd , bd with pre-train autoencoder;
2 Initialize µ with K-means on the representations learned by where w i j means the weight of the edge between node i and node j,
and hi is the representation of node i.
pre-train autoencoder;
3 Initialize W(ℓ) randomly; Based on Definition 1, we can find that the graph regularization
4 for iter ∈ 0, 1, · · · , MaxIter do indicates that if there is a larger weight between nodes i and j, their
5 Generate DNN representations H(1) , H(2) , · · · , H(L) ; representations should be more similar.
6 Use H(L) to compute distribution Q via Eq. 11; Definition 2. Second-order similarity. We assume that A is the
7 Calculate target distribution P via Eq. 12; adjacency matrix of graph G and ai is the i-th column of A. The
8 for ℓ ∈ 1, · · · , L do second-order similarity between node i and node j is
9 Use the delivery operator with ϵ=0.5 aTi aj aT aj C
Z(ℓ) = 21 Z(ℓ) + 12 H(ℓ) ;
e si j = = √ ip = √ p , (18)
∥ai ∥ aj di d j di d j
10 Generate the next GCN layer representation
where C is the number of common neighbors between node i and node
1
e− 2 A
1
e− 2 e (ℓ)
Z(ℓ+1) = ϕ D eD Z(ℓ) Wд ;
j and di is the degree of node i.
11 end
12 Calculate the distribution Z via Eq. 10; Definition 3. Second-order graph regularization. The objective
of second-order graph regularization is to minimize the equation
13 Feed H(L) to the decoder to construct the raw data X; Õ1
hi − hj 2 si j ,
14 Calculate Lr es , Lclu , Lдcn , respectively; 2
(19)
15 Calculate the loss function via Eq. 15; i, j
2
16 Back propagation and update parameters in SDCN; where si j is the second-order similarity.
17 end
Compared with Definition 1, Definition 3 imposes a high-order
18 Calculate the clustering results based on distribution Z ;
constraint, i.e., if two nodes have many common neighbors, their
19 return R;
representations should also be more similar.
Theorem 1. GCN provides an approximate second-order graph
Through this mechanism, SDCN can directly concentrate two regularization for the DNN representations.
different objectives, i.e. clustering objective and classification ob-
jective, in one loss function. And thus, the overall loss function of Proof. Here we focus on the ℓ-th layer of SDCN. hi is the i-
the our proposed SDCN is: th row of H(ℓ) , representing the data representation of sample
hj
i learned by autoencoder and ĥi = ϕ
Í
L = Lr es + α Lclu + β Lдcn , (15) j ∈Ni √ √ W is the
di d j
where α > 0 is a hyper-parameter that balances the clustering representation hi passing through the GCN layer. Here we assume
optimization and local structure preservation of raw data and β > 0 that ϕ(x) = x and W = I, and ĥi can be seen as the average
is a coefficient that controls the disturbance of GCN module to the of neighbor representations. Hence we can divide ĥi into three
embedding space. parts: the node representations dhi , the sum of common neighbor
i
WWW ’20, April 20–24, 2020, Taipei, Taiwan Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui
h
representations S = √ p and the sum of non-common Table 1: The statistics of the datasets.
Í
p ∈Ni ∩Nj dp
h
neighbor representations Di = √ q , where Ni is
Í
q ∈Ni −Ni ∩Nj dq Dataset Type Samples Classes Dimension
the neighbors of node i. The distance between the representations USPS Image 9298 10 256
ĥi and ĥj is: HHAR Record 10299 6 561
Reuters Text 10000 4 2000
ĥi − ĥj
2
! ! ACM Graph 3025 3 1870
hi hj
S S Di Dj DBLP Graph 4058 4 334
= − + √ −p + + √ −p Citeseer
di dj di dj di dj
Graph 3327 6 3703
2
√
di − d j
p
hi hj Di Dj
≤ − + √ p ∥S∥ 2 + √ − p
di dj 2 di d j di dj The advantages of the delivery operator in Eq. 7 are two-folds:
√
2
! one is that the data representation Z (ℓ) learn by SDCN contains
di − d j
p
hi hj Di Dj
≤ − + √ p ∥S∥ 2 + √ + . different structural information. Another is that it can alleviate the
di dj 2 di d j di 2 dj 2
p
over-smoothing phenomenon in GCN. Because multilayer GCNs
(20) focus on higher-order information, the GCN module in SDCN is
We can find that the first term of Eq. 20 is independent of the second- the sum of the representations with different order structural infor-
order similarity. Hence the upper bound of the distance between mation. Similar to [12], our method also uses the fusion of different
two node representations is only related to the second order information to alleviate the over-smoothing phenomenon
q and third
terms. For the second item of Eq. 20, if di ≪ d j , w i j ≤ ddi , which
in GCN. However, different from [12] treating different order ad-
j jacency matrices with the same representations, our SDCN gives
is very small and not consistent with the precondition. If di ≈ d j , different representations to different order adjacency matrices. This
the effect of the second item is paltry and can be ignored. For the makes our model incorporate more information.
third item of Eq. 20, if two nodes have many common neighbors,
their non-common neighbors will be very few, and the values of 3.6 Complexity Analysis
D D
√i and √ j are positively correlated with non-common In this work, we denote d as the dimension of the input data and
di 2 dj 2
neighbors. If the second-order similarity si j is large, the upper the dimension of each layer of the autoencoder is d 1 , d 2 , · · · , d L .
(1)
The size of weight matrix in the first layer of the encoder is We ∈
bound of ĥi − ĥj will drop. In an extreme case, i.e. w i j = 1, d
2 R ×d 1 . N is the number of the input data. The time complexity of
ĥi − ĥj = d1 hi − hj 2 . □ the autoencoder is O(Nd 2d 12 ...d L2 ). As for the GCN module, because
2
the operation of GCN can be efficiently implemented using sparse
This shows that after the DNN representations pass through the matrix, the time complexity is linear with the number of edges |E |.
GCN layer, if the nodes with large second-order similarity, GCN The time complexity is O(|E |dd 1 ...d L ). In addition, we suppose that
will force the representations of nodes to be close to each other, there are K classes in the clustering task, so the time complexity of
which is same to the idea of second-order graph regularization. the Eq. 11 is O(N K + N log N ) corresponding to [26]. The overall
Theorem 2. The representation Z (ℓ) learned by SDCN is equiva- time complexity of our model is O(Nd 2d 12 ...d L2 + |E |dd 1 ...d L +N K +
lent to the sum of the representations with different order structural N log N ), which is linearly related to the number of samples and
information. edges.
Proof. For the simplicity of the proof, let us assume that ϕ (x) = 4 EXPERIMENTS
(ℓ) (ℓ)
x, be = 0 and Wд = I, ∀ℓ ∈ [1, 2, · · · , L]. We can rewrite Eq. 8 4.1 Datasets
as
Our proposed SDCN is evaluated on six datasets. The statistics of
Z(ℓ+1) = A Z(ℓ)
be
(21) these datasets are shown in Table 1 and the detailed descriptions
Z(ℓ+1) = (1 − ϵ)AZb (ℓ) + ϵ AH
b (ℓ) , are the followings:
e− 2 A
b=D
where A eD
1 1
e− 2 . After L-th propagation step, the result is • USPS[13]: The USPS dataset contains 9298 gray-scale hand-
written digit images with size of 16x16 pixels. The features
L
Z(L) = (1 − ϵ)L A
bL X + ϵ (1 − ϵ)ℓ−1 A
b ℓ H(ℓ) .
Õ are the gray value of pixel points in images and all features
(22)
are normalized to [0, 2].
ℓ=1
• HHAR[23]: The Heterogeneity Human Activity Recogni-
bL X is the output of standard GCN, which may suffer
Note that A tion (HHAR) dataset contains 10299 sensor records from
from the over-smoothing problem. Moreover, if L → ∞, the left smart phones and smart watches. All samples are partitioned
term tends to 0 and the right term dominants the data representa- into 6 categories of human activities, including biking, sit-
tion. We can clearly see that the right term is the sum of different ting, standing, walking, stair up and stair down.
representations, i.e. H(ℓ) , with different order structural informa- • Reuters[14]: It is a text dataset containing around 810000
tion. □ English news stories labeled with a category tree. We use
Structural Deep Clustering Network WWW ’20, April 20–24, 2020, Taipei, Taiwan
Table 2: Clustering results on six datasets (mean±std). The bold numbers represent the best results and the numbers with
asterisk are the best results of the baselines.
Dataset Metric K-means AE DEC IDEC GAE VGAE DAEGC SDCNQ SDCN
ACC 66.82±0.04 71.04±0.03 73.31±0.17 76.22±0.12∗ 63.10±0.33 56.19±0.72 73.55±0.40 77.09±0.21 78.08±0.19
NMI 62.63±0.05 67.53±0.03 70.58±0.25 75.56±0.06∗ 60.69±0.58 51.08±0.37 71.12±0.24 77.71±0.21 79.51±0.27
USPS
ARI 54.55±0.06 58.83±0.05 63.70±0.27 67.86±0.12∗ 50.30±0.55 40.96±0.59 63.33±0.34 70.18±0.22 71.84±0.24
F1 64.78±0.03 69.74±0.03 71.82±0.21 74.63±0.10∗ 61.84±0.43 53.63±1.05 72.45±0.49 75.88±0.17 76.98±0.18
ACC 59.98±0.02 68.69±0.31 69.39±0.25 71.05±0.36 62.33±1.01 71.30±0.36 76.51±2.19∗ 83.46±0.23 84.26±0.17
NMI 58.86±0.01 71.42±0.97 72.91±0.39 74.19±0.39∗ 55.06±1.39 62.95±0.36 69.10±2.28 78.82±0.28 79.90±0.09
HHAR
ARI 46.09±0.02 60.36±0.88 61.25±0.51 62.83±0.45∗ 42.63±1.63 51.47±0.73 60.38±2.15 71.75±0.23 72.84±0.09
F1 58.33±0.03 66.36±0.34 67.29±0.29 68.63±0.33 62.64±0.97 71.55±0.29 76.89±2.18∗ 81.45±0.14 82.58±0.08
ACC 54.04±0.01 74.90±0.21 73.58±0.13 75.43±0.14∗ 54.40±0.27 60.85±0.23 65.50±0.13 79.30±0.11 77.15±0.21
NMI 41.54±0.51 49.69±0.29 47.50±0.34 50.28±0.17∗ 25.92±0.41 25.51±0.22 30.55±0.29 56.89±0.27 50.82±0.21
Reuters
ARI 27.95±0.38 49.55±0.37 48.44±0.14 51.26±0.21∗ 19.61±0.22 26.18±0.36 31.12±0.18 59.58±0.32 55.36±0.37
F1 41.28±2.43 60.96±0.22 64.25±0.22∗ 63.21±0.12 43.53±0.42 57.14±0.17 61.82±0.13 66.15±0.15 65.48±0.08
ACC 67.31±0.71 81.83±0.08 84.33±0.76 85.12±0.52 84.52±1.44 84.13±0.22 86.94±2.83∗ 86.95±0.08 90.45±0.18
NMI 32.44±0.46 49.30±0.16 54.54±1.51 56.61±1.16∗ 55.38±1.92 53.20±0.52 56.18±4.15 58.90±0.17 68.31±0.25
ACM
ARI 30.60±0.69 54.64±0.16 60.64±1.87 62.16±1.50∗ 59.46±3.10 57.72±0.67 59.35±3.89 65.25±0.19 73.91±0.40
F1 67.57±0.74 82.01±0.08 84.51±0.74 85.11±0.48 84.65±1.33 84.17±0.23 87.07±2.79∗ 86.84±0.09 90.42±0.19
ACC 38.65±0.65 51.43±0.35 58.16±0.56 60.31±0.62 61.21±1.22 58.59±0.06 62.05±0.48∗ 65.74±1.34 68.05±1.81
NMI 11.45±0.38 25.40±0.16 29.51±0.28 31.17±0.50 30.80±0.91 26.92±0.06 32.49±0.45∗ 35.11±1.05 39.50±1.34
DBLP
ARI 6.97±0.39 12.21±0.43 23.92±0.39 25.37±0.60∗ 22.02±1.40 17.92±0.07 21.03±0.52 34.00±1.76 39.15±2.01
F1 31.92±0.27 52.53±0.36 59.38±0.51 61.33±0.56 61.41±2.23 58.69±0.07 61.75±0.67∗ 65.78±1.22 67.71±1.51
ACC 39.32±3.17 57.08±0.13 55.89±0.20 60.49±1.42 61.35±0.80 60.97±0.36 64.54±1.39∗ 61.67±1.05 65.96±0.31
NMI 16.94±3.22 27.64±0.08 28.34±0.30 27.17±2.40 34.63±0.65 32.69±0.27 36.41±0.86∗ 34.39±1.22 38.71±0.32
Citeseer
ARI 13.43±3.02 29.31±0.14 28.12±0.36 25.70±2.65 33.55±1.18 33.13±0.53 37.78±1.24∗ 35.50±1.49 40.17±0.43
F1 36.08±3.53 53.80±0.11 52.62±0.17 61.62±1.39 57.36±0.82 57.70±0.49 62.20±1.32∗ 57.82±0.98 63.62±0.24
SDCN. We train the autoencoder end-to-end using all data points methods will decline. Besides, SDCN integrates structural in-
with 30 epochs and the learning rate is 10−3 . In order to be con- formation into deep clustering, so its clustering performance
sistent with previous methods [4, 26], we set the dimension of the is better than these two methods.
autoencoder to d-500-500-2000-10, where d is the dimension of • Comparing the results of AE with DEC and the results of
the input data. The dimension of the layers in GCN module is the GAE with DAEGC, we can find that the clustering loss func-
same to the autoencoder. As for the GCN-based methods, we set tion, defined in Eq. 13, plays an important role in improving
the dimension of GAE and VAGE to d-256-16 and train them with the deep clustering performance. Because IDEC and DAEGC
30 epochs for all datasets. For DAEGC, we use the setting of [25]. can be seen as the combination of the clustering loss with AE
In hyperparameter search, we try {1, 3, 5} for the update interval and GAE, respectively. It improves the cluster cohesion by
in DEC and IDEC, {1, 0.1, 0.01, 0.001} for the hyperparameter γ making the data representation closer to the cluster centers,
in IDEC and report the best results. For our SDCN, we uniformly thus improving the clustering results.
set α = 0.1 and β = 0.01 for all the datasets because our method
is not sensitive to hyperparameters. For the non-graph data, we 4.4 Analysis of Variants
train the SDCN with 200 epochs, and for graph data, we train it We compare our model with two variants to verify the ability of
with 50 epochs. Because the graph structure with prior knowledge, GCN in learning structural information and the effectiveness of the
i.e. citation network, contains more information than KNN graph, delivery operator. Specifically, we define the following variants:
which can accelerate convergence speed. The batch size is set to
256 and learning rate is set to 10−3 for USPS, HHAR, ACM, DBLP 90 90
SDCN SDCN
and 10−4 for Reuters, Citeseer. For all methods using K-means algo- SDCN-w/o SDCN-w/o
SDCN-MLP SDCN-MLP
rithm to generate clustering assignments, we initialize 20 times and 80 80
select the best solution. We run all methods 10 times and report the
70 70
average results to prevent extreme cases.
60 60
USPS HHAR Reuters ACM DBLP Citeseer
4.3 Analysis of Clustering Results (a) Datasets with KNN graph (b) Datasets with original graph
Table 3: Effect of different propagation layers (L) term in Eq. 22 is not small enough so that it is still plagued
by the over-smoothing problems.
ACC NMI ARI F1
0.80 0.85 0.80
SDCN-4 90.45 68.31 73.91 90.42
SDCN-3 89.06 64.86 70.51 89.03 0.75 0.80 0.75
ACM
SDCN-2 89.12 66.48 70.94 89.04
SDCN-1 77.69 51.59 50.13 74.62 0.70 0.75 0.70
SDCN-Q SDCN-Z SDCN-P SDCN-Q SDCN-Z SDCN-P SDCN-Q SDCN-Z SDCN-P SDCN-Q SDCN-Z SDCN-P
1 1 1 1
Accuracy
Accuracy
Accuracy
0.6 0.6 0.6 0.6
0 0 0 0
0 50 100 150 200 0 50 100 150 200 0 10 20 30 40 50 0 10 20 30 40 50
Iteration Iteration Iteration Iteration
80 80 80
4.8 Analysis of Training Process
70 70 70 In this section, we analyze the training progress in different datasets.
60 60 60 Specifically, we want to explore how the cluster accuracy of the
50 50 50 three sample assignments distributions in SDCN varies with the
40 GAE VGAE 40 GAE VGAE 40 GAE VGAE
number of iterations. In Figure 4, the red line SDCN-P, the blue
DAEGC SDCN DAEGC SDCN DAEGC SDCN
30 30 30 line SDCN-Q and the orange line SDCN-Z represent the accuracy
1 3 5 10 1 3 5 10 1 3 5 10
K K K of the target distribution P, distribution Q and distribution Z , re-
(a) Accuracy on USPS (b) Accuracy on HHAR (c) Accuracy on Reuters spectively. In most cases, the accuracy of SDCN-P is higher than
that of SDCN-Q, which shows that the target distribution P is able
80 80 50 to guide the update of the whole model. At the beginning, the re-
70 70 40 sults of the accuracy of three distributions all decrease in different
60 60 30 ranges. Because the information learned by autoencoder and GCN
50 50 20 is different, it may rise a conflict between the results of the two
40 GAE VGAE 40 GAE VGAE 10 GAE VGAE modules, making the clustering results decline. Then the accuracy
DAEGC SDCN DAEGC SDCN DAEGC SDCN
30 30 0 of SDCN-Q and SDCN-Z quickly increase to a high level, because
1 3 5 10 1 3 5 10 1 3 5 10
K K K the target distribution SDCN-P eases the conflict between the two
(d) NMI on USPS (e) NMI on HHAR (f) NMI on Reuters modules, making their results tend to be consistent. In addition,
we can see that with the increase of training epochs, the cluster-
ing results of SDCN tend to be stable and there is no significant
Figure 5: Clustering results with different K
fluctuation, indicating the good robustness of our proposed model.
REFERENCES [16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
[1] Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering Journal of machine learning research 9, Nov (2008), 2579–2605.
algorithms. In Mining text data. Springer, 77–128. [17] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan
[2] Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality Frey. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015).
reduction and data representation. Neural computation 15, 6 (2003), 1373–1396. [18] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet
[3] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Agarwal, and Gautam Shroff. 2016. LSTM-based encoder-decoder for multi-sensor
Deep clustering for unsupervised learning of visual features. In ECCV. 132–149. anomaly detection. arXiv preprint arXiv:1607.00148 (2016).
[4] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. 2017. Improved deep [19] Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. 2011. Stacked
embedded clustering with local structure preservation.. In IJCAI. 1753–1759. convolutional auto-encoders for hierarchical feature extraction. In ICANN.
[5] John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means Springer, 52–59.
clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied [20] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted
Statistics) 28, 1 (1979), 100–108. boltzmann machines. In ICML. 807–814.
[6] William Grant Hatcher and Wei Yu. 2018. A survey of deep learning: platforms, [21] Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering:
applications and emerging research trends. IEEE Access 6 (2018), 24411–24432. Analysis and an algorithm. In NIPS. 849–856.
[7] Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensional- [22] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and
ity of data with neural networks. Science 313, 5786 (2006), 504–507. validation of cluster analysis. Journal of computational and applied mathematics
[8] Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. 2017. Deep 20 (1987), 53–65.
subspace clustering networks. In NIPS. 24–33. [23] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow,
[9] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen.
2017. Variational deep embedding: An unsupervised and generative approach to 2015. Smart devices are different: Assessing and mitigatingmobile sensing het-
clustering. IJCAI (2017). erogeneities for activity recognition. In SenSys. ACM, 127–140.
[10] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv [24] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
preprint arXiv:1611.07308 (2016). 2008. Extracting and composing robust features with denoising autoencoders. In
[11] Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph ICML. ACM, 1096–1103.
convolutional networks. ICLR (2017). [25] Chun Wang, Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, and Chengqi Zhang.
[12] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pre- 2019. Attributed Graph Clustering: A Deep Attentional Embedding Approach.
dict then propagate: Graph neural networks meet personalized pagerank. (2018). IJCAI (2019).
[13] Yann Le Cun, Ofer Matan, Bernhard Boser, John S Denker, Don Henderson, [26] Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Unsupervised deep embedding
Richard E Howard, Wayne Hubbard, LD Jacket, and Henry S Baird. 1990. Hand- for clustering analysis. In ICML. 478–487.
written zip code recognition with multilayer networks. In ICPR, Vol. 2. IEEE, [27] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. 2017. Towards
35–40. k-means-friendly spaces: Simultaneous deep learning and clustering. In ICML.
[14] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new 3861–3870.
benchmark collection for text categorization research. Journal of machine learning [28] Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting Zhuang. 2010. Im-
research 5, Apr (2004), 361–397. age clustering using local discriminant models and global integration. IEEE
[15] Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph Transactions on Image Processing 19, 10 (2010), 2761–2773.
convolutional networks for semi-supervised learning. In AAAI.