0% found this document useful (0 votes)
19 views11 pages

20-Structural Deep Clustering Network

The document presents the Structural Deep Clustering Network (SDCN), which integrates structural information into deep clustering by combining autoencoders and Graph Convolutional Networks (GCN). It introduces a delivery operator and a dual self-supervised mechanism to enhance the representation learning process, addressing the limitations of existing deep clustering methods that overlook data structure. Experimental results demonstrate that SDCN outperforms state-of-the-art techniques, achieving significant improvements in clustering performance across various datasets.

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

20-Structural Deep Clustering Network

The document presents the Structural Deep Clustering Network (SDCN), which integrates structural information into deep clustering by combining autoencoders and Graph Convolutional Networks (GCN). It introduces a delivery operator and a dual self-supervised mechanism to enhance the representation learning process, addressing the limitations of existing deep clustering methods that overlook data structure. Experimental results demonstrate that SDCN outperforms state-of-the-art techniques, achieving significant improvements in clustering performance across various datasets.

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Structural Deep Clustering Network

Deyu Bo∗ Xiao Wang∗ Chuan Shi†


Beijing University of Posts and Beijing University of Posts and Beijing University of Posts and
Telecommunications Telecommunications Telecommunications
Beijing, China Beijing, China Beijing, China
[email protected] [email protected] [email protected]

Meiqi Zhu Emiao Lu Peng Cui


Beijing University of Posts and Tencent Tsinghua University
arXiv:2002.01633v3 [cs.LG] 12 Feb 2020

Telecommunications Shenzhen, China Beijing, China


Beijing, China [email protected] [email protected]
[email protected]

ABSTRACT ACM Reference Format:


Clustering is a fundamental task in data analysis. Recently, deep Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. 2020.
Structural Deep Clustering Network. In Proceedings of The Web Conference
clustering, which derives inspiration primarily from deep learning
2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan. ACM, New York, NY,
approaches, achieves state-of-the-art performance and has attracted USA, 11 pages. https://doi.org/10.1145/3366423.3380214
considerable attention. Current deep clustering methods usually
boost the clustering results by means of the powerful representation
ability of deep learning, e.g., autoencoder, suggesting that learning
an effective representation for clustering is a crucial requirement. 1 INTRODUCTION
The strength of deep clustering methods is to extract the useful rep-
Clustering, one of the most fundamental data analysis tasks, is to
resentations from the data itself, rather than the structure of data,
group similar samples into the same category [5, 21]. Over the past
which receives scarce attention in representation learning. Moti-
decades, a large family of clustering algorithms has been devel-
vated by the great success of Graph Convolutional Network (GCN)
oped and successfully applied to various real-world applications,
in encoding the graph structure, we propose a Structural Deep
such as image clustering [28] and text clustering [1]. Recently, the
Clustering Network (SDCN) to integrate the structural information
breakthroughs in deep learning have led to a paradigm shift in
into deep clustering. Specifically, we design a delivery operator
artificial intelligence and machine learning, achieving great success
to transfer the representations learned by autoencoder to the cor-
on many important tasks, including clustering. Therefore, the deep
responding GCN layer, and a dual self-supervised mechanism to
clustering has caught significant attention [6]. The basic idea of
unify these two different deep neural architectures and guide the
deep clustering is to integrate the objective of clustering into the
update of the whole model. In this way, the multiple structures of
powerful representation ability of deep learning. Hence learning
data, from low-order to high-order, are naturally combined with the
an effective data representation is a crucial prerequisite for deep
multiple representations learned by autoencoder. Furthermore, we
clustering. For example, [27] uses the representation learned by
theoretically analyze the delivery operator, i.e., with the delivery
autoencoder in K-means; [4, 26] leverage a clustering loss to help
operator, GCN improves the autoencoder-specific representation as
autoencoder learn the data representation with high cluster cohe-
a high-order graph regularization constraint and autoencoder helps
sion [22], and [9] uses a variational autoencoder to learn better
alleviate the over-smoothing problem in GCN. Through compre-
data representation for clustering. To date, deep clustering meth-
hensive experiments, we demonstrate that our propose model can
ods have achieved state-of-the-art performance and become the de
consistently perform better over the state-of-the-art techniques.
facto clustering methods.
Despite the success of deep clustering, they usually focus on the
KEYWORDS
characteristic of data itself, and thus seldom take the structure of
deep clustering, graph convolutional network, neural network, self- data into account when learning the representation. Notably, the
supervised learning importance of considering the relationship among data samples
∗ Both authors contributed equally to this research.
has been well recognized by previous literatures and results in data
† Corresponding author representation field. Such structure reveals the latent similarity
among samples, and therefore provides a valuable guide on learning
This paper is published under the Creative Commons Attribution 4.0 International the representation. One typical method is the spectral clustering
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution. [21], which treats the samples as the nodes in weighted graph and
WWW ’20, April 20–24, 2020, Taipei, Taiwan uses graph structure of data for clustering. Recently, the emerging
© 2020 IW3C2 (International World Wide Web Conference Committee), published Graph Convolutional Networks (GCN) [11] also encode both of
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-7023-3/20/04. the graph structure and node attributes for node representation.
https://doi.org/10.1145/3366423.3380214 In summary, the structural information plays a crucial role in data
WWW ’20, April 20–24, 2020, Taipei, Taiwan Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui

representation learning. However, it has seldom been applied for a novel delivery operator and a dual self-supervised module.
deep clustering. To the best of our knowledge, this is the first time to apply
In reality, integrating structural information into deep cluster- structural information into deep clustering explicitly.
ing usually needs to address the following two problems. (1) What • We give a theoretical analysis of our proposed SDCN and
structural information should be considered in deep clustering? It is prove that GCN provides an approximate second-order graph
well known that the structural information indicates the underlying regularization for the DNN representations and the data rep-
similarity among data samples. However, the structure of data is resentation learned in SDCN is equivalent to the sum of the
usually very complex, i.e., there is not only the direct relationship representations with different-order structural information.
between samples (also known as first-order structure), but also the Based on our theoretical analysis, the over-smoothing issue
high-order structure. The high-order structure imposes the sim- of GCN module in SDCN will be effectively alleviated.
ilarity constraint from more than one-hop relationship between • Extensive experiments on six real-world datasets demon-
samples. Taking the second-order structure as an example, it im- strate the superiority of SDCN in comparison with the state-
plies that for two samples with no direct relationship, if they have of-the-art techniques. Specifically, SDCN achieves signifi-
many common neighbor samples, they should still have similar cant improvements (17% on NMI, 28% on ARI) over the best
representations. When the structure of data is sparse, which always baseline method.
holds in practice, the high-order structure is of particular impor-
tance. Therefore, only utilizing the low-order structure in deep
clustering is far from sufficient, and how to effectively consider
higher-order structure is the first problem; (2) What is the relation 2 RELATED WORK
between the structural information and deep clustering? The basic In this section, we introduce the most related work: deep clustering
component of deep clustering is the Deep Neural Network (DNN), and graph clustering with GCN.
e.g. autoencoder. The network architecture of autoencoder is very Deep clustering methods aim to combine the deep representation
complex, consisting of multiple layers. Each layer captures different learning with the clustering objective. For example, [27] proposes
latent information. And there are also various types of structural deep clustering network, using the loss function of K-means to help
information between data. Therefore, what is the relation between autoencoder learn a "K-means-friendly" data representation. Deep
different structures and different layers in autoencoder? One can embedding clustering [26] designs a KL-divergence loss to make
use the structure to regularize the representation learned by the the representation learned by autoencoder surround the cluster
autoencoder in some way, however, on the other hand, one can centers closer, thus improving the cluster cohesion. Improved deep
also directly learn the representation from the structure itself. How embedding clustering [4] adds a reconstruction loss to the objective
to elegantly combine the structure of data with the autoencoder of DEC as a constraint to help autoencoder learn a better data
structure is another problem. representation. Variational deep embedding [9] is able to model
In order to capture the structural information, we first construct the data generation process and clusters jointly by using a deep
a K-Nearest Neighbor (KNN) graph, which is able to reveal the variational autoencoder, so as to achieve better clustering results.
underlying structure of the data[? ? ]. To capture the low-order and [8] proposes deep subspace clustering networks, which uses a novel
high-order structural information from the KNN graph, we propose self-expressive layer between the encoder and the decoder. It is able
a GCN module, consisting of multiple graph convolutional layers, to mimic the "self-expressiveness" property in subspace clustering,
to learn the GCN-specific representation. thus obtaining a more expressive representation. DeepCluster [3]
In order to introduce structural information into deep clustering, treats the clustering results as pseudo labels so that it can be applied
we introduce an autoencoder module to learn the autoencoder- in training deep neural network with large datasets. However, all
specific representation from the raw data, and propose a delivery of these methods only focus on learning the representation of data
operator to combine it with the GCN-specific representation. We from the samples themselves. Another important information in
theoretically prove that the delivery operator is able to assist the learning representation, the structure of data, is largely ignored by
integration between autoencoder and GCN better. In particular, these methods.
we prove that GCN provides an approximate second-order graph To cope with the structural information underlying the data,
regularization for the representation learned by autoencoder, and some GCN-based clustering methods have been widely applied.
the representation learned by autoencoder can alleviate the over- For instance, [10] proposes graph autoencoder and graph variation
smoothing issue in GCN. autoencoder, which uses GCN as an encoder to integrate graph
Finally, because both of the autoencoder and GCN modules will structure into node features to learn the nodes embedding. Deep at-
output the representations, we propose a dual self-supervised mod- tentional embedded graph clustering [25] uses an attention network
ule to uniformly guide these two modules. Through the dual self- to capture the importance of the neighboring nodes and employs
supervised module, the whole model can be trained in an end-to-end the KL-divergence loss from DEC to supervise the training process
manner for clustering task. of graph clustering. All GCN-based clustering methods mentioned
In summary, we highlight the main contributions as follows: above rely on reconstructing the adjacency matrix to update the
model, and those methods can only learn data representations from
• We propose a novel Structural Deep Clustering Network the graph structure, which ignores the characteristic of the data
(SDCN) for deep clustering. The proposed SDCN effectively itself. However, the performance of this type of methods might be
combines the strengths of both autoencoder and GCN with limited to the overlapping between community structure.
Structural Deep Clustering Network WWW ’20, April 20–24, 2020, Taipei, Taiwan

Graph GCN Module

𝒁(𝟏) 𝒁(𝟐) 𝒁(𝟑) ... 𝒁(𝑳) 𝒁


𝑳𝒈𝒄𝒏 = 𝑲𝑳(𝒁||𝑷)

𝑯(𝟏) 𝑯(𝟐) 𝑯(𝑳)

P
𝑿 ... Q

𝑯(𝑳)
P
𝑳𝒄𝒍𝒖 = 𝑲𝑳(𝑸||𝑷)
Q
𝑿 ...
Dual Self-Supervised Module
𝑵
𝟏
𝑳𝒓𝒆𝒔 = ||𝑿 − 𝑿||𝟐𝑭
𝟐𝑵 DNN Module
𝒊=𝟏

Figure 1: The framework of our proposed SDCN. X, X̂ are the input data and the reconstructed data, respectively. H(ℓ) and
Z(ℓ) are the representations in the ℓ-th layer in the DNN and GCN module, respectively. Different colors represent different
representations H(ℓ) , learned the by DNN module. The blue solid line represents that target distribution P is calculated by the
distribution Q and the two red dotted lines represent the dual self-supervised mechanism. The target distribution P to guide
the update of the DNN module and the GCN module at the same time.

3 THE PROPOSED MODEL For discrete data, e.g., bag-of-words, we use the dot-product
In this section, we introduce our proposed structural deep clustering similarity so that the similarity is related to the number of
network, where the overall framework is shown in Figure 1. We first identical words only.
construct a KNN graph based on the raw data. Then we input the After calculating the similarity matrix S, we select the top-K similar-
raw data and KNN graph into autoencoder and GCN, respectively. ity points of each sample as its neighbors to construct an undirected
We connect each layer of autoencoder with the corresponding K-nearest neighbor graph. In this way, we can get the adjacency
layer of GCN, so that we can integrate the autoencoder-specific matrix A from the non-graph data.
representation into structure-aware representation by a delivery
operator. Meanwhile, we propose a dual self-supervised mechanism 3.2 DNN Module
to supervise the training progress of autoencoder and GCN. We As we mentioned before, learning an effective data representation is
will describe our proposed model in detail in the following. of great importance to deep clustering. There are several alternative
unsupervised methods for different types of data to learn represen-
3.1 KNN Graph tations. For example, denoising autoencoder [24], convolutional
Assume that we have the raw data X ∈ RN ×d , where each row autoencoder [19], LSTM encoder-decoder [18] and adversarial au-
xi represents the i-th sample, and N is the number of samples toencoder [17]. They are variations of the basic autoencoder [7].
and d is the dimension. For each sample, we first find its top-K In this paper, for the sake of generality, we employ the basic au-
similar neighbors and set edges to connect it with its neighbors. toencoder to learn the representations of the raw data in order to
There are many ways to calculate the similarity matrix S ∈ RN ×N accommodate for different kinds of data characteristics. We assume
of the samples. Here we list two popular approaches we used in that there are L layers in the autoencoder and ℓ represents the layer
constructing the KNN graph: number. Specifically, the representation learned by the ℓ-th layer
in encoder part, H(ℓ) , can be obtained as follows:
1) Heat Kernel. The similarity between samples i and j is cal-  
(ℓ) (ℓ)
culated by: H(ℓ) = ϕ We H(ℓ−1) + be , (3)

∥ xi −xj ∥ 2 where ϕ is the activation function of the fully connected layers


Si j = e − t , (1) (ℓ) (ℓ)
such as Relu [20] or Sigmoid function, We and be are the weight
matrix and bias of the ℓ-th layer in the encoder, respectively. Besides,
where t is the time parameter in heat conduction equation. we denote H(0) as the raw data X.
For continuous data, e.g., images. The encoder part is followed by the decoder part, which is to
2) Dot-product.The similarity between samples i and j is cal- reconstruct the input data through several fully connected layers
culated by: by the equation
 
(ℓ) (ℓ)
Si j = xTj xi . (2) H(ℓ) = ϕ Wd H(ℓ−1) + bd , (4)
WWW ’20, April 20–24, 2020, Taipei, Taiwan Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui

(ℓ) (ℓ)
where Wd and bd are the weight matrix and bias of the ℓ-th layer The result zi j ∈ Z indicates the probability sample i belongs to
in the decoder, respectively. cluster center j, and we can treat Z as a probability distribution.
The output of the decoder part is the reconstruction of the raw
data X̂ = H(L) , which results in the following objective function: 3.4 Dual Self-Supervised Module
N Now, we have connected the autoencoder with GCN in the neural
1 Õ 1
Lr es = ∥xi − x̂i ∥ 22 = ||X − X̂||F2 . (5) network architecture. However, they are not designed for the deep
2N i=1 2N clustering. Basically, autoencoder is mainly used for data represen-
tation learning, which is an unsupervised learning scenario, while
3.3 GCN Module the traditional GCN is in the semi-supervised learning scenario.
Autoencoder is able to learn the useful representations from the Both of them cannot be directly applied to the clustering problem.
data itself, e.g. H(1) , H(2) , · · · , H(L) , while ignoring the relationship Here, we propose a dual self-supervised module, which unifies
between samples. In the section, we will introduce how to use the the autoencoder and GCN modules in a uniform framework and
GCN module to propagate these representations generated by the effectively trains the two modules end-to-end for clustering.
DNN module. Once all the representations learned by DNN module In particular, for the i-th sample and j-th cluster, we use the
are integrated into GCN, then the GCN-learnable representation Student’s t-distribution [16] as a kernel to measure the similarity
will be able to accommodate for two different kinds of informa- between the data representation hi and the cluster center vector
tion, i.e., data itself and relationship between data. In particular, µ j as follows:
with the weight matrix W, the representation learned by the ℓ-th v +1
(1 + hi − µ j /v)−
2
layer of GCN, Z(ℓ) , can be obtained by the following convolutional qi j = Í
2
, (11)
v +1
j ′ (1 + hi − µ j ′ /v)−
operation: 2
2
1 1
Z(ℓ) = ϕ(D e− 2 A
eDe− 2 Z(ℓ−1) W(ℓ−1) ), (6) where hi is the i-th row of H(L) , µ j is initialized by K-means on
where Ae = A + I and D eii = j A
Í eij . I is the identity diagonal matrix representations learned by pre-train autoencoder and v are the
of the adjacent matrix A for the self-loop in each node. As can be degrees of freedom of the StudentâĂŹs t-distribution. qi j can be
seen from Eq. 6, the representation Z(ℓ−1) will propagate through considered as the probability of assigning sample i to cluster j, i.e.,
e− 2 A
1 1
e− 2 to obtain the new rep- a soft assignment. We treat Q = [qi j ] as the distribution of the
the normalized adjacency matrix D eD
assignments of all samples and let α=1 for all experiments.
resentation Z(ℓ) . Considering that the representation learned by
After obtaining the clustering result distribution Q, we aim to
autoencoder H(ℓ−1) is able to reconstruct the data itself and contain optimize the data representation by learning from the high confi-
different valuable information, we combine the two representations dence assignments. Specifically, we want to make data representa-
Z(ℓ−1) and H(ℓ−1) together to get a more complete and powerful tion closer to cluster centers, thus improving the cluster cohesion.
representation as follows: Hence, we calculate a target distribution P as follows:
Z(ℓ−1) = (1 − ϵ)Z(ℓ−1) + ϵH(ℓ−1) ,
e (7) qi2j /f j
where ϵ is a balance coefficient, and we uniformly set it to 0.5 here. pi j = Í , (12)
j ′ qi j ′ /f j ′
2
In this way, we connect the autoencoder and GCN layer by layer.
Z(ℓ−1) as the input of the l-th layer in GCN to where f j = i qi j are soft cluster frequencies. In the target distribu-
Í
Then we use e
generate the representation Z(ℓ) : tion P, each assignment in Q is squared and normalized so that the
 1 1
 assignments will have higher confidence, leading to the following
Z(ℓ) = ϕ De− 2 AeDe− 2 e
Z(ℓ−1) W(ℓ−1) . (8) objective function:
As we can see in Eq. 8, the autoencoder-specific representation ÕÕ pi j
Lclu = KL(P ||Q) = pi j loд . (13)
H(ℓ−1) will be propagated through the normailized adjacency ma- i j
qi j
1 1
e− 2 A
trix D eDe− 2 . Because the representations learned by each DNN
layer are different, to preserve information as much as possible, we By minimizing the KL divergence loss between Q and P distribu-
transfer the representations learned from each DNN layer into a tions, the target distribution P can help the DNN module learn a
corresponding GCN layer for information propagation, as in Figure better representation for clustering task, i.e., making the data rep-
1. The delivery operator works L times in the whole model. We will resentation surround the cluster centers closer. This is regarded as
theoretically analyze the advantages of this delivery operator in a self-supervised mechanism1 , because the target distribution P is
Section 3.5. calculated by the distribution Q, and the P distribution supervises
Note that, the input of the first layer GCN is the raw data X: the updating of the distribution Q in turn.
As for training the GCN module, one possible way is to treat the
1 1
Z(1) = ϕ(D
e− 2 A
eDe− 2 XW(1) ). (9) clustering assignments as the truth labels [3]. However, this strategy
will bring noise and trivial solutions, and lead to the collapse of
The last layer of the GCN module is a multiple classification
the whole model. As mentioned before, the GCN module will also
layer with a softmax function:
 1 1

e− 2 A
Z = so f tmax D eDe− 2 Z(L) W(L) . (10)
1 Although some previous work tend to call this mechanism self-training, we prefer to
use the term "self-supervised" to be consistent with the GCN training method.
Structural Deep Clustering Network WWW ’20, April 20–24, 2020, Taipei, Taiwan

provide a clustering assignment distribution Z . Therefore, we can In practice, after training until the maximum epochs, SDCN will
use distribution P to supervise distribution Z as follows: get a stable result. Then we can set labels to samples. We choose
ÕÕ pi j the soft assignments in distribution Z as the final clustering results.
Lдcn = KL(P ||Z ) = pi j loд . (14)
i j
zi j Because the representations learned by GCN contains two different
kinds of information. The label assigned to sample i is:
There are two advantages of the objective function: (1) compared r i = arg max zi j , (16)
with the traditional multi-classification loss function, KL divergence j
updates the entire model in a more "gentle" way to prevent the data where zi j is calculated in Eq. 10.
representations from severe disturbances; (2) both GCN and DNN The algorithm of the whole model is shown in Algorithm 1.
modules are unified in the same optimization target, making their
results tend to be consistent in the training process. Because the 3.5 Theory Analysis
goal of the DNN module and GCN module is to approximate the
In this section, we will analyze how SDCN introduces structural
target distribution P, which has a strong connection between the
information into the autoencoder. Before that, we give the definition
two modules, we call it a dual self-supervised mechanism.
of graph regularization and second-order graph regularization.
Definition 1. Graph regularization [2]. Given a weighted graph
Algorithm 1: Training process of SDCN G, the objective of graph regularization is to minimize the following
Input: Input data: X , Graph: G, Number of clusters: K, equation:
Maximum iterations: MaxIter ; Õ1
hi − hj 2 w i j ,
2
Output: Clustering results R; (17)
ij
2
(ℓ) (ℓ) (ℓ) (ℓ)
1 Initialize We , be , Wd , bd with pre-train autoencoder;
2 Initialize µ with K-means on the representations learned by where w i j means the weight of the edge between node i and node j,
and hi is the representation of node i.
pre-train autoencoder;
3 Initialize W(ℓ) randomly; Based on Definition 1, we can find that the graph regularization
4 for iter ∈ 0, 1, · · · , MaxIter do indicates that if there is a larger weight between nodes i and j, their
5 Generate DNN representations H(1) , H(2) , · · · , H(L) ; representations should be more similar.
6 Use H(L) to compute distribution Q via Eq. 11; Definition 2. Second-order similarity. We assume that A is the
7 Calculate target distribution P via Eq. 12; adjacency matrix of graph G and ai is the i-th column of A. The
8 for ℓ ∈ 1, · · · , L do second-order similarity between node i and node j is
9 Use the delivery operator with ϵ=0.5 aTi aj aT aj C
Z(ℓ) = 21 Z(ℓ) + 12 H(ℓ) ;
e si j = = √ ip = √ p , (18)
∥ai ∥ aj di d j di d j
10 Generate the next GCN layer representation
where C is the number of common neighbors between node i and node
 1 
e− 2 A
1
e− 2 e (ℓ)
Z(ℓ+1) = ϕ D eD Z(ℓ) Wд ;
j and di is the degree of node i.
11 end
12 Calculate the distribution Z via Eq. 10; Definition 3. Second-order graph regularization. The objective
of second-order graph regularization is to minimize the equation
13 Feed H(L) to the decoder to construct the raw data X; Õ1
hi − hj 2 si j ,
14 Calculate Lr es , Lclu , Lдcn , respectively; 2
(19)
15 Calculate the loss function via Eq. 15; i, j
2
16 Back propagation and update parameters in SDCN; where si j is the second-order similarity.
17 end
Compared with Definition 1, Definition 3 imposes a high-order
18 Calculate the clustering results based on distribution Z ;
constraint, i.e., if two nodes have many common neighbors, their
19 return R;
representations should also be more similar.
Theorem 1. GCN provides an approximate second-order graph
Through this mechanism, SDCN can directly concentrate two regularization for the DNN representations.
different objectives, i.e. clustering objective and classification ob-
jective, in one loss function. And thus, the overall loss function of Proof. Here we focus on the ℓ-th layer of SDCN. hi is the i-
the our proposed SDCN is: th row of H(ℓ) , representing the data representation of sample
hj
i learned by autoencoder and ĥi = ϕ
Í
L = Lr es + α Lclu + β Lдcn , (15) j ∈Ni √ √ W is the
di d j
where α > 0 is a hyper-parameter that balances the clustering representation hi passing through the GCN layer. Here we assume
optimization and local structure preservation of raw data and β > 0 that ϕ(x) = x and W = I, and ĥi can be seen as the average
is a coefficient that controls the disturbance of GCN module to the of neighbor representations. Hence we can divide ĥi into three
embedding space. parts: the node representations dhi , the sum of common neighbor
i
WWW ’20, April 20–24, 2020, Taipei, Taiwan Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui

h
representations S = √ p and the sum of non-common Table 1: The statistics of the datasets.
Í
p ∈Ni ∩Nj dp
h
neighbor representations Di = √ q , where Ni is
Í
q ∈Ni −Ni ∩Nj dq Dataset Type Samples Classes Dimension
the neighbors of node i. The distance between the representations USPS Image 9298 10 256
ĥi and ĥj is: HHAR Record 10299 6 561
Reuters Text 10000 4 2000
ĥi − ĥj
2
! ! ACM Graph 3025 3 1870

hi hj

S S Di Dj DBLP Graph 4058 4 334
= − + √ −p + + √ −p Citeseer
di dj di dj di dj
Graph 3327 6 3703
2

di − d j
p
hi hj Di Dj
≤ − + √ p ∥S∥ 2 + √ − p
di dj 2 di d j di dj The advantages of the delivery operator in Eq. 7 are two-folds:

2
! one is that the data representation Z (ℓ) learn by SDCN contains
di − d j
p
hi hj Di Dj
≤ − + √ p ∥S∥ 2 + √ + . different structural information. Another is that it can alleviate the
di dj 2 di d j di 2 dj 2
p
over-smoothing phenomenon in GCN. Because multilayer GCNs
(20) focus on higher-order information, the GCN module in SDCN is
We can find that the first term of Eq. 20 is independent of the second- the sum of the representations with different order structural infor-
order similarity. Hence the upper bound of the distance between mation. Similar to [12], our method also uses the fusion of different
two node representations is only related to the second order information to alleviate the over-smoothing phenomenon
q and third
terms. For the second item of Eq. 20, if di ≪ d j , w i j ≤ ddi , which
in GCN. However, different from [12] treating different order ad-
j jacency matrices with the same representations, our SDCN gives
is very small and not consistent with the precondition. If di ≈ d j , different representations to different order adjacency matrices. This
the effect of the second item is paltry and can be ignored. For the makes our model incorporate more information.
third item of Eq. 20, if two nodes have many common neighbors,
their non-common neighbors will be very few, and the values of 3.6 Complexity Analysis
D D
√i and √ j are positively correlated with non-common In this work, we denote d as the dimension of the input data and
di 2 dj 2
neighbors. If the second-order similarity si j is large, the upper the dimension of each layer of the autoencoder is d 1 , d 2 , · · · , d L .
(1)
The size of weight matrix in the first layer of the encoder is We ∈
bound of ĥi − ĥj will drop. In an extreme case, i.e. w i j = 1, d
2 R ×d 1 . N is the number of the input data. The time complexity of
ĥi − ĥj = d1 hi − hj 2 . □ the autoencoder is O(Nd 2d 12 ...d L2 ). As for the GCN module, because
2
the operation of GCN can be efficiently implemented using sparse
This shows that after the DNN representations pass through the matrix, the time complexity is linear with the number of edges |E |.
GCN layer, if the nodes with large second-order similarity, GCN The time complexity is O(|E |dd 1 ...d L ). In addition, we suppose that
will force the representations of nodes to be close to each other, there are K classes in the clustering task, so the time complexity of
which is same to the idea of second-order graph regularization. the Eq. 11 is O(N K + N log N ) corresponding to [26]. The overall
Theorem 2. The representation Z (ℓ) learned by SDCN is equiva- time complexity of our model is O(Nd 2d 12 ...d L2 + |E |dd 1 ...d L +N K +
lent to the sum of the representations with different order structural N log N ), which is linearly related to the number of samples and
information. edges.

Proof. For the simplicity of the proof, let us assume that ϕ (x) = 4 EXPERIMENTS
(ℓ) (ℓ)
x, be = 0 and Wд = I, ∀ℓ ∈ [1, 2, · · · , L]. We can rewrite Eq. 8 4.1 Datasets
as
Our proposed SDCN is evaluated on six datasets. The statistics of
Z(ℓ+1) = A Z(ℓ)
be
(21) these datasets are shown in Table 1 and the detailed descriptions
Z(ℓ+1) = (1 − ϵ)AZb (ℓ) + ϵ AH
b (ℓ) , are the followings:
e− 2 A
b=D
where A eD
1 1
e− 2 . After L-th propagation step, the result is • USPS[13]: The USPS dataset contains 9298 gray-scale hand-
written digit images with size of 16x16 pixels. The features
L
Z(L) = (1 − ϵ)L A
bL X + ϵ (1 − ϵ)ℓ−1 A
b ℓ H(ℓ) .
Õ are the gray value of pixel points in images and all features
(22)
are normalized to [0, 2].
ℓ=1
• HHAR[23]: The Heterogeneity Human Activity Recogni-
bL X is the output of standard GCN, which may suffer
Note that A tion (HHAR) dataset contains 10299 sensor records from
from the over-smoothing problem. Moreover, if L → ∞, the left smart phones and smart watches. All samples are partitioned
term tends to 0 and the right term dominants the data representa- into 6 categories of human activities, including biking, sit-
tion. We can clearly see that the right term is the sum of different ting, standing, walking, stair up and stair down.
representations, i.e. H(ℓ) , with different order structural informa- • Reuters[14]: It is a text dataset containing around 810000
tion. □ English news stories labeled with a category tree. We use
Structural Deep Clustering Network WWW ’20, April 20–24, 2020, Taipei, Taiwan

Table 2: Clustering results on six datasets (mean±std). The bold numbers represent the best results and the numbers with
asterisk are the best results of the baselines.

Dataset Metric K-means AE DEC IDEC GAE VGAE DAEGC SDCNQ SDCN
ACC 66.82±0.04 71.04±0.03 73.31±0.17 76.22±0.12∗ 63.10±0.33 56.19±0.72 73.55±0.40 77.09±0.21 78.08±0.19
NMI 62.63±0.05 67.53±0.03 70.58±0.25 75.56±0.06∗ 60.69±0.58 51.08±0.37 71.12±0.24 77.71±0.21 79.51±0.27
USPS
ARI 54.55±0.06 58.83±0.05 63.70±0.27 67.86±0.12∗ 50.30±0.55 40.96±0.59 63.33±0.34 70.18±0.22 71.84±0.24
F1 64.78±0.03 69.74±0.03 71.82±0.21 74.63±0.10∗ 61.84±0.43 53.63±1.05 72.45±0.49 75.88±0.17 76.98±0.18
ACC 59.98±0.02 68.69±0.31 69.39±0.25 71.05±0.36 62.33±1.01 71.30±0.36 76.51±2.19∗ 83.46±0.23 84.26±0.17
NMI 58.86±0.01 71.42±0.97 72.91±0.39 74.19±0.39∗ 55.06±1.39 62.95±0.36 69.10±2.28 78.82±0.28 79.90±0.09
HHAR
ARI 46.09±0.02 60.36±0.88 61.25±0.51 62.83±0.45∗ 42.63±1.63 51.47±0.73 60.38±2.15 71.75±0.23 72.84±0.09
F1 58.33±0.03 66.36±0.34 67.29±0.29 68.63±0.33 62.64±0.97 71.55±0.29 76.89±2.18∗ 81.45±0.14 82.58±0.08
ACC 54.04±0.01 74.90±0.21 73.58±0.13 75.43±0.14∗ 54.40±0.27 60.85±0.23 65.50±0.13 79.30±0.11 77.15±0.21
NMI 41.54±0.51 49.69±0.29 47.50±0.34 50.28±0.17∗ 25.92±0.41 25.51±0.22 30.55±0.29 56.89±0.27 50.82±0.21
Reuters
ARI 27.95±0.38 49.55±0.37 48.44±0.14 51.26±0.21∗ 19.61±0.22 26.18±0.36 31.12±0.18 59.58±0.32 55.36±0.37
F1 41.28±2.43 60.96±0.22 64.25±0.22∗ 63.21±0.12 43.53±0.42 57.14±0.17 61.82±0.13 66.15±0.15 65.48±0.08
ACC 67.31±0.71 81.83±0.08 84.33±0.76 85.12±0.52 84.52±1.44 84.13±0.22 86.94±2.83∗ 86.95±0.08 90.45±0.18
NMI 32.44±0.46 49.30±0.16 54.54±1.51 56.61±1.16∗ 55.38±1.92 53.20±0.52 56.18±4.15 58.90±0.17 68.31±0.25
ACM
ARI 30.60±0.69 54.64±0.16 60.64±1.87 62.16±1.50∗ 59.46±3.10 57.72±0.67 59.35±3.89 65.25±0.19 73.91±0.40
F1 67.57±0.74 82.01±0.08 84.51±0.74 85.11±0.48 84.65±1.33 84.17±0.23 87.07±2.79∗ 86.84±0.09 90.42±0.19
ACC 38.65±0.65 51.43±0.35 58.16±0.56 60.31±0.62 61.21±1.22 58.59±0.06 62.05±0.48∗ 65.74±1.34 68.05±1.81
NMI 11.45±0.38 25.40±0.16 29.51±0.28 31.17±0.50 30.80±0.91 26.92±0.06 32.49±0.45∗ 35.11±1.05 39.50±1.34
DBLP
ARI 6.97±0.39 12.21±0.43 23.92±0.39 25.37±0.60∗ 22.02±1.40 17.92±0.07 21.03±0.52 34.00±1.76 39.15±2.01
F1 31.92±0.27 52.53±0.36 59.38±0.51 61.33±0.56 61.41±2.23 58.69±0.07 61.75±0.67∗ 65.78±1.22 67.71±1.51
ACC 39.32±3.17 57.08±0.13 55.89±0.20 60.49±1.42 61.35±0.80 60.97±0.36 64.54±1.39∗ 61.67±1.05 65.96±0.31
NMI 16.94±3.22 27.64±0.08 28.34±0.30 27.17±2.40 34.63±0.65 32.69±0.27 36.41±0.86∗ 34.39±1.22 38.71±0.32
Citeseer
ARI 13.43±3.02 29.31±0.14 28.12±0.36 25.70±2.65 33.55±1.18 33.13±0.53 37.78±1.24∗ 35.50±1.49 40.17±0.43
F1 36.08±3.53 53.80±0.11 52.62±0.17 61.62±1.39 57.36±0.82 57.70±0.49 62.20±1.32∗ 57.82±0.98 63.62±0.24

4 root categories: corporate/industrial, government/social, 4.2 Baselines


markets and economics as labels and sample a random subset We compare our proposed method SDCN with three types of meth-
of 10000 examples for clustering. ods, including clustering methods on raw data, DNN-based cluster-
• ACM2 [? ]: This is a paper network from the ACM dataset. ing methods and GCN-based graph clustering methods.
There is an edge between two papers if they are written
by same author. Paper features are the bag-of-words of the • K-means [5]: A classical clustering method based on the
keywords. We select papers published in KDD, SIGMOD, raw data.
SIGCOMM, MobiCOMM and divide the papers into three • AE [7]: It is a two-stage deep clustering algorithm which
classes (database, wireless communication, data mining) by performs K-means on the representations learned by autoen-
their research area. coder.
• DBLP3 [? ]: This is an author network from the DBLP dataset. • DEC [26]: It is a deep clustering method which designs a
There is an edge between two authors if they are the co- clustering objective to guide the learning of the data repre-
author relationship. The authors are divided into four areas: sentations.
database, data mining, machine learning and information • IDEC [4]: This method adds a reconstruction loss to DEC,
retrieval. We label each author’s research area according so as to learn better representation.
to the conferences they submitted. Author features are the • GAE & VGAE [10]: It is an unsupervised graph embedding
elements of a bag-of-words represented of keywords. method using GCN to learn data representations.
• Citeseer4 : It is a citation network which contains sparse • DAEGC [25]: It uses an attention network to learn the node
bag-of-words feature vectors for each document and a list representations and employs a clustering loss to supervise
of citation links between documents. The labels contain six the process of graph clustering.
area: agents, artificial intelligence, database, information • SDCNQ : The variant of SDCN with distribution Q.
retrieval, machine language, and HCI. • SDCN: The proposed method.

Metrics. We employ four popular metrics: Accuracy (ACC), Nor-


malized Mutual Information (NMI), Average Rand Index (ARI) and
macro F1-score (F1). For each metric, a larger value implies a better
2 http://dl.acm.org/ clustering result.
3 https://dblp.uni-trier.de Parameter Setting. We use the pre-trained autoencoder for all
4 http://citeseerx.ist.psu.edu/index DNN-based clustering methods (AE+K-means, DEC, IDEC) and
WWW ’20, April 20–24, 2020, Taipei, Taiwan Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui

SDCN. We train the autoencoder end-to-end using all data points methods will decline. Besides, SDCN integrates structural in-
with 30 epochs and the learning rate is 10−3 . In order to be con- formation into deep clustering, so its clustering performance
sistent with previous methods [4, 26], we set the dimension of the is better than these two methods.
autoencoder to d-500-500-2000-10, where d is the dimension of • Comparing the results of AE with DEC and the results of
the input data. The dimension of the layers in GCN module is the GAE with DAEGC, we can find that the clustering loss func-
same to the autoencoder. As for the GCN-based methods, we set tion, defined in Eq. 13, plays an important role in improving
the dimension of GAE and VAGE to d-256-16 and train them with the deep clustering performance. Because IDEC and DAEGC
30 epochs for all datasets. For DAEGC, we use the setting of [25]. can be seen as the combination of the clustering loss with AE
In hyperparameter search, we try {1, 3, 5} for the update interval and GAE, respectively. It improves the cluster cohesion by
in DEC and IDEC, {1, 0.1, 0.01, 0.001} for the hyperparameter γ making the data representation closer to the cluster centers,
in IDEC and report the best results. For our SDCN, we uniformly thus improving the clustering results.
set α = 0.1 and β = 0.01 for all the datasets because our method
is not sensitive to hyperparameters. For the non-graph data, we 4.4 Analysis of Variants
train the SDCN with 200 epochs, and for graph data, we train it We compare our model with two variants to verify the ability of
with 50 epochs. Because the graph structure with prior knowledge, GCN in learning structural information and the effectiveness of the
i.e. citation network, contains more information than KNN graph, delivery operator. Specifically, we define the following variants:
which can accelerate convergence speed. The batch size is set to
256 and learning rate is set to 10−3 for USPS, HHAR, ACM, DBLP 90 90
SDCN SDCN
and 10−4 for Reuters, Citeseer. For all methods using K-means algo- SDCN-w/o SDCN-w/o
SDCN-MLP SDCN-MLP
rithm to generate clustering assignments, we initialize 20 times and 80 80

select the best solution. We run all methods 10 times and report the
70 70
average results to prevent extreme cases.
60 60
USPS HHAR Reuters ACM DBLP Citeseer

4.3 Analysis of Clustering Results (a) Datasets with KNN graph (b) Datasets with original graph

Table 2 shows the clustering results on six datasets. Note that in


USPS, HHAR and Reuters, we use the KNN graph as the input of Figure 2: Clustering accuracy with different variants
the GCN module, while for ACM, DBLP and Citeseer, we use the
original graph. We have the following observations: • SDCN-w/o: This variant is SDCN without delivery operator,
which is used to validate the effectiveness of our proposed
• For each metric, our methods SDCN and SDCNQ achieve delivery operator.
the best results in all the six datasets. In particular, compared • SDCN-MLP: This variant is SDCN replacing the GCN mod-
with the best results of the baselines, our approach achieves ule with the same number of layers of multilayer perceptron
a significant improvement of 6% on ACC, 17% on NMI and (MLP), which is used to validate the advantages of GCN in
28% on ARI averagely. The reason is that SDCN successfully learning structural information.
integrates the structural information into deep clustering From Figure 2, we have the following observations:
and the dual self-supervised module guides the update of • In Figure 2(a), we can find that the clustering accuracy of
autoencoder and GCN, making them enhance each other. SDCN-MLP is better than SDCN-w/o in Reuters and achieves
• SDCN generally achieves better cluster results than SDCNQ . similar results in USPS and HHAR. This shows that in the
The reason is that SDCN uses the representations containing KNN graph, without delivery operator, the ability of GCN
the structural information learned by GCN, while SDCNQ in learning structural information is severely limited. The
mainly uses the representations learned by the autoencoder. reason is that multilayer GCN will produce a serious over-
However, in Reuters, the result of SDCNQ is much better smoothing problem, leading to the decrease of the clustering
than SDCN. Because in the KNN graph of Reuters, many results. On the other hand, SDCN is better than SDCN-MLP.
different classes of nodes are connected together, which This proves that the delivery operator can help GCN alleviate
contains much wrong structural information. Therefore, an the over-smoothing problem and learn better data represen-
important prerequisite for the application of GCN is to con- tation.
struct a KNN graph with less noise. • In Figure 2(b), we can find that the clustering accuracy of
• Clustering results of autoencoder based methods (AE, DEC, SDCN-w/o is better than SDCN-MLP in all three datasets
IDEC) are generally better than those of GCN-based methods containing original graph. This shows that GCN has the pow-
(GAE, VAGE, DAEGC) on the data with KNN graph, while erful ability in learning data representation with structural
GCN-based methods usually perform better on the data with information. Besides, SDCN performs better than SDCN-w/o
graph structure. The reason is that GCN-based methods only in the three datasets. This proves that there still exists over-
use structural information to learn the data representation. smoothing problem in the SDCN-w/o, but the good graph
When the structural information in the graph is not clear structure still makes SDCN-w/o achieve not bad clustering
enough, e.g. KNN graph, the performance of the GCN-based results.
Structural Deep Clustering Network WWW ’20, April 20–24, 2020, Taipei, Taiwan

Table 3: Effect of different propagation layers (L) term in Eq. 22 is not small enough so that it is still plagued
by the over-smoothing problems.
ACC NMI ARI F1
0.80 0.85 0.80
SDCN-4 90.45 68.31 73.91 90.42
SDCN-3 89.06 64.86 70.51 89.03 0.75 0.80 0.75
ACM
SDCN-2 89.12 66.48 70.94 89.04
SDCN-1 77.69 51.59 50.13 74.62 0.70 0.75 0.70

SDCN-4 68.05 39.51 39.15 67.71 0.65 0.70 0.65


SDCN-3 65.11 36.81 36.03 64.98
DBLP
SDCN-2 66.72 37.19 37.58 65.37 0.60 0.65 0.60
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
SDCN-1 64.19 30.69 33.62 60.44 ε ε ε
SDCN-4 65.96 38.71 40.17 61.62 (a) USPS (b) HHAR (c) Reuters
SDCN-3 59.18 32.11 32.16 55.92
Citeseer 0.95 0.75 0.75
SDCN-2 60.96 33.69 34.49 57.31
SDCN-1 58.58 32.91 32.31 52.38 0.90 0.70 0.70

0.85 0.65 0.65

• Comparing the results in Figure 2(a) and Figure 2(b), we can


0.80 0.60 0.60
find no matter on which types of datasets, SDCN achieves
the best performance, compared with SDCN-w/o and SDCN-
0.75 0.55 0.55
MLP. This proves that both the delivery operator and GCN 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

play an important role in improving clustering quality. ε ε ε


(d) ACM (e) DBLP (f) Citeseer
4.5 Analysis of Different Propagation Layers
To investigate whether SDCN benefits from multilayer GCN, we Figure 3: Clustering accuracy with different ϵ
vary the depth of the GCN module while keeping the DNN module
unchanged. In particular, we search the number of layers in the
range of {1, 2, 3, 4}. There are a total of four layers in the encoder 4.6 Analysis of balance coefficient ϵ
part of the DNN module in SDCN, generating the representation In previous experiments, in order to reduce hyperparameter search,
H(1) , H(2) , H(3) , H(4) , respectively. SDCN-L represents that there is we uniformly set the balance coefficient ϵ to 0.5. In this experiment,
a total of L layers in the GCN module. For example, SDCN-2 means we will explore how SDCN is affected by different ϵ on different
that H(3) , H(4) will be transferred to the corresponding GCN lay- datasets. In detail, we set ϵ = {0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}. Note
ers for propagating. We choose the datasets with original graph to that ϵ = 0.0 means the representations in GCN module do not
verify the effect of the number of the propagation layers on the clus- contain the representation from autoencoder and ϵ = 1.0 represents
tering effect because they have the nature structural information. that GCN only use the representation H(L) learned by DNN. From
From Table 3, we have the following observations: Figure 4, we can find:
• Increasing the depth of SDCN substantially enhances the • Clustering accuracy with parameter ϵ = 0.5 in four datasets
clustering performance. It is clear that SDCN-2, SDCN-3 and (Reuters, ACM, DBLP, Citeseer) achieve the best perfor-
SDCN-4 achieve consistent improvement over SDCN-1 in mance, which shows that the representations of GCN module
all the across. Besides, SDCN-4 performs better than other and DNN module are equally important and the improve-
methods in all three datasets. Because the representations ment of SDCN depends on the mutual enhancement of the
learned by each layer in the autoencoder are different, to two modules.
preserve information as much as possible, we need to put • Clustering accuracy with parameter ϵ = 0.0 in all datasets
all the representations learned from autoencoder into corre- performs the worst. Clearly, when ϵ = 0.0, the GCN module
sponding GCN layers. is equivalent to the standard multilayer GCN, which will
• There is an interesting phenomenon that the performance produce very serious over-smoothing problem [15], leading
of SDCN-3 is not as good as SDCN-2 in all the datasets. The to the decline of the clustering quality. Compared with the
reason is that SDCN-3 uses the representation H(2) , which is accuracy when ϵ = 0.1, we can find that even injecting a
a middle layer of the encoder. The representation generated small amount of representations learned by autoencoder into
by this layer is in the transitional stage from raw data to se- GCN can help alleviate the over-smoothing problem.
mantic representation, which inevitably loses some underly- • Another interesting observation is that SDCN with parame-
ing information and lacks of semantic information. Another ter ϵ = 1.0 still gets a higher clustering accuracy. The reason
reason is that GCN with two layers does not cause serious is that although SDCN with parameter ϵ = 1.0 only use the
over-smoothing problems, proved in [15]. For SDCN-3, due representation H(L) , it contains the most important informa-
to the number of layers is not enough, the over-smoothing tion of the raw data. After passing through one GCN layer,
WWW ’20, April 20–24, 2020, Taipei, Taiwan Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui

SDCN-Q SDCN-Z SDCN-P SDCN-Q SDCN-Z SDCN-P SDCN-Q SDCN-Z SDCN-P SDCN-Q SDCN-Z SDCN-P

1 1 1 1

0.8 0.8 0.8 0.8


Accuracy

Accuracy

Accuracy

Accuracy
0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
0 50 100 150 200 0 50 100 150 200 0 10 20 30 40 50 0 10 20 30 40 50
Iteration Iteration Iteration Iteration

(a) USPS (b) HHAR (c) ACM (d) DBLP

Figure 4: Training process on different datasets

80 80 80
4.8 Analysis of Training Process
70 70 70 In this section, we analyze the training progress in different datasets.
60 60 60 Specifically, we want to explore how the cluster accuracy of the
50 50 50 three sample assignments distributions in SDCN varies with the
40 GAE VGAE 40 GAE VGAE 40 GAE VGAE
number of iterations. In Figure 4, the red line SDCN-P, the blue
DAEGC SDCN DAEGC SDCN DAEGC SDCN
30 30 30 line SDCN-Q and the orange line SDCN-Z represent the accuracy
1 3 5 10 1 3 5 10 1 3 5 10
K K K of the target distribution P, distribution Q and distribution Z , re-
(a) Accuracy on USPS (b) Accuracy on HHAR (c) Accuracy on Reuters spectively. In most cases, the accuracy of SDCN-P is higher than
that of SDCN-Q, which shows that the target distribution P is able
80 80 50 to guide the update of the whole model. At the beginning, the re-
70 70 40 sults of the accuracy of three distributions all decrease in different
60 60 30 ranges. Because the information learned by autoencoder and GCN
50 50 20 is different, it may rise a conflict between the results of the two
40 GAE VGAE 40 GAE VGAE 10 GAE VGAE modules, making the clustering results decline. Then the accuracy
DAEGC SDCN DAEGC SDCN DAEGC SDCN
30 30 0 of SDCN-Q and SDCN-Z quickly increase to a high level, because
1 3 5 10 1 3 5 10 1 3 5 10
K K K the target distribution SDCN-P eases the conflict between the two
(d) NMI on USPS (e) NMI on HHAR (f) NMI on Reuters modules, making their results tend to be consistent. In addition,
we can see that with the increase of training epochs, the cluster-
ing results of SDCN tend to be stable and there is no significant
Figure 5: Clustering results with different K
fluctuation, indicating the good robustness of our proposed model.

it can still achieve some structural information to improve


5 CONCLUSION
clustering performance. However, due to the limitation of In this paper, we make the first attempt to integrate the structural in-
the number of layers, the results are not the best. formation into deep clustering. We propose a novel structural deep
clustering network, consisting of DNN module, GCN module, and
4.7 K-sensitivity Analysis dual self-supervised module. Our model is able to effectively com-
bine the autoencoder-spectific representation with GCN-spectific
Since the number of the nearest neighbors K is an important param- representation by a delivery operator. Theoretical analysis is pro-
eter in the construction of the KNN graph, we design a K-sensitivity vided to demonstrate the strength of the delivery operator. We show
experiment on the datasets with KNN graph. This experiment is that our proposed model consistently outperforms the state-of-the-
mainly to prove that our model is K-insensitive. Hence we compare art deep clustering methods in various open datasets.
SDCN with the clustering methods focusing on the graph data (GAE,
VGAE, DAEGC). From Figure 5, we can find that with K={1, 3, 5, 10}, ACKNOWLEDGMENTS
our proposed SDCN is much better than GAE, VGAE and DAEGC,
This work is supported by the National Key Research and Develop-
which proves that our method can learn useful structural infor-
ment Program of China (2018YFB1402600) and the National Natural
mation even in the graphs containing noise. Another finding is
Science Foundation of China (No. 61772082, 61702296, 61806020,
that these four methods can achieve good performance when K =
61972442, U1936104). It is also supported by 2018 Tencent Market-
3 or K = 5, but in the case of K = 1 and K = 10, the performance
ing Solution Rhino-Bird Focused Research Program.
will drop significantly. The reason is that when K = 1, the KNN
graph contains less structural information and when K = 10, the
communities in KNN graph are over-lapping. In summary, SDCN
can achieve stable results compared with other baseline methods
on the KNN graphs with different number of nearest neighbors.
Structural Deep Clustering Network WWW ’20, April 20–24, 2020, Taipei, Taiwan

REFERENCES [16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
[1] Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering Journal of machine learning research 9, Nov (2008), 2579–2605.
algorithms. In Mining text data. Springer, 77–128. [17] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan
[2] Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality Frey. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015).
reduction and data representation. Neural computation 15, 6 (2003), 1373–1396. [18] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet
[3] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Agarwal, and Gautam Shroff. 2016. LSTM-based encoder-decoder for multi-sensor
Deep clustering for unsupervised learning of visual features. In ECCV. 132–149. anomaly detection. arXiv preprint arXiv:1607.00148 (2016).
[4] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. 2017. Improved deep [19] Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. 2011. Stacked
embedded clustering with local structure preservation.. In IJCAI. 1753–1759. convolutional auto-encoders for hierarchical feature extraction. In ICANN.
[5] John A Hartigan and Manchek A Wong. 1979. Algorithm AS 136: A k-means Springer, 52–59.
clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied [20] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted
Statistics) 28, 1 (1979), 100–108. boltzmann machines. In ICML. 807–814.
[6] William Grant Hatcher and Wei Yu. 2018. A survey of deep learning: platforms, [21] Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering:
applications and emerging research trends. IEEE Access 6 (2018), 24411–24432. Analysis and an algorithm. In NIPS. 849–856.
[7] Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensional- [22] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and
ity of data with neural networks. Science 313, 5786 (2006), 504–507. validation of cluster analysis. Journal of computational and applied mathematics
[8] Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. 2017. Deep 20 (1987), 53–65.
subspace clustering networks. In NIPS. 24–33. [23] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow,
[9] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen.
2017. Variational deep embedding: An unsupervised and generative approach to 2015. Smart devices are different: Assessing and mitigatingmobile sensing het-
clustering. IJCAI (2017). erogeneities for activity recognition. In SenSys. ACM, 127–140.
[10] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv [24] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
preprint arXiv:1611.07308 (2016). 2008. Extracting and composing robust features with denoising autoencoders. In
[11] Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph ICML. ACM, 1096–1103.
convolutional networks. ICLR (2017). [25] Chun Wang, Shirui Pan, Ruiqi Hu, Guodong Long, Jing Jiang, and Chengqi Zhang.
[12] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pre- 2019. Attributed Graph Clustering: A Deep Attentional Embedding Approach.
dict then propagate: Graph neural networks meet personalized pagerank. (2018). IJCAI (2019).
[13] Yann Le Cun, Ofer Matan, Bernhard Boser, John S Denker, Don Henderson, [26] Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Unsupervised deep embedding
Richard E Howard, Wayne Hubbard, LD Jacket, and Henry S Baird. 1990. Hand- for clustering analysis. In ICML. 478–487.
written zip code recognition with multilayer networks. In ICPR, Vol. 2. IEEE, [27] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. 2017. Towards
35–40. k-means-friendly spaces: Simultaneous deep learning and clustering. In ICML.
[14] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new 3861–3870.
benchmark collection for text categorization research. Journal of machine learning [28] Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting Zhuang. 2010. Im-
research 5, Apr (2004), 361–397. age clustering using local discriminant models and global integration. IEEE
[15] Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph Transactions on Image Processing 19, 10 (2010), 2761–2773.
convolutional networks for semi-supervised learning. In AAAI.

You might also like