0% found this document useful (0 votes)
2 views16 pages

Learning The Implicit Semantic Representation On Graph-Structured Data

The document presents a novel approach called Semantic Graph Convolutional Networks (SGCN) that aims to learn implicit semantic representations in graph-structured data by exploring latent semantic-paths. Unlike traditional methods that rely on explicit heterogeneous information, SGCN dynamically infers independent factors behind connections to uncover complex semantic associations. Experimental results demonstrate the model's superior performance on various real-world datasets, highlighting its effectiveness in graph representation learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views16 pages

Learning The Implicit Semantic Representation On Graph-Structured Data

The document presents a novel approach called Semantic Graph Convolutional Networks (SGCN) that aims to learn implicit semantic representations in graph-structured data by exploring latent semantic-paths. Unlike traditional methods that rely on explicit heterogeneous information, SGCN dynamically infers independent factors behind connections to uncover complex semantic associations. Experimental results demonstrate the model's superior performance on various real-world datasets, highlighting its effectiveness in graph representation learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Learning the Implicit Semantic Representation

on Graph-Structured Data

Likang Wu1 , Zhi Li1 , Hongke Zhao2 , Qi Liu1 , Jun Wang1 ,


Mengdi Zhang3 , and Enhong Chen1(B)
1
Anhui Province Key Laboratory of Big Data Analysis and Application, University
of Science and Technology of China, Hefei, China
arXiv:2101.06471v1 [cs.AI] 16 Jan 2021

{wulk,zhili03}@mail.ustc.edu.cn, {qiliuql,cheneh}@ustc.edu.cn
2
Tianjin University, Tianjin, China
[email protected]
3
Meituan-Dianping Group, Beijing, China
[email protected]

Abstract. Existing representation learning methods in graph convo-


lutional networks are mainly designed by describing the neighborhood
of each node as a perceptual whole, while the implicit semantic asso-
ciations behind highly complex interactions of graphs are largely un-
exploited. In this paper, we propose a Semantic Graph Convolutional
Networks (SGCN) that explores the implicit semantics by learning la-
tent semantic-paths in graphs. In previous work, there are explorations
of graph semantics via meta-paths. However, these methods mainly rely
on explicit heterogeneous information that is hard to be obtained in
a large amount of graph-structured data. SGCN first breaks through
this restriction via leveraging the semantic-paths dynamically and au-
tomatically during the node aggregating process. To evaluate our idea,
we conduct sufficient experiments on several standard datasets, and the
empirical results show the superior performance of our model.4

Keywords: Graph Neural Networks · Semantic Representation · Net-


work Analysis.

1 Introduction
The representations of objects (nodes) in large graph-structured data, such as
social or biological networks, have been proved extremely effective as feature
inputs for graph analysis tasks. Recently, there have been many attempts in
the literature to extend neural networks to deal with representation learning of
graphs, such as Graph Convolutional Networks (GCN) [15], GraphSAGE [12]
and Graph Attention Networks (GAT) [34].
In spite of enormous success, previous graph neural networks mainly proposed
representation learning methods by describing the neighborhoods as a perceptual
whole, and they have not gone deep into the exploration of semantic information
4
Our code is available online at https://github.com/WLiK/SGCN_SemanticGCN
5

2 L. Wu et al.

A D

B C

Fig. 1. Example of implicit semantic-paths in a scholar cooperation network. There are


not explicit node (relation) types. Behind the same kind of relation (black solid edge),
there are implicit factors (dotted line, A is the student of B, B is the advisor of C).
So, the path A-B-C expresses “Student-Advisor-Student”, A and C are “classmates”.
B-C-D expresses “Advisor-Student-Advisor”, B and D are “colleagues”.
认知诊断模型调研

in graphs. Taking the movie network as an example, the paths based on com-
posite relations of “Movie-Actor-Movie” and “Movie-Director-Movie” may reveal
two different semantic patterns, i.e., the two movies have the same actor (direc-
tor). Here the semantic pattern is defined as a specific knowledge expressed by
the corresponding path. Although several researchers [35,30] attempt to capture
these graph semantics of composite relations between two objects by meta-paths,
existing work relies on the given heterogeneous information such as different
types of objects and distinct object connections. However, in the real world,
quite a lot of graph-structured data do not have the explicit characteristics. As
shown in Figure 1, in a scholar cooperation network, there are usually no explicit
node (relation) types and all nodes are connected through the same relation, i.e.,
“Co-author”. Fortunately, behind the same relation, there are various implicit
factors which may express different connecting reasons, such as “Classmate”
and “Colleague” for the same relation “Co-author”. These factors can further
compose diverse semantic-paths (e.g. “Student-Advisor-Student” and “Advisor-
Student-Advisor”), which reveal sophisticated semantic associations and help to
generate more informative representations. Then, how to automatically exploit
comprehensive semantic patterns based on the implicit factors behind a general
graph is a non-trivial problem.
In general, there are several challenges to solve this problem. Firstly, it is an
essential part to adaptively infer latent factors behind graphs. We notice that
several researches begin to explore desired latent factors behind a graph by dis-
entangled representations [20,18]. However, they mainly focus on inferring the
latent factors by the disentangled representation learning while failing to discrim-
inatively model the independent implicit factors behind the same connections.
Secondly, after discovering the latent factors, how to select the most meaningful
semantics and aggregate the diverse semantic information remain largely unex-
plored. Last but not the least, to further exploit the implicit semantic patterns
and to be capable of conducting inductive learning are quite difficult.
To address above challenges, in this paper, we propose a novel Semantic
Graph Convolutional Networks (SGCN), which sheds light on the exploration
of implicit semantics in the node aggregating process. Specifically, we first pro-
Learning the Implicit Semantic Representation on Graph-Structured Data 3

pose a latent factor routing method with the DisenConv layer [20] to adaptively
infer the probability of each latent factor that may have caused the link from
a given node to one of its neighborings. Then, for further exploring the diverse
semantic information, we transfer the probability between every two connected
nodes to the corresponding semantic adjacent matrix, which can present the
semantic-paths in a graph. Afterwards, most semantic strengthen methods like
the semantic level attention module can be easily integrated into our model and
aggregate the diverse semantic information from these semantic-paths. Finally,
to encourage the independence of the implicit semantic factors and conduct the
inductive learning, we design an effective joint loss function to maintain the in-
dependent mapping channels of different factors. This loss function is able to
focus on different semantic characteristics during the training process.
Specifically, the contributions of this paper can be summarized as follows:

– We first break the heterogeneous restriction of semantic representations with


an end-to-end framework. It automatically infers the independent factor be-
hind the formation of each edge and explores the semantic associations of
latent factors behind a graph.
– We propose a novel Semantic Graph Convolutional Networks (SGCN), to
learn node representations by aggregating the implicit semantics from the
graph-structured data.
– We conduct extensive experiments on various real-world graphs datasets to
evaluate the performance of the proposed model. The results show the supe-
riority of our proposed model by comparing it with many powerful models.

2 Related Works

Graph neural networks (GNNs) [10,26], especially graph convolutional networks


[13], have been proven successful in modeling the structured graph data due to
its theoretical elegance [5]. They have made new breakthroughs in various tasks,
such as node classification [15] and graph classification [6]. In the early days,
the graph spectral theory [13] was used to derive a graph convolutional layer.
Then, the polynomial spectral filters [6] greatly reduced the computational cost
than before. And, Kipf and Welling [15] proposed the usage of a linear filter
to get further simplification. Along with spectral graph convolution, directly
performing graph convolution in the spatial domain was also investigated by
many researchers [8,12]. Among them, graph attention networks [34] has aroused
considerable research interest, since it adaptively specify weights to the neighbors
of a node by attention mechanism [1,37].
For semantic learning research, there have been studies explored a kind of
semantic-path called meta-path in heterogeneous graph embedding to preserve
structural information. ESim [28] learned node representations by searching the
user-defined embedding space. Based on random walk, meta-path2vec [7] utilized
skip-gram to perform a semantic-path. HERec [29] proposed a type constraint
strategy to filter the node sequence and captured the complex semantics reflected
4 L. Wu et al.

in heterogeneous graph. Then, Fan et al. [9] suggested a meta-graph2vec model


for malware detection, where both the structures and semantics are preserved.
Sun et al. [30] proposed meta-graph-based network embedding models, which
simultaneously considers the hidden relations of all meta information of a meta-
graph. Meanwhile, there were other influential semantic learning approaches in
some studies. For instance, many models [4,17,25] were utilized to various fields
because of their latent semantic analysis ability.
In heterogeneous graphs, two objects can be connected via different semantic-
paths, which are called meta-paths. It depends on the characteristic that this
graph structure has different types of nodes and relations. One meta-path Φ
R1 R2 Rl
is defined as a path in the form of A1 −→ A2 −→ · · · −→ Al+1 (abbreviated as
A1 A2 · · · Al+1 ), it describes a composite relation R = R1 ◦ R2 ◦ · · · ◦ Rl , where ◦
denotes the composition operator on relations. Actually, in homogeneous graph,
the relationships between nodes are also generated for different reasons (latent
factors), so we can implicitly construct various types of relationships to extract
various semantic-paths correspond to different semantic patterns, so as to im-
prove the performance of GCN model from the perspective of semantic discovery.

3 Semantic Graph Convolutional Networks

In this section, we introduce the Semantic Graph Convolutional Networks (SGCN).


We first present the notations, then describe the overall network progressively.

3.1 Preliminary

We focus primarily on undirected graphs, and it is straightforward to extend our


approach to directed graphs. We define G = (V, E) as a graph, comprised of the
nodes set V and edges set E, and |V | = N denotes the number of nodes. Each
node u ∈ V has a feature vector xu ∈ Rdin . We use (u, v) ∈ E to indicate that
there is an edge between node u and node v. Most graph convolutional networks
can be regarded as an aggregation function f (·) that outputs the representations
of nodes when given features of each node and its neighbors:

y = f (xu , xv : (u, v) ∈ E | u ∈ V ),

where the output y ∈ RN ×dout denotes the representations of nodes. It means


that neighborhoods of a node contains rich information, which can be aggre-
gated to describe the node more comprehensively. Different from previous stud-
ies [15,12,34], in our work, proposed f (·) would automatically learn the semantic-
path from graph data to explore corresponding semantic pattern.

3.2 Latent Factor Routing

Here we aim to introduce the disentangled algorithm that calculates the latent
factors between every two objects. We assume that each node is composed of K
Learning the Implicit Semantic Representation on Graph-Structured Data 5

independent components, hence there are K latent factors to be disentangled.


For the node u ∈ V , the hidden representation of u is hu = [eu,1 , eu,2 , ..., eu,K ] ∈
dout dout
RK× K , where eu,k ∈ R K (k = 1, 2, ..., K) denotes corresponding aspect of
node u that is pertinent to the k-th disentangled factor.
In the initial stage, we project its feature vector xu into K different subspaces:
σ(Wk xu + bk )
zu,k = , (1)
k σ(Wk xu + bk ) k2
dout dout
where Wk ∈ Rdin × K and bk ∈ R K are the mapping parameters and bias
of k-th subspace, the nonlinear activation function σ is ReLU [23]. To capture
aspect k of node u comprehensively, we construct eu,k from both zu,k and {zv,k :
(u, v) ∈ E}, which can be utilized to identify the latent factors. Here we learn the
probability of each factor by leveraging neighborhood routing mechanism [20,18],
it is a DisenConv layer:
k,t−1
P
t
zu,k + v:(u,v)∈E pu,v zv,k
eu,k = P k,t−1
, (2)
k zu,k + v:(u,v)∈E pu,v zv,k k
2

exp(z> t
v,k eu,k )
pk,t
u,v = K
, (3)
> t
P
k=1 exp(zv,k eu,k )

where iteration t = 1, 2, ..., T , pku,v indicates the probability that factor k indi-
PK
cates the reason why node u reaches neighbor v, and satisfies pku,v ≥ 0, k=1 pku,v =
1. The neighborhood routing mechanism will iteratively infer pku,v and construct
ek . Note that, there are total L DisenConv layers, zu,k is assigned the value of
eTu,k finally in each layer l ≤ L − 1, more detail can refer to Algorithm 1.

3.3 Discriminative Semantic Aggregation


For the data that various relation types between nodes and their corresponding
neighbors are explicit and fixed, it is easily to construct multiple sub-semantic
graphs as the input data for multiple GCN model. As shown in Figure 2(a) , a
heterogeneous graph G contains two different types of meta-paths (meta-path
1, meta-path 2). Then G can be decomposed to multiple graphs G̃ consisting of
single semantic graph G1 and G2 , where u and its neighbors are connected by
path-relation 1(2) for each node u in G1 (G2 ).
However, we cannot simply transfer the pre-construct multiple graph method
to all network architectures. In detail, for a graph with no different types of edges,
we have to judge implicit connecting factors of these edges to find semantic-paths.
And the probability of each latent factor is calculated in the iteratively running
process as mentioned in last section. To solve this dilemma, we propose a novel
algorithm to automatically represent semantic-paths during the model running.
After the latent factor routing process, we get the soft probability matrix
of node latents p ∈ RN ×N ×K , where 0 ≤ pki,j ≤ 1 means the possibility that
6 L. Wu et al.

𝑘 3 2
3

1 2 3
𝑢 1 2 3
1 0 0 0 1 0 0 0
2 0 0 1 2 1 0 0
3 0 0 0 3 0 0 0

𝐁 , 𝑙 𝑜 𝐁 ,
(a) Multi-graph method (b) Discriminative semantic aggregation method

Fig. 2. A previous meta-paths representation on heterogeneous graph and our discrim-


inative semantic aggregation method.

node i connects to j because of the factor k. In our model, the latent factor
should identify the certain connecting cause of each connected node pair. Here
we transfer the probability matrix p to an semantic adjacent matrix A, so the
element in A only has binary value (0 or 1). In detail, for every node pair i and
j, Aki,j = 1 if pki,j denotes the biggest value in pi,j . As shown in Figure 2(b), each
node is represented by K components. In this graph, every node may connect
with others by one relationship from K types, e.g., the relationship between
node u and o is R2 (denotes A2u,o = 1). For node u, we can find that it has two
semantic-path-based neighbors l and v. And, the semantic-paths of (u, l) and
(u, v) are two different types which composed by Φu,o,l = (A2u,o , A3o,l ) = R2 ◦ R3
and Φu,o,v = (A2u,o , A1o,v ) = R2 ◦ R1 respectively. We define the adjacent matrix
B for virtual semantic-path-based edges,
X
Bu,v = A>u,o Ao,v , {u, v} ⊂ V, (4)
[(u,o),(o,v)]∈E

where Au,o ∈ RK , Ao,v ∈ RK , and Bu,v ∈ RK×K . For instance, in Figure 2(b),
Au,o = [0, 1, 0], Ao,v = [1, 0, 0], and Ao,l = [0, 0, 1], in this way two semantic-
paths start from node u can be expressed as B2,3 2,1
u,l = 1 and Bu,v = 1.
In the semantic information aggregation process, we aggregate the latent
vectors connected by corresponding semantic-path as:
dout
hu = [eu,1 , eu,2 , ..., eu,K ] ∈ RK× K ,
d
K× out
h̃v = [zv,1 , zv,2 , ..., zv,K ] ∈ R K , (5)
yu = hu + MeanPooling(Bu,v h̃v ), u ∈ V,
v∈V,v6=u
P
where we just use MeanPooling to avoid large values instead of v∈V oper-
d
K× out
ator, and hu , h̃v ∈ R are both returned from the last layer of Disen-
K

Conv operation, in this time that factor probabilities would be stable since the
representation of each node considers the influence from neighbors. According
Learning the Implicit Semantic Representation on Graph-Structured Data 7

to Eq. (5), the aggregation of two latent representations (end points) of one
certain semantic-path denotes the mining result of this semantic relation, e.g.,
Pooling(eu,2 , zv,1 ) and Pooling(eu,2 , zl,3 ) express two different kinds of semantic
pattern representations in Figure 2(b), R2 ◦R1 and R2 ◦R3 respectively. And, for
all types of semantic-paths start from node u, the weight of each type depends
on its frequency. Note that, although the semantic adjacent matrix A neglects
some low probability factors, our semantic paths are integrated with the node
states of DisenGCN, which would not lose the crucial information captured by
basic GCN model. The advantage of this aggregation method is that our model
can distinguish different semantic relations without adding extra parameters,
instead of designing various graph convolution networks for different semantic-
paths. That is to say, the model does not increase the risk of over fitting after
the graph semantic-paths learning. Here we only consider 2-order-paths in our
model, however, it can be straightly extended to longer path mining.

3.4 Independence Learning for Mapping Subspaces


In fact, one type of edge in a meta-path tries to denote one unique meaning, so
the K latent factors in our work should not overlap. So, the assumption of using
latent factors to construct semantic-paths is that these different factors extracted
by latent factor routing module can focus on different connecting causes. In
other words, we should encourage the representations of different factors to be
of sufficient independence. Before the probability calculating, on our features,
the focused point views of K subspaces in Eq. (1) should keep different. Our
solution considers that the distance between independence factor representations
zi,k , k ≤ K should be sufficient long if they were projected to one subspace.
First, we project the input values z in Eq. (1) into an unified space to get
vectors Q and K as follow:
Q = zw, K = zw, (6)
dout d
× out
where w ∈ R K Kis the projection parameter matrix. Then, the indepen-
dence loss based on distances between unequal factor representations could be
calculated as follow:
1 X QK>
Li = softmax( q ) (1 − I), (7)
M dout
K

K×K
where I ∈ R denotes an identity matrix, is element-wise product, M =
K 2p− K. Specifically, we learn a lesson from [33] that scaling the dot products by
1/ dout /K, to counteract the gradients disappear effect for large values. As long
as Li is minimized in the training process, the distances between different factors
tend to be larger, that is, the K subspaces would capture sufficient different
information to encourage independence among learned latent factors.
Next, we would analyze the validity of this optimization. Latent Factor Rout-
ing aims to utilize the disentangled algorithm to calculate the latent factors be-
tween every two objects. However, this approach is a variant of von Mises-Fisher
8 L. Wu et al.

(vMF) [2] mixture model, such an EM algorithm cannot optimize the indepen-
dences of latent factors within the iterative process. And random initialization
of the mapping parameters is also not able to promise that subspaces obtain
different concerns. For this shortcoming, we give an assumption:
Assumption 31 The features in different subspaces keep sufficient independent
when the margins of their projections in the unified space are sufficiently distinct.
This assumption is inspired by the Latent Semantic Analysis algorithm (LSA)
[16] that projects multi-dimensional features of a vector space model into a
semantic space with less dimensions, which keeps the semantic features of the
original space in a statistical sense. So, our optimization approach is listed below:

X
w = arg min softmax(QKT ) (1 − I),
XV T
= arg min softmax((zu w)(zu w) ) (1 − I),
u
V P
exp(zu,k1 w · zu,k2 w)
Pk1 6=k2
X
= arg min , (8)
u k1 ,k2 exp(zu,k1 w · zu,k2 w)
V
X X
= arg max distance(zu,k1 w, zu,k2 w).
u k1 6=k2

S.t. : 1 ≤ k1 ≤ K, 1 ≤ k2 ≤ K.

In the above equation, pw denotes the training parameter to be optimized.


We ignore the 1/M and 1/ dout /K in Eq. (7), because they do not affect the
optimization procedure. With the increase of Inter-distances of K subspaces, the
IntraVar of factors in each subspace would not larger than the original level (as
the random initialization). The InterVar/IntraVar ratio becomes larger, in other
word, we get more sufficient independence of mapping subspaces.

3.5 Algorithm Framework


In this section, we describe the overall algorithm of SGCN for performing node-
related tasks. For graph G, the ground-truth label of node u is †u ∈ {0, 1}C ,
where C is the number of classes. The details of our algorithm are shown in
Algorithm 1. First, we calculate the independence loss Li after factor channels
capture features. Then, L layers of DisenConv operations would return the stable
probability matrix p. After that, the automatic graph semantic-path represen-
tation y is learned based on p. To apply y to different tasks, we design the final
layer by a fully-connected layer y0 = Wy y + by , where Wy ∈ Rdout ×C , by ∈ RC .
For instance, for the semi-supervised node classification task, we implement

X 1X C
Ls = − †u (c)ln(ŷu (c)) + λLi (9)
L
C c=1
u∈V
Learning the Implicit Semantic Representation on Graph-Structured Data 9

Algorithm 1 Semantic Graph Convolutional Networks


Input: the feature vector matrix x ∈ RN ×din , the graph G = (V, E), the number of
iterations T , and the number of disentangle layers L.
Output: the representation of node u by yu ∈ Rdout , ∀u ∈ V
1: for i ∈ V do
2: for k = 1, 2, ..., K do
3: zi,k ← σ(Wk xi + bk )/k σ(Wk xi + bk ) k2
4: Q ← zwq , K ← zwk q
1
softmax(QK> / dout
P
5: Li = M K
) (1 − I)
6: for disentangle layer l = 1, 2, ..., L do
7: et=1
u,k ← zu,k , ∀k = 1, 2, ..., K, ∀u ∈ V
8: for routing iteration t = 1, 2, ..., T do
9: Get the soft probability matrix p, where calculating pk,t
u,v by Eq. (3)
10: Update the latent representation etu,k , ∀u ∈ V by Eq. (2)
11: eu ← dropout(ReLU(eu )), zu,k ← et=T u,k , ∀k = 1, 2, ..., K, ∀u ∈ V C when
l ≤L−1
12: TransferP
p to hard probability matrix A
13: Bu,v ← [(u,o),(o,v)]∈E A> u,o Ao,v , {u, v} ⊂ V
14: Get each aggregation yuk of the latent vectors on semantic-paths by Eq. (5)
15: return {yu , ∀u ∈ V }, Li

as the loss function, where ŷu = softmax(yu0 ), V L is the set of labeled nodes, and
Li would be joint training by sum up with the task loss function. For the multi-
label classification task, since the label †u consists of more than one positive
bits, we define the multi-label loss function for node u as:
C
1X 0 0
Lm = − [†u (c) · sigmoid(yu (c)) + (1 − †u (c)) · sigmoid(−yu (c))] + λLi . (10)
C
c=1

Moreover, for the node clustering task, y0 denotes the input feature of K-Means.

3.6 Time Complexity Analysis and Optimization


We should notice a problem in Section 3.3 that the time complexity of Eq. (4-
5) by matrix calculation is O(N (N − 1)(N − 2)K 2 + N ((N − 1)K 2 × dK out
+
dout 3 2 2 2
2K K )) ≈ O(N K + N K ). Such a complex time complexity will bring a lot
of computing load, so we optimize this algorithm in the actual implementation.
For real-world datasets, one node connects to neighbors that are far less than
the total number of nodes in the graph. Therefore, when we create the semantic-
paths based adjacent matrix, the matrix à ∈ RN ×C×K is defined to denote
1-order neighbor relationships, C is the maximum number of neighbors that we
define, and Ãku is the id of a neighbor if they are connected by Rk , else Ãku = 0.
Then the semantic-path relations of type (Rk1 , Rk2 ) of u ∈ V are denoted by
B̃ku1 ,k2 = Ã[Ã[u, :, k1 ], :, k2 ] ∈ RC×C , and the pooling of this semantic pattern is
the mean pooling of z[B̃ku1 ,k2 , k2 , :]. According to the analysis above, the time
complexity can be reduced to O(K 2 (N C 2 + N C 2 dK out
)) ≈ O(2N K 2 C 2 ).
10 L. Wu et al.

Table 1. The statistics of datasets.

Dataset Type Nodes Edges Classes Features Multi-label


Pubmed Citation Network 19,717 44,338 3 500 False
Citeseer Citation Network 3,327 4,732 6 3,703 False
Cora Citation Network 2,708 5,429 7 1,433 False
Blogcatalog Social Network 10,312 333,983 39 - True
POS Word Co-occurrence 4,777 184,812 40 - True

4 Experiments
In this section, we empirically assess the efficacy of SGCN on several node-
related tasks, includes semi-supervised node classification, node clustering and
multi-label node classification. We then provide node visualization analysis and
semantic-paths sampling experiments to verify the validity of our idea.

4.1 Experimental Setup


Datasets. We conduct our experiments on 5 real-world datasets, Citeseer, Cora,
Pubmed, POS and BlogCatalog [27,11,32], whose statistics are listed in Table 1.
The first three citation networks are benchmark datasets for semi-supervised
node classification and node clustering. For graph content, the nodes, edges, and
labels in these three represent articles, citations, and research areas, respectively.
Their node features correspond a bag-of-words representation of a document.
POS and BlogCatalog are suitable for multi-label node classification task.
Their labels are part-of-speech tags and user interests, respectively. In detail,
BlogCatalog is a social relationships network of bloggers who post blogs in the
BlogCatalog website. These labels represent the blogger’s interests inferred from
the text information provided by the blogger. POS (Part-of-Speech) is a co-
occurrence network of words appearing in the first million bytes of the Wikipedia
dump. The labels in POS denote the Part-of-Speech tags inferred via the Stan-
ford POS-Tagger. Due to the two graphs do not provide node features, we use
the rows of their adjacency matrices in place of node features for them.

Baselines. To demonstrate the advantages of our model, we compare SGCN


with some representative graph neural networks, including the graph convolu-
tion network (GCN) [15] and the graph attention network (GAT) [34]. In detail,
GCN [15] is a simplified spectral method of node aggregating, while GAT weights
a node’s neighbors by the attention mechanism. GAT achieves state of the art
in many tasks, but it contains far more parameters than GCN and our model.
Besides, ChebNet [6] is a spectral graph convolutional network by means of a
Chebyshev expansion of the graph Laplacian, MoNet [22] extends CNN archi-
tectures by learning local, stationary, and compositional task-specific features.
And IPGDN [18] is the advanced version of DisenGCN. We also implement other
non-graph convolution network method, including random walk based network
embedding DeepWalk [24], link-based classification method ICA [19], inductive
Learning the Implicit Semantic Representation on Graph-Structured Data 11

embedding based approach Planetoid [38], label propagation approach LP [39],


semi-supervised embedding learning model SemiEmb [36] and so on.
In addition, we conduct the ablation experiments into nodes classification
and clustering to verify the effectiveness of the main components of SGCN:
SGCN-path is our complete model without independence loss, and SGCN-indep
denotes SGCN without the semantic-path representations.
In the multi-label classification experiment, the original implementations of
GCN and GAT do not support multi-label tasks. We therefore modify them
to use the same multi-label loss function as ours for fair comparison in multi-
label tasks. We additionally include three node embedding algorithms, including
DeepWalk [24], LINE [31], and node2vec [11], because they are demonstrated to
perform strongly on the multi-label classification. Besides, we remove IPGDN
since it is not designed for multi-label task.

Implementation Details. We train our models on one machine with 8 NVIDIA


Tesla V100 GPUs. Some experimental results and the settings of common base-
lines that we follow [20,18], and we optimize the parameters of models with
Adam [14]. Besides, we tune the hyper-parameters of both our model and base-
lines using hyperopt [3]. In detail, for semi-supervised classification and node
clustering, we set the number of iterations T = 6, the layers L ∈ {1, 2, ..., 8},
the number of components K ∈ {1, 2, .., 7} (denotes the number of mapping
channels. Therefore, for our model, the dimension of a component in the SGCN
model is [dout /K] ∈ {10, 12, ..., 8}), dropout rate ∈ {0.05, 0.10, ..., 0.95}, trade-off
λ ∈ {0.0, 0.5, ..., 10.0}, the learning rate ∼ loguniform [e − 8, 1], the l2 regular-
ization term ∼ loguniform [e − 10, 1]. Besides, it should be noted that, in the
multi-label node classification, the output dimension dout is set to 128 to achieve
better performance, while setting the dimension of the node embeddings to be
128 as well for other node embedding algorithms. And, when tuning the hyper-
parameters, we set the number of components K ∈ {4, 8, ...28} in the latent
factor routing process. Here K = 8 makes the best result in our experiments.

4.2 Semi-Supervised Node Classification


For semi-supervised node classification, there are only 20 labeled instances for
each class. It means that the information of neighbors should be leveraged when
predicting the labels of target nodes. Here we follow the experimental settings
of previous works [38,15,34].
We report the classification accuracy (ACC) results in Table 2. The majority
of nodes only connect with those neighbors of the same class. According to Table
2, it is obvious that SGCN achieves the best performance amongst all baselines.
Here SGCN outperforms the most powerful baseline IPGDN with 1.55%, 0.47%
and 1.1% relative accuracy improvements on three datasets, compared with the
increasing degrees of previous models, our model express obvious improvements
in the node classification task. And our proposed model achieves the best ACC
of 85.4% on Cora dataset, it is a great improvement on this dataset. On the other
hand, in the ablation experiment (the last three rows of Table 2), the complete
12 L. Wu et al.

Table 2. Semi-supervised classification. Table 3. Node clustering with double metrics.

Models Cora Citeseer Pubmed Cora Citeseer Pubmed


MLP 55.1 46.5 71.4 Models
NMI ARI NMI ARI NMI ARI
SemiEmb 59.0 59.6 71.1
LP 68.0 45.3 63.0 SemiEmb 48.7 41.5 31.2 21.5 27.8 35.2
DeepWalk 67.2 43.2 65.3 DeepWalk 50.3 40.8 30.5 20.6 29.6 36.6
ICA 75.1 69.1 73.9 Planetoid 52.0 40.5 41.2 22.1 32.5 33.9
Planetoid 75.7 64.7 77.2 ChebNet 49.8 42.4 42.6 41.5 35.6 38.6
ChebNet 81.2 69.8 74.4
GCN 51.7 48.9 42.8 42.8 35.0 40.9
GCN 81.5 70.3 79.0
MoNet 81.7 - 78.8 GAT 57.0 54.1 43.1 43.6 35.0 41.4
GAT 83.0 72.5 79.0 DIsenGCN 58.4 60.4 43.7 42.5 36.1 41.6
DisenGCN 83.7 73.4 80.5 IPGDN 59.2 61.0 44.3 43.0 37.0 42.0
IPGDN 84.1 74.0 81.2
SGCN-indep 60.2 59.2 44.7 42.8 37.2 42.3
SGCN-indep 84.2 73.7 82.0
SGCN-path 84.6 74.4 81.6 SGCN-path 60.5 60.7 45.1 44.0 37.3 42.8
SGCN 85.4 74.2 82.1 SGCN 60.7 61.6 44.9 44.2 37.9 42.5

SGCN model is superior to either algorithm in at least two datasets. Moreover,


we can find that SGCN-indep and SGCN-path are both perform better than
previous algorithms to some degree. It reveals the effectiveness of our semantic-
paths mining module and the independence learning for subspaces.

4.3 Multi-label Node Classification


In the multi-label classification experiment, every node is assigned one or more
labels from a finite set L. We follow node2vec [11] and report the performance of
each method while varying the number of nodes labeled for training from 10%
|V | to 90% |V |, where |V | is the total number of nodes. The rest of nodes are
split equally to form a validation set and a test set. Then with the best hyper-
parameters on the validation sets, we report the averaged performance of 30
runs on each multi-label test set. Here we summarize the results of multi-label
node classification by Macro-F1 and Micro-F1 scores in Figure 3. Firstly, there
is an obvious point that proposed SGCN model achieves the best performances
in both two datasets. Compared with DisenGCN model, SGCN combines with
semantic semantic-paths can achieve the biggest improvement of 20.0% when we
set 10% of labeled nodes in POS dataset. The reason may be that the relation
type of POS dataset is Word Co-occurrence, there are lots of regular explicit
or implicit semantics amongst these relationships between different words. In
the other dataset, although SGCN does not show a full lead but achieves the
highest accuracy on both indicators. We find that the GCN-based algorithms are
usually superior to the traditional node embedding algorithms in overall effect.
Although for the Micro-F1 score on Blogcatalog, GCN produces the poor results.
In addition, the SGCN algorithm can make both Macro-F1 and Micro-F2 achieve
good results at the same time, and there will be no bad phenomenon in one of
them. Because this approach would not ignore the information provided by the
classes with few samples but important semantic relationships.
Learning the Implicit Semantic Representation on Graph-Structured Data 13

35 SGCN 30 SGCN
DisenGCN DisenGCN
30
DeepWalk 28 DeepWalk
LINE LINE

Macro-F1(%)

Macro-F1(%)
Node2Vec 26 Node2Vec
25 GCN GCN
GAT 24 GAT
20 22
15 20
18
10 16
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
%Labeled Nodes %Labeled Nodes
(a) Macro-F1 POS (b) Macro-F1 Blogcatalog

60
SGCN
42 DisenGCN
DeepWalk
55 40 LINE
Micro-F1(%)

Micro-F1(%)
Node2Vec
GCN
50 38 GAT
SGCN
DisenGCN
45 DeepWalk 36
LINE
Node2Vec 34
GCN
40 GAT
32
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
%Labeled Nodes %Labeled Nodes
(c) Micro-F1 POS (d) Micro-F1 Blogcatalog

Fig. 3. Results of multi-label node classification.

4.4 Node Clustering


To further evaluate the embeddings learned from the above algorithms, we also
conduct the clustering task. Following [18], for our model and each baseline, we
obtain its node embedding via feed forward when the model is trained. Then
we input the node embedding to the K-Means algorithm to cluster nodes. The
ground-truth is the same as that of node classification task, and the number
of clusters K is set to the number of classes. In detail, we employ two metrics
of Normalized Mutual Information (NMI) and Average Rand Index (ARI) to
validate the clustering results. Since the performance of K-Means is affected
by initial centroids, we repeat the process for 20 times and report the average
results in Table 3. As can be seen in Table 3, SGCN consistently outperforms
all baselines, and GNN-based algorithms usually achieve better performance.
Besides, with the semantic-path representation, SGCN and SGCN-path performs
significantly better than DisenGCN and IPGDN, our proposed algorithm gets
the best results on both NMI and ARI. It shows that SGCN captures a more
meaningful node embedding via learning semantic patterns from graph.

4.5 Visualization Analysis and Semantic-paths Sampling


We try to demonstrate the intuitive changes of node representations after incor-
porating semantic patterns. Therefore, we utilize t-SNE [21] to transform feature
representations (node embedding) of SGCN and DisenGCN into a 2-dimensional
space to make a more intuitive visualization. Here we visualize the node embed-
ding of Cora (actually, the change of representation visualization is similar in
14 L. Wu et al.

86
SGCN

85.5

Accurcy(%)
85

84.5

84
0 1 2 3 4 5 6 7
(a) DisenGCN (b) SGCN Number of cut

Fig. 4. Node representation visualization of Cora. Fig. 5. Semantic-paths sampling.

other datasets), where different colors denote different research areas. Accord-
ing to Figure 4, there is a phenomenon that the visualization of SGCN is more
distinguishable than DisenGCN. It demonstrates that the embedding learned by
SGCN presents a high intra-class similarity and separates papers into different
research areas with distinct boundaries. On the contrary, DisenGCN dose not
perform well since the inter-margin of clusters are not distinguishable enough.
In several clusters, many nodes belong to different areas are mixed with others.
Then, to explore the influence of different scales of semantic-paths on our
model performance, we implement a semantic-paths sampling experiment on
Cora. As mentioned in the section 3.6, for capturing different numbers of seman-
tic paths, we change the hyper-parameter of cut size C to restrict the sampling
size on each node’s neighbors. As shown in Figure 5, the SGCN model with the
path representation achieves higher performances than the first point (C = 0).
From the perspective of global trend, with the increase of C, the classification
accuracy of SGCN model is also improved steady, although it get the highest
score when C = 5. It means that GCN model combines with more sufficient
scale semantic-paths can really learn better node representations.

5 Conclusion
In this paper, we proposed a novel framework named Semantic Graph Convo-
lutional Networks which incorporates the semantic-paths automatically during
the node aggregating process. Therefore, SGCN provided the semantic learning
ability to general graph algorithms. We conducted extensive experiments on var-
ious real-world datasets to evaluate the superior performance of our proposed
model. Moreover, our method has good expansibility, all kinds of path-based
algorithms in the graph embedding field can be directly applied in SGCN to
adapt to different tasks, we will take more explorations in future work.

6 Acknowledgements
This research was partially supported by grants from the National Key Research
and Development Program of China (No. 2018YFC0832101), and the National
Learning the Implicit Semantic Representation on Graph-Structured Data 15

Natural Science Foundation of China (No.s U20A20229 and 61922073). This


research was also supported by Meituan-Dianping Group.

References
1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473 (2014)
2. Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere
using von mises-fisher distributions. J. Mach. Learn. Res. 6(Sep), 1345–1382 (2005)
3. Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: A python library for optimizing
the hyperparameters of machine learning algorithms. In: Proceedings of the 12th
Python in science conference. pp. 13–20. Citeseer (2013)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003), http://jmlr.org/papers/v3/blei03a.html
5. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric
deep learning: going beyond euclidean data. IEEE Signal Processing Magazine
34(4), 18–42 (2017)
6. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on
graphs with fast localized spectral filtering. In: Advances in neural information
processing systems. pp. 3844–3852 (2016)
7. Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: Scalable representation learn-
ing for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD inter-
national conference on knowledge discovery and data mining. pp. 135–144 (2017)
8. Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-
Guzik, A.: Convolutional networks on graphs for learning molecular fingerprints.
In: Advances in neural information processing systems. pp. 2224–2232 (2015)
9. Fan, Y., Hou, S., Zhang, Y., Ye, Y., Abdulhayoglu, M.: Gotcha-sly malware! scor-
pion a metagraph2vec based malware detection system. In: Proceedings of the 24th
ACM SIGKDD. pp. 253–262 (2018)
10. Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains.
In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks,
2005. vol. 2, pp. 729–734. IEEE (2005)
11. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Pro-
ceedings of the 22nd ACM SIGKDD. pp. 855–864 (2016)
12. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large
graphs. In: NIPS. pp. 1024–1034 (2017)
13. Henaff, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graph-structured
data. arXiv preprint arXiv:1506.05163 (2015)
14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd In-
ternational Conference on Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track Proceedings (2015)
15. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
16. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic anal-
ysis. Discourse processes 25(2-3), 259–284 (1998)
17. Li, Z., Wu, B., Liu, Q., Wu, L., Zhao, H., Mei, T.: Learning the compositional visual
coherence for complementary recommendations. In: IJCAI-20. pp. 3536–3543
18. Liu, Y., Wang, X., Wu, S., Xiao, Z.: Independence promoted graph disentangled
networks. Proceedings of the AAAI Conference on Artificial Intelligence (2020)
16 L. Wu et al.

19. Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th Interna-
tional Conference on Machine Learning (ICML-03). pp. 496–503 (2003)
20. Ma, J., Cui, P., Kuang, K., Wang, X., Zhu, W.: Disentangled graph convolutional
networks. In: International Conference on Machine Learning. pp. 4212–4221 (2019)
21. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learn-
ing research 9(Nov), 2579–2605 (2008)
22. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geo-
metric deep learning on graphs and manifolds using mixture model cnns. In: IEEE
Conference on Computer Vision and Pattern Recognition. pp. 5115–5124 (2017)
23. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-
chines. In: Proceedings of the 27th international conference on machine learning
(ICML-10). pp. 807–814 (2010)
24. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social represen-
tations. In: Proceedings of the 20th ACM SIGKDD. pp. 701–710 (2014)
25. Qiao, L., Zhao, H., Huang, X., Li, K., Chen, E.: A structure-enriched neural net-
work for network embedding. Expert Systems with Applications pp. 300–311 (2019)
26. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph
neural network model. IEEE Transactions on Neural Networks 20(1), 61–80 (2008)
27. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collec-
tive classification in network data. AI magazine 29(3), 93–93 (2008)
28. Shang, J., Qu, M., Liu, J., Kaplan, L.M., Han, J., Peng, J.: Meta-path guided
embedding for similarity search in large-scale heterogeneous information networks.
arXiv preprint arXiv:1610.09769 (2016)
29. Shi, C., Hu, B., Zhao, W.X., Philip, S.Y.: Heterogeneous information network
embedding for recommendation. IEEE Transactions on Knowledge and Data En-
gineering 31(2), 357–370 (2018)
30. Sun, L., He, L., Huang, Z., Cao, B., Xia, C., Wei, X., Philip, S.Y.: Joint embedding
of meta-path and meta-graph for heterogeneous information networks. In: 2018
IEEE International Conference on Big Knowledge. pp. 131–138. IEEE (2018)
31. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale infor-
mation network embedding. In: Proceedings of the 24th international conference
on world wide web. pp. 1067–1077 (2015)
32. Tang, L., Liu, H.: Leveraging social media networks for classification. Data Mining
and Knowledge Discovery 23(3), 447–478 (2011)
33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: Advances in neural information
processing systems. pp. 5998–6008 (2017)
34. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph
attention networks. arXiv preprint arXiv:1710.10903 (2017)
35. Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., Yu, P.S.: Heterogeneous graph
attention network. In: The World Wide Web Conference. pp. 2022–2032 (2019)
36. Weston, J., Ratle, F., Mobahi, H., Collobert, R.: Deep learning via semi-supervised
embedding. In: Neural networks: Tricks of the trade, pp. 639–655. Springer (2012)
37. Wu, L., Li, Z., Zhao, H., Pan, Z., Liu, Q., Chen, E.: Estimating early fundraising
performance of innovations via graph-based market environment model. In: AAAI.
pp. 6396–6403 (2020)
38. Yang, Z., Cohen, W.W., Salakhutdinov, R.: Revisiting semi-supervised learning
with graph embeddings. arXiv preprint arXiv:1603.08861 (2016)
39. Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised learning using gaussian
fields and harmonic functions. In: Proceedings of the 20th International conference
on Machine learning (ICML-03). pp. 912–919 (2003)

You might also like