Learning The Implicit Semantic Representation On Graph-Structured Data
Learning The Implicit Semantic Representation On Graph-Structured Data
on Graph-Structured Data
{wulk,zhili03}@mail.ustc.edu.cn, {qiliuql,cheneh}@ustc.edu.cn
2
Tianjin University, Tianjin, China
[email protected]
3
Meituan-Dianping Group, Beijing, China
[email protected]
1 Introduction
The representations of objects (nodes) in large graph-structured data, such as
social or biological networks, have been proved extremely effective as feature
inputs for graph analysis tasks. Recently, there have been many attempts in
the literature to extend neural networks to deal with representation learning of
graphs, such as Graph Convolutional Networks (GCN) [15], GraphSAGE [12]
and Graph Attention Networks (GAT) [34].
In spite of enormous success, previous graph neural networks mainly proposed
representation learning methods by describing the neighborhoods as a perceptual
whole, and they have not gone deep into the exploration of semantic information
4
Our code is available online at https://github.com/WLiK/SGCN_SemanticGCN
5
2 L. Wu et al.
A D
B C
in graphs. Taking the movie network as an example, the paths based on com-
posite relations of “Movie-Actor-Movie” and “Movie-Director-Movie” may reveal
two different semantic patterns, i.e., the two movies have the same actor (direc-
tor). Here the semantic pattern is defined as a specific knowledge expressed by
the corresponding path. Although several researchers [35,30] attempt to capture
these graph semantics of composite relations between two objects by meta-paths,
existing work relies on the given heterogeneous information such as different
types of objects and distinct object connections. However, in the real world,
quite a lot of graph-structured data do not have the explicit characteristics. As
shown in Figure 1, in a scholar cooperation network, there are usually no explicit
node (relation) types and all nodes are connected through the same relation, i.e.,
“Co-author”. Fortunately, behind the same relation, there are various implicit
factors which may express different connecting reasons, such as “Classmate”
and “Colleague” for the same relation “Co-author”. These factors can further
compose diverse semantic-paths (e.g. “Student-Advisor-Student” and “Advisor-
Student-Advisor”), which reveal sophisticated semantic associations and help to
generate more informative representations. Then, how to automatically exploit
comprehensive semantic patterns based on the implicit factors behind a general
graph is a non-trivial problem.
In general, there are several challenges to solve this problem. Firstly, it is an
essential part to adaptively infer latent factors behind graphs. We notice that
several researches begin to explore desired latent factors behind a graph by dis-
entangled representations [20,18]. However, they mainly focus on inferring the
latent factors by the disentangled representation learning while failing to discrim-
inatively model the independent implicit factors behind the same connections.
Secondly, after discovering the latent factors, how to select the most meaningful
semantics and aggregate the diverse semantic information remain largely unex-
plored. Last but not the least, to further exploit the implicit semantic patterns
and to be capable of conducting inductive learning are quite difficult.
To address above challenges, in this paper, we propose a novel Semantic
Graph Convolutional Networks (SGCN), which sheds light on the exploration
of implicit semantics in the node aggregating process. Specifically, we first pro-
Learning the Implicit Semantic Representation on Graph-Structured Data 3
pose a latent factor routing method with the DisenConv layer [20] to adaptively
infer the probability of each latent factor that may have caused the link from
a given node to one of its neighborings. Then, for further exploring the diverse
semantic information, we transfer the probability between every two connected
nodes to the corresponding semantic adjacent matrix, which can present the
semantic-paths in a graph. Afterwards, most semantic strengthen methods like
the semantic level attention module can be easily integrated into our model and
aggregate the diverse semantic information from these semantic-paths. Finally,
to encourage the independence of the implicit semantic factors and conduct the
inductive learning, we design an effective joint loss function to maintain the in-
dependent mapping channels of different factors. This loss function is able to
focus on different semantic characteristics during the training process.
Specifically, the contributions of this paper can be summarized as follows:
2 Related Works
3.1 Preliminary
y = f (xu , xv : (u, v) ∈ E | u ∈ V ),
Here we aim to introduce the disentangled algorithm that calculates the latent
factors between every two objects. We assume that each node is composed of K
Learning the Implicit Semantic Representation on Graph-Structured Data 5
exp(z> t
v,k eu,k )
pk,t
u,v = K
, (3)
> t
P
k=1 exp(zv,k eu,k )
where iteration t = 1, 2, ..., T , pku,v indicates the probability that factor k indi-
PK
cates the reason why node u reaches neighbor v, and satisfies pku,v ≥ 0, k=1 pku,v =
1. The neighborhood routing mechanism will iteratively infer pku,v and construct
ek . Note that, there are total L DisenConv layers, zu,k is assigned the value of
eTu,k finally in each layer l ≤ L − 1, more detail can refer to Algorithm 1.
𝑘 3 2
3
1 2 3
𝑢 1 2 3
1 0 0 0 1 0 0 0
2 0 0 1 2 1 0 0
3 0 0 0 3 0 0 0
𝐁 , 𝑙 𝑜 𝐁 ,
(a) Multi-graph method (b) Discriminative semantic aggregation method
node i connects to j because of the factor k. In our model, the latent factor
should identify the certain connecting cause of each connected node pair. Here
we transfer the probability matrix p to an semantic adjacent matrix A, so the
element in A only has binary value (0 or 1). In detail, for every node pair i and
j, Aki,j = 1 if pki,j denotes the biggest value in pi,j . As shown in Figure 2(b), each
node is represented by K components. In this graph, every node may connect
with others by one relationship from K types, e.g., the relationship between
node u and o is R2 (denotes A2u,o = 1). For node u, we can find that it has two
semantic-path-based neighbors l and v. And, the semantic-paths of (u, l) and
(u, v) are two different types which composed by Φu,o,l = (A2u,o , A3o,l ) = R2 ◦ R3
and Φu,o,v = (A2u,o , A1o,v ) = R2 ◦ R1 respectively. We define the adjacent matrix
B for virtual semantic-path-based edges,
X
Bu,v = A>u,o Ao,v , {u, v} ⊂ V, (4)
[(u,o),(o,v)]∈E
where Au,o ∈ RK , Ao,v ∈ RK , and Bu,v ∈ RK×K . For instance, in Figure 2(b),
Au,o = [0, 1, 0], Ao,v = [1, 0, 0], and Ao,l = [0, 0, 1], in this way two semantic-
paths start from node u can be expressed as B2,3 2,1
u,l = 1 and Bu,v = 1.
In the semantic information aggregation process, we aggregate the latent
vectors connected by corresponding semantic-path as:
dout
hu = [eu,1 , eu,2 , ..., eu,K ] ∈ RK× K ,
d
K× out
h̃v = [zv,1 , zv,2 , ..., zv,K ] ∈ R K , (5)
yu = hu + MeanPooling(Bu,v h̃v ), u ∈ V,
v∈V,v6=u
P
where we just use MeanPooling to avoid large values instead of v∈V oper-
d
K× out
ator, and hu , h̃v ∈ R are both returned from the last layer of Disen-
K
Conv operation, in this time that factor probabilities would be stable since the
representation of each node considers the influence from neighbors. According
Learning the Implicit Semantic Representation on Graph-Structured Data 7
to Eq. (5), the aggregation of two latent representations (end points) of one
certain semantic-path denotes the mining result of this semantic relation, e.g.,
Pooling(eu,2 , zv,1 ) and Pooling(eu,2 , zl,3 ) express two different kinds of semantic
pattern representations in Figure 2(b), R2 ◦R1 and R2 ◦R3 respectively. And, for
all types of semantic-paths start from node u, the weight of each type depends
on its frequency. Note that, although the semantic adjacent matrix A neglects
some low probability factors, our semantic paths are integrated with the node
states of DisenGCN, which would not lose the crucial information captured by
basic GCN model. The advantage of this aggregation method is that our model
can distinguish different semantic relations without adding extra parameters,
instead of designing various graph convolution networks for different semantic-
paths. That is to say, the model does not increase the risk of over fitting after
the graph semantic-paths learning. Here we only consider 2-order-paths in our
model, however, it can be straightly extended to longer path mining.
K×K
where I ∈ R denotes an identity matrix, is element-wise product, M =
K 2p− K. Specifically, we learn a lesson from [33] that scaling the dot products by
1/ dout /K, to counteract the gradients disappear effect for large values. As long
as Li is minimized in the training process, the distances between different factors
tend to be larger, that is, the K subspaces would capture sufficient different
information to encourage independence among learned latent factors.
Next, we would analyze the validity of this optimization. Latent Factor Rout-
ing aims to utilize the disentangled algorithm to calculate the latent factors be-
tween every two objects. However, this approach is a variant of von Mises-Fisher
8 L. Wu et al.
(vMF) [2] mixture model, such an EM algorithm cannot optimize the indepen-
dences of latent factors within the iterative process. And random initialization
of the mapping parameters is also not able to promise that subspaces obtain
different concerns. For this shortcoming, we give an assumption:
Assumption 31 The features in different subspaces keep sufficient independent
when the margins of their projections in the unified space are sufficiently distinct.
This assumption is inspired by the Latent Semantic Analysis algorithm (LSA)
[16] that projects multi-dimensional features of a vector space model into a
semantic space with less dimensions, which keeps the semantic features of the
original space in a statistical sense. So, our optimization approach is listed below:
X
w = arg min softmax(QKT ) (1 − I),
XV T
= arg min softmax((zu w)(zu w) ) (1 − I),
u
V P
exp(zu,k1 w · zu,k2 w)
Pk1 6=k2
X
= arg min , (8)
u k1 ,k2 exp(zu,k1 w · zu,k2 w)
V
X X
= arg max distance(zu,k1 w, zu,k2 w).
u k1 6=k2
S.t. : 1 ≤ k1 ≤ K, 1 ≤ k2 ≤ K.
X 1X C
Ls = − †u (c)ln(ŷu (c)) + λLi (9)
L
C c=1
u∈V
Learning the Implicit Semantic Representation on Graph-Structured Data 9
as the loss function, where ŷu = softmax(yu0 ), V L is the set of labeled nodes, and
Li would be joint training by sum up with the task loss function. For the multi-
label classification task, since the label †u consists of more than one positive
bits, we define the multi-label loss function for node u as:
C
1X 0 0
Lm = − [†u (c) · sigmoid(yu (c)) + (1 − †u (c)) · sigmoid(−yu (c))] + λLi . (10)
C
c=1
Moreover, for the node clustering task, y0 denotes the input feature of K-Means.
4 Experiments
In this section, we empirically assess the efficacy of SGCN on several node-
related tasks, includes semi-supervised node classification, node clustering and
multi-label node classification. We then provide node visualization analysis and
semantic-paths sampling experiments to verify the validity of our idea.
35 SGCN 30 SGCN
DisenGCN DisenGCN
30
DeepWalk 28 DeepWalk
LINE LINE
Macro-F1(%)
Macro-F1(%)
Node2Vec 26 Node2Vec
25 GCN GCN
GAT 24 GAT
20 22
15 20
18
10 16
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
%Labeled Nodes %Labeled Nodes
(a) Macro-F1 POS (b) Macro-F1 Blogcatalog
60
SGCN
42 DisenGCN
DeepWalk
55 40 LINE
Micro-F1(%)
Micro-F1(%)
Node2Vec
GCN
50 38 GAT
SGCN
DisenGCN
45 DeepWalk 36
LINE
Node2Vec 34
GCN
40 GAT
32
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
%Labeled Nodes %Labeled Nodes
(c) Micro-F1 POS (d) Micro-F1 Blogcatalog
86
SGCN
85.5
Accurcy(%)
85
84.5
84
0 1 2 3 4 5 6 7
(a) DisenGCN (b) SGCN Number of cut
other datasets), where different colors denote different research areas. Accord-
ing to Figure 4, there is a phenomenon that the visualization of SGCN is more
distinguishable than DisenGCN. It demonstrates that the embedding learned by
SGCN presents a high intra-class similarity and separates papers into different
research areas with distinct boundaries. On the contrary, DisenGCN dose not
perform well since the inter-margin of clusters are not distinguishable enough.
In several clusters, many nodes belong to different areas are mixed with others.
Then, to explore the influence of different scales of semantic-paths on our
model performance, we implement a semantic-paths sampling experiment on
Cora. As mentioned in the section 3.6, for capturing different numbers of seman-
tic paths, we change the hyper-parameter of cut size C to restrict the sampling
size on each node’s neighbors. As shown in Figure 5, the SGCN model with the
path representation achieves higher performances than the first point (C = 0).
From the perspective of global trend, with the increase of C, the classification
accuracy of SGCN model is also improved steady, although it get the highest
score when C = 5. It means that GCN model combines with more sufficient
scale semantic-paths can really learn better node representations.
5 Conclusion
In this paper, we proposed a novel framework named Semantic Graph Convo-
lutional Networks which incorporates the semantic-paths automatically during
the node aggregating process. Therefore, SGCN provided the semantic learning
ability to general graph algorithms. We conducted extensive experiments on var-
ious real-world datasets to evaluate the superior performance of our proposed
model. Moreover, our method has good expansibility, all kinds of path-based
algorithms in the graph embedding field can be directly applied in SGCN to
adapt to different tasks, we will take more explorations in future work.
6 Acknowledgements
This research was partially supported by grants from the National Key Research
and Development Program of China (No. 2018YFC0832101), and the National
Learning the Implicit Semantic Representation on Graph-Structured Data 15
References
1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473 (2014)
2. Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere
using von mises-fisher distributions. J. Mach. Learn. Res. 6(Sep), 1345–1382 (2005)
3. Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: A python library for optimizing
the hyperparameters of machine learning algorithms. In: Proceedings of the 12th
Python in science conference. pp. 13–20. Citeseer (2013)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003), http://jmlr.org/papers/v3/blei03a.html
5. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric
deep learning: going beyond euclidean data. IEEE Signal Processing Magazine
34(4), 18–42 (2017)
6. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on
graphs with fast localized spectral filtering. In: Advances in neural information
processing systems. pp. 3844–3852 (2016)
7. Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: Scalable representation learn-
ing for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD inter-
national conference on knowledge discovery and data mining. pp. 135–144 (2017)
8. Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-
Guzik, A.: Convolutional networks on graphs for learning molecular fingerprints.
In: Advances in neural information processing systems. pp. 2224–2232 (2015)
9. Fan, Y., Hou, S., Zhang, Y., Ye, Y., Abdulhayoglu, M.: Gotcha-sly malware! scor-
pion a metagraph2vec based malware detection system. In: Proceedings of the 24th
ACM SIGKDD. pp. 253–262 (2018)
10. Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains.
In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks,
2005. vol. 2, pp. 729–734. IEEE (2005)
11. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Pro-
ceedings of the 22nd ACM SIGKDD. pp. 855–864 (2016)
12. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large
graphs. In: NIPS. pp. 1024–1034 (2017)
13. Henaff, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graph-structured
data. arXiv preprint arXiv:1506.05163 (2015)
14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd In-
ternational Conference on Learning Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track Proceedings (2015)
15. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907 (2016)
16. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic anal-
ysis. Discourse processes 25(2-3), 259–284 (1998)
17. Li, Z., Wu, B., Liu, Q., Wu, L., Zhao, H., Mei, T.: Learning the compositional visual
coherence for complementary recommendations. In: IJCAI-20. pp. 3536–3543
18. Liu, Y., Wang, X., Wu, S., Xiao, Z.: Independence promoted graph disentangled
networks. Proceedings of the AAAI Conference on Artificial Intelligence (2020)
16 L. Wu et al.
19. Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th Interna-
tional Conference on Machine Learning (ICML-03). pp. 496–503 (2003)
20. Ma, J., Cui, P., Kuang, K., Wang, X., Zhu, W.: Disentangled graph convolutional
networks. In: International Conference on Machine Learning. pp. 4212–4221 (2019)
21. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learn-
ing research 9(Nov), 2579–2605 (2008)
22. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geo-
metric deep learning on graphs and manifolds using mixture model cnns. In: IEEE
Conference on Computer Vision and Pattern Recognition. pp. 5115–5124 (2017)
23. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-
chines. In: Proceedings of the 27th international conference on machine learning
(ICML-10). pp. 807–814 (2010)
24. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social represen-
tations. In: Proceedings of the 20th ACM SIGKDD. pp. 701–710 (2014)
25. Qiao, L., Zhao, H., Huang, X., Li, K., Chen, E.: A structure-enriched neural net-
work for network embedding. Expert Systems with Applications pp. 300–311 (2019)
26. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph
neural network model. IEEE Transactions on Neural Networks 20(1), 61–80 (2008)
27. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collec-
tive classification in network data. AI magazine 29(3), 93–93 (2008)
28. Shang, J., Qu, M., Liu, J., Kaplan, L.M., Han, J., Peng, J.: Meta-path guided
embedding for similarity search in large-scale heterogeneous information networks.
arXiv preprint arXiv:1610.09769 (2016)
29. Shi, C., Hu, B., Zhao, W.X., Philip, S.Y.: Heterogeneous information network
embedding for recommendation. IEEE Transactions on Knowledge and Data En-
gineering 31(2), 357–370 (2018)
30. Sun, L., He, L., Huang, Z., Cao, B., Xia, C., Wei, X., Philip, S.Y.: Joint embedding
of meta-path and meta-graph for heterogeneous information networks. In: 2018
IEEE International Conference on Big Knowledge. pp. 131–138. IEEE (2018)
31. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale infor-
mation network embedding. In: Proceedings of the 24th international conference
on world wide web. pp. 1067–1077 (2015)
32. Tang, L., Liu, H.: Leveraging social media networks for classification. Data Mining
and Knowledge Discovery 23(3), 447–478 (2011)
33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: Advances in neural information
processing systems. pp. 5998–6008 (2017)
34. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph
attention networks. arXiv preprint arXiv:1710.10903 (2017)
35. Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., Yu, P.S.: Heterogeneous graph
attention network. In: The World Wide Web Conference. pp. 2022–2032 (2019)
36. Weston, J., Ratle, F., Mobahi, H., Collobert, R.: Deep learning via semi-supervised
embedding. In: Neural networks: Tricks of the trade, pp. 639–655. Springer (2012)
37. Wu, L., Li, Z., Zhao, H., Pan, Z., Liu, Q., Chen, E.: Estimating early fundraising
performance of innovations via graph-based market environment model. In: AAAI.
pp. 6396–6403 (2020)
38. Yang, Z., Cohen, W.W., Salakhutdinov, R.: Revisiting semi-supervised learning
with graph embeddings. arXiv preprint arXiv:1603.08861 (2016)
39. Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised learning using gaussian
fields and harmonic functions. In: Proceedings of the 20th International conference
on Machine learning (ICML-03). pp. 912–919 (2003)