0% found this document useful (0 votes)
45 views20 pages

A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications

Paper

Uploaded by

Duong Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views20 pages

A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications

Paper

Uploaded by

Duong Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO.

XX, SEPT 2017 1

A Comprehensive Survey of Graph Embedding:


Problems, Techniques and Applications
Hongyun Cai, Vincent W. Zheng, and Kevin Chen-Chuan Chang

Abstract—Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph
analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as
node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation
and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into
a low dimensional space in which the graph structural information and graph properties are maximumly preserved. In this survey, we
conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as
well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in
arXiv:1709.07604v3 [cs.AI] 2 Feb 2018

different graph embedding problem settings and how the existing work address these challenges in their solutions. Finally, we
summarize the applications that graph embedding enables and suggest four promising future research directions in terms of
computation efficiency, problem settings, techniques and application scenarios.

Index Terms—Graph embedding, graph analytics, graph embedding survey, network embedding

1 I NTRODUCTION

G RAPHS naturally exist in a wide diversity of real-


world scenarios, e.g., social graph/diffusion graph in
social media networks, citation graph in research areas,
can then be computed efficiently. There are different types
of graphs (e.g., homogeneous graph, heterogeneous graph,
attribute graph, etc), so the input of graph embedding
user interest graph in electronic commerce area, knowl- varies in different scenarios. The output of graph embed-
edge graph etc. Analysing these graphs provides insights ding is a low-dimensional vector representing a part of
into how to make good use of the information hidden the graph (or a whole graph). Fig. 1 shows a toy exam-
in graphs, and thus has received significant attention in ple of embedding a graph into a 2D space in different
the last few decades. Effective graph analytics can ben- granularities. I.e., according to different needs, we may
efit a lot of applications, such as node classification [1], represent a node/edge/substructure/whole-graph as a low-
node clustering [2], node retrieval/recommendation [3], link dimensional vector. More details about different types of
prediction [4], etc. For example, by analysing the graph graph embedding input and output are provided in Sec. 3.
constructed based on user interactions in a social network In the early 2000s, graph embedding algorithms were
(e.g., retweet/comment/follow in Twitter), we can classify mainly designed to reduce the high dimensionality of the
users, detect communities, recommend friends, and predict non-relational data by assuming the data lie in a low
whether an interaction will happen between two users. dimensional manifold. Given a set of non-relational high-
Although graph analytics is practical and essential, most dimensional data features, a similarity graph is constructed
existing graph analytics methods suffer the high compu- based on the pairwise feature similarity. Then, each node
tation and space cost. A lot of research efforts have been in the graph is embedded into a low-dimensional space
devoted to conducting the expensive graph analytics effi- where connected nodes are closer to each other. Examples
ciently. Examples include the distributed graph data pro- of this line of researches are introduced in Sec. 4.1. Since
cessing framework (e.g., GraphX [5], GraphLab [6]), new 2010, with the proliferation of graph in various fields,
space-efficient graph storage which accelerate the I/O and research in graph embedding started to take a graph as
computation cost [7], and so on. the input and leverage the auxiliary information (if any)
In addition to the above strategies, graph embedding to facilitate the embedding. On the one hand, some of
provides an effective yet efficient way to solve the graph them focus on representing a part of the graph (e.g., node,
analytics problem. Specifically, graph embedding converts edge, substructure) (Figs. 1(b)-1(d)) as one vector. To obtain
a graph into a low dimensional space in which the graph such embedding, they either adopt the state-of-the-art deep
information is preserved. By representing a graph as a learning techniques (Sec. 4.2) or design an objective function
(or a set of) low dimensional vector(s), graph algorithms to optimize the edge reconstruction probability (Sec. 4.3).
On the other hand, there is also some work concentrating
on embedding the whole graph as one vector (Fig. 1(e)) for
• H. Cai is with Advanced Digital Sciences Center, Singapore. E-mail: graph level applications. Graph kernels (Sec. 4.4) are usually
[email protected].
• V. Zheng is with Advanced Digital Sciences Center, Singapore. Email: designed to meet this need.
[email protected]. The problem of graph embedding is related to two
• K. Chang is with University of Illinois at Urbana-Champaign, USA. traditional research problems, i.e., graph analytics [8] and
Email: [email protected].
representation learning [9]. Particularly, graph embedding
1.51.5 1.5 77 7
0.60.6 0.6
5 88 8 9
5 5 5 1.51.581.5
8 8
-3 -3 -3
0.00.0 0.0 1.51.5 1.5 3.0
3.03.0
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 2
3
3 3 3 3 3 3 3 33 3

1 1 2 2 e12e12ee23
e23 e23
12
1.5 1.2 1.5 1.2 1 1 e13e13 e13 GG G{1,2,3}
{1,2,3}
{1,2,3}
3 0.8 2 3 2 0 3 3 0 0 0 e46e46 ee4656e56 e56 0 0 0 GG G{4,5,6} 00 0 GG
0.8 6 0 1 1G1
e678e78 ee45
{4,5,6}
{4,5,6}
0.3 0.3 4 7 e e
7845 45
1 4 7 e34e34 e34
4 6 11 9 GG G{7,8,9}
74
0.2 9 5 8 9 e79e79 e79 {7,8,9}
1.5 0.6 6 0.2 1 5 8 9 e67e67 e67
{7,8,9}
5 1.5 8
1.5 0.6 7
1.5 -3 -3 -3 -3 -3 -3 -3-3 -3
5 8 -3
0.0 1.5 0.0
3.00.0 0.0 1.51.5 1.5 3.03.0 3.0 0.00.0 0.0 1.51.5 1.5 3.03.03.0 0.00.00.0 1.5
1.51.5 3.0
3.03.0
(a) Input Graph G1 -3
(b) Node Embedding (c) Edge Embedding (d) Substructure Embedding (e) Whole-Graph Embedding
0.0 1.5 3.0

Fig. 1. A toy example of embedding a graph into 2D space with different granularities. G{1,2,3} denotes the substructure containing node v1 , v2 , v3 .
3 3 3

3 3 3
e12 e23
e13 aims to represent a graph as low dimensional vectors while
G{1,2,3} similarity is preserved.
the e56
e46 graph structures are preserved. On the 0one hand,
0 e12 e23 0 G{4,5,6} G1 graph In observations of the challenges faced in different prob-
e78 e45 analytics
e13 aims to mine useful information from G{1,2,3}graph data. lem settings, we propose two taxonomies of graph em-
e34 e56 representation
e79 0 On the other e46 hand, G{7,8,9} 0 learning
G{4,5,6} obtains data
0 bedding
G1 work, by categorizing graph embedding literature
e78e67 e45
representations that make it easier to extract useful informa-
-3
based on the problem settings and the embedding tech-
-3 -3
0.0 tion when 3.0
1.5 buildinge34classifiers
0.0 1.5 or other3.0 predictors
0.0 [9]. Graph
1.5 niques.
3.0 These two taxonomies correspond to what chal-
e79 G{7,8,9}
embedding liese in the overlap of the two problems and lenges exist in graph embedding and how existing studies
67
-3
focuses on learning the low-dimensional-3 representations. -3 address these challenges. In particular, we first introduce
Note that 1.5
0.0 we distinguish
3.0 0.0
graph 1.5
representation 3.0
learning 0.0 1.5 settings 3.0
different of graph embedding problem as well as
and graph embedding in this survey. Graph representation the challenges faced in each setting. Then we describe how
learning does not require the learned representations to be existing studies address these challenges in their work,
low dimensional. For example, [10] represents each node as including their insights and their technical solutions.
a vector with dimensionality equals to the number of nodes Note that although a few attempts have been made to
in the input graph. Every dimension denotes the geodesic survey graph embedding ( [11], [12], [13]), they have the fol-
distance of a node to each other node in the graph. lowing two limitations. First, they usually propose only one
taxonomy of graph embedding techniques. None of them
Embedding graphs into low dimensional spaces is not
analyzed graph embedding work from the perspective of
a trivial task. The challenges of graph embedding depend
problem setting, nor did they summarize the challenges in
on the problem setting, which consists of embedding input
each setting. Second, only a limited number of related work
and embedding output. In this survey, we divide the input
are covered in existing graph embedding surveys. E.g., [11]
graph into four categories, including homogeneous graph,
mainly introduces twelve representative graph embedding
heterogeneous graph, graph with auxiliary information and graph
algorithms, and [13] focuses on knowledge graph embed-
constructed from non-relational data. Different types of em-
ding only. Moreover, there is no analysis on the insight
bedding input carry different information to be preserved
behind each graph embedding technique. A comprehensive
in the embedded space and thus pose different challenges
review of existing graph embedding work and a high level
to the problem of graph embedding. For example, when
abstraction of the insight for each embedding technique can
embedding a graph with structural information only, the
foster the future researches in the field.
connections between nodes are the target to be preserved.
However, for a graph with node label or attribute infor-
mation, the auxiliary information provides graph property
1.1 Our Contributions
from other perspectives, and thus may also be considered
during the embedding. Unlike embedding input which is Below, we summarize our major contributions in this survey.
given and fixed, the embedding output is task driven. For • We propose a taxonomy of graph embedding based on
example, the most common type of embedding output is problem settings and summarize the challenges faced in
node embedding which represents close nodes as similar each setting. We are the first to categorize graph embedding
vectors. Node embedding can benefit node related tasks work based on problem setting, which brings new perspec-
such as node classification, node clustering, etc. However, in tives to understanding existing work.
some cases, the tasks may be related to higher granularity
of a graph e.g., node pairs, subgraph, whole graph. Hence, • We provide a detailed analysis of graph embedding tech-
the first challenge in terms of embedding output is to find a niques. Compared to existing graph embedding surveys,
suitable embedding output type for the application of inter- we not only investigate a more comprehensive set of graph
est. We categorize four types of graph embedding output, embedding work, but also present a summary of the insights
including node embedding, edge embedding, hybrid embedding behind each technique. In contrast to simply listing how the
and whole-graph embedding. Different output granularities graph embedding was solved in the past, the summarized
have different criteria for a “good” embedding and face insights answer the questions of why the graph embedding
different challenges. For example, a good node embedding can be solved in a certain way. This can serve as an insightful
preserves the similarity to its neighbouring nodes in the guideline for future research.
embedded space. In contrast, a good whole-graph embedding • We systematically categorize the applications that graph
represents a whole graph as a vector so that the graph-level embedding enables and divide the applications as node
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 3

Graph Embedding Problem Settings Graph Embedding Techniques TABLE 1


Matrix Factorization Notations used in this paper.
Graph Embedding Input
Graph Laplacian Eigenmaps Notations Descriptions
Homogeneous Graph
Node Proximity Matrix Factorization |·| The cardinality of a set
Heterogeneous Graph Deep Learning G = (V, E) Graph G with nodes set V and edges set E
Graph with Auxiliary Information With Random Walk Ĝ = (V̂ , Ê) A substructure of graph G , where V̂ ⊆ V, Ê ⊆ E
Without Random Walk
Graph Constructed from Non-relational Data vi , eij A node vi ∈ V and an edge eij ∈ E connecting vi and vj
Edge Reconstruction
A The adjacent matrix of G
Graph Embedding Output Maximize Edge Reconstruct Probability
Ai The i-th row vector of matrix A
Node Embedding Minimize Distance-based Loss
Ai,j The i-th row and j -th column in matrix A
Minimize Margin-based Ranking Loss
Edge Embedding
Graph Kernel
fv (vi ), fe (eij ) Type of node vi and type of edge eij
Hybrid Embedding T v, T e The node type set and edge type set
Based on Graphlet 2.4 0.6
Whole-Graph Embedding Based on Subtree Patterns Nk (vi ) v The v v
2 k nearest 1neighbours 3of node vi
Based on Random Walks X ∈ R|V |×N A feature matrix, each row Xi is a N -dimensional vector for vi
Generative Model yi , yij , yĜ The embedding of node vi , edge eij , and structure Ĝ
Embed Graph into Latent Space d The dimensionality of the embedding Graph Analyti
Incorporate Semantics for Embedding A knowledge graph triplet, with head entity h,
< h, r, t >

Fig. 2. Graph embedding taxonomies by problems and techniques. (1) (2)


v5 v4 v6
tail entity t and the relation between them r
Graph
sij , sij First- and second-order proximity between node vi and vj
Embedding
c An information cascade
G c = (V c , E c ) A cascade graph which adopts the cascade c
related, edge related and graph related. For each category,
we present detailed application scenarios as the reference.
isFriendOf isSupervisorOf
• We suggest four promising future research directions in Alice Bob Chris
the field of graph embedding in terms of computational ef-
ficiency, problem settings, solution techniques and applica- Fig. 3. A toy example of knowledge graph.
tions. For each direction, we provide a thorough analysis of
its disadvantages (deficiency) in current work and propose Definition 2. A homogeneous graph Ghomo = (V, E) is a
future research direction(s). graph in which |T v | = |T e | = 1. All nodes in G belong
to a single type and all edges belong to one single type.
1.2 Organization of The Survey Definition 3. A heterogeneous graph Ghete = (V, E) is a
The rest of this survey is organized as follows. In Sec. 2, graph in which |T v | > 1 and/or |T e | > 1.
we introduce the definitions of the basic concepts required
Definition 4. A knowledge graph Gknow = (V, E) is a
to understand the graph embedding problem, and then
directed graph whose nodes are entities and edges are
provide a formal problem definition of graph embedding. In
subject-property-object triple facts. Each edge of the form
the next two sections, we provide two taxonomies of graph
(head entity, relation, tail entity) (denoted as < h, r, t >)
embedding, where the taxonomy structures are illustrated
indicates a relationship of r from entity h to entity t.
in Fig. 2. Sec. 3 compares the related work based on the
problem settings and summarizes the challenges faced in h, t ∈ V are entities and r ∈ E is the relation. In this survey,
each setting. In Sec. 4, we categorize the literature based on we call < h, r, t > a knowledge graph triplet. For example,
the embedding techniques.The insights behind each tech- in Fig. 3, there are two triplets: < Alice, isF riendOf, Bob >
nique are abstracted, and a detailed comparison of different and < Bob, isSupervisorOf, Chris >. Note that the en-
techniques is provided at the end. After that, we present tities and relations in a knowledge graph are usually of
the applications that graph embedding enables in Sec. 5. We different types [14], [15]. Hence, knowledge graph can be
then discuss four potential future research directions in Sec. viewed as an instance of the heterogeneous graph.
6 and concludes this survey in Sec. 7. The following proximity measures are usually adopted
to quantify the graph property to be preserved in the em-
2 P ROBLEM F ORMALIZATION bedded space. The first-order proximity is the local pairwise
similarity between only the nodes connected by edges. It
In this section, we first introduce the definition of the basic
compares the direct connection strength between a node
concepts in graph embedding, and then provide a formal
pair. Formally,
definition of the graph embedding problem.
Definition 5. The first-order proximity between node vi and
2.1 Notation and Definition node vj is the weight of the edge eij , i.e., Ai,j .
The detailed descriptions of the notations used in this sur- Two nodes are more similar if they are connected by an
vey can be found in Table 1. edge with larger weight. Denote the first-order proximity
(1) (1)
Definition 1. A graph is G = (V, E), where v ∈ V is a between node vi and vj as sij , we have sij = Ai,j . Let
(1) (1) (1) (1)
node and e ∈ E is an edge. G is associated with a node si = [si1 , si2 , · · · , si|V | ] denote the first-order proximity
type mapping function fv : V → T v and an edge type between vi and other nodes. Take the graph in Fig. 1(a) as an
mapping function fe : E → T e . example, the first order between v1 and v2 is the weight of
(1) (1)
T v and T e denote the set of node types and edge types, edge e12 , denoted as s12 = 1.2. And s1 records the weight
respectively. Each node vi ∈ V belongs to one particular of edges connecting v1 and other nodes in the graph, i.e.,
(1)
type, i.e., fv (vi ) ∈ T v . Similarly, for eij ∈ E , fe (eij ) ∈ T e . s1 = [0, 1.2, 1.5, 0, 0, 0, 0, 0, 0].
2.4 0.6
v2 v1 v3
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 4

The second-order proximity compares the similarity of


2.4 0.6
the nodes’ neighbourhood structures. The more similar two v2 v1 v3 v5 v4 v6
nodes’ neighbourhoods are, the larger the second-order (a) A weighted graph. (b) A directed graph.
proximity value between them. Formally,
Fig. 4. Examples of weighted and directed graphs.
(2)
Definition 6. The second-order proximity sij between node
(1) v
5 is that v
4 although
6 v
vi and vj is a similarity between vi ’s neighbourhood si reason there exist various types of em-
(1) bedding output, the majority of graph embedding studies
and vj ’s neighborhood sj .
focus on node embedding, i.e., embedding nodes to a low
(2)
Again, take Fig. 1(a) as an example: s12 is the dimensional space where the node similarity in the input
(1) (1)
similarity between s1 and s2 . As introduced before, graph is preserved. More details about node embedding and
(1) (1) other types of embedding output are presented in Sec. 3.2.
s1 = [0, 1.2, 1.5, 0, 0, 0, 0, 0, 0] and s2 = [1.2, 0, 0.8, 0, 0,
(2)
0, 0, 0, 0]. Let us consider cosine similarities s12 =
(1) (1) (2) (1) (1)
cosine(s1 , s2 ) = 0.43 and s15 = cosine(s1 , s5 ) = 0. 3.1 Graph Embedding Input
We can see that the second-order proximity between v1 and The input of graph embedding is a graph. In this survey, we
v5 equals to zero as v1 and v5 do not share any common divide graph embedding input into four categories: homo-
1-hop neighbour. v1 and v2 share a common neighbour v3 , geneous graph, heterogeneous graph, graph with auxiliary
(2) information and constructed graph. Each type of graph
hence their second-order proximity s12 is larger than zero.
The higher-order proximity can be defined likewise. For poses different challenges to graph embedding. Next, we
example, the k -th-order proximity between node vi and introduce these four types of input graphs and summarize
vj is the similarity between si
(k−1) (k−1)
and sj . Note that the challenges faced in each input setting.
sometimes the higher-order proximities are also defined us-
3.1.1 Homogeneous Graph
ing some other metrics, e.g., Katz Index, Rooted PageRank,
Adamic Adar, etc [11]. The first category of input graph is the homogeneous graph
It is worth noting that, in some work, the first-order and (Def. 2), in which both nodes and edges belong to a single
second-order proximities are empirically calculated based type respectively. The homogeneous graph can be further
on the joint probability and conditional probability of two categorized as the weighted (or directed) and unweighted
nodes. More details are discussed in Sect. 4.3.2. (or undirected) graph as the example shown in Fig. 4.
Undirected and unweighted homogeneous graph is the
Problem 1. Graph embedding: Given the input of a graph G
most basic graph embedding input setting. A number of
= (V, E), and a predefined dimensionality of the embed-
studies work under this setting, e.g., [1], [16], [17], [18], [19].
ding d (d  |V |), the problem of graph embedding is to
They treat all nodes and edges equally, as only the basic
convert G into a d-dimensional space, in which the graph
structural information of the input graph is available.
property is preserved as much as possible. The graph
Intuitively, the weights and directions of the edges pro-
property can be quantified using proximity measures
vide more information about the graph, and may help
such as the first- and higher-order proximity. Each graph
represent the graph more accurately in the embedded space.
is represented as either a d-dimensional vector (for a
For example, in Fig. 4(a), v1 should be embedded closer to
whole graph) or a set of d-dimensional vectors with each
v2 than v3 because the weight of the edge e1,2 is higher.
vector representing the embedding of part of the graph
Similarly, v4 in Fig. 4(b) should be embedded closer to v5
(e.g., node, edge, substructure).
than v6 as v4 and v5 are connected in both direction. The
Fig. 1 shows a toy example of graph embedding with d = above information is lost in the unweighted and undirected
2. Given an input graph (Fig. 1(a)), the graph embedding graph. Noticing the advantages of exploiting the weight and
algorithms are applied to convert a node (Fig. 1(b))/ edge direction property of the graph edges, the graph embed-
(Fig. 1(c)), substructure (Fig. 1(d))/ whole-graph (Fig. 1(e)) ding community starts to explore the weighted and/or the
as a 2D vector (i.e., a point in a 2D space). In the next two directed graph. Some of them focus on only one graph prop-
sections, we provide two taxonomies of graph embedding, erty, i.e., either edge weight or edge direction. On the one
by categorizing the graph embedding literature based on hand, the weighted graph is considered in [20], [21], [22],
problem settings and embedding techniques respectively. [23], [24], [25]. Nodes connected by higher-weighted edges
are embedded closer to each other. However, their work
is still limited to undirected graphs. On the other hand,
3 P ROBLEM S ETTINGS OF G RAPH E MBEDDING some work distinguishes directions of the edges during the
In this section, we compare existing graph embedding work embedding process and preserve the direction information
from the perspective of problem setting, which consists of in the embedded space. One example of the directed graph
the embedding input and the embedding output. For each is the social network graph, e.g, [26]. Each user has both
setting, we first introduce different types of graph embed- followership and followeeship with other users. However,
ding input or output, and then summarize the challenges the weight information is unavailable for the social user
faced in each setting at the end. links. Recently, a more general graph embedding algorithm
We start with graph embedding input. As a graph em- is proposed, in which both weight and direction properties
bedding setting consists of both input and output, we use are considered. In other words, these algorithms (e.g., [3],
node embedding as an example embedding output setting [27], [28]) can process both directed and undirected, as well
during the introduction of different types of input. The as weighted and unweighted graph.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 5

TABLE 2 TABLE 3
Graph Embedding Algorithms for cQA sites Comparison of Different Types of Auxiliary Information in Graphs
GE Algorithm Links Exploited Auxiliary Information Description
[30] user-user, user-question label categorical value of a node/edge, e.g., class information
[31] user-user, user-question, question-answer attribute categorical or continuous value of a node/edge,
[29] user-user, question-answer, user-answer e.g., property information
[32] users’ asymmetric following links, a ordered tuple (i, j, k, o, p) node feature text or image feature for a node
information propagation the paths of how the information is propagated in graphs
knowledge base text associated with or facts between knowledge concepts
Challenge: How to capture the diversity of connectivity patterns
observed in graphs? Since only structural information is avail-
able in homogeneous graphs, the challenge of homogeneous consistency between them is a problem. Moreover, there
graph embedding lies in how to preserve these connectivity may exist imbalance between objects of different types. This
patterns observed in the input graphs during embedding. data skewness should be considered in embedding.

3.1.2 Heterogeneous Graph 3.1.3 Graph with Auxiliary Information


The second category of input is the heterogeneous graph The third category of input graph contains auxiliary in-
(Def. 3), which mainly exist in the three scenarios below. formation of a node/edge/whole-graph in addition to the
Community-based Question Answering (cQA) sites. structural relations of nodes (i.e., E ). Generally, there are five
cQA is an Internet-based crowdsourcing service that enables different types of auxiliary information as listed in Table 3.
users to post questions on a website, which are then an- Label: Nodes with different labels should be embedded
swered by other users [29]. Intuitively, there are different far away from each other. In order to achieve this, [47]
types of nodes in a cQA graph, e.g., question, answer, user. and [48] jointly optimize the embedding objective function
Existing cQA graph embedding methods distinguish from together with a classifier function. [49] puts a penalty on the
each other in terms of the links they exploit as summarized similarity between nodes with different labels. [50] consid-
in Table 2, where (i, j, k, o, p) denotes that the j -th answer ers node labels and edge labels when calculating different
provided by user k obtains more votes (i.e., thumb-ups) than graph kernels. [51] and [52] embed a knowledge graph, in
the o-th answer of user p for question i. which the entity (node) has a semantic category. [53] embeds
Multimedia Networks. A multimedia network is a net- a more complicated knowledge graph with the entity cate-
work containing multimedia data, e.g., image, text, etc. For gories in a hierarchical structure, e.g., the category “book”
example, both [33] and [34] embed the graphs containing has two sub-categories “author” and “written work”.
two types of nodes (image and text) and three types of links Attribute: In contrast to a label, an attribute value can be
(the co-occurrence of image-image, text-text and image- discrete or continuous. For example, [54] embeds a graph
text). [35] processes a social curation with user node and with discrete node attribute value (e.g., the atomic number
image node. It exploits user-image links to embed users in a molecule). In contrast, [4] represents the node attribute
and images into the same space so that they can be directly as a continuous high-dimensional vector (e.g., user attribute
compared for image recommendation. In [36], a click graph features in social networks). [55] deals with both discrete
is considered which contains images and text queries. The and continuous attributes for nodes and edges.
image-query edge indicates a click of an image given a Node feature: Most node features are text, which are
query, where the click count serves as the edge weight. provided either as a feature vector for each node [56], [57]
Knowledge Graphs. In a knowledge graph (Def. 4), or as a document [58], [59], [60], [61]. For the latter, the
the entities (nodes) and relations (edges) are usually of documents are further processed to extract feature vectors
different types. For example, in a film related knowledge using techniques such as bag-of-words [58], topic modelling
graph constructed from Freebase [37], the types of entities [59], [60], or treating “word” as one type of node [61]. Other
can be “director”, “actor”, “film”, etc. The types of relations types of node features, such as image features [33], are also
can be “produce”, “direct”, “act in”. A lot of efforts have been possible. Node features enhance the graph embedding per-
devoted to embeding knowledge graphs (e.g., [38], [39], formance by providing rich and unstructured information,
[40]). We will introduce them in details in Sec. 4.3.3. which is available in many real-world graphs. Moreover, it
Other heterogeneous graphs also exist. For instance, [41] makes inductive graph embedding possible [62].
and [42] work on the mobility data graph, in which the Information propagation: An example of information
station (s), role (r) and company (c) nodes are connected by propagation is “retweet” in Twitter. In [63], given a data
three types of links (s-s, s-r, s-c). [43] embeds a Wikipedia graph G = (V, E), a cascade graph G c = (V c , E c ) is
graph with three types of nodes (entity (e), category (c) constructed for each cascade c, where V c are the nodes
and word (w)) and three types of edges (e-e, e-c, w-w). that have adopted c and E c are the edges with both ends
In addition to the above graphs, there are some general in V c . They then embed G c to predict the increment of
heterogeneous graphs in which the types of nodes and cascade size. Differently, [64] aims to embed the users and
edges are not specifically defined [44], [45], [46]. content information, such that the similarity between their
Challenge: How to explore global consistency between different embedding indicates a diffusion probability. Topo-LSTM
types of objects, and how to deal with the imbalances of objects [65] considers a cascade as not merely a sequence of nodes,
belonging to different types, if any? Different types of objects but a dynamic directed acyclic graphs for embedding.
(e.g., nodes, edges) are embedded into the same space in Knowledge base: The popular knowledge bases include
heterogeneous graph embedding. How to explore the global Wikipedia [66], Freebase [37], YAGO [67], DBpedia [68], etc.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 6

Take Wikipedia as an example, the concepts are entities graph using entity mention, target type and text feature as
proposed by users and text is the article associated with the nodes, and establishes three kinds of edges: mention-type,
entity. [66] uses knowledge base to learn a social knowledge mention-feature and type-type.
graph from a social network by linking each social network In addition to the above pairwise similarity based and
user to a given set of knowledge concepts. [69] represents node co-occurrence based methods, other graph construc-
queries and documents in the entity space (provided by a tion strategies have been designed for different purposes.
knowledge base) so that the academic search engine can For example, [85] constructs an intrinsic graph to capture the
understand the meaning of research concepts in queries. intraclass compactness, and a penalty graph to characterize
Other types of auxiliary information include user check- the interclass separability. The former is constructed by
in data (user-location) [70], user item preference ranking connecting each data point with its neighbours of the same
list [71], etc. Note that the auxiliary information is not just class, while the latter connects the marginal points across
limited to one type. For instance, [62] and [72] consider different classes. [86] constructs a signed graph to exploit
both label and node feature information. [73] utilizes node the label information. Two nodes are connected by a positive
contents and labels to assist the graph embedding process. edge if they belong to the same class, and a negative edge
Challenge: How to incorporate the rich and unstructured infor- if they are from two classes. [87] includes all instances with
mation so that the learnt embeddings are both representing the a common label into one hyperedge to capture their joint
topological structure and discriminative in terms of the auxiliary similarity. In [88], two feedback graphs are constructed to
information? The auxiliary information helps to define node gather together relevant pairs and keep away irrelevant
similarity in addition to graph structural information. The ones after embedding. In the positive graph, two nodes are
challenges of embedding graph with auxiliary information connected if they are both relevant. In the negative graph,
is how to combine these two information sources to define two nodes are connected only when one node is relevant
the node similarity to be preserved. and the other is irrelevant.
Challenge: How to construct a graph that encodes the pairwise
3.1.4 Graph Constructed from Non-relational Data relations between instances and how to preserve the generated
node proximity matrix in the embedded space? The first chal-
As the last category of input graph is not provided, but lenge faced by embedding graphs constructed from non-
constructed from the non-relational input data by different relational data is how to compute the relations between the
strategies. This usually happens when the input data is non-relational data and construct such a graph. After the
assumed to lie in a low dimensional manifold. graph is constructed, the challenge becomes the same as in
In most cases, the input is a feature matrix X ∈ R|V |×N other input graphs, i.e., how to preserve the node proximity
where each row Xi is an N -dimensional feature vector of the constructed graph in the embedded space.
for the i-th training instance. A similarity matrix S is
constructed by calculating Sij using the similarity between
(Xi , Xj ). There are usually two ways to construct a graph 3.2 Graph Embedding Output
from S . A straightforward way is to directly treat S as the The output of graph embedding is a (set of) low dimensional
adjacency matrix A of an invisible graph [74]. However, [74] vector(s) representing (part of) a graph. Based on the output
is based on the Euclidean distance and it does not consider granularity, we divide graph embedding output into four
the neighbouring nodes when calculating Sij . If X lies on categories, including node embedding, edge embedding,
or near a curved manifold, the distance between Xi and hybrid embedding and whole-graph embedding. Different
Xj over the manifold is much larger than their Euclidean types of embedding facilitate different applications.
distance [12]. To address these issues, other methods (e.g., Unlike embedding input which is fixed and given, the
[75], [76], [77] ) construct a K nearest neighbour (KNN) embedding output is task driven. For example, node em-
graph from S first and estimate the adjacency matrix A bedding can benefit a wide variety of node related graph
based on the KNN graph. For example, Isomap [78] incor- analysis tasks. By representing each node as a vector, the
porates the geodesic distance in A. It first constructs a KNN node related tasks such as node clustering, node classifi-
graph from S , and then finds the shortest path between two cation, can be performed efficiently in terms of both time
nodes as the geodesic distance between them. To reduce the and space. However, graph analytics tasks are not always at
cost of KNN graph construction (O(|V |2 )), [79] constructs node level. In some scenarios, the tasks may be related to
an Anchor graph instead, whose cost is O(|V |) in terms of higher granularity of a graph, such as node pairs, subgraph,
both time and space consumption. They first obtain a set of or even a whole graph. Hence, the first challenge in terms of
clustering centers as virtual anchors and find the K nearest embedding output is how to find a suitable type of embedding
anchors of each node for anchor graph construction. output which meets the needs of the specific application task.
Another way of graph construction is to establish edges
between nodes based on the nodes’ co-occurrence. For ex- 3.2.1 Node Embedding
ample, to facilitate image related applications (e.g., image As the most common embedding output setting, node
segmentation, image classification), researchers (e.g., [80], embedding represents each node as a vector in a low di-
[81], [82]) construct a graph from each image by treating mensional space. Nodes that are “close” in the graph are
pixels as nodes and the spatial relations between pixels embedded to have similar vector representations. The dif-
as edges. [83] extracts three types of nodes (location, time ferences between various graph embedding methods lie in
and message) from the GTMS record and therefore forms how they define the “closeness” between two nodes. First-
six types of edges between these nodes. [84] generates a order proximity (Def. 5) and second-order proximity (Def.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 7

6) are two commonly adopted metrics for pairwise node consider a community-aware proximity for node embed-
similarity calculation. In some work, higher-order proximity ding, such that a node’s embedding is similar to its commu-
is also explored to certain extent. For example, [21] captures nity’s embedding. ComE [89] also jointly solves node em-
the k -step (k = 1, 2, 3, · · · ) neighbours relations in their bedding, community detection and community embedding
embedding. Both [1] and [89] consider two nodes belonging together. Rather than representing a community as a vector,
to the same community as embedded closer. it defines each community embedding as a multivariate
Challenge: How to define the pairwise node proximity in var- Gaussian distribution so as to characterize how its member
ious types of input graph and how to encode the proximity in nodes are distributed.
the learnt embeddings? The challenges of node embedding The embedding of substructure or community can also
mainly come from defining the node proximity in the input be derived by aggregating the individual node and edge
graph. In Sec 3.1, we have elaborated the challenges of node embedding inside it. However, such a kind of “indirect”
embedding with different types of input graphs. approach is not optimized to represent the structure. More-
Next, we will introduce other types of embedding output over, node embedding and community embedding can
as well as the new challenges posed by these outputs. reinforce each other. Better node embedding is learnt by
incorporating the community-aware high-order proximity,
while better communities are detected when more accurate
3.2.2 Edge Embedding node embedding is generated.
In contrast to node embedding, edge embedding aims to Challenge: How to generate the target substructure and how to
represent an edge as a low-dimensional vector. Edge em- embed different types of graph components in one common space?
bedding is useful in the following two scenarios. In contrast to other types of embedding output, the target to
Firstly, knowledge graph embedding (e.g., [90], [91], [92] embed in hybrid embedding (e.g., subgraph, community) is
) learns embedding for both nodes and edges. Each edge not given. Hence the first challenge is how to generate such
is a triplet < h, r, t > (Def. 4). The embedding is learnt to kind of embedding target structure. Furthermore, different
preserve r between h and t in the embedded space, so that types of targets (e.g., community, node) may be embedded
a missing entity/relation can be correctly predicted given in one common space simultaneously. How to address the
the other two components in < h, r, t >. Secondly, some heterogeneity of the embedding target types is a problem.
work (e.g., [28], [64]) embeds a node pair as a vector feature
to either make the node pair comparable to other nodes 3.2.4 Whole-Graph Embedding
or predict the existence of a link between two nodes. For
instance, [64] proposes a content-social influential feature to The last type of output is the embedding of a whole graph
predict user-user interaction probability given a content. It usually for small graphs, such as proteins, molecules, etc.
embeds both the user pairs and content in the same space. In this case, a graph is represented as one vector and two
[28] embeds a pair of nodes using a bootstrapping approach similar graphs are embedded to be closer.
over the node embedding, to facilitate the prediction of Whole-graph embedding benefits the graph classifica-
whether a link exists between two nodes in a graph. tion task by providing a straightforward and efficient so-
In summary, edge embedding benefits edge (/node lution for calculating graph similarities [49], [55], [95]. To
pairs) related graph analysis, such as link prediction, knowl- establish a compromise between the embedding time (effi-
edge graph entity/relation prediction, etc. ciency) and the ability to preserve information (expressive-
ness), [95] designs a hierarchical graph embedding frame-
Challenge: How to define the edge-level similarity and how to work. It thinks that accurate understanding of the global
model the asymmetric property of the edges, if any? The edge graph information requires the processing of substructures
proximity is different from node proximity as an edge con- in different scales. A graph pyramid is formed where each
tains a pair of nodes and usually denotes the pairwise node level is a summarized graph at different scales. The graph
relation. Moreover, unlike nodes, edges may be directed. is embedded at all levels and then concatenated into one
This asymmetric property should be encoded in the learnt vector. [63] learns the embedding for a whole cascade graph,
edge representations. and then trains a multi-layer perceptron to predict the
increment of the size of the cascade graph in the future.
3.2.3 Hybrid Embedding Challenge: How to capture the properties of a whole graph and
Hybrid embedding is the embedding of a combination of how to make a trade-off between expressiveness and efficiency?
different types of graph components, e.g, node + edge (i.e., Embedding a whole graph requires capturing the property
substructure), node + community. of a whole graph and is thus more time consuming com-
Substructure embedding has been studied in a quantity pared to other types of embedding. The key challenge of
of work. For example, [44] embeds the graph structure whole-graph embedding is how to make a choice between
between two possibly distant nodes to support semantic the expressive power of the learnt embedding and the
proximity search. [93] learns the embedding for subgraphs efficiency of the embedding algorithm.
(e.g., graphlets) so as to define the graph kernels for graph
classification. [94] utilizes a knowledge base to enrich the
information about the answer. It embeds both path and 4 G RAPH E MBEDDING T ECHNIQUES
subgraph from the question entity to the answer entity. In this section, we categorize graph embedding methods
Compared to subgraph embedding, community embed- based on the techniques used. Generally, graph embedding
ding has only attracted limited attention. [1] proposes to aims to represent a graph in a low dimensional space which
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 8

preserves as much graph property information as possi- The optimal a’s are eigenvectors with the maximum eigen-
ble. The differences between different graph embedding values in solving XW X T a = λXDX T a.
algorithms lie in how they define the graph property to
be preserved. Different algorithms have different insights The differences of existing studies mainly lie in how they
of the node(/edge/substructure/whole-graph) similarities calculate the pairwise node similarity Wij , and whether they
and how to preserve them in the embedded space. Next, use a linear function y = X T a or not. Some attempts [81],
we will introduce the insight of each graph embedding [85] have been made to summarize existing Laplacian eigen-
technique, as well as how they quantify the graph property maps based graph embedding methods using a general
and solve the graph embedding problem. framework. But their surveys only cover a limited quantity
of work. In Table 4, we summarize existing Laplacian eigen-
maps based graph embedding studies and compare how
4.1 Matrix Factorization
they calculate W and what objective function they adopt.
Matrix factorization based graph embedding represent
graph property (e.g., node pairwise similarity) in the form of The initial study MDS [74] directly adopted the Eu-
a matrix and factorize this matrix to obtain node embedding clidean distance between two feature vectors Xi and Xj as
[11]. The pioneering studies in graph embedding usually Wij . Eq. 2 is used to find the optimal embedding y ’s. MDS
solve graph embedding in this way. In most cases, the input does not consider the neighbourhood of nodes, i.e., any
is a graph constructed from non-relational high dimensional pair of training instances are considered as connected. The
data features as introduced in Sec. 3.1.4. And the output is a follow-up studies (e.g., [78], [96], [97], [102]) overcome this
set of node embedding (Sec. 3.2.1). The problem of graph problem by first constructing a k nearest neighbour (KNN)
embedding can thus be treated as a structure-preserving graph from the data feature. Each node is only connected
dimensionality reduction problem which assumes the input with its top k similar neighbours. After that, different meth-
data lie in a low dimensional manifold. There are two types ods are utilized to calculate the similarity matrix W so as to
of matrix factorization based graph embedding. One is to preserve as much desired graph property as possible. Some
factorize graph Laplacian eigenmaps, and the other is to more advanced models are design recently. For example,
directly factorize the node proximity matrix. AgLPP [79] introduces an anchor graph to significantly
improve the efficiency of earlier matrix factorization model
4.1.1 Graph Laplacian Eigenmaps LPP. LGRM [98] learns a local regression model to grasp
Insight: The graph property to be preserved can be interpreted as the graph structure and a global regression term for out-of-
pairwise node similarities. Thus, a larger penalty is imposed if two sample data extrapolation. Finally, different from previous
nodes with larger similarity are embedded far apart. works preserving local geometry, LSE [103] uses local spline
Based on the above insight, the optimal embedding y can regression to preserve global geometry.
be derived by the below objective function [99].
X When auxiliary information (e.g., label, attribute) is
y ∗ = arg min (yi − yj 2 Wij ) = arg min y T Ly, (1) available, the objective function is adjusted to preserve
i6=j
the richer information. E.g., [99] constructs an adjacency
where Wij is the “defined” similarity between node vi and graph W and a labelled graph W SR . The objective func-
vj ; L = D − W is the P graph Laplacian. D is the diagonal tion consists of two parts, one focuses on preserving the
matrix where Dii = j6=i Wij . The bigger the value of Dii , local geometric structure of the datasets as in LPP [97],
the more important yi is [97]. A constraint y T Dy = 1 is and the other tries to get the embedding with the best
usually imposed on Eq. 1 to remove an arbitrary scaling class separability on the labelled training data. Similarly,
factor in the embedding. Eq. 1 then reduces to: [88] also constructs two graphs, an adjacency graph W
y T Ly yT W y which encodes local geometric structures and a feedback
y ∗ = arg min y T Ly = arg min T = arg max T .
y T Dy=1 y Dy y Dy relational graph W ARE that encodes the pairwise relations
(2) in users’ relevance feedbacks. RF-Semi-NMF-PCA [101] si-
The optimal y ’s are the eigenvectors corresponding to the multaneously consider clustering, dimensionality reduction
maximum eigenvalue of the eigenproblem W y = λDy . and graph embedding by constructing an objective function
The above graph embedding is transductive because it that consists of three components: PCA, k-means and graph
can only embed the nodes that exist in the training set. Laplacian regularization.
In practice, it might also need to embed the new coming
nodes that have not been seen in training. One solution is to Some other work thinks that W cannot be constructed
design a linear function y = X T a so that the embedding by easily enumerating pairwise node relationships. Instead,
can be derived as long as the node feature is provided. they adopt semidefinite programming (SDP) to learn W .
Consequently, for inductive graph embedding, Eq. 1 becomes Specifically, SDP [104] aims to find an inner product matrix
finding the optimal a in the below objective function: that maximizes the pairwise distances between any two
X 2 inputs which are not connected in the graph while preserv-
a∗ = arg min kaTXi −aTXjk Wij = arg min aTXLX Ta. ing the nearest neighbors distances. MVU [100] constructs
i6=j
(3) such matrix and then applies MDS [74] on the learned
Similar to Eq. 2, by adding the constraint aT XDX T a = inner product matrix. [2] proves that regularized LPP [97] is
1, the problem in Eq. 3 becomes: equivalent to regularized SR [99] if W is symmetric, doubly
aT XLX T a aT XW X T a stochastic, PSD and with rank p. It constructs such kind of
a∗ = arg min T T
= arg max T . (4) similarity matrix so as to solve LPP liked problem efficiently.
a XDX a a XDX T a
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 9

TABLE 4
Graph Laplacian eigenmaps based graph embedding.
GE Algorithm W Objective Function
MDS [74] Wij = Euclidean distance (Xi , Xj ) Eq. 2
Isomap [78] KNN, Wij is the sum of edge weights along the shortest path between vi and vj Eq. 2
kXi −Xj k2
LE [96] KNN, Wij = exp( 2t2
) Eq. 2
kX −X k2
LPP [97] KNN, Wij = exp( i t j ) Eq. 4
−1 T K (Xi ,Uk ) aT U LU T a
P σ a∗ = arg min
P
AgLPP [79] anchor graph, W = ZΛ Z , Λkk = Zik , Zik =
j Kσ (Xi ,Uj ) aT U DU T a
kX −X k2 T
y (Lle +µLg )y
LGRM [98] KNN, Wij = exp( i 2t2 j ) y ∗ = arg min
yT y
+ +

−ρ2 (Xi Xj )
 −γ, Xi ∈ F &Xj ∈ F
ARE
KNN, Wij = exp( ), W = 1, L(X i ) 6
= L(X j ) aT XLARE X T a
ARE [88] t ij
a∗ = arg max
0, otherwise aT XLX T a

F + denotes the images relevant to a query, γ controls the unbalanced feedback

 1, L(Xi ) = L(Xj ) 
SR 1/lr , L(Xi ) = L(Xj ) = Cr
KNN, Wij = 0, L(Xi ) 6= L(Xj ) Wij = aT XW SR X T a
SR [99] 0, otherwise a∗ = arg max
Wij , otherwise aT X(D SR +L)X T a

Cr is the r -th class, lr =|Xi ∈ X : L(Xi ) = Cr |
HSL [87] S = I − L, where L is normalized hypergraph Laplacian a∗ = arg max tr(aT XSX T a), s.t. aT XX T a = Ik
KNN, W ∗ = arg max tr(W ), s.t. W ≥ 0, ij Wij = 0 and ∀i, j ,
P
MVU [100] Eq. 2
where Wii − 2Wij + Wjj = kXi − Xj k2

1/lr , L(Xi ) = L(Xj ) = Cr
KNN, Wij =
SLE [86] −1, L(Xi ) 6= L(Xj ) Eq. 4
Cr is the r -th class, lr =|Xi ∈ X : L(Xi ) = Cr |
normal graph: KNN, Wij = 1 Eq. 2
NSHLRR [76] hypergraph: W (e) is the weight of a hyperedge e
W (e)
y ∗ = arg min kyi − yj k2 d(e)
P
1, v ∈ e P e
h(v, e) = , d(e) = v∈V h(v, e)
 0, otherwise
kXi −Xm+1 k2 2 −kX i −X j k 2
2
2,j ≤ k

kkXi −Xk+1 k2 y ∗ = arg min Wij min(kyi − yj kp
Pk P
[77] Wij = 2 − m=1 kXi −Xk k2 2 , θ)

0, j > k y T y=1 i6=j

kXi −Xj k2
PUFS [75] KNN, Wij = exp(− 2t2
) Eq. 4 +(must-link and cannot link constraints)
RF-Semi-NMF-PCA [101] KNN , Wij = 1 Eq. 2 +O (PCA) + O (kmeans)

4.1.2 Node Proximity Matrix Factorization more constraints [48]. We summarize all the node proximity
In addition to solving the above generalized eigenvalue matrix factorization based graph embedding in Table 5.
problem, another line of studies tries to directly factorize Summary: Matrix Factorization (MF) is mostly used to
node proximity matrix. embed a graph constructed from non-relational data (Sec.
Insight: Node proximity can be approximated in a low- 3.1.4) for node embedding (Sec. 3.2.1), which is the typical
dimensional space using matrix factorization. The objective of pre- setting of graph Laplacian eigenmap problems. MF is also
serving node proximity is to minimize the loss of approximation. used to embed homogeneous graphs [24], [50] (Sec. 3.1.1).
Given the node proximity matrix W , the objective is:
min kW − Y Y c T k, (5) 4.2 Deep Learning
Deep learning (DL) has shown outstanding performance
where Y ∈ R|V |×d is the node embedding, and Y c ∈ R|V |×d
in a wide variety of research fields, such as computer
is the embedding for the context nodes [21].
vision, language modeling, etc. DL based graph embedding
Eq. 5 aims to find an optimal rank-d approximation of
applies DL models on graphs. These models are either a
the proximity matrix W (d is the dimensionality of the em-
direct adoption from other fields or a new neural network
bedding). One popular solution is to apply SVD (Singular
model specifically designed for embedding graph data. The
Value Decomposition) on W [110]. Formally,
input is either paths sampled from a graph or the whole
P|V |
W = i=1 σi ui uci T ≈ di=1 σi ui uci T , graph itself. Consequently, we divide the DL based graph
P
(6)
embedding into two categories based on whether random
where {σ1 , σ2 , · · · , σ|V | } are the singular values sorted in walk is adopted to sample paths from a graph.
descending order, ui and uci are singular vectors of σi . The
optimal embedding is obtained using the largest d singular 4.2.1 DL based Graph Embedding with Random Walk
values and corresponding singular vectors as follows: Insight: The second-order proximity in a graph can be preserved
√ √
Y = [ σ1 u1 , · · · , σd ud ], in the embedded space by maximizing the probability of observing
√ √ (7) the neighbourhood of a node conditioned on its embedding.
Y c = [ σ1 uc1 , · · · , σd ucd ].
In the first category of deep learning based graph em-
Depending on whether the asymmetric property is pre- bedding, a graph is represented as a set of random walk
served or not, the embedding of node i is either yi = Yi [21], paths sampled from it. The deep learning methods are then
[50], or the concatenation of Yi and Yic , i.e., yi = [Yi , Yic ] applied to the sampled paths for graph embedding which
[106]. There exist other solutions for Eq. 5, such as regu- preserves graph properties carried by the paths.
larized Gaussian matrix factorization [24], low-rank matrix In view of the above insight, DeepWalk [17] adopts a
factorization [56], and adding other regularizers to enforce neural language model (SkipGram) for graph embedding.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 10

TABLE 5
Node proximity matrix factorization based graph embedding. O(∗) denotes the objective function; e.g., O(SVM classifier) denotes the objective
function of a SVM classifier.
GE Algorithm W Objective Function

1, eij ∈ E .
[50] Wij = Eq. 5
0, otherwise
KNN, W ∗ = arg max tr(W Â), s.t. Dij > (1 − Âij )max(Âim Dim )
SPE [105] k≥0 m Eq. 5
 is a connectivity matrix that describes local pairwise distances
Katz Index W = (I − βA)−1 βA; Personalized Pagerank W = (1 − α)(I − αP )−1
HOPE [106] Eq. 5
Common neighbors W = A2 ; Adamic-Adar W = A(1/ j (Aij + Aji ))A
P
k  P
p Aip , i = j

GraRep [21] Wijk
= log( P ijk ) − log( |Vλ | ), where  = D −1 S , Dij = Eq. 5
t Âij 0, i 6= j
CMF [43] PPMI Eq. 5
TADW [56] PMI Eq. 5 with text feature matrix
y ∗ = arg min e ∈E (Aij − < yi , yj >)2 + λ
kyi k2
P P
[24] A 2 i
ij
MMDW [48] PMI Eq. 5 + O (SVM classifier)
HSCA [57] PMI O (MMDW) + (1st order proximity constraint)
KNN, W ∗ = arg min tr(W ( d T
P PN T
MVE [107] i=1 υi υi + i=d+1 υi υi )) Eq. 5
where υi is eigenvector of a pairwise distance matrix, d is embedding dimensionality
W = s(1) + 5s(2) Eq. 5 + O (community detection)
M-NMF [1]
+ (community proximity constraint)
W = Z∆−1 Z T , where Z ∗ = a∗ = arg min kaT X − Fp k2F + αkak2F
Pm
kXi − uj k22 zij + γ
Pm 2
arg min j=1 j=1 zij
ULGE [2] z T 1=1,zi ≥0
i
whereFp is the top p eigenvectors of W
KNN, W ∗ = arg min i kXi − j Wij Xj k2 y ∗ = arg min i kyi − j Wij yj k2
P P P P
LLE [102]

1, (hi , rj , tk ) exists
kWk − Y Rk Y T k2F + λ(kY k2F + kRk k2F )
P P
RESCAL [108] Wijk = min
0, otherwise k k
FONPE [109] KNN, W ∗ = arg min i kXi − j Wij Xj k2 min kF − F W T k2F + βkP T X − F k2F , s.t. P T P = I
P P

SikpGram [111] aims to maximize the co-occurrence prob- path to leaf vi is a sequence of nodes (b0 , b1 , · · · , blog(|V |) ),
ability among the words that appear within a window w. where b0 = root, blog(|V |) = vi . Eq. 10 then becomes:
DeepWalk first samples a set of paths from the input graph Qlog(|V |)
using truncated random walk (i.e., uniformly sample a P (vi+j |yi ) = t=1 P (bt |yi ), (11)
neighbour of the last visited node until the maximum length where P (bt ) is a binary classifier: P (bt |vi ) = σ(ybTt yi ). σ(·)
is reached). Each path sampled from the graph corresponds denotes the sigmoid function. ybt is the embedding of tree
to a sentence from the corpus, where a node corresponds node bt ’s parent. The hierarchical softmax reduces time
to a word. Then SkipGram is applied on the paths to max- complexity of SkipGram from O(|V |2 ) to O(|V |log(|V |)).
imize the probability of observing a node’s neighbourhood Negative sampling: The key idea of negative sampling
conditioned on its embedding. In this way, nodes with sim- is to distinguish the target node from noises using logis-
ilar neighbourhoods (having large second-order proximity tic regression. I.e., for a node vi , we want to distinguish
values) share similar embedding. The objective function of its neighbour vi+j from other nodes. A noise distribution
DeepWalk is as follows: Pn (vi ) is designed to draw the negative samples for node
miny − log P ({vi−w , · · · , vi−1 , vi+1 , · · · , vi+w }|yi ), (8) vi . Each log P (vi+j |yi ) in Eq. 9 is then calculated as:
T
yi ) + K T
P
log σ(yi+j t=1 Evt ∼Pn [log σ(−yvt yi )], (12)
where w is the window size which restricts the size of
random walk context. SkipGram removes the ordering con- where K is the number of negative nodes that are sampled.
straint, and Eq. 8 is transformed to: Pn (vi ) is a noise distribution, e.g., a uniform distribution
P
miny − log −w≤j≤w P (vi+j |yi ), (9) (Pn (vi ) ∼ |V1 | , ∀vi ∈ V ). The time complexity of SkipGram
with negative sampling is O(K|V |).
where P (vi+j |yi ) is defined using the softmax function: The success of DeepWalk [17] motivates many subse-
T
exp(yi+j yi )
quent studies which apply deep learning models (e.g., Skip-
P (vi+j |yi ) = P|V | T
. (10) Gram or Long-Short Term Memory (LSTM) [115]) on the
k=1 exp(yk yi )
sampled paths for graph embedding. We summarize them
Note that calculating Eq. 10 is not feasible as the normal- in Table 6. As shown in the table, most studies follow the
ization factor (i.e., the summation over all inner product idea of DeepWalk but change the settings of either random
with every node in a graph) is expensive. There are usually walk sampling methods ( [25], [28], [62], [62]) or proximity
two solutions to approximate the full softmax: hierarchical (Def. 5 and Def. 6) to be preserved ( [34], [47], [62], [66],
softmax [112] and negative sampling [112]. [73]). [46] designs meta-path-based random walks to deal
Hierarchical softmax: To efficiently solve Eq. 10, a binary with heterogeneous graphs and a heterogeneous SkipGram
tree is constructed in which the nodes are assigned to the which maximizes the probability of having the hetegeneous
leaves. Instead of enumerating all nodes as in Eq. 10, only context for a given node. Apart from SkipGram, LSTM is
the path from the root to the corresponding leaf needs to be another popular deep learning model adopted in graph
evaluated. The optimization problem becomes maximizing embedding. Note that SkipGram can only embed one sin-
the probability of a specific path in the tree. Suppose the gle node. However, sometimes we may need to embed a
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 11

TABLE 6 TABLE 7
Deep Learning based graph embedding with random walk paths. Deep learning based graph embedding without random walk paths.
GE Algorithm Ransom Walk Methods Preserved Proximity DL Model GE Algorithm Deep Learning Model Model Input
DeepWalk [17] truncated random walk 2nd SDNE [20] autoencoder A
[34] truncated random walk 2nd (word-image) DNGR [23] stacked denoising autoencoder PPMI
GenVector [66] truncated random walk 2nd (user-user & SkipGram SAE [22] sparse autoencoder D −1 A
concept-concept) with [55] CNN node sequence
Constrained sampling with edge weight 2nd hierarchical SCNN [118] Spectral CNN graph
DeepWalk [25] softmax [119] Spectral CNN with smooth graph
nd spectral multipliers
DDRW [47] truncated random walk 2 + class identity (Eq. 11)
TriDNR [73] truncated random walk 2nd (among node, MoNet [80] Mixture model network graph
word & label) ChebNet [82] Graph CNN a.k.a. ChebNet graph
node2vec [28] BFS + DFS 2nd GCN [72] Graph Convolutional Network graph
UPP-SNE [113] truncated random walk 2nd (user-user & SkipGram GNN [120] Graph Neural Network graph
profile-profile) with [121] adapted Graph Neural Network molecules graph
Planetoid [62] sampling node pairs by la- 2nd + label identity negative GGS-NNs [122] adapted Graph Neural Network graph
bels and structure sampling HNE [33] CNN + FC graph with image and text
NBNE [19] sampling direct neighbours 2nd (Eq. 12) DUIF [35] a hierarchical deep model social curation network
of a node ProjE [40] a neural network model knowledge graph
DGK [93] graphlet kernel: random 2nd (by graphlet) SkipGram TIGraNet [123] Graph Convolutional Network graph constructed from images
sampling [114] (Eqs. 11–12 )
metapath2vec meta-path based random 2nd heterogeneous
[46] walk SkipGram
the one hand, some of them directly use the original CNN
ProxEmbed [44] truncate random walk node ranking tuples
HSNL [29] truncate random walk 2nd + QA ranking LSTM
model designed for Euclidean domains and reformat input
tuples graphs to fit it. E.g., [55] uses graph labelling to select
RMNL [30] truncated random walk 2nd + user-question a fixed-length node sequence from a graph and then as-
quality ranking
DeepCas [63] Markov chain based ran- information cascade GRU
sembles nodes’ neighbourhood to learn a neighbourhood
dom walk sequence representation with the CNN model. On the other hand,
MRW-MN [36] truncated random walk 2nd + cross-modal DCNN+ some other work attempts to generalize the deep neural
feature difference SkipGram
model to non-Euclidean domains (e.g., graphs). [117] sum-
marizes the representative studies in their survey. Generally,
sequence of nodes as a fixed length vector, e.g., represent a the differences between these approaches lie in the way
sentence (i.e., a sequence of words) as one vector. LSTM is they formulate a convolution-like operation on graphs. One
then adopted in such scenarios to embed a node sequence. way is to emulate the Convolution Theorem to define the
For example, [29] and [30] embed the sentences from ques- convolution in the spectral domain [118], [119]. Another is
tions/answers in cQA sites, and [44] embeds a sequence to treat the convolution as neighborhood matching in the
of nodes between two nodes for proximity embedding. A spatial domain [72], [82], [120].
ranking loss function is optimized in these work to preserve Others: There are some other types of deep learning
the ranking scores in the training data. In [63], GRU [116] based graph embedding methods. E.g., [35] proposes DUIF,
(i.e., a recurrent neural network model similar to LSTM) is which uses a hierarchical softmax as a forward propagation
used to embed information cascade paths. to maximize the modularity. HNE [33] utilizes deep learning
techniques to capture the interactions between heteroge-
neous components, e.g., CNN for image and FC layers for
4.2.2 DL based Graph Embedding without Random Walk
text. ProjE [40] designs a neural network with a combination
Insight:The multi-layered learning architecture is a robust and layer and a projection layer. It defines a pointwise loss
effective solution to encode the graph into a low dimensional space. (similar to multi-class classification) and a listwise loss (i.e.,
The second class of deep learning based graph embed- softmax regression loss) for knowledge graph embedding.
ding methods applies deep models on a whole graph (or a We summarize all deep learning based graph embedding
proximity matrix of a whole graph) directly. Below are some methods (random walk free) in Table 7, and compare the
popular deep learning models used in graph embedding. models they use as well as the input for each model.
Autoencoder: An autoencoder aims to minimize the Summary: Due to its robustness and effectiveness, deep
reconstruction error of the output and input by its encoder learning has been widely used in graph embedding. Three
and decoder. Both encoder and decoder contain multiple types of input graphs (except for graph constructed from
nonlinear functions. The encoder maps input data to a rep- non-relational data (Sec. 3.1.4)) and all the four types of
resentation space and the decoder maps the representation embedding output have been observed in deep learning
space to a reconstruction space. The idea of adopting au- based graph embedding methods.
toencoder for graph embedding is similar to node proximity
matrix factorization (Sec. 4.1.2) in terms of neighbourhood
preservation. Specifically, the adjacency matrix captures a 4.3 Edge Reconstruction based Optimization
node’s neighbourhood. If we input the adjacency matrix to Overall Insight: The edges established based on node embedding
an autoencoder, the reconstruction process will make the should be as similar to those in the input graph as possible.
nodes with similar neighbourhood have similar embedding. The third category of graph embedding techniques di-
Deep Neural Network: As a popular deep learning rectly optimizes an edge reconstruction based objective
model, Convolutional Neural Network (CNN) and its vari- functions, by either maximizing edge reconstruction prob-
ants have been widely adopted in graph embedding. On ability or minimizing edge reconstruction loss. The later is
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 12

further divided into distance-based loss and margin-based TABLE 8


ranking loss. Next, we introduce the three types one by one. Edge reconstruction based graph embedding. O∗∗ (Tvvi , Tvvj ) refers to
(2)
one of Eq. 14, Eq. 16 ∼ Eq. 19; e.g., Omin (word-label) refers to Eq. 18
4.3.1 Maximizing Edge Reconstruction Probability with a word node and a label node. Tvvi denote the type of node vi .

Insight: Good node embedding maximizes the probability of GE Objectives Order of


Algorithm Proximity
generating the observed edges in a graph. (1)
PALE [18] Omax (node, node)
Good node embedding should be able to re-establish 1st
NRCL [4] Orank (node, neighbour node) + O (attribute loss)
edges in the original input graph. This can be realized by PTE [124]
(2) (2)
Omin (word, word) + Omin (word, document) +
maximizing the probability of generating all observed edges (2)
Omin (word, label)
(2)
(i.e., node pairwise proximity) using node embedding. APP [3] Omax (node, node))
(2) (2)
The direct edge between a node pair vi and vj indicates GraphEmbed Omin (word, word) + Omin (word, time) +
[83] (2) (2)
Omin (word, location) + Omin (time, location) +
their first-order proximity, which can be calculated as the joint (2) (2)
Omin (location, location) + Omin (time, time)
probability using the embedding of vi and vj : [41], [42]
(2) (2)
Omin (station, company), Omin (station, role),
1 (2)
p(1) (vi , vj ) = 1+exp(−yi T yj ) . (13) Omin (destination, boarding)
(2) nd
PLE [84] Orank (mention-type) + Omin (mention, feature) + 2
(2)
The above first-order proximity exists between any pair of Omin (type, type)
(2)
connected nodes in a graph. To learn the embedding, we IONE [26] Omin (node, node) + O (anchor align)
(2)
HEBE [45] Omin (node, other nodes in one hyperedge)
maximize the log-likelihood of observing these proximities (2) (2)
GAKE [38] Omax (node, neighbour context)+Omax (node, path
in a graph. The objective function is then defined as: (2)
context)+Omax (node, edge context)
(1) (2)
CSIF [64] Omax (user pair, diffused content)
Omax = max eij ∈E log p(1) (vi , vj ).
P
(14) (2) (2)
ESR [69] Omax (entity, author) + Omax (entity, entity) +
(2) (2)
Omax (entity, words) + Omax (entity, venue)
Similarly, second-order proximity of vi and vj is the condi- (1) (2)
line [27] Omin (node, node) + Omin (node, node))
tional probability of vj generated by vi using yi and yj : (2) (2) 1st + 2nd
EBPR [71] O (AUC ranking) + Omax (node, node) + Omax (node,
exp(yjT yi ) node context)
p(2) (vj |vi ) = P|V | T
. (15) [94] Orank (question, answer) 1st + 2nd
k=1 exp(yk yi )
+ higher
It can be interpreted as the probability of a random walk in
a graph which starts from vi and ends with vj . Hence the
graph embedding objective function is:
vi . Similar to Eq. 10, it is very expensive to calculate Eq.
15 and negative sampling is again adopted for approximate
(2)
Omax = max {vi ,vj }∈P log p(2) (vj |vi ),
P
(16) computation to improve the efficiency. By minimizing the
KL divergence between p(2) (vj |vi ) and p̂(2) (vj |vi ), the ob-
where P is a set of {start node, end node} in the paths
jective function to preserve second-order proximity is:
sampled from the graph, i.e., the two end nodes from each
(2)
Omin = min − eij ∈E Aij log p(2) (vj |vi ).
P
sampled path. This simulates the second-order proximity (18)
as the probability of a random walk starting from the
start node and ending with the end node.
4.3.3 Minimizing Margin-based Ranking Loss
4.3.2 Minimizing Distance-based Loss In the margin-based ranking loss optimization, the edges of
Insight: The node proximity calculated based on node embedding the input graph indicate the relevance between a node pair.
should be as close to the node proximity calculated based on the Some nodes in the graph are usually associated with a set
observed edges as possible. of relevant nodes. E.g., in a cQA site, a set of answers are
Specifically, node proximity can be calculated based on marked as relevant to a given question. The insight of the
node embedding or empirically calculated based on ob- loss is straightforward.
served edges. Minimizing the differences between the two Insight: A node’s embedding is more similar to the embedding of
types of proximities preserves the corresponding proximity. relevant nodes than that of any other irrelevant node.
For the first-order proximity, it can be computed using Denote s(vi , vj ) as the similarity score for node vi and vj ,
node embedding as defined in PEq. 13. The empirical prob- Vi+ as the set of nodes relevant to vi and Vi− as the irrelevant
ability is p̂(1) (vi , vj ) = Aij / eij ∈E Aij , where Aij is the nodes set. The margin-based ranking loss is defined as:
weight of edge eij . The smaller the distance between p(1) Orank = min
P P
max{0, γ−s(vi , vi+ )+s(vi , vi− )},
and p̂(1) is, the better first-order proximity is preserved. vi+ ∈Vi+ vi− ∈Vi−
Adopting KL-divergence as the distance function to calcu- (19)
late the differences between p(1) and p̂(1) and omitting some where γ is the margin. Minimizing the loss rank encourages
constants, the objective function to preserve the first-order a large margin between s(vi , vi+ ) and s(vi , vi− ), and thus
proximity in graph embedding is: enforces vi to be embedded closer to its relevant nodes than
(1)
Omin = min − eij ∈E Aij log p(1) (vi , vj ).
P
(17) to any other irrelevant nodes.
In Table 8, we summarize existing edge reconstruction
Similarly, the second-order proximity of vi and vj is the based graph embedding methods, based on their objective
conditional probability of vj generated by node vi (Eq. functions and preserved node proximity. In general, most
15). The empirical probability of p̂(2)P
(vi |vj ) is calculated as methods use one of the above objective functions (Eq. 14,
p̂(2) (vj |vi ) = Aij /di , where di = eik ∈E Aik is the out- Eq. 16 ∼ Eq. 19). [71] optimizes an AUC ranking loss, which
degree (or degree in the case of undirected graph) of node is a substitution loss for margin based ranking loss (Eq. 19).
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 13

TABLE 9 Eq. 20 to utilize textual node descriptions for embedding.


Knowledge graph embedding using margin-based ranking loss. [133] categorizes monolingual relations and uses linear
GE Algorithm Energy Function fr (h, t) transformation to learn cross-lingual alignment for entities
TransE [91] kh + r − tkl1 and relations. There also exists some work which defines a
TKRL [53] kMrh h + r − Mrt tk
matching degree score rather than an energy function for a
TransR [15] khMr + r − tMr k22
triplet < h, r, t >. E.g., [134] defines a bilinear score function
CTransR [15] khMr + rc − tMr k22 + αkrc − rk22
TransH [14] k(h − wrT hwr ) + dr − (t − wrT twr )k22 vhT Wr vt It adds a normality constraint and a commutativ-
SePLi [39] 1
2 kWi eih + bi − eit k
2 ity constraint to impose analogical structures among the
TransD [125] kMrh h + r − Mrt tk2 2
embedding. ComplEx [135] extends embedding to complex
TranSparse [126] kMrh (θrh )h + r − Mrt (θrt )tk2l /2
1 domain and defines the real part of vhT Wr vt as the score.
ar (ρ)Pnr (t(ρ)) + br k2 , t ∈ N M(Rr )
P
m-TransH [127] k ρ∈M(Rr )
Summary: Edge reconstruction based optimization is appli-
DKRL [128] khd + r − td k + khd + r − ts k + khs + r − td k
ManifoldE [129] Sphere: kϕ(h) + ϕ(r) − ϕ(t)k2 cable for most graph embedding settings. As far as can be
Hyperplane: (ϕ(h) + ϕ(rhead ))T (ϕ(t) + ϕ(rtail )) observed, only graph constructed from non-relational data
ϕ is the mapping function to Hilbert space (Sec. 3.1.4) and whole-graph embedding (Sec. 3.2.4) have
TransA [130] kh + r − tk not been tried. The reason is that reconstructing manually
puTransE [43] kh + r − tk
constructed edges is not as intuitive as in other graphs.
KGE-LDA [60] kh + r − tkl1
Moreover, as this technique focuses on the directly observed
SE [90] kRu h − Ru tkl1
SME [92] linear (Wu1 r + Wu2 h + bu )T (Wv1 r + Wv2 t + bv ) local edges, it is not suitable for whole-graph embedding.
SME [92] bilinear (Wu1 r + Wu2 h + bu )T (Wv1 r + Wv2 t + bv )
sh +st
SSP [59] −λke − sT esk22 + kek22 , S(sh , st ) = 4.4 Graph Kernel
ksh +st k2
2
NTN [131] uT T
r tanh(h Wr t + Wrh h + Wrt t + br )
HOLE [132] r T (h ? t), where ? is circular correlation Insight: The whole graph structure can be represented as a
MTransE [133] kh + r − tkl1 vector containing the counts of elementary substructures that are
decomposed from it.
Graph kernel is an instance of R-convolution kernels
Note that when another task is simultaneously optimized [136], which is a generic way of defining kernels on discrete
during graph embedding, that task-specific objective will be compound objects by recursively decomposing structured
incorporated in the overall objective. For instance, [26] aims objects into “atomic” substructures and comparing all pairs
to align two graphs. Hence an network alignment objective of them [93]. The graph kernel represents each graph as
(2)
function is optimized together with Omin (Eq. 18). a vector, and two graphs are compared using an inner
It is worth noting that most of the existing knowledge product of the two vectors. There are generally three types
graph embedding methods choose to optimize margin- of “atomic” substructures defined in graph kernel.
based ranking loss. Recall that a knowledge graph G consists Graphlet. A graphlet is an induced and non-isomorphic
of a set of triplets < h, r, t > denoting the head entity h is subgraph of size-k [93]. Suppose graph G is decomposed
linked to a tail entity t by a relation r. Embedding G can into a set of graphlets {G1 , G2 , · · · , Gd }. Then G is embed-
be interpreted as preserving the ranking of a true triplet ded as a d-dimensional vector (denoted as yG ) of normalized
< h, r, t > over a false triplet < h0 , r, t0 > that does not exist counts. The i-th dimension of yG is the frequency of the
in G . Particularly, in knowledge graph embedding, similar graphlet Gi occurring in G .
to s(vi , vi+ ) in Eq. 19, an energy function is designed for
a triplet < h, r, t > as fr (h, t). There is a slight difference Subtree Patterns. In this kernel, a graph is decomposed
between these two functions. s(vi , vi+ ) denotes the similar- as its subtree patterns. One example is Weisfeiler-Lehman
ity score between the embedding of node vi and vi+ , while subtree [49]. In particular, a relabelling iteration process is
fr (h, t) is the distance score of the embedding of h and t in conducted on a labelled graph (i.e., a graph with discrete
terms of relation r. One example of fr (h, t) is kh + r − tkl1 , node labels). At each iteration, a multiset label is gener-
where relationships are represented as translations in the em- ated based on the label of a node and its neighbours. The
bedded space [91]. Other options of fr (h, t) are summarized new generated multiset label is a compressed label which
in Table 9. Consequently, for knowledge graph embedding, denotes the subtree patterns, which is then used for the
Eq. 19 becomes: next iteration. Based on Weisfeiler-Lehman test of graph
isomorphism, counting the occurrence of labels in a graph is
kg
max{0, γ + fr (h, t) − fr (h0 , t0 )},
P
Orank = min equivalent to counting the corresponding subtree structures.
<h,r,t>∈S, Suppose h iterations of relabelling are performed on graph
<h0 ,r,t0 >∈S
/
(20) G . Its embedding yG contains h blocks. The i-th dimension
where S is the triples in the input knowledge graph. Exist- in the j -th block of yG is the frequency of the i-th label being
ing knowledge graph embedding methods mainly optimize assigned to a node in the j -th iteration.
Eq. 20 in their work. The difference among them is how they Random Walks. In the third type of graph kernels, a graph
define fr (h, t) as summarized in Table 9. More details about is decomposed into random walks or paths and represented
the related work in knowledge graph embedding has been as the counts of occurrence of random walks [137] or paths
thoroughly reviewed in [13]. [138] in it. Take paths as an example, suppose graph G is
Note that some studies jointly optimize the ranking loss decomposed into d shortest paths. Denote the i-th path as
(Eq. 20) and other objectives to preserve more information. a triplet < lis , lie , ni >, where lis and lie are the labels of the
E.g., SSP [59] optimizes a topic model loss together with starting and ending nodes, ni is the length of the path. Then
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 14

G is represented as a d-dimensional vector yG where the i-th semantic representation of text using topic modelling and
dimension is the frequency of the i-th triplet occurring in G . restricts the triplet embedding in the semantic subspace.
Summary: A graph kernel is designed for whole-graph The difference between the above two directions of
embedding (Sec. 3.2.4) only as it captures the global prop- methods is that the embedded space is the latent space in
erty of a whole graph. The type of input graph is usually the first way. In contrast, in the second way, the latent space
a homogeneous graph (Sec. 3.1.1) [93] or a graph with is used to integrate information from different sources, and
auxiliary information (Sec. 3.1.3) [49]. help to embed a graph to another space.
Summary: Generative model can be used for both node
4.5 Generative Model embedding (Sec. 3.2.1) [70] and edge embedding (Sec. 3.2.2)
[141]. As it considers node semantics, the input graph is
A generative model can be defined by specifying the joint usually a heterogeneous graph (Sec. 3.1.2 ) [70] or a graph
distribution of the input features and the class labels, with auxiliary information (Sec. 3.1.3) [59].
conditioned on a set of parameters [139]. An example is
Latent Dirichlet Allocation (LDA), in which a document is
interpreted as a distribution over topics, and a topic is a 4.6 Hybrid Techniques and Others
distribution over words [140]. There are the following two Sometimes multiple techniques are combined in one study.
ways to adopt a generative model for graph embedding. For example, [4] learns edge-based embedding via mini-
mizing the margin-based ranking loss (Sec. 4.3), and learns
4.5.1 Embed Graph Into The Latent Semantic Space attribute-based embedding by matrix factorization (Sect.
Insight: Nodes are embedded into a latent semantic space where 4.1). [51] optimizes a margin-based ranking loss (Sec. 4.3)
the distances among nodes explain the observed graph structure. with matrix factorization based loss (Sec. 4.1) as regu-
The first type of generative model based graph embed- larization terms. [32] uses LSTM (Sec. 4.2) to learn sen-
ding methods directly embeds a graph in the latent space. tences embedding in cQAs and a margin-based ranking loss
Each node is represented as a vector of the latent variables. (Sec. 4.3) to incorporate friendship relations. [142] adopts
In other words, it views the observed graph as generated CBOW/SkipGram (Sec. 4.2) for knowledge graph entity em-
by a model. E.g., in LDA, documents are embedded in a bedding, and then fine-tune the embedding by minimizing
“topic” space where documents with similar words have a margin-based ranking loss (Sec. 4.3). [61] use word2vec
similar topic vector representations. [70] designs a LDA- (Sec. 4.2) to embed the textual context and TransH (Sec.
like model to embed a location-based social network (LBSN) 4.3) to embed the entity/relations so that the rich context
graph. Specifically, the input is locations (documents), each information is utilized in knowledge graph embedding.
of which contains a set of users (words) who visited that [143] leverages the heterogeneous information in a knowl-
location. Users visit the same location (words appearing in edge base to improve recommendation performance. It uses
the same document) due to some activities (topics). Then a TransR (Sec. 4.3) for network embedding, and uses auto-
model is designed to represent a location as a distribution encoders for textual and visual embedding (Sec. 4.2). Finally,
over activities, where each activity has an attractiveness dis- a generative framework (Sec. 4.5) is proposed to integrate
tribution over users. Consequently, both user and location collaborative filtering with items semantic representations.
are represented as a vector in the “activity” space. Apart from the introduced five categories of techniques,
there exist other approaches. [95] presents embedding of a
4.5.2 Incorporate Latent Semantics for Graph Embedding graph by its distances to prototype graphs. [16] first embeds
a few landmark nodes using their pairwise shortest path
Insight: Nodes which are close in the graph and having similar distances. Then other nodes are embedded so that their
semantics should be embedded closer. The node semantics can be distances to a subset of landmarks are as close as possi-
detected from node descriptions via a generative model. ble to the real shortest paths. [4] jointly optimizes a link-
In this line of methods, latent semantics are used to based loss (maximizing the likelihood of observing a node’s
leverage auxiliary node information for graph embedding. neighbours instead of its non-neighbours) and an attribute-
The embedding is decided not only by the graph structure based loss (learning a linear projection based on link-based
information but also by the latent semantics discovered representation). KR-EAR [144] distinguishes the relations in
from other sources of node information. For example, [58] a knowledge graph as attribute-based and relation-based.
proposes a unified framework which jointly integrates topic It constructs a relational triple encoder (TransE, TransR) to
modelling and graph embedding. Its principle is that if embed the correlations between entities and relations, and
two nodes are close in the embedded space, they will an attributional triple encoder to embed the correlations
also share similar topic distribution. A mapping function between entities and attributes. Struct2vec [145] considers
from the embedded space to the topic semantic space is the structral identify of nodes by a hierarchical metric for
designed so as to correlate the two spaces. [141] proposes a node embedding. [146] provides a fast embedding approach
generative model (Bayesian non-parametric infinite mixture by approximating the higher-order proximity matrices.
embedding model) to address the issue of multiple relation
semantics in knowledge graph embedding. It discovers the
latent semantics of a relation and leverages a mixture of 4.7 Summary
relation components for embedding. [59] embeds a knowl- We now summarize and compare all the five categories
edge graph from both the knowledge graph triplets and the of introduced graph embedding techniques in Table 10 in
textual descriptions of entities and relations. It learns the terms of their advantages and disadvantages.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 15

TABLE 10
Comparison of graph embedding techniques.
Category Subcategory Advantages Disadvantages
graph Laplacian eigenaps consider global node proximity large time and space consumption
matrix factorization
node proximity matrix factorization
effective and robust, a) only consider local context within a path
with random walk
deep learning no feature engineering b) hard to find optimal sampling strategy
without random walk high computation cost
maximize edge reconstruct probability optimization using only observed local
edge reconstruction minimize distance-based loss relatively efficient training information, i.e., edges (1-hop neighbour)
minimize margin-based ranking loss or ranked node pairs
based on graphlet efficient, only counting the desired atom a) substructures are not independent
graph kernel based on subtree patterns substructure b) embedding dimensionality grows
based on random walks exponentially
embed graph in the latent space interpretable embedding a) hard to justify the choice of distribution
generative model incorporate latent semantics for graph embedding naturally leverage multiple information sources b) require a large amount of training data

Matrix factorization based graph embedding learns the “bag-of-structure” based methods have two limitations [93].
representations based on the statistics of global pairwise Firstly, the substructures are not independent. For example,
similarities. Hence it can outperform deep learning based the graphlet of size k +1 can be derived from size k graphlet
graph embedding (random walk involved) in certain tasks by adding a new node and some edges. This means there
as the latter relies on separate local context windows [147], exist redundant information in the graph representation.
[148]. However, either the proximity matrix construction or Secondly, the embedding dimensionality usually grows ex-
the eigendecomposition of the matrix is time and space ponentially when the size of the substructure grows, leading
consuming [149], making matrix factorization inefficient to a sparse problem in the embedding.
and unscalable for large graphs. Generative model based graph embedding can naturally
Deep Learning (DL) has shown promising results among leverage information from different sources (e.g., graph
different graph embedding methods. We consider DL as structure, node attribute) in a unified model. Directly em-
suitable for graph embedding, because of its capability of bedding graphs into the latent semantic space generates the
automatically identifying useful representations from com- embedding that can be interpreted using the semantics. But
plex graph structures. For example, DL with random walk the assumption of modelling the observation using certain
(e.g., DeepWalk [17], node2vec [28], metapath2vec [46]) can distributions is hard to justify. Moreover, the generative
automatically exploit the neighbourhood structure through method needs a large amount of training data to estimate
sampled paths on the graph. DL without random walk can a proper model which fits the data. Hence it may not work
model variable-sized subgraph structures in homogeneous well for small graphs or a small number of graphs.
graphs (e.g., GCN [72], struc2vec [145], GraphSAGE [150]),
or rich interactions among flexible-typed nodes in heteroge-
neous graphs (e.g., HNE [33], TransE [91], ProxEmbed [44]), 5 A PPLICATIONS
as useful representations. On the other hand, DL also has Graph embedding benefits a wide variety of graph analytics
its limitations. For DL with random walks, it typically con- applications as the vector representations can be processed
siders a nodes local neighbours within the same path and efficiently in both time and space. In this section, we cate-
thus overlooks the global structure information. Moreover, gorize the graph embedding enabled applications as node
it is difficult to find an ‘optimal sampling strategy as the related, edge related and graph related.
embedding and path sampling are not jointly optimized in
a unified framework. For DL without random walks, the
5.1 Node Related Applications
computational cost is usually high. The traditional deep
learning architectures assume the input data on 1D or 2D 5.1.1 Node Classification
grid to take advantage of GPU [117]. However, graphs do Node classification is to assign a class label to each node in
not have such a grid structure, and thus require different a graph based on the rules learnt from the labelled nodes.
solutions to improve the efficiency [117]. Intuitively, “similar” nodes have the same labels. It is one
Edge reconstruction based graph embedding optimizes of the most common applications discussed in graph em-
an objective function based on the observed edges or rank- bedding literatures. In general, each node is embedded as a
ing triplets. It is more efficient compared to the previous low-dimensional vector. Node classification is conducted by
two categories of graph embedding. However, this line applying a classifier on the set of labelled node embedding
of methods are trained using the directly observed local for training. The example classifiers include SVM ( [1], [20],
information, and thus the obtained embedding lacks the [33], [34], [41], [42], [45], [56], [57], [60], [73], [75], [81], [87]),
awareness of the global graph structure. logistic regression ( [1], [17], [19], [20], [21], [25], [27], [28],
Graph kernel based graph embedding converts a graph [45], [59], [124]) and k-nearest neighbour classification ( [58],
into one single vector to facilitate the graph level analytic [151]). Then given the embedding of an unlabelled node,
tasks such as graph classification. It is more efficient than the trained classifier can predict its class label. In contrast
other categories of techniques as it only needs to enumerate to the above sequential processing of first node embedding
the desired atomic substructures in a graph. However, such then node classification, some other work ( [47], [48], [62],
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 16

[72], [80]) designs a unified framework to jointly optimize [19], [28]. For example, [28] predicts the friendship relation
graph embedding and node classification, which learns a between two users. Relatively fewer graph embedding work
classification-specific representation for each node. deals with heterogeneous graph link prediction. For exam-
ple, on a heterogeneous social graph, ProxEmbed [44] tries
5.1.2 Node Clustering to predict the missing links of certain semantic types (e.g.,
Node clustering aims to group similar nodes together, so schoolmates) between two users, based on the embedding of
that nodes in the same group are more similar to each other their connecting paths on the graph. D2AGE [152] solves the
than those in other groups. As an unsupervised algorithm, same problem by embedding two users connecting directed
it is applicable when the node labels are unavailable. After acyclic graph structure.
representing nodes as vectors, the traditional clustering
algorithms can then be applied on the node embedding. 5.2.2 Triple Classification
Most existing work [1], [2], [21], [22], [23], [33], [81] adopts Triplet classification [14], [15], [38], [51], [52], [53], [61], [142]
k-means as the clustering algorithm. In contrast, [4] and is a specific application for knowledge graph. It aims to
[77] jointly optimize clustering and graph embedding in one classify whether an unseen triplet < h, r, t > is correct or
objective to learn a clustering-specific node representation. not, i.e., whether the relation between h and t is r.

5.1.3 Node Recommendation/Retrieval/Ranking


5.3 Graph Related Applications
The task of node recommendation is to recommend top K
nodes of interest to a given node based on certain criteria 5.3.1 Graph Classification
such as similarity [3], [16], [43], [45], [47], [106]. In real- Graph classification assigns a class label to a whole graph.
world scenarios, there are various types of recommended This is important when the graph is the unit of data. For
node, such as research interests for researchers [66], items example, in [50], each graph is a chemical compound, an
for customers [3], [71], images for curation network users organic molecule or a protein structure. In most cases,
[35], friends for social network users [3], and documents for whole-graph embedding is applied to calculate graph level
a query [69]. It is also popular in community-based question similarity [49], [54], [55], [93], [95]. Recently, some work
answering. Given a question, they predict the relative rank starts to match node embedding for graph similarity [50],
of users ( [30], [31]) or answers ( [29], [32]). In proximity [153]. Each graph is represented as a set of node embedding
search [39], [44], they rank the nodes of a particular type vectors. Graphs are compared based on two sets of node
(e.g., “user”) for a given query node (e.g., “Bob”) and a embedding. [93] decomposes a graph into a set of sub-
proximity class (e.g., “schoolmate”), e.g., ranking users who structures and then embed each substructure as a vector
are the schoolmates of Bob. And there is some work focusing and compare graphs via substructure similarities.
on cross-modal retrieval [33], [34], [36], [99], e.g., keyword-
based image/video search. 5.3.2 Visualization
A specific application which is popularly discussed in Graph visualization generates visualizations of a graph on
knowledge graph embedding is entity ranking [51], [52], a low dimensional space [20], [23], [48], [55], [58], [73].
[53], [59], [61] . Recall that a knowledge graph consists of Usually, for visualization purpose, all nodes are embedded
a set of triplets < h, r, t >. Entity ranking aims to rank the as 2D vectors and then plotted in a 2D space with different
correct missing entities given the other two components in colours indicating nodes’ categories. It provides a vivid
a triplet higher than the false entities. E.g., it returns the true demonstration of whether nodes belonging to the same
h’s among all the candidate entities given r and t, or returns category are embedded closer to each other.
the true t’s given r and h.
5.4 Other Applications
5.2 Edge Related Applications Above are some general applications that are commonly
Next we introduce edge related applications in which an discussed in existing work. Depending on the information
edge or a node pair is involved. carried in the input graph, more specific applications may
exist. Below are some example scenarios.
5.2.1 Link Prediction Knowledge graph related: [15] and [14] extract relational
Graph embedding aims to represent a graph with low- fact from large-scale plain text. [62] extracts medical enti-
dimensional vectors, but interestingly its output vectors can ties from text. [69] links natural language text with enti-
also help infer the graph structure. In practice, graphs are ties in a knowledge graph. [92] focuses on de-duplicating
often incomplete; e.g., in social networks, friendship links entities that are equivalent in a knowledge graph. [84]
can be missing between two users who actually know each jointly embeds entity mentions, text and entity types to
other. In graph embedding, the low-dimensional vectors are estimate the true type-path for each mention from its noisy
expected to preserve different orders of network proximity candidate type set. E.g., the candidate types for “Trump”
(e.g., DeepWalk [17], LINE [27]), as well as different scales of are {person, politician, businessman, artist, actor}. For the
structural similarity (e.g., GCN [72], struc2vec [145]). Hence, mention “Trump” in sentence “Republican presidential can-
these vectors encode rich information about the network didate Donald Trump spoke during a campaign event in
structure, and they can be used to predict missing links in Rock Hill.”, only {person, politician} are correct types.
the incomplete graph. Most attempts on graph embedding Multimedia network related: [83] embeds the geo-
driven link prediction are on homogeneous graphs [3], [16], tagged social media (GTSM) records <time, location,
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 17

message> which enables them to recover the missing com- graph, and (vi , vi+ , vi− )
in a cQA graph. Single edges only
ponent from a GTSM triplet given the other two. It can also provide local neighbourhood information to calculate the
classify the GTSM records, e.g., whether a check-in record is first- and second-order proximity. The global structure of
related to “Food” or “Shop”. [85] uses graph embedding to a graph (e.g., paths, tree, subgraph patterns) is omitted.
reduce data dimensionality for face recognition. [88] maps Intuitively, a substructure contains richer information than
images into a semantic manifold that faithfully grasps users’ one single edge. Some work attempts to explore the path
preferences to facilitate content-based image retrieval. information in knowledge graph embedding ( [38], [39], [40],
Information propagation related: [63] predicts the in- [142]). However, most of them use deep learning models
crement of a cascade size after a given time interval. [64] ( [38], [40], [142]) which suffer the low efficiency issue as
predicts the propagation user and identifies the domain discussed earlier. How to design the non-deep learning
expert by embedding the social interaction graph. based methods that can take advantage of the expressive
Social networks alignment: Both [26] and [18] learn power of graph structure is a question. [39] provides one
node embedding to align users across different social net- example solution. It minimizes both pairwise and long-
works, i.e., to predict whether two user accounts in two range loss to capture pairwise relations and long-range
different social networks are owned by the same user. interactions between entities. Note that in addition to the
Image related: Some work embeds graphs constructed list/path structure, there are various kinds of substructures
from images, and then use the embedding for image classi- which carry different structure information. For example,
fication ( [81], [82]), image clustering [101], image segmen- SPE [155] has tried to introduce a subgraph-augmented path
tation [154], pattern recognition [80], and so on. structure for embedding the proximity between two nodes
in a heterogeneous graph, and it shows better performance
than embedding simple paths for semantic search tasks. In
6 F UTURE D IRECTIONS general, an efficient structure-aware graph embedding opti-
In this section, we summarize four future directions for the mization solution, together with the substructure sampling
field of graph embedding, including computation efficiency, strategy, is in needed.
problem settings, techniques and application scenarios. Applications. Graph embedding has been applied in many
Computation. The deep architecture, which takes the geo- different applications. It is an effective way to learn the
metric input (e.g., graph), suffers the low efficiency problem. representations of data with consideration of their rela-
Traditional deep learning models (designed for Euclidean tions. Moreover, it can convert data instances from different
domains) utilize the modern GPU to optimize their effi- sources/platforms/views into one common space so that
ciency by assuming that the input data are on a 1D or they are directly comparable. For example, [16], [34], [36] use
2D grid. However, graphs do not have such a kind of grid graph embedding for cross-modal retrieval, such as content-
structure and thus the deep architecture designed for graph based image retrieval, keyword-based image/video search.
embedding needs to seek alternative solutions to improve The advantages of using graph embedding for represen-
the model efficiency. [117] suggested that the computational tation learning is that the graph manifold of the training
paradigms developed for large-scale graph processing can data instances are preserved in the representations and can
be adopted to facilitate efficiency improvement in deep further benefit the follow-up applications. Consequently,
learning models for graph embedding. graph embedding can benefit the tasks which assume the
Problem settings. The dynamic graph is one promising input data instances are correlated with certain relations
setting for graph embedding. Graphs are not always static, (i.e., connected by certain links). It is of great importance
especially in real life scenarios, e.g., social graphs in Twitter, to exploring the application scenarios which benefit from
citation graphs in DBLP. Graphs can be dynamic in terms graph embedding, as it provides effective solutions to the
of graph structure or node/edge information. On the one conventional problems from a different perspective.
hand, the graph structure may evolve over time, i.e., new
nodes/edges appear while some old nodes/edges disap- 7 C ONCLUSIONS
pear. On the other hand, the nodes/edges may be described In this survey, we conduct a comprehensive review of the lit-
by some time-varying information. Existing graph embed- erature in graph embedding. We provide a formal definition
ding mainly focuses on embedding the static graph and to the problem of graph embedding and introduce some ba-
the settings of dynamic graph embedding are overlooked. sic concepts. More importantly, we propose two taxonomies
Unlike static graph embedding, the techniques for dynamic of graph embedding, categorizing existing work based on
graphs need to be scalable and better to be incremental so problem settings and embedding techniques respectively.
as to deal with the dynamic changes efficiently. This makes In terms of problem setting taxonomy, we introduce four
most of the existing graph embedding methods, which types of embedding input and four types of embedding
suffer from the low efficiency problem, not suitable any- output and summarize the challenges faced in each set-
more. How to design effective graph embedding methods ting. For embedding technique taxonomy, we introduce the
in dynamic domains remains an open question. work in each category and compare them in terms of their
Techniques. Structure awareness is important for edge advantages and disadvantages. After that, we summarize
reconstruction based graph embedding. Current edge re- the applications that graph embedding enables. Finally, we
construction based graph embedding methods are mainly suggest four promising future research directions in the
based on the edges only, e.g., 1-hope neighbours in a gen- field of graph embedding in terms of computation efficiency,
eral graph, a ranked triplet (< h, r, t > in a knowledge problem settings, techniques and application scenarios.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 18

R EFERENCES [28] A. Grover and J. Leskovec, “Node2vec: Scalable feature learning


for networks,” in KDD, 2016, pp. 855–864.
[1] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community [29] H. Fang, F. Wu, Z. Zhao, X. Duan, Y. Zhuang, and M. Ester,
preserving network embedding,” in AAAI, 2017, pp. 203–209. “Community-based question answering via heterogeneous social
[2] F. Nie, W. Zhu, and X. Li, “Unsupervised large graph embed- network learning,” in AAAI, 2016, pp. 122–128.
ding,” in AAAI, 2017, pp. 2422–2428. [30] Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang, “Expert finding
[3] C. Zhou, Y. Liu, X. Liu, Z. Liu, and J. Gao, “Scalable graph for community-based question answering via ranking metric
embedding for asymmetric proximity,” in AAAI, 2017, pp. 2942– network learning,” in IJCAI, 2016, pp. 3000–3006.
2948.
[31] H. Lu and M. Kong, “Community-based question answering via
[4] X. Wei, L. Xu, B. Cao, and P. S. Yu, “Cross view link prediction
contextual ranking metric network learning,” in AAAI, 2017, pp.
by learning noise-resilient representation consensus,” in WWW,
4963–4964.
2017, pp. 1611–1619.
[5] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, [32] Z. Zhao, H. Lu, V. W. Zheng, D. Cai, X. He, and Y. Zhuang,
and I. Stoica, “Graphx: Graph processing in a distributed “Community-based question answering via asymmetric multi-
dataflow framework,” in OSDI, 2014, pp. 599–613. faceted ranking network learning,” in AAAI, 2017, pp. 3532–3539.
[6] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. [33] S. Chang, W. Han, J. Tang, G.-J. Qi, C. C. Aggarwal, and T. S.
Hellerstein, “Distributed graphlab: A framework for machine Huang, “Heterogeneous network embedding via deep architec-
learning and data mining in the cloud,” Proc. VLDB Endow., vol. 5, tures,” in KDD, 2015, pp. 119–128.
no. 8, pp. 716–727, 2012. [34] H. Zhang, X. Shang, H. Luan, M. Wang, and T. Chua, “Learning
[7] P. Kumar and H. H. Huang, “G-store: High-performance graph from collective intelligence: Feature learning using social images
store for trillion-edge processing,” in SC, 2016, pp. 71:1–71:12. and tags,” TOMCCAP, vol. 13, no. 1, pp. 1:1–1:23, 2016.
[8] N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. [35] X. Geng, H. Zhang, J. Bian, and T. Chua, “Learning image and
Hassaan, S. Sengupta, Z. Yin, and P. Dubey, “Navigating the maze user features for recommendation in social networks,” in ICCV,
of graph analytics frameworks using massive graph datasets,” in 2015, pp. 4274–4282.
SIGMOD, 2014, pp. 979–990. [36] F. Wu, X. Lu, J. Song, S. Yan, Z. M. Zhang, Y. Rui, and Y. Zhuang,
[9] Y. Bengio, A. C. Courville, and P. Vincent, “Representation learn- “Learning of multimodal representations with random walks on
ing: A review and new perspectives,” PAMI, vol. 35, no. 8, pp. the click graph,” IEEE Trans. Image Processing, vol. 25, no. 2, pp.
1798–1828, 2013. 630–642, 2016.
[10] A. Mahmood, M. Small, S. A. Al-Máadeed, and N. M. Rajpoot, [37] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Free-
“Using geodesic space density gradients for network community base: A collaboratively created graph database for structuring
detection,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 4, pp. 921– human knowledge,” in SIGMOD, 2008, pp. 1247–1250.
935, 2017. [38] J. Feng, M. Huang, Y. Yang, and X. Zhu, “GAKE: graph aware
[11] P. Goyal and E. Ferrara, “Graph embedding techniques, applica- knowledge embedding,” in COLING, 2016, pp. 641–651.
tions, and performance: A survey,” CoRR, vol. abs/1705.02801, [39] F. Wu, J. Song, Y. Yang, X. Li, Z. M. Zhang, and Y. Zhuang,
2017. “Structured embedding via pairwise relations and long-range
[12] N. S. S and S. Surendran, “Graph embedding and dimensionality interactions in knowledge base,” in AAAI, 2015, pp. 1663–1670.
reduction-a survey,” IJCSET, vol. 4, no. 1, pp. 29–34, 2013. [40] B. Shi and T. Weninger, “Proje: Embedding projection for knowl-
[13] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph edge graph completion,” in AAAI, 2017, pp. 1236–1242.
embedding: A survey of approaches and applications,” IEEE [41] M. Ochi, Y. Nakashio, Y. Yamashita, I. Sakata, K. Asatani, M. Rut-
Trans. Knowl. Data Eng., vol. 29, no. 12, pp. 2724–2743, 2017. tley, and J. Mori, “Representation learning for geospatial areas
[14] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph using large-scale mobility data from smart card,” in UbiComp,
embedding by translating on hyperplanes,” in AAAI, 2014, pp. 2016, pp. 1381–1389.
1112–1119. [42] M. Ochi, Y. Nakashio, M. Ruttley, J. Mori, and I. Sakata, “Geospa-
[15] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and tial area embedding based on the movement purpose hypothesis
relation embeddings for knowledge graph completion,” in AAAI, using large-scale mobility data from smart card,” IJCNS, vol. 9,
2015, pp. 2181–2187. pp. 519–534, 2016.
[16] X. Zhao, A. Chang, A. D. Sarma, H. Zheng, and B. Y. Zhao, “On [43] Y. Zhao, Z. Liu, and M. Sun, “Representation learning for mea-
the embeddability of random walk distances,” PVLDB, vol. 6, suring entity relatedness with rich information,” in IJCAI, 2015,
no. 14, pp. 1690–1701, 2013. pp. 1412–1418.
[17] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning [44] Z. Liu, V. W. Zheng, Z. Zhao, F. Zhu, K. C. Chang, M. Wu, and
of social representations,” in KDD, 2014, pp. 701–710. J. Ying, “Semantic proximity search on heterogeneous graph by
[18] T. Man, H. Shen, S. Liu, X. Jin, and X. Cheng, “Predict anchor proximity embedding,” in AAAI, 2017, pp. 154–160.
links across social networks via an embedding approach,” in
[45] H. Gui, J. Liu, F. Tao, M. Jiang, B. Norick, and J. Han, “Large-scale
IJCAI, 2016, pp. 1823–1829.
embedding learning in heterogeneous event data,” in ICDM,
[19] T. Pimentel, A. Veloso, and N. Ziviani, “Unsupervised and scal-
2016, pp. 907–912.
able algorithm for learning node representations,” in ICLR, 2017.
[46] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable
[20] D. Wang, P. Cui, and W. Zhu, “Structural deep network embed-
representation learning for heterogeneous networks,” in KDD,
ding,” in KDD, 2016, pp. 1225–1234.
2017, pp. 135–144.
[21] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph represen-
tations with global structural information,” in CIKM, 2015, pp. [47] J. Li, J. Zhu, and B. Zhang, “Discriminative deep random walk
891–900. for network classification,” in ACL, 2016.
[22] F. Tian, B. Gao, Q. Cui, E. Chen, and T. Liu, “Learning deep [48] C. Tu, W. Zhang, Z. Liu, and M. Sun, “Max-margin deepwalk:
representations for graph clustering,” in AAAI, 2014, pp. 1293– Discriminative learning of network representation,” in IJCAI,
1299. 2016, pp. 3889–3895.
[23] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning [49] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn,
graph representations,” in AAAI, 2016, pp. 1145–1152. and K. M. Borgwardt, “Weisfeiler-lehman graph kernels.” Journal
[24] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, of Machine Learning Research, vol. 12, pp. 2539–2561, 2011.
and A. J. Smola, “Distributed large-scale natural graph factoriza- [50] G. Nikolentzos, P. Meladianos, and M. Vazirgiannis, “Matching
tion,” in WWW, 2013, pp. 37–48. node embeddings for graph similarity,” in AAAI, 2017, pp. 2429–
[25] Z. Jin, R. Liu, Q. Li, D. D. Zeng, Y. Zhan, and L. Wang, “Predicting 2435.
user’s multi-interests with network embedding in health-related [51] S. Guo, Q. Wang, B. Wang, L. Wang, and L. Guo, “Semantically
topics,” in IJCNN, 2016, pp. 2568–2575. smooth knowledge graph embedding,” in ACL, 2015, pp. 84–94.
[26] L. Liu, W. K. Cheung, X. Li, and L. Liao, “Aligning users across [52] ——, “SSE: semantically smooth embedding for knowledge
social networks using network embedding,” in IJCAI, 2016, pp. graphs,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 4, pp. 884–897,
1774–1780. 2017.
[27] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: [53] R. Xie, Z. Liu, and M. Sun, “Representation learning of knowl-
Large-scale information network embedding,” in WWW, 2015, edge graphs with hierarchical types,” in IJCAI, 2016, pp. 2965–
pp. 1067–1077. 2971.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 19

[54] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent [82] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
variable models for structured data,” in ICML, 2016, pp. 2702– neural networks on graphs with fast localized spectral filtering,”
2711. in NIPS, 2016, pp. 3837–3845.
[55] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional [83] C. Zhang, K. Zhang, Q. Yuan, H. Peng, Y. Zheng, T. Hanratty,
neural networks for graphs,” in ICML, 2016, pp. 2014–2023. S. Wang, and J. Han, “Regions, periods, activities: Uncovering
[56] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Network urban dynamics via cross-modal representation learning,” in
representation learning with rich text information,” in IJCAI, WWW, 2017, pp. 361–370.
2015, pp. 2111–2117. [84] X. Ren, W. He, M. Qu, C. R. Voss, H. Ji, and J. Han, “Label
[57] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Homophily, structure, noise reduction in entity typing by heterogeneous partial-label
and content augmented network representation learning,” in embedding,” in KDD, 2016, pp. 1825–1834.
ICDM, 2016, pp. 609–618. [85] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph
[58] T. M. V. Le and H. W. Lauw, “Probabilistic latent document embedding and extensions: A general framework for dimension-
network embedding,” in ICDM, 2014, pp. 270–279. ality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29,
[59] H. Xiao, M. Huang, L. Meng, and X. Zhu, “SSP: semantic space no. 1, pp. 40–51, 2007.
projection for knowledge graph embedding with text descrip- [86] C. Gong, D. Tao, J. Yang, and K. Fu, “Signed laplacian embedding
tions,” in AAAI, 2017, pp. 3104–3110. for supervised dimension reduction,” in AAAI, 2014, pp. 1847–
[60] L. Yao, Y. Zhang, B. Wei, Z. Jin, R. Zhang, Y. Zhang, and Q. Chen, 1853.
“Incorporating knowledge graph embeddings into topic model- [87] L. Sun, S. Ji, and J. Ye, “Hypergraph spectral learning for multi-
ing,” in AAAI, 2017, pp. 3119–3126. label classification,” in KDD, 2008, pp. 668–676.
[61] Z. Wang and J. Li, “Text-enhanced representation learning for [88] Y.-Y. Lin, T.-L. Liu, and H.-T. Chen, “Semantic manifold learning
knowledge graph,” in IJCAI, 2016, pp. 1293–1299. for image retrieval,” in MM, 2005, pp. 249–258.
[62] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi- [89] S. Cavallari, V. W. Zheng, H. Cai, K. C. Chang, and E. Cambria,
supervised learning with graph embeddings,” in ICML, 2016, pp. “Learning community embedding with community detection
40–48. and node embedding on graphs,” in CIKM, 2017, pp. 377–386.
[63] C. Li, J. Ma, X. Guo, and Q. Mei, “Deepcas: An end-to-end [90] A. Bordes, J. Weston, R. Collobert, and Y. Bengio, “Learning
predictor of information cascades,” in WWW, 2017, pp. 577–586. structured embeddings of knowledge bases,” in AAAI, 2011.
[64] N. Zhao, H. Zhang, M. Wang, R. Hong, and T. Chua, “Learning [91] A. Bordes, N. Usunier, A. Garcı́a-Durán, J. Weston, and
content-social influential features for influence analysis,” IJMIR, O. Yakhnenko, “Translating embeddings for modeling multi-
vol. 5, no. 3, pp. 137–149, 2016. relational data,” in NIPS, 2013, pp. 2787–2795.
[92] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “A semantic
[65] J. Wang, V. W. Zheng, Z. Liu, and K. C. Chang, “Topological
matching energy function for learning with multi-relational data
recurrent neural network for diffusion prediction,” in ICDM,
- application to word-sense disambiguation,” Machine Learning,
2017, pp. 475–484.
vol. 94, no. 2, pp. 233–259, 2014.
[66] Z. Yang, J. Tang, and W. Cohen, “Multi-modal bayesian embed-
[93] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in KDD,
dings for learning social knowledge graphs,” in IJCAI, 2016, pp.
2015, pp. 1365–1374.
2287–2293.
[94] A. Bordes, S. Chopra, and J. Weston, “Question answering with
[67] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of
subgraph embeddings,” in EMNLP, 2014, pp. 615–620.
semantic knowledge,” in WWW, 2007, pp. 697–706.
[95] S. F. Mousavi, M. Safayani, A. Mirzaei, and H. Bahonar, “Hier-
[68] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyga- archical graph embedding in vector space by graph pyramid,”
niak, and S. Hellmann, “Dbpedia - A crystallization point for the Pattern Recognition, vol. 61, pp. 245–254, 2017.
web of data,” J. Web Sem., vol. 7, no. 3, pp. 154–165, 2009.
[96] W. N. A. Jr. and T. D. Morley, “Eigenvalues of the laplacian of a
[69] C. Xiong, R. Power, and J. Callan, “Explicit semantic ranking for graph,” Linear and Multilinear Algebra, vol. 18, no. 2, pp. 141–145,
academic search via knowledge graph embedding,” in WWW, 1985.
2017, pp. 1271–1279. [97] X. He and P. Niyogi, “Locality preserving projections,” in NIPS,
[70] B. Alharbi and X. Zhang, “Learning from your network of 2003, pp. 153–160.
friends: A trajectory representation learning model based on [98] Y. Yang, F. Nie, S. Xiang, Y. Zhuang, and W. Wang, “Local and
online social ties,” in ICDM, 2016, pp. 781–786. global regressive mapping for manifold learning with out-of-
[71] Q. Zhang and H. Wang, “Not all links are created equal: An sample extrapolation,” in AAAI, 2010.
adaptive embedding approach for social personalized ranking,” [99] D. Cai, X. He, and J. Han, “Spectral regression: a unified subspace
in SIGIR, 2016, pp. 917–920. learning framework for content-based image retrieval,” in MM,
[72] T. N. Kipf and M. Welling, “Semi-supervised classification with 2007, pp. 403–412.
graph convolutional networks,” in ICLR, 2017. [100] K. Q. Weinberger, F. Sha, and L. K. Saul, “Learning a kernel
[73] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep matrix for nonlinear dimensionality reduction,” in ICML, 2004.
network representation,” in IJCAI, 2016, pp. 1895–1901. [101] K. Allab, L. Labiod, and M. Nadif, “A semi-nmf-pca unified
[74] T. Hofmann and J. M. Buhmann, “Multidimensional scaling and framework for data clustering,” IEEE Trans. Knowl. Data Eng.,
data clustering,” in NIPS, 1994, pp. 459–466. vol. 29, no. 1, pp. 2–16, 2017.
[75] Y. Han and Y. Shen, “Partially supervised graph embedding for [102] S. T. Roweis and L. K. Saul, “Nonlinear Dimensionality Reduc-
positive unlabelled feature selection,” in IJCAI, 2016, pp. 1548– tion by Locally Linear Embedding,” Science, vol. 290, no. 5500,
1554. pp. 2323–2326, 2000.
[76] M. Yin, J. Gao, and Z. Lin, “Laplacian regularized low-rank [103] S. Xiang, F. Nie, C. Zhang, and C. Zhang, “Nonlinear dimension-
representation and its applications,” IEEE Trans. Pattern Anal. ality reduction with local spline embedding,” IEEE Trans. Knowl.
Mach. Intell., vol. 38, no. 3, pp. 504–517, 2016. Data Eng., vol. 21, no. 9, pp. 1285–1298, 2009.
[77] M. Tang, F. Nie, and R. Jain, “Capped lp-norm graph embedding [104] L. Vandenberghe and S. Boyd, “Semidefinite programming,”
for photo clustering,” in MM, 2016, pp. 431–435. SIAM Rev., vol. 38, no. 1, pp. 49–95, 1996.
[78] M. Balasubramanian and E. L. Schwartz, “The isomap algorithm [105] B. Shaw and T. Jebara, “Structure preserving embedding,” in
and topological stability,” Science, vol. 295, no. 5552, pp. 7–7, 2002. ICML, 2009, pp. 937–944.
[79] R. Jiang, W. Fu, L. Wen, S. Hao, and R. Hong, “Dimensionality [106] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric
reduction on anchorgraph with an efficient locality preserving transitivity preserving graph embedding,” in KDD, 2016, pp.
projection,” Neurocomputing, vol. 187, pp. 109–118, 2016. 1105–1114.
[80] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. [107] B. Shaw and T. Jebara, “Minimum volume embedding,” in AIS-
Bronstein, “Geometric deep learning on graphs and manifolds TATS, 2007, pp. 460–467.
using mixture model cnns,” in CVPR, 2017. [108] M. Nickel, V. Tresp, and H. peter Kriegel, “A three-way model
[81] M. Chen, I. W. Tsang, M. Tan, and C. T. Jen, “A unified feature for collective learning on multi-relational data,” in ICML. ACM,
selection framework for graph embedding on high dimensional 2011, pp. 809–816.
data,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 6, pp. 1465–1477, [109] J. H. Tianji Pang, Feiping Nie, “Flexible orthogonal neighborhood
2015. preserving embedding,” in IJCAI-17, 2017, pp. 2592–2598.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 20

[110] G. H. Golub and C. Reinsch, “Singular value decomposition and [138] K. M. Borgwardt and H. Kriegel, “Shortest-path kernels on
least squares solutions,” Numer. Math., vol. 14, no. 5, pp. 403–420, graphs,” in ICDM, 2005, pp. 74–81.
1970. [139] C. M. Bishop and J. Lasserre, “Generative or Discrimative? Get-
[111] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient es- ting the Best of Both Worlds,” in Bayes. Stat. 8, 2007, pp. 3–24.
timation of word representations in vector space,” CoRR, vol. [140] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet alloca-
abs/1301.3781, 2013. tion,” in NIPS, 2001, pp. 601–608.
[112] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, [141] H. Xiao, M. Huang, and X. Zhu, “Transg : A generative model for
“Distributed representations of words and phrases and their knowledge graph embedding,” in ACL, 2016.
compositionality,” in NIPS, 2013, pp. 3111–3119. [142] Y. Luo, Q. Wang, B. Wang, and L. Guo, “Context-dependent
[113] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “User profile preserving knowledge graph embedding,” in EMNLP, 2015, pp. 1656–1661.
social network embedding,” in IJCAI, 2017, pp. 3378–3384. [143] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W. Ma, “Collaborative
[114] N. Shervashidze, S. V. N. Vishwanathan, T. Petri, K. Mehlhorn, knowledge base embedding for recommender systems,” in KDD,
and K. M. Borgwardt, “Efficient graphlet kernels for large graph 2016, pp. 353–362.
comparison,” in AISTATS, 2009, pp. 488–495. [144] Y. Lin, Z. Liu, and M. Sun, “Knowledge representation learning
[115] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” with entities, attributes and relations,” in IJCAI, 2016, pp. 2866–
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. 2872.
[116] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On [145] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo, “Struc2vec:
the properties of neural machine translation: Encoder-decoder Learning node representations from structural identity,” in KDD,
approaches,” CoRR, vol. abs/1409.1259, 2014. 2017, pp. 385–394.
[146] C. Yang, M. Sun, Z. Liu, and C. Tu, “Fast network embedding
[117] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-
enhancement via high order proximity approximation,” in IJCAI,
dergheynst, “Geometric deep learning: going beyond euclidean
2017, pp. 3894–3900.
data,” CoRR, vol. abs/1611.08097, 2016.
[147] J. A. Bullinaria and J. P. Levy, “Extracting semantic represen-
[118] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net- tations from word co-occurrence statistics: stop-lists, stemming,
works and locally connected networks on graphs,” in ICLR, 2013. and svd,” Behav. Res. Methods, vol. 44, no. 3, pp. 890–907, 2012.
[119] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks [148] O. Levy, Y. Goldberg, and I. Dagan, “Improving distributional
on graph-structured data,” CoRR, vol. abs/1506.05163, 2015. similarity with lessons learned from word embeddings,” TACL,
[120] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Mon- vol. 3, pp. 211–225, 2015.
fardini, “The graph neural network model,” IEEE Trans. Neural [149] J. Demmel, I. Dumitriu, and O. Holtz, “Fast linear algebra is
Networks, vol. 20, no. 1, pp. 61–80, 2009. stable,” Numerische Mathematik, vol. 108, no. 1, pp. 59–91, 2007.
[121] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez- [150] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representa-
Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Con- tion learning on large graphs,” in NIPS, 2017, pp. 1025–1035.
volutional networks on graphs for learning molecular finger- [151] R. C. Wilson, E. R. Hancock, E. Pekalska, and R. P. W. Duin,
prints,” in NIPS, 2015, pp. 2224–2232. “Spherical and hyperbolic embeddings of data,” IEEE Trans.
[122] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph Pattern Anal. Mach. Intell., vol. 36, no. 11, pp. 2255–2269, 2014.
sequence neural networks,” in ICLR, 2016. [152] Z. Liu, V. W. Zheng, Z. Zhao, F. Zhu, K. C.-C. Chang, M. Wu, and
[123] R. Khasanova and P. Frossard, “Graph-based isometry invariant J. Ying, “Distance-aware dag embedding for proximity search on
representation learning,” in ICML, 2017, pp. 1847–1856. heterogeneous graphs,” in AAAI, 2018.
[124] J. Tang, M. Qu, and Q. Mei, “Pte: Predictive text embedding [153] F. D. Johansson and D. P. Dubhashi, “Learning with similarity
through large-scale heterogeneous text networks,” in KDD, 2015, functions on graphs using matchings of geometric embeddings,”
pp. 1165–1174. in KDD, 2015, pp. 467–476.
[125] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao, “Knowledge graph [154] Y. Yu, C. Fang, and Z. Liao, “Piecewise flat embedding for image
embedding via dynamic mapping matrix,” in ACL, 2015, pp. 687– segmentation,” in ICCV, 2015, pp. 1368–1376.
696. [155] Z. Liu, V. W. Zheng, Z. Zhao, H. Yang, K. C.-C. Chang, M. Wu,
[126] G. Ji, K. Liu, S. He, and J. Zhao, “Knowledge graph completion and J. Ying, “Subgraph-augmented path embedding for semantic
with adaptive sparse transfer matrix,” in AAAI, 2016, pp. 985– user search on heterogeneous social network,” in WWW, 2018.
991.
[127] J. Wen, J. Li, Y. Mao, S. Chen, and R. Zhang, “On the repre- Hongyun Cai received the Ph.D. degree
sentation and embedding of knowledge bases beyond binary in Computer Science from the University of
relations,” in IJCAI, 2016, pp. 1300–1307. Queensland in 2016. She is currently a post-
[128] R. Xie, Z. Liu, J. Jia, H. Luan, and M. Sun, “Representation doctoral researcher at the Advanced Digital Sci-
learning of knowledge graphs with entity descriptions,” in AAAI, ences Center, Singapore. Her research focuses
2016, pp. 2659–2665. on graph mining, social data management and
[129] H. Xiao, M. Huang, and X. Zhu, “From one point to a manifold: analysis.
Knowledge graph embedding for precise link prediction,” in
IJCAI, 2016, pp. 1315–1321.
[130] Y. Jia, Y. Wang, H. Lin, X. Jin, and X. Cheng, “Locally adaptive
translation for knowledge graph embedding,” in AAAI, 2016, pp. Vincent W. Zheng received the Ph.D. degree
992–998. in Computer Science from the Hong Kong Uni-
[131] R. Socher, D. Chen, C. D. Manning, and A. Y. Ng, “Reasoning versity of Science and Technology in 2011. He
with neural tensor networks for knowledge base completion,” in is a senior research scientist at the Advanced
NIPS, 2013, pp. 926–934. Digital Sciences Center, Singapore, and a re-
[132] M. Nickel, L. Rosasco, and T. Poggio, “Holographic embeddings search affiliate at the University of Illinois at
of knowledge graphs,” in AAAI, 2016, pp. 1955–1961. Urbana-Champaign. His research interests in-
[133] M. Chen, Y. Tian, M. Yang, and C. Zaniolo, “Multilingual knowl- clude graph mining, information extraction, ubiq-
edge graph embeddings for cross-lingual knowledge alignment,” uitous computing and machine learning.
in IJCAI, 2017, pp. 1511–1517.
[134] H. Liu, Y. Wu, and Y. Yang, “Analogical inference for multi- Kevin Chen-Chuan Chang is a Professor in
relational embeddings,” in ICML, 2017, pp. 2168–2178. University of Illinois at Urbana-Champaign. His
[135] T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard, research addresses large-scale information ac-
“Complex embeddings for simple link prediction,” in ICML, cess, for search, mining, and integration across
2016, pp. 2071–2080. structured and unstructured big data includ-
[136] D. Haussler, “Convolution kernels on discrete structures,” Tech- ing Web data and social media. He also co-
nical Report UCS-CRL-99-10, 1999. founded Cazoodle for deepening vertical data-
aware search over the Web.
[137] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M.
Borgwardt, “Graph kernels,” Journal of Machine Learning Research,
vol. 11, pp. 1201–1242, 2010.

You might also like