A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications
A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications
Abstract—Graph is an important data representation which appears in a wide diversity of real-world scenarios. Effective graph
analytics provides users a deeper understanding of what is behind the data, and thus can benefit a lot of useful applications such as
node classification, node recommendation, link prediction, etc. However, most graph analytics methods suffer the high computation
and space cost. Graph embedding is an effective yet efficient way to solve the graph analytics problem. It converts the graph data into
a low dimensional space in which the graph structural information and graph properties are maximumly preserved. In this survey, we
conduct a comprehensive review of the literature in graph embedding. We first introduce the formal definition of graph embedding as
well as the related concepts. After that, we propose two taxonomies of graph embedding which correspond to what challenges exist in
arXiv:1709.07604v3 [cs.AI] 2 Feb 2018
different graph embedding problem settings and how the existing work address these challenges in their solutions. Finally, we
summarize the applications that graph embedding enables and suggest four promising future research directions in terms of
computation efficiency, problem settings, techniques and application scenarios.
Index Terms—Graph embedding, graph analytics, graph embedding survey, network embedding
1 I NTRODUCTION
1 1 2 2 e12e12ee23
e23 e23
12
1.5 1.2 1.5 1.2 1 1 e13e13 e13 GG G{1,2,3}
{1,2,3}
{1,2,3}
3 0.8 2 3 2 0 3 3 0 0 0 e46e46 ee4656e56 e56 0 0 0 GG G{4,5,6} 00 0 GG
0.8 6 0 1 1G1
e678e78 ee45
{4,5,6}
{4,5,6}
0.3 0.3 4 7 e e
7845 45
1 4 7 e34e34 e34
4 6 11 9 GG G{7,8,9}
74
0.2 9 5 8 9 e79e79 e79 {7,8,9}
1.5 0.6 6 0.2 1 5 8 9 e67e67 e67
{7,8,9}
5 1.5 8
1.5 0.6 7
1.5 -3 -3 -3 -3 -3 -3 -3-3 -3
5 8 -3
0.0 1.5 0.0
3.00.0 0.0 1.51.5 1.5 3.03.0 3.0 0.00.0 0.0 1.51.5 1.5 3.03.03.0 0.00.00.0 1.5
1.51.5 3.0
3.03.0
(a) Input Graph G1 -3
(b) Node Embedding (c) Edge Embedding (d) Substructure Embedding (e) Whole-Graph Embedding
0.0 1.5 3.0
Fig. 1. A toy example of embedding a graph into 2D space with different granularities. G{1,2,3} denotes the substructure containing node v1 , v2 , v3 .
3 3 3
3 3 3
e12 e23
e13 aims to represent a graph as low dimensional vectors while
G{1,2,3} similarity is preserved.
the e56
e46 graph structures are preserved. On the 0one hand,
0 e12 e23 0 G{4,5,6} G1 graph In observations of the challenges faced in different prob-
e78 e45 analytics
e13 aims to mine useful information from G{1,2,3}graph data. lem settings, we propose two taxonomies of graph em-
e34 e56 representation
e79 0 On the other e46 hand, G{7,8,9} 0 learning
G{4,5,6} obtains data
0 bedding
G1 work, by categorizing graph embedding literature
e78e67 e45
representations that make it easier to extract useful informa-
-3
based on the problem settings and the embedding tech-
-3 -3
0.0 tion when 3.0
1.5 buildinge34classifiers
0.0 1.5 or other3.0 predictors
0.0 [9]. Graph
1.5 niques.
3.0 These two taxonomies correspond to what chal-
e79 G{7,8,9}
embedding liese in the overlap of the two problems and lenges exist in graph embedding and how existing studies
67
-3
focuses on learning the low-dimensional-3 representations. -3 address these challenges. In particular, we first introduce
Note that 1.5
0.0 we distinguish
3.0 0.0
graph 1.5
representation 3.0
learning 0.0 1.5 settings 3.0
different of graph embedding problem as well as
and graph embedding in this survey. Graph representation the challenges faced in each setting. Then we describe how
learning does not require the learned representations to be existing studies address these challenges in their work,
low dimensional. For example, [10] represents each node as including their insights and their technical solutions.
a vector with dimensionality equals to the number of nodes Note that although a few attempts have been made to
in the input graph. Every dimension denotes the geodesic survey graph embedding ( [11], [12], [13]), they have the fol-
distance of a node to each other node in the graph. lowing two limitations. First, they usually propose only one
taxonomy of graph embedding techniques. None of them
Embedding graphs into low dimensional spaces is not
analyzed graph embedding work from the perspective of
a trivial task. The challenges of graph embedding depend
problem setting, nor did they summarize the challenges in
on the problem setting, which consists of embedding input
each setting. Second, only a limited number of related work
and embedding output. In this survey, we divide the input
are covered in existing graph embedding surveys. E.g., [11]
graph into four categories, including homogeneous graph,
mainly introduces twelve representative graph embedding
heterogeneous graph, graph with auxiliary information and graph
algorithms, and [13] focuses on knowledge graph embed-
constructed from non-relational data. Different types of em-
ding only. Moreover, there is no analysis on the insight
bedding input carry different information to be preserved
behind each graph embedding technique. A comprehensive
in the embedded space and thus pose different challenges
review of existing graph embedding work and a high level
to the problem of graph embedding. For example, when
abstraction of the insight for each embedding technique can
embedding a graph with structural information only, the
foster the future researches in the field.
connections between nodes are the target to be preserved.
However, for a graph with node label or attribute infor-
mation, the auxiliary information provides graph property
1.1 Our Contributions
from other perspectives, and thus may also be considered
during the embedding. Unlike embedding input which is Below, we summarize our major contributions in this survey.
given and fixed, the embedding output is task driven. For • We propose a taxonomy of graph embedding based on
example, the most common type of embedding output is problem settings and summarize the challenges faced in
node embedding which represents close nodes as similar each setting. We are the first to categorize graph embedding
vectors. Node embedding can benefit node related tasks work based on problem setting, which brings new perspec-
such as node classification, node clustering, etc. However, in tives to understanding existing work.
some cases, the tasks may be related to higher granularity
of a graph e.g., node pairs, subgraph, whole graph. Hence, • We provide a detailed analysis of graph embedding tech-
the first challenge in terms of embedding output is to find a niques. Compared to existing graph embedding surveys,
suitable embedding output type for the application of inter- we not only investigate a more comprehensive set of graph
est. We categorize four types of graph embedding output, embedding work, but also present a summary of the insights
including node embedding, edge embedding, hybrid embedding behind each technique. In contrast to simply listing how the
and whole-graph embedding. Different output granularities graph embedding was solved in the past, the summarized
have different criteria for a “good” embedding and face insights answer the questions of why the graph embedding
different challenges. For example, a good node embedding can be solved in a certain way. This can serve as an insightful
preserves the similarity to its neighbouring nodes in the guideline for future research.
embedded space. In contrast, a good whole-graph embedding • We systematically categorize the applications that graph
represents a whole graph as a vector so that the graph-level embedding enables and divide the applications as node
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 3
TABLE 2 TABLE 3
Graph Embedding Algorithms for cQA sites Comparison of Different Types of Auxiliary Information in Graphs
GE Algorithm Links Exploited Auxiliary Information Description
[30] user-user, user-question label categorical value of a node/edge, e.g., class information
[31] user-user, user-question, question-answer attribute categorical or continuous value of a node/edge,
[29] user-user, question-answer, user-answer e.g., property information
[32] users’ asymmetric following links, a ordered tuple (i, j, k, o, p) node feature text or image feature for a node
information propagation the paths of how the information is propagated in graphs
knowledge base text associated with or facts between knowledge concepts
Challenge: How to capture the diversity of connectivity patterns
observed in graphs? Since only structural information is avail-
able in homogeneous graphs, the challenge of homogeneous consistency between them is a problem. Moreover, there
graph embedding lies in how to preserve these connectivity may exist imbalance between objects of different types. This
patterns observed in the input graphs during embedding. data skewness should be considered in embedding.
Take Wikipedia as an example, the concepts are entities graph using entity mention, target type and text feature as
proposed by users and text is the article associated with the nodes, and establishes three kinds of edges: mention-type,
entity. [66] uses knowledge base to learn a social knowledge mention-feature and type-type.
graph from a social network by linking each social network In addition to the above pairwise similarity based and
user to a given set of knowledge concepts. [69] represents node co-occurrence based methods, other graph construc-
queries and documents in the entity space (provided by a tion strategies have been designed for different purposes.
knowledge base) so that the academic search engine can For example, [85] constructs an intrinsic graph to capture the
understand the meaning of research concepts in queries. intraclass compactness, and a penalty graph to characterize
Other types of auxiliary information include user check- the interclass separability. The former is constructed by
in data (user-location) [70], user item preference ranking connecting each data point with its neighbours of the same
list [71], etc. Note that the auxiliary information is not just class, while the latter connects the marginal points across
limited to one type. For instance, [62] and [72] consider different classes. [86] constructs a signed graph to exploit
both label and node feature information. [73] utilizes node the label information. Two nodes are connected by a positive
contents and labels to assist the graph embedding process. edge if they belong to the same class, and a negative edge
Challenge: How to incorporate the rich and unstructured infor- if they are from two classes. [87] includes all instances with
mation so that the learnt embeddings are both representing the a common label into one hyperedge to capture their joint
topological structure and discriminative in terms of the auxiliary similarity. In [88], two feedback graphs are constructed to
information? The auxiliary information helps to define node gather together relevant pairs and keep away irrelevant
similarity in addition to graph structural information. The ones after embedding. In the positive graph, two nodes are
challenges of embedding graph with auxiliary information connected if they are both relevant. In the negative graph,
is how to combine these two information sources to define two nodes are connected only when one node is relevant
the node similarity to be preserved. and the other is irrelevant.
Challenge: How to construct a graph that encodes the pairwise
3.1.4 Graph Constructed from Non-relational Data relations between instances and how to preserve the generated
node proximity matrix in the embedded space? The first chal-
As the last category of input graph is not provided, but lenge faced by embedding graphs constructed from non-
constructed from the non-relational input data by different relational data is how to compute the relations between the
strategies. This usually happens when the input data is non-relational data and construct such a graph. After the
assumed to lie in a low dimensional manifold. graph is constructed, the challenge becomes the same as in
In most cases, the input is a feature matrix X ∈ R|V |×N other input graphs, i.e., how to preserve the node proximity
where each row Xi is an N -dimensional feature vector of the constructed graph in the embedded space.
for the i-th training instance. A similarity matrix S is
constructed by calculating Sij using the similarity between
(Xi , Xj ). There are usually two ways to construct a graph 3.2 Graph Embedding Output
from S . A straightforward way is to directly treat S as the The output of graph embedding is a (set of) low dimensional
adjacency matrix A of an invisible graph [74]. However, [74] vector(s) representing (part of) a graph. Based on the output
is based on the Euclidean distance and it does not consider granularity, we divide graph embedding output into four
the neighbouring nodes when calculating Sij . If X lies on categories, including node embedding, edge embedding,
or near a curved manifold, the distance between Xi and hybrid embedding and whole-graph embedding. Different
Xj over the manifold is much larger than their Euclidean types of embedding facilitate different applications.
distance [12]. To address these issues, other methods (e.g., Unlike embedding input which is fixed and given, the
[75], [76], [77] ) construct a K nearest neighbour (KNN) embedding output is task driven. For example, node em-
graph from S first and estimate the adjacency matrix A bedding can benefit a wide variety of node related graph
based on the KNN graph. For example, Isomap [78] incor- analysis tasks. By representing each node as a vector, the
porates the geodesic distance in A. It first constructs a KNN node related tasks such as node clustering, node classifi-
graph from S , and then finds the shortest path between two cation, can be performed efficiently in terms of both time
nodes as the geodesic distance between them. To reduce the and space. However, graph analytics tasks are not always at
cost of KNN graph construction (O(|V |2 )), [79] constructs node level. In some scenarios, the tasks may be related to
an Anchor graph instead, whose cost is O(|V |) in terms of higher granularity of a graph, such as node pairs, subgraph,
both time and space consumption. They first obtain a set of or even a whole graph. Hence, the first challenge in terms of
clustering centers as virtual anchors and find the K nearest embedding output is how to find a suitable type of embedding
anchors of each node for anchor graph construction. output which meets the needs of the specific application task.
Another way of graph construction is to establish edges
between nodes based on the nodes’ co-occurrence. For ex- 3.2.1 Node Embedding
ample, to facilitate image related applications (e.g., image As the most common embedding output setting, node
segmentation, image classification), researchers (e.g., [80], embedding represents each node as a vector in a low di-
[81], [82]) construct a graph from each image by treating mensional space. Nodes that are “close” in the graph are
pixels as nodes and the spatial relations between pixels embedded to have similar vector representations. The dif-
as edges. [83] extracts three types of nodes (location, time ferences between various graph embedding methods lie in
and message) from the GTMS record and therefore forms how they define the “closeness” between two nodes. First-
six types of edges between these nodes. [84] generates a order proximity (Def. 5) and second-order proximity (Def.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 7
6) are two commonly adopted metrics for pairwise node consider a community-aware proximity for node embed-
similarity calculation. In some work, higher-order proximity ding, such that a node’s embedding is similar to its commu-
is also explored to certain extent. For example, [21] captures nity’s embedding. ComE [89] also jointly solves node em-
the k -step (k = 1, 2, 3, · · · ) neighbours relations in their bedding, community detection and community embedding
embedding. Both [1] and [89] consider two nodes belonging together. Rather than representing a community as a vector,
to the same community as embedded closer. it defines each community embedding as a multivariate
Challenge: How to define the pairwise node proximity in var- Gaussian distribution so as to characterize how its member
ious types of input graph and how to encode the proximity in nodes are distributed.
the learnt embeddings? The challenges of node embedding The embedding of substructure or community can also
mainly come from defining the node proximity in the input be derived by aggregating the individual node and edge
graph. In Sec 3.1, we have elaborated the challenges of node embedding inside it. However, such a kind of “indirect”
embedding with different types of input graphs. approach is not optimized to represent the structure. More-
Next, we will introduce other types of embedding output over, node embedding and community embedding can
as well as the new challenges posed by these outputs. reinforce each other. Better node embedding is learnt by
incorporating the community-aware high-order proximity,
while better communities are detected when more accurate
3.2.2 Edge Embedding node embedding is generated.
In contrast to node embedding, edge embedding aims to Challenge: How to generate the target substructure and how to
represent an edge as a low-dimensional vector. Edge em- embed different types of graph components in one common space?
bedding is useful in the following two scenarios. In contrast to other types of embedding output, the target to
Firstly, knowledge graph embedding (e.g., [90], [91], [92] embed in hybrid embedding (e.g., subgraph, community) is
) learns embedding for both nodes and edges. Each edge not given. Hence the first challenge is how to generate such
is a triplet < h, r, t > (Def. 4). The embedding is learnt to kind of embedding target structure. Furthermore, different
preserve r between h and t in the embedded space, so that types of targets (e.g., community, node) may be embedded
a missing entity/relation can be correctly predicted given in one common space simultaneously. How to address the
the other two components in < h, r, t >. Secondly, some heterogeneity of the embedding target types is a problem.
work (e.g., [28], [64]) embeds a node pair as a vector feature
to either make the node pair comparable to other nodes 3.2.4 Whole-Graph Embedding
or predict the existence of a link between two nodes. For
instance, [64] proposes a content-social influential feature to The last type of output is the embedding of a whole graph
predict user-user interaction probability given a content. It usually for small graphs, such as proteins, molecules, etc.
embeds both the user pairs and content in the same space. In this case, a graph is represented as one vector and two
[28] embeds a pair of nodes using a bootstrapping approach similar graphs are embedded to be closer.
over the node embedding, to facilitate the prediction of Whole-graph embedding benefits the graph classifica-
whether a link exists between two nodes in a graph. tion task by providing a straightforward and efficient so-
In summary, edge embedding benefits edge (/node lution for calculating graph similarities [49], [55], [95]. To
pairs) related graph analysis, such as link prediction, knowl- establish a compromise between the embedding time (effi-
edge graph entity/relation prediction, etc. ciency) and the ability to preserve information (expressive-
ness), [95] designs a hierarchical graph embedding frame-
Challenge: How to define the edge-level similarity and how to work. It thinks that accurate understanding of the global
model the asymmetric property of the edges, if any? The edge graph information requires the processing of substructures
proximity is different from node proximity as an edge con- in different scales. A graph pyramid is formed where each
tains a pair of nodes and usually denotes the pairwise node level is a summarized graph at different scales. The graph
relation. Moreover, unlike nodes, edges may be directed. is embedded at all levels and then concatenated into one
This asymmetric property should be encoded in the learnt vector. [63] learns the embedding for a whole cascade graph,
edge representations. and then trains a multi-layer perceptron to predict the
increment of the size of the cascade graph in the future.
3.2.3 Hybrid Embedding Challenge: How to capture the properties of a whole graph and
Hybrid embedding is the embedding of a combination of how to make a trade-off between expressiveness and efficiency?
different types of graph components, e.g, node + edge (i.e., Embedding a whole graph requires capturing the property
substructure), node + community. of a whole graph and is thus more time consuming com-
Substructure embedding has been studied in a quantity pared to other types of embedding. The key challenge of
of work. For example, [44] embeds the graph structure whole-graph embedding is how to make a choice between
between two possibly distant nodes to support semantic the expressive power of the learnt embedding and the
proximity search. [93] learns the embedding for subgraphs efficiency of the embedding algorithm.
(e.g., graphlets) so as to define the graph kernels for graph
classification. [94] utilizes a knowledge base to enrich the
information about the answer. It embeds both path and 4 G RAPH E MBEDDING T ECHNIQUES
subgraph from the question entity to the answer entity. In this section, we categorize graph embedding methods
Compared to subgraph embedding, community embed- based on the techniques used. Generally, graph embedding
ding has only attracted limited attention. [1] proposes to aims to represent a graph in a low dimensional space which
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 8
preserves as much graph property information as possi- The optimal a’s are eigenvectors with the maximum eigen-
ble. The differences between different graph embedding values in solving XW X T a = λXDX T a.
algorithms lie in how they define the graph property to
be preserved. Different algorithms have different insights The differences of existing studies mainly lie in how they
of the node(/edge/substructure/whole-graph) similarities calculate the pairwise node similarity Wij , and whether they
and how to preserve them in the embedded space. Next, use a linear function y = X T a or not. Some attempts [81],
we will introduce the insight of each graph embedding [85] have been made to summarize existing Laplacian eigen-
technique, as well as how they quantify the graph property maps based graph embedding methods using a general
and solve the graph embedding problem. framework. But their surveys only cover a limited quantity
of work. In Table 4, we summarize existing Laplacian eigen-
maps based graph embedding studies and compare how
4.1 Matrix Factorization
they calculate W and what objective function they adopt.
Matrix factorization based graph embedding represent
graph property (e.g., node pairwise similarity) in the form of The initial study MDS [74] directly adopted the Eu-
a matrix and factorize this matrix to obtain node embedding clidean distance between two feature vectors Xi and Xj as
[11]. The pioneering studies in graph embedding usually Wij . Eq. 2 is used to find the optimal embedding y ’s. MDS
solve graph embedding in this way. In most cases, the input does not consider the neighbourhood of nodes, i.e., any
is a graph constructed from non-relational high dimensional pair of training instances are considered as connected. The
data features as introduced in Sec. 3.1.4. And the output is a follow-up studies (e.g., [78], [96], [97], [102]) overcome this
set of node embedding (Sec. 3.2.1). The problem of graph problem by first constructing a k nearest neighbour (KNN)
embedding can thus be treated as a structure-preserving graph from the data feature. Each node is only connected
dimensionality reduction problem which assumes the input with its top k similar neighbours. After that, different meth-
data lie in a low dimensional manifold. There are two types ods are utilized to calculate the similarity matrix W so as to
of matrix factorization based graph embedding. One is to preserve as much desired graph property as possible. Some
factorize graph Laplacian eigenmaps, and the other is to more advanced models are design recently. For example,
directly factorize the node proximity matrix. AgLPP [79] introduces an anchor graph to significantly
improve the efficiency of earlier matrix factorization model
4.1.1 Graph Laplacian Eigenmaps LPP. LGRM [98] learns a local regression model to grasp
Insight: The graph property to be preserved can be interpreted as the graph structure and a global regression term for out-of-
pairwise node similarities. Thus, a larger penalty is imposed if two sample data extrapolation. Finally, different from previous
nodes with larger similarity are embedded far apart. works preserving local geometry, LSE [103] uses local spline
Based on the above insight, the optimal embedding y can regression to preserve global geometry.
be derived by the below objective function [99].
X When auxiliary information (e.g., label, attribute) is
y ∗ = arg min (yi − yj 2 Wij ) = arg min y T Ly, (1) available, the objective function is adjusted to preserve
i6=j
the richer information. E.g., [99] constructs an adjacency
where Wij is the “defined” similarity between node vi and graph W and a labelled graph W SR . The objective func-
vj ; L = D − W is the P graph Laplacian. D is the diagonal tion consists of two parts, one focuses on preserving the
matrix where Dii = j6=i Wij . The bigger the value of Dii , local geometric structure of the datasets as in LPP [97],
the more important yi is [97]. A constraint y T Dy = 1 is and the other tries to get the embedding with the best
usually imposed on Eq. 1 to remove an arbitrary scaling class separability on the labelled training data. Similarly,
factor in the embedding. Eq. 1 then reduces to: [88] also constructs two graphs, an adjacency graph W
y T Ly yT W y which encodes local geometric structures and a feedback
y ∗ = arg min y T Ly = arg min T = arg max T .
y T Dy=1 y Dy y Dy relational graph W ARE that encodes the pairwise relations
(2) in users’ relevance feedbacks. RF-Semi-NMF-PCA [101] si-
The optimal y ’s are the eigenvectors corresponding to the multaneously consider clustering, dimensionality reduction
maximum eigenvalue of the eigenproblem W y = λDy . and graph embedding by constructing an objective function
The above graph embedding is transductive because it that consists of three components: PCA, k-means and graph
can only embed the nodes that exist in the training set. Laplacian regularization.
In practice, it might also need to embed the new coming
nodes that have not been seen in training. One solution is to Some other work thinks that W cannot be constructed
design a linear function y = X T a so that the embedding by easily enumerating pairwise node relationships. Instead,
can be derived as long as the node feature is provided. they adopt semidefinite programming (SDP) to learn W .
Consequently, for inductive graph embedding, Eq. 1 becomes Specifically, SDP [104] aims to find an inner product matrix
finding the optimal a in the below objective function: that maximizes the pairwise distances between any two
X 2 inputs which are not connected in the graph while preserv-
a∗ = arg min kaTXi −aTXjk Wij = arg min aTXLX Ta. ing the nearest neighbors distances. MVU [100] constructs
i6=j
(3) such matrix and then applies MDS [74] on the learned
Similar to Eq. 2, by adding the constraint aT XDX T a = inner product matrix. [2] proves that regularized LPP [97] is
1, the problem in Eq. 3 becomes: equivalent to regularized SR [99] if W is symmetric, doubly
aT XLX T a aT XW X T a stochastic, PSD and with rank p. It constructs such kind of
a∗ = arg min T T
= arg max T . (4) similarity matrix so as to solve LPP liked problem efficiently.
a XDX a a XDX T a
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 9
TABLE 4
Graph Laplacian eigenmaps based graph embedding.
GE Algorithm W Objective Function
MDS [74] Wij = Euclidean distance (Xi , Xj ) Eq. 2
Isomap [78] KNN, Wij is the sum of edge weights along the shortest path between vi and vj Eq. 2
kXi −Xj k2
LE [96] KNN, Wij = exp( 2t2
) Eq. 2
kX −X k2
LPP [97] KNN, Wij = exp( i t j ) Eq. 4
−1 T K (Xi ,Uk ) aT U LU T a
P σ a∗ = arg min
P
AgLPP [79] anchor graph, W = ZΛ Z , Λkk = Zik , Zik =
j Kσ (Xi ,Uj ) aT U DU T a
kX −X k2 T
y (Lle +µLg )y
LGRM [98] KNN, Wij = exp( i 2t2 j ) y ∗ = arg min
yT y
+ +
−ρ2 (Xi Xj )
−γ, Xi ∈ F &Xj ∈ F
ARE
KNN, Wij = exp( ), W = 1, L(X i ) 6
= L(X j ) aT XLARE X T a
ARE [88] t ij
a∗ = arg max
0, otherwise aT XLX T a
F + denotes the images relevant to a query, γ controls the unbalanced feedback
1, L(Xi ) = L(Xj )
SR 1/lr , L(Xi ) = L(Xj ) = Cr
KNN, Wij = 0, L(Xi ) 6= L(Xj ) Wij = aT XW SR X T a
SR [99] 0, otherwise a∗ = arg max
Wij , otherwise aT X(D SR +L)X T a
Cr is the r -th class, lr =|Xi ∈ X : L(Xi ) = Cr |
HSL [87] S = I − L, where L is normalized hypergraph Laplacian a∗ = arg max tr(aT XSX T a), s.t. aT XX T a = Ik
KNN, W ∗ = arg max tr(W ), s.t. W ≥ 0, ij Wij = 0 and ∀i, j ,
P
MVU [100] Eq. 2
where Wii − 2Wij + Wjj = kXi − Xj k2
1/lr , L(Xi ) = L(Xj ) = Cr
KNN, Wij =
SLE [86] −1, L(Xi ) 6= L(Xj ) Eq. 4
Cr is the r -th class, lr =|Xi ∈ X : L(Xi ) = Cr |
normal graph: KNN, Wij = 1 Eq. 2
NSHLRR [76] hypergraph: W (e) is the weight of a hyperedge e
W (e)
y ∗ = arg min kyi − yj k2 d(e)
P
1, v ∈ e P e
h(v, e) = , d(e) = v∈V h(v, e)
0, otherwise
kXi −Xm+1 k2 2 −kX i −X j k 2
2
2,j ≤ k
kkXi −Xk+1 k2 y ∗ = arg min Wij min(kyi − yj kp
Pk P
[77] Wij = 2 − m=1 kXi −Xk k2 2 , θ)
0, j > k y T y=1 i6=j
kXi −Xj k2
PUFS [75] KNN, Wij = exp(− 2t2
) Eq. 4 +(must-link and cannot link constraints)
RF-Semi-NMF-PCA [101] KNN , Wij = 1 Eq. 2 +O (PCA) + O (kmeans)
4.1.2 Node Proximity Matrix Factorization more constraints [48]. We summarize all the node proximity
In addition to solving the above generalized eigenvalue matrix factorization based graph embedding in Table 5.
problem, another line of studies tries to directly factorize Summary: Matrix Factorization (MF) is mostly used to
node proximity matrix. embed a graph constructed from non-relational data (Sec.
Insight: Node proximity can be approximated in a low- 3.1.4) for node embedding (Sec. 3.2.1), which is the typical
dimensional space using matrix factorization. The objective of pre- setting of graph Laplacian eigenmap problems. MF is also
serving node proximity is to minimize the loss of approximation. used to embed homogeneous graphs [24], [50] (Sec. 3.1.1).
Given the node proximity matrix W , the objective is:
min kW − Y Y c T k, (5) 4.2 Deep Learning
Deep learning (DL) has shown outstanding performance
where Y ∈ R|V |×d is the node embedding, and Y c ∈ R|V |×d
in a wide variety of research fields, such as computer
is the embedding for the context nodes [21].
vision, language modeling, etc. DL based graph embedding
Eq. 5 aims to find an optimal rank-d approximation of
applies DL models on graphs. These models are either a
the proximity matrix W (d is the dimensionality of the em-
direct adoption from other fields or a new neural network
bedding). One popular solution is to apply SVD (Singular
model specifically designed for embedding graph data. The
Value Decomposition) on W [110]. Formally,
input is either paths sampled from a graph or the whole
P|V |
W = i=1 σi ui uci T ≈ di=1 σi ui uci T , graph itself. Consequently, we divide the DL based graph
P
(6)
embedding into two categories based on whether random
where {σ1 , σ2 , · · · , σ|V | } are the singular values sorted in walk is adopted to sample paths from a graph.
descending order, ui and uci are singular vectors of σi . The
optimal embedding is obtained using the largest d singular 4.2.1 DL based Graph Embedding with Random Walk
values and corresponding singular vectors as follows: Insight: The second-order proximity in a graph can be preserved
√ √
Y = [ σ1 u1 , · · · , σd ud ], in the embedded space by maximizing the probability of observing
√ √ (7) the neighbourhood of a node conditioned on its embedding.
Y c = [ σ1 uc1 , · · · , σd ucd ].
In the first category of deep learning based graph em-
Depending on whether the asymmetric property is pre- bedding, a graph is represented as a set of random walk
served or not, the embedding of node i is either yi = Yi [21], paths sampled from it. The deep learning methods are then
[50], or the concatenation of Yi and Yic , i.e., yi = [Yi , Yic ] applied to the sampled paths for graph embedding which
[106]. There exist other solutions for Eq. 5, such as regu- preserves graph properties carried by the paths.
larized Gaussian matrix factorization [24], low-rank matrix In view of the above insight, DeepWalk [17] adopts a
factorization [56], and adding other regularizers to enforce neural language model (SkipGram) for graph embedding.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 10
TABLE 5
Node proximity matrix factorization based graph embedding. O(∗) denotes the objective function; e.g., O(SVM classifier) denotes the objective
function of a SVM classifier.
GE Algorithm W Objective Function
1, eij ∈ E .
[50] Wij = Eq. 5
0, otherwise
KNN, W ∗ = arg max tr(W Â), s.t. Dij > (1 − Âij )max(Âim Dim )
SPE [105] k≥0 m Eq. 5
 is a connectivity matrix that describes local pairwise distances
Katz Index W = (I − βA)−1 βA; Personalized Pagerank W = (1 − α)(I − αP )−1
HOPE [106] Eq. 5
Common neighbors W = A2 ; Adamic-Adar W = A(1/ j (Aij + Aji ))A
P
k P
p Aip , i = j
Â
GraRep [21] Wijk
= log( P ijk ) − log( |Vλ | ), where  = D −1 S , Dij = Eq. 5
t Âij 0, i 6= j
CMF [43] PPMI Eq. 5
TADW [56] PMI Eq. 5 with text feature matrix
y ∗ = arg min e ∈E (Aij − < yi , yj >)2 + λ
kyi k2
P P
[24] A 2 i
ij
MMDW [48] PMI Eq. 5 + O (SVM classifier)
HSCA [57] PMI O (MMDW) + (1st order proximity constraint)
KNN, W ∗ = arg min tr(W ( d T
P PN T
MVE [107] i=1 υi υi + i=d+1 υi υi )) Eq. 5
where υi is eigenvector of a pairwise distance matrix, d is embedding dimensionality
W = s(1) + 5s(2) Eq. 5 + O (community detection)
M-NMF [1]
+ (community proximity constraint)
W = Z∆−1 Z T , where Z ∗ = a∗ = arg min kaT X − Fp k2F + αkak2F
Pm
kXi − uj k22 zij + γ
Pm 2
arg min j=1 j=1 zij
ULGE [2] z T 1=1,zi ≥0
i
whereFp is the top p eigenvectors of W
KNN, W ∗ = arg min i kXi − j Wij Xj k2 y ∗ = arg min i kyi − j Wij yj k2
P P P P
LLE [102]
1, (hi , rj , tk ) exists
kWk − Y Rk Y T k2F + λ(kY k2F + kRk k2F )
P P
RESCAL [108] Wijk = min
0, otherwise k k
FONPE [109] KNN, W ∗ = arg min i kXi − j Wij Xj k2 min kF − F W T k2F + βkP T X − F k2F , s.t. P T P = I
P P
SikpGram [111] aims to maximize the co-occurrence prob- path to leaf vi is a sequence of nodes (b0 , b1 , · · · , blog(|V |) ),
ability among the words that appear within a window w. where b0 = root, blog(|V |) = vi . Eq. 10 then becomes:
DeepWalk first samples a set of paths from the input graph Qlog(|V |)
using truncated random walk (i.e., uniformly sample a P (vi+j |yi ) = t=1 P (bt |yi ), (11)
neighbour of the last visited node until the maximum length where P (bt ) is a binary classifier: P (bt |vi ) = σ(ybTt yi ). σ(·)
is reached). Each path sampled from the graph corresponds denotes the sigmoid function. ybt is the embedding of tree
to a sentence from the corpus, where a node corresponds node bt ’s parent. The hierarchical softmax reduces time
to a word. Then SkipGram is applied on the paths to max- complexity of SkipGram from O(|V |2 ) to O(|V |log(|V |)).
imize the probability of observing a node’s neighbourhood Negative sampling: The key idea of negative sampling
conditioned on its embedding. In this way, nodes with sim- is to distinguish the target node from noises using logis-
ilar neighbourhoods (having large second-order proximity tic regression. I.e., for a node vi , we want to distinguish
values) share similar embedding. The objective function of its neighbour vi+j from other nodes. A noise distribution
DeepWalk is as follows: Pn (vi ) is designed to draw the negative samples for node
miny − log P ({vi−w , · · · , vi−1 , vi+1 , · · · , vi+w }|yi ), (8) vi . Each log P (vi+j |yi ) in Eq. 9 is then calculated as:
T
yi ) + K T
P
log σ(yi+j t=1 Evt ∼Pn [log σ(−yvt yi )], (12)
where w is the window size which restricts the size of
random walk context. SkipGram removes the ordering con- where K is the number of negative nodes that are sampled.
straint, and Eq. 8 is transformed to: Pn (vi ) is a noise distribution, e.g., a uniform distribution
P
miny − log −w≤j≤w P (vi+j |yi ), (9) (Pn (vi ) ∼ |V1 | , ∀vi ∈ V ). The time complexity of SkipGram
with negative sampling is O(K|V |).
where P (vi+j |yi ) is defined using the softmax function: The success of DeepWalk [17] motivates many subse-
T
exp(yi+j yi )
quent studies which apply deep learning models (e.g., Skip-
P (vi+j |yi ) = P|V | T
. (10) Gram or Long-Short Term Memory (LSTM) [115]) on the
k=1 exp(yk yi )
sampled paths for graph embedding. We summarize them
Note that calculating Eq. 10 is not feasible as the normal- in Table 6. As shown in the table, most studies follow the
ization factor (i.e., the summation over all inner product idea of DeepWalk but change the settings of either random
with every node in a graph) is expensive. There are usually walk sampling methods ( [25], [28], [62], [62]) or proximity
two solutions to approximate the full softmax: hierarchical (Def. 5 and Def. 6) to be preserved ( [34], [47], [62], [66],
softmax [112] and negative sampling [112]. [73]). [46] designs meta-path-based random walks to deal
Hierarchical softmax: To efficiently solve Eq. 10, a binary with heterogeneous graphs and a heterogeneous SkipGram
tree is constructed in which the nodes are assigned to the which maximizes the probability of having the hetegeneous
leaves. Instead of enumerating all nodes as in Eq. 10, only context for a given node. Apart from SkipGram, LSTM is
the path from the root to the corresponding leaf needs to be another popular deep learning model adopted in graph
evaluated. The optimization problem becomes maximizing embedding. Note that SkipGram can only embed one sin-
the probability of a specific path in the tree. Suppose the gle node. However, sometimes we may need to embed a
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 11
TABLE 6 TABLE 7
Deep Learning based graph embedding with random walk paths. Deep learning based graph embedding without random walk paths.
GE Algorithm Ransom Walk Methods Preserved Proximity DL Model GE Algorithm Deep Learning Model Model Input
DeepWalk [17] truncated random walk 2nd SDNE [20] autoencoder A
[34] truncated random walk 2nd (word-image) DNGR [23] stacked denoising autoencoder PPMI
GenVector [66] truncated random walk 2nd (user-user & SkipGram SAE [22] sparse autoencoder D −1 A
concept-concept) with [55] CNN node sequence
Constrained sampling with edge weight 2nd hierarchical SCNN [118] Spectral CNN graph
DeepWalk [25] softmax [119] Spectral CNN with smooth graph
nd spectral multipliers
DDRW [47] truncated random walk 2 + class identity (Eq. 11)
TriDNR [73] truncated random walk 2nd (among node, MoNet [80] Mixture model network graph
word & label) ChebNet [82] Graph CNN a.k.a. ChebNet graph
node2vec [28] BFS + DFS 2nd GCN [72] Graph Convolutional Network graph
UPP-SNE [113] truncated random walk 2nd (user-user & SkipGram GNN [120] Graph Neural Network graph
profile-profile) with [121] adapted Graph Neural Network molecules graph
Planetoid [62] sampling node pairs by la- 2nd + label identity negative GGS-NNs [122] adapted Graph Neural Network graph
bels and structure sampling HNE [33] CNN + FC graph with image and text
NBNE [19] sampling direct neighbours 2nd (Eq. 12) DUIF [35] a hierarchical deep model social curation network
of a node ProjE [40] a neural network model knowledge graph
DGK [93] graphlet kernel: random 2nd (by graphlet) SkipGram TIGraNet [123] Graph Convolutional Network graph constructed from images
sampling [114] (Eqs. 11–12 )
metapath2vec meta-path based random 2nd heterogeneous
[46] walk SkipGram
the one hand, some of them directly use the original CNN
ProxEmbed [44] truncate random walk node ranking tuples
HSNL [29] truncate random walk 2nd + QA ranking LSTM
model designed for Euclidean domains and reformat input
tuples graphs to fit it. E.g., [55] uses graph labelling to select
RMNL [30] truncated random walk 2nd + user-question a fixed-length node sequence from a graph and then as-
quality ranking
DeepCas [63] Markov chain based ran- information cascade GRU
sembles nodes’ neighbourhood to learn a neighbourhood
dom walk sequence representation with the CNN model. On the other hand,
MRW-MN [36] truncated random walk 2nd + cross-modal DCNN+ some other work attempts to generalize the deep neural
feature difference SkipGram
model to non-Euclidean domains (e.g., graphs). [117] sum-
marizes the representative studies in their survey. Generally,
sequence of nodes as a fixed length vector, e.g., represent a the differences between these approaches lie in the way
sentence (i.e., a sequence of words) as one vector. LSTM is they formulate a convolution-like operation on graphs. One
then adopted in such scenarios to embed a node sequence. way is to emulate the Convolution Theorem to define the
For example, [29] and [30] embed the sentences from ques- convolution in the spectral domain [118], [119]. Another is
tions/answers in cQA sites, and [44] embeds a sequence to treat the convolution as neighborhood matching in the
of nodes between two nodes for proximity embedding. A spatial domain [72], [82], [120].
ranking loss function is optimized in these work to preserve Others: There are some other types of deep learning
the ranking scores in the training data. In [63], GRU [116] based graph embedding methods. E.g., [35] proposes DUIF,
(i.e., a recurrent neural network model similar to LSTM) is which uses a hierarchical softmax as a forward propagation
used to embed information cascade paths. to maximize the modularity. HNE [33] utilizes deep learning
techniques to capture the interactions between heteroge-
neous components, e.g., CNN for image and FC layers for
4.2.2 DL based Graph Embedding without Random Walk
text. ProjE [40] designs a neural network with a combination
Insight:The multi-layered learning architecture is a robust and layer and a projection layer. It defines a pointwise loss
effective solution to encode the graph into a low dimensional space. (similar to multi-class classification) and a listwise loss (i.e.,
The second class of deep learning based graph embed- softmax regression loss) for knowledge graph embedding.
ding methods applies deep models on a whole graph (or a We summarize all deep learning based graph embedding
proximity matrix of a whole graph) directly. Below are some methods (random walk free) in Table 7, and compare the
popular deep learning models used in graph embedding. models they use as well as the input for each model.
Autoencoder: An autoencoder aims to minimize the Summary: Due to its robustness and effectiveness, deep
reconstruction error of the output and input by its encoder learning has been widely used in graph embedding. Three
and decoder. Both encoder and decoder contain multiple types of input graphs (except for graph constructed from
nonlinear functions. The encoder maps input data to a rep- non-relational data (Sec. 3.1.4)) and all the four types of
resentation space and the decoder maps the representation embedding output have been observed in deep learning
space to a reconstruction space. The idea of adopting au- based graph embedding methods.
toencoder for graph embedding is similar to node proximity
matrix factorization (Sec. 4.1.2) in terms of neighbourhood
preservation. Specifically, the adjacency matrix captures a 4.3 Edge Reconstruction based Optimization
node’s neighbourhood. If we input the adjacency matrix to Overall Insight: The edges established based on node embedding
an autoencoder, the reconstruction process will make the should be as similar to those in the input graph as possible.
nodes with similar neighbourhood have similar embedding. The third category of graph embedding techniques di-
Deep Neural Network: As a popular deep learning rectly optimizes an edge reconstruction based objective
model, Convolutional Neural Network (CNN) and its vari- functions, by either maximizing edge reconstruction prob-
ants have been widely adopted in graph embedding. On ability or minimizing edge reconstruction loss. The later is
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 12
G is represented as a d-dimensional vector yG where the i-th semantic representation of text using topic modelling and
dimension is the frequency of the i-th triplet occurring in G . restricts the triplet embedding in the semantic subspace.
Summary: A graph kernel is designed for whole-graph The difference between the above two directions of
embedding (Sec. 3.2.4) only as it captures the global prop- methods is that the embedded space is the latent space in
erty of a whole graph. The type of input graph is usually the first way. In contrast, in the second way, the latent space
a homogeneous graph (Sec. 3.1.1) [93] or a graph with is used to integrate information from different sources, and
auxiliary information (Sec. 3.1.3) [49]. help to embed a graph to another space.
Summary: Generative model can be used for both node
4.5 Generative Model embedding (Sec. 3.2.1) [70] and edge embedding (Sec. 3.2.2)
[141]. As it considers node semantics, the input graph is
A generative model can be defined by specifying the joint usually a heterogeneous graph (Sec. 3.1.2 ) [70] or a graph
distribution of the input features and the class labels, with auxiliary information (Sec. 3.1.3) [59].
conditioned on a set of parameters [139]. An example is
Latent Dirichlet Allocation (LDA), in which a document is
interpreted as a distribution over topics, and a topic is a 4.6 Hybrid Techniques and Others
distribution over words [140]. There are the following two Sometimes multiple techniques are combined in one study.
ways to adopt a generative model for graph embedding. For example, [4] learns edge-based embedding via mini-
mizing the margin-based ranking loss (Sec. 4.3), and learns
4.5.1 Embed Graph Into The Latent Semantic Space attribute-based embedding by matrix factorization (Sect.
Insight: Nodes are embedded into a latent semantic space where 4.1). [51] optimizes a margin-based ranking loss (Sec. 4.3)
the distances among nodes explain the observed graph structure. with matrix factorization based loss (Sec. 4.1) as regu-
The first type of generative model based graph embed- larization terms. [32] uses LSTM (Sec. 4.2) to learn sen-
ding methods directly embeds a graph in the latent space. tences embedding in cQAs and a margin-based ranking loss
Each node is represented as a vector of the latent variables. (Sec. 4.3) to incorporate friendship relations. [142] adopts
In other words, it views the observed graph as generated CBOW/SkipGram (Sec. 4.2) for knowledge graph entity em-
by a model. E.g., in LDA, documents are embedded in a bedding, and then fine-tune the embedding by minimizing
“topic” space where documents with similar words have a margin-based ranking loss (Sec. 4.3). [61] use word2vec
similar topic vector representations. [70] designs a LDA- (Sec. 4.2) to embed the textual context and TransH (Sec.
like model to embed a location-based social network (LBSN) 4.3) to embed the entity/relations so that the rich context
graph. Specifically, the input is locations (documents), each information is utilized in knowledge graph embedding.
of which contains a set of users (words) who visited that [143] leverages the heterogeneous information in a knowl-
location. Users visit the same location (words appearing in edge base to improve recommendation performance. It uses
the same document) due to some activities (topics). Then a TransR (Sec. 4.3) for network embedding, and uses auto-
model is designed to represent a location as a distribution encoders for textual and visual embedding (Sec. 4.2). Finally,
over activities, where each activity has an attractiveness dis- a generative framework (Sec. 4.5) is proposed to integrate
tribution over users. Consequently, both user and location collaborative filtering with items semantic representations.
are represented as a vector in the “activity” space. Apart from the introduced five categories of techniques,
there exist other approaches. [95] presents embedding of a
4.5.2 Incorporate Latent Semantics for Graph Embedding graph by its distances to prototype graphs. [16] first embeds
a few landmark nodes using their pairwise shortest path
Insight: Nodes which are close in the graph and having similar distances. Then other nodes are embedded so that their
semantics should be embedded closer. The node semantics can be distances to a subset of landmarks are as close as possi-
detected from node descriptions via a generative model. ble to the real shortest paths. [4] jointly optimizes a link-
In this line of methods, latent semantics are used to based loss (maximizing the likelihood of observing a node’s
leverage auxiliary node information for graph embedding. neighbours instead of its non-neighbours) and an attribute-
The embedding is decided not only by the graph structure based loss (learning a linear projection based on link-based
information but also by the latent semantics discovered representation). KR-EAR [144] distinguishes the relations in
from other sources of node information. For example, [58] a knowledge graph as attribute-based and relation-based.
proposes a unified framework which jointly integrates topic It constructs a relational triple encoder (TransE, TransR) to
modelling and graph embedding. Its principle is that if embed the correlations between entities and relations, and
two nodes are close in the embedded space, they will an attributional triple encoder to embed the correlations
also share similar topic distribution. A mapping function between entities and attributes. Struct2vec [145] considers
from the embedded space to the topic semantic space is the structral identify of nodes by a hierarchical metric for
designed so as to correlate the two spaces. [141] proposes a node embedding. [146] provides a fast embedding approach
generative model (Bayesian non-parametric infinite mixture by approximating the higher-order proximity matrices.
embedding model) to address the issue of multiple relation
semantics in knowledge graph embedding. It discovers the
latent semantics of a relation and leverages a mixture of 4.7 Summary
relation components for embedding. [59] embeds a knowl- We now summarize and compare all the five categories
edge graph from both the knowledge graph triplets and the of introduced graph embedding techniques in Table 10 in
textual descriptions of entities and relations. It learns the terms of their advantages and disadvantages.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 15
TABLE 10
Comparison of graph embedding techniques.
Category Subcategory Advantages Disadvantages
graph Laplacian eigenaps consider global node proximity large time and space consumption
matrix factorization
node proximity matrix factorization
effective and robust, a) only consider local context within a path
with random walk
deep learning no feature engineering b) hard to find optimal sampling strategy
without random walk high computation cost
maximize edge reconstruct probability optimization using only observed local
edge reconstruction minimize distance-based loss relatively efficient training information, i.e., edges (1-hop neighbour)
minimize margin-based ranking loss or ranked node pairs
based on graphlet efficient, only counting the desired atom a) substructures are not independent
graph kernel based on subtree patterns substructure b) embedding dimensionality grows
based on random walks exponentially
embed graph in the latent space interpretable embedding a) hard to justify the choice of distribution
generative model incorporate latent semantics for graph embedding naturally leverage multiple information sources b) require a large amount of training data
Matrix factorization based graph embedding learns the “bag-of-structure” based methods have two limitations [93].
representations based on the statistics of global pairwise Firstly, the substructures are not independent. For example,
similarities. Hence it can outperform deep learning based the graphlet of size k +1 can be derived from size k graphlet
graph embedding (random walk involved) in certain tasks by adding a new node and some edges. This means there
as the latter relies on separate local context windows [147], exist redundant information in the graph representation.
[148]. However, either the proximity matrix construction or Secondly, the embedding dimensionality usually grows ex-
the eigendecomposition of the matrix is time and space ponentially when the size of the substructure grows, leading
consuming [149], making matrix factorization inefficient to a sparse problem in the embedding.
and unscalable for large graphs. Generative model based graph embedding can naturally
Deep Learning (DL) has shown promising results among leverage information from different sources (e.g., graph
different graph embedding methods. We consider DL as structure, node attribute) in a unified model. Directly em-
suitable for graph embedding, because of its capability of bedding graphs into the latent semantic space generates the
automatically identifying useful representations from com- embedding that can be interpreted using the semantics. But
plex graph structures. For example, DL with random walk the assumption of modelling the observation using certain
(e.g., DeepWalk [17], node2vec [28], metapath2vec [46]) can distributions is hard to justify. Moreover, the generative
automatically exploit the neighbourhood structure through method needs a large amount of training data to estimate
sampled paths on the graph. DL without random walk can a proper model which fits the data. Hence it may not work
model variable-sized subgraph structures in homogeneous well for small graphs or a small number of graphs.
graphs (e.g., GCN [72], struc2vec [145], GraphSAGE [150]),
or rich interactions among flexible-typed nodes in heteroge-
neous graphs (e.g., HNE [33], TransE [91], ProxEmbed [44]), 5 A PPLICATIONS
as useful representations. On the other hand, DL also has Graph embedding benefits a wide variety of graph analytics
its limitations. For DL with random walks, it typically con- applications as the vector representations can be processed
siders a nodes local neighbours within the same path and efficiently in both time and space. In this section, we cate-
thus overlooks the global structure information. Moreover, gorize the graph embedding enabled applications as node
it is difficult to find an ‘optimal sampling strategy as the related, edge related and graph related.
embedding and path sampling are not jointly optimized in
a unified framework. For DL without random walks, the
5.1 Node Related Applications
computational cost is usually high. The traditional deep
learning architectures assume the input data on 1D or 2D 5.1.1 Node Classification
grid to take advantage of GPU [117]. However, graphs do Node classification is to assign a class label to each node in
not have such a grid structure, and thus require different a graph based on the rules learnt from the labelled nodes.
solutions to improve the efficiency [117]. Intuitively, “similar” nodes have the same labels. It is one
Edge reconstruction based graph embedding optimizes of the most common applications discussed in graph em-
an objective function based on the observed edges or rank- bedding literatures. In general, each node is embedded as a
ing triplets. It is more efficient compared to the previous low-dimensional vector. Node classification is conducted by
two categories of graph embedding. However, this line applying a classifier on the set of labelled node embedding
of methods are trained using the directly observed local for training. The example classifiers include SVM ( [1], [20],
information, and thus the obtained embedding lacks the [33], [34], [41], [42], [45], [56], [57], [60], [73], [75], [81], [87]),
awareness of the global graph structure. logistic regression ( [1], [17], [19], [20], [21], [25], [27], [28],
Graph kernel based graph embedding converts a graph [45], [59], [124]) and k-nearest neighbour classification ( [58],
into one single vector to facilitate the graph level analytic [151]). Then given the embedding of an unlabelled node,
tasks such as graph classification. It is more efficient than the trained classifier can predict its class label. In contrast
other categories of techniques as it only needs to enumerate to the above sequential processing of first node embedding
the desired atomic substructures in a graph. However, such then node classification, some other work ( [47], [48], [62],
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 16
[72], [80]) designs a unified framework to jointly optimize [19], [28]. For example, [28] predicts the friendship relation
graph embedding and node classification, which learns a between two users. Relatively fewer graph embedding work
classification-specific representation for each node. deals with heterogeneous graph link prediction. For exam-
ple, on a heterogeneous social graph, ProxEmbed [44] tries
5.1.2 Node Clustering to predict the missing links of certain semantic types (e.g.,
Node clustering aims to group similar nodes together, so schoolmates) between two users, based on the embedding of
that nodes in the same group are more similar to each other their connecting paths on the graph. D2AGE [152] solves the
than those in other groups. As an unsupervised algorithm, same problem by embedding two users connecting directed
it is applicable when the node labels are unavailable. After acyclic graph structure.
representing nodes as vectors, the traditional clustering
algorithms can then be applied on the node embedding. 5.2.2 Triple Classification
Most existing work [1], [2], [21], [22], [23], [33], [81] adopts Triplet classification [14], [15], [38], [51], [52], [53], [61], [142]
k-means as the clustering algorithm. In contrast, [4] and is a specific application for knowledge graph. It aims to
[77] jointly optimize clustering and graph embedding in one classify whether an unseen triplet < h, r, t > is correct or
objective to learn a clustering-specific node representation. not, i.e., whether the relation between h and t is r.
message> which enables them to recover the missing com- graph, and (vi , vi+ , vi− )
in a cQA graph. Single edges only
ponent from a GTSM triplet given the other two. It can also provide local neighbourhood information to calculate the
classify the GTSM records, e.g., whether a check-in record is first- and second-order proximity. The global structure of
related to “Food” or “Shop”. [85] uses graph embedding to a graph (e.g., paths, tree, subgraph patterns) is omitted.
reduce data dimensionality for face recognition. [88] maps Intuitively, a substructure contains richer information than
images into a semantic manifold that faithfully grasps users’ one single edge. Some work attempts to explore the path
preferences to facilitate content-based image retrieval. information in knowledge graph embedding ( [38], [39], [40],
Information propagation related: [63] predicts the in- [142]). However, most of them use deep learning models
crement of a cascade size after a given time interval. [64] ( [38], [40], [142]) which suffer the low efficiency issue as
predicts the propagation user and identifies the domain discussed earlier. How to design the non-deep learning
expert by embedding the social interaction graph. based methods that can take advantage of the expressive
Social networks alignment: Both [26] and [18] learn power of graph structure is a question. [39] provides one
node embedding to align users across different social net- example solution. It minimizes both pairwise and long-
works, i.e., to predict whether two user accounts in two range loss to capture pairwise relations and long-range
different social networks are owned by the same user. interactions between entities. Note that in addition to the
Image related: Some work embeds graphs constructed list/path structure, there are various kinds of substructures
from images, and then use the embedding for image classi- which carry different structure information. For example,
fication ( [81], [82]), image clustering [101], image segmen- SPE [155] has tried to introduce a subgraph-augmented path
tation [154], pattern recognition [80], and so on. structure for embedding the proximity between two nodes
in a heterogeneous graph, and it shows better performance
than embedding simple paths for semantic search tasks. In
6 F UTURE D IRECTIONS general, an efficient structure-aware graph embedding opti-
In this section, we summarize four future directions for the mization solution, together with the substructure sampling
field of graph embedding, including computation efficiency, strategy, is in needed.
problem settings, techniques and application scenarios. Applications. Graph embedding has been applied in many
Computation. The deep architecture, which takes the geo- different applications. It is an effective way to learn the
metric input (e.g., graph), suffers the low efficiency problem. representations of data with consideration of their rela-
Traditional deep learning models (designed for Euclidean tions. Moreover, it can convert data instances from different
domains) utilize the modern GPU to optimize their effi- sources/platforms/views into one common space so that
ciency by assuming that the input data are on a 1D or they are directly comparable. For example, [16], [34], [36] use
2D grid. However, graphs do not have such a kind of grid graph embedding for cross-modal retrieval, such as content-
structure and thus the deep architecture designed for graph based image retrieval, keyword-based image/video search.
embedding needs to seek alternative solutions to improve The advantages of using graph embedding for represen-
the model efficiency. [117] suggested that the computational tation learning is that the graph manifold of the training
paradigms developed for large-scale graph processing can data instances are preserved in the representations and can
be adopted to facilitate efficiency improvement in deep further benefit the follow-up applications. Consequently,
learning models for graph embedding. graph embedding can benefit the tasks which assume the
Problem settings. The dynamic graph is one promising input data instances are correlated with certain relations
setting for graph embedding. Graphs are not always static, (i.e., connected by certain links). It is of great importance
especially in real life scenarios, e.g., social graphs in Twitter, to exploring the application scenarios which benefit from
citation graphs in DBLP. Graphs can be dynamic in terms graph embedding, as it provides effective solutions to the
of graph structure or node/edge information. On the one conventional problems from a different perspective.
hand, the graph structure may evolve over time, i.e., new
nodes/edges appear while some old nodes/edges disap- 7 C ONCLUSIONS
pear. On the other hand, the nodes/edges may be described In this survey, we conduct a comprehensive review of the lit-
by some time-varying information. Existing graph embed- erature in graph embedding. We provide a formal definition
ding mainly focuses on embedding the static graph and to the problem of graph embedding and introduce some ba-
the settings of dynamic graph embedding are overlooked. sic concepts. More importantly, we propose two taxonomies
Unlike static graph embedding, the techniques for dynamic of graph embedding, categorizing existing work based on
graphs need to be scalable and better to be incremental so problem settings and embedding techniques respectively.
as to deal with the dynamic changes efficiently. This makes In terms of problem setting taxonomy, we introduce four
most of the existing graph embedding methods, which types of embedding input and four types of embedding
suffer from the low efficiency problem, not suitable any- output and summarize the challenges faced in each set-
more. How to design effective graph embedding methods ting. For embedding technique taxonomy, we introduce the
in dynamic domains remains an open question. work in each category and compare them in terms of their
Techniques. Structure awareness is important for edge advantages and disadvantages. After that, we summarize
reconstruction based graph embedding. Current edge re- the applications that graph embedding enables. Finally, we
construction based graph embedding methods are mainly suggest four promising future research directions in the
based on the edges only, e.g., 1-hope neighbours in a gen- field of graph embedding in terms of computation efficiency,
eral graph, a ranked triplet (< h, r, t > in a knowledge problem settings, techniques and application scenarios.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 18
[54] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent [82] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional
variable models for structured data,” in ICML, 2016, pp. 2702– neural networks on graphs with fast localized spectral filtering,”
2711. in NIPS, 2016, pp. 3837–3845.
[55] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional [83] C. Zhang, K. Zhang, Q. Yuan, H. Peng, Y. Zheng, T. Hanratty,
neural networks for graphs,” in ICML, 2016, pp. 2014–2023. S. Wang, and J. Han, “Regions, periods, activities: Uncovering
[56] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Network urban dynamics via cross-modal representation learning,” in
representation learning with rich text information,” in IJCAI, WWW, 2017, pp. 361–370.
2015, pp. 2111–2117. [84] X. Ren, W. He, M. Qu, C. R. Voss, H. Ji, and J. Han, “Label
[57] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Homophily, structure, noise reduction in entity typing by heterogeneous partial-label
and content augmented network representation learning,” in embedding,” in KDD, 2016, pp. 1825–1834.
ICDM, 2016, pp. 609–618. [85] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph
[58] T. M. V. Le and H. W. Lauw, “Probabilistic latent document embedding and extensions: A general framework for dimension-
network embedding,” in ICDM, 2014, pp. 270–279. ality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29,
[59] H. Xiao, M. Huang, L. Meng, and X. Zhu, “SSP: semantic space no. 1, pp. 40–51, 2007.
projection for knowledge graph embedding with text descrip- [86] C. Gong, D. Tao, J. Yang, and K. Fu, “Signed laplacian embedding
tions,” in AAAI, 2017, pp. 3104–3110. for supervised dimension reduction,” in AAAI, 2014, pp. 1847–
[60] L. Yao, Y. Zhang, B. Wei, Z. Jin, R. Zhang, Y. Zhang, and Q. Chen, 1853.
“Incorporating knowledge graph embeddings into topic model- [87] L. Sun, S. Ji, and J. Ye, “Hypergraph spectral learning for multi-
ing,” in AAAI, 2017, pp. 3119–3126. label classification,” in KDD, 2008, pp. 668–676.
[61] Z. Wang and J. Li, “Text-enhanced representation learning for [88] Y.-Y. Lin, T.-L. Liu, and H.-T. Chen, “Semantic manifold learning
knowledge graph,” in IJCAI, 2016, pp. 1293–1299. for image retrieval,” in MM, 2005, pp. 249–258.
[62] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semi- [89] S. Cavallari, V. W. Zheng, H. Cai, K. C. Chang, and E. Cambria,
supervised learning with graph embeddings,” in ICML, 2016, pp. “Learning community embedding with community detection
40–48. and node embedding on graphs,” in CIKM, 2017, pp. 377–386.
[63] C. Li, J. Ma, X. Guo, and Q. Mei, “Deepcas: An end-to-end [90] A. Bordes, J. Weston, R. Collobert, and Y. Bengio, “Learning
predictor of information cascades,” in WWW, 2017, pp. 577–586. structured embeddings of knowledge bases,” in AAAI, 2011.
[64] N. Zhao, H. Zhang, M. Wang, R. Hong, and T. Chua, “Learning [91] A. Bordes, N. Usunier, A. Garcı́a-Durán, J. Weston, and
content-social influential features for influence analysis,” IJMIR, O. Yakhnenko, “Translating embeddings for modeling multi-
vol. 5, no. 3, pp. 137–149, 2016. relational data,” in NIPS, 2013, pp. 2787–2795.
[92] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, “A semantic
[65] J. Wang, V. W. Zheng, Z. Liu, and K. C. Chang, “Topological
matching energy function for learning with multi-relational data
recurrent neural network for diffusion prediction,” in ICDM,
- application to word-sense disambiguation,” Machine Learning,
2017, pp. 475–484.
vol. 94, no. 2, pp. 233–259, 2014.
[66] Z. Yang, J. Tang, and W. Cohen, “Multi-modal bayesian embed-
[93] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in KDD,
dings for learning social knowledge graphs,” in IJCAI, 2016, pp.
2015, pp. 1365–1374.
2287–2293.
[94] A. Bordes, S. Chopra, and J. Weston, “Question answering with
[67] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of
subgraph embeddings,” in EMNLP, 2014, pp. 615–620.
semantic knowledge,” in WWW, 2007, pp. 697–706.
[95] S. F. Mousavi, M. Safayani, A. Mirzaei, and H. Bahonar, “Hier-
[68] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyga- archical graph embedding in vector space by graph pyramid,”
niak, and S. Hellmann, “Dbpedia - A crystallization point for the Pattern Recognition, vol. 61, pp. 245–254, 2017.
web of data,” J. Web Sem., vol. 7, no. 3, pp. 154–165, 2009.
[96] W. N. A. Jr. and T. D. Morley, “Eigenvalues of the laplacian of a
[69] C. Xiong, R. Power, and J. Callan, “Explicit semantic ranking for graph,” Linear and Multilinear Algebra, vol. 18, no. 2, pp. 141–145,
academic search via knowledge graph embedding,” in WWW, 1985.
2017, pp. 1271–1279. [97] X. He and P. Niyogi, “Locality preserving projections,” in NIPS,
[70] B. Alharbi and X. Zhang, “Learning from your network of 2003, pp. 153–160.
friends: A trajectory representation learning model based on [98] Y. Yang, F. Nie, S. Xiang, Y. Zhuang, and W. Wang, “Local and
online social ties,” in ICDM, 2016, pp. 781–786. global regressive mapping for manifold learning with out-of-
[71] Q. Zhang and H. Wang, “Not all links are created equal: An sample extrapolation,” in AAAI, 2010.
adaptive embedding approach for social personalized ranking,” [99] D. Cai, X. He, and J. Han, “Spectral regression: a unified subspace
in SIGIR, 2016, pp. 917–920. learning framework for content-based image retrieval,” in MM,
[72] T. N. Kipf and M. Welling, “Semi-supervised classification with 2007, pp. 403–412.
graph convolutional networks,” in ICLR, 2017. [100] K. Q. Weinberger, F. Sha, and L. K. Saul, “Learning a kernel
[73] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Tri-party deep matrix for nonlinear dimensionality reduction,” in ICML, 2004.
network representation,” in IJCAI, 2016, pp. 1895–1901. [101] K. Allab, L. Labiod, and M. Nadif, “A semi-nmf-pca unified
[74] T. Hofmann and J. M. Buhmann, “Multidimensional scaling and framework for data clustering,” IEEE Trans. Knowl. Data Eng.,
data clustering,” in NIPS, 1994, pp. 459–466. vol. 29, no. 1, pp. 2–16, 2017.
[75] Y. Han and Y. Shen, “Partially supervised graph embedding for [102] S. T. Roweis and L. K. Saul, “Nonlinear Dimensionality Reduc-
positive unlabelled feature selection,” in IJCAI, 2016, pp. 1548– tion by Locally Linear Embedding,” Science, vol. 290, no. 5500,
1554. pp. 2323–2326, 2000.
[76] M. Yin, J. Gao, and Z. Lin, “Laplacian regularized low-rank [103] S. Xiang, F. Nie, C. Zhang, and C. Zhang, “Nonlinear dimension-
representation and its applications,” IEEE Trans. Pattern Anal. ality reduction with local spline embedding,” IEEE Trans. Knowl.
Mach. Intell., vol. 38, no. 3, pp. 504–517, 2016. Data Eng., vol. 21, no. 9, pp. 1285–1298, 2009.
[77] M. Tang, F. Nie, and R. Jain, “Capped lp-norm graph embedding [104] L. Vandenberghe and S. Boyd, “Semidefinite programming,”
for photo clustering,” in MM, 2016, pp. 431–435. SIAM Rev., vol. 38, no. 1, pp. 49–95, 1996.
[78] M. Balasubramanian and E. L. Schwartz, “The isomap algorithm [105] B. Shaw and T. Jebara, “Structure preserving embedding,” in
and topological stability,” Science, vol. 295, no. 5552, pp. 7–7, 2002. ICML, 2009, pp. 937–944.
[79] R. Jiang, W. Fu, L. Wen, S. Hao, and R. Hong, “Dimensionality [106] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric
reduction on anchorgraph with an efficient locality preserving transitivity preserving graph embedding,” in KDD, 2016, pp.
projection,” Neurocomputing, vol. 187, pp. 109–118, 2016. 1105–1114.
[80] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. [107] B. Shaw and T. Jebara, “Minimum volume embedding,” in AIS-
Bronstein, “Geometric deep learning on graphs and manifolds TATS, 2007, pp. 460–467.
using mixture model cnns,” in CVPR, 2017. [108] M. Nickel, V. Tresp, and H. peter Kriegel, “A three-way model
[81] M. Chen, I. W. Tsang, M. Tan, and C. T. Jen, “A unified feature for collective learning on multi-relational data,” in ICML. ACM,
selection framework for graph embedding on high dimensional 2011, pp. 809–816.
data,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 6, pp. 1465–1477, [109] J. H. Tianji Pang, Feiping Nie, “Flexible orthogonal neighborhood
2015. preserving embedding,” in IJCAI-17, 2017, pp. 2592–2598.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, SEPT 2017 20
[110] G. H. Golub and C. Reinsch, “Singular value decomposition and [138] K. M. Borgwardt and H. Kriegel, “Shortest-path kernels on
least squares solutions,” Numer. Math., vol. 14, no. 5, pp. 403–420, graphs,” in ICDM, 2005, pp. 74–81.
1970. [139] C. M. Bishop and J. Lasserre, “Generative or Discrimative? Get-
[111] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient es- ting the Best of Both Worlds,” in Bayes. Stat. 8, 2007, pp. 3–24.
timation of word representations in vector space,” CoRR, vol. [140] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet alloca-
abs/1301.3781, 2013. tion,” in NIPS, 2001, pp. 601–608.
[112] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, [141] H. Xiao, M. Huang, and X. Zhu, “Transg : A generative model for
“Distributed representations of words and phrases and their knowledge graph embedding,” in ACL, 2016.
compositionality,” in NIPS, 2013, pp. 3111–3119. [142] Y. Luo, Q. Wang, B. Wang, and L. Guo, “Context-dependent
[113] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “User profile preserving knowledge graph embedding,” in EMNLP, 2015, pp. 1656–1661.
social network embedding,” in IJCAI, 2017, pp. 3378–3384. [143] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W. Ma, “Collaborative
[114] N. Shervashidze, S. V. N. Vishwanathan, T. Petri, K. Mehlhorn, knowledge base embedding for recommender systems,” in KDD,
and K. M. Borgwardt, “Efficient graphlet kernels for large graph 2016, pp. 353–362.
comparison,” in AISTATS, 2009, pp. 488–495. [144] Y. Lin, Z. Liu, and M. Sun, “Knowledge representation learning
[115] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” with entities, attributes and relations,” in IJCAI, 2016, pp. 2866–
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. 2872.
[116] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On [145] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo, “Struc2vec:
the properties of neural machine translation: Encoder-decoder Learning node representations from structural identity,” in KDD,
approaches,” CoRR, vol. abs/1409.1259, 2014. 2017, pp. 385–394.
[146] C. Yang, M. Sun, Z. Liu, and C. Tu, “Fast network embedding
[117] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-
enhancement via high order proximity approximation,” in IJCAI,
dergheynst, “Geometric deep learning: going beyond euclidean
2017, pp. 3894–3900.
data,” CoRR, vol. abs/1611.08097, 2016.
[147] J. A. Bullinaria and J. P. Levy, “Extracting semantic represen-
[118] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral net- tations from word co-occurrence statistics: stop-lists, stemming,
works and locally connected networks on graphs,” in ICLR, 2013. and svd,” Behav. Res. Methods, vol. 44, no. 3, pp. 890–907, 2012.
[119] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks [148] O. Levy, Y. Goldberg, and I. Dagan, “Improving distributional
on graph-structured data,” CoRR, vol. abs/1506.05163, 2015. similarity with lessons learned from word embeddings,” TACL,
[120] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Mon- vol. 3, pp. 211–225, 2015.
fardini, “The graph neural network model,” IEEE Trans. Neural [149] J. Demmel, I. Dumitriu, and O. Holtz, “Fast linear algebra is
Networks, vol. 20, no. 1, pp. 61–80, 2009. stable,” Numerische Mathematik, vol. 108, no. 1, pp. 59–91, 2007.
[121] D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez- [150] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representa-
Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Con- tion learning on large graphs,” in NIPS, 2017, pp. 1025–1035.
volutional networks on graphs for learning molecular finger- [151] R. C. Wilson, E. R. Hancock, E. Pekalska, and R. P. W. Duin,
prints,” in NIPS, 2015, pp. 2224–2232. “Spherical and hyperbolic embeddings of data,” IEEE Trans.
[122] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel, “Gated graph Pattern Anal. Mach. Intell., vol. 36, no. 11, pp. 2255–2269, 2014.
sequence neural networks,” in ICLR, 2016. [152] Z. Liu, V. W. Zheng, Z. Zhao, F. Zhu, K. C.-C. Chang, M. Wu, and
[123] R. Khasanova and P. Frossard, “Graph-based isometry invariant J. Ying, “Distance-aware dag embedding for proximity search on
representation learning,” in ICML, 2017, pp. 1847–1856. heterogeneous graphs,” in AAAI, 2018.
[124] J. Tang, M. Qu, and Q. Mei, “Pte: Predictive text embedding [153] F. D. Johansson and D. P. Dubhashi, “Learning with similarity
through large-scale heterogeneous text networks,” in KDD, 2015, functions on graphs using matchings of geometric embeddings,”
pp. 1165–1174. in KDD, 2015, pp. 467–476.
[125] G. Ji, S. He, L. Xu, K. Liu, and J. Zhao, “Knowledge graph [154] Y. Yu, C. Fang, and Z. Liao, “Piecewise flat embedding for image
embedding via dynamic mapping matrix,” in ACL, 2015, pp. 687– segmentation,” in ICCV, 2015, pp. 1368–1376.
696. [155] Z. Liu, V. W. Zheng, Z. Zhao, H. Yang, K. C.-C. Chang, M. Wu,
[126] G. Ji, K. Liu, S. He, and J. Zhao, “Knowledge graph completion and J. Ying, “Subgraph-augmented path embedding for semantic
with adaptive sparse transfer matrix,” in AAAI, 2016, pp. 985– user search on heterogeneous social network,” in WWW, 2018.
991.
[127] J. Wen, J. Li, Y. Mao, S. Chen, and R. Zhang, “On the repre- Hongyun Cai received the Ph.D. degree
sentation and embedding of knowledge bases beyond binary in Computer Science from the University of
relations,” in IJCAI, 2016, pp. 1300–1307. Queensland in 2016. She is currently a post-
[128] R. Xie, Z. Liu, J. Jia, H. Luan, and M. Sun, “Representation doctoral researcher at the Advanced Digital Sci-
learning of knowledge graphs with entity descriptions,” in AAAI, ences Center, Singapore. Her research focuses
2016, pp. 2659–2665. on graph mining, social data management and
[129] H. Xiao, M. Huang, and X. Zhu, “From one point to a manifold: analysis.
Knowledge graph embedding for precise link prediction,” in
IJCAI, 2016, pp. 1315–1321.
[130] Y. Jia, Y. Wang, H. Lin, X. Jin, and X. Cheng, “Locally adaptive
translation for knowledge graph embedding,” in AAAI, 2016, pp. Vincent W. Zheng received the Ph.D. degree
992–998. in Computer Science from the Hong Kong Uni-
[131] R. Socher, D. Chen, C. D. Manning, and A. Y. Ng, “Reasoning versity of Science and Technology in 2011. He
with neural tensor networks for knowledge base completion,” in is a senior research scientist at the Advanced
NIPS, 2013, pp. 926–934. Digital Sciences Center, Singapore, and a re-
[132] M. Nickel, L. Rosasco, and T. Poggio, “Holographic embeddings search affiliate at the University of Illinois at
of knowledge graphs,” in AAAI, 2016, pp. 1955–1961. Urbana-Champaign. His research interests in-
[133] M. Chen, Y. Tian, M. Yang, and C. Zaniolo, “Multilingual knowl- clude graph mining, information extraction, ubiq-
edge graph embeddings for cross-lingual knowledge alignment,” uitous computing and machine learning.
in IJCAI, 2017, pp. 1511–1517.
[134] H. Liu, Y. Wu, and Y. Yang, “Analogical inference for multi- Kevin Chen-Chuan Chang is a Professor in
relational embeddings,” in ICML, 2017, pp. 2168–2178. University of Illinois at Urbana-Champaign. His
[135] T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard, research addresses large-scale information ac-
“Complex embeddings for simple link prediction,” in ICML, cess, for search, mining, and integration across
2016, pp. 2071–2080. structured and unstructured big data includ-
[136] D. Haussler, “Convolution kernels on discrete structures,” Tech- ing Web data and social media. He also co-
nical Report UCS-CRL-99-10, 1999. founded Cazoodle for deepening vertical data-
aware search over the Web.
[137] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M.
Borgwardt, “Graph kernels,” Journal of Machine Learning Research,
vol. 11, pp. 1201–1242, 2010.