0% found this document useful (0 votes)
11 views9 pages

DeepWalk, LINE, PTE, and Node2vec

This document presents a theoretical framework unifying various network embedding models, including DeepWalk, LINE, PTE, and node2vec, under a matrix factorization approach. It demonstrates that these models perform implicit matrix factorizations and provides closed-form expressions for their respective matrices. The authors introduce the NetMF method, which significantly improves performance over existing models for network mining tasks.

Uploaded by

pearsonicin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

DeepWalk, LINE, PTE, and Node2vec

This document presents a theoretical framework unifying various network embedding models, including DeepWalk, LINE, PTE, and node2vec, under a matrix factorization approach. It demonstrates that these models perform implicit matrix factorizations and provides closed-form expressions for their respective matrices. The authors introduce the NetMF method, which significantly improves performance over existing models for network mining tasks.

Uploaded by

pearsonicin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Network Embedding as Matrix Factorization: Unifying

DeepWalk, LINE, PTE, and node2vec

Jiezhong Qiu†∗ , Yuxiao Dong‡ , Hao Ma‡ , Jian Li♯ , Kuansan Wang‡ , and Jie Tang†
† Department
of Computer Science and Technology, Tsinghua University
‡ Microsoft Research, Redmond
♯ Institute for Interdisciplinary Information Sciences, Tsinghua University

[email protected],{yuxdong,haoma,kuansanw}@microsoft.com,{lijian83,jietang}@tsinghua.edu.cn

ABSTRACT Table 1: The matrices that are implicitly approximated and


arXiv:1710.02971v4 [cs.SI] 8 Feb 2018

factorized by DeepWalk, LINE, PTE, and node2vec.


Since the invention of word2vec [28, 29], the skip-gram model
has significantly advanced the research of network embedding, Algorithm Matrix
  Í  
such as the recent emergence of the DeepWalk, LINE, PTE, and DeepWalk log vol(G) T1 Tr=1 (D −1 A)r D −1 − log b
node2vec approaches. In this work, we show that all of the afore- 
LINE log vol(G)D −1 AD −1 − log b
mentioned models with negative sampling can be unified into the  ww −1 ww −1 
matrix factorization framework with closed forms. Our analysis © α vol(G ww )(D row ) Aww (D col )  ª
PTE log ­­  β vol(G dw )(D row
dw )−1 A (D dw )−1  ® − log b
dw ®
 lw −1 
col
and proofs reveal that: (1) DeepWalk [31] empirically produces a lw −1
low-rank transformation of a network’s normalized Laplacian ma- «  γÍvol(GÍ
lw )(D row ) Alw (D col )
Í
¬  !
1 T r r
u X w,u P c,w,u + u X c,u P w,c,u
trix; (2) LINE [37], in theory, is a special case of DeepWalk when the node2vec log
2T r =1
Í Í − log b
( u X w,u )( u X c,u )
size of vertices’ context is set to one; (3) As an extension of LINE,
Notations in DeepWalk and LINE are introduced below. See detailed notations for PTE and
PTE [36] can be viewed as the joint factorization of multiple net- node2vec in Section 2.
works’ Laplacians; (4) node2vec [16] is factorizing a matrix related A : A ∈ R+
|V |×|V |
is G ’s adjacency matrix with Ai, j as the edge weight between vertices i and j ;
to the stationary distribution and transition probability tensor of D col : D col = diag(A⊤ e) is the diagonal matrix with column sum of A;
D row : D row = diag(Ae) is the diagonal matrix with row sum of A;
a 2nd-order random walk. We further provide the theoretical con- D : For undirected graphs (A⊤ = A), D col = D row . For brevity, D represents both D col & D row .
nections between skip-gram based network embedding algorithms D = diag(d 1, · · · , d |V | ), where d i represents generalized degree of vertex i ;
Í Í Í
vol(G): vol(G) = i j Ai, j = i d i is the volume of a weighted graph G ;
and the theory of graph Laplacian. Finally, we present the NetMF
method1 as well as its approximation algorithm for computing net- T & b : The context window size and the number of negative sampling in skip-gram, respectively.

work embedding. Our method offers significant improvements over


DeepWalk and LINE for conventional network mining tasks. This centrality, triangle count, and modularity, require extensive do-
work lays the theoretical foundation for skip-gram based network main knowledge and expensive computation to handcraft. In light
embedding methods, leading to a better understanding of latent of these issues, as well as the opportunities offered by the recent
network representation learning. emergence of representation learning [2], learning latent repre-
ACM Reference Format: sentations for networks, a.k.a., network embedding, has been ex-
Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. tensively studied in order to automatically discover and map a
2018. Network Embedding as Matrix Factorization: Unifying DeepWalk, network’s structural properties into a latent space.
LINE, PTE, and node2vec . In Proceedings of WSDM’18. ACM, New York, NY, Formally, the problem of network embedding is often formalized
USA, 9 pages. https://doi.org/10.1145/3159652.3159706 as follows: Given an undirected and weighted graph G = (V , E, A)
with V as the node set, E as the edge set and A as the adjacency
1 INTRODUCTION matrix, the goal is to learn a function V → Rd that maps each vertex
The conventional paradigm of mining and learning with networks to a d-dimensional (d ≪ |V |) vector that captures its structural
usually starts from the explicit exploration of their structural prop- properties. The output representations can be used as the input of
erties [13, 32]. But many of such properties, such as betweenness mining and learning algorithms for a variety of network science
∗ This tasks, such as label classification and community detection.
work was partially done when Jiezhong was an intern at Microsoft Research.
1 Code available at github.com/xptree/NetMF. The attempt to address this problem can date back to spectral
graph theory [11] and social dimension learning [38]. Its very recent
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
advances have been largely influenced by the skip-gram model
for profit or commercial advantage and that copies bear this notice and the full citation originally proposed for word embedding [28, 29], whose input
on the first page. Copyrights for components of this work owned by others than ACM is a text corpus composed of sentences in natural language and
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a output is the latent vector representation for each word in the
fee. Request permissions from [email protected]. corpus. Notably, inspired by this setting, DeepWalk [31] pioneers
WSDM’18, , February 5–9, 2018, Marina Del Rey, CA, USA network embedding by considering the vertex paths traversed by
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5581-0/18/02. . . $15.00
random walks over networks as the sentences and leveraging skip-
https://doi.org/10.1145/3159652.3159706 gram for learning latent vertex representations. With the advent of
Í |V |
DeepWalk, many network embedding models have been developed, where d j = k =1 A j,k is the generalized degree of vertex j. In our
such as LINE [37], PTE [36], and node2vec [16]. analysis, we assume P N (j) ∝ d j because the normalization factor
The above models have thus far been demonstrated quite effec- has a closed form solution in graph theory, i.e., P N (j) = d j /vol(G),
tive empirically. However, the theoretical mechanism behind them Í |V | Í |V |
where vol(G) = i=1 j=1 Ai, j .2 Then we rewrite the objective as
is much less well-understood. We note that the skip-gram model
with negative sampling for word embedding has been shown to Õ
|V | Õ
|V |
 Õ
|V |
 
ℓ= Ai, j log д x i⊤y j + b d i E j ′ ∼P N log д −x i⊤y j ′ , (1)
be an implicit factorization of a certain word-context matrix [24],
h  i
i =1 j=1 i =1
and there is recent effort to theoretically explaining the word em-
bedding models from geometric perspectives [1, 18]. But it is un- and express the expectation term Ej ′ ∼P N log д −x i⊤y j ′ as
clear what is the relation between the word-context matrix and dj  Õ dj′ 
the network structure. Moreover, there was an early attempt to log д −x i⊤y j + log д −x i⊤y j ′ . (2)
vol(G) vol(G)
theoretically analyze DeepWalk’s behavior [47]. However, their ′ j ,j

main theoretical results are not fully consistent with the setting of
By combining Eq. 1 and Eq. 2, and considering the local objective
the original DeepWalk paper. In addition, despite the superficial function for a specific pair of vertices (i, j), we have
similarity among DeepWalk, LINE, PTE, and node2vec, there is a
 di d j 
lack of deeper understanding of their underlying connections. ℓ(i, j) = Ai, j log д x i⊤y j + b log д −x i⊤y j .
vol(G)
Contributions In this work, we provide theoretical results con-
cerning several skip-gram powered network embedding methods. Let us define zi, j = x i⊤y j . Following Levy and Goldberg [24],
More concretely, we first show that the models we mentioned— where the authors suggested that for a sufficient large embedding
DeepWalk, LINE, PTE, and node2vec—are in theory performing dimension, each individual zi, j can assume a value independence,
implicit matrix factorizations. We derive the closed form of the we can take the derivative w.r.t. zi, j and get:
matrix for each model (see Table 1 for a summary). For example, ∂ℓ ∂ℓ(i, j) di d j
DeepWalk (random walks on a graph + skip-gram) is in essence ∂z i, j
=
∂z i, j
= Ai, j д(−z i, j ) − b
vol(G)
д(z i, j ).
factorizing a random matrix that converges in probability to our
Setting the derivative to be zero reveals
closed-form matrix as the length of random walks goes to infinity.  
vol(G)Ai, j vol(G)Ai, j
Second, observed from their matrices’ closed forms, we find that, e 2zi, j − − 1 e zi, j − = 0.
interestingly, LINE can be seen as a special case of DeepWalk, when bd i d j bd i d j
the window size T of contexts is set to 1 in skip-gram. Furthermore, The above quadratic equation has two solutions (1) e zi, j = −1,
vol(G)Ai, j
we demonstrate that PTE, as an extension of LINE, is actually an which is invalid; and (2) e zi, j = , i.e.,
bd i d j
implicit factorization of the joint matrix of multiple networks.  
vol(G)Ai, j
Third, we discover a theoretical connection between DeepWalk’s x i⊤y j = z i, j = log . (3)
implicit matrix and graph Laplacians. Based on the connection, we bd i d j
propose a new algorithm NetMF to approximate the closed form Writing Eq. 3 in matrix form, LINE (2nd) is factoring the matrix
 
of DeepWalk’s implicit matrix. By explicitly factorizing this matrix log vol(G)D −1 AD −1 − log b = X Y ⊤, (4)
using SVD, our extensive experiments in four networks (used in
the DeepWalk and node2vec approaches) demonstrate NetMF’s where log(·) denotes the element-wise matrix logarithm, and D =
outstanding performance (relative improvements by up to 50%) diag(d 1 , · · · , d |V | ).
over DeepWalk and LINE.
PTE [36] PTE is an extension of LINE (2nd) in heterogeneous text
networks. To examine it, we first adapt our analysis of LINE (2nd) to
2 THEORETICAL ANALYSIS AND PROOFS bipartite networks. Consider a bipartite network G = (V1 ∪V2 , E, A)
In this section, we present the detailed theoretical analysis and where V1 , V2 are two disjoint sets of vertices, E ⊆ V1 ×V2 is the edge
proofs for four popular network embedding approaches: LINE, PTE, |V |× |V2 |
set, and A ∈ R+ 1 is the bipartite adjacency matrix. The volume
DeepWalk, and node2vec. Í |V | Í |V |
of G is defined to be vol(G) = i=11 j=12 Ai, j . The goal is to learn
a representation x i for each vertex vi ∈ V1 and a representation y j
2.1 LINE and PTE for each vertex v j ∈ V2 . The objective function is
LINE [37] Given an undirected and weighted network G = (V , E, A), Õ
|V1 | Õ
|V2 |     
LINE with the second order proximity (aka LINE (2nd)) aims to ℓ= Ai, j log д x i⊤y j + bE j ′ ∼P N log д −x i⊤y j ′ .
learn two representation matrices X , Y ∈ R |V |×d , whose rows are
i =1 j =1

denoted by x i and yi , i = 1, · · · , |V |, respectively. The objective of Applying the same analysis procedure of LINE, we can see that
LINE (2nd) is to maximize maximizing ℓ is actually factorizing
 
Õ
|V | Õ
|V |      log vol(G)D row
−1 −1
AD col − log b = X Y ⊤
ℓ= Ai, j log д x i⊤y j + bE j ′ ∼P N log д −x i⊤y j ′ ,
i =1 j=1 where we denote D row = diag(Ae) and D col = diag(A⊤e).
where д is the sigmoid function; b is the parameter for negative Given the above discussion, let us consider the heterogeneous
sampling; P N is known as the noise distribution that generates text network used in PTE, which is composed of three sub-networks
3/4
negative samples. In LINE, the authors empirically set P N (j) ∝ d j , 2A similar result could be achieved if we use P N (j) ∝ d j .
3/4
— the word-word network G ww , the document-word network G dw , Algorithm 1: DeepWalk
and the label-word network G lw , where G dw and G lw are bipar-
tite. Take the document-word network G dw as an example, we use 1 for n = 1, 2, . . . , N do
Pick w 1n according to a probability distribution P(w 1 );
Adw ∈ R#doc×#word to denote its adjacency matrix, and use D row dw 2

dw
and D col to denote its diagonal matrices with row and column 3 Generate a vertex sequence (w 1n , · · · , w Ln ) of length L by a
sum, respectively. By leveraging the analysis of LINE and the above random walk on network G;
notations, we find that PTE is factorizing 4 for j = 1, 2, . . . , L − T do
for r = 1, . . . ,T do
 ww −1 
ww −1 5
© α vol(G ww )(D row ) Aww (D col )  ª Add vertex-context pair (w nj , w nj+r ) to multiset D;
log ­­  β vol(G dw )(D row
dw )−1 A (D dw )−1  ® − log b, 6
dw ® (5)
 lw −1 
col
lw −1 Add vertex-context pair (w nj+r , w nj ) to multiset D;
«  γ vol(G lw )(D row ) Alw (D col )  ¬
7

where the factorized matrix is of shape (#word + #doc + #label) × 8 Run SGNS on D with b negative samples.
#word, b is the parameter for negative sampling, and {α, β, γ } are
non-negative hyper-parameters to balance the weights of the three
sub-networks. In PTE, {α, β, γ } satisfy α vol(G ww ) = β vol(G dw ) =
γ vol(G lw ). This is because the authors perform edge sampling vertex and its context appear in a random walk sequence. More
formally, for r = 1, · · · ,T , we define
during training wherein edges are sampled from each of three n o
sub-networks alternatively (see Section 4.2 in [36]). −r = (w, c) : (w, c) ∈ D, w = w j , c = w j+r ,
D→ n n

n o
r− = (w, c) : (w, c) ∈ D, w = w j +r , c = w j .
n n
2.2 DeepWalk D←

In this section, we analyze DeepWalk [31] and illustrate the essence That is, D→ −r /D←
r− is the sub-multiset of D such that the context c is
of DeepWalk is actually performing an implicit matrix factorization r steps after/before the vertex w in random walks. Moreover, we use
(See the matrix form solution in Thm. 2.3). #(w, c)→
−r and #(w, c)← r− to denote the number of times vertex-context
DeepWalk first generates a “corpus” D by random walks on pair (w, c) appears in D→ −r and D←r− , respectively. The following three
graphs [26]. To be formal, the corpus D is a multiset that counts theorems characterize DeepWalk step by step.
the multiplicity of vertex-context pairs. DeepWalk then trains a
Theorem 2.1. Denote P = D −1A, when L → ∞, we have
skip-gram model on the multiset D. In this work, we focus on skip-
#(w, c)→ dw #(w, c)← dc
gram with negative sampling (SGNS). For clarity, we summarize −r p
→ (P r )w,c and r− p
→ (P r )c,w .
the DeepWalk method in Algorithm 1. The outer loop (Line 1-7) D→
−r vol(G) D← r−
vol(G)
specifies the total number of times, N , for which we should run Proof. See Appendix. □
random walks. For each random walk of length L, the first vertex
is sampled from a prior distribution P(w). The inner loop (Line 4-7) Remark 1. What if we start random walks with other distribu-
specifies the construction of the multiset D. Once we have D, we tions (e.g., the uniform distribution in the original DeepWalk work [31])?
run an SGNS to attain the network embedding (Line 8). Next, we It turns out that, for a connected, undirected, and non-bipartite graph,
introduce some necessary background about the SGNS technique, P(w j = w, w j+r = c) → vol(G) dw
(P r )w,c as j → ∞. So when the
followed by our analysis of the DeepWalk method. length of random walks L → ∞, Thm. 2.1 still holds.
Preliminary on Skip-gram with Negative Sampling (SGNS) Theorem 2.2. When L → ∞, we have
The skip-gram model assumes a corpus of words w and their context T  
c. More concretely, the words come from a textual corpus of words #(w, c) p 1 Õ dw dc
→ (P r )w,c + (P r )c,w .
w 1 , · · · w L and the contexts for word w i are words surrounding it in |D | 2T r =1 vol(G) vol(G)
a T -sized window w i−T , · · · , w i−1 , w i+1 , · · · , w i+T . Following the
work by Levy and Goldberg [24], SGNS is implicitly factorizing | D→
− | | D←−| 1 . By using Thm. 2.1
Proof. Note the fact that | Dr | = | Dr | = 2T
 
#(w, c) | D | and the continuous mapping theorem, we get
log − log b, (6)
#(w ) · #(c) ÍT  !
#(w, c)→
−r + #(w, c)← 1 Õ #(w, c)→ #(w, c)←
T
#(w, c) r− −r r−
ÍT 
r =1
where #(w, c), #(w) and #(c) denote the number of times word- |D |
= =
2T r =1
+
r =1 D→−r + D← r−
D→ −r D← r−
context pair (w, c), word w and context c appear in the corpus,  
p 1 Õ
T
respectively; b is the number of negative samples. dw dc
→ (P r )w,c + (P r )c,w .
2T r =1 vol(G) vol(G)
Proofs and Analysis Our analysis of DeepWalk is based on the
following key assumptions. Firstly, assume the used graph is undi- Further, marginalizing w and c respectively reveals the fact that
rected, connected, and non-bipartite, making P(w) = dw /vol(G) #(w ) p dw #(c) p dc
→ vol(G) and | D | → vol(G) , as L → ∞. □
a unique stationary distribution of the random walks. Secondly, |D |
suppose the first vertex of a random walk is sampled from the Theorem 2.3. For DeepWalk, when L → ∞,
stationary distribution P(w) = dw /vol(G). !
#(w, c) | D | p vol(G) 1 Õ r 1 Õ r
T T
To characterize DeepWalk, we want to reinterpret Eq. 6 by using
→ (P )w,c + (P )c,w .
graph theory terminologies. Our general idea is to partition the #(w ) · #(c) 2T dc r =1 dw r =1
multiset D into several sub-multisets according to the way in which
In matrix form, DeepWalk is equivalent to factorize Algorithm 2: node2vec
! !
vol(G) Õ
T 1 Construct transition probability tensor P;
log P r D −1 − log(b). (7)
T 2 for n = 1, 2, . . . , N do
r =1
3 Pick w 1n , w 2n according to a distribution P(w 1 , w 2 );
Proof. Using the results in Thm. 2.2 and the continuous map- 4 Generate a vertex sequence (w 1n , · · · , w Ln ) of length L by
ping theorem, we get the 2nd-order random walk on network G;
ÍT  
#(w,c ) 1 dw
(P r )w,c + dc
(P r )c,w 5 for j = 2, 3, . . . , L − T do
#(w, c) | D | p 2T r =1 vol(G ) vol(G)
for r = 1, . . . ,T do
|D |
= → 6
#(w ) · #(c) #(w ) #(c ) dw dc
|D | · |D | vol(G ) · vol(G ) 7 Add triplet (w nj , w nj+r , w nj−1 ) to multiset D;
!
vol(G) 1 Õ r 1 Õ r Add triplet (w nj+r , w nj , w nj−1 ) to multiset D;
T T
8
= (P )w,c + (P )c,w .
2T dc r =1 dw r =1
9 Run SGNS on multiset D ′ = {(w, c) : (w, c, u) ∈ D};
Write it in matrix form:
!
vol(G) Õ r −1 Õ −1 r ⊤
T T
P D + D (P )
2T r =1 r =1 In this analysis, we assume the first two vertices of node2vec’s
2nd-order random walk are sampled from its stationary distribu-
vol(G) ­©Õ
T Õ
T ª
D −1 AD −1 × · · · × AD −1 ®® tion X . The stationary distribution X of the 2nd-order random
Í
2T ­r =1 |
D −1 A × · · · × D −1 A D −1 +
{z } | {z }
=
walk satisfies w P u,v,w Xv,w = Xu,v , and the existence of such
« ¬
r =1
r terms r terms
! X is guaranteed by the Perron-Frobenius theorem [4]. Addition-
vol(G) ÕT
1 Õ
T
D −1 A × · · · × D −1 A D −1 = vol(G) P r D −1 . ally, the higher-order transition probability tensor
 is defined to be
| {z }
=
T r =1
T r =1 (P r )u,v,w = Prob w j+r = u|w j = v, w j−1 = w .
r terms
□ Main Results Limited by space, we list the main results of node2vec
without proofs. The idea is similar to the analysis of DeepWalk.
Corollary 2.4. Comparing Eq. 4 and Eq. 7, we can easily observe #(w,c,u)→ −r p #(w,c,u)← p
that LINE (2nd) is a special case of DeepWalk when T = 1. • → X w,u (P r )c,w,u and r−
→ X c,u (P r )w,c,u .
| D→ −r |
Í
| D← − |
−r p Í
r
#(w,c)→ u #(w,c,u)→
c,w,u .
−r r)
• = → u X w,u (P
2.3 node2vec | D→ −r |
Í
| D→
−r |
#(w,c)← #(w,c,u) p Í
→ u X c,u (P r )w,c,u .
− ←−
node2vec [16] is a recently proposed network embedding method, • r
= u r
| D← r− | | D←r− |
which is briefly summarized in Algorithm 2. First, it defines an #(w,c) p 1 ÍT Í Í 
unnormalized transition probability tensor T with parameters p • |D |
→ 2T r =1 u X w,u (P r )c,w,u + u X c,u (P r )w,c,u .
and q, and then normalizes it to be the transition probability of a #(w ) p Í #(c) p Í
2nd-order random walk (Line 1). Formally, • |D |
→ u X w,u and | D | → u X c,u .
 Combining all together, we conclude that, for node2vec, Eq. 6


1
(u, v) ∈ E, (v, w ) ∈ E, u = w ;

 has the form
1
p

 (u, v) ∈ E, (v, w ) ∈ E, u , w, (w, u) ∈ E; ÍT Í Í 
1 r r

T u,v,w = X w,u P c,w,u + u X c,u P w,c,u

1 #(w, c) | D | p 2T r =1 u
q

(u, v) ∈ E, (v, w ) ∈ E, u , w, (w, u) < E; → Í  Í  (8)

.
0 otherwise. #(w ) · #(c) u X w,u u X c,u

 T u,v,w Though the closed form for node2vec has been achieved, we leave
P u,v,w = Prob w j+1 = u |w j = v, w j−1 = w = Í . the formulation of its matrix form for future research.
u T u,v,w
Note that both the computation and storage of the transition
Second, node2vec performs the 2nd-order random walks to generate probability tensor P r and its corresponding stationary distribution
a multiset D (Line 2-8) and then trains an SGNS model (Line 9) on X are very expensive, making the modeling of the full 2nd-order
it. To facilitate the analysis, we instead record triplets to form the dynamics difficult. However, we have noticed some recent pro-
multiset D for node2vec rather than vertex-context pairs. Take a
gresses [3, 4, 15] that try to understand or approximate the 2nd-
vertex w = w nj and its context c = w nj+r as an example, we denote
order random walk by assuming a rank-one factorization Xu,v =
u = w nj−1 and add a triplet (w, c, u) into D (Line 7). Similar to our
xu xv for its stationary distribution X . Due to the page limitation,
analysis of DeepWalk, we partition the multiset D according to the in the rest of this paper, we mainly focus on the matrix factorization
way in which the vertex and its context appear in a random walk
framework depending on the 1st-order random walk (DeepWalk).
sequence. More formally, for r = 1, · · · ,T , we define
n o
−r = (w, c, u) : (w, c, u) ∈ D, w j = w, w j +r = c, w j −1 = u ,
D→ n n n 3 NetMF: NETWORK EMBEDDING AS
n o MATRIX FACTORIZATION
r− = (w, c, u) : (w, c, u) ∈ D, w j+r = w, w j = c, w j −1 = u .
n n n
D←
Based on the analysis in Section 2, we unify LINE, PTE, DeepWalk,
In addition, for a triplet (w, c, u), we use #(w, c, u)→
−r and #(w, c, u)←r− and node2vec in the framework of matrix factorization, where the
to denote the number of times it appears in D→ −r and D← r− , respec- factorized matrices have closed forms as showed in Eq. 4, Eq. 5,
tively. Eq. 7, and Eq. 8, respectively. In this section, we study the DeepWalk
matrix (Eq. 7) because it is more general than the LINE matrix and 1.0 1.0

Eigenvalues After Filtering


computationally more efficient than the node2vec matrix. We first
T =1 D−1/2 AD−1/2

T =2 U T1 ∑Tr=1 Λr U ⊤

discuss the connection between the DeepWalk matrix and graph


0.5 0.5  −1
T =5 1 T r
T ∑r=1 P D

Eigenvalue
T = 10
Laplacian in Section 3.1. Then in Section 3.2, we present a matrix 0.0 0.0

factorization based framework—NetMF—for network embedding.


−0.5 −0.5

3.1 Connection between DeepWalk Matrix and −1.0


−1.0 −0.5 0.0 0.5 1.0
−1.0
0 500 1000 1500 2000 2500

Normalized Graph Laplacian Eigenvalues Before Filtering Index

(a) (b)
In this section, we show that the DeepWalk matrix has a close
relationship with the normalized graph Laplacian. To facilitate our
Figure 1: DeepWalk Matrix as Filtering: (a) Function f (x ) =
analysis, we introduce the following four theorems. 1 ÍT r with dom f = [−1, 1], where T ∈ {1, 2, 5, 10}; (b) Eigen-
T r =1 x  Í   Í 
Theorem 3.1. ([11]) For normalized graph Laplacian L = I − values of D −1/2 AD −1/2 , U T
1 T r
r =1 Λ U , and T
⊤ 1 T
r =1 P
r D −1 for

D −1/2AD −1/2 ∈ Rn×n , all its eigenvalues are real numbers and lie Cora network (T = 10).
in [0, 2], with λ min (L) = 0. For a connected graph with n > 1,
λ max (L) ≥ n/(n − 1). the non-increasing order such that
Theorem 3.2. ([41]) Singular values of a real symmetric matrix 1 Õ r 1 Õ r 1 Õ r
T T T

are the absolute values of its eigenvalues. λp1 ≥ λp2 ≥ · · · ≥ λ ,


T r =1 T r =1 T r =1 pn
Theorem 3.3. ([19]) Let B, C be two n × n symmetric matrices. where {p1 , p2 , · · · , pn } is a permutation of {1, 2, · · · , n}. Similarly,
Then for the decreasingly ordered singular values σ of B, C and BC, since every di is positive, the decreasingly ordered singular values

σi+j−1 (BC) ≤ σi (B)×σ j (C) holds for any 1 ≤ i, j ≤ n and i+j ≤ n+1. of the matrix D −1/2 can be constructed by psorting 1/ di in
p p the
non-increasing order such that 1/ dq1 ≥ 1/ dq2 ≥ · · · ≥ 1/ dqn
Theorem 3.4. ([41]) For a real symmetric matrix A, its Rayleigh
where {q 1 , q 2 , · · · , qn } is a permutation of {1, 2, · · · , n}. In particu-
Quotient R(A, x) = (x ⊤Ax)/(x ⊤x) satisfies λ min (A) = minx ,0 R(A, x) lar, dq1 = d min is the smallest vertex degree. By applying Thm. 3.3
and λ max (A) = maxx ,0 R(A, x). twice, we can see that the s-th singular value satisfies
! ! ! !
By ignoring the element-wise matrix logarithm
 Íand theconstant 1 Õ r
T  1
 1 Õ r
T  1

P D −1 ≤ σ1 D − 2 σs U Λ U ⊤ σ1 D − 2
term in Eq. 7, we focus on studying the matrix T1 Tr=1 P r D −1 . By
σs
T r =1 T r =1
Thm. 3.1, D −1/2AD −1/2 = I − L has eigen-decomposition U ΛU ⊤ 1 Õ r 1 Õ r
T T
1 1 1
such that U orthonormal and Λ = diag(λ 1 , · · · , λn ), where 1 = = p λ ps p = λ ,
dq1 T r =1 dq1 d min T r =1 ps
λ 1 ≥ λ 2 ≥ · · · ≥ λn ≥ −1
ÍT  and λn < 0. Based on this eigen- (10)
 Í 
1
decomposition, T r =1 P D −1 can be decomposed to be the
r

product of there symmetric matrices: which reveals that the magnitude of T1 Tr=1 P r D −1 ’s eigenval-
! ! !  Í 
1 Õ r
T   1 Õ r
T   ues is always bounded by the magnitude of U T1 Tr=1 Λr U ⊤ ’s
P D −1 = D −1/2 U Λ U ⊤
D −1/2 . (9)
T r =1 T r =1 eigenvalues. In addition to the magnitude of eigenvalues, we also
 Í  want to bound
 its smallest
 eigenvalue. Observe that the Rayleigh
The goal here is to characterize the spectrum of T1 Tr=1 P r D −1 . 1 ÍT
Quotient of T r =1 P D −1 has a lower bound as follows
r
To achieve this, we first analyze the second matrix at RHS of Eq. 9, ! ! ! !
1 Õ r
T
1 Õ r
T  
and then extend our analysis to the targeted matrix at LHS. R P D −1, x = R U Λ U ⊤, D −1/2 x R D −1, x
 Í   Í  T r =1 T r =1
Spectrum of U T1 Tr=1 Λr U ⊤ The matrix U T1 Tr=1 Λr U ⊤ ! ! ! !
Õ   1 Õ r
Í
T T
1 1
has eigenvalues T1 Tr=1 λri , i = 1, · · · , n, which can be treated as the ≥λ min U Λr U ⊤ λ max D −1 = λ min U Λ U⊤ .
T r =1 d min T r =1
output of a transformation applied on D −1/2AD −1/2 ’s eigenvalue
By Í
applying Thm. 3.4, we can bound the smallest
 eigenvalue
 of
1 T P r D −1 by the smallest eigenvalue of U 1 ÍT Λr U ⊤ :
λi , i.e., a kind of filter! The effect of this transformation (filter)
Í
f (x) = T1 Tr=1 x r is plotted in Figure 1(a), from which we observe T r =1 T r =1
the following two properties of this filter. Firstly, it prefers positive ! ! ! !
1 Õ r 1 Õ r
T T
1
large eigenvalues; Secondly, the preference becomes stronger as λ min P D −1
≥ λ min U Λ U ⊤
.
T r =1 d min T r =1
the window size T increases. In other words, as T grows, this filter
tries to approximate a low-rank positive semi-definite matrix by
keeping large positive eigenvalues. Illustrative Example: Cora In order to illustrate the filtering ef-
 Í  fect we discuss above, we analyze a small citation network Cora [27].
Spectrum of T1 Tr=1 P r D −1 Guided by Thm. 3.2, the decreas- we make the citation links undirected and choose its largest con-
 Í 
ingly ordered singular values of the matrix U T1 Tr=1 Λr U ⊤ can nected component. In Figure 1(b), we plot the decreasingly  or-
1 ÍT
be constructed by sorting the absolute value of its eigenvalues in dered eigenvalues of matrices D −1/2 AD −1/2 , U T r =1 Λ U ⊤ ,
r
Algorithm 3: NetMF for a Small Window Size T Table 2: Statistics of Datasets.

Compute P 1 , · · · , P T ; Dataset BlogCatalog PPI Wikipedia Flickr


Í 
1
vol(G) |V | 10,312 3,890 4,777 80,513
2 Compute M = T Pr D −1 ;
bT r =1 |E | 333,983 76,584 184,812 5,899,882
3 Compute M ′ = max(M, 1); #Labels 39 50 40 195
Rank-d approximation by SVD: log M ′ = Ud Σd Vd⊤ ;
p
4

5 return Ud Σd as network embedding. D −1/2AD −1/2 with its top-h eigenpairs Uh Λh Uh⊤ (Line 1). Since
only the top-h eigenpairs are required and the involved matrix
is sparse, we can use Arnoldi method [22] to achieve significant
Algorithm 4: NetMF for a Large Window Size T time reduction. In step two (Line 2), we approximate M with M̂ =
vol(G) −1/2 Í
1 Eigen-decomposition D −1/2AD −1/2 ≈ Uh Λh Uh⊤ ; Uh T1 Tr=1 Λhr Uh⊤ D −1/2 . The final step is the same
b D
Approximate M with
 Í 
2
as that in Algorithm 3, in which we form M̂ ′ = max(M̂, 1) (Line 3)
vol(G)
M̂ = b D −1/2Uh T1 Tr=1 Λhr Uh⊤ D −1/2 ; and then perform SVD on log M̂ ′ to get network embedding (Line
3 Compute M̂ ′ = max(M̂, 1); 4-5).
For NetMF with large window sizes, we develop the following
Rank-d approximation by SVD: log M̂ ′ = Ud Σd Vd⊤ ;
p
4
error bound theorem for the approximation of M and the approxi-
5 return Ud Σd as network embedding. mation of log M ′ .
Theorem 3.5. Let ∥·∥ F be the matrix Frobenius norm. Then
 ÍT  v
u
1 u
t Õ
1 Õ r
2
and r =1
T Pr D −1 , respectively, with T
= 10. For D −1/2AD −1/2 , vol(G)
n T
M − M̂ ≤ λ ;
the largest eigenvalue
 Í λ 1 = 1, and the smallest eigenvalue λn = F bd min T r =1 j
j =k +1
1
−0.971. For U T r =1 Λ U , we observe that all its negative
T r ⊤
log M ′ − log M̂ ′ ≤ M ′ − M̂ ′ ≤ M − M̂ .
eigenvalues and small positive eigenvalues are “filtered out” in F F F
Í
spectrum. Finally, for the matrix T1 Tr=1 P r D −1 , we observe that Proof. See Appendix. □
both the magnitude of its eigenvalues
 Í and
 its smallest eigenvalue Thm. 3.5 reveals that the error for approximating log M ′ is
are bounded by those of U T1 Tr=1 Λr U ⊤ . bounded by the error bound for the approximation of M. Nev-
ertheless, the major drawback of NetMF lies in this element-wise
3.2 NetMF matrix logarithm. Since good tools are currently not available to
Built upon the theoretical analysis above, we propose a matrix analyze this operator, we have to compute it explicitly even after
factorization framework NetMF for empirical understanding of we have already achieved a good low-rank approximation of M.
and improving on DeepWalk
 and LINE. For simplicity, we denote
vol(G) ÍT r D −1 , and refer to log M as the DeepWalk 4 EXPERIMENTS
M = bT r =1 P
matrix. In this section, we evaluate the proposed NetMF method on the
multi-label vertex classification task, which has also been used in
NetMF for a Small Window Size T NetMF for a small T is quite the works of DeepWalk, LINE, and node2vec.
intuitive. The basic idea is to directly compute and factorize the
DeepWalk matrix. The detail is listed in Algorithm 3. In the first Datasets We employ four widely-used datasets for this task. The
step (Line 1-2), we compute the matrix power from P 1 to P T and statistics of these datasets are listed in Table 2.
then get M. However, the factorization of log M presents computa- BlogCatalog [38] is a network of social relationships of online
tional challenges due to the element-wise matrix logarithm. The bloggers. The vertex labels represent interests of the bloggers.
matrix is not only ill-defined (since log 0 = −∞), but also dense. Protein-Protein Interactions (PPI) [35] is a subgraph of the
Inspired by the Shifted PPMI approach [24], we define M ′ such PPI network for Homo Sapiens. The labels are obtained from the
that Mi,′ j = max(Mi, j , 1) (Line 3). In this way, log M ′ is a sparse hallmark gene sets and represent biological states.
Wikipedia3 is a co-occurrence network of words appearing in
and consistent version of log M. Finally, we factorize log M ′ by
the first million bytes of the Wikipedia dump. The labels are the
using Singular Value Decomposition (SVD) and construct network
Part-of-Speech (POS) tags inferred by Stanford POS-Tagger [40].
embedding by using its top-d singular values/vectors (Line 4-5).
Flickr [38] is the user contact network in Flickr. The labels
NetMF for a Large Window SizeT The direct computation of the represent the interest groups of the users.
matrix M presents computing challenges for a large window size
Baseline Methods We compare our methods NetMF (T = 1) and
T , mainly due to its high time complexity. Hereby we propose an
NetMF (T = 10) with LINE (2nd) [37] and DeepWalk [31], which
approximation algorithm as listed in Algorithm 4. The general idea
we have introduced in previous sections. For NetMF (T = 10), we
comes from our analysis in section 3.1, wherein we reveal M’s close
choose h = 16384 for Flickr, and h = 256 for BlogCatelog, PPI, and
relationship with the normalized graph Laplacian and show its low-
rank nature theoretically. In our algorithm, we first approximate 3 http://mattmahoney.net/dc/text.html
BlogCatalog PPI Wikipedia Flickr
50 30 70 40

45 25
60 35
Micro-F1 (%)

40
20
35 50 30
15
30
40 25
25 10

20 5 30 20

35 25 20.0 25

30 17.5
20 20
Macro-F1 (%)

15.0
25
15 12.5 15
20
10.0
10 10
15 7.5

10 5 5.0 5
20 40 60 80 20 40 60 80 20 40 60 80 2 4 6 8 10
NetMF (T=1) LINE NetMF (T=10) DeepWalk

Figure 2: Predictive performance on varying the ratio of training data. The x-axis represents the ratio of labeled data (%), and
the y-axis in the top and bottom rows denote the Micro-F1 and Macro-F1 scores respectively.

Wikipedia. For DeepWalk, we present its results with the authors’ This is because the used Wikipedia network is a dense word co-
preferred parameters—window size 10, walk length 40, and the occurrence network with the average degree = 77.11, in which an
number of walks starting from each vertex to be 80. Finally, we set edge between a pair of words are connected if they co-occur in a
embedding dimension to be 128 for all methods. two-length window in the Wikipedia corpus.
(3) As shown in Table 3, the proposed NetMF method (T = 10
Prediction Setting Following the same experimental procedure and T = 1) outperforms DeepWalk and LINE by large margins in
in DeepWalk [31], we randomly sample a portion of labeled vertices most cases when sparsely labeled vertices are provided. Take the
for training and use the rest for testing. For BlogCatalog, PPI, and PPI dataset with 10% training data as an example, NetMF (T = 1)
Wikipedia datasets, the training ratio is varied from 10% to 90%. For achieves relatively 46.34% and 33.85% gains over LINE (2nd) regard-
Flickr, the training ratio is varied from 1% to 10%. We use the one- ing Micro-F1 and Macro-F1 scores, respectively; More impressively,
vs-rest logistic regression model implemented by LIBLINEAR [14] NetMF (T = 10) outperforms DeepWalk by 50.71% and 39.16% rela-
for the multi-label classification task. In the test phase, the one- tively as measured by two metrics.
vs-rest model yields a ranking of labels rather than an exact label (4) DeepWalk tries to approximate the exact vertex-context joint
assignment. To avoid the thresholding effect [39], we assume that distribution with an empirical distribution through random walk
the number of labels for test data is given [31, 39]. We repeat the sampling. Although the convergence is guaranteed by the law of
prediction procedure 10 times and evaluate the performance in large numbers, there still exist gaps between the exact and estimated
terms of average Micro-F1 and average Macro-F1 [42]. distributions due to the large size of real-world networks and the
Experimental Results Figure 2 summarizes the prediction per- relatively limited scale of random walks in practice (e.g., #walks and
formance of all methods on the four datasets and Table 3 lists the the walk length), negatively affecting DeepWalk’s performance.
quantitative and relative gaps between our methods and baselines.
In specific, we show NetMF (T = 1)’s relative performance gain over 5 RELATED WORK
LINE (2nd) and NetMF (T = 10)’s relative improvements over Deep- The story of network embedding stems from Spectral Cluster-
Walk, respectively, as each pair of them share the same window ing [5, 45], a data clustering technique which selects eigenval-
size T . We have the following key observations and insights: ues/eigenvectors of a data affinity matrix to obtain representations
(1) In BlogCatalog, PPI, and Flickr, the proposed NetMF method that can be clustered or embedded in a low-dimensional space. Spec-
(T = 10) achieves significantly better predictive performance over tral Clustering has been widely used in fields such as community de-
baseline approaches as measured by both Micro-F1 and Macro-F1, tection [23] and image segmentation [33]. In recent years, there is an
demonstrating the effectiveness of the theoretical foundation we increasing interest in network embedding. Following a few pioneer
lay out for network embedding. works such as SocDim [38] and DeepWalk [31], a growing number
(2) In Wikipedia, NetMF (T = 1) shows better performance than of literature has tried to address the problem from various of per-
other methods in terms of Micro-F1, while LINE outperforms other spectives, such as heterogeneous network embedding [8, 12, 20, 36],
methods regarding Macro-F1. This observation implies that short- semi-supervised network embedding [17, 21, 44, 48], network em-
term dependence is enough to model Wikipedia’s network structure. bedding with rich vertex attributes [43, 47, 49], network embedding
Table 3: Micro/Macro-F1 Score(%) for Multilabel Classification on BlogCatalog, PPI, Wikipedia, and Flickr datasets. In Flickr,
1% of vertices are labeled for training [31], and in the other three datasets, 10% of vertices are labeled for training.
BlogCatalog (10%) PPI (10%) Wikipeida (10%) Flickr (1%)
Algorithm
Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1
LINE (2nd) 23.64 13.91 10.94 9.04 41.77 9.72 25.18 9.32
NetMF (T = 1) 33.04 14.86 16.01 12.10 49.90 9.25 23.87 6.44
Relative Gain of NetMF (T = 1) 39.76 6.83% 46.34% 33.85% 19.46% -4.84% -5.20% -30.90%
DeepWalk 29.32 18.38 12.05 10.29 36.08 8.38 26.21 12.43
NetMF (T = 10) 38.36 22.90 18.16 14.32 46.21 8.38 29.95 13.50
Relative Gain of NetMF (T = 10) 30.83% 24.59% 50.71% 39.16% 28.08% 0.00% 14.27% 8.93%

with high order structure [6, 16], signed network embedding [10], APPENDIX:
direct network embedding [30], network embedding via deep neural Theorem 2.1. Denote P = D −1A, when L → ∞, we have
network [7, 25, 46], etc. #(w, c)→ #(w, c)←
−r p dw r− p dc
Among the above research, a commonly used technique is to → (P r )w,c and → (P r )c,w .
D→
−r vol(G) D← − vol(G)
define the “context” for each vertex, and then to train a predictive r

model to perform context prediction. For example, DeepWalk [31], Lemma 6.1. (S.N. Bernstein Law of Large Numbers [34]) Let Y1 , Y2 · · ·
node2vec [16], and metapath2vec [12] define vertices’ context by be a sequence of random variables with finite expectation E[Yj ] and
the 1st-, 2nd-order, and meta-path based random walks, respec- variance Var(Yj ) < K, j ≥ 1, and covariances are s.t. Cov(Yi , Yj ) → 0
tively; The idea of leveraging the context information are largely as |i − j | → ∞. Then the law of large numbers (LLN) holds.
motivated by the skip-gram model with negative sampling (SGNS) [29].
Recently, there has been effort in understanding this model. For Proof. First consider the special case when N = 1, thus we only
example, Levy and Goldberg [24] prove that SGNS is actually con- have one vertex sequence w 1 , · · · , w L generated by random walk as
ducting an implicit matrix factorization, which provides us with a described in Algorithm 1. Consider one certain vertex-context pair
tool to analyze the above network embedding models; Arora et al. (w, c), let Yj (j = 1, · · · , L − T ) be the indicator function for event
[1] propose a generative model RAND-WALK to explain word em- that w j = w and w j+r = c. We have the following two observations:
bedding models; and Hashimoto et al. [18] frame word embedding • The quantity #(w, c)→ −r is the sample average of Yj ’s, i.e.,
−r / D→
as a metric learning problem. Built upon the work in [24], we the-
#(w, c)→ 1 Õ
L−T
oretically analyze popular skip-gram based network embedding −r
= Yj .
models and connect them with spectral graph theory. D→
−r L − T j=1

• Based on our assumptions about the graph and the random


6 CONCLUSION dw
walk, E[Yj ] = vol(G) (P r )w,c , and when j > i + r ,
In this work, we provide a theoretical analysis of four impactful net-
work embedding methods—DeepWalk, LINE, PTE, and node2vec— E(Yi Yj ) = Prob(w i = w, w i +r = c, w j = w, w j +r = c)
 
that were recently proposed between the years 2014 and 2016. We dw
= (P r )w,c P j−i +r (P r )w,c .
show that all of the four methods are essentially performing implicit vol(G) c,w
matrix factorizations and the closed forms of their matrices offer In this way, we can evaluate the covariance Cov(Yi , Yj ) when j >
not only the relationships between those methods but also their i + r and calculate its limit when j − i → ∞ by using the fact that
intrinsic connections with graph Laplacian. We further propose our random walk converges to its stationary distribution:
NetMF—a general framework to explicitly factorize the closed-form Cov(Yi , Yj ) = E(Yi Yj ) − E(Yi )E(Yj )
matrices that DeepWalk and LINE aim to implicitly approximate   
dw dw
and factorize. Our extensive experiments suggest that NetMF’s di- = (P r )w,c P j −i +r − (P r )w,c → 0.
vol(G) c,w vol(G)
rect factorization achieves consistent performance improvements | {z }
over the implicit approximation models—DeepWalk and LINE. goes to 0 as j −i →∞
In the future, we would like to further explore promising direc- Then we can apply Lemma 6.1 and conclude that the sample average
tions to deepen our understanding of network embedding. It would converges in probability towards the expected value, i.e.,
be necessary to investigate whether and how the development in
#(w, c)→ 1 Õ 1 Õ
L−T L−T
random-walk polynomials [9] can support fast approximations of −r
=
p
Yj → E(Yj ) =
dw
(P r )w,c .
the closed-form matrices. The computation and approximation of D→
−r L − T j =1 L − T j =1 vol(G)
the 2nd-order random walks employed by node2vec is another in-
#(w,c)← p dc
teresting topic to follow. Finally, it is exciting to study the nature of Similarly, we also have r−
→ vol(G) (P r )c,w .
| D←
r− |
skip-gram based dynamic and heterogeneous network embedding. For the general case when N > 1, we define Yjn (n = 1, · · · , N ,
Acknowledgements. We thank Hou Pong Chan for his comments. j = 1, · · · , L − T ) to be the indicator function for event w nj = w and
Jiezhong Qiu and Jie Tang are supported by NSFC 61561130160. w nj+r = c, and organize Yjn ’s as Y11 , Y12 , · · · , Y1N , Y21 , Y22 , · · · , Y2N , · · · .
Jian Li is supported in part by the National Basic Research Program This r.v. sequence still satisfies the condition of S.N. Bernstein LLN,
of China Grant 2015CB358700, and NSFC 61772297 & 61632016. so the above conclusion still holds. □
Theorem 3.5. Let ∥·∥ F be the matrix Frobenius norm. Then [17] William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation
v
u
u
Learning on Large Graphs. In NIPS.
t Õ
1 Õ r
n T 2 [18] Tatsunori B Hashimoto, David Alvarez-Melis, and Tommi S Jaakkola. 2016. Word
vol(G) embeddings as metric recovery in semantic spaces. TACL 4 (2016), 273–286.
M − M̂ ≤ λ ;
F bd min T r =1 j [19] Roger A. Horn and Charles R. Johnson. 1991. Topics in Matrix Analysis. Cambridge
j=k +1
University Press. https://doi.org/10.1017/CBO9780511840371
[20] Yann Jacob, Ludovic Denoyer, and Patrick Gallinari. 2014. Learning latent repre-
log M − log M̂
′ ′
≤ M − M̂ ′ ′
≤ M − M̂ .
F F F sentations of nodes for classifying in heterogeneous social networks. In WSDM.
ACM, 373–382.
Proof. The first inequality can be seen by applying the defini- [21] Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with
Graph Convolutional Networks. arXiv preprint arXiv:1609.02907 (2016).
tion of Frobenius norm and Eq. 10. [22] Richard B Lehoucq, Danny C Sorensen, and Chao Yang. 1998. ARPACK users’
For the second inequality, first to show log M ′ − log M̂ ′ ≤ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi
F methods. SIAM.
M ′ − M̂ ′ . According to the definition of Frobenius norm, suf- [23] Jure Leskovec, Kevin J Lang, and Michael Mahoney. 2010. Empirical comparison
F of algorithms for network community detection. In WWW. ACM, 631–640.
[24] Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix
ficient to show log Mi,′ j − log M̂i,′ j ≤ M̂i,′ j − Mi,′ j for any i, j. Factorization. In NIPS. 2177–2185.
[25] Hang Li, Haozheng Wang, Zhenglu Yang, and Masato Odagaki. 2017. Variation
Without loss of generality, assume Mi,′ j ≤ M̂i,′ j . Autoencoder Based Network Representation Learning for Classification. In ACL.
! 56.
M̂ i,′ j M̂ i,′ j − M i,′ j [26] László Lovász. 1993. Random walks on graphs. Combinatorics, Paul erdos is eighty
log M i,′ j − log M̂ i,′ j = log = log 1 + 2 (1993), 1–46.
M i,′ j M i,′ j
[27] Qing Lu and Lise Getoor. 2003. Link-based Classification. In ICML.
M̂ i,′ j − M i,′ j [28] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
≤ ≤ M̂ i,′ j − M i,′ j = M̂ i,′ j − M i,′ j , estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
M i,′ j (2013).
[29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
where the first inequality is because log(1+x) ≤ x for x ≥ 0, and the Distributed Representations of Words and Phrases and their Compositionality.
the second inequality is because Mi,′ j = max Mi, j , 1 ≥ 1. Next to In NIPS. 3111–3119.
[30] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asym-
show M ′ − M̂ ′ ≤ M − M̂ . Sufficient to show Mi,′ j − M̂i,′ j ≤ metric Transitivity Preserving Graph Embedding.. In KDD. 1105–1114.
F F [31] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning
of social representations. In KDD. ACM, 701–710.
Mi, j − M̂i, j for any i, j. Recall the definition of M ′ and M̂ ′ , we get [32] S Yu Philip, Jiawei Han, and Christos Faloutsos. 2010. Link mining: Models,
algorithms, and applications. Springer.
Mi,′ j − M̂i,′ j = max(Mi, j , 1) − max(M̂i, j , 1) ≤ Mi, j − M̂i, j . □ [33] Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation.
IEEE PAMI 22, 8 (2000), 888–905.
[34] A.N. Shiryaev and A. Lyasoff. 2012. Problems in Probability. Springer New York.
REFERENCES [35] Chris Stark, Bobby-Joe Breitkreutz, Andrew Chatr-Aryamontri, Lorrie Boucher,
[1] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. Rose Oughtred, Michael S Livstone, Julie Nixon, Kimberly Van Auken, Xiaodong
A latent variable model approach to pmi-based word embeddings. TACL 4 (2016), Wang, Xiaoqi Shi, et al. 2010. The BioGRID interaction database: 2011 update.
385–399. Nucleic acids research 39, suppl_1 (2010), D698–D704.
[2] Yoshua Bengio, Aaron Courville, and Pierre Vincent. 2013. Representation learn- [36] Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive text embedding
ing: A review and new perspectives. IEEE TPAMI 35, 8 (2013), 1798–1828. through large-scale heterogeneous text networks. In KDD. ACM, 1165–1174.
[3] Austin R Benson, David F Gleich, and Jure Leskovec. 2015. Tensor spectral [37] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.
clustering for partitioning higher-order network structures. In SDM. SIAM, 118– 2015. LINE: Large-scale information network embedding. In WWW. 1067–1077.
126. [38] Lei Tang and Huan Liu. 2009. Relational learning via latent social dimensions. In
[4] Austin R Benson, David F Gleich, and Lek-Heng Lim. 2017. The Spacey Random KDD. ACM, 817–826.
Walk: A stochastic Process for Higher-Order Data. SIAM Rev. 59, 2 (2017), 321– [39] Lei Tang, Suju Rajan, and Vijay K Narayanan. 2009. Large scale multi-label
345. classification via metalabeler. In WWW. ACM, 211–220.
[5] Matthew Brand and Kun Huang. 2003. A unifying theorem for spectral embedding [40] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003.
and clustering.. In AISTATS. Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL.
[6] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning graph repre- Association for Computational Linguistics, 173–180.
sentations with global structural information. In CIKM. ACM, 891–900. [41] Lloyd N Trefethen and David Bau III. 1997. Numerical linear algebra. Vol. 50.
[7] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep Neural Networks for Siam.
Learning Graph Representations.. In AAAI. 1145–1152. [42] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. 2009. Mining
[8] Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and multi-label data. In Data mining and knowledge discovery handbook. Springer,
Thomas S Huang. 2015. Heterogeneous network embedding via deep archi- 667–685.
tectures. In KDD. ACM, 119–128. [43] Cunchao Tu, Han Liu, Zhiyuan Liu, and Maosong Sun. 2017. CANE: Context-
[9] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. 2015. aware network embedding for relation modeling. In ACL.
Efficient sampling for Gaussian graphical models via spectral sparsification. In [44] Cunchao Tu, Weicheng Zhang, Zhiyuan Liu, and Maosong Sun. 2016. Max-Margin
COLT. 364–390. DeepWalk: Discriminative Learning of Network Representation.. In IJCAI. 3889–
[10] Kewei Cheng, Jundong Li, and Huan Liu. 2017. Unsupervised Feature Selection 3895.
in Signed Social Networks. In KDD. ACM. [45] Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and
[11] Fan RK Chung. 1997. Spectral graph theory. Number 92. American Mathematical computing 17, 4 (2007), 395–416.
Soc. [46] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embed-
[12] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: ding. In KDD. ACM, 1225–1234.
Scalable Representation Learning for Heterogeneous Networks. In KDD. [47] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. 2015.
[13] David Easley and Jon Kleinberg. 2010. Networks, Crowds, and Markets: Reasoning Network Representation Learning with Rich Text Information. In IJCAI. 2111–
about a Highly Connected World. Cambridge University Press. 2117.
[14] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. [48] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2016. Revisiting
2008. LIBLINEAR: A library for large linear classification. JMLR 9, Aug (2008), Semi-Supervised Learning with Graph Embeddings. In ICML. 40–48.
1871–1874. [49] Zhilin Yang, Jie Tang, and William W Cohen. 2016. Multi-Modal Bayesian
[15] David F Gleich, Lek-Heng Lim, and Yongyang Yu. 2015. Multilinear pagerank. Embeddings for Learning Social Knowledge Graphs.. In IJCAI. 2287–2293.
SIAM J. Matrix Anal. Appl. 36, 4 (2015), 1507–1541.
[16] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for
networks. In KDD. ACM, 855–864.

You might also like