DeepWalk, LINE, PTE, and Node2vec
DeepWalk, LINE, PTE, and Node2vec
Jiezhong Qiu†∗ , Yuxiao Dong‡ , Hao Ma‡ , Jian Li♯ , Kuansan Wang‡ , and Jie Tang†
† Department
of Computer Science and Technology, Tsinghua University
‡ Microsoft Research, Redmond
♯ Institute for Interdisciplinary Information Sciences, Tsinghua University
[email protected],{yuxdong,haoma,kuansanw}@microsoft.com,{lijian83,jietang}@tsinghua.edu.cn
main theoretical results are not fully consistent with the setting of
By combining Eq. 1 and Eq. 2, and considering the local objective
the original DeepWalk paper. In addition, despite the superficial function for a specific pair of vertices (i, j), we have
similarity among DeepWalk, LINE, PTE, and node2vec, there is a
di d j
lack of deeper understanding of their underlying connections. ℓ(i, j) = Ai, j log д x i⊤y j + b log д −x i⊤y j .
vol(G)
Contributions In this work, we provide theoretical results con-
cerning several skip-gram powered network embedding methods. Let us define zi, j = x i⊤y j . Following Levy and Goldberg [24],
More concretely, we first show that the models we mentioned— where the authors suggested that for a sufficient large embedding
DeepWalk, LINE, PTE, and node2vec—are in theory performing dimension, each individual zi, j can assume a value independence,
implicit matrix factorizations. We derive the closed form of the we can take the derivative w.r.t. zi, j and get:
matrix for each model (see Table 1 for a summary). For example, ∂ℓ ∂ℓ(i, j) di d j
DeepWalk (random walks on a graph + skip-gram) is in essence ∂z i, j
=
∂z i, j
= Ai, j д(−z i, j ) − b
vol(G)
д(z i, j ).
factorizing a random matrix that converges in probability to our
Setting the derivative to be zero reveals
closed-form matrix as the length of random walks goes to infinity.
vol(G)Ai, j vol(G)Ai, j
Second, observed from their matrices’ closed forms, we find that, e 2zi, j − − 1 e zi, j − = 0.
interestingly, LINE can be seen as a special case of DeepWalk, when bd i d j bd i d j
the window size T of contexts is set to 1 in skip-gram. Furthermore, The above quadratic equation has two solutions (1) e zi, j = −1,
vol(G)Ai, j
we demonstrate that PTE, as an extension of LINE, is actually an which is invalid; and (2) e zi, j = , i.e.,
bd i d j
implicit factorization of the joint matrix of multiple networks.
vol(G)Ai, j
Third, we discover a theoretical connection between DeepWalk’s x i⊤y j = z i, j = log . (3)
implicit matrix and graph Laplacians. Based on the connection, we bd i d j
propose a new algorithm NetMF to approximate the closed form Writing Eq. 3 in matrix form, LINE (2nd) is factoring the matrix
of DeepWalk’s implicit matrix. By explicitly factorizing this matrix log vol(G)D −1 AD −1 − log b = X Y ⊤, (4)
using SVD, our extensive experiments in four networks (used in
the DeepWalk and node2vec approaches) demonstrate NetMF’s where log(·) denotes the element-wise matrix logarithm, and D =
outstanding performance (relative improvements by up to 50%) diag(d 1 , · · · , d |V | ).
over DeepWalk and LINE.
PTE [36] PTE is an extension of LINE (2nd) in heterogeneous text
networks. To examine it, we first adapt our analysis of LINE (2nd) to
2 THEORETICAL ANALYSIS AND PROOFS bipartite networks. Consider a bipartite network G = (V1 ∪V2 , E, A)
In this section, we present the detailed theoretical analysis and where V1 , V2 are two disjoint sets of vertices, E ⊆ V1 ×V2 is the edge
proofs for four popular network embedding approaches: LINE, PTE, |V |× |V2 |
set, and A ∈ R+ 1 is the bipartite adjacency matrix. The volume
DeepWalk, and node2vec. Í |V | Í |V |
of G is defined to be vol(G) = i=11 j=12 Ai, j . The goal is to learn
a representation x i for each vertex vi ∈ V1 and a representation y j
2.1 LINE and PTE for each vertex v j ∈ V2 . The objective function is
LINE [37] Given an undirected and weighted network G = (V , E, A), Õ
|V1 | Õ
|V2 |
LINE with the second order proximity (aka LINE (2nd)) aims to ℓ= Ai, j log д x i⊤y j + bE j ′ ∼P N log д −x i⊤y j ′ .
learn two representation matrices X , Y ∈ R |V |×d , whose rows are
i =1 j =1
denoted by x i and yi , i = 1, · · · , |V |, respectively. The objective of Applying the same analysis procedure of LINE, we can see that
LINE (2nd) is to maximize maximizing ℓ is actually factorizing
Õ
|V | Õ
|V | log vol(G)D row
−1 −1
AD col − log b = X Y ⊤
ℓ= Ai, j log д x i⊤y j + bE j ′ ∼P N log д −x i⊤y j ′ ,
i =1 j=1 where we denote D row = diag(Ae) and D col = diag(A⊤e).
where д is the sigmoid function; b is the parameter for negative Given the above discussion, let us consider the heterogeneous
sampling; P N is known as the noise distribution that generates text network used in PTE, which is composed of three sub-networks
3/4
negative samples. In LINE, the authors empirically set P N (j) ∝ d j , 2A similar result could be achieved if we use P N (j) ∝ d j .
3/4
— the word-word network G ww , the document-word network G dw , Algorithm 1: DeepWalk
and the label-word network G lw , where G dw and G lw are bipar-
tite. Take the document-word network G dw as an example, we use 1 for n = 1, 2, . . . , N do
Pick w 1n according to a probability distribution P(w 1 );
Adw ∈ R#doc×#word to denote its adjacency matrix, and use D row dw 2
dw
and D col to denote its diagonal matrices with row and column 3 Generate a vertex sequence (w 1n , · · · , w Ln ) of length L by a
sum, respectively. By leveraging the analysis of LINE and the above random walk on network G;
notations, we find that PTE is factorizing 4 for j = 1, 2, . . . , L − T do
for r = 1, . . . ,T do
ww −1
ww −1 5
© α vol(G ww )(D row ) Aww (D col ) ª Add vertex-context pair (w nj , w nj+r ) to multiset D;
log β vol(G dw )(D row
dw )−1 A (D dw )−1 ® − log b, 6
dw ® (5)
lw −1
col
lw −1 Add vertex-context pair (w nj+r , w nj ) to multiset D;
« γ vol(G lw )(D row ) Alw (D col ) ¬
7
where the factorized matrix is of shape (#word + #doc + #label) × 8 Run SGNS on D with b negative samples.
#word, b is the parameter for negative sampling, and {α, β, γ } are
non-negative hyper-parameters to balance the weights of the three
sub-networks. In PTE, {α, β, γ } satisfy α vol(G ww ) = β vol(G dw ) =
γ vol(G lw ). This is because the authors perform edge sampling vertex and its context appear in a random walk sequence. More
formally, for r = 1, · · · ,T , we define
during training wherein edges are sampled from each of three n o
sub-networks alternatively (see Section 4.2 in [36]). −r = (w, c) : (w, c) ∈ D, w = w j , c = w j+r ,
D→ n n
n o
r− = (w, c) : (w, c) ∈ D, w = w j +r , c = w j .
n n
2.2 DeepWalk D←
In this section, we analyze DeepWalk [31] and illustrate the essence That is, D→ −r /D←
r− is the sub-multiset of D such that the context c is
of DeepWalk is actually performing an implicit matrix factorization r steps after/before the vertex w in random walks. Moreover, we use
(See the matrix form solution in Thm. 2.3). #(w, c)→
−r and #(w, c)← r− to denote the number of times vertex-context
DeepWalk first generates a “corpus” D by random walks on pair (w, c) appears in D→ −r and D←r− , respectively. The following three
graphs [26]. To be formal, the corpus D is a multiset that counts theorems characterize DeepWalk step by step.
the multiplicity of vertex-context pairs. DeepWalk then trains a
Theorem 2.1. Denote P = D −1A, when L → ∞, we have
skip-gram model on the multiset D. In this work, we focus on skip-
#(w, c)→ dw #(w, c)← dc
gram with negative sampling (SGNS). For clarity, we summarize −r p
→ (P r )w,c and r− p
→ (P r )c,w .
the DeepWalk method in Algorithm 1. The outer loop (Line 1-7) D→
−r vol(G) D← r−
vol(G)
specifies the total number of times, N , for which we should run Proof. See Appendix. □
random walks. For each random walk of length L, the first vertex
is sampled from a prior distribution P(w). The inner loop (Line 4-7) Remark 1. What if we start random walks with other distribu-
specifies the construction of the multiset D. Once we have D, we tions (e.g., the uniform distribution in the original DeepWalk work [31])?
run an SGNS to attain the network embedding (Line 8). Next, we It turns out that, for a connected, undirected, and non-bipartite graph,
introduce some necessary background about the SGNS technique, P(w j = w, w j+r = c) → vol(G) dw
(P r )w,c as j → ∞. So when the
followed by our analysis of the DeepWalk method. length of random walks L → ∞, Thm. 2.1 still holds.
Preliminary on Skip-gram with Negative Sampling (SGNS) Theorem 2.2. When L → ∞, we have
The skip-gram model assumes a corpus of words w and their context T
c. More concretely, the words come from a textual corpus of words #(w, c) p 1 Õ dw dc
→ (P r )w,c + (P r )c,w .
w 1 , · · · w L and the contexts for word w i are words surrounding it in |D | 2T r =1 vol(G) vol(G)
a T -sized window w i−T , · · · , w i−1 , w i+1 , · · · , w i+T . Following the
work by Levy and Goldberg [24], SGNS is implicitly factorizing | D→
− | | D←−| 1 . By using Thm. 2.1
Proof. Note the fact that | Dr | = | Dr | = 2T
#(w, c) | D | and the continuous mapping theorem, we get
log − log b, (6)
#(w ) · #(c) ÍT !
#(w, c)→
−r + #(w, c)← 1 Õ #(w, c)→ #(w, c)←
T
#(w, c) r− −r r−
ÍT
r =1
where #(w, c), #(w) and #(c) denote the number of times word- |D |
= =
2T r =1
+
r =1 D→−r + D← r−
D→ −r D← r−
context pair (w, c), word w and context c appear in the corpus,
p 1 Õ
T
respectively; b is the number of negative samples. dw dc
→ (P r )w,c + (P r )c,w .
2T r =1 vol(G) vol(G)
Proofs and Analysis Our analysis of DeepWalk is based on the
following key assumptions. Firstly, assume the used graph is undi- Further, marginalizing w and c respectively reveals the fact that
rected, connected, and non-bipartite, making P(w) = dw /vol(G) #(w ) p dw #(c) p dc
→ vol(G) and | D | → vol(G) , as L → ∞. □
a unique stationary distribution of the random walks. Secondly, |D |
suppose the first vertex of a random walk is sampled from the Theorem 2.3. For DeepWalk, when L → ∞,
stationary distribution P(w) = dw /vol(G). !
#(w, c) | D | p vol(G) 1 Õ r 1 Õ r
T T
To characterize DeepWalk, we want to reinterpret Eq. 6 by using
→ (P )w,c + (P )c,w .
graph theory terminologies. Our general idea is to partition the #(w ) · #(c) 2T dc r =1 dw r =1
multiset D into several sub-multisets according to the way in which
In matrix form, DeepWalk is equivalent to factorize Algorithm 2: node2vec
! !
vol(G) Õ
T 1 Construct transition probability tensor P;
log P r D −1 − log(b). (7)
T 2 for n = 1, 2, . . . , N do
r =1
3 Pick w 1n , w 2n according to a distribution P(w 1 , w 2 );
Proof. Using the results in Thm. 2.2 and the continuous map- 4 Generate a vertex sequence (w 1n , · · · , w Ln ) of length L by
ping theorem, we get the 2nd-order random walk on network G;
ÍT
#(w,c ) 1 dw
(P r )w,c + dc
(P r )c,w 5 for j = 2, 3, . . . , L − T do
#(w, c) | D | p 2T r =1 vol(G ) vol(G)
for r = 1, . . . ,T do
|D |
= → 6
#(w ) · #(c) #(w ) #(c ) dw dc
|D | · |D | vol(G ) · vol(G ) 7 Add triplet (w nj , w nj+r , w nj−1 ) to multiset D;
!
vol(G) 1 Õ r 1 Õ r Add triplet (w nj+r , w nj , w nj−1 ) to multiset D;
T T
8
= (P )w,c + (P )c,w .
2T dc r =1 dw r =1
9 Run SGNS on multiset D ′ = {(w, c) : (w, c, u) ∈ D};
Write it in matrix form:
!
vol(G) Õ r −1 Õ −1 r ⊤
T T
P D + D (P )
2T r =1 r =1 In this analysis, we assume the first two vertices of node2vec’s
2nd-order random walk are sampled from its stationary distribu-
vol(G) ©Õ
T Õ
T ª
D −1 AD −1 × · · · × AD −1 ®® tion X . The stationary distribution X of the 2nd-order random
Í
2T r =1 |
D −1 A × · · · × D −1 A D −1 +
{z } | {z }
=
walk satisfies w P u,v,w Xv,w = Xu,v , and the existence of such
« ¬
r =1
r terms r terms
! X is guaranteed by the Perron-Frobenius theorem [4]. Addition-
vol(G) ÕT
1 Õ
T
D −1 A × · · · × D −1 A D −1 = vol(G) P r D −1 . ally, the higher-order transition probability tensor
is defined to be
| {z }
=
T r =1
T r =1 (P r )u,v,w = Prob w j+r = u|w j = v, w j−1 = w .
r terms
□ Main Results Limited by space, we list the main results of node2vec
without proofs. The idea is similar to the analysis of DeepWalk.
Corollary 2.4. Comparing Eq. 4 and Eq. 7, we can easily observe #(w,c,u)→ −r p #(w,c,u)← p
that LINE (2nd) is a special case of DeepWalk when T = 1. • → X w,u (P r )c,w,u and r−
→ X c,u (P r )w,c,u .
| D→ −r |
Í
| D← − |
−r p Í
r
#(w,c)→ u #(w,c,u)→
c,w,u .
−r r)
• = → u X w,u (P
2.3 node2vec | D→ −r |
Í
| D→
−r |
#(w,c)← #(w,c,u) p Í
→ u X c,u (P r )w,c,u .
− ←−
node2vec [16] is a recently proposed network embedding method, • r
= u r
| D← r− | | D←r− |
which is briefly summarized in Algorithm 2. First, it defines an #(w,c) p 1 ÍT Í Í
unnormalized transition probability tensor T with parameters p • |D |
→ 2T r =1 u X w,u (P r )c,w,u + u X c,u (P r )w,c,u .
and q, and then normalizes it to be the transition probability of a #(w ) p Í #(c) p Í
2nd-order random walk (Line 1). Formally, • |D |
→ u X w,u and | D | → u X c,u .
Combining all together, we conclude that, for node2vec, Eq. 6
1
(u, v) ∈ E, (v, w ) ∈ E, u = w ;
has the form
1
p
(u, v) ∈ E, (v, w ) ∈ E, u , w, (w, u) ∈ E; ÍT Í Í
1 r r
T u,v,w = X w,u P c,w,u + u X c,u P w,c,u
1 #(w, c) | D | p 2T r =1 u
q
(u, v) ∈ E, (v, w ) ∈ E, u , w, (w, u) < E; → Í Í (8)
.
0 otherwise. #(w ) · #(c) u X w,u u X c,u
T u,v,w Though the closed form for node2vec has been achieved, we leave
P u,v,w = Prob w j+1 = u |w j = v, w j−1 = w = Í . the formulation of its matrix form for future research.
u T u,v,w
Note that both the computation and storage of the transition
Second, node2vec performs the 2nd-order random walks to generate probability tensor P r and its corresponding stationary distribution
a multiset D (Line 2-8) and then trains an SGNS model (Line 9) on X are very expensive, making the modeling of the full 2nd-order
it. To facilitate the analysis, we instead record triplets to form the dynamics difficult. However, we have noticed some recent pro-
multiset D for node2vec rather than vertex-context pairs. Take a
gresses [3, 4, 15] that try to understand or approximate the 2nd-
vertex w = w nj and its context c = w nj+r as an example, we denote
order random walk by assuming a rank-one factorization Xu,v =
u = w nj−1 and add a triplet (w, c, u) into D (Line 7). Similar to our
xu xv for its stationary distribution X . Due to the page limitation,
analysis of DeepWalk, we partition the multiset D according to the in the rest of this paper, we mainly focus on the matrix factorization
way in which the vertex and its context appear in a random walk
framework depending on the 1st-order random walk (DeepWalk).
sequence. More formally, for r = 1, · · · ,T , we define
n o
−r = (w, c, u) : (w, c, u) ∈ D, w j = w, w j +r = c, w j −1 = u ,
D→ n n n 3 NetMF: NETWORK EMBEDDING AS
n o MATRIX FACTORIZATION
r− = (w, c, u) : (w, c, u) ∈ D, w j+r = w, w j = c, w j −1 = u .
n n n
D←
Based on the analysis in Section 2, we unify LINE, PTE, DeepWalk,
In addition, for a triplet (w, c, u), we use #(w, c, u)→
−r and #(w, c, u)←r− and node2vec in the framework of matrix factorization, where the
to denote the number of times it appears in D→ −r and D← r− , respec- factorized matrices have closed forms as showed in Eq. 4, Eq. 5,
tively. Eq. 7, and Eq. 8, respectively. In this section, we study the DeepWalk
matrix (Eq. 7) because it is more general than the LINE matrix and 1.0 1.0
Eigenvalue
T = 10
Laplacian in Section 3.1. Then in Section 3.2, we present a matrix 0.0 0.0
(a) (b)
In this section, we show that the DeepWalk matrix has a close
relationship with the normalized graph Laplacian. To facilitate our
Figure 1: DeepWalk Matrix as Filtering: (a) Function f (x ) =
analysis, we introduce the following four theorems. 1 ÍT r with dom f = [−1, 1], where T ∈ {1, 2, 5, 10}; (b) Eigen-
T r =1 x Í Í
Theorem 3.1. ([11]) For normalized graph Laplacian L = I − values of D −1/2 AD −1/2 , U T
1 T r
r =1 Λ U , and T
⊤ 1 T
r =1 P
r D −1 for
D −1/2AD −1/2 ∈ Rn×n , all its eigenvalues are real numbers and lie Cora network (T = 10).
in [0, 2], with λ min (L) = 0. For a connected graph with n > 1,
λ max (L) ≥ n/(n − 1). the non-increasing order such that
Theorem 3.2. ([41]) Singular values of a real symmetric matrix 1 Õ r 1 Õ r 1 Õ r
T T T
product of there symmetric matrices: which reveals that the magnitude of T1 Tr=1 P r D −1 ’s eigenval-
! ! ! Í
1 Õ r
T 1 Õ r
T ues is always bounded by the magnitude of U T1 Tr=1 Λr U ⊤ ’s
P D −1 = D −1/2 U Λ U ⊤
D −1/2 . (9)
T r =1 T r =1 eigenvalues. In addition to the magnitude of eigenvalues, we also
Í want to bound
its smallest
eigenvalue. Observe that the Rayleigh
The goal here is to characterize the spectrum of T1 Tr=1 P r D −1 . 1 ÍT
Quotient of T r =1 P D −1 has a lower bound as follows
r
To achieve this, we first analyze the second matrix at RHS of Eq. 9, ! ! ! !
1 Õ r
T
1 Õ r
T
and then extend our analysis to the targeted matrix at LHS. R P D −1, x = R U Λ U ⊤, D −1/2 x R D −1, x
Í Í T r =1 T r =1
Spectrum of U T1 Tr=1 Λr U ⊤ The matrix U T1 Tr=1 Λr U ⊤ ! ! ! !
Õ 1 Õ r
Í
T T
1 1
has eigenvalues T1 Tr=1 λri , i = 1, · · · , n, which can be treated as the ≥λ min U Λr U ⊤ λ max D −1 = λ min U Λ U⊤ .
T r =1 d min T r =1
output of a transformation applied on D −1/2AD −1/2 ’s eigenvalue
By Í
applying Thm. 3.4, we can bound the smallest
eigenvalue
of
1 T P r D −1 by the smallest eigenvalue of U 1 ÍT Λr U ⊤ :
λi , i.e., a kind of filter! The effect of this transformation (filter)
Í
f (x) = T1 Tr=1 x r is plotted in Figure 1(a), from which we observe T r =1 T r =1
the following two properties of this filter. Firstly, it prefers positive ! ! ! !
1 Õ r 1 Õ r
T T
1
large eigenvalues; Secondly, the preference becomes stronger as λ min P D −1
≥ λ min U Λ U ⊤
.
T r =1 d min T r =1
the window size T increases. In other words, as T grows, this filter
tries to approximate a low-rank positive semi-definite matrix by
keeping large positive eigenvalues. Illustrative Example: Cora In order to illustrate the filtering ef-
Í fect we discuss above, we analyze a small citation network Cora [27].
Spectrum of T1 Tr=1 P r D −1 Guided by Thm. 3.2, the decreas- we make the citation links undirected and choose its largest con-
Í
ingly ordered singular values of the matrix U T1 Tr=1 Λr U ⊤ can nected component. In Figure 1(b), we plot the decreasingly or-
1 ÍT
be constructed by sorting the absolute value of its eigenvalues in dered eigenvalues of matrices D −1/2 AD −1/2 , U T r =1 Λ U ⊤ ,
r
Algorithm 3: NetMF for a Small Window Size T Table 2: Statistics of Datasets.
5 return Ud Σd as network embedding. D −1/2AD −1/2 with its top-h eigenpairs Uh Λh Uh⊤ (Line 1). Since
only the top-h eigenpairs are required and the involved matrix
is sparse, we can use Arnoldi method [22] to achieve significant
Algorithm 4: NetMF for a Large Window Size T time reduction. In step two (Line 2), we approximate M with M̂ =
vol(G) −1/2 Í
1 Eigen-decomposition D −1/2AD −1/2 ≈ Uh Λh Uh⊤ ; Uh T1 Tr=1 Λhr Uh⊤ D −1/2 . The final step is the same
b D
Approximate M with
Í
2
as that in Algorithm 3, in which we form M̂ ′ = max(M̂, 1) (Line 3)
vol(G)
M̂ = b D −1/2Uh T1 Tr=1 Λhr Uh⊤ D −1/2 ; and then perform SVD on log M̂ ′ to get network embedding (Line
3 Compute M̂ ′ = max(M̂, 1); 4-5).
For NetMF with large window sizes, we develop the following
Rank-d approximation by SVD: log M̂ ′ = Ud Σd Vd⊤ ;
p
4
error bound theorem for the approximation of M and the approxi-
5 return Ud Σd as network embedding. mation of log M ′ .
Theorem 3.5. Let ∥·∥ F be the matrix Frobenius norm. Then
ÍT v
u
1 u
t Õ
1 Õ r
2
and r =1
T Pr D −1 , respectively, with T
= 10. For D −1/2AD −1/2 , vol(G)
n T
M − M̂ ≤ λ ;
the largest eigenvalue
Í λ 1 = 1, and the smallest eigenvalue λn = F bd min T r =1 j
j =k +1
1
−0.971. For U T r =1 Λ U , we observe that all its negative
T r ⊤
log M ′ − log M̂ ′ ≤ M ′ − M̂ ′ ≤ M − M̂ .
eigenvalues and small positive eigenvalues are “filtered out” in F F F
Í
spectrum. Finally, for the matrix T1 Tr=1 P r D −1 , we observe that Proof. See Appendix. □
both the magnitude of its eigenvalues
Í and
its smallest eigenvalue Thm. 3.5 reveals that the error for approximating log M ′ is
are bounded by those of U T1 Tr=1 Λr U ⊤ . bounded by the error bound for the approximation of M. Nev-
ertheless, the major drawback of NetMF lies in this element-wise
3.2 NetMF matrix logarithm. Since good tools are currently not available to
Built upon the theoretical analysis above, we propose a matrix analyze this operator, we have to compute it explicitly even after
factorization framework NetMF for empirical understanding of we have already achieved a good low-rank approximation of M.
and improving on DeepWalk
and LINE. For simplicity, we denote
vol(G) ÍT r D −1 , and refer to log M as the DeepWalk 4 EXPERIMENTS
M = bT r =1 P
matrix. In this section, we evaluate the proposed NetMF method on the
multi-label vertex classification task, which has also been used in
NetMF for a Small Window Size T NetMF for a small T is quite the works of DeepWalk, LINE, and node2vec.
intuitive. The basic idea is to directly compute and factorize the
DeepWalk matrix. The detail is listed in Algorithm 3. In the first Datasets We employ four widely-used datasets for this task. The
step (Line 1-2), we compute the matrix power from P 1 to P T and statistics of these datasets are listed in Table 2.
then get M. However, the factorization of log M presents computa- BlogCatalog [38] is a network of social relationships of online
tional challenges due to the element-wise matrix logarithm. The bloggers. The vertex labels represent interests of the bloggers.
matrix is not only ill-defined (since log 0 = −∞), but also dense. Protein-Protein Interactions (PPI) [35] is a subgraph of the
Inspired by the Shifted PPMI approach [24], we define M ′ such PPI network for Homo Sapiens. The labels are obtained from the
that Mi,′ j = max(Mi, j , 1) (Line 3). In this way, log M ′ is a sparse hallmark gene sets and represent biological states.
Wikipedia3 is a co-occurrence network of words appearing in
and consistent version of log M. Finally, we factorize log M ′ by
the first million bytes of the Wikipedia dump. The labels are the
using Singular Value Decomposition (SVD) and construct network
Part-of-Speech (POS) tags inferred by Stanford POS-Tagger [40].
embedding by using its top-d singular values/vectors (Line 4-5).
Flickr [38] is the user contact network in Flickr. The labels
NetMF for a Large Window SizeT The direct computation of the represent the interest groups of the users.
matrix M presents computing challenges for a large window size
Baseline Methods We compare our methods NetMF (T = 1) and
T , mainly due to its high time complexity. Hereby we propose an
NetMF (T = 10) with LINE (2nd) [37] and DeepWalk [31], which
approximation algorithm as listed in Algorithm 4. The general idea
we have introduced in previous sections. For NetMF (T = 10), we
comes from our analysis in section 3.1, wherein we reveal M’s close
choose h = 16384 for Flickr, and h = 256 for BlogCatelog, PPI, and
relationship with the normalized graph Laplacian and show its low-
rank nature theoretically. In our algorithm, we first approximate 3 http://mattmahoney.net/dc/text.html
BlogCatalog PPI Wikipedia Flickr
50 30 70 40
45 25
60 35
Micro-F1 (%)
40
20
35 50 30
15
30
40 25
25 10
20 5 30 20
35 25 20.0 25
30 17.5
20 20
Macro-F1 (%)
15.0
25
15 12.5 15
20
10.0
10 10
15 7.5
10 5 5.0 5
20 40 60 80 20 40 60 80 20 40 60 80 2 4 6 8 10
NetMF (T=1) LINE NetMF (T=10) DeepWalk
Figure 2: Predictive performance on varying the ratio of training data. The x-axis represents the ratio of labeled data (%), and
the y-axis in the top and bottom rows denote the Micro-F1 and Macro-F1 scores respectively.
Wikipedia. For DeepWalk, we present its results with the authors’ This is because the used Wikipedia network is a dense word co-
preferred parameters—window size 10, walk length 40, and the occurrence network with the average degree = 77.11, in which an
number of walks starting from each vertex to be 80. Finally, we set edge between a pair of words are connected if they co-occur in a
embedding dimension to be 128 for all methods. two-length window in the Wikipedia corpus.
(3) As shown in Table 3, the proposed NetMF method (T = 10
Prediction Setting Following the same experimental procedure and T = 1) outperforms DeepWalk and LINE by large margins in
in DeepWalk [31], we randomly sample a portion of labeled vertices most cases when sparsely labeled vertices are provided. Take the
for training and use the rest for testing. For BlogCatalog, PPI, and PPI dataset with 10% training data as an example, NetMF (T = 1)
Wikipedia datasets, the training ratio is varied from 10% to 90%. For achieves relatively 46.34% and 33.85% gains over LINE (2nd) regard-
Flickr, the training ratio is varied from 1% to 10%. We use the one- ing Micro-F1 and Macro-F1 scores, respectively; More impressively,
vs-rest logistic regression model implemented by LIBLINEAR [14] NetMF (T = 10) outperforms DeepWalk by 50.71% and 39.16% rela-
for the multi-label classification task. In the test phase, the one- tively as measured by two metrics.
vs-rest model yields a ranking of labels rather than an exact label (4) DeepWalk tries to approximate the exact vertex-context joint
assignment. To avoid the thresholding effect [39], we assume that distribution with an empirical distribution through random walk
the number of labels for test data is given [31, 39]. We repeat the sampling. Although the convergence is guaranteed by the law of
prediction procedure 10 times and evaluate the performance in large numbers, there still exist gaps between the exact and estimated
terms of average Micro-F1 and average Macro-F1 [42]. distributions due to the large size of real-world networks and the
Experimental Results Figure 2 summarizes the prediction per- relatively limited scale of random walks in practice (e.g., #walks and
formance of all methods on the four datasets and Table 3 lists the the walk length), negatively affecting DeepWalk’s performance.
quantitative and relative gaps between our methods and baselines.
In specific, we show NetMF (T = 1)’s relative performance gain over 5 RELATED WORK
LINE (2nd) and NetMF (T = 10)’s relative improvements over Deep- The story of network embedding stems from Spectral Cluster-
Walk, respectively, as each pair of them share the same window ing [5, 45], a data clustering technique which selects eigenval-
size T . We have the following key observations and insights: ues/eigenvectors of a data affinity matrix to obtain representations
(1) In BlogCatalog, PPI, and Flickr, the proposed NetMF method that can be clustered or embedded in a low-dimensional space. Spec-
(T = 10) achieves significantly better predictive performance over tral Clustering has been widely used in fields such as community de-
baseline approaches as measured by both Micro-F1 and Macro-F1, tection [23] and image segmentation [33]. In recent years, there is an
demonstrating the effectiveness of the theoretical foundation we increasing interest in network embedding. Following a few pioneer
lay out for network embedding. works such as SocDim [38] and DeepWalk [31], a growing number
(2) In Wikipedia, NetMF (T = 1) shows better performance than of literature has tried to address the problem from various of per-
other methods in terms of Micro-F1, while LINE outperforms other spectives, such as heterogeneous network embedding [8, 12, 20, 36],
methods regarding Macro-F1. This observation implies that short- semi-supervised network embedding [17, 21, 44, 48], network em-
term dependence is enough to model Wikipedia’s network structure. bedding with rich vertex attributes [43, 47, 49], network embedding
Table 3: Micro/Macro-F1 Score(%) for Multilabel Classification on BlogCatalog, PPI, Wikipedia, and Flickr datasets. In Flickr,
1% of vertices are labeled for training [31], and in the other three datasets, 10% of vertices are labeled for training.
BlogCatalog (10%) PPI (10%) Wikipeida (10%) Flickr (1%)
Algorithm
Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1
LINE (2nd) 23.64 13.91 10.94 9.04 41.77 9.72 25.18 9.32
NetMF (T = 1) 33.04 14.86 16.01 12.10 49.90 9.25 23.87 6.44
Relative Gain of NetMF (T = 1) 39.76 6.83% 46.34% 33.85% 19.46% -4.84% -5.20% -30.90%
DeepWalk 29.32 18.38 12.05 10.29 36.08 8.38 26.21 12.43
NetMF (T = 10) 38.36 22.90 18.16 14.32 46.21 8.38 29.95 13.50
Relative Gain of NetMF (T = 10) 30.83% 24.59% 50.71% 39.16% 28.08% 0.00% 14.27% 8.93%
with high order structure [6, 16], signed network embedding [10], APPENDIX:
direct network embedding [30], network embedding via deep neural Theorem 2.1. Denote P = D −1A, when L → ∞, we have
network [7, 25, 46], etc. #(w, c)→ #(w, c)←
−r p dw r− p dc
Among the above research, a commonly used technique is to → (P r )w,c and → (P r )c,w .
D→
−r vol(G) D← − vol(G)
define the “context” for each vertex, and then to train a predictive r
model to perform context prediction. For example, DeepWalk [31], Lemma 6.1. (S.N. Bernstein Law of Large Numbers [34]) Let Y1 , Y2 · · ·
node2vec [16], and metapath2vec [12] define vertices’ context by be a sequence of random variables with finite expectation E[Yj ] and
the 1st-, 2nd-order, and meta-path based random walks, respec- variance Var(Yj ) < K, j ≥ 1, and covariances are s.t. Cov(Yi , Yj ) → 0
tively; The idea of leveraging the context information are largely as |i − j | → ∞. Then the law of large numbers (LLN) holds.
motivated by the skip-gram model with negative sampling (SGNS) [29].
Recently, there has been effort in understanding this model. For Proof. First consider the special case when N = 1, thus we only
example, Levy and Goldberg [24] prove that SGNS is actually con- have one vertex sequence w 1 , · · · , w L generated by random walk as
ducting an implicit matrix factorization, which provides us with a described in Algorithm 1. Consider one certain vertex-context pair
tool to analyze the above network embedding models; Arora et al. (w, c), let Yj (j = 1, · · · , L − T ) be the indicator function for event
[1] propose a generative model RAND-WALK to explain word em- that w j = w and w j+r = c. We have the following two observations:
bedding models; and Hashimoto et al. [18] frame word embedding • The quantity #(w, c)→ −r is the sample average of Yj ’s, i.e.,
−r / D→
as a metric learning problem. Built upon the work in [24], we the-
#(w, c)→ 1 Õ
L−T
oretically analyze popular skip-gram based network embedding −r
= Yj .
models and connect them with spectral graph theory. D→
−r L − T j=1