0% found this document useful (0 votes)
2 views

19-Gemsec Graph embedding with self clustering

GEMSEC is a novel graph embedding algorithm that simultaneously learns node embeddings and clustering, enhancing community detection in social networks. By incorporating a smoothness regularization in its optimization process, GEMSEC effectively preserves community structures while maintaining computational efficiency. Experimental results demonstrate that GEMSEC outperforms existing methods in clustering quality and is robust to hyperparameter variations.

Uploaded by

aegr82
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

19-Gemsec Graph embedding with self clustering

GEMSEC is a novel graph embedding algorithm that simultaneously learns node embeddings and clustering, enhancing community detection in social networks. By incorporating a smoothness regularization in its optimization process, GEMSEC effectively preserves community structures while maintaining computational efficiency. Experimental results demonstrate that GEMSEC outperforms existing methods in clustering quality and is robust to hyperparameter variations.

Uploaded by

aegr82
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

GEMSEC: Graph Embedding with Self Clustering

Benedek Rozemberczki, Ryan Davies, Rik Sarkar and Charles Sutton


School of Informatics, The University of Edinburgh
{benedek.rozemberczki,ryan.davies}@ed.ac.uk,{rsarkar,csutton}@inf.ed.ac.uk

Abstract—Modern graph embedding procedures can efficiently proceed to use this random-walk-proximity information as a
process graphs with millions of nodes. In this paper, we propose basis to embed nodes such that socially close nodes are placed
GEMSEC – a graph embedding algorithm which learns a nearby. In this category, Deepwalk [8] and Node2Vec [9] are
clustering of the nodes simultaneously with computing their
embedding. GEMSEC is a general extension of earlier work two popular methods.
arXiv:1802.03997v3 [cs.SI] 25 Jul 2019

in the domain of sequence-based graph embedding. GEMSEC While these methods preserve the proximity of nodes in
places nodes in an abstract feature space where the vertex the graph sense, they do not have an explicit preference for
features minimize the negative log-likelihood of preserving sam- preserving social communities. Thus, in this paper, we develop
pled vertex neighborhoods, and it incorporates known social a machine learning approach that considers clustering when
network properties through a machine learning regularization.
We present two new social network datasets and show that embedding the network and includes a parameter to control the
by simultaneously considering the embedding and clustering closeness of nodes in the same community. Figure 1(a) shows
problems with respect to social properties, GEMSEC extracts the embedding obtained by the standard Deepwalk method,
high-quality clusters competitive with or superior to other where communities are coherent, but not clearly separated in
community detection algorithms. In experiments, the method is the embedding. The method described in this paper, called
found to be computationally efficient and robust to the choice of
hyperparameters. GEMSEC, is able to produce clusters that are tightly embedded
Index Terms—community detection, clustering, node embed- and separated from each other (Fig. 1(b)).
ding, network embedding, feature extraction.

I. I NTRODUCTION
Community detection is one of the most important problems
in network analysis due to its wide applications ranging from
the analysis of collaboration networks to image segmentation,
the study of protein-protein interaction networks in biology,
and many others [1], [2], [3]. Communities are usually defined
as groups of nodes that are connected to each other more
densely than to the rest of the network. Classical approaches
to community detection depend on properties such as graph (a) DeepWalk (b) GEMSEC
metrics, spectral properties and density of shortest paths [4]. Fig. 1. Zachary’s Karate club graph [10]. White nodes: instructor’s group;
Random walks and randomized label propagation [5], [6] have blue nodes: president’s group. GEMSEC produces embedding with more
also been investigated. tightly clustered communities.
Embedding the nodes in a low dimensional Euclidean space
enables us to apply standard machine learning techniques. This
space is sometimes called the feature space – implying that it A. Our Contributions
represents abstract structural features of the network. Embed- GEMSEC is an algorithm that considers the two problems
dings have been used for machine learning tasks such as label- of embedding and community detection simultaneously, and
ing nodes, regression, link prediction, and graph visualization, as a result, the two solutions of embedding and clustering
see [7] for a survey. Graph embedding processes usually aim to can inform and improve each other. Through iterations, the
preserve certain predefined differences between nodes encoded embedding converges toward one where nodes are placed close
in their embedding distances. For social network embedding, to their neighbors in the network, while at the same time
a natural priority is to preserve community membership and clusters in the embedding space are well separated.
enable community detection. The algorithm is based on the paradigm of sequence-
Recently, sequence-based methods have been developed as based node embedding procedures that create d dimensional
a way to convert complex, non-linear network structures into feature representations of nodes in an abstract feature space.
formats more compatible with vector spaces. These methods Sequence-based node embeddings embed pairs of nodes close
sample sequences of nodes from the graph using a randomized to each other if they occur frequently within a small window
mechanism (e.g. random walks), with the idea that nodes of each other in a random walk. This problem can be formu-
that are “close” in the graph connectivity will also frequently lated as minimizing the negative log-likelihood of observed
appear close in a sampling of random walks. The methods then neighborhood samples (Sec. III) and is called the skip-gram
optimization [11]. We extend this objective function to include into vector spaces [19]. Optimization-based representation of
a clustering cost. The formal description is presented in networks has been used for routing and navigation in domains
Subsection III-A. The resulting optimization problem is solved such as sensor networks and robotics [20], [21]. Represen-
with a variant of mini-batch gradient descent [12]. tations in hyperbolic spaces have emerged as a technique to
The detailed algorithm is presented in Subsection III-B. preserve richer network structures [22], [23], [24].
By enforcing clustering on the embedding, GEMSEC reveals Recent advances in node embedding procedures have made
the natural community structure (e.g. Figure 1).Our approach it possible to learn vector features for large real-world
improves over existing methods of simultaneous embedding graphs [8], [16], [9]. Features extracted with these sequence-
and clustering [13], [14], [15] and shows that community based node embedding procedures can be used for predicting
sensitivity can be directly incorporated into the skip-gram style social network users’ missing age [7], the category of scientific
optimization to obtain greater accuracy and efficiency. papers in citation networks [17] and the function of proteins
In social networks, nodes in the same community tend to in protein-protein interaction networks [9]. Besides supervised
have similar groups of friends, which is expressed as high learning tasks on nodes the extracted features can be used
neighborhood overlap. This fact can be leveraged to produce for graph visualization [7], link prediction [9] and community
clusters that are better aligned with the underlying communi- detection [13].
ties. We achieve this effect using a regularization procedure Sequence based embedding commonly considers variations
– a smoothness regularization added to the basic optimization in the sampling strategy that is used to obtain vertex sequences
achieves more coherent community detection. The effect can – truncated random walks being the simplest strategy [8]. More
be seen in Figure 3, where a somewhat uncertain community involved methods include second-order random walks [9],
affiliation suggested by the randomized sampling is sharpened skips in random walks [17] and diffusion graphs [25]. It is
by the smoothness regularization. This technique is described worth noting that these models implicitly approximate matrix
in Subsection III-C. factorizations for different matrices that are expensive to
In experimental evaluation we demonstrate that GEMSEC factorize explicitly [26].
outperforms – in clustering quality – the state of the art neigh- Our work extends the literature of node embedding algo-
borhood based [8], [9], multi-scale [16], [17] and community rithms which are community aware. Earlier works in this
aware embedding methods [13], [14], [15]. We present new category did not directly extend the skip-gram embedding
social datasets from the streaming service Deezer and show framework. M-NMF [14] applies computationally expensive
that the clustering can improve music recommendations. The non-negative matrix factorization with a modularity constraint
clustering performance of GEMSEC is found to be robust to term. The procedure DANMF [15] uses hierarchical non-
hyperparameter changes, and the runtime complexity of our negative matrix factorization to create community-aware node
method is linear in the size of the graphs. embeddings. ComE [13] is a more scalable approach, but it
To summarize, the main contributions of our work are: assumes that in the embedding space the communities fit a
1) GEMSEC: a sequence sampling-based learning model gaussian structure, and aims to model them by a mixture
which learns an embedding of the nodes at the same of Gaussians. In comparison to these methods, GEMSEC
time as it learns a clustering of the nodes. provides greater control over community sensitivity of the em-
2) Clustering in GEMSEC can be aligned to network neigh- bedding process, it is independent of the specific neighborhood
borhoods by a smoothness regularization added to the sampling methods and is computationally efficient.
optimization. This enhances the algorithm’s sensitivity
to natural communities.
3) Two new large social network datasets are introduced – III. G RAPH E MBEDDING WITH S ELF C LUSTERING
from Facebook and Deezer data.
4) Experimental results show that the embedding process For a graph G = (V, E), a node embedding is a mapping
runs linearly in the input size. It generally performs well f : V → Rd where d is the dimensionality of the embedding
in quality of embedding and in particular outperforms space. For each node v ∈ V we create a d dimensional
existing methods on cluster quality measured by modu- representation. Alternatively, the embedding f is a |V | × d
larity and subsequent recommendation tasks. real-valued matrix. In sequence-based embedding, sequences
of neighboring nodes are sampled from the graph. Within a
We start with reviewing related work in the area and relation
sequence, a node v occurs in the context of a window ω within
to our approach in the next section. A high-performance
the sequence. Given a sample S of sequences, we refer to the
Tensorflow reference implementation of GEMSEC and the
collection of windows containing v as NS (v). Earlier works
datasets that we collected can be accessed online1 .
have proposed random walks, second-order random walks or
II. R ELATED W ORK branching processes to obtain NS (v). In our experiments, we
There is a long line of research in metric embedding – used unweighted first and second-order random walks for node
for example, embedding discrete metrics into trees [18] and sampling [8], [9].
Our goal is to minimize the negative log-likelihood of ob-
1 https://github.com/benedekrozemberczki/GEMSEC serving neighborhoods of source nodes conditional on feature
vectors that describe the position of nodes in the embedding O(|V |2 ) runtime complexity. Because of this, we approximate
space. Formally, the optimization objective is: the partition function term with negative sampling which is a
X form of noise contrastive estimation [11], [27].
min − log P (NS (v)|f (v)) (1)
f
v∈V

for a suitable probability function P (·|·). To define this P ,


we consider two standard properties (see [9]) expected of the
embedding f in relation to NS . First, it should be possible
to factorize P (NS (v)|f (v)) in line with conditional indepen-
dence with respect to f (v). Formally: (a) Node capture. (b) Empty initialization.
Y
Fig. 2. Potential issues with cluster cost weighting and cluster initialization.
P (NS (v)|f (v)) = P (ni ∈ NS (v) | f (v), f (ni )). Different node colors denote different ground truth community memberships
ni ∈NS (v) and the computed cluster boundary is denoted by the dashed line. In Subfigure
(2) 2a a single white node is captured in a cluster with the blue nodes due to
clustering weight γ being high. In Subfigure 2b an empty cluster is initialized
Second, it should satisfy symmetry in the feature space, mean- with no nodes in it. It is plausible that the cluster center remains empty
ing that source and neighboring nodes have a symmetric effect throughout the optimization process.
on each other in the embedding space. A softmax function on
the pairwise dot products of node representations with f (v)
to get P (ni ∈ NS (v) | f (v), f (ni )) express such a property: P
exp(f (v ∗ ) · f (u)) · f (u) X
∂L u∈V
= P − f (ni )
exp(f (ni ) · f (v)) ∂f (v ∗ ) exp(f (v ∗ ) · f (u))
P (ni ∈ NS (v) | f (v), f (ni )) = P . (3) u∈V ni ∈N S (v ∗)

exp(f (u) · (f (v)) | {z } | {z }


u∈V Partition function gradient Neighbor direction

Substituting (2) and (3) into the optimization function, we get: f (v ∗ ) − µc


+γ · (6)
 !  kf (v ∗ ) − µc k2
X X X | {z }
min ln exp(f (v) · f (u)) − f (ni ) · f (v) . (4) Closest cluster direction
f
v∈V u∈V ni ∈NS (v)
The gradients of the loss function in Equation 5 are important
The partition function in Equation (4) enforces nodes to be in solving the minimization problem. As a result we can obtain
embedded in a low volume space around the origin, while the the gradients for node representations and cluster centers. Ex-
second term forces nodes with similar sampled neighborhoods amining in more detail, the gradient of the objective function L
to be embedded close to each other. with respect to the representation of node v ∗ ∈ V is described
by Equation (6) if µc is the closest cluster center to f (v ∗ ).
A. Learning to Cluster
The gradient of the partition function pulls the represen-
Next, we extend the optimization to pay attention to the tation of v ∗ towards the origin. The second term moves
clusters it forms. We include a clustering cost similar to k- the representation of v ∗ closer to the representations of its
means, measuring the distance from nodes to their cluster neighbors in the embedding space while the third term moves
centers. This augmented optimization problem is described by the node closer to the closest cluster center. If we set a high
minimizing a loss function over the embedding f and position γ value the third term dominates the gradient. This will cause
of cluster centers µ, that is, min L, where: the node to gravitate towards the closest cluster center which
f,µ
  might not contain the neighbors of v ∗ . An example is shown
!
X X X in Figure 2a. If the set of nodes that belong to cluster center c
L= ln exp(f (v) · f (u)) − f (ni ) · f (v) is Vc , then the gradient of the objective function with respect
v∈V u∈V ni ∈NS (v)
| {z } to µc is described by
X
Embedding cost
∂L X f (v) − µc
+γ· min kf (v) − µc k2 . (5) = −γ · . (7)
c∈C ∂µc kf (v) − µc k2
v∈V v∈Vc
| {z }
Clustering cost
In Equation 7 we see that the gradient moves the cluster
In Equation (5) we have C the set of cluster centers – the cth center by the sum of coordinates of nodes in the embedding
cluster mean is denoted by µc . Each of these cluster centers space that belong to cluster c. Second, if a cluster ends up
is a d-dimensional vector in the embedding space. The idea is empty it will not be updated as elements of the gradient
to minimize the distance from each node to its nearest cluster would be zero. Because of this, cluster centers and embedding
center. The weight coefficient of the clustering cost is given by weights are initialized with the same uniform distribution. A
the hyperparameter γ. Evaluating the partition function in the wrong initialization just like the one with an empty cluster in
proposed objective function for all of the source nodes has a Subfigure 2b can affect clustering performance considerably.
10). The extracted features, gradient, current learning rate
Data: G = (V, E) – Graph to be embedded. and clustering cost coefficient determine the update to model
N – Number of sequence samples per node. weights by the optimizer (line 11). In the implementation we
l – Length of sequences. utilized a variant of stochastic gradient descent – the Adam
ω – Context size. optimizer [12]. We approximate the first cost term with noise
d – Number of embedding dimensions. contrastive estimation to make the gradient descent tractable,
|C| – Number of clusters. drawing k noise samples for each positive sample. If the node
k – Number of noise samples. sampling is done by first-order random walks the runtime
γ0 – Initial clustering weight coefficient. complexity of this procedure will be O((ω · k + |C|) · l · d ·
α0 , αF – Initial and final learning rate. |V | · N ) while DeepWalk with noise contrastive estimation has
Result: f (v), where v ∈ V a O(ω · k · l · d · |V | · N ) runtime complexity.
µc , where c ∈ C
1 Model ← Initialize Model(|V |, d, |C|)
2 t←0 C. Smoothness Regularization for coherent community detec-
3 for n in 1:N do tion
4 Vb ← Shuffle(V )
We have seen in Subsection III-A that there is a tension
5 for v in Vb do
between what the clustering objective considers to be clusters
6 t←t+1 and what the real communities are in the underlying social
7 γ ← Update γ (γ0 , t, w, l, N, |V |) network. We can incorporate additional knowledge of social
8 α ← Update α (α0 , αF , t, w, l, N, |V |) network communities using a machine learning technique
9 Sequence ← Sample Nodes(G, v, l) called regularization.
10 Features ← Extract Features( Sequence, ω) We observe that social networks have natural local prop-
11 Update Weights(Model, Features, γ, α, k) erties such as homophily, strong ties between members of a
12 end community, etc. Thus, we can incorporate such social network-
13 end specific properties in the form of regularization to find more
Algorithm 1: GEMSEC training procedure natural embeddings and clusters.
This regularization effect can be achieved by adding a term
Λ to the loss function:
X
B. GEMSEC algorithm Λ=λ· w(v,u) · kf (v) − f (u)k2 , (10)
(v,u)∈ES
We propose an efficient learning method to create GEM-
SEC embeddings which is described with pseudo-code by where the weight function w determines the social network
Algorithm 1. The main idea behind our procedure is the cost of the embedding with respect to properties of the edges
following. To avoid the clustering cost overpowering the graph traversed in the sampling. We use the neighborhood overlap of
information (as in Fig. 2a), we initialize the system with a low an edge – defined as the fraction of neighbors common to two
weight γ0 ∈ [0, 1] for clustering, and through iterations anneal nodes of the edge relative to the union of the two neighbor
it to 1. sets2 . In experiments on real data, neighborhood overlap is
known to be a strong indicator of the strength of relation
The embedding computation proceeds as follows. The
between members of a social network [28]. Thus, by treating
weights in the model are initialized based on the number of
neighborhood overlap as the weight wv,u of edge (v, u), we
vertices, embedding dimensions and clusters. After this, the
can get effective social network clustering, which is confirmed
algorithm makes N sampling repetitions in order to generate
by experiments in the next section. The coeffeicient λ lets
vertex sequences from every source node. Before starting
us tune the contribution of the social network cost in the
a sampling epoch, it shuffles the set of vertices. We set
embedding process. In experiments, the regularized version
the clustering cost coefficient γ (line 7) according to an
of the algorithms is found to be more robust to changes in
exponential annealing rule described by Equation (8). The
hyperparameters.
learning rate is set to α (line 8) with a linear annealing rule
(Equation (9)). The effect of the regularization can be understood intuitively
through an example. For this exposition, let us consider matrix
 −t·log10 γ0  representations of the social network describing closeness
γ = γ0 · 10 w·l·|V |·N (8) of nodes. In fact, other skip-gram style learning processes
t like [8], [9] are known to approximate the factorization of
α = α0 − (α0 − αF ) · (9)
w · l · |V | · N
The sampling process reads sequences of length l (line 9) 2 Neighbor sets N (a) and N (b) of nodes a and b, the neighborhood overlap
and extracts features using the context window size ω (line of (a, b) is defined as the Jaccard similarity N(a)∩N(b)
N(a)∪N(b)
.
a similarity matrix M such as [26]: • Deezer user-user friendship networks: We collected
 P Q 1  friendship networks from the music streaming site Deezer
ω deg(a) and included 3 European countries (Croatia, Hungary, and
 vol(G) X r
P ∈Pv,u a∈P \{v} 
Mu,v = log   − log(k) Romania). For each user, we curated the list of genres
ω r=1 deg(v)
loved based on the songs liked by the user.
r
where Pv,u is the set of paths going from v to u with length r. B. Standard parameter settings
Elements of the target matrix M grow with number of paths A fixed standard parameter setting is used our experi-
of length at most ω between the corresponding nodes. Thus M ments, and we indicate any deviations. Models using first
is intended to represent level of connectivity between nodes order random walk sampling strategy are referenced as
in terms of a raw graph feature like number of paths. GEMSEC and Smooth GEMSEC, second order random walk
The barbell graph in Figure 3a is a typical example with an variants are named as GEMSEC2 and Smooth GEMSEC2 .
obvious community structure we can use to analyze the matter. Random walks with length 80 are used and 5 truncated
The optimization procedure used by Deepwalk [8] aims to random walks per source node were used. Second-order
converge to a target matrix Mu,v shown in Figure 3b. Observe random walk control hyperparameters [9] return and in-
that this matrix has fuzzy edges around the communities of 
out were chosen from 2−2 , 2−1 , 1, 2, 4 . A window size
the graph, showing a degree of uncertainty. An actual approxi- of 5 is used for features. Each embedding has 16 dimen-
mation by running the Deepwalk is shown in Figure 3c, which sions and we extract 20 cluster centers. A parameter sweep
naturally incorporates further uncertainty due to sampling. over hyperparameters was used to obtain the highest aver-
A much more clear output with sharp communities can be age modularity. Initial learning rate values are chosen from
obtained by applying a regularized optimization. This can be  −2
10 , 5  · 10−3 , 10−3 and the final learning rate is cho-
seen in Figure 3d. sen from 10−3 , 5 · 10−4 , 10−4 . Noise contrastive estimation
uses 10 negative examples.
 The initial clustering cost coef-
TABLE I ficient is chosen from 10−1 , 10−2 , 10−3 . The smoothness
S TATISTICS OF SOCIAL NETWORKS USED IN THE PAPER .
regularization term’s hyperparameter is 0.0625 and Jaccard’s
Source Dataset |V| Density Transitivity coefficient is the penalty weight.
Politicians 5,908 0.0024 0.3011
Companies 14,113 0.0005 0.1532 C. Cluster Quality
Athletes 13,866 0.0009 0.1292 Using Facebook page networks we evaluate the clustering
Facebook Media 27,917 0.0005 0.1140
performance. Cluster quality is evaluated by modularity –
Celebrities 11,565 0.0010 0.1666
we assume that a node belongs to a single community. Our
Artists 50,515 0.0006 0.1140
Government 7,057 0.0036 0.2238
results are summarized in Table II based on 10 experimental
TV Shows 3,892 0.0023 0.5906 repetitions and errors in parentheses correspond to two stan-
Croatia 54,573 0.0004 0.1146 dard deviations. The baselines use the hyperparameters from
Deezer Hungary 47,538 0.0002 0.0929 the respective papers. We used 16-dimensional embeddings
Romania 41,773 0.0001 0.0752 throughout. The embeddings obtained with non-community-
aware methods were clustered after the embedding by k-
means clustering to extract 20 cluster centers. Specifically,
IV. E XPERIMENTAL E VALUATION comparisons are made with:
1) Overlap Factorization [29]: Factorizes the neighborhood
In this section we evaluate the cluster quality obtained by the
overlap matrix to create features.
GEMSEC variants, their scalability, robustness and predictive
2) DeepWalk [8]: Approximates the sum of the adjacency
performance on a downstream supervised task. Results show
matrix powers with first order random walks and implic-
that GEMSEC outperforms or is at par with existing methods
itly factorizes it.
in all measures.
3) LINE [16]: Implicitly factorizes the sum of the first two
A. Datasets powers for the normalized adjacency matrix and the
resulting node representation vectors are concatenated
For the evaluation of GEMSEC real-world social network together to form a multi-scale representation.
datasets are used which we collected from public APIs specif- 4) Node2vec [9]: Factorizes a neighbourhood matrix ob-
ically for this work. Table I shows these social networks have tained with second order random walks. The in-out and
a variety of size, density, and level of clustering. We used return parameters of the  second-order random walks
graphs from two sources: were chosen from the 2−2 , 2−1 , 1, 2, 4 set to maxi-
• Facebook page networks: These graphs represent mutual mize modularity.
like networks among verified Facebook pages – the types 5) Walklets [17]: Approximates with first order random
of sites included TV shows, politicians, athletes, and walks each adjacency matrix power individually and
artists among others. implicitly factorizes the target matrix. These embeddings
(a) The graph (b) Target Matrix of Deepwalk. (c) Approximation by Deep- (d) Regularized Approxima-
walk. tion
Fig. 3. An example Barbell graph with the corresponding target matrix factorized (window size of 3) by DeepWalk [26] and the reconstructed target matrices
obtained with standard DeepWalk and Smooth DeepWalk. Regularized optimization produces more well defined communities. While the standard DeepWalk
model has less well defined clusters.

TABLE II
M EAN MODULARITY OF CLUSTERINGS ON THE FACEBOOK DATASETS . E ACH EMBEDDING EXPERIMENT WAS REPEATED TEN TIMES . E RRORS IN THE
PARENTHESES CORRESPOND TO TWO STANDARD DEVIATIONS . I N TERMS OF MODULARITY Smooth GEMSEC2 OUTPERFORMS THE BASELINES .

Politicians Companies Athletes Media Celebrities Artists Government TV Shows


Overlap Factorization 0.810 0.553 0.601 0.471 0.551 0.474 0.608 0.786
(±0.008) (±0.010) (±0.020) (±0.016) (±0.01) (±0.018) (±0.024) (±0.008)
DeepWalk 0.840 0.637 0.649 0.481 0.631 0.508 0.686 0.811
(±0.015) (±0.012) (±0.012) (±0.022) (±0.011) (±0.029) (±0.024) (±0.005)
LINE 0.841 0.651 0.665 0.558 0.642 0.557 0.690 0.813
(±0.014) (±0.009) (±0.007) (±0.012) (±0.010) (±0.014) (±0.017) (±0.010)
Node2Vec 0.846 0.664 0.669 0.565 0.643 0.560 0.692 0.827
(±0.012) (±0.008) (±0.007) (±0.011) (±0.013) (±0.010) (±0.017) (±0.016)
Walklets 0.843 0.655 0.664 0.562 0.621 0.548 0.689 0.819
(±0.014) (±0.012) (±0.007) (±0.009) (±0.043) (±0.016) (±0.019) (±0.015)
ComE 0.830 0.654 0.665 0.573 0.635 0.560 0.696 0.806
(±0.008) (±0.005) (±0.007) (±0.005) (±0.010) (±0.011) (±0.010) (±0.011)
M-NMF 0.816 0.646 0.655 0.561 0.628 0.535 0.668 0.813
(±0.014) (±0.007) (±0.008) (±0.004) (±0.006) (±0.021) (±0.011) (±0.008)
DANMF 0.810 0.648 0.650 0.560 0.628 0.532 0.673 0.812
(±0.020) (±0.005) (±0.009) (±0.006) (±0.011) (±0.019) (±0.015) (±0.014)
Smooth DeepWalk 0.849 0.667 0.669 0.541 0.643 0.523 0.707 0.835
(±0.017) (±0.007) (±0.007) (±0.006) (±0.008) (±0.020) (±0.008) (±0.008)
GEMSEC 0.851 0.662 0.674 0.536 0.636 0.528 0.705 0.833
(±0.009) (±0.013) (±0.009) (±0.011) (±0.014) (±0.020) (±0.020) (±0.010)
Smooth GEMSEC 0.855 0.683 0.692 0.567 0.649 0.559 0.710 0.841
(±0.006) (±0.009) (±0.009) (±0.009) (±0.008) (±0.011) (±0.008) (±0.004)
GEMSEC2 0.852 0.667 0.683 0.551 0.638 0.562 0.712 0.838
(±0.010) (±0.008) (±0.008) (±0.008) (±0.009) (±0.020) (±0.010) (±0.010)
Smooth GEMSEC2 0.859 0.684 0.692 0.571 0.649 0.562 0.712 0.847
(±0.006) (±0.009) (±0.007) (±0.010) (±0.011) (±0.017) (±0.010) (±0.006)

are concatenated to form a multi-scale representation of positive effect on the clustering performance of Deepwalk,
nodes. GEMSEC and GEMSEC2 .
6) ComE [13]: Uses a Gaussian mixture model to learn
an embedding and clustering jointly using random walk D. Sensitivity Analysis for hyperparameters
features.
We tested the effect of hyperparameter changes to clustering
7) M-NMF [14]: Factorizes a matrix which is a weighted
performance. The Politicians Facebook graph is embedded
sum of the first two proximity matrices with a modular-
with the standard parameter settings while the initial and final
ity based regularization constraint.
learning rates are set to be 10−2 and 5 · 10−3 respectively, the
8) DANMF [15]: Decomposes a weighted sum of the first
clustering cost coefficient is 0.1 and we perturb certain hyper-
two proximity matrices hierarchically to obtain cluster
parameters. The second-order random walks used in-out and
memberships with an autoencoder-like non-negative ma-
return parameters of 4. In Figure 4 each data point represents
trix factorization model.
the mean modularity calculated from 10 experiments. Based
Smooth GEMSEC, GEMSEC2 and Smooth GEMSEC2 consis- on the experimental results we make two observations. First,
tently outperform the neighborhood conserving node embed- GEMSEC model variants give high-quality clusters for a wide
ding methods and the competing community aware methods. range of parameter settings. Second, introducing smoothness
The relative advantage of Smooth GEMSEC2 over the bench- regularization makes GEMSEC models more robust to hyper-
marks is highest on the Athletes dataset as the clustering’s parameter changes. This is particularly apparent across varying
modularity is 3.44% higher than the best performing baseline. the number of clusters. The length of truncated random walks
It is the worst on the Media dataset with a disadvantage of and the number of random walks per source node above a
0.35% compared to the strongest baseline. Use of smoothness certain threshold has only a marginal effect on the community
regularization has sometimes non-significant, but definitely detection performance.
0.86
0.8 13
Modularity 0.85

Modularity
11

log2 Optimization runtime (s)


0.7 0.84
9
0.83
0.6 7
0.82
5
0.5 3
4 12 20 28 36 44 52 60 2 4 6 8 10
Clusters Context size 1
−1
0.86
−3
0.856
Modularity

0.84 Modularity
0.852 7 9 11 13 15
log2 Vertex number
0.82
DeepWalk Smooth DeepWalk GEMSEC
0.848 Smooth GEMSEC M-NMF ComE
0.8
0.844
Fig. 5. Sensitivity of optimization runtime to graph size measured by seconds.
16 32 48 64 80 96 112128 40 80 120 160 The dashed lines are linear references.
Dimensions Walk length

0.86
0.855
V. C ONCLUSIONS
0.85
Modularity

Modularity

0.845
We described GEMSEC – a novel algorithm that learns a
0.84 0.835
node embedding and a clustering of nodes jointly. It extends
0.83
0.825 existing embedding modes. We showed that smoothness regu-
larization is used to incorporate social network properties and
0.2 0.4 0.6 0.8 1 2 4 6 8 10
Cluster cost coefficient Number of radom walks produce natural embedding and clustering. We presented new
GEMSEC Smooth GEMSEC GEMSEC2 Smooth GEMSEC2 social datasets, and experimentally, our methods outperform
a number of strong community aware node embedding base-
Fig. 4. Sensitivity of cluster quality to parameter changes measured by lines.
modularity.

VI. ACKNOWLEDGEMENTS
E. Music Genre Recommendation Benedek Rozemberczki and Ryan Davies were supported
Node embeddings are often used for extracting features of by the Centre for Doctoral Training in Data Science, funded
nodes for downstream predictive tasks. In order to investigate by EPSRC (grant EP/L016427/1).
this, we use social networks of Deezer users collected from
R EFERENCES
European countries. We predict the genres (out of 84) of
music liked by people. Following the embedding, we used [1] T. Van Laarhoven and E. Marchiori, “Robust community detection
logistic regression with ℓ2 regularization to predict each of methods with resolution parameter for complex detection in protein
protein interaction networks,” Pattern Recognition in Bioinformatics, pp.
the labels and 90% of the nodes were randomly selected for 1–13, 2012.
training. We evaluated the performance of the remaining users. [2] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group
Numbers reported in Table III are F1 scores calculated from 10 formation in large social networks: Membership, growth, and evolution,”
in Proceedings of the 12th ACM SIGKDD international conference on
experimental repetitions. GEMSEC2 significantly outperforms Knowledge discovery and data mining. ACM, 2006, pp. 44–54.
the other methods on all three countries’ datasets. The perfor- [3] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos, “Com-
mance advantage varies between 3.03% and 4.95%. We also munity detection in social media,” Data Mining and Knowledge Discov-
ery, vol. 24, no. 3, pp. 515–554, 2012.
see that Smooth GEMSEC2 has lower accuracy, but it is able [4] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive
to outperform DeepWalk, LINE, Node2Vec, Walklets, ComE, Datasets. Cambridge University Press, 2014.
M-NMF and DANMF on all datasets. [5] P. Pascal and M. Latapy, In International Symposium on Computer and
Information Sciences. Springer Berlin Heidelberg, 2005, ch. Computing
Communities in Large Networks Using Random Walks., pp. 284–293.
F. Scalability and computational efficiency [6] S. Gregory, “Finding overlapping communities in networks by label
propagation,” New Journal of Physics, vol. 12, no. 10, p. 103018, 2010.
To create graphs of various sizes, we used the Erdos-Renyi [7] P. Goyal and E. Ferrara, “Graph embedding techniques, applications,
model and with an average degree of 20. Figure 5 shows the and performance: A survey,” arXiv preprint arXiv:1705.02801, 2017.
log of mean runtime against the log of the number of nodes. [8] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning
of social representations.” in Proceedings of the 20th ACM SIGKDD
Most importantly, we can conclude that doubling the size of international conference on Knowledge discovery and data mining.,
the graph doubles the time needed for optimizing GEMSEC, 2014.
thus the growth is linear. We also observe that embedding [9] A. Grover and J. Leskovec, “Node2vec: Scalable feature learning for
networks,” in Proceedings of the 22nd ACM SIGKDD International
algorithms that incorporate clustering have a higher cost, and Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–
regularization also produces a higher cost, but similar growth. 864.
TABLE III
M ULTI - LABEL NODE CLASSIFICATION PERFORMANCE OF THE EMBEDDING EXTRACTED FEATURES ON THE D EEZER GENRE LIKES DATASETS .
P ERFORMANCE IS MEASURED BY AVERAGE F1 SCORE VALUES . M ODELS WERE TRAINED ON 90% OF THE DATA AND EVALUATED ON THE REMAINING
10%. E RRORS IN THE PARENTHESES CORRESPOND TO TWO STANDARD DEVIATIONS . GEMSEC MODELS CONSISTENTLY HAVE GOOD PERFORMANCE .

C ROATIA H UNGARY R OMANIA


Micro Macro Weighted Micro Macro Weighted Micro Macro Weighted
Overlap Factorization 0.319 0.026 0.208 0.361 0.029 0.227 0.275 0.020 0.167
(±0.017) (±0.002) (±0.010) (±0.007) (±0.001) (±0.006) (±0.025) (±0.003) (±0.017)
DeepWalk 0.321 0.026 0.207 0.361 0.029 0.228 0.307 0.023 0.186
(±0.006) (±0.002) (±0.004) (±0.004) (±0.002) (±0.002) (±0.008) (±0.002) (±0.006)
LINE 0.331 0.028 0.212 0.374 0.033 0.250 0.332 0.028 0.212
(±0.013) (±0.002) (±0.010) (±0.007) (±0.002) (±0.005) (±0.007) (±0.002) (±0.006)
Node2Vec 0.348 0.032 0.235 0.393 0.037 0.267 0.346 0.031 0.229
(±0.012) (±0.003) (±0.010) (±0.008) (±0.002) (±0.011) (±0.008) (±0.002) (±0.008)
Walklets 0.363 0.043 0.270 0.397 0.051 0.307 0.361 0.050 0.281
(±0.013) (±0.003) (±0.012) (±0.007) (±0.001) (±0.006) (±0.011) (±0.005) (±0.012)
ComE 0.326 0.028 0.217 0.363 0.033 0.246 0.323 0.028 0.212
(±0.012) (±0.002) (±0.009) (±0.010) (±0.001) (±0.007) (±0.008) (±0.001) (±0.006)
M-NMF 0.336 0.028 0.217 0.369 0.032 0.239 0.330 0.028 0.209
(±0.005) (±0.001) (±0.003) (±0.015) (±0.002) (±0.011) (±0.016) (±0.002) (±0.013)
DANMF 0.340 0.027 0.210 0.365 0.031 0.242 0.335 0.029 0.210
(±0.007) (±0.002) (±0.002) (±0.011) (±0.002) (±0.008) (±0.009) (±0.002) (±0.012)
Smooth DeepWalk 0.329 0.028 0.215 0.375 0.032 0.244 0.321 0.026 0.204
(±0.006) (±0.002) (±0.006) (±0.006) (±0.002) (±0.004) (±0.008) (±0.002) (±0.006)
GEMSEC 0.328 0.027 0.212 0.377 0.032 0.244 0.332 0.028 0.213
(±0.006) (±0.002) (±0.004) (±0.004) (±0.002) (±0.004) (±0.008) (±0.002) (±0.006)
Smooth GEMSEC 0.333 0.028 0.215 0.379 0.034 0.250 0.334 0.029 0.215
(±0.006) (±0.002) (±0.004) (±0.006) (±0.002) (±0.004) (±0.008) (±0.002) (±0.006)
GEMSEC2 0.381 0.046 0.287 0.407 0.050 0.310 0.378 0.049 0.289
(±0.007) (±0.003) (±0.005) (±0.005) (±0.003) (±0.007) (±0.009) (±0.003) (±0.007)
Smooth GEMSEC2 0.373 0.044 0.276 0.409 0.053 0.314 0.376 0.049 0.287
(±0.005) (±0.002) (±0.006) (±0.004) (±0.002) (±0.006) (±0.008) (±0.003) (±0.007)

[10] W. W. Zachary, “An information flow model for conflict and fission in [23] C. De Sa, A. Gu, C. Ré, and F. Sala, “Representation tradeoffs for
small groups,” Journal of anthropological research, vol. 33, no. 4, pp. hyperbolic embeddings,” Proceedings of machine learning research,
452–473, 1977. vol. 80, p. 4460, 2018.
[11] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of [24] W. Zeng, R. Sarkar, F. Luo, X. Gu, and J. Gao, “Resilient routing
word representations in vector space,” 2013. for sensor networks using hyperbolic embedding of universal covering
[12] J. B. Diederik P. Kingma, “Adam: A method for stochastic optimiza- space,” in 2010 Proceedings IEEE INFOCOM. IEEE, 2010, pp. 1–9.
tion,” in Proceedings of the 3rd International Conference on Learning [25] B. Rozemberczki and R. Sarkar, “Fast sequence-based embedding with
Representations (ICLR), 2015. diffusion graphs,” in International Workshop on Complex Networks.
[13] S. Cavallari, V. W. Zheng, H. Cai, K. C.-C. Chang, and E. Cambria, Springer, 2018, pp. 99–107.
“Learning community embedding with community detection and node [26] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network
embedding on graphs,” in Proceedings of the 2017 ACM on Conference embedding as matrix factorization: Unifying deepwalk, line, pte, and
on Information and Knowledge Management, 2017, pp. 377–386. node2vec,” in Proceedings of the Eleventh ACM International Confer-
[14] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community ence on Web Search and Data Mining. ACM, 2018, pp. 459–467.
preserving network embedding.” in AAAI, 2017, pp. 203–209. [27] M. Gutmann and A. Hyvarinen, “Noise-contrastive estimation: A new
[15] F. Ye, C. Chen, and Z. Zheng, “Deep autoencoder-like nonnegative estimation principle for unnormalized statistical models,” in Proceedings
matrix factorization for community detection,” in Proceedings of the of the Thirteenth International Conference on Artificial Intelligence and
27th ACM International Conference on Information and Knowledge Statistics, 2010, pp. 297–304.
Management. ACM, 2018, pp. 1393–1402. [28] J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski,
[16] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: J. Kertész, and A.-L. Barabási, “Structure and tie strengths in mobile
Large-scale information network embedding,” in Proceedings of the 24th communication networks,” Proceedings of the national academy of
International Conference on World Wide Web, 2015, pp. 1067–1077. sciences, vol. 104, no. 18, pp. 7332–7336, 2007.
[17] B. Perozzi, V. Kulkarni, H. Chen, and S. Skiena, “Don’t walk, skip!: [29] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and
Online learning of multi-scale network embeddings,” in Proceedings of A. J. Smola, “Distributed large-scale natural graph factorization,” in
the 2017 IEEE/ACM International Conference on Advances in Social Proceedings of the 22nd international conference on World Wide Web,
Networks Analysis and Mining 2017, 2017, pp. 258–265. 2013, pp. 37–48.
[18] J. Fakcharoenphol, S. Rao, and K. Talwar, “A tight bound on approx-
imating arbitrary metrics by tree metrics,” Journal of Computer and
System Sciences, vol. 69, no. 3, pp. 485–497, 2004.
[19] J. Matoušek, Lectures on discrete geometry. Springer, 2002, vol. 108.
[20] X. Yu, X. Ban, W. Zeng, R. Sarkar, X. Gu, and J. Gao, “Spherical
representation and polyhedron routing for load balancing in wireless
sensor networks,” in 2011 Proceedings IEEE INFOCOM. IEEE, 2011,
pp. 621–625.
[21] K. Huang, C.-C. Ni, R. Sarkar, J. Gao, and J. S. Mitchell, “Bounded
stretch geographic homotopic routing in sensor networks,” in IEEE IN-
FOCOM 2014-IEEE Conference on Computer Communications. IEEE,
2014, pp. 979–987.
[22] R. Sarkar, “Low distortion delaunay embedding of trees in hyperbolic
plane,” in International Symposium on Graph Drawing. Springer, 2011,
pp. 355–366.

You might also like