Predicting Graph Operator Output Over Multiple Graphs
Predicting Graph Operator Output Over Multiple Graphs
1 Introduction
A huge amount of data originating from the Web can be naturally expressed
as graphs, e.g., product co-purchasing graphs [25], community graphs [28], etc.
Graph analytics is a common tool used to effectively tackle complex tasks such as
social community analysis, recommendations, fraud detection, etc. Many diverse
graph operators are available [12], with functionality including the computation
of centrality measures, clustering metrics or network statistics [8], all regularly
utilized in tasks such as classification, community detection and link prediction.
Yet, as Big Data technologies mature and evolve, emphasis is placed on areas
not solely related to data (i.e., graph) size. A different type of challenge steadily
c Springer Nature Switzerland AG 2019
M. Bakaev et al. (Eds.): ICWE 2019, LNCS 11496, pp. 107–122, 2019.
https://doi.org/10.1007/978-3-030-19274-7_9
108 T. Bakogiannis et al.
To the best of our knowledge, this is the first effort to predict graph operator
output over large datasets. In summary, we make the following contributions:
2 Methodology
In this section, we formulate the problem and describe the methodology along
with different aspects of the proposed solution. We start off with some basic
notation followed throughout the paper and a formal description of our method
and its complexity.
Let a graph G be an ordered pair G = (V, E) with V being the set of vertices
and E the set of edges of G, respectively. The degree of a vertex u ∈ V , denoted
by dG (u), is the number of edges of G incident to u. The degree distribution of a
graph G, denoted by PG (k), expresses the probability that a randomly selected
vertex of G has degree k. A dataset D is a set of N simple, undirected graphs
D = {G1 , G2 , ..., GN }. We define a graph operator to be a function g : D → R,
mapping an element of D to a real number. In order to quantify the similarity
between two graphs Ga , Gb ∈ D we use a graph similarity function s : D×D → R
with range within [0, 1]. For two graphs Ga , Gb ∈ D, a similarity of 1 implies
that they are identical while a similarity of 0 the opposite.
Consequently, the problem we are addressing can be formally stated as fol-
lows: Given a dataset of graphs D and a graph operator g, without knowledge of
the range of g given D, we wish to infer a function ĝ : D → R that approximates
1
https://github.com/giagiannis/data-profiler.
110 T. Bakogiannis et al.
Where wxi = R[x, i] is the similarity score for graphs Gx , Gi , i.e., wxi =
s(Gx , Gi ), Γk (x) is the set of the k most similar graphs to Gx for which we
have already calculated g and g(Gi ) the value of the operator for Gi . Our app-
roach is formally described in Algorithm 1. The complexity of Algorithm 1 can
be broken down to its three main components: (1) The calculation of the simi-
larity matrix R in lines 3−4, for a given similarity measure s with complexity S.
(2) The component which computes the operator g for pN graphs (lines 5−7),
assuming that g has complexity M . And 3) the approximation of the operator
for the remaining graphs (lines 8−10) using kNN. Thus, the overall complexity
of our method is:
From Eq. 2, we deduce that the complexity of our method is dominated by its
first two components. Consequently, the lower the computational cost of s, the
more efficient our approach will be. Additionally, we expect our training set to
be much smaller than our original dataset (i.e., p 1).
calculate the degrees for a given level, for each vertex we perform a depth-limited
Depth First Search up to level hops away in order to mark the internal edges
of the super-node. We then count the edges of the border vertices (vertices level
hops away from the source) that do not connect to any internal vertices.
Degree Distribution + Vertex Count: A second extension to our degree
distribution-based similarity measure is based on the ability of our method to
combine similarity matrices. Graph size (vertex count) is another graph attribute
to measure similarity on. We formulate similarity in terms of vertex count
min(|VG |,|VG |)
as: s(Gi , Gj ) = max(|VGi |,|VGj |) . Intuitively, s approaches 1 when |VGi | − |VGj |
i j
approaches 0, i.e., when Gi , Gj have similar vertex counts. To incorporate vertex
count into the graph comparison, we can combine the similarity matrices com-
puted with degree distributions and vertex counts using an arbitrary formula
(e.g., linear composition).
u8
u7
u0
u5 u1
=4
Level 0
Leve
Level 2 = 3 l1=
1
u4 u2
u6 u3
2.2 Discussion
In this section, we consider a series of issues that relate to the configuration and
performance of our method as well as to the relation between modeled operators,
similarity measure and input datasets.
Graph Operators: This work focuses on graph analytics operators, namely
centralities, clustering metrics, network statistics, etc. Research on this area has
resulted in a large collection of operators, also referred to as topology metrics
(e.g., [7,8,12,20]). Topology metrics can be loosely classified in three categories
([7,20,22]), those related to distance, connectivity and spectrum. In the first
class, we find metrics like diameter, average distance or betweenness centrality.
In the second, average degree, degree distribution, etc. Finally, the third class
comes from the spectral analysis of a graph and contains the computation of
eigenvalues, eigenvectors or other spectral-related metrics.
Combining Similarity Measures: We can think of use cases where we
want to quantify the similarity of graphs based on parameters unrelated to each
other. For example, we might want to compare two graphs based on their degree
distributions but also take under account their vertex count. This composition
can be naturally implemented in our system by computing independent similarity
Predicting Graph Operator Output over Multiple Graphs 113
matrices and “fuse” those matrices into one using a formula. This technique is
presented in our evaluation and proves effective for a number of operators.
Regression Analysis: Although there exist several approaches to statistical
learning [19], we have opted for the kNN method. We choose kNN for its sim-
plicity and because we do not have to calculate distances between points of our
dataset (we already have that information from the similarity matrix). The kNN
algorithm is also suitable for our use case since it is sensitive to localized data and
insensitive to outliers. A desired property, since we expect similar graphs to have
similar operator scores and should therefore be of influence in our estimations.
Scaling Similarity Computations: Having to compute all-pairs similarity
scores for a large collection of graphs can be prohibitively expensive. To this end,
we introduce a preprocessing step which we argue that improves on the existing
computational cost, reducing the number of similarity calculations performed.
As, in order to approximate a graph operator, we employ kNN, we observe
that, for each graph, we only require the similarity scores to its k most similar
graphs for which we have the value of g, i.e., the weights in Eq. 1. Therefore we
propose to run a clustering algorithm which will produce clusters of graphs with
high similarity. Then for each cluster compute all-pairs similarity scores between
its members, setting inter-cluster similarities to zero. By creating clusters of
size much larger than k, we expect minimal loss in accuracy while avoiding a
considerable number of similarity computations. As a clustering algorithm we
use a simplified version of k-medoids in combination with k-means++, for the
initial seed selection ([2,23]). For an extensive experimental evaluation of this
technique we refer the reader to the extended version of our work in [34] which
we have not included here due to space constraints.
3 Experimental Evaluation
Datasets: For our experimental evaluation, we consider both real and syn-
thetic datasets. The real datasets comprise a set of ego graphs from Twitter
(TW ) which consists of 973 user “circles” as well as a dataset containing 733
snapshots of the graph that is formed by considering the Autonomous Systems
(AS ) that comprise the Internet as nodes and adding links between those sys-
tems that communicate to each other. Both datasets are taken from the Stanford
Large Network Dataset Collection [26].
We also experiment with a dataset of synthetic graphs (referred to as the
BA dataset) generated using the SNAP library [27]. We use the GenPrefAttach
generator to create random scale-free graphs with power-law degree distributions
using the Barabasi-Albert model [3]. We keep the vertex count of the graphs
constant to 4K. We introduce randomness to this dataset by having the initial
outdegree of each vertex be a uniformly random number in the range [1, 32]. The
Barabasi-Albert model constructs a graph by adding one vertex at a time. The
initial outdegree of a vertex is the maximum number of vertices it connects to,
114 T. Bakogiannis et al.
the moment it is added to the graph. The graphs of the dataset are simple and
undirected. Further details about the datasets can be found in Table 1.
Similarity Measures: We evaluate all the similarity measures proposed in
Sect. 2.1, namely degree distribution + levels, for levels 0, 1, 2 and degree dis-
tribution + vertex count. When combining vertex count with degree, we use
the following simple formula: R = w1 Rd + w2 Rn , with Rd , Rn the degree dis-
tribution and vertex count similarity matrices respectively. In our evaluation,
w1 = w2 = 0.5. To investigate their strengths and limitations, we compare them
against two measures functioning as our baselines. The first is a sophisticated
similarity measure not based on degree but rather on distance distributions (from
which the degree distribution can be deduced). D-measure [32] is based on the
concept of network node dispersion (NND) which is a measure of the heterogene-
ity of a graph in terms of connectivity distances. It is a state-of-the-art graph
similarity measure with very good experimental results for both real and syn-
thetic graphs. Our second baseline comes from the extensively researched area
of graph kernels. For the purposes of our evaluation, we opted for the geometric
Random Walk Kernel (rw-kernel ) [16] as a widely used representative of this
class of similarity measures. In order to avoid the halting phenomenon due to
the kernel’s decay factor (λk ) we set λ = 0.1 and the number of steps k ≤ 4,
values that are considered to be reasonable for the general case [33].
3.1 Experiments
This can be attributed to the topology of the AS graphs. These graphs display
a linear relationship between vertex and edge counts. Their clustering coefficient
displays very little variance, suggesting that as the graphs grow in size they
keep the same topological structure. This gradual, uniform evolution of the AS
graphs leads to easier modeling of the values of a given graph topology measure.
On the other hand, our approach has better accuracy for degree- than
distance-related metrics in the cases of the TW and BA datasets. The simi-
larity measure we use is based on the degree distribution that is only indirectly
related to vertex distances. This can be seen, for example, in the case of BA if we
compare the modeling error for the betweenness centrality (bc) and PageRank
(pr) measures. Overall, we see that eigenvector and closeness centralities are the
two most accurately approximated metrics across all datasets. Next up, we find
PageRank, spectral radius, betweenness and edge betweenness centralities. Will-
ing to further examine the connection between modeling accuracy and similarity
measures, we have included D-measure and rw-kernel in our evaluation as well
as the degree-level similarity measures and the similarity matrice combination
technique.
Execution Speedup: Next, we evaluate the gains our method can provide
in execution time. The similarity matrix computation is a time-consuming step,
Predicting Graph Operator Output over Multiple Graphs 117
yet an advantage of our scheme is that the matrix can be reused for different
graph operators and thus its cost can be amortized. In order to provide a better
insight, we calculate two types of speedups: One that considers the similarity
matrix construction from scratch for each operator separately (provided in the
Speedup column of Table 2) and one that expresses the average speedup for all
six measures for each dataset, where the similarity matrix has been constructed
once (provided in the A. Speedup column of Table 2).
The observed results highlight that our method is not only capable of pro-
viding models of high quality, but also does so in a time-efficient manner. A
closer examination of the Speedup columns shows that our method is particu-
larly efficient for complex metrics that require more computation time (as in the
ebc and cc cases for all datasets). The upper bound of the theoretically antici-
pated speedup equals p1 , p being the sampling ratio. Interestingly, the Amortized
Speedup column indicates that when the procedure of constructing the similar-
ity matrix is amortized to the six operators under consideration, the achieved
speedup is very close to the theoretical one. This is indeed the case for the AS
and BA datasets that comprise the largest graphs, in terms of number of ver-
tices: For all p values, the amortized speedup closely approximates p1 . In the case
of the TW dataset which consists of much smaller graphs and, hence, the time
dedicated to the similarity matrix estimation is relatively larger than the previ-
ous cases, we observe that the achieved speedup is also sizable. In any case, the
capability of reusing the similarity matrix, which is calculated on a per-dataset
rather than on a per-operator basis, enables our approach to scale and be more
efficient as the number and complexity of graph operators increases.
MdAPE
MdAPE
MdAPE
0 0 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Sampling Ratio Sampling Ratio Sampling Ratio Sampling Ratio
level-0 level-2 d-sim level-0 level-2 d-sim level-0 level-2 d-sim level-0 level-2 d-sim
level-1 level-0+size rw-kernel level-1 level-0+size rw-kernel level-1 level-0+size rw-kernel level-1 level-0+size rw-kernel
MdAPE
4 d-sim
0.3 0.3 10 rw-kernel
0.2 0.2
Fig. 2e we observe that by adding levels we get better results, vertex count con-
tributes into even better modeling but D-measure gives better approximations.
Yet, our methods’ errors are already very small (less than 3%) in this case. Con-
sidering the rw-kernel similarity measure, we observe that it performs poorly for
most of the operators. Although its modeling accuracy is comparable to degree
distribution + levels for some operators, we find that for a certain level or in
combination with vertex count a degree distribution-based measure has better
accuracy. Notably, rw-kernel has low accuracy for degree and distance related
operators while performing comparably in the case of spectrum operators.
0.04 0.12
0.1
0.08
MdAPE
MdAPE
0.02 0.06
0.04
0.02
0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Sampling Ratio Sampling Ratio
level-0 level-2 rw-kernel level-0 level-2
level-1 level-0+size level-1 level-0+size
observe that in the case of AS and BA it is 19× and 76× faster. The computation
of the D-measure and the rw-kernel, on the other hand, are orders of magnitude
slower. Given the difference in modeling quality between the presented similarity
functions, we observe a clear trade-off between quality of results and execution
time in the context of our method.
4 Related Work
Our work relates to the actively researched areas of graph similarity, graph
analytics and machine learning. The available techniques for quantifying graph
similarity can be classified into three main categories ([24,36]):
Graph Isomorphism - Edit Distance: Two graphs are considered similar
if they are isomorphic. A generalization of the graph isomorphism problem is
expressed through the Edit Distance, i.e., the number of operations that have to
be performed in order to transform one graph to the other [31]. The drawback
of approaches in this category is that graph isomorphism is hard to compute.
Iterative Methods: This category of graph similarity algorithms is based on
the idea that two vertices are similar if their neighborhoods are similar. Applying
this idea iteratively over the entire graph can produce a global similarity score.
Such algorithms compare graphs based on their topology, we choose to map
graphs to feature vectors and compare those vectors instead.
Feature Vectors: These approaches are based on the idea that similar graphs
share common properties such as degree distribution, diameter, etc and therefore
represent graphs as feature vectors. To assess the degree of similarity between
graphs, statistical tools are used to compare their feature vectors instead. Such
methods are not computationally demanding. Drawing from this category of
measures, we base our graph similarity computations on comparing degree dis-
tributions.
Graph Kernels: A different approach to graph similarity comes from the
area of machine learning where kernel functions can be used to infer knowledge
about samples. Graph kernels are kernel functions constructed on graphs or
graph nodes for comparing graphs or nodes respectively. Extensive research on
this area (e.g., [15,17]) has resulted in many kernels based on walks, paths,
etc. While computationally more expensive they provide a good baseline for our
modeling accuracy evaluation.
Graph Analytics and Machine Learning: Although graph analytics is a
very thoroughly researched area, there exist few cases where machine learning
techniques are used. On the subject of graph summarization, a new approach is
based on node representations that are learned automatically from the neighbor-
hood of a vertex [18]. Node representations are also applicable in computing node
or graph similarities as seen in [18]. However, we do not find works employing
machine learning techniques in the field of graph mining through graph topology
metric computations.
Predicting Graph Operator Output over Multiple Graphs 121
5 Conclusion
In this work we present an operator-agnostic modeling methodology which lever-
ages similarity between graphs. This knowledge is used by a kNN classifier to
model a given operator allowing scientists to predict operator output for any
graph without having to actually execute the operator. We propose an intuitive,
yet powerful class of similarity measures that efficiently capture graph relations.
Our thorough evaluation indicates that modeling a variety of graph operators is
not only possible, but it can also provide results of high quality at considerable
speedups. Finally, our approach appears to present similar results to state-of-
the-art similarity measures, such as D-measure, in terms of quality, but requires
orders of magnitude less execution time.
References
1. Aherne, F.J., et al.: The bhattacharyya metric as an absolute similarity measure
for frequency coded data. Kybernetika 34(4), 363–368 (1998)
2. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
rithms, SODA 2007, New Orleans, Louisiana, USA, pp. 1027–10357–9 January
2007
3. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science
286(5439), 509–512 (1999)
4. Bentley, J.L.: Multidimensional binary search trees used for associative searching.
Commun. ACM 18(9), 509–517 (1975)
5. Bhattacharyya, A.: On a measure of divergence between two statistical populations
defined by their probability distributions. Bull. Calcutta Math. Soc. 35(1), 99–109
(1943)
6. Bonacich, P.: Power and centrality: a family of measures. Am. J. Sociol. 92(5),
1170–1182 (1987)
7. Bounova, G., de Weck, O.: Overview of metrics and their correlation patterns for
multiple-metric topology analysis on heterogeneous graph ensembles. Phys. Rev.
E 85, 016117 (2012)
8. Brandes, U., Erlebach, T.: Network Analysis: Methodological Foundations.
Springer, New York (2005). https://doi.org/10.1007/b106453
9. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.
Comput. Netw. 30(1–7), 107–117 (1998)
10. Csardi, G., Nepusz, T.: The Igraph software package for complex network research.
Inter J. Complex Syst. 1695, 1–9 (2006)
11. Eppstein, D., Wang, J.: Fast approximation of centrality. J. Graph Algorithms
Appl. 8, 39–45 (2004)
12. da, F., Costa, L., et al.: Characterization of complex networks: a survey of mea-
surements. Adv. Phys. 56(1), 167–242 (2007)
13. Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry
40(1), 35–41 (1977)
14. Gandomi, et al.: Beyond the hype. Int. J. Inf. Manage. 35(2), 137–144 (2015)
15. Gärtner, T.: A survey of kernels for structured data. SIGKDD 5(1), 49–58 (2003)
122 T. Bakogiannis et al.
16. Gärtner, T., Flach, P., Wrobel, S.: On graph kernels: hardness results and efficient
alternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT-Kernel 2003. LNCS
(LNAI), vol. 2777, pp. 129–143. Springer, Heidelberg (2003). https://doi.org/10.
1007/978-3-540-45167-9 11
17. Ghosh, S., Das, N., Gonçalves, T., Quaresma, P., Kundu, M.: The journey of graph
kernels through two decades. Comput. Sci. Rev. 27, 88–111 (2018)
18. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In:
SIGKDD, pp. 855–864. ACM (2016)
19. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009). https://
doi.org/10.1007/978-0-387-21606-5
20. Hernández, J.M., Mieghem, P.V.: Classification of graph metrics, pp. 1–20 (2011)
21. Jamakovic, A., et al.: Robustness of networks against viruses: the role of the spec-
tral radius. In: Symposium on Communications and Vehicular Technology, pp.
35–38 (2006)
22. Jamakovic, A., Uhlig, S.: On the relationships between topological measures in
real-world networks. NHM 3(2), 345–359 (2008)
23. Kaufmann, L., Rousseeuw, P.: Clustering by means of medoids, pp. 405–416 (1987)
24. Koutra, D., Parikh, A., Ramdas, A., Xiang, J.: Algorithms for graph similarity and
subgraph matching. Technical Report Carnegie-Mellon-University (2011). https://
people.eecs.berkeley.edu/∼aramdas/reports/DBreport.pdf
25. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing.
TWEB 1(1), 5 (2007)
26. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection,
June 2014. http://snap.stanford.edu/data
27. Leskovec, J., Sosič, R.: Snap: a general-purpose network analysis and graph-mining
library. ACM Trans. Intell. Syst. Technol. 8(1), 1 (2016)
28. McAuley, J.J., Leskovec, J.: Learning to discover social circles in ego networks. In:
Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.)
Advances in Neural Information Processing Systems 25: 26th Annual Conference
on Neural Information Processing Systems 2012. Proceedings of a meeting held
December 3–6, 2012, Lake Tahoe, Nevada, United States, pp. 548–556 (2012)
29. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in net-
works. Phys. Rev. E 69(2), 026113 (2004)
30. Riondato, M., Kornaropoulos, E.M.: Fast approximation of betweenness centrality
through sampling. Data Min. Knowl. Discov. 30(2), 438–475 (2016)
31. Sanfeliu, A., Fu, K.: A distance measure between attributed relational graphs for
pattern recognition. IEEE Trans. Syst. Man Cybern. 13(3), 353–362 (1983)
32. Schieber, T.A., et al.: Quantification of network structural dissimilarities. Nat.
Commun. 8, 13928 (2017)
33. Sugiyama, M., Borgwardt, K.M.: Halting in random walk kernels. In: Annual Con-
ference on Neural Information Processing Systems, pp. 1639–1647 (2015)
34. Bakogiannis, T., Giannakopoulos, I., Tsoumakos, D., Koziris, N.: Graph operator
modeling over large graph datasets. CoRR abs/1802.05536 (2018). http://arxiv.
org/abs/1802.05536
35. Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph
kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010)
36. Zager, L.A., Verghese, G.C.: Graph similarity scoring and matching. Appl. Math.
Lett. 21(1), 86–94 (2008)