IO Efficient Generation of Hyperbolic Random Graphs
IO Efficient Generation of Hyperbolic Random Graphs
Author: Supervisor:
Kamil Ren K NIG Prof. Dr. Ulrich M EYER
Eidesstattliche Erklrung
Erklrung gem Master-Ordnung 2008 24 Abs. 12
Hiermit besttige ich, dass ich die vorliegende Arbeit selbststndig verfasst habe
und keine anderen Quellen oder Hilfsmittel als die in dieser Arbeit angegebenen
verwendet habe.
iii
Abstract
Random hyperbolic graphs as models for real-world, complex networks have been
the subject of many authors. Such graphs embed nodes in a hyperbolic plane, where
a node pair establishes an edge whenever the distance between the nodes is smaller
than a chosen value. Since the analysis of these graphs efficient generation is usu-
ally done under the assumption of an underlying unit-cost RAM model, scalability
regarding memory usage and I/O-efficiency is a problem for many in-memory al-
gorithms.
We create the first parallelizable random hyperbolic graph generator that uses
an external memory (EM) approach. The generator runs empirically in a sorting run-
time whose modified algorithm is based on a state-of-the-art algorithm by Looz et al.
[LLM16]. We prove that the candidate selection per node for edge-creation neces-
sitates memory sublinear in the number of nodes n for sufficiently small average de-
grees. Based on that, we thus show that the generator has a sort(n) I/O-complexity.
Since the algorithm is based on a radial subdivision of the hyperbolic plane, we also
devise a workload-per-band-proxy calculation to aid in finding a radial partitioning
with a desired workload distribution. In practical comparisons between the origi-
nal algorithm and our EM-variant, our generator is able to compete in an internal
memory focused benchmark setting. Furthermore when used in an EM setting, we
demonstrate at large enough graph sizes runtime improvements on the original of
an order of magnitude. During such an EM setting, our generator is able to embed
graphs with 109 nodes and establish 5 1010 edges in under an hour.
iv
Contents
Abstract iii
1 Introduction 1
1.1 EM Model and RAM Model . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Graph Models for the Representation of Complex Networks . . 5
1.2.2 Hyperbolic Space and its Relation to Complex Networks . . . . 7
1.2.3 The Generative Model and Native Representation . . . . . . . . 8
1.2.4 Poincar Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 STXXL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Algorithms 15
2.1 State of the Art: NkGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 EM-Variant of NkGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 2-Sorter-Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 0-Sorter-Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Radial Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Angular Parallelisation . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Parallelisation of the Generation Phase . . . . . . . . . . . . . . 23
2.4 GIRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Analysis 29
3.1 NkGen: I/O-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 EM-Variant of NkGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 I/O-Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 GIRG: I/O-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Radial Partitionings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.2 Estimating the Expected Workload of a Partitioning . . . . . . . 41
3.6.3 Equalised Workload . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.4 Minimised Workload . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Experimental Evaluation 48
4.1 Setup of the Computer System and Implementation . . . . . . . . . . . 48
4.2 Graph Parameters used in the Benchmarks . . . . . . . . . . . . . . . . 49
4.3 Finding Optimised Parameters for the EM-Variant . . . . . . . . . . . . 49
4.3.1 Comparison between Sorter-Count-Versions . . . . . . . . . . . 49
4.3.2 Benchmark Analysis of the Equalised Workload Partitioning . . 51
v
5 Conclusion 77
Bibliography 79
A Figures 82
A.1 Normal Distributions of Active-Size . . . . . . . . . . . . . . . . . . . . . 82
A.2 Coparison Run of Best Band and Angular Parallelization Count Set-
tings for the Geometric Workload Partitioning . . . . . . . . . . . . . . 83
A.3 Comparison Between the Radial Partitionings . . . . . . . . . . . . . . 84
A.4 Internal Memory Runtime Comparison of the Algorithms . . . . . . . . 88
A.5 External Memory Runtime Comparison of the Algorithms . . . . . . . 92
B Pseudocode 94
vi
List of Figures
List of Tables
4.1 Rule Set for angular parallelisation count v and band count l for the
equalised workload partitioning. . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Rule Set for angular parallelisation count v and band count l for the
minimised workload partitioning. . . . . . . . . . . . . . . . . . . . . . 61
4.3 Rule set for p depending on the parameters k, and l. . . . . . . . . . . 62
4.4 A recording of the runtime jumps seen in figure 4.27. . . . . . . . . . . 68
1
Chapter 1
Introduction
A random hyperbolic graph (or hyperbolic geometric graph) is at its most basic a set of
nodes embedded in hyperbolic space, where a pair of nodes share an edge when-
ever the distance between them is smaller than a chosen threshold [KPKVB10]. The
properties that arise in such graphs exhibit a lot of similarities to many real-world
instances of relationship models, such as a high clustering and a heavy-tailed de-
gree distribution as they follow a power-law in this case meaning a vertexs de-
gree being proportional to some negative exponent of that degree. For example,
Papadopoulos et al. propose that hyperbolic graphs are well suited to simulate so-
cial networks the hyperbolic component being a trade-off between popularity
and similarity in that case [PBK11].
Most algorithms up until now have been focused on the general runtime of the
graph generation and have been generally working on an unit-cost RAM model. In
such a model, the focus is put on the number of steps our algorithm takes, dis-
regarding the implications and possible constraints that very large data sets bring
with them. In todays Big Data world, though, working with extremely large graphs
is not out of the ordinary anymore it might be thus more advantageous during
the algorithm engineering phase to take memory intensive use cases into account.
Every computer system, no matter how large its components may be, will even-
tually get to a certain point where the data it has to work with is so large that work-
ing through it internally in one pass will be impossible. In those cases, data has to
be off-loaded down the memory hierarchy the set of different categories of memory
units inside a computer system, as explained by Meyer et al. [MSS03]. The memory
hierarchy generally consist of the following categories: three cache levels all with
increasing sizes and decreasing access time the main memory, and further on any
external memory unit. The authors mention that a programmer does not need to
know the intricacies between the cache-levels and the main memory as the auto-
matic processes are for the average case well enough. The differences in data access
latency between main memory and hard disks, on the other hand, is large enough to
become a concern. Hard disk access times for example take up to 107 as much time
as one register access alone [MSS03, p.3]. Any data too large for the main memory
2 Chapter 1. Introduction
will be sent to the external memory unit for depending on the algorithm a tem-
porary or final stay. Since the I/O-access times are as large as mentioned, this fact
alone makes an analysis of an algorithms practical runtime on the basis of a unit-cost
model alone incomplete if the subject matter revolves around very large data sets.
The External Memory model (EM model) by Aggarwal and Vitter takes a differ-
ent perspective [AV88]. The model describes a computer system with an internal
memory of a size allowing for M objects and an external memory disk of unlimited
size. The systems CPU is only able to work on data that is present in the internal
memory, meaning that each data set that is to be worked with has to be first loaded
into the internal memory from the external memory disk. Since hard disks have a
high latency, but allow for a higher bandwidth, it is advantageous to load a block of
multiple, subsequent data points lying on the hard disk into main memory.
As the time spent waiting in between two I/O-operations is generally larger than
between two computational operations, the model allows for a different approach to
algorithm analysis: The number of I/O-operations is considered in this model to be
more relevant to the runtime than it was in the RAM model, where it was completely
1
NetworKit with a reference implementation of the algorithm at: https://networkit.iti.
kit.edu/
2
Implementation taken from:https://bitbucket.org/HaiZhung/hyperbolic-embedder/
overview
1.1. EM Model and RAM Model 3
discarded.
To give a working example for the difference between those two models, let us
assume we have the problem of finding a number x in a set of n numbers. Without
any further details, the simplest solution in the RAM model would be to scan the en-
tire array one by one and check, whether the current number c equals x. Reading a
number, comparing it to x, and increasing the index to reach the next number are all
operations that cost one work unit, which results in an asymptotic runtime of O(n)
operations to accomplish the task.
Let us assume under the EM model, that our problem size of n objects is so big
that the entire array does not fit into our main memory of M objects. Since we
load multiple, subsequent data objects from the external memory unit, the number
of objects that can be transfered simultaneously between the internal and external
memory in the EM-model is of size B objects per I/O-operation. In that case, every
B-th time we would load a number from our array, the system will load from the
hard drive our current number c inside a block of B numbers into our main mem-
ory. This is beneficial to us if we regard the workings of hard drives in combination
with the spatial locality principle, which says that, if one needs data that has been put
in a specific place in memory, it is highly likely that one will need the data surround-
ing that place as well [MSS03, p.9]. Since hard drives have a large bandwidth, there
is no practical difference between loading one or B objects in terms of time spent on
the loading either.
Coming back to the example of scanning for an object in the RAM model, we
would say that scanning the array for our number x would take n computational
steps as we have n numbers. In the EM model, though, we will have to think
in number of I/O-operations instead: We have n objects, the bandwidth is large
enough for B objects to fit into a block per I/O-operation, thus n/B blocks will have
to be read from the disk at worst and at best for our scan-operation hence why
one well known notations in the EM-model is the I/O-complexity for scanning, i.e.
scan(n) = (n/B).
Another important I/O-complexity for the algorithms in this thesis regards the
tight bound on sorting, since two of the presented algorithms necessitate sorted data
objects:
There are multiple sorting algorithms that have a proven upper bound of the
same, such as merge sort and distribution sort [AV88, p.1123-1124]. As such, the
I/O-complexity of sorting has tight bound of sort(n) = ((n/B) logM/B (n/B)).
The former will be introduced in conjunction with the steps involved in the gener-
ative model of hyperbolic graphs. We will present the given formulas in regards to
the native representation of the hyperbolic space, after which we will shortly intro-
duce the Poincar model and explain its relevancy in regards to the simplification of
distance calculations.
After further analysis of complex networks, certain properties have been found
to be lacking in the Erdos-Rnyi model, such as a high-tailed, power-law degree
distribution (networks with that property are also called scale-free networks, since
6 Chapter 1. Introduction
A closer attempt at modeling real, scale-free networks was made by Barabsi and
Albert [AB01, p.71-76]: Instead of finding a topology that captures real world
networks properties, their Barabsi-Albert model (BA model) emulates real world
networks dynamics. The assumption is that following those dynamics should de-
liver a graph with the desired topological properties. The main components of their
model are the dynamics of growth and preferential treatment. As an example that has
been given by the cited authors, the former can be seen in the real world example of
the world wide web, whose number of web pages increases every second; the latter
can be seen in the number of hyperlinks on web pages more likely linking well
known web pages who have been linked to by others, and who in return link to a lot
of other web pages as well [AB01, p.71]. As such, their model begins with a starting
number of nodes n0 , and increases this number each step by introducing a new
node with a set number of x n0 edges to the network, until we reach the desired
node count of n. Edges are established randomly with a probability which depends
on each nodes degree. In other words, nodes with a higher degree have a prefer-
ential treatment, and are more likely to establish an edge with the newly arrived node.
An issue is that in this model, the clustering still depends on the number of
nodes [AB01, p.75], while the power-law coefficient BA = 3 is constant, as it is
independent from anything but the model itself [AB01, p.71].
Opposed to these two models is the random hyperbolic graph model by Krioukov
et al. [KPKVB10], which differs in its approach. Instead of generating graphs
following a specified topology, it, similar to the BA model, achieves the desired
topology by the generative models properties. Unlike the BA model, though, the
generative model involves the embedding of a graph into hyperbolic space, meaning
that the topological properties depend on the geometric positioning of the nodes
themselves. The generative models randomisation follows a probability function
which depends on a set threshold more specifically, a set distance between two
nodes in hyperbolic space. If the distance between two nodes is small enough, the
probability of an edge being established between the two is set to one, otherwise, it
is set to zero. An additional temperature T can be set to a number higher than zero
to allow for further randomisation, where the higher the temperature, the more
likely a random edge will be established between two random nodes independent
of distance. Krioukov et al. also notice, as T approaches infinity, random hyperbolic
graphs start resembling Erdos-Rnyi graphs, since at infinite temperatures, each
node pair has an equal probability of establishing an edge [KPKVB10, p.12].
The result of Krioukov et al.s generative model is the generation of graphs that
exhibit all the desired properties: On one hand, a high-tailed degree distribution
which follows a power-law whose power-law coefficient is independent of the
model, of the number of nodes, and of the number of edges; on the other, a
non-vanishing clustering coefficient independent of the networks size.
This thesis subject concerns random hyperbolic graphs as a generative model, though
the focus will be on the threshold model alone; the temperature will be assumed to be
1.2. Mathematical Background 7
During the generation of a complex network, one can notice that with the growing
number of elements, patterns are starting to emerge with nodes of similar properties
flocking to one another and creating groups with further subgroups within. These
multitude of (sub-)groups can be connected in a dendogram a representation
of hierarchies as tree structures (an example of such a dendogram can be seen in
figure 1.2). Krioukov et al. posit that, in the same way that b-ary trees and circles
grow exponentially, so do dendograms in the hyperbolic space, as the circles of
influence in regards to the (sub-)groups representing inner nodes in the dendogram
grow in the number of occurrences in the same way. In figure 1.2, those circles
where nodes inside the same circle share similarities are represented on
the flat, Eucledian space. The dendogram is shown to arise above this plane into
the three-dimensional, hyperbolic space, connecting the circles respective nodes
whenever a circle is enveloped by another. There are occurrences sometimes,
where circles might only overlap without one containing the other or vice versa
those relationships are visualised by the dashed lines, creating cycles, and thus
making the dendogram not an actual tree. Regardless, the hyperbolic nature of this
representation is nonetheless apparent by the exponential increase of nodes per
level.
In light of such connections between the two concepts, Krioukov et al. found that
the properties of hyperbolic space were fitting for the modeling of a multitude
of complex networks seen in practical usage whose degree distribution follows
a power-law social network simulations being an example of those. As such,
8 Chapter 1. Introduction
H3
R2
2 2 2 R
k= n eR/2 + 2 n eR ( ( )2 ( 1) + ( 2)) 1 (1.1)
2 4
/
The number of nodes is represented by n, while is defined as = (/1/2) .
The variable is an additional parameter that controls the power-law exponent
= 2 + 1 of the graphs degree distribution and is by definition > 1/2. At last,
according to Bode et al., they prove that the curvature parameter of the hyperbolic
space "is not necessary, the parameters [k the average degree ] and suffice
to yield the same degrees of freedom for the model" [BFM16]. As such, we will
do as Looz et al. and set = 1 without any loss of degrees of freedom [LLM16, p.3].
As one can see, the equation is fairly complex and cannot be reduced to a closed
form expression for R. In order to create a random hyperbolic graph with a certain
average degree, we have to use numerical approaches for that. Looz et al. for
example decided to query R by using a binary search with a starting value for R,
1.2. Mathematical Background 9
R
u
until equation 1.1s result can be matched with the desired k [LLM16, p.3]. We
chose to use the same approach.
R = 2 log(n) + C
C is a parameter that one can change to vary the average degree k in their model.
The average degree itself is derived by the authors as[GPP12, p.6]:
22 eC/2
k = (1 + o(1))
( 1/2)2
Solving for R and setting R 2 log(n) = C into this equation gives us:
22 eR/2+log(n)
|(...) ( 1/2)2 / 22 (1 + o(1))
k = (1 + o(1)) 2
( 1/2)
1/2 2
= O k( ) = O(k) = eR/2 n | ln((...)/n)
= O(ln(k/n)) = R/2 = O(ln(n/k)) = R
10 Chapter 1. Introduction
After choosing an R for the disk radius, we continue with the generation of the
points. Their native coordinates follow two separate density functions: The angle
is uniformly distributed over the range [0,2), while the radial distribution is gov-
erned by [KPKVB10, p.6, eq. (17)]:
sinh(r)
f (r) = (1.2)
cosh(R) 1
There are two further equations that are going to be needed later on that stem from
1.2. The first is the approximation of 1.2 that is going to be useful in proofs (which is
also given by Krioukov et al. [KPKVB10, p.6, eq. (17)]):
(er er )/2 er
f (r) = = e(rR) = f(r) (1.3)
(eR + eR )/2 1 eR
The second is the integral of 1.2, which gives us the mass probability of the percent-
age of nodes between two radii b and d, with 0 b < d R, given by:
Z d
cosh(d) cosh(b)
mass(b, d) = f (r) dr = (1.4)
b cosh(R) 1 cosh(R) 1
Inserting the definition for cosh(x) := (ex + ex )/2 and approximating analogous to
1.3 yields us:
1
cosh(d) cosh(b) eR (ed eb )
mass(b, d) = (1.5)
cosh(R) 1
After having generated our nodes, we now only need to establish an edge between
any two points u = (u , ru ), v = (v , rv ), whenever the distance between them is
less or equal to R. The hyperbolic distance between those points is given by:
dist(u, v) = acosh cosh(ru ) cosh(rv ) sinh(ru ) sinh(rv ) cos(|u v |) (1.6)
Calculating the distance between all node combinations of our graph would be
computationally very expensive not only because of the quadratic nature of such
an approach, but also because of the extensive usage of cosh- and sinh-functions.
To combat this, the algorithms shown in this thesis have a few ways to reduce the
amount of potential neighbour candidates each node has to check. One of those
employed by NkGen and our EM-variant is to divide the ground plane ra-
dially into L slabs bi with 0 i < L where each slab is defined by its inner ra-
dius ci and outer radius ci+1 . The specific choice for those radii is important in
regards to the runtime, and in case of NkGen, Looz et al. chose a geometric parti-
titioning with p = 0.9, and where their chosen number of slabs L was always de-
pendent on the number of nodes n more specifically, L = log(n) calculated as
follows [LLM16, p.3]:
0 , if i = 0
(1p)R
, if i = 1
ci = 1pL (1.7)
R , if i = L
p (c c )+c
i1 i2 , otherwise
i1
1.2. Mathematical Background 11
For every point v on slab bi , we calculate the minimal and maximal angles of any
possible neighbour points u, and use those angles to reduce the number of potential
neighbour points, as only points inbetween those two angles can be neighbours (a
more detailed explanation of this procedure can be found in chapter 2.1). The cal-
culation of those angles, given by Looz et al. in [LLM16, p.4, eq. (7-10)], is the one
used by the function getMinMaxPhi in listing B.1 for the native representation of
the hyperbolic plane and is given by:
cosh(r ) cosh(c ) cosh(R)
v i
bi (v) = acos |u v | (1.8)
sinh(rv ) sinh(ci )
bi only gives us the maximal angular difference in one direction, though, which
is why during certain calculations, one will see this term multiplied by two (see
proof of lemma 3 as an example) whenever we are looking for the entire area
encompassed by those angles extremes.
The approximation for equation 1.8 is particularly useful during our proofs later on
(see proof of lemma 3) and is given by [GPP12, p.7]:
It is important to mention that 1.9 is only accurate for large rv , ci and R. Specifically,
the approximation overestimates the actual angle, which is why it can safely be used
in an asymptotic proof. To give a short example of this, for any
the angle is larger than which should not be possible as the largest, absolute an-
gular difference between any two points can be at most . This small issue is ad-
ditionally of note, as we are later going to be using an equation to approximate the
workload of our algorithm (see equation 3.2 for more details), whose accuracy is not
exact if this issue is not taken into account.
A short illustration of the Poincar disk model was given by Looz et al., who
summarised the model thus [LSMP15, p.4]:
||pE qE ||2
distH (pE , qE ) = acosh(1 + 2 ). (1.11)
(1 ||pE ||2 ) (1 ||qE ||2 )
12 Chapter 1. Introduction
Ec radE
u
F IGURE 1.4: The graph in figure 1.3 mapped to the Poincar disk whose radius is
1. EC is the center of point us query with radE being the query circles radius, as
calculated with 1.14 and 1.15 respectively. All the points and bands are pushed fur-
ther towards the disks outer radius. The right part of the figure shows an enlarged
view of the area around point u, where one can see that the yellow query area exists
here as well.
As one can see, if the problem is calculating the distance and checking, whether it is
less or equal to R, one can circumvent the repeated usage of an acosh-operation in
1.11 by changing the equation to:
If we now calculate the left-hand side once with R as a parameter beforehand and
save that value, we only have to use simple arithmetics for comparisons between
the right-hand side for two points and (cosh(R)-1)/2.
The only possible downside is that we need to use the following calculation for our
transformation from the native representation to the Poincar disk model which
also includes cosh-operations [LSMP15, p.14]:
Another advantage of using the Poincar model is that instead of using equation 1.8
for our min and max that involves additional usage of cosh- and sinh, we can use
1.2. Mathematical Background 13
an easier formula by drawing simple circles around each point. One of the positive
properties of using the Poincar disk model is that circles in hyperbolic geometry
become simple circles on the now established Eucledian disk in the Poincar model
albeit with radii that change depending on the circles positioning, decreasing
the further away the circles center lies from the Eucledian disks center, as the
following equations taken from [LSMP15, p.5] show:
Let E be the Eucledian circle in the Poincar disk, which corresponds to a hyperbolic
circle H with radius R around point u = (u , radiusu ) on the hyperbolic plane. Let
rg(u) be point us corresponding radius on the Poincar disk, calculated by equation
1.13. Let the center of E be Ec , and the radius of that circle be radE . Let also be:
a = cosh(R) 1
2
b = (1 rg(u) )
The center Ec is then given by:
2rg(u)
Ec = (u , ) (1.14)
ab + 2
With the radius radE being given by:
s
2
2rg(u) 2 2rg(u) ab
radE = ( ) (1.15)
ab + 2 ab + 2
At last, we finish the background with the equation used to calculate those intersec-
tions or more accurately, the angles defining those intersections.
Let ci1 be the inner radius of band bi , outlining a circle with its center at the origin.
Let v be a point that creates the query circle E with its center being Ec = (u , rE ) and
its radius being radE . The angle between the line from the origin towards the center
of Ec on one hand, and the line from the origin towards one of the intersections
of those two circles on the other, is given by equation 1.16, which is used by our
method getMinMaxPhi_Poincare in listing B.2. The equation is taken from Paul
Bourke [Bou97] and modified with the respective variables in the Poincar model:
2
(ci1 rad2E + rE2 )
bi (v) = acos (1.16)
2 rE ci1
As one can see by a simple comparison between equations 1.16 and 1.8, the one in
usage in the Poincar disk model is arithmetically simpler. If we take a look at the
the number of cosh/sinh-operations, we have four distinct applications of those in
the native representation and only one distinct application in the Poincar model
(we are excluding cosh(R) from both representations in this comparison, as this can
be calculated once beforehand and kept in memory as a constant).
14 Chapter 1. Introduction
1.3 STXXL
Since the main focus of this thesis remains on large data sets, we decided to use data
constructs from the STXXL library for data management and sorting. The Standard
Template Library for Extra Large Data Sets "enables practice oriented experimentation
with huge data sets [, as it] supports parallel disks, overlapping between I/O and
computation, and pipelining technique that can save more than half of the I/Os"
[DKS08, p. 640]. One of the features of STXXL that is relevant for our application
is the sorting algorithm that has been designed with extremely large data sets,
multiple parallel working disks and I/O-efficiency in mind.
The Asynchronous Parallel Disk Sorting algorithm specifically uses an external mem-
ory approach that splits the data akin to a k-merge-sorting algorithm the differ-
ence here being that the algorithm takes I/O-procedures into account and schedules
its workflow around an overlap buffer to allow for "almost perfect overlapping of
I/O and computation" [DS03, p. 142], which is something that will be advantageous
to the External-Memory-variant of NkGen. Since a few versions of our algorithm rely
heavily on a pre-sorted data set to work, STXXL and its sorters along with the I/O
performance counters it provides will prove themselves useful in our implementa-
tion of the algorithm and its analysis thereafter.
15
Chapter 2
Algorithms
The general outline of the algorithm by Looz et al. consists in our framework of
four general phases: setup, generation, sorting and edge creation [LLM16, p.3].
Its main idea is that if one partitions the hyperbolic plane into multiple, concentric,
ring-shaped bands, one decreases the amount of comparisons necessary for the
fourth step.
The setup step includes the calculation of the target radius R (see eq. 1.1) of the hy-
perbolic plane, given specified parameters like node count n, average degree k and a
power-law exponent = 2 + 1. Additionally to that, we prepare C = (c0 , c1 , ..., cl )
which define the radii-borders of the l bands, with band bi = [ci , ci+1 ) being defined
by its inner radius ci and outer radius cc+1 . How C is chosen has a big influence on
the runtime, as the expected amount of nodes per vertex that have to be compared
for the edge-creation increases not only with the area of the band, but also with its
position on the hyperbolic plane as the density function follows an approximately
exponential curve (see eq. 1.2 and 1.3). Looz et al. chose a geometric sequence with
ratio p = 0.9 based on experimental outcomes (see eq. 1.7 and [LLM16, p.3]).
For every vertex, the algorithm goes through every band that includes it or is
further away from the center than the vertex. To decrease the number of potential
neighbours that fit the neighbour-criteria of the distance being distH (u, v) R
(see eq. 1.6), the maximum angular difference from a node in either direction on
the current band is calculated (see eq. 1.8). This is possible, since a hypothetical,
hyperbolic circle with radius R around any vertex creates overlapping areas with at
least one band. The intersections between those hyperbolic circles and band radii
lines equal the calculated maximum angular difference on any given band. Thus,
since all neighbours of a node have to be somewhere inside that nodes respective
circle, we can narrow down the area of neighbourhood candidates with those
angular bounds.
16 Chapter 2. Algorithms
R
u
F IGURE 2.1: The red line shows a query circle with radius R same as the radius
of the entire plane around a randomly chosen point on the hyperbolic plane.
The yellow areas are the ones in between the maximum angular differences for
the green point on the respective bands. Because all to-be-established neighbours
are inside the query circle, only the yellow areas are searched for those (edges are
created outwards, thus only the nodes and outer bands are considered).
There are exceptions though: If the circle of a point covers the band on which the
point resides either entirely, or more than once, no such simple bounds can exist. A
potential neighbour could be on the other side of the center of the hyperbolic plane,
meaning the boundary would need to encompass the entire plane. This means that
on the inner most band and possibly further bands, depending on how close to
the center the bands and points are all vertices have to be compared for possible
neighbour properties. In other words, the angle boundaries are set to be 0 and 2.
Depending on , those cases can be rare: With increasing , the exponential nature of
the vertices radius density function (see eq. 1.3) pushes the overwhelming majority
of vertices further towards the border of the hyperbolic plane. Taking for example
equation 1.5 and setting d = R/2, b = 0 and R = O(ln(n/k)), we get:
In other words, the mass of nodes with a radius less then R/2 falls for any increasing
n and , as long as k < n, and is for instance less than 3.2% for any k n/1000
and 1.
Once calculated, we get min and max as our bounds for that particular vertex
(min being the green, max the red boundaries in the figures 2.1 and 2.2). All the
points between those two angles are potential neighbours whose distances to our
vertex we have to calculate and, depending on the distance, establish an edge with.
The algorithm uses a binary search to find the first node with min and the last
node with max , which is vitally important as this is infeasible in an EM-setting
2.2. EM-Variant of NkGen 17
R
R
F IGURE 2.2: The same graph as seen in figure 2.1, but with a focus on a different
point that is closer to the center. The two innermost bands do not allow simple
boundaries to be created, as the query radius covers too much area of those bands
without simple intersections between band radii and query circle.
For our comparison later on, we will use the implementation by the authors found
in the NetworKit library 1 .
Thus, our approach is to circumvent the need for unstructured access in our main
algorithm by not only sorting the nodes, but by also creating an additional data
structure for the angular boundaries of each point. In detail, we create structures
that represent the left angular bound (start of a query) and the right angular bound
(end of a query) once per node x and per band. Those boundaries represent the
query for all neighbours of that point x. Our intention is that by sorting these
angular boundaries as well, we can scan these sorters and work through multiple
queries at once without requiring any unstructured access. In other words, once a
memory block of sorted nodes and boundaries has been loaded into main memory,
we can be sure that all of the loaded data will be of use before being discarded, re-
sulting in a reduced number of I/O-operations compared to the original algorithm.
1
NetworKit with a reference implementation of the algorithm at: https://networkit.iti.
kit.edu/
18 Chapter 2. Algorithms
Pseudocode detailing this process can be found in the appendix (see listing B.2).
The EM-variant goes through similar four steps as the original algorithm. We set up
the target radius R and choose a fitting radial partitioning for our l bands. Unlike
NkGen, though which uses the hyperbolic geometrys native representation
we are going to use the Poincar disk model instead. The reason for this is that the
equations for the distance and angular boundaries calculation are much simpler
in this model (compare eq. 1.6 with eq. 1.12, and eq. 1.8 with 1.16), meaning
that we can reduce the number of computational operations by using a different
representation alone.
To make this possible, we modify NkGens setup and generation phase, where
instead of keeping the nodes and bands radii as is, we first map them to the
Poincar disk with the function MapToPoincare (see eq. 1.13). Additionally, we also
use a different function to calculate min and max with getMinMaxPhi_Poincare
(see eq.1.14, 1.15 and eq. 1.16).
We also have to modify our edge-creation step in a way that unstructured access
is unnecessary, e.g. by avoiding the binary search and using the aforementioned
angular boundaries concerning min and max . These boundaries are calculated
once per vertex for each band that has an outer band radius larger than the vertexs
radius (including the vertexs own band) during the generation step. After sorting
the boundaries and nodes according to a lexicographical order beginning with
the respective angular component, we can utilise an additional vector to store all
ongoing, active queries during our traversal through the sorters. With that active
vector in mind, we can then work through multiple, concurrent queries, point by
point, deleting and adding new queries to our vector whenever we reached the
end of a query or found the start of a new one respectively. We only stop once we
reached the end of a band when there are no points on the band left anymore.
We first create for those boundaries their own data structures during the generation
step one sorted by min called StartBounds, one by max called StopBounds.
Both data structures store the ID of the original query-vertex, while the StartBounds
alone store the coordinates of the vertex as well. This way, we can use the Start-
Bounds themselves to calculate distances between the query-vertex and the current
vertex being handled in the edge-creation step. Additionally, the IDs will help to
manage the bounds during the fourth step.
After generating the points and with them the Start- and StopBounds we put
them all in three seperate sorters per band and sort them.
Thus, the edge-creation step changes in the following way: For each band, we will
perform a sweep-line algorithm by which we go through all three sorters concur-
rently, choosing the smallest item according to a lexicographical order the order
being first the angular property, then the data type as our current work-token. In
2.2. EM-Variant of NkGen 19
case a bound-object has the exact same angle as a point, the order is "StartBound <
Point < StopBound", to ensure that all neighbours will be found. Depending on the
token-type, we act accordingly:
The result is a linear scan through all three sorters with no unstructured access
necessary. As the following lemma shows, we find all neighbours in our generated
graph that follow the neighbourhood requirements.
Lemma 2. The EM-variant algorithm is correct and finds all neighbours y for all vertices x
where distH (x, y) R.
From the fact that x is a neighbour of y it follows, that ssy x holds, because
otherwise x would be outside of ys reach, seeing as x would be smaller than
miny . Accordingly, this means that ssy has to have been put into the active-vector
at one point in time before x became our work-token.
2.2.2 2-Sorter-Version
During our work, we considered to optimise our algorithm by decreasing the
number of sorters necessary. This should reduce the time spent on sorting, since the
overall number of elements will be reduced by at least a third (as there are more
Bound-objects than nodes). Instead of using one data structure for each Bound-type
respectively, we use only one data structure called StartStopBound that would
keep track of both min , max and the position of its original node.
20 Chapter 2. Algorithms
Regarding the lexicographical order, we still first compare angles and then data
types. Analogously to before, the StartStopBound is considered smaller than a Point-
object. The algorithm itself only changes in the comparison step, where we have two
possible work-tokens now, depending on which object is considered smaller:
One possible downside that could arise from the way this version works is that our
average active-size should increase slightly compared to the 3-Sorter-Version. It
is possible for our active-vector to have StartStopBounds whose angle-range does
not cover our current node, since their removal only happens whenever we get a
new point work-token. Asymptotically, this does not matter as it would double the
active-size at most (i.e. we would have an active-vector with queries pertaining to
the current and the previous node only). Computationally, this would also not be a
large disadvantage, as comparing the angles is enough to know, whether we should
engage in the computationally more expensive distance-calculation.
Pseudocode for this version can also be found in the appendix (see listing B.3).
2.2.3 0-Sorter-Version
For our theoretical runtime analysis later on it might be interesting to have a
version of our algorithm that does not need any kind of sorting at all. Seeing as our
2-sorter-version already decreased the number of sorters by one, there are only two
left that we have to work around if we want to still mainly depend on our main
algorithm.
Avoiding the sorting of our nodes requires a change in our generation step: Instead
of randomly generating the nodes, we first randomly generate the node count on
every band. For this we calculate the probability mass of the nodes radiis density
function f (r) per band between its inner radius rinner and outer radius router with
the following formula based on the integral of the density function:
Z router
cosh(router ) cosh(rinner )
mass(rinner , router ) = f (r) dr =
rinner cosh(R) 1 cosh(R) 1
With this, we can calculate the probability mass mj , one for each band bj . If we
now sample n uniformly random numbers pi between 0 and 1 one for each node
we can derive the random number of nodes per band we would have gotten from
a randomly generated batch of n nodes by counting, how many pi we have with
( jk=0 mk ) mj pi < jk=0 mk .
P P
After that, for every band we generate nodes whose radius follows the usual
density functions, only now restricted to the bands inner and outer radii. The
angle on the other hand has to be generated in a way, that every generated angle is
greater or equal to the last one, while still following a uniform distribution. Bentley
and Saxe describe an algorithm that does exactly that [BS80]. This is possible, the
2.3. Parallelisation 21
P P
authors note, since the values Yj = [ 1ij Xi ]/[ 1in+1 Xi ] for j = 1, ..., n "are
distributed as the order of statistics of size n from U [0, 1]" for independent variables
X1 , ..., Xn + 1 with an exponential distribution and a fixed mean, they can use
this fact to generate a series of sorted, decreasing, uniformly random numbers in
sequence [BS80, p.2].
Using this algorithm, we can generate nodes that are sorted first per band, then per
angle, which is exactly what our main algorithm needs.
Avoiding the sorting of our StartStopBounds can be done by dividing each Start-
StopBound sx into two halves, one ranging from [min , x ] called sBackwardx , the
other from [x , max ] called sF orwardx . The 2-Sorter-Version requires that all given
StartStopBounds are sorted by their min -angle, meaning that if every generated
node comes into the StartStopBound-generation phase in a sorted sequence, we
can be sure that all StartStopBounds are sorted by their original nodes angle x .
Following that, if we insert every sBackwardx into one vector and every sF orwardx
into another one, we would then have two vectors filled with StartStopBounds
one being sorted backwards by its max , the other forwards by its min .
Unfortunately, this process requires that we perform the edge-creation phase once
per StartStopBound-vector, as each StartStopBound only covers one half of each
query nodes reach. The forwards vector can be used in conjunction with our 2-
Sorter-Versions comparison step just as it is, giving us approximately half of our
desired edges. The backwards vector on the other hand has to be used with an al-
tered version of the 2-Sorters comparison step that goes backwards through both
our StartStopBound- and Point-vector, giving us the other half of our edges.
2.3 Parallelisation
2.3.1 Radial Parallelisation
To increase performance, parallelisation seems to be a suitable possibility, consid-
ering the generally independent workload divided between the bands. This means
that we do not need to change our algorithm, if we are aiming for a radial paral-
lelisation on our hyperbolic plane along the bands, i.e. working on the bands in
parallel (see figure 2.3 for a visualisation). The radial partitioning choice might have
22 Chapter 2. Algorithms
F IGURE 2.3: Each colored band can be given to a different thread independently.
For simplicity, we ignore specific scheduling details for now (see for instance JJs
Work-Time-Frameworks top level[JJ92, p.27-32].).
an influence on the algorithm in a parallel setting, though. Since the algorithm can
work in parallel, it is not necessarily the best idea to use a radial subdivision that
decreases the overall workload over all the bands. It is possible that one partition-
ing choice delivers us a workload that, while in a sequential setting optimal, would
be suboptimal in a parallel setting. This is because the workload per band depends
on a multitude of factors - such as the amount of nodes between the origin and the
current band, the average angle |st,r,j ss,r,j | of the StartStopBound (of any query-
point with radius r) on a band j , etc. - and thus is not a simple linear division of
work among all bands. Possible radial partitioning alternatives and their effects will
thus be described in chapter 3.6 and tested later on in chapter 4.3.
Instead of creating sorters for every band, we create it for every segment per band
that we have. Nodes will be inserted into the right sorters, according to the seg-
ment they reside in, analogously to the way we did before. The bounds are treated
differently this time around:
2.3. Parallelisation 23
F IGURE 2.4: Each color represents a different thread that operates on the colored
segment. As an exemplary distribution of segments to threads, each segment is
given to a thread counterclockwise, starting from the innermost band with the color
order being blue, green, red, purple.
If a StartBound ssy and its corresponding StopBound sty belong on the same
segment bij , both will be added in the same manner as before to the correct
segment bij .
If a StartBound ssy belongs to bij and its corresponding StopBound sty belongs
to bim with j 6= m and j, m [0, 1, ..., v 1], we add StartBound ssy and a new
StopBound styj with angle j+1 = 2(j + 1)/v to segment bij .
Analogously, we add a new StartBound ssym with angle m = 2m/v and
StopBound sty to segment bim .
For every other segment bin , we have two further options:
j < m : In that case, to all segments bin with j < n < m add additional
StartBounds ssyn with angles n = 2n/v and StopBound styn with angle
n+1 = 2(n + 1)/v
m > j : In that case, the query range covers the 2 threshold. We add
bounds analogously to the former case, except that we first add Start-
Bounds and StopBounds to all segments bin with j < n v 1, and then
to all segments bin with 0 n < m, each with the same definition as in
the former case.
In other words, if a Start- and StopBound pair encompasses multiple segments, we
are going to split the pair up into multiple pairs along the dividing angles, so that
each pair only covers one segment at a time.
At last, our edge-creation step changes only slightly in the sense that, instead of
going through all bands, we are going through all segments. The general algorithm
applied to each part, though, stays the same.
10 0
10 -1
10 -2
10 -3
10 4 10 5 10 6 10 7
Node Count
SEQ 3-Sorter SEQ 0-Sorter PAR 2-Sorter
SEQ 2-Sorter PAR 3-Sorter PAR 0-Sorter
F IGURE 2.5: Graph showing the improvement in the generation phase by using a
parallel generator, once with a sequential generator ("SEQ"), once with a parallel
one ("PAR") with 16 threads (0-Sorter uses only four threads).
from setting the to-be-generated node count for each of those generators to n/t, if t is
the number of threads. The only problem concerns the insertion of objects into their
respective sorters, as the ST XXL-sorters are not thread-safe. Since we also did not
want to increase the memory usage by a large margin, we decided to use one array
per thread that was filled with the generated nodes and Bound-objects, which was
emptied out into their respective sorter once an arbitrary threshold was reached (in
our case 1000 Bound-objects). The insertion locked the sorters with a mutex for the
duration of the insert, creating a thread-safe generation and insertion without much
more memory usage then the single-threaded generation. The increase in runtime
can be seen in figure 2.5.
For the 0-Sorter, we had to take a different approach, as there are additional restric-
tions on the sequencing of the generated nodes: Our idea was to parallelise akin
to the angular parallelisation approach, where we divide the plane angularly into
multiple pieces only that each piece spans multiple bands now. We then sample
n uniformly random numbers in the range of [0,1), where each piece 0 i < t of the
t pieces has its own range of [i/t, (i + 1)/t). This way, we randomise the number of
nodes per generator uniformly. We distribute each piece to a thread, and calculate
the nodes like usual. During the generation, though, we restrict the generated
angles to the same angular range of the generators piece.
The only issue here is that, for maximum performance, we could be using all t
threads available since the nodes require to be generated in a sorted sequence,
though, each piece requires one vector per band as to keep the nodes ordered per
band. Overall we need s = (t l 3) ST XXL vectors one per piece (i.e. per
thread), per band, per object-type (one for the nodes, two for the Bound-objects
for the traversal in both directions). This is an issue, since the ST XXL-vectors
have an overhead and a minimum amount of memory required per vector during
initialisation. Since the normal sequential generation only requires s = (l 3)
vectors, we have a t times bigger overhead at worst. Using all 16 threads during
the generation phase for instance would require so many vectors, and thus so much
2.4. GIRG 25
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
8 Threads 4 Threads Sequential
F IGURE 2.6: Graph showing the differences in thread count for the
combined runtime of the 0-Sorters combined setup and generation
phase, once with four threads, once with eight, and once with the
sequential generator.
memory, that we quickly reach the RAMs limit. Figure 2.6 shows the 0-Sorter once
with four threads, once with eight threads, and once with the sequential generator
during the setup and generation phase (the latter phases are able to use all threads
regardless of generator). We can see here that higher t result in a comparatively
larger runtime for small graphs, but introduce a slower asymptotic rise. In the end,
the overhead alone is only a constant and thus a bigger problem for small graphs.
Asymptotically, the constant setup phase is dominated by every other phase, which
is why the parallel generator still gives us an overall improvement for large graphs.
2.4 GIRG
An alternative to our previous algorithms is the one introduced in a paper by
Bringmann et al. [BKL15]. The authors propose a more general model in regards
to hyperbolic random graphs, called geometric inhomogeneous random graphs
(GIRG). We paraphrase their summary here briefly: Their model describes GIRGs
by giving each vertex v a weight wv (which follows a power-law) and a uniformly
random position xv in the d-dimensional torus Td (for random hyperbolic graphs,
d = 1). Two vertices u 6= v form an edge with probability puv proportional to
wu wv and inversely proportional to some power of their distance ||xu xv ||. Any
other details as to how those weights and distances are calculated depends on the
properties of a given graph type one is aiming for during generation.
Bringmann et al. prove that their sampling algorithm for GIRGs has an expected
linear runtime of O(n + m), where n is the number of points and m the expected
number of edges. Considering that they show that random hyperbolic graphs are a
special case of GIRGs, the same runtime applies.
We first sample the random positions of our vertices and assign them weights
based on their positions - specifically in regards to hyperbolic random graphs, a
vertex v V has according to the above definitions the positional information
v
xv = (2.1)
2
and weight
Rrv
wv = e 2 . (2.2)
The angle and radius are randomly sampled as in previous algorithms. Afterwards,
we partition all our vertices v V into different weight layers Vi with 1 i L and
L = (log(n)), the layers themselves defined as Vi := {v V | wi1 v wi } with
w0 := min{wv | v V } and wi := 2wi1 for all i 1. Considering that the weight
layers give a lower and upper bound for all the radii of the vertices therein, they
can be understood analogously to the bands used in our previous two algorithms.
For all the weight layers, we will create certain data structures D(i) ({xv |v Vi })
with
wi w0
(i) := (2.3)
W
P
where W := vV wv , that ultimately have the following properties:
Starting with the first level l = 0 being made up of only one cell, every other
level j divides all cells on level j 1 in two equally large cells, thus doubling
the number of cells with each level.
Because the cells partition the entire ground plane (in our case being all angles
between 0 and 2) and considering that each level is a subdivision of the level
above, all cells on the same level l combined cover the entire ground plane. Let
us assume a geometric ordering of the cells with C1 being the one cell on level
0 that covers the entire ground plane on their own and C2l , ..., C2l+1 1 being
the cells on every other level l that do the same when combined. In that case,
cell Cx on level l with 2l x 2l+1 1 contains all points p on the weight
layer (i) with an angle 2(x 2l )/2l p < 2(x 2l + 1)/2l .
2.4. GIRG 27
R L0
L1
L2
L0
Radius
L1
L3
0
0 Angle 2
F IGURE 2.7: A visual help for the GIRG-algorithm showing the weight layers on
the left, and an exemplary step of the partitioning process for partitioning P(0,1)
involving layers L0 and L1 .
Every point in any cell Ci can be accessed in constant time by virtue of having
an additional Array A[.] which stores a pointer
P to the k-th point at A[si + k],
where, let P be the set of all points, si := j<i |Cj P | is the prefix sum of cell
Ci , stored in the cells data structure as a parameter.
After creating such data structures for every weight level, we create for every level
1 i j L a partitioning P(i,j) = {(A1 , B1 ), ..., (As , Bs )} of cells Ai , Bi with
wi wj
(i, j) := (2.5)
W
and s = O(1/(i, j)). For our special case without probability edges there are only
two possible ways a pair of cells is added to our partitioning:
Any pair of cells (Ai , Bi ) with V OL(Ai ) = V OL(Bi ) whose boundaries touch
is part of P(i,j) .
Any pair of cells (Ai , Bi ) that dont already fall under the above criteria, whose
parents boundaries touch is part of P(i,j) the definition of a parent of a cell
on level l being the cell on level l 1 which contains the child in its entirety.
For every pair of cells (A, B) P(i,j) we will now calculate the distances between
each vertex u ViA and v VjB and establish an edge {u, v} if and only if the
distance distH (u, v) (see eq. 1.6) is smaller or equal to R, the radius of our hyperbolic
plane. After this step, our graph will have been generated and we are done with the
process.
To make it easier to visualise the procedure, figure 2.7 shows a simplified version
of the process: On the left we have a visual representation of the multiple weight
layers, where we can see how each layer encompasses the entire 2-span, while each
higher numbered layer has fewer cells with larger volume each (volume in regards
to the angular width, not the radial height). On the right we see how adjacent cells
are added to the partitioning:
28 Chapter 2. Algorithms
First, calculating (0, 1) would tell us that L0 would have to be set to the parent
cells one layer higher in this particular example, which is why L0 has half the cells
with double the volume than previously seen on the left. Second, in respect to the
cell with a yellow border, all its neighbours borders are colored in red (the most
left cells left neighbour is the most right cell and vice versa) with arrows pointing
from the current cell towards its neighbours with which the current cell will be
compared. This will be done for each cell on L1 and each cell on L0 , after which we
will continue with the next two layers partitioning.
In regards to the linear runtime, the original paper goes into far more detail, but a
short summary of the differences between NkGen and GIRG, and their respective
runtimes can be understood as follows [BKL15, p.10-11]:
NkGen restricts the number of possible candidates a node v has to query by divid-
ing the hyperbolic plane into multiple bands. Based on those, it calculates angles
between which depending on the radial partitioning chosen an expected
constant factor of nodes relative to to-be-established edges can be found (see
chapter 3.3 and Penschuck [Pen17]), compared to the number of actual neighbours
node v would have in the end. Because we are requiring a binary search per node,
and because we are sorting all nodes anyway, the runtime can be given an upper
bound of O(m + n log n).
GIRG on the other hand while also dividing the plane into multiple bands cre-
ates a partitioning between those bands, where cells from each of those bands are
compared to one another. The cell sizes meaning the number of nodes per cell
are chosen during each partitioning in a way that, even though one compares
each node from one cell with a node from another one in a quadratic fashion, math-
ematically, it still is bound by a constant factor compared to the number of edges we
would have in between those two cells. As we are not requiring any kind of sorting
or binary search in this endeavor, the runtime stays O(m + 1).
L ISTING 2.2: Sampling algorithm for GIRG with regards to our spe-
cial case of no additional random edges, taken from [BKL15]
1 E:=
2 sample t h e p o s i t i o n s xv , v V , and determine t h e weight l a y e r s Vi
3 f o r a l l 1 i L do b u i l d data s t r u c t u r e Dv(i) (xv |v Vi ) with (i) := wW
i w0
4 f o r a l l i i j L do
wi wj
5 c o n s t r u c t p a r t i t i o n i n g P(i,j) with (i, j) := W
6 f o r a l l (A, B) P(i,j) do
7 f o r a l l u ViA and v VjB add edge u, v t o E i f d i s t a n c e d(u, v) R
8 i f i = j then remove a l l edges with u > v sampled i n t h i s i t e r a t i o n
29
Chapter 3
Analysis
In this chapter, we will at first analyse the three algorithms and derive their I/O-
complexity it should be noted, though, that we only consider the algorithm and
exclude the output in the complexity analysis. In the case of the EM-variant, we
will also present a runtime analysis before taking a closer look at the radial parti-
tioning possibilities. As our goal is to compare the three algorithms in a practical
benchmark setting, we first have to establish which parameters are most beneficial
to the EM-variant beforehand. Considering that the partitioning has a large impact
on the runtime in general, this analysis and subsequent choice of parameters band
counts, radial partitioning and parallelisation count per band among others is a
necessary step to optimise our algorithm.
The generation and sorting phases both take scan(n) and sort(n) I/O-operations
respectively. This is because arrays will be filled with n nodes and the algorithm
sorts at worst O(n) nodes. The dominating factor comes from the edge-creation
phase, where the algorithm goes through all bands and all nodes therein. On its
own, this would imply a bound of O(scan(n)). Because the algorithm performs a
binary search once per node, though, it implies at least one random access jump per
node for sufficiently large n. Assuming the binary search performs O(log(n)) steps,
n
all of which jumping all over the array, it will take at best (log( B )) I/Os for finding
the two nodes that are closest to min and max . Doing this once per node results
n n
in an I/O-complexity of O(scan(n) + n log( B )) = O(n log( B )) for this phase
alone. Comparing the respective upper bounds, one can see that the algorithms
I/O-complexity is bound by its very I/O-intensive edge-creation phase.
In other words, if the nodes do not fit into the main memory, we will have to ex-
pect (n) I/O-operation per node, resulting in a very unbeneficial practical runtime
under certain use case conditions.
30 Chapter 3. Analysis
Since we are working with randomly generated graphs and most of our algorithm
is based upon the randomly generated positioning of the nodes, the active-size is a
random variable. Let us thus begin with a construction of an estimate for active.
Let Bx = (bx , dx ) be a band, where bx is the inner, dx the outer band radius, and
where Bx encompasses the radius range [bx , dx ). Let node v = (v , rv ) be a random
node on band Bx . Let min v < max and |max min | = 2Bx (y) (see eq.
1.8) be the maximal angular difference for a different node ys neighbours, where
y = (y , ry ) is an arbitrary node creating a query on band Bx . Let also E[|activeBx |]
be the expected number of elements in active on band Bx . This number is as such the
expected number of StartBounds we have on average whenever a node on band Bx
is being picked to traverse active. In case of a random node v, we are thus looking
at the average number of query nodes whose min are smaller than v and whose
max are larger than v .
Since v is uniformly randomly generated from [0,2), the problem of finding this
number is thus equivalent to the question of how many nodes have created queries
in the range of [min , max ) on a band Bx .
To simplify this further, let us assume we have only one node y that creates a query
with a query range of 0 2Bx (y) 2. The probability of this points query to
2 x (y)
appear in active during point vs turn (v resides on band Bx ) is B2 as this is the
fraction of the area on a given band that is covered by the query. If v is in that area,
active will have ys query.
The next question is, how many nodes create such a query. This depends on the
probability of having nodes be on that exact radius as y, meaning we can change
our question further: For all possible Bx (y) that can be created on Bx , how many
nodes can be expected to create each of these queries?
First off, the expected number Q of nodes that can create queries on a band Bx is the
number of nodes in total multiplied by the mass between 0 and d:
Ri is inside the range [0, d]. In that case, the expected number of queries Q on Bx is:
n
X n
X Z d Z d
E[Q] = E[Xi ] = P[0 Ri d] = f (r) ndr = n f (r)dr
i=1 i=1 0 0
To calculate the number of nodes that will be part of vs active we only need
to multiply the number of nodes creating a query of specified range with that
probability of that query to appear in active, which is the aforementioned fraction
2Bx (y)/2. As both the number of query nodes and their probability of appearing
in active are both dependent on the query nodes radius, the easiest solution is to
multiply both inside the integral.
Lemma 3. The expected size of active is sublinear in n for any average degree k < n, where
n is the number of nodes, for any radial partitioning with at least one band with an inner
radius b = y R > R/2. Thus active can be expected to be held in our main memory of size
M = O(n1(y) k y ) for all 1 1 < y < 1, all > 1/2, and all k < n.
Proof. Given the equation for E[|activeBx |] with Bx = (bx , dx ), where bx is the inner
and dx the outer band, we begin by applying the approximations 1.3 and 1.9 and
assuming R = O(ln(n/k)) (see lemma 1):
Z dx
2Bx (y , r)
E[|activeBx |] = n f (r)dr
0 2
dx
2 2e(Rrbx )/2
Z
= n e(rR) dr
0 2
Before we continue, let us note that we can split the active-array into two sets of
query nodes:
Also to remember is the fact that neither of those sets can be larger than all nodes
in their respective radii ranges, as there cannot be more queries than double the
nodes. I.e., let V be the set of all nodes, then active can be split into the sets
activeBx = Qx Q<x , where Qx Q<x = , with:
the integral and taking into account that the smallest radius r a node in Vx can have
is bx .
dx
2 2e(Rrbx )/2
Z
E[|Qx |] = ne(rR) dr | max (e(Rrbx )/2) ) = e(Rbx bx )/2
bx 2 bx rdx
2 2e(Rbx bx )/2 dx
2 2e(Rbx bx )/2
Z
n e(rR) dr = n O(eR edx )
2 bx 2
= O(eR/2 ebx eR edx n)
Let us assume we have two bands, band B0 = (0, yR) and band B1 = (yR, R) with
1/2 < y < 1.
Case B0 :
The number of nodes |V0 | on band B0 is, after applying 1.5 to its range, expected to
be:
And since c1 < 1 for all y > 1 1 , it follows that, under those restrictions for
y, this is sublinear to n for all k < n (as a side note, this means for all 2, this
applies to all 1/2 < y < 1, which during the latter experimental evaluation is always
going to be the case for all radial partitionings). As this is in the realm of possi-
bilities, since y < 1, this means we can put all |V0 | nodes from band B0 into main
memory. Since we cannot have more queries than nodes, i.e. |Q0 | = O(|V0 |), and
since there are no other bands below it, i.e. |Q<0 | = 0, active fits into any memory
M = O(k y n1(y) ) which is sublinear to n for all 11 < y < 1 and all k < n.
Case B1 :
Note that both exponents are smaller than one and larger than zero. Let c2 = y 1/2,
then exk = c2 , and exn = 1 c2 . In order to see for which k this becomes sublinear,
we set the term in an inequality with n:
Thus it follows that the number of queries from nodes on band B1 only, i.e.
|Q1 | = O(k exk nexn ) = O(k c2 n1c2 ), is sublinear in n for any k < n, and for any
1/2 < y < 1, including 1 1 < y < 1.
In regards to the number of queries of nodes from smaller bands, i.e. |Q<1 |, we can
just keep all nodes from all lower bands (in this case B0 in specific) inside our main
memory, as proven in the former case. All around, the size of active on band B1 is
expected to be:
Since the upper bound here is the number of nodes on band B0 , we apply the
same argument to this upper bound with c1 = y as we did earlier with
O(|V0 |), meaning that actives size on band B1 is sublinear in n for all k < n and all
1/2 < 1 1 < y < 1.
With this, we have proven that on both bands, the active-size is sublinear in n for
any k < n, and for all 1/2 < 1 1 < y < 1, and will thus fit into any memory
M = O(n1+y k y ). Any additional band would follow one of the two cases:
Any additional band Bi = (bi , di ) inserted into the radius range of B0 , i.e.
bi < di yR, would follow the proof of case B0 .
Any additional band Bi = (bi , di ) inserted into the radius range of B1 , i.e.
yR bi < di R, would follow the proof of case B1 .
This means that the only condition for this lemma to apply to any radial partitioning
is the existence of one band Bx with an inner band bx = yR > (1 1 )R and k < n.
At last, we want to show that with a high probability, the given expected size is not
larger than a constant factor of the above equation.
R d 2 ( ,r)
Given an expected probability p = 0 Bx2 y f (r)dr (derived from equation 3.1
by multiplying the density with the query angle) of a random node being part of
that neighbourhood, we can construct from this problem a binomial distribution
B(n, p), where p = p and n is the number of nodes we have. Because we are
generally working with larger n, we can approximate the binomial distribution
B(n, p) with a normal distribution N (, n p q).
Thus, our active-size falls under the normal distribution N = (, 2 ) under the
p
assumption that = n p = E[|activebi |] and = n p (1 p) = O( n p) =
p
O( E[|activebi |]). As we are working with a normal distribution, we can use the
Empirical Rule to our advantage:
34 Chapter 3. Analysis
1
q
E[|activebi |] E[|activebi |]
3
meaning that = O( 13 E[|activebi |]). The 3--interval encompasses 99.7% of all pos-
sible E[|activebi |] with the largest number being:
1
+ 3 = E[|activebi |] + 3 E[|activebi |] = 2E[|activebi |]
3
Following that, we can assure with a 99.7% chance that the actual active-size is at
most twice as large as the expected size. In other words, the above proven sublin-
earity holds w.h.p., which finishes our proof.
3.2.1 I/O-Complexity
We have overall three phases to consider for our possible upper bound on the
I/O-complexity: The generation, the sorting and the edge-creation phase.
The generation and sorting phases can be combined into one, as each generated
object has to be sorted in the end. We have n nodes to put into our sorters for one.
This alone is an I/O-complexity of O(sort(n)). We also have to sort the bounds,
considering each node has at most two bound-pairs on their own band and at most
another two pairs on each outer band. Since we thus have more bounds than nodes,
we have to assume the bound count is the dominating factor here. Let S be the
number of bounds, with S = (n) and S = O(l n) with l being the number of bands,
then our complexity would be O(sort(n) + sort(S)) = O(sort(S)).
In the edge-creation phase we go through each band and sorter. Since the data is
sorted and our algorithm only relies on one additional array the active-array we
have only two things to consider here: The sorter sizes and the active size.
Due to lemma 3, we can assume that active fits into our RAM and can thus
ignore the data structure in its entirety for the calculation of our I/O-complexity.
The sorters are scanned from the smallest to the largest element in each sorter,
and because we know that all node sorters combined have n elements and
all bound sorters combined have S elements, the complexity here is one of
O(scan(n) + scan(S)) = O(scan(S)).
Overall, the dominating phase in regards to I/O complexity is the sorting step, as
sorting is more complex than scanning alone, meaning our entire algorithm has an
I/O-complexity of O(sort(S)).
The upper bound of S depends entirely on the partitioning chosen, and while we
will not use it in our final benchmarks , the following lemma 4 proves that it is
3.2. EM-Variant of NkGen 35
Proof. Let l be the number of bands with l > 1. Let C = (c0 , c1 , ..., cl ) be a radial par-
titioning with bands bi = (ci , ci+1 ) for all 0 i < l, where band b0 is the innermost
band. Let v(bi ) be the fraction of nodes on band bi , where a fraction of one would be
the entirety of all nodes, while zero would be no nodes at all. Let the partitioning be
chosen in a way that divides the nodes onto the bands in the following fashion:
(
1/2l1 , if i = 0
v(bi ) =
1/2li , otherwise
Since we are working with a randomly generated graph, the fractions should be
considered a probability mass. In other words, the expected node count on band
i is defined as ni = n v(bi ) = i . A node being part of a specific band i with
a specified fraction can be considered a Bernoulli experiment with the probability
p being that fraction and the node being part of that band being a positive event
(i.e Xji = 1, if Xji is the random event of node j being on band i), and it being a
part of another band being a negative event (i.e. Xji = 0). Under those definitions
we get E[ nj=1 Xji ] = O(n v(bi )), where Chernoffs inequality gives us with high
P
probability the same equation:
n
X
P[ Xji > n v(bi )] exp( n v(bi )) |O(v(bi )) = O(0.5)
3
j=1
exp( n) | = 6
6
1
exp(n) =
exp(n)
l1
X l1
X l1
X l1
X
l1
1/2li
ni = v(bi ) n = n v(bi ) = n 1/2 +
i=0 i=0 i=0 i=1
l1
X 1 1/2l
= n 1/2l1 1/20 + 1/2i = n 1/2l1 1/20 +
1 1/2
i=0
= n 1/2l1 1 + 2 (1 1/2l ) = n 1/2l1 1/2l1 1 + 2
=n
Every node on band i will at most create one bound for its query on each outer
band additionally to its own band. Looking at it from the other way, this means
every band i will have at most one bound per node that exists on each band j with
36 Chapter 3. Analysis
= O(n 2) = O(n)
No matter how large l > 0 is, the upper bound will always be O(n). And since the
lower bound will never be less than (n) because every node will have at least
one bound pair on their own band we conclude this proof with S having a tight
bound of (n) under the above partitioning.
NkGen sorts only the nodes, while the EM-variant has to sort the nodes and all
Bound-objects. The number of Bound-objects, though, can be given an upper
bound with the number of nodes either depending on the radial partitioning
scheme (see lemma 4), or by choosing a constant band number l. Thus, the
differences are O(n log(n)) vs. O(l n log(n)), or with a constant band count,
O(n log(n)).
NkGen has during its edge-creation phase a runtime that depends on two as-
pects: The number of comparisons between nodes and neighbour candidates,
and the number of binary search operations one needs to find those candi-
dates. The former, while not proven, has been empirically analysed by the au-
thors to be linear in number of edges [LLM16, p.5]. In fact, it has been shown
that with a uniform radial partitioning where, let l is the number of bands
band is inner radius is b = R(i/l), while its outer radius is d = R(i + 1)/l
the number of neighbour cadidates per node v chosen by the algorithm differs
only by a constant factor c from the actual number of neighbours that v has (see
3.4. GIRG: I/O-Analysis 37
Let O(C) be the upper bound on the number of comparisons. In total, this means that
the EM-variant has an overall runtime of O(C + l n log(n)), compared to NkGens
O(C + n log(n)), where O(C) is empirically linear in m [LLM16, p.5].
What is left is the main loop that includes the construction of partitionings between
multiple layers and subsequent traversal of those as well. While the number of
partitioning combinations is quadratic in the number of layers L, the number of
I/O-operations might have a better I/O-complexity, depending on how we traverse
them. If we are traversing all partitioning pairs in a concurrent manner where we
always keep the largest cell in memory, the best I/O-complexity we could achieve
would be O(scan(n)):
Each layer i has nodes whose weight wv = e(Rrv )/2 is between w1 and wi , and
each wi is defined as wi := 2wi1 , i 1. In turn, this means that a node with weight
wi has a radius of ri = R 2 ln(wi ). Were we to double the weight to wi+1 = 2wi , a
node with this weight would have the radius ri+1 = R 2 ln(2wi ).
Applying those radii for two subsequent layers i and i + 1 with range [ri1 , ri )
and [ri , ri+1 ) respectively to the density mass calculation 1.5, we would get:
mass(layeri )
= O(eR+ri /eR+ri+1 ) = O(eri ri+1 )
mass(layeri+1 )
= O(e(R2 ln(wi )(R2 ln(2wi ))) ) = O(e(2 ln(2wi )2 ln(wi )) )
= O(e2(ln(2wi )ln(wi )) ) = O(e2(ln(2)) ) = O(4 )
In other words, there is a geometric distribution of nodes between the layers, where
the number of nodes in subsequent layers doubles in the least (since > 1/2). Let
us assume that the number doubles exactly, i.e., the node mass per band resembles
a geometric partitioning of mass(layeri ) = 1/2i , for all 1 i L. In that case, the
sum over all layers nodes is just a constant factor of n, specifically O(2n) (see proof
for lemma 4).
Coming back to the I/O-complexity: For ease of understanding, let us assume that
each cell in the cell partitionings P(i,j) , 0 i j L fit into a block of b nodes. In
order to be as efficient as possible, we would load the first cell/block of B nodes
from the largest layer, layer 1, and compare these nodes with the first three adjacent
cells/blocks from each layer 1 through L. We would then continue with the second
block on layer 1, the third, and so on. After that, we would compare layer 1 with
itself, and compare each block with every adjacent block. Let ni be the expected
number of nodes on layer i, with L
P
i=1 ni = n. The comparison between layer 1
and all the others would then take O(L + (n1 + n2 + ... + nL )/B) = O(L + n/B)
I/O-operations, while the later comparison of the layer with itself would take
another O(n1 /B) the added L comes from the fact that we have to traverse each
layer at least once. This holds for every subsequent layer we are going through. The
total I/O-cost I of the entire procedure would thus be:
L
X L
X
I= i + (ni /B) + (nj /B)
i=1 j=i
Since the summation of each subsequent nx resembles a geometric sum, we can give
an upper bound for the inner sum, and further more an upper bound on the outer
sum as well, resulting in:
L
X XL
I= i+(ni /B)+O(2(ni /B)) = O(i+3(ni /B)) = O(L2 +6ni /B) = O(L2 +n/B)
i=1 i=1
In other words, under the condition that we were to traverse the partitioning in
this specific order, we would at worst have an I/O-complexity of O(L2 + n/B) =
O(log2 (n) + scan(n)) = O(scan(n)). Since every node has to be compared at least
once, this is also our best-case-scenario, i.e. O(scan(n)) = (scan(n)). If this order
is not taken into account, though, we might at worst have to load the first, largest
layer multiple times specifically once per other layer j we are comparing this one
with. This would result in a final upper bound of O(L n1 /B) = O(log(n)n/B), since
we have L = log(n) layers.
3.5. Comparison 39
3.5 Comparison
Table 3.1 summarises all findings up until now, where we take the empirically based
assumption of NkGens runtime [LLM16, p.5] as valid. As a reminder, the I/O-
complexity does not include any I/O-operations regarding the graphs final output:
From the runtime complexity alone, we can see that the algorithms are very similar
amongst one another. GIRG does not have the problem with sorting, which is
why the others have an additional O(n log(n)) added onto the runtime. In the
EM-variants case it is even O(l n log(n)), as NkGen only sorts the n nodes.
Regardless of how many bands one might have, the sum of all nodes will always
be n. Not so much for the EM-variant, as we do not only sort the n nodes, but also
all StartStopBounds, the sum of which is at least n and more likely above that, as
each band i has always the sum of all previous bands nodes bounds (i.e. the sum
of bounds from all nodes on all bands j, with j < i), additionally to the ones created
by the nodes on band i itself.
In other words, for small graphs it might be worse for the EM-variant to have a high
number of bands while for large ones it will be irrelevant if m is sufficiently large as
well.
In regards to the I/O complexity, the EM-variant has the biggest advantage, as only
the sorting phase hinders it here with a low enough complexity to compete with
the other alternatives. GIRG has the problem in its partitioning, as the traversal of
the many cells we are comparing matters. At best, we might have a traversal order
that takes I/O-efficiency into account and keeps the temporal locality principle i.e.
that currently used data is more likely to be needed sooner rather than later in
mind [MSS03, p.9]. Using all loaded data into RAM as much as possible according
to the traversal order mentioned in chapter 3.4 would result in a I/O-complexity
of scan(n). If that is not the case, though, we will have a lot more unstructured
access between the multiple layers, resulting in a worse I/O-complexity than the
EM-variant. NkGen on the other hand has its own issues with the binary search
that is performed once per node, resulting in the factor before the logarithm to be
potentially a magnitude higher than the one for the EM-variant (depending on the
block-size B).
All in all, we expect from the theoretical analysis to see GIRG and NkGen to excell at
smaller graphs that still fit into memory, as that is where their runtime complexities
should dominate the practical runtime. Once we are entering the external memory
environment, it should either be the EM-variant or GIRG that has an advantage over
the others in regards to the I/O-complexity, depending on the internal traversal or-
der of the GIRGs algorithm.
40 Chapter 3. Analysis
3.6.1 Overview
As previously mentioned, we used a geometric partitioning for our bands based on
Looz et al.s empirically chosen parameters [LLM16, p.3], as to even the grounds
on which we are comparing the algorithms in general using different kinds of
partitioning methods might induce a bias to the benchmarking setup, if we are
trying to see the effectiveness of an EM-implementation alone.
On the other hand, our parallelisation scheme differs from the original algorithm:
Where in the original, each node was given to different thread, we cannot do so
since the sorters are not thread-safe. And since in our case, we distribute band
segments to threads instead, the partitioning choice might have a larger impact on
the parallelised runtime in a direct comparison as well. As such, we chose to put
some focus on this part of the algorithm as well.
In regards to dividing the hyperbolic plane in the case of our algorithm specifically,
there are only two general parameters we are concerned with.
The first parameter is the partitioning C of the l band-radii c0 , ..., cl+1 for our l
bands, where every band i being defined by their inner radius ci and outer radius
ci+1 (in particular, c0 := 0).
The second parameter is the number of bands l itself. Every additional band adds
an additional query-creation and sorting step, with the number of query bounds
we have to sort being bound in its size by O(n l) as mentioned during our
I/O-complexity analysis of the EM-variant. Looz et al. noted that for their bench-
mark environment, it was beneficial to make the number of bands dependent on the
number of nodes one has specifically l = log(n) [LLM16, p.2-3]. As to whether
or not this is a sensible choice for the other radial partitionings is something we will
look into as well.
Apart from the geometrical approach taken from Looz et al., we came up with a
couple of other possible partitioning choices that we will go through and analyse,
later on one after the other:
The reason for both partitioning choices is as follows: The minimised workload
approach should yield a partitioning by which, if we were to go through all bands
one-by-one, we would minimise the runtime as well. The equalised workload
approach on the other hand was chosen because, while the minimised approach
would be the faster in a single-threaded environment, it might be more beneficial
3.6. Radial Partitionings 41
for a parallel work setting to chose a partitioning that spreads out the workload
better over multiple threads.
Regardless of the approach, the bigger problem here is the question as to how one
creates a partitioning where the workload subdivision is chosen beforehand. For
that, we devised a way to estimate an expected workload, given a chosen partition-
ing, which will be explained in the next section.
Let us assume, disregarding the sorting step, we would want to predict the amount
of work band i between radii ci and ci+1 has. Our main algorithm goes through
every node v in band i and checks for every potential neighbour candidates for each
node, whether an edge has to be established. If one recalls, the candidates were
located in our algorithm in the active array through which we had to go once per
node. As such, the size of active is the number of potential neighbours we have to
verify for edge-compatibility.
With this in mind, let ni be the number of nodes on band i and let E[ki ] be the
expected size of potential neighbours each of the E[ni ] nodes have to verify. The
work wi on band i would thus be defined as wi = E[ni ] E[ki ].
E[ni ] is comparatively straightforward: Using the radial density function used for
nodes radius distribution, we can calculate its integral and with it the density mass
mi residing on band i, meaning E[ni ] = n mi = n mass(ci , ci+1 ) (see eq. 1.4).
E[ki ] on the other hand is more complicated, though we already have given the
answer to it during our theoretical active-size analysis in chapter 3.2. As a quick
reminder, the active-array holds all potential neighbours that any current node has
to be checked against to verify, whether or not an edge has to be established. In
other words, E[ki ] corresponds to the expected size of the active-array on band bi ,
meaning that we can use equation 3.1 and the thought processes behind it for our
expected workload calculation.
This results in the following formula that should give us an expected size for active,
where the point q with radius r is a query point on band bi , and (q, r) = 2 bi (q) is
the calculated angle the query point q would have on a band with an inner radius
ci1 (see eq. 1.8), and f (r) being the density function being used for the radius
distribution (see eq. 1.2):
Z ci
(q, ci1 )
E[ki ] = f (r) n dr (3.2)
0 2
With this active-size-proxy we can approximate the work on any given band if we
know its boundaries with the following formula:
42 Chapter 3. Analysis
Given the band radii C = (c0 , ..., cl ), with c0 := 0, the entire workload W for our
main algorithm can be calculated as follows:
l1
X l1
X
W = wi = ni E[ki ]
i=0 i=0
l1 Z ci+1
X (q, ci )
= n mass(ci , ci+1 ) f (r) n dr
0 2
i=0
l1
X n 2 1 1
= n mass(ci , ci+1 ) 2 (e 2 (21)ci+1 1) e 2 (ci 2R+R) (3.3)
2 1
i=0
As one can see, since our numerical methods require multiple usage of this equa-
tion, we chose to use the approximations (see eq. 1.3 and 1.9) instead of the actual
equations eq. 1.2 and 1.8 to cut down the runtime of the calculation. We found out
during initial testing that our proxy usually delivered inaccurate, too large results
on the inner bands and for smaller , as figure 3.1 shows. This is mostly because, as
mentioned in chapter 1.2.3 with the introduced inequality 1.10, for smaller rv and
ci , we get degrees that are too large. Using an integral does not change that fact,
mostly if we calculate the workload for lower bands (i.e. lower ci ) and have a lot
more points with lower rv than larger ones as is the case with smaller settings.
To combat this, we changed equation 3.3 to take this discrepancy into account, and
split the integral along a border corresponding to inequality 1.10. With this ap-
proach, we devised a better approximation of actives size, namely, E[ki ], where the
second bar indicates the improved version of our estimate calculation:
l1
ni E[ki ]
X
W =
i=0
Where
n 2
E[ki2 ] =
1 1
2 e(ci /2) (e( 2 (21)(ci+1 R)) e( 2 (21)(sR)) ) (3.5)
2 1
In other words, whenever the border s is larger or exactly our current bands outer
radius ci+1 , we know that the query angle for each node between radii 0 and ci+1
is going to encompass the entire band, i.e. it will be 2. In that case, we can strike
out the probability (q, ci )/2 out of the integral, as it will equal practically 1 over
the entire integral (even if the approximation would have given us a larger number
which would not make much sense here). In that case, the approximated, average
neighborhood size of any node on band i (i.e. the expected active-size on band bi ),
namely E[ki ], will be equal to all nodes between 0 and ci+1 .
In case ci+1 is larger than s, we split the integral into two parts, one part in the
range [0,s], and one part in the range [s,ci+1 ]. The first part is similarly calculated
to the case where ci+1 s, only difference being that we are integrating between 0
3.6. Radial Partitionings 43
3.0 1e7
2.0
1.5
1.0
0.5
0.0
0 1 2 3 4 5 6 7 8 9 10 11 12
Band Number
Estimate via eq. 3.3 Actual Comparison Count Estimate via eq. 3.4
F IGURE 3.1: Bar graph with the actual compare count, the estimate
via equation 3.3, and the estimate via equation 3.4.
and s, not 0 and ci+1 . The second part is the integral between s and ci+1 where the
probability has not been struck out of the integral, resulting in equation 3.5.
Figure 3.1 shows the differences between the estimated workload of equations
3.3 and 3.4 with 3.5, where the problem in lower - and band-regions are not
susceptible to inaccuracies when used in conjunction with equation 3.4 as opposed
to the unaltered approximation. Also of note are the negligible differences on higher
bands, and how the approximation seems to be close to the actual results on higher
bands regardless of the equation chosen.
To visualise the nature of actives size in general, figure 3.2 (and similarly A.1,A.2
in the appendix) show the approximated active-size using those same equations
with their respective normal distribution following the logic and definition outlined
in chapter 3.2. To quickly summarise those here again: The problem of a query
node appearing in active on band Bx = (c, d) can be shown to follow a binomial
distribution by the fact of having two possible outcomes during a random nodes
comparison/edge-establishing phase, a query node is either in active or not with
constant probabilities each. In specific, the probability of such a query node to
Rd
appear under those conditions in active is p = 0 (2Bx (y , r)/2) f (r)dr, while
the probability of this event not occurring is q = 1 p. Because of that, we can
approximate this binomial distribution by a normal distribution N (, 2 ) with
= E[|activebi |] = ki = 2 .
The figures show the actual occurring active-sizes during the comparison phase
i.e., whenever our algorithm works on a node-token and has to traverse active, we
recorded actives size on the x-axis with the number of occurrences defining the
y-axis. On each figure we can see three normal distributions and bars: One for
a run with the average degree k = 20, one with k = 200, and one with k = 500.
The node count is set to n = 106 on every run, the band count is also always 13
with a geometric partitioning with p = 0.9, as well as the recorded data being
always taken from the active-sizes occurring on the 8th band. Figure A.1 has
= 0.51, figure 3.2 has = 0.75, and figure A.2 has = 1.1. In other words, we
see the active-sizes on band 8 for different and average degrees, all else being equal.
44 Chapter 3. Analysis
1400
1200
1000
Occurrence Count
800
600
400
200
0
0 100 200 300 400 500 600 700 800
Size of Active
N(42,42) N(299,299) N(652,652)
Actual Data (k=20) Actual Data (k=200) Actual Data (k=500)
4.5 1e7
Comparison Count (Workload) 9 1e7
8
Comparison Count (Workload)
4.0
3.5 7
3.0 6
2.5 5
2.0 4
1.5 3
1.0 2
0.5 1
0.0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12
Band Number Band Number
Actual Comparison Count Estimate via eq. 3.4 Actual Comparison Count Estimate via eq. 3.4
F IGURE 3.3: Bar graphs with the actual compare count, and the esti-
mate via equation 3.4. = 0.75 (left), = 1.1 (right).
The bars are overlayed by normal distributions as they were just defined, with the
mean being calculated by the equation 3.4 while the variance is set to the same. As
one can see, the lines almost completely overlap the bars in their entirety (with slight
discrepancies during lower ), further proving that our assumptions of estimating
the upper bound of active-sizes by way of calculating normal distributions in
chapter 3.2 were indeed correct.
As an overall example of the estimation function, the earlier figure 3.1, and the
bar graphs in figure 3.3 together show three runs of n = 106 nodes, k = 200, and
{0.51, 0.75, 1.1} respectively. For each band, we see an entry of the actual num-
ber of distance calculations/comparisons and the estimated workload calculated by
our equation. As is also visible here, the estimation function works for any with
slightly varying, although overall close enough accuracy.
3.0 1e7
2.0
1.5
1.0
0.5
0.0
0 1 2 3 4 5 6 7 8 9 10 11 12
Band Number
Actual Comparison Count Estimate via eq. 3.4
F IGURE 3.4: Bar graph with the actual compare count and the esti-
mate via equation 3.4, showcasing the equalised workload partition-
ing. = 0.51, n = 106 , k = 500.
At first, we tried finding a result with the calculation from 3.4 by dividing the plane
with radius R into an arbitrarily chosen number of a smaller possible band radii.
46 Chapter 3. Analysis
23.5
23.0
F IGURE 3.5: Each line is there to visualise the downward trend of its correspondent
bands outer radius calculated by our algorithm for an arbitrary run of n = 106 ,
k = 10 and = 2, dependent on the number of bands. Because the outer radius of
Band 0 is always set by us manually to R/2, it is not shown in this figure.
This way, our search space is discrete and has a finite number of combinations.
Using an exhaustive-search and trying out all combinations for two or three bands
to find the one with the smallest workload is still done in a fraction of a second,
depending on the accuracy a set. Unfortunately, we will reach a band count
rather quickly that takes multiple seconds which is too high for any reasonable
feasibility during smaller graph creation. One aspect of the calculated results
that might indicate some kind of dependency, though, was the fact, that given a
(3) (3) (3) (x)
result {c1 , c0 , ..., c3 } ci being the i-th radii for a solution with x bands
for our problem with three bands, if we were to calculate the result for the same
graph with the same properties, only now with four bands, with the result being
(4) (4) (4)
{c1 , c0 , ..., c4 }, it turned out that for all bands the radii with the same index from
the problem with four bands were always smaller or equal, compared to the radii
from the problem with three bands.
(j) (i) (i) (j)
In other words, for all i < j it holds, that cx cx , and specifically, ci cj R.
This fact held true for multiple benchmarks, and is probably based on the monotonic
property of the nodesradial density function.
Thus, we changed our algorithm to test only combinations that fit the above criteria:
After calculating the minimised workload partitioning for four bands with the
exhaustive-search, we decreased the search area to only include radii smaller than
the earlier result for all respective radii of the same index. In order to speed up the
algorithm even further and avoid relying on a quadratic runtime, we additionally
altered the exhaustive-search into a greedy-based one:
Assuming that the answer for a problem with i bands was always for every radii to
be lower than for i 1 bands, we let the algorithm set all radii at the start of our next
(i)
iteration on the previous solution and inserted the newest addition ci at R. Starting
3.6. Radial Partitionings 47
(i)
with c0 , we decreased the radius incrementally each increment t being R/a big
as long as the summarised workload was decreasing as well. The moment the
(i) (i)
workload would increase again after s steps, we would set c0 = c0 (s 1) t
as the new radius, after which we did the same for all the other radii, one after the
other. After doing this once for every radius, we start over with ci0 and continue
the entire process, until no further improvements on the workload could have been
made by decreasing any of the radii. Depending on the accuracy parameters set,
this process does not take longer than a second. A pseudocode for this process can
be found in the appendix as well in listing B.6.
Overall, the results in figure 3.5 appear to show a logarithmic partitioning in general,
with an additionally linear to possibly negatively exponential movement along the
radial axis for each bands radii boundaries with each additional band added to our
partitioning.
48
Chapter 4
Experimental Evaluation
In this chapter, we first detail our computer setup and important aspects of the im-
plementation of the three main algorithms. After listing the graph parameters we are
using during our practical tests, we benchmark the multitude of options available
to the EM-variant in order to find parameter settings that optimise our runtime
amongst others, we regard multiple radial partitionings, as well as band count and
angular segment count per band. After that, we compare the three main algorithms
under different benchmark settings. For more details on those, see the introduction
in chapter 4.4.
Every algorithm is written in C++ and built using the same compiler GCC
version 6.2.1 for building a release version. The parallelism available in
NkGen and the EM-variant are both based on OpenMP, while the EM-variant addi-
tionally uses the ST XXL-library for data management and sorting (see chapter 1.3).
During the later benchmarks where we compare the three main algorithms, we
additionally disabled any output in all three algorithms as to focus on the runtime
alone. For that reason, we changed the source code of NetworKits generator and
GIRG in two ways: For one, we removed any inserts into arrays/vectors or the like
that happen, whenever an edge has been found or is to be established. For two,
we also deleted any lines declaring the data constructs in charge of saving those
edges. In NkGens case, we disabled the allocation of the adjacency list and any
access to it. Additionally, we let the program calculate a fingerprint based on the
node IDs during the establishment of edges, in case the compiler were to ignore
the edge-creation due to no variables being manipulated or saved. In GIRGs case,
we also let the program end right after all edges in the hyperbolic graph have been
found, as the program that was made available by [BFKL16] also printed the edges
to a file, sorted them and calculated additional information afterwards which would
have increased the runtime.
In other words, we only measured the time for the creation of the graph in and of
itself, ignoring any additional post processing done to it.
4.2. Graph Parameters used in the Benchmarks 49
Power-law parameter : 0.51, 0.75 and 1.1. The reason for this decision is
that, for instance according to Friedrich et al. [FK15, p.617], the exponential
component of hyperbolic graphs is usually between 2 and 3 in practical use
cases. is also defined as = 2+1, and considering that has to be above 0.5,
we went for those aforementioned values, thus encompassing the respective
values of 2.02, 2.5 and 3.2.
Average degree k: 10, 50, 500 and 1000. This is in order to cover the range of
graphs with small average degrees to graphs with large ones.
Node count n: Exponential increase from 104 up to at most 109 the sequence
being in specific (1012/3 , 1013/3 , ..., 1023/3 , 1027/3 ) to cover a wide range of
small to massive graphs.
Since the search space is enormously large, we decided to analyse each of these pa-
rameters step by step and use the respective best options further on out. As such,
the following section is going to mostly involve taking multiple choices for a setting,
benchmarking them under relevant conditions, and comparing the results in order
to decide, when to use which option.
Taking first a look at the generation phase (figure 4.1), we can see that the 2-Sorter
consistently outperforms both alternatives. Since there is no difference between
the 2-Sorter and 3-Sorter in their implementation in regards to the parallelisational
aspect, the difference must come from the latter variants additional access of a
third sorter into which we put one object more than in the 2-Sorter version. The
50 Chapter 4. Experimental Evaluation
10 0
10 -1
10 -2
10 -3
10 4 10 5 10 6 10 7
Node Count
3-Sorter 2-Sorter 0-Sorter
0-Sorter on the other hand not only uses a different parallelisation scheme, but also
uses logarithmic operations during the calculation of the angular random variable
in order to deliver a sorted sequence.
For the sorting phase in figure 4.2, we set the memory available to the sorters to
the minimum available (around 44MB per sorter) per object type, which we will
increase in later benchmarks to around 2GB in total. The higher the available
memory, the fewer merge-passes do the sorters necessitate. During lower object
counts the minimum memory setting is enough to sort with one merge-pass, larger
graphs on the other hand need more memory to sort more efficiently. Regardless,
the sorting phase during the benchmarks did not take more than half the overall
runtime at worst, as one can see if one compares both graphs in figure 4.2. During
the sorting phase, one can also see the sharp jump at 107 nodes, which is where the
sorter requires more than one merge-pass, as the set memory is not enough for one
pass alone. For comparisons sake, we kept it at the minimum memory required for
the sorters to work.
The sorting phase shows that for smaller graphs, the sorting takes more time for the
2-Sorter, while with increasing numbers of nodes, the 2-Sorter starts to amortise the
larger object size and runs faster than the 3-Sorter, if only marginally. This comes
presumably from the size and number of to-be-sorted objects: The larger object
size in the 2-Sorter has an impact, if only for smaller graphs. For larger ones, it is
the number of sorters and objects that is more relevant to the runtime, which the
3-Sorter has both more of.
Interestingly, we can see in figure 4.3 that both the 2-Sorter and 3-Sorter versions
edge-creation phase compete well with one another, meaning that the difference in
the overall performance between them stems from the generation phase. Regarding
the 0-Sorter, the runtime during the edge-creation phase was expectedly twice as
long as the 2-Sorter, as we are traversing the band once forward, once backward.
Considering that the sorting takes relatively few seconds compared to the edge-
creation phase, the lack of a sorters in general did not give us any advantage after all.
4.3. Finding Optimised Parameters for the EM-Variant 51
10 0 10 1
10 -1 10 0
10 -2 10 -1
10 -3 10 -2
10 4 10 5 10 6 10 7 10 8 10 4 10 5 10 6 10 7 10 8
Node Count Node Count
3-Sorter 2-Sorter 3-Sorter 2-Sorter 0-Sorter
F IGURE 4.2: Sorting (left) and overall (right) runtime phase of the 2-
and 3-Sorter for graphs of average degree k = 10 and = 0.75 on a
logarithmic scale.
Edge-Creation Phase Runtime in Seconds
10 2
10 1
10 0
10 -1
10 -2
10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
3-Sorter 2-Sorter 0-Sorter
In conclusion, the 2-Sorter seems to be the most efficient version out of the three we
devised, and as such we are going to use the 2-Sorter in all forthcoming benchmarks.
From the start, we already have the possibility to divide our ground plane into
multiple bands, meaning that apart from the first band always covering the
first [0,R/2]-range because of the higher query surface area in that region the
first idea would be to have 16 or 17 band, as we want to distribute one band
to each thread. 17 bands might be worthwhile of a consideration, because de-
pending on the value, the inner most band could have few enough nodes to
be almost of no work compared to the other bands, meaning one thread could
be working fast enough on the first band to start with a second one without too
big of a lag. Those ideas are beside the more important point, though, as the
bigger problem here arises from the fact that the more bands we have, the longer
52 Chapter 4. Experimental Evaluation
The generation phase goes through all nodes once and checks, once per node, every
higher band for possible Start- and StopBound-creation, meaning that the runtime
as previously mentioned has an upper bound of O(n l), n being the node count,
l being the band count. The data in figure 4.4 shows this linearity. A quick note on
the forthcoming figures with band counts on the x-axis: The lines are not supposed
to imply a continuing data plot, since there are no such things as half a band count.
In order to visualise the general trend better, though, we decided to insert lines
between the data points.
Interesting is the fact that the slope is weaker the smaller the -value is. A possible
explanation could be the number of nodes on the outer bands we can find with
smaller with our partitioning and the way the generation step works: Generally
speaking, every node checks through every band that is further away from the
center than itself meaning for 8 bands, a node on band 0 has to check through all
bands, while a node on band 7 only checks the last one. Our partitioning algorithm,
though, finds it more beneficial to put more nodes into the last band (i.e. create
wider bands on the outside) because for smaller the radii distribution inherently
wants to put more nodes closer to the center. The more nodes we have in the
center, the more nodes we have that would create wider queries. Following that, to
compensate the increase in amounts of wider queries closer to the center, we have
to create inner bands with fewer, and outer bands with more nodes inside of them, if
our goal is to create an equalised workload. This results in lower values having
way more nodes that have to check only a few bands, while larger values will
have to divide the node distribution onto the bands in a way that will force them to
check more bands per node, because in those cases we will have way more nodes
with similar, smaller query ranges.
Nevertheless, for simplicitys sake, we can still assume our previously established
upper bound, considering that the worst case scenario would be one where every
node had to check every band.
4.3. Finding Optimised Parameters for the EM-Variant 53
14 80
13 75
Overall Runtime in Seconds
The sorting step on the other hand would be at best O(ni log(ni ) + si log(si )) per
band, ni being the node count, and si being the StartStopBound count on band
i. Having more bands certainly would decrease ni on some bands, regardless of
positioning, but would also increase the overall StartStopBound count. We could
not find any kind of closed-form expression for our partitioning, which means
that estimating the node count or StartStopBound count per band based on that
partitioning was not possible either. Regardless, let us assume a node count number
n = maxi[0,l1] (ni ) and s = maxi[0,l1] (si ) = n representing the maximal node
and StartStopBound count on a band for a given partitioning, the sorting step
would thus be at most O(l (n log(n) + s log(s))) = O(l n log(n)).
In other words, both the generation step and the sorting step are theoretically
linearly dependent on the band count l, meaning that the more bands we have, the
higher the cost would be in those two phases, even if we would get a better runtime
during our edge-creation phase. This is where the angular parallelisation comes into
play: For instance, we could choose a band count which is suboptimal in regards
to the generation and sorting step, while choosing an angular parallelisation count
that would allow all threads to still have a smaller, equalised workload comparable
to one without angular parallelisation entirely.
Taking a look at the time spent on all steps combined overall, there is one thing
that is immediately apparent: If the number of edges is lower, the algorithm takes
more time with every additional band if the number is higher, it takes at first
less time with every additional band until the performance increase plateaus (figure
4.5). The benchmarks with higher edge counts show that our initial consideration
of 17 instead of 16 bands is more or less irrelevant: The differences are slight and in
the realm of probabilistic variances, as redone benchmarks have shown. Increasing
54 Chapter 4. Experimental Evaluation
8 70
the bands further than that either did not change anything or only increased the
runtime, most likely because we have fewer threads available than bands in those
cases not to mention that the more bands we have, the longer the generation and
sorting steps will take as well.
Analyzing the benchmarks further, it seemed that in all runs, the sorting phase took
only a small fraction of the time spent on the entire algorithm in general and thus
seems not to be a relevant factor for an optimised setting. In the graphs in figure 4.6,
one can see all three steps, once for an average degree of 50, once for one of 1000.
As one can see, even though the change in average degree changes which phase of
our algorithm dominates, the sorting phase stays less relevant in comparison as it is
not the dominating factor in either case.
The generation step, though, does matter more so in lower edge count graphs
than higher ones. Regardless of value, the generation-phase alone is linearly
dependent on the band count. The edge-creation step on the other hand follows a
negative exponential curve with each additional band up until we have no further
threads to offset additional bands workload onto. The result is the previously
mentioned trade-off: For lower edge counts, where the edge-creation step itself is
not too long in regards to the runtime, further bands increase the time spent on the
generation step so much, that the time spent here dominates the runtime. Because
the overall runtime becomes thus very similar in shape to the generation-phase
alone, it might be more beneficial to keep a lower band count while increasing the
angular partitioning of the bands itself to be able to use all threads in their entirety.
In cases where, on the other hand, our edge-creation phase dominates the runtime,
it would seem on first glance to have been more appropriate to use as many bands
as possible for our threads to use, were it not for the fact we see in figure 4.7. Here
we can see the maximal workload, or maximal comparison count, per band. Just like
the edge-creation step, it follows a negative exponential slope with increased
band counts.
This means that, while we could lower the maximal comparison count with each
and every new band, it would become less and less beneficial. Taking additionally
into consideration the finite amount of threads we have at our disposal, it could be
better to use a smaller band count with angular parallelisation after all, which is
4.3. Finding Optimised Parameters for the EM-Variant 55
1.3 1e10
1.2
1.1
Compare Count
1.0
0.9
0.8
0.7
0.6
0.5
4 6 8 10 12 14 16 18 20
Band Count
Alpha: 0.51 Alpha: 0.75 Alpha: 1.1
F IGURE 4.7: Maximum compare count out of all bands, per band
count. Node count is n = 107 , average degree is k = 1000.
50
Edge-Creation Phase Runtime in Seconds
46
Overall Runtime in Seconds
44 48
42
46
40
38 44
36
42
34
32 40
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Band Count Band Count
Ang. Para.: 3 Ang. Para.: 4 Ang. Para.: 5 Ang. Para.: 3 Ang. Para.: 4 Ang. Para.: 5
F IGURE 4.8: Edge-creation (left) and overall (right) runtime per an-
gular parallelisation, per band count. n = 107 , k = 1000, = 0.75.
why we investigated this even further with an excerpt of the data gathered from the
benchmarks shown in the graphs in figure 4.8.
The above figures show the runtime for the angular parallelisation count v being
three, four, and five, once for the edge-creation phase, and once the overall runtime.
As one can see, the edge-creation-phase alone does not paint the whole picture, as
going by the first figure we would have chosen a larger band count for our later
benchmarks, seeing as how the graph generally follows a somewhat downward
slope with each additional band, interrupted by spikes that give a less than optimal
band-piece-to-thread distribution. The overall runtime, though, shows that the im-
provement with higher band counts is mitigated by the aforementioned increasing
runtime of the generation phase. Also to mention is the fact that regardless of band
count, the first band has the exact same amount of comparisons, as we set the first
bands outer radius to be always R/2. Because of this, and because of figure 4.7,
it seems reasonable to assume that the spikes are somewhat related to the overall
number of threads and angular parallelisation pieces:
If one were to think as to when our parallelisation would be optimal, the first
thought would be that any band piece count with equal work per piece that is divisi-
ble by our thread count should give us a good distribution and thus a good runtime.
56 Chapter 4. Experimental Evaluation
40
38
36
34
4 6 8 10 12 14 16
Band Count
Median Median + MAD Median - MAD
As one can see in figure 4.7, though, the maximal comparison count rises exponen-
tially with fewer bands, meaning that one should not expect for example four bands
to be delivering a better result than 5 bands, regardless of how well our thread count
is divisible by 16 (in the case of four pieces per band). On the other hand, for larger
band counts this argument could be made, as the next minima in the left graph of
figure 4.8 for a parallelisation count of four after five bands appears around eight
(which is 32 pieces at v = 4), the next one at 12 (48 pieces), and the next one at 16
segments per band.
For the other angular segment count per band v of three and five, for example,
where there are barely any band piece counts that are divisible by 16, it is more
difficult to be see this pattern. Considering that we have a multitude of factors
affecting the runtime like the comparison count decreasing with every band for
example, and because we are working in a parallel setting, all of these indicators
should not be taken for absolutes: While the general positions of those more visible
spikes were usually in the same vicinity after multiple benchmarks (see figure 4.9),
the variance was high enough that for larger band counts (and thus more pieces),
the reasoning could be considered less applicable. Nonetheless, the randomising
factor of parallelizing experiments in practice occurred during all runs, which is
why we still chose to benchmark for the four aforementioned minima for four
pieces per band.
Figure 4.10 shows that overall, 5 bands and 4 pieces per band were the optimal
setting for smaller average degrees for the overall runtime. For larger degrees, it
was additionally dependent on the node count, where anything above 107 took less
time with 8 bands. Because of that, we will be using those settings (shown again in
table 4.1) during the later benchmarks.
4.3. Finding Optimised Parameters for the EM-Variant 57
10 2 10 3
10 1 10 2
10 0 10 1
10 -1 10 0
10 5 10 6 10 7 10 8 10 5 10 6 10 7 10 8
Node Count Node Count
L: 16 L: 8 L: 12 L: 5 L: 16 L: 8 L: 12 L: 5
F IGURE 4.10: Separate runs for the various minima with v = 4 and
l {5, 8, 12, 16}. k = 50 (left), k = 500 (right).
TABLE 4.1: Rule Set for angular parallelisation count v and band
count l for the equalised workload partitioning.
The generation and sorting steps do not differ from a runtime perspective at all.
Similar properties show themselves in the graphs with a linear runtime of O(n l)
in regards to the generation step and a similar sorting step with O(l n log(n)). The
algorithms entire runtime is also similar to the equalised workload version: Higher
edge counts result in the edge-creation step dominating the runtime, decreasing
it with every band, while lower edge counts let the generation step dictate our
speed, slowing us down with higher band counts fairly early on. Just as with the
equalised workload, this effect appears quicker and more drastically during higher
values than lower ones (as seen in figure 4.12), the reason being the same as the
one explained above (different node distribution based on band wideness as a result
of our partitioning algorithm for different ).
One difference to the previous radial partitioning that we decided to take into
account for our parallelisation scheme was the varying comparison counts per
band. Considering that the equalised workload method had the same amount of work
on every band, there was no requirement to consider the distribution of band pieces
to threads, as every thread would get a piece with the same workload just as any
other piece (except the first band). Because this property is not the case for this
radial partitioning, we decided to create a simple scheduling scheme that would
58 Chapter 4. Experimental Evaluation
9 120
5 60
4 40
3
20
2
1 0
4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20
Band Count Band Count
Edge-Creation Generation Sorting Edge-Creation Generation Sorting
F IGURE 4.11: Separate runtimes of all three phases, per band count.
n = 107 , k = 50 (left), k = 1000(right).
13
12
Overall Runtime in Seconds
11
10
9
8
7
6
4 6 8 10 12 14 16 18 20
Band Count
Alpha: 0.51 Alpha: 0.75 Alpha: 1.1
F IGURE 4.12: Overall runtime per alpha, per band count. n = 107 ,
k = 50.
In most cases, as one can see in the graphs in figure 4.13, the outer, last few bands
would have the largest workload, decreasing with every band closer to the center
(except for the first band because, regardless of partitioning, it is always R/2 wide).
Our scheduling for 16 threads works thus as follows:
second, if more than 16 pieces were left, we assign each thread a workload
in ascending order, giving the thread with the currently largest workload the
smallest piece
third, if less than x < 16 pieces are left, we assign each thread a workload in
ascending order, starting with thread t (x 1) up to thread t
2.0
5
1.5 4
1.0 3
2
0.5
1
0.0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Band Number Band Number
Alpha = 0.51 Alpha = 0.75 Alpha = 1.1 Alpha = 0.51 Alpha = 0.75 Alpha = 1.1
F IGURE 4.13: Workload per band on a run with 7 (left) and 16 (right)
bands. n = 107 , k = 1000.
Edge-Creation Phase Runtime in Seconds
46
45
44
43
42
41
40
39
38
5 6 7 8 9 10 11 12
Band Count
With Scheduler Without Scheduler
set without a scheduling scheme (in this example 6 bands). While those runtime
improvements would most assuredly be in the error margins during the generation
of smaller graphs, and while the speedup was not always as large as that (some-
times less, sometimes more, depending on a variety of factors), it nonetheless gave
us consistently an improvement in our runtime when comparing the respective op-
timal parameter for the band count under the same angular parallelisation count.
Because of that, all further benchmarks are going to be done with the aforemen-
tioned scheduling scheme.
In a visual representation of the workload given to each thread (see figure 4.15) we
show every thread for both runs one scheduled manually, one directed by the
compiler with their respective comparison count. As one can see, in the specific
example of 10 bands, our manually scheduled run even managed to distribute the
workload almost equally, while the unscheduled run seemed to increase the work
put on each thread with every higher thread ID. This follows from the fact that the
lower thread IDs get assigned subsequent pieces from the lower bands, while the
higher IDs are forced to work on the outer bands. Concerning the short fall at thread
ID 8 the compiler decided at this point to assign only two subsequent pieces each
60 Chapter 4. Experimental Evaluation
7 1e8
F IGURE 4.15: Workload per thread for the unscheduled (compiler di-
rected) and manually scheduled run. The parameters are the ones
used in figure 4.14, except that the band count is set to l = 10 only.
110
100
Overall Runtime in Seconds
90
80
70
60
50
40
30
4 6 8 10 12 14 16 18 20
Band Count
Alpha: 0.51 Alpha: 0.75 Alpha: 1.1
F IGURE 4.16: Overall runtime per alpha, per band count. Node count
is n = 107 , k = 1000.
to every following thread, instead of three as it was the case for the previous threads.
Considering that we are working with those 16 threads in parallel, the unscheduled
run is thusly at a disadvantage: While the first couple of threads are done fairly
shortly into the edge-creation phase, all threads will have to wait for thread number
14 and 15 in this example, which have more work to do than their respective
counterparts in the scheduled run.
450
Overall Phase Runtime in Seconds
F IGURE 4.17: Overall runtime per percentage, per band count. Node
count is 107 , k = 10 (left), k = 1000(right), = 0.75.
other band or angular parallelisation counts than the once we are going to choose
during our later comparisons.
TABLE 4.2: Rule Set for angular parallelisation count v and band
count l for the minimised workload partitioning.
Generally speaking, the dependencies in regards to the band count are similar to
before, with the difference being that the percentage variable also has an impact on
how fast the runtime is, and thus also changes at which point the generation and
sorting step start to overshadow the edge-creation step. The left graph in figure 4.17
shows for example for smaller k that a p = 0.7 starts to be more quickly affected
by the generation step than the other chosen p. This is not only because the edge-
creation step changes, though figure 4.18 shows the generation phase only. As
one can see, different p also change the generation phases runtime. The reason for
this is the change in the band radiis positioning: If one might recall equation 1.7, the
most inner bands outer c1 is calculated by the formula:
R(1 p)
c1 =
pl
If we keep the band count l the same and only vary the p parameter, we can see
that a decrease in p increases the first bands size, meaning more points on the inner
bands resulting in more outer bands per node on which we will have to calculate the
62 Chapter 4. Experimental Evaluation
F IGURE 4.18: Runtime of the generation step per p value, per band
count. n = 107 , k = 10, = 0.75.
StartStopBounds.
What can be also be seen in the graphs in figure 4.17 is that depending on the band
count, the larger the band count the better it is to use a higher percentage number.
Depending on , though, the exact point where one p-value is better than the other
changes as well. We benchmarked this for 107 nodes, once for k = 10 and once for
k = 1000, for the three -settings 0.51,0.75 and 1.1, and came up with the rule set in
table 4.3. There we can see which p-value we chose depending on the band count
number l, the -setting, and whether we were creating graphs with smaller (in our
later benchmarks, 10 and 50) or larger (500 and 1000) average degrees:
TABLE 4.3: Rule set for p depending on the parameters k, and l.
After that, we benchmarked these settings for angular parallelisation. As one can
see from figure 4.19 for different kinds of percentage values, the distribution of work
per band is also heavily off-loaded to the bands further away from the center as
was the case with the minimal workload partitioning. Because of that, we chose to
use the same scheduling rule set as outlined in the minimal workload section for
all our benchmarks, where we tested the overall performance in regards to angular
parallelisation count.
Just like in the previous section, we first benchmarked all v {2, 3, ..., 8} and chose
for each v the best band count l as a setting. After that, we compared all best settings
for small and large average degrees (the figures A.3 and A.4 for those best settings
can be seen in the appendix), and chose for both, small an large graphs eleven
bands and a parallelisation count of two.
4.3. Finding Optimised Parameters for the EM-Variant 63
5 1e8
0
0 1 2 3 4 5 6 7 8 9
Band Number
p=0.7 p=0.8 p=0.9
10 1 10 1
Overall Runtime in Seconds
10 0 10 0
10 -1 10 -1
10 4 10 5 10 6 10 7 10 4 10 5 10 6 10 7
Node Count Node Count
GEO MIN EQ GEO MIN EQ
Changing the -value matters slightly for the algorithms relative ranking: For
instance, the intersection between the equalised partitionings and the geometric
partitionings runtime during the run with k = 10 (see figure 4.20) happens around
n = 106 nodes during = 0.51 (and = 1.1 as well, see the figure A.13 in
the appendix), but at n = 2 106 for = 0.75. For larger graphs, for instance
k = 1000 (see figure 4.21), we see similar phenomena, only with the minimised
partitioning instead of the equalised one. The minimised workload seems to work
asymptotically best for large graphs. During the generation of smaller ones, one can
see 0.3 seconds necessary to calculate the band partitioning, since at this size, it is
more than half the runtime. The calculation for the equalised workload takes only
a third of that time, which is why it is quickly amortised, but is still longer than the
calculation for the geometric partitiong (less than 104 seconds).
Overall, the geometric partitioning works best for small graphs up until a point that
64 Chapter 4. Experimental Evaluation
10 2 10 2
10 0 10 0
10 -1 10 -1
10 4 10 5 10 6 10 7 10 4 10 5 10 6 10 7
Node Count Node Count
GEO MIN EQ GEO MIN EQ
either the equalised or minimised workload improves on it in the long run. Based
on the figures here and the ones in the appendix, we chose the following settings:
For any node counts n 106 , regardless of k, we chose the geometric partitioning.
For any other node counts, if k 50, we chose the equalised partitioning; if k > 50,
we chose the minimised one.
10 4
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 3
Overall Runtime in Seconds
10 2
10 1
10 0
10 -1
10 -2
10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
continues to rise further on larger node counts as seen for instance in figure 4.24.
Exceptions for this happen during the generation of small graphs (in terms of node
count), where the runtime of the EM-variant for example is dominated by the setup
and sorting phases, until their overhead is amortised by the edge-creation. Once
this happens, the runtime stabilises to a linear scaling for the tested graphs.
For small average degrees and graphs, the overhead of the EM-variants modifi-
cations are too large to compete with NetworKits NkGen in an internal memory
setting: Neither the EM-variant set to similar settings that NetworKit uses (in the
figures the run with a geometric radial partitioning, with p = 0.9, and l = log(n)
bands), nor the optimised settings fare better off for an average degree of ten, up
until we reach graph sizes in the ranges of 4 106 nodes. This threshold decreases
with an increase in the average degree. For instance, setting k to 50, the threshold is
around 2 106 ; for a k set to 500 it appears even earlier at around 2 105 .
Interesting is that, for example in figure 4.23, the EM-variant modifications alone
are not good enough to compete with NkGen, as the run with the same settings
as implemented in NetworKit never catches up the version with the optimised
settings on the other hand can cut the runtime by around one fourth compared to
NetworKits NkGen (29 to 39 seconds respectively at 108 nodes). In other words, the
66 Chapter 4. Experimental Evaluation
10 3
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
advantage of the EM-variant in this particular run might involve less the differences
in their algorithm and more in the radial partitioning choices.
Another detail is that the setting does not change much in terms of the ranking of
the three algorithms overall, but does change on whether the optimised settings we
have chosen were better then the ones chosen for NkGen. Figure 4.24 and 4.23 show
the two runs with only the settings differentiating between them: For set to 0.51,
there are times where the original settings are still being slightly favoured for large
node counts, while for 0.75 the optimised settings advantages are more apparent.
In summary, NkGen seems to be favored for graphs with small average degrees
and regardless of average degree graphs with small node counts. The EM-
variant is still able to compete and improves on NkGens runtime even in an internal
memory setting under the condition that the to be generated graphs have a large
number of nodes and edges. GIRG on the other hand takes a lot of time even in the
internal memory setting comparatively on almost every scale, regardless of graph
properties. Another fact to mention is that both NkGen and GIRG required more
memory than was available (64GB) for graphs with more than 2108 nodes, resulting
in both programs crashing, since we kept memory-swapping deactivated during the
internal-memory-benchmarks. More detailed comparison of the memory usage can
be seen in chapter 4.4.4.
10
9
Overall Runtime in Seconds
20 40
Seperate Runtime in Seconds
Overall, we can establish with these runs that practically, the EM-variant shares sim-
ilar dependencies with NkGen in regards to and k.
GIRG
Figure 4.27 shows the same runs done with the GIRG algorithm only with a
smaller node count of 106 because of its long runtime drawing a widely different
picture. Considering that GIRG is proven to have a linear runtime in edge count, it
68 Chapter 4. Experimental Evaluation
250 35
150 25
100 20
50 15
0 10
0 100 200 300 400 500 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
Average Degree Alpha
Connection of Jumps GIRG GIRG
is not interesting to see that this is overall the case with a rise in the average degree
as well. The interesting property here is that the linear rise in the runtime is not
gradual, but happens on an exponentially regular occurrence:
Increasing the average degree seems to not effect the runtime, or if, only marginally,
until some kind of threshold is overstepped. In that case, the runtime jumps at first
slowly, but later on dramatically, where after the jump it continues to not change at
all until the next jump occurs again. The jumps appear to come at an exponential
rate, with the runtime also increasing exponentially at every spike. To be more pre-
cise, the runtime doubles every time the average degree reaches the double of the
degree that introduced the last jump in runtime.
TABLE 4.4: A recording of the runtime jumps seen in figure 4.27.
Jump No. 1 2 3 4 5 6
k 11 22 44 88 170 340
Runtime in seconds 7 14 28 55 109 215
Table 4.4 illustrates this better with numbers detailing each degree where a jump
occurs and its corresponding runtime that is being kept steadily during the next
phase until another jump occurs. For example, an average degree k of 44 introduces
the second jump in runtime to around 28 seconds. The previous jump occurred at
a k of 22 with a runtime of 14 seconds, both of these numbers being half of the
corresponding ones we just took a look at. Double the average degree, double the
seconds, and with some leeway of random chance we get to the next jump at a k of
88 with a runtime of 55 seconds. As one can see in that same figure 4.27, this does
not change the overall asymptotic runtime. The jumps occur at a logarithmic rate,
meaning that while each jump increases the runtime exponentially, it stays overall
linearly at worst, as the overlayed red line connecting the jumps alone shows.
The explanation for that is the way the cell sizes and thus the number of cells in
GIRG are decided upon: The smallest weight layer p starts at the minimal weight
possible, which in our case
is w 0 = exp(R/2) = O( n/k), as the weight is defined
as wv := exp (R rv )/2 (see eq. 2.2) with R = O(log(n/k)) (see lemma 1). The cell
sizes are calculated by multiplying the weights of the current layers weight with the
minimal weight, i.e. (i) := wi w0 /W (see eq. 2.3), and setting the resulting number
x to the next largest negative power of 2, i.e. d(i)e2 = min{2l |l N0 : 2l (i)}
4.4. Runtime Benchmark 69
27
26
25
24
Radius R
23
22
21
20
19
0.4 0.6 0.8 1.0 1.2 1.4 1.6
Alpha
Radius R based on average degree calculation
(see eq. 2.4). This rounded up number is the volume of a cell, which has the
implication that a given layer i has 1/d(i)e2 number of cells.
Let us now assume we followed the entire calculation for a specific k with
an unchanging n, reaching a number d(0)e2 for the lowest layer. This
means, we have on this particular layer 1/d(0)e2 cells. Let us assume we
have chosen another averagepdegree, k2 = 2 k. In this case, our minimal
weight will bepnow p w02 = n/2k. The lowest layers volume will now equal
w02 w02 /W = n/2k n/2k/W = (1/2)(n/k)/W , which is half the volume of the
previous example, i.e. 1/d(0)/2e2 , which in turn gives us double the cells in
specific 1/((0)/2) = 2/(0). Because we round up during the calculations, though,
this only happens when our new average degree is double from the last example,
which is exactly what is happening in figure 4.27.
The same figure also shows us that GIRG depends negatively, exponentially on the
value like the other algorithms as well. What is different here, though, is similar
to what we have seen in GIRGs dependency on k: We first have an exponential fall
until we come to a point where we have a jump. The jumps maxima seem to follow
a logarithmic curve, the same way their occurrences follow a logarithmic rate akin
to the one we saw in the k-dependency graph. The reason for that is, as seen in
figure 4.28, that R falls logarithmically with an increase in . This means that, while
alone has a negatively exponential effect on the runtime for any given number
of cells, because that number increases again whenever R is lowered enough to be
under the next threshold (as explained with the k-variable), the runtime will jump
up again. Because of this logarithmic fall of R with an increase in , it logically
follows that the jumps happen at a logarithmic rate.
10 4
10 2
10 1
0 2 4 6 8 10 12 14 16
Thread Count
EM-variant, (A) EM-variant, (NA) Networkit GIRG
F IGURE 4.29: Runtime per thread count of the three algorithms. The
EM-variant is shown once with angular parallelisation (A) and once
without (NA). Average degree k = 500, while = 0.75.
available, and how well it fares off in a comparison to the other two algorithms.
In figure 4.29 we see a direct comparison between the three versions: As already
mentioned in the introduction, this implementation of GIRG does not utilise
parallelisation at all, resulting in a completely linear line. Interesting, though, is that
even a sequential run (where the thread count is set to one) of both the EM-variant
and NkGen, they perform by at least an order of magnitude better than GIRG at the
chosen graph settings in figure 4.29.
For the EM-variant, we benchmarked one run with angular parallelisation enabled,
and one without (all other settings being equal). In the same figure one can also see
that the run with angular parallelisation is at every thread count higher than one
faster. What is hardly noticeable, but more so in figure 4.30, is that the run without
angular parallelisation shows no noticeable speedup at all after eight threads, while
the run with such a parallelisation scheme does up to around ten/eleven threads
from where it plateaus as well. The lack of further large increases in speedup,
but higher variance than the run without angular parallelisation is most likely
because of an unbalanced workload distribution from band segments to threads.
The variance in the somewhat linear line before the 8-thread-threshold for the run
without an angular parallelisation scheme can also be explained by an unbalanced
workload distribution: Considering that each thread gets hold of another band,
smaller thread counts will distribute multiple bands to one thread. This results
in the same run (with otherwise the same parameters) giving one specific thread
sometimes more, sometimes less work to do, depending on the overall thread count.
NkGen on the other hand has a speedup that follows a strictly linear line all through-
out up to eight threads, from which the line stays as linear, only with a smaller slope.
All these phenomena occurring around the eight-thread-threshold are a result of
the switch to hyperthreading: In NkGens case, the switch is advantageous in every
regard, while in the EM-variants case, only the angular parallelisation seems to
get some improvement out of it the run without it only stagnates in its speedup
after the threshold, as it cannot take advantage of hyperthreading. The reason for
this might be that our band count in this instance is eleven, as we are using the
4.4. Runtime Benchmark 71
10
Speedup Factor
6
0
0 2 4 6 8 10 12 14 16
Thread Count
EM-variant, (A) EM-variant, (NA) Networkit GIRG
F IGURE 4.30: Speedup per thread count of the three algorithms. Av-
erage degree k = 500, while = 0.75.
geometric partitioning. With eleven bands the most inner ones additionally
not having much workload put on to them it seems reasonable to assume that,
considering each band gets one thread, there just is not a large enough subdivision
of the overall workload onto multiple threads that hyperthreading could be even
considered. While the inner most bands are done quickly in succession by one
CPU core, all the others are handed out to seperate cores, meaning that in the end,
waiting for every core to be done with their own band is all the program does at
that point. The angular parallelisation scheme, though, hands over multiple parts
to multiple cores, making use of any hyperthreading that is allowed by the system.
Overall, from the figures its quite apparent that NetworKits generator handles
parallelisation better and cleaner than the EM-variant. The differences between
NkGen and the EM-variant come presumably from the way parallelisation is used
in both algorithms: In NetworKits generators case, not only are the sorting and
generation phases done in parallel, the edge-creation phase is done so as well
with one significant difference to our version. Whereas the EM-variant only paral-
lelises entire areas of the the plane (parallelizing entire bands or band segments),
NkGen parallelises on a node-to-node basis (see the pseudocode at listing B.1 in
the appendix): Each node is given to a different thread, which performs the binary
search and establishes edges. This in turn results in a workload distribution that
is asymptotically almost completely equal across the board under any number of
available threads:
If we were for example to tier each node into a class of nodes with similar work pro-
duction for example, a node on the innermost band will require more work than
a node on the outermost band we would do so on a band basis. Assuming we
have ni nodes on band i, all of which are from the same workload-production-class,
dividing the number of nodes ni by the number of threads t will be almost equal,
with at worst one thread getting ni mod t nodes more to work with. This number
though is almost insignificant, considering that ni is proportional to n, and with
the thread count in our case being at most 16, the number ni mod t is completely
negligible to any node count ni higher than 1000. In other words, for not even very
high n, any equal division of nodes to threads will result in an also equal division of
72 Chapter 4. Experimental Evaluation
F IGURE 4.31: Maximum resident set size per run on a linearly scale
axis. k = 500, while = 0.75.
workload to threads.
The EM-variant does not have that luxury: The algorithm is designed in a way
to circumvent unstructered accesses as much as possible, which is why we are
traversing the multiple sorters concurrently, and which is why we cannot assign
each node to a different thread arbitrarily. The correctness of our algorithm (see
lemma 2) is dependent on this subsequent traversal. Were this not the case, and were
we to assign different nodes to different threads, we would have to either establish
communication processes between the threads to make sure that the active-array
holds the correct query nodes at any point in time, or duplicate the sorters so that
any multiple threads working on the same band would have independent access to
the same data required. The former option would hamper parallelisation, while the
latter would increase the memory usage by a manifold more specifically, by the
number of threads working in parallel on the same band.
This means that for our EM-variant, dividing the plane into multiple area segments
allows for situations where the workload is not equally distributed onto the threads
available either because of a number of segments that is indivisible by the
number of threads, or by a radial partitioning that might not give an equalised
workload distributed onto the bands (one of the reasons why we investigated a
method to do exactly that in chapter 3.6.3).
10 2
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
EM-variant, min. Memory EM-variant, 2GB
F IGURE 4.32: The sorting phase of the EM-variant, once with the min-
imal amount of memory possible to assign to the sorters (40 sorters,
44MB per sorter), once with overall 2GB assigned to the sorters.
Figure 4.31 shows the maximum resident set size in kilobytes of each run, i.e. the
maximum number of kilobytes held in RAM at some point during the algorithms
run. If we were to lower the RAM-capacity to a number smaller than that, the
process would either crash or swap data between the internal and external memory
locations (depending on what the system currently allows to happen).
The first thing to notice is that GIRG and NetworKits generator both need very
small amounts of memory during smaller graph creation, while the EM-variant
needs more than one GB. This is because we chose in our implementation, re-
gardless of size, to use the sorters made available by the STXXL-library, which are
used once per band segment. From a runtime perspective, the overhead of their
creation did not affect the runtime negatively enough to discourage the use of
parallelisation regarding the memory usage, though, this is especially for smaller
graphs a problem. Each STXXL-sorter necessitates a certain amount of memory as
a minimum, which is why we are showcasing two runs of our algorithm in figure
4.31 one with the memory settings we had during our runtime comparisons in
the internal memory environment (which is around 2GB for all sorters combined),
and one with the minimal amount of memory allowed per sorter (this changes
depending on the computer setup, set block size, etc. in our case it was around 44
MB per sorter, with overall 40 sorters). Figure 4.32 shows the difference in runtime
for the sorting step alone (as this is the only step affected by this change), where the
differences are almost negligible during smaller, but mostly apparent during larger
graphs, as smaller memory available requires more merge-passes of the sorting
algorithm (see [DS03] for more details on the STXXL-librarys sorter algorithm).
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
EM-variant (min) Networkit GIRG EM-variant (2GB)
F IGURE 4.33: Maximum resident set size divided by node count to
visualise the average amount of memory necessary per node.
To detail this asymptotic property better, figure 4.33 shows the memory-per-node
factor, which can be seen as representative of the memory-per-edge factor due to
the constant average degree during the runs. On this double-logarithmic scale, it
is only the EM-variant that holds a linear downward slope and has not reached a
point yet where the memory required is scaled properly with the number of nodes
available. Both other algorithms on the other hand are reaching a minimum at 105
nodes during the run where k is set to 50, for example, meaning that, while the
actual memory-per-node factor limit is not visible yet, the EM-variant has definitely
a smaller factor and puts thus the other two at a disadvantage in a direct comparison.
In the end, though, which algorithm fares better off with the RAM memory available
depends on how much is there to begin with: If the RAM capacity is small, the EM-
variant will reach the limit fairly quickly. If the capacity is large enough (at least
more than around five to six GB, if one were to assume around one or two GB to be
delegated to the operating system), though, the EM-variant will take much longer
than the other two algorithms.
10 4
10 2
10 1
10 0
10 -1
10 -2 3
10 10 4 10 5 10 6 10 7 10 8 10 9 10 10
Node Count
NetworKit EM-variant GIRG
and not unreasonable to put all three algorithms into external memory territory at
some point during their runs.
In figure 4.34 we see the run for and average degree of ten. As the EM-variant
requires more RAM for the angular parallelisation scheme since every band
segment gets its own sorter, requiring a larger amount of minimum memory
we see that this scheme reaches the external memory setting earlier than NkGen
or GIRG (for a clearer visualisation of the shift from internal to external memory
for all three algorithms, see figures A.29 through A.31 in the appendix, where
each algorithm is shown with a run with an without the RAM restrictions).At
around 4 107 nodes, we can see that NetworKit has reached the memory wall
as well, requiring suddenly as much time as the EM-variant. Every other node
count after that, though, we can see the relatively I/O-inefficient algorithm show
its handicap: At 108 nodes, NkGen requires more than four times as much time as
the EM-variant(230 vs. 74 seconds), at around 2 108 it is 20 times as much (3,000
vs. 150 seconds). The next node count after that took longer than 10,000 seconds, at
which point we stopped the benchmark.
GIRGs case is a little more difficult to see, as the runtime is already fairly high, but
around 108 , the doubled number of nodes increases the runtime by three, meaning
that somewhere around these graph sizes, GIRG also starts coming to a crawl.
We stopped benchmarking afterwards as well, as the next higher node count took
longer than 10,000 seconds for our setup either.
In regards to the EM-variant itself, the runtime seems to be less linear than origi-
nally expected, though comparatively it is undoubtedly at a huge advantage, since
around the time the other two algorithms stopped working after 10,000 seconds,
the EM-variant still stays with 2,500 seconds at a runtime under an hour for a graph
with a billion nodes, i.e. approximately five billion edges.
Overall, while having only 4GB of RAM is not too large an issue for smaller graphs
regardless of algorithm chosen, once we get to graphs that exceed the amount of
memory available to us, the EM-variant is expectedly better equipped to handle
76 Chapter 4. Experimental Evaluation
4.5 Summary
All three algorithms seem to have an approximately linear runtime in the number
of edges during runs in an internal memory environment. The dominating factor in
all of them seems to be not the sorting of the respective data objects but either the
generation or the establishing of edges between nodes. In terms of scalability, the
EM-variant is able to keep up with NkGen and even perform better for the creation
of larger graphs, while GIRG underperforms comparatively at every level.
Regarding the multiple graph parameters, all three are also linear in both in n and
k alone, although differences are visible in regards to the variable. A high
means a higher concentration of nodes towards the edge of the hyperbolic plane,
creating for NkGen, and thus also the EM-variant, less wider ranging queries. On
the other hand, also influences radius R. For GIRG, this creates a runtime that,
while rapidly falling as well, also has spikes that put the overall asymptotic runtime
logarithmic in . In contrast, both NkGen and the EM-variant have an overall
negative slope in regards to the power-law variable.
Once all three algorithms reach the external memory environment, though, the
EM-variant is undoubtedly at an advantage, performing with a better scaling than
the alternatives. Even graphs with five billion edges are able to be generated in less
than an hour in an external memory setting, while both NkGen and GIRG need
more than 10,000 seconds for graphs of one to two orders of magnitude smaller than
that.
Chapter 5
Conclusion
The goal of this thesis was to analyse and demonstrate the advantages and disad-
vantages of an EM-approach towards the generation of hyperbolic random graphs
and compare those insights with current, state-of-the-art algorithms. Apparent
from the results are the large runtime differences for very large graphs between the
EM-approach and its alternatives. This can be seen not only in the internal memory
setting, but even more so in an external memory environment.
While the basic concepts for all three algorithms are the same the definition of
a hyperbolic ground plane, the subdivision of that plane into multiple band-like
constructs based on radii boundaries, and the further reduction of the number
of possible queries per node through placed boundary structures they all are
different enough in their approaches to affect the practical usage of those algorithms
to a large degree.
In regards to the set goals in the beginning of this thesis, we can summarise
that in terms of large graphs with large average degrees the EM-approach
is outweighing its additional overhead disadvantages and even improves on
the alternatives runtime. In the EM environment, we were even able to create
graphs with 109 nodes and 5 1010 edges in under an hour, while both alterna-
tives breached the 20GB swap space limit and were not able to finish the embedding.
GIRG is the slowest alternative of them all by a wider margin, even with the
random-hyperbolic-graph-oriented implementation. On the other hand, GIRG is
also designed to fair off well in a more general setting. Considering that hyperbolic
graphs are just one of the possible graph types available for creation, GIRG does
allow for a greater variety of graphs compared to the other, faster algorithms
showcased in this thesis.
Another aspect that was not part of this thesis, is that GIRG allows its graphs to
be generated with an additional degree of randomness compared to the threshold
model graphs we were investigating: Apart from nodes having edges between
them in case they are in close vicinity, there is also an option of using a temperature
variable for random edge-creation between further away nodes. This is a graph
78 Chapter 5. Conclusion
property that both NkGen and the EM-variant do not account for, though the
NetworKit library has a different algorithm for such use cases [LM16].
The same goes for movement of nodes inside an already generated graph, which
NkGen does allow for [LLM16, p.4-6]: Our generator has currently no answer to
this problem that would be I/O-efficient, which might be a subject that could also
necessitate further analysis in another thesis.
79
Bibliography
[LLM16] Moritz von Looz, Mustafa zdayi, Sren Laue, and Henning Meyer-
henke. Generating massive complex networks with hyperbolic geom-
etry faster in practice. In: CoRR abs/1606.09481 (2016). URL: http:
//arxiv.org/abs/1606.09481.
[KPKVB10] Dmitri V. Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin
Vahdat, and Marin Bogu. Hyperbolic Geometry of Complex Net-
works. In: CoRR abs/1006.5169 (2010). URL: http://arxiv.org/
abs/1006.5169.
[PBK11] Fragkiskos Papadopoulos, Marin Bogu, and Dmitri V. Krioukov.
Popularity versus Similarity in Growing Networks. In: CoRR
abs/1106.0286 (2011). URL: http://arxiv.org/abs/1106.0286.
[GPP12] Luca Gugelmann, Konstantinos Panagiotou, and Ueli Peter. Ran-
dom Hyperbolic Graphs: Degree Sequence and Clustering. In: CoRR
abs/1205.1470 (2012). URL: http://arxiv.org/abs/1205.1470.
[MSS03] Ulrich Meyer, Peter Sanders, and Jop F. Sibeyn, eds. Algorithms
for Memory Hierarchies, Advanced Lectures [Dagstuhl Research Seminar,
March 10-14, 2002]. Vol. 2625. Lecture Notes in Computer Science.
Springer, 2003. ISBN: 3-540-00883-7. DOI: 10.1007/3-540-36574-5.
URL : https://doi.org/10.1007/3-540-36574-5.
[SSM16] CHRISTIAN L. STAUDT, ALEKSEJS SAZONOVS, and HENNING
MEYERHENKE. NetworKit: A tool suite for large-scale complex net-
work analysis. In: Network Science 4.4 (2016), 508530. DOI: 10.1017/
nws.2016.20.
[BKL15] Karl Bringmann, Ralph Keusch, and Johannes Lengler. Geometric In-
homogeneous Random Graphs. In: CoRR abs/1511.00576 (2015). URL:
http://arxiv.org/abs/1511.00576.
[BFKL16] Thomas Blsius, Tobias Friedrich, Anton Krohmer, and Sren Laue.
Efficient Embedding of Scale-Free Graphs in the Hyperbolic Plane.
In: 24th Annual European Symposium on Algorithms (ESA 2016). Ed.
by Piotr Sankowski and Christos Zaroliagis. Vol. 57. Leibniz Interna-
tional Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss
DagstuhlLeibniz-Zentrum fuer Informatik, 2016, 16:116:18. ISBN:
978-3-95977-015-6. DOI: 10 . 4230 / LIPIcs . ESA . 2016 . 16. URL:
http://drops.dagstuhl.de/opus/volltexte/2016/6367.
[Lam17] Sebastian Lamm. Communication Efficient Algorithms for Generat-
ing Massive Networks. MA thesis. Karlsruher Institut fr Technolo-
gie, 2017. DOI: 10.5445/ir/1000068617.
[AV88] Alok Aggarwal and Jeffrey Scott Vitter. The Input/Output Complex-
ity of Sorting and Related Problems. In: Commun. ACM 31.9 (1988),
pp. 11161127. DOI: 10.1145/48529.48535. URL: http://doi.
acm.org/10.1145/48529.48535.
80 BIBLIOGRAPHY
[AKL93] Lars Arge, Mikael B. Knudsen, and Kirsten Larsen. A General Lower
Bound on the I/O-Complexity of Comparison-based Algorithms. In:
Algorithms and Data Structures, Third Workshop, WADS 93, Montral,
Canada, August 11-13, 1993, Proceedings. Ed. by Frank K. H. A. Dehne,
Jrg-Rdiger Sack, Nicola Santoro, and Sue Whitesides. Vol. 709. Lec-
ture Notes in Computer Science. Springer, 1993, pp. 8394. ISBN: 3-
540-57155-8. DOI: 10.1007/3- 540- 57155- 8_238. URL: https:
//doi.org/10.1007/3-540-57155-8_238.
[ER59] Paul Erdos and Alfrd Rnyi. On random graphs. In: Publicationes
Mathematicae Debrecen 6 (1959), pp. 290297.
[AB01] Rka Albert and Albert-Lszl Barabsi. Statistical mechanics of com-
plex networks. In: CoRR cond-mat/0106096 (2001). URL: http : / /
arxiv.org/abs/cond-mat/0106096.
[BFM16] Michel Bode, Nikolaos Fountoulakis, and Tobias Mller. The proba-
bility of connectivity in a hyperbolic model of complex networks. In:
Random Struct. Algorithms 49.1 (2016), pp. 6594. DOI: 10.1002/rsa.
20626. URL: https://doi.org/10.1002/rsa.20626.
[LSMP15] Moritz von Looz, Christian L. Staudt, Henning Meyerhenke, and Ro-
man Prutkin. Fast generation of dynamic complex networks with un-
derlying hyperbolic geometry. In: CoRR abs/1501.03545 (2015). URL:
http://arxiv.org/abs/1501.03545.
[Bou97] Paul Bourke. Intersection of two circles. (Online. Webpage last accessed:
July 3rd, 2017.) Apr. 1997. URL: http : / / paulbourke . net /
geometry/circlesphere/.
[DKS08] Roman Dementiev, Lutz Kettner, and Peter Sanders. STXXL: standard
template library for XXL data sets. In: Softw., Pract. Exper. 38.6 (2008),
pp. 589637. DOI: 10.1002/spe.844. URL: https://doi.org/
10.1002/spe.844.
[DS03] Roman Dementiev and Peter Sanders. Asynchronous parallel disk
sorting. In: SPAA 2003: Proceedings of the Fifteenth Annual ACM Sym-
posium on Parallelism in Algorithms and Architectures, June 7-9, 2003, San
Diego, California, USA (part of FCRC 2003). ACM, 2003, pp. 138148.
ISBN : 1-58113-661-7. DOI : 10 . 1145 / 777412 . 777435. URL : http :
//doi.acm.org/10.1145/777412.777435.
[BS80] Jon Louis Bentley and James B. Saxe. Generating Sorted Lists of Ran-
dom Numbers. In: ACM Trans. Math. Softw. 6.3 (1980), pp. 359364.
DOI : 10.1145/355900.355907. URL : http://doi.acm.org/10.
1145/355900.355907.
[JJ92] Joseph JJ. An Introduction to Parallel Algorithms. Addison-Wesley,
1992. ISBN: 0-201-54856-9.
[Pen17] Manuel Penschuck. Generating practical random hyperbolic graphs
in near-linear time and with sub-linear memory. In: (2017). To appear
in SEA 2017 Proceedings.
BIBLIOGRAPHY 81
Appendix A
Figures
2500
2000
Occurrence Count
1500
1000
500
0
0 100 200 300 400 500 600 700 800 900
Size of Active
N(55,55) N(353,353) N(743,743)
Actual Data (k=20) Actual Data (k=200) Actual Data (k=500)
900
800
700
600
Occurrence Count
500
400
300
200
100
0
0 50 100 150 200 250 300 350 400
Size of Active
N(14,14) N(134,134) N(328,328)
Actual Data (k=20) Actual Data (k=200) Actual Data (k=500)
10 1
Overall Runtime in Seconds
10 0
10 4 10 5 10 6 10 7
Node Count
Set: (5, 9) Set: (2, 11) Set: (8, 8) Set: (3, 9)
Set: (6, 7) Set: (7, 7) Set: (4, 9) Set: (7, 9)
F IGURE A.3: Comparison of best geometric workload settings for
small graphs. = 0.75, k = 10. "Set: (a, b)" describes a run with
angular parallelization count of a, and b bands.
10 2
Overall Runtime in Seconds
10 1
10 0
10 4 10 5 10 6 10 7
Node Count
Set: (5, 9) Set: (2, 11) Set: (8, 8) Set: (3, 9)
Set: (6, 7) Set: (7, 7) Set: (4, 9) Set: (7, 9)
F IGURE A.4: Comparison of best geometric workload settings for
large graphs. = 0.75, k = 1000.
84 Appendix A. Figures
10 1
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 1
Overall Runtime in Seconds
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 2
Overall Runtime in Seconds
10 1
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 2
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 1
Overall Runtime in Seconds
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 1
Overall Runtime in Seconds
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 2
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 2
Overall Runtime in Seconds
10 1
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 1
Overall Runtime in Seconds
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 1
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 2
Overall Runtime in Seconds
10 1
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 2
Overall Runtime in Seconds
10 1
10 0
10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ
10 3
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 4
Overall Runtime in Seconds
10 3
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 5
10 4
Overall Runtime in Seconds
10 3
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 4
10 2
10 1
10 0
10 -1
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 3
Overall Runtime in Seconds
10 2
10 1
10 0
10 -1
10 -2
10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 4
Overall Runtime in Seconds
10 3
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 5
10 4
10 4
Overall Runtime in Seconds
10 3
10 2
10 1
10 0
10 -1
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 3
Overall Runtime in Seconds
10 2
10 1
10 0
10 -1
10 -2
10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 4
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 4
Overall Runtime in Seconds
10 3
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 4
Overall Runtime in Seconds
10 3
10 2
10 1
10 0
10 -1
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG
10 4
10 3
10 4
Overall Runtime in Seconds
10 3
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
GIRG / IM GIRG / EM
10 4
10 2
10 1
10 0
10 -1
10 4 10 5 10 6 10 7 10 8 10 9
Node Count
Em-variant / IM EM-variant / EM
Appendix B
Pseudocode
25 current_C [ i ] = curRad ;
26 end
27
28 r e t u r n current_C ;