0% found this document useful (0 votes)
100 views109 pages

IO Efficient Generation of Hyperbolic Random Graphs

Abstract: "Random hyperbolic graphs as models for real-world, complex networks have been the subject of many authors. Such graphs embed nodes in a hyperbolic plane, where a node pair establishes an edge whenever the distance between the nodes is smaller than a chosen value. Since the analysis of these graph’s efficient generation is usu- ally done under the assumption of an underlying unit-cost RAM model, scalability regarding memory usage and I/O-efficiency is a problem for many in-memory al- gorithms. We create the first parallelizable random hyperbolic graph generator that uses an external memory (EM) approach. The generator runs empirically in a sorting run- time whose modified algorithm is based on a state-of-the-art algorithm by Looz et al. [LÖLM16]. We prove that the candidate selection per node for edge-creation neces- sitates memory sublinear in the number of nodes n for sufficiently small average de- grees. Based on that, we thus show that the generator has a sort(n) I/O-complexity. Since the algorithm is based on a radial subdivision of the hyperbolic plane, we also devise a workload-per-band-proxy calculation to aid in finding a radial partitioning with a desired workload distribution. In practical comparisons between the origi- nal algorithm and our EM-variant, our generator is able to compete in an internal memory focused benchmark setting. Furthermore when used in an EM setting, we demonstrate at large enough graph sizes runtime improvements on the original of an order of magnitude. During such an EM setting, our generator is able to embed graphs with 10^9 nodes and establish 5 · 10^10 edges in under an hour."

Uploaded by

Kamil König
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views109 pages

IO Efficient Generation of Hyperbolic Random Graphs

Abstract: "Random hyperbolic graphs as models for real-world, complex networks have been the subject of many authors. Such graphs embed nodes in a hyperbolic plane, where a node pair establishes an edge whenever the distance between the nodes is smaller than a chosen value. Since the analysis of these graph’s efficient generation is usu- ally done under the assumption of an underlying unit-cost RAM model, scalability regarding memory usage and I/O-efficiency is a problem for many in-memory al- gorithms. We create the first parallelizable random hyperbolic graph generator that uses an external memory (EM) approach. The generator runs empirically in a sorting run- time whose modified algorithm is based on a state-of-the-art algorithm by Looz et al. [LÖLM16]. We prove that the candidate selection per node for edge-creation neces- sitates memory sublinear in the number of nodes n for sufficiently small average de- grees. Based on that, we thus show that the generator has a sort(n) I/O-complexity. Since the algorithm is based on a radial subdivision of the hyperbolic plane, we also devise a workload-per-band-proxy calculation to aid in finding a radial partitioning with a desired workload distribution. In practical comparisons between the origi- nal algorithm and our EM-variant, our generator is able to compete in an internal memory focused benchmark setting. Furthermore when used in an EM setting, we demonstrate at large enough graph sizes runtime improvements on the original of an order of magnitude. During such an EM setting, our generator is able to embed graphs with 10^9 nodes and establish 5 · 10^10 edges in under an hour."

Uploaded by

Kamil König
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

M ASTER S T HESIS

I/O-Efficient Generation of Hyperbolic


Random Graphs

Author: Supervisor:
Kamil Ren K NIG Prof. Dr. Ulrich M EYER

July 3rd, 2017


ii

Eidesstattliche Erklrung
Erklrung gem Master-Ordnung 2008 24 Abs. 12

Hiermit besttige ich, dass ich die vorliegende Arbeit selbststndig verfasst habe
und keine anderen Quellen oder Hilfsmittel als die in dieser Arbeit angegebenen
verwendet habe.
iii

Abstract
Random hyperbolic graphs as models for real-world, complex networks have been
the subject of many authors. Such graphs embed nodes in a hyperbolic plane, where
a node pair establishes an edge whenever the distance between the nodes is smaller
than a chosen value. Since the analysis of these graphs efficient generation is usu-
ally done under the assumption of an underlying unit-cost RAM model, scalability
regarding memory usage and I/O-efficiency is a problem for many in-memory al-
gorithms.

We create the first parallelizable random hyperbolic graph generator that uses
an external memory (EM) approach. The generator runs empirically in a sorting run-
time whose modified algorithm is based on a state-of-the-art algorithm by Looz et al.
[LLM16]. We prove that the candidate selection per node for edge-creation neces-
sitates memory sublinear in the number of nodes n for sufficiently small average de-
grees. Based on that, we thus show that the generator has a sort(n) I/O-complexity.
Since the algorithm is based on a radial subdivision of the hyperbolic plane, we also
devise a workload-per-band-proxy calculation to aid in finding a radial partitioning
with a desired workload distribution. In practical comparisons between the origi-
nal algorithm and our EM-variant, our generator is able to compete in an internal
memory focused benchmark setting. Furthermore when used in an EM setting, we
demonstrate at large enough graph sizes runtime improvements on the original of
an order of magnitude. During such an EM setting, our generator is able to embed
graphs with 109 nodes and establish 5 1010 edges in under an hour.
iv

Contents

Abstract iii

1 Introduction 1
1.1 EM Model and RAM Model . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Graph Models for the Representation of Complex Networks . . 5
1.2.2 Hyperbolic Space and its Relation to Complex Networks . . . . 7
1.2.3 The Generative Model and Native Representation . . . . . . . . 8
1.2.4 Poincar Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 STXXL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Algorithms 15
2.1 State of the Art: NkGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 EM-Variant of NkGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 2-Sorter-Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 0-Sorter-Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Radial Parallelisation . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Angular Parallelisation . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Parallelisation of the Generation Phase . . . . . . . . . . . . . . 23
2.4 GIRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Analysis 29
3.1 NkGen: I/O-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 EM-Variant of NkGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 I/O-Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 GIRG: I/O-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Radial Partitionings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.2 Estimating the Expected Workload of a Partitioning . . . . . . . 41
3.6.3 Equalised Workload . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.4 Minimised Workload . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Experimental Evaluation 48
4.1 Setup of the Computer System and Implementation . . . . . . . . . . . 48
4.2 Graph Parameters used in the Benchmarks . . . . . . . . . . . . . . . . 49
4.3 Finding Optimised Parameters for the EM-Variant . . . . . . . . . . . . 49
4.3.1 Comparison between Sorter-Count-Versions . . . . . . . . . . . 49
4.3.2 Benchmark Analysis of the Equalised Workload Partitioning . . 51
v

4.3.3 Benchmark Analysis of the Minimal Workload Partitioning . . 57


4.3.4 Benchmark Analysis of the Geometric Partitioning . . . . . . . 61
4.3.5 Comparison between the Three Radial Partitionings Best Set-
tings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Runtime Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Runtime Comparison in Regards to Internal Memory . . . . . . 64
4.4.2 Runtime Analysis in regards to Varying Graph Parameters . . . 66
NetworKits NkGen and the EM-variant . . . . . . . . . . . . . 67
GIRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.3 Parallelism Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.4 Memory Usage Comparison . . . . . . . . . . . . . . . . . . . . 72
4.4.5 Runtime Performance in an External Memory Setting . . . . . . 74
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Conclusion 77

Bibliography 79

A Figures 82
A.1 Normal Distributions of Active-Size . . . . . . . . . . . . . . . . . . . . . 82
A.2 Coparison Run of Best Band and Angular Parallelization Count Set-
tings for the Geometric Workload Partitioning . . . . . . . . . . . . . . 83
A.3 Comparison Between the Radial Partitionings . . . . . . . . . . . . . . 84
A.4 Internal Memory Runtime Comparison of the Algorithms . . . . . . . . 88
A.5 External Memory Runtime Comparison of the Algorithms . . . . . . . 92

B Pseudocode 94
vi

List of Figures

1.1 Tesselation of hyperbolic plane. . . . . . . . . . . . . . . . . . . . . . . . 5


1.2 Dendogram in hyperbolic space. . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Points scattered in hyperbolic space with circle of radius R. . . . . . . . 9
1.4 Example of the Poincar model. . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Example of angular boundaries on the hyperbolic plane. . . . . . . . . 16


2.2 Example of boundaries with a node on the innermost band. . . . . . . 17
2.3 Colorised, parallisable bands on the hyperbolic plane. . . . . . . . . . . 22
2.4 Colorised parallelisable, angular segments on the hyperbolic plane. . . 23
2.5 Parallel and sequential generation of all Sorter-variants compared. . . 24
2.6 Parallel generation of the 0-Sorter under multiple thread counts. . . . . 25
2.7 Visualisation of GIRGs layers and partitioning process. . . . . . . . . . 27

3.1 Bar graph, showing differences between both estimator functions. . . . 43


3.2 Graph showcasing normal distribution of actives size. = 0.75 . . . . 44
3.3 Bar graphs, showing the effectiveness of our estimator function. . . . . 44
3.4 Bar graph showing an equalised distribution of the workload to the
bands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Line graph, visualising the band radiis trends for a minimised work-
load partitionings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1 Generation phase of the Sorter-variants. . . . . . . . . . . . . . . . . . . 50


4.2 Sorting and overall runtime of Sorter-variants. . . . . . . . . . . . . . . 51
4.3 Edge-creation runtime of Sorter-variants. . . . . . . . . . . . . . . . . . 51
4.4 Generation phase per band count (EQ) k = 1000. . . . . . . . . . . . 52
4.5 Runtime overall in regards to band count (EQ). . . . . . . . . . . . . . . 53
4.6 Separate runtimes in regards to band count (EQ). . . . . . . . . . . . . . 54
4.7 Maximum compare count out of all bands, per band count (EQ). . . . . 55
4.8 Runtime per angular parallelisation count (EQ) . . . . . . . . . . . . . . 55
4.9 Median Absolute Deviation of ten angularly parallel runs (EQ). . . . . 56
4.10 Separate runs for varius minima with four segments per band (EQ). . . 57
4.11 Separate runs per band count (MIN). . . . . . . . . . . . . . . . . . . . . 58
4.12 Overall runtime per (MIN). . . . . . . . . . . . . . . . . . . . . . . . . 58
4.13 Workload per band under different (MIN). . . . . . . . . . . . . . . . 59
4.14 Comparison between scheduled and unscheduled run (MIN). . . . . . 59
4.15 Bar graph with workload per scheduled/unscheduled run (MIN). . . . 60
4.16 Overall runtime per alpha, per band count. . . . . . . . . . . . . . . . . 60
4.17 Overall runtime per p value, per band count (GEO). . . . . . . . . . . . 61
4.18 Generation phase per p value, per band (GEO). . . . . . . . . . . . . . . 62
4.19 Comparison count per band (GEO). . . . . . . . . . . . . . . . . . . . . 63
4.20 Comparison of the three radial partitionings. = 0.51 on the left,
= 0.75 on the right, k is set to 10. . . . . . . . . . . . . . . . . . . . . . 63
vii

4.21 Comparison of the three radial partitionings. = 0.51 on the left,


= 0.75 on the right, k = 1000. . . . . . . . . . . . . . . . . . . . . . . . 64
4.22 Comparison of all major algorithms. = 0.75, k = 50. . . . . . . . . . . 65
4.23 Comparison of all major algorithms. = 0.75, k = 10. . . . . . . . . . . 65
4.24 Comparison of NetworKits NkGen and the EM-variant. = 0.51,
k = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.25 NetworKits NkGens parameter dependencies. . . . . . . . . . . . . . 67
4.26 The EM-variants parameter dependencies. . . . . . . . . . . . . . . . . 67
4.27 GIRGs parameter dependencies. . . . . . . . . . . . . . . . . . . . . . . 68
4.28 Radius R of the hyperbolic space relative to . . . . . . . . . . . . . . . 69
4.29 Parallelism efficiency of the NkGen and the EM-variant. . . . . . . . . 70
4.30 Speedup of NkGen and the EM-variant. . . . . . . . . . . . . . . . . . . 71
4.31 Maximum resident set size per run on a linearly scale axis. k = 500,
while = 0.75. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.32 The sorting phase of the EM-variant, once with the minimal amount
of memory possible to assign to the sorters (40 sorters, 44MB per
sorter), once with overall 2GB assigned to the sorters. . . . . . . . . . . 73
4.33 Maximum resident set size divided by node count to visualise the av-
erage amount of memory necessary per node. . . . . . . . . . . . . . . . 74
4.34 A benchmark with restricted RAM to around 4GB with memory-swapping
enabled. Average degree k = 10, while = 0.75. . . . . . . . . . . . . . 75

A.1 Actual occurrence count of active-sizes on the 8th band, overlayed


with the normal distribution one would get by using eq. 3.4. = 0.51. 82
A.2 Actual occurrence count of active-sizes on the 8th band, overlayed
with the normal distribution one would get by using eq. 3.4. = 1.1. . 82
A.3 Comparison of best geometric workload settings for small graphs. . . . 83
A.4 Comparison of best geometric workload settings for large graphs. . . . 83
A.5 Comparison of the three radial partitionings. = 0.51, k = 10. . . . . . 84
A.6 Comparison of the three radial partitionings. = 0.51, k = 50. . . . . . 84
A.7 Comparison of the three radial partitionings. = 0.51, k = 500. . . . . 84
A.8 Comparison of the three radial partitionings. = 0.51, k = 1000. . . . 85
A.9 Comparison of the three radial partitionings. = 0.75, k = 10. . . . . . 85
A.10 Comparison of the three radial partitionings. = 0.75, k = 50. . . . . . 85
A.11 Comparison of the three radial partitionings. = 0.75, k = 500. . . . . 86
A.12 Comparison of the three radial partitionings. = 0.75, k = 1000. . . . 86
A.13 Comparison of the three radial partitionings. = 1.1, k = 10. . . . . . 86
A.14 Comparison of the three radial partitionings. = 1.1, k = 50. . . . . . 87
A.15 Comparison of the three radial partitionings. = 1.1, k = 500. . . . . . 87
A.16 Comparison of the three radial partitionings. = 1.1, k = 1000. . . . . 87
A.17 Comparison of all major algorithms. = 0.51, k = 10. . . . . . . . . . . 88
A.18 Comparison of all major algorithms. = 0.51, k = 50. . . . . . . . . . . 88
A.19 Comparison of all major algorithms. = 0.51, k = 500. . . . . . . . . . 88
A.20 Comparison of all major algorithms. = 0.51, k = 1000. . . . . . . . . 89
A.21 Comparison of all major algorithms. = 0.75, k = 10. . . . . . . . . . . 89
A.22 Comparison of all major algorithms. = 0.75, k = 50. . . . . . . . . . . 89
A.23 Comparison of all major algorithms. = 0.75, k = 500. . . . . . . . . . 90
A.24 Comparison of all major algorithms. = 0.75, k = 1000. . . . . . . . . 90
A.25 Comparison of all major algorithms. = 1.1, k = 10. . . . . . . . . . . 90
A.26 Comparison of all major algorithms. = 1.1, k = 50. . . . . . . . . . . 91
A.27 Comparison of all major algorithms. = 1.1, k = 500. . . . . . . . . . . 91
viii

A.28 Comparison of all major algorithms. = 1.1, k = 1000. . . . . . . . . . 91


A.29 A benchmark of NetworKits generator without (IM) and with (EM) a
restricted RAM to around 4GB with memory-swapping enabled. Av-
erage degree k = 10, while = 0.75. . . . . . . . . . . . . . . . . . . . . 92
A.30 A benchmark of GIRG without (IM) and with (EM) a restricted RAM
to around 4GB with memory-swapping enabled. Average degree k =
10, while = 0.75. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.31 A benchmark of the EM-variant without (IM) and with (EM) a re-
stricted RAM to around 4GB with memory-swapping enabled. Aver-
age degree k = 10, while = 0.75. . . . . . . . . . . . . . . . . . . . . . 93
ix

List of Tables

1.1 EM model specific notation. . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Summary of the Runtime and I/O-Complexity . . . . . . . . . . . . . . 39

4.1 Rule Set for angular parallelisation count v and band count l for the
equalised workload partitioning. . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Rule Set for angular parallelisation count v and band count l for the
minimised workload partitioning. . . . . . . . . . . . . . . . . . . . . . 61
4.3 Rule set for p depending on the parameters k, and l. . . . . . . . . . . 62
4.4 A recording of the runtime jumps seen in figure 4.27. . . . . . . . . . . 68
1

Chapter 1

Introduction

A random hyperbolic graph (or hyperbolic geometric graph) is at its most basic a set of
nodes embedded in hyperbolic space, where a pair of nodes share an edge when-
ever the distance between them is smaller than a chosen threshold [KPKVB10]. The
properties that arise in such graphs exhibit a lot of similarities to many real-world
instances of relationship models, such as a high clustering and a heavy-tailed de-
gree distribution as they follow a power-law in this case meaning a vertexs de-
gree being proportional to some negative exponent of that degree. For example,
Papadopoulos et al. propose that hyperbolic graphs are well suited to simulate so-
cial networks the hyperbolic component being a trade-off between popularity
and similarity in that case [PBK11].

As such, the generation of random hyperbolic graphs for benchmark purposes


are important and have been studied by various authors [KPKVB10], [LLM16],
[GPP12]. The biggest advantage of generated graphs as compared to real-life data
sets (taken from other sources such as corporations or organisations) is the freedom
to be able to choose and vary the parameters of the generated graph without hav-
ing to depend on otherwise outside sources. Additional reasons for such generators
may also be privacy concerns that could come up otherwise.

Most algorithms up until now have been focused on the general runtime of the
graph generation and have been generally working on an unit-cost RAM model. In
such a model, the focus is put on the number of steps our algorithm takes, dis-
regarding the implications and possible constraints that very large data sets bring
with them. In todays Big Data world, though, working with extremely large graphs
is not out of the ordinary anymore it might be thus more advantageous during
the algorithm engineering phase to take memory intensive use cases into account.

Every computer system, no matter how large its components may be, will even-
tually get to a certain point where the data it has to work with is so large that work-
ing through it internally in one pass will be impossible. In those cases, data has to
be off-loaded down the memory hierarchy the set of different categories of memory
units inside a computer system, as explained by Meyer et al. [MSS03]. The memory
hierarchy generally consist of the following categories: three cache levels all with
increasing sizes and decreasing access time the main memory, and further on any
external memory unit. The authors mention that a programmer does not need to
know the intricacies between the cache-levels and the main memory as the auto-
matic processes are for the average case well enough. The differences in data access
latency between main memory and hard disks, on the other hand, is large enough to
become a concern. Hard disk access times for example take up to 107 as much time
as one register access alone [MSS03, p.3]. Any data too large for the main memory
2 Chapter 1. Introduction

will be sent to the external memory unit for depending on the algorithm a tem-
porary or final stay. Since the I/O-access times are as large as mentioned, this fact
alone makes an analysis of an algorithms practical runtime on the basis of a unit-cost
model alone incomplete if the subject matter revolves around very large data sets.

Thus, depending on the size of the problem, I/O-efficient algorithms might be


preferable for practical applications. For that reason, we will take a look as to how
an External-Memory-approach to our graph generation problem will compare to two
other algorithms NkGen by Looz et al. [LLM16] implemented in the library Net-
worKit [SSM16]1 as a parallel generator, and the GIRG algorithm by K. Bringmann et
al. [BKL15] in a sequential implementation provided by Blsius et al. [BFKL16]2
that have been made without such considerations in mind and see, whether or not
advantages in taking such an approach can be made in practice. A parallel gen-
erator with similar properties to the GIRG algorithm has been proposed by Lamm
[Lam17]. Since Lamm shows that his generator is able to compete with NetworKits
NkGen, we assume NetworKits implementation to be representative of a parallel
GIRG generators runtime.

1.1 EM Model and RAM Model


Our analysis of the algorithms presented in this paper will be based on the External
Memory Model (EM model) in addition to the generally used RAM Model.

The RAM model can be understood as a model of computation where we con-


sider a computer system with a main memory of unlimited size, disregard the actual
presence of data on either cache or disk, and where every simple computational op-
eration (arithmetic and logical operators) is considered to cost one work unit. The
base assumption here is that during the course of an algorithm all required data
will always fit into the main memory, resulting in an asymptotic analysis of said al-
gorithms runtime that depends only on the number of computational steps. This
simplifies machine-independent analyses, but does not take into account the practi-
cal situations where the base assumptions are incorrect.

The External Memory model (EM model) by Aggarwal and Vitter takes a differ-
ent perspective [AV88]. The model describes a computer system with an internal
memory of a size allowing for M objects and an external memory disk of unlimited
size. The systems CPU is only able to work on data that is present in the internal
memory, meaning that each data set that is to be worked with has to be first loaded
into the internal memory from the external memory disk. Since hard disks have a
high latency, but allow for a higher bandwidth, it is advantageous to load a block of
multiple, subsequent data points lying on the hard disk into main memory.

As the time spent waiting in between two I/O-operations is generally larger than
between two computational operations, the model allows for a different approach to
algorithm analysis: The number of I/O-operations is considered in this model to be
more relevant to the runtime than it was in the RAM model, where it was completely
1
NetworKit with a reference implementation of the algorithm at: https://networkit.iti.
kit.edu/
2
Implementation taken from:https://bitbucket.org/HaiZhung/hyperbolic-embedder/
overview
1.1. EM Model and RAM Model 3

discarded.

To give a working example for the difference between those two models, let us
assume we have the problem of finding a number x in a set of n numbers. Without
any further details, the simplest solution in the RAM model would be to scan the en-
tire array one by one and check, whether the current number c equals x. Reading a
number, comparing it to x, and increasing the index to reach the next number are all
operations that cost one work unit, which results in an asymptotic runtime of O(n)
operations to accomplish the task.

Let us assume under the EM model, that our problem size of n objects is so big
that the entire array does not fit into our main memory of M objects. Since we
load multiple, subsequent data objects from the external memory unit, the number
of objects that can be transfered simultaneously between the internal and external
memory in the EM-model is of size B objects per I/O-operation. In that case, every
B-th time we would load a number from our array, the system will load from the
hard drive our current number c inside a block of B numbers into our main mem-
ory. This is beneficial to us if we regard the workings of hard drives in combination
with the spatial locality principle, which says that, if one needs data that has been put
in a specific place in memory, it is highly likely that one will need the data surround-
ing that place as well [MSS03, p.9]. Since hard drives have a large bandwidth, there
is no practical difference between loading one or B objects in terms of time spent on
the loading either.

Coming back to the example of scanning for an object in the RAM model, we
would say that scanning the array for our number x would take n computational
steps as we have n numbers. In the EM model, though, we will have to think
in number of I/O-operations instead: We have n objects, the bandwidth is large
enough for B objects to fit into a block per I/O-operation, thus n/B blocks will have
to be read from the disk at worst and at best for our scan-operation hence why
one well known notations in the EM-model is the I/O-complexity for scanning, i.e.
scan(n) = (n/B).

Another important I/O-complexity for the algorithms in this thesis regards the
tight bound on sorting, since two of the presented algorithms necessitate sorted data
objects:

Let us span a binary tree to describe comparisons between n numbers. Each


node describes the comparison between two numbers, where each edge to a child of
such a node resembles the two possible outcomes of such a comparison. Each leaf
on the other hand has a permutation of numbers that resembles the path one has fol-
lowed from the root to get to this node. Following this well established comparison
tree model, any sorting algorithm in an internal memory setting has a lower bound of
(n log n) comparison operations. This can be derived from the number of all possi-
ble sorting results, i.e. permutations of n numbers possible (which is n!), which is the
smallest number of leaves such a comparison tree would have. A binary tree with
n! leaves has a height of at least (log(n!)) = (n log(n)), meaning that any sorting
algorithm that depends on comparisons needs to traverse at least (n log(n)) nodes
to arrive at the right permutation resembling the sorted order of n numbers.
4 Chapter 1. Introduction

According to Aggarvals and Vitters I/O-comparison model, a similar argumen-


tation of counting the number of sorting permutations per step out of all permuta-
tions possible was used to prove a lower bound for sorting in regards to the I/O-
complexity [AV88, p.1119-1123]. The number of permutations is reduced by the as-
sumption that during an I/O-operation, the CPU is able to sort the B numbers of a
currently inserted block into internal memory while the system writes an outputted
block onto the disk. With this assumption, the authors were able to prove a lower
bound for I/O-operations of ((n/B) logM/B (n/B)). This has been proven again
later on by Arge et al. by using modified comparison trees called I/O-trees. These
trees consist of comparison nodes akin to the nodes used in comparison trees
as well as I/O-nodes, which consider the number of elements that can be swapped
in one I/O-operation between internal and external memory. Arge et al. showed
with these trees, in a similar fashion to comparison trees, that there is a lower bound
for any comparison-based algorithm with an internal memory runtime complexity
of (n log(n)) which is the case for sorting as well [AKL93].

There are multiple sorting algorithms that have a proven upper bound of the
same, such as merge sort and distribution sort [AV88, p.1123-1124]. As such, the
I/O-complexity of sorting has tight bound of sort(n) = ((n/B) logM/B (n/B)).

A summary of the aforementioned notations under the EM model is listed in ta-


ble 1.1.

TABLE 1.1: EM model specific notation.

B # of elements per block sent in an I/O-operation


M # of elements that fit into main memory
scan(n) (n/B)
sort(n) ((n/B) logM/B (n/B))
1.2. Mathematical Background 5

F IGURE 1.1: "[A] 7, 3-tessellation of the hyperbolic plane by equi-


lateral triangles, and the dual 3, 7-tessellation by regular heptagons
are shown. All triangles and heptagons are of the same hyperbolic
size but the size of their Euclidean representations exponentially de-
creases as a function of the distance from the center, while their num-
ber exponentially increases." [KPKVB10, p.2]

1.2 Mathematical Background


We will summarise the basic mathematical building blocks for this thesis in this
section. First, we will introduce the random hyperbolic graphs as a generative model
by Krioukov et al. [KPKVB10] in comparison to the Erdos-Rnyi model, and the
Barabsi-Albert model. After this short introduction, we will give a summary of the
hyperbolic space to a degree that is necessary to understand the thesis subject and
its connections to complex networks. Following that, we will go further into detail
on the two representations of hyperbolic geometry that are in use by the algorithms
presented namely the native representation, and the Poincar disk model.

The former will be introduced in conjunction with the steps involved in the gener-
ative model of hyperbolic graphs. We will present the given formulas in regards to
the native representation of the hyperbolic space, after which we will shortly intro-
duce the Poincar model and explain its relevancy in regards to the simplification of
distance calculations.

1.2.1 Graph Models for the Representation of Complex Networks


The first introduction into random graph models came from Paul Erdos and Alfrd
Rnyi who devised such a model to describe complex networks seen in the real
world [ER59]. Such Erdos-Rnyi graphs are generated by taking n nodes and
establishing random edges between any pair of nodes. The randomisation in
this model is given by a uniform probability over all nodes, meaning that any pair
of nodes has an equal probability at establishing an edge between them as any other.

After further analysis of complex networks, certain properties have been found
to be lacking in the Erdos-Rnyi model, such as a high-tailed, power-law degree
distribution (networks with that property are also called scale-free networks, since
6 Chapter 1. Introduction

power-laws lack a characteristic scale [AB01, p.63]) and a non-vanishing clustering


coefficient that is independent of the size of the network. The Erdos-Rnyi model
has neither its degree distribution is binomial and its clustering coefficient falls
with an increase in n [AB01, p.57-58].

A closer attempt at modeling real, scale-free networks was made by Barabsi and
Albert [AB01, p.71-76]: Instead of finding a topology that captures real world
networks properties, their Barabsi-Albert model (BA model) emulates real world
networks dynamics. The assumption is that following those dynamics should de-
liver a graph with the desired topological properties. The main components of their
model are the dynamics of growth and preferential treatment. As an example that has
been given by the cited authors, the former can be seen in the real world example of
the world wide web, whose number of web pages increases every second; the latter
can be seen in the number of hyperlinks on web pages more likely linking well
known web pages who have been linked to by others, and who in return link to a lot
of other web pages as well [AB01, p.71]. As such, their model begins with a starting
number of nodes n0 , and increases this number each step by introducing a new
node with a set number of x n0 edges to the network, until we reach the desired
node count of n. Edges are established randomly with a probability which depends
on each nodes degree. In other words, nodes with a higher degree have a prefer-
ential treatment, and are more likely to establish an edge with the newly arrived node.

An issue is that in this model, the clustering still depends on the number of
nodes [AB01, p.75], while the power-law coefficient BA = 3 is constant, as it is
independent from anything but the model itself [AB01, p.71].

Opposed to these two models is the random hyperbolic graph model by Krioukov
et al. [KPKVB10], which differs in its approach. Instead of generating graphs
following a specified topology, it, similar to the BA model, achieves the desired
topology by the generative models properties. Unlike the BA model, though, the
generative model involves the embedding of a graph into hyperbolic space, meaning
that the topological properties depend on the geometric positioning of the nodes
themselves. The generative models randomisation follows a probability function
which depends on a set threshold more specifically, a set distance between two
nodes in hyperbolic space. If the distance between two nodes is small enough, the
probability of an edge being established between the two is set to one, otherwise, it
is set to zero. An additional temperature T can be set to a number higher than zero
to allow for further randomisation, where the higher the temperature, the more
likely a random edge will be established between two random nodes independent
of distance. Krioukov et al. also notice, as T approaches infinity, random hyperbolic
graphs start resembling Erdos-Rnyi graphs, since at infinite temperatures, each
node pair has an equal probability of establishing an edge [KPKVB10, p.12].

The result of Krioukov et al.s generative model is the generation of graphs that
exhibit all the desired properties: On one hand, a high-tailed degree distribution
which follows a power-law whose power-law coefficient is independent of the
model, of the number of nodes, and of the number of edges; on the other, a
non-vanishing clustering coefficient independent of the networks size.

This thesis subject concerns random hyperbolic graphs as a generative model, though
the focus will be on the threshold model alone; the temperature will be assumed to be
1.2. Mathematical Background 7

zero all throughout this thesis in all three algorithms.

1.2.2 Hyperbolic Space and its Relation to Complex Networks


The hyperbolic space is a space that has a negative curvature, as opposed to
Euclidean which has no curvature. One of the results of such a property is that in
hyperbolic geometry, space itself expands exponentially. This can be seen in the
Euclidean visualisation of a hyperbolic plane in figure 1.1, where each blue triangle
(or each green heptagon), no matter how small and how curved in the projection,
encompasses just as much area as any other in the figure.

To give an example of that property: A circle in Eucledian space with radius r


has the well known area of r2 . This means that with every step that r expands,
the area of the circle grows linearly (as the derivative of that formula is 2r). In
hyperbolic space with a standard curvature of 1, on the other hand, a circle with
radius r encompasses the area of 2cosh(r) 2 er . In other words, each step
will increase the circles area exponentially rather than linearly.

This aspect of hyperbolic space is relevant to the generation of complex networks,


as Krioukov et al. notice, as there are "intrinsic connections between hyperbolic
geometry and the topology of complex networks" [KPKVB10, p.3]. An example that
the authors give are b-ary trees as in trees with a branching factor of b where
the number of nodes grows exponentially with every further step, i.e. level, akin to
the exponential growth of the area of a circle in hyperbolic space. The example of a
circle is not arbitrary, as the authors connect other properties of complex networks
to circle areas, such as the taxonomy of the elements in such a network:

During the generation of a complex network, one can notice that with the growing
number of elements, patterns are starting to emerge with nodes of similar properties
flocking to one another and creating groups with further subgroups within. These
multitude of (sub-)groups can be connected in a dendogram a representation
of hierarchies as tree structures (an example of such a dendogram can be seen in
figure 1.2). Krioukov et al. posit that, in the same way that b-ary trees and circles
grow exponentially, so do dendograms in the hyperbolic space, as the circles of
influence in regards to the (sub-)groups representing inner nodes in the dendogram
grow in the number of occurrences in the same way. In figure 1.2, those circles
where nodes inside the same circle share similarities are represented on
the flat, Eucledian space. The dendogram is shown to arise above this plane into
the three-dimensional, hyperbolic space, connecting the circles respective nodes
whenever a circle is enveloped by another. There are occurrences sometimes,
where circles might only overlap without one containing the other or vice versa
those relationships are visualised by the dashed lines, creating cycles, and thus
making the dendogram not an actual tree. Regardless, the hyperbolic nature of this
representation is nonetheless apparent by the exponential increase of nodes per
level.

In light of such connections between the two concepts, Krioukov et al. found that
the properties of hyperbolic space were fitting for the modeling of a multitude
of complex networks seen in practical usage whose degree distribution follows
a power-law social network simulations being an example of those. As such,
8 Chapter 1. Introduction

H3

R2

F IGURE 1.2: A visual representation of a dendogram in hyperbolic


space (H 3 ), constructed above the hidden hierarchies of a complex
networks nodes in Eucledian space (R2 ). The circles represent groups
and subgroups of nodes sharing similar properties, while each circle
has a representative node in the dendogram which is connected with
a solid edge to the nodes of groups it either contains, or groups that
contain it. The dark circle is a group that overlaps others, representing
possible outlier groups to the overall model, whose edges are dashed
if the other group is not fully overlapped. [KPKVB10, p.4]

Krioukov et al. introduced random hyperbolic graphs as a generative model.

1.2.3 The Generative Model and Native Representation


The basic idea of Krioukov et al.s generative model is that to generate such a graph,
one has to first generate a number of randomly distributed points on a disk in hyper-
bolic space with radius R using polar coordinates. R in particular has an important
role, as the size of R dictates, how many edges are going to be constructed once
the nodes have been set in place. One is able to calculate the average degree of a
hyperbolic graph given its parameters with equation 1.1[KPKVB10, p.6, eq. (22)]:

2 2 2   R 
k= n eR/2 + 2 n eR ( ( )2 ( 1) + ( 2)) 1 (1.1)
2 4
/
The number of nodes is represented by n, while is defined as = (/1/2) .
The variable is an additional parameter that controls the power-law exponent
= 2 + 1 of the graphs degree distribution and is by definition > 1/2. At last,
according to Bode et al., they prove that the curvature parameter of the hyperbolic
space "is not necessary, the parameters [k the average degree ] and suffice
to yield the same degrees of freedom for the model" [BFM16]. As such, we will
do as Looz et al. and set = 1 without any loss of degrees of freedom [LLM16, p.3].

As one can see, the equation is fairly complex and cannot be reduced to a closed
form expression for R. In order to create a random hyperbolic graph with a certain
average degree, we have to use numerical approaches for that. Looz et al. for
example decided to query R by using a binary search with a starting value for R,
1.2. Mathematical Background 9

R
u

F IGURE 1.3: Randomly generated points of a yet-to-be-established random hyper-


bolic graph on the hyperbolic plane projected onto a 2D-disk for visual purposes.
The entire planes radius is R, so is point us red query circles radius in which all
its future neighbours exist. The form is unlike a circle as known from Eucledian Ge-
ometry only because of the projection combined with the exponential expansion of
space in hyperbolic space. The smaller circles represent the inner and outer radii of
slabs/bands whose purpose will be quickly mentioned in this chapter and further
detailed in chapter 2.

until equation 1.1s result can be matched with the desired k [LLM16, p.3]. We
chose to use the same approach.

Before we continue, let us quickly show R asymptotically in regards to equation 1.1,


as this will be also important later on during most proofs in this thesis:
Lemma 1. R = O(ln(n/k)) for any .
Proof. Gugelmann et al. define the hyperbolic radius R as [GPP12, p.5]:

R = 2 log(n) + C

C is a parameter that one can change to vary the average degree k in their model.
The average degree itself is derived by the authors as[GPP12, p.6]:

22 eC/2
k = (1 + o(1))
( 1/2)2

Solving for R and setting R 2 log(n) = C into this equation gives us:

22 eR/2+log(n)
|(...) ( 1/2)2 / 22 (1 + o(1))

k = (1 + o(1)) 2
( 1/2)
1/2 2 
= O k( ) = O(k) = eR/2 n | ln((...)/n)

= O(ln(k/n)) = R/2 = O(ln(n/k)) = R
10 Chapter 1. Introduction

After choosing an R for the disk radius, we continue with the generation of the
points. Their native coordinates follow two separate density functions: The angle
is uniformly distributed over the range [0,2), while the radial distribution is gov-
erned by [KPKVB10, p.6, eq. (17)]:

sinh(r)
f (r) = (1.2)
cosh(R) 1

There are two further equations that are going to be needed later on that stem from
1.2. The first is the approximation of 1.2 that is going to be useful in proofs (which is
also given by Krioukov et al. [KPKVB10, p.6, eq. (17)]):

(er er )/2 er
f (r) = = e(rR) = f(r) (1.3)
(eR + eR )/2 1 eR

The second is the integral of 1.2, which gives us the mass probability of the percent-
age of nodes between two radii b and d, with 0 b < d R, given by:
Z d
cosh(d) cosh(b)
mass(b, d) = f (r) dr = (1.4)
b cosh(R) 1 cosh(R) 1
Inserting the definition for cosh(x) := (ex + ex )/2 and approximating analogous to
1.3 yields us:

1
cosh(d) cosh(b) eR (ed eb )

mass(b, d) = (1.5)
cosh(R) 1

After having generated our nodes, we now only need to establish an edge between
any two points u = (u , ru ), v = (v , rv ), whenever the distance between them is
less or equal to R. The hyperbolic distance between those points is given by:
 
dist(u, v) = acosh cosh(ru ) cosh(rv ) sinh(ru ) sinh(rv ) cos(|u v |) (1.6)

Calculating the distance between all node combinations of our graph would be
computationally very expensive not only because of the quadratic nature of such
an approach, but also because of the extensive usage of cosh- and sinh-functions.

To combat this, the algorithms shown in this thesis have a few ways to reduce the
amount of potential neighbour candidates each node has to check. One of those
employed by NkGen and our EM-variant is to divide the ground plane ra-
dially into L slabs bi with 0 i < L where each slab is defined by its inner ra-
dius ci and outer radius ci+1 . The specific choice for those radii is important in
regards to the runtime, and in case of NkGen, Looz et al. chose a geometric parti-
titioning with p = 0.9, and where their chosen number of slabs L was always de-
pendent on the number of nodes n more specifically, L = log(n) calculated as
follows [LLM16, p.3]:



0 , if i = 0
(1p)R

, if i = 1
ci = 1pL (1.7)


R , if i = L

p (c c )+c
i1 i2 , otherwise
i1
1.2. Mathematical Background 11

For every point v on slab bi , we calculate the minimal and maximal angles of any
possible neighbour points u, and use those angles to reduce the number of potential
neighbour points, as only points inbetween those two angles can be neighbours (a
more detailed explanation of this procedure can be found in chapter 2.1). The cal-
culation of those angles, given by Looz et al. in [LLM16, p.4, eq. (7-10)], is the one
used by the function getMinMaxPhi in listing B.1 for the native representation of
the hyperbolic plane and is given by:
 cosh(r ) cosh(c ) cosh(R) 
v i
bi (v) = acos |u v | (1.8)
sinh(rv ) sinh(ci )

bi only gives us the maximal angular difference in one direction, though, which
is why during certain calculations, one will see this term multiplied by two (see
proof of lemma 3 as an example) whenever we are looking for the entire area
encompassed by those angles extremes.

The approximation for equation 1.8 is particularly useful during our proofs later on
(see proof of lemma 3) and is given by [GPP12, p.7]:

bi (v) bi (v) = 2e(Rrv ci )/2 (1.9)

It is important to mention that 1.9 is only accurate for large rv , ci and R. Specifically,
the approximation overestimates the actual angle, which is why it can safely be used
in an asymptotic proof. To give a short example of this, for any

rv + ci < R 2 log(/2) (1.10)

the angle is larger than which should not be possible as the largest, absolute an-
gular difference between any two points can be at most . This small issue is ad-
ditionally of note, as we are later going to be using an equation to approximate the
workload of our algorithm (see equation 3.2 for more details), whose accuracy is not
exact if this issue is not taken into account.

1.2.4 Poincar Model


Since equations 1.6 and 1.8 have a lot of expensive cosh- and sinh-operations,
working in the native representation of the hyperbolic plane is not necessarily the
optimal way. Another alternative to this representation is the Poincar Model,
whose respective calculations are fairly simpler from an arithmetic perspective, as
will be shortly shown.

A short illustration of the Poincar disk model was given by Looz et al., who
summarised the model thus [LSMP15, p.4]:

"The Poincar disk model is one of several representations of hyperbolic space


within Euclidean geometry and maps the hyperbolic plane onto the Eucledian unit
disk D1 (0). The hyperbolic distance between two points pE , qE D1 (0) is then given
by the Poincar metric:

||pE qE ||2
distH (pE , qE ) = acosh(1 + 2 ). (1.11)
(1 ||pE ||2 ) (1 ||qE ||2 )
12 Chapter 1. Introduction

Ec radE

u
F IGURE 1.4: The graph in figure 1.3 mapped to the Poincar disk whose radius is
1. EC is the center of point us query with radE being the query circles radius, as
calculated with 1.14 and 1.15 respectively. All the points and bands are pushed fur-
ther towards the disks outer radius. The right part of the figure shows an enlarged
view of the area around point u, where one can see that the yellow query area exists
here as well.

As one can see, if the problem is calculating the distance and checking, whether it is
less or equal to R, one can circumvent the repeated usage of an acosh-operation in
1.11 by changing the equation to:

cosh(distH (pE , qE )) 1 ||pE qE ||2


= (1.12)
2 (1 ||pE ||2 ) (1 ||qE ||2 )

If we now calculate the left-hand side once with R as a parameter beforehand and
save that value, we only have to use simple arithmetics for comparisons between
the right-hand side for two points and (cosh(R)-1)/2.

The only possible downside is that we need to use the following calculation for our
transformation from the native representation to the Poincar disk model which
also includes cosh-operations [LSMP15, p.14]:

Given a point p in the native representation p = (n , rn ), g : H2 D1 (0) is a map-


ping that exists as follows:
 s 
 cosh(rn ) 1
g (n , rn ) = n , (1.13)
cosh(rn ) + 1
This of course has to be done for every point we are generating. Since we usually
have a larger number of distance calculations than nodes we have to generate, this
is not a problem: The additional work during the generation of nodes is amortised
by the higher number of distance calculations that do not require cosh-operations in
this model anymore.

Another advantage of using the Poincar model is that instead of using equation 1.8
for our min and max that involves additional usage of cosh- and sinh, we can use
1.2. Mathematical Background 13

an easier formula by drawing simple circles around each point. One of the positive
properties of using the Poincar disk model is that circles in hyperbolic geometry
become simple circles on the now established Eucledian disk in the Poincar model
albeit with radii that change depending on the circles positioning, decreasing
the further away the circles center lies from the Eucledian disks center, as the
following equations taken from [LSMP15, p.5] show:

Let E be the Eucledian circle in the Poincar disk, which corresponds to a hyperbolic
circle H with radius R around point u = (u , radiusu ) on the hyperbolic plane. Let
rg(u) be point us corresponding radius on the Poincar disk, calculated by equation
1.13. Let the center of E be Ec , and the radius of that circle be radE . Let also be:

a = cosh(R) 1
2
b = (1 rg(u) )
The center Ec is then given by:

2rg(u)
Ec = (u , ) (1.14)
ab + 2
With the radius radE being given by:
s
2
2rg(u) 2 2rg(u) ab
radE = ( ) (1.15)
ab + 2 ab + 2

With these query circles, it is a simple matter of calculating intersection points


between two overlapping circles namely the band radii that were also converted
into the Poincar model, and those query circles. Because of symmetry, the angle
between the query circle center and a line drawn from the origin (i.e. the center of
|max min |
all band radii) to one of the intersection points on band bi is exactly 2 = b i .

At last, we finish the background with the equation used to calculate those intersec-
tions or more accurately, the angles defining those intersections.

Let ci1 be the inner radius of band bi , outlining a circle with its center at the origin.
Let v be a point that creates the query circle E with its center being Ec = (u , rE ) and
its radius being radE . The angle between the line from the origin towards the center
of Ec on one hand, and the line from the origin towards one of the intersections
of those two circles on the other, is given by equation 1.16, which is used by our
method getMinMaxPhi_Poincare in listing B.2. The equation is taken from Paul
Bourke [Bou97] and modified with the respective variables in the Poincar model:
 2
(ci1 rad2E + rE2 )
bi (v) = acos (1.16)
2 rE ci1
As one can see by a simple comparison between equations 1.16 and 1.8, the one in
usage in the Poincar disk model is arithmetically simpler. If we take a look at the
the number of cosh/sinh-operations, we have four distinct applications of those in
the native representation and only one distinct application in the Poincar model
(we are excluding cosh(R) from both representations in this comparison, as this can
be calculated once beforehand and kept in memory as a constant).
14 Chapter 1. Introduction

1.3 STXXL
Since the main focus of this thesis remains on large data sets, we decided to use data
constructs from the STXXL library for data management and sorting. The Standard
Template Library for Extra Large Data Sets "enables practice oriented experimentation
with huge data sets [, as it] supports parallel disks, overlapping between I/O and
computation, and pipelining technique that can save more than half of the I/Os"
[DKS08, p. 640]. One of the features of STXXL that is relevant for our application
is the sorting algorithm that has been designed with extremely large data sets,
multiple parallel working disks and I/O-efficiency in mind.

The Asynchronous Parallel Disk Sorting algorithm specifically uses an external mem-
ory approach that splits the data akin to a k-merge-sorting algorithm the differ-
ence here being that the algorithm takes I/O-procedures into account and schedules
its workflow around an overlap buffer to allow for "almost perfect overlapping of
I/O and computation" [DS03, p. 142], which is something that will be advantageous
to the External-Memory-variant of NkGen. Since a few versions of our algorithm rely
heavily on a pre-sorted data set to work, STXXL and its sorters along with the I/O
performance counters it provides will prove themselves useful in our implementa-
tion of the algorithm and its analysis thereafter.
15

Chapter 2

Algorithms

2.1 State of the Art: NkGen


For a better understanding of our EM-variant of the generation algorithm, we will
give a short overview of the original algorithm whose pseudocode from the original
paper can be found in the appendix in listing B.1.

The general outline of the algorithm by Looz et al. consists in our framework of
four general phases: setup, generation, sorting and edge creation [LLM16, p.3].
Its main idea is that if one partitions the hyperbolic plane into multiple, concentric,
ring-shaped bands, one decreases the amount of comparisons necessary for the
fourth step.

The setup step includes the calculation of the target radius R (see eq. 1.1) of the hy-
perbolic plane, given specified parameters like node count n, average degree k and a
power-law exponent = 2 + 1. Additionally to that, we prepare C = (c0 , c1 , ..., cl )
which define the radii-borders of the l bands, with band bi = [ci , ci+1 ) being defined
by its inner radius ci and outer radius cc+1 . How C is chosen has a big influence on
the runtime, as the expected amount of nodes per vertex that have to be compared
for the edge-creation increases not only with the area of the band, but also with its
position on the hyperbolic plane as the density function follows an approximately
exponential curve (see eq. 1.2 and 1.3). Looz et al. chose a geometric sequence with
ratio p = 0.9 based on experimental outcomes (see eq. 1.7 and [LLM16, p.3]).

Generation of the vertices is being performed in parallel where all coordinates


are drawn randomly as detailed in section 1.2.3. Every vertex is then put into an
array corresponding to the band the point belongs to (i.e. band bi = [ci , ci+1 ) with
ci r < ci+1 ; where r is the vertexs radius). Afterwards all bands points are
sorted by their angular coordinates to prepare for the edge-creation step:

For every vertex, the algorithm goes through every band that includes it or is
further away from the center than the vertex. To decrease the number of potential
neighbours that fit the neighbour-criteria of the distance being distH (u, v) R
(see eq. 1.6), the maximum angular difference from a node in either direction on
the current band is calculated (see eq. 1.8). This is possible, since a hypothetical,
hyperbolic circle with radius R around any vertex creates overlapping areas with at
least one band. The intersections between those hyperbolic circles and band radii
lines equal the calculated maximum angular difference on any given band. Thus,
since all neighbours of a node have to be somewhere inside that nodes respective
circle, we can narrow down the area of neighbourhood candidates with those
angular bounds.
16 Chapter 2. Algorithms

R
u

F IGURE 2.1: The red line shows a query circle with radius R same as the radius
of the entire plane around a randomly chosen point on the hyperbolic plane.
The yellow areas are the ones in between the maximum angular differences for
the green point on the respective bands. Because all to-be-established neighbours
are inside the query circle, only the yellow areas are searched for those (edges are
created outwards, thus only the nodes and outer bands are considered).

There are exceptions though: If the circle of a point covers the band on which the
point resides either entirely, or more than once, no such simple bounds can exist. A
potential neighbour could be on the other side of the center of the hyperbolic plane,
meaning the boundary would need to encompass the entire plane. This means that
on the inner most band and possibly further bands, depending on how close to
the center the bands and points are all vertices have to be compared for possible
neighbour properties. In other words, the angle boundaries are set to be 0 and 2.
Depending on , those cases can be rare: With increasing , the exponential nature of
the vertices radius density function (see eq. 1.3) pushes the overwhelming majority
of vertices further towards the border of the hyperbolic plane. Taking for example
equation 1.5 and setting d = R/2, b = 0 and R = O(ln(n/k)), we get:

= e ln(n/k) (eR/2 e0 ) = (k/n) ( (n/k) 1) = O( (k/n) )


p p

In other words, the mass of nodes with a radius less then R/2 falls for any increasing
n and , as long as k < n, and is for instance less than 3.2% for any k n/1000
and 1.

Once calculated, we get min and max as our bounds for that particular vertex
(min being the green, max the red boundaries in the figures 2.1 and 2.2). All the
points between those two angles are potential neighbours whose distances to our
vertex we have to calculate and, depending on the distance, establish an edge with.
The algorithm uses a binary search to find the first node with min and the last
node with max , which is vitally important as this is infeasible in an EM-setting
2.2. EM-Variant of NkGen 17

R
R

F IGURE 2.2: The same graph as seen in figure 2.1, but with a focus on a different
point that is closer to the center. The two innermost bands do not allow simple
boundaries to be created, as the query radius covers too much area of those bands
without simple intersections between band radii and query circle.

each unstructured access in that search could easily trigger an I/O-operation if


the data is large enough. Because of that, our later EM-variant will circumvent a
binary search by introducing additional data structures and adapting the algorithm
to them, thus potentially decreasing the amount of time spent on I/O-operations.

For our comparison later on, we will use the implementation by the authors found
in the NetworKit library 1 .

2.2 EM-Variant of NkGen


A problem with the aforementioned algorithm is the I/O-inefficient approach to its
edge-creation stage. A sufficiently large node set will not fit into the main memory
and as such, using unstructured access intensive operations such as a binary search
or other alternatives will incur additional on- and off-loading of data blocks into
our main memory that will be disregarded immediately thereafter.

Thus, our approach is to circumvent the need for unstructured access in our main
algorithm by not only sorting the nodes, but by also creating an additional data
structure for the angular boundaries of each point. In detail, we create structures
that represent the left angular bound (start of a query) and the right angular bound
(end of a query) once per node x and per band. Those boundaries represent the
query for all neighbours of that point x. Our intention is that by sorting these
angular boundaries as well, we can scan these sorters and work through multiple
queries at once without requiring any unstructured access. In other words, once a
memory block of sorted nodes and boundaries has been loaded into main memory,
we can be sure that all of the loaded data will be of use before being discarded, re-
sulting in a reduced number of I/O-operations compared to the original algorithm.
1
NetworKit with a reference implementation of the algorithm at: https://networkit.iti.
kit.edu/
18 Chapter 2. Algorithms

Pseudocode detailing this process can be found in the appendix (see listing B.2).

2.2.1 Main Algorithm


First, let us outline the changes necessary for the EM-variant to work more effi-
ciently:

The EM-variant goes through similar four steps as the original algorithm. We set up
the target radius R and choose a fitting radial partitioning for our l bands. Unlike
NkGen, though which uses the hyperbolic geometrys native representation
we are going to use the Poincar disk model instead. The reason for this is that the
equations for the distance and angular boundaries calculation are much simpler
in this model (compare eq. 1.6 with eq. 1.12, and eq. 1.8 with 1.16), meaning
that we can reduce the number of computational operations by using a different
representation alone.

To make this possible, we modify NkGens setup and generation phase, where
instead of keeping the nodes and bands radii as is, we first map them to the
Poincar disk with the function MapToPoincare (see eq. 1.13). Additionally, we also
use a different function to calculate min and max with getMinMaxPhi_Poincare
(see eq.1.14, 1.15 and eq. 1.16).

We also have to modify our edge-creation step in a way that unstructured access
is unnecessary, e.g. by avoiding the binary search and using the aforementioned
angular boundaries concerning min and max . These boundaries are calculated
once per vertex for each band that has an outer band radius larger than the vertexs
radius (including the vertexs own band) during the generation step. After sorting
the boundaries and nodes according to a lexicographical order beginning with
the respective angular component, we can utilise an additional vector to store all
ongoing, active queries during our traversal through the sorters. With that active
vector in mind, we can then work through multiple, concurrent queries, point by
point, deleting and adding new queries to our vector whenever we reached the
end of a query or found the start of a new one respectively. We only stop once we
reached the end of a band when there are no points on the band left anymore.

We first create for those boundaries their own data structures during the generation
step one sorted by min called StartBounds, one by max called StopBounds.
Both data structures store the ID of the original query-vertex, while the StartBounds
alone store the coordinates of the vertex as well. This way, we can use the Start-
Bounds themselves to calculate distances between the query-vertex and the current
vertex being handled in the edge-creation step. Additionally, the IDs will help to
manage the bounds during the fourth step.

After generating the points and with them the Start- and StopBounds we put
them all in three seperate sorters per band and sort them.

Thus, the edge-creation step changes in the following way: For each band, we will
perform a sweep-line algorithm by which we go through all three sorters concur-
rently, choosing the smallest item according to a lexicographical order the order
being first the angular property, then the data type as our current work-token. In
2.2. EM-Variant of NkGen 19

case a bound-object has the exact same angle as a point, the order is "StartBound <
Point < StopBound", to ensure that all neighbours will be found. Depending on the
token-type, we act accordingly:

StartBound: We add the StartBound to a vector called active.

StopBound: We delete the corresponding StartBound with the same ID as our


StopBound from active.

Point: We go through the whole active-vector and calculate distances between


our node and our StartBound-queries. If the neighbour requirements are met,
we add an edge between the current node and the query-node.

The result is a linear scan through all three sorters with no unstructured access
necessary. As the following lemma shows, we find all neighbours in our generated
graph that follow the neighbourhood requirements.

Lemma 2. The EM-variant algorithm is correct and finds all neighbours y for all vertices x
where distH (x, y) R.

Proof. Let x be a random vertex with angular coordinates (rx , x ). Let y be an


arbitrary neighbour of x with angular coordinates (ry , y ). Let ssy and sty be the
corresponding Start- and StopBounds of y according to our algorithm with angular
coordinates (ry , ssy ) and (ry , sty ) with ssy = miny and sty = miny respectively.

From the fact that x is a neighbour of y it follows, that ssy x holds, because
otherwise x would be outside of ys reach, seeing as x would be smaller than
miny . Accordingly, this means that ssy has to have been put into the active-vector
at one point in time before x became our work-token.

By the same logic, y being a neighbour of x necessitates that x sty holds,


because if it does not, then x would be larger then maxy , meaning x is not a
neighbour of y. Accordingly, this means that if the active-vector has ssy in it, by
the time x becomes our work-token, it could not have been deleted from active
beforehand, considering that the respective StopBound-token has to come after x.

Thus, if a random vertex y is a neighbor of x, ys StartBound will be in active during


xs edge-creation-turn, meaning any and all neighbours of x will establish an edge
with x by necessity at some point in time during our algorithm.

2.2.2 2-Sorter-Version
During our work, we considered to optimise our algorithm by decreasing the
number of sorters necessary. This should reduce the time spent on sorting, since the
overall number of elements will be reduced by at least a third (as there are more
Bound-objects than nodes). Instead of using one data structure for each Bound-type
respectively, we use only one data structure called StartStopBound that would
keep track of both min , max and the position of its original node.
20 Chapter 2. Algorithms

Regarding the lexicographical order, we still first compare angles and then data
types. Analogously to before, the StartStopBound is considered smaller than a Point-
object. The algorithm itself only changes in the comparison step, where we have two
possible work-tokens now, depending on which object is considered smaller:

StartStopBound: We add our StartStopBound to active

Point: We go through active and

delete the current query-StartStopBound if its max is smaller then our


points angle
compare otherwise and establish an edge if the distance is smaller than R

One possible downside that could arise from the way this version works is that our
average active-size should increase slightly compared to the 3-Sorter-Version. It
is possible for our active-vector to have StartStopBounds whose angle-range does
not cover our current node, since their removal only happens whenever we get a
new point work-token. Asymptotically, this does not matter as it would double the
active-size at most (i.e. we would have an active-vector with queries pertaining to
the current and the previous node only). Computationally, this would also not be a
large disadvantage, as comparing the angles is enough to know, whether we should
engage in the computationally more expensive distance-calculation.

Pseudocode for this version can also be found in the appendix (see listing B.3).

2.2.3 0-Sorter-Version
For our theoretical runtime analysis later on it might be interesting to have a
version of our algorithm that does not need any kind of sorting at all. Seeing as our
2-sorter-version already decreased the number of sorters by one, there are only two
left that we have to work around if we want to still mainly depend on our main
algorithm.

Avoiding the sorting of our nodes requires a change in our generation step: Instead
of randomly generating the nodes, we first randomly generate the node count on
every band. For this we calculate the probability mass of the nodes radiis density
function f (r) per band between its inner radius rinner and outer radius router with
the following formula based on the integral of the density function:
Z router
cosh(router ) cosh(rinner )
mass(rinner , router ) = f (r) dr =
rinner cosh(R) 1 cosh(R) 1
With this, we can calculate the probability mass mj , one for each band bj . If we
now sample n uniformly random numbers pi between 0 and 1 one for each node
we can derive the random number of nodes per band we would have gotten from
a randomly generated batch of n nodes by counting, how many pi we have with
( jk=0 mk ) mj pi < jk=0 mk .
P P

After that, for every band we generate nodes whose radius follows the usual
density functions, only now restricted to the bands inner and outer radii. The
angle on the other hand has to be generated in a way, that every generated angle is
greater or equal to the last one, while still following a uniform distribution. Bentley
and Saxe describe an algorithm that does exactly that [BS80]. This is possible, the
2.3. Parallelisation 21

P P
authors note, since the values Yj = [ 1ij Xi ]/[ 1in+1 Xi ] for j = 1, ..., n "are
distributed as the order of statistics of size n from U [0, 1]" for independent variables
X1 , ..., Xn + 1 with an exponential distribution and a fixed mean, they can use
this fact to generate a series of sorted, decreasing, uniformly random numbers in
sequence [BS80, p.2].

Let N be the number of numbers to be uniformly randomly generated in a sorted


sequence, an algorithm to calculate and print out such a sequence in one pass can be
described as follows [BS80, p.6, program 4]:
L ISTING 2.1: Sequential generation of presorted, uniformly dis-
tributed numbers by Bentley and Saxe.
1 I = N;
2 LnCurMax = 0 ;
3 while ( I > 0 )
4 {
5 LnCurMax = LnCurMax + l o g ( GenerateUniformlyRandomNumber ( ) ) / I ;
6 I = I 1;
7 p r i n t ( 1 exp ( LnCurMax ) ) ; #The s u b t r a c t i o n e x i s t s t o change t h e s o r t
order from d e c r e a s i n g t o i n c r e a s i n g
8 }
9

Using this algorithm, we can generate nodes that are sorted first per band, then per
angle, which is exactly what our main algorithm needs.

Avoiding the sorting of our StartStopBounds can be done by dividing each Start-
StopBound sx into two halves, one ranging from [min , x ] called sBackwardx , the
other from [x , max ] called sF orwardx . The 2-Sorter-Version requires that all given
StartStopBounds are sorted by their min -angle, meaning that if every generated
node comes into the StartStopBound-generation phase in a sorted sequence, we
can be sure that all StartStopBounds are sorted by their original nodes angle x .
Following that, if we insert every sBackwardx into one vector and every sF orwardx
into another one, we would then have two vectors filled with StartStopBounds
one being sorted backwards by its max , the other forwards by its min .

Unfortunately, this process requires that we perform the edge-creation phase once
per StartStopBound-vector, as each StartStopBound only covers one half of each
query nodes reach. The forwards vector can be used in conjunction with our 2-
Sorter-Versions comparison step just as it is, giving us approximately half of our
desired edges. The backwards vector on the other hand has to be used with an al-
tered version of the 2-Sorters comparison step that goes backwards through both
our StartStopBound- and Point-vector, giving us the other half of our edges.

2.3 Parallelisation
2.3.1 Radial Parallelisation
To increase performance, parallelisation seems to be a suitable possibility, consid-
ering the generally independent workload divided between the bands. This means
that we do not need to change our algorithm, if we are aiming for a radial paral-
lelisation on our hyperbolic plane along the bands, i.e. working on the bands in
parallel (see figure 2.3 for a visualisation). The radial partitioning choice might have
22 Chapter 2. Algorithms

F IGURE 2.3: Each colored band can be given to a different thread independently.
For simplicity, we ignore specific scheduling details for now (see for instance JJs
Work-Time-Frameworks top level[JJ92, p.27-32].).

an influence on the algorithm in a parallel setting, though. Since the algorithm can
work in parallel, it is not necessarily the best idea to use a radial subdivision that
decreases the overall workload over all the bands. It is possible that one partition-
ing choice delivers us a workload that, while in a sequential setting optimal, would
be suboptimal in a parallel setting. This is because the workload per band depends
on a multitude of factors - such as the amount of nodes between the origin and the
current band, the average angle |st,r,j ss,r,j | of the StartStopBound (of any query-
point with radius r) on a band j , etc. - and thus is not a simple linear division of
work among all bands. Possible radial partitioning alternatives and their effects will
thus be described in chapter 3.6 and tested later on in chapter 4.3.

2.3.2 Angular Parallelisation


Another parallelisation method is the angular approach (see figure 2.4 for a visual-
isation). Together with the radial workload division, we can easily subdivide each
band into equivalently large parts. Let bi = (ci , ci+1 ) be a band with inner radius
ci and outer radius ci+1 . Let  v be the number of segments per band. Let thus
bij = ci , ci+1 , [startj , endj ) be a segment on band bi encompassing the area be-
tween the angles startj and endj , where startj = 2j/v and endj = 2(j + 1)/v.
Because the angular component of our nodes are uniformly distributed on [0, 2),
dividing a band bi with nodes nbi [ci , ci+1 ) [0, 2] into v segments bij and
j [0, 1, ..., v 1] and nodes nbi [ci , ci+1 ) [2j/v, 2(j + 1)/v) should result
in segments that divide the workload of a band equally.

Changes necessary to our algorithm involve the borders of our segments, as a


node on segment bij could be a neighbour to a node on segment bij+1 or even
further away. The changes are minimal, though, and only need to be applied to the
generation step:

Instead of creating sorters for every band, we create it for every segment per band
that we have. Nodes will be inserted into the right sorters, according to the seg-
ment they reside in, analogously to the way we did before. The bounds are treated
differently this time around:
2.3. Parallelisation 23

F IGURE 2.4: Each color represents a different thread that operates on the colored
segment. As an exemplary distribution of segments to threads, each segment is
given to a thread counterclockwise, starting from the innermost band with the color
order being blue, green, red, purple.

If a StartBound ssy and its corresponding StopBound sty belong on the same
segment bij , both will be added in the same manner as before to the correct
segment bij .
If a StartBound ssy belongs to bij and its corresponding StopBound sty belongs
to bim with j 6= m and j, m [0, 1, ..., v 1], we add StartBound ssy and a new
StopBound styj with angle j+1 = 2(j + 1)/v to segment bij .
Analogously, we add a new StartBound ssym with angle m = 2m/v and
StopBound sty to segment bim .
For every other segment bin , we have two further options:
j < m : In that case, to all segments bin with j < n < m add additional
StartBounds ssyn with angles n = 2n/v and StopBound styn with angle
n+1 = 2(n + 1)/v
m > j : In that case, the query range covers the 2 threshold. We add
bounds analogously to the former case, except that we first add Start-
Bounds and StopBounds to all segments bin with j < n v 1, and then
to all segments bin with 0 n < m, each with the same definition as in
the former case.
In other words, if a Start- and StopBound pair encompasses multiple segments, we
are going to split the pair up into multiple pairs along the dividing angles, so that
each pair only covers one segment at a time.

At last, our edge-creation step changes only slightly in the sense that, instead of
going through all bands, we are going through all segments. The general algorithm
applied to each part, though, stays the same.

2.3.3 Parallelisation of the Generation Phase


The generation phase can be easily parallelised in case of the 2- and 3-sorter variants:
Creating as many generators as we have threads is all that needs to be done, apart
24 Chapter 2. Algorithms

Generation Phase Runtime in Seconds


10 1

10 0

10 -1

10 -2

10 -3
10 4 10 5 10 6 10 7
Node Count
SEQ 3-Sorter SEQ 0-Sorter PAR 2-Sorter
SEQ 2-Sorter PAR 3-Sorter PAR 0-Sorter

F IGURE 2.5: Graph showing the improvement in the generation phase by using a
parallel generator, once with a sequential generator ("SEQ"), once with a parallel
one ("PAR") with 16 threads (0-Sorter uses only four threads).

from setting the to-be-generated node count for each of those generators to n/t, if t is
the number of threads. The only problem concerns the insertion of objects into their
respective sorters, as the ST XXL-sorters are not thread-safe. Since we also did not
want to increase the memory usage by a large margin, we decided to use one array
per thread that was filled with the generated nodes and Bound-objects, which was
emptied out into their respective sorter once an arbitrary threshold was reached (in
our case 1000 Bound-objects). The insertion locked the sorters with a mutex for the
duration of the insert, creating a thread-safe generation and insertion without much
more memory usage then the single-threaded generation. The increase in runtime
can be seen in figure 2.5.

For the 0-Sorter, we had to take a different approach, as there are additional restric-
tions on the sequencing of the generated nodes: Our idea was to parallelise akin
to the angular parallelisation approach, where we divide the plane angularly into
multiple pieces only that each piece spans multiple bands now. We then sample
n uniformly random numbers in the range of [0,1), where each piece 0 i < t of the
t pieces has its own range of [i/t, (i + 1)/t). This way, we randomise the number of
nodes per generator uniformly. We distribute each piece to a thread, and calculate
the nodes like usual. During the generation, though, we restrict the generated
angles to the same angular range of the generators piece.

The only issue here is that, for maximum performance, we could be using all t
threads available since the nodes require to be generated in a sorted sequence,
though, each piece requires one vector per band as to keep the nodes ordered per
band. Overall we need s = (t l 3) ST XXL vectors one per piece (i.e. per
thread), per band, per object-type (one for the nodes, two for the Bound-objects
for the traversal in both directions). This is an issue, since the ST XXL-vectors
have an overhead and a minimum amount of memory required per vector during
initialisation. Since the normal sequential generation only requires s = (l 3)
vectors, we have a t times bigger overhead at worst. Using all 16 threads during
the generation phase for instance would require so many vectors, and thus so much
2.4. GIRG 25

Setup and Generation Runtime in Seconds


10 1

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
8 Threads 4 Threads Sequential

F IGURE 2.6: Graph showing the differences in thread count for the
combined runtime of the 0-Sorters combined setup and generation
phase, once with four threads, once with eight, and once with the
sequential generator.

memory, that we quickly reach the RAMs limit. Figure 2.6 shows the 0-Sorter once
with four threads, once with eight threads, and once with the sequential generator
during the setup and generation phase (the latter phases are able to use all threads
regardless of generator). We can see here that higher t result in a comparatively
larger runtime for small graphs, but introduce a slower asymptotic rise. In the end,
the overhead alone is only a constant and thus a bigger problem for small graphs.
Asymptotically, the constant setup phase is dominated by every other phase, which
is why the parallel generator still gives us an overall improvement for large graphs.

2.4 GIRG
An alternative to our previous algorithms is the one introduced in a paper by
Bringmann et al. [BKL15]. The authors propose a more general model in regards
to hyperbolic random graphs, called geometric inhomogeneous random graphs
(GIRG). We paraphrase their summary here briefly: Their model describes GIRGs
by giving each vertex v a weight wv (which follows a power-law) and a uniformly
random position xv in the d-dimensional torus Td (for random hyperbolic graphs,
d = 1). Two vertices u 6= v form an edge with probability puv proportional to
wu wv and inversely proportional to some power of their distance ||xu xv ||. Any
other details as to how those weights and distances are calculated depends on the
properties of a given graph type one is aiming for during generation.

Bringmann et al. prove that their sampling algorithm for GIRGs has an expected
linear runtime of O(n + m), where n is the number of points and m the expected
number of edges. Considering that they show that random hyperbolic graphs are a
special case of GIRGs, the same runtime applies.

As such, we will later on compare practical runtime results between an implemen-


tation of their algorithm with ours to additionally gauge how our algorithms fares
26 Chapter 2. Algorithms

off in a practical setting in relation to one proven to be having a linear runtime.

2.4.1 The Algorithm


We will summarise the algorithm in the following, but will not go into the math-
ematical details and proofs as they are not in the scope of this thesis (for that, we
refer to the original paper). We will present only a specialised variant in the context
of using it for the generation of random hyperbolic graphs under the threshold model
without a temperature parameter (see chapter 1.2.1). As such, certain definitions
shown here are simplifications compared to the general case, as they already take
into account that we are not concerned with the general case of GIRGs but random
hyperbolic graphs explicitly:

We first sample the random positions of our vertices and assign them weights
based on their positions - specifically in regards to hyperbolic random graphs, a
vertex v V has according to the above definitions the positional information

v
xv = (2.1)
2
and weight
Rrv
wv = e 2 . (2.2)
The angle and radius are randomly sampled as in previous algorithms. Afterwards,
we partition all our vertices v V into different weight layers Vi with 1 i L and
L = (log(n)), the layers themselves defined as Vi := {v V | wi1 v wi } with
w0 := min{wv | v V } and wi := 2wi1 for all i 1. Considering that the weight
layers give a lower and upper bound for all the radii of the vertices therein, they
can be understood analogously to the bands used in our previous two algorithms.

For all the weight layers, we will create certain data structures D(i) ({xv |v Vi })
with
wi w0
(i) := (2.3)
W
P
where W := vV wv , that ultimately have the following properties:

D(i) is a recursive cell structure with a geometric ordering with as many as


k = log2 (d(i)e2 ) levels, where d(i)e2 is the next larger number that is a
power of 2, i.e.:
dxe2 = min{2l |l N0 : 2l x} (2.4)

Starting with the first level l = 0 being made up of only one cell, every other
level j divides all cells on level j 1 in two equally large cells, thus doubling
the number of cells with each level.

Because the cells partition the entire ground plane (in our case being all angles
between 0 and 2) and considering that each level is a subdivision of the level
above, all cells on the same level l combined cover the entire ground plane. Let
us assume a geometric ordering of the cells with C1 being the one cell on level
0 that covers the entire ground plane on their own and C2l , ..., C2l+1 1 being
the cells on every other level l that do the same when combined. In that case,
cell Cx on level l with 2l x 2l+1 1 contains all points p on the weight
layer (i) with an angle 2(x 2l )/2l p < 2(x 2l + 1)/2l .
2.4. GIRG 27

R L0
L1
L2
L0
Radius

L1
L3

0
0 Angle 2

F IGURE 2.7: A visual help for the GIRG-algorithm showing the weight layers on
the left, and an exemplary step of the partitioning process for partitioning P(0,1)
involving layers L0 and L1 .

Every cell C on level l of the data structure has a volume of V OL(C) = 2l , a


volume of 1 being defined as a cell containing the entire ground plane.

Every cell in any D(i) can be accessed in constant time.

Every point in any cell Ci can be accessed in constant time by virtue of having
an additional Array A[.] which stores a pointer
P to the k-th point at A[si + k],
where, let P be the set of all points, si := j<i |Cj P | is the prefix sum of cell
Ci , stored in the cells data structure as a parameter.

After creating such data structures for every weight level, we create for every level
1 i j L a partitioning P(i,j) = {(A1 , B1 ), ..., (As , Bs )} of cells Ai , Bi with
wi wj
(i, j) := (2.5)
W
and s = O(1/(i, j)). For our special case without probability edges there are only
two possible ways a pair of cells is added to our partitioning:

Any pair of cells (Ai , Bi ) with V OL(Ai ) = V OL(Bi ) whose boundaries touch
is part of P(i,j) .

Any pair of cells (Ai , Bi ) that dont already fall under the above criteria, whose
parents boundaries touch is part of P(i,j) the definition of a parent of a cell
on level l being the cell on level l 1 which contains the child in its entirety.

For every pair of cells (A, B) P(i,j) we will now calculate the distances between
each vertex u ViA and v VjB and establish an edge {u, v} if and only if the
distance distH (u, v) (see eq. 1.6) is smaller or equal to R, the radius of our hyperbolic
plane. After this step, our graph will have been generated and we are done with the
process.

To make it easier to visualise the procedure, figure 2.7 shows a simplified version
of the process: On the left we have a visual representation of the multiple weight
layers, where we can see how each layer encompasses the entire 2-span, while each
higher numbered layer has fewer cells with larger volume each (volume in regards
to the angular width, not the radial height). On the right we see how adjacent cells
are added to the partitioning:
28 Chapter 2. Algorithms

First, calculating (0, 1) would tell us that L0 would have to be set to the parent
cells one layer higher in this particular example, which is why L0 has half the cells
with double the volume than previously seen on the left. Second, in respect to the
cell with a yellow border, all its neighbours borders are colored in red (the most
left cells left neighbour is the most right cell and vice versa) with arrows pointing
from the current cell towards its neighbours with which the current cell will be
compared. This will be done for each cell on L1 and each cell on L0 , after which we
will continue with the next two layers partitioning.

In regards to the linear runtime, the original paper goes into far more detail, but a
short summary of the differences between NkGen and GIRG, and their respective
runtimes can be understood as follows [BKL15, p.10-11]:

NkGen restricts the number of possible candidates a node v has to query by divid-
ing the hyperbolic plane into multiple bands. Based on those, it calculates angles
between which depending on the radial partitioning chosen an expected
constant factor of nodes relative to to-be-established edges can be found (see
chapter 3.3 and Penschuck [Pen17]), compared to the number of actual neighbours
node v would have in the end. Because we are requiring a binary search per node,
and because we are sorting all nodes anyway, the runtime can be given an upper
bound of O(m + n log n).

GIRG on the other hand while also dividing the plane into multiple bands cre-
ates a partitioning between those bands, where cells from each of those bands are
compared to one another. The cell sizes meaning the number of nodes per cell
are chosen during each partitioning in a way that, even though one compares
each node from one cell with a node from another one in a quadratic fashion, math-
ematically, it still is bound by a constant factor compared to the number of edges we
would have in between those two cells. As we are not requiring any kind of sorting
or binary search in this endeavor, the runtime stays O(m + 1).
L ISTING 2.2: Sampling algorithm for GIRG with regards to our spe-
cial case of no additional random edges, taken from [BKL15]
1 E:=
2 sample t h e p o s i t i o n s xv , v V , and determine t h e weight l a y e r s Vi
3 f o r a l l 1 i L do b u i l d data s t r u c t u r e Dv(i) (xv |v Vi ) with (i) := wW
i w0

4 f o r a l l i i j L do
wi wj
5 c o n s t r u c t p a r t i t i o n i n g P(i,j) with (i, j) := W
6 f o r a l l (A, B) P(i,j) do
7 f o r a l l u ViA and v VjB add edge u, v t o E i f d i s t a n c e d(u, v) R
8 i f i = j then remove a l l edges with u > v sampled i n t h i s i t e r a t i o n
29

Chapter 3

Analysis

In this chapter, we will at first analyse the three algorithms and derive their I/O-
complexity it should be noted, though, that we only consider the algorithm and
exclude the output in the complexity analysis. In the case of the EM-variant, we
will also present a runtime analysis before taking a closer look at the radial parti-
tioning possibilities. As our goal is to compare the three algorithms in a practical
benchmark setting, we first have to establish which parameters are most beneficial
to the EM-variant beforehand. Considering that the partitioning has a large impact
on the runtime in general, this analysis and subsequent choice of parameters band
counts, radial partitioning and parallelisation count per band among others is a
necessary step to optimise our algorithm.

3.1 NkGen: I/O-Analysis


As previously noted during the introduction of the NkGen algorithm, it can be un-
derstood as a 4-phase-process: Setup, generation, sorting, and edge-creation. For the
I/O-bound, only the latter three are relevant, as the setup phase only includes data
structure creation and the calculation of target radius R which should not take any
relevant number of I/O-operations as we have a constant number of data structures.

The generation and sorting phases both take scan(n) and sort(n) I/O-operations
respectively. This is because arrays will be filled with n nodes and the algorithm
sorts at worst O(n) nodes. The dominating factor comes from the edge-creation
phase, where the algorithm goes through all bands and all nodes therein. On its
own, this would imply a bound of O(scan(n)). Because the algorithm performs a
binary search once per node, though, it implies at least one random access jump per
node for sufficiently large n. Assuming the binary search performs O(log(n)) steps,
n
all of which jumping all over the array, it will take at best (log( B )) I/Os for finding
the two nodes that are closest to min and max . Doing this once per node results
n n
in an I/O-complexity of O(scan(n) + n log( B )) = O(n log( B )) for this phase
alone. Comparing the respective upper bounds, one can see that the algorithms
I/O-complexity is bound by its very I/O-intensive edge-creation phase.

In other words, if the nodes do not fit into the main memory, we will have to ex-
pect (n) I/O-operation per node, resulting in a very unbeneficial practical runtime
under certain use case conditions.
30 Chapter 3. Analysis

3.2 EM-Variant of NkGen


Before we can get to the analysis of the I/O- and runtime-complexity, there is
one particular issue that has to be resolved first specifically the size of the
active-vector. Active is held in memory during the entire course of the edge-creation
phase on a band, updated each time by deleting or inserting new StartBounds that
are kept for comparisons with the current node. Because of this, it is important to
know, how the active-size is calculated (for the runtime-complexity) and whether its
size fits into main memory. As such, we will first analyse active itself.

Since we are working with randomly generated graphs and most of our algorithm
is based upon the randomly generated positioning of the nodes, the active-size is a
random variable. Let us thus begin with a construction of an estimate for active.

Let Bx = (bx , dx ) be a band, where bx is the inner, dx the outer band radius, and
where Bx encompasses the radius range [bx , dx ). Let node v = (v , rv ) be a random
node on band Bx . Let min v < max and |max min | = 2Bx (y) (see eq.
1.8) be the maximal angular difference for a different node ys neighbours, where
y = (y , ry ) is an arbitrary node creating a query on band Bx . Let also E[|activeBx |]
be the expected number of elements in active on band Bx . This number is as such the
expected number of StartBounds we have on average whenever a node on band Bx
is being picked to traverse active. In case of a random node v, we are thus looking
at the average number of query nodes whose min are smaller than v and whose
max are larger than v .

Since v is uniformly randomly generated from [0,2), the problem of finding this
number is thus equivalent to the question of how many nodes have created queries
in the range of [min , max ) on a band Bx .

To simplify this further, let us assume we have only one node y that creates a query
with a query range of 0 2Bx (y) 2. The probability of this points query to
2 x (y)
appear in active during point vs turn (v resides on band Bx ) is B2 as this is the
fraction of the area on a given band that is covered by the query. If v is in that area,
active will have ys query.

The next question is, how many nodes create such a query. This depends on the
probability of having nodes be on that exact radius as y, meaning we can change
our question further: For all possible Bx (y) that can be created on Bx , how many
nodes can be expected to create each of these queries?

First off, the expected number Q of nodes that can create queries on a band Bx is the
number of nodes in total multiplied by the mass between 0 and d:

Let Xi be a random variable with Xi = 1 if node vi creates a query on band Bx , and


Xi = 0 otherwise. Note that a query is created when node vi s radius is smaller or
equal to band Bx s outer radius d. Thus, let Ri be a random variable representing the
Rd
random radius of a node vi . Let P[0 Ri d] = 0 f (r)dr be the probability that a
3.2. EM-Variant of NkGen 31

Ri is inside the range [0, d]. In that case, the expected number of queries Q on Bx is:
n
X n
X Z d Z d
E[Q] = E[Xi ] = P[0 Ri d] = f (r) ndr = n f (r)dr
i=1 i=1 0 0

To calculate the number of nodes that will be part of vs active we only need
to multiply the number of nodes creating a query of specified range with that
probability of that query to appear in active, which is the aforementioned fraction
2Bx (y)/2. As both the number of query nodes and their probability of appearing
in active are both dependent on the query nodes radius, the easiest solution is to
multiply both inside the integral.

The resulting equation is the following:


Z d
2Bx (y , r)
E[|activeBx |] := n f (r)dr (3.1)
0 2
With this equation in mind, we follow up with a proof of lemma 3, which uses the
EM conventions for variables, as this lemma is important for the I/O-complexity.

Lemma 3. The expected size of active is sublinear in n for any average degree k < n, where
n is the number of nodes, for any radial partitioning with at least one band with an inner
radius b = y R > R/2. Thus active can be expected to be held in our main memory of size
M = O(n1(y) k y ) for all 1 1 < y < 1, all > 1/2, and all k < n.

Proof. Given the equation for E[|activeBx |] with Bx = (bx , dx ), where bx is the inner
and dx the outer band, we begin by applying the approximations 1.3 and 1.9 and
assuming R = O(ln(n/k)) (see lemma 1):
Z dx
2Bx (y , r)
E[|activeBx |] = n f (r)dr
0 2
dx
2 2e(Rrbx )/2
Z
= n e(rR) dr
0 2

Before we continue, let us note that we can split the active-array into two sets of
query nodes:

Qx := {queries from nodes from band Bx }

Q<x := {queries from nodes from all lower bands Bj | j < x }

Also to remember is the fact that neither of those sets can be larger than all nodes
in their respective radii ranges, as there cannot be more queries than double the
nodes. I.e., let V be the set of all nodes, then active can be split into the sets
activeBx = Qx Q<x , where Qx Q<x = , with:

|Qx | = O(|Vrx |) where Vrx := {v V |bx rv < dx }

|Q<x | = O(|Vr<x |) where Vr<x := {v V |0 rv < bx }


Maximizing the angle in our equation in specific maximizing the term e(Rrbx )/2
can be done by minimizing the r variable, as the smaller the radius of the node
sending the query is, the wider the range its query covers. Were we to calculate the
size of Qx alone, we could thus change the upper equation by changing the range of
32 Chapter 3. Analysis

the integral and taking into account that the smallest radius r a node in Vx can have
is bx .

dx
2 2e(Rrbx )/2
Z
E[|Qx |] = ne(rR) dr | max (e(Rrbx )/2) ) = e(Rbx bx )/2
bx 2 bx rdx

2 2e(Rbx bx )/2 dx
2 2e(Rbx bx )/2
Z
n e(rR) dr = n O(eR edx )
2 bx 2
= O(eR/2 ebx eR edx n)

Let us assume we have two bands, band B0 = (0, yR) and band B1 = (yR, R) with
1/2 < y < 1.

Case B0 :

The number of nodes |V0 | on band B0 is, after applying 1.5 to its range, expected to
be:

E[|V0 |] = O(eR eyR n) |R = O(ln(n/k))


y y 1(y)
 
= O (k/n) (n/k) n = O k n

Let c1 = y, then it follows that O(k y n1(y) ) = O(k c1 n1(c1 ) ). In order to


see for which k this becomes sublinear, we set the term in an inequality with n:

k c1 n1c1 < n |(...)/n1c1


= k c1 < nc1 = k < n

And since c1 < 1 for all y > 1 1 , it follows that, under those restrictions for
y, this is sublinear to n for all k < n (as a side note, this means for all 2, this
applies to all 1/2 < y < 1, which during the latter experimental evaluation is always
going to be the case for all radial partitionings). As this is in the realm of possi-
bilities, since y < 1, this means we can put all |V0 | nodes from band B0 into main
memory. Since we cannot have more queries than nodes, i.e. |Q0 | = O(|V0 |), and
since there are no other bands below it, i.e. |Q<0 | = 0, active fits into any memory
M = O(k y n1(y) ) which is sublinear to n for all 11 < y < 1 and all k < n.

Case B1 :

Calculating E[|Q1 |] first will give us the following:

E[|Q1 |] = O(eR/2 eb1 eR ed1 n) = O(eR/2 eyR eR eR n)


= O(eR/2yR n) = O((n/k)1/2y n) = O((k/n)y1/2 n)
= O(k y1/2 n1(y1/2) )

Since 1/2 < y < 1, it follows for ks exponent exk = y 1/2:

0 < exk < 1/2

In the same vein, it follows for ns exponent exn = 1 (y 1/2) = 3/2 y:

1/2 < exn < 1


3.2. EM-Variant of NkGen 33

Note that both exponents are smaller than one and larger than zero. Let c2 = y 1/2,
then exk = c2 , and exn = 1 c2 . In order to see for which k this becomes sublinear,
we set the term in an inequality with n:

k c2 n1c2 < n |(...)/n1c2


= k c2 < nc2 = k < n

Thus it follows that the number of queries from nodes on band B1 only, i.e.
|Q1 | = O(k exk nexn ) = O(k c2 n1c2 ), is sublinear in n for any k < n, and for any
1/2 < y < 1, including 1 1 < y < 1.

In regards to the number of queries of nodes from smaller bands, i.e. |Q<1 |, we can
just keep all nodes from all lower bands (in this case B0 in specific) inside our main
memory, as proven in the former case. All around, the size of active on band B1 is
expected to be:

E[|activeB1 |] = E[|Q1 |] + E[|Q<1 |] = E[|Q1 |] + O(E[|V0 |])


= O(k y1/2 n3/2y + n1+y k y ) = O(n1(y) k y )

Since the upper bound here is the number of nodes on band B0 , we apply the
same argument to this upper bound with c1 = y as we did earlier with
O(|V0 |), meaning that actives size on band B1 is sublinear in n for all k < n and all
1/2 < 1 1 < y < 1.

With this, we have proven that on both bands, the active-size is sublinear in n for
any k < n, and for all 1/2 < 1 1 < y < 1, and will thus fit into any memory
M = O(n1+y k y ). Any additional band would follow one of the two cases:

Any additional band Bi = (bi , di ) inserted into the radius range of B0 , i.e.
bi < di yR, would follow the proof of case B0 .

Any additional band Bi = (bi , di ) inserted into the radius range of B1 , i.e.
yR bi < di R, would follow the proof of case B1 .

This means that the only condition for this lemma to apply to any radial partitioning
is the existence of one band Bx with an inner band bx = yR > (1 1 )R and k < n.
At last, we want to show that with a high probability, the given expected size is not
larger than a constant factor of the above equation.
R d 2 ( ,r)
Given an expected probability p = 0 Bx2 y f (r)dr (derived from equation 3.1
by multiplying the density with the query angle) of a random node being part of
that neighbourhood, we can construct from this problem a binomial distribution
B(n, p), where p = p and n is the number of nodes we have. Because we are
generally working with larger n, we can approximate the binomial distribution
B(n, p) with a normal distribution N (, n p q).

Thus, our active-size falls under the normal distribution N = (, 2 ) under the
p
assumption that = n p = E[|activebi |] and = n p (1 p) = O( n p) =
p
O( E[|activebi |]). As we are working with a normal distribution, we can use the
Empirical Rule to our advantage:
34 Chapter 3. Analysis

For all expected active-sizes E[|activebi |] 9 it holds that

1
q
E[|activebi |] E[|activebi |]
3
meaning that = O( 13 E[|activebi |]). The 3--interval encompasses 99.7% of all pos-
sible E[|activebi |] with the largest number being:

1
+ 3 = E[|activebi |] + 3 E[|activebi |] = 2E[|activebi |]
3
Following that, we can assure with a 99.7% chance that the actual active-size is at
most twice as large as the expected size. In other words, the above proven sublin-
earity holds w.h.p., which finishes our proof.

For an example of the active-size approximately following a normal distribution, we


will show in chapter 3.6.2 this fact by using an approximated equation to calculate
the expected active-size. With this, we will overlay a normal distribution with the
above properties over actual recorded data in regards to occurring active-sizes dur-
ing multiple runs in figures 3.2, and A.1, A.2.

3.2.1 I/O-Complexity
We have overall three phases to consider for our possible upper bound on the
I/O-complexity: The generation, the sorting and the edge-creation phase.

The generation and sorting phases can be combined into one, as each generated
object has to be sorted in the end. We have n nodes to put into our sorters for one.
This alone is an I/O-complexity of O(sort(n)). We also have to sort the bounds,
considering each node has at most two bound-pairs on their own band and at most
another two pairs on each outer band. Since we thus have more bounds than nodes,
we have to assume the bound count is the dominating factor here. Let S be the
number of bounds, with S = (n) and S = O(l n) with l being the number of bands,
then our complexity would be O(sort(n) + sort(S)) = O(sort(S)).

In the edge-creation phase we go through each band and sorter. Since the data is
sorted and our algorithm only relies on one additional array the active-array we
have only two things to consider here: The sorter sizes and the active size.

Due to lemma 3, we can assume that active fits into our RAM and can thus
ignore the data structure in its entirety for the calculation of our I/O-complexity.
The sorters are scanned from the smallest to the largest element in each sorter,
and because we know that all node sorters combined have n elements and
all bound sorters combined have S elements, the complexity here is one of
O(scan(n) + scan(S)) = O(scan(S)).

Overall, the dominating phase in regards to I/O complexity is the sorting step, as
sorting is more complex than scanning alone, meaning our entire algorithm has an
I/O-complexity of O(sort(S)).

The upper bound of S depends entirely on the partitioning chosen, and while we
will not use it in our final benchmarks , the following lemma 4 proves that it is
3.2. EM-Variant of NkGen 35

possible to choose a partitioning that gives a linear, upper bound on S compared to


n.
Lemma 4. There exists a radial partitioning by which S = (n) so that our algorithms
I/O-complexity equals O(sort(n)) under that partitioning, regardless of number of bands.
Remark: This partitioning may yield an increased number of false candidates.

Proof. Let l be the number of bands with l > 1. Let C = (c0 , c1 , ..., cl ) be a radial par-
titioning with bands bi = (ci , ci+1 ) for all 0 i < l, where band b0 is the innermost
band. Let v(bi ) be the fraction of nodes on band bi , where a fraction of one would be
the entirety of all nodes, while zero would be no nodes at all. Let the partitioning be
chosen in a way that divides the nodes onto the bands in the following fashion:
(
1/2l1 , if i = 0
v(bi ) =
1/2li , otherwise

Since we are working with a randomly generated graph, the fractions should be
considered a probability mass. In other words, the expected node count on band
i is defined as ni = n v(bi ) = i . A node being part of a specific band i with
a specified fraction can be considered a Bernoulli experiment with the probability
p being that fraction and the node being part of that band being a positive event
(i.e Xji = 1, if Xji is the random event of node j being on band i), and it being a
part of another band being a negative event (i.e. Xji = 0). Under those definitions
we get E[ nj=1 Xji ] = O(n v(bi )), where Chernoffs inequality gives us with high
P
probability the same equation:
n
X
P[ Xji > n v(bi )] exp( n v(bi )) |O(v(bi )) = O(0.5)
3
j=1

exp( n) | = 6
6
1
exp(n) =
exp(n)

In other words, the probability that we overshoot the expectation by a considerable


margin falls exponentially with n. Thus, we assume that the above partitioning will
yield a node count per band close to the estimate of v(bi ) n. Following that, this
partitioning distributes all of our n nodes onto the l bands, as the following steps
will prove:

l1
X l1
X l1
X l1
X
l1
1/2li

ni = v(bi ) n = n v(bi ) = n 1/2 +
i=0 i=0 i=0 i=1
l1
X 1 1/2l 
= n 1/2l1 1/20 + 1/2i = n 1/2l1 1/20 +

1 1/2
i=0
= n 1/2l1 1 + 2 (1 1/2l ) = n 1/2l1 1/2l1 1 + 2
 

=n

Every node on band i will at most create one bound for its query on each outer
band additionally to its own band. Looking at it from the other way, this means
every band i will have at most one bound per node that exists on each band j with
36 Chapter 3. Analysis

0 j i. Thus, let si be the number of bounds on band i with si = ij=0 nj . The


P
sum of all si will be bound by a constant factor of n regardless of l as the following
steps will prove:
l1
X l1 X
X i l1 X
X i l1
X
S= si = nj = n v(bj ) = n v(bi ) (l i)
i=0 i=0 j=0 i=0 j=0 i=0
l1
X l1
X
v(bi ) (l i) = n 1/2l1 l + 1/2li (l i)
 
= n v(b0 ) l +
i=1 i=1
Xl1 l1
X
l1 i l1 0
1/2i i)

= n 1/2 l+ 1/2 i = n (1/2 l 1/2 0 +
i=1 i=0
l1
(l 1) 1/2l+1 l 1/2l + 1/2
X  
l1 i l1
= n (1/2 l+ 1/2 i) = n 1/2 l+
(1/2 1)2
i=0
1
= n l 1/2l1 + (l 1) 1/2l1 l ( )l2 + 2

2
l1 l2
= n (2 l 1) 1/2 l 1/2 +2
l2 l2

= n (l 1/2) 1/2 l 1/2 +2
= n 2 (1/2) 1/2l2 = n 2 1/2l1
 

= O(n 2) = O(n)

No matter how large l > 0 is, the upper bound will always be O(n). And since the
lower bound will never be less than (n) because every node will have at least
one bound pair on their own band we conclude this proof with S having a tight
bound of (n) under the above partitioning.

3.3 Runtime Analysis


As the algorithm is a modification of NkGen, the only differences between the two
are as follows:

NkGen sorts only the nodes, while the EM-variant has to sort the nodes and all
Bound-objects. The number of Bound-objects, though, can be given an upper
bound with the number of nodes either depending on the radial partitioning
scheme (see lemma 4), or by choosing a constant band number l. Thus, the
differences are O(n log(n)) vs. O(l n log(n)), or with a constant band count,
O(n log(n)).

NkGen has during its edge-creation phase a runtime that depends on two as-
pects: The number of comparisons between nodes and neighbour candidates,
and the number of binary search operations one needs to find those candi-
dates. The former, while not proven, has been empirically analysed by the au-
thors to be linear in number of edges [LLM16, p.5]. In fact, it has been shown
that with a uniform radial partitioning where, let l is the number of bands
band is inner radius is b = R(i/l), while its outer radius is d = R(i + 1)/l
the number of neighbour cadidates per node v chosen by the algorithm differs
only by a constant factor c from the actual number of neighbours that v has (see
3.4. GIRG: I/O-Analysis 37

Penschuck [Pen17]). Since the minimal number of comparisons per node in a


graph with an average degree of k is (k), and since the aforementioned fact
gives an upper bound of O(c k) = O(k), such an equidistant radial partition-
ing has a runtime of O(m). Regardless of this fact, the number of comparisons
during the run of the same radial partitioning in NkGen and the EM-variant
respectively does not differ, as the calculations are basically the same (since
the Poincar model only improves a constant factor, but does not change the
asymptotic runtime). NkGen employs additionally a binary search once per
node, which takes O(n log(n)) steps in total. Thus, if we were to assume for
the asymptotic number of comparisons an arbitrary term equaling O(C), then
NkGens runtime during this phase will be O(C + n log(n)) vs. O(C) of the
EM-variant, as we only traverse the nodes without additional searches.

Let O(C) be the upper bound on the number of comparisons. In total, this means that
the EM-variant has an overall runtime of O(C + l n log(n)), compared to NkGens
O(C + n log(n)), where O(C) is empirically linear in m [LLM16, p.5].

3.4 GIRG: I/O-Analysis


Taking a look at the GIRG algorithm, one can see immediately that the majority of
its I/O-operations happen during the main loop beginning at line 4 in listing 2.2.
The sampling process for the n nodes alone would imply O(scan(n)), the same goes
for the creation of the L = log(n) data structures D(i) which store a pointer per
node in an array A[.] as large as the number of nodes we have, i.e. n.

What is left is the main loop that includes the construction of partitionings between
multiple layers and subsequent traversal of those as well. While the number of
partitioning combinations is quadratic in the number of layers L, the number of
I/O-operations might have a better I/O-complexity, depending on how we traverse
them. If we are traversing all partitioning pairs in a concurrent manner where we
always keep the largest cell in memory, the best I/O-complexity we could achieve
would be O(scan(n)):

Each layer i has nodes whose weight wv = e(Rrv )/2 is between w1 and wi , and
each wi is defined as wi := 2wi1 , i 1. In turn, this means that a node with weight
wi has a radius of ri = R 2 ln(wi ). Were we to double the weight to wi+1 = 2wi , a
node with this weight would have the radius ri+1 = R 2 ln(2wi ).

Applying those radii for two subsequent layers i and i + 1 with range [ri1 , ri )
and [ri , ri+1 ) respectively to the density mass calculation 1.5, we would get:

mass(layeri ) = eR (eri eri1 ) = O(eR+ri )

mass(layeri+1 ) = eR (eri+1 eri ) = O(eR+ri+1 )


Calculating the ratio of density mass, while inserting the above definitions, we arrive
at:
38 Chapter 3. Analysis

mass(layeri )
= O(eR+ri /eR+ri+1 ) = O(eri ri+1 )
mass(layeri+1 )
= O(e(R2 ln(wi )(R2 ln(2wi ))) ) = O(e(2 ln(2wi )2 ln(wi )) )
= O(e2(ln(2wi )ln(wi )) ) = O(e2(ln(2)) ) = O(4 )

In other words, there is a geometric distribution of nodes between the layers, where
the number of nodes in subsequent layers doubles in the least (since > 1/2). Let
us assume that the number doubles exactly, i.e., the node mass per band resembles
a geometric partitioning of mass(layeri ) = 1/2i , for all 1 i L. In that case, the
sum over all layers nodes is just a constant factor of n, specifically O(2n) (see proof
for lemma 4).

Coming back to the I/O-complexity: For ease of understanding, let us assume that
each cell in the cell partitionings P(i,j) , 0 i j L fit into a block of b nodes. In
order to be as efficient as possible, we would load the first cell/block of B nodes
from the largest layer, layer 1, and compare these nodes with the first three adjacent
cells/blocks from each layer 1 through L. We would then continue with the second
block on layer 1, the third, and so on. After that, we would compare layer 1 with
itself, and compare each block with every adjacent block. Let ni be the expected
number of nodes on layer i, with L
P
i=1 ni = n. The comparison between layer 1
and all the others would then take O(L + (n1 + n2 + ... + nL )/B) = O(L + n/B)
I/O-operations, while the later comparison of the layer with itself would take
another O(n1 /B) the added L comes from the fact that we have to traverse each
layer at least once. This holds for every subsequent layer we are going through. The
total I/O-cost I of the entire procedure would thus be:
L 
X L
X 
I= i + (ni /B) + (nj /B)
i=1 j=i

Since the summation of each subsequent nx resembles a geometric sum, we can give
an upper bound for the inner sum, and further more an upper bound on the outer
sum as well, resulting in:

L 
X  XL
I= i+(ni /B)+O(2(ni /B)) = O(i+3(ni /B)) = O(L2 +6ni /B) = O(L2 +n/B)
i=1 i=1

In other words, under the condition that we were to traverse the partitioning in
this specific order, we would at worst have an I/O-complexity of O(L2 + n/B) =
O(log2 (n) + scan(n)) = O(scan(n)). Since every node has to be compared at least
once, this is also our best-case-scenario, i.e. O(scan(n)) = (scan(n)). If this order
is not taken into account, though, we might at worst have to load the first, largest
layer multiple times specifically once per other layer j we are comparing this one
with. This would result in a final upper bound of O(L n1 /B) = O(log(n)n/B), since
we have L = log(n) layers.
3.5. Comparison 39

3.5 Comparison
Table 3.1 summarises all findings up until now, where we take the empirically based
assumption of NkGens runtime [LLM16, p.5] as valid. As a reminder, the I/O-
complexity does not include any I/O-operations regarding the graphs final output:

Algorithm Runtime Complexity I/O Complexity


n
NkGen O(m + n log(n)) O(n log( B ))
GIRG O(m + 1) between (sort(n)) and O(log(n)sort(n))
EM-variant O(m + l n log(n)) O(l sort(n)) = O(l(n/B) log(n/B))
EM-variant (constant l) O(m + n log(n)) O(sort(n)) = O((n/B) log(n/B))

TABLE 3.1: Summary of the Runtime and I/O-Complexity

From the runtime complexity alone, we can see that the algorithms are very similar
amongst one another. GIRG does not have the problem with sorting, which is
why the others have an additional O(n log(n)) added onto the runtime. In the
EM-variants case it is even O(l n log(n)), as NkGen only sorts the n nodes.
Regardless of how many bands one might have, the sum of all nodes will always
be n. Not so much for the EM-variant, as we do not only sort the n nodes, but also
all StartStopBounds, the sum of which is at least n and more likely above that, as
each band i has always the sum of all previous bands nodes bounds (i.e. the sum
of bounds from all nodes on all bands j, with j < i), additionally to the ones created
by the nodes on band i itself.

In other words, for small graphs it might be worse for the EM-variant to have a high
number of bands while for large ones it will be irrelevant if m is sufficiently large as
well.

In regards to the I/O complexity, the EM-variant has the biggest advantage, as only
the sorting phase hinders it here with a low enough complexity to compete with
the other alternatives. GIRG has the problem in its partitioning, as the traversal of
the many cells we are comparing matters. At best, we might have a traversal order
that takes I/O-efficiency into account and keeps the temporal locality principle i.e.
that currently used data is more likely to be needed sooner rather than later in
mind [MSS03, p.9]. Using all loaded data into RAM as much as possible according
to the traversal order mentioned in chapter 3.4 would result in a I/O-complexity
of scan(n). If that is not the case, though, we will have a lot more unstructured
access between the multiple layers, resulting in a worse I/O-complexity than the
EM-variant. NkGen on the other hand has its own issues with the binary search
that is performed once per node, resulting in the factor before the logarithm to be
potentially a magnitude higher than the one for the EM-variant (depending on the
block-size B).

All in all, we expect from the theoretical analysis to see GIRG and NkGen to excell at
smaller graphs that still fit into memory, as that is where their runtime complexities
should dominate the practical runtime. Once we are entering the external memory
environment, it should either be the EM-variant or GIRG that has an advantage over
the others in regards to the I/O-complexity, depending on the internal traversal or-
der of the GIRGs algorithm.
40 Chapter 3. Analysis

3.6 Radial Partitionings


Since the radial partitioning has an impact on the runtime of our algorithm, we will
detail the partitionings we have chosen to analyse in their efficiency, which is going
to be the main focus of this section.

3.6.1 Overview
As previously mentioned, we used a geometric partitioning for our bands based on
Looz et al.s empirically chosen parameters [LLM16, p.3], as to even the grounds
on which we are comparing the algorithms in general using different kinds of
partitioning methods might induce a bias to the benchmarking setup, if we are
trying to see the effectiveness of an EM-implementation alone.

On the other hand, our parallelisation scheme differs from the original algorithm:
Where in the original, each node was given to different thread, we cannot do so
since the sorters are not thread-safe. And since in our case, we distribute band
segments to threads instead, the partitioning choice might have a larger impact on
the parallelised runtime in a direct comparison as well. As such, we chose to put
some focus on this part of the algorithm as well.

In regards to dividing the hyperbolic plane in the case of our algorithm specifically,
there are only two general parameters we are concerned with.

The first parameter is the partitioning C of the l band-radii c0 , ..., cl+1 for our l
bands, where every band i being defined by their inner radius ci and outer radius
ci+1 (in particular, c0 := 0).

The second parameter is the number of bands l itself. Every additional band adds
an additional query-creation and sorting step, with the number of query bounds
we have to sort being bound in its size by O(n l) as mentioned during our
I/O-complexity analysis of the EM-variant. Looz et al. noted that for their bench-
mark environment, it was beneficial to make the number of bands dependent on the
number of nodes one has specifically l = log(n) [LLM16, p.2-3]. As to whether
or not this is a sensible choice for the other radial partitionings is something we will
look into as well.

Apart from the geometrical approach taken from Looz et al., we came up with a
couple of other possible partitioning choices that we will go through and analyse,
later on one after the other:

An equalised workload approach, by which each band has an approximately


equal amount of work to go through.

A minimised workload approach, by which the summed up workload spread


across all bands is as minimal as possible for a given number of bands.

The reason for both partitioning choices is as follows: The minimised workload
approach should yield a partitioning by which, if we were to go through all bands
one-by-one, we would minimise the runtime as well. The equalised workload
approach on the other hand was chosen because, while the minimised approach
would be the faster in a single-threaded environment, it might be more beneficial
3.6. Radial Partitionings 41

for a parallel work setting to chose a partitioning that spreads out the workload
better over multiple threads.

Regardless of the approach, the bigger problem here is the question as to how one
creates a partitioning where the workload subdivision is chosen beforehand. For
that, we devised a way to estimate an expected workload, given a chosen partition-
ing, which will be explained in the next section.

3.6.2 Estimating the Expected Workload of a Partitioning


Both of our approaches are based on a numerical solver using the expected amount
of work on a band as a function of the bands radial boundaries and the properties
of our hyperbolic graph:

Let us assume, disregarding the sorting step, we would want to predict the amount
of work band i between radii ci and ci+1 has. Our main algorithm goes through
every node v in band i and checks for every potential neighbour candidates for each
node, whether an edge has to be established. If one recalls, the candidates were
located in our algorithm in the active array through which we had to go once per
node. As such, the size of active is the number of potential neighbours we have to
verify for edge-compatibility.

With this in mind, let ni be the number of nodes on band i and let E[ki ] be the
expected size of potential neighbours each of the E[ni ] nodes have to verify. The
work wi on band i would thus be defined as wi = E[ni ] E[ki ].

E[ni ] is comparatively straightforward: Using the radial density function used for
nodes radius distribution, we can calculate its integral and with it the density mass
mi residing on band i, meaning E[ni ] = n mi = n mass(ci , ci+1 ) (see eq. 1.4).

E[ki ] on the other hand is more complicated, though we already have given the
answer to it during our theoretical active-size analysis in chapter 3.2. As a quick
reminder, the active-array holds all potential neighbours that any current node has
to be checked against to verify, whether or not an edge has to be established. In
other words, E[ki ] corresponds to the expected size of the active-array on band bi ,
meaning that we can use equation 3.1 and the thought processes behind it for our
expected workload calculation.

This results in the following formula that should give us an expected size for active,
where the point q with radius r is a query point on band bi , and (q, r) = 2 bi (q) is
the calculated angle the query point q would have on a band with an inner radius
ci1 (see eq. 1.8), and f (r) being the density function being used for the radius
distribution (see eq. 1.2):
Z ci
(q, ci1 )
E[ki ] = f (r) n dr (3.2)
0 2

With this active-size-proxy we can approximate the work on any given band if we
know its boundaries with the following formula:
42 Chapter 3. Analysis

Given the band radii C = (c0 , ..., cl ), with c0 := 0, the entire workload W for our
main algorithm can be calculated as follows:
l1
X l1
X
W = wi = ni E[ki ]
i=0 i=0
l1  Z ci+1 
X (q, ci )
= n mass(ci , ci+1 ) f (r) n dr
0 2
i=0

l1
X n 2 1 1
= n mass(ci , ci+1 ) 2 (e 2 (21)ci+1 1) e 2 (ci 2R+R) (3.3)
2 1
i=0

As one can see, since our numerical methods require multiple usage of this equa-
tion, we chose to use the approximations (see eq. 1.3 and 1.9) instead of the actual
equations eq. 1.2 and 1.8 to cut down the runtime of the calculation. We found out
during initial testing that our proxy usually delivered inaccurate, too large results
on the inner bands and for smaller , as figure 3.1 shows. This is mostly because, as
mentioned in chapter 1.2.3 with the introduced inequality 1.10, for smaller rv and
ci , we get degrees that are too large. Using an integral does not change that fact,
mostly if we calculate the workload for lower bands (i.e. lower ci ) and have a lot
more points with lower rv than larger ones as is the case with smaller settings.

To combat this, we changed equation 3.3 to take this discrepancy into account, and
split the integral along a border corresponding to inequality 1.10. With this ap-
proach, we devised a better approximation of actives size, namely, E[ki ], where the
second bar indicates the improved version of our estimate calculation:
l1
ni E[ki ]
X
W =
i=0

Where, let s = R log(/2) 2 ci :


(
n eR (e(ci+1 1), if ci+1 s
E[ki ] = R (s (3.4)
ne (e 1) + E[ki2 ], otherwise

Where
n 2
E[ki2 ] =
1 1
2 e(ci /2) (e( 2 (21)(ci+1 R)) e( 2 (21)(sR)) ) (3.5)
2 1
In other words, whenever the border s is larger or exactly our current bands outer
radius ci+1 , we know that the query angle for each node between radii 0 and ci+1
is going to encompass the entire band, i.e. it will be 2. In that case, we can strike
out the probability (q, ci )/2 out of the integral, as it will equal practically 1 over
the entire integral (even if the approximation would have given us a larger number
which would not make much sense here). In that case, the approximated, average
neighborhood size of any node on band i (i.e. the expected active-size on band bi ),
namely E[ki ], will be equal to all nodes between 0 and ci+1 .

In case ci+1 is larger than s, we split the integral into two parts, one part in the
range [0,s], and one part in the range [s,ci+1 ]. The first part is similarly calculated
to the case where ci+1 s, only difference being that we are integrating between 0
3.6. Radial Partitionings 43

3.0 1e7

Comparison Count (Workload)


2.5

2.0

1.5

1.0

0.5

0.0
0 1 2 3 4 5 6 7 8 9 10 11 12
Band Number
Estimate via eq. 3.3 Actual Comparison Count Estimate via eq. 3.4

F IGURE 3.1: Bar graph with the actual compare count, the estimate
via equation 3.3, and the estimate via equation 3.4.

and s, not 0 and ci+1 . The second part is the integral between s and ci+1 where the
probability has not been struck out of the integral, resulting in equation 3.5.

Figure 3.1 shows the differences between the estimated workload of equations
3.3 and 3.4 with 3.5, where the problem in lower - and band-regions are not
susceptible to inaccuracies when used in conjunction with equation 3.4 as opposed
to the unaltered approximation. Also of note are the negligible differences on higher
bands, and how the approximation seems to be close to the actual results on higher
bands regardless of the equation chosen.

To visualise the nature of actives size in general, figure 3.2 (and similarly A.1,A.2
in the appendix) show the approximated active-size using those same equations
with their respective normal distribution following the logic and definition outlined
in chapter 3.2. To quickly summarise those here again: The problem of a query
node appearing in active on band Bx = (c, d) can be shown to follow a binomial
distribution by the fact of having two possible outcomes during a random nodes
comparison/edge-establishing phase, a query node is either in active or not with
constant probabilities each. In specific, the probability of such a query node to
Rd
appear under those conditions in active is p = 0 (2Bx (y , r)/2) f (r)dr, while
the probability of this event not occurring is q = 1 p. Because of that, we can
approximate this binomial distribution by a normal distribution N (, 2 ) with
= E[|activebi |] = ki = 2 .

The figures show the actual occurring active-sizes during the comparison phase
i.e., whenever our algorithm works on a node-token and has to traverse active, we
recorded actives size on the x-axis with the number of occurrences defining the
y-axis. On each figure we can see three normal distributions and bars: One for
a run with the average degree k = 20, one with k = 200, and one with k = 500.
The node count is set to n = 106 on every run, the band count is also always 13
with a geometric partitioning with p = 0.9, as well as the recorded data being
always taken from the active-sizes occurring on the 8th band. Figure A.1 has
= 0.51, figure 3.2 has = 0.75, and figure A.2 has = 1.1. In other words, we
see the active-sizes on band 8 for different and average degrees, all else being equal.
44 Chapter 3. Analysis

1400
1200
1000

Occurrence Count
800
600
400
200
0
0 100 200 300 400 500 600 700 800
Size of Active
N(42,42) N(299,299) N(652,652)
Actual Data (k=20) Actual Data (k=200) Actual Data (k=500)

F IGURE 3.2: Actual occurrence count of active-sizes on the 8th band,


overlayed with the normal distribution one would get by using eq.
3.4. is set to 0.75.

4.5 1e7
Comparison Count (Workload) 9 1e7
8
Comparison Count (Workload)

4.0
3.5 7
3.0 6
2.5 5
2.0 4
1.5 3
1.0 2
0.5 1
0.0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12
Band Number Band Number
Actual Comparison Count Estimate via eq. 3.4 Actual Comparison Count Estimate via eq. 3.4

F IGURE 3.3: Bar graphs with the actual compare count, and the esti-
mate via equation 3.4. = 0.75 (left), = 1.1 (right).

The bars are overlayed by normal distributions as they were just defined, with the
mean being calculated by the equation 3.4 while the variance is set to the same. As
one can see, the lines almost completely overlap the bars in their entirety (with slight
discrepancies during lower ), further proving that our assumptions of estimating
the upper bound of active-sizes by way of calculating normal distributions in
chapter 3.2 were indeed correct.

As an overall example of the estimation function, the earlier figure 3.1, and the
bar graphs in figure 3.3 together show three runs of n = 106 nodes, k = 200, and
{0.51, 0.75, 1.1} respectively. For each band, we see an entry of the actual num-
ber of distance calculations/comparisons and the estimated workload calculated by
our equation. As is also visible here, the estimation function works for any with
slightly varying, although overall close enough accuracy.

3.6.3 Equalised Workload


As mentioned before, we used numerical methods for both of our approaches. For
our equalised workload approach, we ignored the first band that involves a clique of
nodes, due to the there existent queries reaches being 2 for all queries with radius
3.6. Radial Partitionings 45

3.0 1e7

Comparison Count (Workload)


2.5

2.0

1.5

1.0

0.5

0.0
0 1 2 3 4 5 6 7 8 9 10 11 12
Band Number
Actual Comparison Count Estimate via eq. 3.4

F IGURE 3.4: Bar graph with the actual compare count and the esti-
mate via equation 3.4, showcasing the equalised workload partition-
ing. = 0.51, n = 106 , k = 500.

r R/2. After that, we chose a sufficiently small number of work-units x which we


used as a target to calculate through all radius possibilities for our second band until
we hit one that was close enough to that number. Repeating this for all l 1 band
would result in one of two results:
The differences between the work-units on each band are smaller than an ar-
bitrary threshold t with t > x, meaning the work-distribution was approxi-
mately the same.
At least one of the bands had more work-units than our threshold would allow
for, meaning that an equalised workload was impossible for a target around x.
As such, x has to be increased by an arbitrary amount, after which we try all
of this again until we find an equalised workload.
Figure 3.4 shows how well a partitioning given by this algorithm works in practice.
Except for band 0, which is always set to be between 0 and R/2 regardless of work-
load, not only are the estimated workloads almost equal on all bands, the actual
comparison count is just as equally distributed amongst them. A pseudocode for
this entire process can be found in the appendix at listings B.4 and B.5.

3.6.4 Minimised Workload


Just as before, we are setting the most inner bands outer radius to R/2. But in
contrast to the previous approach, calculating the minimised workload on the
leftover bands is strictly speaking less arbitrary, but more time-consuming from a
problem complexity viewpoint. The only apparent property we could gather from
the data about partitionings which yield a minimised workload was a seemingly
logarithmic partitioning, an assumption that did not provide sufficient help in
creating a closed-form expression. Thus, in terms of a problem complexity for
finding a partitioning with a minimal workload, we could only work with it as a
O(al )-complex problem, a being a constant for accuracy (as will be understood,
given the following algorithm), and l being the number of bands.

At first, we tried finding a result with the calculation from 3.4 by dividing the plane
with radius R into an arbitrarily chosen number of a smaller possible band radii.
46 Chapter 3. Analysis

23.5
23.0

Outer Radius of Band


22.5
22.0
21.5
21.0
20.5
20.0
19.5
3 4 5 6 7 8 9 10
Band Count
Band 1 Band 3 Band 5 Band 7 Band 9
Band 2 Band 4 Band 6 Band 8 Maximal Radius

F IGURE 3.5: Each line is there to visualise the downward trend of its correspondent
bands outer radius calculated by our algorithm for an arbitrary run of n = 106 ,
k = 10 and = 2, dependent on the number of bands. Because the outer radius of
Band 0 is always set by us manually to R/2, it is not shown in this figure.

This way, our search space is discrete and has a finite number of combinations.
Using an exhaustive-search and trying out all combinations for two or three bands
to find the one with the smallest workload is still done in a fraction of a second,
depending on the accuracy a set. Unfortunately, we will reach a band count
rather quickly that takes multiple seconds which is too high for any reasonable
feasibility during smaller graph creation. One aspect of the calculated results
that might indicate some kind of dependency, though, was the fact, that given a
(3) (3) (3) (x)
result {c1 , c0 , ..., c3 } ci being the i-th radii for a solution with x bands
for our problem with three bands, if we were to calculate the result for the same
graph with the same properties, only now with four bands, with the result being
(4) (4) (4)
{c1 , c0 , ..., c4 }, it turned out that for all bands the radii with the same index from
the problem with four bands were always smaller or equal, compared to the radii
from the problem with three bands.
(j) (i) (i) (j)
In other words, for all i < j it holds, that cx cx , and specifically, ci cj R.
This fact held true for multiple benchmarks, and is probably based on the monotonic
property of the nodesradial density function.

Thus, we changed our algorithm to test only combinations that fit the above criteria:
After calculating the minimised workload partitioning for four bands with the
exhaustive-search, we decreased the search area to only include radii smaller than
the earlier result for all respective radii of the same index. In order to speed up the
algorithm even further and avoid relying on a quadratic runtime, we additionally
altered the exhaustive-search into a greedy-based one:

Assuming that the answer for a problem with i bands was always for every radii to
be lower than for i 1 bands, we let the algorithm set all radii at the start of our next
(i)
iteration on the previous solution and inserted the newest addition ci at R. Starting
3.6. Radial Partitionings 47

(i)
with c0 , we decreased the radius incrementally each increment t being R/a big
as long as the summarised workload was decreasing as well. The moment the
(i) (i)
workload would increase again after s steps, we would set c0 = c0 (s 1) t
as the new radius, after which we did the same for all the other radii, one after the
other. After doing this once for every radius, we start over with ci0 and continue
the entire process, until no further improvements on the workload could have been
made by decreasing any of the radii. Depending on the accuracy parameters set,
this process does not take longer than a second. A pseudocode for this process can
be found in the appendix as well in listing B.6.

Overall, the results in figure 3.5 appear to show a logarithmic partitioning in general,
with an additionally linear to possibly negatively exponential movement along the
radial axis for each bands radii boundaries with each additional band added to our
partitioning.
48

Chapter 4

Experimental Evaluation

In this chapter, we first detail our computer setup and important aspects of the im-
plementation of the three main algorithms. After listing the graph parameters we are
using during our practical tests, we benchmark the multitude of options available
to the EM-variant in order to find parameter settings that optimise our runtime
amongst others, we regard multiple radial partitionings, as well as band count and
angular segment count per band. After that, we compare the three main algorithms
under different benchmark settings. For more details on those, see the introduction
in chapter 4.4.

4.1 Setup of the Computer System and Implementation


The computer system in use during our benchmarks is the same as the one used
by Penschuck [Pen17]: We use an Intel Xeon CPU E5-2630 v3 processor (8 cores, 16
threads, 2.40GHz) with AVX2/SSE4.2 support for 4-way double-precision vectori-
sation. The main memory consists of 64 GB RAM with 2133 MHz. The operating
system is a Linux 4.8.1.

Every algorithm is written in C++ and built using the same compiler GCC
version 6.2.1 for building a release version. The parallelism available in
NkGen and the EM-variant are both based on OpenMP, while the EM-variant addi-
tionally uses the ST XXL-library for data management and sorting (see chapter 1.3).

During the later benchmarks where we compare the three main algorithms, we
additionally disabled any output in all three algorithms as to focus on the runtime
alone. For that reason, we changed the source code of NetworKits generator and
GIRG in two ways: For one, we removed any inserts into arrays/vectors or the like
that happen, whenever an edge has been found or is to be established. For two,
we also deleted any lines declaring the data constructs in charge of saving those
edges. In NkGens case, we disabled the allocation of the adjacency list and any
access to it. Additionally, we let the program calculate a fingerprint based on the
node IDs during the establishment of edges, in case the compiler were to ignore
the edge-creation due to no variables being manipulated or saved. In GIRGs case,
we also let the program end right after all edges in the hyperbolic graph have been
found, as the program that was made available by [BFKL16] also printed the edges
to a file, sorted them and calculated additional information afterwards which would
have increased the runtime.

In other words, we only measured the time for the creation of the graph in and of
itself, ignoring any additional post processing done to it.
4.2. Graph Parameters used in the Benchmarks 49

4.2 Graph Parameters used in the Benchmarks


For the graph parameters during our comparison, we chose the following options:

Power-law parameter : 0.51, 0.75 and 1.1. The reason for this decision is
that, for instance according to Friedrich et al. [FK15, p.617], the exponential
component of hyperbolic graphs is usually between 2 and 3 in practical use
cases. is also defined as = 2+1, and considering that has to be above 0.5,
we went for those aforementioned values, thus encompassing the respective
values of 2.02, 2.5 and 3.2.

Average degree k: 10, 50, 500 and 1000. This is in order to cover the range of
graphs with small average degrees to graphs with large ones.

Node count n: Exponential increase from 104 up to at most 109 the sequence
being in specific (1012/3 , 1013/3 , ..., 1023/3 , 1027/3 ) to cover a wide range of
small to massive graphs.

4.3 Finding Optimised Parameters for the EM-Variant


To better compare our algorithm with all alternatives, it was in our best interest to
first analyse which settings and parameters would optimise our generator. In all, we
have the following options in regards to our algorithms parameters:

Using the 3-, 2 or 0-Sorter-algorithm

Using one radial partitioning out of the following choices:

Using the geometric partitioning with an additional percentage parame-


ter
Using the equalised workload partitioning
Using the minimised workload partitioning

The number of bands we use given a partitioning

The number of angular segments per band for parallelisation

Since the search space is enormously large, we decided to analyse each of these pa-
rameters step by step and use the respective best options further on out. As such,
the following section is going to mostly involve taking multiple choices for a setting,
benchmarking them under relevant conditions, and comparing the results in order
to decide, when to use which option.

4.3.1 Comparison between Sorter-Count-Versions


First off, we will compare the 0-, 2- and 3-Sorter-options under the settings used
in the original algorithm: a geometric partitioning with p = 0.9 and l = log(n) bands.

Taking first a look at the generation phase (figure 4.1), we can see that the 2-Sorter
consistently outperforms both alternatives. Since there is no difference between
the 2-Sorter and 3-Sorter in their implementation in regards to the parallelisational
aspect, the difference must come from the latter variants additional access of a
third sorter into which we put one object more than in the 2-Sorter version. The
50 Chapter 4. Experimental Evaluation

Generation Phase Runtime in Seconds


10 1

10 0

10 -1

10 -2

10 -3
10 4 10 5 10 6 10 7
Node Count
3-Sorter 2-Sorter 0-Sorter

F IGURE 4.1: Generation phase of the three sorter-variants for graphs


of average degree k = 10 and = 0.75 on a logarithmic scale.

0-Sorter on the other hand not only uses a different parallelisation scheme, but also
uses logarithmic operations during the calculation of the angular random variable
in order to deliver a sorted sequence.

For the sorting phase in figure 4.2, we set the memory available to the sorters to
the minimum available (around 44MB per sorter) per object type, which we will
increase in later benchmarks to around 2GB in total. The higher the available
memory, the fewer merge-passes do the sorters necessitate. During lower object
counts the minimum memory setting is enough to sort with one merge-pass, larger
graphs on the other hand need more memory to sort more efficiently. Regardless,
the sorting phase during the benchmarks did not take more than half the overall
runtime at worst, as one can see if one compares both graphs in figure 4.2. During
the sorting phase, one can also see the sharp jump at 107 nodes, which is where the
sorter requires more than one merge-pass, as the set memory is not enough for one
pass alone. For comparisons sake, we kept it at the minimum memory required for
the sorters to work.

The sorting phase shows that for smaller graphs, the sorting takes more time for the
2-Sorter, while with increasing numbers of nodes, the 2-Sorter starts to amortise the
larger object size and runs faster than the 3-Sorter, if only marginally. This comes
presumably from the size and number of to-be-sorted objects: The larger object
size in the 2-Sorter has an impact, if only for smaller graphs. For larger ones, it is
the number of sorters and objects that is more relevant to the runtime, which the
3-Sorter has both more of.

Interestingly, we can see in figure 4.3 that both the 2-Sorter and 3-Sorter versions
edge-creation phase compete well with one another, meaning that the difference in
the overall performance between them stems from the generation phase. Regarding
the 0-Sorter, the runtime during the edge-creation phase was expectedly twice as
long as the 2-Sorter, as we are traversing the band once forward, once backward.
Considering that the sorting takes relatively few seconds compared to the edge-
creation phase, the lack of a sorters in general did not give us any advantage after all.
4.3. Finding Optimised Parameters for the EM-Variant 51

Sorting Phase Runtime in Seconds 10 2 10 3

Overall Runtime in Seconds


10 1 10 2

10 0 10 1

10 -1 10 0

10 -2 10 -1

10 -3 10 -2
10 4 10 5 10 6 10 7 10 8 10 4 10 5 10 6 10 7 10 8
Node Count Node Count
3-Sorter 2-Sorter 3-Sorter 2-Sorter 0-Sorter

F IGURE 4.2: Sorting (left) and overall (right) runtime phase of the 2-
and 3-Sorter for graphs of average degree k = 10 and = 0.75 on a
logarithmic scale.
Edge-Creation Phase Runtime in Seconds

10 2

10 1

10 0

10 -1

10 -2

10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
3-Sorter 2-Sorter 0-Sorter

F IGURE 4.3: Edge-creation runtime for graphs of average degree k =


10 and = 0.75 on a logarithmic scale.

In conclusion, the 2-Sorter seems to be the most efficient version out of the three we
devised, and as such we are going to use the 2-Sorter in all forthcoming benchmarks.

4.3.2 Benchmark Analysis of the Equalised Workload Partitioning


As we have at most 16 threads to our disposal, the question arises that, if we were to
parallelise our algorithm in regards to an equalised workload, what settings would
be more helpful in our endeavor than others.

From the start, we already have the possibility to divide our ground plane into
multiple bands, meaning that apart from the first band always covering the
first [0,R/2]-range because of the higher query surface area in that region the
first idea would be to have 16 or 17 band, as we want to distribute one band
to each thread. 17 bands might be worthwhile of a consideration, because de-
pending on the value, the inner most band could have few enough nodes to
be almost of no work compared to the other bands, meaning one thread could
be working fast enough on the first band to start with a second one without too
big of a lag. Those ideas are beside the more important point, though, as the
bigger problem here arises from the fact that the more bands we have, the longer
52 Chapter 4. Experimental Evaluation

Generation Phase Runtime in Seconds


8
7
6
5
4
3
2
1
4 6 8 10 12 14 16 18 20
Band Count
Alpha: 0.51 Alpha: 0.75 Alpha: 1.1

F IGURE 4.4: Linear runtime of the generation phase in regards to


band count. Node count is 107 , average degree is 1000.

the generation- and sorting-phases will take.

The generation phase goes through all nodes once and checks, once per node, every
higher band for possible Start- and StopBound-creation, meaning that the runtime
as previously mentioned has an upper bound of O(n l), n being the node count,
l being the band count. The data in figure 4.4 shows this linearity. A quick note on
the forthcoming figures with band counts on the x-axis: The lines are not supposed
to imply a continuing data plot, since there are no such things as half a band count.
In order to visualise the general trend better, though, we decided to insert lines
between the data points.

Interesting is the fact that the slope is weaker the smaller the -value is. A possible
explanation could be the number of nodes on the outer bands we can find with
smaller with our partitioning and the way the generation step works: Generally
speaking, every node checks through every band that is further away from the
center than itself meaning for 8 bands, a node on band 0 has to check through all
bands, while a node on band 7 only checks the last one. Our partitioning algorithm,
though, finds it more beneficial to put more nodes into the last band (i.e. create
wider bands on the outside) because for smaller the radii distribution inherently
wants to put more nodes closer to the center. The more nodes we have in the
center, the more nodes we have that would create wider queries. Following that, to
compensate the increase in amounts of wider queries closer to the center, we have
to create inner bands with fewer, and outer bands with more nodes inside of them, if
our goal is to create an equalised workload. This results in lower values having
way more nodes that have to check only a few bands, while larger values will
have to divide the node distribution onto the bands in a way that will force them to
check more bands per node, because in those cases we will have way more nodes
with similar, smaller query ranges.

Nevertheless, for simplicitys sake, we can still assume our previously established
upper bound, considering that the worst case scenario would be one where every
node had to check every band.
4.3. Finding Optimised Parameters for the EM-Variant 53

14 80
13 75
Overall Runtime in Seconds

Overall Runtime in Seconds


12 70
11 65
10 60
9 55
8 50
7 45
6 40
4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20
Band Count Band Count
Alpha: 0.51 Alpha: 0.75 Alpha: 1.1 Alpha: 0.51 Alpha: 0.75 Alpha: 1.1

F IGURE 4.5: Runtime overall increasing (left) and decreasing (right)


with more bands. Node count is n = 107 , k = 50 (left), k = 1000
(right).

The sorting step on the other hand would be at best O(ni log(ni ) + si log(si )) per
band, ni being the node count, and si being the StartStopBound count on band
i. Having more bands certainly would decrease ni on some bands, regardless of
positioning, but would also increase the overall StartStopBound count. We could
not find any kind of closed-form expression for our partitioning, which means
that estimating the node count or StartStopBound count per band based on that
partitioning was not possible either. Regardless, let us assume a node count number
n = maxi[0,l1] (ni ) and s = maxi[0,l1] (si ) = n representing the maximal node
and StartStopBound count on a band for a given partitioning, the sorting step
would thus be at most O(l (n log(n) + s log(s))) = O(l n log(n)).

In other words, both the generation step and the sorting step are theoretically
linearly dependent on the band count l, meaning that the more bands we have, the
higher the cost would be in those two phases, even if we would get a better runtime
during our edge-creation phase. This is where the angular parallelisation comes into
play: For instance, we could choose a band count which is suboptimal in regards
to the generation and sorting step, while choosing an angular parallelisation count
that would allow all threads to still have a smaller, equalised workload comparable
to one without angular parallelisation entirely.

For example, instead of having no angular parallelisation and 16 parallel bands,


we could be working with 8 parallel bands, all of which being angularly divided
into 2 parts, each of which being worked on by separate threads. This way we
would be on one hand working with a partitioning that could potentially have a
higher maximum workload meaning the parallelisation would be worse than the
optimum but, in theory, we would on the other hand decrease the runtime on the
generation and sorting step.

Taking a look at the time spent on all steps combined overall, there is one thing
that is immediately apparent: If the number of edges is lower, the algorithm takes
more time with every additional band if the number is higher, it takes at first
less time with every additional band until the performance increase plateaus (figure
4.5). The benchmarks with higher edge counts show that our initial consideration
of 17 instead of 16 bands is more or less irrelevant: The differences are slight and in
the realm of probabilistic variances, as redone benchmarks have shown. Increasing
54 Chapter 4. Experimental Evaluation

8 70

Seperate Runtime in Seconds

Seperate Runtime in Seconds


7 60
6 50
5 40
4 30
3 20
2 10
1 0
4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20
Band Count Band Count
Edge-Creation Generation Sorting Edge-Creation Generation Sorting

F IGURE 4.6: Separate runtimes of the three phases in our algorithm


per band count. n = 107 , k = 50 (left), k = 1000 (right), is 1.1.

the bands further than that either did not change anything or only increased the
runtime, most likely because we have fewer threads available than bands in those
cases not to mention that the more bands we have, the longer the generation and
sorting steps will take as well.

Analyzing the benchmarks further, it seemed that in all runs, the sorting phase took
only a small fraction of the time spent on the entire algorithm in general and thus
seems not to be a relevant factor for an optimised setting. In the graphs in figure 4.6,
one can see all three steps, once for an average degree of 50, once for one of 1000.
As one can see, even though the change in average degree changes which phase of
our algorithm dominates, the sorting phase stays less relevant in comparison as it is
not the dominating factor in either case.

The generation step, though, does matter more so in lower edge count graphs
than higher ones. Regardless of value, the generation-phase alone is linearly
dependent on the band count. The edge-creation step on the other hand follows a
negative exponential curve with each additional band up until we have no further
threads to offset additional bands workload onto. The result is the previously
mentioned trade-off: For lower edge counts, where the edge-creation step itself is
not too long in regards to the runtime, further bands increase the time spent on the
generation step so much, that the time spent here dominates the runtime. Because
the overall runtime becomes thus very similar in shape to the generation-phase
alone, it might be more beneficial to keep a lower band count while increasing the
angular partitioning of the bands itself to be able to use all threads in their entirety.

In cases where, on the other hand, our edge-creation phase dominates the runtime,
it would seem on first glance to have been more appropriate to use as many bands
as possible for our threads to use, were it not for the fact we see in figure 4.7. Here
we can see the maximal workload, or maximal comparison count, per band. Just like
the edge-creation step, it follows a negative exponential slope with increased
band counts.

This means that, while we could lower the maximal comparison count with each
and every new band, it would become less and less beneficial. Taking additionally
into consideration the finite amount of threads we have at our disposal, it could be
better to use a smaller band count with angular parallelisation after all, which is
4.3. Finding Optimised Parameters for the EM-Variant 55

1.3 1e10
1.2
1.1

Compare Count
1.0
0.9
0.8
0.7
0.6
0.5
4 6 8 10 12 14 16 18 20
Band Count
Alpha: 0.51 Alpha: 0.75 Alpha: 1.1

F IGURE 4.7: Maximum compare count out of all bands, per band
count. Node count is n = 107 , average degree is k = 1000.

50
Edge-Creation Phase Runtime in Seconds

46
Overall Runtime in Seconds

44 48
42
46
40
38 44
36
42
34
32 40
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Band Count Band Count
Ang. Para.: 3 Ang. Para.: 4 Ang. Para.: 5 Ang. Para.: 3 Ang. Para.: 4 Ang. Para.: 5

F IGURE 4.8: Edge-creation (left) and overall (right) runtime per an-
gular parallelisation, per band count. n = 107 , k = 1000, = 0.75.

why we investigated this even further with an excerpt of the data gathered from the
benchmarks shown in the graphs in figure 4.8.

The above figures show the runtime for the angular parallelisation count v being
three, four, and five, once for the edge-creation phase, and once the overall runtime.
As one can see, the edge-creation-phase alone does not paint the whole picture, as
going by the first figure we would have chosen a larger band count for our later
benchmarks, seeing as how the graph generally follows a somewhat downward
slope with each additional band, interrupted by spikes that give a less than optimal
band-piece-to-thread distribution. The overall runtime, though, shows that the im-
provement with higher band counts is mitigated by the aforementioned increasing
runtime of the generation phase. Also to mention is the fact that regardless of band
count, the first band has the exact same amount of comparisons, as we set the first
bands outer radius to be always R/2. Because of this, and because of figure 4.7,
it seems reasonable to assume that the spikes are somewhat related to the overall
number of threads and angular parallelisation pieces:

If one were to think as to when our parallelisation would be optimal, the first
thought would be that any band piece count with equal work per piece that is divisi-
ble by our thread count should give us a good distribution and thus a good runtime.
56 Chapter 4. Experimental Evaluation

Edge-Creation Phase Runtime in Seconds


42

40

38

36

34
4 6 8 10 12 14 16
Band Count
Median Median + MAD Median - MAD

F IGURE 4.9: Median over the runtime of the edge-creation phase of


ten runs with Median Absolute Deviation (MAD), with the angular
pieces per band count v = 4.

As one can see in figure 4.7, though, the maximal comparison count rises exponen-
tially with fewer bands, meaning that one should not expect for example four bands
to be delivering a better result than 5 bands, regardless of how well our thread count
is divisible by 16 (in the case of four pieces per band). On the other hand, for larger
band counts this argument could be made, as the next minima in the left graph of
figure 4.8 for a parallelisation count of four after five bands appears around eight
(which is 32 pieces at v = 4), the next one at 12 (48 pieces), and the next one at 16
segments per band.

For the other angular segment count per band v of three and five, for example,
where there are barely any band piece counts that are divisible by 16, it is more
difficult to be see this pattern. Considering that we have a multitude of factors
affecting the runtime like the comparison count decreasing with every band for
example, and because we are working in a parallel setting, all of these indicators
should not be taken for absolutes: While the general positions of those more visible
spikes were usually in the same vicinity after multiple benchmarks (see figure 4.9),
the variance was high enough that for larger band counts (and thus more pieces),
the reasoning could be considered less applicable. Nonetheless, the randomising
factor of parallelizing experiments in practice occurred during all runs, which is
why we still chose to benchmark for the four aforementioned minima for four
pieces per band.

Figure 4.10 shows that overall, 5 bands and 4 pieces per band were the optimal
setting for smaller average degrees for the overall runtime. For larger degrees, it
was additionally dependent on the node count, where anything above 107 took less
time with 8 bands. Because of that, we will be using those settings (shown again in
table 4.1) during the later benchmarks.
4.3. Finding Optimised Parameters for the EM-Variant 57

10 2 10 3

Overall Runtime in Seconds


Overall Runtime in Seconds

10 1 10 2

10 0 10 1

10 -1 10 0
10 5 10 6 10 7 10 8 10 5 10 6 10 7 10 8
Node Count Node Count
L: 16 L: 8 L: 12 L: 5 L: 16 L: 8 L: 12 L: 5

F IGURE 4.10: Separate runs for the various minima with v = 4 and
l {5, 8, 12, 16}. k = 50 (left), k = 500 (right).

Average Degree k Node Count n Ang. Para. Count v Band Count l


k 50 any 4 5
500 k n 107 4 5
500 k 107 < n 4 8

TABLE 4.1: Rule Set for angular parallelisation count v and band
count l for the equalised workload partitioning.

4.3.3 Benchmark Analysis of the Minimal Workload Partitioning


Taking a look at the graphs in figure 4.11, one can see that there are certain
similarities to the previous one. Ignoring all other regions of our algorithm, the
runtime for our edge-creation step alone falls exponentially with each additional
band, plateauing somewhere around 16, 17 bands - presumably because we have 16
threads available, but perhaps also because of the lower advantages each additional
bands introduction to the algorithm might give.

The generation and sorting steps do not differ from a runtime perspective at all.
Similar properties show themselves in the graphs with a linear runtime of O(n l)
in regards to the generation step and a similar sorting step with O(l n log(n)). The
algorithms entire runtime is also similar to the equalised workload version: Higher
edge counts result in the edge-creation step dominating the runtime, decreasing
it with every band, while lower edge counts let the generation step dictate our
speed, slowing us down with higher band counts fairly early on. Just as with the
equalised workload, this effect appears quicker and more drastically during higher
values than lower ones (as seen in figure 4.12), the reason being the same as the
one explained above (different node distribution based on band wideness as a result
of our partitioning algorithm for different ).

One difference to the previous radial partitioning that we decided to take into
account for our parallelisation scheme was the varying comparison counts per
band. Considering that the equalised workload method had the same amount of work
on every band, there was no requirement to consider the distribution of band pieces
to threads, as every thread would get a piece with the same workload just as any
other piece (except the first band). Because this property is not the case for this
radial partitioning, we decided to create a simple scheduling scheme that would
58 Chapter 4. Experimental Evaluation

9 120

Seperate Runtime in Seconds


8

Seperate Runtime in Seconds


100
7
6 80

5 60
4 40
3
20
2
1 0
4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20
Band Count Band Count
Edge-Creation Generation Sorting Edge-Creation Generation Sorting

F IGURE 4.11: Separate runtimes of all three phases, per band count.
n = 107 , k = 50 (left), k = 1000(right).

13
12
Overall Runtime in Seconds

11
10
9
8
7
6
4 6 8 10 12 14 16 18 20
Band Count
Alpha: 0.51 Alpha: 0.75 Alpha: 1.1

F IGURE 4.12: Overall runtime per alpha, per band count. n = 107 ,
k = 50.

take the varying workload sizes into account.

In most cases, as one can see in the graphs in figure 4.13, the outer, last few bands
would have the largest workload, decreasing with every band closer to the center
(except for the first band because, regardless of partitioning, it is always R/2 wide).
Our scheduling for 16 threads works thus as follows:

first, we assign each thread a workload in descending order, beginning with


the last, largest piece and ending with the 15th to last

second, if more than 16 pieces were left, we assign each thread a workload
in ascending order, giving the thread with the currently largest workload the
smallest piece

third, if less than x < 16 pieces are left, we assign each thread a workload in
ascending order, starting with thread t (x 1) up to thread t

While this is not an optimal workload scheduling, it is a quickly calculable method


that gave us a large enough improvements in runtime, as one can see from the com-
parison graph in figure 4.14. The difference between the best parameter set with a
scheduler (in this example 10 bands) is around 10% compared to the best parameter
4.3. Finding Optimised Parameters for the EM-Variant 59

2.5 1e8 7 1e7


6

Comparison Count on Band


Comparison Count on Band

2.0
5
1.5 4

1.0 3
2
0.5
1
0.0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Band Number Band Number
Alpha = 0.51 Alpha = 0.75 Alpha = 1.1 Alpha = 0.51 Alpha = 0.75 Alpha = 1.1

F IGURE 4.13: Workload per band on a run with 7 (left) and 16 (right)
bands. n = 107 , k = 1000.
Edge-Creation Phase Runtime in Seconds

46
45
44
43
42
41
40
39
38
5 6 7 8 9 10 11 12
Band Count
With Scheduler Without Scheduler

F IGURE 4.14: Comparison between two runs, runtime per band


count, one with, one without scheduled workload distribution. Node
count is n = 107 , = 0.75, k = 1000, angular parallelisation is v = 4
pieces per band.

set without a scheduling scheme (in this example 6 bands). While those runtime
improvements would most assuredly be in the error margins during the generation
of smaller graphs, and while the speedup was not always as large as that (some-
times less, sometimes more, depending on a variety of factors), it nonetheless gave
us consistently an improvement in our runtime when comparing the respective op-
timal parameter for the band count under the same angular parallelisation count.
Because of that, all further benchmarks are going to be done with the aforemen-
tioned scheduling scheme.

In a visual representation of the workload given to each thread (see figure 4.15) we
show every thread for both runs one scheduled manually, one directed by the
compiler with their respective comparison count. As one can see, in the specific
example of 10 bands, our manually scheduled run even managed to distribute the
workload almost equally, while the unscheduled run seemed to increase the work
put on each thread with every higher thread ID. This follows from the fact that the
lower thread IDs get assigned subsequent pieces from the lower bands, while the
higher IDs are forced to work on the outer bands. Concerning the short fall at thread
ID 8 the compiler decided at this point to assign only two subsequent pieces each
60 Chapter 4. Experimental Evaluation

7 1e8

Comparison Count on Thread


6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Thread ID
Unscheduled Manual Scheduler

F IGURE 4.15: Workload per thread for the unscheduled (compiler di-
rected) and manually scheduled run. The parameters are the ones
used in figure 4.14, except that the band count is set to l = 10 only.

110
100
Overall Runtime in Seconds

90
80
70
60
50
40
30
4 6 8 10 12 14 16 18 20
Band Count
Alpha: 0.51 Alpha: 0.75 Alpha: 1.1

F IGURE 4.16: Overall runtime per alpha, per band count. Node count
is n = 107 , k = 1000.

to every following thread, instead of three as it was the case for the previous threads.

Considering that we are working with those 16 threads in parallel, the unscheduled
run is thusly at a disadvantage: While the first couple of threads are done fairly
shortly into the edge-creation phase, all threads will have to wait for thread number
14 and 15 in this example, which have more work to do than their respective
counterparts in the scheduled run.

Going by figures 4.12 and 4.16, we decided to use a constant number of 9 or 11


bands instead of (log(n)) for ease of comparison, as those were the numbers at
which the overall runtime stopped improving with a further increase in band count.
We then benchmarked multiple v {2, 3, 4, 5, 6, 7, 8} for the node counts 10x/3 ,
x {15, ..., 21} and k {10, 1000}, and came to the decision to use for wider graphs
a band count of 11 and v of 3, while for narrower graphs a band count of 9 and v of 2.
Those settings (seen again in table 4.2), though, are only to be taken as circumstan-
tially optimal, as a different benchmarking environment would most likely benefit
4.3. Finding Optimised Parameters for the EM-Variant 61

450
Overall Phase Runtime in Seconds

Overall Phase Runtime in Seconds


12 400
350
10
300
250
8
200
6 150
100
4 50
0
4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20
Band Count Band Count
Percen.: 0.7 Percen.: 0.8 Percen.: 0.9 Percen.: 0.7 Percen.: 0.8 Percen.: 0.9

F IGURE 4.17: Overall runtime per percentage, per band count. Node
count is 107 , k = 10 (left), k = 1000(right), = 0.75.

other band or angular parallelisation counts than the once we are going to choose
during our later comparisons.
TABLE 4.2: Rule Set for angular parallelisation count v and band
count l for the minimised workload partitioning.

Average Degree k Ang. Para. Count v Band Count l


k 50 2 9
500 k 3 11

4.3.4 Benchmark Analysis of the Geometric Partitioning


In regards to the geometric partitioning there is additionally to the question of
how many bands one might use the question of what percentage one might choose
for the partitioning itself. According to Lootz et al.[LLM16, p.2-3], their optimised
parameter was p = 0.9 and a band count of l = log(n), which is what we also will
chose for one of the settings during our direct comparison between the EM-version
and the original algorithm. Considering though that we are not using the same
computer system for benchmarking as the original paper did, we decided to check
whether there are settings of parameters that give us a better result.

Generally speaking, the dependencies in regards to the band count are similar to
before, with the difference being that the percentage variable also has an impact on
how fast the runtime is, and thus also changes at which point the generation and
sorting step start to overshadow the edge-creation step. The left graph in figure 4.17
shows for example for smaller k that a p = 0.7 starts to be more quickly affected
by the generation step than the other chosen p. This is not only because the edge-
creation step changes, though figure 4.18 shows the generation phase only. As
one can see, different p also change the generation phases runtime. The reason for
this is the change in the band radiis positioning: If one might recall equation 1.7, the
most inner bands outer c1 is calculated by the formula:

R(1 p)
c1 =
pl

If we keep the band count l the same and only vary the p parameter, we can see
that a decrease in p increases the first bands size, meaning more points on the inner
bands resulting in more outer bands per node on which we will have to calculate the
62 Chapter 4. Experimental Evaluation

Generation Phase Runtime in Seconds


8
7
6
5
4
3
2
1
4 6 8 10 12 14 16 18 20
Band Count
Percen.: 0.7 Percen.: 0.8 Percen.: 0.9

F IGURE 4.18: Runtime of the generation step per p value, per band
count. n = 107 , k = 10, = 0.75.

StartStopBounds.

What can be also be seen in the graphs in figure 4.17 is that depending on the band
count, the larger the band count the better it is to use a higher percentage number.
Depending on , though, the exact point where one p-value is better than the other
changes as well. We benchmarked this for 107 nodes, once for k = 10 and once for
k = 1000, for the three -settings 0.51,0.75 and 1.1, and came up with the rule set in
table 4.3. There we can see which p-value we chose depending on the band count
number l, the -setting, and whether we were creating graphs with smaller (in our
later benchmarks, 10 and 50) or larger (500 and 1000) average degrees:
TABLE 4.3: Rule set for p depending on the parameters k, and l.

Average Degree -value Case p = 0.7 Case p = 0.8 Case p = 0.9


Small 0.51 l < 10 10 l < 12 12 l
0.75 l < 11 11 l < 14 14 l
1.1 l < 12 12 l < 17 17 l
Large 0.51 - l < 11 11 l
0.75 l < 11 11 l -
1.1 l < 15 15 l -

After that, we benchmarked these settings for angular parallelisation. As one can
see from figure 4.19 for different kinds of percentage values, the distribution of work
per band is also heavily off-loaded to the bands further away from the center as
was the case with the minimal workload partitioning. Because of that, we chose to
use the same scheduling rule set as outlined in the minimal workload section for
all our benchmarks, where we tested the overall performance in regards to angular
parallelisation count.

Just like in the previous section, we first benchmarked all v {2, 3, ..., 8} and chose
for each v the best band count l as a setting. After that, we compared all best settings
for small and large average degrees (the figures A.3 and A.4 for those best settings
can be seen in the appendix), and chose for both, small an large graphs eleven
bands and a parallelisation count of two.
4.3. Finding Optimised Parameters for the EM-Variant 63

5 1e8

Comparison Count on Band


4

0
0 1 2 3 4 5 6 7 8 9
Band Number
p=0.7 p=0.8 p=0.9

F IGURE 4.19: Comparison count per band. k = 50, = 1.1, node


count is n = 107 and the band count was set to l = 10.

10 1 10 1
Overall Runtime in Seconds

Overall Runtime in Seconds

10 0 10 0

10 -1 10 -1
10 4 10 5 10 6 10 7 10 4 10 5 10 6 10 7
Node Count Node Count
GEO MIN EQ GEO MIN EQ

F IGURE 4.20: Comparison of the three radial partitionings. = 0.51


on the left, = 0.75 on the right, k is set to 10.

4.3.5 Comparison between the Three Radial Partitionings Best Settings


Representative figures 4.20 and 4.21 (for more, see. figures A.5 through A.16 in the
appendix) show the results of a comparison between each respective most efficient
parameter setting:

Changing the -value matters slightly for the algorithms relative ranking: For
instance, the intersection between the equalised partitionings and the geometric
partitionings runtime during the run with k = 10 (see figure 4.20) happens around
n = 106 nodes during = 0.51 (and = 1.1 as well, see the figure A.13 in
the appendix), but at n = 2 106 for = 0.75. For larger graphs, for instance
k = 1000 (see figure 4.21), we see similar phenomena, only with the minimised
partitioning instead of the equalised one. The minimised workload seems to work
asymptotically best for large graphs. During the generation of smaller ones, one can
see 0.3 seconds necessary to calculate the band partitioning, since at this size, it is
more than half the runtime. The calculation for the equalised workload takes only
a third of that time, which is why it is quickly amortised, but is still longer than the
calculation for the geometric partitiong (less than 104 seconds).

Overall, the geometric partitioning works best for small graphs up until a point that
64 Chapter 4. Experimental Evaluation

10 2 10 2

Overall Runtime in Seconds

Overall Runtime in Seconds


10 1 10 1

10 0 10 0

10 -1 10 -1
10 4 10 5 10 6 10 7 10 4 10 5 10 6 10 7
Node Count Node Count
GEO MIN EQ GEO MIN EQ

F IGURE 4.21: Comparison of the three radial partitionings. = 0.51


on the left, = 0.75 on the right, k = 1000.

either the equalised or minimised workload improves on it in the long run. Based
on the figures here and the ones in the appendix, we chose the following settings:

For any node counts n 106 , regardless of k, we chose the geometric partitioning.
For any other node counts, if k 50, we chose the equalised partitioning; if k > 50,
we chose the minimised one.

4.4 Runtime Benchmark


In this section, we first compare all three algorithms in their entire runtime. First,
we compare them in an internal memory focused benchmark setting. We then go
into detail as to how our algorithms runtime is affected by the graph parameters
in specific, the exponential factor and the average degree k. After that, we
look into parallelism and the speedup the EM-variant can achieve per thread, com-
pared to NkGen. Since the algorithms reach an EM-setting only after exceeding the
RAM limit and requiring memory swapping, we also analyse the memory efficiency
of each algorithm. At last, we set an artificial limit on our RAM to force memory
swapping to demonstrate the differences between the in-memory-algorithms and
the EM-variant.

4.4.1 Runtime Comparison in Regards to Internal Memory


In order to see how the algorithms fare off using the internal memory, we deactivated
memory swapping and kept the RAM memory available at its maximum of 64GB
(minus any main memory space needed for the operating system to function).
The first apparent thing from all figures 4.22 (see also A.17 through A.28) is that,
compared to the other algorithms, GIRG stands out as the one with the worst
scaling. The differences are so vast that for example for an average degree of 50 and
a node count of 108 , while NkGen and the EM-variant take less than a 70 and 40
seconds each, GIRG needs more than 20 times as much (around 2,160 seconds). This
could come from the lack of parallelism, but even assuming a perfect parallelisation
for 16 threads, it would take longer than the alternatives (2, 160/16 135 seconds).

Overall, GIRG scales asymptotically linear with n, regardless of average degree or


setting. The EM-variant and NkGen, while asymptotically slower, appear to have
a practically linear runtime for the chosen graph sizes as well even if NkGen
4.4. Runtime Benchmark 65

10 4

Overall Runtime in Seconds


10 3

10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE 4.22: Comparison of all major algorithms. = 0.75, k = 50.

10 3
Overall Runtime in Seconds

10 2

10 1

10 0

10 -1

10 -2

10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE 4.23: Comparison of all major algorithms. = 0.75, k = 10.

continues to rise further on larger node counts as seen for instance in figure 4.24.
Exceptions for this happen during the generation of small graphs (in terms of node
count), where the runtime of the EM-variant for example is dominated by the setup
and sorting phases, until their overhead is amortised by the edge-creation. Once
this happens, the runtime stabilises to a linear scaling for the tested graphs.

For small average degrees and graphs, the overhead of the EM-variants modifi-
cations are too large to compete with NetworKits NkGen in an internal memory
setting: Neither the EM-variant set to similar settings that NetworKit uses (in the
figures the run with a geometric radial partitioning, with p = 0.9, and l = log(n)
bands), nor the optimised settings fare better off for an average degree of ten, up
until we reach graph sizes in the ranges of 4 106 nodes. This threshold decreases
with an increase in the average degree. For instance, setting k to 50, the threshold is
around 2 106 ; for a k set to 500 it appears even earlier at around 2 105 .

Interesting is that, for example in figure 4.23, the EM-variant modifications alone
are not good enough to compete with NkGen, as the run with the same settings
as implemented in NetworKit never catches up the version with the optimised
settings on the other hand can cut the runtime by around one fourth compared to
NetworKits NkGen (29 to 39 seconds respectively at 108 nodes). In other words, the
66 Chapter 4. Experimental Evaluation

10 3

Overall Runtime in Seconds


10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE 4.24: Comparison of NetworKits NkGen and the EM-


variant. = 0.51, k = 10.

advantage of the EM-variant in this particular run might involve less the differences
in their algorithm and more in the radial partitioning choices.

Another detail is that the setting does not change much in terms of the ranking of
the three algorithms overall, but does change on whether the optimised settings we
have chosen were better then the ones chosen for NkGen. Figure 4.24 and 4.23 show
the two runs with only the settings differentiating between them: For set to 0.51,
there are times where the original settings are still being slightly favoured for large
node counts, while for 0.75 the optimised settings advantages are more apparent.

In summary, NkGen seems to be favored for graphs with small average degrees
and regardless of average degree graphs with small node counts. The EM-
variant is still able to compete and improves on NkGens runtime even in an internal
memory setting under the condition that the to be generated graphs have a large
number of nodes and edges. GIRG on the other hand takes a lot of time even in the
internal memory setting comparatively on almost every scale, regardless of graph
properties. Another fact to mention is that both NkGen and GIRG required more
memory than was available (64GB) for graphs with more than 2108 nodes, resulting
in both programs crashing, since we kept memory-swapping deactivated during the
internal-memory-benchmarks. More detailed comparison of the memory usage can
be seen in chapter 4.4.4.

4.4.2 Runtime Analysis in regards to Varying Graph Parameters


To see the effect of or the average degree k, we chose to benchmark all three
algorithms at 107 nodes, where we set the respective variables to a low point and
incremented them slowly in the range [0.51,1.1] with an incremental step of
0.01 (for NkGen respectively in the range of [2.02,3.2] with a step size of 0.02) with
k = 100, and k in the range of [10,100] with a step of 1 with = 0.75 ( = 2.5).
4.4. Runtime Benchmark 67

10
9
Overall Runtime in Seconds

Overall Runtime in Seconds


10.5
8
7 10.0
6
5 9.5
4
3 9.0
0 20 40 60 80 100 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4
Average Degree Gamma
NetworKit NetworKit

F IGURE 4.25: Graph showcasing NetworKits NkGens k dependency between k


[10, 100] (left graph with = 0.75) and dependency between [2.02, 3.2] (right
graph with k = 100). Node count is set to n = 107 .

20 40
Seperate Runtime in Seconds

Seperate Runtime in Seconds


18 35
16 30
14 25
12 20
10 15
8 10
6 5
0 20 40 60 80 100 0.5 0.6 0.7 0.8 0.9 1.0 1.1
Average Degree Alpha
Generation Sorting Edge-Creation Edge-Creation Sorting Generation

F IGURE 4.26: Graph showcasing the EM-variants k dependency between k


[10, 100] (left graph with = 0.75) and dependency between [0.51, 1.1] (right
graph with k = 100). Node count is set to n = 108 .

NetworKits NkGen and the EM-variant


In figure 4.25 we can see that NkGen clearly has a linear dependency on k and a
negatively exponential dependency on . Figure 4.26 shows the same k-dependency
benchmark for the EM-variant, only that the figure depicts the three phases our
algorithm goes through seperately. The trend is overall linearly increasing: Both
the generation phase and the edge-creation phase share such a dependency and
dominate the sorting phases runtime. A different picture can be seen in the
-dependency, though: The right graph shows that both the generation and sorting
phases take slightly more time with a larger , while only the edge-creation phase
falls. The latter amortises the increase, though, resulting in an overall negatively
proportional runtime in regards to as well.

Overall, we can establish with these runs that practically, the EM-variant shares sim-
ilar dependencies with NkGen in regards to and k.

GIRG
Figure 4.27 shows the same runs done with the GIRG algorithm only with a
smaller node count of 106 because of its long runtime drawing a widely different
picture. Considering that GIRG is proven to have a linear runtime in edge count, it
68 Chapter 4. Experimental Evaluation

250 35

Overall Runtime in Seconds


Overall Runtime in Seconds
200 30

150 25

100 20

50 15

0 10
0 100 200 300 400 500 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
Average Degree Alpha
Connection of Jumps GIRG GIRG

F IGURE 4.27: GIRGs k dependency from k [10, 500] (left) and


dependency from [0.51, 1.1] (right).

is not interesting to see that this is overall the case with a rise in the average degree
as well. The interesting property here is that the linear rise in the runtime is not
gradual, but happens on an exponentially regular occurrence:

Increasing the average degree seems to not effect the runtime, or if, only marginally,
until some kind of threshold is overstepped. In that case, the runtime jumps at first
slowly, but later on dramatically, where after the jump it continues to not change at
all until the next jump occurs again. The jumps appear to come at an exponential
rate, with the runtime also increasing exponentially at every spike. To be more pre-
cise, the runtime doubles every time the average degree reaches the double of the
degree that introduced the last jump in runtime.
TABLE 4.4: A recording of the runtime jumps seen in figure 4.27.

Jump No. 1 2 3 4 5 6
k 11 22 44 88 170 340
Runtime in seconds 7 14 28 55 109 215

Table 4.4 illustrates this better with numbers detailing each degree where a jump
occurs and its corresponding runtime that is being kept steadily during the next
phase until another jump occurs. For example, an average degree k of 44 introduces
the second jump in runtime to around 28 seconds. The previous jump occurred at
a k of 22 with a runtime of 14 seconds, both of these numbers being half of the
corresponding ones we just took a look at. Double the average degree, double the
seconds, and with some leeway of random chance we get to the next jump at a k of
88 with a runtime of 55 seconds. As one can see in that same figure 4.27, this does
not change the overall asymptotic runtime. The jumps occur at a logarithmic rate,
meaning that while each jump increases the runtime exponentially, it stays overall
linearly at worst, as the overlayed red line connecting the jumps alone shows.

The explanation for that is the way the cell sizes and thus the number of cells in
GIRG are decided upon: The smallest weight layer p starts at the minimal weight
possible, which in our case
 is w 0 = exp(R/2) = O( n/k), as the weight is defined
as wv := exp (R rv )/2 (see eq. 2.2) with R = O(log(n/k)) (see lemma 1). The cell
sizes are calculated by multiplying the weights of the current layers weight with the
minimal weight, i.e. (i) := wi w0 /W (see eq. 2.3), and setting the resulting number
x to the next largest negative power of 2, i.e. d(i)e2 = min{2l |l N0 : 2l (i)}
4.4. Runtime Benchmark 69

27
26
25
24

Radius R
23
22
21
20
19
0.4 0.6 0.8 1.0 1.2 1.4 1.6
Alpha
Radius R based on average degree calculation

F IGURE 4.28: Radius R of the hyperbolic space relative to .

(see eq. 2.4). This rounded up number is the volume of a cell, which has the
implication that a given layer i has 1/d(i)e2 number of cells.

Let us now assume we followed the entire calculation for a specific k with
an unchanging n, reaching a number d(0)e2 for the lowest layer. This
means, we have on this particular layer 1/d(0)e2 cells. Let us assume we
have chosen another averagepdegree, k2 = 2 k. In this case, our minimal
weight will bepnow p w02 = n/2k. The lowest layers volume will now equal
w02 w02 /W = n/2k n/2k/W = (1/2)(n/k)/W , which is half the volume of the
previous example, i.e. 1/d(0)/2e2 , which in turn gives us double the cells in
specific 1/((0)/2) = 2/(0). Because we round up during the calculations, though,
this only happens when our new average degree is double from the last example,
which is exactly what is happening in figure 4.27.

The same figure also shows us that GIRG depends negatively, exponentially on the
value like the other algorithms as well. What is different here, though, is similar
to what we have seen in GIRGs dependency on k: We first have an exponential fall
until we come to a point where we have a jump. The jumps maxima seem to follow
a logarithmic curve, the same way their occurrences follow a logarithmic rate akin
to the one we saw in the k-dependency graph. The reason for that is, as seen in
figure 4.28, that R falls logarithmically with an increase in . This means that, while
alone has a negatively exponential effect on the runtime for any given number
of cells, because that number increases again whenever R is lowered enough to be
under the next threshold (as explained with the k-variable), the runtime will jump
up again. Because of this logarithmic fall of R with an increase in , it logically
follows that the jumps happen at a logarithmic rate.

4.4.3 Parallelism Efficiency


The EM-variant relies heavily on parallelisation during all phases: The generation
phase consists of as many generators as threads available, all of which are pushing
nodes and StartStopBounds in batches into the sorters as they are created. The
edge-creation phase also divides the entire plane into multiple bands and band
segments, all being worked through in parallel as well. As such, it is also worthy
of note how efficient the EM-variant works compared to the number of threads
70 Chapter 4. Experimental Evaluation

10 4

Overall Runtime in Seconds


10 3

10 2

10 1
0 2 4 6 8 10 12 14 16
Thread Count
EM-variant, (A) EM-variant, (NA) Networkit GIRG

F IGURE 4.29: Runtime per thread count of the three algorithms. The
EM-variant is shown once with angular parallelisation (A) and once
without (NA). Average degree k = 500, while = 0.75.

available, and how well it fares off in a comparison to the other two algorithms.

In figure 4.29 we see a direct comparison between the three versions: As already
mentioned in the introduction, this implementation of GIRG does not utilise
parallelisation at all, resulting in a completely linear line. Interesting, though, is that
even a sequential run (where the thread count is set to one) of both the EM-variant
and NkGen, they perform by at least an order of magnitude better than GIRG at the
chosen graph settings in figure 4.29.

For the EM-variant, we benchmarked one run with angular parallelisation enabled,
and one without (all other settings being equal). In the same figure one can also see
that the run with angular parallelisation is at every thread count higher than one
faster. What is hardly noticeable, but more so in figure 4.30, is that the run without
angular parallelisation shows no noticeable speedup at all after eight threads, while
the run with such a parallelisation scheme does up to around ten/eleven threads
from where it plateaus as well. The lack of further large increases in speedup,
but higher variance than the run without angular parallelisation is most likely
because of an unbalanced workload distribution from band segments to threads.
The variance in the somewhat linear line before the 8-thread-threshold for the run
without an angular parallelisation scheme can also be explained by an unbalanced
workload distribution: Considering that each thread gets hold of another band,
smaller thread counts will distribute multiple bands to one thread. This results
in the same run (with otherwise the same parameters) giving one specific thread
sometimes more, sometimes less work to do, depending on the overall thread count.

NkGen on the other hand has a speedup that follows a strictly linear line all through-
out up to eight threads, from which the line stays as linear, only with a smaller slope.
All these phenomena occurring around the eight-thread-threshold are a result of
the switch to hyperthreading: In NkGens case, the switch is advantageous in every
regard, while in the EM-variants case, only the angular parallelisation seems to
get some improvement out of it the run without it only stagnates in its speedup
after the threshold, as it cannot take advantage of hyperthreading. The reason for
this might be that our band count in this instance is eleven, as we are using the
4.4. Runtime Benchmark 71

10

Speedup Factor
6

0
0 2 4 6 8 10 12 14 16
Thread Count
EM-variant, (A) EM-variant, (NA) Networkit GIRG

F IGURE 4.30: Speedup per thread count of the three algorithms. Av-
erage degree k = 500, while = 0.75.

geometric partitioning. With eleven bands the most inner ones additionally
not having much workload put on to them it seems reasonable to assume that,
considering each band gets one thread, there just is not a large enough subdivision
of the overall workload onto multiple threads that hyperthreading could be even
considered. While the inner most bands are done quickly in succession by one
CPU core, all the others are handed out to seperate cores, meaning that in the end,
waiting for every core to be done with their own band is all the program does at
that point. The angular parallelisation scheme, though, hands over multiple parts
to multiple cores, making use of any hyperthreading that is allowed by the system.

Overall, from the figures its quite apparent that NetworKits generator handles
parallelisation better and cleaner than the EM-variant. The differences between
NkGen and the EM-variant come presumably from the way parallelisation is used
in both algorithms: In NetworKits generators case, not only are the sorting and
generation phases done in parallel, the edge-creation phase is done so as well
with one significant difference to our version. Whereas the EM-variant only paral-
lelises entire areas of the the plane (parallelizing entire bands or band segments),
NkGen parallelises on a node-to-node basis (see the pseudocode at listing B.1 in
the appendix): Each node is given to a different thread, which performs the binary
search and establishes edges. This in turn results in a workload distribution that
is asymptotically almost completely equal across the board under any number of
available threads:

If we were for example to tier each node into a class of nodes with similar work pro-
duction for example, a node on the innermost band will require more work than
a node on the outermost band we would do so on a band basis. Assuming we
have ni nodes on band i, all of which are from the same workload-production-class,
dividing the number of nodes ni by the number of threads t will be almost equal,
with at worst one thread getting ni mod t nodes more to work with. This number
though is almost insignificant, considering that ni is proportional to n, and with
the thread count in our case being at most 16, the number ni mod t is completely
negligible to any node count ni higher than 1000. In other words, for not even very
high n, any equal division of nodes to threads will result in an also equal division of
72 Chapter 4. Experimental Evaluation

Maximum Resident Set Size in KB per Node


7 1e6
6
5
4
3
2
1
0
10 4 10 5 10 6 10 7 10 8
Node Count
EM-variant (min) NetworKit GIRG EM-variant (2GB)

F IGURE 4.31: Maximum resident set size per run on a linearly scale
axis. k = 500, while = 0.75.

workload to threads.

The EM-variant does not have that luxury: The algorithm is designed in a way
to circumvent unstructered accesses as much as possible, which is why we are
traversing the multiple sorters concurrently, and which is why we cannot assign
each node to a different thread arbitrarily. The correctness of our algorithm (see
lemma 2) is dependent on this subsequent traversal. Were this not the case, and were
we to assign different nodes to different threads, we would have to either establish
communication processes between the threads to make sure that the active-array
holds the correct query nodes at any point in time, or duplicate the sorters so that
any multiple threads working on the same band would have independent access to
the same data required. The former option would hamper parallelisation, while the
latter would increase the memory usage by a manifold more specifically, by the
number of threads working in parallel on the same band.

This means that for our EM-variant, dividing the plane into multiple area segments
allows for situations where the workload is not equally distributed onto the threads
available either because of a number of segments that is indivisible by the
number of threads, or by a radial partitioning that might not give an equalised
workload distributed onto the bands (one of the reasons why we investigated a
method to do exactly that in chapter 3.6.3).

In summary, NkGen allows for a better usage of threading and hyperthreading,


while the EM-variant can utilise both less efficiently. Since the GIRG implemen-
tation does not utilise parallelism at all, we cannot comment on its efficiency in that
regard. Since Lamms thesis shows his algorithm of similar properties to GIRG to be
on par with NetworKits NkGen, though, we can assume that NkGens run here is
at least an indicator to GIRGs parallelism efficiency [Lam17].

4.4.4 Memory Usage Comparison


Another aspect we need to consider is how much memory each algorithm neces-
sitates in comparison to the size of the generated graphs. The higher efficient the
memory-per-edge factor is, the quicker will the algorithm reach the RAM-limit and
4.4. Runtime Benchmark 73

10 2

Sorting Phase Runtime in Seconds


10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
EM-variant, min. Memory EM-variant, 2GB

F IGURE 4.32: The sorting phase of the EM-variant, once with the min-
imal amount of memory possible to assign to the sorters (40 sorters,
44MB per sorter), once with overall 2GB assigned to the sorters.

thus necessitate external memory to continue working. The consequences of this


we will see in the next subsection for now we will focus only on how fast we will
reach that limit.

Figure 4.31 shows the maximum resident set size in kilobytes of each run, i.e. the
maximum number of kilobytes held in RAM at some point during the algorithms
run. If we were to lower the RAM-capacity to a number smaller than that, the
process would either crash or swap data between the internal and external memory
locations (depending on what the system currently allows to happen).

The first thing to notice is that GIRG and NetworKits generator both need very
small amounts of memory during smaller graph creation, while the EM-variant
needs more than one GB. This is because we chose in our implementation, re-
gardless of size, to use the sorters made available by the STXXL-library, which are
used once per band segment. From a runtime perspective, the overhead of their
creation did not affect the runtime negatively enough to discourage the use of
parallelisation regarding the memory usage, though, this is especially for smaller
graphs a problem. Each STXXL-sorter necessitates a certain amount of memory as
a minimum, which is why we are showcasing two runs of our algorithm in figure
4.31 one with the memory settings we had during our runtime comparisons in
the internal memory environment (which is around 2GB for all sorters combined),
and one with the minimal amount of memory allowed per sorter (this changes
depending on the computer setup, set block size, etc. in our case it was around 44
MB per sorter, with overall 40 sorters). Figure 4.32 shows the difference in runtime
for the sorting step alone (as this is the only step affected by this change), where the
differences are almost negligible during smaller, but mostly apparent during larger
graphs, as smaller memory available requires more merge-passes of the sorting
algorithm (see [DS03] for more details on the STXXL-librarys sorter algorithm).

Overall what can be seen is that, regardless of memory settings in regards to


the sorters, the memory usage is from an asymptotic perspective heavily in the
EM-variants favor: Around 2 107 nodes, NkGen and GIRG require more memory
than the EM-variants run with minimal memory settings; around 4 107 nodes it
74 Chapter 4. Experimental Evaluation

Maximum Resident Set Size in KB per Node


10 3

10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
EM-variant (min) Networkit GIRG EM-variant (2GB)
F IGURE 4.33: Maximum resident set size divided by node count to
visualise the average amount of memory necessary per node.

takes more than the usual runs of the EM-variant.

To detail this asymptotic property better, figure 4.33 shows the memory-per-node
factor, which can be seen as representative of the memory-per-edge factor due to
the constant average degree during the runs. On this double-logarithmic scale, it
is only the EM-variant that holds a linear downward slope and has not reached a
point yet where the memory required is scaled properly with the number of nodes
available. Both other algorithms on the other hand are reaching a minimum at 105
nodes during the run where k is set to 50, for example, meaning that, while the
actual memory-per-node factor limit is not visible yet, the EM-variant has definitely
a smaller factor and puts thus the other two at a disadvantage in a direct comparison.

In the end, though, which algorithm fares better off with the RAM memory available
depends on how much is there to begin with: If the RAM capacity is small, the EM-
variant will reach the limit fairly quickly. If the capacity is large enough (at least
more than around five to six GB, if one were to assume around one or two GB to be
delegated to the operating system), though, the EM-variant will take much longer
than the other two algorithms.

4.4.5 Runtime Performance in an External Memory Setting


Up until now, most runs were performed with all 64GB RAM of our system available
which were most of the time more than enough both GIRG and NkGens limit
was around 4 108 nodes while the EM-variant was still able to perform. In order to
focus our benchmarks on an EM-setting, we chose to decrease the RAM available
to each process to around 4GB by allowing memory swapping on our system and
locking 60 GB of memory from swapping itself. This causes all three algorithms
to reach the RAM-limit more quickly, giving us data on their effectiveness in an
EM-setting. The actual available RAM is lower than that, though, as some of it will
be used by the operating system as well, but this number should be low enough
4.4. Runtime Benchmark 75

10 4

Overall Runtime in Seconds


10 3

10 2

10 1

10 0

10 -1

10 -2 3
10 10 4 10 5 10 6 10 7 10 8 10 9 10 10
Node Count
NetworKit EM-variant GIRG

F IGURE 4.34: A benchmark with restricted RAM to around 4GB with


memory-swapping enabled. Average degree k = 10, while = 0.75.

and not unreasonable to put all three algorithms into external memory territory at
some point during their runs.

In figure 4.34 we see the run for and average degree of ten. As the EM-variant
requires more RAM for the angular parallelisation scheme since every band
segment gets its own sorter, requiring a larger amount of minimum memory
we see that this scheme reaches the external memory setting earlier than NkGen
or GIRG (for a clearer visualisation of the shift from internal to external memory
for all three algorithms, see figures A.29 through A.31 in the appendix, where
each algorithm is shown with a run with an without the RAM restrictions).At
around 4 107 nodes, we can see that NetworKit has reached the memory wall
as well, requiring suddenly as much time as the EM-variant. Every other node
count after that, though, we can see the relatively I/O-inefficient algorithm show
its handicap: At 108 nodes, NkGen requires more than four times as much time as
the EM-variant(230 vs. 74 seconds), at around 2 108 it is 20 times as much (3,000
vs. 150 seconds). The next node count after that took longer than 10,000 seconds, at
which point we stopped the benchmark.

GIRGs case is a little more difficult to see, as the runtime is already fairly high, but
around 108 , the doubled number of nodes increases the runtime by three, meaning
that somewhere around these graph sizes, GIRG also starts coming to a crawl.
We stopped benchmarking afterwards as well, as the next higher node count took
longer than 10,000 seconds for our setup either.

In regards to the EM-variant itself, the runtime seems to be less linear than origi-
nally expected, though comparatively it is undoubtedly at a huge advantage, since
around the time the other two algorithms stopped working after 10,000 seconds,
the EM-variant still stays with 2,500 seconds at a runtime under an hour for a graph
with a billion nodes, i.e. approximately five billion edges.

Overall, while having only 4GB of RAM is not too large an issue for smaller graphs
regardless of algorithm chosen, once we get to graphs that exceed the amount of
memory available to us, the EM-variant is expectedly better equipped to handle
76 Chapter 4. Experimental Evaluation

the increasing number of I/O-operations compared to the other two algorithms.


Nonetheless, the RAM-limit only became an issue for graphs larger than 107 nodes
at an average degree of ten a point where the EM-variant performed better in an
internal memory setting either way. In other words, while the external memory set-
ting put the other algorithms at a big disadvantage, it only widened the gap between
the algorithms at that point but did not change their ranking in any way.

4.5 Summary
All three algorithms seem to have an approximately linear runtime in the number
of edges during runs in an internal memory environment. The dominating factor in
all of them seems to be not the sorting of the respective data objects but either the
generation or the establishing of edges between nodes. In terms of scalability, the
EM-variant is able to keep up with NkGen and even perform better for the creation
of larger graphs, while GIRG underperforms comparatively at every level.

Regarding the multiple graph parameters, all three are also linear in both in n and
k alone, although differences are visible in regards to the variable. A high
means a higher concentration of nodes towards the edge of the hyperbolic plane,
creating for NkGen, and thus also the EM-variant, less wider ranging queries. On
the other hand, also influences radius R. For GIRG, this creates a runtime that,
while rapidly falling as well, also has spikes that put the overall asymptotic runtime
logarithmic in . In contrast, both NkGen and the EM-variant have an overall
negative slope in regards to the power-law variable.

In terms of parallelisation, NkGen is able to use all threads in a system to get a


speedup linear in number of threads. The EM-variant is less able to do so, as the
algorithms structured RAM access makes it unable to parallelise uniformly over all
threads without possible disadvantages in either higher memory consumption or
worse parallelised runtime through required communication processes.

In regards to the memory consumption, we can see another advantage of the


EM-variant in practice: Since the EM-variant is asymptotically better in terms of
memory efficiency, we are able to create graphs that are larger than 4 108 , which at
64GB RAM without swap space, neither of the alternatives are capable of without
breaching this internal memory limit. On the other hand, because of the larger
initial memory consumption, the EM-variant can reach the limit of a system with
less RAM earlier than NkGen or GIRG.

Once all three algorithms reach the external memory environment, though, the
EM-variant is undoubtedly at an advantage, performing with a better scaling than
the alternatives. Even graphs with five billion edges are able to be generated in less
than an hour in an external memory setting, while both NkGen and GIRG need
more than 10,000 seconds for graphs of one to two orders of magnitude smaller than
that.

In summary, the EM-variant can be recommended when the intention is to create


large graphs, as both with an unlimited and limited RAM size, the EM-variant per-
forms better in comparison. For small hyperbolic graph creation, NkGen is the better
choice.
77

Chapter 5

Conclusion

The goal of this thesis was to analyse and demonstrate the advantages and disad-
vantages of an EM-approach towards the generation of hyperbolic random graphs
and compare those insights with current, state-of-the-art algorithms. Apparent
from the results are the large runtime differences for very large graphs between the
EM-approach and its alternatives. This can be seen not only in the internal memory
setting, but even more so in an external memory environment.

While the basic concepts for all three algorithms are the same the definition of
a hyperbolic ground plane, the subdivision of that plane into multiple band-like
constructs based on radii boundaries, and the further reduction of the number
of possible queries per node through placed boundary structures they all are
different enough in their approaches to affect the practical usage of those algorithms
to a large degree.

In regards to the set goals in the beginning of this thesis, we can summarise
that in terms of large graphs with large average degrees the EM-approach
is outweighing its additional overhead disadvantages and even improves on
the alternatives runtime. In the EM environment, we were even able to create
graphs with 109 nodes and 5 1010 edges in under an hour, while both alterna-
tives breached the 20GB swap space limit and were not able to finish the embedding.

The EM-variant is theoretically in O(m + n log(n)), the practical runtime is on the


other hand empirically linear, since the edge-creation or generation of nodes and
Bound-objects is both linear and comparatively the dominating factor. Regard-
ing smaller graphs with lesser average degrees on the other hand, NetworKits
generator is able to embed graphs faster, as the EM-variants overhead becomes
overbearing at those sizes.

GIRG is the slowest alternative of them all by a wider margin, even with the
random-hyperbolic-graph-oriented implementation. On the other hand, GIRG is
also designed to fair off well in a more general setting. Considering that hyperbolic
graphs are just one of the possible graph types available for creation, GIRG does
allow for a greater variety of graphs compared to the other, faster algorithms
showcased in this thesis.

Another aspect that was not part of this thesis, is that GIRG allows its graphs to
be generated with an additional degree of randomness compared to the threshold
model graphs we were investigating: Apart from nodes having edges between
them in case they are in close vicinity, there is also an option of using a temperature
variable for random edge-creation between further away nodes. This is a graph
78 Chapter 5. Conclusion

property that both NkGen and the EM-variant do not account for, though the
NetworKit library has a different algorithm for such use cases [LM16].

The same goes for movement of nodes inside an already generated graph, which
NkGen does allow for [LLM16, p.4-6]: Our generator has currently no answer to
this problem that would be I/O-efficient, which might be a subject that could also
necessitate further analysis in another thesis.
79

Bibliography

[LLM16] Moritz von Looz, Mustafa zdayi, Sren Laue, and Henning Meyer-
henke. Generating massive complex networks with hyperbolic geom-
etry faster in practice. In: CoRR abs/1606.09481 (2016). URL: http:
//arxiv.org/abs/1606.09481.
[KPKVB10] Dmitri V. Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin
Vahdat, and Marin Bogu. Hyperbolic Geometry of Complex Net-
works. In: CoRR abs/1006.5169 (2010). URL: http://arxiv.org/
abs/1006.5169.
[PBK11] Fragkiskos Papadopoulos, Marin Bogu, and Dmitri V. Krioukov.
Popularity versus Similarity in Growing Networks. In: CoRR
abs/1106.0286 (2011). URL: http://arxiv.org/abs/1106.0286.
[GPP12] Luca Gugelmann, Konstantinos Panagiotou, and Ueli Peter. Ran-
dom Hyperbolic Graphs: Degree Sequence and Clustering. In: CoRR
abs/1205.1470 (2012). URL: http://arxiv.org/abs/1205.1470.
[MSS03] Ulrich Meyer, Peter Sanders, and Jop F. Sibeyn, eds. Algorithms
for Memory Hierarchies, Advanced Lectures [Dagstuhl Research Seminar,
March 10-14, 2002]. Vol. 2625. Lecture Notes in Computer Science.
Springer, 2003. ISBN: 3-540-00883-7. DOI: 10.1007/3-540-36574-5.
URL : https://doi.org/10.1007/3-540-36574-5.
[SSM16] CHRISTIAN L. STAUDT, ALEKSEJS SAZONOVS, and HENNING
MEYERHENKE. NetworKit: A tool suite for large-scale complex net-
work analysis. In: Network Science 4.4 (2016), 508530. DOI: 10.1017/
nws.2016.20.
[BKL15] Karl Bringmann, Ralph Keusch, and Johannes Lengler. Geometric In-
homogeneous Random Graphs. In: CoRR abs/1511.00576 (2015). URL:
http://arxiv.org/abs/1511.00576.
[BFKL16] Thomas Blsius, Tobias Friedrich, Anton Krohmer, and Sren Laue.
Efficient Embedding of Scale-Free Graphs in the Hyperbolic Plane.
In: 24th Annual European Symposium on Algorithms (ESA 2016). Ed.
by Piotr Sankowski and Christos Zaroliagis. Vol. 57. Leibniz Interna-
tional Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss
DagstuhlLeibniz-Zentrum fuer Informatik, 2016, 16:116:18. ISBN:
978-3-95977-015-6. DOI: 10 . 4230 / LIPIcs . ESA . 2016 . 16. URL:
http://drops.dagstuhl.de/opus/volltexte/2016/6367.
[Lam17] Sebastian Lamm. Communication Efficient Algorithms for Generat-
ing Massive Networks. MA thesis. Karlsruher Institut fr Technolo-
gie, 2017. DOI: 10.5445/ir/1000068617.
[AV88] Alok Aggarwal and Jeffrey Scott Vitter. The Input/Output Complex-
ity of Sorting and Related Problems. In: Commun. ACM 31.9 (1988),
pp. 11161127. DOI: 10.1145/48529.48535. URL: http://doi.
acm.org/10.1145/48529.48535.
80 BIBLIOGRAPHY

[AKL93] Lars Arge, Mikael B. Knudsen, and Kirsten Larsen. A General Lower
Bound on the I/O-Complexity of Comparison-based Algorithms. In:
Algorithms and Data Structures, Third Workshop, WADS 93, Montral,
Canada, August 11-13, 1993, Proceedings. Ed. by Frank K. H. A. Dehne,
Jrg-Rdiger Sack, Nicola Santoro, and Sue Whitesides. Vol. 709. Lec-
ture Notes in Computer Science. Springer, 1993, pp. 8394. ISBN: 3-
540-57155-8. DOI: 10.1007/3- 540- 57155- 8_238. URL: https:
//doi.org/10.1007/3-540-57155-8_238.
[ER59] Paul Erdos and Alfrd Rnyi. On random graphs. In: Publicationes
Mathematicae Debrecen 6 (1959), pp. 290297.
[AB01] Rka Albert and Albert-Lszl Barabsi. Statistical mechanics of com-
plex networks. In: CoRR cond-mat/0106096 (2001). URL: http : / /
arxiv.org/abs/cond-mat/0106096.
[BFM16] Michel Bode, Nikolaos Fountoulakis, and Tobias Mller. The proba-
bility of connectivity in a hyperbolic model of complex networks. In:
Random Struct. Algorithms 49.1 (2016), pp. 6594. DOI: 10.1002/rsa.
20626. URL: https://doi.org/10.1002/rsa.20626.
[LSMP15] Moritz von Looz, Christian L. Staudt, Henning Meyerhenke, and Ro-
man Prutkin. Fast generation of dynamic complex networks with un-
derlying hyperbolic geometry. In: CoRR abs/1501.03545 (2015). URL:
http://arxiv.org/abs/1501.03545.
[Bou97] Paul Bourke. Intersection of two circles. (Online. Webpage last accessed:
July 3rd, 2017.) Apr. 1997. URL: http : / / paulbourke . net /
geometry/circlesphere/.
[DKS08] Roman Dementiev, Lutz Kettner, and Peter Sanders. STXXL: standard
template library for XXL data sets. In: Softw., Pract. Exper. 38.6 (2008),
pp. 589637. DOI: 10.1002/spe.844. URL: https://doi.org/
10.1002/spe.844.
[DS03] Roman Dementiev and Peter Sanders. Asynchronous parallel disk
sorting. In: SPAA 2003: Proceedings of the Fifteenth Annual ACM Sym-
posium on Parallelism in Algorithms and Architectures, June 7-9, 2003, San
Diego, California, USA (part of FCRC 2003). ACM, 2003, pp. 138148.
ISBN : 1-58113-661-7. DOI : 10 . 1145 / 777412 . 777435. URL : http :
//doi.acm.org/10.1145/777412.777435.
[BS80] Jon Louis Bentley and James B. Saxe. Generating Sorted Lists of Ran-
dom Numbers. In: ACM Trans. Math. Softw. 6.3 (1980), pp. 359364.
DOI : 10.1145/355900.355907. URL : http://doi.acm.org/10.
1145/355900.355907.
[JJ92] Joseph JJ. An Introduction to Parallel Algorithms. Addison-Wesley,
1992. ISBN: 0-201-54856-9.
[Pen17] Manuel Penschuck. Generating practical random hyperbolic graphs
in near-linear time and with sub-linear memory. In: (2017). To appear
in SEA 2017 Proceedings.
BIBLIOGRAPHY 81

[FK15] Tobias Friedrich and Anton Krohmer. On the Diameter of Hyperbolic


Random Graphs. In: Automata, Languages, and Programming - 42nd In-
ternational Colloquium, ICALP 2015, Kyoto, Japan, July 6-10, 2015, Pro-
ceedings, Part II. Ed. by Magns M. Halldrsson, Kazuo Iwama, Naoki
Kobayashi, and Bettina Speckmann. Vol. 9135. Lecture Notes in Com-
puter Science. Springer, 2015, pp. 614625. ISBN: 978-3-662-47665-9.
DOI : 10.1007/978- 3- 662- 47666- 6_49. URL : https://doi.
org/10.1007/978-3-662-47666-6_49.
[LM16] Moritz von Looz and Henning Meyerhenke. Querying Probabilistic
Neighborhoods in Spatial Data Sets Efficiently. In: Combinatorial Al-
gorithms - 27th International Workshop, IWOCA 2016, Helsinki, Finland,
August 17-19, 2016, Proceedings. Ed. by Veli Mkinen, Simon J. Puglisi,
and Leena Salmela. Vol. 9843. Lecture Notes in Computer Science.
Springer, 2016, pp. 449460. ISBN: 978-3-319-44542-7. DOI: 10.1007/
978- 3- 319- 44543- 4_35. URL: https://doi.org/10.1007/
978-3-319-44543-4_35.
82

Appendix A

Figures

A.1 Normal Distributions of Active-Size

2500

2000
Occurrence Count

1500

1000

500

0
0 100 200 300 400 500 600 700 800 900
Size of Active
N(55,55) N(353,353) N(743,743)
Actual Data (k=20) Actual Data (k=200) Actual Data (k=500)

F IGURE A.1: Actual occurrence count of active-sizes on the 8th band,


overlayed with the normal distribution one would get by using eq.
3.4. = 0.51.

900
800
700
600
Occurrence Count

500
400
300
200
100
0
0 50 100 150 200 250 300 350 400
Size of Active
N(14,14) N(134,134) N(328,328)
Actual Data (k=20) Actual Data (k=200) Actual Data (k=500)

F IGURE A.2: Actual occurrence count of active-sizes on the 8th band,


overlayed with the normal distribution one would get by using eq.
3.4. = 1.1.
A.2. Coparison Run of Best Band and Angular Parallelization Count Settings for
83
the Geometric Workload Partitioning

A.2 Coparison Run of Best Band and Angular Parallelization


Count Settings for the Geometric Workload Partitioning

10 1
Overall Runtime in Seconds

10 0
10 4 10 5 10 6 10 7
Node Count
Set: (5, 9) Set: (2, 11) Set: (8, 8) Set: (3, 9)
Set: (6, 7) Set: (7, 7) Set: (4, 9) Set: (7, 9)
F IGURE A.3: Comparison of best geometric workload settings for
small graphs. = 0.75, k = 10. "Set: (a, b)" describes a run with
angular parallelization count of a, and b bands.

10 2
Overall Runtime in Seconds

10 1

10 0
10 4 10 5 10 6 10 7
Node Count
Set: (5, 9) Set: (2, 11) Set: (8, 8) Set: (3, 9)
Set: (6, 7) Set: (7, 7) Set: (4, 9) Set: (7, 9)
F IGURE A.4: Comparison of best geometric workload settings for
large graphs. = 0.75, k = 1000.
84 Appendix A. Figures

A.3 Comparison Between the Radial Partitionings

10 1

Overall Runtime in Seconds


10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.5: Comparison of the three radial partitionings. = 0.51,


k = 10.

10 1
Overall Runtime in Seconds

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.6: Comparison of the three radial partitionings. = 0.51,


k = 50.

10 2
Overall Runtime in Seconds

10 1

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.7: Comparison of the three radial partitionings. = 0.51,


k = 500.
A.3. Comparison Between the Radial Partitionings 85

10 2

Overall Runtime in Seconds


10 1

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.8: Comparison of the three radial partitionings. = 0.51,


k = 1000.

10 1
Overall Runtime in Seconds

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.9: Comparison of the three radial partitionings. = 0.75,


k = 10.

10 1
Overall Runtime in Seconds

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.10: Comparison of the three radial partitionings. = 0.75,


k = 50.
86 Appendix A. Figures

10 2

Overall Runtime in Seconds


10 1

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.11: Comparison of the three radial partitionings. = 0.75,


k = 500.

10 2
Overall Runtime in Seconds

10 1

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.12: Comparison of the three radial partitionings. = 0.75,


k = 1000.

10 1
Overall Runtime in Seconds

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.13: Comparison of the three radial partitionings. = 1.1,


k = 10.
A.3. Comparison Between the Radial Partitionings 87

10 1

Overall Runtime in Seconds


10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.14: Comparison of the three radial partitionings. = 1.1,


k = 50.

10 2
Overall Runtime in Seconds

10 1

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.15: Comparison of the three radial partitionings. = 1.1,


k = 500.

10 2
Overall Runtime in Seconds

10 1

10 0

10 -1
10 4 10 5 10 6 10 7
Node Count
GEO MIN EQ

F IGURE A.16: Comparison of the three radial partitionings. = 1.1,


k = 1000.
88 Appendix A. Figures

A.4 Internal Memory Runtime Comparison of the Algo-


rithms

10 3

Overall Runtime in Seconds


10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.17: Comparison of all major algorithms. = 0.51, k = 10.

10 4
Overall Runtime in Seconds

10 3

10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.18: Comparison of all major algorithms. = 0.51, k = 50.

10 5
10 4
Overall Runtime in Seconds

10 3
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.19: Comparison of all major algorithms. = 0.51, k = 500.


A.4. Internal Memory Runtime Comparison of the Algorithms 89

10 4

Overall Runtime in Seconds


10 3

10 2

10 1

10 0

10 -1
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.20: Comparison of all major algorithms. = 0.51, k =


1000.

10 3
Overall Runtime in Seconds

10 2

10 1

10 0

10 -1

10 -2

10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.21: Comparison of all major algorithms. = 0.75, k = 10.

10 4
Overall Runtime in Seconds

10 3

10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.22: Comparison of all major algorithms. = 0.75, k = 50.


90 Appendix A. Figures

10 5
10 4

Overall Runtime in Seconds


10 3
10 2
10 1
10 0
10 -1
10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.23: Comparison of all major algorithms. = 0.75, k = 500.

10 4
Overall Runtime in Seconds

10 3

10 2

10 1

10 0

10 -1
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.24: Comparison of all major algorithms. = 0.75, k = 1000.

10 3
Overall Runtime in Seconds

10 2

10 1

10 0

10 -1

10 -2

10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.25: Comparison of all major algorithms. = 1.1, k = 10.


A.4. Internal Memory Runtime Comparison of the Algorithms 91

10 4

Overall Runtime in Seconds


10 3

10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.26: Comparison of all major algorithms. = 1.1, k = 50.

10 4
Overall Runtime in Seconds

10 3

10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.27: Comparison of all major algorithms. = 1.1, k = 500.

10 4
Overall Runtime in Seconds

10 3

10 2

10 1

10 0

10 -1
10 4 10 5 10 6 10 7 10 8
Node Count
Em-variant(NetworKit Set.) Em-variant NetworKit GIRG

F IGURE A.28: Comparison of all major algorithms. = 1.1, k = 1000.


92 Appendix A. Figures

A.5 External Memory Runtime Comparison of the Algo-


rithms

10 4
10 3

Overall Runtime in Seconds


10 2
10 1
10 0
10 -1
10 -2
10 -3
10 4 10 5 10 6 10 7 10 8
Node Count
NetworKit / IM NetworKit / EM

F IGURE A.29: A benchmark of NetworKits generator without (IM)


and with (EM) a restricted RAM to around 4GB with memory-
swapping enabled. Average degree k = 10, while = 0.75.

10 4
Overall Runtime in Seconds

10 3

10 2

10 1

10 0

10 -1

10 -2
10 4 10 5 10 6 10 7 10 8
Node Count
GIRG / IM GIRG / EM

F IGURE A.30: A benchmark of GIRG without (IM) and with (EM)


a restricted RAM to around 4GB with memory-swapping enabled.
Average degree k = 10, while = 0.75.
A.5. External Memory Runtime Comparison of the Algorithms 93

10 4

Overall Runtime in Seconds


10 3

10 2

10 1

10 0

10 -1
10 4 10 5 10 6 10 7 10 8 10 9
Node Count
Em-variant / IM EM-variant / EM

F IGURE A.31: A benchmark of the EM-variant without (IM) and with


(EM) a restricted RAM to around 4GB with memory-swapping en-
abled. Average degree k = 10, while = 0.75.
94

Appendix B

Pseudocode

L ISTING B.1: Fast Algorithm for Hyperbolic Random Graph Genera-


tion[LLM16, p.3]
1 Input : number o f v e r t i c e s n , average degree k , power law exponent
2 Output : G = (V, E)
3
4 = ( 1) /2;
5 R = g e t T a r g e t R a d i u s ( n, k, ) ;
6 V = n vertices ;
7 C = {c0 , c1 , ..., cmax } s e t o f l o g n ordered r a d i a l c o o r d i n a t e s , with c0 = 0 and
cmax = R ;
8 B = {b0 , b1 , ..., bmax } s e t o f l o g n empty s e t s ;
9 f o r vertex v V do i n p a r a l l e l
10 draw [v] from U [0, 2) ;
11 draw r[v] with d e n s i t y f (r) = s i n h (r)/(cosh(R) 1) ;
12 i n s e r t ( [v], r[v] ) i n s u i t a b l e bi so t h a t ci r[v] < ci+1 ;
13 end
14 f o r b B do i n p a r a l l e l
15 s o r t p o i n t s i n b by t h e i r angular c o o r d i n a t e s ;
16 end
17 f o r vertex v V do i n p a r a l l e l
18 f o r band bi B , where ci+ 1 > r[v] do
19 min , max = getMinMaxPhi ( [v], r[v]), ci , ci+1 , R) ;
20 f o r vertex w bi , where min [w] < max do
21 i f distH (v, w) R then
22 add (v, w) t o E ;
23 end
24 end
25 end
26 end
27 r e t u r n G;
Appendix B. Pseudocode 95

L ISTING B.2: EM-Variant of NkGen with 3 Sorters


1 Input : number o f v e r t i c e s n , average degree k , power law exponent
2 Output : G = (V, E)
3
4 R = g e t T a r g e t R a d i u s ( n, k, ) ;
5 V = n vertices ;
6 C = {c0 , c1 , ..., cl } s e t o f l + 1 ordered r a d i a l c o o r d i n a t e s , with c0 = 0 and cl = R
;
7 f o r ci C do :
8 ci = MapToPoincare ( ci ) ;
9 end
10 B = {b0 , b1 , ..., bl1 } s e t o f l empty band s e t s ;
11 Sstart = {SS0 , SS1 , ..., SSl1 } s e t o f l empty StartBound s e t s ;
12 Sstop = {ST0 , ST1 , ..., STl1 } s e t o f l empty StopBound s e t s ;
13 i = 0 ; # f o r ID
14 f o r vertex v V do :
15 i ++;
16 draw v from U [0, 2) ;
17 draw rv with d e n s i t y f (r) = s i n h (r)/(cosh(R) 1) ;
18 rv = MapToPoincare ( rv ) ;
19 i n s e r t p o i n t v =( = v , r = rv , ID = i ) i n s u i t a b l e bi so t h a t ci rv < ci+1 ;
20 f o r band bi B , where ci+ 1 > rv do
21 minv , maxv = getMinMaxPhi_Poincare ( v, ci , ci+1 , R) ;
22 i n s e r t StartBound ( u = v, min = minv ) i n SSi ;
23 i n s e r t StopBound ( ID=v . ID, max = maxv ) i n STi ;
24 end
25 end
26 f o r i [0, ..., l 1] do i n p a r a l l e l :
27 s o r t p o i n t s i n bi by t h e i r angular c o o r d i n a t e s ;
28 s o r t StartBounds i n SSi by t h e i r min ;
29 s o r t StopBounds i n STi by t h e i r max ;
30 end
31 f o r band bi B do i n p a r a l l e l :
32 sscur = SSi . f i r s t ( ) ;
33 stcur = STi . f i r s t ( ) ;
34 vcur = bi . f i r s t ( ) ;
35 active = ;
36 while vcur ! = bi . end ( ) do
37 c u r r e n t _ t o k e n = G e t S m a l l e s t ( sscur , stcur , vcur ) ;
38 s w i tch ( type ( c u r r e n t _ t o k e n ) ) do
39 c a s e StartBound do
40 add sscur t o active ;
41 sscur = SSi . next ( ) ;
42 end
43 c a s e StopBound do
44 d e l e t e ss from active where s s . u . ID = stcur .ID ;
45 stcur = STi . next ( ) ;
46 end
47 c a s e vertex do
48 f o r ss active do
49 i f distH (vcur , ss.u) R then
50 add ( vcur , ss.u ) t o E ;
51 end
52 end
53 vcur = bi . next ( ) ;
54 end
55 end
56 end
57 end
58 r e t u r n G;
96 Appendix B. Pseudocode

L ISTING B.3: EM-Variant of the NkGen with 2 Sorters


1 Input : number o f v e r t i c e s n , average degree k , power law exponent
2 Output : G = (V, E)
3
4 R = g e t T a r g e t R a d i u s ( n, k, ) ;
5 V = n vertices ;
6 C = {c0 , c1 , ..., cl } s e t o f l + 1 ordered r a d i a l c o o r d i n a t e s , with c0 = 0 and
cmax = R ;
7 f o r ci C do :
8 ci = MapToPoincare ( ci ) ;
9 end
10 B = {b0 , b1 , ..., bl1 } s e t o f l empty band s e t s ;
11 S = {S0 , S1 , ..., Sl1 } s e t o f l empty StartStopBound s e t s ;
12 f o r vertex v V do i n p a r a l l e l :
13 draw v from U [0, 2) ;
14 draw rv with d e n s i t y f (r) = s i n h (r)/(cosh(R) 1) ;
15 rv = MapToPoincare ( rv ) ;
16 i n s e r t p o i n t v =(v , rv ) i n s u i t a b l e bi so t h a t ci rv < ci+1 ;
17 f o r band bi B , where ci+ 1 > rv do :
18 minv , maxv = getMinMaxPhi_Poincare ( v, ci , ci+1 , R) ;
19 i n s e r t StartStopBound ( u = v, min = minv , max = maxv ) i n Si ;
20 end
21 end
22 f o r i [0, ..., l 1] do i n p a r a l l e l :
23 s o r t p o i n t s i n b : i by t h e i r angular c o o r d i n a t e s ;
24 s o r t StartStopBounds i n Si by t h e i r min ;
25 end
26 f o r band bi B do i n p a r a l l e l
27 scur = Si . f i r s t ( ) ;
28 vcur = bi . f i r s t ( ) ;
29 active = ;
30 while vcur ! = bi . end ( ) do
31 c u r r e n t _ t o k e n = G e t S m a l l e s t ( scur , vcur ) ;
32 switch ( type ( c u r r e n t _ t o k e n ) ) do
33 c a s e StartStopBound do
34 add scur t o active ;
35 scur = Si . next ( ) ;
36 end
37 c a s e vertex do
38 f o r s active do
39 i f s.max > vcur . then
40 d e l e t e s from active
41 e l s e i f distH (vcur , ss.u) R then
42 add ( vcur , ss.u ) t o E ;
43 end
44 end
45 vcur = bi . next ( ) ;
46 end
47 end
48 end
49 end
50 r e t u r n G;
Appendix B. Pseudocode 97

L ISTING B.4: Main loop of algorithm to determine an equalized


workload radial partitioning.
1 Input : Band count l , node count n , , r a d i u s R , maximum allowed d i f f e r e n c e
i n p e r c e n t maxDif , maximum allowed i t e r a t i o n s maxIter
2
3 Output : Band p a r t i t i o n i n g C = (c1 , c0 , c1 , ..., cl ) with c1 = 0, c0 = R/2
4
5 iter = 0;
6 curW = 1 ; # c u r r e n t " workload per band " c a n d i d a t e we a r e aiming f o r
7 W_stepsize = 0 . 0 5 n # a r b i t r a r i l y chosen s t e p s i z e
8 while ( i t e r < maxIter ) do
9 i t e r ++;
10 # determine a p a r t i t i o n i n g where t h e bands have each as c l o s e t o curW
much work as p o s s i b l e
11 current_C = CalcRadiiEQ ( l, n, , R , curW , curW maxDif, maxIter ) ;
12 # e s t i m a t e t h e maximal workload by t h e given t e s t p a r t i t i o n i n g
13 maxW = 0 ;
14 f o r ( i = 0 ; i < l ; i ++) do
15 e s t = EstimateWork ( curC [ i ] , curC [ i + 1 ] , n, , R ) ; # eq . 3 . 4
16 i f (maxW < e s t )
17 maxW = e s t ;
18 end
19 end
20 # i f t h e maximal workload o f a l l bands i s g r e a t e r compared t o curW than
what we allow f o r , t r y again with a l a r g e r CurW
21 i f (maxW > curW + curW maxDif )
22 curW += W_stepsize ;
23 end
24 end
25
26 r e t u r n current_C ;

L ISTING B.5: CalcRadiiEQ.


1 Input : Band count l , node count n , , r a d i u s R , d e s i r e d workload per band
desW , maximal allowed d i f f e r e n c e from desW named curDif , maximum
allowed i t e r a t i o n s maxIter
2
3 Output : Band p a r t i t i o n i n g C = (c1 , c0 , c1 , ..., cl ) with c1 = 0, c0 = R/2 where t h e
p a r t i t i o n i n g has a workload per band as c l o s e t o curW as p o s s i b l e
4
5 current_C = [ 0 , R / 2 , 0 , 0 , . . . , 0 , R ] ;
6 f o r ( i = 2 ; i < l ; i ++) do
7 minRad = current_C [ i 1 ] ;
8 maxRad = R
9 maxW = EstimateWork ( minRad , maxRad , n, , R ) ;
10 curRad = R ;
11 curW = maxW;
12 iter = 0;
13 # b i n a r y s e a r c h f o r a r a d i u s curRad t h a t g i v e s us a workload t h a t i s
c l o s e t o t h e d e s i r e d one f o r band i
14 while ( i t e r < maxIter ) do
15 i t e r ++;
16 i f (|curWdesW| < maxDif ) break ;
17 i f ( curW > desW ) do
18 maxRad = curRad ;
19 curRad = minRad + ( ( maxRad minRad ) /2) ;
20 e l s e i f ( curW < desW ) do
21 minRad = curRad ;
22 curRad = minRad + ( ( maxRad minRad ) /2) ;
23 end
24 end
98 Appendix B. Pseudocode

25 current_C [ i ] = curRad ;
26 end
27
28 r e t u r n current_C ;

L ISTING B.6: Algorithm to determine a minimized workload radial


partitioning.
1 Input : Band count l , node count n , , r a d i u s R
2
3 Output : Band p a r t i t i o n i n g C = (c1 , c0 , c1 , ..., cl ) with c1 = 0, c0 = R/2 where t h e
p a r t i t i o n i n g has an o v e r a l l workload as s m a l l as p o s s i b l e
4
5 a = 1000
6 s t e p S i z e = (R/2)/a ;
7 #As a s t a r t i n g point , e x h a u s t i v e s e a r c h a l l p o s s i b l e r a d i i combinations
f o r 3 bands and choose t h e one with t h e s m a l l e s t o v e r a l l workload . Band
0 i s always ( 0 , R/2) , which i s why t h e s t e p S i z e can be ( ( R/2)/a ) big ,
as we do not need t o check t h e lowest band
8 resultOfPrevRun = E x h a u s t i v e S e a r c h ( 3 , a , s t e p s i z e ) ;
9 r e s ultOfThisRun [ resultOfPrevRun . s i z e ( ) + 1 ] ;
10
11 f o r ( i = 3 ; i < l ; i ++) do
12 j = resultOfPrevRun . s i z e ( ) ;
13 r esultOfThisRun [ j + 1 ] ;
14 # t a k e over t h e previous r e s u l t i n t o t h i s run
15 f o r ( k = 0 ; k < j ; k++) do
16 resultOfThisRun [ k ] = resultOfPrevRun [ k ] ;
17 end
18 #add R as t h e newest , l a s t r a d i i
19 r esultOfThisRun [ j ] = R
20
21 overallOptWorkCount = i n f ;
22 currentOptWorkCount = i n f ;
23
24 s t a r t R a d i i = resultOfThis R un . copy ( ) ;
25 curOptRadii [ j + 1 ] ;
26 curRadii [ j + 1];
27 # a whileloop t h a t loops as long as we can improve our p a r t i t i o n i n g f o r
i bands
28 while ( ! l e a v e ) do
29 s t a r t R a d i i = resultOfThisRun . copy ( ) ;
30 c u r R a d i i = resultOfThisR un . copy ( ) ;
31 # going through i 1 = j 3 bands backwards
32 # c u r R a d i i [ i +1] i s always R
33 # c u r R a d i i [ 0 ] i s always 0
34 # c u r R a d i i [ 1 ] i s always R/2
35 # because o f t h a t , we a r e only t r a v e r s i n g downwards from m=2 t o m= i
36 f o r (m = 2 ; m <= i ; m ++)do
37 # going through a t most a s t e p s
38 f o r ( s = 0 ; s < a ; s ++) do
39 # d e c r e a s e c u r r e n t band s r a d i u s by one s t e p
40 c u r R a d i i [m] = s t a r t R a d i i [m] s s t e p S i z e ;
41 # i f we went one s t e p too f a r / have a r a d i u s s m a l l e r than t h e
lower band s o u t e r radius , c o n t i n u e with t h e next band s r a d i u s
42 i f ( c u r R a d i i [m] < c u r R a d i i [m1]) break ;
43 # e s t i m a t e o v e r a l l workload
44 sumWork = 0
45 f o r ( k = 1 ; k < j ; k++) do
46 sumWork += EstimateWork ( c u r R a d i i [ k 1] , c u r R a d i i [ k ] , n, , R ) ;
47 end
48 # i f b e t t e r than previous o v e r a l l workload , save t h e c u r r e n t
partitioning
Appendix B. Pseudocode 99

49 i f ( sumWork < currentOptWorkCount ) do


50 currentOptWorkCount = sumWork ;
51 curOptRadii = c u r R a d i i . copy ( ) ;
52 end
53 end
54 # f o r t h e next run i n t h e f o r loop , s e t t h e r a d i i o f t h e c u r r e n t band
t o t h e optimal one as i t w i l l not be changed u n t i l t h e c u r r e n t f o r
loop ends e n t i r e l y and b e g i n s anew
55 c u r R a d i i [m] = curOptRadii [m] ;
56 end
57 # a f t e r having done t h i s f o r a l l i bands , check i f t h e o v e r a l l optimal
workload one i s b e t t e r than t h e p r e v i o u s l y c a l c u l a t e d one i f so , save
i t f o r t h e next run
58 i f ( currentOptWorkCount < overallOptWorkCount ) do
59 r e sultOfThisRun = curOptRadii . copy ( ) ;
60 # i f i t i s worse , i t means t h e previous round had a b e t t e r r e s u l t ,
which i s c u r r e n t l y saved i n t h e resultOfThisRuna r r a y l e a v e t h e while
loop and i n c r e a s e i by one
61 e l s e do
62 l e a v e = True ;
63 end
64 end #end o f whileloop
65 resOfPrevRun = resOfThisRun . copy ( ) ;
66 end
67 r e t u r n resOfThisRun ;

You might also like