Network Science
Network Science
IT UNIVERSITY OF COPENHAGEN
T H E AT L A S F O R T H E
ASPIRING NETWORK
SCIENTIST
Copyright © 2025 Michele Coscia
michele coscia is employed by the it university of copenhagen, rued langgaards vej 7, 2300
copenhagen, denmark
tufte-latex.googlecode.com
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in com-
pliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/
LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an “as is” basis, without warranties or conditions of any kind, either
express or implied. See the License for the specific language governing permissions and limitations under
the License.
1 Introduction 7
I Background Knowledge 21
2 Probability Theory 22
3 Statistics 39
4 Machine Learning 55
5 Linear Algebra 70
II Graph Representations 90
6 Basic Graphs 91
8 Matrices 119
9 Degree 134
12 Density 179
IV Centrality 190
20 Epidemics 286
IX Mesoscale 429
30 Homophily 430
32 Core-Periphery 450
33 Hierarchies 462
X Communities 489
54 Glossary 789
Bibliography 803
1
Introduction
This serves me well, because it can capture many key things about
complex systems:
And so on.
The diversity of these examples shows how complexity is all
around us – and so are networks. At some level, every aspect of
reality seems to be made by interconnected parts. Societies are made
by people entertaining multiple different types of relations with
each other: artist citing each other’s works, people making financial
transactions, or developing friendships and enmities. The brains in
their skulls are an intertwined web of neurons and synapses, but 3
Karl Friston. The free-energy principle:
also machines to make inferences3 , which is another way to say a unified brain theory? Nature reviews
connecting stimuli to each other. They’re built with genes connected neuroscience, 11(2):127–138, 2010
in a ballet of upregulating and downregulating dynamics. Genes
are made with interacting proteins. Chemical compounds are atoms 4
Richard Feynman. The theory of
linked by bonds. Feynman’s diagrams4 show elementary particles positrons. Phys. Rev., 76:749–759, 1949
having all sorts of interesting relations. It’s interactions all the way
down.
If complexity is all around us, why did I say it is difficult to de-
fine? Part of the reason is because it is difficult to quantify. We have
decided to describe reality with the language of math, to the point
of believing that the math is actually the only thing that is real and 5
Max Tegmark. The mathematical
objective5 but, try as we might, we haven’t been able to quantify com- universe. Foundations of physics, 38(2):
plexity. If math is the language of science, then our understanding 101–150, 2008
of complexity is pre-scientific, because we haven’t found a way to fit
complexity in our language. So we can’t find the laws of complexity.
The solution, in my opinion, is a change in perspective. We need
to develop a new language of complexity and go beyond our simple
quantitative approach to science. We need to embrace complexity, not
pigeonhole it into formulas. After all, the math is only useful when it
hides the complexity away. One can be allured by the beauty of math
and how well it describes reality, but math is only beautiful insofar
it hides complexity, rather than explaining it. Take for instance the
Standard Model of physics. This is a model that succinctly describes
every elementary interaction (except gravity). However, if you were
trying to use it to simulate a single iron atom in isolation you’d have
a computationally intractable problem in your hands.
If you look at the formula behind the Standard Model of physics,
you’ll realize why:
− 12 ∂ν gµa ∂ν gµa − gs f abc ∂µ gνa gµb gνc − 14 gs2 f abc f ade gµb gνc gµd gνe +
1 2 σ µ σ a a 2 a abc ∂ Ḡ a G b gc − ∂ W + ∂ W − −
2 igs ( q̄i γ q j ) gµ + Ḡ ∂ G + gs f µ µ ν µ ν µ
M2 Wµ+ Wµ− − 12 ∂ν Zµ0 ∂ν Zµ0 − 2c12 M2 Zµ0 Zµ0 − 12 ∂µ Aν ∂µ Aν − 21 ∂µ H∂µ H −
w
1 2 2 + − 2 + − 1 0 0 1 0 0 2M2
2 m h H − ∂µ ϕ ∂µ ϕ − M ϕ ϕ − 2 ∂µ ϕ ∂µ ϕ − 2c2w Mϕ ϕ − β h [ g2
+
2M 1 2 0 0 + − 2M4 0 + −
g H + 2 ( H + ϕ ϕ + 2ϕ ϕ )] + g2 α h − igcw [ ∂ν Zµ (Wµ Wν −
Wν+ Wµ− ) − Zν0 (Wµ+ ∂ν Wµ− − Wµ− ∂ν Wµ+ ) + Zµ0 (Wν+ ∂ν Wµ− −
Wν ∂ν Wµ+ )] − igsw [∂ν Aµ (Wµ+ Wν− − Wν+ Wµ− ) − Aν (Wµ+ ∂ν Wµ− −
−
introduction 9
Wµ− ∂ν Wµ+ ) + Aµ (Wν+ ∂ν Wµ− − Wν− ∂ν Wµ+ )] − 12 g2 Wµ+ Wµ− Wν+ Wν− +
1 2 + − + − 2 2 0 + 0 − 0 0 + −
2 g Wµ Wν Wµ Wν + g cw ( Zµ Wµ Zν Wν − Zµ Zµ Wν Wν ) +
g sw ( Aµ Wµ Aν Wν − Aµ Aµ Wν Wν ) + g sw cw [ Aµ Zν0 (Wµ+ Wν− −
2 2 + − + − 2
ig
(d¯λj γµ (1 − 83 s2w − γ5 )dλj )] + √ W + [(ν̄λ γµ (1 + γ5 )eλ ) + (ūλj γµ (1 +
2 2 µ
ig
γ5 )Cλκ dκj )] + √ Wµ− [(ēλ γµ (1 + γ5 )νλ ) + (d¯κj Cλκ † γµ (1 + γ5 ) uλ )] +
j
2 2
ig me λ
+ 5 − 5 g mλe
√ [−ϕ (ν̄ (1 − γ )e ) + ϕ (ē (1 + γ )ν )] − 2 M [ H (ē eλ ) +
λ λ λ λ λ
2 2 M
ig
iϕ0 (ēλ γ5 eλ )] + √ ϕ+ [−mκd (ūλj Cλκ (1 − γ5 )dκj ) + mλu (ūλj Cλκ (1 +
2M 2
ig
γ5 )dκj ] + √ ϕ− [mλd (d¯λj Cλκ † (1 + γ5 ) uκ ) − mκ ( d¯λ C † (1 − γ5 ) uκ ] −
j u j λκ j
2M 2
g mλu g mλd ¯ ig m λ
0 5 ig m λ
0 ¯ 5
2 M H ( ū j u j ) − 2 M H ( d j d j ) + 2 M ϕ ( ū j γ u j ) − 2 M ϕ ( d j γ d j ) +
λ λ λ λ u λ λ d λ λ
2
X̄ + (∂2 − M2 ) X + + X̄ − (∂2 − M2 ) X − + X̄ 0 (∂2 − M c2w
) X 0 + Ȳ∂2 Y +
igcw Wµ+ (∂µ X̄ 0 X − − ∂µ X̄ + X 0 ) + igsw Wµ+ (∂µ ȲX − − ∂µ X̄ + Y ) +
igcw Wµ− (∂µ X̄ − X 0 − ∂µ X̄ 0 X + ) + igsw Wµ− (∂µ X̄ − Y − ∂µ ȲX + ) +
igcw Zµ0 (∂µ X̄ + X + − ∂µ X̄ − X − ) + igsw Aµ (∂µ X̄ + X + − ∂µ X̄ − X − ) −
1 + + − − 1 0 0 1−2c2w + 0 +
2 gM [ X̄ X H + X̄ X H + c2w X̄ X H ] + 2cw igM [ X̄ X ϕ −
X̄ − X 0 ϕ− ] + 2c1w igM[ X̄ 0 X − ϕ+ − X̄ 0 X + ϕ− ] + igMsw [ X̄ 0 X − ϕ+ −
X̄ 0 X + ϕ− ] + 21 igM[ X̄ + X + ϕ0 − X̄ − X − ϕ0 ]
new field called “chemistry”, because the change in scale causes the
emergence of new phenomena.
We have some starting pointers to understand how this compart-
mentalization of knowledge can help scientific investigation. This is 8
Friedrich August Hayek. The use of
what Hayek called “division of knowledge”8 , which is a much more knowledge in society. The American
powerful concept than Smith’s classical division of labor9 . If I spe- economic review, 35(4):519–530, 1945
cialize as a chemist and hone my skills and tools to that specific task,
9
Adam Smith. The Wealth of Nations.
1776
I can be immensely more productive, because I am outsourcing all
other knowledge discovery endeavors to other specialists. This is how
societies grow their pool of knowledge efficiently. However, the re-
sult is that, now, no individual can really fully grasp a well-rounded
picture of reality. The collective society can, but not its individual
components. It is all deformed by the lens of their specialization.
The resulting irony is that we might not be able to get to the
language of complexity because we need it in order to get it. To make
an advancement in physics we need teams of tens or hundreds of 10
Georges Aad, Tatevik Abajyan,
people. The paper containing the discovery of the Higgs boson10 has Brad Abbott, Jalal Abdallah, S Abdel
5, 154 authors. The knowledge needed for it was so vast it could not Khalek, Ahmed Ali Abdelalim, R Aben,
B Abi, M Abolins, OS AbouZeid, et al.
fit in a single brain. Only a collective of interconnected brains could
Observation of a new particle in the
understand it – and that is a complex system. search for the standard model higgs
How do you meaningful coordinate a complex system to make boson with the atlas detector at the lhc.
Physics Letters B, 716(1):1–29, 2012
an even vaster scientific discovery? You can only do it if you under-
stand complexity – which is what you need the collective of brains
to do! It’s a circular problem. We already know that simple inter-
ventions will cause unexpected local optima that are unsatisfactory. 11
John PA Ioannidis. Why most
Science with its replication crisis11 , its publish-or-perish misaligned published research findings are false.
incentives, its difficulty in dealing with misinformation12 , isn’t going PLoS medicine, 2(8):e124, 2005
as smoothly as it could. Because we don’t understand complexity.
12
Carl T Bergstrom and Jevin D West.
Calling bullshit: The art of skepticism in a
Because we need to understand complexity in order to understand data-driven world. Random House Trade
complexity. Paperbacks, 2021
But I’ll be damned if I don’t give everything I have to make this
understanding of complexity happen. And I think our best shot is
via network science, because it is the field that gives us a way to talk
about emergence. We need to know how the different fields – physics,
chemistry, biology, ... – relate and transform into each other, which is
necessary to reconstruct a picture of reality.
Connecting those fields means finding a shared language that can
describe the relations between the symbols they use – make a mental
note of this, it’ll come back later. A shared language can then be
universal and move across layers and fields. If you understand intelli-
gence as a way of how information is aggregated and manipulated in
each part given its interactions, then this description is independent
of what the parts actually are, as long as they can perform the same
function. You can use this network theory of intelligence to describe
introduction 11
not only how individual brains learn, but how collectives made of
brains learn.
In summary, I hope I’m wrong when I say that we need to under-
stand complexity in order to understand complexity. I hope network
science can bootstrap our understanding. By teaching you network
science with this book, I’m trying my darnedest to prove myself
wrong. I want you – the collective of all the 25 people who’ll read
this thing – to understand complexity and save science and society in
the process. No pressure.
This is all fine and dandy but, at the end of the day, what does this
book contain?
At a general level, it contains the widest possible span of all that
is related to network science that I know. It is the result of twelve 44
Boy, writing a version 2 of this book
sixteen44 years of experience that I poured on the field. Virtually any really makes me feel even older.
concept that I used or that I simply came to know in these fifteen
years is represented in at least a sentence in this book.
As you might expect, this is a lot to include and would not fit
a book, not even a ∼ 800 pages like this one. By necessity, many –
if not all – of the topics included in this book are treated relatively
superficially. I would not say that this book would provide you what
you need to know to be a network scientist. But it would point you 45
Connecting the symbols of network
to what you need to know45 . To borrow from Rumsfeld46 : the book science, maybe?
provides little to no known knowns, but it will provide you with all the 46
https://archive.defense.gov/
Transcripts/Transcript.aspx?
known unknowns in network science – so that your unknown unknowns TranscriptID=2636
are aligned with those of everyone else. After internalizing this book,
you will know what you don’t know; you will be handed all the tools
you need to ask meaningful questions about network science in 2025.
You can go to the other books or to any other article, and find the
answers. Or you can figure out the answer yourself.
That is why I decided to call this book an “Atlas”. It is the map
you need to set foot among networks and start exploring. An atlas
16 the atlas for the aspiring network scientist
doesn’t do the exploration for you, but you can’t explore without an
atlas. This is the book I wished I had fifteen years ago.
At a more specific level, the book is divided in fourteen parts.
Part II teaches you what a graph is and how many features to the
simple mathematical model were added over the years, to em-
power our symbols to tame more and more complex real world
entities. Finally, it pivots perspectives to show an alternative way
of manipulating networks, via matrices.
Part III is a carousel of all the simplest analyses you can operate
on a graph. These are either local node properties – how many
connections a node has –, or global network ones – how many
connections on average I have to cross to go from one node to
another. We see that some of these are easy to calculate on the
graph structure, while others are naturally solved by linear algebra
operations. Shifting perspectives is something you need to get
used to, if you want to make it as a network scientist.
Part IV uses some of the tools presented in the previous part to build
slightly more advanced analyses. Specifically, it focuses on the
question: which nodes are playing which role in the network?
And: can we say that a node is more important than another? If
you want to answer these questions, you need to relate the entire
network structure to a node, i.e. to use fully what Part III trained
you to do.
Part V teaches you the main approaches for the creation of synthetic
network data. It explores the main reasons why we want to do
it. Sometimes, it is because we need to test an algorithm and we
need a benchmark. Alternatively, we can use these models to
reproduce the properties of real world networks we investigated in
the previous parts, to see whether we understand the mechanisms
that make them emerge.
Part IX opens the Pandora’s Box of the level of analysis that is the
most interesting and probably the one with which you will strug-
gle most of the time: the mesoscale. The mesoscale is what lies
between local node properties and global network statistics. This
includes – but is not limited to – questions such as: does my net-
work have a hierarchical structure? Is there a densely connected
core surrounded by a sparsely connected periphery? Do nodes
consider other nodes’ properties to decide whether to connect to
them?
Part XI takes a steep turn into the realm of computer science. It deals
with graph mining: a collection of techniques that allow you to
18 the atlas for the aspiring network scientist
discover patterns in your graph structure, even if you are not sure
about what these patterns might look like or hint at. It is what we
would call “bottom-up” discovery. This is where you’ll find a deep
dive on graph neural networks, which is one of the fastest moving
subfields of network science at the moment.
Part XIII includes a few tips and tricks for an aspect of network sci-
ence that is rarely covered in other books: how to browse/explore
your network data and how to communicate your results. Specifi-
cally, I will show you some best practices in visualizing networks.
I am a visual thinker and, sometimes, patterns and ideas about
those patterns emerge much more clearly when you see them,
rather than scouting through summary statistics. Moreover, net-
work science papers thrive on visual communication and a good
looking network has an amazing chance of ending up in the cover
of the journal you’re publishing on. It is a mystery to me why you
would not spend some time in making sure that your network
figures are at least of passable quality. Moreover, even if we are
all primed to think dots and lines when it comes to visualizing a
graph, you should be aware of the situations in which there are
different ways to show your network.
This is the second edition of the book. I decided to update the book
because there were many things I found unsatisfactory with the
first edition. Here I’m letting you know what’s new besides making
the introduction slightly less tongue-in-cheek – so that, if you al-
ready read version one, you know where to check for what you have
missed.
The first thing I greatly expanded was the introductory part. In
version one, I only had a chapter on probability theory. The book
introduction 19
1.6 Acknowledgements
Another one who went beyond the call of duty was Andres
Gomez-Lievano. Andres and I shared a desk for years and I cher-
ish those as the most fun I had at work. Andres didn’t stop at the
chapters I asked him to review, but deeply commented on the philos-
ophy and framing of this book. I can see in his comments the spark
of the years we spent together.
My other kind reviewers were, in alphabetical order: Alexey
Medvedev, Andrea Tagarelli, Charlie Brummitt, Ciro Cattuto, Clara
Vandeweerdt, Fred Morstatter, Giulio Rossetti, Gourab Ghoshal,
Isabel Meirelles, Laura Alessandretti, Luca Rossi, Mariano Beguerisse,
Marta Sales-Pardo, Matté Hartog, Petter Holme, Renaud Lambiotte,
Roberta Sinatra, Yong-Yeol Ahn, and Yu-Ru Lin.
For version two, I have recruited the additional help of: Giovanni
Puccetti, Matteo Magnani, Maria Astefanoaei, Daniele Cassese, and
Paul Scherer.
All these people donated hours of their time with no real tangible
reward, just to make sure my book graduated from “incomprehensi-
ble mess” to “almost passable and not misleading”. Thank you.
With their work, some reviewers expressed their intent to support 47
https://www.techwomen.org/
charitable organizations. Specifically, they mentioned TechWomen47 – 48
https://www.evidenceaction.org/
to support the careers of women in STEM fields –, Evidence Action48
– to expand our de-worming efforts and reaping the surprisingly high 49
https://www.doctorswithoutborders.
societal payoff –, and Doctors without Borders49 . You should also org/
consider donating to them.
If there’s any value in this book, it comes from the hard work of
these people. All the mistakes that remain here are exclusively due to
my ineptitude in properly implementing my reviewers’ valuable com-
ments. I expect there must be many of such mistakes, ranging from
trivial typos and clumsily written sentences, to more fundamental
issues of misrepresentation. If you find some, feel free to drop me an
email to [email protected].
If, for some reason, you only have access to a printed version of
this book – or you found the PDF somewhere on the Internet, know 50
https://www.networkatlas.eu/
that there is a companion website50 with data for the exercises, their
solutions, and – hopefully in the future – interactive visualizations.
Part I
Background Knowledge
2
Probability Theory
There are more subtleties to this, but since these are the two main
approaches we will see in this book, there is no reason to make this
picture more complex than it needs to be.
To understand the difference, let’s suppose you have Mrs. Fre-
quent and Mr. Bayes experimenting with coin tosses. They toss a coin
ten times and six out of ten times it turns heads up. Now they ask
themselves the question: what is the probability that, if we toss the
coin, it will turn heads up again?
Mrs. Frequent reasons as follows: “An event’s probability is the
relative frequency after many trials. We had six heads after ten tosses,
thus my best guess about the probability it’ll come out as heads is
60%”. Note that Mrs. Frequent doesn’t really believe that ten tosses
gave him a perfect understanding of that coin’s odds of landing on
heads. Mrs. Frequent knows that he will get the answer wrong a
certain number of times, that is what confidence intervals are for, but
for the sake of this example we need not to go there.
“Hold on a second,” Mr. Bayes says, “Before we tossed it, I exam-
ined the coin with my Coin ExaminerTM and it said it was a fair coin.
Of course my Coin ExaminerTM might have malfunctioned, but that
rarely happens. We haven’t performed enough experiments to say
it did, but I admit that the data shows it might have. So I think the
probability we’ll get heads again is 51%”. Just like Mrs. Frequent,
also Mr. Bayes is uncertain, and he has a different procedure to es-
timate such uncertainty – in this case dubbed “credible intervals” –
which again we leave out for simplicity.
Herein lies the difference between a frequentist and a Bayesian.
For a frequentist only the outcome of the physical experiment mat-
ters. If you toss the coin an infinite number of times, eventually
you’ll find out what the true probability of it landing on heads is.
For a Bayesian it’s all about degrees of beliefs. The Bayesian has a
set of opinions about how the world works, which they call “priors”.
Performing enough new experiments can change these priors, us-
ing a standard set of procedures to integrate new data. However a
Bayesian will never take a new surprising event at face value if it is
wildly off its priors, because those priors were carefully obtained
knowledge coherent with how the world worked thus far.
Figure 2.1 shows the difference between the mental processes
between a frequentist and a Bayesian. The default mode for this book
is taking a frequentist approach. However, here and there, Bayesian
interpretations are going to pop up, thus you have to know why
we’re doing things that way.
24 the atlas for the aspiring network scientist
Frequentist Bayesian
Prior expectation
Figure 2.1: Schematics of the
mental processes used by a fre-
Experiment outcome quentist and a Bayesian when
presented with the results of an
Updating priors
experiment.
Future expectation
2.2 Notation
intuition.
or or
or
or or
2.3 Axioms
Ws
you can figure out what event W did to the coin. If P( H |W ) > P( H ),
it means that adding the weight to the coin made it more likely to
land on heads. P( H |W ) < P( H ) means the opposite: your coin is
loaded towards tails. The P( H |W ) = P( H ) case is equally interesting:
it means that you added the weight uniformly and the odds of the
coin to land on either side didn’t change.
This is a big deal: if you have two events and this equation, then
you can conclude that the events are independent – the occurrence of 4
Note that here I’m talking about
one has no effect on the occurrence of the other4 . This should be your statistical independence, which is not
starting point when testing a hypothesis: the null assumption is that the same as causal independence. Two
events could be statistically dependent
there is no relation between an outcome (landing on heads) and an
without being causally dependent. For
intervention (adding a weight). “Unless,” Mr. Bayes says, “You have a instance, the number of US computer
strong prior for that to be the case.” science doctorates is statistically
dependent with the total revenue of
Reasoning with conditional probabilities is trickier than you might arcades (http://www.tylervigen.com/
expect. The source of the problem is that, typically, P( H |W ) ̸= spurious-correlations). This is what
the mantra “correlation does not imply
P(W | H ), and often dramatically so. Suppose we’re tossing a coin to
causation” means: correlation is mere
settle a dispute. However, I brought the coin and you think I might statistical dependence, causation is
be cheating. You know that, if I loaded the coin, the probability of it causal dependence, and you shouldn’t
confuse one with the other. You should
landing on heads is P( H |W ) = 0.9. However, you can’t see nor feel check [Pearl and Mackenzie, 2018] to
the weights: the only thing you can do is tossing it and – presto! – it delve deeper into this.
lands on heads. Did I cheat?
Naively you might rush and say yes, there’s a 90% chance I
cheated. But that’d be wrong, because the coin already had a 50%
chance of landing on heads without any cheating. Thus P( H |W ) ̸=
P(W | H ), and what you really want to estimate is the probability I
cheated given that the coin landed on heads: P(W | H ). How to do so,
using what you know about coins (P( H )) and what you know about
my integrity (P(W )), is the specialty of Bayes’ Theorem.
P ( H |W ) P (W )
P (W | H ) = .
P( H )
Figure 2.4 shows a graphical proof of the theorem. When trying
to derive P(W | H ) P( H ), we realize that’s identical to P( H |W ) P(W ),
from which Bayes’ theorem follows.
P( ) P( | )
P( ) P( | ) = P( ) P( | ) P( | ) =
P( )
I already told you that I’m a pretty good coin rigger (P( H |W ) =
0.9). For the sake of the argument, let’s assume I’m a very honest
person: the probability I cheat is fairly low (P(W ) = 0.3).
Now, what’s the probability of landing on heads (P( H ))? P( H ) is
trickier than it appears, because we’re in a world where people might
cheat. Thus we can’t be naive and saying P( H ) = 0.5. P( H ) is 0.5
if rigging coins is impossible. It’s more correct to say P( H | − W ) =
0.5: a non rigged coin (if W didn’t happen, which we refer to as
−W) is fair and lands on heads 50% of the times. The real P( H ) is
P( H | − W ) P(−W ) + P( H |W ) P(W ). In other words: the probability
of the coin landing on heads is the non rigged heads probability if I
didn’t rig it (P( H | − W ) P(−W )) plus the rigged heads probability if I
rigged it (P( H |W ) P(W )).
The probability of not cheating P(−W ) is equal to 1 − P(W ). This
is because cheating and non cheating are mutually exclusive and
either of the two must happen. Thus we have Ω = {W, −W }. Since
P(Ω) = 1 and P(W ) = 0.3, the only way for P(W, −W ) to be equal to
1 is if P(−W ) = 0.7.
This leads us to: P( H ) = P( H | − W ) P(−W ) + P( H |W ) P(W ) =
0.5 × 0.7 + 0.9 × 0.3 = 0.62. Shocking.
The aim of Bayes’ theorem is to update your prior about me
cheating (P(W )) given that, suspiciously, the toss went in my favor
(P(W ) → P(W | H )). Plugging in the numbers in the formula:
0.9 × 0.3
P (W | H ) = = 0.43.
0.62
probability theory 29
0.999 × 0.001
P(C |+) = = 0.5.
0.999 × 0.001 + 0.001 × 0.999
The probability you have cancer is not 99.9%: it’s a coin toss! (Still 5
Of course, in the real world, if you
bad, but not that bad).5 took the test it means you thought you
might have cancer. Thus you were not
The real world is a large and scary environment. Many different drawn randomly from the population,
things can alter your priors and have different effects on differ- meaning that you have a higher prior
that you had cancer. Therefore, the test
ent events. The way a Bayesian models the world is by means of a
is more likely right than not. Bayes’
Bayesian network: a special type of network connecting events that theorem doesn’t endorse carelessness
influence each other. Exploring a Bayesian network allows you to when receiving a bad news from a very
accurate medical test.
make your inferences by moving from event to event. I talk more
about Bayesian networks in Section 6.4.
2.6 Stochasticity
When you observe an actual grain, you obtain only one of those
paths. The observed path is called a realization of the process. Figure
2.5 shows three of such realizations, which should help you visualize
30 the atlas for the aspiring network scientist
8
Specifically, it is a right stochastic
Figure 2.6 is a stochastic matrix8 . The rows tell you your current matrix: the rows sum to one, although
state and the columns tell you your next state. If you are in the first there’s a bit of rounding going on. In a
left stochastic matrix, the columns sum
row, you have a 30% probability of remaining in that state (the value to one.
of the cell in the first row and first column is 0.3). You have a 20%
probability of transitioning to state two (first row, second column), 8%
probability of transitioning to state three, and so on.
A bit more formally, let’s assume you indicate your state at time
t with Xt . You want to know the probability of this state to be a
specific one, let’s say x. x could be the id of the node you visit at the
t-th step of your random walk. If your process is a Markov process,
the only thing you need to know is the value of Xt−1 – i.e. the id of
the node you visited at t − 1. In other words, the probability of Xt = x
is P( Xt = x | Xt−1 = xt−1 ). Note how Xt−2 , Xt−3 , ..., X1 aren’t part of
this estimation. You don’t need to know them: all you care about is
X t −1 .
On the other hand, a non-Markov process is a process for which
knowing the current state doesn’t tell you anything about the next
possible transitions. For instance, a coin toss is a non-Markov process.
The fact that you toss the coin and it lands on heads tells you nothing
32 the atlas for the aspiring network scientist
about the result of the next toss – under the absolute certainty that
the coin is fair. The probability of Xt = x is simply P( Xt = x ): there’s
no information you can gather from your previous state.
Finally, we have higher-order Markov processes. Higher-order
means that the Markov process now has a memory. A Markov pro-
cess of order 2 can remember one step further in the past. This means
that, now, P( Xt = x | Xt−1 = xt−1 , Xt−2 = xt−2 ): to know the probabil-
ity of Xt = x, you need to know the state value of Xt−2 as well as of
Xt−1 . More generally, P( Xt = x | Xt−1 = xt−1 , Xt−2 = xt−2 , ..., Xt−m =
xt−m ), with m ≤ t.
The classical network examples of a higher order Markov pro-
cess is the non-backtracking random walk (Figure 2.8). In a non-
backtracking random walk, once you move from node u to node
v, you are forbidden to move back from v to u. This means that,
once you are in v, you also have to remember that you came from u.
Higher order Markov processes are the bread and butter of higher
order network problems, which is the topic of Chapter 34.
No backtracking!
Starting with beliefs, the first thing you need is a degree of belief, 11
Judea Pearl. Reasoning with belief
which estimates your ability to prove a set of beliefs11 . This degree functions: An analysis of compatibility.
of belief is quantified by a function which is conventionally called its International Journal of Approximate
Reasoning, 4(5-6):363–389, 1990
Mass function. Figure 2.9 shows a relatively simple example when
tossing a coin – slightly loaded on heads. The distinction between
Mass in DST and classical probability is that it considers the case
“we don’t know whether heads or tail” as distinct from “heads” and
“tail”. In probability theory, you wouldn’t make this distinction,
because no other outcome than heads or tails can happen, even if for
some reason you don’t know the outcome. But in DST you want to
model this, because we’re talking about the ability of proving our
statement, so we need to specifically take into account the situation
in which we don’t actually know the result – e.g., if the coin rolled
under the sofa and we can’t see it. In that case, we don’t have any
evidence to say that the coin landed on heads or tail.
In summary, if Ω = { H, T }, then p(Ω) = p( H ) + p( T ) = 1,
but Mass(Ω) ̸= Mass( H ) + Mass( T ): Mass(Ω) is less trivial and
actually informative – it is the amount of uncertainty we have about
the outcome of the event given the imperfection of our evidence.
So the Mass function is basically giving all the available evidence a
probability and obeys the following two rules:
1. Mass(∅) = 0, and
2. ∑ Mass( X ) = 1.
X ∈2Ω
Now that you know the probabilities of all possible outcomes, you
must make a hypothesis which is a set of potential outcomes. To
estimate your ability to prove your hypothesis, you sum all the Mass
values of all the subsets of your hypothesis. This is the Belief func-
tion, and you can see how it works in Figure 2.10. If your hypothesis
has no support in the gathered evidence then its Belief value is zero,
while if it is absolutely certain then Belief evaluates to one. So Belief
tells you how likely your hypothesis – or any of its subsets – is to be
proven given the available evidence.
Fuzzy Logic
In probability theory, you only deal with boolean events, whose truth
values can either be zero or one. Either something is false or it is
true. The coin either landed on heads or on tails. In fuzzy logic, you
work with something different. Things can have degrees of truthiness,
which is to say we assign them a truth value between zero and one.
If it is zero, we’re certain that a statement is false, if it is one we’re
certain it is true, and if it is a value in between then there is some 13
Petr Hájek. Metamathematics of fuzzy
vagueness about whether it is true or false13 . logic, volume 4. Springer Science &
For instance, at the moment of writing this paragraph I am 39 Business Media, 2013
years old. Is that young or old? Well, you could line up 100 people
and ask them this question. Maybe 60 will say that I’m young, 39
will say that I’m old (I’m so insecure I am disrespected even in my
thought experiments), and 1 will say something else. In probability
theory you could model this as something like: there’s a 39% chance
a random person will call me old (hey!). But in fuzzy logic you’d
do something different. You could say that I belong to both the sets
of young people and old people, with different strengths. I’m 60%
young and 39% old – I frankly don’t know if that’s an improvement
over the alternative.
The consequences of this difference lead to different outcomes 14
Ebrahim H Mamdani. Application of
when working with fuzzy logic. We’ll see a basic common example14 , fuzzy algorithms for control of simple
but know that there are alternative ways to implement fuzzy logic15 . dynamic plant. In Proceedings of the
institution of electrical engineers, volume
Let’s assume that the degree to which a person belongs to an age
121, pages 1585–1588. IET, 1974
class depends on their age, following the function I draw in Figure 15
Tomohiro Takagi and Michio Sugeno.
2.12. Fuzzy identification of systems and its
applications to modeling and control.
For the age I highlight, probability theory can say the following
IEEE transactions on systems, man, and
things: cybernetics, (1):116–132, 1985
Class 39
Belonging Other Young Figure 2.12: An illustration of
Old
fuzzy logic. The degrees of be-
longing (y axis) to different age
classes (line color) for a given
age (x axis).
Age
change their mind independently from what they said the first
time) is P(Y ∩ O) = P(Y ) P(O) = 0.234;
• My belonging to the class of people who are both young and old is
P(Y ∩ O) = min( P(Y ), P(O)) = 0.39 – it can’t be any higher than
my minimum belonging, because to fully belong to the young-old
class I must be fully young and fully old;
2.9 Summary
1. Probability theory gives you the tools to make inferences about un-
certain events. We often use a frequentist approach, the idea that
an event’s probability is approximated by the aggregate past tests
of that event. Another important approach is the Bayesian one,
which introduces the concept of priors: additional information that
you should use to adjust your inferences.
2.10 Exercises
1. Suppose you’re tossing two coins at the same time. They’re loaded
in different ways, according to the table below. Calculate the
probability of getting all possible outcomes:
p1 ( H ) p2 ( H ) H-H H-T T-H T-T
0.5 0.5
0.6 0.7
0.4 0.8
0.1 0.2
0.3 0.4
at any data. When data takes the center stage, you enter in the world
of statistics. Statistics covers more than simply describing your data:
you should think in statistical terms also when collecting, cleaning,
validating your data. But here we ignore all that and we focus on the
tools that allow you to say something interesting about your data,
assuming you did a good job collecting, cleaning, and validating it.
tributed – meaning that the average value is actually the mean and
values farther from the mean are progressively more rare. We’ll see
what a normal distribution looks like in Section 3.2. The same section
will also tell you that not all variables distribute like that: in some
cases the mean is actually not a great approximation of your average
(or “typical”) case. Wealth is like that: a tiny fraction of people own
vastly more than the majority. In this case, the median gives you 4
David J Sheskin. Handbook of parametric
a better idea4 . The median tells you the value that splits the data and nonparametric statistical procedures.
in two equally populated halves: 50% of the points are below the Chapman and hall/CRC, 2003
median and 50% of the points are above.
Figure 3.1 shows that the mean and the median can be quite differ-
ent. In network science, we use the mean extensively – even though
maybe we shouldn’t. When we’ll talk about the number of connec-
tions a node has in a network (Chapter 9), we’ll see we routinely
take its “average” by calculating its mean. But connection counts
in real networks typically do not follow a neat normal distribution.
These distributions tend to look more like Figure 3.1(b) than Figure
3.1(a). So, perhaps the mean count of connections is not the most
meaningful thing you can calculate.
(a) (b)
This disconnect between the mean and the typical case is true for
the arithmetic mean I show here, but there are other types of means
– such as geometric or harmonic – which can take into account some 5
Philip J Fleming and John J Wallace.
special properties of the data5 . How not to lie with statistics: the
correct way to summarize benchmark
results. Communications of the ACM, 29
(3):218–221, 1986
Variance & Standard Deviation
When we deal with an average observation, we might want to know
not only its expected value, but also how much we expect it to differ
from the actual average value. Even if the average human height is,
let’s say, 1.75 meters, we could think of two radically different popu-
lations. In the first, almost everyone is more or less 1.75 and heights
don’t vary much. In the other, the opposite is true: the average is
still 1.75, but people could be anything between 1 meter and 2.5 me-
ters. So the heights in this second population vary much more. We
statistics 41
frequency (y axis).
Distribution
Outcome i
Sample Space
# People
# People
Wealth
over the average value. However, these outliers are more common
and less extreme. Figure 3.5 shows a graphical example.
If we’re talking wealth, a long tail world is a world with a single
Jeff Bezos and everybody else works in an Amazon warehouse. In
a fat tail world, Jeff might not be quite as rich, but there are a few
billionaire friends to keep him company.
Chapter 9 will drill in your head how important distributions are for
network science, so it pays off to become familiar with a few of them.
First, let’s make an important distinction. There are two kinds of
distributions, depending on the kinds of values that their underlying
variables can take. There are discrete distributions – for instance, the
distribution of the number of ice cream cones different people ate on
a given day. And there are continuous distributions – for instance the
distances you rode on your bike on different days. The difference is
that the former has specific values that the underlying variable can
take (you may have eaten two or three ice cream cones, but 2.5 is not
an option), the latter can take any real value as an outcome. In the
first discrete case, we call the distribution a “mass function”. In the
second case, we call it a “density function”.
Figure 3.6 shows some stylized representations of the most impor-
tant distributions you should pay attention to, which are:
Section 9.3.
• Lognormal: a lognormal distribution is the distribution of a
continuous random variable whose logarithm follows a normal
distribution – meaning the logarithm of the random variable, not
of the distribution. This is the typical distribution resulting from
the multiplication of two independent random positive variables.
If you throw a dozen 20-sided dice and multiply the values of their
faces up, you’d get a lognormal distribution. It’s very tricky to tell
this distribution apart from a power law, as we’ll see.
3.3 p-Values
One of the key tasks of statistics is figuring out whether what you’re
observing – a natural phenomena or the result of an experiment – can
tell us something bigger about how the world works. The way this is
normally done is making an hypothesis, for instance that a specific
drug will cause weight loss. To figure out whether it is true, we need
to prove that taking the drug actually does something rather than
nothing. “The drug does nothing” is what we call the null hypothesis,
which is what we’d expect – after all, most drugs don’t cause weight
loss. This is what we colloquially call “burden of proof”: the person
making the claim that something exists needs to prove that it does,
because if we haven’t proven that something exists yet there is no
reason to believe it does.
We call what you want to prove – “the drug causes weight loss”
– the alternative hypothesis, because it’s the alternative to the null
hypothesis.
p-values are among of the most commonly used tools to deal 11
Ronald L Wasserstein and Nicole A
with this problem11 . The interpretation of p-values is tricky and it Lazar. The asa statement on p-values:
is easy to get it wrong. The “p” stands for “probability”. Suppose context, process, and purpose, 2016
that you give your drug to a bunch of people and, after a few weeks,
you see that their weight decreased by 5kg. The p-value tells you the
probability that you would be observing an effect this strong – a loss
of 5kg – if the null hypothesis was true – i.e. if the drug actually did
nothing. Lower p-values mean there is stronger evidence against the
null hypothesis. What the p-value does not tell you (but might trick
you into thinking it does) is:
• The p-value does NOT tell you how likely you are to be right;
• The p-value does NOT tell you how strong the effect is;
• The p-value does NOT tell you how much evidence you have
against your hypothesis.
mean that the null hypothesis is true. A high p-value just means
that the observations we have are compatible with a world where
the null hypothesis is true. But it could also mean that our sample is
not big enough to draw firm conclusions. Given how tricky it is to
get them right, some researchers have called for not using p-values 12
Raymond Hubbard and R Murray
altogether12 . Lindsay. Why p values are not a useful
Figure 3.8 shows a graphical way to understand the p-value. You measure of evidence in statistical
significance testing. Theory & Psychology,
have a distribution of values that would be produced by the null
18(1):69–88, 2008
hypothesis, you pit your measurements against those values, and the
more unusual your observations look compared to those that would
be produced by the null hypothesis, the lower the p-value. Exactly
how to produce this null hypothesis distribution is not something
we’ll cover here, because it is not so close to the core of network
science – although we’ll see something similar in Section 19.1.
250
Figure 3.8: In red we have the
# Null Observations
One thing is worth mentioning, though. You will often see some
magical p-value thresholds that people use as standards – the most
common being p < 0.05 and p < 0.01. These p-value thresholds say
something about the strength of the evidence we want to see before
we are willing to reject the null hypothesis of no effect. Beware of
these. Not only because – as I said before – they don’t mean what 13
“When a measure becomes a target, it
you think they mean, but also because of Goodhart’s Law13 . If we say ceases to be a good measure.”
So far we have worked with a single variable and we have done our
best to describe how it distributes. However, more often than not,
you have more than one variable and you want to know something
interesting about how one relates to the other. For instance, you
might want to describe how the weight of a person tends to be
related to their height.
In general, the taller a person is, the more we expect them to
weigh. In this sense the two variables vary together. Therefore, we 19
John A Rice. Mathematical statistics and
call this concept “covariance19 .” The formula is pretty simple, it is data analysis. Cengage Learning, 2006
the mean of the product of the deviations from the mean of both
variables, so: cov( H, W ) = µ(( H − µ( H ))(W − µ(W ))).
If a person is both a lot taller than the mean and a lot heavier than
the mean, the product of their height and weight deviations will be
big, and they will contribute a large positive value to the covariance
calculation. If a person is both a lot shorter than the mean and a
lot lighter than the mean, the product of their height and weight
deviations, both negative, will again be a large positive. So, they will
also make the covariance turn out bigger. If someone is both tall and
light, their positive height deviation and negative weight deviation
will be multiplied to become a large negative number, pulling the
covariance down. The covariance will be large in total if we observe
many tall/heavy and short/light people, and not so many tall/light
or short/heavy people (see Figure 3.9, the covariance values are in
the figure’s caption, first row under cov( H, W )).
One problem is that the covariance depends on the units of your
variables. That means the size of the covariance can be difficult
to interpret – is a covariance of 5.5 high or low? To make sense
of it, it helps to normalize it. This is what the Pearson correlation 20
Francis Galton. Typical laws of
coefficient tries to do20 . It divides the covariance by the variances heredity. Royal Institution of Great
of both variables. That means, for example, that if start measuring Britain, 1877
statistics 49
Weight
Weight
80 80 80
70 70 70 datasets with (a) positive, (b)
60 60 60
50 50 50 no, and (c) negative covariance
40 40 40
150 160 170 180 190 150 160 170 180 190 150 160 170 180 190 between height and weight.
Height Height Height
Covariance (cov( H, W )) and
correlation (ρ( H, W )) values
below the scatter plots.
(a) (b) (c)
1 1
0.8 0.8
Figure 3.11: Data with skewed
distributions across various
p(Visiting)
p(Visiting)
0.6 0.6
(a) (b)
Weight
Weight
150 80
100
120 70 ables showing the difference
80
90 60
does this allow us to make a better guess at their weight? How much
better? This “amount of information” is usually measured in bits.
To understand MI, we need to take a quick crash course on infor- 22
Thomas M Cover. Elements of in-
mation theory22 , 23 , which starts with the definition of information formation theory. John Wiley & Sons,
entropy. It is a lot to take in, but we will extensively use these con- 1999
cepts when it comes to link prediction and community discovery in
23
David JC MacKay. Information
theory, inference and learning algorithms.
Parts VII and X, thus it is a worthwhile effort. Cambridge university press, 2003
Consider Figure 3.13. The figure contains a representation of a
vector of six elements that can take three different values. The first
thing we want to know is how many bits of information we need to
encode its content.
X X
Figure 3.13: A simple example
=0 0 to understand information
9 bits entropy. From left to right: the
0
vector x has six elements tak-
6 values
0 ing three different values. We
Consider flipping a coin. Once you know the result, you obtain
one bit of information. That is because there are two possible events,
equally likely with a probability p of 50%. Generalizing to all possi-
ble cases, every time an event with probability p occurs, it gives you 26
The amount of information of an
− log2 ( p) bits of information for... reasons26 . So, the total informa- event is a function that only depends
tion of an event is the amount of information you get per occurrence on the probability p of the event to
happen, e.g. i a = f ( p a ) for event a. If
times the probability of occurrence: − p log2 ( p). Summed over all
we have two events, a and b, happening
possible events i in x: Hx = − ∑ pi log2 ( pi ), which is Shannon’s infor- with probability p a and pb , the event
i c defined as a and b happening has
mation entropy – how many bits you need to encode the occurrence probability pc = p a pb . Now, each
of all events. event also gives you an amount of
information, namely i a and ib . When
Mutual information is defined for two variables. As I said, it is the
c happens, it means that both a and b
amount of information you gain about one by knowing the other, or happened, thus you got both pieces of
how much entropy knowing one saves you about the other. Consider information, or ic = i a + ib . What we
just said can be rewritten as f ( pc ) =
Figure 3.14. It shows the relationship between two vectors, x and y. f ( p a ) + f ( pb ), given the equation at
Note how y has equally likely outcomes: each color appears three the beginning. Since pc = p a pb , then
we can also rewrite the equation as
times. However, if we observe a green square in x, we know with
f ( p a pb ) = f ( p a ) + f ( pb ). The only
100% confidence that the corresponding square in y is going to be function f that we can possibly plug
purple. This means that, knowing x’s values gives us information into this equation maintaining it true
is the logarithm. Since probabilities
about y’s value. Mathematically speaking, mutual information is the are lower than 1, the logarithm would
amount of information entropy shared by the two vectors. be lower than zero, which would be
nonsense – you cannot get negative
It would take − log2 (1/3) ∼ 1.58 bits to encode y on its own (it is
information. Thus we take the negated
a random coin with three sides). However, knowing x’s values makes logarithm: i a = − log( p a ).
you able to use the inference rules we see in Figure 3.14. Those rules
are helpful: note how our confidence is almost always higher than
33%, which is the probability of getting y’s color right without any
further information. The rules will save you around 0.79 bits, which
is x and y’s mutual information.
The exact formulation of mutual information is similar to the
formula of entropy:
X Y
!
pij
MIxy = ∑ ∑ pij log pi p j
,
j∈y i ∈ x
where pij is the joint probability of i and j. The meat of this equa-
tion is in comparing the joint probability of i and j happening with
what you would expect if i and j were completely independent. If
they are, then pij = pi p j , which means we take the logarithm of one,
which is zero. But any time the happening of i and j is not indepen-
dent, we add something to the mutual information. That something
is the number of bits we save.
3.6 Summary
8. When you have two variables, covariance tells you how much
the two change together. Correlation coefficients are normalized
covariances that do not change depending on the scale of the data.
If your variables have a non-linear relationship, you need to use a
correlation coefficient that can handle it.
54 the atlas for the aspiring network scientist
3.7 Exercises
tures
tures
Fea
Fea
Fea
Output
Validation Set
Training Set
Evaluation
Result
Parameters Quality
Truth
Truth
Truth
Loss
Training Steps
• The parameters are the things regulating how the algorithm works.
If you’re trying to predict how weight influences height, your
algorithm could be simply to multiply the height with a given
number. That number is the parameter.
Normalization Softmax
1 100 Figure 4.3: Different activations
0.8 10-4 for plain normalization (left
Case #1: 0.6 10-8 column) and softmax (right
Output
Output
Output
10-2
Sparse data 0.4
0.2 10-3
0 10-4
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Input Input
1
0.9 Logistic Figure 4.4: Different activations
0.8 Gaussian functions (line color): the activa-
0.7 Norm Hyperbolic Tan
0.6 tion function value (y axis) for a
Output
0.5
0.4 given input (x axis).
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Input
For the logistic, the higher the value going in, the more likely it
is to be an example of a positive class. For the Gaussian, we instead
want to be close to a given value, usually zero – which can be inter-
preted as “the class is positive if its distance from a reference point is
low”. The hyperbolic tangent is not interpretable as a probability, but
you can still normalize it if you want – as I did just to plot this figure.
machine learning 61
5 1 5 5 5
4
3 4 4 4
2 3 3
1 3
Output
Output
Output
Output
Output
0 2 2
-1 2
-2 1 1
-3 1 0 0
-4
-5 0 0 -1 -1
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
(a) Linear (b) Step (c) ReLU (d) ELU (e) GeLU
nothing else than trying something, calculate the loss function, try
something else and, if the loss now is lower, keep the change and
keep going in that direction. It follows that the loss function is as
62 the atlas for the aspiring network scientist
2.5
MAE Figure 4.6: Different loss func-
2 MSE tions (line color): the loss
Log-Cosh
1.5 function value (y axis) for a
Loss
0
-1.5 -1 -0.5 0 0.5 1 1.5
Error
Mean Errors
Two classical examples are mean absolute error (MAE) and mean
squared error (MSE). In MAE, if yi is the real outcome and ȳi is
what your method says, then you average the absolute value of the 18
Cort J Willmott and Kenji Matsuura.
difference18 : Advantages of the mean absolute error
(mae) over the root mean square error
(rmse) in assessing average model
MAE(Ȳ ) = ∑ |yi − ȳi |/|Y |, performance. Climate research, 30(1):
i 79–82, 2005
where Ȳ is your vector with all your answers and Y is the vector
with the corresponding correct answers.
It should be clear why we need the absolute value: if you were to
make symmetric errors (for each overestimation you also make, for a
different observation, an equal underestimation) your loss would be
zero – confusing symmetric errors with no errors at all. In MSE you
solve the same issue of symmetric errors by taking not the absolute
machine learning 63
19
Peter J Bickel and Kjell A Doksum.
value, but the squared error19 (which is always positive, also for Mathematical statistics: basic ideas and
underestimates): selected topics, volumes I-II package.
Chapman and Hall/CRC, 2015
If you find it distasteful to take the square because then you have
an error in different units than your observation, you can always take
the square root of the result and you have the root mean squared er-
ror – just like you can take the root of the variance to get the standard
deviation (Section 3.1). Figure 4.7 gives you a graphical representa-
tion of how the errors are calculated.
Model
Figure 4.7: Given a model (blue)
for data points (green), we have
Data two ways to interpret the errors
MSE (red).
} MAE
In Figure 4.7, for MAE, only the length of the segment matters. For
MSE, you instead consider the entire area of the square.
(Log) Likelihood
The likelihood function is a function that tells you what is the prob-
20
Anthony William Fairbank Edwards.
ability of observing an event given a (set of) parameter(s) regulating Likelihood. In Time Series and Statistics,
such an event20 . Notation-wise, we use L to indicate the likelihood pages 126–129. Springer, 1972
function, x is the outcome of the event, and θ is the set of parameters.
Let’s make a simple example. Suppose that we are tossing a coin
three times. The sides on which it lands, let’s say heads twice and
tails once, is our x = { H, H, T }. θ tells us the parameter regulating
the coin. A coin is a fairly simple system, so it has only one parame-
ter: the probability of the coin landing on heads – which is 50% when
we know nothing about the coin and whether it is fair. So we can say
θ = { p H = 0.5}. At this point we can estimate L(θ, x ) – which, in
our case, is L({ p H = 0.5}, { H, H, T }) – which is the likelihood of the
coin being fair given that we observe that given event. In this case, to
know the L value given the event for any p H value we need to evalu-
ate the formula p2H (1 − p H ) – because we got two heads and one tails
64 the atlas for the aspiring network scientist
0.16
0.14 Figure 4.8: The likelihood (y
0.12 axis) of the parameter p H (x
Likelihood
0.1
0.08 axis) for the { H, H, T } event.
0.06
0.04 pH
0.02 pT
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
px
pH
Cross Entropy
qH
66 the atlas for the aspiring network scientist
For the figure, I set p H = 0.66. Then our classifier needs to spit out
the q H it believes being the closest to the real p H . The closest q H is to
0.66, the lower cross entropy is. You can see how cross entropy shoots
up the farther we get from our mark, while being tolerant of smaller
mistakes – the function is pretty flat around 0.66.
One last important thing I should point out about machine learning
is that it is normally done on a truckload of data. If you don’t have a
lot of observations, the appeal of machine learning is not that great.
The more data points you have, the more likely it is you’re going to
discover whatever pattern lurks in them. Otherwise, the patterns
might be overpowered by whatever other random fluctuation affects
the observations. This introduces the problem of computational
complexity. If you’re going to do machine learning, you need to do it
quickly, to process large amounts of information in a short time.
This is a problem because, even if the operations you perform are
simple, you need to do them for each observation, potentially billions
of times, which will take time. There are two main solutions to this
issue: sampling and batching.
Sampling
In sampling, you decide not to perform your operation on all data
points, but on a selection of them. That selection is a sample. The
key problem to solve here is to get a representative sample: if all
data points in your sample are “weird” in the same way, you might
discover a pattern that is not present in your overall dataset. There
are a few techniques to ensure that your sample is representative, but
I’m not covering them here in details. That doesn’t mean that you 23
Chao-Yuan Wu, R Manmatha, Alexan-
der J Smola, and Philipp Krahenbuhl.
should avoid learning how to sample, since it’s quite important23 .
Sampling matters in deep embedding
What I will say, though, is that sampling complex networks is its own learning. In Proceedings of the IEEE
variation of the problem, one we will look at in Chapter 29. international conference on computer vision,
pages 2840–2848, 2017
Of particular interest is a specific type of sampling called negative 24
Zhen Yang, Ming Ding, Chang Zhou,
sampling24 . Here a network example is helpful, let’s look together at Hongxia Yang, Jingren Zhou, and Jie
Figure 4.11. I know that I haven’t formally introduced networks yet, Tang. Understanding negative sampling
in graph representation learning. In
but hopefully you can follow with some intuition. ACM SIGKDD, pages 1666–1676, 2020
Suppose we want to figure out what causes connections between
nodes in Figure 4.11. We should collect various features and compare
connections with non-connections. The problem is that the number of
non-connections dwarfs the number of connections. We have 10 con-
nections, but – trust me on this – a whopping 45 pairs of nodes that
are not connected. And this is a super tiny network! Imagine what
machine learning 67
7 6
Figure 4.11: A network with
8 5
fewer connections than non-
connections.
9 4
10 3
11 2
1
would happen if you had an actually large one. You won’t be able to
look at all the negative instances – the non-connections. That is why
you need to sample them, usually to have as many negative samples
as observations – so here you’d pick 10 random non-connected node
pairs. This is negative sampling.
Batching
Ok, now we’ve cut the potentially gigantic number of negative obser-
vations with negative sampling. Are we good? Not really. What you
have now is twice the number of data points – for each data point 25
Nils Bjorck, Carla P Gomes, Bart
you have a companion generated via negative sampling. This can still Selman, and Kilian Q Weinberger.
be pretty huge, if you have a lot of data points – as you should. What Understanding batch normalization.
Advances in neural information processing
now?
systems, 31, 2018
Now, batching25 . In many common machine learning infrastruc-
tures, it is much faster – and it requires less memory – to run the
training process on a small portion of the data at a time. Once you
learn your parameters with this first chunk, you restart the learn- 26
Priya Goyal, Piotr Dollár, Ross
ing process, updating the parameters with another portion of the Girshick, Pieter Noordhuis, Lukasz
data. Rinse and repeat until you have used all of your data. These Wesolowski, Aapo Kyrola, Andrew
Tulloch, Yangqing Jia, and Kaiming He.
chunks are the batches. Sometimes, it makes sense to use really small
Accurate, large minibatch sgd: Training
batches, and we have a cute pet name for them: minibatches26 . imagenet in 1 hour. arXiv preprint
The advantages of batching are faster computation and less mem- arXiv:1706.02677, 2017
4.5 Summary
2. In the training phase, you might have the right answers you’re
interested in finding and your algorithm can learn from them. This
is supervised learning. If you don’t have them, you’ll be doing
unsupervised learning.
3. During training you might overfit, i.e. learn the odd peculiarities
of the training set rather than the general patterns in the data.
Having a separate validation set can help you to spot overfitting.
5. Loss functions drive your training phase by telling you how far
from the objective your method is. They should not be confused
with error functions, which are used in testing phase to tell you
how wrong your final answers are – and, since they are final, they
cannot be changed, unlike training outputs.
4.6 Exercises
ment the MAE and MSE functions and compare their outputs
when applied to the vector, by plotting each of them.
[-3, 3]
[2, 2.5]
Figure 5.1: (a) Several vectors
[1, 3, 0]
on a 2D space. (b) Two vectors
on a 3D space.
[0, 0]
[0, 0, 0]
[-3, -2]
[1.5, -2.5]
(a) (b)
From the figure you can see that it is a bit cumbersome to repre-
sent 3D vectors on a piece of 2D paper, so that’s why I will stick to
2D examples when making figures. But there is no need to stop at 2D.
In fact there is no need to stop at 3D either: all operations in linear
algebra work the same in an arbitrary number of dimensions. They
just become a little harder to picture in your head.
Notation-wise, normally a vector is written as − →v , but I’m skipping
the arrow and writing them as simple lowercase letters: v. This is a
bit ambiguous, but in general it will be clear from the context when
I’m talking about vectors or not.
One useful property of a vector is its length. The length of the
vector is literally the length of the arrow we draw: it is the Euclidean
(straight line) distance between the point identified by the vector and
the origin. To calculate the Euclidean distance we can realize that a
vector is nothing more than the hypotenuse of a special right triangle.
Starting from the origin, we can first travel along the x axis until we
get to the first coordinate of the vector, then we walk up parallel to
the y axis to make up the second coordinate of the vector. So the
[3, 4] vector in Figure 5.2 is three steps along the x axis and four steps
parallel to the y axis.
Pythagoras teaches us that the length of the hypotenuse is the
5
https://en.wikipedia.org/wiki/
square root of the sum of the squares of the catheti lengths5 . Putting
√ Pythagorean_theorem, Wikipedia will
it into mathematical form, our [3, 4] vector is s32 + 42 = 5 long. have to do, since Pythagoras never put
d a bibtex out...
More generally, any d dimensional vector v is ∑ v2k long.
k =1
What operations can you do with vectors? Fundamentally, we
need two: sum and multiplication.
The sum of two vectors is pretty straightforward. Say you have
[1, 3] and [2, 2]. What do you think the result of their sum should be?
72 the atlas for the aspiring network scientist
It will be the element-wise sum of their entries. So [1, 3] + [2, 1] = [3, 4].
Using a more abstract notation, “element-wise” means that, if you
have two dimensional vectors u and v, then their sum is [u1 + v1 , u2 +
v2 ]. You sum the first element of u with the first element of v, then
second with second, and so on (if you have more dimensions).
The result of the sum of two vectors is another vector with the
same dimensions. Visually, this means that we’re taking the second
vector and move its tail to the point of the first vector. The point
where we end up is the new vector, the result of the sum. This is why
it is sometimes easier to think of vectors as arrows rather than points,
because this operation would be a bit less intuitive if we only have
points. But arrows make it easy to see what we’re doing, which is
what Figure 5.3 shows.
[2, 1]
For the multiplication, for now we’re only going to deal with the
case of multiplying a number to a vector. As you can guess from the
sum example, this means to multiply all the elements of the vector
with that number: 2v = [2v1 , 2v2 ]. Visually, that means we extend
the vector by that factor – as you can see in Figure 5.4. De facto, that
means stretching the arrow’s length while remaining on the same
line – we have to flip the direction if we’re multiplying by a negative
number.
[1, 4]
Figure 5.4: Two examples of
1.5 * [2, 1]
vector multiplications by a
number (scalar). The light color
[2, 1]
version is the original vector,
and the darker version is the
result of the multiplication (the
scaling).
-1 * [1, 4]
5.2 Matrices
Like vectors, matrices too have dimensions. The matrix above has
two rows and two columns, so we say it is a 2 × 2 matrix. Since the
number of rows and columns is the same, the matrix looks like a
square and so we call it a square matrix. But nothing stops matrices
from having a different number of rows and columns. For instance, a
3 × 2 matrix is a non-square (rectangular) matrix, a totally valid object.
Notation wise, Mij means to look at the value in the cell at the ith
row and the jth column.
74 the atlas for the aspiring network scientist
Before getting into what matrices are and what they are for, I want
to list a few important things about matrices. First, square matrices
have a diagonal, which goes from the top left element down to the
bottom right. Second, there’s the concept of symmetry. A square
matrix is symmetric if you can mirror it along the diagonal and you
get the same matrix. Mathematically, this means that is is always the
case that Mij = M ji .
Note how non-square matrices do not have a diagonal and cannot
be symmetric. Transposing a matrix M means that all Mij values
become M ji and viceversa. In this book, for convention, M T will
be the transpose of M. Figure 5.5 shows the case of a squared non
symmetric matrix transpose. In practice, transposing is like placing a
mirror on the diagonal.
(a) M (b) M T
If your matrix is squared and symmetric transposing has no effect:
M T = M. However, for non symmetric matrices, M T ̸= M. Moreover,
for non square matrices, transposing an n × m matrix results into an
m × n one, the dimensions flip.
Vector-Matrix Multiplication
To keep our spatial interpretation of linear algebra, in the case of
matrices we need to jump right to the multiplication of a matrix
with a vector. If you have matrix M and vector v, then Mv = w:
multiplying vector v with matrix M results a new vector w. The new
vector w will have different coordinates from v – with few interesting
and useful exceptions we’ll get to later. In practice, M is moving v so
that it ends up in w. Alternatively, you can say that M is changing
the coordinate system of v: w is still the same as v, but in a different
coordinate system.
So far this isn’t helping much, it’s still pretty abstract. What does
the matrix I showed you before really means? Each column of the
matrix tells you how to change each coordinate in isolation. This
is the same as stretching a vector that is equal to one in that given
dimension, and zero everywhere else. So, since the first column of
the matrix is [0, 5] then we know that the unit vector [1, 0] will end up
linear algebra 75
[0, 5]
Figure 5.6: A visualization of a
matrix. The light color vectors
are the original unit vectors,
and their darker version is
[0, 1] [4, 1]
the result of the coordinate
transformation applied by the
matrix.
[1, 0]
in [0, 5]. The second column tells you that the second unit vector [0, 1]
will end up in [4, 1]. Figure 5.6 shows you this transformation.
Once you know this, you know how to move any two vectors,
because any vector is a combination of these two unit vectors. To
know where any arbitrary vector v = [v1 , v2 ] ends – to calculate w
–, you need to look at the matrix first by rows: the first row tells you
the contribution of the matrix to the first entry of w. Then you look
at columns: the first column tells you the effect of v1 , the second
of v2 , and so on. To sum up in general terms, Mv = w means that
w1 = M1,1 v1 + M1,2 v2 and w2 = M2,1 v1 + M2,2 v2 . You can see an
example in Figure 5.7.
Mv = w:
v 2 3 Figure 5.7: A example of vector-
matrix multiplication. Each
M w entry of the result vector de-
pends on the corresponding
A matrix's row 0 4 12
produces a given
row of the matrix. Each entry
of the input vector is handled
entry of the output 5 1 13
by the corresponding column of
the matrix.
A matrix's column
operates on a given
entry of the input
the matrix gives you the coordinates of where [1, 0] lands. If a matrix
is 3 × 2, it means that [1, 0] will land on a 3D space, because it will
have three coordinates. So non-square matrices allow you to move
between spaces with a different number of dimensions.
In our Mv = w, besides w1 and w2 , we also have w3 = M3,1 v1 +
n
M3,2 v2 . More generally: wi = ∑ Mik vk . This formula sneakily tells
k =1
you one important thing: you cannot multiply any vector-matrix
combination. The matrix must have the same number of columns as
the number of entries in the vector. Otherwise you either have some
entries of v you can’t transform, or portions of M doing nothing.
In general, multiplying a d dimensional vector with a n × d matrix
will result in an n dimensional vector. Figure 5.8 shows you this
operation, and highlights how this coordinate change moves to a
completely new 2D space from the original 3D one.
[2, 2]
Figure 5.8: A visualization of a
[0, 1, 0] [0, 1]
[1, 0, 0]
non-square matrix. Each unit
vector in the 3D space gets
[-1, 0]
mapped to a new coordinate
[0, 0, 1] system in a 2D space.
Possibly the most useful matrix for this book is the identity matrix,
which we call I. This is defined as a matrix that has ones on its main
diagonal – the one going from top left to bottom right, and zero
everywhere else:
1 0 ... 0
0 1 ... 0
.
. .. ..
. ..
. . . .
0 0 ... 1
Can you guess what it does? Each column in this matrix is exactly
the unit vector for that specific dimension. Since it tells you what the
unit vector becomes after the transformation, what this means is that
the unit vector will not change. If it does not change, nothing will!
The identity matrix will preserve the coordinate system exactly as it
is. Mathematically, for any v: Iv = v.
Matrix Multiplication
It is often the case that you might want to multiply two matrices
together, rather than a matrix and a vector. It’s worth explaining
what that operation means intuitively, because that will make it easy
to understand why we achieve it the way we do.
linear algebra 77
5.3 Tensors
[2]
Figure 5.11: An example of
[3] sum for two zero dimensional
tensors.
[5]
Note that the dimension of the tensor has no relation with the
dimension of the space in which the tensor lives! A vector is a one
dimensional tensor, but can live in a 3D space, if it has three entries.
Vice versa, a 2D space can contain a three dimensional tensor, for
instance [[[1, 2], [2, 4]], [[3, 1], [2, 5]]] is an example of such a tensor.
Confusing, I know!
Dot Product
The fact that both vectors and matrices are tensors suggests a pro-
found thing: there is no qualitative distinction between a vector and
a matrix. This is indeed true, and one cool repercussion is that vec-
tors, just like matrices, are also coordinate shifts. That is because any
d dimensional vector is also a d × 1 rectangular matrix. So it can
play a role in the vector-matrix and matrix-matrix multiplications I
explained in Section 5.2.
This is the dot product. Let’s say you have two d dimensional
vectors: v and m. We can decide that m is actually a d × 1 matrix and
we use it to perform a vector-matrix multiplication with v. Since m is
a rectangular matrix, by now you know that it will transport you to a
space with a different number of dimensions. In this case, you’ll end
up with only one dimension. A one-dimensional vector is a number.
What this means in our spatial perspective is that you’re projecting
v onto the line defined by the direction pointed by m. This is because
you’re bringing v into the 1D space of m, which is a number line. You
will also have to stretch v’s projection proportionally to the length of
m, just like entries in a matrix M will stretch or squish the space if
they are different than 1. Figure 5.12 shows you how this looks like
spatially.
Note how in Figure 5.12 the red vector v lands on the fourth tick-
mark of the number line determined by the direction of vector m.
80 the atlas for the aspiring network scientist
Outer Product
There’s another type of product that is useful in several part of this
book. It is the cousin of the dot product: the outer product. When we
decided to multiply two d × 1 vectors, we decided we were collapsing
the d dimension to get a 1 dimensional vector, a number. But that’s
not the only option. We could have instead collapsed the 1 dimension
and the result would be... can you guess it? A d × d matrix! In fact,
if we decide to collapse the 1 dimension, it actually doesn’t matter
linear algebra 81
whether u and v have the same dimension. Any two vectors can have
an outer product. In this book I’ll write the outer product as u ⊗ v.
Formally, this looks like:
u1 u1 v1 u1 v2 u1 v3
u h i u v u2 v2 u2 v3
u ⊗ v = uv T = 2 v1
2 1
v2 v3 = .
u3 u3 v1 u3 v2 u3 v3
u4 u4 v1 u4 v2 u4 v3
Two absolutely key concepts for network science are eigenvalues and
eigenvectors. They are used almost everywhere in network science,
so we need to define them here. Consider Figure 5.13. Given a vector
v, we learned we can apply an arbitrary matrix transformation M to
it – as long as it has the correct dimensions. We then obtain a new
vector w = Mv. Any transformation M has special vectors: M scales
these special vectors without altering their directions. In practice, the
transformation M simply multiplies the elements of such vectors by
the same scalar λ: w = Mv = λv. We have a name for this: v is M’s
eigenvector and λ is its associated eigenvalue. Mathematically, we
represent this relation as Mv = λv.
(x’, y’)
(λx, λy) Figure 5.13: A graphical depic-
tion of an eigenvector.
w=Av=λv
w'=A'v
(x, y)
values, but their corresponding eigenvalues are the same. From now
on, when I mention eigenvectors, I refer to the right eigenvectors.
Right eigenvectors are the default, and when I refer to left eigenvectors
I will explicitly acknowledge it.
A square matrix with n rows and columns also has n eigenvalues.
By convention, we sort eigenvalues by their value, sometimes in
increasing order, sometimes in decreasing order, depending on the
application.
A key term you need to keep in mind is “multiplicity”. The mul-
tiplicity of an eigenvalue is the number of eigenvectors to which it is
associated. If you have an n × n matrix, but only d < n distinct eigen-
values, some eigenvalues are associated to more than one eigenvector.
Thus their multiplicity is higher than one.
Matrix Decomposition
One of the easiest ways to perform matrix factorization is what
we call “eigendecomposition”. A square matrix M can always be
decomposed as M = ΦΛΦ−1 . Rather than being the left and right
eyes of a really pissed frowny face, Φ is the matrix we obtain piling
all eigenvectors next to each other, and Λ is a diagonal matrix with
the eigenvalues on its main diagonal and zeros everywhere else:
λ0 ... 0
Λ=
..
0 . 0
0 ... λn .
We are mostly interested in eigendecomposition as the special
case of the more general Singular Value Decomposition (SVD) –
which can be applied to any matrix, even non-square ones. In SVD,
we simply replace Φ and Λ with generic matrices. In other words,
we say that we can reconstruct M with the following operation:
84 the atlas for the aspiring network scientist
Day Temp (°C) Wind (km/h) Sunlight (%) Rain (mm) Snow (mm)
1 27 10 80 2 0
2 26 1.2 95 1 0
3 32 7.6 100 0 0
4 12 2.3 12 20 0
5 14 3.8 8 25 0
6 6 0.2 24 40 1
7 4 0.1 2 8 30
8 2 0.9 4 1 40
9 −1 1.1 4 0 80
Figure 5.14: A table recording
This is the aim of PCA. Let’s repeat the previous paragraph math-
in a matrix the characteristics of
ematically: you want to transform your correlated vectors in a set of
some days.
uncorrelated, or orthogonal, vectors which we call “principal compo-
nents”. Each component is a vector that explains the largest possible
amount of variance (Section 3.4) in your data, under the condition
of being orthogonal with all the other components. You can have as
many components as you have variables, but usually you want much
linear algebra 85
fewer – for instance two, so you can plot the data. That is because
the first component explains the most variance in the system, the
second a bit less, and so on, until the last few components which are
practically random. Thus you want to stop collecting components
after you’ve taken the first n, setting n to your delight. In Figure 5.15,
I collect the first two – they’re there for illustrative purposes so don’t
be shocked if you realize they’re not really orthogonal.
9
8
7
2nd Comp
nents, besides the fact that all these components must be orthogonal
with each other. This means that you might end up with compo-
nents with negative values, as we do in Figure 5.15. This might not
be ideal. What does it mean for a day to be a “negative rainy day”?
PCA is interpretable, but sometimes the intepretation can be a bit...
confusing.
Non-Negative Matrix Factorization solves this problem. With-
out going into technical details, NMF is PCA with the additional
constraint that no component can have a negative entry – hence
the “Non-Negative” part in the name. At a practical level, if there
were no negative entries in Figure 5.15, then the two components
in that figure could be results of NMF. This additional constraint
comes at the expense of some precision: PCA can fit the data better
because it does not restrict its output space. However, usually, NMF
components are more easy to interpret.
Given their links to data clustering, both PCA and NMF are exten-
sively used when looking for communities in your networks (Part
X).
Tensor Decomposition
Let’s assume you have a three dimensional tensor. You can think of a
tensor as a cuboid and a slice of it is a matrix. If you find it difficult
to picture this in your head, don’t worry: you’re not alone. That is
why many researchers put effort into finding ways to decompose
tensors in lower-dimensional representations that can sum up their
main properties. This process is generally known as “tensor decom-
position”.
Tensor decomposition is a general term encompassing many
techniques to express a tensor as a sequence of elementary operations
(addition, multiplication, etc) on other, simpler tensors. For instance,
you can represent a 3D tensor as a combination of three vectors, one
per dimension. Or as a matrix and a vector. You want to do this to
solve complex network analyses on multilayer networks – whatever
the hell this means, Section 8.1 will enlighten you – by taking the full
dimensionality into account at the same time, rather than performing
6
Tamara G Kolda and Brett W Bader.
Tensor decompositions and applications.
the analysis on each layer separately and then merge the results SIAM review, 51(3):455–500, 2009
somehow. Examples of applications of tensor decomposition range 7
Lieven De Lathauwer, Bart De Moor,
and Joos Vandewalle. A multilinear sin-
from node ranking (Chapter 14), to link prediction (Part VII), to
gular value decomposition. SIAM journal
community discovery (Part X). on Matrix Analysis and Applications, 21
I am going to mention very briefly only two of these techniques: (4):1253–1278, 2000
8
Elina Robeva and Anna Seigal. Sin-
tensor rank decomposition and Tucker decomposition. You should
gular vectors of orthogonally decom-
look elsewhere for a more complete treatment of the subject6 . There posable tensors. Linear and Multilinear
also exists a tensor SVD7 , 8 , but it is relatively similar to a special case Algebra, 65(12):2457–2471, 2017
linear algebra 87
c1 c2 ck
a1 a2 ak
m ~ + +…+
n b1 b2 bk
m
T ∼ T × X × Y × Z.
Here, T is the core tensor, whose dimensions are smaller than T’s.
X, Y, and Z are matrices which have one dimension in common with
T and the other in common with T – so that the matrix multiplica-
tion of them with T reconstructs a tensor with T’s dimensions.
m ~
n
m
Again, for the visual thinkers, Figure 5.18 might come in handy.
In Tucker decomposition you have the freedom to choose the di-
mensions of the core tensor T . Smaller cores tend to be more inter-
pretable, because they defer most of the heavy lifting to X, Y, and Z.
However, they also tend to make the decomposition less precise in
reconstructing T.
5.7 Summary
5.8 Exercises
2. Suppose:
! !
1 0 3 0
A= B=
0 2 0 −1
Graph Representations
6
Basic Graphs
Every story should start from the beginning and, in this case, in 1
John Adrian Bondy, Uppaluri Siva Ra-
the beginning was the graph1 , 2 , 3 , 4 . To explain and decompose the machandra Murty, et al. Graph theory
elements of a graph, I’m going to use the recurrent example of social with applications, volume 290. Citeseer,
1976
networks. The same graph can represent different networks: power 2
Douglas Brent West et al. Introduction
grids, protein interactions, financial transactions. Hopefully, you can to graph theory, volume 2. Prentice hall
effortlessly translate these examples into whatever domain you’re Upper Saddle River, 2001
going to work.
3
Reinhard Diestel. Graph theory.
Springer Publishing Company, Incorpo-
Let’s start by defining the fundamental elements of a social net- rated, 2018
work. In society, the fundamental starting point is you. The person. 4
Jonathan L Gross and Jay Yellen. Graph
Following Euler’s logic that I discussed in the introduction, we want theory and its applications. CRC press,
2005
to strip out the internal structure of the person to get to a node. It’s
like a point in geometry: it’s the fundamental concept, one that you
cannot divide up into any sub-parts. Each person in a social network
is a node – or vertex; in the book I’ll treat these two terms as syn-
onyms. We can also call nodes “actors” because they are the ones
interacting and making events happen – or “entities” because some-
times they are not actors: rather than making things happen, things
happen to them. “Actor” is a more specific term which is not an
exact synonym of “node”, but we’ll see the difference between the 5
The understatement of the century.
two once we complicate our network model just a bit5 , in Section 7.2.
To add some notation, we usually refer to a graph as G. V indi-
cates the set of G’s vertices. Since V is the set of nodes, to refer to the
number of nodes of a graph we use |V | – some books will use n, but
I’ll try to avoid it. Throughout the book, I’ll tend to use u and v to
indicate single nodes.
So far, so good. However, you cannot have a society with only one
individual. You need more than one. And, once you have at least two
people, you need interactions between them. Again, following Euler,
for now we forget about everything that happens in the internal struc-
ture of the communication: we only remember that an interaction is
92 the atlas for the aspiring network scientist
taking place. We will have plenty of time to make this model more
complicated. The most common terms used to talk about interactions
are “edge”, “link”, “connection” or “arc”. While some texts use them
with specific distinctions, for me they are going to be synonyms, and
my preferred term will always be “edge”. I think it’s clearer if you
always are explicit when you refer to special cases: sure, you can
decide that “arc” means “directed edge”, but the explicit formula “di-
rected edge” is always better than remembering an additional term,
because it contains all the information you need. (What the hell are
“directed edges”? Patience, everything will be clear)
Again, notation. E indicates the set of G’s edges and | E| is the
number of edges – some books will use m as a synonym for | E|.
Usually, when talking about a specific edge one will use the notation
(u, v), because edges are pairs of nodes – unless we complicate the
graph model. Now we have a way to refer to the simplest possible
graph model: G = (V, E), with E ⊆ V × V. A graph is a set of nodes
and a set of edges – i.e. node pairs – established among those nodes.
Euler’s first graph wasn’t simple after all. It allowed for parallel
edges: multiple edges between the same two nodes. Euler’s first
graph was a multigraph. That’s so non-standard that we’re not even
going to talk about it in this chapter: you’ll have to wait for the next
one, specifically for Section 7.2.
In our simple graph we also assume there are no self loops, which
are edges connecting a node with itself. Our assumption is that we
aren’t psychopaths: everybody is friend with themselves, so we don’t
need to keep track of those connections.
(a) (b)
3-5
Figure 6.4: (a) A graph. (b) Its
3 4-5 linegraph version.
3-4
2 2-4
4
5
1-4
1 1-2
(a) (b)
(a) (b)
proximity between the two nodes or their distance. This can and
will influence the results of many algorithms you’ll apply to your
graph, so this semantic distinction matters. For instance, if you’re
looking for the shortest path (see Chapter 13) in a road network, your
edge weight could mean different things. It could be a distance if it
represents the length of the trait of road: longer traits will take more
time to cross. Or it can be a proximity: it could be the throughput of
the trait of road in number of cars per minute that can pass through
it – or the number of lanes. If the weight is a distance, the shortest
path should avoid high edge weights. If the weight is a proximity, it
should do its best to include them.
To sum up, “proximity” means that a high weight makes the
nodes closer together; e.g. they interact a lot, the edge has a high
capacity. “Distance” means that a high weight makes the nodes
further apart; e.g. it’s harder or costly to make the nodes interact.
Edge weights don’t have to be positive. Nobody says nodes should
be friends! Examples of negative edge weights can be resistances in
electric circuits or genes downregulating other genes. This observa-
tion is the beginning of a slippery slope towards signed networks,
which is a topic for another time (namely, for Section 7.2, if you want
to jump there).
The network in Figure 6.6 has nice integer weights. In this case,
the edge weights are akin to counts. For instance, in a phone call
network, it could be the number of times two people have called each
other. Unfortunately, not all weighted networks look as neat as the
example in Figure 6.6. In fact, most of the weighted networks you
might work with will have continuous edge weights. In that case,
many assumptions you can make for count weights won’t apply – for
instance when filtering connections, as we will see in Chapter 27.
By far, the most common case is the one of correlation networks.
In these networks, the nodes aren’t really interacting directly with
one another. Instead, we are connecting nodes because they are
similar to each other, for some definition of similarity. For instance, 11
Boris C Bernhardt, Zhang Chen,
we could connect brain areas via cortical thickness correlations11 , or Yong He, Alan C Evans, and Neda
Bernasconi. Graph-theoretical analysis
currencies according to their exchange rate12 , or correlating the taxa reveals disrupted small-world organi-
presence in different biological communities13 . zation of cortical thickness correlation
These cases have more or less the same structure. I provide an ex- networks in temporal lobe epilepsy.
Cerebral cortex, 21(9):2147–2157, 2011
ample in Figure 6.7. In this case, nodes are numerical vectors, which 12
Takayuki Mizuno, Hideki Takayasu,
could represent a set of attributes, for instance. We calculate a corre- and Misako Takayasu. Correlation
networks among currencies. Physica A:
lation between the vectors, or some sort of attribute similarity – for
Statistical Mechanics and its Applications,
instance mutual information (Section 3.5). We then obtain continuous 364:336–342, 2006
weights, which typically span from −1 to 1. And, since every pair of 13
Jonathan Friedman and Eric J Alm.
Inferring correlation networks from
nodes have a similarity (because any two vectors can be correlated,
genomic survey data. PLoS computational
minus extremely rare degenerate cases), every node is connected to biology, 8(9):e1002687, 2012
basic graphs 97
every other node. So, when working with similarity networks, you
will have to filter your connections somehow, a process we call “net-
work backboning” which is far less trivial that it might sound. We
will explore it in Chapter 27.
0.36
B
for correlation networks: (left to
right) from nodes represented
C
C
as some sort of vectors, to a
graph with a similarity measure
as edge weigth.
Now that you know more about the various features of different net-
work models, we can start looking at different types of networks. I’m
going to use a taxonomy for this section. I find this way of organizing
networks useful to think about the objects I work with.
Simple Networks
The first important distinction between network types is between
simple and complex networks. A simple network is a network we
can fully describe analytically. Its topological features are exact and
trivial. You can have a simple formula that tells you everything you
need to know about it. In complex networks that is not possible, you
can only use formulas to approximate their salient characteristics.
The difference between a simple network and a complex network
is the same between a sphere and a human being. You can fully
describe the shape of a sphere with a few formulas: its surface is
4
4πr2 , its volume is πr3 . If you know r you know everything you
3
need to know about the sphere. Try to fully describe the shape of
a human being, internal organs included, starting from a single
number. Go on, I have time.
What do simple networks look like? I think the easiest example
conceivable is a square lattice. This is a regular grid, in which each
node is connected to its four nearest neighbors. Such lattice can either
span indefinitely (Figure 6.8(a)), or it can have a boundary (Figure
6.8(b)). Their fundamental properties are more or less the same.
Knowing this connection rule that I just stated allows you to picture
any lattice ever. That is why this is a simple topology.
Regular lattices can come in many different shapes besides square,
for instance triangular (Figure 6.9(a)) or hexagonal (Figure 6.9(b)).
98 the atlas for the aspiring network scientist
(a) (b)
has favorite graphs) are the lollipop graph14 (a set of n nodes all Maximum hitting time for random
connected to each other plus a path of m nodes shooting out of it, walks on graphs. Random Structures &
Algorithms, 1(3):263–276, 1990
Figure 6.10(a)), the wheel graph (which has a center connected to
a circle of m nodes, Figure 6.10(b)), and the windmill graph (a set
of n graphs with m nodes and all connections to each other, also all
connected to a central node, Figure 6.10(c)). Once you figure out
what rule determines each topology, you can generate an arbitrary set
of arbitrary size of graphs that all have the same properties.
Complex Networks
If simple networks were the only game in town, this book would
not exist. That is because, as I said, you can easily understand all
basic graphs 99
their properties from relatively simple math. That is not the case
when the network you’re analyzing is a complex network. Complex
networks model complex systems: systems that cannot be fully
understood if all you have is a perfect description of all their parts.
The interactions between the parts let global properties emerge
that are not the simple sum of local properties. There isn’t a simple
wiring rule and, even knowing all the wiring, some properties can
still take you by surprise.
Personally, I like to divide complex networks into two further
categories: complex network with fundamental metadata and without
fundamental metadata. We saw that you can have edge metadata, the
direction and weight. In Chapter 7 we’ll see you can have even more,
attached to both nodes and edges. The difference I’m trying to make
is that, if the metadata are fundamental, they change the way you
interpret some or all the metadata themselves.
To understand non-fundamental metadata, think about the fact that
social networks, infrastructure networks, biological networks, and so
on, model different systems and have different metadata attached to
their nodes and edges. They can be age/gender, activation types, up-
and down-regulation. However, the algorithms and the analyses you
perform on them are the same, regardless of what the networks rep-
resent. They have nodes and edges and you treat them as such. You
perform the Euler operation: you forget about all that is unnecessary
so you can apply standardized analytic steps.
That is emphatically not true for networks with fundamental meta-
data. In that case, you need to be aware of what the metadata repre-
sent, because they change the way you perform the analysis and you
interpret the results. A few examples:
Rain
T F Figure 6.11: (a) A Bayesian
network. (b) The conditional
0.20 0.80
probability tables for the node
states. The tables are referring
Sprinkler
to, from top to bottom: Rain,
Rain T F
Sprinkler, Wet.
T 0.01 0.99
F 0.20 0.80
Raining
Wet
Rain Sprinkler T F
Sprinklers
T T 0.99 0.01
T F 0.98 0.02
F T 0.97 0.03
Wet
F F 0.01 0.99
(a) (b)
The way they work is that the weight on each node of the out-
put layer is the answer the model is giving. This weight is directly
dependent on a combination of the weights of the nodes in the last
hidden layer. The contribution of each hidden node is proportional to
the weight of the edge connecting it to the output node. Recursively,
the status of each node in the hidden layer is a combination of all
its incoming connections – combining the edge weight to the node
weight at the origin. The first hidden layer will be directly dependent
on the weights of the nodes in the input layer, which are, in turn,
determined by the data.
What the model does is simply finding the combination of edge
weights causing the output layer’s node weights to maximize the
desired quality function.
102 the atlas for the aspiring network scientist
6.5 Summary
6.6 Exercises
2. Mr. A considers Ms. B a friend, but she doesn’t like him back. She
has a reciprocal friendship with both C and D, but only C con-
siders D a friend. D has also sent friend requests to E, F, G, and
H but, so far, only G replied. G also has a reciprocal relationship
with A. Draw the corresponding directed graph.
The world of simple graphs is... well... simple. The only thing com-
plicating it a bit so far was adding some information on the edges:
whether they are asymmetric – meaning (u, v) ̸= (v, u) – and whether
they are strong or weak. Unfortunately, that’s not enough to deal
with everything reality can throw your way. In this chapter, I present
even more graph models, which go beyond the simple addition of
edge information.
(a) (b)
1
Armen S Asratian, Tristan MJ Denley,
networks in which nodes must be part of either of two classes (V1 and Roland Häggkvist. Bipartite
and V2 ) and edges can only be established between nodes of unlike graphs and their applications, volume 131.
Cambridge University Press, 1998
type1 , 2 . Formally, we would say that G = (V1 , V2 , E), and that E can 2
Jean-Loup Guillaume and Matthieu
only contain edges like (v1 , v2 ), with v1 ∈ V1 and v2 ∈ V2 . Figure Latapy. Bipartite structure of all complex
7.2(a) depicts an example. networks. Information processing letters,
90:Issue–5, 2004
Bipartite networks are used for countless things, connecting: 3
César A Hidalgo and Ricardo Haus-
countries to the products they export3 , hosts to guest in symbiotic mann. The building blocks of economic
relationships4 , users to the social media items they tag5 , bank-firm complexity. Proceedings of the national
academy of sciences, 106(26):10570–10575,
relationships in financial networks6 , players-bands in jazz7 , listener-
2009
band in music consumption8 , plant-pollinators in ecosystems9 , and 4
Brian D Muegge, Justin Kuczynski,
more. You get the idea. Bipartite networks pop up everywhere. Dan Knights, Jose C Clemente, Antonio
González, Luigi Fontana, Bernard
However, by a curious twist of fate, the algorithms able to work di- Henrissat, Rob Knight, and Jeffrey I
rectly on bipartite structures are less studied than their non-bipartite Gordon. Diet drives convergence
counterparts. For instance, for every community discovery algorithm in gut microbiome functions across
mammalian phylogeny and within
that works on bipartite networks you have a hundred working on humans. Science, 332(6032):970–974, 2011
non-bipartite ones. The distinction is important, because the standard 5
Renaud Lambiotte and Marcel Aus-
assumptions of non-bipartite community discovery do not hold in loos. Collaborative tagging as a tripartite
network. In International Conference on
bipartite networks, as we will see in Part X. Computational Science, pages 1114–1117.
Why would that be the case? Because practically everyone who Springer, 2006
works on bipartite networks projects them. Most of the times, you
6
Luca Marotta, Salvatore Micciche,
Yoshi Fujiwara, Hiroshi Iyetomi,
are interested only in one of the two node types. So you create a Hideaki Aoyama, Mauro Gallegati,
unipartite version of the network connecting all nodes in V1 to each and Rosario N Mantegna. Bank-firm
credit network in japan: an analysis
other, using some criteria to make the V2 count. The trivial way is of a bipartite network. PloS one, 10(5):
to connect all V1 nodes with at least a common V2 neighbor. This e0123079, 2015
is so widely done and so wrong that I like to call it the Mercator 7
Pablo M Gleiser and Leon Danon.
Community structure in jazz. Advances
bipartite projection, in honor of the most used and misunderstood in complex systems, 6(04):565–573, 2003
map projection of all times. We’ll see in Chapter 26 why that’s not 8
Renaud Lambiotte and Marcel Aus-
very smart, and the different ways to do a better job. loos. Uncovering collective listening
habits and music genres in bipartite
Why stopping at bipartite? Why not go full n-partite? For instance, networks. Physical Review E, 72(6):066107,
a paper I cited before actually builds a tri-partite network (Figure 2005
extended graphs 105
One-to-One
Traditionally, network scientists try to focus on one thing at a time. If
they are interested in analyzing your friendship patterns, they will
choose one network that closely approximates your actual social 13
Mikko Kivelä, Alex Arenas, Marc
relations and they will study that. For instance, they will download Barthelemy, James P Gleeson, Yamir
a sample of the Facebook graph. Or they will analyze tweets and Moreno, and Mason A Porter. Multilayer
networks. Journal of complex networks, 2
retweets. (3):203–271, 2014
However, in some cases, that is not enough to really grasp the 14
Manlio De Domenico, Albert Solé-
phenomenon one wants to study. If you want to predict a new con- Ribalta, Emanuele Cozzo, Mikko Kivelä,
Yamir Moreno, Mason A Porter, Sergio
nection on Facebook, something happening in another social media Gómez, and Alex Arenas. Mathematical
might have influenced it. Two people might have started working in formulation of multilayer networks.
Physical Review X, 3(4):041022, 2013
the same company and thus first connected on Linkedin, and then
15
Stefano Boccaletti, Ginestra Bianconi,
became friends and connected on Facebook. Such scenario could not Regino Criado, Charo I Del Genio, Jesús
be captured by simply looking at one of the two networks. Network Gómez-Gardenes, Miguel Romance,
Irene Sendina-Nadal, Zhen Wang, and
scientists invented multilayer networks13 , 14 , 15 , 16 , 17 , 18 to answer this
Massimiliano Zanin. The structure
kind of questions. and dynamics of multilayer networks.
There are two ways to represent multilayer networks. The simpler Physics Reports, 544(1):1–122, 2014
Many-to-Many
To fix the insufficient power of multiplex networks to represent true
multilayer systems we need to extend the model. We introduce the
concept of “interlayer coupling”. In this scenario, the node is split
into the different layers to which it belongs. In this case, your identity
includes multiple personas: you are the union of the “Facebook
you”, the “Linkedin you”, the “Twitter you”. Figure 7.4(a) shows
the visual representation of this model: each layer is a slice of the
network. There are two types of edges: the intra-layer connections –
the traditional type: we’re friends on Facebook, Linkedin, Twitter –,
and the inter-layer connections. The inter-layer edges run between
layers, and their function is to establish that the two nodes in the
different layers are really the same node: they are coupled to – or
dependent on – each other.
tral layer. To see what I mean, look at Figure 7.6(c). Using different
coupling flavors can be useful for computational efficiency: when you
start having dozens or even hundreds of layers, creating cliques of
layers can add a significant overhead.
Such many-to-many layer couplings are often referred to in the
literature as “networks of networks”, because each layer can be seen
as a distinct network, and the interlayer couplings are relationships 20
Jacopo Iacovacci, Zhihao Wu, and
between different networks20 , 21 , 22 . Ginestra Bianconi. Mesoscopic structures
reveal the network between the layers of
multiplex data sets. Physical Review E, 92
Aspects (4):042806, 2015
21
Gregorio D’Agostino and Antonio
Do you think we can’t make this even more complicated? Think Scala. Networks of networks: the last
again. These aren’t called “complex networks” by accident. To fully frontier of complexity, volume 340.
Springer, 2014
generalize multilayer networks, adding the many-to-many interlayer 22
Dror Y Kenett, Matjaž Perc, and
coupling edges is not enough. To see why that’s the case, consider Stefano Boccaletti. Networks of
the fact that, up to this point, I considered the layers in a multilayer networks–an introduction. Chaos,
Solitons & Fractals, 80:1–6, 2015
network as interchangeable. Sure, they represent different relation-
ships – Facebook friendship rather than Twitter following – but they
are fundamentally of the same type. That’s not necessarily the case:
the network can have multiple aspects.
For instance, consider time. We might not be Facebook friends
now, but that might change in the future. So we can have our mul-
tilayer network at time t and at time t + 1. These are two aspects of
the same network. All the layers are present in both aspects and the
edges inside them change. Another classical example is a scientific
community. People at a conference interact in different ways – by
attending each other talks, by chatting, or exchanging business cards
– and can do all of those things at different conferences. The type of
interaction is one aspect of the network, the conference in which it
happens is another.
I can’t hope to give you here an overview of how many new things
this introduces to graph theory. So I’m referring you to a specialized 23
Ginestra Bianconi. Multilayer Networks:
book on the subject23 . Structure and Function. Oxford University
Press, 2018
Signed Networks
Signed networks are a particular case of multilayer networks. Sup-
pose you want to buy a computer, and you go online to read some
reviews. Suppose that you do this often, so you can recognize the
reviewers from past reviews you read from them. This means that
you might realize you do not trust some of them and you trust others.
This information is embedded in the edges of a signed network: there
are positive and negative relationships.
Signed networks are not necessarily restricted to either a single
positive or a single negative relationship – e.g. “I trust this person”
110 the atlas for the aspiring network scientist
Hypergraphs
In the classical definition, an edge connects two nodes – the gray
lines in Figure 7.7(a). Your friendship relation involves you and your
friend. If you have a second friend, that is a different relationship.
There are some cases in which connections bind together multiple
people at the same time. For instance, consider team building: when
you do your final project with some of your classmates, the same
relationship connects you with all of them. When we allow the
25
Vitaly Ivanovich Voloshin. Introduction
to graph and hypergraph theory. Nova
same edge to connect more than two nodes we call it a hyperedge – Science Publishers Hauppauge, 2009
the gray area in Figure 7.7(b). A collection of hyperedges makes a 26
Alain Bretto. Hypergraph theory: An
hypergraph25 , 26 . introduction. Mathematical Engineering.
Cham: Springer, 2013
(a) (b)
To make them more manageable, we can put constraints to hy-
peredges. We could force them to always contain the same number
of nodes. In a soccer tournament, the hyperedge representing a
team can only have eleven members: not one more nor one less, be-
extended graphs 111
cause that’s the number of players in the team. In this case, we call
the resulting structure a “uniform hypergraph”, and have all sorts 27
Shenglong Hu and Liqun Qi. Alge-
of interesting properties27 . In general, when simply talking about braic connectivity of an even uniform
hypergraphs we have no such constraint. hypergraph. Journal of Combinatorial
Optimization, 24(4):564–579, 2012
It is difficult to work with hypergraphs28 . Specialized algorithms 28
Source: I tried once.
to analyze them exist, but they become complicated very soon. In
the vast majority of cases, we will transform hyperedges into simpler
network forms and then apply the corresponding simpler algorithms.
There are two main strategies to simplify hypergraphs. The first is
to transform the hyperedge into the simple edges it stands for. If the
hyperedge connects three nodes, we can change it into a unipartite
network in which all three nodes are connected to each other. In the
project team example, the new edges simply represent the fact that
the two people are part of the same team. The advantage is a gain
in simplicity, the disadvantage is that we lose the ability to know the
full team composition by looking at its corresponding hyperedge: we
need to explore the newly created structures.
The second strategy is to turn the hypergraph into a bipartite
network. Each hyperedge is converted into a node of type 1, and the
hypergraph nodes are converted into nodes of type 2. If nodes are
connected by the same hyperedge, they all connect to the correspond-
ing node of type 1. In the project team example, the nodes of type
1 represent the teams, and the nodes of type 2 the students. This is
an advantageous representation: it is simpler than the hypergraph,
but it preserves some of its abilities, for instance being able to re-
construct teams by looking at the neighbors of the nodes of type 1.
However, the disadvantage with respect to the previous strategy is
that there are fewer algorithms working for bipartite networks than
with unipartite networks.
Figure 7.8 provides a simple example on how to perform these two
Simplicial Complexes
29
Vsevolod Salnikov, Daniele Cassese,
Simplicial complexes29 , 30 , 31 are related to hypergraphs. A simplicial and Renaud Lambiotte. Simplicial
complex is a graph containing simplices. Simplicial complexes are complexes and complex systems.
European Journal of Physics, 40(1):014001,
like hyperedges in that they connect multiple nodes, but they have 2018
a strong emphasis on geometry. Graphically, we normally represent 30
Jakob Jonsson. Simplicial complexes of
graphs, volume 3. Springer, 2008
simplicial complexes as fills in between the sides that compose the 31
Ginestra Bianconi. Higher-order
simplex – as I show in Figure 7.9. networks. Cambridge University Press,
2021
1 8
3
what they are. On the other hand, a hyperedge with four nodes
only contains those four nodes, it does not logically contain any
three-node hyperedge – unless that specific three-node hyperedge is
explicitly coded as part of the data, but it is a wholly separate entity.
In the paper writing example, four people writing a paper make a
four-node hyperedge, but only a different paper with three of those
authors will generate a hyperedge contained in it – while a 3-simplex
will naturally contain all 2-simplices with no extra paper.
A simplicial complex – a network with simplices – has a di-
mension: the largest dimension of the simplices it contains. The
simplicial complex in Figure 7.9 has dimension 3. A facet of a sim-
plicial complex is a simplex that is not a face of any other larger
simplex. The facets in the simplicial complex of Figure 7.9 are:
[{1, 2}, {1, 3}, {2, 3, 4}, {4, 5}, {4, 6, 7, 8}]. The sequence of facets of
a simplicial complex fully distinguishes it from any other simplicial
complex. Just like with uniform hypergraphs, we can also have pure
simplicial complexes, which are complexes that contain only facets
of the same dimension. Figure 7.9 is not pure because it has facets of
dimension three, two and one. Figure 7.10(d) is a 3-pure simplicial
complex, because it only contains a simplex of dimension three. If
you were to ignore all simplices and analyze the network without
them, you’d be working with the skeleton of the simplicial complex.
In practice, any network is a skeleton of one or more simplicial com-
plexes.
Simplicial complexes are one of the main ways to analyze high-
order interactions in networks, and so we’re going to look at them
extensively in Chapter 34.
social network. People are connected only when they are actually
interacting with each other. We have four observations, taken at
four different time intervals. Suppose that you want to infer if these
people are part of the same social group – or community. Do they?
Looking at each single observation would lead us to say no. In each
time step there are individuals that have no relationships to the rest
of the group. Adding the observations together, though, would create
a structure in which all nodes are connected to each other. Taking
into account the dynamic information allows us to make the correct
inference. Yes, these nodes form a tightly connected group.
In practice, we can consider a dynamic network as a network with
edge attributes. The attribute tells us when the edge is active – or
inactive, if the connection is considered to be “on” by default, like
the road graph. Figure 7.12 shows a basic example of this, with edges
between three nodes.
Time
A
Figure 7.12: An example of dy-
namic edge information. Time
B flows from left to right. Each
A row represents a possible po-
tential edge between nodes A,
C
B, and C. The moments in time
B
in which each edge is active are
C represented by gray bars.
Time Time
A A
B B
A A
C C
B B
C C
A C A C A C A C A C A C
B B B B B B
B B
A A
C C
B B
C C
A C A C A C A C A C A C
B B B B B B
B 3
Figure 7.14: (a) A network with
A 3
qualitative node attributes,
represented by node labels
A 4
A 4 and colors. (b) A network with
quantitative node attributes,
represented by node labels and
A 5 sizes.
B 4
B 3
A 4
B 2
(a) (b)
7.6 Summary
1. Bipartite networks are networks with two node types. Edges can
only connect two nodes of different types. You can generalize
them to be n-partite, and have n node types.
7.7 Exercises
Graphs, with their fancy nodes and edges, are not the only way to
represent a network. One can do so also by using matrices. In fact,
ask some people and they will tell you that everything is a matrix.
What’s a number if not a zero-dimensional tensor (Section 5.3)? I
mean, come on!
Unfortunately, I am not one of those people, so this chapter will
contain only the bare minimum for you to smile and nod while
talking to them.
The reason of having this chapter is because sometimes operations
are more natural to understand with the graph models, and some-
times they are just matrix operations. Which perspective is more
useful – graph vs matrix – often depends on the perspective used by
the researcher(s) discovering a given property of developing a given
tool. So in the book I’ll often switch back and forth between these
two representations, and this chapter is your map not to get lost once
I start rambling about the “supra adjacency matrix”, whatever the
hell that means.
In each of the sections of this chapter we will see a different way
of making a matrix that has a meaningful correspondence to more
and more advanced network structures. These will enable interesting
operations via the linear algebra techniques I introduced in Chapter
5.
Basics
The adjacency matrix is a deceptively simple object. Suppose that you
have a group of friends and you want to keep a tally of who’s friend
with whom. You can make a table with one friend per row and one
friend per column. If two people say they know each other, you can
just put a cross in the corresponding cell, as my example shows in
120 the atlas for the aspiring network scientist
Yes! Yes!
Do you 5
know each
other?
2 Figure 8.1: (a) A vignette of
7
1 2 3 4 5
9 1 how one would construct an ad-
1
2
x
1 5
3
6 jacency matrix. (b) An example
4 8
5x graph. (c) The adjacency matrix
4 3 of (b). Rows and columns are in
(a) (b) (c) the same order as the node ids
(so the first row/column refers
to node 1, the second to node 2,
etc).
8
Figure 8.2: (a) A non-symmetric
2 6 adjacency matrix. (b) The corre-
sponding directed graph.
7
3
5
4 1
9
(a) (b)
If we want edge weights to exploit the power of linear algebra also
on weighted graphs (Section 6.3), we can allow values different than
one for the cells representing edges. Now we can have an arbitrary
real value in the cells, representing the connection’s strength – see
Figure 8.3.
6
Figure 8.3: (a) A non-binary
5 5 adjacency matrix. (b) The corre-
4 4
3 2 sponding weighted graph.
2
3
5
9 3 7 5 2
3
5 6 5
2
4 8
2
1
(a) (b)
What else? We can make the adjacency matrix not square if we
need to represent a bipartite network. The different numbers of rows
and columns allow us to use one dimension to represent the nodes
in V1 and the other to represent the nodes in V2 . Figure 8.4 depicts
an example. The downside is that we lose the power of the diagonal
we had in the adjacency matrix – which doesn’t seem like a big deal
now, because at the moment I’m being all hush hush about what this
power really is.
122 the atlas for the aspiring network scientist
1 2 3 4 5 6 7 8 9 10
Figure 8.4: (a) A non-square
adjacency matrix. (b) The corre-
sponding bipartite graph.
4 5 3 6 2 1
(a) (b)
|V2| AT 0
Multilayer Adjacency
Finally we can – and do – represent even multilayer networks with
matrices. We have two options to do so. We can either use tensors or
the supra adjacency matrix.
I introduced tensors in Section 5.3 as generalized vectors. As a
refresher, a vector can be seen as a monodimensional array: a list
of values. A matrix could be said to be a two-dimensional array. A
tensor is a multidimensional array: we can have as many dimensions
as we want.
A one-to-one coupled multilayer network can be represented
with a three-dimensional vector. The first two dimensions – rows
and columns – are the nodes, and the third dimension is the layers.
Mathematically, the Auvl entry in the tensor tells you the relationship
between nodes u and v in layer l. Figure 8.6 provides an intuitive
example. Note that we are assuming that the nodes are sorted in the
same way across the third dimension, thus the inter-layer couplings
matrices 123
(a) (b)
As you can see, there is no problem if layers don’t have the same
number of nodes, as blocks are allowed to have different sizes and
their off-diagonal parts are still able to represent the inter-layer
couplings.
124 the atlas for the aspiring network scientist
8.2 Stochastic
Adjacency matrices are nice, but I think most of the times you’ll see
them transformed in various ways to squeeze out all the possible
analytic juice. The simplest makeover we can give to the adjacency
matrix is to convert it into a stochastic matrix. This means that we
normalize it, dividing each entry by the sum of its corresponding row
– this means that each of its rows sums to one. If nodes u and v are
connected, and u has 5 connections, the Auv entry will be 1/5 = 0.2.
Figure 8.8 shows an example of this stochastic transformation.
5
2 Figure 8.8: (a) The original
7
9 1 graph. (b) The adjacency matrix
6 of (a). (c) The corresponding
8 stochastic version.
4 3
now write this matrix as A1 , which is the same thing as A. Let’s say
this again: A1 is the probability of all transitions for random walks of
length 1. Could it be, then, that A2 is the probability of all transitions
for random walks of length 2? And that An is the probability of all
transitions for random walks of length n? Yes, they are!
From the matrix multiplication crash course I gave you in Section
5.2 you know why: A2 ’s uv entry is, as the formula I wrote there
shows, the sum of the multiplication of probabilities of all nodes k
that are connected to both u and v, and thus can be used in a path
of length 2. Multiplying Auk to Akv means asking the probability of
going from u to k and from k to v. Summing Auk1 Ak1 v to Auk2 Ak2 v
means asking the probability of passing through k1 or k2 . See Chapter
2 for a refresher on what multiplying and summing probabilities
mean.
8.3 Incidence
8.4 Laplacian
Basic Laplacian
The stochastic adjacency matrix is nice, but the real superstar when
it comes to matrix representations of networks is the Laplacian. To
know what that is, we need to introduce the concept of Degree ma-
trix D – which is a very simple animal. It is what we call a “diagonal”
matrix. A diagonal matrix is a matrix whose nonzero values are ex-
clusively on the main diagonal. The other off-diagonal entries in the
matrix are equal to zero. In D the diagonal entries are the degrees of
the corresponding nodes. Figure 8.11 shows an example of a degree
matrix.
5
2 Figure 8.11: The degree matrix
7
9 1 (a) of the sample graph (b).
6
8
4 3
(a) (b)
an edge in the network, and zero everywhere else. Figure 8.12 depicts
the operation. Why is this matrix interesting? It was originally devel-
oped to represent something very physical: it captures the relation 2
Gustav Kirchhoff. Ueber die auflösung
between voltages and currents between resistors2 – represented by der gleichungen, auf welche man
the edges of the graph. bei der untersuchung der linearen
vertheilung galvanischer ströme geführt
However, you don’t need to care about electric circuits to find
wird. Annalen der Physik, 148(12):497–508,
a use for the Laplacian. L has a number properties that are useful 1847
in general, regardless of what your graph represents. Some of the
most obvious ones are that, since it has the degree of the node on the
diagonal and −1 for each of the node’s connection, the sums of all
rows and columns are equal to zero.
Special Laplacians
The way the Laplacian is built makes it non trivial how to generalize
it for different types of networks. Here I focus on two cases: directed
and signed graphs.
130 the atlas for the aspiring network scientist
For directed networks the problem comes from the fact that we
have to put the degrees of the nodes in the diagonal. However, in
a directed networks, Chapter 9 will teach you that nodes have two
degrees: the number of arrow heads pointing to the node (indegree)
and the number of arrow tails coming out of the node (outdegree).
Which one do we pick? In principle, we could pick either, creating
two Laplacians: the outdegree Laplacian (Figure 8.13(a)) or the
indegree Laplacian (Figure 8.13(b)).
8.5 Summary
8.6 Exercises
3. Calculate the eigenvalues and the right and left eigenvectors of the
stochastic adjacency of the network at http://www.networkatlas.
eu/exercises/8/2/data.txt, using the same procedure applied
in the previous exercise. Make sure to sort the eigenvalues in
descending order (and sort the eigenvectors accordingly). Only
take the real part of eigenvalues and eigenvectors, ignoring the
imaginary part.
Simple Properties
9
Degree
a connection.
The degree is a property of a node. Let’s call k v the degree of node
v. We can aggregate the degrees of all nodes in a network to get a
“global” information about its connectivity. The most common way to
do it is by calculating the average degree of a network. This would
be k̄ = ∑ k v /|V |, however it’s much simpler to remember that
v ∈V
k̄ = 2| E|/|V |. The average degree of a network is twice the number
of edges divided by the number of nodes. Why twice? Because each
edge increases by one the degree of the two nodes it connects.
In a social network, this is how many friends people have on
average. What would that number be in your opinion? If we have
a social network including two billion people, what’s the average
degree? It turns our that this number is usually ridiculously lower
than one would expect, because – as we’ll see in Section 12.1 – real 2
Anna D Broido and Aaron Clauset.
networks are sparse2 . Scale-free networks are rare. Nature
We call a node with zero degree, a person without friends, an communications, 10(1):1017, 2019
isolated node, or a singleton. A node with degree one is a “leaf”
node: this term comes from hierarchies, where nodes at the bottom
– the leaves of the tree – can only have one incoming connection
without outgoing ones. The sum of all degrees is 2| E|, which implies
that any graph can only have an even number of nodes with odd 3
Leonhard Euler. Solutio problematis ad
degree3 – otherwise the sum of degrees would be odd and thus it geometriam situs pertinentis. Commen-
cannot be two times something. tarii academiae scientiarum Petropolitanae,
pages 128–140, 1741
2
Figure 9.2: A graph and its
2 3
degree sequence.
1
2
[3, 2, 2, 2, 1]
(a) (b)
4
Michael Molloy and Bruce Reed.
The size of the giant component of a
A degree sequence is the list of degrees of all nodes in the net- random graph with a given degree
work4 , 5 . Typically, we sort the nodes in descending degree, so you sequence. Combinatorics, probability and
computing, 7(3):295–305, 1998
always start with the node with maximum degree and you go down 5
Béla Bollobás, Oliver Riordan, Joel
until you reach the node with the lowest degree. Figure 9.2 shows an Spencer, and Gábor Tusnády. The degree
example. sequence of a scale-free random graph
process. Random Structures & Algorithms,
Note that not all lists of integers are valid degree sequences. Some 18(3):279–290, 2001
lists cannot generate a valid graph. The easiest case to grasp is if
they contain an odd number of odd numbers. As we just saw, the
6
Gerard Sierksma and Han Hoogeveen.
degree sequence must sum to an even number (2| E|), thus a sequence
Seven criteria for integer sequences
summing to an odd number cannot describe a simple undirected being graphic. Journal of Graph theory, 15
graph6 . We call all valid sequences “graphic”. We’ll see that there are (2):223–231, 1991
Of course, the degree definition I just gave only makes sense in the
world of undirected, unweighted, unipartite, monolayer networks.
We had two whole chapters detailing when such a simple model
doesn’t work in complex real scenarios. We need to extend the
definition of degree to take into account all different graph models
we might have to deal with.
Directed
As we saw in Section 6.2, edges can have a direction, meaning that
the edge going from u to v doesn’t necessarily point back from v to
u. Such is life. In directed graphs you can keep counting the degree
as simply the number of connections of a node, but there is a more
helpful way to think about it. You might want to distinguish the
people who send a lot of connections – but don’t necessarily see
them reciprocated –, and those who are the target of a lot of friends
requests – whether they accept them or not.
1 1
2 0
0 2
(a) (b)
So we split the concept in two parts, helpfully named in-degree 7
Frank Harary, Robert Zane Norman,
and out-degree7 , 8 . As one can expect, the in-degree is the number of and Dorwin Cartwright. Structural
models: An introduction to the theory of
incoming connections. If we represent a directed edge as an arrow, directed graphs. Wiley, 1965
the in-degree is the number of arrow heads attached to your node. 8
Jørgen Bang-Jensen and Gregory Z
See Figure 9.3(a) for a helpful representation. The out-degree is the Gutin. Digraphs: theory, algorithms and
applications. Springer Science & Business
number of outgoing connections, the number of arrow tails attached Media, 2008
to your node. I show the out-degree of the nodes in my example in
Figure 9.3(b).
A directed graph’s degree sequence is now a list of tuples. The
first element of the tuple tells you the indegree, while the second
degree 137
element tells you the outdegree. Or you can have two sequences, but
you need to make sure that the nth positions of the two sequences re-
fer to the same node. If the two sequences are the same, meaning that
every node has the same in- and out-degree, we have a “balanced”
graph.
Weighted
Most of the time, people do not change the definition of degree when
dealing with weighted networks. Many network scientists like how
the standard definition works in weighted graphs, and keep it that
way. The degree is simply the number of connections a node has.
Bipartite
The bipartite case doesn’t need too much treatment: the degree is
still the number of connections of a node. It doesn’t matter much
that for V1 nodes it is gained exclusively via connections to V2 nodes
and viceversa. However, there’s a little change when one uses a
matrix representation that it’s worthwhile to point out. Assuming
A as a binary adjacency matrix (not stochastic), in the regular case
the degree is the sum of the rows: the sum of first row tells you the
degree of the first node, and so on.
Multilayer
The multilayer case is possibly the most complex of them all. At first,
it doesn’t look too bad. The degree is still the number of connections
a node has. Then you realize that there are some connections you
shouldn’t count. For instance, no one – that I know of – counts the
interlayer coupling connections as part of the degree. It’s easy to
see why: these are not connections that lead you to a neighbor in a
proper sense. They lead you to... a different version of yourself.
I’m going to give you some examples from a paper of mine12 , with
14
Federico Battiston, Vincenzo Nicosia,
and Vito Latora. Structural measures for
the caveat that the space of actual possibilities is much vaster than multiplex networks. Physical Review E,
this13 , 14 . What follows is also very related to the multilayer concept 89(3):032804, 2014
140 the atlas for the aspiring network scientist
Hyper
As one might expect, allowing edges to connect an arbitrary number
of nodes – rather than just two – does unspeakable things to your
intuition of the degree. We can still keep our usual definition: the
degree in a hypergraph is the number of hyperedges to which a node 18
Paul Erdős and Miklós Simonovits. Su-
belongs – or: the number of its hyper-connections18 , 19 . However, if persaturated graphs and hypergraphs.
Combinatorica, 3(2):181–192, 1983
you take any step further, all hell breaks loose. The number of neigh- 19
Alain Bretto. Hypergraph theory: An
bors has no relationship whatsoever with the number of connections: introduction. Mathematical Engineering.
with a single hyperedge you can connect a node with the entirety of Cham: Springer, 2013
the network. Also the average degree is something tricky to calculate.
Forget about k̄ = 2| E|/|V |: if a single hyperedge can connect the
entire network, then | E| = 1, but k̄ = |V |.
Things are a bit less crazy for uniform hypergraphs – where we
force hyperedges to always have the same number of nodes. Which
might explain why they’re a much more popular thing to study,
rather than arbitrary hypergraphs. I’ll deal with the generalization of
the degree for simplicial complexes in Section 34.1, because it opens
possibilities much more vast than the space I can allow them to have
here.
The degree of a node only gives you information about that node.
The average degree of a network gives you information about the
whole structure, but it’s only a single bit of data. There are many
ways for a network to have the same average degree. It turns out that
looking at the whole degree distribution can shed light on surprising
properties of the network itself. Since degree distributions can be so
important, generating and looking at them is a second nature for a
network scientist. As a consequence, there are a lot of standardized
procedures you want to follow, to avoid confusing your reader by
breaking them.
Let’s break down all the components of a good degree distribution
plot. First, the basics. What’s a degree distribution? At its most
simple, it is just a degree scatter plot: the number of nodes with a
particular degree. The degree should be on the x axis and the number
of nodes on the y axis, just as I do in Figure 9.8. Commonly, one
would normalize the y axis by dividing its values by the number of
142 the atlas for the aspiring network scientist
# Nodes
1
Figure 9.8: The degree scatter
plot (left) of the graph on the
right.
3
2 2
2
Degree
0.5 1
0.45
0.4 Figure 9.9: The degree distri-
0.35 0.1
0.3
bution of the protein-protein
p(k)
p(k)
0.25
0.2 0.01
interaction network. The distri-
0.15
0.1
butions are the same, but in (a)
0.05 0.001
0
we have a linear scale for the x
0 10 20 30 40 50 60 70 80 90 100 1 10 100
k k
and y axes, which is replaced in
(b) by a log-log scale.
(a) (b)
100 100
p(k)
networks: (a) coauthorship in
10-3 10-3
scientific publication [Leskovec
10-4 10-4 et al., 2007b]; (b) coappearance
1 10 100 1000 100 101 102 103 104
k k
of characters in the same comic
book [Alberich et al., 2002]; (c)
(a) (b)
100 100
interactions of trust between
10-1
PGP users [Boguñá et al., 2004];
10-1
10-2
(d) connections through the
p(k)
p(k)
10-2
10-3
Slashdot platform [Leskovec
10-3 et al., 2009].
10-4
-4
10
1 10 100 1000 100 101 102 103 104
k k
(c) (d)
100
103 103
10-1
2 2
p(k)
p(k)
p(k)
10-2 10 10
1 1
10-3 10 10
# Nodes p(k>=x)
Figure 9.12: The degree scatter
plot (left) and its corresponding
complement of the cumulative
distribution (CCDF).
Degree x
1 1
p(k>=x)
p(k)
0.01 0.01
interaction network. The distri-
butions are the same and are
0.001 0.001
both in log-log scale, but in (a)
1 10 100 1 10 100
k x
we have the degree histogram,
and in (b) we show the CCDF
(a) (b)
version (with the best fit in
blue).
doesn’t really fit. Is this a coincidence, or does it have meaning? To
answer this question we need to enter in the wonderful world of
power-law degree distributions and scale free networks.
1
Figure 9.14: An example of
power law, showing how the
0.1
red line always goes down by
p(k>=x)
1 10 100
x
You can find power laws in nature in many places: the frequencies
of words in written texts, the distribution of earthquake intensities,
etc... To grasp the concept you need a visual example, and my fa- 25
Mark EJ Newman. Power laws,
pareto distributions and zipf’s law.
vorite is moon craters25 . You can see in Figure 9.15 there are a lot of
Contemporary physics, 46(5):323–351,
tiny craters caused by small debris and a huge one. This is fractal 2005b
self-similarity: if the moon were an infinite plane, you could zoom
in and out the picture and the size distributions would be the same.
This is the scale invariance I’m talking about: no matter the zoom, the
146 the atlas for the aspiring network scientist
p (k ) ∼ k −α .
In this formula, we call α the scaling factor. Its value is important, Figure 9.16: An example of
because it determines many properties of the distribution. In general, power law in a CCDF. The ver-
tical gray bar shows that the
point in the distribution is asso-
1
ciated with degree equal to two.
The horizontal gray bar shows
0.1 that this degree correspond
p(k>=x)
1
α=2 Figure 9.17: The CCDF degree
α=3
distributions of two random
0.1 networks with different α expo-
p(k>=x)
nents.
0.01
0.001
1 10 100
x
This is all well and good, but what does it mean exactly to have
α = 2 or α = 3? How do two networks with these two different coeffi-
cients look like? I provide an example of their degree distributions in
Figure 9.17, and I show two very simple random networks with such
degree distributions in Figure 9.18 – obviously, systems this small
are a very rough approximation. From Figure 9.17 you see that α
determines the slope of the degree distribution, with a steeper slope
for α = 3. This means that the hubs in α = 3 are “smaller”, they do
not have a ridiculously high degree.
Figure 9.18 confirms this: in Figure 9.18(a) you see that, for α = 2,
you have only one obvious hub that is head and shoulders above the
100 10
0
1
-1
10 -1
10
0.1
10-2
p(k>=x)
p(k>=x)
p(k>=x)
-2
-3
10
10 0.01
-4 -3
10 10
0.001
0 1 2 3 4 0 1 2 3 4
10 10 10 10 10 10 10 10 10 10 1 10 100
x x x
0 0 0
10 10 10
-1
10
-1 10
-1
10
-2
10-2
10
p(k>=x)
p(k>=x)
p(k>=x)
10-3 10
-2
10-3 -4
10
-3
-4 -5 10
10 10
-6
10
-5 10 10
-4
0 1 2 3 4 5 0 1 2 3 4 5 6 0 1 2 3 4
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
x x x
Early works have found power law degree distributions in many Figure 9.19: A showcase of
networks, prompting the belief that scale free networks are ubiqui- broad degree distributions from
tous. In fact, this seems true. Figure 9.19 shows the CCDFs of many the same networks used in
networks: protein interactions, PGP, Slashdot, DBpedia concept the examples in the previous
network, Gowalla, Internet autonomous system routers. section.
But we need to be aware of our tendency of seeing patterns when
degree 149
they aren’t there – after all, as Feynman says, the easiest person you
can fool is yourself. So in the next section I’ll give you an arsenal to
defend yourself from your own eyes and brain.
0
10 100
PowerLaw
10
-1 10-1 Lognormal
Figure 9.20: (a) An example
10-2
of a CCDF that is most defi-
p(k>=x)
p(k>=x)
10-2 10-3
10-4
nitely NOT a power law, but
10
-3
??? 10-5 that a researcher with a lack of
-4 10-6 proper training might be fooled
10
0 1 2 3 4 5 6
1 10 100 1000 10 10 10 10 10 10 10
x x
into thinking it is. (b) Fitting a
power law (blue) and a lognor-
(a) (b)
mal (green) on data (red) can
Often, people will just assume that any degree distribution is a yield extremely similar results.
power law, calling “power laws” things that are not even deceptively
looking like power laws. I’ve seen distributions as the one in Figure
9.20(a) passing as power laws and that’s just... no. However, I don’t
want to pass as one perpetuating the myth that “everything that
looks like a straight line in a log-log space is a power law”. That is
equally wrong, even if more subtle and harder to catch.
Seeing the plot in Figure 9.20(b), you might be tempted to perform
a linear fit in the log-log space. This more or less looks like fitting the
logged values with a log( p( x )) = α log( x ) + β. Transforming this back
into the real values, the slope α becomes the scaling factor, and β is
the intercept, in other words: log( p( x )) = α log( x ) + β is equivalent to
p( x ) = 10β x α – assuming you logged to the power of ten.
A small aside: if you were to do this on the distributions from Fig-
ure 9.17, you would expect to recover α ∼ 2 and α ∼ 3, because I told
you I generated the degree distributions with those exponents. In-
stead, you will obtain α ∼ 1 and α ∼ 2, respectively. That is because,
in Figure 9.17, I showed you the CCDF of the degree distribution, not
the distribution itself. The CCDF of a power law is also a power law, 31
Heiko Bauke. Parameter estimation for
but with a different exponent31 . If you’re doing the fit on the CCDF, power-law distributions by maximum
you have to remember to add one to your α to recover the actual likelihood methods. The European
Physical Journal B, 58(2):167–173, 2007
exponent of the degree distribution.
Back to parameter estimation. If you perform a simple linear
regression, you’ll get an unbelievably high R2 associated to a super-
significant p value. Well, of course: you’re fitting a straight line over a
straight-ish line. Does that mean you’re looking at a power law? Not
really.
Just because something looks like a straight line in a log-log plot,
150 the atlas for the aspiring network scientist
it doesn’t mean it’s a power law. You need a proper statistical test to
confirm your hypothesis. The reason is that other data generating
processes, such as the ones behind a lognormal distribution, can
generate plots that are almost indistinguishable from a power law.
Figure 9.20(b) shows an example. You cannot really tell which of the
two functions fits the data better.
What you need to do is to fit both functions and then estimate
the likelihood (Section 4.3) of each model to explain the observed 32
Aaron Clauset, Cosma Rohilla Shalizi,
data32 . This can be done with, for instance, the powerlaw package33 , 34 and Mark EJ Newman. Power-law
– available for Python. However, be prepared for the fact that having distributions in empirical data. SIAM
review, 51(4):661–703, 2009
a significant difference between the power law and the lognormal 33
Jeff Alstott and Dietmar Plenz Bull-
model is extremely hard. more. powerlaw: a python package for
In most practical scenarios, you’ll have to argue that your network analysis of heavy-tailed distributions.
PloS one, 9(1), 2014
is a power law. How could you do it? Well, in complex networks 34
https://github.com/jeffalstott/
power law degree distributions can arise by many processes, but one powerlaw
in particular has been observed time and time again: cumulative
advantage. Cumulative advantage in networks says that the more
connections a node has, the more likely it is that the new nodes will
connect to it. For instance, if you write a terrific paper which gathers
lots of citations this year, next year it will likely gain more citations 35
Derek de Solla Price. A general theory
than the less successful papers35 . of bibliometric and other cumulative
This is the same mechanism behind – for instance – Pareto distri- advantage processes. Journal of the
American society for Information science,
butions and the 80/20 rule. Pareto says that 80% of the effects are
27(5):292–306, 1976
generated by 20% of the causes36 . For instance, 20% of people control 36
Vilfredo Pareto. Manuale di economia
80% of the wealth. And, given that it takes money to make money, politica con una introduzione alla scienza
sociale, volume 13. Società editrice
they are likely to hold – or even grow – their share, given their abil- libraria, 1919
ity to unlock better opportunities. In fact, the Pareto distribution is
a power law. Similar to this is Zipf’s Law, the observation that the
second most common word in the English language occurs half of the
time as the most common, the third most common a third of the time, 37
Jean-Baptiste Estoup. Gammes
etc37 , 38 , 39 . In practice, the nth word occurs 1/n as frequently as the sténographiques: méthode et exercices
first, or f (n) = n−1 , which is a power law with α = 1. pour l’acquisition de la vitesse. Institut
sténographique, 1916
This is opposed to the data generating process of a lognormal 38
Felix Auerbach. Das gesetz der
distribution. To generate a lognormal distribution you simply have to bevölkerungskonzentration. Petermanns
multiply many random and independent variables, each of which is Geographische Mitteilungen, 59:74–76,
1913
positive. A lognormal distribution arises if you multiply the results 39
GK Zipf. The psycho-biology of
of many ten-dice rolls. You can see that there is no cumulative advan- language. 1935
tage here: scoring a six on one die doesn’t make a six more likely on
any other die – nor influences subsequent rolls.
So, to sum up, to test for a power law you have to do a few things.
First, make sure that your observations cannot be explained with an
exponential. Confusion between a power law and some other dis-
tribution such as an exponential is hard. If you think a distribution
might be an exponential, then it’s definitely not a power law. Second,
degree 151
try to see if you can statistically prefer a power law model over a log-
normal. In the likely event of you not being able to mathematically
do so, you should look at your data generating process. If you have
the suspicion that it could be due to random fluctuations, then you
might have a lognormal. Otherwise, if you can make a convincing
argument of non-random cumulative advantage, go for it.
There are a few more technicalities. Pure power laws in nature are 40
Whether this holds true also for
– as I mentioned earlier – rare40 . Your data might be affected by two networks is the starting point of
impurities. Your power law could be shifted41 , or it could have an a surprisingly hot debate, see for
instance [Broido and Clauset, 2019] and
exponential cutoff42 . In a shifted power law, the function holds only
[Voitalov et al., 2018].
on the tail. In an exponential cutoff the power law holds only on the 41
Gudlaugur Jóhannesson, Gunnlaugur
head. Björnsson, and Einar H Gudmundsson.
Afterglow light curves and broken
Shifted power laws have an initial regime where the power law power laws: a statistical study. The
doesn’t hold. Formally, the power law function needs a slowly grow- Astrophysical Journal Letters, 640(1):L5,
ing function on top that will be overwhelmed by the power law for 2006
42
Aaron Clauset, Cosma Rohilla Shalizi,
large values of k – as I show in Figure 9.21(a). So we modify our and Mark EJ Newman. Power-law
master equation as: p(k) ∼ f (k)k−α , with f (k) being an arbitrary distributions in empirical data. SIAM
review, 51(4):661–703, 2009
but slowly growing. Slowly growing means that, for low values of
k it will overwhelm the k−α term, but for high values of k, the latter
would be almost unaffected. In power law fitting, this means to find
the k min value of k such that, if k < k min we don’t observe a power
law, but for k > k min we do.
0
10 10
0
10
-1
p(k) ~ f(k)k -α
-1
Figure 9.21: (a) An example of
10
10
-2 shifted power law. The area in
p(k>=x)
p(k>=x)
10
-4
10-3 p(k) ~ k-αe-λk hold is shaded in blue. (b) An
10 -5 example of truncated power
10-4
100 101 102 103 104 105 1 10 100 1000 law: a power law with an ex-
x x
ponential cutoff. The area in
(a) (b)
which the power law doesn’t
hold is shaded in green.
Shifted power laws practically mean that “Getting the first k min
connections is easy”. If you go and sign up for Facebook, you gener-
ally already have a few people you know there. Thus we expect to
find fewer nodes with degree 1, 2, or 3 than a pure power law would
predict. The main takeaway is that, in a shifted power law, we find
fewer nodes with low degrees than we expect in a power law.
Truncated power laws are typical of systems that are not big
enough to show a true scale free behavior. There simply aren’t
enough nodes for the hubs to connect to, or there’s a cost to new
connections that gets prohibitive beyond a certain point. This is
practically a power law excluding its tail, that’s why we call them
“truncated”. Mathematically speaking, this is equivalent to having an
152 the atlas for the aspiring network scientist
9.5 Summary
distributions. You can describe them with statistics that are easier
to get right than the tricky business of fitting power laws.
9.6 Exercises
1. Write the in- and out-degree sequence for the graph in Figure
9.3(a). Are there isolated nodes? Why? Why not?
2. Calculate the degree of the nodes for both node types in the
bipartite adjacency matrix from Figure 9.5(a). Find the isolated
node(s).
3. Write the degree sequence of the graph in Figure 9.7. First consid-
ering all layers at once, then separately for each layer.
6. Find a way to fit the truncated power law of the network at http:
//www.networkatlas.eu/exercises/9/6/data.net. Hint: use the
scipy.optimize.curve_fit to fit an arbitrary function and use the
functional form I provide in the text.
10
Paths & Walks
2
Figure 10.2: (a) A graph. (b-c)
3 Different powers of its binary
4
5 adjacency matrix A.
6
10.2 Cycles
You can make a walk and a path in any graph, no matter its topology.
There is a special path that you cannot always do, though. That is the
cycle. Picking up the social network example as before, now you’re
not happy just by reaching somebody with your message. You want
the message you originally sent to come back to you. Also, you don’t
want anybody to hear it twice. If you manage to do so, then you have
found a cycle in your social network.
A cycle is a path that begins and ends with the same node. Note
that I said “path”, so we don’t have any repeated nodes nor edges
– except the origin, of course. Figure 10.3(a) shows an example of a
cycle in the network. The cycle’s length is the number of edges you
use in your cycle. Given its topological constraints, that is also the
number of its nodes.
Imposing cycles to be paths make them a non trivial object to have
in your network. We can easily see why there might be nodes that
participate in no cycles. If a node has degree equal to one, you can
start a path from it, but you can never go back to complete a cycle.
Doing so would force you to re-use the only connection they have.
Thus a cycle is impossible for such nodes.
In fact, we can go further. We can imagine a network structure
paths & walks 157
(a) (b)
that has no cycles at all! I draw one such structure in Figure 10.3(b).
No matter how hard you squint, you’re never going to be able to
draw a cycle there. We have a special name for such structures: trees.
Trees are simple graphs with no cycles. In a tree you cannot get your
message back, unless somebody hears it twice. Given their lack of
cycles, some even call them acyclic graphs.
10.3 Reciprocity
the network: pairs of nodes with at least one edge between them. In
Figure 10.6, we have five connected pairs. Then we count the number
of connected pairs that have both possible edges between them: the
ones reciprocating the connection. In Figure 10.6, we have two of
them. Reciprocity is simply the second count over the first one. So,
for the example in Figure 10.6, we conclude that reciprocity is 2/5, or
that the probability of a connection to be reciprocated is 40%. Sad.
Walks and paths can help you uncover some interesting properties
in your network. Let’s pick up our game of message-passing. In this
scenario, we might end up in a situation where there is no way for
a message to reach some of the people in the social network. The
people you can reach with your message do not know anybody who
can communicate to your intended targets. In this scenario, it is
natural to divide people into groups that can talk to each other. These
are the network’s “components”.
Unreachable
λ1
2 Figure 10.9: The stochastic adja-
4 cency matrix of a disconnected
6 1
graph looks like two differ-
9 ent adjacency matrices pasted
5
λ2 8
on the diagonal. Thus, they
7 both have a (different) leading
eigenvalue equal to one.
a 0
a 0
λ1
Figure 10.10: If your graph has
a 0 two components, the eigen-
a 0
vectors associated with the
a 0
a 0 largest two eigenvalues of the
0 b stochastic adjacency matrix will
0
0
b
b λ2 tell you to which component
the node belongs, by having a
v1 v2
non-zero value.
5 6 8 9
(a) (b)
you cannot join with a cycle – meaning starting from u and passing
through v makes it impossible to go back to u – then those nodes are
not part of the same strongly connected component. SCCs are impor-
tant: if you are playing a message-passing game where messages can
only go in one direction, you can always hear back from the players
in the same strongly connected component as you.
Popular algorithms to find strongly connected components in 5
Robert Tarjan. Depth-first search and
a graph are Tarjan’s5 , Nuutila’s6 , and others that exploit parallel linear graph algorithms. SIAM journal on
computation7 . computing, 1(2):146–160, 1972
The definition of SCC leaves the door open for some confusion.
6
Esko Nuutila and Eljas Soisalon-
Soininen. On finding the strongly
Even by visually inspecting a network that appears to be connected connected components in a directed
in a single component, you will find multiple different SCCs – as in graph. Inf. Process. Lett., 49(1):9–14, 1994
Figure 10.11(b). In the figure, there is no path that respects the edge
7
Sungpack Hong, Nicole C Rodia, and
Kunle Olukotun. On fast parallel detec-
directions and leads from node 1 to node 7 and back. The best one tion of strongly connected components
could do is 1 → 2 → 3 → 7 → 8 → 6 → 5 → 4. (scc) in small-world graphs. In Proceed-
ings of the International Conference on
However, it feels like this network should have one component,
High Performance Computing, Networking,
because we can see that there are no cuts, no isolated vertices. If Storage and Analysis, pages 1–11, 2013
we were to ignore edge directions, Figure 10.11(b) would really
look like a connected component in an undirected network. This
feeling of uneasiness led network scientists to create the concept of
“weakly connected components” (WCC). WCCs are exactly what
I just wrote: take a directed network, ignore edge directions, and
look for connected components in this undirected version of it. Un-
der this definition, Figure 10.11(b) has only one weakly connected
component.
But you’re not part of the core of the office, you are in a weakly
connected component. Your job is simply to receive a document,
stamp it, and pass it to the next desk. Since you are in a WCC, you
know you’re never going to see the same document twice. That
would imply that there is a cycle, and thus that you are in a strongly
connected component with someone. However, what you see in the
document can be radically different. The document might arrive
to you with or without the core’s stamp of approval. These two
scenarios are quite different.
If you are in the first scenario, it means your WCC is positioned
“before” the core. Documents pass through it and they are put in
the core. The flow of information originates from you or from some
other member of the weakly connected component, and it is poured
into the core. This is the scenario in Figure 10.12(a): you are one of
the four leftmost nodes. In this paragraph I highlighted the word in
because we decided to call these special WCCs in-components.
10.5 Summary
2. Cycles are paths which start and end in the same node. Acyclic
graphs are graphs without cycles. An undirected acyclic graph is
called a tree – a graph with |V | nodes and |V | − 1 edges. Otherwise,
you can have directed acyclic graphs which are not trees.
except one node with in-degree zero, then that directed tree is also
an arborescence.
10.6 Exercises
Figure 11.2 shows the result. We see now that the columns are
constant vectors. These numbers have a specific meaning. When we
calculate A∞ , what we’re doing is basically asking the probability
of being in a node after a random walk of infinite length. Since
166 the atlas for the aspiring network scientist
the length is infinite, it does not really matter from which node
you originally started. That’s why all rows of A∞ are the same –
remember that the row indicates the starting point while the column
indicates the ending point.
This row vector – you can pick any of them, since they’re all
the same – is so important that we give it a name. We call it the
“stationary distribution” – or π, for short. π tells us that, if you have
a path of infinite length, the probability of ending up on a destination
is only dependent on the destination’s location and not on your point
of origin. In practice, if you apply the transition probability (A) to the
stationary distribution (π), you still obtain the stationary distribution:
πA = π. Having a high value in the stationary distribution for
a node means that you are likely to visit it often with a random
walker – by the way, this is almost exactly what PageRank estimates,
plus/minus some bells and whistles, see Section 14.4.
Note that it is not necessary to calculate A∞ to know the stationary
distribution. At least for undirected networks, π is quite literally the
normalized degree of the nodes: the degree divided by the sum of all
degrees (2| E|).
But... wait! This stationary distribution formula is oddly familiar:
πA = π. Haven’t we seen something similar to it? This kind of
looks like our eigenvector specification (Av = λv, see Section 5.5),
with a few odd parts. First, where’s the eigenvalue? Well, we can
always multiply a vector to 1 and we won’t change anything in the
equation. So: π A = π1. This is cool, because we already know that
1 is the largest eigenvalue (λ1 ) of a stochastic matrix. Second, the
vector π is on the left, not on the right. Putting these things together:
the stationary distribution π is the vector associated with the largest
eigenvalue, if multiplied on the left of A. Therefore: π is the leading
left eigenvector.
If you’re dealing with an undirected graph, there is a relationship
between right and left eigenvectors. If you were to transpose the
stochastic adjacency matrix, that is making it column-normalized
instead of row-normalized, the left and right eigenvectors would
swap. In different words: the left eigenvectors of A are exactly the
same as the right eigenvectors of A T . Thus the vector of constant
and π are the right and left leading eigenvectors of A, and they swap
roles in A T .
What do you do if your graph is not connected? No matter how
many powers of A you take, how infinitely long your walks are,
some destinations are unreachable from some origins. We end up
with two stationary distributions, one for one component, and one
for the other. Figure 11.3 shows an example. These two stationary
distributions are not directly comparable one with the other. They are
random walks 167
8 7
having a row and a column per node, we instead have two rows and 5
Ki-ichiro Hashimoto. Zeta functions
columns per edge5 – or one row/column per edge direction if we of finite graphs and representations of
have a directed graph, meaning that we treat an undirected graph p-adic groups. In Automorphic forms
and geometry of arithmetic varieties, pages
as a directed one with perfect reciprocity. Each cell contains a one if
211–280. Elsevier, 1989
we can use the edge direction for our non-backtracking walk, zero
otherwise. Formally:
1 if u ̸= z
NBuv,vz =
0 otherwise.
9
Leo Torres, Pablo Suárez-Serrato, and
graphs9 (Chapter 41). Tina Eliassi-Rad. Non-backtracking
cycles: length spectrum theory and
graph mining applications. Applied
11.3 Hitting Time Network Science, 4(1):41, 2019
λ1 λ2 λ3
1 k1 = 1 H Figure 11.6: The elements
needed to calculate the hitting
time of a graph. From left to
right: the graph and its degree
2
vector, the eigenvalues and
k2 = 2
eigenvectors of N, the resulting
hitting time matrix H.
3 k3 = 1
w1 w2 w3
|V |
1
∑ πv Hu,v = ∑ 1 − λn
.
v ∈V n =2
2 1
(a) (b)
Mathematically, C = 2| E|Ω, which means that Ω is a sort of
normalized commute time. It is the commute time, ignoring the
overall size of the graph – which is given by the 2| E| factor. You can
see an example graph and its effective resistance matrix in Figure
11.7. The original definition of Ω is physical: you assume G is an 12
D Babić, Douglas J Klein, István
electrical network and each edge is a resistor of one Ohm. Then the Lukovits, Sonja Nikolić, and N Trina-
jstić. Resistance-distance matrix: a
effective resistance between nodes u and v Ωu,v is the literal electric
computational algorithm and its appli-
resistance you’d measure. If you remember, when I introduced the cation. International Journal of Quantum
Laplacian (Section 8.4) I said that it was originally used to describe Chemistry, 90(1):166–176, 2002
†
1
Γ= L+ 1 .
|V |
Here, L is the Laplacian, |V | is the number of nodes, and 1 is a
|V | × |V | matrix filled with ones. In practice, inside the parentheses
we have a matrix that is L plus 1/|V | in all its entries. The † symbol
means that we want to invert this matrix. If you remember, the Lapla-
cian tells you how electricity flows in the network, so we need to
invert it to estimate the resistances. The problem is that the Laplacian
is a singular matrix, which means it cannot be inverted.
This is why we use † instead of −1: rather than inverting we take 14
Eliakim H Moore. On the reciprocal
the Moore-Penrose pseudoinverse14 , 15 , 16 , which is basically the of the general algebraic matrix. Bulletin
inverse, but shhh don’t tell the Laplacian or it will get mad. To get of the american mathematical society, 26:
294–295, 1920
the Moore-Penrose pseudoinverse, the first step is to perform the 15
Arne Bjerhammar. Application
singular value decomposition (SVD) of L. SVD is one of the many of calculus of matrices to method of
ways to perform matrix factorization (Section 5.6). In SVD, we want least squares: with special reference to
geodetic calculations. (No Title), 1951
to find the elements for which this equation holds: Q1 ΣQ2T = L. The 16
Roger Penrose. A generalized inverse
important part here is Σ, which is a diagonal matrix containing L’s for matrices. In Mathematical proceedings
singular values. We can easily build a Σ−1 matrix, containing in its of the Cambridge philosophical society,
volume 51, pages 406–413. Cambridge
diagonal the reciprocals of L’s singular values. Then Q2 Σ−1 Q1T = L† University Press, 1955
is L’s Moore-Penrose pseudoinverse. It holds that LL† L = L and that
L† LL† = L† .
What makes effective resistance so awesome is that it defines a 17
Karel Devriendt. Effective resistance
proper analogue to the Euclidean distance in a graph17 . By far the is more than distance: Laplacians,
most popular alternative is using shortest path distances – we’ll see simplices and the schur complement.
Linear Algebra and its Applications, 639:
how to calculate this distance in Chapter 13. However, differently 24–49, 2022
from shortest paths, Ω is a proper metric, which means you can
do a bunch of things (most of which we’ll see in Chapter 47 when
we’ll talk about network distances) that would lead to mathematical
nonsense if you were to use shortest paths instead.
2 1
(a) (b)
Moreover, by being the result of a process of diffusion through the
entirety of the graph – current flows as it was presented – Ω is also
more resistant to random fluctuations. The removal or introduction
of a single edge can radically change the shortest path distance
between two nodes, but Ω will change less abruptly. Figure 11.8
random walks 173
.37 2 1
+
.37 Figure 11.9: (a) The Laplacian
.28 matrix of a graph. I show the 2-
3 4
.28 cut solution for this graph with
.12 5 the red and blue blocks. (b) The
-.23 second smallest eigenvector of
- -.38
-.38
7
6
8
(a). (c) The graph view of (a) – I
color the nodes according to the
-.43 value attached to them in the
v2 9 (b) vector.
(a) (b) (c)
One classical problem in graph theory is to find the minimum
(or normalized) cut of a graph: how to divide nodes in two disjoint
groups such that the number of edges running across groups is
minimized. Turns out that the second smallest eigenvector of the 18
Miroslav Fiedler. Laplacian of graphs
Laplacian is a very good approximation to solve this problem18 . and algebraic connectivity. Banach Center
Publications, 25(1):57–70, 1989
How? Consider Figure 11.9. In Figure 11.9(a) I show the Laplacian
matrix of a graph. I arranged the rows and columns of the matrix
so that the 2-cut solution is evident: by dividing the matrix in two
diagonal blocks there is only one edge outside our block structure
that needs to be cut.
174 the atlas for the aspiring network scientist
Now, why did I label the two blocks as “+” and “-”? The reason
lies in the magical second smallest eigenvector of the Laplacian – also
known as the Fiedler vector –, which is in Figure 11.9(b). We can see
that the top entries are all positive (in red) and the bottom are all
negative (in blue). This is where L shines: by looking at the sign of
the value of a node in its second smallest eigenvector we know in
which group the node has to be to solve the 2-cut problem!
Not only that, but the values in Figure 11.9(b) are clearly in de-
scending order. If we look at the graph itself – in Figure 11.9(c) – and
we use these values as node colors, we discover that there is much
more information than that in the eigenvector. The absolute value
tells us how embedded the node is in the group, or how far from the
cut it is. Node 5 is right next to it, while node 9 is the farthest away.
4
6 0.5
16 12
2 15
0.4 11
10
18 0.3 9
3rd Eigenvector
8
5
0.2
1 13
3
7
17 0.1
14 0
7 13 1
9
-0.1 17
1514 23
18 16 45
8 -0.2 6
11
-0.3
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4
10
12 2nd Eigenvector
(a) (b)
while nodes closer to the cuts are nearby the origin (0, 0). You can
imagine that we could solve the 4-cut problem looking at a 3D space,
and the k-cut problem looking at a (k − 1)D space. I can’t show it
right now because, although it’s truly remarkable, the margin of my
Möbius paper is too small to contain it.
Of course, at the practical level, real world networks are not
amenable to these simple solutions. Most of the times, the best
way to solve the 2-cut problem is to put in one group a node with
degree equal to one and put all other nodes of the network in the
other group. If you want to find non-trivial k-cuts of the network
that are meaningful for humans... well... you have to do community
discovery (and jump to Part X).
One thing that might be left in your head after reading the previous
section is: why? Why do the eigenvectors of the Laplacian help with
the mincut problem? What’s the mechanism? To sketch an answer for
this question we need to look at what we call “consensus dynamics”.
This is a subclass of studying the diffusion of something (a disease,
word-of-mouth, etc) on a network – which we’ll see more in depth
19
Michael T Schaub, Jean-Charles
Delvenne, Renaud Lambiotte, and
in Part VI. This section is sketched from a paper19 that you should Mauricio Barahona. Structured networks
read to have a more in-depth explanation of the dynamics at hand. and coarse-grained descriptions: A
dynamical perspective. Advances in
Consensus dynamics were originally modeled this way by DeGroot20 .
Network Clustering and Blockmodeling,
In this section I’m going to use the stochastic adjacency matrix pages 333–361, 2019
of the graph, but what I’m saying also holds for the Laplacian. The 20
Morris H DeGroot. Reaching a
consensus. Journal of the American
difference between the two – as I also mention in Section 34.3 – is
Statistical Association, 69(345):118–121,
that the stochastic adjacency matrix describes the discrete diffusion 1974
over a network. In other words, you have a clock ticking and nothing
happens between one tick of the clock and the other. The Laplacian,
instead, describes continuous diffusion: time flows without ticks in
a clock, and you can always ask yourself what happens between two
observations. Besides this difference, the two approaches could be
considered equivalent for the level of the explanation in this section.
How does the stochastic adjacency help us in studying consensus
over a network? Let’s suppose that each node starts with an opinion,
which is simply a number between 0 and 1. We can describe the
status of a network with a vector x of |V | entries, each corresponding
to the opinion of each node. One valid operation we could do is
multiplying x with A, the stochastic adjacency matrix, since A is a
|V | × |V | matrix.
What does this operation mean? Mathematically, from Section 5.2,
|V |
the result is a vector x ′ of length |V | defined as xv′ = ∑ xu Auv . In
u =1
176 the atlas for the aspiring network scientist
practice, the formula tells you that node v is updating its opinion
by averaging the opinion of its neighbors. Non-neighbors do not
contribute anything because Auv = 0, and this is an average because
we know that the rows of the adjacency matrix sum to 1 – thus each
xu is weighted equally and xv′ will still be between 0 and 1.
1
Figure 11.11: The value of x
0.8 (y-axis) for each node in the
0.6 graph from Figure 11.10 over
v(T)
longer time. Moreover, you can use the sign to know on which side
of the cut you are, because the nodes will first tend to converge to the
value of their own community, which is the opposite of the value in
the other community.
11.7 Summary
3. The hitting time is the number of expected steps you need to take
in a random walk to reach one node starting from another. It is
related to a special eigenvector decomposition of the adjacency
matrix.
11.8 Exercises
(a) (b)
To quantify the difference between the two cases, network scien-
tists defined the concept of network density. Informally, this is the
probability that a random node pair is connected. Or, the number of
edges in a network over the total possible number of edges that can
exist given the number of nodes. We can estimate this latter quantity
180 the atlas for the aspiring network scientist
quite easily – from now on, I’ll just assume the network is undirected,
unipartite and monolayer, for simplicity.
Here’s the problem, rephrased as a question: how many edges
do we need to connect |V | nodes? Well, let’s start by connecting one
node, v, to every other node. We will need |V | − 1 edges – we’re
banning self loops. Now we take a second node, u. We need to
connect it to the other nodes in V minus itself – seriously, no self
loops! – and v, because we already added the (u, v) edge at the
previous step. So we add |V | − 2 edges. If you go on and perform
the sum for all nodes in v, you’ll obtain that the number of possible
edges connecting |V | nodes is |V |(|V | − 1)/2. In other words, you
need |V | − 1 edges to connect a node in V with all the other nodes,
and you divide by two because each edge connects two nodes. In
fact, the number of possible edges in a directed network is simply
|V |(|V | − 1), because you need an edge for each direction, as (u, v) ̸=
(v, u).
Now we can tell the difference between the networks in Figures
12.1(a) and 12.1(b). The first one has three nodes. We would expect
3 ∗ 2/2 = 3 edges, and that’s exactly what we have. Its density is
the maximum possible: 100%. The network on the right, instead,
contains 650 nodes. Since the average degree of the network is also
two, we know it also contains 650 edges. This is a far cry from the
650 ∗ 649/2 = 210, 925 we’d require. Its density is just 0.31%, more
than three hundred times lower than the example on the left!
With all this talk about real world networks having low degree, we
should expect them to be quite sparse. They are, in fact, even sparser
than you think. A few examples – the numbers are a bit dated, they
refer to the moment in which these networks were studied in a 1
Albert-László Barabási et al. Network
paper1 . science. Cambridge university press, 2016
The network connecting the routers forming the backbone of the
Internet? It contains |V | = 192, 244 nodes. So the possible number of
edges is |V |(|V | − 1)/2 = 18, 478, 781, 646. How many does it have,
really? 609, 066, which is just 0.003% of the maximum. How about
the power grid? The classical studied structure has 4, 941 nodes,
which means – potentially – 12, 204, 270 edges. Yet, it only contains
6, 594 of them, or 0.05%. Well, compared to the Internet that’s quite
the density! Final example: scientific paper citations. A dataset from
arXiV contains 449, 673 papers. The theoretical maximum number
of citations is 101, 102, 678, 628. Physicists, however, are quite stingy:
they only made 4, 689, 479 citations, or 0.004% of the theoretical
maximum.
You might have spotted a pattern there. The density of a network
seems to go down as you increase the number of nodes. While not an
ironclad rule, you might be onto something. The problem is that the
density 181
1400
Potential Edges
1200 Figure 12.2: The red line shows
Actual Edges
1000
800
the number of possible edges
|E|
Global Clustering
Density doesn’t solve all ambiguities you had in the case of the
average degree. Two networks can have the same density and the
same number of nodes, but end up looking quite different from each
other. That is why the ever industrious network scientists created
yet another measure to distinguish between different cases: the
clustering coefficient.
(a) (b)
at v’s neighbors. I start by selecting its first neighbor, with the blue
arrow. How many triads do v and the blue neighbor generate? Well,
one for each of the remaining neighbors of v, so I add a blue dot to
each of the neighbors. When I move to the neighbor highlighted by
the green arrow, I perform the same operation adding the green dot.
I don’t have to add one to the neighbor with the blue arrow, because
I already counted that triad.
Sounds familiar? That is because this is the very same process
you apply when you have to count the number of possible edges of
a graph. The number of triads centered on node v is nothing more
that the number of possible edges among k v nodes, with k v being v’s
degree. So, if we want to know the number of triads in a graph, we
simply need to add k v (k v − 1)/2 for every v in the graph.
Note that, as you might expect, the clustering coefficient takes 3
Jukka-Pekka Onnela, Jari Saramäki,
different values for weighted3 , 4 and directed5 graphs. János Kertész, and Kimmo Kaski.
I primed you to expect that many statistical properties can be Intensity and coherence of motifs in
weighted complex networks. Physical
derived via matrix operations. This is true also for the clustering coef-
Review E, 71(6):065103, 2005
ficient. It is done via the powers of the binary adjacency matrix – see 4
Jari Saramäki, Mikko Kivelä, Jukka-
Section 10.1. Triangles are closed paths of length 3, while triads are Pekka Onnela, Kimmo Kaski, and
Janos Kertesz. Generalizations of
paths of length 2. The number of closed walks of length 3 centered the clustering coefficient to weighted
on u is A3uu , while the number of walks of length 2 passing through u complex networks. Physical Review E, 75
is A2uv , with u ̸= v, which results in the formula: (2):027105, 2007
5
Giorgio Fagiolo. Clustering in complex
directed networks. Physical Review E, 76
∑ A3uu (2):026107, 2007
u
CC = .
∑ A2uv
u̸=v
So, let’s calculate the global clustering coefficient for the graph
in Figure 12.6(a). We know how many triads there are in the graph.
How many triangles are there? Here I made my life easier, because it
is rather trivial to count the number of triangles in a planar graph –
a graph you can draw on a 2D plane without intersecting any edges.
There are eight triangles and 48 triads in the network. Thus the
184 the atlas for the aspiring network scientist
1/3
1/2
1 2/3
5
2
1
Figure 12.8: A network with
3 several structural holes, for
instance between nodes: 2 and
4
4, 2 and 5, and 4 and 5.
186 the atlas for the aspiring network scientist
12.3 Cliques
12.5 Summary
12.6 Exercises
Centrality
13
Shortest Paths
1 1
2 12 1922 27 2 3 4 5 6
(a) (b)
2 2 1
1 Figure 13.5: Finding the short-
4 2 est path – edges colored in
4
4 purple – between the start node
1 3
(in blue) and the target node (in
3
5 3
green). (a) Directed network. (b)
2 Weighted network, where edge
1 weights represent the cost of
traversal.
(a) (b)
shortest paths 195
1. You look at all the unvisited neighbors of the current node and
calculate their tentative distances through the current node.
6
https://upload.wikimedia.org/
Dijkstra algorithm I know, because it is an animated GIF6 . wikipedia/commons/5/57/Dijkstra_
Faster variations of the Dijkstra algorithm7 , 8 , 9 , 10 use clever data Animation.gif
1 3 2
1 8 4
3 1 1 3
1 6
2 1 3
1 1
8 4
1 3 2 2 4
2 2 4 1 1 3 1 2
3 2
2 3 6 4 3 1 2 2 4 1 1 3 1 2 2 4
Just like with the degree, knowing the length distribution of all
shortest paths in the network conveys a lot of information about its
connectivity. A tight distribution with little deviation implies that all
nodes are more or less at the same distance from the average node in
the network. A spread out distribution implies that some nodes are
in a far out periphery and others are deeply embedded in a core.
# Paths
Figure 13.9: The path length
distribution (left) of a graph
(right). Each bar counts the
number of shortest paths of
length one, two and three,
which is the maximum length
in the network.
Path Length
Diameter
The rightmost column of the histogram in Figure 13.9 is important.
It records the number of shortest paths of maximum length. These
are the “longest shortest paths”. Since this is an important concept,
such a long mouthful name won’t do. We’re busy people and we
got places to be. So we use a different name for them or, to be more
precise, to their length. We call it the diameter of the network.
Why do we care about the diameter? Because that’s the worst case
shortest paths 199
• ...
It’s now easy to see that a network with diameter equal to three
is easy to navigate. As the diameter grows, the number of people to
rely on for a full traversal starts becoming unwieldy.
If your network has multiple connected components (Section 10.4),
we have a convention. Nodes in different components are unreach-
able, and thus we say that their shortest path length is infinite. Thus,
a network with more than one connected component has an infinite
diameter. Usually, in these cases, what you want to look at is the
diameter of the giant connected component.
Average
The diameter is the worst case scenario: it finds the two nodes that
are the most far apart in the network. In general, we want to know
the typical case for nodes in the network. What we calculate, then, is
not the longest shortest path, but the typical path length, which is the
average of all shortest path lengths. This is the expected length of the
shortest path between two nodes picked at random in the network.
If Puv is the path to go from u to v and | Puv | is its length, then the
∑ | Puv |
u,v∈V
average path length of the network is APL = . Figure
|V |(|V | − 1)
13.10 shows that, even in a tiny graph, the diameter and the APL
can take different values, with the former being more than twice the
length of the latter.
With APL, we can fix the origin node. For instance, in a social
network, you can calculate your average separation from the world.
200 the atlas for the aspiring network scientist
Diameter = 3
Figure 13.10: The diameter and
the APL in a graph can be quite
different.
APL ~ 1.4
Mean = 3.57
125
Figure 13.11: The path length
100 distribution for Facebook in
FB Users (M)
2012.
75
50
25
2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7
Average Degree of Separation
This would be an APLv , the average path length for all paths starting
at v. Then you can generate the distribution of all lengths for all
origins. How does this APLv distribution look like for a real world
network? One of the most famous examples I know comes from 15
Sergey Edunov, Carlos Diuk, Is-
Facebook15 . I show it in Figure 13.11. The remarkable thing is how mail Onur Filiz, Smriti Bhagat, and
ridiculously short the paths are even in such a gigantic network. Moira Burke. Three and a half degrees
of separation. Research at Facebook, 2016
This is in line with classical results of network science, showing
that the diameter and APL typically grow sublinearly in terms of 16
Mark EJ Newman. The structure and
number of nodes in the network16 . In other words, there are dimin- function of complex networks. SIAM
ishing returns to path lengths: each additional person contributes review, 45(2):167–256, 2003b
origin city, given a list of cities and the distances between each pair of
cities. We can represent the problem as a weighted graph, with city
distances as edge weights. The minimum spanning tree doesn’t solve
the problem: it creates a tree, which has no cycles. Thus, to get back
to the origin city, you have to backtrack all the way through the tree –
not ideal.
(a) (b)
biclique. Look at Figure 13.14 and try to draw those graphs in two
dimensions without having any edge crossing another one. You’ll
find out that is not possible.
The second cousin of spanning trees is the triangulated maximally 28
Guido Previde Massara, Tiziana
filtered graph28 . This was originally proposed as a more efficient Di Matteo, and Tomaso Aste. Network
algorithm to extract planar maximally filtered graphs from larger filtering for big data: Triangulated
maximally filtered graph. Journal of
graphs. However, it also allows to specify different topological con-
complex Networks, 5(2):161–178, 2016
straints, which are not necessarily making the graph planar.
13.6 Summary
2. Shortest paths are the paths connecting two arbitrary nodes in the
network using the minimum possible number of edges. In directed
networks you have to respect the edge’s direction, in weighted
networks you have to minimize (or maximize, depending on the
problem definition) the sum of the edge weights.
13.7 Exercises
The most direct way to find the most important nodes in the network
is to look at the degree. The more friends a person has, the more
important she is. This way of measuring importance works well in
many cases, but can miss important information. What if there is a
person with only few friends, but placed in different communities –
just like in Figure 14.1? The removal of such person will create iso-
lated groups, which are now unable to talk to each other. Shouldn’t
this person be considered a key element in the social network, even
with her puny degree?
14.1 Closeness
1
Alex Bavelas. A mathematical model
If we want to know the closeness centrality1 of a node v, first we for group structures. Human organization,
calculate all shortest paths starting from that node to every possible 7(3):16, 1948
0 .8
0.66
0. 5 0.57
14.2 Betweenness
2
Jac M Anthonisse. The rush in a
directed graph. Stichting Mathematisch
Network scientists developed betweenness centrality2 , 3
to fix some of Centrum. Mathematische Besliskunde, (BN
the issues of closeness centrality. Differently from closeness, with be- 9/71), 1971
tweenness we are not counting distances, but paths. We still calculate
3
Linton C Freeman, Douglas Roeder,
and Robert R Mulholland. Centrality in
all shortest paths between all possible pairs of origins and destina- social networks: Ii. experimental results.
tions. Then, if we want to know the betweenness of node v, we count Social networks, 2(2):119–141, 1979
the number of paths passing through v – but of which v is neither an
origin nor a destination. In other words, the number of times v is in
between an origin and a destination. If there is an alternative way of
equal distance to get from the origin to the destination that does not
use v, we discount the contribution of the path passing through v to
v’s betweenness centrality. I provide an example in Figure 14.4.
The total number of paths that can pass through a node – ex-
cluding the ones for which it is the origin or the destination – are
(|V | − 1)(|V | − 2) in a directed network, and (|V | − 1)(|V | − 2)/2 in
an undirected one.
One intuitive way to think about betweenness centrality is asking
yourself: how many paths would become longer if node v would
disappear from the network? How much is the network structure
dependent on v’s presence? Since real world networks have hubs
which are closer to most nodes, the shortest paths will use them
often. As a result, betweenness centrality distributes over many
210 the atlas for the aspiring network scientist
in one clique can reach the nodes in the other one. All those paths
passing through the removed edge are lost forever. The surviving
edge between the two most central ones will have a much reduced
edge betweenness centrality: it cannot be used to move between
cliques any more. This consideration holds true not only for the edge
betweenness, but also for the node betweenness.
A relaxed version of betweenness centrality does not use shortest
paths, but random walks (Chapter 11). This simulates the spreading
of information into the network. The definition is similar: this “flow”
centrality is the number of random walker passing through the node 4
Mark EJ Newman. A measure
during the information spread event4 , 5 . Just like with the regular of betweenness centrality based on
betweenness centrality, also in this case you can take an edge-centric random walks. Social networks, 27(1):
39–54, 2005a
approach, and count the number of random walks going through a 5
Ulrik Brandes and Daniel Fleischer.
specific edge. This has been used, for instance, to solve the problem Centrality measures based on current
of community discovery6 , 7 . flow. In Annual symposium on theoretical
aspects of computer science, pages 533–544.
Springer, 2005
6
Santo Fortunato, Vito Latora, and
14.3 Reach Massimo Marchiori. Method to
find community structures based on
Reach centrality is only defined for directed networks. The local information centrality. Physical review E,
reach centrality of a node v is the fraction of nodes in a network that 70(5):056104, 2004
7
Vito Latora and Massimo Marchiori.
you can reach starting from v8 . From this definition, one can see why Vulnerability and protection of infras-
it doesn’t make much sense for undirected networks. If your network tructure networks. Physical Review E, 71
has a single connected component, then all nodes have the same (1):015103, 2005
8
Enys Mones, Lilla Vicsek, and Tamás
reach centrality, which is equal to one. That is also the case if your Vicsek. Hierarchy measure for complex
directed network has only one strongly connected component. In a networks. PloS one, 7(3):e33799, 2012
strongly connected component there are no “sinks” where paths get
trapped, thus every node can reach any other node.
14.4 Eigenvector
8 7
(a) (b)
calculate a set of PageRanks, one per topic. There are other possible 15
Arda Halu, Raúl J Mondragón, Pietro
ways to define a multilayer PageRank15 . Panzarasa, and Ginestra Bianconi.
Multiplex pagerank. PloS one, 8(10):
e78293, 2013
Katz
16
Leo Katz. A new status index derived
Another popular variant of eigenvector centrality is Katz centrality16 . from sociometric analysis. Psychometrika,
At a philosophical level, the difference between the two is that Katz 18(1):39–43, 1953
says that nodes that are farther away from v should count less when
estimating v’s importance. So it matters whether v is reached at the
first step of the random walk, rather than at the second, or at the
hundredth. For eigenvector centrality when you meet v in a random
walk makes no difference, for Katz it does.
If we were to write the eigenvector centrality not as an eigenvector,
but as a sum, we would end up with something that looks a bit like
this:
∞
ECv = ∑ ∑ ( Ak )uv ,
k =1 u ∈V
∞
KCv = ∑ ∑ αk ( Ak )uv .
k =1 u ∈V
UBIK
A paper of mine presents UBIK, which is the lovechild between Katz
17
Michele Coscia, Giulio Rossetti, Diego
Pennacchioli, Damiano Ceccarelli, and
centrality and the personalized PageRank I presented before17 . The Fosca Giannotti. You know because
weird acronym is short for “you (U) know Because I Know” and we i know: a multidimensional network
approach to human resources problem.
developed it with networks of professional in mind (like Linkedin). In Proceedings of the 2013 IEEE/ACM
Each professional has skills, which allow her to perform her job. International Conference on Advances in
However, sometimes she is required to do something using a skill she Social Networks Analysis and Mining,
pages 434–441. ACM, 2013b
doesn’t have. In many cases, she might be able to perform the task
anyway because she can ask for help in her social network. Think
about any time you asked a friend to fix something in your script, or
scrape some data, or patch a leaking water pipe.
Of course, if the task you need to perform requires only knowl-
edge you have, you can do it quickly. Every level of social interaction
you add will slow you down. If you’re a computer scientist you can
node ranking 215
∞
ECv = ∑ ∑ ( Ak )uv ,
k =1 u ∈V
∞
ECv = (1 − α)ev + ∑ ∑ α( Ak )uv .
k =1 u ∈V
14.5 HITS
22
Jon M Kleinberg, Ravi Kumar, Prab-
HITS22 , 23is an algorithm designed by Jon Kleinberg and collabora- hakar Raghavan, Sridhar Rajagopalan,
and Andrew S Tomkins. The web as
tors to estimate a node’s centrality in a directed network. It is part a graph: measurements, models, and
of the class of eigenvector centrality algorithms from Section 14.4, methods. In International Computing
and Combinatorics Conference, pages 1–17.
but it deserves its own section due to its interesting characteristics. Springer, 1999
Differently from other centrality measures, HITS assigns two values 23
Jon M Kleinberg. Authoritative sources
to each node. In fact, one can say that HITS assigns nodes to one of in a hyperlinked environment. Journal of
the ACM (JACM), 46(5):604–632, 1999
two roles – we will see more node roles in Chapter 15. The two roles
are “hubs” and “authorities”.
of saying that a central node in a large component is more impor- McGraw-Hill Companies, 1976
tant than a central node in a smaller component. Lin’s centrality26
achieves this by multiplying the closeness centrality of a node by the
size of its connected component, which – incidentally – just means to
square the numerator:
14.7 k-Core
3
3
3 3
3
2 2
3
2 2
1 1 2 1 2
1 1 1 1
14.9 Summary
14.10 Exercises
1. Based on the paths you calculated for your answer in the previous
chapter, calculate the closeness centrality of the nodes in Figure
13.12(a).
4. What’s the most central node in the network used for the previous
exercise according to PageRank? How does PageRank compares
with the in-degree? (for instance, you could calculate the Spear-
man and/or Pearson correlation between the two)
6. Based on the paths you calculated for your answer in the previous
chapter, calculate the harmonic centrality of the nodes in Figure
13.12(a).
3
Keith Henderson, Brian Gallagher,
Rolx3 is one of the best known machine learning approaches for Tina Eliassi-Rad, Hanghang Tong,
the extraction of node roles in complex networks – a topic we’ll Sugato Basu, Leman Akoglu, Danai
Koutra, Christos Faloutsos, and Lei
greatly expand on in Part XI of the book. The way it works is by
Li. Rolx: structural role extraction &
representing nodes as vectors of attributes. Attributes can be, for mining in large graphs. In Proceedings
instance, the degree, the local clustering, betweenness centrality, and of the 18th ACM SIGKDD international
conference on Knowledge discovery and
so on. In practice, you decide which node features are relevant to data mining, pages 1231–1239. ACM,
determine the roles your nodes should be playing in the network. 2012
This means that, selecting the right set of features, you can recover all
228 the atlas for the aspiring network scientist
• Bridges (red): these are the hubs keeping the network together;
• Tightly knit (blue): these are the authors who have a reliable group
of co-authors, and are usually embedded in cliques;
Note that, with Rolx, you can also estimate how much each role
tends to connect with nodes in a similar role. For instance, by their
very nature, bridges tend to connect to nodes with different roles,
while tightly knit nodes band together. This is related to the concepts
of homophily and disassortativity, which we’ll explore in Chapter 30.
Structural Equivalence
When two nodes have the same role in a network they are, in a sense,
similar to each other. Researchers have explored this observation
node roles 229
1 1
Figure 15.5: (a) An example
of two structurally equivalent
nodes (nodes 1 and 2). (b) Here,
3 4 5 6 3 4 5 6 nodes 1 and 2 are not struc-
turally equivalent, because
node 1 has a neighbor that
2 2 node 2 does not.
(a) (b)
nodes 1 and 2 have the same neighbors and no other additional one.
If I were to flip their IDs, you would not be able to tell. There is
no extra information for you to do so, because all you have is their
neighbor set.
On the other hand, we can tell the difference between nodes 1
and 2 in Figure 15.5(b). That is because we know that node 1 also
connects to node 3, which node 2 does not. So the two nodes are not
structurally equivalent. You can use any vector similarity measure
to estimate structural equivalence. For instance, you can calculate
the number of common neighbors or the Jaccard similarity of the
neighbor sets between u and v. In Figure 15.5(b), nodes 1 and 2 have
three common neighbors out of four possible, thus their structural
equivalence is 0.75. 5
Gerard Salton. Automatic text process-
Alternatively, one could use cosine similarity5 , Pearson correlation ing: The transformation, analysis, and
coefficients, or inverse Euclidean distance. In all these cases, you retrieval of. Reading: Addison-Wesley, 169,
1989
have to transform the neighbor set into a numerical vector. If you sort
the nodes consistently, each node can be represented as a vector of
230 the atlas for the aspiring network scientist
1
Node 1 Node 2 Figure 15.6: (a) A simple graph.
0 0 (b) The adjacency vector repre-
0 0 sentations of node 1 and node
3 4 5 6 2.
1 0
1 1
1 1
2 1 1
(a) (b)
Automorphic Equivalence
1 2
Figure 15.7: An example of rela-
beling of the graph to highlight
the automorphic equivalence
3 4 5 6 6 4 5 3 between nodes 1 and 2.
2 1
(a) (b)
Regular Equivalence
4 5 6
similarity. If you write this in matrix form, you get that σ = αAσA,
which looks like an eigenvector problem – and, in fact, it is. To get
the real similarity, you need to apply this formula many times, say l
times. That is why you need an α in front: by applying this formula
l times you are using paths of length l to estimate the similarity,
but you care more about nodes that are closer to u and v and not
those that are l hops away. If α < 1, the αl < αl −1 and you get
smaller contributions from nodes that are father away. This should
be familiar to you, because that’s the same logic as Katz centrality we
just saw in Section 14.4.
If you want to boost the similarity of a node with itself, you can
always use the identity matrix: σ = αAσA + I. However, you should
be careful because this measure as written here reduces to something
similar to structural equivalence. If you run this formula on the net-
work in Figure 15.8 you get that nodes 1 , 4, 5 and 6 are all similar –
in intuitive terms, you’d get that CEOs are similar to interns because
they both connect to middle managers. Figure 15.9 shows you the
similarity matrix you’d get, and you can see node 1 is as similar to
node 4 as nodes 2 and 3 are to each other. So you need to ensure that
your initial classes are respected somehow.
To sum up the difference between structural, automorphic, and
regular equivalence, consider familial bonds. Two women with the
same husband and the same children are structurally equivalent: they
node roles 233
So far we’ve been pretty rigid in the way we wanted to classify nodes
into roles. Either we explicitly defined the roles with strict rules, or
we adopted the similarity approach, finding which node plays a
similar structural role to which other node. This is a sort of “zero-
dimensional” approach, where everything collapses in a single label.
One could use instead node embeddings, which determine the role
of a node with a vector of numbers and then classifies the node with
it. Recently, the most common way to discover such roles has become
the use of graph neural networks. I will explain more in detail how
they work much later in the book – if you think what follows is a
bunch of gobbledygook, you should check out Part XI. Here I will
just show the general shape of the problem they are solving and how
it can be used to infer node roles. 8
Jie Zhou, Ganqu Cui, Zhengyan
When it comes to detect node roles, graph neural networks use Zhang, Cheng Yang, Zhiyuan Liu, and
machine learning techniques (Chapter 4) to learn a function classify- Maosong Sun. Graph neural networks:
A review of methods and applications.
ing nodes8 , 9 . Figure 15.10 shows a general schema for graph neural arXiv preprint arXiv:1812.08434, 2018
networks. We start from some training data. This could be the graph 9
Zonghan Wu, Shirui Pan, Fengwen
itself with some nodes labeled and some not, or a graph or a collec- Chen, Guodong Long, Chengqi Zhang,
and Philip S Yu. A comprehensive
tion of graphs with labeled nodes. We pass this graph as the input to survey on graph neural networks. arXiv
the hidden layers. The role of the hidden layers is to learn a function preprint arXiv:1901.00596, 2019
234 the atlas for the aspiring network scientist
f that can explain why each node has a specific label/value. Once f
is learned, you will obtain the labels of the non-classified nodes, or
you’ll be able to classify a new, previously unseen, graph.
Typically, f assumes that each node can be represented by a vector,
called state. We’re going to see in Chapter 44 more than you want to
know on how to make smart f s. 10
Yaguang Li, Rose Yu, Cyrus Shahabi,
But modifying the way you learn f is not the only possible variant. and Yan Liu. Diffusion convolutional
You could also modify the input and output layers, to change the recurrent neural network: Data-driven
traffic forecasting. arXiv preprint
task itself. For instance, you could try to predict spatiotemporal
arXiv:1707.01926, 2017b
networks10 , 11 , 12 , 13 : by having as input a dynamic network with 11
Bing Yu, Haoteng Yin, and Zhanxing
changing node states, you could predict a future node state. Figure Zhu. Spatio-temporal graph convo-
lutional networks: a deep learning
15.11 shows a simplified schema for the task. Think, for instance,
framework for traffic forecasting. In
about traffic load: given the way traffic evolves, you want to be able IJCAI, pages 3634–3640. AAAI Press,
to predict how many cars will hit a specific road straight (edge) or 2018
12
Sijie Yan, Yuanjun Xiong, and Dahua
intersection (node). Lin. Spatial temporal graph convo-
Other possible applications are the generation of a realistic net- lutional networks for skeleton-based
action recognition. In Thirty-Second
work topology (Chapter 18), the prediction of a link (Chapter 23), or
AAAI Conference on Artificial Intelligence,
summarizing the graph (Chapter 46). Given that these are not related 2018
to node roles, I’ll deal with such applications in the proper chapters.
13
Ashesh Jain, Amir R Zamir, Sil-
vio Savarese, and Ashutosh Saxena.
Structural-rnn: Deep learning on spatio-
temporal graphs. In Proceedings of the
IEEE Conference on Computer Vision and
Pattern Recognition, pages 5308–5317,
1,2,1,2,1,? 1,2,1,2,1,2 2016
15.4 Summary
2. There are many ways to define roles, some well known are brokers
– in between communities –, gatekeepers – on the border of a
community –, and more. A popular algorithm to detect node roles
is Rolx.
15.5 Exercises
Before diving deep into these more complex models, I need to spend
some time with the grandfather of all graph models. It is the family
of network generating processes created by Paul Erdős and Alfréd 2
P Erdős and A Rényi. On random
Rényi in their seminal set of papers2 , 3 , 4 (some credit goes also to graphs. Publicationes Mathematicae
Gilbert5 for a few variants of the model). Debrecen, 6:290–297, 1959
These are simply known colloquially as “Random graphs”. I
3
Paul Erdos and Alfréd Rényi. On the
evolution of random graphs. Publ. Math.
can divide them fundamentally in two categories: Gn,p and Gn,m Inst. Hung. Acad. Sci, 5(1):17–60, 1960
models. The way they work is slightly different, but their results 4
Paul Erdos and Alfred Renyi. On
are mathematically equivalent. The difference is there simply for random matrices. Magyar Tud. Akad. Mat.
Kutató Int. Közl, 8(455-461):1964, 1964
convenience in what you want to fix: Gn,p allows you to define the 5
Edgar N Gilbert. Random graphs. The
probability p that two random nodes will connect, while Gn,m allows Annals of Mathematical Statistics, 30(4):
to fix the number of edges in your final network, m. 1141–1144, 1959
Ok, but... what is a random graph? We’re all familiar to the con-
cept of random number. You toss a die, the result is random. But
what does “random” mean in the context of a graph? What’s ran-
domized here? For this chapter, I will answer these questions assum-
ing uncorrelated random graphs – meaning that you can mentally
replace “random” with “statistical independence”. This is not strictly
speaking necessary: in the same way that you can study the statistics
of correlated coins, you can study correlated random graphs. How-
ever, that would make for a nasty math, and it isn’t super useful for
the aim of this book.
This means that you don’t know anything about the connections that
are already there.
If you’re tired to read a book, this is the perfect occasion for a
physical exercise. Here’s a process you can follow to make your
own random network. First, take a box of buttons and pour it on the
floor. Yup, you heard me: just make a mess. Then take a yarn, cut
some strings, and drop them on the buttons. The buttons are now
your nodes, and the strings the edges. Congratulations! You have a
random network! Time to calculate!
Figure 16.3 depicts the process. Note that throwing coins for each
node pair isn’t exactly the most efficient way to go about generating a
Gn,p – although I invite you to try. A few smart folks determined an 6
Vladimir Batagelj and Ulrik Brandes.
algorithm to generate Gn,p efficiently6 . Efficient generation of large random
Since Gn,m and Gn,p generate graphs with the same properties, it networks. Physical Review E, 71(3):036113,
2005
means that p and m must be related. Since the graphs have n nodes,
we can derive the number of edges (m) from p. p is applied to each
pair of nodes independently. We know how many pairs of nodes the
graph has, which is n(n − 1)/2. Thus we have an easy equation to
derive m from p: the number of edges is the probability of connecting
any random node pair times the number of possible node pairs, or
n ( n − 1)
p = m.
2
This is useful if you use Gn,p but you want to have, more or less,
control on how many edges you’re going to end up with.
By the way this gives you an idea of what’s the typical density of
a random Gn,p graph. The density (see Section 12.1) is the number
of links over the total possible number of links. We just saw that
n ( n − 1)
the number of links in a random graph is p and the total
2
n ( n − 1)
possible number of links is . One divided by the other gives
2
you p. So, if you want to reproduce the sparseness of real world
networks, you can do that at will. Just tune the p parameter to be
exactly the density you want.
In the following sections we explore each property of interest of
random graphs, to see when they model the properties of real world
networks well, and when they don’t. The latter is the starting point
of practically any subsequent graph model developed after Erdős and
Rényi.
random graphs 241
% nodes % nodes
with k = x with k = x
Real World
Figure 16.4: (a) The typical
degree distribution of Gn,p net-
k works. (b) Comparing a Gn,p
p(|V| - 1)
G(n, p)
degree distribution with one
that you would typically get
x x
from a real world network.
(a) (b)
x
Low p p
(a) (b)
16.5 Clustering
p * # possible
edges among
neighbors
k * (k - 1) k * (k - 1)
p*
2 2
Much lower than the
one of real world networks! p Every node has the same
(So it’s also the average cc)
If we say, for simplicity, that k̄(k̄ − 1)/2 = x, then our CCv formula
looks like: CCv = px/x = p. This means that the clustering coeffi-
cient doesn’t depend on any node characteristic, and it’s expected
to be the same – equal to p – for all nodes. Figure 16.7 provides a
graphical version of this derivation.
Compared to real world networks, p is usually a very low value
for the clustering coefficient. In real world networks, it’s more likely
to close a triangle than to establish a link with a node without com-
mon neighbors. Thus the clustering coefficient tends to be higher
than simply the probability of connection. This is a second pain point
of Erdős-Rényi graphs when it comes to explain real world prop-
erties, after the lack of a realistic degree distribution, as we saw in
Section 16.2. Such a low clustering usually implies also the absence of
random graphs 245
16.6 Summary
2. The oldest and most venerable random graph model is the Gn,p
(or Gn,m ) model, where we fix the number of nodes n and the
probability p of connecting a random node pair, and we extract
edges uniformly at random.
5. Random graphs have a short average path length just like real
world networks typically have. However, they have a much lower
clustering coefficient than what you find in the wild.
16.7 Exercises
the largest connected component on the y axis. Can you find the
phase transition?
17.1 Clustering
Before the Stone Age, a caveman society was very simple. You had
tribes living in their own caves. The tribes were very small, they were
families. Everybody knew everyone else in their cave, but between
caves there was almost no communication. Maybe there could have
been one weak link if the two caves were close enough.
This metaphor was the starting point for Watts in developing 1
Duncan J Watts. Networks, dynamics,
his “cavemen” model1 . The cavemen model is part of the family of and the small-world phenomenon.
simple networks (see Section 6.4). It takes two parameters: the cave American Journal of sociology, 105(2):
493–527, 1999
size (Figure 17.1(a)) and the number of caves (Figure 17.1(b)). The
cave size is the number of people living in each cave. A cave is a
clique: as said, everyone in the cave knows every cavemate (Figure
248 the atlas for the aspiring network scientist
draws, but this is pretty unlikely, especially for high k values. Any-
how, given how unlikely this is, we can say that the small world
model properly recovers the giant connected component property of
real world networks.
Differently from cavemen, this time we have short paths. They are
regulated by the rewiring probability p. Rewiring creates bridges that
span across the network. Even a tiny bridge probability can connect
parts of the network that are very far away. Thousands of shortest
paths will use it and will be significantly shorter.
Still, the degree distribution of a small world model is very weird.
If p is low, it looks like a cavemen graph, because almost all nodes
will have the same number of neighbors. If p is high it means that
we are practically rewiring every edge in the graph. At that point,
randomness overcomes every other feature of the model, and the
result would be almost indistinguishable from a Gn,p model. We have
high clustering because each connection that we don’t rewire will 4
Unless you set k = 2, then the
create triangles with some of the neighbors of the connected nodes4 . clustering would be zero. But why
would you set k = 2 in a small world
model? People are weird.
However, high clustering does not necessarily mean that you are
going to have communities. In fact, a small world model typically
doesn’t have them. The triangles are distributed everywhere uni-
formly in the network. There are no discontinuities in the density,
no differences between denser and sparser areas. This is especially
evident for high k and low p: as I show in Figure 17.5, small world
networks with such parameter combinations just look like odd
snakes without clear groups of densely connected nodes. This is a
precondition to have communities, and so you cannot find them in a
small world model.
understanding network properties 251
1/6
1/2 1/6 Figure 17.8: Adding a new
node with the link selection
1/2 1/2
1/3 model. (a) Pick an existing
1/3
1/6 link at random. (b) Pick one
1/3
of the two nodes connected by
(a) (b) (c) that link. (c) Connect to it. The
probabilities of connecting to
each node in this process are
A third alternative from preferential attachment and link selection
the ones floating next to it.
is the copying model. Just like in link selection, the newcoming node
has no information about the network, it just picks something uni-
formly at random. Differently from the link selection model, here it
picks another existing node, rather than a link (Figure 17.9(a)). It then
copies one of its connections (Figure 17.9(b)). You can see again how
it’s more likely to connect to the hub: the hub has more neighbors, 14
Jon M Kleinberg, Ravi Kumar, Prab-
thus it is more likely to select one of its neighbors. Moreover, the hakar Raghavan, Sridhar Rajagopalan,
neighbor of a hub is likely to be low degree, increasing the chances and Andrew S Tomkins. The web as
a graph: measurements, models, and
of selecting the hub in the copying step (Figure 17.9(c)). The copy-
methods. In International Computing
ing model is based on an analogy on how webmasters create new and Combinatorics Conference, pages 1–17.
hyperlinks to pre-existing content on the web14 . Springer, 1999
254 the atlas for the aspiring network scientist
1/4 1 / 12
1/4 1 / 12 Figure 17.9: Adding a new
node with the copying model.
1/4 3/4
(a) Pick an existing node at
1/4 1 / 12
random. (b) Copy one of its
connections. (c) The probabili-
ties of connecting to each node
(a) (b) (c)
in this process are the ones
floating next to it.
It is easy to see why a network generated with either of these three
models has a single connected component. Since you always connect
a new node with one that was already there, there is no step in which
you have two distinct connected components. Thus, any cumulative
advantage network following any of these models will have all its
nodes in the same giant connected component, as it should.
These networks also have short diameters and average path
lengths. Mechanically it is easy to see why. Hubs with thousands
of connections can be used as shortcuts to traverse the network.
Mathematically, the diameter of a preferential attachment network
log |V |
grows as , thus very slowly, slower than a random graph.
log log |V |
Figure 17.10 shows some simulations, comparing the average path
length of a Gn,m and a preferential attachment network with the same
number of nodes and the same number of edges, as their size grows.
6
Gn,m Figure 17.10: The average short-
5
4 PA
est path length (y axis) for
APL
3
2
increasing number of nodes (x
1 axis) for Gn,m (blue) and prefer-
10 100 1000
ential attachment (red) models,
|V|
with the same average degree.
These models also reproduce power law degree distributions –
that’s what they were developed for. In fact, you can calculate the
exact degree distribution exponent for the standard preferential
attachment model, which is α = 3. This is independent from the
m parameter, meaning that you cannot tune it to obtain different
exponents (obviously, you might get different exponents because of
the randomness of the process, but as |V | → ∞, then α → 3). If you
want to reproduce a real world network with α = 2, you cannot use
the basic preferential attachment model.
However, there is a peculiar aspect about their degree distributions
that is worth considering. As you saw from all examples, the cumu-
lative advantage applies especially to “old” nodes. The earlier the
node entered in your structure, the more likely it is to become a hub.
understanding network properties 255
0
10
Figure 17.11: The age effect in
10-1
the degree distribution of the
-2
10 preferential attachment model.
“Young”
p(k>=x)
-3
10 nodes
-4
10
“Old” nodes
10-5
-6
10
0 1 2 3 4 5 6
10 10 10 10 10 10 10
x
You could set the parameter m > 1 to add two or more edges per
new node, but that helps only to a certain point: it’s not so likely
to strike two already connected nodes thus creating a triangle. The
preferential attachment model has a higher clustering than a random
Gn,p one, but not by much. It is still a far cry from the clustering
levels you see in real world networks. Moreover, this is only true for
m > 1. In that case, you add more than one edge per newcomer node,
which means you end up losing the head of your distribution. The
network will not contain a single node with degree equal to one.
Thus the clustering in these cumulative advantage models is much
lower than real world networks, and there are no communities –
because everything connects to hubs which make up a single core.
There are some extensions of the model which try to include clus- 18
Petter Holme and Beom Jun Kim.
tering18 . At every step of this model you have a choice. You either Growing scale-free networks with
add a node with its links, or you just add links between existing tunable clustering. Physical review E, 65
(2):026107, 2002
nodes without adding a new one. The probability of taking that step
regulates the clustering coefficient of the network.
However, triangles close randomly, thus we have no communities
just like in the small world model. If we want to look at models
which generate more realistic network data, we have to look at the
ones I discuss in the next chapter.
17.4 Summary
17.5 Exercises
Both the small world and the preferential attachment models are
useful because they give us ideas on how some real world network
properties arise. The small world model tells us that small diameters
happen because a clustered network might have some random short-
cuts. The preferential attachment model tells us that broad degree
distributions arise because of cumulative advantage: having many
links is the best way to attract more links.
Yet, neither of them is able to reproduce all the features of a real
world network. If we want to do so, we have to sacrifice the explana-
tory power of a model. We have to fine tune the model so that we
force it to have the properties we want, regardless of what realistic
process made them emerge in the first place. This is the topic of this
chapter.
The easiest way to ensure that your network will have a broad degree
distribution is to force it to have it. No fancy mechanics, no emerging
properties. You first establish a degree sequence and then you force
each node to pick a value from the sequence as its degree. This 1
Mark EJ Newman. The structure and
simple idea is at the basis of the configuration model1 . In fact, the function of complex networks. SIAM
configuration model is more general than this. You can use it to review, 45(2):167–256, 2003b
match the degree sequence of any real world graph, regardless of the
simplicity or complexity of its actual degree distribution.
The configuration model starts from the assumption that, if we 2
Michael Molloy and Bruce Reed. A
want to preserve the degree distribution, we can take it as an input of critical point for random graphs with
our network generating process. We know exactly how many nodes a given degree sequence. Random
structures & algorithms, 6(2-3):161–180,
have how many edges. So we forget about the actual connections, 1995
and we have a set of nodes with “stubs” that we have to fill in. Figure 3
Mark EJ Newman, Steven H Strogatz,
18.1 shows an example. and Duncan J Watts. Random graphs
with arbitrary degree distributions and
There’s a relatively simple algorithm to generate a configuration their applications. Physical review E, 64
model network, the Molloy-Reed approach2 , 3 . First, as we saw, you (2):026118, 2001
generating realistic data 259
# nodes
Figure 18.1: In a configuration
model, you start from the de-
gree histogram to determine
2 how many nodes have how
4 many open “edge stubs”.
1 1 1
properly high clustering, but these are rare enough that researchers
needed to modify the configuration model to explicitly include the 10
Mark EJ Newman. Random graphs
generation of triangles10 , 11 . with clustering. Physical review letters,
This is usually achieved by generating a joint degree sequence. 103(5):058701, 2009
Rather than simply specifying the degree of each node, we now have
11
Joel C Miller. Percolation and epi-
demics in random clustered networks.
to fix two values. The first is the number of triangles to which the Physical Review E, 80(2):020901, 2009
node belongs. The second is the number of remaining edges the
node has that are not part of a triangle. One can see that we’re still
generating realistic data 261
(a) (b)
not filled, because it’s made by three independent edges, rather than
being a three-way relationship, like the blue-filled simplex.
You don’t have to necessarily end up with a simplicial network
or a hypergraph: once you have placed the shapes you can “forget”
that they are higher order structures, and consider you network as a
simple graph.
18.2 Communities
pin and pout are the first two parameters of a Stochastic Block 14
Paul W Holland, Kathryn Blackmond
Model14 (SBM). To fully specify the model you need a few additional Laskey, and Samuel Leinhardt. Stochas-
ingredients. First, you have to specify |V |, the number of nodes in tic blockmodels: First steps. Social
networks, 5(2):109–137, 1983
the graph. Then you have to plant a community partition. In other
words, for each node – from one to |V | – you have to specify to which
community it belongs. Otherwise we don’t know how to use the pin
and pout parameters.
It’s easy to see that, if pin = pout , then the SBM is fully equivalent
to a Gn,p model: each node has the same probability to connect to
any other node in the network. However, if we respect the constraint
that pout < pin , the resulting adjacency matrix will be block diagonal.
Most of the non zero entries will be close to the diagonal, whose
blocks are exactly the communities we planted in the first place!
One could go even deeper and determine that each pair of nodes
u and v can have its own connection probability. This would generate
an input matrix for the SBM that looks like the one in Figure 18.5(a).
Figure 18.5(b) shows a likely result of the SBM using Figure 18.5(a)
as an input. It’s easy to see why we call this matrix “block diagonal”.
The blocks are.... on the diagonal, man.
In one swoop we obtained what we were looking for: both com-
munities and high clustering. The very dense blocks contribute a lot
to the clustering calculation, more than the sparse areas around the
communities can bring the clustering down. One observation we will
come back to is that, in real world networks, the community sizes dis-
tribute broadly, just like the degree: there are few giant communities
and many small ones. This can be embedded in SBM, since we’re free
to determine the input partition as we please.
If you set pout to a relatively high value, you might make your
communities harder to find, but you gain something else: smaller
diameters. You’re also free to set pin < pout in which case you’d find
a disassortative community structure, where nodes tend to dislike
nodes in the same community. See Section 30.2 to know more about
what disassortativity means.
264 the atlas for the aspiring network scientist
GN Benchmark
17
Michelle Girvan and Mark EJ New-
The GN benchmark is a modification of the cavemen graph and one
man. Community structure in social
of the first network models designed to test community discovery and biological networks. Proceedings of
algorithms17 . The first defining characteristic of this model is set- the national academy of sciences, 99(12):
7821–7826, 2002
ting some of the parameters of the cavemen graph as fixed. In the
benchmark, we have only four caves and each cave contains 32 nodes.
Differently from the caveman graph, the caves are not cliques: each
node has an expected degree of 16, thus it can connect at most to half
of its own cave.
LFR Benchmark
The LFR Benchmark was developed to serve as a test case for com- 18
Andrea Lancichinetti, Santo Fortu-
munity discovery algorithms18 . The objective is to generate a large nato, and Filippo Radicchi. Benchmark
number of benchmarks to test a new algorithm such that we know graphs for testing community detection
algorithms. Physical review E, 78(4):
the “true” allegiance of each node. Once an algorithm returns us a 046110, 2008
possible node partition, we can compare its solution with the true
communities.
Since you want to have networks with lots of realistic properties,
some of which are difficult to reproduce organically, the LFR bench-
mark takes lots of input parameters. If you want an LFR network,
you have to specify:
• The k̄ average degree of the nodes – you can also set k min and k max
as the minimum and maximum degree, respectively;
As you can see, the LFR assumes that both the degree distribution
and the size of your communities distribute like a power law. The
latter is regulated by the β parameter, with one gigantic community
including the majority of nodes and many trivial communities of
size smin . In Figure 18.7 I show a real world network and its three
largest communities, showing how their sizes rapidly decline. This is
a rather realistic assumption, although there are obvious exceptions.
You can take care of such exceptions by forcing the maximum com-
munity size to a known – and lower – smax size. You can also set the
minimum community size if you don’t want to have too many trivial
communities in your network, using the smin parameter.
The mixing parameter µ regulates how hard it is to find communi-
ties in the network. If µ = 0 all edges run between nodes part of the
same community, i.e. each community becomes a distinct connected
266 the atlas for the aspiring network scientist
Kronecker Graphs
The idea of a Kronecker graph originates from the Kronecker product
operation. The Kronecker product is the matrix equivalent of the
outer product of two vectors. The outer product of two vectors u
and v is a |u| × |v| matrix, whose i, j entry is the multiplication of ui
to v j . The Kronecker product is the same thing, applied to matrices.
Figure 18.10 shows an example. To calculate A ⊗ B, we’re basically
20
G Zehfuss. Über eine gewisse
multiplying each entry of A with B20 . determinante. Zeitschrift für Mathematik
When it comes to generating graphs, the matrix we’re multiplying und Physik, 3(1858):298–301, 1858
is the adjacency matrix. We usually multiply it with itself. So we’re
calculating A ⊗ A, as I show in Figure 18.10(b). This generates a new
268 the atlas for the aspiring network scientist
xA xA xA xA
Figure 18.10: An example of
xA xA xA xA
Kronecker product. (a) A ma-
xA xA xA xA
trix A. (b) The operation we
xA xA xA xA
perform to obtain A ⊗ A.
(a) (b)
squared matrix, whose size is the square of the previous size. We can
multiply this new adjacency matrix with our original one once more,
for as many times as we want. We stop when we reach the desired 21
Jurij Leskovec, Deepayan Chakrabarti,
number of nodes21 , 22 . Jon Kleinberg, and Christos Faloutsos.
Figure 18.11 shows the progression of the Kronecker graph. Fig- Realistic, mathematically tractable
graph generation and evolution, using
ure 18.11(a) is our seed graph which we multiply to itself (Figure kronecker multiplication. In European
18.11(b)) twice (Figure 18.11(c)). Conference on Principles of Data Mining
and Knowledge Discovery, pages 133–145.
One small adjustment that is customary to do when generating Springer, 2005b
a Kronecker graph is to fill the diagonal with ones instead of zeros. 22
Jure Leskovec and Christos Faloutsos.
If you remember my linear algebra primer, this means we consider Scalable modeling of real graphs using
kronecker multiplication. In Proceedings
every node to have a self-loop to itself. This is because we want of the 24th international conference on
the Kronecker graph to be a block-diagonal matrix, with lots of Machine learning, pages 497–504. ACM,
connections around the diagonal. This is required if we want them to 2007
necker product is: why? Well, for starters, Kronecker graphs are
fractals. Personally, I don’t need any other reason that that. Look
at Figure 18.11(c): if you tell me it doesn’t speak to your heart then
I question whether you’re really human. If you’re not an incurable
fractal romantic like me, the deceptively simple process that gener-
ates Kronecker graphs solves all the issues we want from a graph
generating process. In some cases, it is even better than LFR.
One thing that all models discussed so far have in common is that
they are engineered to have specific properties. These are the prop-
erties we think are salient in real world networks: broad degree
distributions, community structures, etc. But what if we are wrong?
Maybe some of these properties are not the most relevant things
about a network we want to model. Moreover: what if there are other
generating realistic data 271
2 2 2 2
4 4
1 1 1 1 1
3 3 3 5
18.6 Exercises
1 1 1 0
1 1 1 0
A=
1 1 1 1
0 0 1 1
19
Evaluating Statistical Significance
2 4 2 4
(a) (b)
1200 900
Figure 19.2: The edge swap
900
# Null Models
# Null Models
!
1
′
Pr ( A = A ) = exp
B ∑ β u,v A′u,v .
u,v
!
1
′
Pr ( A = A ) = exp
B ∑ pA′u,v .
u,v
evaluating statistical significance 281
1
Pr ( A = A′ ) = exp p| E′ | + p1 R( A′ ) .
B
Here R( A′ ) is the number of reciprocated ties. You can make 8
Emmanuel Lazega and Marijtje
edges dependent on node attributes as researchers do in p2 models8 , 9 . Van Duijn. Position in formal structure,
personal characteristics and choices
Finally, you can also plug higher-order structures in the model, for of advisors in a law firm: A logistic
instance: regression model for dyadic network
data. Social networks, 19(4):375–397, 1997
1 9
Marijtje AJ Van Duijn, Tom AB
Pr ( A = A′ ) = exp p| E′ | + τT ( A′ ) ,
Snijders, and Bonne JH Zijlstra. p2:
B a random effects model with covariates
where – under the homogeneity assumption – T ( A′ ) is the number for directed graphs. Statistica Neerlandica,
58(2):234–254, 2004
of triangles in A′ . This way, you can also control the transitivity of
the graph.
These models can be very difficult to solve analytically for all
but the simplest networks. Modern techniques rely on Monte Carlo 10
Tom AB Snijders. Markov chain monte
carlo estimation of exponential random
maximum likelihood estimation10 , 11 . We don’t need to go too much graph models. Journal of Social Structure,
into details, but these work similarly to any Markov chain Monte 3(2):1–40, 2002
Carlo method12 . However, if your network is dense, your estimation
11
Tom AB Snijders, Philippa E Pattison,
Garry L Robins, and Mark S Handcock.
might need to take an exponentially large number of samples to New specifications for exponential
estimate your βs13 . There are ways to get around this problem by random graph models. Sociological
methodology, 36(1):99–153, 2006
expanding the ERG model14 , but by now we’re already way over my
12
Walter R Gilks, Sylvia Richardson,
head and I don’t think I can characterize this fairly. and David Spiegelhalter. Markov chain
How does all of this look like in practice? The result of your model Monte Carlo in practice. Chapman and
might look like something from Figure 19.5. Here we decide to have Hall/CRC, 1995
13
Shankar Bhamidi, Guy Bresler, and
four parameters: a single edge (this is always going to be present Allan Sly. Mixing time of exponential
in any ERGM), a chain of three nodes, a star of four nodes, and random graphs. In 2008 49th Annual
IEEE Symposium on Foundations of
a triangle. Each motif has a likelihood parameter: the higher the
Computer Science, pages 803–812. IEEE,
parameter the more likely the pattern. Negative values mean that the 2008
pattern is less likely than chance to appear. 14
Arun G Chandrasekhar and
Matthew O Jackson. Tractable and
The negative value for simple edges means that the network is consistent random graph models.
sparse: two nodes are unlikely to be connected. The positive value Technical report, National Bureau of
for the triangle means that triangles tend to close: when you have a Economic Research, 2014
triad, it is more likely than chance to have the third edge. The other
two configurations are not significantly different from zero (you can’t
282 the atlas for the aspiring network scientist
β
Figure 19.5: On the left we
have the estimated parameters
-4.27
from the observation for four
patterns, with positive values
1.09 indicating a “more than chance”
occurrence of the pattern, and
-0.67 negative values a “less than
chance”. On the right we have
a likely network extracted from
1.32
the set of ERGM with the given
parameters.
15
John F Padgett and Christopher K
tell because I omitted the standard errors, but trust me on that). Thus
Ansell. Robust action and the rise of the
we should not emphasize their interpretation too much. medici, 1400-1434. American journal of
On the right side of the figure you can see a potential network sociology, 98(6):1259–1319, 1993
16
Skyler J Cranmer and Bruce A
that is very likely to be extracted by this ERGM. In fact, I cheated a
Desmarais. Inferential network analysis
bit, because that is the network on which I fitted the model. It is the with exponential random graph models.
famous graph mapping the business relationship between Florentine Political analysis, 19(1):66–86, 2011
19.4 Exercises
Spreading Processes
20
Epidemics
So far, we’ve seen some dynamics you can embed in your network. In
Section 7.4 I showed you how to model graphs whose edges might
appear and disappear, while in the previous book part we’ve seen
models of network growth: nodes arrive steadily into the network
and we determine rules to connect them such as in the preferential
attachment model. This part deals with another type of dynamics on
networks. Here, edges don’t change, but nodes can transition into
different states.
The easiest metaphor to understand these processes is disease.
Normally, people are healthy: their bodies are in a homeostatic state
and they go about their merry day. However, they also constantly
enter into contact with pathogens. Most of the times, their immune
systems are competent enough to fend off the invasion. Sometimes
this does not happen. The person transitions into a different state:
they now are sick. Sickness might be permanent, but also temporary.
People can recover from most diseases. In some cases, recovery is
permanent, in others it isn’t.
These are all different states in which any individual might find
themselves at any given time. Like individuals, nodes too can change
state as time goes on. This book part will teach you the most popular
models we have to study these state transitions. In this chapter
we look at three models we defined to study the progression of 1
Romualdo Pastor-Satorras, Claudio
diseases through social networks1 . Note that such models can easily Castellano, Piet Van Mieghem, and
represent other forms of contagion, for instance the spread of viruses Alessandro Vespignani. Epidemic
processes in complex networks. Reviews
in computer and mobile networks2 . of modern physics, 87(3):925, 2015
We’re going to complicate these models in Chapter 21, to see how 2
Pu Wang, Marta C González, César A
different criteria for passing the diseases between friends affect the Hidalgo, and Albert-László Barabási.
Understanding the spreading patterns
final results. Then, in Chapter 22, we’ll see how the same model of mobile phone viruses. Science, 324
can be adapted to describe other network events, such as infrastruc- (5930):1071–1076, 2009
ture failure and word-of-mouth systems to aid a viral marketing
campaign.
Another complication is the one introduced by simplicial spread-
epidemics 287
20.1 SI
S I
This is simply determined by the current fraction of the population in
the infected state. Once the susceptible individual meets an infected,
there is a probability that they will transition into the I state too. This
probability is a parameter of the model, traditionally indicated by β.
If β = 1, any contact with an infected will transmit the disease, while
if β = 0.2, you have an 20% chance to contract the disease.
Once you have β you can solve the SI Model. Usually, the way
it’s done is assuming that at the first time step you have a set of one
or more patient zeros scattered randomly into society. Then, you
track the ratio of people in the I status as time goes on, which is
| I |/(| I | + |S|). This usually generates a plot like the one in Figure 20.2.
1
0.9 Figure 20.2: The solution of the
0.8
SI Model for different β values.
Infected Ratio
0.7
0.6
β = 0.175 The plot reports on the y axis
0.5
0.4 β = 0.15
0.3 the share of infected individu-
β = 0.125
0.2
0.1
als (i = | I |/(| I | + |S|)) at a given
β = 0.1
0
0 20 40 60 80 100 120 140 160 180 200
time step (x axis).
Time
– as you can see from Figure 20.2 – is the speed of the system: when
the exponential growth of I starts to kick in and when S gets emptied
out.
We can re-tell the story I’ve just exposed in mathematical form.
In our SI model, the probability that an infected individual meets a
susceptible one is simply the number of susceptible individuals over
the total population, because of the homogenous mixing hypothesis:
|S|/|V | (remember |V | is our number of nodes). There are | I | infected
individuals, each with k̄ meetings (the average degree). Thus the total
| I ||S|
number of meetings is k̄ . Since each meeting has a probability
|V |
| I ||S|
β of passing the disease, at each time step there are βk̄ new
|V |
infected people in I.
We can simplify the equation a bit, because | I |/|V | and |S|/|V | are
related. They sum to one, since S and I are the only possible states
in which you can have a node. So, if we say i = | I |/|V |, that is, the
fraction of nodes in I, then |S|/|V | = 1 − i. So our formula becomes: 7
Note that this is a differential equation,
it+1 = βk̄it (1 − it )7 , where t is the current time step. If we integrate so you need to integrate it to actually
over time, we can derive the fraction of infected nodes depending find the share of infected nodes at t + 1.
Also, I’m only including the addition
solely on the time step8 :
to it+1 , not its full composition. So,
pedantically, the correct formula should
i0 e βk̄t be it+1 = it + βk̄it (1 − it ), but that would
i= . make the discussion harder to follow
1 − i0 + i0 e βk̄t – and it would not change the results
This is the mathematical solution to the SI model with homoge- we are interested in here. This warning
nous mixing, generating the plot in Figure 20.2. You can see why you applies to all formulas with the time
subscript.
have an initial exponential growth at the beginning and a flat growth 8
Albert-László Barabási et al. Network
at the end. If i0 ∼ 0, then the denominator is 1 and the numerator is science. Cambridge university press, 2016
dominated by the e βk̄t factor: exponential growth (very slow at the
beginning because multiplied with the small i0 ). When i0 ∼ 1, both
the denominator and the numerator reduce to e βk̄t , which means that,
in the end, i ∼ i0 ∼ 1, so there’s no growth.
Why did we go to the trouble of all this math? Because, at this
point, we have to tear down the homogenous mixing hypothesis. The
formulas will allow to see the difference better.
Homogenous mixing is based on the assumption that the more
people are infected, the more likely you’re going to be infected. In
practice, it assumes everybody is the same. In homogenous mixing,
the global social network is a lattice: a regular grid where each node
is connected only to its immediate neighbors. Figure 20.3 shows an
example of square lattice (Section 6.4): each node connects regularly
to four spatial neighbors. On a lattice, the infection spreads like water
9
This is a useful mental image: https:
filling a surface9 . //upload.wikimedia.org/wikipedia/
We know that real networks are not neat regular lattices. The commons/a/a6/SIR_model_simulated_
using_python.gif.
degree is distributed unevenly, with hubs having thousands of con-
290 the atlas for the aspiring network scientist
nections – see Section 9.3. When the infection hits such a hub, it will
accelerate faster through the network. In fact, it is extremely easy to
infect an hub early on. Hubs have more connections, thus they are
more likely to be connected to one of your patient zeros. Those same
connections make them super-spreaders: once infected, the hub will
allow the disease to reach the rest of the network quickly. In fact,
when searching information in a peer-to-peer network, your best 10
Lada A Adamic, Rajan M Lukose,
guess is always to ask your neighbor with highest degree10 . Amit R Puniyani, and Bernardo A
To treat the SI model mathematically you have to first group nodes Huberman. Search in power-law
networks. Physical review E, 64(4):046135,
by their degree. Rather than solving for i – the fraction of infected
2001
nodes –, you solve for ik : the fraction of infected nodes of degree k.
The formula for a network-aware SI model is similar as the one we
saw for the vanilla SI model:
ik,t+1 = βk f k (1 − ik,t ).
The two differences are that: (i) we replace the average degree k̄
with the actual node’s degree k, and (ii) rather than using ik,t we use
f k – a function of the degree k. This is because real world networks
typically have degree correlations: if you have a degree k the degree
of your neighbors is usually not random (see Section 31.1 for more).
If it were random, then we could simply use ik,t , because the number
of infected individuals around you should be proportional to the
current infection rate. But it isn’t: in presence of degree correlations,
if you have k neighbors then there exists a function f k able to predict
how many neighbors they have. Thus the likelihood of a node of
degree k of having infected neighbors is specific to its degree, and not
(only) dependent on ik,t . 11
Romualdo Pastor-Satorras and
If you do the proper derivations11 , you’ll discover that in a Gn,p Alessandro Vespignani. Epidemic dy-
network the dynamics have the same functional form to the ones of namics and endemic states in complex
networks. Physical Review E, 63(6):066117,
the homogeneous mixing, as Figure 20.4 shows. In Gn,p the exponen- 2001a
tial rises faster at the beginning – due to the few outliers with high
degree – and tails off slower at the end – due to the outliers with
low degree – but the rising and falling of the infection rates is still an
epidemics 291
1
0.9 Figure 20.4: The solution of the
0.8
SI Model for different β values
Infected Ratio
0.7
0.6
β = 0.175 (no net) β = 0.175 (Gn,p)
0.5 in homogeneous mixing (reds)
0.4 β = 0.15 (no net) β = 0.15 (Gn,p)
0.3 and Gn,p graphs (blues). The
β = 0.125 (no net) β = 0.125 (Gn,p)
0.2
0.1 β = 0.1 (Gn,p)
plot reports on the y axis the
β = 0.1 (no net)
0
0 50 100 150 200 250 300
share of infected individuals
Time (i = | I |/(| I | + |S|)) at a given
time step (x axis).
1
0.9 Figure 20.5: The solution of the
β = 0.175 (α = 2) β = 0.125 (α = 3)
0.8
SI Model for different β values
Infected Ratio
exponential warm up any more (in red in Figure 20.5). You know
that, no matter where you started, you’re going to hit the largest hub
of the network at the second time step t = 2, because it is connected
to practically every node. And, since it is connected to practically
every node, at t = 3 you’ll have almost the entire network infected.
In fact, I ran the simulations from Figure 20.5 on imperfect and
finite power law models. Theoretically, if you had a perfect infinite
power law network, infection would be instantaneous for any non-zero
value of β. Meaning that, no matter how infectious a disease is, with
α = 2 it will infect the entire network almost immediately. And
things get even more complicated when you add to the mix the fact 12
Eugenio Valdano, Luca Ferreri, Chiara
that networks evolve over time12 . Scary thought, isn’t it? Poletto, and Vittoria Colizza. Analytical
computation of the epidemic threshold
on temporal networks. Physical Review X,
20.2 SIS 5(2):021005, 2015
Just like in the SI model, also in the SIS model nodes can only either 13
Herbert W Hethcote. Three basic
be Susceptible or Infected13 . However, the SIS model adds a transi- epidemiological models. In Applied
tion. Where in SI you could only get infected without possibility of mathematical ecology, pages 119–144.
Springer, 1989
recovery (S → I), in SIS you can heal (I → S).
Thus the SIS model requires a new parameter. The first one,
shared with SI, is β: the probability that you will contract the disease
after meeting an infected individual. Once you’re infected, you also
have a recovery rate: µ. µ is the probability that you will transition
from I to S at each time step. High values of µ mean that recovering
from the disease is fast and easy. Note that recovery puts you back
to the Susceptible state, thus you can catch the disease again in the
future.
S I
Figure 20.6 shows the schema fully defining the model. In practice,
SIS models disease with recovery and relapse. An example would
be the general umbrella of the flu family. Once you heal from a
particular strain of the flu you’re unlikely to fall ill again under the
epidemics 293
same strain. However, you can easily catch a similar strain, thus
cycling each year between the S and I states.
The presence of µ changes the outcome of the model. SI models
always reach full saturation: eventually, every node will end up in
status I. For SIS models that is not true, because a certain fraction
of nodes – µ – heal at each time step. The interplay between the
recovery rate µ, the infection rate β, and the average degree k̄ deter-
mines the asymptotic size of I: the share of infected nodes as time
approaches infinity (t → ∞). To see how, let’s look at the math again.
% Infected
Figure 20.7: The typical evolu-
Depends on μ and β
tion of an endemic SIS model:
< the equilibrium state is the one
100%
in which a constant fraction
Endemic state i < 1 contracted the disease.
The rate at which infected peo-
ple recover and the infection
rate are perfectly balanced, as
all things should be.
Time
100
-2
10
and k¯2 grows relatively to it, the critical threshold k̄/k¯2 tends to zero.
If we say that you have an endemic value if λ > k̄/k¯2 and k̄/k¯2 = 0,
then any disease, no matter β and µ, will be endemic in a network
with a power law degree distribution. Oops.
% Infected
at time ∞ Figure 20.9: The solution of the
SIS Model for λ. As λ grows
(x axis), I show the share of
infected individuals i at the
λ=β/μ endemic state (t → ∞).
Power Law
Random
20.3 SIR
S I R
had the disease and healed. Figure 20.11 shows such a typical evolu-
tion. At the beginning, everybody is susceptible. Then, people start
getting infected, so I grows. R cannot start growing immediately,
as I is still too small for the recovery parameter µ to significantly
contribute to R size. As I grows, though, there are enough infected
individuals that start being removed. Eventually every I individual
transitions to R.
% of Pop
Figure 20.11: The typical evolu-
Susceptible Removed tion of an SIR model: after an
initial exponential growth of
the infected, the removed ratio
takes over until it occupies the
entire network.
Infected
Time
least an I neighbor.
4. SIS models are like SI models, but nodes can transition back to S
state with a stochastic probability µ at each time step.
20.5 Exercises
4. Extend your SI model to an SIR. With β = 0.2, run the model for
400 steps with µ values of 0.01, 0.02, and 0.04 and plot the share of
nodes in the Removed state for both the networks used in Q1 and
Q2. How quickly does it converge to a full R state network?
21
Complex Contagion
You may or may not have noticed that, in the previous chapter, all
our models of epidemic contagion shared an assumption. Every time
a susceptible individual comes in contact with an infected individual,
they have a chance to become infected as well. If that doesn’t happen,
the healthy person is still in the susceptible pool. The next time step
represents a new occasion for them to contract the disease. And so
on, ad infinitum.
Without that assumption, the models wouldn’t be mathematically
tractable. For instance, if each node gets only one chance to be in-
fected, you can easily see how it is not given that a SI model would
eventually infect the entire network. In fact, it takes any β < 1 to
make that impossible. The first time you fail to infect somebody you
won’t get the chance to try again.
SI, SIS, and SIR models are useful and generated tons of great
insights. But this limitation allows them to model only rather specific
types of outbreak. We usually consider them models of simple
contagion. There are fundamentally two ways to make such models
more complex and realistic. They involve changing two things: (i) the
triggering mechanism, which is the condition regulating the S → I
transition, and (ii) the assumption that each individual gets infinite
chances to infect their neighbors.
We deal with the triggering mechanisms in Section 21.1 and
infection chances in Section 21.2. We also explore the possibilities of
interfering with the outbreak in Section 21.3, dedicated to epidemic
interventions.
21.1 Triggers
Classical
Threshold
Cascade
rules and they are just different phenomena. This needs not to be the
case. The separation is mostly done out of a pedagogical need. In
fact, there is a universal model of spreading dynamics concentrating 6
Baruch Barzel and Albert-László
on hubs6 . We don’t need to delve deep into the details on how this Barabási. Universality in network
model works, but it mostly hinges on a parameter: ϕ. ϕ determines dynamics. Nature physics, 9(10):673,
2013b
the interplay between the degree of a node and its propensity of be-
ing part of the epidemics. If ϕ = 0, the likelihood of contagion of the
node is independent with its degree. If ϕ > 0 we are in the threshold
scenario: hubs have a stronger impact on the network. With ϕ < 0, as
you might expect, the opposite holds. Figure 21.4 shows a vignette of
the model.
Sprinkling a bit of economics into the mix, you can relate the
threshold or the cascade parameter with the utility an actor v gets
from playing along or not. Each individual calculates their cost and
benefit from undertaking or not undertaking an action. There is a
cost in adopting a behavior before it gets popular, and in not doing 7
Thomas C Schelling. Hockey helmets,
so after it did7 . Being aware of these effects makes for very effective concealed weapons, and daylight
strategies to make your own decisions while you’re in doubt. You saving: A study of binary choices with
externalities. Journal of Conflict resolution,
can establish a Schelling point which determines whether or not
17(3):381–428, 1973
you’re going to undertake an action8 , which effectively means you 8
https://www.lesswrong.com/
consciously set your own κv . However, this is getting dangerously posts/Kbm6QnJv9dgWsPHQP/
schelling-fences-on-slippery-slopes
close to a weird blend of economics, philosophy, and game theory. If
you’re interested in learn more, you’d be best served by closing this 9
Herbert Gintis. The bounds of reason:
book and looking elsewhere9 . Game theory and the unification of the
behavioral sciences. Princeton University
Press, 2014
21.2 Limited Infection Chances
you want people to talk about it and convince each other. In practice,
you want them to infect themselves with the idea that the product is 10
Pedro Domingos and Matt Richard-
good10 , 11 . son. Mining the network value of
The obvious strategy would be to target hubs, since they have customers. In Proceedings of the seventh
ACM SIGKDD international conference
more connections. However, this heavily depends on your triggering
on Knowledge discovery and data mining,
model, and hubs come with a disadvantage. First, by being promi- pages 57–66. ACM, 2001
nent, hubs are targeted by many things, thus they have a very high 11
Dashun Wang, Zhen Wen, Hanghang
Tong, Ching-Yung Lin, Chaoming Song,
barrier to attention. Second, they have many connections: if the trig-
and Albert-László Barabási. Information
gering mechanism requires reinforcement, most of their connections spreading in context. In Proceedings of
might not get it, thinning out the intervention. A third and final prob- the 20th international conference on World
wide web, pages 735–744. ACM, 2011b
lem might be that you have only one shot at convincing a person. If
you fail, it’s game over forever. If a hub fails, you might not have a
second shot to get to all their peripheral nodes.
Try it!
Ok. Figure 21.5: An example of in-
NO!
NO!
NO! dependent cascade. The node’s
Ok. Try it!
state is encoded by the color:
blue = susceptible, red = in-
Try it! Ok.
fected and contagious, green =
NO! infected but not contagious.
NO! NO!
influence this probability, that’s why you should get to hubs when
you’re the most sure you’re going to convince them. So we can
modify that probability as pu,v (S), with S being the set of nodes 13
David Kempe, Jon Kleinberg, and
who already tried to influence v13 , 14 , 15 . The process ends when all Éva Tardos. Maximizing the spread of
infected nodes exhausted all their chances of convincing people so no influence through a social network. In
Proceedings of the ninth ACM SIGKDD
more moves can happen.
international conference on Knowledge
So you get the problem: find the set of cascade initiators I0 such discovery and data mining, pages 137–146.
that, when the infection process ends at time t, the share of infected ACM, 2003
14
David Kempe, Jon Kleinberg, and
nodes in the network it is maximized. Kempe et al. solve the prob- Éva Tardos. Influential nodes in a
lem with a greedy algorithm. We start from an empty I0 . Then we diffusion model for social networks. In
calculate for each node its marginal utility to the cascade. We add International Colloquium on Automata,
Languages, and Programming, pages
the node with the largest utility, meaning the number of potential 1127–1138. Springer, 2005
infected nodes, to I0 and we repeat until we reach the size we can 15
Note that, if we only use pu,v we
afford to infect. Of course, each node we add to I0 changes the ex- call this independent cascade model,
because the previous attempts do not
pected utility of each other node, because they might have common influence future attempts. When we
friends, thus we cannot simply choose the | I0 | nodes with the largest introduce pu,v (S) the cascades are not
independent any more. Specifically, for
initial utility.
the paper I’m citing, we have decreasing
There are many improvements for this algorithm, focused on cascades because, the more people
improving time efficiency, lowering the expected error, and integrat- try, the hardest it is to convince v,
i.e. pu,v (S) < pu,v (S ∪ z). If we did
ing different utility functions. However, things get more interesting the opposite, pu,v (S) > pu,v (S ∪ z),
when you start adding metadata to your network. For instance, Gu- then this model would be practically
equivalent to the threshold model: the
rumine16 is a system that lets you create influence graphs, as I show
more infected neighbors you have, the
in Figure 21.6. You start from a social network (Figure 21.6(a)) and a more likely you’re going to turn.
table of actions (Figure 21.6(b)). You know when a node did what. 16
Amit Goyal, Francesco Bonchi, and
Laks VS Lakshmanan. Learning influ-
You can use the data to infer that node v does action a1 regularly
ence probabilities in social networks. In
after node u performed the same action. In the example, for two Proceedings of the third ACM international
actions a1 and a2 you see node 2 repeating immediately after node conference on Web search and data mining,
pages 241–250. ACM, 2010
1. Since these two nodes are connected, maybe node 1 is influencing
node 2. You can use that to infer pu,v = 0.66 (Figure 21.6(c)) – or, if
you’re really gallant, to infer pu,v (S) by looking at all neighbors of v
performing a1 before it.
Note that node 6 performed the same action at the same time as
node 3. Node 6 could only be influenced by node 2. For node 3 we
prefer inferring that node 1 did it, because we know that it influenced
node 2 too, so that’s the most parsimonious hypothesis. The size in
complex contagion 307
However, the fact that the first cascade happened faster is enough for
us to infer that it’s much more likely to end up being much larger
than the second, slower, one. I’m going to explore more in depth this
idea about memes spreading when talking about classical results in
network analysis in Chapter 52.
You can further complicate models by having competing ideas
spreding into the network. There are some people who are com-
plete enthusiasts about iPhones, while others really hate them. The
love/hate opinions are both competing to spread through the net-
work. You can see them, for instance, as a physical heating/cooling
process which will eventually make nodes converge to a given tem- 21
Hao Ma, Haixuan Yang, Michael R
perature21 . The classic survey of viral marketing applications of Lyu, and Irwin King. Mining social
network analysis22 is a good starting point for diving deeper into the networks using heat diffusion processes
for marketing candidates selection. In
topics only skimmed in this section.
Proceedings of the 17th ACM conference on
Information and knowledge management,
pages 233–242. ACM, 2008
21.3 Interventions 22
Jure Leskovec, Lada A Adamic, and
Bernardo A Huberman. The dynamics
Once we have a disease spreading through a social network, we of viral marketing. ACM Transactions on
the Web (TWEB), 1(1):5, 2007a
might be interested in using our knowledge to prevent people to
become sick. In practice, if this were a SIR model, we want to flip
some people directly from the S to the R state, without passing by
I. This is equivalent to vaccinate them and, if done properly, would
stop the epidemics in its tracks. You can try an online game with this
premise and see how much of a network you can save from an evil 23
https://github.com/
disease23 . digitalepidemiologylab/VaxGame
.1 5
.06
.1 1
.08
randomly sampled node, we would have only one chance out of nine
to find the hub, given that the example has nine nodes. However, the
probability of vaccinating the hub with our strategy is almost three
times as high. This is related to a curious network effect on hubs,
known as the “Friendship Paradox”, which we’ll investigate further
in Section 31.2.
Of course, this strategy makes a number of assumptions that
might not hold in practice. For instance, we only consider a simple
SIR model, without looking at the possibility of complex contagion.
Luckily, there is a wealth of research relaxing this assumption and
proposing ad hoc immunization strategies that can work in realistic 25
Chen Chen, Hanghang Tong,
scenarios25 . One of the most historically important approaches in this B Aditya Prakash, Charalampos E
category is Netshield26 . Tsourakakis, Tina Eliassi-Rad, Christos
Faloutsos, and Duen Horng Chau. Node
How do we know if we did a good job? How can we evaluate the
immunization on large graphs: Theory
impact of an intervention? There are two things we want to look at. and algorithms. IEEE Transactions on
First, we look at the size of the final infected set and simply subtract Knowledge and Data Engineering, 28(1):
113–126, 2015
the predicted infected share without immunization with the one with 26
Hanghang Tong, B Aditya Prakash,
immunization. The higher the difference the better. Figure 21.9 gives Charalampos Tsourakakis, Tina
you a sense of this. An SI model without immunization reaches sat- Eliassi-Rad, Christos Faloutsos, and
Duen Horng Chau. On the vulnerability
uration when all nodes are infected. A smart immunization strategy of large graphs. In 2010 IEEE Interna-
can make sure that the outbreak stops at a share lower than 100%. tional Conference on Data Mining, pages
1091–1096. IEEE, 2010
A second criterion might be just delaying the inevitable. Once
immunized, the nodes can revert to the S state, and therefore to I,
after a certain amount of time. This time can be used to develop
a real vaccine or might be a feature in itself, preventing having
too many people transitioning to I at the same time. Figure 21.10
provides an example. In this case, we might want to either calculate
the time t at which the system reaches saturation, or compute the
area between the two curves as a more precise sense of the delay we
imposed.
We can combine the two criteria at will. By immunizing nodes, we
make the disease unable to reach saturation at 100% infection AND
310 the atlas for the aspiring network scientist
% Infected
No intervention Figure 21.9: The first criterion
100%
of immunization success: the
x% share of infected nodes at the
end of the outbreak is lower
With immunization than 100% in a SI model.
Time
% Infected
Figure 21.10: The second crite-
100%
rion of immunization success:
No intervention
temporary immunity can delay
propagation.
With immunization
Time
we delay its spread in the network. Thus the two scenarios are not
mutually exclusive.
Obviously, here I only assumed the perspective of limiting the
outbreak of a disease. If you’re in the viral marketing case you
can invert the perspective: your interventions wants to favor the
spread of the idea in the social network. In this case, the second
scenario makes more sense: even if the idea was bound to reach
everyone eventually, if it does so faster it can have great repercussion.
Think about the scenario of condom use to prevent HIV infections.
You want to convince as many people as fast as you can, even if 27
Yang-Yu Liu, Jean-Jacques Slotine,
eventually your message was going to reach everybody anyway. and Albert-László Barabási. Controlla-
bility of complex networks. nature, 473
(7346):167, 2011
28
Jianxi Gao, Yang-Yu Liu, Raissa M
21.4 Controllability D’souza, and Albert-László Barabási.
Target control of complex networks.
Nature communications, 5(1):1–8, 2014
A related problem is the classical scenario of the controllability of 29
Gang Yan, Georgios Tsekenis, Baruch
complex networks27 , 28 , 29 . Here the task is slightly different: nodes Barzel, Jean-Jacques Slotine, Yang-
can change their state freely and there can be an arbitrary number of Yu Liu, and Albert-László Barabási.
Spectrum of controlling and observing
states in the network. What we want to ensure is that all – or most
complex networks. Nature Physics, 11(9):
– nodes in the network end up in the state we desire. To do so, we 779–786, 2015
complex contagion 311
4. You can estimate which type of infection model a real world out-
break follows by estimating the universality class of the spreading,
via its parameter ϕ: ϕ > 0 is similar to a threshold model (positive
correlation between degree and chance of infection), ϕ < 0 is sim-
ilar to a cascade model (negative correlation between degree and
chance of infection).
21.6 Exercises
% nodes in LCC
Figure 22.2: The probability of
being part of the largest con-
nected component as a function
of the number of failing nodes
in a random Gn,p graph.
x
|R|
something like this. It was back to Figure 16.5(b), when we were talk-
ing about the probability of a node being part of the GCC in a Gn,p
model. In that case, the function on the x-axis was the probability
p of establishing an edge between two nodes. In fact, the two are
practically equivalent: if you have a Gn,p graph with failures it is as if
you’re manipulating n and p.
What Figure 22.2 says is that a Gn,p network will withstand small
failures: a few nodes in R will not break the network apart. However,
the failure will start to become serious very quickly, until we reach a
critical value of | R| beyond which the GCC disappears and the net-
work effectively breaks down. Just like the appearance of a GCC for
increasing p in a Gn,p model is not a gradual process, so are random
failures. At some point there is a phase transition, from having to not
having a GCC.
Ah – you say – but we’re not amateurs at this. Who would engineer
a random power grid network? For sure it won’t be a Gn,p graph. Good
point. In fact that’s true: the power grid’s degree distribution is
skewed. For the sake of the argument – and the simplicity of the
math – let’s check the resilience to random failures of a network with
a power law degree distribution.
Good news everybody! Power law random networks are more
resilient than Gn,p networks to random failures. The typical signature
of a power law network under random node disappearances looks
something like Figure 22.3. In the figure you see no trace of the phase
transition. The critical value under which the GCC disappears is
much higher than in the Gn,p case. Of course the size of the largest
connected component goes down, because you’re removing nodes
from the network. However, the nodes that remain in the network
still tend to be able to communicate to each other, even for very high
| R |.
Why would that be the case? The reason is always the power law
degree distribution. If you remember Section 9.3, having a heavy
catastrophic failures 317
% nodes in LCC
Figure 22.3: The probability of
being part of the largest con-
nected component as a function
of the number of failing nodes
in a network with a skewed
degree distribution.
|R|
tailed degree distribution means to have very few gigantic hubs and
a vast majority of nodes of low degree. When you pick a node at
random and you make it fail, you’re overwhelmingly more likely to
pick one of the peripheral low degree ones. Thus its impact on the
network connectivity is low. It is extremely unlikely to pick the hub,
which would be catastrophic for the network’s connectivity.
Since, by now, you must be a ninja when it comes to predict the ef-
fect of different degree exponents on the properties of a network with
a power law degree distribution, you might have figured out what’s
next. The exponent α is related to the robustness of the network to
random failures. An α = 2, remember, means that there are fewer
hubs and their degree is higher. If α > 3, the hubs are more common
and less extreme.
% nodes in LCC
Figure 22.4: The probability of
being part of the largest con-
nected component as a function
of the number of failing nodes
in a network for different α ex-
α=2
ponents of its power law degree
α=4 distribution.
|R|
exponent and relatively simple math. In this part we already saw two:
robustness to random failures (here) and outbreak size and speed in
SI and SIS models (in Chapter 20).
By the way, so far we’ve been looking at random node failures,
i.e. a generator blowing up in the power grid. Edge failures can be
equally common: think about road blocks. However, the underlying
math is rather similar and the functions describing the failures are 5
Duncan S Callaway, Mark EJ Newman,
not so different than the ones I’ve been showing you so far5 , 6 . For Steven H Strogatz, and Duncan J Watts.
this reason we keep looking at node failures. Network robustness and fragility:
Percolation on random graphs. Physical
review letters, 85(25):5468, 2000
6
Reuven Cohen, Keren Erez, Daniel
22.2 Targeted Attacks Ben-Avraham, and Shlomo Havlin.
Resilience of the internet to random
So far we’ve assumed the world is a nice place and, when things breakdowns. Physical review letters, 85
(21):4626, 2000
break down, they do so randomly. We suspect no foul play. But what
if there was foul play? What if we’re not observing random failures,
but a deliberate attack from a hostile force? In such a scenario, an
attacker would not target nodes at random. They would go after
the nodes allowing them to maximize the amount of damage while
minimizing the effort required.
This translates into prioritizing attacks to the nodes with the
highest degree. Taking down the node with most connections is
guaranteed to cause the maximum possible amount of damage. What
would happen to our network structure?
1
Random
0.9 Targeted Figure 22.5: The probability
0.8
of being part of the largest
Nodes in GCC
0.7
0.6
0.5 connected component as a func-
0.4
0.3 tion of the number of failing
0.2
0.1
nodes in a Gn,p network, for
0
1 10 100
random (blue) and targeted
|R| (red) failures.
1
Random
0.9 Targeted Figure 22.6: The probability of
0.8
being part of the largest con-
Nodes in GCC
0.7
0.6
0.5 nected component as a function
0.4
0.3 of the number of failing nodes
0.2
0.1
in a degree skewed network,
0
1 10 100 1000
for random (blue) and targeted
|R| (red) failures.
0.7
0.6
0.5 connected component as a func-
0.4
0.3 tion of the number of failing
0.2
0.1
nodes in a degree skewed net-
0
1 10 100 1000
work, for different α and k min
|R| combinations.
the capacity is the maximum amount of cars that can pass before
congestion happens.
At time t = 1 we shut down a node in the network. Maybe the
traffic light failed and so no one can pass through until we repair
it. This means that the node transitions to state I. People still need
to do their errands, so we have to redistribute the load of cars that
wanted to pass through that intersection through alternative routes:
the neighbors of that node. However, that means that their load
will increase. If the new load exceeds the capacity of the node, also
this node shuts down due to congestion. So its load has also to be
redistributed to its neighbors and so on and so forth.
Figure 22.8 shows an example of failure propagation with this
load-capacity feature. You can see that the network was built with
some slack in mind: its normal total load is 37 – the sum of all loads
of all nodes – for a maximum capacity of 90 – the sum of all nodes’
capacities. Yet, shutting down the top node whose load was only 6
and redistributing the loads causes a cascade that, eventually, brings
catastrophic failures 321
6/10 0 /10 0/ 1 0 0/ 1 0
1 / 10 1 / 10 4 /1 0 4/ 10
6/7 4 /1 6 6/7 4 /1 6 8/7 8/16 0/7 10/16
4 / 10 11/10 0/ 1 0 0/ 10
0/7 23/16 0/7 0/ 1 6 0/7 0/16 0/7 0/16
Using this perspective has its own advantages. It makes the failure
propagation model more amenable to analysis. The final size of the
failure cascade depends on the average degree of nodes in this tree
k̄. The critical value here is k̄ = 1. If, on average, the failure of a node
generates another node failure – or more – the cascade will propagate
indefinitely, until all nodes in the network will fail. If, instead, k̄ < 1,
the failure will die out, often rather quickly.
It’s easy to see why if you have the mental picture of a domino
snake: each domino falling will cause the fall of another domino,
until there’s nothing standing. If, however, there is as much as a
single gap in this chain, the rest of the system will be unaffected.
322 the atlas for the aspiring network scientist
Quick show of hands: how many of you expect the size of a failure
cascade to be a power law? Good, good: by now you learned that
every goddamn thing in this book distributes broadly. The exponent
of the cascade size is also related to the α exponent of your degree
distribution. With a power degree exponent α > 3, networks behave
like Gn,p graphs, but for α < 3 then the cascade size will grow with
exponent α/(α − 1).
X X X
X X X
% nodes in LCC
Figure 22.12: The relationship
Gn,p
α=3
between the degree exponent α
α = 2.7 of coupled power law networks
α = 2.3 and the fraction of nodes in R
state (x axis) needed to destroy
the GCC (shades of red). In
blue the equivalent plot for
coupled random Gn,p graphs.
|R|
15
Xuqing Huang, Jianxi Gao, Sergey V
Buldyrev, Shlomo Havlin, and H Eu-
gene Stanley. Robustness of interdepen-
graphs, contrarily to what was the case before. dent networks under targeted attack.
Figure 22.12 shows a schema of fragility for different coupled Physical Review E, 83(6):065101, 2011b
network topologies. Moving from random failures to targeted attacks 16
Rebekka Burkholz, Matt V Leduc,
Antonios Garas, and Frank Schweitzer.
still makes interdependency bad: failures will spread more easily Systemic risk in multiplex networks
than in isolated networks15 . with asymmetric coupling and thresh-
old feedback. Physica D: Nonlinear
Fixing this issue is not easy. First, one needs to estimate the
Phenomena, 323:64–72, 2016b
propensity of the network to run the risk of failure. This has been 17
Charles D Brummitt, Raissa M
done in two-layer multiplex networks16 . One would think that the D’Souza, and Elizabeth A Leicht.
Suppressing cascades of load in inter-
best thing to do is to create more connections between nodes, to
dependent networks. Proceedings of the
prevent breaking down in multiple components. However that’s National Academy of Sciences, 109(12):
not a trivial operation, as these networks are embedded in a real E680–E689, 2012
18
Rebekka Burkholz, Antonios Garas,
geographical space, where creating new power lines might not be and Frank Schweitzer. How damage
possible. However, there are also theoretical concerns that show how diversification can reduce systemic risk.
more connections could render the network more fragile, as it would Physical Review E, 93(4):042313, 2016a
19
Byungjoon Min, Su Do Yi, Kyu-Min
give the cascade more possible pathways to generate a critical fail- Lee, and K-I Goh. Network robustness
ure17 . A better strategy involves so-called “damage diversification”: of multiplex networks with interlayer
degree correlations. Physical Review E, 89
mitigating the impact of the failure of a high degree node18 .
(4):042811, 2014
Note that we assumed that power law networks are randomly 20
Roni Parshani, Celine Rozenblat,
coupled: hubs in one layer will pick a random node to couple to in Daniele Ietri, Cesar Ducruet, and
the other layer. As a consequence, they’ll likely to pick a low degree Shlomo Havlin. Inter-similarity between
coupled networks. EPL (Europhysics
node. Other papers study the effect of degree correlations in inter- Letters), 92(6):68002, 2011
layer coupling19 : what if hubs in one layer tend to connect to hubs in 21
Saulo DS Reis, Yanqing Hu, Andrés
Babino, José S Andrade Jr, Santiago
the other layer? If such correlations were perfect, we’d obtain again Canals, Mariano Sigman, and Hernán A
the robustness of power law networks to random failures. These Makse. Avoiding catastrophic failure in
correlations are luckily observed in real world systems20 , 21 , showing correlated networks of networks. Nature
Physics, 10(10):762, 2014
how they’re not as fragile as one might fear. Phew.
22.5 Summary
3. When one node fails, all its load needs to be redistributed to non-
failing nodes. This can and will make the failure propagate on
the network in a cascade event which might end up bringing the
entire network down.
22.6 Exercises
2. Perform the same operation as the one from the previous exercise,
but for the network at http://www.networkatlas.eu/exercises/
22/2/data.txt. Can you tell which is the network with a power
law degree distribution and which is the Gn,p network?
Link prediction
23
For Simple Graphs
Link prediction is the branch of network analysis that deals with the
prediction of new links in a network. In link prediction, you see the
network as fundamentally dynamic, it can change its connections.
Suppose you’re at a party. You came there with your friends, and
you’re talking to each other, using the old connections. At some
point, you want to go and get a drink so you detach from the group.
On the way, you could meet a new person, and start talking to them.
This creates a new link in the social network. Link prediction wants
to find a theory to predict these events – in this case, that alcohol is
the main cause of new friendships at parties, or so I’m told –, as I
show the vignette in Figure 23.1.
other. Our best guess is that they will connect soon. In practice, the
probability of connecting two nodes is directly proportional to their
current degree: score(u, v) = k u k v , where k u and k v are u’s and v’s
degrees, respectively.
23.3 Adamic-Adar
1
∑ . The only difference with Adamic-Adar is that the scaling
z∈ Nu ∩ Nv k z
is assumed to be linear rather than logarithmic. Thus, Resource
Allocation punishes the high-degree common neighbors more heavily
than Adamic-Adar. You can see that the difference between k z and
log k z is practically nil for low values of k z , but balloons when k z is
high.
One could make a more complex version of the Resource Alloca-
tion index by assuming that the bandwidth of each node and of each
link is not fixed. Thus the amount of resources u sends can change,
and the amount of resources that can pass through the (u, v) link can
also be different from the one passing through other edges.
STEM
Figure 23.5: An example of a
Physics NASA Hierarchical Random Graph
link prediction. The hierarchy
Radioactivity Human
Computers fits the observed connections,
showing that researchers in the
same field are more likely to
connect. Then HRG looks at
pairs of nodes in the same part
of the hierarchy that are not yet
Likely Unlikely connected, and gives them a
higher score.
In HRG we’re basically saying that communities matter: it is more
likely for nodes in the same community to connect. Thus we fit the
hierarchy and then we say that the likelihood of nodes to connect is
proportional to the edge density of the group in which they both are.
If the nodes are very related, the group containing both nodes might
332 the atlas for the aspiring network scientist
3
for simple graphs 333
With the power of these two rules we know we can close the two
open triads we have and add a new neighbor to each node in the
original data. The end result is on the right. We now have more open
triads and could apply the rules again, which would in turn create
more square patterns and so on and so forth. In fact, one could use
GERM not only as a link predictor but also as a graph generator (and
put it in Chapter 17).
Note that I made two simplifications to GERM that the original
papers don’t make. First, in Figure 23.7, I assumed that each rule
applies with the same priority. I ignored its frequency and its confi-
dence. Of course, that would be sub-optimal, so the papers describe a
way to rank each candidate new edge according to the frequency and
confidence of each rule that would predict it.
Second, in all my examples I always assumed that the rules add a
new edge in the next time step and that all edges in the antecedent
are present at the same time. In reality, GERM allows to have more
334 the atlas for the aspiring network scientist
complex rules spanning multiple time steps. You could have a rule
saying something like: you have a single edge at time t = 0, you add
a second edge at time t = 1 creating an open triad, and then you close
the triangle at time t = 2. This triad closure rule spans three time
steps, rather than only two.
GERM has a final ace up its sleeve. We can classify new links
coming into a network into three groups: old-old, old-new, and new-
new. We base these groups according to the type of node they attach
to. I show an example in Figure 23.8. An “old-old” link appearing
at time t + 1 connected two nodes that were already present in the
network at time t. These are two “old” nodes. You can expect what
an “old-new” link is: a link connecting an old node with a node
that was not present at time t – a “new” node. New nodes can also
connect to each other in a “new-new” link. If the network represents
paper co-authorships, this would be a new paper published by two or
more individuals who have never published before.
45%
encode the fact that the attribute values 1 and 2 should be considered
“more similar” to each other than 1 and 1, 000. Extensions taking care
of these limitations might be possible, but I’m not aware of one.
This is not a function unique to GERM, though. Many link pre-
diction methods can be extended to take into consideration node
attributes as well. In fact, this is also a key ingredient in some net-
work generating processes. Node attributes are used, for instance,
when modeling exponential random graphs, as we saw in Section
19.2. In this case, differently than GERM, quantitative attributes
represent no issue.
4 3
(a) (b)
∑ ∑ score( a, b)
a∈ Nu b∈ Nv
score(u, v) = γ .
ku kv
γ is a parameter you can tune. This is surprisingly similar to the
hitting time approach. The expected value of a SimRank score is γl ,
where l is the length of an average random walk from u to v.
22
Elizabeth A Leicht, Petter Holme, and
Vertex similarity22 . The name of this approach should tip you off Mark EJ Newman. Vertex similarity in
regarding its relationship with SimRank. However, it’s actually much networks. Physical Review E, 73(2):026120,
2006
closer to the Jaccard variant of common neighbor. In fact, the only
difference with Jaccard is the denominator. While Jaccard normalizes
the number of common neighbors by the total possible number of
common neighbors – which is the union of the two neighbor sets –
this approach builds an expectation using a random configuration
graph as a null model. This is a definition in line with the philosophy
of the structural equivalence we saw in Section 15.2.
In practice, score(u, v) = | Nu ∩ Nv |/(k u k v ). This is because two
nodes u and v with k u and k v neighbors are expected to have k u k v
common neighbors (multiplied by a constant derived from the aver-
age degree which would not make any difference as it is the same for
all node pairs in the network).
The same authors in the same paper also make a global variant of
this measure. Their inspiration is the Katz link prediction, where they
again provide a correction for a random expectation in a random
graph with the same degree distribution as G. I won’t provide the
full derivation, which you can find in the paper, but their score is:
−1
ϕA
score(u, v) = 2| E|λ1 D −1 I− D −1 .
λ1
The elements in this formula are the usual suspects: | E| is the
number of edges, λ1 is the leading eigenvalue of the adjacency matrix
A (not the stochastic, as that would be equal to one), D is the degree
matrix, and I is the identity matrix. The only odd thing is 0 < ϕ <
1, which is a parameter you can set at will. This is similar to the
parameter of Katz: smaller ϕ give more weight to shorter paths.
23
Weiping Liu and Linyuan Lü. Link
Local and superposed random walks23 . These two methods are a close prediction based on local random walk.
EPL (Europhysics Letters), 89(5):58007,
sibling to the hitting time approach. To determine the similarity
2010
between u and v, we place a random walker on u and we calculate
the probability it will hit node v. Note that, if we were to do infinite
length random walks, this would be the stationary distribution π.
This would be bad, as you know that this only depends on the degree
for simple graphs 339
of v, not on your starting point u. For this reason, the authors limit
the length of the random walk, and also add a vector q determining
different starting configurations – namely, giving different sources
different weights.
To sum up, the local random walk method determines score(u, v) =
qu πu,v + qv πv,u . The superposed variant works in the same way, with
the difference that the random walker is constantly brought back to
its starting point u. This tends to give higher scores to nodes closer in
the network.
24
Roger Guimerà and Marta Sales-
Stochastic block models24 . We saw the stochastic block models (SBM) Pardo. Missing and spurious interac-
tions and the reconstruction of complex
as a way to generate graphs with community partitions (Section 18.2) networks. Proceedings of the National
– and we will see them again as a method to detect communities Academy of Sciences, 106(52):22073–22078,
(Section 35.1). In fact, any link prediction approach, in a sense, is a 2009
26
Nir Friedman, Lise Getoor, Daphne
Probabilistic models26 , 27 , 28 . In this subsection I group not a single Koller, and Avi Pfeffer. Learning
method, but an entire family of approaches to link prediction. They probabilistic relational models. In IJCAI,
volume 99, pages 1300–1309, 1999
have their differences, but they share a common core. In probabilistic 27
David Heckerman, Chris Meek, and
models, you see the graph as a collection of edges and attributes at- Daphne Koller. Probabilistic entity-
tached to both nodes and edges. The hypothesis is that the presence relationship models, prms, and plate
models. Introduction to statistical relational
of an edge is related to the values of the attributes.
learning, pages 201–238, 2007
In practice, the hypothesis is that there exist a function taking 28
Kai Yu, Wei Chu, Shipeng Yu, Volker
as input the attribute values of each node and edge. This function Tresp, and Zhao Xu. Stochastic relational
models for discriminative link predic-
models the observed graph. Then, depending on the values of the
tion. In Advances in neural information
attributes for nodes u and v, the function will output the probability processing systems, pages 1553–1560,
that the u, v edge should appear – and with which attributes. 2007
29
Fei Tan, Yongxiang Xia, and Boyao
Mutual information29 . In Section 3.5 I introduced the concept of Zhu. Link prediction in complex
mutual information: the amount of information one random variable networks: a mutual information per-
spective. PloS one, 9(9):e107056, 2014
gives you over another. This can be exploited for link prediction. If
you remember how it works, you’ll remember that MI allows you to
calculate the relationship between two non-numerical vectors, which
is not really possible using other correlation measures – not without
doing some non-trivial bending of the input. In link prediction, this
advantage is crucial: you can define your function as score(u, v) =
MIuv , where the “events” that allow you to calculate MIuv are the
common neighbors between nodes u and v.
30
Carlo Vittorio Cannistraci, Gregorio
CAR Index30 . In this index you favor pairs of nodes that are part of Alanis-Lobato, and Timothy Ravasi.
a local community, i.e. they are embedded in many mutual connec- From link-prediction in brain connec-
tomes and protein interactomes to the
tions. This is a variant of the idea of common neighbor: each shared local-community-paradigm in complex
connection counts not equally, but proportionally more if it also networks. Scientific reports, 3:1613, 2013
shares neighbors with u and v. This basic idea can be implemented
in multiple ways, depending on which of the traditional link predic-
tion methods we want to extend. For instance, if we extend vanilla
common neighbors, you’d say that:
| Nu ∩ Nv ∩ Nz |
score(u, v) = ∑ 1+
2
.
z∈ Nu ∩ Nv
| Nu ∩ Nv ∩ Nz |
score(u, v) = ∑ | Nz |
.
z∈ Nu ∩ Nv
for simple graphs 341
5. Nothing stops you from using all the link prediction methods at
once and then aggregate their results. Really, it’s a free country.
23.9 Exercises
1. What are the ten most likely edges to appear in the network at
http://www.networkatlas.eu/exercises/23/1/data.txt accord-
ing to the preferential attachment index?
2. Compare the top ten edges predicted for the previous question
with the ones predicted by the jaccard, Adamic-Adar, and resource
allocation indexes.
2 4 6
1 3 5 1 3 5
Figure 24.3: From top to bot-
tom: signed graph, its signed
2 4 6 2 4 6 Laplacian, the sorted eigenval-
ues of the signed Laplacian. (a)
Unbalanced graph. (b) Balanced
graph.
(a) (b)
triangle to close with a negative edge (Figure 24.1(b)). The case with
an initial condition of two negative edges is more difficult to close,
but we prefer to close it with a positive edge (Figure 24.1(b)) than
with a negative one (Figure 24.1(c)).
So you see that you can perform signed link prediction by first
predicting the pair of nodes that will connect, calculating a score(u, v)
with any of the methods presented in the previous chapter. Then you
will decide the sign of the link, by using social balance theory.
v v v v
Figure 24.4: The 16 templates
of directed signed triangles
u z u z u z u z in social status networks. The
(a) (b) (c) (d) color of the edge determines
v v v v its status: green = positive, red
= negative, blue = the edge we
are trying to predict – can be
u z u z u z u z either positive or negative.
(e) (f) (g) (h)
v v v v
u z u z u z u z
(i) (j) (k) (l)
v v v v
u z u z u z u z
(m) (n) (o) (p)
Figure 24.4 shows the 16 main configurations of these triangles.
The closing edge connecting u to v can be either positive or negative,
generating 32 final possible configurations. Status theory generates
predictions that are more sophisticated and – sometimes – less imme-
diately obvious. Some cases are easy to parse. For instance, consider
348 the atlas for the aspiring network scientist
Figure 24.4(a). The objective is to predict the sign of the (u, v) edge.
In the example, u endorses z as higher status. z endorses v. If v is on
a higher level than z, and z is on a higher level than u, then it’s easy
to see how v is also on a higher level than u. Thus the edge will be
of a positive sign. In fact, this specific configuration is grossly over
represented in real world data.
The situation is not as obvious for other triangles. For instance,
the one in Figure 24.4(i). Here we have the z node endorsing both u
and v. We don’t really know anything about their relative level, only
that they are both on a higher standing with respect to z. The paper
presenting the theory makes a subtle case. The edge connecting u to
v is more likely to be positive than the generative baseline on u, but
less likely to be positive than the receiving baseline of v. So, suppose
that 50% of edges originating from u are positive, while 80% of the
links v receives are positive. The presence of a triangle like the one in
Figure 24.4(i) would tell us that the probability of connecting u to v
with a positive link is higher than 50%, but lower than 80%.
Given this sophistication, and the fact that social status works with
more information than social balance – namely the edge’s direction
–, it is no wonder that there are cases in which social status vastly
outperforms social balance. For instance, the original authors apply
social status to a network of votes in Wikipedia. Here the nodes
are users, who are connected during voting sessions to elect a new
admin. The admin receives the links, positive if the originating
user voted in favor, negative if they voted against. Triangles in this
network connect with the patterns predicted by social status theory.
So far we have only considered the case of two possible edge types.
Moreover, these two types have a clear semantic: one type is positive,
the other is negative. Both assumptions make the link prediction
problem easier: there are few degrees of freedom and we move in
a space constrained by strong priors. It is now time to drop these
assumptions and face the full problem of multilayer link prediction
as the big boys we are.
Generalized multilayer link prediction is the task of estimating
the likelihood of observing a new link in the network, given the
two nodes we want to connect and the layer we want to connect
them through. Nodes u and v might be very likely to connect in the
immediate future, but they might do so in any, some, or even just a
single layer. Thus, we extend our score function to take the layer as
an input: from score(u, v) to score(u, v, l ).
Layer Independence
As you might expect, there are tons of ways to face this problem.
The most trivial way to go about it is to apply any of the single layer
350 the atlas for the aspiring network scientist
Figure 24.6 depicts an example for this process. Note that here
I use a rather trivial approach to aggregate, by comparing directly
the various scores. One could also apply to this problem the rank
aggregation measures presented in the previous chapter. In this way,
you could also aggregate different scores using different criteria:
common neighbors, preferential attachment, and so on.
This is practically a baseline: it will work as long as we have an
assumption of independence between the layers. As soon as having a
link in a layer changes the likelihood of connecting into another layer,
we expect to grossly underperform.
Blending Layers
A slightly more sophisticated alternative is to consider the multilayer
network as a single structure and perform the estimations on it. For
instance, consider the hitting time method. This is based on the
estimation of the number of steps required for a random walker
starting on u to visit v. We can allow the random walker to, at any
time, use the inter layer coupling links exactly as if they were normal
edges in the network. At that point, a random walker starting from
u in layer l1 can and will visit node v in layer l2 . The creation of
our connection likelihood score is thus well defined for multilayer
networks. Figure 24.7 depicts an example for this process.
These paths crossing layers are often called meta-paths. The in-
formation from these meta-paths can be used directly as we just
saw, informing a multilayer hitting time. Or we can feed them to a
classifier, which is trying to put potential edges in one of two cate-
for multilayer graphs 351
b
e
Figure 24.7: A slightly more so-
phisticated way to perform mul-
c
f
a d tilayer link prediction. Given
l1
g the input network, perform the
link prediction procedure on
l2
the full structure. In this case,
the gray arrow simulates a ran-
l3
dom walker going from node g
in layer l3 to node a in the same
layer, passing through node
c in layer l2 . The mutlilayer
gories: future existing and future non-existing links. Any classifier
random walker contributes to
can perform this job once you collect the multilayer information from
the score( g, a, l3 ).
the meta-path: naive Bayes, support vector machines (SVM), and 11
Mahdi Jalili, Yasin Orouskhani,
others11 . Milad Asgari, Nazanin Alipourfard,
Other extensions to handle multilayer networks have been pro- and Matjaž Perc. Link prediction in
multiplex online social networks. Royal
posed12 . These studies show that multilayer link prediction is indeed Society open science, 4(2):160863, 2017
an interesting task, as there is a correlation between the neighbor- 12
Darcy Davis, Ryan Lichtenwalter,
hood of the same nodes in different layer. The classical case involves and Nitesh V Chawla. Multi-relational
link prediction in heterogeneous
the prediction of links in a social media platform using information information networks. In ASONAM,
about the two users coming from a different platforms13 . Such layer- pages 281–288. IEEE, 2011
layer correlations are not limited to social media, but can also be
found in infrastructure networks14 . 13
Desislava Hristova, Anastasios
Noulas, Chloë Brown, Mirco Musolesi,
and Cecilia Mascolo. A multilayer
Multilayer Scores approach to multiplexity and link
prediction in online geo-social networks.
The last mentioned strategy is better, but it still doesn’t consider EPJ Data Science, 5(1):24, 2016
14
Kaj-Kolja Kleineberg, Marián Boguná,
all the wealth of information a multilayer network can give you. To
M Ángeles Serrano, and Fragkiskos
see why, let’s dust off the concept of layer relevance we introduced Papadopoulos. Hidden geometric
in Section 9.1. That is a way to tell you that a node u has a strong correlations in real multiplex networks.
Nature Physics, 12(11):1076, 2016
tendency of connecting through a specific layer. If a layer exclusively
hosts many neighbors of u, that might mean that it is its preferred
channel of connection.
This suggests that other naive ways to estimate node-node sim- 15
Giulio Rossetti, Michele Berlingerio,
ilarity for our score should be re-weighted using layer relevance15 . and Fosca Giannotti. Scalable link
Consider Figure 24.8. We see that the two nodes have many common prediction on multidimensional net-
works. In 2011 IEEE 11th International
neighbors in the blue layer. They only have one common neighbor Conference on Data Mining Workshops,
in the red layer. However the blue layer, for both nodes, has a very pages 979–986. IEEE, 2011
low exclusive layer relevance. There is no neighbor that we can reach
using exclusively blue edges. In fact, in this case, the exclusive layer
relevance is zero.
The opposite holds for the layer represented by red edges. There
are many neighbors for which red links are the only possible choice.
In this particular case, we might rank the red layer as more likely
352 the atlas for the aspiring network scientist
45%
24.4 Exercises
The Basics
When it comes to evaluate your prediction algorithm, you have to
distinguish between the training and the test datasets. The training
dataset is what your model uses to learn the patterns it is supposed
to predict. For instance, if you’re doing a common neighbor link
predictor, the training dataset is what you use to count the number of
shared connections between two nodes. Once you’re done examining
the input data, you have generated the results of the score(u, v)
function for all possible pairs of u, v inputs.
designing an experiment 357
Edge Score
Figure 25.1: An example of
5 1, 2 1 5
train and test sets for a network.
4 6 1, 3 2 4 6
The information (a) we use to
1, 4 1
build the score table (b), using
8 1, 5 1 8
3 7 1, 6 2 3 7 the common neighbor approach.
I highlight the test edges (c) in
2, 4 2
2 1 ... ... 2 1 blue.
set. That is why in Figure 25.1(c) the edges that were already in the
training set are gray rather than blue: we won’t make predictions on
those, because we already know they exist.
So now the problem is: how do you do that? If you have a net-
work, how do you divide it into training and test sets?
You have two options, as Figure 25.2 shows. If you have temporal
information on your edges you can use earlier edges to predict the
later ones. Meaning that your train set only contains links up to time
t, and the test set only contains links from time t + 1 on. If you don’t
have the luxury of time data, you have to do k-fold cross validation:
divide your dataset in a train and test set (say 90% of edges in train
and 10% in test) and then perform multiple runs of train-test by 2
Ron Kohavi et al. A study of cross-
rotating the test set so that each edge appears in it at least once2 . validation and bootstrap for accuracy
estimation and model selection. In Ijcai,
volume 14, pages 1137–1145. Montreal,
Specific Issues Canada, 1995
However that’s... kind of not the point? We’re in this business be-
cause we want to predict new links. Returning a negative prediction
for all possible cases is not helpful. The usual fix for this problem
3
Ryan N Lichtenwalter, Jake T Lussier,
is building your test set in a balanced way3 , 4 . Rather than asking
and Nitesh V Chawla. New perspectives
about all possible new edges, you create a smaller test set. Half of and methods in link prediction. In
the edges in the test set is an actual new edge, and then you sample Proceedings of the 16th ACM SIGKDD
international conference on Knowledge
an equal number of non-edges. This would make our Internet test discovery and data mining, pages 243–252,
set containing 60k edges, not 18B. We called this sampling technique 2010
4
Ryan Lichtnwalter and Nitesh V
“negative sampling” in Section 4.4.
Chawla. Link prediction: fair and
effective evaluation. In 2012 IEEE/ACM
International Conference on Advances in
25.2 Evaluating Social Networks Analysis and Mining,
pages 376–383. IEEE, 2012
Let’s assume that we have competently built our training and test
set. We made our model learn on the former. We now have two
things: prediction – the result of the model – and reality – the test
set. We want to know how much these two sets overlap. Since you
know what edges are in the test set, this is a supervised learning task
(Section 4.1) and we can focus on the loss/quality functions (Section
4.3) specific for this scenario. Even more narrowly, link prediction
is a binary task: you give a yes/no answer that is either completely
360 the atlas for the aspiring network scientist
Confusion Matrix
Humans like single numbers, because seeing a number going up
tingles our pleasure centers (wait, what? You don’t feel inexplicable
arousal while maximizing scores? I question whether you’re in the
right line of work...). However, we should beware of what we call
“fixed threshold metrics”, i.e. everything that boils down a complex
phenomenon to a single number. Usually, to reduce everything to
a single measure you have to make a number of assumptions and
simplifications that may warp your perception of performance.
That is why one of the first thing you should look at is a confusion 5
Stephen V Stehman. Selecting and
matrix. A confusion matrix is simply a grid of four cells, putting interpreting measures of thematic
classification accuracy. Remote sensing of
the four counts I just introduced in a nice pattern5 . You can see an Environment, 62(1):77–89, 1997
example in Figure 25.4. Confusion matrices are nice because they
don’t attempt to reduce complexity, but at the same time you see
information in an easy-to-parse pattern.
By looking at two confusion matrices you can say surprisingly
sophisticated things about two different methods. The one in Figure
25.5(a) does a better job in making sure a positive prediction really
corresponds to a new link: there are very few false positives (one)
compared to the true positives (15). The one in Figure 25.5(b) mini-
mizes the number of false negatives, with the downside of having a
lot of false positives.
designing an experiment 361
Actual
Yes No Figure 25.4: The schema of
a confusion matrix for link
prediction. From the top-left
Yes corner, clockwise: true positives,
false positives, true negatives,
false negatives.
Predicted
No
(a) (b)
1
True Positive Rate
( + )
TPR
Highest score 2nd highest
Random guess
/( + ) FPR 1
(a) (b)
FPR
designing an experiment 363
ROC curves are great – you might even say that they ROC – but,
at the end of the day, you might want to know which of the two
classifiers is better on average. ROC curves can be reduced to a single
number, a fixed threshold metric. Since we just said that the higher
the line on the ROC plot the better, one could calculate the Area
Under the Curve (AUC). The more area under that curve, the better
your classifier is, because for each corresponding FPR value, your
TPR is higher – thus encompassing more area.
You don’t need to know calculus to estimate the area under the
curve, because it’s such a standard metric that any machine learning
package will output it for you. The AUC is 0.5 for the random guess:
that’s the area under the 45 degree line. An AUC of 1 – which you’ll
never see and, if you do, it means you did something wrong – means
a perfect classifier.
Note that ROC curves and AUCs are unaffected if you sample
your test set randomly, namely if you only test potential edges at
random from the set of all potential edges – I discussed before how
this is a common thing to do because of the unmanageable size
of the real test set. However, that is not true if you perform a non-
random sampling. This means choosing potential edges according
to a specific criterion. If your criterion is “good”, meaning that your
sampling method is correlated with the actual edge appearance
likelihood, you’re going to see a different – lower – AUC value. That
is because, if you don’t sample, the vast number of easy-to-predict
false negatives increases your classifier’s accuracy.
( + ) ( + )
score, balancing them out. This is known as the F1-score, which is
their harmonic mean: F1 = 2( Precision × Recall )/( Precision + Recall ).
This is a single number, like AUC, capturing both types of errors:
failed predictions and failed non-predictions.
A powerful way to use precision and recall is by using them as an
alternative to ROC curves. The so-called Precision-Recall curves have
the recall on the x-axis and the precision on the y-axis (see Figure
25.9). They tell you how much your precision suffers as you want
to recover more and more of the actual new edges in the network.
Recall basically measures how much of the positive set your recover.
But, as you include more and more links in that set, you’re likely to
start finding lots of false positives. That will make your recall go up,
but precision go down.
positives,
at the price
of having
many false
ones.
Recall
designing an experiment 365
P
PP = 10 log10 .
Pr
This is a decibel-like logscale: a PP = 1 implies your predictor
is ten times better than random, while PP = 2 means you are one
hundred times better than random. You can also create PP-curves by
having on the x-axis the share of links you remove from your training
set. By definition, the random predictor is an horizontal line at 0. The
more area your PP curve can make over the horizontal zero, the more
precise your predictor is.
In closing, I should also mention another popular measure: accu-
racy. This is simply ( TP + TN )/( TP + TN + FP + FN ): the number
of times you got it right over all the attempts. The lure of accuracy is
its straightforward intuition. However, it hides the difference between
type I and type II errors – false positives and false negatives – and
thus it should be handled with care.
25.3 Summary
3. Since real networks are sparse, there are more non-edges than
edges. Thus a link prediction always predicting non-edge would
366 the atlas for the aspiring network scientist
have high performance. That is why you should balance your test
sets, having an equal number of edges and non-edges.
25.4 Exercises
2. Draw the ROC curves on the cross validation of the network used
at the previous question, comparing the following link predictors:
preferential attachment, jaccard, Adamic-Adar, and resource alloca-
tion. Which of those has the highest AUC? (Again, scikit-learn
has helper functions for you)
3. Calculate precision, recall, and F1-score for the four link predictors
as used in the previous question. Set up as cutoff point the nineti-
eth percentile, meaning that you predict a link only for the highest
ten percent of the scores in each classifier. Which method performs
best according to these measures? (Note: when scoring with the
scikit-learn function, remember that this is a binary prediction
task)
The Hairball
26
Bipartite Projections
1. Degree distributions;
2. Epidemics spread;
3. Communities.
Many papers have been written on how power law degree distribu-
tions are ubiquitous1 , 2 , 3 , 4 . Chances are that any and all the networks 1
Albert-László Barabási and Réka
you’ll find on your way as a network analyst do not have even a hint Albert. Emergence of scaling in random
networks. science, 286(5439):509–512,
of a power law degree distribution. In the best case scenario you are 1999
going to have shifted power laws, or exponential cutoffs – if you’re 2
Albert-László Barabási and Eric
lucky – (for a refresher on these terms, see Section 9.4). Bonabeau. Scale-free networks. Scientific
american, 288(5):60–69, 2003
My second example is epidemics spread – Figure 26.1. As we 3
Reka Albert. Scale-free networks in cell
saw in Part VI, SIS/SIR models tell us exactly when the next node biology. Journal of cell science, 118(21):
is going to be activated. In practice, data about real activation times 4947–4957, 2005
4
Albert-László Barabási. Scale-free
has (a) high levels of noise, (b) many exogenous factors that have as networks: a decade and beyond. science,
much power in influencing how the infection spreads as the network 325(5939):412–413, 2009
connections have.
Third, and more famously, communities. We are not going to dive
deeply into the topic only until Part X. But, very superficially, when
(a) (b)
There are a few ways in which hairballs arise, which are the focus
of this book part. First, many networks are not observed directly:
they are inferred (Figure 26.3(a)). If the edge inference process you’re
applying does not fit your data, it will generate edges it shouldn’t.
Second, even if you observe the network directly, your observation is
subject to noise (Figure 26.3(b)), connections that do not reflect real
interactions but appear due to some random fluctuations. Finally,
you might have the opposite problem: you’re looking at an incom-
plete sample (Figure 26.3(c)), and thus missing crucial information.
In the chapters of this book part, we tackle each one of these
problems to see some examples in which you can avoid giving birth Figure 26.3: The typical breed-
to yet another hairball. Chapter 27 deals with network backboning: ing grounds for hairballs: (a)
how to clear out noise from your edge observations. Chapter 29 Indirect observation, (b) Noise
in the data, (c) Incomplete
samples.
Reality Inference You Reality Noise You Reality Data Collection You
2
1 6
3
5 7
4
5
3 8
Figure 26.4: An example of
6
naive bipartite projection,
7 4 2 where we connect nodes of one
8 “Connecting movies type if they have a common
because the same
Users Movies users watch them” neighbor.
bipartite projections 371
are broad. This means that there are going to be some users in your
bipartite user-movie network with a very high degree. These are
power users, people who watched everything. They are a problem:
under the rule we just gave to project the bipartite networks, you’ll
end up with all movies connected to each other. A hairball. The key
lies in recognizing that not all edges have the same importance. Two
movies that are watched by three common users are more related to
each other than two movies that only have one common spectator.
1
Figure 26.5: An example of Sim-
2 1 6
ple Weight bipartite projection,
3
5 7 where we connect nodes of one
4 2 2 type with the number of their
5
2
common neighbors.
3 2 8
6
3
7 4 2
8
Wu,v = |Nu ∩ Nv|
1
Figure 26.7: An example of
2 1 6
Hyperbolic Weight bipartite
3
5 .46 7 projection, where each common
4 .46 neighbor z contributes k− 1
z to
5
.46
the sum of the edge weight.
3 .46 8
6
.79
7 4 2
8 1
Wu,v = ΣzϵN ∩N
u v kz
1
wu,v = ∑ ku kz
.
z∈ Nu ∩ Nv
1
1/2 Figure 26.8: An example of
1/3 2
Resource Allocation bipartite
1/8
1/2 3
1 0.229 projection, where each com-
4 mon neighbor z contributes
5 (k u k z )−1 to the sum of the
6 0.153 2 edge weight. When connecting
node 1 to node 2, from node
7
1’s perspective the edge weight
8 1
Wu,v = ΣzϵN ∩N u v kukz is (1/2 ∗ 1/3) + (1/2 ∗ 1/8),
because the two common
This strategy also works for weighted bipartite networks. If B is neighbors have degree of
your weighted bipartite adjacency matrix, the entries of W are: 3 and 8, respectively, and
node 1 has degree of two.
Buv
wu,v = ∑ ku kz
. However, from node 2’s per-
z∈ Nu ∩ Nv spective, the edge weight is
In practice, you replace the 1 in the numerator with the edge (1/3 ∗ 1/3) + (1/3 ∗ 1/8),
weights connecting z to v and u. Moreover, we can also have node because node 2 has three neigh-
weights, noticing that some nodes might have more resources than bors.
others. Suppose that you have a function f giving each node in the
network a resource weight. After you perform the resource allocation
projection, each node will have a new amount of resources f ′ = W f .
Note that, in this case, W is not symmetric: in the scenario with a
single common neighbor z, u’s score for v would be (k u k z )−1 , while
v’s score would be (k v k z )−1 . If k u ̸= k v , then the scores are different.
In many cases, this provides a better representation of the network
than one ignoring asymmetries. You might be the most similar
author to me because I always collaborated with you, but if you also
contributed to many other papers with other people, then I might not
be the author most similar to you.
1 1 1
W has a well-defined diagonal: wu,u = ∑ = ∑ .
z∈ Nu k u k z | Nu | z∈ Nu k z
In fact, this diagonal is the maximum possible similarity value of the
row: only a node v with the very same neighbors and nothing else
can have a weight wu,v = wu,u .
376 the atlas for the aspiring network scientist
1
Figure 26.9: An example of Ran-
2
dom Walks bipartite projection,
3
1 0.049 where the connection strength
4 between u and v is dependent
5 on the stationary distribution
6 0.022 2 π, telling us the probability
of ending in v after a random
7
walk.
8 Wu,v = πAu,v
# Edges
# Edges
# Edges
104
103 103 103
103
102 102 102 102
101 101 101 101
100 100 100 100
100 101 102 103 10-3 10-2 10-1 100 10-2 10-1 100 100
Edge Weight Edge Weight Edge Weight Edge Weight
# Edges
# Edges
# Edges
103 103 103 103
102 102 102 102
101 101 101 101
100 100 100 100
10-6 10-5 10-4 10-3 10-2 10-1 100 10-6 10-5 10-4 10-3 10-2 10-1 10-8 10-6 10-4 10-2 100 10-11 10-9 10-7 10-5 10-3
Edge Weight Edge Weight Edge Weight Edge Weight
(e) Hyperbolic (f) ProbS (g) Hybrid (λ = 0.5) (h) Random Walks
Figure 26.10: The distributions
weight is 2.14. This is very much not the case for other projection of edges weights in the pro-
strategies such as Jaccard (Figure 26.10(b)), where there is no trace jected Twitter network for eight
of a power law. And, in many cases such as cosine and Pearson different projection methods.
(Figures 26.10(c-d)), the highest edge weight is actually the most The plot report the number
common value, rather than being an outlier such as in the hyperbolic of edges (y axis) with a given
projection (Figure 26.10(e)). weight (x axis).
Is the difference exclusively in the shape of the distribution, or
do these approaches disagree on the weights of specific edges? To
answer this question we have to look at a scattergram comparing the
edge weights for two different projection strategies. This is what I do
in Figure 26.11.
I picked three cases to show the full width of possibilities. In
Figure 26.11(a), I compare the cosine projection against the Jaccard
one. This is the pair of projections that, in this dataset, agree the most.
Their correlation is > 0.94. Looking at the figure, it is easy to see that
there isn’t much difference. You can pick either method and you’re
going to have comparable weights. The opposite case compares two
method that are anti-correlated the most. This would be HeatS and
the random walks approach, in Figure 26.11(b). They correlation in a
log-log space is a staggering −0.7. From the figure you can probably
spot a few patterns, but the lesson learned is that the two methods
build fundamentally different projections.
Ok, but these are extreme cases. How does the average case looks
like? To get an idea, I chose a particular pair of measures: HeatS and
ProbS (Figure 26.11(c)). You might expect the two to be more similar
than the average method: after all, one is the transpose of the other.
You’d be very wrong. In this dataset, HeatS and ProbS are actually
anti correlated, at −0.34 in the log-log space. HeatS and ProbS would
be positively correlated if the nodes of type V1 with similar degrees
bipartite projections 379
26.7 Summary
2. In network projection you pick one of the two node types and you
connect the nodes of that type if they have common neighbors
of the other type. Normally you’d count the number of common
neighbors they have (simple weighted) and then evaluate their
380 the atlas for the aspiring network scientist
statistical significance.
26.8 Exercises
27.1 Naive
Discarded
Figure 27.1: A vignette of the
naive thresholding procedure.
Each red bar is an edge in
the network. The bar’s width
is proportional to the edge’s
weight. Here, I sort all edges
Threshold in decreasing weight order. I
then establish a threshold and
discard everything to its right.
problems with the naive strategy.
The first problem is that, in real world networks, edge weights dis-
tribute broadly in a fat-tail highly skewed fashion, much like the
degree (Section 9.3). Let’s take a quick look again at the edge weight
distribution we got using the simple projection in the previous chap-
ter for our Twitter network. I show the distribution again in Figure
27.2.
7
10
Figure 27.2: The distribution of
106
edge weights in the projected
105 Twitter network using the sim-
# Edges
In this network, 82% of the edges have weight equal to one. The
smallest possible hard threshold would remove 82% of the network,
without allowing for any nuance. Moreover, since we have a fat
tailed edge weight distribution, it is hard to motivate the choice
of a threshold. Such a highly skewed distribution lacks of a well-
defined average value and has undefined variance (Section 3.4).
You cannot motivate your threshold choice by saying that it is “x
standard deviations from the average” or anything resembling this
formulation.
384 the atlas for the aspiring network scientist
9
Figure 27.3: The average weight
Avg Neighbor Weight
8
7 of edges sharing a node with
6 a focus edge (y axis) against
5 the weight of the focus edge (x
4
axis). Thin lines show the stan-
3
2 dard deviation. One percent
1 sample of the Twitter network.
0
0 1 2
10 10 10
Edge Weight
One could think that the edges are simply sorted according to their
contributions to all shortest paths in the network, but that is not the
case. By forcing the substructures to be trees, we are counting the
edges that are salient from each node’s local perspective, rather than
the network’s global perspective. The authors in the paper show
the subtle difference between shortest path tree counts and edge
betweenness, also showing how a hypothetical skeleton extracted
using edge betweenness performs more poorly.
The HSS makes a lot of sense for networks in which paths are
meaningful, like infrastructure networks. However, it requires a
lot of shortest path calculations – which makes it computationally
expensive. Moreover, the edges are either part of (almost) all trees or
of (almost) none of them. Figure 27.7 shows an example of this edge
weight distribution, showcasing the typical “horns” shape of the HSS
score attached to the original edges. You can see clearly that there
are two peaks: one at zero – the edge is in no shortest path tree –; the
other at one – the edge is part of all shortest path trees.
This can be nice, because it means HSS can be almost parameter
free: the thresholding operation does not have many degrees of
freedom. On the other hand, when there are few edges with weights
388 the atlas for the aspiring network scientist
180
160 Figure 27.7: A typical “horns”
140 plot for the edge weight distri-
120 bution in HSS. The plot reports
# Edges 100
how many edges (y axis) are
80
part of a given share of shortest
60
40 path trees (x axis).
20
0
0 0.2 0.4 0.6 0.8 1
HSS Score
close to one your skeleton might end up being too sparse and it is
difficult to add more edges without lowering the threshold close to
zero.
2
1 Figure 27.8: An example of
2 induced graph. (a) The original
4 3
graph. I highlight in red the
5 nodes I pick for my induced
9
8 3 graph. (b) The induced graph
6
of (a), including only nodes in
11
7
10 9 red and all connections between
8 them.
12 6
(a) (b)
A network is convex if all its connected induced subgraphs are
convex. No matter which set of nodes you pick: as long as they are
part of a single connected component, they are all going to be convex.
This might look like a weird and difficult to understand concept, but
you can grasp it with the help of elementary building blocks you
already saw in this book.
Figure 27.9 shows the two basic alternatives for a convex network.
network backboning 389
(a) (b)
In this and the following section, we’re slightly turning the perspec-
tive on network backboning. You could consider these as a different
subclass of the problem. They all apply a general template to solve
the problem of filtering out connections, which relate to the “noise
reduction” application scenario of network backboning. Up until
now, we adopted a purely structural approach which re-weights
edges according to some topological properties of the graph. Here,
instead, given a weighted graph, we adopt a template composed by
three main steps: (1) define a null model based on node distribution
properties; (2) compute a p-value (Section 3.3) for every edge to de-
termine the statistical significance of properties assigned to edges
390 the atlas for the aspiring network scientist
from a given distribution; (3) filter out all edges having p-value above
a chosen significance level, i.e. keep all edges that are least likely to
have occurred due to random chance.
We check New York against a small town in the south, for instance
Franklington in Louisiana. Let’s say that New York’s connections, on
average, involve 10k travelers. The traveler traffic with Franklington
involves only 1k. This is way less than New York’s average so, when
we check this edge from New York’s perspective, we mark it for dele-
tion. However, on average, Franklington’s connections involve only
500 travelers. Thus, when we check the edge from Franklington’s
perspective, we will find it significant and so we will keep it.
Since you need one success out of the two attempts to keep the
edge, you end up with strong hubs connected to the entire net-
work, and few peripheral connections (hub-spoke structure, or core-
periphery, with no communities). In other words, the disparity filter
tends to create networks with high centralization (Section 14.8),
broad degree distributions, and weak communities. In many cases,
that is fine. For some other scenarios, we might want to consider an
alternative.
In summary, this means that the disparity filter ignores the
weights of the neighbors of a node when deciding whether to keep 15
Navid Dianati. Unwinding the hairball
an edge or not. There is a collection of alternatives15 , 16 that take this graph: pruning algorithms for weighted
additional piece of information into account and are thus less biased. complex networks. Physical Review E, 93
(1):012304, 2016
In this section I explained the disparity filter only in the case of 16
Valerio Gemmetto, Alessio Cardillo,
undirected networks. You can apply the same technique also for di- and Diego Garlaschelli. Irreducible
rected networks. In this case, you need to make sure that you’re prop- network backbones: unbiased graph
filtering via maximum entropy. arXiv
erly accounting for direction in your p-value calculation: the edge preprint arXiv:1706.00230, 2017
must be significant either when compared to the out-connections of
the node sending the edge, or when compared to the in-connection
weights of the node receiving it.
27.6 Noise-Corrected
17
Michele Coscia and Frank MH Neffke.
The noise-corrected (NC) approach attempts to fix the issues of the Network backboning with noisy data. In
2017 IEEE 33rd International Conference
disparity filter17 . In spirit, it is very similar to it. However, the focus on Data Engineering (ICDE), pages
is shifted towards an edge-centric approach: each edge has a different 425–436. IEEE, 2017
392 the atlas for the aspiring network scientist
∑ wu,v′ × ∑ wu′ ,v
v′ ∈ Nu u′ ∈ Nv
pu,v = !2 .
∑ wu′ ,v′
u′ ,v′
York voted for deletion. If you attempt to create backbones with the
same number of edges, Franklington will end up connecting to its
local neighborhood, because those edges are more likely to be agreed
upon by all the smaller towns nearby our focus.
6 5
27.7 Summary
4. In high-salience skeleton, you calculate the short path tree for each
node and you re-weight the edges counting the number of trees
using them. Then you keep the most used edges. This is usually
computationally expensive.
27.8 Exercises
4. How many edges would you keep if you were to return the dou-
bly stochastic backbone including all nodes in the network in a
single (weakly) connected component with the minimum number
of edges?
28
Uncertainty & Measurement Error
To any person who has ever worked with real world data, it should
come as no surprise that datasets are often disappointing. They
contain glaring errors, incomprehensible omissions, and a number
of other issues that make them borderline useless if you don’t pour 1
Vilfredo Pareto. Manuale di economia
hours of effort into fixing them. The infamous 80-20 rule1 that holds politica con una introduzione alla scienza
for literally everything (including degree distributions, as we saw sociale, volume 13. Società editrice
libraria, 1919
in Section 9.4) will tell you that 80% of data science is just about
cleaning data, and only 20% about shiny and fun analysis techniques.
This obviously applies to network data as well. You’ll find edges in
your networks with incorrect weights and perhaps they shouldn’t
even be there, and you’ll have plenty of missing or unobserved
connections. And don’t take my word for it: there are some works
outside network science proper that also cite measurement error as
one of the many things you should think about when working with 2
Arun Advani and Bansi Malde.
networked data2 . Empirical methods for networks data:
Social effects, network formation and
Admittedly, techniques to clean network data would deserve an
measurement error. Technical report, IFS
entire book part but, frankly, there are surprisingly few techniques to Working Papers, 2014
deal with this problem. In fact, a survey paper about measurement 3
Dan J Wang, Xiaolin Shi, Daniel A
error in network data3 points out that measurement error is routinely McFarland, and Jure Leskovec. Mea-
considered only a problem about missing data, rather than the more surement error in network data: A
re-classification. Social Networks, 34(4):
general framing as uncertainty. Network data cleaning is thus the 396–409, 2012a
lovechild of network backboning and link prediction, but that’s a
rather barren marriage – as far as I know. In fact, one of the few 4
Tiago P Peixoto. Reconstructing
papers I know4 delivers the truth in a brutal and deadpan way: “[in networks with unknown and heteroge-
network analysis] the practice of ignoring measurement error is still neous errors. Physical Review X, 8(4):
041011, 2018
mainstream”. I hope this chapter will contribute to make things
change.
To give you an idea of the significance of the measurement error
blind spot in network science, consider the Zachary Karate Club
network. As I’ll explain in details in Section 53.4, everyone in our
5
Wayne W Zachary. An information
flow model for conflict and fission in
field is madly in love with this toy example. The paper presenting small groups. Journal of anthropological
the network5 has been cited more than six thousand times – and not research, 33(4):452–473, 1977
396 the atlas for the aspiring network scientist
everybody using this network cites it. The fun thing about this graph
is that we actually don’t know whether it has 77 or 78 edges. The
bewildering thing about this graph is that almost no one even mentions
this problem!
I start the chapter with a few methods that model measurement
error to return a classical adjacency matrix, on which we can operate
with classical algorithms. The rest of the chapter, instead, deals with
probabilistic networks, structures whose edges have a probability of
existing. Then, we want to define techniques to extract various things
we care about – such as the degree distribution or the shortest paths
– from a network containing edges that we’re not certain they are
there.
To wrap up this introduction, let’s mention one obvious thing
someone might want to do when they know their networks are
affected by measurement error: to correct these errors! We’ll see
some techniques in this chapter, but in general one can use generic
algorithm to find missing edges (link prediction, in Chapter 23), to
throw away spurious edges (network backboning, in Chapter 27),
or to find node duplicates. In the latter case, we can use network
techniques to find such nodes, e.g. with node similarity (Section
15.2). However, you could also approach this task with non-network 6
Lise Getoor and Ashwin Machanava-
techniques such as entity resolution6 – e.g. figuring out that two jjhala. Entity resolution: theory, practice
nodes with different names “Sofia Coppola” and “Sophia Coppola” & open challenges. Proceedings of the
VLDB Endowment, 5(12):2018–2019, 2012
are actually referring to the same entity. This is a natural language
processing task and therefore not of interest here.
True Value
Figure 28.2: Estimating a true
quantity (in green) with differ-
Expected Error ent measurements (in blue). In
red we have the distribution of
the expected errors. The x-axis
Measured Values
represents the measurement
value, the y-axis the likelihood
Figure 28.2 shows the usual scenario: you make a lot of measure-
of observing a given measure-
ments, none of them is the exact value you look for, but you can
ment value.
assume a Gaussian error. Once you know what sort of errors you 11
The “likelihood” I mention here is the
might expect, you can do what one calls “maximum likelihood11 log-likelihood loss function I introduced
estimation”: given the measurements I have and the expected distri- in Section 4.3.
bution of the errors, what is the most likely true value? One powerful 12
Todd K Moon. The expectation-
algorithm to deal with this problem is Expectation Maximization12 maximization algorithm. IEEE Signal
(EM) – we already mentioned it for link prediction (Section 23.7) and processing magazine, 13(6):47–60, 1996
2
Figure 28.4: A probabilistic
0.5
1 0.3 4 network. The edge label indi-
0.7 0.9 cates the edge’s probability of
3 existing.
2 2 2 2
1 4 1 4 1 4 1 4
3 3 3 3
1 4 1 4 1 4 1 4
3 3 3 3
1 4 1 4 1 4 1 4
3 3 3 3
1 4 1 4 1 4 1 4
3 3 3 3
I will make a small deep dive into the following classical problems
and one of their possible solutions in probabilistic networks:
uncertainty & measurement error 403
Some of these things we’ll only see later in the book, so I’ll give a
brief context for the time being.
Node Degree
If you wanted to know the degree distribution of the network in
Figure 28.4, what would you do? The naive approach would be to
determine the expected degree of each node: go over all the possible
worlds and take the average of the degrees of the node, weighted by
how likely the world is to exist.
0.6 0.6
PDF
0.3 0.3
0.2 0.2
28.4 to have a given degree (x
0.1 0.1 axis). (b) The probabilistic de-
0 0 gree PDF for the entire network
0 1 2 3 0 1 2 3
Ego Networks
6 7
Figure 28.8: The process of
2 2
1 1 extracting an ego network from
6 0.9 7 e
a probabilistic network by first
e
0.8
2
realizing the possible worlds
0.5
0.7
1 0.7 5 and then extracting their ego
0.8
0.6 3
0.8
0.8
e
0.5 6 7 networks. Ego marked with the
0.6 2
0.9
0.7
0.7 “e” label.
2
5 0.6 3
4
e
3
e
4
4
2
1 Figure 28.9: The process of
6 0.9 7 e extracting an ego network from
2 5
0.8
1
0.7 4 a probabilistic network by first
2 0.5 0.8
0.7
0.7
1
0.8
0.7 0.6
0.8 0.5
3 extracting a probabilistic ego
0.6 3 0.8 e
0.8
0.8
e
0.5 0.6
0.9 0.7
network and then realizing its
0.6 0.7 2
0.9
0.7
0.7 5 0.6 possible worlds. Ego marked
4
5 0.6
4
3 with the “e” label.
e
Betweenness Centrality
Connected Components
In a deterministic network, when two nodes are part of the same con-
nected component they are reachable: there exists a path following
the edges of the network leading you from one node to another (see
Section 10.4). This is not necessarily true in a probabilistic network.
Once you think about this in terms of probabilistic networks, it is not
hard to see why.
3 0.2 6 3 6
Figure 28.10: (a) A probabilistic
0.7 0.6 0.9 0.8
network with edge probabilities
4 0.6 1 0.1 2 0.7 7 4 1 2 7
as the edge’s labels. (b) One of
0.2 0.1 0.2 0.3 the possible worlds of (a).
5 0.4 8 5 8
(a) (b)
Densest Subgraph
In many applications, we might want to extract the densest subgraph Andrew V Goldberg. Finding a
34
of a network34 . As the name implies, the densest subgraph of a maximum density subgraph. 1984
network is a subset of its nodes and the edges between them such
that its density is the highest than the one of any other subgraphs in
the network. You cannot find a subgraph denser than the one you
extracted. Of course there are trivial solutions – for instance any
connected node pair is a densest subgraph because it has density
one – but one can specify a minimum number of nodes they are
interested in, or a different measure to optimize such as the average
degree.
Of course, there are ways to solve this problem in many different
scenarios, for instance in streaming graphs as the data comes in little 35
Bahman Bahmani, Ravi Kumar, and
by little35 , but here we’re interested in probabilistic networks specif- Sergei Vassilvitskii. Densest subgraph in
streaming and mapreduce. Proceedings of
ically. The idea is, as usual, to try and define a measure of expected
the VLDB Endowment, 5(5), 2012
density for a subgraph, which is its density across all possible worlds
weighted by their probability of existing. There are many methods to
do so, and a few notable ones can also avoid making the assumption 36
Zhaonian Zou. Polynomial-time
that the edge existence probabilities are independent36 . algorithm for finding densest subgraphs
in uncertain graphs. In Proceedings of
One related problem is to find not the densest, but any specific
MLG Workshop, 2013
subgraph instead of a large graph. This is something we will see in
depth when we talk about subgraph mining (Chapter 41). For now,
let’s just say that you might have a specific pattern in mind and you
want to know how many times it appears in the data. When you have
a probabilistic network, you could instead ask for all patterns that are 37
Zhaonian Zou, Hong Gao, and
more likely to exist than a certain probability37 , 38 . Jianzhong Li. Discovering frequent sub-
graphs over uncertain graph databases
The potentially huge search space – a large graph has a lot of under probabilistic semantics. In
potential subgraphs – can be efficiently reduced. Let’s take a look at SIGKDD, pages 633–642, 2010
Figure 28.11(a).
38
Odysseas Papapetrou, Ekaterini
Ioannou, and Dimitrios Skoutas.
This probabilistic network is extremely simple – it’s only a chain Efficient discovery of frequent subgraph
with four nodes and three edges. Even this ridiculously simple net- patterns in uncertain graph databases.
In EDBT, pages 355–366, 2011
1
Figure 28.11: (a) A probabilistic
0.6
network with edge probabilities
2 as the edge’s labels. (b-d) Some
0.4 subgraphs that can be extracted
from (a). They exist with proba-
3
bilities: (b) p = 0.8, (c) p = 0.32,
0.8
(d) p = 0.192 – the probability
4 of a pattern is the maximum
(a) (b) (c) (d) combined probability of the
set of edges that can form the
pattern.
408 the atlas for the aspiring network scientist
0.6 – equal to its Mass since this set has no subsets. The Plausibility 47
Gabor Szucs and Gyula Sallai. Route
planning with uncertain information
is 1 − Belie f (1) = 0.85. Following the literature47 , we can define the
using dempster-shafer theory. In 2009
edge bounds as follows: 1 + [1 × 0.6, 1 × 0.85] = [1.6, 1.85] – i.e. the International Conference on Management
minimum weight we think the edge has is 1.6 and the maximum is and Service Science, pages 1–4. IEEE,
2009
1.85.
28.5 Summary
28.6 Exercises
four columns: the two connected nodes, the probability of the edge
existing and the probability of the edge non existing. Generate all
of its possible worlds, together with their probability of existing.
(Note, you can ignore the fourth column for this exercise)
4. Calculate the length of the shortest path between node 2 and node
4 in the previous network using fuzzy logic, assuming that the
third column reports the probability of the edge to have weight 1
and the fourth column reports the probability of the edge to have
weight 2.
29
Network Sampling
29.1 Induced
Node Induced
If you focus on nodes, it means that you are specifying the IDs of a
set of nodes that must be in your sample. Then, usually, what you do
is collecting all their immediate neighbors. The issue here is clearly
deciding the best set of node IDs from which to start your sampling.
There are a few alternatives you could consider.
The first, obvious, one is to choose your node IDs completely
at random. Random sampling is a standard procedure in many
network sampling 415
(a) (b)
Edge Induced
Another way to generate induced samples is to focus on edges rather
than nodes. This means selecting edges in a network and then crawl
416 the atlas for the aspiring network scientist
Snowball
Name k
of your
friends
Figure 29.4: Snowball sampling.
Name k Your sampler (blue) starts
of your from a seed (red) and asks for
friends
k = 3 connections. Red names
their green friends, but not
the gray ones. The interviewer
then recursively asks the same
question to each of the newly
sampled green individuals. If
no one ever mentions the gray
ones, those are not sampled and
won’t be part of the network.
Forest Fire
Name
all your
friends
Figure 29.5: Forest fire sam-
Name pling. Your sampler (blue)
all your starts from a seed (red) and
friends
asks for all the connections a
node. If the probability test
succeeds, the neighbor turns
green and is also explored. If it
fails, the neighbor remains gray
and is not explored further.
The random walk sampling family does exactly what you would
expect it to do given its name: it performs a random walk on the
graph, sampling the nodes it encounters. After all, if random walks
are so powerful and we can use them for ranking nodes (Section
14.4) or projecting bipartite networks (Section 26.5), why can’t we use
them for sampling too? I’ll start by explaining the simplest approach
and its problems, moving into sophisticated variants that address its
downsides.
Vanilla
In Random Walk (RW) sampling, we take an individual and we ask
them to name one of their friends at random. Then we do the same
with her and so on. Figure 29.6 shows the usual vignette applied to
this strategy.
Name 1
of your
friends
Figure 29.6: Random walk
Name 1 sampling. Your sampler (blue)
of your starts from a seed (red) and
friends
asks for all the connections a
node (green + gray). One of the
neighbors is picked at random
and becomes the new seed
(green) and, when asked, will
name another green node to
become the new seed.
Metropolis-Hastings
One way in which we could fix the issues of random walk sampling
is by performing a “random” walk. Meaning that we still pick a
neighbor at random to grow our sample, but we become picky about
whether we really want to sample this new node or not.
In the Metropolis-Hastings Random Walk (MHRW), when we
select a neighbor of the currently visited node, we do not accept
it with probability 1. Instead, we look at its degree. If its degree is
higher than the one of the node we are visiting, we have a chance of
rejecting this neighbor and trying a different one. This probability is
the old node’s degree over the new node’s degree. The exact formula
for this decision is p = k v /k u , assuming that we visited v and we’re 20
Daniel Stutzbach, Reza Rejaie, Nick
Duffield, Subhabrata Sen, and Walter
considering u as a potential next step20 , 21 . Willinger. On unbiased sampling for
If the current node v has degree of 3, and its u neighbor has de- unstructured peer-to-peer networks.
gree of 100, the probability of transitioning to u is only 3% – note that IEEE/ACM Transactions on Networking
(TON), 17(2):377–390, 2009
this is after we selected u as the next step of the random walk, thus 21
Balachander Krishnamurthy, Phillipa
the visit probability is actually lower than 3%: first you have a 1/k v Gill, and Martin Arlitt. A few chirps
probability of being selected and then a k v /k u probability of being about twitter. In Proceedings of the first
workshop on Online social networks, pages
accepted. If we were, instead, to transition from u to v, we would 19–24. ACM, 2008
always accept the move, because 100/3 > 1, thus the test always
succeeds. In practice, we might refuse to visit a neighbor if its degree
is higher than the currently visited node. The higher this difference,
the less likely we’re going to visit it. A random walk with this rule
network sampling 421
Re-Weighted
In Re-Weighted Random Walk (RWRW) we take a different approach.
We don’t modify the way the random walk is performed. We extract
the sample using a vanilla random walk. What we modify is the way
we look at it. Once we’re done exploring the network, we correct the 22
Matthew J Salganik and Douglas D
result for the property of interest22 , 23 . Say we are interested in the Heckathorn. Sampling and estimation in
hidden populations using respondent-
degree. We want to know the probability of a node to have degree driven sampling. Sociological methodology,
equal to i. We correct the observation with the following formula: 34(1):193–240, 2004
23
Amir Hassan Rasti, Mojtaba Tork-
∑ i −1 jazi, Reza Rejaie, Nick Duffield, Wal-
v∈Vi ter Willinger, and Daniel Stutzbach.
pi = . Respondent-driven sampling for charac-
∑ xv−′ 1 terizing unstructured overlays. In IEEE
v ′ ∈V INFOCOM 2009, pages 2701–2705. IEEE,
2009
The formula tells us the probability of a node to have degree equal
to i (pi ). This is the sum of i−1 – the inverse of the value – for all
nodes in the sample with degree i (Vi ), over 1/ degree (xv−′ 1 ) of all
nodes in the sample (V). This is also known as Respondent-Driven 24
H Russell Bernard and Harvey Rus-
sell Bernard. Social research methods:
Survey24 , because it is used in sociology to correct for biases in the Qualitative and quantitative approaches.
sample when the properties of interest are rare and non-randomly Sage, 2013
distributed throughout the population. Figure 29.8 attempts to break
down all parts of the formula.
Let’s make an example. Suppose you want to estimate the proba-
bility of a node to have degree i = 2. First, you perform your vanilla
random walk sample. Say you extracted 100 nodes. Twenty of those
nodes have degree equal to two. So your numerator in the formula
will be the sum of i−1 = 1/2 for |Vi | = 20 times: 20 ∗ 1/2 = 10. If
we assume that there were 50 nodes of degree 1, 10 of degree 3, 8 of
422 the atlas for the aspiring network scientist
p of nodes
with value i Figure 29.8: The Re-Weighted
Random Walk formula, esti-
Set of mating the probability pi of
nodes observing the i value in a prop-
with erty of interest, using the set
value i of sampled nodes Vi with that
particular value in the total set
of v sampled nodes.
Value
Set of nodes for v’
in the sample
19
20 Figure 29.9: Neighbor Reser-
voir sampling. Nodes in the
12
11 explored set V ′ are in red.
8 21
Neighbors of V ′ – the reservoir
3
– are in green.
18 9 7 5
2
4 14
6
17 10
1
15 16
13
able. Other unlucky u-v draws are forbidden too. For instance, you
cannot perform the swap if you pick nodes 3 and 12.
Figure 29.10 shows an example of how some of these different
strategies would explore a simple tree. I don’t show RWRW, because
the samples it extracts are indistinguishable from the vanilla random
walk ones. I also don’t include NRS, because it’s too subtle to really
be appreciated in a figure like this one.
106
Figure 29.11: A power law de-
105
gree distribution, showing the
104 count of nodes (y-axis) with
Count
3
10 a given degree (x-axis). The
2
10 colors in the plot represent in
which cases the first API policy
101
described in the text is faster
100
100 101 102 103 than the second (purple) and
k when the second is faster than
the first (blue).
level of pagination and waiting time will always be part of an API
system. Which means that there are going to be trade-offs when
reconstructing the underlying network.
Pagination is often not the only thing you need to worry about.
Other challenges might be sampling a network in presence of hostile 32
Edward Bortnikov, Maxim Gure-
behavior32 . For instance, some hostile nodes will try to lie about their vich, Idit Keidar, Gabriel Kliot, and
connections and it’s your duty to reconstruct the true underlying Alexander Shraer. Brahms: Byzantine
resilient random membership sampling.
structure. Or not: there are reasonable and legit reasons to lie about
Computer Networks, 53(13):2340–2359,
one’s connection, for instance to protect one own privacy. 2009
In another scenario, you might not be interested in the topological
properties of the full network. What’s interesting for you is just
estimating the local properties of one – or more – nodes. In that case, 33
Manos Papagelis, Gautam Das, and
specialized node-centric strategies can be used33 . Nick Koudas. Sampling online social
networks. IEEE Transactions on knowledge
and data engineering, 25(3):662–676, 2013
29.5 Network Completion
6. When sampling from real API systems one has to be careful that
the throughput in edges per second is not necessarily a good
indicator of how quickly you can gather a representative sample.
Due to pagination, high-throughput sources might return smaller
samples.
29.7 Exercises
2. Compare the CCDF of your sample with the one of the original
network by fitting a log-log regression and comparing the ex-
ponents. You can take multiple samples from different seeds to
ensure the robustness of your result.
Mesoscale
30
Homophily
access the different. In other words, homophily is not only the result
of our preferences, but also of social constructs. That is why the term
“homophily” is problematic, and we use it only because of historic
reasons.
don’t study this fact any more because it’s so boringly obvious. In 7
Yuexin Jiang, Daniel I Bolnick, and
this, we’re truly similar to other animals we often look down to7 . Mark Kirkpatrick. Assortative mating in
Rather than asking whether romantic ties show homophily, it’s animals. The American Naturalist, 181(6):
E125–E138, 2013
more interesting to use the degree of homophily of romantic ties to
compare societies.
8
Salvatore Scellato, Anastasios Noulas,
In Figure 30.3 you see an example of mixed marriage in the United
Renaud Lambiotte, and Cecilia Mascolo.
States. To that diagonally dominated matrix, you have to add the Socio-spatial properties of online
consideration that the United States is probably one of the most location-based social networks. In
ICWSM, 2011
diverse countries in the world. Imagine how this would look like 9
Kerstin Sailer and Ian McCulloh.
elsewhere! Social networks and spatial configura-
tion—how office layouts drive social
interaction. Social networks, 34(1):47–58,
2012
(a) (b)
Once you have an ego network, you can start investigating its
“global” properties such as the degree distribution or its homophily,
and these are not properties of the global network as a whole, but of
the local neighborhood of the ego, the ego network, which lives in
the mesoscale. Ego networks are frequently used in social network
18
Stephen P Borgatti, Ajay Mehra,
analysis18 , 19 , for instance to estimate a person’s social capital20 .
Daniel J Brass, and Giuseppe Labianca.
A consequence of this procedure is that we know that an ego Network analysis in the social sciences.
node is connected to all nodes in its ego network. This is unfortunate science, 323(5916):892–895, 2009
19
Jure Leskovec and Julian J Mcauley.
in some cases, depending on our analytic needs. For instance, all
Learning to discover social circles in
ego networks have a single connected component and will have ego networks. In Advances in neural
a diameter of two. If those forced properties are undesirable, one information processing systems, pages
539–547, 2012
can extract an ego network and then remove the ego and all its 20
Stephen P Borgatti, Candace Jones,
connections. and Martin G Everett. Network
measures of social capital. Connections,
21(2):27–36, 1998
30.2 Quantifying Homophily
∑ eii − ∑ ai bi
i i
r= ,
1 − ∑ a i bi
i
node with value i, and bi is the probability that an edge has as desti-
nation a node with value i. In an undirected network, the latter two
are equal: ai = bi . This formula takes values between −1 (perfect dis-
assortativity) and 1 (perfect assortativity: each attribute is a separate
component of the network).
2 2
((8 /22)+(12/ 22))−((10/ 22) +(14 / 22) ) Figure 30.6: How to calculate
2 2
1−((10 / 22) +(14 / 22) )
homophily using the formula in
the text.
~0.766
In Figure 30.6 we have two values i: red and green. There are 22
edges in the graph: eight green-green edges – thus the probability
is 8/22 – and 12 red-red edges – thus the corresponding eii value is
12/22. Ten edges originate (or end) in a green node: ai = bi = 10/22;
and 14 originate (or end) in a red node: ai = bi = 14/22. The final
value of homophily is ∼ 0.766. This value is interpretable as a sort of
Pearson correlation coefficient, which means that 0.766 is pretty high.
and absent. The terminology should not fool you. In this case, we
are not referring to the edge’s weight (Section 6.3). This is rather
a categorical difference, more akin to multilayer networks (Section
7.2). A weak tie is established between individuals whose social
circles do not overlap much. A strong tie is the opposite: an edge
between nodes well embedded in the same community. The absent
tie is more of a construct in sociology, which lacks a well-defined
counterpart in network science. It can be considered as a potential
connection lurking in the background. For instance, there is an
absent tie between you and that neighbor you always say “hello” to
but never interact beyond that. You could consider an absent tie as
one of the most likely edges to appear next, if you were to perform a
classical link prediction (Part VII).
You can see now that you can have strong, weak, and absent ties
in an unweighted network. We can, of course, expect a correlation
between being a weak tie and having a low weight. However, we can
construct equally valid scenarios in which there is an anti-correlation
instead. For instance, we could weight the edges by their edge be-
tweenness centrality (Section 14.2). A weak tie must have a high edge
betweenness, because by definition it spans across communities and
thus all the shortest paths going from one community to the other
must pass through it.
Note that, notwithstanding their usefulness in favoring informa-
tion spread, weak ties are not the only game in town in a society. The
competing concept of the “strength of strong ties” shows that strong
ties are important as well. They are specifically useful in times of 25
David Krackhardt, N Nohria, and
uncertainty: “Strong ties constitute a base of trust that can reduce B Eccles. The strength of strong ties.
Networks in the knowledge economy, 82,
resistance and provide comfort25 ”. 2003
homophily 437
27
http://ncase.me/crowds-prototype/
drinking27 .
Since humans are social animals and tend to succumb to peer
pressure, homophily can be a channel for behavioral changes. In a
health study, researchers looked at health indicators from thousands
of people in a community over 32 years. They saw that behavior
and health risks that should not be contagious actually are. For
instance obesity: if you have an obese friend, the likelihood of you 28
Nicholas A Christakis and James H
becoming obese increases by 57% in the short term28 . This is like the Fowler. The spread of obesity in a
Susceptible-Infected epidemic models we saw, even if obesity is not a large social network over 32 years.
New England journal of medicine, 357(4):
biological virus. It is rather a social type of virus.
370–379, 2007
Same with smoking, although in this case it worked the opposite: 29
Nicholas A Christakis and James H
people were quitting in droves29 . This is due to social pressure and Fowler. The collective dynamics of
homophily: a behavior you might not adopt by yourself is brokered smoking in a large social network. New
England journal of medicine, 358(21):
by your social circle, which you trust because it is made by people
2249–2258, 2008
like you – it speaks to your identity.
30
Lada A Adamic and Natalie Glance.
Another paper shows strong homophily in political blogs30 . In The political blogosphere and the
Figure 30.10 we see a visualization of how people writing online 2004 us election: divided they blog.
In Proceedings of the 3rd international
about politics connect to each other. A common political vision is the
workshop on Link discovery, pages 36–43.
clear driving force behind the creation of an hyperlink from one blog ACM, 2005
to another.
homophily 439
39
Alessandro Bessi and Emilio Fer-
ing how this might already have happened39 . rara. Social bots distort the 2016 us
presidential election online discussion.
2016
30.5 Summary
30.6 Exercises
kv
possible explanation.
Figure 31.1: A scatter plot we
can use to visualize degree
assortativity. For each edge, we
have the degree of one node on
the x axis and of the other node
u-v on the y axis.
ku
v u
4 2
3
4 degree combination of an edge.
4 3 3
3 5 2 2
The data point color tells you
1
2
3 3
1 how many edges have that
2 3 4 5
ku particolar degree combination.
1 1
index of degree assortativity. This is the first of two options you have
if you want to quantify the network’s assortativity. You iterate over
all the edges in the network and put into two vectors the degrees of
the nodes at the two endpoints. Note that each edge contributes two
entries to this vector – unless your network is directed. So, if your
network only contains a single edge connecting nodes 1 and 2, your
two vectors are x = [k1 , k2 ] and y = [k2 , k1 ], with k v being the degree
of node v. Then, assortativity is simply the Pearson correlation
coefficient of these two vectors.
There is only one way to achieve perfect degree assortativiy. In
such a scenario, each node is connected only to nodes with the exact
same degree. This is true only in a clique. Thus, a perfectly degree
assortative network is one in which each connected component is a
clique.
5 3
ku
100 1000
avg(kN )
u
u
100
real world networks: (a) co-
authorship in scientific pub-
1 10 100
10
1 10 100 1000 10000
lishing, (b) P2P network, (c)
ku ku
Internet routers, (d) Slashdot
social network.
(a) (b)
1000 1000
100
avg(kN )
avg(kN )
u
u
100
10
1 10
1 10 100 1000 1 10 100 1000
ku ku
(c) (d)
1 1 1
Figure 31.5: The
2 2 2
(dis)assortativity inducing
4 3 4 3 4 3 model. (a) Select two pairs
of connected nodes (in green
the edges we select). (b) As-
sortativity inducing move. (c)
Disassortativity inducing move.
(a) (b) (c)
Note that this swap doesn’t always change the topology nor
alter the characteristics of the network. For instance, if all nodes
have the same degree, the move would not affect assortativity. But,
after enough trials in a large enough network, you’ll see that these
operations will have the desired effect. 15
Jacob G Foster, David V Foster, Peter
Degree assortativity, as I discussed it so far, is defined for undi- Grassberger, and Maya Paczuski. Edge
direction and the structure of networks.
rected networks. There are straightforward extensions for directed Proceedings of the National Academy of
networks15 . The standard strategy is to look at four correlation coeffi- Sciences, 107(24):10815–10820, 2010
446 the atlas for the aspiring network scientist
In other words, the identity line divides the space in two. Above
the identity line we have all the nodes for which, on average, the
neighbor degree is higher than the node’s degree. Below the identity
line it’s the opposite: the node’s degree is higher than the neighbors
degree.
At first glance, the situation seems balanced. There are as many
points above the identity line as there are below. However, remember
that we’re aggregating all nodes with the same degree value in a sin-
gle point. We know that the degree has a broad distribution, because
we visualized it for the co-authorship network before. Therefore,
there are way more nodes above the identity line than below.
That’s the friendship paradox: your friends are, on average, more
quantitative assortativity 447
16
Scott L Feld. Why your friends have
popular than you16 , 17 ! This means that, for the average node, its more friends than you do. American
degree is lower than the average degree of their neighbors. This is Journal of Sociology, 96(6):1464–1477,
1991
actually pretty obvious once you think about it: a node with degree 17
Ezra W Zuckerman and John T Jost.
k appears in k other nodes’ averages, and hence is “over-counted” What makes you think you’re so pop-
by an amount equal to how much larger it is than the network’s ular? self-evaluation maintenance and
the subjective side of the" friendship
mean degree. A high degree node appears in many more node paradox". Social Psychology Quarterly,
neighborhoods than does a low degree node, and hence it skews pages 207–223, 2001
many local averages. The only way to escape such paradox is by
having a network whose degree distributes mostly regularly: for
instance small-world networks (Section 17.2) are usually immune to
the friendship paradox because most nodes have the same degree
(the probability of rewiring is low).
The friendship paradox sounds pretty depressing, but we actually
already made use of it in a rather uplifting scenario. The effective
“vaccinate-a-friend” scheme I discussed in Section 21.3 is nothing else
than a practical application of this network property.
80
Figure 31.7: The data model
100 of the business to business net-
work. The edge color tells corre-
100
sponds to the node making the
claim about the transaction. For
90
instance, the green node reports
selling 90 to the red node and
buying 80 back.
This trustworthiness score is a quantitative attribute. It is strongly
correlated with the likelihood that the business was in fact cheating
on their taxes, as I have information whether the audited businesses
were fined and, if they were, how much they had to pay.
Simulations show that, with this correction, in a randomly wired
448 the atlas for the aspiring network scientist
100
Figure 31.8: The average trust-
# Nodes
100
non-neighbors (y axis). The
blue line shows the identity,
10-2 and the point color the number
10
of nodes with the given value
combination.
10-3 -3 1
10 10-2 10-1 100
Avg Neigh Trust Diff
income, or tax fraud – will get its own paradox for free.
31.4 Summary
31.5 Exercises
When you obtain a new network dataset and you plot it for the
first time, in the vast majority of cases you will see a blobbed mess.
This is usually due to the fact that raw network data is usually a
hairball, and you need to backbone it, or perform other data cleaning
tasks, as I detailed in Part VIII. However, in some cases, there is an
unobjectionable truth. It might be that, deep down, your network
really is a hairball.
Many large scale networks have a common topology: a very
densely connected set of core nodes, and a bunch of casual nodes at-
taching only to few neighbors. This should not be surprising. If you
create a network with a configuration model and you have a broad
degree distribution, the high degree nodes have a high probability
of connecting to each other – see Section 18.1. The surprising part
is that the cores of some empirical networks are even denser than
what you’d anticipate by looking at the degree distribution of the 1
Shi Zhou and Raúl J Mondragón. The
network1 ! rich-club phenomenon in the internet
topology. IEEE Communications Letters, 8
Since everything that departs from null expectation is interesting,
(3):180–182, 2004
this phenomenon in real world networks has attracted the attention
2
Petter Holme. Core-periphery orga-
of network scientists. They gave a couple of names to this special
nization of complex networks. Phys.
meso-scale organization of networks: core-periphery2 , 3 , with the core Rev. E, 72:046111, Oct 2005. doi:
sometimes dubbed as “rich club”4 . 10.1103/PhysRevE.72.046111
32.1 Models
p Nothing
here
lighted in blue, a dense area of
the network with many connec-
tions. In green, a sparser area:
the periphery. Connections only
go to (or from) a core member,
There can be only two types of connections: between core nodes meaning that in the main diago-
– which is the most common edge type, since the core is densely nal in the peripheral area there
connected – and between a core-periphery pair. Peripheral nodes do are no entries larger than zero.
not connect to each other. In the adjacency matrix, which I show in
Figure 32.1, there’s a big area with no connections. This is known
as the “Discrete Model”. It is a very strict one, and rarely real world
networks comply with this standard. A perfect discrete model in
which the core is composed by a single node is a star.
If you want to detect the core-periphery structure using the dis-
crete model, you have a simple quality measure you want to maxi-
mize. This is ∑ Aij ∆uv , with A being the adjacency matrix, and ∆ a
uv
matrix with a value per node pair. An entry in ∆ is equal to one if
either of the two nodes is part of the core.
452 the atlas for the aspiring network scientist
Continuous
Reality rarely conforms with strict expectations. Having only two
classes in which to put nodes is exceedingly restrictive. What if nodes
can be sorted in three classes? What if a semi-periphery exists? This
is an enticing opportunity, until you realize that you could also ask:
why three classes? Why not four? Why not five? Why not... you get
the idea.
Other Approaches
The continuous model is powerful, but it doesn’t really tell you much 7
M Puck Rombach, Mason A Porter,
on how you should build your ci values. Rombach et al.7 propose James H Fowler, and Peter J Mucha.
Core-periphery structure in networks.
a way to build such vector, introducing two parameters, α and β. β SIAM Journal on Applied mathematics, 74
determines the size of the core, from the entirety of the network to (1):167–190, 2014
an empty core. α regulates the c score difference between the core
classes. If a node u at a specific core level has a score cu , the node v
at the closest highest core level will have cv = α + cu – or, really, any
function taking α as a parameter.
The higher p, the more weight you’re putting into a classic discrete
model core. Figure 32.3 shows the different effects of different criteria
8
Xiao Zhang, Travis Martin, and
to build ∆.
Mark EJ Newman. Identification of
Other approaches in the literature make use of the Expectation core-periphery structure in networks.
Mazimization or the Belief Propagation algorithms8 . Physical Review E, 91(3):032803, 2015
454 the atlas for the aspiring network scientist
All methods discussed so far detect the core via a statistical infer-
ence or the development of a null model. Thus there are issues of
scalability when you need to infer the parameters of the model to
make the proper inference. Alternative methods exploit the fact that
the core is densely connected and nodes have a high degree. Thus,
the expectation is that a random walker would be trapped for a long 9
Athen Ma and Raúl J Mondragón.
time in the core9 . By analyzing the behavior of a random walker, one Rich-cores in networks. PloS one, 10(3):
could detect the boundary of the core. e0119678, 2015
What about multilayer networks (Section 7.2)? Does it make sense
to talk about a core in a network spanning multiple interconnected
layers? The analysis of some naturally occurring multilayer networks, 10
Federico Battiston, Jeremy Guillon,
for instance the brain connectome10 , 11 , suggest that it is. However, I Mario Chavez, Vito Latora, and Fabrizio
would say that at the moment of writing this paragraph, a systematic De Vico Fallani. Multiplex core–
periphery organization of the human
investigation of core-periphery in multilayer networks, along with
connectome. Journal of the Royal Society
a general method to detect them, is an open problem in network Interface, 15(146):20180514, 2018
science. 11
Jeremy Guillon, Mario Chavez, Fed-
erico Battiston, Yohan Attal, Valentina
La Corte, Michel Thiebaut de Schotten,
32.2 Tension with Communities Bruno Dubois, Denis Schwartz, Olivier
Colliot, and Fabrizio De Vico Fallani.
Disrupted core-periphery structure
There is a tension between core-periphery structures (CP) and the of multimodal brain networks in
classical community discovery (CD) assumption: in CP there isn’t alzheimer’s disease. Network Neuro-
science, 3(2):635–652, 2019
space for communities, given that there’s only one dense area and
everything connects to it. In CD, there’s little space for peripheries,
and there are multiple cores.
You can see this mathematically. For simplicity, let’s consider the
discrete model: ∑ Aij ∆uv . There is a strong correlation between being
uv
an high degree node and being in the core: after all, the nodes in
the core are highly connected. We’ll see that a community is a set of
nodes densely connected to each other. Thus, nodes in a community
have a relatively high degree and should be considered part of the
core. So, for all nodes deeply embedded in a community, ∆uv = 1.
However, a traditional community is also sparsely connected to
nodes outside the community. This means that, if nodes u and v
are in different communities, likely Auv = 0. But we just saw that
their ∆uv should be 1 because they have high degree! All of that
score is wasted! Maximizing the quality function would imply to
put all nodes in the same core, sacrificing the defining characteristic
of a core: the fact that nodes in it should tend to connect to each
other. Figure 32.4 shows the difference between the two archetypal
meso-scale organizations.
This is problematic since we have evidence that core-periphery
structures are ubiquitous, and so are communities. There are a
couple of explanations we can use to restore our sanity.
The first explanation is realizing that every network lives on a
core-periphery 455
(a) (b)
and, since the two vendors offer the same quality, the customers
will go to the closest vendor. Therefore, a rational vendor would
move their stand so that it can capture the people in the middle. The
other vendor would do the same. The solution is an equilibrium in
which vendors concentrate in the middle, even though that means
increasing the walk length for every customer.
32.4 Nestedness
picture that looks like Figure 32.9. To highlight the nested pattern,
you want to plot the matrix sorting rows and columns by their sum
in descending order. If you do so, most of the connections end up in
core-periphery 459
the top-left of the matrix. That is why we often call nested matrices
“upper-triangular”.
Just like in the discrete model, it’s also rare for a real world net-
work to be perfectly nested. Thus, researchers developed methods 25
Wirt Atmar and Bruce D Patterson.
to calculate the degree of nestedness of a matrix25 , 26 , 27 . I’m going to The measure of order and disorder
give you a highly simplified view of the field. in the distribution of species in frag-
mented habitat. Oecologia, 96(3):373–382,
These measures are usually based on the concept of temperature,
1993
using an analogy from physics. A perfectly nested matrix has a 26
Paulo R Guimaraes Jr and Paulo
temperature of zero, because all the “particles” (the ones in the Guimaraes. Improving the analyses of
nestedness for large sets of matrices.
matrix) are frozen where they should be: in the upper-triangular Environmental Modelling & Software, 21
portion. Every particle shooting outside its designated spot increases (10):1512–1513, 2006
the temperature of the matrix, until you reach a fully random matrix 27
Samuel Jonhson, Virginia Domínguez-
García, and Miguel A Muñoz. Factors
which has a temperature of 100 degrees. determining nestedness in complex
A key concept you need to calculate a matrix’s temperature is the networks. PloS one, 8(9):e74025, 2013
isocline. The isocline is the ideal line separating the ones from the
zeroes in the matrix. You should try to draw a line such that most
of the ones are on one side and most of the zeros are on the other.
Usually, there are two ways to go about it.
The parametric way is to figure out which function best describes
the curve in your upper triangular matrix: a straight line, a parabolic,
a hyperbolic curve, the ubiquitous and mighty power-law. Then you
fit the parameters so that the isocline snugs as close as possible to
said border.
The non-parametric way is to simply create a jagged line following
the row sum or the column sum. If your row (or column) sums to 50,
then you expect a perfectly nested matrix to have 50 ones followed
only by zeros. So your isocline should pass through that point.
Once you have your isocline, you can simply calculate how many
mistakes you made: how many ones are on the side of the zeroes,
and vice versa? Figure 32.10 shows an example of the different levels
460 the atlas for the aspiring network scientist
(a) (b)
32.5 Summary
ery).
32.6 Exercises
3. http://www.networkatlas.eu/exercises/32/3/data.txt contains
a nested bipartite network. Draw its adjacency matrix, sorting
rows and columns by their degree.
33
Hierarchies
8
Enys Mones, Lilla Vicsek, and Tamás
them in three categories: order, nested, and flow hierarchy8 . I’ll Vicsek. Hierarchy measure for complex
present the three of them in this section, noting how this chapter will networks. PloS one, 7(3):e33799, 2012
then only focus on flow hierarchy. Order and nested hierarchies are
covered elsewhere in this book with different names.
Order
In an order hierarchy, the objective is to determine the order in which
to sort nodes. We want to place each node to its corresponding level,
according to the topology of its connections. Usually, this is achieved
by calculating some sort of centrality score. The most central nodes
are placed on top and the least central on the bottom.
5
Figure 33.1: (a) A toy network.
8
(b) Its order hierarchy. I place
2 3 nodes in descending order of
9
5
betweenness centrality from top
to bottom.
7 3 1 9
6
4 6 7 8 2 4 1
(a) (b)
1
HG = ∑ Auv Huv =
2 u,v∑
Auv (hu − hv − 1)2 ,
u,v∈V ∈V
the Auv term is the adjacency matrix and will cancel to zero the
464 the atlas for the aspiring network scientist
energies of edges that don’t exists, because in that case Auv = 0. Then
SpringRank finds a few clever ways to minimize this quantity, which
involve the pseudoinverse of the Laplacian we discussed in Section
11.4.
In general, solutions to the order hierarchy result in a continuous
value of each node in the network that puts it on the y axis on a line.
This is quite literally equivalent to finding a ranking of nodes in the
network. One can easily see that we already covered this sense of
hierarchical organization of complex networks. The order hierarchy is
nothing more than a different point of view of node centrality. Thus,
I refer to Chapter 14 for a deeper discussion on the topic.
Nested
Nested hierarchy is about finding higher-order structures that fully
contain lower order structures, at different levels ultimately ending
in nodes. In the corporation example, the largest group is the cor-
poration itself, encompassing all workers. We can first subdivide
the corporation into branches, if it is a multinational, they could
be regional offices. Each office can be broken down into different
departments, which have teams and, finally, the workers in each
team.
8
Figure 33.2: (a) A toy network.
2 Colored circles delineate nested
9
5 substructures. (b) Its nested
hierarchy, according to the
7 3 1
6 highlighted substructures. Each
node and substructure is con-
4 nected to the substructure it
8 9 5 6 7 2 3 4 1
belongs to.
(a) (b)
Flow
In a flow hierarchy, nodes in a higher level connect to nodes at the
level directly beneath it, and can be seen as managers spreading infor-
hierarchies 465
8
Figure 33.3: (a) A toy network.
5
2 (b) Its flow hierarchy.
9
5
8 9 7 3 2
7 3 1
6
4 6 4 1
(a) (b)
33.2 Cycles
1 1
Figure 33.4: (a) A directed net-
2 3 2 3 work. In the figure, I highlight
in blue the edges partaking
2
6 12 13 9 9 in cycles. (b) The condensed
version of (a), where all nodes
4 5 14 15 part of a strongly connected
component are condensed in a
7 8 16 10 11 7 8 16 10 11 node (colored in blue).
(a) (b)
hierarchicalness of a network.
The intuition behind GRC is simple. A network has a strong
hierarchy if there is a node which has an overwhelming reaching
power compared to the average of all other nodes. Or, to put it in
other words, if there is an overseer that sees all and knows all. To
calculate GRC you first estimate the local reach centrality of all nodes
in the network. You then find the maximum value among them, say
LRC MAX . Then:
1
|V | − 1 v∑
GRC = LRC MAX − LRCv .
∈V
In practice, you average out its difference with all reach centrality
values in the network. This is an effective way of counteracting the
468 the atlas for the aspiring network scientist
However, GRC has a blind spot of its own. Since we’re averaging
the differences between the most central node against all others, we
know we will never get a perfect GRC score if there is more than one
node with non-zero local reach centrality. Consider Figure 33.6. I
don’t know about you, but to me it looks like a pretty darn perfect
hierarchy. Yet, we know that the two nodes connected by the root
don’t have a zero local reach centrality. In fact, the GRC for that
network is 0.89.
So, if cycle-based flow hierarchy is too lenient – every directed
acyclic graph is a perfect hierarchy –, GRC is too strict: even flawless
hierarchies might fail to get a 100% score. If for cycle-based flow
hierarchy the perfect hierarchy is a DAG, for GRC the only perfect
hierarchy is a star: a central node connected to everything else, and
no other connections in the network.
33.4 Arborescences
33.5 Agony
A( G, l ) = ∑ max(lv − lu + 1, 0).
(u,v)∈ E
Every time lu < lv , we contribute zero to the sum. Note that agony
requires you to specify the lu value for all nodes in the network. This
is not usually something you know beforehand. So the problem is to 16
Nikolaj Tatti. Hierarchies in directed
find the lu values that will minimize the agony measure. There are networks. In 2015 IEEE international
efficient algorithms to estimate the agony of a directed graph16 . conference on data mining, pages 991–996.
IEEE, 2015
Consider Figure 33.8. In both cases, we have only one edge going
against the flow. Agony, however, ranks these two structures differ-
ently. In Figure 33.8(a), the difference in rank is only of one, thus the
total agony is 2. In Figure 33.8(b), the difference in rank is 3, resulting
in a higher agony. Also a cycle-based flow hierarchy measure takes
different values, as Figure 33.8(b) involves more edges in a cycle (four
edges, versus just two in Figure 33.8(a)).
Ultimately, the resting assumption of agony is the same of the
cycle-based flow hierarchy. Agony considers any directed acyclic
graph as a perfect hierarchy. Thus it will give perfect scores to the
imperfect hierarchies from Figure 33.5.
hierarchies 471
(a) (b)
To wrap up this chapter, note that all these methods have a various
degree of graphical flavor to them. Meaning that you can use them to
create a picture of your hierarchy, which might help you to navigate
the structure. The most rudimentary method is the cycle-based flow
hierarchy, because it just reduces the graph to a DAG, which doesn’t
help you much.
33.7 Summary
4. We can identify the head of the hierarchy as the node with the
highest reach. Alternatively, arborescences are prefect hierarchies
– directed acyclic graphs with all nodes having in-degree of one,
except the head of the hierarchy having in-degree of zero.
33.8 Exercises
4. Perform the null model test you did for exercise 1 also for global
reach centrality and arborescence. Which method is farther from
the average expected hierarchy value?
34
High-Order Dynamics
1 8
3
12
11
Figure 34.3: A simplicial com-
3 plex. The blue shades represent
2 13
the two simplices in the com-
10 15
4 plex. A darker shade indicates
1
14
simplices that overlap.
6
5
7
8
9
they are not connected – or 1 – i.e. they only have a single incident
face. Figure 34.4(a) shows a case that is not a manifold, because there
is a face with incidence larger than one, while Figure 34.4(b) shows a
manifold.
(a) (b)
10
Ginestra Bianconi. Higher-order
book by Bianconi10 is what you should use to quench your knowl- networks. Cambridge University Press,
edge thirst. 2021
Additionally, there are models that can generate multilayer simpli- 11
Hanlin Sun and Ginestra Bianconi.
cial complexes11 . Higher-order percolation processes on
Suppose that you find simplices cool, but you already have your multiplex hypergraphs. Physical Review
E, 104(3):034306, 2021
network – so generating a manifold is not an option, However, your
network is not a simplicial complex. What do you do? Technically,
you can make any arbitrary network into a simplicial complex. You
can consider every clique in the network as a simplex and perform
the analyses I described on that structure.
You should be careful about one thing, though. As I told you
in Section 7.3, you can do the opposite operation: every simplicial
complex can be reduced to a normal complex network by ignoring
the simplices and treating them like cliques. However, these two
operations – simplices to cliques and cliques to simplices – are not
commutative. If you apply them one after the other, you’re not going
to go back to your original simplicial complex – bar weird coinci-
dences. Figure 34.10 shows you a case in which a clique of four
nodes was actually made up by two 2-simplices rather than being
a 3-simplex, and therefore the reconstruction by cliques leads to a
different simplicial complex.
Memory Network 13
Martin Rosvall, Alcides V Esquivel,
Andrea Lancichinetti, Jevin D West, and
Alternatives to HON exist. For instance, researchers built what they Renaud Lambiotte. Memory in network
flows and its effects on spreading
call a “memory network13 ”. To model second-order dynamics we dynamics and community detection.
can create a line graph. If you recall Section 6.1, a line graph is a Nature communications, 5:4630, 2014
484 the atlas for the aspiring network scientist
3-5
Figure 34.13: (a) A graph. (b) Its
3 4-5 linegraph version.
3-4
2 2-4
4
5
1-4
1 1-2
(a) (b)
Motif Dictionary
The first approach I consider is the one building motif dictionaries. In
this approach, one realizes that there are different motifs of interest
that have an impact on the analysis. For instance, one could focus
specifically on triangles. Once you specify all the motifs you’re
interested in, you take a traditional network measure and you extend
it to consider these motifs.
This move isn’t particularly difficult once you realize that low-
order network measures still work with the same logic. It’s just that
they exclusively focus on a single motif of the network: the edge.
An edge is a network motif containing two nodes and a connection
between them. Once you realize that a triangle is nothing else but a
motif with three nodes and three edges connecting them, then you’re
in business.
To make this a bit less abstract, let’s consider a specific instance 14
Austin R Benson, David F Gleich,
of this approach14 , 15 , 16 . In the paper, the idea is to use motifs to and Jure Leskovec. Higher-order
inform community discovery, the topic of Part X. I’m going to delve organization of complex networks.
Science, 353(6295):163–166, 2016
deeper into the topic in that book part, for now let’s just say that 15
Hao Yin, Austin R Benson, Jure
we’re interested in solving the 2-cut problem (Section 11.5): we want Leskovec, and David F Gleich. Local
to divide nodes in two groups such that we minimize the number of higher-order graph clustering. In
Proceedings of the 23rd ACM SIGKDD
edges connecting nodes in different groups. International Conference on Knowledge
In the classical problem we want to make a normalized cut such Discovery and Data Mining, pages
555–564. ACM, 2017
that the number of edges flowing from one group to the other is 16
Charalampos E Tsourakakis, Jakub
minimum, considering some normalization factor. So we have a Pachocki, and Michael Mitzenmacher.
fraction that looks something like this: Scalable motif-aware graph clustering.
In Proceedings of the 26th International
Conference on World Wide Web, pages
| ES,S̄ | 1451–1460, 2017
ϕ(S) = ,
min (| ES |, | ES̄ |)
where S is one of the two groups – i.e. a set of nodes on one side
of the cut –, S̄ is its complement – i.e. S̄ = V − S, the set of nodes
on the other side of the cut. Ex is the set of edges in the set x, and
Ex,y is the set of edges established between a node in x and a node
in y. In practice, let’s find S such that ϕ(S) is minimum. We do so
by minimizing the number of edges between S and its complement,
normalized by their sizes (so that we don’t find trivial solutions by
cutting off simply a dangling leaf node).
But here we say: No! Not the number of edges! We are interested in
higher order structures! We want to minimize the number of triangles
between groups! How would that look like? Exactly the same. We
just count not the number of the edges, but the number of arbitrary
motifs M spanning between S and non-S:
| MS,S̄ |
ϕ(S, M ) = .
min (| MS |, | MS̄ |)
486 the atlas for the aspiring network scientist
(a) (b)
−1
walk at time t: Tt = e−tD L . If we set t = 1, we recover exactly the
transition probabilities of a one-step random walk. But now we’re
free to change t at will, to get second, third, and any higher order
Markov processes.
Other Approaches
There is a whole bunch of other approaches to introduce high order
dynamics into your network structure. More than I can competently
cover. I provide here some examples with brief, but overly simplistic,
explanations.
One way is to use tensors (Section 5.3). In practice, you create a
multidimensional representation of the topological features of the
network. Each added dimension of the tensor represents an extra 20
Michael Chertok and Yosi Keller.
order of relationships between the nodes20 , 21 . Then, by operating on Efficient high order matching. IEEE
Transactions on Pattern Analysis and
this tensor, you can solve any high order problem: in the papers I cite Machine Intelligence, 32(12):2205–2215,
the problem the authors focus on is graph matching. 2010
Another general category of solutions is a collection of techniques
21
Olivier Duchenne, Francis Bach, In-So
Kweon, and Jean Ponce. A tensor-
to take high order relation data and transform it into an equiva- based algorithm for high-order graph
lent first order representation22 , 23 . The first order solution in this matching. IEEE transactions on pattern
analysis and machine intelligence, 33(12):
structure then translates into the high order one, much like in the 2383–2395, 2011
HON and memory network approach. The cited techniques are also 22
Hiroshi Ishikawa. Higher-order clique
generally applied to the problem of finding a high order cut in the reduction in binary graph cut. In 2009
IEEE Conference on Computer Vision and
network. Pattern Recognition, pages 2993–3000.
Being able to study high order interactions can help you making IEEE, 2009
sense of many complex systems. For instance, they have been used 23
Alexander Fix, Aritanan Gruber,
Endre Boros, and Ramin Zabih. A graph
to explain the remarkable stability of biodiversity in complex eco- cut algorithm for higher-order markov
logical systems24 . Other application examples include the study of random fields. In 2011 International
infrastructure networks25 – how to track the high order flow of cars Conference on Computer Vision, pages
1020–1027. IEEE, 2011
in a road graph –, and aiding in solving the problem of controlling 24
Jacopo Grilli, György Barabás,
complex systems26 – which we introduced in Section 21.4. Matthew J Michalska-Smith, and
Stefano Allesina. Higher-order interac-
tions stabilize dynamics in competitive
network models. Nature, 548(7666):210,
34.4 Summary 2017
25
Jan D Wegner, Javier A Montoya-
1. Classical network analysis is single-order: only the direct connec- Zegarra, and Konrad Schindler. A
higher-order crf model for road net-
tions matters. But many phenomena they represent are high-order:
work extraction. In Proceedings of the
team collaborations are many-to-many relationships or, in a flight IEEE Conference on Computer Vision and
passenger network, your next step in the trip is influenced by all Pattern Recognition, pages 1698–1705,
2013
the steps you took in the recent past. 26
Abubakr Muhammad and Magnus
Egerstedt. Control using higher order
2. We can use simplicial complexes to model how many-to-many laplacians in network topologies. In
relationships make synchronization easier, create new potential Proc. of 17th International Symposium
on Mathematical Theory of Networks and
failure types in infrastructures, and make epidemics more infective Systems, pages 1024–1038. Citeseer, 2006
than we would normally expect.
488 the atlas for the aspiring network scientist
5. In High Order Networks you split each node to represent all the
paths that lead to it. In memory networks you instead model
second order dynamics with a line graph, third order dynamics
with the line graph of the line graph, and so on.
34.5 Exercises
1. Calculate the distribution of the k2,0 and k2,1 degrees of the net-
work at http://www.networkatlas.eu/exercises/34/1/data.txt.
Assume every clique of the network to be a simplex.
Communities
35
Graph Partitions
We have reached the part of network analysis that has probably re-
ceived the most attention since the explosion of network science in
the early 90s: community discovery (or detection). To put it bluntly,
community discovery is the subfield of network science that pos-
tulates that the main mesoscale organization of a network is its
partition of nodes into communities. Communities are groups of
nodes that are very related to each other. If two nodes are in the
same community they are more likely to connect than if they are in
two distinct communities. This is an overly simplistic view of the
problem, and we will decompose this assumption when the time
comes, but we need to start from somewhere.
So the community discovery subfield is ginormous. You might
ask: “Why?” Why do we want to find communities? There are many
reasons why community discovery is useful. I can give you a couple
of them. First, this is the equivalent of performing data clustering
in data mining, machine learning, etc. Any reason why you want
to do data clustering also applies to community discovery. Maybe
you want to find similar nodes which would react similarly to your
interventions. Another reason is to condense a complex network into
a simpler view, that could be more amenable to manual analysis or
human understanding.
More generally, decomposing a big messy network into groups
is a useful way to simplify it, making it easier to understand. The
reason why there are so many methods to find communities – which,
as we’ll see, rarely agree with each other – is because there are innu-
merable ways to simplify a network.
It is difficult to give you a perspective of how vast this subfield of
network science is. Probably, one way to do it is by telling you that
there are so many papers proposing a new community discovery
algorithm or discussing some specific aspect of the problem, that
making a review paper is not sufficient any more. We are in need
of making review papers of review papers of community discovery.
graph partitions 491
1
Santo Fortunato. Community detection
This is what I’m going to attempt now. in graphs. Physics reports, 486(3-5):
I think that we can classify review works fundamentally into four 75–174, 2010
2
Santo Fortunato and Darko Hric.
categories, depending on what they focus on the most. Review pa-
Community detection in networks: A
pers on community discovery usually organize community discovery user guide. Physics Reports, 659:1–44,
algorithms by: 2016
3
Srinivasan Parthasarathy, Yiye Ruan,
and Venu Satuluri. Community discov-
• Process. In this subtype of review paper, the guiding principle ery in social networks: Applications,
is how an algorithm works1 , 2 , 3 , 4 , 5 , 6 , 7 . Does it use random walks methods and emerging trends. In Social
network data analytics, pages 79–113.
rather then eigenvector decomposition? Does it use a propagation
Springer, 2011
dynamic or a Bayesian framework? Does it aim at describing the 4
Mason A Porter, Jukka-Pekka Onnela,
network as is, or at modeling what underlying latent process could and Peter J Mucha. Communities in
networks. Notices of the AMS, 56(9):
have generated what we observe8 ?
1082–1097, 2009
• Performance. Here, all that matters is how well the algorithm 5
Mark EJ Newman. Detecting commu-
nity structure in networks. The European
works in some test cases9 , 10 , 11 , 12 , 13 . The typical approach is to
Physical Journal B, 38(2):321–330, 2004b
find many real world networks, or creating a bunch of synthetic 6
Leon Danon, Albert Diaz-Guilera,
benchmark graphs (usually using the LFR benchmark – see Sec- Jordi Duch, and Alex Arenas. Compar-
ing community structure identification.
tion 18.2) and rank methods on how well they can maximize a
Journal of Statistical Mechanics: Theory
quality function. and Experiment, 2005(09):P09008, 2005
7
Natali Gulbahce and Sune Lehmann.
• Definition. More often than not, the standard community defi-
The art of community detection. BioEs-
nition I gave you earlier (nodes in the same community connect says, 30(10):934–938, 2008
to each other, nodes in different communities rarely do so) isn’t 8
Tiago P Peixoto. Descriptive vs. infer-
ential community detection in networks:
exactly capturing what a researcher wants to find. We’ll explore
Pitfalls, myths and half-truths. Cambridge
later how this definition fails. Some review works acknowledge University Press, 2023
this, and classify methods according to their community def- 9
Jure Leskovec, Kevin J Lang, and
inition14 , 15 , 16 . Different processes might be based in different Michael Mahoney. Empirical comparison
of algorithms for network community
definitions, so there’s overlap between this category and the first detection. In WWW, pages 631–640.
one I presented, but that is not always the case. ACM, 2010b
10
Zhao Yang, René Algesheimer, and
• Similarity. Finally, there are review works using a data-driven Claudio J Tessone. A comparative
approach to figure out which algorithms, on a practical level, analysis of community detection
algorithms on artificial networks.
return very similar communities for the same networks17 , 18 , 19 , 20 . Scientific Reports, 6:30750, 2016a
This is similar to the performance category, with the difference 11
Steve Harenberg, Gonzalo Bello,
that we’re not interested in what performs better, only in what L Gjeltema, Stephen Ranshous, Jitendra
Harlalka, Ramona Seay, Kanchana
performs similarly. Padmanabhan, and Nagiza Samatova.
Community detection in large-scale
The process approach is the most pedagogical one, because it networks: a survey and empirical
evaluation. Wiley Interdisciplinary
focuses on us trying to understand how each method works. Thus, it
Reviews: Computational Statistics, 6(6):
will be the one guiding this book part the most. However, I’ll pepper 426–439, 2014
around the other approaches as well, when necessary. This book part 12
Günce Keziban Orman and Vincent
Labatut. A comparison of community
will necessarily be more superficial than what one of the excellent
detection algorithms on artificial
surveys out there can do in each specific subtopic of community networks. In DS, pages 242–256.
discovery, so you should check them out. Springer, 2009
13
Andrea Lancichinetti and Santo
In this chapter I’ll limit myself discussing the most classical view
Fortunato. Community detection
of community discovery, the one sheepishly following the classical algorithms: a comparative analysis.
definition. We’re going to take a historical approach, exploring how Physical review E, 80(5):056117, 2009
492 the atlas for the aspiring network scientist
(a) (b)
Maximum Likelihood
(a) (b)
Lθ,A = ∑ lθ,A,u,v .
u,v∈ A
101101
1001 110011 Figure 35.5: A binary node ID
schema you’d use to encode
a random walk. Each colored
arrow points to the ID of the
three nodes involved in the
orange walk.
1001101101110011
Infomap saves bits by using community prefixes. Nodes in the
same community get the same prefix. So now we need fewer bits to
uniquely refer to each node, because we can prepend the community
code the first time and then omit it as long as we are in the same
community. Since a community contains, in this case, 9 nodes instead
498 the atlas for the aspiring network scientist
of 36, we can use shorter codes. We need to add an extra code that
allows us to know we’re jumping out of a community. This is an
overhead, but the assumption here is that a random walker will
spend most of its time in the community, so this community prefix
and jump overhead is rarely used. Figure 35.6 shows this re-labeling
process.
00 01
101 Figure 35.6: The large two-digit
100 101101 1100 codes are the IDs of the com-
1001 110011
munities. Each node gets a
new shorter ID, given that IDs
need to be community-unique,
rather than network-unique.
Now the random walk uses
the prefix (in red) to indicate in
Prefix Node IDs in the walk Jumping out
which community it is, then the
01 1100 101 100 1111 new shorter node IDs (in blue)
10 11 Rarely used! and finally adds an extra ID to
indicate it’s jumping out of a
community (in green).
If the partition is good, we can compress the random walk in-
formation by a lot. Consider the example in Figure 35.7. Without
communities we have no overhead, but we need to fully encode our
36 nodes. The path in orange is simply the sequence of node IDs and
can be stored in 72 bits. If we have community partitions, we add
the community prefixes and the jump overhead (for the community
jump in brown), but the node IDs are shorter. The encoding of the
same walk is 56 bits, and we can see that the overhead parts are tiny
compared with the rest. 36
http://www.mapequation.org/apps/
If my explanation still makes little sense, you can try out an in- MapDemo.html
teractive system showing all the mechanics of the map equation 37
Michael T Schaub, Renaud Lambiotte,
and Mauricio Barahona. Encoding
approach36 . Infomap has been adapted to numerous scenarios. Many dynamics for multiscale community
involve hierarchical, mutlilayer and overlapping community detec- detection: Markov time sweeping for
tion, which we will explore in later chapters. Other modifications the map equation. Physical Review E, 86
(2):026112, 2012b
include adding some “memory” to the random walkers37 , 38 – effec- 38
Martin Rosvall, Alcides V Esquivel,
tively using higher order networks (Chapter 34). Andrea Lancichinetti, Jevin D West, and
This means that the walker is not randomly selecting destinations Renaud Lambiotte. Memory in network
flows and its effects on spreading
any more, but it follows a certain logic. Consider a network of flight dynamics and community detection.
travelers. If you fly from New York to Los Angeles, your next leg trip Nature communications, 5:4630, 2014
39
Laura M Smith, Linhong Zhu,
isn’t random. You’re much more likely, for instance, to come back to
Kristina Lerman, and Allon G Per-
New York, since you were in LA just for a vacation or visiting family. cus. Partitioning networks with node
Some approaches do not use vanilla random walks, but also attributes by compressing information
flow. ACM Transactions on Knowledge
consider the information encoded in node attributes in the map Discovery from Data (TKDD), 11(2):15,
equation39 . 2016
graph partitions 499
10 11
1000010001101001010110110110110011001000000010001101000
Without 11001000011100101
72 bits
With 00110111001001110001010000111110100111011000000010101101 56 bits
Evolutionary Clustering
So far I’ve framed the community discovery problem as essentially
static. You have a network and you want to divide it into densely
connected groups. However, we saw in Section 7.4 that many of the
graphs we see are views of a specific moment in time. Networks
evolve and you might want to take that information into account. 43
Qing Cai, Lijia Ma, Maoguo Gong,
A couple of good review works43 , 44 focus on dynamic community and Dayong Tian. A survey on network
discovery and can help you obtaining a deeper understanding of this community detection based on evolu-
tionary computation. IJBIC, 8(2):84–98,
problem. Let’s explore what can happen to your communities over 2016
time. 44
Giulio Rossetti and Rémy Cazabet.
One possibility is that the community will grow: it will attract new Community discovery in dynamic
networks: a survey. ACM Computing
nodes that were previously unobserved. The other side of the coin is Surveys (CSUR), 51(2):35, 2018
shrinking: nodes that were part of the community disappear from the
network. Figure 35.9 shows visual examples of these events.
Time
Figure 35.9: Two things that can
happen to your communities in
an evolving network: growing
Grow and shrinking.
Shrink
Time
Figure 35.10: Two things that
can happen to your commu-
nities in an evolving network:
Merge merging and splitting.
Split
Time
Figure 35.11: Two things that
can happen to your commu-
nities in an evolving network:
Birth birth and death.
Death
and say that the most recent community is an evolution of the older
one. A possible similarity criterion would be calculating the Jaccard
coefficient. 48
Deepayan Chakrabarti, Ravi Kumar,
A better solution is performing evolutionary clustering48 . This and Andrew Tomkins. Evolutionary
means that we add a second term to whatever criterion we use to clustering. In Proceedings of the 12th
ACM SIGKDD international conference
find communities in a snapshot – a procedure sometimes called
on Knowledge discovery and data mining,
“smoothing”. Suppose you’re using Infomap. The aim of the algo- pages 554–560. ACM, 2006
rithm, as I presented earlier, is to encode random walkers with the
lowest number of bits. Let’s say this is its quality function – which is
known as code length (CL).
In evolutionary clustering you don’t just optimize CL. You have
CL as a term in your more general quality function Q. The other
term in Q is consistency. For simplicity sake, let’s just assume it is
some sort of Jaccard coefficient between the partitions at time t and
the partition at time t − 1. To sum up, a very simple evolutionary
clustering evaluates the partition pt at time t as:
Q pt = αCL pt + (1 − α) J pt ,pt−1 .
16
10
16 16
10 10 Figure 35.12: (a) The commu-
13 14
14 9 11
11 13 13 14
9 9 11
15
nity partition of a graph at time
15 12
12 12 15
t. (b) A partition of the graph at
6
time t + 1 exclusively optimizing
6 3 6
3 3
the code length, using Infomap.
7 2 4 5 7
2 4 5 2 4 5 7
(c) A partition of the graph at
8 1 8
1 1 8
time t + 1 balancing a good code
(a) (b) (c) length and consistency with the
partition at time t + 1.
Here, α is a parameter you can specify which regulates how much
weight you want to give to your previous partitions. For α = 1 you
have standard clustering, while for α = 0 the new information is
discarded and you only use the partition you found at the previous
time step. Figure 35.12 shows you that maximizing CL pt might yield
significantly different results than maximizing a temporally-aware
Q pt function.
This is only one – the simplest – of the many ways to perform
smoothing, which the other review works I cited describe more
in details. However, all these methods (and the ones that follow)
have something in common: they are all at odds with the classical
definition of community that I gave you earlier. That is because,
at time t + 1, we’re not simply trying to group nodes in the same
community according to the density of their connections. Eventually,
we’re going to end up with a partition with many edges running
504 the atlas for the aspiring network scientist
You iterate over all members of U , trying to find the one that
would maximize some community quality function. For instance, it
could be simply the number of edges connecting inside C – with ties
broken randomly. We add to C the v1 node with the most edges to
the local community. Then, U is updated with the new neighbors, the
ones v1 has but v0 did not.
We continue until we have reached our limit: we explored the
number of nodes we wanted to test, or we ran out of time, or we
actually explored all nodes in the component to which v0 belongs.
Figure 35.13 shows some steps of the process. As you can see, we
can terminate after we explore a certain set of nodes. At that point,
we detected the local community of node v0 , without exploring the
entire network. We did not explore the blue nodes – although we
know they exists – and we’re absolutely clueless about the existence
of the grey nodes.
506 the atlas for the aspiring network scientist
65
Aaron Clauset. Finding local com-
The algorithm I just described is one65 of the many possible66 , 67 , 68 . munity structure in networks. Physical
All these algorithms are variations of this exploration approach. review E, 72(2):026132, 2005
More alternatives have been proposed, for instance using random
66
James P Bagrow. Evaluating local
community methods in networks.
walks like Infomap to explore the network69 , 70 . You can explore the Journal of Statistical Mechanics: Theory
literature more in depth using one of the survey papers I cited at the and Experiment, 2008(05):P05001, 2008
beginning of the chapter.
67
Feng Luo, James Z Wang, and Eric
Promislow. Exploring local community
structures in large networks. Web
Intelligence and Agent Systems: An
35.6 Using Clustering Algorithms International Journal, 6(4):387–400, 2008
68
Symeon Papadopoulos, Andre Skusa,
I barely started scratching the complex landscape of different ap- Athena Vakali, Yiannis Kompatsiaris,
and Nadine Wagner. Bridge bounding: A
proaches to community discovery. We’re going to have time in the local approach for efficient community
next chapters to explore even more variations. However, a ques- discovery in complex networks. arXiv
tion might have already dawned on you. If there are such different preprint arXiv:0902.0871, 2009
69
Lucas GS Jeub, Prakash Balachandran,
approaches to detecting communities, how do I find the one that works for
Mason A Porter, Peter J Mucha, and
me? And, how do I maximize my chances of finding high quality communi- Michael W Mahoney. Think locally, act
ties? As you might expect, the answers to these questions are difficult locally: Detection of small, medium-
sized, and large communities in large
and often tend to be subjective. networks. Physical Review E, 91(1):012821,
Let’s start from the second one: designing a strategy to ensure you 2015
find close-to-optimal communities. In machine learning, we discov-
70
Pasquale De Meo, Emilio Ferrara,
Giacomo Fiumara, and Alessandro
ered a surprising lesson. If you want to improve accuracy, designing Provetti. Mixing local and global
the most sophisticated method in the world usually helps only up information for community detection in
large networks. Journal of Computer and
to a certain point. Having many simple methods and averaging the System Sciences, 80(1):72–87, 2014
results could potentially yield better results.
This observation is at the basis of what we know as “consensus 71
Alexander Strehl and Joydeep Ghosh.
clustering” (or ensemble clustering)71 . This strategy has been applied Cluster ensembles—a knowledge reuse
to detecting communities72 in the way you’d expect. Take a network, framework for combining multiple
partitions. Journal of machine learning
run many community discovery algorithms on it, average the results. research, 3(Dec):583–617, 2002
Figure 35.14 shows an example of the procedure. Note how none of 72
Andrea Lancichinetti and Santo
the methods (Figure 35.14(a-c)) found the best communities, which Fortunato. Consensus clustering in
complex networks. Scientific reports, 2:
is their consensus: Figure 35.14(d). Note also how the third method 336, 2012
finds rather absurd and long stretched sub-optimal communities.
However, its evident blunders are easily overruled by the consensus
between the other two methods, and its tiebreaker improves the
overall partition.
This is a fine strategy, but you should not apply it blindly. You
have to make sure the ensemble of community detection algorithms
you’re considering is internally consistent. In particular, the methods
should have a coherent and compatible definition of what a commu-
nity is. Limiting ourselves to the small perspective on the problem
from this chapter – ignoring all that is coming next – combining a dy-
namic community discovery with a static local community discovery
will probably not help.
Even mashing together superficially similar algorithms might
result in disaster. For instance, the flow-based communities Infomap
graph partitions 507
collapsed-sbm
birch
ward
agglomerative kmeans
dbscan mixnet bnmtf
crossass pmm
affinity ocg
meanshift kerlin
spectral conclude
kclique code-dense
hrg
vbmod
demon
moses bridgeboundm m s b
fluid
ganxis cme-td oslom leadeig
labelperc
extr
tiles ganet
spinglass
gce conga
agm metis
ilcd moganet
infocentr
edgebetween tabu louvain
bigclam
edgeclust copra
savi cme-bu
svinet lwplocal
fuzzyclust netcarto
slpa
hlc ganet+
linecomms
mlrmcl
rmcl
infomap graclus m s g cliquemod
olc infomap-overlap mcl fastgreedy
bagrowlocal
peacock walktrap
graclus2stage vm
clausetlocal
35.7 Summary
35.8 Exercises
2. Find the local communities in the same network using the same
algorithm, by only looking at the 2-step neighborhood of nodes 1,
21, and 181.
How do you know if you found a good partition of nodes into com-
munities? Or, if you have two competing partitions, how do you
decide which is best? In this chapter, I present to you a battery of
functions you can use to solve this problem. Why a “battery” of func-
tions? Doesn’t “best” imply that there is some sort of ideal partition?
Not really. What’s “best” depends on what you want to use your
communities for. Different functions privilege different applications.
So we need a quality function per application and you need to care-
fully choose your evaluation strategy to match the problem definition
you’re trying to solve with your communities.
Think about “evaluating your communities” more as a data ex-
ploration task than a quest to find the ultimate truth. Since there is
no one True partition – and not even one True definition of commu-
nity as I suggested in the previous chapter –, there also cannot be
one True quality function. You have, instead, multiple ways to see
different kinds of communities, some of which might be more or less
useful given the network you have and the task you want to perform.
In the first two sections, I start by focusing on functions that only
take into account the topological information of your network. In this
case, the only thing that matters are the nodes and edges – at most
we can consider the direction and/or the weight of an edge.
In the latter two sections I move to a different perspective. First,
we consider the network as essentially dynamic and we use commu-
nities as clues as to which links will appear next, under the assump-
tion the communities tend to densify: it is much more likely that a
new link will appear between nodes in the same community. Finally,
we look at metadata that could be attached to nodes, which might be
providing some sort of “ground truth” for the actual communities in
which nodes are grouped into in the real world.
community evaluation 511
36.1 Modularity
As a Quality Measure
When it comes to functions evaluating the goodness of a commu-
nity partition using exclusively topological information, there is one 1
Mark EJ Newman. Modularity and
undisputed queen: modularity1 . You shouldn’t be fooled by its popu- community structure in networks.
larity: modularity has severe known issues that limits its usefulness. Proceedings of the national academy of
sciences, 103(23):8577–8582, 2006b
We’ll get to those in the second half of this section.
Modularity is a measure following closely the classical definition
of community discovery. It is all about the internal density of your
communities. However, you cannot simply maximize internal density,
as the partition with the highest possible density is a degenerate one,
where you simply have one community per edge – two connected
nodes have, by definition, a density of one.
zero otherwise.
Modularity’s formula is scary looking, but it ought not to be. In
fact, it’s crystal clear. Let me rewrite it to give you further guidance:
1 kv ku
M=
2| E | ∑ Auv −
2| E |
δ ( c v , c u ),
u,v∈V
which translates into: for every pair of nodes in the same commu-
nity subtract from their observed relation the expected number of
relations given the degree of the two nodes and the total number of
edges in the network, then normalize so that the maximum is 1.
Modularity and Stochastic Blockmodels are related. Optimizing
the community partition following modularity is proven to be equiv- 2
Mark EJ Newman. Equivalence
alent to a special restricted version of SBM2 . Specifically, you need between modularity optimization and
to use the degree-correlated SBM – since it fixes the degree distribu- maximum likelihood methods for
community detection. Physical Review E,
tion just like the configuration model does (which is the null model 94(5):052315, 2016a
on which modularity is defined). Then, you must fix pin and pout
– the probabilities of connecting to nodes inside and outside their
community – to be the same for all nodes.
In general, you can use both to evaluate the quality of your par-
tition, but there are subtle differences. SBM is by nature generative:
it gives you connection probabilities between your nodes. Modu-
larity doesn’t. On the other hand, modularity has this inherent test
against a null graph which you don’t really have in SBMs. In fact,
you can easily extend modularity in such way that you can talk about
a statistically significant community partition, one that is sufficiently 3
Brian Karrer, Elizaveta Levina, and
different from chance3 . Mark EJ Newman. Robustness of
community structure in networks.
Physical review E, 77(4):046119, 2008
Figure 36.3: A network with a
community structure. The node
colors represent the community
partition. (a) Optimal partition.
(b) Sub optimal partition. (c)
Partition grouping all nodes in
the same community.
(a) M = 0.723 (b) M = 0.411 (c) M = 0
As a Maximization Target
As I mentioned earlier, modularity can be used in two ways. So far,
we’ve seen the use case of evaluating your partitions. You start from
a graph, you try two algorithms (or the same algorithm twice) and
you get two partitions. The one with the highest modularity is the
preferred one – see Figure 36.4(a).
M increases?
No
4
Mark EJ Newman. Fast algorithm
for detecting community structure in
M(CD1) > M(CD2) networks. Physical review E, 69(6):066133,
CD1 is better than CD2 2004c
(a) (b) 5
Aaron Clauset, Mark EJ Newman, and
Cristopher Moore. Finding community
structure in very large networks. Physical
The alternative is to directly optimize it: to modify your partition review E, 70(6):066111, 2004
in a smart way so that you’ll get the highest possible modularity 6
Alex Arenas, Jordi Duch, Alberto
score – see Figure 36.4(b). For instance, your algorithm could start Fernández, and Sergio Gómez. Size
reduction of complex networks preserv-
with all nodes in a different community. You identify the node pair ing modularity. New Journal of Physics, 9
which would contribute the most to modularity and you merge it in (6):176, 2007
7
Azadeh Nematzadeh, Emilio Ferrara,
the same community. You repeat the process until you cannot find
Alessandro Flammini, and Yong-Yeol
any community pair whose merging would improve modularity4 , 5 . Ahn. Optimal network modularity for
Most approaches following this strategy return hierarchical commu- information diffusion. Physical review
letters, 113(8):088701, 2014
nities, recursively including low level ones in high level ones, and I 8
Clara Pizzuti. Ga-net: A genetic
cover them in detail in Chapter 37. algorithm for community detection
But there are other ways to optimize modularity. One strategy in social networks. In International
conference on parallel problem solving from
is to progressively condense your network such that you preserve nature, pages 1081–1090. Springer, 2008
its modularity6 . Or using modularity to optimize the encoding of 9
Clara Pizzuti. A multiobjective genetic
information flow in the network, bringing it close to the Infomap algorithm to find communities in
complex networks. IEEE Transactions on
philosophy7 . Another approach is using genetic algorithms8 , 9 or Evolutionary Computation, 16(3):418–430,
extremal optimization10 : an optimization technique similar to genetic 2012
algorithms, which optimizes a single solution rather than having a
10
Jordi Duch and Alex Arenas. Com-
munity detection in complex networks
pool of potential ones. using extremal optimization. Physical
Other approaches include, but are not limited to: review E, 72(2):027104, 2005
community evaluation 515
kout in
1 v ku
M=
2| E | ∑ Auv −
| E|
δ ( c v , c u ).
u,v∈V
Since we’re here, why stopping at directed unweighted graphs?
Let’s add weights! Say that the (u, v) edge has weight wuv , and that
wuout is the sum of all edge weights originating from u (with win
u
defined similarly for the opposite direction). Then:
516 the atlas for the aspiring network scientist
1 wvout win
M= ∑ wuv −
u
δ ( c v , c u ).
2| E | u,v∈V ∑ wuv
u,v∈V
Known Issues
But the issues raised so far are only child’s play. Let’s take a look
at the real problematic stuff when it comes to modularity. There are
three main grievances with modularity. The first is that random
fluctuations in the graph structure and/or in your partition can 21
Roger Guimera, Marta Sales-Pardo,
and Luís A Nunes Amaral. Modularity
make your modularity increase21 . However, I already mentioned that
from fluctuations in random graphs and
modularity can be extended to take care of statistical significance. complex networks. Physical Review E, 70
A harder beast to tame is the infamous resolution limit of modu- (2):025101, 2004
M = 0.535
Intuitive → M = 0.902
Figure 36.6: A ring of cliques,
showing another side of the
resolution limit problem of
Best → M = 0.904 modularity.
(a) (b)
Internal density
The other side of the conductance coin is the internal density mea-
sure. This is exactly what you’d think it is: how many edges are
inside the community over the total possible number of edges the 38
Filippo Radicchi, Claudio Castellano,
community could host38 . Borrowing EC from the previous section: Federico Cecconi, Vittorio Loreto,
and Domenico Parisi. Defining and
identifying communities in networks.
| EC | Proceedings of the National Academy of
f (C ) = .
|C |(|C | − 1)/2 Sciences, 101(9):2658–2663, 2004
So you can see that, in this case, both communities in Figure 36.7
have an internal density of 1, since they’re cliques. Thus, internal
density is unable to distinguish between them, which we would like
since community Figure 36.7(b) is clearly “weaker”, given its high
number of external connections.
You can appreciate the paradoxical result of internal density by
looking at Figure 36.8. Here, one might be tempted to merge the red
and blue communities, since their nodes are so densely connected to
each other and to not much else. Yet, the red nodes are a 5-clique and
the blue nodes are a 4-clique, while the red and blue nodes are not a
9-clique. Thus, the best way to maximize internal density is to split
these clearly very related nodes.
community evaluation 521
Cut
Originally, we define the cut ratio as the fraction of all possible edges
leaving the community. The worst case scenario is when every node
in C has a link to a node not in C. There are |C | nodes in C and
(|V | − |C |) nodes outside C, so there can be |C |(|V | − |C |) such links.
Thus:
| EC,B |
f (C ) = .
|C |(|V | − |C |)
| EB,C | | EB,C |
f (C ) = + .
2| EC | + | EB,C | 2(| E| − | EC |) + | EB,C |
The most attentive readers already noticed that the first term in
this equation is conductance. The second term is also a conductance
of sorts. If the first term is the conductance from the community to
the rest of the network, the second term is the conductance from
the rest of the network to the community. The two are not the same,
because the number of edges in C is | EC |, while the number of edges
outside C is | E| − | EC |.
522 the atlas for the aspiring network scientist
|(u, v) : v ̸∈ C |
f (C ) = max .
u∈C ku
The idea here is that, in a good community partition, there
shouldn’t be any node with a significant number of edges point-
ing outside the community. We can tolerate if a node has a large
number of edges pointing out, only if the node is a gigantic hub with
a humongous degree k u .
Requiring that there is absolutely no node with a large out degree
fraction might be a bit too much. So we also have a relaxed Average-
ODF:
1 |(u, v) : v ̸∈ C |
f (C ) =
|C | ∑ ku
.
u∈C
1
f (C ) = |{u : u ∈ C, |(u, v) : v ∈ C | < k u /2}|.
|C |
For each node u in C, we count the number of edges pointing
outside the cluster. If it’s more than half of its edges, we mark the
node as “bad”, because it connects more outside the community than
inside. A node shouldn’t do that! The measure tells you the share of
bad nodes in C, which is something you want to minimize.
Figure 36.9 shows an example community, which we can use to
understand the difference between the various ODF variants. In the
Maximum-ODF, we’re looking for the node with the relative highest
out degree. That is node 1 as its degree is just three, and two of
those edges point outside the community. Thus, the Maximum-ODF
community evaluation 523
3
2
If you have a temporal network, you gain a new way to test the
quality of your communities. After all, communities are dense areas
in the network, thus they tell you something about where you expect
to find new links. In a strong assortative community partition, there
are more links between nodes in the same community than between
40
Or so the classical definition of
nodes in different communities. Otherwise, your communities would
community says. I already started
be weak – or there won’t be communities at all40 . tearing it apart, and I’ll continue doing
Thus you can use your communities to have a prior about where so, but in this specific test you base
your assumption on this classical
the new links will appear in your dynamic network. This sounds definition. If you have a different
familiar because it is: it is literally the definition of the link prediction definition of community, don’t use this
test.
problem (Part VII). In this approach of community evaluation, you
use the community partition as your input. You use it to estimate the
likelihood of connection between any pair of nodes in the network,
and then you can design the experiment (Chapter 25) and use any
link prediction quality measure as your criterion to decide which
community partition is better. The higher your AUC, the better
looking your ROC curve, the better your partition is.
524 the atlas for the aspiring network scientist
Your network might not be temporal, but you could have additional
information about the nodes, besides to which other nodes they
connect (Section 7.5). In this context, node attributes are usually
referred to as “node metadata”. There is a widespread assumption
in community discovery: if you have good node metadata, some of
41
Jaewon Yang and Jure Leskovec.
them have information about the true communities of the network. Defining and evaluating network
Nodes with similar values, following the homophily assumption communities based on ground-truth.
(Chapter 30), will tend to connect to each other. Therefore there Knowledge and Information Systems, 42(1):
181–213, 2015
should be some sort of agreement between the community partition 42
Vincent D Blondel, Jean-Loup Guil-
of the network and the node metadata41 . laume, Renaud Lambiotte, and Etienne
For instance, a classical paper42 analyzed a network whose nodes Lefebvre. Fast unfolding of communities
in large networks. Journal of statistical
were cellphones, connected together if they made a significant num- mechanics: theory and experiment, 2008
ber of calls to each other. The network showed three well-separated (10):P10008, 2008
community evaluation 525
36.5 Summary
2. You can also use modularity for something more than evaluating
the communities you found: it can be an optimization target. Your
algorithm will operate on your communities until it cannot find
any additional move that would increase modularity.
36.6 Exercises
approach that was built with this feature in mind from the begin-
ning, rather than adding it as an afterthought. There are two meta-
approaches for baking in hierarchies in your community discovery:
merging and splitting.
Merging
In the merging approach, you start from a condition where all your
nodes are isolated in their own community and you create a criterion
to merge communities. This is a bottom-up approach. It is similar
to the meta-algorithm from earlier, but it’s not really the same. Let’s
take a look at how it works, highlighting where the differences with
the meta-algorithm are.
The template I’m using to describe this approach is the Louvain 5
Vincent D Blondel, Jean-Loup Guil-
algorithm5 . This is one of the many heuristics used to recursively laume, Renaud Lambiotte, and Etienne
Lefebvre. Fast unfolding of communities
merge communities with the aim of maximizing modularity6 , 7 ,
in large networks. Journal of statistical
which happens to be among the fastest and most popular. mechanics: theory and experiment, 2008
The Louvain algorithm starts with each node in its own commu- (10):P10008, 2008
6
Marta Sales-Pardo, Roger Guimera,
nity. It calculates, for each edge, the modularity gain one would get
André A Moreira, and Luís A Nunes
if they were to merge the two nodes in the same community. Then Amaral. Extracting the hierarchical
it merges all edges with a positive modularity gain. Now we have a organization of complex systems.
Proceedings of the National Academy of
different network for which the expensive modularity gains need to Sciences, 104(39):15224–15229, 2007
be recomputed. However, this network is smaller, because of all the 7
Tiago P Peixoto. Hierarchical block
edge merges. You repeat the process until you have all nodes in the structures and high-resolution model
selection in large networks. Physical
same community. Figure 37.3 shows an example of this process. Review X, 4(1):011047, 2014c
High Gain
Figure 37.3: An example of the
first step of the Louvain algo-
rithm. All in-clique edges (like
Low Gain
the representative I highlight
in blue) are merged, while all
out-clique edges (like the repre-
sentative I point to with a gray
arrow) are ignored.
moves that can improve modularity, and we can stop. The algorithm 8
Aaron Clauset, Mark EJ Newman, and
inspiring it8 , for instance, only made one merge per modularity gain Cristopher Moore. Finding community
and thus had to perform the expensive modularity gain calculation structure in very large networks. Physical
review E, 70(6):066111, 2004
more often for larger networks.
What the algorithm does, in practice, is building a dendogram of
communities from the bottom up. Each iteration brings you further
up in the hierarchy. We start with no partition: each node is in its
own community. And then we progressively make larger and larger
communities, until we have only one. Figure 37.4 shows an example
of this approach. This is the crucial difference between the merging
approach and what I discuss previously. In the meta algorithm, you
don’t perform all the merges, you make lots of them at once when
you run your step 1 to find the initial communities.
Iterations
9
Michelle Girvan and Mark EJ New-
man. Community structure in social
Splitting and biological networks. Proceedings of
the national academy of sciences, 99(12):
In the splitting approach, you do the opposite of what I described 7821–7826, 2002
10
Mark EJ Newman and Michelle
so far. You start with all nodes in the same community and you use Girvan. Finding and evaluating
a criterion to split it up in different communities. For instance by community structure in networks.
identifying edges to cut. This is a top-down approach. Physical review E, 69(2):026113, 2004
11
Filippo Radicchi, Claudio Castellano,
Historically speaking, the first algorithm using this approach used Federico Cecconi, Vittorio Loreto,
edge betweenness as its criterion to split communities9 , 10 . That is and Domenico Parisi. Defining and
not to say there aren’t valid alternatives as your splitting criterion, identifying communities in networks.
Proceedings of the National Academy of
including – but not limiting to – edge clustering11 and information Sciences, 101(9):2658–2663, 2004
centrality12 . However, given its historical prominence, I’m going to 12
Santo Fortunato, Vito Latora, and
allow the edge betweenness Girvan-Newman algorithm to have its Massimo Marchiori. Method to
find community structures based on
place under the limelight. information centrality. Physical review E,
The first step of the algorithm is to calculate the edge betweenness 70(5):056104, 2004
hierarchical community discovery 535
The second step of the algorithm is to cut the edge with the high-
est edge betweenness. The final aim is to break the network down
into multiple components. Each component of the network is a com-
munity.
Unfortunately, after each edge deletion you have to recalculate
the betwennesses. Every time you alter the topology of the network
you change the distribution of its shortest paths. This makes edge
betweenness extremely computationally heavy. Calculating the edge
betweenness for all edges takes an operation per node and per edge
(O(|V || E|)) and you have to repeat this for every edge you delete,
resulting in a crazy complexity of O(|V || E|2 ). You cannot apply this
naive algorithm to anything but trivially small networks.
You can now see the parallels with the Louvain method I de-
scribed earlier. The difference is that you are exploring the dendo-
gram of communities from the top down, rather than bottom up.
Each iteration brings you further down in the hierarchy. At the very
top you start with a network with a single connected component. As
you delete edges, you find different connected components. As you
continue, you end up with more and more. At the last iteration, each
node is now isolated.
Differently from the Louvain algorithm, in the Girvan-Newman
method you do not calculate modularity gains as you explore the
536 the atlas for the aspiring network scientist
Modularity
Figure 37.6: The dendogram
building from the top down
typical of a “splitting” approach
in hierarchical community dis-
covery. The left panel shows
the modularity values of each
possible cut.
Iterations
30 29
Figure 37.7: A graph with hi-
31 32
erarchical communities (node
27
color according to the commu-
33
26 nity partition at one level of the
36
28 34
25
hierarchy).
35
10 15
11
13
9 12 14 16
8
3 18 22
1 6 17 23
4 7
19 21
2 5 20 24
0.1
0.15
1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
33 35 34 36 31 30 32 29 26 25 27 28 11 9 10 12 1 2 3 4 7 8 5 6 17 20 18 19 21 22 24 23 14 15 16 13
(a) (b)
(a) (b)
540 the atlas for the aspiring network scientist
37.5 Summary
Modularity
Modularity
Density Density
Iterations Iterations
(a) (b)
Modularity Modularity
Density Density
Iterations Iterations
(c) (d)
37.6 Exercises
4. Using the algorithm you made for exercise 3, answer these ques-
tions: What is the latest step for which you have the average
internal community edge density equal to 1? What is the modu-
larity at that step? What is the highest modularity you can obtain?
What is the average internal community edge density at that step?
38
Overlapping Coverage
This seems to imply that communities are a clear cut case. Nodes
have a majority of connections to other nodes in their community.
However real world networks do not have to conform to this expecta-
tion, and in fact often they don’t. There are numerous cases in which
nodes belong to multiple communities: to which community does the
center person in Figure 38.1 belong? The red or the blue one?
The classical community definition forces us to make a choice.
Regardless of the choice we make – red or blue – it wouldn’t be a
satisfying solution. The more reasonable answer is “she belongs to
both”. For instance, a person can very well be part of one community
because it is composed by the people they went to school with. And
she can be part of a work community too, of people she works with.
Some of these people could be the same, but usually they are not.
544 the atlas for the aspiring network scientist
The problem is that none of the methods seen so far allow for such
a consideration. For instance, the basic stochastic blockmodels only
allows you to plant a node in a community, not multiple. Modularity
also has issues, because of the Kronecker delta: since this is going
to be 1 for multiple communities for a node, there will be double-
counting and the formula breaks down.
This is where the concept of overlapping community discovery
was born. We need to explicitly allow for overlapping communities:
communities that can share nodes. There are many ways to do this, 1
Jierui Xie, Stephen Kelley, and
which have been reviewed in several articles1 , 2 dedicated especially Boleslaw K Szymanski. Overlap-
to this sub problem of community detection (itself a sub problem of ping community detection in networks:
The state-of-the-art and comparative
network analysis: it’s communities all the way down).
study. Acm computing surveys (csur), 45
Here we explore a few of the most popular approaches. (4):43, 2013
2
Alessia Amelio and Clara Pizzuti.
Overlapping community discovery
38.1 Evaluating Overlapping Communities methods: A survey. In Social Networks:
Analysis and Case Studies, pages 105–125.
Springer, 2014
Before we delve deep into overlapping community discovery, let’s
amend Chapter 36 to this new scenario. We can have a few options
when we try to evaluate how well we divided the network into
overlapping communities.
Normalized mutual information expects you to put nodes into
a single category. However, there are ways to make it accept an 3
Andrea Lancichinetti, Santo Fortunato,
overlapping coverage3 , 4 . The obstacle is that NMI wants to compare and János Kertész. Detecting the over-
the vector of metadata with the vector containing the community lapping and hierarchical community
structure in complex networks. New
partition. The vector can only have one value per node but, in an Journal of Physics, 11(3):033015, 2009
overlapping coverage, it can have multiple values. Thus we don’t 4
Aaron F McDaid, Derek Greene, and
compare the vectors directly. We compare two bipartite matrices. Neil Hurley. Normalized mutual
information to evaluate overlapping
Suppose you found C communities, and you have A node at- community finding algorithms. arXiv
tributes. You can describe the overlapping coverage in communities preprint arXiv:1110.2515, 2011
with a |V | × C binary matrix, whose u, C entry is equal to 1 if node
u is part of community C. The node attribute matrix is similarly de-
fined. Figure 38.2 shows an example of this procedure. Now you can
calculate the mutual information between the two matrices by pairing
the columns such that we assign to each column on one side the ones
on the other side that is the most similar to it.
We can normalize this mutual information in different ways. In
fact, the papers I cited earlier propose six alternatives, providing
different motivations for each of those. These overlapping NMIs
share with their original counterpart the issue of non-zero values for
independent vectors – although they try to mitigate the issue with
different strategies. 5
Alexander J Gates, Ian B Wood,
William P Hetrick, and Yong-Yeol Ahn.
Some researchers have pointed out a few biases in the overlapping
Element-centric clustering comparison
extensions of NMI and similar measures5 . They propose a unified unifies overlaps and hierarchy. Scientific
framework that can evaluate disjoint, overlapping, and even hierar- reports, 9(1):8574, 2019
overlapping coverage 545
2
Figure 38.2: (a) A network with
1
three overlapping communities,
4
encoded by the node’s color. (b)
7
6 Transforming the overlapping
3 5
coverage into a binary affilia-
8 tion matrix, which we can use
10 as input for the overlapping
9 version of NMI
13
11 12
(a) (b)
Overlapping Communities:
Figure 38.3: Comparing overlap-
u: ping and fuzzy clustering for
node u: the size of the square is
u Fuzzy Communities:
proportional to u’s “belonging”
u: coefficient, in share of number
(60%) (40%) of u’s edges connected to the
community.
01 00
100 01 Figure 38.4: The encoding of
two random walks in the over-
00
lapping version of Infomap.
Note that in neither the red nor
11 the blue path we’re crossing
community boundaries, so
101 100
we don’t use the community
11 101 crossing code.
1100100101 010011100
Clique Percolation
Clique percolation starts from the observation that communities
should be dense. What is the densest possible subgraph? The clique.
In a clique, all nodes are connected to all other nodes. So the problem
of community discovery more or less reduces to the problem of
finding all cliques in the network. However, this is a bit too strict:
there are subgraphs in the network that, while being very dense and
close to being a clique, are not fully connected. It would be a pity to
split them into many small substructures.
Thus researchers developed the more sophisticated k-clique perco- 16
Imre Derényi, Gergely Palla, and
lation algorithm16 . Clique percolation says that communities must be Tamás Vicsek. Clique percolation in
cliques of at least k nodes, with k being a parameter you can freely random networks. Physical review letters,
94(16):160202, 2005
set. In the first step, the algorithm finds all cliques of size k, whether
they are maximal or not. Then, it attempts to merge two communities
in the same community if the two communities share at least a k − 1
overlapping coverage 549
clique.
For instance, consider the example in Figure 38.6, setting the
parameter k = 5. The blue and green 5-cliques only share two nodes,
so it cannot be a 4-clique. But the green and purple do share a 4-
clique, so they are merged (top row). And there is another purple
5-clique that can now be merged with the green community (bottom
row).
(c) (d)
17
Tim S Evans. Clique graphs and
This is generally implemented via the creation of a clique graph17 . overlapping communities. Journal
The nodes of a clique graph are the cliques in the original graph. of Statistical Mechanics: Theory and
Experiment, 2010(12):P12037, 2010
We connect two cliques if they share nodes. For instance, if we only
connect cliques sharing k − 1 nodes, then we can efficiently find
all communities by finding all connected components in the clique
graph.
This algorithm works well in practice. It has been used to study 18
Marta C González, Hans J Herrmann,
overlapping friendship patterns in school systems18 – due to class- J Kertész, and Tamás Vicsek. Commu-
room being quasi-cliques: pupils have rare but significant friendships nity structure and ethnic preferences in
school friendship networks. Physica A,
across classes –, and in metabolic networks19 . However, it has a
379(1):307–316, 2007
couple of downsides. 19
Shihua Zhang, Xuemei Ning, and
First, finding all cliques in a network is computationally expensive. Xiang-Sun Zhang. Identification of
functional modules in a ppi network
One could fix this problem by setting k to be relatively high. If we by clique percolation clustering. Com-
set k = 5 we know that nodes with degree three or less cannot be putational biology and chemistry, 30(6):
in any community, because they need at least four edges to be part 445–451, 2006
20
Jussi M Kumpula, Mikko Kivelä,
there are developments of this algorithm20 , 21 that are a bit more Kimmo Kaski, and Jari Saramäki.
computationally efficient. Sequential algorithm for fast clique
percolation. Physical Review E, 78(2):
Second, it has limited coverage for sparse networks. That means
026109, 2008
that it might end up being unable to classify nodes in networks 21
Fergal Reid, Aaron McDaid, and Neil
because they are not part of any clique. If you set your k relatively Hurley. Percolation computation in
complex networks. In 2012 IEEE/ACM
low, e.g. k = 4, all nodes with degree equal to one cannot be part International Conference on Advances in
of any community. This is because a node with degree equal to one Social Networks Analysis and Mining,
cannot be part of a 3-clique. Thus it will never be merged into any pages 274–281. IEEE, 2012
Node Splitting
Another approach is to simply recognize that a node is part of multi-
ple communities if it has different identities. This is extremely similar
to the approach of overlapping Infomap. In that case we represented
the two identities of the node by giving it two different codes: one
per community to which it belongs. Here we literally split it in two.
We modify the structure of the network in such a way that, when we
are done, by performing a normal non-overlapping community dis-
covery we recover the overlapping clusters. In the resulting structure
we have multiple nodes all referring to the same original one.
If we want to split nodes, we need to answer two questions: which
nodes do we split and how. First we identify the nodes most likely
to be in between communities. If you remember the definition of
22
Steve Gregory. An algorithm to find
betweenness, you’ll recollect that nodes between communities are the overlapping community structure in
gatekeepers of all shortest paths from one community to the other. So networks. In European Conference on
they are the best candidates to split. There are many ways to perform Principles of Data Mining and Knowledge
Discovery, pages 91–102. Springer, 2007
the split, but I’ll focus on the one that involves calculating a special 23
Steve Gregory. Finding overlapping
betweenness: pair betweenness22 , 23 . Pair betweenness is a measure communities using disjoint community
for a pair of edges: the number of shortest paths that use both of detection algorithms. In Complex
networks, pages 47–61. Springer, 2009
them.
For instance, consider the graph in Figure 38.7(a). The most central
node is node 1. To try and split it, we build its split graph. Meaning
that we remove node 1 and we connect all nodes that were connected
by 1. Each edge has a weight: the number of shortest paths in the
overlapping coverage 551
4-5 2 2 5 2
4 0 1 1
3 4-5 8 2-3 4 3
(a) (b) (c)
nodes in the same partition than the one between nodes in different
partitions. The little problem we need to solve now is how to make
this mathematical machinery work when we want to allow nodes
to be part of multiple communities. That solution constitutes the 24
Edoardo M Airoldi, David M Blei,
Mixed Membership Stochastic Blockmodels24 , an object that I already Stephen E Fienberg, and Eric P Xing.
mentioned in Section 18.2. Mixed membership stochastic block-
models. Journal of Machine Learning
The trick here is that we represent each node’s membership as
Research, 9(Sep):1981–2014, 2008
a vector. The vector tells us how much the node belongs to a given 25
Wenjie Fu, Le Song, and Eric P
community. Then, we also have a community-community matrix, Xing. Dynamic mixed membership
blockmodel for evolving networks. In
that tells us the probability of a node belonging to community c1 Proceedings of the 26th annual interna-
to connect to a node belonging to a community c2 . These are the tional conference on machine learning,
two ingredients that replace the simple community partition in the pages 329–336. ACM, 2009
regular SBM. From this moment on, you attempt to find the set 26
Eric P Xing, Wenjie Fu, Le Song,
et al. A state-space mixed membership
of community affiliation vectors and the community-community blockmodel for dynamic network
probability matrix that are most likely to reproduce your observed tomography. The Annals of Applied
Statistics, 4(2):535–566, 2010
data, exactly as you do in SBM. 27
Qirong Ho, Le Song, and Eric Xing.
Just like we saw in Section 35.4, we can have dynamic MMSB, Evolving cluster mixed-membership
adding time to the mix25 , 26 , 27 , 28 : the community affiliation vectors blockmodel for time-evolving networks.
In Proceedings of the Fourteenth Interna-
and the community-community matrix can change over time. There tional Conference on Artificial Intelligence
is also a hierarchical (Chapter 37) variant of MMSB, allowing a nested and Statistics, pages 342–350, 2011
community structure29 . 28
Kevin S Xu and Alfred O Hero.
Dynamic stochastic blockmodels:
Statistical models for time-evolving
networks. In International conference
on social computing, behavioral-cultural
modeling, and prediction, pages 201–210.
Springer, 2013
Community Affiliation Graph 29
Tracy M Sweet, Andrew C Thomas,
and Brian W Junker. Hierarchical mixed
membership stochastic blockmodels for
Affiliations graphs have been often used to describe the overlapping multiple networks and experimental
interventions. Handbook on mixed
community structure of real world networks30 . In a community
membership models and their applications,
affiliation graph you assume that you can describe your observed pages 463–488, 2014
network with a latent bipartite network. In this bipartite network, 30
Jae Dong Noh, Hyeong-Chai Jeong,
Yong-Yeol Ahn, and Hawoong Jeong.
the nodes of one type are the nodes of your observed network. The
Growing network model for community
other type, the latent nodes, represent your communities. Nodes are with group structure. Physical Review E,
connected to the communities they belong to. This is the community 71(3):036131, 2005
2
1 Figure 38.9: (a) A graph with
4
7 overlapping communities indi-
6
3
cated by the colored outlines.
5
8
(b) Its corresponding com-
10 1 2 3 4 5 6 7 8 9 10 11 12 13
9 munity affiliation graph. The
13
community latent nodes are
11 12
triangular and their color cor-
responds to the color used in
(a) (b) (a).
Line Graphs
To cluster the edges rather than the nodes we can transform the 32
TS Evans and Renaud Lambiotte. Line
network into its corresponding line graph32 . In a line graph, as we graphs, link partitions, and overlapping
communities. Physical Review E, 80(1):
saw in Section 6.1, the edges become nodes and they are connected 016105, 2009
if they’re incident on the same node. A way to do so is to generate a
weighted line graph.
5 2
4-5 1-4 1-5 1-2 1-3 2-3 Figure 38.10: (a) A simple
1 graph. (b) A bipartite version of
(a) connecting each node to its
4 3 4 5 1 2 3 edges.
(a) (b)
To create a line graph you first transform the network into bi-
partite connecting the nodes to the edges they are connected to, as
Figure 38.10 shows. Then you project this network over the edges.
The most important thing to define is how to weight the edges in
554 the atlas for the aspiring network scientist
the line graph. Different weight profiles will steer the community
discovery on the line graph in different directions.
You could use any of the weighting schemes I discussed in Chap-
ter 26, but the researchers proposing this method also have their
suggestions. The reason you might need a special projection is be-
cause you want nodes that are part of an overlap to give their edges
lower weights, because their connections are spread out in different
communities.
| Nu ∩ Nv |
S(u,k),(v,k) = .
| Nu ∪ Nv |
The edges with the highest S value are merged in the same com-
munity. For instance, in Figure 38.12, edges (1, 2) and (1, 3) have a
high S value: the neighborhoods of nodes 2 and 3 are identical, thus
S(1,2),(1,3) = 1. On the other hand, edges (4, 7) and (7, 8) only have
one node in the numerator, thus: S(4,7),(7,8) = 1/6.
Then, the merging happens recursively for lower and lower S
values, building a full dendrogram, as we saw in Chapter 37 for
hierarchical community discovery. We then need a criterion to cut
the dendrogram. We cannot use modularity, because these are link
communities, not node communities.
overlapping coverage 555
1
3 Figure 38.12: A graph and its
8 best link communities. The
2 color of the edge represents its
4 7 community. Nodes are part of
9 all communities of their links.
6 For instance, node 4 belongs to
5 three communities: red, blue,
and purple.
| Ec | − (|Vc | − 1)
Dc = ,
|Vc |(|Vc | − 1)/2 − (|Vc | − 1)
which is the number of links in c, normalized by the maximum
number of links possible between those nodes (|Vc |(|Vc | − 1)/2), and
its minimum |Vc | − 1, since we assume that the subgraph induced by
c is connected. Note that, if |Vc | = 2, we simply take Dc = 0. All Dc
scores for all cs in your link partition are aggregated to find the final
partition density, which is the average of Dc weighted by how many
1
links are in c: D = ∑ | Ec | Dc .
| E| c
Ego Networks
Assuming that links exists for one primary reason works usually
well, but it is a problematic assumption. Let’s look back at the case
of work and school communities. What would happen if you were to
end up working in the same company and play in the same team of a
former schoolmate? Is it still fair to say that the link between the two
of you exists for only one predominant reason?
Modeling truly overlapping communities can get rid of this prob-
lem. There are many ways to do it, but we’ll focus on one that is
easy to understand. The starting observation is that networks have
large and messy overlaps. However, just like in the assumption of
clustering links, here we realize that the neighbors of a node usually
are easier to analyze. It is easy for a node to look at a neighbor and
35
Michele Coscia, Giulio Rossetti, Fosca
say: “I know this other node for this reason (or set of reasons)”.
Giannotti, and Dino Pedreschi. Demon:
The procedure35 works as follows, and I use Figure 38.13 to guide a local-first discovery method for
you. First, we extract the ego network of a node, removing the ego overlapping communities. In Proceedings
of the 18th ACM SIGKDD international
itself. This creates a simpler network to analyze. In the figure, I conference on Knowledge discovery and
start by looking at node 1 on the top right. This is a graph with two data mining, pages 615–623. ACM, 2012
556 the atlas for the aspiring network scientist
5 2
Figure 38.13: The process of
5 2 community discovery via the
1 breaking down of the network
into ego networks.
4 3
4 3
Apply CD
{2,3}
1 2 {1,2} Merge similar {1,2,3}
{1,3}
3 1
{4,5}
{1,5} Merge similar {1,4,5}
Repeat for all nodes {1,4}
(a) (b)
(a) (b)
38.8 Summary
38.9 Exercises
3. Implement the ego network algorithm: for each node, extract its
ego minus ego network and apply the label propagation algorithm,
560 the atlas for the aspiring network scientist
Since it has been a looming presence across this entire book part, let’s
start again with modularity, the elephant in the room of community
discovery. Network scientists in the community detection business
love modularity. If there is a scenario in which modularity doesn’t
work, they panic and start amending it to hell, until it works again.
We’ve seen this with directed and overlapping community discovery,
and we’re seeing it again.
There are a couple of alternatives when it comes to define a modu-
larity that works for bipartite networks. If you remember the original
version of the modularity, it hinges on the fact that we want the parti-
tion to divide the network in communities that are denser than what 2
Michael J Barber. Modularity and
we would expect given a null model – the configuration model. Thus, community detection in bipartite
networks. Physical Review E, 76(6):066102,
extending modularity means to find the right formulation of a null 2007
model for bipartite networks2 , 3 . 3
Roger Guimerà, Marta Sales-Pardo,
This is not that difficult, the only thing to keep in mind is that and Luís A Nunes Amaral. Module
identification in bipartite and directed
the expected number of edges in a bipartite network is different networks. Physical Review E, 76(3):036102,
than in a regular network. So, while in the traditional modularity 2007
562 the atlas for the aspiring network scientist
ku kv
the configuration model connection probability was , here it is
2| E |
ku kv
instead , with the added constraints that u and v needs to be
| E|
nodes of unlike type. The sum of modularity is made only across
pairs of nodes of unlike types, otherwise we would have negative
modularity contributions from nodes that cannot be connected,
which would make the modularity estimation incorrect.
To see why this is the case, suppose that we’re checking u and v
and they are of the same type. Since they are of the same type and
we’re in a bipartite network, they cannot connect to each other, so
Auv = 0. But they are both part of the network, thus k u ̸= 0 and
ku kv ku kv
k v ̸= 0. Thus > 0, meaning that Auv − < 0. Negative
| E| | E|
modularity contribution.
Once you have a proper bipartite modularity you can use any of
the modularity maximization algorithms to find modules in your 4
Stephen J Beckett. Improved com-
network, or even specialized ones4 . munity detection in weighted bipartite
networks. Royal Society open science, 3(1):
140536, 2016
39.2 Via Projection
bipartite networks into its two unipartite versions and then you
analyze them at the same time with specialized techniques. This dual 6
David Melamed. Community struc-
projection approach has been applied to community discovery6 , with tures in bipartite networks: A dual-
encouraging results. projection approach. PloS one, 9(5):
e97823, 2014
Bi-Clique Percolation
The solution is to perform the community discovery directly on the
bipartite structure. Here, we use the concept of bi-clique we saw
earlier. Remember that a clique is a set of nodes in which all possible
edges are present. A bi-clique is the same thing, considering that
some edges in a bipartite network are not possible. For instance, a
5-clique in a unipartite network is a graph with five nodes and ten
edges. In a bipartite network, a 2,3-clique has two nodes of type 1,
three nodes of type 2, and all nodes of type 1 are connected to nodes
564 the atlas for the aspiring network scientist
Two 2,4-cliques
sharing a 0,1-clique
(a) (b)
This needs not to worry us. We can redefine the clustering coeffi-
cient to make sense in a bipartite network. In a unipartite network,
the triangle is the smallest non-trivial cycle, the one that does not
backtrack using the same edge, as you can see in Figure 39.4(a). We
can also have a smallest non-trivial cycle in bipartite networks. It
involves four nodes, as Figure 39.4(b) shows. So we can say that the
local clustering coefficient of a node in a bipartite network is the
number of times such cycles appear in its neighborhood, divided by 14
Peng Zhang, Jinliang Wang, Xiaojia
Li, Menghui Li, Zengru Di, and Ying
the number of times they could appear given its degree14 . Fan. Clustering coefficient and com-
Let us assume that we want to know the local square clustering munity structure of bipartite networks.
coefficient of node z. If we say that nodes u, v, and z are involved Physica A: Statistical Mechanics and its
Applications, 387(27):6869–6875, 2008
in suvz squares, then contribution of nodes u and v to the square
clustering coefficient of z is:
suvz
C4u,v (z) = ,
suvz + (k u − ηuvz ) + (k v − ηuvz )
568 the atlas for the aspiring network scientist
z
Figure 39.5: An example of
bipartite network on which we
u v can calculate the local cluster-
ing coefficient of node z.
b c a d
(a) (b)
1 2 3 4 5 6 7 8 9 10 11 12
9 8 7 6 5 4 3 2 1
39.5 Summary
39.6 Exercises
The last chapter of community discovery, at least for this book, fo-
cuses on multilayer networks. In multilayer networks, nodes can
belong to different layers and thus they can connect for different
reasons. In multilayer networks we want to find communities that
span across layers. For example, we want to figure out communities
of friends even if your friends are spread across multiple social media
platforms.
There are a few review works you can check out to have a more 1
Jungeun Kim and Jae-Gil Lee. Com-
in-depth exploration of the topic1 , 2 . Here, I go over briefly the main munity detection in multi-layer graphs:
approaches and peculiar problems of community discovery in multi- A survey. ACM SIGMOD Record, 44(3):
37–48, 2015
layer networks. 2
Obaida Hanteer, Roberto Interdonato,
Matteo Magnani, Andrea Tagarelli, and
Luca Rossi. Community detection in
40.1 Flattening multiplex networks, 2019
one – an edge weight. This assumes that every edge type is equally
important. Then you can perform a normal mono-layer community
discovery. Figure 40.1 shows an example.
There are a few choices for your edge weights. The simplest
one could be to simply count the number of layers in which the
connection between the nodes appear. However, you might want to
take into account some interplay between the layers. For instance,
you can count the number of common neighbors that two nodes 4
Jungeun Kim, Jae-Gil Lee, and Sungsu
have and use that as the weight of the layer, under the assumption Lim. Differential flattening: A novel
that a layer where two nodes have many common neighbors should framework for community detection in
multi-layer graphs. ACM Transactions on
count for more when discoverying communities. Or you could use Intelligent Systems and Technology (TIST),
“differential flattening4 ”: flatten the multilayer graph into the single 8(2):27, 2017
multilayer community discovery 573
(a) (b)
each row is a node and each column is the partition assignment for 6
Lei Tang, Xufei Wang, and Huan Liu.
that node in a specific layer6 . This is then a |V | × |C | matrix. Then one Community detection via heteroge-
could perform kMeans on it, finding clusters of nodes that tend to be neous interaction analysis. Data mining
and knowledge discovery, 25(1):1–33, 2012
clustered in the same communities across layers. 7
Michele Berlingerio, Fabio Pinelli, and
A similar approach7 uses frequent pattern mining, a topic we’ll see Francesco Calabrese. Abacus: frequent
more in depth in Section 41.4. For now, suffice to say that we again pattern mining-based community dis-
covery in multidimensional networks.
perform community discovery on each layer separately. Each node
Data Mining and Knowledge Discovery, 27
can then be represented as a simple list of community affiliations. We (3):294–320, 2013c
then look for sets of communities that are frequently together: these
are communities sharing nodes across layers.
Node L1 L2 L3
1 C1L1 C1L2 C1L3
2 C1L1 C1L2 C1L3
3 C1L1 C1L2 C1L3
4 C2L1 C1L2 C1L3
5 C2L1 C1L2 C2L3
6 C2L1 C1L2 C3L3
7 C1L1 C2L2 C3L3 MLComm SLComms MLComm Nodes
8 C2L1 C2L2 C3L3 MLC1 C1L1, C1L2, C1L3 MLC1 1, 2, 3
9 C2L1 C2L2 C3L3 MLC2 C2L1, C1L2 MLC2 4, 5, 6
10 C1L1 C2L2 C2L3 MLC3 C2L2, C3L3 MLC3 7, 8, 9
(a) (b) (c)
Figure 40.2 shows an example. In Figure 40.2(a) we have the Figure 40.2: (a) The communi-
communities found for each layer for each node. Then we decide that ties found in each layer of each
we want to merge communities if they have at least three nodes in node. (b) The merged multi-
common, i.e. they appear in at least three rows of the table. layer communities. (c) The final
Figure 40.2(b) shows the multilayer communities mapping and node-community affiliation.
there are many interesting things happening. First, we only want
maximal sets, meaning that we aren’t interested in returning C1L1 by
itself if we also find it in a larger set of communities. Second, we are
ok if a community gets merged in different sets – i.e. the multilayer
communities can overlap –: C1L2 is part of two maximal sets, MLC1
and MLC2. Figure 40.2(c) shows the final output: the multilayer com-
munity affiliation. A node is part of a multidimensional community 8
Arlei Silva, Wagner Meira Jr, and
Mohammed J Zaki. Mining attribute-
if it is part of all communities composing it. structure correlated patterns in large
Node 10 is an example of a final interesting thing: it is part of attributed graphs. Proceedings of the
VLDB Endowment, 5(5):466–477, 2012
no multidimensional community because its affiliation is a weird 9
Zhiping Zeng, Jianyong Wang, Lizhu
combination of communities. We can decide to let it be without com- Zhou, and George Karypis. Coherent
munity affiliation, or to allow it to be part only of its non-multilayer closed quasi-clique discovery from large
dense graph databases. In Proceedings
communities.
of the 12th ACM SIGKDD international
There are other algorithms solving the same problem and inspired conference on Knowledge discovery and
by frequent pattern mining8 , 9 . data mining, pages 797–802. ACM, 2006
multilayer community discovery 575
Multilayer Modularity
13
Peter J Mucha, Thomas Richardson,
I already mentioned how obsessed networks scientists are with mod- Kevin Macon, Mason A Porter, and
ularity, so you know what’s coming next: multilayer modularity13 . Jukka-Pekka Onnela. Community
Suppose we’re using the Louvain method, which grows communities structure in time-dependent, multiscale,
and multiplex networks. science, 328
node by node. If we found a triangle in a layer, can we extend it by (5980):876–878, 2010
taking a node in a different layer? Intuitively yes, the edge should
count because the node is the same. However, if we were to represent
this as a flat network, the new node is not densely connected to the
rest of the triangle: a node couples only with itself, not with its com-
munity fellows. So the coupling edges have to count in some special
way.
In practice, standard modularity works well in each layer sepa-
rately. Consider Figure 40.3: in modularity, the part testing for the
ku kv
density of the community is Auv − . If we use this same part for
2| E |
the inter-layer coupling, we would end up with a case in which the
community cannot be expanded across layers, because there are only
sparse connections between layers. A node couples only with itself in
a different layer, not connecting to its community members, making a
multi-layer community sparser than it actually is. So we need to add
something that will allow us to count the coupling links, so that we
don’t end up with the trivial result of all mono-layer communities.
The full formulation of multilayer modularity is the following:
1 k vs k us
2(| E| + |C |) ∑ Avus − γs
2| Es |
δsr + Cvsr δuv δ(cus , cvr ).
vusr
576 the atlas for the aspiring network scientist
Let’s break it down – and you can check Figure 40.4 for a graphical
representation and you can always go back to Section 36.1 to read
more about the notation of classical modularity and compare it with
this formulation. E and C are the sets of (intra-layer) edges and (inter-
layer) couplings. Avus is our multilayer adjacency tensor, it is equal to
1 if nodes u and v are connected in layer s, and it is 0 otherwise. k us
and k vs are the degrees of u and v in layer s, respectively. | Es | is the
number of edges in s.
1 k k
∑
2(|E|+|C|) vusr [( 2|E s| )
u,v = Nodes
]
A vus − γ s vs us δ sr +C vsr δuv δ (c us , c vr ) Figure 40.4: The adaptation of
modularity to the multilyer set-
ting. Each part of the formula
s,r = Layers
If s=r → same layer is underlined with a color corre-
Modularity in s sponding to its interpretation.
Importance of s
If u=v → same node
Coupling strength
If node u in layer s is in
the same community as
node v in layer r
Normalized by number
of links and coupling
links
(a) (b)
If you decide that your inter layer couplings are very strong,
you’ll end up with “pillar communities” where nodes tend to favor
grouping with themselves across layers: the inter layer couplings
trump any intra-layer regular edge. If your inter layer couplings
are weak (low Cvsr ) then you’ll end up with “flat communities” as
nodes prefer to group with other nodes in the same layer. I show an
example in Figure 40.5.
Instead, γ allows you to indicate some layers as more important
than others, as I show in Figure 40.6. If the purple layer is more
important than the green one, multilayer modularity will group in
the community a node that is not connected with the two nodes in
the green layer. If we flip the γ values to make green more important
than purple, the situation is reversed, and modularity will return
578 the atlas for the aspiring network scientist
(a) (b)
different communities.
An alternative way to adapt modularity maximization to multi-
layer networks is to adapt the Louvain algorithm (see Section 37.1) to
Inderjit S Jutla, Lucas GS Jeub, and
14
handle networks with multiple relation types14 . Peter J Mucha. A generalized louvain
method for community detection imple-
mented in matlab. URL http://netwiki.
Other Approaches amath. unc. edu/GenLouvain, 2011
rules19 , and thus adapt the fast label propagation algorithm to find
multilayer communities. First, we cannot use synchronous label
propagation: like in the bipartite case, also for multilayer networks
we could be stuck with label oscillation (Section 39.3), this time across
layers. Second, the authors define a quality function that regulates
the propagation of labels. This is done because there might be layers
that are relevant for a community and layers that are not. We do not
18
Roberto Interdonato, Andrea Tagarelli,
want a community, which is very strong in some layers, to “evaporate
Dino Ienco, Arnaud Sallaberry, and
away” just because in most layers the nodes are not related. Pascal Poncelet. Local community
Next on the menu is k-clique percolation20 . In this scenario, we detection in multilayer networks.
DMKD, 31(5):1444–1479, 2017
need to redefine a couple of concepts, particularly what a clique is 19
Oualid Boutemine and Mohamed
in a multilayer network, and how we determine when two multiplex Bouguessa. Mining community struc-
cliques are adjacent. For the first case, we need to talk about k-l- tures in multidimensional networks.
TKDD, 11(4):51, 2017
cliques: a set of k nodes all connected through a specific set of l 20
Nazanin Afsarmanesh and Matteo
layers. Moreover, there are two ways for nodes to be all connected Magnani. Finding overlapping com-
via the layers: all pairs of nodes could be connected in all layers at munities in multiplex networks. arXiv
preprint arXiv:1602.03746, 2016
the same time, or they could be connected in only one layer at a time.
The first type of clique is an k-l-AND-clique, the second type is a
k-l-OR-clique. Figure 40.8 shows an example.
(a) (b) 21
Caterina De Bacco, Eleanor A Power,
Daniel B Larremore, and Cristopher
Moore. Community detection, link
It becomes clear that two k-l-cliques might share (k − 1) nodes prediction, and layer interdependence
across different layers. In such a case, we need some care in defining in multilayer networks. Physical Review
E, 95(4):042317, 2017
a parameter to regulate percolation. We need a minimum number
m of shared layers to allow the percolation. If the two cliques do not 22
Natalie Stanley, Saray Shai, Dane
share at least m layers, even if they share k − 1 nodes they are not Taylor, and Peter J Mucha. Cluster-
ing network layers with the strata
considered adjacent. Figure 40.9 shows an example. multilayer stochastic block model.
IEEE transactions on network science and
engineering, 3(2):95–105, 2016
1 1
Figure 40.9: Two 2,2-cliques
2 4 2 4 sharing a 1,2-clique. The edge
color represents the edge layer.
3 3 If m = 1 (a) does NOT percolate
(a) (b) because the rightmost clique
does not share a layer with
the leftmost clique; (b) DOES
The final adaptation we consider is the stochastic blockmodels21 , 22 .
percolate, since the two cliques
Just like we saw for overlapping and bipartite SBMs, we need to
share the blue layer.
add an additional matrix into our expectation maximization frame-
580 the atlas for the aspiring network scientist
1
Figure 40.10: A multilayer
network, with the edge color
encoding the layer in which it
7 6 2 5 4 appears.
dense. But the two concepts are not the same. Which of the two are
we looking at?
This is another case when one has to make their own judgment.
The answer depends on the type of analysis, and the type of data,
you are looking at. In some cases, you want connections in all layers.
In some others, you are ok with looking at all layers to find commu-
nities. You cannot rely on a fixed definition of communities based on
density, because it cannot apply to all scenarios.
Thus, you need to have measures to determine when you are in
one case and when you are in another. I proposed a couple in a paper 26
Michele Berlingerio, Michele Coscia,
of mine26 . We decided to call them “redundancy” and “complemen- and Fosca Giannotti. Finding redundant
tarity”. Note that this redundancy here has nothing to do with the and complementary communities
in multidimensional networks. In
cousin of the local clustering coefficient we talked about in Section
Proceedings of the 20th ACM international
12.2. conference on Information and knowledge
Redundancy is the easiest of the two. To consider a set of nodes management, pages 2181–2184, 2011b
1 2
2 Figure 40.11: Two examples of
4 different types of multilayer
density. The edge color encodes
3 3 the layer in which the edge
appears. (a) A high redundancy
5 case. (b) A high complementar-
5
4 1 ity case.
(a) (b)
pairs in the community, which is 10, since we have 5 nodes. Thus the
redundancy is 18/30 = 0.6.
Figure 40.11(b) is instead a high complementary case. Variety is 1
by definition, since the community contains all layers of the network.
Exclusivity is 9/10, because there is one pair of nodes (nodes 2 and 3)
which is connected in two layers, and thus it is not counted. Finally,
the standard deviation of the distribution of the edges in c across
layers is the standard deviation of the vector [5, 1, 5], since there are
five edges in the red and green layer, and only one in the blue layer.
This is ∼ 1.88, which is exactly two thirds of the maximum possible,
leaving us with a total homogeneity of 0.33. Thus, complementarity
is 1 × 0.9 × 0.33 = 0.297, penalized by the low representation of the
blue layer in the community.
40.6 Summary
40.7 Exercises
Graph Mining
41
Frequent Subgraph Mining
? 96%
Model
Our experience with modeling real world networks tells us that they
are not random: they are an expression of complex dynamics shaping
their topology. Networks will tend to have overexpressed connection
patterns. Nodes and edges will form different shapes much more
– or less – often than what you’d expect if the connections were
random. For instance, the clustering coefficient analysis tells us that
you’re going to find more triangles than expected given the number
of nodes or edges.
590 the atlas for the aspiring network scientist
This example is simple enough, but the problem gets very ugly
very soon when we start considering non-trivial graphs. Subgraph 14
Scott Aaronson. P=?np. Electronic
isomorphism is an NP-complete problem14 : a type of problem where Colloquium on Computational Complexity
a correct solution requires you to try all possible combinations of (ECCC), 24:4, 2017. URL https://eccc.
weizmann.ac.il/report/2017/004
labeling. This grows exponentially and requires a time longer than
the age of the universe even for simple graphs of a few hundreds
nodes.
That is why we’re looking for efficient algorithms to solve graph
isomorphism. Recently, a claim of a quasi-polinomial algorithm 15
László Babai. Graph isomorphism in
shook the world15 – well, parts of it. However, this is more of a quasipolynomial time. In Proceedings of
theoretical find, which cannot be used in practice. Given how hard the forty-eighth annual ACM symposium
on Theory of Computing, pages 684–697.
the problem is, we can either try to solve it quickly by paying the ACM, 2016
price of getting it wrong sometimes, or more slowly but having an
exact solution. I’ll give an example for both strategies, since they are
both quite important for the rest of this book part.
Approximate Solutions 16
Boris Weisfeiler and Andrei Leman.
The reduction of a graph to canonical
One of the most well-known ways to approximate a solution to graph form and the algebra which appears
isomorphism is to use the Weisfeiler-Lehman algorithm16 . Part of the therein. nti, Series, 2(9):12–16, 1968
592 the atlas for the aspiring network scientist
1 6 11
Figure 41.4: The Weisfeiler-
2 5 7 10 12 15 Lehman algorithm. The node
colors represent the node’s
3 4 8 9 13 14 classes, based on the degree. In
the first step (top) red means
degree 3 and blue means de-
1 6 11 gree 2 for all networks. In the
second step each color changes
2 5 7 10 12 15
according to a consistent rule
within a given network. Finally,
3 4 8 9 13 14
we have the histogram of node
colors.
bottom row of Figure 41.4. Two networks with the same color his-
tograms are isomorphic: they have the same number of nodes with
the same colors. Indeed, the networks on the left and in the middle
are isomorphic to each other, while the one on the right isn’t.
This works well for most networks, and it is guaranteed to finish 17
László Babai and Ludik Kucera.
after at most |V | steps17 . However, you’re a smart reader and you Canonical labelling of graphs in linear
know you’re reading a section titled “approximate solutions”, so you average time. In 20th annual symposium
on foundations of computer science (sfcs
know there’s a catch. The test sometimes fails. You can see in Figure
1979), pages 39–46. IEEE, 1979
41.5 a case in which two non-isomorphic graphs result in the same
color histogram. You can tell that the graphs are not isomorphic,
because one has two triangles and the other has none.
1 3 5 7 9 11
Figure 41.5: The Weisfeiler-
Lehman algorithm. The node
2 4 6 8 10 12 colors represent the node’s
classes, based on the degree. In
both cases, red means “one red
and one blue neighbor” and
blue means “two red and one
blue neighbor”.
There are ways to fix this issue, for instance by aggregating not
only the information from direct neighbors, but from nodes at k-hops 18
Jin-Yi Cai, Martin Fürer, and Neil
away18 . However, as you might expect, this greatly increases the Immerman. An optimal lower bound
computational complexity of the approach. on the number of variables for graph
identification. Combinatorica, 12(4):
Note that here I use simple colors based on the degree for simplic- 389–410, 1992
ity. Nothing would change in my explanation if you were to have
more complex labels – for instance some sort of node attributes. As
long as you hash those attributes into a label uniquely identify by a
specific set of attribute values, the algorithm works the same and will
produce the same results shown in this section.
Another popular solution in this class is by using DFS codes,
which I’m going to treat extensively in Section 41.4 due to their con-
venient connections to itemset mining. This property of DFS codes
make them more useful in many practical contexts than Weisfeiler-
Lehman. Specifically, we’ll see how you can save computations with
DFS codes if you are growing motifs by adding nodes and edges,
while Weisfeiler-Lehman doesn’t allow you to do so.
Exact Solutions
If one needs to solve graph isomorphism exactly, to the best of my
knowledge, the current practical state of the art to solve graph iso-
594 the atlas for the aspiring network scientist
19
Luigi Pietro Cordella, Pasquale Fog-
morphism is the VF2 algorithm19 , 20 – recently evolved to VF321 , 22 . gia, Carlo Sansone, and Mario Vento.
Since I’m just going to give the general idea of the backtracking pro- An improved algorithm for matching
large graphs. In 3rd IAPR-TC15 workshop
cess it implements, the sophisticated differences between the various
on graph-based representations in pattern
approaches aren’t all that important. Most of the power of these al- recognition, pages 149–159, 2001
gorithms come in a series of heuristics they can use to make good 20
Luigi P Cordella, Pasquale Foggia,
Carlo Sansone, and Mario Vento. A
guesses, but here I’m just interested in talking about the general idea.
(sub) graph isomorphism algorithm
“Heuristics” means to do your darnedest not to actually run the for matching large graphs. IEEE
algorithm itself, or to do it in a way that you expect it to finish more transactions on pattern analysis and
machine intelligence, 26(10):1367–1372,
quickly. For instance, the first step of VF2 is to make all easy checks 2004
that don’t really require much thought. As an example, two graphs 21
Vincenzo Carletti, Pasquale Foggia,
cannot be isomorphic if they have a different number of nodes or a Alessia Saggese, and Mario Vento.
Introducing vf3: A new algorithm for
different number of edges. If this check fails, you can safely deny the subgraph isomorphism. In Graph-Based
isomorphism. Once all these heuristic all have confirmed that the Representations in Pattern Recognition:
11th IAPR-TC-15 International Workshop,
graphs could be isomorphic, then you have to begrudgingly actually
GbRPR 2017, Anacapri, Italy, May 16–18,
look at the network, which is to say that you apply the following 2017, Proceedings 11, pages 128–139.
procedure: Springer, 2017b
22
Vincenzo Carletti, Pasquale Foggia,
Alessia Saggese, and Mario Vento.
• Step #1: match one node in G1 with a node in G2;
Challenging the time complexity of
exact subgraph isomorphism for huge
• Main loop: try to match the n-th node in G1 with the n-th node and dense graphs with vf3. IEEE
in G2. If the match is unsuccessful, recursively step back your transactions on pattern analysis and
machine intelligence, 40(4):804–818, 2017a
previous matches and try a new match;
• End loop, case 1: you explored all nodes in G1 and G2, then the
graphs are isomorphic;
• End loop, case 2: you have no more candidate match, then the
graphs are not isomorphic.
Most of the heavy lifting is made in the main loop, when checking
whether a match is successful or not, but most of the cleverness is in
step #1: prioritizing good matches that are more likely to succeed –
for instance never trying to match two nodes with different degrees.
An illustrated example with a simple graph would probably be
helpful. Consider Figure 41.6. VF2 attempts to explore the tree of
all possible node matching (Figure 41.6(c)). It starts from the empty
match – the root node.
The first attempted match always succeeds, as any node can be
matched to any other node – in this case matching node 1 with node
a. For the second match to succeed, we need that the two matched
nodes are connected to each other. Since node a connects to node b
and node 1 connects to node 2, then the 1 = a and 2 = b match is a
success.
However, attempting to match 3 = c fails, because while node 2
is connected to node 3, node b (matched with 2) isn’t connected to
c (matched to 3). Thus VF2 backtracks: it undoes the last matches
frequent subgraph mining 595
*
Figure 41.6: (a, b) Two graphs,
with their nodes labeled with
10 11
1 their ids. (c) The inner data
structure used by the VF2 algo-
1=a 1=b
rithm to test for isomorphism.
6
5 2 9 12
I label each node with the at-
tempted match. The node color
b 3 2=b 2=c 2=a tells the result of the match
(green = successful, red = un-
a 2 4 3 8 7 13 successful). I label the edges to
follow the step progression of
c 1 3=c 3=b 3=c the algorithm.
and starts from the last successful match – provided that there are
possible matches to try. In this case there aren’t , so it backtracks
again.
Trying to set 2 = c and 3 = b fails again, for the same reason as
before. So VF2 has to give up also on the 1 = a match and start from
scratch. Luckily, there’s another possible move: 1 = b. When we go
down the tree all matches are successful, until we touched all nodes
in the graph. At that point, we can safely conclude the two graphs
are isomorphic. Note that Figure 41.6(c) doesn’t include the branches
that VF2 never tries in this case, for instance the 1 = c branch.
You can see here what I meant by smart heuristics that can help
making the algorithm faster. The 1 = a starting move was particularly
stupid, because node 1 and node a don’t have the same degree, so
they could never have matched. With the heuristic of “only try to
match nodes with the same degree” you could have saved a lot of
computation by starting directly with the 1 = b move – and that’s
what VF2 does. 23
Mikko Kivelä and Mason A Porter.
As expected, multilayer networks provide another level of dif- Isomorphisms in multilayer networks.
IEEE Transactions on Network Science and
ficulty. One can perform graph isomorphism directly on the full Engineering, 5(3):198–211, 2017
multilayer structure23 , or give up a bit of the complexity and repre- 24
Vijay Ingalalli, Dino Ienco, and
sent them as labeled multigraphs24 , 25 . Pascal Poncelet. Sumgra: Querying
multigraphs via efficient indexing. In
International Conference on Database
and Expert Systems Applications, pages
41.4 Transactional Graph Mining 387–401. Springer, 2016
25
Giovanni Micale, Alfredo Pulvirenti,
So far we’ve been dealing with network motifs on a “top-down” Alfredo Ferro, Rosalba Giugno, and
approach. We have some motifs of interest and we ask ourselves Dennis Shasha. Fast methods for finding
significant motifs on labelled multi-
whether they are overexpressed or underexpressed. This implies that relational networks. Journal of Complex
you have to start with your motifs already in mind. This might not be Networks, 2019
596 the atlas for the aspiring network scientist
The first thing you do is giving up on the idea of finding all sub-
sets. You only want to find the frequent ones. Thus you establish a
support threshold: if a subset fails to occur in that many sets, then
you don’t want to see it. This allows you to prune the search space.
If subset A is not frequent, then none of its extensions can be: they 30
That is, the support function is anti-
have to contain it so they can be at most as frequent as A is30 . Thus, monotonic: it can only stay constant or
once you rule out subset A, none of its possible extensions should shrink as your set grows in size.
frequent subgraph mining 597
Some Rules
Figure 41.9: An example of as-
75%
sociation rule mining. Assume
60% the frequencies of each itemset
100% are the ones from Figure 41.7.
66% We generate rules recording the
relative frequency of observing
As a small aside, note that, once you know the frequencies of all two itemsets. Note that these
sets, you can build what we call “association rules”. What you want frequencies are not symmet-
to do is to find all rules in the form of: “If a set contains the objects A, ric! While the green item only
then it is likely to also contain object b”. Figure 41.9 shows a simple occurs 60% of the times a blue
example of the problem we’re trying to solve. item occurs, every time green
Suppose that, in your data, you see 100 instances of sets containing occurs we also have the blue
objects a1 , a2 , and a3 . And let’s say that, among them, 80 also contain item.
object b. Then you can say, with 80% confidence, that the following
rule applies: { a1 , a2 , a3 } → b. The { a1 , a2 , a3 } part is the antecedent of
598 the atlas for the aspiring network scientist
Graph DB
Figure 41.10: For each of the
patterns on the left we check
Support
Motif whether a graph in the database
(on top) contains it (green
checkmark) or not (red cross).
3 The number of graphs in the
database containing the motifs
is its support.
3
search space like the one Apriori creates (Figure 41.8) is tricky. If you
cannot explore the search space like Apriori does, it’s even harder
to prune it by avoiding exploring patterns you already know they
are not frequent. gSpan solves the problem by introducing a graph
lexicographic order (which I’m going to dumb down here, the full
details are in the paper).
a
a Figure 41.11: Three possible
b
a d DFS explorations of the graph
b c on top. Blue arrows show the
b
a DFS exploration and purple
c dashed arrows indicate the
c
backwards edges (pointing to
a node we already explored).
a a Start
a
a a a Remember that DFS backtracks
Start
b b b
a d a d a d to the last explored node, not
b c b c b c to where the backward edges
b b b
a a a points.
c c c
c c Start c
Suppose you have a graph, as in Figure 41.11 (top). You can ex-
plore it using a DFS strategy. Actually, you can have many different
DFS paths: you can start from node a, from node b, ..., then you can
move through a different edge any time. We can encode each DFS
600 the atlas for the aspiring network scientist
Motifs Data
Figure 41.13: Two motifs (left)
m1 and our graph data (right).
Motif m1 appears only once.
How many times does motif m2
m2 appears?
Ego Networks
One option is to bring back the problem into familiar territory. One
can split the single graph in many different subgraphs and then
apply any transactional graph mining technique. For example, one
could take the ego networks of all nodes in the network. The support
definition would then be the number of nodes seeing the pattern
around them in the network.
Harmful Overlap
Another option starts from recognizing that the entire problem of
non-monotonicity is due to the fact that motif m2 appears twice only
because we allow the re-use of parts of the data graph when counting
602 the atlas for the aspiring network scientist
the motif’s occurrences. In Figure 41.13, we use the red node in the
data graph twice to count the support of m2 . In practice, the two
patterns supporting m2 overlap: they have the single red node in
common. We could forbid such overlap: we don’t allow the re-use
of nodes when counting a motif’s occurrences. With such a rule, m2
would appear only once in the data graph. If we applied the rule, 39
Michihiro Kuramochi and George
we would have an anti-monotone support definition39 : larger motifs Karypis. Finding frequent patterns in
would only appear fewer times or as many times as the smaller a large sparse graph. Data mining and
knowledge discovery, 11(3):243–271, 2005
motifs they contain.
Motif Data
A B Figure 41.14: From left to right:
A 1 D Simple a pattern, the graph dataset,
Overlap and its corresponding simple
2 3 4 C D and harmful overlap graphs.
B C A B
I label each occurrence of the
5 6 7 Harmful motif with a letter, which also
labels the corresponding node
Overlap
8 9 C D in the overlap graph.
To see how, consider Figure 41.14. The motif appears four times,
but each of these four occurrences share at least one node. We can
create an “overlap graph” in which each node is an occurrence in
the data graph, and we connect occurrences if they share at least
one node. If we forbid overlaps, we only want to count “complete”
and non-overlapping occurrences. This is equivalent of solving the
maximum independent set problem (see Section 12.3) on the overlap
graph: finding the largest set of nodes which are not connected to
any other member of the set. In this case, we have four independent
sets all including a single node – because the overlap graph is a
clique – and thus the pattern occurs only once.
This is the simple overlap rule and it is usually too strict. There
are some overlaps between the occurrences that do not “harm” the 40
Mathias Fiedler and Christian Borgelt.
anti-monotonicity requirement for the support definition40 . To find Support computation for mining
frequent subgraphs in a single graph. In
harmful overlaps you need to do two things. First, you look at which
MLG, 2007
nodes are in common between two occurrences. For instance, in
Figure 41.14, A ∩ B = {1, 8}, A and B share nodes 1 and 8. Second,
you need to make sure that these nodes that are in common between
the two occurrences are not required to map the same nodes at the
same time. In the case of A and B they are, because the only way to
map the top red node in the motif is to use node 8 for A and 1 for B.
This is a harmful overlap.
The non-harmful overlaps are the ones in which this doesn’t
frequent subgraph mining 603
1 8 9 1 3
(a) (b)
41.6 Summary
1. Machine learning can give you powerful insights about your net-
works, but first you need to find a way to transform the complex
structure of a network into numerical values that can be handled
by machine learning algorithms. One way to do so is to count
network motifs.
2. Network motifs are small simple graphs that you can use to
describe the topology of a larger network. For instance, you can
count the number of times a triangle or a square appears in your
network.
41.7 Exercises
2. How many times do the motifs from the previous question ap-
pear in the network? http://www.networkatlas.eu/exercises/
41/1/motif2.txt is included in http://www.networkatlas.eu/
exercises/41/1/motif3.txt: is the latter less frequent the former
as we would require in an anti-monotonic counting function?
In the next chapters we’re going to fully explore the recent and
quickly evolving field of Graph Neural Networks (GNNs). I already
mentioned some basics of neural networks in Section 6.4 and ideas
on what they can do on graphs in Section 15.3. Now it’s time to
explain exactly how you can use neural networks on graphs to
perform those – and more – tasks.
There are a few good books if you want to get into this topic more 1
William L Hamilton. Graph represen-
in depth1 , 2 , which you should check out. Some of the surveys in the tation learning. Morgan & Claypool
literature3 also come with code4 that you can use to follow along the Publishers, 2020
explanations.
2
Michael M Bronstein, Joan Bruna, Taco
Cohen, and Petar Veličković. Geometric
For the time being, we’re focusing on building “embeddings”. A deep learning: Grids, groups, graphs,
small terminology note: building embeddings and using them in geodesics, and gauges. arXiv preprint
arXiv:2104.13478, 2021
GNNs can go under different terms depending on where the focus 3
Palash Goyal and Emilio Ferrara.
is. For instance, GNNs can be considered a special case of the more Graph embedding techniques, appli-
general geometric learning that goes beyond the use of Euclidean cations, and performance: A survey.
Knowledge-Based Systems, 151:78–94,
space in data analysis5 . We’ll see more on the use of complex non- 2018
Euclidean geometries in Chapter 47. Another term you could see is 4
https://github.com/palash1992/GEM
“collective classification” field: the attempt of classifying nodes by 5
Michael M Bronstein, Joan Bruna,
Yann LeCun, Arthur Szlam, and Pierre
looking at how they relate to the rest of the network6 . Vandergheynst. Geometric deep
Finally, one thing you need to know to read these chapters criti- learning: going beyond euclidean data.
cally. In the next couple of chapters, I am adopting a purely struc- IEEE Signal Processing Magazine, 34(4):
18–42, 2017
tural approach to build node embeddings. This means that they are 6
Prithviraj Sen, Galileo Namata,
built exclusively by looking at their connections. It is only in Chap- Mustafa Bilgic, Lise Getoor, Brian
Gallagher, and Tina Eliassi-Rad. Collec-
ter 44 that I’m going to give you a fuller picture that takes the node
tive classification in network data. AI
attributes into account as well. So don’t assume that looking at the magazine, 29(3):93–93, 2008
connections is the only way to build a node embedding!
Node Embedding
3 1 {0.88, 0.69} Figure 42.1: (a) An example
2 2 {0.88, 0.69} 4 graph. (b) One of the possible
1 3 {0.88, 0.69}
4 embeddings of (a), assigning a
4 {1, 1}
5
5 5 {0.48, 0.66} two dimensional vector to each
1,2,3
6
6 {0, 0}
7 7 {0.42, 0.1} 6 node. (c) The scatter plot repre-
8 {0.42, 0.1} 7,8,9 sentation of (a)’s embeddings.
9
8 9 {0.42, 0.1}
usually the scatter plot is (i) easier to analyze than a graph, and (ii) a
more common data structure than a graph on which you can apply a
more diverse set of algorithms that were not developed with graphs
in mind.
The real meat that makes embeddings interesting is the last con-
straint – which connects similarity of embeddings with the similarity
of the nodes. But before we dive into that, there might be a question
lingering in your mind: why are we bothering with graph embed-
dings? We can already represent easily a node with a vector. In fact,
this is something you taught me since Chapter 8! A node is nothing
more than a row in the adjacency matrix of the graph. Thus it is a
vector. Why can’t we use that as our “embedding”?
The reason is that A fails each of the first three constraints:
I show you an example of the last problem in Figure 42.3. You can
tell that the two graphs defined by the two matrices I show there are
isomorphic, because one matrix is the rotation of the other – quite
literally, I was too lazy to make another figure from scratch, so I just
rotated it in latex.
In fact, you want something even stronger than permutation invari-
ance. You want permutation equivariance: not only the embeddings
of a node should not be changed by the order in which you visit
shallow graph learning 609
the graph, but nodes in the exact same position in the graph should
always get the same embeddings.
Node Embedding 4
1 {0.01, 0.01} Figure 42.4: (a) A different
6
2 {0.01, 0.01}
3 {0.01, 0.01}
valid embedding of the graph
5
4 {0.01, 1} in Figure 42.1(a), assigning a
5 {0.33, 0.5} two dimensional vector to each
6 {0.05, 0.9}
7 {0.05, 0.05} 7,8,9
node. (b) The scatter plot repre-
8 {0.05, 0.05} sentation of (a)’s embeddings.
1,2,3
9 {0.05, 0.05}
(a) (b)
vector. In turn, this also means exactly that I embed the network
structure in a 2D space.
It would be ugly if there were edge crossings or the nodes over-
lapped, so the network structure constrains how I build these vectors.
Every time you try to display your network in the clearest form pos-
sible, you are creating 2D (sometimes 3D) node embeddings. In fact,
there are a bunch of specialized algorithms that layout your networks
– we’ll see them in Chapter 51. They are a relatively simple way to
create node embeddings – but, crucially, it is a totally valid way to do
it.
The main reason people nowadays can’t get enough of making new
ways of building node embeddings is because node embeddings are
a perfect way to boil down the complexity of a graph in a way that
classical neural network architectures can understand it. To see what
I mean, we shall take another look at Figure 6.12, which I reproduce
here as Figure 42.6.
From the figure, you can see the general architecture of a neural
network. The input goes into the input layer, it is processed by one
or more hidden layers, until the neural network produces an output
in the output layer. Now, as I mentioned in Chapter 41, the problem
here is that the input layer is a vector of numbers, but networks
aren’t vectors of numbers. If they were, analyzing them would be
easy and this book wouldn’t exist. So we need to find a way to
transform them into something a neural network understands. Which
is embeddings, as I mentioned.
We generally call this approach “shallow learning” because we
use the network only to build the embeddings. However, once we
have the embeddings, we forget about the network and we don’t
use it any more – there are no more edges in the vectors of numbers
we feed into the algorithm. As I already outlined, when we get to
Chapter 44, we’ll deal with “deep learning”. In this case, we use
612 the atlas for the aspiring network scientist
1.2
Singular Value
0.8
0.6
singular values of (b), sorted on
0.4
0.2
the x axis in descending order.
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Rank
∑ Zu − ∑ Zv Auv .
u v
Here, the straight bars mean that we are taking the length of
the vector Zu − ∑ Zv Auv and then we square it. In practice, your
v
loss is the difference between Zu and Zv , weighted by how strongly
they are connected in the graph (Auv ). If u and v are connected by
a high Auv link strength, we better have a low Zu − Zv difference
to cancel it out! It turns out that finding Z, the matrix containing
the embeddings Z for all our nodes can be solved as an eigenvector
problem. Specifically, you can take the smallest eigenvectors of the
matrix ( I − A) T ( I − A) – but discarding the actual smallest one –,
with I being the identity matrix.
Laplacian Eigenmaps
12
Mikhail Belkin and Partha Niyogi.
Laplacian Eigenmaps12 change the loss function, which means the Laplacian eigenmaps and spectral tech-
niques for embedding and clustering. In
objective is still the eigenvector decomposition of a matrix, but the
Advances in neural information processing
matrix itself is built differently. In this case, the loss function is: systems, pages 585–591, 2002
1
2∑
| Zu − Zv |2 Auv ,
u
42.4 Pooling
You might have noticed that so far I have exclusively talked about
generating node embeddings. However, in many cases, you don’t
want to have node embeddings. Maybe you want to have edge em-
beddings to do link prediction – trying to figure out if there is a key
function distinguishing the embedding of edges that exists from
616 the atlas for the aspiring network scientist
edges that don’t, which will tell you which non-existing links are
likely to exist. Or you might want to have an embedding for the en-
tire network. A classical case is trying to classify molecules to predict
some of their characteristics. A molecule is a network of atoms, so
you want to embed the entire structure.
To do this, you need to perform what we call “pooling”. The
role of a pooling function is to take a bunch of embeddings and
return one that is a good representation of all of them. Pooling for
edges is relatively straightforward, since we only have two node
embeddings to deal with and not much complexity. For edge (u, v),
you have embeddings Zu and Zv . To get to a Zuv embedding you can
average them (Zuv = ( Zu + Zv )/2) or multiply them element-wise
(Zuv = Zu × Zv ), or take the L1 or L2 norm, you get the idea. To give
you an example, I take the node embeddings from Figure 42.1(b) and
I create a bunch of edge embeddings out of them in Figure 42.11.
tion until you get to the desired number of dimensions – solving the
issue that networks with a different number of nodes would generate
a graph embedding with different dimensions.
Other clever methods involve hashing the node embedding so that
it becomes an index and then update the corresponding index in the 17
David K Duvenaud, Dougal Maclau-
graph embedding17 . Figure 42.12 shows how this happens in practice. rin, Jorge Iparraguirre, Rafael Bom-
Nodes 1 and 3 happen to get the same hash and so update the same barell, Timothy Hirzel, Alán Aspuru-
Guzik, and Ryan P Adams. Convolu-
part of the graph embedding. This is fine and expected – if you have
tional networks on graphs for learning
a graph embedding with 256 entries, but a graph with two million molecular fingerprints. In NIPS, pages
nodes, you’re expecting a ton of hash collisions. 2224–2232, 2015
Node Graph
Embedding Embedding Figure 42.12: The pooling ap-
1 [1, 2, 3, ...] hash 6 1 proach determining the index
2 of the graph embedding to be
3 updated by hashing the node
2 [3, 6, 8, ...] hash 5
4 embedding.
5
3 [0, 4, 2, ...] hash 6 6
You stop when you have the desired number of nodes. Step #2
is usually trivial, you just select the k nodes with the lowest scores.
Most of the work is done in step #1: we need to find good scoring
rules. In Figure 42.15 I show an example of a simple scoring rule, by
dropping nodes with the lowest betweenness centrality.
6. Pooling options for graph are: flat, where you don’t look too hard
at the structure of the graph; node clustering, where you collapse
nodes into virtual nodes in a smart way; and node drop, where
you recursively remove nodes that aren’t salient for the network.
42.6 Exercises
that Word2Vec is a network algorithm, and the NLP people are quiet
about it – just like in Section 42.2 we saw that image convolution is a
network algorithm specifically designed for simple grids. Let’s look
at Figure 43.1.
In Word2Vec you are basically saying that each word is a node
random walk embeddings 623
science hard
1200
13
1000
6 Figure 43.2: (Left) A graph.
800
15 10 600
(Right) The graph’s frequency
1
3
8 400 of node co-appearance in a ran-
200
5 dom walk (y axis) per node pair
2 0
0 (5,6) (1,6) (3,9) (2,15) (12,13) (x axis) sorted in descending
Node Pair
14 7 order. Node pairs on the left of
11 the frequency plot should get as
4 9
12 similar embedding as possible.
T
e Zu Zv
p(u|v) = so f tmax ( Zu , Zv ) = T
.
∑ e Zu Zw
w ∈V
Node2Vec
6
Aditya Grover and Jure Leskovec.
Node2vec6 notices that DeepWalk uses uniform random walks, node2vec: Scalable feature learning for
networks. In Proceedings of the 22nd
picking the next step of the walk completely at random. However,
ACM SIGKDD international conference
there is something to gain by performing higher order random walks on Knowledge discovery and data mining,
(Chapter 34). Figure 43.3 shows you an example. pages 855–864. ACM, 2016
4
Figure 43.3: The biased random
walk implemented in Node2Vec.
1
The gray arrow is the step we
1/p 1/q
just took. The colored arrows
2 1 3
show the reweighting of the
various possibilities, with the
In a high order random walk what matters is not only the node p and q parameters. Blue for
you’re currently in, but also the node you come from. Node2Vec uses backtrack, green for common
two parameters to exploit this information, p and q. In practice, it neighbor, purple for not shared
tries to modulate between a BFS-like strategy and a DFS-like one neighbors.
(Section 13.1). In Figure 43.3 you just followed the gray arrow to go
from node 2 to node 1. How do you pick the next step?
Parameter p tells us how much we hate to backtrack. The step
bringing you back to node 2 is weighted 1/p. So if p is high, 1/p
is low, and so it is unlikely we backtrack. But if p is below 1, it is
more likely to backtrack, which is what a BFS exploration would do.
Parameter q, instead, tells us how much we hate to explore parts of
the network we haven’t seen yet. Among all of node 1’s neighbors,
those that are not also neighbor with node 2 will get a 1/q weight
to be explored. With low q we’re trying to get as far from node 2 as
possible – which is how DFS would behave. With high q we’re more
likely to explore a common neighbor between nodes 1 and 2.
DeepWalk is equal to Node2Vec when we set p = q = 1, so
each neighbor of node 1 is treated equally. But setting p and q with
various combinations leads to different random walks and, as a con-
sequence, different embeddings Z which privilege some information
about the network over some other.
626 the atlas for the aspiring network scientist
HARP
7
Haochen Chen, Bryan Perozzi, Yifan
HARP7 is a meta-strategy that one can apply to improve any other Hu, and Steven Skiena. Harp: Hierarchi-
random walk embedding method, including DeepWalk and Node2Vec. cal representation learning for networks.
In Thirty-Second AAAI Conference on
The idea is to employ a smarter way to initialize the weights of the Artificial Intelligence, 2018
function summarizing your nodes – before you start performing
your random walks. The way HARP does it is basically as a sort
of “reverse pooling”. If you remember Section 42.4, node clustering
pooling takes node embeddings and combines them hierarchically to
obtain a graph embedding. Here, instead, we coarsen the graph hier-
archically before we calculate the embeddings. We then calculate the
embeddings on the coarsened graph. Then we refine the embeddings
for each node on the original graph.
The coarsening is done with two approaches: edge and star col-
lapsing, which I show in Figure 43.4. The idea is to preserve both
first order relationship (edges) and second order relationships (paths
of length two).
Star Collapse
5 Figure 43.4: The star and edge
1,5
1,5 collapse processes in HARP.
2 Edge Collapse
1 Nodes in the blue outlines
2
6 3 2,3
merge during the star collapse
4,6 3 4,6 phase, nodes in the green out-
4
lines merge in the edge collapse
phase.
LINE
11
Jian Tang, Meng Qu, Mingzhe Wang,
LINE11 is another approach worth mentioning. Just like in HARP Ming Zhang, Jun Yan, and Qiaozhu
and Node2Vec, LINE realizes that one needs to take into account Mei. Line: Large-scale information
network embedding. In Proceedings of
both first and second order relationships between nodes, a feature the 24th international conference on world
that is absent from DeepWalk. I depict the general idea of LINE in wide web, pages 1067–1077. International
Figure 43.5. World Wide Web Conferences Steering
Committee, 2015a
LINE represents a graph with two matrices: one with the first
order and one with the second order proximities. LINE has its own
definition of what these matrices should be, but simplifying a bit one
could think that the first order proximities can be represented with
the stochastic adjacency matrix A, since it gives you the transition
probability via a random walk. Then, the second order proximities
could be the squared stochastic adjacency matrix, A2 , which gives
the probability of moving between the nodes with a random walk of
length two.
Once you have these two matrices, you can use the same loss func-
tion of DeepWalk, twice. In the first pass, you create embeddings
minimizing the loss over the first order proximities, and in the sec-
ond pass you minimize the loss over the second order proximities.
Then, you combine the embeddings.
Struct2Vec
12
Leonardo FR Ribeiro, Pedro HP
Struct2Vec12 follows the guiding principle of structural similarity. Saverese, and Daniel R Figueiredo.
struc2vec: Learning node represen-
The idea is to guide the random walker to explore nodes that are tations from structural identity. In
structurally similar. The key here is that there could be structurally SIGKDD, pages 385–394, 2017
similar nodes that are very far apart in the network – and therefore
unlikely to appear in a naive random walker. How do we fix the
problem?
The idea here is to build a multilayer graph G ′ out of your original
G. Figure 43.6 shows how it looks like. This multilayer graph has k
layers, where k is the diameter of G. Each layer contains the same
nodes as G. In each layer, all pairs of nodes in G are connected
628 the atlas for the aspiring network scientist
5
Figure 43.6: (a) The original
4 4 4
graph. (b) The multilayer net-
2 1 5 3 5 3 5 3
work Struct2Vec builds. I only
6 3 6 2 6 2 6 2
show one set of interlayer cou-
1 1 1 plings, in green, for node 1. The
line thickness is proportional
4 Layer 1 Layer 2 Layer 3
to the edge’s (and coupling’s)
(a) (b)
weight.
You might think: what’s the point of doing random walks on a giant
clique? Well, it all comes down to the weights of those edges, which
guide the random walker to explore structurally similar nodes. A u, v
edge in layer i has a weight proportional to the structural similarity
of u and v at distance i. In practice, you take the set of all nodes
at i distance to u and you compare it with the set of all nodes at i
distance to v. Struct2vec compares them using the degrees of these
nodes, enforcing a structural similarity function that says that two
nodes are i-structurally similar if the nodes at distance i from them
have similar degree sequences. Note that, in Figure 43.6, some nodes
appear disconnected: they are actually connected with an edge
weight of zero, because of their low structural similarity.
G ′ is more useful than G because, when the random walk transi-
tions to any layer that is not the first, it can explore nodes that are far
apart more easily. Once the random walker reaches the last layer k, in
one step it can transition between the two farthest apart nodes in G –
because that layer represents the diameter.
The random walker can transition to a different layer via inter
layer couplings – in Figure 43.6 I show only two of them, in green.
These couplings connect node u to itself in the neighboring layers –
so u in layer i connects to itself in layers i + 1 and i − 1. The strength
of the coupling is proportional to u’s average similarity to all other
nodes in layer i. If u is similar to all layer i nodes, then it’s more
informative to go up one layer to i + 1.
multiple different types. For instance, one could adopt the metapath 14
Yuxiao Dong, Nitesh V Chawla, and
approach14 we saw in Section 24.2 when talking about link prediction Ananthram Swami. metapath2vec:
in multilayer networks. The problem with heterogeneous networks Scalable representation learning for
heterogeneous networks. In SIGKDD,
is that there are some node types that are more dominant – i.e. con-
pages 135–144, 2017
nected – than others. Their representations would then be extremely
noisy. In metapath2vec, the problem is solved by switching one’s
attention from nodes to metapaths.
34
Seyed Mehran Kazemi, Rishab Goel,
graphs34 , 35 , or to solve a specific subproblem. Kshitij Jain, Ivan Kobyzev, Akshay
In the next chapter we’ll see how deep learning can give us more Sethi, Peter Forsyth, and Pascal Poupart.
Representation learning for dynamic
natural inductive methods that do not suffer from this issue.
graphs: A survey. Journal of Machine
Learning Research, 21(70):1–73, 2020
35
Meng Liu, Yue Liu, Ke Liang, Wenx-
43.6 Summary uan Tu, Siwei Wang, Sihang Zhou, and
Xinwang Liu. Deep temporal graph
1. The basic idea of random walk embeddings is to perform many clustering. ICLR, 2024
6. Many common node embedding methods can only build and use
embeddings for nodes that are already present in the graph. If you
add new nodes – because of a temporal graph or a train-validation
split – you need to recompute all embeddings from scratch.
43.7 Exercises
3. Compare the NMI score from the previous exercise to the one
you would get from a classical community discovery like label
propagation. Note: both methods are randomized, so you could
perform them multiple times and see the distributions of their
NMIs.
5. Is the AUC you get from the previous question better or worse
than the one you’d get from a classical link prediction like Jaccard,
Resource Allocation, Preferential Attachment, or Adamic-Adar?
44
Message-Passing & Graph Convolution
Before I start throwing math at you, I’ll start with a simple graphical
example of what a message-passing graph neural network (MPGNN) 4
Franco Scarselli, Sweah Liang Yong,
is. The reason is that this type of architecture is the basis of all the Marco Gori, Markus Hagenbuchner,
variants that will follow. Understanding this will provide the foun- Ah Chung Tsoi, and Marco Maggini.
Graph neural networks for ranking web
dation to figure out what happens in graph convolution networks pages. In The 2005 IEEE/WIC/ACM In-
and in other models. The example I give here is pretty cartoonish, ternational Conference on Web Intelligence
(WI’05), pages 666–672. IEEE, 2005
but is should help understanding. It is based on the early attempts 5
Marco Gori, Gabriele Monfardini,
to define graph neural networks4 , 5 , 6 , which have been constantly and Franco Scarselli. A new model
refined over time7 . for learning in graph domains. In
Proceedings. 2005 IEEE International Joint
One key component in this framework is that the nodes start
Conference on Neural Networks, 2005.,
with some feature matrix, which we will call H 0 . H 0 is a |V | × d volume 2, pages 729–734. IEEE, 2005
matrix, associating each node with a vector of length d of features. If 6
Christian Merkwirth and Thomas
Lengauer. Automatic generation
your graph doesn’t have node attributes, you can always use some
of complementary descriptors with
structural properties as attributes, for instance their degree. molecular graph networks. Journal of
The key process of a message-passing graph neural network is... chemical information and modeling, 45(5):
1159–1168, 2005
message passing. At each iteration – which we call layer – we do 7
Justin Gilmer, Samuel S Schoenholz,
three things: Patrick F Riley, Oriol Vinyals, and
George E Dahl. Neural message passing
1. Each node formulates a message based on its current node at- for quantum chemistry. In ICML, pages
1263–1272. JMLR. org, 2017
tributes and passes it to all its edges;
2. Each node receives the messages from its neighbors and aggregates
them;
Figure 44.1 shows this process graphically for a node, and the final
result for the whole network.
Note how I emphasized the words “message” “aggregate” and
“update”. That is because those are the functions that power the Figure 44.1: A potential
message-passing graph neu-
[0.8, 0.2] [0.9, 0.1] [0.818, 0.182] [0.832, 0.168]
ral network architecture. From
2 1 2 1 left to right: a network with 2D
[0.475, 0.525]
node attributes; the message-
Sum + Softmax
[1.0, 0.0] 3 3 [0.802, 0.198] passing procedure for node
[0.55, 0.45] [0.4, 0.6] 7; the result of the message-
[0.4, 0.6] 7 7 7 [0.475, 0.525] passing for all nodes. The
AVG different colors in the message-
[0.1, 0.9] 5 5 [0.198, 0.802] passing procedure highlight
[1.0, 0.0] [0.1, 0.9]
the three steps: message com-
3 5
6 4 6 4 position and passing in purple,
[0.0, 1.0] [0.2, 0.8] [0.154, 0.846] [0.182, 0.818] aggregation in blue, and update
in green.
638 the atlas for the aspiring network scientist
[0.8, 0.2] [0.9, 0.1] [0.818, 0.182] [0.832, 0.168] [0.781, 0.219] [0.783, 0.217]
6 4 6 4 6 4
[0.0, 1.0] [0.2, 0.8] [0.154, 0.846] [0.182, 0.818] [0.212, 0.788] [0.217, 0.783]
1 11
1 3 5 7 9 11 Figure 44.3: Two non-
2 5 12 15
isomorphic networks with 1D
2 4 6 8 10 12 embeddings passing through
3 4 13 14
one layer of a simple MPGNN.
[1, 1, 2, 2, 1, 1] [1, 1, 2, 2, 1, 1] [3, 2, 3, 2, 2] [3, 2, 3, 2, 2]
(a) Leading to the same result.
(b) Leading to different results.
[4, 4, 6, 6, 4, 4] [4, 4, 6, 6, 4, 4] [10, 8, 10, 7, 7] [9, 8, 9, 8, 8]
(a) (b)
(a) (b)
thinks that the two graphs are isomorphic, a simplicial MPGNN does
not, because it sees that one has two simplicial complexes that the
other does not. This approach of using high order structures connects
to a more general “topological” learning, of which graph neural
networks are a special case.
Smoothing Problems
There’s another issue with MPGNNs. As I told you, more layers
allow you to pool information from farther away in the network. One
thing you might be tempted to do is to pool information from the
entire network. That is to say, you could decide to have l layers so
that l is the diameter of the network. This way you know that, in
the end, even the farthest away pair of nodes will exchange some
information. However, you start to run into troubles, a specific kind
of trouble we call “smoothing”.
To understand smoothing, let’s take the process we started in
Figure 44.2 and let’s keep unfolding it. In Figure 44.2 we stopped
at the second layer, so in Figure 44.5 I start from the third and I
continue for a few layers. What do you observe?
All embeddings are getting more and more similar to each other.
That is, the MPGNN is losing its capability to distinguish the nodes
in the network. A few more layers, and each node will get the same
embeddings. In MPGNN, you either stop early or you run long
enough to become a vector of constants. You knew this was going
to happen, because this is exactly the same dynamics I showed you
back in Section 11.6, when we talked about reaching a consensus in
the network. From that section you know that the eigenvalues of the
Laplacian will tell you when you will reach the consensus, so you
should stop far before that point.
Well, actually there are other things you can do to prevent smooth-
642 the atlas for the aspiring network scientist
[0.747, 0.253] [0.746, 0.254] [0.719, 0.281] [0.719, 0.281] [0.696, 0.304] [0.696, 0.304]
6 4 6 4 6 4
[0.249, 0.751] [0.250, 0.750] [0.278, 0.722] [0.278, 0.722] [0.300, 0.700] [0.300, 0.700]
H l = AH l −1 ,
that is to say: to get the values for layer l, you take those from
layer l − 1 and you aggregate them by multiplying H l −1 by the
adjacency matrix A. You know from Section 5.2 that the result of
multiplying a |V | × d matrix (H 0 ) to a |V | × |V | matrix (A) is going to
be another |V | × d matrix – so H 1 will be just another node attribute
matrix. Figure 44.7 shows an abstract representation of how this
happens.
644 the atlas for the aspiring network scientist
Note how in the figure I use different colors for H 0 and H 1 , stress-
ing that the former is data while the latter is a representation of
what we’ve learned in the first layer. This process – and equation
– describes the aggregation step of the MPGNN I described at the
beginning of Section 44.1 – except that here the aggregation function
is the sum and not the average.
There are some issues with this formulation which you can solve
by implementing the update step of the MPGNN. If H k = H l −1 A,
then each node v completely forgets its l − 1 features – which is a
problem especially for the first layer, since in that case the H 0 features
we’re forgetting are literally node v’s actual real features. To fix this
issue we add self loops in the update step, assuming that each node v
is a neighbor of itself. This is also a matrix operation! It is equivalent
to adding the identity matrix I to A. So now we have:
H l = ( I + A ) H l −1 .
Using a different perspective, this is equivalent to sum to the result
H l −1to H l −1 A. This would lead to H l = H l −1 + AH l −1 , and we can
group the H l −1 terms, since H l −1 I = H l −1 – with I being the identity
matrix.
Great, but we’re not done reproducing the basic MPGNN I showed
you at the beginning of the chapter. The first thing we need to solve
is that, in my original MPGNN, the aggregation function was not
the sum, but the average. Why might we prefer the average over the
sum? Well, if you have a lot of layers, constantly summing features
will lead them to eventually blow up in scale. All values will diverge
to infinity. You might want to keep each H l within the same scale.
You can actually see this problem happening in the example from
Figure 44.3, with the resulting embedding being much larger than the
original one.
You might think to solve this problem by using the adjacency
matrix, normalized by the degree – yet another thing you can do
with matrix multiplications: D −1 A. However, sometimes a more
message-passing & graph convolution 645
H l = D̂ −1/2 ( I + A) D̂ −1/2 H l −1 ,
H l = σ D̂ −1/2 ( I + A) D̂ −1/2 H l −1 .
This is a big pile of linear algebra already, but we can still de-
mystify it a bit with a graphical representation. That formula is
equivalent to the operation I depict in Figure 44.8.
6 4 6 4
(a) (b)
This is starting to become a handful, isn’t it? Yet, it’s still matrix
multiplications all the way down. And we know what each piece
means, except W and what its role is, so let’s take it step by step.
Fundamentally, W does three things:
W1 H1
of the rectangles are propor-
tional to the dimensions of their
matrices.
W2 H2
W3 H3
message-passing & graph convolution 647
6 4 6 4 6 4
[0.0, 1.0] [0.2, 0.8] [0.154, 0.846] [0.182, 0.818] [0.025] [0.025]
know the entire topology of the graph to learn a specific node’s em- 31
James Atwood and Don Towsley.
Diffusion-convolutional neural net-
bedding. There are many popular GCN methods that are spatial31 , 32 . works. In NIPS, pages 1993–2001,
This, by the way, solves the transductivity problems of shallow 2016
learning we saw in Section 43.5. If you get a new node in your net-
32
Mathias Niepert, Mathias Ahmed,
and Konstantin Kutzkov. Learning
work, you can determine its new embeddings by only running the convolutional neural networks for
local part of the graph neural network. graphs. In ICML, pages 2014–2023, 2016
However, spatial approaches are not the only game in town, just
like they weren’t when we dealt with shallow learning. In shallow
learning, random walk-based methods are also spatial, and I showed
you there are alternatives, namely by using a spectral approach. 33
Joan Bruna, Wojciech Zaremba,
Arthur Szlam, and Yann Lecun. Spec-
In GCNs we can have a spectral approach as well33 , 34 . If in the tral networks and locally connected
spatial approach we learn the embedding of v only by looking at v’s networks on graphs. In ICLR, 2014
neighbors, in the spectral approach v’s updated embedding depends 34
Renjie Liao, Zhizhen Zhao, Raquel Ur-
tasun, and Richard S Zemel. Lanczosnet:
on the entire graph as a whole.
Multi-scale deep graph convolutional
This is because we see the embedding as a signal that is processed networks. arXiv preprint arXiv:1901.01484,
by the entire graph. To make sense of this sentence you need to un- 2019b
the weight you learned by the Fourier transform, you’ll get the exact
value of the signal. Figure 44.13 shows you how to think about this
visually.
In the graph Fourier transform, you want to do the same, but
your signal is H l −1 . How to do it? Well, one key part of the regular
Fourier transform is to calculate the difference between the value of
the signal at a point and the values of the signal in the neighboring
points. This is done with the Laplace operator and you should have
alarm bells ringing in your head right now. The Laplacian does
exactly the same thing on a graph: the −1 entry for each edge gives
you exactly the difference of a node with its neighbor.
The way to use L in the graph Fourier transform is to exploit the
following equivalence: L = ΦΛΦ−1 – this is the eigendecomposi-
tion we saw in Section 5.6. What are Φ and Λ? Λ is L’s eigenvalue
diagonal matrix:
λ0 ... 0
Λ=
..
0 . 0
0 ... λn
a* + b*
44.5 Summary
2. Limitations of this approach come from the fact that there are
structures MPGNNs aren’t able to distinguish, that piling up
too many layers will usually result in all nodes having the same
embeddings, and that messages from peripheral nodes are usually
lost.
44.6 Exercises
3. Implement a MPGNN
as a series of matrix operations, imple-
menting H l = σ D̂ −1/2 ( I + A) D̂ −1/2 H l −1 , with σ being
softmax. Apply it to the network at http://www.networkatlas.
eu/exercises/44/3/network.txt, with node features at http:
//www.networkatlas.eu/exercises/44/3/features.txt. Compare
its running time with the MPGNN you implemented in the first ex-
ercise, running each for 20 layers and making several runs noting
down the average running time.
45
Deep Graph Learning Models
45.1 Attention
they can come from multiple fields, and so might confuse you when
trying to predict fields with citations. In those cases, you want to pay
less attention to them. 2
Petar Veličković, Guillem Cucurull,
In Graph Attention Networks2 , 3 (GATs) the way to implement this Arantxa Casanova, Adriana Romero,
is by looking at the D −1 A transformation that we have in MPGNNs. Pietro Lio, and Yoshua Bengio. Graph
attention networks. arXiv preprint
This is sort of an attention mechanism: it is the trivial one where
arXiv:1710.10903, 2017
each neighbor get the same amount of attention. In this simplified 3
Shaked Brody, Uri Alon, and Eran
example, each neighbor of node v gets exactly 1/k v attention, with k v Yahav. How attentive are graph
attention networks? arXiv preprint
being v’s degree. In this framework, v’s attention only depends on v’s arXiv:2105.14491, 2021
characteristics, specifically its degree.
GATs unlock a new degree of freedom allowing v’s attention
to depend also on the characteristics of the neighboring node u 4
Shyam A Tailor, Felix L Opolka,
as well4 . GCNs could sort of do the same, because the symmetric Pietro Lio, and Nicholas D Lane.
D̂ −1/2 ( I + A) D̂ −1/2 normalization also depends on the neighbor Adaptive filters and aggregator fusion
√ √ for efficient graph convolutions. arXiv
u’s degree, but this attention is always fixed to be 1/ kv ku .
preprint arXiv:2104.01481, 2021
Instead, GATs try to learn – again with backpropagation – a different
attention value for each of v’s neighbors. Since we introduced this
learnable function as α, the GAT formula is:
H l = σ H l −1 α l W l .
There is quite some freedom in how to define α for GATs. The key
thing is that α takes as input both H l −1 and W l , which is what allows
it to be so flexible. So, in summary, GCNs are a special type of GATs:
they are GATs that have a fixed non-learnable α. Therefore, GATs are
a generalization of GCNs. In Figure 45.2 I call back to Figure 44.8.
GCNs allow us to go from Figure 45.2(a) to Figure 45.2(b), preventing
some issues that would be caused by using only A. GATs allow us
to go from Figure 45.2(b) to Figure 45.2(c), realizing we’re not forced
to always use the same transformation, but we can actually earn the
656 the atlas for the aspiring network scientist
2 1
3
Figure 45.3: What happens
l l-1
7 W H in the layer l of a graph trans-
5 former.
2 1 6 4
3 2 1
7 7 Wl Hl-1 Hl
5
5 6 4
6 4 2 1
7 Wl Hl-1
5
6 4
Variational Autoencoders
13
Diederik P Kingma and Max Welling.
To understand variational autoencoders13 we need to start by break- Auto-encoding variational bayes. arXiv
ing down their name into its component parts: we start by under- preprint arXiv:1312.6114, 2013
standing what an “encoder” is, what it means for it to be “auto”, and
what “variational” means.
The encoder part is the easiest. An encoder is a function that takes
and input and produces a code – a more succinct representation of
the same data that went in. You’ve already seen a bunch of encoders
up until now. Every time you produce a node embedding you’re
encoding it into a different representation. An autoencoder aims to
reconstruct the original data by passing it through an encoder and
then recovering it. This means we need a decoder that reconstructs
the original data. Figure 45.4 shows you the general structure of an
autoencoder.
H1
an autoencoder: (left to right)
H0 H2 the input data (blue) is encoded
(purple) in an embedding
(green) then decoded (purple)
Encoder Decoder to a second layer embedding
(blue) which is as similar as
Both the encoder and the decoder can be learned, so that the loss possible to the original data.
in reconstructing the original data is minimal. You might think it’s
pointless to try and reconstruct the original data – after all you have
it, so why bother regenerating it? – but the value is that now your
encoder and decoder have learned the key features of the data. At
this point, you don’t need the original data any more and you can
generate as much data as you want.
Why would we need a variational part? That’s what introduces ran-
domness. If you always reconstruct the input features H 0 , then you’re
going to always get the same embeddings from the autoencoder, as
you can see in Figure 45.5 – where I unpack the values inside H 1 . In
practice, the objective of an autoencoder is to learn directly the em-
beddings and use them to reconstruct the original adjacency matrix
A.
However, what you can do instead is to learn not the embeddings
themselves, but the distribution of their values. For instance, you
can assume the embeddings distribute normally, and then you use
two neural networks to learn the mean and standard deviations
of those distributions. This means that you treat the embeddings
as random variables, which can be more succinctly described by
the parameters of their distribution. Once you have learned such
deep graph learning models 659
A A
Encoder Decoder
(in red). The encoder learns the
H 1 embeddings (in green) and
the decoder reconstructs the
adjacency matrix.
Encoder Decoder
A A'
Mean
Std Dev
[1, 0]
Figure 45.7: (a) A network with
[0.7, 0.3]
2D node embeddings next to
their corresponding nodes. (b)
v v The resulting embedding for
[0, 1] node v, using average as the
[0.625, 0.375] aggregation function.
[0.8, 0.2]
(a) (b)
28
Ryan L Murphy, Balasubramaniam
permutation invariant, and then average it over all permutations28 . Srinivasan, Vinayak Rao, and Bruno
Normally, there will be too many permutations to be able to do this Ribeiro. Janossy pooling: Learning
deep permutation-invariant functions
exhaustively. So you can randomly sample some permutations, or
for variable-size inputs. arXiv preprint
deploy a canonical ordering of nodes – so there is only one order and arXiv:1811.01900, 2018
this operation becomes permutation invariant.
29
Ben Chamberlain, James Rowbottom,
The second alternative is to use normal aggregation functions, but Maria I Gorinova, Michael Bronstein,
refusing to run them in a discrete way. You can simulate a continu- Stefan Webb, and Emanuele Rossi.
Grand: Graph neural diffusion. In
ous flow of information and then discretize it at an arbitrary moment.
International Conference on Machine
This is a technique that we call continuous message passing29 , 30 , 31 , 32 Learning, pages 1407–1418. PMLR, 2021a
and has relations with the more generic form of geometric deep 30
Benjamin Chamberlain, James Row-
learning – of which graph neural networks are a special, discrete, bottom, Davide Eynard, Francesco
Di Giovanni, Xiaowen Dong, and
case. Michael Bronstein. Beltrami flow and
To give an oversimplified example, in Figure 45.8(b) I only pass neural diffusion on graphs. Advances in
Neural Information Processing Systems, 34:
half of the messages from the neighbors, which results in node v 1594–1609, 2021b
preserving more of its own original embedding. 31
Cristian Bodnar, Francesco Di Gio-
vanni, Benjamin Chamberlain, Pietro
Lio, and Michael Bronstein. Neural sheaf
diffusion: A topological perspective on
45.5 Practical Considerations heterophily and oversmoothing in gnns.
Advances in Neural Information Processing
Computational Efficiency Systems, 35:18527–18541, 2022
32
Fabrizio Frasca, Emanuele Rossi,
In practically all cases, you don’t want to implement deep graph Davide Eynard, Ben Chamberlain,
Michael Bronstein, and Federico Monti.
neural network architectures by naively applying the operations I Sign: Scalable inception graph neural
explained to you so far. Consider the message-passing model: a node networks. arXiv preprint arXiv:2004.11198,
2020
can only send the same message to the MPGNN architecture. If you 33
Rex Ying, Ruining He, Kaifeng
have a hub with thousands of connections, you’re going to repeat the Chen, Pong Eksombatchai, William L
same operation thousands of times, to achieve the same result. In a Hamilton, and Jure Leskovec. Graph
convolutional neural networks for
GCN, since you’re doing a single matrix multiplication, you don’t web-scale recommender systems. In
have this problem, but you have another one: if your graph is huge, Proceedings of the 24th ACM SIGKDD
international conference on knowledge
so are your matrices, and they might not fit in memory. discovery & data mining, pages 974–983,
One thing you can do is to apply the sampling and mini-batching 2018a
strategies33 , 34 , 35 , 36 (Section 4.4). In a graph you need specific ways
34
Will Hamilton, Zhitao Ying, and Jure
Leskovec. Inductive representation
of sampling. You know the intricacies of network sampling from learning on large graphs. In NIPS, pages
Chapter 29. As a refresher, Figure 45.9(a) shows that sampling nodes 1024–1034, 2017
randomly will most likely lead to disconnected samples where any
35
Hanqing Zeng, Hongkuan Zhou,
Ajitesh Srivastava, Rajgopal Kannan,
message-passing strategy won’t have anything to work with. and Viktor Prasanna. Graphsaint: Graph
In the figure we were even lucky that the randomness lead to sampling based inductive learning
method. arXiv preprint arXiv:1907.04931,
somewhat connected patches, but each patch cannot communicate 2019
with the others and we have wasted the time to sample and to com- 36
Wei-Lin Chiang, Xuanqing Liu,
pute the messages to essentially receive no information whatsoever. Si Si, Yang Li, Samy Bengio, and Cho-
Jui Hsieh. Cluster-gcn: An efficient
One minibatching strategy – which I show in Figure 45.9(b) – will algorithm for training deep and large
get the messages for node v from a given number of nodes from graph convolutional networks. In
v’s neighborhood, and then recursively sample the neighbors of the Proceedings of the 25th ACM SIGKDD
international conference on knowledge
neighbors and so on, until we reached the desired depth. In the fig- discovery & data mining, pages 257–266,
ure, we want to receive messages from up to two neighbors, up to a 2019
deep graph learning models 663
(a) (b)
Regularization
As I mentioned in Section 4.1, one of the greatest woes of machine
learning is overfitting. Since you’re learning from a lot of data
bottom-up without imposing a theory top-down, you might end
up just learning the special quirks of the data you use for training.
In neural networks there are a few techniques to avoid overfitting,
which we normally call “regularization”. A classical strategy is called
“dropout”: you take your matrix W and you randomly set to zero 37
Li Wan, Matthew Zeiler, Sixin Zhang,
some entries or entire rows37 , 38 . Yann Le Cun, and Rob Fergus. Reg-
ularization of neural networks using
When working with a graph neural network, you can do a special dropconnect. In International conference
variation of this: edge dropout39 . Since you also have A besides W, on machine learning, pages 1058–1066.
PMLR, 2013
you can do dropout in A as well. This means to randomly remove 38
Nitish Srivastava, Geoffrey Hinton,
a few edges between layers, which will make it harder for your Alex Krizhevsky, Ilya Sutskever, and
framework to overfit to the original A. Ruslan Salakhutdinov. Dropout: a
simple way to prevent neural networks
from overfitting. The journal of machine
Augmentation learning research, 15(1):1929–1958, 2014
39
Michael Schlichtkrull, Thomas N Kipf,
Augmentation is something you’d do if your input graph is too Peter Bloem, Rianne Van Den Berg, Ivan
Titov, and Max Welling. Modeling re-
sparse, which means information might have a hard time propagat- lational data with graph convolutional
ing through the layers. Curiously, in this case you’d do the exact networks. In European Semantic Web
Conference, pages 593–607. Springer,
opposite of the regularization strategy we just discussed: you’d try 2018
to add virtual edges – or virtual nodes40 . It might be difficult some- 40
Johannes Gasteiger, Aleksandar
times to add proper virtual edges to actually improve the situation Bojchevski, and Stephan Günnemann.
Predict then propagate: Graph neural
rather than adding noise, but in some cases things can be easier. For networks meet personalized pagerank.
instance, if you deal with a bipartite network, it’s kind of natural to arXiv preprint arXiv:1810.05997, 2018
add virtual edges between nodes of the same type, if they have a lot
664 the atlas for the aspiring network scientist
Applications
45.6 Summary
45.7 Exercises
46.1 Aggregation
51. In that chapter, you’ll see that one of the biggest problems is
that nodes sometimes snuggle together a bit too closely, overlapping
with each other. One could identify such nodes that tend to occupy
the same position in space and simply aggregate them into a super 7
Emden R Gansner and Yifan Hu.
node7 , and use this information to bundle up edges as well8 – edge Efficient node overlap removal using a
bundling is a classic visualization improvement I’ll discuss in Section proximity stress model. In International
Symposium on Graph Drawing, pages
51.3.
206–217. Springer, 2008
Of course, community discovery, graph condensation, or visual- 8
Emden R Gansner, Yifan Hu, Stephen
ization were not originally developed with summarization in mind. North, and Carlos Scheidegger. Multi-
level agglomerative edge bundling for
Thus it is possible to design node aggregation methods that are
visualizing large graphs. In 2011 IEEE
specialized for summarization, even if inspired by other related Pacific Visualization Symposium, pages
approaches. One is Grass9 , 10 . In Grass one performs the node ag- 187–194. IEEE, 2011
9
Kristen LeFevre and Evimaria Terzi.
gregation in such a way that the errors in reconstructing the original
Grass: Graph structure summarization.
adjacency matrix are minimized. In Proceedings of the 2010 SIAM Interna-
Suppose you condensed the graph in Figure 46.2(a) into the graph tional Conference on Data Mining, pages
454–465. SIAM, 2010
in Figure 46.2(b). Now all you have is Figure 46.2(b), but you might 10
Matteo Riondato, David García-
want to know what is the probability that nodes 1 and 4 are con- Soriano, and Francesco Bonchi. Graph
nected. You can reconstruct Figure 46.2(a)’s adjacency matrix via summarization with quality guarantees.
Data mining and knowledge discovery, 31
Figure 46.2(b)’s – and keeping track of the original number of edges (2):314–349, 2017
inside and between each super node.
1 1 1 2 3 4 5
2 4 4, 5 1 0 2/3 2/3 1/3 1/3
2 2/3 0 2/3 1/3 1/3
2 2 3 2/3 2/3 0 1/3 1/3
4 1/3 1/3 1/3 0 1
3 1,2,3 5 1/3 1/3 1/3 1 0
(a) (b) (c)
For instance, if two nodes u and v are in the same super node a, Figure 46.2: (a) An input graph.
then their expected connection probability is | Ea |/(|Va |(|Va | − 1)), (b) Aggregation of (a). Node la-
namely the number of edges collapsed inside a over all edges a could bels report the nodes collapsed
contain. Vice versa, if u and v are in different super nodes a and b, into the super node. Edge la-
then their connection probability is | Eab |/(|Va ||Vb |), again: number bels record the number of edges
of edges between a and b over all the possible edges that there could inside or between super nodes.
be. Thus, the reconstructed adjacency matrix of the original graph is (c) The adjacency matrix of (a)
the one in Figure 46.2(c). The quality function guiding this process is as reconstructed via (b).
mutual information: the higher the mutual information between the 11
Hannu Toivonen, Fang Zhou, Aleksi
original and the reconstructed matrix, the better the aggregation is. Hartikainen, and Atte Hinkka. Compres-
sion of weighted graphs. In Proceedings
Other approaches compress structurally equivalent nodes11 – see
of the 17th ACM SIGKDD international
Section 15.2 for a refresher on structural equivalence. conference on Knowledge discovery and
Respecting the adjacency of nodes is not necessarily the only data mining, pages 965–973, 2011
672 the atlas for the aspiring network scientist
(a) (b)
46.2 Compression
times the character “h” follows specific other characters and so on.
In practice, you’re modeling your text with a model M. Encoding M
takes some bits, but it saves many more. This is following the same
philosophy as the Infomap community discovery approach we saw in
Section 35.2.
Translating this into graph-speak, you want to construct a model
M of your graph G so that the length L describing both is minimal, 13
Saket Navlakha, Rajeev Rastogi,
or: min L( G, M ) = L( M ) + L( G | M ). An example13 creates M using a and Nisheeth Shrivastava. Graph
two-step code: (i) each super node in the summary is connected, in summarization with bounded error. In
Proceedings of the 2008 ACM SIGMOD
the original graph, to all nodes in all super nodes adjacent to it in the
international conference on Management of
summary; and (ii) we correct every edge mistake with an additional data, pages 419–432, 2008
instruction.
8
Figure 46.4: (a) An input graph.
2
6 7 ,8 4,5,6 (b) Its summarization via mini-
1 5 2,3 1
mization of description length.
3 + (1,5) I label each super node with the
7 4 - (4,8) list of nodes it contains. On the
(a) (b) bottom, the additional rules we
need to reconstruct (a).
46.3 Simplification
Timur Bekmambetov
Johnny Depp
Figure 46.5: (a) An input graph:
Tim Burton directors (in blue) and actors
Helena Bonham Johnny Depp (in red) connected if they collab-
Carter
orated with each other. (b) Its
Eva Green
simplification via the selection
Bernardo Bertolucci Helena Bonham
Carter of only actor-type nodes di-
Daniel Craig Eva Green rectly connected to Tim Burton.
(a) (b)
Those are not the only valid approaches. If the graph also has
metadata attached to nodes – or edges – we can exploit them. For 18
Zeqian Shen, Kwan-Liu Ma, and
instance, Ontovis18 allows the simplification of the graph via the Tina Eliassi-Rad. Visual analysis of
large heterogeneous social networks
specification of a set of attribute values we’re interested in studying. by semantic and structural abstraction.
For instance, the graph in Figure 46.5(a) can be simplified into the IEEE transactions on visualization and
computer graphics, 12(6):1427–1439, 2006
one in Figure 46.5(b), if we’re only interested in knowing the rela-
tionships between actors (node type value) working with Tim Burton
(topology attribute).
Ontovis finds the best way to simplify the graph, primarily fo-
cusing on its visual characteristics when plotted in 2D: it is first
and foremost a visualization-aiding tool. Ontovis focuses on node 19
Cheng-Te Li and Shou-De Lin. Ego-
attributes, but one could also switch their focus to edges19 . centric information abstraction for
heterogeneous social networks. In 2009
Note, also, that another difference with sampling is that, in graph International Conference on Advances
simplification, we are not really interested in preserving any spe- in Social Network Analysis and Mining,
pages 255–260. IEEE, 2009
cific property of the original graph. This is, instead, a core focus of
graph summarization 675
network sampling.
Place
Place Place Time
Product Product
Time Time
Product What sold when and
Time and place of
Which products sold where, for a subset Which products
all purchases across
when and where of products, places, a place sold
all products
& times
Visually, this would look like Figure 46.7. The central hubs influ-
ence each other, and each is responsible for influencing their branch.
Thus one could summarize the graph as a clique of interacting tribes.
Of course, you don’t have to use GuruMine for this. In some cases, re-
searchers have used special adaptations to estimate community-level 28
Yasir Mehmood, Nicola Barbieri,
influence28 . Francesco Bonchi, and Antti Ukkonen.
There’s a completely different way to interpret summarization Csi: Community-level social influence
analysis. In Joint European Conference
by influence preservation. We have seen that the spectrum of the on Machine Learning and Knowledge
Laplacian can be used to partition a graph – solving the cut problem Discovery in Databases, pages 48–63.
Springer, 2013
(Section 11.5). This is related to diffusion processes: the reason why
the eigenvectors of the Laplacian help you with cutting is because
they are a sort of simulation of a diffusion process, and the edges to
cut are the bottlenecks though which things cannot flow efficiently.
29
Manish Purohit, B Aditya Prakash,
For now, let’s take this for granted, but we’ll see more about this rela-
Chanhyun Kang, Yao Zhang, and
tionship in Section 47.2, where we’ll talk about using the Laplacian to VS Subrahmanian. Fast influence-based
estimate distances between sets of nodes by releasing a flow from the coarsening for large networks. In
Proceedings of the 20th ACM SIGKDD
nodes in the origin to the nodes in the destination. international conference on Knowledge
If we want to summarize the graph to preserve these diffusion discovery and data mining, pages 1296–
1305, 2014
properties, we can use the Laplacian to guide our process. What 30
Michael Mathioudakis, Francesco
we’re after, in the simplest possible terms, is a smaller Laplacian ma- Bonchi, Carlos Castillo, Aristides Gio-
trix, with fewer rows and columns, that has the same eigenvectors29 . nis, and Antti Ukkonen. Sparsification
of influence networks. In Proceedings
Among other interesting approaches there is SPINE30 . In SPINE,
of the 17th ACM SIGKDD international
one analyzes many influence events in the network. Then, SPINE conference on Knowledge discovery and
only keeps in the network the edges that are able to better explain data mining, pages 529–537, 2011
graph summarization 677
(a) (b)
the paths of influence you observe. You might realize that an edge
is never used to transport information, and thus you could remove
it from the structure without hampering your explanatory power.
Figure 46.8 shows an example of this.
46.5 Summary
46.6 Exercises
points of one to the interest points of the other. Small amounts will
indicate that the images are similar.
3
Ricardo Hausmann, César A Hidalgo,
• In economics3 , 4 , you can represent products as nodes, connected if Sebastián Bustos, Michele Coscia,
there is a significant number of countries that are able to co-export Alexander Simoes, and Muhammed A
Yildirim. The atlas of economic complexity:
significant quantities of them. A country occupies the products in
Mapping paths to prosperity. Mit Press,
this network it can export. From one year to another, the country 2014
will change its export basket, by shifting its industries to different 4
César A Hidalgo, Bailey Klinger, A-
L Barabási, and Ricardo Hausmann.
products. How dynamic is the country’s export basket? The product space conditions the
• In epidemics5 , 6 , 7 , a disease occupies the nodes in a social network development of nations. Science, 317
(5837):482–487, 2007
it has infected. Across time, the disease will move from a set 5
Vittoria Colizza, Alain Barrat, Marc
of infected individuals to another. Similarly, in viral marketing, Barthélemy, and Alessandro Vespignani.
product adoption can be modeled as a disease. The role of the airline transporta-
tion network in the prediction and
predictability of global epidemics.
All these cases can be represented by the same problem formula- Proceedings of the National Academy of
tion. You have a network G. Then you have two vectors: an origin Sciences of the United States of America,
103(7):2015–2020, 2006a
vector p and a destination vector q. Both p and q tell you how much 6
Ayalvadi Ganesh, Laurent Massoulié,
value there is in each node. pu tells you how much value there is and Don Towsley. The effect of network
in node u at the origin, and qv tells you how much value there is in topology on the spread of epidemics.
In INFOCOM 2005. 24th Annual Joint
node v at the destination.
Conference of the IEEE Computer and
All you want to do is to define a δ( p, q, G ) function. Given the Communications Societies. Proceedings
graph and the vectors of origin and destination, the function will tell IEEE, volume 2, pages 1455–1466. IEEE,
2005
you how far these vectors are. There are many ways to do so, which 7
Romualdo Pastor-Satorras and
are organized in a survey paper8 , on which this chapter is based. Alessandro Vespignani. Epidemic
Before we jump into the network distances, it is probably wise to dynamics and endemic states in com-
plex networks. Physical Review E, 63(6):
have a refresher on non-network distances, since it will allow us to 066117, 2001a
introduce concepts that will be helpful later. 8
Michele Coscia, Andres Gomez-
Lievano, James McNerney, and Frank
Neffke. The node vector distance
47.1 Non-Network Distances problem in complex networks. ACM
Computing Surveys, 2020
How to estimate node vector distances on networks is a new and
difficult problem. Let’s take it easy and first have a quick refresher
on the many ways we can estimate distances of vectors without a
network. The easiest way to do it is by assuming a vector of numbers
just represents a set of coordinates in space. If you’re on Earth, with
three numbers you can establish your latitude, longitude and altitude.
That is enough to place you on a position in a three dimensional
space. Another person might be at a different latitude, longitude and
altitude than you. What is the distance between you and your friend?
Easy! You throw a straight rope between you and your friend and its
length is the distance between you. This is the Euclidean distance.
In Section 5.4 I did my very best to connect this intuitive idea
in real life with linear algebra operations. The pain you felt back
then should pay off now. To sum up, if p and q are the vectors
defining your two positions in space, the Euclidean distance is
node vector distance 681
p = (4, 5)
Figure 47.2: Euclidean distance
in m = 2 dimensions. We build
the special p − q vector to have,
((p - q)T(p - q))1/2
at its ith entry, the difference
p2 - q 2 = 4
between the ith entries of p and
q.
p1 - q 1 = 3
q = (1, 1)
In cosine distance you look at the angle made by the vectors con-
necting the two points, as Figure 47.4 shows with a thick green line.
The distance between them is one minus the cosine of that angle.
This is useful, because the cosine is 1 for angles of zero degrees and 0
for angles at ninety degrees. Two points on the same straight line will
have a distance of zero, even if they’re infinitely farther apart on such
a line. For instance, the two points at the bottom of Figure 47.4 are at
a considerable Euclidean distance, but practically neighbors when it
comes to cosine distance. Sometimes in life it doesn’t matter where
node vector distance 683
δ( p, q, G ) = (( p − q) T L† ( p − q))1/2 ,
with L† being the pseudoinverse of the graph Laplacian of G.
Markov Chain
Random walks are helpful to estimate node-node distances (Section
14.4). If we are in the situation of Figure 47.6, we could estimate the
distance between the red and blue node by simply asking how long it
will take for a random walker to go from one node to the other. Here,
we generalize this idea to groups of nodes.
In the Markov Chain distance we start from the assumption that,
given a starting point p, by looking at G we can construct an ex-
pected “next step”, which is E(q) = Ap. This expected behavior
follows a simple random walk (which is a Markov process, see Sec-
tion 2.7). In other words, we expect that q should be the result of a
one step diffusion of p via random walks. For this to be the case, A
needs to be the stochastic adjacency matrix. Here, we also set the
diagonal of the adjacency matrix to be equal to one before we trans-
form it in its stochastic version. This is equivalent to add a self loop
to all nodes in the network: we want to allow the diffusion process to
stand still in the nodes it already occupies.
For each node u in the network, we can calculate the expected
occupancy intensity in the next time step by unrolling the previous
formula: E(qu ) = ∑ Au,v pv . This is helpful, because it allows us
v ∈V
to calculate the standard deviation of this expectation, σu,v , which
we can do by making a few assumptions on the distribution of this 12
Michele Coscia, Andres Gomez-
expectation (which are spelled out in the original paper12 ). For now, Lievano, James McNerney, and Frank
suffice to say that such deviation is σu,v = ( pv Au,v (1 − Au,v )). Neffke. The node vector distance
problem in complex networks. ACM
Since we now have an expectation and a deviation, we can calcu-
Computing Surveys, 2020
late the z-score of the observation, which is a measure of how many
standard deviations your observation is distant from the expectation.
We calculate a z-score for each node in the network and place it in its
corresponding spot in a diagonal matrix:
[ Z ]u,u = ∑ σu,v
2
.
v ∈V
Now, the problem of this formulation is that it’d make this dis-
tance not symmetric, because to build σu,v we only used p as the
origin of the diffusion. So, if σu,v is the deviation of the diffusion
from v to u, we can also calculate a deviation of the diffusion from u
to v: σv,u . This is done as above, switching p’s and q’s places. Then
2 and σ2 .
the u, u entry in Z’s diagonal is the sum of σu,v v,u
Finally we can write our distance as:
δp,q,G = (( p − q) T Z −1 ( p − q))1/2 .
Annihilation
Let’s take that last thought a bit further. If you let your p and q vec-
tors to diffuse via random walks for an infinite amount of time, they
will distribute themselves to all nodes of G proportionally to their
degree, because they will both tend to approximate the stationary
distribution. In the wavy water basin I mentioned before, p and q are
simply two different waves conditions, while the stationary distribu-
tion is... well ... the stationary distribution: a waveless basin where
the water is at the same level everywhere. Mathematically, this means
∞ ∞ ∞
that ∑ Ak q and ∑ Ak p are the same thing, or ∑ Ak ( p − q) = 0.
k =0 k =0 k =0
Now, the interesting bit is for which value of k this is true or, put
in another words, how fast will p and q cancel out. If they cancel
each other out quickly, it means that they were already pretty similar
to begin with. In fact, that equation would be true at k = 0 if p = q.
So we’re interested in the speed of that equation. This is given us by
the following formula:
∞
δp,q,G = (( p − q) T ∑ Ak ( p − q))1/2 .
k =0
∞
You shouldn’t be scared of the infinite sum ∑ Ak : it converges
k =0
and there is a proper solution to it, which is in the survey paper I’ve
been citing in this chapter.
The solutions based on shortest paths start from the assumption that
the problem of establishing distances between sets of nodes can be
generalized from solving the problem of finding the distance between
pairs of nodes. This is a well understood and solved problem: using a 13
Edsger W Dijkstra. A note on two
shortest path algorithm – for instance Dijkstra’s13 – one can count the problems in connexion with graphs.
number of edges separating node u to v. Numerische mathematik, 1(1):269–271,
1959
If we take Figure 47.7 as an example, we’d start by collecting a
bunch of distances: [2, 3, 3] when starting from node 5 or node 8, and
2
8
4 9
6 Figure 47.7: A network where
5 7 the red nodes represent the
1 origins and the green nodes
3
represent the destinations.
node vector distance 687
Non-Optimized
Here we show a set of possible aggregations of shortest path dis-
tances between the nodes in p and q, by taking hierarchical clustering
as an inspiration. There are of course more strategies than the ones
listed here, but I can’t really list them all – and most haven’t really
been researched yet.
When performing hierarchical clustering, there are three common 14
Gabor J Szekely and Maria L Rizzo.
ways to merge clusters according to their distance14 : single, complete, Hierarchical clustering via joint
and average linkage. Single linkage (green in Figure 47.8) means that between-within distances: Extend-
ing ward’s minimum variance method.
the distance between two clusters is the distance between their two
Journal of classification, 22(2):151–183,
closest points. On the other hand, complete linkage (purple in Figure 2005
47.8) considers the distance of the two farthest points as the cluster
distance. In average linkage (orange in Figure 47.8), one calculates
the average distance between all pairs of points in the two clusters as
the distance between the clusters.
Similarly, our aim is to reach the destination from the origin in the
minimum distance possible. In the single linkage strategy, the “cost”
of reaching a destination node is the distance of it from the closest
possible origin node. First, we need to make sure that ∑ p = ∑ q. If
that isn’t the case, we rescale up the vector with the smallest sum so
that this equation is satisfied. For instance, if q had a lower sum, we
688 the atlas for the aspiring network scientist
∑ ∑ pu qv | Pu,v |
∀v∈q ∀u∈ p
δp,q,G = .
∑p
Here it doesn’t matter what we put in the denominator, since we
already ensured that p and q sum to the same value.
vectors, we would again count each path as contributing one third, i.e.
(18/3)/3 = 2.
In complete linkage, we perform a similar operation as in single
linkage, but looking at the farthest destination for each origin. The
farthest destination is 5 → 1, at three steps; then 8 → 4 and 7 → 9 at 15
Andrew McGregor and Daniel Stubbs.
two steps each. Thus, complete linkage will return 3 + 2 + 2 = 7 as Sketching earth-mover distance on
graph metrics. In Approximation, Random-
distance. If we normalized the vectors, we would again count each ization, and Combinatorial Optimization.
path as contributing one third, i.e. 7/3 = 2.3̄. Algorithms and Techniques, pages 274–
286. Springer, 2013
16
Gaspard Monge. Mémoire sur la
théorie des déblais et des remblais.
Optimized Histoire de l’Académie Royale des Sciences
de Paris, 1781
Here we try to be a bit smarter than the aggregation strategies we 17
Frank L Hitchcock. The distribution
saw so far. In this branch of approaches, we try to optimize this of a product from several sources to
numerous localities. Studies in Applied
aggregation such that the number of edge crossing is minimized. Mathematics, 20(1-4):224–230, 1941
If there are no further constraints in this optimization problem, 18
Ira Assent, Andrea Wenning, and
we are in the realm of the Optimal Transportation Problem (OTP) Thomas Seidl. Approximation tech-
niques for indexing the earth mover’s
on graphs15 . In its original formulation16 , OTP focuses on the dis- distance in multimedia databases. In
tance between two probability distributions without an underlying Data Engineering, 2006. ICDE’06. Proceed-
ings of the 22nd International Conference
network. However, it has been observed how this problem can be on, pages 11–11. IEEE, 2006
applied to transportation through an infrastructure, known as the 19
Matthias Erbar, Martin Rumpf,
multi-commodity network flow17 . Specifically, one has to simply Bernhard Schmitzer, and Stefan Simon.
Computation of optimal transport on
specify how distant two dimensions in the vector are. The distance
discrete metric measure spaces. arXiv
needs to be a metric, and the number of edges in the shortest path preprint arXiv:1707.06859, 2017
between two nodes satisfies the requirement. 20
Montacer Essid and Justin Solomon.
Quadratically-regularized optimal
In its most general form, the assumption is that we have a distri-
transport on graphs. arXiv preprint
bution of weights on the network’s nodes, and we want to estimate arXiv:1704.08200, 2017
the minimal number of edge crossings we have to perform to trans- 21
George Karakostas. Faster ap-
proximation schemes for fractional
form the origin distribution into the destination one. This is a high
multicommodity flow problems. ACM
complexity problem, which has lead to an extensive search for effi- Transactions on Algorithms (TALG), 4(1):
cient approximations18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 . For what concerns us, all 13, 2008
22
Jan Maas. Gradient flows of the
these methods are equivalent: they all solve OTP and the difference entropy for finite markov chains. Journal
between them is how they perform the expensive optimization step. of Functional Analysis, 261(8):2250–2292,
Thus, they all return a very similar distance given p, q and G – plus 2011
23
Ofir Pele and Michael Werman.
or minus some approximation due to their optimization strategy –, A linear time histogram metric for
and fall in the same category. improved sift matching. In European
conference on computer vision, pages
More formally, in OTP we want to find a set of movements M such
495–508. Springer, 2008
that: 24
Ofir Pele and Michael Werman. Fast
and robust earth mover’s distances.
In Computer vision, 2009 IEEE 12th
M = arg min ∑ ∑ m pu ,qv du,v , international conference on, pages 460–467.
m pu ,qv pu qv IEEE, 2009
25
Justin Solomon, Raif Rustamov,
where pu and qv are the weighted entries of p and q, respectively; Leonidas Guibas, and Adrian Butscher.
Continuous-flow graph transportation
m pu ,qv is the amount of weights from pu that we transport into qv ; distances. arXiv preprint arXiv:1603.06927,
and d pu ,qv is the distance between them. Then: 2016
690 the atlas for the aspiring network scientist
∑ ∑ m pu ,qv du,v
pu qv
δp,q,G = ,
∑ ∑ m pu ,qv
pu qv
29
Julio E Godoy, Ioannis Karamouzas,
Stephen J Guy, and Maria Gini. Adap-
Moreover, since in MAPF robots cannot be in the same node at
tive learning for multi-agent navigation.
the same time, you still have a problem. Say that we assigned a In Int Conf on Autonomous Agents and
robot to go from u to v in our preprocessing. If pu ̸= qv , then either Multiagent Systems, pages 1577–1585. In-
ternational Foundation for Autonomous
u or v has some unallocated weight. Thus we would need to add Agents and Multiagent Systems, 2015
at least a second robot that can either start in u or terminate in v. 30
Jamie Snape, Jur Van Den Berg,
But this violates MAPF. The way we solve the issue is by running Stephen J Guy, and Dinesh Manocha.
The hybrid reciprocal velocity obstacle.
a sequence of MAPF sessions. In each session, we attempt to move IEEE Transactions on Robotics, 27(4):
all the weights that were left over during the previous session. We 696–706, 2011
keep running smaller and smaller sessions until all weights have
31
Glenn Wagner and Howie Choset. Sub-
dimensional expansion for multirobot
been allocated – which we can guarantee by normalizing either p or path planning. Artificial Intelligence, 219:
q so that they sum to the same value, as we did in the non-optimized 1–24, 2015
solutions.
32
Andrew Dobson, Kiril Solovey, Rahul
Shome, Dan Halperin, and Kostas E
There are many algorithms to solve MAPF29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , Bekris. Scalable asymptotically-optimal
each of them providing a different solution to NVD with our prepro- multi-robot motion planning. In 2017
International Symposium on Multi-Robot
cessing strategy.
and Multi-Agent Systems (MRS), pages
120–127. IEEE, 2017
8
4
Figure 47.11: Attempting to
2
find a pursue solution from
6 red nodes to green nodes. Blue
5 7 arrows show attempted moves.
1
3 Hang Ma, TK Satish Kumar, and Sven
33
Network Correlation
We can use a similar trick to extend the familiar Pearson correlation 55
Michele Coscia. Pearson correlations
coefficient (Section 3.4) to variables defined on a network55 . Let’s on complex networks. Journal of Complex
suppose that we have two vectors, x and y, each assigning a value to Networks, 9(6):cnab036, 2021
47.6 Summary
1. The node vector distance problem is the quest for finding a way
to estimate a network distance between two vectors describing
the degree of occupancy of the nodes in the network. If at time t
I occupy nodes 1, 2, and at time t + 1 I occupy nodes 3, 4, 5, how
much did I move in the network?
5. Finally, you can use signal cleaning techniques. You can see
your network as describing sets of sensors that return correlated
results. Thus, two “signals” are far apart if they are reported by
uncorrelated sensors, which are not connected to each other.
47.7 Exercises
2. Calculate the distance using the same data as the previous ques-
tion, this time with the average linkage shortest path approach.
Normalize the vectors so that they both sum to one.
696 the atlas for the aspiring network scientist
3. Calculate the distance using the same vectors as the previous ques-
tions, this time on the http://www.networkatlas.eu/exercises/
47/3/data.txt network, with both the average linkage shortest
path and the Laplacian approaches. Are these vectors closer or
farther in this network than in the previous one?
By far, the most common and popular way to intend the term “net-
work distance” is as the opposite of the similarity between two
networks. The term “network similarity” is, unfortunately, rather
ambiguous, and you might find papers dealing with very different
problems but using the same terminology. For instance, one could
intend “network similarity” as a measure of how similar two nodes
are (see Section 15.2). Or one could be talking about “similarity net-
works”, which are ways to express the similarities between different
entities by connecting the ones that are the most similar to each other
– something you might do via bipartite projections (Chapter 26).
698 the atlas for the aspiring network scientist
# Nodes
Figure 48.1: On the left we have
four graphs, each identified by
the color of its nodes. On the
right, I make a two dimensional
projection by recording each
graph’s node count (y axis)
Edge Density and edge density (x axis). The
similarity between two graphs
The main issue is that we still don’t know which set of network is the inverse of their distance
statistics is sufficient to cover the space of all possible networks. in this space.
Whatever dimensions you use to organize your networks will col-
lapse many – possibly dissimilar – networks into the same place in
your scatter plot. This happens in Figure 48.1, where a star (in red)
is confused with a set of unconnected cliques (in blue). This is not
necessarily a bad thing! If the summary statistics you chose are mean-
ingful to you in some fundamental way, this is a feature. However, if
you’re hunting for “universal” patterns, this approach could mislead
topological distances 699
you.
(a) (b)
5 5
(a) (b)
Figure 48.4 can help you to visualize the process. Here, we want to
know how many operations we need to go from the graph in Figure
48.4(a) to the graph in Figure 48.4(b). Starting from node 1, we need
to change its label (from red to blue) and to add the edge connecting
it to node 6. Node 2 is fine, but node 3 needs to replace its edge to
702 the atlas for the aspiring network scientist
node 5 with one labeled in green. There are no more edits we need to
do, so the distance between the two graphs is three.
Of course, the hard part of graph edit distance is finding the
minimum set of edits, so there are a bunch of ways to go about it,
ranging from Expectation Maximization to Self-Organizing Maps, 12
Horst Bunke. On a relation between
to subgraph isomorphism12 . Special network types deserve special graph edit distance and maximum
approaches, for instance in the case of Bayesian networks13 (see common subgraph. Pattern Recognition
Letters, 18(8):689–694, 1997
Section 6.4) and trees14 , 15 . 13
Richard Myers, RC Wison, and
The prototypical graph edit distance metric16 is relatively sim- Edwin R Hancock. Bayesian graph edit
ple to understand. It is based on the maximum common subgraph. distance. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22(6):
Given two graphs G1 and G2 , first you find the largest common sub- 628–635, 2000
graph Gs : the largest collection of nodes and edges that is isomorphic 14
Davi de Castro Reis, Paulo Braz
in both graphs. Then, the distance between G1 and G2 is simply the Golgher, Altigran Soares Silva, and
AlbertoF Laender. Automatic web
number of nodes and edges that remain outside Gs : news extraction using tree edit distance.
In Proceedings of the 13th international
δG1 ,G2 = | E1 − Es | + | E2 − Es | + ||V1 | − |V2 ||, conference on World Wide Web, pages
502–511, 2004
with Vx and Ex being the set of nodes and edges of graph Gx . 15
Mateusz Pawlik and Nikolaus Aug-
This is actually a metric, as it respects the triangle inequality. An sten. Rted: a robust algorithm for
the tree edit distance. arXiv preprint
evolution of this approach tries to find the maximum common edge arXiv:1201.0230, 2011
subgraph17 , which is found in the line graph representation of G. 16
Vladimír Baláž, Jaroslav Koča,
The problem gets significantly easier when the networks are Vladimír Kvasnička, and Milan Sekan-
ina. A metric for graphs. Časopis pro
aligned. Two networks are aligned if we have a known node corre- pěstování matematiky, 111(4):431–433,
spondence between the two. This means that we know that node u 1986
in one network is the same as node v in the other. How to align two
17
John W Raymond, Eleanor J Gardiner,
and Peter Willett. Rascal: Calculation
networks is an interesting problem in and of itself, and we’re going of graph similarity using maximum
to look at it in Section 48.2. For now, we just take for granted that the common edge subgraphs. The Computer
Journal, 45(6):631–644, 2002
two networks we’re comparing are already aligned.
In this case, we don’t have to go looking for maximum subgraphs.
We can just iterate over all the nodes and edges in the two networks
and note down every time we find an inconsistency: a node or an
edge that is present in one network and absent in the other. Simply
counting won’t do much good, though, because some differences
should count more than others, if they are significantly affecting
the local or global properties of the network. Consider Figure 48.5:
both Figure 48.5(b) and Figure 48.5(c) are just one edge away from
Figure 48.5(a). However, since Figure 48.5(b) breaks down in multiple
18
Danai Koutra, Joshua T Vogelstein,
connected components, its difference should be counted as higher. and Christos Faloutsos. Deltacon: A
There are a few strategies to estimate these differences. Delta- principled massive-graph similarity
function. In Proceedings of the 2013 SIAM
con is based on some sort of node affinity estimation18 . One could
International Conference on Data Mining,
also do vertex rank comparison19 : if the most important nodes in pages 162–170. SIAM, 2013
two networks are the same, then the networks must be similar, to 19
Panagiotis Papadimitriou, Ali Dasdan,
some extent. The same authors propose other ways to estimate net- and Hector Garcia-Molina. Web graph
similarity for anomaly detection. Journal
work similarity, for instance via shingling: reducing the networks of Internet Services and Applications, 1(1):
to sequences and then applying a sequence comparing algorithm. 19–30, 2010
topological distances 703
20
Xifeng Yan, Philip S Yu, and Jiawei
Han. Substructure similarity search
in graph databases. In Proceedings of
Substructure Comparison the 2005 ACM SIGMOD international
conference on Management of data, pages
766–777, 2005
Substructure comparison20 , 21 is similar to graph edit distance. In 21
Haichuan Shang, Xuemin Lin, Ying
this class of methods, you describe the network as a dictionary of Zhang, Jeffrey Xu Yu, and Wei Wang.
Connected substructure similarity
motifs and how they connect to each other. Usually, you’d find search. In Proceedings of the 2010 ACM
the motifs by applying frequent subgraph mining (Chapter 41). In SIGMOD International Conference on
practice, graph edit distance is equivalent to a simple substructure Management of data, pages 903–914, 2010
22
Thomas R Hagadone. Molecular
comparison, where the only substructure you’re focusing on is the
substructure similarity searching:
edge. There is not much to say about this class, given its similarity efficient retrieval in two-dimensional
with the previous one: the same considerations and warnings that structure databases. Journal of chemical
information and computer sciences, 32(5):
applied there also apply here. 515–521, 1992
When it comes to applications of substructure similarity, the clas- 23
Liping Wang, Qing Li, Na Li, Guozhu
sical scenario is estimating compound similarities at the molecular Dong, and Yu Yang. Substructure
similarity measurement in chinese
level in a biological database22 . But there are more fun scenarios, recipes. In Proceedings of the 17th
such as an analysis of Chinese recipes23 . international conference on World Wide
Web, pages 979–988, 2008a
704 the atlas for the aspiring network scientist
Holistic Approaches
Information Theory
A radically different approach works directly with the adjacency
matrix of a graph. The idea here is to generalize the Kullback-Leibler
divergence (KL-divergence) so that it can be applied to determining
the distance between two graphs. The KL-divergence is a cornerstone
of information theory and linked with the concept of information
entropy – see Section 3.5 for a refresher.
The KL-divergence is also known as “relative entropy”. From
Section 3.5, you learned that the information entropy of a vector X is
the number of bits per element you need to encode it. Now, of course
when you try to encode a vector, you try to be as smart as possible.
You create a codebook that is specialized to encode that particular
vector. If there is an element that appears much more often than the
others, you will give it a short code: you will have to use it more
often and, if it is shorter, every time you use it you will save bits. This
is the strategy used by Infomap to solve community discovery – see
Section 35.2.
Now suppose you have another vector, Y. You want to know how
similar Y is to X. One thing you could do is to encode Y using the
code book you optimized to encode X. If X = Y, then the codebook
is as good encoding X as it is encoding Y: you need no extra bits. As
soon as there are differences between X and Y, you will start needing
extra bits to encode Y, because X’s codebook is not perfect for Y any
more. The KL-divergence boils down to the number of extra bits you
need to encode Y using X’s codebook.
X X Y Y
=0 0 =0 0 Figure 48.7: An example of the
9 bits 11 bits
0 10
0
6 values
10
6 values spirit of KL-divergence. The
= 10 10 =9/6 = 10 10 = 11 / 6 code we use for X (a) requires
10 = 1.5 bits/value 10 = 1.83 bits/value additional bits to encode Y (b).
= 11 11 = 11 11
(a) (b)
Figure 48.7 presents a rough outline of the idea behind the KL-
divergence – simplified to help intuition. The X vector in Figure
48.7(a) requires 1.5 bits per element. Using its codebook to encode Y
706 the atlas for the aspiring network scientist
telling you the probability of each node from the first graph to be
the node in the second graph. The way this matrix is built can rely
on any structural similarity measure, we saw a few in different parts
of this book. Then, the idea is to pick the cells in this matrix so that
the sum of the scores is maximized and, at the same time, we match 36
Oleksii Kuchaiev and Nataša Pržulj.
as many nodes as possible36 . If |V1 | ̸= |V2 | you will have to face a Integrative network alignment reveals
choice: either you do not map some nodes or you allow nodes from large regions of global network similar-
ity in yeast and human. Bioinformatics,
one network to map to multiple nodes in the other. This common ap-
27(10):1390–1396, 2011
proach can be extended, for instance, by calculating multiple versions
of this matrix using different measures and then seeking a consensus
matrix which is a combination of all the similarity measures.
e
6 b Figure 48.8: Two graphs of
1 which we want to discover the
4
c d
3
alignment. (c) assigns to each
2 a
node pair from (a) and (b) an
f
alignment probability.
5
One could also find just a few node mappings with extremely
37
Giorgos Kollias, Shahin Mohammadi,
high confidence and then expand from that seed37 , assuming that and Ananth Grama. Network similarity
the neighborhoods around these high confidence nodes should look decomposition (nsd): A fast and
alike. Of course, a large portion of network alignment solutions rely scalable approach to network alignment.
IEEE Transactions on Knowledge and Data
on solving the maximum common subgraph problem: if you find Engineering, 24(12):2232–2243, 2011
isomorphic subgraphs in both networks, chances are that the nodes
inside these subgraphs are the same, and thus should be aligned
38
Gunnar W Klau. A new graph-based
to each other38 . Other approaches rely on the fact that isomorphic method for pairwise global network
graphs have the same spectrum, thus similar values in the eigenvec- alignment. BMC bioinformatics, 10(1):S59,
tors of the Laplacian imply that the nodes are relatively similar39 . 2009
39
Rob Patro and Carl Kingsford. Global
Another approach uses a dictionary of networks motifs. Each network alignment using multiscale
node is described by counting the number of motifs it is part of. We spectral signatures. Bioinformatics, 28(23):
can then describe the node as a numerical count vector. Two nodes 3105–3114, 2012
40
Tijana Milenković, Weng Leong Ng,
with similar vectors are similar40 . This approach has to solve the Wayne Hayes, and Nataša Pržulj. Opti-
graph isomorphism problem as well, but it needs to do so only for mal network alignment with graphlet
small graph motifs rather than for – supposedly – large common degree vectors. Cancer informatics, 9:
CIN–S4744, 2010
subgraphs. This way, it can be more efficient.
Figure 48.9 shows an example: here I choose a relatively small
set of motifs, which generate a short vector. However, one could
define as many motifs as they are relevant for a specific application,
and obtain much more precise vectors describing the nodes. A final
approach I mention is MAGNA, which uses a genetic algorithm ap-
proach: it tries aligning by exploring the search space of all possible
node mappings, allowing the best matches to survive and evolve, and
708 the atlas for the aspiring network scientist
41
Vikram Saraph and Tijana Milenković.
dropping the worst matches41 . Magna: maximizing accuracy in global
network alignment. Bioinformatics, 30(20):
2931–2940, 2014
48.3 Network Fusion
1
3 2
9 1 3 Figure 48.10: An example of
2 9 2 1
3
3 3
2
2
9 1
4 8 3 2
2 4 3 .0
2.0
2 . 5 3. 0 network fusion: (c) is the re-
2 1
4 8 4
2 1
2 3 2 2.0 8 sult of the fusion of (a) and (b).
3
4 2.0 Edge thickness proportional to
7 7 7
1
5 2 its weight.
5
2
3 3 1 2. 5
5
2.0
6 6 6
48.4 Summary
48.5 Exercises
Visualization
49
Node Visual Attributes
12 12
10 10
8 8
y
y
I II III IV
I II
6 6
x y x y x y x y
10 8.04 10 9.14 10 7.46 8 6.58 4 4
(a) (b)
You cannot simply take away the message that any quantitative
measure of node importance is an equally good choice for your node
size. Some of those measures will not highlight what you want to
highlight. For instance, the degree is not always the right choice.
Consider Figure 49.5(a): would you think to use the node’s degree
as a measure of its size? If you do, you end up with Figure 49.5(b)
where the node playing arguably the strongest role in keeping the
network together almost disappears. A much better choice, in this
case, is betweenness centrality (Figure 49.5(c)).
716 the atlas for the aspiring network scientist
100
10-1
10-2
p(k)
10-3
10-4
100 101 102 103 104
k
(a) (b)
tweak it a bit to make the result more pleasing. The idea is to have
diminishing returns to the contribution of the degree to the node
size. The differences in size from the minimum degree, to the average
degree – which can be quite low – are big but, from that point on, the
contribution to the node’s size plateaus.
(a) (b)
If you do so, you can find new clusters that were previously clut-
tered by the huge nodes, or that had a low degree and so they did
not pop up. You can compare the two hairballs in Figures 49.7(a) and
49.7(b). Note that the visualization is still truthful: we’re never going
to make nodes with lower degree larger than nodes with higher de-
gree. That would be bad and land you in the naughty corner. We’re
just making the visualization more useful.
This is probably a good place to stop and make a disclaimer. Even
if eyes are the highest bandwidth sensors we have, it doesn’t mean
they are flawless. Nor that our monkey brain is able to use the in-
formation they gather in a perfect way. Human perception is flawed
and you cannot expect that something a computer understands will
appear obvious to your viewers as well. In the case of node size this
takes the form of the confusion between radii and areas.
Unless otherwise specified by the software/program of choice,
you are going to decide the radius of the node when determining its
size. This can be trouble if you don’t handle this choice properly. The
reason is that, when you increase the radius, you are substantially
performing a linear increase: you think that, if the degree increases
by one unit, you should increase the radius by one unit. Unfortu-
nately, what a viewer will perceive is you changing the area of the
circle. The crux of the problem is that a radius is a one dimensional
quantity, and it should never be used for controlling a two dimen-
sional one such as an area – which is what your readers perceive.
You think you’re increasing something linearly, but you’re actually
718 the atlas for the aspiring network scientist
49.2 Color
The second obvious feature to manage for your nodes is their color. If
we routinely use node sizes for quantitative attributes, we primarily
use node color for qualitative ones. The reason is that, while humans
perceive size as quantitative, color hue is not perceived in the same 11
William S Cleveland and Robert
McGill. Graphical perception: Theory,
way. Cleveland and McGill11 distinguish between different data types
experimentation, and application to the
and how much different graphical features are effective for each data development of graphical methods. Jour-
type. Color is good for nominal attributes – categories that cannot nal of the American statistical association,
79(387):531–554, 1984
be compared/sorted, like “apple” vs “orange”. Color could be used
for ordinal attributes, that are still categories but can be compared –
for instance days of the week, Monday comes before Tuesday. Color
is terrible for quantitative attributes, for which areas are a more
effective tool. As always, what follows is based on my experience
and, if you want or need more in-depth explanations, you should
check out the paper.
When it comes to network visualization, this implies that we put
nodes into classes and we use colors to emphasize that different
nodes are in different classes. Classical examples can be the node’s
community – Part X –, or its role – Chapter 15. I already mentioned
nodes can have metadata, and these metadata could be categorical.
For instance, in a network connecting online shopping products
because they are co-purchased together you could use the color to
determine their category (outdoors, rather than kitchen, rather than
electrical appliances).
You could still use node color for ordinal attributes, and maybe
for quantities as well, provided that you have clear and intuitive bins.
node visual attributes 719
(a) (b)
To wrap up this chapter, let’s see a few more things you can do
to your nodes. They both stem from the same idea: your nodes
represent something, and so you want to communicate this to your
viewers.
The first strategy involves node labels. If you want the audience to
know something, you simply tell them. You plaster some text on top
of your nodes and you call it a day. In my opinion, this is a desperate
move and it should be avoided if possible. Just as in movies, also
in data visualizations it’s better to “show, not tell”. In other words,
nobody wants to read your network. They want it to speak to them.
That is not to say that sometimes a good choice of node labels
can enhance your visualization. You can practically transform your
network into a glorified word cloud. I don’t love it, but I grudgingly
admit that sometimes it works. An example could be the one in Fig-
ure 49.14 – although in this case one should choose a less saturated
color for the nodes, because the current red goes in the way of the
readability of the label. My rule of the thumb is that the node label
font size should have a one-to-one correspondence to the node size. It
would look weird to have a gigantic label on top of a tiny node, and
vice versa.
The second visual attribute you could play with is the node’s
node visual attributes 723
RWA
BHR SWZ
NPL UGA
KHM BTN MDV
THA
HKG
YEM
SOM
BDI Figure 49.14: A network with
PHL OMN SDN KEN
KWT LKA TZA MWI
PNG
LAO
MMR MYS
DJI
JOR
PAK
node labels conveying informa-
IND
NZL BGD ZMB
BHS IDN LBN
tion about a node’s importance.
AUS
FJI
SAU SYR
TGO
ZAF ZWE
SLB MLI
MNG BEN
SGP KOR KGZ
TJK
AFG NAM
In this trade network, it is
GHA
GIN SEN
CIV GNB BFA
BRN
IRQ
BWA
MOZ
TUR
BLR ARM Product.
BGR
COG KAZ
MRT MDA
TCD LSO LBY MUS
NER
CHE AZE SVK
ROM
ALB
GRC
CHL ARG GAB
CYP
LTU
TUN
NGA POL
BRA
ERI
PER
USA ISR
GBR RUS
CZE LVA
PRY
BOL
CMR ITA HUN MKD
PRT
VEN SUR AUT
IRL
FRA
CAF
DEU
URY PAN HRV EST
ECU
MEX MAR CPV
HTI
COL SWE
BIH
DZA SVN
GUY
ESP NLD
CAN
DOM TTO BLZ
GNQ
BEL NOR
NIC GTM
JAM FIN
ISL DNK
CRI LUX
CUB
HND
SLV
Another strategy is more creative – and for this reason you should
apply tons of caution if you want to go this way. It involves xeno- 18
https://xeno.graphics/
graphic18 . This translates to “weird visualizations”, stuff that has
very specific and almost unique use cases, and thus it’s likely to
choose a style that people haven’t seen before. You can be creative
with what you put on your nodes, as long as you don’t abuse it and
724 the atlas for the aspiring network scientist
broken. I present one in Figure 49.17. I’m not arguing that the figure
is a good visualization: what I’m saying is that being able to see the
nodes would not make much of a difference, especially since they do
not have attributes of interest. Rather, the visualization allows you to
see where different types of edges create red and green clumps, and
which edge type keeps the network together in which branches. And
how to deal with edge visual attributes is exactly the topic of the next
chapter.
49.4 Summary
1. The first visual attribute of nodes is their size. Usually, you want
to show quantitative attributes via size – the degree, the capacity,
etc. Be aware that you should always manipulate the area of the
node, which is what your viewer perceives. If your software only
allows you to control a node’s radius, keep in mind that your area
will change quadratically for each linear change of the radius.
2. Second, you can control a node’s color. Usually, this is for qualita-
tive attribute, e.g. community affiliation. Use no more than nine
distinct colors, from a perceptual-aware space (not RGB rainbows!).
with the node’s area size. You can use pie charts and icons to
embed additional information on the nodes.
49.5 Exercises
Size
The equivalent for edges of node size is the thickness. As in the
previous case, this is mostly for quantitative attributes on edges. The
most trivial one is the edge’s weight: heavy edges usually appear to
be more thick. Another common use case is to put edge betweenness
as the determinant of the edge thickness. This works well when used
in conjunction with nodes sizes following the same semantics. It
gives a sense of balance to the visualization, so you can see which
edges are contributing to the node’s centrality. Figure 50.1 shows an
example.
Color
The most obvious exceptions are two. You can have qualitative
information telling you to which layer and to which community an
edge belongs, if you have multilayer networks (Section 7.2) and/or
link communities (Section 38.5). For multilayer networks you can use
edge colors to represent the layer if you adopt a multigraph visual-
ization – as I do in Figure 50.2(a). However, this will get unwieldy
pretty soon, as the number of nodes, edges and layers grows beyond
an elementary size. In fact, it’s usually best to use dedicated tools 2
Manlio De Domenico, Mason A Porter,
and Alex Arenas. Muxviz: a tool for
for the visualization of multilayer networks2 – although the field of
multilayer analysis and visualization of
visualizing multilayer networks is still in its infancy. networks. Journal of Complex Networks, 3
For link communities, keep in mind the same warning I made (2):159–176, 2015c
can work in concert and reinforce each other. Two imperfect visual
clues can sum and make each other clearer. This is the case for edge
betweenness, determining both color and thickness of the edges in
Figure 50.3(a).
Transparency
Transparency is another aspect in which edges diverge from nodes.
In the previous chapter I mentioned that nodes should be fully
visible, and provided only a single use case in which I believe trans-
parency can add something to the visualization by removing the
nodes from sight. When it comes to edges, I usually abuse trans-
730 the atlas for the aspiring network scientist
(a) (b)
Labels
Just like with nodes, also with edges you can be... edgy in how you
visualize them. There are two fundamental aspects I’m going to
mention here: shapes and bends.
The classical edge visualization is as a straight, solid line. This is
what you should do in 99% of the cases. However, in many cases,
you might want to slightly change this shape. The most common
shape change for edges is when you are working with directed con-
nections. In this case, the convention is to add an arrow that indicates
the direction of the edge. The arrow points from the originator of the
edge to the target.
Directed networks are more challenging to visualize than you
might think. The reason is not only that you’re doubling the possible
number of edges, which is true and it is an issue. But the real trouble
is that now you might have a significant number of double edges be-
tween the same two nodes: u → v and u ← v. This might make your
visualization a real mess. One convention you can implement is not
to actually draw the two edges. What you can do is to draw a single
edge and add to it a second arrow pointing in the opposite direction
if that edge is reciprocal. Figure 50.5 shows how this strategy looks
like.
The final odd thing you could do to your edges is to bend them,
meaning that the (u, v) edge is not a straight line from u to v any
more, but it takes a “detour”. Why would you want to do this?
There are fundamentally two reasons. Figure 50.5(a) provides an
example: since there are two edges between the nodes, we want the
visualization to be more symmetric and pleasant, and thus we bend
the two edges.
More often, edge bends are used to make your network layout
more clear. You bend edges to bundle together the ones going from
nearby nodes to other nearby nodes. Since this is done mostly to
clean up the visualization after you already decided where the nodes
should be placed, I will deal with this topic in the network layout
chapter (Chapter 51).
Let’s recap all the advice I gave you on node and edge visual at-
tributes and see a case of applying each feature one by one to go
from a meaningless hairball to something that conveys at least a little
bit of information. Our starting point is the smudge of edges you al-
ready saw a couple of times: that’s Figure 50.4(a). Note that this isn’t
really the starting point, because we already settled on a network
layout, but that will be the topic of the next chapter.
The usual order I apply to my networks after I settled on a layout
is the following:
1. Edge transparencies;
2. Edge sizes;
3. Edge colors;
4. Node sizes;
5. Node colors.
So let’s do this.
Edge transparencies. In this network, I do have quantitative infor-
mation, that is the edge betweenness of each connection. However,
I think that it’s better if I limit that to the other edge visual features,
so I fix the same edge transparency to all links. The result is Figure
50.4(b).
Edge sizes & colors. We now move on to use edge betweenness.
I merge the two steps of edge size and color into one, because using
simply the thickness does not make a significant difference with the
previous visualization. Compare Figure 50.4(b) with Figure 50.6(a)
edge visual attributes 733
(a) (b)
and see that not much has changed. So I apply a Color Brewer color
gradient to the edges, resulting in Figure 50.6(b). Hopefully now you
can see that there are a few very important long connections keeping
the network together, connecting very central comic book characters
to a sub universe they’re almost exclusively part of.
Node sizes. It’s time to deal with nodes. There’s something to be
said for keeping them almost invisible, but that’s not what we want
to do here. This is a comic book network and so we want to know
which characters are tightly connected to the universe of which other
characters. So first we need to know who the important fellows are.
We use node size for that. As outlined in the previous chapter, this
is a job for the degree. The more connections a node has, the more
important it is, the larger it should be. And that’s what Figure 50.7(a)
does. Note the pseudo-logarithmic node size scaling.
(a) (b)
734 the atlas for the aspiring network scientist
50.4 Summary
2. Differently from node colors, edge colors are used mostly for
quantitative attributes. Usually, there are more edges than nodes
in a network, thus it is harder to limit the number of edge colors.
Anyhow, you should not have more than nine different colors in
total, whether they are node or edge colors. Classical qualitative
edge colors choices are layers or link communities.
50.5 Exercises
100
10-1
10-2
p(k)
10-3
10-4
100 101 102 103 104
k
(a) (b)
Hierarchical
If force directed and its variants are a good default choice, they are
not the only way to display your networks. As I concluded in the
previous section, they have some pretty limited use cases. What
happens when we go out of those use cases?
(a) (b)
some extreme cases they can work, if you want to highlight some
specific messages. For instance, I would not use it for the network
in Figure 51.5(a), but I can see how it can communicate something
about the clusters of the network. The layout manages to put nodes
in the same community in the same column, showing how some
communities have stronger – darker, thicker – connections than
others. However, a non-trivial number of nodes and edges would
make you network visualization completely unintelligible, as it
happens in Figure 51.5(b).
Circular
A second scenario to consider is the case of an extremely dense
network. The network could be so dense, that the force directed is
not able to pull nodes apart and show structure. In this case, the first
step of the solution involves considering a layout that might not seem
the best for the job, but has a few tricks up its sleeve: the circular
layout. Which is exactly what it sounds: it places nodes on a circle,
equidistant from one another.
The first part of the trick in using circular layouts is not to display
the nodes in a random order, but choosing an appropriate one. Ide-
ally, you want to place nodes in bunches such that most connections
happen across neighbors. Usually this is achieved by identifying the
nodes’ attribute which groups them best. You can also run a custom
algorithm deciding the order and then provide that as the attribute
for the circular layout.
Figure 51.6 shows an example. In the figure you can see that the
layout still works: it shows how most connections remain within
network layouts 741
the communities, and clearly points at how many and where the
inter-community connections are. In a force directed layout, these
connections would stretch long and be forced in the background of
denser areas, with the effect of being difficult to appreciate. However,
the real kicker for circular layouts happens when you consider the
use of an additional visual feature: edge bends.
(a) (b)
3 3
(a) (b)
In this section, we break the assumption that nodes are circles and
edges are lines. We try to find weird ways to summarize the network
topology in a way that is more compact and compelling.
Matrix Layouts
What’s the last resource for networks in which the density is too high
even for a circular layout plus edge bundles? If your network is so
dense, then you don’t have a network: you have a matrix and you
should visualize it as such. In these cases, what matters more is not
really which area is denser than which other, but which blocks of If your network is this dense and also
19
nodes have connections with higher and lower weights.19 unweighted, consider changing job.
(a) (b)
That is not to say that this is the only use case of a matrix visual-
ization. Even for sparser networks, sometimes a matrix is worth a
thousand nodes. Consider the case of nestedness (Section 32.4), a par-
ticular core-periphery structure for bipartite networks where you can
sort nodes from most to least connected. The most connected node
connects to every node in the network, while the least connected
nodes only connects to the nodes that everyone connects to. This
sort of linear ordering naturally lends itself to a matrix visualization.
While a node-link diagram would make a mess of such a core, ren-
dering the message difficult to perceive, a matrix view is deceptively
simple, as I show in Figure 51.12.
This case is a good example of the main problem in visualizing
networks as matrices: the order of the rows/columns you choose
is the most important thing. You should put your nodes in the se-
quence that highlights the crucial structural characteristics the best.
In the nestedness case, nodes are sorted by degree (or total incom-
ing weight sum). If your network has communities, you want to
have nodes next to their community mates. This creates the classical
block diagonal matrices. There are other criteria you might want to 20
Michael Behrisch, Benjamin Bach,
consider20 . Nathalie Henry Riche, Tobias Schreck,
and Jean-Daniel Fekete. Matrix reorder-
ing methods for table and network
visualization. In Computer Graphics
Hive Plots Forum, volume 35, pages 693–716. Wiley
Online Library, 2016
The issue with all traditional node-link diagrams is that the posi-
tion in space of a node is arbitrary: it does not reflect its properties,
but it is just relative to its connections. As such, layouts are not re-
producible, because a small change in the initial conditions in the
746 the atlas for the aspiring network scientist
(a) (b)
Graph Thumbnails
22
Vahan Yoghourdjian, Tim Dwyer,
Just like hive plots, also graph thumbnails22 aim to provide a de- Karsten Klein, Kim Marriott, and
terministic layout, where two isomorphic graphs result in the same Michael Wybrow. Graph thumbnails:
visualization. In graph thumbnails, we decide to give up the ability Identifying and comparing multiple
graphs at a glance. IEEE Transactions on
of analyzing local structures. We are not seeing each individual node: Visualization and Computer Graphics, 24
the visualization is a summary of the graph’s global structure. The (12):3081–3095, 2018
network layouts 747
(a) (b)
Probabilistic Layout
Following the same “we can’t visualize all nodes” philosophy of 24
Christoph Schulz, Arlind Nocaj,
graph thumbnails, we have probabilistic layouts24 . This technique is Jochen Goertler, Oliver Deussen,
handy when you have a generic guess of where the nodes should be, Ulrik Brandes, and Daniel Weiskopf.
Probabilistic graph layout for uncertain
but you cannot draw them all. You should use such layouts especially
network visualization. IEEE transactions
for very large graphs that you couldn’t visualize otherwise, because on visualization and computer graphics, 23
they have too many nodes and/or edges. (1):531–540, 2016
The idea is as follows. First, you sample the nodes in your net-
work, taking only a few of them. Then you calculate their positions
using a deterministic force directed layout. You repeat the procedure
multiple times, obtaining, for each node, a good approximation of
where it should be. If you have nodes that you never sampled, you
can reasonably assume that they are going to be in the area surround-
ing their neighbors. Since you’re applying the algorithm to a sample,
this won’t take much time even if the original network was too large
to be analyzed in its entirety.
Now each node is associated to a spatial probability distribution,
much like elementary particles in quantum physics. You can assume
that, if the node is anywhere, it’ll be somewhere in the area where
its probability is nonzero. At this point, you can merge nodes whose
spatial probabilities overlap, by detecting and drawing a contour
containing them. You should then bend edges and smudge them as
well, to reflect the uncertainty of where their endpoints are.
Revealing Matrices
One key visualization technique is scatterplot matrices or SPLOMs.
When you have multiple variables in your dataset, you might be
interested in knowing which one correlates with which other. So you
can create a matrix where each row/column is a variable, and each
network layouts 749
cell contains the scatter plot of the row variable against the column
variable. Figure 51.16 shows an example.
The same visualization technique can be applied to networks. In 25
Maximilian Schich. Revealing matrices.
revealing matrices25 , each row/column of your matrix is an entity. 2010
Then, each cell of the matrix contains a bipartite network, where the
nodes of one type are the row entity and the nodes of the other type
are the column entity.
One defect of SPLOMs is that the main diagonal of the matrix is a
bit awkward. In it, the row variable and the column variable are the
same. Thus the scatter plot is meaningless, as it is the same variable
on the x and y axes: a straight line. One could modify it by showing
some sort of statistical distribution of the variable, but that would
mean breaking the axis consistency of the SPLOM. For this reason,
the main diagonal of a SPLOM is often omitted.
This defect does not apply to the revealing matrices visualization.
The main diagonal in this case is well defined: it is simply the direct
relationship between entities of the same type. Thus, it contains a
unipartite network per node type in your database.
Timelines
an overview of the dynamics all at once; you need to wait for the
animation to play out and you might have forgotten what was in the
first frame by the time you get to the last. 26
Petter Holme and Jari Saramäki.
There is one visualization technique I first saw in 201226 (but Temporal networks. Physics reports, 519
it could be older) that changes fundamentally how we visualize a (3):97–125, 2012
network to show time in a more natural way. Figure 51.17 shows an
example.
1
Figure 51.17: A network time-
2
line visualization: nodes as
3
horizontal lines, edges as ver-
4
tical lines. Each shaded area
5
shows one snapshot and the
6
corresponding classical node-
7
link diagram visualization
2 2 2 below it.
1 1 1
3 3 3
4 5 4 5 4 5
7 7 7
6 6 6
1
2 Figure 51.18: Network timeline
3 visualization of an SIS process.
4 The node’s line is red when the
5 node is infected (I) and blue
6 when it is susceptible (S). A con-
7 tagion happens probabilistically
when a red line joins a blue line,
when it happens the connection
custom visualization, bending and breaking rules along the way. No
is red. Note node 4 that gets
one should really follow your workflow, because it applies only to
reinfected after recovering.
your specific aim with your specific data. However, seeing some of
these examples could be helpful in making you realize that you are
not mad: sometimes you really do know better than everybody else.
The aim of this section is to empower you in being daring: try to look
at your data and your communication objective, and create your way
to bringing them together.
I touch on two examples I worked on. These are custom ways of
displaying a node-link diagram that I found useful. These node-link
diagrams have special configurations given the need to highlight
specific features of the networks they represent. Of course, there’s
much more out there, but these are two cases I’m familiar with.
Product Space
27
César A Hidalgo, Bailey Klinger,
The Product Space27 , 28is a popular example. The Product Space A-L Barabási, and Ricardo Hausmann.
The product space conditions the
is a network in which each node is a product that is traded among
development of nations. Science, 317
countries in the global market. Two products are connected if the sets (5837):482–487, 2007
of countries exporting them have a large overlap. The idea of this 28
Ricardo Hausmann, César A Hidalgo,
Sebastián Bustos, Michele Coscia,
visualization is to show you which products are similar to each other,
Alexander Simoes, and Muhammed A
because if your country can make a given set of products, via the Yildirim. The atlas of economic complexity:
Product Space it can figure out which are the most similar products it Mapping paths to prosperity. Mit Press,
2014
should consider trying to export.
The original way to try and visualize the Product Space was a
simple force directed layout, as I show in Figure 51.19(a). However, as
I mentioned previously, the force directed layouts have this tendency
of forcing your networks on a sphere. This happens to work really
poorly in the case of the Product Space. The reason is that not all
products are the same. Some products are harder to export than
others. This is a key concept in the original research, known as
Economic Complexity.
This means that the Product Space has an inherent “direction”.
Countries want to move from simple to more complex products, as
the latter is a more rewarding category to be able to export. However,
the circle has no direction. It is a loop: you always get back to where
752 the atlas for the aspiring network scientist
(a) (b)
Cathedral
30
Stephen Kosack, Michele Coscia,
In another paper of mine, I analyze government networks30 . My Evann Smith, Kim Albrecht, Albert-
László Barabási, and Ricardo Haus-
nodes are government agencies and I establish edges between them mann. Functional structures of us state
if the website of an agency has an hyperlink pointing to the website governments. Proceedings of the National
Academy of Sciences, 115(46):11748–11753,
of another agency. One key question is verifying if this network has 2018
a hierarchical organization – see Chapter 33. One obvious way to
explore this question is visualizing the network and see if it looks like
a hierarchy. Unfortunately, the network is relatively large and dense.
So I need to come up with a custom layout. Such layout is useful to
visualize dense hierarchical networks, and thus can be considered as
an enhancement of the classical hierarchical layout presented earlier,
that works only for tree-like structures.
The first step is to group nodes into a 2-level functional classifica-
tion. This means to assign to each agency the function it performs
in the government. For instance, a school is part of the education
system (level 1 function) and of the primary & secondary education
(level 2 function). Or: a city government is part of general adminis-
tration (level 1 function) and of the municipal administration (level 2
function).
There are not many level 2 functions so I can collapse all agencies
network layouts 753
51.6 Summary
2. The most common principle is the one of the force directed layout.
Nodes are charges of the same sign repelling each other and edges
are springs trying to keep connected nodes together. This layout
works for sparse networks with communities, whose topology fits
on a circle.
6. In many cases, your network will have a clear and unique message
that has never been visualized before. In those cases, you need to
bend rules and create a unique visualization serving your specific
communication objective.
51.7 Exercises
Useful Resources
52
Network Science Applications
1012 103
Figure 52.1: (a) Total wage sum
1011
(y axis) as a function of a city’s
# Gas Stations
2
10
Total Wage
(a) (b)
new factor you can use and recombine with all the factors that were 7
Hyejin Youn, Deborah Strumsky,
already present so far. This is easy to see especially in patents data7 . Luis MA Bettencourt, and José Lobo.
Every time someone makes a new invention, that new invention Invention as a combinatorial process:
evidence from us patents. Journal of The
can be combined with all the previous inventions to create a new
Royal Society Interface, 12(106):20150272,
one, and so on at infinity. Thus, the knowledge added by a new 2015
invention potentially multiplies itself with the previously accumulated
knowledge, rather than just adding to it.
4 5
13
Arvind Narayanan and Vitaly
De-anonymizing social networks13 is feasible, under a wide array Shmatikov. De-anonymizing social
of different scenarios – whether the attacker is a government agency, networks. In 2009 30th IEEE symposium
on security and privacy, pages 173–187.
a marketing campaign, or an individual stalker. This is usually done
IEEE, 2009
by creating a certain amount of auxiliary information that can then
be used to recursively de-anonymize more and more nodes in the
network. Counter-measures usually adopt the k-anonymity style: 14
Kun Liu and Evimaria Terzi. Towards
making sure that no individual can be identified by obfuscating identity anonymization on graphs. In
Proceedings of the 2008 ACM SIGMOD
enough data to make at least k − 1 other individuals identical to her in international conference on Management of
some respect. For instance, a network is k-degree anonymous if there data, pages 93–106, 2008
are at least k nodes with any given degree value14 .
15
Elena Zheleva and Lise Getoor.
Preserving the privacy of sensitive rela-
Sometimes, the focus is preventing the disclosure of information tionships in graph data. In International
about a relationship, i.e. to combat link re-identification15 . You might Workshop on Privacy, Security, and Trust
in KDD, pages 153–171. Springer, 2007
not want Facebook to know you are friend with someone, which
16
JW Scannell, GAPC Burns, CC Hilge-
they could do by performing some relatively trivial link prediction tag, MA O’Neil, and Malcolm P Young.
– see Part VII. In those cases, you might want to hide some of your The connectional organization of the
cortico-thalamic system of the cat.
relationships, and/or add a few fake connections, to throw off the
Cerebral Cortex, 9(3):277–299, 1999
score function of the link you want to hide. 17
Quanxin Wang, Olaf Sporns, and
Andreas Burkhalter. Network analysis
of corticocortical connections reveals
52.3 Human Connectome ventral and dorsal processing streams
in mouse visual cortex. Journal of
Neuroscience, 32(13):4386–4399, 2012b
Quite likely, the most famous and studied network in human history 18
Siming Li, Christopher M Armstrong,
is the brain. We have been studying neural networks of many ani- Nicolas Bertin, Hui Ge, Stuart Milstein,
Mike Boxem, Pierre-Olivier Vidalain,
mals, due to their limited size and ease of analysis: cats16 , mices17 ,
Jing-Dong J Han, Alban Chesneau,
and, of course, the superstar C. Elegans worm18 . However, most of Tong Hao, et al. A map of the inter-
this is done with the big prize as the ultimate objective: the human actome network of the metazoan c.
elegans. Science, 303(5657):540–543, 2004
brain. You might have heard of the Human Connectome Project. Pro- 19
Olaf Sporns, Giulio Tononi, and Rolf
posed in 200519 , its objective was to create a low-level network map Kötter. The human connectome: a
of the human brain: a network where nodes are individual neurons structural description of the human
brain. PLoS computational biology, 1(4),
and connections are the synapses between them. 2005
The idea was that applying all the network science artillery to 20
Ed Bullmore and Olaf Sporns. Com-
such a network would help us understanding better how our brains plex brain networks: graph theoretical
analysis of structural and functional
work20 – or don’t, sometimes. In fact, one of the major lines of re- systems. Nature reviews neuroscience, 10
search is comparing the brain connection patterns between healthy (3):186–198, 2009
network science applications 761
a hierarchical fashion: neurons are part of modules25 , and there are Bassett. Multi-scale brain networks.
Neuroimage, 160:73–83, 2017
modules of modules, and so on – check out Chapters 33 and 37 for 25
Paolo Bonifazi, Miri Goldin, Michel A
a few refreshers on hierarchies. In fact, one of the most appropriate Picardo, Isabel Jorquera, A Cattani,
models of the brain is multilayer networks26 . Gregory Bianconi, Alfonso Represa,
Yehezkel Ben-Ari, and Rosa Cossart.
Gabaergic hub neurons orchestrate
synchrony in developing hippocampal
52.4 Science of Science and of Success networks. Science, 326(5958):1419–1424,
2009
Unsurprisingly, one of the things that interests scientists the most
26
Manlio De Domenico. Multilayer
modeling and analysis of human brain
is... scientists. Network scientists are no exception to this rule. There networks. Giga Science, 6(5):gix004, 2017
is a large and healthy literature in analyzing networks of scientists.
We already saw many examples of two types of science networks:
co-authorship networks, where scientists are connected to each other
if they collaborate on the same paper/project; and citation networks,
connecting papers if one cites another.
The two can be combined to try and gather a general picture of
how science gets done. Science is one of the most important human 27
According to scientists.
activities27 , because we rely on it to develop new and better ways to
improve our everyday life. It’s better to understand how it works, so
that we can do it better. This is fundamentally the mission statement
762 the atlas for the aspiring network scientist
28
Albert-László Barabási, Chaoming
of the science of science field28 , 29 , 30 , kickstarted by network scientists Song, and Dashun Wang. Publishing:
and making extensive use of network analysis tools. Handful of papers dominates citation.
Nature, 491(7422):40, 2012
One of the most peculiar findings is that the occurrence of the 29
Dashun Wang, Chaoming Song, and
highest impact work of a scientist’s career will happen at a random Albert-László Barabási. Quantifying
point in time31 . In other words, there is no way to predict which of long-term scientific impact. Science, 342
(6154):127–132, 2013
your papers will earn you a Nobel prize: it could be your first, it 30
Santo Fortunato, Carl T Bergstrom,
could be your last, or any in between. This is bad news if we want Katy Börner, James A Evans, Dirk
to predict the success of some research, but it’s great news for me. Helbing, Staša Milojević, Alexander M
Petersen, Filippo Radicchi, Roberta
The fact that I haven’t come even close to making a groundbreaking Sinatra, Brian Uzzi, et al. Science of
discovery doesn’t mean it won’t happen eventually. I simply won’t science. Science, 359(6379):eaao0185, 2018
see coming if it does (it won’t). 31
Roberta Sinatra, Dashun Wang,
Pierre Deville, Chaoming Song, and
Figure 52.5 shows an example of this concept. In both cases, the Albert-László Barabási. Quantifying the
breakout paper arrived early, but there is no pattern in how citations evolution of individual scientific impact.
come. Moreover, the red scientist (Figure 52.5(a)) is a better scientist Science, 354(6312):aaf5239, 2016
on average than the blue one (Figure 52.5(b)), having 23.5 citations
per paper against blue’s 16.2. They’re also more productive (24
vs 18 papers). And yet, it is the blue scientists who published the
best paper – with 223 citations, while red’s best paper only has 115
citations. Life is unfair this way.
103 103
Figure 52.5: Two examples of
10
2
10
2 career paths of scientists, show-
# Citations
# Citations
(a) (b)
player at the top of the world ranking the one gathering the most
Wikipedia page views – or news articles about them, for that matter.
In fact, the performance-success disconnect can and should be
applied to science as well. In this section, I equated “success” with
citations: a successful paper gathers tons of citations. But is it the
best (read: highest performing) paper? Not at all! Citations and
35
Jonathan R Cole. Fair science: Women
in the scientific community. 1979
grant awards correlate with things that are independent of the sci- 36
Donna K Ginther, Walter T Schaffer,
ence/performance itself (e.g., gender35 , race36 and how junior a Joshua Schnell, Beth Masimore, Faye
person is37 ). The world isn’t a perfect meritocracy. Cumulative Liu, Laurel L Haak, and Raynard
Kington. Race, ethnicity, and nih
advantage is not just the pretty story of how you model broad de- research awards. Science, 333(6045):
gree distributions in networks (Section 17.3): it is the real unfair- 1015–1019, 2011
ness in front of everybody who does not start in the advantaged
37
Robert T Blackburn, Charles E
Behymer, and David E Hall. Research
place/time/gender/race. We should investigate the performance- note: Correlates of faculty publications.
success disconnect in order to make the world suck a little less. One Sociology of Education, pages 132–141,
1978
way to do it is to model science as the interaction between individual 38
Samuel F Way, Allison C Morgan,
characteristics and systemic structures38 . Daniel B Larremore, and Aaron Clauset.
Productivity, prominence, and the
effects of academic environment.
52.5 Human Mobility Proceedings of the National Academy of
Sciences, 116(22):10729–10733, 2019
39
Marta C Gonzalez, Cesar A Hidalgo,
A significant portion of network scientists have also worked on issues and Albert-Laszlo Barabasi. Understand-
of human mobility: describing and predicting how individuals and ing individual human mobility patterns.
collectives move in the urban and global landscape39 , 40 . There are nature, 453(7196):779–782, 2008
40
Julián Candia, Marta C González,
a few reasons for this. First, there is a strong connection between Pu Wang, Timothy Schoenharl, Greg
human mobility and many networked phenomena that network Madey, and Albert-László Barabási.
scientists investigate. Just to highlight the example from the previous Uncovering individual and collective
human dynamics from mobile phone
sections: one can use the “mobility” of scientists between affiliations records. Journal of physics A: mathematical
to predict their success41 . Alternatively, one can use mobility data to and theoretical, 41(22):224015, 2008
41
Pierre Deville, Dashun Wang, Roberta
augment the de-anonymization process of people in social settings42 ,
Sinatra, Chaoming Song, Vincent D
or to better predict the spread of infectious diseases43 , 44 , 45 . Blondel, and Albert-László Barabási.
Second, complex networks are themselves useful tools to model Career on the move: Geography,
stratification, and scientific impact.
and analyze mobility patterns. For instance, one can create a better Scientific reports, 4:4770, 2014
synthetic model of human mobility by using an underlying social 42
Yves-Alexandre De Montjoye, César A
network to create realistic motivations for the simulated agents to Hidalgo, Michel Verleysen, and Vin-
cent D Blondel. Unique in the crowd:
move in space46 . The privacy bounds of human mobility.
Classically, to predict the number of people moving from area Scientific reports, 3:1376, 2013
A to area B, one would use a “gravity model”. This works just like 43
Vittoria Colizza, Alain Barrat, Marc
Barthelemy, Alain-Jacques Valleron,
Newton’s gravity law: the mobility relation between two areas is and Alessandro Vespignani. Modeling
directly proportional to how many people live in them (their “mass”) the worldwide spread of pandemic in-
and inversely proportional to their distance47 . In other words, there fluenza: baseline case and containment
interventions. PLoS medicine, 4(1), 2007
can be many people moving between New York and Chicago because 44
Duygu Balcan, Vittoria Colizza, Bruno
they are huge cities, but Boston might attract more Newyorkers Gonçalves, Hao Hu, José J Ramasco,
despite being less populous, simply because it’s closer. The gravity and Alessandro Vespignani. Multiscale
mobility networks and the spatial
model is overly simplistic: it’s deterministic, it requires previous spreading of infectious diseases. PNAS,
mobility data to fit parameters, it lacks theoretical grounding, and 106(51):21484–21489, 2009
764 the atlas for the aspiring network scientist
45
Michele Tizzoni, Paolo Bajardi,
it simply doesn’t predict observations that well. Network scientists Adeline Decuyper, Guillaume Kon Kam
have then developed a radiance model to fix these shortcomings48 . King, Christian M Schneider, Vincent
What do we find? At a collective level, human mobility patterns Blondel, Zbigniew Smoreda, Marta C
González, and Vittoria Colizza. On
are surprisingly universal, but not when it comes to the covered the use of human mobility proxies for
distance49 . Figure 52.6(a) shows that the probability of making a modeling epidemics. PLoS computational
biology, 10(7), 2014
trip is only mildly related to distance: there is no function properly 46
Mirco Musolesi and Cecilia Mascolo.
approximating the likelihood of you visiting a place given its distance A community based mobility model for
to you, and different cities have different scaling and cutoffs. In other ad hoc network research. In MOBIHOC,
pages 31–38, 2006
words, it is not true that the farther apart a pizza place is, the least 47
Dirk Brockmann, Lars Hufnagel, and
you go to eat there. Theo Geisel. The scaling laws of human
travel. Nature, 439(7075):462–465, 2006
work. We’re studying memes, you know? This is for science. There
are a few angles with which network scientists attack the study of 54
Maziar Nekovee, Yamir Moreno,
memes. Some of those already found space elsewhere in the book – Ginestra Bianconi, and Matteo Marsili.
e.g. in Section 21.2. Theory of rumour spreading in complex
social networks. Physica A: Statistical
The first is the relationship between the network structure and Mechanics and its Applications, 374(1):
the probability of a rumor to spread – or the fraction of nodes who 457–470, 2007
will end up hearing a rumor. Theoretical calculations54 show the 55
Nathan Oken Hodas and Kristina
Lerman. How visibility and divided
impact of the network’s topology: in a random graph, initial spread attention constrain social contagion. In
is slow but it will relentlessly cover the entire network; while for SocialCom, pages 249–257. IEEE, 2012
scale free networks the initial speed is fast but, in presence of degree 56
Lilian Weng, Alessandro Flammini,
Alessandro Vespignani, and Fillipo
correlations, it might fail to cover the entire network. All of this is Menczer. Competition among memes in
very similar to simulations of diseases spreading on a network (see a world with limited attention. Scientific
Chapter 20). reports, 2:335, 2012
Other studies show how the large diversity in the meme success
distribution – few memes spread globally while most are immedi-
ately forgot – are due to the limited capacity of brains to process
information55 , 56 , 57 . Another key question is whether memes spread
following simple or complex contagion: is a single exposure sufficient
or does reinforcement play a significant role? It seems that memes
indeed obey the complex contagion rules58 . For a refresher on the
concepts, see Chapter 21. 57
James P Gleeson, Jonathan A Ward,
More complex topological features, such as communities, are Kevin P O’sullivan, and William T Lee.
difficult to treat mathematically, but their impact can be studied Competition-induced criticality in a
model of meme popularity. Physical
using real world data. Figure 52.7 shows a toy example of the role of review letters, 112(4):048701, 2014
communities in meme propagation. Memes originating in the overlap 58
Bjarke Mønsted, Piotr Sapieżyński,
between different communities – in red in Figure 52.7 – have a better Emilio Ferrara, and Sune Lehmann.
Evidence of complex contagion of infor-
chance to go viral59 , 60 . Being born well embedded in a community mation in social media: An experiment
– blue in Figure 52.7 – is bad for propagation, because there are not using twitter bots. PloS one, 12(9), 2017
many paths leading the meme outside of the community.
59
Lilian Weng, Filippo Menczer, and
Yong-Yeol Ahn. Virality prediction and
In general, there are many empirical studies investigating how community structure in social networks.
information propagates through a social network61 , be it memes, Scientific reports, 3:2522, 2013
rumors, news62 , videos63 , 64 , or photographs65 , 66 . Lilian Weng, Filippo Menczer, and
60
104
103
Total Deaths + 1
point is a city, and for each city we count the number of famous peo- 70
Michele Coscia. Popularity spikes hurt
ple who were born and who died in that city. In red we can see death future chances for viral propagation of
protomemes. Communications of the ACM,
attractors: cities who had more famous deaths than births. In blue 61(1):70–77, 2017
we have the emitting cities. By analyzing the historical trajectories of 71
Maximilian Schich, Chaoming Song,
Yong-Yeol Ahn, Alexander Mirsky,
cities, we can see how the cultural center of the world moved from
Mauro Martino, Albert-László Barabási,
Rome to Paris and then to New York, because more and more people and Dirk Helbing. A network framework
die in the city where they work – and notable people work where of cultural history. science, 345(6196):
558–562, 2014
most notable people are. There are other interesting patterns, for in-
stance the fact that the median distance between the birth and death 72
Amy Zhao Yu, Shahar Ronen, Kevin
place is increasing, reflecting technological advancements. Hu, Tiffany Lu, and César A Hidalgo.
With a similar dataset, researchers built the “notable people portfo- Pantheon 1.0, a manually verified
dataset of globally famous biographies.
lio” of cities and nations72 . The idea is to classify all famous people Scientific data, 3:150075, 2016
in the area they contributed the most to humanity. Then, one can 73
https://pantheon.world/
visualize in which areas places specialize73 . For instance, the largest 74
Tom Brughmans. Thinking through
networks: a review of formal network
profession represented in the United States is actors, while it is methods in archaeology. Journal of
politicians for Greece. But one could explore other dimensions. For Archaeological Method and Theory, 20(4):
instance, professions that are over-expressed in a country against the 623–662, 2013
75
Tom Brughmans. Connecting the
rest of the world, like chess players in Armenia (6% of all famous dots: towards archaeological network
people!). Or explore gender divide: in Canada 27.4% of male famous analysis. Oxford Journal of Archaeology, 29
people were actors against 55.4% female famous people. Finally, you (3):277–303, 2010
76
Barbara J Mills, Jeffery J Clark,
can explore time as well. Before 1700 AD, the most common way Matthew A Peeples, W Randall Haas,
to become famous in Italy was to have a career in politics (30.7% of John M Roberts, J Brett Hill, Deborah L
famous people did). Afterward? You’re better off trying as a soccer Huntley, Lewis Borck, Ronald L Breiger,
Aaron Clauset, et al. Transformation of
player (21.6%). social networks in the late pre-hispanic
Other digital humanities applications of network science involve us southwest. Proceedings of the National
Academy of Sciences, 110(15):5785–5790,
archaeology. This mostly involves the use of network visualization 2013
techniques to make sense of a complex, interconnected, and often 77
Claire Lemercier. Formal network
largely incomplete set of evidence74 . However, it is not necessary methods in history: why and how? In
Social networks, political institutions, and
to limit ourselves to this: network analysis can be used as a tool to rural societies, pages 281–310. 2015
explore evidence. For instance, there are studies of social networks 78
Eleanor A Power. Discerning devotion:
in classical Rome75 . Other examples of prehistoric social network Testing the signaling theory of religion.
Evolution and Human Behavior, 38(1):
archaeology focus on pre-hispanic North America76 . Departing 82–91, 2017
from archaeology, the field of social network analysis in a historic77 , 79
Jessica C Flack, Michelle Girvan,
religious78 , or anthropological79 setting is alive and well. Frans BM De Waal, and David C
Krakauer. Policing stabilizes construc-
And since network analysis endows us with powerful tools to tion of social niches in primates. Nature,
study hidden preferences – such as homophily and segregation, see 439(7075):426–429, 2006
Chapter 30 – it is a natural instrument to use in other humanities 80
Claudia Wagner, David Garcia,
Mohsen Jadidi, and Markus Strohmaier.
fields, such as gender studies. In particular, there are studies showing It’s a man’s wikipedia? assessing
unequal gender dynamics when it comes to power relations in online gender inequality in an online ency-
clopedia. In Ninth international AAAI
collaborative tools such as, e.g., Wikipedia. As you might expect, the
conference on web and social media, 2015
majority of Wikipedia contributors are white men. 81
Claudia Wagner, Eduardo Graells-
When it comes to female representation in the content80 , 81 , this Garrido, David Garcia, and Filippo
gender gap shows. The researchers find that women are equally Menczer. Women through the glass ceil-
ing: gender asymmetries in wikipedia.
represented in article numbers – at least in the main six language EPJ Data Science, 5(1):5, 2016
768 the atlas for the aspiring network scientist
Polarization
Low High Figure 52.9: The three factors
1200
we can consider for ideolog-
1000
800 ical polarization. From top
# Users
600
400
to bottom: opinion extremity,
200 echo chambers, and opinion ho-
0
-1 -0.5 0 0.5 1 -1
Opinion
-0.5 0 0.5 1 mophily across echo chambers.
93
Kiran Garimella, Gianmarco De Fran-
this partition93 . I personally like to calculate the network distance be- cisci Morales, Aristides Gionis, and
tween differing opinions94 using the Generalized Euclidean measure Michael Mathioudakis. Quantifying
controversy on social media. ACM
I described in Section 47.2.
Transactions on Social Computing, 1(1):
Finally, researchers have also created agent based models to see 1–27, 2018
how polarization could arise – and be counteracted by good poli- 94
Marilena Hohmann, Karel Devriendt,
and Michele Coscia. Quantifying
cies95 . ideological polarization on a network
using generalized euclidean distance.
Science Advances, 9(9):eabq2044, 2023
95
Henrique Ferraz de Arruda, Fe-
lipe Maciel Cardoso, Guilherme Ferraz
de Arruda, Alexis R Hernández, Lu-
ciano da Fontoura Costa, and Yamir
Moreno. Modelling how social network
algorithms can influence opinion polar-
ization. Information Sciences, 588:265–278,
2022
53
Data & Tools
53.1 Libraries
Networkx
1
Aric Hagberg, Pieter Swart, and Daniel
I start by dealing with Networkx1 , 2 . Networkx is a Python library S Chult. Exploring network structure,
implementing a vast array of network algorithms and analyses. Net- dynamics, and function using networkx.
Technical report, Los Alamos National
workx is – as far as I can tell – the most popular choice for students Lab.(LANL), Los Alamos, NM (United
approaching network analysis tasks. I think it’s a generalist tool that States), 2008
2
is not the best at anything specifically, but good enough at everything. https://networkx.github.io/
103
iGraph Figure 53.1: The running times
Networkx
2 for different implementations
Running Time (s)
10
of community discovery algo-
101 rithms in Networkx (red) and
iGraph (blue).
100
10-1
LPA FG
Algorithm
graph-tool
ones that are there benefit from having a single mind behind them.
The aforementioned issue I had with the bugs in labeled multigraph
isomorphism was solved by simply using graph-tool.
Moreover, getting into Tiago’s frame of mind is necessary to use
graph-tool. You need to understand the way he does things in order
to be able to do them as well. Things like function naming, object
types, parameter passing – what you call the interface of the library –
are not as pythonic and intuitive as in Networkx.
iGraph
5
Gabor Csardi and Tamas Nepusz. The
Among all the alternatives, iGraph5 , 6 is certainly the most versatile igraph software package for complex
tool. It combines the strengths – and weaknesses – of Networkx and network research. InterJournal, Complex
Systems, 1695(5):1–9, 2006
graph-tool. On the one hand, it is a surprisingly complete tool with 6
https://igraph.org/
lots of implemented functions – just like Networkx –, and it is pretty
efficiently written – like graph-tool. Other advantages reside in the
fact that the library is available on a vast array of platforms: you can
use it both in Python and in R. You can even import it directly as a
C library. Thus, if you are capable of writing in C, you can probably
cook up a customized analysis using the power of iGraph that cannot
774 the atlas for the aspiring network scientist
Gee, thank you, it’s refreshing to see that not even who developed
this function knows what the function is doing. Continuing my
movie directors analogy, iGraph is David Lynch: probably the only
one able to do what he is doing, but good luck knowing what’s going
on when you look at something made by him.
The fact that I’m badmouthing iGraph so hard and yet I am includ-
ing it in the book and I use it should really convince you that it is a
fundamental tool. If I could live without it – trust me – I would. But I
can’t, because sometimes it is the only thing that will save you.
Julia
Julia is a more recent alternative to Python. Like Python, it is a
general purpose programming language, but it is particularly geared
towards numerical analysis and so it is useful for data science. I
call this section “Julia” rather than using the specific name of the
library because Julia’s library design is minimalist and modular. Each
library will implement only a very specific set of things and you will
find yourself having to import several libraries to make a complete
network analysis pipeline. The libraries you’d find yourself using
most often for your tasks are:
• Graphs.jl (https://juliagraphs.org/Graphs.jl/stable/) for
basic graph models and operations;
data & tools 775
• GraphIO.jl (https://github.com/JuliaGraphs/GraphIO.jl) to
read/write graphs to/from memory;
• Laplacians.jl (https://danspielman.github.io/Laplacians.jl/
dev/) for some specific advanced linear algebra operations;
• And others that you can find in the JuliaGraphs GitHub repository
collection (https://github.com/orgs/JuliaGraphs/repositories).
The advantages of Julia are that, in general, code will run faster
than Python, all things being equal – I’ll show an example later.
Also, some of the code that has been implemented in Julia cannot be
found anywhere else – at least that I know of. For instance, I know
of no other way to get Laplacian solvers than using Laplacians.jl
(unless I were to implement them myself, obviously). Julia comes
at the disadvantages that it compiles on the fly, so the first time you
run a piece of code it might take a long time, while you wait for
the compilation. Also, Julia’s programming logic is different than
Python. If you come from Python you will sometimes be surprised
by the behavior of Julia, findings in your variables unexpected values
because of how and where they were initialized. Julia for me is like
Ari Aster: the hot new kid on the block who’s doing amazing stuff,
but you wonder whether you’ve become too old for that.
Torch Geometric
One thing you should always remember is that networks and graphs
are, at the end of the day, matrices – remember Chapter 8. I try
to ignore this fact as much as I can, but it is an undeniable truth.
Sometimes, the best thing you can do is to treat them as such, and
to start doing some good old linear algebra. Which means that your
toolbox can include specialized software like Matlab or Octave. There 7
Pauli Virtanen, Ralf Gommers,
are in fact, network scientists who are able to do everything they Travis E Oliphant, Matt Haberland,
Tyler Reddy, David Cournapeau, Ev-
need to do exclusively in these programming environments. I always geni Burovski, Pearu Peterson, Warren
look at them in awe, not knowing if I do so out of being fascinated or Weckesser, Jonathan Bright, et al. Scipy
terrified by them. 1.0: fundamental algorithms for sci-
entific computing in python. Nature
If you are using Python, you cannot live without learning at least methods, pages 1–12, 2020
the basics of Numpy and Scipy7 , 8 , especially when it comes to use 8
https://www.scipy.org/
sparse matrices. Pandas9 is a good tool as well, because you can 9
https://pandas.pydata.org/
776 the atlas for the aspiring network scientist
103
Figure 53.3: The running times
Running Time (s)
Approach
Other
For Visualization
22
Paul Shannon, Andrew Markiel,
Owen Ozier, Nitin S Baliga, Jonathan T
By far, the software I use the most is Cytoscape22 , 23 . Mostly, I use it Wang, Daniel Ramage, Nada Amin,
for network visualization. The visual style of Cytoscape is based on Benno Schwikowski, and Trey Ideker.
Cytoscape: a software environment
the Protovis Java library24 , which is one of the ancestors of D325 , 26 – for integrated models of biomolecular
and it shows. The visual style of Cytoscape is really good, and you interaction networks. Genome research, 13
can customize a large quantity of visual attributes relatively easily. (11):2498–2504, 2003
23
https://cytoscape.org/
Cytoscape supports some basic network analysis. You can calcu- 24
Michael Bostock and Jeffrey Heer.
late a bunch of node, edge, and network statistics, the ones you’d Protovis: A graphical toolkit for visual-
come up first in your exploratory data analysis phase – nothing too ization. IEEE transactions on visualization
and computer graphics, 15(6):1121–1128,
fancy. This analytic capability is mostly there only to allow you to use 2009
node and edge statistical properties to augment your visualization. 25
Michael Bostock, Vadim Ogievetsky,
Since version 3.8, Cytoscape shows fewer plots. For instance you and Jeffrey Heer. D3 data-driven
documents. IEEE transactions on
cannot see any more the distribution of shortest path lengths. More- visualization and computer graphics, 17
over, I can’t seem to be able to show them in a log-log scale – it was (12):2301–2309, 2011
26
possible before. However, the graphical quality of the plots greatly https://d3js.org/
sions for other major OS platforms. In any case, things have greatly
improved over the years, so I like the directions in which it’s going.
Cytoscape is, in other words, Peter Jackson: he might not be perfect,
he might have defects, but boy are his movies nice to look at! 29
Mathieu Bastian, Sebastien Heymann,
The main alternative to Cytoscape is Gephi29 , 30 . In fact, calling Mathieu Jacomy, et al. Gephi: an open
it an “alternative” might even be unfair: my sense is that Gephi is source software for exploring and
manipulating networks. Icwsm, 8(2009):
actually more popular than Cytoscape among network scientists.
361–362, 2009
However, that is not what I started using during my PhD and so 30
https://gephi.org/
I never ended up installing it. So I don’t have any specific way to
compare their relative strengths and weaknesses. Chances are that,
for 99% of visualization tasks you will find yourself doing, the two
programs can be considered equivalent. Gephi is like Guillermo Del
Toro: I am unable to tell him apart from Peter Jackson. Regarding the
topic of file formats, Networkx can read the GEXF file format, which
is the one Gephi uses to save your network visualizations. 31
Manlio De Domenico, Mason A
Muxviz31 , 32 is another great piece of software. Muxviz covers Porter, and Alex Arenas. Muxviz: a tool
a slightly different angle from Cytoscape or Gephi. First, even if I for multilayer analysis and visualization
of networks. Journal of Complex Networks,
classify it in the visualization subsection, it is much more analysis-
3(2):159–176, 2015c
oriented. In fact, it requires a lot of analytical power installed on your 32
http://muxviz.net/
machine (Octave and R for instance). And you might find yourself
using the command line interface more than the graphical interface.
In this sense, it could have been listed as a library in the previous
section.
More importantly, Muxviz is much more specialized. It has a
specific focus on multilayer networks (Section 7.2). Which is a good
thing, because Cytoscape is not very good for visualizing them.
As far as I know, with Cytoscape the only choice you have is to ei-
ther visualize them with a multigraph, or visualizing one layer at a
time and then use a lot of elbow grease to piece the layers together.
Muxviz, instead, supports them natively, and it is thus a very comple-
mentary choice if you want to be prepared for all the layers life will
throw at you. 33
Seth Tisue and Uri Wilensky. Netlogo:
Finally, there’s NetLogo33 , 34 . I frankly don’t know where to clas- A simple environment for modeling
sify it, because it is a weird mix of everything. First and foremost, complexity. In International conference on
complex systems, volume 21, pages 16–21.
NetLogo is a programming language. It is explicitly designed to
Boston, MA, 2004
facilitate the simulation of agent-based models. This includes all sorts 34
https://ccl.northwestern.edu/
of models, not necessarily the ones involving a network. Thus it is a netlogo/
more general tool, which allows you to do more than what this book
focuses on.
Secondarily, you can use NetLogo for visualizing the effects of
specific network processes. If you follow the link I provide, you can
access NetLogo Web, a collection of simulations programmed in
NetLogo that allows you to play with a bunch of different models.
For instance, you can run a SIR model (Section 20.3), modifying its
780 the atlas for the aspiring network scientist
For Analysis
There are many pieces of software out there that will allow you
to perform network analysis and are commonly used by network
professionals. They are far more than I can include here. So I will
limit myself to those with which I had some personal experience.
The programs I talk about here are the ones that primarily pro-
vide analytic power. You can visualize networks with them, but
you should not do that. Their visualization capabilities are not the
main focus of the software, and are there mostly for you to get a
quick sense of what sort of analyses you should ask the program to
perform.
I think the program for network analysis I stumble the most upon 35
Wouter De Nooy, Andrej Mrvar, and
Vladimir Batagelj. Exploratory social
in the literature is Pajek35 , 36 . Pajek allows you to perform a vast array network analysis with Pajek. Cambridge
of network analysis, ranging from classical social science ones, to University Press, 2018
more computer science-y ones – like community discovery. Pajek 36
http://mrvar.fdv.uni-lj.si/pajek/
comes in different versions: Pajek, Pajek XXL, and Pajek 3XL. The
main difference between the versions is the capability of handling
larger and larger networks. The idea is that you would perform
the memory-intense analyses on the XL versions of Pajek and then
import the results for further investigation in the standard version of
the program.
Pajek is such a popular program that its own specific file format
is compatible with most of the software libraries I mentioned earlier.
Both Networkx and iGraph have functions that will allow you to
import networks saved in Pajek’s file format. Pajek is Lars von Trier:
perfect for geeking out every possible detail, but not the prettiest
thing to look at. 37
Stephen P Borgatti, Martin G Everett,
and Linton C Freeman. Ucinet for
A popular alternative from Pajek is UCINET37 , 38 . UCINET’s windows: Software for social network
strength is in its deep dive into the social branch of social network analysis. 2002
analysis. It is possibly the most comprehensive tool for social scien- 38
https://sites.google.com/site/
ucinetsoftware/home
tists to use.
As a result, its coverage of the more computer science and physics
branches is less than optimal. UCINET works best with small net-
works, it is not particularly well optimized for large scale analysis,
and will lack some of the typical algorithms you might expect to find
after reading this book. However, my biggest gripe with it is prob-
ably the fact that – differently from almost everything I mentioned
so far – UCINET is not a free program. If you’re a full time student,
it will cost $40. UCINET is George Méliès: an immortal classic, but
data & tools 781
Community Discovery
I create this special subsection to focus exclusively of implementa-
tions of algorithms solving the community discovery problem. This is
easily the largest subfield of network analysis. Thus this subsection
satisfies two needs. First, it gives you an idea about the immense
wealth of code that cannot find space in generic libraries/software.
Second, it contains the necessary references to the algorithms I con-
sider in my algorithm similarity network that was included in Section
35.6.
The way this subsection works is as follows. Now I will list a
bunch of labels that are consistent with Figure 35.15. For each label,
I tell you where to find the implementation I used to build that
figure. The general disclaimer is that, of course, some of these links
are bound to break in the future. I accessed them last time around
November 2018, so the Internet Archive could help.
• mcl: https://www.micans.org/mcl/#source.
• ganxis: https://sites.google.com/site/communitydetectionslpa/.
• conclude: http://www.emilio.ferrara.name/code/conclude/.
• mlrmcl: https://sites.google.com/site/stochasticflowclustering/.
• metis: http://glaros.dtc.umn.edu/gkhome/metis/hmetis/download.
• pmm: http://leitang.net/heterogeneous_network.html.
• crossass: https://faculty.mccombs.utexas.edu/deepayan.chakrabarti/
software.html.
• demon: http://www.michelecoscia.com/?page_id=42.
• hlc: http://barabasilab.neu.edu/projects/linkcommunities/.
• tiles: https://github.com/GiulioRossetti/TILES.
• oslom: https://sites.google.com/site/andrealancichinetti/
software.
• code-dense: https://link.springer.com/article/10.1007/
s10618-014-0373-y.
• gce: https://sites.google.com/site/greedycliqueexpansion/.
• ilcd: http://cazabetremy.fr/rRessources/iLCD.html.
• bnmtf: http://www.cse.ust.hk/~dyyeung/code/BNMTF.zip.
• rmcl: https://rdrr.io/github/DavidGilgien/ML.RMCL/man/ML_
RMCL.html.
• OLC: http://www-personal.umich.edu/~mejn/OverlappingLinkCommunities.
zip.
41
Santo Fortunato, Vito Latora, and
• cme-td, cme-bu: https://github.com/linhongseba/ContentMapEquationMassimo
. Marchiori. Method to
find community structures based on
• edgeclust: http://homes.sice.indiana.edu/filiradi/Data/ information centrality. Physical review E,
radetal_algorithm.tgz. 70(5):056104, 2004
data & tools 783
any network you can put your hands on that fulfills some specific 53
Dale F Lott. Dominance relations and
constraints. This section should help you with this task. breeding rate in mature male american
bison. Zeitschrift für Tierpsychologie, 49(4):
There are many places where you can find networks directly 418–432, 1979
available for download, but I start with an index: the Colorado 54
Jermain Kaminski, Michael Schober,
Raymond Albaladejo, Oleksandr Zastu-
Index of Complex Networks57 , 58 (ICON). This is quite possibly the
pailo, and Cesar Hidalgo. Moviegalaxies-
most comprehensive index of network datasets from all domains of social networks in movies. 2018
55
network science. Chances are that, if the network data is available http://vlado.fmf.uni-lj.si/pub/
networks/data/bio/foodweb/foodweb.
somewhere, you can find it via ICON. htm
However, this is an index of network datasets, not a dataset repos- 56
http://wwwlovre.appspot.com/
support.jsp
itory, like the ones that will follow. This means that ICON is not 57
A Clauset, E Tucker, and M Sainz. The
hosting any network data itself. It rather contains the links to those colorado index of complex networks,
datasets. This has advantages and disadvantages. The advantage 2016
58
https://icon.colorado.edu/
is completeness: not all datasets can be moved from their original
source and hosted somewhere else. ICON can include those datasets,
while the other repositories cannot. The other side of the coin is the
dynamism of the Internet. Resources get moved all the time, and not
everybody does it properly via HTTP redirects – actually almost no
one does it. Thus it is possible to find dead links in ICON, because
the managers of the website cannot possibly constantly check that all
links are working.
ICON will point you to tons of resources from which you can
actually download your data, for instance Pajek’s and UCINET’s
websites. There you can find a collection of network datasets you can
download, which is a nice additional resource to the software. One
issue you might have with this solution is that they distribute data in
their own file formats, so you might need to convert them before you
can use them with another software.
Also SNAP provides network data. While there is a large overlap
between what you can find in Pajek and UCINET, SNAP’s focus
goes decisively more towards computer science. You will find very
large datasets there, sometimes larger than what you can handle –
at the time of writing this chapter, I believe the largest network is
from Friendster, which contains more than 1.8 billion edges. Just like
with the implemented functions in SNAP, also the datasets are very
much focused on the ones the Stanford research group used for their
publications. 59
Jérôme Kunegis. Konect: the koblenz
Another interesting resource is Konect59 , 60 . Konect is also a Mat- network collection. In Proceedings of the
lab package for network analysis. Since I do not use Matlab unless 22nd International Conference on World
Wide Web, pages 1343–1350, 2013
someone is pointing a gun at me, I have no experience with it as 60
http://konect.cc/
an analysis tool. However, I used to browse Konect daily to find
and download some interesting network data. The list of available
datasets, as far as I can tell, is a superset of what you can find in
the websites of Pajek and UCINET, and more. The web interface is
also well done, and you will be able to tell what are the main char-
data & tools 785
There are some graphs that are so widely used that you don’t really
need to look for them in an online repository. These are the pillars
on which the entire cathedral of network science is founded. They
are often directly included in software and libraries, and all online
self-respecting network data repositories have one or multiple copies
of them. I include a few here.
The first – and by far most popular – of these legendary graphs is 64
Wayne W Zachary. An information
the Zachary Karate Club64 . This is a network of members of a karate flow model for conflict and fission in
club, connecting two members if they sparred against each other. It small groups. Journal of anthropological
research, 33(4):452–473, 1977
is often used because the network focuses on two main nodes: the
coach and the president of the club. The club eventually split due to
a disagreement between the two, and one can reconstruct on which
side each member went by analyzing with whom they sparred. It is
a classical example of community discovery. Figure 53.4 shows this
beauty in all of its glory.
Aaron Clauset told me a fun fact about this network. Zachary’s
original paper contains a figure showing the undirected adjacency
matrix of the Karate Club network, except that it’s not fully undi-
rected! One edge appears in one direction, but not in the other. This
means that there are technically two Karate Club graphs, depending
on whether this edge is a typo or not, one with | E| = 77 edges and
one with | E| = 78 edges. The latter is the most common you’ll find
around, because it is the one that Mark Newman and Michelle Gir-
786 the atlas for the aspiring network scientist
van used for their paper, which arguably launched the Karate Club
network in the Olympus of network science.
Network scientists are obsessed with this network. It has its own 65
https://www.zazzle.co.uk/
t-shirt65 . They even created the Zachary Karate Club Club66 : the club zachary_karate_club_with_label_
of network scientists who are the first using the Zachary network as t_shirt-235415254499870147, the label
says: “If your method doesn’t work on
an example in their presentation at a network science conference. If
this network, then go home”.
you do so, you become the current holder of the Zachary Karate Club 66
https://networkkarate.tumblr.com/
Trophy and you are responsible for handing it at the next conference
you attend. This is fiercely competitive, and often you’ll see this prize
awarded at satellites events happening before the conference itself,
because people will use the network as an example as soon as they
can, to get their hands on the trophy.
The Network Science Society hands many prestigious awards: the 67
https://netscisociety.net/
Erdős-Rényi prize67 , to the career of the most outstanding network award-prizes/er-prize
scientist under the age of forty; or the Euler award68 , to the authors 68
https://netscisociety.net/
award-prizes/euler-award
of paradigm-changing publications in network science. But don’t get
fooled. The Zachary Karate Club Trophy is where it’s at.
Another commonly used network is the one obtained from Victor 69
Donald Ervin Knuth. The Stanford
Hugo’s novel Les Miserables69 . In the network, each node is a char- GraphBase: a platform for combinatorial
acter, and two characters are connected together if they appear in computing. AcM Press New York, 1993
the same chapter. Also in this case the classical application is for
community discovery, given that there are sets of characters closely
interacting with each other that never appear in chapters with other
groups of characters. Figure 53.5 shows an example. This is one of
those graphs that even non-network scientist would use for examples 70
https://bost.ocks.org/mike/
related to other fields, for instance data visualization70 . The likely miserables/
reason is the inclusion of this network in Knuth’s popular book. 71
https://figshare.com/articles/
The college football network71 is another network commonly American_College_Football_Network_
used for community discovery – I’m sensing a pattern here. Figure Files/93179
53.6 shows it. The reason it works well is due to the way sports
data & tools 787
about ERGMs); 73
Allison Davis, Burleigh Bradford
• Davis Southern women social network73 : a bipartite network, Gardner, and Mary R Gardner. Deep
South: A social anthropological study of
connecting 18 women to 14 informal social events they attended;
caste and class. Univ of South Carolina
• C. Elegans: this is not a single graph, it is actually multiple. We Press, 1941
788 the atlas for the aspiring network scientist
have extracted all possible ways to represent this poor little worm
in networks forms, from a neural network to protein-protein
interaction networks.
54
Glossary
Average Path Length: The sum of the lengths of all shortest paths in a
network over the total number of such paths.
790 the atlas for the aspiring network scientist
Chain: A set of nodes that can be ordered, and each node is con-
nected only to its predecessor – except the first node – and its succes-
sor – except the last node.
Clique: A set of nodes where all possible edges are present, i.e. each
node in a clique is connected with each other node in the same clique.
Cycle: A path in which the starting and ending node is the same.
Hyperedges: Edges that can connect more than two nodes at the
same time.
Identity Matrix: A matrix with ones on the diagonal and zeros every-
where else.
k-core: Set of nodes that have a minimum degree of k, once you recur-
sively remove from the network all nodes that have k − 1 connections
or fewer.
Line Graph: The graph that represents the adjacencies between edges
of an undirected graph: each edge of the original graph is a node in
the line graph, and two nodes in the line graph connect if they have a
node in common in the original graph.
Parallel edges: Two (or more) edges established between the same
pair of nodes.
Star: A set of nodes with one acting as a center connected to all other
nodes in the star. All other nodes have only one connection, to the
star’s center.
A: Adjacency matrix.
C: The commute time matrix, Cu,v = Hu,v + Hv,u , with H being the
hitting time matrix.
D: Degree matrix.
E: Set of edges.
H: The hitting time matrix, telling you how long it’ll take for a ran-
dom walker to visit one node when starting from another.
L: Laplacian matrix.
most common abbreviations 801
V: Set of nodes.
Other Symbols
Lada A Adamic and Eytan Adar. Friends and neighbors on the web.
Social networks, 25(3):211–230, 2003.
Arun Advani and Bansi Malde. Empirical methods for networks data:
Social effects, network formation and measurement error. Technical
report, IFS Working Papers, 2014.
William Aiello, Fan Chung, and Linyuan Lu. A random graph model
for massive graphs. In Proceedings of the thirty-second annual ACM
symposium on Theory of computing, pages 171–180. Acm, 2000.
bibliography 805
Alex Arenas, Jordi Duch, Alberto Fernández, and Sergio Gómez. Size
reduction of complex networks preserving modularity. New Journal
of Physics, 9(6):176, 2007.
Wirt Atmar and Bruce D Patterson. The measure of order and disor-
der in the distribution of species in fragmented habitat. Oecologia, 96
(3):373–382, 1993.
Boris C Bernhardt, Zhang Chen, Yong He, Alan C Evans, and Neda
Bernasconi. Graph-theoretical analysis reveals disrupted small-world
organization of cortical thickness correlation networks in temporal
lobe epilepsy. Cerebral cortex, 21(9):2147–2157, 2011.
Alessandro Bessi and Emilio Ferrara. Social bots distort the 2016 us
presidential election online discussion. 2016.
Dirk Brockmann, Lars Hufnagel, and Theo Geisel. The scaling laws
of human travel. Nature, 439(7075):462–465, 2006.
Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph
attention networks? arXiv preprint arXiv:2105.14491, 2021.
Qing Cai, Lijia Ma, Maoguo Gong, and Dayong Tian. A survey on
network community detection based on evolutionary computation.
IJBIC, 8(2):84–98, 2016.
Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph
representations with global structural information. In Proceedings of
the 24th ACM international on conference on information and knowledge
management, pages 891–900, 2015.
Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deep neural networks for
learning graph representations. In AAAI, 2016.
822 the atlas for the aspiring network scientist
Ciro Cattuto, Wouter Van den Broeck, Alain Barrat, Vittoria Colizza,
Jean-François Pinton, and Alessandro Vespignani. Dynamics of
person-to-person interactions from distributed rfid sensor networks.
PloS one, 5(7), 2010.
Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggar-
wal, and Thomas S Huang. Heterogeneous network embedding via
deep architectures. In SIGKDD, pages 119–128, 2015.
Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and S Yu Philip.
Graph olap: Towards online analytical processing on graphs. In
ICDM, pages 103–112. IEEE, 2008.
Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. Harp:
Hierarchical representation learning for networks. In Thirty-Second
AAAI Conference on Artificial Intelligence, 2018.
Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and
Cho-Jui Hsieh. Cluster-gcn: An efficient algorithm for training deep
and large graph convolutional networks. In Proceedings of the 25th
ACM SIGKDD international conference on knowledge discovery & data
mining, pages 257–266, 2019.
Sung-Bae Cho and Jin H Kim. Multiple network fusion using fuzzy
logic. IEEE Transactions on Neural Networks, 6(2):497–501, 1995.
Fan Chung and Linyuan Lu. The average distances in random graphs
with given expected degrees. Proceedings of the National Academy of
Sciences, 99(25):15879–15882, 2002a.
Gabor Csardi and Tamas Nepusz. The igraph software package for
complex network research. InterJournal, Complex Systems, 1695(5):1–9,
2006.
Fabio Della Rossa, Fabio Dercole, and Carlo Piccardi. Profiling core-
periphery network structure by random walkers. Scientific reports, 3:
1467, 2013.
Ugur Dogrusoz, Erhan Giral, Ahmet Cetintas, Ali Civril, and Emek
Demir. A layout algorithm for undirected compound graphs.
Information Sciences, 179(7):980–994, 2009.
Yuxiao Dong, Jie Tang, Sen Wu, Jilei Tian, Nitesh V Chawla, Jinghai
Rao, and Huanhuan Cao. Link prediction and recommendation
across heterogeneous social networks. In 2012 IEEE 12th International
conference on data mining, pages 181–190. IEEE, 2012.
Sergey Edunov, Carlos Diuk, Ismail Onur Filiz, Smriti Bhagat, and
Moira Burke. Three and a half degrees of separation. Research at
Facebook, 2016.
Paul Erdos and Alfred Renyi. On random matrices. Magyar Tud. Akad.
Mat. Kutató Int. Közl, 8(455-461):1964, 1964.
Scott L Feld. Why your friends have more friends than you do.
American Journal of Sociology, 96(6):1464–1477, 1991.
Philip J Fleming and John J Wallace. How not to lie with statistics:
the correct way to summarize benchmark results. Communications of
the ACM, 29(3):218–221, 1986.
Ove Frank and David Strauss. Markov graphs. Journal of the american
Statistical association, 81(395):832–842, 1986.
Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning
probabilistic relational models. In IJCAI, volume 99, pages 1300–1309,
1999.
Xiaofeng Gao, Zhiyin Chen, Fan Wu, and Guihai Chen. Energy
efficient algorithms for k-sink minimum movement target coverage
problem in mobile sensor network. IEEE/ACM Transactions on
Networking, 25(6):3616–3627, 2017.
Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of
graph edit distance. Pattern Analysis and applications, 13(1):113–129,
2010.
Yuan Gao, Lixing Yang, and Shukai Li. Uncertain models on railway
transportation planning problem. Applied Mathematical Modelling, 40
(7-8):4921–4934, 2016b.
Herbert Gintis. The bounds of reason: Game theory and the unification of
the behavioral sciences. Princeton University Press, 2014.
Chris Godsil and Gordon F Royle. Algebraic graph theory, volume 207.
Springer Science & Business Media, 2013.
846 the atlas for the aspiring network scientist
Jonathan L Gross and Jay Yellen. Graph theory and its applications.
CRC press, 2005.
Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns
without candidate generation. In ACM sigmod record, volume 29,
pages 1–12. ACM, 2000.
Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent
pattern mining: current status and future directions. Data mining and
knowledge discovery, 15(1):55–86, 2007.
Jiawei Han, Jian Pei, and Hanghang Tong. Data mining: concepts and
techniques. Morgan kaufmann, 2022.
James A Hanley and Barbara J McNeil. The meaning and use of the
area under a receiver operating characteristic (roc) curve. Radiology,
143(1):29–36, 1982.
Petter Holme and Beom Jun Kim. Growing scale-free networks with
tunable clustering. Physical review E, 65(2):026107, 2002.
John Hopcroft, Omar Khan, Brian Kulis, and Bart Selman. Tracking
evolving communities in large linked networks. Proceedings of the
National Academy of Sciences, 101(suppl 1):5249–5253, 2004.
Ren-Jie Hu, Qing Li, Guang-Yu Zhang, and Wen-Cong Ma. Centrality
measures in directed fuzzy social networks. Fuzzy Information and
Engineering, 7(1):115–128, 2015.
Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent
subgraphs in the presence of isomorphism. In Third IEEE International
Conference on Data Mining, pages 549–552. IEEE, 2003.
Jianbin Huang, Heli Sun, Jiawei Han, Hongbo Deng, Yizhou Sun,
and Yaguang Liu. Shrink: a structural clustering algorithm for
bibliography 855
Jianbin Huang, Heli Sun, Yaguang Liu, Qinbao Song, and Tim
Weninger. Towards online multiresolution community detection in
large-scale networks. PloS one, 6(8):e23829, 2011a.
Junjie Huang, Huawei Shen, Liang Hou, and Xueqi Cheng. Signed
graph attention networks. In ICANN 2019: Workshop and Special
Sessions, pages 566–577. Springer, 2019.
Tommy R Jensen and Bjarne Toft. Graph coloring problems, volume 39.
John Wiley & Sons, 2011.
Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. Knowl-
edge graph embedding via dynamic mapping matrix. In IJCNLP,
pages 687–696, 2015.
Kara Joyner and Grace Kao. School racial composition and ado-
lescent racial homophily. Social science quarterly, pages 810–825,
2000.
U Kang, Hanghang Tong, and Jimeng Sun. Fast random walk graph
kernel. In Proceedings of the 2012 SIAM international conference on data
mining, pages 828–838. SIAM, 2012.
Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song.
Learning combinatorial optimization algorithms over graphs. In
Advances in Neural Information Processing Systems, pages 6348–6358,
2017.
Jin Seop Kim, Kwang-Il Goh, Byungnam Kahng, and Doochul Kim.
Fractality and self-similarity in scale-free networks. New Journal of
Physics, 9(6):177, 2007.
Cerry M Klein. Fuzzy shortest paths. Fuzzy sets and systems, 39(1):
27–41, 1991.
Tamara Kolda and Brett Bader. The tophits model for higher-order
web link analysis. In Workshop on link analysis, counterterrorism and
security, volume 7, pages 26–29, 2006.
Emily Kubin and Christian von Sikorski. The role of (social) media in
political polarization: a systematic review. Annals of the International
Communication Association, 45(3):188–206, 2021.
David Lazer, Alex Sandy Pentland, Lada Adamic, Sinan Aral, Al-
bert Laszlo Barabasi, Devon Brewer, Nicholas Christakis, Noshir
Contractor, James Fowler, Myron Gutmann, et al. Life in the network:
the coming age of computational social science. Science (New York,
NY), 323(5915):721, 2009.
Geng Li, Murat Semerci, Bulent Yener, and Mohammed J Zaki. Graph
classification via topological and label attributes. In Proceedings of the
9th international workshop on mining and learning with graphs (MLG),
San Diego, USA, volume 2, 2011.
868 the atlas for the aspiring network scientist
Jiaoyang Li, Pavel Surynek, Ariel Felner, Hang Ma, TK Satish Ku-
mar, and Sven Koenig. Multi-agent path finding for large agents. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33,
pages 7627–7634, 2019.
Jundong Li, Harsh Dani, Xia Hu, Jiliang Tang, Yi Chang, and Huan
Liu. Attributed network embedding for learning in a dynamic
environment. In Proceedings of the 2017 ACM on Conference on
Information and Knowledge Management, pages 387–396, 2017a.
Menghui Li, Ying Fan, Jiawei Chen, Liang Gao, Zengru Di, and
Jinshan Wu. Weighted networks of scientific communication: the
measurement and topological role of weight. Physica A: Statistical
Mechanics and its Applications, 350(2-4):643–656, 2005.
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolu-
tional recurrent neural network: Data-driven traffic forecasting. arXiv
preprint arXiv:1707.01926, 2017b.
Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter
Battaglia. Learning deep generative models of graphs. arXiv preprint
arXiv:1803.03324, 2018.
Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, Will Hamilton,
David K Duvenaud, Raquel Urtasun, and Richard Zemel. Efficient
graph generation with graph recurrent attention networks. Advances
in neural information processing systems, 32, 2019a.
Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu.
Learning entity and relation embeddings for knowledge graph
completion. In Twenty-ninth AAAI conference on artificial intelligence,
2015.
Chuang Liu, Yibing Zhan, Jia Wu, Chang Li, Bo Du, Wenbin Hu,
Tongliang Liu, and Dacheng Tao. Graph pooling for graph neural
networks: Progress, challenges, and opportunities. arXiv preprint
arXiv:2204.07321, 2022.
870 the atlas for the aspiring network scientist
Lin Liu, Ruoming Jin, Charu Aggarwal, and Yelong Shen. Reliable
clustering on uncertain graphs. In 2012 IEEE 12th international
conference on data mining, pages 459–468. IEEE, 2012.
Meng Liu, Yue Liu, Ke Liang, Wenxuan Tu, Siwei Wang, Sihang
Zhou, and Xinwang Liu. Deep temporal graph clustering. ICLR,
2024.
Minghua Liu, Hang Ma, Jiaoyang Li, and Sven Koenig. Task and
path planning for multi-agent pickup and delivery. In Int Conf
on Autonomous Agents and MultiAgent Systems, pages 1152–1160.
IFAAMAS, 2019.
Weiping Liu and Linyuan Lü. Link prediction based on local random
walk. EPL (Europhysics Letters), 89(5):58007, 2010.
Yike Liu, Tara Safavi, Abhilash Dighe, and Danai Koutra. Graph
summarization methods and applications: A survey. ACM Computing
Surveys (CSUR), 51(3):1–34, 2018b.
Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor
Darrell. Rethinking the value of network pruning. arXiv preprint
arXiv:1810.05270, 2018c.
Can Lu, Jeffrey Xu Yu, Rong-Hua Li, and Hao Wei. Exploring
hierarchies in online social networks. IEEE Transactions on Knowledge
and Data Engineering, 28(8):2086–2100, 2016.
Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King. Min-
ing social networks using heat diffusion processes for marketing
candidates selection. In Proceedings of the 17th ACM conference on
Information and knowledge management, pages 233–242. ACM, 2008.
Jan Maas. Gradient flows of the entropy for finite markov chains.
Journal of Functional Analysis, 261(8):2250–2292, 2011.
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using
t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
872 the atlas for the aspiring network scientist
Matteo Magnani and Luca Rossi. The ml-model for multi-layer social
networks. In ASONAM, pages 5–12. IEEE, 2011.
Elizabeth Aura McClintock. When does race matter? race, sex, and
dating at an elite university. Journal of Marriage and Family, 72(1):
45–72, 2010.
Carl D Meyer. Matrix analysis and applied linear algebra, volume 71.
Siam, 2000.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient
estimation of word representations in vector space. arXiv preprint
arXiv:1301.3781, 2013a.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff
Dean. Distributed representations of words and phrases and their
compositionality. Advances in neural information processing systems, 26:
3111–3119, 2013b.
Erxue Min, Runfa Chen, Yatao Bian, Tingyang Xu, Kangfei Zhao,
Wenbing Huang, Peilin Zhao, Junzhou Huang, Sophia Ananiadou,
and Yu Rong. Transformer for graphs: An overview from architecture
perspective. arXiv preprint arXiv:2202.08455, 2022.
Michael Molloy and Bruce Reed. A critical point for random graphs
with a given degree sequence. Random structures & algorithms, 6(2-3):
161–180, 1995.
Michael Molloy and Bruce Reed. The size of the giant component
of a random graph with a given degree sequence. Combinatorics,
probability and computing, 7(3):295–305, 1998.
Enys Mones, Lilla Vicsek, and Tamás Vicsek. Hierarchy measure for
complex networks. PloS one, 7(3):e33799, 2012.
bibliography 877
Siegfried Nijssen and Joost N Kok. The gaston tool for frequent
subgraph mining. Electronic Notes in Theoretical Computer Science, 127
(1):77–87, 2005.
Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu.
Asymmetric transitivity preserving graph embedding. In SIGKDD,
pages 1105–1114, 2016.
John F Padgett and Christopher K Ansell. Robust action and the rise
of the medici, 1400-1434. American journal of sociology, 98(6):1259–1319,
1993.
Raj Kumar Pan and Jari Saramäki. Path lengths, correlations, and
centrality in temporal networks. Physical Review E, 84(1):016105, 2011.
Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang, and Yang Wang.
Tri-party deep network representation. Network, 11(9):12, 2016.
Luca Pappalardo, Giulio Rossetti, and Dino Pedreschi. " how well
do we know each other?" detecting tie strength in multidimensional
social networks. In 2012 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining, pages 1040–1045.
IEEE, 2012.
Eli Pariser. The filter bubble: What the Internet is hiding from you.
penguin UK, 2011.
Namyong Park, Andrey Kan, Xin Luna Dong, Tong Zhao, and
Christos Faloutsos. Estimating node importance in knowledge graphs
using graph neural networks. In Proceedings of the 25th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages
596–606, 2019.
Judea Pearl and Dana Mackenzie. The book of why: the new science of
cause and effect. Basic Books, 2018.
Tiago P Peixoto. Efficient monte carlo and greedy heuristic for the
inference of stochastic block models. Physical Review E, 89(1):012804,
2014a.
Ofir Pele and Michael Werman. A linear time histogram metric for
improved sift matching. In European conference on computer vision,
pages 495–508. Springer, 2008.
Ofir Pele and Michael Werman. Fast and robust earth mover’s
distances. In Computer vision, 2009 IEEE 12th international conference
on, pages 460–467. IEEE, 2009.
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet:
Deep learning on point sets for 3d classification and segmentation. In
CVPR, pages 652–660, 2017.
Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares Silva, and
AlbertoF Laender. Automatic web news extraction using tree edit
distance. In Proceedings of the 13th international conference on World
Wide Web, pages 502–511, 2004.
Neil Shah, Danai Koutra, Tianmin Zou, Brian Gallagher, and Chris-
tos Faloutsos. Timecrunch: Interpretable dynamic graph summariza-
tion. In Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 1055–1064, 2015.
Haichuan Shang, Xuemin Lin, Ying Zhang, Jeffrey Xu Yu, and Wei
Wang. Connected substructure similarity search. In Proceedings of the
2010 ACM SIGMOD International Conference on Management of data,
pages 903–914, 2010.
Shai S Shen-Orr, Ron Milo, Shmoolik Mangan, and Uri Alon. Net-
work motifs in the transcriptional regulation network of escherichia
coli. Nature genetics, 31(1):64, 2002.
bibliography 895
Jianbo Shi and Jitendra Malik. Normalized cuts and image segmen-
tation. Departmental Papers (CIS), page 107, 2000.
Yu Shi, Qi Zhu, Fang Guo, Chao Zhang, and Jiawei Han. Easing
embedding learning by comprehensive transcription of heteroge-
neous information networks. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages
2190–2199, 2018.
Ben Shneiderman. The eyes have it: A task by data type taxonomy
for information visualizations. In Proceedings 1996 IEEE symposium
on visual languages, pages 336–343. IEEE, 1996.
Jamie Snape, Jur Van Den Berg, Stephen J Guy, and Dinesh
Manocha. The hybrid reciprocal velocity obstacle. IEEE Transac-
tions on Robotics, 27(4):696–706, 2011.
Olaf Sporns, Giulio Tononi, and Rolf Kötter. The human connectome:
a structural description of the human brain. PLoS computational
biology, 1(4), 2005.
Natalie Stanley, Saray Shai, Dane Taylor, and Peter J Mucha. Cluster-
ing network layers with the strata multilayer stochastic block model.
IEEE transactions on network science and engineering, 3(2):95–105, 2016.
Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less
is more: Compact matrix decomposition for large sparse graphs. In
Proceedings of the 2007 SIAM International Conference on Data Mining,
pages 366–377. SIAM, 2007b.
900 the atlas for the aspiring network scientist
Yizhou Sun, Jie Tang, Jiawei Han, Manish Gupta, and Bo Zhao. Com-
munity evolution detection in dynamic heterogeneous information
networks. In MLGraphs, pages 137–146. ACM, 2010.
Nassim Nicholas Taleb. The black swan: The impact of the highly
improbable, volume 2. Random house, 2007.
Fei Tan, Yongxiang Xia, and Boyao Zhu. Link prediction in complex
networks: a mutual information perspective. PloS one, 9(9):e107056,
2014.
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and
Qiaozhu Mei. Line: Large-scale information network embedding. In
Proceedings of the 24th international conference on world wide web, pages
1067–1077. International World Wide Web Conferences Steering
Committee, 2015a.
Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing
large-scale and high-dimensional data. In WWW, pages 287–297,
2016.
Jiliang Tang, Shiyu Chang, Charu Aggarwal, and Huan Liu. Negative
link prediction in social media. In Proceedings of the eighth ACM
international conference on web search and data mining, pages 87–96.
ACM, 2015b.
Lei Tang, Huan Liu, Jianping Zhang, and Zohreh Nazeri. Community
evolution in dynamic multi-mode networks. In SIGKDD, pages
677–685. ACM, 2008.
Lei Tang, Xufei Wang, and Huan Liu. Community detection via
heterogeneous interaction analysis. Data mining and knowledge
discovery, 25(1):1–33, 2012.
Jun Tao, Jian Xu, Chaoli Wang, and Nitesh V Chawla. Honvis:
Visualizing and exploring higher-order networks. In 2017 IEEE Pacific
Visualization Symposium (PacificVis), pages 1–10. IEEE, 2017.
Dane Taylor, Saray Shai, Natalie Stanley, and Peter J Mucha. En-
hanced detectability of community structure in multilayer networks
through layer aggregation. Physical review letters, 116(22):228301, 2016.
Yongxin Tong, Xiaofei Zhang, Caleb Chen Cao, and Lei Chen.
Efficient probabilistic supergraph search over large uncertain graphs.
In CIKM, pages 809–818, 2014.
Ivan Voitalov, Pim van der Hoorn, Remco van der Hofstad, and
Dmitri Krioukov. Scale-free networks well done. arXiv preprint
arXiv:1811.02071, 2018.
Soroush Vosoughi, Deb Roy, and Sinan Aral. The spread of true and
false news online. Science, 359(6380):1146–1151, 2018.
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus.
Regularization of neural networks using dropconnect. In International
conference on machine learning, pages 1058–1066. PMLR, 2013.
Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network
embedding. In SIGKDD, pages 1225–1234, 2016a.
Liping Wang, Qing Li, Na Li, Guozhu Dong, and Yu Yang. Substruc-
ture similarity measurement in chinese recipes. In Proceedings of the
17th international conference on World Wide Web, pages 979–988, 2008a.
Peng Wang, BaoWen Xu, YuRong Wu, and XiaoYu Zhou. Link
prediction in social networks: the state-of-the-art. Science China
Information Sciences, 58(1):1–38, 2015a.
bibliography 907
Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui,
and Philip S Yu. Heterogeneous graph attention network. In The
world wide web conference, pages 2022–2032, 2019.
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowl-
edge graph embedding by translating on hyperplanes. In Twenty-
Eighth AAAI conference on artificial intelligence, 2014b.
Zhen Wang, Lin Wang, Attila Szolnoki, and Matjaž Perc. Evolu-
tionary games on multilayer networks: a colloquium. The European
physical journal B, 88(5):124, 2015b.
908 the atlas for the aspiring network scientist
Charles Wheelan. Naked statistics: Stripping the dread from the data.
WW Norton & Company, 2013.
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How
powerful are graph neural networks? arXiv preprint arXiv:1810.00826,
2018a.
Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph
convolutional networks for skeleton-based action recognition. In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou,
and Jie Tang. Understanding negative sampling in graph representa-
tion learning. In ACM SIGKDD, pages 1666–1676, 2020.
Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure
Leskovec. Graphrnn: Generating realistic graphs with deep auto-
regressive models. In International Conference on Machine Learning,
pages 5694–5703, 2018.
Jiaxuan You, Zhitao Ying, and Jure Leskovec. Design space for graph
neural networks. Advances in Neural Information Processing Systems, 33:
17009–17021, 2020.
bibliography 913
Amy Zhao Yu, Shahar Ronen, Kevin Hu, Tiffany Lu, and César A
Hidalgo. Pantheon 1.0, a manually verified dataset of globally
famous biographies. Scientific data, 3:150075, 2016.
Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, and Zhao Xu. Stochastic
relational models for discriminative link prediction. In Advances in
neural information processing systems, pages 1553–1560, 2007.
Hongming Zhang, Liwei Qiu, Lingling Yi, and Yangqiu Song. Scal-
able multiplex network embedding. In IJCAI, volume 18, pages
3082–3088, 2018a.
Peng Zhang, Jinliang Wang, Xiaojia Li, Menghui Li, Zengru Di, and
Ying Fan. Clustering coefficient and community structure of bipartite
networks. Physica A: Statistical Mechanics and its Applications, 387(27):
6869–6875, 2008.
Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for
training deep neural networks with noisy labels. Advances in neural
information processing systems, 31, 2018.
Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph cube:
on warehousing and olap multidimensional networks. In Proceedings
of the 2011 ACM SIGMOD International Conference on Management of
data, pages 853–864, 2011.
Elena Zheleva and Lise Getoor. To join or not to join: the illusion
of privacy in social networks with mixed public and private user
profiles. In Proceedings of the 18th international conference on World wide
web, pages 531–540, 2009.
916 the atlas for the aspiring network scientist
Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu,
and Maosong Sun. Graph neural networks: A review of methods and
applications. arXiv preprint arXiv:1812.08434, 2018.
Tao Zhou, Jie Ren, Matúš Medo, and Yi-Cheng Zhang. Bipartite
network projection and personal recommendation. Physical Review
E, 76(4):046115, 2007.
Tao Zhou, Zoltán Kuscsik, Jian-Guo Liu, Matúš Medo, Joseph Rush-
ton Wakeling, and Yi-Cheng Zhang. Solving the apparent diversity-
accuracy dilemma of recommender systems. Proceedings of the
National Academy of Sciences, 107(10):4511–4515, 2010.
Ezra W Zuckerman and John T Jost. What makes you think you’re so
popular? self-evaluation maintenance and the subjective side of the"
friendship paradox". Social Psychology Quarterly, pages 207–223, 2001.