How Small Are Building Blocks of Complex Networks
How Small Are Building Blocks of Complex Networks
1
TNO Information and Communication Technology,
Netherlands Organisation for Applied Scientific Research,
P.O. Box 5050, 2600 GB Delft, The Netherlands
2
HP Labs, 1501 Page Mill Rd, Palo Alto, California, 94304, USA
3
Department of Computer Science and Engineering, University of California,
San Diego (UCSD), 9500 Gilman Drive, La Jolla, California 92093, USA
4
Departament de Fsica Fonamental, Universitat de Barcelona, Mart i Franqu`es 1, 08028 Barcelona, Spain
5
Cooperative Association for Internet Data Analysis (CAIDA), University of California,
San Diego (UCSD), 9500 Gilman Drive, La Jolla, California 92093, USA
Network motifs are small building blocks of complex networks, such as gene regulatory networks.
The frequent appearance of a motif may be an indication of some network-specific utility for that
motif, such as speeding up the response times of gene circuits. However, the precise nature of
the connection between motifs and the global structure and function of networks remains unclear.
Here we show that the global structure of some real networks is statistically determined by the
distributions of local motifs of size at most 3, once we augment motifs to include node degree
information. That is, remarkably, the global properties of these networks are fixed by the probability
of the presence of links between node triples, once this probability accounts for the degree of the
individual nodes. We consider a social web of trust, protein interactions, scientific collaborations,
air transportation, the Internet, and a power grid. In all cases except the power grid, random
networks that maintain the degree-enriched connectivity profiles for node triples in the original
network reproduce all its local and global properties. This finding provides an alternative statistical
explanation for motif significance. It also impacts research on network topology modeling and
generation. Such models and generators are guaranteed to reproduce essential local and global
network properties as soon as they reproduce their 3-node connectivity statistics.
I.
INTRODUCTION
FIG. 1: The dK-randomization null models for d = 0, 1, 2, 3. a) Illustration of dK-randomizing rewiring. All nodes
are labeled by their degrees, and a dK-rewiring preserves the graphs dK-distribution, and consequently its d0 K-distributions
for all d0 < d, but randomizes the d00 K-distributions for d00 > d. The 0K-randomization involves rewiring of a link to any pair
of disconnected nodes, which preserves the average degree only. The 1K-randomization preserves the degree distribution, too,
by rewiring a pair of links as shown. The 2K-distribution preserves the joint degree distribution as well, because at least two
nodes adjacent to the rewired links are of the same degree. The 3K-randomization preserves the number of degree-labeled
wedges and triangles. As d increases, the rewiring becomes increasingly more constrained since fewer links can be rewired
without altering the dK-distribution. To dK-randomize a network, we randomly select a pair of links, and rewire them if they
can be dK-rewired, or, if they cannot be rewired, select another random pair. This process is repeated for a sufficient number
of successful rewirings, i.e., until all network properties stop changing, at which point we say that the graph has converged to
its dK-randomization. b) Visualization of the social web of trust (PGP network [15]) and its dK-randomizations.
We use the LaNet-vi tool [16] for visualization, which encodes the node coreness in color, see the right legends. The coreness is
a measure of node centrality, i.e., how deeply in the network core the node lies [16]. Nodes with larger coreness are also placed
closer to the circle centers. The quick convergence of the dK-randomizations to the original PGP network, and the similarity
between it and its 3K-randomization are remarkable.
Fig. 1(b) visualizes the PGP network and its dKrandomizations. We observe that the dK-series converges
at d = 3. While the 0K-random network has little in
common with the real network, the 1K-random one is
somewhat more similar, even more so for 2K, and there
is very little difference between the real PGP network
and its 3K-random counterpart.
To provide a more detailed and insightful comparison
between the real network and its dK-randomizations, we
compute a variety of metrics for each. Some popular metrics, such as degree distribution, average nearest neighbor
connectivity, clustering, etc., are functions, sometimes
peculiar, of dK-distributions, and therefore it is not surprising that they are properly captured by dK-series, as
confirmed in Section IV A. We classify metrics that do
not explicitly depend on dK-distributions as microscopic,
mesoscopic, and macroscopic. We choose them to probe
0.8
0.6
0.4
10
10
10
0.2
1.E+06
1.E+05
PGP
0K-randomization
1K-randomization
2K-randomization
3K-randomization
10 0
10
10
community size
3K - randomization
2K - randomization
1K - randomization
0K - randomization
1.E+04
1.E+03
10
rank
10
1.E+02
1.E+01
1.E+00
4
1.E+04
Mean Distance dK-random
Mean Distance real-world network
StDev Distance dK-random
StDev Distance real-world network
Diameter dK-random
Diameter real-world network
1.E+03
1.E+02
1.E+01
1.E+00
1.E-01
0K-random
1.E-01
1K-random
2K-random
3K-random
1.E-03
1.E-04
0K-random
1K-random
2K-random
3K-random
FIG. 4: Macroscopic scale: the distance and betweenness distributions. The top plot shows the metrics related to the hop length of shortest paths, or distances, between nodes in the PGP network and its dK-randomizations.
These metrics are the average and maximum distance between
nodes, the latter called the network diameter, and the standard deviation of the distance distribution. The bottom plot
shows the average betweenness and the standard deviation of
the betweenness distribution of nodes in the PGP network
and its dK-randomizations. The betweenness of a node is a
measure of its communication centrality [21]. It is equal to the
number of shortest paths passing through the node, divided
by the total number of shortest paths between the same source
and destination, summed over all source-destinations pairs.
In both plots the values for dK-randomizations are averaged,
for each d, over several realizations of the dK-randomized
network.
II.
FIG. 5: The dK-series illustrated: a) dK-distributions for a graph of size 4; b) convergence and inclusiveness of dK-series.
III.
6
1.
Abbreviation
PGP
Collab.
Protein
Air
Internet
Power
IV.
N (k)
,
N
(1)
N (k, k 0 )
,
2M
(3)
k X
P (k, k 0 ),
k 0
(5)
A.
We first consider the most basic metrics, which are defined by the appropriate dK-distributions. Therefore it
is not surprising that dK-random graphs with appropriate d have the values of these metrics equal exactly to
those in the real networks. Nevertheless, we report these
results for consistency and illustration purposes.
but not vice versa. The average neighbor degree knn (k)
is a projection of the 2K-distribution P (k, k 0 ) via
P 0
0
k X 0
0 k P (k, k )
knn (k) =
k P (k, k 0 ) = Pk
. (6)
0
kP (k) 0
k0 P (k, k )
k
degree distribution
10
10
PGP
0K
1K
10
10
10
10
10
10
10
10
degree distribution
10
10
10
10
10
10
10
10
Collab.
0K
1K
10
Internet
0K
1K
10
10
4
10
10
10
10
10
10
10
10
degree distribution
Protein
0K
1K
10
10
10
10
10
10
10
Air
0K
1K
10
10
10
Power
0K
1K
10
10
10
10
10
10
10
degree
degree
10
FIG. 6: The degree distribution in the real networks and their dK-randomizations.
10
10
10
0
10
PGP
0K
1K
2K
10
10
10
Protein
0K
1K
2K
10
10
10
0
10
10
10
10
10
10
10
Collab.
0K
1K
2K
10
10
10
0
10
10
10
10
10
0
10
10
10
10
10
10
10
Air
0K
1K
2K
10
10
0
10
Internet
0K
1K
2K
10
10
10
degree
Power
0K
1K
2K
10
10
0
10
10
degree
10
FIG. 7: The average neighbour degree in the real networks and their dK-randomizations.
10
10
PGP
0K
1K
2K
3K
10
10
10
Protein
0K
1K
2K
3K
10
10
10
10
10
10
10
10
10
10
10
Collab.
0K
1K
2K
3K
10
10
Internet
0K
1K
2K
3K
10
10
10
10
10
10
10
10
10
10
10
10
Air
0K
1K
2K
3K
10
10
10
Power
0K
1K
2K
3K
10
10
10
degree
10
10
10
10
degree
10
FIG. 8: The degree-dependent clustering in the real networks and their dK-randomizations.
9
real networks, whereas the 2K-randomizations have exactly the same average neighbor degrees as the real networks, which is again by definition: 2K-randomization
does not change P (k, k 0 ). In the Internet case, even 1Krandomization does not noticeably affect knn (k). The
dK-randomizations with d > 2 do not alter P (k, k 0 ) and
consequently knn (k) at all, therefore they reproduce the
latter exactly as well for all the networks (not shown).
3.
3K: clustering
Fig. 8 shows degree-dependent clustering c(k). Clustering of node i is the number of triangles 4i it forms, or
equivalently the number of links among its neighbors, divided by the maximum such number, which is k(k 1)/2,
where k is is degree, deg(i) = k. Averaging over all
nodes of degree k, the degree-dependent clustering is
X
24(k)
4i . (7)
c(k) =
, where 4(k) =
k(k 1)N (k)
i: deg(i)=k
k'
k''
00
3T + W + M
= k2 .
N
(12)
B.
k'
k''
.
Let N (k , k, k ) = N (k , k, k ) be the number wedges
involving nodes of degrees k, k 0 , and k 00 , where k is the
central node degree, and let N4 (k, k 0 , k 00 ) be the number
of triangles consisting of nodes of degrees k, k 0 , and k 00 ,
where N4 (k, k 0 , k 00 ) is assumed to be symmetric with
respect to all permutations of its arguments. Then the
two components of the 3K-distribution are
0
00
N (k 0 , k, k 00 )
,
(8)
2W
0 00
N4 (k, k , k )
P4 (k, k 0 , k 00 ) = (k, k 0 , k 00 )
,
(9)
6T
where T and W are the total numbers of triangles and
wedges in the network, and
0
00
6 if k = k = k ,
(k, k 0 , k 00 ) = 1 if k 6= k 0 6= k 00 ,
(10)
2 otherwise,
P (k 0 , k, k 00 ) = (k 0 , k 00 )
0
00
0 00
so that
normalP both P (k ,0 k, k 00) andPP4 (k, k , k ) are
ized, k,k0 ,k00 P (k , k, k ) = k,k0 ,k00 P4 (k, k 0 , k 00 ) = 1.
The 3K-distribution defines the 2K-distribution (but not
vice versa), by
X 6T
1
0
P (k, k ) =
P4 (k, k 0 , k 00 )
k + k 0 2 00 M
k
W
0
00
0 00
+
[P (k , k, k ) + P (k, k , k )] . (11)
M
C.
Fig. 11 shows the distance distribution in the real networks and in their dK-randomizations. The distance distribution is the distribution of hop-lengths of shortest
paths between nodes in a network. Formally, if N (h) is
the number of node pairs located at hop distance h from
each other, then the distance distribution (h) is
(h) =
2N (h)
,
N (N 1)
(14)
10
0.4
0.2
0.6
0.4
0.2
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.8
0.6
0.4
0.2
Air Transportation
3K - randomization
2K - randomization
1K - randomization
0K - randomization
Internet AS-level
3K - randomization
2K - randomization
1K - randomization
0K - randomization
0.6
0.8
Scientific Collaborations
3K - randomization
2K - randomization
1K - randomization
0K - randomization
0.8
0.6
0.4
0.2
Power Grid
3K - randomization
2K - randomization
1K - randomization
0K - randomization
0.8
Protein Interactions
3K - randomization
2K - randomization
1K - randomization
0K - randomization
0.8
0.6
0.4
0.2
0.1
0
FIG. 9: The motif distributions in the real networks and their dK-randomizations.
1.E+06
1.E+05
1.E+05
1.E+04
1.E+03
1.E+02
1.E+04
1.E+03
1.E+02
1.E+01
1.E+01
1.E+00
1.E+00
1.E+09
1.E+08
3K - randomization
2K - randomization
1K - randomization
0K - randomization
1.E+06
1.E+06
1.E+05
1.E+04
1.E+03
1.E+02
1.E+05
1.E+07
3K - randomization
2K - randomization
1K - randomization
0K - randomization
1.E+06
3K - randomization
2K - randomization
1K - randomization
0K - randomization
3K - randomizations
2K - randomization
1K - randomization
0K - randomization
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
3K - randomization
2K - randomization
1K - randomization
0K - randomization
1.E+04
1.E+03
1.E+02
3K - randomization
2K - randomization
1K - randomization
0K - randomization
1.E+03
1.E+06
1.E+02
1.E+01
1.E+01
1.E+01
1.E+00
1.E+00
1.E+00
FIG. 10: The motif Z-scores in the real networks and their dK-randomizations.
(15)
To provide a clearer view of how close the distance distributions in dK-randomizations are to the real networks,
we show in Fig. 12 some scalar summary statistics of the
distance distribution as functions of d. These summary
distance distribution
distance distribution
distance distribution
11
0.4
0.2
10
15
20
0.2
0
25
Collab.
0K
1K
2K
3K
0.4
0.2
10
15
10
15
Internet
0K
1K
0.6
0.4
0.2
0
20
0.8
0.4
Air
0K
1K
2K
3K
0.6
0.4
0.2
0
Protein
0K
1K
2K
3K
0.4
0.8
0.6
0.6
PGP
0K
1K
2K
3K
Power
0K
1K
2K
3K
0.3
0.2
0.1
0
10
10
10
distance
15
distance
20
25
30
FIG. 11: The distance distribution in the real networks and their dK-randomizations.
1.E+03
Mean Distance dK-random
Mean Distance real-world network
StDev Distance dK-random
StDev Distance real-world network
Diameter dK-random
Diameter real-world network
1.E+03
1.E+02
1.E+01
1.E+04
1.E+02
1.E+01
1.E+00
1.E+00
1.E-01
0K-random
1K-random
2K-random
1.E-01
0K-random
3K-random
1.E+04
2K-random
3K-random
1.E+02
1.E+01
1.E+03
Mean Distance dK-random
Mean Distance real-world network
StDev Distance dK-random
StDev Distance real-world network
Diameter dK-random
Diameter real-world network
1.E+03
1K-random
1.E+02
1.E+01
1.E+00
1.E+00
1.E-01
0K-random
1K-random
2K-random
1.E-01
0K-random
3K-random
2K-random
3K-random
1.E+04
1.E+03
Mean Distance dK-random
Mean Distance real-world network
StDev Distance dK-random
StDev Distance real-world network
Diameter dK-random
Diameter real-world network
1.E+02
1.E+01
1.E+00
1.E+03
1K-random
1.E+02
1.E+01
1.E+00
1.E-01
1.E-01
0K-random
1K-random
2K-random
3K-random
1.E-02
0K-random
1K-random
2K-random
3K-random
FIG. 12: The average distance, the standard deviation of the distance distribution, and the network diameter as functions of
d for dK-randomisations of the real networks. The corresponding values for the real networks are shown by dashed lines.
average betweenness
12
10
10
10
10
average betweenness
10
10
10
10
10
10
10
10
10
10
10
Collab.
0K
1K
2K
3K
10
6
0
10
10
10
Internet
0K
1K
2K
3K
10
10
10
10
10
10
10
10
10
5
0
10
10
10
Power
0K
1K
2K
3K
Air
0K
1K
2K
3K
10
10
10
2
10
10
Protein
0K
1K
2K
3K
10
average betweenness
PGP
0K
1K
2K
3K
10
10
10
10
10
10
degree
10
degree
10
FIG. 13: The average betweenness of nodes of a given degree in the real networks and their dK-randomizations.
1.E-01
1.E-01
1.E-03
1.E-04
0K-random
1K-random
2K-random
3K-random
1.E-01
1.E-02
1.E-03
1.E-04
0K-random
1K-random
1.E-02
1.E-03
1K-random
2K-random
1.E-02
1.E-03
1.E-04
0K-random
3K-random
1.E+00
1K-random
3K-random
2K-random
1.E-01
1.E-02
1.E-03
0K-random
3K-random
1.E-04
0K-random
2K-random
1.E-01
1K-random
2K-random
3K-random
1.E-02
1.E-03
1.E-04
0K-random
1K-random
2K-random
3K-random
FIG. 14: The average betweenness and the standard deviation of the betweenness distribution as functions of d for dKrandomisations of the real networks. The corresponding values for the real networks are shown by dashed lines.
13
TABLE II: The scalar topological metrics of the real networks
and the minimum value of d such that the networks dKrandomizations approximately preserve all the metrics.
Metrics
k
r
c
h
b
dK
PGP
Collab. Protein
Air
Internet Power
4.6
6.4
6.4
11.9
6.3
4.7
0.238 0.157 -0.137 -0.268 -0.236 -0.273
0.27
0.65
0.09
0.62
0.46
0.68
7.5
6.6
4.2
3.0
3.1
2.0
6 104 4 104 7 104 4 103 2 104 2 104
3K
3K
3K
2K
1K
?
kk0
hk 3 ihki hk 2 i2
(18)
which is nothing but the Pearson correlation coefficient of the 2K-distribution P (k, k 0 );
c is the average clustering
X
c =
c(k)P (k),
(19)
tance distribution.
Fig. 13 shows degree-dependent betweenness centrality
b(k) in the real networks and their dK-randomizations.
Betweenness b(i) of node i is a measure of how important i is in terms of the number of shortest paths passing
through it. Formally, if st (i) is the number of shortest
paths between nodes s 6= i and t 6= i that pass through
i, and st is the total number of shortest paths between
the two nodes s 6= t, then betweenness of i is
b(i) =
X s,t (i)
s,t
s,t
(16)
X
i: deg(i)=k
b(i)
.
N (k)
(17)
(20)
14
TABLE III: dK-series vs. d-series
d dK-statistics d-statistics
k
0
1
2
N (k)
N (k, k0 )
N (k, k0 , k00 )
3
N4 (k, k0 , k00 )
V.
N
M
W
T
N-1
1
1
1
Acknowledgments
15
17940 (2004).
[13] Y. Artzy-Randrup, S. Fleishman, N. Ben-Tal, and
L. Stone, Science 305, 1107 (2004).
[14] P. Mahadevan, D. Krioukov, K. Fall, and A. Vahdat,
Comput Commun Rev 36, 135 (2006).
[15] M. Bogu
na
, R. Pastor-Satorras, A. Daz-Guilera, and
A. Arenas, Phys Rev E 70, 056122 (2004).
[16] J. I. Alvarez-Hamelin, L. DallAsta, A. Barrat, and
A. Vespignani, in Advances in Neural Information Processing Systems 18, edited by Y. Weiss, B. Sch
olkopf, and
J. Platt (MIT Press, Boston, 2006), pp. 4150.
[17] S. Maslov and K. Sneppen, Science 296, 910 (2002).
[18] S. Maslov, K. Sneppen, and U. Alon, Handbook of Graphs
and Networks (Wiley-VCH, Berlin, 2003), chap. Correlations Profiles and Motifs in Complex Networks.
[19] X. Dimitropoulos, D. Krioukov, G. Riley, and A. Vahdat,
ACM Transactions on Modeling and Computer Simulation (to appear) (2009), arXiv:0708.3879.
[20] J. Duch and A. Arenas, Phys Rev E 72, 027104 (2005).
[21] L. Freeman, Sociometry 40, 35 (1977).
[22] M. Bogu
na
, D. Krioukov, and kc claffy, Nature Physics
5, 74 (2009).
Serrano, D. Krioukov, and M. Bogu
[23] M. A.
na
, Phys Rev
Lett 100, 078701 (2008).
[24] T. V. den Bulcke, K. V. Leemput, B. Naudts, P. van Remortel, H. Ma, A. Verschoren, B. D. Moor, and K. Marchal, BMC Bioinformatics 7 (2006).
[25] J. Knabe, C. Nehaniva, and M. Schilstra, Artif Life 14,
135148 (2008).
[26] S. Roy, M. Werner-Washburne, and T. Lane, Bioinformatics 24, 13181320 (2008).
[27] B. M. Waxman, IEEE J Sel Area Comm 6, 1617 (1988).
[28] E. Zegura, K. Calvert, and S. Bhattacharjee, in Proc INFOCOM (1996), vol. 2, pp. 594602.
[29] A. Medina, A. Lakhina, I. Matta, and J. Byers, in Proc
MASCOTS (2001), pp. 346353.
[30] J. Winick and S. Jamin, Technical Report UM-CSE-TR456-02, University of Michigan (2002).
[31] M. E. J. Newman, Proc Natl Acad Sci USA 98, 404
(2001).
Serrano, and A. Vespig[32] V. Colizza, A. Flammini, M. A.
nani, Nat Phys 2, 110 (2006).
[33] V. Colizza, R. Pastor-Satorras, and A. Vespignani, Nat
Phys 3, 276 (2007).
[34] P. Mahadevan, D. Krioukov, M. Fomenkov, B. Huffaker,
X. Dimitropoulos, kc claffy, and A. Vahdat, Comput
Commun Rev 36, 17 (2006).
[35] D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998).
Serrano and M. Bogu
[36] M. A.
na
, Phys Rev E 74, 056114
(2006).
[37] M. A. Serrano and M. Bogu
na
, Phys Rev E 74, 056115
(2006).
[38] See [36, 37] for an alternative formalism involving three
point correlations