Complex Networks and Decentralized Search Algorithm
Complex Networks and Decentralized Search Algorithm
Algorithms
Jon Kleinberg∗
Abstract. The study of complex networks has emerged over the past several years as
a theme spanning many disciplines, ranging from mathematics and computer science to
the social and biological sciences. A significant amount of recent work in this area has
focused on the development of random graph models that capture some of the qualitative
properties observed in large-scale network data; such models have the potential to help
us reason, at a general level, about the ways in which real-world networks are organized.
We survey one particular line of network research, concerned with small-world phe-
nomena and decentralized search algorithms, that illustrates this style of analysis. We
begin by describing a well-known experiment that provided the first empirical basis for
the “six degrees of separation” phenomenon in social networks; we then discuss some
probabilistic network models motivated by this work, illustrating how these models lead
to novel algorithmic and graph-theoretic questions, and how they are supported by recent
empirical studies of large social networks.
Keywords. Random graphs, complex networks, search algorithms, social network anal-
ysis.
1. Introduction
Over the past decade, the study of complex networks has emerged as a theme
running through research in a wide range of areas. The growth of the Internet
and the World Wide Web has led computer scientists to seek ways to manage the
complexity of these networks, and to help users navigate their vast information
content. Social scientists have been confronted by social network data on a scale
previously unimagined: datasets on communication within organizations, on col-
laboration in professional communities, and on relationships in financial domains.
Biologists have delved into the interactions that define the pathways of a cell’s
metabolism, discovering that the network structure of these interactions can pro-
vide insight into fundamental biological processes. The drive to understand all
∗ Supported in part by a David and Lucile Packard Foundation Fellowship, a John D. and
these issues has resulted in what some have called a “new science of networks” —
a phenomenological study of networks as they arise in the physical world, in the
virtual world, and in society.
At a mathematical level, much of this work has been rooted in the study of ran-
dom graphs [14], an area at the intersection of combinatorics and discrete probabil-
ity that is concerned with the properties of graphs generated by random processes.
While this has been an active topic of study since the work of Erdös and Rényi
in the 1950s [26], the appearance of rich, large-scale network data in the 1990s
stimulated a tremendous influx of researchers from many different communities.
Much of this recent cross-disciplinary work has sought to develop random graph
models that more tightly capture the qualitative properties found in large social,
technological, and information networks; in many cases, these models are closely
related to earlier work in the random graphs literature, but the issues arising in
the motivating applications lead to new types of mathematical questions. For sur-
veys covering different aspects of this general area, and in particular reflecting the
various techniques of some of the different disciplines that have contributed to it,
we refer the reader to recent review papers by Albert and Barabási [4], Bollobás
[15], Kleinberg and Lawrence [39], Newman [52], and Strogatz [60], the volume
of articles edited by Ben-Naim et al. [10], and the monographs by Dorogovtsev
and Mendes [23] and Durrett [25], as well as books by Barabási [8] and Watts [62]
aimed at more general audiences.
What does one hope to achieve from a probabilistic model of a complex network
arising in the natural or social world? A basic strategy pursued in much of this
research is to define a stylized network model, produced by a random mechanism
that reflects the processes shaping the real network, and to show that this stylized
model reproduces properties observed in the real network. Clearly the full range
of factors that contribute to the observed structure will be too intricate to be fully
captured by any simple model. But a finding based on a random graph formulation
can help argue that the observed properties may have a simple underlying basis,
even if their specifics are very complex. While it is crucial to realize the limitations
of this type of activity — and not to read too much into the detailed conclusions
drawn from a simple model — the development of such models has been a valuable
means of proposing concrete, mathematically precise hypotheses about network
structure and evolution that can then serve as starting points for further empirical
investigation. And at its most effective, this process of modeling via random graphs
can suggest novel types of qualitative network features — structures that people
had not thought to define previously, and which become patterns to look for in
new network datasets.
In the remainder of the present paper, we survey one line of work, motivated
by the “small-world phenomenon” and some related search problems, that illus-
trates this style of analysis. We begin with a striking experiment by the social
psychologist Stanley Milgram that frames the empirical issues very clearly [50, 61];
we describe a sequence of models based on random graphs that capture aspects
of this phenomenon [64, 36, 37, 38, 63]; and we then discuss recent work that has
identified some of the qualitative aspects of these models in large-scale network
Complex Networks and Decentralized Search Algorithms 3
data [1, 43, 49]. We conclude with some further extensions to these random graph
models, discussing the results and questions that they lead to.
(Following standard notation and terminology, we say that the degree of a node is
the number of edges incident to it. We say that a function is O(f (n)) if there is
a constant c so that for all sufficiently large n, the function is bounded by cf (n).)
In fact, [17] states a much more detailed result concerning the dependence on n,
but this will not be crucial for our purposes here.
Path lengths that are logarithmic in n — or more generally polylogarithmic,
bounded by a polynomial function of log n — will be our “gold standard” in most
of this discussion. We will keep the term “small world” itself informal; but we will
consider a graph to be a small world, roughly, when all (or most) pairs of nodes
are connected by paths of length polylogarithmic in n, since in such a case the
path lengths are exponentially smaller than the number of nodes.
Watts and Strogatz [64] argued that there was something crucial missing from
the picture provided by Theorem 3.1. A standard random graph (for example, as
in Theorem 3.1) is locally very sparse; with reasonably high probability, none of
the neighbors of a given node v are themselves neighbors of one another. But this
is far from true in most naturally occurring networks: in real network data, many
of a node’s neighbors are joined to each other by edges. (For example, in a social
network, many of our friends know each other.) Indeed, at an implicit level, this is
a large part of what makes the small-world phenomenon surprising to many people
when they first hear it: the social network appears from the local perspective of
any one node to be highly “clustered,” rather than the kind of branching tree-like
structure that would more obviously reach many nodes along very short paths.
Thus, Watts and Strogatz proposed thinking about small-world networks as
a kind of superposition: a structured, high-diameter network with a relatively
small number of “random” links added in. As a model for social networks, the
structured underlying network represents the “typical” social links that we form
with the people who live near us, or who work with us; the additional random
links are the chance, long-range connections that play a large role in creating short
paths through the network as a whole.
This kind of hybrid random graph model had been studied earlier by Bollobás
and Chung [16]; they showed that a small density of random links can indeed
produce short paths very effectively. In particular they proved the following, among
other results.
Complex Networks and Decentralized Search Algorithms 5
this is something that we can try to model; the goal would be to see whether
decentralized routing can be proved to work in a simple random-graph model, and
if so, to try extracting from this model some qualitative properties that distinguish
networks in which this type of routing can succeed. It is worth noting that these
issues reach far beyond the Milgram experiment or even social networks; routing
with limited information is something that takes place in communication networks,
in browsing behavior on the World Wide Web, in neurological networks, and in a
number of other settings — so an understanding of the structural underpinnings
of efficient decentralized routing is a question that spans all these domains.
To begin with, we need to be precise about what we mean by a decentralized
algorithm. In the context of the grid-based model in the previous section, we will
consider algorithms that seek to pass a message from a starting node s to a target
node t, by advancing the message along edges. In each step of this process, the
current message-holder v has knowledge of the underlying grid structure, the loca-
tion of the target t on the grid, and its own long-range contact. The crucial point
is that it does not know the long-range contacts of any other nodes. (Optionally,
we can choose to have v know the path taken by the message thus far, but this
will not be crucial in any of the results to follow.) Using this information, v must
choose one of its network neighbors w to pass the message to; the process then
continues from w. We will evaluate decentralized algorithms according to their
delivery time — the expected number of steps required to reach the target, over
a randomly generated set of long-range contacts, and randomly chosen starting
and target nodes. Our goal will be to find algorithms with delivery times that are
polylogarithmic in n.
It is interesting that while Watts and Strogatz proposed their model without
the algorithmic aspect in mind, it is remarkably effective as a simple system in
which to study the effectiveness of decentralized routing. Indeed, to be able to
pose the question in a non-trivial way, one wants a network that is partially known
to the algorithm and partially unknown — clearly in the Milgram experiment, as
well as in other settings, individual nodes use knowledge not just of their own local
connections, but also of certain global “reference frames” (comparable to the grid
structure in our setting) in which the network is embedded. Furthermore, for the
problem to be interesting, the “known” part of the network should be likely to
contain no short path from the source to the target, but there should be a short
path in the full network. The Watts-Strogatz model combines all these features in
a minimal way, and thus allows us to consider how nodes can use what they know
about the network structure to construct short paths.
Despite all this, the first result here is negative.
Theorem 4.1 ([36, 37]). The delivery time of any decentralized algorithm in the
grid-based model is Ω(n2/3 ).
(We say that a function is Ω(f (n)) if there is a constant c so that for infinitely
many n, the function is at least cf (n).)
This shows that there are simple models in which there can be an exponential
separation between the lengths of paths and the delivery times of decentralized
Complex Networks and Decentralized Search Algorithms 7
algorithms to find these paths. However, it is clearly not the end of the story;
rather, it says that the random links in the Watts-Strogatz model are somehow
too “unstructured” to support the kind of decentralized routing that one found in
the Milgram experiment. It also raises the question of finding a simple extension
of the model in which efficient decentralized routing becomes possible.
To extend the model, we introduce one additional parameter α ≥ 0 that controls
the extent to which the long-range links are correlated with the geometry of the
underlying grid. First, for two nodes v and w, we define their grid distance ρ(v, w)
to be the number of edges in a shortest path between them on the grid. The
idea behind the extended model is to have the long-range contacts favor nodes at
smaller grid distance, where the bias is determined by α. Specifically, we define the
grid-based model with exponent α as follows. We start with a two-dimensional n×n
grid graph, and then for each node v, we add one extra directed edge to some other
long-range contact; we choose w as the long-range contact for v with probability
proportional to ρ(v, w)−α . Note that α = 0 corresponds to the original Watts-
Strogatz model, while large values of α produce networks in which essentially no
edges span long distances on the grid.
We now have a continuum of models that can be studied, parameterized by
α. When α is very small, the long-range links are “too random,” and can’t be
used effectively by a decentralized algorithm; when α is large, the long-range links
appear to be “not random enough,” since they simply don’t provide enough of
the long-distance jumps that are needed to create a small world. Is there an
optimal operating point for the network, where the distribution of long-range links
is sufficiently balanced between these extremes to be of use to a decentralized
routing algorithm?
In fact there is; as the following theorem shows, there is a unique value of α in
the grid-based model for which a polylogarithmic delivery time is achievable.
Theorem 4.2 ([36, 37]). (a) For 0 ≤ α < 2, the delivery time of any decentralized
algorithm in the grid-based model is Ω(n(2−α)/3 ).
(b) For α = 2, there is a decentralized algorithm with delivery time O(log2 n).
(c) For α > 2, the delivery time of any decentralized algorithm in the grid-based
model is Ω(n(α−2)/(α−1) ).
(We note that the lower bounds in (a) and (c) hold even if each node has an
arbitrary constant number of long-range contacts, rather than just one.)
The decentralized algorithm achieving the bound in (b) is very simple: each
node simply forwards the message to a neighbor — long-range or local — whose
grid distance to the target is as small as possible. (In other words, each node uses
its long-range contact if this gets the message closer to the target on the grid; oth-
erwise, it uses a local contact in the direction of the target.) The analysis of this
algorithm proceeds by showing that, for a constant ε > 0, there is a probability of
at least ε/ log n in every step that the grid distance to the target will be halved.
It is also worth noting that the proof can be directly adapted to a grid in any con-
stant number of dimensions; an analogous trichotomy arises, with polylogarithmic
delivery time achievable only when α is equal to the dimension.
8 Jon Kleinberg
At a more general level, the proof of Theorem 4.2(b) shows that the crucial
property of exponent α = 2 is the following: rather than producing long-range
contacts that are uniformly distributed over the grid (as one gets from exponent
α = 0), it produces long-range contacts that are approximately uniformly dis-
tributed over “distance scales”: the probability that the long-range contact of v is
at a grid distance between 2j−1 and 2j away from v is approximately the same for
all values of j from 1 to 12 log n.
From this property, one sees that there is a reasonable chance of halving the
message’s grid distance to the target, independent of how far away it currently
is. The property also has an intuitively natural meaning in the context of the
original Milgram experiment; subject to all the other simplifications made in the
grid model, it says very roughly that decentralized routing can be effective when
people have approximately the same density of acquaintances at many different
levels of distance resolution. And finally, this approximate uniformity over distance
scales is the type of qualitative property that we mentioned as a goal at the outset.
It is something that we can search for in other models and in real network data —
tasks that we undertake in the next two sections.
out of v, choosing w as the endpoint of the ith edge independently with probability
proportional to b−βh(v,w) . (We will refer to k as the out-degree of the model.)
Thus, β works much like α did in the grid-based model; when β = 0, we get
uniform random selection, while larger values of β bias the selection more toward
“nearby” nodes. Now, in this case, a decentralized search algorithm is given the
locations of a starting node s and a target node t in the hierarchy, and it must
construct a path from s to t, knowing only the edges out of nodes that it explicitly
visits. Note that in defining the performance metric for a decentralized search
algorithm in this model, we face a problem that we didn’t encounter in the grid-
based model: the graph G may not contain a path from s to t. Thus, we say that
a decentralized algorithm here has delivery time f (n) if, on a randomly generated
n-node network, and with s and t chosen uniformly at random, the algorithm
produces a path of length O(f (n)) with probability at least 1 − ε(n), where ε(·) is
a function going to 0 as n increases.
We now have the following analogue of Theorem 4.2, establishing that there is
a unique value of β for which polylogarithmic delivery time can be achieved when
the network has polylogarithmic out-degree. This is achieved at β = 1, when the
probability that v links to a node at tree distance h is almost uniform over choices
of h. Also by analogy with the grid-based model, it suffices to use the simple
“greedy” algorithm that always seeks to reduce the tree distance to the target by
as much as possible.
Theorem 5.1 ([38]). (a) In the hierarchical model with exponent β = 1 and out-
degree k = c log2 n, for a sufficiently large constant c, there is a decentralized
algorithm with polylogarithmic delivery time.
(b) For every β 6= 1 and every polylogarithmic function k(n), there is no decen-
tralized algorithm in the hierarchical model with exponent β and out-degree k(n)
that achieves polylogarithmic delivery time.
Watts, Dodds, and Newman [63] independently proposed a model in which each
node resides in several distinct hierarchies, reflecting the notion that participants
in the small-world experiment were simultaneously taking into account several
different notions of “proximity” to the target. Concretely, their model constructs
a random graph G as follows. We begin with q distinct complete b-ary trees,
for a constant q, and in each of these trees, we independently choose a random
one-to-one mapping of the nodes onto the leaves. We then apply a version of the
hierarchical model above, separately in each of the trees; the result is that each
node of G acquires edges independently through its participation in each tree.
(There are a few minor differences between their procedure within each hierarchy
and the hierarchical model described above; in particular, they map multiple nodes
to the same leaf in each hierarchy, and they generate each edge by choosing the
tail uniformly at random, and then the head according to the hierarchical model.
The result is that nodes will not in general all have the same degree.)
Precisely characterizing the power of decentralized search in this model, at an
analytical level, is an open question, but Watts et al. describe a number of inter-
esting findings obtained through simulation [63]. They study what is perhaps the
10 Jon Kleinberg
most natural search algorithm, in which the current message-holder forwards the
message to its neighbor who is closest (in the sense of tree distance) to the target
in any of the hierarchies. Using an empirical definition of efficiency on networks
of several hundred thousand nodes, they examined the set of (β, q) pairs for which
the search algorithm was efficient; they found that this “searchable region” was
centered around values of β ≥ 1 (but relatively close to 1), and on small constant
values of q. (Setting q equal to 2 or 3 yielded the widest range of β for which
efficient search was possible.) The resulting claim, at a qualitative level, is that
efficient search is facilitated by having a small number of different ways to mea-
sure proximity of nodes, and by having a small bias toward nearby nodes in the
construction of random edges.
Models based on Set Systems. One can imagine many other ways to construct
networks in this general style — for example, placing nodes on both a hierarchy
and a lattice simultaneously — and so it becomes natural to consider more general
frameworks in which a range of these bounds on searchability might follow simulta-
neously from a single result. One such approach is based on constructing a random
graph from an underlying set system, following the intuition that individuals in a
social network often form connections because they are both members of the same
small group [38]. In other words, two people might be more likely to form a link
because they live in the same town, work in the same profession, have the same
religious affiliation, or follow the work of the same obscure novelist.
Concretely, we start with a set of nodes V , and a collection of subsets S =
{S1 , S2 , . . . , Sm } of V , which we will call the set of groups. It is hard to say much
of interest for arbitrary set systems, but we would like our framework to include
at least the collection of balls or subsquares in a grid, and the collection of rooted
sub-trees in a hierarchy. Thus we consider set systems that satisfy some simple
combinatorial properties shared by these two types of collections. Specifically, for
constants λ < 1 and κ > 1, we impose the following three properties.
(i) The full set V is one of the groups.
(ii) If Si is a group of size g ≥ 2 containing a node v, then there is a group
Sj ⊆ Si containing v that is strictly smaller than Si , but has size at least
min(λg, g − 1).
(iii) If Si1 , Si2 , Si3 , . . . are groups that all have size at most g and all contain a
common node v, then their union has size at most κg.
The most interesting property here is (iii), which can viewed as a type of “bounded
growth” requirement; one can easily verify that it (along with (i) and (ii)) holds
for the set of balls in a grid and the set of rooted sub-trees in a hierarchy.
Given a collection of groups, we construct a random graph as follows. For
nodes v and w, we define g(v, w) to be the size of the smallest group containing
both of them — this will serve as a notion of “distance” between v and w. For a
fixed exponent γ and out-degree value k, we construct k edges out of each node v,
choosing w as the endpoint of the ith edge from v independently with probability
Complex Networks and Decentralized Search Algorithms 11
proportional to g(v, w)−γ . We will refer to this as the group-based model with set
system S, exponent γ, and out-degree k. A decentralized search algorithm in such
a random graph is given knowledge of the full set system, and the identity of a
target node; but it only learns the links out of a node v when it reaches v. We
now have the following theorem.
Theorem 5.2 ([38]). (a) Given an arbitrary set system S satisfying properties
(i), (ii), and (iii), there is a decentralized algorithm with polylogarithmic delivery
time in the group-based model with set system S, exponent γ = 1, and out-degree
k = c log2 n, for a sufficiently large constant c.
(b) For every set system S satisfying properties (i), (ii), and (iii), every γ <
1, and every polylogarithmic function k(n), there is no decentralized algorithm
achieving polylogarithmic delivery time in the group-based model with set system
S, exponent γ and out-degree k(n).
In other words, efficient decentralized search is possible when nodes link to
each other with probability inversely proportional to the size of the smallest group
containing both of them. As a simple concrete example, if the groups are the
balls in a two-dimensional grid, then the size of the smallest group containing two
nodes at distance ρ is proportional to ρ2 , and so the link probability indicated by
Theorem 5.2(a) is proportional to ρ−2 ; this yields an analogue of Theorem 4.2(b),
the inverse-square result for grids. (The present setting is not exactly the same
as the one there; here, we do not automatically include the edges of the original
grid when constructing the graph, but we construct a larger number of edges out
of each node.)
Simple examples show that one cannot directly formulate a general negative
result in this model for the case of exponents γ > 1 [38]. At a higher level, the
group-based model is clearly not the only way to generalize the results thus far; in
the next section we will discuss one other recent approach, and the development
of other general models is a natural direction for further research.
of Napster and music file-sharing in 1999. The goal of such applications was
to allow a large collection of users to share the content residing on their personal
computers, and in their initial conception, the systems supporting them were based
on a centralized index that simply stored, in a single place, the files that all users
possessed. This way, queries for a particular piece of content could be checked
against this index, and routed to the computer containing the appropriate file.
The music-sharing application of these systems, of course, ran into significant
legal difficulties; but independent of the economic and intellectual property issues
raised by this particular application, it is clear that systems allowing large user
communities to share content have a much broader range of potential, less contro-
versial uses, provided they can be structured in a robust and efficient way. This
has stimulated much subsequent study in the research community, focusing on de-
centralized approaches in which one seeks file-sharing solutions that do not rely on
a single centralized index of all the content.
In this decentralized version of the problem, the crux of the challenge is clear:
each user has certain files on his or her own computer, but there is no single place
that contains a global list of all these files; if someone poses a query looking for
a specific piece of content, how can we efficiently determine which user (if any)
possesses a copy of it? Without a central index, we are in a setting very much like
that of the Milgram experiment: users must pose the query to a subset of their
immediate network neighbors, who in turn can forward the query to some of their
neighbors, and so forth. And this is where small-world models have played a role:
a number of approaches to this problem have tried to explicitly set up the network
on which the protocol operates so that its structure makes efficient decentralized
search possible. We refer the reader to the surveys by Aspnes and Shah [6] and
Lua et al. [44] for general reviews of this body of work, and the work of Clarke et
al. (as described in [32]), Zhang et al. [67], Malkhi et al. [45], and Manku et al.
[46] for more specific discussions of the relationship to small-world networks.
A related set of issues comes up in the design of focused Web crawlers. Whereas
standard Web search engines first compile an enormous index of Web pages, and
then answer queries by referring to this index, a focused crawler attempts to lo-
cate pages on a specific topic by following hyperlinks from one page to another,
without first compiling an index. Again, the underlying issue here is the design
of decentralized search algorithms, in this case for the setting of the Web: when
searching for relevant pages without global knowledge of the network, what are
the most effective rules for for deciding which links to follow? Motivated by these
issues, Menczer [49] studied the extent to which the hierarchical model described
in the previous section captures the patterns of linkage in large-scale Web data,
using the hierarchical organization of topics provided by the Open Directory.
on human social networks. In other words, these small-world models make very
concrete claims about the ways in which networks should be organized to support
efficient search, but it is not a priori clear whether or not real networks are orga-
nized in such ways. Two recent studies of this flavor have both focused on social
networks that exist in on-line environments — as with the previous applications,
we again see an intertwining of social and technological networks, but in these cases
the emphasis is on the social component, with the on-line aspect mainly providing
an opportune means of performing fine-grained analysis on a large scale.
In one study of this flavor, Adamic and Adar [1] considered the e-mail net-
work of a corporate research lab: they collected data over a period of time, and
defined an edge between two people who exchanged at least a certain number of
messages during this period. They overlaid the resulting network on a set sys-
tem representing the organizational structure, with a set for each subgroup of the
lab’s organizational hierarchy. Among other findings, they showed that the prob-
ability of a link between individuals v and w scaled approximately proportional
to g(v, w)−3/4 , compared with the value g(v, w)−1 for efficient search from Theo-
rem 5.2(a). (As above, g(v, w) denotes the size of the smallest group containing
both v and w.) Thus, interactions in their data spanned large groups at a slightly
higher frequency than the optimum for decentralized search. Of course, the e-mail
network was not explicitly designed to support decentralized search, although one
can speculate about whether there were implicit factors shaping the network into
a structure that was easy to search; in any case, it is interesting that the behavior
of the links with respect to the collection of groups is approximately aligned with
the form predicted by the earlier theorems.
An even closer correlation with the structure predicted for efficient search was
found in a large-scale study by Liben-Nowell et al. [43]. They considered Live-
Journal, a highly active on-line community with several million participants, in
which members communicate with one another, update personal on-line diaries,
and post messages to community discussions. LiveJournal is a particularly ap-
pealing domain for studying the geographic distribution of links, because members
provide explicit links to their friends in the system, and a large subset (roughly
half a million at the time of the study in [43]) also provide a hometown in the
continental U.S. As a result, one has the opportunity to investigate, over a very
large population, how the density of social network links decays with distance.
A non-trivial technical challenge that must be overcome in order to relate this
data to the earlier models is that the population density of the U.S. is extremely
non-uniform, and this makes it difficult to interpret predictions based on a model
in which nodes are distributed uniformly over a grid. The generalization to group
structures in the previous section is one way to handle non-uniformity; Liben-
Nowell et al. propose an alternative generalization, rank-based friendships, that
they argue may be more suitable to the geographic data here [43]. In the rank-
based friendship model, one has a set of n people assigned to locations on a two-
dimensional grid, where each grid node may have an arbitrary positive number of
people assigned to it. By analogy with the grid-based model from Section 4, each
person v chooses a local contact arbitrarily in each of the four neighboring grid
14 Jon Kleinberg
nodes, and then chooses an additional long-range contact as follows. First, v ranks
all other people in order of their distance to herself (breaking ties in some canonical
way); we let rankv (w) denote the position of w in v’s ordered list, and say that w
is at rank r with respect to v. v then chooses w as her long-range contact with
probability proportional to 1/rankv (w).
Note that this model generalizes the grid-based model of Section 4, in the sense
that the grid-based model with the inverse-square distribution corresponds to rank-
based friendship in which there is one person resident at each grid node. However,
the rank-based friendship construction is well-defined for any population density,
and Liben-Nowell et al. prove that it supports efficient decentralized search in
general. They analyze a decentralized greedy algorithm that always forwards the
message to a grid node as close as possible to the target’s; and they define the
delivery time in this case to be the expected number of steps needed to reach the
grid node containing the target. (So we can imagine that the task here is to route
the message to the hometown of the target, rather than the target himself; this
is also consistent with the data available from LiveJournal, which only localizes
people to the level of towns.)
Theorem 6.1 ([43]). For an arbitrary poopulation density on a grid, the expected
delivery time of the decentralized greedy algorithm in the rank-based friendship
model is O(log3 n).
On the LiveJournal data. Liben-Nowell et al. examine the fraction of friend-
ships (v, w) where w is at rank r with respect to v. They find that this fraction
is very close to inverse linear in r, in close alignment with the predictions of the
rank-based friendship model.
This finding is notable for several reasons. First, as with the e-mail network
considered by Adamic and Adar, there is no a priori reason to believe that a large,
apparently amorphous social network should correspond so closely to a distribution
predicted by a simple model for efficient decentralized search. Second, geography is
playing a strong role here despite the fact that LiveJournal is an on-line system in
which there are no explicit limitations on forming links with people arbitrarily far
away; as a result, one might have (incorrectly) conjectured that it would be difficult
to detect the traces of geographic proximity in such data. And more generally,
the analytical results of this section and the previous ones have been based on
highly stylized models that nonetheless make very specific predictions about the
theoretical “optimum” for search; to see these concrete predictions approximately
borne out on real social network data is striking, and it suggests that there may
be deeper phenomena yet to be discovered here.
α = 2d.)
For the specific grid-based model described in Section 4, Martel and Nguyen
showed that with high probability the diameter is proportional to log n for α ≤ d, in
the d-dimensional case [48]. They also identified transitions at α = d and α = 2d
analogous to the case of long-range percolation [53]. In particular, their results
show that while decentralized search can construct a path of length O(log2 n) when
α = d, there in fact exist paths that are shorter by a logarithmic factor. (Note also
the contrast with the corresponding results for the long-range percolation model
when α ≤ d; in the grid-based model, the out-degree of each node is bounded by
a constant, so a diameter proportional to log n is the smallest one could hope for;
in the case of long-range percolation, on the other hand, the node degrees will be
unbounded, allowing for smaller diameters.)
8. Conclusion
We have followed a particular strand of research running through the topic of
complex networks, concerned with short paths and the ability of decentralized
algorithms to find them. As suggested initially, the sequence of ideas here is
characteristic of the flavor of research in this area: an experiment in the social sci-
ences that highlights a fundamental and non-obvious property of networks (efficient
searchability, in this case); a sequence of random graph models and accompany-
ing analysis that seeks to capture this notion in a simple and stylized form; a set
of measurements on large-scale network data that parallels the properties of the
models, in some cases to a surprising extent; and a range of connections to further
results and questions in algorithms, graph theory, and discrete probability.
To indicate some of the further directions in which research on this topic could
proceed, we conclude with a list of open questions and issues related to small-world
networks and decentralized search. Some of these questions have already come up
implicitly in the discussion thus far, so one goal of this list is to collect a number
of these questions in a single place. Other questions here, however, bring in issues
that reach beyond the context of the earlier sections. And as with any list of
open questions, we must mention a few caveats: the questions here take different
forms, since some are concretely specified while other are more designed to suggest
problems in need of a precise formulation; the questions are not independent, in
that the answer to one might well suggest ways of approaching others; and several
of the questions may well become more interesting if the underlying model or
formulation is slightly varied or tweaked.
with some kind of spatial embedding is an interesting issue that is not well under-
stood. Simsek and Jensen’s study [58] of this issue left open the question of proving
bounds on the efficiency of decentralized algorithms. For example, consider the
d-dimensional grid-based model with exponent α, and suppose that rather than
constructing a fixed number of long-range contacts for each node, we draw the
number of long-range contacts for each node v independently from a given proba-
bility distribution. To be concrete, we could consider a distribution in which one
selects k long-range contacts with probability proportional to k −δ for a constant
δ.
We now have a family of grid-based models parameterized by α and δ, and we
can study the performance of decentralized search algorithms that know not only
the long-range contacts out of the current node, but also the degrees of the neigh-
boring nodes. Decentralized selection of a neighbor for forwarding the message has
a stochastic optimization aspect here, balancing the goal of forwarding to a node
to the target with the goal of forwarding to a high-degree node. We can now ask
the general question of how the delivery time of decentralized algorithms varies in
both α and δ. Note that it is quite possible this question becomes more interesting
if we vary the model so that long-range links are undirected; this way, a node
with a large degree is both easy to find and also very useful once it is found. (In
a directed version, a node with large out-degree may be relatively useless simply
because it has low in-degree and so is unlikely to be found.)
2. The case of α = 2d. In both the grid-based model and the related long-
range percolation models, very little is known about the diameter of the graph
when α is equal to twice the dimension. (It appears that a similar question arises
in other versions of the group-based models from Section 5, when nodes form links
with probability inversely proportional to the square of the size of the smallest
group containing both of them.) Resolving the behavior of the diameter would
shed light on this transitional point, which lies at the juncture between “small
worlds” and “large worlds.” This open question also manifests itself in the gossip
problem discussed in Section 7, where we noted that the transitional value α = 2d
arises in distributed computing applications (see the discussion in [34, 54]).
9. Reconstruction. The networks we have considered here have all been em-
bedded in some underlying “reference frame” — grids, hierarchies, or set systems
— and most of our analysis has been predicated on a model in which the network
is presented together with this embedding. This makes sense in many contexts;
recall, for example, the discussion from Section 6 of network data explicitly em-
bedded in Web topic directories [49], corporate hierarchies [1], or the geography
of the U.S. [43]. In some cases, however, we may be presented with just the net-
work itself, and the goal is to determine whether it has a natural embedding into
a spatial or hierarchical structure, and to recover this embedding if it exists. For
example, we may have data on communication within an organization, and the
goal is to reconstruct the hierarchical structure under the assumption that the
frequency of communication decreases according to a hierarchical model — or to
reconstruct the positions of the nodes under the assumption that the frequency
of communication decreases with distance according to a grid-based or rank-based
model.
One can formulate many specific questions of this flavor. For example, given a
network known to be generated by the grid-based model with a given exponent α,
can we approximately reconstruct the positions of the nodes on the grid? What
if we are not told the exponent? Can we determine whether a given network was
more likely to have been generated from a grid-based model with exponent α or
α′ ? Or what if there are multiple long-range contacts per node, and we are only
shown the long-range edges, not the local edges? A parallel set of questions can
be asked for the hierarchical model.
Questions of this type have been considered by Sandberg [55], who reports on
the results of computational experiments but leaves open the problem of obtaining
provable guarantees. Benjamini and Berger [11] pose related questions, includ-
ing the problem of reconstructing the dimension d of the underlying lattice when
presented with a graph generated by long-range percolation on a finite piece of Zd .
22 Jon Kleinberg
References
[1] L. Adamic, E. Adar. How to search a social network. Social Networks, 27(3):187-203,
July 2005.
[2] L. A. Adamic, R. M. Lukose, A. R. Puniyani, B. A. Huberman. Search in Power-Law
Networks. Phys. Rev. E, 64 46135 (2001).
[3] M. Aizenman, C.M. Newman. Discontinuity of the Percolation Density in One-
Dimensional 1/|x − y|2 Percolation Models. Commun. Math. Phys. 107(1986).
[4] R. Albert, A.-L. Barabási. Statistical mechanics of complex networks. Review of Mod-
ern Physics 74, 47-97 (2002)
[5] E. Anshelevich. Network Design and Management with Strategic Agents. Ph.D. thesis,
Cornell University, 2005.
[6] J. Aspnes, G. Shah. Distributed data structures for P2P systems. in Theoretical and
Algorithmic Aspects of Sensor, Ad Hoc Wireless and Peer-to-Peer Networks (Jie Wu,
ed.), CRC Press, 2005.
[7] P. Assouad. Plongements lipschitziens dans Rn . Bull. Soc. Math. France 111(1983).
[8] A.-L. Barabási. Linked. Perseus 2002.
[9] L. Barrière, P. Fraigniaud, E. Kranakis, D. Krizanc. Efficient Routing in Networks
with Long Range Contacts. Proceedings of DISC 2001.
[10] Eli Ben-Naim, Hans Frauenfelder, Zoltan Toroczkai, eds. Complex Networks Springer
Lecture Notes in Physics (vol. 650), 2004.
Complex Networks and Decentralized Search Algorithms 23
[52] M.E.J. Newman. The structure and function of complex networks. SIAM Review,
45:167–256, 2003.
[53] V. Nguyen and C. Martel. Analyzing and characterizing small-world graphs. Pro-
ceedings of ACM-SIAM symposium on Discrete Algorithms, 2005.
[54] R. van Renesse, K. P. Birman, W. Vogels. Astrolabe: A robust and scalable technol-
ogy for distributed system monitoring, management, and data mining. ACM Trans.
Computer Sys. 21(2003).
[55] O. Sandberg. Distributed Routing in Small-World Networks. Algorithm Engineering
and Experiments (ALENEX), 2006.
[56] O. Sandberg. Searching a Small World. Licentiate thesis, Chalmers University, 2005.
[57] L.S. Schulman. Long-range percolation in one dimension. J. Phys. A 16, no. 17, 1986.
[58] O. Simsek and D. Jensen. Decentralized search in networks using homophily and
degree disparity. Proc. 19th International Joint Conference on Artificial Intelligence,
2005.
[59] A. Slivkins. Distance Estimation and Object Location via Rings of Neighbors. Pro-
ceedings of 24th Annual Symposium on Principles of Distributed Computing, 2005.
[60] S. Strogatz. Exploring complex networks. Nature 410(2001), 268.
[61] J. Travers and S. Milgram. An experimental study of the small world problem.
Sociometry 32(1969).
[62] Duncan J. Watts. Six Degrees: The Science of a Connected Age, W. W. Norton,
2003.
[63] D. J. Watts, P. S. Dodds, M. E. J. Newman. Identity and Search in Social Networks.
Science, 296, 1302-1305, 2002.
[64] Watts, D. J. and S. H. Strogatz. Collective dynamics of ’small-world’ networks.
Nature 393(1998).
[65] T. Wexler. Pricing Games with Selfish Users. Ph.D. thesis, Cornell University, 2005.
[66] J. Zeng, W.-J. Hsu, J. Wang. Near Optimal Routing in a Small-World Network with
Augmented Local Awareness Parallel and Distributed Processing and Applications:
Third International Symposium (ISPA), 2005.
[67] H. Zhang, A. Goel, R. Govindan. Using the Small-World Model to Improve Freenet
Performance. Proc. IEEE Infocom, 2002.
Jon Kleinberg
Department of Computer Science
Cornell University
Ithaca NY 14853 USA