0% found this document useful (0 votes)
78 views9 pages

Orion: Shortest Path Estimation For Large Social Graphs

This document proposes Orion, a novel graph coordinate system that estimates node distances in large social graphs. Orion maps nodes to positions in low-dimensional Euclidean space, allowing constant-time distance estimates between any two nodes. This improves upon traditional algorithms that require prohibitive computation time on graphs with millions of nodes. The document outlines key differences between graph coordinates and prior network coordinate systems, and describes Orion's design choices for accurately estimating distances on social graphs.

Uploaded by

xkjon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views9 pages

Orion: Shortest Path Estimation For Large Social Graphs

This document proposes Orion, a novel graph coordinate system that estimates node distances in large social graphs. Orion maps nodes to positions in low-dimensional Euclidean space, allowing constant-time distance estimates between any two nodes. This improves upon traditional algorithms that require prohibitive computation time on graphs with millions of nodes. The document outlines key differences between graph coordinates and prior network coordinate systems, and describes Orion's design choices for accurately estimating distances on social graphs.

Uploaded by

xkjon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Orion: Shortest Path Estimation for Large Social Graphs

Xiaohan Zhao, Alessandra Sala, Christo Wilson, Haitao Zheng and Ben Y. Zhao
Department of Computer Science, UC Santa Barbara, USA
{xiaohanzhao, alessandra, bowlin, htzheng, ravenben}@cs.ucsb.edu

Abstract nodes, computing exact values for node separation met-


Through measurements, researchers continue to produce rics like graph radius, graph diameter, and average path
large social graphs that capture relationships, transac- length, requires calculating O(n 2 ) node distances. In de-
tions, and social interactions between users. Efficient ployed social networks, LinkedIn users can use node dis-
analysis of these graphs requires algorithms that scale tance to filter out query results in their neighborhood, and
well with graph size. We examine node distance com- social e-commerce sites can use node distance to iden-
putation, a critical primitive in graph problems such as tify more trustworthy sellers [27]. Node distance is also
computing node separation, centrality computation, mu- the determining factor for other common graph problems
tual friend detection, and community detection. For like centrality and mutual friend detection.
large million-node social graphs, computing even a sin- Current methods for computing node distance do not
gle shortest path using traditional breadth-first-search scale with graph size. For a graph with n nodes and
can take several seconds. m edges, efficient implementations of traditional algo-
In this paper, we propose a novel node distance esti- rithms including breadth-first-search (BFS), Dijkstra and
mation mechanism that effectively maps nodes in high Floyd-Warshall can produce shortest paths for each node
dimensional graphs to positions in low-dimension Eu- pair in O(n log n + m) time, and all pairs shortest-paths
clidean coordinate spaces, thus allowing constant time in Θ(n3 ) [6]. Tolerable for small graphs, the compu-
node distance computation. We describe Orion, a pro- tation required for a single node distance computation
totype graph coordinate system, and explore critical de- on a large million-node graph can take up to a minute
cisions in its design. Finally, we evaluate the accuracy on modern computers [23]. Given the prohibitively high
of Orion’s node distance estimates, and show that it can costs of storing precomputed distances, researchers have
produce accurate results in applications such as node sep- little choice but to sample portions of the graph or seek
aration, node centrality, and ranked social search. approximate results.
In this paper, we propose a novel approach to ap-
1 Introduction proximating node distance measurements we call Graph
Coordinate Systems. A graph coordinate system maps
Analysis of graph properties is critical to understanding nodes in high dimensional graphs to positions in a fixed-
the mechanisms underlying the formation and evolution dimension Euclidean coordinate space. Using the co-
of complex networks, and is of particular importance in ordinates associated with each graph node, we can use
the study of online social networks. In recent years, the a simple Euclidean distance computation to estimate, in
research community has seen a rise in large-scale mea- constant time, its distance to any other node in the graph.
surement studies of deployed social networks [2, 18] and Our work is inspired by the prior success of using vir-
interaction networks [15, 32], some producing graphs of tual network coordinate systems [7, 8, 20] to predict la-
up to tens of millions of nodes. The size of these massive tencies between Internet hosts. Studies show that in-
graphs makes their analysis extremely challenging, as tegrating network coordinates into applications such as
even efficient algorithms can become time-consuming. web caches and peer-to-peer systems significantly im-
Computing node distance, or the shortest-path dis- proved their performance. Unlike latencies between In-
tance between two nodes, is a primitive that lies at the ternet hosts, however, shortest path values on a graph, by
core of both graph analysis algorithms and social net- definition, will never violate the triangle inequality [17].
work applications. For example, in a network with n Since triangle inequality violations are often cited as a

1
key source of error in network coordinate systems, graph tion. Applications that benefit from these systems in-
coordinates could potentially be even more accurate. clude content distribution networks [24], multicast sys-
We make three key contributions in this paper. First, tems [3], distributed file systems [26] and file-sharing
we propose the use of graph coordinate systems to sim- networks [1, 5].
plify node distance computation on large graphs. While The majority of network coordinate systems work by
similar in fundamental methodology to network coordi- mapping an Internet host to a specific position in a Eu-
nates, several critical differences force a ground-up re- clidean space based on round-trip measurements to other
design of graph coordinate systems. For example, while hosts. Depending on the protocol, a node’s coordinates
network coordinates can be easily tuned using fast la- can be continually refined as additional measurement re-
tency measurements (e.g. via Internet ping), measur- sults are added to the system. Once a pair of nodes has
ing actual distances between graph nodes can be very converged to their positions in the coordinate space, their
expensive. We describe Orion, a prototype graph coor- distance in the Internet (usually a round-trip-time or RTT
dinate system, and explore critical decisions in its de- value) can be predicted by computing the Euclidean dis-
sign. Second, we perform extensive validation of Orion’s tance between their coordinate values.
node distance estimates using several real social graphs. Based on the way coordinates are computed for
Finally, we explore the utility of graph coordinate sys- new nodes, NC systems can be generally categorized
tems in graph analysis and social applications, and show into “landmark-based” and “decentralized” systems.
that Orion produces effective results on large graphs for Landmark-based systems such as GNP [20] first compute
applications such as node separation metrics, centrality coordinates for an initial set of well-known landmark
computation, and ranked social search. nodes using pair-wise measurements, where errors be-
tween virtual and measured distances are minimized us-
Roadmap. We begin in Section 2 by defining our
ing a non-linear optimization algorithm such as Simplex
goals and assumptions, and describing key differences
Downhill [19]. The NC then uses these nodes as fixed
from prior work on network coordinate systems. We then
points to calibrate coordinate values for the rest of the
describe the Orion graph coordinate system and explain
network. Landmark-based systems [17, 20, 21, 22, 29]
key design decisions in Section 3. Next, we present ac-
have fast convergence properties, since all nodes rely on
curacy measurements of Orion in Section 4, and show
the same fixed nodes for their coordinate calculations.
the effectiveness of Orion in computing graph metrics
However, the accuracy of these systems can suffer if the
and graph applications in Section 5. Finally, we discuss
choice of landmark nodes is suboptimal, i.e. they do not
future directions and conclude in Section 6.
sufficiently cover the network.
In contrast, decentralized NCs such as PIC [7] and Vi-
2 Virtual Coordinates and Large Graphs valdi [8] allow incoming nodes to orient themselves in
the coordinate space using any nodes already positioned
The goal of our work is to find a compact representa- in the space. While these systems avoid dependence on
tion of distances between nodes in a graph, such that we well-known landmarks, new nodes can force already cal-
can quickly and easily compute estimates of shortest path ibrated nodes to adjust their coordinates, potentially in-
distances between any two nodes. We are inspired by the creasing convergence time and propagating errors. For
significant volume of prior work on the topic of network further details on NC systems, we refer the reader to a
coordinate systems, much of which mapped distances recent survey [9].
between Internet hosts to distances in a Euclidean space. Successes and Limitations. NC systems have been
In this section, we briefly summarize prior work in net- shown to be highly effective at improving performance
work coordinates, and use it as context to identify key of large distributed systems [12, 1]. However, more re-
differences and challenges in the design of graph coordi- cent work has questioned the validity of using Euclidean
nate systems. Finally, we briefly discuss related projects spaces to approximate Internet latencies, which have
as context for our work. been shown to violate the Triangle Inequality [13, 33].

2.1 Background: Network Coordinates 2.2 Graph Coordinates: Challenges


Network coordinate (NC) systems [7, 8, 17, 20, 21, Our goal is to investigate the feasibility of using a Eu-
22, 29] were designed as efficient and scalable mech- clidean coordinate space to capture node distances on
anisms to estimated distances or latencies between In- large graphs. Upon consideration, we find that three key
ternet hosts. Such distance estimation mechanisms can differences separate the problems of estimating shortest
prove critical to large-scale distributed systems that use paths on graphs and host latencies on the Internet. As a
approximate distance values for performance optimiza- result, we cannot simply apply techniques from NC sys-

2
tems, but must instead carefully reevaluate them in the poses to compute nodes position in a graph by exploit-
context of graph distances. ing a coordinate-like approach, called network structure
index (NSI) [25]. Compared to Orion, NSI is more ex-
Triangle Inequality. First, we note that while the
pensive in both time and space complexity. The space
presence of triangle inequality violations (TIV) is often
complexity of NSI is O(nkD), where k is the number of
identified as a barrier to accuracy in network coordinate
zones and D is the number of dimensions, which are k
systems, shortest path computation on graphs is guaran-
times higher than Orion. On the other hand, NSI’s time
teed to be TIV free. This is inherent in the definition
complexity, O(mkD), is proportional to the number of
of the shortest path metric. The proof is straightforward
edges m while Orion takes only O(nkD) time, where n
by contradiction. Assume a triangle inequality violation
is the number of nodes. This also represents a significant
for three nodes a, b, c, i.e. d(a, b) + d(a, c) < d(b, c),
decrease in time complexity, since m is several orders of
where d(a, b) represents the shortest path distance be-
magnitude larger than n in online social graphs. Further-
tween nodes a and b. This scenario is impossible, be-
more, unlike our work, annotation distances computed
cause one can construct a “shorter” shortest path between
by NSI are not the number of hops between nodes pairs.
b and c that is the concatenation of the shortest paths be-
Recent work by Potamias et al. [23] proposes a land-
tween (b, a) and (a, c). At minimum, the sum of lengths
mark scheme for approximating shortest path distances.
of two shortest paths in the triangle is equal to the length
The approach is similar in spirit, but stores for each node
of the third. This property means a graph coordinate sys-
its distance to every landmark. In contrast, Orion is more
tem does not have to support TIVs by resorting to com-
compact. It stores for each node a coordinate address of
plex algorithms such as matrix factorization [17].
e.g. 10 values, independent of the number of landmarks
Cost of Measurements. The second and most crit- used. In addition, our work considers the broader prob-
ical difference between these two problems is the cost lem of embedding large graphs into known coordinate
of obtaining ground truth distance values between two spaces, and evaluates our work using a broad array of
nodes. In Internet latency estimation, a running system applications.
can perform a latency measurement with minimal cost
Social Networks. A significant amount of research ef-
via Internet Ping. In contrast, measuring the shortest path
fort has been invested to understand OSNs such as MyS-
between graph nodes is expensive, and can take at worst
pace, Orkut [2], Flickr, LiveJournal [18], Facebook [32],
time O(n+m). In addition, computing the distance from
and Twitter [10]. Social networks are characterized
a to b using BFS effectively computes the shortest path
by graph properties like power-law degree distribution,
between a to all other nodes in the graph. With these
small-world clustering, and scale-free behavior [16]. A
factors in mind, we must carefully consider how graph
necessary precondition for quantifying some of these
coordinates obtain real node distances for node calibra-
characteristics is calculating node separation metrics (i.e.
tion. We must minimize the number of overall BFS oper-
radius, diameter and average path length) that are based
ations, while reusing the results from each BFS operation
on all-pairs shortest paths. Some social applications also
as much as possible.
leverage shortest path computations, such as distance-
Error Sensitivity. Finally, graph coordinate systems based community detection [11]. Unfortunately, com-
face an additional challenge of higher error sensitivity. puting all-pairs shortest paths on today’s social graphs is
While latency between Internet nodes can vary from sub- infeasible, since they often have millions of nodes and
milliseconds to hundreds of milliseconds, node distances hundreds of millions of edges. Existing studies sidestep
on small-world graphs tend to have much smaller vari- this issue by using sampling techniques to estimate the
ance. For example, diameters of recently measured Face- graph’s true values [18, 32]. In contrast, our solution
book graphs are less than 20 [32]. Additionally, all node computes shortest paths between node pairs in 0.2 mi-
distance values are integers. This means node distance croseconds, making it a scalable solution for computing
values across different paths in a graph are significantly all-pairs shortest paths on massive social graphs.
more clustered across a small number of possible values,
and any estimation errors can be rounded up. Thus, a
graph coordinate system must provide reasonably high 3 Designing Orion
accuracy in order to be useful in graph applications.
In this section, we present the Orion graph coordinate
system and explain our design decisions in detail. Simi-
2.3 Related work lar to network coordinate systems, graph coordinate sys-
tems work in two phases. First, nodes in the graph
Shortest Path Methods. Shortest path computations are iteratively added to the coordinate space, the po-
are extremely costly on large graphs. Rattigan et al. pro- sition of each node being calibrated by ground truth

3
{
y a graph, since each computation can, in the worst case,
a require a full traversal of the graph. Using a landmark
1 approach, we limit the total number of Breadth-First-
b c b c Search operations to k, the number of landmarks. Each

{
BFS computes the shortest path distance from a land-
d x mark to all other nodes. Computing BFS for all land-
1 d
marks essentially precomputes all values needed to cal-
e f ibrate all nodes in the graph. In contrast, a decentral-
e f ized approach such as the physical springs model used by
Vivaldi [8] requires shortest path computations between
Figure 1: Mapping graph nodes into Euclidean coordinate random node pairs, thus drastically increasing the num-
space. For most node pairs, the Euclidean distance exactly ber of BFS operations.
matches the hop-count separating them in the original graph. The second advantage of a landmark-based scheme is
that the positions of incoming nodes depend only on the
landmark nodes. This bounds the number of operations
node-distance measurements. This “calibration phase” required to compute a node’s position, guaranteeing fast
is where a graph coordinate system incurs its one-time convergence. In contrast, in decentralized models adding
computational overhead. Once all nodes in the graph a new node will often force its nearby neighbors to make
have been added, the resulting system can be integrated adjustments on their position, a process that can propa-
with graph applications to answer node distance queries gate adjustments iteratively throughout the entire space.
with estimates.
Finally, we note that the challenges that make Land-
Since the per-query computation cost is O(1), the fo-
mark systems undesirable in Internet systems do not ap-
cus of our design is to ensure the calibration phase is
ply in our context. In network coordinate systems, land-
computationally efficient, and the results are as accurate
marks are physical machines that must remain available
as possible. More specifically, our goals are three-fold:
at all times, and processing load from other applications
• Scalability. The computational cost of the calibra- (e.g. web traffic) can affect the accuracy of latency mea-
tion phase must scale linearly with the number of surements to other machines in the network [21]. Com-
nodes, i.e. O(n). promised landmarks can also significantly impact the en-
• Accuracy. While individual node distance pre- tire system [9]. Those issues do not exist for graph coor-
dictions might incur reasonable errors, predictions dinates, where nodes are just graph vertices and all com-
should approximate ground truth at the large scale. putation can be performed on a centralized server.
• Fast convergence. Impact of individual node cali-
brations should be localized, i.e. should not trigger 3.2 Scalable Landmark Coordinates
significant new adjustments to their neighbors.
Intuitively, the number of landmarks used to calibrate a
Based on these goals, we now describe the Orion de- graph should have a direct impact on the accuracy of the
sign and explain key decisions. Euclidean mapping. Similar correlation between land-
marks and accuracy has been observed in the context of
3.1 A Landmark-based Approach network coordinate systems [20]. The highly connected
and complex nature of social graphs leads us to believe
Figure 1 illustrates how Orion maps nodes in a graph to that an accurate graph coordinate system requires a sig-
positions in a D-dimension Euclidean coordinate space. nificant number of landmarks. The challenge is to find a
The goal is accurately translate pairwise hop-count dis- way to accurately and quickly compute the coordinates
tances in the graph into Euclidean distances in the co- for a large number of landmarks.
ordinate space. To do this, Orion uses a landmark ap- Traditional network coordinates determine a node’s
proach, where the positions of all nodes are calibrated D-dimension coordinates by minimizing the sum of
with their relative distances to a fixed number (k) of cho- squares of prediction errors using the Simplex Downhill
sen landmark nodes. Landmark nodes are initially cho- algorithm [19], a nonlinear optimization algorithm. The
sen from the entire graph based on their position and de- algorithm runs in O(k 2 · D) time to compute coordinates
gree of connectivity. of k landmarks.
Why Landmarks? We use a landmark-based scheme Since running Simplex Downhill on our desired num-
in Orion for two main reasons. First and foremost, we ber of landmarks (up to 100 in our study) is computa-
wish to minimize the number of shortest path compu- tionally expensive, we propose a new approach, where
tations needed to establish ground truth on the actual we separate our landmarks into two groups, a small ini-

4
tial group of 16 landmarks, and a larger secondary group Network Nodes Edges Avg. Path Len.
composed of the remaining landmarks. Norway 293K 5,589K 4.2
We leverage the Simplex Downhill algorithm to com- Egypt 246K 1,618K 5.0
Los Angeles 275K 2,115K 5.1
pute the coordinates for the initial (k I = 16) landmarks,
India 363K 1,556K 6.1
thus its asymptotical complexity is O(k I 2 · D). The sec-
ondary group of landmarks calibrate their positions us- Table 1: Properties of Social Graphs
ing the initial kI landmarks as anchors, contributing to a
computational complexity of only O(k I · D) each. Thus,
the total time required to compute landmark coordinates We consider these strategies as approximations of the
is O(kI 2 · D) + (k − kI ) × O(kI · D), where k is the high-centrality strategy, and evaluate their effectiveness
total number of landmarks. empirically in Section 4.
Furthermore, we describe two ways to compute the Summary. Orion works as a landmark-based scheme,
coordinates of the secondary group of landmarks, while where an initial core of 16 landmarks is first fixed in
maintaining the same computational complexity. In the the space using Simplex Downhill optimization. A sec-
global approach, we compute the coordinates of each ondary group of landmarks position themselves based
node in the secondary group relying only on the ini- on the original landmarks. Finally, all remaining graph
tial group as anchors. In the incremental landmarks ap- nodes calibrate their positions based on node distances
proach, nodes in the secondary group are added one by obtained from computing BFS from all landmarks.
one. Once a node receives its coordinate values, it be-
comes an anchor for all remaining nodes. To compute its
coordinates, any remaining node in the secondary group 4 Experimental Results
can choose any k I nodes from all embedded nodes to be
its landmarks. In this section we analyze the accuracy of Orion’s node
distance estimates. We study the impact on accuracy by
key factors: Landmark selection strategy, cardinality of
3.3 Landmark Selection the Landmark set, and dimensionality of node coordi-
nates. We preface our core discussion with an overview
Finally, we consider the problem of choosing landmark
of the experimental environment and evaluation metrics.
nodes to produce the most accurate graph to Euclidean
coordinate mapping. Prior work by Potamias et. al con-
sidered the problem of choosing landmarks, and con- 4.1 Experimental Setup
cluded experimentally that choosing nodes with high
centrality performed significantly better than random We evaluate Orion accuracy using four anonymized
choice [23]. Given the complexity of computing node datasets (Egypt, India, Los Angeles and Norway) gath-
centrality, we consider two groups of alternative land- ered from Facebook regional networks [32]. These
mark selection strategies as possible approximations of graphs were chosen because they are large, but not too
centrality-based selection: Random and High-degree. large to make graph analysis intractable. Their statistical
properties are consistent with other OSN datasets [2, 30].
• Random. This is the basic landmark selection strat- Table 1 reports their basic properties.
egy. Landmarks are chosen uniformly at random All experiments were run on 2.4 GHz, dual core Xeon
from all nodes in the graph. servers with 32GB of RAM. All machines ran Fedora
• High-degree. Prior measurements on social net- Core, kernel version 2.6.x.
works [18, 32] show that social graphs exhibit a Evaluation Metrics. We use two key metrics to eval-
power-law-like degree distribution. Intuitively, high uate Orion accuracy. The first is Relative Error. This
degree nodes reside at the core of social graphs, ef- metric is widely used in the study of Network Coordinate
fectively approximating central nodes. This strategy Systems, although it must be modified slightly in order
chooses nodes with the highest degree. to evaluate graph coordinate systems. Let a and b be two
• Landmark separation. Closely positioned land- nodes in the graph. Let d m a,b be the measured distance
marks are less effective at “covering” the graph as between a and b on the real graph using the BFS algo-
anchors. Therefore, we add variants to the two ba- rithm, and let dP a,b be the estimated distance computed
sic strategies, where we select the landmarks one using a and b’s coordinates from Orion. In our context,
by one, ignore any potential landmarks that are too the relative error is:
close in the graph to existing landmarks, and con-
tinue selecting landmarks until the desired number |dm P
a,b − da,b |
Re = (1)
has been met. dm
a,b

5
0.4 1
Random Strategy (Global) 0.9
0.35 High-degree Strategy (Global)
Average relative error

Random Strategy (Incremental) 0.8


0.3
High-degree Strategy (Incremental) 0.7
0.25 0.6

CDF
0.2 0.5
0.15 0.4
0.3
0.1
0.2
0.05 100 landmarks
0.1 30 landmarks
0 0
Original 2-hop 3-hop 4-hop 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Minimum distance between landmarks Relative error

Figure 2: ARE of nodes’ distances with different combination Figure 3: CDF of relative error on nodes distances on India.
of landmark selection and computation strategies in India graph

Cardinality of Landmark Set. In this section we ex-


The second metric is Average Relative Error (ARE) of plore the variation in accuracy when we initialize Orion
predicted distances. Small ARE values are sufficient to using different cardinalities of the landmark set. The
prove that the majority of node pairs in Orion have realis- intuition behind this experiment is that by having more
tic predicted distances. Finally, we also use Computation landmarks spread in the graph there is a better space cov-
time to investigate Orion’s efficiency. erage that should allow higher precision while placing
nodes into this space.
4.2 Estimation Accuracy Figure 3 depicts the cumulative distribution function
of the relative error for cardinality 30 and 100 of the land-
We examine Orion’s estimation accuracy under the influ- mark set. Figure 3 shows that there is a small increase in
ence of three different factors: landmark selection strat- precision with larger landmark set sizes. In general, al-
egy, cardinality of the landmark set, and dimensionality most 70% of the computed distances have a relative error
of node coordinates. less then 0.2 and more than 90% are less than 0.4, that
allows us to validate a satisfactory accuracy in comput-
Landmark Selection Strategies. We begin by analyz-
ing node distances with a relatively small landmarks (i.e.
ing the impact of landmark selection strategies on accu-
100 landmarks represent a millesimal of our graphs).
racy. In Section 3.3, we describe two selection strategies
(random and high-degree) and variants based on land- Dimensionality of Coordinates. Nodes are mapped
mark separation. Figure 2 plots AREs for a variety of into geometric space based on the coordinates they ac-
landmark selection strategies using the India graph. We quire during the initialization phase. Intuitively, cali-
evaluate the accuracy of each different strategy on all brating node positions using a larger coordinate vector
four datasets. These results are similar for all our graphs, should have a direct impact on the precision of the esti-
and we only show India here for brevity. mated distances between nodes.
Each evaluation is performed by selecting 1000 ran- We compute coordinates as dimensionality varies be-
dom nodes in the graph and computing pairwise dis- tween 2 and 14. Figure 4 shows that increasing the
tances between them, for a total of ≈ 500K distances. coordinates dimension also increases the predicted dis-
These results form the control sample when calculating tances between nodes, confirming our intuition. Al-
relative error vs. Orion. Each value reported in Figure 2 though higher dimensions produce smaller errors, as the
is the average results over 5 sets of randomly selected dimension increases the time for coordinate and distance
1000 node groups. computation increases as well. We explore the trade off
In general, Figure 2 shows that Orion provides low between predicted precision and efficiency and conclude
relative errors compared to actual path lengths for differ- that using 10-dimensional coordinates is best compro-
ent landmark selection strategies. Among the considered mise. In particular, as shown in Figure 4, the accuracy
strategies, Figure 2 shows that high-degree strategies can gain for x ≥ 10 slightly decreases.
produce lower errors. Furthermore, the impact of land-
mark separation on the accuracy of shortest path length 4.3 Computational Complexity
estimation is fairly small. Taking a close look, the high-
degree incremental landmark selection strategy with 3- In this section we investigate Orion efficiency by analyz-
hop separation provides the most accurate result among ing Orion bootstrap and pair distance computation time
all the considered strategies. As a result, all remaining versus BFS.
experiments run with this approach. Orion bootstrap involves two main operations: (i)

6
0.4 Metric Method India Egypt L. A. Norway
Norway Avg. Relative Error Orion 11.7 9.5 10.8 8.1
0.35 Egypt Avg. Relative Error Radius
Average relative error

L.A. Avg. Relative Error Actual 11 9 11 8


0.3
India Avg. Relative Error Orion 17.9 13.9 17.8 12.1
0.25 Diameter
Actual 17 13 17 12
0.2 Avg. Path Orion 5.8 4.8 4.9 4.1
0.15 Length Actual 6.1 5.0 5.1 4.2
0.1
0.05 Table 3: Comparing Node Separation Metrics for a 1000-node
0 sample in each of our four graphs. Orion’s approximations are
2 4 6 8 10 12 14 compared to results computed via BFS.
Dimension

Figure 4: ARE of different coordinate dimensions.


5.1 Node Separation Metrics
Time India Egypt L. A. Norway
Orion Bootstrap 9499s 7852s 8856s 9383s
Orion Response 0.2µs 0.2µs 0.18µs 0.19µs
Node separation metrics are commonly used to charac-
BFS Response 1.028s 0.75s 1.027s 1.44s
terize overall graph structure. The common node sepa-
Table 2: Computation times for Orion and BFS. ration metrics include graph radius, graph diameter and
average path length. The eccentricity of a node is defined
as the longest hop distance from it to all other nodes in
measure distances from each landmark to all the nodes a graph. Graph radius is defined as the minimum eccen-
using BFS, and (ii) compute coordinates using Simplex tricity across all nodes, while graph diameter is defined
Downhill. We record the time for bootstrapping Orion as the maximum eccentricity across all nodes. Average
on our four social graphs and show that Orion bootstrap path length is the mean of all shortest path lengths.
time is about 2 hours (as shown in Table 2). These times
are acceptable since bootstrapping is a one-time cost. Computation Time. Given their intensive use of
Response time is the average time to compute pairwise shortest path computations, node separation metrics are
node distances using Orion. As shown in Table 2, Orion an ideal application for Orion. We would like to quan-
is 7 orders of magnitude faster than BFS. This result con- tify Orion’s accuracy in this context by computing these
firms the huge gain a coordinate graph system like Orion metrics using Orion and compare them directly to those
is able to achieve compared to traditional methods. from BFS. Given the large sizes of our graphs, however,
Note that Orion bootstrap and response times are func- it was not possible for us to compute eccentricity for all
tions of the number of nodes in the graph. Conversely, the nodes by BFS for direct comparison. From our time
BFS computation time is a function of the number of measurements of single node full BFS we estimate a full
edges. Thus Orion is likely to provide better scalability computation of the Los Angeles network (275K nodes)
than BFS because, as social networks expand, the growth would take roughly 152 hours or more than 6 days of
in edges far surpasses the growth in nodes. computation. In contrast, embedding the LA network
into Orion takes less than 2 hours, and querying for all
pairwise paths takes roughly 7000 seconds, for a total
5 Using Orion in Graph Applications process time of less than 4 hours.

To demonstrate Orion’s utility and accuracy in an opera- Accuracy Results. For a scalable side-by-side com-
tional setting, we integrate Orion into several graph anal- parison, we randomly sample 1000 nodes from each of
ysis and social applications that make extensive use of the graphs, and compute graph radius, diameter and av-
shortest path computations. Under normal conditions, erage path length based on BFS from those nodes to
these graph metrics and applications can be computa- all other nodes in the graph. We compare those results
tionally intractable for large graphs. We show that we to those generated using node distance estimations from
can use Orion to scalably obtain answers that reasonably Orion, and show the results in Table 3. We find that Orion
approximate answers obtained from deterministic meth- performs very well in predicting these metrics. For graph
ods. Specifically, we look at three common operations: radius and diameter, it always provides a result that is less
computing node separation metrics such as graph radius, than 1 hop from the BFS answer. In the case of average
diameter and average path length, locating central nodes path length, Orion is even more accurate, and provides
in a graph, and ranked social search. results that never deviate more than 0.3 from BFS.

7
0.9 1
0.8 0.9
0.7 0.8
0.6 0.7
Accuracy

Accuracy
0.6
0.5
0.5
0.4
0.4
0.3 India 0.3 Norway
0.2 Egypt 0.2 Egypt
0.1 Los Angeles Los Angeles
Norway 0.1 India
0 0
50 100 200 5 10 20 50
Top # of 1000 nodes Top # of 100 responses

Figure 5: Accuracy of Top k high centrality nodes Figure 6: Accuracy of top k ranked nodes.

5.2 Computing Node Centrality rate because it has the longest average path lengths of
our sample graphs. The results are generally good across
Information dissemination is an active research area the board, with Orion giving correct estimates more than
of social networks. Viral spread [31], influence cam- 50% of the time, when selecting top 50 highest centrality
paigns [4, 14], and breaking-news coverage [10] are all nodes out of 1000.
examples of information dissemination problems on so-
cial graphs. A critical, but computationally expensive, 5.3 Ranked Social Search
metric necessary for these applications is node central-
ity. We leverage Orion coordinates to compute node’s Online social networks often need to rank their query re-
centrality in order to compare its speed and accuracy sults by proximity in the social graph to the query owner.
with centrality calculations performed using traditional For example, searches for specific names on Facebook
shortest-path algorithms. and LinkedIn will only return the top results that are clos-
Centrality is defined as the average shortest path est in social distance to the user. Social distance is used
length from a node a to every other nodes in the graph. to rank query results because users generally care about
The smaller the average path length for a node is, the people close to their social circles.
higher its centrality is. Using Orion, a node can quickly We implement a ranked social search application. In
estimate its centrality by computing its average Eu- each graph, we randomly select 100 nodes to represent
clidean distance to all other nodes in the graph. the total set of results for each query. We run the simula-
We estimate the precision of computing node central- tion 5000 times, each time with a randomly chosen node
ity via Orion by comparing its results to actual results as the point of origin for the query.
computed using BFS. Computationally, node centrality Accuracy Results. We sort the randomly selected 100
also requires all pairs of shortest paths computation, and nodes in increasing order and choose the top k nodes.
our time estimates from node separation metrics also ap- Then we count the amount of overlap in the two sets
ply here (152 hours for our LA graph). of top k nodes computed by Orion and the BFS-based
Accuracy Results. To keep computation time man- approach. We define the accuracy of the ranked social
ageable, we again sample 1000 random nodes from each search in Orion as the ratio of the number of overlapping
graph, and compute node centrality values for each node nodes to the total number of all considered nodes. Fig-
using both Orion and BFS. We sort nodes based on their ure 6 plots the accuracy values over different values of
average shortest path length to every other node in the k, averaged across the 5000 runs. Again, Orion’s social
network, in increasing order. Then we select the top k search produces fairly good results, with more than 60%
nodes from each resulting group, and count the number overlap when choosing the top 20 responses.
of top k central nodes (according to BFS) that also ap-
peared in Orion’s results. We repeat this for 5 sets of 6 Conclusions and Future Directions
1000 random nodes and average the result.
Figure 5 shows the percentage of top-k nodes that are Shortest path computation is one of the most critical
correctly considered found by Orion, for different val- and computationally intensive primitives for both graph
ues of k: 50, 100, and 200. The overlap between Orion analysis and social networking applications. We pro-
and BFS’ results increases with k. As with results in pose graph coordinate systems, a new approach to dra-
Section 4.3, centrality results for India are more accu- matically reduce the complexity of shortest paths com-

8
putation by mapping the entire graph into a multi- [10] K WAK , H., L EE , C., PARK , H., AND M OON , S. What is twitter,
dimensional Euclidean coordinate space. We describe a social network or a news media? In Proc. of WWW (2010).
the design of Orion, an efficient graph coordinate proto- [11] L ANCICHINETTI , A., AND F ORTUNATO , S. Community detec-
type. Mapping a graph of n nodes takes time O(k I ·D·n) tion algorithms: A comparative analysis. Phys. Rev. E 80, 5 (Nov
2009).
(roughly 2-3 hours for a 275K node graph), after which
[12] L EDLIE , J., G ARDNER , P., AND S ELTZER , M. I. Network co-
each node distance estimation takes less than 0.2 mi-
ordinates in the wild. In Proc. of NSDI (April 2007).
croseconds. Our experiments show Orion can provide
[13] L EE , S., Z HANG , Z., S AHU , S., AND S AHA , D. On suitability
accurate results both for graph metrics such as graph ra- of euclidean embedding of internet hosts. In Proc. of SIGMET-
dius and node centrality, as well as graph-based applica- RICS (June 2006).
tions such as ranked social search. [14] L ESKOVEC , J., ET AL . Cost-effective outbreak detection in net-
works. In Proc. of KDD (2007).
Future Directions. We believe graph coordinate sys-
tems are a promising new research direction for scalable [15] L ESKOVEC , J., AND H ORVITZ , E. Planetary-scale views on a
large instant-messaging network. In Proc. of WWW (2008).
graph analysis. While our work here is preliminary, we
[16] L I , L., ET AL . Towards a theory of scale-free graphs: Definition,
see three immediate areas for future work. First, we
properties, and implications. Internet Math 2, 4 (2005), 431–523.
would like to explore the efficacy of mapping graphs
[17] M AO , Y., S AUL , L., AND S MITH , J. M. Ides: An internet dis-
to non-Euclidean coordinate systems such as spherical tance estimation service for large networks. IEEE JSAC 24, 12
and hypercube. Second, we will examine the impact (Dec. 2006), 2273–2284.
of graph coordinates on weighted graphs, e.g. geo- [18] M ISLOVE , A., ET AL . Measurement and analysis of online social
graphical graphs or temporal distance metrics for social networks. In Proc. of IMC (Oct 2007).
graphs [28]. Finally, Orion is designed for static graphs. [19] N ELDER , J. A., AND M EAD , R. A simplex method for function
Adding new nodes to the graph after the initial mapping minimization. The Computer Journal 7, 4 (Jan. 1965), 308–313.
can change shortest path values for portions of the graph [20] N G , T. S. E., AND Z HANG , H. Predicting internet network dis-
and force a re-mapping of the graph. We will investigate tance with coordinates-based approaches. In Proc. of INFOCOM
mechanisms and heuristics to allow run-time modifica- (New York, NY, June 2002).
tions to graphs already mapped to the coordinate space. [21] N G , T. S. E., AND Z HANG , H. A network positioning system
for the internet. In Proc. of USENIX ATC (June 2004).
[22] P IAS , M., ET AL . Lighthouses for scalable distributed location.
Acknowledgments In Proc. of IPTPS (Feb. 2003).
[23] P OTAMIAS , M., ET AL . Fast shortest path distance estimation in
This material is based in part upon work supported by the large networks. In Proc. of CIKM (Hong Kong, Nov. 2009).
National Science Foundation under grants IIS-847925, [24] R ATNASAMY, S., H ANDLEY, M., K ARP, R., AND S CHENKER ,
CNS-0916307, and CAREER CNS-0546216. S. Topologically-aware overlay construction and server selection.
In Proc. of INFOCOM (2002), IEEE.

References [25] R ATTIGAN , M. J., M AIER , M., AND J ENSEN , D. Using of


structure indices for efficinet approximation of network proper-
[1] Azureus-vuze. http://sourceforge.net/projects/ ties. In Proc. of ACM SIGKDD (Philadelphia, USA, 2006).
azureus/. [26] R HEA , S., ET AL . Pond: The OceanStore prototype. In Proc. of
[2] A HN , Y.-Y., ET AL . Analysis of topological characteristics of FAST (April 2003).
huge online social networking services. In Proc of WWW (2007). [27] S WAMYNATHAN , G., ET AL . Do social networks improve e-
[3] C ASTRO , M., ET AL . Scribe: A large-scale and decentral- commerce: a study on social marketplaces. In Proc. of SIG-
ized application-level multicast infrastructure. IEEE JSAC 20, COMM WOSN (August 2008).
8 (2002). [28] TANG , J., ET AL . Temporal distance metrics for social network
[4] C HEN , W., WANG , Y., AND YANG , S. Efficient influence maxi- analysis. In Proc. of WOSN (Barcelona, Spain, August 2009).
mization in social networks. In Proc. of ACM KDD (2009). [29] TANG , L., AND C ROVELLA , M. Virtual landmarks for the inter-
[5] C OHEN , B. Incentives build robustness in bittorrent. In Proc. of net. In Proc. of IMC (Oct. 2003).
P2P-Econ (June 2003).
[30] V ISWANATH , B., ET AL . On the evolution of user interaction in
[6] C ORMEN , T. H., L EISERSON , C. E., R IVEST, R. L., AND facebook. In Proc. of SIGCOMM WOSN (2009).
S TEIN , C. Introduction to Algorithms, 3 ed. MIT Press, 2009.
[31] WANG , C., K NIGHT, J. C., AND E LDER , M. C. On computer
[7] C OSTA , M., C ASTRO , M., R OWSTRON , A., AND K EY, P. Pic: viral infection and the effect of immunization. In Proc. of ACSAC
Practical internet coordinates for distance estimation. In Proc. of (2000).
ICDCS (Tokyo, Japan, March 2004).
[32] W ILSON , C., B OE , B., S ALA , A., P UTTASWAMY, K. P. N.,
[8] D ABEK , F., C OX , R., K AASHOEK , F., AND M ORRIS , R. Vi- AND Z HAO , B. Y. User interactions in social networks and their
valdi: A decentralized network coordinate system. In Proc. of implications. In Proc. of EuroSys (April 2009).
SIGCOMM (Portland, OR, August 2004).
[33] Z HENG , H., ET AL . Internet routing policies and round-trip
[9] D ONNET, B., G UEYE , B., AND K AAFAR , M. A. A survey on times. In Proc. of PAM (April 2005).
network coordinates systems, design, and security. IEEE Com-
munication Surveys & Tutorials (2009).

You might also like