TR 07 54
TR 07 54
Abstract
We study the impact of using different priority queues in the performance of Dijkstra’s SSSP
algorithm. We consider only general priority queues that can handle any type of keys (integer,
floating point, etc.); the only exception is that we use as a benchmark the DIMACS Challenge
SSSP code [1] which can handle only integer values for distances.
Our experiments were focussed on the following:
1. We study the performance of two variants of Dijkstra’s algorithm: the well-known version
that uses a priority queue that supports the Decrease-Key operation, and another that
uses a basic priority queue that supports only Insert and Delete-Min. For the latter type
of priority queue we include several for which high-performance code is available such as
bottom-up binary heap, aligned 4-ary heap, and sequence heap [33].
2. We study the performance of Dijkstra’s algorithm designed for flat memory relative to
versions that try to be cache-efficient. For this, in main part, we study the difference in
performance of Dijkstra’s algorithm relative to the cache-efficiency of the priority queue
used, both in-core and out-of-core. We also study the performance of an implementation
of Dijkstra’s algorithm that achieves a modest amount of additional cache-efficiency in
undirected graphs through the use of two cache-efficient priority queues [25, 12]. This
is theoretically the most cache-efficient implementation of Dijkstra’s algorithm currently
known.
Overall, our results show that using a standard priority queue without the decrease-key op-
eration results in better performance than using one with the decrease-key operation in most
cases; that cache-efficient priority queues improve the performance of Dijkstra’s algorithm, both
in-core and out-of-core on current processors; and that the dual priority queue version of Di-
jkstra’s algorithm has a significant overhead in the constant factor, and hence is quite slow in
in-core execution, though it performs by far the best on sparse graphs out-of-core.
∗
Supported in part by NSF Grant CCF-0514876 and NSF CISE Research Infrastructure Grant EIA-0303609.
†
National Instruments Corporation, Austin, TX.
‡
Department of Computer Sciences, University of Texas, Austin, TX. Email:[email protected].
§
Department of Computer Sciences, University of Texas, Austin, TX. Email:[email protected].
¶
Google Inc., Mountain View, CA.
k
Microsoft Corporation, Redmond, WA.
0
1 Introduction
Dijkstra’s single-source shortest path (SSSP) algorithm is the most widely used algorithm for com-
puting shortest paths in a graph with non-negative edge-weights. A key ingredient in the efficient
execution of Dijkstra’s algorithm is the efficiency of the heap (i.e., priority queue) it uses. In this
paper we present an experimental study on how the heap affects performance in Dijkstra’s algo-
rithm. We mainly consider heaps that are able to support arbitrary nonnegative key values so that
the algorithm can be executed on graphs with general nonnegative edge-weights.
We consider the following two issues:
1. We study the relative performance of two variants of Dijkstra’s algorithm: the traditional
implementation that uses a heap with Decrease-Key operations (see Dijkstra-Dec in Section
1.1), and another simple variant that uses a basic heap with only Insert and Delete-Min
operations (see Dijkstra-NoDec in Section 1.1).
2. We study how the performance of Dijkstra’s algorithm varies with the cache-efficiency of the
heap used, both in-core and out-of-core. We also study the performance of the theoretically
best cache-efficient implementation of Dijkstra’s algorithm for undirected graphs that uses
two cache-efficient heaps (see Dijkstra-Ext in Section 1.1).
1.1 Dijkstra’s SSSP Algorithm
We consider the following three implementations of Dijkstra’s algorithm.
Dijkstra-Dec. This is the standard implementation of Dijkstra’s algorithm that uses a heap that
supports the Decrease-Key operation (see Function B.1 in the Appendix).
The heaps we use with Dijkstra-Dec are standard binary heap [40], which is perhaps the most
widely-used heap, pairing heap [19], and cache-oblivious buffer heap [12], which is the only nontrivial
cache-oblivious heap that supports the Decrease-Key operation. We implemented all three in-house.
The pairing heap was chosen as the best representative of the class of heaps that support
Decrease-Key in amortized sub-logarithmic time. Though the Fibonacci heap [20] and others sup-
port it in O (1) amortized time and the pairing heap does not, an experimental study of the Prim-
Dijkstra MST algorithm [29] found the pairing heap to be superior to other heaps considered (while
standard binary heap performed better than the rest in most cases). In our preliminary experiments
we used several other heaps including Fibonacci heap, and similar to [29] we found the pairing heap
and the standard binary heap to be the fastest among the traditional (flat-memory) heaps.
Dijkstra-NoDec. This is an implementation that uses a heap with only Insert and Delete-
Min operations (see Function B.2 in the Appendix). This implementation performs more heap
operations and accesses to the graph data structure than Dijkstra-Dec. Further, theoretically
this implementation is inferior to the asymptotic running time of Dijkstra-Dec when the latter is
used with Fibonacci heap or other heap with amortized sub-logarithmic time support for Decrease-
Key. However, since Dijkstra-NoDec can use a streamlined heap without the heavy machinery
needed for supporting Decrease-Key (e.g., pointers in binary heap), the heap operations are likely
to be more efficient than those in Dijkstra-NoDec. (see Appendix B.3.1 for more details).
The heaps we use with Dijkstra-NoDec are bottom-up binary Heap, aligned 4-ary heap and
sequence heap, which are three highly-optimized heaps implemented by Peter Sanders [33], and two
heaps coded in-house, the standard binary heap without support for Decrease-Key and auxiliary
buffer heap, which is the buffer heap without support for Decrease-Key.
Dijkstra-Ext. This is an external-memory implementation for undirected graphs that uses two
heaps: one with Decrease-Key operation, and the other one supporting only Insert and Delete-
Min operations [25, 12] (see Function B.3 in the Appendix). This algorithm is asymptotically more
1
cache-efficient (by a modest amount) than the two mentioned above, and this is achieved by reducing
the number of I/Os for accessing the graph data structure at the cost of increasing the number of
heap operations considerably (though only by a constant factor) relative to Dijkstra-NoDec. As
a result, Dijkstra-Ext is expected to outperform the other implementations only in out-of-core
computations, i.e., when the cost of accessing data in the external-memory becomes significant (see
Section B.3.1 in the Appendix for more details). We note that there are undirected SSSP algorithms
for graphs with bounded edge-weights [28, 3] that are more cache-efficient than Dijkstra-Ext, but
these algorithms are not direct implementations of Dijkstra’s algorithm as they use a hierarchical
decomposition technique (similar to those in some flat-memory SSSP algorithms [37, 32]), and thus
out of scope of this paper.
The theoretical I/O complexities of all heaps in our experiments are listed in Tables 1 and 2. In
Table 3 we list the implemented I/O complexities for Dijkstra’s algorithm.
For the purpose of comparison we include the following Dijkstra implementation which was used
as the benchmark solver for the “9th DIMACS Implementation Challenge – Shortest Paths” [1].
Dijkstra-Buckets. An implementation of Dijkstra’s algorithm with a heap based on a bucketing
structure. This algorithm works only on graphs with integer edge-weights.
We performed our experiments on undirected Gn,m and directed power-law graphs, as well as on some
real-world graphs from the benchmark instances of the 9th DIMACS Implementation Challenge [1].
Priority Queue Insert/Decrease-Key Delete Delete-Min
Standard Binary Heap [40] (worst-case bounds) O (log N ) O (log N ) O (log N )
“ √ ”
Two-Pass Pairing Heap [19, 30] O 22 log log N O (log N ) O (log N )
O B1 log MN
O B1 log M
N
O B1 log M
N
` ´ ` ´ ` ´
Buffer Heap [12] (cache-oblivious)
Table 1: Amortized I/O bounds for heaps with Decrease-Keys (N = # items in queue, B = block size,
M = size of the cache/internal-memory).
Priority Queue Insert/Delete-Min
Bottom-up Binary Heap [39] (worst-case bounds) O (log2 N )
Aligned 4-ary Heap [26] (worst-case bounds) O (log4 N )
log N + k1 + logl k
`1 ´
Sequence Heap [33] (cache-aware) O B “k l ”
Auxiliary Buffer Heap (cache-oblivious, this paper) O B1 log M NB
B
Table 2: Amortized I/O bounds for heaps without Decrease-Keys (N = # items in queue, B =
block size, M = size of the cache/internal-memory, k = Θ (M/B) and l = Θ (M )).
2
− When the computation was out-of-core, and the graph was not too dense, the external-
memory implementation of Dijkstra’s algorithm (Dijkstra-Ext) with cache-oblivious buffer
heap and auxiliary buffer heap performed the fewest block transfers.
− For high-diameter graphs of geometric nature such as real-world road networks, Dijkstra’s
algorithm with traditional heaps performed almost as well as any cache-efficient heap when
the computation was in-core.
Organization of the Paper. In Section 2 we give an overview of the heaps we used. We discuss
our experimental setup in Section 3, and in Section 4 we discuss our experimental results.
3
Sequence Heap: The sequence heap is a cache-aware heap developed in [33]. It is based on k-way
merging for some appropriate k. When the cache is fully associative, k is chosen to be Θ M
B , and
for some l= Θ (M ) and R = logk Nl , it can perform N Insert and up to N Delete-Min operations
2R 1 log k
in B +O k + amortized cache-misses and O (log N + log R + log l + 1) amortized time each.
l
1
For α-way set associative caches k is reduced by O B α . We used a highly optimized version of
the sequence heap implemented by Peter Sanders [33], and used k = 128 as suggested in [33].
4
Implementation Base Routine Priority Queue(s) I/O Complexity
Bin-Dij Dijkstra-Dec Standard Binary Heap O (m + (n + D) · log n)
“ √ ”
Pair-Dij Dijkstra-Dec Two-Pass Pairing Heap O m + n · log n + D · 22 log log n
O m + n+D
` ´
BH-Dij Dijkstra-Dec Buffer Heap B
· log (n + D)
SBin-Dij Dijkstra-NoDec Binary Heap w/o Dec-key O (m + (n + D) · log (n + D))
FBin-Dij Dijkstra-NoDec Bottom-up Binary Heap O (m + (n + D) · log (n + D))
Al4-Dij Dijkstra-NoDec Aligned 4-ary Heap O (m + (n + D) · log (n + D))
O m + n+D · log n+D + k1 + logl k
` ` ´´
Seq-Dij Dijkstra-NoDec Sequence Heap
` B n+D k l ´
AH-Dij Dijkstra-NoDec Auxiliary Buffer Heap O m + B · log (n + D)
Dijkstra-Ext Buffer Heap
O n + n+m
` ´
Dual-Dij B
· log m
(undirected graphs only) & Auxiliary Buffer Heap
Dijkstra-Bucket Buckets O (m + n)
DIMACS-Dij
(integer edge-weights) + Caliber Heuristic (expected)
Table 3: Different implementations of Dijkstra’s algorithm evaluated in this paper, where D (≤ m) is the number
of Decrease-Keys performed by Dijkstra-Dec, B is the block size, k = Θ (M/B) and l = Θ (M ).
use this implementation in experiments that attempt to measure the improvement (if any) in the
performance of Dijkstra’s SSSP algorithm if we remove the overhead of implementing Decrease-Key
from a given priority queue and use it in Dijkstra-NoDec. Secondly, this version is easier to
implement and is likely to have much lower overhead in practice than the optimal version. Other
optimal cache-oblivious priority queues supporting only Insert and Delete-Min [4, 6] also appear to
be more complicated to implement than the streamlined non-optimal auxiliary buffer heap.
Major features of the streamlined auxiliary buffer heap we implemented are as follows.
No Selection. There are log N levels, and contents of the buffers in each level are kept sorted by key
value instead of element id and time stamp as in buffer heap. As a result during the redistribution
step we do not need a selection algorithm.
Insertion and Delete-Min Buffers. Newly inserted elements are collected in a small insertion
buffer. A small delete-min buffer holds the smallest few elements in the heap (excluding the insertion
buffer) in sorted order. Whenever the insertion buffer overflows or a Delete-Min operation needs
to be performed, appropriate elements from the buffer are moved to the heap. An underflowing
delete-min buffer is filled up with the smallest elements from the heap.
Efficient Merge. We use the optimized 3-way merge technique described in [33] for merging an
update buffer with the (at most two) segments of the update buffer in the next higher level.
Less Space. Uses less space than buffer heap since it does not store timestamps with each element.
Table 3 lists I/O complexities of different implementations we ran of Dijkstra’s algorithm.
3 Experimental Set-up
We ran our experiments on the following two machines.
5
seek time for reads and writes were 4.5 and 5.0 ms, respectively. The maximum data transfer rate
(to/from media) was 106.9 MB/s. All experiments were run on a single processor.
We used the Cachegrind profiler [35] for simulating cache effects.
We implemented all algorithms in C++ using a uniform programming style, and compiled using
the g++ 3.3.4 compiler with optimization level −O3.
For out-of-core experiments we used STXXL library version 0.9. The STXXL library [15, 16] is
an implementation of the C++ standard template library STL for external memory computations,
and is used primarily for experimentation with huge data sets. The STXXL library maintains its
own fully associative cache in RAM with pages from the disk. We compiled STXXL with DIRECT-
I/O turned on, which ensures that the OS does not cache the data read from or written to the hard
disk. We also configured STXXL (more specifically the STXXL vectors) to use LRU paging.
We store the entire graph in a single vector so that the total amount of internal-memory available
to the graph during out-of-core computations can be regulated by changing the STXXL parameters
of the vector. The initial portion of the vector stores information on the vertices in increasing order
of vertex id (each vertex is assumed to have a unique integer id from 1 to n) and the remaining
portion stores the adjacency lists of the vertices in the same order. We store two pieces of information
for each vertex: its distance value from the source vertex and a pointer to its adjacency list. For
SSSP algorithms based on internal-memory heaps with Decrease-Keys (e.g., standard binary heap
and pairing heap) we also store the pointer returned by the heap when the vertex is inserted into
it for the first time. This pointer is used by all subsequent Decrease-Key operations performed on
the vertex. For each edge in the adjacency list of a vertex we store the other endpoint of the edge
and the edge-weight. Each undirected edge (u, v) is stored twice: once in the adjacency list of u
and again in the adjacency list of v. For each graph we use a one-time preprocessing step that puts
the graph in the format described above.
Graph Classes. We ran our experiments on three graph classes obtained through the 9th DIMACS
Implementation Challenge [1]: Two synthetic classes, undirected Gn,m (PR) [31] and directed power-
law graphs (GT) [5], and the real-world class of undirected U.S. road networks [34]. In Appendices
D and E we present experimental results on the following additional (undirected) graph classes:
regular [31], grid [31], geometric [31] and layered [2]. A more detailed description of these graphs is
included in Appendix C.
4 Experimental Results
We present a detailed description of our experimental results on Intel Xeon, and summarize our
results on AMD Opteron in Section 4.6. Unless specified otherwise, all experimental results pre-
sented in this section are averages of three independent runs from three random source vertices on
a randomly chosen graph from the graph class under consideration, and they do not include the
cost of the one-time preprocessing step that puts the graph in the format mentioned in Section 3.
6
In-Core Performance on Gn,m with Fixed Average Degree 8 (on Intel P4 Xeon)
Figure 1: In-core performance of algorithms on Gn,m with fixed average degree 8 (on Intel P4 Xeon).
In-Core Performance on Gn,m with m Fixed to 4 Million (on Intel P4 Xeon)
Figure 2: In-core performance of algorithms on Gn,m with m fixed to 4 million (on Intel P4 Xeon).
Running Times. Figures 1(a) and 1(b) show that as n was varied from 215 to 222 , all Dijkstra-
NoDec implementations (i.e., AH-Dij, FBin-Dij, SBin-Dij, Al4-Dij and Seq-Dij) ran at least 1.4
times faster than any Dijkstra-Dec implementation (i.e., BH-Dij, Bin-Dij and Pair-Dij). We will
investigate this observation in more detail in Section 4.2.
Among all implementations, Seq-Dij consistently ran the fastest, while AH-Dij was consistently
faster than the remaining implementations. Seq-Dij ran around 25% faster than AH-Dij. FBin-Dij,
SBin-Dij, Al4-Dij and DIMACS-Dij ran at almost the same speed, and were consistently 25% slower
than AH-Dij. BH-Dij was the fastest among Dijkstra-Dec for n ≥ 128, and ran up to 25% faster
than the remaining two. The slowest of all implementations was Bin-Dij.
Cache Performance. Figures 1(c) and 1(d) plot the L2 cache misses incurred by different imple-
mentations (except Dual-Dij). As expected, cache-aware Seq-Dij incurred the fewest cache-misses
7
followed by cache-oblivious AH-Dij. The cache-oblivious BH-Dij incurred more cache-misses than
AH-Dij, but fewer than any Dijkstra-Dec implementation.
As n grows larger the cache performance of BH-Dij degrades with respect to Bin-Dij and Pair-Dij
which can be explained as follows. All Dijkstra-Dec implementations perform exactly the same
number of Decrease-Key operations (our experimental results suggest that this number is ≈ 0.8n
for Gn,m with average degree 8). The flat-memory priority queues we used support Decrease-Key
operations more efficiently on average than in the worst case, which is not the case with buffer
heap. Hence the cache-performance of BH-Dij degrades as a whole with respect to that of Bin-Dij
and Pair-Dij as n increases. Surprisingly, however, Figure 1(b) shows that the running time of BH-
Dij improves with respect to most other implementations as n increases. We believe this happens
because of the prefetchers in Intel Xeon. As the operations of buffer heap involves only sequential
scans it benefits more from the prefetchers than the internal-memory heaps. The cachegrind profiler
does not take hardware prefetching into account and as a result, Figures 1(c) and 1(d) failed to reveal
their impact. The same phenomenon is seen to a lesser extent for AH-Dij.
4.1.2 Gn,m with Fixed Number of Edges
Figures 2(a) and 2(b) plot running times as the number of vertices is increased from 2500 to 1
million while keeping m fixed to 4 million (i.e. average degree is decreased from 1600 down to
4). As before, cache-aware Seq-Dij consistently ran the fastest followed by the cache-oblivious AH-
Dij. When the graph was sparse all implementations based on Dijkstra-NoDec ran significantly
faster than any implementation based on Dijkstra-Dec, but this performance gap narrowed as
the graph became denser. As in Section 4.1.1, for Dijkstra-Dec the cache-oblivious BH-Dij ran
faster than flat-memory implementations (i.e., Bin-Dij & Pair-Dij). As the average degree of the
graph decreased down to 160, performance of the DIMACS-Dij solver degraded significantly, but
after that its performance improved dramatically.
Remarks on Dijkstra-Ext. Dual-Dij ran considerably slower than other implementations as it
performs significantly more heap operations compared to any of them (see Table 7 in the Appendix).
For example, on Gn,m with average degree 8, Dual-Dij consistently performed at least 6 times more
heap operations and ran at least 6 times slower than any other implementations.
8
Comparison between Dijkstra-Dec and Dijkstra-NoDec Implementations (on Intel P4 Xeon)
(a) Absolute Running Times on Gn,1000000 (b) Running Times w.r.t. SBin-Dijon Gn,1000000
åäã
ãäí
ãäì
ãäë 4 0
3
)
ãäê +
2
1
* 00
ãäé '0
/
.
-
ãäè ,
+*
)
ãäç (
'
ãäæ
ãäå
ãäã
æäé é åã æé éã åãã æéã éãã
(c) Cycles (avg.) per PQ Operation (d) Cycles (avg.) per PQ Operation (e) Ratio of Decrease-Key and Insert
on G250000,1000000 on G2500,1000000 Performed by Dijkstra-Dec on Gn,1000000
;A:: 5678 kqjj
;@:: kpjj
;?:: kojj
;>:: knjj
;=:: kmjj
;<:: kljj Á¸ÂÂ
d ¶
cd º
b ;;:: kkjj À
à ¿
´
_
^ ;::: kjjj ¾
W ½
] C:: sjj ¶
\ ¼
[ B:: rjj efgh ¶»
Z 95678 º
¹
Y
X A::
qjj ¸¶
·
W ¶
VU @:: pjj µ
T iefgh ´
S ?:: 5678 ojj
>:: njj
=:: mjj
5678 efgh
<:: ljj iefgh efgh
95678
;:: kjj
: j
¡
DEFGHI JGKHGLFGMNGO JGP GIGMQRE tuvwxy zw{xw|vw}~w zw wyw}u ¢£¤¥¦§ ¨© ª ¦§«¬¦® ¯ ° ±² ³
Figure 3: Plots (a) and (b) compare runtimes of Bin-Dij & BH-Dij, and SBin-Dij & AH-Dij, respectively,
on Gn,1000000 as n varies. Plots (c) and (d) show the avg. number of cycles per heap operation performed
by Bin-Dij and SBin-Dij on G250000,1000000 and G2500,1000000 , respectively. Plot (e) shows the ratio of the
number of Decrease-Key and Insert operations performed by Dijkstra-Dec on Gn,1000000 as n varies.
by Dijkstra-Dec. Let the heap used by Dijkstra-Dec support each Insert, Delete-Min and
Decrease-Key operation in cins , cdel and cdec clock cycles (avg), respectively, and let the heap for
Dijkstra-NoDec support each Insert and Delete-Min operation in c′ins and c′del clock cycles (avg),
respectively. Hence, if Dijkstra-Dec requires ∆ extra clock cycles compared to Dijkstra-NoDec
to perform all heap operations, then ∆ = n(cins − c′ins ) + n(cdel − c′del ) + D(cdec − c′ins − c′del ).
In Figures 3(c) and 3(d), we compare empirically obtained (using Callgrind [35]) values of cins ,
cdel and cdec with those of c′ins and c′del for G250000,1000000 and G2500,1000000 , respectively. In Figure
3(e), we plot the empirical ratio of the number of Decrease-Key operations (= D) to the number of
Insert operations (≈ n) performed by Dijkstra-Dec on Gn,1000000 for different values of n. Now for
G250000,1000000 , from Figure 3(e) we obtain D ≈ 0.32n, and using this value we obtain from Figure
3(c), ∆ = 990n − 350D ≈ 220 × 106 . Therefore, SBin-Dij, indeed, spends fewer clock cycles on
heap operations than Bin-Dij in this case, and Figures 3(a) and 3(b) show that this translates into
a better overall running time for SBin-Dij. Similarly, for G2500,1000000 , we obtain ∆ = 330n − 540D
from Figure 3(d), and D ≈ 4.3n from Figure 3(e). Thus ∆ ≈ −5 × 106 , and hence Bin-Dij will
perform better than SBin-Dij in this case. Figures 3(a) and 3(b) confirm our conclusion.
More experimental results on the relative performance of Bin-Dij and SBin-Dij can be found in
the undergraduate honors thesis of Mo Chen [10].
The relative performance of BH-Dij and AH-Dij follows a trend similar to that of Bin-Dij and
SBin-Dij (see Figures 3(a) and 3(b)), which can be explained similarly.
9
In-Core Performance on Power-Law Graphs (on Intel P4 Xeon)
(a) Absolute Runtimes with m/n = 4 (b) Runtimes (w.r.t. SBin-Dij) with m/n = 4
ÈÇ êæå
ÈÆ
é æç
ÈÅ
ÈÄ
ÿ
äã
é æå
à ÈÃ
â
á
à
ß ý
ÞÝ Ç
Ü èæç
Û
Ú
Æ ÿ
þ
ý
Å
èæå
à åæç
ÉÄ Ê ÆÅ Ê ÈÄÇ Ê ÄËÆ Ê Ë ÈÄ Ê ÈÃÄÅ Ê ÄÃÅÇ Ê ÅÃÌÆ Ê êé ë ìí ë èéî ë éçì ë ç èé ë èåéí ë éåíî ë íåïì ë
(c) Absolute Runtimes with m = 1 Million (d) Runtimes (w.r.t. SBin-Dij) with m = 1 Million
9 49
9 43
847
[ W 846
Z
P
21 RY
. X 845
0
/ Q WW
. NW
- V
,+ U 849
* T
) S
(
RQ
P 843
O
N
347
346
345
94: : 83 9: :3 833 9:3 :33
10
Out-of-Core Performance on Gn,m with m = 2 Million ( B = 4 KB and M = 4 MB )
d
c º
¹
¸
§·
b
¶
µ
« ´´
a ¬
³́
²
` ¯
¬
±
°¯
~ ®
} _
{z
| «¬
ª
©
¨
^ §
¦
]
\
]\ ^a a\ ]\\ ^a\ a\\ ] \\\
f
ghijkl mn o klpqrks t u vwx y ¡ ¢£¤ ¥
11
Running Time in Milliseconds (on Intel P4 Xeon)
Region Bin-Dij SBin-Dij FBin-Dij Pair-Dij Al4-Dij Seq-Dij BH-Dij AH-Dij DIMACS-Dij
References
[1] 9th DIMACS Implementation Challenge — Shortest Paths, 2006.
http://www.dis.uniroma1.it/~challenge9/.
[2] D. Ajwani, R. Dementiev, and U. Meyer. Graph generators for external memory bfs algorithms.
url: http://www.mpi-sb.mpg.de/~ajwani/graph_gen/.
[3] L. Allulli, P. Lichodzijewski, and N. Zeh. A faster cache-oblivious shortest-path algorithms for
undirected graphs with bounded edge lengths. In Proceedings of the 18th Annual ACM-SIAM
Symposium on Discrete Algorithms, pages 910–919, New Orleans, Louisiana, 2007.
[4] L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro. Cache-oblivious
priority queue and graph algorithm applications. In Proceedings of the 24th ACM Symposium
on Theory of Computing.
[5] D. Bader and K. Madduri. GTgraph: A suite of synthetic graph generators. url: http://www-
static.cc.gatech.edu/~kamesh/GTgraph/.
[6] G. S. Brodal and R. Fagerberg. Funnel heap – a cache oblivious priority queue. In Proceedings
of the 13th Annual International Symposium on Algorithms and Computation, LNCS 2518,
Vancouver, BC, Canada. Springer-Verlag.
[7] G.S. Brodal, R. Fagerberg, U. Meyer, and N. Zeh. Cache-oblivious data structures and al-
gorithms for undirected breadth-first search and shortest paths. In Proceedings of the 3rd
Scandinavian Workshop on Algorithm Theory, pages 480–492, Humlebæk, Denmark, July 2004.
[8] A. L. Buchsbaum, M. Goldwasser, S. Venkatasubramanian, and J. R. Westbrook. On external
memory graph traversal. In Proc. 11th ACM-SIAM SODA, pages 859–860, 2000.
[9] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In
Proc. 4th SIAM Intl Conf on Data Mining, Orlando, Florida, 2004.
[10] M. Chen. Measuring and improving the performance of cache-efficient priority queues in di-
jkstra’s algorithm, 2007. Undergraduate Honors Thesis, CS-HR-07-36, Univ of Texas, Austin,
Dept of Comp Sci.
12
[11] R. A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In Proc.
17th ACM-SIAM SODA.
[12] R. A. Chowdhury and V. Ramachandran. Cache-oblivious shortest paths in graphs using buffer
heap. In Proc. 16th ACM SPAA.
[14] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms. The MIT
Press, second edition, 2001.
[16] R. Dementiev, L. Kettner, and P. Sanders. STXXL: Standard template library for XXL data
sets. In Proc. 13th ESA, LNCS 1004, pages 640–651. Springer-Verlag, 2005.
[17] E.W. Dijkstra. A note on two problems in connection with graphs. Numerische Mathematik,
1:269–271, 1959.
[18] P. Erdös and A. Rényi. On the evolution of random graphs. Mat. Kuttató. Int. Közl., 5:17–60,
1960.
[19] M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan. The pairing heap: A new form
of self-adjusting heap. Algorithmica, 1:111–129, 1986.
[20] M.L. Fredman and R.E. Tarjan. Fibonacci heaps and their use in improved network optimiza-
tion algorithms. Journal of the ACM, 34:596–615, 1987.
[22] C. A. R. Hoare. Algorithm 63 (PARTITION) and algorithm 65 (FIND). Comm. ACM, 4(7):321–
322, 1961.
[24] I. Katriel and U. Meyer. Elementary graph algorithms in external memory. In U. Meyer, P.
Sanders, and J.F. Sibeyn, editors, Alg. for Mem. Hierarchies, LNCS 2625. Springer-Verlag.
[25] V. Kumar and E. Schwabe. Improved algorithms and data structures for solving graph problems
in external memory. In Proc. 8th IEEE SPDP, pages 169–177, 1996.
[26] A. LaMarca and Ladner R. The influence of caches on the performance of heaps. Journal of
Experimental Algorithmics, 1:4, 1996.
[27] D. Lan Roche. Experimental study of high performance priority queues, 2007. Undergraduate
Honors Thesis, CS-TR-07-34, The University of Texas at Austin, Department of Computer
Sciences.
[28] U. Meyer and N. Zeh. I/O-efficient undirected shortest paths. In Proceedings of the 11th
European Symposium on Algorithms, LNCS 2832, pages 434–445. Springer-Verlag, 2003.
13
[29] B. M. E. Moret and H. D. Shapiro. An empirical assessment of algorithms for constructing a
minimum spanning tree. In DIMACS Series Discrete Math and Theor Comp Sci. 1994.
[30] S. Pettie. Towards a final analysis for pairing heaps. In Proc. 46th FOCS, pages 174–183, 2005.
[31] S. Pettie and V. Ramachandran. Command line tools generating various families of random
graphs. url: http://www.dis.uniroma1.it/~challenge9/code/Randgraph.tar.gz.
[32] S. Pettie and V. Ramachandran. A shortest path algorithm for real-weighted undirected graphs.
SIAM Jour. on Comput, 34:1398–1431, 2005.
[33] P. Sanders. Fast priority queues for cached memory. Jour. Exp. Algorithmics, 5:1–25, 2000.
[34] P. Sanders and D. Schultes. United states road networks (tiger/line). Data Source: U.S. Census
Bureau, Washington, DC, url: http://www.dis.uniroma1.it/~challenge9/data/tiger/.
[35] J. Seward and N. Nethercote. Valgrind (debugging and profiling tool for x86-Linux programs).
url: http://valgrind.kde.org/index.html.
[36] J. T. Stasko and J. S. Vitter. Pairing heaps: experiments and analysis. Comm. ACM, 30:234–
249, 1987.
[37] M. Thorup. Undirected single-source shortest paths with positive integer weights in linear time.
Journal of the ACM, 46:362–394, 1999.
[38] L. Tong. Implementation and experimental evaluation of the cache-oblivious buffer heap, 2006.
Undergraduate Honors Thesis, CS-TR-06-21, Univ of Texas, Austin, Dept of Comp Sci.
14
Appendix
15
Function B.1. Dijkstra-Dec( G, w, s, d ) Function B.2. Dijkstra-NoDec( G, w, s, d )
{Dijkstra’s SSSP algorithm [17] with a heap {Dijkstra’s SSSP algorithm [17] with a heap
that supports Decrease-Keys} that does not support Decrease-Keys}
1. perform the following initializations: 1. perform the following initializations:
(i) Q ← ∅ (i) Q ← ∅
(ii) for each v ∈ V [G] do d[v] ← +∞ (ii) for each v ∈ V [G] do d[v] ← +∞
(iii) Insert(Q) ( s, 0 ) (iii) Insert(Q) ( s, 0 )
2. while Q 6= ∅ do 2. while Q 6= ∅ do
is a factor of Θ (log n) improvement over the I/O complexity of Dijkstra’s algorithm with Fibonacci
heap provided the graph is very sparse (i.e., m = O (n)), and B ≫ log n which typically holds for
memory levels deeper in the hierarchy such as the disk.
16
Function B.3. Dijkstra-Ext( G, w, s, d )
{Kumar & Schwabe’s external-memory implementation of Dijkstra’s algorithm [25]}
2. while Q 6= ∅ do
Dijkstra-Ext Ends
Figure 7: Given an undirected graph G with vertex set V [G] (each vertex is identified with a unique
integer in [1, |V [G]|]), edge set E[G], a weight function w : E[G] → ℜ and a source vertex s ∈ V [G],
this function computes the shortest distance from s to each vertex v ∈ V [G] and stores it in d[v].
on the primary heap Q but uses a mechanism to remove those operations from Q using an auxiliary
heap Q′ (see Figure 7). This mechanism eliminates the need for identifying settled vertices directly
and thus saves Θ (m) cache misses. The auxiliary heap only needs to support Insert and Delete-Min
operations. The algorithm performs m Decrease-Key operations and about n + m Delete operations
on Q, and about m Insert and Delete-Min operations each on Q′ .
The algorithm can be used to solve the SSSP problem on undirected graphs cache-obliviously in
O n+ m B log m I/O operations by replacing Q with a buffer heap and Q ′ with an auxiliary buffer
heap [12].
In [12] we show how to implement Dijkstra’s SSSP algorithm for directed graphs cache-obliviously
m n
in O (n + B ) · log B I/Os under the tall cache assumption. The implementation requires one ad-
ditional data structure called the buffered repository tree [8] for remembering settled vertices. Since
we performed our experiments mainly on sparse graphs for which implementations in Sections B.1
and B.2 give better bounds we have not considered this implementation.
17
Number of Priority Queue Operations Performed
Implementation Insert Decrease-Key Delete/Delete-Min Total
Dijkstra-Dec n D n 2n + D
Dijkstra-NoDec n+D none n+D 2n + 2D
Primary (Q)
none 2m n + 2m n + 4m
Auxiliary (Q′ )
Dijkstra-Ext
2m none 2m 4m
Total (Q & Q′ )
2m 2m n + 4m n + 8m
Table 7: Number of heap operations performed by different implementations of
Dijkstra’s algorithm, where D (≤ m) is the number of Decrease-Keys performed by
Dijkstra-Dec.
18
graph model [9]. The model has four non-zero parameters a, b, c and d with a + b + c + d = 1.
Given the number of vertices n and the number of edges m, the GT-generator starts off with an
empty n × n adjacency matrix for the graph, recursively divides the matrix into four quadrants,
and distributes the edges to the top-left, top-right, bottom-left and bottom-right quadrants with
probabilities a, b, c and d, respectively. It has been conjectured in [9] that many real-world graphs
have a : b ≈ 3 : 1, a : c ≈ 3 : 1 and a ≥ d, and accordingly we have used a = 0.45, b = c = 0.15
and d = 0.25 which are also the default values used by the GT-generator. The resulting graph is
directed.
Undirected U.S. Road Networks ([34]). These are undirected weighted graphs representing
the road networks of 50 U.S. states and the District of Columbia. Edge weights are given both
as the spatial distance between the endpoints (i.e., the great circle distance in meters between the
endpoints) and as the travel time between them (i.e., spatial distance divided by some average speed
that depends on the road category). Merging all networks produces a graph containing about 24
million nodes and 29 million edges.
In Appendices D and E we present experimental results on the following additional graph classes.
All the following graphs are undirected.
Regular Graphs (PR [31]). A graph is d-regular if all its vertices have the same degree d.
√ √
Grid Graphs (PR [31]). We used n × n grid graphs with uniformly distributed edge-weights.
Geometric Graphs (PR [31]). This graph class is a natural alternative to the classical Gn,m
class. In contrast to Gn,m where each edge is chosen independently and with equal probability, a
random geometric graph is constructed by randomly distributing a set of vertices over some metric
space and then connecting two vertices with an edge if the distance between them is sufficiently
small. Given the number of vertices n, we used the PR-generator to generate a random geometric
graph by distributing these n points randomly in the unit square and connecting two vertices if
1.5
they are within distance √ n
. The average degree of the resulting graph is about 7. The weight of
an edge connecting two vertices corresponds to the distance between them.
Layered Graphs (MPI [2]). A d-layer random graph on n nodes consists of d + 1 levels with level
0 containing a single node (called the source node) and each of the remaining d levels containing
exactly (n−1)
d nodes (assuming d divides n − 1) [2]. The source node at level 0 is connected to every
node at level 1, and for 1 ≤ i < d each node at level i is connected to 3 random nodes at level
i + 1. The nodes in each level are connected with a Hamiltonian cycle. All edges are directed and
inter-level edges are directed from lower to higher levels.
19
D Additional Experimental Results on Intel P4 Xeon
In-Core Performance on d-regular Graphs (on Intel P4 Xeon)
Ä
ãßâ
Ã
ãßá
Â
þ
÷ ãßà
ÝÜ Á ù
Ù ÿ
Û
Ú ø þþ
Ù À õ ãßä
Ø ý þ
×Ö ü
Õ û
Ô ¿ ú
Ó ãßÞ
ùø
÷
¾ ö
õ
Þßâ
½
¼ Þßá
» Þßà
¾½ Å Á¿ Å ¼½Ã Å ½ÀÁ Å À ¼½ Å ¼»½¿ Å ½»¿Ã Å åä æ áà æ ãäâ æ äçá æ ç ãä æ ãÞäà æ äÞàâ æ
ÆÇÈÉÊË ÌÍ Î ÊËÏÐÑÊÒ èéêëìí îï ð ìíñòóìô
(b) Absolute Runtimes with n = 220 ≈ 1 Million (d) Runtimes (w.r.t. SBin-Dij) with n = 220 ≈ 1 Million
#
"!
"
73
6
.5, "
43
-*3 "#
120
3
.-/
,+ "
*
!
! "
$ %&'()
√ √
In-Core Performance on n × n Grid Graphs (on Intel P4 Xeon)
20
1.5
In-Core Performance on Geometric Graphs in a Unit Square Connecting All Nodes within Distance √
n
¦£¦
¦£¢
ÇÆÃ
¡ ¾¼ ¥£¤
Å
ÄÃ
½ºÃ ¥£¨
ÁÂ
Ã
À
¾½¿ ¥£§
»¼º
¥£¦
¥£¢
¢£¤
©¦ ª ¨§ ª ¥¦¤ ª ¦«¨ ª «¥¦ ª ¥¢¦§ ª ¦¢§¤ ª §¢¬¨ ª ¤¥¬¦ ª
®¯°±² ³´ µ ±²¶·¸±¹
Figure 10: In-core performance of algorithms on geometric graphs in a unit square connecting all nodes
1.5
within distance √
n
(on Intel P4 Xeon).
(a) Absolute Runtimes with log2 n Layers (b) Runtimes (w.r.t. SBin-Dij) with log 2 n Layers
ÎÌ ñëï
ÌÊ ïëí
ÌÉ ïëì
éåè ÌÈ
ïëê
æçå ËÍ
ãâä
áà îëð
ß ËÌ
îëï
Ê
É êëí
È êëì
ÎÌ Ï ÍÉ Ï ËÌÊ Ï ÌÐÍ Ï ÐËÌ Ï ËÈÌÉ Ï ÌÈÉÊ Ï ÉÈÑÍ Ï ÊËÑÌ Ï ñï ò ðì ò îïí ò ïóð ò óîï ò îêïì ò ïêìí ò ìêôð ò íîôï ò
ÒÓÔÕÖ× ØÙ ÚÖ×ÛÜÝÖÞ õö÷øùú ûü ý ùúþÿ ù
(b) Absolute Runtimes with n = 220 ≈ 1 Million (d) Runtimes (w.r.t. SBin-Dij) with n = 220 ≈ 1 Million
827
826
624
TP
S
0/ KI 623
R
, Q
-., JGPP 621
+ OP
*)( NM
'
& L
KJI 527
H
G
526
124
123
6 3 4 57 86 73 564 697 956 5:163
! "#$% ;<=>?@ AB CDE?@F
Figure 11: In-core performance of algorithms on layered graphs (on Intel P4 Xeon).
21
E Experimental Results on AMD Opteron
In-Core Performance on Gn,m (on AMD Opteron 250)
(a) Absolute Runtimes with m/n = 8 (b) Runtimes (w.r.t. SBin-Dij) with m/n = 8
ZY }xz
ZX }xy
}x}
ZW
}xw
ZV
vu |x{
r ZU
str |xz
q |xy
pon Y
m
l
|x}
X
|xw
W
wx{
V wxz
U wxy
[V \ XW \ ZVY \ V]X \ ]ZV \ ZUVW \ VUWY \ WU^X \ ~} zy |}{ }z |} |w}y }wy{ ywz
_`abcd ef gcdhijck
(b) Absolute Runtimes with m = 4 Million (d) Runtimes (w.r.t. SBin-Dij) with m = 4 Million
¢ ÈÃÆ
ÈÃÅ
¡ ÈÃÄ
ÈÃÈ
¡ ëç
ê ÈÃÂ
ÁÀ âà
é ÇÃÆ
½ è
¾¿½ áÞçç
¼ æç ÇÃÅ
»º¹̧ åä
ã ÇÃÄ
· âáà
ÇÃÈ
ß
Þ
ÇÃÂ
ÂÃÆ
ÂÃÅ
ÂÃÄ
¡ ¡ ¡ £ ÈÃÉ É ÇÂ ÈÉ ÉÂ ÇÂÂ ÈÉÂ ÉÂÂ ÇÊÂÂÂ
¤¥¦§¨© ª« ¬ ¨©®¯¨° ± ² ³´µ ¶ ËÌÍÎÏÐ ÑÒ Ó ÏÐÔÕÖÏ× Ø Ù ÚÛÜ Ý
Figure 12: In-core performance of algorithms on Gn,m (on AMD Opteron 250).
(a) Absolute Runtimes with m/n = 4 (b) Runtimes (w.r.t. SBin-Dij) with m/n = 4
ñð
ñï
ñî
ñí 3 /
2
/
(
*
1
ñì 0
) /
&/
/
ð -
.
,
+
ï *)
(
'
&
î
í
ì
òí ó ïî ó ñíð ó íôï ó ôñí ó ñìíî ó íìîð ó îìõï ó
(b) Absolute Runtimes with m = 1 Million (d) Runtimes (w.r.t. SBin-Dij) with m = 1 Million
45; ` [Z
_[^
45:
_[]
459 ~
w _[\
YX y
U
W
V x ~
U 458 u~ _[`
T } ~
SR |
Q {
P z
O _[Z
yx
457 w
v
u
Z[^
456
Z[]
454 Z[\
75: : 64 7: :4 644 7:4 :44 `[a a _Z `a aZ _ZZ `aZ aZZ
Figure 13: In-core performance of algorithms on power-law graphs (on AMD Opteron 250).
22
In-Core Performance on d-regular Graphs (on AMD Opteron 250)
« ¦«
« ¦¥
ª¦©
É Å
È
¾
¤£ À
Ç ª¦¨
Æ
¢
¡ ¿ Å
¼Å ª¦§
Ä Å
Ã
Â
Á ª¦«
À¿
¾
½ ª¦¥
¼
¥¦©
¥¦¨
¥¦§
¬« ¨§ ª«© «®¨ ® ª« ª¥«§ «¥§©
(b) Absolute Runtimes with n = 220 ≈ 1 Million (d) Runtimes (w.r.t. SBin-Dij) with n = 220 ≈ 1 Million
ÌËÌ é åæ
ÌËÊ
Ð ËÌ
é åä
Ð ËÊ
þ ú
ý
ÏËÌ ó
ãâ õ
ü èåê
ß û
á ÏËÊ
à ô úú
ß ñ
Þ ù ú
ÝÜ Î ËÌ
Û ø
Ú ÷
Ù ö èåé
Î ËÊ õô
ó
ò
ÍËÌ ñ
ÍËÊ äåç
ÊËÌ
ÊËÊ äåæ
Ð Ñ Ò ÍÊ æ ê ç èä
Ó ÔÕÖר ë ìíîïð
Figure 14: In-core performance of algorithms on d-regular graphs (on AMD Opteron 250).
√ √
In-Core Performance on n× n Grid Graphs (on AMD Opteron 250)
("&
&"$
ÿ
FB
E
=D; &"#
C
<9BB &"!
AB
@?
>
=<; %"'
9:
%"&
!"$
ÿ !"#
ÿ
ÿ ÿ (& ) '# ) %&$ ) &*' ) *%& ) %!&# ) &!#$ ) #!+' ) $%+& )
,-./01 23 4 0156708
√ √
Figure 15: In-core performance of algorithms on n× n grid graphs (on Intel P4 opteron).
23
1.5
In-Core Performance on Geometric Graphs in a Unit Square Connecting All Nodes within Distance √
n
Figure 16: In-core performance of algorithms on geometric graphs in a unit square connecting all nodes
1.5
within distance √
n
(on Intel P4 opteron).
(a) Absolute Runtimes with log2 n Layers (b) Runtimes (w.r.t. SBin-Dij) with log 2 n Layers
³²³
³²±
¸²·
ÖÒ ¸²¶
Õ
°¯ ÍË
Ô ¶²´
¬ Ó
®¬ ÌÉÒÒ ¶²³
« ÑÒ
ª©¨ ÐÏ
§
¦ Î ¶²±
ÍÌË
Ê
É µ²·
µ²¶
±²´
±²³
¸¶ ¹ ·³ ¹ µ¶´ ¹ ¶º· ¹ ºµ¶ ¹ µ±¶³ ¹ ¶±³´ ¹ ³±»· ¹ ´µ»¶ ¹
¡¢£¤¥ ¼½¾¿ÀÁ ÂÃ Ä ÀÁÅÆÇÀÈ
(b) Absolute Runtimes with n = 220 ≈ 1 Million (d) Runtimes (w.r.t. SBin-Dij) with n = 220 ≈ 1 Million
ÜØÙ ÿùþ
ÜØ× ÿùý
ÛØÙ ýùû
÷ö ýùú
ó ÛØ×
ôõó ýùø
ò
ñðï ÚØÙ
î
í
üùþ
ÚØ×
üùý
×ØÙ øùû
×Ø× øùú
Û Ý Þ Úß ÜÛ ßÝ ÚÛÞ ÛÙß ÙÚÛ Úà×ÛÝ ý ú û üþ ÿý þú üýû ý þ üý üøýú
áâãäåæ çè éêëåæì
Figure 17: In-core performance of algorithms on layered graphs (on AMD Opteron 250).
24