Early Experiences On Accelerating Dijkstra's Algorithm Using Transactional Memory
Early Experiences On Accelerating Dijkstra's Algorithm Using Transactional Memory
Early Experiences On Accelerating Dijkstra's Algorithm Using Transactional Memory
1
Dijkstra algorithm using traditional synchronization primitives Algorithm 1: Dijkstras algorithm.
(locks and barriers), TM and HT and evaluated them using Input : Directed graph G = (V, E), weight function w : E R+ ,
Simics [15] and GEMS [1, 16], which allow the simulation source vertex s, min-priority queue Q
Output : shortest distance array d, predecessor array
of multicore systems and provide support for TM. Our results /* Initialization phase */
demonstrate that the combination of TM and HT achieves sig- 1 foreach v V do
nificant speedup on a hard to accelerate application, while re- 2 d[v] INF ;
3 [v] NIL ;
quiring only a few extensions to the original source code. 4 Insert(Q, v);
The rest of the paper is organized as follows. Section 2 5 end
6 d[s] 0;
presents the basics of Dijkstras algorithm and the details of
/* Main body of the algorithm */
the various multithreaded implementations. Section 3 demon- 7 while Q 6= do
strates simulation results comparing the performance of the 8 u ExtractMin(Q);
9 foreach v adjacent to u do
versions under consideration. Related work is presented in 10 sum d[u] + w(u, v);
Section 4, while Section 5 summarizes the paper and discusses 11 if d[v] > sum then
12 DecreaseKey(Q, v, sum);
directions for future work. 13 d[v] sum;
14 [v] u;
15 end
2 Parallelizing Dijkstras algorithm 16 end
2
1.3
cgs-lock
1.2 perfbar+cgs-lock
1.1 perfbar+fgs-lock
extract-min relax outgoing edges
1
Multithreaded speedup
Time 0.9
0.8
step k step k+1 step k+2
0.7
0.6
0.5
0.4
step k step k+1 step k+2
0.3
Thread 1
0.2
Thread 2 0.1
Thread 3 0
Thread 4 2 4 6 8 10 12 14 16
Number of threads
Figure 1: Execution patterns of serial and multithreaded Dijk- Figure 2: Speedups of lock-based parallel versions with real
stras algorithm. and perfect barriers.
3
vertices are relaxed in parallel, the paths of these heap traver- 1.2
sals might share one or more common nodes. The TM system 1.1
will detect the conflict, only one of the transactions will be al-
Multithreaded speedup
1
lowed to commit and only one vertex will be relaxed. The other
0.9
conflicting threads will have to pause or repeat their work, de-
0.8
pending on the implementation of the TM system and its con-
flict detection and resolution policy. This scheme is not as fine- 0.7
4
extract-min read tidth-min relax outgoing edges tion, continuing from the point where they had stopped. This
Time might yield suboptimal updates to the distances of the neigh-
step k step k+1 step k+2
Thread 1
bors, but as explained above, these will be overwritten once
the vertices examined by the helper threads reach the top of the
kill
kill
kill
Thread 2 queue. So, correctness is guaranteed.
kill
Algorithm 4: Main threads code.
Thread 3
kill
Input : Directed graph G = (V, E), weight function w : E R+ ,
source vertex s, min-priority queue Q
Thread 4 Output : shortest distance array d, predecessor array
/* Initialization phase same to the serial code */
Figure 4: Execution pattern of the helper threads version. 1 while Q 6= do
2 u ExtractMin(Q);
3 done 0;
4 foreach v adjacent to u do
the next iteration. It is possible that at this time a helper thread 5 sum d[u] + w(u, v);
might have updated only some of the neighbors of its vertex 6 Begin-Transaction
7 if d[v] > sum then
xk , leaving the other ones with their old, possibly suboptimal, 8 DecreaseKey(Q, v, sum);
distances. As explained above, however, this is not a problem 9 d[v] sum;
10 [v] u;
since all neighbors of xk with suboptimal distances will be cor-
11 End-Transaction
rectly updated when xk reaches the top of the priority queue. 12 end
The code executed by the main and helper threads is shown 13 Begin-Transaction
14 done 1;
in Alg. 4 and Alg. 5, respectively. In the beginning of each it- 15 End-Transaction
eration, the main thread extracts the top vertex from the queue. 16 end
At the same time, the helper threads spin-wait until the main Algorithm 5: Helper threads code.
thread has finished the extraction, and then each one reads 1 while Q 6= do
without extracting one of the top k vertices in the queue (this 2 while done = 1 do ;
3 x ReadMin(Q, tid);
is what ReadMin function does). Next, all threads relax all the 4 stop 0;
outgoing edges of the vertices they have undertaken in parallel. 5 foreach y adjacent to x and while stop = 0 do
Compared to the original algorithm, a performance improve- 6 Begin-Transaction
7 if done = 0 then
ment is expected, since, due to the helper threads, the main 8 sum d[x] + w(x, y);
thread will evaluate the expression of line 7 as true fewer times 9 if d[y] > sum then
10 DecreaseKey(Q, y, sum);
and thus, will not need to execute the operations of lines 89. 11 d[y] sum;
12 [y] x;
The proposed HT scheme is largely based on TM. Updates
13 else
to the heap via the DecreaseKey function, as well as updates 14 stop 1;
to the tentative distances and predecessor arrays are enclosed 15 End-Transaction
16 end
within a single transaction for both the main and helper threads. 17 end
This ensures atomicity of these updates, i.e. that they will be
performed in an all-or-none manner. Furthermore, it guar- Summarizing, the main concept of our implementation is
antees that in case of conflict only one thread will be allowed to decouple as much as possible the main thread from the ex-
to commit its transaction and perform the neighbor update. A ecution of the helper threads, minimizing the time that it has
conflict can arise when two or more threads update simultane- to spend on synchronization events or transaction aborts. The
ously the same neighbor, or when they update different neigh- helper threads are allowed to execute in an aggressive man-
bors but change the same part of the heap. The interruption ner, being at the same time as less intrusive to the main thread
of helper threads is implemented using transactions as well. as possible, even if they perform a notable amount of useless
Specifically, when the main thread has completed all the relax- work. The semantics of the algorithm guarantee that any in-
ations for its vertex, it sets the notification variable done to 1 termediate relaxations made by the helper threads are not ir-
within a separate transaction. This value denotes a state where reversible. Finally, by using a TM with a conflict resolution
the main thread proceeds to the next iteration and requires all policy that favors the main thread, transaction abort overheads
helper threads to stop and follow, terminating any operations are mainly suffered by the helper threads.
that they were performing on the heap. All helper threads ex-
ecuting transactions at this point will abort, since done is in
3 Experimental Evaluation
their read sets as well. The helper threads will immediately
retry their transactions, but there is a good chance that they
will find done set to 1, stop examining the remaining neigh- 3.1 Experimental setup
bors in the inner loop and continue with the next iteration of
the outer loop. In the opposite case that the main thread per- We evaluated the performance of the various implementa-
forms the ExtractMin operation too quickly, done will be tions of Dijkstras algorithm through full-system simulation,
set back to 0 and the helper threads will miss the last notifica- using the Wisconsin GEMS toolset v.2.1 [1, 16] in conjunc-
5
tion with the Simics v.3.0.31 [15] simulator. Simics provides Simics Processor
configurations up to 32 cores
functional simulation of a SPARC chip multiprocessor system UltraSPARC III Cu (III+)
(CMP) that boots unmodified Solaris 10. The GEMS Ruby L1 caches
Private, 64KB, 4-way set-associative,
module provides detailed memory system simulation and for 64B line size, 4 cycle hit latency
non-memory instructions behaves as an in-order single-issue Unified and shared, 8 banks, 2MB, 4-way set-
Ruby L2 cache
associative, 64B line size, 10 cycle hit latency
processor, executing one instruction per simulated cycle.
Hardware TM is supported in GEMS through the LogTM- Memory 160 cycle access latency
SE subsystem [20]. It is built upon a single-chip CMP system TM System HYBRID resol. policy, 2Kb HW signatures
with private per-processor L1 caches and a shared L2 cache. It
features eager version management, where transactions write Table 1: Simulation framework.
the new memory values in-place, after saving the old values Sequential Parallel. Ideal
in a log. It also supports eager conflict detection, as conflicts, Graph Edges Parameters
part (%) part (%) speedup
i.e. overlaps between the write set of one transaction and the rand1 10K 97.8 2.2 1.02
write or read set of other concurrent transactions, are detected rand2 100K 38.7 61.3 2.58
at the very moment they happen. On a conflict, the offending rand3 200K 26.2 73.8 3.81
transaction stalls and either retries its request hoping that the
rmat1 10K a = 0.45 78.3 21.7 1.27
other transaction has finished, or aborts if LogTM detects a
rmat2 100K b, c = 0.15 39.3 60.7 2.54
potential deadlock. The aborting processor uses its log to undo
rmat3 200K d = 0.25 26.8 73.2 3.73
the changes it has made and then retries the transaction. In our
ssca1 28K (P, C) = (0.25, 5) 94.1 5.9 1.06
experiments we used the HYBRID conflict resolution policy,
which tends to favor older transactions against younger ones. ssca2 118K (P, C) = (0.5, 20) 35.2 64.8 2.84
Table 1 shows the configuration of the simulation framework. ssca3 177K (P, C) = (0.5, 30) 28.9 71.1 3.46
Our programs use the Pthreads library for thread creation
and synchronization operations such as spin-locking and barri- Table 2: Graphs used for experiments.
ers. For our perfect barriers, we encoded a global barrier as where parallel execution would manage to zero out the time
a single assembly instruction, exploiting the functionality of- 100%
spent for edge relaxations, the speedup would be %Serialpart .
fered by Simics magic instructions. Thus the synchronization This is presented in the sixth column of Table 2 and constitutes
of threads is handled by the simulator and not the operating a theoretical upper bound for any performance improvement.
system, providing instant suspension/resumption of the arriv-
ing/departing threads. To avoid resource conflicts between our 3.3 Results
programs and the operating systems processes, we used CMP
configurations with more processor cores than the number of
Figure 5 shows the speedups of all evaluated schemes. Con-
threads we required. So, the experiments for 2 and 4 threads
sistently to the discussion in Section 2, the concurrent thread
were performed on an 8 core CMP, while the 8 threads exper-
accesses to shared data implemented with the aid of TM (cgs-
iments were done on an 16 core CMP. To schedule a thread
tm and fgs-tm) clearly outperform the ones using traditional
on a particular processor and avoid migrations, we used the
synchronization primitives (cgs-lock and fgs-lock). This fact
pset bind system call of Solaris. Finally, all codes were
reveals the existence of fine-grain parallelism in the updates of
compiled with Suns Studio 12 C compiler (O3 level).
the priority queue of the algorithm, in the sense that, statisti-
cally, it is highly probable that the paths of various concurrent
3.2 Reference graphs updates do not overlap. Thus, optimistic parallelism seems a
good approach for Dijkstras algorithm.
To evaluate the different schemes we strived to work on Nevertheless, only by employing perfect barriers can the
graphs which vary in terms of density and structure. In that TM-based schemes outperform the serial case. Therefore, the
attempt, we used the GTgraph graph generator [4] to construct fine-grain parallelism exposed by the inner loop of the algo-
graphs with 10K vertices from the following families: rithm is insufficient to achieve significant speedups. On the
Random: Their m edges are constructed choosing a ran- contrary, the HT scheme (helper), which exploits parallelism
dom pair among n vertices. at a coarser granularity, is able to achieve significant speedups
R-MAT: Constructed using the Recursive Matrix (R-MAT) in the majority of the cases (6 out of 9 experiments). The max-
graph model [7]. imum speedup achieved is 1.46 as shown in Figure 5c.
SSCA#2: Used in the DARPA HPCS SSCA#2 graph anal- For more dense graphs, the performance improvements are
ysis benchmark [3]. greater since more parallelism can be exposed in the inner loop
Table 2 summarizes the characteristics of the graphs used. of the algorithm. These are the cases where helper achieves
To obtain an estimate of possible speedups, we profiled the the best speedups and scalability (Figures 5c, 5f and 5i). Con-
serial execution of Dijkstras algorithm on each graph in order versely, sparse graphs leave limited space for parallelism lead-
to calculate the distribution of the sequential (ExtractMin), ing to low performance. Therefore they can serve as test cases
and the parallelizable parts (DecreaseKey). In the ideal case for the overhead of the co-existence of numerous threads. The
6
rand-10000x10000 rand-10000x100000 rand-10000x200000
1.5 1.5 1.5
1.4 1.4 1.4
1.3 1.3 1.3
1.2 1.2 1.2
Multithreaded speedup
Multithreaded speedup
Multithreaded speedup
1.1 1.1 1.1
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock
0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm
0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock
perfbar+fgs-tm perfbar+fgs-tm perfbar+fgs-tm
0.1 helper 0.1 helper 0.1 helper
0 0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of threads Number of threads Number of threads
Multithreaded speedup
Multithreaded speedup
1.1 1.1 1.1
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock
0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm
0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock
perfbar+fgs-tm perfbar+fgs-tm perfbar+fgs-tm
0.1 helper 0.1 helper 0.1 helper
0 0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of threads Number of threads Number of threads
Multithreaded speedup
Multithreaded speedup
1.1 1.1 1.1
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock
0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm
0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock
perfbar+fgs-tm perfbar+fgs-tm perfbar+fgs-tm
0.1 helper 0.1 helper 0.1 helper
0 0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of threads Number of threads Number of threads
results for these cases are shown in Figures 5a, 5d and 5g. It performance. Brodal et al. [5] utilize a number of processors
is obvious that the helper scheme is the more robust one, as it to accelerate the DecreaseKey operation and discuss the ap-
exhibits the smallest slowdown. In the worst case, the perfor- plicability of their approach to Dijkstras algorithm. However,
mance of the main thread is degraded by around 10%. this work is evaluated on a theoretical Parallel Random Access
A closer look at the results reveals that the main thread suf- Machine (PRAM) execution model. Hunt et al. [13] implement
fers a really low number of aborts (less than 1% of the total a concurrent priority queue which is based on binary heaps
aborts). This means that even when the helper threads are not and supports parallel Insertions and Deletions using fine-grain
contributing any useful work, they still do not obstruct the main locking on the nodes of the binary heap. Since these operations
threads progress. Therefore, the main thread is allowed to do not traverse the entire data structure, local locking leads to
run almost at the speed of the serial execution, thus explaining performance gains. However, in the case of DecreaseKey
the robustness of the scheme. The low overhead of the helper which performs wide traversals of the data structure it degrades
scheme is also illustrated by the fact that the addition of more performance greatly, unless special hardware synchronization
threads does not lead to performance drops in any case. is supported by the underlying platform.
To expose more parallelism, it would be beneficial to con-
4 Related Work currently extract a large number of nodes from the priority
queue. This can be achieved if several nodes have equal dis-
A significant part of Dijkstras execution is spent in up- tances from the set S of visited nodes. Thus, if the prior-
dates in the priority queue. Therefore, enabling concurrent ity queue is organized into buckets of nodes with equal dis-
accesses to this structure seems a good approach to increase tances, then the extraction and neighbor updates can be done
7
in parallel per bucket (Dials algorithm [10]). A generalization [2] A.-R. Adl-Tabatabai, C. Kozyrakis, and B.E. Saha. Unlocking
of Dials algorithm called -stepping is proposed by Meyer concurrency: Multicore programming with transactional mem-
and Sanders [17]. Madduri et al. [14] use -stepping as the ory. ACM Queue, 4(10):2433, 2006.
base algorithm on Cray MTA-2, an architecture that exploits [3] D.A. Bader and K. Madduri. Design and implementation of the
fine-grain parallelism using hardware synchronization primi- hpcs graph analysis benchmark on symmetric multiprocessors.
tives, and achieve significant speedups. In the Parallel Boost In HiPC, 2005.
Graph Library [11] Dijkstras algorithm is parallelized for a [4] D.A. Bader and K. Madduri. Gtgraph: A suite of synthetic
distributed memory machine. The priority queue is distributed graph generators. 2006. http://www.cc.gatech.edu/
in the local memories of the system nodes and the algorithm is kamesh/GTgraph/.
divided in supersteps, in which each processor extracts a node [5] G.S. Brodal, J.L. Traff, C.D. Zaroliagis, and I. Stadtwald. A
from its local priority queue. The aforementioned approaches parallel priority queue with constant time operations. Journal of
are based on significant modifications to Dijkstras algorithm Parallel and Distributed Computing, 49:421, 1998.
to enable coarse-grain parallelism and lead to promising paral- [6] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun STAMP:
lel implementations. In this paper we adhere to the pure Dijk- Stanford Transactional Applications for Multi-Processing. In
stras algorithm to face the challenges of its parallelization and IISWC 2008.
test the applicability of TM and HT. [7] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive
TM has attracted extensive scientific research during the last model for graph mining. In ICDM, 2004.
few years, focusing mainly on its design and implementation
[8] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y. Lee,
details. Nevertheless, its efficacy on a wide set of real, non-
D. Lavery, and J. P. Shen. Speculative precomputation: Long-
trivial applications is only now starting to be explored. Scott et range prefetching of delinquent loads. In ISCA, 2001.
al. [18] use TM to parallelize Delaunay triangulation and Wat-
[9] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Intro-
son et al. [19] exploit it to parallelize Lees routing algorithm.
duction to Algorithms. The MIT Press, 2001.
Moreover, a set of TM applications is offered in STAMP [6].
[10] R. Dial. Algorithm 360: Shortest path forest with topological
ordering. Communications of the ACM, 12:632633, 1969.
5 Conclusions Future work [11] N. Edmonds, A. Breuer, D. Gregor, and A. Lumsdaine. Single-
source shortest paths with the parallel boost graph library. In 9th
This work applies several parallelization techniques to Di- DIMACS Implementation Challenge, 2006.
jkstras algorithm, which is known to be hard to parallelize. [12] M. Herlihy and E. Moss. Transactional memory: Architectural
The schemes that parallelize each serial step by incorporat- support for lock-free data structures. In ISCA, 1993.
ing traditional synchronization primitives (locks and barriers) [13] G.C. Hunt, M.M. Michael, S. Parthasarathy, and M.L. Scott.
fail to outperform the serial algorithm. In fact, they exhibit An efficient algorithm for concurrent priority queue heaps. Inf.
low performance even if the necessary barriers are replaced by Proc. Letters, 60:151157, 1996.
ideal ones. To deal with this we employ Transactional Mem-
[14] K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak. Parallel
ory (TM), which reduces synchronization overheads, but still shortest path algorithms for solving large-scale instances. In 9th
fails to provide meaningful overall performance improvement, DIMACS Implementation Challenge, 2006.
as speedups can be achieved in some test cases only with using
[15] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren,
ideal barriers. To improve the performance further, we propose
G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and
an implementation based on TM and Helper Threading, that is B. Werner. Simics: A full system simulation platform. Com-
able to provide significant speedups (reaching up to 1.46) in puter, 35(2):5058, 2002.
the majority of the simulated cases.
[16] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu,
As future work, we will investigate the application of these A. Alameldeen, K. Moore, M. Hill, and D. Wood. Mul-
techniques on other algorithms solving the SSSP problem, tifacets general execution-driven multiprocessor simulator
such as -stepping [17] and Bellman-Ford [9]. We also aim (gems) toolset. SIGARCH Comput. Archit. News, 2005.
to explore the impact of various TM characteristics on the be-
[17] U. Meyer and P. Sanders. Delta-stepping: A parallel single
havior of the presented schemes, such as the resolution policy, source shortest path algorithm. In ESA, 1998.
version management and conflict detection. Finally, prelimi-
nary results demonstrated interesting variations in the available [18] M. L. Scott, M. F. Spear, L. Daless, and V. J. Marathe. Delaunay
triangulation with transactions and barriers. In IISWC, 2007.
parallelism between different execution phases, motivating us
to explore more adaptive schemes in terms of the number of [19] I. Watson, C. Kirkham, and M. Lujan. A study of a transactional
parallel threads and the tasks assigned to them. parallel routing algorithm. In PACT, 2007.
[20] L. Yen, J. Bobba, M.R. Marty, K.E. Moore, H. Volos, M.D. Hill,
M.M. Swift, and D.A. Wood. LogTM-SE: Decoupling hardware
References transactional memory from caches. In HPCA, 2007.
[21] W. Zhang, B. Calder, and D.M. Tullsen. An event-driven multi-
[1] Wisconsin multifacet gems simulator. http://www.cs. threaded dynamic optimization framework. In PACT, 2005.
wisc.edu/gems/.