Early Experiences On Accelerating Dijkstra's Algorithm Using Transactional Memory

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Early Experiences on Accelerating Dijkstras Algorithm

Using Transactional Memory

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

National Technical University of Athens


School of Electrical and Computer Engineering
Computing Systems Laboratory
{anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr

Abstract all the neighbors of the extracted node. To implement parallel


versions, researchers follow two general strategies. The first
In this paper we use Dijkstras algorithm as a challenging, strategy attempts to relax the sequential nature of Dijkstra by
hard to parallelize paradigm to test the efficacy of several par- creating more parallelism in the outer loop. This leads to alter-
allelization techniques in a multicore architecture. We consider native algorithms like -stepping [14, 17] that enable concur-
the application of Transactional Memory (TM) as a means of rent extraction of multiple nodes from the unvisited set. The
concurrent accesses to shared data and compare its perfor- second strategy works on pure Dijkstra and seeks parallelism
mance with straightforward parallel versions of the algorithm in the inner loop, by enabling concurrent accesses to the prior-
based on traditional synchronization primitives. To increase ity queue. However, practical implementations of concurrent
the granularity of parallelism and avoid excessive synchro- binary heaps as priority queues [13] are based on unavoidable
nization, we combine TM with Helper Threading (HT). Our fine-grain locking of the binary heap, which is expected to kill
simulation results demonstrate that the straightforward par- the performance of such a scheme.
allelization of Dijkstras algorithm with traditional locks and In this paper we face the challenges of parallelizing Dijk-
barriers has, as expected, disappointing performance. On the stras algorithm for a multicore architecture. To decrease the
other hand, TM by itself is able to provide some performance synchronization cost we employ Transactional Memory (TM)
improvement in several cases, while the version based on TM [2, 12] as a means of efficient concurrent thread accesses to
and HT exhibits a significant performance improvement that shared data. TM is a novel programming model for multi-
can reach up to a speedup of 1.46. core architectures that allows concurrency control over mul-
tiple threads. The programmer is able to envelop parts of the
1 Introduction code within a transaction, indicating that within this section
exist accesses to memory locations that may be performed by
other threads as well. The TM system monitors the transac-
Dijkstras algorithm [9] is a fundamental graph algorithm
tions of the threads and, if two or more of them perform con-
used to compute single source shortest paths (SSSP) for graphs
flicting memory accesses, it resolves the conflict. TM seems
with non-negative edges. SSSP is a classic combinatorial opti-
a promising approach for dynamic data structures and applica-
mization problem used in a variety of applications such as net-
tions with independent threads. It remains though to be inves-
work routing or VLSI design. The algorithm maintains a set
tigated how TM can speedup a single application.
S of visited nodes, whose shortest path has already been cal-
The parallelization of the inner loop does not exploit a
culated. In each iteration, the unvisited node with the shortest
significant amount of parallelism. Therefore, we choose to
distance from S is selected, it is inserted into S and the dis-
coarsen the granularity of parallelism by employing the idea
tances of its neighbors are updated. The set of unvisited nodes
of Helper Threads (HT) [8, 21]. Dijkstras algorithm spends a
is implemented as a priority queue. This serializes a large part
large part of its execution in the relaxations of the nodes of the
of the algorithms operations, thus making Dijkstra a hard to
priority queue. Parallel threads can update (relax) the distances
parallelize graph algorithm [5, 17].
of several nodes neighbors without changing the semantics of
Delving into implementation details, the algorithm involves
the algorithm. Thus, while the main thread extracts and up-
a two-level nested loop: the outer loop iterates over all the
dates the neighbors of the head of the priority queue, k helper
nodes of the graph, selecting in each step the one closest to
threads update the neighbors of the next k nodes in the priority
set S, while the inner loop updates the distances from S of
queue. This approach exploits parallelism in the outer loop,
This research is supported by PENED 2003 Project (EPAN/GSRT), co- without changing the semantics of the algorithm.
funded by the European Social Fund (80%) and National Resources (20%). We have implemented several versions of the multithreaded

1
Dijkstra algorithm using traditional synchronization primitives Algorithm 1: Dijkstras algorithm.
(locks and barriers), TM and HT and evaluated them using Input : Directed graph G = (V, E), weight function w : E R+ ,
Simics [15] and GEMS [1, 16], which allow the simulation source vertex s, min-priority queue Q
Output : shortest distance array d, predecessor array
of multicore systems and provide support for TM. Our results /* Initialization phase */
demonstrate that the combination of TM and HT achieves sig- 1 foreach v V do
nificant speedup on a hard to accelerate application, while re- 2 d[v] INF ;
3 [v] NIL ;
quiring only a few extensions to the original source code. 4 Insert(Q, v);
The rest of the paper is organized as follows. Section 2 5 end
6 d[s] 0;
presents the basics of Dijkstras algorithm and the details of
/* Main body of the algorithm */
the various multithreaded implementations. Section 3 demon- 7 while Q 6= do
strates simulation results comparing the performance of the 8 u ExtractMin(Q);
9 foreach v adjacent to u do
versions under consideration. Related work is presented in 10 sum d[u] + w(u, v);
Section 4, while Section 5 summarizes the paper and discusses 11 if d[v] > sum then
12 DecreaseKey(Q, v, sum);
directions for future work. 13 d[v] sum;
14 [v] u;
15 end
2 Parallelizing Dijkstras algorithm 16 end

Algorithm 2: Fine-grain parallel implementation of Dijkstras algorithm.


2.1 Dijkstras algorithm Input : Directed graph G = (V, E), weight function w : E R+ ,
source vertex s, min-priority queue Q
Output : shortest distance array d, predecessor array
Dijkstras algorithm solves the SSSP problem for a di-
/* Initialization phase same to the serial code */
rected graph with non-negative edge weights. Specifically, /* Main body of the algorithm */
let G = (V, E) be a directed graph with n = |V | vertices, 1 while Q 6= do
m = |E| edges, and w : E R+ a weight function assign- 2
3
Barrier
if tid = 0 then
ing non-negative real-valued weights to the edges of G. For 4 u ExtractMin(Q);
each vertex v, the SSSP problem computes (v), the weight of 5 Barrier
6 foreach v adjacent to u do in parallel
the shortest path from a source vertex s to v. For each vertex 7 sum d[u] + w(u, v);
v, Dijkstras algorithm maintains a shortest-path estimate (or 8 if d[v] > sum then
9 Begin-Atomic
tentative distance) d(v), which is an upper bound for the actual 10 DecreaseKey(Q, v, sum);
weight of the shortest path from s to v, (v). Initially, d(v) is 11 End-Atomic
12 d[v] sum;
set to and through successive edge relaxations it is gradu- 13 [v] u;
ally decreased, converging to (v). The relaxation of an edge 14 end
15 end
(v, w) sets d(w) to min{d(w), d(v) + w(v, w)}, which means
that the algorithm tests whether it can decrease the weight of
the shortest path from s to w by going through v.
The algorithm maintains a partition of V into settled (vis- 2.2 Lock-based parallel implementation
ited), queued and unreached vertices (the latter two repre-
senting unvisited nodes). Settled vertices have d(v) = (v); An intuitive choice for parallelizing Dijkstras algorithm is
queued have d(v) > (v) and d(v) 6= ; unreached have to exploit parallelism at the inner loop by relaxing all outgoing
d(v) = . Initially, only s is queued, d(s) = 0 and all other edges of vertex u in parallel. This is a fine-grain parallelization
vertices are unreached. In each iteration of the algorithm, the scheme. In each step, one thread extracts u from the heap and
vertex with the smallest shortest-path estimate is selected, its then its outgoing edges are assigned (e.g. via cyclic assign-
state is permanently changed to settled and all its outgoing ment) to parallel threads for relaxation. This idea is depicted
edges are relaxed, causing any of its neighbors that were un- in Figure 1 while a generic implementation is shown in Alg. 2.
reached by the source vertex until this point to become queued. A number of observations can be made concerning this par-
The algorithm is presented in more detail in Alg. 1. allelization scheme. First, the speedup is bounded by the av-
The basic data structure lying at the heart of Dijkstras al- erage out-degree of the vertices, i.e. the density of the graph.
gorithm is a min-priority queue of vertices, keyed by their Clearly, if vertices have a small number of neighbors on aver-
d() values. The queue is used to maintain all but the set- age, then the parallel segment of the algorithm (lines 614) will
tled vertices of the graph. At each iteration, the vertex with consume a small fraction of the total execution time, making
the smallest key is removed from the queue (ExtractMin the sequential part (lines 34, ExtractMin) dominant. The
operation) and its outgoing edges are relaxed, which could re- second observation concerns the concurrent accesses to the bi-
sult to reductions of the keys of the corresponding neighbors nary heap by the parallel DecreaseKey operations. The bi-
(DecreaseKey operation). To amortize the cost of the multi- nary heap is implemented as a linear array and can be consid-
ple ExtractMin and DecreaseKey operations, especially ered as a nearly complete binary tree. The smallest element in
for realistic, sparse graphs, the min-priority queue is imple- the heap is stored at the root and the subtree rooted at a node
mented as a binary heap. contains values no smaller than the value of the node. During

2
1.3
cgs-lock
1.2 perfbar+cgs-lock
1.1 perfbar+fgs-lock
extract-min relax outgoing edges
1

Multithreaded speedup
Time 0.9
0.8
step k step k+1 step k+2
0.7
0.6
0.5
0.4
step k step k+1 step k+2
0.3
Thread 1
0.2
Thread 2 0.1
Thread 3 0
Thread 4 2 4 6 8 10 12 14 16
Number of threads

Figure 1: Execution patterns of serial and multithreaded Dijk- Figure 2: Speedups of lock-based parallel versions with real
stras algorithm. and perfect barriers.

In an attempt to isolate the effect of the barriers, we imple-


mented a version of idealized, zero-latency barriers that rely
a DecreaseKey operation, a vertex obtains a smaller value solely on hardware in our simulated environment. This scheme
as its new shortest path estimate. If this new value is smaller is named perfbar+cgs-lock. It is clear from Figure 2 that the
than that of its parent, the vertex has to move upwards the tree replacement of barriers with perfect ones deals with the poor
until it is placed in a location that satisfies the min-heap prop- scalability problem of the cgs-lock scheme. Nevertheless, the
erty. During this traversal, the node is repeatedly compared to scheme still performs worse than the serial execution of the al-
its parent and if its value is smaller, the nodes are swapped. gorithm, revealing that this coarse-grain synchronization is too
The first, and rather naive, approach to enable paralleliza- conservative and cannot expose enough parallelism.
tion of the relaxation phase, is to use a global mutex to lock the Finally, the fgs-lock scheme combined with perfect bar-
entire heap during each DecreaseKey operation. This con- riers fails to outperform the serial execution, despite being
stitutes a conservative, coarse-grain synchronization scheme more optimistic than the cgs-lock scheme. As the number of
that permits only one DecreaseKey operation at a time and threads increases, its performance improves slightly indicating
obviously limits concurrency. We refer to this scheme as cgs- that there does exist an amount of parallelism. However, the
lock. The alternative, more optimistic approach is to allow fgs-lock scheme fails to exploit it efficiently and there are two
multiple sequences of node swaps to execute in parallel as long possible reasons for this failure. The first reason is that in order
as they access different parts of the heap. More specifically, to allow concurrent accesses to the heap, a pair of spin-locks
instead of using one lock for the entire heap, one can utilize is used for each pair of nodes, causing the total overhead to be
separate locks for each parent-child pair of nodes. Whenever a high and lowering the performance gains from the exploited
thread executes a DecreaseKey operation and a node swap parallelism. The second reason is that the fgs-lock scheme
is required, it must first acquire the appropriate lock that guards allows concurrent accesses to the binary heap only when the
this specific pair of nodes (a scheme similar to [13]). In this threads access different parts of the heap. The probability of
way atomicity is guaranteed and the algorithm can be executed threads touching the same nodes of the binary heap depends on
safely in parallel. We refer to this scheme as fgs-lock. the structure of the graph as well as on the order by which the
To obtain a first picture of the efficiency of these schemes, neighbors of a vertex are examined during the DecreaseKey
we evaluated them with a random graph with 10K vertices and operations. Whenever this occurs, the threads are serialized,
100K edges. A detailed description of the simulation frame- thus limiting the total available parallelism.
work can be found in Section 3.1. Figure 2 demonstrates the
speedup of the two schemes for 2 to 16 threads. The speedup 2.2.1 TM-based parallel implementation
is calculated as the ratio of the execution time of the serial
to the parallel scheme in each case. The performance of the The fgs-lock scheme described in Section 2.2 allows concur-
cgs-lock scheme is disappointing. Although the limited par- rent accesses to the binary heap used in the implementation of
allelism of the scheme explains the lack of speedup, a more Dijkstras algorithm. Unfortunately, it has a high overhead due
detailed execution profiling revealed that the vast performance to the numerous locks needed limiting its efficiency severely.
drop is attributed to the overhead of barriers that surround the Looking for alternatives, we test the efficacy of TM, as a means
ExtractMin operation and decouple the serial phases from of concurrent accesses to the shared binary heap. The first ap-
the parallel ones. More specifically, for 2 threads the time spent proach is to enclose each DecreaseKey operation within a
in barriers accounts for 71% of the total execution time. This transaction, and rely on the underlying system to ensure atom-
percentage rises up to 88% when using 8 threads, explaining icity. When concurrent transactions access the same elements
why the performance degrades when more threads are used. in the heap and at least one of these accesses is a write oper-
We used the barriers provided by the Pthreads library, yet we ation, a conflict arises and the system needs to resolve it de-
argue that this should not be a problem of the specific barrier ciding which transaction succeeds. The DecreaseKey oper-
implementation, since alternative software-based implementa- ation includes a series of swaps, as a node traverses the heap
tions are expected to provide similar results. until it is placed in the final correct position. When two or more

3
vertices are relaxed in parallel, the paths of these heap traver- 1.2

sals might share one or more common nodes. The TM system 1.1

will detect the conflict, only one of the transactions will be al-

Multithreaded speedup
1
lowed to commit and only one vertex will be relaxed. The other
0.9
conflicting threads will have to pause or repeat their work, de-
0.8
pending on the implementation of the TM system and its con-
flict detection and resolution policy. This scheme is not as fine- 0.7

grain as the fgs-lock scheme, where atomicity is enforced at the 0.6


perfbar+cgs-lock
perfbar+fgs-lock
perfbar+cgs-tm
level of a single swap and not for a series of swaps. We will 0.5
perfbar+fgs-tm

refer to this scheme as cgs-tm. It is implemented as shown in 2 4 6 8 10


Number of threads
12 14 16

Alg. 2 by replacing the Begin-Atomic and End-Atomic


Figure 3: Speedups of lock-based and TM-based versions.
operations with the appropriate Begin-Transaction and
End-Transaction primitives.
as in [11, 14, 17], without, changing the algorithm itself. Thus,
A second alternative is to implement a scheme as fine-grain
instead of partitioning the inner loop and assigning only a few
as the fgs-lock scheme using TM. To accomplish this, each
neighbors to each thread, we seek to assign the relaxation of a
swap executed by the DecreaseKey operation is enclosed
complete set of neighbors to each thread. To accomplish this,
into a transaction. This means that the transactions will be
we take advantage of a basic property of Dijkstras algorithm:
shorter than those of the cgs-tm scheme resulting, hopefully,
the relaxations (lines 11,1314 in Alg. 1) lead to monotoni-
into fewer conflicts and thus more parallelism. We will refer
cally decreasing values for the distances of unvisited nodes un-
to this scheme as fgs-tm. For the implementation of the fgs-tm
til each distance reaches its final minimum value. As long as
scheme the DecreaseKey presented in Alg. 3 is used. Simi-
a graph node is inserted in the queued set (i.e. the nodes dis-
larly to the lock-based schemes, both these TM-based schemes
tance from S is not infinite) its neighbors could also be relaxed
require the incorporation of barriers to decouple the serial from
to newer updated values. This property is not utilized by the
the parallel phases. Having observed the disastrous effect of
original serial algorithm, as all the updates occur for the neigh-
the barriers on the performance of the lock-based schemes, we
bors of the extracted node. Practically, the algorithm avoids
employ the perfect, zero-latency barriers in the evaluation of
calculating intermediate distances that will eventually be over-
our TM-based implementations as well.
written. Our key idea is that parallel threads can serve as helper
To obtain a first insight on the efficiency of the TM-based
threads and perform relaxations for neighbors of nodes belong-
schemes we used the same graph as in Section 2.2 and present
ing in the queued set. Optimistically, some of these relaxations
speedups in Figure 3. In contrast to the lock-based schemes,
will be utilized and offloaded by the main thread.
the TM-based ones outperform the serial implementation for
In our implementation the main thread operates like in the
more than 4 threads. For 2 threads the overhead of the TM
sequential version, extracting in each iteration the minimum
scheme seems to be too high, canceling out any performance
vertex from the priority queue and relaxing its outgoing edges.
gains from the exploitation of parallelism. As more threads are
At the same time, the k-th helper thread reads the tentative
used though, the performance is improved providing a speedup
distance of the k-th vertex in the queue (let us call it xk for
of up to almost 1.1. More detailed results are presented in Sec-
short) and relaxes all its outgoing edges based on this value.
tion 3.3. Thus, it seems that TM is a promising mechanism to
When the main thread accomplishes its relaxations, it notifies
exploit the available parallelism of the DecreaseKey opera-
the helper threads to stop their relaxations, and they all proceed
tions on the binary heap. However, to obtain these performance
to the next iteration. This scheme is demonstrated in Figure 4.
improvements ideal barriers are employed.
The rationale behind it is that vertices occupying the top k po-
Algorithm 3: DecreaseKey sitions in the queue might already be settled with some proba-
Input : min-priority queue Q, vertex u, new key value for vertex u
bility, so that when the helper threads read their distances and
1 Q[u] value;
2 i u; relax their outgoing edges, they will make their corresponding
3 while (parent(i).key value) do neighbors settled, as well. As a result, when the main thread
4 Begin-Transaction
5 swap(u, i); checks these vertices later, it will not have to perform any re-
6 End-Transaction laxations. On the other hand, if the k-th thread reads xk , it is
7 i parent(i);
8 end
possible that xk might not have been settled yet and thus have
a suboptimal tentative distance. The thread would then update
the neighbors according to a new tentative value, which will
2.3 A multithreaded version based on HT eventually be set to the appropriate minimum value, when xk
will be examined by the main thread later on. At that moment,
In this section we present an alternative multithreaded ver- all its outgoing edges will be re-relaxed using the correct fi-
sion of Dijkstras algorithm based on HT. The motivation arises nal distance. A significant aspect of this multithreaded scheme
from the poor performance of all aforementioned versions, is that the main thread stops all helper threads after finishing
which is due to their limited parallelism and excessive synchro- each iteration of the outer loop. At this time, the helper threads
nization. Our goal is to coarsen the granularity of parallelism, stop their computations and proceed with the main thread to

4
extract-min read tidth-min relax outgoing edges tion, continuing from the point where they had stopped. This
Time might yield suboptimal updates to the distances of the neigh-
step k step k+1 step k+2

Thread 1
bors, but as explained above, these will be overwritten once
the vertices examined by the helper threads reach the top of the

kill

kill
kill
Thread 2 queue. So, correctness is guaranteed.

kill
Algorithm 4: Main threads code.
Thread 3

kill
Input : Directed graph G = (V, E), weight function w : E R+ ,
source vertex s, min-priority queue Q
Thread 4 Output : shortest distance array d, predecessor array
/* Initialization phase same to the serial code */
Figure 4: Execution pattern of the helper threads version. 1 while Q 6= do
2 u ExtractMin(Q);
3 done 0;
4 foreach v adjacent to u do
the next iteration. It is possible that at this time a helper thread 5 sum d[u] + w(u, v);
might have updated only some of the neighbors of its vertex 6 Begin-Transaction
7 if d[v] > sum then
xk , leaving the other ones with their old, possibly suboptimal, 8 DecreaseKey(Q, v, sum);
distances. As explained above, however, this is not a problem 9 d[v] sum;
10 [v] u;
since all neighbors of xk with suboptimal distances will be cor-
11 End-Transaction
rectly updated when xk reaches the top of the priority queue. 12 end
The code executed by the main and helper threads is shown 13 Begin-Transaction
14 done 1;
in Alg. 4 and Alg. 5, respectively. In the beginning of each it- 15 End-Transaction
eration, the main thread extracts the top vertex from the queue. 16 end
At the same time, the helper threads spin-wait until the main Algorithm 5: Helper threads code.
thread has finished the extraction, and then each one reads 1 while Q 6= do
without extracting one of the top k vertices in the queue (this 2 while done = 1 do ;
3 x ReadMin(Q, tid);
is what ReadMin function does). Next, all threads relax all the 4 stop 0;
outgoing edges of the vertices they have undertaken in parallel. 5 foreach y adjacent to x and while stop = 0 do
Compared to the original algorithm, a performance improve- 6 Begin-Transaction
7 if done = 0 then
ment is expected, since, due to the helper threads, the main 8 sum d[x] + w(x, y);
thread will evaluate the expression of line 7 as true fewer times 9 if d[y] > sum then
10 DecreaseKey(Q, y, sum);
and thus, will not need to execute the operations of lines 89. 11 d[y] sum;
12 [y] x;
The proposed HT scheme is largely based on TM. Updates
13 else
to the heap via the DecreaseKey function, as well as updates 14 stop 1;
to the tentative distances and predecessor arrays are enclosed 15 End-Transaction
16 end
within a single transaction for both the main and helper threads. 17 end
This ensures atomicity of these updates, i.e. that they will be
performed in an all-or-none manner. Furthermore, it guar- Summarizing, the main concept of our implementation is
antees that in case of conflict only one thread will be allowed to decouple as much as possible the main thread from the ex-
to commit its transaction and perform the neighbor update. A ecution of the helper threads, minimizing the time that it has
conflict can arise when two or more threads update simultane- to spend on synchronization events or transaction aborts. The
ously the same neighbor, or when they update different neigh- helper threads are allowed to execute in an aggressive man-
bors but change the same part of the heap. The interruption ner, being at the same time as less intrusive to the main thread
of helper threads is implemented using transactions as well. as possible, even if they perform a notable amount of useless
Specifically, when the main thread has completed all the relax- work. The semantics of the algorithm guarantee that any in-
ations for its vertex, it sets the notification variable done to 1 termediate relaxations made by the helper threads are not ir-
within a separate transaction. This value denotes a state where reversible. Finally, by using a TM with a conflict resolution
the main thread proceeds to the next iteration and requires all policy that favors the main thread, transaction abort overheads
helper threads to stop and follow, terminating any operations are mainly suffered by the helper threads.
that they were performing on the heap. All helper threads ex-
ecuting transactions at this point will abort, since done is in
3 Experimental Evaluation
their read sets as well. The helper threads will immediately
retry their transactions, but there is a good chance that they
will find done set to 1, stop examining the remaining neigh- 3.1 Experimental setup
bors in the inner loop and continue with the next iteration of
the outer loop. In the opposite case that the main thread per- We evaluated the performance of the various implementa-
forms the ExtractMin operation too quickly, done will be tions of Dijkstras algorithm through full-system simulation,
set back to 0 and the helper threads will miss the last notifica- using the Wisconsin GEMS toolset v.2.1 [1, 16] in conjunc-

5
tion with the Simics v.3.0.31 [15] simulator. Simics provides Simics Processor
configurations up to 32 cores
functional simulation of a SPARC chip multiprocessor system UltraSPARC III Cu (III+)
(CMP) that boots unmodified Solaris 10. The GEMS Ruby L1 caches
Private, 64KB, 4-way set-associative,
module provides detailed memory system simulation and for 64B line size, 4 cycle hit latency
non-memory instructions behaves as an in-order single-issue Unified and shared, 8 banks, 2MB, 4-way set-
Ruby L2 cache
associative, 64B line size, 10 cycle hit latency
processor, executing one instruction per simulated cycle.
Hardware TM is supported in GEMS through the LogTM- Memory 160 cycle access latency
SE subsystem [20]. It is built upon a single-chip CMP system TM System HYBRID resol. policy, 2Kb HW signatures
with private per-processor L1 caches and a shared L2 cache. It
features eager version management, where transactions write Table 1: Simulation framework.
the new memory values in-place, after saving the old values Sequential Parallel. Ideal
in a log. It also supports eager conflict detection, as conflicts, Graph Edges Parameters
part (%) part (%) speedup
i.e. overlaps between the write set of one transaction and the rand1 10K 97.8 2.2 1.02
write or read set of other concurrent transactions, are detected rand2 100K 38.7 61.3 2.58
at the very moment they happen. On a conflict, the offending rand3 200K 26.2 73.8 3.81
transaction stalls and either retries its request hoping that the
rmat1 10K a = 0.45 78.3 21.7 1.27
other transaction has finished, or aborts if LogTM detects a
rmat2 100K b, c = 0.15 39.3 60.7 2.54
potential deadlock. The aborting processor uses its log to undo
rmat3 200K d = 0.25 26.8 73.2 3.73
the changes it has made and then retries the transaction. In our
ssca1 28K (P, C) = (0.25, 5) 94.1 5.9 1.06
experiments we used the HYBRID conflict resolution policy,
which tends to favor older transactions against younger ones. ssca2 118K (P, C) = (0.5, 20) 35.2 64.8 2.84
Table 1 shows the configuration of the simulation framework. ssca3 177K (P, C) = (0.5, 30) 28.9 71.1 3.46
Our programs use the Pthreads library for thread creation
and synchronization operations such as spin-locking and barri- Table 2: Graphs used for experiments.
ers. For our perfect barriers, we encoded a global barrier as where parallel execution would manage to zero out the time
a single assembly instruction, exploiting the functionality of- 100%
spent for edge relaxations, the speedup would be %Serialpart .
fered by Simics magic instructions. Thus the synchronization This is presented in the sixth column of Table 2 and constitutes
of threads is handled by the simulator and not the operating a theoretical upper bound for any performance improvement.
system, providing instant suspension/resumption of the arriv-
ing/departing threads. To avoid resource conflicts between our 3.3 Results
programs and the operating systems processes, we used CMP
configurations with more processor cores than the number of
Figure 5 shows the speedups of all evaluated schemes. Con-
threads we required. So, the experiments for 2 and 4 threads
sistently to the discussion in Section 2, the concurrent thread
were performed on an 8 core CMP, while the 8 threads exper-
accesses to shared data implemented with the aid of TM (cgs-
iments were done on an 16 core CMP. To schedule a thread
tm and fgs-tm) clearly outperform the ones using traditional
on a particular processor and avoid migrations, we used the
synchronization primitives (cgs-lock and fgs-lock). This fact
pset bind system call of Solaris. Finally, all codes were
reveals the existence of fine-grain parallelism in the updates of
compiled with Suns Studio 12 C compiler (O3 level).
the priority queue of the algorithm, in the sense that, statisti-
cally, it is highly probable that the paths of various concurrent
3.2 Reference graphs updates do not overlap. Thus, optimistic parallelism seems a
good approach for Dijkstras algorithm.
To evaluate the different schemes we strived to work on Nevertheless, only by employing perfect barriers can the
graphs which vary in terms of density and structure. In that TM-based schemes outperform the serial case. Therefore, the
attempt, we used the GTgraph graph generator [4] to construct fine-grain parallelism exposed by the inner loop of the algo-
graphs with 10K vertices from the following families: rithm is insufficient to achieve significant speedups. On the
Random: Their m edges are constructed choosing a ran- contrary, the HT scheme (helper), which exploits parallelism
dom pair among n vertices. at a coarser granularity, is able to achieve significant speedups
R-MAT: Constructed using the Recursive Matrix (R-MAT) in the majority of the cases (6 out of 9 experiments). The max-
graph model [7]. imum speedup achieved is 1.46 as shown in Figure 5c.
SSCA#2: Used in the DARPA HPCS SSCA#2 graph anal- For more dense graphs, the performance improvements are
ysis benchmark [3]. greater since more parallelism can be exposed in the inner loop
Table 2 summarizes the characteristics of the graphs used. of the algorithm. These are the cases where helper achieves
To obtain an estimate of possible speedups, we profiled the the best speedups and scalability (Figures 5c, 5f and 5i). Con-
serial execution of Dijkstras algorithm on each graph in order versely, sparse graphs leave limited space for parallelism lead-
to calculate the distribution of the sequential (ExtractMin), ing to low performance. Therefore they can serve as test cases
and the parallelizable parts (DecreaseKey). In the ideal case for the overhead of the co-existence of numerous threads. The

6
rand-10000x10000 rand-10000x100000 rand-10000x200000
1.5 1.5 1.5
1.4 1.4 1.4
1.3 1.3 1.3
1.2 1.2 1.2
Multithreaded speedup

Multithreaded speedup

Multithreaded speedup
1.1 1.1 1.1
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock
0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm
0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock
perfbar+fgs-tm perfbar+fgs-tm perfbar+fgs-tm
0.1 helper 0.1 helper 0.1 helper
0 0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of threads Number of threads Number of threads

(a) (b) (c)


rmat-10000x10000 rmat-10000x100000 rmat-10000x200000
1.5 1.5 1.5
1.4 1.4 1.4
1.3 1.3 1.3
1.2 1.2 1.2
Multithreaded speedup

Multithreaded speedup

Multithreaded speedup
1.1 1.1 1.1
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock
0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm
0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock
perfbar+fgs-tm perfbar+fgs-tm perfbar+fgs-tm
0.1 helper 0.1 helper 0.1 helper
0 0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of threads Number of threads Number of threads

(d) (e) (f)


ssca2-10000x28351 ssca2-10000x118853 ssca2-10000x177425
1.5 1.5 1.5
1.4 1.4 1.4
1.3 1.3 1.3
1.2 1.2 1.2
Multithreaded speedup

Multithreaded speedup

Multithreaded speedup
1.1 1.1 1.1
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock
0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm
0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock
perfbar+fgs-tm perfbar+fgs-tm perfbar+fgs-tm
0.1 helper 0.1 helper 0.1 helper
0 0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of threads Number of threads Number of threads

(g) (h) (i)

Figure 5: Multithreaded speedups for the graphs tested.

results for these cases are shown in Figures 5a, 5d and 5g. It performance. Brodal et al. [5] utilize a number of processors
is obvious that the helper scheme is the more robust one, as it to accelerate the DecreaseKey operation and discuss the ap-
exhibits the smallest slowdown. In the worst case, the perfor- plicability of their approach to Dijkstras algorithm. However,
mance of the main thread is degraded by around 10%. this work is evaluated on a theoretical Parallel Random Access
A closer look at the results reveals that the main thread suf- Machine (PRAM) execution model. Hunt et al. [13] implement
fers a really low number of aborts (less than 1% of the total a concurrent priority queue which is based on binary heaps
aborts). This means that even when the helper threads are not and supports parallel Insertions and Deletions using fine-grain
contributing any useful work, they still do not obstruct the main locking on the nodes of the binary heap. Since these operations
threads progress. Therefore, the main thread is allowed to do not traverse the entire data structure, local locking leads to
run almost at the speed of the serial execution, thus explaining performance gains. However, in the case of DecreaseKey
the robustness of the scheme. The low overhead of the helper which performs wide traversals of the data structure it degrades
scheme is also illustrated by the fact that the addition of more performance greatly, unless special hardware synchronization
threads does not lead to performance drops in any case. is supported by the underlying platform.
To expose more parallelism, it would be beneficial to con-
4 Related Work currently extract a large number of nodes from the priority
queue. This can be achieved if several nodes have equal dis-
A significant part of Dijkstras execution is spent in up- tances from the set S of visited nodes. Thus, if the prior-
dates in the priority queue. Therefore, enabling concurrent ity queue is organized into buckets of nodes with equal dis-
accesses to this structure seems a good approach to increase tances, then the extraction and neighbor updates can be done

7
in parallel per bucket (Dials algorithm [10]). A generalization [2] A.-R. Adl-Tabatabai, C. Kozyrakis, and B.E. Saha. Unlocking
of Dials algorithm called -stepping is proposed by Meyer concurrency: Multicore programming with transactional mem-
and Sanders [17]. Madduri et al. [14] use -stepping as the ory. ACM Queue, 4(10):2433, 2006.
base algorithm on Cray MTA-2, an architecture that exploits [3] D.A. Bader and K. Madduri. Design and implementation of the
fine-grain parallelism using hardware synchronization primi- hpcs graph analysis benchmark on symmetric multiprocessors.
tives, and achieve significant speedups. In the Parallel Boost In HiPC, 2005.
Graph Library [11] Dijkstras algorithm is parallelized for a [4] D.A. Bader and K. Madduri. Gtgraph: A suite of synthetic
distributed memory machine. The priority queue is distributed graph generators. 2006. http://www.cc.gatech.edu/
in the local memories of the system nodes and the algorithm is kamesh/GTgraph/.
divided in supersteps, in which each processor extracts a node [5] G.S. Brodal, J.L. Traff, C.D. Zaroliagis, and I. Stadtwald. A
from its local priority queue. The aforementioned approaches parallel priority queue with constant time operations. Journal of
are based on significant modifications to Dijkstras algorithm Parallel and Distributed Computing, 49:421, 1998.
to enable coarse-grain parallelism and lead to promising paral- [6] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun STAMP:
lel implementations. In this paper we adhere to the pure Dijk- Stanford Transactional Applications for Multi-Processing. In
stras algorithm to face the challenges of its parallelization and IISWC 2008.
test the applicability of TM and HT. [7] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive
TM has attracted extensive scientific research during the last model for graph mining. In ICDM, 2004.
few years, focusing mainly on its design and implementation
[8] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y. Lee,
details. Nevertheless, its efficacy on a wide set of real, non-
D. Lavery, and J. P. Shen. Speculative precomputation: Long-
trivial applications is only now starting to be explored. Scott et range prefetching of delinquent loads. In ISCA, 2001.
al. [18] use TM to parallelize Delaunay triangulation and Wat-
[9] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Intro-
son et al. [19] exploit it to parallelize Lees routing algorithm.
duction to Algorithms. The MIT Press, 2001.
Moreover, a set of TM applications is offered in STAMP [6].
[10] R. Dial. Algorithm 360: Shortest path forest with topological
ordering. Communications of the ACM, 12:632633, 1969.
5 Conclusions Future work [11] N. Edmonds, A. Breuer, D. Gregor, and A. Lumsdaine. Single-
source shortest paths with the parallel boost graph library. In 9th
This work applies several parallelization techniques to Di- DIMACS Implementation Challenge, 2006.
jkstras algorithm, which is known to be hard to parallelize. [12] M. Herlihy and E. Moss. Transactional memory: Architectural
The schemes that parallelize each serial step by incorporat- support for lock-free data structures. In ISCA, 1993.
ing traditional synchronization primitives (locks and barriers) [13] G.C. Hunt, M.M. Michael, S. Parthasarathy, and M.L. Scott.
fail to outperform the serial algorithm. In fact, they exhibit An efficient algorithm for concurrent priority queue heaps. Inf.
low performance even if the necessary barriers are replaced by Proc. Letters, 60:151157, 1996.
ideal ones. To deal with this we employ Transactional Mem-
[14] K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak. Parallel
ory (TM), which reduces synchronization overheads, but still shortest path algorithms for solving large-scale instances. In 9th
fails to provide meaningful overall performance improvement, DIMACS Implementation Challenge, 2006.
as speedups can be achieved in some test cases only with using
[15] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren,
ideal barriers. To improve the performance further, we propose
G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and
an implementation based on TM and Helper Threading, that is B. Werner. Simics: A full system simulation platform. Com-
able to provide significant speedups (reaching up to 1.46) in puter, 35(2):5058, 2002.
the majority of the simulated cases.
[16] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu,
As future work, we will investigate the application of these A. Alameldeen, K. Moore, M. Hill, and D. Wood. Mul-
techniques on other algorithms solving the SSSP problem, tifacets general execution-driven multiprocessor simulator
such as -stepping [17] and Bellman-Ford [9]. We also aim (gems) toolset. SIGARCH Comput. Archit. News, 2005.
to explore the impact of various TM characteristics on the be-
[17] U. Meyer and P. Sanders. Delta-stepping: A parallel single
havior of the presented schemes, such as the resolution policy, source shortest path algorithm. In ESA, 1998.
version management and conflict detection. Finally, prelimi-
nary results demonstrated interesting variations in the available [18] M. L. Scott, M. F. Spear, L. Daless, and V. J. Marathe. Delaunay
triangulation with transactions and barriers. In IISWC, 2007.
parallelism between different execution phases, motivating us
to explore more adaptive schemes in terms of the number of [19] I. Watson, C. Kirkham, and M. Lujan. A study of a transactional
parallel threads and the tasks assigned to them. parallel routing algorithm. In PACT, 2007.
[20] L. Yen, J. Bobba, M.R. Marty, K.E. Moore, H. Volos, M.D. Hill,
M.M. Swift, and D.A. Wood. LogTM-SE: Decoupling hardware
References transactional memory from caches. In HPCA, 2007.
[21] W. Zhang, B. Calder, and D.M. Tullsen. An event-driven multi-
[1] Wisconsin multifacet gems simulator. http://www.cs. threaded dynamic optimization framework. In PACT, 2005.
wisc.edu/gems/.

You might also like