0% found this document useful (0 votes)

72 views8 pages

Early Experiences On Accelerating Dijkstra's Algorithm Using Transactional Memory

1) The document discusses parallelizing Dijkstra's algorithm for shortest path problems on graphs using transactional memory. 2) Traditional parallelizations of Dijkstra's algorithm have had disappointing performance due to high synchronization costs. 3) The authors implement versions of Dijkstra's algorithm using transactional memory alone and with helper threads to increase parallelism and decrease synchronization overhead. Their results show the transactional memory with helper threads approach achieves significant speedup of up to 1.46x.

Uploaded by

Shatadeep Banerjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views8 pages

Early Experiences On Accelerating Dijkstra's Algorithm Using Transactional Memory

Uploaded by

Shatadeep Banerjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Early Experiences on Accelerating Dijkstras Algorithm

Using Transactional Memory

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

National Technical University of Athens

School of Electrical and Computer Engineering
Computing Systems Laboratory
{anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr

Abstract all the neighbors of the extracted node. To implement parallel

versions, researchers follow two general strategies. The first
In this paper we use Dijkstras algorithm as a challenging, strategy attempts to relax the sequential nature of Dijkstra by
hard to parallelize paradigm to test the efficacy of several par- creating more parallelism in the outer loop. This leads to alter-
allelization techniques in a multicore architecture. We consider native algorithms like -stepping [14, 17] that enable concur-
the application of Transactional Memory (TM) as a means of rent extraction of multiple nodes from the unvisited set. The
concurrent accesses to shared data and compare its perfor- second strategy works on pure Dijkstra and seeks parallelism
mance with straightforward parallel versions of the algorithm in the inner loop, by enabling concurrent accesses to the prior-
based on traditional synchronization primitives. To increase ity queue. However, practical implementations of concurrent
the granularity of parallelism and avoid excessive synchro- binary heaps as priority queues [13] are based on unavoidable
nization, we combine TM with Helper Threading (HT). Our fine-grain locking of the binary heap, which is expected to kill
simulation results demonstrate that the straightforward par- the performance of such a scheme.
allelization of Dijkstras algorithm with traditional locks and In this paper we face the challenges of parallelizing Dijk-
barriers has, as expected, disappointing performance. On the stras algorithm for a multicore architecture. To decrease the
other hand, TM by itself is able to provide some performance synchronization cost we employ Transactional Memory (TM)
improvement in several cases, while the version based on TM [2, 12] as a means of efficient concurrent thread accesses to
and HT exhibits a significant performance improvement that shared data. TM is a novel programming model for multi-
can reach up to a speedup of 1.46. core architectures that allows concurrency control over mul-
tiple threads. The programmer is able to envelop parts of the
1 Introduction code within a transaction, indicating that within this section
exist accesses to memory locations that may be performed by
other threads as well. The TM system monitors the transac-
Dijkstras algorithm [9] is a fundamental graph algorithm
tions of the threads and, if two or more of them perform con-
used to compute single source shortest paths (SSSP) for graphs
flicting memory accesses, it resolves the conflict. TM seems
with non-negative edges. SSSP is a classic combinatorial opti-
a promising approach for dynamic data structures and applica-
mization problem used in a variety of applications such as net-
tions with independent threads. It remains though to be inves-
work routing or VLSI design. The algorithm maintains a set
tigated how TM can speedup a single application.
S of visited nodes, whose shortest path has already been cal-
The parallelization of the inner loop does not exploit a
culated. In each iteration, the unvisited node with the shortest
significant amount of parallelism. Therefore, we choose to
distance from S is selected, it is inserted into S and the dis-
coarsen the granularity of parallelism by employing the idea
tances of its neighbors are updated. The set of unvisited nodes
of Helper Threads (HT) [8, 21]. Dijkstras algorithm spends a
is implemented as a priority queue. This serializes a large part
large part of its execution in the relaxations of the nodes of the
of the algorithms operations, thus making Dijkstra a hard to
priority queue. Parallel threads can update (relax) the distances
parallelize graph algorithm [5, 17].
of several nodes neighbors without changing the semantics of
Delving into implementation details, the algorithm involves
the algorithm. Thus, while the main thread extracts and up-
a two-level nested loop: the outer loop iterates over all the
dates the neighbors of the head of the priority queue, k helper
nodes of the graph, selecting in each step the one closest to
threads update the neighbors of the next k nodes in the priority
set S, while the inner loop updates the distances from S of
queue. This approach exploits parallelism in the outer loop,
This research is supported by PENED 2003 Project (EPAN/GSRT), co- without changing the semantics of the algorithm.
funded by the European Social Fund (80%) and National Resources (20%). We have implemented several versions of the multithreaded

1
Dijkstra algorithm using traditional synchronization primitives Algorithm 1: Dijkstras algorithm.
(locks and barriers), TM and HT and evaluated them using Input : Directed graph G = (V, E), weight function w : E R+ ,
Simics [15] and GEMS [1, 16], which allow the simulation source vertex s, min-priority queue Q
Output : shortest distance array d, predecessor array
of multicore systems and provide support for TM. Our results /* Initialization phase */
demonstrate that the combination of TM and HT achieves sig- 1 foreach v V do
nificant speedup on a hard to accelerate application, while re- 2 d[v] INF ;
3 [v] NIL ;
quiring only a few extensions to the original source code. 4 Insert(Q, v);
The rest of the paper is organized as follows. Section 2 5 end
6 d[s] 0;
presents the basics of Dijkstras algorithm and the details of
/* Main body of the algorithm */
the various multithreaded implementations. Section 3 demon- 7 while Q 6= do
strates simulation results comparing the performance of the 8 u ExtractMin(Q);
9 foreach v adjacent to u do
versions under consideration. Related work is presented in 10 sum d[u] + w(u, v);
Section 4, while Section 5 summarizes the paper and discusses 11 if d[v] > sum then
12 DecreaseKey(Q, v, sum);
directions for future work. 13 d[v] sum;
14 [v] u;
15 end
2 Parallelizing Dijkstras algorithm 16 end

Algorithm 2: Fine-grain parallel implementation of Dijkstras algorithm.

2.1 Dijkstras algorithm Input : Directed graph G = (V, E), weight function w : E R+ ,
source vertex s, min-priority queue Q
Output : shortest distance array d, predecessor array
Dijkstras algorithm solves the SSSP problem for a di-
/* Initialization phase same to the serial code */
rected graph with non-negative edge weights. Specifically, /* Main body of the algorithm */
let G = (V, E) be a directed graph with n = |V | vertices, 1 while Q 6= do
m = |E| edges, and w : E R+ a weight function assign- 2
3
Barrier
if tid = 0 then
ing non-negative real-valued weights to the edges of G. For 4 u ExtractMin(Q);
each vertex v, the SSSP problem computes (v), the weight of 5 Barrier
6 foreach v adjacent to u do in parallel
the shortest path from a source vertex s to v. For each vertex 7 sum d[u] + w(u, v);
v, Dijkstras algorithm maintains a shortest-path estimate (or 8 if d[v] > sum then
9 Begin-Atomic
tentative distance) d(v), which is an upper bound for the actual 10 DecreaseKey(Q, v, sum);
weight of the shortest path from s to v, (v). Initially, d(v) is 11 End-Atomic
12 d[v] sum;
set to and through successive edge relaxations it is gradu- 13 [v] u;
ally decreased, converging to (v). The relaxation of an edge 14 end
15 end
(v, w) sets d(w) to min{d(w), d(v) + w(v, w)}, which means
that the algorithm tests whether it can decrease the weight of
the shortest path from s to w by going through v.
The algorithm maintains a partition of V into settled (vis- 2.2 Lock-based parallel implementation
ited), queued and unreached vertices (the latter two repre-
senting unvisited nodes). Settled vertices have d(v) = (v); An intuitive choice for parallelizing Dijkstras algorithm is
queued have d(v) > (v) and d(v) 6= ; unreached have to exploit parallelism at the inner loop by relaxing all outgoing
d(v) = . Initially, only s is queued, d(s) = 0 and all other edges of vertex u in parallel. This is a fine-grain parallelization
vertices are unreached. In each iteration of the algorithm, the scheme. In each step, one thread extracts u from the heap and
vertex with the smallest shortest-path estimate is selected, its then its outgoing edges are assigned (e.g. via cyclic assign-
state is permanently changed to settled and all its outgoing ment) to parallel threads for relaxation. This idea is depicted
edges are relaxed, causing any of its neighbors that were un- in Figure 1 while a generic implementation is shown in Alg. 2.
reached by the source vertex until this point to become queued. A number of observations can be made concerning this par-
The algorithm is presented in more detail in Alg. 1. allelization scheme. First, the speedup is bounded by the av-
The basic data structure lying at the heart of Dijkstras al- erage out-degree of the vertices, i.e. the density of the graph.
gorithm is a min-priority queue of vertices, keyed by their Clearly, if vertices have a small number of neighbors on aver-
d() values. The queue is used to maintain all but the set- age, then the parallel segment of the algorithm (lines 614) will
tled vertices of the graph. At each iteration, the vertex with consume a small fraction of the total execution time, making
the smallest key is removed from the queue (ExtractMin the sequential part (lines 34, ExtractMin) dominant. The
operation) and its outgoing edges are relaxed, which could re- second observation concerns the concurrent accesses to the bi-
sult to reductions of the keys of the corresponding neighbors nary heap by the parallel DecreaseKey operations. The bi-
(DecreaseKey operation). To amortize the cost of the multi- nary heap is implemented as a linear array and can be consid-
ple ExtractMin and DecreaseKey operations, especially ered as a nearly complete binary tree. The smallest element in
for realistic, sparse graphs, the min-priority queue is imple- the heap is stored at the root and the subtree rooted at a node
mented as a binary heap. contains values no smaller than the value of the node. During

2
1.3
cgs-lock
1.2 perfbar+cgs-lock
1.1 perfbar+fgs-lock
extract-min relax outgoing edges
1

Multithreaded speedup
Time 0.9
0.8
step k step k+1 step k+2
0.7
0.6
0.5
0.4
step k step k+1 step k+2
0.3
Thread 1
0.2
Thread 2 0.1
Thread 3 0
Thread 4 2 4 6 8 10 12 14 16
Number of threads

Figure 1: Execution patterns of serial and multithreaded Dijk- Figure 2: Speedups of lock-based parallel versions with real
stras algorithm. and perfect barriers.

In an attempt to isolate the effect of the barriers, we imple-

mented a version of idealized, zero-latency barriers that rely
a DecreaseKey operation, a vertex obtains a smaller value solely on hardware in our simulated environment. This scheme
as its new shortest path estimate. If this new value is smaller is named perfbar+cgs-lock. It is clear from Figure 2 that the
than that of its parent, the vertex has to move upwards the tree replacement of barriers with perfect ones deals with the poor
until it is placed in a location that satisfies the min-heap prop- scalability problem of the cgs-lock scheme. Nevertheless, the
erty. During this traversal, the node is repeatedly compared to scheme still performs worse than the serial execution of the al-
its parent and if its value is smaller, the nodes are swapped. gorithm, revealing that this coarse-grain synchronization is too
The first, and rather naive, approach to enable paralleliza- conservative and cannot expose enough parallelism.
tion of the relaxation phase, is to use a global mutex to lock the Finally, the fgs-lock scheme combined with perfect bar-
entire heap during each DecreaseKey operation. This con- riers fails to outperform the serial execution, despite being
stitutes a conservative, coarse-grain synchronization scheme more optimistic than the cgs-lock scheme. As the number of
that permits only one DecreaseKey operation at a time and threads increases, its performance improves slightly indicating
obviously limits concurrency. We refer to this scheme as cgs- that there does exist an amount of parallelism. However, the
lock. The alternative, more optimistic approach is to allow fgs-lock scheme fails to exploit it efficiently and there are two
multiple sequences of node swaps to execute in parallel as long possible reasons for this failure. The first reason is that in order
as they access different parts of the heap. More specifically, to allow concurrent accesses to the heap, a pair of spin-locks
instead of using one lock for the entire heap, one can utilize is used for each pair of nodes, causing the total overhead to be
separate locks for each parent-child pair of nodes. Whenever a high and lowering the performance gains from the exploited
thread executes a DecreaseKey operation and a node swap parallelism. The second reason is that the fgs-lock scheme
is required, it must first acquire the appropriate lock that guards allows concurrent accesses to the binary heap only when the
this specific pair of nodes (a scheme similar to [13]). In this threads access different parts of the heap. The probability of
way atomicity is guaranteed and the algorithm can be executed threads touching the same nodes of the binary heap depends on
safely in parallel. We refer to this scheme as fgs-lock. the structure of the graph as well as on the order by which the
To obtain a first picture of the efficiency of these schemes, neighbors of a vertex are examined during the DecreaseKey
we evaluated them with a random graph with 10K vertices and operations. Whenever this occurs, the threads are serialized,
100K edges. A detailed description of the simulation frame- thus limiting the total available parallelism.
work can be found in Section 3.1. Figure 2 demonstrates the
speedup of the two schemes for 2 to 16 threads. The speedup 2.2.1 TM-based parallel implementation
is calculated as the ratio of the execution time of the serial
to the parallel scheme in each case. The performance of the The fgs-lock scheme described in Section 2.2 allows concur-
cgs-lock scheme is disappointing. Although the limited par- rent accesses to the binary heap used in the implementation of
allelism of the scheme explains the lack of speedup, a more Dijkstras algorithm. Unfortunately, it has a high overhead due
detailed execution profiling revealed that the vast performance to the numerous locks needed limiting its efficiency severely.
drop is attributed to the overhead of barriers that surround the Looking for alternatives, we test the efficacy of TM, as a means
ExtractMin operation and decouple the serial phases from of concurrent accesses to the shared binary heap. The first ap-
the parallel ones. More specifically, for 2 threads the time spent proach is to enclose each DecreaseKey operation within a
in barriers accounts for 71% of the total execution time. This transaction, and rely on the underlying system to ensure atom-
percentage rises up to 88% when using 8 threads, explaining icity. When concurrent transactions access the same elements
why the performance degrades when more threads are used. in the heap and at least one of these accesses is a write oper-
We used the barriers provided by the Pthreads library, yet we ation, a conflict arises and the system needs to resolve it de-
argue that this should not be a problem of the specific barrier ciding which transaction succeeds. The DecreaseKey oper-
implementation, since alternative software-based implementa- ation includes a series of swaps, as a node traverses the heap
tions are expected to provide similar results. until it is placed in the final correct position. When two or more

3
vertices are relaxed in parallel, the paths of these heap traver- 1.2

sals might share one or more common nodes. The TM system 1.1

will detect the conflict, only one of the transactions will be al-

Multithreaded speedup
1
lowed to commit and only one vertex will be relaxed. The other
0.9
conflicting threads will have to pause or repeat their work, de-
0.8
pending on the implementation of the TM system and its con-
flict detection and resolution policy. This scheme is not as fine- 0.7

grain as the fgs-lock scheme, where atomicity is enforced at the 0.6

perfbar+cgs-lock
perfbar+fgs-lock
perfbar+cgs-tm
level of a single swap and not for a series of swaps. We will 0.5
perfbar+fgs-tm

refer to this scheme as cgs-tm. It is implemented as shown in 2 4 6 8 10

Number of threads
12 14 16

Alg. 2 by replacing the Begin-Atomic and End-Atomic

Figure 3: Speedups of lock-based and TM-based versions.
operations with the appropriate Begin-Transaction and
End-Transaction primitives.
as in [11, 14, 17], without, changing the algorithm itself. Thus,
A second alternative is to implement a scheme as fine-grain
instead of partitioning the inner loop and assigning only a few
as the fgs-lock scheme using TM. To accomplish this, each
neighbors to each thread, we seek to assign the relaxation of a
swap executed by the DecreaseKey operation is enclosed
complete set of neighbors to each thread. To accomplish this,
into a transaction. This means that the transactions will be
we take advantage of a basic property of Dijkstras algorithm:
shorter than those of the cgs-tm scheme resulting, hopefully,
the relaxations (lines 11,1314 in Alg. 1) lead to monotoni-
into fewer conflicts and thus more parallelism. We will refer
cally decreasing values for the distances of unvisited nodes un-
to this scheme as fgs-tm. For the implementation of the fgs-tm
til each distance reaches its final minimum value. As long as
scheme the DecreaseKey presented in Alg. 3 is used. Simi-
a graph node is inserted in the queued set (i.e. the nodes dis-
larly to the lock-based schemes, both these TM-based schemes
tance from S is not infinite) its neighbors could also be relaxed
require the incorporation of barriers to decouple the serial from
to newer updated values. This property is not utilized by the
the parallel phases. Having observed the disastrous effect of
original serial algorithm, as all the updates occur for the neigh-
the barriers on the performance of the lock-based schemes, we
bors of the extracted node. Practically, the algorithm avoids
employ the perfect, zero-latency barriers in the evaluation of
calculating intermediate distances that will eventually be over-
our TM-based implementations as well.
written. Our key idea is that parallel threads can serve as helper
To obtain a first insight on the efficiency of the TM-based
threads and perform relaxations for neighbors of nodes belong-
schemes we used the same graph as in Section 2.2 and present
ing in the queued set. Optimistically, some of these relaxations
speedups in Figure 3. In contrast to the lock-based schemes,
will be utilized and offloaded by the main thread.
the TM-based ones outperform the serial implementation for
In our implementation the main thread operates like in the
more than 4 threads. For 2 threads the overhead of the TM
sequential version, extracting in each iteration the minimum
scheme seems to be too high, canceling out any performance
vertex from the priority queue and relaxing its outgoing edges.
gains from the exploitation of parallelism. As more threads are
At the same time, the k-th helper thread reads the tentative
used though, the performance is improved providing a speedup
distance of the k-th vertex in the queue (let us call it xk for
of up to almost 1.1. More detailed results are presented in Sec-
short) and relaxes all its outgoing edges based on this value.
tion 3.3. Thus, it seems that TM is a promising mechanism to
When the main thread accomplishes its relaxations, it notifies
exploit the available parallelism of the DecreaseKey opera-
the helper threads to stop their relaxations, and they all proceed
tions on the binary heap. However, to obtain these performance
to the next iteration. This scheme is demonstrated in Figure 4.
improvements ideal barriers are employed.
The rationale behind it is that vertices occupying the top k po-
Algorithm 3: DecreaseKey sitions in the queue might already be settled with some proba-
Input : min-priority queue Q, vertex u, new key value for vertex u
bility, so that when the helper threads read their distances and
1 Q[u] value;
2 i u; relax their outgoing edges, they will make their corresponding
3 while (parent(i).key value) do neighbors settled, as well. As a result, when the main thread
4 Begin-Transaction
5 swap(u, i); checks these vertices later, it will not have to perform any re-
6 End-Transaction laxations. On the other hand, if the k-th thread reads xk , it is
7 i parent(i);
8 end
possible that xk might not have been settled yet and thus have
a suboptimal tentative distance. The thread would then update
the neighbors according to a new tentative value, which will
2.3 A multithreaded version based on HT eventually be set to the appropriate minimum value, when xk
will be examined by the main thread later on. At that moment,
In this section we present an alternative multithreaded ver- all its outgoing edges will be re-relaxed using the correct fi-
sion of Dijkstras algorithm based on HT. The motivation arises nal distance. A significant aspect of this multithreaded scheme
from the poor performance of all aforementioned versions, is that the main thread stops all helper threads after finishing
which is due to their limited parallelism and excessive synchro- each iteration of the outer loop. At this time, the helper threads
nization. Our goal is to coarsen the granularity of parallelism, stop their computations and proceed with the main thread to

4
extract-min read tidth-min relax outgoing edges tion, continuing from the point where they had stopped. This
Time might yield suboptimal updates to the distances of the neigh-
step k step k+1 step k+2

Thread 1
bors, but as explained above, these will be overwritten once
the vertices examined by the helper threads reach the top of the

kill

kill
kill
Thread 2 queue. So, correctness is guaranteed.

kill
Algorithm 4: Main threads code.
Thread 3

kill
Input : Directed graph G = (V, E), weight function w : E R+ ,
source vertex s, min-priority queue Q
Thread 4 Output : shortest distance array d, predecessor array
/* Initialization phase same to the serial code */
Figure 4: Execution pattern of the helper threads version. 1 while Q 6= do
2 u ExtractMin(Q);
3 done 0;
4 foreach v adjacent to u do
the next iteration. It is possible that at this time a helper thread 5 sum d[u] + w(u, v);
might have updated only some of the neighbors of its vertex 6 Begin-Transaction
7 if d[v] > sum then
xk , leaving the other ones with their old, possibly suboptimal, 8 DecreaseKey(Q, v, sum);
distances. As explained above, however, this is not a problem 9 d[v] sum;
10 [v] u;
since all neighbors of xk with suboptimal distances will be cor-
11 End-Transaction
rectly updated when xk reaches the top of the priority queue. 12 end
The code executed by the main and helper threads is shown 13 Begin-Transaction
14 done 1;
in Alg. 4 and Alg. 5, respectively. In the beginning of each it- 15 End-Transaction
eration, the main thread extracts the top vertex from the queue. 16 end
At the same time, the helper threads spin-wait until the main Algorithm 5: Helper threads code.
thread has finished the extraction, and then each one reads 1 while Q 6= do
without extracting one of the top k vertices in the queue (this 2 while done = 1 do ;
3 x ReadMin(Q, tid);
is what ReadMin function does). Next, all threads relax all the 4 stop 0;
outgoing edges of the vertices they have undertaken in parallel. 5 foreach y adjacent to x and while stop = 0 do
Compared to the original algorithm, a performance improve- 6 Begin-Transaction
7 if done = 0 then
ment is expected, since, due to the helper threads, the main 8 sum d[x] + w(x, y);
thread will evaluate the expression of line 7 as true fewer times 9 if d[y] > sum then
10 DecreaseKey(Q, y, sum);
and thus, will not need to execute the operations of lines 89. 11 d[y] sum;
12 [y] x;
The proposed HT scheme is largely based on TM. Updates
13 else
to the heap via the DecreaseKey function, as well as updates 14 stop 1;
to the tentative distances and predecessor arrays are enclosed 15 End-Transaction
16 end
within a single transaction for both the main and helper threads. 17 end
This ensures atomicity of these updates, i.e. that they will be
performed in an all-or-none manner. Furthermore, it guar- Summarizing, the main concept of our implementation is
antees that in case of conflict only one thread will be allowed to decouple as much as possible the main thread from the ex-
to commit its transaction and perform the neighbor update. A ecution of the helper threads, minimizing the time that it has
conflict can arise when two or more threads update simultane- to spend on synchronization events or transaction aborts. The
ously the same neighbor, or when they update different neigh- helper threads are allowed to execute in an aggressive man-
bors but change the same part of the heap. The interruption ner, being at the same time as less intrusive to the main thread
of helper threads is implemented using transactions as well. as possible, even if they perform a notable amount of useless
Specifically, when the main thread has completed all the relax- work. The semantics of the algorithm guarantee that any in-
ations for its vertex, it sets the notification variable done to 1 termediate relaxations made by the helper threads are not ir-
within a separate transaction. This value denotes a state where reversible. Finally, by using a TM with a conflict resolution
the main thread proceeds to the next iteration and requires all policy that favors the main thread, transaction abort overheads
helper threads to stop and follow, terminating any operations are mainly suffered by the helper threads.
that they were performing on the heap. All helper threads ex-
ecuting transactions at this point will abort, since done is in
3 Experimental Evaluation
their read sets as well. The helper threads will immediately
retry their transactions, but there is a good chance that they
will find done set to 1, stop examining the remaining neigh- 3.1 Experimental setup
bors in the inner loop and continue with the next iteration of
the outer loop. In the opposite case that the main thread per- We evaluated the performance of the various implementa-
forms the ExtractMin operation too quickly, done will be tions of Dijkstras algorithm through full-system simulation,
set back to 0 and the helper threads will miss the last notifica- using the Wisconsin GEMS toolset v.2.1 [1, 16] in conjunc-

5
tion with the Simics v.3.0.31 [15] simulator. Simics provides Simics Processor
configurations up to 32 cores
functional simulation of a SPARC chip multiprocessor system UltraSPARC III Cu (III+)
(CMP) that boots unmodified Solaris 10. The GEMS Ruby L1 caches
Private, 64KB, 4-way set-associative,
module provides detailed memory system simulation and for 64B line size, 4 cycle hit latency
non-memory instructions behaves as an in-order single-issue Unified and shared, 8 banks, 2MB, 4-way set-
Ruby L2 cache
associative, 64B line size, 10 cycle hit latency
processor, executing one instruction per simulated cycle.
Hardware TM is supported in GEMS through the LogTM- Memory 160 cycle access latency
SE subsystem [20]. It is built upon a single-chip CMP system TM System HYBRID resol. policy, 2Kb HW signatures
with private per-processor L1 caches and a shared L2 cache. It
features eager version management, where transactions write Table 1: Simulation framework.
the new memory values in-place, after saving the old values Sequential Parallel. Ideal
in a log. It also supports eager conflict detection, as conflicts, Graph Edges Parameters
part (%) part (%) speedup
i.e. overlaps between the write set of one transaction and the rand1 10K 97.8 2.2 1.02
write or read set of other concurrent transactions, are detected rand2 100K 38.7 61.3 2.58
at the very moment they happen. On a conflict, the offending rand3 200K 26.2 73.8 3.81
transaction stalls and either retries its request hoping that the
rmat1 10K a = 0.45 78.3 21.7 1.27
other transaction has finished, or aborts if LogTM detects a
rmat2 100K b, c = 0.15 39.3 60.7 2.54
potential deadlock. The aborting processor uses its log to undo
rmat3 200K d = 0.25 26.8 73.2 3.73
the changes it has made and then retries the transaction. In our
ssca1 28K (P, C) = (0.25, 5) 94.1 5.9 1.06
experiments we used the HYBRID conflict resolution policy,
which tends to favor older transactions against younger ones. ssca2 118K (P, C) = (0.5, 20) 35.2 64.8 2.84
Table 1 shows the configuration of the simulation framework. ssca3 177K (P, C) = (0.5, 30) 28.9 71.1 3.46
Our programs use the Pthreads library for thread creation
and synchronization operations such as spin-locking and barri- Table 2: Graphs used for experiments.
ers. For our perfect barriers, we encoded a global barrier as where parallel execution would manage to zero out the time
a single assembly instruction, exploiting the functionality of- 100%
spent for edge relaxations, the speedup would be %Serialpart .
fered by Simics magic instructions. Thus the synchronization This is presented in the sixth column of Table 2 and constitutes
of threads is handled by the simulator and not the operating a theoretical upper bound for any performance improvement.
system, providing instant suspension/resumption of the arriv-
ing/departing threads. To avoid resource conflicts between our 3.3 Results
programs and the operating systems processes, we used CMP
configurations with more processor cores than the number of
Figure 5 shows the speedups of all evaluated schemes. Con-
threads we required. So, the experiments for 2 and 4 threads
sistently to the discussion in Section 2, the concurrent thread
were performed on an 8 core CMP, while the 8 threads exper-
accesses to shared data implemented with the aid of TM (cgs-
iments were done on an 16 core CMP. To schedule a thread
tm and fgs-tm) clearly outperform the ones using traditional
on a particular processor and avoid migrations, we used the
synchronization primitives (cgs-lock and fgs-lock). This fact
pset bind system call of Solaris. Finally, all codes were
reveals the existence of fine-grain parallelism in the updates of
compiled with Suns Studio 12 C compiler (O3 level).
the priority queue of the algorithm, in the sense that, statisti-
cally, it is highly probable that the paths of various concurrent
3.2 Reference graphs updates do not overlap. Thus, optimistic parallelism seems a
good approach for Dijkstras algorithm.
To evaluate the different schemes we strived to work on Nevertheless, only by employing perfect barriers can the
graphs which vary in terms of density and structure. In that TM-based schemes outperform the serial case. Therefore, the
attempt, we used the GTgraph graph generator [4] to construct fine-grain parallelism exposed by the inner loop of the algo-
graphs with 10K vertices from the following families: rithm is insufficient to achieve significant speedups. On the
Random: Their m edges are constructed choosing a ran- contrary, the HT scheme (helper), which exploits parallelism
dom pair among n vertices. at a coarser granularity, is able to achieve significant speedups
R-MAT: Constructed using the Recursive Matrix (R-MAT) in the majority of the cases (6 out of 9 experiments). The max-
graph model [7]. imum speedup achieved is 1.46 as shown in Figure 5c.
SSCA#2: Used in the DARPA HPCS SSCA#2 graph anal- For more dense graphs, the performance improvements are
ysis benchmark [3]. greater since more parallelism can be exposed in the inner loop
Table 2 summarizes the characteristics of the graphs used. of the algorithm. These are the cases where helper achieves
To obtain an estimate of possible speedups, we profiled the the best speedups and scalability (Figures 5c, 5f and 5i). Con-
serial execution of Dijkstras algorithm on each graph in order versely, sparse graphs leave limited space for parallelism lead-
to calculate the distribution of the sequential (ExtractMin), ing to low performance. Therefore they can serve as test cases
and the parallelizable parts (DecreaseKey). In the ideal case for the overhead of the co-existence of numerous threads. The

6
rand-10000x10000 rand-10000x100000 rand-10000x200000
1.5 1.5 1.5
1.4 1.4 1.4
1.3 1.3 1.3
1.2 1.2 1.2
Multithreaded speedup

Multithreaded speedup

Multithreaded speedup
1.1 1.1 1.1
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock 0.4 perfbar+cgs-lock
0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm 0.3 perfbar+cgs-tm
0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock 0.2 perfbar+fgs-lock
perfbar+fgs-tm perfbar+fgs-tm perfbar+fgs-tm
0.1 helper 0.1 helper 0.1 helper
0 0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Number of threads Number of threads Number of threads

(a) (b) (c)

rmat-10000x10000 rmat-10000x100000 rmat-10000x200000
1.5 1.5 1.5
1.4 1.4 1.4
1.3 1.3 1.3
1.2 1.2 1.2
Multithreaded speedup

Multithreaded speedup

(d) (e) (f)

ssca2-10000x28351 ssca2-10000x118853 ssca2-10000x177425
1.5 1.5 1.5
1.4 1.4 1.4
1.3 1.3 1.3
1.2 1.2 1.2
Multithreaded speedup

Multithreaded speedup

(g) (h) (i)

Figure 5: Multithreaded speedups for the graphs tested.

results for these cases are shown in Figures 5a, 5d and 5g. It performance. Brodal et al. [5] utilize a number of processors
is obvious that the helper scheme is the more robust one, as it to accelerate the DecreaseKey operation and discuss the ap-
exhibits the smallest slowdown. In the worst case, the perfor- plicability of their approach to Dijkstras algorithm. However,
mance of the main thread is degraded by around 10%. this work is evaluated on a theoretical Parallel Random Access
A closer look at the results reveals that the main thread suf- Machine (PRAM) execution model. Hunt et al. [13] implement
fers a really low number of aborts (less than 1% of the total a concurrent priority queue which is based on binary heaps
aborts). This means that even when the helper threads are not and supports parallel Insertions and Deletions using fine-grain
contributing any useful work, they still do not obstruct the main locking on the nodes of the binary heap. Since these operations
threads progress. Therefore, the main thread is allowed to do not traverse the entire data structure, local locking leads to
run almost at the speed of the serial execution, thus explaining performance gains. However, in the case of DecreaseKey
the robustness of the scheme. The low overhead of the helper which performs wide traversals of the data structure it degrades
scheme is also illustrated by the fact that the addition of more performance greatly, unless special hardware synchronization
threads does not lead to performance drops in any case. is supported by the underlying platform.
To expose more parallelism, it would be beneficial to con-
4 Related Work currently extract a large number of nodes from the priority
queue. This can be achieved if several nodes have equal dis-
A significant part of Dijkstras execution is spent in up- tances from the set S of visited nodes. Thus, if the prior-
dates in the priority queue. Therefore, enabling concurrent ity queue is organized into buckets of nodes with equal dis-
accesses to this structure seems a good approach to increase tances, then the extraction and neighbor updates can be done

7
in parallel per bucket (Dials algorithm [10]). A generalization [2] A.-R. Adl-Tabatabai, C. Kozyrakis, and B.E. Saha. Unlocking
of Dials algorithm called -stepping is proposed by Meyer concurrency: Multicore programming with transactional mem-
and Sanders [17]. Madduri et al. [14] use -stepping as the ory. ACM Queue, 4(10):2433, 2006.
base algorithm on Cray MTA-2, an architecture that exploits [3] D.A. Bader and K. Madduri. Design and implementation of the
fine-grain parallelism using hardware synchronization primi- hpcs graph analysis benchmark on symmetric multiprocessors.
tives, and achieve significant speedups. In the Parallel Boost In HiPC, 2005.
Graph Library [11] Dijkstras algorithm is parallelized for a [4] D.A. Bader and K. Madduri. Gtgraph: A suite of synthetic
distributed memory machine. The priority queue is distributed graph generators. 2006. http://www.cc.gatech.edu/
in the local memories of the system nodes and the algorithm is kamesh/GTgraph/.
divided in supersteps, in which each processor extracts a node [5] G.S. Brodal, J.L. Traff, C.D. Zaroliagis, and I. Stadtwald. A
from its local priority queue. The aforementioned approaches parallel priority queue with constant time operations. Journal of
are based on significant modifications to Dijkstras algorithm Parallel and Distributed Computing, 49:421, 1998.
to enable coarse-grain parallelism and lead to promising paral- [6] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun STAMP:
lel implementations. In this paper we adhere to the pure Dijk- Stanford Transactional Applications for Multi-Processing. In
stras algorithm to face the challenges of its parallelization and IISWC 2008.
test the applicability of TM and HT. [7] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive
TM has attracted extensive scientific research during the last model for graph mining. In ICDM, 2004.
few years, focusing mainly on its design and implementation
[8] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y. Lee,
details. Nevertheless, its efficacy on a wide set of real, non-
D. Lavery, and J. P. Shen. Speculative precomputation: Long-
trivial applications is only now starting to be explored. Scott et range prefetching of delinquent loads. In ISCA, 2001.
al. [18] use TM to parallelize Delaunay triangulation and Wat-
[9] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Intro-
son et al. [19] exploit it to parallelize Lees routing algorithm.
duction to Algorithms. The MIT Press, 2001.
Moreover, a set of TM applications is offered in STAMP [6].
[10] R. Dial. Algorithm 360: Shortest path forest with topological
ordering. Communications of the ACM, 12:632633, 1969.
5 Conclusions Future work [11] N. Edmonds, A. Breuer, D. Gregor, and A. Lumsdaine. Single-
source shortest paths with the parallel boost graph library. In 9th
This work applies several parallelization techniques to Di- DIMACS Implementation Challenge, 2006.
jkstras algorithm, which is known to be hard to parallelize. [12] M. Herlihy and E. Moss. Transactional memory: Architectural
The schemes that parallelize each serial step by incorporat- support for lock-free data structures. In ISCA, 1993.
ing traditional synchronization primitives (locks and barriers) [13] G.C. Hunt, M.M. Michael, S. Parthasarathy, and M.L. Scott.
fail to outperform the serial algorithm. In fact, they exhibit An efficient algorithm for concurrent priority queue heaps. Inf.
low performance even if the necessary barriers are replaced by Proc. Letters, 60:151157, 1996.
ideal ones. To deal with this we employ Transactional Mem-
[14] K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak. Parallel
ory (TM), which reduces synchronization overheads, but still shortest path algorithms for solving large-scale instances. In 9th
fails to provide meaningful overall performance improvement, DIMACS Implementation Challenge, 2006.
as speedups can be achieved in some test cases only with using
[15] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren,
ideal barriers. To improve the performance further, we propose
G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and
an implementation based on TM and Helper Threading, that is B. Werner. Simics: A full system simulation platform. Com-
able to provide significant speedups (reaching up to 1.46) in puter, 35(2):5058, 2002.
the majority of the simulated cases.
[16] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu,
As future work, we will investigate the application of these A. Alameldeen, K. Moore, M. Hill, and D. Wood. Mul-
techniques on other algorithms solving the SSSP problem, tifacets general execution-driven multiprocessor simulator
such as -stepping [17] and Bellman-Ford [9]. We also aim (gems) toolset. SIGARCH Comput. Archit. News, 2005.
to explore the impact of various TM characteristics on the be-
[17] U. Meyer and P. Sanders. Delta-stepping: A parallel single
havior of the presented schemes, such as the resolution policy, source shortest path algorithm. In ESA, 1998.
version management and conflict detection. Finally, prelimi-
nary results demonstrated interesting variations in the available [18] M. L. Scott, M. F. Spear, L. Daless, and V. J. Marathe. Delaunay
triangulation with transactions and barriers. In IISWC, 2007.
parallelism between different execution phases, motivating us
to explore more adaptive schemes in terms of the number of [19] I. Watson, C. Kirkham, and M. Lujan. A study of a transactional
parallel threads and the tasks assigned to them. parallel routing algorithm. In PACT, 2007.
[20] L. Yen, J. Bobba, M.R. Marty, K.E. Moore, H. Volos, M.D. Hill,
M.M. Swift, and D.A. Wood. LogTM-SE: Decoupling hardware
References transactional memory from caches. In HPCA, 2007.
[21] W. Zhang, B. Calder, and D.M. Tullsen. An event-driven multi-
[1] Wisconsin multifacet gems simulator. http://www.cs. threaded dynamic optimization framework. In PACT, 2005.
wisc.edu/gems/.

Solutions To Exercises On Parallelism and Concurrency
No ratings yet
Solutions To Exercises On Parallelism and Concurrency
5 pages
Dijkstra's Shortest Path Algorithm Serial and Parallel Execution Performance Analysis
No ratings yet
Dijkstra's Shortest Path Algorithm Serial and Parallel Execution Performance Analysis
5 pages
Quiz For Chapter 7 With Solutions
No ratings yet
Quiz For Chapter 7 With Solutions
8 pages
Parallelizing Dijkstras Algorithm
No ratings yet
Parallelizing Dijkstras Algorithm
63 pages
Dijkstra's Algorithm - Wikipedia
No ratings yet
Dijkstra's Algorithm - Wikipedia
10 pages
Dynamic Shortest Paths Using Javascript On Gpus: Anurag Ingole Rupesh Nasre
No ratings yet
Dynamic Shortest Paths Using Javascript On Gpus: Anurag Ingole Rupesh Nasre
5 pages
Graph Algorithm
No ratings yet
Graph Algorithm
44 pages
Graph Algorithm
No ratings yet
Graph Algorithm
34 pages
Dijkstra's Shortest Path Algorithm: Outline of This Lecture
No ratings yet
Dijkstra's Shortest Path Algorithm: Outline of This Lecture
30 pages
A2SV G5 __ Shortest Path
No ratings yet
A2SV G5 __ Shortest Path
118 pages
Dijkstra'S Algorithm: Fibonacci Heap Implementation
No ratings yet
Dijkstra'S Algorithm: Fibonacci Heap Implementation
23 pages
Dijkstras Algorithm
No ratings yet
Dijkstras Algorithm
39 pages
An Experimental Study of A Parallel Shortest Path Algorithm For Solving Large-Scale Graph Instances
No ratings yet
An Experimental Study of A Parallel Shortest Path Algorithm For Solving Large-Scale Graph Instances
13 pages
Lectures On Graph Series
No ratings yet
Lectures On Graph Series
50 pages
PDC Report
No ratings yet
PDC Report
22 pages
Stmbench7 Report
No ratings yet
Stmbench7 Report
17 pages
GeorgiaTech CS-6515: Graduate Algorithms: Graphs-Book Notes Flashcards by Calvin Hoang - Brainscape
No ratings yet
GeorgiaTech CS-6515: Graduate Algorithms: Graphs-Book Notes Flashcards by Calvin Hoang - Brainscape
10 pages
Guitar World 40 US 41 40 2021-10-05 41
100% (3)
Guitar World 40 US 41 40 2021-10-05 41
120 pages
Graph Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Graph Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
64 pages
21-p2p-dht
No ratings yet
21-p2p-dht
63 pages
written_asst5
No ratings yet
written_asst5
29 pages
19_finalReview
No ratings yet
19_finalReview
20 pages
2.2 Dijkstra's Algorithm
No ratings yet
2.2 Dijkstra's Algorithm
8 pages
TR-07-54
No ratings yet
TR-07-54
25 pages
Dijkstra's Algorithm
No ratings yet
Dijkstra's Algorithm
6 pages
Week6 PDF
No ratings yet
Week6 PDF
47 pages
Advanced Shortest Paths Algorithms On A Massively-Multithreaded Architecture
No ratings yet
Advanced Shortest Paths Algorithms On A Massively-Multithreaded Architecture
8 pages
GeorgiaTech CS-6515: Graduate Algorithms: Graph Algorithms Flashcards by Yang Hu - Brainscape
No ratings yet
GeorgiaTech CS-6515: Graduate Algorithms: Graph Algorithms Flashcards by Yang Hu - Brainscape
4 pages
Graph Algorithms: John Mellor-Crummey
No ratings yet
Graph Algorithms: John Mellor-Crummey
33 pages
Modeling a Non-uniform Memory Access Architecture for Optimizing
No ratings yet
Modeling a Non-uniform Memory Access Architecture for Optimizing
79 pages
Ass 3
No ratings yet
Ass 3
3 pages
Theoretically Efficient Parallel Graph Algorithms
No ratings yet
Theoretically Efficient Parallel Graph Algorithms
70 pages
DS Takeaway Attempt
No ratings yet
DS Takeaway Attempt
5 pages
2203.05095v1
No ratings yet
2203.05095v1
10 pages
Dijkstra's Algorithm
No ratings yet
Dijkstra's Algorithm
7 pages
Dijkstra’s Algorithm Quiz.docx
No ratings yet
Dijkstra’s Algorithm Quiz.docx
5 pages
MetalNetwork - paper
No ratings yet
MetalNetwork - paper
5 pages
Chrissi Jo: Printing Cleo Collection
No ratings yet
Chrissi Jo: Printing Cleo Collection
19 pages
Parallel Graph Algorithms (Chapter 10) : Vivek Sarkar
No ratings yet
Parallel Graph Algorithms (Chapter 10) : Vivek Sarkar
43 pages
18bce0537 VL2020210104308 Pe003
No ratings yet
18bce0537 VL2020210104308 Pe003
31 pages
Volume 1number 4PP 2572 2582
No ratings yet
Volume 1number 4PP 2572 2582
11 pages
Dijkstra Algorithm using OpenMP
No ratings yet
Dijkstra Algorithm using OpenMP
2 pages
COP 4530 - Final Project Report
No ratings yet
COP 4530 - Final Project Report
4 pages
ADA REPORT 032
No ratings yet
ADA REPORT 032
12 pages
DSA LAB Report
No ratings yet
DSA LAB Report
12 pages
An Improved Dijkstra's Algorithm Application To Multi-Core Processors
No ratings yet
An Improved Dijkstra's Algorithm Application To Multi-Core Processors
6 pages
Ref 3
No ratings yet
Ref 3
3 pages
exp 6
No ratings yet
exp 6
3 pages
Dijkstra Project Report
No ratings yet
Dijkstra Project Report
5 pages
Teaching English Using ICT
100% (2)
Teaching English Using ICT
177 pages
Introduction To Algorithms CS 445
No ratings yet
Introduction To Algorithms CS 445
16 pages
Cs - 502 F-T Subjective by Vu - Toper
No ratings yet
Cs - 502 F-T Subjective by Vu - Toper
18 pages
Document
No ratings yet
Document
2 pages
Software Transactional Memory: Why Isitonlya Research Toy?
No ratings yet
Software Transactional Memory: Why Isitonlya Research Toy?
7 pages
Multiscale Simulation Methods PDF
0% (1)
Multiscale Simulation Methods PDF
592 pages
n1230 - Music Business and Styles n6 Qp 13 Nov 2019 Edited (2)
100% (1)
n1230 - Music Business and Styles n6 Qp 13 Nov 2019 Edited (2)
10 pages
Unit V
No ratings yet
Unit V
10 pages
Design and Analysis of Algorithms CO 401: Ash Mohammad Abbas
No ratings yet
Design and Analysis of Algorithms CO 401: Ash Mohammad Abbas
18 pages
Network Forensics
No ratings yet
Network Forensics
34 pages
Datos Masa en Esceso
No ratings yet
Datos Masa en Esceso
423 pages
Subject Personnel Management: Case Study: 1
No ratings yet
Subject Personnel Management: Case Study: 1
14 pages
k-will-you-play-with-me
No ratings yet
k-will-you-play-with-me
5 pages
Open Electives 2013-14
No ratings yet
Open Electives 2013-14
248 pages
NISM VA - Short Notes - Bullet Points
No ratings yet
NISM VA - Short Notes - Bullet Points
14 pages
Burgundy 2022 - Hair of The Dog - Liv-Ex
No ratings yet
Burgundy 2022 - Hair of The Dog - Liv-Ex
20 pages
Analysis of Ipsec Vpns Performance in A
No ratings yet
Analysis of Ipsec Vpns Performance in A
5 pages
Chemistry Coursework Experiment 13
50% (2)
Chemistry Coursework Experiment 13
6 pages
Stage 8 End of Unit 6 Test
50% (2)
Stage 8 End of Unit 6 Test
3 pages
Budget Management Thesis
100% (3)
Budget Management Thesis
5 pages
Corretion in Atwoods Machine-Physics Teacher
No ratings yet
Corretion in Atwoods Machine-Physics Teacher
4 pages
Flats For Rent
No ratings yet
Flats For Rent
6 pages
MS51957D Machine Screw Pan Head
No ratings yet
MS51957D Machine Screw Pan Head
7 pages
Sample Loan
No ratings yet
Sample Loan
47 pages
Minimum Static Strength Requirements
No ratings yet
Minimum Static Strength Requirements
2 pages
Tybcom - Sem Vi - HRM - MCQ
No ratings yet
Tybcom - Sem Vi - HRM - MCQ
17 pages
Presentasi Narrative Text Raihan Syah Ali Safaat
No ratings yet
Presentasi Narrative Text Raihan Syah Ali Safaat
10 pages
Kanyathon Book 2020
No ratings yet
Kanyathon Book 2020
59 pages
Aeronautical Second Sem 2020 Batch
No ratings yet
Aeronautical Second Sem 2020 Batch
1 page
An Effective Threat Evaluation Algorithm For Multiple Ground Targets in Multi-Target and Multi-Weapon Environments
No ratings yet
An Effective Threat Evaluation Algorithm For Multiple Ground Targets in Multi-Target and Multi-Weapon Environments
7 pages
Immuno
No ratings yet
Immuno
4 pages
Crown Xls Series Datasheet Original
100% (1)
Crown Xls Series Datasheet Original
2 pages
Submitted By:-Submitted To: - Name - Govinda DR Dilbag Sir ROLL NO - 191543 Department - DTHM
No ratings yet
Submitted By:-Submitted To: - Name - Govinda DR Dilbag Sir ROLL NO - 191543 Department - DTHM
10 pages
Best-Effort Computing: Re-Thinking Parallel Software and Hardware
No ratings yet
Best-Effort Computing: Re-Thinking Parallel Software and Hardware
6 pages
Crawling Through Web To Extract The Data From Social Networking Site - Twitter
No ratings yet
Crawling Through Web To Extract The Data From Social Networking Site - Twitter
6 pages
A Dynamic URL Assignment Method For Parallel Web Crawler: A.Guerriero F. Ragni, C. Martines
No ratings yet
A Dynamic URL Assignment Method For Parallel Web Crawler: A.Guerriero F. Ragni, C. Martines
5 pages
Tot Watchers The Two Mouseketeers: Shorts
No ratings yet
Tot Watchers The Two Mouseketeers: Shorts
1 page
Ethical Hacking: - Hackers (Or Bad Guys) Try To Compromise Computers
No ratings yet
Ethical Hacking: - Hackers (Or Bad Guys) Try To Compromise Computers
17 pages
A Project Report On Marketing Mix of Mahindra & Mahindra
No ratings yet
A Project Report On Marketing Mix of Mahindra & Mahindra
5 pages
The New Income Tax Chapter 13: COLLECTION of Tax: Print
No ratings yet
The New Income Tax Chapter 13: COLLECTION of Tax: Print
4 pages
Gyokov Solutions - G-NetLook For Android OS
No ratings yet
Gyokov Solutions - G-NetLook For Android OS
4 pages
Directed Acyclic Graphs in Theory and Practice: Definitive Reference for Developers and Engineers
From Everand
Directed Acyclic Graphs in Theory and Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Routing in Wireless Mesh Networks
From Everand
Routing in Wireless Mesh Networks
Raghav Kumar
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Application and Implementation of DES Algorithm Based on FPGA
From Everand
Application and Implementation of DES Algorithm Based on FPGA
madhav
No ratings yet
A Star: Fundamentals and Applications
From Everand
A Star: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Early Experiences On Accelerating Dijkstra's Algorithm Using Transactional Memory

Uploaded by

Early Experiences On Accelerating Dijkstra's Algorithm Using Transactional Memory

Uploaded by

Early Experiences on Accelerating Dijkstras Algorithm

Using Transactional Memory

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

National Technical University of Athens

Abstract all the neighbors of the extracted node. To implement parallel

Algorithm 2: Fine-grain parallel implementation of Dijkstras algorithm.

In an attempt to isolate the effect of the barriers, we imple-

grain as the fgs-lock scheme, where atomicity is enforced at the 0.6

refer to this scheme as cgs-tm. It is implemented as shown in 2 4 6 8 10

Alg. 2 by replacing the Begin-Atomic and End-Atomic

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 5: Multithreaded speedups for the graphs tested.

You might also like