SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
Accelerators
GIOVANNI GOZZI, MICHELE FIORITO, and SERENA CURZEL, Politecnico di Milano,
Milan, Italy
CLAUDIO BARONE, VITO GIOVANNI CASTELLANA, MARCO MINUTOLI,
and ANTONINO TUMEO, Pacific Northwest National Laboratory, Richland, WA, USA
FABRIZIO FERRANDI, Politecnico di Milano, Milan, Italy
This article presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from
OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation
of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support
fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support
synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom
runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability
when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average,
2.29× faster than state-of-the-art HLS methodologies.
CCS Concepts: • Hardware → High-level and register-transfer level synthesis; Hardware accelerators;
Additional Key Words and Phrases: Design automation, FPGA architecture, graph algorithms, parallelism
This work has been partially supported by the Spoke 1 (FutureHPC & BigData) of the Italian Research Center on
High-Performance Computing, Big Data and Quantum Computing (ICSC) funded by MUR Mission 4 - Next Generation EU,
by the Compiler Frameworks and Hardware Generators to Support Innovative US Government (IUSG) Designs project at
Pacific Northwest National Laboratory (PNNL), and by the Adaptive Tunability for Synthesis and Control via Autonomous
Learning on Edge (AT SCALE) Laboratory Directed Research and Development (LDRD) Initiative.
Authors’ Contact Information: Giovanni Gozzi (corresponding author), Politecnico di Milano, Milan, Italy; e-mail:
[email protected]; Michele Fiorito, Politecnico di Milano, Milan, Italy; e-mail: [email protected]; Serena
Curzel, Politecnico di Milano, Milan, Italy; e-mail: [email protected]; Claudio Barone, Pacific Northwest National
Laboratory, Richland, WA, USA; e-mail: [email protected]; Vito Giovanni Castellana, Pacific Northwest National
Laboratory, Richland, WA, USA; e-mail: [email protected]; Marco Minutoli, Pacific Northwest National
Laboratory, Richland, WA, USA; e-mail: [email protected]; Antonino Tumeo, Pacific Northwest National
Laboratory, Richland, WA, USA; e-mail: [email protected]; Fabrizio Ferrandi, Politecnico di Milano, Milan, Italy;
e-mail: [email protected].
This work is licensed under a Creative Commons Attribution International 4.0 License.
© 2024 Copyright held by the owner/author(s).
ACM 1936-7414/2024/12-ART9
https://doi.org/10.1145/3677035
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:2 G. Gozzi et al.
1 Introduction
Field Programmable Gate Arrays (FPGAs) are reconfigurable devices that can be programmed
using a hardware description language (HDL), such as Verilog or VHDL, to implement spe-
cialized accelerators. FPGAs have become a key component of heterogeneous high-performance
computing systems, often coupled to other types of accelerators, such as GPUs, to accelerate
data-intensive tasks or parts of applications that require low latency processing. High-level
synthesis (HLS) tools simplify the process of developing FPGA-based accelerators by automati-
cally translating software descriptions (typically C or C++ code) into HDL. These tools promise to
hide hardware-specific details from the user while still obtaining high quality of results (QoR) in
terms of latency, area usage, and clock frequency [42].
However, existing HLS tools mainly target regular, arithmetic-intensive workloads operating on
dense data structures typical of digital signal processing (DSP) applications, which present an
abundance of instruction-level parallelism. Solutions that generate accelerators from parallel high-
level descriptions leverage data-parallel programming models that rely on regular computation
and easily partitionable data sets (e.g., OpenCL), as they mainly replicate identical computational
units and do not well support synchronization-intensive programs [14]. Driven by the interest in
big data and graph analytics, several approaches attempt to support the acceleration of parallel
applications with irregular behaviour. These include solutions that target specific graph algorithms,
approaches that try to optimize linear algebra operations (e.g., sparse matrix-vector and sparse
matrix-matrix multiplication [17]), and frameworks based on the Gather-Apply-Scatter (GAS)
paradigm [33]. These solutions are highly specialized: they focus on specific graph algorithms, or
they only try to generate and optimize with HLS computational engines that need to be supported
by a software framework; the actual execution of graph algorithms mapped on such accelera-
tors might thus increase in computational complexity with respect to vertex- or edge-centered
formulations.
The OpenMP programming model for shared memory systems has often been successfully used
to design parallel irregular algorithms [7, 36, 38] thanks to its task-level and loop-level parallel
constructs. Research on HLS previously explored the use of OpenMP annotations to synthesize
applications with coarse-grained task-level parallelism [15, 47]; however, these approaches employ
multi-level designs driven by a microcontroller and cannot manage complex synchronization con-
structs. Furthermore, they still focus on generating parallel accelerators from programs with regular
behaviours. The Svelto methodology [39] uses OpenMP annotation as a starting point to synthesize
parallel multi-threaded accelerators, targeting the generation of accelerators starting from specific
patterns of parallel graph algorithms. In this article, we propose a more general methodology
for the Synthesis of PARallel multi-threaded Accelerators (SPARTA), providing automated
translation of parallel specifications written in OpenMP into efficient parallel accelerators.
SPARTA adopts a fundamentally different and more flexible approach than previous method-
ologies generating accelerators from OpenMP code. Instead of mapping OpenMP parallel regions
marked by compiler directives (pragmas) to a pre-defined parallel architecture template, it trans-
lates pragmas into calls to OpenMP runtime primitives and implements corresponding low-level
hardware components to be used in the synthesis backend. Accelerators generated with SPARTA
can exploit both spatial and temporal parallelism, hiding the latency of external memory accesses
through context switching (CS) across multiple execution threads. Moreover, SPARTA can in-
stantiate an advanced hierarchical memory subsystem composed of a custom Network-on-Chip
(NoC) that connects multiple external memory channels to each accelerator, memory-side caching,
and on-chip private memories for each accelerator. These features allow SPARTA to deal with both
regular and irregular behaviours in parallel algorithms and to generate efficient accelerators in
terms of performance per area.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:3
We evaluated SPARTA by synthesizing typical parallel graph algorithms, verifying the scalability
and the QoR of the generated designs as the numbers of parallel accelerators and contexts change.
We also compared accelerators for graph database queries generated with SPARTA to the ones
generated with Svelto, obtaining, on average, a 2.29× improvement in performance with similar
architectural parameters (number of accelerators, contexts, and memory channels).
In summary, this article makes the following contributions:
— A synthesis methodology to generate multi-threaded parallel accelerators (SPARTA) support-
ing OpenMP work-sharing and synchronization constructs through the implementation of
low-level primitives from the OpenMP runtime system;
— Support for both spatial and temporal parallelism, with multiple threads that can pause and
resume execution on a core when needed;
— A highly configurable memory system that mitigates the latency of external memory accesses.
The rest of the article is organized as follows: Section 2 reviews related work on the synthesis of
FPGA accelerators from parallel programming languages. Sections 3 and 3.1 describe the SPARTA
architecture and synthesis flow. Section 4 presents experimental results and Section 5 draws
conclusions for the article.
2 Related Work
SPARTA generates efficient, high-throughput parallel accelerators starting from OpenMP code.
This approach is particularly suited to irregular parallel applications such as graph algorithms,
thanks to the shared memory abstraction, the enhanced memory subsystem, and the combined use
of spatial and temporal multithreading. The work touches on several areas related to the use of
reconfigurable architectures and HLS methodologies for parallel applications.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:4 G. Gozzi et al.
Table 1. Qualitative Comparison between SPARTA and State-of-the-Art Methodologies for the Synthesis of
Parallel Specifications
Asterisks represent how much the input code needs to be modified with respect to a software implementation.
through a multi-ported memory controller, and it supports synchronization through atomic mem-
ory operations. The main limitation of Svelto is that it can only map a small subset of OpenMP
constructs to an architectural template, limiting the set of supported OpenMP pragmas (e.g., atomic
memory operations are supported, but critical sections and barrier synchronization are not). More-
over, supported parallel constructs become kernel functions that do not allow inter-functional
optimizations, local memories within each parallel core are not supported, and the multi-ported
memory controller in the template has limited scalability.
Table 1 summarizes the considerations done in this section to highlight the differences between
SPARTA and previous works. Looking at the supported input languages, we tried to identify how
much a user would have to modify an input software implementation to use a certain tool: for
example, FCUDA requires to heavily reorganize the code into data management and compute
sections, Svelto requires the body of the OpenMP for loop to be outlined in a separate function, and
SPARTA can process standard, unmodified OpenMP code. Methodologies that target a host+FPGA
system use the processor to coordinate the execution of the accelerators, while SPARTA implements
scheduling and synchronization within the accelerator itself. Finally, SPARTA supports different
work-sharing and synchronization constructs, and nested parallelism; the addition of new constructs
is simplified by using the low-level primitives of the OpenMP runtime.
2.2 Accelerator Designs for Graph Algorithms
Graph processing is an important element of data analytics, and it presents unique challenges (large
datasets, unpredictable fine-grained memory accesses, unbalanced task-level parallelism) that are
not well addressed by commodity processors relying on advanced cache architectures, data locality,
and regular computation patterns, and might thus benefit from domain-specific accelerators [22].
Alongside proposals for processing-in-memory [3], flash-based [29], ASIC [25], and ReRAM-based
[52] accelerators, multiple designs have targeted FPGAs to implement graph processing primitives.
The FPGA co-processors on the hybrid Convey HC systems have been used to accelerate breadth-
first search (BFS) [8], while [58] exploits an FPGA-Hybrid Memory Cube platform. Optimized
implementations of PageRank (PR) and random walk are proposed in [59] and [20], respectively.
GraphOps [45] provides a series of building blocks to implement graph analytics applications
following a dataflow approach. Several designs are based on arrays of parallel processing engines
[8, 13, 55]; some of them support the implementation of vertex-centric [44] or edge-centric [60, 61]
graph algorithms. ScalaGraph [56] proposes a tile-based architecture that improves the scalability
of the system as the number of processing elements increases. GraphLily [26] supports graph
algorithms by accelerating their linear algebra formulation (i.e., by implementing custom Sparse-
Matrix-Vector and Sparse-Matrix-Matrix multiplications units) and [2] provides a template for a
GAS accelerator, requiring users to rewrite their applications and pre-partition the data.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:5
All these works propose fixed or templated designs available as register-transfer level compo-
nents, manually developed either with HDL or through HLS. Specializing the HLS process for
the acceleration of graphs is not considered, thus similar approaches can only provide a limited
library of primitives, and input applications may need to be restructured to allow mapping the
computation onto the accelerators.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:6 G. Gozzi et al.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:7
Fig. 2. Translation of OpenMP pragmas into calls to the SPARTA library functions.
require a different datapath and controller. We implemented OpenMP constructs replaceable with
arithmetic and logic operations (e.g., the ones that compute the range of loop iterations belonging
to a thread) with standard functional units.
Figure 2 shows how the SPARTA front end translates an OpenMP annotated code (Figure 2(a))
into an IR with calls to the SPARTA library functions (Figure 2(b)). The code in Figure 2(a) denotes
a parallel region with M threads where each thread executes an iteration of a for loop. The IR
in Figure 2(b) shows how the transformations outline the loop body into a separate function
(omp_outlined) and modify the OpenMP runtime library calls. The new function signature contains
all parameters to execute an iteration of the loop: the thread ID and a reference to the arrays A, B,
and res. The function body is instrumented with API calls to the SPARTA Bambu library functions
that retrieve the thread ID and synchronize with other threads.
The instrumented IR containing the API calls that control the parallel execution is subsequently
optimized with hardware-oriented code transformations that facilitate the generation of the parallel
architecture in the HLS backend.
3.1 The SPARTA Architecture
SPARTA generates a modular architecture following the content of the input OpenMP specification
that can be customized through user-defined parameters. The main components are depicted in
Figure 3: sequential parts of the computation go through the standard HLS flow, while parallel
constructs are translated into a set of parallel processing units called cores. Each core contains a
finite state machine with datapath (FSMD) implementing the computation described within
an OpenMP parallel region (e.g., the loop body in an omp parallel for loop), a CS manager
in case multiple threads are assigned to the same core, and local memory for data belonging to
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:8 G. Gozzi et al.
those threads. The number of cores and the number of threads (contexts) assigned to each core are
configurable. The parallel region of the accelerator can also contain synchronization components,
arbiters, and data shared among threads if needed. The accelerator is connected to a NoC that
relays messages to an external memory (with support for multi-banked or pipelined memories);
the number of channels between the accelerator and the NoC, as well as between the NoC and the
external memory, is configurable. The user encodes information about parallelism, work-sharing
policies, and synchronization requirements through the OpenMP annotations in the code and
configures the rest of the architecture through command-line options.
3.1.1 Spatial and Temporal Multithreading. Cores in a SPARTA accelerator execute concurrently,
using the FPGA resources to exploit coarse-grained parallelism; instruction-level parallelism is
extracted by the HLS tool during the implementation of the FSMD within each core. SPARTA accel-
erators can also support temporal multithreading by interleaving the execution of multiple threads
on the same core instead of stalling execution and leaving the core underutilized when a thread
needs to perform long latency operations (typically external memory accesses or synchronization,
which might take tens of clock cycles), it is possible to perform a CS and execute a different thread
with the same datapath, e.g., a different iteration of the same omp for loop.
Each core with CS enabled features a CS manager module to handle the active context and switch
to a new one when a memory or synchronization operation starts. For example, on a read/write
request to the external memory, the CS manager saves the active context (i.e., the thread state)
and starts the execution of a different context in the next clock cycle. Once the core receives the
memory response, the CS manager flags the previously paused context as ready and eventually
resumes its execution when the currently active context is paused or completed. The CS operation
has no overhead: in the clock cycle before the switch, the core executes normally, and at the start
of the next clock cycle, the new context restarts from the state where it should have been after
it was suspended. The CS manager always performs a switch when the active context starts a
wait synchronization operation. This avoids deadlocks when the waited thread executes on the
same core of the waiting thread, as the CS manager can flag the waiting thread as ready when the
response from the waited thread is received. The CS manager also has a set of registers for each
context to store the signals received while threads are not ready. When responses arrive at a core,
the thread in execution is generally not the one that performed the request: the value is saved and
forwarded to the actual thread originating the request when its execution is resumed. In this way,
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:9
the CS functionality is entirely transparent to the rest of the accelerator and does not force other
components to send signals multiple times or stall their execution.
In the core datapath, registers store intermediate signals between combinational regions; when
many contexts share the same core, it is necessary to store a signal for each context. Each context
thus needs its own set of registers and a control signal to select the active set of input and output
registers; the new register set with the context selector control signal is commonly referred to as a
register file. Since register files can consume considerably more resources than standard registers,
it is fundamental to avoid them when possible. In the SPARTA architecture, CS can happen only
during memory accesses or synchronization calls, which are known at compile time: the synthesis
flow exploits this information to replace registers with register files only within control states
where CS can happen, while other registers are left unchanged. As an example, let us take into
account a core executing a floating-point add operation: two values are loaded from memory,
the sum of the two values is computed, and the result is stored back in memory. Load and store
operations can trigger a CS to hide the memory access latency thus registers connected to load
and store operations in this core must be replaced with register files to save signal states and later
restore them; instead, the floating-point sum can never be interrupted by a CS and thus none of the
registers present in the floating-point adder logic needs to be replaced with a register file.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:10 G. Gozzi et al.
Fig. 4. Paths of memory requests from Core 0 to (1) the local memory of Core N, (2) on-chip shared memory,
and (3) external memory.
logic. However, a linear reduction operation often requires five or six cycles per thread to be
merged, which is negligible when confronted with the duration of the computations. For this
reason, when multiple contexts can be activated, the policy of the CS manager is to choose the
thread with the lower thread ID: this gives priority to thread 0, which is the one that could perform
a reduction operation, and then to the other threads in order since the reduction proceeds following
the thread IDs.
In the OpenMP paradigm, the critical pragma is used to identify a set of instructions that can
only be executed by a single thread at once. SPARTA adds a critical manager that behaves like
a mutex and a local component for each thread that encounters the critical section. When a thread
reaches the critical section, the local component sends a signal to request control of the lock, and
the critical manager grants it only if it is not currently assigned to another thread. If the lock is
not acquired by the current thread, the CS manager performs a CS. The critical manager stores the
state of the requests and the current lock owner and, like the CS manager, prioritizes the thread
with a lower ID when multiple threads require access. When the thread finishes the execution of
the critical section, it sends a signal to release the acquired lock to avoid deadlock and allow other
threads to resume their execution.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:11
Fig. 5. Memory address format for an architecture with P parallel regions, T threads per region, X bytes
of local memory per thread, and Y bits of external memory. The external bit is set if the addressed item is
allocated outside the accelerator.
one channel and connects cores within the same parallel region, while the main accelerator bus
connects parallel regions with the on-chip shared memory and with the NoC toward the external
memory. Each memory request originating from one of the parallel cores goes through a filter;
the filter interprets its address and properly routes the request to internal BRAMs of other cores
through the local bus or outside the parallel region through the main bus. Since BRAMs have two
channels, local memories can support concurrent access from both the owner core and the local
bus. SPARTA can connect the computation units directly to the local memory without using the
local bus if it can infer at compile time the address of the requests and the variable is allocated
inside the same core.
SPARTA memory addresses encode the physical location of the memory module where data is
allocated; Figure 5 shows the structure of external and internal memory addresses, which have
the same length because they travel on the same local bus. The size of the address depends on
the external memory size (which is likely considerably larger than internal memory) plus one bit
that is added to differentiate between internal and external addresses. This allows a very simple
decoding logic: if the external bit is equal to 1, the request is propagated to the external memory.
For internal memory, a dedicated pagination algorithm allocates each memory variable considering
the parallel region and thread it belongs to. The address of internal variables is composed of three
parts, representing the address of the variable in the memory module, the thread that owns the
variable, and the parallel region in which the memory module is allocated, respectively. To build
the address, SPARTA first finds the largest internal memory module and computes the number of
bits needed to represent it, so this number of bits can be reserved to univocally identify a variable
inside each memory module. Similarly, the size of the second part of the address is set by looking
at the largest number of threads assigned to the different parallel regions. Finally, a unique code is
assigned to each parallel region, which is used by the filters to determine whether an address is
local to the parallel region. Looking at the appropriate bits of the address, filters and arbiters in
the memory interconnection system can properly forward a memory request to a different bus or
component using combinatorial logic without adding delays to the requests. For example, if the
parallel region ID in an address matches the one stored in the filter, the request is forwarded to the
local bus, where the arbiter will find the correct core by looking at the thread ID. Instead, if the
parallel region ID does not match or if the address has the external bit set to one, the request is
sent to the main bus.
If two requests arrive in the same cycle, the bus arbiter can only accept one of them, and the
other must be stalled; in this case, SPARTA generates a back pressure signal between arbiters and
cores to resend requests that could not be accepted in the current cycle. With only one context in
execution on the core, execution has to be stalled as it needs the result of the memory operation to
proceed, but when multiple contexts are assigned to the same core, another context can be executed
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:12 G. Gozzi et al.
Fig. 6. The architecture of a unidirectional butterfly NoC component with 24 nodes organized in 3 levels.
in the meantime. A First-In-First-Out (FIFO) queue holds requests before the memory filter so
that the CS manager can switch the running context at the start of each memory request instead of
when an arbiter accepts it; this prevents the effects of the arbitration from having repercussions on
the core execution, as the FIFO is responsible for resending the requests to the arbiter while the
core can continue its operation with another context.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:13
Signal Meaning
Mout_oe_ram Read operation (1 bit)
Mout_we_ram Write operation (1 bit)
Mout_addr_ram Address of the memory operation
Mout_Wdata_ram Value to write
Mout_data_ram_size Size in bits of the value to write
Mout_accelerator_context Tag of the memory operation (parallel region ID and context ID)
Mout_back_pressure The source has to resend the request (1 bit)
in the first two levels. The NoC is resilient to congestion: if two requests want to go to the same
node, there is no need to stall, as the node can redirect one of them on another route to the same
destination. The NoC can thus accept multiple requests: each node has two internal inputs and two
internal output connections, and an I/O node can take a request if one of its internal inputs is not
active in the current cycle. Registered nodes can also store a request, while combinatorial nodes
forward them immediately and do not store information. If both input internal connections are
active, the I/O node of the NoC sends a back pressure signal to resend the request in the next cycle.
When using multiple contexts, back pressure signals do not interfere with the execution of the
cores as they are intercepted in the FIFO modules, which take the role of resending the requests, as
described in Section 3.3.
The SPARTA architecture allows the accelerator to connect to different types of external memory,
with simple or pipelined memory access, and for both unified and banked multi-channel memories.
The accelerator does not assume that the memory is immediately ready to accept a request and
uses a protocol with back pressure that can be used to synchronize with it. Furthermore, SPARTA
cores work under the assumption that answers from the memory could come out of order, as traffic
in the NoC or the external memory could change the relative order of requests. Each request is
tagged with an ID used to identify the thread that performed it and correctly forward the answer.
If the external memory is banked, it is possible to add one cache in front of each bank to further
reduce the average latency of the memory accesses. (There is no need to introduce a cache consis-
tency mechanism, as each cache has its own set of addresses associated with a different memory
bank.)
We introduced a set of custom pragmas and parameters to specify information about the accel-
erator interfaces for variables that will be stored in external memory and model their behaviour
during the simulation of the generated accelerator (Table 3). Specific variables can be allocated to
specific banks to mimic access patterns of real memories. For example, #pragma HLS_interface
a banked bank_number = 16 bank_chunk = 256 specifies that the a variable needs to be stored
in a multi-banked external memory that has 16 banks of 256 bytes each.
SPARTA accelerators, by default, exchange data with other accelerators and the external memory
using a simple and efficient protocol called minimal memory interface, described in Table 2. The
protocol requires a limited number of signals and logic elements to perform memory operations,
reducing the size of internal components (particularly the NoC). However, SPARTA also supports
more complex protocols such as AXI4 [4] by adding a bridge component before each memory bank
that translates each request and response from the minimal protocol to AXI4. The translation is
completely transparent to the accelerator, correctly propagating back pressure signals from and to
the external memory.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:14 G. Gozzi et al.
Parameter Usage
Banked Specify that the memory is banked
Bank number Specify the number of banks
Bank allocation Force a variable to be allocated on a specified group of banks
Bank chunk Specify the size in bytes of the bank blocks
Cache Adds caches to the accelerator
Cache size Specify the size of the cache word
Cache depth Specify the number of cache lines
AXI Specify that the memory protocol is AXI4
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:15
4 Experimental Evaluation
We showcase the results that can be achieved through SPARTA through four experiments: the
first consists of synthesizing three graph processing benchmarks to show speedup and scalability
varying the number of cores and contexts (Section 4.1), the second is a direct comparison between
SPARTA and Svelto [39] on parallel database queries (Section 4.3), the third is a comparison between
SPARTA accelerators and kernels from the graph acceleration library of the commercial tool Vitis
HLS (Section 4.4) and the last one (Section 4.5) is a comparison with LegUp [15]. We also include in
Section 4.2 a focus on the effect of different NoC configurations and back pressure conditions on
the performance of the Graph Algorithm Platform Benchmark Suite (GAPBS) accelerators.
We use a custom version of Bambu HLS to run the experiments, which is available as an executable
binary.1 The following considerations are common to all tests: we assume to connect the accelerators
to an external memory with 16 banks and one channel per bank; the memory delay is assumed to
be constant. We indicate with core the number of parallel accelerator cores, and C the number of
contexts on each core. We report area utilization in terms of lookup tables (LUTs) and registers.
SPARTA components also require one DSP block for each core and one BRAM for every two
contexts if reductions are present; Table 4 reports the area utilization of the main architecture
components to isolate the contribution of components that do not depend on the input application.
Most of the designs meet the imposed timing constraint after place-and-route except for the largest
configurations, where Vivado (the backend tool for AMD/Xilinx FPGAs) was not able to achieve
the requested clock period after the placement phase due to high utilization of the device resources.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:16 G. Gozzi et al.
the increase in performance over a sequential version of the application (one core, one context)
grows linearly with the number of cores, keeping the number of contexts constant; in fact, having
more cores that perform memory requests increases the utilization of the memory channels and
the overall performance. We can see the same effect with more contexts on the same core to hide
the memory latency: the configuration with eight contexts on one core has the same speedup as
the one with eight cores and a single context. The configuration with eight cores and eight contexts
is 36 times faster than the sequential version; this is slower than the theoretical speedup because
the memory access pattern is irregular, bringing multiple requests on the same memory bank at
the same time and thus producing delays and congestion on the NoC. LUTs and registers scale
differently with the number of cores/contexts, as can be seen in Figure 7(b): the number of LUTs
grows with the number of OpenMP threads (e.g., 7× increase with respect to the sequential version
with eight cores, eight contexts), while liveness analysis during the synthesis flow identifies the
minimum number of registers to be replicated for CS, so the number of registers is mostly linked
to the number of cores (4× more than in the sequential version).
The second GAPBS kernel we evaluated is PR, which assigns a value to each graph node based
on its relative importance compared to the other nodes. Figure 8(a) shows that the speedup over a
sequential baseline proportionally increases with the number of cores, similar to the CC benchmark.
However, in this case, increasing the number of contexts per core from four to eight does not lead
to the same performance improvement because some sections of PR are computation-bound rather
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:17
than memory-bound, leading to competition for parallel cores rather than stalls due to memory
access operations. With the largest configuration, we achieve a 30× speedup with 6.5× the area
occupation (considering LUTs, registers occupation only increases 4.5 times).
The last GAPBS benchmark we considered is Triangle Count (TC), which counts the number
of triangles in the input graph. After assessing how performance and resource consumption scale
with varying numbers of cores and contexts (Figure 9) on the Alveo target, we used this benchmark
to perform a first comparison against the Svelto synthesis methodology [39]. There are a few
differences between the two methods that need to be considered. Svelto is based on a gcc/g++
frontend, while SPARTA uses Clang; the crossbar used in Svelto has a different behaviour with
respect to the NoC used in SPARTA, and the comparison is further complicated by the fact that
Svelto uses atomic-based synchronization while SPARTA implements OpenMP reductions. They
also operate under different assumptions regarding the behaviour of the memory (i.e., SPARTA
allows to use asynchronous external memories that can accept multiple requests at the same time
while Svelto only sends one request at a time), and Svelto dynamically assigns loop iterations to
the cores while SPARTA uses static scheduling. Nevertheless, Svelto remains the state-of-the-art
methodology closest to SPARTA, as described in Section 2.1, and it is worthwhile to examine
the QoR of accelerators generated with the two methods. Furthermore, since both use Bambu
as HLS backend, we can safely state that the comparison only refers to the differences in the
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:18 G. Gozzi et al.
implementation of parallel accelerators rather than being influenced by different choices in the
scheduling, allocation, and code generation steps.
The results in Figure 10(a) show the relative performance of SPARTA against the Svelto baseline
with both tools targeting a Virtex 7 FPGA at 100 MHz; bars higher than one, thus represent
configurations where SPARTA outperforms Svelto. While Svelto is faster on smaller configurations,
SPARTA can achieve a higher level of parallelization as the number of cores and contexts grows.
There are two main reasons for this: the first one is that the NoC in SPARTA is designed to be
effective when there are many requests from different sources, which is achieved when the number
of cores is higher; the second reason lies in the ability of SPARTA to allow multiple memory
requests simultaneously, which is critical to exploit memory-level parallelism when multiple
memory channels are available. In fact, SPARTA adds a tag to external memory requests, allowing
it to send multiple pipelined, asynchronous requests, while Svelto must wait for the memory to
answer before performing the next request. With a number of OpenMP threads greater than the
number of banks in the external memory, SPARTA can execute more requests in parallel than Svelto,
resulting in a higher speedup. Regarding area usage, as can be seen in the relative comparisons of
Figure 10(b) and (c), SPARTA consumes more resources than Svelto on the TC benchmark. However,
the area-delay product (ADP) of SPARTA scales better with the number of cores/contexts: using
the formula
𝐴𝐷𝑃𝑆𝑃𝐴𝑅𝑇 𝐴 𝐿𝑈𝑇𝑠𝑆𝑃𝐴𝑅𝑇 𝐴 ∗ 𝐶𝑦𝑐𝑙𝑒𝑠𝑆𝑃𝐴𝑅𝑇 𝐴
= ,
𝐴𝐷𝑃𝑆𝑣𝑒𝑙𝑡𝑜 𝐿𝑈𝑇𝑠𝑆𝑣𝑒𝑙𝑡𝑜 ∗ 𝐶𝑦𝑐𝑙𝑒𝑠𝑆𝑣𝑒𝑙𝑡𝑜
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:19
Fig. 10. Comparison between accelerators generated through SPARTA and Svelto on the TC benchmark.
we can see that, for example, the ADP of a SPARTA accelerator with 1 core and 1 context is 1.14
times the ADP of a Svelto accelerator with the same configuration, and the ratio drops to 0.47 with
32 cores and 8 contexts (SPARTA is 3.8 times faster using only 1.7 times the LUTs and 2.4 times
the registers). One reason for the higher resource consumption is that SPARTA is more general
than Svelto: SPARTA explicitly implements multiple OpenMP primitives and synchronization
mechanisms, while Svelto only supports a few patterns, and thus, it can make assumptions on the
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:20 G. Gozzi et al.
NoC
Benchmark Cycles BP #Mem. Operations Avg. BP Delay (cycles) Avg. NoC Latency (cycles)
TC_32Co_8cs 329159 964697 759812 1.27 7.29
PR_32Co_8cs 129064 168259 467939 0.36 4.42
CC_32Co_8cs 246424 2281643 1172090 1.95 7.67
NoC Registered
Benchmark Cycles BP #Mem. Operations Avg. BP Delay (cycles) Avg. NoC Latency (cycles)
TC_32Co_8cs 338701 508814 759812 0.67 12.00
PR_32Co_8cs 132046 148133 467939 0.32 8.03
CC_32Co_8cs 264787 1338848 1172090 1.14 12.26
BP indicates the total number of back pressure signals received by the FIFO in front of the cores.
behaviour of the input program that simplifies the underlying architecture. Moreover, SPARTA
implements reduction operations that require fewer cycles and memory operations but more area
than the atomic operations implemented in Svelto. Finally, Svelto has a centralized dispatcher that
distributes loop iterations to the cores, while each core in SPARTA contains dedicated logic to
compute the current iteration index.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:21
Fig. 11. Speedup of accelerators generated by SPARTA with respect to Svelto on the parallel queries.
with OpenMP. SPARQL is a query language for graph databases in the resource description
format (RDF). RDF databases are naturally described as single labelled graphs, and SPARQL
queries are defined as (sub)graph matching operations. The resulting code features nested for loops
with different behaviours (e.g., number of OpenMP threads, number of loop iterations assigned
to each thread) and irregular memory accesses. We compared results obtained with SPARTA and
Svelto on several different parallel queries and also ran the same experiments, adding a cache in the
SPARTA accelerators in front of each memory bank (512 lines, 512 bits per line). Figure 11 shows
the relative performance of SPARTA against the Svelto baseline, with and without caches; as before,
bars higher than one represent configurations where SPARTA outperforms Svelto.
The results we obtained are different depending on the specific query considered and the charac-
teristics of its workload. In Q1, SPARTA without caches is always slower than Svelto, as it contains a
small group of OpenMP threads, each performing a large number of operations, and we have shown
how SPARTA performs best with memory-bound workloads. Q4 and Q5 terminate in a few tens of
thousands of cycles and are not very indicative of the achievable speedup, as most cycles are spent
in synchronization operations. Data for Q5 with 8 cores and 8 contexts is missing as Svelto does not
terminate in that configuration. The results observed align with the ones from TC: SPARTA without
caches has worse performances when using few threads but improves when the number of cores
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:22 G. Gozzi et al.
Fig. 12. Relative comparison of resources consumption between SPARTA and Svelto for query Q2. Results for
the other queries are omitted as they have very similar behaviour.
and contexts grows. In Q2, Q3, and Q7, SPARTA drastically outperforms Svelto, even considering
the higher resource utilization (Figure 12), with between 6 and 11.5 times faster execution. In Q1
and Q6 instead, Svelto is faster, and this can be explained by a lack of parallelism in the query: a lot
of the work has to be executed by a small subset of the threads and the dynamic assignment of
iterations in Svelto is better suited for this than the static division of work in SPARTA. However,
even in these cases, with enough cores and contexts, SPARTA can recover the disadvantage and at
least reach the same performance, although using a larger area.
When using caches, SPARTA is almost always faster than Svelto, and this is particularly effective
in the test cases where previously SPARTA was struggling: in both Q1 and Q6, the introduction of
caches in SPARTA makes the generated accelerators more than twice as fast than before. With the
largest configurations in Q2, Q3, and Q7, adding the cache reduces the speedup compared with
the base version of SPARTA: these tests are the ones in which static scheduling can divide the
computations between multiple cores and keep all of them working. Under this circumstance, the
main advantage of SPARTA is the ability to parallelize multiple requests running different contexts
and hide the memory latency while advancing the state of all the contexts simultaneously; adding
caches reduces the average memory latency, and it can cause a loss of parallelization when using
CS. In fact, if the memory requests are answered faster, the execution of the different contexts can
become serialized and, depending on the data access pattern, lead to worse results.
In Figure 13, we investigated the influence of CS across various configurations. Due to space
constraints, we focused on analyzing the speedup resulting from increasing the number of contexts
from 1 to 2, 4, and 8 for query Q7. The speedup was visualized using bar charts and compared
against different numbers of cores and tools (Svelto, SPARTA, and SPARTA with caches). When the
number of contexts was fixed at 1, we noticed that Svelto and SPARTA exhibited similar execution
times, except in the case of 32 cores (see Figure 11). As expected, introducing caches improved
SPARTA’s performance across all scenarios. Transitioning from spatial to temporal parallelism,
SPARTA and SPARTA with cache approaches demonstrated superior scalability compared to Svelto.
In fact, in Figure 13, the maximum speedup achieved by increasing the number of cores tended to
decrease as expected, but the rate of decrease was lower for SPARTA and SPARTA with caches
compared to Svelto. This limited scalability of Svelto is due to its interconnection, which scales less
efficiently than the NoC used by SPARTA. Furthermore, SPARTA initiates a new memory operation
before completing the previous one, enhancing channel utilization. Finally, with caches enabled in
SPARTA, the impact of spatial parallelism is heightened while the impact of temporal parallelism is
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:23
Fig. 13. impact on the speedup considering various core counts and tools (e.g., Svelto, SPARTA, and SPARTA
with cache) with respect to benchmark query Q7. The baseline refers to the execution time of query Q7
when a single context is utilized for a specific number of cores. The bars represent the speedup achieved by
expanding the context to 2, 4, and 8.
reduced. In Figure 13, the maximum speedup with Svelto and SPARTA is around 6–7, while with
SPARTA with caches, it is around 4.
2 https://docs.xilinx.com/r/en-US/Vitis_Libraries/graph/index.html
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:24 G. Gozzi et al.
required, thus only the lower triangular part of the matrix is utilized. The SPARTA configurations
for TC, CC, and PR are optimized to fully utilize spatial and temporal parallelism, using 32 cores
and eight contexts each. However, due to issues in the place and route phase, SPARTA cannot
achieve the same frequency as Vitis with such large designs. As a result, the post-routing frequency
achieved is 200/220 MHz instead of the 250 MHz obtained with smaller configurations.
In the Vitis benchmarks, the delay of the HBM interface between the accelerator and the memory
ranges from 100 to 140 ns, depending on the address of the request. To approximate this behaviour
in the SPARTA benchmarks, we increased the memory delay to 32 cycles, which at 200/220 MHz
is equivalent to 160/145 ns. Additionally, we utilized the AXI4 protocol instead of the minimal
interface to exchange data between the accelerators and the external memory.
It is worth noting that both SPARTA and Vitis target an Alveo U250 FPGA, part of the Ultrascale
family. The results for each algorithm are reported separately in Tables 7, 8, and 9.
SPARTA and Vitis can achieve different maximum frequencies. To account for this difference, we
calculated the kernel execution time as the achieved clock period times the number of cycles (Time
in the Tables). In addition, we computed the AreaXTime metric to normalize the performance of
the different designs.
In the CC benchmark, SPARTA outperforms Vitis in terms of execution time by a factor of 10,
with only a 30% greater area occupation. Additionally, the AreaXTime metric favours SPARTA by
almost an order of magnitude.
In the TC benchmark, SPARTA achieves, on average, a faster execution time. It requires 14%
less time to complete the computation of the four graphs. However, it occupies four times more of
Vitis’s area.
In the PR benchmark, SPARTA achieves a speedup between four and six times compared to Vitis,
occupying only twice the area. Consequently, it achieves an AreaXTime metric at most half of Vitis.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:25
As performance results for the BFS graph library kernel are not available, we also used Vitis HLS
to synthesize a C++ implementation of the BFS algorithm loosely based on the one available in
MachSuite [48]. Note that we are using the common frontier-based implementation rather than
the direction-optimizing implementation [6] (also provided in GAPBS). BFS is the prototypical
irregular kernel [1] where the number of iterations for each thread can be highly unbalanced and
requires critical sections to synchronize the different threads.
We tried different configurations for SPARTA, varying the number of cores and contexts using
the accelerator produced by Vitis HLS as a baseline. Table 10 compares the accelerators produced
by Vitis HLS and SPARTA without parallelization (one core and one context). There is only a
small difference between the number of cycles for the two tools, while the area is more difficult to
compare: Bambu does not require BRAMs for this design as it maps all logic on LUTs and registers,
and almost half of the LUTs and registers used are dedicated to the 16 node NoC (as reported in
Table 4).
Figure 14(a) displays the speedup achieved using different configurations of SPARTA with respect
to the number of cycles required by Vitis HLS. In the sequential version, most of the time is spent
waiting for the external memory to provide data in both the SPARTA and Vitis implementations. In
the SPARTA parallel accelerators, a similar pattern arises when the number of contexts per core
is small: each core spends most of the time waiting for new data to process. As we increase the
number of contexts, each core spends a greater fraction of time doing useful computations. Even
when using multiple contexts, BFS does not achieve the same speedup as more regular applications
such as the ones described in Section 4.1: during the exploration of the graph, each thread can
find a different number of elements, leading to a high load imbalance. Furthermore, all threads
must be periodically synchronized to avoid exploring the same node of the graph multiple times
and to compute the correct length of the shortest path from the origin. Despite this, SPARTA can
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:26 G. Gozzi et al.
Fig. 14. Benchmark synthesized varying the number of cores and contexts.
achieve good speedup when using four contexts per core: with four cores, it achieves a speedup of
almost five times, and with two cores, a speedup of three times compared to the Vitis HLS baseline.
Figure 14(b) compares the area utilization of SPARTA and Vitis HLS. When adding a core, SPARTA
replicates almost the entire design and adds necessary communication and synchronization logic.
Hence, resource utilization increases a little more than linearly: with four cores and one context,
the area occupied is six times more than the baseline. Note that with 4 cores, the NoC is larger than
the entire design generated by Vitis HLS. However, its size does not increase when more contexts
per core are added.
4.5 LegUp
We compare SPARTA even with the published results obtained by LegUp [15]. Specifically, we
are using the Mandelbrot kernel from the latest open-source version of LegUp. This kernel is
a parallelized version of the Mandelbrot iterative mathematical benchmark, which produces a
fractal image and is parallelized with OpenMP for pragmas. The results obtained by LegUp on this
benchmark are reported in Table 11, as published in [15]. SPARTA achieves better performance
across all metrics when using the same Mandelbrot kernel on a Stratix IV device (see Table 12). This
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:27
performance advantage can be attributed to several factors, including SPARTA’s use of external
memory for storing the Mandelbrot picture, its independence from a soft processor for OpenMP
parallelization, and its simpler architecture.
Results from other benchmarks in [15] will only further widen the performance gap between
SPARTA and LegUp. These other benchmarks rely on Pthread primitives in addition to the OpenMP
pragmas. Pthread primitives are partially implemented as software to control hardware threads,
increasing the overhead introduced by the soft processor and affecting the overall execution.
5 Conclusion
Automating the design of complex parallel architectures can greatly improve developers’ productiv-
ity and lower the access barrier to FPGA acceleration for non-experts. In this article we presented
SPARTA, a methodology to support the synthesis of parallel applications written in OpenMP
through HLS. SPARTA exploits a compiler frontend to translate OpenMP pragmas into low-level
calls to primitive functions and a library of corresponding custom hardware components that are
allocated in the HLS tool backend. With SPARTA, it is possible to synthesize a variety of OpenMP
work-sharing and synchronization constructs and to assign multiple threads to a single accelerator
core to hide the latency of slower operations. The SPARTA architecture is modular and flexible,
featuring multiple parallel cores, a hierarchical memory system, and an efficient NoC interface
toward external memories. We tested our methodology on parallel graph applications, showing
linear scaling in both performance and area consumption when the number of parallel cores in-
creases, and we compared our results against a state-of-the-art methodology for the synthesis of
parallel accelerators, obtaining on average a 2.29× speedup. Future developments could include
support for OpenMP tasks, dynamic scheduling, atomic memory operations, and the introduction
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:28 G. Gozzi et al.
of a broadcasting mechanism for memory responses; the latter would greatly reduce latency for
patterns such as matrix-vector multiplication, where multiple cores require the same information
stored in external memory.
References
[1] Graph500. 2023. Retrieved from https://graph500.org
[2] Hamzeh Ahangari, Muhammet Mustafa Özdal, and Özcan Öztürk. 2023. HLS-Based High-Throughput and Work-
Efficient Synthesizable Graph Processing Template Pipeline. ACM Transactions on Embedded Computing Systems 22, 2
(2023), 1–24.
[3] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing. In Proceedings of the 42nd Annual International Symposium on Computer
Architecture (ISCA). 105–117.
[4] ARM Developers. 2020. AMBA AXI and ACE Protocol Specification. Retrieved from https://developer.arm.com/
documentation/ihi0022/e/AMBA-AXI3-and-AXI4-Protocol-Specification
[5] Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019.
Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. In Proceedings of the IEEE
International Symposium on High Performance Computer Architecture (HPCA). 373–386.
[6] Scott Beamer, Krste Asanovic, and David A. Patterson. 2012. Direction-Optimizing Breadth-First Search. In Proceedings
of the SC Conference on High Performance Computing Networking, Storage and Analysis (SC ’12). 12.
[7] Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. Retrieved from http://arxiv.
org/abs/1508.03619
[8] B. Betkaoui, Yu Wang, D.B. Thomas, and W. Luk. 2012. A Reconfigurable Computing Approach for Efficient and
Scalable Parallel Graph Exploration. In Proceedings of the IEEE 23rd International Conference on Application-Specific
Systems, Architectures and Processors (ASAP ’12). 8–15.
[9] Nicolas Bohm Agostini, Serena Curzel, Vinay Amatya, Cheng Tan, Marco Minutoli, Vito Giovanni Castellana,
Joseph Manzano, David Kaeli, and Antonino Tumeo. 2022. An MLIR-based Compiler Flow for System-Level Design
and Hardware Acceleration. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design
(ICCAD). 1–9.
[10] Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark
Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. 2013. From Software to Accelerators with LegUp
High-Level Synthesis. In Proceedings of the 2013 International Conference on Compilers, Architecture and Synthesis for
Embedded Systems (CASES). 1–9.
[11] Vito Giovanni Castellana, Marco Minutoli, Alessandro Morari, Antonino Tumeo, Marco Lattuada, and Fabrizio
Ferrandi. 2015. High Level Synthesis of RDF Queries for Graph Analytics. In Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design (ICCAD). 323–330.
[12] Rohit Chandra, Leo Dagum, Ramesh Menon, David Kohr, Dror Maydan, and Jeff McDonald. 2001. Parallel Programming
in OpenMP. Morgan Kaufmann.
[13] Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-
Based Graph Processing Framework on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA). 69–80.
[14] Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending High-Level Synthesis
for Task-Parallel Programs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM). 204–213.
[15] Jongsok Choi, Stephen Brown, and Jason Anderson. 2013. From Software Threads to Parallel Hardware in High-Level
Synthesis for FPGAs. In Proceedings of the 2013 International Conference on Field-Programmable Technology (FPT).
270–277.
[16] Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA HLS
Today: Successes, Challenges, and Opportunities. ACM Transactions on Reconfigurable Technology and Systems 15, 4
(2022), 1–42.
[17] Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. High-Performance Sparse Linear Algebra on HBM-
Equipped FPGAs Using HLS: A Case Study on SpMV. In Proceedings of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA). 54–64.
[18] José Duato, Sudhakar Yalamanchili, and Lionel Ni. 2003. Interconnection Networks: An Engineering Approach. Morgan
Kaufmann, 359–444.
[19] F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, and A. Tumeo. 2021.
Bambu: An Open-Source Research Framework for the High-Level Synthesis of Complex Applications. In Proceedings
of the 58th ACM/IEEE Design Automation Conference (DAC). 1327–1330.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:29
[20] Yingxue Gao, Teng Wang, Lei Gong, Chao Wang, Xi Li, and Xuehai Zhou. 2023. FastRW: A Dataflow-Efficient and
Memory-Aware Accelerator for Graph Random Walk on FPGAs. In Proceedings of the Design, Automation & Test in
Europe Conference & Exhibition (DATE). 1–6.
[21] Roberto Gioiosa, Antonino Tumeo, Jian Yin, Thomas Warfel, David Haglin, and Santiago Betelu. 2017. Exploring
DataVortex Systems for Irregular Applications. In Proceedings of the 2017 IEEE International Parallel and Distributed
Processing Symposium (IPDPS). 409–418. DOI: https://doi.org/10.1109/IPDPS.2017.121
[22] Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, and Hai Jin. 2019. A Survey on
Graph Processing Accelerators: Challenges and Opportunities. Journal of Computer Science and Technology 34 (2019),
339–371.
[23] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A Benchmark for OWL Knowledge Base Systems. Journal
of Web Semantics 3, 2–3 (2005), 158–182.
[24] R. J. Halstead and W. Najjar. 2013. Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads. In
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
(CASES). 1–10.
[25] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-Performance and Energy-
Efficient Accelerator for Graph Analytics. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture
(MICRO). 1–13.
[26] Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating Graph Linear Algebra
on HBM-Equipped FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design
(ICCAD). 1–9.
[27] J. Huthmann, J. Oppermann, and A. Koch. 2014. Automatic High-Level Synthesis of Multi-Threaded Hardware
Accelerators. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL). 1–4.
[28] Data Vortex Inc. 2023. Data Vortex Network on Chip IP Block. Retrieved from https://www.datavortex.com/dv-noc-
brief
[29] Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. 2018. GraFBoost: Using Accelerated Flash
Storage for External Graph Analytics. In Proceedings of the ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA). 411–424.
[30] Nachiket Kapre and Jan Gray. 2015. Hoplite: Building Austere Overlay NoCs for FPGAs. In Proceedings of the 25th
International Conference on Field Programmable Logic and Applications (FPL). 1–8.
[31] Nachiket Kapre and Jan Gray. 2017. Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs. ACM Transactions
on Reconfigurable Technology and Systems 10, 2 (Mar 2017), 1–24.
[32] Khronos Group. [n. d.]. OpenCL - Open Standard for Parallel Programming of Heterogeneous Systems. Retrieved
from https://www.khronos.org/opencl/
[33] Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2019. GPOP: A Cache and Memory-Efficient
Framework for Graph Processing over Partitions. In Proceedings of the 24th Symposium on Principles and Practice of
Parallel Programming. 393–394.
[34] Odile Liboiron-Ladouceur, Assaf Shacham, Benjamin A. Small, Benjamin G. Lee, Howard Wang, Caroline P. Lai,
Aleksandr Biberman, and Keren Bergman. 2008. The Data Vortex Optical Packet Switched Interconnection Network.
Journal of Lightwave Technology 26, 13 (2008), 1777–1789. DOI: https://doi.org/10.1109/JLT.2007.913739
[35] LLVM developers. [n. d.]. LLVM OpenMP Runtime Library. Retrieved from https://raw.githubusercontent.com/llvm/
llvm-project/main/openmp/runtime/doc/Reference.pdf
[36] Hao Lu, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. Parallel Heuristics for Scalable Community
Detection. Parallel Computing 47, C (Aug. 2015), 19–37.
[37] Gurshaant Singh Malik and Nachiket Kapre. 2019. Enhancing Butterfly Fat Tree NoCs for FPGAs with Lightweight
Flow Control. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). 154–162. DOI: https://doi.org/10.1109/FCCM.2019.00030
[38] Fredrik Manne and Mahantesh Halappanavar. 2014. New Effective Multithreaded Matching Algorithms. In Proceedings
of the IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS). 519–528.
[39] Marco Minutoli, Vito Giovanni Castellana, Nicola Saporetti, Stefano Devecchi, Marco Lattuada, Pietro Fezzardi,
Antonino Tumeo, and Fabrizio Ferrandi. 2022. Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph
Analytics. IEEE Transactions on Computers 71, 3 (2022), 520–533.
[40] Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Marco Lattuada, and Fabrizio Ferrandi. 2016. Efficient
Synthesis Of Graph Methods: A Dynamically Scheduled Architecture. In Proceedings of the 35th International Conference
on Computer-Aided Design (ICCAD).
[41] Thomas Moscibroda and Onur Mutlu. 2009. A Case for Bufferless Routing in On-Chip Networks. In Proceedings of the
36th Annual International Symposium on Computer Architecture (ISCA). 196–207.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:30 G. Gozzi et al.
[42] Razvan Nane, Vlad Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao,
Stephen Dean Brown, Fabrizio Ferrandi, Jason Helge Anderson, and Koen Bertels. 2016. A Survey and Evaluation of
FPGA High-Level Synthesis Tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35,
10 (2016), 1591–1604.
[43] Tan Nguyen, Yao Cheny, Kyle Rupnow, Swathi Gurumani, and Deming Chen. 2016. SoC, NoC and Hierarchical Bus
Implementations of Applications on FPGAs Using the FCUDA Flow. In Proceedings of the IEEE Computer Society
Annual Symposium on VLSI (ISVLSI). 661–666.
[44] E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. GraphGen:
An FPGA Framework for Vertex-Centric Graph Computation. In Proceedings of the IEEE International Symposium on
Field-Programmable Custom Computing Machines (FCCM). 25–28.
[45] Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In
Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 111–117.
[46] Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini,
Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, and Zhiru Zhang. 2022. Accelerator Design with Decoupled Hardware
Customizations: Benefits and Challenges. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC).
1351–1354.
[47] Artur Podobas and Mats Brorsson. 2016. Empowering OpenMP with Automatically Generated Hardware. In Proceedings
of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE.
[48] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks
for Accelerator Design and Customized architectures. In Proceedings of the 2014 IEEE International Symposium on
Workload Characterization (IISWC). IEEE, 110–119.
[49] Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM.
[50] Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022. Automated Accelerator Optimization Aided
by Graph Neural Networks. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC). 55–60.
[51] L. Sommer, J. Oppermann, J. Hofmann, and A. Koch. 2017. Synthesis of Interleaved Multithreaded Accelerators
from OpenMP Loops. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs
(ReConFig). 1–7.
[52] Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating Graph Processing
Using ReRAM. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA).
531–543.
[53] Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. 2015. ElasticFlow: A Complexity-Effective Approach
for Pipelining Irregular Loop Nests. In Proceedings of the IEEE/ACM International Conference on Computer-Aided
Design (ICCAD). 78–85.
[54] J. Villarreal, A. Park, W. Najjar, and R. Halstead. 2010. Designing Modular Hardware Accelerators in C with ROCCC 2.0.
In Proceedings of the 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
(FCCM). 127–134.
[55] S. Windh, P. Budhkar, and W. A. Najjar. 2015. CAMs as Synchronizing Caches for Multithreaded Irregular Applications
on FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 331–336.
[56] Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and
Jingling Xue. 2022. ScalaGraph: A Scalable Accelerator for Massively Parallel Graph Processing. In Proceedings of the
IEEE International Symposium on High-Performance Computer Architecture (HPCA). 199–212.
[57] Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen.
2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). 741–755.
[58] Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-Based Graph Processor Using
Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays (FPGA). 207–216.
[59] Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2015. Optimizing Memory Performance for FPGA
Implementation of Pagerank. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs
(ReConFig). 1–6.
[60] Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing Wu. 2019. HitGraph: High-Throughput
Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2249–2264.
[61] Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA Framework for Edge-Centric
Graph Processing. In Proceedings of the ACM International Conference on Computing Frontiers (CF). 69–77.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.