0% found this document useful (0 votes)
9 views30 pages

SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators

The document presents SPARTA, a methodology for synthesizing parallel multi-threaded accelerators from OpenMP annotated C/C++ specifications, enhancing existing high-level synthesis (HLS) tools. SPARTA allows for efficient handling of irregular memory accesses and supports fine-grained memory-level parallelism, resulting in accelerators that are, on average, 2.29 times faster than current HLS methodologies. The approach is validated through experimental results demonstrating high scalability and performance improvements in synthesizing irregular graph kernels.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views30 pages

SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators

The document presents SPARTA, a methodology for synthesizing parallel multi-threaded accelerators from OpenMP annotated C/C++ specifications, enhancing existing high-level synthesis (HLS) tools. SPARTA allows for efficient handling of irregular memory accesses and supports fine-grained memory-level parallelism, resulting in accelerators that are, on average, 2.29 times faster than current HLS methodologies. The approach is validated through experimental results demonstrating high scalability and performance improvements in synthesizing irregular graph kernels.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

SPARTA: High-Level Synthesis of Parallel Multi-Threaded

Accelerators
GIOVANNI GOZZI, MICHELE FIORITO, and SERENA CURZEL, Politecnico di Milano,
Milan, Italy
CLAUDIO BARONE, VITO GIOVANNI CASTELLANA, MARCO MINUTOLI,
and ANTONINO TUMEO, Pacific Northwest National Laboratory, Richland, WA, USA
FABRIZIO FERRANDI, Politecnico di Milano, Milan, Italy

This article presents a methodology for the Synthesis of PARallel multi-Threaded Accelerators (SPARTA) from
OpenMP annotated C/C++ specifications. SPARTA extends an open-source HLS tool, enabling the generation
of accelerators that provide latency tolerance for irregular memory accesses through multithreading, support
fine-grained memory-level parallelism through a hot-potato deflection-based network-on-chip (NoC), support
synchronization constructs, and can instantiate memory-side caches. Our approach is based on a custom
runtime OpenMP library, providing flexibility and extensibility. Experimental results show high scalability
when synthesizing irregular graph kernels. The accelerators generated with our approach are, on average,
2.29× faster than state-of-the-art HLS methodologies.

CCS Concepts: • Hardware → High-level and register-transfer level synthesis; Hardware accelerators;

Additional Key Words and Phrases: Design automation, FPGA architecture, graph algorithms, parallelism

ACM Reference format:


Giovanni Gozzi, Michele Fiorito, Serena Curzel, Claudio Barone, Vito Giovanni Castellana, Marco Minutoli,
Antonino Tumeo, and Fabrizio Ferrandi. 2024. SPARTA: High-Level Synthesis of Parallel Multi-Threaded
Accelerators. ACM Trans. Reconfig. Technol. Syst. 18, 1, Article 9 (December 2024), 30 pages.
https://doi.org/10.1145/3677035

This work has been partially supported by the Spoke 1 (FutureHPC & BigData) of the Italian Research Center on
High-Performance Computing, Big Data and Quantum Computing (ICSC) funded by MUR Mission 4 - Next Generation EU,
by the Compiler Frameworks and Hardware Generators to Support Innovative US Government (IUSG) Designs project at
Pacific Northwest National Laboratory (PNNL), and by the Adaptive Tunability for Synthesis and Control via Autonomous
Learning on Edge (AT SCALE) Laboratory Directed Research and Development (LDRD) Initiative.
Authors’ Contact Information: Giovanni Gozzi (corresponding author), Politecnico di Milano, Milan, Italy; e-mail:
[email protected]; Michele Fiorito, Politecnico di Milano, Milan, Italy; e-mail: [email protected]; Serena
Curzel, Politecnico di Milano, Milan, Italy; e-mail: [email protected]; Claudio Barone, Pacific Northwest National
Laboratory, Richland, WA, USA; e-mail: [email protected]; Vito Giovanni Castellana, Pacific Northwest National
Laboratory, Richland, WA, USA; e-mail: [email protected]; Marco Minutoli, Pacific Northwest National
Laboratory, Richland, WA, USA; e-mail: [email protected]; Antonino Tumeo, Pacific Northwest National
Laboratory, Richland, WA, USA; e-mail: [email protected]; Fabrizio Ferrandi, Politecnico di Milano, Milan, Italy;
e-mail: [email protected].

This work is licensed under a Creative Commons Attribution International 4.0 License.
© 2024 Copyright held by the owner/author(s).
ACM 1936-7414/2024/12-ART9
https://doi.org/10.1145/3677035

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:2 G. Gozzi et al.

1 Introduction
Field Programmable Gate Arrays (FPGAs) are reconfigurable devices that can be programmed
using a hardware description language (HDL), such as Verilog or VHDL, to implement spe-
cialized accelerators. FPGAs have become a key component of heterogeneous high-performance
computing systems, often coupled to other types of accelerators, such as GPUs, to accelerate
data-intensive tasks or parts of applications that require low latency processing. High-level
synthesis (HLS) tools simplify the process of developing FPGA-based accelerators by automati-
cally translating software descriptions (typically C or C++ code) into HDL. These tools promise to
hide hardware-specific details from the user while still obtaining high quality of results (QoR) in
terms of latency, area usage, and clock frequency [42].
However, existing HLS tools mainly target regular, arithmetic-intensive workloads operating on
dense data structures typical of digital signal processing (DSP) applications, which present an
abundance of instruction-level parallelism. Solutions that generate accelerators from parallel high-
level descriptions leverage data-parallel programming models that rely on regular computation
and easily partitionable data sets (e.g., OpenCL), as they mainly replicate identical computational
units and do not well support synchronization-intensive programs [14]. Driven by the interest in
big data and graph analytics, several approaches attempt to support the acceleration of parallel
applications with irregular behaviour. These include solutions that target specific graph algorithms,
approaches that try to optimize linear algebra operations (e.g., sparse matrix-vector and sparse
matrix-matrix multiplication [17]), and frameworks based on the Gather-Apply-Scatter (GAS)
paradigm [33]. These solutions are highly specialized: they focus on specific graph algorithms, or
they only try to generate and optimize with HLS computational engines that need to be supported
by a software framework; the actual execution of graph algorithms mapped on such accelera-
tors might thus increase in computational complexity with respect to vertex- or edge-centered
formulations.
The OpenMP programming model for shared memory systems has often been successfully used
to design parallel irregular algorithms [7, 36, 38] thanks to its task-level and loop-level parallel
constructs. Research on HLS previously explored the use of OpenMP annotations to synthesize
applications with coarse-grained task-level parallelism [15, 47]; however, these approaches employ
multi-level designs driven by a microcontroller and cannot manage complex synchronization con-
structs. Furthermore, they still focus on generating parallel accelerators from programs with regular
behaviours. The Svelto methodology [39] uses OpenMP annotation as a starting point to synthesize
parallel multi-threaded accelerators, targeting the generation of accelerators starting from specific
patterns of parallel graph algorithms. In this article, we propose a more general methodology
for the Synthesis of PARallel multi-threaded Accelerators (SPARTA), providing automated
translation of parallel specifications written in OpenMP into efficient parallel accelerators.
SPARTA adopts a fundamentally different and more flexible approach than previous method-
ologies generating accelerators from OpenMP code. Instead of mapping OpenMP parallel regions
marked by compiler directives (pragmas) to a pre-defined parallel architecture template, it trans-
lates pragmas into calls to OpenMP runtime primitives and implements corresponding low-level
hardware components to be used in the synthesis backend. Accelerators generated with SPARTA
can exploit both spatial and temporal parallelism, hiding the latency of external memory accesses
through context switching (CS) across multiple execution threads. Moreover, SPARTA can in-
stantiate an advanced hierarchical memory subsystem composed of a custom Network-on-Chip
(NoC) that connects multiple external memory channels to each accelerator, memory-side caching,
and on-chip private memories for each accelerator. These features allow SPARTA to deal with both
regular and irregular behaviours in parallel algorithms and to generate efficient accelerators in
terms of performance per area.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:3

We evaluated SPARTA by synthesizing typical parallel graph algorithms, verifying the scalability
and the QoR of the generated designs as the numbers of parallel accelerators and contexts change.
We also compared accelerators for graph database queries generated with SPARTA to the ones
generated with Svelto, obtaining, on average, a 2.29× improvement in performance with similar
architectural parameters (number of accelerators, contexts, and memory channels).
In summary, this article makes the following contributions:
— A synthesis methodology to generate multi-threaded parallel accelerators (SPARTA) support-
ing OpenMP work-sharing and synchronization constructs through the implementation of
low-level primitives from the OpenMP runtime system;
— Support for both spatial and temporal parallelism, with multiple threads that can pause and
resume execution on a core when needed;
— A highly configurable memory system that mitigates the latency of external memory accesses.
The rest of the article is organized as follows: Section 2 reviews related work on the synthesis of
FPGA accelerators from parallel programming languages. Sections 3 and 3.1 describe the SPARTA
architecture and synthesis flow. Section 4 presents experimental results and Section 5 draws
conclusions for the article.

2 Related Work
SPARTA generates efficient, high-throughput parallel accelerators starting from OpenMP code.
This approach is particularly suited to irregular parallel applications such as graph algorithms,
thanks to the shared memory abstraction, the enhanced memory subsystem, and the combined use
of spatial and temporal multithreading. The work touches on several areas related to the use of
reconfigurable architectures and HLS methodologies for parallel applications.

2.1 HLS and Parallel Specifications


HLS is an established paradigm and an active area of research [16]; significant effort is dedicated to
improving the QoR of generated accelerators, supporting flexible optimization and design space
exploration, and introducing more and more system-level design features [9, 46, 50, 57]. Another
important research topic is the synthesis of parallel accelerator designs starting from specifications
written in user-friendly parallel programming application interfaces (APIs) such as OpenCL,
CUDA, OpenMP, or Pthread.
Hardware vendors lead OpenCL efforts [32], but they mainly target data-parallel workloads with
regular computation and often require significant restructuring of the code to match one of the
supported parallel patterns. FCUDA [43] starts from kernels written using the CUDA API, designed
targeting NVIDIA GPUs, and presents similar issues. The LegUP HLS tool [10] exploits OpenMP and
Pthread specifications to generate architectures built of custom accelerators driven by a general-
purpose microcontroller, only supporting two levels of nested parallelism (the external level through
Pthread and the internal one through OpenMP) [15]. BŁYSK [47] extracts OpenMP tasks and maps
them onto parallel accelerators that can share resources, also leveraging a microcontroller to
perform the actual task scheduling. The CHAT methodology [24] extends the ROCCC HLS tool
[54] to generate temporally multithreaded accelerators starting from loop constructs described in C
without any support for synchronization. The Nymble HLS compiler supports multiple execution
contexts [27] in spatially and temporally multithreaded accelerators generated from OpenMP loops
[51]; the accelerators, however, can only access private memory and thus cannot share data across
threads nor synchronize.
Svelto [39] is the synthesis methodology nearest to SPARTA: it synthesizes spatially and tempo-
rally multithreaded accelerators starting from OpenMP code, parallel cores access a shared memory

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:4 G. Gozzi et al.

Table 1. Qualitative Comparison between SPARTA and State-of-the-Art Methodologies for the Synthesis of
Parallel Specifications

Tool Input language Target architecture Work-sharing Synchronization Nested parallelism


OpenCL [32] OpenCL* host + GPU/FPGA - - -
FCUDA [43] CUDA*** FPGA - - -
LegUp [15] Pthreads + OpenMP** host + FPGA for, master critical, atomic no
BŁYSK [47] OpenMP host + FPGA task taskwait no
CHAT [24] C loops** host + FPGA - - -
Nymble [51] OpenMP host + FPGA for no no
Svelto [39] OpenMP* FPGA for atomic no
SPARTA (this work) OpenMP FPGA for, sections critical, barrier yes

Asterisks represent how much the input code needs to be modified with respect to a software implementation.

through a multi-ported memory controller, and it supports synchronization through atomic mem-
ory operations. The main limitation of Svelto is that it can only map a small subset of OpenMP
constructs to an architectural template, limiting the set of supported OpenMP pragmas (e.g., atomic
memory operations are supported, but critical sections and barrier synchronization are not). More-
over, supported parallel constructs become kernel functions that do not allow inter-functional
optimizations, local memories within each parallel core are not supported, and the multi-ported
memory controller in the template has limited scalability.
Table 1 summarizes the considerations done in this section to highlight the differences between
SPARTA and previous works. Looking at the supported input languages, we tried to identify how
much a user would have to modify an input software implementation to use a certain tool: for
example, FCUDA requires to heavily reorganize the code into data management and compute
sections, Svelto requires the body of the OpenMP for loop to be outlined in a separate function, and
SPARTA can process standard, unmodified OpenMP code. Methodologies that target a host+FPGA
system use the processor to coordinate the execution of the accelerators, while SPARTA implements
scheduling and synchronization within the accelerator itself. Finally, SPARTA supports different
work-sharing and synchronization constructs, and nested parallelism; the addition of new constructs
is simplified by using the low-level primitives of the OpenMP runtime.
2.2 Accelerator Designs for Graph Algorithms
Graph processing is an important element of data analytics, and it presents unique challenges (large
datasets, unpredictable fine-grained memory accesses, unbalanced task-level parallelism) that are
not well addressed by commodity processors relying on advanced cache architectures, data locality,
and regular computation patterns, and might thus benefit from domain-specific accelerators [22].
Alongside proposals for processing-in-memory [3], flash-based [29], ASIC [25], and ReRAM-based
[52] accelerators, multiple designs have targeted FPGAs to implement graph processing primitives.
The FPGA co-processors on the hybrid Convey HC systems have been used to accelerate breadth-
first search (BFS) [8], while [58] exploits an FPGA-Hybrid Memory Cube platform. Optimized
implementations of PageRank (PR) and random walk are proposed in [59] and [20], respectively.
GraphOps [45] provides a series of building blocks to implement graph analytics applications
following a dataflow approach. Several designs are based on arrays of parallel processing engines
[8, 13, 55]; some of them support the implementation of vertex-centric [44] or edge-centric [60, 61]
graph algorithms. ScalaGraph [56] proposes a tile-based architecture that improves the scalability
of the system as the number of processing elements increases. GraphLily [26] supports graph
algorithms by accelerating their linear algebra formulation (i.e., by implementing custom Sparse-
Matrix-Vector and Sparse-Matrix-Matrix multiplications units) and [2] provides a template for a
GAS accelerator, requiring users to rewrite their applications and pre-partition the data.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:5

All these works propose fixed or templated designs available as register-transfer level compo-
nents, manually developed either with HDL or through HLS. Specializing the HLS process for
the acceleration of graphs is not considered, thus similar approaches can only provide a limited
library of primitives, and input applications may need to be restructured to allow mapping the
computation onto the accelerators.

2.3 Synthesis Methodologies for Graph Algorithms


ElasticFlow [53] extends HLS with a methodology for the synthesis of irregular loop nests into a
pipelined architecture. The approach is thus well suited to graph kernels that often feature irregular
memory accesses and imbalanced computation; however, it only uses small on-chip memories to
store the graphs, and it does not support atomic memory operations, relying on a reorder buffer to
provide consistency. Castellana et al. [11] start from OpenMP-annotated nested loops to synthesize
graph database queries expressed as graph pattern matching routines. The methodology leverages
a distributed controller to map loop iterations onto a parallel array of accelerators, and it was
later extended with dynamic scheduling [40]. All these solutions exploit task parallelism only
by replicating the accelerator datapaths with limited resource reuse. The aforementioned Svelto
[39] methodology has been designed to support the synthesis of irregular parallel applications, in
particular graph kernels and graph database queries.

2.4 Architectural Features


The unique features of graph applications require architectural innovations to allow efficient
execution under latency, throughput, or power consumption constraints. Particular care must be
dedicated to the design of memory hierarchies [5] and to the interconnections between memory
and computational elements, as graphs are large data structures that cannot be easily partitioned
in a way that balances the work for each partition and require fine-grained, unpredictable, data
accesses with poor spatial and temporal locality.
In SPARTA, we use a unidirectional, bufferless, deflection-routed NoC to connect the different
parallel components. The idea of bufferless deflection-routed NoCs (BLESS) was proposed in [41]
to reduce energy consumption and chip area by eliminating the need for routing or flow control
buffers. While the deflection-based routing lowers the area and energy costs of the NoC, it might
increase latency under high injection rates; however, SPARTA accelerator cores can tolerate a higher
data access latency through asynchronous memory operations and CS. SPARTA uses a butterfly
topology [18], where the last stage is connected to the first one (see also Figure 6 later). While
adopting a different topology, the bufferless, deflection-based approach of SPARTA’s NoC is also
similar to the one used in the Data Vortex network [34]. The Data Vortex network was originally
devised for optical network interconnects due to its bufferless nature and was later implemented
as a conventional network interconnect for high-performance computing systems. Its ability to
support high injection rates of fine-grained transactions (memory word size) makes it effective for
irregular applications [21]. The design has been later proposed as an intellectual property for a
NoC [28], validated on FPGAs. Hoplite [30] is another noteworthy example of bufferless deflection
routing NoC. Hoplite uses a torus topology with a very simple node switch implementation. The
implementation of Hoplite’s node switch is similar to SPARTA’s one. However, SPARTA’s topology
allows logarithmic hop counts. Kapre and Gray [31] thoroughly examine the cost of Hoplite’s
switches, showing that the deflection routing algorithm can reach high clock frequencies while
utilizing a minimal amount of resources of the FPGA fabric. Malik and Kapre [37] discuss an
enhanced version of Hoplite with a butterfly topology; in the setup of this article, the parallel
components are connected on the lower level of the network, while SPARTA can inject/receive
requests through any of the network switches.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:6 G. Gozzi et al.

Fig. 1. Enhanced SPARTA-Bambu HLS flow.

3 SPARTA Synthesis Flow


The SPARTA design flow automatically generates parallel multi-threaded accelerators starting from
OpenMP code through an enhanced HLS flow that we implemented within the open-source Bambu
tool [19] (Figure 1). The input of the SPARTA synthesis methodology is a C/C++ description of
a parallel application annotated with OpenMP directives, such as the one shown in Figure 2(a).
SPARTA includes a frontend derived from Clang, which parses the OpenMP annotations and
performs initial HLS optimizations. OpenMP directives can be used to define parallel regions
and shared data access policies, with many different work-sharing constructs (e.g., parallel
for, sections) and synchronization mechanisms [12]. It is also possible to directly pass to our
enhanced Bambu flow the LLVM IR exposing the OpenMP runtime calls, allowing SPARTA to
process specifications generated through other frontends that support OpenMP (e.g., Flang and
other domain-specific languages that leverage the OpenMP MLIR dialect).
A conventional compiler flow for parallel architectures translates an OpenMP-annotated de-
scription into a parallel application through several transformation passes. We will illustrate the
transformations performed within the Clang LLVM compiler since we employ it as a frontend
for the SPARTA methodology starting from C/C++ code. However, as previously highlighted, our
approach is adaptable to other compiler infrastructures that support an OpenMP runtime library
[35]. Using a conventional OpenMP parallelization flow through Clang/LLVM greatly simplifies
introducing support for new constructs that might be added to the OpenMP standard.
We can split the parallelization process into three phases: pre-processing of compiler directives,
code transformations, and parallel API library linking. Pre-processing of compiler directives allows
gathering information on how the input program should be parallelized, i.e., which code sections
can be executed in parallel, how to distribute work across threads, data allocation policies, and more.
After pre-processing, the compiler generates an intermediate representation (IR), instruments it
with calls to OpenMP runtime functions, and links the OpenMP library, connecting each function
call to its target-specific implementation.
In the SPARTA synthesis flow, the Clang-based Bambu frontend compiles the input description
following the same steps, except that in the final phase, it maps calls to the OpenMP runtime onto
corresponding calls to our custom HLS library. The SPARTA library in Bambu contains optimized
Verilog primitives to handle synchronization, concurrency, and memory operations in a parallel
architecture, functionally equivalent to the OpenMP runtime library primitives. Starting from the
LLVM OpenMP runtime library, we streamlined and reduced it to essential functionality to generate
small and fast components suitable to be synthesized on FPGA. For example, we removed logic to
support the reuse of threads because hardware threads cannot be reassigned to other cores that

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:7

Fig. 2. Translation of OpenMP pragmas into calls to the SPARTA library functions.

require a different datapath and controller. We implemented OpenMP constructs replaceable with
arithmetic and logic operations (e.g., the ones that compute the range of loop iterations belonging
to a thread) with standard functional units.
Figure 2 shows how the SPARTA front end translates an OpenMP annotated code (Figure 2(a))
into an IR with calls to the SPARTA library functions (Figure 2(b)). The code in Figure 2(a) denotes
a parallel region with M threads where each thread executes an iteration of a for loop. The IR
in Figure 2(b) shows how the transformations outline the loop body into a separate function
(omp_outlined) and modify the OpenMP runtime library calls. The new function signature contains
all parameters to execute an iteration of the loop: the thread ID and a reference to the arrays A, B,
and res. The function body is instrumented with API calls to the SPARTA Bambu library functions
that retrieve the thread ID and synchronize with other threads.
The instrumented IR containing the API calls that control the parallel execution is subsequently
optimized with hardware-oriented code transformations that facilitate the generation of the parallel
architecture in the HLS backend.
3.1 The SPARTA Architecture
SPARTA generates a modular architecture following the content of the input OpenMP specification
that can be customized through user-defined parameters. The main components are depicted in
Figure 3: sequential parts of the computation go through the standard HLS flow, while parallel
constructs are translated into a set of parallel processing units called cores. Each core contains a
finite state machine with datapath (FSMD) implementing the computation described within
an OpenMP parallel region (e.g., the loop body in an omp parallel for loop), a CS manager
in case multiple threads are assigned to the same core, and local memory for data belonging to
ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:8 G. Gozzi et al.

Fig. 3. SPARTA accelerator architecture.

those threads. The number of cores and the number of threads (contexts) assigned to each core are
configurable. The parallel region of the accelerator can also contain synchronization components,
arbiters, and data shared among threads if needed. The accelerator is connected to a NoC that
relays messages to an external memory (with support for multi-banked or pipelined memories);
the number of channels between the accelerator and the NoC, as well as between the NoC and the
external memory, is configurable. The user encodes information about parallelism, work-sharing
policies, and synchronization requirements through the OpenMP annotations in the code and
configures the rest of the architecture through command-line options.
3.1.1 Spatial and Temporal Multithreading. Cores in a SPARTA accelerator execute concurrently,
using the FPGA resources to exploit coarse-grained parallelism; instruction-level parallelism is
extracted by the HLS tool during the implementation of the FSMD within each core. SPARTA accel-
erators can also support temporal multithreading by interleaving the execution of multiple threads
on the same core instead of stalling execution and leaving the core underutilized when a thread
needs to perform long latency operations (typically external memory accesses or synchronization,
which might take tens of clock cycles), it is possible to perform a CS and execute a different thread
with the same datapath, e.g., a different iteration of the same omp for loop.
Each core with CS enabled features a CS manager module to handle the active context and switch
to a new one when a memory or synchronization operation starts. For example, on a read/write
request to the external memory, the CS manager saves the active context (i.e., the thread state)
and starts the execution of a different context in the next clock cycle. Once the core receives the
memory response, the CS manager flags the previously paused context as ready and eventually
resumes its execution when the currently active context is paused or completed. The CS operation
has no overhead: in the clock cycle before the switch, the core executes normally, and at the start
of the next clock cycle, the new context restarts from the state where it should have been after
it was suspended. The CS manager always performs a switch when the active context starts a
wait synchronization operation. This avoids deadlocks when the waited thread executes on the
same core of the waiting thread, as the CS manager can flag the waiting thread as ready when the
response from the waited thread is received. The CS manager also has a set of registers for each
context to store the signals received while threads are not ready. When responses arrive at a core,
the thread in execution is generally not the one that performed the request: the value is saved and
forwarded to the actual thread originating the request when its execution is resumed. In this way,

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:9

the CS functionality is entirely transparent to the rest of the accelerator and does not force other
components to send signals multiple times or stall their execution.
In the core datapath, registers store intermediate signals between combinational regions; when
many contexts share the same core, it is necessary to store a signal for each context. Each context
thus needs its own set of registers and a control signal to select the active set of input and output
registers; the new register set with the context selector control signal is commonly referred to as a
register file. Since register files can consume considerably more resources than standard registers,
it is fundamental to avoid them when possible. In the SPARTA architecture, CS can happen only
during memory accesses or synchronization calls, which are known at compile time: the synthesis
flow exploits this information to replace registers with register files only within control states
where CS can happen, while other registers are left unchanged. As an example, let us take into
account a core executing a floating-point add operation: two values are loaded from memory,
the sum of the two values is computed, and the result is stored back in memory. Load and store
operations can trigger a CS to hide the memory access latency thus registers connected to load
and store operations in this core must be replaced with register files to save signal states and later
restore them; instead, the floating-point sum can never be interrupted by a CS and thus none of the
registers present in the floating-point adder logic needs to be replaced with a register file.

3.2 Thread Synchronization


Three synchronization modules represent a key component of the SPARTA architecture: the barrier
manager, the reduction manager, and the critical manager. They are only included in the parallel
region if they are needed, and their size is proportional to the number of cores that use them.
Barriers are added by the OpenMP runtime at the end of a parallel region to ensure that the
execution can proceed safely if the threads have different execution times. In the OpenMP paradigm,
when a thread encounters a barrier, it notifies all other threads in the same team that it has
reached the barrier and waits until all other threads also signal that they have reached the end of
their execution. To implement barrier synchronization, the SPARTA architecture includes local
components for each core that generate completion signals and a centralized barrier manager that
collects completion signals, updates a synchronization register, and, when all required threads have
reached the barrier, sends them a command to resume execution. Sharing a unique barrier manager
among all parallel cores ensures that synchronization happens safely without the possibility of race
conditions. Furthermore, the same barrier manager component can also be used to implement the
wait_thread OpenMP operation, which waits for the completion of one specific thread without
adding more logic or duplicating the synchronization state.
In OpenMP reduction operations, the thread with ID 0 needs to retrieve partial reduction variables
from all other threads and combine them. Each thread on a SPARTA core participating in a reduction
communicates with the reduction manager to indicate the address of its reduction variable in its
local memory; thread 0 then accesses the reduction manager to retrieve the addresses of all partial
results it needs to merge. In fact, as will be explained in Section 3.3, a thread can access the local
memory of another thread if it knows the correct address. Such a scheme could have also been
implemented by storing the addresses of partial reductions in the external shared memory but with
the considerable latency overhead of additional memory requests. Instead, the SPARTA reduction
manager implemented inside the accelerator allows one-cycle access to the information required to
perform the reduction.
We chose to implement the reduction following a linear pattern: after thread 0 finishes its
part of the computation, it collects all the values computed by the others and merges the results.
Other reduction schemes may result in a different tradeoff between area occupation and speed.
For example, using more threads to merge the results is possible at the cost of duplicating more

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:10 G. Gozzi et al.

Fig. 4. Paths of memory requests from Core 0 to (1) the local memory of Core N, (2) on-chip shared memory,
and (3) external memory.

logic. However, a linear reduction operation often requires five or six cycles per thread to be
merged, which is negligible when confronted with the duration of the computations. For this
reason, when multiple contexts can be activated, the policy of the CS manager is to choose the
thread with the lower thread ID: this gives priority to thread 0, which is the one that could perform
a reduction operation, and then to the other threads in order since the reduction proceeds following
the thread IDs.
In the OpenMP paradigm, the critical pragma is used to identify a set of instructions that can
only be executed by a single thread at once. SPARTA adds a critical manager that behaves like
a mutex and a local component for each thread that encounters the critical section. When a thread
reaches the critical section, the local component sends a signal to request control of the lock, and
the critical manager grants it only if it is not currently assigned to another thread. If the lock is
not acquired by the current thread, the CS manager performs a CS. The critical manager stores the
state of the requests and the current lock owner and, like the CS manager, prioritizes the thread
with a lower ID when multiple threads require access. When the thread finishes the execution of
the critical section, it sends a signal to release the acquired lock to avoid deadlock and allow other
threads to resume their execution.

3.3 Memory Hierarchy


SPARTA exploits a hierarchical memory model that allows reducing the number of requests to
the external memory and keeping data inside the accelerator as much as possible. There are three
possibilities to store variables, visible in Figure 3: external memory outside the accelerator (high-
latency memory), block RAM (BRAM) modules in the FPGA fabric inside the accelerator that are
accessible by all cores, and private BRAM modules inside parallel cores. External memory is usually
suitable for input and output parameters that cannot fit into on-chip BRAMs, while variables that
are defined and used during the computation are better allocated inside the accelerator, whether in
shared or private memory blocks, to lower access latency and improve performance.
The various levels of the memory hierarchy are connected through an ad hoc bus infrastructure,
depicted in Figure 4, that can handle multiple concurrent memory requests. The local bus has

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:11

Fig. 5. Memory address format for an architecture with P parallel regions, T threads per region, X bytes
of local memory per thread, and Y bits of external memory. The external bit is set if the addressed item is
allocated outside the accelerator.

one channel and connects cores within the same parallel region, while the main accelerator bus
connects parallel regions with the on-chip shared memory and with the NoC toward the external
memory. Each memory request originating from one of the parallel cores goes through a filter;
the filter interprets its address and properly routes the request to internal BRAMs of other cores
through the local bus or outside the parallel region through the main bus. Since BRAMs have two
channels, local memories can support concurrent access from both the owner core and the local
bus. SPARTA can connect the computation units directly to the local memory without using the
local bus if it can infer at compile time the address of the requests and the variable is allocated
inside the same core.
SPARTA memory addresses encode the physical location of the memory module where data is
allocated; Figure 5 shows the structure of external and internal memory addresses, which have
the same length because they travel on the same local bus. The size of the address depends on
the external memory size (which is likely considerably larger than internal memory) plus one bit
that is added to differentiate between internal and external addresses. This allows a very simple
decoding logic: if the external bit is equal to 1, the request is propagated to the external memory.
For internal memory, a dedicated pagination algorithm allocates each memory variable considering
the parallel region and thread it belongs to. The address of internal variables is composed of three
parts, representing the address of the variable in the memory module, the thread that owns the
variable, and the parallel region in which the memory module is allocated, respectively. To build
the address, SPARTA first finds the largest internal memory module and computes the number of
bits needed to represent it, so this number of bits can be reserved to univocally identify a variable
inside each memory module. Similarly, the size of the second part of the address is set by looking
at the largest number of threads assigned to the different parallel regions. Finally, a unique code is
assigned to each parallel region, which is used by the filters to determine whether an address is
local to the parallel region. Looking at the appropriate bits of the address, filters and arbiters in
the memory interconnection system can properly forward a memory request to a different bus or
component using combinatorial logic without adding delays to the requests. For example, if the
parallel region ID in an address matches the one stored in the filter, the request is forwarded to the
local bus, where the arbiter will find the correct core by looking at the thread ID. Instead, if the
parallel region ID does not match or if the address has the external bit set to one, the request is
sent to the main bus.
If two requests arrive in the same cycle, the bus arbiter can only accept one of them, and the
other must be stalled; in this case, SPARTA generates a back pressure signal between arbiters and
cores to resend requests that could not be accepted in the current cycle. With only one context in
execution on the core, execution has to be stalled as it needs the result of the memory operation to
proceed, but when multiple contexts are assigned to the same core, another context can be executed

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:12 G. Gozzi et al.

Fig. 6. The architecture of a unidirectional butterfly NoC component with 24 nodes organized in 3 levels.

in the meantime. A First-In-First-Out (FIFO) queue holds requests before the memory filter so
that the CS manager can switch the running context at the start of each memory request instead of
when an arbiter accepts it; this prevents the effects of the arbitration from having repercussions on
the core execution, as the FIFO is responsible for resending the requests to the arbiter while the
core can continue its operation with another context.

3.4 External Memory Interface


The parallel accelerator generated by SPARTA is connected to an external memory module with a
fixed number of memory channels. Thus, memory requests need arbitration as they can originate
from multiple sources (i.e., the parallel cores) and be directed to multiple destinations. Instead of
using dedicated hardware modules or another arbiter, we introduced a custom NoC interconnect that
allows maintaining a short clock period. The NoC interconnect is composed of two unidirectional
NoC components (Figure 6), one for the request from the cores to the memory and the other for
the opposite path.
The proposed NoC consists of a group of interconnected nodes following a butterfly structure,
organized in 𝑛 levels where each level has a number of nodes equal to 2𝑛 , for a total number of
nodes equal to 𝑛 ∗ 2𝑛 . The number of nodes in the NoC must be at least equal to the maximum
between the number of banks in the external memory and the number of cores in the parallel
region (destination and source of the NoC, respectively). Each node in the NoC can be configured
to be registered or combinatorial, providing a tradeoff between the frequency of the NoC and the
number of cycles needed to traverse it: in fact, combinatorial nodes produce output values in the
same cycle, increasing the critical path, while registered nodes introduce a delay. Nodes connected
to external components are called I/O nodes and must be registered to avoid latches; for the rest of
the nodes, the choice to use combinatorial nodes or have only registered ones is left to the user.
The number of cycles needed to traverse the NoC is equal to the number of registered nodes
on the path of the request and, at most, is equal to the number of levels. To minimize the average
travel time inside the NoC, SPARTA organises the registered nodes filling entire levels before going
to the next one. For example, consider a NoC connecting 16 memory banks to 16 parallel cores:
the number of levels must be 3 (as 2 ∗ 22 = 8 < 16), but if we choose to use combinatorial nodes, it
can be traversed in 2 cycles as only the 16 I/O nodes must be registered and they can be arranged

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:13

Table 2. Minimal Memory Interface Protocol

Signal Meaning
Mout_oe_ram Read operation (1 bit)
Mout_we_ram Write operation (1 bit)
Mout_addr_ram Address of the memory operation
Mout_Wdata_ram Value to write
Mout_data_ram_size Size in bits of the value to write
Mout_accelerator_context Tag of the memory operation (parallel region ID and context ID)
Mout_back_pressure The source has to resend the request (1 bit)

in the first two levels. The NoC is resilient to congestion: if two requests want to go to the same
node, there is no need to stall, as the node can redirect one of them on another route to the same
destination. The NoC can thus accept multiple requests: each node has two internal inputs and two
internal output connections, and an I/O node can take a request if one of its internal inputs is not
active in the current cycle. Registered nodes can also store a request, while combinatorial nodes
forward them immediately and do not store information. If both input internal connections are
active, the I/O node of the NoC sends a back pressure signal to resend the request in the next cycle.
When using multiple contexts, back pressure signals do not interfere with the execution of the
cores as they are intercepted in the FIFO modules, which take the role of resending the requests, as
described in Section 3.3.
The SPARTA architecture allows the accelerator to connect to different types of external memory,
with simple or pipelined memory access, and for both unified and banked multi-channel memories.
The accelerator does not assume that the memory is immediately ready to accept a request and
uses a protocol with back pressure that can be used to synchronize with it. Furthermore, SPARTA
cores work under the assumption that answers from the memory could come out of order, as traffic
in the NoC or the external memory could change the relative order of requests. Each request is
tagged with an ID used to identify the thread that performed it and correctly forward the answer.
If the external memory is banked, it is possible to add one cache in front of each bank to further
reduce the average latency of the memory accesses. (There is no need to introduce a cache consis-
tency mechanism, as each cache has its own set of addresses associated with a different memory
bank.)
We introduced a set of custom pragmas and parameters to specify information about the accel-
erator interfaces for variables that will be stored in external memory and model their behaviour
during the simulation of the generated accelerator (Table 3). Specific variables can be allocated to
specific banks to mimic access patterns of real memories. For example, #pragma HLS_interface
a banked bank_number = 16 bank_chunk = 256 specifies that the a variable needs to be stored
in a multi-banked external memory that has 16 banks of 256 bytes each.
SPARTA accelerators, by default, exchange data with other accelerators and the external memory
using a simple and efficient protocol called minimal memory interface, described in Table 2. The
protocol requires a limited number of signals and logic elements to perform memory operations,
reducing the size of internal components (particularly the NoC). However, SPARTA also supports
more complex protocols such as AXI4 [4] by adding a bridge component before each memory bank
that translates each request and response from the minimal protocol to AXI4. The translation is
completely transparent to the accelerator, correctly propagating back pressure signals from and to
the external memory.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:14 G. Gozzi et al.

Table 3. SPARTA Parameters Used to Describe External Memory Interfaces

Parameter Usage
Banked Specify that the memory is banked
Bank number Specify the number of banks
Bank allocation Force a variable to be allocated on a specified group of banks
Bank chunk Specify the size in bytes of the bank blocks
Cache Adds caches to the accelerator
Cache size Specify the size of the cache word
Cache depth Specify the number of cache lines
AXI Specify that the memory protocol is AXI4

3.5 Hardware-Oriented Transformations


We describe here two key transformations of the SPARTA methodology that reduce the im-
pact of parallelization on resource utilization: thread ID propagation and parallel API pointer
promotion.
Thread ID propagation. SPARTA generates an array of cores by replicating the datapaths of
omp_outlined functions corresponding to the content of an OpenMP parallel region. While simply
replicating the whole datapath provides a functional design, considering which parts are effectively
used at runtime can reduce resource utilization. Because thread IDs are statically assigned to each
core, our approach can use them to determine the resources of the datapath that each thread actually
uses. In fact, in the OpenMP work-sharing model, it is possible to differentiate the computation
assigned to threads in the same group, but only depending on the thread ID. Bambu features several
code analyses and transformations that SPARTA can exploit to statically propagate thread IDs
through the replicated datapaths and to automatically perform dead code elimination, leading to
the generation of the minimum set of resources required for each of the cores. Static thread ID
propagation also avoids the replication of sections of the datapath only used by one thread, such as
resources that execute reduction operations.
Pointer promotion. The OpenMP API makes extensive use of pointers and multiple levels of
indirection to exchange references to internal data structures and parameters between threads;
such a programming model is generic and versatile, but it determines an overhead for memory
accesses and pointer resolution. In fact, memory operations on FPGA typically need to access
external memories and are thus significantly more expensive in terms of latency and energy with
respect to arithmetic operations. Whenever possible, SPARTA leverages static resolution of multiple
indirections to reduce the number of memory operations. Because the input application and the
architectural parameters (e.g., number of contexts per core and memory architecture) are fixed, the
SPARTA frontend can apply function versioning for each call to functions of the parallel API. After
versioning, each function represents an instance of a hardware component, and static analysis
can be performed on the IR to infer dynamically allocated data structures and memory accesses:
simple and multiple indirections are solved with pointer promotion, and statically allocated data
structures enable further optimizations during the HLS flow. Once parallel APIs versioning is
completed, it is also possible to perform a more detailed alias analysis on pointers to understand
what memory structures are accessed by each operation. In this way, it is possible to pinpoint
shared memory variables whose location will be shared at runtime between threads and allocate
them in shared BRAMs while the remaining private memory variables are allocated in the local
memories.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:15

Table 4. Resources Consumption of the Main SPARTA


Architecture Components

Component name LUTs Registers BRAMs


NoC (16/24 nodes) 6,543 6,274 0
NoC (32/64 nodes) 12,051 10,048 0
CS manager (2 C) 122 109 0
CS manager (4 C) 254 182 0
CS manager (8 C) 527 327 0
Cache (32 kB) 2,844 755 15
MinimalAXI4Adapter 46 161 0
For the NoC, we report the number of I/O nodes and the total number
of nodes.

4 Experimental Evaluation
We showcase the results that can be achieved through SPARTA through four experiments: the
first consists of synthesizing three graph processing benchmarks to show speedup and scalability
varying the number of cores and contexts (Section 4.1), the second is a direct comparison between
SPARTA and Svelto [39] on parallel database queries (Section 4.3), the third is a comparison between
SPARTA accelerators and kernels from the graph acceleration library of the commercial tool Vitis
HLS (Section 4.4) and the last one (Section 4.5) is a comparison with LegUp [15]. We also include in
Section 4.2 a focus on the effect of different NoC configurations and back pressure conditions on
the performance of the Graph Algorithm Platform Benchmark Suite (GAPBS) accelerators.
We use a custom version of Bambu HLS to run the experiments, which is available as an executable
binary.1 The following considerations are common to all tests: we assume to connect the accelerators
to an external memory with 16 banks and one channel per bank; the memory delay is assumed to
be constant. We indicate with core the number of parallel accelerator cores, and C the number of
contexts on each core. We report area utilization in terms of lookup tables (LUTs) and registers.
SPARTA components also require one DSP block for each core and one BRAM for every two
contexts if reductions are present; Table 4 reports the area utilization of the main architecture
components to isolate the contribution of components that do not depend on the input application.
Most of the designs meet the imposed timing constraint after place-and-route except for the largest
configurations, where Vivado (the backend tool for AMD/Xilinx FPGAs) was not able to achieve
the requested clock period after the placement phase due to high utilization of the device resources.

4.1 GAP Benchmark Suite


The GAPBS [7] is a collection of benchmarks intended to standardize evaluation for graph research,
and it uses OpenMP to parallelize execution. We tested SPARTA on a subset of GAPBS benchmarks,
excluding the ones that use OpenMP primitives that have not yet been implemented in SPARTA. The
target is an AMD/Xilinx Alveo U280 board with a clock frequency of 250 MHz. In all experiments,
the input graph, in Compressed Sparse Row format [49], is assumed to be stored outside the FPGA
in an external DRAM memory.
The connected components (CC) benchmark finds the largest connected subgraph in the
undirected input graph; we synthesized it with configurations that vary from one to eight cores
and from one to eight contexts per core. With a larger number of cores/contexts, the memory
channels utilization saturates, and there can be no further improvement, as the primary source of
latency for the CC application is the access to the external memory. As can be seen in Figure 7(a),
1 https://release.bambuhls.eu

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:16 G. Gozzi et al.

Fig. 7. CC benchmark synthesized varying the number of cores and contexts.

the increase in performance over a sequential version of the application (one core, one context)
grows linearly with the number of cores, keeping the number of contexts constant; in fact, having
more cores that perform memory requests increases the utilization of the memory channels and
the overall performance. We can see the same effect with more contexts on the same core to hide
the memory latency: the configuration with eight contexts on one core has the same speedup as
the one with eight cores and a single context. The configuration with eight cores and eight contexts
is 36 times faster than the sequential version; this is slower than the theoretical speedup because
the memory access pattern is irregular, bringing multiple requests on the same memory bank at
the same time and thus producing delays and congestion on the NoC. LUTs and registers scale
differently with the number of cores/contexts, as can be seen in Figure 7(b): the number of LUTs
grows with the number of OpenMP threads (e.g., 7× increase with respect to the sequential version
with eight cores, eight contexts), while liveness analysis during the synthesis flow identifies the
minimum number of registers to be replicated for CS, so the number of registers is mostly linked
to the number of cores (4× more than in the sequential version).
The second GAPBS kernel we evaluated is PR, which assigns a value to each graph node based
on its relative importance compared to the other nodes. Figure 8(a) shows that the speedup over a
sequential baseline proportionally increases with the number of cores, similar to the CC benchmark.
However, in this case, increasing the number of contexts per core from four to eight does not lead
to the same performance improvement because some sections of PR are computation-bound rather

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:17

Fig. 8. Benchmark synthesized varying the number of cores and contexts.

than memory-bound, leading to competition for parallel cores rather than stalls due to memory
access operations. With the largest configuration, we achieve a 30× speedup with 6.5× the area
occupation (considering LUTs, registers occupation only increases 4.5 times).
The last GAPBS benchmark we considered is Triangle Count (TC), which counts the number
of triangles in the input graph. After assessing how performance and resource consumption scale
with varying numbers of cores and contexts (Figure 9) on the Alveo target, we used this benchmark
to perform a first comparison against the Svelto synthesis methodology [39]. There are a few
differences between the two methods that need to be considered. Svelto is based on a gcc/g++
frontend, while SPARTA uses Clang; the crossbar used in Svelto has a different behaviour with
respect to the NoC used in SPARTA, and the comparison is further complicated by the fact that
Svelto uses atomic-based synchronization while SPARTA implements OpenMP reductions. They
also operate under different assumptions regarding the behaviour of the memory (i.e., SPARTA
allows to use asynchronous external memories that can accept multiple requests at the same time
while Svelto only sends one request at a time), and Svelto dynamically assigns loop iterations to
the cores while SPARTA uses static scheduling. Nevertheless, Svelto remains the state-of-the-art
methodology closest to SPARTA, as described in Section 2.1, and it is worthwhile to examine
the QoR of accelerators generated with the two methods. Furthermore, since both use Bambu
as HLS backend, we can safely state that the comparison only refers to the differences in the

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:18 G. Gozzi et al.

Fig. 9. TC benchmark synthesized varying the number of cores and contexts.

implementation of parallel accelerators rather than being influenced by different choices in the
scheduling, allocation, and code generation steps.
The results in Figure 10(a) show the relative performance of SPARTA against the Svelto baseline
with both tools targeting a Virtex 7 FPGA at 100 MHz; bars higher than one, thus represent
configurations where SPARTA outperforms Svelto. While Svelto is faster on smaller configurations,
SPARTA can achieve a higher level of parallelization as the number of cores and contexts grows.
There are two main reasons for this: the first one is that the NoC in SPARTA is designed to be
effective when there are many requests from different sources, which is achieved when the number
of cores is higher; the second reason lies in the ability of SPARTA to allow multiple memory
requests simultaneously, which is critical to exploit memory-level parallelism when multiple
memory channels are available. In fact, SPARTA adds a tag to external memory requests, allowing
it to send multiple pipelined, asynchronous requests, while Svelto must wait for the memory to
answer before performing the next request. With a number of OpenMP threads greater than the
number of banks in the external memory, SPARTA can execute more requests in parallel than Svelto,
resulting in a higher speedup. Regarding area usage, as can be seen in the relative comparisons of
Figure 10(b) and (c), SPARTA consumes more resources than Svelto on the TC benchmark. However,
the area-delay product (ADP) of SPARTA scales better with the number of cores/contexts: using
the formula
𝐴𝐷𝑃𝑆𝑃𝐴𝑅𝑇 𝐴 𝐿𝑈𝑇𝑠𝑆𝑃𝐴𝑅𝑇 𝐴 ∗ 𝐶𝑦𝑐𝑙𝑒𝑠𝑆𝑃𝐴𝑅𝑇 𝐴
= ,
𝐴𝐷𝑃𝑆𝑣𝑒𝑙𝑡𝑜 𝐿𝑈𝑇𝑠𝑆𝑣𝑒𝑙𝑡𝑜 ∗ 𝐶𝑦𝑐𝑙𝑒𝑠𝑆𝑣𝑒𝑙𝑡𝑜

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:19

Fig. 10. Comparison between accelerators generated through SPARTA and Svelto on the TC benchmark.

we can see that, for example, the ADP of a SPARTA accelerator with 1 core and 1 context is 1.14
times the ADP of a Svelto accelerator with the same configuration, and the ratio drops to 0.47 with
32 cores and 8 contexts (SPARTA is 3.8 times faster using only 1.7 times the LUTs and 2.4 times
the registers). One reason for the higher resource consumption is that SPARTA is more general
than Svelto: SPARTA explicitly implements multiple OpenMP primitives and synchronization
mechanisms, while Svelto only supports a few patterns, and thus, it can make assumptions on the

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:20 G. Gozzi et al.

Table 5. Congestion Overhead of the NoC on GAPBS Accelerators in Two Configurations.

NoC
Benchmark Cycles BP #Mem. Operations Avg. BP Delay (cycles) Avg. NoC Latency (cycles)
TC_32Co_8cs 329159 964697 759812 1.27 7.29
PR_32Co_8cs 129064 168259 467939 0.36 4.42
CC_32Co_8cs 246424 2281643 1172090 1.95 7.67
NoC Registered
Benchmark Cycles BP #Mem. Operations Avg. BP Delay (cycles) Avg. NoC Latency (cycles)
TC_32Co_8cs 338701 508814 759812 0.67 12.00
PR_32Co_8cs 132046 148133 467939 0.32 8.03
CC_32Co_8cs 264787 1338848 1172090 1.14 12.26
BP indicates the total number of back pressure signals received by the FIFO in front of the cores.

behaviour of the input program that simplifies the underlying architecture. Moreover, SPARTA
implements reduction operations that require fewer cycles and memory operations but more area
than the atomic operations implemented in Svelto. Finally, Svelto has a centralized dispatcher that
distributes loop iterations to the cores, while each core in SPARTA contains dedicated logic to
compute the current iteration index.

4.2 NoC Evaluation


In this section, we evaluate the performance cost of the NoC and the effect of back pressure on
the latency of memory operations. Table 5 shows data collected synthesizing and running the
TC, PR and CC benchmarks with 32 cores and eight contexts, as it is the configuration with the
largest occurrence of back pressure signals. We tested two different configurations of the NoC:
one that registers only the 32 I/O nodes and leaves the other 32 combinatorial and one that only
uses registered nodes. The difference between the two configurations of the NoC is the number of
requests it can store, which is equal to the number of registered nodes, and the maximum latency
of the NoC, which is two clock cycles in the first case and four in the second. The actual number of
cycles needed to traverse the NoC equals the number of registered nodes on the request’s path plus
the effect of congestion (NoC latency).
BP represents the total number of times the NoC could not accept a memory request because of
congestion, measured through the number of back pressure signals received by the FIFO modules
in front of all cores, while the number of memory operations is the sum of the requests from all
cores on all memory channels. We can compute the average delay spent by a single request in the
FIFO modules before entering the NoC by dividing the total number of back pressure signals by the
number of memory operations; this is the effect of the arbitration on the main memory bus, and it
is very small when compared with the latency of the memory (20 cycles) or the latency of the NoC.
The average NoC latency is measured as the average time that it takes for a request to traverse it.
Using a registered NoC, which can store more requests and thus reduce the effect of the arbitration
delay, has a negative effect on the total number of cycles for two reasons: while a registered NoC
reduces the number of back pressure signals, it increases the minimum time needed to traverse the
NoC from two cycles to four, and it increases the internal congestion (visible in the higher average
NoC latency). For these reasons, we ran all the experiments in Section 4.1 with a NoC with only
the I/O nodes registered.

4.3 Parallel Queries


In this section, we generate accelerators for the same parallel database queries used in [39], which
are SPARQL queries from the LUBM benchmark and dataset [23] translated into C++ and parallelized

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:21

Fig. 11. Speedup of accelerators generated by SPARTA with respect to Svelto on the parallel queries.

with OpenMP. SPARQL is a query language for graph databases in the resource description
format (RDF). RDF databases are naturally described as single labelled graphs, and SPARQL
queries are defined as (sub)graph matching operations. The resulting code features nested for loops
with different behaviours (e.g., number of OpenMP threads, number of loop iterations assigned
to each thread) and irregular memory accesses. We compared results obtained with SPARTA and
Svelto on several different parallel queries and also ran the same experiments, adding a cache in the
SPARTA accelerators in front of each memory bank (512 lines, 512 bits per line). Figure 11 shows
the relative performance of SPARTA against the Svelto baseline, with and without caches; as before,
bars higher than one represent configurations where SPARTA outperforms Svelto.
The results we obtained are different depending on the specific query considered and the charac-
teristics of its workload. In Q1, SPARTA without caches is always slower than Svelto, as it contains a
small group of OpenMP threads, each performing a large number of operations, and we have shown
how SPARTA performs best with memory-bound workloads. Q4 and Q5 terminate in a few tens of
thousands of cycles and are not very indicative of the achievable speedup, as most cycles are spent
in synchronization operations. Data for Q5 with 8 cores and 8 contexts is missing as Svelto does not
terminate in that configuration. The results observed align with the ones from TC: SPARTA without
caches has worse performances when using few threads but improves when the number of cores

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:22 G. Gozzi et al.

Fig. 12. Relative comparison of resources consumption between SPARTA and Svelto for query Q2. Results for
the other queries are omitted as they have very similar behaviour.

and contexts grows. In Q2, Q3, and Q7, SPARTA drastically outperforms Svelto, even considering
the higher resource utilization (Figure 12), with between 6 and 11.5 times faster execution. In Q1
and Q6 instead, Svelto is faster, and this can be explained by a lack of parallelism in the query: a lot
of the work has to be executed by a small subset of the threads and the dynamic assignment of
iterations in Svelto is better suited for this than the static division of work in SPARTA. However,
even in these cases, with enough cores and contexts, SPARTA can recover the disadvantage and at
least reach the same performance, although using a larger area.
When using caches, SPARTA is almost always faster than Svelto, and this is particularly effective
in the test cases where previously SPARTA was struggling: in both Q1 and Q6, the introduction of
caches in SPARTA makes the generated accelerators more than twice as fast than before. With the
largest configurations in Q2, Q3, and Q7, adding the cache reduces the speedup compared with
the base version of SPARTA: these tests are the ones in which static scheduling can divide the
computations between multiple cores and keep all of them working. Under this circumstance, the
main advantage of SPARTA is the ability to parallelize multiple requests running different contexts
and hide the memory latency while advancing the state of all the contexts simultaneously; adding
caches reduces the average memory latency, and it can cause a loss of parallelization when using
CS. In fact, if the memory requests are answered faster, the execution of the different contexts can
become serialized and, depending on the data access pattern, lead to worse results.
In Figure 13, we investigated the influence of CS across various configurations. Due to space
constraints, we focused on analyzing the speedup resulting from increasing the number of contexts
from 1 to 2, 4, and 8 for query Q7. The speedup was visualized using bar charts and compared
against different numbers of cores and tools (Svelto, SPARTA, and SPARTA with caches). When the
number of contexts was fixed at 1, we noticed that Svelto and SPARTA exhibited similar execution
times, except in the case of 32 cores (see Figure 11). As expected, introducing caches improved
SPARTA’s performance across all scenarios. Transitioning from spatial to temporal parallelism,
SPARTA and SPARTA with cache approaches demonstrated superior scalability compared to Svelto.
In fact, in Figure 13, the maximum speedup achieved by increasing the number of cores tended to
decrease as expected, but the rate of decrease was lower for SPARTA and SPARTA with caches
compared to Svelto. This limited scalability of Svelto is due to its interconnection, which scales less
efficiently than the NoC used by SPARTA. Furthermore, SPARTA initiates a new memory operation
before completing the previous one, enhancing channel utilization. Finally, with caches enabled in
SPARTA, the impact of spatial parallelism is heightened while the impact of temporal parallelism is

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:23

Fig. 13. impact on the speedup considering various core counts and tools (e.g., Svelto, SPARTA, and SPARTA
with cache) with respect to benchmark query Q7. The baseline refers to the execution time of query Q7
when a single context is utilized for a specific number of cores. The bars represent the speedup achieved by
expanding the context to 2, 4, and 8.

Table 6. Characteristics of the


Graphs Used to Compare Vitis
and SPARTA

Name Vertices Edges


Graph_1 4,096 65,390
Graph_2 8,192 98,230
Graph_3 8,192 131,036
Graph_4 16,384 131,036

reduced. In Figure 13, the maximum speedup with Svelto and SPARTA is around 6–7, while with
SPARTA with caches, it is around 4.

4.4 Vitis HLS Graph Library


Commercial tool Vitis HLS from AMD/Xilinx provides a library of pre-optimized C++ kernels to
implement graph processing accelerators.2 The kernels from the Vitis library have been optimized
for HLS through parallelization, loop pipelining, and the introduction of caches to reduce memory
access times.
We utilized identical graphs to compare the results obtained through the Vitis Graph library
with the SPARTA approach. Although our graphs are smaller in scale compared to those used with
the Vitis Graph Library, they still provide a reasonable simulation and analysis time for conducting
design space exploration. The graph characteristics are detailed in Table 6. The graphs are stored
in the Compressed Sparse Row matrix format as undirected. For TC and PR, directed graphs are

2 https://docs.xilinx.com/r/en-US/Vitis_Libraries/graph/index.html

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:24 G. Gozzi et al.

Table 7. Comparison between SPARTA and Vitis on the CC Benchmark

Graph Cycles Frequency Time (ms) LUTs AreaXTime


Graph_1 SPARTA 148,599 220 MHz 0.675 137,407 92,700
Graph_1 Vitis 2,358,543 280 MHz 8.423 103,923 875,031
Graph_2 SPARTA 255,000 220 MHz 1.158 137,407 159,076
Graph_2 Vitis 3,960,784 280 MHz 14.14 103,923 1,469,471
Graph_3 SPARTA 322,363 220 MHz 1.464 137,407 201,098
Graph_3 Vitis 4,661,624 280 MHz 16.64 103,923 1,729,486
Graph_4 SPARTA 422,074 220 MHz 1.912 137,407 263,301
Graph_4 Vitis 6,456,582 280 MHz 23.05 103,923 2,395,425

Table 8. Comparison between SPARTA and Vitis on the TC Benchmark

Graph Cycles Frequency Time (ms) LUTs AreaXTime


Graph_1 SPARTA 359,552 220 MHz 1.632 91,449 149,278
Graph_1 Vitis 456,756 300 MHz 1.521 21,001 31,942
Graph_2 SPARTA 378,962 220 MHz 1.720 91,449 157,336
Graph_2 Vitis 560,060 300 MHz 1.865 21,001 39,166
Graph_3 SPARTA 552,599 220 MHz 2.509 91,449 201,098
Graph_3 Vitis 916,216 300 MHz 3.051 21,001 64,074
Graph_4 SPARTA 292,759 220 MHz 1.329 91,449 263,301
Graph_4 Vitis 590,690 300 MHz 1.967 21,001 41,308

required, thus only the lower triangular part of the matrix is utilized. The SPARTA configurations
for TC, CC, and PR are optimized to fully utilize spatial and temporal parallelism, using 32 cores
and eight contexts each. However, due to issues in the place and route phase, SPARTA cannot
achieve the same frequency as Vitis with such large designs. As a result, the post-routing frequency
achieved is 200/220 MHz instead of the 250 MHz obtained with smaller configurations.
In the Vitis benchmarks, the delay of the HBM interface between the accelerator and the memory
ranges from 100 to 140 ns, depending on the address of the request. To approximate this behaviour
in the SPARTA benchmarks, we increased the memory delay to 32 cycles, which at 200/220 MHz
is equivalent to 160/145 ns. Additionally, we utilized the AXI4 protocol instead of the minimal
interface to exchange data between the accelerators and the external memory.
It is worth noting that both SPARTA and Vitis target an Alveo U250 FPGA, part of the Ultrascale
family. The results for each algorithm are reported separately in Tables 7, 8, and 9.
SPARTA and Vitis can achieve different maximum frequencies. To account for this difference, we
calculated the kernel execution time as the achieved clock period times the number of cycles (Time
in the Tables). In addition, we computed the AreaXTime metric to normalize the performance of
the different designs.
In the CC benchmark, SPARTA outperforms Vitis in terms of execution time by a factor of 10,
with only a 30% greater area occupation. Additionally, the AreaXTime metric favours SPARTA by
almost an order of magnitude.
In the TC benchmark, SPARTA achieves, on average, a faster execution time. It requires 14%
less time to complete the computation of the four graphs. However, it occupies four times more of
Vitis’s area.
In the PR benchmark, SPARTA achieves a speedup between four and six times compared to Vitis,
occupying only twice the area. Consequently, it achieves an AreaXTime metric at most half of Vitis.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:25

Table 9. Comparison between SPARTA and Vitis on the PR Benchmark

Graph Cycles Frequency Time (ms) LUTs AreaXTime


Graph_1 SPARTA 692,441 200 MHz 3.462 205,444 711,289
Graph_1 Vitis 3,703,603 300 MHz 12.333 88,372 1,089,891
Graph_2 SPARTA 783,614 200 MHz 3.918 205,444 804,943
Graph_2 Vitis 5,659,159 300 MHz 18.845 88,372 1,665,370
Graph_3 SPARTA 971,440 200 MHz 4.857 205,444 997,882
Graph_3 Vitis 7,509,609 300 MHz 25.007 88,372 2,209,918
Graph_4 SPARTA 1,088,871 200 MHz 5.444 205,444 1,118,518
Graph_4 Vitis 7,614,714 300 MHz 25.375 88,372 2,240,848

Table 10. Comparison between SPARTA


and Vitis HLS Accelerators for the
Sequential BFS Algorithm

Results SPARTA Vitis HLS


Cycles 32,622,452 33,287,532
LUTs 13,163 4,852
Registers 13,944 8,164
BRAM 0 263
SPARTA results include the cost of the NoC and
the adapter from the minimal protocol to AXI4.

As performance results for the BFS graph library kernel are not available, we also used Vitis HLS
to synthesize a C++ implementation of the BFS algorithm loosely based on the one available in
MachSuite [48]. Note that we are using the common frontier-based implementation rather than
the direction-optimizing implementation [6] (also provided in GAPBS). BFS is the prototypical
irregular kernel [1] where the number of iterations for each thread can be highly unbalanced and
requires critical sections to synchronize the different threads.
We tried different configurations for SPARTA, varying the number of cores and contexts using
the accelerator produced by Vitis HLS as a baseline. Table 10 compares the accelerators produced
by Vitis HLS and SPARTA without parallelization (one core and one context). There is only a
small difference between the number of cycles for the two tools, while the area is more difficult to
compare: Bambu does not require BRAMs for this design as it maps all logic on LUTs and registers,
and almost half of the LUTs and registers used are dedicated to the 16 node NoC (as reported in
Table 4).
Figure 14(a) displays the speedup achieved using different configurations of SPARTA with respect
to the number of cycles required by Vitis HLS. In the sequential version, most of the time is spent
waiting for the external memory to provide data in both the SPARTA and Vitis implementations. In
the SPARTA parallel accelerators, a similar pattern arises when the number of contexts per core
is small: each core spends most of the time waiting for new data to process. As we increase the
number of contexts, each core spends a greater fraction of time doing useful computations. Even
when using multiple contexts, BFS does not achieve the same speedup as more regular applications
such as the ones described in Section 4.1: during the exploration of the graph, each thread can
find a different number of elements, leading to a high load imbalance. Furthermore, all threads
must be periodically synchronized to avoid exploring the same node of the graph multiple times
and to compute the correct length of the shortest path from the origin. Despite this, SPARTA can

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:26 G. Gozzi et al.

Fig. 14. Benchmark synthesized varying the number of cores and contexts.

achieve good speedup when using four contexts per core: with four cores, it achieves a speedup of
almost five times, and with two cores, a speedup of three times compared to the Vitis HLS baseline.
Figure 14(b) compares the area utilization of SPARTA and Vitis HLS. When adding a core, SPARTA
replicates almost the entire design and adds necessary communication and synchronization logic.
Hence, resource utilization increases a little more than linearly: with four cores and one context,
the area occupied is six times more than the baseline. Note that with 4 cores, the NoC is larger than
the entire design generated by Vitis HLS. However, its size does not increase when more contexts
per core are added.

4.5 LegUp
We compare SPARTA even with the published results obtained by LegUp [15]. Specifically, we
are using the Mandelbrot kernel from the latest open-source version of LegUp. This kernel is
a parallelized version of the Mandelbrot iterative mathematical benchmark, which produces a
fractal image and is parallelized with OpenMP for pragmas. The results obtained by LegUp on this
benchmark are reported in Table 11, as published in [15]. SPARTA achieves better performance
across all metrics when using the same Mandelbrot kernel on a Stratix IV device (see Table 12). This

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:27

Table 11. The LegUp Results for the Mandelbrot Kernel

Core Cycles Frequency LUTs M9Ks DSPs


1 2,523,455 135 MHz 22,111 134 28
4 627,059 127 MHz 26,869 134 88
8 355,224 124 MHz 34,187 158 168
12 241,805 127 MHz 37,476 158 248
16 185,863 127 MHz 40,890 158 328
32 126,828 112 MHz 61,251 182 648
48 113,593 104 MHz 81,224 206 968
Accelerator indicates the number of accelerators used, and M9K is the
number of memory components.

Table 12. The SPARTA Results for the Mandelbrot Kernel

Accelerator Cycles Frequency LUTs M9Ks DSPs


1 2,294,063 200 MHz 4,878 1 12
2 1,151,243 200 MHz 5,198 2 24
4 579,830 200 MHz 5,928 4 48
8 291,057 197 MHz 7,553 8 96
16 139,607 196 MHz 9,939 16 192
32 73,039 163 MHz 19,793 32 384
Accelerator indicates the number of accelerators used, and M9K is the number of
memory components.

performance advantage can be attributed to several factors, including SPARTA’s use of external
memory for storing the Mandelbrot picture, its independence from a soft processor for OpenMP
parallelization, and its simpler architecture.
Results from other benchmarks in [15] will only further widen the performance gap between
SPARTA and LegUp. These other benchmarks rely on Pthread primitives in addition to the OpenMP
pragmas. Pthread primitives are partially implemented as software to control hardware threads,
increasing the overhead introduced by the soft processor and affecting the overall execution.

5 Conclusion
Automating the design of complex parallel architectures can greatly improve developers’ productiv-
ity and lower the access barrier to FPGA acceleration for non-experts. In this article we presented
SPARTA, a methodology to support the synthesis of parallel applications written in OpenMP
through HLS. SPARTA exploits a compiler frontend to translate OpenMP pragmas into low-level
calls to primitive functions and a library of corresponding custom hardware components that are
allocated in the HLS tool backend. With SPARTA, it is possible to synthesize a variety of OpenMP
work-sharing and synchronization constructs and to assign multiple threads to a single accelerator
core to hide the latency of slower operations. The SPARTA architecture is modular and flexible,
featuring multiple parallel cores, a hierarchical memory system, and an efficient NoC interface
toward external memories. We tested our methodology on parallel graph applications, showing
linear scaling in both performance and area consumption when the number of parallel cores in-
creases, and we compared our results against a state-of-the-art methodology for the synthesis of
parallel accelerators, obtaining on average a 2.29× speedup. Future developments could include
support for OpenMP tasks, dynamic scheduling, atomic memory operations, and the introduction

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:28 G. Gozzi et al.

of a broadcasting mechanism for memory responses; the latter would greatly reduce latency for
patterns such as matrix-vector multiplication, where multiple cores require the same information
stored in external memory.

References
[1] Graph500. 2023. Retrieved from https://graph500.org
[2] Hamzeh Ahangari, Muhammet Mustafa Özdal, and Özcan Öztürk. 2023. HLS-Based High-Throughput and Work-
Efficient Synthesizable Graph Processing Template Pipeline. ACM Transactions on Embedded Computing Systems 22, 2
(2023), 1–24.
[3] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing. In Proceedings of the 42nd Annual International Symposium on Computer
Architecture (ISCA). 105–117.
[4] ARM Developers. 2020. AMBA AXI and ACE Protocol Specification. Retrieved from https://developer.arm.com/
documentation/ihi0022/e/AMBA-AXI3-and-AXI4-Protocol-Specification
[5] Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. 2019.
Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads. In Proceedings of the IEEE
International Symposium on High Performance Computer Architecture (HPCA). 373–386.
[6] Scott Beamer, Krste Asanovic, and David A. Patterson. 2012. Direction-Optimizing Breadth-First Search. In Proceedings
of the SC Conference on High Performance Computing Networking, Storage and Analysis (SC ’12). 12.
[7] Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. Retrieved from http://arxiv.
org/abs/1508.03619
[8] B. Betkaoui, Yu Wang, D.B. Thomas, and W. Luk. 2012. A Reconfigurable Computing Approach for Efficient and
Scalable Parallel Graph Exploration. In Proceedings of the IEEE 23rd International Conference on Application-Specific
Systems, Architectures and Processors (ASAP ’12). 8–15.
[9] Nicolas Bohm Agostini, Serena Curzel, Vinay Amatya, Cheng Tan, Marco Minutoli, Vito Giovanni Castellana,
Joseph Manzano, David Kaeli, and Antonino Tumeo. 2022. An MLIR-based Compiler Flow for System-Level Design
and Hardware Acceleration. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design
(ICCAD). 1–9.
[10] Andrew Canis, Jongsok Choi, Blair Fort, Ruolong Lian, Qijing Huang, Nazanin Calagar, Marcel Gort, Jia Jun Qin, Mark
Aldham, Tomasz Czajkowski, Stephen Brown, and Jason Anderson. 2013. From Software to Accelerators with LegUp
High-Level Synthesis. In Proceedings of the 2013 International Conference on Compilers, Architecture and Synthesis for
Embedded Systems (CASES). 1–9.
[11] Vito Giovanni Castellana, Marco Minutoli, Alessandro Morari, Antonino Tumeo, Marco Lattuada, and Fabrizio
Ferrandi. 2015. High Level Synthesis of RDF Queries for Graph Analytics. In Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design (ICCAD). 323–330.
[12] Rohit Chandra, Leo Dagum, Ramesh Menon, David Kohr, Dror Maydan, and Jeff McDonald. 2001. Parallel Programming
in OpenMP. Morgan Kaufmann.
[13] Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-
Based Graph Processing Framework on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA). 69–80.
[14] Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending High-Level Synthesis
for Task-Parallel Programs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM). 204–213.
[15] Jongsok Choi, Stephen Brown, and Jason Anderson. 2013. From Software Threads to Parallel Hardware in High-Level
Synthesis for FPGAs. In Proceedings of the 2013 International Conference on Field-Programmable Technology (FPT).
270–277.
[16] Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, and Zhiru Zhang. 2022. FPGA HLS
Today: Successes, Challenges, and Opportunities. ACM Transactions on Reconfigurable Technology and Systems 15, 4
(2022), 1–42.
[17] Yixiao Du, Yuwei Hu, Zhongchun Zhou, and Zhiru Zhang. 2022. High-Performance Sparse Linear Algebra on HBM-
Equipped FPGAs Using HLS: A Case Study on SpMV. In Proceedings of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA). 54–64.
[18] José Duato, Sudhakar Yalamanchili, and Lionel Ni. 2003. Interconnection Networks: An Engineering Approach. Morgan
Kaufmann, 359–444.
[19] F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, and A. Tumeo. 2021.
Bambu: An Open-Source Research Framework for the High-Level Synthesis of Complex Applications. In Proceedings
of the 58th ACM/IEEE Design Automation Conference (DAC). 1327–1330.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators 9:29

[20] Yingxue Gao, Teng Wang, Lei Gong, Chao Wang, Xi Li, and Xuehai Zhou. 2023. FastRW: A Dataflow-Efficient and
Memory-Aware Accelerator for Graph Random Walk on FPGAs. In Proceedings of the Design, Automation & Test in
Europe Conference & Exhibition (DATE). 1–6.
[21] Roberto Gioiosa, Antonino Tumeo, Jian Yin, Thomas Warfel, David Haglin, and Santiago Betelu. 2017. Exploring
DataVortex Systems for Irregular Applications. In Proceedings of the 2017 IEEE International Parallel and Distributed
Processing Symposium (IPDPS). 409–418. DOI: https://doi.org/10.1109/IPDPS.2017.121
[22] Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, and Hai Jin. 2019. A Survey on
Graph Processing Accelerators: Challenges and Opportunities. Journal of Computer Science and Technology 34 (2019),
339–371.
[23] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005. LUBM: A Benchmark for OWL Knowledge Base Systems. Journal
of Web Semantics 3, 2–3 (2005), 158–182.
[24] R. J. Halstead and W. Najjar. 2013. Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads. In
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
(CASES). 1–10.
[25] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-Performance and Energy-
Efficient Accelerator for Graph Analytics. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture
(MICRO). 1–13.
[26] Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating Graph Linear Algebra
on HBM-Equipped FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design
(ICCAD). 1–9.
[27] J. Huthmann, J. Oppermann, and A. Koch. 2014. Automatic High-Level Synthesis of Multi-Threaded Hardware
Accelerators. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL). 1–4.
[28] Data Vortex Inc. 2023. Data Vortex Network on Chip IP Block. Retrieved from https://www.datavortex.com/dv-noc-
brief
[29] Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. 2018. GraFBoost: Using Accelerated Flash
Storage for External Graph Analytics. In Proceedings of the ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA). 411–424.
[30] Nachiket Kapre and Jan Gray. 2015. Hoplite: Building Austere Overlay NoCs for FPGAs. In Proceedings of the 25th
International Conference on Field Programmable Logic and Applications (FPL). 1–8.
[31] Nachiket Kapre and Jan Gray. 2017. Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs. ACM Transactions
on Reconfigurable Technology and Systems 10, 2 (Mar 2017), 1–24.
[32] Khronos Group. [n. d.]. OpenCL - Open Standard for Parallel Programming of Heterogeneous Systems. Retrieved
from https://www.khronos.org/opencl/
[33] Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2019. GPOP: A Cache and Memory-Efficient
Framework for Graph Processing over Partitions. In Proceedings of the 24th Symposium on Principles and Practice of
Parallel Programming. 393–394.
[34] Odile Liboiron-Ladouceur, Assaf Shacham, Benjamin A. Small, Benjamin G. Lee, Howard Wang, Caroline P. Lai,
Aleksandr Biberman, and Keren Bergman. 2008. The Data Vortex Optical Packet Switched Interconnection Network.
Journal of Lightwave Technology 26, 13 (2008), 1777–1789. DOI: https://doi.org/10.1109/JLT.2007.913739
[35] LLVM developers. [n. d.]. LLVM OpenMP Runtime Library. Retrieved from https://raw.githubusercontent.com/llvm/
llvm-project/main/openmp/runtime/doc/Reference.pdf
[36] Hao Lu, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. Parallel Heuristics for Scalable Community
Detection. Parallel Computing 47, C (Aug. 2015), 19–37.
[37] Gurshaant Singh Malik and Nachiket Kapre. 2019. Enhancing Butterfly Fat Tree NoCs for FPGAs with Lightweight
Flow Control. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). 154–162. DOI: https://doi.org/10.1109/FCCM.2019.00030
[38] Fredrik Manne and Mahantesh Halappanavar. 2014. New Effective Multithreaded Matching Algorithms. In Proceedings
of the IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS). 519–528.
[39] Marco Minutoli, Vito Giovanni Castellana, Nicola Saporetti, Stefano Devecchi, Marco Lattuada, Pietro Fezzardi,
Antonino Tumeo, and Fabrizio Ferrandi. 2022. Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph
Analytics. IEEE Transactions on Computers 71, 3 (2022), 520–533.
[40] Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Marco Lattuada, and Fabrizio Ferrandi. 2016. Efficient
Synthesis Of Graph Methods: A Dynamically Scheduled Architecture. In Proceedings of the 35th International Conference
on Computer-Aided Design (ICCAD).
[41] Thomas Moscibroda and Onur Mutlu. 2009. A Case for Bufferless Routing in On-Chip Networks. In Proceedings of the
36th Annual International Symposium on Computer Architecture (ISCA). 196–207.

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.
9:30 G. Gozzi et al.

[42] Razvan Nane, Vlad Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao,
Stephen Dean Brown, Fabrizio Ferrandi, Jason Helge Anderson, and Koen Bertels. 2016. A Survey and Evaluation of
FPGA High-Level Synthesis Tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35,
10 (2016), 1591–1604.
[43] Tan Nguyen, Yao Cheny, Kyle Rupnow, Swathi Gurumani, and Deming Chen. 2016. SoC, NoC and Hierarchical Bus
Implementations of Applications on FPGAs Using the FCUDA Flow. In Proceedings of the IEEE Computer Society
Annual Symposium on VLSI (ISVLSI). 661–666.
[44] E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin. 2014. GraphGen:
An FPGA Framework for Vertex-Centric Graph Computation. In Proceedings of the IEEE International Symposium on
Field-Programmable Custom Computing Machines (FCCM). 25–28.
[45] Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In
Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 111–117.
[46] Debjit Pal, Yi-Hsiang Lai, Shaojie Xiang, Niansong Zhang, Hongzheng Chen, Jeremy Casas, Pasquale Cocchini,
Zhenkun Yang, Jin Yang, Louis-Noël Pouchet, and Zhiru Zhang. 2022. Accelerator Design with Decoupled Hardware
Customizations: Benefits and Challenges. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC).
1351–1354.
[47] Artur Podobas and Mats Brorsson. 2016. Empowering OpenMP with Automatically Generated Hardware. In Proceedings
of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE.
[48] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks
for Accelerator Design and Customized architectures. In Proceedings of the 2014 IEEE International Symposium on
Workload Characterization (IISWC). IEEE, 110–119.
[49] Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM.
[50] Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022. Automated Accelerator Optimization Aided
by Graph Neural Networks. In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC). 55–60.
[51] L. Sommer, J. Oppermann, J. Hofmann, and A. Koch. 2017. Synthesis of Interleaved Multithreaded Accelerators
from OpenMP Loops. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs
(ReConFig). 1–7.
[52] Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2018. GraphR: Accelerating Graph Processing
Using ReRAM. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA).
531–543.
[53] Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. 2015. ElasticFlow: A Complexity-Effective Approach
for Pipelining Irregular Loop Nests. In Proceedings of the IEEE/ACM International Conference on Computer-Aided
Design (ICCAD). 78–85.
[54] J. Villarreal, A. Park, W. Najjar, and R. Halstead. 2010. Designing Modular Hardware Accelerators in C with ROCCC 2.0.
In Proceedings of the 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
(FCCM). 127–134.
[55] S. Windh, P. Budhkar, and W. A. Najjar. 2015. CAMs as Synchronizing Caches for Multithreaded Irregular Applications
on FPGAs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 331–336.
[56] Pengcheng Yao, Long Zheng, Yu Huang, Qinggang Wang, Chuangyi Gui, Zhen Zeng, Xiaofei Liao, Hai Jin, and
Jingling Xue. 2022. ScalaGraph: A Scalable Accelerator for Massively Parallel Graph Processing. In Proceedings of the
IEEE International Symposium on High-Performance Computer Architecture (HPCA). 199–212.
[57] Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen.
2022. ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). 741–755.
[58] Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-Based Graph Processor Using
Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays (FPGA). 207–216.
[59] Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2015. Optimizing Memory Performance for FPGA
Implementation of Pagerank. In Proceedings of the International Conference on ReConFigurable Computing and FPGAs
(ReConFig). 1–6.
[60] Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing Wu. 2019. HitGraph: High-Throughput
Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2249–2264.
[61] Shijie Zhou, Rajgopal Kannan, Hanqing Zeng, and Viktor K. Prasanna. 2018. An FPGA Framework for Edge-Centric
Graph Processing. In Proceedings of the ACM International Conference on Computing Frontiers (CF). 69–77.

Received 31 December 2023; revised 28 May 2024; accepted 3 July 2024

ACM Transactions on Reconfigurable Technology and Systems, Vol. 18, No. 1, Article 9. Publication date: December 2024.

You might also like