Try This
Try This
through hardware specialization. Field-Programmable Gate Arrays (FPGAs) play a crucial role in this paradigm, offering unparalleled
flexibility and high-performance potential. High-Level Synthesis (HLS) and source-to-source compilers have simplified FPGA devel-
opment by translating high-level programming languages into hardware descriptions enriched with directives. However, achieving
high Quality of Results (QoR) remains a significant challenge, requiring intricate code transformations, strategic directive placement,
and optimized data communication.
This paper presents Prometheus, a holistic optimization framework that integrates key optimizations—including task fusion,
tiling, loop permutation, computation-communication overlap, and concurrent task execution—into a unified design space. By leveraging
Non-Linear Programming (NLP) methodologies, Prometheus explores the optimization space under strict resource constraints, enabling
automatic bitstream generation. Unlike existing frameworks, Prometheus considers interdependent transformations and dynamically
balances computation and memory access.
We evaluate Prometheus across multiple benchmarks, demonstrating its ability to maximize parallelism, minimize execution stalls,
and optimize data movement. The results showcase its superior performance compared to state-of-the-art FPGA optimization frame-
works, highlighting its effectiveness in delivering high QoR while reducing manual tuning efforts.
CCS Concepts: • Hardware → High-level and register-transfer level synthesis; • Software and its engineering → Compilers.
Additional Key Words and Phrases: High-Level Synthesis, NonLinear Programming, Compiler
1 Introduction
The rise of customized accelerators has revolutionized modern computing, driving significant improvements in energy
efficiency and performance through hardware specialization. Among these accelerators, Field-Programmable Gate Ar-
rays (FPGAs) have gained widespread adoption due to their flexibility, high performance, and adaptability across di-
verse domains, including machine learning, scientific computing, financial modeling, and embedded systems. Unlike
fixed-function hardware such as Application-Specific Integrated Circuits (ASICs), FPGAs offer reconfigurability, en-
abling hardware adaptation to workload-specific requirements. However, achieving a good Quality of Result (QoR) on
FPGAs remains a complex and resource-intensive process, requiring careful tuning of computation, memory access,
and parallelism strategies.
To address this complexity, High-Level Synthesis (HLS) tools and source-to-source compilers [8, 19, 29, 30, 36, 39, 46,
58, 60, 75, 76, 78, 79, 81] have emerged as critical enablers of FPGA adoption, allowing developers to write hardware-
accelerated programs in high-level languages such as C++ and Python. These tools generate synthesizable hardware
descriptions, leveraging pragmas (hardware directives) and code transformations to optimize execution. Despite their
advantages, HLS-generated designs still require substantial manual tuning to achieve high QoR, as the compiler’s
ability to optimize performance remains highly dependent on the structure and directives applied to the input code.
Authors’ Contact Information: Stéphane Pouget, [email protected] of California, Los Angeles; Michael Lo, [email protected] of
California, Los Angeles; Louis-Noël Pouchet, [email protected] State University; Jason Cong, [email protected] of California,
Los Angeles.
1
2 Pouget et al.
• Unified FPGA optimization framework that jointly optimizes loop transformations, pragma insertion, task con-
curency and computation-communication overlap while considering interdependencies between computation
and data movement.
• Hybrid execution model that dynamically selects between shared buffering and dataflow streaming to maxi-
mize parallelism and efficient memory access.
• NLP-based design space exploration that automates the selection of loop tiling, scheduling, and memory strate-
gies for globally theoretical optimal performance.
• SLR-aware scheduling and multi-SLR partitioning to balance routing complexity and FPGA resource utilization,
overcoming single-SLR limitations in prior works.
• End-to-end compilation and automation that generates optimized HLS-C++ code, OpenCL host code, and FPGA
bitstreams with minimal manual intervention.
• Comprehensive performance evaluation, demonstrating superior QoR compared to AutoDSE [65], Sisyphus [58],
ScaleHLS [78], Stream-HLS [8] and Allo [14].
4 Pouget et al.
quality and model calibration. Conversely, NLP-DSE explores the entire design space efficiently by formulating the
optimization as a Non-Linear Programming problem. This method allows for rapid exploration and selection of theo-
retical optimal configurations. However, it relies on a theoretical cost model, which may introduce inaccuracies if the
compiler behaves unpredictably or if certain optimizations are not accurately captured by the model. Consequently,
while NLP-DSE provides a fast and scalable solution, its reliability depends on the fidelity of the cost model to actual
HLS compilation behavior.
While pragma insertion is a powerful optimization technique, its effectiveness is inherently dependent on the in-
teraction between memory hierarchy, computational resources, and FPGA-specific constraints. Moreover, its impact is
tightly coupled with the program’s execution schedule, meaning that without appropriate code transformations, the
benefits of pragma insertion remain limited. Effective optimization requires a synergy between pragma directives and
structural modifications to the code to fully exploit FPGA parallelism and resource utilization.
2.1.2 Code Transformation. Code transformations are fundamental for optimizing execution on FPGAs. While pragma
insertion plays a crucial role in enhancing performance, it must be complemented by effective code transformations to
fully exploit FPGA resources. Techniques such as loop reordering (permutation), task fusion, and data tiling enhance
data locality and increase parallelism, thereby improving both computational efficiency and memory access patterns.
Various code transformation strategies have been developed specifically for FPGAs [17, 40, 42–44, 55, 82, 83]. While
transformations originally designed for CPUs and GPUs achieve substantial performance gains by optimizing for their
respective architectures, they do not inherently align with FPGA requirements, which prioritize fine-grained paral-
lelism and efficient resource utilization. Several studies have utilized Pluto [13], a leading compiler framework origi-
nally designed for CPU optimizations, to transform FPGA kernels [82, 83]. While Pluto excels at tiling and minimizing
dependencies to enhance memory reuse, its direct application to FPGA optimization is constrained due to the funda-
mental differences in optimization strategies required for CPUs and FPGAs. Unlike CPUs, where memory hierarchy and
cache locality are primary concerns, FPGA optimization focuses on maximizing parallelism, minimizing resource con-
tention, and efficiently utilizing on-chip memory, making Pluto’s conventional transformation techniques less effective
in this context. Conversely, studies such as [17, 40, 42–44, 55] focus on code transformations tailored to specific FPGA
performance goals. The work in [55] aims to reduce communication overhead between off-chip and on-chip memory,
achieving superior Quality of Results (QoR) for memory-bound kernels. Meanwhile, research efforts in [17, 40, 42–
44] concentrate on optimizing pipelining strategies to maximize instruction-level parallelism and resource utilization.
Sisyphus [58] introduces a unified approach by integrating code transformation and pragma insertion into a single
optimization problem. By formulating this as a Non-Linear Programming (NLP) problem, Sisyphus efficiently explores
the design space to identify theoretical optimal configurations, streamlining FPGA acceleration while maintaining a
balance between computation and memory access efficiency.
2.1.3 Task Concurrency. HLS tools like Vitis HLS support the dataflow pragma, which structures computations into
actors that communicate through FIFO queues. This approach allows overlapping execution of multiple tasks, signifi-
cantly reducing overall latency . By enforcing a producer-consumer model, dataflow scheduling enables each task to
process data as soon as it becomes available, rather than waiting for an entire stage to complete, which is particularly
beneficial.
For computational kernels such as 3mm (Listing 3), the dataflow paradigm allows the first two matrix multiplications
to execute in parallel while the third begins as soon as its required inputs are produced. This overlapping of execution
helps maximize throughput and minimize idle time for computing units. Additionally, dataflow optimizations facilitate
Holistic Optimization Framework for FPGA Accelerators 7
task-level parallelism, allowing independent tasks to run concurrently across multiple compute resources, such as DSP
blocks and BRAM, ensuring efficient utilization of available FPGA resources.
However, despite its advantages, pure dataflow parallelism presents several challenges. One major limitation is intra-
task parallelism, which is constrained by the reliance on FIFO-based communication. Since each FIFO can only transfer
up to 512 bits per cycle (as this is the maximum off-chip memory bitwidth supported), this inherently restricts the
amount of data that can be processed concurrently within a task. To further enhance intra-task parallelism, alternative
approaches must be explored.
Stream-HLS [8] attempts to address this limitation by increasing the number of FIFOs connecting two tasks, as-
suming that all data resides on-chip. While this approach simplifies certain aspects of the design space, it is neither
scalable nor generalizable for transferring data from off-chip to on-chip memory. By using 𝑛 FIFOs to transfer 𝑛 data
elements in parallel between two tasks, the method significantly increases resource consumption without proportion-
ally improving efficiency. The additional FIFOs complicate the design, potentially leading to routing congestion and
excessive hardware overhead.
Moreover, modern FPGAs feature up to 32 off-chip memory banks, making the multi-FIFO approach impractical
for handling off-chip data transfers. A more effective and scalable strategy must be developed to optimize intra-task
parallelism while ensuring efficient communication between off-chip and on-chip memory, without introducing un-
necessary complexity or resource constraints.
2.1.4 Shared Buffering. Shared buffering is a critical technique in FPGA memory optimization, enabling efficient data
access and reuse across multiple computational units. It can be employed at a global level, as seen in frameworks like
AutoDSE [65], Sisyphus [58], and ScaleHLS [78], or within individual dataflow tasks to enhance execution efficiency.
This approach involves preloading data buffers into on-chip memory, allowing multiple computations to access
shared data without redundant transfers. By reducing memory access latency and improving bandwidth utilization,
shared buffering facilitates high parallelism and optimizes overall performance. It is particularly beneficial in applica-
tions which are computation bound.
However, shared buffering presents challenges in maintaining concurrency. While it enables efficient data reuse, it
can introduce synchronization overhead, especially when multiple tasks attempt to access the same memory region
simultaneously. Managing concurrent access requires arbitration mechanisms, which can lead to increased latency and
potential bottlenecks. Additionally, routing congestion can occur due to high interconnect demands, limiting scalability
in complex FPGA designs.
One limitation of shared buffering is its impact on initiation interval (II) in pipelined architectures. Unlike dataflow-
based designs that rely on FIFO queues for seamless data streaming, shared buffering requires explicit read and write
coordination, which may introduce stalls if not carefully managed. Moreover, FPGA resource constraints, such as
limited BRAM and URAM availability, impose restrictions on buffer size and allocation strategies.
To address these challenges, advanced techniques such as double-buffering, memory partitioning, and adaptive
scheduling have been explored. Double-buffering enables overlapping computation with data transfer, reducing idle
cycles. Memory partitioning distributes data across multiple banks to alleviate contention, while adaptive scheduling
dynamically assigns buffer access based on task priority and workload demands.
By effectively integrating shared buffering with intelligent memory management strategies, FPGA designs can
achieve a balance between computational parallelism and efficient memory access. Future advancements should focus
8 Pouget et al.
on automated buffer allocation techniques and dynamic access pattern optimization to further enhance performance
in high-level synthesis (HLS) workflows.
2.1.5 Computation-Communication Overlap. Overlapping computation and communication is crucial for high-performance
FPGA designs. Techniques such as double buffering (ping-pong buffering) and advanced data tiling help mask com-
munication latency while keeping computational units busy. Managing data transfers efficiently between on-chip and
off-chip memory ensures that processing units remain active without waiting for data.
2.1.6 Data Packing and Padding. Modern FPGA architectures support high-bandwidth data transfers (up to 512-bit
wide on AMD/Xilinx FPGAs). Data packing and padding optimize memory alignment, reducing the number of required
memory cycles. For instance, transferring 216 floating-point values using a 256-bit width (8 floats per transfer) requires
27 cycles, making efficient packing crucial for minimizing overhead.
To fully exploit data packing, padding must be considered to enable even faster data transfers. Additionally, padding
is valuable for achieving finer control over parallelism and resource utilization by expanding the available design space
for loop unrolling, thereby enhancing computational throughput and efficiency.
Padding for Communication. Padding must be considered to increase the flexibility of data packing. The original size
of the data may impose restrictions on packing efficiency, but by introducing padding, we create a larger space that
may allow for a more efficient transfer. However, padding is not a free optimization, as increasing it also increases the
amount of data that needs to be transferred.
For example, in the code presented in Listings 1 and 2, where 𝐽 = 190, transferring all elements along the second
dimension (iterated by loop 𝑗) of array 𝐵 onto the chip is constrained by the available memory bandwidth. Without
padding, the maximum transfer rate is 64 bits per cycle, as 190 × 32 is divisible by 64 but not by 128. However, by
introducing padding with 𝑃 = 2 and adjusting 𝐽 to 192 (190 + 2), the data alignment enables a significantly higher
transfer rate of 512 bits per cycle, as 192 × 32 is now perfectly divisible by 512. This optimization maximizes memory
bandwidth utilization, reducing transfer latency and improving overall data throughput.
Padding for Computation. Padding can also be used for computation, if we take a similar method than Sisyphus [58]
which tiled and unroll the intra-tile loops which correspond to the transformation from Listing 1 to Listing 2. We will
have an unroll factor which correspond to 𝐼 1 × 𝐽 1 × 𝐾1 but we want to have 𝐼 1 divide 𝐼 , 𝐽 1 divide 𝐽 and 𝐾1 divide K
because we do not want to use extra ressource to only compute a partial tile which would correspond to the execution
of the last tile which is not complete. For example if we have a trip count of 190 and we use an unroll factor of 8 then
we will execute 184 × 8 a full tile and then 6 iterations for the partial tile.
But we restrict the unroll factor to be divisible of the trip count we restrict considerably the space of possibilities. So
the solution is to pad which allow to considerably increase the space of possibilities. Hence with a trip count of 190 the
possible unroll factor are include in 𝑈 𝐹 ∈ {1, 2, 5, 10, 19, 38, 95, 190} but with padding of 2 (a trip count of 192) the space
become 𝑈 𝐹 ∈ {1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 192}. And of course all pading can be considered. Padding enables
finer control over unrolling and resource utilization, allowing for more flexible and efficient hardware implementation
while optimizing computational parallelism.
Holistic Optimization Framework for FPGA Accelerators 9
Listing 1. Baseline Implementation of Matrix Multiplication in C. This code depicts a naive triple-nested loop structure used for matrix-
matrix multiplication, serving as the unoptimized reference for further transformations.
Listing 2. Matrix Multiplication with Loop Tiling and Fully Unrolled Intra-Tile Computation. This implementation showcases a
performance-optimized matrix multiplication using loop tiling to enhance data locality and fully unrolled intra-tile loops to expose
fine-grained parallelism.
effective memory bandwidth. It also reveals practical challenges, including timing closure issues, placement
inefficiencies, and resource contention that cannot be detected in earlier stages. This level of evaluation is
essential for validating Quality of Results (QoR) and ensuring the design meets real-world constraints.
The transition from RTL simulation to FPGA implementation often introduces additional challenges, such as in-
creased routing complexity, unexpected timing violations, and suboptimal resource utilization. These discrepancies
arise due to the abstraction gap between HLS-generated code and final FPGA hardware constraints. To mitigate these
issues, it is essential to develop robust tools that enable rapid design regeneration by modifying specific configurations
within a well-defined portion of the design space. Such tools would enable engineers to efficiently iterate on design
parameters without requiring a complete code regeneration, which can be time-consuming and may result in a signif-
icantly different design that still fails to generate a valid bitstream. Having the ability to selectively modify only the
congested parts of the design while preserving the rest of the configuration would be highly valuable, ensuring faster
convergence toward an optimized and feasible FPGA implementation.
2.2.2 Hardware Awareness and Resource Constraints. Many existing studies conclude their evaluations at the Vitis
HLS report or RTL simulation stage, overlooking the critical impact of placement and routing constraints. However,
as the design progresses toward bitstream generation, the available design space becomes increasingly constrained.
Addressing hardware limitations early in the HLS process is essential to avoid costly design iterations and ensure
convergence toward a feasible and deployable FPGA implementation.
Additionally, Super Logic Region (SLR)-aware optimizations are crucial for multi-SLR FPGA architectures. A Super
Logic Region (SLR) is a physically distinct section of an FPGA die in Stacked Silicon Interconnect (SSI) technology.
Each SLR contains a subset of the device’s computational resources, including DSP, LUT and FF, similar to those in
monolithic FPGA devices.
Effectively mapping tasks across SLRs is critical for balancing resource utilization and minimizing routing conges-
tion, as inter-SLR communication introduces additional latency and can become a performance bottleneck. By inte-
grating SLR-aware task partitioning, optimized scheduling, and efficient data movement strategies, FPGA designs can
achieve higher scalability, improved parallelism, and enhanced synthesis feasibility. Ensuring that computation and
memory access patterns align with the physical structure of multi-SLR FPGAs is essential for achieving high QoR while
maintaining timing closure and efficient resource allocation.
Some recent frameworks, such as RapidStream-TAPA [25, 26] and PASTA [33], aim to improve the design scalability
and performance of HLS programs on modern multi-die FPGAs through hardware-aware co-optimization of HLS and
physical design. While RapidStream-TAPA focuses on latency-insensitive, FIFO-based communication and exploits
coarse-grained floorplanning and pipelining for timing closure and parallel compilation, PASTA extends this model to
include buffer-based inter-task communication using a generalized channel abstraction. Despite their advances, both
frameworks still require a manually optimized task-parallel input code, which demands substantial expertise in HLS
and hardware design, limiting their accessibility for non-expert users.
2.3 Challenges
While existing HLS optimizations provide substantial performance improvements, several challenges remain. One of
the primary limitations is that many prior works restrict their optimization space to specific techniques or separate
Holistic Optimization Framework for FPGA Accelerators 11
their optimization processes into multiple independent steps. This fragmented approach can lead to incoherent de-
sign decisions, where optimizations applied at one stage may not align with or may even counteract those applied at
subsequent stages. Refer to Table 1 for a detailed comparison of these approaches.
A common oversimplification in many frameworks is the assumption that all data resides on-chip. While this simpli-
fies memory access patterns in design space exploration, it often results in low Quality of Results (QoR) when deployed
on real hardware. This assumption prevents frameworks from incorporating crucial tiling strategies, which are essen-
tial for optimizing data locality, reducing memory access overhead, and ensuring efficient off-chip communication.
When off-chip memory management is treated as a separate optimization step, decisions made in earlier phases may
restrict the effectiveness of later memory optimizations, leading to suboptimal performance and resource utilization.
The limitations of current HLS optimization frameworks that rely on shared buffering are exemplified through the
3mm kernel. It consists of two independent matrix multiplications, whose outputs serve as inputs for a final matrix
multiplication.
As illustrated in Listing 3, each matrix multiplication consists of deeply nested loops that are well-suited to paral-
lelization techniques such as loop unrolling, pipelining, and computation-communication overlap. However, conven-
tional shared buffering strategies often fall short in effectively leveraging concurrency, primarily due to their limited
exploitation of dataflow principles and insufficient overlap between computation and communication.
Another major challenge is balancing parallelism and routing complexity. Excessive parallelization can lead to rout-
ing congestion, increased place-and-route time, and ultimately, failure to generate a valid bitstream. Current method-
ologies lack adaptive mechanisms to dynamically adjust parallelism levels based on available routing resources and
FPGA architecture constraints. Without such mechanisms, achieving high-performance designs often requires manual
intervention and iterative fine-tuning.
While pragma-based optimizations such as unroll and pipeline significantly improve performance, their effectiveness
is inherently tied to the underlying code structure. Without proper transformations to expose parallelism and optimize
memory access patterns, pragma insertion alone cannot fully unlock the potential of FPGA acceleration.
Addressing these challenges requires a holistic approach that integrates memory-aware optimizations, automated
design space exploration, and adaptive scheduling techniques. Future research should focus on frameworks that jointly
optimize data placement, computational parallelism, and resource utilization while considering real hardware con-
straints. By developing more cohesive and intelligent HLS optimization strategies, FPGA-based acceleration can achieve
greater efficiency, scalability, and deployment feasibility.
Model-free approaches, such as AutoDSE [65] and AI-based frameworks like HARP [64, 71, 72], rely on either HLS
compilers or AI models to predict performance and resource utilization. However, these methods are currently limited
to pragma insertion and space enumeration. Even though AI-based models provide fast estimations (on the order
of milliseconds), exhaustively exploring large design spaces remains infeasible. Our previous work, NLP-DSE [57],
addresses this limitation by leveraging NLP techniques to efficiently explore pragma configurations. However, like the
other approaches, it remains restricted to pragma insertion.
PolyOptHLS [54] focuses on optimizing designs by minimizing memory transfers. While reducing communication
latency can significantly improve performance, it does not always lead to optimal results. In many cases, the primary
bottleneck may shift from communication to computation, limiting overall efficiency and potentially degrading the
quality of results (QoR). This approach, therefore, lacks a balanced optimization strategy that considers both compu-
tation and communication trade-offs.
12 Pouget et al.
ScaleHLS [78] and POM [80] extend optimization beyond pragma insertion by incorporating code transformations.
However, their exploration space is constrained by the assumption that data is already on-chip. Additionally, their trans-
formations are based on heuristics, such as permuting the reduction loop to the outermost level. While they employ a
cost model to minimize computation latency, they still rely on exhaustive enumeration of possible configurations.
Allo [14] aims to reduce development time and enhance design quality. However, it still requires significant manual
intervention from an expert, limiting its usability for non-specialists.
Our previous work, Sisyphus [58], enables both code transformations and pragma insertion but is restricted to
optimizing a single task. It lacks support for optimizations that dataflow techniques can provide, which would be
necessary for broader efficiency improvements.
Stream-HLS [8] effectively leverages dataflow pragmas by selecting a good loop ordering strategy to maximize
streaming efficiency. However, it imposes constraints on the design space by assuming that data is on-chip. Moreover,
its approach to increasing parallelism relies on multiple FIFOs, which is not generalizable when off-chip memory access
is required.
A significant limitation of all these frameworks is that none of them account for hardware constraints in their explo-
ration space. While this simplifies the exploration process, it reduces real-world applicability. Additionally, frameworks
such as Merlin-based tools (AutoDSE, HARP, NLP-DSE) and Sisyphus can generate all necessary components, includ-
ing off-chip memory, for bitstream generation. However, they are constrained to a single Super Logic Region (SLR),
leading to under-utilization of the FPGA board. Although generating a design that spans multiple SLRs is theoreti-
cally possible, it significantly increases the risk of bitstream generation failures. Even if the bitstream is successfully
generated, timing violations often degrade performance.
The framework integrates all the optimization techniques discussed in previous sections, ensuring a comprehen-
sive and cohesive approach to performance enhancement. Additionally, Prometheus is SLR-aware, enabling intelligent
task distribution across Super Logic Regions (SLRs). This capability not only simplifies bitstream generation but also
optimizes the trade-off between computational performance and resource utilization.
Prometheus
Extract Task-flow Graph
Intermediate
with PoCC Generate 1 2
C++ Code Representation
(dependencies, trip counts)
3 4
Formulate a Non-Linear
Problem
encoding the full
optimization design space
Solve NLP
Bitstream
Files for HLS Compiler Generate All Parameters of the Space
with HLS
HLS-C++, OpenCL, Config e.g. data-tile of array 𝐴 is 32 × 64
Compiler
Fig. 1. Overview of the Prometheus Framework Workflow. This diagram illustrates the end-to-end pipeline of the Prometheus opti-
mization framework, starting from C++ source code and proceeding through intermediate representation extraction, task-flow graph
generation, Non-Linear Programming (NLP)-based design space exploration, and final compilation into FPGA bitstreams using HLS
compilers.
Figure 1 illustrates the workflow of Prometheus. The process begins with an input C++ code, from which we extract
an intermediate representation containing dependencies, trip counts, schedules, and other relevant information. This
extraction is performed using PoCC [51], which provides all necessary details for analysis. Next, we construct the
task-flow graph based on this extracted information and formulate the corresponding Non-Linear Programming (NLP)
problem. Solving this NLP problem determines the theoretical optimal parameters for the design space. Once the NLP
solver provides these parameters, we have all the required information to proceed. At this stage, we automatically
generate the HLS-C++ code along with all the necessary files to produce the FPGA bitstream, ensuring an efficient and
fully automated compilation process.
To support efficient exploration of the design space, Prometheus defines a set of tunable parameters and architectural
constraints that capture the key aspects of FPGA acceleration. Table 2 provides a detailed overview of these parame-
ters, including memory-related configurations (e.g., bitwidth, tiling, and reuse levels), parallelism controls (e.g., loop
unrolling and array partitioning), and structural transformations (e.g., loop permutations and SLR assignments). In ad-
dition, the table outlines hardware constraints such as DSP budgets, on-chip memory limits, and maximum legal array
partitioning. These variables and constraints jointly define the feasible design space that the NLP-based optimization
engine systematically explores to generate high-performance, hardware-aware FPGA implementations.
The pseudo-code in Listing 4 demonstrates how Prometheus processes the 3mm kernel (Listing 3). Load and store
operations manage data transfers to and from off-chip memory, while send and receive operations handle inter-task
communication using FIFO.
14 Pouget et al.
Each taski in the pseudo-code corresponds to the fully unrolled computation of an intra-tile for statement Si in
Listing 3. An example of such a task is illustrated in Listing 5.
Holistic Optimization Framework for FPGA Accelerators 15
Arrays 𝐸, 𝐹 , and 𝐺 are initialized to zero within their respective tasks (e.g., S0, S2, and S4) and are not preloaded,
reducing unnecessary memory overhead.
Prometheus enhances computation efficiency by fusing statements that produce the same outputs (e.g., in Listing
3), and it automatically formulates an NLP problem to determine the theoretically optimal parameters, such as loop
schedules, array bit widths (e.g., 512 bits), tile sizes, reuse buffer sizes, transfer locations, and padding. The framework
dynamically adjusts bit widths for arrays like 𝐹 , 𝐷, and 𝐺 to achieve a better balance between parallelism and resource
usage.
Efficient computation-communication overlap is achieved through ping-pong buffering, while concurrent task exe-
cution and FIFO-triggered dependent tasks maximize overall performance. Fused Tasks 0 and 1 execute concurrently,
while Fused Task 2 begins as soon as the data tiles for 𝐹 and 𝐸 become available.
To assess the effectiveness of our proposed framework, we evaluate the performance of the 3mm kernel using sev-
eral state-of-the-art HLS-based FPGA optimization tools. As shown in Table 3, Prometheus significantly outperforms
existing frameworks in terms of throughput, achieving 368.36 GF/s. This represents a substantial improvement over
16 Pouget et al.
Sisyphus (178.97 GF/s) and Stream-HLS (174.00 GF/s), and far exceeds the results of Allo, ScaleHLS, and AutoDSE.
The superior performance of Prometheus stems from its holistic design space exploration strategy, which integrates
loop transformations, memory tiling, concurrent task execution, and SLR-aware scheduling to deliver optimized and
hardware-feasible designs.
3 Code Transformation
In our dataflow model, we adopt a synchronous dataflow where the sizes of the arrays are known during compile time.
This compile-time awareness enables us to construct a precise model that facilitates rigorous optimizations. To leverage
FPGA parallelism, we implement an acyclic dataflow graph, ensuring parent nodes do not receive data from their
children. While this constraint limits graph configurations and may increase resource usage, it reduces overall latency.
To support this structure, we inline [15] each function in the input code to generate the required acyclic graph. Our
primary objective is to minimize latency within resource constraints by overlapping communication and computation
within tasks and executing independent tasks concurrently. To achieve this, we apply various transformations and
optimizations, explored in this section, and navigate the design space using an NLP-based approach, detailed in Section
4.
To optimize memory transfers, padding is applied to arrays in two ways. Simple padding increases the bit width
(𝐵𝑊𝑎 ) for efficient transfers while maintaining the original loop trip count. Composite padding adjusts both 𝐵𝑊𝑎 and
the loop trip count (𝑇𝐶 𝑙 ) to support unroll factors that do not evenly divide the original trip count, allowing irregular
tile sizes and expanding the design space. Tile sizes are consistent within a fused task but vary between tasks. In Listing
4, the tile size of array F is 19 × 32 in Fused Task 1 and 192 × 32 in Fused Task 2.
4 NLP Formulation
To determine the tile size, loop order, bit width, and memory transfers, we formulate a cost model as a NLP problem
aimed at minimizing overall latency. This approach builds upon the methodology proposed in our previous work [56–
58], which we have further adapted to meet the specific requirements and constraints of our current framework. In our
work, we incorporate dataflow considerations along with all the optimizations detailed in Section 3. We employ PoCC
[50] to extract compile-time information such as schedules, loop trip counts, dependencies, and operation counts per
statement. ISCC [67] then generates all legal permutations for each loop body.
Table 4 delineates the sets, variables, and constants utilized in our NLP formulation.
4.1 Constraints
We now describe the constraints by using the code of Listings 3, 4 and 5.
4.1.1 Data-tiling and Unroll Factor. The intra-tile transformation, as explained in Section 3, can divide either the
original loop trip count or the original trip count with padding, thereby increasing the range of possibilities. Equation
1 ensures that the trip count of the intra-tile is a divisor of one of these two possibilities. The user has the option to
constrain the padding using Equation 2, which simplifies the solution space for the NLP solver. For instance, in Listing
5, for the array 𝐹 , the loop 𝑗 (line 7 in Listing 3) has been split into 𝑗0 (line 14 in Listing 4) and 𝑗1 (intra-tile). The trip
𝑗1 𝑗
count of the intra-tile loop 𝑗1, denoted as 𝑇𝐶 intra = 32, does not evenly divide the original trip count 𝑇𝐶 ori = 210, but
it does divide the trip count of the padded loop 𝑇𝐶 𝑗 = 224.
𝑙
∀𝑙 ∈ L, 𝑇𝐶𝑖𝑛𝑡𝑟𝑎 𝑙 == 0||𝑇𝐶 𝑙
%𝑇𝐶𝑜𝑟𝑖 𝑙
𝑖𝑛𝑡𝑟𝑎%𝑇𝐶 == 0 (1)
𝑙 𝑙
(𝑜𝑝𝑡)∀𝑙 ∈ L, ∃𝑛 ∈ N ≤ 𝑁 ∈ N, s.t. 𝑇𝐶 = 𝑇𝐶𝑜𝑟𝑖 +𝑛 (2)
4.1.2 Bit Width. B denotes the number of elements that can be transferred simultaneously, determined by the bit
width and the data type. Hence, if we have a bit width under 512 bits for float the set is {1, 2, 4, 8, 16}. Equation 3
computes the bit width for each array based on the last dimension of the data-tile transferred on-chip. For example,
Holistic Optimization Framework for FPGA Accelerators 19
Constants Description
𝐼 𝐼𝑙 II of the loop 𝑙
𝐼 𝐿𝑝𝑎𝑟 , Iteration Latency of the operations without (𝑝𝑎𝑟 ) and with (𝑟𝑒𝑑) dependencies of the statement 𝑠
𝐼 𝐿𝑟𝑒𝑑
𝑇𝐶𝑜𝑟𝑖𝑙 Original Trip Count of the loop 𝑙
𝑓𝑎,𝑙 Footprint of the array 𝑎 if transferred to on-chip after the loop 𝑙
𝑁 𝐹𝑇 Number of fused task
𝐷𝑆𝑃𝑠𝑜𝑝 Number of DSP used by the statement 𝑠 for the operation 𝑜𝑝
𝐷𝑆𝑃 Number of DSP available for the FPGA used
𝑚𝑎𝑥𝑝𝑎𝑟𝑡 Maximum array partitioning
𝑆𝐿𝑅 Number of SLR available for the FPGA used
Variables Description
𝑙
𝑇𝐶𝑖𝑛𝑡𝑟𝑎 , TC of the loop 𝑙 for the intra and inter tile
𝑙
𝑇𝐶𝑖𝑛𝑡𝑒𝑟
𝑇𝐶 𝑙 Trip Count of the loop 𝑙 after padding
𝑆𝑎𝑙𝑎𝑠𝑡 Size of the last dimension of the array 𝑎 transferred on-chip
𝐵𝑊𝑎 Bit width of the array 𝑎
𝑡𝑎,𝑙 , 𝑑𝑎,𝑙 Boolean to know if the array 𝑎 is transferred and defined (respectively) on-chip after the loop 𝑙 in
the inter-tile
𝑝𝑖𝑙 Position of the loop 𝑙 under the 𝑖-th permutation
𝑠𝑙𝑟 𝑡 ID of the SLR use by the task 𝑡
Sets Description
L, A, S The set of loops, arrays and statements
L𝑠 The set of loops which iterate the statement 𝑠
L𝑠𝑟𝑒𝑑 The set of reduction loops which iterate the statement 𝑠
L𝑎𝑙𝑎𝑠𝑡 The set of loop which iterate the last dimension of the array 𝑎
L𝑖𝑛𝑡𝑒𝑟 , The set of loops which belong to the inter-tile and intra-tile respectively
L𝑖𝑛𝑡𝑟𝑎
B Set of possible burst size for the data type
C𝑎𝑑 The set of loops which iterates the array 𝑎 at the dimension 𝑑
𝐴𝑃𝑎,𝑑 Array Partition for the array 𝑎 in dimension 𝑑
P𝑠 All permutation of the loops which iterate the statement 𝑠
F𝑖 The set of statements which belong to the fused task 𝑖
T The set of tasks in the dataflow
Table 4. Mathematical Notation for Constants, Variables, and Sets in the NLP-Based Optimization Model. This table defines the for-
mal notation used in Prometheus’ Non-Linear Programming (NLP) model for design space exploration, including loop trip counts,
bitwidths, SLR mappings, resource limits, and legal transformations.
4.1.3 Permutation. Equation 4 requires the NLP to choose identical permutations for loops that are shared by state-
ments fused within the same task. For example, in Fused task 1, the loops iterating statement S2 can be permuted as
(𝑖0, 𝑗0) or (𝑗0, 𝑖0), and similarly, loops iterating statement S3 can be permuted as (𝑖0, 𝑗0) or (𝑗0, 𝑖0). However, since the
loops 𝑖0 and 𝑗0 iterate both S2 and S3, they must use the same permutation; either (𝑖0, 𝑗0) for both or (𝑗0, 𝑖0) for both.
4.1.4 Transfer and Reuse. Equation 5 permits the selection of a single level where each array can be defined (and
reused) and transferred. Equation 6 constrains that the definition of the array must occur lexicographically before or
at the same time as the transfer.
The array 𝐸, defined on line 24 in Listing 4, is defined before any loops, so 𝑑 𝐸,0 = 1 (0 indicating it is defined before
any loops). However, it is transferred under the loop 𝑖0, so 𝑡𝐸,𝑖0 = 1. Equation 6 simply means that the definition of
array E should occur before or at the same level as the transfer. We cannot transfer E under loop 𝑖0 if E is defined under
𝑘0. Similarly, in Listing 4, the array 𝐴 is defined and transferred in line 4, with 𝑑𝐴,𝑖0 = 1 and 𝑡𝐴,𝑖0 = 1.
Õ Õ
∀𝑎 ∈ A, 𝑙 ∈ L𝑖𝑛𝑡𝑒𝑟 , 𝑡𝑎,𝑙 = 1, 𝑙 ∈ L𝑖𝑛𝑡𝑒𝑟 , 𝑑𝑎,𝑙 = 1 (5)
Õ
2
∀𝑎 ∈ A, (𝑙 0, 𝑙 1 ) ∈ L𝑖𝑛𝑡𝑒𝑟 |𝑑𝑎,𝑙 0 , 𝑡𝑎,𝑙 1 = 1, then 𝑙 0 4 𝑙 1 (6)
4.1.5 On-chip Memory. Equation 7 constrains the footprint of the array to be within the available resources, based on
where the array is defined, the number of double buffers used and the footprint of the array transferred at this level.
Õ Õ
𝑑𝑎,𝑙 × 𝑓𝑎,𝑙 × 𝑁𝑎 ≤ 𝑀𝑒𝑚, (7)
𝑎∈ A 𝑙 ∈ L
4.1.6 Array Partitioning. Equation 8 limits the maximum partitioning of each array. This partitioning is crucial as it
impacts the maximum unroll factor, necessitating the distribution of data across different BRAM banks under fully
unrolled loops, thereby influencing the utilization of BRAMs. Equation 9 computes the array partitioning needed for
each array based on the trip count of the fully unrolled intra-tile loops.
Array D in Listing 5 is traversed by two unrolled loops: 𝑘1, which iterates 3 times (𝐴𝑃𝐷,0 = 3), and 𝑗1, which iterates
32 times (𝐴𝑃𝐷,1 = 32). Therefore, the total number of partitions needed is 3 × 32 = 96. Consequently, these 96 values
of F are stored in different banks, allowing all of them to be accessed in parallel. However, this value must be less than
or equal to 𝑚𝑎𝑥𝑝𝑎𝑟𝑡 .
Ö
∀𝑎 ∈ A, 𝐴𝑃𝑎,𝑑 ≤ 𝑚𝑎𝑥𝑝𝑎𝑟𝑡 (8)
𝑑 ∈N
𝑙
∀𝑎 ∈ A, ∀𝑑 ∈ N, ∀𝑙 ∈ 𝐶𝑎𝑑 , 𝐴𝑃𝑎,𝑑 = 𝑇𝐶𝑖𝑛𝑡𝑟𝑎 == 0 (9)
Holistic Optimization Framework for FPGA Accelerators 21
4.1.7 DSP Utilization. Equation 10 constrains the number of DSPs used based on the available DSP resources. In
contrast to [56–58], we utilize pessimistic DSP utilization. This approach is necessary because concurrent execution
and resource reuse between tasks are not feasible when two tasks can run simultaneously. Given 𝐷𝑆𝑃+ = 2, 𝐷𝑆𝑃∗ = 3,
and 𝐼 𝐼𝑆3 = 3, the DSP usage for Task 3 is calculated as (2 + 3) × 1824, accounting for the unroll factor. However, since
the loop is pipelined with 𝐼 𝐼 = 3, the HLS compiler optimizes resource usage, effectively reducing the DSP count by
approximately dividing by 𝐼 𝐼 .
Õ Õ Ö
𝑙
(𝐷𝑆𝑃𝑠𝑜𝑝 /𝐼 𝐼𝑠 ) × 𝑇𝐶𝑖𝑛𝑡𝑟𝑎 ≤ 𝐷𝑆𝑃 (10)
𝑜𝑝 ∈ {+,−,∗,/} 𝑠 ∈ S 𝑙 ∈ L∫
4.1.8 SLR Selection. For each task, the NLP determines the SLR on which the task will be implemented (Equation 11).
Additionally, Equations 7 and 10 are applied to each SLR to manage resource allocation per SLR effectively.
𝐿𝑎𝑡 (𝑇0, 𝑇1 ) = max(𝐿𝑎𝑡 (𝑇0 ), 𝐿𝑎𝑡 (𝑇1 ) + 𝑠ℎ𝑖 𝑓 𝑡𝑇0 ,𝑇1 )
𝐿𝑎𝑡 (𝑇2, 𝑇3 ) = max(𝐿𝑎𝑡 (𝑇2 ), 𝐿𝑎𝑡 (𝑇3 ) + 𝑠ℎ𝑖 𝑓 𝑡𝑇2 ,𝑇3 )
𝐿𝑎𝑡 (𝑇0, .., 𝑇4 ) = max(𝐿𝑎𝑡 (𝑇0,𝑇1 ), 𝐿𝑎𝑡 (𝑇2,𝑇3 ), 𝐿𝑎𝑡 (𝑇4 ))
𝐿𝑎𝑡 = max(𝐿𝑎𝑡 (𝑇0, .., 𝑇4 ), 𝐿𝑎𝑡 (𝑇5 )
+ max(𝑠ℎ𝑖 𝑓 𝑡𝑇1 ,𝑇5 , 𝑠ℎ𝑖 𝑓 𝑡𝑇3 ,𝑇5 , 𝑠ℎ𝑖 𝑓 𝑡𝑇4 ,𝑇5 ))
T0 T1
T5 T4
T2 T3
Fig. 2. Maximally distributed dataflow graph of the 3mm kernel, where each node represents a task and each edge indicates a data
dependency between tasks.
The latency of the intra-task for a statement 𝑠 is determine similarly to [56–58] with Equation 12. And if the intra-
task is iterated by the reduction loops in the inter-tile the latency is represented by Equation 13.
Ö
𝐿𝑎𝑡𝑖𝑛𝑡𝑟𝑎 = 𝐼 𝐿𝑝𝑎𝑟 + 𝐼 𝐿𝑠𝑒𝑞 × log2 ( 𝑙
𝑇𝐶𝑖𝑛𝑡𝑟𝑎 ) (12)
𝑙 ∈ L𝑠𝑟𝑒𝑑
Ö
𝑙
𝐿𝑎𝑡 = 𝐿𝑎𝑡𝑖𝑛𝑡𝑟𝑎 + 𝐼 𝐼 × ( 𝑇𝐶𝑖𝑛𝑡𝑒𝑟 − 1) (13)
𝑙 ∈ L𝑠𝑟𝑒𝑑
22 Pouget et al.
For each level 𝑛 ≥ 0 (with the outermost level being 0) of the inter-tile, we consider the shifting of load, compute,
and store operations to enable overlapping of computation and communication. The overlap factor 𝛼 is set to 1 for
read-only operations and 2 for read and store operations:
𝑓𝑎,𝑙 𝑓𝑎,𝑙
𝐿𝑎𝑡𝑛 = max(𝐿𝑎𝑡𝑛+1, ) + 𝐿𝑎𝑡𝑛+1 + ×𝛼 (14)
𝐵𝑆𝑎 𝐵𝑆𝑎
And then, the final objective function is computed using the dataflow graph, where the latency of the task is estab-
lished by the latency of the outermost level, denoted as 𝐿𝑎𝑡 0 .
5 Code Generation
Prometheus takes as input affine C/C++ code and automatically produces an HLS-C++ file, OpenCL host files, and all
necessary files for code verification, RTL simulation, and bitstream generation. The NLP described in Section 4 gives
the parameters of the space such as loop order, tiling factor, etc. However, in order to be efficient, the code generation
needs some specific rules.
5.3 Intra-Task
Each intra-task, corresponding to an intra-tile selected by the NLP, is implemented as a fully unrolled, independent
function without communication with off-chip memory. Data for these tasks resides entirely on-chip.
If the task involves reduction loops, inter-tile reduction loops are integrated into the function and pipelined. While
the pipeline initiation interval (II) is greater than 1 due to reduction dependencies, other operations in the statement
are pipelined efficiently.
Padding is handled at the intra-tile level. Non-reduction loops remain unchanged, allowing computation of padding
values without excessive resource usage. For reduction loops, full tiles are computed first, and the intra-tile loop is
adjusted to handle padding for partial tiles accurately.
level of granularity. Once we identify the data being transferred, we implement double buffering if only reads or writes
are involved, or triple buffering if both reads and writes are required. This strategy allows us to overlap the operations
of reading, storing, and computing at the innermost level, optimizing both communication and computation overlap.
6 Evaluation
6.1 Setup
We evaluated our method using kernels from Polybench/C 4.2.1 [53] with medium-sized datasets and single-precision
floating-point computations. The selected kernels represent both memory-intensive and computation-intensive scenar-
ios. Medium problem sizes were chosen to balance demonstration of efficacy and the feasibility of time-consuming RTL
evaluations. Due to Allo’s lack of automatic code generation, we limited our experiments to the subset of PolyBench
kernels provided in its artifact evaluation package [1], as results for other kernels were not available. NLP problems
were solved using the AMPL description language and the Gurobi solver (version 11.0.0) with the qp:nonconvex=2
option for non-convex quadratic objectives and constraints. Evaluations included RTL simulation and on-board exe-
cution using the Alveo U55C FPGA, with a targeted frequency of 220 MHz. RTL simulation provided accurate latency
estimates, contrasting with the overly optimistic Vitis HLS reports that assume perfect task overlapping with dataflow
pragma. The generated code from the frameworks are compiled using AMD/Xilinx Vitis HLS 2023.2 using the Vitis flow
[3]. This flow assumes that data initially resides off-chip and has a default latency of 64 cycles to bring onto on-chip
memory. All frameworks utilize "unsafe math" optimizations, enabling commutative/associative reduction operations
at the expense of precision.
24 Pouget et al.
6.3 Comparison
Table 5 presents the RTL simulation results for Prometheus, Sisyphus [58], AutoDSE [65], ScaleHLS [78], and Allo [14].
The last lines show the average and geometric mean performance improvement (PI) of Prometheus across evaluated
kernels.
Results marked with ∗ are from the Vitis HLS report, showing minimum latency as an optimistic estimate. The RTL
simulation for 3mm with Allo was incomplete after two days. The N/A values in the table for Scale-HLS correspond
to kernels that have at least one loop with a non-constant trip count. Stream-HLS does not handle these cases, so we
cannot make a comparison.
Prometheus consistently achieves superior QoR across all evaluated kernels compared to other frameworks. While
Sisyphus [58], designed primarily for computation-bound kernels, demonstrates competitive results for these kernels,
it shows a weakness for 2mm and 3mm. This disparity is due to Sisyphus lacking concurrent execution capabilities
Holistic Optimization Framework for FPGA Accelerators 25
Table 5. Throughput Comparison (in GF/s) of PolyBench Kernels Across Frameworks Using RTL Simulation
for independent matrix multiplications. The performance advantage of Prometheus stems from both concurrent task
execution and efficient overlapping of computation and communication.
Although for bicg, Allo and Prometheus do not use the same code transformation, the results are similar. Allo retains
the original code structure by permuting the reduction loop outermost. The non-reduction loop is fully unrolled, and
the reduction loop is pipelined. Prometheus partially unrolls both loops and pipelines the reduction loop.
On Board Evaluation Table 6 presents the results for the on-board evaluation. The column 𝑇 (𝑚𝑠) indicates the
kernel execution time in milliseconds, GF/s represents the throughput in Giga Floating Operations per second, and the
resource usage is detailed in the thousands for both LUTs (L) and FFs. URAM utilization is excluded as no kernels use
it. Column F (MHz) shows the frequency achieved by each design, with a target frequency of 220 MHz for all designs.
For Bicg and Atax, targeting 60% of resources on one SLR caused congestion, requiring regeneration with a 55%
constraint. The 3mm bitstream from AutoDSE succeeded only with a 15% constraint.
Prometheus achieves a remarkable 77.16x performance improvement over AutoDSE. This gain is accompanied by
an average increase in resource utilization: 2.38x more DSPs, 1.08x more BRAM, 1.36x more LUTs, and 1.42x more FFs,
reflecting the trade-offs made to achieve such high efficiency. When compared to Sisyphus, Prometheus demonstrates
a notable 2.59x performance boost, leveraging an average of 3.88x more DSPs, 1.04x more BRAM, 1.31x more LUTs,
and 1.33x more FFs, illustrating its ability to deliver substantial improvements while effectively utilizing additional
resources.
Table 7 shows the parameters found by the NLP. We use the same name of the iterator as Polybench 4.2.1 [53]
code. 𝑆𝑖 represent the statement in position 𝑖 in the code. The second column gives the statement fused inside the same
tasks, the third column sets the fused task order and loop order found by the NLP for the fused task and the last column
supplies the data-tile size found by the NLP, if the array is present in a different fused task the fused-task (defined in
the second column) is precise.
Permutations are evident in the implementations of 3mm, Atax, and Bicg. For 2mm and 3mm, the NLP opted to
fully transfer array B instead of overlapping computation and load, as the load operation had higher latency than
computation. In 2mm, the NLP retains the original loop order. Array tmp is transferred from the first task to the
26 Pouget et al.
Sisyphus
3mm 1.52 29.89 984 611 230 300 220
1 SLR
Atax 0.62 1.03 173 450 240 250 220
Bicg 0.63 1.02 173 451 238 265 217
2mm 92.25 0.40 963 353.5 287 292 205
AutoDSE
3mm 110.34 0.41 1117 470 278 306 220
1 SLR
Atax 2.88 0.22 452 630.5 170 212 220
Bicg 1.13 0.56 196 867.5 168 217 214
2mm 0.56 65.13 1941 635.5 371 454 216
3mm 0.87 51.95 1551 635.5 342 423 220
1 SLR
Ours
second, with both tasks iterating over the first dimension using loop 𝑖. This enables a 4 × 32 data tile to be sent to the
second task, which starts computation once a 4 × 192 tile of tmp(FT1) is filled.
Table 7. Fusion, loop order and data-tile size found by the NLP for the kernel evaluated on Table 6 for 1 SLR
As other frameworks cannot generate bitstreams for 3 SLRs without human intervention, we compare results for 1
and 3 SLRs using our framework. For 2mm and 3mm, performance improves due to increased resource utilization. How-
ever, for atax and bicg, where the bottleneck is memory transfer between off-chip and on-chip rather than parallelism,
the improvement is negligible.
6.4 Scalability
Our NLP solver includes the option to set a timeout, returning the best design found so far without guaranteeing
optimality. This feature enables efficient exploration of the solution space while ensuring adherence to time constraints
when necessary.
Holistic Optimization Framework for FPGA Accelerators 27
When analyzing the time required to solve the Non-Linear Programming (NLP) problem in Sisyphus, we observe
that our framework, Prometheus, achieves similar solution times for most benchmarks. However, there is a notable
exception: the 3mm benchmark, which times out after 4 hours in Sisyphus, while Prometheus successfully finds a
solution.
This efficiency stems from Prometheus’ ability to explore a larger optimization space while still maintaining fast
solution times. The key reason is that all optimization techniques are seamlessly integrated within the design space.
This structured integration ensures that when a decision is made for one optimization parameter, it naturally constrains
the possible choices for others, thereby reducing the search complexity. As a result, even with a broader search space,
Prometheus efficiently converges to a solution in 21.37 seconds.
A detailed comparison of solution times can be found in Table 8, which highlights the impact of our approach in
managing the design space effectively.
7 Related Work
Dataflow Dataflow principles have been extensively studied in models such as Kahn Process Networks [32], Den-
nis Dataflow [22], synchronous dataflow languages [11, 38], and for FPGA applications [2, 4, 10, 12, 14, 24, 49, 74].
DaCe [10] introduces Stateful DataFlow multiGraphs to separate program definition from optimization, enabling
transformations like tiling and double-buffering, though requiring user intervention. Stream-HLS [8] automatically
generates dataflow; however, it assumes that data are already on-chip, thereby overlooking communication with off-
chip memory. Additionally, it offers very limited opportunities for parallelism. Flower [4] automates FPGA dataflow
development but limits parallelism. Frameworks like [14, 74], built on HeteroCL [36], optimize data placement and
compute scheduling for heterogeneous systems, maximizing data reuse and bandwidth utilization. Systolic arrays
28 Pouget et al.
[9, 20, 31, 37, 69] offer efficient computation for specific patterns but lack generalization. Application-specific frame-
works [5, 16, 21, 23, 27, 34, 47, 61] demonstrate dataflow advantages but do not generalize across domains. RapidStream-
TAPA [25, 26] enhances the performance of dataflow designs and automates SLR placement. However, it requires an
optimized kernel with a dataflow structure as input.
Shared Buffer Shared buffer utilization through HLS has been extensively explored using methods such as NLP-based
pragma insertion [56–58], bottleneck DSE [65], and GNN-based latency and resource estimation [7, 62–64, 66, 71, 73].
However, these approaches lack effective integration of dataflow optimization techniques.
Code Transformation Code transformation has been explored for CPUs [6, 13, 35, 52], GPUs [68], and FPGAs [17, 40,
42–44, 55, 82, 83]. Pluto [13] minimizes communication and improves locality but can limit parallelism. FPGA-specific
adaptations [55] leverage FIFOs for overlapping communication and computation but are restricted in parallelism.
Recent works [82, 83] selectively use Pluto for latency minimization while avoiding non-HLS-friendly code. While
[17, 40, 42–44] focus on optimizing pipelining techniques, they do not address parallelism or the coordination of
computation and communication overlap, which are crucial for our objectives. The [14, 29, 36, 77, 78] compilers perform
code transformations and pragma insertion, their modifications are primarily heuristic and based on loop properties.
The paper [58] described in Section 2 generates design with high QoR, but the absence of dataflow utilization hinders
concurrent task execution. Additionally, their approach avoids padding, limiting the unroll factor to divisors of the
loop’s trip count and constraining tiling space.
Tiling and Padding Tiling is essential for balancing computation and communication. While prior works [41, 45]
use cost models to minimize communication, our approach extends this to reduce overall latency. Techniques like
NLP-based tiling [41] focus on CPUs, while Wedler [59] optimizes DNNs on GPUs by fusing operators, enhancing data
reuse, and employing padding to prevent bank conflicts. Padding is well-studied for reducing cache misses [28, 48] and
improving memory transfers [70], but its use for varying unroll factors on FPGAs remains underexplored.
8 Conclusion
In this work, we introduced Prometheus, a holistic optimization framework that unifies loop transformations, task
concurrency, computation-communication overlap, and hardware-aware scheduling for FPGA accelerators. By formu-
lating the optimization process as a Non-Linear Programming (NLP) problem, Prometheus enables a structured explo-
ration of the vast design space while considering hardware constraints, memory bandwidth, and task parallelism.
Our framework addresses key limitations in existing methodologies, which often optimize only isolated aspects of
FPGA acceleration. Prometheus surpasses these approaches by integrating SLR-aware scheduling, dynamic memory
management, and hybrid execution models that effectively balance dataflow streaming and shared buffering. This
enables more efficient utilization of FPGA resources, leading to significant improvements in latency, throughput, and
scalability.
Through extensive performance evaluations on computation-bound kernels, we demonstrated that Prometheus out-
performs state-of-the-art frameworks.
Furthermore, Prometheus’ automatic design space exploration significantly reduces the manual effort required in
FPGA development. By automatically generating HLS-C++ code, OpenCL host code, and FPGA bitstreams, the frame-
work streamlines the deployment process, making FPGA acceleration more accessible to a broader range of applica-
tions.
In summary, Prometheus provides an innovative and effective approach to FPGA optimization, delivering high-performance
solutions while minimizing the complexity of hardware design.
Holistic Optimization Framework for FPGA Accelerators 29
Acknowledgments
This work was supported by the NSF award #CCF-2211557. It is also supported by CDSC industrial partners and the
AMD/HACC Program.
References
[1] 2024. Allo Artifact: https://github.com/cornell-zhang/allo-pldi24-artifact. https://github.com/cornell-zhang/allo-pldi24-artifact
[2] Mariem Abid, Khaled Jerbi, Mickaël Raulet, Olivier Déforges, and Mohamed Abid. 2013. System level synthesis of dataflow programs: HEVC
decoder case study. In Proceedings of the 2013 Electronic System Level Synthesis Conference (ESLsyn). 1–6.
[3] AMD/Xilinx. 2024. AMD/Xilinx Vitis 2023.2 Documentation. https://docs.amd.com/r/en-US/ug1399-vitis-hls/Target-Flow-Overview Accessed:
2025-01-06.
[4] Puya Amiri, Arsène Pérard-Gayot, Richard Membarth, Philipp Slusallek, Roland Leißa, and Sebastian Hack. 2021. FLOWER: A com-
prehensive dataflow compiler for high-level synthesis. In 2021 International Conference on Field-Programmable Technology (ICFPT). 1–9.
doi:10.1109/ICFPT52863.2021.9609930
[5] Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo, and Marco Domenico Santambrogio. 2017. A pipelined and scalable dataflow implementation
of convolutional neural networks on FPGA. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 90–97.
doi:10.1109/IPDPSW.2017.44
[6] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil,
and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In 2019 IEEE/ACM International Symposium
on Code Generation and Optimization (CGO). IEEE, 193–205.
[7] Yunsheng Bai, Atefeh Sohrabizadeh, Yizhou Sun, and Jason Cong. 2022. Improving GNN-based accelerator design automation with meta learning.
In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC ’22). Association for Computing Machinery,
New York, NY, USA, 1347–1350. doi:10.1145/3489517.3530629
[8] Suhail Basalama and Jason Cong. 2025. Stream-HLS: Towards Automatic Dataflow Acceleration. In Proceedings of the 2025 ACM/SIGDA International
Symposium on Field Programmable Gate Arrays (Monterey, CA, USA) (FPGA ’25). Association for Computing Machinery, New York, NY, USA,
103–114. doi:10.1145/3706628.3708878
[9] Suhail Basalama, Atefeh Sohrabizadeh, Jie Wang, Licheng Guo, and Jason Cong. 2023. FlexCNN: An End-to-end Framework for Composing CNN
Accelerators on FPGA. ACM Trans. Reconfigurable Technol. Syst. 16, 2, Article 23 (mar 2023), 32 pages. doi:10.1145/3570928
[10] Tal Ben-Nun, Johannes de Fine Licht, Alexandros N. Ziogas, Timo Schneider, and Torsten Hoefler. 2019. Stateful dataflow multigraphs: a data-
centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 81,
14 pages. doi:10.1145/3295500.3356173
[11] Albert Benveniste, Paul Caspi, Paul Le Guernic, and Nicolas Halbwachs. 1994. Data-flow synchronous languages. In A Decade of Concurrency
Reflections and Perspectives: REX School/Symposium Noordwijkerhout, The Netherlands June 1–4, 1993 Proceedings. Springer, 1–45.
[12] Shuvra S. Bhattacharyya, Gordon Brebner, Jörn W. Janneck, Johan Eker, Carl von Platen, Marco Mattavelli, and Mickaël Raulet. 2009.
OpenDF: a dataflow toolset for reconfigurable hardware and multicore systems. SIGARCH Comput. Archit. News 36, 5 (jun 2009), 29–35.
doi:10.1145/1556444.1556449
[13] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Opti-
mizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (Tucson, AZ, USA) (PLDI ’08).
Association for Computing Machinery, New York, NY, USA, 101–113. doi:10.1145/1375581.1375595
[14] Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. 2024. Allo: A Programming Model for Compos-
able Accelerator Design. Proc. ACM Program. Lang. 8, PLDI, Article 171 (jun 2024). doi:10.1145/3656401
[15] W.Y. Chen, P.P. Chang, T.M. Conte, and W.W. Hwu. 1993. The effect of code expanding optimizations on instruction cache design. IEEE Trans.
Comput. 42, 9 (1993), 1045–1057. doi:10.1109/12.241594
[16] Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: Stencil with Optimized Dataflow Architecture. In 2018 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD). 1–8. doi:10.1145/3240765.3240850
[17] Young-kyu Choi and Jason Cong. 2018. HLS-Based Optimization and Design Space Exploration for Applications with Variable Loop Bounds. In
2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. doi:10.1145/3240765.3240815
[18] Jason Cong, Zhenman Fang, Yuchen Hao, Peng Wei, Cody Hao Yu, Chen Zhang, and Peipei Zhou. 2018. Best-Effort FPGA Programming: A Few
Steps Can Go a Long Way. CoRR abs/1807.01340 (2018). arXiv:1807.01340 http://arxiv.org/abs/1807.01340
[19] Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. 2011. High-Level Synthesis for FPGAs:
From Prototyping to Deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 4 (2011), 473–491.
doi:10.1109/TCAD.2011.2110592
[20] Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-Based Systolic Array Auto-Compilation. In 2018 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD). 1–8. doi:10.1145/3240765.3240838
30 Pouget et al.
[21] Johannes de Fine Licht, Andreas Kuster, Tiziano De Matteis, Tal Ben-Nun, Dominic Hofer, and Torsten Hoefler. 2021. StencilFlow: Mapping Large
Stencil Programs to Distributed Spatial Computing Systems. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization
(CGO). 315–326. doi:10.1109/CGO51591.2021.9370315
[22] J. B. Dennis. 1974. First version of a data flow procedure language. In Programming Symposium, Proceedings Colloque Sur La Programmation.
Springer-Verlag, Berlin, Heidelberg, 362–376.
[23] Alain Denzler, Geraldo F. Oliveira, Nastaran Hajinazar, Rahul Bera, Gagandeep Singh, Juan Gómez-Luna, and Onur Mutlu. 2023. Casper: Acceler-
ating Stencil Computations Using Near-Cache Processing. IEEE Access 11 (2023), 22136–22154. doi:10.1109/ACCESS.2023.3252002
[24] Paul Grigoraş, Xinyu Niu, Jose G. F. Coutinho, Wayne Luk, Jacob Bower, and Oliver Pell. 2013. Aspect driven compilation for dataflow designs. In
2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors. 18–25. doi:10.1109/ASAP.2013.6567545
[25] Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru
Zhang, and Jason Cong. 2023. TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of
HLS and Physical Design. ACM Trans. Reconfigurable Technol. Syst. 16, 4, Article 63 (Dec. 2023), 31 pages. doi:10.1145/3609335
[26] Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Eddie Hung, Wuxi Li, Jason Lau, Weikang Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao,
Alireza Kaviani, Zhiru Zhang, and Jason Cong. 2023. RapidStream 2.0: Automated Parallel Implementation of Latency–Insensitive FPGA Designs
Through Partial Reconfiguration. ACM Trans. Reconfigurable Technol. Syst. 16, 4, Article 59 (Sept. 2023), 30 pages. doi:10.1145/3593025
[27] James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Han-
rahan. 2014. Darkroom: compiling high-level image processing code into hardware pipelines. ACM Trans. Graph. 33, 4, Article 144 (jul 2014),
11 pages. doi:10.1145/2601097.2601174
[28] Changwan Hong, Wenlei Bao, Albert Cohen, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, J. Ramanujam, and P. Sadayappan. 2016.
Effective padding of multidimensional arrays to avoid cache conflict misses. SIGPLAN Not. 51, 6 (jun 2016), 129–144. doi:10.1145/2980983.2908123
[29] Sitao Huang, Kun Wu, Hyunmin Jeong, Chengyue Wang, Deming Chen, and Wen-Mei Hwu. 2021. PyLog: An Algorithm-Centric Python-Based
FPGA Programming and Synthesis Flow. IEEE Trans. Comput. 70, 12 (2021), 2015–2028. doi:10.1109/TC.2021.3123465
[30] Intel. 2024. Intel. https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/hls-compiler.html
[31] Liancheng Jia, Liqiang Lu, Xuechao Wei, and Yun Liang. 2020. Generating Systolic Array Accelerators With Reusable Blocks. IEEE Micro 40, 4
(2020), 85–92. doi:10.1109/MM.2020.2997611
[32] Gilles Kahn. 1974. The Semantics of a Simple Language for Parallel Programming. In Information Processing, Proceedings of the 6th IFIP Congress
1974, Stockholm, Sweden, August 5-10, 1974, Jack L. Rosenfeld (Ed.). North-Holland, 471–475.
[33] Moazin Khatti, Xingyu Tian, Yuze Chi, Licheng Guo, Jason Cong, and Zhenman Fang. 2023. PASTA: Programming and Automation Support
for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs. In 2023 IEEE 31st Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM). 12–22. doi:10.1109/FCCM57271.2023.00011
[34] Guilherme Korol, Michael Guilherme Jordan, Mateus Beck Rutzig, and Antonio Carlos Schneider Beck. 2022. AdaFlow: A Framework for
Adaptive Dataflow CNN Acceleration on FPGAs. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). 244–249.
doi:10.23919/DATE54114.2022.9774727
[35] Michael Kruse, Hal Finkel, and Xingfu Wu. 2020. Autotuning search space for loop transformations. In 2020 IEEE/ACM 6th Workshop on the LLVM
Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar). IEEE, 12–22.
[36] Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A Multi-Paradigm
Programming Infrastructure for Software-Defined Reconfigurable Computing. In Proceedings of the 2019 ACM/SIGDA International Sympo-
sium on Field-Programmable Gate Arrays (Seaside, CA, USA) (FPGA ’19). Association for Computing Machinery, New York, NY, USA, 242–251.
doi:10.1145/3289602.3293910
[37] Yi-Hsiang Lai, Hongbo Rong, Size Zheng, Weihao Zhang, Xiuping Cui, Yunshan Jia, Jie Wang, Brendan Sullivan, Zhiru Zhang, Yun Liang, Youhui
Zhang, Jason Cong, Nithin George, Jose Alvarez, Christopher Hughes, and Pradeep Dubey. 2020. SuSy: a programming model for productive
construction of high-performance systolic arrays on FPGAs. In Proceedings of the 39th International Conference on Computer-Aided Design (Virtual
Event, USA) (ICCAD ’20). Association for Computing Machinery, New York, NY, USA, Article 73, 9 pages. doi:10.1145/3400302.3415644
[38] Edward Ashford Lee and David G. Messerschmitt. 1987. Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE
Trans. Comput. C-36, 1 (1987), 24–35. doi:10.1109/TC.1987.5009446
[39] Jiajie Li, Yuze Chi, and Jason Cong. 2020. HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration. In Proceedings of the 2020
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Seaside, CA, USA) (FPGA ’20). Association for Computing Machinery,
New York, NY, USA, 51–57. doi:10.1145/3373087.3375320
[40] Peng Li, Louis-Noël Pouchet, and Jason Cong. 2014. Throughput optimization for high-level synthesis using resource constraints. In Int. Workshop
on Polyhedral Compilation Techniques (IMPACT’14).
[41] Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, and P. Sadayappan. 2021. Analytical characterization and design space explo-
ration for optimization of CNNs. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and
Operating Systems (Virtual, USA) (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA, 928–942. doi:10.1145/3445814.3446759
[42] Junyi Liu, Samuel Bayliss, and George A. Constantinides. 2015. Offline Synthesis of Online Dependence Testing: Parametric Loop Pipelining for
HLS. In 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. 159–162. doi:10.1109/FCCM.2015.31
Holistic Optimization Framework for FPGA Accelerators 31
[43] Junyi Liu, John Wickerson, Samuel Bayliss, and George A Constantinides. 2017. Polyhedral-based dynamic loop pipelining for high-level synthesis.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 9 (2017), 1802–1815.
[44] Junyi Liu, John Wickerson, and George A Constantinides. 2016. Loop splitting for efficient pipelining in high-level synthesis. In 2016 IEEE 24th
Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 72–79.
[45] Junyi Liu, John Wickerson, and George A. Constantinides. 2017. Tile size selection for optimized memory reuse in high-level synthesis. In 2017
27th International Conference on Field Programmable Logic and Applications (FPL). 1–8. doi:10.23919/FPL.2017.8056810
[46] Microchip. 2023. SmartHLS Compiler Software. https://www.microchip.com/en-us/products/fpgas-and-plds/fpga-and-soc-design-tools/smarthls-compiler
[47] Giuseppe Natale, Giulio Stramondo, Pietro Bressana, Riccardo Cattaneo, Donatella Sciuto, and Marco D. Santambrogio. 2016. A polyhedral model-
based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In 2016 IEEE/ACM International Conference on Computer-
Aided Design (ICCAD). 1–8. doi:10.1145/2966986.2966995
[48] P.R. Panda, H. Nakamura, N.D. Dutt, and A. Nicolau. 1999. Augmenting loop tiling with data alignment for improved cache performance. IEEE
Trans. Comput. 48, 2 (1999), 142–149. doi:10.1109/12.752655
[49] Francesco Peverelli, Marco Rabozzi, Emanuele Del Sozzo, and Marco D. Santambrogio. 2018. OXiGen: A Tool for Automatic Acceleration of C
Functions Into Dataflow FPGA-Based Kernels. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 91–98.
doi:10.1109/IPDPSW.2018.00023
[50] PoCC [n. d.]. PoCC, the Polyhedral Compiler Collection 1.3. http://pocc.sourceforge.net
[51] Louis-Noel Pouchet. 2023. PoCC. https://sourceforge.net/projects/pocc/ Ver 1.6.
[52] Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. 2011. Loop Transfor-
mations: Convexity, Pruning and Optimization. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Languages (Austin, Texas, USA) (POPL ’11). ACM, New York, NY, USA, 549–562. doi:10.1145/1926385.1926449
[53] Louis-Noel Pouchet and Tomofumi Yuki. Retrieved 2024. Polybench: The polyhedral benchmark suite. http://polybench.sourceforge.net
[54] Louis-Noël Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. 2013. Polyhedral-Based Data Reuse Optimization for Configurable Computing.
In 21st ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’13). ACM Press, Monterey, California.
[55] Louis-Noel Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. 2013. Polyhedral-Based Data Reuse Optimization for Configurable Computing.
In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, California, USA) (FPGA ’13). Association
for Computing Machinery, New York, NY, USA, 29–38. doi:10.1145/2435264.2435273
[56] Stéphane Pouget, Louis-Noël Pouchet, and Jason Cong. 2024. Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear
Programming Approach. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey,CA,USA)
(FPGA ’24). Association for Computing Machinery, New York, NY, USA, 184. doi:10.1145/3626202.3637593
[57] Stéphane Pouget, Louis-Noël Pouchet, and Jason Cong. 2025. Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear
Programming Approach. ACM Trans. Des. Autom. Electron. Syst. 30, 2, Article 26 (Feb. 2025), 44 pages. doi:10.1145/3711847
[58] Stéphane Pouget, Louis-Noël Pouchet, and Jason Cong. 2025. A Unified Framework for Automated Code Transformation and Pragma Insertion. In
Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (Monterey, CA, USA) (FPGA ’25). Association for
Computing Machinery, New York, NY, USA, 187–198. doi:10.1145/3706628.3708873
[59] Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou. 2023. Welder: Scheduling
Deep Learning Memory Access via Tile-graph. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX
Association, Boston, MA, 701–718. https://www.usenix.org/conference/osdi23/presentation/shi
[60] Siemens. 2023. Catapult High-Level Synthesis. https://eda.sw.siemens.com/en-US/ic/catapult-high-level-synthesis/
[61] Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gomez-Luna, Sander Stuijk, Onur Mutlu, and Henk Corporaal. 2020.
NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling. In 2020 30th International Conference on Field-
Programmable Logic and Applications (FPL). 9–17. doi:10.1109/FPL50879.2020.00014
[62] Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022. Automated Accelerator Optimization Aided by Graph Neural Networks.
In 2022 59th ACM/IEEE Design Automation Conference (DAC).
[63] Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022. Automated accelerator optimization aided by graph neural networks. In
Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC ’22). Association for Computing Machinery, New
York, NY, USA, 55–60. doi:10.1145/3489517.3530409
[64] Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2023. Robust GNN-Based Representation Learning for HLS. In 2023 IEEE/ACM
International Conference on Computer Aided Design (ICCAD). 1–9. doi:10.1109/ICCAD57390.2023.10323853
[65] Atefeh Sohrabizadeh, Cody Hao Yu, Min Gao, and Jason Cong. 2021. AutoDSE: Enabling Software Programmers Design Efficient FPGA Accelerators.
In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA ’21). Association for Computing
Machinery, New York, NY, USA, 147. doi:10.1145/3431920.3439464
[66] Ecenur Ustun, Chenhui Deng, Debjit Pal, Zhijing Li, and Zhiru Zhang. 2020. Accurate Operation Delay Prediction for FPGA HLS Using Graph
Neural Networks. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.
[67] Sven Verdoolaege. 2011. Counting affine calculator and applications. In First International Workshop on Polyhedral Compilation Techniques (IM-
PACT’11), Chamonix, France.
32 Pouget et al.
[68] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code
generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (jan 2013), 23 pages. doi:10.1145/2400682.2400713
[69] Jie Wang, Licheng Guo, and Jason Cong. 2021. AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. In The 2021
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA ’21). Association for Computing Machinery,
New York, NY, USA, 93–104. doi:10.1145/3431920.3439292
[70] Yuxin Wang, Peng Zhang, Xu Cheng, and Jason Cong. 2012. An integrated and automated memory optimization flow for FPGA behavioral
synthesis. In 17th Asia and South Pacific Design Automation Conference. 257–262. doi:10.1109/ASPDAC.2012.6164955
[71] Nan Wu, Yuan Xie, and Cong Hao. 2022. IronMan-Pro: Multi-objective Design Space Exploration in HLS via Reinforcement Learning
and Graph Neural Network based Modeling. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2022), 1–1.
doi:10.1109/TCAD.2022.3185540
[72] Nan Wu, Yuan Xie, and Cong Hao. 2023. IronMan-Pro: Multiobjective Design Space Exploration in HLS via Reinforcement Learning and
Graph Neural Network-Based Modeling. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 3 (2023), 900–913.
doi:10.1109/TCAD.2022.3185540
[73] Nan Wu, Hang Yang, Yuan Xie, Pan Li, and Cong Hao. 2022. High-level synthesis performance prediction using GNNs: benchmarking, modeling,
and advancing. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC ’22). Association for Computing
Machinery, New York, NY, USA, 49–54. doi:10.1145/3489517.3530408
[74] Shaojie Xiang, Yi-Hsiang Lai, Yuan Zhou, Hongzheng Chen, Niansong Zhang, Debjit Pal, and Zhiru Zhang. 2022. HeteroFlow: An Accelerator
Programming Model with Decoupled Data Placement for Software-Defined FPGAs. In Proceedings of the 2022 ACM/SIGDA International Sympo-
sium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA ’22). Association for Computing Machinery, New York, NY, USA, 78–88.
doi:10.1145/3490422.3502369
[75] AMD Xilinx. 2023.2. Vitis. https://www.xilinx.com/products/design-tools/vitis.html
[76] AMD Xilinx. 2024. Merlin: https://github.com/Xilinx/merlin-compiler. https://github.com/Xilinx/merlin-compiler
[77] Hanchen Ye, Hyegang Jun, and Deming Chen. 2024. HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis. In Proceedings of the 29th
ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (La Jolla,CA,USA) (ASPLOS
’24). Association for Computing Machinery, New York, NY, USA, 215–230. doi:10.1145/3617232.3624850
[78] Hanchen Ye, HyeGang Jun, Hyunmin Jeong, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A Scalable High-Level Synthesis Framework
with Multi-Level Transformations and Optimizations: Invited. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco,
California) (DAC ’22). Association for Computing Machinery, New York, NY, USA, 1355–1358. doi:10.1145/3489517.3530631
[79] Cody Hao Yu, Peng Wei, Max Grossman, Peng Zhang, Vivek Sarker, and Jason Cong. 2018. S2FA: An Accelerator Automation Framework for
Heterogeneous Computing in Datacenters. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 1–6. doi:10.1109/DAC.2018.8465827
[80] Weichuang Zhang, Jieru Zhao, Guan Shen, Quan Chen, Chen Chen, and Minyi Guo. 2024. An Optimizing Framework on MLIR for Efficient
FPGA-based Accelerator Generation . In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer
Society, Los Alamitos, CA, USA, 75–90. doi:10.1109/HPCA57654.2024.00017
[81] Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong. 2008. AutoPilot: A Platform-Based ESL Synthesis System. Springer
Netherlands, Dordrecht, 99–112. doi:10.1007/978-1-4020-8588-8_6
[82] Ruizhe Zhao and Jianyi Cheng. 2021. Phism: Polyhedral High-Level Synthesis in MLIR. arXiv preprint arXiv:2103.15103 (2021).
[83] Ruizhe Zhao, Jianyi Cheng, Wayne Luk, and George A Constantinides. 2022. POLSCA: Polyhedral High-Level Synthesis with Compiler Transfor-
mations. arXiv (2022).