COSMOS - Coordination of High-Level Synthesis and
COSMOS - Coordination of High-Level Synthesis and
COSMOS - Coordination of High-Level Synthesis and
1 INTRODUCTION
High-performance systems-on-chip (SoCs) are increasingly based on heterogeneous architectures
that combine general-purpose processor cores and specialized hardware accelerators [4, 8, 22]. Ac-
celerators are hardware devices designed to perform specific functions. Accelerators are become
popular because they guarantee considerable gains in both performance and energy efficiency
with respect to the corresponding software executions [9–11, 20, 23, 29, 41, 48]. However, the
This article was presented in the International Conference on Hardware/Software Codesign and System Synthesis
(CODES+ISSS) 2017 and appears as part of the ESWEEK-TECS special issue.
Authors addresses: The authors are within the Department of Computer Science, Columbia University, New York, NY,
USA (Luca Piccolboni: [email protected], Paolo Mantovani: [email protected], Giuseppe Di Guglielmo:
[email protected], and Luca P. Carloni: [email protected]).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2017 ACM 1539-9087/2017/09-ART150 $15.00
https://doi.org/10.1145/3126566
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:2 L. Piccolboni et al.
integration of several specialized hardware blocks into a complex accelerator is a difficult design
and verification task. In response to this challenge, we advocate the application of two key prin-
ciples. First, to cope with the increasing complexity of SoCs and accelerators, most of the design
effort should move away from the familiar register-transfer level (RTL) by embracing system-level
design (SLD) [18, 42] with high-level synthesis (HLS) [32, 39]. Second, it is necessary to create
reusable and flexible components, also known as intellectual property (IP) blocks, which can be
easily (re)used across a variety of architectures with different targets for performance and metrics
for cost.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:3
Fig. 1. COSMOS: a methodology to coordinate HLS and memory optimization for the DSE of hardware
accelerators.
replication), rather than in time. The application of this knob generally leads to a faster, but larger,
implementation of the initial specification.
Despite the advantages of HLS, performing this design-space exploration (DSE) is still a compli-
cated task, especially for complex hardware accelerators. First, the support for memory generation
and optimization is limited in current HLS tools. Some HLS tools still require third-party gener-
ators to provide a description of the memory organization and automatize the DSE process [36,
37]. Several studies, however, highlight the importance of private memories to sustain the parallel
datapath of accelerators: on a typical accelerator design, memory takes from 40% to 90% of the
area [16, 30]; hence, its optimization cannot be an independent task. Second, HLS tools are based
on heuristics, whose behavior is not robust and often hard to predict [24]. Small changes to the
knobs, e.g., changing the number of iterations unrolled in a loop, can cause significant and un-
expected modifications at the implementation level. This increases the DSE effort because small
changes to the knobs can take the exploration far from the Pareto-optimality.
1.3 Contributions
To address these limitations, we present COSMOS1 : an automatic methodology for the DSE of
complex hardware accelerators, which are composed of several components. COSMOS is based on
a compositional approach that coordinates both HLS tools and memory generators. First, thanks to
the datapath and memory co-design, COSMOS produces a large set of Pareto-optimal implemen-
tations for each component, thus increasing both performance and cost spans. These spans are
defined as the ratios between the maximum value and the minimum value for performance and
cost, respectively. Second, COSMOS leverages compositional design techniques to significantly re-
duce the number of invocations to the HLS tool and the memory generator. In this way, COSMOS
focuses on the most critical components of the accelerator and quickly converges to the desired
trade-off point between cost and performance for the entire accelerator. The COSMOS methodol-
ogy consists of two main steps (Figure 1). First, COSMOS uses an algorithm to characterize each
component of the accelerator individually by efficiently coordinating multiple runs of the HLS and
memory generator tools. This algorithm finds the regions in the design space of the components
that include the Pareto-optimal implementations (Component Characterization in Figure 1). Sec-
ond, COSMOS performs a DSE to identify the Pareto-optimal solutions for the entire accelerator
by efficiently solving a linear programming (LP) problem instance (Design-Space Exploration).
We evaluate the effectiveness and efficiency of the COSMOS methodology on a complex accel-
erator for wide-area motion imagery (WAMI) [3, 38], which consists of approximately 7000 lines
of SystemC code. While exploring the design space of WAMI, COSMOS returns an average perfor-
mance span of 4.1× and an average area span of 2.6×, as opposed to 1.7× and 1.2× when memory
1 COSMOS stands for “COordination of high-level Synthesis and Memory Optimization for hardware acceleratorS”. We also
adopt the name COSMOS for our methodology since it is the opposite of CHAOS (in the Greek creation myths). In our
analogy, CHAOS corresponds to the complexity of the DSE process.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:4 L. Piccolboni et al.
optimization is not considered and only standard dual-port memories are used. Further, COSMOS
achieves the target data-processing throughput for the WAMI accelerator while reducing the
number of invocations to the HLS tool per component by up to 14.6×, with respect to an
exhaustive exploration approach.
1.4 Organization
The paper is organized as follows. Section 2 provides the necessary background for the rest of the
paper. Section 3 describes few examples to show the effort required in the DSE process. Section 4
gives an overview of the COSMOS methodology, which is then detailed in Sections 5 (Component
Characterization) and 6 (Design-Space Exploration). Section 7 presents the experimental results.
Section 8 discusses the related work. Finally, Section 9 concludes the paper.
2 PRELIMINARIES
This section provides the necessary background concepts. We first describe the main characteris-
tics of the accelerators targeted by COSMOS in Section 2.1. Then, we present the computational
model we adopt for the DSE in Section 2.2.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:5
by exchanging the data through an on-chip interconnect network that implements transaction-
level modeling (TLM) [19] channels. These channels synchronize the components by absorbing
the potential differences in their computational latencies with a latency-insensitive communica-
tion protocol [7]. This ensures that the components of an accelerator can always be replaced with
different Pareto-optimal implementations without affecting the correctness of the accelerator im-
plementation. COSMOS employs channels with a fixed bitwidth (256 bits) and does not explore
different design alternatives to implement the communication among the components. It can be
extended, however, to support this type of DSE by using, for example, the XKnobs [35] or buffer-
restructuring techniques [13]. Each component includes a datapath, which is organized in a set of
loops, to read and store input and output data and to compute the required functionality. There
are also private local memories (PLMs), or scratchpads, where data resides during the computation.
PLMs are multi-bank memory architectures that provide multiple read and write ports to allow
accelerators to perform parallel accesses. We generate optimized memories for our accelerators
by using the Mnemosyne memory generator [37]. Several analyses highlight the importance of
the PLMs in sustaining the parallel datapath of accelerators [16, 30]. PLMs play a key role on
the performance of accelerators [25], and they occupy from 40% to 90% of the entire area of the
components of a given accelerator [30].
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:6 L. Piccolboni et al.
behaviors, they are a practical model to analyze stream processing accelerators for many classes
of applications, e.g., image and signal processing applications. A PN is a bipartite graph defined as
a tuple (P,T , F , w, M 0 ), where P is a set of m places, T is a set of n transitions, F : (P × T ) ∪ (T × P )
is a set of arcs, w : F → N+ is an arc weighting function, and M 0 ∈ Nm is the initial marking, i.e.,
the number of tokens at each p ∈ P. A PN is strongly-connected if for every pairs of places pi and
p j there exists a sequence of transitions and places such that pi and p j are mutually reachable in
the net. A PN can be organized in a set of strongly-connect components, i.e., the maximal sets of
places that are strongly-connected. A TMG is a PN such that (i) each place has exactly one input
and one output transition, and (ii) w : F → 1, i.e., every arc has a weight equal to 1. To measure
performance, TMGs are extended with a transition firing-delay vector τ ∈ Rn , which represents
the duration of each particular firing.
The minimum cycle time of a strongly-connected TMG is defined as: max {D k /Nk | k ∈ K },
where K is the set of cycles of the TMG, D k is the sum of the transition firing delays in cycle
k, and Nk is the number of tokens in cycle k [40]. In this paper, we use the TMG model to formally
describe the accelerators. We use the term system to indicate a complex accelerator that is made of
multiple components. Each component of the system is represented with a transition in the TMG
whose firing delay is equal to its effective latency. The effective latency λ of a component is defined
as the product of its clock cycle count and target clock period. The maximum sustainable effective
throughput θ of the system is then the reciprocal of the minimum cycle time of its TMG, if the TMG
is strongly connected. Otherwise, it is the minimum θ among its strongly-connected components.
We use λ and θ as performance figures for the single components and the system, respectively. We
use the area α as the cost metric for both the components and the system.
3 MOTIVATIONAL EXAMPLES
Performing an accurate and as exhaustive as possible DSE for a complex hardware accelerator is
a difficult task for three main reasons: (i) HLS tools do not always support PLM generation and
optimization (Section 3.1), (ii) HLS tools are based on heuristics that make it difficult to configure
the knobs (Section 3.2), and (iii) HLS tools do not handle the simultaneous optimization of multiple
components (Section 3.3). Next, we detail these issues with some examples.
3.1 Memories
The joint optimization of the accelerator datapath and PLM architecture is critical for an effective
DSE. Figure 4 depicts the design space of Gradient, a component we designed for WAMI. The
graph reports different design points, each characterized in terms of area (mm2 ) and effective
latency (milliseconds), synthesized for an industrial 32nm ASIC technology library. The points
with the same color (shape) are obtained by partially unrolling the loops for different numbers
of iterations. The different colors (shapes) indicate different numbers of ports for the PLM2 . By
increasing the number of ports, we notice a significant impact on both latency and area. In fact,
multiple ports allow the component to read and write more data in the same clock cycle, thus
increasing the hardware parallelism. Multi-port memories, however, require much more area
since more banks may be used depending on the given memory-access pattern. Note that ignoring
the role of the PLM limits considerably the design space. By changing the number of ports of
the PLM, we obtain a latency span of 7.9× and an area span of 3.7×. By using standard dual-port
memories, we have only a latency span of 1.4× and an area span of 1.2×. This motivates the need
2 Here and in the rest of the paper, the number of ports indicates the number of read ports to the memories containing the
input data of the component and the number of write ports containing the output data of the component, i.e., the ports
that allow parallelism in the compute phase of the component.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:7
Fig. 4. Example of application of two HLS knobs (number of ports, number of unrolls) to Gradient, a com-
ponent of WAMI. The nested graph magnifies the design points with 2 read and 2 write ports. The numbers
indicate the numbers of iterations unrolled.
of considering the optimization of PLMs in the DSE process. COSMOS takes into consideration
the PLMs by generating optimized memories with Mnemosyne [37].
3.3 Compositionality
Complex accelerators need to be partitioned into multiple components to be efficiently synthesized
by current HLS tools. This reduces the synthesis time and improves the quality of results, but sig-
nificantly increases the DSE effort. Figure 5 reports a simple example to illustrate this problem.
On the top, the figure reports two graphs representing a small subset of Pareto-optimal points for
Gradient and Grayscale, two components of WAMI. Assuming that they are executed sequen-
tially in a loop, their aggregate throughput is the reciprocal of the sum of their latencies. On the
bottom, the figure reports all the possible combinations of the design points of the two components,
differentiating the Pareto-optimal combinations from the Pareto-dominated combinations. These
design points are characterized in terms of area (mm2 ) and effective throughput (1/milliseconds).
In order to find the Pareto-optimal combinations at the system level, an exhaustive search method
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:8 L. Piccolboni et al.
Fig. 5. Example of composition for Gradient and Grayscale, two components of WAMI. The graphs on
the top report some Pareto-optimal points for the two components. The graph on the bottom shows all the
possible combinations of these components, assuming they are executed sequentially in a loop. In the graph
of the composition, the effective throughput is used as the performance metric.
would apply the following steps: (i) synthesize different points for each component by varying
the settings of the knobs, (ii) find the Pareto-optimal points for each component, and (iii) find the
Pareto-optimal combinations of the components at the system level. This approach is impractical
for complex accelerators. First, step (i) requires to try all the combinations of the knob settings (e.g.,
different number of ports and number of unrolls). Second, step (iii) requires to evaluate an expo-
nential number of combinations at the system level to find those that are Pareto-optimal. In fact,
if we have n components with k Pareto-optimal points each, then the number of combinations to
check is O(k n ). This example motivates the need of a smart compositional method that identifies
the most critical components of an accelerator and minimizes the invocations to the HLS tool. In
order to do that, COSMOS reduces the number of combinations of knob settings that are used for
synthesis and prioritizes the synthesis of the components depending on their level of contribution
to the effective throughput of the entire accelerator.
(1) Component Characterization (Section 5): in this step COSMOS analyzes each component
of the system individually; for each component it identifies the boundaries of the regions
that include the Pareto-optimal designs; starting from the HLS-ready implementation of
each component (in SystemC), COSMOS applies an algorithm that generates knob and
memory configurations to automatically coordinate the HLS and memory generator tools;
the algorithm takes into account the memories of the accelerators and tries to deal with
the unpredictability of HLS tools;
(2) Design-Space Exploration (Section 6): in this step COSMOS analyzes the design space of
the entire system; the system is modeled with a TMG to find the most critical components
for the system throughput; then, COSMOS:
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:9
5 COMPONENT CHARACTERIZATION
Algorithm 1 reports the pseudocode used for the component characterization. The designer pro-
vides the clock period, the maximum number of ports for the PLMs (mainly constrained by the
target technology and the memory generator) and the maximum number of loop unrolls. In order
to keep the delay of the logic for selecting the memory banks negligible, the number of ports should
be a power of two. Note that this constraint can be partially relaxed without requiring Euclidean
division for the selection logic [46]. The number of unrolls depends on the loop complexity. Loops
with few iterations can be completely unrolled, while more complex loops can be only partially
unrolled. In fact, unrolling loops replicates the hardware resources, thus making the scheduling
more complex for the HLS tool. The algorithm identifies regions in the design space of the com-
ponent. A region includes design points that have the same number of ports. They are bounded
by an upper-left (λmin , αmax ) and a lower-right (λmax , αmin ) point. These regions represent the
design space of the component that will be used for the DSE at the system level, as explained in
Section 6.
ALGORITHM 1: Component Characterization
Input: clock, max_ports, max_unrolls
Output: set of regions (λmax , αmin , λmin , αmax )
1 for ports = 1 up to max_ports do
2 // Identification of max-λ min-α point
3 (λmax , αmin ) = hls_tool(ports, ports, clock);
4 // Identification of min-λ max-α point
5 for unrolls = max_unrolls down to ports + 1 do
6 (λmin , αmax ) = hls_tool(unrolls, ports, clock);
7 if λ_constraintports (unrolls) is sat then break;
8 // Generation of the PLM of the component
9 αplm = memory_generator(ports);
10 αmin += αplm ; αmax += αplm ;
11 // Save the region of the design space
12 save(ports, unrolls, λmax , αmin , λmin , αmax );
tool parameters: hls_tool(unrolls, ports, clock);
tool parameters: memory_generator(ports);
Algorithm 1 starts by identifying the lower-right point of the region. To identify this design
point, it sets the number of unrolls equal to the current number of ports (line 3). This ensures that
all the ports of the PLM are exploited and the obtained point is not redundant. In fact, this point
cannot be obtained by using a lower number of ports. On the other hand, finding the upper-left
point is more challenging. A complete unroll (which could lead to the point with the minimum
latency) is unfeasible in case of complex loops. Indeed, it is not always guaranteed that, by increas-
ing the number of unrolls, the HLS tool returns an implementation of the component that gives
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:10 L. Piccolboni et al.
lower latency in exchange for higher area occupation. To overcome these problems, Algorithm 1
introduces a constraint, λ − constraint for the rest of the paper, that defines the maximum number
of states that the HLS tool can insert in the body of a loop. This helps in constraining the behavior
of the HLS tool to be more deterministic and in removing some of the Pareto-dominated points.
Thus, Algorithm 1 uses the following function to estimate the number of states that should be
sufficient to schedule one iteration of the loop that includes read and write operations:
γr ∗ unrolls γw
hpor t s (unrolls) = + +η (1)
ports ports
where γr is the maximum number of read accesses to the same array per loop iteration, γw is
the maximum number of write accesses to the same array per loop iteration and, η accounts for
the latency required to perform the operations that do not access the PLM. These parameters are
inferred by traversing the control data flow graph (CDFG) created by the HLS tool for scheduling
the lower-right point. This function is used as an upper bound of the number of states that the
HLS tool can insert. If this upper bound is not sufficient, then the synthesis fails and the point is
discarded. A synthesis run with a lower number of unrolls is performed to find another point to
be used as the upper-left extreme (lines 5-7).
Example 1. Figure 6 shows an example of using the λ-constraint. The loop (reported on the left)
contains two read operations to two distinct arrays, i.e., γr = 1, and one write operation, i.e., γw = 1.
We assume that all the operations that are neither read nor write operations can be performed in
one clock cycle, i.e., η = 1. The two diagrams (on the right) show the results of the scheduling by
using two ports for the PLM and by unrolling two or three times the loop, respectively. In the first
case (unrolls = 2), the HLS tool can schedule all the operations in a maximum of h 2 (2) = 3 clock
cycles. Thus, this point would be chosen by Algorithm 1 to be used as upper-left extreme. In the
second case (unrolls = 3), the HLS tool is not able to complete the schedule within h 2 (3) = 4 clock
cycles (it needs at least 5 clock cycles). Thus, this point is discarded.
Note that the λ-constraint is not guaranteed to obtain a Pareto-optimal point due to the intrinsic
variability of the HLS results. Still, this point can serve as an upper bound of the region in the
design space. Note also that the λ − constraint cannot be applied to loops that (i) require data from
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:11
sub-components through blocking interfaces or (ii) do not present memory accesses to the PLM.
In these cases, in fact, it is necessary to extend the definition of the estimation function given in
Equation (1) to handle such situations. Alternatively, COSMOS can optionally run some synthesis
in the neighbourhood of the maximum number of unrolls and use a local Pareto-optimal point as
the upper-left extreme.
6 DESIGN-SPACE EXPLORATION
After the characterization of the single components of a given accelerator, COSMOS uses a LP
formulation to find the Pareto-optimal design points at the system level. The DSE problem at the
system level can be formulated as follows:
Problem 1. Given a TMG model of the system where each component has been characterized, a
HLS tool, and a target granularity δ > 0, find a Pareto curve α versus θ of the system, such that:
(i) given two consecutive points d, d on the Pareto curve, they have to satisfy: max {d α /d α −
1, dθ /dθ − 1} < δ ; this ensures a maximum distance between two design points on the
curve;
(ii) the HLS tool must be invoked as few times as possible.
This formulation is borrowed from [28], where the authors propose a solution that requires the
manual effort of the designers to characterize the components. In contrast, COSMOS solves this
problem by leveraging the automatic characterization method in Section 5 and by dividing it into
two steps: Synthesis Planning and Synthesis Mapping.
where the function fi returns the implementation cost (α) of the i-th component given the firing-
delay τi of transition ti , σ ∈ Rn is the transition-firing initiation-time vector, M 0 ∈ Nm is the initial
marking, τ − ∈ Rm is the input-transition firing-delay vector, i.e., τi− is the firing-delay of the tran-
sition tk entering in place pi (note that τmin− −
and τmax correspond to the extreme λmin and λmax
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:12 L. Piccolboni et al.
⎪ +1 if t j is an output transition of pi ,
⎧
⎪
A[i, j] = ⎨
⎪ −1 if t j is an input transition of pi , (3)
⎪0 otherwise.
⎩
The objective function minimizes the implementation costs of the components, while satisfying
the system throughput requirements. Given the component extreme latencies λmin and λmax , it is
possible to determine the values of θmin and θmax by labeling the transitions of the TMG of the
system with such latencies. By iterating from θmin to θmax with a ratio of (1 + δ ), we can then find
the optimal values of λ for the components that solve Problem 1. This formulation guarantees that
the components that are not critical for the system throughput are selected to minimize their cost.
The cost functions fi in Equation (2) are unknown a-priori, but they can be approximated with
convex piecewise-linear functions. This LP formulation can be solved in polynomial time [5], and
it can be extended to the case of non-strongly-connected TMGs.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:13
mapping function that returns the number of unrolls that should be applied, given a specific value
for the latency (we apply the ceiling function to get an integer value). For instance, if a point with
latency of 20 s is required, the mapping function returns 11 as the number of unrolls. Note that by
specifying the maximum latency, the function returns the minimum number of unrolls, while by
specifying the minimum latency, it returns the maximum number of unrolls.
It is possible that the mapping may fail by choosing a value for μ t ar дet that does not satisfy the λ-
constraint (Section 5). In this case, COSMOS tries to increase the number of unrolls to preserve the
throughput. Further, if λt ar дet is not included in any region, COSMOS uses the slowest point of the
next region that has a larger number of ports. This does not require a synthesis run (because that
point has been synthesized during the characterization), and it is a conservative solution because,
as in the case of failure of the λ-constraint, we are willing to trade area to preserve the throughput.
7 EXPERIMENTAL RESULTS
We implement the COSMOS methodology with a set of tools and scripts to automatize the DSE.
Specifically, COSMOS includes: (i) Mnemosyne [37] to generate multi-bank memory architectures
as described in Section 5, (ii) a tool to extract the information required by Mnemosyne from the
database of the HLS tool, (iii) a script to run the synthesis and the memory generator according
to Algorithm 1, (iv) a program that creates and solves the LP model by using the GLPK Library3
(Section 6.1), and (v) a tool that maps the LP solutions to the HLS knobs and runs the synthesis
(Section 6.2).
We evaluate the effectiveness and efficiency of COSMOS by considering the WAMI applica-
tion [38] as a case study. The original specification of the WAMI application is available in C in
the PERFECT Benchmark Suite [3]. Starting from this specification, we design a SystemC acceler-
ator to be synthesized with a commercial HLS tool, i.e., Cadence C-to-Silicon. We use an industrial
32nm ASIC technology as target library4 . We choose the WAMI application as our case study due
to (i) the different types of computational blocks it includes and (ii) its complexity. The hetero-
geneity of its computational blocks allows us to develop different components for each block and
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:14 L. Piccolboni et al.
show the vast applicability of COSMOS. The C specification is roughly 1000 lines of code. The
specification of our accelerator design is roughly 7000 lines of SystemC code.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:15
COSMOS No Memory
Component reд λspan α span λspan α span
Debayer 3 2.89× 1.99× 1.04× 1.36×
Grayscale 4 6.91× 3.41× 2.75× 1.14×
Gradient 4 7.89× 3.65× 1.39× 1.22×
Hessian 4 7.70× 7.30× 1.44× 1.30×
SD-Update 4 9.87× 2.01× 2.78× 1.79×
Matrix-Sub 4 2.75× 3.98× 1.88× 1.05×
Matrix-Add 3 1.53× 1.01× 1.26× 1.01×
Matrix-Mul 3 2.88× 3.05× 1.92× 1.14×
Matrix-Resh 1 1.02× 1.04× 1.02× 1.04×
Steep.-Descent 1 1.95× 1.46× 1.95× 1.46×
Change-Det. 1 2.21× 1.04× 2.21× 1.04×
Warp 1 1.09× 1.03× 1.09× 1.03×
Average - 4.06× 2.58× 1.73× 1.22×
overall a richer DSE, as evidenced by the average results. For some components the algorithm ex-
tracts only one region because multiple ports can incur in additional area for no latency gains.
This happens when (i) the algorithm cannot exploit multiple accesses in memory, or (ii) the data
is cached into local registers which can be accessed in parallel in the same clock cycle, e.g., for
Change-Detection. On the other hand, in most cases COSMOS provides significant gains in
terms of area and latency spans compared to a DSE that does not consider the memories.
Figure 9 shows the design space of four representative components of WAMI. The rectangles in
the figures are the regions found by Algorithm 1. For completeness, in addition to the design points
corresponding to the extreme points of the regions, the graphs show also the intermediate points
that could be selected by the mapping function. The small graphs on the right magnify the cor-
responding regions reported on the left. As in the examples discussed in Section 3, increasing the
number of ports has a significant impact on the DSE, while loop unrolling has a local effect within
each region. Another aspect that is common among many components is that the regions become
smaller as we keep increasing the number of ports. For example, for Grayscale in Figure 9(c), we
note that by increasing the number of ports, we reach a point where the gain in latency is not sig-
nificative. This effect, called diminishing returns [1], is the same effect that can be observed in the
parallelization of software algorithms. In some cases, changing the ports increases only the area
with no latency gains as discussed in the previous paragraph. This is highlighted in Figure 9(d),
where for Change-Detection we report two additional regions with respect to those specified
in Table 1. The diminishing-return effect can also be observed by increasing the number of unrolls
inside a region, e.g., Figure 9(b). This is why COSMOS exploits Amdahl’s Law (Section 6.2). On the
other hand, we notice some discontinuities of the Pareto-optimal points within some regions, e.g.,
the region in the bottom-right corner of Figure 9(a). Even by applying the λ − constraints (Sec-
tion 5) it is not possible to completely discard the Pareto-dominated implementations. In fact,
by further restricting the imposed constraints, i.e., by reducing the number of states that the
HLS tool can insert in each loop, we observe that also the Pareto-optimal implementations are
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:16 L. Piccolboni et al.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:17
discarded. Thus, it is not always possible to obtain a curve composed only of Pareto-optimal points
within a certain region. Finally, the Pareto-optimal points outside the regions are not discarded by
COSMOS. They can be chosen when it is necessary to perform the mapping (Section 6.2).
| dm − d p |
σ (dp , dm ) =
dp
where dp is the area of a planned point p, while dm is the area of the corresponding mapped
point m. Each planned point in Figure 10 is labeled with its corresponding σ % value. Note that the
curve obtained with LP is a theoretical curve because the points found at the system level do not
guarantee the existence of a corresponding set of implementations for the components. The error
is mainly due to the impact of the memory, which determines a significant distance between two
consecutive regions (e.g., the points with more than 10% of mismatch in Figure 10). In fact, if a point
is mapped between two regions it must be approximated with the lower-right point of the next
region with lower effective latency. This choice permits to satisfy the throughput requirements
almost always, but at the expense of additional area. In fact, even if Equation (2) is constrained by
the system throughput, it is not always guaranteed to obtain the same throughput because it is not
always the case that there exists a mapped point that has exactly the same latency of a planned
point. To solve this issue, one could try to reduce the clock period and satisfy the throughput
requirements.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:18 L. Piccolboni et al.
Fig. 11. Number of invocations of the HLS tool for an exhaustive exploration (bars on the left) and COSMOS
(on the right).
Finally, to demonstrate the efficiency of COSMOS, Figure 11 shows the number of invocations
to the HLS tool. For each component of WAMI, the right bars report the breakdown of the syn-
thesis calls performed in each phase of the algorithm. At least two invocations are necessary for
each region to characterize a component. Then, we have to consider the invocations that fail due
to the λ − constraints, and finally, the invocations required at system level on the most critical
components (mapping). Some components do non play any role in the efficiency of the system.
For example, for Matrix-Mul, there are no invocations after the characterization because only the
slowest version has been requested by Equation (2) (to save area). This component is not important
to guarantee a high throughput for the entire system. Moreover, some synthesized points belong
to multiple solutions of the LP problem, as in the case of Debayer. Therefore, COSMOS avoids
performing an invocation of the HLS with the same knobs more than once. On the other hand, the
left bars in Figure 11 report the number of invocations required for a exhaustive exploration. Such
exploration requires to (i) synthesize all the possible configurations of unrolls and memory ports
for each component, (ii) find the Pareto-optimal design points for each component, and (iii) com-
pose all the Pareto-optimal designs to find the Pareto curve at the system level (Section 3). The left
bars in Figure 11 show the number of invocations to the HLS tool required in step (i). COSMOS
reduces the total number of invocations for WAMI by 6.7× on average and up to 14.6× for the
single components, compared to the exhaustive exploration. Further, while COSMOS returns the
Pareto-optimal implementations at the system level, to find the combinations of the components
that are Pareto optimal with an exhaustive search method, one has to combine the huge number
of solutions for the single components. In the case of WAMI, the number of combinations, i.e.,
the product of the number of Pareto-optimal points of each component, is greater than 9 ∗ 1012 .
This motivates the need of using a compositional method like COSMOS for the DSE of complex
accelerators.
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:19
7.4 Summary
We report a brief summary of the achieved results:
• COSMOS guarantees a richer DSE with respect to the approaches that do not consider the
memory as integral part of the DSE: for WAMI, COSMOS guarantees an average perfor-
mance span of 4.06× and an average area span of 2.58× as opposed to 1.73× and 1.22×,
respectively, when only standard dual-port memories are used; COSMOS obtains a richer
set of Pareto-optimal implementations thanks to memory generation and optimization;
• COSMOS guarantees a faster DSE compared to exhaustive search methods: for WAMI,
COSMOS reduces the number of invocations to the HLS tool by 6.7× on average and by up
to 14.6× for the single components; COSMOS is able to reduce the number of invocations
thanks the compositional approach discussed in Section 6;
• COSMOS is an automatic and scalable methodology for DSE: the approach is intrinsi-
cally compositional, and thus with larger designs the performance gains are expected to be
as good as smaller ones, if not better. While an exhaustive method has to explore all the
alternatives, COSMOS focuses on the most critical components.
8 RELATED WORK
This section describes the most-closely related methods to perform DSE. We distinguish the meth-
ods that explore single-component designs (reported in Section 8.1) from those that are composi-
tional like COSMOS (in Section 8.2).
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:20 L. Piccolboni et al.
to account for the high variability and partial unpredictability of the HLS tools. Such constraints
consider both the dependency graph of the specification and the memory references in each loop.
Thus, COSMOS identifies larger regions of Pareto-optimal implementations.
Other methods, such as Aladdin [47], perform a DSE without using HLS tools and without gener-
ating the RTL implementations, estimating the performance and costs of high-level specifications
(C code for Aladdin). COSMOS differs from these methods because it aims at generating efficient
RTL implementations by using HLS and memory generator tools. Indeed, such methods can be
used before applying COSMOS to pre-characterize the different components of an accelerator that
is not ready to be synthesized with HLS tools. Since the design of HLS-ready specifications requires
significant efforts [39], this can help the designers to focus only on the most critical components,
i.e., those that are expected to return good performance gains over software executions. After this
pre-characterization, COSMOS can be used to perform a DSE of such components and obtain the
Pareto-optimal combinations of their RTL implementations.
9 CONCLUDING REMARKS
We presented COSMOS, an automatic methodology for compositional DSE that coordinates both
HLS and memory generator tools. COSMOS takes into account the unpredictability of the current
HLS tools and considers the PLMs of the components as an essential part of the DSE. The method-
ology of COSMOS is intrinsically compositional. First, it characterizes the components to define
the regions of the design space that contain Pareto-optimal implementations. Then, it exploits a
LP formulation to find the Pareto-optimal solutions at the system level. Finally, it identifies the
knobs for each component that can be used to obtain the corresponding implementations at RTL.
We showed the effectiveness and efficiency of COSMOS by considering the WAMI accelerator as
a case study. Compared to methods that do not consider the PLMs, COSMOS finds a larger set of
Pareto-optimal implementations. Additionally, compared to exhaustive search methods, COSMOS
reduces the number of invocations to the HLS tool by up to one order of magnitude.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers for their valuable comments and help-
ful suggestions that help us improve the paper considerably. This work was supported in part
by DARPA PERFECT (C#: R0011-13-C-0003), the National Science Foundation (A#: 1527821), and
C-FAR (C#: 2013-MA-2384), one of the six centers of STARnet, a Semiconductor Research Corpo-
ration program sponsored by MARCO and DARPA.
REFERENCES
[1] M. Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In
Proc. of the ACM Spring Joint Computer Conference (AFIPS).
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
COSMOS: Coordination of HLS and Memory Optimization for Hardware Accelerators 150:21
[2] N. Baradaran and P. C. Diniz. 2008. A Compiler Approach to Managing Storage and Memory Bandwidth in Config-
urable Architectures. ACM Transaction on Design Automation of Electronic Systems (2008).
[3] K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez,
L. Song, N. Tallent, and A. Tumeo. 2013. PERFECT (Power Efficiency Revolution For Embedded Computing Tech-
nologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute.
http://hpc.pnl.gov/PERFECT/.
[4] S. Borkar and A. Chien. 2011. The Future of Microprocessors. Communication of the ACM (2011).
[5] S. Boyd and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
[6] J. Campos, G. Chiola, J. M. Colom, and M. Silva. 1992. Properties and Performance Bounds for Timed Marked Graphs.
IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications (1992).
[7] L. P. Carloni. 2015. From Latency-Insensitive Design to Communication-Based System-Level Design. Proc. of the IEEE
(2015).
[8] L. P. Carloni. 2016. The Case for Embedded Scalable Platforms. In Proc. of the ACM/IEEE Design Automation Conference
(DAC). (Invited).
[9] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. 2014. DaDianNao:
A Machine-Learning Supercomputer. In Proc. of the Annual ACM/IEEE International Symposium on Microarchitecture
(MICRO).
[10] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks. IEEE Journal of Solid-State Circuits (2017).
[11] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman. 2014. Accelerator-Rich Architectures:
Opportunities and Progresses. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[12] J. Cong, P. Li, B. Xiao, and P. Zhang. 2016. An Optimal Microarchitecture for Stencil Computation Acceleration Based
on Nonuniform Partitioning of Data Reuse Buffers. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems (2016).
[13] J. Cong, P. Wei, C. H. Yu, and P. Zhou. 2017. Bandwidth Optimization Through On-Chip Memory Restructuring for
HLS. In Proc. of the Annual Design Automation Conference (DAC).
[14] J. Cong, P. Zhang, and Y. Zou. 2011. Combined Loop Transformation and Hierarchy Allocation for Data Reuse Opti-
mization. In Proc. of the ACM/IEEE International Conference on Computer-Aided Design (ICCAD).
[15] J. Cong, P. Zhang, and Y. Zou. 2012. Optimizing Memory Hierarchy Allocation with Loop Transformations for High-
Level Synthesis. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[16] E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2015. An Analysis of Accelerator Coupling in Hetero-
geneous Architectures. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[17] F. Ferrandi, P. L. Lanzi, D. Loiacono, C. Pilato, and D. Sciuto. 2008. A Multi-objective Genetic Algorithm for Design
Space Exploration in High-Level Synthesis. In Proc. of the IEEE Computer Society Annual Symposium on VLSI.
[18] A. Gerstlauer, C. Haubelt, A. D. Pimentel, T. P. Stefanov, D. D. Gajski, and J. Teich. 2009. Electronic System-level
Synthesis Methodologies. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2009).
[19] F. Ghenassia. 2006. Transaction-Level Modeling with SystemC. Springer-Verlag.
[20] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A High-Performance and Energy-
Efficient Accelerator for Graph Analytics. In Proc. of the Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO).
[21] C. Haubelt and J. Teich. 2003. Accelerating Design Space Exploration Using Pareto-Front Arithmetics [SoC design].
In Proc. of the ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC).
[22] M. Horowitz. 2014. Computing’s energy problem (and what we can do about it). In Proc. of the IEEE International
Solid-State Circuits Conference (ISSCC).
[23] L. W. Kim. 2017. DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks.
IEEE Transactions on Neural Networks and Learning Systems (2017).
[24] S. Kurra, N. K. Singh, and P. R. Panda. 2007. The Impact of Loop Unrolling on Controller Delay in High Level Synthesis.
In Proc. of the ACM/IEEE Conference on Design, Automation and Test in Europe (DATE).
[25] B. Li, Z. Fang, and R. Iyer. 2011. Template-based Memory Access Engine for Accelerators in SoCs. In Proc. of the
ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC).
[26] H. Y. Liu and L. P. Carloni. 2013. On Learning-Based Methods for Design-Space Exploration with High-Level Synthe-
sis. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[27] H. Y. Liu, I. Diakonikolas, M. Petracca, and L. P. Carloni. 2011. Supervised Design Space Exploration by Compositional
Approximation of Pareto Sets. In Proc. of the ACM/IEEE Design Automation Conference (DAC).
[28] H. Y. Liu, M. Petracca, and L. P. Carloni. 2012. Compositional System-Level Design Exploration with Planning of
High-Level Synthesis. In Proc. of the AMC/IEEE Conference on Design, Automation, and Test in Europe (DATE).
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.
150:22 L. Piccolboni et al.
[29] X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen. 2016. High Level Synthesis of Complex Appli-
cations: An H.264 Video Decoder. In Proc. of the ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA).
[30] M. J. Lyons, M. Hempstead, G. Y. Wei, and D. Brooks. 2012. The Accelerator Store: A Shared Memory Framework for
Accelerator-based Systems. ACM Transactions on Architecture and Code Optimization (2012).
[31] A. Mahapatra and B. Carrion Schafer. 2014. Machine-learning based Simulated Annealer Method for High Level
Synthesis Design Space Exploration. In Proc. of the Electronic System Level Synthesis Conference (ESLsyn).
[32] W. Meeus, K. Van Beeck, T. Goedemé, J. Meel, and D. Stroobandt. 2012. An Overview of Today’s High-Level Synthesis
Tools. Design Automation for Embedded Systems (2012).
[33] V. K. Mishra and A. Sengupta. 2014. PSDSE: Particle Swarm Driven Design Space Exploration of Architecture and
Unrolling Factors for Nested Loops in High Level Synthesis. In Proc. of the IEEE International Symposium on Electronic
System Design (ISED).
[34] T. Murata. 1989. Petri Nets: Properties, Analysis and Applications. Proc. of the IEEE (1989).
[35] L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. Broadening the Exploration of the Accelerator
Design Space in Embedded Scalable Platforms. In Proc. of the IEEE High Performance Extreme Computing Conference
(HPEC).
[36] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2014. System-level Memory Optimization for High-level
Synthesis of Component-based SoCs. In Proc. of the ACM/IEEE International Conference on Hardware/Software Code-
sign and System Synthesis (CODES+ISSS).
[37] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. System-Level Optimization of Accelerator Local
Memory for Heterogeneous Systems-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems (2017).
[38] R. Porter, A. M. Fraser, and D. Hush. 2010. Wide-Area Motion Imagery. IEEE Signal Processing Magazine (2010).
[39] A. Qamar, F. B. Muslim, F. Gregoretti, L. Lavagno, and M. T. Lazarescu. 2017. High-Level Synthesis for Semi-Global
Matching: Is the Juice Worth the Squeeze? IEEE Access (2017).
[40] C. V. Ramamoorthy and G. S. Ho. 1980. Performance Evaluation of Asynchronous Concurrent Systems Using Petri
Nets. IEEE Transaction on Software Engineering (1980).
[41] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. HernÃandez-Lobato, G. Y. Wei, and D. Brooks.
2016. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In Proc. of the ACM/IEEE
Annual International Symposium on Computer Architecture (ISCA).
[42] A. Sangiovanni-Vincentelli. 2007. Quo Vadis, SLD? Reasoning About the Trends and Challenges of System Level
Design. Proc. of the IEEE (2007).
[43] B. Carrion Schafer. 2016. Probabilistic Multiknob High-Level Synthesis Design Space Exploration Acceleration. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems (2016).
[44] B. Carrion Schafer, T. Takenaka, and K. Wakabayashi. 2009. Adaptive Simulated Annealer for High Level Synthe-
sis Design Space Exploration. In Proc. of the IEEE International Symposium on VLSI Design, Automation and Test
(VLSI-DAT).
[45] B. Carrion Schafer and K. Wakabayashi. 2012. Machine Learning Predictive Modelling High-Level Synthesis Design
Space Exploration. IET Computers Digital Techniques (2012).
[46] A. Seznec. 2015. Bank-interleaved Cache or Memory Indexing Does Not Require Euclidean Division. In Proc. of the
Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD).
[47] Y. S. Shao, B. Reagen, G. Y. Wei, and D. Brooks. 2014. Aladdin: A Pre-RTL, Power-performance Accelerator Simulator
Enabling Large Design Space Exploration of Customized Architectures. In Proc. of the ACM/IEEE Annual International
Symposium on Computer Architecture (ISCA).
[48] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks. In Proc. of the ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA).
ACM Transactions on Embedded Computing Systems, Vol. 16, No. 5s, Article 150. Publication date: September 2017.