Transformations of High-Level Synthesis Codes For Higher-Performance Computing
Transformations of High-Level Synthesis Codes For Higher-Performance Computing
deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing
transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We
systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS
code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation
can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining,
on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various
transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim
to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential
offered by spatial computing architectures using HLS.
Mem. buffering §4.2 – – – - – – – – – – – – registers including the logic between them (i.e., the critical
Mem. striping §4.3 – – – - – – – – – – –
Type demotion §4.4 – – – - - - – - – – – – – path of the circuit), will determine the maximum obtainable
TABLE 1: Overview of transformations, the characteristics of their effect on the frequency. ¹ Bitstream generation translates the final circuit
HLS code and the resulting hardware, and the objectives that they can target. The description to a binary format used to configure the device.
center group of column marks the following transformation characteristics: (PL)
enables pipelining; (RE) increases data reuse, i.e., increases the arithmetic intensity of Most effort invested by an HLS programmer lies in guiding
the code; (PA) increases or exposes more parallelism; (ME) optimizes memory accesses; the scheduling process in ¶ to implement deep, efficient
(RS) does not significantly increase resource consumption; (RT) does not significantly
impair routing, i.e., does not potentially reduce maximum frequency or prevent pipelines, but · is considered when choosing data types and
the design from being routed altogether; (SC) does not change the schedule of loop buffer sizes, and ¸ can ultimately bottleneck applications
nests, e.g., by introducing more loops; and (CC) does not significantly increase
code complexity. The symbols have the following meaning: “–”: no effect, “-”: once the desired parallelism has been achieved, requiring
positive effect, “-!”: very positive effect, “(-)”: small or situational positive the developer to adapt their code to aid this process.
effect, “”: negative effect, “!”: very negative effect, “()”: small or situational
negative effect, “∼”: positive or negative effect can occur, depending on the
context. The right group of columns marks the following objectives that 1.2 Key Transformations for High-Level Synthesis
can be targeted by transformations: (LD) resolve loop-carried dependencies, due This work identifies a set of optimizing transformations that
to inter-iteration dependencies or resource contention; (RE) increase data reuse;
(CU) increase parallelism; (BW) increase memory bandwidth utilization; (PL) reduce are essential to designing scalable and efficient hardware
pipelining overhead; (RT) improve routing results; (RS) reduce resource utilization. kernels in HLS. An overview given in Tab. 1. We divide
the transformations into three major classes: pipelining
In addition to identifying previous work that apply one or transformations, that enable or improve the potential for
more of the transformations defined here, we describe and pipelining computations; scaling transformations that in-
publish a set of end-to-end “hands-on” examples, optimized crease or expose additional parallelism; and memory en-
from naive HLS codes into high performance implementa- hancing transformations, which increase memory utilization
tions. This includes a stencil code, matrix multiplication, and efficiency. Each transformation is further classified ac-
and the N-body problem, all available on github. The cording to a number of characteristic effects on the HLS
optimized codes exhibit dramatic cumulative speedups of source code, and on the resulting hardware architecture
up to 29,950× relative to their respective naive starting (central columns). To serve as a cheat sheet, the table further-
points, showing the crucial necessity of hardware-aware more lists common objectives targeted by HLS programmers,
transformations, which are not performed automatically by and maps them to relevant HLS transformations (rightmost
today’s HLS compilers. As FPGAs are currently the only columns). Characteristics and objectives are discussed in
platforms commonly targeted by HLS tools in the HPC detail in relevant transformation sections.
domain, transformations are discussed and evaluated in this Throughout this work, we will show how each transfor-
context. Evaluating FPGA performance in comparison to mation is applied manually by a performance engineer by
other platforms is out of scope of this work. Our work pro- directly modifying the source code, giving examples before
vides a set of guidelines and a cheat sheet for optimizing and after it is applied. However, many transformations are
high-performance codes for reconfigurable architectures, also amenable to automation in an optimizing compiler.
guiding both performance engineers and compiler devel-
1.3 The Importance of Pipelining
opers to efficiently exploit these devices.
Pipelining is essential to efficient hardware architectures,
as expensive instruction decoding and data movement be-
1.1 From Imperative Code to Hardware tween memory, caches and registers can be avoided, by
Before diving into transformations, it is useful to form an sending data directly from one computational unit to the
intuition of the major stages of the source-to-hardware stack, next. We attribute two primary characteristics to pipelines:
to understand how they are influenced by the HLS code: • Latency (L): the number of cycles it takes for an input to
¶ High-level synthesis converts a pragma-assisted proce- propagate through the pipeline and arrive at the exit, i.e.,
dural description (C++, OpenCL) to a functionally equiva- the number of pipeline stages.
lent behavioral description (Verilog, VHDL). This requires • Initiation interval or gap (I ): the number of cycles that
mapping variables and operations to corresponding con- must pass before a new input can be accepted to the
structs, then scheduling operations according to their inter- pipeline. A perfect pipeline has I=1 cycle, as this is re-
dependencies. The dependency analysis is concerned with quired to keep all pipeline stages busy. Consequently,
creating a hardware mapping such that the throughput the initiation interval can often be considered the inverse
requirements are satisfied, which for pipelined sections re- throughput of the pipeline; e.g., I=2 cycles implies that the
quire the circuit to accept a new input every cycle. Coarse- pipeline stalls every second cycle, reducing the through-
grained control flow is implemented with state machines, put of all pipelines stages by a factor of 12 .
3
To quantify the importance of pipelining in HLS, we con- previous iteration, which takes multiple cycles to com-
sider the number of cycles C it takes to execute a pipeline plete (i.e., has multiple internal pipeline stages). If the
with latency L (both in [cycles]), taking N inputs, with an latency of the operations producing this result is L, the
initiation interval of I [cycles]. Assuming a reliable producer minimum initiation interval of the pipeline will be L.
and consumer at either end, we have: This is a common scenario when accumulating into a sin-
gle register (see Fig. 2), in cases where the accumulation
C = L + I · (N − 1) [cycles]. (1)
operation takes Lacc >1 cycles.
This is shown in Fig. 1. The time to execute all N iterations 2) Interface contention (intra-iteration): a hardware re-
with clock rate f [cycles/s] of this pipeline is thus C/f . source with limited ports is accessed multiple times in
the same iteration of the loop. This could be a FIFO
queue or RAM that only allows a single read and write
I
per cycle, or an interface to external memory, which only
N
p
oo
1 for (int n = 0; n < N; ++n) 2 double t[16];
rl
ne
2 for (int m = 0; m < M; ++m) { K 3 #pragma PIPELINE
in
3 double acc = C[n][m]; 4 for (int i = 0; i < N; ++i) { // P0
+ +
4 #pragma PIPELINE 5 auto prev = (i < 16) ? 0 : t[i%16];
C [N×M]
5 for (int k = 0; k < K; ++k) 6 t[i%16] = prev + arr[i]; }
]
[K ×K
6 acc += A[n][k] * B[k][m];
]
double res = 0;
M
[N
7
N
×
A
7 C[n][m] = acc; } M 8 for (int i = 0; i < 16; ++i) // P1
B
(a) Naive implementation of general matrix multiplication C=AB+C . 9 res += t[i]; // Not pipelined
10 return res; }
1 for (int n = 0; n < N; ++n) {
2 double acc[M]; // Uninitialized Listing 2: Two stages required for single loop accumulation.
inner acc[ 0] , acc[ 1] , . . . , acc[ M-1]
3 for (int k = 0; k < K; ++k) loop
4 double a = A[n][k]; // Only read once K
5 #pragma PIPELINE
6 for (int m = 0; m < M; ++m) {
result buffers, the second phase collapses the partial results
N
7 double prev = (k == 0) ? C[n][m] into the final output. This is shown in Lst. 2 for K=16.
× ]
[K ×K
8 : acc[m]; Optionally, the two stages can be implemented to run in
]
M
[N
acc[m] = prev + a * B[k][m]; }
A
9
M a coarse-grained pipelined fashion, such that the first stage
B
10 for (int m = 0; m < M; ++m) // Write
11 C[n][m] = acc[m]; } // out begins computing new partial results while the second stage
(b) Transposed iteration space, same location written every M cycles. is collapsing the previous results (by exploiting dataflow
1 for (int n = 0; n < N; ++n)
between modules, see Sec. 3.3).
2 for (int m = 0; m < M/T; ++m) {
3 double acc[T]; // Tiles of size T inner
loop
acc[ 0] , . . ., acc[ B-1] 2.1.4 Batched Accumulation Interleaving
4 for (int k = 0; k < K; ++k) K For algorithms with loop-carried dependencies that cannot
5 double a = A[n][k]; // M/T reads
6 #pragma PIPELINE N be solved by either method above (e.g., due to a non-
7 for (int t = 0; t < T; ++t) { C [N×M] commutative accumulation operator), we can still pipeline
double prev = (k == 0) ?
]
8
K
]
M
9 C[n][m*T+t] : acc[t];
×
A
[K
10 acc[t] = prev + a * B[k][m*T+t]; } M additional loop nested in the accumulation loop. This pro-
11 for (int t = 0; t < T; ++t) // Write cedure is similar to Sec. 2.1.2, but only applies to programs
12 C[n][m*T+t] = acc[t]; } // out
where it is relevant to compute the accumulation for multi-
(c) Tiled iteration space, same location written every T cycles.
ple data streams, and requires altering the interface and data
Listing 1: Interleave accumulations to remove loop-carried dependency.
movement of the program to interleave inputs in batches.
The code in Lst. 3a shows an iterative solver code with
an inherent loop-carried dependency on state, with a min-
• The loop-carried dependency is resolved: each location is
imum initiation interval corresponding to the latency LStep
only updated every M cycles (with M ≥Lacc in Fig. 3).
of the (inlined) function Step. There are no loops to inter-
• A, B , and C are all read in a contiguous fashion, achiev-
change, and we cannot change the order of loop iterations.
ing perfect spatial locality (we assume row-major memory
While there is no way to improve the latency of producing
layout. For column-major we would interchange the K -
a single result, we can improve the overall throughput by a
loop and N -loop).
factor of LStep by pipelining across N ≥LStep different inputs
• Each element of A is read exactly once.
(e.g., overlap solving for different starting conditions). We
The modified code is shown in Lst. 1b. We leave the accumu- effectively inject another loop over inputs, then perform
lation buffer defined on line 2 uninitialized, and implicitly transposition or tiled accumulation interleaving with this
reset it on line 8, avoiding M extra cycles to reset (this is a loop. The result of this transformation is shown in Lst. 3b,
form of pipelined loop fusion, covered in Sec. 2.4). for a variable number of interleaved inputs N.
2.1.2 Tiled Accumulation Interleaving 1 Vec<double> IterSolver(Vec<double> state, int T) {
For accumulations done in a nested loop, it can be sufficient 2 #pragma PIPELINE // Will fail to pipeline with I=1
3 for (int t = 0; t < T; ++t)
to interleave across a tile of an outer loop to resolve a 4 state = Step(state);
loop-carried dependency, using a limited size buffer to store 5 return state; }
intermediate results. This tile only needs to be of size ≥Lacc , (a) Solver executed for T steps with a loop-carried dependency on state.
where Lacc is the latency of the accumulation operation. 1 template <int N> inner loop T
2 void MultiSolver(Vec<double> *in,
This is shown in Lst. 1c, for the transposed matrix mul-
3 Vec<double> *out, int T) { b[0]
tiplication example from Lst. 1b, where the accumulation 4 Vec<double> b[N]; // Partial results
array has been reduced to tiles of size T (which should be 5 for (int t = 0; t < T; ++t) b[1]
6 #pragma PIPELINE
≥Lacc , see Fig. 3), by adding an additional inner loop over 7 for (int i = 0; i < N; ++i) {
the tile, and cutting the outer loop by a factor of B . 8 auto read = (t == 0) ? in[i] : b[i]; ...
9 auto next = Step(read);
10 if (t < T-1) b[i] = next;
2.1.3 Single-Loop Accumulation Interleaving b[N-1]
11 else out[i] = next; }} // Write out
If no outer loop is present, we have to perform the ac- (b) Pipeline across N ≥Lstep inputs to achieve I=1 cycle.
cumulation in two separate stages, at the cost of extra Listing 3: Pipeline across multiple inputs to avoid loop-carried dependency.
resources. For the first stage, we perform a transformation
similar to the nested accumulation interleaving, but strip- 2.2 Delay Buffering
mine the inner (and only) loop into blocks of size K ≥ Lacc , When iterating over regular domains in a pipelined fashion,
accumulating partial results into a buffer of size K . Once it is often sufficient to express buffering using delay buffers,
all incoming values have been accumulated into the partial expressed either with cyclically indexed arrays, or with
5
constant offset delay buffers, also known from the Intel Lst. 4b demonstrates the shift register pattern used to
ecosystem as shift registers. These buffers are only accessed in express the stencil buffering scheme, which is supported
a FIFO manner, with the additional constraint that elements by the Intel OpenCL toolflow. Rather than creating each
are only be popped once they have fully traversed the depth individual delay buffer required to propagate values, a
of the buffer (or when they pass compile-time fixed access single array is used, which is “shifted” every cycle using
points, called “taps”, in Intel OpenCL). Despite the “shift unrolling (lines 6-7). The computation accesses elements of
register” name, these buffers do not need to be implemented this array using constant indices only (line 10), relying on the
in registers, and are frequently implemented in on-chip tool to infer the partitioning into individual buffers (akin
RAM when large capacity is needed, where values are not to loop idiom recognition [25]) that we did explicitly in
physically shifted. Lst. 4a. The implicit nature of this pattern requires the tool
A common set of applications that adhere to the delay to specifically support it. For more detail on buffering stencil
buffer pattern are stencil applications such as partial dif- codes we refer to other works on the subject [44], [39].
ferential equation solvers [27], [28], [29], image processing Opportunities for delay buffering often arise naturally in
pipelines [30], [31], and convolutions in deep neural net- pipelined programs. If we consider the transposed matrix
works [32], [33], [34], [35], [36], all of which are typically multiplication code in Lst. 1b, we notice that the read from
traversed using a sliding window buffer, implemented in acc on line 8 and the write on line 9 are both sequential, and
terms of multiple delay buffers (or, in Intel terminology, a cyclical with a period of M cycles. We could therefore also
shift register with multiple taps). These applications have use the shift register abstraction for this array. The same is
been shown to be a good fit to spatial computing architec- true for the accumulation code in Lst. 3b.
tures [37], [38], [39], [40], [41], [42], [43], as delay buffering
is cheap to implement in hardware, either as shift registers
Seq.
+ + + ×
DRAM
south
in general purpose logic, or in RAM blocks. east west north
13
14 west = center; center = east; } } // Propagate registers 4 int bin = CalculateBin(memory[i]);
5 hist[bin] += 1; // Single cycle access Seq.
(a) Delay buffering using cyclically indexed line buffers.
6 } // ...write result out to memory...
1 // Pipelined loops executed sequentially 1 for (int i = 0; i < N0+N1; ++i) { 1 for (int i = 0; i < max(N0, N1); ++i) {
2 for (int i = 0; i < N0; ++i) Foo(i, /*...*/); 2 if (i < N0) Foo(i, /*...*/); 2 if (i < N0) Foo(i, /*...*/); // Omit ifs
3 for (int i = 0; i < N1; ++i) Bar(i, /*...*/); 3 else Bar(i - N0, /*...*/); } 3 if (i < N1) Bar(i, /*...*/); } // for N0==N1
(a) (L0 + I0 (N0 −1)) + (L1 + I1 (N1 −1)) cycles. (b) L2 + I(N0 + N1 −1) cycles. (c) L3 + I · (max(N0 , N1 )−1) cycles.
Listing 5: Two subsequent pipelined loops fused sequentially (Lst. 5b) or concurrently (Lst. 5c). Assume that all loops are pipelined (pragmas omitted for brevity).
For two consecutive loops with latencies/bounds/initi- Lst. 7b). There can be a (tool-dependent) benefit from saving
ation intervals {L0 , N0 , I0 } and {L1 , N1 , I1 } (Lst. 5a), re- overhead logic by only implementing the orchestration and
spectively, the total runtime according to Eq. 1 is (L0 + interfaces of a single pipeline, at the (typically minor) cost
I0 (N0 −1)) + (L1 + I1 (N1 −1)). Depending on which con- of the corresponding predication logic. More importantly,
dition(s) are met, we can distinguish between three levels of eliminating the coarse-grained control can enable other
pipelined loop fusion, with increasing performance benefits: transformations that significantly benefit performance, such
1) I=I0 =I1 (true in most cases): Loops can be fused by as fusion [§2.4] with adjacent pipelined loops, flattening
summing the loop bounds, using loop guards to sequen- nested loops [§2.6], and on-chip dataflow [§3.3].
tialize them within the same pipeline (Lst. 5b).
2.6 Pipelined Loop Flattening/Coalescing
2) Condition 1 is met, and only fine-grained or no dependencies
exist between the two loops: Loops can be fused by To minimize the number of cycles spent in filling/draining
iterating to the maximum loop bound, and loop guards pipelines (where the circuit is not streaming at full through-
are placed as necessary to predicate each section (Lst. 5c). put), we can flatten nested loops to move the fill/drain
3) Conditions 1 and 2 are met, and N =N0 =N1 (same loop phases to the outermost loop, fusing/absorbing code that
bounds): Loops bodies can be trivially fused (Lst. 5c, but is not in the innermost loop if necessary.
with no loop guards necessary). Lst. 8a shows a code with two nested loops, and gives the
An alternative way of performing pipeline fusion is to total number of cycles required to execute the program. The
instantiate each stage as a separate processing element, and latency of the drain phase of the inner loop and the latency
stream fine-grained dependencies between them (Sec. 3.3). of Bar outside the inner loop must be paid at every iteration
of the outer loop. If N0 L0 , the cycle count becomes just
2.5 Pipelined Loop Switching L1 + N0 N1 , but for applications where N0 is comparable to
L0 , draining the inner pipeline can significantly impact the
The benefits of pipelined loop fusion can be extended to
runtime (even if N1 is large). By transforming the code such
coarse-grained control flow by using loop switching (as op-
that all loops are perfectly nested (see Lst. 8b), the HLS tool
posed to loop unswitching, which is a common transforma-
can effectively coalesce the loops into a single pipeline, where
tion [25] on load/store architectures). Whereas instruction-
next iteration of the outer loop can be executed immediately
based architectures attempt to only execute one branch of
after the previous finishes.
a conditional jump (via branch prediction on out-of-order
processors), a conditional in a pipelined scenario will result 1 for (int i = 0; i < N1; ++i) { 1 for (int i = 0; i < N1; ++i) {
in both branches being instantiated in hardware, regardless 2 #pragma PIPELINE 2 #pragma PIPELINE
3 for (int j = 0; j < N0; ++i) 3 for (int j = 0; j < N0; ++i)
of whether/how often it is executed. The transformation of 4 Foo(i, j); 4 Foo(i, j);
coarse-grained control flow into fine-grained control flow is 5 Bar(i); } 5 if (j == N0 - 1) Bar(i); }
implemented by the HLS tool by introducing predication to (a) L1 + N1 · (L0 + N0 −1) cycles. (b) L2 + N0 N1 −1 cycles.
the pipeline, at no significant runtime penalty.
Lst. 7 shows a simple example of how the transformation
Inner state 0 Inner state 1
fuses two pipelined loops in different branches into a single Outer state Single state
loop switching pipeline. The transformation applies to any Listing 8: Before and after coalescing loop nest to avoid inner pipeline drains.
pipelined code in either branch, following the principles
described for pipelined loop fusion (§2.4 and Lst. 5). To perform the transformation in Lst. 8, we had to absorb
The implications of pipelined loop switching are more Bar into the inner loop, adding a loop guard (line 5 in
subtle than the pure fusion examples in Lst. 5, as the total Lst. 8b), analogous to pipelined loop fusion (§2.4), where
number of loop iterations is not affected (assuming the fused the second pipelined “loop” consists of a single iteration.
loop bound is set according to the condition, see line 1 in This contrasts the loop peeling transformation, which is
used by CPU compilers to regularize loops to avoid branch
mispredictions and increasing amenability to vectorization.
1 if (condition) 1 auto N = condition ? N0 : N1;
2 #pragma HLS PIPELINE 2 #pragma HLS PIPELINE While loop peeling can also be beneficial in hardware, e.g.,
3 for (int i = 0; i < N0; ++i) 3 for (int i = 0; i < N; ++i) { to avoid deep conditional logic in a pipeline, small inner
4 y[i] = Foo(x[i]); 4 if (condition)
5 else 5 y[i] = Foo(x[i]); loops can see a significant performance improvement by
6 #pragma HLS PIPELINE 6 else eliminating the draining phase.
7 for (int i = 0; i < N1; ++i) 7 y[i] = Bar(x[i]);
8 y[i] = Bar(x[i]); 8} 2.7 Inlining
(a) Coarse-grained control flow. (b) Control flow absorbed into pipeline. In order to successfully pipeline a scope, all function calls
Listing 7: Pipelined loop switching absorbs coarse-grained control flow. within the code section must be pipelineable. This typically
7
ditional resources consumed for every additional callsite 4 C[i*W + w] = A[i*W + w]*B[i*W + w]; 4 C[i] = A[i] * B[i];
after the first. This replication is done automatically by HLS (a) Using strip-mining. (b) Using partial unrolling.
compilers on demand, but an additional inline pragma Listing 9: Two variants of vectorization by factor W using loop unrolling.
can be specified to directly “paste” the function body into
the callsite during preprocessing, removing the function
boundary during optimization and scheduling. compute jlogic k (e.g., from off-chip memory), according to
Wmax = fBS , where f [cycle/s] is the clock frequency of
3 S CALABILITY T RANSFORMATIONS
the unrolled logic, and S [Byte/operand] is the operand
Parallelism in HLS revolves around the folding of loops, size in bytes. Horizontal unrolling is usually not sufficient to
achieved through unrolling. In Sec. 2.1 we used strip- achieve high logic utilization on large chips, where the avail-
mining and reordering to avoid loop-carried dependencies able memory bandwidth is low compared to the available
by changing the schedule of computations in the pipelined amount of compute logic. Furthermore, because the energy
loop nest. In this section, we similarly strip-mine and re- cost of I/O is orders of magnitude higher than moving data
order loops, but with additional unrolling of the strip-mined on the chip, it is desirable to exploit on-chip memory and
chunks. Pipelined loops constitute the iteration space; the pipeline parallelism instead (this follows in Sec. 3.2 and 3.3).
size of which determines the number of cycles it takes
to execute the program. Unrolled loops, in a pipelined 3.2 Vertical Unrolling
program, correspond to the degree of parallelism in the We can achieve scalable parallelism in HLS without relying
architecture, as every expression in an unrolled statement on external memory bandwidth by exploiting data reuse,
is required to exist as hardware. Parallelizing a code thus distributing input elements to multiple computational units
means turning sequential/pipelined loops fully or partially replicated “vertically” through unrolling [49], [38], [50]. This
into parallel/unrolled loops. This corresponds to folding the is the most potent source of parallelism on hardware architectures,
sequential iteration space, as the number of cycles taken to as it can conceptually scale indefinitely with available silicon
execute the program are effectively reduced by the inverse when enough reuse is possible. Viewed from the paradigm
of the unrolling factor. of cached architectures, the opportunity for this transforma-
tion arises from temporal locality in loops. Vertical unrolling
a b a0 b0 a1 b1 a2 b2 a3 b3 b a1 b a2 b a3 b a0 a1 a2 a3
a0
CU CU
draws on bandwidth from on-chip fast memory by storing
CU CU CU CU CU CU CU CU CU CU CU
b b b b
more elements temporally, combining them with new data
(a) Before. (b) Horizontal unroll. (c) Vertical unroll. (d) Dataflow. streamed in from external memory to increase parallelism,
allowing more computational units to run in parallel at the
Fig. 5: Horizontal unrolling, vertical unrolling, and dataflow, as means to increase
parallelism. Rectangles represent buffer space, such as registers or on-chip RAM. expense of buffer space. In comparison, horizontal unrolling
Horizontal: four independent inputs processed in parallel. Vertical: one input is requires us to widen the data path that passes through the
combined with multiple buffered values. Dataflow: similar to vertical, but input
or partial results are streamed through a pipeline rather than broadcast.
processing elements (compare Fig. 5b and 5c).
When attempting to parallelize a new algorithm, iden-
3.1 Horizontal Unrolling (Vectorization) tifying a source of temporal parallelism to feed vertical
We implement vectorization-style parallelism with HLS by unrolling is essential to whether the design will scale. Pro-
“horizontally” unrolling loops in pipelined sections, or by grammers should consider this carefully before designing
introducing vector types, folding the sequential iteration the hardware architecture. From a reference software code,
space accordingly. This is the most straightforward way of the programmer can identify scenarios where reuse occurs,
adding parallelism, as it can often be applied directly to an then extract and explicitly express the temporal access pattern
inner loop without further reordering or drastic changes to in hardware, using a delay buffering [§2.2] or random-access
the nested loop structure. Vectorization is more powerful [§2.3] buffering scheme. Then, if additional reuse is possible,
in HLS than SIMD operations on load/store architectures, vertically unroll the circuit to scale up performance.
as the unrolled compute units are not required to be ho- As an example, we return to the matrix multiplication
mogeneous, and the number of units are not constrained code from Lst. 1c. In Sec. 2.1.2, we saw that strip-mining
to fixed sizes. Horizontal unrolling increases bandwidth
utilization by explicitly exploiting spatial locality, allowing
more efficient accesses to off-chip memory such as DRAM. 1 for (int n = 0; n < N / P; ++n) { // Folded by unrolling factor P
2 for (int m = 0; m < M / T; ++m) { // Tiling
Lst. 9 shows two functionally equivalent ways of vec- 3 double acc[T][P]; // Is now 2D
torizing a loop over N elements by a horizontal unrolling 4 // ...initialize acc from C...
5 for (int k = 0; k < K; ++k) {
factor of W . Lst. 9a strip-mines a loop into chunks of W 6 double a_buffer[P]; // Buffer multiple elements to combine with
and unrolls the inner loop fully, while Lst. 9b uses partial 7 #pragma PIPELINE // incoming values of B in parallel
8 for (int p = 0; p < P; ++p)
unrolling by specifying an unroll factor in the pragma. As 9 a_buffer[p] = A[n*P + p][k];
a third option, explicit vector types can be used, such as 10 #pragma PIPELINE
11 for (int t = 0; t < T; ++t) // Stream tile of B
those built into OpenCL (e.g., float4 or int16), or custom 12 #pragma UNROLL
vector classes [48]. These provide less flexibility, but are 13 for (int p = 0; p < P; ++p) // P-fold vertical unrolling
more concise and are sufficient for most applications. 14 acc[t][p] += a_buffer[p] * B[k][m*T+t];
15 } /* ...write back 2D tile of C... */ } }
In practice, the unrolling factor W [operand/cycle] is con-
strained by the bandwidth B [Byte/s] available to the Listing 10: P -fold vertical unrolling of matrix multiplication.
8
and reordering loops allowed us to move reads from matrix To see how streaming can be an important tool to express
A out of the inner loop, re-using the loaded value across scalable hardware, we apply it in conjunction with vertical
T different entries of matrix B streamed in while keeping unrolling (Sec. 3.2) to implement an iterative version of the
the element of A in a register. Since every loaded value stencil example from Lst. 4. Unlike the matrix multiplication
of B eventually needs to be combined with all N rows of code, the stencil code has no scalable source of parallelism
A, we realize that we can perform more computations in in the spatial dimension. Instead, we can achieve reuse by
parallel by keeping multiple values of A in local registers. folding the outer time-loop to treat P consecutive timesteps
The result of this transformation is shown in Lst. 10. By in a pipeline parallel fashion, each computed by a distinct
buffering P elements (where P was 1 in Lst. 1c) of A prior PE, connected in a chain via channels [37], [51], [38]. We
to streaming in the tile of B -matrix (lines 8-9), we can fold replace the memory interfaces to the PE with channels, such
the outer loop over rows by a factor of P , using unrolling that the memory read and write become Pop and Push oper-
to multiply parallelism (as well as buffer space required for ations, respectively. The resulting code is shown in Lst. 11a.
the partial sums) by a factor of P (lines 12-14). We then vertically unroll to generate P instances of the PE
(shown in Lst. 11b), effectively increasing the throughput
3.3 Dataflow
of the kernel by a factor of P , and consequently reducing
For complex codes it is common to partition functionality the runtime by folding the outermost loop by a factor of P
into multiple modules, or processing elements (PEs), stream- (line 3 in Lst. 11a). Such architectures are sometimes referred
ing data between them through explicit interfaces. In con- to as systolic arrays [52], [53].
trast to conventional pipelining, PEs arranged in a dataflow For architectures/HLS tools where large fan-out is an is-
architecture are scheduled separately when synthesized by sue for compilation or routing, an already replicated design
the HLS tool. There are multiple benefits to this: can be transformed to a dataflow architecture. For example,
• Different functionality runs at different schedules. For exam- in the matrix multiplication example in Lst. 10, we can move
ple, issuing memory requests, servicing memory requests, the P -fold unroll out of the inner loop, and replicate the
and receiving requested memory can all require different entire PE instead, replacing reads and writes with channel
pipelines, state machines, and even clock rates. accesses [50]. B is then streamed into the first PE, and
• Smaller components are more modular, making them eas- passed downstream every cycle. A and C should no longer
ier to reuse, debug and verify. be accessed by every PE, but rather be handed downstream
• The effort required by the HLS tool to schedule code similar to B , requiring a careful implementation of the start
sections increases dramatically with the number of opera- and drain phases, where the behavior of each PE will vary
tions that need to be considered for the dependency and slightly according to its depth in the sequence.
pipelining analysis. Scheduling logic in smaller chunks is
thus beneficial for compilation time. 3.4 Tiling
• Large fan-out/fan-in is challenging to route on real hard- Loop tiling in HLS is commonly used to fold large problem
ware, (i.e., 1-to-N or N -to-1 connections for large N ). This sizes into manageable chunks that fit into fast on-chip
is mitigated by partitioning components into smaller parts memory, in an already pipelined program [38]. Rather than
and adding more pipeline stages. making the program faster, this lets the already fast archi-
• The fan-in and fan-out of control signals (i.e., stall, reset) tecture support arbitrarily large problem sizes. This is in
within each module is reduced, reducing the risk of these contrast to loop tiling on CPU and GPU, where tiling is used
signals constraining the maximum achievable frequency. to increase performance. Common to both paradigms is that
they fundamentally aim to meet fast memory constraints. As
To move data between PEs, communication channels with
with horizontal and vertical unrolling, tiling relies on strip-
a handshake mechanism are used. These channels double
mining loops to alter the iteration space.
as synchronization points, as they imply a consensus on
Tiling was already shown in Sec. 2.1.2, when the accu-
the program state. In practice, channels are always FIFO
mulation buffer in Lst. 1b was reduced to a tile buffer in
interfaces, and support standard queue operations Push,
Pop, and sometimes Empty, Full, and Size operations. They
1 void PE(FIFO<float> &in, FIFO<float> &out, int T) {
occupy the same register or block memory resources as 2 // ..initialization...
other buffers (Sec. 2.2/Sec. 2.3). 3 for (int t = 0; t < T / P; ++t) // Fold timesteps T by factor P
4 #pragma PIPELINE
The mapping from source code to PEs differs between 5 for (/* loops over spatial dimensions */) {
HLS tools, but is manifested when functions are connected 6 auto south = in.Pop(); // Value for t-1 from previous PE
using channels. In the following example, we will use the 7 // ...load values from delay buffers...
8 auto next = 0.25*(north + west + east + south);
syntax from Xilinx Vivado HLS to instantiate PEs, where 9 out.Push(next); }} // Value for t sent to PE computing t+1
each non-inlined function correspond to a PE, and these (a) Processing element for a single timestep. Will be replicated P times.
are connected by channels that are passed as arguments
1 #pragma DATAFLOW // Schedule nested functions as parallel modules
to the functions from a top-level entry function. Note that 2 void SystolicStencil(const float in[], float out[], int T) {
this functionally diverges from C++ semantics without 3 FIFO<float> pipes[P + 1]; // Assume P is given at compile time
4 ReadMemory(in, pipes[0]); // Head
additional abstraction [48], as each function in the dataflow 5 #pragma UNROLL // Replicate PEs
scope is executed in parallel in hardware, rather than in the 6 for (int p = 0; p < P; ++p)
sequence specified in the imperative code. In Intel OpenCL, 7 PE(pipe[p], pipe[p + 1], T); // Forms a chain
8 WriteMemory(pipes[P], out); } // Tail
dataflow semantics are instead expressed with multiple
(b) Instantiate and connect P consecutive and parallel PEs.
kernel functions each defining a PE, which are connected by
global channel objects prefixed with the channel keyword. Listing 11: Dataflow between replicated PEs to compute P timesteps in parallel.
9
Lst. 1c, such that the required buffer space used for partial one cache line. If we instead read the two sections of A
results became a constant, rather than being dependent on sequentially (or in larger chunks), the HLS tool can infer
the input size. This transformation is also relevant to the two bursts accesses to A of length N/2, shown in Lst. 12c.
stencil codes in Lst. 4, where it can be used similarly to Since the schedules of memory and computational modules
restrict the size of the line buffers or shift register, so they a are independent, ReadA can run ahead of PE, ensuring that
no longer proportional to the problem size. memory is always read at the maximum bandwidth of the
interface (Sec. 4.2 and Sec. 4.3 will cover how to increase this
4 M EMORY ACCESS T RANSFORMATIONS bandwidth). From the point of view of the computational
When an HLS design has been pipelined, scheduled, and PE, both A0 and A1 are read in parallel, as shown on
unrolled as desired, the memory access pattern has been line 5 in Lst. 12b, hiding initialization time and inconsistent
established. In the following, we describe transformations memory producers in the synchronization implied by the
that optimize the efficiency of off-chip memory accesses in data streams.
the HLS code. For memory bound codes in particular, this is An important use case of memory extraction appears in
critical for performance after the design has been pipelined. the stencil code in Lst. 11, where it is necessary to separate
the memory accesses such that the PEs are agnostic of
4.1 Memory Access Extraction whether data is produced/consumed by a neighboring PE
By extracting accesses to external memory from the compu- or by a memory module. Memory access extraction is also
tational logic, we enable compute and memory accesses to useful for performing data layout transformations in fast
be pipelined and optimized separately. Accessing the same on-chip memory. For example, we can change the schedule
interface multiple times within the same pipelined section of reads from A in Lst. 10 to a more efficient scheme by
is a common cause for poor memory bandwidth utilization buffering values in on-chip memory, while streaming them
and increased initiation interval due to interface contention, to the kernel according to the original schedule.
since the interface can only service a single request per
cycle. In the Intel OpenCL flow, memory extraction is done 4.2 Memory Buffering
automatically by the tool, but since this process must be When dealing with memory interfaces with an inconsistent
conservative due to limited information, it is often still data rate, such as DRAM, it can be beneficial to request
beneficial to do the extraction explicitly in the code [54]. In and buffer accesses earlier and/or at a more aggressive pace
many cases, such as for independent reads, this is not an in- than what is consumed or produced by the computational
herent memory bandwidth or latency constraint, but arises elements. For memory reads, this can be done by reading
from the tool scheduling iterations according to program ahead of the kernel into a deep buffer instantiated between
order. This can be relaxed when allowed by inter-iteration memory and computations, by either 1) accessing wider vec-
dependencies (which can in many cases be determined tors from memory than required by the kernel, narrowing or
automatically, e.g., using polyhedral analysis [55]). widening data paths (aka. “gearboxing”) when piping to or
In Lst. 12a, the same memory (i.e., hardware memory from computational elements, respectively, or 2) increasing
interface) is accessed twice in the inner loop. In the worst the clock rate of modules accessing memory with respect to
case, the program will issue two 4 Byte memory requests the computational elements.
every iteration, resulting in poor memory performance, and The memory access function Lst. 12c allows long bursts
preventing pipelining of the loop. In software, this problem to the interface of A, but receives the data on a narrow bus
is typically mitigated by caches, always fetching at least at W · Sint = (1 · 4) Byte/cycle. In general, this limits the
bandwidth consumption to f ·W Sint at frequency f , which is
likely to be less than what the external memory can provide.
1 void PE(const int A[N], int B[N/2]) { 1 elem./burst To better exploit available bandwidth, we can either read
#pragma PIPELINE // Achieves I=2
DRAM
2 1 burst A[i]
Compute
respective interfaces, pushing to FIFO buffers that are read CPU-Oriented Transformations and how they apply to HLS codes.
in parallel and combined by another module (for writing: in Loop interchange [57], [47] is used to resolve loop-carried dependencies [§2].
Strip-mining [58], loop tiling [59], [47], and cycle shrinking [60] are central compo-
reverse), exposing a single data stream to the computational nents of many HLS transformations [§2.1, §3.1, §3.2, §2.1.2].
kernel. This is illustrated in Fig. 6, where the unlabeled Loop distribution and loop fission [61], [47] are used to separate differently scheduled
DDR1
DDR2
DDR1
DDR2
DDR0
DDR3
DDR0
function calls.
moved to a type that is natively supported by the target I/O format compilation: No I/O supported directly in HLS.
Supercompiling: is infeasible for HLS due to long synthesis times.
architecture, such as single precision floating point on Loop pushing/embedding: Inlining completely is favored to allow pipelining.
Intel’s Arria 10 and Stratix 10 devices [56]. Automatic decomposition and alignment, scalar privatization, array privatization,
cache alignment, and false sharing are not relevant for HLS, as there is no (implicit)
• Bandwidth bound architectures, where performance can cache coherency protocol in hardware.
be improved by up to the same factor that the size of the Procedure call parallelization and split do not apply, as there are no forks in hardware.
Graph partitioning only applies to explicit dataflow languages.
data type can be reduced by. There are no instruction sets in hardware, so VLIW transformations do not apply.
• Latency bound architectures where the data type can be TABLE 2: The relation of traditional CPU-oriented transformations to HLS codes.
reduced to a lower latency operation, e.g., from floating
point to integer. It is interesting to note that the majority of well-known
In the most extreme case, it has been shown that collapsing transformations from software apply to HLS. This implies
the data type of weights and activations in deep neural that we can leverage much of decades of research into high-
networks to binary [34] can provide sufficient speedup for performance computing transformations to also optimize
inference that the increased number of weights makes up hardware programs, including many that can be applied
for the loss of precision per weight. directly (i.e., without further adaptation to HLS) to the im-
perative source code or intermediate representation before
5 S OFTWARE T RANSFORMATIONS IN HLS synthesizing for hardware. We stress the importance of sup-
In addition to the transformations described in the sections port for these pre-hardware generation transformations in
above, we include an overview of how well-known CPU- HLS compilers, as they lay the foundation for the hardware-
oriented transformations apply to HLS, based on the com- specific transformations proposed here.
piler transformations compiled by Bacon et al. [25]. These
transformations are included in Tab. 2, and are partitioned 6 E ND - TO -E ND E XAMPLES
into three categories: To showcase the transformations presented in this work and
provide a “hands-on” opportunity for seeing HLS optimiza-
• Transformations directly relevant to the HLS transforma-
tions applied in practice, we will describe the optimization
tions already presented here.
process on a sample set of classical HPC kernels, available
• Transformations that are the same or similar to their
as open source repositories on github1 . These kernels are
software counterparts.
• Transformations with little or no relevance to HLS. 1. https://github.com/spcl?q=hls
11
written in C++ for Xilinx Vivado HLS [12] with hlslib [48] [GOp/s] Stencil Matrix Multiplication N-Body
extensions, and are built and run using the Xilinx Vitis envi- 103 (25×/18270×)409.3 (52×/29950×)497.0 (42×/167×)270.7
ronment. For each example, we will describe the sequence of (14×/720×) 16.1 (16×/578×) 9.6 (4×) 6.4
101 1.6
transformations applied, and give the resulting performance (53×) 1.2 (36×) 0.6
at each major stage. 10−1 <0.1 <0.1
The included benchmarks were run on an Alveo 10 −3
Nai Pip Vec S Nai Pip Vec S I P S
U250 board, which houses a Xilinx UltraScale+ XCU250- ve elin tor ystoli ve elin tor ystoli nitial ipelin ystoli
ed ized c ed ized c ed c
FIGD2104-2L-E FPGA and four 2400 MT/s DDR4 banks (we
Fig. 7: Performance progression of kernels when applying transformations. Paren-
utilize 1-2 banks for the examples here). The chip consists theses show speedup over previous version, and cumulative speedup.
of four almost identical chiplets with limited interconnect [Utilization] LUTs DSPs BRAM
between them, where each chiplet is connected to one of 100%
Stencil Matrix Multiplication N-Body
the DDR4 pinouts. This multi-chiplet design allows more 10%
resources (1728K LUTs and 12,288 DSPs), but poses chal- 1%
lenges for the routing process, which impedes the achiev-
0.1%
able clock rate and resource utilization for a monolithic ker-
nel attempting to span the full chip. Kernels were compiled 0.01%
Nai Pip Vec S Nai Pip Vec S Init Pip S
for the xilinx u250 xdma 201830 2 shell with Vitis 2019.2 ve elin tor ystoli ve elin tor ystoli ial elin ystoli
ed ized c ed ized c ed c
and executed with version 2.3.1301 of the Xilinx Runtime Fig. 8: Resource usage of kernels from Fig. 7 as fractions of available resources.
(XRT). All benchmarks are included in Fig. 7, and the The maxima are taken as 1728K LUTs, 12,288 DSPs, and 2688 BRAM.
into P parallel processing element arranged in a systolic Lee et al. [90] present an OpenACC to OpenCL com-
array. Each element holds T resident particles, and parti- piler, using Intel OpenCL as a backend. The authors im-
cles are streamed [§3.3] through the PEs. plement horizontal and vertical unrolling, pipelining and
The second stage gains a factor of 4× corresponding to the dataflow by introducing new OpenACC clauses. Papakon-
latency of the interleaved accumulation, followed by a factor stantinou et al. [91] generate HLS code for FPGA from
of 42× from unrolling units across the chip. T ≥L+ can be directive-annotated CUDA code.
used to regulate the arithmetic intensity of the kernel. The Optimizing HLS compilers. Mainstream HLS compil-
bandwidth requirements can be reduced further by storing ers automatically apply many of the well-known software
more resident particles on the chip, scaling up to the full transformations in Tab. 2 [22], [92], [93], but can also employ
fast memory usage of the FPGA. The tiled accumulation in- more advanced FPGA transformations. Intel OpenCL [19]
terleaving transformation thus enables not just pipelining of performs memory access extraction into “load store units”
the compute, but also minimization of I/O. The optimized (LSUs), does memory striping between DRAM banks, and
implementation is available on github4 . detects and auto-resolves some buffering and accumulation
These examples demonstrate the impact of different patterns. The proprietary Merlin Compiler [94] uses high-
transformations on a reconfigurable hardware platform. In level acceleration directives to automatically perform some
particular, enabling pipelining, regularizing memory ac- of the transformations described here, as source-to-source
cesses, and vertical unrolling are shown to be central com- transformations to underlying HLS code. Polyhedral compi-
ponents of scalable hardware architectures. The dramatic lation is a popular framework for optimizing CPU and GPU
speedups over naive codes also emphasize that HLS tools do loop nests [55], and has also been applied to HLS for FPGA
not yield competitive performance out of the box, making it for optimizing data reuse [95]. Such techniques may prove
critical to perform further transformations. For additional valuable in automating, e.g., memory extraction and tiling
examples of optimizing HLS codes, we refer to the numer- transformations. While most HLS compilers rely strictly
ous works applying HLS optimizations listed below. on static scheduling, Dynamatic [68] considers dynamically
scheduling state machines and pipelines to allow reducing
7 R ELATED W ORK the number of stages executed at runtime.
Optimized applications. Much work has been done in
Domain-specific frameworks. Implementing programs
optimizing C/C++/OpenCL HLS codes for FPGA, such as
in domain specific languages (DSLs) can make it easier
stencils [38], [39], [40], [74], [75], deep neural networks [76],
to detect and exploit opportunities for advanced trans-
[77], [35], [36], [34], matrix multiplication [78], [75], [50], [79],
formations. Darkroom [30] generates optimized HDL for
graph processing [80], [81], networking [82], light propaga-
image processing codes, and the popular image process-
tion for cancer treatment [46], and protein sequencing [49],
ing framework Halide [31] has been extended to support
[83]. These works optimize the respective applications using
FPGAs [96], [97]. Luzhou et al. [53] and StencilFlow [44]
transformations described here, such as delay buffering,
propose frameworks for generating stencil codes for FP-
random access buffering, vectorization, vertical unrolling,
GAs. These frameworks rely on optimizations such as delay
tiling for on-chip memory, and dataflow.
buffering, dataflow, and vertical unrolling, which we cover
Transformations. Zohouri et al. [84] use the Rodinia
here. Using DSLs to compile to structured HLS code can
benchmark to evaluate the performance of OpenCL codes
be a viable approach to automating a wide range of trans-
targeting FPGAs, employing optimizations such as SIMD
formations, as proposed by Koeplinger et al. [98], and the
vectorization, sliding-window buffering, accumulation in-
FROST [99] DSL framework.
terleaving, and compute unit replication across multiple
Other approaches. There are other approaches than
kernels. We present a generalized description of a superset
C/C++/OpenCL-based HLS languages to addressing the
of these transformations, along with concrete code examples
productivity issues of hardware design: Chisel/FIR-
that show how they are applied in practice. The DaCe frame-
RTL [100], [101] maintains the paradigm of behavioral pro-
work [85] exploits information on explicit dataflow and
gramming known from RTL, but provides modern language
control flow to perform a wide range of transformations,
and compiler features. This caters to developers who are
and code generates efficient HLS code using vendor-specific
already familiar with hardware design, but wish to use a
pragmas and primitives. Kastner et al. [86] go through the
more expressive language. In the Maxeler ecosystem [102],
implementation of many HLS codes in Vivado HLS, focus-
kernels are described using a Java-based language, but
ing on algorithmic optimizations. da Silva et al. [87] explore
rather than transforming imperative code into a behavioral
using modern C++ features to capture HLS concepts in a
equivalent, the language provides a DSL of hardware con-
high-level fashion. Lloyd et al. [88] describe optimizations
cepts that are instantiated using object-oriented interfaces.
specific to Intel OpenCL, and include a variant of memory
By constraining the input, this encourages developers to
access extraction, as well as the single-loop accumulation
write code that maps well to hardware, but requires learning
variant of accumulation interleaving.
a new language exclusive to the Maxeler ecosystem.
Directive-based frameworks. High-level, directive-based
frameworks such as OpenMP and OpenACC have been 8 TOOLFLOW OF X ILINX VS . I NTEL
proposed as alternative abstractions for generating FPGA
When choosing a toolflow to start designing hardware with
kernels. Leow et al. [89] implement an FPGA code gen-
HLS, it is useful to understand two distinct approaches
erator from OpenMP pragmas, primarily focusing on cor-
by the two major vendors: Intel OpenCL wishes to en-
rectness in implementing a range of OpenMP pragmas.
able writing accelerators using software, making an effort to
4. https://github.com/spcl/nbody hls abstract away low-level details about the hardware, and
13
present a high-level view to the programmer; whereas Xil- [6] D. B. Thomas et al., “A comparison of CPUs, GPUs, FPGAs,
inx’ Vivado HLS provides a more productive way of writing and massively parallel processor arrays for random number
generation,” in FPGA, 2009.
hardware, by means of a familiar software language. Xilinx [7] D. Bacon et al., “FPGA programming for the masses,” CACM,
uses OpenCL as a vehicle to interface between FPGA and 2013.
host, but implements the OpenCL compiler itself as a thin [8] G. Martin and G. Smith, “High-level synthesis: Past, present, and
wrapper around the C++ compiler, whereas Intel embraces future,” D&T, 2009.
[9] J. Cong et al., “High-level synthesis for FPGAs: From prototyping
the OpenCL paradigm as their frontend (although they to deployment,” TCAD, 2011.
encourage writing single work item kernels [103], effectively [10] R. Nane et al., “A survey and evaluation of FPGA high-level
preventing reuse of OpenCL kernels written for GPU). synthesis tools,” TCAD, 2016.
Vivado HLS has a stronger coupling between the HLS [11] W. Meeus et al., “An overview of today’s high-level synthesis
tools,” DAEM, 2012.
source code and the generated hardware. This requires [12] Z. Zhang et al., “AutoPilot: A platform-based ESL synthesis
the programmer to write more annotations and boilerplate system,” in High-Level Synthesis, 2008.
code, but can also give them stronger feeling of control. [13] Intel High-Level Synthesis (HLS) Compiler. https://www.
intel.com/content/www/us/en/software/programmable/
Conversely, the Intel OpenCL compiler presents convenient quartus-prime/hls-compiler.html. Accessed May 15, 2020.
abstracted views, saves boilerplate code (e.g., by automat- [14] A. Canis et al., “LegUp: High-level synthesis for FPGA-based
ically pipelining sections), and implements efficient substi- processor/accelerator systems,” in FPGA, 2011.
tutions by detecting common patterns in the source code [15] Mentor Graphics. Catapult high-level synthesis. https:
//www.mentor.com/hls-lp/catapult-high-level-synthesis/
(e.g., to automatically perform memory extraction [§4.1]). c-systemc-hls. Accessed May 15, 2020.
The downside is that developers end up struggling to write [16] C. Pilato et al., “Bambu: A modular framework for the high level
or generate code in a way that is recognized by the tool’s synthesis of memory-intensive applications,” in FPL, 2013.
“black magic”, in order to achieve the desired result. Finally, [17] R. Nane et al., “DWARV 2.0: A CoSy-based C-to-VHDL hardware
compiler,” in FPL, 2012.
Xilinx’ choice to allow C++ gives Vivado HLS an edge in [18] M. Owaida et al., “Synthesis of platform architectures from
expressibility, as (non-virtual) objects and templating turns OpenCL programs,” in FCCM, 2011.
out to be a useful tool for abstracting and extending the [19] T. Czajkowski et al., “From OpenCL to high-performance hard-
language [48]. Intel offers a C++-based HLS compiler, but ware on FPGAs,” in FPL, 2012.
[20] R. Nikhil, “Bluespec system Verilog: efficient, correct RTL from
does not (as of writing) support direct interoperability with high level specifications,” in MEMOCODE, 2004.
the OpenCL-driven accelerator flow. [21] J. Auerbach et al., “Lime: A Java-compatible and synthesizable
language for heterogeneous architectures,” in OOPSLA, 2010.
9 C ONCLUSION [22] ——, “A compiler and runtime for heterogeneous computing,”
in DAC, 2012.
The transformations known from software are insufficient to [23] J. Hammarberg and S. Nadjm-Tehrani, “Development of safety-
optimize HPC kernels targeting spatial computing systems. critical reconfigurable hardware with Esterel,” FMICS, 2003.
We have proposed a new set of optimizing transforma- [24] M. B. Gokhale et al., “Stream-oriented FPGA computing in the
Streams-C high level language,” in FCCM, 2000.
tions that enable efficient and scalable hardware architec- [25] D. F. Bacon et al., “Compiler transformations for high-
tures, and can be applied directly to the source code by a performance computing,” CSUR, 1994.
performance engineer, or automatically by an optimizing [26] S. Ryoo et al., “Optimization principles and application perfor-
compiler. Performance and compiler engineers can benefit mance evaluation of a multithreaded GPU using CUDA,” in
PPoPP, 2008.
from these guidelines, transformations, and the presented [27] G. D. Smith, Numerical solution of partial differential equations: finite
cheat sheet as a common toolbox for developing high per- difference methods, 1985.
formance hardware using HLS. [28] A. Taflove and S. C. Hagness, “Computational electrodynamics:
The finite-difference time-domain method,” 1995.
[29] C. A. Fletcher, Computational Techniques for Fluid Dynamics 2, 1988.
ACKNOWLEDGEMENTS [30] J. Hegarty et al., “Darkroom: compiling high-level image process-
This work was supported by the European Research Coun- ing code into hardware pipelines.” TOG, 2014.
cil under the European Union’s Horizon 2020 programme [31] J. Ragan-Kelley et al., “Halide: A language and compiler for
optimizing parallelism, locality, and recomputation in image
(grant agreement DAPP, No. 678880). The authors wish processing pipelines,” in PLDI, 2013.
to thank Xilinx and Intel for helpful discussions; Xilinx [32] T. Ben-Nun and T. Hoefler, “Demystifying parallel and dis-
for generous donations of software, hardware, and access tributed deep learning: An in-depth concurrency analysis,”
to the Xilinx Adaptive Compute Cluster (XACC) at ETH CSUR, 2019.
[33] G. Lacey et al., “Deep learning on FPGAs: Past, present, and
Zurich; the Swiss National Supercomputing Center (CSCS) future,” arXiv:1602.04283, 2016.
for providing computing infrastructure; and Tal Ben-Nun [34] M. Courbariaux et al., “Binarized neural networks: Training deep
for valuable feedback on iterations of this manuscript. neural networks with weights and activations constrained to +1
or -1,” arXiv:1602.02830, 2016.
[35] Y. Umuroglu et al., “FINN: A framework for fast, scalable bina-
R EFERENCES rized neural network inference,” in FPGA, 2017.
[36] M. Blott et al., “FINN-R: An end-to-end deep-learning framework
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implica- for fast exploration of quantized neural networks,” TRETS, 2018.
tions of the obvious,” SIGARCH, 1995. [37] H. Fu and R. G. Clapp, “Eliminating the memory bottleneck: An
[2] M. Horowitz, “Computing’s energy problem (and what we can FPGA-based solution for 3d reverse time migration,” in FPGA,
do about it),” in ISSCC, 2014. 2011.
[3] D. D. Gajski et al., “A second opinion on data flow machines and [38] H. R. Zohouri et al., “Combined spatial and temporal block-
languages,” Computer, 1982. ing for high-performance stencil computation on FPGAs using
[4] S. Sirowy and A. Forin, “Where’s the beef? why FPGAs are so OpenCL,” in FPGA, 2018.
fast,” MS Research, 2008. [39] H. M. Waidyasooriya et al., “OpenCL-based FPGA-platform for
[5] A. R. Brodtkorb et al., “State-of-the-art in heterogeneous comput- stencil computation and its optimization methodology,” TPDS,
ing,” Scientific Programming, 2010. May 2017.
14
[40] Q. Jia and H. Zhou, “Tuning stencil codes in OpenCL for FPGAs,” [76] N. Suda et al., “Throughput-optimized OpenCL-based FPGA
in ICCD, 2016. accelerator for large-scale convolutional neural networks,” in
[41] X. Niu et al., “Exploiting run-time reconfiguration in stencil FPGA, 2016.
computation,” in FPL, 2012. [77] J. Zhang and J. Li, “Improving the performance of OpenCL-based
[42] ——, “Dynamic stencil: Effective exploitation of run-time re- FPGA accelerator for convolutional neural network,” in FPGA,
sources in reconfigurable clusters,” in FPT, 2013. 2017.
[43] J. Fowers et al., “A performance and energy comparison of [78] E. H. D’Hollander, “High-level synthesis optimization for
FPGAs, GPUs, and multicores for sliding-window applications,” blocked floating-point matrix multiplication,” SIGARCH, 2017.
in FPGA, 2012. [79] P. Gorlani et al., “OpenCL implementation of Cannon’s matrix
[44] J. de Fine Licht et al., “StencilFlow: Mapping large stencil pro- multiplication algorithm on Intel Stratix 10 FPGAs,” in ICFPT,
grams to distributed spatial computing systems,” in CGO, 2021. 2019.
[45] X. Chen et al., “On-the-fly parallel data shuffling for graph [80] M. Besta et al., “Graph processing on FPGAs: Taxonomy, survey,
processing on OpenCL-based FPGAs,” in FPL, 2019. challenges,” arXiv:1903.06697, 2019.
[46] T. Young-Schultz et al., “Using OpenCL to enable software-like [81] ——, “Substream-centric maximum matchings on FPGA,” in
development of an FPGA-accelerated biophotonic cancer treat- FPGA, 2019.
ment simulator,” in FPGA, 2020. [82] H. Eran et al., “Design patterns for code reuse in HLS packet
[47] D. J. Kuck et al., “Dependence graphs and compiler optimiza- processing pipelines,” in FCCM, 2019.
tions,” in POPL, 1981. [83] E. Rucci et al., “Smith-Waterman protein search with OpenCL on
[48] J. de Fine Licht and T. Hoefler, “hlslib: Software engineering for an FPGA,” in Trustcom/BigDataSE/ISPA, 2015.
hardware design,” arXiv:1910.04436, 2019. [84] H. R. Zohouri et al., “Evaluating and optimizing OpenCL kernels
[49] S. O. Settle, “High-performance dynamic programming on FP- for high performance computing with FPGAs,” in SC, 2016.
GAs with OpenCL,” in HPEC, 2013. [85] T. Ben-Nun et al., “Stateful dataflow multigraphs: A data-centric
[50] J. de Fine Licht et al., “Flexible communication avoiding matrix model for performance portability on heterogeneous architec-
multiplication on FPGA with high-level synthesis,” in FPGA, tures,” in SC, 2019.
2020. [86] R. Kastner et al., “Parallel programming for FPGAs,”
[51] K. Sano et al., “Multi-FPGA accelerator for scalable stencil com- arXiv:1805.03648, 2018.
putation with constant memory bandwidth,” TPDS, 2014. [87] J. S. da Silva et al., “Module-per-Object: a human-driven method-
[52] H. Kung and C. E. Leiserson, “Systolic arrays (for VLSI),” in ology for C++-based high-level synthesis design,” in FCCM, 2019.
Sparse Matrix Proceedings, 1978. [88] T. Lloyd et al., “A case for better integration of host and target
[53] W. Luzhou et al., “Domain-specific language and compiler compilation when using OpenCL for FPGAs,” in FSP, 2017.
for stencil computation on fpga-based systolic computational- [89] Y. Y. Leow et al., “Generating hardware from OpenMP pro-
memory array,” in ARC, 2012. grams,” in FPT, 2006.
[54] T. Kenter et al., “OpenCL-based FPGA design to accelerate the [90] S. Lee et al., “OpenACC to FPGA: A framework for directive-
nodal discontinuous Galerkin method for unstructured meshes,” based high-performance reconfigurable computing,” in IPDPS,
in FCCM, 2018. 2016.
[55] T. Grosser et al., “Polly – performing polyhedral optimizations on [91] A. Papakonstantinou et al., “FCUDA: Enabling efficient compila-
a low-level intermediate representation,” PPL, 2012. tion of CUDA kernels onto FPGAs,” in SASP, 2009.
[56] U. Sinha, “Enabling impactful DSP designs on FPGAs with hard- [92] S. Gupta et al., “SPARK: a high-level synthesis framework for ap-
ened floating-point implementation,” Altera White Paper, 2014. plying parallelizing compiler transformations,” in VLSID, 2003.
[93] ——, “Coordinated parallelizing compiler optimizations and
[57] J. R. Allen and K. Kennedy, “Automatic loop interchange,” in
high-level synthesis,” TODAES, 2004.
SIGPLAN, 1984.
[94] J. Cong et al., “Source-to-source optimization for HLS,” in FPGAs
[58] M. Weiss, “Strip mining on SIMD architectures,” in ICS, 1991.
for Software Programmers, 2016.
[59] M. D. Lam et al., “The cache performance and optimizations of
[95] L.-N. Pouchet et al., “Polyhedral-based data reuse optimization
blocked algorithms,” 1991.
for configurable computing,” in FPGA, 2013.
[60] C. D. Polychronopoulos, “Advanced loop optimizations for par-
[96] J. Pu et al., “Programming heterogeneous systems from an image
allel computers,” in ICS, 1988.
processing DSL,” TACO, 2017.
[61] D. J. Kuck, “A survey of parallel machine organization and [97] J. Li et al., “HeteroHalide: From image processing DSL to efficient
programming,” CSUR, Mar. 1977. FPGA acceleration,” in FPGA, 2020.
[62] A. P. Yershov, “ALPHA – an automatic programming system of [98] D. Koeplinger et al., “Automatic generation of efficient accelera-
high efficiency,” J. ACM, 1966. tors for reconfigurable hardware,” in ISCA, 2016.
[63] M. J. Wolfe, “Optimizing supercompilers for supercomputers,” [99] E. D. Sozzo et al., “A common backend for hardware acceleration
Ph.D. dissertation, 1982. on FPGA,” in ICCD, 2017.
[64] J. J. Dongarra and A. R. Hinds, “Unrolling loops in Fortran,” [100] J. Bachrach et al., “Chisel: constructing hardware in a scala
Software: Practice and Experience, 1979. embedded language,” in DAC, 2012.
[65] M. Lam, “Software pipelining: An effective scheduling technique [101] A. Izraelevitz et al., “Reusability is FIRRTL ground: Hardware
for VLIW machines,” in PLDI, 1988. construction languages, compiler frameworks, and transforma-
[66] C. D. Polychronopoulos, “Loop coalescing: A compiler transfor- tions,” in ICCAD, 2017.
mation for parallel machines,” Tech. Rep., 1987. [102] Maxeler Technologies, “Programming MPC systems (white pa-
[67] F. E. Allen and J. Cocke, A catalogue of optimizing transformations, per),” 2013.
1971. [103] Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide,
[68] L. Josipović et al., “Dynamically scheduled high-level synthesis,” UG-OCL003, revision 2020.04.1. Accessed May 15, 2020.
in FPGA, 2018.
[69] J. Cocke and K. Kennedy, “An algorithm for reduction of operator Johannes de Fine Licht is a PhD student at ETH Zurich. His research topics
strength,” CACM, 1977. revolve around spatial computing systems in HPC, and include programming
[70] R. Bernstein, “Multiplication by integer constants,” Softw. Pract. models, applications, libraries, and enhancing programmer productivity.
Exper., 1986.
[71] G. L. Steele, “Arithmetic shifting considered harmful,” ACM Maciej Besta is a PhD student at ETH Zurich. His research focuses on under-
SIGPLAN Notices, 1977. standing and accelerating large-scale irregular graph processing in any type of
[72] A. V. Aho et al., “Compilers, principles, techniques,” Addison setting and workload.
Wesley, 1986.
Simon Meierhans is studying for his MSc degree at ETH Zurich. His interests
[73] T. De Matteis et al., “Streaming message interface: High- include randomized and deterministic algorithm and data structure design.
performance distributed memory programming on reconfig-
urable hardware,” in SC, 2019. Torsten Hoefler is a professor at ETH Zurich, where he leads the Scalable
[74] D. Weller et al., “Energy efficient scientific computing on FPGAs Parallel Computing Lab. His research aims at understanding performance of
using OpenCL,” in FPGA, 2017. parallel computing systems ranging from parallel computer architecture through
[75] A. Verma et al., “Accelerating workloads on FPGAs via OpenCL: parallel programming to parallel algorithms.
A case study with opendwarfs,” Tech. Rep., 2016.