0% found this document useful (0 votes)
13 views14 pages

Transformations of High-Level Synthesis Codes For Higher-Performance Computing

Uploaded by

yo bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

Transformations of High-Level Synthesis Codes For Higher-Performance Computing

Uploaded by

yo bro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Transformations of High-Level Synthesis Codes


for High-Performance Computing
Johannes de Fine Licht Maciej Besta Simon Meierhans Torsten Hoefler
[email protected] [email protected] [email protected] [email protected]
Department of Computer Science, ETH Zurich
Abstract—Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store
devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C++
and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider
audience to target spatial computing architectures, the optimization principles known from traditional software design are no longer
sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for
arXiv:1805.08288v6 [cs.DC] 23 Nov 2020

deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing
transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We
systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS
code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation
can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining,
on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various
transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim
to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential
offered by spatial computing architectures using HLS.

1 I NTRODUCTION For many applications, computational performance is a


Since the end of Dennard scaling, when the power con- primary goal, which is achieved through careful tuning by
sumption of digital circuits stopped scaling with their specialized performance engineers using well-understood
size, compute devices have become increasingly limited by optimizing transformations when targeting CPU [25] and
their power consumption [1]. In fact, shrinking the feature GPU [26] architectures. For HLS, a comparable collection
size even increases the loss in the metal layers of modern of guidelines and principles for code optimization is yet
microchips. Today’s load/store architectures suffer mostly to be established. Optimizing codes for hardware is dras-
from the cost of data movement and addressing general tically different from optimizing codes for software. In fact,
purpose registers and cache [2]. Other approaches, such the optimization space is larger, as it contains most known
as dataflow architectures, have not been widely successful, software optimizations, in addition to HLS-specific trans-
due to the varying granularity of applications [3]. However, formations that let programmers manipulate the underlying
application-specific dataflow can be used to lay out as regis- hardware architecture. To make matters worse, the low clock
ters and on-chip memory to fit the specific structure of the frequency, lack of cache, and fine-grained configurability,
computation, and thereby minimize data movement. means that naive HLS codes typically perform poorly com-
Reconfigurable architectures, such as FPGAs, can be pared to naive software codes, and must be transformed
used to implement application-specific dataflow [4], [5], considerably before the advantages of specialized hardware
[6], but are hard to program [7], as traditional hardware can be exploited. Thus, the established set of traditional
design languages, such as VHDL or Verilog, do not benefit transformations is insufficient, as it does not consider
from the rich set of software engineering techniques that aspects of optimized hardware design, such as pipelining
improve programmer productivity and code reliability. For and decentralized fast memory.
these reasons, both hardware and software communities are In this work, we survey and define a set of key trans-
embracing high-level synthesis (HLS) tools [8], [9], enabling formations that optimizing compilers or performance en-
hardware development using procedural languages. gineers can apply to improve the performance of hard-
HLS bridges the gap between hardware and software ware layouts generated from HLS codes. This set unions
development, and enables basic performance portability transformations extracted from previous work, where they
implemented in the compilation system. For example, HLS were applied either explicitly or implicitly, with additional
programmers do not have to worry about how exactly a techniques that fill in gaps to maximize completeness. We
floating point operation, a bus protocol, or a DRAM con- characterize and categorize transformations, allowing per-
troller is implemented on the target hardware. Numerous formance engineers to easily look up those relevant to
HLS systems [10], [11] synthesize hardware designs from improving their HLS code, based on the problems and bot-
C/C++ [12], [13], [14], [15], [16], [17], OpenCL [18], [19] and tlenecks currently present. The transformations have been
other high-level languages [20], [21], [22], [23], [24], provid- verified to apply to both the Intel OpenCL and Xilinx Vivado
ing a viable path for software and hardware communities to HLS toolflows, but are expected to translate to any pragma-
meet and address each other’s concerns. based imperative HLS tool.
2

Transformations Characteristics Objectives


PL RE PA ME RS RT SC CC LD RE CU BW PL RT RS
while computations and fine-grained control flow are or-
ganized in (predicated) pipelines. · Hardware synthesis
Accumulation interleaving §2.1 - – – ∼  –  ∼ ’ – – – – – –
Delay buffering §2.2 - - (-) -  ()   ’ ’ – – – – – maps the register-level circuit description to components
Pipelining

Random access buffering §2.3 - - (-) -     ’ ’ – ’ – – –


Pipelined loop fusion §2.4 - (-) – ∼ ∼ () –  – – – – ’ – –
and wires present on the specific target architecture. At
Pipelined loop switching §2.5 - (-) – ∼ ∼ () – ∼ – – – – ’ – ’ this stage and onwards, the procedure is both vendor and
Pipelined loop flattening §2.6 - – – - ∼ ∼ –  – – – – ’ – –
Inlining §2.7 - – (-) – () – – - ’ – – – – – – architecture specific. ¸ Place and route maps the logical
Horizontal unrolling §3.1 – (-) - -    () – – ’ – – – – circuit description to concrete locations on the target device,
Scaling

Vertical unrolling §3.2 – -! -! – ! !   – ’ ’ – – – – by performing a lengthy heuristic-based optimization that


Dataflow §3.3 – – (-) – () -! – - – ’ – – ’ ’ –
Tiling §3.4 – - – ∼  ∼   ’ ’ – – – ’ ’ attempts to minimize the length of the longest wire and the
Mem. access extraction §4.1 (-) – – -  - –  ’ – – ’ – – – total wire length. The longest propagation time between two
Memory

Mem. buffering §4.2 – – – -  – –  – – – ’ – – – registers including the logic between them (i.e., the critical
Mem. striping §4.3 – – – -   –  – – – ’ – – –
Type demotion §4.4 – – – - - - – - – – – ’ – – ’ path of the circuit), will determine the maximum obtainable
TABLE 1: Overview of transformations, the characteristics of their effect on the frequency. ¹ Bitstream generation translates the final circuit
HLS code and the resulting hardware, and the objectives that they can target. The description to a binary format used to configure the device.
center group of column marks the following transformation characteristics: (PL)
enables pipelining; (RE) increases data reuse, i.e., increases the arithmetic intensity of Most effort invested by an HLS programmer lies in guiding
the code; (PA) increases or exposes more parallelism; (ME) optimizes memory accesses; the scheduling process in ¶ to implement deep, efficient
(RS) does not significantly increase resource consumption; (RT) does not significantly
impair routing, i.e., does not potentially reduce maximum frequency or prevent pipelines, but · is considered when choosing data types and
the design from being routed altogether; (SC) does not change the schedule of loop buffer sizes, and ¸ can ultimately bottleneck applications
nests, e.g., by introducing more loops; and (CC) does not significantly increase
code complexity. The symbols have the following meaning: “–”: no effect, “-”: once the desired parallelism has been achieved, requiring
positive effect, “-!”: very positive effect, “(-)”: small or situational positive the developer to adapt their code to aid this process.
effect, “”: negative effect, “!”: very negative effect, “()”: small or situational
negative effect, “∼”: positive or negative effect can occur, depending on the
context. The right group of columns marks the following objectives that 1.2 Key Transformations for High-Level Synthesis
can be targeted by transformations: (LD) resolve loop-carried dependencies, due This work identifies a set of optimizing transformations that
to inter-iteration dependencies or resource contention; (RE) increase data reuse;
(CU) increase parallelism; (BW) increase memory bandwidth utilization; (PL) reduce are essential to designing scalable and efficient hardware
pipelining overhead; (RT) improve routing results; (RS) reduce resource utilization. kernels in HLS. An overview given in Tab. 1. We divide
the transformations into three major classes: pipelining
In addition to identifying previous work that apply one or transformations, that enable or improve the potential for
more of the transformations defined here, we describe and pipelining computations; scaling transformations that in-
publish a set of end-to-end “hands-on” examples, optimized crease or expose additional parallelism; and memory en-
from naive HLS codes into high performance implementa- hancing transformations, which increase memory utilization
tions. This includes a stencil code, matrix multiplication, and efficiency. Each transformation is further classified ac-
and the N-body problem, all available on github. The cording to a number of characteristic effects on the HLS
optimized codes exhibit dramatic cumulative speedups of source code, and on the resulting hardware architecture
up to 29,950× relative to their respective naive starting (central columns). To serve as a cheat sheet, the table further-
points, showing the crucial necessity of hardware-aware more lists common objectives targeted by HLS programmers,
transformations, which are not performed automatically by and maps them to relevant HLS transformations (rightmost
today’s HLS compilers. As FPGAs are currently the only columns). Characteristics and objectives are discussed in
platforms commonly targeted by HLS tools in the HPC detail in relevant transformation sections.
domain, transformations are discussed and evaluated in this Throughout this work, we will show how each transfor-
context. Evaluating FPGA performance in comparison to mation is applied manually by a performance engineer by
other platforms is out of scope of this work. Our work pro- directly modifying the source code, giving examples before
vides a set of guidelines and a cheat sheet for optimizing and after it is applied. However, many transformations are
high-performance codes for reconfigurable architectures, also amenable to automation in an optimizing compiler.
guiding both performance engineers and compiler devel-
1.3 The Importance of Pipelining
opers to efficiently exploit these devices.
Pipelining is essential to efficient hardware architectures,
as expensive instruction decoding and data movement be-
1.1 From Imperative Code to Hardware tween memory, caches and registers can be avoided, by
Before diving into transformations, it is useful to form an sending data directly from one computational unit to the
intuition of the major stages of the source-to-hardware stack, next. We attribute two primary characteristics to pipelines:
to understand how they are influenced by the HLS code: • Latency (L): the number of cycles it takes for an input to
¶ High-level synthesis converts a pragma-assisted proce- propagate through the pipeline and arrive at the exit, i.e.,
dural description (C++, OpenCL) to a functionally equiva- the number of pipeline stages.
lent behavioral description (Verilog, VHDL). This requires • Initiation interval or gap (I ): the number of cycles that
mapping variables and operations to corresponding con- must pass before a new input can be accepted to the
structs, then scheduling operations according to their inter- pipeline. A perfect pipeline has I=1 cycle, as this is re-
dependencies. The dependency analysis is concerned with quired to keep all pipeline stages busy. Consequently,
creating a hardware mapping such that the throughput the initiation interval can often be considered the inverse
requirements are satisfied, which for pipelined sections re- throughput of the pipeline; e.g., I=2 cycles implies that the
quire the circuit to accept a new input every cycle. Coarse- pipeline stalls every second cycle, reducing the through-
grained control flow is implemented with state machines, put of all pipelines stages by a factor of 12 .
3

To quantify the importance of pipelining in HLS, we con- previous iteration, which takes multiple cycles to com-
sider the number of cycles C it takes to execute a pipeline plete (i.e., has multiple internal pipeline stages). If the
with latency L (both in [cycles]), taking N inputs, with an latency of the operations producing this result is L, the
initiation interval of I [cycles]. Assuming a reliable producer minimum initiation interval of the pipeline will be L.
and consumer at either end, we have: This is a common scenario when accumulating into a sin-
gle register (see Fig. 2), in cases where the accumulation
C = L + I · (N − 1) [cycles]. (1)
operation takes Lacc >1 cycles.
This is shown in Fig. 1. The time to execute all N iterations 2) Interface contention (intra-iteration): a hardware re-
with clock rate f [cycles/s] of this pipeline is thus C/f . source with limited ports is accessed multiple times in
the same iteration of the loop. This could be a FIFO
queue or RAM that only allows a single read and write
I
per cycle, or an interface to external memory, which only
N

L supports sending/serving one request per cycle.


C = L + I (N – 1) For each of the following transformations, we will give
Fig. 1: Pipeline characteristics. examples of programs exhibiting properties that prevent
them from being pipelined, and how the transformation
For two pipelines in sequence that both consume and pro- can resolve this. All examples use C++ syntax, which al-
duce N elements, the latency is additive, while the initiation lows classes (e.g., “FIFO” buffer objects) and templating. We
interval is decided by the “slowest” actor: perform pipelining and unrolling using pragma directives,
where loop-oriented pragmas always refer to the following
C0 + C1 = (L0 + L1 ) + max(I0 , I1 ) · (N − 1) loop/scope, which is the convention used by Intel/Altera HLS
When I0 =I1 this corresponds to a single, deeper pipeline. tools (as opposed to applying to current scope, which is the
For large N , the latencies are negligible, so this deeper convention for Xilinx HLS tools).
pipeline increases pipeline parallelism by adding more Lacc Loop carried
computations without increasing the runtime; and without dependency
is resolved
introducing additional off-chip memory traffic. We are thus
+ +
Loop (update every
carried M cycles)
interested in building deep, perfect pipelines to maximize depen-
performance and minimize off-chip data movement. Lacc
dency
M ≥ Lacc
1.4 Optimization Goals
We organize the remainder of this work according to three Fig. 2: Loop-carried dependency. Fig. 3: Buffered accumulation.
overarching optimization goals, corresponding to the three
categories marked in Tab. 1: 2.1 Accumulation Interleaving
• Enable pipelining (Sec. 2): For compute bound codes, For multi-dimensional iteration spaces, loop-carried depen-
achieve I=1 cycle for all essential compute components, dencies can often be resolved by reordering and/or inter-
to ensure that all pipelines run at maximum throughput. leaving nested loops, keeping state for multiple concurrent
For memory bound codes, guarantee that memory is accumulations. We distinguish between four approaches to
always consumed at line rate. interleaving accumulation, covered below.
• Scaling/folding (Sec. 3): Fold the total number of itera-
2.1.1 Full Transposition
tions N by scaling up the parallelism of the design to
consume more elements per cycle, thus cutting the total When a loop-carried dependency is encountered in a loop
number of cycles required to execute the program. nest, it can be beneficial to reorder the loops, thereby fully
• Memory efficiency (Sec. 4): Saturate pipelines with data
transposing the iteration space. This typically also has a
from memory to avoid stalls in compute logic. For mem- significant impact on the program’s memory access pattern,
ory bound codes, maximize bandwidth utilization. which can benefit/impair the program beyond resolving a
loop-carried dependency.
Sec. 5 covers the relationship between well-known software
Consider the matrix multiplication code in Lst. 1a, com-
optimizations and HLS, and accounts for which of these
puting C = A · B + C , with matrix dimensions N , K , and
apply directly to HLS code. Sec. 6 shows the effect of
M . The inner loop k ∈ K accumulates into a temporary reg-
transformations on a selection of kernels, Sec. 7 presents
ister, which is written back to C at the end of each iteration
related work, and we conclude in Sec. 9.
m ∈ M . The multiplication of elements of A and B can
2 P IPELINE -E NABLING T RANSFORMATIONS be pipelined, but the addition on line 6 requires the result
As a crucial first step for any HLS code, we cover detecting of the addition in the previous iteration of the loop. This
and resolving issues that prevent pipelining of computa- is a loop-carried dependency, and results in an initiation
tions. When analyzing a basic block of a program, the HLS interval of L+ , where L+ is the latency of a 64 bit floating
tool determines the dependencies between computations, point addition (for integers L+,int =1 cycle, and the loop can
and pipelines operations accordingly to achieve the target be pipelined without further modifications). To avoid this,
initiation interval. There are two classes of problems that we can transpose the iteration space, swapping the K -loop
hinder pipelining of a given loop: with the M -loop, with the following consequences:
1) Loop-carried dependencies (inter-iteration): an iteration • Rather than a single register, we now implement an accu-
of a pipelined loop depends on a result produced by a mulation buffer of depth M and width 1 (line 2).
4
acc 1 double Acc(double arr[], int N) { Phase 0 Phase 1

p
oo
1 for (int n = 0; n < N; ++n) 2 double t[16];

rl
ne
2 for (int m = 0; m < M; ++m) { K 3 #pragma PIPELINE

in
3 double acc = C[n][m]; 4 for (int i = 0; i < N; ++i) { // P0

+ +
4 #pragma PIPELINE 5 auto prev = (i < 16) ? 0 : t[i%16];
C [N×M]
5 for (int k = 0; k < K; ++k) 6 t[i%16] = prev + arr[i]; }

]
[K ×K
6 acc += A[n][k] * B[k][m];

]
double res = 0;

M
[N
7
N

×
A
7 C[n][m] = acc; } M 8 for (int i = 0; i < 16; ++i) // P1

B
(a) Naive implementation of general matrix multiplication C=AB+C . 9 res += t[i]; // Not pipelined
10 return res; }
1 for (int n = 0; n < N; ++n) {
2 double acc[M]; // Uninitialized Listing 2: Two stages required for single loop accumulation.
inner acc[ 0] , acc[ 1] , . . . , acc[ M-1]
3 for (int k = 0; k < K; ++k) loop
4 double a = A[n][k]; // Only read once K
5 #pragma PIPELINE
6 for (int m = 0; m < M; ++m) {
result buffers, the second phase collapses the partial results
N
7 double prev = (k == 0) ? C[n][m] into the final output. This is shown in Lst. 2 for K=16.

× ]
[K ×K
8 : acc[m]; Optionally, the two stages can be implemented to run in

]
M
[N
acc[m] = prev + a * B[k][m]; }

A
9
M a coarse-grained pipelined fashion, such that the first stage
B
10 for (int m = 0; m < M; ++m) // Write
11 C[n][m] = acc[m]; } // out begins computing new partial results while the second stage
(b) Transposed iteration space, same location written every M cycles. is collapsing the previous results (by exploiting dataflow
1 for (int n = 0; n < N; ++n)
between modules, see Sec. 3.3).
2 for (int m = 0; m < M/T; ++m) {
3 double acc[T]; // Tiles of size T inner
loop
acc[ 0] , . . ., acc[ B-1] 2.1.4 Batched Accumulation Interleaving
4 for (int k = 0; k < K; ++k) K For algorithms with loop-carried dependencies that cannot
5 double a = A[n][k]; // M/T reads
6 #pragma PIPELINE N be solved by either method above (e.g., due to a non-
7 for (int t = 0; t < T; ++t) { C [N×M] commutative accumulation operator), we can still pipeline
double prev = (k == 0) ?
]

8
K

the design by processing batches of inputs, introducing an


B [N×

]
M

9 C[n][m*T+t] : acc[t];
×
A
[K

10 acc[t] = prev + a * B[k][m*T+t]; } M additional loop nested in the accumulation loop. This pro-
11 for (int t = 0; t < T; ++t) // Write cedure is similar to Sec. 2.1.2, but only applies to programs
12 C[n][m*T+t] = acc[t]; } // out
where it is relevant to compute the accumulation for multi-
(c) Tiled iteration space, same location written every T cycles.
ple data streams, and requires altering the interface and data
Listing 1: Interleave accumulations to remove loop-carried dependency.
movement of the program to interleave inputs in batches.
The code in Lst. 3a shows an iterative solver code with
an inherent loop-carried dependency on state, with a min-
• The loop-carried dependency is resolved: each location is
imum initiation interval corresponding to the latency LStep
only updated every M cycles (with M ≥Lacc in Fig. 3).
of the (inlined) function Step. There are no loops to inter-
• A, B , and C are all read in a contiguous fashion, achiev-
change, and we cannot change the order of loop iterations.
ing perfect spatial locality (we assume row-major memory
While there is no way to improve the latency of producing
layout. For column-major we would interchange the K -
a single result, we can improve the overall throughput by a
loop and N -loop).
factor of LStep by pipelining across N ≥LStep different inputs
• Each element of A is read exactly once.
(e.g., overlap solving for different starting conditions). We
The modified code is shown in Lst. 1b. We leave the accumu- effectively inject another loop over inputs, then perform
lation buffer defined on line 2 uninitialized, and implicitly transposition or tiled accumulation interleaving with this
reset it on line 8, avoiding M extra cycles to reset (this is a loop. The result of this transformation is shown in Lst. 3b,
form of pipelined loop fusion, covered in Sec. 2.4). for a variable number of interleaved inputs N.
2.1.2 Tiled Accumulation Interleaving 1 Vec<double> IterSolver(Vec<double> state, int T) {
For accumulations done in a nested loop, it can be sufficient 2 #pragma PIPELINE // Will fail to pipeline with I=1
3 for (int t = 0; t < T; ++t)
to interleave across a tile of an outer loop to resolve a 4 state = Step(state);
loop-carried dependency, using a limited size buffer to store 5 return state; }
intermediate results. This tile only needs to be of size ≥Lacc , (a) Solver executed for T steps with a loop-carried dependency on state.
where Lacc is the latency of the accumulation operation. 1 template <int N> inner loop T
2 void MultiSolver(Vec<double> *in,
This is shown in Lst. 1c, for the transposed matrix mul-
3 Vec<double> *out, int T) { b[0]
tiplication example from Lst. 1b, where the accumulation 4 Vec<double> b[N]; // Partial results
array has been reduced to tiles of size T (which should be 5 for (int t = 0; t < T; ++t) b[1]
6 #pragma PIPELINE
≥Lacc , see Fig. 3), by adding an additional inner loop over 7 for (int i = 0; i < N; ++i) {
the tile, and cutting the outer loop by a factor of B . 8 auto read = (t == 0) ? in[i] : b[i]; ...
9 auto next = Step(read);
10 if (t < T-1) b[i] = next;
2.1.3 Single-Loop Accumulation Interleaving b[N-1]
11 else out[i] = next; }} // Write out
If no outer loop is present, we have to perform the ac- (b) Pipeline across N ≥Lstep inputs to achieve I=1 cycle.
cumulation in two separate stages, at the cost of extra Listing 3: Pipeline across multiple inputs to avoid loop-carried dependency.
resources. For the first stage, we perform a transformation
similar to the nested accumulation interleaving, but strip- 2.2 Delay Buffering
mine the inner (and only) loop into blocks of size K ≥ Lacc , When iterating over regular domains in a pipelined fashion,
accumulating partial results into a buffer of size K . Once it is often sufficient to express buffering using delay buffers,
all incoming values have been accumulated into the partial expressed either with cyclically indexed arrays, or with
5

constant offset delay buffers, also known from the Intel Lst. 4b demonstrates the shift register pattern used to
ecosystem as shift registers. These buffers are only accessed in express the stencil buffering scheme, which is supported
a FIFO manner, with the additional constraint that elements by the Intel OpenCL toolflow. Rather than creating each
are only be popped once they have fully traversed the depth individual delay buffer required to propagate values, a
of the buffer (or when they pass compile-time fixed access single array is used, which is “shifted” every cycle using
points, called “taps”, in Intel OpenCL). Despite the “shift unrolling (lines 6-7). The computation accesses elements of
register” name, these buffers do not need to be implemented this array using constant indices only (line 10), relying on the
in registers, and are frequently implemented in on-chip tool to infer the partitioning into individual buffers (akin
RAM when large capacity is needed, where values are not to loop idiom recognition [25]) that we did explicitly in
physically shifted. Lst. 4a. The implicit nature of this pattern requires the tool
A common set of applications that adhere to the delay to specifically support it. For more detail on buffering stencil
buffer pattern are stencil applications such as partial dif- codes we refer to other works on the subject [44], [39].
ferential equation solvers [27], [28], [29], image processing Opportunities for delay buffering often arise naturally in
pipelines [30], [31], and convolutions in deep neural net- pipelined programs. If we consider the transposed matrix
works [32], [33], [34], [35], [36], all of which are typically multiplication code in Lst. 1b, we notice that the read from
traversed using a sliding window buffer, implemented in acc on line 8 and the write on line 9 are both sequential, and
terms of multiple delay buffers (or, in Intel terminology, a cyclical with a period of M cycles. We could therefore also
shift register with multiple taps). These applications have use the shift register abstraction for this array. The same is
been shown to be a good fit to spatial computing architec- true for the accumulation code in Lst. 3b.
tures [37], [38], [39], [40], [41], [42], [43], as delay buffering
is cheap to implement in hardware, either as shift registers
Seq.
+ + + ×

DRAM
south
in general purpose logic, or in RAM blocks. east west north

Lst. 4 shows two ways of applying delay buffering to a


stencil code, namely a 4-point stencil in 2D, which updates Fig. 4: A delay buffer for a 4-point stencil with three taps.
each point on a 2D grid to the average of its north, west,
east, and south neighbors. To achieve perfect data reuse, we 2.3 Random Access Buffering
buffer every element read in sequential order from memory When a program unavoidably needs to perform random ac-
until it has been used for the last time – after two rows, cesses, we can buffer data in on-chip memory and perform
when the same value has been used as all four neighbors. random access to this fast memory instead of to slow off-
In Lst. 4a we use cyclically indexed line buffers to imple- chip memory. A random access buffer implemented with a
ment the delay buffering pattern, instantiated as arrays on general purpose replacement strategy will emulate a CPU-
lines 1-2. We only read the south element from memory each style cache; but to benefit from targeting a spatial system, it
iteration (line 7), which we store in the center line buffer is usually more desirable to specialize the buffering strategy
(line 13). This element is then reused after M cycles (i.e., to the target application [45], [46]. This can enable off-chip
“delayed” for M cycles), when it is used as the east value memory accesses to be made contiguous by loading and
(line 9), propagated to the north buffer (line 12), shifted in storing data in stages (i.e., tiles), then exclusively perform-
registers for two cycles until it is used as the west value ing random accesses to fast on-chip memory.
(line 14), and reused for the last time after M cycles on line 8. Lst. 6 outlines a histogram implementation that uses an
The resulting circuit is illustrated in Fig. 4. on-chip buffer (line 1) to perform fast random accesses reads
and writes (line 5) to the bins computed from incoming data,
1 float north_buffer[M]; // Line illustrated in Fig. 6. Note that the random access results
2 float center_buffer[M]; // buffers
3 float west, center; // Registers
in a loop-carried dependency on histogram, as there is a
4 for (int i = 0; i < N; ++i) { potential for subsequent iterations to read and write the
5 #pragma PIPELINE
6 for (int j = 0; j < M; ++j) {
same bin. This can be solved with one of the interleaving
7 auto south = memory[i][j]; // Single memory read techniques described in Sec. 2.1, by maintaining multiple
8 auto north = north_buffer[j]; // Read line buffers partial result buffers.
9 auto east = center_buffer[(j + 1)%M]; // (with wrap around)
10 if (i > 1 && j > 0 && j < M - 1) // Assume padding of 1
11
12
result[i - 1][j] = 0.25*(north + west + south + east);
north_buffer[j] = center; // Update both
1 unsigned
2 #pragma
hist[256] = {0}; // Array of bins
PIPELINE // Will have II=2 + Seq.
DRAM

center_buffer[j] = south; // line buffers 3 for (int i = 0; i < N; ++i) {


RAM

13
14 west = center; center = east; } } // Propagate registers 4 int bin = CalculateBin(memory[i]);
5 hist[bin] += 1; // Single cycle access Seq.
(a) Delay buffering using cyclically indexed line buffers.
6 } // ...write result out to memory...

1 float sr[2*M + 1]; // Shift register buffer


Listing 6: Random access to on-chip histogram buffer.
2 for (int i = 0; i < N; ++i) {
3 #pragma PIPELINE
4 for (int j = 0; j < M; ++j) { 2.4 Pipelined Loop Fusion
5 #pragma UNROLL
6 for (int k = 0; k < 2*M; ++k) When two pipelined loops appear sequentially, we can fuse
7 sr[k] = sr[k + 1]; // Shift the array left them into a single pipeline, while using loop guards to en-
8 sr[2*M] = memory[i][j]; // Append to the front
9 if (i > 1 && j > 0 && j < M - 1) // Initialize/drain force any dependencies that might exist between them. This
10 result[i-1][j] = 0.25*(sr[0] + sr[M-1] + sr[M+1] + sr[2*M]); } } can result in a significant reduction in runtime, at little to no
(b) Delay buffering using an Intel-style shift register. resource overhead. This transformation is closely related to
Listing 4: Two ways of implementing delay buffering on an N ×M grid. loop fusion [47] from traditional software optimization.
6

1 // Pipelined loops executed sequentially 1 for (int i = 0; i < N0+N1; ++i) { 1 for (int i = 0; i < max(N0, N1); ++i) {
2 for (int i = 0; i < N0; ++i) Foo(i, /*...*/); 2 if (i < N0) Foo(i, /*...*/); 2 if (i < N0) Foo(i, /*...*/); // Omit ifs
3 for (int i = 0; i < N1; ++i) Bar(i, /*...*/); 3 else Bar(i - N0, /*...*/); } 3 if (i < N1) Bar(i, /*...*/); } // for N0==N1

(a) (L0 + I0 (N0 −1)) + (L1 + I1 (N1 −1)) cycles. (b) L2 + I(N0 + N1 −1) cycles. (c) L3 + I · (max(N0 , N1 )−1) cycles.
Listing 5: Two subsequent pipelined loops fused sequentially (Lst. 5b) or concurrently (Lst. 5c). Assume that all loops are pipelined (pragmas omitted for brevity).

For two consecutive loops with latencies/bounds/initi- Lst. 7b). There can be a (tool-dependent) benefit from saving
ation intervals {L0 , N0 , I0 } and {L1 , N1 , I1 } (Lst. 5a), re- overhead logic by only implementing the orchestration and
spectively, the total runtime according to Eq. 1 is (L0 + interfaces of a single pipeline, at the (typically minor) cost
I0 (N0 −1)) + (L1 + I1 (N1 −1)). Depending on which con- of the corresponding predication logic. More importantly,
dition(s) are met, we can distinguish between three levels of eliminating the coarse-grained control can enable other
pipelined loop fusion, with increasing performance benefits: transformations that significantly benefit performance, such
1) I=I0 =I1 (true in most cases): Loops can be fused by as fusion [§2.4] with adjacent pipelined loops, flattening
summing the loop bounds, using loop guards to sequen- nested loops [§2.6], and on-chip dataflow [§3.3].
tialize them within the same pipeline (Lst. 5b).
2.6 Pipelined Loop Flattening/Coalescing
2) Condition 1 is met, and only fine-grained or no dependencies
exist between the two loops: Loops can be fused by To minimize the number of cycles spent in filling/draining
iterating to the maximum loop bound, and loop guards pipelines (where the circuit is not streaming at full through-
are placed as necessary to predicate each section (Lst. 5c). put), we can flatten nested loops to move the fill/drain
3) Conditions 1 and 2 are met, and N =N0 =N1 (same loop phases to the outermost loop, fusing/absorbing code that
bounds): Loops bodies can be trivially fused (Lst. 5c, but is not in the innermost loop if necessary.
with no loop guards necessary). Lst. 8a shows a code with two nested loops, and gives the
An alternative way of performing pipeline fusion is to total number of cycles required to execute the program. The
instantiate each stage as a separate processing element, and latency of the drain phase of the inner loop and the latency
stream fine-grained dependencies between them (Sec. 3.3). of Bar outside the inner loop must be paid at every iteration
of the outer loop. If N0 L0 , the cycle count becomes just
2.5 Pipelined Loop Switching L1 + N0 N1 , but for applications where N0 is comparable to
L0 , draining the inner pipeline can significantly impact the
The benefits of pipelined loop fusion can be extended to
runtime (even if N1 is large). By transforming the code such
coarse-grained control flow by using loop switching (as op-
that all loops are perfectly nested (see Lst. 8b), the HLS tool
posed to loop unswitching, which is a common transforma-
can effectively coalesce the loops into a single pipeline, where
tion [25] on load/store architectures). Whereas instruction-
next iteration of the outer loop can be executed immediately
based architectures attempt to only execute one branch of
after the previous finishes.
a conditional jump (via branch prediction on out-of-order
processors), a conditional in a pipelined scenario will result 1 for (int i = 0; i < N1; ++i) { 1 for (int i = 0; i < N1; ++i) {
in both branches being instantiated in hardware, regardless 2 #pragma PIPELINE 2 #pragma PIPELINE
3 for (int j = 0; j < N0; ++i) 3 for (int j = 0; j < N0; ++i)
of whether/how often it is executed. The transformation of 4 Foo(i, j); 4 Foo(i, j);
coarse-grained control flow into fine-grained control flow is 5 Bar(i); } 5 if (j == N0 - 1) Bar(i); }
implemented by the HLS tool by introducing predication to (a) L1 + N1 · (L0 + N0 −1) cycles. (b) L2 + N0 N1 −1 cycles.
the pipeline, at no significant runtime penalty.
Lst. 7 shows a simple example of how the transformation
Inner state 0 Inner state 1
fuses two pipelined loops in different branches into a single Outer state Single state
loop switching pipeline. The transformation applies to any Listing 8: Before and after coalescing loop nest to avoid inner pipeline drains.
pipelined code in either branch, following the principles
described for pipelined loop fusion (§2.4 and Lst. 5). To perform the transformation in Lst. 8, we had to absorb
The implications of pipelined loop switching are more Bar into the inner loop, adding a loop guard (line 5 in
subtle than the pure fusion examples in Lst. 5, as the total Lst. 8b), analogous to pipelined loop fusion (§2.4), where
number of loop iterations is not affected (assuming the fused the second pipelined “loop” consists of a single iteration.
loop bound is set according to the condition, see line 1 in This contrasts the loop peeling transformation, which is
used by CPU compilers to regularize loops to avoid branch
mispredictions and increasing amenability to vectorization.
1 if (condition) 1 auto N = condition ? N0 : N1;
2 #pragma HLS PIPELINE 2 #pragma HLS PIPELINE While loop peeling can also be beneficial in hardware, e.g.,
3 for (int i = 0; i < N0; ++i) 3 for (int i = 0; i < N; ++i) { to avoid deep conditional logic in a pipeline, small inner
4 y[i] = Foo(x[i]); 4 if (condition)
5 else 5 y[i] = Foo(x[i]); loops can see a significant performance improvement by
6 #pragma HLS PIPELINE 6 else eliminating the draining phase.
7 for (int i = 0; i < N1; ++i) 7 y[i] = Bar(x[i]);
8 y[i] = Bar(x[i]); 8} 2.7 Inlining
(a) Coarse-grained control flow. (b) Control flow absorbed into pipeline. In order to successfully pipeline a scope, all function calls
Listing 7: Pipelined loop switching absorbs coarse-grained control flow. within the code section must be pipelineable. This typically
7

1 for (int i = 0; i < N / W; ++i) 1 // Unroll outer loop by W


requires “inlining” functions into each call site, creating 2 #pragma UNROLL // Fully unroll inner 2 #pragma UNROLL W
dedicated hardware for each invocation, resulting in ad- 3 for (int w = 0; w < W; ++w) // loop 3 for (int i = 0; i < N; ++i)

ditional resources consumed for every additional callsite 4 C[i*W + w] = A[i*W + w]*B[i*W + w]; 4 C[i] = A[i] * B[i];

after the first. This replication is done automatically by HLS (a) Using strip-mining. (b) Using partial unrolling.
compilers on demand, but an additional inline pragma Listing 9: Two variants of vectorization by factor W using loop unrolling.
can be specified to directly “paste” the function body into
the callsite during preprocessing, removing the function
boundary during optimization and scheduling. compute jlogic k (e.g., from off-chip memory), according to
Wmax = fBS , where f [cycle/s] is the clock frequency of
3 S CALABILITY T RANSFORMATIONS
the unrolled logic, and S [Byte/operand] is the operand
Parallelism in HLS revolves around the folding of loops, size in bytes. Horizontal unrolling is usually not sufficient to
achieved through unrolling. In Sec. 2.1 we used strip- achieve high logic utilization on large chips, where the avail-
mining and reordering to avoid loop-carried dependencies able memory bandwidth is low compared to the available
by changing the schedule of computations in the pipelined amount of compute logic. Furthermore, because the energy
loop nest. In this section, we similarly strip-mine and re- cost of I/O is orders of magnitude higher than moving data
order loops, but with additional unrolling of the strip-mined on the chip, it is desirable to exploit on-chip memory and
chunks. Pipelined loops constitute the iteration space; the pipeline parallelism instead (this follows in Sec. 3.2 and 3.3).
size of which determines the number of cycles it takes
to execute the program. Unrolled loops, in a pipelined 3.2 Vertical Unrolling
program, correspond to the degree of parallelism in the We can achieve scalable parallelism in HLS without relying
architecture, as every expression in an unrolled statement on external memory bandwidth by exploiting data reuse,
is required to exist as hardware. Parallelizing a code thus distributing input elements to multiple computational units
means turning sequential/pipelined loops fully or partially replicated “vertically” through unrolling [49], [38], [50]. This
into parallel/unrolled loops. This corresponds to folding the is the most potent source of parallelism on hardware architectures,
sequential iteration space, as the number of cycles taken to as it can conceptually scale indefinitely with available silicon
execute the program are effectively reduced by the inverse when enough reuse is possible. Viewed from the paradigm
of the unrolling factor. of cached architectures, the opportunity for this transforma-
tion arises from temporal locality in loops. Vertical unrolling
a b a0 b0 a1 b1 a2 b2 a3 b3 b a1 b a2 b a3 b a0 a1 a2 a3
a0
CU CU
draws on bandwidth from on-chip fast memory by storing
CU CU CU CU CU CU CU CU CU CU CU
b b b b
more elements temporally, combining them with new data
(a) Before. (b) Horizontal unroll. (c) Vertical unroll. (d) Dataflow. streamed in from external memory to increase parallelism,
allowing more computational units to run in parallel at the
Fig. 5: Horizontal unrolling, vertical unrolling, and dataflow, as means to increase
parallelism. Rectangles represent buffer space, such as registers or on-chip RAM. expense of buffer space. In comparison, horizontal unrolling
Horizontal: four independent inputs processed in parallel. Vertical: one input is requires us to widen the data path that passes through the
combined with multiple buffered values. Dataflow: similar to vertical, but input
or partial results are streamed through a pipeline rather than broadcast.
processing elements (compare Fig. 5b and 5c).
When attempting to parallelize a new algorithm, iden-
3.1 Horizontal Unrolling (Vectorization) tifying a source of temporal parallelism to feed vertical
We implement vectorization-style parallelism with HLS by unrolling is essential to whether the design will scale. Pro-
“horizontally” unrolling loops in pipelined sections, or by grammers should consider this carefully before designing
introducing vector types, folding the sequential iteration the hardware architecture. From a reference software code,
space accordingly. This is the most straightforward way of the programmer can identify scenarios where reuse occurs,
adding parallelism, as it can often be applied directly to an then extract and explicitly express the temporal access pattern
inner loop without further reordering or drastic changes to in hardware, using a delay buffering [§2.2] or random-access
the nested loop structure. Vectorization is more powerful [§2.3] buffering scheme. Then, if additional reuse is possible,
in HLS than SIMD operations on load/store architectures, vertically unroll the circuit to scale up performance.
as the unrolled compute units are not required to be ho- As an example, we return to the matrix multiplication
mogeneous, and the number of units are not constrained code from Lst. 1c. In Sec. 2.1.2, we saw that strip-mining
to fixed sizes. Horizontal unrolling increases bandwidth
utilization by explicitly exploiting spatial locality, allowing
more efficient accesses to off-chip memory such as DRAM. 1 for (int n = 0; n < N / P; ++n) { // Folded by unrolling factor P
2 for (int m = 0; m < M / T; ++m) { // Tiling
Lst. 9 shows two functionally equivalent ways of vec- 3 double acc[T][P]; // Is now 2D
torizing a loop over N elements by a horizontal unrolling 4 // ...initialize acc from C...
5 for (int k = 0; k < K; ++k) {
factor of W . Lst. 9a strip-mines a loop into chunks of W 6 double a_buffer[P]; // Buffer multiple elements to combine with
and unrolls the inner loop fully, while Lst. 9b uses partial 7 #pragma PIPELINE // incoming values of B in parallel
8 for (int p = 0; p < P; ++p)
unrolling by specifying an unroll factor in the pragma. As 9 a_buffer[p] = A[n*P + p][k];
a third option, explicit vector types can be used, such as 10 #pragma PIPELINE
11 for (int t = 0; t < T; ++t) // Stream tile of B
those built into OpenCL (e.g., float4 or int16), or custom 12 #pragma UNROLL
vector classes [48]. These provide less flexibility, but are 13 for (int p = 0; p < P; ++p) // P-fold vertical unrolling
more concise and are sufficient for most applications. 14 acc[t][p] += a_buffer[p] * B[k][m*T+t];
15 } /* ...write back 2D tile of C... */ } }
In practice, the unrolling factor W [operand/cycle] is con-
strained by the bandwidth B [Byte/s] available to the Listing 10: P -fold vertical unrolling of matrix multiplication.
8

and reordering loops allowed us to move reads from matrix To see how streaming can be an important tool to express
A out of the inner loop, re-using the loaded value across scalable hardware, we apply it in conjunction with vertical
T different entries of matrix B streamed in while keeping unrolling (Sec. 3.2) to implement an iterative version of the
the element of A in a register. Since every loaded value stencil example from Lst. 4. Unlike the matrix multiplication
of B eventually needs to be combined with all N rows of code, the stencil code has no scalable source of parallelism
A, we realize that we can perform more computations in in the spatial dimension. Instead, we can achieve reuse by
parallel by keeping multiple values of A in local registers. folding the outer time-loop to treat P consecutive timesteps
The result of this transformation is shown in Lst. 10. By in a pipeline parallel fashion, each computed by a distinct
buffering P elements (where P was 1 in Lst. 1c) of A prior PE, connected in a chain via channels [37], [51], [38]. We
to streaming in the tile of B -matrix (lines 8-9), we can fold replace the memory interfaces to the PE with channels, such
the outer loop over rows by a factor of P , using unrolling that the memory read and write become Pop and Push oper-
to multiply parallelism (as well as buffer space required for ations, respectively. The resulting code is shown in Lst. 11a.
the partial sums) by a factor of P (lines 12-14). We then vertically unroll to generate P instances of the PE
(shown in Lst. 11b), effectively increasing the throughput
3.3 Dataflow
of the kernel by a factor of P , and consequently reducing
For complex codes it is common to partition functionality the runtime by folding the outermost loop by a factor of P
into multiple modules, or processing elements (PEs), stream- (line 3 in Lst. 11a). Such architectures are sometimes referred
ing data between them through explicit interfaces. In con- to as systolic arrays [52], [53].
trast to conventional pipelining, PEs arranged in a dataflow For architectures/HLS tools where large fan-out is an is-
architecture are scheduled separately when synthesized by sue for compilation or routing, an already replicated design
the HLS tool. There are multiple benefits to this: can be transformed to a dataflow architecture. For example,
• Different functionality runs at different schedules. For exam- in the matrix multiplication example in Lst. 10, we can move
ple, issuing memory requests, servicing memory requests, the P -fold unroll out of the inner loop, and replicate the
and receiving requested memory can all require different entire PE instead, replacing reads and writes with channel
pipelines, state machines, and even clock rates. accesses [50]. B is then streamed into the first PE, and
• Smaller components are more modular, making them eas- passed downstream every cycle. A and C should no longer
ier to reuse, debug and verify. be accessed by every PE, but rather be handed downstream
• The effort required by the HLS tool to schedule code similar to B , requiring a careful implementation of the start
sections increases dramatically with the number of opera- and drain phases, where the behavior of each PE will vary
tions that need to be considered for the dependency and slightly according to its depth in the sequence.
pipelining analysis. Scheduling logic in smaller chunks is
thus beneficial for compilation time. 3.4 Tiling
• Large fan-out/fan-in is challenging to route on real hard- Loop tiling in HLS is commonly used to fold large problem
ware, (i.e., 1-to-N or N -to-1 connections for large N ). This sizes into manageable chunks that fit into fast on-chip
is mitigated by partitioning components into smaller parts memory, in an already pipelined program [38]. Rather than
and adding more pipeline stages. making the program faster, this lets the already fast archi-
• The fan-in and fan-out of control signals (i.e., stall, reset) tecture support arbitrarily large problem sizes. This is in
within each module is reduced, reducing the risk of these contrast to loop tiling on CPU and GPU, where tiling is used
signals constraining the maximum achievable frequency. to increase performance. Common to both paradigms is that
they fundamentally aim to meet fast memory constraints. As
To move data between PEs, communication channels with
with horizontal and vertical unrolling, tiling relies on strip-
a handshake mechanism are used. These channels double
mining loops to alter the iteration space.
as synchronization points, as they imply a consensus on
Tiling was already shown in Sec. 2.1.2, when the accu-
the program state. In practice, channels are always FIFO
mulation buffer in Lst. 1b was reduced to a tile buffer in
interfaces, and support standard queue operations Push,
Pop, and sometimes Empty, Full, and Size operations. They
1 void PE(FIFO<float> &in, FIFO<float> &out, int T) {
occupy the same register or block memory resources as 2 // ..initialization...
other buffers (Sec. 2.2/Sec. 2.3). 3 for (int t = 0; t < T / P; ++t) // Fold timesteps T by factor P
4 #pragma PIPELINE
The mapping from source code to PEs differs between 5 for (/* loops over spatial dimensions */) {
HLS tools, but is manifested when functions are connected 6 auto south = in.Pop(); // Value for t-1 from previous PE
using channels. In the following example, we will use the 7 // ...load values from delay buffers...
8 auto next = 0.25*(north + west + east + south);
syntax from Xilinx Vivado HLS to instantiate PEs, where 9 out.Push(next); }} // Value for t sent to PE computing t+1
each non-inlined function correspond to a PE, and these (a) Processing element for a single timestep. Will be replicated P times.
are connected by channels that are passed as arguments
1 #pragma DATAFLOW // Schedule nested functions as parallel modules
to the functions from a top-level entry function. Note that 2 void SystolicStencil(const float in[], float out[], int T) {
this functionally diverges from C++ semantics without 3 FIFO<float> pipes[P + 1]; // Assume P is given at compile time
4 ReadMemory(in, pipes[0]); // Head
additional abstraction [48], as each function in the dataflow 5 #pragma UNROLL // Replicate PEs
scope is executed in parallel in hardware, rather than in the 6 for (int p = 0; p < P; ++p)
sequence specified in the imperative code. In Intel OpenCL, 7 PE(pipe[p], pipe[p + 1], T); // Forms a chain
8 WriteMemory(pipes[P], out); } // Tail
dataflow semantics are instead expressed with multiple
(b) Instantiate and connect P consecutive and parallel PEs.
kernel functions each defining a PE, which are connected by
global channel objects prefixed with the channel keyword. Listing 11: Dataflow between replicated PEs to compute P timesteps in parallel.
9

Lst. 1c, such that the required buffer space used for partial one cache line. If we instead read the two sections of A
results became a constant, rather than being dependent on sequentially (or in larger chunks), the HLS tool can infer
the input size. This transformation is also relevant to the two bursts accesses to A of length N/2, shown in Lst. 12c.
stencil codes in Lst. 4, where it can be used similarly to Since the schedules of memory and computational modules
restrict the size of the line buffers or shift register, so they a are independent, ReadA can run ahead of PE, ensuring that
no longer proportional to the problem size. memory is always read at the maximum bandwidth of the
interface (Sec. 4.2 and Sec. 4.3 will cover how to increase this
4 M EMORY ACCESS T RANSFORMATIONS bandwidth). From the point of view of the computational
When an HLS design has been pipelined, scheduled, and PE, both A0 and A1 are read in parallel, as shown on
unrolled as desired, the memory access pattern has been line 5 in Lst. 12b, hiding initialization time and inconsistent
established. In the following, we describe transformations memory producers in the synchronization implied by the
that optimize the efficiency of off-chip memory accesses in data streams.
the HLS code. For memory bound codes in particular, this is An important use case of memory extraction appears in
critical for performance after the design has been pipelined. the stencil code in Lst. 11, where it is necessary to separate
the memory accesses such that the PEs are agnostic of
4.1 Memory Access Extraction whether data is produced/consumed by a neighboring PE
By extracting accesses to external memory from the compu- or by a memory module. Memory access extraction is also
tational logic, we enable compute and memory accesses to useful for performing data layout transformations in fast
be pipelined and optimized separately. Accessing the same on-chip memory. For example, we can change the schedule
interface multiple times within the same pipelined section of reads from A in Lst. 10 to a more efficient scheme by
is a common cause for poor memory bandwidth utilization buffering values in on-chip memory, while streaming them
and increased initiation interval due to interface contention, to the kernel according to the original schedule.
since the interface can only service a single request per
cycle. In the Intel OpenCL flow, memory extraction is done 4.2 Memory Buffering
automatically by the tool, but since this process must be When dealing with memory interfaces with an inconsistent
conservative due to limited information, it is often still data rate, such as DRAM, it can be beneficial to request
beneficial to do the extraction explicitly in the code [54]. In and buffer accesses earlier and/or at a more aggressive pace
many cases, such as for independent reads, this is not an in- than what is consumed or produced by the computational
herent memory bandwidth or latency constraint, but arises elements. For memory reads, this can be done by reading
from the tool scheduling iterations according to program ahead of the kernel into a deep buffer instantiated between
order. This can be relaxed when allowed by inter-iteration memory and computations, by either 1) accessing wider vec-
dependencies (which can in many cases be determined tors from memory than required by the kernel, narrowing or
automatically, e.g., using polyhedral analysis [55]). widening data paths (aka. “gearboxing”) when piping to or
In Lst. 12a, the same memory (i.e., hardware memory from computational elements, respectively, or 2) increasing
interface) is accessed twice in the inner loop. In the worst the clock rate of modules accessing memory with respect to
case, the program will issue two 4 Byte memory requests the computational elements.
every iteration, resulting in poor memory performance, and The memory access function Lst. 12c allows long bursts
preventing pipelining of the loop. In software, this problem to the interface of A, but receives the data on a narrow bus
is typically mitigated by caches, always fetching at least at W · Sint = (1 · 4) Byte/cycle. In general, this limits the
bandwidth consumption to f ·W Sint at frequency f , which is
likely to be less than what the external memory can provide.
1 void PE(const int A[N], int B[N/2]) { 1 elem./burst To better exploit available bandwidth, we can either read
#pragma PIPELINE // Achieves I=2
DRAM

2 N/2 bursts A[i]


3 for (int i = 0; i < N/2; ++i) N/2 state
wider vectors (increase W ) or clock the circuit at a higher
PE transitions
4 // Issues N/2 memory requests of size 1
1 elem./burst
rate (increase f ). The former consumes more resources, as
5 B[i] = A[i] + A[N/2 + i];
6} N/2 bursts A[N/2+i] additional logic is required to widen and narrow the data
path, but the latter is more likely to be constrained by timing
(a) Multiple accesses to A cause inefficient memory accesses.
PE constraints on the device.
1 void PE(FIFO<int> &A0, FIFO<int> &A1, N/2 elem./burst
int B[N/2]) {
DRAM

2 1 burst A[i]
Compute

4.3 Memory Striping


Pipeline

3 #pragma PIPELINE // Achieves I=1 1 state


ReadA transition
4 for (int i = 0; i < N/2; ++i)
5 B[i] = A0.Pop() + A1.Pop()); N/2 elem./burst When multiple memory banks with dedicated channels
6} 1 burst A[N/2+i] (e.g., multiple DRAM modules or HBM lanes) are available,
(b) Move memory accesses out of computational code. the bandwidth at which a single array is accessed can be
1 void ReadA(const int A[N], FIFO<int> &A0, FIFO<int> &A1) { increased by a factor corresponding the the number of
2 int buffer[N/2]; available interfaces by striping it across memory banks. This
3 #pragma PIPELINE
4 for (int i = 0; i < N/2; ++i) optimization is employed by most CPUs transparently by
5 buffer[i] = A[i]; // Issues 1 memory request of size N/2 striping across multi-channel memory, and is commonly
6 #pragma PIPELINE
7 for (int i = 0; i < N/2; ++i) { known from RAID 0 configuration of disks.
8 A0.Push(buffer[i]); // Sends to PE We can perform striping explicitly in HLS by inserting
9 A1.Push(A[N/2 + i]); }} // Issues 1 memory request of size N/2
modules that join or split data streams from two or more
(c) Read A in long bursts and stream them to the PE. memory interfaces. Reading can be implemented with two
Listing 12: Separate memory accesses from computational logic. or more memory modules requesting memory from their
10

respective interfaces, pushing to FIFO buffers that are read CPU-Oriented Transformations and how they apply to HLS codes.

in parallel and combined by another module (for writing: in “ Loop interchange [57], [47] is used to resolve loop-carried dependencies [§2].
“ Strip-mining [58], loop tiling [59], [47], and cycle shrinking [60] are central compo-
reverse), exposing a single data stream to the computational nents of many HLS transformations [§2.1, §3.1, §3.2, §2.1.2].
kernel. This is illustrated in Fig. 6, where the unlabeled “ Loop distribution and loop fission [61], [47] are used to separate differently scheduled

Directly related to HLS transformations


computations to allow pipelining [§3.3].
dark boxes in Fig. 6b represent PEs reading and combining “ Loop fusion [62], [47], [63] is used for merging pipelines [§2.4].
data from the four DRAM modules. The Intel OpenCL “ Loop unrolling [64] is used to generate parallel hardware [§3.1, §3.2].
“ Software pipelining [65] is used by HLS tools to schedule code sections according to
compiler [19] applies this transformation by default. operation interdependencies to form hardware pipelines.
“ Loop coalescing/flattening/collapsing [66] saves pipeline drains in nested
loops [§2.6].
DDR3

DDR1

DDR2
DDR1

DDR2

DDR0

DDR3
DDR0

“ Reduction recognition prevents loop-carried dependencies when accumulating [§2.1].


“ Loop idiom recognition is relevant for HLS backends, for example to recognize shift
registers [§2.2] in the Intel OpenCL compiler [19].
“ Procedure inlining is used to remove function call boundaries [§2.7].
FPGA fabric Compute kernel FPGA fabric Compute kernel
“ Procedure cloning is frequently used by HLS tools when inlining [§2.7] to specialize
each function “call” with values that are known at compile-time.
(a) Memory stored in a single bank. (b) Memory striped across four banks. “ Loop unswitching [67] is rarely advantageous; its opposite is beneficial [§2.6, §2.4].
“ Loop peeling is rarely advantageous; its opposite is beneficial to allow coalesc-
Fig. 6: Striping memory across memory banks increases available bandwidth.
ing [§2.6].
“ SIMD transformations is done in HLS via horizontal unrolling [§3.1].
“ Short-circuiting: while the logic for both boolean operands must always be instanti-
4.4 Type Demotion ated in hardware, dynamically scheduling branches [68] can effectively “short-circuit”
otherwise deep, static pipelines.
We can reduce resource and energy consumption, band- “ Loop-based strength reduction [69], [70], [71], Induction variable elimination [72],
width requirements, and operation latency by demoting Unreachable code elimination [72], Useless-code elimination [72], Dead-variable
elimination [72], Common-subexpression elimination [72], Constant propagation
data types to less expensive alternatives that still meet [72], Constant folding [72], Copy propagation [72], Forwarding substitution [72],
Reassociation, Algebraic simplification, Strength reduction, Bounds reduction,
precision requirements. This can lead to significant im- Redundant guard elimination are all transformations that eliminate code, which is a
provements on architectures that are specialized for certain useful step for HLS codes to avoid generating unnecessary hardware.

Same or similar in HLS


“ Loop-invariant code motion (hoisting) [72] does not save hardware in itself, but can
types, and perform poorly on others. On traditional FPGAs save memory operations.
there is limited native support for floating point units. “ Loop normalization can be useful as an intermediate transformation.
“ Loop reversal [72], array padding and contraction, scalar expansion, and scalar
Since integer/fixed point and floating point computations replacement yield the same benefits as in software.
“ Loop skewing [72] can be used in multi-dimensional wavefront codes.
on these architectures compete for the same reconfigurable “ Function memoization can be applied to HLS, using explicit fast memory.
logic, using a data type with lower resource requirements “ Tail recursion elimination may be useful if eliminating dynamic recursion can enable
a code to be implemented in hardware.
increases the total number of arithmetic operations that can “ Regular array decomposition applies to partitioning of both on-chip/off-chip memory.
potentially be instantiated on the device. The largest benefits “ We do not consider transformations that apply only in a distributed setting (message
vectorization, message coalescing, message aggregation, collective communica-
of type demotion are seen in the following scenarios: tion, message pipelining, guard introduction, redundant communication), but they
should be implemented in dedicated message passing hardware when relevant [73].
• Compute bound architectures where the data type can be
“ No use case found for loop spreading and parameter promotion.
changed to a type that occupies less of the same resources “ Array statement scalarization: No built-in vector notation in C/C++/OpenCL.
(e.g., from 64 bit integers to 48 bit integers). “ Code colocation, displacement minimization, leaf procedure optimization, and
cross-call register allocation, are not relevant for HLS, as there are no runtime
• Compute bound architectures where the data type can be
Do not apply to HLS

function calls.
moved to a type that is natively supported by the target “ I/O format compilation: No I/O supported directly in HLS.
“ Supercompiling: is infeasible for HLS due to long synthesis times.
architecture, such as single precision floating point on “ Loop pushing/embedding: Inlining completely is favored to allow pipelining.
Intel’s Arria 10 and Stratix 10 devices [56]. “ Automatic decomposition and alignment, scalar privatization, array privatization,
cache alignment, and false sharing are not relevant for HLS, as there is no (implicit)
• Bandwidth bound architectures, where performance can cache coherency protocol in hardware.
be improved by up to the same factor that the size of the “ Procedure call parallelization and split do not apply, as there are no forks in hardware.
“ Graph partitioning only applies to explicit dataflow languages.
data type can be reduced by. “ There are no instruction sets in hardware, so VLIW transformations do not apply.
• Latency bound architectures where the data type can be TABLE 2: The relation of traditional CPU-oriented transformations to HLS codes.
reduced to a lower latency operation, e.g., from floating
point to integer. It is interesting to note that the majority of well-known
In the most extreme case, it has been shown that collapsing transformations from software apply to HLS. This implies
the data type of weights and activations in deep neural that we can leverage much of decades of research into high-
networks to binary [34] can provide sufficient speedup for performance computing transformations to also optimize
inference that the increased number of weights makes up hardware programs, including many that can be applied
for the loss of precision per weight. directly (i.e., without further adaptation to HLS) to the im-
perative source code or intermediate representation before
5 S OFTWARE T RANSFORMATIONS IN HLS synthesizing for hardware. We stress the importance of sup-
In addition to the transformations described in the sections port for these pre-hardware generation transformations in
above, we include an overview of how well-known CPU- HLS compilers, as they lay the foundation for the hardware-
oriented transformations apply to HLS, based on the com- specific transformations proposed here.
piler transformations compiled by Bacon et al. [25]. These
transformations are included in Tab. 2, and are partitioned 6 E ND - TO -E ND E XAMPLES
into three categories: To showcase the transformations presented in this work and
provide a “hands-on” opportunity for seeing HLS optimiza-
• Transformations directly relevant to the HLS transforma-
tions applied in practice, we will describe the optimization
tions already presented here.
process on a sample set of classical HPC kernels, available
• Transformations that are the same or similar to their
as open source repositories on github1 . These kernels are
software counterparts.
• Transformations with little or no relevance to HLS. 1. https://github.com/spcl?q=hls
11

written in C++ for Xilinx Vivado HLS [12] with hlslib [48] [GOp/s] Stencil Matrix Multiplication N-Body

extensions, and are built and run using the Xilinx Vitis envi- 103 (25×/18270×)409.3 (52×/29950×)497.0 (42×/167×)270.7

ronment. For each example, we will describe the sequence of (14×/720×) 16.1 (16×/578×) 9.6 (4×) 6.4
101 1.6
transformations applied, and give the resulting performance (53×) 1.2 (36×) 0.6
at each major stage. 10−1 <0.1 <0.1
The included benchmarks were run on an Alveo 10 −3
Nai Pip Vec S Nai Pip Vec S I P S
U250 board, which houses a Xilinx UltraScale+ XCU250- ve elin tor ystoli ve elin tor ystoli nitial ipelin ystoli
ed ized c ed ized c ed c
FIGD2104-2L-E FPGA and four 2400 MT/s DDR4 banks (we
Fig. 7: Performance progression of kernels when applying transformations. Paren-
utilize 1-2 banks for the examples here). The chip consists theses show speedup over previous version, and cumulative speedup.
of four almost identical chiplets with limited interconnect [Utilization] LUTs DSPs BRAM
between them, where each chiplet is connected to one of 100%
Stencil Matrix Multiplication N-Body
the DDR4 pinouts. This multi-chiplet design allows more 10%
resources (1728K LUTs and 12,288 DSPs), but poses chal- 1%
lenges for the routing process, which impedes the achiev-
0.1%
able clock rate and resource utilization for a monolithic ker-
nel attempting to span the full chip. Kernels were compiled 0.01%
Nai Pip Vec S Nai Pip Vec S Init Pip S
for the xilinx u250 xdma 201830 2 shell with Vitis 2019.2 ve elin tor ystoli ve elin tor ystoli ial elin ystoli
ed ized c ed ized c ed c
and executed with version 2.3.1301 of the Xilinx Runtime Fig. 8: Resource usage of kernels from Fig. 7 as fractions of available resources.
(XRT). All benchmarks are included in Fig. 7, and the The maxima are taken as 1728K LUTs, 12,288 DSPs, and 2688 BRAM.

resource utilization of each kernel is shown in Fig. 8.


1) We transpose the iteration space [§2.1.1], removing the
6.1 Stencil Code loop-carried dependency on the accumulation register,
Stencil codes are a popular target for FPGA acceleration in and extract the memory accesses [§4.1], vastly improving
HPC, due to their regular access pattern, intuitive buffering spatial locality. The buffering, streaming and writing
scheme, and potential for creating large systolic array de- phases are fused [§2.4], allowing us to coalesce the three
signs [38]. We show the optimization of a 4-point 2D stencil nested loops [§2.6].
based on Lst. 4. Benchmarks are shown in Fig. 7, and use 2) In order to increase spatial parallelism, we vectorize
single precision floating point, iterating over a 8192×8192 accesses to B and C [§3.1].
domain. We first measure a naive implementation, where 3) To scale up the design, we vertically unroll by buffering
all neighboring cells are accessed directly from the input multiple values of A, applying them to streamed in
array, which results in no data reuse and heavy interface values of B in parallel [§3.2]. To avoid high fan-out,
contention on the input array. We then apply the following we partition buffered elements of A into processing
optimization steps: elements [§3.3] arranged in a systolic array architecture.
1) Delay buffers [§2.2] are added to store two rows of the Finally, the horizontal domain is tiled to accommodate
domain (see Lst. 4a), removing interface contention on arbitrarily large matrices with finite buffer space.
the memory bus and achieving perfect spatial data reuse. Allowing pipelining and regularizing the memory access
2) Spatial locality is exploited by introducing vectoriza- pattern results in a throughput of ∼1 cell per cycle. Vec-
tion [§3.1]. To efficiently use memory bandwidth, we torization multiplies the performance by W , set to 16 in
use memory extraction [§4.1], buffering [§4.2], and strip- the benchmarked kernel. The performance of the vertically
ing [§4.3] from two DDR banks. unrolled dataflow kernel is only limited by placement and
3) To exploit temporal locality, we replicate the vectorized routing due to high resource usage on the chip. The final
PE by vertical unrolling [§3.2] and stream [§3.3] between implementation achieves state-of-the-art performance on
them (Lst. 11). The domain is tiled [§3.4] to limit fast the target architecture [50], and is available on github3 .
memory usage.
Enabling pipelining with delay buffers allows the kernel to 6.3 N-Body Code
throughput ∼1 cell per cycle. Improving the memory perfor- Finally, we show an N-body code in 3 dimensions, using
mance to add vectorization (using W = 16 operands/cycle single precision floating point types and iterating over
for the kernel) exploits spatial locality through additional 16,128 bodies. Since Vivado HLS does not allow memory
bandwidth usage. The vertical unrolling and dataflow step accesses of a width that is not a power of two, memory ex-
scales the design to exploit available hardware resources on traction [§4.1] and buffering [§4.2] was included in the first
the chip, until limited by placement and routing. The final stage, to support 3-vectors of velocity. We then performed
implementation is available on github2 . the following transformations:
1) The loop-carried dependency on the acceleration accu-
6.2 Matrix Multiplication Code mulation is resolved by applying tiled accumulation
We consider the optimization process of a matrix multipli- interleaving [§2.1.2], pipelining across T ≥L+ different
cation kernel using transformations presented here. Bench- resident particles applied to particles streamed in.
mark for 8192×8192 matrices across stages of optimization 2) To scale up the performance, we further multiply the
are shown in Fig. 7. Starting from a naive implementation number of resident particles, this time replicating com-
(Lst. 1a), the following optimization stages were applied: pute through vertical unrolling [§3.2] of the outer loop

2. https://github.com/spcl/stencil hls/ 3. https://github.com/spcl/gemm hls


12

into P parallel processing element arranged in a systolic Lee et al. [90] present an OpenACC to OpenCL com-
array. Each element holds T resident particles, and parti- piler, using Intel OpenCL as a backend. The authors im-
cles are streamed [§3.3] through the PEs. plement horizontal and vertical unrolling, pipelining and
The second stage gains a factor of 4× corresponding to the dataflow by introducing new OpenACC clauses. Papakon-
latency of the interleaved accumulation, followed by a factor stantinou et al. [91] generate HLS code for FPGA from
of 42× from unrolling units across the chip. T ≥L+ can be directive-annotated CUDA code.
used to regulate the arithmetic intensity of the kernel. The Optimizing HLS compilers. Mainstream HLS compil-
bandwidth requirements can be reduced further by storing ers automatically apply many of the well-known software
more resident particles on the chip, scaling up to the full transformations in Tab. 2 [22], [92], [93], but can also employ
fast memory usage of the FPGA. The tiled accumulation in- more advanced FPGA transformations. Intel OpenCL [19]
terleaving transformation thus enables not just pipelining of performs memory access extraction into “load store units”
the compute, but also minimization of I/O. The optimized (LSUs), does memory striping between DRAM banks, and
implementation is available on github4 . detects and auto-resolves some buffering and accumulation
These examples demonstrate the impact of different patterns. The proprietary Merlin Compiler [94] uses high-
transformations on a reconfigurable hardware platform. In level acceleration directives to automatically perform some
particular, enabling pipelining, regularizing memory ac- of the transformations described here, as source-to-source
cesses, and vertical unrolling are shown to be central com- transformations to underlying HLS code. Polyhedral compi-
ponents of scalable hardware architectures. The dramatic lation is a popular framework for optimizing CPU and GPU
speedups over naive codes also emphasize that HLS tools do loop nests [55], and has also been applied to HLS for FPGA
not yield competitive performance out of the box, making it for optimizing data reuse [95]. Such techniques may prove
critical to perform further transformations. For additional valuable in automating, e.g., memory extraction and tiling
examples of optimizing HLS codes, we refer to the numer- transformations. While most HLS compilers rely strictly
ous works applying HLS optimizations listed below. on static scheduling, Dynamatic [68] considers dynamically
scheduling state machines and pipelines to allow reducing
7 R ELATED W ORK the number of stages executed at runtime.
Optimized applications. Much work has been done in
Domain-specific frameworks. Implementing programs
optimizing C/C++/OpenCL HLS codes for FPGA, such as
in domain specific languages (DSLs) can make it easier
stencils [38], [39], [40], [74], [75], deep neural networks [76],
to detect and exploit opportunities for advanced trans-
[77], [35], [36], [34], matrix multiplication [78], [75], [50], [79],
formations. Darkroom [30] generates optimized HDL for
graph processing [80], [81], networking [82], light propaga-
image processing codes, and the popular image process-
tion for cancer treatment [46], and protein sequencing [49],
ing framework Halide [31] has been extended to support
[83]. These works optimize the respective applications using
FPGAs [96], [97]. Luzhou et al. [53] and StencilFlow [44]
transformations described here, such as delay buffering,
propose frameworks for generating stencil codes for FP-
random access buffering, vectorization, vertical unrolling,
GAs. These frameworks rely on optimizations such as delay
tiling for on-chip memory, and dataflow.
buffering, dataflow, and vertical unrolling, which we cover
Transformations. Zohouri et al. [84] use the Rodinia
here. Using DSLs to compile to structured HLS code can
benchmark to evaluate the performance of OpenCL codes
be a viable approach to automating a wide range of trans-
targeting FPGAs, employing optimizations such as SIMD
formations, as proposed by Koeplinger et al. [98], and the
vectorization, sliding-window buffering, accumulation in-
FROST [99] DSL framework.
terleaving, and compute unit replication across multiple
Other approaches. There are other approaches than
kernels. We present a generalized description of a superset
C/C++/OpenCL-based HLS languages to addressing the
of these transformations, along with concrete code examples
productivity issues of hardware design: Chisel/FIR-
that show how they are applied in practice. The DaCe frame-
RTL [100], [101] maintains the paradigm of behavioral pro-
work [85] exploits information on explicit dataflow and
gramming known from RTL, but provides modern language
control flow to perform a wide range of transformations,
and compiler features. This caters to developers who are
and code generates efficient HLS code using vendor-specific
already familiar with hardware design, but wish to use a
pragmas and primitives. Kastner et al. [86] go through the
more expressive language. In the Maxeler ecosystem [102],
implementation of many HLS codes in Vivado HLS, focus-
kernels are described using a Java-based language, but
ing on algorithmic optimizations. da Silva et al. [87] explore
rather than transforming imperative code into a behavioral
using modern C++ features to capture HLS concepts in a
equivalent, the language provides a DSL of hardware con-
high-level fashion. Lloyd et al. [88] describe optimizations
cepts that are instantiated using object-oriented interfaces.
specific to Intel OpenCL, and include a variant of memory
By constraining the input, this encourages developers to
access extraction, as well as the single-loop accumulation
write code that maps well to hardware, but requires learning
variant of accumulation interleaving.
a new language exclusive to the Maxeler ecosystem.
Directive-based frameworks. High-level, directive-based
frameworks such as OpenMP and OpenACC have been 8 TOOLFLOW OF X ILINX VS . I NTEL
proposed as alternative abstractions for generating FPGA
When choosing a toolflow to start designing hardware with
kernels. Leow et al. [89] implement an FPGA code gen-
HLS, it is useful to understand two distinct approaches
erator from OpenMP pragmas, primarily focusing on cor-
by the two major vendors: Intel OpenCL wishes to en-
rectness in implementing a range of OpenMP pragmas.
able writing accelerators using software, making an effort to
4. https://github.com/spcl/nbody hls abstract away low-level details about the hardware, and
13

present a high-level view to the programmer; whereas Xil- [6] D. B. Thomas et al., “A comparison of CPUs, GPUs, FPGAs,
inx’ Vivado HLS provides a more productive way of writing and massively parallel processor arrays for random number
generation,” in FPGA, 2009.
hardware, by means of a familiar software language. Xilinx [7] D. Bacon et al., “FPGA programming for the masses,” CACM,
uses OpenCL as a vehicle to interface between FPGA and 2013.
host, but implements the OpenCL compiler itself as a thin [8] G. Martin and G. Smith, “High-level synthesis: Past, present, and
wrapper around the C++ compiler, whereas Intel embraces future,” D&T, 2009.
[9] J. Cong et al., “High-level synthesis for FPGAs: From prototyping
the OpenCL paradigm as their frontend (although they to deployment,” TCAD, 2011.
encourage writing single work item kernels [103], effectively [10] R. Nane et al., “A survey and evaluation of FPGA high-level
preventing reuse of OpenCL kernels written for GPU). synthesis tools,” TCAD, 2016.
Vivado HLS has a stronger coupling between the HLS [11] W. Meeus et al., “An overview of today’s high-level synthesis
tools,” DAEM, 2012.
source code and the generated hardware. This requires [12] Z. Zhang et al., “AutoPilot: A platform-based ESL synthesis
the programmer to write more annotations and boilerplate system,” in High-Level Synthesis, 2008.
code, but can also give them stronger feeling of control. [13] Intel High-Level Synthesis (HLS) Compiler. https://www.
intel.com/content/www/us/en/software/programmable/
Conversely, the Intel OpenCL compiler presents convenient quartus-prime/hls-compiler.html. Accessed May 15, 2020.
abstracted views, saves boilerplate code (e.g., by automat- [14] A. Canis et al., “LegUp: High-level synthesis for FPGA-based
ically pipelining sections), and implements efficient substi- processor/accelerator systems,” in FPGA, 2011.
tutions by detecting common patterns in the source code [15] Mentor Graphics. Catapult high-level synthesis. https:
//www.mentor.com/hls-lp/catapult-high-level-synthesis/
(e.g., to automatically perform memory extraction [§4.1]). c-systemc-hls. Accessed May 15, 2020.
The downside is that developers end up struggling to write [16] C. Pilato et al., “Bambu: A modular framework for the high level
or generate code in a way that is recognized by the tool’s synthesis of memory-intensive applications,” in FPL, 2013.
“black magic”, in order to achieve the desired result. Finally, [17] R. Nane et al., “DWARV 2.0: A CoSy-based C-to-VHDL hardware
compiler,” in FPL, 2012.
Xilinx’ choice to allow C++ gives Vivado HLS an edge in [18] M. Owaida et al., “Synthesis of platform architectures from
expressibility, as (non-virtual) objects and templating turns OpenCL programs,” in FCCM, 2011.
out to be a useful tool for abstracting and extending the [19] T. Czajkowski et al., “From OpenCL to high-performance hard-
language [48]. Intel offers a C++-based HLS compiler, but ware on FPGAs,” in FPL, 2012.
[20] R. Nikhil, “Bluespec system Verilog: efficient, correct RTL from
does not (as of writing) support direct interoperability with high level specifications,” in MEMOCODE, 2004.
the OpenCL-driven accelerator flow. [21] J. Auerbach et al., “Lime: A Java-compatible and synthesizable
language for heterogeneous architectures,” in OOPSLA, 2010.
9 C ONCLUSION [22] ——, “A compiler and runtime for heterogeneous computing,”
in DAC, 2012.
The transformations known from software are insufficient to [23] J. Hammarberg and S. Nadjm-Tehrani, “Development of safety-
optimize HPC kernels targeting spatial computing systems. critical reconfigurable hardware with Esterel,” FMICS, 2003.
We have proposed a new set of optimizing transforma- [24] M. B. Gokhale et al., “Stream-oriented FPGA computing in the
Streams-C high level language,” in FCCM, 2000.
tions that enable efficient and scalable hardware architec- [25] D. F. Bacon et al., “Compiler transformations for high-
tures, and can be applied directly to the source code by a performance computing,” CSUR, 1994.
performance engineer, or automatically by an optimizing [26] S. Ryoo et al., “Optimization principles and application perfor-
compiler. Performance and compiler engineers can benefit mance evaluation of a multithreaded GPU using CUDA,” in
PPoPP, 2008.
from these guidelines, transformations, and the presented [27] G. D. Smith, Numerical solution of partial differential equations: finite
cheat sheet as a common toolbox for developing high per- difference methods, 1985.
formance hardware using HLS. [28] A. Taflove and S. C. Hagness, “Computational electrodynamics:
The finite-difference time-domain method,” 1995.
[29] C. A. Fletcher, Computational Techniques for Fluid Dynamics 2, 1988.
ACKNOWLEDGEMENTS [30] J. Hegarty et al., “Darkroom: compiling high-level image process-
This work was supported by the European Research Coun- ing code into hardware pipelines.” TOG, 2014.
cil under the European Union’s Horizon 2020 programme [31] J. Ragan-Kelley et al., “Halide: A language and compiler for
optimizing parallelism, locality, and recomputation in image
(grant agreement DAPP, No. 678880). The authors wish processing pipelines,” in PLDI, 2013.
to thank Xilinx and Intel for helpful discussions; Xilinx [32] T. Ben-Nun and T. Hoefler, “Demystifying parallel and dis-
for generous donations of software, hardware, and access tributed deep learning: An in-depth concurrency analysis,”
to the Xilinx Adaptive Compute Cluster (XACC) at ETH CSUR, 2019.
[33] G. Lacey et al., “Deep learning on FPGAs: Past, present, and
Zurich; the Swiss National Supercomputing Center (CSCS) future,” arXiv:1602.04283, 2016.
for providing computing infrastructure; and Tal Ben-Nun [34] M. Courbariaux et al., “Binarized neural networks: Training deep
for valuable feedback on iterations of this manuscript. neural networks with weights and activations constrained to +1
or -1,” arXiv:1602.02830, 2016.
[35] Y. Umuroglu et al., “FINN: A framework for fast, scalable bina-
R EFERENCES rized neural network inference,” in FPGA, 2017.
[36] M. Blott et al., “FINN-R: An end-to-end deep-learning framework
[1] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implica- for fast exploration of quantized neural networks,” TRETS, 2018.
tions of the obvious,” SIGARCH, 1995. [37] H. Fu and R. G. Clapp, “Eliminating the memory bottleneck: An
[2] M. Horowitz, “Computing’s energy problem (and what we can FPGA-based solution for 3d reverse time migration,” in FPGA,
do about it),” in ISSCC, 2014. 2011.
[3] D. D. Gajski et al., “A second opinion on data flow machines and [38] H. R. Zohouri et al., “Combined spatial and temporal block-
languages,” Computer, 1982. ing for high-performance stencil computation on FPGAs using
[4] S. Sirowy and A. Forin, “Where’s the beef? why FPGAs are so OpenCL,” in FPGA, 2018.
fast,” MS Research, 2008. [39] H. M. Waidyasooriya et al., “OpenCL-based FPGA-platform for
[5] A. R. Brodtkorb et al., “State-of-the-art in heterogeneous comput- stencil computation and its optimization methodology,” TPDS,
ing,” Scientific Programming, 2010. May 2017.
14

[40] Q. Jia and H. Zhou, “Tuning stencil codes in OpenCL for FPGAs,” [76] N. Suda et al., “Throughput-optimized OpenCL-based FPGA
in ICCD, 2016. accelerator for large-scale convolutional neural networks,” in
[41] X. Niu et al., “Exploiting run-time reconfiguration in stencil FPGA, 2016.
computation,” in FPL, 2012. [77] J. Zhang and J. Li, “Improving the performance of OpenCL-based
[42] ——, “Dynamic stencil: Effective exploitation of run-time re- FPGA accelerator for convolutional neural network,” in FPGA,
sources in reconfigurable clusters,” in FPT, 2013. 2017.
[43] J. Fowers et al., “A performance and energy comparison of [78] E. H. D’Hollander, “High-level synthesis optimization for
FPGAs, GPUs, and multicores for sliding-window applications,” blocked floating-point matrix multiplication,” SIGARCH, 2017.
in FPGA, 2012. [79] P. Gorlani et al., “OpenCL implementation of Cannon’s matrix
[44] J. de Fine Licht et al., “StencilFlow: Mapping large stencil pro- multiplication algorithm on Intel Stratix 10 FPGAs,” in ICFPT,
grams to distributed spatial computing systems,” in CGO, 2021. 2019.
[45] X. Chen et al., “On-the-fly parallel data shuffling for graph [80] M. Besta et al., “Graph processing on FPGAs: Taxonomy, survey,
processing on OpenCL-based FPGAs,” in FPL, 2019. challenges,” arXiv:1903.06697, 2019.
[46] T. Young-Schultz et al., “Using OpenCL to enable software-like [81] ——, “Substream-centric maximum matchings on FPGA,” in
development of an FPGA-accelerated biophotonic cancer treat- FPGA, 2019.
ment simulator,” in FPGA, 2020. [82] H. Eran et al., “Design patterns for code reuse in HLS packet
[47] D. J. Kuck et al., “Dependence graphs and compiler optimiza- processing pipelines,” in FCCM, 2019.
tions,” in POPL, 1981. [83] E. Rucci et al., “Smith-Waterman protein search with OpenCL on
[48] J. de Fine Licht and T. Hoefler, “hlslib: Software engineering for an FPGA,” in Trustcom/BigDataSE/ISPA, 2015.
hardware design,” arXiv:1910.04436, 2019. [84] H. R. Zohouri et al., “Evaluating and optimizing OpenCL kernels
[49] S. O. Settle, “High-performance dynamic programming on FP- for high performance computing with FPGAs,” in SC, 2016.
GAs with OpenCL,” in HPEC, 2013. [85] T. Ben-Nun et al., “Stateful dataflow multigraphs: A data-centric
[50] J. de Fine Licht et al., “Flexible communication avoiding matrix model for performance portability on heterogeneous architec-
multiplication on FPGA with high-level synthesis,” in FPGA, tures,” in SC, 2019.
2020. [86] R. Kastner et al., “Parallel programming for FPGAs,”
[51] K. Sano et al., “Multi-FPGA accelerator for scalable stencil com- arXiv:1805.03648, 2018.
putation with constant memory bandwidth,” TPDS, 2014. [87] J. S. da Silva et al., “Module-per-Object: a human-driven method-
[52] H. Kung and C. E. Leiserson, “Systolic arrays (for VLSI),” in ology for C++-based high-level synthesis design,” in FCCM, 2019.
Sparse Matrix Proceedings, 1978. [88] T. Lloyd et al., “A case for better integration of host and target
[53] W. Luzhou et al., “Domain-specific language and compiler compilation when using OpenCL for FPGAs,” in FSP, 2017.
for stencil computation on fpga-based systolic computational- [89] Y. Y. Leow et al., “Generating hardware from OpenMP pro-
memory array,” in ARC, 2012. grams,” in FPT, 2006.
[54] T. Kenter et al., “OpenCL-based FPGA design to accelerate the [90] S. Lee et al., “OpenACC to FPGA: A framework for directive-
nodal discontinuous Galerkin method for unstructured meshes,” based high-performance reconfigurable computing,” in IPDPS,
in FCCM, 2018. 2016.
[55] T. Grosser et al., “Polly – performing polyhedral optimizations on [91] A. Papakonstantinou et al., “FCUDA: Enabling efficient compila-
a low-level intermediate representation,” PPL, 2012. tion of CUDA kernels onto FPGAs,” in SASP, 2009.
[56] U. Sinha, “Enabling impactful DSP designs on FPGAs with hard- [92] S. Gupta et al., “SPARK: a high-level synthesis framework for ap-
ened floating-point implementation,” Altera White Paper, 2014. plying parallelizing compiler transformations,” in VLSID, 2003.
[93] ——, “Coordinated parallelizing compiler optimizations and
[57] J. R. Allen and K. Kennedy, “Automatic loop interchange,” in
high-level synthesis,” TODAES, 2004.
SIGPLAN, 1984.
[94] J. Cong et al., “Source-to-source optimization for HLS,” in FPGAs
[58] M. Weiss, “Strip mining on SIMD architectures,” in ICS, 1991.
for Software Programmers, 2016.
[59] M. D. Lam et al., “The cache performance and optimizations of
[95] L.-N. Pouchet et al., “Polyhedral-based data reuse optimization
blocked algorithms,” 1991.
for configurable computing,” in FPGA, 2013.
[60] C. D. Polychronopoulos, “Advanced loop optimizations for par-
[96] J. Pu et al., “Programming heterogeneous systems from an image
allel computers,” in ICS, 1988.
processing DSL,” TACO, 2017.
[61] D. J. Kuck, “A survey of parallel machine organization and [97] J. Li et al., “HeteroHalide: From image processing DSL to efficient
programming,” CSUR, Mar. 1977. FPGA acceleration,” in FPGA, 2020.
[62] A. P. Yershov, “ALPHA – an automatic programming system of [98] D. Koeplinger et al., “Automatic generation of efficient accelera-
high efficiency,” J. ACM, 1966. tors for reconfigurable hardware,” in ISCA, 2016.
[63] M. J. Wolfe, “Optimizing supercompilers for supercomputers,” [99] E. D. Sozzo et al., “A common backend for hardware acceleration
Ph.D. dissertation, 1982. on FPGA,” in ICCD, 2017.
[64] J. J. Dongarra and A. R. Hinds, “Unrolling loops in Fortran,” [100] J. Bachrach et al., “Chisel: constructing hardware in a scala
Software: Practice and Experience, 1979. embedded language,” in DAC, 2012.
[65] M. Lam, “Software pipelining: An effective scheduling technique [101] A. Izraelevitz et al., “Reusability is FIRRTL ground: Hardware
for VLIW machines,” in PLDI, 1988. construction languages, compiler frameworks, and transforma-
[66] C. D. Polychronopoulos, “Loop coalescing: A compiler transfor- tions,” in ICCAD, 2017.
mation for parallel machines,” Tech. Rep., 1987. [102] Maxeler Technologies, “Programming MPC systems (white pa-
[67] F. E. Allen and J. Cocke, A catalogue of optimizing transformations, per),” 2013.
1971. [103] Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide,
[68] L. Josipović et al., “Dynamically scheduled high-level synthesis,” UG-OCL003, revision 2020.04.1. Accessed May 15, 2020.
in FPGA, 2018.
[69] J. Cocke and K. Kennedy, “An algorithm for reduction of operator Johannes de Fine Licht is a PhD student at ETH Zurich. His research topics
strength,” CACM, 1977. revolve around spatial computing systems in HPC, and include programming
[70] R. Bernstein, “Multiplication by integer constants,” Softw. Pract. models, applications, libraries, and enhancing programmer productivity.
Exper., 1986.
[71] G. L. Steele, “Arithmetic shifting considered harmful,” ACM Maciej Besta is a PhD student at ETH Zurich. His research focuses on under-
SIGPLAN Notices, 1977. standing and accelerating large-scale irregular graph processing in any type of
[72] A. V. Aho et al., “Compilers, principles, techniques,” Addison setting and workload.
Wesley, 1986.
Simon Meierhans is studying for his MSc degree at ETH Zurich. His interests
[73] T. De Matteis et al., “Streaming message interface: High- include randomized and deterministic algorithm and data structure design.
performance distributed memory programming on reconfig-
urable hardware,” in SC, 2019. Torsten Hoefler is a professor at ETH Zurich, where he leads the Scalable
[74] D. Weller et al., “Energy efficient scientific computing on FPGAs Parallel Computing Lab. His research aims at understanding performance of
using OpenCL,” in FPGA, 2017. parallel computing systems ranging from parallel computer architecture through
[75] A. Verma et al., “Accelerating workloads on FPGAs via OpenCL: parallel programming to parallel algorithms.
A case study with opendwarfs,” Tech. Rep., 2016.

You might also like