Finn RTL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

O N THE RTL I MPLEMENTATION OF FINN M ATRIX V ECTOR

C OMPUTE U NIT

A P REPRINT

Syed Asad Alam and David Gregg1, Giulio Gambardella and Michaela Blott2 , and Thomas Preusser3
arXiv:2201.11409v2 [cs.AR] 11 Apr 2022

1
School of Computer Science and Statistics, The University of Dublin, Trinity College, Dublin, Ireland ∗
2
Xilinx Research Labs, Ireland †
3
Xilinx Research Labs, Germany ‡

April 12, 2022

A BSTRACT
FPGA-based accelerators are becoming increasingly popular for deep neural network inference due
to their ability to scale performance with increasing degree of specialization with dataflow archi-
tectures or custom data type precision. In order to reduce the barrier for software engineers and
data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis
(HLS) have been introduced. They provide higher abstraction compared to register-transfer level
(RTL)-based design. HLS offers faster development time, better maintainability and more flexibility
in code exploration, when evaluating several options for multi-dimension tensors, convolutional lay-
ers or different degrees of parallelism. For this reason, HLS has been adopted by DNN accelerator
generation frameworks such as FINN and hls4ml.
In this paper, we present an alternative backend library for FINN, leveraging RTL. We investigate
and evaluate, across a spectrum of design dimensions, the pros and cons of an RTL-based implemen-
tation versus the original HLS variant. We show that for smaller design parameters, RTL produces
significantly smaller circuits as compared to HLS. For larger circuits, however, the look-up table
(LUT) count of RTL-based design is slightly higher, up to around 15%. On the other hand, HLS
consistently requires more flip-flops (FFs) (with an orders-of-magnitude difference for smaller de-
signs) and block RAMs (BRAMs) (2× more). This also impacts the critical path delay, with RTL
producing significantly faster circuits, up to around 80%. Furthermore, RTL also benefits from
at-least a 10× reduction in synthesis time. Finally the results were practically validated using a
real-world use case of a multi-layer perceptron (MLP) network used in network intrusion detection.
Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the
design entry is less important. As such, the gained benefits in synthesis time together with some
design dependent resource benefits, might make the RTL abstraction an attractive alternative.

Keywords FINN, Convolutional neural network, HLS, RTL, FPGA

1 Introduction
Deep convolutional neural networks, referred to simply as CNNs, have shown a tremendous growth in the past many
years with networks now having millions of parameters. E.g., AlexNet, VGGNet and ResNet-152 have 62M , 138M
and 60.3M parameters, respectively [1, 2, 3]. The computational cost of a CNN is primarily derived from convolution
layers. These convolution operations are typically performed between input matrices, which can be as large as 224 ×
224 with multiple channels as in the popular ImageNet category[4], and filter kernels, which are typically small on

{syed.asad.alam,david.gregg}@tcd.ie

{giuliog,mblott}@xilinx.com

[email protected]
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

the order of 5 × 5 [5] or 3 × 3 [2]. The implied convolutional compute amounts to many dot products between the
filter kernels of all output channels and correspondingly sized, overlapping tiles of the input across the full depth of
input channels. Many networks conclude the convolutions with a small number of fully connected layers. These
layers, indeed, compute one large dot product for each output channel but without any stencil tiling. In consequence,
convolutional layers dominate the total execution time of modern CNNs. In total, these compute requirements are
challenging in CNN deployments especially on resource-constrained devices, say, for enabling Internet-of-Things
(IoT) use cases both embedded and at the edge. The deployment of such networks on edge devices is critical to enable
technologies like smart cities, driver-less cars and other autonomous systems [6] as it is neither energy-efficient to
submit data from an edge device for remote cloud processing nor conducive for the real-time processing required by
such state-of-the-art applications.
The challenging demands of the desired CNNs must be mitigated on embedded devices with limited processing and
memory capacity. It is important to reduce the number of layers or parameters, and the memory required to store them.
There are various ways in which these aspects of a network can be reduced. One of the ways is referred to as pruning.
It involves replacing insignificant parameters in weight tensors by zero. Pruning can be performed at different levels
of granularity ranging from pruning individual weights to complete channels [7, 8].
Another orthogonal measure of reducing the memory footprint of such networks is quantization. It means that the data
inside a network (i.e., input and output activations, weights and biases) are represented using shorter word lengths.
Admissible word lengths have been explored thoroughly for fixed- and floating-point number systems. Capotondi
et al. [9] propose a CNN inference library, called CMix-NN, for integer-only mixed low-precision quantization of
weights and activations at (8, 4, 2)-bits. The library is optimized for the ARM instruction set with vector arithmetic
extensions and tested for the deployment on microcontroller targets like the STM32H7. Garofalo et al. [10] also
propose a library of kernels for quantized neural network (QNN) inference, called PULP-NN, using a homogeneous
quantization scheme with (8, 4, 2, 1)-bits. They optimize for a cluster of low-power RISC-V processors by exploiting
4 × 8-bit SIMD MAC instructions and bitwise extension operations in order to benefit from precisions below 8 bits.
Bruschi et al. [11] extend PULP-NN by targeting the acceleration of mixed-precision DNNs. Hubara et al. [12] present
the quantization with various low-bit representations as small as 1 bit per weight and activation to allow for the use of
bitwise operations during the forward pass. A further contribution in the field of these binary neural networks (BNNs)
is reported by Rastegari et al. [13]. They present two approximations of DNNs by firstly representing all weights with
binary values and secondly approximating both weights and inputs to the convolutional and fully connected layers by
binary values. Their interpretation of the two values as ±1 motivates their term XNOR network as inspired by the
resulting elementary multiplication.
Field-programmable gate arrays (FPGAs) are a very suitable platform to implement fixed-point systems due to the
presence of fast multipliers and adders. Their ability to customize datapaths down to the bit level has made them an
attractive target and the technology of choice for very-low-precision QNNs and, of course, BNNs. One framework
exploiting this technological match is FINN, which is able to build FPGA-based accelerators for QNNs. While initially
limited to BNNs [14], FINN was later extended to non-binary word lengths [15]. The prototypes designed using
this framework were demonstrated to improve performance measures, such as latency, throughput, Top-1, Top-5 and
mean average precision (mAP) accuracies, with respect to the state of the art. The FINN framework uses C++-based
high-level synthesis (HLS). HLS allows software engineers and scientists, not well versed with hardware description
languages (HDLs) like Verilog and VHDL to describe digital systems. They can, thus, target hardware platforms like
FPGAs and ASICs using familiar languages like C++ or OpenCL [16, 17, 18]. Designs described in these high-level
languages are translated to traditional HDLs using dedicated compilers. The time required to develop and design a
digital system using HLS is considerably less than what is needed for a genuine HDL implementation. The general
reduction of code complexity is a benefit for design space exploration, optimization, and maintenance. Also, the
debugging of a design can ideally be performed on a higher functional abstraction level, which is more accessible
to the application engineer and considerably faster than a simulation on the register-transfer level (RTL). Finally,
HLS offers more flexibility in code exploration, which is powerful for example when evaluating the many options for
ordering different dimensions in tensors, convolutional layers or different degrees of parallelism.
On the other side, adding an extra layer of abstraction and translation also incurs costs in terms of design imple-
mentation times, hardware resource cost and the predictability of design quality. In this work, we analyse HLS- and
RTL-based implementations of the core compute unit of the accelerator designed for the FINN framework to evaluate
the implied trade-offs. This core unit performs matrix-vector multiplications using different types of compute engines
and uses AXI-Stream-based interfaces for communication. We use Xilinx’s Vivado and Vivado HLS for RTL and HLS
implementations, respectively.
The main contributions of this work are:

2
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

Listing 1: Simple D-Flip Flop in SystemVerilog


1 a l w a y s f f @( po sedg e c l k ) b e g i n
2 if (! rst n )
Q <= 0 ;
3 e l s e i f ( en )
Q <= d ;
4 end

• RTL implementation of the key FINN compute component, the MVU, available in open source at
https://github.com/asadalam/FINN_MatrixVector_RTL.
• Systematic measurement and analysis of resource utilization, critical path delay and synthesis time for designs
realized using HLS and RTL. We show that:
– HLS uses more FPGA resources for smaller designs and suffers from complex multiplexer structures
when processing large input streams using a limited number of compute units.
– The LUT usage converges between HLS and RTL as the core compute unit increases in size but HLS
consistently uses more flip-flops and block RAMs (BRAMs).
– The RTL implementation results in a significantly reduced critical path delay across different parameters
and for the three different types of compute units as compared to HLS
– HLS takes at least 10× more synthesis time as compared to RTL synthesis and is the limiting factor
towards synthesizing and analyzing designs with large numbers of compute units.
• Implementation and analysis of a multi-layer perceptron (MLP) network, consisting of 4 fully connected
layers, for network intrusion detection with the UNSW-NB15 dataset [19].

This paper is organized as follows: Section 2 highlights the differences between HLS and RTL and the implied design
and workflow trade-offs before Section 3 provides a literature review of related work. Section 4 briefly describes the
FINN platform for generating HLS architectures for QNN inference. The architecture of the matrix vector compute
unit and its RTL refinement are detailed in Section 4.1.1. The results of our HLS vs. RTL comparison are finally
presented in Section 6 before Section 7 concludes the paper.

2 High-Level Synthesis (HLS) vs. Register-Transfer Level (RTL) Design


The design of digital microelectronic systems has evolved tremendously over the years. Starting with the manual
placement of transistors, it eventually embraced hardware description languages (HDLs) as a major milestone boosting
reuse and productivity. HDLs can describe a digital design at various abstraction levels, commonly identified as the
behavioral, register-transfer, gate and switch levels. Typically, a combination of behavioral and register-transfer level
is used for design entry with few instances of gate level design where detailed control is necessary. Conventionally,
the term RTL design is used as an umbrella subsuming the adjacent abstraction levels as used for typical design entry.
RTL design entry was an important but by far not conclusive step towards a functional, software-like specification
of hardware systems. Behavioral control structures, like if - and case-statements, and high-level logic and arithmetic
operators raise the abstraction level significantly. However, designers must still schedule their computation into clock
cycles explicitly and deal with low-level control details, such as the synchronicity of clock domains and the polarity
of reset signals. Listing 1 gives a small impression of this style of design entry modelling a D-Flip Flop with a
synchronous active-low reset and an active-high enable.
Three HDLs are in widespread use today: VHDL, Verilog, and SystemVerilog. SystemVerilog is the youngest among
them and is closely related to Verilog. All of them require special skills in RTL design and are exclusively used by
hardware engineers to implement designs on hardware platforms like application-specific integrated circuits (ASICs)
and field-programmable gate arrays (FPGAs) and to build testbenches for these designs.
In order to enable software engineers, research scientists and others to deploy and accelerate their applications on
FPGAs without the need for writing RTL, high-level synthesis (HLS) tools were envisioned and created [20]. These
efforts yielded, for example, Maxeler’s OpenSPL [21], Intel’s OpenCL [22] and Xilinx’ Vitis unified software platform
[23]. While Intel’s HLS [24] uses OpenCL, the Xilinx Vitis platform uses C++ as its design entry language. Pragma
annotations, comparable to those used by OpenMP, guide the RTL synthesis by the HLS compiler. These pragma
recipes and language restrictions, particularly the lack of dynamically allocated memory and late virtual function

3
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

dispatch, still pose a learning and implementation challenge. This challenge can, however, be approached within
familiar C++ terrain.
HLS technology has, indeed, demonstrated its ability to make FPGA acceleration accessible to a wider audience. It has
triggered a number of contributions using HLS in diverse fields, e.g., scientific high-performance computing (HPC)
[25, 26] as well as machine learning applications like routing congestion prediction [27] and deep neural networks
[28]. FINN [14, 15] is also an example for realizing accelerators for neural networks on FPGAs using C++ HLS.
The design entry in HLS comes with significant benefits. The time and effort to get an initial design up and running
is significantly lower. Its functional validation can be conducted efficiently in software. The higher-level description
is much more flexible and, hence, more amenable for a thorough design space exploration. Re-arranging loop nests
and tuning the degree of parallelism are small changes that can be evaluated quickly. Finally, the quality of the
hardware designs generated from HLS has been improving significantly and constantly over the recent years which
led to meanwhile widespread adoption of this new form in descign entry.
Designing on a higher functional abstraction level does, however, also incur costs. First of all, HLS produces some
wrapping overhead for standardizing all control and communication interfaces of a design. Internally, HLS tools can
brilliantly apply formalized knowledge about transforming and optimizing standard code structures. However, they
have no intuition about exploitable application constraints and the consequences of custom control structures. They
do not automatically push into optimizing congested parts of the design and regularly fail in scaling a design up or
in meeting the expected timing. When a critical generated kernel violates physical design constraints, the available
means and levers are much less clear. Cutting one or two LUT levels from a critical path that did not meet timing
typically turns into a massive investigative endeavor, which cannot even be expected to be approached by the high-
level audience of HLS users. Last but not least, current HLS compilers perform an additional compilation step with
synthesis times that clearly grow superlinearly. The compilation of designs that somewhat fill modern high-end devices
may take days. This may squander the gains of faster turnaround times in earlier design phases. In the end, the system
implementation cost of an HLS application may suddenly surge when operating the tools close to the limits of the
target platform. Particularly for such challenging and for settled use cases, RTL still poses as a worthy option.

3 Related Work

As HLS design entry has become main stream, a number of research contributions for its analysis and evaluation
have been made. Nane et al. [29] present a survey and evaluation of FPGA HLS tools. In addition to evaluating
various tools, they also describe different optimizations and problems being addressed by researchers. They identified
benchmark-specific optimization and constraints that can enable HLS tools to significantly improve performance while
also highlighting the differences between optimizations needed for hardware designs as compared to software designs.
Furthermore, they also present an analysis of academic HLS tools against commercial ones.
An overview of HLS techniques and tools has also been compiled by Coussy et al. [30] who outline the steps involved
in an HLS design flow and compare it against an RTL design flow. A similar overview of a much larger set of tools was
contributed by Meeus et al. [16]. They outline a number of challenges faced by HLS, such as application-specific tool
flows, more optimization options for improving HLS designs, and design entry standardization asking for additional
training for designers to adapt their applications for hardware platforms. Martin et al. [17] divide HLS tools into
generations highlighting both their key features along with their shortcomings. According to the authors, the HLS of
the early to late 2000’s establishes the 3rd generation comprising tools like Xilinx’s AccelDSP, Synopsys Synplicity
Synplify DSP among others. Although not documented, surely we are seeing a new generation of tools like Xilinx’s
VivadoHLS and VitisHLS and Intel’s OpenCL-based HLS tool.
In addition to these contributions towards the analysis of HLS, specific algorithms have been realized in HLS and
compared against RTL counterparts. Homsirikamil et al. [31] present such an analysis for various cryptography
algorithms. They use Xilinx’s Vivado HLS and benchmark the algorithms on throughput and throughput/area metrics.
Another relevant work is presented by Winterstein et al. [32]. They analyze HLS for two types of implementations of
the K-means algorithm, a dataflow and a recursive tree implementation. They demonstrate that dataflow architectures
described using HLS achieve near RTL performance but recursive architectures do not.
It is to be noted that FINN [14, 15], the base for the HLS vs. RTL comparison in this paper, in fact generates HLS
dataflow architectures. This work performs an analysis on a large scale sweeping through the design space of a critical
component of FINN networks. We implement the main compute unit of a neural network layer with varying design
parameters, such as the input feature map size, kernel dimension, output feature map size, input and kernel word length,
both from FINN’s native HLS output and from a drop-in RTL description. We determine the relationship between the

4
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

chosen design parameters and achieved quality metrics including throughput, resource utilization, end-to-end delay,
and tool execution time.

4 FINN

Xilinx provides an open-source framework called FINN [14, 15] to generate highly specialized accelerators on FPGAs
leveraging reduced-precision datatypes and streaming dataflow (DF) architectures. FINN customizes the hardware
architectures to the specifics of a DNN topology and the exact datatypes used. Each layer is instantiated with its
designated compute units in hardware. On-chip data streams interconnect the compute units to form the desired
network topology. The small and compact size of reduced-precision DNNs allows to store all parameters on the chip -
off-chip memory access and with that potential memory bottlenecks are avoided. FINN produces synthesizable C++-
based HLS code for describing the generated QNN accelerators. HLS was initially chosen for the hardware description
of these highly customized architectures, as C++ templates can be used to parameterize canned designs for different
datatypes, different layer types and different degrees of parallelism. Furthermore, it enables rapid implementation in
comparison with handwritten RTL. In the following, we discuss the hardware architecture and the tool flow of FINN
in separate sections.

4.1 Hardware Architecture

As mentioned above, the overall architecture realized by the FINN framework consists of multiple layers connected
using a data-flow model. Popular layers such as convolutions are lowered to a matrix-matrix multiplication between
the filter kernel and an activation matrix resulting from the expansion of the input feature map by the im2col operation.
FINN performs this expansion on the fly using a sliding window moving across the input. As a result, each layer is
a dedicated composition of sliding window unit (SWU) and matrix vector threshold units (MVU), with the MVU
being the central compute block for convolutional layers, and similarly also for fully connected layers. This compute
block, which represents the main compute engine in FINN designs, is the focus of this work. It is parameterized in
terms of number of input and output feature map channels, kernel dimensions, input feature map dimensions and input
and kernel weight precision. Furthermore, the degree of parallelism of a MVU can be specified through 2 parameters,
namely the number of processing elements (PEs) with the width of singe-instruction multiple-data (SIMD) lanes which
will be discussed in more detail in the next subsection on MVU architecture.

4.1.1 Matrix Vector Compute Unit Architecture


The matrix vector compute unit (MVU) is the main computation unit of each neural network layer’s implementation
in the FINN framework. While it also subsumes the output thresholding in this framework, this part of its functionality
is not considered in this work as it only requires a few look-up tables (LUTs) for its realization [15].
As previously mentioned, in FINN, convolutions are lowered to matrix-matrix multiplications where the weights
defining the convolution filters are packed into a matrix and the input image is also converted to a matrix by moving
a sliding window across it [15]. This is known as general matrix multiply (GEMM) and quantized versions of it have
been used to realize neural networks [33, 34].
The image matrix is generated by the sliding vector unit and then processed by the MVU along with the weight filter
matrix. The image matrix has dimensions of Kd2 .Ic × Od2 , where Kd , Ic and Od are the weight filter kernel dimensions,
number of input feature map channels and dimensions of the output feature map, respectively. Similarly, the weight
filter matrix has dimensions of Oc × Kd2 .Ic , where Oc is the number of channels in the output feature map. The
output image matrix obtained after multiplying these two matrices have dimensions of Od2 × Oc . This is also shown
graphically in Fig. 1.
Each vector of the input image matrix is streamed into the MVU along with the weight matrix to produce one output
vector which is streamed to the next layer in the network. The AXI-Stream protocol is used for the communication
between layers.

MVU: Achieving Parallelism FPGAs allow for massive data parallelism and also the MVU can be parallelized in
various dimensions with the bounds of the resources available in the target device. To achieve this, the MVU is divided
into P parallel processing elements (PEs) and S single-instruction, multiple-data (SIMD) input lanes for each PE, as
shown in Fig. 2. Here, each PE correspond to a hardware neuron and each SIMD to a hardware synapse [14].
For a fully parallel architecture, the number of PEs should equal the number of rows of the weight matrix. Thus, it
means that each PE reads in a row of the weight matrix and multiplies it with the input image vector to produce an

5
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

Figure 1: Input, weight and output matrix of convolution implemented as GEMM.

Figure 2: Processing elements (PEs) and SIMDs arrangement.

output vector. Within each PE, number of SIMD lanes will be equal to the number of columns in the weight matrix.
This arrangement is graphically shown in Fig. 2.
However, resources in an FPGA are limited and a fully parallel implementation of the MVU may not be possible. Or
conversely, throughput requirements are not high. Then, it is important to time-multiplex or fold the QNN onto fewer
hardware resources. Time multiplexing is a general technique where only parts of the input are used in a given time
instant for computation on a given hardware resource and the hardware resource is re-used for other time instances
[35].
Consider an example where the weight matrix has a 4 × 4 dimension while the input image is a 4 × 1 vector. Let
the number of PEs equal two with two SIMD lanes in each PE. Assume that the four elements of the input vector are
[x0 , x1 , x2 , x3 ] and those of the weight matrix are:

y00 y01 y02 y03


 
y10 y11 y12 y13 
Y = (1)
y20 y21 y22 y23 
y30 y31 y32 y33
The mapping between the input vector/weight matrix to each PE/SIMD, on each clock cycle, is shown in Fig. 3. It
can be seen that elements of the input vector are re-used, thus necessitating the use of an input buffer. This results in a
folded or time-multiplexed architecture whereas setting PE and SIMD to be one will result in a fully unfolded, serial
architecture.

Processing Element and SIMD Lanes Each processing element consists of several SIMD lanes. This arrangement
is shown in Fig. 2. The accumulator is only needed in case of folded architectures to accumulate the outputs of each
clock cycle until the final output is computed.
Each SIMD lane essentially computes the product of the input vector element and weight matrix element, as also
shown in Fig. 3. FINN only used BNNs [14] which was then extended to binary weights and arbitrary precision input
vectors in FINN-R [15]. Thus, the architecture of MVU supports three different types of SIMD lanes/elements for the
different datatypes. The implementations are listed here and shown in Fig. 4 along with associated logic for adding
the output of each SIMD and accumulator.

• XNOR followed by popcount


• Binary weights, interpreted as {±1}, followed by an adder tree

6
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

Clock cycle 1 Clock cycle 2

PE0

PE1

Clock cycle 3 Clock cycle 4

PE0

PE1

Figure 3: Mapping of input vector/weight matrix to each PE/SIMD.

(a) (b) (c)


Input

Input Weight Weight Input weight

Outputs Outputs Outputs

Pop count Adder tree Adder tree

Accumulator Accumulator Accumulator

PE Output PE Output PE Output

Figure 4: Types of SIMD components used. (a) XNOR, (b) Multiplexer based for binary inputs, where 0 correspond
to −1 and 1 to +1 and (c) Standard multiplier for arbitrary precision inputs.

• Arbitrary precision weights and inputs, followed by an adder tree

The output of each SIMD lane in a PE can be added using a pop count, in case of binary inputs or a simple adder tree,
as is used in this work, or even advanced compressor trees as proposed by Preußer [36] or Kumm and Kappauf [37].

Figure 5: FINN compiler flow.

7
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

4.2 Tool flow

The FINN tool flow has a highly modular structure as shown in Figure 5, which allows the user to interactively generate
a specialized architecture for a specific DNN. The framework provides a frontend, transformation and analysis passes
and backends to explore the design space in terms of resource and throughput constraints.

Frontend and Intermediate Representation The used training frontend is called Brevitas [38]. It is PyTorch library
for quantization-aware training and enables training DNNs with weights and activations quantized down to a few bits,
then exports the trained network into the intermediate representation (IR) used by the FINN compiler. After training,
the DNN model must be first converted into the IR of the FINN compiler. The frontend stage takes care of this by
converting the PyTorch description into the IR, called FINN-ONNX. This IR is based on ONNX https://onnx.ai,
an open-source interchange format which uses a protobuf description to represent DNNs. The IR forms the input to
the transformation and analysis passes.

Transformation and Analysis Passes The transformation and analysis passes help to generate an efficient repre-
sentation of the DNN. For this, the FINN compiler performs graph transformation and analysis passes, which analyze
and change the IR of the model. In this part of the compiler flow, synthesizable HLS descriptions are generated for the
various layers (see Lowering and Conversion to HLS Layers) and the degree of parallelization is determined, which is
essential to meet specific resource or throughput constraints (see Folding and Resource Estimation below).

Lowering and Conversion to HLS Layers High-level operations in the graph are lowered to simpler operators
implemented by the FINN hardware library. As mentioned before, convolutions are lowered to a sliding window node
followed by a MVU node. In the resulting graph, each node corresponds to a Vivado HLS C++ function call, for which
an IP block can be generated using Vivado. The resources utilized by each hardware building block can be controlled
through specific attributes passed from FINN to Vivado. For example, multiplications can be performed using LUTs or
DSP blocks, and parameters can be stored in distributed, Block, or Ultra RAM. The main resource adjustment happens
in the Folding and Resource Estimation pass.

Folding and Resource Estimation The folding process assigns compute resources to each layer to obtain the desired
throughput within a balanced pipeline. This process in essence determines values for PE and SIMD. Once the folding
is specified, resource estimates can be produced for each node.
There are several ways to estimate the resources. Even before IP blocks are generated from the HLS layers, an estimate
of the resources per layer can be made by using analytical models based on the concepts from the FINN-R paper [15].
Estimations can also be extracted from Vivado HLS after IP generation, though these results are still estimations that
may differ from the resource usage of the final implementation due to synthesis optimizations.

Backends Finally, backends are responsible for consuming the IR graph and translating these into RTL descriptions
and backend-specific information to create a deployment package. FINN support implementation as a standalone Vi-
vado IP core or integrated into various shells, such as the ones available for Xilinx Alveo boards and PYNQ embedded
platforms.

5 The MVU RTL Implementation

5.1 Overall Architecture

Realizing the MVU in HLS only requires defining the matrix vector multiplication, partitioning the matrices and
internal arrays, and defining the pipelining. The control logic required to synchronize all these operations, including
the AXI stream I/O protocol, is handled by the HLS framework. For RTL, however, one needs to define the complete
control logic manually.
For the RTL implementation, the overall architecture of the MVU, as shown in Fig. 6, was divided into two modules,
one containing the other. The top-level module is referred to as MVU batch and incorporates burned-in weight memory,
a control unit, which is responsible for sequencing the stream of weights to the compute in the stream unit, as well
as the stream unit MVU stream itself, which is the second module. The latter implements the main computation
partitioned along the two dimensions of PEs and SIMD accumulations. Both slices of weights and input data are
streamed in parallel into the unit. This partition is inherited from FINN [39].

8
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

Control Control
unit unit

Input
buffer
Weight
memories
Matrix Vector
Stream Unit

Matrix Vector
Batch Unit
Figure 6: Overall architecture of the MVU batch and stream units.

In order to understand the sequencing and dimensions to parallelize, it is important to consider the layout of the weight
memory. The depth of each weight memory is given by:
Kd2 × Ic × Oc
Dmem = (2)
SIMD × PE
where Kd , Ic and Oc have their usual meanings as defined earlier. The word size of the data stored in these memories
is SIM D × Bw , where Bw is the weight precision. Since each PE is responsible for one or more rows of the weight
matrix, it is served by a dedicated weight memory resulting in PE memory instances.

5.2 MVU Batch Unit

The MVU batch unit, as shown in Fig. 6, comprises a small control unit for managing the reads from these weight
memories, whose contents are initialized offline and are also part of the MVU batch unit. The MVU batch unit then
contains the main compute unit shown as the matrix vector stream unit in Fig. 6.

5.3 MVU Stream Unit


The stream unit performs the main computations. It encapsulates the PEs and SIMDs of Fig. 2. The stream unit also
has a separate control unit which deals with the AXI stream protocol and handles the reading and writing of the input
buffer. The design is generic both in terms of PE and SIMD partitioning and in terms of data precision by leveraging
appropriate SystemVerilog generate loops and conditional instantiations to select from the implementation alternatives
introduced by Fig. 4.

5.3.1 AXI Stream Protocol


The AXI stream protocol [40] defines a standard interface for exchanging data between connected components. It
follows a master/slave design with the slave accepting data from the master. From all its interface signals, this work
only uses the five tabulated and described in Tab. 1.
Data is only transmitted when both handshaking signals, i.e. TVALID and TREADY, are asserted. By deasserting
TREADY, the receiving slave is able to assert backpressure as a means of flow control. In the case of FINN, this
enables a slower layer that is not ready to accept new data to pace its upstream data source. Other flow scenarios
include the intermittent availability of data from the preceding layer and the intermittent assertion of the ready signal
by the succeeding layer. All this necessitated the use of a finite state machine to synchronize the data transfer to and
from the stream unit. Since data coming from a preceding layer is written to the input buffer and backpressure requires
either to stop writing to the input buffer or stop reading from it (in case of a folded architecture), access to the input
buffer is also handled by the state machine.

9
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

Table 1: AXI interface signals used in this work.

ACLK The global clock signal


ARESETn The global active low asynchronous reset signal
TVALID Signal to indicate master is driving a valid output
TREADY Signal to indicate the slave can accept a valid input
TDATA The main data signal being exchanged

ARESETn

TR
Idle EA
DY DY
EA &
TR IN

!T
& P

RE
D BU

DY
LI

A
F
A A

DY
FU
TV
RE
LL


!T

CO

M
D
LI

P
VA

D
O
!T

N
TREADY & INP BUF FULL

E
Write Read

TVALID & COMP DONE

Figure 7: State diagram of the finite state machine for controlling the MVU stream unit.

5.3.2 Finite State Machine


The control that is part of the stream unit implements a three-state Mealy machine is implemented to synchronize
the data movement between adjacent layers and the internal data processing. Its abstract representation is shown in
Fig. 7. It identifies the states and their transitions. The idle state is the default state assumed at the start, during
back-pressure and whenever no input is available from the preceding layer. The internal signals COMP DONE and
INP BUF FULL indicate the completion of the computation for one streamed input sequence and the fill up of the
input buffer, respectively. As shown, the state machine transitions from the idle state to the write state when new
input data is available. It remains there while data is available and the input buffer is not yet filled up. Already when
filling the input buffer, the written input data is presented to the PEs so that they can start or proceed their operation.
When the input buffer is filled, the state machine transitions to the read state where the buffered data is re-used for
their processing with other rows of the weight matrix as illustrated by Fig. 3. During both reading and writing, the
computation can be stalled due to a number of reasons, which then cause the state machine to transition back to the
idle state.
Instead of halting the computation immediately upon back-pressure, the computation is allowed to proceed for a few
cycles while a small temporary FIFO buffer captures the produced output. This decouples the production of outputs in
bursts by the parallel PEs from the consumption of inputs in bursts by the parallel SIMD units of the next layer so that
a smoother dataflow with more effective parallelism is achieved. In the case of unavailable data from the preceding
layer, the computation must be halted.

6 Results

In this section, we evaluate the RTL and HLS implementation alternatives of the MVU with respect to different
performance metrics. In Section 6.2, we look at the resource utilization in terms of look-up table (LUT) and flip-flop
(FF) counts and in terms of block RAM (BRAM) usage. We also report the total number of execution cycles, which

10
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

Table 2: Layer and implementation parameters for analysis.

Configuration 1 2 3 4 5 6
Num. of input feature map (IFM) channels * 64 64 64 64 64
IFM dimensions 32 * 32 32 8 8
Num. of output feature map (OFM) channels 64 64 * 64 64 64
Kernel dimensions 4 4 4 * 4 4
Num. of processing elements (PEs) 2 32 2 32 * 64
Num. of SIMD elements per PE (SIMDs) 2 32 2 32 64 *

are a measure of how long each implementation takes to process the same number of inputs. In Section 6.3, we analyze
the critical path delay while in Section 6.4, we report the impact on synthesis time. Finally, in Section 6.5, we look at
a complete application of network intrusion detection. It uses a multi-layer perceptron (MLP) network composed of
four fully connected layers [41]. We start with describing the experimental setup.

6.1 Experimental Setup

We analyse both RTL and HLS designs based on the network and design parameters. A list of these parameters
are given in Table 2. In addition to these parameters, the MVU was synthesized with the three different types of
SIMD elements shown in Fig. 4, which will be referred to in this section as XNOR, binary weights and standard
implementation. In the case of the standard implementation, we four as the precision for inputs and weights, which is
in line with the word sizes typically used in the FINN framework [15].
For analysis, we sweep through each of these parameters while keeping the others constant. Different configurations
for the analysis are shown in Table 2. A star (“*”) in the place of a configuration parameter identifies the one being
varied. For example, in configuration 1, the number of IFM channels is modified while keeping all other parameters
at the named constants.
All the presented results were generated using the Xilinx Vivado and Vivado HLS tools for RTL and HLS, respectively.
The targeted FPGA was a Xilinx’s Zynq-7000 SoC, the XC7Z020-1CLG400C device of the Pynq-Z1 board. A number
of designs, both HLS- and RTL-based, were tested on the board to allow the designs to go through the complete
implementation and validation cycle. However, the reported results are the estimates obtained directly after the out-
of-context (OOC) synthesis of the corresponding design since the MVU implementations are meant to be part of an
enclosing top-level design. The OOC design flow allows units to be synthesized and analyzed independently of a
concrete top-level design [42]. All inputs, outputs and clocks were properly constrained for the synthesis. The clock
was typically constrained to a period of 5 ns, only to be increased to 10 ns in case either HLS or RTL fail to meet the
tighter target constraint. Finally, the reported synthesis times measure the complete processing of the design sources
all the way to obtaining the synthesized netlist. In the case of HLS, this comprises both HLS and RTL synthesis. The
number of PEs and SIMDs/PE are kept low when we modify the IFM and OFM channels so that we can realize circuits
with with values ranging between 2 to 64 for these two parameters.

6.2 Resource Utilization

6.2.1 Look-up tables (LUTs) and Flip-flops (FFs)


Look-up tables (LUTs) form the core of the general FPGA fabric used to implement digital logic. They can also serve
as small distributed memories, which is very useful for implementing shallow and narrow memory efficiently. While
synthesizing the MVU architecture, the choice whether to utilize block RAMs (BRAMs) or LUTs for memory was
left to the synthesizer. The flip-flops are used as pipelining registers.
To analyse LUT and FF utilization, a number of parameters were modified as shown in Table 2, one at a time, while
others were held constant when measuring their impact on resource utilization. The number of input feature map chan-
nels (IFM channels) indicates the number of channels of the input image where each image has a certain dimension,
in this case 32 × 32. The resource count in terms of LUTs and flip-flops is shown in Fig. 8 for the three types of SIMD
elements shown earlier.
As can be observed, the number of IFM channels does not impact the core architecture of the MVU as that is deter-
mined by the number of PEs and SIMDs. This is reflected in the total resource count usage by RTL remaining constant
as we vary the values of IFM channels from 2 – 64 and also in the increase in the total number of execution cycles
indicating the same design is being re-used for more inputs. However, the depth of the input buffer, which is given

11
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

(a) 1e7 (b) 1e7


1.25 6000 1.25
2000

Resource count

Resource count
1.00 1.00

Exec. cycles

Exec. cycles
1500 4000
0.75 0.75
1000
0.50 2000 0.50
500 0.25 0.25
0 0.00 0 0.00
IFM channels IFM channels
(c) 1e7
6000 1.25
Resource count

1.00 HLS LUT RTL FF

Exec. cycles
4000 HLS FF Exec. cycles
0.75 RTL LUT
0.50
2000
0.25
0 0.00
2 4 8 16 32 64
IFM channels

Figure 8: Resource utilization and latency, in terms of look-up tables (LUTs) and flip-flips (FFs), for HLS and RTL
when varying number of input feature map channels with (a) 1-bit precision, (b) binary weights and (c) 4-bit precision.

K 2 ×Ic
by SIMD
d
, increases with a growing number of IFM channels. Input data to the PEs and SIMDs either comes directly
from the input stream or is provided by the input buffer and with an increase in the depth of input buffer, HLS designs
result in a more complex multiplexer architecture than RTL, thus affecting the total resource usage significantly. A
similar effect is observed in Fig. 9 when we explore different values of the kernel dimension, in the range 3 – 9, which
also affects the buffer length.
However, when sweeping through the number of output channels (number of channels in the output activation), this
increase is not seen since the buffer length and other associated logic remains the same. The only thing affected is the
number of clock cycles needed to process all the inputs which is manifested in execution time. This is visualized in
Fig. 10.
The core MVU used for Figs. 8, 9 and 10 is small, only consisting of two PEs and two SIMDs. For small designs,
HLS tends to use more resources as compared to RTL, indicating an already large generated basic control logic in the
implemented design. However, the amortization of this logic improves as the core MVU design is expanded by the
increase in the number of PEs and SIMDs. This is highlighted in Fig. 11 where the IFM dimensions are swept from 4
to 16 in steps of powers of 2. Here, the number of PEs and SIMDs is 32, indicating a large core MVU design. There
is little difference in the resource usage by HLS and RTL and since the IFM dimensions do not impact the design
complexity (only results in an increase in the execution cycles), the resource usage remains fairly consistent across the
range of values.
In order to see how HLS and RTL resource usage converge, we increase the number of PEs and SIMDs separately,
again keeping the other dimensions constant. Fig. 12 shows how the difference between HLS and RTL shrinks with
increasing number of PEs. A similar effect is shown in Fig. 13 for increasing the number of SIMDs. In order to
properly illustrate this relationship, Fig. 14 shows a heat map of the difference between resource utilization of HLS
and RTL as PEs and SIMDs/PE are increased, with positive values showing RTL using fewer resources and negative
ones otherwise.
The heat map for the LUTs in Fig. 14(a) shows that RTL uses less LUTs as compared to HLS for smaller designs and
as designs are made larger by increasing the number of PEs and SIMDs, the LUT usage by HLS design is lower as
compared to RTL design. However, the flip-flop usage by HLS is always more than RTL for all designs.
The convergence between HLS and RTL for resource usage is further emphasized by Table 4 for the configurations
given in Table 3 where we increase the number of IFM channels for a larger design with PE=16, and SIMD=16. The
effect of an increase in the input buffer depth and base control logic is no longer significant, and HLS and RTL use
similar amounts of resources with a trend of HLS eventually even outperforming RTL for larger designs in terms of
LUTs. However, HLS always consumes more flip-flops (FFs). This is primarily due to HLS pipelining the generated
design aggressively in a pursuit to meet the timing constraints while also attaining an initiation interval (II) of one.

12
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

(a) (b)
12500 40000
10000

Resource count

Resource count
150000 30000 150000

Exec. cycles

Exec. cycles
7500
100000 20000 100000
5000
2500 10000
50000 50000
0 0
3 4 5 6 7 8 9 3 4 5 6 7 8 9
Kernel dimensions Kernel dimensions
(c)
60000
Resource count

150000 HLS LUT RTL FF

Exec. cycles
40000 HLS FF Exec. cycles
RTL LUT
100000
20000
50000
0
3 4 5 6 7 8 9
Kernel dimensions

Figure 9: Resource utilization and latency, in terms of look-up tables (LUTs) and flip-flips (FFs), for HLS and RTL
when varying kernel dimension with (a) 1-bit precision, (b) binary weights and (c) 4-bit precision.

(a) 1e7 (b) 1e7


2000 1.25 6000 1.25
Resource count

Resource count

1.00 1.00
Exec. cycles

Exec. cycles
1500 4000
0.75 0.75
1000
0.50 0.50
2000
500 0.25 0.25
0 0.00 0 0.00
OFM channels OFM channels
(c) 1e7
6000 1.25
Resource count

1.00 HLS LUT RTL FF


Exec. cycles

4000 HLS FF Exec. cycles


0.75 RTL LUT
0.50
2000
0.25
0 0.00
2 4 8 16 32 64
OFM channels

Figure 10: Resource utilization and latency, in terms of look-up tables (LUTs) and flip-flips (FFs), for HLS and
RTL when varying number of output feature map channels with (a) 1-bit precision, (b) binary weights and (c) 4-bit
precision.
Table 3: Configuration parameters for larger designs with an increasing number of IFM channels.

Configuration 0 1 2
IFM channels 16 32 64
IFM dimensions 16 16 16
OFM channels 16 16 16
Kernel dimensions 4 4 4
Weight precision 4 4 4
Input precision 4 4 4
PE 16 16 16
SIMD 16 16 16

13
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

(a) (b)
250000 250000
6000
200000 15000 200000
Resource count

Resource count
Exec. cycles

Exec. cycles
4000 150000 150000
10000
100000 100000
2000 5000
50000 50000

0 0 0 0
IFM dimensions IFM dimensions
(c)
250000
40000
200000
Resource count

30000 HLS LUT RTL FF

Exec. cycles
150000 HLS FF Exec. cycles
20000 RTL LUT
100000
10000 50000

0 0
4 8 16 32 64
IFM dimensions

Figure 11: Resource utilization and latency, in terms of look-up tables (LUTs) and flip-flips (FFs), for HLS and RTL
when varying input feature map dimension with (a) 1-bit precision, (b) binary weights and (c) 4-bit precision.

(a) (b)
25000 25000
20000 60000
20000 20000
Resource count

Resource count
Exec. cycles

Exec. cycles
15000 15000 40000 15000
10000 10000 10000
20000
5000 5000 5000

0 0 0 0
PEs PEs
(c)
25000
150000
20000
Resource count

HLS LUT RTL FF


Exec. cycles

100000 15000 HLS FF Exec. cycles


RTL LUT
10000
50000
5000

0 0
1 2 4 8 16 32 64
PEs

Figure 12: Resource utilization and latency, in terms of look-up tables (LUTs) and flip-flips (FFs), for HLS and
RTL when varying the number of processing elements (PEs) with (a) 1-bit precision, (b) binary weights and (c) 4-bit
precision.

14
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

(a) (b)
25000 60000 25000
20000
20000 20000
Resource count

Resource count
Exec. cycles

Exec. cycles
15000 15000 40000 15000
10000 10000 10000
20000
5000 5000 5000
0 0 0 0
SIMDs/PE SIMDs/PE
(c)
25000
150000
20000
Resource count

HLS LUT RTL FF

Exec. cycles
100000 15000 HLS FF Exec. cycles
RTL LUT
10000
50000
5000
0 0
1 2 4 8 16 32 64
SIMDs/PE

Figure 13: Resource utilization and latency, in terms of look-up tables (LUTs) and flip-flips (FFs), for HLS and RTL
when varying the number of SIMD elements per PE with (a) 1-bit precision, (b) binary weights and (c) 4-bit precision.

(a) (b)
8 16 32 64

0 30000
−2000
SIMDs

20000
−4000

−6000
4

10000
2

−8000
2 4 8 16 32 64 2 4 8 16 32 64
Processing elements (PEs) Processing elements (PEs)
Figure 14: Heat map of difference in resource utilization between HLS and RTL in terms of (a) LUTs and (b) FFs
when changing the number of processing elements (PEs) and SIMD lanes (SIMDs), for 4-bit inputs.

Table 4: Resource utilization for configurations given in Table 3

LUTs FFs
Config.
HLS RTL HLS RTL
Config. #0 7528 7572 8400 5838
Config. #1 7354 7599 7560 5857
Config. #2 7919 8102 9634 5659

15
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

(a) (b)
4 4
BRAMs

BRAMs
2 2

0 0
2 4 8 16 32 64 2 4 8 16 32 64
IFM channels OFM channels
(c) (d)

50 50
BRAMs

BRAMs

25 25
0 0
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Processing elements (PEs) SIMDs
(e)

20
BRAMs

HLS RTL

0
4 8 16 32 64
IFM dimensions

Figure 15: Number of block RAMs (BRAMs) for HLS and RTL when varying different layer and implementation
parameters with 1-bit precision for inputs feature map and kernel weights.

16
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

The RTL implementation was designed with an II of one to begin with. As it has an explicit cycle-accurate compute
schedule by design, its only possible failure mode is the violation of the set clock period constraint.

6.2.2 Block RAMs (BRAMs)


Block RAMs are dedicated on-chip memory blocks available in FPGAs to facilitate high-speed designs. They come in
fixed sizes, e.g., the size of one BRAM tile in the target FPGA of this work is 36 Kb with a total of 4.9 Mb of BRAMs
on the device. One 36 Kb BRAM can be partitioned into two 18 Kb memories. As mentioned earlier, the choice of
whether to use BRAMs or LUTs for implementing memories was left to the synthesizer. By default settings, HLS is
found to use more BRAMs when compared to RTL. In fact, for some cases, RTL does not use any BRAM while HLS
ended up using a significant number of them. All of this is illustrated in Fig. 15(e). The high usage for HLS is due to
the unfavourable choice to implement weight memories using BRAMs as they are under utilized. A better alternative
will have been to use distributed memory using LUTs.

6.2.3 Resource Utilization: Summary


Summarizing the resource utilization by HLS and RTL, it is clear that for smaller designs, HLS uses significantly
more resources than equivalent RTL designs. In particular, as the number of IFM channels was increased, resulting
in an increase in the depth of the input buffer, the HLS resource count increased significantly as it realized a complex
multiplexer network for the buffer access. At the same time, RTL resource utilization remained fairly constant. This
was consistent across all data types. However, as the core architecture of MVU increases in terms of PEs and SIMDs,
there is a convergence between the resource utilization of both. In fact, HLS attains a lower LUT count for a number
of designs. With regards to flip-flops though, HLS uses more for all types of designs, which is due to HLS trying
to meet the II of one while also meeting the timing constraints. Finally, in regards to block RAMs (BRAMs), HLS
mostly uses at least 2× more of these resources compared to RTL.

6.3 Critical Path Delay

The critical path delay of a circuit determines the maximum clock frequency it can run at. All the designs were
properly constrained in terms of clock period as well as signal input and output delays as described in Section 6.1. All
the results shown in this section correspond to those shown in Section 6.2 for the resource usage.
Table 5 gives the critical path delay, in terms of minimum, maximum and mean delay, of HLS and RTL designs when
exploring a range of values of different parameters. When changing the number of IFM and OFM channels, the critical
path delay of both RTL and HLS remain consistent, indicated by similar minimum, maximum and mean delay. This
is because for HLS, critical path is typically in either in the the adder tree required to add the SIMD outputs or the
SIMD elements per PE which remains the same since the number of PEs and SIMDs do not change. For RTL, critical
path is in the control logic which also does not change, apart from adjusting to the increase input buffer length with
increasing IFM channels.
Furthermore, RTL designs are consistently faster than HLS. They are around 45% faster for designs with 1-bit precision
inputs and binary weights. For the 4-bit precision inputs, the HLS designs become significantly more slower, around
80%, as compared to RTL.
As the number of PEs and SIMDs are increased, the critical path of RTL is also either in the SIMD elements or the
adder tree, like that of HLS designs and the critical path delay is directly proportional to the number of PEs and SIMDs.
Thus, the delay increases for both HLS and RTL indicated by increase in the maximum and mean critical path delay
as these two design parameters are increased. In all cases, RTL designs are 45% – 75% faster than HLS designs.

6.3.1 Critical Path Delay: Summary


To summarize, the RTL designs are faster than HLS designs for all types of networks. The critical path is proportional
to the number of PEs and SIMDs. Thus, change in number of IFM and OFM channels do not affect the critical path
delay. Apart from a change in the input buffer depth, the overall design complexity remains the same. However,
increase in design complexity by increasing either PE or SIMD or both leads to increase in the critical path delay.

6.4 Synthesis Times

The overall design time needed for an RTL implementation is significantly higher than that needed for HLS. However,
it is important to analyze how long it takes to synthesize the two design approaches and to see whether the benefits
achieved with a shorter design time for HLS can be carried over to actual synthesis time.

17
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

Table 5: Critical path delay (ns) for HLS- and RTL-based designs.

HLS RTL
Paremeter SIMD type
Min. Max. Mean Min. Max. Mean
XNOR 2.427 2.636 2.549 1.4 1.423 1.412
IFM channels Bin. weights 2.445 2.641 2.567 1.4 1.424 1.413
Standard 7.357 7.441 7.409 1.406 1.609 1.526
XNOR 2.458 2.715 2.570 1.333 1.42 1.394
OFM channels Bin. weights 2.476 2.651 2.560 1.421 1.424 1.422
Standard 7.384 7.384 7.384 1.529 1.609 1.545
XNOR 2.47 2.747 2.552 1.613 2.644 1.992
PEs Bin. weights 3.842 4.009 3.917 1.833 3.327 2.598
Standard 8.087 8.977 8.617 2.0 4.884 2.898
XNOR 2.244 2.936 2.667 1.437 2.644 1.859
SIMDs Bin. weights 2.698 4.531 3.387 1.684 3.231 2.336
Standard 7.063 9.374 8.037 1.814 3.072 2.453

(a) (b)
2500 HLS 2500 HLS
RTL RTL
Synthesis time (sec.)

2000 Synthesis time (sec.)


2000
1500 1500
1000 1000
500 500
0 0
Processing elements (PEs) SIMD

Figure 16: Total synthesis time for HLS and RTL when varying the number of PEs and SIMDs/PE.

For the results shown earlier, tool execution time was measured for both HLS and RTL when modifying various
network and design parameters. Only compilation and synthesis times were considered and results are shown in
Fig. 16
It can be clearly seen that the synthesis time of HLS is at least 10× more than RTL with a superlinear growth for
larger designs. For other network parameters, like the number IFM and OFM channels, the dimension of the IFM and
filter kernel, HLS also uses significantly more time to synthesize, though they are not shown here. However, the HLS
synthesis time does not grow significantly as we modify other parameters, mainly because they either only impact the
depth of the input buffer or have no impact at all on the overall design complexity of the MVU. In addition to this,
HLS synthesis time was a limiting factor towards synthesizing larger designs than those presented in this work.

6.5 Full Example: Network Intrusion Detection (NID)

High-performance packet processing systems, as for network intrusion detection (NID), are typically implemented on
FPGAs because of the inherent parallelism of the FPGA architecture and the streaming nature of the accelerators which
can deliver far lower latency. Such systems enable increased network security and may be implemented using deep
neural networks (DNNs) [41]. The DNN used for this NID is a multi-layer perceptron (MLP) network of four layers,
the parameters of which are given in Table 6. Depending on the throughput requirements, different configurations of
PEs and SIMDs/PE can be used and the ones used in this work are also given in Table 6.
The results of the synthesis of the four layers with the given network and hardware parameters of Table 6 are presented
in Table 7. We see the same behavior as earlier: RTL produces smaller circuits than HLS for smaller designs but not
for larger ones. RTL designs, however, have an improved critical path delay and a significant reduction in synthesis
time. However, the synthesis time needs to be seen in the context of the larger design effort to write RTL to describe
an architecture. The execution cycles for both RTL and HLS are fairly similar and both achieve an initiation interval
(II) of one.
Thus, the findings of earlier experiments are re-affirmed by evaluating a real-world use case of network intrusion
detection using an MLP network.

18
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

Table 6: Configuration parameters for multi-layer perceptron (MLP) network of 4 layers for intrusion detection.

Layer 0 1 2 3
IFM channels 600 64 64 64
IFM dimensions 1 1 1 1
OFM channels 64 64 64 1
Kernel dimensions 1 1 1 1
Weight precision 2 2 2 2
Input precision 2 2 2 2
PE 64 16 16 1
SIMD 50 32 32 8

Table 7: NID synthesis results for HLS and RTL.

LUTs FFs BRAM Delay (ns) Synth. time Exec. cycles


Layer
HLS RTL HLS RTL HLS RTL HLS RTL HLS RTL HLS RTL
Layer #0 30744 43894 21159 12965 0 0 7.081 5.292 38’45” 5’21” 17 17
Layer #1/2 4653 5454 3276 4970 0 0 7.453 4.959 17’48” 3’59” 13 13
Layer #3 248 133 364 158 0 0 7.132 4.959 16’28” 1’43” 12 13

7 Conclusion
High-level synthesis (HLS) has become a popular alternative to the design of digital systems on the register-transfer
level (RTL) using hardware description languages (HDL). HLS provides flexibility to designers, especially to software
engineers, and allows them to describe their designs in a high-level language like C++. It reduces the design time and
requires less specialized training as compared to RTL design entry. Over the years, HLS has improved significantly,
and the performance gap between HLS and RTL has reduced.
HLS has been used successfully to describe designs of various applications like high-performance computing and
neural networks. The FINN framework is one such example. It generates DNN accelerators targeting FPGAs by
producing synthesizable C++ HLS code. These accelerators are constructed as streaming dataflow (DF) architectures
and leverage reduced-precision data types. The main compute units in FINN accelerators are instances of the matrix-
vector unit (MVU). They multiply input feature vectors with the kernel weight matrices of a neural network layer. This
work presented an alternative RTL description of this MVU and evaluated the attained design trade-offs in comparison
to the baseline HLS implementation.
The RTL description is tailored to the requirements of the MVU and is designed to achieve an initiation interval (II)
of one. By implementing an AXI-stream input/output interface, it is a drop-in replacement for the generated HLS
cores. The design comparison, which was conducted for a Zynq-7000 FPGA target platform, analyzed the resource
utilization, the critical path delay and the needed synthesis time. The resource utilization was further detailed into
look-up table (LUT), flip-flop (FF) and block RAM (BRAM) usage.
The analysis was carried out by sweeping through a number of network and design parameters, namely, the number
of input feature map (IFM) and output feature map (OFM) channels, the dimensions of the IFM and the kernel as well
as the numbers of processing elements (PEs) and SIMD slices per PE. It was shown that HLS requires significantly
more FPGA resources as compared to RTL for smaller designs due to a large base implementation to cater for I/O
protocols, buffers and overall architecture. Increasing the design complexity, i.e. the number of PEs and SIMDs,
lets the resource utilization figures of HLS and RTL designs, particularly the LUT usage, converge. However, HLS
consistently consumes orders of magnitude more flip-flops and at least 2× more BRAMs than the RTL design. This
is caused by the HLS synthesis aggressively pipelining the generated RTL code as a proactive measure for reducing
the risk of later timing violations. Finally, it was demonstrated that the resource consumption by the HLS design is
particularly sensitive to the number of IFM channels. The increase of this parameter mandates larger input buffers,
which the HLS design accesses through an unfavorably growing multiplexer structure.
The critical path of the MVU is in the control logic for RTL and in the SIMD elements or the adder tree for HLS for
small number of PEs and SIMDs. As the design becomes larger, the critical path of the RTL designs is also in the
SIMD elements or the adder tree and the delay is directly proportional to the number of PEs and SIMDs. This results
in similar delay through the circuit as the number of IFM and OFM channels are changed while keeping the number
of PEs and SIMDs constant. As the design becomes larger with an increase in the number of PEs and/or SIMDs, the

19
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

critical path delay increases for both RTL and HLS designs. In all cases, however, The RTL produces faster designs,
between 45% – 80%, for all types of networks as compared to HLS.
HLS also suffers from significantly longer synthesis times compared to RTL. This was a limiting factor on the size
of designs that can be synthesized. It took at least 10× more time to synthesize an HLS design as compared to RTL.
This has the potential to undo any gains achieved in the reduction of the original design time when the design space
exploration of debugging demand additional complete design cycles.
To conclude, the RTL abstraction is an attractive alternative for code-generated hardware designs of frameworks
such as FINN and hls4ml, given the gains achieved in synthesis time together with some potential resource benefits
depending on the design size.

Acknowledgements
This material is based upon work supported, in part, by Science Foundation Ireland under Grant No. 13/RC/2094 P2
and, in part, by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-
Curie grant agreement and Grant No. 754489. Any opinions, findings, and conclusions or recommendations expressed
in this material are those of the author and do not necessarily reflect the views of the Science Foundation Ireland and
European Union’s Horizon 2020 programme.

References
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”
Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf.
Comput. Vision Pattern Recog., 2015.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image
database,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., 2009, pp. 248–255.
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
“Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vision Pattern Recog., 2015, pp. 1–9.
[6] M. Satyanarayanan, “The emergence of edge computing,” Computer, vol. 50, no. 1, pp. 30–39, Jan. 2017.
[7] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-
aware pruning,” 2016.
[8] K. Persand, A. Anderson, and D. Gregg, “Taxonomy of saliency metrics for channel pruning,” IEEE Access,
vol. 9, pp. 120 110–120 126, 2021.
[9] A. Capotondi, M. Rusci, M. Fariselli, and L. Benini, “Cmix-nn: Mixed low-precision cnn library for memory-
constrained edge devices,” IEEE Trans. Circuits Syst. II, vol. 67, no. 5, pp. 871–875, 2020.
[10] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “Pulp-nn: accelerating quantized neural networks on
parallel ultra-low-power risc-v processors,” Philosophical Trans. Royal Society A: Mathematical, Physical and
Eng. Sci., vol. 378, Dec. 2019.
[11] N. Bruschi, A. Garofalo, F. Conti, G. Tagliavini, and D. Rossi, “Enabling mixed-precision quantized neural
networks in extreme-edge devices,” in Proc. ACM Int. Conf. on Computing Frontiers, 2020, pp. 217–220.
[12] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural
networks with low precision weights and activations,” ACM J. Mach. Learn. Res., vol. 18, no. 1, pp. 6869–6898,
Jan. 2017.
[13] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolu-
tional neural networks,” Mar. 2016.
[14] Y. Umoroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “FINN: A framework
for fast, scalable binarized neural network inference,” in Proc. ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, 2017, pp. 65–74.
[15] M. Blott, T. B. Preußer, N. J. Fraser, G. Gambardella, K. O’Brien, Y. Umuroglu, M. Leeser, and K. Vissers,
“FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks,” ACM
Trans. Reconfigurable Technol. Syst., vol. 11, no. 3, Dec. 2018.

20
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

[16] W. Meeus, K. V. Beeck, T. Goedemé, J. Meel, and D. Stroobandt, “An overview of today’s high-level synthesis
tools,” Design Automation for Embedded Systems, pp. 31–51, 2012.
[17] G. Martin and G. Smith, “High-level synthesis: Past, present, and future,” IEEE Des. Test. Comput., vol. 26,
no. 4, pp. 18–25, 2009.
[18] S. Sarkar, S. Dabral, P. K. Tiwari, and R. S. Mitra, “Lessons and experiences with high-level synthesis,” IEEE
Des. Test. Comput., vol. 26, no. 4, pp. 34–45, 2009.
[19] N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-
nb15 network data set),” in 2015 Military Communications and Information Systems Conference (MilCIS), 2015,
pp. 1–6.
[20] S. W. Nabi and W. Vanderbauwhede, “FPGA design space exploration for scientific HPC applications using a
fast and accurate cost model based on roofline analysis,” Journal of Parallel and Distributed Computing, vol.
133, pp. 407–419, 2017.
[21] O. Pell and V. Averbukh, “Maximum performance computing with dataflow engines,” Computing in Science
Engineering, vol. 14, no. 4, pp. 98–103, 2012.
[22] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and
D. P. Singh, “From OpenCL to high-performance hardware on FPGAs,” in Proc. Int. Conf. Field-Programmable
Logic Applicat., 2012, pp. 531–534.
[23] Xilinx, “Xilinx unified software development flatform,” . [Online]. Available:
https://www.xilinx.com/html docs/xilinx2020 1/vitis doc/irn1582730075765.html
[24] Intel®, “High level synthesis compiler.” [Online]. Available:
https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/hls-compiler.html
[25] S. W. Nabi and W. Vanderbauwhede, “Automatic pipelining and vectorization of scientific code for fpgas,” Inter-
national Journal of Reconfigurable Computing, vol. 2019, no. 7348013, p. 12, 2019.
[26] F. B. Muslim, L. Ma, M. Roozmeh, and L. Lavagno, “Efficient FPGA implementation of OpenCL high-
performance computing applications via high-level synthesis,” IEEE Access, vol. 5, pp. 2747–2762, 2017.
[27] J. Zhao, T. Liang, S. Sinha, and W. Zhang, “Machine learning based routing congestion prediction in FPGA
high-level synthesis,” in Proc. Design, Automation Test in Europe (DATE), 2019, pp. 1130–1135.
[28] D. H. Noronha, B. Salehpour, and S. J. E. Wilton, “LeFlow: Enabling flexible FPGA high-level synthesis of
tensorflow deep neural networks,” in Proc. International Workshop on FPGAs for Software Programmers, 2018,
pp. 1–8.
[29] R. Nane, V.-M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Ander-
son, and K. Bertels, “A survey and evaluation of FPGA high-level synthesis tools,” IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 35, no. 10, pp. 1591–1604, 2016.
[30] P. Coussy, D. D. Gajski, M. Meredith, and A. Takach, “An introduction to high-level synthesis,” IEEE Des. Test.
Comput., vol. 26, no. 4, pp. 8–17, 2009.
[31] E. Homsirikamol and K. G. George, “Toward a new HLS-based methodology for FPGA benchmarking of can-
didates in cryptographic competitions: The CAESAR contest case study,” in Proc. IEEE Int. Conf. Field Pro-
grammable Technology., 2017, pp. 120–127.
[32] F. Winterstein, S. Bayliss, and G. A. Constantinides, “High-level synthesis of dynamic data structures: A case
study using vivado HLS,” in Proc. IEEE Int. Conf. Field Programmable Technology., 2013, pp. 362–365.
[33] Y. Umuroglu and M. Jahre, “Streamlined deployment for quantized neural networks,” 2017.
[34] B. J. et al., “gemmlowp: a small self-contianed low precision GEMM library,” 2017. [Online]. Available:
https://github.com/google/gemmlowp
[35] S. A. Alam and O. Gustafsson, “On the implementation of time-multiplexed frequency-response masking filters,”
IEEE Trans. Signal Process., vol. 64, no. 15, pp. 3933–3944, Aug. 2016.
[36] T. B. Preußer, “Generic and universal parallel matrix summation with a flexible compression goal for Xilinx
FPGAs,” in International Conference on Field Programmable Logic and Applications (FPL 2017), Sep 2017.
[37] M. Kumm and J. Kappauf, “Advanced compressor tree synthesis for FPGAs,” IEEE Transactions on Computers,
vol. PP, no. 99, pp. 1–1, 2018.
[38] A. Pappalardo, “Xilinx/brevitas,” 2021. [Online]. Available: https://doi.org/10.5281/zenodo.3333552

21
On the RTL Implementation of FINN Matrix Vector Compute Unit A P REPRINT

[39] M. B. et al., “FINN: Dataflow compiler for QNN inference on FPGAs,” 2021. [Online]. Available:
https://github.com/xilinx/finn
[40] “AMBA 4 AXI4-stream protocol specification,” 2010.
[41] Y. Umuroglu, Y. Akhauri, N. J. Fraser, and M. Blott, “LogicNets: Co-designed neural networks and circuits for
extreme-throughput applications,” in Proc. Int. Conf. Field-Programmable Logic Applicat., 2020, pp. 291–297.
[42] Xilinx, Jul. 2020. [Online]. Available: https://www.xilinx.com/support/documentation/sw manuals/xilinx2020 1/ug892-vivado-des

22

You might also like