0% found this document useful (0 votes)
12 views

A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design

A scalable and efficient convolutional neural network accelerator using HLS for a system-on-chip design

Uploaded by

corganhuang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design

A scalable and efficient convolutional neural network accelerator using HLS for a system-on-chip design

Uploaded by

corganhuang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

A scalable and efficient convolutional neural network accelerator using HLS

for a System on Chip design

Kim Bjergea,∗, Jonathan Horsted Schougaardb , Daniel Ejnar Larsenb


a School of Engineering, Aarhus University, Finlandsgade 22, 8200 Aarhus N, Denmark
b Department of Engineering, Aarhus University, Finlandsgade 22, 8200 Aarhus N, Denmark

Abstract
arXiv:2004.13075v2 [cs.CV] 7 Oct 2020

This paper presents a configurable Convolutional Neural Network Accelerator (CNNA) for a System on
Chip design (SoC). The goal was to accelerate inference of different deep learning networks on an embedded
SoC platform. The presented CNNA has a scalable architecture which uses High Level Synthesis (HLS) and
SystemC for the hardware accelerator. It is able to accelerate any Convolutional Neural Network (CNN)
exported from Python and supports a combination of convolutional, max-pooling, and fully connected layers.
A training method with fixed-point quantized weights is proposed and presented in the paper. The CNNA is
template-based, enabling it to scale for different targets of the Xilinx Zynq platform. This approach enables
design space exploration, which makes it possible to explore several configurations of the CNNA during C-
and RTL-simulation, fitting it to the desired platform and model. The CNN VGG16 was used to test the
solution on a Xilinx Ultra96 board using PYNQ. The result gave a high level of accuracy in training with
an auto-scaled fixed-point Q2.14 format compared to a similar floating-point model. It was able to perform
inference in 2.0 seconds, while having an average power consumption of 2.63 W, which corresponds to a
power efficiency of 6.0 GOPS/W.
Keywords: System On Chip, FPGA, High Level Synthesis, Convolutional Neural Network, PYNQ

1. Introduction Graphical Processing Units (GPUs). Due to sev-


eral attractive features, Field Programming Gate
In recent years, deep learning with Convolutional Arrays (FPGAs) present promising platforms for
Neural Networks (CNNs) has been applied in many Hardware (HW) acceleration of CNNs as reported
different fields such as image classification [1],[2], in [6],[7],[8],[9]. CNNs that are optimized for
object detection [3],[4] and recognition [5]. In most fixed-point data or use binary neural networks
cases, state-of-the-art CNN models run on a server achieve even better performance [10],[11],[12],[13].
in the cloud. However, with the increase of Internet In general, FPGAs provide higher performance
of Things (IoT), there is a demand for embedding than CPUs and have a better energy efficiency than
the deep neural networks into mobile edge comput- both CPUs and GPUs.
ing. This is especially true for computer vision sys- Historically, the long design time and need for
tems, where the amount of collected data is high HW experts have limited the use of FPGAs. Here,
and analyses of images must be carried out in real- the high-level synthesis tools have enabled auto-
time. matic compilation from imperative high-level pro-
As CNNs continue to be applied to increasingly grams to low-level specifications in a Hardware De-
complex problems, low throughput, latency and scription Language (HDL) [14]. It is, however, still
energy efficiency present challenges on embedded a challenge to accelerate large-scale CNNs [15] on
devices with Central Processing Units (CPUs) or a FPGA, since model parameters typically require
far more memory than the on-chip capacity of the
FPGAs. Another challenge is to find an optimal
∗ Correspondingauthor configuration for a given HW accelerator design due
Email address: [email protected] (Kim Bjerge) to the long design time.
Preprint submitted to Journal of Systems Architecture October 8, 2020
The scope of our work is to develop a generic FINN-R[11] is an end-to-end deep-learning
and flexible architecture, which can accelerate the framework for fast exploration of Quantized Neu-
inference of CNNs on a Multi-Processor System on ral Networks (QNNs). It is a framework built upon
Chip design (MPSoC). It presents the design of the the FINN accelerator [22] which is a QNN built
HW/SW architecture, i.e. the programmable logic for FPGA. The FINN-R consists of a cascade of
that will reside in the FPGA fabric and the de- multiple layer accelerators that are optimized for
sign of the software. The architecture is generic so a pipelined architecture. This design reduces the
that it can accept major CNNs such as AlexNet [16] transfer of data between the main memory and the
and VGG16 [2], which can be exported from a deep accelerators. The difficult part is to balance the lay-
learning framework such as Keras [17]. It is devel- ered accelerators in order to prevent bottlenecks or
oped in the PYNQ [18] framework using Python resource waste. However, the framework does not
and SystemC [19] in order to create a generic tem- solve the problem of different throughput for each
plate based HW accelerator. To find the optimal layer. FINN-R optimizes the generated HW using
design, this study uses a SystemC-based simula- HLS, allowing fast exploration of QNN to create the
tion to explore the design space of the optimal con- perfect accelerator for a specific FPGA target.
figuration parameters of the Convolutional Neural To accelerate and develop CNNs on reconfig-
Network Accelerator(CNNA). The design model is urable HW (FPGAs), a survey of the current State-
translated to a HDL specification using High Level of-the-art toolflows was published in 2018 by Ve-
Synthesis (HLS). Our paper discusses the precision, nieris et al. [23]. The survey compares the perfor-
speed and power consumption of the accelerator as mance of: fpgaConvNet [24, 25], DNNWEAVER [26],
well as the fixed-point retraining of the CNN. Angel-Eye [27], DeepBurning [28] and Caffeine [29].
The listed toolflows covers mapping of the clas-
1.1. Related work sic AlexNet and VGG16 to the Xilinx Zynq or
In this section, the current state-of-the-art HW- UltraScale platforms. The deep learning frame-
based CNNAs that inspired the architecture pre- work Caffe by Berkeley AI Research is the most
sented in this paper will be discussed. widely supported front end for these State-of-the-
The Microsoft model [20] is an architecture devel- art toolflows.
oped by Microsoft for accelerating CNN for a cloud The above-mentioned toolflows can be divided
server solution with several FPGA-cards. The ar- into two main categories of architectures: streaming
chitecture uses a top-level controller to control the architecture and single computation engine. FINN-
data-flow with a PCI memory interface. It has mul- R, fpgaConvNet and DeepBurning are in the cate-
tiple input buffers, one kernel weight buffer, a large gory of streaming architectures. This chained and
array of Processing Element Arrays (PEAs) and pipelined architecture can be achieved high perfor-
lastly, a data redistribution block. It uses a Di- mance, but the optimal HW design needs to be
rect Memory Access (DMA) channel to load data found and synthesized for each specific CNN. The
in from PC memory to the buffers. On the FPGA single computation engine, on the other hand, ex-
it uses PEA blocks to perform dot product calcula- ecutes CNN layers sequentially, which means that
tions of the values in the input buffer and the weight the same HW engine is able to handle many differ-
buffer. The result of the dot product is saved into ent CNNs. The engine is controlled from SW, and
the next input buffer. data must be moved from CPU memory to the on-
ZynqNet [21] is based on the architecture of the chip FPGA memory during processing of the CNN
Microsoft model. However, it focuses on making it layers. The advantage of this approach is that the
work for both training and inference. It is built for same HW can be used for several CNN architectures
a System on Chip (SoC) design instead of a server without reconfiguration.
solution. The proposed solution seems promising, Tunable parameters must be optimized for the
although it appears to have a few bottlenecks due available resources of the FPGA devices, which is
to a purely C-based HLS implementation of the so- the case for DNNWEAVER, Angel-Eye and Caffeine.
lution. It uses a Circular Line Buffer (CLB) for Angel-Eye uses a compiler to translate the input
input data handling and a memory-mapped master CNN to custom instructions. A similar approach
interface to get data from the main memory, i.e. is used by DNNWEAVER, which utilizes a macro
weights and input data are transferred using the dataflow instruction set architecture and supports
memory-mapped interface. FPGAs from both Xilinx and Altera. Of the above-
2
mentioned toolflows, only fpgaConvNet supports pass and is used to split complex convolutions
special layers with irregular dataflow including the into sub convolutions.
inception, residual and dense blocks that are re-
quired for the newest deep neural networks such • Dynamic auto scaling is used during training
as ResNet [30], DenseNet [31], InceptionNet and to minimize the accuracy between the floating
GooglLeNet [32]. point and the quantized fixed point accelera-
The single computation engines build their archi- tor.
tectures around a PEA with a buffer for handling
input data, which could be a CLB or a row buffer. • A template based SystemC design with an exe-
A streaming design, such as FINN-R uses the out- cutable model is proposed for design space ex-
put to feed the next accelerator. Consequently, the ploration. The template model is synthesized
streaming architecture has a large memory to cache to an Xilinx IP core with HLS and controlled
layered outputs directly to the next input buffer so from the host CPU using the PYNQ framework
that the data are ready for the next CNN layer. and Python.
However, due to limited internal memory, this ap-
proach is not feasible for all FPGAs. Therefore,
there is a need for reloading the input data from 2. Design methods
the main memory. An example of this is the Mi-
crosoft model. Other architectures use the main In this section, we will briefly describe the design
memory to cache the data between layers. FINN-R methods and concepts used as a basis for designing
and fpgaConvNet, does this for each block of layers. and implementing the architecture for the CNNA.
The CNN developed in this work has some el- In our work, SystemC is used with the design flow
ements in common with the solutions presented described in [34, ch. 1],[35]. It is an efficient way in
above. It uses the main memory to store data be- which an IP can be written and verified using HLS.
tween layers and uses the single computation en- SystemC is able to model the HW structure and
gine approach. In addition, the architecture is built concurrency in a more optimal manner than pure
around a PEA with two buffering systems: one for C and C++. It is an IEEE standard (IEEEStd
the weights and one for input image data, the latter 1666-2011) [19], based on a C++ template library
of which uses a CLB. The above architectures are made for HW/SW co-design.
very similar, but the major difference lies in the de- The use of templates makes the IP core config-
tails of the CLB, which enables efficient pipelining urable and portable to explore different solutions
and data allignment. and platforms, whereas custom designs are less flex-
The CNN architecture in our work supports any ible. It is much faster to recompile and simulate a
input size and layer depth, stride, zero padding template-based IP core than to write a custom IP
and windows size. It makes the accelerator more that may be more optimal. By use of SystemC the
flexible and enables it to run nearly any CNN desired HW architecture can be controlled and de-
model that uses convolution, pooling and fully con- signed via modules with parallel clocked threads.
nected layers. It can be used with most CNNs With HLS directives it is possible to control the
during run-time inference without the need for re- synthesized threads and achieve a desired unroll,
compiling. The accelerator is developed to work pipeline and iteration interval of the synthesized
with PYNQ [33],[18] and uses an Application Pro- RTL code.
gramming Interface (API) similar to Keras [17]. In PYNQ (Productiviy for Zynq) [36] is an open-
summary, this paper makes the following contribu- source framework for creating applications on a
tions. ZYNQ MPSoC. The system design in this work is
based on PYNQ for controlling the IP directly from
• We present a generic CNN architecture con-
Python. This framework is divided into three lay-
sisting of a single computation engine with five
ers: application, SW and HW.
core elements (weight buffer, data buffer, PEA,
The application layer, which hosts the user-code,
pooling block and output handler) to perform
is described in [36, ch. 22]. This is usually Python
FPGA acceleration of CNN models.
code in the form of a Jupyter notebook that runs on
• Stitching is used for convolutional layers that the ARM CPU inside the Zynq MPSoC. The mid-
are too large to execute in a single processing dle layer in the PYNQ framework is the SW layer.
3
Preprocessing. The first scenario converts the
weights to fixed-point and realigns and scales the
weights so that they are ready for the system to
use. Preprocessing also calculates parameters such
as layer output size and layers to be split which can
be done offline on any computer. The weights are
transformed from floating-point to fixed-point rep-
resentation in the chosen format, and aligned and
rounded off correctly, as described later. Finally,
the weights are saved in an h5-file, which is the
standard format for storing weights in Keras [17],
and can be transferred to the HW target.

Initialization. The HW target needs to be config-


Figure 1: Block diagram of the system architecture covering
ured and initialized for a particular fixed-point res-
CPU (Zynq Ultrascale+ MPSoC), memory (DDR RAM), olution by using the synthesized bit-file of the op-
HW IP core accelerator (CNNA), five DMAs for inputs timized CNNA. The bit-file contains the CNNA IP
(X), outputs (Y), weights (W), control (CTRL) and splits core and interconnection setup to the CPU for the
(XBUF).
specified HW target. This is done by using a spec-
ification of the model in the form of a JSON-file
This layer contains the Python libraries and the in- and an h5-file containing the weights, which are al-
teraction with the IP inside the FPGA through the ready realigned and quantized in the preprocessing
OS drivers. Several drivers are provided through scenario. It starts by calculating the buffer size and
the PYNQ libraries for interacting with the IP. The getting the properties of the loaded CNNA. When
interface is called an overlay and is used to pro- this is done, the SW allocates the needed resources
gram the FPGA and manage the IP. The last HW and readies the SW for inference by allocating the
layer in the PYNQ framework is the bit-file pro- buffers for each layer in the CNN.
grammed into the FPGA. The interaction between
the SW layer and the HW layer is done using DMA Inference. When using the system, predicting an
or memory-mapped interfaces. image will be the most commonly used task. This
task is shown in the sequence diagram in figure 2.
3. System architecture Here, the user calls the method predict, which re-
turns the predicted class of the image when the in-
The SoC design consists of three main elements: ference is done. The image, which is the parameter
the FPGA (i.e. the Programming Logic (PL)) the to the method predict, is stored internally in a con-
dual core CPU and memory (i.e. Dynamic Ran- tiguous array, i.e. an array which can be used by
dom Access Memory (DRAM)). The goal is to run a the PYNQ DMA library. Depending on the CNN,
CNN consisting of convolutional, max-pooling and several layers are executed in the correct order, i.e.
fully connected layers computed in the same IP core convolution, pooling or fully connected layer. All
inside the FPGA logic. The responsibility of the parameters controlling the CNN execution are sent
CPU is to let the user control the HW accelera- at the start of the predict method.
tion so that the IP core is processing CNN layers in The convolution is done by the CPU initiating
correct sequential order. Figure 1 shows that the four different tasks in parallel. It sets up the data
system uses DMA to transfer data and weights be- transfer for the input control data CTRL, the input
tween the CPU and IP core accelerator. The CPU data X, the output Y and XBUF. Each of these data
controls the DMA data block transfer and converts transfers are handled by the DMA, which streams
the memory interface to the streaming interface of the content of the buffer from DRAM to the CNNA.
the IP core. The fully connected layer is executed similarly to
The system interacts in different manners de- both pooling and convolution. It starts four dif-
pending on which scenario it needs to execute. ferent DMAs: one for each of the input data X,
There are three main scenarios: preprocessing, ini- the weights W, the output Y and the configuration
tialization and inference. though the CTRL.
4
Figure 3: Example of buffer stitching with a split of three
shown for a single pixel.

by the DMA. However, only the first third of the


depth is calculated in the first convolution. The
second convolution calculates the next third of the
Figure 2: Sequence diagram of the system of interaction
between SW control (PYNQ) and HW accelerator (FPGA) output. However, these outputs need to be stitched
during inference. in between the previous outputs. This is done by
using the output of the first convolution as a stitch
buffer. The IP core is informed to use the first part
Two interfaces are used. The streaming inter- of each pixel and appends the result to each pixel
faced used for the DMA is implemented with a depth-wise. The result of this stitching is sent to
functional deterministic guaranty so that no race the output buffer[1]. The third convolution takes
condition can happen, which makes the entire IP two thirds from the stitch buffer and the last third
very stable. The AXI streaming interface, is used from the output for each pixel. The output of the
for transmitting the data between the CPU and stitched convolution is in the buffer, which is the
IP. The other AXI-lite [37] interface, which is a one used as output in the last stitching.
memory-map, is only used for reading status reg- However, the fully connected layers can also be
isters. too large to be processed at once, in which case
the splits are handled differently. A fully connected
3.1. Software control and stitching layer generates a single value per output. The
Some convolutional layers are too large to be pro- buffer will be filled with values from the first split
cessed as a single CNNA iteration. This means that when it runs the first time. The second time it runs
they are split into several sub-convolutions. How- it will get the next outputs, which must then to be
ever, the result is returned from the IP core with placed after the first split in the buffer. All the
the depth first and thus needs to be stitched to- splits need to have an adjusted number of bytes to
gether with the later result. This is done through ensure that the correct amount of data is received.
the IP core which has a DMA-channel (XBUF) for When all the splits have been processed the result
this purpose, as shown in figure 1. An example of is in the same buffer. This means that the fully
stitching can be seen in figure 3. The shown ex- connected layer only needs a single buffer, contrary
ample illustrates the output of two pixels, a and b, to the convolution layers, which needs two.
from a convolution with a depth that needs to be
split. 3.2. CNN hardware accelerator
The stitching is done by using two equally sized
buffers, which both have a size equal to the ex- Figure 4 shows the architecture overview of the
pected output size. The size in this example is six. main elements of the CNNA. The CNNA works as
The first convolution only uses the first buffer as an accelerator for a single layer at a time. This
the output, and pixels a and b are both updated means that the accelerator needs to be reconfigured
5
Figure 4: Illustration of the CNNA architecture with five
streaming buffer inputs for control, stitching, weights and
data as well as one result output buffer. A number of Pro-
cessing Elements (PE) are used to accelerate the pooling and
Figure 5: Weight buffer with alignment illustrates how kernel
multiplications for each neuron in the network. The CNNA
weight data are aligned so that a specific kernel is placed in
will be executed several times and typical once for each layer
the correct spot. The illustration shows how the weight data
during inference.
of kernel 0 are sent to stream buffer 1. The II and BW of the
weight input package are changed by a factor of resizefactor .
The yellow part of the image cube shows which part of the
for each layer, which is done through the streaming kernel is sent.
interface CTRL.
The streaming interface W is used to load the
weights, which can consist of multiple kernels, and (BW) of the weight input package are changed dur-
cache them in the weight buffer. This means that ing resizing as illustrated in the figure. It shows
they can be used multiple times without reloading how the resize module changes the BW from in-
from DRAM. The streaming interface X is used to put BW, BW(in) by a factor of resizefactor . It also
stream in the input data, which can either be an splits the raw package into smaller packages. The
image or the output from the previous layer. X buf realign module splits the raw package into smaller
is an interface that is used when a convolutional packages. The splitter separates the data stream
layer is split into several splits, which need to be into N different stream buffers, each of which has
stitched together correctly. The last streaming in- the same BW as the resized BW, BW(resize) . Each
terface is Y, which streams out the output values stream buffer sends the kernel to a Processing Ele-
of the operation. ment (PE) X times.
The accelerator is built for three different opera- The realignment in the weight buffer is compli-
tions: convolution, pooling and fully connected lay- cated. Firstly, this is due to the bias value, which
ers. During convolution acceleration, it uses the uses a complete package. Secondly, it is compli-
weight buffer, the data buffer and the PEA. The cated because the kernels need to match the or-
pooling operation is done by using only the data der in which the three-dimensional window comes
buffer and the pooling block. When executing the from the data buffer, i.e. have the same positions
fully connected layers, the weight buffer and the and depths as the data buffer. In figure 5, the first
data buffer are simply set to forward the data di- package contains the bias values transferred to the
rectly to the PEA, thus generating a dot product weight buffer. It shows that this single value only
of the two. The following sections describe the five uses a complete resized package. This is followed by
core elements comprising the design of the CNNA. N other bias packages. After all bias packages are
sent the weight packaged are sent. In this example,
3.2.1. Weight buffer the weight packages contain a 3 × 3 × 4 window.
The weight buffer is used for caching the weight The stream buffers receive the values one after the
kernels. This caching is necessary, because the con- other.
volution requires the kernels to be used once for
each output pixel. 3.2.2. Data buffer
An illustration of the weight buffer module can The data buffer is used to handle the input data
be seen in figure 5, which shows the modules inside stream and create the windows on which the con-
of it. The Iteration Interval (II) and Bandwidth volution or pooling is carried out.
6
An image typically consists of three chan- the line buffer in figure 6 shows that the N − 1 pre-
nels, RGB, which can be visualized as a three- vious lines are stored inside the line buffer and sent
dimensional cube. Such a three-dimensional image out individually. This means that the BW increases
is illustrated in figure 6. The image is stored in with a factor of windowsize . It is also indicated that
raster order, i.e. first, pixel (0,0) channel 0, then the buffer is stored circularly. This is handled by
channel 1 of the same pixel followed by the last the pointer, which can be seen in figure 6. This
channel. This is followed by the same three chan- pointer will increase each time a new input is re-
nels for the pixel one row down, which means that ceived. After receiving a whole line, the line buffers
the Z-axis is processed first, then the Y-axis second will rotate, i.e. the first line will be moved to the
and the X-axis last. Raster order is the order in back the second line will be pushed forward and the
which the image data is streamed to the CNNA. pointer will be reset. This is done by multiplexing
The CLB can be considered the brain of the logic in the implemented design.
CNNA, because it allows it to increase the BW
and removes the need to realign the input data for Shift buffers. After the line buffers, the data
each layer. The parameters that the actor can set reaches the shift buffers. These buffers are used
through the control interface are: for getting the N previous pixels from each line,
i.e. having all the pixels needed for a convolution
• Row size — The row size of the input image, window, as shown in figure 6. The shift buffers
i.e. the Y-axis length. It is assumed that the have another important function as well. They re-
image is quadratic. play the window for the convolution if there are not
enough PEs to run all the dot products in the con-
• Depth — The depth is N-channels of the input volution operation at once. The shift buffers are
image, i.e. the Z-axis of the image. This should on-chip RAM-based shift buffers and consist of two
be dividable by the BW. pointers. The write pointer is essentially controlled
by counting up whenever data is written and mov-
• Stride — Stride is a convolution parameter. ing it back to start of the shift buffer when the end
has been reached. The read pointer, however, is
• Window size — The size of the window. If the
controlled by logic, which tells the shift buffer that
window size is 3, the real window size would
it needs the N previous samples. This will be han-
be 3 × 3 × depth. This is also a convolution
dled by the shift buffer, which also calculates its
parameter.
new positions.
• Zero pad — A convolution parameter setting
the size of the zero padding around the image. 3.2.3. Processing element array
The heart of the CNNA is the PEA. Each PE
• Replay — How many times the CLB should performs HW acceleration of a dot product with a
resend a single window. small range of activation functions, e.g. linear or
ReLU [38]. The PE operation can be written as
After setting up the CLB with the parameters shown in equation 1, which is a dot product of the
through the control interface, the image data can two equal length vectors ~x and w.
~
flow into the CLB. The CLB consists of two parts:
a line buffer for storing Nlines previous lines and −1
NX 
a shift buffer for storing Npixels previous pixel for P E(~x, w)
~ =f (xi · wi ) (1)
each line. These parts are explained in detail below: i=0

Each PE receives data frames in pairs from the


Line buffers. The first module in the CLB, where weight buffer and the data buffer, i.e. one from
the image data is ordered and stored is the line each. The acceleration of the PE is done by run-
buffer. This module streams one row of the image ning the multiplications in parallel and totaling the
with all channels at the same time. The number of results afterward, as illustrated in figure 7. This
line buffers is equal to the maximum window size data is dotted together and followed by the activa-
minus one, Nline buffer = windowsize − 1. This is be- tion function.
cause only the N previous lines are needed to con- Figure 7 shows how the PE has two inputs: W,
struct a window. The illustration of the data flow of the weight input, and X, the data input. When
7
Figure 6: An illustration of the flow of data through the CLB. It consists of two parts: a line buffer for storing Nlines previous
lines and a shift buffer for storing Npixels previous pixel for each line. The leftmost image cube illustrates a single pixel 202,
which is written for the line buffer. The middle cube illustrates which data is saved in which line buffer and how the new line
replaces the first line. The rightmost cube illustrates what data is in the shift buffers. The missing part illustrates how much
more data is needed from the line buffers before it has a complete window in the shift buffer. The read pointer on the shift
buffers is used for getting the N previous samples and for generating the output from the shift buffers.

the PE has received a frame on both W and X,


the data frames are dotted together and the bias
is added to the result. The result is forwarded to
the next part, which is the pipelined PE summer.
This part accumulates the result, which it has re-
ceived from the PE dot product. It will keep on
accumulating until it receives the last flag. When
this happens, it will multiply the accumulated value
by a factor set by the actor, i.e. the control inter-
face, and apply the activation, which is also set by
the actor through the control interface. Lastly, it
is streamed out through the port Y, and the accu-
mulated result is reset. The hole PE is synthesized
with a II of one, which means that new inputs can
be processed in each clock cycle. HLS try to solve
this problem with a summer tree or a cascade of
Figure 7: Illustration of the PE design. It consists of a par- Multiply-Accumulates (MACs) processed in a long
allel multiplier array followed by a summing tree and lastly
an accumulator, scale and the activation function logic. The pipeline.
PE dot product and summer are executed in a parallel and
pipeline order. 3.2.4. Pooling
The pooling element is used to accelerate the
pooling operation. It gets its input from the data
buffer and sends the output to the output handling
part, thus bypassing the PEA, which is not used in
pooling.
8
weight buffer. It also handles the output of a pool-
ing operation, which simply means forwarding the
output of the pooling element.

4. Training for fixed-point

To overcome the challenge of the CNNA us-


ing fixed-point values, an emulation of fixed-point
needs to be made in order for the CNN to be trained
and calculated correctly. This is mostly due to the
large dynamic range of the weights.
This emulation is shown in equation 2, where
Q[I.F ] (x) is the fixed-point representation of x in
the fixed-point format Q[I].[F] [39]. Here, I is the
number of integer bits and F is the number of fac-
Figure 8: Illustration of the logic of the pooling accelerator.
tional bits. First, the number x is scaled up by 2F
The data is received as a small slice (the purple cube), which and then rounded off to resolve the limited reso-
is the output of the CLB. It compares the input data with lution of fixed-point numbers. This is followed by
the current pooling. If it is the first value, it saves it. Af- what is essentially a saturation of the number to
ter pooling, logic is executed. If it is max-pooling it checks
if the input is larger than the stored value, and stores the
the range of the fixed-point number, i.e. between
largest one. When a whole window has been run through the −2I+F −1 and 2I+F −1 − 1. Lastly, the number is
accelerator, it will start streaming out the calculated pixels. scaled down by the same factor it was scaled up
These steps are repeated for all the windows created by the by. This results in a value that can be interpreted
CLB.
correctly by the CNNA.

The reason for placing the pooling operator in- 


side the CNNA is reuse of the CLB hardware. The Q[I.F ] (x) = max −2I+F −1 , min 2I+F −1 − 1,
pooling accelerator receives its input directly from 
the CLB, and the output from the pooling goes di- round(x · 2 ) · 2−F
F

rectly to the output.
When looking at figure 8, it can be seen that (2)
the pooling block consists of logic for handling the
pooling operation, e.g. max-pooling, and RAM for 4.1. Quantized weights
buffering a single pixel. The pooling logic is con- The weights are quantized as a constraint to the
trolled by the actor and is used for setting the depth optimizer, which executes the backpropagation [40].
of the current image and the size of the window, e.g. This constraint is set to quantize all weights after
2 × 2 × depth or 3 × 3 × depth. The last parameter each update using equation 2. This results in the
controls what type of pooling operator should be Stochastic Gradient Decent (SGD) update formula
run, i.e. max-, min- or average-pooling. shown in equation 3, where Q[I.F ] (x) is the quanti-
(l,t=t−1)
zation function shown in equation 2, Wij is
3.2.5. Output handler (l,t=t)
the previous weight, Wij is the new weight, and
The output handling element plays a major role α is the learning rate.
in getting the output of the CNNA into the cor-
rect shape and alignment before streaming it out
through interface Y. It merges the results from the (l,t=t) (l,t=t−1) (l,t=t−1)
Wij = Q[I.F ] (Wij − α∇Wij ) (3)
PEA when it is used, and if the data needs to be
stitched with X buf, i.e. if a convolution opera- However, this introduces a problem, that makes
tion has been split into more than one convolution the training freeze. The cause of the problem is that
and needs to have the old output interleaved into the size of the update to the weights is too small
the new output. Splitting the convolution happens to move from one quantized value to another. The
when too many kernels need to be stored in the effect of a too-small update change can be seen in
9
the example shown in equation 4. It is not possi- and a fixed-point format Q[I].[F], which, for simplic-
ble to update a single weight in Q2.6 with a value ity, is able to store a maximum value of 1, denoted
smaller than the smallest quantized value, in this QM AX
[I.F ] , a scaling can be found. To find the scaling
case 2−6 = 0.015625. The example shows a weight needed for a better dynamic range, equation 7 can
with value 1.671875 being updated by a too-small be used. This equation takes the absolute maxi-
value: 0.0015624. Updating the quantized weight mum absolute value of the weights and divides it
value does not result in a change, which causes the by the maximum value of the fixed-point format.
training to freeze.
 
(l)
(l)
maxi |Wi |
Wij = Q[2.6] (1.671875 − 0.0015624) = 1.671875 scale(l) = = | − 0.30| = 0.30 (7)
(4) QM AX
[I.F ]
To solve this issue, an extra copy of the weights
W is saved so that the forward pass, i.e. inference, The scaled value of the weights can now be cal-
is calculated using the quantized weights, and the culated as shown in equation 8, which divides the
SGD is calculated using unquantized weights. This weights by scale(l) . This shows that the maximal
means that the weights do not get stuck between absolute value is now -1.
quantization steps. This is also known as lazy up-
date SGD [41]. In this way, the weights W are
W (l)
 
(l) 0.367 0.08
−1
saved and the quantized weights W Q are used for Wscale = =
scale(l) −0.167 0.00667
0.333
the forward pass, which can be seen in equations 5
(8)
and 6.
Using this scale factor, the output of a layer is
calculated as shown in equation 9, which has an
(l),t=τ (l),t=τ −1 (l),t=τ −1 added multiplication of the quantized value of the
Wij = (Wij − α(∇W )ij ) (5) (l)
scale factor, where zscale is the scaled output of
(l) ~ is the
layer l, Wscale are the scaled weights, al−1
(l),t=τ (l),t=τ
W Qij = Q[I.F ] (Wij ) (6) output from the previous layer and scale(l) is the
scale factor of the layer l.
By using these equations, the optimizer can train
the CNN even though the changes are too small to
be significant when quantized. (l) (l) ~ )·Q
zscale = Q[I.F ] (Wscale ·al−1 (l)
[Iscale ].[Fscale ] (scale )
(9)
4.2. Dynamic range scaling Because of the quantization, it cannot be guaran-
teed that the outputs are the same, but they should
The small kernels in the first convolutional layers (l)
be very similar, i.e. z (l) ≃ zscale . The main dif-
of the CNN VGG16 have large weights, i.e. close
ference between the scaled and unscaled version is
to 1 or −1, but the fully connected layers have very (l)
that zscale is better suited for the bit range of the
small weights that only use the lowest bits, even
fixed-point format than z (l) .
in Q2.14. This means that the CNN needs more
fractional bits. However, this is possible to solve
by dynamically scaling the weights and the output. 5. Design space exploration
This is carried out with integers in [42].
The following will show how this can be carried The template-based IP core written in SystemC
out on fixed-point values as well. It has been found has a number of parameters that must be selected
that the dynamic range of each kernel is almost the in order to achieve an optimal solution. The exe-
same for each layer. This knowledge can be used cutable model of the IP core gives an approximate
to add scaling to each layer in order to change the estimate of the time performance and FPGA re-
dynamic range of the weights. For example, based source usage. When optimizing the IP core for a
on the given weights FPGA, it can be an enormous task to generate a
design and find the optimal design parameters. It
takes approximately one hour to synthesize the HLS
 
0.11 0.024 −0.30
W = code to RTL code. If this was to be carried out for
−0.05 0.002 0.1
10
potentially have a bigger PEA. The internal BW
(output)
in the PE will be DBBW{×} · PEBW{×} , with an
(W)
element size of datasize . The BW used inside the
(output)
weight buffer is also equal to DBBW{×} · PEBW{×} .

(R3×3×512+1 )
kernelsN . Used to calculate
 
(buffer) (3 × 3 × 512)
WBsize = (output)
+ biassize
DBBW{×} · PEBW{×}
(R3×3×512+1 )
·kernelsN
Figure 9: Results of the C-simulation of the combined test
of all fixed-point candidates, showing the average resource Here (3 × 3 × 512) is chosen from the largest layer
usage of BRAM and DSP versus latency. The PEBW{×} is in the CNN, which, in this case, is the VGG16 [2].
set to 128 for all
 solutions. The plotted text for a candidate is
 The tuning of the CNNA can be expressed as
(output) (R3×3×512+1 )
in the format PEN , DBBW{×} , kernelsN . The ~ with 5 hyper-parameters, as shown in
a vector, β,
candidates are split up in three groups of the word lengths equation:
(W)
(datasize ) 8 bit, 16 bit and 32 bit versions. It took 2 minutes 
to create the estimate for one candidate solution. (W) (output)
~
β = datasize , PEBW{×} , PEN , DBBW{×} ,

all possible combinations of parameters, it would (R3×3×512+1 )
kernelsN
take weeks, months or even years, since the archi-
tecture has such a large number of parameters, e.g.
BW between modules, FIFO-depth, the number of To measure the performance of the different
PEs, etc. The high level model of the IP in Sys- CNNA configurations, a simulation was made. It
temC can be simulated faster than the RTL code consisted of five different elements: two pooling op-
by a factor of 50-200 times, depending on the size erations, two convolution operations and a single
of the accelerator. It is possible to use a heuristic fully connected operation. They were executed in-
approach to find the optimal solutions for a cer- dividually but evaluated together.
tain fixed-point resolution constrained by the given When looking at the latency for the combined
target device by evaluating several simulated solu- simulation test, i.e. the five simulations carried out
tions. consecutively after each other, the dominant can-
(output)
The design parameters are used for tuning the didates all have DBBW{×} = 1 regardless of word
CNNA design in order to find a balance between length (see figure 9). The figure shows that the
precision, speed and resources. The CNNA tuning faster the accelerator, the higher the number of
parameters used are as follows: PEs.
(W)
Two models were created of each configuration,
datasize . The word-length of the fixed-point data one of which was done using C-simulation, i.e. a
format in bits, i.e. I+F. Has an impact on precision. simulation that used the SystemC HLS code di-
rectly. The other was a RTL-simulation, which used
PEBW{×} . The internal BW with an element size the RTL-code generated from the SystemC model
(W)
of datasize used by the CNNA. for the most optimal solutions. The latter was clock
cycle accurate and the execution time was precise.
PEN . The number of PEs and the PEA are limited Several candidates were identified and shown in
by the size of FPGA fabrics. greater detail in table 1. The table shows the num-
(output)
ber of Digital Signal Processing slices (DSPs) and
DBBW{×} . The output BW multiplier after the BRAM used, as well as the total latency for C- and
CLB. Normally this will be set at an equal value RTL-simulation. Some candidates marked with a
(rows)
to CLBN , but can be set to a lower number in ”-” used more resources than were available on the
order to allow the PE to run with lower BW and tested target platform.
11
Table 1: Design space exploration of resource usage and latency of possible CNNA candidates using C- and RTL-simulation. A
”÷” means RTL-simulation performed, but insufficient space on target platform (Ultra96). A ”-” means RTL-simulation not
performed.
Parameters resource DSPs BRAMs latency[ms]
DSPs BRAMs latency[ms]
β~ average % (RTL) (RTL) (RTL)
[8, 128, 8, 3, 42] 70 384 359 144 249 4.60 6.66
[8, 128, 8, 1, 42] 34 128 125 137 185 4.95 7.28
[8, 128, 16, 3, 42] 124 768 - 152 - 4.26 -
[8, 128, 16, 1, 42] 52 256 245 139 193 4.03 6.66
[16, 128, 8, 3, 32] 54 192 360 233 377 7.47 8.40
[16, 128, 8, 1, 32] 35 64 - 227 - 7.61 -
[16, 128, 16, 3, 32] 81 384 ÷ 239 ÷ 6.98 ÷
[16, 128, 16, 1, 32] 44 128 293 229 349 6.27 8.52
[32, 128, 8, 3, 20] 94 384 - 355 - 13.19 -
[32, 128, 8, 1, 20] 58 128 165 351 336 12.92 14.53
[32, 128, 16, 3, 20] 148 768 - 359 - 12.42 -
[32, 128, 16, 1, 20] 76 256 325 353 408 10.73 12.81

The candidates with lowest latency were synthe-


sized and tested using RTL-simulation, which simu-
lates the real HDL-code generated. This also gives
a more precise resource usage, which only differs
slightly from the ones estimated using C-simulation.
The execution time is also shown and is slightly
higher (approx. 2ms) than the estimated value.
On average, the compilation and C-simulation time
took 2 minutes for each solution. The HLS synthe-
sizatione and RTL-simulation took 1-7 hours.
The optimal parameters were found using two
(W)
different fixed-point formats (datasize ): Q2.14 and
Q2.6, i.e. a word length of 16 bits and 8 bits, re-
spectively. These were chosen because of the area Figure 10: Training with five classes. Gray: floating-point,
orange: fixed-point Q2.14 with auto-scaling, blue: fixed-
constraints of the FPGA on the Xilinx Ultra96 point Q2.14 without auto-scaling, red at the bottom: fixed-
board [43]. However 32-bits would have been pos- point Q2.6 with and without auto-scale.
sible with a larger FPGA.
Finally, three different configurations of the
CNNA were chosen for the final test of the system, The training was carried out over the span of 100
one of which used 16-bit fixed-point format Q2.14, epochs.
while the two others used 8-bit fixed-point format The CNN used was VGG16 [2]. The CNN con-
Q2.6. volutional blocks of this CNN is followed by two
dense fully connected layers with either 4096 or
1024 neurons. Its final fully connected layer has
6. Results and discussion
either five or 29 neurons, depending on the num-
The dataset DETECT [44] was used to verify ber of classes. The training was performed on
the system. This dataset consisted of 29 classes two fixed-point formats: Q2.14 and Q2.6, and
of micro-invertebrates suspended in alcohol. Only tested on three configurations, which will be de-
the first five classes were used in the first test, while noted CNNA16 , CNNA18 and CNNA28 . CNNA16
the second test used all 29 classes. Cifar-100 [45] uses the tuning parameters β~ = [16, 128, 8, 3, 32],
and ImageNet [46] were used for comparison with CNNA18 uses β~ = [8, 128, 16, 1, 42] and CNNA28 uses
other common datasets and to validate the results. β~ = [8, 128, 8, 3, 42].
12
Table 2: Training results for the training of VGG16 on five
classes.
Type auto-scale Nneurons validate train
float n/a 1024 97.5 100.0
Q2.6 yes 1024 20.3 20.9
Q2.14 no 1024 24.3 23.6
Q2.14 yes 1024 94.2 98.8
float n/a 4096 97.9 99.5
Q2.14 no 4096 83.2 83.6
Q2.14 yes 4096 91.7 97.6

Figure 11: Training with 29 classes. Blue: floating-point, A final test was performed on all 29 classes of
red: fixed-point Q2.14 with auto-scale, light blue: fixed- DETECT with, the candidates that performed well
point Q2.14 without auto-scaling.
in the previous test. The best candidates were
the floating-point version for reference and the ver-
The accuracy, performance and power consump- sions that used fixed-point format Q2.14, both with
tion of the proposed system will be presented and and without auto-scaling. As is evident from fig-
discussed in this section. ure 11, only the training that used fixed-point for-
mat Q2.14 and auto-scaling achieved promising re-
sults. It shows that it is much more difficult to
6.1. Accuracy
train the CNNA when using quantization, because
The CNN was trained using the small dataset in details are lost due to the limited range of the fixed
order to find suitable candidates faster, since it is point numbers. However, it takes many more iter-
faster to train for five classes than for 29 classes. ations for the training to reach the same accuracy
If the accuracy of a fixed-point format is poor on level as the floating-point format.
five classes, it will likely be as poor, or worse, when The first 29 classes from ImageNet and Cifar-100
training on 29 classes. Therefore, initial training were also used for training. The validation results
was carried out on the small dataset. in table 3 shows that comparing the Q2.14 format
Figure 10 shows that most of the trained models with floating-point the accuracy drops with 3.5%
faced issues and obtained low accuracy when us- and 3.2%. For DECTECT the drop is 4% which
ing fixed-point format. The only quantized version is higher compared to training with ImageNet and
that obtained a high level of accuracy was the one Cifar-100.
using fixed-point format Q2.14. It is unknown why
the training with fixed-point format Q2.14 and no
auto-scaling makes a sudden dive after 10 epoch.
Table 3: Results for the training of the 16-bits fixed-point
However, it could be caused by the learning-rate VGG16 on 29 classes from the datasets: DETECT(DET),
being too high or too low, or too few neurons in ImageNet(Image) and Cifar-100(Cifar).
the fully connected layers. The best results were Type data auto Nneurons val. train
achieved with fixed-point format Q2.14 and auto- float DET n/a 1024 88.0 100.0
scaling, which converges towards an accuracy of al- Q2.14 DET no 1024 5.0 5.0
most 100%. All fixed-point Q2.6 versions did not Q2.14 DET yes 1024 86.4 94.4
manage to be trained or achieve any useful results. float DET n/a 4096 84.0 99.4
Table 2 shows the results of the training with five Q2.14 DET no 4096 5.0 5.1
classes. Only the training that used Q2.14 with no Q2.14 DET yes 4096 86.5 92.9
auto-scaling performed well with 4096 neurons and float Image n/a 4096 83.0 99.2
reached approximately 83%. The table shows the Q2.14 Image yes 4096 79.5 88.1
number of neurons in the fully connected layers, float Cifar n/a 4096 80.5 99.5
Nneurons , as well as the training and validation ac- Q2.14 Cifar yes 4096 77.3 89.3
curacy. Validation was performed on a dataset not
used for training.
13
6.2. Performance
Table 5: Time of execution of each VGG16 layer in [ms]
The Xilinx Ultra96 board [43] was used to eval- using four different IP cores.
uate the performance of the system using a HW CNNA16 CNNA16 CNNA18 CNNA28
layer
clock of 100 MHz and 172.22 MHz for the CNNA IP 100MHz 172MHz 172MHz 172MHz
core. The inference time was measured for the dif- l1 conv1 19.3 17.0 19.9 21.4
ferent configurations of the CNNA16 , CNNA18 and l1 conv2 111 84.1 61.3 60.1
CNNA28 and the inference times are shown in table l1 pool 18.1 13.7 12.9 17.1
4. The timing performance was measured on the l2 conv1 55.4 42.9 31.3 30.5
Ultra96 board during inference of the quantized and l2 conv2 108 81.3 60.2 56.3
trained VGG16 model with five classes. The mean l2 pool 8.98 6.87 6.36 8.40
time and variance is an average of 30 measurements. l3 conv1 56.0 43.5 33.1 29.9
The fastest model CNNA18 took 1.22 sec per image, l3 conv2 112 84.2 64.2 59.1
while the slowest, CNNA16 at 100 MHz, took 2.20 l3 conv3 110 85.5 63.0 57.8
sec per image. l3 pool 4.51 3.48 3.25 4.23
l4 conv1 64.6 51.5 37.9 35.3
l4 conv2 126 97.9 76.0 70.5
Table 4: Average inference time and variance using VGG16
for five classes using four different IP cores. l4 conv3 123 102.0 73.5 67.8
CNNA16 CNNA16 CNNA18 CNNA28 l4 pool 2.32 1.83 1.71 2.19
100MHz 172MHz 172MHz 172MHz l5 conv1 46.6 41.0 30.7 29.3
avg [sec] 2.20 1.96 1.22 1.49 l5 conv2 49.1 39.7 29.7 28.1
var [·10−3 ] 0.25 0.30 0.20 0.11 l5 conv3 45.8 39.5 29.6 27.8
l5 pool 0.74 0.62 0.59 0.69
The different layers have different execution dense 1 767 737 364 509
times, as shown in table 5. As expected, the execu- dense 2 393 397 197 362
tion time in convolutional layers depended on the dense 3 1.55 1.62 1.18 1.50
number of bits in the fixed-point format. However,
pooling took approximately the same time for all
VGG16 model with five classes. The measured volt-
tested IP, since pooling is independent of the fixed-
age of the power supply to the board was multiplied
point format. The table shows that the IP CNNA18 ,
with the measured current to compute the power
obtained the best performance due to the lager
consumption. The mean and maximum power dur-
number of PEs (16). Note that CNNA28 was slightly
ing inference is calculated as a mean of 10 infer-
faster than CNNA18 in the convolutional layers, even
ences. The power consumption of the IP core is de-
with fewer PEs, due to the higher Bandwidth of
fined as the difference between the Ultra96 board
the output multiplier. There is a large number of
idling and power during inference. The idle power
splits (512) in the dense 1 and dense 2 layers, and
consumption was measured at Pidle = 3.055 Watt
they consume more than half of the total execution
over a five-minute period:
time for all three CNNA configurations. In average
32% of the time is used to setup the DMA’s from
PYNQ, which could be optimized with a scatter- Table 6: Average and peak power consumption in watt of
the Ultra96 board and the IP core during inference.
gather DMA. In such a solution the DMA would
CNNA16 CNNA16 CNNA18 CNNA28
initiate transfer for the next location of DRAM data
100MHz 172MHz 172MHz 172MHz
without involving the CPU. A larger FPGA with
more on-chip memory could also be a solution to Pavg 5.28 5.68 4.71 4.80
lower the number of splits and optimize the perfor- Ppeak 6.60 7.14 5.76 6.35
mance further. PIPavg 2.23 2.63 1.66 1.74
PIPpeak 3.55 4.09 2.71 3.30
6.3. Power consumption
Table 6 shows that the mean power consump-
The power consumption of the design with tion of the Ultra96 board for all tests was between
CNNA16 , CNNA18 and CNNA28 was measured on 4.7 − 5.7 W out of which the IP core only consumes
the Ultra96 board during inference of the trained approximately 2 W. This means that running the
14
Accuracy. Our fixed-point training method only
performed well for 16-bit quantization. DoReFa-
Net [47] proposes a method for training CNNs with
low-bit quantization. The method demonstrate a
high accuracy bu using AlexNet with only 1-bit
weights. FINN-R [11] uses quantized neural net-
works for low-bit quantization of different CNN
topologies with high accuracy. Angel-Eye [27] also
proposes a dynamic quantization strategy, where
the network is initially trained with a floating-point
format. The radix position of the fixed-point data is
chosen differently for each layer based on statistics,
and an optimal radix point is chosen. The network
is converted back to floating-point and fine tuned.
This method achieves a high level of accuracy for
both 16 and 8-bit formats with VGG16.
Figure 12: Power consumption of the tested solution with
format CNNA18 , CNNA28 and CNNA16 during inference at Performance. To compare the different solutions,
172 MHz. the performance needs to be expressed in Giga Op-
erations Per Second (GOPS) . The performance re-
sult is normalized relative to the number of Look Up
IP did not affect the average power consumption.
Tables (LUTs) and DSPs as a measure for available
However, because they run for a shorter amount of
resources on the target device. This performance
time, the fixed-point IPs with a low number of bits
density measure is used to compare the VGG16
used less energy per inference. The CNNA16 with
mapped to different FPGA devices. The through-
a 100 MHz clock was 0.24 sec slower but consumed
put performance is calculated as the number of Giga
less power than the version with a 172 MHz clock.
Operations (GOP) performed by the CNN relative
Table 6 shows that the peak power consumption
to the inference time in seconds. In the case of the
was almost the same for all tested IPs in the range
VGG16 network, the total number is 30.76 GOP
from 2.7 W to 4.1 W.
out of which 30.7 GOP is performed in the con-
Figure 12 shows that the power consumption is
volutional (CONV) layers. We have presented the
largest in the beginning of the inference, i.e. in
performance for CONV layers and all layers of the
the convolution blocks of the CNN. The power con-
VGG16 model since some solutions do not acceler-
sumption dropped during execution of the fully
ate the FC layers.
connected layers. This indicates that most of the
The results are shown in table 7, which indi-
FPGA logic was in action during convolution, while
cates that the CNNA performance (Total) is lower
less logic was used during computing of the fully
than the comparable state-of-the-art architectures.
connected layers and pooling. Pooling activity cor-
The best performance of our 16-bit solution is
responds to the big dips in power consumption in
29.1 GOPS. This is lower than the 31.4 GOPS for
the first half of the inference.
DNNWEAVER which has the worst performance of
the state-of-the-art solutions.
The Ultra96 target used in our evaluation is small
7. Comparison with state-of-the-art CNNs and low-cost compared to the ones used in some of
the examples e.g. Zynq XC7Z2045 and UltraScale
We have chosen to evaluate our work with the KU060. If a larger and more expensive target such
current state-of-the-art toolflows presented in [23] as the Xilinx ZCU104 evaluation kit [48] was used,
which use a fixed-point resolution of 8 or 16-bits to it would be possible to increase the number of PEs,
perform FPGA acceleration of the VGG16 network thereby achieving a higher throughput and perfor-
by targeting the Xilinx Zynq and UltraScale plat- mance.
forms. The purpose of this is to compare our work The performance density measure is also lower
with other tools that have mapped the same CNN than most the other architectures and only simi-
on similar FPGA devices from the same vendor. lar to DNNWEAVER. Angel-Eye and Caffeine both
15
Table 7: Table comparing the CNNA with state-of-the-art CNN accelerators: DnnWeaver [26], fpgaConvNet [24], Angel-Eye [27]
and Caffeine [29]. All solutions targets the Xilinx Zynq devices except for Caffeine, which uses the Kintex UltraScale FPGA.
The power efficiency, performance density and throughput performance are listed for the different solutions. The performance
density is only shown for the CONV layers.
Power Power E. Density Density Perfor. Perfor.
Xilinx Fix.
Technique-Mhz Efficiency (Conv) [GOPS/ [GOPS/ (Conv) (Total)
Device [bits]
[GOPS/W] [GOPS/W] DSP] kLUT] [GOPS] [GOPS]

DnnWeaver-150 n/a n/a 0.143 0.59 31.4 n/a ZC7Z020 16


fpgaConvNet-125 n/a 7.27 0.221 0.91 48.5 12.7 ZC7Z020 16
fpgaConvNet-125 n/a n/a 0.173 0.71 156 n/a ZC7Z045 16
Angel-Eye-214 n/a 24.1 n/a n/a 85.3 n/a ZC7Z020 8
Angel-Eye-150 14.2 n/a 0.209 0.86 188 137 ZC7Z045 16
Caffeine-200 10.64 12.4 0.187 1.55 310 266 KU060 16
CNNA16 -100 6.28 11.86 0.078 0.62 26.4 14.0 ZU3EG 16
CNNA16 -172 5.99 11.08 0.081 0.52 29.1 15.7 ZU3EG 16
CNNA18 -172 15.22 22.94 0.155 0.66 38.0 25.2 ZU3EG 8
CNNA28 -172 11.83 22.53 0.110 0.61 39.5 20.7 ZU3EG 8

have a much higher density performance compared bit fixed-point weights at 100 MHz, its total power
to usages of LUT and DSP resources on the FPGA. efficiency is 0.44x lower than Angel-Eye and 0.59x
lower than Caffeine. With nearly the same effi-
Power efficiency. The power efficiency is depen- ciency of 12 GOPS/W, the power efficiency of the
dent on both of the efficiency of data communica- CONV layers are considered comparable with Caf-
tion and computation. feine. The performance bottleneck in our CNN ac-
The SmartShuttle [49] solution is optimizing celerator is the fully connected layers, where splits
CNN off-chip memory access. Observing that over are performed 512 times with a high DRAM ac-
80% of energy is consumed by DRAM accesses, they cess. The fpgaConvNet on the Zynq XC7Z020 has
make a benchmark of the data volume of DMA a worse efficiency of 7.3 GOPS/W compared to the
requests during inference of the 13 CONV layers CNNA16 with 11.9 GOPS/W. While Angel-Eye’s
in VGG16. Our CNNA16 measures a data vol- fixed-point 8-bit with 24.1 GOPS/W is the best of
ume of 211.7 MB transferred for the same feature all the compared state-of-the-art solutions in terms
layers including pooling. However, we use more of efficiency, the 8-bit CNNA with 23.0 GOPS/W
on-chip memory for weight and data buffers than is a close second.
SmartShuttle. As a benchmark SmartShuttle mea-
sures 221.3 MB. Simulated with a on-chip buffer of 8. Conclusion
512 KB, however, they can lower the DRAM ac-
cess volume to 160 MB. The design of the CLB In this paper, an architecture for a SoC design
in our CNNA ensures that weights are only trans- was presented. The presented architecture imple-
ferred once from DRAM, which is similar to what ments the different operations necessary for a deep
SmartShuttle achieves with the weight reuse ori- neural network to perform close to real-time in-
ented scheme (WRO) they propose. The last three ference. The architecture was implemented using
FC layers of the CNNA16 transfers a volume of Python and HLS for the IP core and was able to
273.8 MB, which is not considered by SmartShuttle run on the Ultra96 board using PYNQ. The inter-
and stands for most of the data communication. face for the system is similar to Keras and should
The computation power efficiency is calculated be familiar to most engineers working in the field
as the number of operations per second, relative of machine learning.
to the the mean power consumption of the CNNA, The CNN is able to accelerate deep learning al-
which we measured earlier (GOPS/W). Compared gorithms that use any sequence of convolutional,
to many of the current state-of-the-art accelerators, max-pooling and fully connected layers. The layer
the CNN accelerator in this work performs quite operations can support many different parameters
well in terms of power efficiency. When using 16- and will be able to perform inferences using most
16
modern CNNs. The network weights can use any [7] S. Mittal, J. S. Vetter, A survey of methods for an-
8-, 16- or 32-bit fixed-point format when exported alyzing and improving gpu energy efficiency (2014).
arXiv:1404.4629, doi:10.1145/2636342.
from Keras with the weights auto-scaled correctly. [8] S. Mittal, A survey of FPGA-based accelerators for
A training method was proposed which achieved convolutional neural networks (2018). doi:10.1007/
high levels of inference accuracies, both when using s00521-018-3761-1.
fixed-point and floating-point weights. The VGG16 [9] W. Ding, Z. Huang, Z. Huang, L. Tian, H. Wang,
S. Feng, Designing efficient accelerator of depthwise sep-
architecture chosen for testing in this paper was arable convolutional neural network on FPGA, Journal
able to perform inference in 2.0 sec per image when of Systems Architecturedoi:10.1016/j.sysarc.2018.
using the fixed-point format Q2.14 and 1.2 sec when 12.008.
using fixed-point format Q2.6. The IP core alone [10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv,
Y. Bengio, Binarized neural networks, in: Advances in
consumes a peak power of 4.1 W with a mean power Neural Information Processing Systems, 2016. arXiv:
between 1.5 − 2.7 W and has a power efficiency be- 1602.02505.
tween 6.0 − 15.2 GOPS/W depending of the fixed- [11] M. Blott, T. B. Preuber, N. J. Fraser, G. Gambardella,
K. O’Brien, Y. Umuroglu, M. Leeser, K. Vissers, FinN-
point format.
R: An end-to-end deep-learning framework for fast ex-
Compared to similar state-of-the-art solutions ploration of quantized neural networks, ACM Transac-
for mapping the VGG16 network to Xilinx plat- tions on Reconfigurable Technology and SystemsarXiv:
forms, our solution demonstrates a comparable en- 1809.04570, doi:10.1145/3242897.
[12] E. Nurvitadhi, D. Sheffield, J. Sim, A. Mishra,
ergy efficiency, especially for the convolutional lay- G. Venkatesh, D. Marr, Accelerating binarized neu-
ers. In future work, the CNNA needs be extended ral networks: Comparison of FPGA, CPU, GPU, and
to support special layers to support deep neural ASIC, in: Proceedings of the 2016 International Con-
networks such as ResNet, DenseNet, InceptionNet ference on Field-Programmable Technology, FPT 2016,
2017. doi:10.1109/FPT.2016.7929192.
and GooglLeNet. The special layers with irregular [13] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H.
dataflow will be implemented in the SW controlling Lin, M. Srivastava, R. Gupta, Z. Zhang, Acceler-
part of the proposed architecture. ating binarized convolutional neural networks with
software-programmable fpgas, in: Proceedings of the
2017 ACM/SIGDA International Symposium on Field-
Acknowledgments Programmable Gate Arrays, FPGA ’17, ACM, New
York, NY, USA, 2017, pp. 15–24. doi:10.1145/
We would like to thank Freia Martensen for lan- 3020078.3021741.
URL http://doi.acm.org/10.1145/3020078.3021741
guage and proof reading the article. [14] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort,
A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi,
J. Anderson, K. Bertels, A Survey and Evaluation of
References FPGA High-Level Synthesis Tools, IEEE Transactions
on Computer-Aided Design of Integrated Circuits and
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet
Systemsdoi:10.1109/TCAD.2015.2513673.
classification with deep convolutional neural networks,
[15] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, L. Wang,
Communications of the ACMdoi:10.1145/3065386.
A high performance FPGA-based accelerator for large-
[2] K. Simonyan, A. Zisserman, Very deep convolutional
scale convolutional neural networks, in: FPL 2016 -
networks for large-scale image recognition, in: 3rd In-
26th International Conference on Field-Programmable
ternational Conference on Learning Representations,
Logic and Applications, 2016. doi:10.1109/FPL.2016.
ICLR 2015 - Conference Track Proceedings, 2015.
7577308.
arXiv:1409.1556.
[16] A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012
[3] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You
AlexNet, Advances In Neural Information Pro-
only look once: Unified, real-time object detection, in:
cessing SystemsarXiv:1102.0183, doi:http://dx.doi.
Proceedings of the IEEE Computer Society Conference
org/10.1016/j.protcy.2014.09.007.
on Computer Vision and Pattern Recognition, 2016.
[17] F. Chollet, et al., Keras, https://keras.io (2015).
arXiv:1506.02640, doi:10.1109/CVPR.2016.91.
[18] L. Stornaiuolo, M. Santambrogio, D. Sciuto, On how
[4] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: To-
to efficiently implement deep learning algorithms on
wards Real-Time Object Detection with Region Pro-
PYNQ Platform, in: Proceedings of IEEE Computer
posal Networks, IEEE Transactions on Pattern Anal-
Society Annual Symposium on VLSI, ISVLSI, 2018.
ysis and Machine IntelligencearXiv:1506.01497, doi:
doi:10.1109/ISVLSI.2018.00112.
10.1109/TPAMI.2016.2577031.
[19] Accellera Systems Initiative, Ieee standard for standard
[5] K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask R-
systemc language reference manual, IEEE Std 1666-
CNN, IEEE Transactions on Pattern Analysis and Ma-
2011 (Revision of IEEE Std 1666-2005).
chine Intelligencedoi:10.1109/TPAMI.2018.2844175.
[20] K. Ovtcharov, O. Ruwase, J.-y. Kim, J. Fowers,
[6] A. Shawahna, S. M. Sait, A. El-Maleh, FPGA-Based
K. Strauss, E. S. Chung, Accelerating Deep Convo-
accelerators of deep learning networks for learning and
lutional Neural Networks Using Specialized Hardware,
classification: A review (2019). arXiv:1901.00121,
Microsoft Research Whitepaper.
doi:10.1109/ACCESS.2018.2890150.

17
[21] D. Gschwend, Zynqnet: An fpga-accelerated embedded son, R. W. Stewart, Exploring Zynq® MPSoC With
convolutional neural network. PYNQ and Machine Learning Applications, Strath-
URL https://github.com/dgschwend/zynqnet clyde Academic Media, 2019.
[22] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, [37] Xilinx, UG761 - AXI Reference Guide, v13.1 Edition
P. Leong, M. Jahre, K. Vissers, FINN: A frame- (March 2011).
work for fast, scalable binarized neural network [38] B. Xu, R. Huang, M. Li, Revise saturated activation
inference, in: FPGA 2017 - Proceedings of the functions, CoRR abs/1602.05980. arXiv:1602.05980.
2017 ACM/SIGDA International Symposium on Field- URL http://arxiv.org/abs/1602.05980
Programmable Gate Arrays, 2017. arXiv:1612.07119, [39] E. Oberstar, Fixed-Point Representation & Fractional
doi:10.1145/3020078.3021744. Math Revison 1.2 (08 2007). doi:10.13140/RG.2.1.
[23] S. I. Venieris, A. Kouris, C. S. Bouganis, Toolflows for 3602.8242.
mapping convolutional neural networks on FPGAS: A [40] B. J. Wythoff, Backpropagation neural networks:
survey and future directions (2018). arXiv:1803.05900, A tutorial, Chemometrics and Intelligent Labora-
doi:10.1145/3186332. tory Systems 18 (2) (1993) 115 – 155. doi:https:
[24] S. I. Venieris, C. S. Bouganis, FpgaConvNet: A Frame- //doi.org/10.1016/0169-7439(93)80052-J.
work for Mapping Convolutional Neural Networks on URL http://www.sciencedirect.com/science/
FPGAs, in: Proceedings - 24th IEEE International article/pii/016974399380052J
Symposium on Field-Programmable Custom Comput- [41] H. Park, J. H. Lee, Y. Oh, S. Ha, S. Lee, Train-
ing Machines, FCCM 2016, 2016. doi:10.1109/FCCM. ing deep neural network in limited precision, CoRR
2016.22. abs/1810.05486. arXiv:1810.05486.
[25] S. I. Venieris, C. S. Bouganis, FpgaConvNet: Mapping URL http://arxiv.org/abs/1810.05486
Regular and Irregular Convolutional Neural Networks [42] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,
on FPGAs, IEEE Transactions on Neural Networks and A. Howard, H. Adam, D. Kalenichenko, Quantization
Learning Systemsdoi:10.1109/TNNLS.2018.2844093. and training of neural networks for efficient integer-
[26] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, arithmetic-only inference, in: The IEEE Conference
C. Shao, A. Mishra, H. Esmaeilzadeh, From high-level on Computer Vision and Pattern Recognition (CVPR),
deep neural models to FPGAS, in: Proceedings of the 2018.
Annual International Symposium on Microarchitecture, [43] 96 Boards, Ultra96-v2 developer board.
MICRO, 2016. doi:10.1109/MICRO.2016.7783720. URL https://www.96boards.org/product/ultra96/
[27] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, [44] Detect [online] (June 2017) [cited 7/5-2018].
Y. Wang, H. Yang, Angel-Eye: A complete design flow [45] A. Krizhevsky, V. Nair, G. Hinton, CIFAR-10 and
for mapping CNN onto embedded FPGA, IEEE Trans- CIFAR-100 datasets (2009).
actions on Computer-Aided Design of Integrated Cir- [46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
cuits and Systemsdoi:10.1109/TCAD.2017.2705069. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern-
[28] Y. Wang, J. Xu, Y. Han, H. Li, X. Li, DeepBurn- stein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale
ing: Automatic generation of FPGA-based learning Visual Recognition Challenge, International Journal of
accelerators for the neural network family, in: Pro- Computer Vision (IJCV) 115 (3) (2015) 211–252. doi:
ceedings - Design Automation Conference, 2016. doi: 10.1007/s11263-015-0816-y.
10.1145/2897937.2898003. [47] Y. Z. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu
[29] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, Caffeine: Zhou, He Wen, Dorefa-Net: Training Low Bitwidth
Towards uniformed representation and acceleration for Convolu- Tional Neural Networks With Low Bitwidth
deep convolutional neural networks, in: IEEE/ACM In- Gradients, arXiv:1606.06160v3 [cs.NE] 2 Feb 2018
ternational Conference on Computer-Aided Design, Di- DoReFa-NetarXiv:arXiv:1606.06160v3, doi:10.1145/
gest of Technical Papers, ICCAD, 2016. doi:10.1145/ 1449956.1450053.
2966986.2967011. [48] Xilinx, Zynq ultrascale+ mpsoc zcu104 evaluation kit.
[30] Y. Y. Huang, W. Y. Wang, Deep residual learning URL https://www.xilinx.com/products/
for weakly-supervised relation extraction, in: EMNLP boards-and-kits/zcu104.html
2017 - Conference on Empirical Methods in Natural [49] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu,
Language Processing, Proceedings, 2017. arXiv:1707. X. Li, SmartShuttle: Optimizing off-chip memory ac-
08866, doi:10.18653/v1/d17-1191. cesses for deep learning accelerators, in: Proceedings
[31] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein- of the 2018 Design, Automation and Test in Europe
berger, Densely connected convolutional networks, in: Conference and Exhibition, DATE 2018, 2018. doi:
Proceedings - 30th IEEE Conference on Computer 10.23919/DATE.2018.8342033.
Vision and Pattern Recognition, CVPR 2017, 2017.
arXiv:1608.06993, doi:10.1109/CVPR.2017.243.
[32] G. Zeng, Y. He, Z. Yu, X. Yang, R. Yang, L. Zhang, In-
ceptionNet/GoogLeNet - Going Deeper with Convolu-
tions, CvprarXiv:1409.4842, doi:10.1002/jctb.4820.
[33] Xilinx, Pynq: Python productivity for zynq.
URL http://www.pynq.io/
[34] Xilinx, UG 902 - Vivado Design Suite User Guide -
High-Level Synthesis, 2019th Edition (07 2019).
[35] Accellera Systems Initiative, SystemC Synthesizable
Subsets, 1st Edition (January 2015).
[36] L. H. Crockett, D. Northcote, C. Ramsay, F. D. Robin-

18

You might also like