An Implementation of Convolutional Neural Networks
An Implementation of Convolutional Neural Networks
Article
CNN2Gate: An Implementation of Convolutional
Neural Networks Inference on FPGAs with
Automated Design Space Exploration
Alireza Ghaffari * and Yvon Savaria
Electrical Engineering Department, Polytechnique Montreal, 2500 Edouard Montpetit Blvd, Montreal,
QC H3T 1J4, Canada; [email protected]
* Correspondence: [email protected]
Received: 23 November 2020; Accepted: 16 December 2020; Published: 21 December 2020
Abstract: Convolutional Neural Networks (CNNs) have a major impact on our society, because of the
numerous services they provide. These services include, but are not limited to image classification,
video analysis, and speech recognition. Recently, the number of researches that utilize FPGAs to
implement CNNs are increasing rapidly. This is due to the lower power consumption and easy
reconfigurability that are offered by these platforms. Because of the research efforts put into topics,
such as architecture, synthesis, and optimization, some new challenges are arising for integrating
suitable hardware solutions to high-level machine learning software libraries. This paper introduces
an integrated framework (CNN2Gate), which supports compilation of a CNN model for an FPGA
target. CNN2Gate is capable of parsing CNN models from several popular high-level machine
learning libraries, such as Keras, Pytorch, Caffe2, etc. CNN2Gate extracts computation flow of layers,
in addition to weights and biases, and applies a "given" fixed-point quantization. Furthermore,
it writes this information in the proper format for the FPGA vendor’s OpenCL synthesis tools that
are then used to build and run the project on FPGA. CNN2Gate performs design-space exploration
and fits the design on different FPGAs with limited logic resources automatically. This paper reports
results of automatic synthesis and design-space exploration of AlexNet and VGG-16 on various Intel
FPGA platforms.
1. Introduction
The impact of machine learning and deep learning is rapidly growing in our society, due to
their diverse technological advantages. Convolutional neural networks (CNNs) are among the most
notable architectures that provide a very powerful tool for many applications, such as video and
image analysis, speech recognition, and recommender systems [1]. On the other hand, CNNs require
considerable computing power. In order to better satisfy some given requirements, it is possible to
use high-performance processors, like graphic processing units (GPUs) [2]. However, GPUs have
some shortcomings that limit their usability and suitability in day-to-day mission-critical and real-time
scenarios. The first downside of using GPUs is their high power consumption. This makes GPUs
hard to use in robotics, drones, self-driving cars, and Internet of Things (IoTs), while these fields
can highly benefit from deep learning algorithms. The second downside is the lack of external
Inputs/Outputs (I/Os). GPUs are typically accessible through some PCI-express bus on their host
computer. The absence of direct I/Os makes it hard to use them in mission-critical scenarios that need
prompt control actions.
In particular, the floating-point number representation of CPUs and GPUs is used for most deep
learning algorithms. However, it is possible to benefit from custom fixed-point numbers (quantized
numbers) in order to reduce the power consumption, circuit footprint, and increase the number
of compute engines [3]. It was proven that many convolutional neural networks can work with
eight-bit quantized numbers or less [4]. Hence, GPUs waste significant computing power and electrical
power consumption when performing inference in most deep learning algorithms. A better balance
can be restored by designing application specific computing units while using quantized arithmetic.
Quantized numbers provide other benefits, such as memory access performance improvements, as they
allocate less space in the external or internal memory of compute devices. Memory access performance
and memory footprint are particularly important for hardware designers in many low-power devices.
Most of the portable hardware devices, such as cell-phones, drones, and internet of things (IoT) sensors,
require memory optimization due to of the limited resources available on these devices.
Field Programmable Gate Arrays (FPGA) can be used in these scenarios in order to tackle
the problems that are caused by limitations of GPUs without compromising the accuracy of the
algorithm. Using quantized deep learning algorithms can solve the power consumption and memory
efficiency issues on FPGAs as the size of the implemented circuits shrinks with quantization. In [5],
the authors reported that the theoretical peak performance of six-bit integer matrix multiplication
(GEMM) in the Titan X GPU is almost 180 GOP/s/Watt, while it can be as high as 380 GOP/s/Watt
for Stratix 10, and 200 GOP/s/Watt for Arria 10 FPGAs. This means that high end FPGAs are able
to provide comparable or even better performance per Watt than the modern GPUs. FPGAs are
scalable and configurable. Thus, a deep convolutional neural network can be configured and scaled
to be used in a much smaller FPGA in order to reduce the power consumption for mobile devices.
The present paper investigates various design-space exploration methods that can fit a desired CNN
to a selected FPGA with limited resources. Having massive connectivity through I/Os is natural in
FPGAs. Furthermore, FPGAs can be flexibly optimized and tailored to perform various applications
in industrial and real-time mobile workloads [6]. FPGA vendors, such as Intel, offer a range of
commercialized System-on-Chip (SoC) devices that integrate both processor and FPGA fabric into a
single chip (see, for example, [7]). These SoCs are widely used on mobile devices and they make the
use of application-specific processors obsolete in many cases .
Several research contributions address ghd architecture, synthesis, and optimization of deep
learning algorithms on FPGAs [8–12]. Nonetheless, little work was done regarding integrating the
results into a single tool. An in depth literature review in Section 2 exposes that tools, such as [13,14],
provide neither integrated design space exploration nor generic CNN model analyzer. In [11,12,15],
the authors specifically discuss the hardware design and not the automation of the synthesis. In [9],
fitting the design automatically in different FPGA targets is not discussed. This presents a major
challenge, as there are many degrees of freedom in designing a CNN. There is no fixed architecture for
CNN, as a designer can choose as many convolution, pooling, and fully connected layers as needed
to reach the desired accuracy. In addition, there is a need for design-space exploration methods that
optimize resource utilization on FPGA.
A number of development environments exist for the description and architectural synthesis of
digital systems [16,17]. However, they are designed for general purpose applications. This means
that there is a great opportunity to make a library that uses these hardware design environments
to implement deep learning algorithms. Alternately, there are also many other development
frameworks/libraries for deep learning in Python language. Some notable libraries for deep learning
are Pytorch, Keras, Tensorflow, and Caffe. Another challenge would be designing a synthesis tool for
FPGA that can accept models that are produced by all of the aforementioned Python libraries and
many more available for machine learning developers.
Electronics 2020, 9, 2200 3 of 23
The present paper proposes methods for tackling research challenges that are related
to the integration of high-level synthesis tools for convolutional neural networks on FPGAs.
The contributions of the present paper are:
Note that CNN2Gate emphasizes software/hardware co-design methods. The goal of this paper
is not to compare the performance of FPGAs and GPUs, but it explores the possibility of designing
end-to-end generic frameworks that can leverage high-level model descriptions in order to implement
FPGA-based CNN accelerators without human intervention. As shown in Figure 1, CNN2Gate is a
Python library that can be used in order to perform inference of CNNs on FPGAs that is capable of:
Electronics 2020, 9, 2200 4 of 23
ONNX model
CNN2Gate
Front-end Parser
Automated high-level
Design-Space Exploration
RTL Generation
Figure 1. The CNN2Gate overall architecture comprising a front-end parser (Open Neural Network
eXchange format (ONNX) parser), a design-space exploration module, and leverages automated
high-level synthesis.
The rest of the paper is organized, as follows. Section 2 reviews the related works. Section 3
reviews the most relevant background knowledge on convolutional neural networks. Section 4
elaborates on how CNN2Gate extracts the computation flow, configures the kernels, and executes
design space exploration. Subsequently, Section 5 reports some results and compares them to other
existing implementations.
2. Related Works
A great deal of research was conducted on implementing deep neural networks on FPGAs.
Among those researches, hls4ml [14], fpgaConvNet [9], and Caffeine [15] are the most similar to the
present paper. hls4ml is a companion compiler package for machine learning inference on FPGA.
It translates open-source machine learning models into high-level synthesizable (HLS) descriptions.
hls4ml was specifically developed for an application in particle physics, with the purpose of reducing
the development time on FPGA. However, as stated in the status page of the project [19], the package
only supports Keras and Tensorflow for CNNs and the support for Pytorch is in development.
In addition, to the best of our knowledge, hls4ml does not offer design-space exploration. FpgaConvNet
is also an end-to-end framework for the optimized mapping of CNNs on FPGAs. FpgaConvNet uses a
symmetric multi-objective algorithm in order to optimize the generated design for either throughput,
latency, or multi-objective criteria (e.g., throughput and latency). The front-end parser of fpgaConvNet
can analyze models that are expressed in the Torch and Caffe machine-learning libraries. Caffeine
is also a software-hardware co-design library that directly synthesizes Caffe models comprising
convolutional layers and fully connected layers for FPGAs. The main differences between CNN2Gate
and other cited works are in three key features. First, as explained in Section 4.1, CNN2Gate leverages
a model transfer layer (ONNX), which automatically brings support for most machine-learning Python
libraries without bounding the user to a specific machine-learning library. Second, CNN2Gate is based
Electronics 2020, 9, 2200 5 of 23
on OpenCL, unlike hls4ml and fpgasConvNet, which are based on C++. Third, CNN2Gate proposes
an FPGA fitter algorithm that is based on reinforcement learning.
Using existing high-level synthesis technologies, it is possible to synthesize OpenCL Single
Instruction Multiple Thread (SIMT) algorithms to RTL. It is worth mentioning some notable research
efforts in that direction. In [20], the authors provided a deep learning accelerator targeting Intel’s
FPGA devices that are based on OpenCL. This architecture was capable of maximizing data-reuse and
minimizing memory accesses. The authors of [21] presented a systematic design-space exploration
methodology in order to maximize the throughput of an OpenCL-based FPGA accelerator for a given
CNN model. They used synthesis results to empirically model the FPGA resource utilization. Similarly,
in [22], the authors analyzed the throughput and memory bandwidth quantitatively in order to tackle
the problem of design-space exploration of a CNN design targeting FPGAs. They also applied various
optimization methods, such as loop-tiling, to reach the best performance. A CNN RTL compiler is
proposed in [23]. This compiler automatically generates sets of scalable computing primitives to form
an end-to-end CNN accelerator.
Another remarkable OpenCL-based implementation of CNNs is PipeCNN [8]. PipeCNN is
mainly based on the capability of currently available design tools in order to use OpenCL kernels in
high-level synthesis. PipeCNN consists of a set of configurable OpenCL kernels to accelerate CNN
and optimize memory bandwidth utilization. Data reuse and task mapping techniques have also been
used in that design. Recently, in [24], the authors added a sparse convolution scheme to PipeCNN
in order to further improve its throughput. Our work (CNN2Gate) follows the spirit of PipeCNN.
CNN2Gate is built on top of a modified version of PipeCNN. CNN2Gate is capable of exploiting
a library of primitive kernels needed to perform inference of a CNN. In addition, CNN2Gate also
includes means to perform automated design-space exploration and can automatically translate CNN
models that are provided by a wide variety of machine learning libraries. It should be noted that our
research goal is introducing a methodology to design an end-to-end framework to implement CNN
models on FPGA targets. This means, without a loss of generality, that CNN2Gate can be modified to
support other hardware implementations or technologies ranging from RTL to high-level synthesis
with two conditions. First, CNN layers can be expressed as pre-defined templates. Second, the amount
of parallelism in the hardware templates can be controlled.
There are several reasons that we have chosen PipeCNN as the foundation of CNN2Gate. First,
while using the OpenCL model, it is possible to fine-tune the amount of parallelism in the algorithm,
as explained in the Section 4.3. Second, it supports the possibility of having deep pipelined design,
as explained in Section 3.2.2. Third, the library of primitive kernels can be easily adapted and
re-configured based on the information that was extracted from a CNN model (Section 4.1).
Lately, reinforcement learning has been used in order to perform automated quantization for
neural networks. HAQ [25] and ReLeQ [26] are examples of research efforts exploiting this technique.
ReLeQ proposes a systematic approach for solving the problem of deep quantization automatically.
This provides a general solution for the quantization of a large variety of neural networks. Likewise,
HAQ, suggests a hardware-aware automated quantization method for deep neural networks while
using actor-critic reinforcement learning method [27]. Inspired by these two papers, we used a
reinforcement learning algorithm to control the level of parallelism in CNN2Gate OpenCL kernels.
Finally, it is worth mentioning the challenges of implementing digital vision system in hardware.
In [28], the authors proposed a hardware implementation of a haze removal method exploiting adaptive
filtering. Additionally, in [29], the authors provided an FPGA implementation of a novel method to
recover clear images from degraded ones on FPGA.
3. Background
This section provides the background knowledge that is required to understand CNNs and
OpenCL-based high-level synthesis workflows.
Electronics 2020, 9, 2200 6 of 23
Cat
c
c
Mouse
Bear
Wolf
in which I is the input from the preceding layer or input image and Wk is the convolution kernel for
feature k and bk is the bias vector. The non-linear activation function is denoted f (.) in (1).
where MPFk denotes k’th max-pooling feature and Rm,n shows the region of interest around
point (m, n).
CPU FPGA
.c .h .h RTL library
.cl
OpenCL
C++
FPGA
Compiler Compiler
CPU FPGA
Figure 3. OpenCL high-level synthesis for Field Programmable Gate Arrays (FPGA).
Figure 4. (a) Memory access pattern in OpenCL standard 1.x. (b) OpenCL pipes; in FPGAs, pipes are
implemented as FIFOs. (c) Deeply pipelined CNN network architecture.
4. Proposed Architecture
• Convolution: for the convolution operator, CNN2Gate parses dilations, pads, kernel shape,
and stride. The reader can refer to [31] for more information regarding these variables and how
they affect the computation. It also extracts the learned weights and biases for convolutional
kernels. CNN2Gate also computes the output tensor size of the layer while using Equation (3).
Let us assume the input of a two dimensional convolutional kernel is of size (cin , hin , win ), where c
denotes the number of features, h denotes the height, and w denotes the width. The output tensor
size (cout , hout , wout ) can be written as:
where ks is the kernel size, st is the stride, while p and d are the padding and dilation
parameters, respectively.
Electronics 2020, 9, 2200 9 of 23
• Max-pooling: similar to the convolution, CNN2Gate parses dilations, pads, kernel size, and strides.
However, as max-pooling is a down-sampling kernel, it does not have weights and biases.
The output tensor size of a max-pooling node with input size of (cin , hin , win ) is identical to
Equations (3) and (4).
• ReLu: CNN2Gate detects the presence of activation function such as “Relu” after a convolutional
or max-pooling layer.
• General Matrix Multiplication (“GEMM”): a fully connected layer appears as a GEMM operator
in ONNX dataflow graph. In CNN2Gate, there is no specific kernel for the fully connected layer.
• Softmax: CNN2Gate detects the presence of the softmax operator after a fully connected layer.
The front-end parser saves the information that specifies each layer’s data in a linked structure in
order to preserve the order. Later, this data structure is used by a high-level hardware synthesis tool.
The preserved order serves as a guideline for the synthesizer to configure hardware pipelines.
Synthesized RTL
Design-space Exploration
Post-training Quantization
Physical Domain
Note that CNN2Gate can automatically configure several memory buffers, depending on the
layer operation type. For instance, if the next layer in the neural network is fully connected layer,
it writes the data to the memory buffer that is associated with the fully connected layers and, similarly,
if the layer is convolutional, it writes the data to the convolution buffer.
CNN2Gate is also capable of building and running the CNN model in both emulation and full
flow mode. For the emulation mode, CNN2Gate compiles the project for a CPU. In some cases, the user
needs to verify whether the CNN model performs correctly in terms of computational accuracy before
committing to long hours of synthesis. The compilation for the emulation mode is significantly
Electronics 2020, 9, 2200 10 of 23
faster—in the order of seconds—as compared to synthesizing the full flow for the FPGA, which takes
several hours. This feature makes the workflow more versatile for the designers who want to iterate
between a FPGA design and a CNN model to reach the best quantization parameters and accuracy.
In order to synthesize the full flow on FPGA, CNN2Gate accepts the name of the FPGA board to
perform design-space exploration (Section 4.3) and generates the RTL accordingly. In order to validate
CNN2Gate, we tested this process by targeting three different IntelTM FPGA boards [33–35] and report
the results later in this paper. In addition, we used Intel OpenCL SDK 16.1 to synthesize our designs.
Taking advantage of OpenCL flexibility, it is possible to design an architecture with several
degrees of freedom. The first degree of freedom is the size of the pipes. The better throughput of the
pipes means less congestion point for data to be moved from a kernel to another. The second degree
of freedom is the bandwidth of data that memory write/read kernels provide. The third degree of
freedom is the number of parallel convolutional (CONV) and RELU units that are used to implement
convolution kernels. Figure 6 shows this concept in a simple example. These degrees of freedom for
deeply pipelined kernels are leveraged from what proposed by [8]. The memory read kernel fetches Nl
vectors of size Ni for features and weights. Tuning Ni and Nl can provide a better throughput for data
write/read kernels. Note that the memory access schedule of where and when to read the features and
weights are derived by the host program. The memory access schedule is configured by the front-end
parser that is based on the CNN model . The number of computation lanes (Nl ) shows the level of
parallelism of the algorithm. The number of CONVs in a convolution kernel, the size of data pipes and
the number of max-pool operators in the max-pool kernel are tuned according to Nl . Changing Nl and
Ni can result in different utilization ratios of FPGA resources. For instance, in the case of increasing
Nl , (1) more on-chip memory in read/write kernels, (2) more register for pipe FIFOs, (3) more DSP
slices for CONVs, and (4) more LUT for max-pooling kernels are needed in order to accommodate the
design on FPGA.
...
...
Features
Weights
CONV RELU Max-Pool
Nl
Figure 6. Detailed demonstration of vectorized input data and weights and the concept of computation
lanes in pipelined kernels.
Architectural Limitations
The architecture shown in Figure 6 is used to perform the calculation of all layers. In order to
obtain practical implementations and allow the targeting FPGAs of various size, it is necessary to fold
(or time multiplex) the layers onto fewer hardware lanes and vectors [9,10]. Therefore, arbitrary choices
for Nl and Ni are not always possible. Ni should be a divisor of the features’ width for all layers in
order to avoid padding. Likewise, Nl should be a divisor of the number of features for all layers to
avoid idle lanes in some layers.
Electronics 2020, 9, 2200 11 of 23
learning [36] with which, the number of iterations in each episode is limited. Finally, in our RL-DSE,
a scaling factor β = 0.01 is applied to Favg in order to form the final reward function to convert it from
percentage scale to a number between 0 and 1.
Note that a discount factor γ = 0.1 is used in our RL agent design. The discount factor specifies
that the agent does not have unlimited time to find the best option. Thus, the agent receives a
down-scaled reward as it spends more time in the environment. The discount factor urges the agent to
optimize the total discounted reward [37]:
n
rt + γrt+1 + γ2 rt+1 + ... = ∑ γi r t +i (6)
i =1
A C
ΔA ΔC
O ΔB B
Hstart = (Ni0,Nl0)
Figure 7. The hill-climber in a two-dimensional (2D) design-space. Optimizer O moves toward the
direction that optimizes the cost function ∆.
Algorithm 2: Hill-climbing
Input : Hstart = ( Ni0 , Nl0 )
Output : Hbest
Hcurrent := Hstart
compute: Favg (current)
while true do
Hnext := null
Favg (next) := −in f
for i : Neighbours(Hcurrent ) do
if (Plut , Pdsp , Pmem , Preg )i < ( Tlut , Tdsp , Tmem , Treg ) then
if Favg (i ) > Favg (next) then
Favg (next) = Favg (i )
Hnext = Hi
end
end
end
if Favg (next) 6 Favg (current) then
Hbest = Hcurrent
return : Hbest
end
end
5. Results
Table 1 shows the execution times of AlexNet [38] and VGG-16 [39] for three platforms while
using CNN2Gate. The user can verify the CNN model on CPU using the CNN2Gate emulation mode
in order to confirm the resulting numerical accuracy, as mentioned before. Even if the execution time
is rather large, this is a very useful feature to let the developer verify the validity of the CNN design
on the target hardware before going forward for synthesis, which is a very time-consuming process.
Note that the emulation’s execution time cannot be a reference for the throughput performance of
a core-i7 processor. The emulation mode only serves the purpose of verifying the OpenCL kernels
operations. In [40], the authors described the execution of AlexNet on desktop CPUs and reported an
execution time as low as 2.15 s. The reported results also show the scalability of this design Indeed,
the results are reported for both the low cost Cyclone V SoC and the much more expensive Arria 10
FPGA that has much more resources. The exploited level of parallelism was automatically extended
by the design-space exploration algorithm in order to obtain better results for execution times that are
commensurate with the capacity of the hardware platform.
Electronics 2020, 9, 2200 14 of 23
Table 1. Execution times for Alexnet and VGG (batch size = 1).
Execution Time
Platform Resource Type fmax
AlexNet VGG-16
Core-i7
CPU 13 s 148 s N/A
(Emulation)
Cyclone V
System-on-Chip 153 ms 4.26 s 131 MHz
5CSEMA5
Arria 10 CPU+FPGA with
18 ms 205 ms 199 MHz
GX 1150 PCIe link
In order to maintain the scalability of the algorithm, it is not always possible to use arbitrary
choices for ( Ni , Nl ) parameters. These parameters must be chosen in a way that kernels can be used as
the building block of all the layers. This leads to have limited options to increase the level of parallelism
that can be exploited with a given network algorithm. Relaxing this limitation (i.e., manually designing
kernels based on each layers’ computation flow) could lead to better resource consumption and higher
throughput at the expense of losing the scalability that is needed for automation of this process.
The maximum operating frequency ( f max ) varies for different FPGAs. Indeed, it depends on the
underlying technology and FPGA family. Intel OpenCL compiler (synthesizer) automatically adjust
PLLs on the board to use the maximum frequency for the kernels. It is of interest that the operating
frequency of the kernels supporting AlexNet and VGG-16 were the same, as they had essentially the
same critical path, even though VGG-16 is a lot more complex. The larger complexity of VGG was
handled by synthesizing a core (i.e., Figure 6) that executes a greater number of cycles if the number of
layers in the network increases.
Table 2 gives more details regarding the design-space exploration algorithms which are coupled
with synthesis tool. All three algorithms use the resource utilization estimation of the synthesizer to
fit the design on the FPGA. This is important as the time consumed for design-space exploration is
normally under 5 min., while the synthesis time for larger FPGAs, such as the Arria 10, can be close
to 10 h.
Hardware
HC-DSE RL-DSE BF-DSE Synthesis Resources Resources
Platform Options
Time Time Time Time Available Consumed
(Ni , Nl )
Cyclone V ALM: 15 K
SoC 20 s 2.5 min 3.5 min N/A DSP: 83 Does not fit N/A
5CSEMA4 RAM blocks: 321
ALM: 32 K ALM: 26 K
Cyclone V
DSP: 87 DSP: 72
SoC 1 min 2.5 min 3.5 min 46 min (8, 8)
RAM blocks: 397 RAM blocks: 397
5CSEMA5
Mem. bits: 4 M Mem. bits: 2 M
ALM: 427 K ALM: 129 K
Arria 10 DSP: 1516 DSP: 300
2 min 3 min 4 min 8.5 hrs (16, 32)
GX 1150 RAM blocks: 2713 RAM blocks: 1091
Mem. bits: 55.5 M Mem. bits: 16 M
Experimenting with various DSE algorithms confirms that smart algorithms (such as
reinforcement learning and hill-climbing) can provide advantages for design-space exploration when
compared to brute-force enumeration. The first goal of suggesting the RL-DSE and HC-DSE methods
is to demonstrate that it is possible to further decrease the exploration time by using some better search
methods. Analyzing the execution times shows that the reinforcement learning algorithm is almost
25 percent and the hill-climbing is 50 percent faster than the brute-force algorithm when optimizing the
Electronics 2020, 9, 2200 15 of 23
design for the Arria 10 FPGA. Note that these exploration times are significantly less than the synthesis
time, since we use estimation of resource usage provided by the synthesizer. On the other hand,
performing full synthesis as part of the exploration process is not advisable, because the exploration
time would take weeks. Because HC-DSE follows the gradient of resource usage, it is always faster
than BF-DSE. This method is the best when the design-space is convex. However, if the search space is
not convex (e.g., having several local maximums), HC-DSE might choose a wrong solution. In contrast,
it is less probable for RL-DSE to be trapped in a local maximum due to the random nature of the
reinforcement learning algorithm. Moreover, the RL-DSE algorithm would be more valuable if it could
be exploited in conjunction to the reinforcement learning quantization algorithms, such as ReLeQ [26].
The goal of considering various DSE algorithms in our work is to provide a versatile tool for the
user in different conditions and it is not limited to a specific case. We included the brute-force algorithm
in CNN2gate in order to guarantee a successful exploration for small design-spaces. However,
presuming that the design space is small is not always a correct assumption. For large convex
design-spaces, we added the hill-climbing algorithm to find the best solution which works significantly
faster than brute-force. For non-convex and large design-spaces, reinforcement learning was found to
work well [41]. In Table 2, the design-space is small. This is why the difference between the execution
time between various exploration algorithms is small. However, it was shown that HC-DSE can be
twice faster than brute force when optimizing the design for the Arria 10 FPGA.
There are other model-based design-space exploration algorithms that are dedicated to a specific
implementation of a library of primitives. For instance, in [24], the authors proposed a performance
model for their design, and they can predict the performance of the hardware implementation based
on the resource requirements. The advantage of our proposed DSE algorithms is that our algorithms
are model agnostic. This means that the CNN2Gate framework tunes the parallelism parameters of
the design and directly queries the performance feedback from the synthesizer.
We tried CNN2Gate on three platforms. The first one is a very small Cyclone V device with 15K
adaptive logic modules (ALMs) and 83 DSPs. The fitter could not fit either ALexNet or VGG on this
device. Clearly, the minimum space that is required for fitting this design on FPGA is fairly large due
the complexity of the control logic. CNN2Gate did not experience any difficulty fitting the design on
bigger FPGAs. as demonstrated in Table 2. Resource utilization and hardware options ( Ni , Nl ) are also
provided, which correspond to the execution times that are shown in Table 1.
CNN2Gate resource consumption is very similar for AlexNet and VGG-16. In the case of identical
hardware options, CNN2Gate’s synthesized core is going to be nearly identical for all CNN architecture
as shown in Figure 6. The only difference is the size of internal buffers to allocate the data in the
computation flow. More quantitatively, in our implementation, VGG-16 uses 8% more of the Arria
10 FPGA block RAMs in comparison to what is shown in Table 2 for AlexNet.
Revisiting Figure 6, pipelined kernels are capable of reading data from global memory and
processing the convolution and pooling kernel at the same time. In addition, for fully connected layers
in CNNs, the convolution kernel acts as the main data process unit and the pooling kernel is configured
as a pass-through. When considering this hardware configuration, we can merge convolution and
pooling layers as one layer. In the case of AlexNet, this leads to five fused convolution/pooling
and three fully-connected layers. Figure 8 reports the detailed execution time of these six layers,
including memory read and write kernels for each layer. Thus, as the algorithm proceeds through the
layers, the dimensions of the data (features) reduced and the execution time is decreased. Note that,
in this figure, Layer 1 to Layer 5 are fused layers comprising a convolution, a pooling, and a ReLu
layers. Additionally, note that, although the number of parameters in convolutional layers are less
in fully connected layers, the memory consumption is far greater than with fully connected layer.
Therefore, the performance of the first convolutional layers can be affected in a system accessing
external memory (RAM).
Electronics 2020, 9, 2200 16 of 23
41.69
Execution Time (ms) 40
Cyclone V SoC 5CSEMA5
34.12
Arria 10 GX1150
30
23.17
20 17.42
14.49
11.77
10
6.5 6.57
2.69 3.09
1.75 1.35 0.94 1.45 1.78
0.44
0
Layer-1 Layer-2 Layer-3 Layer-4 Layer-5 Layer-6 Layer-7 Layer-8
Figure 8. Detailed breakdown of execution time for each layer of AlexNet. A Layer
here means execution of one round of pipelined kernels as shown in Figure 6. In case
of AlexNet Layer-1 to Layer-5 are combination of memory read/write, convolution and
pooling kernels, while Layer-5 to Layer-7 are combination of memory read/write and fully
connected kernels.
Table 3 shows a detailed comparison of CNN2Gate to other existing works for AlexNet.
CNN2Gate is faster than [21,22] in terms of latency and throughput. However, CNN2Gate uses
more FPGA resources than [21]. To make a fair comparison, we can measure relative performance
density as per DSP or ALMs. In this case, the CNN2Gate performance density (GOp/s/DSP) is higher
(0.266) when compared to 0.234 for [21]. There are other designs, such as [9,23], which are faster than
our design in terms of pure performance. Nevertheless, the CNN2Gate method that is outlined above
significantly improves on existing methods, as it more scalable and automatable than the methods that
are presented in [9,23], which are limited in these regards, as they require human intervention in order
to reach high levels of performance.
Table 3. Comparison to the existing works AlexNet with hardware options ( Ni , Nl ) = (16, 32)
AlexNet
AlexNet [22] AlexNet [23] AlexNet [9] AlexNet [21]
[This Work] *
Virtex-7 Stratix-V Zynq Stratix-V
FPGA Arria 10 GX1150
VX485T GXA7 7045 GX-D8
Synthesis method C/C++ RTL C/C++ OpenCL OpenCL
Frequeny (MHz) 100 100 125 - 199
Logic Utilization 186K (61%) 121K (52%) - 120K (17%) 129K (30%)
DSP Utilization 2240 (80%) 256 (100%) 897 (99.5%) 665 (34%) 300 (20%)
Latency (ms) 21.61 12.75 8.22 20.1 18.24
Precision (bits) 32 float 8–16 fixed 16 fixed 8–16 fixed 8 fixed
Performance (GOp/s) 61.62 114.5 161.98 72.4 80.04
* batch size = 1.
Second, the design entry of [9,23] are C and RTL, respectively, while our designs were
automatically synthesized from ONNX while using OpenCL. Thus, not surprisingly, our work does
not achieve the maximum reported performance. This is partly due to the use of ONNX as a starting
point, and trying to keep the algorithm scalable for either large and small FPGAs. This imposes
some limitations, such as the maximum number of utilized CONV units per layer. There are also
other latency reports in the literature, such as [8]. However, those latency reports are measured with
favorable batch size (e.g., 16). An increasing batch size can make more parallelism available to the
algorithm that can lead to higher throughput. Thus, for clarity, we limited the comparisons in Table 3
and 4 to batch size = 1.
Electronics 2020, 9, 2200 17 of 23
Table 4. Comparison to the existing works VGG-16 with hardware options ( Ni , Nl ) = (16, 32).
VGG-16
VGG-16 [42] VGG-16 [11] VGG-16 [9] VGG-16 [21]
[This Work] *
Zynq Arria 10 Zynq Stratix-V
FPGA Arria 10 GX1150
7045 GX1150 7045 GX-D8
Synthesis method - RTL C/C++ OpenCL OpenCL
Frequeny (MHz) 150 150 125 120 199
Logic Utilization 182 (83.5%) 161K (38%) - - 129K (30%)
DSP Utilization 780 (89.2%) 1518 (100%) 855 (95%) - 300 (20%)
Latency (ms) - 47.97 249.5 262.9 205
Precision (bits) 16 fixed 8–16 fixed 16 fixed 8–16 fixed 8 fixed
Performance (GOp/s) 136.91 645.25 161.98 117.8 151.7
* batch size = 1.
Table 4 shows a detailed comparison of CNN2Gate to other existing works for VGG-16. It is
observable that CNN2Gate is performing better for larger neural networks, such as VGG. CNN2Gate
achieves 18 % lower latency than [9], despite the fact that CNN2Gate uses fewer DSPs. While,
for AlexNet, [9] was more than 50 % faster than CNN2Gate. Finally, for VGG-16, we did find some
hand tailored RTL custom designs, such as [11], which are faster than CNN2Gate.
1. Arranging the basic hardware building blocks or library of primitives: CNN2Gate provides the
framework in which the primitive blocks are connected to form the specific architecture of a
convolutional neural network. This architecture is based on the model that is parsed from an
ONNX representation of the neural network. As shown in Figure 9, CNN2Gate chooses and
connects the building blocks of a convolutional neural network from a library of primitives.
This library of primitives can be designed by the user or it can be adapted from other existing
open-source designs. We assume that the level of parallelism is controllable in the library
of primitives.
2. Providing parallelism control for algorithm or design-space exploration: after providing the
layout of building blocks, CNN2Gate makes several queries to the hardware synthesizer and
determines the best parameters particularly regarding the level of parallelism in order to obtain
the best performance or performance/accuracy/cost trade-off of the hardware implementation.
The performance characteristics of the hardware architecture can be defined by the user as the
total logic utilization, throughput, or combination of them.
In our case, we adapted the library of primitives from PipeCNN [8]. The three DSE algorithm
that we are introduced in the previous sections are the control algorithms that work in conjunction
with the synthesizer in order to fit the design to the desired FPGA target.
Electronics 2020, 9, 2200 18 of 23
CONV POL
...
FC Relu
CNNGate
Parallelism
How
Model Parser blocks
are
connected CONV
Quantization POL
Relu
...
Set the FC
Algorithm Parallelism Control (DSE) level of
Parallelism
CNN
Performance
feedback
Rewards
Actions
Agent Environment
State
Figure 10. Reinforcement Learning: a software agent takes actions in the environment in order to
maximize the reward.
To set up the RL agent to explore the N-dimensional design-space, we need to take into
consideration the following steps:
1. Defining the states: in the context of a hardware optimizer, states can be defined as a set of all
possible hardware options. Particularly, states represent the whole design-space.
2. Defining the actions: actions are the act of moving from one state to another. In the case of
N-dimension, actions can be defined as the unit vector of each dimension.
Electronics 2020, 9, 2200 20 of 23
3. Forming a reward function: executing an Action a in the state s provides the agent a numerical
reward score r. The agent tries to maximize this reward. In the context of this article, the reward
can be throughput.
4. Considering a living penalty or discount factor for the agent: the agent should not have
an infinite amount of time to explore the environment. The longer the agent stays in the
environment, the less reward it should get. This trains the agent to find the best hardware option
as fast as possible.
Having defined all of the previous steps, we are able to train our agent to find the best hardware
option while using common reinforcement learning techniques, such as Q-Learning [44].
7. Conclusions
This paper described the design and implementation of a general framework for developing
convolutional neural networks on FPGAs. This framework takes the form of a Python library that
can be integrated with a wide range of popular machine learning frameworks. CNN2Gate makes
it easy for machine learning developers to program and use FPGAs in order to perform inference.
CNN2Gate exploits the OpenCL synthesis workflow for FPGAs that are offered by commercial vendors.
CNN2Gate is capable of parsing the data-flow of CNN models expressed in ONNX. This framework
also has an integrated design-space exploration tool that helps developers to find the best hardware
option for synthesizing a given CNN model on an FPGA. CNN2Gate achieves a classification latency
of 205 ms for VGG-16 and 18 ms for AlexNet on an Intel Arria 10 FPGA. These results are excellent
when considering that they were obtained by an automated design space exploration and synthesis
process, which is not relying on expert’s low-level hardware knowledge.
Author Contributions: All authors contributed substantially to methodology and validation of this manuscript.
All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by an IVADO (Institut de Valorisation des Données) grant through its
fundamental research program and the work was conducted with tools provided by CMC Microsystems.
Acknowledgments: The authors would like to thank the Institut de Valorisation des Données (IVADO) and CMC
Microsystems for supporting this research.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
BF Brute Force
CNN Convolutional Neural Network
CNN2Gate Convolutional Neural Network to field programmable gate arrays.
DSE Design Space Exploration
FPGA Field Programmable Gate Array
FIFO First In First Out
HC Hill Climbing
RL Reinforcement Learning
ONNX Open Neural Network Exchange
References
1. Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review.
Neural Comput. 2017, 29, 2352–2449. [CrossRef] [PubMed]
2. Strigl, D.; Kofler, K.; Podlipnig, S. Performance and scalability of GPU-based convolutional neural
networks. In Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based
Processing, Pisa, Italy, 17–19 February 2010; pp. 317–324.
Electronics 2020, 9, 2200 21 of 23
22. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing fpga-based accelerator design for deep
convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170.
23. Ma, Y.; Suda, N.; Cao, Y.; Seo, J.S.; Vrudhula, S. Scalable and modularized RTL compilation of convolutional
neural networks onto FPGA. In Proceedings of the 2016 26th International Conference on Field Programmable
Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–8.
24. Wang, D.; Xu, K.; Jia, Q.; Ghiasi, S. ABM-SpConv: A Novel Approach to FPGA-Based Acceleration of
Convolutional Neural Network Inference. In Proceedings of the 56th Annual Design Automation Conference
2019, Las Vegas, NV, USA, 2–6 June 2019; p. 87.
25. Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; Han, S. HAQ: Hardware-Aware Automated Quantization with Mixed
Precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach,
CA, USA, 15–20 June 2019; pp. 8612–8620.
26. Yazdanbakhsh, A.; Elthakeb, A.T.; Pilligundla, P.; Esmaeilzadeh, F.M.H. ReLeQ: An Automatic Reinforcement
Learning Approach for Deep Quantization of Neural Networks. arXiv 2018, arXiv:1811.01704.
27. Grondman, I.; Busoniu, L.; Lopes, G.A.; Babuska, R. A survey of actor-critic reinforcement learning:
Standard and natural policy gradients. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2012, 42, 1291–1307.
[CrossRef]
28. Zhang, B.; Wei, J. Hardware Implementation for Haze Removal With Adaptive Filtering. IEEE Access 2019,
7, 142498–142506. [CrossRef]
29. Ngo, D.; Lee, S.; Lee, G.D.; Kang, B. Single-Image Visibility Restoration: A Machine Learning Approach and
Its 4K-Capable Hardware Accelerator. Sensors 2020, 20, 5795. [CrossRef] [PubMed]
30. Véstias, M.P. A survey of convolutional neural networks on edge with reconfigurable computing. Algorithms
2019, 12, 154. [CrossRef]
31. Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv 2016, arXiv:1603.07285.
32. Gajski, D.D.; Kuhn, R.H. New VLSI tools. Computer 1983, 11–14. [CrossRef]
33. Terasic. DE0-Nano-SoC Kit/Atlas-SoC Kit. 2019. Available online: de0-nano-soc.terasic.com (accessed on
19 December 2020).
34. Terasic. DE1-SoC Board. 2019. Available online: de1-soc.terasic.com (accessed on 19 December 2020).
35. Nallatech. Nallatech 510 Acceleration Board. 2019. Available online: https://www.bittware.com/fpga/510t/
(accessed on 19 December 2020).
36. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous
methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine
Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937.
37. Van Hasselt, H.; Wiering, M.A. Reinforcement learning in continuous action spaces. In Proceedings of the
2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning,
Honolulu, HI, USA, 1–4 April 2007; pp. 272–279.
38. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems; ACM: New York, NY, USA, 2012; pp. 1097–1105.
[CrossRef]
39. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
2014, arXiv:1409.1556.
40. Shi, S.; Wang, Q.; Xu, P.; Chu, X. Benchmarking state-of-the-art deep learning software tools. In Proceedings
of the 2016 7th International Conference on Cloud Computing and Big Data (CCBD), Macau, China,
16–18 November 2016; pp. 99–104.
41. Van Moffaert, K.; Drugan, M.M.; Nowé, A. Scalarized multi-objective reinforcement learning: Novel design
techniques. In Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic Programming and
Reinforcement Learning (ADPRL), Singapore, 16–19 April 2013; pp. 191–199.
42. Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going deeper
with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016;
pp. 26–35.
43. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Electronics 2020, 9, 2200 23 of 23
44. Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional
affiliations.
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).