A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
Abstract
arXiv:2004.13075v2 [cs.CV] 7 Oct 2020
This paper presents a configurable Convolutional Neural Network Accelerator (CNNA) for a System on
Chip design (SoC). The goal was to accelerate inference of different deep learning networks on an embedded
SoC platform. The presented CNNA has a scalable architecture which uses High Level Synthesis (HLS) and
SystemC for the hardware accelerator. It is able to accelerate any Convolutional Neural Network (CNN)
exported from Python and supports a combination of convolutional, max-pooling, and fully connected layers.
A training method with fixed-point quantized weights is proposed and presented in the paper. The CNNA is
template-based, enabling it to scale for different targets of the Xilinx Zynq platform. This approach enables
design space exploration, which makes it possible to explore several configurations of the CNNA during C-
and RTL-simulation, fitting it to the desired platform and model. The CNN VGG16 was used to test the
solution on a Xilinx Ultra96 board using PYNQ. The result gave a high level of accuracy in training with
an auto-scaled fixed-point Q2.14 format compared to a similar floating-point model. It was able to perform
inference in 2.0 seconds, while having an average power consumption of 2.63 W, which corresponds to a
power efficiency of 6.0 GOPS/W.
Keywords: System On Chip, FPGA, High Level Synthesis, Convolutional Neural Network, PYNQ
(R3×3×512+1 )
kernelsN . Used to calculate
(buffer) (3 × 3 × 512)
WBsize = (output)
+ biassize
DBBW{×} · PEBW{×}
(R3×3×512+1 )
·kernelsN
Figure 9: Results of the C-simulation of the combined test
of all fixed-point candidates, showing the average resource Here (3 × 3 × 512) is chosen from the largest layer
usage of BRAM and DSP versus latency. The PEBW{×} is in the CNN, which, in this case, is the VGG16 [2].
set to 128 for all
solutions. The plotted text for a candidate is
The tuning of the CNNA can be expressed as
(output) (R3×3×512+1 )
in the format PEN , DBBW{×} , kernelsN . The ~ with 5 hyper-parameters, as shown in
a vector, β,
candidates are split up in three groups of the word lengths equation:
(W)
(datasize ) 8 bit, 16 bit and 32 bit versions. It took 2 minutes
to create the estimate for one candidate solution. (W) (output)
~
β = datasize , PEBW{×} , PEN , DBBW{×} ,
all possible combinations of parameters, it would (R3×3×512+1 )
kernelsN
take weeks, months or even years, since the archi-
tecture has such a large number of parameters, e.g.
BW between modules, FIFO-depth, the number of To measure the performance of the different
PEs, etc. The high level model of the IP in Sys- CNNA configurations, a simulation was made. It
temC can be simulated faster than the RTL code consisted of five different elements: two pooling op-
by a factor of 50-200 times, depending on the size erations, two convolution operations and a single
of the accelerator. It is possible to use a heuristic fully connected operation. They were executed in-
approach to find the optimal solutions for a cer- dividually but evaluated together.
tain fixed-point resolution constrained by the given When looking at the latency for the combined
target device by evaluating several simulated solu- simulation test, i.e. the five simulations carried out
tions. consecutively after each other, the dominant can-
(output)
The design parameters are used for tuning the didates all have DBBW{×} = 1 regardless of word
CNNA design in order to find a balance between length (see figure 9). The figure shows that the
precision, speed and resources. The CNNA tuning faster the accelerator, the higher the number of
parameters used are as follows: PEs.
(W)
Two models were created of each configuration,
datasize . The word-length of the fixed-point data one of which was done using C-simulation, i.e. a
format in bits, i.e. I+F. Has an impact on precision. simulation that used the SystemC HLS code di-
rectly. The other was a RTL-simulation, which used
PEBW{×} . The internal BW with an element size the RTL-code generated from the SystemC model
(W)
of datasize used by the CNNA. for the most optimal solutions. The latter was clock
cycle accurate and the execution time was precise.
PEN . The number of PEs and the PEA are limited Several candidates were identified and shown in
by the size of FPGA fabrics. greater detail in table 1. The table shows the num-
(output)
ber of Digital Signal Processing slices (DSPs) and
DBBW{×} . The output BW multiplier after the BRAM used, as well as the total latency for C- and
CLB. Normally this will be set at an equal value RTL-simulation. Some candidates marked with a
(rows)
to CLBN , but can be set to a lower number in ”-” used more resources than were available on the
order to allow the PE to run with lower BW and tested target platform.
11
Table 1: Design space exploration of resource usage and latency of possible CNNA candidates using C- and RTL-simulation. A
”÷” means RTL-simulation performed, but insufficient space on target platform (Ultra96). A ”-” means RTL-simulation not
performed.
Parameters resource DSPs BRAMs latency[ms]
DSPs BRAMs latency[ms]
β~ average % (RTL) (RTL) (RTL)
[8, 128, 8, 3, 42] 70 384 359 144 249 4.60 6.66
[8, 128, 8, 1, 42] 34 128 125 137 185 4.95 7.28
[8, 128, 16, 3, 42] 124 768 - 152 - 4.26 -
[8, 128, 16, 1, 42] 52 256 245 139 193 4.03 6.66
[16, 128, 8, 3, 32] 54 192 360 233 377 7.47 8.40
[16, 128, 8, 1, 32] 35 64 - 227 - 7.61 -
[16, 128, 16, 3, 32] 81 384 ÷ 239 ÷ 6.98 ÷
[16, 128, 16, 1, 32] 44 128 293 229 349 6.27 8.52
[32, 128, 8, 3, 20] 94 384 - 355 - 13.19 -
[32, 128, 8, 1, 20] 58 128 165 351 336 12.92 14.53
[32, 128, 16, 3, 20] 148 768 - 359 - 12.42 -
[32, 128, 16, 1, 20] 76 256 325 353 408 10.73 12.81
Figure 11: Training with 29 classes. Blue: floating-point, A final test was performed on all 29 classes of
red: fixed-point Q2.14 with auto-scale, light blue: fixed- DETECT with, the candidates that performed well
point Q2.14 without auto-scaling.
in the previous test. The best candidates were
the floating-point version for reference and the ver-
The accuracy, performance and power consump- sions that used fixed-point format Q2.14, both with
tion of the proposed system will be presented and and without auto-scaling. As is evident from fig-
discussed in this section. ure 11, only the training that used fixed-point for-
mat Q2.14 and auto-scaling achieved promising re-
sults. It shows that it is much more difficult to
6.1. Accuracy
train the CNNA when using quantization, because
The CNN was trained using the small dataset in details are lost due to the limited range of the fixed
order to find suitable candidates faster, since it is point numbers. However, it takes many more iter-
faster to train for five classes than for 29 classes. ations for the training to reach the same accuracy
If the accuracy of a fixed-point format is poor on level as the floating-point format.
five classes, it will likely be as poor, or worse, when The first 29 classes from ImageNet and Cifar-100
training on 29 classes. Therefore, initial training were also used for training. The validation results
was carried out on the small dataset. in table 3 shows that comparing the Q2.14 format
Figure 10 shows that most of the trained models with floating-point the accuracy drops with 3.5%
faced issues and obtained low accuracy when us- and 3.2%. For DECTECT the drop is 4% which
ing fixed-point format. The only quantized version is higher compared to training with ImageNet and
that obtained a high level of accuracy was the one Cifar-100.
using fixed-point format Q2.14. It is unknown why
the training with fixed-point format Q2.14 and no
auto-scaling makes a sudden dive after 10 epoch.
Table 3: Results for the training of the 16-bits fixed-point
However, it could be caused by the learning-rate VGG16 on 29 classes from the datasets: DETECT(DET),
being too high or too low, or too few neurons in ImageNet(Image) and Cifar-100(Cifar).
the fully connected layers. The best results were Type data auto Nneurons val. train
achieved with fixed-point format Q2.14 and auto- float DET n/a 1024 88.0 100.0
scaling, which converges towards an accuracy of al- Q2.14 DET no 1024 5.0 5.0
most 100%. All fixed-point Q2.6 versions did not Q2.14 DET yes 1024 86.4 94.4
manage to be trained or achieve any useful results. float DET n/a 4096 84.0 99.4
Table 2 shows the results of the training with five Q2.14 DET no 4096 5.0 5.1
classes. Only the training that used Q2.14 with no Q2.14 DET yes 4096 86.5 92.9
auto-scaling performed well with 4096 neurons and float Image n/a 4096 83.0 99.2
reached approximately 83%. The table shows the Q2.14 Image yes 4096 79.5 88.1
number of neurons in the fully connected layers, float Cifar n/a 4096 80.5 99.5
Nneurons , as well as the training and validation ac- Q2.14 Cifar yes 4096 77.3 89.3
curacy. Validation was performed on a dataset not
used for training.
13
6.2. Performance
Table 5: Time of execution of each VGG16 layer in [ms]
The Xilinx Ultra96 board [43] was used to eval- using four different IP cores.
uate the performance of the system using a HW CNNA16 CNNA16 CNNA18 CNNA28
layer
clock of 100 MHz and 172.22 MHz for the CNNA IP 100MHz 172MHz 172MHz 172MHz
core. The inference time was measured for the dif- l1 conv1 19.3 17.0 19.9 21.4
ferent configurations of the CNNA16 , CNNA18 and l1 conv2 111 84.1 61.3 60.1
CNNA28 and the inference times are shown in table l1 pool 18.1 13.7 12.9 17.1
4. The timing performance was measured on the l2 conv1 55.4 42.9 31.3 30.5
Ultra96 board during inference of the quantized and l2 conv2 108 81.3 60.2 56.3
trained VGG16 model with five classes. The mean l2 pool 8.98 6.87 6.36 8.40
time and variance is an average of 30 measurements. l3 conv1 56.0 43.5 33.1 29.9
The fastest model CNNA18 took 1.22 sec per image, l3 conv2 112 84.2 64.2 59.1
while the slowest, CNNA16 at 100 MHz, took 2.20 l3 conv3 110 85.5 63.0 57.8
sec per image. l3 pool 4.51 3.48 3.25 4.23
l4 conv1 64.6 51.5 37.9 35.3
l4 conv2 126 97.9 76.0 70.5
Table 4: Average inference time and variance using VGG16
for five classes using four different IP cores. l4 conv3 123 102.0 73.5 67.8
CNNA16 CNNA16 CNNA18 CNNA28 l4 pool 2.32 1.83 1.71 2.19
100MHz 172MHz 172MHz 172MHz l5 conv1 46.6 41.0 30.7 29.3
avg [sec] 2.20 1.96 1.22 1.49 l5 conv2 49.1 39.7 29.7 28.1
var [·10−3 ] 0.25 0.30 0.20 0.11 l5 conv3 45.8 39.5 29.6 27.8
l5 pool 0.74 0.62 0.59 0.69
The different layers have different execution dense 1 767 737 364 509
times, as shown in table 5. As expected, the execu- dense 2 393 397 197 362
tion time in convolutional layers depended on the dense 3 1.55 1.62 1.18 1.50
number of bits in the fixed-point format. However,
pooling took approximately the same time for all
VGG16 model with five classes. The measured volt-
tested IP, since pooling is independent of the fixed-
age of the power supply to the board was multiplied
point format. The table shows that the IP CNNA18 ,
with the measured current to compute the power
obtained the best performance due to the lager
consumption. The mean and maximum power dur-
number of PEs (16). Note that CNNA28 was slightly
ing inference is calculated as a mean of 10 infer-
faster than CNNA18 in the convolutional layers, even
ences. The power consumption of the IP core is de-
with fewer PEs, due to the higher Bandwidth of
fined as the difference between the Ultra96 board
the output multiplier. There is a large number of
idling and power during inference. The idle power
splits (512) in the dense 1 and dense 2 layers, and
consumption was measured at Pidle = 3.055 Watt
they consume more than half of the total execution
over a five-minute period:
time for all three CNNA configurations. In average
32% of the time is used to setup the DMA’s from
PYNQ, which could be optimized with a scatter- Table 6: Average and peak power consumption in watt of
the Ultra96 board and the IP core during inference.
gather DMA. In such a solution the DMA would
CNNA16 CNNA16 CNNA18 CNNA28
initiate transfer for the next location of DRAM data
100MHz 172MHz 172MHz 172MHz
without involving the CPU. A larger FPGA with
more on-chip memory could also be a solution to Pavg 5.28 5.68 4.71 4.80
lower the number of splits and optimize the perfor- Ppeak 6.60 7.14 5.76 6.35
mance further. PIPavg 2.23 2.63 1.66 1.74
PIPpeak 3.55 4.09 2.71 3.30
6.3. Power consumption
Table 6 shows that the mean power consump-
The power consumption of the design with tion of the Ultra96 board for all tests was between
CNNA16 , CNNA18 and CNNA28 was measured on 4.7 − 5.7 W out of which the IP core only consumes
the Ultra96 board during inference of the trained approximately 2 W. This means that running the
14
Accuracy. Our fixed-point training method only
performed well for 16-bit quantization. DoReFa-
Net [47] proposes a method for training CNNs with
low-bit quantization. The method demonstrate a
high accuracy bu using AlexNet with only 1-bit
weights. FINN-R [11] uses quantized neural net-
works for low-bit quantization of different CNN
topologies with high accuracy. Angel-Eye [27] also
proposes a dynamic quantization strategy, where
the network is initially trained with a floating-point
format. The radix position of the fixed-point data is
chosen differently for each layer based on statistics,
and an optimal radix point is chosen. The network
is converted back to floating-point and fine tuned.
This method achieves a high level of accuracy for
both 16 and 8-bit formats with VGG16.
Figure 12: Power consumption of the tested solution with
format CNNA18 , CNNA28 and CNNA16 during inference at Performance. To compare the different solutions,
172 MHz. the performance needs to be expressed in Giga Op-
erations Per Second (GOPS) . The performance re-
sult is normalized relative to the number of Look Up
IP did not affect the average power consumption.
Tables (LUTs) and DSPs as a measure for available
However, because they run for a shorter amount of
resources on the target device. This performance
time, the fixed-point IPs with a low number of bits
density measure is used to compare the VGG16
used less energy per inference. The CNNA16 with
mapped to different FPGA devices. The through-
a 100 MHz clock was 0.24 sec slower but consumed
put performance is calculated as the number of Giga
less power than the version with a 172 MHz clock.
Operations (GOP) performed by the CNN relative
Table 6 shows that the peak power consumption
to the inference time in seconds. In the case of the
was almost the same for all tested IPs in the range
VGG16 network, the total number is 30.76 GOP
from 2.7 W to 4.1 W.
out of which 30.7 GOP is performed in the con-
Figure 12 shows that the power consumption is
volutional (CONV) layers. We have presented the
largest in the beginning of the inference, i.e. in
performance for CONV layers and all layers of the
the convolution blocks of the CNN. The power con-
VGG16 model since some solutions do not acceler-
sumption dropped during execution of the fully
ate the FC layers.
connected layers. This indicates that most of the
The results are shown in table 7, which indi-
FPGA logic was in action during convolution, while
cates that the CNNA performance (Total) is lower
less logic was used during computing of the fully
than the comparable state-of-the-art architectures.
connected layers and pooling. Pooling activity cor-
The best performance of our 16-bit solution is
responds to the big dips in power consumption in
29.1 GOPS. This is lower than the 31.4 GOPS for
the first half of the inference.
DNNWEAVER which has the worst performance of
the state-of-the-art solutions.
The Ultra96 target used in our evaluation is small
7. Comparison with state-of-the-art CNNs and low-cost compared to the ones used in some of
the examples e.g. Zynq XC7Z2045 and UltraScale
We have chosen to evaluate our work with the KU060. If a larger and more expensive target such
current state-of-the-art toolflows presented in [23] as the Xilinx ZCU104 evaluation kit [48] was used,
which use a fixed-point resolution of 8 or 16-bits to it would be possible to increase the number of PEs,
perform FPGA acceleration of the VGG16 network thereby achieving a higher throughput and perfor-
by targeting the Xilinx Zynq and UltraScale plat- mance.
forms. The purpose of this is to compare our work The performance density measure is also lower
with other tools that have mapped the same CNN than most the other architectures and only simi-
on similar FPGA devices from the same vendor. lar to DNNWEAVER. Angel-Eye and Caffeine both
15
Table 7: Table comparing the CNNA with state-of-the-art CNN accelerators: DnnWeaver [26], fpgaConvNet [24], Angel-Eye [27]
and Caffeine [29]. All solutions targets the Xilinx Zynq devices except for Caffeine, which uses the Kintex UltraScale FPGA.
The power efficiency, performance density and throughput performance are listed for the different solutions. The performance
density is only shown for the CONV layers.
Power Power E. Density Density Perfor. Perfor.
Xilinx Fix.
Technique-Mhz Efficiency (Conv) [GOPS/ [GOPS/ (Conv) (Total)
Device [bits]
[GOPS/W] [GOPS/W] DSP] kLUT] [GOPS] [GOPS]
have a much higher density performance compared bit fixed-point weights at 100 MHz, its total power
to usages of LUT and DSP resources on the FPGA. efficiency is 0.44x lower than Angel-Eye and 0.59x
lower than Caffeine. With nearly the same effi-
Power efficiency. The power efficiency is depen- ciency of 12 GOPS/W, the power efficiency of the
dent on both of the efficiency of data communica- CONV layers are considered comparable with Caf-
tion and computation. feine. The performance bottleneck in our CNN ac-
The SmartShuttle [49] solution is optimizing celerator is the fully connected layers, where splits
CNN off-chip memory access. Observing that over are performed 512 times with a high DRAM ac-
80% of energy is consumed by DRAM accesses, they cess. The fpgaConvNet on the Zynq XC7Z020 has
make a benchmark of the data volume of DMA a worse efficiency of 7.3 GOPS/W compared to the
requests during inference of the 13 CONV layers CNNA16 with 11.9 GOPS/W. While Angel-Eye’s
in VGG16. Our CNNA16 measures a data vol- fixed-point 8-bit with 24.1 GOPS/W is the best of
ume of 211.7 MB transferred for the same feature all the compared state-of-the-art solutions in terms
layers including pooling. However, we use more of efficiency, the 8-bit CNNA with 23.0 GOPS/W
on-chip memory for weight and data buffers than is a close second.
SmartShuttle. As a benchmark SmartShuttle mea-
sures 221.3 MB. Simulated with a on-chip buffer of 8. Conclusion
512 KB, however, they can lower the DRAM ac-
cess volume to 160 MB. The design of the CLB In this paper, an architecture for a SoC design
in our CNNA ensures that weights are only trans- was presented. The presented architecture imple-
ferred once from DRAM, which is similar to what ments the different operations necessary for a deep
SmartShuttle achieves with the weight reuse ori- neural network to perform close to real-time in-
ented scheme (WRO) they propose. The last three ference. The architecture was implemented using
FC layers of the CNNA16 transfers a volume of Python and HLS for the IP core and was able to
273.8 MB, which is not considered by SmartShuttle run on the Ultra96 board using PYNQ. The inter-
and stands for most of the data communication. face for the system is similar to Keras and should
The computation power efficiency is calculated be familiar to most engineers working in the field
as the number of operations per second, relative of machine learning.
to the the mean power consumption of the CNNA, The CNN is able to accelerate deep learning al-
which we measured earlier (GOPS/W). Compared gorithms that use any sequence of convolutional,
to many of the current state-of-the-art accelerators, max-pooling and fully connected layers. The layer
the CNN accelerator in this work performs quite operations can support many different parameters
well in terms of power efficiency. When using 16- and will be able to perform inferences using most
16
modern CNNs. The network weights can use any [7] S. Mittal, J. S. Vetter, A survey of methods for an-
8-, 16- or 32-bit fixed-point format when exported alyzing and improving gpu energy efficiency (2014).
arXiv:1404.4629, doi:10.1145/2636342.
from Keras with the weights auto-scaled correctly. [8] S. Mittal, A survey of FPGA-based accelerators for
A training method was proposed which achieved convolutional neural networks (2018). doi:10.1007/
high levels of inference accuracies, both when using s00521-018-3761-1.
fixed-point and floating-point weights. The VGG16 [9] W. Ding, Z. Huang, Z. Huang, L. Tian, H. Wang,
S. Feng, Designing efficient accelerator of depthwise sep-
architecture chosen for testing in this paper was arable convolutional neural network on FPGA, Journal
able to perform inference in 2.0 sec per image when of Systems Architecturedoi:10.1016/j.sysarc.2018.
using the fixed-point format Q2.14 and 1.2 sec when 12.008.
using fixed-point format Q2.6. The IP core alone [10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv,
Y. Bengio, Binarized neural networks, in: Advances in
consumes a peak power of 4.1 W with a mean power Neural Information Processing Systems, 2016. arXiv:
between 1.5 − 2.7 W and has a power efficiency be- 1602.02505.
tween 6.0 − 15.2 GOPS/W depending of the fixed- [11] M. Blott, T. B. Preuber, N. J. Fraser, G. Gambardella,
K. O’Brien, Y. Umuroglu, M. Leeser, K. Vissers, FinN-
point format.
R: An end-to-end deep-learning framework for fast ex-
Compared to similar state-of-the-art solutions ploration of quantized neural networks, ACM Transac-
for mapping the VGG16 network to Xilinx plat- tions on Reconfigurable Technology and SystemsarXiv:
forms, our solution demonstrates a comparable en- 1809.04570, doi:10.1145/3242897.
[12] E. Nurvitadhi, D. Sheffield, J. Sim, A. Mishra,
ergy efficiency, especially for the convolutional lay- G. Venkatesh, D. Marr, Accelerating binarized neu-
ers. In future work, the CNNA needs be extended ral networks: Comparison of FPGA, CPU, GPU, and
to support special layers to support deep neural ASIC, in: Proceedings of the 2016 International Con-
networks such as ResNet, DenseNet, InceptionNet ference on Field-Programmable Technology, FPT 2016,
2017. doi:10.1109/FPT.2016.7929192.
and GooglLeNet. The special layers with irregular [13] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H.
dataflow will be implemented in the SW controlling Lin, M. Srivastava, R. Gupta, Z. Zhang, Acceler-
part of the proposed architecture. ating binarized convolutional neural networks with
software-programmable fpgas, in: Proceedings of the
2017 ACM/SIGDA International Symposium on Field-
Acknowledgments Programmable Gate Arrays, FPGA ’17, ACM, New
York, NY, USA, 2017, pp. 15–24. doi:10.1145/
We would like to thank Freia Martensen for lan- 3020078.3021741.
URL http://doi.acm.org/10.1145/3020078.3021741
guage and proof reading the article. [14] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort,
A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi,
J. Anderson, K. Bertels, A Survey and Evaluation of
References FPGA High-Level Synthesis Tools, IEEE Transactions
on Computer-Aided Design of Integrated Circuits and
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet
Systemsdoi:10.1109/TCAD.2015.2513673.
classification with deep convolutional neural networks,
[15] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, L. Wang,
Communications of the ACMdoi:10.1145/3065386.
A high performance FPGA-based accelerator for large-
[2] K. Simonyan, A. Zisserman, Very deep convolutional
scale convolutional neural networks, in: FPL 2016 -
networks for large-scale image recognition, in: 3rd In-
26th International Conference on Field-Programmable
ternational Conference on Learning Representations,
Logic and Applications, 2016. doi:10.1109/FPL.2016.
ICLR 2015 - Conference Track Proceedings, 2015.
7577308.
arXiv:1409.1556.
[16] A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012
[3] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You
AlexNet, Advances In Neural Information Pro-
only look once: Unified, real-time object detection, in:
cessing SystemsarXiv:1102.0183, doi:http://dx.doi.
Proceedings of the IEEE Computer Society Conference
org/10.1016/j.protcy.2014.09.007.
on Computer Vision and Pattern Recognition, 2016.
[17] F. Chollet, et al., Keras, https://keras.io (2015).
arXiv:1506.02640, doi:10.1109/CVPR.2016.91.
[18] L. Stornaiuolo, M. Santambrogio, D. Sciuto, On how
[4] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: To-
to efficiently implement deep learning algorithms on
wards Real-Time Object Detection with Region Pro-
PYNQ Platform, in: Proceedings of IEEE Computer
posal Networks, IEEE Transactions on Pattern Anal-
Society Annual Symposium on VLSI, ISVLSI, 2018.
ysis and Machine IntelligencearXiv:1506.01497, doi:
doi:10.1109/ISVLSI.2018.00112.
10.1109/TPAMI.2016.2577031.
[19] Accellera Systems Initiative, Ieee standard for standard
[5] K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask R-
systemc language reference manual, IEEE Std 1666-
CNN, IEEE Transactions on Pattern Analysis and Ma-
2011 (Revision of IEEE Std 1666-2005).
chine Intelligencedoi:10.1109/TPAMI.2018.2844175.
[20] K. Ovtcharov, O. Ruwase, J.-y. Kim, J. Fowers,
[6] A. Shawahna, S. M. Sait, A. El-Maleh, FPGA-Based
K. Strauss, E. S. Chung, Accelerating Deep Convo-
accelerators of deep learning networks for learning and
lutional Neural Networks Using Specialized Hardware,
classification: A review (2019). arXiv:1901.00121,
Microsoft Research Whitepaper.
doi:10.1109/ACCESS.2018.2890150.
17
[21] D. Gschwend, Zynqnet: An fpga-accelerated embedded son, R. W. Stewart, Exploring Zynq® MPSoC With
convolutional neural network. PYNQ and Machine Learning Applications, Strath-
URL https://github.com/dgschwend/zynqnet clyde Academic Media, 2019.
[22] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, [37] Xilinx, UG761 - AXI Reference Guide, v13.1 Edition
P. Leong, M. Jahre, K. Vissers, FINN: A frame- (March 2011).
work for fast, scalable binarized neural network [38] B. Xu, R. Huang, M. Li, Revise saturated activation
inference, in: FPGA 2017 - Proceedings of the functions, CoRR abs/1602.05980. arXiv:1602.05980.
2017 ACM/SIGDA International Symposium on Field- URL http://arxiv.org/abs/1602.05980
Programmable Gate Arrays, 2017. arXiv:1612.07119, [39] E. Oberstar, Fixed-Point Representation & Fractional
doi:10.1145/3020078.3021744. Math Revison 1.2 (08 2007). doi:10.13140/RG.2.1.
[23] S. I. Venieris, A. Kouris, C. S. Bouganis, Toolflows for 3602.8242.
mapping convolutional neural networks on FPGAS: A [40] B. J. Wythoff, Backpropagation neural networks:
survey and future directions (2018). arXiv:1803.05900, A tutorial, Chemometrics and Intelligent Labora-
doi:10.1145/3186332. tory Systems 18 (2) (1993) 115 – 155. doi:https:
[24] S. I. Venieris, C. S. Bouganis, FpgaConvNet: A Frame- //doi.org/10.1016/0169-7439(93)80052-J.
work for Mapping Convolutional Neural Networks on URL http://www.sciencedirect.com/science/
FPGAs, in: Proceedings - 24th IEEE International article/pii/016974399380052J
Symposium on Field-Programmable Custom Comput- [41] H. Park, J. H. Lee, Y. Oh, S. Ha, S. Lee, Train-
ing Machines, FCCM 2016, 2016. doi:10.1109/FCCM. ing deep neural network in limited precision, CoRR
2016.22. abs/1810.05486. arXiv:1810.05486.
[25] S. I. Venieris, C. S. Bouganis, FpgaConvNet: Mapping URL http://arxiv.org/abs/1810.05486
Regular and Irregular Convolutional Neural Networks [42] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,
on FPGAs, IEEE Transactions on Neural Networks and A. Howard, H. Adam, D. Kalenichenko, Quantization
Learning Systemsdoi:10.1109/TNNLS.2018.2844093. and training of neural networks for efficient integer-
[26] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, arithmetic-only inference, in: The IEEE Conference
C. Shao, A. Mishra, H. Esmaeilzadeh, From high-level on Computer Vision and Pattern Recognition (CVPR),
deep neural models to FPGAS, in: Proceedings of the 2018.
Annual International Symposium on Microarchitecture, [43] 96 Boards, Ultra96-v2 developer board.
MICRO, 2016. doi:10.1109/MICRO.2016.7783720. URL https://www.96boards.org/product/ultra96/
[27] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, [44] Detect [online] (June 2017) [cited 7/5-2018].
Y. Wang, H. Yang, Angel-Eye: A complete design flow [45] A. Krizhevsky, V. Nair, G. Hinton, CIFAR-10 and
for mapping CNN onto embedded FPGA, IEEE Trans- CIFAR-100 datasets (2009).
actions on Computer-Aided Design of Integrated Cir- [46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
cuits and Systemsdoi:10.1109/TCAD.2017.2705069. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern-
[28] Y. Wang, J. Xu, Y. Han, H. Li, X. Li, DeepBurn- stein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale
ing: Automatic generation of FPGA-based learning Visual Recognition Challenge, International Journal of
accelerators for the neural network family, in: Pro- Computer Vision (IJCV) 115 (3) (2015) 211–252. doi:
ceedings - Design Automation Conference, 2016. doi: 10.1007/s11263-015-0816-y.
10.1145/2897937.2898003. [47] Y. Z. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu
[29] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, Caffeine: Zhou, He Wen, Dorefa-Net: Training Low Bitwidth
Towards uniformed representation and acceleration for Convolu- Tional Neural Networks With Low Bitwidth
deep convolutional neural networks, in: IEEE/ACM In- Gradients, arXiv:1606.06160v3 [cs.NE] 2 Feb 2018
ternational Conference on Computer-Aided Design, Di- DoReFa-NetarXiv:arXiv:1606.06160v3, doi:10.1145/
gest of Technical Papers, ICCAD, 2016. doi:10.1145/ 1449956.1450053.
2966986.2967011. [48] Xilinx, Zynq ultrascale+ mpsoc zcu104 evaluation kit.
[30] Y. Y. Huang, W. Y. Wang, Deep residual learning URL https://www.xilinx.com/products/
for weakly-supervised relation extraction, in: EMNLP boards-and-kits/zcu104.html
2017 - Conference on Empirical Methods in Natural [49] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu,
Language Processing, Proceedings, 2017. arXiv:1707. X. Li, SmartShuttle: Optimizing off-chip memory ac-
08866, doi:10.18653/v1/d17-1191. cesses for deep learning accelerators, in: Proceedings
[31] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein- of the 2018 Design, Automation and Test in Europe
berger, Densely connected convolutional networks, in: Conference and Exhibition, DATE 2018, 2018. doi:
Proceedings - 30th IEEE Conference on Computer 10.23919/DATE.2018.8342033.
Vision and Pattern Recognition, CVPR 2017, 2017.
arXiv:1608.06993, doi:10.1109/CVPR.2017.243.
[32] G. Zeng, Y. He, Z. Yu, X. Yang, R. Yang, L. Zhang, In-
ceptionNet/GoogLeNet - Going Deeper with Convolu-
tions, CvprarXiv:1409.4842, doi:10.1002/jctb.4820.
[33] Xilinx, Pynq: Python productivity for zynq.
URL http://www.pynq.io/
[34] Xilinx, UG 902 - Vivado Design Suite User Guide -
High-Level Synthesis, 2019th Edition (07 2019).
[35] Accellera Systems Initiative, SystemC Synthesizable
Subsets, 1st Edition (January 2015).
[36] L. H. Crockett, D. Northcote, C. Ramsay, F. D. Robin-
18