Fully Convolutional
Fully Convolutional
Abstract—Convolutional Neural Networks (CNNs) allow fast The chosen model defines how to perform the training and
and precise image recognition. Nowadays this capability is highly the computation involved during the test, specifying parame-
requested in the embedded system domain for video processing ters and functions to be used for CNN recognition. The CNN
applications such as video surveillance and homeland security.
Moreover, with the increasing requirement of portable and flow is summarized as follows:
ubiquitous processing, power consumption is a key issue to be • Convolution: The matricial convolution operator is applied
accounted for. over the feature maps of the Input image, such as the RGB
In this paper, we present an FPGA implementation of CNN color channels. The computation is shown in Equation 1,
designed for addressing portability and power efficiency. Perfor- where O is the number of Output feature maps of size H ×
mance characterization results show that the proposed imple- W , I is the number of Input feature maps, and K × K is
mentation is as efficient as a general purpose 16-core CPU, and the size of the Kernel, which is the convolution operand
almost 15 times faster than a SoC GPU for mobile application. obtained from the training.
Moreover, external memory footprint is reduced by 84% with
respect to a standard CNN software application. Out[O][H][W ] = ΣIi=0 ΣK K
kh=0 Σkw=0
In[i][H + kh][W + kw] × Kernel[O][i][kh][kw] (1)
I. I NTRODUCTION
• Activation: A threshold function which is applied on the
In this paper, we propose the design of a hardware architec- convolution output. The ReLU (x) = max(0, x) function
ture implementing a customizable Convolutional Neural Net- is widely adopted, but others are common as well, such as
work (CNN) framework where several CNN schemas can be Tanh and Sigmoid.
configured and executed. We analyzed the CNN computational
flow for identifying the most critical points to be parallelized in • Pooling: The average or maximum value over an input re-
the FPGA implementation. We described the CNN framework gion is evaluated, generating a resized image representative
of the pool values. Equation 2 shows the Pooling by Average
architecture using a High Level Synthesis (HLS) language and operation, where P h × P w is the pooling window size.
tested the new HW-CNN module on an Altera Stratix V FPGA
embedded in a Terasic DE-5-Net board. Out[O][H][W ] = (ΣP h Pw
ph=0 Σpw=0
The CNN algorithm performs fast and precise image recog- In[O][H + ph][W + pw])/P h/P w (2)
nition, which is a highly requested feature in the context of • Fully-Connected (FC): Implemented at the end of a CNN,
embedded systems. The biggest involvement of this type of the FC layers provides the classification of the features
algorithms can be found in the Artificial Intelligence field, extracted by convolution. The FC layers are implemented
bringing contribution to numerous applications, such as in as in Equation 3:
fire detection in forests [1], robotics [2], autonomous driving Out[O] = ΣIi=0 In[i] × Kernel[O][i] (3)
[3] and mobile applications [4]. In this latter domain, battery
lifetime and memory resources are a serious concern for CNN
implementations.
Several CNN models are available in the literature for
general purpose applications. In 2012, Alex-Net Model [5]
has been the first efficient application of the CNN and lately
more accurate models have been proposed, such as GoogLe-
Net [6] featuring the Inception concept and Fast R-CNN [7]
with advanced capabilities for detecting the position of the
subject in the picture.
For teaching the CNN to recognize defined objects, the
network needs to be Trained. During the Training Phase, a set
of labeled images is used for generating the set of parameters
to be applied in the neural network. By means of the Test
Phase, the capability of the network to identify and classify Fig. 1: Convolution process representation. The magnifications are
the pictures is evaluated. representative of the CNN edge-detection.
50
26
III. R ESULTS AND D ISCUSSIONS
We configured the HW-CNN with the Alex-Net model structure
and parameters used in the Cong work [9]. The following eval-
uation has been compared with a software implementation, since
the FPGA-based implementation [9] found in literature reports only
performances the CNN steps computed on FPGA and not considering
all the other steps executed on the host. Thus making impractical a
direct comparison between the two architectures. We tested the per-
formances by comparing different CNN software implementations:
an optimized C code designed by Zhang et al. [9], the Caffe Python
Library working in CPU-only mode (without GPU parallelization)
[12] and, for the GPU mobile comparison, the clBLAS OpenCL
library which has been evaluated by Lokhmotov et al. [13].
51
27
TABLE I: Power performances. GPU in both terms of timing (15×) and power (16×). The
OP/s Power Perf. per Watt proposed implementation is even 3 times more power efficient
Device [GOP/s] [W] [GOP/s/W] with respect to the reference CPU, and equivalently efficient
GPU Mobile OpenCL [13] 0.02 3 0.007 to the same CPU parallelized 16 times. Moreover, the require-
CPU single thread [9] 3.54 95 0.037
CPU 16-threads [9] 12.87 95 0.135
ment of an external memory has been reduced by 83%, when
FPGA HW-CNN [This work] 0.75 5.54 0.135 compared to the software version of CNN.
Finally, this architecture has been designed to allow software
reconfiguration, which allows the user to apply various CNN
hardware shows that the current mobile implementation of model and to efficiently test the same picture against different
CNN are outperformed by the FPGA by more than a 19× trained data and recognized classes.
factor. This demonstrates the effectiveness of the low-power
ACKNOWLEDGMENTS
FPGA computation, and the possibility to adopt similar CNN
implementation for mobile applications where battery life is The HLS compiler and the technical support was provided
the major constrain. by NEC Corporation, Japan.
R EFERENCES
B. Resource usage
[1] Qingjie Zhang et al. “Deep Convolutional Neural Networks for Forest
The percentage of required FPGA resources is reported Fire Detection”. In: 2016 International Forum on Management, Edu-
in Table II. The synthesis report shows that few resources cation and Information Technology Application. Atlantis Press. 2016.
[2] Lei Tai and Ming Liu. “Deep-learning in Mobile Robotics-from
have been implemented, allowing the architecture to fit more Perception to Control Systems: A Survey on Why and Why not”. In:
compact FPGAs, such as the Xilinx Zynq. The most required arXiv preprint arXiv:1612.07139 (2016).
is the on-chip memory, which has been exploited for the main [3] Mariusz Bojarski et al. “End to end learning for self-driving cars”. In:
arXiv preprint arXiv:1604.07316 (2016).
purpose of caching the intermediate CNN results on FPGA. [4] Ryosuke Tanno and Keiji Yanai. “Caffe2C: A Framework for Easy
Implementation of CNN-based Mobile Applications”. In: Adjunct
TABLE II: FPGA Resources. Proceedings of the 13th International Conference on Mobile and
Resource Stratix V - FPGA Chip Usage Ubiquitous Systems: Computing Networking and Services. ACM. 2016,
Logic 65,463 ALMs (28%) pp. 159–164.
Register 3.5kB (3%) [5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet
DSP 104 blocks (41%) classification with deep convolutional neural networks”. In: Advances
Memory 4,752kB (73%) in neural information processing systems. 2012, pp. 1097–1105.
[6] Christian Szegedy et al. “Going deeper with convolutions”. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
The memory necessary on the on-board DDR RAM is Recognition. 2015, pp. 1–9.
greatly reduced, as it is shown in Figure 5b. The chart reports [7] Shaoqing Ren et al. “Faster r-cnn: Towards real-time object detection
with region proposal networks”. In: Advances in neural information
the comparison of RAM memory requirements for performing processing systems. 2015, pp. 91–99.
the Alex-Net model on both FPGA and software. The values [8] NVIDIA. GPU-Based Deep Learning Inference. URL: https://www.
are motivated by the fact that during the CNN execution, the nvidia . com / content / tegra / embedded - systems / pdf / jetson tx1
whitepaper.pdf (visited on 01/18/2017).
intermediate results are stored in the PP buffers along the [9] Chen Zhang et al. “Optimizing fpga-based accelerator design for
pipeline, rather than transferred to the RAM memory. deep convolutional neural networks”. In: Proceedings of the 2015
The reduced amount of external Memory and FPGA re- ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays. ACM. 2015, pp. 161–170.
sources are significant figures for encourage the adoption of [10] Kazutoshi Wakabayashi. “CyberWorkBench: Integrated design envi-
HW-CNN-like architectures in mobility applications. While ronment based on C-based behavior synthesis and verification”. In:
this allows the implementation to be placed on smaller FPGA VLSI Design, Automation and Test, 2005.(VLSI-TSA-DAT). 2005 IEEE
VLSI-TSA International Symposium on. IEEE. 2005, pp. 173–176.
or silicon component, the architecture can be easily adapted [11] Wikipedia. Multiple buffering. URL: https : / / en . wikipedia . org / wiki /
to fit the smallest FPGA devices, exploiting the versatility of Multiple buffering (visited on 02/28/2017).
the HLS synthesis coupled with the modular programming [12] Evan Shelhamer Yangqing Jia. Caffe - Deep learning framework by the
BVLC. URL: http://caffe.berkeleyvision.org (visited on 01/20/2017).
technique. [13] Anton Lokhmotov and Grigori Fursin. “Optimizing convolutional neu-
ral networks on embedded platforms with OpenCL”. In: Proceedings
IV. C ONCLUSION of the 4th International Workshop on OpenCL. ACM. 2016, p. 10.
In this paper we propose a Convolutional Neural Network
(CNN) fully implemented in FPGA, that enables image recog-
nition in low-power embedded systems with limited resources.
This features have been made feasible by extending state-
of-the-art implementations where only the convolution step
is accelerated on FPGA, with the modular addition of extra
functionalities. This modularity guarantees compliance with
existing CNN models, but also the possibility to easily intro-
duce new functionalities.
The experiments show that our HW-CNN can quickly per-
form image recognitions, outperforming the reference SoC
52
28