FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
Abstract
we present a new efficient OpenCL-based Accelerator for large scale Convolutional Neural Networks
called “Fast Inference on FPGAs for Convolution Neural Network” (FFCNN). FFCNN is based on a
deeply pipelined OpenCL kernels architecture. As pointed out before, high-level synthesis tools such
as the OpenCL framework can easily port codes originally designed for CPUs/GPUs to FPGAs, but
it is still difficult to make OpenCL codes run efficiently on FPGAs. This work aims to propose an
efficient FPGA implementation of OpenCL High-Performance Computing Applications. To do so, a
Data reuse and task mapping techniques are also presented to improve design efficiency. In addition,
the following motivations were taken into account when developing FFCNN: • FFCNN has been
designed to be easily implemented on Intel OpenCL SDK based FPGA design flow. • In FFFCN,
different techniques have been integrated to improve the memory band- with and throughput. A
performance analysis is conducted on two deep CNN for Large-Scale Images classifica- tion. The
obtained results, and the comparison with other works designed to accelerate the same types of
architectures, show the efficiency and the competitiveness of the pro- posed accelerator design by
significantly improved performance and resource utilization.
1 Introduction
Recent work on the neural networks have shown great improvements over traditional machine learn-
ing algorithms. Especially in computer vision where a high adaptive capacity for a wide range of
pattern recognition problems was demonstrated. The convolutional neuron network (AlexNet)[11]
improved the classification accuracy of TOP-5 images in ImageNet [12] datasets from 73.8% to
84.7% and helped to improve the performance of different computer vision problems [13] with its
ability to extract features. However, the complexity of its calculation and storage is high. According
to current research, the size of the RN model continues to increase. In Table 1, we list the number
of operations (additions or multiplications), the number of parameters and the top-1 precision
on the ImageNet dataset [12] of the Convolutional Neural Networks (CNN) models found in the
literature for image classification, object detection, and image segmentation.
For instance, one of the largest and widely used CNN requires 39 billion floating point (FLOP)
operations with an image size of 224 × 224 and has a model parameter of 500 MB (VGG[14]). The
complexity of the calculations is proportional to the size of the input, then, the calculation of high
resolution images will require more than 100 billion operations.
Therefore, it is important to select a computing architecture for any CNN based solution. A
typical CPU runs 10 to 100 GFLOP per second. Energy efficiency is often less than 1 GOP per
day. The CPUs are difficult to apply to cloud applications that require high performance in terms
FLOP and mobile applications that require low power consumption. On the other hand, GPUs
offer high performance up to 10 TOP per second.
Usually, hardware accelerators are based on ASIC [12] or FPGA [13, 14]. ASIC-based accelera-
tors offer the highest performance and energy efficiency, but must withstand considerable develop-
ment costs. Because of their reconfigurable nature, FPGA-based accelerators are more economical
given development costs.
2 F. Keddous, H-N Nguyen and A. Nakib
For years, FPGA developers have been struggling with difficult-to-use Register Transfer Level
(RTL) programming languages such as VHDL and Verilog HDL. This makes programming a major
issue for the FPGA. Thus, FPGA providers are beginning to provide high-level synthesis tools such
as the OpenCL framework [15] to enable FPGA programming using high-level languages. Although
developers can easily port codes originally designed for CPUs / GPUs to FPGAs with the OpenCL
framework, it is still difficult to make OpenCL codes run efficiently on FPGAs. The same code may
have different performance on different platforms because of the different execution methods related
to the architecture. Therefore, developers must consider the FPGA architecture when optimizing
OpenCL code.
The main contributions of this work are as follows: (1) an OpenCL based FPGA accelerator with
an efficient pipelined kernel structure is proposed for large scale network (CNN) implementation;
(2) the design space of the proposed architecture was fully explored on the Arria FPGA 10 and
Stratix-10, two large-scale CNN models, were implemented and tested. The results show that the
proposed scheme improves performance and resource utilization compared to previous work.
The rest of the paper is organized as follow: in the next section we recall CNN definition. In
section 3, the proposed implementation is presented. The obtained results are shown in the section
4. The conclusion ends the paper.
In this section, we present the basic functions of a neural network and we focus only on the inference
procedure, which means that the Neural Network model was already trained and validated to
predict or classify new data.
The basic architectural ideas of a Convolution Neural Network (CNN) [5] consist of the local re-
ceptive fields via the convolution operation and the spatial sub-sampling via the pooling operation.
The Convolution operation can be formally written as:
C,l T
fx,y,h = whl fx,y
Op,l−1
+ blh (1)
C,l
where whl and blh are the weights and bias of the hth feature map, f Op,l−1 and fx,y,h are the
input and output feature maps, l denotes the layer and (x, y) is the spatial image coordinate. The
superscript C denotes convolution and Op represents various operations, e.g., input (when l = 1),
convolution, pooling, activation, etc.
Pooling applies local operations, e.g., computing the maximum within a local neighborhood
has the following form:
Pmax ,l Op,l−1
fx,y,h = max(m,n)∈Nx,y (fm,n,h ) (2)
where Nx,y denotes the local spatial neighborhood and Pmax denotes the max pooling. Often
a spatial resolution reduction is applied after the max-pooling operation. Besides the two above-
mentioned operations, there are several strategies applied within the CNN models, such as non-
linear activation (e.g., the Rectified Linear Unit (ReLU) [6]), dropout [7] and batch normalization
[8]. A Fully Connected (FC) layer, can be added at the end of the concatenated layers. It takes
all nodes (neurons) from the feature maps of the previous layer as input and connects it to every
nodes (neurons) of the output feature map. At the last layer, called dense layer, of the CNN models
(referred to as the prediction layer), it is the common to use the Softmax activation function.
Then, the convolution (CONV) layers and the dense layer of fully connected layer (FC) layers
are two common types of layers most of architectures. CONV layers conduct two-dimensional (2D)
convolutions on a set of input feature maps and add the results to get output feature maps. FC
layers receive a feature vector as input and conduct matrix-vector multiplications.
Besides CONV and FC layers, NN layers also have pooling, ReLU, concat[9], elementwise[10],
and other types of layers. But these layers contributes little to the computation and storage re-
quirement of a neural network model. Figure1 shows the distribution of weights and operations in
the VGG-11 model. In this model, CONV and FC layers together contribute more than 99% of
the network’s weights and operations, which is similar to most of the CNN models. It is obvious
that most of the neural network acceleration systems must be focus on these two types of layers.
FFCNN: Fast FPGA based Acceleration for Convolution neural network inference 3
Fig. 1. Distribution of the parameters and the operations in chain based architecture. Example of VGG
with 11 layers.
3 Proposed implementation
In this work, we used an Altera FPGA Development Kit to build our CNN accelerator. In particular,
the overall memory controller is a DDR3/DDR4 controller, the link controller is a PCIe controller,
and the host computer is a desktop PC based on an x86 architecture.
The figure2 illustrates the proposed architecture that consists of four kernels which are con-
nected using Altera OpenCL extension channel/pipes.
The single threaded Convolution kernel is designed to implement both the 3D multiply-accumulate
operation, defined by:
Cl K−1
X X K−1
X
D0 (f0 , y, x) = Wl (f0 , fi , ky , kx )Di (fi , y + ky , x + kx ) (3)
fi =1 ky =0 kx =0
where Di (fi , y, x) and D0 (f0 , y, x) denote the neurons located at position (x, y) in the input feature
map fi , and the output feature map f0 , respectively. Wl (f0 , fi , y, x) represents the corresponding
weights in the lth layer which is convoluted with fi . The size of the convolution filters is K × K,
while the total number of input feature maps is Cl . In this paper, we propose to implement 3 using
a 1-D convolution structure that flattens 3-D convolution as follows:
Cl ×K×K
X
D0 (f0 ) = Wl (f0 , xi )Di (xi ) (4)
xi =1
where xi is the index of the parameters of the layer i. Local response normalization (LRN) layers
that perform normalization operations on each inputv neuron value by a factor that depends on
the neighboring neurons are also used following the pooling layer.
Therefore, we avoid nested 5-way loops levels and we get a 2-level nested loop structure, there-
fore, the multiplier-adder tree structure with a buffer can be efficiently pipelined by the OpenCL
compiler.
Two DataIN and DataOut data transfer kernels inspired by the work of [2], two NDRange 3-D
multi-mode transfer data of characteristics and weights from / to the global memory.
In addition to the most compute-intensive convolution kernel, we have designed new OpenCL
kernels to speed-up layer operations widely used in CNNs, such as pooling, etc. Therefore, our pro-
posed model can handle the CNN Forward compute stream with very small host CPU involvement,
resulting in high throughput and low latency.
Cascading kernels form a deep compute pipeline able to implement a series of basic CNN
operations without the need to store the interlayer data in global memory. It greatly reduces the
bandwidth requirements.
The Arria 10 FPGA includes 660K logical elements (LE) 1687 DSP blocks and 42MB M20K,
while the stratix 10 FPGA includes 2753K logical elements (LE), 5760 DSP blocks and 229MB
M20K memory.
It should be noted that the card has a 2 GB DDR3 DRAM connected to the FPGA which
functions as global memory for Alaric and 32 GB of DDR4 for Nallatech. OpenCL kernel codes
are compiled using Altera OpenCL SDK v16.0 (Alaric) and v18.0 (Nallatech).
The host computer is equipped with an Intel Core i5-4590 processor and is running Ubuntu
Linux 14.04.3. We followed the same methodology described in [11].
and we implemented the basic design on the same Arria 10 platform. We also use the Caffe
[6] convolutional learning framework as a baseline for our CPU. We extract the input image, pre-
trained weights and output functions of Caffe. We compare the result of our implementation with
the result of Caffe to verify functional correctness.
Two large-scale CNN models: AlexNet (8 layers) and ResNet-50 (50 layers) models were used
as benchmarks to measure performance.
Since CNNs are intensive floating multiplications, the number of DSPs consumed is used as a
metric for evaluating performance. As in [2] the proposed CNN design implements full-precision
direct computation (32-bit float format), which also makes it favorable for implementing back-
propagation flow in the learning phase of the model. To make fair comparison, we provided the
normalized performance as ”performance density” in the table. It can be noticed that the proposed
implementation takes efficiently profit from the DSPs. The classification time is also better than
all other implementations.
References
1. GUO, Kaiyuan, ZENG, Shulin, YU, Jincheng, et al. [DL] A Survey of FPGA-based Neural Network
Inference Accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2019,
vol. 12, no 1, p.2:1-2:23
2. WANG, Dong, AN, Jianjing, et XU, Ke. PipeCNN: an OpenCL-based FPGA accelerator for large-scale
convolution neuron networks. arXiv preprint arXiv:1611.02450, 2016.
3. C. Zhang, P. Li, G. Sun, Y. Guan, B. J. Xiao, and J. Cong, “Optimizing FPGA-based accelerator
design for deep convolutional neural networks,” in Proc. ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA ’15), 2015.
4. N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. F. Ma, S. Vrudhula, J. S. Seo, and Y. Cao,
“Throughput-Optimized OpenCL-based FPGA accelerator for large-scale convolutional neural net-
works,” in Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA
’16), 2016.
5. Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap- plied to document recognition,
Proceedings of the IEEE 86 (11) (1998) 2278–2324. doi:10.1109/5.726791.
FFCNN: Fast FPGA based Acceleration for Convolution neural network inference 5
Table 1. Comparison with other works. 2016a is in [3], FPGA2015 is in [4], and FPGRA2016b is in [2]
6. K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on
imagenet classification, IEEE International Conference on Computer Vision (ICCV 2015) 1502.
7. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to
prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958.
8. S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network train- ing by reducing internal
covariate shift, in: Proceedings of the 32Nd In- ternational Conference on International Conference on
Machine Learning - Volume 37, ICML’15, JMLR.org, 2015, pp. 448–456.
9. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. 2015. Going deeper with convolutions. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).
10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image
recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
770–778
11. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep con-
volutional neural networks. In Advances in neural information processing systems. 1097–1105.
12. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet
Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3
(2015), 211–252. https: //doi.org/10.1007/s11263-015-0816-y
13. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for ac-
curate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 580–587.
14. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 (2014).
15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and paˆern recognition. 770–778.