GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork
GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork
net/publication/311491247
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network
CITATIONS READS
1,265 2,409
12 authors, including:
All content following this page was uploaded by Yu Wang on 04 December 2022.
Jiantao Qiu1,2 , Jie Wang1 , Song Yao1,2 , Kaiyuan Guo1,2 , Boxun Li1,2 ,Erjin Zhou1 ,
Jincheng Yu1,2 , Tianqi Tang1,2 , Ningyi Xu3 , Sen Song2,4 , Yu Wang1,2 ,
and Huazhong Yang1,2
1
Department of Electronic Engineering, Tsinghua University
1
Tsinghua National Laboratory for Information Science and Technology
2
Center for Brain-Inspired Computing Research, Tsinghua University
3
Hardware Computing Group, Microsoft Research Asia 4 School of Medicine, Tsinghua University
{songyao, yu-wang}@mail.tsinghua.edu.cn
ABSTRACT Keywords
In recent years, Convolutional Neural Network (CNN) based Embedded FPGA; Convolutional Neural Network (CNN); Dyna-
methods have achieved great success in a large number of appli- mic-precision data quantization; Bandwidth utilization
cations and have been among the most powerful and widely used
techniques in computer vision. However, CNN-based methods are 1. INTRODUCTION
computational-intensive and resource-consuming, and thus are hard Image classification is a basic problem in computer vision (CV).
to be integrated into embedded systems such as smart phones, smart In recent years, Convolutional Neural Network (CNN) has led to
glasses, and robots. FPGA is one of the most promising platforms great advances in image classification accuracy. In Image-Net Large-
for accelerating CNN, but the limited bandwidth and on-chip mem- Scale Vision Recognition Challenge (ILSVRC) 2012 [1], Krizhevsky
ory size limit the performance of FPGA accelerator for CNN. et al. showed that CNN had great power by achieving the top-5
In this paper, we go deeper with the embedded FPGA platfor- accuracy of 84.7% in classification task [2], which was significant-
m on accelerating CNNs and propose a CNN accelerator design ly higher than other traditional image classification methods. In
on embedded FPGA for Image-Net large-scale image classifica- the following years, the accuracy has been improved to 88.8% [3],
tion. We first present an in-depth analysis of state-of-the-art C- 93.3% [4], and 96.4% [5] in ILSVRC 2013, 2014, and 2015.
NN models and show that Convolutional layers are computational- While achieving state-of-the-art performance, CNN-based meth-
centric and Fully-Connected layers are memory-centric. Then the ods demand much more computations and memory resources com-
dynamic-precision data quantization method and a convolver de- pared with traditional methods. In this manner, most CNN-based
sign that is efficient for all layer types in CNN are proposed to methods have to depend on large servers. However, there has been
improve the bandwidth and resource utilization. Results show that a non-negligible market for embedded systems which demands ca-
only 0.4% accuracy loss is introduced by our data quantization flow pabilities of high-accuracy and real-time object recognition, such
for the very deep VGG16 model when 8/4-bit quantization is used. as auto-piloted car and robots. But for embedded systems, the lim-
A data arrangement method is proposed to further ensure a high uti- ited battery and resources are serious problems.
lization of the external memory bandwidth. Finally, a state-of-the- To address this problem, many researchers have proposed vari-
art CNN, VGG16-SVD, is implemented on an embedded FPGA ous CNN acceleration techniques from either computing or memo-
platform as a case study. VGG16-SVD is the largest and most ac- ry access aspects [6, 7, 8, 9, 10, 11, 12, 13]. However, most of pre-
curate network that has been implemented on FPGA end-to-end so vious techniques only considered small CNN models such as the
far. The system on Xilinx Zynq ZC706 board achieves a frame rate 5-layer LeNet for simple tasks such as MNIST handwritten dig-
at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quanti- its recognition [14]. State-of-the-art CNN models for large-scale
zation. The average performance of Convolutional layers and the image classification have extremely high complexity, and thus can
full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working only be stored in external memory. In this manner, memory band-
frequency, which outperforms previous approaches significantly. width becomes a serious problem for accelerating CNNs especially
for embedded systems. Besides, previous research focused on ac-
celerating Convolutional (CONV) layers, while the Fully-Connected
This work was supported by 973 project 2013CB329000, National Natural Science (FC) layers were not well studied. Consequently, we need to go
Foundation of China (No. 61373026, 61261160501), the Importation and Develop- deeper with the embedded FPGA platform to address these prob-
ment of High-Caliber Talents Project of Beijing Municipal Institutions, Microsoft, X- lems.
ilinx University Program, and Tsinghua University Initiative Scientific Research Pro- In this paper, we make a deep investigation on how to deploy
gram. full CNNs to accelerators on embedded FPGA platform. A CN-
N accelerator for Image-Net large-scale classification is proposed,
which can execute the very deep VGG16-SVD model at a speed of
4.45 fps. Specifically, this paper makes the following contributions.
• We present an in-depth analysis of state-of-the-art CNN mod-
els for large-scale image classification. We show that state-
of-the-art CNN models are extremely complex (for example,
the VGG16 model has 138 million weights and needs over
30 GOPs), CONV layers are computational-centric, and FC
layers are memory-centric.
• For the first time, we present an automatic flow for dynamic-
precision data quantization and explore various data quan-
26
Input Image Feature Maps Probability in “cat”
Probability in “tree”
Table 1: # of layers in VGG models.
CONV CONV CONV CONV CONV
Model FC Total
Group 1 Group 2 Group 3 Group 4 Group 5
Probability in “dog” VGG11 1 1 2 2 2 3 11
VGG16 2 2 3 3 3 3 16
CONV + Pooling CONV + Pooling FC FC
VGG19 2 2 4 4 4 3 19
Figure 1: A typical CNN structure from the feature map per- each element in the output feature maps is often attached to CONV
spective. layers. The CONV layer can be expressed with Equation 1:
∑
nin
tization configurations. Results show that only a 0.4% ac- fiout = fjin ⊗ gi,j + bi (1 ≤ i ≤ nout ), (1)
curacy loss is introduced with VGG16 model under 8/4 bit j=1
dynamic-precision quantization. Specific hardware is also
designed to support dynamic-precision data quantization. where gi,j is the convolutional kernel applied to j-th input feature
map and i-th output feature map.
• We show that the performance of FC layers is mainly limit- FC layer applies a linear transformation on the input feature
ed by the memory bandwidth on embedded FPGA platform, vector:
which is different from CONV layers. In this manner, we ap- f out = W f in + b, (2)
ply SVD to the weight matrix of the first FC layer, which
reduces 85.8% memory footprint of this layer, design the where W is an nout × nin transformation matrix and b is the bias
convolvers that can compute FC layers to reduce resource term. It should be noted, for the FC layer, the input is not a combi-
consumption, and propose a data arrangement scheme to ac- nation of several 2-D feature maps but just a feature vector. Conse-
celerate FC layers. quently, in Equation 2, the parameter nin and nout actually corre-
sponds to the lengths of the input and output feature vector.
• We propose a CNN accelerator design on an embedded FP- Pooling layer, which outputs the maximum or average value of
GA platform for Image-Net large-scale classification. On the each subarea in each feature maps, is often attached to the CONV
Xilinx Zynq platform, our system achieves the performance layer. Max-pooling can be expressed as Equation 3:
at 187.8 GOP/s and 137.0 GOP/s for CONV layers and full
CNN under 150 MHz frequency respectively. With VGG16- in
fm,n ··· in
fm,n+p−1
SVD network, our implementation achieves a top-5 accuracy out
fi,j = max
.. .. ,
(3)
of 86.66% at a 4.45 fps speed. p×p . .
in
fm+p−1,n ··· in
fm+p−1,n+p−1
The rest of paper is organized as follows. In Section 2, the
background of CNN is presented. In Section 3, the related work where p is the pooling kernel size. This non-linear "down sam-
is introduced and discussed. We analyze the complexity distribu- pling" not only reduces the feature map size and the computation
tion of state-of-the-art CNN models in Section 4. In Section 5, the for later layers, but also provides a form of translation invariance.
dynamic-precision data quantization flow is proposed. The pro- CNN can be used to classify images in a forward inference pro-
posed image classification system design and implementation de- cess. But before using the CNN for any task, one should first train
tails are introduced in Section 6. The memory system and data the CNN on a dataset. Recent research [15] showed that, a CNN
arrangement method for FC layers are introduced in Section 7. The model pre-trained on a large dataset for a given task can be used
performance of the proposed system is evaluated and discussed in for other tasks and achieved high accuracy with minor adjustment
Section 8. We finally conclude this paper in Section 9. in network weights. This minor adjustment is called "fine-tune".
The training of the CNN is mostly implemented on large servers.
2. BACKGROUND For embedded FPGA platform, we only focus on accelerating the
inference process of a CNN.
Deep CNN achieves the state-of-the-art performance on a wide
range of vision-related tasks. To help understand the CNN-based 2.2 Image-Net Dataset
image classification algorithms analyzed in this paper, in this sec-
tion, we introduce the basics of CNN. An introduction to the Image- Image-Net [1] dataset is regarded as the standard benchmark to
Net dataset and state-of-the-art CNN models is also presented. evaluate the performance of image classification and object detec-
tion algorithms. So far Image-Net dataset has collected more than
14 million images within more than 21 thousand categories. Image-
2.1 Primer on CNN Net releases a subset with 1.2 million images in 1000 categories for
A typical CNN consists of a number of layers that run in se- the ILSVRC classification task, which has significantly promoted
quence. The parameters of a CNN model are called "weights". The the development of CV techniques. In this paper, all the CNN mod-
first layer of a CNN reads an input image and outputs a series of els are trained with ILSVRC 2014 training dataset and evaluated
feature maps. The following layers read the feature maps generated with ILSVRC 2014 validation set.
by previous layers and output new feature maps. Finally a classifier
outputs the probability of each category that the input image might 2.3 State-of-the-Art CNN Models
belong to. CONV layer and FC layer are two essential types of lay- In ILSVRC 2012, the SuperVision team won the first place in
er in CNN. After CONV layers, there are usually pooling layers. A image classification task using AlexNet by achieving 84.7% top-
typical CNN example is shown in Figure 1. 5 accuracy [2]. CaffeNet is a replication of AlexNet with minor
In this paper, for a CNN layer, fjin denotes its j-th input feature changes. Both of AlexNet and CaffeNet consist of 5 CONV layers
map, fiout denotes the i-th output feature map, and bi denotes the and 3 FC layers.
bias term to the i-th output map. For CONV layers, nin and nout The Zeiler-and-Fergus (ZF) network achieved 88.8% top-5 accu-
represent the number of input and output feature maps respectively. racy and won the first place in image classification task of ILSVRC
For FC layers, nin and nout are the length of the input and output 2013 [3]. The ZF network also has 5 CONV layers and 3 FC layers.
feature vector. The VGG model achieved a top-5 accuracy of 92.6% and won
CONV layer takes a series of feature maps as input and con- the second place in image classification task of ILSVRC 2014 [16].
volves with convolutional kernels to obtain the output feature map- VGG model consists of 5 CONV layer groups and 3 FC layers.
s. A nonlinear layer, which applies nonlinear activation function to According to the exact number of layers, there are several versions
27
12.95
12.95
of the VGG model including VGG11, VGG16, and VGG19, as CONV1 CONV2 CONV3 CONV4
CONV5 FC6 FC7 FC8
listed in Table 1.
9.25
9.25
3. RELATED WORK
5.55
5.55
5.55
5.55
To accelerate CNN, a set of techniques from both software and
3.87
3.87
3.70
hardware perspectives have been studied. From software perspec-
2.31
1.85
1.85
tive, the target is compressing CNN models in order to reduce the
0.90
0.83
0.45
0.45
0.34
0.30
0.30
0.30
0.30
0.21
0.21
0.21
0.21
0.17
0.10
0.08
0.03
0.03
0.03
0.03
0.03
0.01
0.01
0.01
0.01
0.01
memory footprint and the number of operations while minimizing
accuracy loss. From the hardware perspective, specific architecture CAFFENET ZF VGG11 VGG16 VGG19
and modules are designed to reuse data, enhance "locality" of da- (a) Operations demanded in different layers (GOP)
ta, and accelerate convolution operations. To deploy CNN models CONV1 CONV2 CONV3 CONV4
102.76
102.76
102.76
on embedded systems, the bit widths of operators and weights are CONV5 FC6 FC7 FC8
often reduced compared to that on CPU or GPU platform.
52.43
3.1 Model Compression
37.75
Network pruning and decomposition were widely used to com-
16.78
16.78
16.78
16.78
16.78
press CNN models. In early work, network pruning proved to be
9.44
8.26
7.08
5.90
4.72
4.10
4.10
4.10
4.10
4.10
3.54
2.06
1.47
1.33
0.88
0.88
0.88
0.88
0.66
0.61
0.44
0.31
0.22
0.22
0.07
0.03
0.04
0.04
0.01
0.00
a valid way to reduce the network complexity and over-fitting [17,
18, 19]. In [20], Han et al. pruned less influential connection- CAFFENET ZF VGG11 VGG16 VGG19
s in neural networks, and achieved 9× and 13× compression for (b) Number of weights in different layers (Million)
CaffeNet and VGG16 model without accuracy loss. The Singu- Figure 2: The complexity distribution of state-of-the-art CNN
lar Value Decomposition (SVD) [21] is frequently used to reduce
models: (a) Distribution of operations by theoretical estima-
memory footprint. In [22], Denton et al. used SVD and filters clus-
tering to speedup the first two FC layers of CNNs. Zhang et al. [23] tion; (b) Distribution of weight number.
proposed a method that was tested on a deeper model, which used
low rank decomposition on network parameters and took nonlinear multi-chip supercomputer was proposed which offered sufficien-
units into consideration. Jaderberg et al. [24] used rank-1 filters to t memory capacity to store all the weights in the CNN on chip.
approximate the original ones. In [10], all the weights of one CNN layer were also stored in on-
chip memory. In this manner, the data traffic between on-chip and
3.2 Data Quantization off-chip memory could be minimized.
Implementing fixed-point arithmetic units on ASIC and FPGA
is much more efficient compared with floating-point ones. Conse- 3.4 Motivation
quently, most of previous CNN accelerators used fixed-point num- State-of-the-art CNN models for large-scale visual recognition
bers instead of floating-point numbers [7, 25, 26, 6]. Shorter fixed- are much larger and deeper than early small CNN models. In
point representation of weights and data can also significantly re- this case, CNN accelerators such as ShiDianNao [10] which store
duce memory footprint and computation resources. For example, weights on chip are hard to support those large CNN models. Con-
Chen et al. showed that the area and power of a 16-bit multiplier is sequently, state-of-the-art CNN models can only be stored in exter-
0.164× and 0.136× compared with that of 32-bit multiplier under nal memory and the bandwidth problem needs to be considered.
65nm fabrication technology [7]. Most of previous studies focused on only accelerating the CON-
Most of previous work adopted the 16-bit quantization strate- V layers of CNN. For example, in [6], the accelerator design was
gy [27, 25, 7, 8]. In [7], Chen et al. showed that using 16-bit num- only applied to several CONV layers rather than the full CNN.
bers instead of 32-bit ones only introduced 0.26% more error rate In [26] and [11], authors only used models with few CONV lay-
on MNIST dataset. In [8], 16-bit numbers were used in the infer- ers without any FC layer. In this manner, those accelerators were
ence process while 32-bit numbers were used in training process, hard to be used for accelerating full CNNs.
and results on MNIST dataset showed that there was only 0.01% A full CNN model consists of both CONV layers and FC layer-
accuracy reduction. s, and thus an efficient CNN accelerator for real-life applications
To accelerate large CNN models on the embedded FPGA plat- need to consider both of them. For CONV layers and FC lay-
form, data quantization is rather important and a shorter represen- ers, the encountered problems are rather different. CONV layers
tation that introducing negligible accuracy loss is always expected. are computation-centric: they contain few parameters but need a
However, though previous work used data quantization, there is no great deal of operations; FC layers are memory-centric: they usual-
comprehensive analysis of different quantization strategies. ly contain hundreds of million weights, and each weight is used for
only once. Consequently, loading weights from the external mem-
3.3 CNN Accelerator ory significantly degrades the performance of FC layers. In other
Previous CNN accelerator designs can be generally classified in- words, the bandwidth limits the performance of FC layers. Con-
to two groups: the first group focuses on the computing engine and sidering this, we go deeper with the embedded FPGA platform on
the second group aims to optimize the memory system. alleviating the bandwidth problem.
CNNs are extremely computational-intensive, and thus powerful
computing engines are necessary to accelerate them. Chakaradhar
et al. in [11] proposed a dynamically configurable architecture for 4. COMPLEXITY ANALYSIS OF CNN
CNN. They added dedicated switches between the computing mod- Time complexity of a layer in CNN can be evaluated by the
ules to enable design space exploration for dynamic configuration number of multiplication operations in the inference process. In a
across different CNN layers. An associate compiler was also pro- CONV layer, each convolutional kernel is a k × k filter applied to a
posed to fully exploit the parallelism among the CNN workloads. r×c dimension input feature map. The number of kernels equals to
The weights in CONV layers of CNN are used for multiple times nin ×nout . Consequently, according to Equation 1, the complexity
in computation, and thus the overall performance can be signifi- of this CONV layer is
cantly degraded by frequent memory access. In [7], Chen et al.
used the tiling strategy and dedicated buffers for data reuse to re-
V = O(nin · nout · k · r · c).
T ime 2
duce the total communication traffic. In their further study [8], a CCON (4)
28
Input images CNN model
Table 2: The Memory footprint, Computation Complexities,
and Performance of the VGG16 model and its SVD version. Weight quantization phase
Weight dynamic
namic range analysis
# of total # of Top-5
Network FC6
weights operations accuracy
VGG16 25088×4096 138.36M 30.94G 88.00%
VGG16-SVD 25088×500 + 500×4096 50.18M 30.76G 87.96%
Weight quantization
zation configuration
For pooling layers and FC layers, the time complexities are
CPT ooling
ime
= O(nin · r · c), (5) Data quantization phase
Fixed-point CNN model Floating-point CNN model
CFT C
ime
= O(nin · nout ). (6) Layer 1
Laye Layer
yer 1
For pooling layers, nout equals to nin since each input feature Feature
ure
e maps
ma Feature
ature
e maps
ma
map is pooled to a corresponding output feature map, and thus the Dynamic range analysis and finding
…
optimal quantization strategy
complexity is linear to either input or output feature map number. Feature maps Feature maps
Space complexity refers to the memory footprint. For a CONV
layer, there are nin ×nout convolution kernels, and each kernel has Layer N Layer N
29
Table 3: Exploration of different data quantization strategies with state-of-the-art CNNs.
Network CaffeNet VGG16 VGG16-SVD
Experiment Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 7 Exp 8 Exp 9 Exp 10 Exp 11 Exp 12 Exp 13
Data Bits Single-float 16 8 Single-float 16 16 8 8 8 8 Single-float 16 8
Weight Bits Single-float 16 8 Single-float 16 8 8 8 8 8 or 4 Single-float 16 8 or 4
Data Precision N/A Dynamic Dynamic N/A 2−2 2−2 Not available 2−5 or 2−1 Dynamic Dynamic N/A Dynamic Dynamic
Weight Precision N/A Dynamic Dynamic N/A 2−15 2−7 Not available 2−7 Dynamic Dynamic N/A Dynamic Dynamic
Top 1 Accuracy 53.90% 53.90% 53.02% 68.10% 68.02% 62.26% Not available 28.24% 66.58% 66.96% 68.02% 64.64% 64.14%
Top 5 Accuracy 77.70% 77.12% 76.64% 88.00% 87.94% 85.18% Not available 49.66% 87.38% 87.60% 87.96% 86.66% 86.30%
1
The weight bits "8 or 4" in Exp10 and Exp13 means 8 bits for CONV layers and 4 bits for FC layers.
2
The data precision "2−5 or 2−1 " in Exp8 means 2−5 for feature maps between CONV layers and 2−1 for feature maps between FC layers.
5.2 Analysis of Different Strategies the majority of computation tasks in CNN, including CONV layer-
We explore different data quantization strategies with CaffeNet, s, Pooling layers, and FC layers. On-chip buffers, including input
VGG16, and VGG16-SVD networks and the results are shown in buffer and output buffer, prepare data to be used by PEs and store
Table 3. All results are obtained under Caffe framework [28]. the results. Controller fetches instructions from the external mem-
• For CaffeNet, as shown in Exp 1, the top-5 accuracy is 77.70% ory and decodes them to orchestrate all the modules except DMAs
when 32-bit floating-point numbers are used. When employ- on the PL. DMAs are working for transferring data and instructions
ing static-precision 16-bit quantization and 8/4-bit dynamic- between the external memory on the PS side and On-chip Buffers
precision quantization, the top-5 accuracy results are 77.12% on the PL side.
and 76.64% respectively. PS consists of general-purpose processors and the external mem-
• VGG16 network with static-precision quantization strategies ory. All the CNN model parameters, data, and instructions are s-
are tested in Exp 4 to Exp 8. As shown in Exp 4, single-float tored in the external memory. Processors run bare-metal programs
VGG16 network 88.00% top-5 accuracy. When using the 16- and help to orchestrate the whole inference phase by configuring
bit quantization configuration, only 0.06% accuracy loss is the DMAs. We also realize Softmax function on CPU considering
introduced. However, when employing 8-bit static-precision that its FPGA implementation will bring inevitable design overhead
quantization, no configuration is available since the feature with little performance improvement since this function is called
maps between FC layers are quantized to 0. As shown in only in the last layer of the whole CNN.
Exp 8, at least two precisions are needed when using 8-bit The complete inference process of an image with the proposed
quantization and the accuracy degrades greatly in this case. CNN accelerator consists of three steps that are executed in se-
• Results of VGG16 network with dynamic-precision quanti- quence: data preparation, data processing, and result output.
zation are shown in Exp 9 and Exp 10. When 8-bit dynamic- Data Preparation. In this phase, all the data needed in the com-
precision quantization is used for both data and weights, the putation including image data, model data, and control data are s-
top-5 accuracy is 87.38%. Using 8/4-bit dynamic-precision tored in the external memory. Control data includes the Buffer De-
quantization for weights in CONV layers and FC layers re- scriptors (BD) used by DMAs and instructions used by Controller.
spectively even achieves higher accuracy. As shown in Exp So far the image data is not obtained from the camera.
10, in this case, the top-5 accuracy is 87.60%. Data Processing. When all the data are prepared, CPU host
• The results of VGG16-SVD network are shown in Exp 11 starts to configure DMAs with the BDs that are pre-stored in the
to Exp 13. Compared with the floating-point VGG16 model, external memory. The configured DMA loads data and instructions
floating-point VGG16-SVD only introduces 0.04% accuracy to the controller, triggers a computation process on PL. Each time
loss. However, when 16-bit dynamic-precision quantization a DMA interrupt is asserted, CPU host adds up the self-maintained
is adopted, the top-5 accuracy is down to 86.66%. With 8/4- pointer address for each DMA’s BD list and configures them with
bit dynamic-precision quantization, the top-5 accuracy fur- new BDs. This phase works until the last BD has been transferred.
ther drops to 86.30%. Result Output. After receiving the interrupt of the last BD from
The results show that dynamic-precision quantization is much DMA, the processor host applies Softmax function to the final re-
more favorable compared with static-precision quantization. With sults from PEs, and output the results to UART port.
dynamic-precision quantization, we can use much shorter repre-
sentations of operations while still achieving comparable accuracy.
For example, compared with 16-bit quantization, 8/4-bit quantiza- 6.2 PE Architecture
tion halves the storage space for intermediate data and reduce three- Figure 4 (b) shows the architecture of the PE and other modules
fourths memory footprint of CNN models. Besides, the utilization involved. A PE consists of five parts, including the Convolver Com-
of bandwidth can also be significantly increased. plex, the Adder Tree, the Non-Linearity module, the Max-Pooling
module, and the Bias Shift.
6. SYSTEM DESIGN • For Convolver Complex, we employ the classical line buffer
In this section, we introduce the design of our CNN accelerator. design [29] as shown in Figure 4 (c). When Input Data goes
First, the overall architecture is presented. After that, the designs of through the buffer in row-major layout, the line buffer releas-
major modules are introduced. Finally, the implementation details es a window selection function on the input image. Thus the
are presented. selected window followed by multipliers and an adder tree
will compute the convolution result, one data per cycle. S-
6.1 Overall Architecture ince the bottleneck of FC layers appears at the bandwidth,
In this work, we propose a CPU+FPGA heterogeneous archi- we use this module to compute matrix-vector multiplication
tecture to accelerate CNNs. Figure 4 (a) shows an overview of the for FC layers even the efficiency is not good. To realize this
proposed system architecture. The whole system can be divided function, we set the delay of each line of the line buffer the
into two parts: the Programmable Logic (PL) and the Processing same as the kernel size by using a MUX at the end of each
System (PS). line. In the proposed implementation, the kernel size is 3.
PL is the FPGA chip, on which we place the Computing Com- When Input Data goes through the buffer, we get a totally
plex, On-chip Buffers, Controller, and DMAs. The Computing Com- new vector every 9 cycles in the selected window and do a
plex consists of Processing Elements (PEs) which take charge of vector inner product. Thus a convolver can do a matrix mul-
tiplied by a vector of size 9.
30
n Delays
Processing System
MU
Ĺ
X
ĸ
ڮ
External Input ڮڮ ڮڮ
CPU
Memory Data ķ
Bias Intermediate Data
Bias Shift
Ĺ
ĸ
MU
ڮ
Config. PE
X
Data & Inst. Bus ڮڮ ڮڮ
Bus Data + ķ
C
…
+
Input
C
Output Data ݉Delays
…
+ NL Pool
Buffer Buffer shift
…
DMA Weights
+ 9 Data Inputs
Data buffer
…
FIFO C
Programmable Logic
+ Input
Input Output Convolver Adder Weight
9 Weight
+
Buffer Buffer Complex Tree
Inputs X X X + Output
X X X +
ڮ
Controller
ڮ
Data
Computing Complex X X X +
Controller
+
Multipliers
PE PE PE Adder Tree
Weight buffer
Figure 4: The design of our image classification system: (a) the overall architecture; (b) the processing element; (c) the convolver in
the processing element.
Input Buffer Input Buffer
In4 In1 In2 In3 In4
In3 ࢀ Out4 ࢛࢚ ܶ
In2 Out3
In1 Out2
In4
ࢀ࢘ Out
Out1 ܶ݅ ܶ݅ A B
1
reuse_times PE1 PE2 PE1 PE2 ൌ
ࢀࢉ In1 In2 ൈ Out1 Out2
row ݊ ݊௨௧
C D
col
Phase1 (Intermediate Results) Phase2 (Final Results)
(a) (b) (c)
Figure 5: Workload schedule for CONV layers and FC layers: (a) Tiling and reuse of feature maps in CONV layers; (b) two phases
in the execution of CONV layers; (c) workload schedule in FC layers.
• Adder Tree (AD) sums all the results from convolvers. It Phase 1 Phase nin / Ti
can add the intermediate data from Output Buffer or bias data Data In Data In
Data Out Data Out ڮڮ Data Out Data Out Data Out ڮڮ Data Out
from Input Buffer if needed. Weight In Weight In ڮڮ Weight In ڮڮ Weight In Weight In ڮڮ Weight In
• Non-Linearity (NL) module applies non-linear activation func- Weight Out Weight Out ڮڮ Weight Out Weight Out Weight Out ڮڮ Weight Out
tion to the input data stream. Result In Result In
• Max-Pooling module utilizes the line buffers to apply the Result Out Result Out
31
݀ܽ݉ݑ̴݊ݐݎ̴݊݅ܽݐ ݉ݑ̴݊ݐݎ̴݊݅ݐ݄݃݅݁ݓ ݀ܽ݉ݑ̴݊ݐݎ̴ݐݑܽݐ
ڮڮ
Addr. Data Addr. Data
In4 In8
100
100
instructions. The compiler takes the fixed-point CNN model as 1 2 … 64 1 2 … 64
32
Table 4: Instructions for One CONV layer generated by the compiler.
Index Pool Bypass NL Bypass Zero Switch Result Shift Bias Shift Write En PE En Phase Type Pic Num Tile Size Layer Type
1 X X X X X No 2 First 2 Tr CONV
2 Yes Yes Bias X BS No 2 Cal 2 Tr CONV
3 No No Zero X X PE 2 Cal 2 Tr CONV
4 X X X RS X DDR 2 Last 2 Tr CONV
Table 5: Parameter configuration and resource utilization. The CPU platform is Intel Xeon E5-2690 [email protected]. The
Param. tile_size convolver_num PE_num reuse_times GPU platform is Nvidia K40 GPU (2880 CUDA cores with 12GB
Config. 28 64 2 16 GDDR5 384-bit memory), and the mGPU platform is the Nvidia
Param. datain_port_num weightin_port_num dataout_port_num TK1 Mobile GPU development kit (192 CUDA cores). For exper-
Config. 8 4 2 iments on CPU, GPU, and mGPU, the operating system is Ubuntu
Resource FF LUT DSP BRAM 14.04 and the deep learning software framework is Caffe [28].
Utilization 127653 182616 780 486
Percent(%) 29.2 83.5 89.2 86.7 8.1 Theoretical Estimation
For CONV layer, the number of phases needed in one CON-
V layer when tiling is adopted can be calculated by the following
formula:
nin nout row 2
CON V
Nphase =⌈ ⌉×⌈ ⌉×⌈ ⌉ ,
Ti To Tr
where T o = reuse_times × P E_num and T c = T r. The time
of computation and loading data in each phase are:
compute_data ≈ T r × reuse_times,
tCON V 2
and
Figure 10: Testing platform. We use Xilinx Zynq ZC706 for T r2 × T i
on-board testing. A power meter is used for power analysis. tCON V
load_data = .
datain_port_num
FC layers is not effective. Considering this, the proposed system CONV layers are usually computation-intensive. Consequently,
uses the Convolver Complex in one of the PEs to do the computa- in order to keep ping-pong mechanism working, typically tload is
tion for FC layers. In this case, we need to fully utilize the band- smaller than tcompute , and thus there should be:
width of the external memory with the current PL structure.
Ti
In our system, we assign a buffer of length 900, the same as T r× datain_port_numCON V ≥ .
T r to each of the 64 Compute Complex in one PE. The buffers are reuse_times
filled one by one when computing CONV layers. To reduce extra In each phase, data will be reused for reuse_times times, each
data routing logic for filling buffers while keep a long burst length accompanied with a new group of weights (9 weights for each ker-
when fetching data for computing FC layers, we arrange the weight nel as for the model we use). Therefore, for weights, we have:
matrix in the external memory. We first divide the whole matrix
with blocks of 64×9 columns and 100 rows such that one block compute_weight ≈ T r
tCON V 2
can be processed in a phase. In each block, the data is arranged as
shown in Figure 9 (b). Without data arrangement for FC layers, as 9 × T i × P E_num
tCON V
load_weight =
shown in Figure 9 (a), we need 64×100 DMA transactions to load weightin_port_num
one block while the burst length is just 9. By arranging the data 9 × T i × P E_num
following Figure 9 (b), we need just one DMA transaction to load weightin_port_numCON V ≥ .
T r2
the whole block and the long burst length ensures a high utilization
of the bandwidth of external memory. According to the workloads schedule shown in Figure 6, the con-
straint for dataout_port_num is:
33
Table 6: Performance of different platforms with VGG16-SVD network.
Platform Embedded FPGA CPU GPU mGPU CPU GPU mGPU
Layer Theoretical Computation Real Computation Total Operations Real Performance Real Computation Real Performance
(Group) Time (ms) Time (ms) (GOP) (GOP/s) Time (ms) (GOP/s)
CONV1 21.41 31.29 3.87 123.76 83.43 2.45 59.45 46.42 1578.8 65.15
CONV2 16.06 23.58 5.55 235.29 68.99 3.31 79.73 80.44 1675.5 69.60
CONV3 26.76 39.29 9.25 235.38 76.08 4.25 89.35 151.57 2177.1 103.51
CONV4 26.76 36.30 9.25 254.81 62.53 3.31 107.49 147.91 2791.6 86.04
CONV5 32.11 32.95 2.31 70.16 12.36 2.30 63.75 186.99 1003.5 36.27
CONV Total 123.10 163.42 30.69 187.80 312.36 15.45 399.77 98.26 1986.0 76.77
FC6-1 10.45 20.17 0.025 1.24 1.69 0.445 29.35 14.87 56.404 0.86
FC6-2 1.71 3.75 0.0041 1.09 0.26 0.031 5.26 15.65 132.26 0.78
FC7 13.98 30.02 0.034 1.12 1.86 0.19 14.74 18.04 177.78 2.28
FC8 3.413 7.244 0.0082 1.13 0.46 0.96 4.58 17.75 8.56 1.79
FC Total 29.55 61.18 0.073 1.20 4.28 1.79 53.93 17.17 40.98 1.36
Total 152.65 224.60 30.76 136.97 316.64 17.25 453.70 97.16 1783.9 67.81
34
GOP/s, which is several times higher than previous designs. The [9] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng,
performance of our system with the full VGG16-SVD network is X. Zhou, and Y. Chen, “Pudiannao: A polyvalent machine learning
136.97 GOP/s. accelerator,” in ASPLOS. ACM, 2015, pp. 369–381.
[10] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng,
8.4 Discussion Y. Chen, and O. Temam, “Shidiannao: shifting vision processing
At present, our implementation uses 16-bit fixed-point numbers closer to the sensor,” in ISCA. ACM, 2015, pp. 92–104.
and Zynq board. The projected results with different quantization [11] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A
strategies and platforms are shown in Table 8. Theoretically, when dynamically configurable coprocessor for convolutional neural
using the 8-bit quantization, 2× PEs can be placed on the FPGA networks,” in ISCA, vol. 38, no. 3. ACM, 2010, pp. 247–257.
and thus the performance on CONV layers doubles. Besides, 2× [12] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
weights can be loaded to the system with the same bandwidth com- Y. LeCun, “Neuflow: A runtime reconfigurable dataflow processor
for vision,” in CVPRW. IEEE, 2011, pp. 109–116.
pared with the 16-bit quantization, and thus the performance on FC
layers also doubles. Further more, when deploying to the VC707 [13] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-based
processor for convolutional networks,” in FPL. IEEE, 2009, pp.
board, one more PE can be placed, and thus the processing capa- 32–37.
bility on CONV layers is expected to be 1.5× higher than that of
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
Zynq platform. For VGG16-SVD network on VC707 with 8-bit learning applied to document recognition,” Proceedings of the IEEE,
dynamic-precision quantization, it is expected to achieve a frame vol. 86, no. 11, pp. 2278–2324, 1998.
rate at 11.76 fps. [15] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn
features off-the-shelf: an astounding baseline for recognition,” in
9. CONCLUSION CVPRW. IEEE, 2014, pp. 512–519.
The limited bandwidth is one of the bottlenecks of accelerat- [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks
ing deep CNN models on embedded systems. In this paper, we for large-scale image recognition,” arXiv preprint arXiv:1409.1556,
make an in-depth investigation of the memory footprint and band- 2014.
width problem in order to accelerate state-of-the-art CNN model- [17] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network
s for Image-Net classification on the embedded FPGA platform. construction with back-propagation,” in NIPS, 1989, pp. 177–185.
We show that CONV layers are computation-centric and FC layers [18] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel,
are memory-centric. A dynamic-precision data quantization flow “Optimal brain damage,” in NIPS, vol. 89, 1989.
is proposed to reduce memory footprint and bandwidth require- [19] B. Hassibi and D. G. Stork, Second order derivatives for network
ments while maintaining comparable accuracy. Convolver that can pruning: Optimal brain surgeon. Morgan Kaufmann, 1993.
be used for both CONV layers and FC layers is designed to save [20] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and
the resource. A data arrangement scheme for FC layers is also pro- connections for efficient neural networks,” arXiv preprint
posed to ensure high bandwidth utilization. Our implementation arXiv:1506.02626, 2015.
on Xilinx Zynq with very deep VGG16-SVD model for Image-Net [21] G. H. Golub and C. F. Van Loan, “Matrix computations. 1996,” Johns
classification achieves a frame rate at 4.45 fps with 86.66% top-5 Hopkins University, Press, Baltimore, MD, USA, pp. 374–426, 1996.
accuracy with 16-bit quantization. The average performances of [22] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
“Exploiting linear structure within convolutional networks for
CONV layers and the full CNN are 187.8 GOP/s and 137.0 GOP/s
efficient evaluation,” in NIPS, 2014, pp. 1269–1277.
under 150MHz working frequency.
[23] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient and accurate
approximations of nonlinear convolutional networks,” arXiv preprint
10. REFERENCES arXiv:1411.4229, 2014.
[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, [24] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and convolutional neural networks with low rank expansions,” arXiv
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” preprint arXiv:1405.3866, 2014.
pp. 211–252, 2015. [25] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini,
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet P. Akselrod, and S. Talay, “Large-scale fpga-based convolutional
classification with deep convolutional neural networks,” in NIPS, networks,” Machine Learning on Very Large Data Sets, vol. 1, 2011.
2012, pp. 1097–1105. [26] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar,
[3] M. D. Zeiler and R. Fergus, “Visualizing and understanding I. Durdanovic, E. Cosatto, and H. Graf, “A massively parallel
convolutional networks,” in ECCV, 2014, pp. 818–833. coprocessor for convolutional neural networks,” in ASAP, July 2009,
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, pp. 53–60.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with [27] D. Larkin, A. Kinane, and N. O’Connor, “Towards hardware
convolutions,” arXiv preprint arXiv:1409.4842, 2014. acceleration of neuroevolution for multimedia processing
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for applications on mobile devices,” in Neural Information Processing.
image recognition,” arXiv preprint arXiv:1512.03385, 2015. Springer, 2006, pp. 1178–1188.
[6] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing [28] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
fpga-based accelerator design for deep convolutional neural S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
networks,” in Proceedings of ISFPGA. ACM, 2015, pp. 161–170. fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
[7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, [29] B. Bosi, G. Bois, and Y. Savaria, “Reconfigurable pipelined 2-d
“Diannao: A small-footprint high-throughput accelerator for convolvers for fast digital signal processing,” VLSI, vol. 7, no. 3, pp.
ubiquitous machine-learning,” in ASPLOS, vol. 49, no. 4. ACM, 299–308, 1999.
2014, pp. 269–284. [30] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240
[8] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, g-ops/s mobile coprocessor for deep neural networks,” in CVPRW.
Z. Xu, N. Sun et al., “Dadiannao: A machine-learning IEEE, 2014, pp. 696–701.
supercomputer,” in MICRO. IEEE, 2014, pp. 609–622.
35