0% found this document useful (0 votes)
2 views

GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork

The paper discusses the challenges of integrating Convolutional Neural Networks (CNNs) into embedded systems due to their computational intensity and resource demands. It presents a CNN accelerator design on an embedded FPGA platform, specifically targeting the VGG16-SVD model, achieving significant performance improvements with minimal accuracy loss through dynamic-precision data quantization. The proposed system demonstrates a frame rate of 4.45 fps with a top-5 accuracy of 86.66%, addressing the limitations of memory bandwidth and resource utilization in embedded applications.

Uploaded by

kysn2026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork

The paper discusses the challenges of integrating Convolutional Neural Networks (CNNs) into embedded systems due to their computational intensity and resource demands. It presents a CNN accelerator design on an embedded FPGA platform, specifically targeting the VGG16-SVD model, achieving significant performance improvements with minimal accuracy loss through dynamic-precision data quantization. The proposed system demonstrates a frame rate of 4.45 fps with a top-5 accuracy of 86.66%, addressing the limitations of memory bandwidth and resource utilization in embedded applications.

Uploaded by

kysn2026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/311491247

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network

Conference Paper · February 2016


DOI: 10.1145/2847263.2847265

CITATIONS READS
1,265 2,409

12 authors, including:

Jiantao Qiu Sen Song


Tsinghua University Tsinghua University
20 PUBLICATIONS 2,217 CITATIONS 121 PUBLICATIONS 11,864 CITATIONS

SEE PROFILE SEE PROFILE

Yu Wang Huazhong Yang


Tsinghua University Tsinghua University
308 PUBLICATIONS 8,340 CITATIONS 816 PUBLICATIONS 15,075 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Yu Wang on 04 December 2022.

The user has requested enhancement of the downloaded file.


Going Deeper with Embedded FPGA Platform for
Convolutional Neural Network

Jiantao Qiu1,2 , Jie Wang1 , Song Yao1,2 , Kaiyuan Guo1,2 , Boxun Li1,2 ,Erjin Zhou1 ,
Jincheng Yu1,2 , Tianqi Tang1,2 , Ningyi Xu3 , Sen Song2,4 , Yu Wang1,2 ,
and Huazhong Yang1,2
1
Department of Electronic Engineering, Tsinghua University
1
Tsinghua National Laboratory for Information Science and Technology
2
Center for Brain-Inspired Computing Research, Tsinghua University
3
Hardware Computing Group, Microsoft Research Asia 4 School of Medicine, Tsinghua University
{songyao, yu-wang}@mail.tsinghua.edu.cn

ABSTRACT Keywords
In recent years, Convolutional Neural Network (CNN) based Embedded FPGA; Convolutional Neural Network (CNN); Dyna-
methods have achieved great success in a large number of appli- mic-precision data quantization; Bandwidth utilization
cations and have been among the most powerful and widely used
techniques in computer vision. However, CNN-based methods are 1. INTRODUCTION
computational-intensive and resource-consuming, and thus are hard Image classification is a basic problem in computer vision (CV).
to be integrated into embedded systems such as smart phones, smart In recent years, Convolutional Neural Network (CNN) has led to
glasses, and robots. FPGA is one of the most promising platforms great advances in image classification accuracy. In Image-Net Large-
for accelerating CNN, but the limited bandwidth and on-chip mem- Scale Vision Recognition Challenge (ILSVRC) 2012 [1], Krizhevsky
ory size limit the performance of FPGA accelerator for CNN. et al. showed that CNN had great power by achieving the top-5
In this paper, we go deeper with the embedded FPGA platfor- accuracy of 84.7% in classification task [2], which was significant-
m on accelerating CNNs and propose a CNN accelerator design ly higher than other traditional image classification methods. In
on embedded FPGA for Image-Net large-scale image classifica- the following years, the accuracy has been improved to 88.8% [3],
tion. We first present an in-depth analysis of state-of-the-art C- 93.3% [4], and 96.4% [5] in ILSVRC 2013, 2014, and 2015.
NN models and show that Convolutional layers are computational- While achieving state-of-the-art performance, CNN-based meth-
centric and Fully-Connected layers are memory-centric. Then the ods demand much more computations and memory resources com-
dynamic-precision data quantization method and a convolver de- pared with traditional methods. In this manner, most CNN-based
sign that is efficient for all layer types in CNN are proposed to methods have to depend on large servers. However, there has been
improve the bandwidth and resource utilization. Results show that a non-negligible market for embedded systems which demands ca-
only 0.4% accuracy loss is introduced by our data quantization flow pabilities of high-accuracy and real-time object recognition, such
for the very deep VGG16 model when 8/4-bit quantization is used. as auto-piloted car and robots. But for embedded systems, the lim-
A data arrangement method is proposed to further ensure a high uti- ited battery and resources are serious problems.
lization of the external memory bandwidth. Finally, a state-of-the- To address this problem, many researchers have proposed vari-
art CNN, VGG16-SVD, is implemented on an embedded FPGA ous CNN acceleration techniques from either computing or memo-
platform as a case study. VGG16-SVD is the largest and most ac- ry access aspects [6, 7, 8, 9, 10, 11, 12, 13]. However, most of pre-
curate network that has been implemented on FPGA end-to-end so vious techniques only considered small CNN models such as the
far. The system on Xilinx Zynq ZC706 board achieves a frame rate 5-layer LeNet for simple tasks such as MNIST handwritten dig-
at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quanti- its recognition [14]. State-of-the-art CNN models for large-scale
zation. The average performance of Convolutional layers and the image classification have extremely high complexity, and thus can
full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working only be stored in external memory. In this manner, memory band-
frequency, which outperforms previous approaches significantly. width becomes a serious problem for accelerating CNNs especially
for embedded systems. Besides, previous research focused on ac-
celerating Convolutional (CONV) layers, while the Fully-Connected
This work was supported by 973 project 2013CB329000, National Natural Science (FC) layers were not well studied. Consequently, we need to go
Foundation of China (No. 61373026, 61261160501), the Importation and Develop- deeper with the embedded FPGA platform to address these prob-
ment of High-Caliber Talents Project of Beijing Municipal Institutions, Microsoft, X- lems.
ilinx University Program, and Tsinghua University Initiative Scientific Research Pro- In this paper, we make a deep investigation on how to deploy
gram. full CNNs to accelerators on embedded FPGA platform. A CN-
N accelerator for Image-Net large-scale classification is proposed,
which can execute the very deep VGG16-SVD model at a speed of
4.45 fps. Specifically, this paper makes the following contributions.
• We present an in-depth analysis of state-of-the-art CNN mod-
els for large-scale image classification. We show that state-
of-the-art CNN models are extremely complex (for example,
the VGG16 model has 138 million weights and needs over
30 GOPs), CONV layers are computational-centric, and FC
layers are memory-centric.
• For the first time, we present an automatic flow for dynamic-
precision data quantization and explore various data quan-

26
Input Image Feature Maps Probability in “cat”
Probability in “tree”
Table 1: # of layers in VGG models.
CONV CONV CONV CONV CONV
Model FC Total
Group 1 Group 2 Group 3 Group 4 Group 5
Probability in “dog” VGG11 1 1 2 2 2 3 11
VGG16 2 2 3 3 3 3 16
CONV + Pooling CONV + Pooling FC FC
VGG19 2 2 4 4 4 3 19

Figure 1: A typical CNN structure from the feature map per- each element in the output feature maps is often attached to CONV
spective. layers. The CONV layer can be expressed with Equation 1:

nin
tization configurations. Results show that only a 0.4% ac- fiout = fjin ⊗ gi,j + bi (1 ≤ i ≤ nout ), (1)
curacy loss is introduced with VGG16 model under 8/4 bit j=1
dynamic-precision quantization. Specific hardware is also
designed to support dynamic-precision data quantization. where gi,j is the convolutional kernel applied to j-th input feature
map and i-th output feature map.
• We show that the performance of FC layers is mainly limit- FC layer applies a linear transformation on the input feature
ed by the memory bandwidth on embedded FPGA platform, vector:
which is different from CONV layers. In this manner, we ap- f out = W f in + b, (2)
ply SVD to the weight matrix of the first FC layer, which
reduces 85.8% memory footprint of this layer, design the where W is an nout × nin transformation matrix and b is the bias
convolvers that can compute FC layers to reduce resource term. It should be noted, for the FC layer, the input is not a combi-
consumption, and propose a data arrangement scheme to ac- nation of several 2-D feature maps but just a feature vector. Conse-
celerate FC layers. quently, in Equation 2, the parameter nin and nout actually corre-
sponds to the lengths of the input and output feature vector.
• We propose a CNN accelerator design on an embedded FP- Pooling layer, which outputs the maximum or average value of
GA platform for Image-Net large-scale classification. On the each subarea in each feature maps, is often attached to the CONV
Xilinx Zynq platform, our system achieves the performance layer. Max-pooling can be expressed as Equation 3:
at 187.8 GOP/s and 137.0 GOP/s for CONV layers and full  
CNN under 150 MHz frequency respectively. With VGG16- in
fm,n ··· in
fm,n+p−1
 
SVD network, our implementation achieves a top-5 accuracy out
fi,j = max 

.. .. ,
 (3)
of 86.66% at a 4.45 fps speed. p×p . .
in
fm+p−1,n ··· in
fm+p−1,n+p−1
The rest of paper is organized as follows. In Section 2, the
background of CNN is presented. In Section 3, the related work where p is the pooling kernel size. This non-linear "down sam-
is introduced and discussed. We analyze the complexity distribu- pling" not only reduces the feature map size and the computation
tion of state-of-the-art CNN models in Section 4. In Section 5, the for later layers, but also provides a form of translation invariance.
dynamic-precision data quantization flow is proposed. The pro- CNN can be used to classify images in a forward inference pro-
posed image classification system design and implementation de- cess. But before using the CNN for any task, one should first train
tails are introduced in Section 6. The memory system and data the CNN on a dataset. Recent research [15] showed that, a CNN
arrangement method for FC layers are introduced in Section 7. The model pre-trained on a large dataset for a given task can be used
performance of the proposed system is evaluated and discussed in for other tasks and achieved high accuracy with minor adjustment
Section 8. We finally conclude this paper in Section 9. in network weights. This minor adjustment is called "fine-tune".
The training of the CNN is mostly implemented on large servers.
2. BACKGROUND For embedded FPGA platform, we only focus on accelerating the
inference process of a CNN.
Deep CNN achieves the state-of-the-art performance on a wide
range of vision-related tasks. To help understand the CNN-based 2.2 Image-Net Dataset
image classification algorithms analyzed in this paper, in this sec-
tion, we introduce the basics of CNN. An introduction to the Image- Image-Net [1] dataset is regarded as the standard benchmark to
Net dataset and state-of-the-art CNN models is also presented. evaluate the performance of image classification and object detec-
tion algorithms. So far Image-Net dataset has collected more than
14 million images within more than 21 thousand categories. Image-
2.1 Primer on CNN Net releases a subset with 1.2 million images in 1000 categories for
A typical CNN consists of a number of layers that run in se- the ILSVRC classification task, which has significantly promoted
quence. The parameters of a CNN model are called "weights". The the development of CV techniques. In this paper, all the CNN mod-
first layer of a CNN reads an input image and outputs a series of els are trained with ILSVRC 2014 training dataset and evaluated
feature maps. The following layers read the feature maps generated with ILSVRC 2014 validation set.
by previous layers and output new feature maps. Finally a classifier
outputs the probability of each category that the input image might 2.3 State-of-the-Art CNN Models
belong to. CONV layer and FC layer are two essential types of lay- In ILSVRC 2012, the SuperVision team won the first place in
er in CNN. After CONV layers, there are usually pooling layers. A image classification task using AlexNet by achieving 84.7% top-
typical CNN example is shown in Figure 1. 5 accuracy [2]. CaffeNet is a replication of AlexNet with minor
In this paper, for a CNN layer, fjin denotes its j-th input feature changes. Both of AlexNet and CaffeNet consist of 5 CONV layers
map, fiout denotes the i-th output feature map, and bi denotes the and 3 FC layers.
bias term to the i-th output map. For CONV layers, nin and nout The Zeiler-and-Fergus (ZF) network achieved 88.8% top-5 accu-
represent the number of input and output feature maps respectively. racy and won the first place in image classification task of ILSVRC
For FC layers, nin and nout are the length of the input and output 2013 [3]. The ZF network also has 5 CONV layers and 3 FC layers.
feature vector. The VGG model achieved a top-5 accuracy of 92.6% and won
CONV layer takes a series of feature maps as input and con- the second place in image classification task of ILSVRC 2014 [16].
volves with convolutional kernels to obtain the output feature map- VGG model consists of 5 CONV layer groups and 3 FC layers.
s. A nonlinear layer, which applies nonlinear activation function to According to the exact number of layers, there are several versions

27
12.95
12.95
of the VGG model including VGG11, VGG16, and VGG19, as CONV1 CONV2 CONV3 CONV4
CONV5 FC6 FC7 FC8
listed in Table 1.

9.25
9.25
3. RELATED WORK

5.55

5.55

5.55
5.55
To accelerate CNN, a set of techniques from both software and

3.87

3.87

3.70
hardware perspectives have been studied. From software perspec-

2.31
1.85

1.85
tive, the target is compressing CNN models in order to reduce the

0.90

0.83
0.45

0.45
0.34
0.30

0.30

0.30

0.30
0.21

0.21

0.21

0.21
0.17
0.10
0.08
0.03

0.03

0.03

0.03

0.03
0.01

0.01

0.01

0.01

0.01
memory footprint and the number of operations while minimizing
accuracy loss. From the hardware perspective, specific architecture CAFFENET ZF VGG11 VGG16 VGG19

and modules are designed to reuse data, enhance "locality" of da- (a) Operations demanded in different layers (GOP)
ta, and accelerate convolution operations. To deploy CNN models CONV1 CONV2 CONV3 CONV4

102.76

102.76

102.76
on embedded systems, the bit widths of operators and weights are CONV5 FC6 FC7 FC8
often reduced compared to that on CPU or GPU platform.

52.43
3.1 Model Compression

37.75
Network pruning and decomposition were widely used to com-

16.78

16.78

16.78

16.78

16.78
press CNN models. In early work, network pruning proved to be

9.44
8.26
7.08
5.90
4.72
4.10

4.10

4.10

4.10

4.10
3.54

2.06
1.47
1.33
0.88

0.88
0.88

0.88
0.66

0.61
0.44
0.31

0.22

0.22
0.07
0.03

0.04

0.04
0.01

0.00
a valid way to reduce the network complexity and over-fitting [17,
18, 19]. In [20], Han et al. pruned less influential connection- CAFFENET ZF VGG11 VGG16 VGG19
s in neural networks, and achieved 9× and 13× compression for (b) Number of weights in different layers (Million)
CaffeNet and VGG16 model without accuracy loss. The Singu- Figure 2: The complexity distribution of state-of-the-art CNN
lar Value Decomposition (SVD) [21] is frequently used to reduce
models: (a) Distribution of operations by theoretical estima-
memory footprint. In [22], Denton et al. used SVD and filters clus-
tering to speedup the first two FC layers of CNNs. Zhang et al. [23] tion; (b) Distribution of weight number.
proposed a method that was tested on a deeper model, which used
low rank decomposition on network parameters and took nonlinear multi-chip supercomputer was proposed which offered sufficien-
units into consideration. Jaderberg et al. [24] used rank-1 filters to t memory capacity to store all the weights in the CNN on chip.
approximate the original ones. In [10], all the weights of one CNN layer were also stored in on-
chip memory. In this manner, the data traffic between on-chip and
3.2 Data Quantization off-chip memory could be minimized.
Implementing fixed-point arithmetic units on ASIC and FPGA
is much more efficient compared with floating-point ones. Conse- 3.4 Motivation
quently, most of previous CNN accelerators used fixed-point num- State-of-the-art CNN models for large-scale visual recognition
bers instead of floating-point numbers [7, 25, 26, 6]. Shorter fixed- are much larger and deeper than early small CNN models. In
point representation of weights and data can also significantly re- this case, CNN accelerators such as ShiDianNao [10] which store
duce memory footprint and computation resources. For example, weights on chip are hard to support those large CNN models. Con-
Chen et al. showed that the area and power of a 16-bit multiplier is sequently, state-of-the-art CNN models can only be stored in exter-
0.164× and 0.136× compared with that of 32-bit multiplier under nal memory and the bandwidth problem needs to be considered.
65nm fabrication technology [7]. Most of previous studies focused on only accelerating the CON-
Most of previous work adopted the 16-bit quantization strate- V layers of CNN. For example, in [6], the accelerator design was
gy [27, 25, 7, 8]. In [7], Chen et al. showed that using 16-bit num- only applied to several CONV layers rather than the full CNN.
bers instead of 32-bit ones only introduced 0.26% more error rate In [26] and [11], authors only used models with few CONV lay-
on MNIST dataset. In [8], 16-bit numbers were used in the infer- ers without any FC layer. In this manner, those accelerators were
ence process while 32-bit numbers were used in training process, hard to be used for accelerating full CNNs.
and results on MNIST dataset showed that there was only 0.01% A full CNN model consists of both CONV layers and FC layer-
accuracy reduction. s, and thus an efficient CNN accelerator for real-life applications
To accelerate large CNN models on the embedded FPGA plat- need to consider both of them. For CONV layers and FC lay-
form, data quantization is rather important and a shorter represen- ers, the encountered problems are rather different. CONV layers
tation that introducing negligible accuracy loss is always expected. are computation-centric: they contain few parameters but need a
However, though previous work used data quantization, there is no great deal of operations; FC layers are memory-centric: they usual-
comprehensive analysis of different quantization strategies. ly contain hundreds of million weights, and each weight is used for
only once. Consequently, loading weights from the external mem-
3.3 CNN Accelerator ory significantly degrades the performance of FC layers. In other
Previous CNN accelerator designs can be generally classified in- words, the bandwidth limits the performance of FC layers. Con-
to two groups: the first group focuses on the computing engine and sidering this, we go deeper with the embedded FPGA platform on
the second group aims to optimize the memory system. alleviating the bandwidth problem.
CNNs are extremely computational-intensive, and thus powerful
computing engines are necessary to accelerate them. Chakaradhar
et al. in [11] proposed a dynamically configurable architecture for 4. COMPLEXITY ANALYSIS OF CNN
CNN. They added dedicated switches between the computing mod- Time complexity of a layer in CNN can be evaluated by the
ules to enable design space exploration for dynamic configuration number of multiplication operations in the inference process. In a
across different CNN layers. An associate compiler was also pro- CONV layer, each convolutional kernel is a k × k filter applied to a
posed to fully exploit the parallelism among the CNN workloads. r×c dimension input feature map. The number of kernels equals to
The weights in CONV layers of CNN are used for multiple times nin ×nout . Consequently, according to Equation 1, the complexity
in computation, and thus the overall performance can be signifi- of this CONV layer is
cantly degraded by frequent memory access. In [7], Chen et al.
used the tiling strategy and dedicated buffers for data reuse to re-
V = O(nin · nout · k · r · c).
T ime 2
duce the total communication traffic. In their further study [8], a CCON (4)

28
Input images CNN model
Table 2: The Memory footprint, Computation Complexities,
and Performance of the VGG16 model and its SVD version. Weight quantization phase
Weight dynamic
namic range analysis
# of total # of Top-5
Network FC6
weights operations accuracy
VGG16 25088×4096 138.36M 30.94G 88.00%
VGG16-SVD 25088×500 + 500×4096 50.18M 30.76G 87.96%
Weight quantization
zation configuration
For pooling layers and FC layers, the time complexities are
CPT ooling
ime
= O(nin · r · c), (5) Data quantization phase
Fixed-point CNN model Floating-point CNN model

CFT C
ime
= O(nin · nout ). (6) Layer 1
Laye Layer
yer 1

For pooling layers, nout equals to nin since each input feature Feature
ure
e maps
ma Feature
ature
e maps
ma

map is pooled to a corresponding output feature map, and thus the Dynamic range analysis and finding


optimal quantization strategy
complexity is linear to either input or output feature map number. Feature maps Feature maps
Space complexity refers to the memory footprint. For a CONV
layer, there are nin ×nout convolution kernels, and each kernel has Layer N Layer N

k2 weights. Consequently, the space complexity for a CONV layer


is
Space
V = O(nin · nout · k ).
2 Weight and data quantization
uantiz configuration
CCON (7)
FC layer actually applies a multiplication to the input feature vec-
Figure 3: The dynamic-precision data quantization flow.
tor, and thus the complexity for FC layer is measure by the size for
the parameter matrix, which is shown in Equation 8:
on different quantization strategies and the trade-off between the bit
Space
CF C = O(nin · nout ) (8) length of fixed-point numbers and the accuracy. In this section, we
No space is needed for pooling layers since it has no weight. propose a dynamic-precision data quantization flow and compare it
The distribution of demanded operations and weight numbers in with widely used static-precision quantization strategies.
the inference process of state-of-the-art CNN models are shown in
Figure 2. The measured operations consist of multiplications, adds, 5.1 Quantization Flow
and non-linear functions. For a fixed-point number, its value can be expressed as
As shown in Figure 2 (a), the operations of CONV layers com- ∑
bw−1
pose most of the total operations of CNN models, and thus the n= Bi · 2−fl · 2i , (9)
time complexity of CONV layers is much higher than that of FC i=0
layers. Consequently, for CONV layers, more attention should be where bw is the bit width and fl is the fractional length which
paid to accelerate convolution operations. can be negative. To convert floating-point numbers into fixed-point
For space complexity, the situation is quite different. As shown ones while achieving the highest accuracy, we propose a dynamic-
in Figure 2 (b), FC layers contribute to most of the weights. S- precision data quantization strategy and an automatic workflow, as
ince each weight in FC layers is used only once in one inference shown in Figure 3. Unlike previous static-precision quantization s-
process, leaves no chance for reuse, the limited bandwidth can trategies, in the proposed data quantization flow, fl is dynamic for
significantly degrade the performance since loading those weights different layers and feature map sets while static in one layer
may take quite long time. to minimize the truncation error of each layer. The proposed
Since FC layers contribute to most of memory footprint, it is quantization flow mainly consists of two phases: the weight quan-
necessary to reduce weights of FC layers while maintaining com- tization phase and the data quantization phase.
parable accuracy. In this paper, SVD is adopted for accelerating The weight quantization phase aims to find the optimal fl for
FC layers. Considering an FC layer f out = W f in + b, the weight weights in one layer, as shown in Equation 10:
matrix W can be decomposed as W ≈ Ud Sd Vd = W1 W2 , in ∑
which Sd is a diagonal matrix. By choosing the first d singular fl = argmin |Wf loat − W (bw, fl )|, (10)
fl
values in SVD, i.e. the rank of matrix Ud , Sd , and Vd , both time
and space complexity can be reduced to O(d · nin + d · nout ) from where W is a weight and W (bw, fl ) represents the fixed-point for-
O(nin · nout ). Since accuracy loss may be minute even when d mat of W under the given bw and fl . In this phase, the dynamic
is much smaller than nin and nout , considerable reduction of time ranges of weights in each layer is analyzed first. After that, the fl
consumption and memory footprint can be achieved. is initialized to avoid data overflow. Furthermore, we search for the
The effectiveness of SVD is proved by the results in Table 2. optimal fl in the adjacent domains of the initial fl .
By applying SVD to the parameter matrix of the FC6 layer and The data quantization phase aims to find the optimal fl for a
choosing first 500 singular values, the number of weights in FC6 set of feature maps between two layers. In this phase, the inter-
layers is reduced to 14.6 million from 103 million, which achieves mediate data of the fixed-point CNN model and the floating-point
a compression rate at 7.04×. However, the number of operations CNN model are compared layer by layer using a greedy algorithm
does not decrease much since the FC layer contributes little to total to reduce the accuracy loss. For each layer, the optimization target
operations. The SVD only introduces 0.04% accuracy loss. is shown in Equation 11:
∑ +
fl = argmin |xf loat − x+ (bw, fl )|. (11)
fl
5. DATA QUANTIZATION
Using short fixed-point numbers instead of long floating-point In Equation 11, x+ represents the result of a layer when we denote
numbers is efficient for implementations on the FPGA and can sig- the computation of a layer as x+ = A · x. It should be noted, for
nificantly reduce memory footprint and bandwidth requirements. A either CONV layer or FC layer, the direct result x+ has longer bit
shorter bit width is always wanted, but it may lead to a severe ac- width than the given standard. Consequently, truncation is needed
curacy loss. Though fixed-point numbers have been widely used in when optimizing fl selection. Finally, the entire data quantization
CNN accelerator designs, there is no comprehensive investigation configuration is generated.

29
Table 3: Exploration of different data quantization strategies with state-of-the-art CNNs.
Network CaffeNet VGG16 VGG16-SVD
Experiment Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 7 Exp 8 Exp 9 Exp 10 Exp 11 Exp 12 Exp 13
Data Bits Single-float 16 8 Single-float 16 16 8 8 8 8 Single-float 16 8
Weight Bits Single-float 16 8 Single-float 16 8 8 8 8 8 or 4 Single-float 16 8 or 4
Data Precision N/A Dynamic Dynamic N/A 2−2 2−2 Not available 2−5 or 2−1 Dynamic Dynamic N/A Dynamic Dynamic
Weight Precision N/A Dynamic Dynamic N/A 2−15 2−7 Not available 2−7 Dynamic Dynamic N/A Dynamic Dynamic
Top 1 Accuracy 53.90% 53.90% 53.02% 68.10% 68.02% 62.26% Not available 28.24% 66.58% 66.96% 68.02% 64.64% 64.14%
Top 5 Accuracy 77.70% 77.12% 76.64% 88.00% 87.94% 85.18% Not available 49.66% 87.38% 87.60% 87.96% 86.66% 86.30%
1
The weight bits "8 or 4" in Exp10 and Exp13 means 8 bits for CONV layers and 4 bits for FC layers.
2
The data precision "2−5 or 2−1 " in Exp8 means 2−5 for feature maps between CONV layers and 2−1 for feature maps between FC layers.

5.2 Analysis of Different Strategies the majority of computation tasks in CNN, including CONV layer-
We explore different data quantization strategies with CaffeNet, s, Pooling layers, and FC layers. On-chip buffers, including input
VGG16, and VGG16-SVD networks and the results are shown in buffer and output buffer, prepare data to be used by PEs and store
Table 3. All results are obtained under Caffe framework [28]. the results. Controller fetches instructions from the external mem-
• For CaffeNet, as shown in Exp 1, the top-5 accuracy is 77.70% ory and decodes them to orchestrate all the modules except DMAs
when 32-bit floating-point numbers are used. When employ- on the PL. DMAs are working for transferring data and instructions
ing static-precision 16-bit quantization and 8/4-bit dynamic- between the external memory on the PS side and On-chip Buffers
precision quantization, the top-5 accuracy results are 77.12% on the PL side.
and 76.64% respectively. PS consists of general-purpose processors and the external mem-
• VGG16 network with static-precision quantization strategies ory. All the CNN model parameters, data, and instructions are s-
are tested in Exp 4 to Exp 8. As shown in Exp 4, single-float tored in the external memory. Processors run bare-metal programs
VGG16 network 88.00% top-5 accuracy. When using the 16- and help to orchestrate the whole inference phase by configuring
bit quantization configuration, only 0.06% accuracy loss is the DMAs. We also realize Softmax function on CPU considering
introduced. However, when employing 8-bit static-precision that its FPGA implementation will bring inevitable design overhead
quantization, no configuration is available since the feature with little performance improvement since this function is called
maps between FC layers are quantized to 0. As shown in only in the last layer of the whole CNN.
Exp 8, at least two precisions are needed when using 8-bit The complete inference process of an image with the proposed
quantization and the accuracy degrades greatly in this case. CNN accelerator consists of three steps that are executed in se-
• Results of VGG16 network with dynamic-precision quanti- quence: data preparation, data processing, and result output.
zation are shown in Exp 9 and Exp 10. When 8-bit dynamic- Data Preparation. In this phase, all the data needed in the com-
precision quantization is used for both data and weights, the putation including image data, model data, and control data are s-
top-5 accuracy is 87.38%. Using 8/4-bit dynamic-precision tored in the external memory. Control data includes the Buffer De-
quantization for weights in CONV layers and FC layers re- scriptors (BD) used by DMAs and instructions used by Controller.
spectively even achieves higher accuracy. As shown in Exp So far the image data is not obtained from the camera.
10, in this case, the top-5 accuracy is 87.60%. Data Processing. When all the data are prepared, CPU host
• The results of VGG16-SVD network are shown in Exp 11 starts to configure DMAs with the BDs that are pre-stored in the
to Exp 13. Compared with the floating-point VGG16 model, external memory. The configured DMA loads data and instructions
floating-point VGG16-SVD only introduces 0.04% accuracy to the controller, triggers a computation process on PL. Each time
loss. However, when 16-bit dynamic-precision quantization a DMA interrupt is asserted, CPU host adds up the self-maintained
is adopted, the top-5 accuracy is down to 86.66%. With 8/4- pointer address for each DMA’s BD list and configures them with
bit dynamic-precision quantization, the top-5 accuracy fur- new BDs. This phase works until the last BD has been transferred.
ther drops to 86.30%. Result Output. After receiving the interrupt of the last BD from
The results show that dynamic-precision quantization is much DMA, the processor host applies Softmax function to the final re-
more favorable compared with static-precision quantization. With sults from PEs, and output the results to UART port.
dynamic-precision quantization, we can use much shorter repre-
sentations of operations while still achieving comparable accuracy.
For example, compared with 16-bit quantization, 8/4-bit quantiza- 6.2 PE Architecture
tion halves the storage space for intermediate data and reduce three- Figure 4 (b) shows the architecture of the PE and other modules
fourths memory footprint of CNN models. Besides, the utilization involved. A PE consists of five parts, including the Convolver Com-
of bandwidth can also be significantly increased. plex, the Adder Tree, the Non-Linearity module, the Max-Pooling
module, and the Bias Shift.
6. SYSTEM DESIGN • For Convolver Complex, we employ the classical line buffer
In this section, we introduce the design of our CNN accelerator. design [29] as shown in Figure 4 (c). When Input Data goes
First, the overall architecture is presented. After that, the designs of through the buffer in row-major layout, the line buffer releas-
major modules are introduced. Finally, the implementation details es a window selection function on the input image. Thus the
are presented. selected window followed by multipliers and an adder tree
will compute the convolution result, one data per cycle. S-
6.1 Overall Architecture ince the bottleneck of FC layers appears at the bandwidth,
In this work, we propose a CPU+FPGA heterogeneous archi- we use this module to compute matrix-vector multiplication
tecture to accelerate CNNs. Figure 4 (a) shows an overview of the for FC layers even the efficiency is not good. To realize this
proposed system architecture. The whole system can be divided function, we set the delay of each line of the line buffer the
into two parts: the Programmable Logic (PL) and the Processing same as the kernel size by using a MUX at the end of each
System (PS). line. In the proposed implementation, the kernel size is 3.
PL is the FPGA chip, on which we place the Computing Com- When Input Data goes through the buffer, we get a totally
plex, On-chip Buffers, Controller, and DMAs. The Computing Com- new vector every 9 cycles in the selected window and do a
plex consists of Processing Elements (PEs) which take charge of vector inner product. Thus a convolver can do a matrix mul-
tiplied by a vector of size 9.

30
n Delays

Processing System

MU
Ĺ

X
ĸ

‫ڮ‬
External Input ‫ڮڮ‬ ‫ڮڮ‬
CPU
Memory Data ķ
Bias Intermediate Data
Bias Shift
Ĺ
ĸ

MU
‫ڮ‬
Config. PE

X
Data & Inst. Bus ‫ڮڮ‬ ‫ڮڮ‬

Bus Data + ķ
C


+
Input
C
Output Data ݉Delays


+ NL Pool
Buffer Buffer shift


DMA Weights
+ 9 Data Inputs
Data buffer


FIFO C
Programmable Logic

+ Input
Input Output Convolver Adder Weight
9 Weight
+
Buffer Buffer Complex Tree
Inputs X X X + Output
X X X +

‫ڮ‬
Controller

‫ڮ‬
Data
Computing Complex X X X +
Controller
+
Multipliers
PE PE PE Adder Tree
Weight buffer

(a) (b) (c)

Figure 4: The design of our image classification system: (a) the overall architecture; (b) the processing element; (c) the convolver in
the processing element.
Input Buffer Input Buffer
In4 ࢔࢏࢔ In1 In2 In3 In4
In3 ࢀ࢏ Out4 ࢔࢕࢛࢚ ܶ‫݋‬
In2 Out3
In1 Out2
In4
ࢀ࢘ Out
Out1 ܶ݅ ܶ݅ A B
1
reuse_times PE1 PE2 PE1 PE2 ൌ
ࢀࢉ In1 In2 ൈ Out1 Out2
row ݊௜௡ ݊௢௨௧
C D

Output Buffer Output Buffer


Out1 Out2 Out3 Out4 Out1 Out2 Out3 Out4 ݊௢௨௧

col
Phase1 (Intermediate Results) Phase2 (Final Results)
(a) (b) (c)

Figure 5: Workload schedule for CONV layers and FC layers: (a) Tiling and reuse of feature maps in CONV layers; (b) two phases
in the execution of CONV layers; (c) workload schedule in FC layers.

• Adder Tree (AD) sums all the results from convolvers. It Phase 1 Phase nin / Ti

can add the intermediate data from Output Buffer or bias data Data In Data In
Data Out Data Out ‫ڮڮ‬ Data Out Data Out Data Out ‫ڮڮ‬ Data Out
from Input Buffer if needed. Weight In Weight In ‫ڮڮ‬ Weight In ‫ڮڮ‬ Weight In Weight In ‫ڮڮ‬ Weight In
• Non-Linearity (NL) module applies non-linear activation func- Weight Out Weight Out ‫ڮڮ‬ Weight Out Weight Out Weight Out ‫ڮڮ‬ Weight Out
tion to the input data stream. Result In Result In
• Max-Pooling module utilizes the line buffers to apply the Result Out Result Out

specific 2 × 2 window to the input data stream, and outputs


the maximum among them. Figure 6: Timing graph. There are totally nin /T i phases to
• Bias Shift module and Data Shift module are designed to generate the reuse_times × P E_num tiles in the output layer.
support dynamic quantization. Input bias will be shifted by In each phase, the next group of data is loaded and the data pre-
Bias Shift according to the layer’s quantization result. For loaded in the last phase are output and reused for reuse_times
a 16-bit implementation, the bias is extended to 32-bit to be times. Meanwhile, accompanied weights are loaded and output
added with convolution result. The output data will be shifted for reuse_times times with no reuse. The output buffer works
by Data Shift and cut back to the original width. on collecting data in the entire phase, while outputting inter-
The size of convolutional kernel usually has only several options mediate data and final data to PEs or the external memory.
such as 3×3, 5×5, and 7×7. All the convolutional kernels in the
VGG16 model are in 3×3 dimension, and thus in the Convolver
Complex, the 2D convolvers are designed for convolution operation of each input tiled block (vector) to be reused is reuse_times. We
only over a 3×3 window. show how this workload schedule mechanism applies to CONV
layers in Figure 5 (a) (b) and FC layers in Figure 5 (c).
6.3 Implementation Details
6.3.2 Controller System
6.3.1 Workloads Schedule In each computation phase, the Controller decodes a 16-bit in-
Parallelism. Chakradhar et al. pointed out that there are mainly struction to generate control signals for on-chip buffers and PEs.
three types of parallelism in CNN workloads: operator-level (fine- One instruction is composed with the following signals.
grained) parallelism, intra-output parallelism (multiple input fea- • Pool Bypass and NL Bypass are used to bypass the Pool and
tures are combined to create a single output), and inter-output paral- NL module if needed.
lelism (multiple independent features are computed simultaneous- • Zero Switch is used to select either zero or bias data into
ly) [11]. In our implementation, all the three types of parallelism added to the result of adder tree, since usually more than
are considered. The operator-level parallelism is realized with 2D one phase is needed to calculate the final result and the bias
convolvers. The intra-output parallelism is realized with multiple should be added only once.
convolvers working simultaneously in each PE. The inter-output • Result Shift and Bias Shift describe the number of bits and
parallelism is realized by placing multiple PEs. direction for data shifting, for dynamic data quantization.
Tiling and Reuse. Due to limited on-chip memory, tiling is • Write En is used to switch the data from the Output Buffer
necessary for CNNs. For tiling in CONV layers, we tile each input either to the external memory or to the PEs to be reused.
image by the factor T r (T c) in row (column). And we tile the input
(output) feature maps nin (nout ) by the factor T i (T o). For FC • PE En offers us the flexibility to set several PEs as idle if
layers, we tile each matrix into tiles of T i×T o. For reuse, the times needed. This can help save energy when computation capac-
ity meet the demand.

31
݀ܽ‫݉ݑ̴݊ݐݎ݋݌̴݊݅ܽݐ‬ ‫݉ݑ̴݊ݐݎ݋݌̴݊݅ݐ݄݃݅݁ݓ‬ ݀ܽ‫݉ݑ̴݊ݐݎ݋݌̴ݐݑ݋ܽݐ‬
‫ڮڮ‬
Addr. Data Addr. Data
In4 In8

‫̴ܹܦ‬ ‫̴ܹܦ‬ In3 In7 1 In1(1,1) 1 Out1(1,1)


‫̴ܹܦ‬ ‫ڮڮ‬ ‫ڮڮ‬ In2 In6
Out4 Out8
Out3 Out7

In1 In5 Out2 Out6


2 In1(1,2) 2 Out1(1,2)
ࢀ࢘ ࢀࢉ Out1 Out5 ‫ڮ‬ ‫ڮڮ‬ 3 Out5(1,1)
Data Weight Data ࢀࢉ ࢀ࢘
ࢀࢉ 16 In1(4,4) 4 Out5(1,2)
Buffer Buffer Buffer ࢀࢉ
ࢀࢉ
17 In2(1,1) 5 Out1(2,1)
Input Buffer Output Buffer ‫ڮ‬ ‫ڮڮ‬ 6 Out1(2,2)
32 In2(4,4) 7 Out5(2,1)
‫ڮ‬ ‫ڮڮ‬ 8 Out5(2,2)
Figure 7: Buffer structure. Image data and weights are s- 65 In5(1,1) ‫ڮڮ‬
tored separately inside Input Buffer, and bias are stored in Data In1
(1,1)
In1
(1,2)
In1
(1,3)
In1
(1,4)
‫ڮ‬ 17 Out2(1,1)
Buffer. The total bandwidth of each buffer is defined by corre- Out1
(1,1)
Out1
(1,2)
80 In5(4,4) 18 Out2(1,2)
‫ڮ‬ ‫ڮڮ‬ 19 Out6(1,1)
sponding port numbers multiplied by data width (D_W ). Out1 Out1
(2,1) (2,2)
113 In8(1,1) ‫ڮڮ‬
‫ڮ‬ ‫ڮڮ‬
• Phase Type helps the Controller to distinguish these phases 128 In8(4,4)
24 Out6(2,2)
‫ڮڮ‬
and send out the corresponding signals.helps the Controller
to distinguish these phases and send out the corresponding
signals. Several phases need to be specifically taken care of. Figure 8: Storage pattern for one CONV layer with max-
For example, for the last phase in the last layer an the last pooling when the parameter group <T i, T o, reuse_times,
output image, no more weights or data should be loaded in, P E_num> is set to <2, 4, 2, 2>.
and the input buffers should be configured differently com-
pared to previous phases. Input Buffer 64h9 Input Buffer 64h9
• Pic Num and Tile Size/Layer Type help the Controller to …
configure the Input Buffer and Output Buffer.
… … …
A compiler is developed on Matlab to automatically generate

100

100
instructions. The compiler takes the fixed-point CNN model as 1 2 … 64 1 2 … 64

the input and generates instructions as output. Table 4 shows the


generated the instructions with the example in Figure 5 (a).
• Instruction 1 commands Input Buffer to load all the needed
data, which is distinguished by the Phase Type signal. PE En Convolver Convolver
(a) (b)
enables two PEs working in parallel. As T i = 2, Pic Num
is set as 2. Tile Size is set as the defined T r. Layer Type Figure 9: Data arrangement in external memory: (a) Linear
defines the layer type as CONV layer. All the other signals arrangement; (b) DMA-oriented arrangement.
are useless in this phase.
• Instruction 2 starts calculating the four tiled blocks in the • dataout_port_num. The maximum amount of results that
output layer. Since they are all intermediate results, Pool and can be transferred by DMA each cycle.
NL modules are bypassed. Bias will be added in this phase In CONV layers, the total amount of weights needed in each
only once. And Bias Shift specifies the shift configuration phase is far less than that of image data, while in FC layers, the
for bias data. Output Buffer will only collect the intermediate amount of weights is far more than the amount of data in input
data and not write to anywhere. vectors. Therefore, we save the weights of FC layers in data buffer
• In instruction 3, Write En is set as "PE" to command Output whose capability is larger than weight buffer, and save the input
Buffer to send the intermediate results back to the PEs. Bias data vector in the weight buffer.
is no longer added, and thus Zero Switch is set to zero. Since
all the data generated in this phase is the final results, Pool 7.2 Data Arrangement for CONV layers
and NL Bypass are disabled to let data from AD enter these In order to reduce the unnecessary access latency of external
two modules in sequence. memory, we optimize the storage pattern of data in the memory
• In the last instruction, supposing this CONV layer is the last space. The principle is to maximize the burst length of each DMA
layer, then no module is working in PE. Write EN is set as transaction. Figure 8 shows a brief example of how we organize
"DDR" to command the Output Buffer to write results back the input and output data in one CONV layer with max-pooling.
to the external memory. Result Shift is set to shift the results We store the tiles which are at the same relative locations in each
data as we want. This phase is distinguished by Controller picture continuously. Therefore, in each phase, we can load all the
by setting Phase Type as last. input tiles for computation continuously. The output feature maps
7. MEMORY SYSTEM will be the input feature maps of the next layer, therefore, the same
storage pattern applies as well.
In this section, we introduce the memory system design which There is a slight difference between CONV layers with Pooling
aims to feed the PEs with data efficiently. First the designs of and other layers. After a 2 × 2 pooling, the result is only a quar-
buffers are introduced. After that, the data arrangement mecha- ter of a tile. In Figure 8, Out(2, 1), instead of Out(1,2), will be
nisms for CONV and FC layers are presented. calculated after Out(1,1). This means adjacent result tiles are not
stored continuously in external memory. If we write each result tile
7.1 Buffer Design as soon as it is generated, the burst length will be only T r/2. This
As shown in Figure 4 (a), there are two on-chip buffers on the will significantly degrade the utilization of the external memory.
PL side, the Input Buffer and the Output Buffer. The Input Buffer To solve this problem, we increase the memory budget on chip. We
stores the bias, image data, and weights. The Output Buffer saves buffer Out(1,1) to Out(4,1) before generating Out(1,2), then write
the results generated from PE and offers intermediate results to the Out(1,1) and Out(1,2) together. This strategy increases the burst
PEs at proper time. For simplicity of illustration, we define three length to T r × T c/2.
parameters as shown in Figure 7
• datain_port_num. The maximum amount of data that can 7.3 Data Arrangement for FC Layers
be transferred by DMA each cycle.
The speed of computing FC layers is mainly restricted by the
• weightin_port_num. The maximum amount of weights bandwidth. In this manner, using specific hardware to accelerate
that can be transferred by DMA each cycle.

32
Table 4: Instructions for One CONV layer generated by the compiler.
Index Pool Bypass NL Bypass Zero Switch Result Shift Bias Shift Write En PE En Phase Type Pic Num Tile Size Layer Type
1 X X X X X No 2 First 2 Tr CONV
2 Yes Yes Bias X BS No 2 Cal 2 Tr CONV
3 No No Zero X X PE 2 Cal 2 Tr CONV
4 X X X RS X DDR 2 Last 2 Tr CONV

Table 5: Parameter configuration and resource utilization. The CPU platform is Intel Xeon E5-2690 [email protected]. The
Param. tile_size convolver_num PE_num reuse_times GPU platform is Nvidia K40 GPU (2880 CUDA cores with 12GB
Config. 28 64 2 16 GDDR5 384-bit memory), and the mGPU platform is the Nvidia
Param. datain_port_num weightin_port_num dataout_port_num TK1 Mobile GPU development kit (192 CUDA cores). For exper-
Config. 8 4 2 iments on CPU, GPU, and mGPU, the operating system is Ubuntu
Resource FF LUT DSP BRAM 14.04 and the deep learning software framework is Caffe [28].
Utilization 127653 182616 780 486
Percent(%) 29.2 83.5 89.2 86.7 8.1 Theoretical Estimation
For CONV layer, the number of phases needed in one CON-
V layer when tiling is adopted can be calculated by the following
formula:
nin nout row 2
CON V
Nphase =⌈ ⌉×⌈ ⌉×⌈ ⌉ ,
Ti To Tr
where T o = reuse_times × P E_num and T c = T r. The time
of computation and loading data in each phase are:

compute_data ≈ T r × reuse_times,
tCON V 2

and
Figure 10: Testing platform. We use Xilinx Zynq ZC706 for T r2 × T i
on-board testing. A power meter is used for power analysis. tCON V
load_data = .
datain_port_num
FC layers is not effective. Considering this, the proposed system CONV layers are usually computation-intensive. Consequently,
uses the Convolver Complex in one of the PEs to do the computa- in order to keep ping-pong mechanism working, typically tload is
tion for FC layers. In this case, we need to fully utilize the band- smaller than tcompute , and thus there should be:
width of the external memory with the current PL structure.
Ti
In our system, we assign a buffer of length 900, the same as T r× datain_port_numCON V ≥ .
T r to each of the 64 Compute Complex in one PE. The buffers are reuse_times
filled one by one when computing CONV layers. To reduce extra In each phase, data will be reused for reuse_times times, each
data routing logic for filling buffers while keep a long burst length accompanied with a new group of weights (9 weights for each ker-
when fetching data for computing FC layers, we arrange the weight nel as for the model we use). Therefore, for weights, we have:
matrix in the external memory. We first divide the whole matrix
with blocks of 64×9 columns and 100 rows such that one block compute_weight ≈ T r
tCON V 2
can be processed in a phase. In each block, the data is arranged as
shown in Figure 9 (b). Without data arrangement for FC layers, as 9 × T i × P E_num
tCON V
load_weight =
shown in Figure 9 (a), we need 64×100 DMA transactions to load weightin_port_num
one block while the burst length is just 9. By arranging the data 9 × T i × P E_num
following Figure 9 (b), we need just one DMA transaction to load weightin_port_numCON V ≥ .
T r2
the whole block and the long burst length ensures a high utilization
of the bandwidth of external memory. According to the workloads schedule shown in Figure 6, the con-
straint for dataout_port_num is:

8. SYSTEM EVALUATION dataout_port_numCON V ≥ P E _num.


In this section, the performance of the implemented system is In order to minimize the bandwidth consumption, we consider
evaluated. First, we analyze the performance of our system archi- to choose weightin_port_num and datain_port_num as few-
tecture under given design constraints. After that, the performance er as possible. Under the above constraints, we can estimate the
of the proposed system is presented and compared with other plat- computation time for one CONV layer:
forms. Finally we compare our system with previous FPGA-based
CNN accelerators. tCON V = NP hase × tcompute
We use 16-bit dynamic-precision quantization and Xilinx Zynq nin nout row 2
=⌈ ⌉×⌈ ⌉×⌈ ⌉ × T r 2 × reuse_times.
ZC706 for the implementation. Xilinx Zynq platform consists of a Ti To Tr
Xilinx Kintex-7 FPGA, dual ARM Cortex-A9 Processor, and 1 GB
Considering that T i = convolver_num, T r = tile_size and
DDR3 memory. It offers a bandwidth of up to 4.2GB/s. All the
T o = reuse_times × P E_num in CONV layers, we further get:
synthesis results are obtained from Xilinx Vivado 2014.4. We first
synthesize each module in Vivado to figure out the resource utiliza- nin nout
tCON V = ⌈ ⌉×⌈ ⌉
tion. Then we choose the optimal parameter group to maximize convolver_num reuse_times × P E_num
the throughput with the resource and bandwidth constraints. The row
×⌈ ⌉2 × tile_size2 × resue_times.
parameters and resource utilization are shown in Table 5. We can tile_size
see that our parameter configuration helps to maximize the resource nin × nout × row2
utilization. Figure 10 shows the hardware platform. ≈
convolver_num × P E_num

33
Table 6: Performance of different platforms with VGG16-SVD network.
Platform Embedded FPGA CPU GPU mGPU CPU GPU mGPU
Layer Theoretical Computation Real Computation Total Operations Real Performance Real Computation Real Performance
(Group) Time (ms) Time (ms) (GOP) (GOP/s) Time (ms) (GOP/s)
CONV1 21.41 31.29 3.87 123.76 83.43 2.45 59.45 46.42 1578.8 65.15
CONV2 16.06 23.58 5.55 235.29 68.99 3.31 79.73 80.44 1675.5 69.60
CONV3 26.76 39.29 9.25 235.38 76.08 4.25 89.35 151.57 2177.1 103.51
CONV4 26.76 36.30 9.25 254.81 62.53 3.31 107.49 147.91 2791.6 86.04
CONV5 32.11 32.95 2.31 70.16 12.36 2.30 63.75 186.99 1003.5 36.27
CONV Total 123.10 163.42 30.69 187.80 312.36 15.45 399.77 98.26 1986.0 76.77
FC6-1 10.45 20.17 0.025 1.24 1.69 0.445 29.35 14.87 56.404 0.86
FC6-2 1.71 3.75 0.0041 1.09 0.26 0.031 5.26 15.65 132.26 0.78
FC7 13.98 30.02 0.034 1.12 1.86 0.19 14.74 18.04 177.78 2.28
FC8 3.413 7.244 0.0082 1.13 0.46 0.96 4.58 17.75 8.56 1.79
FC Total 29.55 61.18 0.073 1.20 4.28 1.79 53.93 17.17 40.98 1.36
Total 152.65 224.60 30.76 136.97 316.64 17.25 453.70 97.16 1783.9 67.81

For FC layers, the number of phases and the time of different


tasks can be estimated with the following equations: Table 7: Comparison with other FPGA accelerators.
nin nout [11] [30] [6] Ours
FC
Nphase =⌈ ⌉×⌈ ⌉ Year 2010 2014 2015 2015
Ti T o × P E_num Virtex5 Zynq Virtex7 Zynq
P E_num × T i × T o Platform
SX240t XC7Z045 VX485t XC7Z045
tF C
load_data = Clock(MHz) 120 150 100 150
datain_port_num
Bandwidth (GB/s) – 4.2 12.8 4.2
FC Ti Quantization
tload_weight = 48-bit fixed 16-bit fixed 32-bit float 16-bit fixed
weightin_port_num Strategy
Power (W) 14 8 18.61 9.63
Ti × To
tF C FC
compute_data = tcompute_weight = . Problem
0.52 0.552 1.33 30.76
convolver_num Complexity (GOP)
Performance 187.80 (CONV)
Typically, for FC layers, tF C FC 16 23.18 61.62
compute is much smaller than tload , (GOP/s) 136.97 (Overall)
and thus the total cycles needed by one FC layer can be estimated Resource −3
−4 −4 3.58×10 (CONV)
as: Efficiency 4.30×10 – 8.12×10
2.61×10−3 (Overall)
(GOP/s/Slices)
tF C = NP hase × tload Power Efficiency 19.50 (CONV)
1.14 2.90 3.31
nin nout P E_num × T i × T o (GOP/s/W) 14.22 (Overall)
=⌈ ⌉×⌈ ⌉×
Ti T o × P E_num datain_port_num
nin × nout Table 8: Projected frame rates on Zynq/VC707 board using 16-
≈ .
datain_port_num bit and 8/4-bit quantization with VGG16-SVD network.
In summary, under the given constraints, the runtime of a CONV Total Resources 16-bit Quantization 8-bit Quantization
Platform
layer and an FC layer can be estimated through Equation 13 and LUT FF Bandwidth # of PE FPS # of PE FPS
Equation 12: Zynq 218600 437200 4.2GBps 2 4.45 4 8.9
nin × nout VC707 303600 607200 4.2GBps 3 5.88 6 11.76
tF C = , (12)
datain_port_num
the number of operations needed by FC layers is only 0.0024× of
nin · nout · row2 CONV layers, the runtime of FC layers is 0.374× of CONV lay-
tCON V = . (13)
convolver_num2 × P E_num er. The mGPU platform suffers from the same problem due to the
As shown in Equation 13, CONV layers are bounded both by limited bandwidth.
bandwidth and computation resources. For FC layers, as shown in Compared with theoretical estimation, there is around 47% per-
Equation 12, it is bandwidth-bounded only. Consequently, higher formance degradation for on-board test, as shown in the 2nd colum-
bandwidth can help reduce the runtime of FC layers. n and the 3rd column of Table 6. One possible reason is the DDR
access latency. The other possible reason is that different DMAs are
8.2 Performance Analysis working asynchronously, since different DMA transactions may af-
Though the performance of FC layers on FPGA is limited by the fect each other and reduce the total efficiency of bandwidth usage.
bandwidth, it is still higher than ARM processors. Consequently, in
our implementation, the FC layer workloads are placed on FPGA. 8.3 Design Comparison
The performance of our system, CPU, GPU, and mGPU is shown As shown in Table 7, we compare our CNN accelerator with pre-
in Table 6. The VGG16-SVD network needs 30.764 GOPs includ- vious work. In [30], the design was verified on 3 models, including
ing multiplications, adds, and non-linear functions. Our system a single CONV layer, a model consisting of 2 CONV layers for face
achieves an average performance of 187.80 GOP/s for CONV lay- recognition, and a model for street parsing without structure detail-
ers and 136.97 GOP/s for the whole network. The frame rate of our s. Since the first model lacks generality and structure details of
system is 4.45 fps, which is 1.4× and 2.0× faster than the CPU and the third model were not provided, results of a 2-layer CNN model
mGPU platform (the power of CPU and mGPU are 135W and 9W is adopted for comparison. We also transform the unit GF LOP
respectively). The overall performance of GPU is 13.0× higher in [6] into GOP for comparison.
than our implementation, but it consumes 26.0× more power com- In summary, our accelerator achieved the highest performance,
pared with embedded FPGA (250W versus 9.63W). resource efficiency, and power efficiency compared with previous
The performance of our system on FC layers is much lower designs. It should be noted, all the performance results of previous
than that of CONV layers even though data arrangement method designs were obtained from CONV layers only. If we only consider
is adopted due to the limited bandwidth. Consequently, though CONV layers, the average performance of our system is 187.80

34
GOP/s, which is several times higher than previous designs. The [9] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng,
performance of our system with the full VGG16-SVD network is X. Zhou, and Y. Chen, “Pudiannao: A polyvalent machine learning
136.97 GOP/s. accelerator,” in ASPLOS. ACM, 2015, pp. 369–381.
[10] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng,
8.4 Discussion Y. Chen, and O. Temam, “Shidiannao: shifting vision processing
At present, our implementation uses 16-bit fixed-point numbers closer to the sensor,” in ISCA. ACM, 2015, pp. 92–104.
and Zynq board. The projected results with different quantization [11] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A
strategies and platforms are shown in Table 8. Theoretically, when dynamically configurable coprocessor for convolutional neural
using the 8-bit quantization, 2× PEs can be placed on the FPGA networks,” in ISCA, vol. 38, no. 3. ACM, 2010, pp. 247–257.
and thus the performance on CONV layers doubles. Besides, 2× [12] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
weights can be loaded to the system with the same bandwidth com- Y. LeCun, “Neuflow: A runtime reconfigurable dataflow processor
for vision,” in CVPRW. IEEE, 2011, pp. 109–116.
pared with the 16-bit quantization, and thus the performance on FC
layers also doubles. Further more, when deploying to the VC707 [13] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-based
processor for convolutional networks,” in FPL. IEEE, 2009, pp.
board, one more PE can be placed, and thus the processing capa- 32–37.
bility on CONV layers is expected to be 1.5× higher than that of
[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
Zynq platform. For VGG16-SVD network on VC707 with 8-bit learning applied to document recognition,” Proceedings of the IEEE,
dynamic-precision quantization, it is expected to achieve a frame vol. 86, no. 11, pp. 2278–2324, 1998.
rate at 11.76 fps. [15] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn
features off-the-shelf: an astounding baseline for recognition,” in
9. CONCLUSION CVPRW. IEEE, 2014, pp. 512–519.
The limited bandwidth is one of the bottlenecks of accelerat- [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks
ing deep CNN models on embedded systems. In this paper, we for large-scale image recognition,” arXiv preprint arXiv:1409.1556,
make an in-depth investigation of the memory footprint and band- 2014.
width problem in order to accelerate state-of-the-art CNN model- [17] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network
s for Image-Net classification on the embedded FPGA platform. construction with back-propagation,” in NIPS, 1989, pp. 177–185.
We show that CONV layers are computation-centric and FC layers [18] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel,
are memory-centric. A dynamic-precision data quantization flow “Optimal brain damage,” in NIPS, vol. 89, 1989.
is proposed to reduce memory footprint and bandwidth require- [19] B. Hassibi and D. G. Stork, Second order derivatives for network
ments while maintaining comparable accuracy. Convolver that can pruning: Optimal brain surgeon. Morgan Kaufmann, 1993.
be used for both CONV layers and FC layers is designed to save [20] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and
the resource. A data arrangement scheme for FC layers is also pro- connections for efficient neural networks,” arXiv preprint
posed to ensure high bandwidth utilization. Our implementation arXiv:1506.02626, 2015.
on Xilinx Zynq with very deep VGG16-SVD model for Image-Net [21] G. H. Golub and C. F. Van Loan, “Matrix computations. 1996,” Johns
classification achieves a frame rate at 4.45 fps with 86.66% top-5 Hopkins University, Press, Baltimore, MD, USA, pp. 374–426, 1996.
accuracy with 16-bit quantization. The average performances of [22] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
“Exploiting linear structure within convolutional networks for
CONV layers and the full CNN are 187.8 GOP/s and 137.0 GOP/s
efficient evaluation,” in NIPS, 2014, pp. 1269–1277.
under 150MHz working frequency.
[23] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient and accurate
approximations of nonlinear convolutional networks,” arXiv preprint
10. REFERENCES arXiv:1411.4229, 2014.
[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, [24] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and convolutional neural networks with low rank expansions,” arXiv
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” preprint arXiv:1405.3866, 2014.
pp. 211–252, 2015. [25] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini,
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet P. Akselrod, and S. Talay, “Large-scale fpga-based convolutional
classification with deep convolutional neural networks,” in NIPS, networks,” Machine Learning on Very Large Data Sets, vol. 1, 2011.
2012, pp. 1097–1105. [26] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar,
[3] M. D. Zeiler and R. Fergus, “Visualizing and understanding I. Durdanovic, E. Cosatto, and H. Graf, “A massively parallel
convolutional networks,” in ECCV, 2014, pp. 818–833. coprocessor for convolutional neural networks,” in ASAP, July 2009,
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, pp. 53–60.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with [27] D. Larkin, A. Kinane, and N. O’Connor, “Towards hardware
convolutions,” arXiv preprint arXiv:1409.4842, 2014. acceleration of neuroevolution for multimedia processing
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for applications on mobile devices,” in Neural Information Processing.
image recognition,” arXiv preprint arXiv:1512.03385, 2015. Springer, 2006, pp. 1178–1188.
[6] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing [28] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
fpga-based accelerator design for deep convolutional neural S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
networks,” in Proceedings of ISFPGA. ACM, 2015, pp. 161–170. fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
[7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, [29] B. Bosi, G. Bois, and Y. Savaria, “Reconfigurable pipelined 2-d
“Diannao: A small-footprint high-throughput accelerator for convolvers for fast digital signal processing,” VLSI, vol. 7, no. 3, pp.
ubiquitous machine-learning,” in ASPLOS, vol. 49, no. 4. ACM, 299–308, 1999.
2014, pp. 269–284. [30] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240
[8] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, g-ops/s mobile coprocessor for deep neural networks,” in CVPRW.
Z. Xu, N. Sun et al., “Dadiannao: A machine-learning IEEE, 2014, pp. 696–701.
supercomputer,” in MICRO. IEEE, 2014, pp. 609–622.

35

View publication stats

You might also like