An FPGA-Based Reconfigurable CNN Accelerator For YOLO
An FPGA-Based Reconfigurable CNN Accelerator For YOLO
Shiguang Zhang1, Jian Cao1*, Quan Zhang1, Qi Zhang1, Ying Zhang1, Yuan Wang2
1.
School of Software and Microelectronics, Peking University
2.
School of Electronics Engineering and Computer Science, Peking University
Beijing, China
*Corresponding e-mail: [email protected]
Abstract—Convolutional neural network (CNN) has been memory access are all issues that need to be considered strictly
widely used in image processing fields. Object detection models in hardware deployment. In order to solve these problems,
based on CNN, such as YOLO and SSD, have been proved to be people have done a lot of research in the direction of model
the most advanced in many applications. CNN have extremely compression (including pruning and quantification) [7][8],
high requirements on computing power and memory bandwidth, efficient model design [9] [10], and efficient hardware
which usually needs to be deployed to a dedicated hardware architecture design [11]-[15].
platform. FPGA has great advantages in reconfigurability and In this paper, we propose a reconfigurable CNN
performance power ratio, which is a suitable choice to deploy
accelerator, which can adapt to different networks by
CNN. In this paper, we propose a reconfigurable CNN
configuring signals. The main contributions of this paper are
accelerator with AXI bus based on ARM + FPGA architecture.
The accelerator can receive the configuration signals sent by
summarized as follows
ARM and complete the calculation during inference of different • We design a reconfigurable CNN accelerator based
CNN layers through time-sharing. By combining convolution on ARM + FPGA architecture through High Level
and pooling operation, the number of data moves of Synthesis (HLS), which realized the full pipeline of
convolutional layer and pooling layer is reduced to reduce the each stage.
number of off-chip memory accesses. The floating-point • Based on the block-based computations, a method of
number is converted into 16-bit dynamic fixed-point format, combining convolutional (Conv) and pooling layers
which improves the calculation performance. We implemented is proposed.
the proposed architecture on the Xilinx ZCU102 FPGA for • A delicate dynamic fixed-point strategy is used to
YOLOv2 and YOLOv2 Tiny models on COCO and VOC 2007 transform the parameters from floating-point to 16-bit
respectively, with peak performance of 289GOPs at 300MHz fixed-point and multiple scaling factors.
clock frequency. • The line buffer is designed to make the data encoding
Keywords-YOLO; FPGA; reconfigurable; dynamic fixed-
and decoding time cover access time of off-chip
point memory.
The rest of this article is organized as follows. Section II
reviews introduces the research background and discusses
I. INTRODUCTION related work. Section III describes the reconfigurable CNN
With the development of convolutional neural network accelerator architecture and internal details. Section IV
(CNN) research, new network structures are being proposed. introduces the CNN accelerator system design based on ARM
From LeNet-5, AlexNet, to R-CNN, YOLO and SSD [1]-[5], + FPGA architecture. Section V describes the experimental
the function of CNN have evolved from simple classification results of implementing the accelerator on the ZCU102
to target detection, and have been fully applied in the field of platform. Section VI summarizes the content of this article.
computer vision. With the enhancement of functions, the
depth and scale of CNN are also increasing rapidly. How to II. BACKGROUND
deploy the CNN on the hardware efficiently has become a The object detection system based on CNN model can be
very popular work. divided into two types, one of which can be called Two-stage
Compared with central processing unit (CPU) and model including R-CNN, Fast R-CNN and Faster R-CNN et
Graphics Processing Unit (GPU), FPGA has the advantages al. [2][17][18], the other can be called One-stage model
of high energy efficiency and low power consumption, and it including YOLO and SSD et al.. The One-stage model, which
is hardware programmable [6]. Compared with ASIC, FPGA does not need the stage of region proposal, directly generates
has the advantages of short development cycle, high flexibility the category probability and position coordinates of objects,
and low development cost. FPGA has become a very suitable so it has a faster speed. Two-stage model first extracts region
hardware platform for CNN deployment. The characteristics proposals through Region Proposal Networks (RNP), and then
of CNN are computationally and memory intensive, which classifies them, so it is slower than the One-stage model, but
bring many challenges to FPGA and embedded deployment. has high detection accuracy. With the development of
The time cost of a large number of computing operations and research, the problem of low detection accuracy of the One-
the time cost caused by the increase of data movement and stage model is being improved, and it is suitable for
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.
deployment to the embedded devices. In the One-stage model, C-1 K-1 K-1
SSD does not support multi-scale training, and the network
generalization ability is worse than that of YOLOv2 and yf,h,w = ∑ ∑ ∑ (wf,c,j,i ×xc,h×s+j,w×s+i )+bf. (1)
YOLOv3 [18]. YOLOv3 has complex network structure and c=0 j=0 i=0
slower than YOLOv2. Therefore, YOLOv2 and YOLOv2 - where yf,h,w is the value of the h row w column of the fth output
Tiny [20] are selected to deploy on the FPGA platform. feature map, C is the number of input feature map, K is the
convolution kernel size, and S is the step size. The result of
TABLE I. YOLOV2 TINY convolution often needs to be activated. The activation funtion
Layer Filters Size/Stride/padding Input Output of YOLOv2 and YOLOv2 Tiny is Leaky ReLU.
Conv1 16 3×3 / 1 / 1 416×416 2) Pooling layer: Maximum pooling is used for the down
Pool1 2×2 / 2 / 0 208×208 sampling of feature mapping, which improves the robustness
Conv2 32 3×3 / 1 / 1 208×208 of feature extraction. By establishing a K×K sampling
Pool2 2×2 / 2 / 0 104×104
Conv3 64 3×3 / 1 / 1 104×104
window, the maximum value in the window are saved. The
Pool3 2×2 / 2 / 0 52×52 size of the sampling window is generally 2×2, and the step
Conv4 128 3×3 / 1 / 1 52×52 size is generally 1 or 2.
Pool4 2×2 / 2 / 0 26×26 3) Batch Normalization (BN) layer: BN layer can
Conv5 256 3×3 / 1 / 1 26×26
Pool5 2×2 / 2 / 0 13×13 significantly reduce the training time of deep CNN [13], and
Conv6 512 3×3 / 1 / 1 13×13 solve the problem caused by the change of data distribution in
Pool6 2×2 / 2 / 0 13×13 the middle layer during the training process. For a mini-batch,
Conv7 1024 3×3 / 1 / 1 13×13 13×13 calculate the mean μꞴ and variance σ2Ꞵ of all input samples Ꞵ
Conv8 512 3×3 / 1 / 1 13×13 13×13 (Ꞵ ={x1 , x2 , x3 , ... xm }). In order to recover the distribution
of the original network feature, the learnable parameters γ and
Conv9 425 1×1 / 1 / 0 13×13 13×13
β are introduced to get the output results as:
xi - μꞴ
yb = γ × +β. (2)
A. YOLOv2 and YOLOv2-Tiny 2
√ σꞴ + ε
YOLOv2 is proposed by Joseph Redmon et al. [4], which
removes the fully connected layer from YOLO and uses where ε is a constant added to the mini-batch variance for
anchor boxes in Faster R-CNN to predict bounding boxes [17]. numerical stability [22]. The BN layer is located in front of
YOLOv2 is based on Darknet-19 (including 19 Conv layers the activation layer. When deployed in hardware, the BN layer
and 5 maximum pooling layers), and adds three 3×3 Conv can generally be fused into the weight or bias [13].
layers and 1×1 convolution layer, and references the route
layer in ResNet [20]. The input image is divided into S×S cells, C. Related Work
each cell predicts 5 bounding boxes, and each bounding box In recent years, people have made a lot of progress in
predicts (C + 5) numbers, including the probability of C deploying the object detection model based on CNN to the
classes, the position and size information of the bounding hardware. Through the model compression technology,
boxes, and the probability of real objects in the bounding SqueezeNet is less than 0.5Mb while maintaining the accuracy
boxes, so the network output size is S×S×(5+C). [10]. MobileNet adopts the deep separable convolution and
YOLOv2 Tiny is different from YOLOv2 in that the the ShiftNet replaces the spatial convolution with a simple
network is more simplified, and the pre-processing and post- shift operation [9][23]. These efficient models reduce the
processing processes are the same. For space reasons, we only computation and model scale. [11]-[16] explored the parallel
show network structure of YOLOv2 Tiny in Table I. YOLOv2 space of convolution, in which [11] proposes the dynamic
Tiny has 9 Conv layers and 6 maximum pooling layers, and combination of parallelism within a convolution, intra-output
the first six Conv layers are followed by the maximum pooling parallelism, and inter-output parallelism, which fully
layer. improves the computing performance. In this paper, we focus
on the following three points: (1) hardware architecture
B. Basic of CNN exploration, parallel space exploration; (2) quantifying the
1) Conv layer:In the Conv layer, multiple convolution model while maintaining accuracy; (3) reducing data
kernels are used to extract the advanced features of the input bandwidth, reducing the number of weight values and
feature maps. The number of convolution kernels determines intermediate data bit width to reduce the number of off-chip
the number of output feature map. The size of convolution memory accesses.
kernel K×K is generally 1×1 or 3×3. In order to make the size III. ACCELERATOR DESIGN
of input and output feature maps consistent, a padding
operation is often performed around the input feature maps, A. Data Access Pattern
and the padding number is (K-1) / 2. The Conv layer can be More than 90% of the computing time spent in CNN is in
expressed as: Conv layer, which consists of multiple nested loops. Each
75
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.
layer of large-scale CNN has a large number of computing and Tow =52, Tc =8, Tf = 64. For the exploration of parallel space,
parameters. It is impossible for embedded devices such as intra-output parallelism and inter-output parallelism in [12]
FPGA to complete the operation of the whole Conv layer at are applied. As shown in Fig. 3, the PE module has Tf PE units.
one time. As shown in Fig. 1, the output sliding cube is Each PE unit can process Tc input feature maps and get one
defined based on block-based computations, which slides on output feature map. For sliding cube, the PE module can
the output feature maps [12]. The accelerator will read the process Tc input feature maps and Tf output feature maps at
corresponding input sliding cube and weight to calculate one the same time.
output sliding cube, and then write it back to the off-chip
memory.
The accelerator obtains the output sliding cube in turn 1) Convolution Units: The parallelism of convolution
according to the cyclic order of Fig. 1 until the whole layer is units is determined as Tc ×Tf , including Tc × Tf group
completed. For the Conv layer connected to the pooling layer, multiplication and shifter, and Tf adder trees. This
the maximum pooling is performed before the output sliding corresponds to the operation process of Conv layer, as shown
cube is written back to the off-chip memory. For example, for in Fig. 3.
Conv2 and Pool2 in Table I, the accelerator reads the mth input
sliding cube of Conv2 and obtains the mth output sliding cube
for (h = 0; Th_min = H; h < H/Th; h++,Th_min -= Th)
of Pool2. It effectively reduces the number of accesses to off- for (w = 0; Tw_min = W; w < W/Tw; w++, Tw_min -= Tw)
chip memory. for (f = 0; f < F/Tf; f++)
Where W/k, H/k and F are the width, height and number for (c = 0; c < C/Tc; c++)
of output feature map respectively, and C is the number of read data ();
input feature map. The size of the input buffer is Tih × Tiw × Tc , compute ();
the size of the weight buffer is Tf ×Tc ×K ×K, and the size of write data ();
the output buffer is Toh × Tow × Tf . When the Conv layer is
compute ():
connected to the pooling layer, k = 2, otherwise k = 1. To
for (i = 0; i < K; i++)
complete a Conv layer operation, the accelerator needs to read for (j = 0; j < K; j++)
F/Tf times of input feature maps and (W/Tw × H/Th ) times of //load weights
weight from off-chip memory. for (th = 0; th < min(Th,Th_min); th++)
for (tw = 0; tw < min(Tw,Tw_min); tw++)
B. Overall Structure of FPGA Accelerator #pragma HLS PIPELINE
The overall architecture of the accelerator is shown in Fig. for (tf = 0; tf < Tf; tf++)
2, which mainly includes processing element (PE) module, #pragma HLS UNROLL
on-chip buffer and DMA module. It can complete the for (tc = 0; tc < Tc; tc++)
operation of one sliding cube at a time. DMA module can #pragma HLS UNROLL
{
interact between off-chip memory and on-chip buffer, and
//Initialization: output() = (i, j, c==0) ? bias() : output();
complete data scheduling and format conversion. The on-chip output (tf, th, tw) += input(tc, th+i, tw+j)*weight(tf, tc, i, j) ;
buffer include input buffer, output buffer and weight buffer. }
The PE module can complete the operations of convolution,
pooling and activation.
Figure 3. Pseudo-code for convolution units.
C. PE Module
On the premise of minimizing data movement and off-chip 2) Pooling and Activation Units: The type of pooling
memory access and improving data reuse, this paper explores layer of YOLO is maximum pooling, and the activation layer
design space. The size of sliding cube is determined as Toh =26, uses Leaky ReLU function. For the pooling, we add two line-
76
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.
buffers of size Tow to each PE unit. The input of line buffer is
the output of convolution. The output and input of the line ynf,h,w = ynf,h,w_on >> (onp - Qnout (f)) (5)
buffer are sent to the comparator to complete the pooling. For
Leaky ReLU function, we use multiplication and shift instead Qnout (C)= Qn+1
in
(C) (6)
of decimal multiplication.
where onp represents the sharing scale factor of intermediate
results, which can prevent data overflow. Qn is the scaling
factor of input feature maps, output feature maps and weight
of the nth layer.
IV. SYSTEM DESIGN
As illustrated in Fig. 5, based on the ARM + FPGA
architecture, the system architecture mainly includes external
memory Double Data Rate (DDR), processing system (PS),
on-chip buffer, accelerator in programmable logic (PL), and
on-chip and off-chip bus interconnection. ARM has the
function of task scheduling in the whole system. The initial
image data and weights are pre-stored in the external memory
DDR, and the middle layer feature maps are read and written
Figure 4. Execution time working in a pipelined style. from the DDR. The accelerator is connected with ARM
through AXI4 bus, receives configuration signal through
D. Ping-Pong Model Design AXI4 Lite bus, and DMA module can interact with DDR and
convert data format through AXI4-HP bus.
The input buffer and weight buffer for calculation and the
output buffer for storing intermediate results are designed in V. EVALUATION
Ping-Pong model. In addition, the line buffer is designed to
make the format conversion time cover the access time of off- A. Model Preparation
chip memory. As shown in Fig. 4, each level of buffer
designed as Ping-Pong model implements the full pipeline of Base on Darknet deep learning framework, we train
all stages. Where t2 is the total time when the accelerator YOLOv2 and YOLOv2-Tiny on the COCO and PASCAL
VOC 2007 respectively, and the input resolution is 416×416.
acquires the data of the output sliding cube, and t3 is the time
After training, the dynamic fixed-point strategy is used to
to write the output sliding cube back to the off-chip memory.
transform the parameters from floating-point to 16-bit fixed-
These two processes complete all the operations needed to get
point and multiple scaling factors.
one output sliding cube. Where t2 includes N t1 processes,
and N = Co /To +1. B. Model Deployment
E. Dynamic Fixed-Point Quantization The accelerator IP is packaged by Vivado HLS, and the
system architecture is built in Vivado. We implement the
In order to improve the efficiency of convolution and
system architecture on Xilinx ZCU102, and the driving clock
reduce the access time of off-chip memory, we use dynamic
is 300MHz. The larger the network scale, the better the
fixed-point quantization strategy [24], which transforms the
accelerator performance. The system performance and
weight and feature value from floating-point number to 16-bit
occupied resources are shown in Table II.
fixed-point and shared scaling factor. Different from fixed-
point, dynamic fixed-point can transform floating-point
number into fixed-point and multiple scaling factors to
improve the representation accuracy of fixed-point number. In
order to minimize the loss, each feature map shares the same
scaling factor, and the scaling factors of different feature maps
can be different. Similarly, for weights, each convolution
kernel shares the same scaling factor. In addition to the basic
operation of multiplication and addition, only the shift
operation is used to incorporate the scaling factor in the
convolution of dynamic fixed-point format, and the process
can be expressed as (3)-(6).
n
y' f,h,w = xc,h×s+j,w×s+i ×wf,c,j,i (3)
77
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.
TABLE II. IMPLEMENTATION RESULTS OF THE PROPOSED DESIGNS [3] R. Girshick, J. Donahue, T. Darrell, J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” pp. 580–587,
YOLOv2 YOLOv2 tiny 2013.
Features
VOC COCO VOC COCO [4] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger.”
Device Xilinx ZCU102 FPGA arXiv preprint arXiv:1612.08242, 2016.
[5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. “SSD:
Frequency 300MHz
single shot multibox detector,” arXiv: 1512.0232, 2015.
BRAM (18Kb) 491 [6] Y. Ma, N. Suda, Y. Cao, J. Seo and S. Vrudhula, "Scalable and
DSPs 609 modularized RTL compilation of Convolutional Neural Networks onto
LUTs 95136 FPGA," 2016 26th International Conference on Field Programmable
Logic and Applications (FPL), Lausanne, 2016, pp. 1-8.
FFs 90589
[7] S. Han, H. Mao, and W. Dally. “Deep compression: Compressing
DNN size DNNs with pruning, trained quantization and huffman coding,” arXiv
29.42 29.47 6.18 5.4
(GOP) preprint arxiv:1510.00149v3, 2015.
Throughput
Peak (GOPs) 289.1 289.1 237.6 237.6 [8] A Zhou, A. Yao and Y. Guo, “Incremental Network Quantization:
Overall (GOPs) 102.2 102.5 85.8 87.0 Towards Lossless CNNs with Low-Precision Weights,” arXiv preprint
arXiv:1702.03044, 2017.
Interface time 288ms 288ms 72ms 62ms
[9] B. Wu, A. Wan and X. Yue, “Shift: A Zero FLOP, Zero Parameter
Alternative to Spatial Convolutions,” arXiv preprint arXiv:1711.08141,
2017.
TABLE III. COMPARE WITH PREVIOUS WORK [10] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and
K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer
[13] [14] [15] This work
parameters and <0.5MB model size,” arXiv preprint arXiv:1602.07360,
Artix7 Virtex7 Zynq 2016.
Platform Stratix-V
TSBG484 VX485T UltraScale+ [11] S. Chakradhar, M. Sankaradas, V. Jakkula and S. Cadambi,
Frequency “A dynamically configurablecoprocessor for convolutional neural net
100 100 120 300
(MHz) works,” In: ISCA 2010.
16-bit 32-bit (8-16)-bit 16-bit [12] D. T. Nguyen, T. N. Nguyen, H. Kim and H. Lee, "A High-Throughput
Precision
fixed float fixed fixed and Power-Efficient FPGA Implementation of YOLO CNN for Object
throughput 289(peak) Detection," in IEEE Transactions on Very Large Scale Integration
22 61.62 136.5
(GOPs) 102(overall) (VLSI) Systems, vol. 27, no. 8, pp. 1861-1873, Aug. 2019.
Power(W) 7.53 3.0 19.1 11.8 [13] Q. Zhang, J. Cao, Y. Zhang, S. Zhang, Q. Zhang and D. Yu, "FPGA
Implementation of Quantized Convolutional Neural Networks," 2019
IEEE 19th International Conference on Communication Technology
The comparison between the proposed design and (ICCT), Xi'an, China, 2019, pp. 1605-1610.
previous works on hardware implementations is presented in [14] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
Table III. fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
VI. CONCLUSION Field-Programmable Gate Arrays, ACM.
[15] N. Suda, “Throughput-optimized OpenCL-based FPGA accelerator for
In this paper, we propose a reconfigurable CNN large-scale convolutional neural networks,” in Proc. of FPGA. ACM,
accelerator for YOLO. By combining convolution and 2016, pp. 16–25
pooling operations, the off-chip memory access time of [16] J. Qiu and J. Wang, “Going deeper with embedded FPGA platform for
accelerator is reduced. The configuration table is designed to convolutional neural network,” in ACM International Symposium on
make the accelerator to adapt to different models, and the peak Field-Programmable Gate Arrays, 2016.
performance of the accelerator is 289GOPs at 300 MHz clock [17] R. Girshick, “Fast R-CNN, ” arXiv preprint arXiv:1504.08083, 2015.
frequency. In the future research, we will further improve the [18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
generality of the accelerator so that it can complete the time object detection with region proposal networks,: arXiv preprint
arXiv:1506.01497, 2015.
YOLOv3 deployment, and deploy the post-processing on
[19] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
FPGA. arXiv:1804.02767, 2018.
ACKNOWLEDGMENT [20] Trieu. 2016. https://github.com/AlexeyAB/darknet. (2016).
[21] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet
Project supported by the National Key Research and and the impact of residual connections on learning. CoRR, arXiv
Development Program of China (Grant No. preprint arXiv:1602.07261, 2016.
2018YFE0203801) . [22] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep
network training by reducing internal covariate shift,” International
REFERENCES Conference on Machine Learning (2015) 448–456.
[1] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based [23] A. G. Howard et al., “Mobilenets: Efficient convolutional neural
learning applied to document recognition," in Proceedings of the IEEE, networks for mobile vision applications,” arXiv preprint
vol. 86, no. 11, pp. 2278-2324, Nov. 1998. arXiv:1704.04861,2017.
[2] A. Krizhevsky and I. Sutskever, “ImageNet classification with deep [24] R. Ding, G. Su, G. Bai, W. Xu, N. Su and X. Wu, "A FPGA-based
convolutional neural networks,” in Proc. Advances in Neural Inf. Accelerator of Convolutional Neural Network for Face Feature
Process. Syst., 2012, pp. 1097–1105. Extraction," 2019 IEEE International Conference on Electron Devices
and Solid-State Circuits (EDSSC), Xi'an, China, 2019, pp. 1-3.
78
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.