FPGA Implementation of Object Detection Accelerator Based on Vitis-AI
FPGA Implementation of Object Detection Accelerator Based on Vitis-AI
School of Mechatronic Engineering and Automation School of Mechatronic Engineering and Automation
Shanghai University Shanghai University
Shanghai, China Shanghai, China
[email protected] [email protected]
Abstract—The emergence of YOLOv3 makes it possible to reflected, which makes it difficult to identify, but YOLOv3
detect small targets. Due to the characteristics of the YOLO improves obviously [3].
network itself, the YOLOv3 network has exceptionally high Although GPUs have been widely used in deep learning
requirements for computing power and memory bandwidth and
it usually needs to be deployed on a dedicated hardware accel- algorithms such as YOLO, they are not efficient enough to
eration platform. FPGAs is a logically reconfigurable hardware achieve lower power consumption and higher throughput. At
chip with substantial advantages in terms of performance and present, in addition to improving the algorithm logic itself,
power consumption, so it is a good choice to deploy a deep hardware acceleration methods are also prevalent for optimiz-
convolutional network. In the research of this paper, we proposed ing neural networks. FPGA is a high-performance and low-
a reconfigurable YOLOv3 FPGA hardware accelerator based on
the AXI bus ARM+FPGA architecture. The YOLOv3 network power chip that has its unique advantages in accelerating neu-
quantifies through Vitis AI, and a series of operations such as ral network algorithms. On the one hand, FPGA can achieve
model compression and data pre-processing can save accelerator high computing performance and high energy efficiency ratio.
chips and the access time of external storage. Pipeline operation On the other hand, FPGA has high flexibility and can be
enables FPGAs to achieve higher throughput. Compared with reconfigured [4] [5]. It integrates large number of digital circuit
the GPU implementation of the YOLOv3 model, it is found
that the hardware implementation of the FPGA-based YOLOv3 logic units and memories. Developers can burn the recording
accelerator has lower energy consumption and can achieve higher configuration file customizes the wiring between the logic unit
throughput. and the memory to realize different arithmetic logic, and this
Index Terms—FPGA, Vitis AI, Object Detection, YOLOv3, configuration file is not one-time. The circuit logic inside the
Parallel Processing chip can be modified at any time.
In this paper, we propose a YOLOv3 accelerator that can be
I. INTRODUCTION applied to different network structures. The main contributions
With the development of Convolutional Neural Networks of this article are listed below:
(CNN), new neural network structures are proposed one af- • Design a reconfigurable YOLOv3 accelerator, which can
ter another to solve the bottlenecks of existing networks in do task pipeline operation in each stage.
terms of performance and computing power. Recently, in the • Basing on Vitis AI, perform optimization operations such
field of traditional machine learning algorithms, significant as data quantization, pruning, and model compression
progress has been made in the research of convolutional to reduce model complexity and improve accelerator
neural networks [1]. Object detection is a challenging task performance.
in the field of computer vision. Due to the emergence of • Compared with the GPU under the standard model, ob-
rugged computing devices such as GPUs, deep learning has tains greater throughput and lower energy consumption.
been widely used in the field of object detection. Therefore,
II. BACKGROUND
many object detection algorithms related to deep learning have
been proposed, such as Single Shot MultiBox Detector (SSD), The current mainstream object detection algorithms are
Faster R-convolutional Neural Network (Faster R-CNN), and mainly based on deep learning models, which can be divided
You-Only-Look-Once (YOLO). When YOLO was proposed, into two categories: one is two-stage detection algorithms,
it has been affixed with two labels, 1) It is very fast; 2) It is such as R-CNN, Fast R-CNN, Faster R-CNN; the other is
not good at detecting small objects [2]. The latter has become one-stage detection algorithm, more typical algorithms such
the reason why many people stay away from it. Due to the as YOLO and SSD. The leading performance indicators of
limitation of principle, YOLO only detects the last layer of the object detection model are detection accuracy and speed.
the output, and small objects have fewer pixels. After layer- For accuracy, object detection should consider the positioning
by-layer convolution, the information on this layer is hardly accuracy of the object, not just the classification accuracy
[6]. In general, the two-stage algorithm has advantages in
§ Shenshen Gu is corresponding author. accuracy, while the one-stage algorithm has advantages in
572
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
• AI Model Zoo: A comprehensive set of pre-optimized III. ACCELERATOR DESIGN
models that are ready to deploy on Xilinx devices. Generally, the design goals of neural network inference
• AI Optimizer: An optional model optimizer that can accelerators should consider the following two indicators: high
prune a model by up to 90%. speed (high throughput and low latency) and high energy
• AI Quantizer: A powerful quantizer that supports model efficiency [14].
quantization, calibration, and fine-tuning. The throughput of the neural network accelerator can be
• AI Compiler: Compiles the quantized model to a high- expressed by Equation (1). If the selected FPGA chip has
efficient instruction set and data flow. limited resources, we can increase the number of computing
• AI Profiler: Perform an in-depth analysis of the efficiency units by reducing the size of each computing unit, or increase
and utilization of AI inference implementation. the operating frequency of the accelerator to achieve peak
• AI Library: Provides advanced and optimized C++ API performance. However, reducing the size of the computing unit
for AI applications from the edge to the cloud. usually sacrifices model accuracy, and the increase in operating
• DPU: Efficient and scalable IP cores can be customized frequency is often only changed by hardware. Therefore, to
to meet the needs of many different applications. improve throughput and reduce latency, reasonable pipeline
parallel implementation and effective memory are particularly
important.
OPSact OP Speak × η f ×P ×η
IP S = = = (1)
W W W
where IP S represents throughput of the system, which is
measured by the number of inference processes each second.
The unit of IP S is seconds−1 . W is the workload for
each inference and its unit is GOP. OP Speak represents peak
performance of the accelerator and OP Sact represents run-
time performance of the accelerator. η is the utilization rate
of the computing unit, which is measured by calculating the
average ratio of the working computing units in all computing
units in each inference cycle. f represents the working fre-
Fig. 3. Vitis AI: Unified AI Inference Solution Stack quency of the computation units. P represents the number of
computation units in the hardware design.
The hardware implementation of the FPGA-based YOLOv3
Vitis AI provides software engineers and application devel- accelerator will realize parallel processing of different inputs
opers to use popular frameworks such as Tensorflow, Caffe, for different inputs. The waiting time of the accelerator can be
PyTorch, etc., and use hardware acceleration simultaneously. expressed by equation (2). Standard parallel designs include
Vitis AI provides three principal components from top to pipelining and batch processing, which are usually considered
bottom: First, AI Model Zoo provides standard CNN networks. together with loop unrolling.
Developers can choose from more than 60 optimized reference C
models for quick proof of concept or production. Vitis AI L= (2)
IP S
development kit including Xilinx IR, Xilinx Compiler, and
Xilinx Embedded Software in the middle part provides five where L represents the latency for processing each inference,
tools for deploying machine learning networks. AI Optimizer which unit is seconds. C represents the concurrency of the
is a license-based tool that can significantly reduce the number accelerator.
of neural network calculations without affecting accuracy. Energy efficiency is another goal realized by FPGA-based
Depending on the network, it can increase the inference speed neural network accelerator hardware. For neural network hard-
by 5 to 20 times. AI Quantizer is used to convert FP32 ware accelerators, its energy efficiency is defined as Equation
models into INT8 or even lower precision models for FPGA (3). The total energy consumption of the accelerator is shown
deployment. Vitis AI software library, which supports all in Equation (4). We can reduce the overall energy consumption
models in Model Zoo and deploys custom models, provides of the FPGA by reducing the bit width used for calculation and
open-source pre-optimized software APIs and functions to reducing the dynamic energy consumption of memory access.
implement machine learning preprocessing, calls the DPU W
network for inference, and processes the results to obtain the Ef f = (3)
Etotal
final output. The DPU at the bottom is an efficient hardware
overlay for machine inference. Vitis AI provides different
types of DPUs for processing different AI workloads, such Etotal ≈ W × Eop + NSRAMacc × ESRAMacc
(4)
as CNN, LSTM, BERT, MLP, and other tasks. +NDRAM acc × EDRAMacc + Estatic
573
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
where Ef f represents the energy efficiency of the system.
Etotal represents the total system energy cost for each in-
ference, Estatic is the static energy cost of the system for
each inference and Eop is the average one in each inference.
Nx− acc is the number of bytes accessed from memory (x can Fig. 6. Vitis AI Provides
be SRAM or DRAM) while Ex− acc is the energy for accessing
each byte from memory (x can be SRAM or DRAM).
operations. Each small operation takes less time, so the
A. Data Access Pattern frequency can be increased. And the small operations can
For the FPGA implementation of the YOLOv3, the hard- be executed in parallel, then the data throughput rate can
ware accelerator first reads the corresponding input sliding be improved [16]. In this paper, the pipeline design shortens
cube, weights for input images of different sizes, then cal- the path length that a given signal passing in a clock cycle,
culates the output sliding cube. Next, writes it back to the increases the throughput, and reduces latency.
off-chip memory [15].
In Vitis 1.3 version, the Yaml file was introduced to maintain D. Data Quantization
all models. Each model has a Yaml file. For the YOLOv3 The reason for data quantization is using low bits to express
model used in this paper, as shown in Fig.4 below, it can the entire network, which can achieve the purpose of compres-
describe comprehensive information such as training data set, sion and reduce the demand for storage space. The other is the
input size, complexity, framework and other download links. DSP unit in the FPGA, which is better at handling fixed-point
operations [17]. One of the most commonly used methods
for model compression is the quantification of weights and
activations.
B. Processing Flow
When the image or video stream is input from the sensor, it
undergoes a series of system-level pre-processing. As shown Fig. 7. AI Quantizer
in Fig.5, the right block in the middle is the backbone of the
model, and then we need to perform some post-processing, In the Vitis AI workflow, AI Optimizer is an essential com-
such as drawing a bounding box. As shown in Fig.6, the ponent. Because FPGAs only support INT8 calculations, AI
part in the large pink area is the feature provided by the Quantizer will convert from 32-bit floating-point to 8-bit fixed-
Vitis AI library for performance optimization. Because the point and minimize the loss of precision during the conversion
pre-processing and post-processing are usually done in the process [18]. In this paper, FP32-calculations are converted
ARM processor, and the ARM processor is not very powerful, to INT8-calculations. AI Quantizer can reduce computational
the Vitis AI library can optimize and improve performance complexity without loss of prediction accuracy. Compared
algorithmically. with the floating-point model, the fixed-point network model
requires less memory bandwidth, and provides faster speed
and higher power efficiency.
E. Module Optimize
Fig. 5. Processing Flow
There are two ways to achieve neural network model
compression. One is to reduce the bit width of activation and
C. Pipeline Design weight. Pruning is another method of model compression,
Pipeline design is a method of systematically dividing and it is also the most commonly used method. The goal
the combinational logic, inserting registers between various is to develop smaller and more efficient neural networks.
parts (levels), and temporarily storing intermediate data. The The compressed neural network runs faster and reduces the
purpose is to decompose a large operation into several small computational cost of training the network.
574
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
V. EVALUTION
In this paper, we implement the YOLOv3 network model
on Xilinx ZCU104. First, we compile and generate exe-
cutable files on ZCU104 on the Ubuntu system with the
following commands: test jpeg yolov3, test video yolov3,
test performance yolov3, and test accuracy yolov3. The
test jpeg function will get the image input, and feedback with
the resulting image with a bounding box or the segmented im-
age with dots. The input extension support all most format of
image files. The test video function works similarly but taking
a video clip as the input. The test performance function will
take an image as input and calculate the average execution time
Fig. 8. AI Optimizer
with 100 or 1000 running cycles. At the same time, you can
use the command line -t to decide whether to use a single-
As shown in Figure 5, the AI Optimizer component in threaded or multi-threaded operation. The test accuracy func-
Vitis AI integrates leading model compression technology, tion cannot directly output the accuracy value but outputs the
which can reduce model complexity by 5 to 50 times with result as a text file [20]. In the AI library, the YOLOv3 network
minimal impact on accuracy. Deep compression improves the has four trained models for vehicle detection: yolov3 voc tf,
performance of AI inference to a new level [18]. In this paper, yolov3 voc, yolov3 bdd, and yolov3 adas pruned 0 9. The
we will use this component to prune the model, and achieve following experiments will be performed on four models.
the effect of compressing the model. A. Training of YOLOv3
IV. SYSTEM DESIGN We use the object detection algorithm based on YOLOv3,
As shown in Fig.9, the FPGA hardware accelerator system and evaluate it on the PASCAL VOC2007 data set and
design implemented in this paper is based on the ARM + PASCAL VOC2012 data set. The ideal output of YOLO is
FPGA architecture, which mainly includes external memory a bounding box for each object. During training, we use
double data rate (DDR), YOLO network direct memory access optimized random data, color cast, and other standard data for
(YOLO DMA), on-chip cache, and programmable logic Ac- demonstration. In the quantitative calibration process of the
celerator (PL), etc. The accelerator mainly includes processing experiment, we analyzed the activation distribution through a
element (PE) modules, on-chip buffers, and programmable small set of unlabeled images. After the processing of each
network logic modules. The design of the PE module is component in the Vitis AI library, it introduced a small loss
conducive to reducing data movement, reducing the number of accuracy while maintaining a large compression rate.
of off-chip memory accesses, and improving data reuse. The
B. Test Result
PE module can complete convolution, pooling, and activation
operations [19]. On-chip and off-chip are interconnected via The weight of FPGA board detection is obtained by G-
the AXI bus. The host PC is responsible for task scheduling PU training and conversion, so there is not many differ-
in the entire system, issues workload or instructions, and ences in detection accuracy between FPGAs and GPU. As
monitors its working status. For different input images, the shown in Fig.10, (a), (b), (c), (d) are the result graph-
system will read the input and weight of the initial input s that the input image respectively in different YOLOv3
image, then store it in the external memory DDR. Next, the training models: yolov3 voc tf, yolov3 voc, yolov3 bdd, and
accelerator will read and write the corresponding input data yolov3 adas pruned 0 9. From the figure, we can see that
from the DDR. Through the AXI bus, the hardware accelerator the four training models show 4, 5, 8, and 11 detection boxes
communicates with ARM and receives configuration signals. respectively. It can be seen that each model has different
training results, so the detection results are also different. As
shown in Fig.11, when we run the detection function, we will
get four output lines. The four lines of the output indicate that
four cars have been detected. In Fig. 11, RESULT: 6 shows
the category of the vehicle, followed by the coordinates of the
bounding box, and the last number (such as the last number
0.939764 in the first line of output) is the confidence level,
which indicates the confidence that the object is recognized
as a car.
C. Performance Comparison
Fig. 9. System architecture In this paper, we conduct comparative experiments on the
hardware implementation and the real model implemented on
575
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
Fig. 12. Experimental Device
576
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
TABLE I
PERFORMANCE OF FPGA VS GPU [5] L. Lu, Y. Liang, Q. Xiao, S. Yan, Evaluating fast algorithms for
convolutional neural networks on FPGAs, in Proc. IEEE 25th Annu. Int.
FPGA Symp. Field-Program. Custom Comput. Mach. (FCCM), pp. 101C108,
GPU Apr 2017.
(Xilinx ZCU104)
Platform ZYNQ UltraScale+ GeForce GTX1080 [6] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies
Frquency for accurate object detection and semantic segmentation, pp. 580C587,
300 10240 Nov 2013.
(MHz)
[7] T.-Y.Lin, P.Dollar, R.Girshick, K.He, B.Hariharan, S.Belongie. Feature
Precision INT8 FP32
pyramid networks for object detection, In Proceedings of the IEEE Con-
GOFs 5.5 2.45
ference on Computer Vision and Pattern Recognition, pp. 2117C2125,
84.5518 (Single thread) Feb 2017.
FPS 31.7
206.701 (Multiple thread) [8] J. Redmon, A. Farhadi. YOLOv3: An Incremental Improvement, IEEE
Power Trans. Pattern Anal. pp. 1125-1131, Apr 2018.
25 126
(W) [9] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, Xindong Wu. Object De-
Energy Efficiency tection with Deep Learning: A Review, IEEE Trans. Neural Netw.Learn.
3.38 0.26
(FPS/W) Syst. pp. 3212-3232, Apr 2019.
[10] Yi Zhang, Yongliang Shen, Jun Zhang. An improved tiny-YOLOv3
pedestrian detection algorithm, Optik-International Journal for Light and
Electron Optics 183(2019), pp. 17-23, Feb 2019.
compared with other data sets, the performance of our FPGA- [11] Wang, Qiwei; Bi, Shusheng; Sun, Minglei; Wang, Yuliang; Wang, Di;
YOLOv3 is improved by at least 2 times. Compared with the Yang, Shaobao (2019): YOLOv3 architecture.. PLOS ONE. Figure.
network structure on GPU, while ensuring that the accuracy is https://doi.org/10.1371/journal.pone.0218808.g005
[12] Yunong Tian, Guodong Yang, Zhe Wang, Hao Wang, En Li, Zize
not greatly affected, the key to improving FPGA performance Liang. Apple detection during different growth stages in orchards
and power consumption is that pruning and quantization using the improved YOLO-V3 model, Computers and Electronics in
operations can reduce computational complexity. At the same Agriculture(2019), pp. 417-426, Jan 2019.
[13] Xilinx/Vitis-AI/README,https://github.com/Xilinx/Vitis-AI
time, we should point out that the inference latency of FPGAs [14] Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang.
has increased. A Survey of FPGA-based Neural Network Inference Accelerators, ACM
Trans. Reconfigurable Technol. Syst. Article 2, 26 pages. Mar 2019.
[15] D. T. Nguyen, T. N. Nguyen, H. Kim and H. Lee, ”A High-Throughput
VI. CONCLUSION and Power-Efficient FPGA Implementation of YOLO CNN for Object
Detection,” in IEEE Transactions on Very Large Scale Integration
In the research of this paper, we designed and implemented (VLSI) Systems, vol.27, no.8, pp. 1861-1873, Aug 2019.
a hardware accelerator based on the reconfigurable FPGA Y- [16] Sathaporn Visakhasart, Orachat Chitsobhuk. Multi-pipeline Architecture
OLOv3 network model. Simultaneously, a series of operations for Face Recognition on FPGA, 2009 International Conference on
Digital Image Processing, Aug 2009.
such as task pipeline, data quantification, model compression, [17] R. Ding, G. Su, G. Bai, W. Xu, N. Su and X. Wu, A FPGA-
and data pre-processing by using the Vitis AI can reduce the based Accelerator of Convolutional Neural Network for Face Feature
network scale and reduce the access time of the accelerator Extraction, 2019 IEEE International Conference on Electron Devices
and Solid-State Circuits (EDSSC), pp. 1-3, Xi’an, China, Jul 2019.
off-chip memory. When the GPU is 31.7FPS, the accelerator [18] Xilinx Inc, Adaptable and Real-Time AI Inference Acceleration,
FPGA hardware implementation can obtain a single-threaded https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html
84.5FPS and a multi-threaded result of 206.37FPS. In the [19] Jialiang Zhang, Jing Li, Improving the performance of OpenCL-based
FPGA accelerator for convolu-tional neural network, In Proceedings of
follow-up research, we will continue to study the versatility of the ACM/SIGDA International Symposium on Field-Programmable Gate
FPGA hardware accelerators, simplify the deployment process Arrays (FPGA17), pp. 25C34, Feb 2017.
of the model, enable it quickly to deploy a variety of neural [20] Xilinx Inc, ZCU104 Board User Guide, UG1267 (v1.1), Oct 2018.
network models easily, and achieve high-performance and low-
power execution of neural network models.
ACKNOWLEDGEMENT
The work described in the paper was supported by the Na-
tional Science Foundation of China under Grant 61876105 and
the Ministry of Education Industry-University Cooperation and
Collaborative Education Project under Grant 201902097014.
R EFERENCES
[1] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning
applied to document recognition, in Proceedings of the IEEE, vol.86,
no.11, pp. 2278-2324, Nov 1998.
[2] A. Krizhevsky and I. Sutskever, ImageNet classification with deep
convolutional neural networks, in Proc. Advances in Neural Inf. Process.
Syst. pp. 1097C1105, 2012.
[3] J. Redmon and A. Farhadi. YOLO9000: Better, faster, stronger. [Online].
Available: https://arxiv.org/abs/1612.08242, Dec 2016.
[4] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. Kyung Kim, Chenkai
Shao, From high-level deep neural models to FPGAs, in Proc. 49th
Annu. IEEE/ACM Int. Symp. Microarchitecture, Art. no. 17, Oct 2016.
577
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.