0% found this document useful (0 votes)
44 views7 pages

FPGA Implementation of Object Detection Accelerator Based on Vitis-AI

This document presents a study on the FPGA implementation of a YOLOv3 object detection accelerator using Vitis AI, highlighting its advantages in energy efficiency and throughput compared to GPU implementations. The proposed design utilizes a reconfigurable architecture that allows for optimization through model compression and data quantization. Key contributions include the design of a pipeline operation for the accelerator and the exploration of various optimization techniques to enhance performance in object detection tasks.

Uploaded by

dmb06283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views7 pages

FPGA Implementation of Object Detection Accelerator Based on Vitis-AI

This document presents a study on the FPGA implementation of a YOLOv3 object detection accelerator using Vitis AI, highlighting its advantages in energy efficiency and throughput compared to GPU implementations. The proposed design utilizes a reconfigurable architecture that allows for optimization through model compression and data quantization. Key contributions include the design of a pipeline operation for the accelerator and the exploration of various optimization techniques to enhance performance in object detection tasks.

Uploaded by

dmb06283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

11th International Conference on Information Science and Technology (ICIST)

Chengdu, China, May 21-23, 2021

FPGA Implementation of Object Detection


Accelerator Based on Vitis-AI
2021 11th International Conference on Information Science and Technology (ICIST) | 978-1-6654-1266-7/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICIST52614.2021.9440554

Jin Wang Shenshen Gu §

School of Mechatronic Engineering and Automation School of Mechatronic Engineering and Automation
Shanghai University Shanghai University
Shanghai, China Shanghai, China
[email protected] [email protected]

Abstract—The emergence of YOLOv3 makes it possible to reflected, which makes it difficult to identify, but YOLOv3
detect small targets. Due to the characteristics of the YOLO improves obviously [3].
network itself, the YOLOv3 network has exceptionally high Although GPUs have been widely used in deep learning
requirements for computing power and memory bandwidth and
it usually needs to be deployed on a dedicated hardware accel- algorithms such as YOLO, they are not efficient enough to
eration platform. FPGAs is a logically reconfigurable hardware achieve lower power consumption and higher throughput. At
chip with substantial advantages in terms of performance and present, in addition to improving the algorithm logic itself,
power consumption, so it is a good choice to deploy a deep hardware acceleration methods are also prevalent for optimiz-
convolutional network. In the research of this paper, we proposed ing neural networks. FPGA is a high-performance and low-
a reconfigurable YOLOv3 FPGA hardware accelerator based on
the AXI bus ARM+FPGA architecture. The YOLOv3 network power chip that has its unique advantages in accelerating neu-
quantifies through Vitis AI, and a series of operations such as ral network algorithms. On the one hand, FPGA can achieve
model compression and data pre-processing can save accelerator high computing performance and high energy efficiency ratio.
chips and the access time of external storage. Pipeline operation On the other hand, FPGA has high flexibility and can be
enables FPGAs to achieve higher throughput. Compared with reconfigured [4] [5]. It integrates large number of digital circuit
the GPU implementation of the YOLOv3 model, it is found
that the hardware implementation of the FPGA-based YOLOv3 logic units and memories. Developers can burn the recording
accelerator has lower energy consumption and can achieve higher configuration file customizes the wiring between the logic unit
throughput. and the memory to realize different arithmetic logic, and this
Index Terms—FPGA, Vitis AI, Object Detection, YOLOv3, configuration file is not one-time. The circuit logic inside the
Parallel Processing chip can be modified at any time.
In this paper, we propose a YOLOv3 accelerator that can be
I. INTRODUCTION applied to different network structures. The main contributions
With the development of Convolutional Neural Networks of this article are listed below:
(CNN), new neural network structures are proposed one af- • Design a reconfigurable YOLOv3 accelerator, which can
ter another to solve the bottlenecks of existing networks in do task pipeline operation in each stage.
terms of performance and computing power. Recently, in the • Basing on Vitis AI, perform optimization operations such
field of traditional machine learning algorithms, significant as data quantization, pruning, and model compression
progress has been made in the research of convolutional to reduce model complexity and improve accelerator
neural networks [1]. Object detection is a challenging task performance.
in the field of computer vision. Due to the emergence of • Compared with the GPU under the standard model, ob-
rugged computing devices such as GPUs, deep learning has tains greater throughput and lower energy consumption.
been widely used in the field of object detection. Therefore,
II. BACKGROUND
many object detection algorithms related to deep learning have
been proposed, such as Single Shot MultiBox Detector (SSD), The current mainstream object detection algorithms are
Faster R-convolutional Neural Network (Faster R-CNN), and mainly based on deep learning models, which can be divided
You-Only-Look-Once (YOLO). When YOLO was proposed, into two categories: one is two-stage detection algorithms,
it has been affixed with two labels, 1) It is very fast; 2) It is such as R-CNN, Fast R-CNN, Faster R-CNN; the other is
not good at detecting small objects [2]. The latter has become one-stage detection algorithm, more typical algorithms such
the reason why many people stay away from it. Due to the as YOLO and SSD. The leading performance indicators of
limitation of principle, YOLO only detects the last layer of the object detection model are detection accuracy and speed.
the output, and small objects have fewer pixels. After layer- For accuracy, object detection should consider the positioning
by-layer convolution, the information on this layer is hardly accuracy of the object, not just the classification accuracy
[6]. In general, the two-stage algorithm has advantages in
§ Shenshen Gu is corresponding author. accuracy, while the one-stage algorithm has advantages in

978-1-6654-1266-7/21/$31.00 ©2021 IEEE 571


Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
speed. The network generates positions and categories based
on the candidate region, while one-stage does not require the
region proposal stage. The position coordinates and category
probabilities are directly generated from the picture. The final
test result can be obtained directly after the first test. The
single target detection algorithm uses a neural network as a
feature extractor, followed by a convolutional layer, and then
adds a custom convolutional layer (according to the number of
output categories and anchors), and directly uses convolutional
for detection at the end [7]. YOLOv2 introduced the training
model and anchor mechanism based on YOLOv1, replaced
Fig. 1. YOLOv3 Architecture [11]
the backbone network with Darknet-19 (only convolutional
layer and pooling layer), and removed the dropout after each
convolutional layer to join the Batch Normalization layer. SSD
structure deeper. In the lower right corner of Fig.1 , we
uses multi-scale features for detection, which improves the
can see that the basic component of res block is also
problem that multiple targets cannot be detected and it is
DBL.
difficult to detect when large or small objects exist. YOLOv3
• Concat: tensor splicing. Concatenate the upsampling of
adds some improvements based on YOLOv2 to improve model
the Darknet middle layer and a later layer.
performance. Therefore, we chose to apply the YOLOv3
model to the FPGA platform in this article [8]. As shown in Fig.2, YOLOv3 proposes a deeper robust
feature extractor Darknet-53. Using the structure of the ResNet
A. YOLOv3 network for reference, a large number of Residual layer jump
YOLOv3 is the third version of the YOLO (You Only connections are used [12]. To reduce the negative effect of
Look Once) series of object detection algorithms. Compared gradient caused by pooling, pooling is directly abandoned, the
with the previous algorithms, especially for small targets, the stride of convolutional is used to achieve downsampling. Each
accuracy has been significantly improved in [8] [9]. The basic unit of each feature map corresponding to another receptive
idea of the YOLOv3 algorithm can be divided into two parts: field, predicts three bounding boxes, so the size of the anchor is
generate a series of candidate areas on the picture according also different. YOLOv3 uses multi-label classification to adapt
to specific rules, and then mark the candidate areas according to more complex data sets containing many overlapping labels.
to the positional relationship between these candidate areas That, logistic is used for classification instead of softmax.
and the entire frame of the object on the picture. Those
candidate regions that are sufficiently close to the ground
truth box will be marked as positive samples. The position
of the ground truth box will be used as the position target
of the positive sample. Those candidate regions that deviate
significantly from the right frame will be marked as negative
samples, and negative samples do not need to predict the
position or category. Using the convolutional neural network to
extract image features and predict the location and category
of candidate regions [10]. In this way, each prediction box
can be regarded as a sample. The label value is obtained by
labeling the real box relative to its position and category. The
network model predicts its position and category, compares
the network prediction value with the label value, and then Fig. 2. Darknet-53
establishes a loss function.
YOLOv3 has the three basic components which shown in
the figure above: B. Vitis AI
• DBL: the basic component of Darknet-53, which is the Vitis AI is the Xilinx’s development stack for AI inference
smallest component of the YOLOv3 network structure. on the Xilinx hardware platform, which is an advanced ac-
Each DBL unit is composed of a convolutional (Conv-2D) celeration library and design tool for algorithm developers
layer, a batch normalization (BN) layer, and an activation to conduct deep-learning-related research and development.
function (Leaky ReLU). It is composed of optimized IP, tools, libraries, models, and
• Resn: n represents how many res units are contained in example designs. It is designed with high efficiency, ease of
this res block, which is a large component of YOLOv3. use in mind and utilizes the full potential of acceleration on
YOLOv3 begins to learn from the residual structure Xilinx FPGAs [13]. As shown in Fig.3, Vitis AI consists of
of ResNet, using this structure can make the network the following key components:

572
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
• AI Model Zoo: A comprehensive set of pre-optimized III. ACCELERATOR DESIGN
models that are ready to deploy on Xilinx devices. Generally, the design goals of neural network inference
• AI Optimizer: An optional model optimizer that can accelerators should consider the following two indicators: high
prune a model by up to 90%. speed (high throughput and low latency) and high energy
• AI Quantizer: A powerful quantizer that supports model efficiency [14].
quantization, calibration, and fine-tuning. The throughput of the neural network accelerator can be
• AI Compiler: Compiles the quantized model to a high- expressed by Equation (1). If the selected FPGA chip has
efficient instruction set and data flow. limited resources, we can increase the number of computing
• AI Profiler: Perform an in-depth analysis of the efficiency units by reducing the size of each computing unit, or increase
and utilization of AI inference implementation. the operating frequency of the accelerator to achieve peak
• AI Library: Provides advanced and optimized C++ API performance. However, reducing the size of the computing unit
for AI applications from the edge to the cloud. usually sacrifices model accuracy, and the increase in operating
• DPU: Efficient and scalable IP cores can be customized frequency is often only changed by hardware. Therefore, to
to meet the needs of many different applications. improve throughput and reduce latency, reasonable pipeline
parallel implementation and effective memory are particularly
important.

OPSact OP Speak × η f ×P ×η
IP S = = = (1)
W W W
where IP S represents throughput of the system, which is
measured by the number of inference processes each second.
The unit of IP S is seconds−1 . W is the workload for
each inference and its unit is GOP. OP Speak represents peak
performance of the accelerator and OP Sact represents run-
time performance of the accelerator. η is the utilization rate
of the computing unit, which is measured by calculating the
average ratio of the working computing units in all computing
units in each inference cycle. f represents the working fre-
Fig. 3. Vitis AI: Unified AI Inference Solution Stack quency of the computation units. P represents the number of
computation units in the hardware design.
The hardware implementation of the FPGA-based YOLOv3
Vitis AI provides software engineers and application devel- accelerator will realize parallel processing of different inputs
opers to use popular frameworks such as Tensorflow, Caffe, for different inputs. The waiting time of the accelerator can be
PyTorch, etc., and use hardware acceleration simultaneously. expressed by equation (2). Standard parallel designs include
Vitis AI provides three principal components from top to pipelining and batch processing, which are usually considered
bottom: First, AI Model Zoo provides standard CNN networks. together with loop unrolling.
Developers can choose from more than 60 optimized reference C
models for quick proof of concept or production. Vitis AI L= (2)
IP S
development kit including Xilinx IR, Xilinx Compiler, and
Xilinx Embedded Software in the middle part provides five where L represents the latency for processing each inference,
tools for deploying machine learning networks. AI Optimizer which unit is seconds. C represents the concurrency of the
is a license-based tool that can significantly reduce the number accelerator.
of neural network calculations without affecting accuracy. Energy efficiency is another goal realized by FPGA-based
Depending on the network, it can increase the inference speed neural network accelerator hardware. For neural network hard-
by 5 to 20 times. AI Quantizer is used to convert FP32 ware accelerators, its energy efficiency is defined as Equation
models into INT8 or even lower precision models for FPGA (3). The total energy consumption of the accelerator is shown
deployment. Vitis AI software library, which supports all in Equation (4). We can reduce the overall energy consumption
models in Model Zoo and deploys custom models, provides of the FPGA by reducing the bit width used for calculation and
open-source pre-optimized software APIs and functions to reducing the dynamic energy consumption of memory access.
implement machine learning preprocessing, calls the DPU W
network for inference, and processes the results to obtain the Ef f = (3)
Etotal
final output. The DPU at the bottom is an efficient hardware
overlay for machine inference. Vitis AI provides different
types of DPUs for processing different AI workloads, such Etotal ≈ W × Eop + NSRAMacc × ESRAMacc
(4)
as CNN, LSTM, BERT, MLP, and other tasks. +NDRAM acc × EDRAMacc + Estatic

573
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
where Ef f represents the energy efficiency of the system.
Etotal represents the total system energy cost for each in-
ference, Estatic is the static energy cost of the system for
each inference and Eop is the average one in each inference.
Nx− acc is the number of bytes accessed from memory (x can Fig. 6. Vitis AI Provides
be SRAM or DRAM) while Ex− acc is the energy for accessing
each byte from memory (x can be SRAM or DRAM).
operations. Each small operation takes less time, so the
A. Data Access Pattern frequency can be increased. And the small operations can
For the FPGA implementation of the YOLOv3, the hard- be executed in parallel, then the data throughput rate can
ware accelerator first reads the corresponding input sliding be improved [16]. In this paper, the pipeline design shortens
cube, weights for input images of different sizes, then cal- the path length that a given signal passing in a clock cycle,
culates the output sliding cube. Next, writes it back to the increases the throughput, and reduces latency.
off-chip memory [15].
In Vitis 1.3 version, the Yaml file was introduced to maintain D. Data Quantization
all models. Each model has a Yaml file. For the YOLOv3 The reason for data quantization is using low bits to express
model used in this paper, as shown in Fig.4 below, it can the entire network, which can achieve the purpose of compres-
describe comprehensive information such as training data set, sion and reduce the demand for storage space. The other is the
input size, complexity, framework and other download links. DSP unit in the FPGA, which is better at handling fixed-point
operations [17]. One of the most commonly used methods
for model compression is the quantification of weights and
activations.

Fig. 4. Yaml File of YOLOv3

B. Processing Flow
When the image or video stream is input from the sensor, it
undergoes a series of system-level pre-processing. As shown Fig. 7. AI Quantizer
in Fig.5, the right block in the middle is the backbone of the
model, and then we need to perform some post-processing, In the Vitis AI workflow, AI Optimizer is an essential com-
such as drawing a bounding box. As shown in Fig.6, the ponent. Because FPGAs only support INT8 calculations, AI
part in the large pink area is the feature provided by the Quantizer will convert from 32-bit floating-point to 8-bit fixed-
Vitis AI library for performance optimization. Because the point and minimize the loss of precision during the conversion
pre-processing and post-processing are usually done in the process [18]. In this paper, FP32-calculations are converted
ARM processor, and the ARM processor is not very powerful, to INT8-calculations. AI Quantizer can reduce computational
the Vitis AI library can optimize and improve performance complexity without loss of prediction accuracy. Compared
algorithmically. with the floating-point model, the fixed-point network model
requires less memory bandwidth, and provides faster speed
and higher power efficiency.

E. Module Optimize
Fig. 5. Processing Flow
There are two ways to achieve neural network model
compression. One is to reduce the bit width of activation and
C. Pipeline Design weight. Pruning is another method of model compression,
Pipeline design is a method of systematically dividing and it is also the most commonly used method. The goal
the combinational logic, inserting registers between various is to develop smaller and more efficient neural networks.
parts (levels), and temporarily storing intermediate data. The The compressed neural network runs faster and reduces the
purpose is to decompose a large operation into several small computational cost of training the network.

574
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
V. EVALUTION
In this paper, we implement the YOLOv3 network model
on Xilinx ZCU104. First, we compile and generate exe-
cutable files on ZCU104 on the Ubuntu system with the
following commands: test jpeg yolov3, test video yolov3,
test performance yolov3, and test accuracy yolov3. The
test jpeg function will get the image input, and feedback with
the resulting image with a bounding box or the segmented im-
age with dots. The input extension support all most format of
image files. The test video function works similarly but taking
a video clip as the input. The test performance function will
take an image as input and calculate the average execution time
Fig. 8. AI Optimizer
with 100 or 1000 running cycles. At the same time, you can
use the command line -t to decide whether to use a single-
As shown in Figure 5, the AI Optimizer component in threaded or multi-threaded operation. The test accuracy func-
Vitis AI integrates leading model compression technology, tion cannot directly output the accuracy value but outputs the
which can reduce model complexity by 5 to 50 times with result as a text file [20]. In the AI library, the YOLOv3 network
minimal impact on accuracy. Deep compression improves the has four trained models for vehicle detection: yolov3 voc tf,
performance of AI inference to a new level [18]. In this paper, yolov3 voc, yolov3 bdd, and yolov3 adas pruned 0 9. The
we will use this component to prune the model, and achieve following experiments will be performed on four models.
the effect of compressing the model. A. Training of YOLOv3
IV. SYSTEM DESIGN We use the object detection algorithm based on YOLOv3,
As shown in Fig.9, the FPGA hardware accelerator system and evaluate it on the PASCAL VOC2007 data set and
design implemented in this paper is based on the ARM + PASCAL VOC2012 data set. The ideal output of YOLO is
FPGA architecture, which mainly includes external memory a bounding box for each object. During training, we use
double data rate (DDR), YOLO network direct memory access optimized random data, color cast, and other standard data for
(YOLO DMA), on-chip cache, and programmable logic Ac- demonstration. In the quantitative calibration process of the
celerator (PL), etc. The accelerator mainly includes processing experiment, we analyzed the activation distribution through a
element (PE) modules, on-chip buffers, and programmable small set of unlabeled images. After the processing of each
network logic modules. The design of the PE module is component in the Vitis AI library, it introduced a small loss
conducive to reducing data movement, reducing the number of accuracy while maintaining a large compression rate.
of off-chip memory accesses, and improving data reuse. The
B. Test Result
PE module can complete convolution, pooling, and activation
operations [19]. On-chip and off-chip are interconnected via The weight of FPGA board detection is obtained by G-
the AXI bus. The host PC is responsible for task scheduling PU training and conversion, so there is not many differ-
in the entire system, issues workload or instructions, and ences in detection accuracy between FPGAs and GPU. As
monitors its working status. For different input images, the shown in Fig.10, (a), (b), (c), (d) are the result graph-
system will read the input and weight of the initial input s that the input image respectively in different YOLOv3
image, then store it in the external memory DDR. Next, the training models: yolov3 voc tf, yolov3 voc, yolov3 bdd, and
accelerator will read and write the corresponding input data yolov3 adas pruned 0 9. From the figure, we can see that
from the DDR. Through the AXI bus, the hardware accelerator the four training models show 4, 5, 8, and 11 detection boxes
communicates with ARM and receives configuration signals. respectively. It can be seen that each model has different
training results, so the detection results are also different. As
shown in Fig.11, when we run the detection function, we will
get four output lines. The four lines of the output indicate that
four cars have been detected. In Fig. 11, RESULT: 6 shows
the category of the vehicle, followed by the coordinates of the
bounding box, and the last number (such as the last number
0.939764 in the first line of output) is the confidence level,
which indicates the confidence that the object is recognized
as a car.
C. Performance Comparison
Fig. 9. System architecture In this paper, we conduct comparative experiments on the
hardware implementation and the real model implemented on

575
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
Fig. 12. Experimental Device

Fig. 10. Output Image


parameters obtained by the single-threaded and multi-threaded.
As shown in Fig.13, the −t in the command line indicates the
number of threads. The throughput of the hardware accelerator
is measured in terms of the number of images per second. Af-
ter comparing the single-threaded and multi-threaded running
performance functions, it is found that FPS=84.56 in single-
threaded operation and FPS=206.70 in eight-threaded opera-
tion. In this paper, the performance comparison between FPGA
hardware implementation based on YOLOv3 and GPU (taking
GeForce GTX1080 as an example) platform implementation
is shown in the following table I. To ignore the difference in
memory reading of different computers, we first read a large-
size image into the memory, and then split the slider and send
it to the network for testing in comparison.
Fig. 11. Output Result

the GPU, and mainly compare the differences in performance


and power consumption.
Xilinx Zynq ZCU104 is a development platform based
on Embedded Vision Low Cost (EVLC) SoC, which can
promote machine vision, autonomous driving, AR/VR, etc.
Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC is installed
on the Xilinx ZCU104 board, which integrates a powerful
processing system (PS) and programmable logic (PL) in the
same device. The core ZU7EV has an integrated 1.5 GHz
quad-core ARM Cortex-A53 (PS) and a 600 MHz dual-core Fig. 13. Performance of YOLOv3
ARM Cortex-R5 (RPU), and the drive clock is 300MHz.
This kind of computing power on FPGA greatly promotes Table I lists the comparison between the hardware accelera-
heterogeneous multiprocessing in complex computer vision tion platform designed and the GPU. Compared with GeForce
applications [20]. Xilinx ZCU104 and its experimental setup GTX1080, our YOLOv3 accelerator based on Vitis AI design
are shown in Fig.12. The GPU used in this comparative achieves higher performance, throughput and power efficiency.
experiment is GeForce GTX1080 with a memory frequency of The performance and energy efficiency results of FPGA-
10GHz, has 2560 CUDA processors, 8 GB GDDR5X video based YOLOv3 implementation and GPU implementation are
memory, a graphics card bit width of 256b, and the bandwidth also presented in Table I. Because heterogeneous quantization
is 320GB/s. makes full use of hardware resources and design parallelism,
We run the performance function loaded on the ZCU104 performance is improved by 2.4 times, and energy efficiency
board in the form of a command line, get the performance is improved by 13 times. For the PAS-CAL VOC data set,

576
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.
TABLE I
PERFORMANCE OF FPGA VS GPU [5] L. Lu, Y. Liang, Q. Xiao, S. Yan, Evaluating fast algorithms for
convolutional neural networks on FPGAs, in Proc. IEEE 25th Annu. Int.
FPGA Symp. Field-Program. Custom Comput. Mach. (FCCM), pp. 101C108,
GPU Apr 2017.
(Xilinx ZCU104)
Platform ZYNQ UltraScale+ GeForce GTX1080 [6] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies
Frquency for accurate object detection and semantic segmentation, pp. 580C587,
300 10240 Nov 2013.
(MHz)
[7] T.-Y.Lin, P.Dollar, R.Girshick, K.He, B.Hariharan, S.Belongie. Feature
Precision INT8 FP32
pyramid networks for object detection, In Proceedings of the IEEE Con-
GOFs 5.5 2.45
ference on Computer Vision and Pattern Recognition, pp. 2117C2125,
84.5518 (Single thread) Feb 2017.
FPS 31.7
206.701 (Multiple thread) [8] J. Redmon, A. Farhadi. YOLOv3: An Incremental Improvement, IEEE
Power Trans. Pattern Anal. pp. 1125-1131, Apr 2018.
25 126
(W) [9] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, Xindong Wu. Object De-
Energy Efficiency tection with Deep Learning: A Review, IEEE Trans. Neural Netw.Learn.
3.38 0.26
(FPS/W) Syst. pp. 3212-3232, Apr 2019.
[10] Yi Zhang, Yongliang Shen, Jun Zhang. An improved tiny-YOLOv3
pedestrian detection algorithm, Optik-International Journal for Light and
Electron Optics 183(2019), pp. 17-23, Feb 2019.
compared with other data sets, the performance of our FPGA- [11] Wang, Qiwei; Bi, Shusheng; Sun, Minglei; Wang, Yuliang; Wang, Di;
YOLOv3 is improved by at least 2 times. Compared with the Yang, Shaobao (2019): YOLOv3 architecture.. PLOS ONE. Figure.
network structure on GPU, while ensuring that the accuracy is https://doi.org/10.1371/journal.pone.0218808.g005
[12] Yunong Tian, Guodong Yang, Zhe Wang, Hao Wang, En Li, Zize
not greatly affected, the key to improving FPGA performance Liang. Apple detection during different growth stages in orchards
and power consumption is that pruning and quantization using the improved YOLO-V3 model, Computers and Electronics in
operations can reduce computational complexity. At the same Agriculture(2019), pp. 417-426, Jan 2019.
[13] Xilinx/Vitis-AI/README,https://github.com/Xilinx/Vitis-AI
time, we should point out that the inference latency of FPGAs [14] Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang.
has increased. A Survey of FPGA-based Neural Network Inference Accelerators, ACM
Trans. Reconfigurable Technol. Syst. Article 2, 26 pages. Mar 2019.
[15] D. T. Nguyen, T. N. Nguyen, H. Kim and H. Lee, ”A High-Throughput
VI. CONCLUSION and Power-Efficient FPGA Implementation of YOLO CNN for Object
Detection,” in IEEE Transactions on Very Large Scale Integration
In the research of this paper, we designed and implemented (VLSI) Systems, vol.27, no.8, pp. 1861-1873, Aug 2019.
a hardware accelerator based on the reconfigurable FPGA Y- [16] Sathaporn Visakhasart, Orachat Chitsobhuk. Multi-pipeline Architecture
OLOv3 network model. Simultaneously, a series of operations for Face Recognition on FPGA, 2009 International Conference on
Digital Image Processing, Aug 2009.
such as task pipeline, data quantification, model compression, [17] R. Ding, G. Su, G. Bai, W. Xu, N. Su and X. Wu, A FPGA-
and data pre-processing by using the Vitis AI can reduce the based Accelerator of Convolutional Neural Network for Face Feature
network scale and reduce the access time of the accelerator Extraction, 2019 IEEE International Conference on Electron Devices
and Solid-State Circuits (EDSSC), pp. 1-3, Xi’an, China, Jul 2019.
off-chip memory. When the GPU is 31.7FPS, the accelerator [18] Xilinx Inc, Adaptable and Real-Time AI Inference Acceleration,
FPGA hardware implementation can obtain a single-threaded https://www.xilinx.com/products/design-tools/vitis/vitis-ai.html
84.5FPS and a multi-threaded result of 206.37FPS. In the [19] Jialiang Zhang, Jing Li, Improving the performance of OpenCL-based
FPGA accelerator for convolu-tional neural network, In Proceedings of
follow-up research, we will continue to study the versatility of the ACM/SIGDA International Symposium on Field-Programmable Gate
FPGA hardware accelerators, simplify the deployment process Arrays (FPGA17), pp. 25C34, Feb 2017.
of the model, enable it quickly to deploy a variety of neural [20] Xilinx Inc, ZCU104 Board User Guide, UG1267 (v1.1), Oct 2018.
network models easily, and achieve high-performance and low-
power execution of neural network models.

ACKNOWLEDGEMENT
The work described in the paper was supported by the Na-
tional Science Foundation of China under Grant 61876105 and
the Ministry of Education Industry-University Cooperation and
Collaborative Education Project under Grant 201902097014.

R EFERENCES
[1] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning
applied to document recognition, in Proceedings of the IEEE, vol.86,
no.11, pp. 2278-2324, Nov 1998.
[2] A. Krizhevsky and I. Sutskever, ImageNet classification with deep
convolutional neural networks, in Proc. Advances in Neural Inf. Process.
Syst. pp. 1097C1105, 2012.
[3] J. Redmon and A. Farhadi. YOLO9000: Better, faster, stronger. [Online].
Available: https://arxiv.org/abs/1612.08242, Dec 2016.
[4] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. Kyung Kim, Chenkai
Shao, From high-level deep neural models to FPGAs, in Proc. 49th
Annu. IEEE/ACM Int. Symp. Microarchitecture, Art. no. 17, Oct 2016.

577
Authorized licensed use limited to: Kwangwoon Univ. Downloaded on March 04,2025 at 06:08:03 UTC from IEEE Xplore. Restrictions apply.

You might also like