An FPGA-Based Reconfigurable CNN Accelerator For YOLO

Uploaded by

聂珣

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views5 pages

An FPGA-Based Reconfigurable CNN Accelerator For YOLO

Uploaded by

聂珣

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2020 IEEE 3rd International Conference on Electronics Technology

An FPGA-Based Reconfigurable CNN Accelerator for YOLO

Shiguang Zhang1, Jian Cao1*, Quan Zhang1, Qi Zhang1, Ying Zhang1, Yuan Wang2
1.
School of Software and Microelectronics, Peking University
2.
School of Electronics Engineering and Computer Science, Peking University
Beijing, China
*Corresponding e-mail: [email protected]

Abstract—Convolutional neural network (CNN) has been memory access are all issues that need to be considered strictly
widely used in image processing fields. Object detection models in hardware deployment. In order to solve these problems,
based on CNN, such as YOLO and SSD, have been proved to be people have done a lot of research in the direction of model
the most advanced in many applications. CNN have extremely compression (including pruning and quantification) [7][8],
high requirements on computing power and memory bandwidth, efficient model design [9] [10], and efficient hardware
which usually needs to be deployed to a dedicated hardware architecture design [11]-[15].
platform. FPGA has great advantages in reconfigurability and In this paper, we propose a reconfigurable CNN
performance power ratio, which is a suitable choice to deploy
accelerator, which can adapt to different networks by
CNN. In this paper, we propose a reconfigurable CNN
configuring signals. The main contributions of this paper are
accelerator with AXI bus based on ARM + FPGA architecture.
The accelerator can receive the configuration signals sent by
summarized as follows
ARM and complete the calculation during inference of different • We design a reconfigurable CNN accelerator based
CNN layers through time-sharing. By combining convolution on ARM + FPGA architecture through High Level
and pooling operation, the number of data moves of Synthesis (HLS), which realized the full pipeline of
convolutional layer and pooling layer is reduced to reduce the each stage.
number of off-chip memory accesses. The floating-point • Based on the block-based computations, a method of
number is converted into 16-bit dynamic fixed-point format, combining convolutional (Conv) and pooling layers
which improves the calculation performance. We implemented is proposed.
the proposed architecture on the Xilinx ZCU102 FPGA for • A delicate dynamic fixed-point strategy is used to
YOLOv2 and YOLOv2 Tiny models on COCO and VOC 2007 transform the parameters from floating-point to 16-bit
respectively, with peak performance of 289GOPs at 300MHz fixed-point and multiple scaling factors.
clock frequency. • The line buffer is designed to make the data encoding
Keywords-YOLO; FPGA; reconfigurable; dynamic fixed-
and decoding time cover access time of off-chip
point memory.
The rest of this article is organized as follows. Section II
reviews introduces the research background and discusses
I. INTRODUCTION related work. Section III describes the reconfigurable CNN
With the development of convolutional neural network accelerator architecture and internal details. Section IV
(CNN) research, new network structures are being proposed. introduces the CNN accelerator system design based on ARM
From LeNet-5, AlexNet, to R-CNN, YOLO and SSD [1]-[5], + FPGA architecture. Section V describes the experimental
the function of CNN have evolved from simple classification results of implementing the accelerator on the ZCU102
to target detection, and have been fully applied in the field of platform. Section VI summarizes the content of this article.
computer vision. With the enhancement of functions, the
depth and scale of CNN are also increasing rapidly. How to II. BACKGROUND
deploy the CNN on the hardware efficiently has become a The object detection system based on CNN model can be
very popular work. divided into two types, one of which can be called Two-stage
Compared with central processing unit (CPU) and model including R-CNN, Fast R-CNN and Faster R-CNN et
Graphics Processing Unit (GPU), FPGA has the advantages al. [2][17][18], the other can be called One-stage model
of high energy efficiency and low power consumption, and it including YOLO and SSD et al.. The One-stage model, which
is hardware programmable [6]. Compared with ASIC, FPGA does not need the stage of region proposal, directly generates
has the advantages of short development cycle, high flexibility the category probability and position coordinates of objects,
and low development cost. FPGA has become a very suitable so it has a faster speed. Two-stage model first extracts region
hardware platform for CNN deployment. The characteristics proposals through Region Proposal Networks (RNP), and then
of CNN are computationally and memory intensive, which classifies them, so it is slower than the One-stage model, but
bring many challenges to FPGA and embedded deployment. has high detection accuracy. With the development of
The time cost of a large number of computing operations and research, the problem of low detection accuracy of the One-
the time cost caused by the increase of data movement and stage model is being improved, and it is suitable for

978-1-7281-6283-6/20/$31.00 ©2020 IEEE 74

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.
deployment to the embedded devices. In the One-stage model, C-1 K-1 K-1
SSD does not support multi-scale training, and the network
generalization ability is worse than that of YOLOv2 and yf,h,w = ∑ ∑ ∑ (wf,c,j,i ×xc,h×s+j,w×s+i )+bf. (1)
YOLOv3 [18]. YOLOv3 has complex network structure and c=0 j=0 i=0
slower than YOLOv2. Therefore, YOLOv2 and YOLOv2 - where yf,h,w is the value of the h row w column of the fth output
Tiny [20] are selected to deploy on the FPGA platform. feature map, C is the number of input feature map, K is the
convolution kernel size, and S is the step size. The result of
TABLE I. YOLOV2 TINY convolution often needs to be activated. The activation funtion
Layer Filters Size/Stride/padding Input Output of YOLOv2 and YOLOv2 Tiny is Leaky ReLU.
Conv1 16 3×3 / 1 / 1 416×416 2) Pooling layer: Maximum pooling is used for the down
Pool1 2×2 / 2 / 0 208×208 sampling of feature mapping, which improves the robustness
Conv2 32 3×3 / 1 / 1 208×208 of feature extraction. By establishing a K×K sampling
Pool2 2×2 / 2 / 0 104×104
Conv3 64 3×3 / 1 / 1 104×104
window, the maximum value in the window are saved. The
Pool3 2×2 / 2 / 0 52×52 size of the sampling window is generally 2×2, and the step
Conv4 128 3×3 / 1 / 1 52×52 size is generally 1 or 2.
Pool4 2×2 / 2 / 0 26×26 3) Batch Normalization (BN) layer: BN layer can
Conv5 256 3×3 / 1 / 1 26×26
Pool5 2×2 / 2 / 0 13×13 significantly reduce the training time of deep CNN [13], and
Conv6 512 3×3 / 1 / 1 13×13 solve the problem caused by the change of data distribution in
Pool6 2×2 / 2 / 0 13×13 the middle layer during the training process. For a mini-batch,
Conv7 1024 3×3 / 1 / 1 13×13 13×13 calculate the mean μꞴ and variance σ2Ꞵ of all input samples Ꞵ
Conv8 512 3×3 / 1 / 1 13×13 13×13 (Ꞵ ={x1 , x2 , x3 , ... xm }). In order to recover the distribution
of the original network feature, the learnable parameters γ and
Conv9 425 1×1 / 1 / 0 13×13 13×13
β are introduced to get the output results as:
xi - μꞴ
yb = γ × +β. (2)
A. YOLOv2 and YOLOv2-Tiny 2
√ σꞴ + ε
YOLOv2 is proposed by Joseph Redmon et al. [4], which
removes the fully connected layer from YOLO and uses where ε is a constant added to the mini-batch variance for
anchor boxes in Faster R-CNN to predict bounding boxes [17]. numerical stability [22]. The BN layer is located in front of
YOLOv2 is based on Darknet-19 (including 19 Conv layers the activation layer. When deployed in hardware, the BN layer
and 5 maximum pooling layers), and adds three 3×3 Conv can generally be fused into the weight or bias [13].
layers and 1×1 convolution layer, and references the route
layer in ResNet [20]. The input image is divided into S×S cells, C. Related Work
each cell predicts 5 bounding boxes, and each bounding box In recent years, people have made a lot of progress in
predicts (C + 5) numbers, including the probability of C deploying the object detection model based on CNN to the
classes, the position and size information of the bounding hardware. Through the model compression technology,
boxes, and the probability of real objects in the bounding SqueezeNet is less than 0.5Mb while maintaining the accuracy
boxes, so the network output size is S×S×(5+C). [10]. MobileNet adopts the deep separable convolution and
YOLOv2 Tiny is different from YOLOv2 in that the the ShiftNet replaces the spatial convolution with a simple
network is more simplified, and the pre-processing and post- shift operation [9][23]. These efficient models reduce the
processing processes are the same. For space reasons, we only computation and model scale. [11]-[16] explored the parallel
show network structure of YOLOv2 Tiny in Table I. YOLOv2 space of convolution, in which [11] proposes the dynamic
Tiny has 9 Conv layers and 6 maximum pooling layers, and combination of parallelism within a convolution, intra-output
the first six Conv layers are followed by the maximum pooling parallelism, and inter-output parallelism, which fully
layer. improves the computing performance. In this paper, we focus
on the following three points: (1) hardware architecture
B. Basic of CNN exploration, parallel space exploration; (2) quantifying the
1) Conv layer：In the Conv layer, multiple convolution model while maintaining accuracy; (3) reducing data
kernels are used to extract the advanced features of the input bandwidth, reducing the number of weight values and
feature maps. The number of convolution kernels determines intermediate data bit width to reduce the number of off-chip
the number of output feature map. The size of convolution memory accesses.
kernel K×K is generally 1×1 or 3×3. In order to make the size III. ACCELERATOR DESIGN
of input and output feature maps consistent, a padding
operation is often performed around the input feature maps, A. Data Access Pattern
and the padding number is (K-1) / 2. The Conv layer can be More than 90% of the computing time spent in CNN is in
expressed as: Conv layer, which consists of multiple nested loops. Each

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.
layer of large-scale CNN has a large number of computing and Tow =52, Tc =8, Tf = 64. For the exploration of parallel space,
parameters. It is impossible for embedded devices such as intra-output parallelism and inter-output parallelism in [12]
FPGA to complete the operation of the whole Conv layer at are applied. As shown in Fig. 3, the PE module has Tf PE units.
one time. As shown in Fig. 1, the output sliding cube is Each PE unit can process Tc input feature maps and get one
defined based on block-based computations, which slides on output feature map. For sliding cube, the PE module can
the output feature maps [12]. The accelerator will read the process Tc input feature maps and Tf output feature maps at
corresponding input sliding cube and weight to calculate one the same time.
output sliding cube, and then write it back to the off-chip
memory.

Figure 2. Overall structure of FPGA accelerator.

Figure 1. Data access pattern.

The accelerator obtains the output sliding cube in turn 1) Convolution Units: The parallelism of convolution
according to the cyclic order of Fig. 1 until the whole layer is units is determined as Tc ×Tf , including Tc × Tf group
completed. For the Conv layer connected to the pooling layer, multiplication and shifter, and Tf adder trees. This
the maximum pooling is performed before the output sliding corresponds to the operation process of Conv layer, as shown
cube is written back to the off-chip memory. For example, for in Fig. 3.
Conv2 and Pool2 in Table I, the accelerator reads the mth input
sliding cube of Conv2 and obtains the mth output sliding cube
for (h = 0; Th_min = H; h < H/Th; h++,Th_min -= Th)
of Pool2. It effectively reduces the number of accesses to off- for (w = 0; Tw_min = W; w < W/Tw; w++, Tw_min -= Tw)
chip memory. for (f = 0; f < F/Tf; f++)
Where W/k, H/k and F are the width, height and number for (c = 0; c < C/Tc; c++)
of output feature map respectively, and C is the number of read data ();
input feature map. The size of the input buffer is Tih × Tiw × Tc , compute ();
the size of the weight buffer is Tf ×Tc ×K ×K, and the size of write data ();
the output buffer is Toh × Tow × Tf . When the Conv layer is
compute ():
connected to the pooling layer, k = 2, otherwise k = 1. To
for (i = 0; i < K; i++)
complete a Conv layer operation, the accelerator needs to read for (j = 0; j < K; j++)
F/Tf times of input feature maps and (W/Tw × H/Th ) times of //load weights
weight from off-chip memory. for (th = 0; th < min(Th,Th_min); th++)
for (tw = 0; tw < min(Tw,Tw_min); tw++)
B. Overall Structure of FPGA Accelerator #pragma HLS PIPELINE
The overall architecture of the accelerator is shown in Fig. for (tf = 0; tf < Tf; tf++)
2, which mainly includes processing element (PE) module, #pragma HLS UNROLL
on-chip buffer and DMA module. It can complete the for (tc = 0; tc < Tc; tc++)
operation of one sliding cube at a time. DMA module can #pragma HLS UNROLL
{
interact between off-chip memory and on-chip buffer, and
//Initialization: output() = (i, j, c==0) ? bias() : output();
complete data scheduling and format conversion. The on-chip output (tf, th, tw) += input(tc, th+i, tw+j)*weight(tf, tc, i, j) ;
buffer include input buffer, output buffer and weight buffer. }
The PE module can complete the operations of convolution,
pooling and activation.
Figure 3. Pseudo-code for convolution units.
C. PE Module
On the premise of minimizing data movement and off-chip 2) Pooling and Activation Units: The type of pooling
memory access and improving data reuse, this paper explores layer of YOLO is maximum pooling, and the activation layer
design space. The size of sliding cube is determined as Toh =26, uses Leaky ReLU function. For the pooling, we add two line-

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.
buffers of size Tow to each PE unit. The input of line buffer is
the output of convolution. The output and input of the line ynf,h,w = ynf,h,w_on >> (onp - Qnout (f)) (5)
buffer are sent to the comparator to complete the pooling. For
Leaky ReLU function, we use multiplication and shift instead Qnout (C)= Qn+1
in
(C) (6)
of decimal multiplication.
where onp represents the sharing scale factor of intermediate
results, which can prevent data overflow. Qn is the scaling
factor of input feature maps, output feature maps and weight
of the nth layer.
IV. SYSTEM DESIGN
As illustrated in Fig. 5, based on the ARM + FPGA
architecture, the system architecture mainly includes external
memory Double Data Rate (DDR), processing system (PS),
on-chip buffer, accelerator in programmable logic (PL), and
on-chip and off-chip bus interconnection. ARM has the
function of task scheduling in the whole system. The initial
image data and weights are pre-stored in the external memory
DDR, and the middle layer feature maps are read and written
Figure 4. Execution time working in a pipelined style. from the DDR. The accelerator is connected with ARM
through AXI4 bus, receives configuration signal through
D. Ping-Pong Model Design AXI4 Lite bus, and DMA module can interact with DDR and
convert data format through AXI4-HP bus.
The input buffer and weight buffer for calculation and the
output buffer for storing intermediate results are designed in V. EVALUATION
Ping-Pong model. In addition, the line buffer is designed to
make the format conversion time cover the access time of off- A. Model Preparation
chip memory. As shown in Fig. 4, each level of buffer
designed as Ping-Pong model implements the full pipeline of Base on Darknet deep learning framework, we train
all stages. Where t2 is the total time when the accelerator YOLOv2 and YOLOv2-Tiny on the COCO and PASCAL
VOC 2007 respectively, and the input resolution is 416×416.
acquires the data of the output sliding cube, and t3 is the time
After training, the dynamic fixed-point strategy is used to
to write the output sliding cube back to the off-chip memory.
transform the parameters from floating-point to 16-bit fixed-
These two processes complete all the operations needed to get
point and multiple scaling factors.
one output sliding cube. Where t2 includes N t1 processes,
and N = Co /To +1. B. Model Deployment
E. Dynamic Fixed-Point Quantization The accelerator IP is packaged by Vivado HLS, and the
system architecture is built in Vivado. We implement the
In order to improve the efficiency of convolution and
system architecture on Xilinx ZCU102, and the driving clock
reduce the access time of off-chip memory, we use dynamic
is 300MHz. The larger the network scale, the better the
fixed-point quantization strategy [24], which transforms the
accelerator performance. The system performance and
weight and feature value from floating-point number to 16-bit
occupied resources are shown in Table II.
fixed-point and shared scaling factor. Different from fixed-
point, dynamic fixed-point can transform floating-point
number into fixed-point and multiple scaling factors to
improve the representation accuracy of fixed-point number. In
order to minimize the loss, each feature map shares the same
scaling factor, and the scaling factors of different feature maps
can be different. Similarly, for weights, each convolution
kernel shares the same scaling factor. In addition to the basic
operation of multiplication and addition, only the shift
operation is used to incorporate the scaling factor in the
convolution of dynamic fixed-point format, and the process
can be expressed as (3)-(6).
n
y' f,h,w = xc,h×s+j,w×s+i ×wf,c,j,i (3)

C-1 K-1 K-1

n
ynf,h,w_on = ∑ ∑ ∑ (y' f,h,w >> (Qnw (f)+Qnin (c)-onp ) )+bf (4)
c=0 j=0 i=0 Figure 5. System architecture.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.
TABLE II. IMPLEMENTATION RESULTS OF THE PROPOSED DESIGNS [3] R. Girshick, J. Donahue, T. Darrell, J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” pp. 580–587,
YOLOv2 YOLOv2 tiny 2013.
Features
VOC COCO VOC COCO [4] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger.”
Device Xilinx ZCU102 FPGA arXiv preprint arXiv:1612.08242, 2016.
[5] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. “SSD:
Frequency 300MHz
single shot multibox detector,” arXiv: 1512.0232, 2015.
BRAM (18Kb) 491 [6] Y. Ma, N. Suda, Y. Cao, J. Seo and S. Vrudhula, "Scalable and
DSPs 609 modularized RTL compilation of Convolutional Neural Networks onto
LUTs 95136 FPGA," 2016 26th International Conference on Field Programmable
Logic and Applications (FPL), Lausanne, 2016, pp. 1-8.
FFs 90589
[7] S. Han, H. Mao, and W. Dally. “Deep compression: Compressing
DNN size DNNs with pruning, trained quantization and huffman coding,” arXiv
29.42 29.47 6.18 5.4
(GOP) preprint arxiv:1510.00149v3, 2015.
Throughput
Peak (GOPs) 289.1 289.1 237.6 237.6 [8] A Zhou, A. Yao and Y. Guo, “Incremental Network Quantization:
Overall (GOPs) 102.2 102.5 85.8 87.0 Towards Lossless CNNs with Low-Precision Weights,” arXiv preprint
arXiv:1702.03044, 2017.
Interface time 288ms 288ms 72ms 62ms
[9] B. Wu, A. Wan and X. Yue, “Shift: A Zero FLOP, Zero Parameter
Alternative to Spatial Convolutions,” arXiv preprint arXiv:1711.08141,
2017.
TABLE III. COMPARE WITH PREVIOUS WORK [10] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and
K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer
[13] [14] [15] This work
parameters and <0.5MB model size,” arXiv preprint arXiv:1602.07360,
Artix7 Virtex7 Zynq 2016.
Platform Stratix-V
TSBG484 VX485T UltraScale+ [11] S. Chakradhar, M. Sankaradas, V. Jakkula and S. Cadambi,
Frequency “A dynamically configurablecoprocessor for convolutional neural net
100 100 120 300
(MHz) works,” In: ISCA 2010.
16-bit 32-bit (8-16)-bit 16-bit [12] D. T. Nguyen, T. N. Nguyen, H. Kim and H. Lee, "A High-Throughput
Precision
fixed float fixed fixed and Power-Efficient FPGA Implementation of YOLO CNN for Object
throughput 289(peak) Detection," in IEEE Transactions on Very Large Scale Integration
22 61.62 136.5
(GOPs) 102(overall) (VLSI) Systems, vol. 27, no. 8, pp. 1861-1873, Aug. 2019.
Power(W) 7.53 3.0 19.1 11.8 [13] Q. Zhang, J. Cao, Y. Zhang, S. Zhang, Q. Zhang and D. Yu, "FPGA
Implementation of Quantized Convolutional Neural Networks," 2019
IEEE 19th International Conference on Communication Technology
The comparison between the proposed design and (ICCT), Xi'an, China, 2019, pp. 1605-1610.
previous works on hardware implementations is presented in [14] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
Table III. fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
VI. CONCLUSION Field-Programmable Gate Arrays, ACM.
[15] N. Suda, “Throughput-optimized OpenCL-based FPGA accelerator for
In this paper, we propose a reconfigurable CNN large-scale convolutional neural networks,” in Proc. of FPGA. ACM,
accelerator for YOLO. By combining convolution and 2016, pp. 16–25
pooling operations, the off-chip memory access time of [16] J. Qiu and J. Wang, “Going deeper with embedded FPGA platform for
accelerator is reduced. The configuration table is designed to convolutional neural network,” in ACM International Symposium on
make the accelerator to adapt to different models, and the peak Field-Programmable Gate Arrays, 2016.
performance of the accelerator is 289GOPs at 300 MHz clock [17] R. Girshick, “Fast R-CNN, ” arXiv preprint arXiv:1504.08083, 2015.
frequency. In the future research, we will further improve the [18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
generality of the accelerator so that it can complete the time object detection with region proposal networks,: arXiv preprint
arXiv:1506.01497, 2015.
YOLOv3 deployment, and deploy the post-processing on
[19] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
FPGA. arXiv:1804.02767, 2018.
ACKNOWLEDGMENT [20] Trieu. 2016. https://github.com/AlexeyAB/darknet. (2016).
[21] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet
Project supported by the National Key Research and and the impact of residual connections on learning. CoRR, arXiv
Development Program of China (Grant No. preprint arXiv:1602.07261, 2016.
2018YFE0203801) . [22] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep
network training by reducing internal covariate shift,” International
REFERENCES Conference on Machine Learning (2015) 448–456.
[1] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based [23] A. G. Howard et al., “Mobilenets: Efficient convolutional neural
learning applied to document recognition," in Proceedings of the IEEE, networks for mobile vision applications,” arXiv preprint
vol. 86, no. 11, pp. 2278-2324, Nov. 1998. arXiv:1704.04861,2017.
[2] A. Krizhevsky and I. Sutskever, “ImageNet classification with deep [24] R. Ding, G. Su, G. Bai, W. Xu, N. Su and X. Wu, "A FPGA-based
convolutional neural networks,” in Proc. Advances in Neural Inf. Accelerator of Convolutional Neural Network for Face Feature
Process. Syst., 2012, pp. 1097–1105. Extraction," 2019 IEEE International Conference on Electron Devices
and Solid-State Circuits (EDSSC), Xi'an, China, 2019, pp. 1-3.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 11,2022 at 11:53:05 UTC from IEEE Xplore. Restrictions apply.

Fpga Resume Interessant
No ratings yet
Fpga Resume Interessant
53 pages
Pynq Classification
No ratings yet
Pynq Classification
65 pages
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
No ratings yet
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
102 pages
Sensors 22 02459
No ratings yet
Sensors 22 02459
18 pages
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
No ratings yet
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
14 pages
FPGA Design For Object Detection
No ratings yet
FPGA Design For Object Detection
12 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and
20 pages
High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA
No ratings yet
High-Utilization, High-Flexibility Depth-First CNN Coprocessor For Image Pixel Processing On FPGA
11 pages
Electronics 13 01564 v2
No ratings yet
Electronics 13 01564 v2
18 pages
10.1109 fpl53798.2021.00061
No ratings yet
10.1109 fpl53798.2021.00061
6 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
Towards Reconfigurable CNN Accelerator For FPGA Implementation
No ratings yet
Towards Reconfigurable CNN Accelerator For FPGA Implementation
5 pages
A Reconfigurable CNN-based Accelerator Design For
No ratings yet
A Reconfigurable CNN-based Accelerator Design For
9 pages
Electronics: FPGA Implementation For CNN-Based Optical Remote Sensing Object Detection
No ratings yet
Electronics: FPGA Implementation For CNN-Based Optical Remote Sensing Object Detection
24 pages
A Scalable FPGA Based Accelerator For Tiny-YOLO-V2
No ratings yet
A Scalable FPGA Based Accelerator For Tiny-YOLO-V2
9 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
A High-Throughput and Power-Efficient FPGA Implementation of Yolo CNN For Object Detection
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of Yolo CNN For Object Detection
13 pages
Cafpga: An Automatic Generation Model For CNN Accelerator
No ratings yet
Cafpga: An Automatic Generation Model For CNN Accelerator
30 pages
8 - An Improved Algorithm For Deep Learning YOLO Network Based On Xilinx ZYNQ FPGA
No ratings yet
8 - An Improved Algorithm For Deep Learning YOLO Network Based On Xilinx ZYNQ FPGA
5 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
No ratings yet
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
10 pages
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
13 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
FPGA Convolution Network Acceleration
No ratings yet
FPGA Convolution Network Acceleration
9 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
7-Research On FPGA High-Performance Implementation Method of CNN
No ratings yet
7-Research On FPGA High-Performance Implementation Method of CNN
5 pages
1 s2.0 S1877050922005701 Main
No ratings yet
1 s2.0 S1877050922005701 Main
6 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
No ratings yet
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
5 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
10 1109vdat50263 2020 9190274
No ratings yet
10 1109vdat50263 2020 9190274
6 pages
Systematic Analysis of FPGA-based Hardware Acceler
No ratings yet
Systematic Analysis of FPGA-based Hardware Acceler
9 pages
Implementation of FPGA-based Accelerator For CNN
No ratings yet
Implementation of FPGA-based Accelerator For CNN
7 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
No ratings yet
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
4 pages
Research On FPGA Based Convolutional Neural Network Acceleration Method
No ratings yet
Research On FPGA Based Convolutional Neural Network Acceleration Method
4 pages
Irmak2021energy Efficient
No ratings yet
Irmak2021energy Efficient
4 pages
Fixed-Point CNN For FPGA
No ratings yet
Fixed-Point CNN For FPGA
7 pages
Tcas-I Haco Final
No ratings yet
Tcas-I Haco Final
14 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
10 1109@mwscas48704 2020 9184436
No ratings yet
10 1109@mwscas48704 2020 9184436
4 pages
BSL MONOLITH Software User's Manual 2.0
No ratings yet
BSL MONOLITH Software User's Manual 2.0
108 pages
JTECCNN
No ratings yet
JTECCNN
6 pages
Abstract
No ratings yet
Abstract
1 page
Performance Modeling For CNN Inference Accelerators On FPGA
No ratings yet
Performance Modeling For CNN Inference Accelerators On FPGA
14 pages
Embedded Active Vision System Based On An FPGA Architecture
No ratings yet
Embedded Active Vision System Based On An FPGA Architecture
14 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
PM Chi Zhang
No ratings yet
PM Chi Zhang
1 page
Time Management Toolkit
100% (1)
Time Management Toolkit
20 pages
MCA - Books List
No ratings yet
MCA - Books List
8 pages
Rapport PFE
No ratings yet
Rapport PFE
80 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
8 pages
BIOS and Dos Interrupts
91% (11)
BIOS and Dos Interrupts
15 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
Computer Organzation and Architecture Question Bank
100% (2)
Computer Organzation and Architecture Question Bank
10 pages
MS Azure DP-100
No ratings yet
MS Azure DP-100
56 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
GENERALISATION
No ratings yet
GENERALISATION
17 pages
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
No ratings yet
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
7 pages
Advanced GIS - Reveiw
No ratings yet
Advanced GIS - Reveiw
221 pages
M.sc. Graphics & Animation Course at Amity University, Jaipur
No ratings yet
M.sc. Graphics & Animation Course at Amity University, Jaipur
8 pages
AdminGuide 3.1
No ratings yet
AdminGuide 3.1
305 pages
Word Processor
No ratings yet
Word Processor
4 pages
Making Power Point and Advanced Publisher
No ratings yet
Making Power Point and Advanced Publisher
8 pages
Game Log
No ratings yet
Game Log
44 pages
Cloud Sim Manual
No ratings yet
Cloud Sim Manual
40 pages
Batch 17 Final Review
No ratings yet
Batch 17 Final Review
31 pages
Lab - Installing The Virtual Machines: (Instructor Version) Instructor Note: Red Font Color or Gray Highlights
No ratings yet
Lab - Installing The Virtual Machines: (Instructor Version) Instructor Note: Red Font Color or Gray Highlights
14 pages
Fleet MGMTSM
No ratings yet
Fleet MGMTSM
91 pages
Numerical: Central Processing Unit
No ratings yet
Numerical: Central Processing Unit
28 pages
Supercomputer - Wikipedia
No ratings yet
Supercomputer - Wikipedia
16 pages
Unit 2 Unit 2 Data, Expressions, Statements
No ratings yet
Unit 2 Unit 2 Data, Expressions, Statements
34 pages
EEE4120
No ratings yet
EEE4120
14 pages
Pix4Dmatic 2020
No ratings yet
Pix4Dmatic 2020
12 pages
6 - Siemens Open Library - PID Configuration
No ratings yet
6 - Siemens Open Library - PID Configuration
17 pages
HD Pro 2
No ratings yet
HD Pro 2
2 pages
Design and Development of A Website Usin
No ratings yet
Design and Development of A Website Usin
7 pages
Practical 3 2D Game Bounceshoot
No ratings yet
Practical 3 2D Game Bounceshoot
9 pages
ADC (Admiralty Digital Catalogue) Manual : Notice
No ratings yet
ADC (Admiralty Digital Catalogue) Manual : Notice
7 pages
OS. Autodesk AutoCAD Macro
No ratings yet
OS. Autodesk AutoCAD Macro
3 pages
Veriton m480g
No ratings yet
Veriton m480g
2 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

An FPGA-Based Reconfigurable CNN Accelerator For YOLO

Uploaded by

An FPGA-Based Reconfigurable CNN Accelerator For YOLO

Uploaded by

2020 IEEE 3rd International Conference on Electronics Technology

An FPGA-Based Reconfigurable CNN Accelerator for YOLO

978-1-7281-6283-6/20/$31.00 ©2020 IEEE 74

Figure 2. Overall structure of FPGA accelerator.

C-1 K-1 K-1

You might also like