An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs
Abstract— Convolutional neural networks (CNNs) have had and monitoring. Advancements in algorithms and changes
great success when applied to computer vision technology, and in requirements have led to critical challenges to the data
many application-specific integrated circuit (ASIC) and field- processing capabilities of current hardware. This article pro-
programmable gate array (FPGA) CNN accelerators have been
proposed. These accelerators primarily focus on the acceleration poses a convolutional neural network (CNN) accelerator for
of a single input, and they are not particularly optimized for video target recognition using an FPGA that achieves higher
video applications. In this article, we focus on the similarities resource efficiency and higher speed than other FPGA CNN
between continuous inputs in video, and we propose a YOLOv3- implementations when processing images of videos.
tiny CNN FPGA accelerator using incremental operation. The The difference between processing a single image and
accelerator can skip the convolution operation of similar data
between continuous inputs. We also use the Winograd algorithm sequential frames of an input video is minimal. Therefore, the
to optimize the conv3 × 3 operator in the YOLOv3-tiny network computation and memory accesses performed on sequential
to further improve the accelerator’s efficiency. Experimental frames can be redundant. Traditional strategies perform a
results show that our accelerator achieved 74.2 frames/s on complete computation for each frame, which ignores the
ImageNet ILSVRC2015. Compared to the original network similarity in live vision. Our work attempts to produce a CNN
without Winograd algorithm and incremental operation, our
design provides a 4.10× speedup. When compared with other that skips redundant operations between adjacent frames in
YOLO network FPGA accelerators applied to video applications, video to achieve better throughput and resource efficiency on
our design provided a 3.13×–18.34× normalized digital signal an FPGA.
processor (DSP) efficiency and 1.10×–14.2× energy efficiency. Recently, some articles [10]–[13] have designed neural
Index Terms— Convolutional neural network (CNN), field- network accelerators based on data similarity between video
programmable gate array (FPGA) accelerator, incremental oper- frames. Previous work has primarily used the similarity of data
ation, input similarity, video applications, Winograd algorithm. between sequential frames in two ways.
1) With the exception of the first frame, the accelerator
I. I NTRODUCTION only processes the differences in sequential frames.
The resulting convolution of the differential input is
I N RECENT years, deep learning methods have played
an important role in computer vision for areas, such as
image recognition, target tracking, and classification. Many
superimposed on the result from the previous frame
operation [10]–[11].
studies have proposed high-performance application-specific 2) By capturing visual motion in sequential frames, the
integrated circuit (ASIC) [1]–[4] and field-programmable gate accelerator can use the results obtained from key frames
array (FPGA) [5]–[9] neural network accelerators. Due to to predict the results of other frames, thereby avoiding
the rapid development of computer vision algorithms, the redundant operations on nonkey frames [12], [13].
depth and scale of deep neural networks (DNNs) are also However, most of the works related to video CNN accel-
increasing. Meanwhile, there is an increasingly larger need erators are about algorithms and modeling analysis. There
for video processing versus traditional static image processing are some designs devoted to high-performance and high-
due to emerging applications such as autonomous driving resource utilization CNN accelerators on FPGAs [14]–[17].
These works primarily use pruning and quantification methods
Manuscript received 11 October 2021; revised 20 December 2021 and
27 January 2022; accepted 12 February 2022. Date of publication 26 Septem- to improve resource efficiency for static image tasks. There are
ber 2022; date of current version 24 October 2022. This work was supported no accelerator designs that take advantage of the data similarity
by the National Key Research and Development Program of China under between video frames on an FPGA.
Grant 2018YFA0701500. (Corresponding author: Qin Wang.)
Shengzhao Li, Jianfei Jiang, Weiguang Sheng, Naifeng Jing, and Capturing visual motion in sequential frames for real-
Zhigang Mao are with the Department of Micro/Nano Electronics, School time video applications requires additional processing
of Electronic Information and Electrical Engineering, Shanghai Jiao Tong modules, which is not conducive to improving resource
University, Shanghai 200240, China (e-mail: [email protected]).
Qin Wang is with the National Key Laboratory of Science and Technology efficiency on FPGAs. The design in this article uses
on Micro/Nano Fabrication, Shanghai Jiao Tong University, Shanghai 200240, the YOLOv3-tiny network with 8-bit quantization, and it
China (e-mail: [email protected]). adopts the incremental operation of interframe differen-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2022.3151788. tial input. As a result, our design achieves higher dig-
Digital Object Identifier 10.1109/TVLSI.2022.3151788 ital signal processor (DSP) efficiency and performance
1063-8210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1588 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022
compared to other high-performance [14]–[17] and YOLO In recent years, the Winograd algorithm has been widely
accelerators [18]–[20] in video application scenarios while used to accelerate DNNs [23]–[25]. It is an effective method
running at a reasonable image processing frame rate. that can significantly reduce the computational complexity of
To improve the DSP utilization rate under different CNN convolution calculations. Compared with original convolution
operators, our design uses the Winograd algorithm to increase calculations, F(2 × 2, 3 × 3) of the Winograd algorithm
both throughput while using the same DSP resources and the reduces the convolutional arithmetic complexity by 2.25 times.
multiplexing efficiency of the processing element (PE) array. The YOLOv3-tiny network includes both conv3 × 3 and
In conclusion, we make the following contributions. conv1 × 1 operators, with the majority being conv3 × 3,
1) Our work uses the incremental operation of interframe and the Winograd algorithm can reduce the arithmetic com-
differential input to take advantage of the interframe plexity required by the conv3 × 3 operators. When the
similarity of video input, and it improves processing conv3 × 3 operator is implemented using the Winograd
speed and resource utilization. In addition, we use algorithm, four 3 × 3 size convolution kernel operations
different frames that ignore thresholds in different layers are converted into 4 × 4 matrix elementwise products. The
to increase speed while keeping the network accuracy. operation after the conversion is very similar to that of
2) A corresponding dataflow is designed for the incre- conv1 × 1 operators, which can reduce the complexity of
mental operation method, which balances the sparse the module and improve the utilization efficiency of the PE
irregularity of the activation value for the incremental array. In our design, the Winograd algorithm allows us to
operation and improves the utilization rate of the DSP further reduce the number of calculations required for con-
on the FPGA. volution operations and to improve the efficiency of DSP on
3) Using the Winograd algorithm improves throughput the FPGA.
under the same DSP consumption and simultaneously
increases the multiplexing of the PE array of the A. CNN Incremental Operation
YOLOv3-tiny operators conv1 × 1 and conv3 × 3. This
We define the input similarity as the ratio of the convo-
greatly improves DSP efficiency.
Experimental results show that our accelerator achieved lutional layer input of the current frame to the same part
74.2 frames/s on ImageNet ILSVRC2015 [29]. Compared of the convolutional layer input of the saved frame, which
to the original network without Winograd algorithm and is the frame before the current frame. In video applications,
incremental operation, our design provides a 4.10× speedup. most of our inputs use 32-bit floating-point numbers with
When compared with other YOLO network FPGA acceler- low input similarity. Although many parts of the current
ators applied to video applications, our design provided a frame are like those of the previous frame, there are still
3.13×–18.34× normalized DSP efficiency and 1.10×–14.2× small differences between them. Due to the high precision
energy efficiency. of 32-bit floating-point numbers, any small changes are
reflected in the difference. However, small changes in the
II. A LGORITHM input have a minimal impact on the convolution results,
and we can ignore these small changes in the incremental
This section introduces both the incremental operation and
operation.
Winograd algorithm used in neural networks. Our work uses
The common network quantization algorithm can signifi-
incremental operation strategies to reduce redundant opera-
cantly increase the similarity between inputs, which is con-
tions in real-time visual processing, and it uses the Winograd
ducive to our design for reducing redundant operations with
algorithm to improve the utilization efficiency of DSP. We also
minimal impact on recognition accuracy.
implement a high-efficiency accelerator for real-time video
For pretrained CNNs, linear quantization is a popular quan-
processing on an FPGA.
tization method [26]. We linearly quantize the activation and
The DNN in the universal neural network accelerator repeats
weights in each layer. After the quantization is completed,
a large number of calculations for each video frame, which
we evaluate the accuracy of the quantized network. The basic
wastes computing resources. Some articles [21], [22] proposed
formula for linear quantization used for each layer is as
a new DNN calculation method to eliminate the wasteful
follows:
calculations when processing video. The key idea of incre-
mental operation is to find an algorithm that can update datain
dataquant = round ∗ stepl . (1)
previously saved results. Incremental operation only needs stepl
to find the differences between frames and then correct the In our design, the steps used to quantify the weights
saved results throughout the differing parts. The corrected and activation are different for different layers. The step is
result can be approximated to the original calculation using the calculated according to the weights of each layer and the
complete CNN. This type of algorithm is much cheaper than range of the activation value. The weight range is determined
the complete convolution of each frame. It can reduce many by analyzing the weights of the pretrained network, and the
of the redundant operations and memory accesses compared activation range is determined by analyzing the training set;
with the original network. dataquant is the cluster centroid that is closest to the value of
Incremental operation can be applied to any pretrained the input data. Our work uses 8-bit quantization, and the step
DNN, so most common networks can be optimized. Therefore, calculation formula for each layer is given as follows:
it is feasible to use incremental operation in hardware, which
can reduce the computational cost of processing video. stepl = 2floor(max(|datain |))−7 . (2)
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1589
Fig. 1. (a) Original neural network calculation process. (b) Neural network calculation process with incremental operation.
TABLE I
A CCURACY OF THE N ETWORK B EFORE AND A FTER Q UANTIZATION
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1590 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1591
III. DATAFLOW
There are efficient dataflow designs for traditional CNN
accelerators [16], [27], but they are only efficient in dense
networks. These dataflows cannot take advantage of the video
application characteristics, and they are not suitable for use
with the Winograd algorithm. For our work, there are two
important issues. First, the dataflow should be able to use
the similarity between input frames to get better performance,
i.e., the dataflow can skip the part of the input that does not
need to be calculated after the difference. Note that dataflows
with this characteristic are similar dataflows for sparse CNNs.
Second, since the Winograd algorithm changes the calculation elementwise multiplication. This means that conv1 × 1 can be
unit of the neural network, the dataflow needs to be updated converted to a process with 4 × 4 matrix elementwise multipli-
to accommodate changes in the calculation method. cation as the basic unit of operation. After this conversion, the
There are also dataflows of CNNs for video applications dataflow of conv1 × 1 is like that of the conv3 × 3 operation
[10], [11]. These dataflows can be expanded to make use under the Winograd algorithm. To summarize, our loops
of the similarity between frames, as shown in Fig. 5, and in our dataflow include frame, output channel, and input
they provide a basis for our design. With the characteris- channel loops, along with the matrix position of the image
tics of the Winograd algorithm in mind, we have designed loop.
a dataflow that is more suitable for our accelerator. This In the dataflow for the conv3 × 3 convolutional layer (see
dataflow helps to ensure the efficiency of our computing Algorithm 1), the input channel index is ic, and the output
module. channel index is oc. Woc,ic is a 3 × 3 weight kernel unit
The YOLOv3-tiny includes two operators: conv3 × 3 and corresponding to different input and output channels. Input
conv1 × 1. When the Winograd algorithm is used to complete ic
X h,w is a 4 × 4 activation block reflecting the difference
the operation of conv3 × 3, conv3 × 3 becomes a process between two frames. h and w represent the position of
with 4 × 4 matrix elementwise multiplication as the basic the top-left pixel of the 4 × 4 activation value matrix on
operation unit. As a result, there is no concept of a convolution the image. B, G, and A are the conversion matrices in the
sliding window of conv3 × 3. The kernel size of conv1 × 1 Winograd algorithm. The related introduction of these matrices
is 1 × 1, which can be alternatively viewed as matrix is detailed in Section II-B. M[H /2][W/2] stores the number of
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1592 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022
channels of nonzero 4 × 4 activation blocks at each position. Algorithm 2 Pseudocode of Dataflow for conv1 × 1
Incremental operation needs to be computed on each layer,
so we process the data from all images in one image group at
a time. The IG is the batch size of a frame package. Since the
first frame of each package requires a complete convolution
operation, the performance is different for different IGs. This
will be discussed further in the following.
In order to facilitate the parallel hardware operation of the
output channels, we put the loop over N output channels in the
inner layer.
For each channel, the saved reference convolution
oc+s__oc
result Yh,w is set to zero for the first frame of the frame
package. For all other frames, the result is computed by
adding the current convolution to the reference convolution.
The innermost part of the dataflow loop is the matrix oper-
ation of the Winograd algorithm, including both weight and
activation value matrix conversions along with the convolution
result calculation. Each Winograd operation calculates the
result of a 4 × 4 image block for one input channel, and
the results from different input channels are added to the
oc+s__oc
output result Yh,w until all N output channels have been
oc+s__oc
processed. The saved reference convolution result Yh,w is
oc+s__oc
then updated to Yh,w calculated in the current iteration.
The convolution of the next frame is performed, followed by
the convolution of the image block at the next position until
all N output channels are completed. Finally, the calculation
of the next N output channels is performed until all results
are calculated. Once the convolution calculation is complete,
nonlinear calculations, such as ReLU and Maxpooling, are
performed.
We changed the traditional conv1 × 1 dataflow (see Algo-
rithm 2) to make it more like the conv3 × 3 dataflow used
in the Winograd algorithm. Since the conv1 × 1 convolution
does not use the Winograd algorithm, the input weight is obtained by directly doing matrix elementwise multiplication.
expanded from the 1 × 1 kernel to the 4 × 4 matrix, and all Therefore, the dataflow for conv1 × 1 is like one for conv3 × 3
elements in the matrix are the same. This is done in order to using the Winograd algorithm that lacks some operations. This
maintain the 4 × 4 matrix elementwise multiplication. Without convolutional layer similarity is beneficial for improving the
the Winograd matrix conversion, the convolution result can be operating efficiency of the hardware.
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1593
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1594 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1595
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1596 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022
TABLE III
R ESOURCE U TILIZATION
B. Arithmetic Complexity
1) Similarity Between Frames and Arithmetic Complexity:
When determining if a 4 × 4 activation value block needs
to be processed, we calculate the L1 norm and omit it if it is
The ping–pong structure enables the input and output stages
less than our threshold. We set different thresholds in different
to be executed in parallel with the data processing stage. One
layers, corresponding to Table II.
buffer accepts the data from the I/O module, while the other
The similarity between frames reflects the proportion of
is used for processing, which improves the efficiency of the
operations that can be skipped during incremental operations.
hardware since a pipeline is produced. The pipeline of our
The higher the similarity between frames, the more convolu-
design is shown in Fig. 12.
tion operations that can be skipped, and the better the accel-
The RRB is 6912 bytes, and it stores the convolution of the
eration effect. The accuracy of the network is determined by
previous frame, which is added to the differential convolution
the threshold. In order to discuss the changes in the similarity
to obtain the convolution of the current frame. The frames
between frames using thresholds or not, we calculated the
are compared to determine the output index. After the current
similarity between the input frames of each convolutional layer
frame is completely processed, the data in the RRB are
under thresholds of Table II and no threshold on the ImageNet
updated to the convolution result for the current frame.
ILSVRC2015 video dataset. This statistic reflects the average
interframe similarity of all frames in a dataset relative to the
V. E XPERIMENT previous frame. The results are shown in Fig. 13.
In this article, we used a Xilinx ZCU104 to evaluate When we use incremental operation with no threshold, the
our design. Vivado HLS 2019.2 was used to implement the average interframe similarity of network is 0.2998. When we
accelerator and to generate the bitstream. The accelerator runs use incremental operation with the thresholds of Table II,
at a frequency of 200 MHz. We completed the performance the average interframe similarity of network improves to
evaluation of the YOLOv3-tiny network accelerator using an 0.6297. The interframe similarity reflects the proportion of the
FPGA, and the power consumption was evaluated using the convolution calculation that can be skipped for each frame
Xilinx Power Estimator. in an ideal situation. Ideally, the arithmetic complexity of
incremental operations with no threshold is 70.02%, and the
speedup is 1.428×. With thresholds of Table II, the arith-
A. Resource Utilization metic complexity is 37.03%, and the speedup is 2.701×. The
The resource consumption of our accelerator on the FPGA incremental operation with threshold reduces the arithmetic
development board is shown in Table III. The most important complexity and improves the speedup without accuracy loss.
value is the utilization of the BRAM_18K and DSP resources. All the incremental operations we mentioned later will use the
BRAM provides caching and storage in the accelerator. The thresholds of Table II.
memory size is determined by the frame package and stored Recall that due to the limitation of hardware resources, the
input and output data channel sizes. Increasing frame package size of the frame package cannot be expanded indefinitely.
size provides higher speedup, and using more stored input The first frame of the frame package will perform a complete
and output data channels increases the multiplexing of input convolution operation instead of an incremental operation.
data. This reduces data exchanges between the accelerator and Therefore, the size of the frame packet will affect the actual
DDR. DSP is primarily used for multiply–add operations in algorithm complexity. Besides, in actual hardware accelera-
the accelerator. The PU array performs 256 multiplication and tion, a part of the data input and output time cannot be covered
addition operations per clock cycle. Since YOLOv3-tiny is by the calculation time, so the actual speedup will be lower
8-bit quantized, each of the 128 DSP resources implements than the ideal speedup.
two multiplication operations. Later, we will discuss the 2) Frame Package Size and Arithmetic Complexity: The first
throughput and energy consumption provided by each DSP. frame of each frame package must do a complete convolution
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1597
TABLE IV
N ETWORK P ROCESSING T IME
TABLE V
P ERFORMANCE OF A CCELERATOR IN T HREE C ASES
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1598 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022
TABLE VI
P ERFORMANCE C OMPARISON W ITH P REVIOUS I MPLEMENTATION
of the similarity between frames greatly reduce the network computed as follows:
processing time of each frame package. DSP efficiency
The Winograd algorithm reduces the number of multiplica- Normalized DSP efficiency = . (9)
α × ( f ÷ 100 MHz)
tions of the conv3 × 3 layers, making the processing time of
conv3 × 3 layers shorter. Compared to the original network, Here, α = 1 for 16-bit or 18-bit accelerator and α = 2 for
the network processing time with the Winograd algorithm 8-bit accelerator.
has a speedup of 2.06×. The incremental operation allows The DSP efficiency of our design is 2.384 GOPS/DSP.
each layer to skip part of the convolution operation, providing The normalized DSP efficiency is 0.596. Compared to the
1.99× speedup. The network using the Winograd algorithm 3-D CNN accelerator, our design can provide a 9.17×–9.77×
and incremental operation has 4.10× speedup compared to normalized DSP efficiency. Compared to other YOLO accel-
the original network. erators, our design can provide a 3.13×–18.34× normalized
Through the processing time and the GFLOPs of the DSP efficiency.
YOLOv3-tiny network, we calculated the performance of the Our design has an energy efficiency of 105.4 GOPS/w.
accelerator in three cases, as shown in Table V. In general, our Compared to other designs, our design provides 1.10×–14.2×
design provides 4.1× speedup with no loss of accuracy. energy efficiency.
The design of our algorithm will not be affected by the In general, due to video optimizations, the CNN accelerator
hardware structure. Regardless of the size of the hardware we designed has significant advantages over other neural
structure, the same proportion of convolution can still be network accelerators in video applications.
skipped. Therefore, the architecture with different sizes can
still obtain the same relative speedup. VI. C ONCLUSION
In this article, we proposed a CNN FPGA accelerator
that takes advantage of the similarity between frames in
D. Result Comparison video applications. The accelerator skips the calculation of
As a result, we recorded the performance of our design similar data between frames by using incremental operations.
using general video applications (ImageNet ILSVRC2015 In addition, we used the Winograd algorithm to improve the
dataset). We then compared the performance with a 3-D CNN utilization of DSP resources on the FPGA while increasing
accelerator [28] and other YOLO accelerators [18]–[20]. The accelerator throughput. We designed a dataflow based on the
comparison data are shown in Table VI. characteristics of the incremental operation and Winograd
One 25 bit × 18 bit DSP can be decomposed to handle algorithms, which increased the efficiency of data reuse and
two 8 bit × 8 bit multiplications, while it can only handle array processing. Experimental results show that our accel-
one 16 bit × 16 bit or 18 bit × 18 bit multiplication. Thus, erator achieved 74.2 frames/s on ImageNet ILSVRC2015.
8-bit accelerator has twice overall DSP efficiency compared Compared to the original network without Winograd algo-
with 16-bit accelerator. Also, the working frequency is in rithm and incremental operation, our design provides a 4.10×
part determined by the occupation of the FPGA device. Thus, speedup. When compared with other YOLO network FPGA
the normalized DSP efficiency of different designs can be accelerators applied to video applications, our design provided
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1599
a 3.13×–18.34× normalized DSP efficiency and 1.10×–14.2× [16] Q. Yin et al., “FPGA-based high-performance CNN accelerator architec-
energy efficiency. ture with high DSP utilization and efficient scheduling mode,” in Proc.
Int. Conf. High Perform. Big Data Intell. Syst. (HPBD&IS), Shenzhen,
China, May 2020, pp. 1–7, doi: 10.1109/HPBDIS49115.2020.9130576.
ACKNOWLEDGMENT [17] D. T. Nguyen, T. N. Nguyen, H. Kim, and H. J. Lee, “A high-throughput
The authors would like to thank LetPub (www.letpub.com) and power-efficient FPGA implementation of YOLO CNN for object
detection,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27,
for its linguistic assistance during the preparation of this no. 8, pp. 1861–1873, Aug. 2019.
manuscript. [18] A. Ahmad, M. A. Pasha, and G. J. Raza, “Accelerating tiny YOLOv3
using FPGA-based hardware/software co-design,” in Proc. IEEE Int.
R EFERENCES Symp. Circuits Syst. (ISCAS), Seville, Spain, Oct. 2020, pp. 1–5, doi:
10.1109/ISCAS45731.2020.9180843.
[1] C.-H. Lin et al., “7.1 A 3.4-to-13.3TOPS/W 3.6TOPS dual-core deep- [19] J. Zhang et al., “A low-latency FPGA implementation for
learning accelerator for versatile AI applications in 7 nm 5G smartphone real-time object detection,” in Proc. IEEE Int. Symp. Circuits
SoC,” in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, USA, Syst. (ISCAS), Daegu, South Korea, May 2021, pp. 1–5, doi:
Feb. 2020, pp. 134–136, doi: 10.1109/ISSCC19947.2020.9063111. 10.1109/ISCAS51556.2021.9401577.
[2] Y. Jiao et al., “7.2 A 12 nm programmable convolution-efficient [20] T. Adiono, A. Putra, N. Sutisna, I. Syafalni, and R. Mulyawan,
neural-processing-unit chip achieving 825TOPS,” in IEEE ISSCC Dig. “Low latency YOLOv3-tiny accelerator for low-cost FPGA using
Tech. Papers, San Francisco, CA, USA, Feb. 2020, pp. 136–140, doi: general matrix multiplication principle,” IEEE Access, vol. 9,
10.1109/ISSCC19947.2020.9062984. pp. 141890–141913, 2021, doi: 10.1109/ACCESS.2021.3120629.
[3] W. Shan et al., “14.1 A 510 nW 0.41 V low-memory low-computation [21] P. O’Connor and M. Welling, “Sigma delta quantized networks,” 2016,
keyword-spotting chip using serial FFT-based MFCC and binarized arXiv:1611.02024.
depthwise separable convolutional neural network in 28 nm CMOS,” [22] D. Neil, J. H. Lee, T. Delbruck, and S.-C. Liu, “Delta networks for
in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2020, optimized recurrent network computation,” in Proc. 34th Int. Conf.
pp. 230–232, doi: 10.1109/ISSCC19947.2020.9063000. Mach. Learn. (ICML), Sydney, NSW, Australia, 2017, pp. 2584–2593.
[4] P. C. Knag et al., “A 617 TOPS/W all digital binary neural net- [23] A. Lavin and S. Gray, “Fast algorithms for convolutional neural
work accelerator in 10 nm FinFET CMOS,” in Proc. IEEE Symp. networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
VLSI Circuits, Honolulu, HI, USA, Jun. 2020, pp. 1–2, doi: 10.1109/ (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 4013–4021, doi:
VLSICircuits18222.2020.9162949. 10.1109/CVPR.2016.435.
[5] J.-W. Chang, K.-W. Kang, and S.-J. Kang, “SDCNN: An efficient sparse [24] X. Wang, C. Wang, J. Cao, L. Gong, and X. Zhou, “WinoNN:
deconvolutional neural network accelerator on FPGA,” in Proc. Design, Optimizing FPGA-based convolutional neural network accelerators
Automat. Test Eur. Conf. Exhib. (DATE), Florence, Italy, Mar. 2019, using sparse Winograd algorithm,” IEEE Trans. Comput.-Aided Design
pp. 968–971, doi: 10.23919/DATE.2019.8715055. Integr. Circuits Syst., vol. 39, no. 11, pp. 4290–4302, Nov. 2020, doi:
[6] J.-W. Chang, K.-W. Kang, and S.-J. Kang, “An energy-efficient FPGA- 10.1109/TCAD.2020.3012323.
based deconvolutional neural networks accelerator for single image [25] J. Yepez and S.-B. Ko, “Stride 2 1-D, 2-D, and 3-D Winograd
super-resolution,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, for convolutional neural networks,” IEEE Trans. Very Large Scale
no. 1, pp. 281–295, Jan. 2020, doi: 10.1109/TCSVT.2018.2888898. Integr. (VLSI) Syst., vol. 28, no. 4, pp. 853–863, Apr. 2020, doi:
[7] L. Lu, J. Xie, R. Huang, J. Zhang, W. Lin, and Y. Liang, “An effi- 10.1109/TVLSI.2019.2961602.
cient hardware accelerator for sparse convolutional neural networks on [26] B. Widrow, I. Kollar, and M.-C. Liu, “Statistical theory of quantization,”
FPGAs,” in Proc. IEEE 27th Annu. Int. Symp. Field-Program. Custom IEEE Trans. Instrum. Meas., vol. 45, no. 2, pp. 353–361, Apr. 1996, doi:
Comput. Mach. (FCCM), San Diego, CA, USA, Apr. 2019, pp. 17–25, 10.1109/19.492748.
doi: 10.1109/FCCM.2019.00013. [27] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
[8] Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang, “Sparse- energy-efficient dataflow for convolutional neural networks,” in Proc.
YOLO: Hardware/software co-design of an FPGA accelerator for ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Seoul, South
YOLOv2,” IEEE Access, vol. 8, pp. 116569–116585, 2020, doi: Korea, Jun. 2016, pp. 367–379, doi: 10.1109/ISCA.2016.40.
10.1109/ACCESS.2020.3004198. [28] M. Sun, P. Zhao, M. Gungor, M. Pedram, M. Leeser, and X. Lin,
[9] H. Kim and K. Choi, “Low power FPGA-SoC design techniques “3D CNN acceleration on FPGA using hardware-aware pruning,” in
for CNN-based object detection accelerator,” in Proc. IEEE 10th Proc. 57th ACM/IEEE Design Automat. Conf. (DAC), San Francisco,
Annu. Ubiquitous Comput., Electron. Mobile Commun. Conf. (UEM- CA, USA, Jul. 2020, pp. 1–6, doi: 10.1109/DAC18072.2020.9218571.
CON), New York, NY, USA, Oct. 2019, pp. 1130–1134, doi: 10.1109/ [29] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
UEMCON47517.2019.8992929. lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015,
[10] Z. Yuan et al., “14.2 A 65 nm 24.7 μJ/frame 12.3 mW doi: 10.1007/s11263-015-0816-y.
activation-similarity-aware convolutional neural network video proces-
sor using hybrid precision, inter-frame data reuse and mixed-bit-
width difference-frame data codec,” in IEEE ISSCC Dig. Tech.
Papers, San Francisco, CA, USA, Feb. 2020, pp. 232–234, doi: Shengzhao Li (Graduate Student Member, IEEE)
10.1109/ISSCC19947.2020.9063155. received the B.S degree from Shanghai Jiao Tong
[11] M. Riera, J.-M. Arnau, and A. Gonzalez, “Computation reuse in DNNs University, Shanghai, China, in 2019, where he is
by exploiting input similarity,” in Proc. ACM/IEEE 45th Annu. Int. Symp. currently working toward the master’s degree at the
Comput. Archit. (ISCA), Los Angeles, CA, USA, Jun. 2018, pp. 57–68, College of Electronic Engineering.
doi: 10.1109/ISCA.2018.00016. His research interest includes hardware accelera-
[12] M. Buckler, P. Bedoukian, S. Jayasuriya, and A. Sampson, “EVA: tion of neural networks.
Exploiting temporal redundancy in live computer vision,” in Proc.
ACM/IEEE 45th Annu. Int. Symp. Comput. Archit. (ISCA), Los Angeles,
CA, USA, Jun. 2018, pp. 533–546, doi: 10.1109/ISCA.2018.00051.
[13] Y. Wang, Y. Wang, H. Li, Y. Han, and X. Li, “An efficient deep learning
accelerator for compressed video analysis,” in Proc. 57th ACM/IEEE Qin Wang (Member, IEEE) received the B.S. degree
Design Automat. Conf. (DAC), San Francisco, CA, USA, Jul. 2020, from the University of Electronics Science and Tech-
pp. 1–6, doi: 10.1109/DAC18072.2020.9218743. nology of China, Chengdu, China, in 1997, and the
[14] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, “An efficient Ph.D. degree from Shanghai Jiao Tong University,
hardware accelerator for structured sparse convolutional neural networks Shanghai, China, in 2004.
on FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, She is currently a Professor with the Department of
no. 9, pp. 1953–1965, Sep. 2020. Micro/Nano Electronics, Shanghai Jiao Tong Univer-
[15] S. Zhang, J. Cao, Q. Zhang, Q. Zhang, Y. Zhang, and Y. Wang, sity. Her research interests include high-performance
“An FPGA-based reconfigurable CNN accelerator for Yolo,” in Proc. processor and in-memory computing.
IEEE 3rd Int. Conf. Electron. Technol. (ICET), Chengdu, China,
May 2020, pp. 74–78, doi: 10.1109/ICET49382.2020.9119500.
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1600 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022
Jianfei Jiang (Member, IEEE) received the B.S. Naifeng Jing (Senior Member, IEEE) received the
degree from Zhejiang University, Hangzhou, China, Ph.D. degree from Shanghai Jiao Tong University,
in 2000, and the M.S. and Ph.D. degrees from Shanghai, China, in 2012.
Shanghai Jiao Tong University, Shanghai, China, in He is currently an Associate Professor with the
2007 and 2017, respectively. Department of Micro/Nano Electronics, Shanghai
He is currently an Assistant Professor with the Jiao Tong University. His research interests include
Department of Microelectronics and Nanoscience, high-performance and high-reliability computing
Shanghai Jiao Tong University. His current research architecture and systems, in-memory computing
interests include high-speed on-chip intercon- architecture, and computer-aided VLSI design.
nect, low-power circuit design, and high-speed
circuit design.
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.