0% found this document useful (0 votes)
63 views14 pages

An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs

This document summarizes an efficient CNN accelerator for video applications using FPGAs. The accelerator takes advantage of similarities between sequential video frames to skip redundant convolution operations. It uses an incremental operation method where only the differences between frames are processed. This improves throughput and resource efficiency compared to processing each frame individually. The accelerator also uses the Winograd algorithm to optimize 3x3 convolutions, further improving efficiency. Experimental results show the accelerator achieves 74.2 frames/s and outperforms other FPGA CNN accelerators in terms of DSP efficiency and energy efficiency for video workloads.

Uploaded by

palansamy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views14 pages

An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos On FPGAs

This document summarizes an efficient CNN accelerator for video applications using FPGAs. The accelerator takes advantage of similarities between sequential video frames to skip redundant convolution operations. It uses an incremental operation method where only the differences between frames are processed. This improves throughput and resource efficiency compared to processing each frame individually. The accelerator also uses the Winograd algorithm to optimize 3x3 convolutions, further improving efficiency. Experimental results show the accelerator achieves 74.2 frames/s and outperforms other FPGA CNN accelerators in terms of DSP efficiency and energy efficiency for video workloads.

Uploaded by

palansamy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO.

11, NOVEMBER 2022 1587

An Efficient CNN Accelerator Using Inter-Frame


Data Reuse of Videos on FPGAs
Shengzhao Li , Graduate Student Member, IEEE, Qin Wang , Member, IEEE, Jianfei Jiang , Member, IEEE,
Weiguang Sheng , Member, IEEE, Naifeng Jing , Senior Member, IEEE, and Zhigang Mao, Member, IEEE

Abstract— Convolutional neural networks (CNNs) have had and monitoring. Advancements in algorithms and changes
great success when applied to computer vision technology, and in requirements have led to critical challenges to the data
many application-specific integrated circuit (ASIC) and field- processing capabilities of current hardware. This article pro-
programmable gate array (FPGA) CNN accelerators have been
proposed. These accelerators primarily focus on the acceleration poses a convolutional neural network (CNN) accelerator for
of a single input, and they are not particularly optimized for video target recognition using an FPGA that achieves higher
video applications. In this article, we focus on the similarities resource efficiency and higher speed than other FPGA CNN
between continuous inputs in video, and we propose a YOLOv3- implementations when processing images of videos.
tiny CNN FPGA accelerator using incremental operation. The The difference between processing a single image and
accelerator can skip the convolution operation of similar data
between continuous inputs. We also use the Winograd algorithm sequential frames of an input video is minimal. Therefore, the
to optimize the conv3 × 3 operator in the YOLOv3-tiny network computation and memory accesses performed on sequential
to further improve the accelerator’s efficiency. Experimental frames can be redundant. Traditional strategies perform a
results show that our accelerator achieved 74.2 frames/s on complete computation for each frame, which ignores the
ImageNet ILSVRC2015. Compared to the original network similarity in live vision. Our work attempts to produce a CNN
without Winograd algorithm and incremental operation, our
design provides a 4.10× speedup. When compared with other that skips redundant operations between adjacent frames in
YOLO network FPGA accelerators applied to video applications, video to achieve better throughput and resource efficiency on
our design provided a 3.13×–18.34× normalized digital signal an FPGA.
processor (DSP) efficiency and 1.10×–14.2× energy efficiency. Recently, some articles [10]–[13] have designed neural
Index Terms— Convolutional neural network (CNN), field- network accelerators based on data similarity between video
programmable gate array (FPGA) accelerator, incremental oper- frames. Previous work has primarily used the similarity of data
ation, input similarity, video applications, Winograd algorithm. between sequential frames in two ways.
1) With the exception of the first frame, the accelerator
I. I NTRODUCTION only processes the differences in sequential frames.
The resulting convolution of the differential input is
I N RECENT years, deep learning methods have played
an important role in computer vision for areas, such as
image recognition, target tracking, and classification. Many
superimposed on the result from the previous frame
operation [10]–[11].
studies have proposed high-performance application-specific 2) By capturing visual motion in sequential frames, the
integrated circuit (ASIC) [1]–[4] and field-programmable gate accelerator can use the results obtained from key frames
array (FPGA) [5]–[9] neural network accelerators. Due to to predict the results of other frames, thereby avoiding
the rapid development of computer vision algorithms, the redundant operations on nonkey frames [12], [13].
depth and scale of deep neural networks (DNNs) are also However, most of the works related to video CNN accel-
increasing. Meanwhile, there is an increasingly larger need erators are about algorithms and modeling analysis. There
for video processing versus traditional static image processing are some designs devoted to high-performance and high-
due to emerging applications such as autonomous driving resource utilization CNN accelerators on FPGAs [14]–[17].
These works primarily use pruning and quantification methods
Manuscript received 11 October 2021; revised 20 December 2021 and
27 January 2022; accepted 12 February 2022. Date of publication 26 Septem- to improve resource efficiency for static image tasks. There are
ber 2022; date of current version 24 October 2022. This work was supported no accelerator designs that take advantage of the data similarity
by the National Key Research and Development Program of China under between video frames on an FPGA.
Grant 2018YFA0701500. (Corresponding author: Qin Wang.)
Shengzhao Li, Jianfei Jiang, Weiguang Sheng, Naifeng Jing, and Capturing visual motion in sequential frames for real-
Zhigang Mao are with the Department of Micro/Nano Electronics, School time video applications requires additional processing
of Electronic Information and Electrical Engineering, Shanghai Jiao Tong modules, which is not conducive to improving resource
University, Shanghai 200240, China (e-mail: [email protected]).
Qin Wang is with the National Key Laboratory of Science and Technology efficiency on FPGAs. The design in this article uses
on Micro/Nano Fabrication, Shanghai Jiao Tong University, Shanghai 200240, the YOLOv3-tiny network with 8-bit quantization, and it
China (e-mail: [email protected]). adopts the incremental operation of interframe differen-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2022.3151788. tial input. As a result, our design achieves higher dig-
Digital Object Identifier 10.1109/TVLSI.2022.3151788 ital signal processor (DSP) efficiency and performance
1063-8210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1588 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022

compared to other high-performance [14]–[17] and YOLO In recent years, the Winograd algorithm has been widely
accelerators [18]–[20] in video application scenarios while used to accelerate DNNs [23]–[25]. It is an effective method
running at a reasonable image processing frame rate. that can significantly reduce the computational complexity of
To improve the DSP utilization rate under different CNN convolution calculations. Compared with original convolution
operators, our design uses the Winograd algorithm to increase calculations, F(2 × 2, 3 × 3) of the Winograd algorithm
both throughput while using the same DSP resources and the reduces the convolutional arithmetic complexity by 2.25 times.
multiplexing efficiency of the processing element (PE) array. The YOLOv3-tiny network includes both conv3 × 3 and
In conclusion, we make the following contributions. conv1 × 1 operators, with the majority being conv3 × 3,
1) Our work uses the incremental operation of interframe and the Winograd algorithm can reduce the arithmetic com-
differential input to take advantage of the interframe plexity required by the conv3 × 3 operators. When the
similarity of video input, and it improves processing conv3 × 3 operator is implemented using the Winograd
speed and resource utilization. In addition, we use algorithm, four 3 × 3 size convolution kernel operations
different frames that ignore thresholds in different layers are converted into 4 × 4 matrix elementwise products. The
to increase speed while keeping the network accuracy. operation after the conversion is very similar to that of
2) A corresponding dataflow is designed for the incre- conv1 × 1 operators, which can reduce the complexity of
mental operation method, which balances the sparse the module and improve the utilization efficiency of the PE
irregularity of the activation value for the incremental array. In our design, the Winograd algorithm allows us to
operation and improves the utilization rate of the DSP further reduce the number of calculations required for con-
on the FPGA. volution operations and to improve the efficiency of DSP on
3) Using the Winograd algorithm improves throughput the FPGA.
under the same DSP consumption and simultaneously
increases the multiplexing of the PE array of the A. CNN Incremental Operation
YOLOv3-tiny operators conv1 × 1 and conv3 × 3. This
We define the input similarity as the ratio of the convo-
greatly improves DSP efficiency.
Experimental results show that our accelerator achieved lutional layer input of the current frame to the same part
74.2 frames/s on ImageNet ILSVRC2015 [29]. Compared of the convolutional layer input of the saved frame, which
to the original network without Winograd algorithm and is the frame before the current frame. In video applications,
incremental operation, our design provides a 4.10× speedup. most of our inputs use 32-bit floating-point numbers with
When compared with other YOLO network FPGA acceler- low input similarity. Although many parts of the current
ators applied to video applications, our design provided a frame are like those of the previous frame, there are still
3.13×–18.34× normalized DSP efficiency and 1.10×–14.2× small differences between them. Due to the high precision
energy efficiency. of 32-bit floating-point numbers, any small changes are
reflected in the difference. However, small changes in the
II. A LGORITHM input have a minimal impact on the convolution results,
and we can ignore these small changes in the incremental
This section introduces both the incremental operation and
operation.
Winograd algorithm used in neural networks. Our work uses
The common network quantization algorithm can signifi-
incremental operation strategies to reduce redundant opera-
cantly increase the similarity between inputs, which is con-
tions in real-time visual processing, and it uses the Winograd
ducive to our design for reducing redundant operations with
algorithm to improve the utilization efficiency of DSP. We also
minimal impact on recognition accuracy.
implement a high-efficiency accelerator for real-time video
For pretrained CNNs, linear quantization is a popular quan-
processing on an FPGA.
tization method [26]. We linearly quantize the activation and
The DNN in the universal neural network accelerator repeats
weights in each layer. After the quantization is completed,
a large number of calculations for each video frame, which
we evaluate the accuracy of the quantized network. The basic
wastes computing resources. Some articles [21], [22] proposed
formula for linear quantization used for each layer is as
a new DNN calculation method to eliminate the wasteful
follows:
calculations when processing video. The key idea of incre-  
mental operation is to find an algorithm that can update datain
dataquant = round ∗ stepl . (1)
previously saved results. Incremental operation only needs stepl
to find the differences between frames and then correct the In our design, the steps used to quantify the weights
saved results throughout the differing parts. The corrected and activation are different for different layers. The step is
result can be approximated to the original calculation using the calculated according to the weights of each layer and the
complete CNN. This type of algorithm is much cheaper than range of the activation value. The weight range is determined
the complete convolution of each frame. It can reduce many by analyzing the weights of the pretrained network, and the
of the redundant operations and memory accesses compared activation range is determined by analyzing the training set;
with the original network. dataquant is the cluster centroid that is closest to the value of
Incremental operation can be applied to any pretrained the input data. Our work uses 8-bit quantization, and the step
DNN, so most common networks can be optimized. Therefore, calculation formula for each layer is given as follows:
it is feasible to use incremental operation in hardware, which
can reduce the computational cost of processing video. stepl = 2floor(max(|datain |))−7 . (2)
Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1589

Fig. 1. (a) Original neural network calculation process. (b) Neural network calculation process with incremental operation.

TABLE I
A CCURACY OF THE N ETWORK B EFORE AND A FTER Q UANTIZATION

Fig. 2. One-dimensional Winograd algorithm.

operations, so we need to perform ReLU and Maxpool after


correcting the convolution results in each layer. Therefore,
we need to insert the difference and result processing oper-
ations along with threshold processing in each layer. The
Table I compares the accuracy of the quantized network on
network calculation process using incremental operation and
the ImageNet ILSVRC2015 dataset with the accuracy before
the original neural network calculation process are shown in
quantization. We observe that the quantized network still has
Fig. 1.
good accuracy.
The core idea of incremental operation using input similarity
is that the convolution result of the previous frame will B. Winograd Algorithm
be saved and can be corrected to the result of the current Our design uses the F(2 × 2, 3 × 3) Winograd algo-
frame. Considering that the convolution operation is a linear rithm [23], which means that each Winograd processing unit
operation, the difference between the current frame input and will calculate 2 × 2 units of the 3 × 3 size convolution kernel.
the previous frame input is equal to the difference between To illustrate how to perform the 2-D Winograd algorithm,
the current frame convolution and the previous frame convo- we consider a 1-D F(2, 3) convolution operation. As shown
lution. Therefore, we do not need to completely calculate the in Fig. 2, the activation value is 1 × 4, and the convolu-
convolution of all frames. We only need to do the convolution tion kernel is 1 × 3. The operation can be regarded as a
operation on the difference between the input of the current 2 × 3 matrix multiplied by a 1 × 3 matrix, as shown in
frame and the saved frame and then correct the convolution the following formula:
result of the previous frame to obtain the convolution of the ⎡ ⎤
current frame. We can skip zeros and small values in the g
d0 d1 d2 ⎣ 0 ⎦ m0 + m1 + m2
input difference because the input of the current frame is very g1 = (4)
d1 d2 d3 m1 − m2 − m3
similar to the input of the saved frame, and there are fewer g2
values in their difference that influence the result. We define where
the sensitivity threshold as the value at which the input can
be ignored. For the incremental operation, the calculation is m 0 = (d0 − d2 )g0
performed using the following formula: g0 + g1 + g2
m 1 = (d1 + d2 )
2

N
   g0 − g1 + g2
Resulto = Resulto + ci − ci ∗ Wio . (3) m 2 = (d2 − d1 )
i=1
2
m 3 = (d1 − d3 )g2 . (5)
In a complete network, there are nonlinear operations, such
as the ReLU and Maxpool operations, between each con- When the Winograd algorithm is applied to a 1-D con-
volutional layer. Nonlinear operation cannot use incremental volution operation, it can be written in the following

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1590 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022

Fig. 3. Incremental operation process flow with Winograd algorithm.

matrix operation form:


   
Y = A T GgG T · B T dB A. (6)
For F(2 × 2, 3 × 3) used in this article, the matrix in
formula (6) is given as follows:
⎡ ⎤
1 0 −1 0
⎢0 1 1 0 ⎥
BT = ⎢ ⎣ 0 −1

1 0 ⎦
0 1 0 −1
⎡ ⎤
1 0 0
⎢ 0.5 0.5 0.5 ⎥
G =⎢ ⎣ 0.5 −0.5 0.5 ⎦

Fig. 4. Accuracy loss of conv layers with different thresholds.
0 0 1
1 1 1 0
AT = . (7) is used to evaluate the similarity between the image blocks of
0 1 −1 −1
the previous and next frames. The calculation process is given
In the 2-D Winograd algorithm, g in F(2 × 2, 3 × 3) is as follows:
a 3 × 3 filter, and d is a 4 × 4 image tile. In the original
 
n+3 m+3

convolution operation, a 4 × 4 image tile will require a T = d  − d i j . (8)
ij
total of 2 × 2 × 3 × 3 = 36 multiplication operations. i=n j =m
Through the Winograd algorithm, we can reduce the number
Here, di j represents the activation in row i and column j
of multiplications to 4 × 4 = 16 for a 4 × 4 image tile, thus
of the stored frame and di j represents the activation in row i
reducing the arithmetic complexity by 36/16 = 2.25 times.
and column j of the current frame.
The cost of the algorithm is that the convolution operation
After the difference process is performed, threshold process-
will have longer latency and more additional operations.
ing will determine the similarity between the modules.
When the L1 norm of the difference is less than a certain
C. Incremental Operation With Winograd Algorithm threshold, the current frame image block is considered to be
The incremental operation in Section II-A is optimized for the same as the previous frame image block, and the Winograd
the original CNN. For the CNN optimized with the Winograd operation can be skipped. When a higher L1 norm threshold
algorithm, the incremental operation can still be used since the is selected, our CNN will experience a drop in accuracy.
matrix operation is a linear operation. Due to the Winograd Simultaneously, fewer image blocks need to be calculated.
algorithm being based on a matrix operation of 4 × 4 image In order to make use of the similarity between consecutive
tiles, the difference and threshold processing in the incremental inputs without accuracy loss, we use different thresholds in
operation flow needs to be changed, as shown in Fig. 3. different layers. The accuracy loss of each layer with different
The smallest unit of operation is a 4 × 4 image tile, and thresholds is shown in Fig. 4.
the smallest unit of difference in the conv3 × 3 layer is also Fig. 4 shows the accuracy loss of each layer when the
a 4 × 4 image tile. In our work, the L1 norm of the difference threshold size is 1 step, 2 step, and 3 step, and the step is the
between the image blocks of the previous and the next frame quantization step. When the accuracy loss is less than 0.05%,

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1591

TABLE II Algorithm 1 Pseudocode of Dataflow for conv3 × 3


T HRESHOLD S ETTING OF E ACH L AYER , W HERE THE VALUE IN B RACKETS
R EPRESENTS A VALUE OF A S INGLE Q UANTIZATION S TEP

we believe that this kind of threshold will not cause accuracy


loss, which is not shown in Fig. 4. Also, there is no accuracy
loss for 3 step of conv5 layer, and when the threshold of conv5
layer is 4 step, the accuracy loss is about 0.3%, so the threshold
setting of conv5 layer is 3 step.
According to the accuracy loss of each layer, we set the
threshold of each layer, which will not cause accuracy loss.
The threshold setting is shown in Table II. The value in the
brackets of Table II is the actual value of the quantization step.
The accuracy of the network under the above threshold setting
is 0.731, which is the same as the accuracy of the quantized
YOLOv3-tiny without incremental calculation.

III. DATAFLOW
There are efficient dataflow designs for traditional CNN
accelerators [16], [27], but they are only efficient in dense
networks. These dataflows cannot take advantage of the video
application characteristics, and they are not suitable for use
with the Winograd algorithm. For our work, there are two
important issues. First, the dataflow should be able to use
the similarity between input frames to get better performance,
i.e., the dataflow can skip the part of the input that does not
need to be calculated after the difference. Note that dataflows
with this characteristic are similar dataflows for sparse CNNs.
Second, since the Winograd algorithm changes the calculation elementwise multiplication. This means that conv1 × 1 can be
unit of the neural network, the dataflow needs to be updated converted to a process with 4 × 4 matrix elementwise multipli-
to accommodate changes in the calculation method. cation as the basic unit of operation. After this conversion, the
There are also dataflows of CNNs for video applications dataflow of conv1 × 1 is like that of the conv3 × 3 operation
[10], [11]. These dataflows can be expanded to make use under the Winograd algorithm. To summarize, our loops
of the similarity between frames, as shown in Fig. 5, and in our dataflow include frame, output channel, and input
they provide a basis for our design. With the characteris- channel loops, along with the matrix position of the image
tics of the Winograd algorithm in mind, we have designed loop.
a dataflow that is more suitable for our accelerator. This In the dataflow for the conv3 × 3 convolutional layer (see
dataflow helps to ensure the efficiency of our computing Algorithm 1), the input channel index is ic, and the output
module. channel index is oc. Woc,ic is a 3 × 3 weight kernel unit
The YOLOv3-tiny includes two operators: conv3 × 3 and corresponding to different input and output channels. Input
conv1 × 1. When the Winograd algorithm is used to complete ic
X h,w is a 4 × 4 activation block reflecting the difference
the operation of conv3 × 3, conv3 × 3 becomes a process between two frames. h and w represent the position of
with 4 × 4 matrix elementwise multiplication as the basic the top-left pixel of the 4 × 4 activation value matrix on
operation unit. As a result, there is no concept of a convolution the image. B, G, and A are the conversion matrices in the
sliding window of conv3 × 3. The kernel size of conv1 × 1 Winograd algorithm. The related introduction of these matrices
is 1 × 1, which can be alternatively viewed as matrix is detailed in Section II-B. M[H /2][W/2] stores the number of

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1592 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022

Fig. 5. Dataflow process with incremental operation.

channels of nonzero 4 × 4 activation blocks at each position. Algorithm 2 Pseudocode of Dataflow for conv1 × 1
Incremental operation needs to be computed on each layer,
so we process the data from all images in one image group at
a time. The IG is the batch size of a frame package. Since the
first frame of each package requires a complete convolution
operation, the performance is different for different IGs. This
will be discussed further in the following.
In order to facilitate the parallel hardware operation of the
output channels, we put the loop over N output channels in the
inner layer.

For each channel, the saved reference convolution
oc+s__oc
result Yh,w is set to zero for the first frame of the frame
package. For all other frames, the result is computed by
adding the current convolution to the reference convolution.
The innermost part of the dataflow loop is the matrix oper-
ation of the Winograd algorithm, including both weight and
activation value matrix conversions along with the convolution
result calculation. Each Winograd operation calculates the
result of a 4 × 4 image block for one input channel, and
the results from different input channels are added to the
oc+s__oc
output result Yh,w until all N output channels have been

oc+s__oc
processed. The saved reference convolution result Yh,w is
oc+s__oc
then updated to Yh,w calculated in the current iteration.
The convolution of the next frame is performed, followed by
the convolution of the image block at the next position until
all N output channels are completed. Finally, the calculation
of the next N output channels is performed until all results
are calculated. Once the convolution calculation is complete,
nonlinear calculations, such as ReLU and Maxpooling, are
performed.
We changed the traditional conv1 × 1 dataflow (see Algo-
rithm 2) to make it more like the conv3 × 3 dataflow used
in the Winograd algorithm. Since the conv1 × 1 convolution
does not use the Winograd algorithm, the input weight is obtained by directly doing matrix elementwise multiplication.
expanded from the 1 × 1 kernel to the 4 × 4 matrix, and all Therefore, the dataflow for conv1 × 1 is like one for conv3 × 3
elements in the matrix are the same. This is done in order to using the Winograd algorithm that lacks some operations. This
maintain the 4 × 4 matrix elementwise multiplication. Without convolutional layer similarity is beneficial for improving the
the Winograd matrix conversion, the convolution result can be operating efficiency of the hardware.

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1593

In order to choose dataflow parameters, including the


number of frames in a frame package, N is set to the
number of parallel output channels. Based on the algorithm,
frame package size will have little effect on the accuracy of
the convolution result. Since the first frame of each frame
package performs a complete convolution operation, larger
frame package sizes lead to higher utilization of interframe
similarity and improved computational efficiency. However,
frame package size will be limited by available BRAM
resources since activation value data for each frame package
must be partially stored there. Based on the BRAM resources
on the ZCU104 board, we chose a frame package size of 8.
The effect of similarity between frames on utilization and the
acceleration efficiency of different frame package sizes will
Fig. 6. Accelerator architecture.
be discussed in detail in the following. The parallel efficiency
of the accelerator is directly proportional to the number of
parallel output channels, and the number of parallel output A. Design for Skipping Similarity Between Frames
channels needs to match the off-chip bandwidth. Considering We designed many modules to use the similarity between
the bandwidth and on-chip resources, the number of parallel frames in video applications. Modules related to incremental
output channels is set to 16 in our design. The PE array of operation include the index, correction, index compute, and
our design can perform 256 multiplications per clock cycle. difference modules.
The dataflow and parallel mode designs are closely related As mentioned earlier, we leverage the similarity between
to the hardware design. To improve the parallel efficiency frames by subtracting the current frame from the reference
of the hardware, we performed parallel operations on N frame. Upon the completion of the convolution operation,
output channels and 4 × 4 elementwise matrix multiplication. the reference frame is corrected to obtain the current frame.
We attempted to ensure the use of differential input sparsity In order to skip input that does not need to be processed,
to achieve the best possible acceleration effect. A single the index module reads the nonzero activation block index
processing unit can complete a 4 × 4 elementwise matrix from the input buffer and outputs the activation and weights
multiplication operation, and our calculation array has N such of channels needing to be processed to subsequent modules.
units. Each unit is composed of 16 PEs, thus allowing the The correction model is used to get the convolution result
completion of the 4 × 4 elementwise matrix multiplication in of the current frame using the reference result. The module
one clock cycle. To meet the requirements of the Winograd retrieves reference data from the RRB, produces the current
algorithm, we also designed modules for Winograd matrix frame by accumulating it with the differential convolution
conversion. These modules are mainly composed of shifters result, and updates the reference data in the RRB. In addition,
and adders, which consumes very few hardware resources. we designed the index compute module, which calculates
Internal BRAM stores some of the weights and activation the nonzero activation block index of the next layer input.
value data along with the convolution for the reference frame. The difference module calculates the difference between the
The specific hardware structure will be introduced in detail in previous and current frames.
the following. 1) Index Module: In order to skip activation blocks that do
not need to be calculated, we must process the data before
IV. A RCHITECTURE OF ACCELERATOR performing the convolution operation. Since our differential
This section introduces the accelerator architecture in detail. input is densely stored, information that does not contribute
We explain how we used the interframe similarity in hardware to the result still consumes computing resources. To address
and how we deployed the Winograd algorithm. this issue, the index module selects the data to be calculated
Fig. 6 shows the overall structure of the CNN accelerator, using the input nonzero activation block index, and it outputs
which includes I/O, storage, control, and processing modules. the selected data to the convolution processing module.
The I/O module exchanges data with external DRAM using As shown in Fig. 7, 16 input channels of 4 × 4 activation
the AXI4 protocol, and the storage module includes input, value blocks are input every clock cycle. The nonzero acti-
output, and reference result buffers (RRBs). The processing vation block index stores the indexes of the N channels to
module includes index and matrix transform modules, the be processed (N < 16), and the activation and corresponding
process unit (PU) array, and correction, ReLU and maxpool- weights of the N channels are selected in the next cycle. Using
ing, and difference modules. The accelerator uses the PU this approach, the convolution processing module will only
array to perform elementwise multiplication for several output receive data that need to be calculated. Channels that should
channels. The control module receives parameters regarding be ignored are discarded in the index module, thus allowing
the current convolutional layer from the I/O module and then our calculation to skip the similar data between frames.
determines whether the PU should perform a conv1 × 1 or When processing the first frame of each frame package,
conv3 × 3 operation. It can also control processing loops based the input is the complete frame since the initial saved frame
on the parameters of this layer. is zero. To ensure consistent operation between the first and

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1594 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022

matrix transform module and PU array. The matrix transform


module is used in the conv3 × 3 convolutional layer to convert
input activation, input weights, and elementwise multiplication
results. The PU array performs elementwise matrix multipli-
cation operations. There are a total of 16 PU groups.
1) Matrix Transform Module: The matrix transform module
includes the activation, weight, and elementwise multiplication
result transform submodules. The activation transform com-
putes GgG T , the weight transform computes B T dB, and the
elementwise multiplication result transform computes A T R A,
where R is the phase of GgG T and B T dB. It is worth noting
that although the matrix transform is matrix multiplication, the
Fig. 7. Index processing.
A, G, and B matrix elements only contain 1, 0.5, and 2 as
values, respectively. This means that we do not need a hard-
subsequent frames, we set the first frame index to be dense, ware multiplication unit to perform transformations, and we
so no channels are discarded. This ensures the consistency can use shifters and adders. Since there is no need to perform
of the calculation process and reduces the complexity of the a matrix transform when computing conv1 × 1 convolutional
hardware. layers, the matrix transform module is not active during this
2) Correct Module: The convolution of the differential input phase. The input data width is 8 bit, and the transformed data
is not the final result, but the difference between the current width is 10 bit. Before performing the elementwise matrix
frame and the reference frame. The neural network contains multiplication, we truncate the results. The output data width
both linear and nonlinear operations. Nonlinear operations is 8 bit.
cannot be performed as incremental operations, so we need a 2) PU Array: In our accelerator, the PU array is responsible
complete convolution result to run them. In the correct module, for performing elementwise matrix multiplication operations,
each time we get a set of convolution results, the reference and it will calculate the results of 16 output channels every
frame convolution of the corresponding channel is combined clock cycle. Each PU receives the activation and weights from
with the result to get the current frame. The current frame is either the matrix transform or index module, and they all
then stored back into the RRB. Since this process is pipelined, receive the same activation and weights of different output
we can process the current set of data while calculating the channels; 16 PEs are included in each PU to ensure that
next set of data. 16 multiply-accumulate operations can be completed in one
The first frame contains the complete convolution result, clock cycle. A PE performs multiplication and accumulation
so we directly overwrite the corresponding position in the operations, and it output the results after all the required input
RRB. channels have been calculated. A PE has an 8 bit × 8 bit
3) Index Compute Module: The index compute module multiplier. The structure of the PU array is shown in Fig. 10.
accepts data from the difference module and calculates the
nonzero index of that data. As shown in Fig. 8, the index com- C. Storage
pute module calculates the L1 norm for the 4 × 4 activation
blocks for all 16 output channels. When the L1 norm of the As shown in Fig. 11, storage is divided into input, output,
differenced data is greater than our threshold, the activation and RRBs. The input and output buffers can be further
value block needs to be processed and its channel number subdivided into the input index, weight and input activation
is added to the nonzero activation block index. Conversely, buffers (IMBs), and the output index and output activation
if the L1 norm is less than the threshold, we conclude that the buffers (OMBs). Each buffer has a ping–pong structure to
activation value block does not contribute to the convolution improve processing efficiency, and this is described later in
result, and the channel number of the activation value block is this article. All activations and weights have been quantized
omitted from the nonzero activation block index. This results to 8-bit fixed-point numbers.
in the activation block being skipped during the calculation of The IMB includes ping and pong buffers. The width of each
next layer. buffer is 128 bits, the depth is 196, and the size of each ping–
We do not perform the threshold process for the first frame pong buffer is 3136 bytes. Each IMB can store 16 channels of
of each position frame package, so all activation value blocks 14 × 14 activation blocks. For the conv3 × 3 convolutional
are processed. layer, the IMB is fully utilized, and the 14 × 14 activation
block contains a 12 × 12 result after convolution. In order
to maintain the consistency of the output activation with the
B. Design for Winograd Algorithm conv1 × 1 layer, the IMB may only use 12 bit × 12 bit ×
The architecture of the Winograd convolution is shown 128 bit data. For conv3 × 3, if the complete input acti-
in Fig. 9. The conv3 × 3 convolutional layer in the neural vation is greater than 14 × 14, it is divided into several
network uses the Winograd algorithm, and we adapted the 14 × 14 activation blocks, where each activation block has
convolution operation module accordingly. The convolution a 2 × 14 block overlap. Likewise, for conv1 × 1, if the
processing module for the Winograd algorithm includes a complete input activation size is greater than 12 × 12, the input

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1595

Fig. 8. Index compute process.

Fig. 9. Architecture of the Winograd convolution.


Fig. 11. Structure of storage.

to the weights in the 3 × 3 convolution kernel of 16 output


channels. For conv1 × 1, we only use 256 bytes.
The input index buffer (IIDB) stores the index of the
nonzero 4 × 4 activation block, its size is 576 bytes, and
the ping–pong structure is also used. The index of each
position stores the serial number of the nonzero channels in the
block.
The OMB includes ping and pong buffers. It is 24 bits wide,
so the width of each buffer is 24 bit × 16 bit = 384 bits,
the depth is 144, and the size of each buffer is 6912 bytes.
Each block of the OMB can store 12 × 12 activation blocks
of 16 output channels. The maxpooling operation only uses
384 bits × 6 bit × 6 bit, and it stores 6 × 6 activation blocks
Fig. 10. Structure of PU array. of 16 output channels. When the maxpooling operation is not
required, the output buffer is fully utilized. The I/O module
shifts and truncates the 24-bit activation to transform it into
activation is divided into several 12 × 12 activation blocks. an 8-bit activation and outputs it through the AXI bus.
The input weight buffer (WB) also contains ping and pong The output index buffer (OIDB) is the same size as the
buffers. The bit width of each buffer is 128 bits, the depth IIDB, and it stores the nonzero 4 × 4 activation block index
is 144, and the size of each ping–pong buffer is 2304 bytes. of the difference between the output of the current layer and
Each WB can store up to 16 input channels corresponding the saved result.

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1596 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022

Fig. 12. Processing pipeline.

TABLE III
R ESOURCE U TILIZATION

Fig. 13. Similarity between frames of different thresholds.

B. Arithmetic Complexity
1) Similarity Between Frames and Arithmetic Complexity:
When determining if a 4 × 4 activation value block needs
to be processed, we calculate the L1 norm and omit it if it is
The ping–pong structure enables the input and output stages
less than our threshold. We set different thresholds in different
to be executed in parallel with the data processing stage. One
layers, corresponding to Table II.
buffer accepts the data from the I/O module, while the other
The similarity between frames reflects the proportion of
is used for processing, which improves the efficiency of the
operations that can be skipped during incremental operations.
hardware since a pipeline is produced. The pipeline of our
The higher the similarity between frames, the more convolu-
design is shown in Fig. 12.
tion operations that can be skipped, and the better the accel-
The RRB is 6912 bytes, and it stores the convolution of the
eration effect. The accuracy of the network is determined by
previous frame, which is added to the differential convolution
the threshold. In order to discuss the changes in the similarity
to obtain the convolution of the current frame. The frames
between frames using thresholds or not, we calculated the
are compared to determine the output index. After the current
similarity between the input frames of each convolutional layer
frame is completely processed, the data in the RRB are
under thresholds of Table II and no threshold on the ImageNet
updated to the convolution result for the current frame.
ILSVRC2015 video dataset. This statistic reflects the average
interframe similarity of all frames in a dataset relative to the
V. E XPERIMENT previous frame. The results are shown in Fig. 13.
In this article, we used a Xilinx ZCU104 to evaluate When we use incremental operation with no threshold, the
our design. Vivado HLS 2019.2 was used to implement the average interframe similarity of network is 0.2998. When we
accelerator and to generate the bitstream. The accelerator runs use incremental operation with the thresholds of Table II,
at a frequency of 200 MHz. We completed the performance the average interframe similarity of network improves to
evaluation of the YOLOv3-tiny network accelerator using an 0.6297. The interframe similarity reflects the proportion of the
FPGA, and the power consumption was evaluated using the convolution calculation that can be skipped for each frame
Xilinx Power Estimator. in an ideal situation. Ideally, the arithmetic complexity of
incremental operations with no threshold is 70.02%, and the
speedup is 1.428×. With thresholds of Table II, the arith-
A. Resource Utilization metic complexity is 37.03%, and the speedup is 2.701×. The
The resource consumption of our accelerator on the FPGA incremental operation with threshold reduces the arithmetic
development board is shown in Table III. The most important complexity and improves the speedup without accuracy loss.
value is the utilization of the BRAM_18K and DSP resources. All the incremental operations we mentioned later will use the
BRAM provides caching and storage in the accelerator. The thresholds of Table II.
memory size is determined by the frame package and stored Recall that due to the limitation of hardware resources, the
input and output data channel sizes. Increasing frame package size of the frame package cannot be expanded indefinitely.
size provides higher speedup, and using more stored input The first frame of the frame package will perform a complete
and output data channels increases the multiplexing of input convolution operation instead of an incremental operation.
data. This reduces data exchanges between the accelerator and Therefore, the size of the frame packet will affect the actual
DDR. DSP is primarily used for multiply–add operations in algorithm complexity. Besides, in actual hardware accelera-
the accelerator. The PU array performs 256 multiplication and tion, a part of the data input and output time cannot be covered
addition operations per clock cycle. Since YOLOv3-tiny is by the calculation time, so the actual speedup will be lower
8-bit quantized, each of the 128 DSP resources implements than the ideal speedup.
two multiplication operations. Later, we will discuss the 2) Frame Package Size and Arithmetic Complexity: The first
throughput and energy consumption provided by each DSP. frame of each frame package must do a complete convolution

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1597

TABLE IV
N ETWORK P ROCESSING T IME

TABLE V
P ERFORMANCE OF A CCELERATOR IN T HREE C ASES

Fig. 14. Arithmetic complexity of different batch sizes.

operation, which reduces the utilization of similarity between


frames. We define the size of each frame package as the frame
batch size. The larger the frame package is, the fewer frames
need to do the complete convolution operation, and the higher
the utilization of the similarity between frames.
The utilization of interframe similarity is reflected in the
arithmetic complexity of convolution operations. The arith-
metic complexity is the ratio of the number of convolution
operations to the number of original convolution operations. dataset, we evaluated the accelerator’s processing time of the
We evaluated the arithmetic complexity of different frame YOLOv3-tiny network on the original setup, using Winograd
batch sizes on the ImageNet ILSVRC2015 dataset of the algorithm, and using Winograd algorithm and incremental
thresholds of Table II. The frame batch size was set to 2, operation. We observed the processing time of each layer,
4, and 8. The result is shown in Fig. 14. as well as the total processing time. The time required to
As the frame batch size increases, the arithmetic complexity perform ReLU and maxpooling processing after a convolu-
of the network approaches the ideal arithmetic complexity. tional layer was included in the processing time for that layer.
Under the thresholds of Table II, the ideal network arithmetic The processing time for each layer and the total includes the
complexity is 37.03% of the original convolution. The network network processing time to complete all eight frames of a
arithmetic complexity is 68.51%, 52.76%, and 44.89% for frame package. The average network process time of all frame
batch sizes 2, 4, and 8, respectively. This indicates that we packages in dataset is shown in Table IV.
should set the frame batch size as large as possible. The processing time of the original network is about
429.1 ms. With the Winograd algorithm, the processing time of
C. Performance the network is roughly 207.9 ms. With the Winograd algorithm
The accelerator frequency we designed is 200 MHz, and and incremental operation, the processing time of the network
the frame batch size is 8. On the ImageNet ILSVRC2015 is about 107.8 ms. The Winograd algorithm and the utilization

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1598 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022

TABLE VI
P ERFORMANCE C OMPARISON W ITH P REVIOUS I MPLEMENTATION

of the similarity between frames greatly reduce the network computed as follows:
processing time of each frame package. DSP efficiency
The Winograd algorithm reduces the number of multiplica- Normalized DSP efficiency = . (9)
α × ( f ÷ 100 MHz)
tions of the conv3 × 3 layers, making the processing time of
conv3 × 3 layers shorter. Compared to the original network, Here, α = 1 for 16-bit or 18-bit accelerator and α = 2 for
the network processing time with the Winograd algorithm 8-bit accelerator.
has a speedup of 2.06×. The incremental operation allows The DSP efficiency of our design is 2.384 GOPS/DSP.
each layer to skip part of the convolution operation, providing The normalized DSP efficiency is 0.596. Compared to the
1.99× speedup. The network using the Winograd algorithm 3-D CNN accelerator, our design can provide a 9.17×–9.77×
and incremental operation has 4.10× speedup compared to normalized DSP efficiency. Compared to other YOLO accel-
the original network. erators, our design can provide a 3.13×–18.34× normalized
Through the processing time and the GFLOPs of the DSP efficiency.
YOLOv3-tiny network, we calculated the performance of the Our design has an energy efficiency of 105.4 GOPS/w.
accelerator in three cases, as shown in Table V. In general, our Compared to other designs, our design provides 1.10×–14.2×
design provides 4.1× speedup with no loss of accuracy. energy efficiency.
The design of our algorithm will not be affected by the In general, due to video optimizations, the CNN accelerator
hardware structure. Regardless of the size of the hardware we designed has significant advantages over other neural
structure, the same proportion of convolution can still be network accelerators in video applications.
skipped. Therefore, the architecture with different sizes can
still obtain the same relative speedup. VI. C ONCLUSION
In this article, we proposed a CNN FPGA accelerator
that takes advantage of the similarity between frames in
D. Result Comparison video applications. The accelerator skips the calculation of
As a result, we recorded the performance of our design similar data between frames by using incremental operations.
using general video applications (ImageNet ILSVRC2015 In addition, we used the Winograd algorithm to improve the
dataset). We then compared the performance with a 3-D CNN utilization of DSP resources on the FPGA while increasing
accelerator [28] and other YOLO accelerators [18]–[20]. The accelerator throughput. We designed a dataflow based on the
comparison data are shown in Table VI. characteristics of the incremental operation and Winograd
One 25 bit × 18 bit DSP can be decomposed to handle algorithms, which increased the efficiency of data reuse and
two 8 bit × 8 bit multiplications, while it can only handle array processing. Experimental results show that our accel-
one 16 bit × 16 bit or 18 bit × 18 bit multiplication. Thus, erator achieved 74.2 frames/s on ImageNet ILSVRC2015.
8-bit accelerator has twice overall DSP efficiency compared Compared to the original network without Winograd algo-
with 16-bit accelerator. Also, the working frequency is in rithm and incremental operation, our design provides a 4.10×
part determined by the occupation of the FPGA device. Thus, speedup. When compared with other YOLO network FPGA
the normalized DSP efficiency of different designs can be accelerators applied to video applications, our design provided

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
LI et al.: EFFICIENT CNN ACCELERATOR USING INTERFRAME DATA REUSE OF VIDEOS ON FPGAs 1599

a 3.13×–18.34× normalized DSP efficiency and 1.10×–14.2× [16] Q. Yin et al., “FPGA-based high-performance CNN accelerator architec-
energy efficiency. ture with high DSP utilization and efficient scheduling mode,” in Proc.
Int. Conf. High Perform. Big Data Intell. Syst. (HPBD&IS), Shenzhen,
China, May 2020, pp. 1–7, doi: 10.1109/HPBDIS49115.2020.9130576.
ACKNOWLEDGMENT [17] D. T. Nguyen, T. N. Nguyen, H. Kim, and H. J. Lee, “A high-throughput
The authors would like to thank LetPub (www.letpub.com) and power-efficient FPGA implementation of YOLO CNN for object
detection,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27,
for its linguistic assistance during the preparation of this no. 8, pp. 1861–1873, Aug. 2019.
manuscript. [18] A. Ahmad, M. A. Pasha, and G. J. Raza, “Accelerating tiny YOLOv3
using FPGA-based hardware/software co-design,” in Proc. IEEE Int.
R EFERENCES Symp. Circuits Syst. (ISCAS), Seville, Spain, Oct. 2020, pp. 1–5, doi:
10.1109/ISCAS45731.2020.9180843.
[1] C.-H. Lin et al., “7.1 A 3.4-to-13.3TOPS/W 3.6TOPS dual-core deep- [19] J. Zhang et al., “A low-latency FPGA implementation for
learning accelerator for versatile AI applications in 7 nm 5G smartphone real-time object detection,” in Proc. IEEE Int. Symp. Circuits
SoC,” in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, USA, Syst. (ISCAS), Daegu, South Korea, May 2021, pp. 1–5, doi:
Feb. 2020, pp. 134–136, doi: 10.1109/ISSCC19947.2020.9063111. 10.1109/ISCAS51556.2021.9401577.
[2] Y. Jiao et al., “7.2 A 12 nm programmable convolution-efficient [20] T. Adiono, A. Putra, N. Sutisna, I. Syafalni, and R. Mulyawan,
neural-processing-unit chip achieving 825TOPS,” in IEEE ISSCC Dig. “Low latency YOLOv3-tiny accelerator for low-cost FPGA using
Tech. Papers, San Francisco, CA, USA, Feb. 2020, pp. 136–140, doi: general matrix multiplication principle,” IEEE Access, vol. 9,
10.1109/ISSCC19947.2020.9062984. pp. 141890–141913, 2021, doi: 10.1109/ACCESS.2021.3120629.
[3] W. Shan et al., “14.1 A 510 nW 0.41 V low-memory low-computation [21] P. O’Connor and M. Welling, “Sigma delta quantized networks,” 2016,
keyword-spotting chip using serial FFT-based MFCC and binarized arXiv:1611.02024.
depthwise separable convolutional neural network in 28 nm CMOS,” [22] D. Neil, J. H. Lee, T. Delbruck, and S.-C. Liu, “Delta networks for
in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2020, optimized recurrent network computation,” in Proc. 34th Int. Conf.
pp. 230–232, doi: 10.1109/ISSCC19947.2020.9063000. Mach. Learn. (ICML), Sydney, NSW, Australia, 2017, pp. 2584–2593.
[4] P. C. Knag et al., “A 617 TOPS/W all digital binary neural net- [23] A. Lavin and S. Gray, “Fast algorithms for convolutional neural
work accelerator in 10 nm FinFET CMOS,” in Proc. IEEE Symp. networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
VLSI Circuits, Honolulu, HI, USA, Jun. 2020, pp. 1–2, doi: 10.1109/ (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 4013–4021, doi:
VLSICircuits18222.2020.9162949. 10.1109/CVPR.2016.435.
[5] J.-W. Chang, K.-W. Kang, and S.-J. Kang, “SDCNN: An efficient sparse [24] X. Wang, C. Wang, J. Cao, L. Gong, and X. Zhou, “WinoNN:
deconvolutional neural network accelerator on FPGA,” in Proc. Design, Optimizing FPGA-based convolutional neural network accelerators
Automat. Test Eur. Conf. Exhib. (DATE), Florence, Italy, Mar. 2019, using sparse Winograd algorithm,” IEEE Trans. Comput.-Aided Design
pp. 968–971, doi: 10.23919/DATE.2019.8715055. Integr. Circuits Syst., vol. 39, no. 11, pp. 4290–4302, Nov. 2020, doi:
[6] J.-W. Chang, K.-W. Kang, and S.-J. Kang, “An energy-efficient FPGA- 10.1109/TCAD.2020.3012323.
based deconvolutional neural networks accelerator for single image [25] J. Yepez and S.-B. Ko, “Stride 2 1-D, 2-D, and 3-D Winograd
super-resolution,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, for convolutional neural networks,” IEEE Trans. Very Large Scale
no. 1, pp. 281–295, Jan. 2020, doi: 10.1109/TCSVT.2018.2888898. Integr. (VLSI) Syst., vol. 28, no. 4, pp. 853–863, Apr. 2020, doi:
[7] L. Lu, J. Xie, R. Huang, J. Zhang, W. Lin, and Y. Liang, “An effi- 10.1109/TVLSI.2019.2961602.
cient hardware accelerator for sparse convolutional neural networks on [26] B. Widrow, I. Kollar, and M.-C. Liu, “Statistical theory of quantization,”
FPGAs,” in Proc. IEEE 27th Annu. Int. Symp. Field-Program. Custom IEEE Trans. Instrum. Meas., vol. 45, no. 2, pp. 353–361, Apr. 1996, doi:
Comput. Mach. (FCCM), San Diego, CA, USA, Apr. 2019, pp. 17–25, 10.1109/19.492748.
doi: 10.1109/FCCM.2019.00013. [27] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
[8] Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang, “Sparse- energy-efficient dataflow for convolutional neural networks,” in Proc.
YOLO: Hardware/software co-design of an FPGA accelerator for ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Seoul, South
YOLOv2,” IEEE Access, vol. 8, pp. 116569–116585, 2020, doi: Korea, Jun. 2016, pp. 367–379, doi: 10.1109/ISCA.2016.40.
10.1109/ACCESS.2020.3004198. [28] M. Sun, P. Zhao, M. Gungor, M. Pedram, M. Leeser, and X. Lin,
[9] H. Kim and K. Choi, “Low power FPGA-SoC design techniques “3D CNN acceleration on FPGA using hardware-aware pruning,” in
for CNN-based object detection accelerator,” in Proc. IEEE 10th Proc. 57th ACM/IEEE Design Automat. Conf. (DAC), San Francisco,
Annu. Ubiquitous Comput., Electron. Mobile Commun. Conf. (UEM- CA, USA, Jul. 2020, pp. 1–6, doi: 10.1109/DAC18072.2020.9218571.
CON), New York, NY, USA, Oct. 2019, pp. 1130–1134, doi: 10.1109/ [29] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
UEMCON47517.2019.8992929. lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015,
[10] Z. Yuan et al., “14.2 A 65 nm 24.7 μJ/frame 12.3 mW doi: 10.1007/s11263-015-0816-y.
activation-similarity-aware convolutional neural network video proces-
sor using hybrid precision, inter-frame data reuse and mixed-bit-
width difference-frame data codec,” in IEEE ISSCC Dig. Tech.
Papers, San Francisco, CA, USA, Feb. 2020, pp. 232–234, doi: Shengzhao Li (Graduate Student Member, IEEE)
10.1109/ISSCC19947.2020.9063155. received the B.S degree from Shanghai Jiao Tong
[11] M. Riera, J.-M. Arnau, and A. Gonzalez, “Computation reuse in DNNs University, Shanghai, China, in 2019, where he is
by exploiting input similarity,” in Proc. ACM/IEEE 45th Annu. Int. Symp. currently working toward the master’s degree at the
Comput. Archit. (ISCA), Los Angeles, CA, USA, Jun. 2018, pp. 57–68, College of Electronic Engineering.
doi: 10.1109/ISCA.2018.00016. His research interest includes hardware accelera-
[12] M. Buckler, P. Bedoukian, S. Jayasuriya, and A. Sampson, “EVA: tion of neural networks.
Exploiting temporal redundancy in live computer vision,” in Proc.
ACM/IEEE 45th Annu. Int. Symp. Comput. Archit. (ISCA), Los Angeles,
CA, USA, Jun. 2018, pp. 533–546, doi: 10.1109/ISCA.2018.00051.
[13] Y. Wang, Y. Wang, H. Li, Y. Han, and X. Li, “An efficient deep learning
accelerator for compressed video analysis,” in Proc. 57th ACM/IEEE Qin Wang (Member, IEEE) received the B.S. degree
Design Automat. Conf. (DAC), San Francisco, CA, USA, Jul. 2020, from the University of Electronics Science and Tech-
pp. 1–6, doi: 10.1109/DAC18072.2020.9218743. nology of China, Chengdu, China, in 1997, and the
[14] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, “An efficient Ph.D. degree from Shanghai Jiao Tong University,
hardware accelerator for structured sparse convolutional neural networks Shanghai, China, in 2004.
on FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, She is currently a Professor with the Department of
no. 9, pp. 1953–1965, Sep. 2020. Micro/Nano Electronics, Shanghai Jiao Tong Univer-
[15] S. Zhang, J. Cao, Q. Zhang, Q. Zhang, Y. Zhang, and Y. Wang, sity. Her research interests include high-performance
“An FPGA-based reconfigurable CNN accelerator for Yolo,” in Proc. processor and in-memory computing.
IEEE 3rd Int. Conf. Electron. Technol. (ICET), Chengdu, China,
May 2020, pp. 74–78, doi: 10.1109/ICET49382.2020.9119500.

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.
1600 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2022

Jianfei Jiang (Member, IEEE) received the B.S. Naifeng Jing (Senior Member, IEEE) received the
degree from Zhejiang University, Hangzhou, China, Ph.D. degree from Shanghai Jiao Tong University,
in 2000, and the M.S. and Ph.D. degrees from Shanghai, China, in 2012.
Shanghai Jiao Tong University, Shanghai, China, in He is currently an Associate Professor with the
2007 and 2017, respectively. Department of Micro/Nano Electronics, Shanghai
He is currently an Assistant Professor with the Jiao Tong University. His research interests include
Department of Microelectronics and Nanoscience, high-performance and high-reliability computing
Shanghai Jiao Tong University. His current research architecture and systems, in-memory computing
interests include high-speed on-chip intercon- architecture, and computer-aided VLSI design.
nect, low-power circuit design, and high-speed
circuit design.

Zhigang Mao (Member, IEEE) received the B.S.


Weiguang Sheng (Member, IEEE) received the degree from Tsinghua University, Beijing, China,
bachelor’s, master’s, and Ph.D. degrees from the in 1986, and the Ph.D. degree from the University
Harbin Institute of Technology, Harbin, China, of Rennes 1, Rennes, France, in 1992.
in 1999, 2004, and 2009, respectively. From 1992 to 2006, he was with the Micro-
He is currently a Research Assistant Professor electronics Center, Harbin Institute of Technology,
with the Department of Micro/Nano Electronics, Harbin, China. In 2006, he joined the Department
Shanghai Jiao Tong University, Shanghai, China. His of Micro/Nano Electronics, Shanghai Jiao Tong
research interests include reconfigurable architec- University, Shanghai, China, where he is currently
tures and compiling techniques, soft error analysis, a Professor. His current research interests include
and optimization. digital signal processor (DSP) architecture design,
video processor design, and reconfigurable processor architecture.

Authorized licensed use limited to: Mahendra Educational Trust. Downloaded on November 04,2023 at 04:42:50 UTC from IEEE Xplore. Restrictions apply.

You might also like