Applsci 13 04144 v2
Applsci 13 04144 v2
Applsci 13 04144 v2
sciences
Article
FPGA Implementation of a Deep Learning Acceleration Core
Architecture for Image Target Detection
Xu Yang 1 , Chen Zhuang 2, * , Wenquan Feng 1 , Zhe Yang 1 and Qiang Wang 1
1 School of Electronic & Information Engineering, Beihang University, Beijing 100080, China
2 Hefei Innovation Research Institute of Beihang University, Hefei 230012, China
* Correspondence: zhuangchen0214@buaa.edu.cn
Abstract: Due to the flexibility and ease of deployment of Field Programmable Gate Arrays (FPGA),
more and more studies have been conducted on developing and optimizing target detection algo-
rithms based on Convolutional Neural Networks (CNN) models using FPGAs. Still, these studies
focus on improving the performance of the core algorithm and optimizing hardware structure, with
few studies focusing on the unified architecture design and corresponding optimization techniques
for the algorithm model, resulting in inefficient overall model performance. The essential reason
is that these studies do not address arithmetic power, speed, and resource consistency. In order to
solve this problem, we propose a deep learning acceleration core architecture based on FPGAs, which
is designed for target detection algorithms with CNN models, using multi-channel parallelization
of CNN network models to improve the arithmetic power, using scheduling tasks and intensive
computation pipelining to meet the algorithm’s data bandwidth requirements and unifying the
speed and area of the orchestrated computation matrix to save hardware resources. The proposed
framework achieves 14 Frames Per Second (FPS) inference performance of the TinyYolo model at
5 Giga Operations Per Second (GOPS) with 30% higher running clock frequency, 2–4 times higher
arithmetic power, and 28% higher Digital Signal Processing (DSP) resource utilization efficiency using
less than 25% of FPGA resource usage.
Keywords: target detection; TinyYolo; FPGA; acceleration core; parallel acceleration; pipeline;
Citation: Yang, X.; Zhuang, C.; Feng, resource optimization
W.; Yang, Z.; Wang, Q. FPGA
Implementation of a Deep Learning
Acceleration Core Architecture for
Image Target Detection. Appl. Sci.
1. Introduction
2023, 13, 4144. https://doi.org/
10.3390/app13074144
Target detection is a popular research field in computer vision, widely used in aerial
photography, intelligent surveillance, industrial inspection, and other fields. Compared
Academic Editor: Juan A. with traditional algorithms, deep learning methods have the advantages of high accu-
Gómez-Pulido
racy and robustness for target detection in complex scenarios. Deep learning detection
Received: 9 March 2023 algorithms such as You Only Look Once (YOLO) [1], Faster Region Convolutional Neural
Revised: 20 March 2023 Networks (Faster R-CNN) [2] have shown higher accuracy and robustness than traditional
Accepted: 21 March 2023 algorithms for target detection tasks in visible and Synthetic Aperture Radar (SAR) images.
Published: 24 March 2023 YOLO, which stands for “You Only Look Once”, treats the object detection tasks as a
regression problem by taking the entire map as input to the network and using a CNN
structure to achieve end-to-end target detection. The YOLO network consists of several
convolutional layers and fully connected layers. The number of convolutional layers is
Copyright: © 2023 by the authors.
24, followed by two fully connected layers. The convolutional layer is used to extract
Licensee MDPI, Basel, Switzerland.
the features of the original image, and the fully connected layer is used to predict the
This article is an open access article
probability and coordinates of the output. Among them, the alternating 1 × 1 convolutional
distributed under the terms and
layers are used to reduce the feature map space size of the previous layers. The final output
conditions of the Creative Commons
Attribution (CC BY) license (https://
of the YOLO network is still a 7 × 7 × 30 tensor.
creativecommons.org/licenses/by/
Sun et al. propose an “Auto-T-YOLO” network model based on YOLOv4, which
4.0/). improves the accuracy of ship targets [3]. Sun et al. propose a novel YOLO-based arbitrary
directional SAR ship detector with Bi-directional Feature Fusion and Angle Classification
(BiFA-YOLO). Comparative experiments show that the method has better robustness and
adaptability [4]. Hu et al. propose a novel method for small ship detection based on the
basic YOLO network structure, which achieves state-of-the-art performance [5]. Li et al.
present a complete YOLO-based ship detection method using an improved YOLOv5s
model, providing a practical reference for large-scale ship detection [6]. Ye et al. propose a
Combined Attention Augmented YOLO (CAA-YOLO) algorithm to alleviate the recogni-
tion challenges of extremely multi-scale ships due to the severe lack of texture details [7].
Lu et al. propose an aerial image vehicle detection method based on the YOLO algorithm.
Experiments show that the training model performs well on unknown aerial images, es-
pecially for small objects and rotating objects [8]. Al-Batat et al. utilize a YOLO-based
end-to-end generic pipeline for vehicle detection without prior knowledge or additional
steps in inference and achieves average recognition accuracy of 90.3% [9]. Zhang et al.
propose a method for vehicle detection in different traffic scenarios based on an improved
YOLOv5 network to reduce the false detection rate of vehicle targets [10]. Liu et al. de-
velop a unique detection method based on YOLOv3 for small objects in the Unmanned
Aerial Vehicle (UAV) view. Experiments demonstrate that the performance of small object
detection is significantly improved [11]. Li et al. propose an improved Residual YOLO
(RES-YOLO) detection algorithm to solve the difficulties of automatic vehicle recognition.
The experimental results show that the proposed algorithm can automatically recognize
multiple vehicle targets and significantly reduce the missing and error rates [12]. Chen et al.
improve the Faster R-CNN for the bridge detection tasks of SAR images by combining
a multi-resolution attention network and region-binding network [13]. Li et al. combine
YOLOv4 with a point cloud algorithm to determine concrete cracks in bridges [14]. Du et al.
propose a target detection algorithm BE-YOLOv5S based on YOLO, which meets the needs
of bridge structure damage detection [15]. Lin et al. apply the YOLOv3 algorithm to the
spacecraft inspection task and improve the detection accuracy [16]. The network inference
capability of the above methods is based on a desktop Graphics Processing Unit (GPU),
which cannot guarantee real-time performance on embedded platforms. Therefore, these
methods often need to be accelerated by FPGAs, Digital Signal Processors (DSPs), and
Neural Processor Units (NPUs).
Madasamy et al. propose a deep YOLOv3 approach based on an embedded multi-
object detection and tracking system [17]. Jiang et al. designed a UAV Thermal Infrared
(TIR) object detection framework for images and videos based on a YOLO model with CNN
architecture. The highest Mean of Average Precision (mAP) is 88.69% [18]. Artamonov
et al. propose a YOLO approach to solve the traffic sign classification problem on the
mobile platform NVIDIA Jetson, which allows high-performance computing at low power
consumption [19]. Emin et al. propose a portable Advanced Driver Assistance System
(ADAS) based on the YOLOv5 algorithm for real-time traffic signs, vehicles, and pedestrians
detection. The system has excellent detection speed and accuracy for real-time road object
detection on a mobile platform [20]. Feng et al. propose a novel embedded YOLO model
to obtain real-time and high-accuracy performance on embedded devices. The embedded
YOLO model has only 3.53M parameters and can reach an average processing time of
155.1 FPS [21]. The network inference of the above approaches is based on mobile GPUs,
such as Nvidia’s TX2 with Nvidia Pascal architecture GPU. Although these GPUs are
robust and easy to deploy but still face the challenges of power, cost, and optimization.
In contrast, FPGAs have many advantages regarding unified design and optimization
capabilities of arithmetic power, speed, and resources for CNN. The FPGA can call and
optimize hardware resources at the trigger level, which can precisely adjust the algorithm
structure at the trigger and logic gate level to ensure the precise control of arithmetic power
and resources. The rich clock network and routing resources inside the FPGA can be
designed for speed and hardware resource consistency to ensure timing convergence and
resource coordination.
Appl. Sci. 2023, 13, 4144 3 of 26
Currently, FPGAs are increasingly being used to implement CNN acceleration. Zhang
et al. propose an ARM+FPGA architecture on Xilinx ZCU102 FPGA for YOLOv2 and
TinyYOLOv2 on Microsoft Common Objects in Context (COCO) and Visual Object Classes
(VOC) 2007, respectively, [22]. To address the YOLO algorithms’ high processing accuracy
and speed requirements, Babu et al. propose an algorithm of YOLOv4-based on a Xilinx
ZYNQ-7000 system for real-time object detection [23]. Using a hybrid architecture of ARM
and FPGA, Xiong et al. deploy the YOLO model on FPGA to improve the efficiency of
target identification and detection with low resource and power consumption [24]. Chen et
al. propose RS (Row stationery) data streams to reduce memory bandwidth requirements
by improving local storage utilization. AlexNet using RS data streams improves convo-
lutional layer performance by 1.4 times compared to existing data streams [25]. Liu et al.
propose a parallel framework including task, loop, and operation layers. The speedups of
AlexNet and Visual Geometry Group (VGG) were 6.96 times and 4.79 times, respectively,
compared with Intel i7-4790K CPU on Xilinx VC709 platform [26]. Peeman et al. propose a
memory-driven acceleration core with a hierarchical memory design without increasing
the bandwidth requirement, which reduces the resource consumption of FPGAs by 13
times [27]. Zhang et al. propose the Caffeine architecture, which achieves 365 GOPS on the
Xilinx KU060 platform [28]. Shen et al. divide FPGA resources into multiple subprocessors
according to the convolutional structure to improve the computational speed of CNN by
optimizing resource allocation and parallelism [29].
TinyYOLO is a lightweight and simplified version of YOLOv2. It is widely used
in real-time target detection due to its advantages, such as fast speed and low memory
consumption. On the VOC 2007 dataset, the mAP of TinyYOLO is 57.1, less accurate than
YOLOv2, but the frame rate reaches 207 FPS, about three times that of YOLOv2. Meanwhile,
the number of weight parameters is about 1/3 of YOLOv2.Because the application scenario
of our proposed architecture is mainly for the real-time detection of bridges in UAV aerial
images, the algorithm’s real-time performance and power consumption are the main
directions of our design and optimization. TinyYOLO achieves several times the detection
speed of other lightweight algorithms, which is suitable for some real-time detection
scenarios that do not require high accuracy. At the same time, since TinyYOLO has the
network structure of a typical CNN algorithm, it is convenient to fine-tune the proposed
architecture to apply it to other CNN algorithms. We also chose the TinyYOLO algorithm
to implement our proposed framework. Therefore, we choose the TinyYOLO algorithm as
the prototype for the framework implementation.In this paper, we propose an FPGA-based
deep learning acceleration core architecture to overcome the problem of large computational
and parametric volumes and low real-time performance of deep learning-based detection
algorithms deployed on embedded devices.
As shown in Figure 1, The architecture consists of three parts; the first part is the video
capture channel, which is used to pre-process the input video stream, such as decoding
and de-framing, and store the raw video data into off-chip storage, the second part is the
deep learning acceleration core, which is used for neural network inference and consists of
a data loading kernel, a computational kernel, and a data unloading kernel, the third part
is the scheduling system, which is used to schedule the components within the acceleration
core and also runs the embedded operating system. The second and third parts use the
Advanced eXtensible Interface Memory Map (AXI-MM) bus for data interaction, while
the components within the acceleration core cache small amounts of data through the
on-chip Block Random Access Memory (BRAM) and use the BRAM bus or registers for
internal data interaction. The data in the acceleration core is first loaded into the data load
buffer through the data loading kernel. The scheduling state machine in the computation
kernel transfers the data to the computation matrix. After a round of computation, the data
offloading kernel writes the data in the data storage buffer back to off-chip storage through
the AXI-MM bus.
Appl. Sci. 2023, 13, 4144 4 of 26
Figure 1. The architecture consists of three parts, the first part is the video capture channel, the
second part is the deep learning acceleration core, the third part is the scheduling system.
2. Proposed Method
In order to solve the problem of consistency of arithmetic power, speed, and resources,
we make full use of the characteristics of FPGAs in the proposed architecture and adopt a
comprehensive optimization method of computation, timing, and resources for the CNN
model, using multi-channel parallelization to improve the arithmetic power, using schedul-
ing tasks and intensive computation pipelining to meet the data bandwidth requirements
of the algorithm, and uniformly scheduling the speed and area of the calculation matrix to
save hardware resources, the full view of the study is shown in Figure 2.
Figure 3. The parallel acceleration of the convolution layer is divided into three main dimensions:
input channel parallelism, output channel parallelism, and pixel parallelism.
Appl. Sci. 2023, 13, 4144 6 of 26
Input channel parallelism is the parallelization of the product of the input tensor and
the convolution kernel. The results are added together to obtain a pixel of the output
channels. Input channel parallelism calculates the channel vector of one pixel of the input
tensor and one pixel of the convolution kernel in a dot product operation each time and
finally adds up the results. The hardware implementation of this operation requires the
logic to access all channels of the input vector and the convolution kernel simultaneously,
so the bit width of the computational kernel needs to be adjusted accordingly. For example,
for a kernel with eight input channels in parallel, a single data transfer from the kernel
requires 8× the data bit width. If the number of input channels is less than 8, the rest of
the bit widths are complemented by 0. Suppose the number of input channels exceeds
8; only eight are counted in each computation. The computation is divided into multiple
computations, complementing the last by 0.
Output channel parallelism refers to the simultaneous convolution of an input tensor
with multiple convolution kernels to obtain multiple channels of the output tensor. This
operation does not increase the bandwidth requirement of the computational kernel for the
input data. However, it requires the computational kernel to be able to access the values of
the convolution kernel on multiple output channels at the same time.
Pixel parallelism means that the elements of the input tensor at different positions are
convolved with the convolution kernel. In contrast, the elements of the output tensor at
different positions are obtained in parallel. Take the example of a 4 × 4 × 3 input tensor
and a 2 × 2 × 3 × 2 convolution kernel. Assuming the top-left corner of the input tensor
is 1, the pixel values in the first row of the output tensor are obtained by simultaneously
computing the convolution of the pixels in positions 1, 2, and 3 with the convolution
kernel. This operation requires simultaneous access to the element values of the input
channel at different pixel points. It requires a hardware design that increases the access
bandwidth of the computational kernel. For example, in a parallel scheme with a pixel
parallelism of 4, the input tensor is divided into different data buffer groups by column. The
number of pixels in parallel is complemented by 0. As a result, the pixel values read by the
computational kernel are located in different buffer groups during the entire convolution
operation, thus increasing the access bandwidth of the computational kernel.
As shown in Figure 4, the similar scheme used is eight-input parallelism, eight-
output parallelism, and four-pixel parallelism. The computational kernel can compute
256 multiplication and addition operations in a single clock, i.e., 512 operations. The
computational kernel is intended to run at a minimum of 300 M, giving a computational
power of 153.6 GOPS. The accelerator will be able to run the 2 GOPS detection model
TinyYOLO in real-time.
Figure 4. Input channels are 3. Input parallelism is 8. Output parallelism is 8. Pixel parallelism is 4.
Internal data flow diagram of the Computational kernel.
Appl. Sci. 2023, 13, 4144 7 of 26
The input data of the fully connected layer is 1 × 1 in length and width. However, the
number of input channels is generally more significant. The number of parameters needed
for computation is multiplied by the number of output channels, making the number of
parameters much larger than that of the convolutional layer. In this paper, we adopt a
similar approach to the partial loading of convolutional layers by partially loading the
parameters of the fully connected layer and loading only one or a few output channels at a
time to reduce the need for on-chip BRAM. The computational scheduling kernel transfers
the input data and weights to the external computation matrix in parallel. The FPGA’s
on-chip DSP performs the multiplication operation between them and then returns them to
the computational scheduling kernel for accumulation.
In this paper, eight 16-bit data are combined into one word (WORD), which is used to
temporarily store data and parameters with the same BRAM bit width and word length of
128 bits so that the computational kernel can access eight data in parallel on the channel at
the same time. If the number of input channels is greater than 8, the data of subsequent
input channels are stored incrementally on the address. If the number of channels is not a
multiple of 8, the high bits are complemented by 0 to simplify the data reading operation.
A pixel stores all input channel data for output channel 1, then all input channel data for
output channel 2, and so on. The computational kernel can access the parameter data of
eight output channels simultaneously. The pixel parallel data access to the input data is
implemented as in Figure 7.
In addition to channel remapping, the input data is also remapped within rows
according to the number of pixels in parallel and stored in different BRAMs. In this paper,
for example, the data in each row is grouped by the remainder of 4, i.e., column 0, column
4, etc. column 1, column 5, etc., and so on. If the number of columns is not a multiple of 4,
simplify the data reading operation by a multiple of 4. By storing the data in groups 1 to 4
in BRAMs, the convolutional computational kernel has four access ports and can access the
data in four BRAMs simultaneously.
Appl. Sci. 2023, 13, 4144 9 of 26
The computation unit must perform a Width × Height × Depth × Channel calculation
for each input pixel within a single convolutional layer calculation. In this paper, the
computational kernel takes the data of 8 input channels of 4 pixels, multiplies and adds the
data of 8 input channels of the first pixel of the convolution kernel, then iterates through the
input channels, then shifts the whole input data window right by one pixel, multiplies and
adds the data of the second pixel of the convolution kernel until the result of one output
channel is obtained, and finally iterates through the output channels, after this round of
computation, we get the result of the convolution of the 4 pixels in all output channels.
Considering the TinyYolo model, the batch layer and the Rectified Linear Unit (ReLU)
layer are both after the convolutional layer; this paper integrates the batch layer and the
activation layer with the convolutional layer in the processing flow so that, in practice, only
the results can be manipulated before the output data of the convolutional kernel, avoiding
reloading data and parameters and reducing data access requirements.
Due to a large number of parameters and the large amount of data in a single con-
volutional layer, the BRAM memory size may be exceeded, and the direct reading of the
Dynamic Random Access Memory (DRAM) may cause high read latency and reduce the
efficiency of the computation. Therefore, only a part of the data or parameters can be
loaded simultaneously. Then another part can be loaded after one calculation until the
convolutional layer is completed. The data within the layer is divided according to the
pixel location of the data, and each part of the data is divided by a multiple of the total
length and width to reduce the number of cycles.
Parallel acceleration of computational layers in CNN decomposes the computational
process of CNN models in three dimensions: input channel, output channel, and pixel.
Parallel acceleration of these dimensions significantly improves the efficiency of the CNN al-
gorithm. Traditional methods focus on the computational level of convolutional operations
and do not improve enough on the structural level of the algorithm.
bandwidth. The loop logic of the convolutional layer is the most complex of the three
computations, and its loop nesting is shown in Algorithm 1.
Loop level 1 represents the traversal of the number of rows of the input tensor. Loop
level 2 represents the traversal of the number of columns of the input tensor. For example,
suppose the number of parallel pixels is 4. In that case, the loop level means that after
each round of convolution, the pointer is shifted 4 pixels to the right until the traversal of a
row is completed. Loop level 3 represents the traversal of the number of output channels.
Loop levels 4 and 5 represent the traversal of the convolution kernel tensor rows and
columns, respectively. Loop level 6 represents the traversal of the convolution kernel input
channels. Loops 7, 8, and 9 are loops of pixel parallelism, input channel parallelism, and
output channel parallelism, respectively. Loops 1–6 use the pipeline constraint, which
means that these loops will be combined into one big loop, and the upper limit of the big
loop is the product of the upper limit of all small loops. Loops 7–9 below the pipeline
constraint are constrained to be unrolled loops. Within a convolutional operation, the
computation scheduling kernel iterates through all the elements of the output tensor by
loops 7–9, computes one of the element values by loops 4–6, and unrolls the convolutional
operation in parallel by loops 1–3 to improve the computational efficiency and finally
obtain the result of a whole convolutional layer.
As shown in Algorithm 2, the pooling layer does not need to consider input channel
parallelism compared to the convolutional layer since the number of input channels is the
same as the number of output channels. Only the output channels need to be parallelized.
Loop level 1 represents the traversal of the input tensor row direction. Loop level 2
represents the traversal of the input tensor column direction. Loop level 3 is the traversal
of the output channels. Loop level 4 is the traversal of the pooling range in the row
direction. Loop 5 is the traversal of the pooling range in the column direction. Loops 6 and
7 are for pixel and output channel parallelism, respectively. Loops 1 and 2 in the pooling
layer traverse the output tensor, and each element of the output tensor corresponds to the
maximum value of the pooling range in the input tensor; loops 4 and 5 are the traversal of
the pooling range, and loops 6 and 7 are parallel acceleration.
As shown in Algorithm 3, Considering TinyYolo’s 17th fully connected layer, with an
input tensor of 1 × 1 × 50,176 and an output tensor of 1 × 1 × 256, the fully connected
layer parameter of 1 × 1 × 50,176 × 256 is the most significant layer in the entire network.
Suppose the output channels are parallelized like the convolutional layer, the fully con-
nected layer must load at least 50,176 times the number of output channels in parallel with
16-bit fixed-point parameters, which takes up a lot of on-chip BRAM storage resources.
Therefore, in this paper, the fully connected layer includes only two dimensions: input
channel and pixel parallelism. The loop level 1 of the fully connected layer traverses the
input tensor channels. Loop level 2 traverses the output tensor channels. Loop levels 3 and
4 are pixel parallelism and input channel parallelism, respectively.
Appl. Sci. 2023, 13, 4144 12 of 26
A BRAM capable of storing nine words is instantiated into three BRAMs capable of
storing three words after a memory decomposition constraint of factor 3, which expands
the access bandwidth of the memory to three times the original one. In this paper, since the
convolutional scheduling kernel needs to access the input channel parallelism × output
channel parallelism, and the BRAM bit width is 16 bits × input channel parallelism, the
storage decomposition factor is equal to the output channel parallelism number 8, thus
decomposing one BRAM into 8 in parallel on the output channel so that the scheduling
kernel can access all the required parameters simultaneously.
situation would increase the bandwidth requirements of the acceleration core and increase
unnecessary parameter loading time. Considering that the input tensor size of the FC17
layer in the TinyYolo network is 1 × 1 × 50,176, which requires the most significant data
load buffer with a depth of 6272 at 128 bit width, this paper sets the data loading buffer
bits in depth to 2048 × 4. In addition, the data storage buffer is 128 bits in width and 8192
bits in depth, and the parameter storage is 128 bits in width and 2048 bits in depth. This
paper shows the partial loading strategy for each layer of the TinyYolo network inference
in Table 1.
Input Tensor/Output Partial Loading Input Occupy the Data Occupy Data Storage Occupy the Parameter
Tensor Tensor/ Output Tensor Loading Area Depth Area Depth Storage Area Depth
448 × 448 × 3/ 56 × 56 × 3/
CONV1 784 6272 18
448 × 448 × 16 56 × 56 × 16
448 × 448 × 16/ 56 × 56 × 16/
POOL2 1568 1568 0
224 × 224 × 16 28 × 28 × 16
224 × 224 × 16/ 28 × 28 × 32/
CONV3 784 3136 72
224 × 224 × 32 28 × 28 × 32
224 × 224 × 32/ 28 × 28 × 32/
POOL4 784 784 0
112 × 112 × 32 14 × 14 × 32
112 × 112 × 32/ 28 × 28 × 32/
CONV5 784 6272 288
112 × 112 × 64 28 × 28 × 64
112 × 112 × 64/ 28 × 28 × 64/
POOL6 1568 1568 0
56 × 56 × 64 14 × 14 × 64
56 × 56 × 64/ 14 × 14 × 64/
CONV7 392 3136 1152
56 × 56 × 128 14 × 14 × 128
56 × 56 × 128/ 14 × 14 × 128/
POOL8 784 784 0
28 × 28 × 128 7 × 7 × 128
28 × 28 × 128/ 14 × 14 × 128/
CONV9 784 1568 1152
28 × 28 × 256 14 × 14 × 64
28 × 28 × 256/ 14 × 14 × 256/
POOL10 1568 1568 0
14 × 14 × 256 7 × 7 × 256
14 × 14 × 256/ 14 × 14 × 256/
CONV11 1568 784 1152
14 × 14 × 512 14 × 14 × 32
14 × 14 × 512/ 2 × 2 × 512/
POOL12 128 64 0
7 × 7 × 512 1 × 1 × 512
7 × 7 × 512/ 7 × 7 × 512/
CONV13 896 98 1152
7 × 7 × 1024 7 × 7 × 16
7 × 7 × 1024/ 7 × 7 × 1024/
CONV14 1792 49 1152
7 × 7 × 1024 7×7×8
7 × 7 × 1024/ 7 × 7 × 1024/
CONV15 1792 49 1152
7 × 7 × 1024 7×7×8
1 × 1 × 50,176/ 1 × 1 × 50,176/
FC17 1568 1 1568
1 × 1 × 256 1×1×2
1 × 1 × 256/ 1 × 1 × 256/
FC18 8 64 2048
1 × 1 × 4096 1 × 1 × 512
1 × 1 × 4096/ 1 × 1 × 4096/
FC19 128 4 2048
1 × 1 × 1470 1 × 1 × 32
As seen from Table 1, the parameter buffers, the data loading buffers, and the data
storage buffers were not overflowed during the inference process of the whole TinyYolo
network by the partial loading and partial computing strategy, and the BRAM resource
consumption of the whole acceleration core is kept within a reasonable range.
In order to make the access bandwidth of data meet the arithmetic power requirement
of the CNN algorithm after being accelerated, the access method of data must be specially
designed, for which we propose three methods: data streaming, BRAM expansion, and
partial calculation, which can significantly improve the matching requirement of data
and arithmetic power so that the CNN algorithm can execute almost at total capacity.
Traditional methods have yet to be studied in this area.
Appl. Sci. 2023, 13, 4144 15 of 26
ACLK
ARADDR 0 0
ARLEN 0 0
ARSIZE 0b0100b010
ARREADY
ARVALID
RRESP
RLAST
RREADY
RVALID
ACLK
ARBURST 0 0
ARREADY
ARVALID
RDATA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
RRESP 0
RLAST
RREADY
RVALID
In the AXI-MM bus access, the slave often needs several or even dozens of clocks to
respond to the host’s request. If the length of the pipeline is insufficient, it will reduce the
bus utilization and the overall operational efficiency of the module. Therefore, in this paper,
the modules interacting with the AXI-MM bus have extended the length of the pipeline by
128 clocks in order to wait for the AXI-MM slave to return data and avoid the host entering
a blocking state.
By combining the above three strategies for optimizing the AXI bus utilization, the
acceleration core designed in this paper achieves more than 85% utilization of the AXI
bus; taking the FC17 as an example, the number of parameters to be loaded in this layer is
50,176 × 256, the theoretical limit time is 5.35 ms, the actual time is 5.90 ms, compared with
the time of more than 10 ms before optimization. The AXI bus runs much more efficiently
and avoids memory access limitations to accelerate the efficiency of the computation.
The four operands to be multiplied are first buffered in the 300 MHz clock domain
using a D flip-flop and then selected by a 600 MHz clock, with the first two operands
input at the odd 600 MHz clock and the last two operands input at the even clock. The
output of the DSP48E is buffered by two D flip-flops clocked, and the output of the two
Appl. Sci. 2023, 13, 4144 17 of 26
buffered flip-flops is tapped in the 300 MHz clock domain to obtain a stable output with no
sub-stability and convergence in timing. The above architecture works on a pipeline with
an initial interval of 1. Each fundamental clock can use one DSP48E to calculate fixed-point
multiplication twice, which realizes the time-division multiplexing of DSP48E resources
and reduces resource consumption. Cross-clock domain DSP48E unit is shown in Figure 13.
A select signal in the above cross-clock domain design is identical to the 300 MHz
clock in the waveform. However, the clock signal is buffered by Global Buffer (BUFG) into
the clock network within the FPGA, which provides a stable clock signal for the global
flip-flop with low jitter and skew. Since there is no direct path from the clock network to the
logic input, the 300 MHz clock cannot be used as the select signal. We use a clock follower
to generate the select signal. The diagram of the clock follower and the output waveform
of each point are shown in Figure 14.
CLK1 300MHz
CLK2 600MHz
DFF1 Q
DFF2 Q
DFF3 Q
Figure 14. Diagram of clock follower and the output waveform of each point of the clock follower.
The DSP resources inside the FPGA are limited. Our proposed method takes advantage
of the strategy of doubling the computational efficiency in the high-speed clock domain
through a time-for-space approach to achieve the same computational efficiency with fewer
DSPs and specifically optimizes the access for the AXI bus operation timing so that the bus
can work almost at total capacity. Compared with the traditional method, we use fewer
DSP resources to do the same work or the same DSP resources to do twice the work.
Therefore, the on-chip DDR bandwidth is sufficient to support the data transfer bandwidth
required for neural network computing. In addition, the input data for this paper comes
from a real-time input video stream, so the hardware platform is equipped with a camera
link chip for accessing the real-time video stream. The input video is processed by the
interface chip, waiting for the acceleration core to infer. The detected results are sent from
the PS to the PL side, overlaying the detected frame on the output video and displaying it
through the video interface. The hardware we use is shown in Figure 15. We have used
this hardware platform to implement and validate the deep learning acceleration core.
In order to objectively evaluate the ability of our proposed method to handle the
consistency problem of arithmetic power, speed, and resources, we make the following
assumptions.
1. Our proposed method will not be interrupted by unordered scheduling instructions
during network inference tasks, which include parameter updates, recovery after
video stream interruptions, and interruption exception handling.
2. The automatic optimization function of the synthesis tool is turned off since we have
already performed manual optimization specifically for the proposed method. The
secondary optimization of the automatic tool will affect the results.
3. When selecting other methods proposed in the literature for comparison, we try to
select FPGA chips of the same architecture system because the internal structure of
chips of different architectures is different, which will affect the evaluation of the
effect of resource optimization.
In Section 3.1, we illustrate the hardware implementation results and timing of the
convolutional and fully connected layer pipelines in the computational scheduling kernel,
list the hardware resources consumed by the computational scheduling kernel, and discuss
the resource usage. In Section 3.2, we list the hardware resources consumed by the deep
learning acceleration core and discuss the resource usage. In Section 3.3, we illustrate the
actual performance of the deep learning acceleration core, compare it cross-sectionally with
other similar lightweight acceleration cores, and discuss our method’s advantages.
Pipeline 4 is the main pipeline for reading the data stored on the chip and pushing it to
the computation matrix to get the result of the convolutional layer. All four pipelines have
an initial interval of 1 to maximize computational efficiency and data throughput. The
pipeline length of pipeline 2 is 173 because when the number of output channels of the
partial computation is not equal to the number of output channels of the output tensor,
the address of the parameter load is not continuous and needs to be cut off into several
burst transfers, so the pipeline is lengthened to avoid blocking caused by the untimely
return of AXI-MM data, which reduces the efficiency of data reading. The pooling layer has
the most straightforward pipeline since it does not require loading parameters, with only
one main pipeline for computation and the same initial interval of 1. The fully connected
layer contains three pipelines: pipeline one is the fully connected multiplication parameter
loading pipeline, pipeline 2 is the fully connected bias parameter loading pipeline, and
pipeline 3 is the main pipeline for computing the entire connection. Pipeline experiment
results are shown in Figure 16.
(a)
(b)
(c)
Figure 16. Pipeline experiment results: (a) Convolutional layer pipeline synthesis result. (b) Fully
connected layer pipeline synthesis results. (c) Computational scheduling kernel pipeline diagram.
The computational scheduling kernel runs at 300 MHz with 3.33 ns per clock cycle.
With the HLS pipeline design, the most extended single-cycle instruction takes 2.91 ns,
with a timing margin of 0.42 ns, which meets the setup hold time for FPGA operation. The
overall resource consumption of the computational scheduling kernel is shown in Table 2.
The convolutional layer consumes the most lookup table and trigger resources, mainly
because it needs to compute the accumulation of three dimensions in parallel. In contrast,
the fully connected layer only needs to compute the accumulation of two dimensions,
Appl. Sci. 2023, 13, 4144 20 of 26
input channel parallelism and pixel parallelism. At the same time, the convolutional
layer computes data access addresses and loop counts most frequently, which requires
partial multiplication, such as calculating the data address of a pixel in the input tensor or
calculating the total number of current loops. Hence, the convolutional layer consumes the
most DSP resources of the three. Finally, the computational scheduling kernel stores the
multiplication and bias parameters for the convolution and fully connected layers in an
internal BRAM, with 8 channels, 2048 depth, 128 bit width memory for the multiplication
parameters and a 256 depth, 128 bit width memory for the bias parameters, which can
support the computation of a maximum 2048 output channel tensor. The parameters are
stored in BRAM36k, a 36 bit width, 1024 depth memory cell, consuming 69 BRAM36k and
1 BRAM18k.
Regarding lookup tables, the computation scheduling kernel consumes a large amount
of data, mainly because its function is to remap the data addresses, output the data to
the computation matrix, and compute the accumulation of the convolutional layers. The
concurrency of the convolutional accumulation is the largest, requiring the sum of 256
fixed points in a pipeline, which consumes many lookup tables. On the one hand, the
computation scheduling kernel needs to store many intermediate results. On the other
hand, the cross-clock domain processing of the DSP computation matrix needs to use
different clock domain triggers to avoid timing problems, which require many trigger cache
data. In addition, the DSP consumption of this part is fixed and will not increase due to the
parallelism of data computation. In addition to the intermediate data storage, the burst and
outstanding modes of the AXI-MM bus require on-chip storage space to cache the data to
be sent or received on the bus, which consumes some of the BRAM. In general, the overall
resource consumption of the acceleration core architecture is less than 25% of the ZU15EG
platform used. In addition to the deep learning acceleration core, the hardware platform
also requires a video input and output path. With the addition of these external modules,
the overall resource consumption of the hardware platform is shown in Figure 17.
Appl. Sci. 2023, 13, 4144 21 of 26
Number of
Layer Input Tensor Output Tensor Data Volume Calculated Volume
Parameters
Input0 448 × 448 × 3 0 1605632 0
CONV1 448 × 448 × 3 448 × 448 × 16 1152 3211264 231211008
POOL2 448 × 448 × 16 224 × 224 × 16 0 802816 3211264
CONV3 224 × 224 × 16 224 × 224 × 32 4608 1605632 231211008
POOL4 224 × 224 × 32 112 × 112 × 32 0 401408 1605632
CONV5 112 × 112 × 32 112 × 112 × 64 18432 802816 231211008
POOL6 112 × 112 × 64 56 × 56 × 64 0 200704 802816
CONV7 56 × 56 × 64 56 × 56 × 128 73728 401408 231211008
POOL8 56 × 56 × 128 28 × 28 × 128 0 100352 401408
CONV9 28 × 28 × 128 28 × 28 × 256 294912 200704 231211008
POOL10 28 × 28 × 256 14 × 14 × 256 0 50176 200704
CONV11 14 × 14 × 256 14 × 14 × 512 1179648 100352 231211008
POOL12 14 × 14 × 512 7 × 7 × 512 0 25088 100352
CONV13 7 × 7 × 512 7 × 7 × 1024 4718592 50176 231211008
CONV14 7 × 7 × 1024 7 × 7 × 1024 9437182 50176 462422016
CONV15 7 × 7 × 1024 7 × 7 × 1024 9437182 50176 462422016
FLAT16 7 × 7 × 1024 1 × 1 × 50,176 0 50176 0
FC17 1 × 1 × 50,176 1 × 1 × 256 12845056 256 12845056
FC18 1 × 1 × 256 1 × 1 × 4096 1048576 4096 1048576
FC19 1 × 1 × 4096 1 × 1 × 1470 6021120 1470 6021120
Total 45080192 9664702 2569558016
the parallel computation efficiency. However, the parameter loading part is inside the
computation scheduling kernel. The scheduling kernel cannot compute while loading
parameters, so the actual computation time of the convolutional layer is equal to the sum
of the parameter loading time and the computation time. The other computational layer in
TinyYolo that takes longer to infer, the fully connected layer, has a more significant number
of parameters making the parameter loading time longer, and the computational time
required to infer TinyYolo once is shown in Table 6.
Data Parameter
Data Storage Calculation
Layer Loading Loading Actual Time
Time Time
Time Time
CONV1 0.67 ms 1.33 ms 0.00 ms 3.01 ms 3.01 ms
CONV3 0.33 ms 0.67 ms 0.00 ms 3.01 ms 3.01 ms
CONV5 0.16 ms 0.33 ms 0.00 ms 3.01 ms 3.01 ms
CONV7 0.08 ms 0.16 ms 0.03 ms 3.01 ms 3.14 ms
CONV9 0.04 ms 0.08 ms 0.12 ms 3.01 ms 3.13 ms
CONV11 0.02 ms 0.04 ms 0.49 ms 3.01 ms 3.50 ms
CONV13 0.01 ms 0.02 ms 1.96 ms 3.01 ms 4.97 ms
CONV14 0.02 ms 0.02 ms 3.93 ms 6.02 ms 9.95 ms
CONV15 0.02 ms 0.02 ms 3.93 ms 6.02 ms 9.95 ms
Total 1.35 ms 2.67 ms 10.46 ms 33.11 ms 43.57 ms
Data Parameter
Data Storage Calculation
Layer Loading Loading Actual Time
Time Time
Time Time
FC17 0.02 ms 0.00 ms 5.35 ms 1.33 ms 6.68 ms
FC18 0.00 ms 0.00 ms 0.44 ms 0.11 ms 0.55 ms
FC19 0.00 ms 0.00 ms 2.50 ms 0.63 ms 3.13 ms
Total 0.02 ms 0.00 8.29 ms 2.07 ms 10.36 ms
In addition to the inference time consumed by the convolutional and fully connected
layers, the total inference time also includes the data access time of the pooling layer and
one input image normalization operation. The acceleration core inference designed in
this paper takes 71 ms for the whole network at one time, with an FPS of about 14, which
reaches the real-time standard. Considering that in order to optimize the stability of the
target detection frame, tracking is generally added after the detection result, a detection
result of 14 FPS is sufficient to provide the tracker with a real-time detection result as a
tracking target. The theoretical arithmetic power of the acceleration core in this paper is
150 GOPS, and 2.5 G times multiplication and addition are required to infer a TinyYolo, for
a total of 5 GOPS of computation, with a theoretical maximum frame rate of 30 FPS and
actual computational resource utilization of about 47%. Compared to other designs that
use FPGAs to build deep learning acceleration cores, the performance comparison of the
acceleration cores in this paper is shown in Table 7.
In this paper, the DSP doubling strategy is adopted. The bus frequency and DSP
running frequency are several times higher than other designs, so the acceleration core
designed in this paper can guarantee higher computational performance while minimizing
DSP consumption and significantly improving the utilization of DSP resources. In Table 7,
although different literature uses different hardware platforms and implements other
algorithmic models, the parameter Energy Efficiency allows normalizing the performance
of different algorithms within the same evaluation system. Our proposed architecture can
provide an arithmetic performance of 28.98 GOPS per unit power, which is 20.34% higher
than the 2nd place.
Appl. Sci. 2023, 13, 4144 23 of 26
Here we have to point out that since our proposed method utilizes a lot of hardware
design tricks to optimize specifically for CNN-based models and FPGA structures, it has a
good migration capability for algorithms of CNN-based models such as VGG, Multi-scale
Residual Aggregation Network (MSRANet), GoogleNet, Inception, Faster R-CNN, etc. For
other models, such as Recurrent Neural Networks (RNN) or Generative Adversarial Nets
(GAN) models, there is only some optimization capability at the FPGA structure level.
Further, since the performance gains in arithmetic power and speed of our proposed deep
learning acceleration core rely on highly streamlined and parallelized processing of the
hardware, this requires predictable behavior of the inference network because if a random
operation interrupts this predictable behavior, such as a parameter update or a bus transfer
failure, then all streamlined work has to be restarted, which often results in massive latency.
For real-time tasks, this latency is often intolerable.
The test result is shown in Figure 18.
Figure 18. The actual output of the bridge detection in the aerial video using TinyYolo inference with
the acceleration core.
Appl. Sci. 2023, 13, 4144 24 of 26
4. Conclusions
This paper proposes an FPGA-based deep learning acceleration core architecture for
image target detection, which designs a parallel acceleration scheme to address the problem
of arithmetic power, speed, and resource consistency. In this paper, the computational
scheduling kernel is streamlined so that the computation unit can perform one parallel
computation per clock without waiting for data pre-processing. In order to provide suf-
ficient data access bandwidth for parallel computing units, this paper also designs and
implements a three-level data cache architecture of off-chip storage, on-chip storage, and
registers, which provides high bandwidth data streams for parallel computing units by slic-
ing on-chip storage to avoid data stream operations affecting the computational efficiency
of parallel acceleration cores. Using bus accessing and DSP resource optimization strategies
improves bus bandwidth utilization and saves computational resources, reducing the DSP
resource to half the original. This paper uses the HLS high-level synthesis tool for deep
learning acceleration core development on FPGAs and achieves 14 FPS inference for the
TinyYolo model with 5 GOPS computation using less than 25% of the FPGA resource. The
acceleration core can run at 30% higher clock frequency, 2–4 times higher arithmetic power,
and 28% more efficient DSP resource utilization than other methods. The limitation of this
paper is that our proposed parallel acceleration algorithm is only suitable for CNN-based
models, and further research on acceleration algorithms for RNN-based or GAN-based
models should be conducted in the future.
Author Contributions: Conceptualization, X.Y. and C.Z.; methodology, W.F.; software, X.Y.; valida-
tion, X.Y., Z.Y. and Q.W.; formal analysis, C.Z.; investigation, X.Y.; resources, X.Y.; data curation, X.Y.;
writing—original draft preparation, X.Y.; writing—review and editing, X.Y. and C.Z.; visualization,
X.Y.; supervision, C.Z.; project administration, W.F. All authors have read and agreed to the published
version of the manuscript.
Funding: This research was funded by National Natural Science Foundation of China under grant
number 61901015.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments: The authors acknowledge graduate student Xu Yang for his contribution to
literature search and collation.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
2. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 28, 91 [CrossRef] [PubMed]
3. Sun, B.; Wang, X.; Oad, A.; Pervez, A.; Dong, F. Automatic Ship Object Detection Model Based on YOLOv4 with Transformer
Mechanism in Remote Sensing Images. Appl. Sci. 2023, 13, 2488. [CrossRef]
4. Sun, Z.; Leng, X.; Lei, Y.; Xiong, B.; Ji, K.; Kuang, G. BiFA-YOLO: A novel YOLO-based method for arbitrary-oriented ship
detection in high-resolution SAR images. Remote Sens. 2021, 13, 4209. [CrossRef]
5. Hu, J.; Zhi, X.; Shi, T.; Zhang, W.; Cui, Y.; Zhao, S. PAG-YOLO: A portable attention-guided YOLO network for small ship
detection. Remote Sens. 2021, 13, 3059. [CrossRef]
6. Li, L.; Jiang, L.; Zhang, J.; Wang, S.; Chen, F. A complete YOLO-based ship detection method for thermal infrared remote sensing
images under complex backgrounds. Remote Sens. 2022, 14, 1534. [CrossRef]
7. Ye, J.; Yuan, Z.; Qian, C.; Li, X. Caa-yolo: Combined-attention-augmented yolo for infrared ocean ships detection. Sensors 2022,
22, 3782. [CrossRef] [PubMed]
8. Lu, J.; Ma, C.; Li, L.; Xing, X.; Zhang, Y.; Wang, Z.; Xu, J. A vehicle detection method for aerial image based on YOLO. J. Comput.
Commun. 2018, 6, 98–107. [CrossRef]
9. Al-Batat, R.; Angelopoulou, A.; Premkumar, S.; Hemanth, J.; Kapetanios, E. An end-to-end automated license plate recognition
system using YOLO based vehicle and license plate detection with vehicle classification. Sensors 2022, 22, 9477. [CrossRef]
[PubMed]
10. Zhang, Y.; Guo, Z.; Wu, J.; Tian, Y.; Tang, H.; Guo, X. Real-Time Vehicle Detection Based on Improved YOLO v5. Sustainability
2022, 14, 12274. [CrossRef]
11. Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. Uav-yolo: Small object detection on unmanned aerial vehicle perspective.
Sensors 2020, 20, 2238. [CrossRef]
12. Li, Y.; Wang, J.; Huang, J.; Li, Y. Research on Deep Learning Automatic Vehicle Recognition Algorithm Based on RES-YOLO
Model. Sensors 2022, 22, 3783. [CrossRef] [PubMed]
13. Chen, L.; Weng, T.; Xing, J.; Pan, Z.; Yuan, Z.; Xing, X.; Zhang, P. A new deep learning network for automatic bridge detection
from SAR images based on balanced and attention mechanism. Remote Sens. 2020, 12, 441. [CrossRef]
14. Li, X.; Meng, Q.; Wei, M.; Sun, H.; Zhang, T.; Su, R. Identification of Underwater Structural Bridge Damage and BIM-Based
Bridge Damage Management. Appl. Sci. 2023, 13, 1348. [CrossRef]
15. Du, F.; Jiao, S.; Chu, K. Application research of bridge damage detection based on the improved lightweight convolutional neural
network model. Appl. Sci. 2022, 12, 6225. [CrossRef]
16. Lin, Y.C.; Chen, W.D. Automatic aircraft detection in very-high-resolution satellite imagery using a YOLOv3-based process. J.
Appl. Remote Sens. 2021, 15, 018502. [CrossRef]
Appl. Sci. 2023, 13, 4144 26 of 26
17. Madasamy, K.; Shanmuganathan, V.; Kandasamy, V.; Lee, M.Y.; Thangadurai, M. OSDDY: Embedded system-based object
surveillance detection system with small drone using deep YOLO. EURASIP J. Image Video Process. 2021, 2021, 1–14. [CrossRef]
18. Jiang, C.; Ren, H.; Ye, X.; Zhu, J.; Zeng, H.; Nan, Y.; Sun, M.; Ren, X.; Huo, H. Object detection from UAV thermal infrared images
and videos using YOLO models. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102912. [CrossRef]
19. Artamonov, N.; Yakimov, P. Towards real-time traffic sign recognition via YOLO on a mobile GPU. J. Phys. Conf. Ser. 2018, 1096,
012086. [CrossRef]
20. Güney, E.; Bayilmiş, C.; Cakan, B. An implementation of real-time traffic signs and road objects detection based on mobile GPU
platforms. IEEE Access 2022, 10, 86191–86203. [CrossRef]
21. Feng, W.; Zhu, Y.; Zheng, J.; Wang, H. Embedded YOLO: A real-time object detector for small intelligent trajectory cars. Math.
Probl. Eng. 2021, 2021, 6555513. [CrossRef]
22. Zhang, S.; Cao, J.; Zhang, Q.; Zhang, Q.; Zhang, Y.; Wang, Y. An fpga-based reconfigurable cnn accelerator for yolo. In Proceedings
of the 2020 IEEE 3rd International Conference on Electronics Technology (ICET), Chengdu, China, 8–11 May 2020; pp. 74–78.
23. Babu, P.; Parthasarathy, E. Hardware acceleration for object detection using YOLOv4 algorithm on Xilinx Zynq platform. J.
Real-Time Image Process. 2022, 19, 931–940. [CrossRef]
24. Xiong, Q.; Liao, C.; Yang, Z.; Gao, W. A Method for Accelerating YOLO by Hybrid Computing Based on ARM and FPGA. In
Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 22–24
December 2021; pp. 1–7.
25. Chen, Y.H.; Emer, J.; Sze, V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM
SIGARCH Comput. Archit. News 2016, 44, 367–379. [CrossRef]
26. Liu, Z.; Dou, Y.; Jiang, J.; Xu, J.; Li, S.; Zhou, Y.; Xu, Y. Throughput-optimized FPGA accelerator for deep convolutional neural
networks. ACM Trans. Reconfigurable Technol. Syst. 2017, 10, 1–23. [CrossRef]
27. Peemen, M.; Setio, A.A.; Mesman, B.; Corporaal, H. Memory-centric accelerator design for convolutional neural networks. In
Proceedings of the 2013 IEEE 31st International Conference on Computer Design (ICCD), Asheville, NC, USA, 6–9 October 2013;
pp. 13–19.
28. Zhang, C.; Sun, G.; Fang, Z.; Zhou, P.; Pan, P.; Cong, J. Caffeine: Toward uniformed representation and acceleration for deep
convolutional neural networks. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2018, 38, 2072–2085. [CrossRef]
29. Shen, Y.; Ferdman, M.; Milder, P. Maximizing CNN accelerator efficiency through resource partitioning. ACM SIGARCH Comput.
Archit. News 2017, 45, 535–547. [CrossRef]
30. Peng, H.; Chen, S.; Wang, Z.; Yang, J.; Weitze, S.A.; Geng, T.; Li, A.; Bi, J.; Song, M.; Jiang, W.; et al. Optimizing fpga-based
accelerator design for large-scale molecular similarity search (special session paper). In Proceedings of the 2021 IEEE/ACM
International Conference On Computer Aided Design (ICCAD), Munich, Germany, 1–4 November 2021; pp. 1–7.
31. Azari, E.; Vrudhula, S. ELSA: A throughput-optimized design of an LSTM accelerator for energy-constrained devices. ACM
Trans. Embed. Comput. Syst. 2020, 19, 1–21. [CrossRef]
32. Gong, H.J. Research and Implementation of FPGA-Based Acceleration Method for Convolutional Neural Networks. Master’s
Thesis, University of Chinese Academy of Sciences, National Space Science Center, Chinese Academy of Sciences, Beijing, China,
2021.
33. Guo, K.; Sui, L.; Qiu, J.; Yu, J.; Wang, J.; Yao, S.; Han, S.; Wang, Y.; Yang, H. Angel-eye: A complete design flow for mapping CNN
onto embedded FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2017, 37, 35–47. [CrossRef]
34. Liu, S.; Fan, H.; Niu, X.; Ng, H.C.; Chu, Y.; Luk, W. Optimizing CNN-based segmentation with deeply customized convolutional
and deconvolutional architectures on FPGA. ACM Trans. Reconfigurable Technol. Syst. 2018, 11, 1–22. [CrossRef]
35. Venieris, S.I.; Bouganis, C.S. fpgaConvNet: Mapping regular and irregular convolutional neural networks on FPGAs. IEEE Trans.
Neural Netw. Learn. Syst. 2018, 30, 326–342. [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.