0% found this document useful (0 votes)
160 views42 pages

Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey

Uploaded by

Jayasuryaa G R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views42 pages

Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey

Uploaded by

Jayasuryaa G R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Efficient Hardware Architectures for


Accelerating Deep Neural Networks:
Survey
PUDI DHILLESWARARAO1 ,(Student Member, IEEE), SRINIVAS BOPPU1 ,(Member, IEEE), M.
SABARIMALAI MANIKANDAN1 ,(Senior Member, IEEE), LINGA REDDY
CENKERAMADDI1 ,(Senior Member, IEEE),
1
School of Electrical Sciences, Indian Institute of Technology Bhubaneswar, India (e-mail:{pd30,srinivas,msm}@iitbbs.ac.in)
2
Department of ICT, University of Agder, Grimstad, Norway (e-mail: [email protected])
Corresponding author: Linga Reddy Cenkeramaddi (e-mail:[email protected]).
This work was supported by the Indo-Norwegian Collaboration in Autonomous Cyber-Physical Systems (INCAPS) project: 287918 of the
International Partnerships for Excellent Education, Research and Innovation (INTPART) program from the Research Council of Norway.

ABSTRACT In the modern-day era of technology, a paradigm shift has been witnessed in the areas
involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL).
Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI
applications such as computer vision, image and video processing, robotics, etc. In the context of
developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs
have been a credible choice for solving more complex real-life problems. The performance and accuracy
of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the
DNN is computationally too cumbersome in terms of the resources and time to handle these computations.
Furthermore, general-purpose architectures like CPUs have issues in handling such computationally
intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity
in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate
Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array
(CGRA) in the context of effective implementation of computationally intensive algorithms. This paper
brings forward the various research works carried out on the development and deployment of DNNs
using the aforementioned specialized hardware architectures and embedded AI accelerators. The review
discusses the detailed description of the specialized hardware-based accelerators used in the training and/or
inference of DNN. A comparative study based on factors like power, area, and throughput, is also made
on the various accelerators discussed. Finally, future research and development directions are discussed,
such as future trends in DNN implementation on specialized hardware accelerators. This review article is
intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness
of deep learning research.

INDEX TERMS Machine Learning, Field Programmable Gate Array (FPGA), Deep Neural Networks
(DNN), Deep Learning (DL), Application Specific Integrated Circuits (ASIC), Artificial Intelligence (AI),
Central Processing Unit (CPU), Graphics Processing unit (GPU), Hardware Accelerators

I. INTRODUCTION as the study of how computers may learn without being


Deep neural networks (DNNs), also known as deep learning, explicitly programmed. Machine Learning uses traditional
are a subset of the Artificial Intelligence (AI) discipline. The techniques to perform tasks like classification, regression,
term AI was coined in 1956 by John McCarthy, who defined and clustering. Deep learning is a subfield of machine learn-
it as “the science and engineering of making intelligent ing that makes use of a multi-layered structure of algorithms
machines”. Machine learning is a broad topic of artificial known as a neural network, which was developed mostly
intelligence that was first defined by Arthur Samuel in 1959 between 2006 and 2010. The relationship between deep

VOLUME 4, 2016 i

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

learning, machine learning, and AI is illustrated in Fig. 1. to its capacity to extract more complex high-level features,
such as objects and facial structures, from raw input data.
DNNs are computationally expensive and need lots of
Artificial Intelligence computational resources and memory for training and infer-
The science and engineering of ence. CPUs inherently support a limited number of parallel
making intelligent systems
workloads, though they can context switch with hyper-
threading. They are not sequential in nature. CPUs may have
Machine Learning more available resources than their counter architectures
The field of study that gives
computers the ability to learn
(like GPUs or FPGAs). CPUs have a limited number of
without being explicitly programmed registers to support concurrent threads. But they may have
higher cache sizes, larger branch control logic, and higher
on-chip bandwidth than GPUs. However, the limited number
of cores available on the CPU limits its ability to process
Deep Learning large amounts of data in parallel, which is required for DNN
A technique to perform machine acceleration. Although CPUs dominate the IoT industry in
learning algorithms inspired by DNN inference on low-power edge devices, they struggle
human brain’s own network of to realize complex DNNs. Therefore, specialized hardware
neurons.
designs are required for the acceleration of DNNs. DNNs
can be implemented using customized hardware accelerators
instead of a CPU. The heterogeneous computing platforms
viz. Field Programmable Gate Array (FPGA), Application-
Fig. 1. AI vs. Machine Learning vs. Deep Learning Specific Integrated Circuits (ASIC), and Graphical Pro-
cessing Units (GPU) are widely used to accelerate DNNs.
Nowadays, DNNs are used in many modern AI ap- The specialized hardware based DNN accelerators can be
plications, including bioinformatics [60], natural language categorized into two classes: the first class of accelerators
processing [148], image restoration [186], speech recogni- efficiently implements the computational primitives such as
tion [34], computer vision [195], machine translation [36], convolutional operations, fully connected operations, etc. for
healthcare [43], finance [223], robotics [95], visual art the DNNs [86], [176] and the second class of DNN accel-
processing [194], etc. Furthermore, the recent applications erators efficiently optimize the data movement and memory
of DNN include aerospace and defence, automated driving, access [56], [178]. These two generations of specialized
recommendation systems, and industrial automation [71], hardware based DNN accelerators improve the speed and
[87], [102], [217]. DNNs are also useful in a variety of appli- energy efficiency of running DNNs. There are two ways to
cations, such as news aggregation and fraud detection [125], improve the performance of the DNN acceleration. The first
virtual assistants [61], chatbots [35], and customer relation- method is to optimize the DNN algorithm, and the second
ship management systems [205]. In addition, DNNs have method is to optimize the hardware architecture. Therefore,
also been used to diagnose covid-19 by classifying it based we need to co-design the algorithm and the hardware to
on different lung and chest imaging modalities [40]. achieve the superior performance.
DNNs contain many layers, and each layer is capable Because of their high throughput and memory bandwidth,
of detecting features at different levels. For instance, in GPUs are one of the most often employed hardware ac-
pattern recognition, where the input is available in pixel celerators for improving inference and training processes
form, the first layer of DNN extracts minor details of the in DNNs [220]. In floating-point matrix-based calcula-
image, such as curves and edges. The outputs of this first tions, GPU-based hardware accelerators are extremely effi-
layer act as inputs to the second layer. The image’s primary cient [207]. GPU-based hardware accelerators, on the other
details, such as squares and semi-circles, are extracted by the hand, consume a lot of power. ASIC and FPGA-based hard-
second layer. The outputs of the second layer act as inputs ware accelerators have limited computational and memory
to the third layer. The third layer extracts the part of objects. resources when compared to GPU-based accelerators. They
Furthermore, the subsequent layer uses the previous layer’s can, nevertheless, achieve a moderate level of performance
output and extracts more aspects of the objects. As the while using less energy [154]. ASIC-based DNN accelera-
number of layers increases, the DNN extracts increasingly tors provides superior performance compared to GPU and
complicated features and complete objects [73]. DNNs pro- FPGA counterparts at the cost of reconfigurability. However,
vide superior accuracy and performance at the cost of high ASIC-based accelerators have some limitations, including
computational complexity. For instance, AlexNet [131] takes high cost of development, long time to market, inflexibility,
1.4 Giga Operations Per Second (GOPS) to process a single etc [77], [104]. FPGA-based accelerators can be used as
image of size 224×224 with a top-1 accuracy of 61%, while an alternative to ASIC-based accelerators, and they can
ResNet-152 [109] takes 22.6 GOPS with a top-1 accuracy of provide superior performance at an affordable cost with
79.3%. DNN’s superior accuracy and performance are due reconfigurability and low power dissipation [215]. FPGA,
ii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

ASIC, and GPU-based AI accelerators have been the subject in the following ways. Few studies [44], [98], [127], [155],
of numerous research [98], [151], [155], [156], [159], [212]. [212] focused only on the developments of FPGA-based
This survey, however, also looks at various embedded AI accelerators. Whereas few other studies [55], [137], [151],
accelerators for DNN acceleration. [159] have presented the details of ASIC-based accelerators.
This survey supplements the existing work and con- Some research reviews [48], [202], [203] have explored both
tributes towards providing the complete background on FPGA as well as ASIC-based accelerators. Very limited
DNN acceleration using various specialized hardware archi- studies [179], [203] have dealt with the progress of GPU-
tectures. The contributions of this survey can be summarized based accelerators. On the other hand, studies on embedded
as follows: AI accelerators haven’t been explored much. Many of these
1) The survey discusses the various research works car- reviews do not mention the compiler/mapping frameworks
ried out on the development and deployment of DNN and SDKs available for these accelerators, making it difficult
using FPGA-based accelerators. for someone to choose the appropriate accelerator. This
2) The survey covers the work done in ASIC-based AI review, therefore, aims to bring a comprehensive study of all
accelerators in the last decade, from 2012 to 2022. the aforementioned hardware accelerators in the context of
3) The survey describes the various GPU-based DNN implementation of DNNs. Furthermore, this survey classifies
accelerators. the FPGA-based accelerators and ASIC-based accelerators
4) The survey provides a comprehensive overview of in a unique way, briefly discusses the key architectural
CGRA-based accelerators for DNN implementation. features and the compiler or mapping frameworks avail-
5) The survey covers the research works carried out able. Accelerators for each category are summarized and
on the implementation of DNNs on the edge using compared. A comprehensive survey of GPU-based accel-
embedded AI accelerators. erators by Nvidia is also presented. The need for edge
6) The survey provides a comparative study of existing AI computing is emphasized and state-of-the-art embedded
hardware architectures: FPGAs, GPUs, ASICs, and AI accelerators, including Arm-based accelerators, are also
embedded AI accelerators. discussed and compared. This survey also briefly discusses
7) The survey highlights the future research trends in the recent developments in tinyML. Table 1 shows the
DNN acceleration on specialized hardware architec- comparison of this survey paper with recently published
tures, including FPGA, ASIC, GPU, CGRA, and Edge review articles on DNN implementation using specialized
AI accelerators. hardware architectures. Researchers in the fields of artificial
intelligence, system design, and hardware architecture are
A. SCOPE OF THE SURVEY expected to benefit from this survey.
This paper lays the focus on research trends in FPGA,
ASIC, and GPU-based accelerators for implementing DNNs. B. ORGANIZATION
We have also briefly discussed the current trends in Arm- This paper is organized as follows: Section II provides a
based machine learning processors and embedded edge brief overview of neural networks and DNNs, including
AI accelerators. The review categorizes the FPGA-based the basic architecture of hardware for the DNN accelera-
accelerator into three categories and briefly discusses the tion. Section III describes various architectures implemented
key features of the accelerators, including the frameworks on the FPGA platform for DNN acceleration. Section IV
available. The three categories include accelerators for a describes various ASIC-based accelerator architectures for
specific application such as speech recognition, object de- DNN acceleration. Section V shows a detailed review of
tection, natural language processing, etc., accelerators for GPU-based accelerators for the acceleration of DNN. Sec-
a specific algorithm such as CNN, RNN, etc., and accel- tion VI discusses various CGRA-based accelerator architec-
erator frameworks with hardware templates. Furthermore, tures for DNN acceleration. Section VII discusses in detail
ASIC-based accelerators are categorized into three types: the embedded edge AI accelerators for DNN acceleration.
ALU-based accelerators, dataflow-based accelerators, and Section VIII provides the comparisons between the various
sparsity-based accelerators. A comparative study of these hardware architectures used for the DNN acceleration. Sec-
hardware accelerators based on performance metrics like tion IX provides the future research directions of various
power, throughput, and the area has been presented. The hardware architectures for DNN acceleration. Finally, the
review also focuses on the mapping frameworks avail- conclusion of this review is presented in Section X.
able for each these accelerators and briefly discuss the
implementation details. In addition, the recent research II. BACKGROUND
contributions in Arm-based machine learning processors A. NEURAL NETWORKS
and a few embedded AI hardware accelerators are dis- A Neural Network (NN) is a computational model inspired
cussed and compared in terms of their cores, performance, by biological neural networks. It is also known as an Artifi-
power, availability of Software Development Kits (SDKs), cial Neural Network (ANN). An ANN comprises hundreds
and supported frameworks. This survey is different and or thousands of interconnected artificial neurons, also called
unique with respect to many existing papers in this area processing units. Three or more interconnected layers are
VOLUME 4, 2016 iii

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

TABLE 1: Comparison among state-of-the-art surveys


Scope
Paper Year Summary FPGA-based ASIC-based GPU-based Embedded AI CGRA-based Sparce
Frameworks Dataflow Tiny ML Applications
accelerators accelerators accelerators accelerators accelerators DNN
Provided a comprehensive survey of various techniques to
Sze et al. [202] 2017 ✓ ✓ ✗ ✗ ✗ ✓ ✓ ✓ ✗ ✓
efficiently process deep neural networks on hardware.
Provided a comprehensive survey for the acceleration of neural
Wang et al. [212] 2018 networks on FPGA and also discussed about advantages and ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✓
disadvantages of FPGA-based accelerators
Provided a comprehensive review of FPGA-based neural
Guo et al. [98] 2019 ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✓
network inference accelerators.
Reviewed the techniques and frameworks for the acceleration
Blaiech et al. [44] 2019 ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✓
of deep learning algorithms on FPGA.
Provided detailed survey of techniques for implementing and
Mittal et al. [155] 2020 ✓ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✗ ✓
optimizing CNN algorithms on FPGA.
Summarized the recent advances in DNN accelerator, and also
Chen et al. [55] 2020 ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✓
provided the future trends of DNN accelerators.
Provided a comprehensive survey of state-of-the-art
Capra et al. [48] 2020 ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✗ ✗
architectures for DNN acceleration.
Provided a detailed survey of energy-efficient DNN processing
Lee et al. [137] 2021 on hardware and also summarized the key techniques of ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗
energy-efficient DNN training on ASICs
Provided a systematic review of ASIC, FPGA, and GPU-based
Talib et al. [203] 2021 ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✗ ✗ ✗
accelerators for AI and ML tools.
Summarized the current commercial accelerators that have been
Reuther et al. [179] 2021 ✗ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✓
publicly announced with peak performance and power numbers.
Provided a comprehensive review of optimization techniques
Machupalli et al. [151] 2022 ✗ ✓ ✗ ✗ ✗ ✓ ✓ ✓ ✗ ✓
used in the existing DNN accelerators
Provided a comprehensive survey of CNN accelerator
Moolchandani et al. [159] 2022 ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗
architectures on custom hardware.
Provided a detailed review on recent advancements in the area
of DNN acceleration on specialized hardware architectures such
This surey 2022 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
as FPGA, ASIC, and GPU. Discussed the recent developments
in embedded AI accelerators for edge environment and tinyML.

formed by these neurons. The input neurons are in the all n inputs from the input layer and generates the output
first layer. The input neurons receive external signals and y. These inputs are multiplied by the weight coefficients
pass them on to the subsequent layers, which eventually (w1 , w2 , . . . , wn ) and combined together with a bias value
provide the final output data to the final output layer. The b for each neuron. A non-linear function σ(.) , also called as
intermediate layers in the ANN are called as hidden layers. an activation function, is then used to calculate the neuron’s
Fig. 2 depicts the architecture of a typical NN, which output, see Eq. (1). In this scenario, the activation function
includes an input layer, an output layer, and two hidden causes a neuron to produce an output only if the input to
layers. it exceeds a specified threshold value. Common non-linear
functions used in NN are Sigmoid, Rectified Linear Unit
(ReLU), and Hyperbolic tangent. The graphical model and
mathematical representation of artificial neuron is shown
in Fig. 3 and Eq. (1), respectively.

∑ σ
Output layer

Fig. 3. A single ANN neuron with its elements (inputs,


Input layer Hidden layers weights, bias, summer, activation function, and output)
Fig. 2. An architecture of NN
N
In NN shown in Fig. 3, the input layer contains n inputs X
y = σ( x[n]w[n] + b) (1)
(x1 , x2 , . . . , xn ). The following layer (hidden layer) gets
n=1
iv VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

In neural networks, weights are initialized with some learning. The labeled data is used in supervised learning to
random values. However, during the training process, all train or model the network. The labeled data indicates that
these weights get updated iteratively to predict the correct some input data has already been matched to the correct
output. The weights are updated using the cost function, output. Unsupervised learning is another learning technique
which is nothing more than the mean square error. The in which the network/model is trained using the unlabeled
mathematical representation of mean square error is shown data. The trained network generates the clusters or structures
in Eq. (2). Here, MSE is mean squared error, n represents in the unlabeled data. Semi-supervised learning uses the
the number of input data points, yi and yˆi are true and partially labeled data sets and it falls in between supervised
predicted outputs, respectively. Once the neural network is and unsupervised learning approaches. Finally, reinforce-
trained, it may be used for classification problems. ment learning is a type of training that rewards positive
behaviours while punishes undesirable ones. Reinforcement
1X
M SE = (yi − yˆi )2 (2) learning is bound to learn from its previous experience. The
n i=1 pictorial representation of the aforementioned deep learning
approaches is shown in Fig. 5.
B. DEEP NEURAL NETWORK (DNN)
The Deep Neural Network (DNN) is a type of neural Input data Input data Input data
(labeled) (unlabeled) (states & actions)
network that has more than three hidden layers and is well-
suited to complicated tasks [37]. In today’s DNN, the typical
number of layers used ranges from five to over a thousand. A
Supervised Unsupervised Reinforcement
DNN with N hidden layers is shown in Fig. 4. In DNNs, the Learning Learning Learning

Error
model and its parameters are learned through an extensive Error
Reinforcement
training process. signal

Critic output output output Critic


Neuron (mapping) (classes) (state/action)

Fig. 5. Deep learning approaches

C. CONVOLUTIONAL NEURAL NETWORK (CNN)


outputs Convolutional neural networks (CNNs) are a type of neural
Input Data network which have been widely used for image recognition
tasks. CNN is made up of several stages, each of which
is referred to as a layer. Each layer extracts a feature
Output layer
from the data it receives. The identifying features get more
sophisticated or complex as we proceed. CNN structure
Input layer
was first proposed by Fukushima in 1988 [88]. As shown
Layer 1 Layer N in Fig. 6, a CNN consists of four layers: convolution,
Hidden layers
fully connected layer, pooling layer, and Rectified Linear
Fig. 4. DNN with N hidden layers [158] Unit (ReLU) layer. Optionally, CNN might also have non-
traditional layers such as dilated convolution layer [115] and
Training and inference are the two critical phases in accel- deconvolution layer [52]. The CNN’s overall design may be
erating a task using DNN. The specific tasks such as object divided into two sections: feature learning and classification.
detection, pattern recognition etc. are part of the training, Each layer of the CNN gets data from the layer before it as
in which the DNN is taught to perform such specific tasks input and delivers its output to the following layer as input
using available data. The known data is supplied to DNN during the feature learning phase. The feature learning phase
throughout the training process, allowing the network to includes three types of layers: convolution, RELU, and
predict what the data represents. As a result, the prediction pooling. At each node of the convolution layer, convolution
error is used to adjust the weights of the neurons. The operations on the input nodes detect features from the input
weights are adjusted till the predictions are made with feature maps. The output of the feature extraction phase’s
a considerable degree of accuracy. Back propagation is a final layer is delivered to a fully connected network known
popular method for updating weights, as mentioned before as the classification layer. The following sections discuss
in the training phase. DNN is ready to make predictions each type of layer in the CNN in brief.
on fresh and unknown data once it has been fully trained.
This stage is known as inference, and it involves testing the 1) Convolution Layer
trained model with completely new and unknown data. The convolution layer is also known as the feature extraction
There are four types of deep learning approaches: su- layer since it extracts the features of images. The inputs and
pervised, unsupervised, reinforcement, and semi-supervised outputs of the convolution layer are defined as feature maps
VOLUME 4, 2016 v

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

cat
dog 2 × 2 pooling, stride 2

4 1 2 4 MAX pooling AVG pooling


rabbit

Input Convolution+RELU Pooling


5 6 6 8 6 8 4 5
Convolution+RELU Pooling
Fully
Flatten Softmax
connected
Feature learning
Classification

2 3 21 9 3 21 2 12
Fig. 6. CNN architecture (adopted from [182])
1 2 11 7
(FMs) which are organized in two-dimensional grids. The Fig. 7. Various forms of pooling
FM from the previous layers of the convolution layer is
convolved with the filter coefficients. More than one input
feature map can be paired with each of the output feature layers in the CNN network . By substituting all the negative
maps. In a 2-D convolution operation between an input valued outputs with 0, it introduces non-linearity into the
image matrix x (size R × C) and a filter f (size W × L), CNN. Because of its computational simplicity, sparsity, and
the convolution layer performs point-wise multiplication and ability to converge faster than other activation functions like
addition of the corresponding pixels. The filter size is often hyperbolic tangent and sigmoid [72], [199], ReLU [161] has
smaller than the input matrix size. The filter multiplies the gained a lot of traction in recent years. The mathematical
input matrix with the W × L sized-block, accumulates the representation of ReLU is shown in Eq. (4). Some popular
result, slides to the next block of the input matrix, and extensions of ReLU, for instance, exponential LU [64], para-
repeats the operation. The input matrix is processed one metric ReLU [108], and leaky ReLU [150] are also being
block at a time until it has processed all of the image’s R×C used in CNNs for improved performance and accuracy.
elements. The 2-D convolution operation is given in Eq. (3)
where y(r, c) signifies one output pixel in the output matrix f (x) = max(0, x) (4)
y, with each pixel’s coordinates expressed as (r, c). The
iterators over the filter’s length (L) and width (W ) are l 4) Fully Connected Layer
and w, respectively, in Eq. (3). Finally, the resulting feature Fully connected layers do the final classification in the CNN
maps apply non-linear activation functions such as sigmoid, network after multiple convolutions, ReLU, and pooling
hyperbolic tangent, or rectified linear units. layers. Weights, biases, and neurons are all part of the
W −1 L−1     fully connected layer. All input and output neurons are
X X W L connected in the fully connected layer. A CNN network
y(r, c) = f (w, l)x(r + w − ,c + l − )
2 2 typically has one or more fully connected layers. The final
w=0 l=0
(3) output of CNN comes from the last fully connected layer,
often known as the classification layer. The fully connected
2) Pooling Layer layer in the CNN contains a large number of inputs and
The pooling layer shrinks the spatial dimensions of the input outputs. Therefore, it is challenging to implement the fully
image after convolution, thereby reducing the computation connected layer operations on hardware platforms with
and number of parameters in the network. Pooling layers limited resources.
are also known as subsampling layers. In CNN, the pooling
layer is used between two convolution layers. The MAX 5) Deconvolution Layer
operation is used to resize each slice of the input image To increase the size of the feature map, a deconvolution
spatially, on which the pooling layers operate individually. A layer, also known as a transposed convolution layer, is
pooling layer with filters of size 2×2 is found in many CNN employed [52]. Upsampling (inserting zeros in the feature
topologies. Over the four samples in the filter, the pooling map) and then convolving the upsampled feature maps with
operation, which is nothing but the MAX operation, is the kernel coefficients are used to accomplish this.
done. The operation yielding the maximum value is retained
while discarding the other values [124]. It is noteworthy that 6) Dilated convolution layer
additional operations like MIN operation and AVG operation The filter coefficients are up-sampled and convolved with
can also be used in the pooling layer, particularly in some the input image in a dilated convolution layer to capture
CNNs [198]. The MAX and AVG pooling operations for a broader receptive field [115]. Image segmentation, for
the filters of size 2 × 2 is shown in Fig. 7. example, uses it to capture the larger global context in each
output pixel.
3) Rectified Linear Unit (ReLU) Layer With millions of weight coefficients, CNNs are extremely
In a CNN network, the ReLU layer is usually employed after complex. They are computationally expensive and neces-
the convolution and fully connected layers. The ReLU layer sitate a significant amount of memory to store the input,
is generally used after the convolution and fully connected output feature maps, and weight coefficients, causing CPUs
vi VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

to under perform. To boost the performance of the CNNs, register file) and control are shared by all ALUs in the
specific hardware accelerators are used. As a result, different temporal architecture. In temporal architectures like CPUs
techniques for implementing CNNs efficiently on hardware or GPUs, all the convolution or fully connected operations
platforms must be explored in order to reduce resource and are mapped to matrix multiplication. CPU cores are the
memory requirements. least employed among the several temporal architectures for
DNN training and inference. CPUs contain a small number
D. HARDWARE ARCHITECTURES FOR DNN of processing cores, ranging from one to ten. As a result,
ACCELERATION only a small number of processes can be performed in
DNNs have been increasingly popular in recent years, parallel, limiting throughput. GPUs are commonly used to
allowing for their development and deployment on a variety train and infer DNNs. They have thousands of cores to run
of hardware platforms. These hardware platforms are of highly parallel algorithms efficiently, for instance, matrix
various types, right from general-purpose architectures such multiplication. Throughput is enhanced by lowering the
as CPUs and GPUs, programmable architectures (FPGAs) number of multiplications in both CPUs and GPUs. There
to special-purpose chips (ASICs). In many DNN models, are software libraries that optimize matrix multiplication
multiply-accumulate (MAC) operations are the most impor- for GPUs (e. g., cuBLAS, cuDNN [59], etc.) and CPUs
tant computations, and they can be easily parallelized. Since (e. g., Intel MKL [2], OpenBLAS, etc.). Another well known
these MAC operations can be executed in parallel, hardware technique to reduce the matrix multiplications is Fast Fourier
architectures that enable parallel operations are required to Transform (FFT) [80], [152]. Furthermore, several tech-
process DNNs. To achieve superior performance, highly niques, such as Winograd’s algorithm [133] and Strassen’s
parallel computing models, encompassing both spatial and algorithm [67], are used to reduce the matrix multiplications
temporal computing architectures, are often employed for and thereby reduces the resource and memory requirements.
DNN acceleration. The spatial and temporal architectures
have a similar computational structure, with a set of Pro- 2) Spatial Architectures
cessing Elements (PEs). However, processing units can have In spatial architectures, each ALU can have its own local
internal control in a spatial architecture, whereas control in memory and control logic. The local memory is also referred
a temporal architecture is centralized, as shown in Fig. 8. to as the register file. The development and deployment
Each PE can have a register file (RF) to store data in of DNNs on Field-Programmable-Gate-Arrays (FPGA) and
spatial architecture; however, PEs do not have the memory Application-Specific-Integrated-Circuits (ASIC) comes un-
capacity in a temporal architecture. The PEs can also be der the category of spatial architectures. FPGAs are less
connected to exchange data in spatial computing designs. expensive and have a faster time to market than ASICs, and
To summarize, the PEs in the temporal architectures contain the design flow is simpler. However, FPGAs are less energy-
only Arithmetic and Logic Units (ALUs). The PEs consist efficient and consume more power than ASICs since FPGAs,
of ALU as a computation unit, RF to store the data, and a unlike ASICs, contain a significant chip area dedicated to
control unit in spatial architectures. reconfigurability. ASICs, on the other hand, are mainly
designed for a particular application and cannot support
Memory Hierarchy Memory Hierarchy reconfigurability. The design flow of ASICs is more complex
than FPGAs [46]. ASIC chips are expensive, but they are
PE PE PE PE PE PE PE PE
highly optimized and energy-efficient and provide superior
performance than FPGAs. Memory accesses are the real
Register File (RF)
bottleneck in DNN computations; therefore, off-chip DRAM
Control Unit

PE PE PE PE PE PE PE PE
Computation (ALU)

PE PE PE PE
Control
PE PE PE PE Computation (ALU)
accesses must be minimized, as they have a high energy
cost and delay. The memory accesses (off-chip) can be
PE PE PE PE PE PE PE PE reduced by reusing data stored in smaller, quicker, and
PE: Processing Element
low-energy memories. In spatial computing architectures,
Spatial Architecture Temporal Architecture
weight stationary, row stationary, output stationary, and
Fig. 8. Spatial and temporal architectures other specialized processing dataflows can be designed to
improve data reuse from memories in the memory hierarchy
and reduce energy dissipation. At each level of the memory
1) Temporal Architctures hierarchy, the dataflow defines what data is read and when
The temporal architectures exploits the parallelism by sup- it is processed. In spatial architectures, dataflows can be
porting a variety of techniques, such as Single Instruction classified as follows:
Multiple Threads (SIMT) or Single Instruction Multiple
Data (SIMD). The temporal computing architectures appears Weight Stationary (WS)
mostly in CPUs and GPUs. In temporal designs, ALUs can In weight stationary dataflow, the weights are kept fixed and
only access data from the memory hierarchy and cannot are stored in the register files of the PEs, whereas the inputs
communicate directly with one another. The memory (i.e., and partial sums are distributed across the PEs. Weight
VOLUME 4, 2016 vii

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

stationary dataflow maximizes filter and convolutional reuse


of weights. Weight stationary dataflow examples are found (
Peak Floating Point Performance
in [50], [169], [183], [196]. Attainable Performance = min
Peak Memory Bandwidth × CTC ratio
(5)
Output Stationary (OS) The roofline model is illustrated in Fig. 9. Algorithm 2
in Fig. 9 has a better CTC ratio than Algorithm 1. As
Each partial sum is held fixed in a PE in the output stationary a result, Algorithm 2 performs better than Algorithm 1
dataflow, and accumulation is done until the final total is because it effectively utilizes all of the hardware com-
obtained. In the meantime, the PEs’ weights and inputs are putation resources. In contrast, Algorithm 1 under-utilizes
dispersed in a variety of ways. The convolutional reuse is hardware computation resources due to inefficient off-chip
maximized with output stationary dataflow. This dataflow communication.
reduces the amount of energy used while writing and
reading partial sums. Output stationary dataflow examples
are found in [99], [170].

Attinable performance
Row Stationary (RS) peak floating-point performance
(GFLOPS)
The operations of a row of convolution are mapped to the

s)
(GFLOPS)

B/
same PE in row stationary dataflow, and the weights are kept

(G
th
stationary inside the register file of the PEs. Row stationary

id
w
nd
dataflow maximizes the convolutional reuse of input feature

Algorithm 2

Algorithm 1
ba
y
maps, weights, and partial sums. Row stationary dataflow

or
em
examples are found in [53], [57].
m
ak
pe

No Local Reuse (NLR)


In no local reuse dataflow, nothing is stationary inside CTC ratio (FLOP/B)
the PEs, and it is used to reduce the accelerator area
by eliminating the register file from PEs. No local reuse Fig. 9. Roofline model, adopted from [225]
dataflow examples are found in [56], [224].
All PEs in the spatial architectures can be connected in
III. FPGA-BASED ACCELERATORS
one of two ways: 1-D systolic or 2-D systolic. The PEs in
a 1-D systolic architecture are arranged in one dimension, The FPGA-based neural network accelerators are increas-
allowing systolic data flow, but the PEs in a 2-D systolic ingly favored over CPUs because of their higher effi-
architecture are arranged in two dimensions and can receive ciency [165]. FPGA supports parallelism and accelerates
data from both vertical and horizontal directions. Similarly, the computations by mapping them to the parallel hardware;
all PEs can be connected in temporal architectures in one of i. e., multiple DNN structures are executing in parallel on
two ways: 1-D array or 2-D array. Data is received from the FPGA. FPGA-based accelerators deliver up to several orders
global buffer by the PEs in a 1-D array architecture, which of magnitude speedup compared to the baseline CPU [85].
are arrayed in one dimension. A 2-D array architecture has FPGAs give designers the freedom to implement only the
PEs that are arrayed in two dimensions and receive data required logic in the hardware based on the target appli-
only from the global buffer. cation. FPGA-based DNN accelerator architectures mainly
contain a host computer and an FPGA part to implement
DNN algorithms.
E. ROOFLINE MODEL In this section, we would like to review FPGA-based
The roofline model is basically a visual performance model DNN accelerators, which can be broadly categorized into
intended for floating point computations and multicore three types: accelerators for a specific application, such as
architectures [214]. The roofline model relates peak per- speech recognition, object detection, natural language pro-
formance provided by the hardware platform and off- cessing, etc., accelerators for a specific algorithm, such as
chip memory traffic with system performance. For a given CNN, RNN, etc., and accelerator frameworks with hardware
compute-to-communication (CTC) ratio, the maximum at- templates. For the first two categories, the design complexity
tainable performance is the minimum of (1) peak compu- of the accelerator is low, whereas the design complexity is
tational performance and (2) peak memory performance. relatively high for the final category.
Here, the CTC ratio, also called operational intensity, means
operations per byte of DRAM traffic. Eq. (5) formulates A. ACCELERATORS FOR A SPECIFIC APPLICATION
the attainable performance of an application on a specific There exists many FPGA-hardware accelerators for specific
hardware platform. applications. Designing a custom accelerator for a given
viii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

application is a good fit for the problem and has a low design tively. Cloutier et al. [65] proposed a hardware accelerator,
complexity. Han et al. [103] proposed the FPGA-based referred to as Virtual Image Processor (VIP) to implement
accelerator named efficient speech recognition engine (ESE) the CNNs. The Altera EPF81500 FPGA platform is used to
to implement the LSTM algorithm for speech recognition. implement the proposed design. VIP primarily consists of
Load-balanced sensing pruning method is used in the pro- Processing Elements (PEs) connected by a 2-D systolic ar-
posed design to compress the LSTM model. The proposed chitecture and supports the SIMD paradigm. VIP is designed
accelerator uses a framework named Kaldi to implement to perform the following vector and matrix operations:
LSTM algorithm for speech recognition. The ESE has a matrix multiplication, matrix-vector multiplication, scalar
performance of 282 GOPS and is implemented in a Xilinx multiplication, matrix addition, matrix-vector addition, vec-
XCKU060 FPGA running at 200 MHz. The implementation tor addition, 1-D convolution, 2-D convolution, etc. The
of speech recognition algorithms using FPGA-based accel- host computer is used to provide the configuration data to
erators is also presented in several earlier studies [62], [113], the FPGA board, which is connected through Peripheral
[138], [201]. Component Interconnect (PCI) interface. VIP uses the low
Wang et al. [211] proposed a reconfigurable YOLOv3 accuracy arithmetic because of the limitations of resources
FPGA hardware accelerator for object detection. In this on Altera EPF81500 FPGA. Fortunately, recent FPGAs
context, YOLOv3 (You Only Look Once, Version 3) is a contains large numbers of computing units and memory
real-time object detection algorithm that detects specific ob- resources and allow fast CNN implementations. FPGA
jects in images or videos. The proposed accelerator is built implementations of DNNs mainly focused on accelerating
using the ARM + FPGA architecture. Experiment results the convolution operations, which are reported in [38] and
show that the FPGA-based YOLOv3 accelerator consumes [49].
less energy and achieves higher throughput than the GPU Farabet et al. [86] presented ConvNet Processor (CNP):
counterpart. The proposed accelerator is compatible with an FPGA-based accelerator to implement the CNNs. CNP
several frameworks, such as Tensorflow, Caffe, PyTorch, uses dedicated hardware convolver for the data processing
etc. The proposed accelerator is implemented on Xilinx and also uses soft-processor for controlling. CNP is designed
ZCU104 running at a frequency of 300 MHz. Several on the Virtex4 SX35 FPGA and also equipped with external
previous works [82], [149], [162] also used the FPGA to memory to store the input and filter coefficients. CNP
implement object detection algorithms. consists of Vector Arithmetic and Logic Units (VALU), one
Hamza et al. [126] proposed the FPGA-based acceler- of the main components in the architecture that implements
ator named NPE to efficiently implement various Natu- the CNN operations viz. 2-D convolutions, sub-sampling,
ral Language Processing (NLP) models. NPE provides a and non-linear activation functions. The implementation
single framework for processing arbitrarily complex non- of 2-D convolution, represented using Eq. (6), is shown
linear functions with software-like programmability. NPE in Fig. 10 for K = 3, i. e. 3 × 3 kernel. In Eq. (6), xij
consumes 4× and 6× less power than CPU and GPU. NPE is the data in the input plane, wmn is the weight value in
is implemented on the Xilinx Zynq Z-7100 FPGA running K × K kernel, yij is the partial sum, zij is the result in
at a frequency of 200 MHz. the output plane, and W is the width of the input image.
Serkan et al. [185] developed an FPGA-based CNN ac- At each clock cycle, the convolution module performs k 2
celerator to classify malaria disease cells. The proposed multiply-accumulate operations simultaneously. CNP uses
accelerator is implemented on Xilinx Zynq-7000 FPGA run- the First In First Out (FIFO) buffers between the external
ning at a frequency of 168 MHz. The proposed accelerator memory and FPGA to provides the continuous flow of
achieves an accuracy of 94.76%. Zhu et al. [231] proposed data in both directions. CNP uses the 32-bit soft processor
an FPGA-based accelerator to recognize liver dynamic that provides the macro instructions, generally higher level
CT images. Xiong et al. [219] developed an FPGA-based instructions than most traditional processors, to the VALU
CNN accelerator to improve the automatic segmentation for implementing the basic CNN operations. CNP has a
of 3D brain tumors. FPGA-based accelerators are also compiler that converts network implementations with Torch
used to implement various applications such as autonomous directly into CNP instructions. The proposed architecture
driving [106], [130], image classification [45], [70], fraud has been used to implement the face detection system.
detection [129], cancer detection [187], etc. Table 2 sum- K−1
X K−1
marizes the reviewed FPGA-based accelerators for specific
X
zij = yij + xi+m,j+n · wmn (6)
application. m=0 n=0

B. ACCELERATORS FOR A SPECIFIC ALGORITHM Sankaradas et al. [183] presented a massively parallel
A prominent topic of research in the realm of accelerators is co-processor for accelerating CNNs. This co-processor is
the use of FPGA-based accelerators for a particular neural designed using the Virtex5 LX330T FPGA platform and
network algorithm. Since the accelerator is intended to four DDR2 (Double Data Rate 2) memory banks totalling
address a specific problem, its operation typically requires 1 GB. The proposed co-processor mainly consists of clusters
minimal adjustments to a few parameters to operate effec- of Vector Processing Elements (VPE) connected in parallel.
VOLUME 4, 2016 ix

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

TABLE 2: Summary of FPGA-based accelerators for specific application


DNN Frequency
Application FPGA Device Year
Type (MHz)
Speech recognition [113] CNN Fudan Micro - 2022
Object detection [211] CNN Xilinx ZCU104 300 2021
Natural Language Processing [126] CNN Xilinx Zynq Z-7100 200 2021
Liver CT image recognition [231] CNN - - 2021
Image classification [45] CNN Xilinx xc7vx980t 225 2021
Fraud detection [129] CNN Xilinx Zynq 7000 - 2021
Brain tumor segmentation [219] CNN Xilinx Alveo U280 - 2021
Object detection [149] CNN Intel Arria 10 GX1150 240 2020
Object detection [162] CNN Xilinx VC707 200 2019
Malaria disease cell detection [185] CNN Xilinx Zynq-7000 168 2019
Autonomous driving [106] DNN Xilinx Ultra96 - 2019
Object detection [82] CNN Xilinx ZC706 100 2018
Autonomous driving [130] CNN Xilinx Zynq 7020 100 2018
Image classification [70] CNN Xilinx ZC702 100 2018
Speech recognition [103] LSTM Xilinx XCKU060 200 2017
Cancer detection [187] MLPNN Xilinx XC5VLX50TFFT1136 - 2016
Speech recognition [138] LSTM Xilinx XC7Z045 100 2016
Speech recognition [201] - Altera DE2-115 50 2015
Speech recognition [62] - Xilinx ML402 100 2010

Pix In: x
approach to accelerate the Support Vector Machines (SVM)
2-D CONVULTION
and the proposed design contains VPEs instead of VPE
w00 × w01 × w02 × [W-K] delays clusters. But the accelerator proposed in [94] provides low
performance while accelerating DNNs compared to the co-
0 + + + line m
processor proposed in [183].
w10 × w11 × w12 × [W-K] delays
image
line m + + + line m+1 row
h W1K W1K-1 W11
X X X

w20 × w21 × w22 ×


+ + + +
line m+1 + + + + Pix Out: z Delay registers

image
Pix In: y row
h+1 W2K W2K-1 W21
Internal storage + hard-wired data stream kernel loaded X X X
( dual ported BRAMs)

= 1 register × operations from / to memory from CPU MAC unit


K FIFOs

+ + + +
Fig. 10. 2-D convolution module for 3 × 3 kernel, adopted
from [86] VPE

image
Each cluster consists of 2-D convolver units, sub-samplers, row
h+k
Look-Up Tables (LUT) and performs convolution, pooling, X
WKK
X
WKK-1
X
WK1

and non-linear functions. The co-processor is coupled with load from


off-chip + + + +
DDR2 memory banks to store the intermediate data. Each
OUT
VPE in the proposed co-processor exploits parallelism by
supporting SIMD stream. The primitive 2-D convolver of K2 + K VPEs chained

the proposed design is shown in Fig. 11. It contains k × k Fig. 11. 2-D convolver unit of CNN co-processor, adopted
convolution units along with k 2 +k VPEs, and the final col- from [183]
umn of VPEs is used to add partial results. The coprocessor
operates in collaboration with a host, which can control the A programmable parallel accelerator called MAPLE is
coprocessor through an Application Programming Interface presented in [47] to accelerate the several learning and
(API). The proposed design uses low precision data repre- classification algorithms such as Support Vector Machine
sentation to improve the throughput and memory bandwidth. (SVM), K_means, CNN, etc. MAPLE contains hundreds
The proposed architecture has been used to implement of simple PEs arranged in a 2-D grid fashion as shown
the full face recognition application using CNN with four in Fig. 12 MAPLE can be used to perform vector and
convolution layers. The proposed accelerator can not be matrix operations in parallel. In MAPLE, each PE has
used to realize the full CNNs, which contain convolution local storage to perform the computations efficiently. Each
and fully connected layers. Graf et al. [94] used a similar PE has two operands; one operand comes from its local
x VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

storage, and another operand comes from the PE on its left, as shown in Fig. 13. The co-processor uses the three
see Fig. 12. Furthermore, the output of each PE is connected bank memory sub-system to store input images, kernels,
to the PE on its right. The PEs are arranged as clusters, and intermediate data. The DC-CNN uses the Torch7 [66]
where each cluster has a separate off-chip memory block software for CNN implementation. The proposed dynami-
that creates independent data streams for memory-processor cally reconfigurable architecture supports “inter-output” and
computations. MAPLE processing core can be organized as “intra-output” parallelism. The performance of the proposed
H clusters, and each cluster contains M PEs. So, the total dynamically reconfigurable architecture with 20 convolvers,
number of PEs in MAPLE core equals H × M. MAPLE 128-bit memory port width is 4 to 8 times faster than CNP
also uses smart memory banks to process the intermediate presented in [86]. The proposed architecture can be used
data and to perform secondary reduction operations such to accelerate CNN with only three convolutional layers.
as, aggregation, finding minimum or maximum, and array The proposed accelerator is not capable of realizing the
ranking. Authors developed a tool to map the applications full CNNs, which contain convolution and fully connected
on the MAPLE. For the given input matrices and reduction layers.
functions, the tool generates the assembly code needed to
program MAPLE. The authors created a C++ simulator that Data Memory
Bank Bank Bank
1 2 3
estimates how long MAPLE will take to execute from the
input assembly code and an architectural configuration file VLIW
DC-CNN
Host
memory
that details the processor layout and off-chip memory archi- Hardware
Controller

tecture. MAPLE processor is connected to the host computer


1 DC-CNN
through Peripheral Component Interconnect (PCI). MAPLE 1 1 1
Y1 C
is implemented on Xilinx Virtex 5 SX240T FPGA with 512 2 + NL S1 NL X1
Y2 C n
PEs organized as 2 cores, 32 chains per core, and 8 PEs
n
per chain. The experimental results show that MAPLE with Yn
C

512 PEs is 1.5 to 10 times faster than a quad-core Xeon 2


processor with a clock frequency of 2.5 GHz despite running 1 1 2
C

Output Switch
at 125 MHz. Input Switch
2
C n + NL S1 NL X2

n C

from off-chip Input Local PE PE PE


bank store local local local
store store store

pipelined
smart m
memory PE PE inter-chain PE
1 m
block interconnect 1
C
PE PE PE 2
C n + NL S1 NL Xn
local local local
store store store
n C
C: Convolvers
pipelined NL: Non-linearity
smart S1: Sub-sampling
to off-chip
REDUCE memory PE PE inter-chain PE
bank
block interconnect

pipelined
Fig. 13. DC-CNN co-processor architecture, adopted
input
broadcast PE PE PE
from [51]
local local local
store store store

smart pipelined A CNN accelerator, referred to as NeuFlow is proposed


memory PE PE inter-chain PE
block interconnect in [84]. NeuFlow is implemented on a Xilinx Virtex 6
FPGA platform. NeuFlow contains a 2-D grid of Processing
Fig. 12. MAPLE’s processing core architecture, adopted
Tiles (PTs), as shown in Fig. 14. A PT contains a bank
from [47]
of processing operators where an operator can be anything
from memory (FIFO) to an arithmetic operator. All the
Chakradhar et al. [51] presented a dynamically reconfig- operators are connected to a local data line using recon-
urable architecture for CNNs on Virtex 5 FPGA platform. figurable routes. A multiplexer connects the local data line
The proposed system consists of a dynamically configurable to a global data line by which a PT is connected to the
CNN (DC-CNN) processing core and three bank memory four neighbouring PTs. Data is transferred from the off-chip
sub-system. The DC-CNN processing core continuously memory to the tiles using a Smart Direct Memory Access
communicates with the host computer that executes the Module (Smart DMA). The control unit configures each tile
main application. In the proposed accelerator, the host for the computation and connections between the tiles. Data
computer transfers the complete CNN structure and input streams from the Smart DMA are processed in tiles, and the
images to the co-processor. The DC-CNN processing core, results are passed to the neighbouring tiles, or back to the
responsible for executing CNN applications, mainly con- Smart DMA. This 2-D grid can be used to perform arbitrary
tains computational units (2-D convolvers), subsampling, computations on streams of data and plain unary operations
and non-linearity units, adders, input and output switches, to complex nested operations. Using FIFOs input/output
VOLUME 4, 2016 xi

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

flow can be managed, and operators can easily be cascaded time division multiplexing (TDM) processing scheme, and
and connected across tiles. The NeuFlow accelerator uses page-mirror algorithm. In the proposed accelerator, NPUs
a compiler named luaFlow to process CNNs. The luaFlow get the inputs from the host computer through the Ethernet
compiler converts high-level data flow graph representations interface, and weight coefficients are fetched from page
of deep learning algorithms in the Torch5 environment into mirror memory. The serializer sends the output of NPUs
machine code for NeuFlow. The proposed accelerator has to the activation function blocks. For each sample, the
been used to implement a real-time street scene parser. proposed accelerator requires a long time to transfer the
appropriate weight coefficient from the host computer to
PT
MUX
PT
MUX
PT
MUX
the accelerator core.
Off-chip
Memory
× + × + × +
% ∑∏ Mem % ∑∏ Mem % ∑∏ Mem

PT PT PT Smart
MUX MUX MUX
DMA

× + × + × +
% ∑∏ % ∑∏ % ∑∏

Serializer
Mem Mem Mem

Page-Mirror
Communication

Memory
Page-Mirror
Memory

signals
PT PT PT

Interface
MUX MUX MUX

× + × + × + Control &
Config
% ∑∏ Mem % ∑∏ Mem % ∑∏ Mem

weights
Configurable Route Global Data Lines Local Data Lines Runtime Config Bus

Fig. 14. 2-D grid of Processing Tiles (PTs) in NeuFlow


architecture, adopted from [84]
Fig. 15. NPU-based neural network accelerator architecture,
Peeman et al. [170] presented a memory-centric design adopted from [172]
method for CNN accelerator. The proposed memory-centric
accelerator is implemented on a Virtex 6 FPGA board. A scalable and low power accelerator referred to as
This accelerator minimizes the bandwidth requirements by neural network next (nn-X) is presented in [92] to accelerate
exploiting the data reuse in complex access patterns. The the DNNs. The nn-X accelerator mainly contains a co-
memory-centric accelerator uses the loop transformation and processor, a host processor, and external memory as shown
Block RAM (BRAM)-based multi-bank on-chip buffers to in Fig. 16. The host processor controls the input and con-
maximize the efficiency of on-chip memories for better data figuration data transfer to the coprocessor, parses the DNN,
locality. The memory-centric accelerator uses SIMD type and converts it into instructions for the coprocessor. The co-
of PEs to accelerate the convolutional layers. The proposed processor mainly contains an array of processing elements
accelerator design mainly focused on the maximization of called collections, configuration bus, and memory router.
the reuse of on-chip data. The proposed accelerator is con- The collections in the nn-X accelerator are mainly composed
nected to a MicroBlaze host processor and is communicated of convolution engines, pooling modules, and non-linear
through Fast Simplex Link (FSL) connections. Vivado HLS operators and are used to perform the most common CNN
tool is used to map the CNNs on the proposed accelerator, operations, such as convolution, sub-sampling, and activa-
which enables the user to use the high-level accelerator tion functions. The memory router in the nn-X accelerator
description in C and to use HLS directives to specify the is used to transfer the data between the processing elements
hardware configuration. The performance of the proposed and the external memory, which provides the independent
accelerator will be improved with the use of the DMA data streams. The proposed architecture uses the weight
controller. stationary dataflow to improve the energy efficiency. The
In [172], an accelerator for DNNs is introduced, and it nn-X accelerator is implemented using the Xilinx ZC706
is implemented on the Xilinx Kintex 7 FPGA platform. It platform, which has a dual ARM Cortex-A9 processor,
is built using a set of Neural Processing Units (NPUs), Xilinx Zynq XC7Z045 chip, and 1 GB DDR3 memory. The
see Fig. 15. The number of NPUs in the proposed design experimental results show that the nn-X can achieve a peak
depends on the available FPGA resources. NPUs are mainly performance of 240 GOPS.
used to compute the majority of operations (multiplica- Zhang et.al. [225] proposed a roofline-based model [214]
tions and additions) in parallel. A multiply and accumulate to implement CNNs on FPGAs. The authors analyzed
(MAC) unit and control logic are the essential components the throughput and required bandwidth for a given CNN
of each NPU. The proposed accelerator utilizes the available design using various optimization techniques, such as loop
FPGA resources efficiently by using pipelined architecture, tiling and loop transformation. With the help of roofline
xii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

memory
Memory interconnect
programmable logic
is carried out using the max-pooling module. CNN’s non-
controller linearity function is calculated using the non-linearity mod-
memory router
ule. The convolver complex module generates partial sums,
which are added by the adder tree. Finally, for dynamic
quantization, bias shift and data shift modules are used.
router router router
external The proposed accelerator supports the Caffe deep learning
memory
convolution
engine
convolution
engine
convolution
engine
framework. The proposed accelerator has been implemented
on Xilinx Zynq platform.
collection

pooler pooler pooler


ARM
processors programmable programmable programmable
f(x) f(x) f(x) Bias Shift Intermediate Data
Bias

C
+

Input Buffer
coprocessor bus 32-bit off-chip bus 64-bit on-chip bus 32/64 bit on-chip bus 32-bit config bus

Output Buffer
Fig. 16. Architecture of nn-x system, adopted from [92]
Data + Data
C
+ NL Pool
Shift

Weights +
C
model they identified the solutions with best performance
Convolver
+
and lowest FPGA resource requirement. This roofline-based Complex Adder Tree
model optimizes both the memory accesses as well as com-
Controller
putations in the convolutional layers. The accelerator design
is implemented with the Vivado HLS tool, which enables the Fig. 17. Convolver architecture, adopted from [178]
accelerator implementation in C language. The proposed ac-
celerator achieves maximum throughput of 61.62 GFLOPS Wang et al. [210] proposed a scalable design called Deep
(Giga Floating-point Operations Per Second). Learning Accelerator Unit (DLAU) for accelerating deep
Implementing DNN in embedded devices is tough due learning algorithms. DLAU utilizes the tiling technique to
to resource and power constraints. In this regard, authors produce a scalable architecture. The proposed accelerator
in [173] have developed a novel FPGA-based accelerators mainly contains modules such as DMA, embedded pro-
for implementing trained and fully connected DNNs. Since cessor, DLAU, and DDR3 memory controller as shown
it is difficult to map a DNN with large number of neurons in Fig. 18. The DLAU module mainly contains three pro-
and corresponding weights, directly onto an FPGA, the cessing units, viz. Partial Sum Accumulation Unit (PSAU),
authors in [173] used a time division multiplexing scheme. Tiled Matrix Multiplication Unit (TMMU), and Activation
Batch processing is used in the proposed architecture, which Function Acceleration Unit (AFAU). TMMU is used to per-
distributes different weights over many input samples. In form the multiplication operations and also generate partial
addition, the suggested accelerator employs a pipelined sums. PSAU is used to add the partial sums derived from
architecture to make the most of the FPGA resources while TMMU. Finally, AFAU is used to perform the non-linear
staying within power and resource limits. The concept of activation functions, for instance, sigmoid function. The
pruning has also been incorporated into the proposed archi- DLAU module reads the tiled input data through the DDR3
tecture to reduce data transfer from the external memory memory. The embedded processor provides the program-
to the accelerator [174]. Both Batch processing and weight ming interface to the users and communicates with DLAU
pruning can enhance the throughput of DNN accelerators. via JTAG-UART. The proposed architecture is implemented
Qiu et al. [178] proposed FPGA based CNN accelerator, on Xilinx Zynq Zedboard with ARM Cortex-A9 processors
which will efficiently accelerate all the layers of CNN, in- operating at 667 MHz.
cluding the fully connected layers. The proposed accelerator Lian et al. [141] proposed a block-floating-point (BFP)
improves bandwidth and resource usage by employing a arithmetic-based CNN accelerator for DNN inference. The
dynamic-precision data quantization method and a unique proposed accelerator mainly contains three elements: Pro-
design of the convolver hardware module. The proposed cessing Array (PEA), on-chip buffer, and external memory,
accelerator applies singular value decomposition (SVD) on as shown in Fig. 19. The onboard DDR3 modules receive
weight coefficients to minimize the memory footprint at input data and network parameters from the host computer
the fully connected layer. The convolver hardware module via PCIe3.0x8. Conv PEA performs the convolutional op-
can be used for both convolutional and fully connected erations, and FC PEA performs the fully connected layer
layers to reduce resource consumption. The adder tree, operations. The proposed accelerator uses 8-bit and 16-bit
convolver complex, non-linearity, max-pooling, bias shift, formats to represent the feature maps and modal parameters
and data shift are the main elements of the convolver (activations and weights), which can reduce off-chip band-
hardware module, as shown in Fig. 17. Convolutions and width and memory compared to the 32-bit floating point
fully connected layer operations are both performed using counterpart with only a tiny accuracy loss. The accelerator
the convolver complex module. The max pooling action design is implemented with the Vivado HLS tool, and the
VOLUME 4, 2016 xiii

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

DDR3 directly in the compressed domain without decompressing,


Memory Processor UART which leads to improvement in efficiency and performance.
DDR3 Controller
The circuit-level process scheme used in this architecture is
dataflow independent, and thus, applying to both CNN and
Data Bus (AXI-Stream)
fully connected layers. In this architecture, a new dataflow is
proposed to facilitate the reuse of input activations across the
DMA fully connected layers, that leads to exploits parallelism and
maximizes the utilization of PEs. In this work, the Xilinx
Vivado HLS tool chain is used to convert C code to RTL
DLAU
implementation, and then Xilinx SDSoC is used to compile
TMMU PSAU AFAU the source code to generate the bit stream. The proposed
architecture is implemented on the Xilinx Virtex-7 FPGA
platform and achieves the performance of 1.34 GOP/s .

Control Bus (AXI-Lite) DNN Accelerator


Filters
Fig. 18. DLAU accelerator architecture, adopted from [210] Top Controller

Iact/Ofmap PE Array
Buffer 1
proposed BFP arithmetic is conducted on the Caffe [120] Off Chip PE0
scheme. The proposed accelerator is implemented on the DRAM
Xilinx VC709 evaluation board, running at a frequency of RLC PE1
Encoder
200 MHz, and achieves a throughput of 760.83 GOP/s.
Iact/Ofmap
PE63
Processor Buffer 2
North
System

CPU
Bridge
DRAM
Fig. 20. DNN accelerator architecture proposed in [218],
PCIe 3.0x8 adopted from [218].
PCIe Interface Ahmed et al. [81] proposed an FPGA-based Low Power
CNN (LP-CNN) accelerator based on GoogLeNet CNN.
DDR3 M0 DDR3 M1 The proposed accelerator uses quantization and weight
pruning techniques to reduce memory size. The LP-CNN
Memory Interface
accelerator is a time-sharing processor designed to process
the CNN model layer by layer, and it enables pipelining.
The proposed accelerator only uses the on-chip memory to
Conv output
Conv input

FC buffer#0 FC buffer#1 store the activations and weights instead of offline DRAM
buffer buffer
memory. Moreover, the proposed architecture replaces mul-
tiplication operations with shifting operations and uses no
FC PEA Conv PEA
DSP units. The LP-CNN accelerator is implemented in
CNN accelerator Verilog RTL, and the Vivado power analyzer has been
used to calculate the power. The experimental results show
Fig. 19. Block diagram of BFP arithmetic-based CNN that the LP-CNN accelerator provides 49.5 and 7.8 times
accelerator, adopted from [141] power improvement over the Intel Core-i7 and NVidia GTX
1080Ti, respectively. The proposed accelerator has been
Xiao et al. [218] presented the DNN accelerator architec- implemented on the Virtex-7 FPGA running at a frequency
ture specially designed for the sparse and compressed DNN of 200 MHz.
models. The proposed DNN accelerator mainly contains a The low-power, energy-efficient FPGA-based accelerator
PE array, RLC encoder, controller, and on-chip buffers, as is presented in [117] to accelerate the LeNet CNNs. The
shown in Fig. 20. In the proposed DNN accelerator, all the proposed accelerator uses 8-bit, 16-bit, and 32-bit fixed point
weights and non-linear activation functions are kept in Run- formats to represent the weights, activations, and biases,
length Coding (RLC) compressed form and are stored in respectively. The proposed accelerator supports pipelining
off-chip DRAM memory. The PE array contains 64 PEs and and implements LeNet with the minimal resources possible
performs the multiply-accumulate (MAC) operations of the without affecting the throughput. This work uses the Xilinx
fully connected layer. The proposed accelerator uses a novel Vitis HLS tool to convert the C++ code to RTL implementa-
circuit-level processing scheme to process the sparse data tion. The proposed accelerator is implemented on the Nexys
xiv VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

DDR 4 FPGA evaluation board and achieves a throughput proposed accelerator is implemented on Xilinx Zynq 7020
of 14K images/sec while using just 628 mW of power. FPGA board.
An FPGA-based dynamically reconfigurable architecture
is presented in [118] to accelerate neural networks. Dynamic DDR
Partial Reconfiguration (DPR) is used in the proposed
accelerator to realize different types of neural network ar- Memory Interface
chitectures. Dynamic Partial Reconfiguration (DPR) allows
the proposed architecture to switch between networks and
Image Buffer
applications without sacrificing precision or throughput.
The proposed accelerator mainly contains a PE array and Image Data Transmitter 
configurable switches, as shown in Fig. 21. PE is a high-
level generic block that can implement the layers of neural PE PE PE PE

Configuration

Weight Buffer
network accelerator and has three predefined interfaces:

Transmitter
Registers

PE Array

Weight 

data interface, I/O interface, and memory interface. DPR PE PE PE PE


allows each PE to implement many functionalities with
the same hardware. Any PE can communicate with any PE PE PE PE
other PE, CPU, or I/O port of the FPGA through config-
urable switches. The hard/soft processor controls all PE
connections using the memory interface. This work uses
the Xilinx Vitis HLS tool to convert the C++ code to RTL
implementation. The proposed accelerator is implemented Pooling
on the Xilinx Zynq 7020 FPGA board.

Controller
BN

Activation
AXI STREAM (DATA INTERFACE) Function
AXI4 (MEMORY INTERFACE)
GPIO (I/O INTERFACE) Special Function Buffer
AXI STREAM SWITCH

PE-1 PE-2 PE-N


Data Transmitter

AXI
HARD/SOFT
LITE

CROSSBAR
PROCESSOR
GPIOs I/O MUX
Fig. 22. RCNN accelerator architecture, adopted from [93].
FPGA

Table 3 summarizes the reviewed FPGA-based accel-


Fig. 21. Accelerator architecture proposed in [118], adopted erators for a specific algorithm. The year the accelerator
from [118]. was introduced, the deep learning model used, the FPGA
platform used, the precision used for input feature maps
Gowda et al. [93] proposed an FPGA-based reconfig- and weights, the clock frequency, the number of resources
urable convolutional neural network (RCNN) accelerator. available in terms of DSPs, LUTs, BRAMs, and FFs, the
Unlike the existing structures, the RCNN accelerator con- percentage of resources utilized, the performance in GOPS,
tains configuration registers to reconfigure the architecture and finally, the power efficiency (GOPS/W) are all listed
according to the configuration instructions stored in the for each accelerator. Fig. 23 shows the power efficiency
Double Data Rate (DDR), as shown in Fig. 22. The image and throughput of various FPGA-based accelerators listed
and weight buffers are updated with input feature maps and in Table 3.
weights. The PE arrays perform the convolution operations,
whereas the special function buffer performs pooling, Batch C. ACCELERATOR FRAMEWORKS WITH HARDWARE
Normalization (BN), and activation functions. The proposed TEMPLATES
accelerator uses the SOW (sparse optimization of weight) Several frameworks for mapping AI models onto FPGAs
and CO (convolutional optimization) optimizations to reduce have been developed in recent years. Venieris et al. [208]
the sizes of weights and feature maps, respectively, which developed a framework called fpgaConvNet to map CNNs
also minimizes the number of hardware resources needed. on FPGAs. The fpgaConvNet framework employs the syn-
The proposed accelerator uses 16-bit, 8-bit, and 4-bit fixed chronous dataflow (SDF) paradigm to capture the CNN
point formats to represent the feature maps, convolution workloads. The processing flow of fpgaConvNet is shown
(CONV) layer weights, and fully connected (FC) layer in Fig. 24. Firstly, the Deep Learning expert uses a domain-
weights, respectively. This work uses the Xilinx Vivado HLS specific language to provide a high-level description of a
toolchain to convert C++ code to RTL implementation. The ConvNet architecture as well as information on the target
VOLUME 4, 2016 xv

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

TABLE 3: Summary of FPGA-based accelerators for specific algorithm


Frequency LUT Type Resources Resource Utilization Performance Power Efficiency
Accelerator Name Year DNN Type FPGA Platform Precision
(MHz) (# inputs) BRAMs LUTs FFs DSPs BRAMs LUTs FFs DSPs (GOPS) (GOPS/W)

VIP [65] 1996 CNN Altera EPF81500 16 fixed point 4 N/A 1500 1,500 N/A N/A N/A N/A

CNP [85] 2009 LeNet-5 Virtex4 SX35 200 16-bit fixed point 4 192 30720 30720 192 N/A 90% 90% 28% 5.25 0.35

Parallel coprocessor for CNN [183] 2009 CNN Virtex5 LX330T 115 16-bit fixed point 6 324 207360 207360 192 0.93% 17% 19.05% 55.73% 6.74 0.61

MAPLE [47] 2010 CNN Virtex5 SX240T 125 fixed point 6 516 149760 149760 1056 N/A 7 N/A

DC-CNN [51] 2010 CNN Virtex5 SX240T 120 48-bit fixed point 6 516 149760 149760 1056 N/A 16 1.14

NeuFlow [84] 2011 CNN Virtex6 VLX240T 200 16-bit fixed point 6 416 150720 301440 768 N/A 147 14.7

Memory- Centric Accelerator [170] 2013 CNN Virtex6 VLX240T 150 fixed point 6 416 150720 301440 768 45.50% 1.10% N/A 6% 17 N/A

NPU based Accelerator [172] 2014 DNN Xilinx Kintex 7 N/A float point 6 1590 254200 508400 1540 N/A N/A N/A

nn-X [92] 2014 CNN Zynq XC7Z045 142 16-bit fixed point 4 545 218600 437200 900 N/A 23.18 2.9

Roofline based Accelerator [225] 2015 AlexNet Virtex7 VX485T 100 32-bit float point 4 2060 303600 607200 2800 50% 61.30% 33.87% 80% 61.62 3.31

Embedded FPGA Accelerator [178] 2016 VGG-16 Zynq XC7Z045 150 16-bit fixed point 4 545 218600 437200 900 86.70% 83.50% 29.20% 89.20% 136.97 14.22

DNN Acceleration using 2016 DNN Zynq-7000 100 16-bit fixed point 6 280 53200 106400 220 N/A N/A N/A

Batch Processing [173]

DLAU [210] 2017 DNN Zynq XC7Z020 200 48-bit float point 6 280 53200 106400 220 12.50% 68.40% 26.60% 75.90% N/A N/A

DNN Acceleration using Batch 2018 DNN ZedBoard 100 16-bit fixed point 6 280 53,200 106,400 220 N/A 4.48 N/A

Processing and Pruning [174]

BFP arithmetic-based 2019 VGG-16 Xilinx VC709 200 8-bit BFP (feature maps) 6 1470 433200 866400 3600 62.1% 53.5% 16.3% 28.5% 760.83 82.88

Accelerator [141] 16-bit BFP (weights and activations)

Accelerator for Space DNN [218] 2021 AlexNet/ Xilinx Virtex7 200 16-bit fixed point 6 1470 433200 866400 3600 N/A 1.34 53.14

VGG-16

LP-CNN [81] 2021 GoogLeNet Virtex-7 VC709 200 12-bit fixed point 6 1470 433200 866400 3600 77% 94% 10% 0% 129.2 32.7

Energy-efficient CNN Accelerator [117] 2021 LeNet Xilinx Artix XC7A100T 125 8-bit fixed point (weights) 6 135 63400 126800 240 21.48% 25.16% 13.93% 50% N/A N/A

16-bit fixed point (activations)

32-bit fixed point (biases)

Dynamically Reconfigurable Architecture [118] 2022 CNN, SNN Xilinx Zynq 7020 200 N/A 6 280 53200 106400 220 N/A N/A N/A

RCNN Accelerator [93] 2022 AlexNet Xilinx Zynq 7020 200 16-bit fixed point (feature maps) 6 280 53200 106400 220 N/A N/A N/A

VGG-16 8-bit fixed point (CONV layer weights)

VGG-19 4-bit fixed point (FC layer weights)

FPGA-based platform as inputs. The ConvNet description the library with the required interconnections. DeepBurning
is passed through a DSL (Domain-Specific Language) pro- supports a wide range of NN models and simplifies the
cessor, which parses the input script and populates the Con- design flow of NN-based accelerators for machine learning
vNet’s semantic model as a Directed Acyclic Graph (DAG), applications.
and also extracts platform-specific resource constraints. The A framework referred to as DNNWeaver is presented
ConvNet DAG is converted into an SDF hardware intermedi- in [188] that generates bitstream and host code to implement
ate format, which corresponds to an utterly parallel hardware DNNs on various FPGA boards. DNNWeaver employs
implementation. After several transformations on ConvNet’s Caffe as its programming interface. DNNWeaver consists
SDF hardware model, the design space is searched, and of three software components: translator, design weaver, and
this procedure provides a set of hardware mappings of integrator. The translator transforms the Caffe specification
the ConvNet onto the specific FPGA-based platform. The of a DNN into a macro data flow graph. Design weaver
fpgaConvNet front-end parser can examine models written accepts macro data flow graph as an input and generates
in the Caffe and Torch machine-learning libraries. This a synthesizable Verilog implementation of the accelerator
framework accomplishes efficient design space explorations code. The integrator adds the memory interface code to the
through graph segmentation, reconfiguration, folding, and accelerator code. DNNWeaver generates accelerator code
weight reloading. This framework can be used to map small from a series of scalable and customizable hand-optimized
CNN models, for instance, LeNet-5 on FPGAs. template designs, resulting in high performance and effi-
Wang et al. [213] developed a design automation tool ciency.
referred to as DeepBurning that contains a library of build- Guan et al. [96] proposed a framework called Field Pro-
ing blocks that mimic the behavior of typical neural net- grammable DNN (FP-DNN) to accelerate DNNs efficiently
work components. The general design flow of DeepBurning on FPGAs. FP-DNN Framework is shown in Fig. 26. The
framework is shown in Fig. 25. The DeepBurning Neural model description is generated by TensorFlow and is fed
Network Generator (NN-Gen) takes a model descriptive into a Symbolic Compiler. The compiler generates a C++
script ( Caffe-compatible script) as input, which describes program and an FPGA programming bit stream for model
a high-level view of network topology and layer definition. inference, executed by the host and device, respectively.
The DeepBurning NN-Gen also takes user-specified con- Model mapper examines the model description, extracts
straints such as area and power as an input. DeepBurning the target model’s topological structure and operations, and
NN-Gen consists of a hardware generator and compiler that sends the hardware kernel schedule and configuration to the
generate the control flow and data layout based on the user’s code generators. Software generator generates the host code
specifications. The DeepBurning automation tool’s hardware in C++ using the kernel schedule. The host code is compiled
generator builds a neural network architecture for a given using a commercial C++ compiler to create host programs.
network structure by selecting and instantiating blocks from Hardware generator creates device codes by instantiating
xvi VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

80
700
Performance per watt (GOPS/W)

70
600

Performance (GOPS)
60
500
50
40 400

30 300

20 200
10 100
0 0
DN 41]
rith d F Acc nn- 4]
Ac tic-b Acc ator 2]
-C 3]

ato d Ac rato 5]
r S era 78]

-C 8]
N ]

uF 1]

-C 41]
-C 47]

ato 4]

-ba Ac ato 2]
P 3]

me PG cce nn-X 0]

cc tor ]
o ]
N ]

Ac Fl 1]

]
r C [85

[81

d A era 25

rat 78
CN [85

[81
[8
me PG ele X [9
DC N [18

e [22

LP [21
Ne N [5

ler [8

tic A ler [9
MA [18

7
ic eu [5
ce [1
r fo el r [1

LP r [1
DC LE [

r [1

se cel r [2
ele [1
r fo CNP

low

NN

for NP

ce ow
ntr N NN

NN
pa tor

N
N

C
r
ce ase el
c

or
sso

rith d F A
ss
A
P a de ed

d
ce

ce

P a dde se
BF bed bas
pro

pro

BF be ba
Ce
co

Em fline

co

Em line
ler

ry-
lel

lel

f
mo
o

o
ral

ral
Ro

Ro
Me
Pa

Accelerator Pa Accelerator
, ,
(a) Power efficiency (b) Throughput

Fig. 23. Power efficiency and throughput of FPGA-based accelerators listed in Table 3

Library of
Target Platform Configurable
ConvNet Description DeepBurning
Specifications Network Hardware/Software Co-Generation 

Components

Model
Connected Components with
Descriptive 

DSL Processor Script

detailed Parameters

 (RTL of NN)
FPGA
Hardware Burning

Generator

Constraint
Address Flow Generation

(Area and Power)


(RTL of AGU)

ConvNet DAG Target Platform Model


Input
Runtime
Dynamic Control

Data Layout in
Compiler
Control Flow of
Memory

ConvNet Hardware Target Platform Binaries

SDF Model Constraints

DeepBurning NN-Gen

Search Design
Space Fig. 25. Design flow of DeepBurning framework, adopted
from [213].
Supplied by Deep ConvNet Hardware
Learning Expert  Mapping
FPGA. FINN generates a synthesizable C++ network de-
scription of a flexible heterogeneous streaming architecture.
Fig. 24. Processing flow of fpgaConvNet, adopted
The architecture mainly contains pipelined compute engines
from [208].
that communicate via on-chip data streams. Each BNN layer
has been implemented using dedicated compute engines
with 1-bit values for FMs and weights. To evaluate FINN,
RTL-HLS hybrid templates based on kernel configuration. the authors implemented CNV, a convolutional network
The hardware code is compiled using commercial synthesis topology inspired by BinaryNet [68] and VGG-16 [193], on
tools to generate the programming files for the hardware a Xilinx Zynq-7000 FPGA board running at 200 MHz to
implementation. With a high-performance compute engine accelerate BNN inference.
and well-designed communication optimized algorithms, Guo et al. [97] proposed a flexible and programmable
FP-DNN performs model inference for DNNs. CNN accelerator, referred to as Angle-Eye, together with
Umuroglu et al. [206] proposed FINN, a framework that the compilation tool and the data quantization scheme.
maps trained Binarized Neural Networks (BNNs) onto an The data quantization scheme can be used to reduce the
VOLUME 4, 2016 xvii

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

Model Description (TensorFlow) Ins


Controller
Symbolic Compiler

Model Mapper PE 0
Kernel Kernel
Schedule Configuration Out
In PE 1
SW HW Buf
External Buf External
Generator Generator
Memory Memory
C++ Codes HW Codes
PE npe
C++ Synthesis
Compiler Tools SAVE
LOAD CALC

C++ Program Programming File


Fig. 27. Angel-Eye accelerator architecture, adopted
from [97]
Device
PCI-e Bus
Host FPGA DRAM
× and 1.5 ×, respectively.
Fig. 26. FP-DNN framework, adopted from [96]. Ghaffari et al. [91] developed a general framework called
CNN2Gate, which allows mapping CNN models on FPGAs
with automated design space exploration. The CNN2Gate
bit-width down to 8-bit with insignificant accuracy loss. overall architecture consists of an Open Neural Network
The compilation tool is responsible for mapping a given eXchange (ONNX) format parser, a design-space explo-
CNN model efficiently onto the hardware architecture. The ration module, and leverages automated high-level synthesis
proposed accelerator supports the acceleration of various is shown in Fig. 28. CNN2gate can parse CNN models using
CNNs on different FPGA platforms. The overall architec- ONNX parser from several popular high-level machine
ture of the Angel-Eye is shown in Fig. 27. Angle-Eye learning libraries, such as Caffe2, Keras, TensorFlow, etc.
accelerator mainly consists of PE array, controller, on-chip The computation flow of network layers and their weights
buffer, and external memory. The PE array is used to and biases are retrieved in CNN2Gate, and a fixed-point
perform the convolution operations, and it supports three quantization is used. To undertake design space exploration
levels of parallelism: input channel parallelism, kernel-level for deeply pipeline OpenCL kernels of CNN, the authors
parallelism, and output channel parallelism. The on-chip used time-limited reinforcement learning.
buffer can isolate the PE array from external memory, Xilinx Vitis AI [32] is a framework for implementing
allowing simultaneous convolution and data I/O operations. deep learning inference on Xilinx FPGAs and SoCs. It uses
All network parameters and the results of each layer can be an Intellectual Property (IP) core called the Deep Learn-
saved to external memory. The controller is responsible for ing Processor Unit (DPU) to implement ample essential
receiving, decoding, and issuing instructions to the other functions of deep learning on FPGAs, see Fig. 29. Xilinx
three components and monitoring each component’s work Inc. released the DPU, a programmable engine designed
status. Angle-Eye accelerator is implemented on the Zynq for DNNs. Xilinx Vitis AI framework enables the com-
XC7Z045 platform. pression of DNN models without sacrificing accuracy and
Zhang et al. [226] proposed a software/hardware co- compiling DNN models into DPU instruction code before
design library called Caffeine to accelerate CNNs efficiently deploying them to the target DPU platform. For efficient
on FPGAs. The authors developed a uniformed convolu- DNNs implementations, the Xilinx DPU offers a tailored
tional matrix-multiplication representation for both convolu- and scalable overlay with ISA architecture. Xilinx Vitis
tional and fully connected layers. Caffeine synthesizes Caffe AI framework supports various frameworks such as Caffe,
models comprising convolutional layers and fully connected PyTorch, TensorFlow, etc., and efficiently implements deep
layers for FPGAs. The Caffeine framework effectively han- learning tasks, CNN, and RNN, see Fig. 29. The internal
dles weights and biases reconfiguration in off-chip DRAM architecture of the DPU contains an Instruction Unit (IU),
to maximize the underlying memory bandwidth utilization. a Compute Array (CA), and a Global Memory Pool (GMP).
Authors integrated the Caffeine with a deep learning frame- The IU fetches the DPU instructions associated with the
work Caffe and implemented AlexNet and VGG networks model, decodes it, and drives the PEs present in compute
on multiple FPGA platforms, viz., Xilinx KU060 FPGA array. It also manages the data/instructions transfer among
board, and Virtex7 690t FPGA board. Caffeine achieved the PEs and the memory. The GMP acts as buffer for the
better energy efficiency than 12-core CPU and GPU by 43.5 input and output data as well as intermediate output from
xviii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

Tensor ciency than GPU and FPGA at the cost of reconfigurability.


Pytorch Keras Caffe MATLAB
Flow
Many researchers are focused on building custom ASICs for
accelerating CNNs inference workloads to achieve the best
performance and energy efficiency. In this section, we would
ONNX Model like to review the recent ASIC-based DNN accelerators.
CNN2Gate There are three broad types of ASIC-based DNN accel-
erators depending on how the architecture has been opti-
mized/designed: ALU (Arithmetic Logical Unit), Dataflow,
Front-end Parser and Sparsity-based accelerators. The main building block,
the MAC unit (or an array of MAC units), in ALU-
Automated high-level synthesis

based accelerators is modified to have ample computational


Weights and biases Computation flow resources and flexibility to obtain the best performance with
varying bit accuracy. In dataflow-based accelerators, the
Post-training activations, weights, and partial sums are managed to reduce
Kernel usage guide-line  the energy needed to move data within the chip and achieve
quantization
high arithmetic intensity. In Sparsity-based accelerators, the
unstructured sparse data is handled in such a way that the
Design-Space Exploration matrix multiplication units (2-D array of MAC units) can
prevent zero multiplications. Following sections provide a
comprehensive overview of ALU, Dataflow, and Sparsity-
RTL Generation based accelerators.

A. ALU BASED ACCELERATORS


Fig. 28. CNN2Gate, adopted from [91]. NeuFlow is the ASIC based CNN accelerator presented
in [171] to accelerate the NNs and other ML algorithms.
The architecture of the proposed accelerator is the same as
the DPU, resulting in high throughput [114]. The DPU can the accelerator discussed in [84] and shown in Fig. 14, but is
be configured to meet the requirements of a specific CNN implemented using IBM 45 nm Silicon-On-Insulator (SOI)
architecture, and the Vitis AI stack contains all the necessary process. The NeuFlow accelerator uses a compiler named
libraries to generate the instructions for the DPU. The de- luaFlow to process CNNs. The luaFlow compiler converts
velopment flow is described in Fig. 29, where trained model high-level data flow graph representations of deep learning
is compiled using the Vitis AI compiler. The Vitis AI tools algorithms in the Torch5 environment into machine code for
provide a model quantizer to reduce the precision of weights Neuflow. The proposed architecture provides higher power
without losing the accuracy. An Xmodel file is generated by efficiency and is suitable for vision-based applications, such
the Vitis Compiler consisting of domain-specific instructions as autonomous vehicle navigation, driving assistance, etc.
for the DPU unit, which are used to configure the DPU. The proposed architecture achieves the maximum through-
During inference, a Python script running on the PS acts put of 320 GOPS with a power consumption of 0.6 W; in
as the interface, and it is responsible for transferring the contrast, the NeuFlow architecture implemented on Xilinx
data from the on-chip memory to the DPU memory buffers. Virtex6 FPGA presented in [84] has a maximum throughput
Examples of CNNs that have been implemented using DPU of 16 GOPS with power consumption of 10 W.
include, but are not limited to, VGG, ResNet, GoogLeNet, Chen et al. [53] proposed the ASIC-based hardware
YOLO, SSD, MobileNet, and FPN. Table 4 summarizes accelerator, also called DianNao, to accelerate the large-
the reviewed FPGA-based accelerator frameworks for the scale CNNs and DNNs. The proposed architecture provides
implementation of DNNs. the quick and energy-efficient execution of the inference
of large-scale CNNs and DNNs. The architecture contains
IV. ASIC BASED ACCELERATORS the Neural Functional Unit (NFU), buffers, and control
Application Specific Integrated Circuit (ASIC) is a powerful processor (CP), see Fig. 30. The NFU module is used to
platform to accelerate the DNNs. ASICs are customized perform the computations needed to determine the output
chips designed for a specific application. They are smaller of the neuron in the fashion, i. e., in the first stage, NFU
in size, consume less power, and provide higher speeds, performs the multiplication of input neuron values with
making them suitable solutions for DNN acceleration [76]. weight coefficients. In the second stage, NFU accumulates
ASIC based hardware accelerators have limited computing these products using the adder trees. In the third stage, NFU
resources, memory resources, and I/O bandwidths compared calculates the activation functions. Buffers are used to store
with GPU based accelerators, but they can achieve moderate the input/output neuron values and weights. The proposed
performance and consume less power [166]. Furthermore, architecture contains three buffers viz., an input buffer to
ASIC exhibits the best computation speed and energy effi- store the input neuron values (NBin), an output buffer to
VOLUME 4, 2016 xix

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

User Application
Model Training
DPU
Frameworks update weights
Instruction Unit Tensorflow Caffe PyTorch
Compute
ARM Processor Array loss, accuracy
Fetch
Decode Saturated?
PE Dataset
AXI Dispatch
Vitis AI Vitis AI Quantizer and Compiler
BUS
Development Kit
PE Xilinx Runtime Library
Overlay Vitis AI
Off Chip Memory Global Memory
Memory Controller Pool PE Quantisation
Inputs Python
DPU
Overlay Deep Learning Processing Unit APIs
Outputs Xmodel Compilation

Processing System Programmable Logic

a) DPU architecture top-level overview b) Vitis AI stack c) Development flow

Fig. 29. DPU architecture overview, adopted from [230], Vitis AI stack, and development flow
.
TABLE 4: Summary of FPGA-based accelerator frameworks
Framework Name Year DNN Type Interface
Xilinx Vitis AI [32] 2022 CNN, RNN Caffe, PyTorch, TensorFlow,
CNN2Gate [91] 2020 Alexnet, VGG-16 Caffe2, Keras, TensorFlow
Caffeine [226] 2019 Alexnet, VGG-16 Caffe
Angle-Eye [97] 2018 VGG-16 Caffe
FP-DNN [96] 2017 VGG-19, Res-152 TensorFlow
FINN [206] 2017 CNV Caffe
DNNWeaver [188] 2016 LeNet, Siamese Caffe
DeepBurning [213] 2016 Alexnet, NiN Caffe
fpgaConvNet [208] 2016 CNN Caffe, Torch

store the output neuron values (NBout), and a third buffer uses eDRAM to store all the data related to a CNN, i. e.,
to store the weights (SB). Different computational operators input feature maps, weight kernels, output kernels, etc.
are invoked in each stage depending on the type of the The DaDianNao accelerator gives better performance while
layer (convolution, activation function, pooling, etc.) For accelerating the CNNs, but provides moderate to low per-
architecture exploration, author developed a C++ simulator formance while accelerating large-scale CNNs.
that evaluates execution time and serves as a specification
for the Verilog implementation. The Verilog version of the
accelerator is synthesized using Synopsys’ design compiler, Control Processor (CP)

and the generated design is placed and routed by Synopsys’


Instructions
DMA

ICC compiler. The design is simulated using Synopsys’ Inst.

VCS, while PrimeTime PX is used to determine the power.


The proposed architecture was implemented using 65 nm NFU-1 NFU-2 NFU-3
Tn

CMOS technology. The experimental results show that Di- NBin


anNao achieves an average performance of 452 GOPS with
DMA

Inst.
485 mW of power consumption. The proposed accelerator
DMA

Inst.
Memory Interface

has scalability issues due to the bandwidth constraints of the


Tn

memory system. The DaDianNao accelerator [54] and [147] NBout


are extensions of the DianNao accelerator [53]. DaDianNao
Tnx Tn

has enough on-chip memory to hold all of CNN’s weights.


DaDianNao also uses 16-bit fixed-point representation in the
inference process like DianNao. However, it is implemented SB
using 28 nm CMOS technology. The design compiler syn-
thesizes the Verilog version of the DaDianNao accelerator,
and the ICC compiler is used to generate the layout. The Fig. 30. DianNao accelerator architecture, adopted from [53]
energy, area, and critical path are obtained after layout.
The design is simulated using VCS, while PrimeTime PX Liu et al. [142] proposed the machine learning acceler-
is used to determine the power. The proposed architecture ator referred to as PuDianNao, that supports multiple ma-
chine learning scenarios (e.g., regression, classification, and
xx VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

clustering) as well as many machine learning techniques, the acceleration of large-scale CNNs. The ShiDianNao
including k-means, k-nearest neighbors, linear regression, accelerator is implemented using 65 nm CMOS technology.
classification tree, naive bayes, support vector machine, and DianNao [53], DaDianNao [54], [147], PuDianNao [142],
DNNs. The PuDianNao mainly contains various Functional and ShiDianNao [78] are not built utilizing reconfigurable
Units (FUs) and three types of data buffers: ColdBuf, hardware, hence they cannot be adapted to changing appli-
HotBuf, and OutputBuf, an instruction buffer (InstBuf), cation demands such as NN sizes.
and a DMA, and a control module, see Fig. 31. The FU Lu et al. [146] proposed a flexible dataflow architecture
contains a Machine Learning Functional Unit (MLU) and called FlexFlow to accelerate the CNNs, exploiting all kinds
an Arithmetic Logic Unit (ALU). The MLU can be used of parallelisms viz., inter-kernel, intra-kernel, and inter-
to perform several computational primitives, including dot output on a two-dimensional array of PEs. FlexFlow has
product, counting, sorting, distance calculations, non-linear the additional interconnections between on-chip memories
functions, for instance, sigmoid and so on. The ALU has and PEs, which provides the flexibility to fetch any neuron
an adder, divider, and multiplier and converters for the 16- from any feature map. The proposed accelerator minimizes
bit float to 32-bit float and 32-bit float to 16-bit float. It the interconnections between the PEs at the cost of energy
may also be used to compute estimates using the Taylor because of data movement from on-chip memory to PEs.
expansion of log (1-x). HotBuf (8 KB) and ColdBuf (16 KB) In FlexFlow, all the PEs are operated in parallel, therefore,
store the input data with short and longer reuse distances, helping in improving the overall throughput. The proposed
respectively. OutputBuf (8 KB) is used to store the output architecture has high scalability and supports different sizes
data or intermediate results. The authors implemented an in- of CNNs with stable resource utilization. FlexFlow only
house C simulator of PuDianNao; it acts as a specification implements CNNs and is confined to within a layer rather
for the Verilog implementation and also measures the per- than across layers. The design is simulated, synthesized,
formance of PuDianNao on large-scale datasets. The design placed & routed using Synopsys’ tools. The FlexFlow
compiler synthesizes the design, and the ICC compiler is accelerator is implemented using TSMC 65 nm technology.
used to generate the layout. The energy, area, and critical Hardik et al. [189] developed a bit-level dynamically
path are obtained after layout. The design is simulated using composable architecture called Bit Fusion for accelerating
Synopsys VCS, and PrimeTime PX is used to determine DNNs. Bit fusion mainly consists of an array of bit-level
the power using the Value Change Dump (VCD) file. The computation elements, called BitBricks, that dynamically
proposed architecture has been implemented using TSMC fuse to match the bit width of individual DNN layers and
65 nm CMOS technology. execute DNN operations with the required bit width, without
any loss of accuracy. Furthermore, Bit Fusion supports
the multiplication of 2, 4, 8, and 16 bits spatially. Bit
Fusion decomposes a 16-bit multiplication into multiple 2-
InstBuf HotBuf ColdBuf bit multiplications to achieve the flexibility to efficiently
map various layers of CNN with different bit widths and
minimize the computation and the communication with no
MLU MLU FUs MLU MLU
loss of accuracy. Bit Fusion architecture comes with an
Control Module ALU ALU ALU ALU
Instruction Set Architecture (ISA) that minimizes the data
transfer and maximizes the parallelism in computations.
The proposed design is implemented in Verilog and is syn-
DMA OutputBuf thesized using Design Compiler, which estimates the area,
frequency, and power. The proposed accelerator architecture
is implemented on 45 nm CMOS technology. Bit Fusion
accelerator achieves 5.1× energy saving and 3.9× speedup
Fig. 31. PuDianNao accelerator architecture, adopted over Eyeriss accelerator.
from [142] Shin et al. [191] proposed Deep Neural Processing Unit
(DNPU) architecture to process CNNs and Recurrent Neural
Du et al. [78] proposed a CNN accelerator referred Networks (RNNs). DNPU is a SIMD MAC-based CN-
to as ShiDianNao to improve the energy efficiency and N/RNN accelerator that uses dynamic precision control to
scalability of DianNao [53] design discussed above. The minimize the kernel data size. DNPU consists of a convo-
ShiDianNao accelerator does not access the main memory lutional layer processor (CP), a fully connected and RNN-
while executing a CNN and achieves more energy efficiency LSTM layer processor (FRP), and a RISC controller. CP
compared to DianNao. The design is implemented in Verilog performs convolutional operations, and FRP performs ma-
and synthesized by the design compiler, and IC compiler trix multiplication operations. DNPU is the first CNN/RNN
is used to place and route the synthesized design. The accelerator with the highest energy efficiency of 8.1 TOP-
energy cost of DRAM accesses is calculated using CACTI S/W on 65 nm CMOS technology. DNPU has some limita-
6.0 [160]. The ShiDianNao accelerator will not support tions; for instance, its area limits the number of processing
VOLUME 4, 2016 xxi

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

elements (PEs) for convolutional layers (CL) and recurrent CNN data flows under the same hardware constraints. The
layers (CL). As a result, performance was suboptimal in proposed accelerator is implemented using 65 nm CMOS
cases that just required CLs or RLs. Furthermore, DNPU technology.
only supports a limited number of weight bit precisions,
such as 4 bits, 8 bits, or 16 bits. Lee et al. [134] proposed Link Clock Core Clock Configuration Bits
Top-Level Control Config Scan Chain 12 x 14 Accelerator
the Unified Neural Processing Unit (UNPU) architecture PE Array
Filter
to process CNNs and RNNs. UNPU contains a bit-serial Filter

MAC unit to perform the required computations. UNPU Ifmap


Ifmap RLC Global
supports CLs, RLs, and fully connected layers (FCLs) Off-Chip Decoder Buffer
DRAM 64
with fully-variable weight bit-precision from 1 to 16 bits. bits
Ofmap 108KB
Psum

UNPU achieves an energy efficiency of 3.08, 11.6, and RLC


Encoder ReLU Psum

50.6 TOPS/W for the case of 16-bit, 4-bit, and 1-bit weights,
respectively. UNPU achieves 1.43× higher energy efficiency
than the DNPU for convolutional layers with 4-bit weights. Processing
Spad
Element MAC
Control
B. DATAFLOW BASED ACCELERATOR
The accelerators based on dataflow put a special emphasis Fig. 32. Eyeriss DNN accelerator, adopted from [56]
on data management to minimize off-chip memory read-
s/writes. When it is feasible, reusing parameters between Chen et al. [57] proposed a DNN accelerator architecture
layers can enhance dataflow. For instance, in a convolutional referred to as Eyeriss v2 to accelerate compact and sparse
layer, both activations and weights can be reused. In a fully DNNs. Like Eyeriss [56], Eyeriss v2 is composed of an
connected layer, each neuron has a unique set of weights; array of PEs to perform MAC operations, global buffers,
as a result, weights cannot be reused, but input data may. and local scratchpad (SPad) memory to support data reuse.
In order to minimize data movement between a computing In the Eyeriss v2 accelerator, PEs and global buffers (GLB)
unit and higher-level memory, the reusable parameters are are grouped into clusters to support a flexible Network On
kept in local registers. Chip (NoC), as shown in Fig. 33. The main difference
Cavigelli et al. [50] proposed the Origami CNN acceler- between Eyeriss and Eyeriss v2 is that Eyeriss v2 uses a
ator, which is scalable to different network sizes. The pro- hierarchical mesh NoC (HM-NoC) to connect the global
posed architecture uses the Weight Stationary (WS) dataflow buffers to the PEs; in contrast, the Eyeriss uses multicast
to improve the energy efficiency during the acceleration NoC between the global buffer and PEs. Furthermore, the
process. WS dataflow minimizes the energy consumption by Eyeriss v2 accelerator uses separate NoCs to transfer the
maximizing the access of weight coefficients. WS dataflow input activations, weights, and partial sums between the
used in the Origami maximizes the convolution and filter global buffer and PEs. The hierarchical mesh NoC used
reuse of weights. The proposed accelerator was imple- in the Eyeriss v2 accelerator supports unicast, multicast,
mented using UMC 65 nm CMOS technology and having a and broadcast. The HM-NoC can be configured into various
core area of 3.09 mm2 . The proposed CNN accelerator can modes ranging from high data reuse to high bandwidth.
achieve the throughput of 274 GOPS and power efficiency The proposed architecture supports various CNN layer di-
of 369 GOPS/W with an external memory bandwidth of mensions and sizes because of flexible hierarchical mesh
525 MB/S full-duplex. The proposed architecture is only NoC. The authors proposed an analysis framework named
used to perform the convolution operation and is unsuitable EYEXAM for evaluating the performance of various CNN
for implementing the fully connected layer operations. dataflows. The Eyeriss v2 accelerator has higher hardware
Eyeriss [56] is an ASIC based CNN accelerator that uses a utilization than Eyeriss but has large area overhead. The
row-stationary (RS) dataflow that minimizes data movement experimental results show that Eyeriss v2 reaches 11.3× and
energy consumption on a spatial computing architecture. RS 42.5× improvement in energy efficiency and throughput,
dataflow is adaptable to various CNN shapes and minimizes respectively, with the sparse AlexNet, compared to Eyeriss
the energy consumption by reusing the filter coefficients running with the AlexNet. It also achieves 2.5× and 12.6×
and input feature maps. The proposed accelerator mainly improvement in energy efficiency and throughput, respec-
contains a 12 × 14 PE array, feature map compression units, tively, with sparse MobileNet compared to Eyeriss running
and a 108 KB global buffer; ReLU as shown in Fig. 32. with MobileNet.
The global buffer enables the reuse of loaded data from Multiply-Accumulate Engine with Reconfigurable Inter-
off-chip DRAM and the generated results by PEs and is connect (MAERI) is a DNN accelerator containing a set
also responsible for returning the final results to the off-chip of configurable building blocks to support various CNN
DRAM. In the Eyeriss accelerator, the PEs are connected partitions and mapping by configuring the tiny switches
via a Network on Chip (NoC). The NoC used in Eyeriss presented in [132]. MAERI contains a set of multiply adder
only supports multi-cast. The authors proposed an analysis computation units, each augmented with tiny configurable
framework for calculating the energy efficiency of various switches that can be configured to support various kinds of
xxii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

Top-Level Control & Configuration Psum SRAM Bank and the input activation function are stored in the unified
Iact SRAM Bank
GLB Cluster Router
Cluster
Router
Cluster
GLB Cluster
Iact SRAM Bank
Psum SRAM Bank local buffer. In order to perform convolution operation on
PE Cluster PE Cluster
Psum SRAM Bank
GLB Cluster Router Router GLB Cluster Iact SRAM Bank
Psum SRAM Bank
a matrix multiply unit, a systolic data setup block is used
Cluster Cluster
PE Cluster PE Cluster
iacts weights psums in order to rearrange the data. Efficient running of machine
GLB Cluster Router Router GLB Cluster
PE Cluster Cluster Cluster PE Cluster iacts Router
Psum Psum Router learning model tasks and inference tasks like search and
weights psums
External Memory

GLB Cluster Router Router GLB Cluster Psum Router Psum Router
image recognition, language translation have been the focus

External Memory
PE Cluster Cluster Cluster PE Cluster Weight Router Weight Router
GLB Cluster Router Router GLB Cluster Weight Router Weight Router
in the first version of TPU, called TPU1. Since 2015, TPU1
Cluster Cluster
PE Cluster PE Cluster
Weight Router Weight Router has been operational in Google’s data center. A second
GLB Cluster Router Router GLB Cluster
iacts weights psums
PE Cluster Cluster Cluster PE Cluster version TPU2, also called Cloud TPU is operational in data
PE PE PE PE
GLB Cluster Router
Cluster
Router
Cluster
GLB Cluster centers for the purpose of training and interference. Cloud
PE Cluster PE Cluster PE PE PE PE
GLB Cluster Router Router GLB Cluster
TPU supports several frameworks, including TensorFlow,
PE PE PE PE
Cluster Cluster
PE Cluster PE Cluster
PyTorch, and JAX/FLAX.
Fig. 33. Eyeriss v2 top-level architecture, adopted from [57]
DDR3 DRAM Chips

30 GiB/s
dataflows, see Fig. 34. The prefetch buffer stores the input 14 GiB/s 30 GiB/s
DDR3-2133 Weight FIFO
activations, intermediate partial sums, weights, and output Interfaces (Weight Fetcher)

30 GiB/s
activations. Acceleration units mainly contain look-up ta-
Control Control
bles (LUT) and perform activation functions. MAERI uses

PCIe Gen3 x16 Interface


two configurable interconnect networks, namely, distributed
10 GiB/s Unified Buffer
167 GiB/s
network and augmented reduction network. To assist the (Local
Systolic 
Matrix Multiply Unit

Host Interface
14 GiB/s 14 GiB/s
Activation
Data 
(64K per cycle)

effective mapping of the irregular dataflows and to provide Storage) Setup

high resource utilization, MAERI offers non-blocking com-


Control
munication via reconfigurable links with large bandwidth. Accumalators
The proposed accelerator can accelerate various operations

Instr
viz., convolution, pooling, fully connected layer, and LSTM. Off-Chip I/O
Activation

167 GiB/s
The proposed accelerator also supports sparsity and cross- Data Buffer
Normalize / Pool
layer mapping. MAERI is implemented in Bluespec System Computation

Control Control Control


Verilog (BSV) [164] and is synthesized with TSMC 28 nm
standard cell and SRAM library at 200 MHz.
Fig. 35. Block Diagram of TPU, adopted from [121]
Activation Control Layer Topology

Activation Units

Accelerator Controller C. SPARSITY BASED ACCELERATORS


The fraction of zeros in a CNN layer’s weights and input
+ activation matrices is called sparsity. Since multiplying by
+ + zero should produce a zero, there should be no effort
Prefetch Buffer
Distribution / Reduction Network

+ + + + required. As a result, typical layers can cut work by a


+ + + + + + + +
control

From / To

× × × × × × × × × × × × × × × ×
DRAM
factor of four, and in some instances, by a factor of ten.
Also, the addition is not needed because the zero products
won’t add anything to the total of which they are a part.
Moreover, data with many zeros can be compressed—
Distribution Tree
Weights / Input Activations
these traits, when combined, open up a lot of possibilities
MAERI
for improvement. This section provides a comprehensive
+ Adder Switch
× Multiplier Switch 1:2 Switch Local Buffer Local Forwarding Data Link
overview of accelerators that explore sparsity.
Fig. 34. MAERI architecture, adopted from [57] A CNN accelerator referred to as Sparse CNN (SCNN) is
presented in [168] for inference of CNNs. SCNN employs
Tensor Processing Unit (TPU) is developed by Google in a novel dataflow referred to as sparse Planar-Tiled Input-
order to implement machine learning algorithms. A matrix Stationary Cartesian Product (PT-IS-CP-sparse) dataflow
multiplication unit as a systolic array of 256x256 units is that maximizes the reuse of activations and weights and
used in the TPU architecture [121]. Fig. 35 shows the block removes needless data transfers and reduces storage and
diagram of the TPU. The mentioned systolic array structure power requirements. The dataflow used in SCNN eliminates
is basically built with weight-stationary dataflow and as a all multiplications with a zero and keeps both activations
2-D SIMD architecture. Extracting from the DRAM, the and weights in compressed form. SCNN mainly contains an
weights can then be stored in the weight FIFO (First-In, array of processing elements arranged in a 2-D fashion with
First-Out) register. The results from the previous layers systolic connections to transfer partial sums. The proposed
VOLUME 4, 2016 xxiii

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

dataflow efficiently delivers activations and weights to the uses Gustavson’s algorithm [100] to compute the spMspM
multiplier array to perform the required MAC operations. operations. GAMMA accelerator mainly consists of an array
SCNN exploits all the three kinds of parallelisms viz., of processing elements(PEs), on-chip storage referred to
inter-kernel, intra-kernel, and inter-output. SCNN requires as FiberCache, and a scheduler, as shown in Fig. 36. The
additional optimization circuitry to implement the fully con- PEs are used to perform the required spMspM operations
nected layer operations. SCNN improves the performance that combine sparse input rows to produce each output
by skipping the zeros in the input feature maps and weights. row. FiberCache is a specialized memory structure that
SCNN is implemented in system C and Catapult High-Level stores the non-zero elements and their coordinates. The
Synthesis (HLS) [30] tool is used to generate the Verilog scheduler distributes computational workloads among PEs
RTL. Synopsys Design Compiler synthesizes the Verilog to maximize resource efficiency while reducing unnecessary
version of the design. SCNN is implemented using TSMC access to shared memory. GAMMA is implemented using
16 nm FinFET technology. 45nm CMOS technology.
Eyeriss [56] also looked into input sparsity as a way
to save energy. The gating mechanism deactivates MAC
Memory
units that correspond to zero inputs. Gating saves energy
while not increasing throughput. With sparse models, The
processing speed and energy efficiency of Eyeriss V2 [57]
FiberCache
have improved due to its ability to process sparse data
directly in compressed format for both the weights and
activations.
PE PE PE PE
Zhang et al. [228] developed Sparse Neural Acceleration
Processor (SNAP) to exploit unstructured sparsity in DNNs.
To ensure that data is distributed evenly throughout the
Scheduler
MAC units, SNAP employs parallel associative search.
SNAP is fabricated using 16 nm CMOS technology and
achieves a peak energy efficiency of 21.55 TOPS/W (FP16) Fig. 36. Block Diagram of GAMMA, adopted from [227]
for CONV layers with 10% weight and activation density.
Lee et al. [136] proposed an energy-efficient on-chip We summarized the reviewed ASIC-based accelerators
accelerator called LNPU for sparse DNN model learning. for DNN in Table 5. For each accelerator, we list the year
In the LNPU accelerator, Sparsity is exploited with intra- the accelerator was introduced, the process technology, the
channel as well as inter-channel accumulation. The input clock frequency, the dataflow, the architecture type, the
load buffer module of the LNPU evenly distributes workload power dissipation, the area, the performance in GOPS, and
among the PEs while considering irregular sparsity. LNPU finally, the power efficiency. Fig. 37 shows the plots of
uses the fine-grained mixed precision (FGMP) of FP8-FP16 various metrics, such as power, throughput, area, and power
that optimizes data precision while maintaining training ac- efficiency of ASIC-based accelerators.
curacy. LNPU maintains an average hardware utilization of
100%. LNPU is fabricated using 65 nm CMOS technology V. GPU BASED ACCELERATORS
and has an energy efficiency of 3.48 TFLOPS/W (FP8) at Over the last few decades, Graphics Processing Units
0% sparsity and 25.3 TFLOPS/W (FP8) at 90% sparsity. (GPUs) are widely used in training DL algorithms or CNNs
SIGMA is a scalable and flexible accelerator proposed for face recognition [110], object detection [222], [229],
in [177] to implement the large, irregular, and sparse general data mining [89], and other AI applications. GPU supports
matrix-matrix multiplications (GEMMs). The basic building parallelism due to lots of parallel cores in the architecture
block in SIGMA is Flexible Dot Product Engine (Flex- and offers significant computation speed. GPU exploits large
DPE). All the Flex-DPE modules can be interconnected degrees of data-level parallelism in the applications through
via simple NoC. In SIGMA, all the Flex-DPE multipli- the Single Instruction Multiple Thread (SIMT) execution
ers are arranged in a 1-D fashion, and it performs the models. The high computational capacity of the GPUs
multiple variable-sized dot-products in parallel. SIGMA makes them the primary choice for DNN acceleration. In
uses scalable inter-connects to efficiently map the GEMMs this section, we would like to review some of the recent
of different dimensions and sparsity levels to the PEs. GPU-based DNN accelerators.
SIGMA outperforms systolic array architectures by 5.7× The study of implementing a standard backpropagation
for irregular sparse matrices. SIGMA is implemented using algorithm for training multiple perceptrons simultaneously
the 28 nm CMOS technology and achieves a throughput of on GPU using NVIDIA CUDA technology is presented
10.8 TFLOPS with a power dissipation of 22.33 W. in [101]. For a given program, GPU-based implementation
Zhang et al. [227] proposed an accelerator called on NVIDIA GTX 260 GPU achieves 50× to 150× speedup
GAMMA to perform the Sparse matrix-sparse matrix mul- compared to the CPU-based implementation. A neurally
tiplication (spMspM) operations. The proposed accelerator accelerated architecture for GPU, called NGPU (neurally
xxiv VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

TABLE 5: Summary of ASIC-based accelerators


Frequency Power Dissipation Area Performance Power Efficiency
Accelerator Name Year Process Technology Dataflow Architecture
(MHz) (W) (mm2 ) (GOPS) (GOPS/W)
NeuFlow [171] 2012 IBM 45 nm SOI 400 Flexible 2-D systolic 0.6 12.5 320 490
DianNao [53] 2014 TSMC 65 nm CMOS 980 NLR 1-D array 0.485 3.02 452 932
DaDianNao [54] 2014 TSMC 28 nm CMOS 606 NLR 1-D array 15.97 0.78 5580 350
PuDianNao [142] 2015 TSMC 65 nm CMOS 1000 NLR 1-D array 0.596 3.51 1056 1752
ShiDianNao [79] 2015 TSMC 65 nm CMOS 1000 OS 2-D matrix 0.32 4.86 194 606
Origami [50] 2015 UMC 65 nm CMOS 700 WS 2-D array 0.744 3.09 274 369
FlexFlow [146] 2017 TSMC 65 nm CMOS 1000 flexible 2-D matrix 6.8 3.89 420 500
SCNN [168] 2017 TSMC 16 nm FinFET 1000 N/A 2-D systolic N/A 7.9 2000 N/A
0.278
Eyeriss [56] 2017 TSMC 65 nm CMOS 200 RS 2-D array 12.25 N/A N/A
(for AlexNet)
TPU [121] 2017 28 nm CMOS 700 WS 2-D systolic N/A < 331 N/A N/A
Bit Fusion [189] 2018 TSMC 45 nm CMOS 500 N/A 2-D array 0.895 5.87 N/A N/A
MAERI [132] 2018 TSMC 28 nm CMOS 200 WS, RS, OS 2-D array N/A 6 N/A N/A
DNPU [191] 2018 65 nm IP8M CMOS 200 N/A 2-D array 0.279 16 300 3900
UNPU [134] 2018 65 nm IP8M logic CMOS 200 N/A 2-D array 0.297 16 345.6 3080
LNPU [136] 2019 65 nm IP8M CMOS 200 N/A 2-D array 0.367 16 > 300 3480
SNAP [228] 2019 16 nm CMOS 33-480 N/A 2-D array 0.0163-0.364 2.4 N/A 3860
253.2
Eyeriss2 [57] 2019 TSMC 65 nm CMOS 200 RS 2-D array N/A N/A 153.6
(for AlexNet)
SIGMA [177] 2020 28 nm CMOS 500 WS 2-D array 22.33 65.1 10800 480
GAMMA [227] 2021 28 nm CMOS 1000 WS, RS, OS 1-D array N/A 30.6 N/A N/A

accelerated GPU) is presented in [220] to enable scalable High-performance GPU dedicated architecture referred as
integration of neural acceleration with a large number of TResNet is presented in [181] to accelerate CNNs. The
GPU cores. The proposed architecture brings the neural proposed architecture effectively utilizes the GPU resources
and GPU accelerators together without hampering the SIMT and achieves better accuracy and efficiency.
execution model. NGPU provides significant energy and Nvidia GPUs are the most popular for Deep Learning
performance benefits at the cost of reasonably low hardware (DL) implementations. Table 6 lists the accelerators that
overhead. NGPU achieves 2.44× average speedup and 2.8× Nvidia has released, which are used for the inference and
average energy reduction compared to the baseline GPU training of deep learning (DL) algorithms and have both a
architecture across different sets of benchmarks. Central Processing Unit (CPU) and a GPU integrated on a
Danial et al. [197] presented a framework for accelerating single chip.
the training and classification of arbitrary CNNs on the
GPU. The proposed method improves the performance VI. CGRA-BASED ACCELERATORS
by moving the computationally intensive tasks of a CNN Coarse Grain Reconfigurable Architectures (CGRAs) pri-
to the GPU. Training and classification of CNN on the marily consist of an array of Processing Elements (PEs) con-
GPU performs 2 to 24 times faster than on the CPU nected using reconfigurable interconnects. When compared
based on the network topology. Li et al. [140] proposed to FPGAs, CGRAs often have a shorter reconfiguration
an efficient GPU implementation to accelerate the training time. CGRAs have emerged as a popular option for real-
process of large-scale Recurrent Neural Networks (RNN). time computing due to their low power consumption, high
When compared to the CPU-based solution with the Intel’s efficiency, fast reconfiguration time, and ability to perform
Math Kernel Library (MKL), the proposed method yields both spatial and temporal calculations. In recent years,
a speedup of 2 to 11 times. Kim et al. [128] proposed a CGRAs have become increasingly significant in accelerating
new memory management scheme to enhance the overall DNNs, particularly CNNs, thanks to their ability to combine
GPU memory utilization in multi-GPU systems for deep FPGAs’ flexibility with ASICs’ efficiency. In this section,
learning algorithms acceleration. The authors extended the we would like to review some of the recent CGRA-based
concept of vDNN to a multi-GPU environment employing DNN accelerators.
PCIe-bus, where vDNN [180] virtualizes the GPU and Jafri et al. [119] proposed a CGRA-based accelerator
memory of the CPU so that it can be used simultaneously named NeuroCGRA to realize both neural networks and
to train DL algorithms in a hybrid fashion. The suggested digital signal processing applications. The authors have
memory scheme increases batch size by 60% in multi- opted to investigate the viability of deploying neural net-
GPU systems and enhances training throughput by 46.6%. works on an actual CGRA by means of a Dynamically
VOLUME 4, 2016 xxv

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

4000
Performance per watt (GOPS/W)

3500
10000
3000
8000

Throughput (GOPS)
2500
2000 6000
1500
4000
1000
500 2000
0
0
Sh nNa [54]
Da nNa 71]
Pu anN [53]

N 2]

PU 6]
PU 1]
PU 4]

Ey P [2 ]

A[ ]
iga 9]
low 0]

7]
A 6

GM 57
eri 28
ian 4

DN [14
UN [19
LN [13
SN [13
Or ao [7
xF i [5

17
Di w [1

iD o [1

SI s 2 [

Sh nNa [54]
Da nNa 71]
Pu anN [53]

N 2]

PU 6]
PU 1]

Ey U [1 ]
]
A[ ]
iga 9]

Fle NN 0]
low 8]

7]
Di ao
Di o

P 4
eri 36
GM 57
Fle m

ian 4

DN [14
UN [19
LN [13
Or ao [7
SC mi [5
xF [16

17
Di w [1

iD o [1

SI s 2 [
lo

Di ao
Di o
uF

lo
a

s
Ne

uF

a
Ne
Accelerator
, Accelerator
(a) Power efficiency
,
(b) Throughput

20 60

50
15
Area (mm 2)

40
Power (W)

10 30

20
5
10

0 0
iD ao 54]
Pu Dian ao 1]
Sh ianN ao [ 3]

Or Nao 42]

Fu riss 6]
MA ion [56]
DN RI [ 89]
UN U [ 32]
LN U [1 1]
SN U [1 4]
SI AP 36]
S am 79]
xF [1 ]
Bit Eye [14 ]

MM [1 ]
A [ 77]
7]
iD o ]
D a ]
Di Na 3]

Or ao 2]

Fu iss ]
DN n [1 ]
UN U [1 ]
LN U [1 ]
SN U [1 ]
SI AP [ 6]
S mi ]
Fle CNN [50]

Ey w [1 ]

A [ 8]
7]

Fle CNN i [50


low 68

GAGMA [228
Sh anNa o [54
Da anN 171

Bit er 46
sio [56
P 89
P 91
P 34
iga [79

lo 68

Da anN [17
D N [5

P 19
P 3

22
Pu ian o [5

N 4

3
GM 22
17

ian [1

E [1
P 1
ig [
ian [1

xF [1
Di low [

Di low

s
uF
uF

Ne
Ne

Accelerator Accelerator
, ,
(c) Power dissipation (d) Area

2
Fig. 37. performance metrics of ASIC-based accelerators

Reconfigurable Resource Array (DRRA). DRRA mainly provides a framework for mapping neural networks onto
consists of four elements, viz., Data Path Units (DPUs), CGRAs. The translater takes three inputs, viz., network
register files (Reg-files), Switch Boxes (SB), and sequencers, model, weights, and network specifications, and generates
as shown in Fig. 38. The DPUs are the functional units that three outputs: DPU, Reg-file, and SB instructions. NeuroC-
perform the required computations. The Reg-files store the GRA is synthesized using 65nm technology running at a
data for the DPUs. Interconnectivity between various DRRA frequency of 500 MHz. A framework called FIST is pre-
components is provided through SBs. The sequencers con- sented in [163] that allows the NeuroCGRA [119] to realize
figure the DPU, switch boxes, and register files. Distributed both DSP applications and neural networks, depending on
Memory Architecture (DiMArch) is essentially a scratch the target applications. The authors have implemented edge
pad providing enough data to the DRRA. The authors have detection on DRRA using the proposed framework.
embedded dedicated hardware, known as neuroDPU, with EMAX is an energy-efficient, low-power CGRA architec-
each DPU of DRRA to implement neural networks on ture with on-chip distributed memory proposed in [204] to
it. The authors proposed a neural network translator that implement CNNs. EMAX supports both CNN training and
xxvi VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

TABLE 6: GPU-based accelerators developed by Nvidia


Memory Bandwidth Thermal Design Performance
Accelerator DNN Model Precision Memory Applications
(GB/s) Power (W) (GOPS)
Image Classification, object detection,
Jetson Xavier NX [26] ResNext-50 int8 16 GB 59.7 10 21
natural language processing
Image Classification, object detection,
Jetson AGX Xavier [28] ResNext-50 int8 32 GB 136.5 10 32
natural language processing
T4 [31] ResNet-50 int8 16 GB 320 70 130 Natural language interpretation
V100 [74] [122] ResNet-50 fp32 16 GB 900 300 15700 Natural language interpretation
A100 [200] [63] ResNext-50 fp32 40 GB 1555 250 19500 High performance computing

EMAX
Reg-file Reg-file
PE PE PE PE
SB SB CPU
Sequencer Sequencer Core
Row 0
DPU DPU

Memory Interface

Interconnection
SB SB PE PE PE PE

Cell 0 Cell 1

Reg-file Reg-file
PE PE PE PE
SB SB DRAM
Sequencer Sequencer
Row 1
DPU DPU
SB SB
Fig. 39. EMAX architecture, adopted from [204]
Cell 2 Cell 3

Column 0 Column 1
Special RC is used for operations like power (represented
Fig. 38. DRRA computation layer [119] as PRC in Fig. 40) and piecewise functions (represented as
IRC in Fig. 40). The crossbar switch serves as a bridge to
connect the RC array and SBUs. Data can be transferred
inference. EMAX is composed primarily of an array of PEs from off-chip memory to SBUs using the external direct
and an interconnection network, as shown in Fig. 39. Each memory access interface. Static and dynamic interfaces are
PE is connected to its neighbors by local interconnections, used for static and dynamic configurations, respectively.
and each row of the PE array has a shared bus. The results The proposed SDT-CGRA is realized in Verilog HDL, and
of calculations performed on the PEs are passed on to the Synopsys design compiler is used to synthesize the design.
PEs exist in the next row. The PEs can access external The proposed SDT-CGRA is implemented using SMIC
memory (DRAM) via the memory interface. Each PE has 55nm CMOS technology. Experimental results show that
two execution units that perform the arithmetic and logical SDT-CGRA outperforms EMAX by three times in terms of
operations. Each PE also has a local memory to store the operations per memory bandwidth.
required data, reducing the memory bandwidth pressure. In [111], the authors proposed mapping of CNNs onto
Experimental results show that EMAX performs better than Tightly Coupled Processor Array (TCPA) efficiently. TCPA
GPUs in terms of per memory bandwidth and per area. belongs to the class of CGRA, containing an array of tightly
A CGRA-based accelerator referred to as stream dual- coupled VLIW Processing Elements (PEs) [105]. TCPA
track CGRA (SDT-CGRA), which targets the implemen- offers multiple levels of parallelism, for instance, task-level,
tation of object inference algorithms, is presented in [83]. loop-level, iteration-level, instruction-level parallelism, etc.
SDT-CGRA employs stream processing and uses both static TCPAs are suited for accelerating computationally expen-
and dynamic configurations for stream processing. The SDT- sive nested loop programs exhibiting a high degree of
CGRA accelerator mainly contains an array of PEs known parallelism, such as CNNs. CNN layers are based on ma-
as reconfigurable cells (RS) and stream buffer units (SBUs), trix multiplications which can be written as 6-dimensional
as shown in Fig. 40. The SDT-CGRA architecture is divided nested loops, making them suitable for acceleration. It
into two sections: global memory and computing array. was demonstrated that TCPAs use techniques such as loop
The global memory section is dynamically configured and permutation, loop unrolling, and layer-parallel processing
stores data streams. On the other hand, the computing array to exploit the parallelism offered by the TCPA architecture.
section operates in a static configuration mode. It comprises Layer fusion allows the processing of multiple layers of
several RC columns and one special RC column. The CNN in the overlapped fashion [33], which was exploited
VOLUME 4, 2016 xxvii

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

Local data bus There is a lot of room for CGRA research to develop and
Interconnection between local data bus and RC expand as a topic of study for future architecture; this is
Interconnection between RCs in horizontal direction especially true when developing high-performance CGRAs
Interconnection between RCs in vertical direction tailored to specialized or general-purpose computing. Some
Global Memory Crossbar Computing array key issues that require further research in this area include
developing tools to program the architecture efficiently,
SBU
RC RC RC PRC
memory management, scalability, adaptability, productivity,
External virtualization, etc.
memory SBU
DMA
interface RC RC RC IRC
VII. EMBEDDED AI ACCELERATORS
The AI hardware requirements are more critical in the edge
SBU
environment, which is typically represented by Internet of
Things (IoT) devices (e.g., smart speaker, mobile, sensors
and actuators) with limited computing resources, as opposed
Dynamic
Config.
to cloud infrastructure with relatively sufficient computing
Ctr. Unit RC RC RC IRC capability. For the sake of real-time immediacy, latency,
Static Config.
offline capabilities, security, and privacy, AI models are
SDT-CGRA
Ctr. Unit increasingly required to be implemented on the edge. In
Dynamic Static this context, Small Form Factor (SFF) devices such as mi-
configuration interface configuration interface
crocontrollers, which dominate the market, are of particular
interest and having AI capabilities on theses devices can
Off-chip Memory Host Processor
help many applications. Many industrial solutions require
Fig. 40. SDT-CGRA architecture, adopted from [83] products with SFF and Size, Weight, and Power (SWaP)
enhanced embedded systems. In this section, we review
some of the latest embedded AI accelerators.
by TCPA to save intermediate memory needed between the Fig. 42 shows the architecture of Edge TPU from
layers. Loop permutation allows the computation of multiple Google which is used in products such as Coral and Pixel
convolution filters in an interspersed way. TCPA allows the Phones [112]. Edge TPUs are designed to give high perfor-
parallel execution of multiple layers by different PEs. A mance acceleration while staying within strict physical and
CNN model for the MNIST benchmark on an array of size power constraints [221]. Edge TPU is organized in a 2-D
4×4 was evaluated and the performance of the layer-parallel array of Processing Elements (PEs) where each PE performs
approach over layer-by-layer processing was compared. computations in a SIMD fashion. Data is transferred from
A CGRA-based accelerator called Neural Processing off-chip memory and PEs via an on-chip controller. Acti-
CGRA (NP-CGRA) is presented in [135] to accelerate vation and parameters are loaded into the on-chip staging
lightweight CNNs. The authors have proposed a set of buffers by the controller. In addition, the controller reads
extensions to the baseline CGRA [153] to improve the in the low-level instructions that will be executed on the
performance of CGRAs and to efficiently implement depth- PEs (e. g., convolution, pooling, etc.). Each PE may contain
wise convolution (DWC) and pointwise convolution (PWC). single or multiple cores, each having multiple compute
The authors have presented three architectural extensions: lanes to support operation in SIMD fashion. A memory
a crossbar-style memory bus, dual-mode MAC unit, and is shared across all cores, PE Memory is used to model
operand reuse network. The crossbar-style memory bus activations, partial results, and outputs are all stored in a
contains horizontal and vertical buses, and each bus is shared memory, which is labelled PE Memory, see Fig. 42.
accessible to all the PEs connected to it. Dual-mode MAC Each PE’s cores have a core memory that is mostly used to
unit works in two modes: MAC mode and MUL/ALU store model parameters. Each compute lane has multi-way
mode. The multiplication and accumulation operations are MAC units to perform computations between activations and
chained together in the MAC mode to realize the function. model parameters. A few prototyping boards, see Fig. 43
On the other hand, in the MUL/ALU mode, a PE can from Coral are available for the community to try and
choose either multiplication or an addition operation for deploy ML apps at the edge, including the Dev Board, USB
each cycle. Operand reuse network offers input-to-input accelerator, Dev Board Mini, and Dev Board Macro [27].
routing instead of output-to-input routing. The proposed NP- TensorFlow Lite framework is particularly developed for
CGRA is realized in Verilog HDL, and Synopsys design mapping various neural network operations onto the Edge
compiler is used to synthesize the design. The proposed TPU [29]. The Edge TPU coprocessor can compute 4 trillion
NP-CGRA is implemented using Samsung 65nm CMOS operations per second (TOPS) while consuming just 0.5
technology. Experimental results show that the area-delay watts for each TOPS (2 TOPS per watt) [27].
product of NP-CGRA is 8-18 times better than that of NVIDIA’s Jetson Nano [3], [184] is an embedded board
baseline CGRA. suitable for edge AI applications. It contains a 64-bit
xxviii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

AHB Bus
LEON3
Address
( Configuration Interrupt Global Generators
& Communication Controller Controller
Processor )

Reconfigurable Buffers

PE PE PE PE
Interconnect
Wrapper Module
Reconfigurable Buffers

Reconfigurable Buffers
PE
PE PE PE PE

PE PE PE PE Layer Conv0(+Pool1) Conv2(+Pool2) Conv4 FC


Size (RxCxN) 28x28x4 14x14x24 7x7x16 1x1x10
Par. (M,N,K) 24,1,3 (1,24,2) 24,24,3 (1,24,2) 16,24,3 10,784,1
Stoarage (KB) 19.8 14.6 5.4 8.6
MACs 169,368 1,016,088 169,360 7850
Configuration
(+18,816) (+4,704)
Loader
FU Branch Unit PE PE PE PE
VLIW Instruction Memory

Data Register Control Tightly Coupled Processor Array PE PE PE


File Register File (0, 0) (0, 1) (0, nx)

Data Control
TCPA tile Reconfigurable Buffers Activation
Memory
PE PE PE
(1, 0) (1, 1) (1, nx)
Instruction
DRAM
Memory
Parameter
Memory
a) TCPA accelerator b) CNN to recognize digits in 28x28 pixel images
Controller
including net parameters PE PE PE
(nx, 0) (nx, 1) (nx, nx)

Fig. 41. TCPA accelerator showing PE array of size 4 × 4 and a CNN that is mapped onto it for recognizing digits from
MNIST database

PE PE PE
(0, 0) (0, 1) (0, nx)

Activation PE PE PE
Memory
(1, 0) (1, 1) (1, nx)
Instruction
DRAM
Memory
Parameter
Memory

Controller
PE PE PE
(nx, 0) (nx, 1) (nx, nx)

d) Dev board
a) Dev board b) USB Accelerator c) Dev board Mini Macro

Fig. 43. Prototyping boards from Coral [27] having edge


Core Core
Memory Memory TPU.
Memory
PE

472 GFLOPS of FP16 computation performance while


Compute

Compute

Compute

Compute

Compute

Compute
Lanes

Lanes

Lanes

Lanes

Lanes

Lanes

consuming only 5-10 W of power. NVIDIA also provides


the developer kit with examples to map the multiple neural
Core0 Corek networks applications such as object detection, segmen-
tation, image classification, and speech processing [184].
Fig. 42. Overall architecture of Edge TPU [221]
NVIDIA also provide many such embedded boards such
as Jetson AGX Orin, Jetson Orin NX, Jetson Xavier NX
Series, Jetson TX2 Series in various combinations of form-
quad-core Arm Cortex-A57 CPU running at 1.43 GHz, factor, power-efficiency, and performance to address various
NVIDIA Maxwell GPU with 128 CUDA cores, and has industry segments [3], [11].
4GB LPDDR4 memory. Jetson Nano runs Linux and offers BeagleBone AI [4], built around Texas Instruments’ (TI)
VOLUME 4, 2016 xxix

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

Myriad X is a popular choice for on-device DNNs and


computer vision applications thanks to the Neural Compute
Engine, 16 SHAVE cores, and ultra-high throughput. The
Myriad X VPU includes a native 4K image processor
pipeline and can directly link up to eight HD sensors. The
Myriad Development Kit (MDK), which offers development
tools, frameworks, and APIs to implement computer vision,
imaging, and DNN workloads on the chip, can be used to
program both the Myriad 2 and Myriad X VPUs.

Fig. 44. NVIDIA’s Jetson Nano. Software Controlled I/O Multiplexing

INTERFACES
MIPI
(SPI, USB3, I2C, I2S, LCD, CIF, UART, ETHERNET etc.)
AM 5729 Sitara SoC [116], is yet another board for AI at the X12 lanes
edge. This SoC has two 32-bit Arm Cortex-A15 cores, two
Image Processing Units (IPUs) that each having two Cortex-
M4 cores, two C66x DSP cores, two PowerVER SGX5443D Imaging/Vision Hardware Accelerators RISC-RT
GPUs, and four Embedded Vision Engines (EVEs). It also
has 15 GB of eMMC flash, 1 GB of RAM, Wi-Fi as well Intelligent Memory Fabric
RISC-RTOS
as Bluetooth support, and USB connectors for power and
data transfer. The BeagleBone AI runs Linux and TI Deep
SHAVE DSP

SHAVE DSP

SHAVE DSP
SHAVE DSP

SHAVE DSP

SHAVE DSP

SHAVE DSP

SHAVE DSP

SHAVE DSP

SHAVE DSP

SHAVE DSP

SHAVE DSP
Learning (TIDL) framework can be used develop real-time 12 vector VLIW SHAVE processors

ML applications.
L2 Cache

Myriad 2 MA2x5x Block Diagram DDR

Fig. 46. Myriad 2 VPU architectural block diagram, adopted


from [15]

Sipeed Maixduino is like an Arduino for machine learning


projects. It has MAIX SoC [13] which includes Kendryte
K210 KPU (Knowledge Processing Unit, also called Net-
work Processing Unit), powerful chip suited for visual and
semantic recognition [12]. MAIX SoC block diagram is
shown in Fig. 47, which includes K210 featuring two RISC-
V 64-bit CPU cores, an APU (Audio Processing Unit, also
called Audio Accelerator), and KPU optimized for running
Fig. 45. BeagleBone AI board [4] CNNs. KPU offers 0.25 [email protected] W,400 MHz, when
overclock to 800 MHz, it offers 0.5 TOPS. It means, we can
A Vision Processing Unit (VPU) [10] is a processor do object recognition 60 fps@VGA. MAIX also includes
optimized to perform inference tasks at the edge with ultra- a Fast Fourier Transform (FFT) unit, making it useful for
low power without compromising performance. Movidius signal processing. In addition to these, it supports a wide
Myriad 2 VPU is based on the Intel Neural Compute Stick range of other peripherals, see Fig. 47. TensorFlow Lite
(NCS) platform, designed as a 28 nm co-processor that framework is supported by this board and platforms such as
provides high performance tensor acceleration, see Fig. 46. Arduino IDE and PlatformIO can be used for development.
The Streaming Hybrid Architecture Vector Engines (SHAVE)
are 12 highly parallelizable vector processors in the Myriad Sophon’s edge developer board, see Fig. 48, is envisioned
2 VPU, whose parallelism and ISA allow good performance as rapid prototype development board for ML applications.
efficiency across a range of computer vision applications, It contains a powerful BM1880 which is capable of imple-
even those with low latency requirements. The Neural menting DNN/CNN/RNN/LSTM models efficiently using a
Compute Engine, a dedicated hardware AI accelerator for tailored tensor processing unit. It also features two Arm
deep neural network deep-learning inferences, is included Cortex-A53 CPUs and a RISC-V CPU. TPU can perform
in the Myriad X VPU, Movidius’ third generation of VPUs. 1 TOPS for 8-bit integer data. This board is mainly used
xxx VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

version of Xilinx’s all-programmable System-on-Chip (SoC)


DMA GPIO
CPU families, the Zynq architecture combines a dual-core ARM
UART
DVP Cortex-A9 processor with a conventional processor (FPGA).
RISC-V RISC-V
JTAG
64-bit 64-bit
SPI The Advanced eXtensible Interface (AXI) standard is used
AES FPU FPU
RTC to connect the various pieces of the Zynq architecture,
I2S allowing for high bandwidth and low latency connections.
OTP
I2C Vivado Design Suite [22] is used to map programs on
FPIOA KPU to Ultra96-V2 board and is widely used in AI and ML
Timer
CNN Accelerator
PWM projects. For instance, authors in [39] implemented a real-
SRAM
time face recognition on Ultra96-V2. Designers may use
APU WDT
FFT the Python language and libraries to make use of Zynq’s
K210 Audio Accelerator SHA256
programmable logic and microprocessors to create more
capable and intriguing embedded systems. PYNQ (Python
DC-DC ESP8285
FLASH IPEX
3-channel MCU WIFI

Fig. 47. Block diagram of KPU and MAIX SoC, adopted


from [13]

for surveillance cameras, BM1880 [5] is designed using Ultra96-V2 PYNQ-Z2

28 nm process and dissipated 2.5 W. Frameworks such as


Fig. 49. Ultra96-V2 [1] and PYNQ-Z2 [18] development
TensorFlow, Pytorch, ONNX, Caffe, etc. are supported by
boards from Xilinx
this board. However, BITMAIN has its own framework
called BITMAIN Neural Network Software Development On Zynq) is a Xilinx® open-source project that makes
Kit (BMNNSDK) [5] and recommends it to achieve high designing embedded systems with Zynq® Systems on Chips
inference throughput and efficiency. BMNET and BMRun- simple. PYNQ-Z2 is an FPGA development board based on
Time are included in the BMNNSDK. BMNET is a DNN the ZYNQ XC7Z020 FPGA, which has been meticulously
compiler for TPU processors on the edge. It translates CNN- developed to support PYNQ. Designers can create more
like algorithms into TPU instructions. powerful embedded systems using ZYNQ by combining
PL and PS. Furthermore, the SoCs may be programmed
in Python, and the code can be developed and tested on
the PYNQ-Z2 directly. In the same manner that software
libraries are imported and programmed, programmable logic
circuits are imported as hardware libraries and programmed
through APIs. PYNQ-Z2 board has many interfaces such
as user LEDs, push-buttons, switches, MIC input, Ethernet,
HDMI Input/Output, MIC Input, Audio Output, Arduino as
well as Rasberry Pi interfaces etc. PYNQ takes advantage
of the greatest features of both ZYNQ and Python. Machine
learning research and prototyping have made extensive use
of it. For instance, authors in [209] used this board for
implementing CNNs. Xilinix’s configurable DPU IP [7] can
also be used together with PYNQ board for creating a
network with desired number of layers, activation functions
Fig. 48. Bitmain Sophon(TM) Edge Developer Board, etc.. Vivado [22], Vitis [21] and Python can be used to work
adopted from [8] with PYNQ board.
Xilinx’s Kria KV260 [23], [123] is an AI starter kit
Ultra96-V2 [1] and PYNQ-Z2 [18] are embedded AI targeted for vision AI applications in smart cities, smart fac-
boards using FPGAs, see Fig. 49. Ultra96-V2 features a tories, robotics, home automation, etc., see Fig. 50. KV260
Zynq UltraScale+ MPSoC ZU3EG device. Xilinx’s Zynq includes a Zynq MPSoC, and it supports the Python-based
devices contain both Processor System (PS) and Pro- PYNQ framework. The trained models can be implemented
grammable Logic (PL) where PS consists of hardcore pro- in the DPU [7] and are loaded with PYNQ using hardware
cessors while PL contains the FPGA. Prior to the Zynq, overlays. In [123], authors have demonstrated pre-trained
processors were connected to a FPGA, which complicated models based on the MNIST dataset, RESNET based on
communication between the PL and the PS. As the latest Caffe framework, and InceptionV1 based on Tensorflow.
VOLUME 4, 2016 xxxi

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

The trained DNN model can be transferred to the Rasp-


berry Pi through network connectivity. However, network
connectivity can introduce delays, data loss, and other se-
curity concerns, limiting DNN deployment on the Raspberry
Pi [41]. Bhosale et al. [42] proposed Deep Convolutional
Neural Network (DCNN) for Covid-19 classification. In
this work, the DCNN architecture is deployed on the cloud
and uses radiology x-ray images for classification. On the
other hand, the authors in [41] proposed a lightweight Deep
PL AXI PS Learning model (LDC-Net) for Covid-19 classification with
lung disease. In this work, LDC-Net was trained on High-
Kria SOM DPU Performance Computing (HPC). Furthermore, the trained
LDC-Net and weights have been deployed in an IoT-enabled
Fig. 50. Xilinx’s Kria KV260 SOM, adopted from [123] Raspberry Pi with network connectivity for Covid-19 clas-
sification.

Furthermore, to exercise the features of KV260, many


models from Vitis AI Model Zoo [24] repository are imple-
mented. Traffic detection, lane detection and segmentation
algorithms were also implemented and tested in real time.
Silicon Lab has recently introduced BG24/MG24 [19] SoCs
with built-in AI accelerators and new software toolkit. These
new devices with optimized hardware and software will
help execute AL/ML applications on battery-powered edge
devices. The MAX78000 [14] from Maxim Integrated is an
AI microcontroller that runs neural networks at extremely
low power. It has a hardware-based CNN accelerator, en- Fig. 51. Raspberry Pi computer [233]
abling the battery-powered applications to execute AI infer-
ences. AlphaICs’ Gluon AI co-processor [9] is optimized An Arm processor is a general-purpose processor that
for vision AI applications. It comes with an SDK for easy belongs to the family of CPUs, and it uses Reduced In-
porting of neural networks. struction Set Computer (RISC) architecture. Because of their
Deep neural networks (DNN) are increasingly being used efficiency and flexibility, Arm processors are used in a
on IoT-enabled devices like the Raspberry Pi to improve wide range of electronic products, including smartphones,
efficiency, security, and privacy. However, the size and tablets, and wearables. Arm’s new portfolio hardware so-
complexity of the machine-learning (ML) model that can lutions are now aimed towards Machine Learning (ML)
be deployed in such systems are limited by the available and Deep Neural Network (DNN) applications. In recent
computational and memory resources. The Raspberry Pi is times, ARM-based processors have been developed for the
a low-cost, small, and portable computer board with built- acceleration of machine learning applications from various
in software that allows users to create scripts or programs manufactures viz., Marvell (ThunderX2), Fujitsu (A64FX),
in Python [232]. There are two main limitations to utilizing Huawei (Kunpeng 920), and Ampere (eMAG). With the help
a Raspberry Pi for deep learning: 1) the small amount of of its recently released Neural Processing Units (NPUs),
memory available and 2) the slow processing speed. These Arm processors bring machine learning to low-end edge
limitations severely hamper the implementation of more devices.
complex neural networks. There are two ways to deploy Arm ML processor uses the Neural Network (NN) soft-
deep learning at IoT end devices. 1) Deploy feature vector ware development kit provided by the company to inter-
and model architecture on the Server machine and call with face the ML software and corresponding hardware [216].
API using Web service to IoT. 2) Deploy feature vector The Arm-based ML accelerator consists of a number of
and model architecture on resource-constraint platforms like computing engines up to 16, each of which includes a
Raspberry Pi, also called on-device computing. The first programmable layer engine and a MAC convolution engine,
method has network latency issues, security risks, and high see Fig. 52. Each computing engine has its own local mem-
communication costs. The second method has difficulty in ory to process the ML models. Starting with weights applied
implementing large DNN models due to the limited memory to incoming data, processing via the MAC convolution
and computational resources of IoT-enabled devices like engine, and finally results processed by the Programmable
Raspberry Pi. Furthermore, devices with limited resources, Layer Engine (PLE), the flow is typical of DNN imple-
such as the Raspberry Pi, are only used for DNN inference. mentations. There are 128 multiple-accumulate (MAC) units
xxxii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

in the MAC convolution engine. MAC convolution engine captivating alternative uses of on-device machine learning.
receives the input data from the input feature map read TinyML supports various frameworks, including Tensor-
block, weights from the weight decoder, and performs the Flow Lite Micro (TFLM), TensorFlow-Native, Embedded
required MAC operation. The result of the convolution is Learning Library (ELL), Graph Lowering (GLOW), etc.
processed by the PLE, which is a vectorized microcontroller. Google developed an open-source framework referred to as
It is more akin to a RISC platform designed to wrap up CFU Playground [175] for TinyML acceleration on FPGA.
the processing of a layer for a piece of a DNN model with CFU playground toolchain combines open-source software
several layers. The PLE is in charge of tasks like pooling and (TensorFlow), RTL generators (LiteX, Migen, etc.), and
activation. The throughput of the proposed ML processor FPGA tools for synthesis (yosys), place, and route (vpr).
is 4.6 TOPS. The proposed design is implemented using The CFU playground framework makes it possible to in-
7 nm chip technology, and it is scalable, can achieve the vestigate custom architectures for the acceleration of Tiny
throughput of 150 TOPS for high-end applications. ML for embedded ML systems. TinyML is used in many
The Arm AI platform, also known as Project Trillium, applications, including medical face mask detection [157],
is a heterogeneous compute platform that includes Arm eating detection [167], Li-Ion batteries parameter estima-
Cortex CPUs, Ethos NPUs, Mali GPUs, and microNPUs to tion [69], etc. The most in-demand research areas among the
accelerate the ML algorithms [143]. Arm supports various TinyML community include sound recognition, computer
ML frameworks such as TensorFlow Lite, Caffe, Pytorch, vision, and the development of low-power accurate ML
etc. and accelerates the ML applications using software models. More research is needed to fully comprehend the
libraries including arm NN, arm COMPUTE LIBRARY, and advantages and drawbacks of the topics under discussion,
Common Microcontroller Software Interface Standard-NN even if many applications have demonstrated TinyML’s
(CMSIS-NN). The hardware products such as Arm Cortex promise. Some key issues that require further research in this
CPUs, Ethos NPUs, Mali GPUs, and microNPUs, FPGAs, area include developing benchmarks, memory constraints,
DSPs, etc. ARM’s new Cortex-A55/A75 and Mali-G72 energy, processor capacity, cost reduction, etc.
combination targets machine learning on edge computing
devices. VIII. COMPARISON BETWEEN VARIOUS HARDWARE
Arm has developed its Ethos series of ML processors ARCHITECTURES FOR DNN ACCELERATION
for machine learning applications. Ethos series is classified The performance of the various hardware accelerators
into two types: N-series and U-series [25]. Ethos N-series is for the DNN acceleration depends on the target applica-
introduced in October 2019, and it contains NPUs identical tion. However, researchers defined some standard metrics,
to Cortex family. Ethos U-series is introduced in early 2020, namely, area, power, and throughput, to measure the per-
and it contains microNPUs. MicroNPUs paired with the formance of the hardware accelerators for the development
CPU like the Cortex-M55 to process the ML algorithms. and deployment of DNNs. Here, the area is nothing but the
Ethos-U55 achieves a throughput of 0.5 TOPS, and it portion of silicon required for the DNN acceleration, which
contains 32 to 256 8-bit MAC units [144]. Ethos-U55 is generally represented in squared millimeters or squared
supports 8-bit and 16-bit integer data types. Ethos-U65 micrometers. The area depends on the size of the on-
achieves a throughput of 1 TOPS, and it contains 256 to chip memory and the technology used during the hardware
312 8-bit MAC units. Ethos-N57 achieves a throughput of synthesis process. Power is nothing but the amount of power
2 TOPS, and it contains 1024 8-bit MAC units. Ethos-N77 consumed by the specific hardware during the DNN acceler-
is a highly efficient ML inference processor that achieves ation. The power consumption mainly depends on off-chip
throughput of 5 TOPS, and it is best suitable for mobile and on-chip memories. Throughput is used to measure the
devices. Ethos-N77 ML processors can be used for facial or productivity of the hardware accelerator. The comparison
object recognition applications. Ethos-N78 is a scalable and between the various hardware accelerator architectures for
efficient ML inference processor that achieves a throughput DNN acceleration is shown in Table 7. Due to a lack of
of 1 to 10 TOPS [145]. Arm‘s Cortex-M55 and the Ethos- data on their footprint, power consumption, and throughput,
U55 can be used as an AI accelerator in edge computing CGRA-based accelerators are not represented in Table 7. As
devices [90]. This combination achieves a 32× improve- expected, temporal or general purpose architectures such as
ment in ML processing compared to the base Cortex- CPU and GPU have greater power consumption and area
M55 core. Furthermore, TinyML [20] advancements have than special purpose architectures such as FPGA and ASIC
made it possible to use ML models on the microcontroller because they are not tailored for a particular application. The
hardware found in our household appliances, including essential hardware metrics like power, area, technology, and
printers, TVs, smartwatches, and pacemakers, which can throughput are reported for each hardware architecture.
now carry out tasks that were previously only capable of In Table 8, we have compared the few embedded de-
being done by computers and smartphones. The machine velopment boards discussed above with respect to general
learning and embedded ultra-low power systems commu- purpose CPUs/GPUs, specialized co-processor they contain,
nities have joined forces to create TinyML foundation. performance, power, SDKs, and supported ML frameworks.
This joint effort has paved the way for innovative and
VOLUME 4, 2016 xxxiii

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

TABLE 7: Comparison among accelerators implemented on different hardware platforms


Accelerator Year Platform Area (mm2 ) Power (W) Throughput (GOPS)
CNP [86] 2009 FPGA N/A 15 N/A
Parallel coprocessor for CNN [183] 2009 FPGA N/A 11 6.74
MAPLE [47] 2010 FPGA N/A N/A 7
DC-CNN [51] 2010 FPGA N/A 14 16
NeuFlow [84] 2011 FPGA N/A 10 147
NeuFlow [171] 2012 ASIC 12.5 0.6 320
Memory- Centric Accelerator [170] 2013 FPGA N/A N/A 17
nn-X [92] 2014 FPGA N/A 8 23.18
DianNao [53] 2014 ASIC 3.02 0.485 452
DaDianNao [54] 2014 ASIC 0.78 15.97 5580
Origami [50] 2015 ASIC 3.09 0.744 274
PuDianNao [142] 2015 ASIC 3.51 0.596 1056
ShiDianNao [78] 2015 ASIC 4.86 0.32 194
Roofline based Accelerator [225] 2015 FPGA N/A 18.61 61.62
Embedded FPGA Accelerator [178] 2016 FPGA N/A 9.63 136.97
fpgaConvNet [208] 2016 FPGA N/A N/A 12.73
DeepBurning [213] 2016 FPGA N/A N/A 73
SCNN [168] 2017 ASIC 7.9 N/A 2000
FlexFlow [146] 2017 ASIC 3.89 6.8 420
Nvidia V100 [17] 2017 GPU 815 250 15700
Eyeriss [56] 2017 ASIC 12.25 0.278 N/A
TPU [121] 2017 ASIC <331 N/A N/A
DLAU [210] 2017 FPGA N/A 0.234 N/A
ESE [103] 2017 FPGA N/A 41 282
FP-DNN [96] 2017 FPGA N/A 25 364.4
FINN [206] 2017 FPGA N/A 11.7 2465.5
Angle-Eye [97] 2018 FPGA N/A 3.5 137
Bit Fusion [189] 2018 ASIC 5.87 0.895 N/A
MAERI [132] 2018 ASIC 6 N/A N/A
DNPU [191] 2018 ASIC 16 0.279 300
UNPU [134] 2018 ASIC 16 0.297 345.6
Jetson AGX Xavier [28] 2018 GPU N/A 10 32
Tesla T4 [31] 2018 GPU 545 70 130
Accelerator for SSDLiteM2 [82] 2018 FPGA N/A 9.9 N/A
Intel Xeon Platinum 9282 2019 CPU N/A 400 3200
AMD Ryzen Threadripper 3970x 2019 CPU N/A 280 1859
LNPU [136] 2019 ASIC 16 0.367 >300
SNAP [228] 2019 ASIC 2.4 0.16-0.36 N/A
Eyeriss2 [57] 2019 ASIC N/A N/A 153.6
BFP arithmetic-based Accelerator [141] 2019 FPGA N/A 9.18 760.83
Tera-OPS streaming Accelerator [162] 2019 FPGA N/A 18.29 1877
Caffeine [226] 2019 FPGA N/A 26 354
SIGMA [177] 2020 ASIC 65.1 22.3 10800
Nvidia A100 [16] 2020 GPU 826 400 19500
CNN2Gate [91] 2020 FPGA N/A N/A 80.04
Jetson Xavier NX [26] 2021 GPU N/A 10-20 14-21
GAMMA [227] 2021 ASIC 30.6 N/A N/A
Accelerator for space DNN [218] 2021 FPGA N/A 3.82 1.34
NPE [126] 2021 FPGA N/A 20 135.14
LP-CNN [81] 2021 FPGA N/A 3.92 129.2
Reconfigurable YOLOv3 Accelerator [211] 2021 FPGA N/A 25 N/A
Ad-MobileNet [45] 2021 FPGA N/A 3.25 N/A
Energy-efficient CNN Accelerator [117] 2021 FPGA N/A 0.628 N/A
Dynamically Reconfigurable Architecture [118] 2022 FPGA N/A 2.039 40.71
RCNN Accelerator [93] 2022 FPGA N/A 1.15 N/A

xxxiv VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

interrupt
Machine Learning Processor
ACE-Lite DMA Control Sync
Interface engine Unit Unit Vector
CPU
Engine

Input feature map read


Load/Store MCE
Broadcast PLE’s Vector
SRAM Register
Controller File
Weight decoder µ DMA

SRAM

MAC convolution engine

Programmable Layer Engine


SRAM (PLE)
Programmable Layer Engine
(PLE)
Main SRAM
Compute engine 1
Unit
Compute engine 16

Fig. 52. Arm based ML processor [216]

TABLE 8: Comparison of various embedded edge AI development boards


Coral Dev Board Jetson Nano BeagleBone Myriad X Maixduino Sophon Edge
NXP i.MX 8M SoC Dual LEON4, Dual RISC-V
General Purpose Quad-core ARM A57, Dual Cortex-A15, Dual Cortex A53.
(quad Cortex-A53, Cortex-4F), 16 SHAVE Core 64bit,
Processors 128-core Maxwell Dual SGX544, Dual PRU-ICSS Single RISC-V
GC7000 GPU (Vector Processing Unit) with FPU
AI Co-Processor Google Edge TPU 128-core NVIDIA GPU Dual C66x DSP, Quad EVE Neural Computing Engine KPU/NPU BM1880, TPU
Performance 4 TOPS 472 GFLOPS ~100GOPS 1 TOPS 0.25 -0.5 TOPS 1 TOPS
Power 2 TOPS per watt 5-10W 5-10W 1W 0.3 W 2.5W
TI Deep Myriad Maixduino BITMAIN Neural
SDK Edge TPU AI JetPack
Learning Development Kit SDK Network SDK
TensorFlow, PyTorch, Caffe, TensorFlow, Caffe, TensorFlow,
Supported Frameworks TensorFLow Lite Caffe, TensorFlow TensorFlowLite
Caffe MXNet PyTorch, MXnet

IX. FUTURE DIRECTIONS at MIT and Stanford have developed a new 3-D architecture
In the future, hardware AI acceleration is set to become based on a network of millions of carbon nanotubes [192].
ubiquitous. In recent processors, some sort of AI accelerator Computations in optical computing technology can hap-
hardware becoming a standard feature, indicating that AI pen at the speed of light, much faster than conventional
acceleration is an essential general-purpose task. In this electron-driven chips. To advance optical computing, MIT
paper, we have reviewed several FPGA-based, ASIC-based, is driving research in advanced optical materials, switches,
GPU-based, CGRA-based, and edge AI hardware acceler- lasers, and nano-optics [107]. We may expect to see a
ators. However, looking at the industry trends and startups greater deployment of optical chips in the future. DNA
in this space indicates that we are still in the early stage of computing is a type of parallel computing in which many
the AI revolution. Many more energy-efficient architectures different DNA molecules are used to test many possibilities
will emerge in the future. In particular, architectures with simultaneously [139]. The major advantage of DNA is its
transprecision or approximate computing, high-bandwidth potential for memory storage. A single gram of DNA can
memories, and emerging non-volatile memories such as store 215 petabytes (215 million gigabytes) [6]. Although
MRAM, and ReRAM may appear in the market. Evolving DNA information storage has enormous application poten-
architectures involving the Tsetlin machine are another tial, many issues, such as the high cost of writing and
promising future research direction. reading information and techniques to erase and rewrite
information in DNA that is still unknown, must be addressed
Emerging technologies such as nanomaterials, optical
before its widespread use [75].
computing, and DNA computing may accelerate DNNs
in the near future. Carbon nanomaterials, such as car- In FPGA-based architectures, following future directions
bon nanotubes (CNTs) and graphene, are particularly in- seems to be promising. The combination of FPGAs and
triguing due to their rapid electron transport [58]. CNT cloud computing opens up new avenues for developing deep
and graphene have desirable switching and optical prop- learning applications. The FPGA cloud service is still in
erties, making them well-suited to electronic and optical its early stages. Many imperfections must be investigated,
architectures [190]. New chip architectures become possible such as the virtualization of FPGA hardware resources, task
with the help of CNTs and other nanomaterials. Researchers migration, and so on. The majority of current research is
VOLUME 4, 2016 xxxv

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

focused on lowering the bandwidth requirements for off- for easy deployment of neural network applications.
chip memory access. The performance of multiple FPGA
chips combined is favourable. However, dealing with pro- X. CONCLUSION
cessing scheduling and chip allocation remains a significant Deep Neural Networks have recently gained popularity in a
challenge. Future research could focus on the development variety of applications. They are, however, computationally
of in-memory-computing processors. Moreover, further im- demanding, making them difficult to handle by general-
provements are required in the computation of the activation purpose architectures. In this context, a detailed review
functions used in DNNs. Because most studies focus on of recent advances in DNN acceleration on specialized
loop optimization, only a few researchers are currently hardware architectures such as FPGA, ASIC, GPU, and
working on activation function optimization. There will be CGRA is presented. Furthermore, embedded AI accelerators
frameworks to integrate existing or new architectures, which for the edge environment have been thoroughly discussed.
will help quickly deploy applications. Most importantly, The review begins with a detailed background of DNNs,
FPGA-based accelerator research will be towards training with a focus on their key operations and applications. CNNs,
and not inference. which have a wide range of applications, have also been
In ASIC-based hardware accelerators, following future included in the review. To improve the performance of
research trends are suggested. TPU is already a standard the hardware accelerator, we discussed various computing
in the field of deep learning. More capable replacements architectures such as temporal and spatial architectures, as
are likely to emerge in the coming years. There will be well as different dataflow patterns. The review focused
entirely new architectures to target low-latency and low- on recent advancements in the acceleration of DNNs on
power applications. Most current studies assume a trained FPGA, ASIC, GPU, CGRA, and Embedded AI accelerators.
DNN and focus on increasing the speed of its inference. The review divided the FPGA-based accelerators into three
There have been only a few studies on accelerator design categories and briefly discussed their key features, including
for DNN training. Therefore, there will be more empha- the frameworks available for each. Similarly, ASIC-based
sis on developing ASIC-based DNN training accelerators accelerators are classified, and the review summarizes the
in the future. More research and breakthroughs in CPU- accelerators available in the literature based on area, power
GPU heterogeneous architectures are required for more effi- dissipation, throughput, resource utilization, and so on. A
cient DNN implementations. Special-purpose or data center comprehensive review of Nvidia’s GPU-based accelerators
system-on-chips (SoCs) with embedded FPGA or GPU- was also presented. Furthermore, the review compared the
based machine learning accelerators appear to be gaining various popular FPGA/ASIC/GPU-based accelerators. It has
traction. In CGRA-based accelerators, architectures driven been observed that temporal architectures such as CPU and
by programming might be interesting. Other directions in- GPU dissipate more power than spatial architectures such as
clude, introducing the process-in-memory into CGRA archi- FPGA and ASIC; however, they have higher throughput than
tectures to address the data movement bottleneck. Further FPGA and ASIC. As a result, it is difficult to say that one
improvements are needed for the architectures that support architecture is superior to another because it is dependent
dynamic configuration as it is an important step towards the on the target application and requirements. Furthermore, the
widespread use of CGRAs. survey presented and compared recent research contributions
The following trends may be observed in the development in Arm-based machine learning processors and a few em-
of Edge AI accelerators in the future. Edge AI operates in a bedded AI hardware accelerators in terms of their cores,
heterogeneous environment in which the data at the edge and performance, power, availability of Software Development
the preprocessing techniques required for each sensor vary Kits (SDKs), and supported frameworks. Finally, the review
greatly between applications. Therefore, more customized, suggests future research directions for DNN acceleration us-
powerful, and energy-efficient chips for specific edge ML ing various hardware architectures, including FPGA, ASIC,
applications will be developed. Multimodal deep learning is GPU, CGRA, and Edge AI accelerators.
a major development that pulls data from multiple sources
to extract more granular features. By using these multimodal
REFERENCES
techniques, instead of just recognizing a car, the make and
model of the car can be pinpointed. Other potential research [1] ULTRA96-V2. https://www.avnet.com/opasdata/d120001/medias/
docus/198/5365-pb-ultra96-v2-v10b.pdf. (Accessed on 01/07/2022).
directions include using distributed ML algorithms to speed [2] Accelerate fast math with intel® oneapi math kernel library.
up ML algorithm training and reduce the amount of mem- https://software.intel.com/content/www/us/en/develop/tools/oneapi/
ory required for processing. ML applications at the edge, components/onemkl.html#gs.3595x9. (Accessed on 06/05/2021).
[3] Advanced AI Embedded Systems: NVIDIA Jetson: The AI Plat-
require high accuracy. Therefore, methods for implementing form for Autonomous Machines. https://www.nvidia.com/en-in/
cutting-edge models at the edge while maintaining accuracy autonomous-machines/embedded-systems/. (Accessed on 08/02/2022).
based on deep learning model pruning and quantization [4] BeagleBone AI: Fast Track to Embedded Artificial Intelligence. https:
//beagleboard.org/AI. (Accessed on 01/02/2022).
are among the new research directions. We will also see
[5] BitMain Neural Network SDK: Introduction. https://sophon-edge.
the development of customized as well as general SDK gitbook.io/project/. (Accessed on 01/02/2022).
frameworks targeting specific or multiple edge accelerators [6] dna could store all of the world’s data in one room - science.

xxxvi VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

[7] DPU for Convolutional Neural Network. https://www.xilinx.com/ Proceedings of Machine Learning Research, pages 173–182, New York,
products/intellectual-property/dpu.html. (Accessed on 05/01/2022). New York, USA, 20–22 Jun 2016. PMLR.
[8] Edge TPU Developer Board. https://www.sophon.ai/product/introduce/ [35] A. Argal, S. Gupta, A. Modi, P. Pandey, S. Shim, and C. Choo. Intelligent
edb.html. (Accessed on 01/02/2022). travel chatbot for predictive recommendation in echo platform. In
[9] Gluon AI Co-Processor. https://alphaics.ai/products/ 2018 IEEE 8th Annual Computing and Communication Workshop and
gluon-ai-accelerator/. (Accessed on 01/10/2022). Conference (CCWC), pages 176–183, 2018.
[10] Intel® Movidius™ Myriad™ X Vision Processing Unit. [36] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by
https://www.intel.com/content/www/us/en/products/details/processors/ jointly learning to align and translate. CoRR, abs/1409.0473, 2015.
movidius-vpu/movidius-myriad-x.html. (Accessed on 04/02/2022). [37] Y. Bengio. Learning deep architectures for ai. Found. Trends Mach.
[11] Jetson Nano Developer Kit. https://www.nvidia.com/en-in/ Learn., 2(1):1–127, Jan. 2009.
autonomous-machines/embedded-systems/jetson-nano-developer-kit/. [38] K. Benkrid and S. Belkacemi. Design and implementation of a 2d
(Accessed on 08/02/2022). convolution core for video applications on fpgas. In Third International
[12] Kendryte K210. https://canaan.io/product/kendryteai. (Accessed on Workshop on Digital and Computational Video, 2002. DCV 2002. Pro-
09/02/2022). ceedings., pages 85–92, 2002.
[13] Maixduino. https://www.seeedstudio.com/ [39] M. Bergeron. Real-Time Face Recognition on
Sipeed-Maixduino-Kit-for-RISC-V-AI-IoT-p-4047.html. (Accessed on Ultra96-V2. https://www.hackster.io/AlbertaBeef/
09/02/2022). real-time-face-recognition-on-ultra96-v2-94de9b. (Accessed on
[14] MAX78000—Artificial Intelligence Microcontroller with Ultra-Low- 01/02/2022).
Power Convolutional Neural Network Accelerator. https://www. [40] Y. H. Bhosale and K. S. Patnaik. Application of deep learning techniques
maximintegrated.com/en/products/microcontrollers/MAX78000.html. in diagnosis of covid-19 (coronavirus): A systematic review. Neural
(Accessed on 01/10/2022). Processing Letters, Sep 2022.
[15] Myriad 2 MA2x5x Vision Processor: Transforming Devices [41] Y. H. Bhosale and K. Sridhar Patnaik. Iot deployable lightweight deep
Through Ultra Low-Power Machine Vision - Google Search. learning application for covid-19 detection with lung diseases using
https://www.intel.com/content/www/us/en/products/details/processors/ raspberrypi. In 2022 International Conference on IoT and Blockchain
movidius-vpu/movidius-myriad-x.html,www.movidius.com. (Accessed Technology (ICIBT), pages 1–6, 2022.
on 04/02/2022). [42] Y. H. Bhosale, S. Zanwar, Z. Ahmed, M. Nakrani, D. Bhuyar, and
[16] Nvidia a100 tensor core gpu architecture. 2020. available U. Shinde. Deep convolutional neural network based covid-19 classi-
online: https://www.nvidia.com/content/d am/en-zz/solutions/data- fication from radiology x-ray images for iot enabled devices. In 2022 8th
center/nvidia-ampere-architecture-whitepaper.pdf (accessed on 6 june International Conference on Advanced Computing and Communication
2020). - google search. (Accessed on 06/13/2021). Systems (ICACCS), volume 1, pages 1398–1402, 2022.
[17] Nvidia tesla v100 gpu architecture. 2017. available online: [43] L. Bishnoi and S. Narayan Singh. Artificial intelligence techniques used
https://images.nvidia.com/content/ technologies/volta/pdf/437 317- in medical sciences: A review. In 2018 8th International Conference on
volta-v100-ds-nv-us-web.pdf (accessed on 6 june 2020). - google search. Cloud Computing, Data Science Engineering (Confluence), pages 1–8,
(Accessed on 06/13/2021). 2018.
[18] PYNQ-Z2. http://www.pynq.io/board.html. (Accessed on 01/07/2022). [44] A. G. Blaiech, K. Ben Khalifa, C. Valderrama, M. A. Fernandes, and
[19] Silicon Labs BG24 and MG24 SoCs. https://www.silabs.com/wireless/ M. H. Bedoui. A survey and taxonomy of fpga-based deep learning
zigbee/efr32mg24-series-2-socs. (Accessed on 01/10/2022). accelerators. Journal of Systems Architecture, 98:331–345, 2019.
[20] TinyML Foundation. https://www.tinyml.org/. (Accessed on [45] S. Bouguezzi, H. B. Fredj, T. Belabed, C. Valderrama, H. Faiedh, and
09/02/2022). C. Souani. An efficient fpga-based convolutional neural network for
[21] Vitis Unified Software Platform. https://www.xilinx.com/products/ classification: Ad-mobilenet. Electronics, 10(18), 2021.
design-tools/vitis/vitis-platform.html. (Accessed on 10/01/2022). [46] A. Boutros, S. Yazdanshenas, and V. Betz. You cannot improve what you
[22] Vivado. https://www.xilinx.com/products/design-tools/vivado.html. (Ac- do not measure: Fpga vs. asic efficiency gaps for convolutional neural
cessed on 15/01/2022). network inference. ACM Trans. Reconfigurable Technol. Syst., 11(3),
[23] Xilinx Kria—Adaptive System-on-Module. https://www.xilinx.com/ dec 2018.
products/som/kria.html. (Accessed on 01/10/2022). [47] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf. A
[24] Xilinx Vitis AI Model Zoo. https://github.com/Xilinx/AI-Model-Zoo. programmable parallel accelerator for learning and classification. In 2010
(Accessed on 01/10/2022). 19th International Conference on Parallel Architectures and Compilation
[25] Ethos - ARM - WikiChip, Jul 2021. [Online; accessed 7. Aug. 2021]. Techniques (PACT), pages 273–283, 2010.
[26] Jetson Xavier NX, Aug. 2021. [Online; accessed 15. Jul. 2022]. [48] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and
[27] Coral products. https://coral.ai/products/, 2022. M. Martina. An Updated Survey of Efficient Hardware Architectures
[28] Deploy AI-Powered Autonomous Machines at Scale, July 2022. [Online; for Accelerating Deep Convolutional Neural Networks. Future Internet,
accessed 15. Jul. 2022]. 12(7):113, July 2020.
[29] Edge tpu compiler. https://coral.ai/docs/edgetpu/compiler/ [49] F. Cardells-Tormo, P.-L. Molinet, J. Sempere-Agullo, L. Baldez, and
#system-requirements, 2022. M. Bautista-Palacios. Area-efficient 2d shift-variant convolvers for fpga-
[30] High-Level Synthesis & Verification, July 2022. [Online; accessed 18. based digital image processing. In International Conference on Field
Jul. 2022]. Programmable Logic and Applications, 2005., pages 578–581, 2005.
[31] NVIDIA Tesla T4 Specs, July 2022. [Online; accessed 16. Jul. 2022]. [50] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
[32] Vitis AI, June 2022. [Online; accessed 17. Jun. 2022]. L. Benini. Origami: A convolutional network accelerator. In Proceedings
[33] M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer cnn of the 25th Edition on Great Lakes Symposium on VLSI, GLSVLSI ’15,
accelerators. In 2016 49th Annual IEEE/ACM International Symposium page 199–204, New York, NY, USA, 2015. Association for Computing
on Microarchitecture (MICRO), pages 1–12. IEEE, 2016. Machinery.
[34] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, [51] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi. A dynam-
C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, ically configurable coprocessor for convolutional neural networks. In
Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, Proceedings of the 37th Annual International Symposium on Computer
E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, Architecture, ISCA ’10, page 247–257, New York, NY, USA, 2010.
A. Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, Association for Computing Machinery.
L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, [52] J.-W. Chang and S.-J. Kang. Optimizing fpga-based convolutional neural
Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, networks accelerator for image super-resolution. In 2018 23rd Asia and
D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, South Pacific Design Automation Conference (ASP-DAC), pages 343–
C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, 348, 2018.
L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and [53] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam.
Z. Zhu. Deep speech 2 : End-to-end speech recognition in english and Diannao: A small-footprint high-throughput accelerator for ubiquitous
mandarin. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of machine-learning. In Proceedings of the 19th International Conference
The 33rd International Conference on Machine Learning, volume 48 of on Architectural Support for Programming Languages and Operating

VOLUME 4, 2016 xxxvii

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

Systems, ASPLOS ’14, page 269–284, New York, NY, USA, 2014. [76] L. Du and Y. Du. Hardware accelerator design for machine learning.
Association for Computing Machinery. In H. Farhadi, editor, Machine Learning, chapter 1. IntechOpen, Rijeka,
[54] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, 2018.
N. Sun, and O. Temam. Dadiannao: A machine-learning supercomputer. [77] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and M.-C. F. Chang. A
In 2014 47th Annual IEEE/ACM International Symposium on Microar- reconfigurable streaming deep convolutional neural network accelerator
chitecture, pages 609–622, 2014. for internet of things. IEEE Transactions on Circuits and Systems I:
[55] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang. A survey of accelerator Regular Papers, 65(1):198–208, 2018.
architectures for deep neural networks. Engineering, 6(3):264–274, 2020. [78] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
[56] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy- and O. Temam. Shidiannao: Shifting vision processing closer to the
efficient reconfigurable accelerator for deep convolutional neural net- sensor. SIGARCH Comput. Archit. News, 43(3S):92–104, June 2015.
works. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017. [79] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
[57] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze. Eyeriss v2: A flexible and O. Temam. Shidiannao: Shifting vision processing closer to the
accelerator for emerging deep neural networks on mobile devices. IEEE sensor. In Proceedings of the 42nd Annual International Symposium on
Journal on Emerging and Selected Topics in Circuits and Systems, Computer Architecture, ISCA ’15, page 92–104, New York, NY, USA,
9(2):292–308, 2019. 2015. Association for Computing Machinery.
[58] Z. Chen, H. S. Philip Wong, S. Mitra, A. Bol, L. Peng, G. Hills, and [80] C. Dubout and F. Fleuret. Exact acceleration of linear object detectors.
N. Thissen. Carbon nanotubes for high-performance logic. MRS In Proceedings of the 12th European Conference on Computer Vision
Bulletin, 39(8):719–726, Aug 2014. - Volume Part III, ECCV’12, page 301–311, Berlin, Heidelberg, 2012.
[59] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, Springer-Verlag.
and E. Shelhamer. cudnn: Efficient primitives for deep learning. ArXiv, [81] A. J. A. El-Maksoud, M. Ebbed, A. H. Khalil, and H. Mostafa. Power
abs/1410.0759, 2014. efficient design of high-performance convolutional neural networks hard-
[60] D. Chicco, P. Sadowski, and P. Baldi. Deep autoencoder neural networks ware accelerator on fpga: A case study with googlenet. IEEE Access,
for gene ontology annotation predictions. In Proceedings of the 5th ACM 9:151897–151911, 2021.
Conference on Bioinformatics, Computational Biology, and Health Infor- [82] H. Fan, S. Liu, M. Ferianc, H.-C. Ng, Z. Que, S. Liu, X. Niu, and W. Luk.
matics, BCB ’14, page 533–540, New York, NY, USA, 2014. Association A real-time object detection accelerator with compressed ssdlite on fpga.
for Computing Machinery. In 2018 International Conference on Field-Programmable Technology
[61] P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee. Enabling (FPT), pages 14–21, 2018.
intelligent environment by the design of emotionally aware virtual assis- [83] X. Fan, D. Wu, W. Cao, W. Luk, and L. Wang. Stream processing dual-
tant: A case of smart campus. IEEE Access, 8:62032–62041, 2020. track cgra for object inference. IEEE Transactions on Very Large Scale
[62] Y.-k. Choi, K. You, J. Choi, and W. Sung. A real-time fpga-based 20 000- Integration (VLSI) Systems, 26(6):1098–1111, 2018.
word speech recognizer with optimized dram access. IEEE Transactions
[84] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
on Circuits and Systems I: Regular Papers, 57(8):2119–2131, 2010.
Y. Lecun. NeuFlow: A Runtime-Reconfigurable Dataflow Processor for
[63] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky.
Vision. 2011 IEEE Computer Society Conference on Computer Vision
Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro,
and Pattern Recognition Workshops (CVPRW), Jun 2011.
41(2):29–35, 2021.
[85] C. Farabet, C. Poulet, J. Han, and Y. LeCun. Cnp: An fpga-based pro-
[64] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep
cessor for convolutional networks. In FPL 09, FPL 09: 19th International
network learning by exponential linear units (elus). arXiv: Learning,
Conference on Field Programmable Logic and Applications, pages 32–
2016.
37, 2009. FPL 09: 19th International Conference on Field Programmable
[65] J. Cloutier, E. Cosatto, S. Pigeon, F. Boyer, and P. Simard. Vip: an
Logic and Applications ; Conference date: 31-08-2009 Through 02-09-
fpga-based processor for image processing and neural networks. In
2009.
Proceedings of Fifth International Conference on Microelectronics for
[86] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. Cnp: An fpga-based
Neural Networks, pages 330–336, 1996.
processor for convolutional networks. In 2009 International Conference
[66] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like
on Field Programmable Logic and Applications, pages 32–37, 2009.
environment for machine learning. In NIPS 2011, 2011.
[67] J. Cong and B. Xiao. Minimizing computation in convolutional neural [87] X. Feng, H. Zhang, Y. Ren, P. Shang, Y. Zhu, Y. Liang, R. Guan, and
networks. In ICANN, 2014. D. Xu. The Deep Learning–Based Recommender System “Pubmender”
[68] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks for Choosing a Biomedical Publication Venue: Development and Valida-
with weights and activations constrained to+ 1 or- 1. arxiv 2016. arXiv tion Study. J. Med. Internet Res., 21(5), May 2019.
preprint arXiv:1602.02830. [88] K. Fukushima. Neocognitron: A hierarchical neural network capable of
[69] G. Crocioni, D. Pau, J.-M. Delorme, and G. Gruosso. Li-ion batteries visual pattern recognition. Neural Networks, 1(2):119–130, 1988.
parameter estimation with tiny neural networks embedded on intelligent [89] A. Gainaru, E. Slusanschi, and S. Trausan-Matu. Mapping data mining
iot microcontrollers. IEEE Access, 8:122135–122146, 2020. algorithms on a GPU architecture: A study. In M. Kryszkiewicz,
[70] D. Danopoulos, C. Kachris, and D. Soudris. Acceleration of image H. Rybinski, A. Skowron, and Z. W. Ras, editors, Foundations of In-
classification with caffe framework using fpga. In 2018 7th International telligent Systems - 19th International Symposium, ISMIS 2011, Warsaw,
Conference on Modern Circuits and Systems Technologies (MOCAST), Poland, June 28-30, 2011. Proceedings, volume 6804 of Lecture Notes in
pages 1–4, 2018. Computer Science, pages 102–112. Springer, 2011.
[71] L. Deng and D. Yu. Deep learning: Methods and applications. Found. [90] C. Gartenberg. ARM’s new edge AI chips promise IoT devices that won’t
Trends Signal Process., 7(3–4):197–387, June 2014. need the cloud. Verge, Feb 2020.
[72] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas. Predicting [91] A. Ghaffari and Y. Savaria. Cnn2gate: An implementation of convolu-
parameters in deep learning. In Proceedings of the 26th International tional neural networks inference on fpgas with automated design space
Conference on Neural Information Processing Systems - Volume 2, exploration. Electronics, 2020.
NIPS’13, page 2148–2156, Red Hook, NY, USA, 2013. Curran Asso- [92] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello. A 240
ciates Inc. g-ops/s mobile coprocessor for deep neural networks. In 2014 IEEE
[73] A. Deshpande. A Beginner’s Guide To Understanding Convolutional Conference on Computer Vision and Pattern Recognition Workshops,
Neural Networks. UCLA (‘19), 2018. pages 696–701, 2014.
[74] J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria, [93] K. M. V. Gowda, S. Madhavan, S. Rinaldi, P. B. Divakarachari, and
D. Mukunoki, A. Podobas, M. WahibT, and S. Matsuoka. Matrix engines A. Atmakur. Fpga-based reconfigurable convolutional neural network
for high performance computing: A paragon of performance or grasping accelerator using sparse and convolutional optimization. Electronics,
at straws? In 2021 IEEE International Parallel and Distributed Processing 11(10), 2022.
Symposium (IPDPS), pages 1056–1065, Los Alamitos, CA, USA, may [94] H. Graf, S. Cadambi, V. Jakkula, M. Sankaradass, E. Cosatto, S. Chakrad-
2021. IEEE Computer Society. har, and I. Dourdanovic. A massively parallel digital learning proces-
[75] Y. Dong, F. Sun, Z. Ping, Q. Ouyang, and L. Qian. DNA storage: research sor. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,
landscape and future prospects. National Science Review, 7(6):1092– Advances in Neural Information Processing Systems, volume 21. Curran
1107, 01 2020. Associates, Inc., 2009.

xxxviii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

[95] S. Grigorescu, B. Trasnea, T. T. Cocias, and G. Macesanu. A sur- [116] T. Instruments. Am5729 sitara processor. URL: https://www. ti. com/pro-
vey of deep learning techniques for autonomous driving. ArXiv, duct/AM5729, 2015.
abs/1910.07738, 2020. [117] H. Irmak, N. Alachiotis, and D. Ziener. An energy-efficient fpga-based
[96] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, convolutional neural network implementation. In 2021 29th Signal
and J. Cong. Fp-dnn: An automated framework for mapping deep Processing and Communications Applications Conference (SIU), pages
neural networks onto fpgas with rtl-hls hybrid templates. In 2017 IEEE 1–4. IEEE, 2021.
25th Annual International Symposium on Field-Programmable Custom [118] H. Irmak, F. Corradi, P. Detterer, N. Alachiotis, and D. Ziener. A dynamic
Computing Machines (FCCM), pages 152–159, 2017. reconfigurable architecture for hybrid spiking and convolutional fpga-
[97] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, based neural network designs. Journal of Low Power Electronics and
and H. Yang. Angel-eye: A complete design flow for mapping cnn Applications, 11(3), 2021.
onto embedded fpga. IEEE Transactions on Computer-Aided Design of [119] S. M. A. H. Jafri, T. N. Gia, S. Dytckov, M. Daneshtalab, A. Hemani,
Integrated Circuits and Systems, 37(1):35–47, 2018. J. Plosila, and H. Tenhunen. Neurocgra: A cgra with support for
[98] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang. [DL] A Survey neural networks. In 2014 International Conference on High Performance
of FPGA-based Neural Network Inference Accelerators. ACM Trans. Computing & Simulation (HPCS), pages 506–511, 2014.
Reconfigurable Technol. Syst., 12(1):1–26, Mar. 2019. [120] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
[99] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast
learning with limited numerical precision. In Proceedings of the 32nd feature embedding. CoRR, abs/1408.5093, 2014.
International Conference on International Conference on Machine Learn- [121] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
ing - Volume 37, ICML’15, page 1737–1746. JMLR.org, 2015. S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
[100] F. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
and permuted transposition. ACM Trans. Math. Softw., 4(3):250–269, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,
sep 1978. D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
[101] A. Guzhva, S. Dolenko, and I. Persiantsev. Multifold acceleration of A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law,
neural network computations using gpu. In Proceedings of the 19th D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore,
International Conference on Artificial Neural Networks: Part I, ICANN M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix,
’09, page 373–380, Berlin, Heidelberg, 2009. Springer-Verlag. T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross,
[102] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter,
U. Muller, and Y. LeCun. Learning long-range vision for autonomous D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle,
off-road driving. J. Field Robot., 26(2):120–144, Feb. 2009. V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-
datacenter performance analysis of a tensor processing unit, 2017.
[103] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
[122] D. Justus, J. Brennan, S. Bonner, and A. S. McGough. Predicting the
Y. Wang, H. Yang, and W. J. Dally. Ese: Efficient speech recognition
computational cost of deep learning models. In 2018 IEEE International
engine with sparse lstm on fpga. Proceedings of the 2017 ACM/SIGDA
Conference on Big Data (Big Data), pages 3873–3882, 2018.
International Symposium on Field-Programmable Gate Arrays, 2017.
[123] S. Kalapothas, G. Flamis, and P. Kitsos. Efficient edge-ai application
[104] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
deployment for fpgas. Information, 13(6), 2022.
W. J. Dally. Eie: Efficient inference engine on compressed deep neural
[124] A. Karpathy. convolutional neural networks for visual recognition, 2018.
network. In 2016 ACM/IEEE 43rd Annual International Symposium on
[125] M. Kavitha, R. Srinivasan, and R. Bhuvanya. Fake News Detection Using
Computer Architecture (ISCA), pages 243–254, 2016.
Machine Learning Algorithms, chapter 10, pages 181–207. John Wiley
[105] F. Hannig, V. Lari, S. Boppu, A. Tanase, and O. Reiche. Invasive tightly-
& Sons, Ltd, 2022.
coupled processor arrays: A domain-specific architecture/compiler co-
[126] H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He. NPE: An
design approach. ACM Transactions on Embedded Computing Systems
FPGA-based overlay processor for natural language processing. In The
(TECS), 13(4s):1–29, 2014.
2021 ACM/SIGDA International Symposium on Field-Programmable
[106] C. Hao, A. Sarwari, Z. Jin, H. Abu-Haimed, D. Sew, Y. Li, X. Liu,
Gate Arrays. ACM, feb 2021.
B. Wu, D. Fu, J. Gu, and D. Chen. A hybrid gpu + fpga system design
[127] J.-Y. Kim. Chapter five - fpga based neural network accelerators. In
for autonomous driving cars. In 2019 IEEE International Workshop on
S. Kim and G. C. Deka, editors, Hardware Accelerator Systems for
Signal Processing Systems (SiPS), pages 121–126, 2019.
Artificial Intelligence and Machine Learning, volume 122 of Advances
[107] L. Hardesty. Researchers build an all-optical transistor. in Computers, pages 135–165. Elsevier, 2021.
[108] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Sur- [128] Y. Kim, J. Lee, J.-S. Kim, H. Jei, and H. Roh. Efficient multi-gpu
passing human-level performance on imagenet classification. 2015 IEEE memory management for deep learning acceleration. In 2018 IEEE
International Conference on Computer Vision (ICCV), pages 1026–1034, 3rd International Workshops on Foundations and Applications of Self*
2015. Systems (FAS*W), pages 37–43, 2018.
[109] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image [129] J. P. Klock, J. Corrêa, M. Bessa, J. Arias-Garcia, F. Barboza, and
recognition. In Proceedings of the IEEE Conference on Computer Vision C. Meinertz. A new automated energy meter fraud detection system based
and Pattern Recognition (CVPR), June 2016. on artificial intelligence. In 2021 XI Brazilian Symposium on Computing
[110] D. Hefenbrock, J. Oberg, N. T. N. Thanh, R. Kastner, and S. B. Baden. Systems Engineering (SBESC), pages 1–8, 2021.
Accelerating viola-jones face detection to fpga-level using gpus. In 2010 [130] A. Kojima and Y. Nose. Development of an autonomous driving robot
18th IEEE Annual International Symposium on Field-Programmable car using fpga. In 2018 International Conference on Field-Programmable
Custom Computing Machines, pages 11–18, 2010. Technology (FPT), pages 411–414, 2018.
[111] C. Heidorn, M. Witterauf, F. Hannig, and J. Teich. Efficient mapping of [131] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
cnns onto tightly coupled processor arrays. J. Comput., 14(8):541–556, with deep convolutional neural networks. Commun. ACM, 60(6):84–90,
2019. May 2017.
[112] A. Howard and S. Gupta. Introducing the next generation of on-device vi- [132] H. Kwon, A. Samajdar, and T. Krishna. Maeri: Enabling flexible
sion models: Mobilenetv3 and mobilenetedgetpu. https://ai.googleblog. dataflow mapping over dnn accelerators via reconfigurable interconnects.
com/2019/11/introducing-next-generation-on-device.html, 2020. SIGPLAN Not., 53(2):461–475, Mar. 2018.
[113] H. Hu, J. Li, C. Wu, X. Li, and Y. Chen. Design and Implementation of [133] A. Lavin and S. Gray. Fast algorithms for convolutional neural networks.
Intelligent Speech Recognition System Based on FPGA. J. Phys. Conf. 2016 IEEE Conference on Computer Vision and Pattern Recognition
Ser., 2171(1):012010, Jan. 2022. (CVPR), pages 4013–4021, 2016.
[114] A. S. Hussein, A. Anwar, Y. Fahmy, H. Mostafa, K. N. Salama, and [134] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo. Unpu: A
M. Kafafy. Implementation of a dpu-based intelligent thermal imaging 50.6tops/w unified deep neural network accelerator with 1b-to-16b fully-
hardware accelerator on fpga. Electronics, 11(1), 2022. variable weight bit-precision. In 2018 IEEE International Solid - State
[115] D. Im, D. Han, S. Choi, S. Kang, and H.-J. Yoo. Dt-cnn: Dilated and Circuits Conference - (ISSCC), pages 218–220, 2018.
transposed convolution neural network accelerator for real-time image [135] J. Lee and J. Lee. Np-cgra: Extending cgras for efficient processing of
segmentation on mobile devices. In 2019 IEEE International Symposium light-weight deep neural networks. In 2021 Design, Automation & Test
on Circuits and Systems (ISCAS), pages 1–5, 2019. in Europe Conference & Exhibition (DATE), pages 1408–1413, 2021.

VOLUME 4, 2016 xxxix

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

[136] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo. 7.7 lnpu: A [159] D. Moolchandani, A. Kumar, and S. R. Sarangi. Accelerating cnn infer-
25.3tflops/w sparse deep-neural-network learning processor with fine- ence on asics: A survey. Journal of Systems Architecture, 113:101887,
grained mixed precision of fp8-fp16. 2019 IEEE International Solid- 2021.
State Circuits Conference - (ISSCC), pages 142–144, 2019. [160] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing nuca
[137] J. Lee and H.-J. Yoo. An overview of energy-efficient hardware acceler- organizations and wiring alternatives for large caches with cacti 6.0. In
ators for on-device deep-neural-network training. IEEE Open Journal of 40th Annual IEEE/ACM International Symposium on Microarchitecture
the Solid-State Circuits Society, 1:115–128, 2021. (MICRO 2007), pages 3–14, 2007.
[138] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung. Fpga- [161] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltz-
based low-power speech recognition with recurrent neural networks. In mann machines. In Proceedings of the 27th International Conference on
2016 IEEE International Workshop on Signal Processing Systems (SiPS), International Conference on Machine Learning, ICML’10, page 807–814,
pages 230–235, 2016. Madison, WI, USA, 2010. Omnipress.
[139] D. Lewin. Dna computing. Computing in Science & Engineering, 4(3):5– [162] D. T. Nguyen, T. N. Nguyen, H. Kim, and H.-J. Lee. A high-throughput
8, 2002. and power-efficient fpga implementation of yolo cnn for object detection.
[140] B. Li, E. Zhou, B. Huang, J. Duan, Y. Wang, N. Xu, J. Zhang, and IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
H. Yang. Large scale recurrent neural network on gpu. In 2014 Interna- 27(8):1861–1873, 2019.
tional Joint Conference on Neural Networks (IJCNN), pages 4062–4069, [163] T. Ngyen, S. M. Jafri, M. Daneshtalab, A. Hemani, S. Dytckov, J. Plosila,
2014. and H. Tenhunen. Fist: A framework to interleave spiking neural
[141] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji. High- networks on cgras. In 2015 23rd Euromicro International Conference
performance fpga-based cnn accelerator with block-floating-point arith- on Parallel, Distributed, and Network-Based Processing, pages 751–758,
metic. IEEE Transactions on Very Large Scale Integration (VLSI) 2015.
Systems, 27(8):1874–1885, 2019. [164] R. Nikhil. Bluespec system verilog: efficient, correct rtl from high level
[142] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, specifications. In Proceedings. Second ACM and IEEE International
and Y. Chen. Pudiannao: A polyvalent machine learning accelerator. In Conference on Formal Methods and Models for Co-Design, 2004. MEM-
Proceedings of the Twentieth International Conference on Architectural OCODE ’04., pages 69–70, 2004.
Support for Programming Languages and Operating Systems, ASPLOS [165] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh,
’15, page 369–381, New York, NY, USA, 2015. Association for Comput- and D. Marr. Accelerating binarized neural networks: Comparison of
ing Machinery. fpga, cpu, gpu, and asic. In 2016 International Conference on Field-
[143] A. Ltd. Learn more about the Linaro Machine Learning Initiative. Arm | Programmable Technology (FPT), pages 77–84, 2016.
The Architecture for the Digital World, Jan 2019. [166] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong
[144] A. Ltd. Ethos-U55 – Arm Developer, Aug 2021. [Online; accessed 7. Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and
Aug. 2021]. G. Boudoukh. Can fpgas beat gpus in accelerating next-generation deep
[145] A. Ltd. High-Performing AI Solutions to Transform our Digital World, neural networks? In Proceedings of the 2017 ACM/SIGDA International
Aug 2021. [Online; accessed 7. Aug. 2021]. Symposium on Field-Programmable Gate Arrays, FPGA ’17, page 5–14,
[146] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexible New York, NY, USA, 2017. Association for Computing Machinery.
dataflow accelerator architecture for convolutional neural networks. In [167] M. T. Nyamukuru and K. M. Odame. Tiny eats: Eating detection on a
2017 IEEE International Symposium on High Performance Computer microcontroller, 2020.
Architecture (HPCA), pages 553–564, 2017. [168] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
[147] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. Scnn: An accelerator
and Y. Chen. Dadiannao: A neural network supercomputer. IEEE for compressed-sparse convolutional neural networks. In Proceedings
Transactions on Computers, 66(1):73–88, 2017. of the 44th Annual International Symposium on Computer Architecture,
[148] T. Luong, H. Pham, and C. D. Manning. Effective approaches to ISCA ’17, page 27–40, New York, NY, USA, 2017. Association for
attention-based neural machine translation. In Proceedings of the 2015 Computing Machinery.
Conference on Empirical Methods in Natural Language Processing, [169] S.-W. Park, J. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J.
pages 1412–1421, Lisbon, Portugal, Sept. 2015. Association for Com- Yoo. An energy-efficient and scalable deep learning/inference processor
putational Linguistics. with tetra-parallel mimd architecture for big data applications. IEEE
[149] P. Lv, W. Liu, and J. Li. A fpga-based accelerator implementaion for Transactions on Biomedical Circuits and Systems, 9(6):838–848, 2015.
yolov2 object detection using winograd algorithm. In 2020 5th Inter- [170] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal. Memory-
national Conference on Mechanical, Control and Computer Engineering centric accelerator design for convolutional neural networks. In 2013
(ICMCCE), pages 1894–1898, 2020. IEEE 31st International Conference on Computer Design (ICCD), pages
[150] A. L. Maas. Rectifier nonlinearities improve neural network acoustic 13–19, 2013.
models. 2013. [171] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and E. Culur-
[151] R. Machupalli, M. Hossain, and M. Mandal. Review of asic accelerators ciello. Neuflow: Dataflow vision processing system-on-a-chip. In 2012
for deep neural network. Microprocessors and Microsystems, 89:104441, IEEE 55th International Midwest Symposium on Circuits and Systems
2022. (MWSCAS), pages 1044–1047, 2012.
[152] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional [172] M. Pietras. Hardware conversion of neural networks simulation models
networks through ffts. CoRR, abs/1312.5851, 2014. for neural processing accelerator implemented as fpga-based soc. In
[153] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. Adres: 2014 24th International Conference on Field Programmable Logic and
An architecture with tightly coupled vliw processor and coarse-grained Applications (FPL), pages 1–4, 2014.
reconfigurable matrix. In P. Y. K. Cheung and G. A. Constantinides, [173] T. Posewsky and D. Ziener. Efficient deep neural network acceleration
editors, Field Programmable Logic and Application, pages 61–70, Berlin, through fpga-based batch processing. In 2016 International Conference
Heidelberg, 2003. Springer Berlin Heidelberg. on ReConFigurable Computing and FPGAs (ReConFig), pages 1–8,
[154] J. Misra and I. Saha. Artificial neural networks in hardware: A survey of 2016.
two decades of progress. Neurocomputing, 74:239–255, 2010. [174] T. Posewsky and D. Ziener. Throughput optimizations for fpga-based
[155] S. Mittal. A survey of fpga-based accelerators for convolutional neural deep neural network inference. Microprocessors and Microsystems,
networks. Neural Computing and Applications, 32(4):1109–1139, Feb 60:151–161, 2018.
2020. [175] S. Prakash, T. Callahan, J. Bushagour, C. Banbury, A. V. Green, P. War-
[156] S. Mittal and J. S. Vetter. A survey of cpu-gpu heterogeneous computing den, T. Ansell, and V. J. Reddi. Cfu playground: Full-stack open-source
techniques. ACM Computing Surveys (CSUR), 47:1 – 35, 2015. framework for tiny machine learning (tinyml) acceleration on fpgas.
[157] P. Mohan, A. J. Paul, and A. Chirania. A tiny CNN architecture for arXiv preprint arXiv:2201.01863, 2022.
medical face mask detection for resource-constrained endpoints. In Lec- [176] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and
ture Notes in Electrical Engineering, pages 657–670. Springer Singapore, M. Horowitz. Convolution engine: Balancing efficiency and flexibility in
2021. specialized computing. Commun. ACM, 58(4):85–93, Mar. 2015.
[158] J. J. Moolayil. A Layman’s Guide to Deep Neural Networks - Towards [177] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul,
Data Science. Medium, May 2020. and T. Krishna. Sigma: A sparse and irregular gemm accelerator with

xl VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

flexible interconnects for dnn training. In 2020 IEEE International Sym- [198] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-
posium on High Performance Computer Architecture (HPCA), pages 58– s. Seo, and Y. Cao. Throughput-optimized opencl-based fpga accelerator
70, 2020. for large-scale convolutional neural networks. In Proceedings of the 2016
[178] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, ACM/SIGDA International Symposium on Field-Programmable Gate
S. Song, Y. Wang, and H. Yang. Going deeper with embedded fpga Arrays, FPGA ’16, page 16–25, New York, NY, USA, 2016. Association
platform for convolutional neural network. In Proceedings of the 2016 for Computing Machinery.
ACM/SIGDA International Symposium on Field-Programmable Gate [199] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J. sun
Arrays, FPGA ’16, page 26–35, New York, NY, USA, 2016. Association Seo, and Y. Cao. Throughput-optimized opencl-based fpga accelerator
for Computing Machinery. for large-scale convolutional neural networks. In FPGA 2016 - Pro-
[179] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kep- ceedings of the 2016 ACM/SIGDA International Symposium on Field-
ner. AI accelerator survey and trends. In 2021 IEEE High Performance Programmable Gate Arrays, pages 16–25. Association for Computing
Extreme Computing Conference (HPEC). IEEE, sep 2021. Machinery, Inc, Feb. 2016. 2016 ACM/SIGDA International Symposium
[180] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. on Field-Programmable Gate Arrays, FPGA 2016 ; Conference date: 21-
Vdnn: Virtualized deep neural networks for scalable, memory-efficient 02-2016 Through 23-02-2016.
neural network design. In The 49th Annual IEEE/ACM International [200] M. Svedin, S. W. D. Chien, G. Chikafa, N. Jansson, and A. Podobas.
Symposium on Microarchitecture, MICRO-49. IEEE Press, 2016. Benchmarking the nvidia gpu lineage: From early k80 to modern a100
[181] T. Ridnik, H. Lawen, A. Noy, and I. Friedman. Tresnet: High perfor- with asynchronous memory transfers. Proceedings of the 11th Interna-
mance gpu-dedicated architecture. ArXiv, abs/2003.13630, 2020. tional Symposium on Highly Efficient Accelerators and Reconfigurable
[182] S. Saha. A Comprehensive Guide to Convolutional Neural Networks — Technologies, 2021.
the ELI5 way, 2018. [201] D.-F. Syu, S. Syu, S.-J. Ruan, Y.-C. Huang, and C.-K. Yang. Fpga imple-
[183] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, mentation of automatic speech recognition system in a car environment.
E. Cosatto, and H. P. Graf. A massively parallel coprocessor for convolu- 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE),
tional neural networks. In 2009 20th IEEE International Conference on pages 485–486, 2015.
Application-specific Systems, Architectures and Processors, pages 53– [202] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer. Efficient processing of
60, 2009. deep neural networks: A tutorial and survey. Proceedings of the IEEE,
[184] V. Sati, S. M. Sánchez, N. Shoeibi, A. Arora, and J. M. Corchado. Face 105(12):2295–2329, 2017.
detection and recognition, face emotion recognition through nvidia jetson [203] M. A. Talib, S. Majzoub, Q. Nasir, and D. Jamal. A systematic literature
nano. In International Symposium on Ambient Intelligence, pages 177– review on hardware implementation of artificial intelligence algorithms.
185. Springer, 2020. J. Supercomput., 77(2):1897–1938, Feb. 2021.
[204] M. Tanomoto, S. Takamaeda-Yamazaki, J. Yao, and Y. Nakashima. A
[185] S. Sağlam, F. Tat, and S. Bayar. Fpga implementation of cnn algorithm for
cgra-based approach for accelerating convolutional neural networks. In
detecting malaria diseased blood cells. In 2019 International Symposium
2015 IEEE 9th International Symposium on Embedded Multicore/Many-
on Advanced Electrical and Communication Technologies (ISAECT),
core Systems-on-Chip, pages 73–80, 2015.
pages 1–5, 2019.
[205] Y. Tkachenko. Autonomous crm control via clv approximation with deep
[186] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration.
reinforcement learning in discrete and continuous action space. arXiv
In 2014 IEEE Conference on Computer Vision and Pattern Recognition,
preprint arXiv:1504.01840, 2015.
pages 2774–2781, 2014.
[206] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
[187] D. Selvathi, R. D. Nayagam, D. J. Hemanth, and V. E. Balas. FPGA
and K. Vissers. Finn: A framework for fast, scalable binarized neural net-
implementation of on-chip ANN for breast cancer diagnosis. Intell.
work inference. In Proceedings of the 2017 ACM/SIGDA international
Decis. Technol., 10(4):341–352, Dec. 2016.
symposium on field-programmable gate arrays, pages 65–74, 2017.
[188] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
[207] A. Vasudevan, A. Anderson, and D. Gregg. Parallel multi channel
A. Mishra, and H. Esmaeilzadeh. From high-level deep neural models
convolution using general matrix multiplication. 2017 IEEE 28th Inter-
to fpgas. In 2016 49th Annual IEEE/ACM International Symposium on
national Conference on Application-specific Systems, Architectures and
Microarchitecture (MICRO), pages 1–12, 2016.
Processors (ASAP), pages 19–24, 2017.
[189] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Es- [208] S. I. Venieris and C.-S. Bouganis. fpgaconvnet: A framework for mapping
maeilzadeh. Bit fusion: Bit-level dynamically composable architecture convolutional neural networks on fpgas. In 2016 IEEE 24th Annual
for accelerating deep neural network. In 2018 ACM/IEEE 45th Annual International Symposium on Field-Programmable Custom Computing
International Symposium on Computer Architecture (ISCA), pages 764– Machines (FCCM), pages 40–47, 2016.
775, 2018. [209] T. Viet Huynh. Fpga-based acceleration for convolutional neural net-
[190] R. Shi, H. Xu, B. Chen, Z. Zhang, and L.-M. Peng. Scalable fabrication works on pynq-z2. International Journal Of Computing and Digital
of graphene devices through photolithography. Applied Physics Letters, System, 2021.
102(11):113102, 2013. [210] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou. Dlau: A scalable
[191] D. Shin, J. Lee, J. Lee, and H.-J. Yoo. 14.2 dnpu: An 8.1tops/w reconfig- deep learning accelerator unit on fpga. IEEE Transactions on Computer-
urable cnn-rnn processor for general-purpose deep neural networks. In Aided Design of Integrated Circuits and Systems, 36(3):513–517, 2017.
2017 IEEE International Solid-State Circuits Conference (ISSCC), pages [211] J. Wang and S. Gu. Fpga implementation of object detection accelerator
240–241, 2017. based on vitis-ai. In 2021 11th International Conference on Information
[192] M. M. Shulaker, G. Hills, R. S. Park, R. T. Howe, K. Saraswat, H.-S. P. Science and Technology (ICIST), pages 571–577, 2021.
Wong, and S. Mitra. Three-dimensional integration of nanotechnologies [212] T. Wang, C. Wang, X. Zhou, and H. Chen. A Survey of FPGA Based
for computing and data storage on a single chip. Nature, 547(7661):74– Deep Learning Accelerators: Challenges and Opportunities. ArXiv, Dec.
78, Jul 2017. 2018.
[193] K. Simonyan and A. Zisserman. Very deep convolutional networks for [213] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. Deepburning: Automatic
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. generation of fpga-based learning accelerators for the neural network
[194] G. W. Smith and F. F. Leymarie. The Machine as Artist: An Introduction. family. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference
Arts, 6(2):5, Apr 2017. (DAC), pages 1–6, 2016.
[195] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopuri, N. Prabhu, S. S. S. [214] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful
Kruthiventi, and R. V. Babu. A taxonomy of deep convolutional neural visual performance model for multicore architectures. Communications
nets for computer vision. Frontiers in Robotics and AI, 2:36, 2016. of the ACM, 52(4):65–76, 2009.
[196] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk. Towards an embedded [215] K. B. Wim Vanderbauwhede. High-Performance Computing Using
biologically-inspired machine vision processor. In 2010 International FPGAs. Springer, New York, NY, USA, 2013.
Conference on Field-Programmable Technology, pages 273–278, 2010. [216] W. G. Wong. More Details Emerge About Arm’s Machine Learning.
[197] D. Strigl, K. Kofler, and S. Podlipnig. Performance and scalability of gpu- Electronic Design, Jun 2018.
based convolutional neural networks. In 2010 18th Euromicro Confer- [217] B. Wu, A. Wan, F. Iandola, P. H. Jin, and K. Keutzer. Squeezedet:
ence on Parallel, Distributed and Network-based Processing, pages 317– Unified, small, low power fully convolutional neural networks for real-
324, 2010. time object detection for autonomous driving. In 2017 IEEE Conference

VOLUME 4, 2016 xli

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767

on Computer Vision and Pattern Recognition Workshops (CVPRW),


pages 446–454, 2017.
[218] H. XIAO, K. ZHAO, and G. LIU. Efficient hardware accelerator for com-
pressed sparse deep neural network. IEICE Transactions on Information
and Systems, E104.D(5):772–775, 2021.
[219] S. Xiong, G. Wu, X. Fan, X. Feng, Z. Huang, W. Cao, X. Zhou, S. Ding,
J. Yu, L. Wang, et al. Mri-based brain tumor segmentation using fpga-
accelerated neural network. BMC bioinformatics, 22(1):1–15, 2021.
[220] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and H. Es-
maeilzadeh. Neural acceleration for gpu throughput processors. In 2015
48th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), pages 482–493, 2015.
[221] A. Yazdanbakhsh, K. Seshadri, B. Akin, J. Laudon, and
R. Narayanaswami. An evaluation of edge tpu accelerators for
convolutional neural networks. arXiv preprint arXiv:2102.10423, 2021.
[222] X. Yin, L. Chen, X. Zhang, and Z. Gao. Object detection implementation
and optimization on embedded gpu system. In 2018 IEEE Interna-
tional Symposium on Broadband Multimedia Systems and Broadcasting
(BMSB), pages 1–5, 2018.
[223] R. Zanc, T. Cioara, and I. Anghel. Forecasting financial markets using
deep learning. In 2019 IEEE 15th International Conference on Intelligent
Computer Communication and Processing (ICCP), pages 459–466, 2019.
[224] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing
fpga-based accelerator design for deep convolutional neural networks.
Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, 2015.
[225] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing
fpga-based accelerator design for deep convolutional neural networks.
In Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, FPGA ’15, page 161–170, New York,
NY, USA, 2015. Association for Computing Machinery.
[226] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong. Caffeine: Toward
uniformed representation and acceleration for deep convolutional neural
networks. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 38(11):2072–2085, 2019.
[227] G. Zhang, N. Attaluri, J. S. Emer, and D. Sánchez. Gamma: leveraging
gustavson’s algorithm to accelerate sparse matrix multiplication. Pro-
ceedings of the 26th ACM International Conference on Architectural
Support for Programming Languages and Operating Systems, 2021.
[228] J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler, and Z. Zhang.
Snap: A 1.67 — 21.55tops/w sparse neural acceleration processor for
unstructured sparse deep neural network inference in 16nm cmos. In
2019 Symposium on VLSI Circuits, pages C306–C307, 2019.
[229] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu. Object detection with deep
learning: A review. IEEE Transactions on Neural Networks and Learning
Systems, 30(11):3212–3232, 2019.
[230] J. Zhu, L. Wang, H. Liu, S. Tian, Q. Deng, and J. Li. An efficient
task assignment framework to accelerate dpu-based convolutional neural
network inference on fpgas. IEEE Access, 8:83224–83237, 2020.
[231] J. Zhu, T. Yang, R. Liu, X. Xu, and X. Zhu. Image recognition of ct
diagnosis for cholangiocarcinoma treatment based on fpga processor and
neural network. Microprocessors and Microsystems, 81:103645, 2021.
[232] S. Monk. Programming the Raspberry Pi: getting started with Python.
McGraw-Hill Education, 2016.
[233] Photos of the Raspberry Pi through the ages: From the prototype to Pi 3
B+, Nov. 2022. [Online; accessed 14. Nov. 2022].

xlii VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like