Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT In the modern-day era of technology, a paradigm shift has been witnessed in the areas
involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL).
Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI
applications such as computer vision, image and video processing, robotics, etc. In the context of
developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs
have been a credible choice for solving more complex real-life problems. The performance and accuracy
of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the
DNN is computationally too cumbersome in terms of the resources and time to handle these computations.
Furthermore, general-purpose architectures like CPUs have issues in handling such computationally
intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity
in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate
Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array
(CGRA) in the context of effective implementation of computationally intensive algorithms. This paper
brings forward the various research works carried out on the development and deployment of DNNs
using the aforementioned specialized hardware architectures and embedded AI accelerators. The review
discusses the detailed description of the specialized hardware-based accelerators used in the training and/or
inference of DNN. A comparative study based on factors like power, area, and throughput, is also made
on the various accelerators discussed. Finally, future research and development directions are discussed,
such as future trends in DNN implementation on specialized hardware accelerators. This review article is
intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness
of deep learning research.
INDEX TERMS Machine Learning, Field Programmable Gate Array (FPGA), Deep Neural Networks
(DNN), Deep Learning (DL), Application Specific Integrated Circuits (ASIC), Artificial Intelligence (AI),
Central Processing Unit (CPU), Graphics Processing unit (GPU), Hardware Accelerators
VOLUME 4, 2016 i
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
learning, machine learning, and AI is illustrated in Fig. 1. to its capacity to extract more complex high-level features,
such as objects and facial structures, from raw input data.
DNNs are computationally expensive and need lots of
Artificial Intelligence computational resources and memory for training and infer-
The science and engineering of ence. CPUs inherently support a limited number of parallel
making intelligent systems
workloads, though they can context switch with hyper-
threading. They are not sequential in nature. CPUs may have
Machine Learning more available resources than their counter architectures
The field of study that gives
computers the ability to learn
(like GPUs or FPGAs). CPUs have a limited number of
without being explicitly programmed registers to support concurrent threads. But they may have
higher cache sizes, larger branch control logic, and higher
on-chip bandwidth than GPUs. However, the limited number
of cores available on the CPU limits its ability to process
Deep Learning large amounts of data in parallel, which is required for DNN
A technique to perform machine acceleration. Although CPUs dominate the IoT industry in
learning algorithms inspired by DNN inference on low-power edge devices, they struggle
human brain’s own network of to realize complex DNNs. Therefore, specialized hardware
neurons.
designs are required for the acceleration of DNNs. DNNs
can be implemented using customized hardware accelerators
instead of a CPU. The heterogeneous computing platforms
viz. Field Programmable Gate Array (FPGA), Application-
Fig. 1. AI vs. Machine Learning vs. Deep Learning Specific Integrated Circuits (ASIC), and Graphical Pro-
cessing Units (GPU) are widely used to accelerate DNNs.
Nowadays, DNNs are used in many modern AI ap- The specialized hardware based DNN accelerators can be
plications, including bioinformatics [60], natural language categorized into two classes: the first class of accelerators
processing [148], image restoration [186], speech recogni- efficiently implements the computational primitives such as
tion [34], computer vision [195], machine translation [36], convolutional operations, fully connected operations, etc. for
healthcare [43], finance [223], robotics [95], visual art the DNNs [86], [176] and the second class of DNN accel-
processing [194], etc. Furthermore, the recent applications erators efficiently optimize the data movement and memory
of DNN include aerospace and defence, automated driving, access [56], [178]. These two generations of specialized
recommendation systems, and industrial automation [71], hardware based DNN accelerators improve the speed and
[87], [102], [217]. DNNs are also useful in a variety of appli- energy efficiency of running DNNs. There are two ways to
cations, such as news aggregation and fraud detection [125], improve the performance of the DNN acceleration. The first
virtual assistants [61], chatbots [35], and customer relation- method is to optimize the DNN algorithm, and the second
ship management systems [205]. In addition, DNNs have method is to optimize the hardware architecture. Therefore,
also been used to diagnose covid-19 by classifying it based we need to co-design the algorithm and the hardware to
on different lung and chest imaging modalities [40]. achieve the superior performance.
DNNs contain many layers, and each layer is capable Because of their high throughput and memory bandwidth,
of detecting features at different levels. For instance, in GPUs are one of the most often employed hardware ac-
pattern recognition, where the input is available in pixel celerators for improving inference and training processes
form, the first layer of DNN extracts minor details of the in DNNs [220]. In floating-point matrix-based calcula-
image, such as curves and edges. The outputs of this first tions, GPU-based hardware accelerators are extremely effi-
layer act as inputs to the second layer. The image’s primary cient [207]. GPU-based hardware accelerators, on the other
details, such as squares and semi-circles, are extracted by the hand, consume a lot of power. ASIC and FPGA-based hard-
second layer. The outputs of the second layer act as inputs ware accelerators have limited computational and memory
to the third layer. The third layer extracts the part of objects. resources when compared to GPU-based accelerators. They
Furthermore, the subsequent layer uses the previous layer’s can, nevertheless, achieve a moderate level of performance
output and extracts more aspects of the objects. As the while using less energy [154]. ASIC-based DNN accelera-
number of layers increases, the DNN extracts increasingly tors provides superior performance compared to GPU and
complicated features and complete objects [73]. DNNs pro- FPGA counterparts at the cost of reconfigurability. However,
vide superior accuracy and performance at the cost of high ASIC-based accelerators have some limitations, including
computational complexity. For instance, AlexNet [131] takes high cost of development, long time to market, inflexibility,
1.4 Giga Operations Per Second (GOPS) to process a single etc [77], [104]. FPGA-based accelerators can be used as
image of size 224×224 with a top-1 accuracy of 61%, while an alternative to ASIC-based accelerators, and they can
ResNet-152 [109] takes 22.6 GOPS with a top-1 accuracy of provide superior performance at an affordable cost with
79.3%. DNN’s superior accuracy and performance are due reconfigurability and low power dissipation [215]. FPGA,
ii VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
ASIC, and GPU-based AI accelerators have been the subject in the following ways. Few studies [44], [98], [127], [155],
of numerous research [98], [151], [155], [156], [159], [212]. [212] focused only on the developments of FPGA-based
This survey, however, also looks at various embedded AI accelerators. Whereas few other studies [55], [137], [151],
accelerators for DNN acceleration. [159] have presented the details of ASIC-based accelerators.
This survey supplements the existing work and con- Some research reviews [48], [202], [203] have explored both
tributes towards providing the complete background on FPGA as well as ASIC-based accelerators. Very limited
DNN acceleration using various specialized hardware archi- studies [179], [203] have dealt with the progress of GPU-
tectures. The contributions of this survey can be summarized based accelerators. On the other hand, studies on embedded
as follows: AI accelerators haven’t been explored much. Many of these
1) The survey discusses the various research works car- reviews do not mention the compiler/mapping frameworks
ried out on the development and deployment of DNN and SDKs available for these accelerators, making it difficult
using FPGA-based accelerators. for someone to choose the appropriate accelerator. This
2) The survey covers the work done in ASIC-based AI review, therefore, aims to bring a comprehensive study of all
accelerators in the last decade, from 2012 to 2022. the aforementioned hardware accelerators in the context of
3) The survey describes the various GPU-based DNN implementation of DNNs. Furthermore, this survey classifies
accelerators. the FPGA-based accelerators and ASIC-based accelerators
4) The survey provides a comprehensive overview of in a unique way, briefly discusses the key architectural
CGRA-based accelerators for DNN implementation. features and the compiler or mapping frameworks avail-
5) The survey covers the research works carried out able. Accelerators for each category are summarized and
on the implementation of DNNs on the edge using compared. A comprehensive survey of GPU-based accel-
embedded AI accelerators. erators by Nvidia is also presented. The need for edge
6) The survey provides a comparative study of existing AI computing is emphasized and state-of-the-art embedded
hardware architectures: FPGAs, GPUs, ASICs, and AI accelerators, including Arm-based accelerators, are also
embedded AI accelerators. discussed and compared. This survey also briefly discusses
7) The survey highlights the future research trends in the recent developments in tinyML. Table 1 shows the
DNN acceleration on specialized hardware architec- comparison of this survey paper with recently published
tures, including FPGA, ASIC, GPU, CGRA, and Edge review articles on DNN implementation using specialized
AI accelerators. hardware architectures. Researchers in the fields of artificial
intelligence, system design, and hardware architecture are
A. SCOPE OF THE SURVEY expected to benefit from this survey.
This paper lays the focus on research trends in FPGA,
ASIC, and GPU-based accelerators for implementing DNNs. B. ORGANIZATION
We have also briefly discussed the current trends in Arm- This paper is organized as follows: Section II provides a
based machine learning processors and embedded edge brief overview of neural networks and DNNs, including
AI accelerators. The review categorizes the FPGA-based the basic architecture of hardware for the DNN accelera-
accelerator into three categories and briefly discusses the tion. Section III describes various architectures implemented
key features of the accelerators, including the frameworks on the FPGA platform for DNN acceleration. Section IV
available. The three categories include accelerators for a describes various ASIC-based accelerator architectures for
specific application such as speech recognition, object de- DNN acceleration. Section V shows a detailed review of
tection, natural language processing, etc., accelerators for GPU-based accelerators for the acceleration of DNN. Sec-
a specific algorithm such as CNN, RNN, etc., and accel- tion VI discusses various CGRA-based accelerator architec-
erator frameworks with hardware templates. Furthermore, tures for DNN acceleration. Section VII discusses in detail
ASIC-based accelerators are categorized into three types: the embedded edge AI accelerators for DNN acceleration.
ALU-based accelerators, dataflow-based accelerators, and Section VIII provides the comparisons between the various
sparsity-based accelerators. A comparative study of these hardware architectures used for the DNN acceleration. Sec-
hardware accelerators based on performance metrics like tion IX provides the future research directions of various
power, throughput, and the area has been presented. The hardware architectures for DNN acceleration. Finally, the
review also focuses on the mapping frameworks avail- conclusion of this review is presented in Section X.
able for each these accelerators and briefly discuss the
implementation details. In addition, the recent research II. BACKGROUND
contributions in Arm-based machine learning processors A. NEURAL NETWORKS
and a few embedded AI hardware accelerators are dis- A Neural Network (NN) is a computational model inspired
cussed and compared in terms of their cores, performance, by biological neural networks. It is also known as an Artifi-
power, availability of Software Development Kits (SDKs), cial Neural Network (ANN). An ANN comprises hundreds
and supported frameworks. This survey is different and or thousands of interconnected artificial neurons, also called
unique with respect to many existing papers in this area processing units. Three or more interconnected layers are
VOLUME 4, 2016 iii
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
formed by these neurons. The input neurons are in the all n inputs from the input layer and generates the output
first layer. The input neurons receive external signals and y. These inputs are multiplied by the weight coefficients
pass them on to the subsequent layers, which eventually (w1 , w2 , . . . , wn ) and combined together with a bias value
provide the final output data to the final output layer. The b for each neuron. A non-linear function σ(.) , also called as
intermediate layers in the ANN are called as hidden layers. an activation function, is then used to calculate the neuron’s
Fig. 2 depicts the architecture of a typical NN, which output, see Eq. (1). In this scenario, the activation function
includes an input layer, an output layer, and two hidden causes a neuron to produce an output only if the input to
layers. it exceeds a specified threshold value. Common non-linear
functions used in NN are Sigmoid, Rectified Linear Unit
(ReLU), and Hyperbolic tangent. The graphical model and
mathematical representation of artificial neuron is shown
in Fig. 3 and Eq. (1), respectively.
∑ σ
Output layer
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
In neural networks, weights are initialized with some learning. The labeled data is used in supervised learning to
random values. However, during the training process, all train or model the network. The labeled data indicates that
these weights get updated iteratively to predict the correct some input data has already been matched to the correct
output. The weights are updated using the cost function, output. Unsupervised learning is another learning technique
which is nothing more than the mean square error. The in which the network/model is trained using the unlabeled
mathematical representation of mean square error is shown data. The trained network generates the clusters or structures
in Eq. (2). Here, MSE is mean squared error, n represents in the unlabeled data. Semi-supervised learning uses the
the number of input data points, yi and yˆi are true and partially labeled data sets and it falls in between supervised
predicted outputs, respectively. Once the neural network is and unsupervised learning approaches. Finally, reinforce-
trained, it may be used for classification problems. ment learning is a type of training that rewards positive
behaviours while punishes undesirable ones. Reinforcement
1X
M SE = (yi − yˆi )2 (2) learning is bound to learn from its previous experience. The
n i=1 pictorial representation of the aforementioned deep learning
approaches is shown in Fig. 5.
B. DEEP NEURAL NETWORK (DNN)
The Deep Neural Network (DNN) is a type of neural Input data Input data Input data
(labeled) (unlabeled) (states & actions)
network that has more than three hidden layers and is well-
suited to complicated tasks [37]. In today’s DNN, the typical
number of layers used ranges from five to over a thousand. A
Supervised Unsupervised Reinforcement
DNN with N hidden layers is shown in Fig. 4. In DNNs, the Learning Learning Learning
Error
model and its parameters are learned through an extensive Error
Reinforcement
training process. signal
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
cat
dog 2 × 2 pooling, stride 2
2 3 21 9 3 21 2 12
Fig. 6. CNN architecture (adopted from [182])
1 2 11 7
(FMs) which are organized in two-dimensional grids. The Fig. 7. Various forms of pooling
FM from the previous layers of the convolution layer is
convolved with the filter coefficients. More than one input
feature map can be paired with each of the output feature layers in the CNN network . By substituting all the negative
maps. In a 2-D convolution operation between an input valued outputs with 0, it introduces non-linearity into the
image matrix x (size R × C) and a filter f (size W × L), CNN. Because of its computational simplicity, sparsity, and
the convolution layer performs point-wise multiplication and ability to converge faster than other activation functions like
addition of the corresponding pixels. The filter size is often hyperbolic tangent and sigmoid [72], [199], ReLU [161] has
smaller than the input matrix size. The filter multiplies the gained a lot of traction in recent years. The mathematical
input matrix with the W × L sized-block, accumulates the representation of ReLU is shown in Eq. (4). Some popular
result, slides to the next block of the input matrix, and extensions of ReLU, for instance, exponential LU [64], para-
repeats the operation. The input matrix is processed one metric ReLU [108], and leaky ReLU [150] are also being
block at a time until it has processed all of the image’s R×C used in CNNs for improved performance and accuracy.
elements. The 2-D convolution operation is given in Eq. (3)
where y(r, c) signifies one output pixel in the output matrix f (x) = max(0, x) (4)
y, with each pixel’s coordinates expressed as (r, c). The
iterators over the filter’s length (L) and width (W ) are l 4) Fully Connected Layer
and w, respectively, in Eq. (3). Finally, the resulting feature Fully connected layers do the final classification in the CNN
maps apply non-linear activation functions such as sigmoid, network after multiple convolutions, ReLU, and pooling
hyperbolic tangent, or rectified linear units. layers. Weights, biases, and neurons are all part of the
W −1 L−1 fully connected layer. All input and output neurons are
X X W L connected in the fully connected layer. A CNN network
y(r, c) = f (w, l)x(r + w − ,c + l − )
2 2 typically has one or more fully connected layers. The final
w=0 l=0
(3) output of CNN comes from the last fully connected layer,
often known as the classification layer. The fully connected
2) Pooling Layer layer in the CNN contains a large number of inputs and
The pooling layer shrinks the spatial dimensions of the input outputs. Therefore, it is challenging to implement the fully
image after convolution, thereby reducing the computation connected layer operations on hardware platforms with
and number of parameters in the network. Pooling layers limited resources.
are also known as subsampling layers. In CNN, the pooling
layer is used between two convolution layers. The MAX 5) Deconvolution Layer
operation is used to resize each slice of the input image To increase the size of the feature map, a deconvolution
spatially, on which the pooling layers operate individually. A layer, also known as a transposed convolution layer, is
pooling layer with filters of size 2×2 is found in many CNN employed [52]. Upsampling (inserting zeros in the feature
topologies. Over the four samples in the filter, the pooling map) and then convolving the upsampled feature maps with
operation, which is nothing but the MAX operation, is the kernel coefficients are used to accomplish this.
done. The operation yielding the maximum value is retained
while discarding the other values [124]. It is noteworthy that 6) Dilated convolution layer
additional operations like MIN operation and AVG operation The filter coefficients are up-sampled and convolved with
can also be used in the pooling layer, particularly in some the input image in a dilated convolution layer to capture
CNNs [198]. The MAX and AVG pooling operations for a broader receptive field [115]. Image segmentation, for
the filters of size 2 × 2 is shown in Fig. 7. example, uses it to capture the larger global context in each
output pixel.
3) Rectified Linear Unit (ReLU) Layer With millions of weight coefficients, CNNs are extremely
In a CNN network, the ReLU layer is usually employed after complex. They are computationally expensive and neces-
the convolution and fully connected layers. The ReLU layer sitate a significant amount of memory to store the input,
is generally used after the convolution and fully connected output feature maps, and weight coefficients, causing CPUs
vi VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
to under perform. To boost the performance of the CNNs, register file) and control are shared by all ALUs in the
specific hardware accelerators are used. As a result, different temporal architecture. In temporal architectures like CPUs
techniques for implementing CNNs efficiently on hardware or GPUs, all the convolution or fully connected operations
platforms must be explored in order to reduce resource and are mapped to matrix multiplication. CPU cores are the
memory requirements. least employed among the several temporal architectures for
DNN training and inference. CPUs contain a small number
D. HARDWARE ARCHITECTURES FOR DNN of processing cores, ranging from one to ten. As a result,
ACCELERATION only a small number of processes can be performed in
DNNs have been increasingly popular in recent years, parallel, limiting throughput. GPUs are commonly used to
allowing for their development and deployment on a variety train and infer DNNs. They have thousands of cores to run
of hardware platforms. These hardware platforms are of highly parallel algorithms efficiently, for instance, matrix
various types, right from general-purpose architectures such multiplication. Throughput is enhanced by lowering the
as CPUs and GPUs, programmable architectures (FPGAs) number of multiplications in both CPUs and GPUs. There
to special-purpose chips (ASICs). In many DNN models, are software libraries that optimize matrix multiplication
multiply-accumulate (MAC) operations are the most impor- for GPUs (e. g., cuBLAS, cuDNN [59], etc.) and CPUs
tant computations, and they can be easily parallelized. Since (e. g., Intel MKL [2], OpenBLAS, etc.). Another well known
these MAC operations can be executed in parallel, hardware technique to reduce the matrix multiplications is Fast Fourier
architectures that enable parallel operations are required to Transform (FFT) [80], [152]. Furthermore, several tech-
process DNNs. To achieve superior performance, highly niques, such as Winograd’s algorithm [133] and Strassen’s
parallel computing models, encompassing both spatial and algorithm [67], are used to reduce the matrix multiplications
temporal computing architectures, are often employed for and thereby reduces the resource and memory requirements.
DNN acceleration. The spatial and temporal architectures
have a similar computational structure, with a set of Pro- 2) Spatial Architectures
cessing Elements (PEs). However, processing units can have In spatial architectures, each ALU can have its own local
internal control in a spatial architecture, whereas control in memory and control logic. The local memory is also referred
a temporal architecture is centralized, as shown in Fig. 8. to as the register file. The development and deployment
Each PE can have a register file (RF) to store data in of DNNs on Field-Programmable-Gate-Arrays (FPGA) and
spatial architecture; however, PEs do not have the memory Application-Specific-Integrated-Circuits (ASIC) comes un-
capacity in a temporal architecture. The PEs can also be der the category of spatial architectures. FPGAs are less
connected to exchange data in spatial computing designs. expensive and have a faster time to market than ASICs, and
To summarize, the PEs in the temporal architectures contain the design flow is simpler. However, FPGAs are less energy-
only Arithmetic and Logic Units (ALUs). The PEs consist efficient and consume more power than ASICs since FPGAs,
of ALU as a computation unit, RF to store the data, and a unlike ASICs, contain a significant chip area dedicated to
control unit in spatial architectures. reconfigurability. ASICs, on the other hand, are mainly
designed for a particular application and cannot support
Memory Hierarchy Memory Hierarchy reconfigurability. The design flow of ASICs is more complex
than FPGAs [46]. ASIC chips are expensive, but they are
PE PE PE PE PE PE PE PE
highly optimized and energy-efficient and provide superior
performance than FPGAs. Memory accesses are the real
Register File (RF)
bottleneck in DNN computations; therefore, off-chip DRAM
Control Unit
PE PE PE PE PE PE PE PE
Computation (ALU)
PE PE PE PE
Control
PE PE PE PE Computation (ALU)
accesses must be minimized, as they have a high energy
cost and delay. The memory accesses (off-chip) can be
PE PE PE PE PE PE PE PE reduced by reusing data stored in smaller, quicker, and
PE: Processing Element
low-energy memories. In spatial computing architectures,
Spatial Architecture Temporal Architecture
weight stationary, row stationary, output stationary, and
Fig. 8. Spatial and temporal architectures other specialized processing dataflows can be designed to
improve data reuse from memories in the memory hierarchy
and reduce energy dissipation. At each level of the memory
1) Temporal Architctures hierarchy, the dataflow defines what data is read and when
The temporal architectures exploits the parallelism by sup- it is processed. In spatial architectures, dataflows can be
porting a variety of techniques, such as Single Instruction classified as follows:
Multiple Threads (SIMT) or Single Instruction Multiple
Data (SIMD). The temporal computing architectures appears Weight Stationary (WS)
mostly in CPUs and GPUs. In temporal designs, ALUs can In weight stationary dataflow, the weights are kept fixed and
only access data from the memory hierarchy and cannot are stored in the register files of the PEs, whereas the inputs
communicate directly with one another. The memory (i.e., and partial sums are distributed across the PEs. Weight
VOLUME 4, 2016 vii
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
Attinable performance
Row Stationary (RS) peak floating-point performance
(GFLOPS)
The operations of a row of convolution are mapped to the
s)
(GFLOPS)
B/
same PE in row stationary dataflow, and the weights are kept
(G
th
stationary inside the register file of the PEs. Row stationary
id
w
nd
dataflow maximizes the convolutional reuse of input feature
Algorithm 2
Algorithm 1
ba
y
maps, weights, and partial sums. Row stationary dataflow
or
em
examples are found in [53], [57].
m
ak
pe
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
application is a good fit for the problem and has a low design tively. Cloutier et al. [65] proposed a hardware accelerator,
complexity. Han et al. [103] proposed the FPGA-based referred to as Virtual Image Processor (VIP) to implement
accelerator named efficient speech recognition engine (ESE) the CNNs. The Altera EPF81500 FPGA platform is used to
to implement the LSTM algorithm for speech recognition. implement the proposed design. VIP primarily consists of
Load-balanced sensing pruning method is used in the pro- Processing Elements (PEs) connected by a 2-D systolic ar-
posed design to compress the LSTM model. The proposed chitecture and supports the SIMD paradigm. VIP is designed
accelerator uses a framework named Kaldi to implement to perform the following vector and matrix operations:
LSTM algorithm for speech recognition. The ESE has a matrix multiplication, matrix-vector multiplication, scalar
performance of 282 GOPS and is implemented in a Xilinx multiplication, matrix addition, matrix-vector addition, vec-
XCKU060 FPGA running at 200 MHz. The implementation tor addition, 1-D convolution, 2-D convolution, etc. The
of speech recognition algorithms using FPGA-based accel- host computer is used to provide the configuration data to
erators is also presented in several earlier studies [62], [113], the FPGA board, which is connected through Peripheral
[138], [201]. Component Interconnect (PCI) interface. VIP uses the low
Wang et al. [211] proposed a reconfigurable YOLOv3 accuracy arithmetic because of the limitations of resources
FPGA hardware accelerator for object detection. In this on Altera EPF81500 FPGA. Fortunately, recent FPGAs
context, YOLOv3 (You Only Look Once, Version 3) is a contains large numbers of computing units and memory
real-time object detection algorithm that detects specific ob- resources and allow fast CNN implementations. FPGA
jects in images or videos. The proposed accelerator is built implementations of DNNs mainly focused on accelerating
using the ARM + FPGA architecture. Experiment results the convolution operations, which are reported in [38] and
show that the FPGA-based YOLOv3 accelerator consumes [49].
less energy and achieves higher throughput than the GPU Farabet et al. [86] presented ConvNet Processor (CNP):
counterpart. The proposed accelerator is compatible with an FPGA-based accelerator to implement the CNNs. CNP
several frameworks, such as Tensorflow, Caffe, PyTorch, uses dedicated hardware convolver for the data processing
etc. The proposed accelerator is implemented on Xilinx and also uses soft-processor for controlling. CNP is designed
ZCU104 running at a frequency of 300 MHz. Several on the Virtex4 SX35 FPGA and also equipped with external
previous works [82], [149], [162] also used the FPGA to memory to store the input and filter coefficients. CNP
implement object detection algorithms. consists of Vector Arithmetic and Logic Units (VALU), one
Hamza et al. [126] proposed the FPGA-based acceler- of the main components in the architecture that implements
ator named NPE to efficiently implement various Natu- the CNN operations viz. 2-D convolutions, sub-sampling,
ral Language Processing (NLP) models. NPE provides a and non-linear activation functions. The implementation
single framework for processing arbitrarily complex non- of 2-D convolution, represented using Eq. (6), is shown
linear functions with software-like programmability. NPE in Fig. 10 for K = 3, i. e. 3 × 3 kernel. In Eq. (6), xij
consumes 4× and 6× less power than CPU and GPU. NPE is the data in the input plane, wmn is the weight value in
is implemented on the Xilinx Zynq Z-7100 FPGA running K × K kernel, yij is the partial sum, zij is the result in
at a frequency of 200 MHz. the output plane, and W is the width of the input image.
Serkan et al. [185] developed an FPGA-based CNN ac- At each clock cycle, the convolution module performs k 2
celerator to classify malaria disease cells. The proposed multiply-accumulate operations simultaneously. CNP uses
accelerator is implemented on Xilinx Zynq-7000 FPGA run- the First In First Out (FIFO) buffers between the external
ning at a frequency of 168 MHz. The proposed accelerator memory and FPGA to provides the continuous flow of
achieves an accuracy of 94.76%. Zhu et al. [231] proposed data in both directions. CNP uses the 32-bit soft processor
an FPGA-based accelerator to recognize liver dynamic that provides the macro instructions, generally higher level
CT images. Xiong et al. [219] developed an FPGA-based instructions than most traditional processors, to the VALU
CNN accelerator to improve the automatic segmentation for implementing the basic CNN operations. CNP has a
of 3D brain tumors. FPGA-based accelerators are also compiler that converts network implementations with Torch
used to implement various applications such as autonomous directly into CNP instructions. The proposed architecture
driving [106], [130], image classification [45], [70], fraud has been used to implement the face detection system.
detection [129], cancer detection [187], etc. Table 2 sum- K−1
X K−1
marizes the reviewed FPGA-based accelerators for specific
X
zij = yij + xi+m,j+n · wmn (6)
application. m=0 n=0
B. ACCELERATORS FOR A SPECIFIC ALGORITHM Sankaradas et al. [183] presented a massively parallel
A prominent topic of research in the realm of accelerators is co-processor for accelerating CNNs. This co-processor is
the use of FPGA-based accelerators for a particular neural designed using the Virtex5 LX330T FPGA platform and
network algorithm. Since the accelerator is intended to four DDR2 (Double Data Rate 2) memory banks totalling
address a specific problem, its operation typically requires 1 GB. The proposed co-processor mainly consists of clusters
minimal adjustments to a few parameters to operate effec- of Vector Processing Elements (VPE) connected in parallel.
VOLUME 4, 2016 ix
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
Pix In: x
approach to accelerate the Support Vector Machines (SVM)
2-D CONVULTION
and the proposed design contains VPEs instead of VPE
w00 × w01 × w02 × [W-K] delays clusters. But the accelerator proposed in [94] provides low
performance while accelerating DNNs compared to the co-
0 + + + line m
processor proposed in [183].
w10 × w11 × w12 × [W-K] delays
image
line m + + + line m+1 row
h W1K W1K-1 W11
X X X
image
Pix In: y row
h+1 W2K W2K-1 W21
Internal storage + hard-wired data stream kernel loaded X X X
( dual ported BRAMs)
+ + + +
Fig. 10. 2-D convolution module for 3 × 3 kernel, adopted
from [86] VPE
image
Each cluster consists of 2-D convolver units, sub-samplers, row
h+k
Look-Up Tables (LUT) and performs convolution, pooling, X
WKK
X
WKK-1
X
WK1
the proposed design is shown in Fig. 11. It contains k × k Fig. 11. 2-D convolver unit of CNN co-processor, adopted
convolution units along with k 2 +k VPEs, and the final col- from [183]
umn of VPEs is used to add partial results. The coprocessor
operates in collaboration with a host, which can control the A programmable parallel accelerator called MAPLE is
coprocessor through an Application Programming Interface presented in [47] to accelerate the several learning and
(API). The proposed design uses low precision data repre- classification algorithms such as Support Vector Machine
sentation to improve the throughput and memory bandwidth. (SVM), K_means, CNN, etc. MAPLE contains hundreds
The proposed architecture has been used to implement of simple PEs arranged in a 2-D grid fashion as shown
the full face recognition application using CNN with four in Fig. 12 MAPLE can be used to perform vector and
convolution layers. The proposed accelerator can not be matrix operations in parallel. In MAPLE, each PE has
used to realize the full CNNs, which contain convolution local storage to perform the computations efficiently. Each
and fully connected layers. Graf et al. [94] used a similar PE has two operands; one operand comes from its local
x VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
storage, and another operand comes from the PE on its left, as shown in Fig. 13. The co-processor uses the three
see Fig. 12. Furthermore, the output of each PE is connected bank memory sub-system to store input images, kernels,
to the PE on its right. The PEs are arranged as clusters, and intermediate data. The DC-CNN uses the Torch7 [66]
where each cluster has a separate off-chip memory block software for CNN implementation. The proposed dynami-
that creates independent data streams for memory-processor cally reconfigurable architecture supports “inter-output” and
computations. MAPLE processing core can be organized as “intra-output” parallelism. The performance of the proposed
H clusters, and each cluster contains M PEs. So, the total dynamically reconfigurable architecture with 20 convolvers,
number of PEs in MAPLE core equals H × M. MAPLE 128-bit memory port width is 4 to 8 times faster than CNP
also uses smart memory banks to process the intermediate presented in [86]. The proposed architecture can be used
data and to perform secondary reduction operations such to accelerate CNN with only three convolutional layers.
as, aggregation, finding minimum or maximum, and array The proposed accelerator is not capable of realizing the
ranking. Authors developed a tool to map the applications full CNNs, which contain convolution and fully connected
on the MAPLE. For the given input matrices and reduction layers.
functions, the tool generates the assembly code needed to
program MAPLE. The authors created a C++ simulator that Data Memory
Bank Bank Bank
1 2 3
estimates how long MAPLE will take to execute from the
input assembly code and an architectural configuration file VLIW
DC-CNN
Host
memory
that details the processor layout and off-chip memory archi- Hardware
Controller
Output Switch
at 125 MHz. Input Switch
2
C n + NL S1 NL X2
n C
pipelined
smart m
memory PE PE inter-chain PE
1 m
block interconnect 1
C
PE PE PE 2
C n + NL S1 NL Xn
local local local
store store store
n C
C: Convolvers
pipelined NL: Non-linearity
smart S1: Sub-sampling
to off-chip
REDUCE memory PE PE inter-chain PE
bank
block interconnect
pipelined
Fig. 13. DC-CNN co-processor architecture, adopted
input
broadcast PE PE PE
from [51]
local local local
store store store
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
flow can be managed, and operators can easily be cascaded time division multiplexing (TDM) processing scheme, and
and connected across tiles. The NeuFlow accelerator uses page-mirror algorithm. In the proposed accelerator, NPUs
a compiler named luaFlow to process CNNs. The luaFlow get the inputs from the host computer through the Ethernet
compiler converts high-level data flow graph representations interface, and weight coefficients are fetched from page
of deep learning algorithms in the Torch5 environment into mirror memory. The serializer sends the output of NPUs
machine code for NeuFlow. The proposed accelerator has to the activation function blocks. For each sample, the
been used to implement a real-time street scene parser. proposed accelerator requires a long time to transfer the
appropriate weight coefficient from the host computer to
PT
MUX
PT
MUX
PT
MUX
the accelerator core.
Off-chip
Memory
× + × + × +
% ∑∏ Mem % ∑∏ Mem % ∑∏ Mem
PT PT PT Smart
MUX MUX MUX
DMA
× + × + × +
% ∑∏ % ∑∏ % ∑∏
Serializer
Mem Mem Mem
Page-Mirror
Communication
Memory
Page-Mirror
Memory
signals
PT PT PT
Interface
MUX MUX MUX
× + × + × + Control &
Config
% ∑∏ Mem % ∑∏ Mem % ∑∏ Mem
weights
Configurable Route Global Data Lines Local Data Lines Runtime Config Bus
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
memory
Memory interconnect
programmable logic
is carried out using the max-pooling module. CNN’s non-
controller linearity function is calculated using the non-linearity mod-
memory router
ule. The convolver complex module generates partial sums,
which are added by the adder tree. Finally, for dynamic
quantization, bias shift and data shift modules are used.
router router router
external The proposed accelerator supports the Caffe deep learning
memory
convolution
engine
convolution
engine
convolution
engine
framework. The proposed accelerator has been implemented
on Xilinx Zynq platform.
collection
C
+
Input Buffer
coprocessor bus 32-bit off-chip bus 64-bit on-chip bus 32/64 bit on-chip bus 32-bit config bus
Output Buffer
Fig. 16. Architecture of nn-x system, adopted from [92]
Data + Data
C
+ NL Pool
Shift
Weights +
C
model they identified the solutions with best performance
Convolver
+
and lowest FPGA resource requirement. This roofline-based Complex Adder Tree
model optimizes both the memory accesses as well as com-
Controller
putations in the convolutional layers. The accelerator design
is implemented with the Vivado HLS tool, which enables the Fig. 17. Convolver architecture, adopted from [178]
accelerator implementation in C language. The proposed ac-
celerator achieves maximum throughput of 61.62 GFLOPS Wang et al. [210] proposed a scalable design called Deep
(Giga Floating-point Operations Per Second). Learning Accelerator Unit (DLAU) for accelerating deep
Implementing DNN in embedded devices is tough due learning algorithms. DLAU utilizes the tiling technique to
to resource and power constraints. In this regard, authors produce a scalable architecture. The proposed accelerator
in [173] have developed a novel FPGA-based accelerators mainly contains modules such as DMA, embedded pro-
for implementing trained and fully connected DNNs. Since cessor, DLAU, and DDR3 memory controller as shown
it is difficult to map a DNN with large number of neurons in Fig. 18. The DLAU module mainly contains three pro-
and corresponding weights, directly onto an FPGA, the cessing units, viz. Partial Sum Accumulation Unit (PSAU),
authors in [173] used a time division multiplexing scheme. Tiled Matrix Multiplication Unit (TMMU), and Activation
Batch processing is used in the proposed architecture, which Function Acceleration Unit (AFAU). TMMU is used to per-
distributes different weights over many input samples. In form the multiplication operations and also generate partial
addition, the suggested accelerator employs a pipelined sums. PSAU is used to add the partial sums derived from
architecture to make the most of the FPGA resources while TMMU. Finally, AFAU is used to perform the non-linear
staying within power and resource limits. The concept of activation functions, for instance, sigmoid function. The
pruning has also been incorporated into the proposed archi- DLAU module reads the tiled input data through the DDR3
tecture to reduce data transfer from the external memory memory. The embedded processor provides the program-
to the accelerator [174]. Both Batch processing and weight ming interface to the users and communicates with DLAU
pruning can enhance the throughput of DNN accelerators. via JTAG-UART. The proposed architecture is implemented
Qiu et al. [178] proposed FPGA based CNN accelerator, on Xilinx Zynq Zedboard with ARM Cortex-A9 processors
which will efficiently accelerate all the layers of CNN, in- operating at 667 MHz.
cluding the fully connected layers. The proposed accelerator Lian et al. [141] proposed a block-floating-point (BFP)
improves bandwidth and resource usage by employing a arithmetic-based CNN accelerator for DNN inference. The
dynamic-precision data quantization method and a unique proposed accelerator mainly contains three elements: Pro-
design of the convolver hardware module. The proposed cessing Array (PEA), on-chip buffer, and external memory,
accelerator applies singular value decomposition (SVD) on as shown in Fig. 19. The onboard DDR3 modules receive
weight coefficients to minimize the memory footprint at input data and network parameters from the host computer
the fully connected layer. The convolver hardware module via PCIe3.0x8. Conv PEA performs the convolutional op-
can be used for both convolutional and fully connected erations, and FC PEA performs the fully connected layer
layers to reduce resource consumption. The adder tree, operations. The proposed accelerator uses 8-bit and 16-bit
convolver complex, non-linearity, max-pooling, bias shift, formats to represent the feature maps and modal parameters
and data shift are the main elements of the convolver (activations and weights), which can reduce off-chip band-
hardware module, as shown in Fig. 17. Convolutions and width and memory compared to the 32-bit floating point
fully connected layer operations are both performed using counterpart with only a tiny accuracy loss. The accelerator
the convolver complex module. The max pooling action design is implemented with the Vivado HLS tool, and the
VOLUME 4, 2016 xiii
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
Iact/Ofmap PE Array
Buffer 1
proposed BFP arithmetic is conducted on the Caffe [120] Off Chip PE0
scheme. The proposed accelerator is implemented on the DRAM
Xilinx VC709 evaluation board, running at a frequency of RLC PE1
Encoder
200 MHz, and achieves a throughput of 760.83 GOP/s.
Iact/Ofmap
PE63
Processor Buffer 2
North
System
CPU
Bridge
DRAM
Fig. 20. DNN accelerator architecture proposed in [218],
PCIe 3.0x8 adopted from [218].
PCIe Interface Ahmed et al. [81] proposed an FPGA-based Low Power
CNN (LP-CNN) accelerator based on GoogLeNet CNN.
DDR3 M0 DDR3 M1 The proposed accelerator uses quantization and weight
pruning techniques to reduce memory size. The LP-CNN
Memory Interface
accelerator is a time-sharing processor designed to process
the CNN model layer by layer, and it enables pipelining.
The proposed accelerator only uses the on-chip memory to
Conv output
Conv input
FC buffer#0 FC buffer#1 store the activations and weights instead of offline DRAM
buffer buffer
memory. Moreover, the proposed architecture replaces mul-
tiplication operations with shifting operations and uses no
FC PEA Conv PEA
DSP units. The LP-CNN accelerator is implemented in
CNN accelerator Verilog RTL, and the Vivado power analyzer has been
used to calculate the power. The experimental results show
Fig. 19. Block diagram of BFP arithmetic-based CNN that the LP-CNN accelerator provides 49.5 and 7.8 times
accelerator, adopted from [141] power improvement over the Intel Core-i7 and NVidia GTX
1080Ti, respectively. The proposed accelerator has been
Xiao et al. [218] presented the DNN accelerator architec- implemented on the Virtex-7 FPGA running at a frequency
ture specially designed for the sparse and compressed DNN of 200 MHz.
models. The proposed DNN accelerator mainly contains a The low-power, energy-efficient FPGA-based accelerator
PE array, RLC encoder, controller, and on-chip buffers, as is presented in [117] to accelerate the LeNet CNNs. The
shown in Fig. 20. In the proposed DNN accelerator, all the proposed accelerator uses 8-bit, 16-bit, and 32-bit fixed point
weights and non-linear activation functions are kept in Run- formats to represent the weights, activations, and biases,
length Coding (RLC) compressed form and are stored in respectively. The proposed accelerator supports pipelining
off-chip DRAM memory. The PE array contains 64 PEs and and implements LeNet with the minimal resources possible
performs the multiply-accumulate (MAC) operations of the without affecting the throughput. This work uses the Xilinx
fully connected layer. The proposed accelerator uses a novel Vitis HLS tool to convert the C++ code to RTL implementa-
circuit-level processing scheme to process the sparse data tion. The proposed accelerator is implemented on the Nexys
xiv VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
DDR 4 FPGA evaluation board and achieves a throughput proposed accelerator is implemented on Xilinx Zynq 7020
of 14K images/sec while using just 628 mW of power. FPGA board.
An FPGA-based dynamically reconfigurable architecture
is presented in [118] to accelerate neural networks. Dynamic DDR
Partial Reconfiguration (DPR) is used in the proposed
accelerator to realize different types of neural network ar- Memory Interface
chitectures. Dynamic Partial Reconfiguration (DPR) allows
the proposed architecture to switch between networks and
Image Buffer
applications without sacrificing precision or throughput.
The proposed accelerator mainly contains a PE array and Image Data Transmitter
configurable switches, as shown in Fig. 21. PE is a high-
level generic block that can implement the layers of neural PE PE PE PE
Configuration
Weight Buffer
network accelerator and has three predefined interfaces:
Transmitter
Registers
PE Array
Weight
Controller
BN
Activation
AXI STREAM (DATA INTERFACE) Function
AXI4 (MEMORY INTERFACE)
GPIO (I/O INTERFACE) Special Function Buffer
AXI STREAM SWITCH
AXI
HARD/SOFT
LITE
CROSSBAR
PROCESSOR
GPIOs I/O MUX
Fig. 22. RCNN accelerator architecture, adopted from [93].
FPGA
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
VIP [65] 1996 CNN Altera EPF81500 16 fixed point 4 N/A 1500 1,500 N/A N/A N/A N/A
CNP [85] 2009 LeNet-5 Virtex4 SX35 200 16-bit fixed point 4 192 30720 30720 192 N/A 90% 90% 28% 5.25 0.35
Parallel coprocessor for CNN [183] 2009 CNN Virtex5 LX330T 115 16-bit fixed point 6 324 207360 207360 192 0.93% 17% 19.05% 55.73% 6.74 0.61
MAPLE [47] 2010 CNN Virtex5 SX240T 125 fixed point 6 516 149760 149760 1056 N/A 7 N/A
DC-CNN [51] 2010 CNN Virtex5 SX240T 120 48-bit fixed point 6 516 149760 149760 1056 N/A 16 1.14
NeuFlow [84] 2011 CNN Virtex6 VLX240T 200 16-bit fixed point 6 416 150720 301440 768 N/A 147 14.7
Memory- Centric Accelerator [170] 2013 CNN Virtex6 VLX240T 150 fixed point 6 416 150720 301440 768 45.50% 1.10% N/A 6% 17 N/A
NPU based Accelerator [172] 2014 DNN Xilinx Kintex 7 N/A float point 6 1590 254200 508400 1540 N/A N/A N/A
nn-X [92] 2014 CNN Zynq XC7Z045 142 16-bit fixed point 4 545 218600 437200 900 N/A 23.18 2.9
Roofline based Accelerator [225] 2015 AlexNet Virtex7 VX485T 100 32-bit float point 4 2060 303600 607200 2800 50% 61.30% 33.87% 80% 61.62 3.31
Embedded FPGA Accelerator [178] 2016 VGG-16 Zynq XC7Z045 150 16-bit fixed point 4 545 218600 437200 900 86.70% 83.50% 29.20% 89.20% 136.97 14.22
DNN Acceleration using 2016 DNN Zynq-7000 100 16-bit fixed point 6 280 53200 106400 220 N/A N/A N/A
DLAU [210] 2017 DNN Zynq XC7Z020 200 48-bit float point 6 280 53200 106400 220 12.50% 68.40% 26.60% 75.90% N/A N/A
DNN Acceleration using Batch 2018 DNN ZedBoard 100 16-bit fixed point 6 280 53,200 106,400 220 N/A 4.48 N/A
BFP arithmetic-based 2019 VGG-16 Xilinx VC709 200 8-bit BFP (feature maps) 6 1470 433200 866400 3600 62.1% 53.5% 16.3% 28.5% 760.83 82.88
Accelerator for Space DNN [218] 2021 AlexNet/ Xilinx Virtex7 200 16-bit fixed point 6 1470 433200 866400 3600 N/A 1.34 53.14
VGG-16
LP-CNN [81] 2021 GoogLeNet Virtex-7 VC709 200 12-bit fixed point 6 1470 433200 866400 3600 77% 94% 10% 0% 129.2 32.7
Energy-efficient CNN Accelerator [117] 2021 LeNet Xilinx Artix XC7A100T 125 8-bit fixed point (weights) 6 135 63400 126800 240 21.48% 25.16% 13.93% 50% N/A N/A
Dynamically Reconfigurable Architecture [118] 2022 CNN, SNN Xilinx Zynq 7020 200 N/A 6 280 53200 106400 220 N/A N/A N/A
RCNN Accelerator [93] 2022 AlexNet Xilinx Zynq 7020 200 16-bit fixed point (feature maps) 6 280 53200 106400 220 N/A N/A N/A
FPGA-based platform as inputs. The ConvNet description the library with the required interconnections. DeepBurning
is passed through a DSL (Domain-Specific Language) pro- supports a wide range of NN models and simplifies the
cessor, which parses the input script and populates the Con- design flow of NN-based accelerators for machine learning
vNet’s semantic model as a Directed Acyclic Graph (DAG), applications.
and also extracts platform-specific resource constraints. The A framework referred to as DNNWeaver is presented
ConvNet DAG is converted into an SDF hardware intermedi- in [188] that generates bitstream and host code to implement
ate format, which corresponds to an utterly parallel hardware DNNs on various FPGA boards. DNNWeaver employs
implementation. After several transformations on ConvNet’s Caffe as its programming interface. DNNWeaver consists
SDF hardware model, the design space is searched, and of three software components: translator, design weaver, and
this procedure provides a set of hardware mappings of integrator. The translator transforms the Caffe specification
the ConvNet onto the specific FPGA-based platform. The of a DNN into a macro data flow graph. Design weaver
fpgaConvNet front-end parser can examine models written accepts macro data flow graph as an input and generates
in the Caffe and Torch machine-learning libraries. This a synthesizable Verilog implementation of the accelerator
framework accomplishes efficient design space explorations code. The integrator adds the memory interface code to the
through graph segmentation, reconfiguration, folding, and accelerator code. DNNWeaver generates accelerator code
weight reloading. This framework can be used to map small from a series of scalable and customizable hand-optimized
CNN models, for instance, LeNet-5 on FPGAs. template designs, resulting in high performance and effi-
Wang et al. [213] developed a design automation tool ciency.
referred to as DeepBurning that contains a library of build- Guan et al. [96] proposed a framework called Field Pro-
ing blocks that mimic the behavior of typical neural net- grammable DNN (FP-DNN) to accelerate DNNs efficiently
work components. The general design flow of DeepBurning on FPGAs. FP-DNN Framework is shown in Fig. 26. The
framework is shown in Fig. 25. The DeepBurning Neural model description is generated by TensorFlow and is fed
Network Generator (NN-Gen) takes a model descriptive into a Symbolic Compiler. The compiler generates a C++
script ( Caffe-compatible script) as input, which describes program and an FPGA programming bit stream for model
a high-level view of network topology and layer definition. inference, executed by the host and device, respectively.
The DeepBurning NN-Gen also takes user-specified con- Model mapper examines the model description, extracts
straints such as area and power as an input. DeepBurning the target model’s topological structure and operations, and
NN-Gen consists of a hardware generator and compiler that sends the hardware kernel schedule and configuration to the
generate the control flow and data layout based on the user’s code generators. Software generator generates the host code
specifications. The DeepBurning automation tool’s hardware in C++ using the kernel schedule. The host code is compiled
generator builds a neural network architecture for a given using a commercial C++ compiler to create host programs.
network structure by selecting and instantiating blocks from Hardware generator creates device codes by instantiating
xvi VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
80
700
Performance per watt (GOPS/W)
70
600
Performance (GOPS)
60
500
50
40 400
30 300
20 200
10 100
0 0
DN 41]
rith d F Acc nn- 4]
Ac tic-b Acc ator 2]
-C 3]
ato d Ac rato 5]
r S era 78]
-C 8]
N ]
uF 1]
-C 41]
-C 47]
ato 4]
-ba Ac ato 2]
P 3]
me PG cce nn-X 0]
cc tor ]
o ]
N ]
Ac Fl 1]
]
r C [85
[81
d A era 25
rat 78
CN [85
[81
[8
me PG ele X [9
DC N [18
e [22
LP [21
Ne N [5
ler [8
tic A ler [9
MA [18
7
ic eu [5
ce [1
r fo el r [1
LP r [1
DC LE [
r [1
se cel r [2
ele [1
r fo CNP
low
NN
for NP
ce ow
ntr N NN
NN
pa tor
N
N
C
r
ce ase el
c
or
sso
rith d F A
ss
A
P a de ed
d
ce
ce
P a dde se
BF bed bas
pro
pro
BF be ba
Ce
co
Em fline
co
Em line
ler
ry-
lel
lel
f
mo
o
o
ral
ral
Ro
Ro
Me
Pa
Accelerator Pa Accelerator
, ,
(a) Power efficiency (b) Throughput
Fig. 23. Power efficiency and throughput of FPGA-based accelerators listed in Table 3
Library of
Target Platform Configurable
ConvNet Description DeepBurning
Specifications Network Hardware/Software Co-Generation
Components
Model
Connected Components with
Descriptive
detailed Parameters
(RTL of NN)
FPGA
Hardware Burning
Generator
Constraint
Address Flow Generation
Data Layout in
Compiler
Control Flow of
Memory
DeepBurning NN-Gen
Search Design
Space Fig. 25. Design flow of DeepBurning framework, adopted
from [213].
Supplied by Deep ConvNet Hardware
Learning Expert Mapping
FPGA. FINN generates a synthesizable C++ network de-
scription of a flexible heterogeneous streaming architecture.
Fig. 24. Processing flow of fpgaConvNet, adopted
The architecture mainly contains pipelined compute engines
from [208].
that communicate via on-chip data streams. Each BNN layer
has been implemented using dedicated compute engines
with 1-bit values for FMs and weights. To evaluate FINN,
RTL-HLS hybrid templates based on kernel configuration. the authors implemented CNV, a convolutional network
The hardware code is compiled using commercial synthesis topology inspired by BinaryNet [68] and VGG-16 [193], on
tools to generate the programming files for the hardware a Xilinx Zynq-7000 FPGA board running at 200 MHz to
implementation. With a high-performance compute engine accelerate BNN inference.
and well-designed communication optimized algorithms, Guo et al. [97] proposed a flexible and programmable
FP-DNN performs model inference for DNNs. CNN accelerator, referred to as Angle-Eye, together with
Umuroglu et al. [206] proposed FINN, a framework that the compilation tool and the data quantization scheme.
maps trained Binarized Neural Networks (BNNs) onto an The data quantization scheme can be used to reduce the
VOLUME 4, 2016 xvii
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
Model Mapper PE 0
Kernel Kernel
Schedule Configuration Out
In PE 1
SW HW Buf
External Buf External
Generator Generator
Memory Memory
C++ Codes HW Codes
PE npe
C++ Synthesis
Compiler Tools SAVE
LOAD CALC
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
User Application
Model Training
DPU
Frameworks update weights
Instruction Unit Tensorflow Caffe PyTorch
Compute
ARM Processor Array loss, accuracy
Fetch
Decode Saturated?
PE Dataset
AXI Dispatch
Vitis AI Vitis AI Quantizer and Compiler
BUS
Development Kit
PE Xilinx Runtime Library
Overlay Vitis AI
Off Chip Memory Global Memory
Memory Controller Pool PE Quantisation
Inputs Python
DPU
Overlay Deep Learning Processing Unit APIs
Outputs Xmodel Compilation
Fig. 29. DPU architecture overview, adopted from [230], Vitis AI stack, and development flow
.
TABLE 4: Summary of FPGA-based accelerator frameworks
Framework Name Year DNN Type Interface
Xilinx Vitis AI [32] 2022 CNN, RNN Caffe, PyTorch, TensorFlow,
CNN2Gate [91] 2020 Alexnet, VGG-16 Caffe2, Keras, TensorFlow
Caffeine [226] 2019 Alexnet, VGG-16 Caffe
Angle-Eye [97] 2018 VGG-16 Caffe
FP-DNN [96] 2017 VGG-19, Res-152 TensorFlow
FINN [206] 2017 CNV Caffe
DNNWeaver [188] 2016 LeNet, Siamese Caffe
DeepBurning [213] 2016 Alexnet, NiN Caffe
fpgaConvNet [208] 2016 CNN Caffe, Torch
store the output neuron values (NBout), and a third buffer uses eDRAM to store all the data related to a CNN, i. e.,
to store the weights (SB). Different computational operators input feature maps, weight kernels, output kernels, etc.
are invoked in each stage depending on the type of the The DaDianNao accelerator gives better performance while
layer (convolution, activation function, pooling, etc.) For accelerating the CNNs, but provides moderate to low per-
architecture exploration, author developed a C++ simulator formance while accelerating large-scale CNNs.
that evaluates execution time and serves as a specification
for the Verilog implementation. The Verilog version of the
accelerator is synthesized using Synopsys’ design compiler, Control Processor (CP)
Inst.
485 mW of power consumption. The proposed accelerator
DMA
Inst.
Memory Interface
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
clustering) as well as many machine learning techniques, the acceleration of large-scale CNNs. The ShiDianNao
including k-means, k-nearest neighbors, linear regression, accelerator is implemented using 65 nm CMOS technology.
classification tree, naive bayes, support vector machine, and DianNao [53], DaDianNao [54], [147], PuDianNao [142],
DNNs. The PuDianNao mainly contains various Functional and ShiDianNao [78] are not built utilizing reconfigurable
Units (FUs) and three types of data buffers: ColdBuf, hardware, hence they cannot be adapted to changing appli-
HotBuf, and OutputBuf, an instruction buffer (InstBuf), cation demands such as NN sizes.
and a DMA, and a control module, see Fig. 31. The FU Lu et al. [146] proposed a flexible dataflow architecture
contains a Machine Learning Functional Unit (MLU) and called FlexFlow to accelerate the CNNs, exploiting all kinds
an Arithmetic Logic Unit (ALU). The MLU can be used of parallelisms viz., inter-kernel, intra-kernel, and inter-
to perform several computational primitives, including dot output on a two-dimensional array of PEs. FlexFlow has
product, counting, sorting, distance calculations, non-linear the additional interconnections between on-chip memories
functions, for instance, sigmoid and so on. The ALU has and PEs, which provides the flexibility to fetch any neuron
an adder, divider, and multiplier and converters for the 16- from any feature map. The proposed accelerator minimizes
bit float to 32-bit float and 32-bit float to 16-bit float. It the interconnections between the PEs at the cost of energy
may also be used to compute estimates using the Taylor because of data movement from on-chip memory to PEs.
expansion of log (1-x). HotBuf (8 KB) and ColdBuf (16 KB) In FlexFlow, all the PEs are operated in parallel, therefore,
store the input data with short and longer reuse distances, helping in improving the overall throughput. The proposed
respectively. OutputBuf (8 KB) is used to store the output architecture has high scalability and supports different sizes
data or intermediate results. The authors implemented an in- of CNNs with stable resource utilization. FlexFlow only
house C simulator of PuDianNao; it acts as a specification implements CNNs and is confined to within a layer rather
for the Verilog implementation and also measures the per- than across layers. The design is simulated, synthesized,
formance of PuDianNao on large-scale datasets. The design placed & routed using Synopsys’ tools. The FlexFlow
compiler synthesizes the design, and the ICC compiler is accelerator is implemented using TSMC 65 nm technology.
used to generate the layout. The energy, area, and critical Hardik et al. [189] developed a bit-level dynamically
path are obtained after layout. The design is simulated using composable architecture called Bit Fusion for accelerating
Synopsys VCS, and PrimeTime PX is used to determine DNNs. Bit fusion mainly consists of an array of bit-level
the power using the Value Change Dump (VCD) file. The computation elements, called BitBricks, that dynamically
proposed architecture has been implemented using TSMC fuse to match the bit width of individual DNN layers and
65 nm CMOS technology. execute DNN operations with the required bit width, without
any loss of accuracy. Furthermore, Bit Fusion supports
the multiplication of 2, 4, 8, and 16 bits spatially. Bit
Fusion decomposes a 16-bit multiplication into multiple 2-
InstBuf HotBuf ColdBuf bit multiplications to achieve the flexibility to efficiently
map various layers of CNN with different bit widths and
minimize the computation and the communication with no
MLU MLU FUs MLU MLU
loss of accuracy. Bit Fusion architecture comes with an
Control Module ALU ALU ALU ALU
Instruction Set Architecture (ISA) that minimizes the data
transfer and maximizes the parallelism in computations.
The proposed design is implemented in Verilog and is syn-
DMA OutputBuf thesized using Design Compiler, which estimates the area,
frequency, and power. The proposed accelerator architecture
is implemented on 45 nm CMOS technology. Bit Fusion
accelerator achieves 5.1× energy saving and 3.9× speedup
Fig. 31. PuDianNao accelerator architecture, adopted over Eyeriss accelerator.
from [142] Shin et al. [191] proposed Deep Neural Processing Unit
(DNPU) architecture to process CNNs and Recurrent Neural
Du et al. [78] proposed a CNN accelerator referred Networks (RNNs). DNPU is a SIMD MAC-based CN-
to as ShiDianNao to improve the energy efficiency and N/RNN accelerator that uses dynamic precision control to
scalability of DianNao [53] design discussed above. The minimize the kernel data size. DNPU consists of a convo-
ShiDianNao accelerator does not access the main memory lutional layer processor (CP), a fully connected and RNN-
while executing a CNN and achieves more energy efficiency LSTM layer processor (FRP), and a RISC controller. CP
compared to DianNao. The design is implemented in Verilog performs convolutional operations, and FRP performs ma-
and synthesized by the design compiler, and IC compiler trix multiplication operations. DNPU is the first CNN/RNN
is used to place and route the synthesized design. The accelerator with the highest energy efficiency of 8.1 TOP-
energy cost of DRAM accesses is calculated using CACTI S/W on 65 nm CMOS technology. DNPU has some limita-
6.0 [160]. The ShiDianNao accelerator will not support tions; for instance, its area limits the number of processing
VOLUME 4, 2016 xxi
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
elements (PEs) for convolutional layers (CL) and recurrent CNN data flows under the same hardware constraints. The
layers (CL). As a result, performance was suboptimal in proposed accelerator is implemented using 65 nm CMOS
cases that just required CLs or RLs. Furthermore, DNPU technology.
only supports a limited number of weight bit precisions,
such as 4 bits, 8 bits, or 16 bits. Lee et al. [134] proposed Link Clock Core Clock Configuration Bits
Top-Level Control Config Scan Chain 12 x 14 Accelerator
the Unified Neural Processing Unit (UNPU) architecture PE Array
Filter
to process CNNs and RNNs. UNPU contains a bit-serial Filter
50.6 TOPS/W for the case of 16-bit, 4-bit, and 1-bit weights,
respectively. UNPU achieves 1.43× higher energy efficiency
than the DNPU for convolutional layers with 4-bit weights. Processing
Spad
Element MAC
Control
B. DATAFLOW BASED ACCELERATOR
The accelerators based on dataflow put a special emphasis Fig. 32. Eyeriss DNN accelerator, adopted from [56]
on data management to minimize off-chip memory read-
s/writes. When it is feasible, reusing parameters between Chen et al. [57] proposed a DNN accelerator architecture
layers can enhance dataflow. For instance, in a convolutional referred to as Eyeriss v2 to accelerate compact and sparse
layer, both activations and weights can be reused. In a fully DNNs. Like Eyeriss [56], Eyeriss v2 is composed of an
connected layer, each neuron has a unique set of weights; array of PEs to perform MAC operations, global buffers,
as a result, weights cannot be reused, but input data may. and local scratchpad (SPad) memory to support data reuse.
In order to minimize data movement between a computing In the Eyeriss v2 accelerator, PEs and global buffers (GLB)
unit and higher-level memory, the reusable parameters are are grouped into clusters to support a flexible Network On
kept in local registers. Chip (NoC), as shown in Fig. 33. The main difference
Cavigelli et al. [50] proposed the Origami CNN acceler- between Eyeriss and Eyeriss v2 is that Eyeriss v2 uses a
ator, which is scalable to different network sizes. The pro- hierarchical mesh NoC (HM-NoC) to connect the global
posed architecture uses the Weight Stationary (WS) dataflow buffers to the PEs; in contrast, the Eyeriss uses multicast
to improve the energy efficiency during the acceleration NoC between the global buffer and PEs. Furthermore, the
process. WS dataflow minimizes the energy consumption by Eyeriss v2 accelerator uses separate NoCs to transfer the
maximizing the access of weight coefficients. WS dataflow input activations, weights, and partial sums between the
used in the Origami maximizes the convolution and filter global buffer and PEs. The hierarchical mesh NoC used
reuse of weights. The proposed accelerator was imple- in the Eyeriss v2 accelerator supports unicast, multicast,
mented using UMC 65 nm CMOS technology and having a and broadcast. The HM-NoC can be configured into various
core area of 3.09 mm2 . The proposed CNN accelerator can modes ranging from high data reuse to high bandwidth.
achieve the throughput of 274 GOPS and power efficiency The proposed architecture supports various CNN layer di-
of 369 GOPS/W with an external memory bandwidth of mensions and sizes because of flexible hierarchical mesh
525 MB/S full-duplex. The proposed architecture is only NoC. The authors proposed an analysis framework named
used to perform the convolution operation and is unsuitable EYEXAM for evaluating the performance of various CNN
for implementing the fully connected layer operations. dataflows. The Eyeriss v2 accelerator has higher hardware
Eyeriss [56] is an ASIC based CNN accelerator that uses a utilization than Eyeriss but has large area overhead. The
row-stationary (RS) dataflow that minimizes data movement experimental results show that Eyeriss v2 reaches 11.3× and
energy consumption on a spatial computing architecture. RS 42.5× improvement in energy efficiency and throughput,
dataflow is adaptable to various CNN shapes and minimizes respectively, with the sparse AlexNet, compared to Eyeriss
the energy consumption by reusing the filter coefficients running with the AlexNet. It also achieves 2.5× and 12.6×
and input feature maps. The proposed accelerator mainly improvement in energy efficiency and throughput, respec-
contains a 12 × 14 PE array, feature map compression units, tively, with sparse MobileNet compared to Eyeriss running
and a 108 KB global buffer; ReLU as shown in Fig. 32. with MobileNet.
The global buffer enables the reuse of loaded data from Multiply-Accumulate Engine with Reconfigurable Inter-
off-chip DRAM and the generated results by PEs and is connect (MAERI) is a DNN accelerator containing a set
also responsible for returning the final results to the off-chip of configurable building blocks to support various CNN
DRAM. In the Eyeriss accelerator, the PEs are connected partitions and mapping by configuring the tiny switches
via a Network on Chip (NoC). The NoC used in Eyeriss presented in [132]. MAERI contains a set of multiply adder
only supports multi-cast. The authors proposed an analysis computation units, each augmented with tiny configurable
framework for calculating the energy efficiency of various switches that can be configured to support various kinds of
xxii VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
Top-Level Control & Configuration Psum SRAM Bank and the input activation function are stored in the unified
Iact SRAM Bank
GLB Cluster Router
Cluster
Router
Cluster
GLB Cluster
Iact SRAM Bank
Psum SRAM Bank local buffer. In order to perform convolution operation on
PE Cluster PE Cluster
Psum SRAM Bank
GLB Cluster Router Router GLB Cluster Iact SRAM Bank
Psum SRAM Bank
a matrix multiply unit, a systolic data setup block is used
Cluster Cluster
PE Cluster PE Cluster
iacts weights psums in order to rearrange the data. Efficient running of machine
GLB Cluster Router Router GLB Cluster
PE Cluster Cluster Cluster PE Cluster iacts Router
Psum Psum Router learning model tasks and inference tasks like search and
weights psums
External Memory
GLB Cluster Router Router GLB Cluster Psum Router Psum Router
image recognition, language translation have been the focus
External Memory
PE Cluster Cluster Cluster PE Cluster Weight Router Weight Router
GLB Cluster Router Router GLB Cluster Weight Router Weight Router
in the first version of TPU, called TPU1. Since 2015, TPU1
Cluster Cluster
PE Cluster PE Cluster
Weight Router Weight Router has been operational in Google’s data center. A second
GLB Cluster Router Router GLB Cluster
iacts weights psums
PE Cluster Cluster Cluster PE Cluster version TPU2, also called Cloud TPU is operational in data
PE PE PE PE
GLB Cluster Router
Cluster
Router
Cluster
GLB Cluster centers for the purpose of training and interference. Cloud
PE Cluster PE Cluster PE PE PE PE
GLB Cluster Router Router GLB Cluster
TPU supports several frameworks, including TensorFlow,
PE PE PE PE
Cluster Cluster
PE Cluster PE Cluster
PyTorch, and JAX/FLAX.
Fig. 33. Eyeriss v2 top-level architecture, adopted from [57]
DDR3 DRAM Chips
30 GiB/s
dataflows, see Fig. 34. The prefetch buffer stores the input 14 GiB/s 30 GiB/s
DDR3-2133 Weight FIFO
activations, intermediate partial sums, weights, and output Interfaces (Weight Fetcher)
30 GiB/s
activations. Acceleration units mainly contain look-up ta-
Control Control
bles (LUT) and perform activation functions. MAERI uses
Host Interface
14 GiB/s 14 GiB/s
Activation
Data
(64K per cycle)
Instr
viz., convolution, pooling, fully connected layer, and LSTM. Off-Chip I/O
Activation
167 GiB/s
The proposed accelerator also supports sparsity and cross- Data Buffer
Normalize / Pool
layer mapping. MAERI is implemented in Bluespec System Computation
Activation Units
From / To
× × × × × × × × × × × × × × × ×
DRAM
factor of four, and in some instances, by a factor of ten.
Also, the addition is not needed because the zero products
won’t add anything to the total of which they are a part.
Moreover, data with many zeros can be compressed—
Distribution Tree
Weights / Input Activations
these traits, when combined, open up a lot of possibilities
MAERI
for improvement. This section provides a comprehensive
+ Adder Switch
× Multiplier Switch 1:2 Switch Local Buffer Local Forwarding Data Link
overview of accelerators that explore sparsity.
Fig. 34. MAERI architecture, adopted from [57] A CNN accelerator referred to as Sparse CNN (SCNN) is
presented in [168] for inference of CNNs. SCNN employs
Tensor Processing Unit (TPU) is developed by Google in a novel dataflow referred to as sparse Planar-Tiled Input-
order to implement machine learning algorithms. A matrix Stationary Cartesian Product (PT-IS-CP-sparse) dataflow
multiplication unit as a systolic array of 256x256 units is that maximizes the reuse of activations and weights and
used in the TPU architecture [121]. Fig. 35 shows the block removes needless data transfers and reduces storage and
diagram of the TPU. The mentioned systolic array structure power requirements. The dataflow used in SCNN eliminates
is basically built with weight-stationary dataflow and as a all multiplications with a zero and keeps both activations
2-D SIMD architecture. Extracting from the DRAM, the and weights in compressed form. SCNN mainly contains an
weights can then be stored in the weight FIFO (First-In, array of processing elements arranged in a 2-D fashion with
First-Out) register. The results from the previous layers systolic connections to transfer partial sums. The proposed
VOLUME 4, 2016 xxiii
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
dataflow efficiently delivers activations and weights to the uses Gustavson’s algorithm [100] to compute the spMspM
multiplier array to perform the required MAC operations. operations. GAMMA accelerator mainly consists of an array
SCNN exploits all the three kinds of parallelisms viz., of processing elements(PEs), on-chip storage referred to
inter-kernel, intra-kernel, and inter-output. SCNN requires as FiberCache, and a scheduler, as shown in Fig. 36. The
additional optimization circuitry to implement the fully con- PEs are used to perform the required spMspM operations
nected layer operations. SCNN improves the performance that combine sparse input rows to produce each output
by skipping the zeros in the input feature maps and weights. row. FiberCache is a specialized memory structure that
SCNN is implemented in system C and Catapult High-Level stores the non-zero elements and their coordinates. The
Synthesis (HLS) [30] tool is used to generate the Verilog scheduler distributes computational workloads among PEs
RTL. Synopsys Design Compiler synthesizes the Verilog to maximize resource efficiency while reducing unnecessary
version of the design. SCNN is implemented using TSMC access to shared memory. GAMMA is implemented using
16 nm FinFET technology. 45nm CMOS technology.
Eyeriss [56] also looked into input sparsity as a way
to save energy. The gating mechanism deactivates MAC
Memory
units that correspond to zero inputs. Gating saves energy
while not increasing throughput. With sparse models, The
processing speed and energy efficiency of Eyeriss V2 [57]
FiberCache
have improved due to its ability to process sparse data
directly in compressed format for both the weights and
activations.
PE PE PE PE
Zhang et al. [228] developed Sparse Neural Acceleration
Processor (SNAP) to exploit unstructured sparsity in DNNs.
To ensure that data is distributed evenly throughout the
Scheduler
MAC units, SNAP employs parallel associative search.
SNAP is fabricated using 16 nm CMOS technology and
achieves a peak energy efficiency of 21.55 TOPS/W (FP16) Fig. 36. Block Diagram of GAMMA, adopted from [227]
for CONV layers with 10% weight and activation density.
Lee et al. [136] proposed an energy-efficient on-chip We summarized the reviewed ASIC-based accelerators
accelerator called LNPU for sparse DNN model learning. for DNN in Table 5. For each accelerator, we list the year
In the LNPU accelerator, Sparsity is exploited with intra- the accelerator was introduced, the process technology, the
channel as well as inter-channel accumulation. The input clock frequency, the dataflow, the architecture type, the
load buffer module of the LNPU evenly distributes workload power dissipation, the area, the performance in GOPS, and
among the PEs while considering irregular sparsity. LNPU finally, the power efficiency. Fig. 37 shows the plots of
uses the fine-grained mixed precision (FGMP) of FP8-FP16 various metrics, such as power, throughput, area, and power
that optimizes data precision while maintaining training ac- efficiency of ASIC-based accelerators.
curacy. LNPU maintains an average hardware utilization of
100%. LNPU is fabricated using 65 nm CMOS technology V. GPU BASED ACCELERATORS
and has an energy efficiency of 3.48 TFLOPS/W (FP8) at Over the last few decades, Graphics Processing Units
0% sparsity and 25.3 TFLOPS/W (FP8) at 90% sparsity. (GPUs) are widely used in training DL algorithms or CNNs
SIGMA is a scalable and flexible accelerator proposed for face recognition [110], object detection [222], [229],
in [177] to implement the large, irregular, and sparse general data mining [89], and other AI applications. GPU supports
matrix-matrix multiplications (GEMMs). The basic building parallelism due to lots of parallel cores in the architecture
block in SIGMA is Flexible Dot Product Engine (Flex- and offers significant computation speed. GPU exploits large
DPE). All the Flex-DPE modules can be interconnected degrees of data-level parallelism in the applications through
via simple NoC. In SIGMA, all the Flex-DPE multipli- the Single Instruction Multiple Thread (SIMT) execution
ers are arranged in a 1-D fashion, and it performs the models. The high computational capacity of the GPUs
multiple variable-sized dot-products in parallel. SIGMA makes them the primary choice for DNN acceleration. In
uses scalable inter-connects to efficiently map the GEMMs this section, we would like to review some of the recent
of different dimensions and sparsity levels to the PEs. GPU-based DNN accelerators.
SIGMA outperforms systolic array architectures by 5.7× The study of implementing a standard backpropagation
for irregular sparse matrices. SIGMA is implemented using algorithm for training multiple perceptrons simultaneously
the 28 nm CMOS technology and achieves a throughput of on GPU using NVIDIA CUDA technology is presented
10.8 TFLOPS with a power dissipation of 22.33 W. in [101]. For a given program, GPU-based implementation
Zhang et al. [227] proposed an accelerator called on NVIDIA GTX 260 GPU achieves 50× to 150× speedup
GAMMA to perform the Sparse matrix-sparse matrix mul- compared to the CPU-based implementation. A neurally
tiplication (spMspM) operations. The proposed accelerator accelerated architecture for GPU, called NGPU (neurally
xxiv VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
accelerated GPU) is presented in [220] to enable scalable High-performance GPU dedicated architecture referred as
integration of neural acceleration with a large number of TResNet is presented in [181] to accelerate CNNs. The
GPU cores. The proposed architecture brings the neural proposed architecture effectively utilizes the GPU resources
and GPU accelerators together without hampering the SIMT and achieves better accuracy and efficiency.
execution model. NGPU provides significant energy and Nvidia GPUs are the most popular for Deep Learning
performance benefits at the cost of reasonably low hardware (DL) implementations. Table 6 lists the accelerators that
overhead. NGPU achieves 2.44× average speedup and 2.8× Nvidia has released, which are used for the inference and
average energy reduction compared to the baseline GPU training of deep learning (DL) algorithms and have both a
architecture across different sets of benchmarks. Central Processing Unit (CPU) and a GPU integrated on a
Danial et al. [197] presented a framework for accelerating single chip.
the training and classification of arbitrary CNNs on the
GPU. The proposed method improves the performance VI. CGRA-BASED ACCELERATORS
by moving the computationally intensive tasks of a CNN Coarse Grain Reconfigurable Architectures (CGRAs) pri-
to the GPU. Training and classification of CNN on the marily consist of an array of Processing Elements (PEs) con-
GPU performs 2 to 24 times faster than on the CPU nected using reconfigurable interconnects. When compared
based on the network topology. Li et al. [140] proposed to FPGAs, CGRAs often have a shorter reconfiguration
an efficient GPU implementation to accelerate the training time. CGRAs have emerged as a popular option for real-
process of large-scale Recurrent Neural Networks (RNN). time computing due to their low power consumption, high
When compared to the CPU-based solution with the Intel’s efficiency, fast reconfiguration time, and ability to perform
Math Kernel Library (MKL), the proposed method yields both spatial and temporal calculations. In recent years,
a speedup of 2 to 11 times. Kim et al. [128] proposed a CGRAs have become increasingly significant in accelerating
new memory management scheme to enhance the overall DNNs, particularly CNNs, thanks to their ability to combine
GPU memory utilization in multi-GPU systems for deep FPGAs’ flexibility with ASICs’ efficiency. In this section,
learning algorithms acceleration. The authors extended the we would like to review some of the recent CGRA-based
concept of vDNN to a multi-GPU environment employing DNN accelerators.
PCIe-bus, where vDNN [180] virtualizes the GPU and Jafri et al. [119] proposed a CGRA-based accelerator
memory of the CPU so that it can be used simultaneously named NeuroCGRA to realize both neural networks and
to train DL algorithms in a hybrid fashion. The suggested digital signal processing applications. The authors have
memory scheme increases batch size by 60% in multi- opted to investigate the viability of deploying neural net-
GPU systems and enhances training throughput by 46.6%. works on an actual CGRA by means of a Dynamically
VOLUME 4, 2016 xxv
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
4000
Performance per watt (GOPS/W)
3500
10000
3000
8000
Throughput (GOPS)
2500
2000 6000
1500
4000
1000
500 2000
0
0
Sh nNa [54]
Da nNa 71]
Pu anN [53]
N 2]
PU 6]
PU 1]
PU 4]
Ey P [2 ]
A[ ]
iga 9]
low 0]
7]
A 6
GM 57
eri 28
ian 4
DN [14
UN [19
LN [13
SN [13
Or ao [7
xF i [5
17
Di w [1
iD o [1
SI s 2 [
Sh nNa [54]
Da nNa 71]
Pu anN [53]
N 2]
PU 6]
PU 1]
Ey U [1 ]
]
A[ ]
iga 9]
Fle NN 0]
low 8]
7]
Di ao
Di o
P 4
eri 36
GM 57
Fle m
ian 4
DN [14
UN [19
LN [13
Or ao [7
SC mi [5
xF [16
17
Di w [1
iD o [1
SI s 2 [
lo
Di ao
Di o
uF
lo
a
s
Ne
uF
a
Ne
Accelerator
, Accelerator
(a) Power efficiency
,
(b) Throughput
20 60
50
15
Area (mm 2)
40
Power (W)
10 30
20
5
10
0 0
iD ao 54]
Pu Dian ao 1]
Sh ianN ao [ 3]
Or Nao 42]
Fu riss 6]
MA ion [56]
DN RI [ 89]
UN U [ 32]
LN U [1 1]
SN U [1 4]
SI AP 36]
S am 79]
xF [1 ]
Bit Eye [14 ]
MM [1 ]
A [ 77]
7]
iD o ]
D a ]
Di Na 3]
Or ao 2]
Fu iss ]
DN n [1 ]
UN U [1 ]
LN U [1 ]
SN U [1 ]
SI AP [ 6]
S mi ]
Fle CNN [50]
Ey w [1 ]
A [ 8]
7]
GAGMA [228
Sh anNa o [54
Da anN 171
Bit er 46
sio [56
P 89
P 91
P 34
iga [79
lo 68
Da anN [17
D N [5
P 19
P 3
22
Pu ian o [5
N 4
3
GM 22
17
ian [1
E [1
P 1
ig [
ian [1
xF [1
Di low [
Di low
s
uF
uF
Ne
Ne
Accelerator Accelerator
, ,
(c) Power dissipation (d) Area
2
Fig. 37. performance metrics of ASIC-based accelerators
Reconfigurable Resource Array (DRRA). DRRA mainly provides a framework for mapping neural networks onto
consists of four elements, viz., Data Path Units (DPUs), CGRAs. The translater takes three inputs, viz., network
register files (Reg-files), Switch Boxes (SB), and sequencers, model, weights, and network specifications, and generates
as shown in Fig. 38. The DPUs are the functional units that three outputs: DPU, Reg-file, and SB instructions. NeuroC-
perform the required computations. The Reg-files store the GRA is synthesized using 65nm technology running at a
data for the DPUs. Interconnectivity between various DRRA frequency of 500 MHz. A framework called FIST is pre-
components is provided through SBs. The sequencers con- sented in [163] that allows the NeuroCGRA [119] to realize
figure the DPU, switch boxes, and register files. Distributed both DSP applications and neural networks, depending on
Memory Architecture (DiMArch) is essentially a scratch the target applications. The authors have implemented edge
pad providing enough data to the DRRA. The authors have detection on DRRA using the proposed framework.
embedded dedicated hardware, known as neuroDPU, with EMAX is an energy-efficient, low-power CGRA architec-
each DPU of DRRA to implement neural networks on ture with on-chip distributed memory proposed in [204] to
it. The authors proposed a neural network translator that implement CNNs. EMAX supports both CNN training and
xxvi VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
EMAX
Reg-file Reg-file
PE PE PE PE
SB SB CPU
Sequencer Sequencer Core
Row 0
DPU DPU
Memory Interface
Interconnection
SB SB PE PE PE PE
Cell 0 Cell 1
Reg-file Reg-file
PE PE PE PE
SB SB DRAM
Sequencer Sequencer
Row 1
DPU DPU
SB SB
Fig. 39. EMAX architecture, adopted from [204]
Cell 2 Cell 3
Column 0 Column 1
Special RC is used for operations like power (represented
Fig. 38. DRRA computation layer [119] as PRC in Fig. 40) and piecewise functions (represented as
IRC in Fig. 40). The crossbar switch serves as a bridge to
connect the RC array and SBUs. Data can be transferred
inference. EMAX is composed primarily of an array of PEs from off-chip memory to SBUs using the external direct
and an interconnection network, as shown in Fig. 39. Each memory access interface. Static and dynamic interfaces are
PE is connected to its neighbors by local interconnections, used for static and dynamic configurations, respectively.
and each row of the PE array has a shared bus. The results The proposed SDT-CGRA is realized in Verilog HDL, and
of calculations performed on the PEs are passed on to the Synopsys design compiler is used to synthesize the design.
PEs exist in the next row. The PEs can access external The proposed SDT-CGRA is implemented using SMIC
memory (DRAM) via the memory interface. Each PE has 55nm CMOS technology. Experimental results show that
two execution units that perform the arithmetic and logical SDT-CGRA outperforms EMAX by three times in terms of
operations. Each PE also has a local memory to store the operations per memory bandwidth.
required data, reducing the memory bandwidth pressure. In [111], the authors proposed mapping of CNNs onto
Experimental results show that EMAX performs better than Tightly Coupled Processor Array (TCPA) efficiently. TCPA
GPUs in terms of per memory bandwidth and per area. belongs to the class of CGRA, containing an array of tightly
A CGRA-based accelerator referred to as stream dual- coupled VLIW Processing Elements (PEs) [105]. TCPA
track CGRA (SDT-CGRA), which targets the implemen- offers multiple levels of parallelism, for instance, task-level,
tation of object inference algorithms, is presented in [83]. loop-level, iteration-level, instruction-level parallelism, etc.
SDT-CGRA employs stream processing and uses both static TCPAs are suited for accelerating computationally expen-
and dynamic configurations for stream processing. The SDT- sive nested loop programs exhibiting a high degree of
CGRA accelerator mainly contains an array of PEs known parallelism, such as CNNs. CNN layers are based on ma-
as reconfigurable cells (RS) and stream buffer units (SBUs), trix multiplications which can be written as 6-dimensional
as shown in Fig. 40. The SDT-CGRA architecture is divided nested loops, making them suitable for acceleration. It
into two sections: global memory and computing array. was demonstrated that TCPAs use techniques such as loop
The global memory section is dynamically configured and permutation, loop unrolling, and layer-parallel processing
stores data streams. On the other hand, the computing array to exploit the parallelism offered by the TCPA architecture.
section operates in a static configuration mode. It comprises Layer fusion allows the processing of multiple layers of
several RC columns and one special RC column. The CNN in the overlapped fashion [33], which was exploited
VOLUME 4, 2016 xxvii
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
Local data bus There is a lot of room for CGRA research to develop and
Interconnection between local data bus and RC expand as a topic of study for future architecture; this is
Interconnection between RCs in horizontal direction especially true when developing high-performance CGRAs
Interconnection between RCs in vertical direction tailored to specialized or general-purpose computing. Some
Global Memory Crossbar Computing array key issues that require further research in this area include
developing tools to program the architecture efficiently,
SBU
RC RC RC PRC
memory management, scalability, adaptability, productivity,
External virtualization, etc.
memory SBU
DMA
interface RC RC RC IRC
VII. EMBEDDED AI ACCELERATORS
The AI hardware requirements are more critical in the edge
SBU
environment, which is typically represented by Internet of
Things (IoT) devices (e.g., smart speaker, mobile, sensors
and actuators) with limited computing resources, as opposed
Dynamic
Config.
to cloud infrastructure with relatively sufficient computing
Ctr. Unit RC RC RC IRC capability. For the sake of real-time immediacy, latency,
Static Config.
offline capabilities, security, and privacy, AI models are
SDT-CGRA
Ctr. Unit increasingly required to be implemented on the edge. In
Dynamic Static this context, Small Form Factor (SFF) devices such as mi-
configuration interface configuration interface
crocontrollers, which dominate the market, are of particular
interest and having AI capabilities on theses devices can
Off-chip Memory Host Processor
help many applications. Many industrial solutions require
Fig. 40. SDT-CGRA architecture, adopted from [83] products with SFF and Size, Weight, and Power (SWaP)
enhanced embedded systems. In this section, we review
some of the latest embedded AI accelerators.
by TCPA to save intermediate memory needed between the Fig. 42 shows the architecture of Edge TPU from
layers. Loop permutation allows the computation of multiple Google which is used in products such as Coral and Pixel
convolution filters in an interspersed way. TCPA allows the Phones [112]. Edge TPUs are designed to give high perfor-
parallel execution of multiple layers by different PEs. A mance acceleration while staying within strict physical and
CNN model for the MNIST benchmark on an array of size power constraints [221]. Edge TPU is organized in a 2-D
4×4 was evaluated and the performance of the layer-parallel array of Processing Elements (PEs) where each PE performs
approach over layer-by-layer processing was compared. computations in a SIMD fashion. Data is transferred from
A CGRA-based accelerator called Neural Processing off-chip memory and PEs via an on-chip controller. Acti-
CGRA (NP-CGRA) is presented in [135] to accelerate vation and parameters are loaded into the on-chip staging
lightweight CNNs. The authors have proposed a set of buffers by the controller. In addition, the controller reads
extensions to the baseline CGRA [153] to improve the in the low-level instructions that will be executed on the
performance of CGRAs and to efficiently implement depth- PEs (e. g., convolution, pooling, etc.). Each PE may contain
wise convolution (DWC) and pointwise convolution (PWC). single or multiple cores, each having multiple compute
The authors have presented three architectural extensions: lanes to support operation in SIMD fashion. A memory
a crossbar-style memory bus, dual-mode MAC unit, and is shared across all cores, PE Memory is used to model
operand reuse network. The crossbar-style memory bus activations, partial results, and outputs are all stored in a
contains horizontal and vertical buses, and each bus is shared memory, which is labelled PE Memory, see Fig. 42.
accessible to all the PEs connected to it. Dual-mode MAC Each PE’s cores have a core memory that is mostly used to
unit works in two modes: MAC mode and MUL/ALU store model parameters. Each compute lane has multi-way
mode. The multiplication and accumulation operations are MAC units to perform computations between activations and
chained together in the MAC mode to realize the function. model parameters. A few prototyping boards, see Fig. 43
On the other hand, in the MUL/ALU mode, a PE can from Coral are available for the community to try and
choose either multiplication or an addition operation for deploy ML apps at the edge, including the Dev Board, USB
each cycle. Operand reuse network offers input-to-input accelerator, Dev Board Mini, and Dev Board Macro [27].
routing instead of output-to-input routing. The proposed NP- TensorFlow Lite framework is particularly developed for
CGRA is realized in Verilog HDL, and Synopsys design mapping various neural network operations onto the Edge
compiler is used to synthesize the design. The proposed TPU [29]. The Edge TPU coprocessor can compute 4 trillion
NP-CGRA is implemented using Samsung 65nm CMOS operations per second (TOPS) while consuming just 0.5
technology. Experimental results show that the area-delay watts for each TOPS (2 TOPS per watt) [27].
product of NP-CGRA is 8-18 times better than that of NVIDIA’s Jetson Nano [3], [184] is an embedded board
baseline CGRA. suitable for edge AI applications. It contains a 64-bit
xxviii VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
AHB Bus
LEON3
Address
( Configuration Interrupt Global Generators
& Communication Controller Controller
Processor )
Reconfigurable Buffers
PE PE PE PE
Interconnect
Wrapper Module
Reconfigurable Buffers
Reconfigurable Buffers
PE
PE PE PE PE
Data Control
TCPA tile Reconfigurable Buffers Activation
Memory
PE PE PE
(1, 0) (1, 1) (1, nx)
Instruction
DRAM
Memory
Parameter
Memory
a) TCPA accelerator b) CNN to recognize digits in 28x28 pixel images
Controller
including net parameters PE PE PE
(nx, 0) (nx, 1) (nx, nx)
Fig. 41. TCPA accelerator showing PE array of size 4 × 4 and a CNN that is mapped onto it for recognizing digits from
MNIST database
PE PE PE
(0, 0) (0, 1) (0, nx)
Activation PE PE PE
Memory
(1, 0) (1, 1) (1, nx)
Instruction
DRAM
Memory
Parameter
Memory
Controller
PE PE PE
(nx, 0) (nx, 1) (nx, nx)
d) Dev board
a) Dev board b) USB Accelerator c) Dev board Mini Macro
Compute
Compute
Compute
Compute
Compute
Lanes
Lanes
Lanes
Lanes
Lanes
Lanes
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
INTERFACES
MIPI
(SPI, USB3, I2C, I2S, LCD, CIF, UART, ETHERNET etc.)
AM 5729 Sitara SoC [116], is yet another board for AI at the X12 lanes
edge. This SoC has two 32-bit Arm Cortex-A15 cores, two
Image Processing Units (IPUs) that each having two Cortex-
M4 cores, two C66x DSP cores, two PowerVER SGX5443D Imaging/Vision Hardware Accelerators RISC-RT
GPUs, and four Embedded Vision Engines (EVEs). It also
has 15 GB of eMMC flash, 1 GB of RAM, Wi-Fi as well Intelligent Memory Fabric
RISC-RTOS
as Bluetooth support, and USB connectors for power and
data transfer. The BeagleBone AI runs Linux and TI Deep
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
SHAVE DSP
Learning (TIDL) framework can be used develop real-time 12 vector VLIW SHAVE processors
ML applications.
L2 Cache
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
in the MAC convolution engine. MAC convolution engine captivating alternative uses of on-device machine learning.
receives the input data from the input feature map read TinyML supports various frameworks, including Tensor-
block, weights from the weight decoder, and performs the Flow Lite Micro (TFLM), TensorFlow-Native, Embedded
required MAC operation. The result of the convolution is Learning Library (ELL), Graph Lowering (GLOW), etc.
processed by the PLE, which is a vectorized microcontroller. Google developed an open-source framework referred to as
It is more akin to a RISC platform designed to wrap up CFU Playground [175] for TinyML acceleration on FPGA.
the processing of a layer for a piece of a DNN model with CFU playground toolchain combines open-source software
several layers. The PLE is in charge of tasks like pooling and (TensorFlow), RTL generators (LiteX, Migen, etc.), and
activation. The throughput of the proposed ML processor FPGA tools for synthesis (yosys), place, and route (vpr).
is 4.6 TOPS. The proposed design is implemented using The CFU playground framework makes it possible to in-
7 nm chip technology, and it is scalable, can achieve the vestigate custom architectures for the acceleration of Tiny
throughput of 150 TOPS for high-end applications. ML for embedded ML systems. TinyML is used in many
The Arm AI platform, also known as Project Trillium, applications, including medical face mask detection [157],
is a heterogeneous compute platform that includes Arm eating detection [167], Li-Ion batteries parameter estima-
Cortex CPUs, Ethos NPUs, Mali GPUs, and microNPUs to tion [69], etc. The most in-demand research areas among the
accelerate the ML algorithms [143]. Arm supports various TinyML community include sound recognition, computer
ML frameworks such as TensorFlow Lite, Caffe, Pytorch, vision, and the development of low-power accurate ML
etc. and accelerates the ML applications using software models. More research is needed to fully comprehend the
libraries including arm NN, arm COMPUTE LIBRARY, and advantages and drawbacks of the topics under discussion,
Common Microcontroller Software Interface Standard-NN even if many applications have demonstrated TinyML’s
(CMSIS-NN). The hardware products such as Arm Cortex promise. Some key issues that require further research in this
CPUs, Ethos NPUs, Mali GPUs, and microNPUs, FPGAs, area include developing benchmarks, memory constraints,
DSPs, etc. ARM’s new Cortex-A55/A75 and Mali-G72 energy, processor capacity, cost reduction, etc.
combination targets machine learning on edge computing
devices. VIII. COMPARISON BETWEEN VARIOUS HARDWARE
Arm has developed its Ethos series of ML processors ARCHITECTURES FOR DNN ACCELERATION
for machine learning applications. Ethos series is classified The performance of the various hardware accelerators
into two types: N-series and U-series [25]. Ethos N-series is for the DNN acceleration depends on the target applica-
introduced in October 2019, and it contains NPUs identical tion. However, researchers defined some standard metrics,
to Cortex family. Ethos U-series is introduced in early 2020, namely, area, power, and throughput, to measure the per-
and it contains microNPUs. MicroNPUs paired with the formance of the hardware accelerators for the development
CPU like the Cortex-M55 to process the ML algorithms. and deployment of DNNs. Here, the area is nothing but the
Ethos-U55 achieves a throughput of 0.5 TOPS, and it portion of silicon required for the DNN acceleration, which
contains 32 to 256 8-bit MAC units [144]. Ethos-U55 is generally represented in squared millimeters or squared
supports 8-bit and 16-bit integer data types. Ethos-U65 micrometers. The area depends on the size of the on-
achieves a throughput of 1 TOPS, and it contains 256 to chip memory and the technology used during the hardware
312 8-bit MAC units. Ethos-N57 achieves a throughput of synthesis process. Power is nothing but the amount of power
2 TOPS, and it contains 1024 8-bit MAC units. Ethos-N77 consumed by the specific hardware during the DNN acceler-
is a highly efficient ML inference processor that achieves ation. The power consumption mainly depends on off-chip
throughput of 5 TOPS, and it is best suitable for mobile and on-chip memories. Throughput is used to measure the
devices. Ethos-N77 ML processors can be used for facial or productivity of the hardware accelerator. The comparison
object recognition applications. Ethos-N78 is a scalable and between the various hardware accelerator architectures for
efficient ML inference processor that achieves a throughput DNN acceleration is shown in Table 7. Due to a lack of
of 1 to 10 TOPS [145]. Arm‘s Cortex-M55 and the Ethos- data on their footprint, power consumption, and throughput,
U55 can be used as an AI accelerator in edge computing CGRA-based accelerators are not represented in Table 7. As
devices [90]. This combination achieves a 32× improve- expected, temporal or general purpose architectures such as
ment in ML processing compared to the base Cortex- CPU and GPU have greater power consumption and area
M55 core. Furthermore, TinyML [20] advancements have than special purpose architectures such as FPGA and ASIC
made it possible to use ML models on the microcontroller because they are not tailored for a particular application. The
hardware found in our household appliances, including essential hardware metrics like power, area, technology, and
printers, TVs, smartwatches, and pacemakers, which can throughput are reported for each hardware architecture.
now carry out tasks that were previously only capable of In Table 8, we have compared the few embedded de-
being done by computers and smartphones. The machine velopment boards discussed above with respect to general
learning and embedded ultra-low power systems commu- purpose CPUs/GPUs, specialized co-processor they contain,
nities have joined forces to create TinyML foundation. performance, power, SDKs, and supported ML frameworks.
This joint effort has paved the way for innovative and
VOLUME 4, 2016 xxxiii
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
interrupt
Machine Learning Processor
ACE-Lite DMA Control Sync
Interface engine Unit Unit Vector
CPU
Engine
SRAM
IX. FUTURE DIRECTIONS at MIT and Stanford have developed a new 3-D architecture
In the future, hardware AI acceleration is set to become based on a network of millions of carbon nanotubes [192].
ubiquitous. In recent processors, some sort of AI accelerator Computations in optical computing technology can hap-
hardware becoming a standard feature, indicating that AI pen at the speed of light, much faster than conventional
acceleration is an essential general-purpose task. In this electron-driven chips. To advance optical computing, MIT
paper, we have reviewed several FPGA-based, ASIC-based, is driving research in advanced optical materials, switches,
GPU-based, CGRA-based, and edge AI hardware acceler- lasers, and nano-optics [107]. We may expect to see a
ators. However, looking at the industry trends and startups greater deployment of optical chips in the future. DNA
in this space indicates that we are still in the early stage of computing is a type of parallel computing in which many
the AI revolution. Many more energy-efficient architectures different DNA molecules are used to test many possibilities
will emerge in the future. In particular, architectures with simultaneously [139]. The major advantage of DNA is its
transprecision or approximate computing, high-bandwidth potential for memory storage. A single gram of DNA can
memories, and emerging non-volatile memories such as store 215 petabytes (215 million gigabytes) [6]. Although
MRAM, and ReRAM may appear in the market. Evolving DNA information storage has enormous application poten-
architectures involving the Tsetlin machine are another tial, many issues, such as the high cost of writing and
promising future research direction. reading information and techniques to erase and rewrite
information in DNA that is still unknown, must be addressed
Emerging technologies such as nanomaterials, optical
before its widespread use [75].
computing, and DNA computing may accelerate DNNs
in the near future. Carbon nanomaterials, such as car- In FPGA-based architectures, following future directions
bon nanotubes (CNTs) and graphene, are particularly in- seems to be promising. The combination of FPGAs and
triguing due to their rapid electron transport [58]. CNT cloud computing opens up new avenues for developing deep
and graphene have desirable switching and optical prop- learning applications. The FPGA cloud service is still in
erties, making them well-suited to electronic and optical its early stages. Many imperfections must be investigated,
architectures [190]. New chip architectures become possible such as the virtualization of FPGA hardware resources, task
with the help of CNTs and other nanomaterials. Researchers migration, and so on. The majority of current research is
VOLUME 4, 2016 xxxv
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
focused on lowering the bandwidth requirements for off- for easy deployment of neural network applications.
chip memory access. The performance of multiple FPGA
chips combined is favourable. However, dealing with pro- X. CONCLUSION
cessing scheduling and chip allocation remains a significant Deep Neural Networks have recently gained popularity in a
challenge. Future research could focus on the development variety of applications. They are, however, computationally
of in-memory-computing processors. Moreover, further im- demanding, making them difficult to handle by general-
provements are required in the computation of the activation purpose architectures. In this context, a detailed review
functions used in DNNs. Because most studies focus on of recent advances in DNN acceleration on specialized
loop optimization, only a few researchers are currently hardware architectures such as FPGA, ASIC, GPU, and
working on activation function optimization. There will be CGRA is presented. Furthermore, embedded AI accelerators
frameworks to integrate existing or new architectures, which for the edge environment have been thoroughly discussed.
will help quickly deploy applications. Most importantly, The review begins with a detailed background of DNNs,
FPGA-based accelerator research will be towards training with a focus on their key operations and applications. CNNs,
and not inference. which have a wide range of applications, have also been
In ASIC-based hardware accelerators, following future included in the review. To improve the performance of
research trends are suggested. TPU is already a standard the hardware accelerator, we discussed various computing
in the field of deep learning. More capable replacements architectures such as temporal and spatial architectures, as
are likely to emerge in the coming years. There will be well as different dataflow patterns. The review focused
entirely new architectures to target low-latency and low- on recent advancements in the acceleration of DNNs on
power applications. Most current studies assume a trained FPGA, ASIC, GPU, CGRA, and Embedded AI accelerators.
DNN and focus on increasing the speed of its inference. The review divided the FPGA-based accelerators into three
There have been only a few studies on accelerator design categories and briefly discussed their key features, including
for DNN training. Therefore, there will be more empha- the frameworks available for each. Similarly, ASIC-based
sis on developing ASIC-based DNN training accelerators accelerators are classified, and the review summarizes the
in the future. More research and breakthroughs in CPU- accelerators available in the literature based on area, power
GPU heterogeneous architectures are required for more effi- dissipation, throughput, resource utilization, and so on. A
cient DNN implementations. Special-purpose or data center comprehensive review of Nvidia’s GPU-based accelerators
system-on-chips (SoCs) with embedded FPGA or GPU- was also presented. Furthermore, the review compared the
based machine learning accelerators appear to be gaining various popular FPGA/ASIC/GPU-based accelerators. It has
traction. In CGRA-based accelerators, architectures driven been observed that temporal architectures such as CPU and
by programming might be interesting. Other directions in- GPU dissipate more power than spatial architectures such as
clude, introducing the process-in-memory into CGRA archi- FPGA and ASIC; however, they have higher throughput than
tectures to address the data movement bottleneck. Further FPGA and ASIC. As a result, it is difficult to say that one
improvements are needed for the architectures that support architecture is superior to another because it is dependent
dynamic configuration as it is an important step towards the on the target application and requirements. Furthermore, the
widespread use of CGRAs. survey presented and compared recent research contributions
The following trends may be observed in the development in Arm-based machine learning processors and a few em-
of Edge AI accelerators in the future. Edge AI operates in a bedded AI hardware accelerators in terms of their cores,
heterogeneous environment in which the data at the edge and performance, power, availability of Software Development
the preprocessing techniques required for each sensor vary Kits (SDKs), and supported frameworks. Finally, the review
greatly between applications. Therefore, more customized, suggests future research directions for DNN acceleration us-
powerful, and energy-efficient chips for specific edge ML ing various hardware architectures, including FPGA, ASIC,
applications will be developed. Multimodal deep learning is GPU, CGRA, and Edge AI accelerators.
a major development that pulls data from multiple sources
to extract more granular features. By using these multimodal
REFERENCES
techniques, instead of just recognizing a car, the make and
model of the car can be pinpointed. Other potential research [1] ULTRA96-V2. https://www.avnet.com/opasdata/d120001/medias/
docus/198/5365-pb-ultra96-v2-v10b.pdf. (Accessed on 01/07/2022).
directions include using distributed ML algorithms to speed [2] Accelerate fast math with intel® oneapi math kernel library.
up ML algorithm training and reduce the amount of mem- https://software.intel.com/content/www/us/en/develop/tools/oneapi/
ory required for processing. ML applications at the edge, components/onemkl.html#gs.3595x9. (Accessed on 06/05/2021).
[3] Advanced AI Embedded Systems: NVIDIA Jetson: The AI Plat-
require high accuracy. Therefore, methods for implementing form for Autonomous Machines. https://www.nvidia.com/en-in/
cutting-edge models at the edge while maintaining accuracy autonomous-machines/embedded-systems/. (Accessed on 08/02/2022).
based on deep learning model pruning and quantization [4] BeagleBone AI: Fast Track to Embedded Artificial Intelligence. https:
//beagleboard.org/AI. (Accessed on 01/02/2022).
are among the new research directions. We will also see
[5] BitMain Neural Network SDK: Introduction. https://sophon-edge.
the development of customized as well as general SDK gitbook.io/project/. (Accessed on 01/02/2022).
frameworks targeting specific or multiple edge accelerators [6] dna could store all of the world’s data in one room - science.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
[7] DPU for Convolutional Neural Network. https://www.xilinx.com/ Proceedings of Machine Learning Research, pages 173–182, New York,
products/intellectual-property/dpu.html. (Accessed on 05/01/2022). New York, USA, 20–22 Jun 2016. PMLR.
[8] Edge TPU Developer Board. https://www.sophon.ai/product/introduce/ [35] A. Argal, S. Gupta, A. Modi, P. Pandey, S. Shim, and C. Choo. Intelligent
edb.html. (Accessed on 01/02/2022). travel chatbot for predictive recommendation in echo platform. In
[9] Gluon AI Co-Processor. https://alphaics.ai/products/ 2018 IEEE 8th Annual Computing and Communication Workshop and
gluon-ai-accelerator/. (Accessed on 01/10/2022). Conference (CCWC), pages 176–183, 2018.
[10] Intel® Movidius™ Myriad™ X Vision Processing Unit. [36] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by
https://www.intel.com/content/www/us/en/products/details/processors/ jointly learning to align and translate. CoRR, abs/1409.0473, 2015.
movidius-vpu/movidius-myriad-x.html. (Accessed on 04/02/2022). [37] Y. Bengio. Learning deep architectures for ai. Found. Trends Mach.
[11] Jetson Nano Developer Kit. https://www.nvidia.com/en-in/ Learn., 2(1):1–127, Jan. 2009.
autonomous-machines/embedded-systems/jetson-nano-developer-kit/. [38] K. Benkrid and S. Belkacemi. Design and implementation of a 2d
(Accessed on 08/02/2022). convolution core for video applications on fpgas. In Third International
[12] Kendryte K210. https://canaan.io/product/kendryteai. (Accessed on Workshop on Digital and Computational Video, 2002. DCV 2002. Pro-
09/02/2022). ceedings., pages 85–92, 2002.
[13] Maixduino. https://www.seeedstudio.com/ [39] M. Bergeron. Real-Time Face Recognition on
Sipeed-Maixduino-Kit-for-RISC-V-AI-IoT-p-4047.html. (Accessed on Ultra96-V2. https://www.hackster.io/AlbertaBeef/
09/02/2022). real-time-face-recognition-on-ultra96-v2-94de9b. (Accessed on
[14] MAX78000—Artificial Intelligence Microcontroller with Ultra-Low- 01/02/2022).
Power Convolutional Neural Network Accelerator. https://www. [40] Y. H. Bhosale and K. S. Patnaik. Application of deep learning techniques
maximintegrated.com/en/products/microcontrollers/MAX78000.html. in diagnosis of covid-19 (coronavirus): A systematic review. Neural
(Accessed on 01/10/2022). Processing Letters, Sep 2022.
[15] Myriad 2 MA2x5x Vision Processor: Transforming Devices [41] Y. H. Bhosale and K. Sridhar Patnaik. Iot deployable lightweight deep
Through Ultra Low-Power Machine Vision - Google Search. learning application for covid-19 detection with lung diseases using
https://www.intel.com/content/www/us/en/products/details/processors/ raspberrypi. In 2022 International Conference on IoT and Blockchain
movidius-vpu/movidius-myriad-x.html,www.movidius.com. (Accessed Technology (ICIBT), pages 1–6, 2022.
on 04/02/2022). [42] Y. H. Bhosale, S. Zanwar, Z. Ahmed, M. Nakrani, D. Bhuyar, and
[16] Nvidia a100 tensor core gpu architecture. 2020. available U. Shinde. Deep convolutional neural network based covid-19 classi-
online: https://www.nvidia.com/content/d am/en-zz/solutions/data- fication from radiology x-ray images for iot enabled devices. In 2022 8th
center/nvidia-ampere-architecture-whitepaper.pdf (accessed on 6 june International Conference on Advanced Computing and Communication
2020). - google search. (Accessed on 06/13/2021). Systems (ICACCS), volume 1, pages 1398–1402, 2022.
[17] Nvidia tesla v100 gpu architecture. 2017. available online: [43] L. Bishnoi and S. Narayan Singh. Artificial intelligence techniques used
https://images.nvidia.com/content/ technologies/volta/pdf/437 317- in medical sciences: A review. In 2018 8th International Conference on
volta-v100-ds-nv-us-web.pdf (accessed on 6 june 2020). - google search. Cloud Computing, Data Science Engineering (Confluence), pages 1–8,
(Accessed on 06/13/2021). 2018.
[18] PYNQ-Z2. http://www.pynq.io/board.html. (Accessed on 01/07/2022). [44] A. G. Blaiech, K. Ben Khalifa, C. Valderrama, M. A. Fernandes, and
[19] Silicon Labs BG24 and MG24 SoCs. https://www.silabs.com/wireless/ M. H. Bedoui. A survey and taxonomy of fpga-based deep learning
zigbee/efr32mg24-series-2-socs. (Accessed on 01/10/2022). accelerators. Journal of Systems Architecture, 98:331–345, 2019.
[20] TinyML Foundation. https://www.tinyml.org/. (Accessed on [45] S. Bouguezzi, H. B. Fredj, T. Belabed, C. Valderrama, H. Faiedh, and
09/02/2022). C. Souani. An efficient fpga-based convolutional neural network for
[21] Vitis Unified Software Platform. https://www.xilinx.com/products/ classification: Ad-mobilenet. Electronics, 10(18), 2021.
design-tools/vitis/vitis-platform.html. (Accessed on 10/01/2022). [46] A. Boutros, S. Yazdanshenas, and V. Betz. You cannot improve what you
[22] Vivado. https://www.xilinx.com/products/design-tools/vivado.html. (Ac- do not measure: Fpga vs. asic efficiency gaps for convolutional neural
cessed on 15/01/2022). network inference. ACM Trans. Reconfigurable Technol. Syst., 11(3),
[23] Xilinx Kria—Adaptive System-on-Module. https://www.xilinx.com/ dec 2018.
products/som/kria.html. (Accessed on 01/10/2022). [47] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf. A
[24] Xilinx Vitis AI Model Zoo. https://github.com/Xilinx/AI-Model-Zoo. programmable parallel accelerator for learning and classification. In 2010
(Accessed on 01/10/2022). 19th International Conference on Parallel Architectures and Compilation
[25] Ethos - ARM - WikiChip, Jul 2021. [Online; accessed 7. Aug. 2021]. Techniques (PACT), pages 273–283, 2010.
[26] Jetson Xavier NX, Aug. 2021. [Online; accessed 15. Jul. 2022]. [48] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and
[27] Coral products. https://coral.ai/products/, 2022. M. Martina. An Updated Survey of Efficient Hardware Architectures
[28] Deploy AI-Powered Autonomous Machines at Scale, July 2022. [Online; for Accelerating Deep Convolutional Neural Networks. Future Internet,
accessed 15. Jul. 2022]. 12(7):113, July 2020.
[29] Edge tpu compiler. https://coral.ai/docs/edgetpu/compiler/ [49] F. Cardells-Tormo, P.-L. Molinet, J. Sempere-Agullo, L. Baldez, and
#system-requirements, 2022. M. Bautista-Palacios. Area-efficient 2d shift-variant convolvers for fpga-
[30] High-Level Synthesis & Verification, July 2022. [Online; accessed 18. based digital image processing. In International Conference on Field
Jul. 2022]. Programmable Logic and Applications, 2005., pages 578–581, 2005.
[31] NVIDIA Tesla T4 Specs, July 2022. [Online; accessed 16. Jul. 2022]. [50] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
[32] Vitis AI, June 2022. [Online; accessed 17. Jun. 2022]. L. Benini. Origami: A convolutional network accelerator. In Proceedings
[33] M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer cnn of the 25th Edition on Great Lakes Symposium on VLSI, GLSVLSI ’15,
accelerators. In 2016 49th Annual IEEE/ACM International Symposium page 199–204, New York, NY, USA, 2015. Association for Computing
on Microarchitecture (MICRO), pages 1–12. IEEE, 2016. Machinery.
[34] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, [51] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi. A dynam-
C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, ically configurable coprocessor for convolutional neural networks. In
Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, Proceedings of the 37th Annual International Symposium on Computer
E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, Architecture, ISCA ’10, page 247–257, New York, NY, USA, 2010.
A. Hannun, T. Han, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, Association for Computing Machinery.
L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, [52] J.-W. Chang and S.-J. Kang. Optimizing fpga-based convolutional neural
Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, networks accelerator for image super-resolution. In 2018 23rd Asia and
D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, South Pacific Design Automation Conference (ASP-DAC), pages 343–
C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, 348, 2018.
L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and [53] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam.
Z. Zhu. Deep speech 2 : End-to-end speech recognition in english and Diannao: A small-footprint high-throughput accelerator for ubiquitous
mandarin. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of machine-learning. In Proceedings of the 19th International Conference
The 33rd International Conference on Machine Learning, volume 48 of on Architectural Support for Programming Languages and Operating
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
Systems, ASPLOS ’14, page 269–284, New York, NY, USA, 2014. [76] L. Du and Y. Du. Hardware accelerator design for machine learning.
Association for Computing Machinery. In H. Farhadi, editor, Machine Learning, chapter 1. IntechOpen, Rijeka,
[54] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, 2018.
N. Sun, and O. Temam. Dadiannao: A machine-learning supercomputer. [77] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and M.-C. F. Chang. A
In 2014 47th Annual IEEE/ACM International Symposium on Microar- reconfigurable streaming deep convolutional neural network accelerator
chitecture, pages 609–622, 2014. for internet of things. IEEE Transactions on Circuits and Systems I:
[55] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang. A survey of accelerator Regular Papers, 65(1):198–208, 2018.
architectures for deep neural networks. Engineering, 6(3):264–274, 2020. [78] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
[56] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy- and O. Temam. Shidiannao: Shifting vision processing closer to the
efficient reconfigurable accelerator for deep convolutional neural net- sensor. SIGARCH Comput. Archit. News, 43(3S):92–104, June 2015.
works. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017. [79] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
[57] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze. Eyeriss v2: A flexible and O. Temam. Shidiannao: Shifting vision processing closer to the
accelerator for emerging deep neural networks on mobile devices. IEEE sensor. In Proceedings of the 42nd Annual International Symposium on
Journal on Emerging and Selected Topics in Circuits and Systems, Computer Architecture, ISCA ’15, page 92–104, New York, NY, USA,
9(2):292–308, 2019. 2015. Association for Computing Machinery.
[58] Z. Chen, H. S. Philip Wong, S. Mitra, A. Bol, L. Peng, G. Hills, and [80] C. Dubout and F. Fleuret. Exact acceleration of linear object detectors.
N. Thissen. Carbon nanotubes for high-performance logic. MRS In Proceedings of the 12th European Conference on Computer Vision
Bulletin, 39(8):719–726, Aug 2014. - Volume Part III, ECCV’12, page 301–311, Berlin, Heidelberg, 2012.
[59] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, Springer-Verlag.
and E. Shelhamer. cudnn: Efficient primitives for deep learning. ArXiv, [81] A. J. A. El-Maksoud, M. Ebbed, A. H. Khalil, and H. Mostafa. Power
abs/1410.0759, 2014. efficient design of high-performance convolutional neural networks hard-
[60] D. Chicco, P. Sadowski, and P. Baldi. Deep autoencoder neural networks ware accelerator on fpga: A case study with googlenet. IEEE Access,
for gene ontology annotation predictions. In Proceedings of the 5th ACM 9:151897–151911, 2021.
Conference on Bioinformatics, Computational Biology, and Health Infor- [82] H. Fan, S. Liu, M. Ferianc, H.-C. Ng, Z. Que, S. Liu, X. Niu, and W. Luk.
matics, BCB ’14, page 533–540, New York, NY, USA, 2014. Association A real-time object detection accelerator with compressed ssdlite on fpga.
for Computing Machinery. In 2018 International Conference on Field-Programmable Technology
[61] P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee. Enabling (FPT), pages 14–21, 2018.
intelligent environment by the design of emotionally aware virtual assis- [83] X. Fan, D. Wu, W. Cao, W. Luk, and L. Wang. Stream processing dual-
tant: A case of smart campus. IEEE Access, 8:62032–62041, 2020. track cgra for object inference. IEEE Transactions on Very Large Scale
[62] Y.-k. Choi, K. You, J. Choi, and W. Sung. A real-time fpga-based 20 000- Integration (VLSI) Systems, 26(6):1098–1111, 2018.
word speech recognizer with optimized dram access. IEEE Transactions
[84] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
on Circuits and Systems I: Regular Papers, 57(8):2119–2131, 2010.
Y. Lecun. NeuFlow: A Runtime-Reconfigurable Dataflow Processor for
[63] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky.
Vision. 2011 IEEE Computer Society Conference on Computer Vision
Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro,
and Pattern Recognition Workshops (CVPRW), Jun 2011.
41(2):29–35, 2021.
[85] C. Farabet, C. Poulet, J. Han, and Y. LeCun. Cnp: An fpga-based pro-
[64] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep
cessor for convolutional networks. In FPL 09, FPL 09: 19th International
network learning by exponential linear units (elus). arXiv: Learning,
Conference on Field Programmable Logic and Applications, pages 32–
2016.
37, 2009. FPL 09: 19th International Conference on Field Programmable
[65] J. Cloutier, E. Cosatto, S. Pigeon, F. Boyer, and P. Simard. Vip: an
Logic and Applications ; Conference date: 31-08-2009 Through 02-09-
fpga-based processor for image processing and neural networks. In
2009.
Proceedings of Fifth International Conference on Microelectronics for
[86] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. Cnp: An fpga-based
Neural Networks, pages 330–336, 1996.
processor for convolutional networks. In 2009 International Conference
[66] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like
on Field Programmable Logic and Applications, pages 32–37, 2009.
environment for machine learning. In NIPS 2011, 2011.
[67] J. Cong and B. Xiao. Minimizing computation in convolutional neural [87] X. Feng, H. Zhang, Y. Ren, P. Shang, Y. Zhu, Y. Liang, R. Guan, and
networks. In ICANN, 2014. D. Xu. The Deep Learning–Based Recommender System “Pubmender”
[68] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks for Choosing a Biomedical Publication Venue: Development and Valida-
with weights and activations constrained to+ 1 or- 1. arxiv 2016. arXiv tion Study. J. Med. Internet Res., 21(5), May 2019.
preprint arXiv:1602.02830. [88] K. Fukushima. Neocognitron: A hierarchical neural network capable of
[69] G. Crocioni, D. Pau, J.-M. Delorme, and G. Gruosso. Li-ion batteries visual pattern recognition. Neural Networks, 1(2):119–130, 1988.
parameter estimation with tiny neural networks embedded on intelligent [89] A. Gainaru, E. Slusanschi, and S. Trausan-Matu. Mapping data mining
iot microcontrollers. IEEE Access, 8:122135–122146, 2020. algorithms on a GPU architecture: A study. In M. Kryszkiewicz,
[70] D. Danopoulos, C. Kachris, and D. Soudris. Acceleration of image H. Rybinski, A. Skowron, and Z. W. Ras, editors, Foundations of In-
classification with caffe framework using fpga. In 2018 7th International telligent Systems - 19th International Symposium, ISMIS 2011, Warsaw,
Conference on Modern Circuits and Systems Technologies (MOCAST), Poland, June 28-30, 2011. Proceedings, volume 6804 of Lecture Notes in
pages 1–4, 2018. Computer Science, pages 102–112. Springer, 2011.
[71] L. Deng and D. Yu. Deep learning: Methods and applications. Found. [90] C. Gartenberg. ARM’s new edge AI chips promise IoT devices that won’t
Trends Signal Process., 7(3–4):197–387, June 2014. need the cloud. Verge, Feb 2020.
[72] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas. Predicting [91] A. Ghaffari and Y. Savaria. Cnn2gate: An implementation of convolu-
parameters in deep learning. In Proceedings of the 26th International tional neural networks inference on fpgas with automated design space
Conference on Neural Information Processing Systems - Volume 2, exploration. Electronics, 2020.
NIPS’13, page 2148–2156, Red Hook, NY, USA, 2013. Curran Asso- [92] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello. A 240
ciates Inc. g-ops/s mobile coprocessor for deep neural networks. In 2014 IEEE
[73] A. Deshpande. A Beginner’s Guide To Understanding Convolutional Conference on Computer Vision and Pattern Recognition Workshops,
Neural Networks. UCLA (‘19), 2018. pages 696–701, 2014.
[74] J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria, [93] K. M. V. Gowda, S. Madhavan, S. Rinaldi, P. B. Divakarachari, and
D. Mukunoki, A. Podobas, M. WahibT, and S. Matsuoka. Matrix engines A. Atmakur. Fpga-based reconfigurable convolutional neural network
for high performance computing: A paragon of performance or grasping accelerator using sparse and convolutional optimization. Electronics,
at straws? In 2021 IEEE International Parallel and Distributed Processing 11(10), 2022.
Symposium (IPDPS), pages 1056–1065, Los Alamitos, CA, USA, may [94] H. Graf, S. Cadambi, V. Jakkula, M. Sankaradass, E. Cosatto, S. Chakrad-
2021. IEEE Computer Society. har, and I. Dourdanovic. A massively parallel digital learning proces-
[75] Y. Dong, F. Sun, Z. Ping, Q. Ouyang, and L. Qian. DNA storage: research sor. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,
landscape and future prospects. National Science Review, 7(6):1092– Advances in Neural Information Processing Systems, volume 21. Curran
1107, 01 2020. Associates, Inc., 2009.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
[95] S. Grigorescu, B. Trasnea, T. T. Cocias, and G. Macesanu. A sur- [116] T. Instruments. Am5729 sitara processor. URL: https://www. ti. com/pro-
vey of deep learning techniques for autonomous driving. ArXiv, duct/AM5729, 2015.
abs/1910.07738, 2020. [117] H. Irmak, N. Alachiotis, and D. Ziener. An energy-efficient fpga-based
[96] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, convolutional neural network implementation. In 2021 29th Signal
and J. Cong. Fp-dnn: An automated framework for mapping deep Processing and Communications Applications Conference (SIU), pages
neural networks onto fpgas with rtl-hls hybrid templates. In 2017 IEEE 1–4. IEEE, 2021.
25th Annual International Symposium on Field-Programmable Custom [118] H. Irmak, F. Corradi, P. Detterer, N. Alachiotis, and D. Ziener. A dynamic
Computing Machines (FCCM), pages 152–159, 2017. reconfigurable architecture for hybrid spiking and convolutional fpga-
[97] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, based neural network designs. Journal of Low Power Electronics and
and H. Yang. Angel-eye: A complete design flow for mapping cnn Applications, 11(3), 2021.
onto embedded fpga. IEEE Transactions on Computer-Aided Design of [119] S. M. A. H. Jafri, T. N. Gia, S. Dytckov, M. Daneshtalab, A. Hemani,
Integrated Circuits and Systems, 37(1):35–47, 2018. J. Plosila, and H. Tenhunen. Neurocgra: A cgra with support for
[98] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang. [DL] A Survey neural networks. In 2014 International Conference on High Performance
of FPGA-based Neural Network Inference Accelerators. ACM Trans. Computing & Simulation (HPCS), pages 506–511, 2014.
Reconfigurable Technol. Syst., 12(1):1–26, Mar. 2019. [120] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
[99] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast
learning with limited numerical precision. In Proceedings of the 32nd feature embedding. CoRR, abs/1408.5093, 2014.
International Conference on International Conference on Machine Learn- [121] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
ing - Volume 37, ICML’15, page 1737–1746. JMLR.org, 2015. S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
[100] F. G. Gustavson. Two fast algorithms for sparse matrices: Multiplication C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
and permuted transposition. ACM Trans. Math. Softw., 4(3):250–269, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,
sep 1978. D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
[101] A. Guzhva, S. Dolenko, and I. Persiantsev. Multifold acceleration of A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law,
neural network computations using gpu. In Proceedings of the 19th D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore,
International Conference on Artificial Neural Networks: Part I, ICANN M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix,
’09, page 373–380, Berlin, Heidelberg, 2009. Springer-Verlag. T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross,
[102] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter,
U. Muller, and Y. LeCun. Learning long-range vision for autonomous D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle,
off-road driving. J. Field Robot., 26(2):120–144, Feb. 2009. V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon. In-
datacenter performance analysis of a tensor processing unit, 2017.
[103] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
[122] D. Justus, J. Brennan, S. Bonner, and A. S. McGough. Predicting the
Y. Wang, H. Yang, and W. J. Dally. Ese: Efficient speech recognition
computational cost of deep learning models. In 2018 IEEE International
engine with sparse lstm on fpga. Proceedings of the 2017 ACM/SIGDA
Conference on Big Data (Big Data), pages 3873–3882, 2018.
International Symposium on Field-Programmable Gate Arrays, 2017.
[123] S. Kalapothas, G. Flamis, and P. Kitsos. Efficient edge-ai application
[104] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
deployment for fpgas. Information, 13(6), 2022.
W. J. Dally. Eie: Efficient inference engine on compressed deep neural
[124] A. Karpathy. convolutional neural networks for visual recognition, 2018.
network. In 2016 ACM/IEEE 43rd Annual International Symposium on
[125] M. Kavitha, R. Srinivasan, and R. Bhuvanya. Fake News Detection Using
Computer Architecture (ISCA), pages 243–254, 2016.
Machine Learning Algorithms, chapter 10, pages 181–207. John Wiley
[105] F. Hannig, V. Lari, S. Boppu, A. Tanase, and O. Reiche. Invasive tightly-
& Sons, Ltd, 2022.
coupled processor arrays: A domain-specific architecture/compiler co-
[126] H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He. NPE: An
design approach. ACM Transactions on Embedded Computing Systems
FPGA-based overlay processor for natural language processing. In The
(TECS), 13(4s):1–29, 2014.
2021 ACM/SIGDA International Symposium on Field-Programmable
[106] C. Hao, A. Sarwari, Z. Jin, H. Abu-Haimed, D. Sew, Y. Li, X. Liu,
Gate Arrays. ACM, feb 2021.
B. Wu, D. Fu, J. Gu, and D. Chen. A hybrid gpu + fpga system design
[127] J.-Y. Kim. Chapter five - fpga based neural network accelerators. In
for autonomous driving cars. In 2019 IEEE International Workshop on
S. Kim and G. C. Deka, editors, Hardware Accelerator Systems for
Signal Processing Systems (SiPS), pages 121–126, 2019.
Artificial Intelligence and Machine Learning, volume 122 of Advances
[107] L. Hardesty. Researchers build an all-optical transistor. in Computers, pages 135–165. Elsevier, 2021.
[108] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Sur- [128] Y. Kim, J. Lee, J.-S. Kim, H. Jei, and H. Roh. Efficient multi-gpu
passing human-level performance on imagenet classification. 2015 IEEE memory management for deep learning acceleration. In 2018 IEEE
International Conference on Computer Vision (ICCV), pages 1026–1034, 3rd International Workshops on Foundations and Applications of Self*
2015. Systems (FAS*W), pages 37–43, 2018.
[109] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image [129] J. P. Klock, J. Corrêa, M. Bessa, J. Arias-Garcia, F. Barboza, and
recognition. In Proceedings of the IEEE Conference on Computer Vision C. Meinertz. A new automated energy meter fraud detection system based
and Pattern Recognition (CVPR), June 2016. on artificial intelligence. In 2021 XI Brazilian Symposium on Computing
[110] D. Hefenbrock, J. Oberg, N. T. N. Thanh, R. Kastner, and S. B. Baden. Systems Engineering (SBESC), pages 1–8, 2021.
Accelerating viola-jones face detection to fpga-level using gpus. In 2010 [130] A. Kojima and Y. Nose. Development of an autonomous driving robot
18th IEEE Annual International Symposium on Field-Programmable car using fpga. In 2018 International Conference on Field-Programmable
Custom Computing Machines, pages 11–18, 2010. Technology (FPT), pages 411–414, 2018.
[111] C. Heidorn, M. Witterauf, F. Hannig, and J. Teich. Efficient mapping of [131] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
cnns onto tightly coupled processor arrays. J. Comput., 14(8):541–556, with deep convolutional neural networks. Commun. ACM, 60(6):84–90,
2019. May 2017.
[112] A. Howard and S. Gupta. Introducing the next generation of on-device vi- [132] H. Kwon, A. Samajdar, and T. Krishna. Maeri: Enabling flexible
sion models: Mobilenetv3 and mobilenetedgetpu. https://ai.googleblog. dataflow mapping over dnn accelerators via reconfigurable interconnects.
com/2019/11/introducing-next-generation-on-device.html, 2020. SIGPLAN Not., 53(2):461–475, Mar. 2018.
[113] H. Hu, J. Li, C. Wu, X. Li, and Y. Chen. Design and Implementation of [133] A. Lavin and S. Gray. Fast algorithms for convolutional neural networks.
Intelligent Speech Recognition System Based on FPGA. J. Phys. Conf. 2016 IEEE Conference on Computer Vision and Pattern Recognition
Ser., 2171(1):012010, Jan. 2022. (CVPR), pages 4013–4021, 2016.
[114] A. S. Hussein, A. Anwar, Y. Fahmy, H. Mostafa, K. N. Salama, and [134] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo. Unpu: A
M. Kafafy. Implementation of a dpu-based intelligent thermal imaging 50.6tops/w unified deep neural network accelerator with 1b-to-16b fully-
hardware accelerator on fpga. Electronics, 11(1), 2022. variable weight bit-precision. In 2018 IEEE International Solid - State
[115] D. Im, D. Han, S. Choi, S. Kang, and H.-J. Yoo. Dt-cnn: Dilated and Circuits Conference - (ISSCC), pages 218–220, 2018.
transposed convolution neural network accelerator for real-time image [135] J. Lee and J. Lee. Np-cgra: Extending cgras for efficient processing of
segmentation on mobile devices. In 2019 IEEE International Symposium light-weight deep neural networks. In 2021 Design, Automation & Test
on Circuits and Systems (ISCAS), pages 1–5, 2019. in Europe Conference & Exhibition (DATE), pages 1408–1413, 2021.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
[136] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo. 7.7 lnpu: A [159] D. Moolchandani, A. Kumar, and S. R. Sarangi. Accelerating cnn infer-
25.3tflops/w sparse deep-neural-network learning processor with fine- ence on asics: A survey. Journal of Systems Architecture, 113:101887,
grained mixed precision of fp8-fp16. 2019 IEEE International Solid- 2021.
State Circuits Conference - (ISSCC), pages 142–144, 2019. [160] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing nuca
[137] J. Lee and H.-J. Yoo. An overview of energy-efficient hardware acceler- organizations and wiring alternatives for large caches with cacti 6.0. In
ators for on-device deep-neural-network training. IEEE Open Journal of 40th Annual IEEE/ACM International Symposium on Microarchitecture
the Solid-State Circuits Society, 1:115–128, 2021. (MICRO 2007), pages 3–14, 2007.
[138] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung. Fpga- [161] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltz-
based low-power speech recognition with recurrent neural networks. In mann machines. In Proceedings of the 27th International Conference on
2016 IEEE International Workshop on Signal Processing Systems (SiPS), International Conference on Machine Learning, ICML’10, page 807–814,
pages 230–235, 2016. Madison, WI, USA, 2010. Omnipress.
[139] D. Lewin. Dna computing. Computing in Science & Engineering, 4(3):5– [162] D. T. Nguyen, T. N. Nguyen, H. Kim, and H.-J. Lee. A high-throughput
8, 2002. and power-efficient fpga implementation of yolo cnn for object detection.
[140] B. Li, E. Zhou, B. Huang, J. Duan, Y. Wang, N. Xu, J. Zhang, and IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
H. Yang. Large scale recurrent neural network on gpu. In 2014 Interna- 27(8):1861–1873, 2019.
tional Joint Conference on Neural Networks (IJCNN), pages 4062–4069, [163] T. Ngyen, S. M. Jafri, M. Daneshtalab, A. Hemani, S. Dytckov, J. Plosila,
2014. and H. Tenhunen. Fist: A framework to interleave spiking neural
[141] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji. High- networks on cgras. In 2015 23rd Euromicro International Conference
performance fpga-based cnn accelerator with block-floating-point arith- on Parallel, Distributed, and Network-Based Processing, pages 751–758,
metic. IEEE Transactions on Very Large Scale Integration (VLSI) 2015.
Systems, 27(8):1874–1885, 2019. [164] R. Nikhil. Bluespec system verilog: efficient, correct rtl from high level
[142] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, specifications. In Proceedings. Second ACM and IEEE International
and Y. Chen. Pudiannao: A polyvalent machine learning accelerator. In Conference on Formal Methods and Models for Co-Design, 2004. MEM-
Proceedings of the Twentieth International Conference on Architectural OCODE ’04., pages 69–70, 2004.
Support for Programming Languages and Operating Systems, ASPLOS [165] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh,
’15, page 369–381, New York, NY, USA, 2015. Association for Comput- and D. Marr. Accelerating binarized neural networks: Comparison of
ing Machinery. fpga, cpu, gpu, and asic. In 2016 International Conference on Field-
[143] A. Ltd. Learn more about the Linaro Machine Learning Initiative. Arm | Programmable Technology (FPT), pages 77–84, 2016.
The Architecture for the Digital World, Jan 2019. [166] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong
[144] A. Ltd. Ethos-U55 – Arm Developer, Aug 2021. [Online; accessed 7. Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and
Aug. 2021]. G. Boudoukh. Can fpgas beat gpus in accelerating next-generation deep
[145] A. Ltd. High-Performing AI Solutions to Transform our Digital World, neural networks? In Proceedings of the 2017 ACM/SIGDA International
Aug 2021. [Online; accessed 7. Aug. 2021]. Symposium on Field-Programmable Gate Arrays, FPGA ’17, page 5–14,
[146] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexible New York, NY, USA, 2017. Association for Computing Machinery.
dataflow accelerator architecture for convolutional neural networks. In [167] M. T. Nyamukuru and K. M. Odame. Tiny eats: Eating detection on a
2017 IEEE International Symposium on High Performance Computer microcontroller, 2020.
Architecture (HPCA), pages 553–564, 2017. [168] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
[147] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. Scnn: An accelerator
and Y. Chen. Dadiannao: A neural network supercomputer. IEEE for compressed-sparse convolutional neural networks. In Proceedings
Transactions on Computers, 66(1):73–88, 2017. of the 44th Annual International Symposium on Computer Architecture,
[148] T. Luong, H. Pham, and C. D. Manning. Effective approaches to ISCA ’17, page 27–40, New York, NY, USA, 2017. Association for
attention-based neural machine translation. In Proceedings of the 2015 Computing Machinery.
Conference on Empirical Methods in Natural Language Processing, [169] S.-W. Park, J. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J.
pages 1412–1421, Lisbon, Portugal, Sept. 2015. Association for Com- Yoo. An energy-efficient and scalable deep learning/inference processor
putational Linguistics. with tetra-parallel mimd architecture for big data applications. IEEE
[149] P. Lv, W. Liu, and J. Li. A fpga-based accelerator implementaion for Transactions on Biomedical Circuits and Systems, 9(6):838–848, 2015.
yolov2 object detection using winograd algorithm. In 2020 5th Inter- [170] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal. Memory-
national Conference on Mechanical, Control and Computer Engineering centric accelerator design for convolutional neural networks. In 2013
(ICMCCE), pages 1894–1898, 2020. IEEE 31st International Conference on Computer Design (ICCD), pages
[150] A. L. Maas. Rectifier nonlinearities improve neural network acoustic 13–19, 2013.
models. 2013. [171] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and E. Culur-
[151] R. Machupalli, M. Hossain, and M. Mandal. Review of asic accelerators ciello. Neuflow: Dataflow vision processing system-on-a-chip. In 2012
for deep neural network. Microprocessors and Microsystems, 89:104441, IEEE 55th International Midwest Symposium on Circuits and Systems
2022. (MWSCAS), pages 1044–1047, 2012.
[152] M. Mathieu, M. Henaff, and Y. LeCun. Fast training of convolutional [172] M. Pietras. Hardware conversion of neural networks simulation models
networks through ffts. CoRR, abs/1312.5851, 2014. for neural processing accelerator implemented as fpga-based soc. In
[153] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. Adres: 2014 24th International Conference on Field Programmable Logic and
An architecture with tightly coupled vliw processor and coarse-grained Applications (FPL), pages 1–4, 2014.
reconfigurable matrix. In P. Y. K. Cheung and G. A. Constantinides, [173] T. Posewsky and D. Ziener. Efficient deep neural network acceleration
editors, Field Programmable Logic and Application, pages 61–70, Berlin, through fpga-based batch processing. In 2016 International Conference
Heidelberg, 2003. Springer Berlin Heidelberg. on ReConFigurable Computing and FPGAs (ReConFig), pages 1–8,
[154] J. Misra and I. Saha. Artificial neural networks in hardware: A survey of 2016.
two decades of progress. Neurocomputing, 74:239–255, 2010. [174] T. Posewsky and D. Ziener. Throughput optimizations for fpga-based
[155] S. Mittal. A survey of fpga-based accelerators for convolutional neural deep neural network inference. Microprocessors and Microsystems,
networks. Neural Computing and Applications, 32(4):1109–1139, Feb 60:151–161, 2018.
2020. [175] S. Prakash, T. Callahan, J. Bushagour, C. Banbury, A. V. Green, P. War-
[156] S. Mittal and J. S. Vetter. A survey of cpu-gpu heterogeneous computing den, T. Ansell, and V. J. Reddi. Cfu playground: Full-stack open-source
techniques. ACM Computing Surveys (CSUR), 47:1 – 35, 2015. framework for tiny machine learning (tinyml) acceleration on fpgas.
[157] P. Mohan, A. J. Paul, and A. Chirania. A tiny CNN architecture for arXiv preprint arXiv:2201.01863, 2022.
medical face mask detection for resource-constrained endpoints. In Lec- [176] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and
ture Notes in Electrical Engineering, pages 657–670. Springer Singapore, M. Horowitz. Convolution engine: Balancing efficiency and flexibility in
2021. specialized computing. Commun. ACM, 58(4):85–93, Mar. 2015.
[158] J. J. Moolayil. A Layman’s Guide to Deep Neural Networks - Towards [177] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul,
Data Science. Medium, May 2020. and T. Krishna. Sigma: A sparse and irregular gemm accelerator with
xl VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
flexible interconnects for dnn training. In 2020 IEEE International Sym- [198] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-
posium on High Performance Computer Architecture (HPCA), pages 58– s. Seo, and Y. Cao. Throughput-optimized opencl-based fpga accelerator
70, 2020. for large-scale convolutional neural networks. In Proceedings of the 2016
[178] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, ACM/SIGDA International Symposium on Field-Programmable Gate
S. Song, Y. Wang, and H. Yang. Going deeper with embedded fpga Arrays, FPGA ’16, page 16–25, New York, NY, USA, 2016. Association
platform for convolutional neural network. In Proceedings of the 2016 for Computing Machinery.
ACM/SIGDA International Symposium on Field-Programmable Gate [199] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J. sun
Arrays, FPGA ’16, page 26–35, New York, NY, USA, 2016. Association Seo, and Y. Cao. Throughput-optimized opencl-based fpga accelerator
for Computing Machinery. for large-scale convolutional neural networks. In FPGA 2016 - Pro-
[179] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kep- ceedings of the 2016 ACM/SIGDA International Symposium on Field-
ner. AI accelerator survey and trends. In 2021 IEEE High Performance Programmable Gate Arrays, pages 16–25. Association for Computing
Extreme Computing Conference (HPEC). IEEE, sep 2021. Machinery, Inc, Feb. 2016. 2016 ACM/SIGDA International Symposium
[180] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. on Field-Programmable Gate Arrays, FPGA 2016 ; Conference date: 21-
Vdnn: Virtualized deep neural networks for scalable, memory-efficient 02-2016 Through 23-02-2016.
neural network design. In The 49th Annual IEEE/ACM International [200] M. Svedin, S. W. D. Chien, G. Chikafa, N. Jansson, and A. Podobas.
Symposium on Microarchitecture, MICRO-49. IEEE Press, 2016. Benchmarking the nvidia gpu lineage: From early k80 to modern a100
[181] T. Ridnik, H. Lawen, A. Noy, and I. Friedman. Tresnet: High perfor- with asynchronous memory transfers. Proceedings of the 11th Interna-
mance gpu-dedicated architecture. ArXiv, abs/2003.13630, 2020. tional Symposium on Highly Efficient Accelerators and Reconfigurable
[182] S. Saha. A Comprehensive Guide to Convolutional Neural Networks — Technologies, 2021.
the ELI5 way, 2018. [201] D.-F. Syu, S. Syu, S.-J. Ruan, Y.-C. Huang, and C.-K. Yang. Fpga imple-
[183] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, mentation of automatic speech recognition system in a car environment.
E. Cosatto, and H. P. Graf. A massively parallel coprocessor for convolu- 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE),
tional neural networks. In 2009 20th IEEE International Conference on pages 485–486, 2015.
Application-specific Systems, Architectures and Processors, pages 53– [202] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer. Efficient processing of
60, 2009. deep neural networks: A tutorial and survey. Proceedings of the IEEE,
[184] V. Sati, S. M. Sánchez, N. Shoeibi, A. Arora, and J. M. Corchado. Face 105(12):2295–2329, 2017.
detection and recognition, face emotion recognition through nvidia jetson [203] M. A. Talib, S. Majzoub, Q. Nasir, and D. Jamal. A systematic literature
nano. In International Symposium on Ambient Intelligence, pages 177– review on hardware implementation of artificial intelligence algorithms.
185. Springer, 2020. J. Supercomput., 77(2):1897–1938, Feb. 2021.
[204] M. Tanomoto, S. Takamaeda-Yamazaki, J. Yao, and Y. Nakashima. A
[185] S. Sağlam, F. Tat, and S. Bayar. Fpga implementation of cnn algorithm for
cgra-based approach for accelerating convolutional neural networks. In
detecting malaria diseased blood cells. In 2019 International Symposium
2015 IEEE 9th International Symposium on Embedded Multicore/Many-
on Advanced Electrical and Communication Technologies (ISAECT),
core Systems-on-Chip, pages 73–80, 2015.
pages 1–5, 2019.
[205] Y. Tkachenko. Autonomous crm control via clv approximation with deep
[186] U. Schmidt and S. Roth. Shrinkage fields for effective image restoration.
reinforcement learning in discrete and continuous action space. arXiv
In 2014 IEEE Conference on Computer Vision and Pattern Recognition,
preprint arXiv:1504.01840, 2015.
pages 2774–2781, 2014.
[206] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
[187] D. Selvathi, R. D. Nayagam, D. J. Hemanth, and V. E. Balas. FPGA
and K. Vissers. Finn: A framework for fast, scalable binarized neural net-
implementation of on-chip ANN for breast cancer diagnosis. Intell.
work inference. In Proceedings of the 2017 ACM/SIGDA international
Decis. Technol., 10(4):341–352, Dec. 2016.
symposium on field-programmable gate arrays, pages 65–74, 2017.
[188] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
[207] A. Vasudevan, A. Anderson, and D. Gregg. Parallel multi channel
A. Mishra, and H. Esmaeilzadeh. From high-level deep neural models
convolution using general matrix multiplication. 2017 IEEE 28th Inter-
to fpgas. In 2016 49th Annual IEEE/ACM International Symposium on
national Conference on Application-specific Systems, Architectures and
Microarchitecture (MICRO), pages 1–12, 2016.
Processors (ASAP), pages 19–24, 2017.
[189] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Es- [208] S. I. Venieris and C.-S. Bouganis. fpgaconvnet: A framework for mapping
maeilzadeh. Bit fusion: Bit-level dynamically composable architecture convolutional neural networks on fpgas. In 2016 IEEE 24th Annual
for accelerating deep neural network. In 2018 ACM/IEEE 45th Annual International Symposium on Field-Programmable Custom Computing
International Symposium on Computer Architecture (ISCA), pages 764– Machines (FCCM), pages 40–47, 2016.
775, 2018. [209] T. Viet Huynh. Fpga-based acceleration for convolutional neural net-
[190] R. Shi, H. Xu, B. Chen, Z. Zhang, and L.-M. Peng. Scalable fabrication works on pynq-z2. International Journal Of Computing and Digital
of graphene devices through photolithography. Applied Physics Letters, System, 2021.
102(11):113102, 2013. [210] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou. Dlau: A scalable
[191] D. Shin, J. Lee, J. Lee, and H.-J. Yoo. 14.2 dnpu: An 8.1tops/w reconfig- deep learning accelerator unit on fpga. IEEE Transactions on Computer-
urable cnn-rnn processor for general-purpose deep neural networks. In Aided Design of Integrated Circuits and Systems, 36(3):513–517, 2017.
2017 IEEE International Solid-State Circuits Conference (ISSCC), pages [211] J. Wang and S. Gu. Fpga implementation of object detection accelerator
240–241, 2017. based on vitis-ai. In 2021 11th International Conference on Information
[192] M. M. Shulaker, G. Hills, R. S. Park, R. T. Howe, K. Saraswat, H.-S. P. Science and Technology (ICIST), pages 571–577, 2021.
Wong, and S. Mitra. Three-dimensional integration of nanotechnologies [212] T. Wang, C. Wang, X. Zhou, and H. Chen. A Survey of FPGA Based
for computing and data storage on a single chip. Nature, 547(7661):74– Deep Learning Accelerators: Challenges and Opportunities. ArXiv, Dec.
78, Jul 2017. 2018.
[193] K. Simonyan and A. Zisserman. Very deep convolutional networks for [213] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. Deepburning: Automatic
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. generation of fpga-based learning accelerators for the neural network
[194] G. W. Smith and F. F. Leymarie. The Machine as Artist: An Introduction. family. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference
Arts, 6(2):5, Apr 2017. (DAC), pages 1–6, 2016.
[195] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopuri, N. Prabhu, S. S. S. [214] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful
Kruthiventi, and R. V. Babu. A taxonomy of deep convolutional neural visual performance model for multicore architectures. Communications
nets for computer vision. Frontiers in Robotics and AI, 2:36, 2016. of the ACM, 52(4):65–76, 2009.
[196] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk. Towards an embedded [215] K. B. Wim Vanderbauwhede. High-Performance Computing Using
biologically-inspired machine vision processor. In 2010 International FPGAs. Springer, New York, NY, USA, 2013.
Conference on Field-Programmable Technology, pages 273–278, 2010. [216] W. G. Wong. More Details Emerge About Arm’s Machine Learning.
[197] D. Strigl, K. Kofler, and S. Podlipnig. Performance and scalability of gpu- Electronic Design, Jun 2018.
based convolutional neural networks. In 2010 18th Euromicro Confer- [217] B. Wu, A. Wan, F. Iandola, P. H. Jin, and K. Keutzer. Squeezedet:
ence on Parallel, Distributed and Network-based Processing, pages 317– Unified, small, low power fully convolutional neural networks for real-
324, 2010. time object detection for autonomous driving. In 2017 IEEE Conference
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3229767
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/