0% found this document useful (0 votes)
24 views22 pages

A Survey On Neural Network Hardware Accelerators

This survey explores neural network hardware accelerators, focusing on their design, performance challenges, and various types such as ASICs, FPGAs, and GPUs. It discusses key issues like speed, power consumption, throughput, and resource efficiency, while also comparing existing hardware solutions. The findings aim to guide researchers and engineers in selecting appropriate hardware platforms for machine learning applications and inspire future advancements in the field.

Uploaded by

smanasvitareddyp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views22 pages

A Survey On Neural Network Hardware Accelerators

This survey explores neural network hardware accelerators, focusing on their design, performance challenges, and various types such as ASICs, FPGAs, and GPUs. It discusses key issues like speed, power consumption, throughput, and resource efficiency, while also comparing existing hardware solutions. The findings aim to guide researchers and engineers in selecting appropriate hardware platforms for machine learning applications and inspire future advancements in the field.

Uploaded by

smanasvitareddyp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO.

8, AUGUST 2024 3801

A Survey on Neural Network


Hardware Accelerators
Tamador Mohaidat and Kasem Khalil , Senior Member, IEEE

Abstract—Artificial intelligence (AI) hardware accelerator is I. INTRODUCTION


an emerging research for several applications and domains. The
hardware accelerator’s direction is to provide high computational
speed with retaining low-cost and high learning performance. The
main challenge is to design complex machine learning models
M ACHINE learning has emerged with a number of new
issues alongside the growth of the Internet and mul-
timedia technology. Machine learning has gained traction in
on hardware with high performance. This article presents a various disciplines, with the goal of simulating human thought.
thorough investigation into machine learning accelerators and
associated challenges. It describes a hardware implementation of Researchers from diverse fields collaborate to share their per-
different structures such as convolutional neural network (CNN), spectives and techniques, contributing to the progression of the
recurrent neural network (RNN), and artificial neural network artificial intelligence (AI) field. Machine learning approaches
(ANN). The challenges such as speed, area, resource consumption, are involved in several applications, and domains such as speech
and throughput are discussed. It also presents a comparison recognition [1], [2], image classification [3], [4], hardware and
between the existing hardware design. Last, the article describes
the evaluation parameters for a machine learning accelerator in software fault prediction [5], [6], [7], text detection [8], [9],
terms of learning and testing performance and hardware design. disease detection and prediction [4], [10], [11], nutrition moni-
Impact Statement—Neural networks have revolutionized the toring [12], medical treatment [13], [14], and for defect inspec-
field of AI, empowering machines to acquire knowledge from tion and metrology [15], [16]. The main challenges of current
data and accomplish tasks that were previously deemed un- approaches are providing an optimized model of learning and
achievable. This survey covers various types of accelerators,
including custom application-specific integrated circuits (ASICs), hardware accelerator with low cost.
field programmable gate arrays (FPGAs), graphics processing Machine learning serves as a crucial step toward achieving
units (GPUs), and dedicated AI chips, and compares their AI. Functioning as a specialized branch of AI, machine learn-
performance, power efficiency, and scalability. The survey also ing involves utilizing data and algorithms to simulate human
discusses the design tradeoffs involved in building neural network learning processes and continually improve its accuracy. Rather
accelerators, such as memory hierarchy, dataflow architecture,
and precision. It provides insights into the latest trends and than relying on direct instruction, machine learning involves the
advancements in hardware accelerators for neural networks. It use of mathematical models to facilitate computer learning. By
helps researchers, engineers, and practitioners in the field choose utilizing historical data as input, machine learning algorithms
the right hardware platform for their specific needs and optimize can predict new output values. As the core component of AI,
the performance and energy efficiency of their neural network machine learning enables computers to possess intelligent ca-
models. Moreover, this survey can also inspire new research
directions and advancements in the neural network hardware pabilities. Through the development of various theories and
accelerator design, paving the way for the next generation of methodologies, research in machine learning aims to estab-
intelligent systems. lish a specific application-based study system that utilizes in-
Index Terms—Artificial intelligence (AI), artificial neural sights from human physiology and cognitive science to conduct
network (ANN), convolutional neural network (CNN), hard- theoretical analysis and advance algorithmic models. Machine
ware accelerator, machine learning, machine learning design, learning tracks complex human actions in multimedia streams.
machine learning on-chip, neural network, recurrent neural
network (RNN).
The field of machine learning comprises various types, includ-
ing reinforcement learning, supervised learning, and unsuper-
vised learning. Supervised learning involves training a model
using data that is labeled, while unsupervised learning involves
training a model using data that is unlabeled [17], [18], [19].
Manuscript received 12 August 2023; revised 22 September 2023; accepted Reinforcement learning, on the other hand, requires trial and
4 February 2024. Date of publication 14 March 2024; date of current version
13 August 2024. This article was recommended for publication by Associate error to train a model [20], [21].
Editor Mehmet Onder Efe upon evaluation of the reviewers’ comments. The importance of machine learning comes from its capacity
(Corresponding author: Kasem Khalil.) to speed up the creation of new products and grant businesses
Tamador Mohaidat is with Electrical and Computer Engineering Depart-
ment, University of Mississippi, Oxford, MS 38677 USA. insights into consumer behavior trends and operational pat-
Kasem Khalil is with Electrical and Computer Engineering Department, terns. Many of today’s leading companies incorporate machine
University of Mississippi, Oxford, MS 38677 USA, and also with Electrical learning into their operations. Machine learning has become a
Engineering Department, Assiut University, Asyut 71515, Egypt (e-mail:
[email protected]). critical factor in setting businesses apart from their competitors.
Digital Object Identifier 10.1109/TAI.2024.3377147 Machine learning is reshaping every industry, from healthcare

2691-4581 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3802 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

and education to transportation and the food and entertainment The subsequent sections of this article are organized in the
industries to manufacturing and more. The fields of bioinfor- following manner. Section II presents unique challenges of
matics, physics, chemistry, material analysis, and other related machine learning hardware accelerator. Section III investigates
disciplines utilize intelligent methods that enhance their content multiple hardware accelerator systems. The models and datasets
development. These methods leverage machine learning as a covered by the existing methods are presented in Section IV.
core technology. It will significantly influence nearly every facet Section V presents a review of machine learning accelerator
of individuals’ lives. Both cloud computing and the Internet of with a comparison between the existing methods. Section VI
Things (IoT) are driving the expanded utilization of machine presents an evaluation framework of a hardware accelerator,
learning to enable objects and gadgets to become “smart” on followed by the conclusion and a discussion of future work
their own [22], [23], [24]. in Section VII.
Deep learning typically employs multiple layered structures,
such as convolutional neural networks (CNN) [25], [26], re- II. ACCELERATOR CHALLENGES
current neural networks (RNN) [27], [28], and artificial neural
networks (ANN) [29], [30], to process large-scale and unstruc- The existing machine learning hardware accelerator faces
tured data. Each structure has its own model of learning. ANN some challenges in providing a design with the desired per-
is based on multiple layers connected in a chain. Each layer has formance and cost. The machine learning models are complex,
multiple nodes that perform the computation process. The nodes so their hardware implementation is complex and slow. Thus
in each layer are interconnected with the nodes in the following the research direction is to propose a design with less complex-
layer until they reach the output layer. The architecture of a ity while saving performance and increasing speed. Hardware
CNN consists of convolutional layers and pooling layers, with accelerator challenges are power/energy consumption, through-
a pooling layer following each convolutional layer. A convo- put, area, speed, learning performance, and resource consump-
lutional layer is used to run the computation of the input data tion. Each one is described as follows.
with the stored weights. To reduce complexity in the subsequent
layers, a pooling layer is utilized to diminish the data size. The A. Power consumption
RNN is based on memory for the learning process which makes In cloud-based deep neural network (DNN) processing,
it suitable for time series data. power consumption is a critical factor due to the strict power
Hardware implementation of machine learning has a signifi- limits in data centers caused by cooling costs. Additionally, data
cant role in current applications with low cost [31], [32], [33], movement consumes more energy than arithmetic operations
[34], [35], [36]. The main challenge is to provide a machine like multiplier–accumulator (MAC), as capacitance is much
learning accelerator with high speed for problem classification higher. Hence, it is crucial to provide comprehensive reporting
with low hardware costs, without compromising desired per- on not only the energy efficiency and power consumption of the
formance in terms of area and power [37], [38], [39], [40], chip but also both the energy efficiency and power consumption
[41], [42]. The objective of this article is to examine the cur- associated with off-chip memory. This includes considerations
rent machine-learning hardware accelerator approaches with such as dynamic random access memory (DRAM) or the fre-
their advantages and limitations discussion. It also defines the quency of off-chip accesses. By evaluating the energy efficiency
general challenges for any hardware accelerator design to be and power consumption of the entire system, regardless of the
considered in future methods. Furthermore, it presents the eval- specific memory technology employed, a more holistic assess-
uation parameters for a hardware accelerator. This study delves ment can be achieved [43]. Embedded system designers face an
into a comprehensive examination of various hardware acceler- increasing challenge in reducing hardware resources and power
ators, without being limited to specific neural network architec- consumption while maintaining the computational complexity
tures. It conducts an extensive comparative analysis of recent of real-time applications. The weights and intermediate results
advancements, highlighting their respective strengths and short- can be stored in on-chip buffers in some designs to cut down on
comings. This is accomplished by presenting numerical data time spent retrieving data from off-chip memory and the amount
for key performance metrics, including power consumption, of power required to keep the system running. The main issue is
area, and accuracy. By providing this detailed information, the to design a hardware accelerator with a light structure to reduce
reader gains a precise understanding of the distinctive attributes power consumption.
of each examined work. This article’s main contributions are
summarized as follows.
1) A list of the challenges on machine learning hardware B. Throughput
accelerator. Obtaining high throughput and low latency concurrently can
2) A comprehensive study on hardware accelerator systems. be challenging depending on the approach taken. Expanding
3) A comprehensive review on machine learning hardware the amount of process elements (PEs) can enhance the overall
accelerator. throughput, resulting in an increased number of parallel MAC
4) A comparison between the existing hardware accelerators operations. However, the system’s area cost and the area of the
5) An evaluation framework for machine learning hardware PE determine the number of PEs. If the area cost of the system
accelerators. remains constant, Expanding the amount of PEs results in a

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3803

decrease in the area per PE or a reduction of on-chip storage. multiple input channels to handle different types of convolution
This could affect how PEs are utilized. Reducing the logic operations. By efficiently reusing the weight data and input
necessary to send operands to a MAC by using a single piece of data, the convolutional PE array architecture reduces the num-
logic can decrease the area per PE. The maximum throughput is ber of memory accesses, which results in a decrease in power
determined by The amount of PEs and the maximum throughput consumption and hardware resources. This technique enables
achievable by a single PE. However, the actual throughput the hardware designer to attain a balance between performance
depends on several factors, such as the network architecture, and hardware resources, which is a critical aspect of designing
weight and activation sparsity in the DNN model, and batch hardware accelerators for machine learning applications.
size. Increasing the batch size can enhance the reuse of data
and increase throughput. The hardware’s ability to support these F. Speed
approaches while maintaining PE utilization, the number of
The design of neural networks with both high speed and en-
PEs, or cycles per second determines the overall impact of the
ergy efficiency has been a challenging task. This has prompted
DNN model on throughput [43].
researchers to explore alternatives to graphics processing units
(GPUs) and central processing units (CPUs) for efficient accel-
C. Area eration of the algorithms used in neural network models. Due
Machine learning accelerators face a multitude of challenges to the high energy cost per read and write operation and the
that can lead to area overhead, such as the need to perform both long access time associated with external memory, a number of
forward and backward passes without sharing any hardware re- systems continue to experience difficulties handling their data
sources between the two processes. Additionally, implementing loads. Alternate methods first adjust the memory to allow for a
the hardware accelerator on the chip can come at a high cost. bigger data bus or make use of several memories distributed
To address these challenges, it is necessary to simplify complex across the system in order to cut down on the overhead of
machine learning models in hardware designs while also opti- this data movement. Parallel access makes it possible to handle
mizing hardware components without sacrificing performance, many data streams during a single clock cycle, which both
making them more efficient and cost-effective. accelerates the system’s overall speed and makes better use of
its available hardware resources.
D. Performance
III. HARDWARE ACCELERATOR SYSTEMS
Neural networks face difficulties with throughput due to wait-
ing for the processing unit to finish reading data. To address this Hardware Accelerator systems are specialized hardware de-
issue, improved activation functions are proposed in machine vices designed to accelerate the performance of specific tasks.
learning accelerator designs to enhance accuracy and perfor- These systems use dedicated hardware components such as
mance. Strategies such as pooling, convolutional or kernel pro- FPGAs, ASICs, and GPUs to perform complex computations
cessing are used to further improve accuracy. To achieve low much faster than traditional CPUs. Hardware accelerator sys-
latency and high efficiency, the neural network is accelerated tems are widely used in a variety of industries due to their ability
using pipeline design and multichannel parallel processing. The to perform complex computations faster and more efficiently
main challenge is to maintain high performance in terms of than traditional computing systems. In the finance industry,
sensitivity, accuracy, and specificity while avoiding the addition hardware accelerators are used for a variety of purposes, in-
of complex hardware components. cluding high-frequency trading, risk management, and fraud
detection. High-frequency trading relies on the ability to make
trades within fractions of a second, and hardware accelerators
E. Resource Consumption can process vast amounts of data in real time, making them
Reducing hardware resources poses a significant challenge a valuable tool. In healthcare, they can be used to accelerate
due to the increased computational complexity of real-time ap- medical imaging tasks such as magnetic resonance imaging
plications. To address this challenge, some innovative architec- (MRI) and computerized tomography scans, drug discovery
tures have been proposed that use a convolutional PE array. This and development, and genomics research. In scientific research,
PE array can reuse pixel and weight data effectively. thereby they can be used to accelerate simulations, modeling, and data
reducing the number of resources consumed while maintaining analysis tasks. They also find applications in autonomous vehi-
performance in learning and testing. The basic concept is to cles, aerospace, and defense industries for tasks such as image
reduce the hardware resources without compromising the learn- processing, sensor data analysis, and control systems. Hardware
ing and testing performance of the system. The convolutional Accelerator systems are also used in high-performance com-
PE array architecture exploits the fact that the convolution puting applications such as machine learning, data analytics,
operation is both data and weight reuse friendly. The array and virtual reality [44], [45]. They can greatly improve the
can perform multiple convolution operations simultaneously, performance of computing devices, allowing for faster and more
and the weights for each convolution operation are stored in efficient processing of tasks. This can lead to improved produc-
a weight buffer. The input pixels are stored in a buffer, which tivity and reduced wait times for users. Additionally, hardware
can be accessed multiple times during the computation. The accelerator systems can help reduce energy consumption and
PE array can also incorporate multiple output channels and lower costs. By offloading certain tasks from the CPU and GPU,

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3804 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

Fig. 1. Hardware accelerator systems.

Fig. 2. GPU architecture block diagram.


these systems can operate more efficiently and require less
power. This can result in significant savings for both individuals
and organizations. The possibilities for the use of Hardware
Accelerator systems are endless and are only limited by our
imagination. There are various types of hardware accelerator
systems available in the market today, each designed to ac-
celerate specific types of tasks. Some of the common types
include FPGA-based systems, CPU-based systems, GPU-based
systems, and ASIC-based systems. Each type has its unique
advantages and disadvantages. FPGA-based systems are highly
flexible and can be programmed to perform different tasks.
GPU-based systems are ideal for parallel processing and are
commonly used in gaming and graphics applications. ASIC-
based systems are highly optimized for specific tasks and of-
fer the highest performance but are expensive to design and
manufacture. Fig. 1 shows the hardware accelerators used by
all the works examined for this study. As can be shown, 64%
of the works used FPGA to implement the DNN networks.
While the CPU accounts for 16%, the GPU and ASIC account
for 12% and 8%, respectively. Some common types of hardware
accelerators include the following units:

A. CPU Fig. 3. FPGA architecture block diagram.


The current CPU can execute single instruction, multiple
data (SIMD) instructions by utilizing multiple ALUs simultane-
ously. This feature is particularly useful in image processing, as situations where there aren’t many or any branching conditions.
it allows the same instruction to be performed on a continuous In addition, GPU architectures include a memory architecture
stream of data. In computer vision, most operations occur across designed specifically for high-speed data streaming for image
the image [44], [46]. processing.

B. GPU C. Field Programmable Gate Array (FPGA)


Compared to general-purpose CPUs, GPUs have developed FPGA is a programmable integrated circuit capable of be-
into a specialized architecture designed for parallel processing, ing reconfigured to perform different circuit functions multi-
with SIMD instruction extensions that allow for concurrent ple times. Instead of having a fixed design like a processor,
execution of multiple tasks, including image processing work- The composition of an FPGA includes digital signal proces-
loads [47]. Fig. 2 shows the GPU architecture, GPUs have sors (DSPs), configurable logic blocks (CLBs), I/O pads, on-
processing cores that are far simpler than those of standard high- chip block RAMs (BRAMs), and routing channels as shown
performance CPUs [46]. These often have simpler control logic in Fig. 3. In FPGA, you can set up your data tracks so that
because they do not need to predict branching or prefetch data, pixels can be directly transferred between computing units and
and constrained memory per core. GPUs can accommodate a external memory without any intermediary steps. Furthermore,
much larger number of cores on a single chip than CPUs due to distributed BRAMs can be used to utilize data locality in vi-
simpler computing cores. GPU architectures work very well in sion kernels by storing pixels on the chip. Programmers can

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3805

Fig. 4. ASIC block diagram. Fig. 5. DNNs models used in the works we reviewed.

change the structure of FPGA’s hardware to make it into any that a significant portion (27%) of the accuracy assessments
shape they need. FPGA’s fine-grained parallel architecture gives are carried out on the basic MNIST network. Nonetheless,
it advantages over GPU. Once the computation clock cycle considerable attention is devoted to sophisticated networks such
time has been calculated, the designer can optimize the out- as the CIFAR (18%) and ImageNet (18%) networks.
put mode to minimize the demand for data storage in main
memory, thereby decreasing memory reading delays. FPGA V. ACCELERATOR APPROACHES
programming is powerful, FPGA provides dynamic algorithm
Several machine learning approaches, such as ANN, CNN,
reconfiguration with robust reconfigurability. Also, FPGA uses
and RNN, are implemented on hardware. This section discusses
much less power than GPU and works better with the same
the different machine learning accelerator approaches for each
amount of power, which can help a lot with the processor’s
category as follows.
problem of getting rid of heat [45], [48].

A. ANN
D. Application-Specific Integrated Circuit (ASIC)
Like the biological neural network in the human body, ANN
ASIC is another type of integrated circuit that is designed features a layered architecture in which each network node can
to perform specific tasks, as opposed to a CPU, which is a process input and forward output to other nodes in the network.
general-purpose processor, as shown in Fig. 4 [44]. ASICs The nodes are known as neurons. ANN is comprised of three
are more specialized than GPUs since an ASIC is a processor or more interconnected layers. The first layer contains input
built to perform a relatively small set of computations, whereas neurons that transfer data to the subsequent layers. The output
a GPU is still a massively parallel processor with thousands layer produces the final output data. The layers between the
of processing units that can run multiple algorithms [49]. In input and output layers are hidden and made up of units that
contrast to FPGA, you cannot reprogram ASIC to do something adaptively transform the information received from the previous
different once it is required. Its logic has been fixed since it was layer through a sequence of transformations. The ANN can
made, but on FPGA you can make a different design that fits understand more complicated objects since each layer works
your needs better. ASICs are usually substantially more energy as an input and output layer. The neural layer is the term used
efficient as a result of this specialization. to refer to these inner layers together. The units within a neural
layer aim to learn from the gathered information by assigning
IV. MODELS AND DATASETS weights based on the internal architecture of the ANN [50],
Datasets are essential for determining the accuracy of a DNN. [51], [52]. These principles enable units to provide a changed
Significant research effort has been expended over the decades result, which is delivered as an output to the next layer. Fig. 6
to increase the performance of DNNs through innovative ar- presents the general block diagram of the ANN. An adder
chitectures. However, the constant need for more accuracy in- accumulator receives the product of the input from each node
creased new, deeper, and incredibly complex models [37], [47]. after it has been multiplied by a weight. The result of the adder
In Fig. 5, we demonstrate the most datasets from all the articles accumulator is sent to an activation function, which returns
examined in our survey; various datasets were utilized to assess the final result. The final output is expressed by the follow-
the accuracy of the suggested DNN algorithms. There could be ing equation:
 n 
multiple datasets for the same work. MNIST, ResNet, CIFAR, 
and ImageNet are the most popular datasets, as shown. In yk = f Wlk ∗ Xl + bk . (1)
general, there is a well-balanced distribution of research efforts l=0
among CIFAR, ResNet, and ImageNet. While many DNN hard- Equation (1) represents the final output, where n denotes the
ware works are focusing on the MNIST dataset. It is apparent total number of neurons. Here, Xl represents the output of the

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3806 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

respectively. Medus et al. [55] present novel hardware for any


feedforward neural network (FFNN) implementation, including
logistic regression, multilayer perceptrons (MLPs), and autoen-
coders. The architecture allows for any number of layers, units
of layers, and input and output numbers. The hardware utilizes
matrix algebra techniques and employs serial-parallel compu-
tation. It utilizes a systolic ring of neural processing elements
(NPEs) that only requires as many NPEs as neuron units in the
largest layer, regardless of the number of layers. The utiliza-
tion of resources increases linearly as the number of NPEs in-
creases. This adaptable design works as a real-time application
Fig. 6. Block diagram of ANN. accelerator and does not have an impact on the system clock
frequency due to its size. In contrast to most existing methods,
the proposed approach only utilizes a single activation function
lth neuron in the previous layer, Wlk represents the synaptic block (AFB) for the entire FFNN. Performance, accuracy, and
weight from the lth neuron to the kth neuron in the current layer. resource usage are evaluated for several activation functions
Additionally, f denotes the activation function and b represents and network topologies. In a Virtex7 FPGA, the architecture
the bias. ANNs have extensive practical applications, such as operates at 550 MHz clock speed. The proposed approach
speech and image recognition, business intelligence predictive achieves classification performance comparable to a floating-
analysis, natural language processing (NLP), and many more. point approach using an 18-b fixed point. A smaller weight bit
ANNs have several advantages, including processing speed, size has no effect on accuracy and allows for more weights in
scalability, and accuracy. The most significant advantage of the same memory. A x 256 acceleration was attained for a real-
ANNs is their high-speed processing, which can be achieved time application of abnormal cardiac detection, and different
through parallel implementation. FFNNs for Iris and MNIST datasets were evaluated.
Khalil et al. [53] present an area-efficient ANN implementa- Khalil et al. [56] present a NoC-based adaptive neural net-
tion. The proposed method utilizes certain layers called hidden work. NoC is composed of routers and PEs, where each router
layers in a novel way to reduce the number of layers in the is connected to its PE, consisting of m nodes used for con-
ANN by nearly half. The study proposes nonconventional layers structing neural network layers. A configuration packet is uti-
called hidden layers. Each of these layers serves two distinct lized to specify the number of layers and nodes per layer. The
functions through the intelligent use of various weights, which suggested method can allocate multiple routers to represent a
are adaptable. Consequently, each hidden layer performs the layer based on the required configuration. Thus, the proposed
tasks of two regular ANN layers. Fixed layers are the other kind solution enables the creation of several layers with flexible node
of layers that are traditionally used. The fixed layers act as a configurations (as required by the application). The method is
single layer and are not flexible. The proposed architecture min- implemented on FPGA Altera 10 GX, achieving an accuracy of
imizes the number of fixed layers. In addition to the adaptable 98.18% with the MNIST dataset. Multiple datasets are used for
layers proposed, specific applications may still require one or testing, and the results demonstrate that The suggested approach
more fixed layers. The implementation of the proposed method uses comparable resources as the traditional method. Xiao et al.
using Verilog HDL and Altera Arria 10 GX FPGA resulted in [57] present intrachip and interchip communication strategies
a 41% reduction in the area compared to the state-of-the-art for neural network accelerators named NeuronLink. In terms
method, with low overhead in terms of power consumption and of intrachip communication, this study suggests techniques for
speed. The benefits in terms of area reduction and accuracy are route computation parallelization, arbitration interception, and
substantial. scoring crossbar arbitration to improve virtual-channel routing.
Huynh [54] present a feedforward DNNs accelerator on These techniques lead to a high-throughput network on a chip
FPGAs for an efficient hardware implementation architecture. (NoC) for multicast-based traffic while keeping the hardware
The proposed neural network architecture performs fully con- cost low. Additionally, this work proposes a lightweight and
nected feedforward DNNs with customizable layers, neurons NoC-aware chip-to-chip interconnection architecture to enable
per layer, and inputs, using only one physical processing layer. efficient interconnection for neural network processors that use
The network represents inputs, weights, and outputs in half- NoCs for interchip communication. Zhang et al. [58] introduce
precision floating-point number format using 16-b operands and a programmable in-memory computing accelerator (PIMCA)
employs the MNIST database for handwritten digit recognition for DNN inference with low precision (1–2 b). PIMCA inte-
applications. The performance evaluation is conducted on two grates 108 IMC SRAM macros with custom 10T1C bit-cell
Xilinx FPGA boards, the Virtex-5 XC5VLX-110T and ZynQ- technology in 28-nm. These macros can store all weights of the
7000 7Z045. The accuracy achieved for the 784-40-40-10 net- targeted 1-b VGG-9 model, eliminating off-chip data movement
work on the Virtex-5 XC5VLX-110T chip is 97.20%, while for during DNN inference. The IMC SRAM macros also improve
the 784-126-126-10 network on the ZynQ-7000 7Z045 chip, speed and energy efficiency by eliminating row-by-row mem-
it is 98.16%. The peak performance achieved is 15.81 and ory accesses. Additionally, a flexible SIMD processor is inte-
15.90 thousand handwritten image frames per second (kFPS), grated to handle non-MAC operations, such as max-pooling,

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3807

element-wise addition, and residual operations, eliminating data


movement energy consumption and latency between the accel-
erator and host. The test chip prototyped in a 28-nm technology
achieves a system-level (macro-level) peak energy efficiency of
437 TOPS/W and a peak throughput of 49 TOPS at 42-MHz
clock frequency.
Song et al. [59] propose HYPAR, a method proposed to as-
certain layer-specific parallelism in the training of DNNs using
an array of DNN accelerators. HYPAR divides the feature map
tensors (input and output), kernel tensor, gradient tensor, and
error tensors among the DNN accelerators, with each division Fig. 7. Block diagram of CNN.
representing the chosen parallelism for weighted layers. The
primary optimization goal is to identify a partition that mini- such as kernels to reduce the disparity between outputs and
mizes overall communication during the complete training of a ground truth labels. CNNs have become widely used in various
DNN. Notably, HYPAR demonstrates practicality with a linear applications, including image classification, image captioning,
time complexity for the partition search. This approach is imple- object detection, facial recognition, semantic segmentation, and
mented within the HYPAR architecture, an HMC-based DNN other applications.
training framework designed to reduce data movement. The To perform a convolution operation in a CNN, we need to
results show an average performance increase of 3.39 × and find a local receptive field in the input feature map that has
an energy efficiency improvement of 1.51 × when compared the same size as the convolution kernel. Next, multiply the
to the default data parallelism. Wei et al. [60] introduce a novel corresponding points by the convolution kernel. Finally, add
approach called layer-conscious memory hierarchy (LCMH) the offset coefficient to the final result given by the follow-
for DNN accelerators. LCMH dynamically determines the ap- ing equation:
propriate memory levels for each layer based on their spe-  
cific requirements for off-chip memory bandwidth and on-chip f = bais + m n(pixelmn × weightmn ). (2)
buffer size. This allows for the avoidance of off-chip memory
Equation (2) uses m and n to represent the height and width
usage in layers with high memory demands, keeping their data
of the feature window, f is a pixel of the feature image gen-
on-chip instead. Experimental results demonstrate that designs
erated by the convolutional layer, and bias is a constant value
implementing layer-conscious memory management achieve a
added to each feature window. The resulting values are typ-
significant speedup of up to 36% compared to designs using
ically passed through a nonlinear activation function such as
uniform memory hierarchy, and a 5% improvement over current
the hyperbolic tangent (tanh), rectified linear unit (ReLU), and
state-of-the-art designs.
Sigmoid. When convolution is performed without padding, the
output dimension is less than the input dimension. Output size
B. CNN is given by the following equation, which is a function of filter
size and stride:
CNN is a deep learning model that is used to process data.
Its architecture is intended to learn spatial feature hierarchies H − F + 2P
D= + 1. (3)
automatically and adaptively., starting from low-level features S
and building up to high-level ones. CNNs are made up of In equation (3), H denotes the input size, F denotes the filter
a variety of building blocks that help them extract relevant size, S denotes the stride size, and P denotes the padding size.
features from input data [61], [62], [63], [64]. As shown in The pooling layer is responsible for reducing network complex-
Fig. 7 CNN structures are typically made up of three layers: ity, compressing features, and removing redundant information.
convolution layers (CLs), pooling layers, and fully connected The size of the final feature map after pooling is determined by
layers. Convolution and pooling layers are used to extract fea- the kernels’ movement step, which is given by equation (3). The
tures from input data. These features are then mapped to the fully connected layer, or MLP, is the last layer of a CNN and is
final output, which is typically a classification result, using a composed of layers, each of which has many neurons (nodes).
fully connected layer. A crucial component of CNN is the CL, Nodes in one layer are directly connected to nodes in the pre-
which comprises a stack of mathematical operations, including ceding and subsequent layers. Finally, the fully connected layer
the convolution linear operation. Pixel values are recorded in is linked to the final output node where classification results
digital images as an array of numbers in a 2-D grid. CNNs are received. A classification function, such as SoftMax, can
use a small parameter grid called a kernel as an optimizable be utilized at this point. The SoftMax function can be obtained
feature extractor, which is applied at each position in an image. by the following equation:
This makes CNNs particularly efficient for image processing. e(xj )
The extracted features can grow hierarchically and become fx (Xj ) = i for j = 1, 2, 3, . . . , k. (4)
(xi )
increasingly complicated as the output of one layer is fed into 1e
the next. Using an optimization process called backpropagation In equation (4), x is the input signal and k is the number of
and gradient descent, training involves improving parameters output classes.

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3808 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

Tang et al. [65] create an improved CNN image classification System-on-Chip (SoC). This architecture optimizes all compu-
model. Maximum pooling is used in the model network struc- tations to be 8-b. Moreover, due to its high-speed performance,
ture. The accuracy of the four activation functions of Sigmoid, low power consumption, and compact size, the architecture is
Tanh, ReLU, and T-ReLU is compared in this article to improve a suitable option for CNN applications that require portability
neural network performance and image classification accuracy. and embedded systems.
The T-ReLU activation function improves the model, raising Xiao et al. [70] present a neural network acceleration archi-
image classification accuracy from 62% to 76.52%. Khalil et tecture that is efficient, scalable, and has low latency and low
al. [66] propose a hardware implementation of a new pooling error rates. The architecture achieves acceleration by utilizing
method, absolute average deviation (AAD), for use in a CNN multichannel parallel computing methods between layers and
accelerator. AAD makes use of the spatial proximity of pixels employing a pipeline design that prioritizes high efficiency and
by computing vertical and horizontal deviations, resulting in low latency performance requirements. The addition of a line
higher accuracy and lower area and power consumption com- buffer to accommodate varying image widths and the imple-
pared to other pooling methods. The AAD pooling method mentation of a selectable convolution kernel size mechanism
achieves over 98% accuracy without increasing computational enhance the network’s flexibility and scalability. This proposed
complexity. It was tested using various neural network struc- neural network performs 32-b floating-point operations. Since
tures and datasets, including EEG, ImageNet, common objects CNNs are based on floating point operations, there will be a loss
in context (COCO), and united states postal service (USPS). of precision and time-consuming transformation work if the
VHDL was used to implement AAD on Altera Arria10 GX algorithm’s FPGA implementation involves the conversion of
FPGA with 45-nm technology, using synopsys design compiler. floating-point values to fixed-point values. The MNIST dataset
Song et al. [67] propose a multidie-based CNN accelerator. The is used to perform handwritten number recognition to perform
VU9P chip involves three accelerators connected to an indepen- an experimental evaluation of the solution. The acceleration
dent super logic region (SLR). The host computer manages the strategy is implemented using the Xilinx zynq-7000 FPGA, and
three accelerators under the control of the accelerator, which the results of calculating 28 × 28 handwritten images at a clock
installs one accelerator in each SLR and uses on-chip resources. frequency of 200M in 25.95us are examined. 98.43% accuracy
This system utilizes an 8-b quantization method to enhance the rate is obtained.
throughput and computational efficiency of a single DSP for Lee et al. [71] present S3NAS, a quick hardware-aware NAS
accelerating the YOLOv4-tiny algorithm. The design employs approach. The process is broken down into three stages: super-
a full reuse of feature maps and weights during the calculation net design, Single-Path NAS for quick architectural exploration,
process and stores intermediate results in the on-chip buffer and scaling and postprocessing. The initial stage involves cre-
to minimize off-chip access, reduce bandwidth pressure, and ating a supernet, which is a set of candidate networks with two
decrease power consumption. Moreover, a designed instruction main features. First, it allows for varying numbers of blocks
group enables the host computer to control the accelerator. This in the stages, and secondly, it permits blocks to have parallel
architecture achieves a frame rate of 148.14 frames per second layers with different kernel sizes (MixConv). To minimize the
(FPS) and a peak throughput of 2.76 tera operations per second hyperparameter search overhead, a differential search can be
(TOPS) at a frequency of 200 MHz with an energy efficiency carried out by extending the single-path NAS method to include
ratio of 93.15 GOPS/W. It delivers promising results in real- the MixConv layer and incorporating a loss term that takes
time target detection applications. into account the latency. The network is scaled to its maximum
Ting et al. [68] propose a batch normalization (BN) processor within the latency constraint using compound scaling as the
that supports training and inference processes. To accelerate last step. In the postprocessing step, SE blocks and h-swish
CNN training, the proposed work develops an efficient dataflow activation functions are incorporated if they are found to be
that incorporates a novel BN processor design as well as pro- advantageous. The efficiency of the proposed methodology is
cessing elements for convolution acceleration. By sharing hard- demonstrated by tests conducted on four different hardware
ware elements between the two passes, this study took use of platforms. Using TPUv3, The search process can be completed
the comparable calculations necessary for the BN forward and within 4 h, resulting in the discovery of networks that offer
backward passes, reducing the area overhead. The method com- superior tradeoffs between latency and accuracy compared to
pleted automatic placement and routing (APR) and post-APR state-of-the-art networks. Moreover, this model outperforms
simulation on the training of neural network and functional other models by 0.6% in terms of accuracy and 14% in terms
verification of the BN processor. The method implemented the of speed compared to EfficientNet-B2.
BN processor in a CMOS technology process. The proposed Liu et al. [72] propose hardware architecture tailored for
solution accelerates the CNN training process while saving streaming applications, with a strong emphasis on increasing
hardware. The proposed architecture can reduce the total area computation efficiency by fully accelerating CNNs on FPGAs.
by 40.13%. To support an inference of CNNs with varied topologies, the
Khabbazan and Mirzakuchaki [69] describe optimized hard- architecture integrates most computational functions, convolu-
ware for CNNs for use in embedded vision systems. This de- tional and deconvolutional layers into a single unified module.
sign method is intended to be applied to low-end hardware It efficiently handles concatenative and residual connections
with the least resources needed. This hardware proposes a Z- between the functions, resulting in highly customized accel-
turn evaluation board architecture with a Xilinx Zynq-7000 eration. This design is further enhanced by utilizing various

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3809

levels of parallelism, layer fusion, and completely utilizing NP-layers. They introduced a shift-accumulator-based process-
DSPs. The suggested accelerator has been tested using a variety ing element with activation-driven data flow (ADF) for handling
of benchmark models and implemented on Intel’s Arria 10 the irregular sparse model in the P-layers. They also proposed
GX1150 hardware. The results show a high performance of over a hardware/algorithm cooptimization (HACO) method based
1.3 TOP/s throughputs and up to 97% computation efficiency. on the compression strategy and hardware architecture to im-
Wang et al. [73] propose a double buffer memory access plement an NP-P hybrid compressed CNN model on FPGAs.
structure, which considerably increases the computing unit’s They implemented the compressed VGG-16 model on a Xilinx
memory access efficiency; Furthermore, the proposed archi- VCU118 evaluation board for image applications and achieved
tecture utilizes a “ping-pong” buffer structure and employs a compression ratio of 27.5x for a hardware accelerator on a sin-
calculation delay to overlap with memory access delay, re- gle FPGA chip without the use of off-chip memory, processing
sulting in improved acceleration performance. To improve the 83.0 FPS.
computation performance of the computing unit, an accelerator Huang et al. [76] propose FPGA-based CNN hardware ac-
structure with a multilevel cache is proposed to execute data celerator design, which utilizes a row-level pipelined stream-
preparation reading. To prevent waiting for the processing unit ing technique to calculate CLs using a multicomputing engine
when reading data, a double buffer method is used to wait for the (CE) architecture. They also presented a mapping mechanism
processing unit when reading data; a double buffer method is to optimize the computational resource utilization ratio of the
used to perform calculation and data reading alternately. Based PE Array, achieving up to 98.15%. Additionally, an effective
on the experimental results, the proposed accelerator in this data storage system was implemented to improve the work
article achieved a detection speed of 15 FPS when process- efficiency of the CE by continuously feeding input data. A
ing an input image of size 3 × 160 × 320, while maintaining weighted data allocation technique was proposed to reduce the
the same test accuracy as the original design. This signifies a need for off-chip bandwidth while sacrificing some on-chip
1.5 times enhancement in acceleration when compared to the storage capacity. The design was tested on XC7VX980T FPGA,
original design. achieving 1 TOPS at 150 MHz, which is approximately 98.15%
Achararit et al. [74] offer an accuracy-and-performance- of the theoretical throughput. Moreover, a ResNet-101 acceler-
aware NAS (APNAS) that can efficiently create DNNs. APNAS ator was implemented, achieving 600 GOPS at 100 MHZ with
is based on a weight-sharing and reinforcement learning-based up to 96.12% throughput efficiency. Kim et al. [77] present
exploration method. First, provide a technique for calculating an ASIC accelerator for deep CNNs that uses a novel condi-
the cycle count in an RNN such that the network search does tional computing technique to significantly reduce the number
not require running a time-consuming hardware simulator. Ad- of redundant computations and external memory accesses. By
ditionally, they use analytical models for cycle count estimates combining subsequent max-pooling processes, precision cas-
to speed up the DNN creation process even further. The ac- cading (PC) is a novel conditional computing technique that
curacy of these analytical models is demonstrated by the fact reduces redundant convolution operations. In addition, com-
that they provide cycle count estimates that are comparable to bining precision-cascading with zero-skipping greatly reduced
those generated by a cycle-accurate hardware simulator. Then, energy and external memory access. For VGG- 16 CNN for
in the RL, establish a reward function by including a config- ImageNet, The accelerator achieved peak/average energy ef-
urabexcellentrameter for configuring the tradeoff between the ficiency of 8.85/1.22 TOPS/W at a voltage of 0.9V, and low
performance and accuracy of the generated DNNs. The study external memory access of 55.31 MB or it can be defined as
showed that APNAS could construct neural network models in 0.0018 access/MAC. Cheng et al. [78] introduce a low-power
0.55 GPU days on an Nvidia GTX 1080Ti GPU, resulting in sparse CNN accelerator featuring a preencoding radix-4 Booth
an average of 53% fewer cycles when compared to a manually multiplier. Leveraging the properties of the radix-4 Booth al-
developed neural network model (ResNet) and a state-of-the- gorithm, the accelerator reduces the number and bit width of
art NAS. They generated CNNs by APNAS for two different partial products (PPs) and encoder power consumption. it incor-
image classification datasets (CIFAR-10 and CIFAR-100) that porates an activation selector module that chooses activations
required 52.78% and 53.57% fewer cycles compared to a man- corresponding to nonzero weights for subsequent multiple-add
ually designed CNN. operations after offline encoding of nonzero weights. Addition-
Yuan et al. [75] propose hardware-oriented compression and ally, it consolidates eight encoders from relevant multipliers into
hybrid quantization techniques to reduce the memory require- a single preencoding module to save area. The proposed work
ments of CNNs. They classified all layers as either “no-pruning is developed using the Verilog HDL language and implemented
layers (NP-layers)” or “pruning layers (P-layers)” based on in a 28 nm process. The proposed accelerator achieves a per-
their processing features. The former uses parallel computation formance of 7.0325 TOPS/W with 50% sparsity and scales up
for high performance with a regular weights distribution, while to 14.3720 TOPS/W at 87.5% sparsity.
the latter has a high compression ratio but is asymmetric due to Yu et al. [79] introduce an FPGA-based acceleration platform
pruning. The approach aimed to balance compression ratio and utilizing supertile methods tailored for general-purpose CNNs
processing efficiency while maintaining reasonable accuracy by in data center applications. The design of a dispatching-
using uniform and incremental quantization techniques, as well assembling buffering model incorporating broadcast cache
as a distributed convolutional architecture with multiple parallel sets, tailored for a multi-supertile units (SU) architecture,
finite impulse response (FIR) filters for the regular model in the significantly enhances both reading and writing bandwidth.

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3810 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

Fig. 8. Block diagram of RNN.

Fig. 9. Block diagram of LSTM.


Additionally, the article discusses a 2-D filter processing
unit capable of handling a class of filter-like and pointwise to vanishing and exploding gradients in applications requiring
operations, striking a balance between design complexity and long-term input data. LSTM is a variation of RNN that has been
performance. The experiment demonstrates that the suggested proposed to address this problem. However, LSTM introduces
architecture implemented on KU115 attains the highest peak gating units and many extra parameters, making direct imple-
performance and throughput on FPGAs. Furthermore, it mentation difficult on resource-limited platforms like FPGAs.
operates at a comparable level to cutting-edge GPUs, but with LSTM block diagram comprises three gates and two activation
a latency more than 50 times lower. When compared to the functions, as shown in Fig. 9. The first gate, known as the
total cost of ownership, the FPGA improves server throughput “forget gate,” is responsible for deciding which portions of
by 149.2%, albeit with a modest 31.5% increase in costs. the cell state should be disregarded. The next step involves
Hwang et al. [80] propose GROW, a graph convolutional determining which new data should be stored in the cell. The
neural network (GCN)accelerator utilizing a Row-stationary “input gate” determines the values that need updating, while
SpDeGEMM dataflow. Unlike previous SpDeGEMM the “tanh” function generates new potential values. The final
accelerators, GROW combines software/hardware architecture layer is the output layer, which determines which data will be
codesign to minimize data movements, notably enhancing sent out. Each part’s equation is calculated by the following
memory-bound SpDeGEMM performance. It employs a equations [81]:
row-stationary dataflow based on Gustavson’s algorithm, ft = σ(Wf [ht−1 , xt ] + bf ) (5)
enabling flexible adaptation to heterogeneous sparsity patterns.
Compared to GCNAX, GROW’s dataflow significantly reduces it = σ(Wi [ht−1 , xt ] + bi ) (6)
memory bandwidth waste, particularly during the aggregation
ot = σ(Wo [ht−1 , xt ] + bo ) (7)
phase. While it introduces a more irregular reuse of dense
matrices, GROW employs a graph partitioning algorithm to c̃t = tanh(Wc [ht−1 , xt ] + bc ) (8)
improve dataflow locality. This is coupled with a multirow
stationary run-ahead execution model, enhancing memory-level ct = ft  ct−1 + it  c̃t (9)
parallelism and overall throughput. ht = ot  tanh(ct ) (10)
where ft is the result of the forget gate, it is the result of the
C. Recurrent Neural Network input gate, and ot is the result of the output gate. Whereas the
RNNs are a particular class of neural network that function matrix multiplication Wf [ht−1 , xt ] = Wh ht−1 + Wx xt . ht is
well for processing time series and other sequential data. An the cell output, whereas c̃ and ct denote the new and final states,
RNN has loops that allow information to be stored inside the respectively. The weights of the forget gate, the input gate, and
network. Fig. 8 presents the general block diagram of the RNN. the output gate are, in order, Wf , Wi , and Wo . Biases are bf
RNNs leverage prior knowledge to make predictions about for the forget layer, bi for the input layer, and bo for the output
future events, making them valuable tools for complex task per- layer. The symbol of  represents the elementwise (Hadamard)
formance by enabling sequence modeling of vectors within the multiplication, σ is the logistic sigmoid function, and tanh is the
API. RNN can be viewed as a series of interconnected networks. hyperbolic tangent function.
RNNs are an extension of feedforward networks to handle RNN has several hardware implementations for low cost
variable-length sequences. Some of the most commonly used [82], [83], [84]. Khalil et al. [28] propose economic LSTM
recurrent architectures include gated recurrent units (GRUs) (ELSTM) is a new LSTM architecture requiring only a single
and long short-term memory (LSTM). RNNs are often designed gate. ELSTM requires fewer processing units. Using a single
with a chain-like structure, making them well-suited for tasks gate, the proposed method can retain memory and learn data
in NLP such as language translation, speech recognition, sen- sequences. The proposed method offers the advantage of faster
timent analysis, and automatic speech recognition. However, learning by utilizing fewer computation units. The accuracy
traditional RNNs experienced poor network performance due of the proposed method is comparable to that of the other

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3811

method, while having fewer units, and Compared to LSTM, utilize temporal sparsity in RNNs, EdgeDRNN employs a delta
the proposed method offers a reduction in area and power con- network technique inspired by spiking algorithms. The weight
sumption by 34% and 35%, respectively. This design exhibits storage of EdgeDRNN is implemented using low-cost off-chip
notable attractiveness for low-cost hardware applications. The DRAM, and it employs temporal sparsity to decrease memory
method proposed in this study is evaluated using three datasets: bandwidth requirements during RNN updates. By employing
ImageNet, IMDB, and MNIST. The testing and implementation sparse updates, the memory access to DRAM weight can be
of the proposed method are performed using Altera Arria 10 GX reduced by a factor of up to 10. Furthermore, the delta value
FPGA. Wu et al. [85] introduce an energy-efficient scalable pro- can be dynamically adjusted to strike a balance between latency
cessor that leverages the data locality of compressed RNNs. By and accuracy requirements. This helps optimize EdgeDRNN for
eliminating redundant connections and sharing quantized val- efficient edge RNN inference with low latency.
ues among several weights, the RNN models are significantly Shan et al. [89] introduces dynamic recurrent routing neural
compressed. Adopting the quantified sparse matrix encoding networks (DRRNets) as a solution to typical RNN problems
significantly reduces repeated calculations and memory oper- such as complicated dependencies and gradient vanishing. The
ations. Both approaches ensure the suggested design has a high suggested DRRNets use the routing pointer matrix’s low-rank
level of energy efficiency. Scalable architecture and network attribute to construct adaptive routes for diverse dependencies
cross-division approach enable hardware parallelism and flexi- and drastically decrease redundant parameters by discovering
bility. More than 80% of the weight fetching and matrix-vector low-rank approximations for fully connected layers based on
multiplications for applications like natural language and key- the inner structure of the cell state. The article contains an
word spotting can be further decreased when using compressed optimization algorithm for training the network and assesses
RNNs compared to traditional processors. The peak energy effi- the model’s performance in a variety of tasks, including im-
ciency reaches 3.89 GOPS/mW. It achieves a peak performance age classification, language modeling, and speaker recognition.
of 24 GOPS and dissipates 6.16-mW power with a 1.1 V supply Chen et al. [90] introduce a specialized hardware accelera-
and 200 MHz. tor called “Eciton” designed for implementing LSTM neural
Kadetotad et al. [86] propose LSTM RNN accelerator featur- networks. Eciton showcases the ability to conduct real-time
ing an accelerator that is based on using a memory compres- inference for LSTM neural network models of practical size, all
sion method known as hierarchical coarse-grain sparsity that while operating within a power constraint of 17 mW. In com-
was algorithm-hardware cooptimized (HCGS). HCGS offers parison to FPGA implementations that demand higher power
considerable compression (16x) of LSTM weights with gentle consumption, Eciton delivers competitive performance. This
error rate degradation while minimizing index memory cost. is achieved through the utilization of 8-b fixed-point weight
The suggested LSTM accelerator utilizes a combination of quantization, hard sigmoid activation functions, and meticu-
hierarchical blockwise sparsity and low-precision quantization lously optimized microarchitecture, effectively minimizing chip
to store the compressed weights of LSTMs consisting of three resource and memory demands. Although these quantization
layers and 512 cells in only 288 kB of on-chip SRAM. This techniques lead to a slight accuracy reduction of approximately
method effectively reduces the necessary computation by up to 5% when assessed on real-world predictive maintenance LSTM
16 times. The prototype chip, fabricated using 65-nm LP CMOS models consisting of 3 to 4 layers, the advantage of low resource
technology, achieves a remarkable energy efficiency of up to requisites permits Eciton to be accommodated within a cost-
8.93 TOPS/W for real-time speech recognition. Experimental effective, low-power Lattice iCE40 UP5K FPGA.
evaluations conducted on TIMIT, TED-LIUM, and LibriSpeech
datasets provide solid evidence of the effectiveness and suit-
ability of HCGS across multiple LSTM RNNs. Nan et al. D. Transformer-Based and Diffusion-Based Models
[87] present a hybrid-iterative compression (HIC) technique for Transformer-based models have gained a significant amount
LSTM/GRU, which separates gating units into error-sensitive of attention in recent years because of their outstanding results
and error-insensitive groups and compresses them using dif- on NLP problems. The transformer architecture was first de-
ferent techniques, leveraging the error sensitivity of RNNs. scribed in [91] by Vaswani et al. It uses self-attention mech-
Additionally, a proposed energy-efficient accelerator for bidi- anisms to capture dependencies between different input data
rectional RNNs is made. In this accelerator, weights are rear- elements, enabling parallel processing of sequences and re-
ranged to optimize data flow in the matrix operation unit based ducing the sequential nature of conventional RNNs. For tasks
on the block structure matrix (MOU-S). A fine-grained par- such as accelerator optimization, automatic machine learning,
allelism configuration of matrix-vector multiplications is used and compiler optimization [92], transformer-based models can
to improve BRAM utilization (MVMs). The challenge of load be applied. Diffusion-based models are a type of probabilistic
imbalance between MOU-S and the matrix operation unit based model that propagates information across data points through
on top-k pruning (MOU-P) is effectively addressed through the repetitive processes. These models have found use in a va-
implementation of the timing matching technique. The archi- riety of domains, such as image denoising, data imputation,
tecture of the compressed LSTM/GRU, as proposed, has been and generation tasks, data-driven accelerator design, and neu-
thoroughly assessed on the Xilinx ADM-PCIE-7V3 platform. ral architecture search [93], [94]. Zhao et al. [95] introduce
Gao et al. [88] propose EdgeDRNN, an RNN accelerator based a transformer accelerator utilizing an output block stationary
on the GRU is optimized for low-latency edge RNN inference (OBS) dataflow to optimize memory access and improve DSP
with a batch size of 1 while maintaining a lightweight design. To utilization, resulting in higher energy efficiency. By minimizing
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3812 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

repeated memory access and employing block-level and vector- VI. EVALUATION
level broadcasting, the accelerator achieves reduced memory
An evaluation of a machine learning accelerator is significant
access bandwidth for input and output. The FPGA-based ver-
for any design for validation. The evaluation is divided into
ification of the proposed accelerator demonstrates impressive
training and testing and hardware evaluations. Each evaluation
performance, with a throughput of 728.3 GOPs and an energy
parameter is described as follows.
efficiency of 58.31 GOPs/W when evaluating a transformer-in-
transformer (TNT) model. Cheng et al. [96] present a novel
transformer-based model which is proposed for signal detection A. Training and Testing Evaluation
in a multiuser molecular communication (MMC) system. The
In machine learning classification models, performance mea-
model is trained using received data generated with varying
sures are used to evaluate how well the models perform in
initial distances between transmitters and receivers. The nu-
a specific context. This evaluation helps to improve machine
merical results demonstrate that the trained transformer-based
learning classification models. Some of the performance met-
model exhibits excellent convergence and outperforms the tra-
rics are accuracy, specificity, precision, tension, F1 score (ten-
ditional DNN in terms of signal detection, achieving a lower bit
sion), and loss function. Model performance is critical for ma-
error rate.
chine learning because it allows us to understand the strengths
and limitations of these models when making predictions in
E. Large Language Models (LLMs) new situations. True positives (TPs), true negatives (TNs), false
positives (FPs), and false negatives (FNs) are commonly used
A LLM is a specialized hardware or software compo-
performance measures for evaluating the performance of classi-
nent designed to enhance the performance of LLMs in NLP
fication models. TP refers to the number of correctly predicted
tasks. LLMs, such as OpenAI’s GPT-3 and BERT, have
positive cases, while TN is the number of correctly predicted
demonstrated remarkable capabilities in understanding and gen-
negative cases. FPs are the number of negative cases that were
erating human-like text, but they come with substantial com-
incorrectly predicted as positive, and FNs are the number of
putational requirements, making them resource-intensive and
positive cases that were incorrectly predicted as negative. These
time-consuming to run on standard hardware [97]. These accel-
metrics are typically used to calculate other performance mea-
erators leverage techniques like parallel processing, optimized
sures, such as accuracy, precision, recall, and F1 score. [66],
memory access, and specialized circuit designs to improve the
[100]. The evaluation parameters are given by (11)–(20).
overall efficiency of language model computations. The LLMs
1) Accuracy: A test’s accuracy is measured by its ability to
accelerator has become crucial in a wide range of applica-
differentiate classes accurately [55], [66], [100]. It indicates the
tions, including chatbots, language translation, text summa-
quality of the result for a given task. Accuracy can be calculated
rization, and sentiment analysis [98]. Maddigan and Susnjak
using the following equation:
[99] introduce an innovative system called Chat2VIS, which
harnesses the capabilities of LLMs. Through effective prompt TP + TN
Accuracy = . (11)
engineering, Chat2VIS demonstrates a more efficient solution TP + TN + FP + FN
for language understanding, resulting in simpler and more accu- More complex DNN models typically require more compu-
rate end-to-end outcomes compared to previous methods. The tations and more memory resources to process the input data,
research reveals that Chat2VIS, utilizing LLMs and proposed which can lead to slower processing times and higher resource
prompts, offers a reliable approach to generating visualizations utilization. This is especially true for hardware implementa-
from natural language queries, even when queries are imprecise tions of DNNs, where the processing capabilities and resources
or insufficiently specified. Moreover, this solution significantly are more limited compared to software implementations. As
reduces development costs for Natural Language Interface sys- a result, there is often a tradeoff between model complexity,
tems while achieving superior visualization inference abilities accuracy, hardware performance, and efficiency [43].
compared to traditional NLP approaches that rely on hand- 2) Sensitivity: It can be defined as a TP rate that measures
crafted grammar rules and tailored models. the ratio between the number of classes that were correctly
identified and the total number of TPs and FNs [55], [66], [100].
F. Performance Comparison of Different Methods Sensitivity is by the following equation:
Table I provides a comprehensive summary of various ma- TP
Sensitivity = . (12)
chine learning hardware accelerators, highlighting their key TP + FN
features, performance metrics, and targeted applications. It aims 3) Precision: It is a positive predictive value. It measures
to offer an overview of the latest advancements in ML hardware the proportion of TP predictions to total positive predictions
acceleration, assisting researchers, developers, and technology produced by the model. A high precision indicates that the
enthusiasts in understanding the landscape of available solu- model is good at avoiding FPs. [55], [66], [100]. Precision is
tions and their respective strengths. By analyzing the charac- by the following equation:
teristics and capabilities of different accelerators, readers can
make informed decisions regarding the most suitable hardware TP
Precision = . (13)
for their specific ML requirements. TP + FP

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3813

TABLE I
SUMMARY OF MACHINE LEARNING AND DEEP LEARNING HARDWARE ACCELERATORS

Hardware
Ref. Method Power Area Performance Dataset Accuracy
Device
A fully connected feedforward DNN with a customizable
number of layers, neurons per layer, and inputs are per-
formed by the neural network architecture using just one
physical processing layer.
1) Advantages: Adequate recognition performance can be
achieved with relatively modest network sizes, resulting
in an increased performance while consuming fewer 15.90 Thousand handwritten
[54] N/A N/A MNIST 98.16% FPGA
hardware resources and power. image kFPS
2) Limitations: Compared to other related works, the per-
formance of floating-point DNN in this architecture
on The MNIST dataset exhibits a comparatively lower
when compared to fixed-point and binary-based neural
networks.

An optimized hardware approach for CNNs for embedded


vision systems.
1) Advantages: The presented design exhibits superior per-
formance with significantly fewer DSP48 and BRAM
resources and only half of the LUTs in comparison to
previous CNN implementations. This accelerator oper-
[69] ates at a frequency of 160 MHz and consumes 1.77 1.77 W N/A 40.96 GOPs AlexNet N/A FPGA
watts of power, achieving a performance of 40.96 GOP/s
while utilizing only 134 processing units and 601 KB
of internal memory.
2) Limitations: This architecture has less performance, and
energy efficiency compared to other related works.

Adaptive hardware architecture for neural-network-on-chip.


1) Advantages: With the MNIST dataset, the proposed
method achieves an accuracy of 98.18% and can
construct networks of various sizes to suit different
13% MNIST and
[56] applications. 0.97 W N/A 98.18% FPGA
reduction CIFAR
2) Limitations: The proposed method incurs an area over-
head of 13% when compared to the state-of-the-art
method.

Designing a novel hardware implementation AAD pooling


for a CNN accelerator.
1) Advantages: AAD pooling technique, which takes into
EEG,
account pixel variations to obtain a highly accurate
ImageNet,
representation. The hardware implementation achieves
Common
excellent separability and high precision. The AAD 244.46
[66] 0.31 mW N/A Objects in 98.51% FPGA
pooling achieved 98.51% accuracy without increasing nm2
Context
computational complexity.
(COCO), and
2) Limitations: Compared to the max and average methods,
USPS
the proposed method incurs only a minor increase in
power consumption and execution time.

Reconfigurable hardware design approach for economic neu-


ral network.
1) Advantages: The suggested approach’s hardware struc-
ture consists of a neural network with fewer layers than
the state-of-the-art method, resulting in a 41% reduction
in area while preserving performance efficiency. Further-
MNIST,
more, the suggested method allows for the configuration 41%
[53] 44 mW N/A CIFAR-10, 96.8% FPGA
and updating of the number of layers in the on-chip reduction
and USPS
design. It can be easily adapted for complex speech
recognition and image classification problems.
2) Limitations: The power consumption of the proposed
method is 44 mW, which is slightly higher (9%) than
the power consumption of the state-of-the-art methods.

Fast hardware-aware neural architecture search


methodology.
1) Advantages: The performance of this model exceeds that 0.6% higher
of the other models, achieving 0.6% higher accuracy accuracy
14% faster than EfficientNet- CPU and
[71] and 14 N/A N/A ImageNet than
B2 GPU
2) Limitations: This design has a higher number of pa- EfficientNet-
rameters and a larger number of FLOPS compared to B2.
EfficientNet-B1.

Full-stack acceleration of deep CNNs.


1) Advantages: The proposed method achieves a high level
VGG16:
of performance with over 1.3 TOP/s throughput and up
17.2W 1.3–1.59 TOP/s throughputs VGG16,
to 97% computation efficiency.
[72] ResNet-50: N/A and up to 97% computation ResNet-50, N/A FPGA
2) Limitations: The proposed method exhibits high re-
19.1W U-Net: efficiency and U-Net
source utilization and compute density, which may lead
21.5 W
to a reduced working clock frequency.

(Continued)

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3814 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

TABLE I
(Continued.) SUMMARY OF MACHINE LEARNING HARDWARE ACCELERATORS

Systolic parallel hardware architecture for the FPGA accel-


eration of FFNNs.
1) Advantages: The proposed architecture enables the im-
plementation of a multilayer FFNN with up to 3600
neurons per layer on a single chip, without the need for
external memory, achieving a maximum performance of
Iris, MNIST,
1980 GOPS. The architecture is designed in a way that it
[55] N/A N/A 1980 GOPS and MIT- 98.53% FPGA
can be easily scaled to larger capacity devices or multi-
BIH&AHA
chip configurations using a simple NPE ring extension.
The proposed architecture can adopt any type of FFNN.
2) Limitations: Compared to some related works, the pro-
posed architecture employs a higher number of DSP
blocks and registers.

Design and implementation of CNNs accelerator based on


multidie.
1) Advantages: At a clock frequency of 200 MHz, the
accelerator architecture can achieve a peak throughput
of 2.76 TOPS and a frame rate of 148.14 FPS. The
1) 148.14 FPS
proposed method has achieved significant improvements
2) A peak throughput of PASCAL
[67] in performance, with a threefold increase, and energy 12.689 W N/A N/A FPGA
2.76 TOPS VOC
efficiency, with a ratio of 93.15 GOPS/W, resulting in
excellent real-time target detection results.
2) Limitations: The proposed method exhibits comparable
power consumption to existing accelerators designed for
the YOLOv4-tiny algorithm.

CNN image classification model.


1) Advantages: This model improves the image classifica-
tion accuracy by 14.52% using the T-ReLU activation
function.
2) Limitations: The model training requires much time.
It also requires additional learning and improvement.
[65] To improve accuracy, additional fully connected layers N/A N/A N/A CIFAR-10 76.52% N/A
can be added to the network or the number of nodes
in the existing fully connected layer can be increased.
However, the current paper only utilizes a single fully
connected layer with 128 nodes and evaluates the net-
work’s performance on the small CIFAR-10 dataset.

Batch normalization processor design for CNN training and


inference.
1) Advantages: This method eliminates the need for a
divider and reduces the area required for the original
divider. Also, It can finish normalizing each data set
in a single cycle.The proposed architecture offers a 40.13%
[68] N/A N/A MNIST N/A FPGA
significant reduction in the time required for the orig- reduction
inal division operation while also achieving a 40.13%
reduction in the total area.
2) Limitations: It is not tested using datasets for accuracy
verification.

Accelerator structure with a double buffer memory access


structure significantly improves memory access efficiency. 1) Detection speed of 15
1) Advantages: The design proposed in this article ac- FPS when the input image
celerates the processing by approximately 1.4 times is 3 × 160 × 320
compared to the original design, as measured by the 2) It achieves an acceler-
[73] N/A N/A SkyNet N/A FPGA
number of cycles required to complete the task. ation effect of approx-
2) Limitations: The proposed design shows an increase in imately 1.4 times the
resource utilization for BRAM, LUT, and FF compared original design
to the original design.

Multi-channel parallel processing across layers and a


pipeline design make the neural network acceleration archi-
tecture efficient, scalable, low-latency, and low-error.
1) Advantages: The proposed neural network acceleration
architecture results in low latency, high accuracy, and
scalability. With a clock frequency of 200 M, it can
process 28 × 28 handwritten images in only 25.9 us, and It takes 25.9 us to calculate
[70] the DSP consumption is reduced through a reasonable N/A N/A 28 × 28 handwritten images MNIST 98.43% FPGA
multiplex method. The network achieves a 98.43% ac- at a clock frequency of 200 M
curacy rate with minimal errors, making it suitable for
a wide range of applications.
2) Limitations: Compared to other previous CNN imple-
mentation methods, the utilization of FF in this method
is higher and requires more resources.

(Continued)

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3815

TABLE I
(Continued.) SUMMARY OF MACHINE LEARNING HARDWARE ACCELERATORS

A novel RNN processor has been proposed that prioritizes


energy efficiency by leveraging data locality and network
compression techniques through a new quantified sparse
matrix encoding format. 1) The peak performance is THCHS-
1) Advantages: The proposed processor design utilizes the 24 GOPS 30 Chinese
network cross-division method, allowing for a high 2) The peak energy effi- speech
[85] 6.16 mW 0.65 mm2 ciency reaches 3.89
N/A ASIC
degree of flexibility and parallelism in handling various corpus and
sizes of embedded RNN applications while maintaining GOPS/mW a command
scalability. word library
2) Limitations: The number of MACs and SRAM units in
this design exceeds that of some related works.

20.6%
An energy-efficient LSTM RNN accelerator with hi- PER for
erarchical coarse-grain sparsity memory compression, TIMIT,
1) 8.93 TOPS/W for two-
an algorithm-hardware cooptimized memory compression 21.3%
layer LSTM for TIMIT
method (HCGS). WER for
data set TIMIT, TED-
1) Advantages: Comparing the hierarchical blockwise spar- TED-
2 2) 7.22/7.24 TOPS/W for LIUM, and
[86] sity technique to earlier research shows advantageous 67.3/1.85 mW 7.74 mm LIUM, N/A
three-layer LSTM for LibriSpeech
error rate and memory compression tradeoffs. It has a and
TED-LIUM/LibriSpeech data sets
high MAC efficiency reaching 99.66%. 11.4%
data sets
2) Limitations: It has higher power consumption than some WER
existing methods. for Lib-
riSpeech
data sets.
The new DNN design framework APNAS emphasizes accu-
racy and efficiency during neural architecture search.
1) Advantages: APNAS is capable of generating DNNs
with fewer parameters (i.e., cycle count) while main-
taining relatively high accuracy compared to state-of-
It offers an average of 53% CIFAR-10
the-art NAS techniques. This is achieved by adjusting
[74] N/A N/A fewer cycles than state-of-the- and CIFAR- 93.75% FPGA
the weight of the RNN to account for cycle count,
art techniques 100
allowing APNAS to successfully trade off accuracy and
cycle count.
2) Limitations: This model has less accuracy than other
state-of-the-art NAS techniques.

Hardware-oriented compression and hybrid quantization


techniques require less memory for CNN accelerator.
1) Advantages: The compressed VGG-16 architecture pro-
posed in the article achieves high performance without
the need for off-chip memory. It can process images 83.0 FPS, and a compression
VGG-16,
at a rate of 83.0 FPS while maintaining the same ratio of 27.5x is achieved for a
ResNet-50,
[75] level of accuracy. The hardware design can be used N/A N/A hardware accelerator on a sin- N/A GPU
and ResNet-
in various real-time image processing applications that gle FPGA chip without using
152
have limited resources. off-chip memory
2) Limitations: The proposed method exhibits a slightly
lower accuracy compared to other state-of-the-art FPGA
designs, also it comes with high resource utilization.

A novel multi-CE-based row-level pipelined streaming 1) The work efficiency of the


method for calculating CLs. PE Array is up to 99.83%
1) Advantages: The PE Array exhibits exceptional work 2) The VGG16 accelerator
efficiency, reaching up to 99.83%, with a remarkable on XC7VX980T FPGA,
resource utilization ratio of 98.15%. The VGG16 and achieves 1 TOPS at VGG-16 and
[76] ResNet-101 accelerators, which utilize this PE Array, 14.36 mW N/A 150 MHz N/A FPGA
ResNet-101
achieve throughput efficiencies of 98.15% and 96.12%, 3) A ResNet-101 accelera-
respectively, outperforming other existing works. tor is also implemented,
2) Limitations: It has a higher number of multiplication achieving 600 GOPS at
than the state-of-the-art methods. 100 MHZ

ASIC accelerator for deep CNNs. 1) The peak energy effi-


1) Advantages: This work demonstrates significant im- ciency of 8.85 TOPS/W at
provements in DCNN inference throughput and overall 0.9V supply
system-level energy efficiency, which includes both the 2) Low external memory
[77] 203 mW N/A access of 55.31 MB (or ImageNet N/A ASIC
accelerator chip and off-chip memory.
2) Limitations: It has higher on-chip SRAM than the state- 0.0018 access/MAC) for
of-the-art methods. ImageNet classification
with VGG-16 CNN

A HIC technique for LSTM/GRU. 1) The improvement in


1) Advantages: This method achieves a significant com- the energy efficiency of
pression ratio of 37.1/32.3 for LSTM/GRU with neg- the LSTM networks is
GRU: 14.906 (5%–237%)
ligible accuracy loss. It also shows improved energy
W Vanilla 2) A 58% improvement in DeepSpeech2
[87] efficiency for LSTM networks (5%–237%) and a 58% N/A N/A FPGA
LSTM: 17.106 GRU networks. (AN4)
improvement for GRU networks.
W 3) It achieves 379 507 FPS
2) Limitations: It has lower FPS computation than the
state-of-the-art methods. 4) It achieves a latency of
2.635 μs.

(Continued)

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3816 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

TABLE I
(Continued.) SUMMARY OF MACHINE LEARNING HARDWARE ACCELERATORS

ELSTM architecture, which only requires a single gate,


is more suitable for low-cost hardware design due to its
simplicity in terms of components. The gate is responsible
for data deletion and updating, and its output is fed to
three components: the memory layer, the update layer, and
The proposed method has MNIST,
the output layer. ELSTM requires fewer processing units 34%
[28] 1.192 W a latency of 23 ms and a IMDB, and 90.89% FPGA
compared to traditional LSTM architectures. reduction
throughput of 258.4 MOPS ImageNet
Advantages: ELSTM has a low error and faster training,
allowing it to achieve high accuracy faster. Compared to
LSTM, the proposed method saves 34% of area and 35%
of power consumption.

NeuronLink is a neural network accelerator communication


mechanism that combines intrachip and interchip communi-
cation techniques.
Neuron Link-based DNN
1) Advantages: The neuron link-based DNN accelerator
accelerator outperforms the
proposed in this work outperforms previous NoC-based
previous NoC-based DNN
DNN accelerators in terms of power efficiency, with a ResNet-18
[57] 15.4 W 41.2 mm2 accelerators 1.27×–1.34× N/A FPGA
1.27×–1.34× improvement, and area efficiency, with a and AlexNet
in terms of power efficiency
2.01×–9.12× improvement.
and 2.01×–9.12× in terms
2) Limitations: The design requires a significant number of
of area efficiency
FIFOs to store various packets with different priorities
and routers, which increases the demand for BRAMs.

EdgeDRNN is a RNN accelerator optimized for low-latency


edge inference and based on a lightweight implementation
of GRU networks. An effective throughput is
1) Advantages: EdgeDRNN achieves an average effec- 20.2 GOp/s and a wall plug TIDIGITS
[88] tive throughput of 20.2GOp/s and a wall plug power 2.3 W N/A power efficiency is over 4X and 99% FPGA
efficiency over 4X higher than the commercial edge higher than the commercial SensorsGas
platforms of AI. edge AI platforms
2) Limitations: It uses off-chip weigh storage.

A low-power sparse CNN accelerator featuring a pre-


encoding radix-4 Booth multiplier.
1) 7.0325 TOPS/W at 50%
1) Advantages:This design surpasses others in terms of
sparsity
both area and energy efficiency. 0.4839 VGG16 and
[78] 160.17 mW 2) 14.3720 TOPS/W at N/A FPGA
2) Limitations: The accelerator struggles to achieve high mm2 AlexNet
87.5%
MAC utilization in small-size convolutional layers and
fully connected layers.

DRRNets are proposed as a remedy for traditional


RNN challenges like intricate dependencies and gradient
vanishing.
Advantages: DRRNets utilize fewer parameters than other
MNIST and CPU and
[89] models in the same domain. This model is the first to use N/A N/A N/A 99%
CIFAR10 GPU
the low-rank property in terms of input structure, and both
the decoupling and low-rank features are imposed on fully
connected layers.

1-/1-b
VGG-9:
83.20%
A PIMCA for DNN inference with low precision (1–2 b).
1-/1-b
Advantages: By employing this method, a significant reduc- The peak energy efficiency
ResNet-
tion of up to 73% in the total program size is achieved, of 437 TOPS/W and a peak
[58] 124mW 20.9 mm2 CIFAR-10 18: FPGA
resulting in fewer cycle counts and ultimately leading to throughput of 49 TOPS at 42-
83.48%
improved energy efficiency. MHz clock frequency
2-/2-b
ResNet-
18:
86.48%
A transformer accelerator utilizing an OBS dataflow resulting
in higher energy efficiency.
Advantages: The proposed OBS dataflow reduces the power Throughput of 728.3 GOPs
[95] consumption of the BRAM, which leads to an overall power 80 mW N/A and an energy efficiency of ImageNet 79.5% FPGA
reduction of 33%. OBS also lowers the input reading and 58.31 GOPs/W
output writing bandwidth.

4) Specificity: It calculates the percentage of actual negative 5) F1 Score or Tension: It measures the balanced rela-
cases that the classifier correctly classifies as negative. It is often tionship between sensitivity and precision [34]. F1 Score is
referred to as the TN rate [30], [34]. Specificity is calculated by calculated by the following equation:
the following equation:
TN 2 ∗ Sensitivity ∗ Precision
Specificity = . (14) F 1 Score (Tension) = . (15)
TN + FP Sensitivity + Precision

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3817

6) Loss Function: The evaluation of how well an algorithm In (18), (δ) delta hyperparameter defines the range for
models a dataset involves a mathematical function that depends MAE and MSE.
on the machine learning algorithm’s parameters. This function, 3) Binary cross-entropy (log loss): Cross-Entropy loss is
known as a loss function, plays a crucial role in the train- also called logarithmic loss, log loss, or logistic loss. This
ing process and the results obtained from any deep learning is the loss function used in binary classification models,
methodology. Loss functions are typically categorized as either which takes in an input and should classify it into one of
regression loss or classification loss. Regression loss functions two predefined categories. Classification neural networks
used in regression neural networks predict an output value from output a vector of probabilities, the probability that the
an input value rather than preselected labels such as mean input fits into each preset category, and pick the category
squared error (MSE) and mean absolute error (MAE). On the with the highest probability as the final output
other hand, classification neural networks use classification loss
1
N
functions, which allow selecting a category with the highest CE Loss = − (yi .log(pi )) + (1 − yi ).log(1 − pi ).
probability of the input belonging to it, such as binary cross- n
i=1
entropy and categorical cross-entropy. Each one is described (19)
as follows. In binary classification, the actual value of y can only be
1) MSE: It is also known as L2 Loss. MSE calculates the 0 or 1. To accurately determine the loss between actual
average of the squared differences between the predicted and predicted values, it is necessary to compare the actual
and actual values across the entire dataset. MSE is calcu- value (0 or 1) to the probability that the input aligns with
lated as follows: that category [p(i) = probability that the category is 1;
1
n
1 − p(i) = probability that the category is 0].
MSE = (yi − ŷi )2 . (16) 4) Categorical cross-entropy: In multiclass classification
n
i=1
tasks, where an example can only belong to one of several
MSE is sensitive toward outliers; given multiple examples possible categories, a categorical cross-entropy is com-
with the same input feature values, the ideal prediction monly used. This function is designed to measure the
is the mean target value. This function is ideal for cal- difference between two probability distributions. We use
culating loss due to its many features. The difference is categorical cross-entropy when the number of classes is
squared. Thus the predicted value might be above or be- more than two. Binary cross-entropy is a special case of
low the target value, but big errors are penalized. MSE is a categorical cross-entropy, where M = 2, and M is the
convex function with a global minimum, making gradient number of categories
descent optimization easier to use to select weight values.
1 
N M
2) MAE: MAE is also known as L1 loss. MAE represents
CE = − yij .log(pi j). (20)
the difference between the target and predicted values n
i=1 j=1
extracted by averaging the absolute difference over the
data set. MAE is calculated as follows: B. Hardware Evaluation
1
n
Various metrics can be used to evaluate hardware systems,
MAE = |yi − ŷi |. (17)
n such as power consumption, area, throughput, and latency.
i=1
These metrics are helpful in comparing and assessing the ad-
MAE is a robust metric that is not significantly affected by
vantages and limitations of different designs.
outliers. In cases where multiple samples have the same
1) Energy Efficiency and Power Consumption: The effi-
input feature values, MAE chooses the median target
ciency of energy usage is a measure of the amount of data
value as the best prediction. Compare this to MSE, where
that can be processed or the number of tasks that can be per-
the mean represents the best prediction. MAE’s limitation
formed per unit of energy. This is particularly important when
is that its gradient magnitude depends only on the sign of
processing DNNs on embedded devices at the edge. Power
the difference between the predicted and actual values, not
consumption is the amount of energy consumed during a given
the error size. This results in large gradient magnitudes
period. The thermal design power (TDP) is a design criterion
even for small errors, which can lead to convergence
that determines the maximum power consumption, which is the
problems. Because of this, a loss function is known as a
amount of power that the cooling system can dissipate due to
Huber Loss was developed. This loss function combines
increased power consumption.
the benefits of MSE and MAE into a single package. We
2) Area: The size of each PE and the total area cost of the
can define it using the following function:
system together determine the optimal number of PEs. If the
Huber Loss area cost of the system remains the same, increasing the number
⎧  n of PEs will necessitate either decreasing the amount of space

⎪ 1

⎨n (yi − ŷi )2 if |yi − ŷi | ≤ δ required for each PE or exchanging some of the on-chip storage
= i=1 areas for additional PEs. However, decreasing the amount of
1
n

⎪ 1

⎩ δ |yi − ŷi | − δ if |yi − ŷi | > δ. storage on-chip can have an impact on how PEs are utilized. You
n 2 can also reduce the area per PE by reducing the logic needed
i=1
(18) to send operands to a MAC [43].

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3818 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

Fig. 10. Performance analysis of the existing methods.

3) Throughput: Throughput refers to the amount of data that Fig. 11 presents each accelerator’s power efficiency together
can be transmitted or processed within a specific time frame. with the year that it was first published. We calculate the power
It is a key performance metric used to evaluate the efficiency efficiency of those articles that did not achieve it by dividing
and performance of network connections or data processing throughput by power. In the previous two years, the power effi-
systems, as it indicates how many packets or messages can ciency of AI accelerators has ranged from a minimum of more
successfully reach their destination. Throughput is commonly than 50 GOPS/W to a maximum of more than 70 TOPS/W. In
measured in bits per second (bps) and is often expressed in units the surveyed accelerators, we observe that FPGA implementa-
of megabits per second (Mbps) or gigabits per second (Gbps). A tions have higher power efficiency than other implementations.
higher throughput indicates a more efficient network or system, According to the data, no major new developments have been
while a lower throughput can indicate performance issues or produced that significantly affect power, power efficiency, or
bottlenecks [32], [44]. throughput when compared to previous years.
4) Latency: It indicates how long it takes for packets to reach
their destination. In a network, the way throughput and latency
C. Future Machine Learning Accelerator Designs
work are directly related. Applications that require real-time
interaction, such as augmented reality, autonomous navigation, Future machine learning accelerator designs face several
and robotics, require low latency in order to work correctly. challenges as AI applications continue to grow in complexity
Throughput and latency frequently tend to dissipate due to the and scale. Here are some insights and suggestions to address
maximum throughput of a conversation being determined by these challenges.
the level of latency. Conversations are data exchanges from one 1) Leveraging reconfigurable designs: Reconfigurable
point to another. Thus, depending on the approach, achieving designs with optimization strategies such as parallel
high throughput and low latency simultaneously can sometimes processing, dynamic resource allocation, and area
be incompatible, and both metrics should be reported [43]. optimization, it becomes possible to increase the speed
Latency is measured in milliseconds (ms). of machine learning accelerators while minimizing costs
5) Analysis: Our AI accelerator survey begins with power and maintaining flexibility to adapt to varying workloads
usage and throughput comparisons. In Fig. 10, a comprehen- and applications. The reconfigurable designs proposed
sive examination of power consumption, quantified in watts, encompass nodes that possess the ability to seamlessly
is juxtaposed against the frequency of operations executed per transition between different layers, thereby heightening
second, measured in giga operations per second (GOPS). As network speed and achieving specific performance
part of our investigation, we derived throughput figures by the objectives. This adaptability empowers optimization
multiplication of power and power efficiency for specific arti- by enabling the configuration and updating of on-chip
cles. The observed trend reveals that contemporary accelerators layer quantities, offering a versatile approach to resource
predominantly align with the throughput trendline situated at 1 allocation. Furthermore, a reduction in the number of
TOPS. Notably, accelerators with a low-power design exhibit a adders and multipliers is integrated, leading to a decrease
discernible pattern: their power consumption typically exceeds in computational operations. The successful integration
the threshold of 0.1 watts, while simultaneously showcasing a of these design elements yields a solution that excels in
throughput surpassing 1 GOPS. It’s worth highlighting that only both efficiency and resource utilization.
a limited number of accelerators fall beneath these specified 2) Power efficiency: Energy consumption is a major concern
benchmarks. for AI systems, especially in mobile and edge computing

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3819

Fig. 11. Comparison between the recent methods.

scenarios. Improving power efficiency through techniques maintaining high accuracy, especially in time-sensitive
like quantization, parsity, and specialized memory archi- fields such as autonomous vehicles and robotics. Imple-
tectures will be vital. Data reuse is an effective approach menting parallel processing techniques to execute mul-
for reducing the energy consumption of data transfer. tiple tasks simultaneously. This can lead to substantial
This requires moving data once from a remote, large reductions in response times for AI applications. Also,
memory source (such as an off-chip DRAM) and then utilizing edge devices with processing capabilities to per-
using it for multiple operations from a nearby, smaller form computations locally reduces the need for data to be
memory location (such as an on-chip buffer or a PE’s sent back and forth to a centralized server.
scratchpad). The optimization of data movement holds 6) Bottlenecks in data transfer: Addressing memory access
substantial importance in the overall design of DNN pro- and data transfer bottlenecks in the accelerator can be
cessors. Furthermore, by reducing the number of adders achieved through various strategies. Reusing data in cal-
and multipliers, the system executes fewer computational culations helps minimize the need for frequent data trans-
operations, leading to decreased energy consumption. fers. Processing data in batches reduces the frequency
3) Model size and complexity: State-of-the-art AI models of memory access and transfers, thereby enhancing over-
are becoming larger and more complex, demanding all efficiency. Additionally, employing cache memory to
significant computational power and memory resources. store frequently accessed data mitigates the impact of
Future accelerators need to be scalable to handle these slow memory access. Data compression algorithms can
large models efficiently. The optimization involves also be employed to reduce the volume of data transferred,
condensing layers, such as combining two layers to leading to improved performance. Utilizing direct mem-
function as effectively as four, thereby enhancing per- ory access (DMA) controllers allows for the offloading
formance. Additionally, simplifying units and reducing of data transfer tasks from the CPU, enabling it to focus
the number of pooling layers results in a reduced overall on computation. Furthermore, structuring algorithms and
area footprint. data layouts to enhance spatial locality can further re-
4) Diverse workloads: With the increasing diversity of AI duce the frequency of memory accesses. These combined
workloads, designing accelerators that can efficiently han- approaches can effectively alleviate memory access and
dle various tasks is essential. Addressing the diversity data transfer bottlenecks, ultimately enhancing the perfor-
of AI workloads requires a multifaceted approach that mance of the accelerator.
encompasses various strategies and methodologies like 7) Hardware–software co-design: Tight collaboration be-
quantization (reducing the precision of weights and acti- tween hardware and software teams is required to extract
vations) and pruning (removing less significant connec- maximum performance from accelerators. Codesign ef-
tions) to reduce the computational requirements of AI forts can result in improved hardware-software integra-
models. This can make them more versatile and adaptable tion and targeted optimizations. Future AI accelerators
to different workloads. may increasingly adopt neuromorphic computing princi-
5) Real-time inference: Many AI applications require real- ples, mimicking the brain’s architecture. Also, quantum
time or low-latency replies. Future accelerators must computing advances, Algorithms, and software frame-
face the challenge of offering rapid inference while works will need to be tailored for quantum hardware.

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3820 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

8) Heterogeneous computing: To strike a balance between [9] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai, “TextField:
performance and energy efficiency, heterogeneous com- Learning a deep direction field for irregular scene text detection,” IEEE
Trans. Image Process., vol. 28, no. 11, pp. 5566–5579, Nov. 2019.
puting designs incorporating several types of accelerators [10] U. P. Singh, S. S. Chouhan, S. Jain, and S. Jain, “Multilayer convolution
(e.g., CPUs, GPUs, and TPUs) may become more com- neural network for the classification of mango leaves infected by
mon. Each type of processor can be optimized for specific anthracnose disease,” IEEE Access, vol. 7, pp. 43721–43729, 2019.
[11] K. Li, J. Daniels, C. Liu, P. Herrero, and P. Georgiou, “Convolutional
types of computations. recurrent neural networks for glucose prediction,” IEEE J. Biomed.
Addressing these challenges would necessitate ongoing re- Health Inform., vol. 24, no. 2, pp. 603–613, Feb. 2020.
search and development in both the hardware and software [12] C. N. Freitas, F. R. Cordeiro, and V. Macario, “MyFood: A food
segmentation and classification system to aid nutritional monitoring,”
areas. Collaboration between academia, industry, and the open- in Proc. 33rd SIBGRAPI Conf. Graph. Patterns Images (SIBGRAPI),
source community will be critical to advancing machine learn- Piscataway, NJ, USA: IEEE Press, 2020, pp. 234–239.
ing accelerator designs that match the needs of tomorrow’s AI [13] H. Jelodar, Y. Wang, R. Orji, and S. Huang, “Deep sentiment classifi-
cation and topic discovery on novel coronavirus or COVID-19 online
landscape. discussions: NLP using LSTM recurrent neural network approach,”
IEEE J. Biomed. Health Inform., vol. 24, no. 10, pp. 2733–2742,
Oct. 2020.
VII. CONCLUSION [14] M. Li, W. Hsu, X. Xie, J. Cong, and W. Gao, “SACNN: Self-attention
convolutional neural network for low-dose CT denoising with self-
Machine learning is involved in most of the current domains supervised perceptual loss network,” IEEE Trans. Med. Imag., vol. 39,
such as IoT environment and biomedical systems. The main no. 7, pp. 2289–2301, Jul. 2020.
challenge is to design a machine learning hardware acceler- [15] B. Dey et al., “SEM image denoising with unsupervised machine learn-
ing for better defect inspection and metrology,” in Proc. Metrol. Inspec-
ator with high speed and performance at a low cost. This tion Process Control Semicond. Manuf. XXXV, vol. 11611, Bellingham,
article investigated different hardware accelerator structures: WA, USA: SPIE, 2021, pp. 245–254.
ANN, CNN, and RNN. It described the existing approaches [16] B. Dey et al., “Unsupervised machine learning based SEM image
denoising for robust contour detection,” in Proc. Int. Conf. Extreme
with a comparison that shows the features and limitations of Ultraviolet Lithography, vol. 11854, Bellingham, WA, USA: SPIE, 2021,
each method. This article also presented the current challenges pp. 88–102.
for designing machine learning accelerators. We highlighted [17] Y. Liu et al., “Graph self-supervised learning: A survey,” IEEE Trans.
Knowl. Data Eng., vol. 35, no. 6, pp. 5879–5900, Jun. 2023.
the evaluation parameters of both the learning and hardware [18] X. Wang, D. Kihara, J. Luo, and G.-J. Qi, “EnAET: A self-trained
sides such as accuracy, sensitivity, area, speed, throughput, and framework for semi-supervised and supervised learning with ensemble
energy consumption. Thus, this article presented a complete transformations,” IEEE Trans. Image Process., vol. 30, pp. 1639–
1647, 2020.
survey on machine learning hardware accelerators to help new [19] S. Ahmed, Y. Lee, S.-H. Hyun, and I. Koo, “Unsupervised machine
researchers and designers in the field. For future research, the learning-based detection of covert data integrity assault in smart grid
hardware accelerator can have the reconfiguration features to be networks utilizing isolation forest,” IEEE Trans. Inf. Forensics Secur.,
vol. 14, no. 10, pp. 2765–2777, Oct. 2019.
suitable for multiple applications. The reconfiguration process [20] A. Uprety and D. B. Rawat, “Reinforcement learning for IoT security:
can be done online based on application criteria. Also, a hard- A comprehensive survey,” IEEE Internet Things J., vol. 8, no. 11,
ware accelerator might be implemented using mixed circuits to pp. 8693–8706, Jun. 2020.
[21] H. Xu, A. D. Domínguez-García, and P. W. Sauer, “Optimal tap setting
have both benefits of analog and digital designs. Furthermore, of voltage regulation transformers using batch reinforcement learning,”
some hardware components can be shared to support multiple IEEE Trans. Power Syst., vol. 35, no. 3, pp. 1990–2001, May 2020.
operations to save area on a chip. [22] M. Saharkhizan, A. Azmoodeh, A. Dehghantanha, K.-K. R. Choo, and
R. M. Parizi, “An ensemble of deep recurrent neural networks for
detecting IoT cyber attacks using network traffic,” IEEE Internet Things
REFERENCES J., vol. 7, no. 9, pp. 8852–8859, Sep. 2020.
[23] P. Goswami, A. Mukherjee, M. Maiti, S. K. S. Tyagi, and L. Yang,
[1] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech “A neural-network-based optimal resource allocation method for secure
recognition using deep neural networks: A systematic review,” IEEE IIoT network,” IEEE Internet Things J., vol. 9, no. 4, pp. 2538–2544,
Access, vol. 7, pp. 19143–19165, 2019. Feb. 2022.
[2] S. Dua et al., “Developing a speech recognition system for recognizing [24] M. Woźniak, J. Siłka, M. Wieczorek, and M. Alrashoud, “Recurrent
tonal speech signals using a convolutional neural network,” Appl. Sci., neural network model for IoT and networking malware threat detection,”
vol. 12, no. 12, 2022, Art. no. 6223. IEEE Trans. Ind. Informat., vol. 17, no. 8, pp. 5583–5594, Aug. 2021.
[3] M. Chun, H. Jeong, H. Lee, T. Yoo, and H. Jung, “Development [25] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional
of Korean food image classification model using public food image neural networks: Analysis, applications, and prospects,” IEEE Trans.
dataset and deep learning methods,” IEEE Access, vol. 10, pp. 128732– Neural Netw. Learn. Syst., vol. 33, no. 12, pp. 6999–7019, Dec. 2022.
128741, 2022. [26] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. J.
[4] C. T. Sari and C. Gunduz-Demir, “Unsupervised feature extraction via Inman, “1D convolutional neural networks and applications: A survey,”
deep learning for histopathological classification of colon tissue images,” Mech. Syst. Signal Process., vol. 151, 2021, Art. no. 107398.
IEEE Trans. Med. Imag., vol. 38, no. 5, pp. 1139–1149, May 2019. [27] V. Veerasamy et al., “LSTM recurrent neural network classifier for high
[5] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Machine learning- impedance fault detection in solar PV integrated power system,” IEEE
based approach for hardware faults prediction,” IEEE Trans. Circuits Access, vol. 9, pp. 32672–32687, 2021.
Syst. I, Reg. Papers, vol. 67, no. 11, pp. 3880–3892, Nov. 2020. [28] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Economic LSTM
[6] R. Malhotra, “A systematic review of machine learning techniques approach for recurrent neural networks,” IEEE Trans. Circuits Syst., II,
for software fault prediction,” Appl. Soft Comput., vol. 27, pp. 504– Exp. Briefs, vol. 66, no. 11, pp. 1885–1889, Nov. 2019.
518, 2015. [29] O. I. Abiodun et al., “Comprehensive review of artificial neural network
[7] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Intelligent fault- applications to pattern recognition,” IEEE Access, vol. 7, pp. 158820–
prediction assisted self-healing for embryonic hardware,” IEEE Trans. 158846, 2019.
Biomed. Circuits Syst., vol. 14, no. 4, pp. 852–866, Aug. 2020. [30] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “An efficient
[8] L.-Q. Zuo, H.-M. Sun, Q.-C. Mao, R. Qi, and R.-S. Jia, “Natural scene approach for neural network architecture,” in Proc. 25th IEEE Int. Conf.
text recognition based on encoder-decoder framework,” IEEE Access, Electron. Circuits Syst. (ICECS), Piscataway, NJ, USA: IEEE Press,
vol. 7, pp. 62616–62623, 2019. 2018, pp. 745–748.

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3821

[31] K. Khalil, O. Eldash, B. Dey, A. Kumar, and M. Bayoumi, “Architecture [52] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Self-healing ap-
of a novel low-cost hardware neural network,” in Proc. IEEE 63rd Int. proach for hardware neural network architecture,” in Proc. IEEE 62nd
Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA: IEEE Int. Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA:
Press, 2020, pp. 1060–1063. IEEE Press, 2019, pp. 622–625.
[32] E. Wang et al., “Deep neural network approximation for custom hard- [53] K. Khalil, A. Kumar, and M. Bayoumi, “Reconfigurable hardware design
ware: Where we’ve been, where we’re going,” ACM Comput. Surveys approach for economic neural network,” IEEE Trans. Circuits Syst., II,
(CSUR), vol. 52, no. 2, pp. 1–39, 2019. Exp. Briefs, vol. 69, no. 12, pp. 5094–5098, Dec. 2022.
[33] K. Khalil, B. Dey, M. Abdelrehim, A. Kumar, and M. Bayoumi, “An [54] T. V. Huynh, “Deep neural network accelerator based on FPGA,” in
efficient reconfigurable neural network on chip,” in Proc. 28th IEEE Proc. 4th NAFOSTED Conf. Inf. Comput. Sci., Piscataway, NJ, USA:
Int. Conf. Electron. Circuits Syst. (ICECS), Piscataway, NJ, USA: IEEE IEEE Press, 2017, pp. 254–257.
Press, 2021, pp. 1–4. [55] L. D. Medus, T. Iakymchuk, J. V. Frances-Villora, M. Bataller-
[34] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “N2 OC: Neural- Mompeán, and A. Rosado-Muñoz, “A novel systolic parallel hardware
network-on-chip architecture,” in Proc. 32nd IEEE Int. System–Chip architecture for the FPGA acceleration of feedforward neural networks,”
Conf. (SOCC), Piscataway, NJ, USA: IEEE Press, 2019, pp. 272–277. IEEE Access, vol. 7, pp. 76084–76103, 2019.
[35] K. Khalil, O. Eldash, B. Dey, A. Kumar, and M. Bayoumi, “A novel [56] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, “Adaptive hardware
reconfigurable hardware architecture of neural network,” in Proc. IEEE architecture for neural-network-on-chip,” in Proc. IEEE 65th Int. Mid-
62nd Int. Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, west Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA: IEEE Press,
USA: IEEE Press, 2019, pp. 618–621. 2022, pp. 1–4.
[36] M. A. Rajput, S. Alyami, Q. A. Ahmed, H. Alshahrani, Y. Asiri, [57] S. Xiao et al., “Neuronlink: An efficient chip-to-chip interconnect for
and A. Shaikh, “Improved learning-based design space exploration for large-scale neural network accelerators,” IEEE Trans. Very Large Scale
approximate instance generation,” IEEE Access, vol. 11, pp. 18291– Integr. VLSI Syst., vol. 28, no. 9, pp. 1966–1978, Sep. 2020.
18299, 2023. [58] B. Zhang et al., “PIMCA: A programmable in-memory computing
[37] G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, “Hardware accelerator for energy-efficient DNN inference,” IEEE J. Solid-State
approximate techniques for deep neural network accelerators: A survey,” Circuits, vol. 58, no. 5, pp. 1436–1449, May 2023.
ACM Comput. Surveys, vol. 55, no. 4, pp. 1–36, 2022. [59] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “HyPar: Towards
[38] K. Khalil, A. Kumar, and M. Bayoumi, “Low-power convolutional hybrid parallelism for deep learning accelerator array,” in Proc. IEEE
neural network accelerator on FPGA,” in Proc. IEEE 5th Int. Conf. Int. Symp. High Perform. Comput. Archit. (HPCA), 2019, pp. 56–68.
Artif. Intell. Circuits Syst. (AICAS), Piscataway, NJ, USA: IEEE Press, [60] X. Wei, Y. Liang, P. Zhang, C. H. Yu, and J. Cong, “Overcoming data
2023, pp. 1–5. transfer bottlenecks in DNN accelerators via layer-conscious memory
[39] C. Åleskog, H. Grahn, and A. Borg, “Recent developments in low- managment,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable
power AI accelerators: A survey,” Algorithms, vol. 15, no. 11, 2022, Gate Arrays (FPGA), New York, NY, USA: ACM, 2019, p. 120,
Art. no. 419. doi: 10.1145/3289602.3293947.
[40] M. Giordano, L. Piccinelli, and M. Magno, “Survey and comparison of [61] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and
milliwatts micro controllers for tiny machine learning at the edge,” in M. Martina, “An updated survey of efficient hardware architectures
Proc. IEEE 4th Int. Conf. Artif. Intell. Circuits Syst. (AICAS), Piscataway, for accelerating deep convolutional neural networks,” Future Internet,
NJ, USA: IEEE Press, 2022, pp. 94–97. vol. 12, no. 7, 2020, Art. no. 113.
[41] S. S. Saha, S. S. Sandha, and M. Srivastava, “Machine learning [62] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, “A reversible-logic
for microcontroller-class hardware-a review,” IEEE Sens. J., vol. 22, based architecture for convolutional neural network (CNN),” in Proc.
no. 22, pp. 21362–21390, Nov. 2022. IEEE Int. Midwest Symp. Circuits Syst. (MWSCAS),. Piscataway, NJ,
[42] K. Khalil, T. Mohaidat, and M. Bayoumi, “Low-cost hardware design USA: IEEE Press, 2021, pp. 1070–1073.
approach for long short-term memory (LSTM),” in Proc. IEEE Int. [63] H. Li, X. Yue, Z. Wang, W. Wang, H. Tomiyama, and L. Meng, “A
Symp. Circuits Syst. (ISCAS), Piscataway, NJ, USA: IEEE Press, 2023, survey of convolutional neural networks—From software to hardware
pp. 1–5. and the applications in measurement,” Meas. Sens., vol. 18, 2021,
[43] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “How to evaluate deep Art. no. 100080.
neural network processors: TOPS/W (alone) considered harmful,” IEEE [64] B. Dey, K. Khalil, A. Kumar, and M. Bayoumi, “A reversible-logic
Solid-State Circuits Mag., vol. 12, no. 3, pp. 28–41, Summer 2020. based architecture for VGGNet,” in Proc. 28th IEEE Int. Conf. Elec-
[44] N. Gupta, “Introduction to hardware accelerator systems for artificial tron. Circuits Syst. (ICECS), Piscataway, NJ, USA: IEEE Press, 2021,
intelligence and machine learning,” in Hardware Accelerator Systems pp. 1–4.
for Artificial Intelligence and Machine Learning, S. Kim and G. C. [65] Y. Tang, L. Tian, Y. Liu, Y. Wen, K. Kang, and X. Zhao, “Design and
Deka, Eds., Elsevier, 2021, ch. 1, vol. 122, pp. 1–21. [Online]. Available: implementation of improved CNN activation function,” in Proc. 3rd Int.
https://www.sciencedirect.com/science/article/pii/S0065245820300541 Conf. Comput. Vis. Image Deep Learn. Int. Conf. Comput. Eng. Appl.
[45] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. (CVIDL & ICCEA), Piscataway, NJ, USA: IEEE Press, 2022, pp. 1166–
Kepner, “Survey of machine learning accelerators,” in Proc. IEEE High 1170.
Perform. Extreme Comput. Conf. (HPEC), Piscataway, NJ, USA: IEEE [66] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Designing novel
Press, 2020, pp. 1–12. AAD pooling in hardware for a convolutional neural network acceler-
[46] M. F. Hashmi, R. Pal, R. Saxena, and A. G. Keskar, “A new approach ator,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 30, no. 3,
for real time object detection and tracking on high resolution and multi- pp. 303–314, Mar. 2022.
camera surveillance videos using GPU,” J. Central South Univ., vol. 23, [67] Q. Song, J. Zhang, L. Sun, and G. Jin, “Design and implementation
pp. 130–144, 2016. of convolutional neural networks accelerator based on multidie,” IEEE
[47] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and Access, vol. 10, pp. 91497–91508, 2022.
M. Shafique, “Hardware and software optimizations for accelerating [68] Y.-S. Ting, Y.-F. Teng, and T.-D. Chiueh, “Batch normalization processor
deep neural networks: Survey of current trends, challenges, and the road design for convolution neural network training and inference,” in Proc.
ahead,” IEEE Access, vol. 8, pp. 225134–225180, 2020. IEEE Int. Symp. Circuits Syst. (ISCAS), Piscataway, NJ, USA: IEEE
[48] Z. Qi, W. Chen, R. A. Naqvi, and K. Siddique, “Designing deep Press, 2021, pp. 1–4.
learning hardware accelerator and efficiency evaluation,” Comput. Intell. [69] B. Khabbazan and S. Mirzakuchaki, “Design and implementation of a
Neurosci., vol. 2022, 2022, Art. no. 1291103. low-power, embedded CNN accelerator on a low-end FPGA,” in Proc.
[49] S. Bavikadi et al., “A survey on machine learning accelerators and 22nd Euromicro Conf. Digit. Syst. Des. (DSD), Piscataway, NJ, USA:
evolutionary hardware platforms,” IEEE Des. Test, vol. 39, no. 3, IEEE Press, 2019, pp. 647–650.
pp. 91–116, Jun. 2022. [70] H. Xiao, K. Li, and M. Zhu, “FPGA-based scalable and highly con-
[50] Z. Zhang, K. Zhang, and A. Khelifi, Multivariate Time Series Analysis current convolutional neural network acceleration,” in Proc. IEEE Int.
in Climate and Environmental Research. Springer, 2018. Conf. Power Electron. Comput. Appl. (ICPECA), Piscataway, NJ, USA:
[51] B. Dey, K. Khalil, A. Kumar, and M. Bayoumi, “A reversible-logic IEEE Press, 2021, pp. 367–370.
based architecture for artificial neural network,” in Proc. IEEE 63rd Int. [71] J. Lee, J. Rhim, D. Kang, and S. Ha, “SNAS: Fast hardware-aware neural
Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA: IEEE architecture search methodology,” IEEE Trans. Comput.-Aided Design
Press, 2020, pp. 505–508. Integr. Circuits Syst., vol. 41, no. 11, pp. 4826–4836, Nov. 2022.

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3822 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024

[72] S. Liu, H. Fan, M. Ferianc, X. Niu, H. Shi, and W. Luk, “Toward full- [91] A. Vaswani et al., “Attention is all you need,” Proc. Adv. Neural Inf.
stack acceleration of deep convolutional neural networks on FPGAs,” Process. Syst., vol. 30, 2017.
IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 8, pp. 3974–3987, [92] W. Li, S. Wang, and G. Liu, “Transformer-based model for fMRI
Aug. 2022. data: ABIDE results,” in Proc. 7th Int. Conf. Comput. Commun. Syst.
[73] H. Wang, Y. Zhao, and F. Gao, “A convolutional neural network (ICCCS), 2022, pp. 162–167.
accelerator based on FPGA for buffer optimization,” in Proc. IEEE 5th [93] S. Ansari and K. A. Alnajjar, “Multi-hop genetic-algorithm-optimized
Adv. Inf. Technol., Electron. Automat. Control Conf. (IAEAC), vol. 5, routing technique in diffusion-based molecular communication,” IEEE
Piscataway, NJ, USA: IEEE Press, 2021, pp. 2362–2367. Access, vol. 11, pp. 22689–22704, 2023.
[74] P. Achararit, M. A. Hanif, R. V. W. Putra, M. Shafique, and Y. [94] M. S. Rao, K. Venkata Rao, and M. H. M. Krishna Prasad, “Hybrid
Hara-Azumi, “APNAS: Accuracy-and-performance-aware neural archi- security approach for database security using diffusion based cryp-
tecture search for neural hardware accelerators,” IEEE Access, vol. 8, tography and diffie-hellman key exchange algorithm,” in Proc. 5th
pp. 165319–165334, 2020. Int. Conf. I-SMAC (IoT Soc. Mob. Analytics Cloud) (I-SMAC), 2021,
[75] T. Yuan, W. Liu, J. Han, and F. Lombardi, “High performance CNN pp. 1608–1612.
accelerators based on hardware and algorithm co-optimization,” IEEE [95] Z. Zhao, R. Cao, K.-F. Un, W.-H. Yu, P.-I. Mak, and R. P. Martins,
Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 1, pp. 250–263, “An FPGA-based transformer accelerator using output block stationary
Jan. 2021. dataflow for object recognition applications,” IEEE Trans. Circuits Syst.,
[76] W. Huang et al., “FPGA-based high-throughput CNN hardware acceler- II, Exp. Briefs, vol. 70, no. 1, pp. 281–285, Jan. 2023.
ator with high computing resource utilization ratio,” IEEE Trans. Neural [96] Z. Cheng, Z. Zhang, J. Jiang, and J. Sun, “Signal detection of mobile
Netw. Learn. Syst., vol. 33, no. 8, pp. 4069–4083, Aug. 2022. multi-user molecular communication system using transformer-based
[77] M. Kim and J.-S. Seo, “Deep convolutional neural network accelerator model,” in Proc. 8th Int. Conf. Comput. Commun. Syst. (ICCCS), 2023,
featuring conditional computing and low external memory access,” in pp. 85–90.
Proc. IEEE Custom Integr. Circuits Conf. (CICC), Piscataway, NJ, USA: [97] Y. Yan, W. Du, D. Yang, and D. Yin, “CIPTA: Contrastive-based iterative
IEEE Press, 2020, pp. 1–4. prompt-tuning using text annotation from large language models,” in
[78] Q. Cheng et al., “A low-power sparse convolutional neural network Proc. 4th Int. Conf. Electron. Commun. Artif. Intell. (ICECAI), 2023,
accelerator with pre-encoding Radix-4 booth multiplier,” IEEE Trans. pp. 174–178.
Circuits Syst., II, Exp. Briefs, vol. 70, no. 6, pp. 2246–2250, Jun. 2023. [98] Y. Ye, H. You, and J. Du, “Improved trust in human-robot collaboration
[79] X. Yu et al., “A data-center FGPA acceleration platform for convolu- with ChatGPT,” IEEE Access, vol. 11, pp. 55748–55754, 2023.
tional neural networks,” in Proc. 29th Int. Conf. Field Programmable [99] P. Maddigan and T. Susnjak, “Chat2VIS: Generating data visualizations
Log. Appl. (FPL), 2019, pp. 151–158. via natural language using ChatGPT, Codex and GPT-3 large language
[80] R. Hwang, M. Kang, J. Lee, D. Kam, Y. Lee, and M. Rhu, “GROW: models,” IEEE Access, vol. 11, pp. 45181–45193, 2023.
A row-stationary sparse-dense GEMM accelerator for memory-efficient [100] W. Zhu et al., “Sensitivity, specificity, accuracy, associated confidence
graph convolutional neural networks,” in Proc. IEEE Int. Symp. High- interval and ROC analysis with practical SAS implementations,” in
Perform. Comput. Archit. (HPCA), 2023, pp. 42–55. Proc. Health Care Life Sci. (NESUG), Baltimore, MD, USA, vol. 19,
[81] A. Graves and J. Schmidhuber, “Framewise phoneme classification with 2010, p. 67.
bidirectional LSTM networks,” in Proc. IEEE Int. Joint Conf. Neural
Netw., vol. 4, Piscataway, NJ, USA: IEEE Press, 2005, pp. 2047–2052.
[82] K. Smagulova and A. P. James, “A survey on LSTM memristive neural Tamador Mohaidat received the B.Sc. degree in
network architectures and applications,” Eur. Phys. J. Special Top., computer engineering from Yarmouk University,
vol. 228, no. 10, pp. 2313–2324, 2019. Irbid, Jordan, in 2010. She is currently working
[83] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, “A reversible-logic toward the M.Sc. degree in computer engineering
based architecture for long short-term memory (LSTM) network,” in with the Department of Electrical and Computer
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Piscataway, NJ, USA: Engineering, University of Mississippi, Oxford, MS,
IEEE Press, 2021, pp. 1–5. USA.
[84] Y. Wei et al., “A review of algorithm & hardware design for AI-based She was a Lecturer with the Deanship of the
biomedical applications,” IEEE Trans. Biomed. Circuits Syst., vol. 14, Preparatory Year, Prince Sattam Bin Abdulaziz Uni-
no. 2, pp. 145–163, Apr. 2020. versity, Al-Kharj, Saudi Arabia, for two years. She
[85] J. Wu, F. Li, Z. Chen, and X. Xiang, “A 3.89-GOPS/mW scalable is currently a Research Assistant with the Depart-
recurrent neural network processor with improved efficiency on memory ment of Electrical and Computer Engineering, University of Mississippi.
and computation,” IEEE Trans. Very Large Scale Integr. VLSI Syst., Her research interests include very large-scale integration (VLSI), artificial
vol. 27, no. 12, pp. 2939–2943, Dec. 2019. intelligence, machine learning, and hardware accelerator.
[86] D. Kadetotad, S. Yin, V. Berisha, C. Chakrabarti, and J.-S. Seo, “An
8.93 TOPS/W LSTM recurrent neural network accelerator featuring
hierarchical coarse-grain sparsity for on-device speech recognition,”
IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1877–1887, Jul. 2020. Kasem Khalil (Senior Member, IEEE) received
[87] G. Nan et al., “An energy efficient accelerator for bidirectional recurrent the B.Sc. and M.Sc. degrees in electrical engi-
neural networks (BiRNNs) using hybrid-iterative compression with error neering from Assiut University, Asyut, Egypt, in
sensitivity,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 9, 2009 and 2014, respectively, and the Ph.D. de-
pp. 3707–3718, Sep. 2021. gree in computer engineering from the Center
[88] C. Gao, A. Rios-Navarro, X. Chen, S.-C. Liu, and T. Delbruck, “Edge- of Advanced Computer Studies (CACS), Univer-
DRNN: Recurrent neural network accelerator for edge inference,” IEEE sity of Louisianaat Lafayette, Lafayette, LA, USA,
J. Emerg. Sel. Topics. Circuits Syst., vol. 10, no. 4, pp. 419–432, in 2021.
Dec. 2020. Since 2022, he has been serving as an Associate
[89] D. Shan, Y. Luo, X. Zhang, and C. Zhang, “DRRNets: Dynamic recur- Editor at Elsevier Microelectronics Journal. His re-
rent routing via low-rank regularization in recurrent neural networks,” search interests include electronics, very large-scale
IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 4, pp. 2057–2067, integration (VLSI), microelectronics, reconfigurable hardware, self-healing
Apr. 2023. hardware system, machine learning, hardware accelerators, network-on-chip,
[90] J. Chen, S. Hong, W. He, J. Moon, and S.-W. Jun, “Eciton: Very low- artificial intelligence, intelligent hardware system, and the Internet of Things.
power LSTM neural network accelerator for predictive maintenance at Dr. Khalil was the recipient of IEEE TRANSACTIONS ON VERY LARGE
the edge,” in Proc. 31st Int. Conf. Field-Programmable Log. Appl. (FPL), SCALE INTEGRATION SYSTEMS Prize Paper Award (IEEE Circuits and Systems
2021, pp. 1–8. Society VLSI Paper Award), 2023.

Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.

You might also like