A Survey On Neural Network Hardware Accelerators
A Survey On Neural Network Hardware Accelerators
2691-4581 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3802 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
and education to transportation and the food and entertainment The subsequent sections of this article are organized in the
industries to manufacturing and more. The fields of bioinfor- following manner. Section II presents unique challenges of
matics, physics, chemistry, material analysis, and other related machine learning hardware accelerator. Section III investigates
disciplines utilize intelligent methods that enhance their content multiple hardware accelerator systems. The models and datasets
development. These methods leverage machine learning as a covered by the existing methods are presented in Section IV.
core technology. It will significantly influence nearly every facet Section V presents a review of machine learning accelerator
of individuals’ lives. Both cloud computing and the Internet of with a comparison between the existing methods. Section VI
Things (IoT) are driving the expanded utilization of machine presents an evaluation framework of a hardware accelerator,
learning to enable objects and gadgets to become “smart” on followed by the conclusion and a discussion of future work
their own [22], [23], [24]. in Section VII.
Deep learning typically employs multiple layered structures,
such as convolutional neural networks (CNN) [25], [26], re- II. ACCELERATOR CHALLENGES
current neural networks (RNN) [27], [28], and artificial neural
networks (ANN) [29], [30], to process large-scale and unstruc- The existing machine learning hardware accelerator faces
tured data. Each structure has its own model of learning. ANN some challenges in providing a design with the desired per-
is based on multiple layers connected in a chain. Each layer has formance and cost. The machine learning models are complex,
multiple nodes that perform the computation process. The nodes so their hardware implementation is complex and slow. Thus
in each layer are interconnected with the nodes in the following the research direction is to propose a design with less complex-
layer until they reach the output layer. The architecture of a ity while saving performance and increasing speed. Hardware
CNN consists of convolutional layers and pooling layers, with accelerator challenges are power/energy consumption, through-
a pooling layer following each convolutional layer. A convo- put, area, speed, learning performance, and resource consump-
lutional layer is used to run the computation of the input data tion. Each one is described as follows.
with the stored weights. To reduce complexity in the subsequent
layers, a pooling layer is utilized to diminish the data size. The A. Power consumption
RNN is based on memory for the learning process which makes In cloud-based deep neural network (DNN) processing,
it suitable for time series data. power consumption is a critical factor due to the strict power
Hardware implementation of machine learning has a signifi- limits in data centers caused by cooling costs. Additionally, data
cant role in current applications with low cost [31], [32], [33], movement consumes more energy than arithmetic operations
[34], [35], [36]. The main challenge is to provide a machine like multiplier–accumulator (MAC), as capacitance is much
learning accelerator with high speed for problem classification higher. Hence, it is crucial to provide comprehensive reporting
with low hardware costs, without compromising desired per- on not only the energy efficiency and power consumption of the
formance in terms of area and power [37], [38], [39], [40], chip but also both the energy efficiency and power consumption
[41], [42]. The objective of this article is to examine the cur- associated with off-chip memory. This includes considerations
rent machine-learning hardware accelerator approaches with such as dynamic random access memory (DRAM) or the fre-
their advantages and limitations discussion. It also defines the quency of off-chip accesses. By evaluating the energy efficiency
general challenges for any hardware accelerator design to be and power consumption of the entire system, regardless of the
considered in future methods. Furthermore, it presents the eval- specific memory technology employed, a more holistic assess-
uation parameters for a hardware accelerator. This study delves ment can be achieved [43]. Embedded system designers face an
into a comprehensive examination of various hardware acceler- increasing challenge in reducing hardware resources and power
ators, without being limited to specific neural network architec- consumption while maintaining the computational complexity
tures. It conducts an extensive comparative analysis of recent of real-time applications. The weights and intermediate results
advancements, highlighting their respective strengths and short- can be stored in on-chip buffers in some designs to cut down on
comings. This is accomplished by presenting numerical data time spent retrieving data from off-chip memory and the amount
for key performance metrics, including power consumption, of power required to keep the system running. The main issue is
area, and accuracy. By providing this detailed information, the to design a hardware accelerator with a light structure to reduce
reader gains a precise understanding of the distinctive attributes power consumption.
of each examined work. This article’s main contributions are
summarized as follows.
1) A list of the challenges on machine learning hardware B. Throughput
accelerator. Obtaining high throughput and low latency concurrently can
2) A comprehensive study on hardware accelerator systems. be challenging depending on the approach taken. Expanding
3) A comprehensive review on machine learning hardware the amount of process elements (PEs) can enhance the overall
accelerator. throughput, resulting in an increased number of parallel MAC
4) A comparison between the existing hardware accelerators operations. However, the system’s area cost and the area of the
5) An evaluation framework for machine learning hardware PE determine the number of PEs. If the area cost of the system
accelerators. remains constant, Expanding the amount of PEs results in a
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3803
decrease in the area per PE or a reduction of on-chip storage. multiple input channels to handle different types of convolution
This could affect how PEs are utilized. Reducing the logic operations. By efficiently reusing the weight data and input
necessary to send operands to a MAC by using a single piece of data, the convolutional PE array architecture reduces the num-
logic can decrease the area per PE. The maximum throughput is ber of memory accesses, which results in a decrease in power
determined by The amount of PEs and the maximum throughput consumption and hardware resources. This technique enables
achievable by a single PE. However, the actual throughput the hardware designer to attain a balance between performance
depends on several factors, such as the network architecture, and hardware resources, which is a critical aspect of designing
weight and activation sparsity in the DNN model, and batch hardware accelerators for machine learning applications.
size. Increasing the batch size can enhance the reuse of data
and increase throughput. The hardware’s ability to support these F. Speed
approaches while maintaining PE utilization, the number of
The design of neural networks with both high speed and en-
PEs, or cycles per second determines the overall impact of the
ergy efficiency has been a challenging task. This has prompted
DNN model on throughput [43].
researchers to explore alternatives to graphics processing units
(GPUs) and central processing units (CPUs) for efficient accel-
C. Area eration of the algorithms used in neural network models. Due
Machine learning accelerators face a multitude of challenges to the high energy cost per read and write operation and the
that can lead to area overhead, such as the need to perform both long access time associated with external memory, a number of
forward and backward passes without sharing any hardware re- systems continue to experience difficulties handling their data
sources between the two processes. Additionally, implementing loads. Alternate methods first adjust the memory to allow for a
the hardware accelerator on the chip can come at a high cost. bigger data bus or make use of several memories distributed
To address these challenges, it is necessary to simplify complex across the system in order to cut down on the overhead of
machine learning models in hardware designs while also opti- this data movement. Parallel access makes it possible to handle
mizing hardware components without sacrificing performance, many data streams during a single clock cycle, which both
making them more efficient and cost-effective. accelerates the system’s overall speed and makes better use of
its available hardware resources.
D. Performance
III. HARDWARE ACCELERATOR SYSTEMS
Neural networks face difficulties with throughput due to wait-
ing for the processing unit to finish reading data. To address this Hardware Accelerator systems are specialized hardware de-
issue, improved activation functions are proposed in machine vices designed to accelerate the performance of specific tasks.
learning accelerator designs to enhance accuracy and perfor- These systems use dedicated hardware components such as
mance. Strategies such as pooling, convolutional or kernel pro- FPGAs, ASICs, and GPUs to perform complex computations
cessing are used to further improve accuracy. To achieve low much faster than traditional CPUs. Hardware accelerator sys-
latency and high efficiency, the neural network is accelerated tems are widely used in a variety of industries due to their ability
using pipeline design and multichannel parallel processing. The to perform complex computations faster and more efficiently
main challenge is to maintain high performance in terms of than traditional computing systems. In the finance industry,
sensitivity, accuracy, and specificity while avoiding the addition hardware accelerators are used for a variety of purposes, in-
of complex hardware components. cluding high-frequency trading, risk management, and fraud
detection. High-frequency trading relies on the ability to make
trades within fractions of a second, and hardware accelerators
E. Resource Consumption can process vast amounts of data in real time, making them
Reducing hardware resources poses a significant challenge a valuable tool. In healthcare, they can be used to accelerate
due to the increased computational complexity of real-time ap- medical imaging tasks such as magnetic resonance imaging
plications. To address this challenge, some innovative architec- (MRI) and computerized tomography scans, drug discovery
tures have been proposed that use a convolutional PE array. This and development, and genomics research. In scientific research,
PE array can reuse pixel and weight data effectively. thereby they can be used to accelerate simulations, modeling, and data
reducing the number of resources consumed while maintaining analysis tasks. They also find applications in autonomous vehi-
performance in learning and testing. The basic concept is to cles, aerospace, and defense industries for tasks such as image
reduce the hardware resources without compromising the learn- processing, sensor data analysis, and control systems. Hardware
ing and testing performance of the system. The convolutional Accelerator systems are also used in high-performance com-
PE array architecture exploits the fact that the convolution puting applications such as machine learning, data analytics,
operation is both data and weight reuse friendly. The array and virtual reality [44], [45]. They can greatly improve the
can perform multiple convolution operations simultaneously, performance of computing devices, allowing for faster and more
and the weights for each convolution operation are stored in efficient processing of tasks. This can lead to improved produc-
a weight buffer. The input pixels are stored in a buffer, which tivity and reduced wait times for users. Additionally, hardware
can be accessed multiple times during the computation. The accelerator systems can help reduce energy consumption and
PE array can also incorporate multiple output channels and lower costs. By offloading certain tasks from the CPU and GPU,
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3804 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3805
Fig. 4. ASIC block diagram. Fig. 5. DNNs models used in the works we reviewed.
change the structure of FPGA’s hardware to make it into any that a significant portion (27%) of the accuracy assessments
shape they need. FPGA’s fine-grained parallel architecture gives are carried out on the basic MNIST network. Nonetheless,
it advantages over GPU. Once the computation clock cycle considerable attention is devoted to sophisticated networks such
time has been calculated, the designer can optimize the out- as the CIFAR (18%) and ImageNet (18%) networks.
put mode to minimize the demand for data storage in main
memory, thereby decreasing memory reading delays. FPGA V. ACCELERATOR APPROACHES
programming is powerful, FPGA provides dynamic algorithm
Several machine learning approaches, such as ANN, CNN,
reconfiguration with robust reconfigurability. Also, FPGA uses
and RNN, are implemented on hardware. This section discusses
much less power than GPU and works better with the same
the different machine learning accelerator approaches for each
amount of power, which can help a lot with the processor’s
category as follows.
problem of getting rid of heat [45], [48].
A. ANN
D. Application-Specific Integrated Circuit (ASIC)
Like the biological neural network in the human body, ANN
ASIC is another type of integrated circuit that is designed features a layered architecture in which each network node can
to perform specific tasks, as opposed to a CPU, which is a process input and forward output to other nodes in the network.
general-purpose processor, as shown in Fig. 4 [44]. ASICs The nodes are known as neurons. ANN is comprised of three
are more specialized than GPUs since an ASIC is a processor or more interconnected layers. The first layer contains input
built to perform a relatively small set of computations, whereas neurons that transfer data to the subsequent layers. The output
a GPU is still a massively parallel processor with thousands layer produces the final output data. The layers between the
of processing units that can run multiple algorithms [49]. In input and output layers are hidden and made up of units that
contrast to FPGA, you cannot reprogram ASIC to do something adaptively transform the information received from the previous
different once it is required. Its logic has been fixed since it was layer through a sequence of transformations. The ANN can
made, but on FPGA you can make a different design that fits understand more complicated objects since each layer works
your needs better. ASICs are usually substantially more energy as an input and output layer. The neural layer is the term used
efficient as a result of this specialization. to refer to these inner layers together. The units within a neural
layer aim to learn from the gathered information by assigning
IV. MODELS AND DATASETS weights based on the internal architecture of the ANN [50],
Datasets are essential for determining the accuracy of a DNN. [51], [52]. These principles enable units to provide a changed
Significant research effort has been expended over the decades result, which is delivered as an output to the next layer. Fig. 6
to increase the performance of DNNs through innovative ar- presents the general block diagram of the ANN. An adder
chitectures. However, the constant need for more accuracy in- accumulator receives the product of the input from each node
creased new, deeper, and incredibly complex models [37], [47]. after it has been multiplied by a weight. The result of the adder
In Fig. 5, we demonstrate the most datasets from all the articles accumulator is sent to an activation function, which returns
examined in our survey; various datasets were utilized to assess the final result. The final output is expressed by the follow-
the accuracy of the suggested DNN algorithms. There could be ing equation:
n
multiple datasets for the same work. MNIST, ResNet, CIFAR,
and ImageNet are the most popular datasets, as shown. In yk = f Wlk ∗ Xl + bk . (1)
general, there is a well-balanced distribution of research efforts l=0
among CIFAR, ResNet, and ImageNet. While many DNN hard- Equation (1) represents the final output, where n denotes the
ware works are focusing on the MNIST dataset. It is apparent total number of neurons. Here, Xl represents the output of the
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3806 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3807
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3808 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
Tang et al. [65] create an improved CNN image classification System-on-Chip (SoC). This architecture optimizes all compu-
model. Maximum pooling is used in the model network struc- tations to be 8-b. Moreover, due to its high-speed performance,
ture. The accuracy of the four activation functions of Sigmoid, low power consumption, and compact size, the architecture is
Tanh, ReLU, and T-ReLU is compared in this article to improve a suitable option for CNN applications that require portability
neural network performance and image classification accuracy. and embedded systems.
The T-ReLU activation function improves the model, raising Xiao et al. [70] present a neural network acceleration archi-
image classification accuracy from 62% to 76.52%. Khalil et tecture that is efficient, scalable, and has low latency and low
al. [66] propose a hardware implementation of a new pooling error rates. The architecture achieves acceleration by utilizing
method, absolute average deviation (AAD), for use in a CNN multichannel parallel computing methods between layers and
accelerator. AAD makes use of the spatial proximity of pixels employing a pipeline design that prioritizes high efficiency and
by computing vertical and horizontal deviations, resulting in low latency performance requirements. The addition of a line
higher accuracy and lower area and power consumption com- buffer to accommodate varying image widths and the imple-
pared to other pooling methods. The AAD pooling method mentation of a selectable convolution kernel size mechanism
achieves over 98% accuracy without increasing computational enhance the network’s flexibility and scalability. This proposed
complexity. It was tested using various neural network struc- neural network performs 32-b floating-point operations. Since
tures and datasets, including EEG, ImageNet, common objects CNNs are based on floating point operations, there will be a loss
in context (COCO), and united states postal service (USPS). of precision and time-consuming transformation work if the
VHDL was used to implement AAD on Altera Arria10 GX algorithm’s FPGA implementation involves the conversion of
FPGA with 45-nm technology, using synopsys design compiler. floating-point values to fixed-point values. The MNIST dataset
Song et al. [67] propose a multidie-based CNN accelerator. The is used to perform handwritten number recognition to perform
VU9P chip involves three accelerators connected to an indepen- an experimental evaluation of the solution. The acceleration
dent super logic region (SLR). The host computer manages the strategy is implemented using the Xilinx zynq-7000 FPGA, and
three accelerators under the control of the accelerator, which the results of calculating 28 × 28 handwritten images at a clock
installs one accelerator in each SLR and uses on-chip resources. frequency of 200M in 25.95us are examined. 98.43% accuracy
This system utilizes an 8-b quantization method to enhance the rate is obtained.
throughput and computational efficiency of a single DSP for Lee et al. [71] present S3NAS, a quick hardware-aware NAS
accelerating the YOLOv4-tiny algorithm. The design employs approach. The process is broken down into three stages: super-
a full reuse of feature maps and weights during the calculation net design, Single-Path NAS for quick architectural exploration,
process and stores intermediate results in the on-chip buffer and scaling and postprocessing. The initial stage involves cre-
to minimize off-chip access, reduce bandwidth pressure, and ating a supernet, which is a set of candidate networks with two
decrease power consumption. Moreover, a designed instruction main features. First, it allows for varying numbers of blocks
group enables the host computer to control the accelerator. This in the stages, and secondly, it permits blocks to have parallel
architecture achieves a frame rate of 148.14 frames per second layers with different kernel sizes (MixConv). To minimize the
(FPS) and a peak throughput of 2.76 tera operations per second hyperparameter search overhead, a differential search can be
(TOPS) at a frequency of 200 MHz with an energy efficiency carried out by extending the single-path NAS method to include
ratio of 93.15 GOPS/W. It delivers promising results in real- the MixConv layer and incorporating a loss term that takes
time target detection applications. into account the latency. The network is scaled to its maximum
Ting et al. [68] propose a batch normalization (BN) processor within the latency constraint using compound scaling as the
that supports training and inference processes. To accelerate last step. In the postprocessing step, SE blocks and h-swish
CNN training, the proposed work develops an efficient dataflow activation functions are incorporated if they are found to be
that incorporates a novel BN processor design as well as pro- advantageous. The efficiency of the proposed methodology is
cessing elements for convolution acceleration. By sharing hard- demonstrated by tests conducted on four different hardware
ware elements between the two passes, this study took use of platforms. Using TPUv3, The search process can be completed
the comparable calculations necessary for the BN forward and within 4 h, resulting in the discovery of networks that offer
backward passes, reducing the area overhead. The method com- superior tradeoffs between latency and accuracy compared to
pleted automatic placement and routing (APR) and post-APR state-of-the-art networks. Moreover, this model outperforms
simulation on the training of neural network and functional other models by 0.6% in terms of accuracy and 14% in terms
verification of the BN processor. The method implemented the of speed compared to EfficientNet-B2.
BN processor in a CMOS technology process. The proposed Liu et al. [72] propose hardware architecture tailored for
solution accelerates the CNN training process while saving streaming applications, with a strong emphasis on increasing
hardware. The proposed architecture can reduce the total area computation efficiency by fully accelerating CNNs on FPGAs.
by 40.13%. To support an inference of CNNs with varied topologies, the
Khabbazan and Mirzakuchaki [69] describe optimized hard- architecture integrates most computational functions, convolu-
ware for CNNs for use in embedded vision systems. This de- tional and deconvolutional layers into a single unified module.
sign method is intended to be applied to low-end hardware It efficiently handles concatenative and residual connections
with the least resources needed. This hardware proposes a Z- between the functions, resulting in highly customized accel-
turn evaluation board architecture with a Xilinx Zynq-7000 eration. This design is further enhanced by utilizing various
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3809
levels of parallelism, layer fusion, and completely utilizing NP-layers. They introduced a shift-accumulator-based process-
DSPs. The suggested accelerator has been tested using a variety ing element with activation-driven data flow (ADF) for handling
of benchmark models and implemented on Intel’s Arria 10 the irregular sparse model in the P-layers. They also proposed
GX1150 hardware. The results show a high performance of over a hardware/algorithm cooptimization (HACO) method based
1.3 TOP/s throughputs and up to 97% computation efficiency. on the compression strategy and hardware architecture to im-
Wang et al. [73] propose a double buffer memory access plement an NP-P hybrid compressed CNN model on FPGAs.
structure, which considerably increases the computing unit’s They implemented the compressed VGG-16 model on a Xilinx
memory access efficiency; Furthermore, the proposed archi- VCU118 evaluation board for image applications and achieved
tecture utilizes a “ping-pong” buffer structure and employs a compression ratio of 27.5x for a hardware accelerator on a sin-
calculation delay to overlap with memory access delay, re- gle FPGA chip without the use of off-chip memory, processing
sulting in improved acceleration performance. To improve the 83.0 FPS.
computation performance of the computing unit, an accelerator Huang et al. [76] propose FPGA-based CNN hardware ac-
structure with a multilevel cache is proposed to execute data celerator design, which utilizes a row-level pipelined stream-
preparation reading. To prevent waiting for the processing unit ing technique to calculate CLs using a multicomputing engine
when reading data, a double buffer method is used to wait for the (CE) architecture. They also presented a mapping mechanism
processing unit when reading data; a double buffer method is to optimize the computational resource utilization ratio of the
used to perform calculation and data reading alternately. Based PE Array, achieving up to 98.15%. Additionally, an effective
on the experimental results, the proposed accelerator in this data storage system was implemented to improve the work
article achieved a detection speed of 15 FPS when process- efficiency of the CE by continuously feeding input data. A
ing an input image of size 3 × 160 × 320, while maintaining weighted data allocation technique was proposed to reduce the
the same test accuracy as the original design. This signifies a need for off-chip bandwidth while sacrificing some on-chip
1.5 times enhancement in acceleration when compared to the storage capacity. The design was tested on XC7VX980T FPGA,
original design. achieving 1 TOPS at 150 MHz, which is approximately 98.15%
Achararit et al. [74] offer an accuracy-and-performance- of the theoretical throughput. Moreover, a ResNet-101 acceler-
aware NAS (APNAS) that can efficiently create DNNs. APNAS ator was implemented, achieving 600 GOPS at 100 MHZ with
is based on a weight-sharing and reinforcement learning-based up to 96.12% throughput efficiency. Kim et al. [77] present
exploration method. First, provide a technique for calculating an ASIC accelerator for deep CNNs that uses a novel condi-
the cycle count in an RNN such that the network search does tional computing technique to significantly reduce the number
not require running a time-consuming hardware simulator. Ad- of redundant computations and external memory accesses. By
ditionally, they use analytical models for cycle count estimates combining subsequent max-pooling processes, precision cas-
to speed up the DNN creation process even further. The ac- cading (PC) is a novel conditional computing technique that
curacy of these analytical models is demonstrated by the fact reduces redundant convolution operations. In addition, com-
that they provide cycle count estimates that are comparable to bining precision-cascading with zero-skipping greatly reduced
those generated by a cycle-accurate hardware simulator. Then, energy and external memory access. For VGG- 16 CNN for
in the RL, establish a reward function by including a config- ImageNet, The accelerator achieved peak/average energy ef-
urabexcellentrameter for configuring the tradeoff between the ficiency of 8.85/1.22 TOPS/W at a voltage of 0.9V, and low
performance and accuracy of the generated DNNs. The study external memory access of 55.31 MB or it can be defined as
showed that APNAS could construct neural network models in 0.0018 access/MAC. Cheng et al. [78] introduce a low-power
0.55 GPU days on an Nvidia GTX 1080Ti GPU, resulting in sparse CNN accelerator featuring a preencoding radix-4 Booth
an average of 53% fewer cycles when compared to a manually multiplier. Leveraging the properties of the radix-4 Booth al-
developed neural network model (ResNet) and a state-of-the- gorithm, the accelerator reduces the number and bit width of
art NAS. They generated CNNs by APNAS for two different partial products (PPs) and encoder power consumption. it incor-
image classification datasets (CIFAR-10 and CIFAR-100) that porates an activation selector module that chooses activations
required 52.78% and 53.57% fewer cycles compared to a man- corresponding to nonzero weights for subsequent multiple-add
ually designed CNN. operations after offline encoding of nonzero weights. Addition-
Yuan et al. [75] propose hardware-oriented compression and ally, it consolidates eight encoders from relevant multipliers into
hybrid quantization techniques to reduce the memory require- a single preencoding module to save area. The proposed work
ments of CNNs. They classified all layers as either “no-pruning is developed using the Verilog HDL language and implemented
layers (NP-layers)” or “pruning layers (P-layers)” based on in a 28 nm process. The proposed accelerator achieves a per-
their processing features. The former uses parallel computation formance of 7.0325 TOPS/W with 50% sparsity and scales up
for high performance with a regular weights distribution, while to 14.3720 TOPS/W at 87.5% sparsity.
the latter has a high compression ratio but is asymmetric due to Yu et al. [79] introduce an FPGA-based acceleration platform
pruning. The approach aimed to balance compression ratio and utilizing supertile methods tailored for general-purpose CNNs
processing efficiency while maintaining reasonable accuracy by in data center applications. The design of a dispatching-
using uniform and incremental quantization techniques, as well assembling buffering model incorporating broadcast cache
as a distributed convolutional architecture with multiple parallel sets, tailored for a multi-supertile units (SU) architecture,
finite impulse response (FIR) filters for the regular model in the significantly enhances both reading and writing bandwidth.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3810 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3811
method, while having fewer units, and Compared to LSTM, utilize temporal sparsity in RNNs, EdgeDRNN employs a delta
the proposed method offers a reduction in area and power con- network technique inspired by spiking algorithms. The weight
sumption by 34% and 35%, respectively. This design exhibits storage of EdgeDRNN is implemented using low-cost off-chip
notable attractiveness for low-cost hardware applications. The DRAM, and it employs temporal sparsity to decrease memory
method proposed in this study is evaluated using three datasets: bandwidth requirements during RNN updates. By employing
ImageNet, IMDB, and MNIST. The testing and implementation sparse updates, the memory access to DRAM weight can be
of the proposed method are performed using Altera Arria 10 GX reduced by a factor of up to 10. Furthermore, the delta value
FPGA. Wu et al. [85] introduce an energy-efficient scalable pro- can be dynamically adjusted to strike a balance between latency
cessor that leverages the data locality of compressed RNNs. By and accuracy requirements. This helps optimize EdgeDRNN for
eliminating redundant connections and sharing quantized val- efficient edge RNN inference with low latency.
ues among several weights, the RNN models are significantly Shan et al. [89] introduces dynamic recurrent routing neural
compressed. Adopting the quantified sparse matrix encoding networks (DRRNets) as a solution to typical RNN problems
significantly reduces repeated calculations and memory oper- such as complicated dependencies and gradient vanishing. The
ations. Both approaches ensure the suggested design has a high suggested DRRNets use the routing pointer matrix’s low-rank
level of energy efficiency. Scalable architecture and network attribute to construct adaptive routes for diverse dependencies
cross-division approach enable hardware parallelism and flexi- and drastically decrease redundant parameters by discovering
bility. More than 80% of the weight fetching and matrix-vector low-rank approximations for fully connected layers based on
multiplications for applications like natural language and key- the inner structure of the cell state. The article contains an
word spotting can be further decreased when using compressed optimization algorithm for training the network and assesses
RNNs compared to traditional processors. The peak energy effi- the model’s performance in a variety of tasks, including im-
ciency reaches 3.89 GOPS/mW. It achieves a peak performance age classification, language modeling, and speaker recognition.
of 24 GOPS and dissipates 6.16-mW power with a 1.1 V supply Chen et al. [90] introduce a specialized hardware accelera-
and 200 MHz. tor called “Eciton” designed for implementing LSTM neural
Kadetotad et al. [86] propose LSTM RNN accelerator featur- networks. Eciton showcases the ability to conduct real-time
ing an accelerator that is based on using a memory compres- inference for LSTM neural network models of practical size, all
sion method known as hierarchical coarse-grain sparsity that while operating within a power constraint of 17 mW. In com-
was algorithm-hardware cooptimized (HCGS). HCGS offers parison to FPGA implementations that demand higher power
considerable compression (16x) of LSTM weights with gentle consumption, Eciton delivers competitive performance. This
error rate degradation while minimizing index memory cost. is achieved through the utilization of 8-b fixed-point weight
The suggested LSTM accelerator utilizes a combination of quantization, hard sigmoid activation functions, and meticu-
hierarchical blockwise sparsity and low-precision quantization lously optimized microarchitecture, effectively minimizing chip
to store the compressed weights of LSTMs consisting of three resource and memory demands. Although these quantization
layers and 512 cells in only 288 kB of on-chip SRAM. This techniques lead to a slight accuracy reduction of approximately
method effectively reduces the necessary computation by up to 5% when assessed on real-world predictive maintenance LSTM
16 times. The prototype chip, fabricated using 65-nm LP CMOS models consisting of 3 to 4 layers, the advantage of low resource
technology, achieves a remarkable energy efficiency of up to requisites permits Eciton to be accommodated within a cost-
8.93 TOPS/W for real-time speech recognition. Experimental effective, low-power Lattice iCE40 UP5K FPGA.
evaluations conducted on TIMIT, TED-LIUM, and LibriSpeech
datasets provide solid evidence of the effectiveness and suit-
ability of HCGS across multiple LSTM RNNs. Nan et al. D. Transformer-Based and Diffusion-Based Models
[87] present a hybrid-iterative compression (HIC) technique for Transformer-based models have gained a significant amount
LSTM/GRU, which separates gating units into error-sensitive of attention in recent years because of their outstanding results
and error-insensitive groups and compresses them using dif- on NLP problems. The transformer architecture was first de-
ferent techniques, leveraging the error sensitivity of RNNs. scribed in [91] by Vaswani et al. It uses self-attention mech-
Additionally, a proposed energy-efficient accelerator for bidi- anisms to capture dependencies between different input data
rectional RNNs is made. In this accelerator, weights are rear- elements, enabling parallel processing of sequences and re-
ranged to optimize data flow in the matrix operation unit based ducing the sequential nature of conventional RNNs. For tasks
on the block structure matrix (MOU-S). A fine-grained par- such as accelerator optimization, automatic machine learning,
allelism configuration of matrix-vector multiplications is used and compiler optimization [92], transformer-based models can
to improve BRAM utilization (MVMs). The challenge of load be applied. Diffusion-based models are a type of probabilistic
imbalance between MOU-S and the matrix operation unit based model that propagates information across data points through
on top-k pruning (MOU-P) is effectively addressed through the repetitive processes. These models have found use in a va-
implementation of the timing matching technique. The archi- riety of domains, such as image denoising, data imputation,
tecture of the compressed LSTM/GRU, as proposed, has been and generation tasks, data-driven accelerator design, and neu-
thoroughly assessed on the Xilinx ADM-PCIE-7V3 platform. ral architecture search [93], [94]. Zhao et al. [95] introduce
Gao et al. [88] propose EdgeDRNN, an RNN accelerator based a transformer accelerator utilizing an output block stationary
on the GRU is optimized for low-latency edge RNN inference (OBS) dataflow to optimize memory access and improve DSP
with a batch size of 1 while maintaining a lightweight design. To utilization, resulting in higher energy efficiency. By minimizing
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3812 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
repeated memory access and employing block-level and vector- VI. EVALUATION
level broadcasting, the accelerator achieves reduced memory
An evaluation of a machine learning accelerator is significant
access bandwidth for input and output. The FPGA-based ver-
for any design for validation. The evaluation is divided into
ification of the proposed accelerator demonstrates impressive
training and testing and hardware evaluations. Each evaluation
performance, with a throughput of 728.3 GOPs and an energy
parameter is described as follows.
efficiency of 58.31 GOPs/W when evaluating a transformer-in-
transformer (TNT) model. Cheng et al. [96] present a novel
transformer-based model which is proposed for signal detection A. Training and Testing Evaluation
in a multiuser molecular communication (MMC) system. The
In machine learning classification models, performance mea-
model is trained using received data generated with varying
sures are used to evaluate how well the models perform in
initial distances between transmitters and receivers. The nu-
a specific context. This evaluation helps to improve machine
merical results demonstrate that the trained transformer-based
learning classification models. Some of the performance met-
model exhibits excellent convergence and outperforms the tra-
rics are accuracy, specificity, precision, tension, F1 score (ten-
ditional DNN in terms of signal detection, achieving a lower bit
sion), and loss function. Model performance is critical for ma-
error rate.
chine learning because it allows us to understand the strengths
and limitations of these models when making predictions in
E. Large Language Models (LLMs) new situations. True positives (TPs), true negatives (TNs), false
positives (FPs), and false negatives (FNs) are commonly used
A LLM is a specialized hardware or software compo-
performance measures for evaluating the performance of classi-
nent designed to enhance the performance of LLMs in NLP
fication models. TP refers to the number of correctly predicted
tasks. LLMs, such as OpenAI’s GPT-3 and BERT, have
positive cases, while TN is the number of correctly predicted
demonstrated remarkable capabilities in understanding and gen-
negative cases. FPs are the number of negative cases that were
erating human-like text, but they come with substantial com-
incorrectly predicted as positive, and FNs are the number of
putational requirements, making them resource-intensive and
positive cases that were incorrectly predicted as negative. These
time-consuming to run on standard hardware [97]. These accel-
metrics are typically used to calculate other performance mea-
erators leverage techniques like parallel processing, optimized
sures, such as accuracy, precision, recall, and F1 score. [66],
memory access, and specialized circuit designs to improve the
[100]. The evaluation parameters are given by (11)–(20).
overall efficiency of language model computations. The LLMs
1) Accuracy: A test’s accuracy is measured by its ability to
accelerator has become crucial in a wide range of applica-
differentiate classes accurately [55], [66], [100]. It indicates the
tions, including chatbots, language translation, text summa-
quality of the result for a given task. Accuracy can be calculated
rization, and sentiment analysis [98]. Maddigan and Susnjak
using the following equation:
[99] introduce an innovative system called Chat2VIS, which
harnesses the capabilities of LLMs. Through effective prompt TP + TN
Accuracy = . (11)
engineering, Chat2VIS demonstrates a more efficient solution TP + TN + FP + FN
for language understanding, resulting in simpler and more accu- More complex DNN models typically require more compu-
rate end-to-end outcomes compared to previous methods. The tations and more memory resources to process the input data,
research reveals that Chat2VIS, utilizing LLMs and proposed which can lead to slower processing times and higher resource
prompts, offers a reliable approach to generating visualizations utilization. This is especially true for hardware implementa-
from natural language queries, even when queries are imprecise tions of DNNs, where the processing capabilities and resources
or insufficiently specified. Moreover, this solution significantly are more limited compared to software implementations. As
reduces development costs for Natural Language Interface sys- a result, there is often a tradeoff between model complexity,
tems while achieving superior visualization inference abilities accuracy, hardware performance, and efficiency [43].
compared to traditional NLP approaches that rely on hand- 2) Sensitivity: It can be defined as a TP rate that measures
crafted grammar rules and tailored models. the ratio between the number of classes that were correctly
identified and the total number of TPs and FNs [55], [66], [100].
F. Performance Comparison of Different Methods Sensitivity is by the following equation:
Table I provides a comprehensive summary of various ma- TP
Sensitivity = . (12)
chine learning hardware accelerators, highlighting their key TP + FN
features, performance metrics, and targeted applications. It aims 3) Precision: It is a positive predictive value. It measures
to offer an overview of the latest advancements in ML hardware the proportion of TP predictions to total positive predictions
acceleration, assisting researchers, developers, and technology produced by the model. A high precision indicates that the
enthusiasts in understanding the landscape of available solu- model is good at avoiding FPs. [55], [66], [100]. Precision is
tions and their respective strengths. By analyzing the charac- by the following equation:
teristics and capabilities of different accelerators, readers can
make informed decisions regarding the most suitable hardware TP
Precision = . (13)
for their specific ML requirements. TP + FP
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3813
TABLE I
SUMMARY OF MACHINE LEARNING AND DEEP LEARNING HARDWARE ACCELERATORS
Hardware
Ref. Method Power Area Performance Dataset Accuracy
Device
A fully connected feedforward DNN with a customizable
number of layers, neurons per layer, and inputs are per-
formed by the neural network architecture using just one
physical processing layer.
1) Advantages: Adequate recognition performance can be
achieved with relatively modest network sizes, resulting
in an increased performance while consuming fewer 15.90 Thousand handwritten
[54] N/A N/A MNIST 98.16% FPGA
hardware resources and power. image kFPS
2) Limitations: Compared to other related works, the per-
formance of floating-point DNN in this architecture
on The MNIST dataset exhibits a comparatively lower
when compared to fixed-point and binary-based neural
networks.
(Continued)
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3814 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
TABLE I
(Continued.) SUMMARY OF MACHINE LEARNING HARDWARE ACCELERATORS
(Continued)
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3815
TABLE I
(Continued.) SUMMARY OF MACHINE LEARNING HARDWARE ACCELERATORS
20.6%
An energy-efficient LSTM RNN accelerator with hi- PER for
erarchical coarse-grain sparsity memory compression, TIMIT,
1) 8.93 TOPS/W for two-
an algorithm-hardware cooptimized memory compression 21.3%
layer LSTM for TIMIT
method (HCGS). WER for
data set TIMIT, TED-
1) Advantages: Comparing the hierarchical blockwise spar- TED-
2 2) 7.22/7.24 TOPS/W for LIUM, and
[86] sity technique to earlier research shows advantageous 67.3/1.85 mW 7.74 mm LIUM, N/A
three-layer LSTM for LibriSpeech
error rate and memory compression tradeoffs. It has a and
TED-LIUM/LibriSpeech data sets
high MAC efficiency reaching 99.66%. 11.4%
data sets
2) Limitations: It has higher power consumption than some WER
existing methods. for Lib-
riSpeech
data sets.
The new DNN design framework APNAS emphasizes accu-
racy and efficiency during neural architecture search.
1) Advantages: APNAS is capable of generating DNNs
with fewer parameters (i.e., cycle count) while main-
taining relatively high accuracy compared to state-of-
It offers an average of 53% CIFAR-10
the-art NAS techniques. This is achieved by adjusting
[74] N/A N/A fewer cycles than state-of-the- and CIFAR- 93.75% FPGA
the weight of the RNN to account for cycle count,
art techniques 100
allowing APNAS to successfully trade off accuracy and
cycle count.
2) Limitations: This model has less accuracy than other
state-of-the-art NAS techniques.
(Continued)
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3816 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
TABLE I
(Continued.) SUMMARY OF MACHINE LEARNING HARDWARE ACCELERATORS
1-/1-b
VGG-9:
83.20%
A PIMCA for DNN inference with low precision (1–2 b).
1-/1-b
Advantages: By employing this method, a significant reduc- The peak energy efficiency
ResNet-
tion of up to 73% in the total program size is achieved, of 437 TOPS/W and a peak
[58] 124mW 20.9 mm2 CIFAR-10 18: FPGA
resulting in fewer cycle counts and ultimately leading to throughput of 49 TOPS at 42-
83.48%
improved energy efficiency. MHz clock frequency
2-/2-b
ResNet-
18:
86.48%
A transformer accelerator utilizing an OBS dataflow resulting
in higher energy efficiency.
Advantages: The proposed OBS dataflow reduces the power Throughput of 728.3 GOPs
[95] consumption of the BRAM, which leads to an overall power 80 mW N/A and an energy efficiency of ImageNet 79.5% FPGA
reduction of 33%. OBS also lowers the input reading and 58.31 GOPs/W
output writing bandwidth.
4) Specificity: It calculates the percentage of actual negative 5) F1 Score or Tension: It measures the balanced rela-
cases that the classifier correctly classifies as negative. It is often tionship between sensitivity and precision [34]. F1 Score is
referred to as the TN rate [30], [34]. Specificity is calculated by calculated by the following equation:
the following equation:
TN 2 ∗ Sensitivity ∗ Precision
Specificity = . (14) F 1 Score (Tension) = . (15)
TN + FP Sensitivity + Precision
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3817
6) Loss Function: The evaluation of how well an algorithm In (18), (δ) delta hyperparameter defines the range for
models a dataset involves a mathematical function that depends MAE and MSE.
on the machine learning algorithm’s parameters. This function, 3) Binary cross-entropy (log loss): Cross-Entropy loss is
known as a loss function, plays a crucial role in the train- also called logarithmic loss, log loss, or logistic loss. This
ing process and the results obtained from any deep learning is the loss function used in binary classification models,
methodology. Loss functions are typically categorized as either which takes in an input and should classify it into one of
regression loss or classification loss. Regression loss functions two predefined categories. Classification neural networks
used in regression neural networks predict an output value from output a vector of probabilities, the probability that the
an input value rather than preselected labels such as mean input fits into each preset category, and pick the category
squared error (MSE) and mean absolute error (MAE). On the with the highest probability as the final output
other hand, classification neural networks use classification loss
1
N
functions, which allow selecting a category with the highest CE Loss = − (yi .log(pi )) + (1 − yi ).log(1 − pi ).
probability of the input belonging to it, such as binary cross- n
i=1
entropy and categorical cross-entropy. Each one is described (19)
as follows. In binary classification, the actual value of y can only be
1) MSE: It is also known as L2 Loss. MSE calculates the 0 or 1. To accurately determine the loss between actual
average of the squared differences between the predicted and predicted values, it is necessary to compare the actual
and actual values across the entire dataset. MSE is calcu- value (0 or 1) to the probability that the input aligns with
lated as follows: that category [p(i) = probability that the category is 1;
1
n
1 − p(i) = probability that the category is 0].
MSE = (yi − ŷi )2 . (16) 4) Categorical cross-entropy: In multiclass classification
n
i=1
tasks, where an example can only belong to one of several
MSE is sensitive toward outliers; given multiple examples possible categories, a categorical cross-entropy is com-
with the same input feature values, the ideal prediction monly used. This function is designed to measure the
is the mean target value. This function is ideal for cal- difference between two probability distributions. We use
culating loss due to its many features. The difference is categorical cross-entropy when the number of classes is
squared. Thus the predicted value might be above or be- more than two. Binary cross-entropy is a special case of
low the target value, but big errors are penalized. MSE is a categorical cross-entropy, where M = 2, and M is the
convex function with a global minimum, making gradient number of categories
descent optimization easier to use to select weight values.
1
N M
2) MAE: MAE is also known as L1 loss. MAE represents
CE = − yij .log(pi j). (20)
the difference between the target and predicted values n
i=1 j=1
extracted by averaging the absolute difference over the
data set. MAE is calculated as follows: B. Hardware Evaluation
1
n
Various metrics can be used to evaluate hardware systems,
MAE = |yi − ŷi |. (17)
n such as power consumption, area, throughput, and latency.
i=1
These metrics are helpful in comparing and assessing the ad-
MAE is a robust metric that is not significantly affected by
vantages and limitations of different designs.
outliers. In cases where multiple samples have the same
1) Energy Efficiency and Power Consumption: The effi-
input feature values, MAE chooses the median target
ciency of energy usage is a measure of the amount of data
value as the best prediction. Compare this to MSE, where
that can be processed or the number of tasks that can be per-
the mean represents the best prediction. MAE’s limitation
formed per unit of energy. This is particularly important when
is that its gradient magnitude depends only on the sign of
processing DNNs on embedded devices at the edge. Power
the difference between the predicted and actual values, not
consumption is the amount of energy consumed during a given
the error size. This results in large gradient magnitudes
period. The thermal design power (TDP) is a design criterion
even for small errors, which can lead to convergence
that determines the maximum power consumption, which is the
problems. Because of this, a loss function is known as a
amount of power that the cooling system can dissipate due to
Huber Loss was developed. This loss function combines
increased power consumption.
the benefits of MSE and MAE into a single package. We
2) Area: The size of each PE and the total area cost of the
can define it using the following function:
system together determine the optimal number of PEs. If the
Huber Loss area cost of the system remains the same, increasing the number
⎧ n of PEs will necessitate either decreasing the amount of space
⎪
⎪ 1
⎪
⎨n (yi − ŷi )2 if |yi − ŷi | ≤ δ required for each PE or exchanging some of the on-chip storage
= i=1 areas for additional PEs. However, decreasing the amount of
1
n
⎪
⎪ 1
⎪
⎩ δ |yi − ŷi | − δ if |yi − ŷi | > δ. storage on-chip can have an impact on how PEs are utilized. You
n 2 can also reduce the area per PE by reducing the logic needed
i=1
(18) to send operands to a MAC [43].
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3818 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
3) Throughput: Throughput refers to the amount of data that Fig. 11 presents each accelerator’s power efficiency together
can be transmitted or processed within a specific time frame. with the year that it was first published. We calculate the power
It is a key performance metric used to evaluate the efficiency efficiency of those articles that did not achieve it by dividing
and performance of network connections or data processing throughput by power. In the previous two years, the power effi-
systems, as it indicates how many packets or messages can ciency of AI accelerators has ranged from a minimum of more
successfully reach their destination. Throughput is commonly than 50 GOPS/W to a maximum of more than 70 TOPS/W. In
measured in bits per second (bps) and is often expressed in units the surveyed accelerators, we observe that FPGA implementa-
of megabits per second (Mbps) or gigabits per second (Gbps). A tions have higher power efficiency than other implementations.
higher throughput indicates a more efficient network or system, According to the data, no major new developments have been
while a lower throughput can indicate performance issues or produced that significantly affect power, power efficiency, or
bottlenecks [32], [44]. throughput when compared to previous years.
4) Latency: It indicates how long it takes for packets to reach
their destination. In a network, the way throughput and latency
C. Future Machine Learning Accelerator Designs
work are directly related. Applications that require real-time
interaction, such as augmented reality, autonomous navigation, Future machine learning accelerator designs face several
and robotics, require low latency in order to work correctly. challenges as AI applications continue to grow in complexity
Throughput and latency frequently tend to dissipate due to the and scale. Here are some insights and suggestions to address
maximum throughput of a conversation being determined by these challenges.
the level of latency. Conversations are data exchanges from one 1) Leveraging reconfigurable designs: Reconfigurable
point to another. Thus, depending on the approach, achieving designs with optimization strategies such as parallel
high throughput and low latency simultaneously can sometimes processing, dynamic resource allocation, and area
be incompatible, and both metrics should be reported [43]. optimization, it becomes possible to increase the speed
Latency is measured in milliseconds (ms). of machine learning accelerators while minimizing costs
5) Analysis: Our AI accelerator survey begins with power and maintaining flexibility to adapt to varying workloads
usage and throughput comparisons. In Fig. 10, a comprehen- and applications. The reconfigurable designs proposed
sive examination of power consumption, quantified in watts, encompass nodes that possess the ability to seamlessly
is juxtaposed against the frequency of operations executed per transition between different layers, thereby heightening
second, measured in giga operations per second (GOPS). As network speed and achieving specific performance
part of our investigation, we derived throughput figures by the objectives. This adaptability empowers optimization
multiplication of power and power efficiency for specific arti- by enabling the configuration and updating of on-chip
cles. The observed trend reveals that contemporary accelerators layer quantities, offering a versatile approach to resource
predominantly align with the throughput trendline situated at 1 allocation. Furthermore, a reduction in the number of
TOPS. Notably, accelerators with a low-power design exhibit a adders and multipliers is integrated, leading to a decrease
discernible pattern: their power consumption typically exceeds in computational operations. The successful integration
the threshold of 0.1 watts, while simultaneously showcasing a of these design elements yields a solution that excels in
throughput surpassing 1 GOPS. It’s worth highlighting that only both efficiency and resource utilization.
a limited number of accelerators fall beneath these specified 2) Power efficiency: Energy consumption is a major concern
benchmarks. for AI systems, especially in mobile and edge computing
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3819
scenarios. Improving power efficiency through techniques maintaining high accuracy, especially in time-sensitive
like quantization, parsity, and specialized memory archi- fields such as autonomous vehicles and robotics. Imple-
tectures will be vital. Data reuse is an effective approach menting parallel processing techniques to execute mul-
for reducing the energy consumption of data transfer. tiple tasks simultaneously. This can lead to substantial
This requires moving data once from a remote, large reductions in response times for AI applications. Also,
memory source (such as an off-chip DRAM) and then utilizing edge devices with processing capabilities to per-
using it for multiple operations from a nearby, smaller form computations locally reduces the need for data to be
memory location (such as an on-chip buffer or a PE’s sent back and forth to a centralized server.
scratchpad). The optimization of data movement holds 6) Bottlenecks in data transfer: Addressing memory access
substantial importance in the overall design of DNN pro- and data transfer bottlenecks in the accelerator can be
cessors. Furthermore, by reducing the number of adders achieved through various strategies. Reusing data in cal-
and multipliers, the system executes fewer computational culations helps minimize the need for frequent data trans-
operations, leading to decreased energy consumption. fers. Processing data in batches reduces the frequency
3) Model size and complexity: State-of-the-art AI models of memory access and transfers, thereby enhancing over-
are becoming larger and more complex, demanding all efficiency. Additionally, employing cache memory to
significant computational power and memory resources. store frequently accessed data mitigates the impact of
Future accelerators need to be scalable to handle these slow memory access. Data compression algorithms can
large models efficiently. The optimization involves also be employed to reduce the volume of data transferred,
condensing layers, such as combining two layers to leading to improved performance. Utilizing direct mem-
function as effectively as four, thereby enhancing per- ory access (DMA) controllers allows for the offloading
formance. Additionally, simplifying units and reducing of data transfer tasks from the CPU, enabling it to focus
the number of pooling layers results in a reduced overall on computation. Furthermore, structuring algorithms and
area footprint. data layouts to enhance spatial locality can further re-
4) Diverse workloads: With the increasing diversity of AI duce the frequency of memory accesses. These combined
workloads, designing accelerators that can efficiently han- approaches can effectively alleviate memory access and
dle various tasks is essential. Addressing the diversity data transfer bottlenecks, ultimately enhancing the perfor-
of AI workloads requires a multifaceted approach that mance of the accelerator.
encompasses various strategies and methodologies like 7) Hardware–software co-design: Tight collaboration be-
quantization (reducing the precision of weights and acti- tween hardware and software teams is required to extract
vations) and pruning (removing less significant connec- maximum performance from accelerators. Codesign ef-
tions) to reduce the computational requirements of AI forts can result in improved hardware-software integra-
models. This can make them more versatile and adaptable tion and targeted optimizations. Future AI accelerators
to different workloads. may increasingly adopt neuromorphic computing princi-
5) Real-time inference: Many AI applications require real- ples, mimicking the brain’s architecture. Also, quantum
time or low-latency replies. Future accelerators must computing advances, Algorithms, and software frame-
face the challenge of offering rapid inference while works will need to be tailored for quantum hardware.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3820 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
8) Heterogeneous computing: To strike a balance between [9] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai, “TextField:
performance and energy efficiency, heterogeneous com- Learning a deep direction field for irregular scene text detection,” IEEE
Trans. Image Process., vol. 28, no. 11, pp. 5566–5579, Nov. 2019.
puting designs incorporating several types of accelerators [10] U. P. Singh, S. S. Chouhan, S. Jain, and S. Jain, “Multilayer convolution
(e.g., CPUs, GPUs, and TPUs) may become more com- neural network for the classification of mango leaves infected by
mon. Each type of processor can be optimized for specific anthracnose disease,” IEEE Access, vol. 7, pp. 43721–43729, 2019.
[11] K. Li, J. Daniels, C. Liu, P. Herrero, and P. Georgiou, “Convolutional
types of computations. recurrent neural networks for glucose prediction,” IEEE J. Biomed.
Addressing these challenges would necessitate ongoing re- Health Inform., vol. 24, no. 2, pp. 603–613, Feb. 2020.
search and development in both the hardware and software [12] C. N. Freitas, F. R. Cordeiro, and V. Macario, “MyFood: A food
segmentation and classification system to aid nutritional monitoring,”
areas. Collaboration between academia, industry, and the open- in Proc. 33rd SIBGRAPI Conf. Graph. Patterns Images (SIBGRAPI),
source community will be critical to advancing machine learn- Piscataway, NJ, USA: IEEE Press, 2020, pp. 234–239.
ing accelerator designs that match the needs of tomorrow’s AI [13] H. Jelodar, Y. Wang, R. Orji, and S. Huang, “Deep sentiment classifi-
cation and topic discovery on novel coronavirus or COVID-19 online
landscape. discussions: NLP using LSTM recurrent neural network approach,”
IEEE J. Biomed. Health Inform., vol. 24, no. 10, pp. 2733–2742,
Oct. 2020.
VII. CONCLUSION [14] M. Li, W. Hsu, X. Xie, J. Cong, and W. Gao, “SACNN: Self-attention
convolutional neural network for low-dose CT denoising with self-
Machine learning is involved in most of the current domains supervised perceptual loss network,” IEEE Trans. Med. Imag., vol. 39,
such as IoT environment and biomedical systems. The main no. 7, pp. 2289–2301, Jul. 2020.
challenge is to design a machine learning hardware acceler- [15] B. Dey et al., “SEM image denoising with unsupervised machine learn-
ing for better defect inspection and metrology,” in Proc. Metrol. Inspec-
ator with high speed and performance at a low cost. This tion Process Control Semicond. Manuf. XXXV, vol. 11611, Bellingham,
article investigated different hardware accelerator structures: WA, USA: SPIE, 2021, pp. 245–254.
ANN, CNN, and RNN. It described the existing approaches [16] B. Dey et al., “Unsupervised machine learning based SEM image
denoising for robust contour detection,” in Proc. Int. Conf. Extreme
with a comparison that shows the features and limitations of Ultraviolet Lithography, vol. 11854, Bellingham, WA, USA: SPIE, 2021,
each method. This article also presented the current challenges pp. 88–102.
for designing machine learning accelerators. We highlighted [17] Y. Liu et al., “Graph self-supervised learning: A survey,” IEEE Trans.
Knowl. Data Eng., vol. 35, no. 6, pp. 5879–5900, Jun. 2023.
the evaluation parameters of both the learning and hardware [18] X. Wang, D. Kihara, J. Luo, and G.-J. Qi, “EnAET: A self-trained
sides such as accuracy, sensitivity, area, speed, throughput, and framework for semi-supervised and supervised learning with ensemble
energy consumption. Thus, this article presented a complete transformations,” IEEE Trans. Image Process., vol. 30, pp. 1639–
1647, 2020.
survey on machine learning hardware accelerators to help new [19] S. Ahmed, Y. Lee, S.-H. Hyun, and I. Koo, “Unsupervised machine
researchers and designers in the field. For future research, the learning-based detection of covert data integrity assault in smart grid
hardware accelerator can have the reconfiguration features to be networks utilizing isolation forest,” IEEE Trans. Inf. Forensics Secur.,
vol. 14, no. 10, pp. 2765–2777, Oct. 2019.
suitable for multiple applications. The reconfiguration process [20] A. Uprety and D. B. Rawat, “Reinforcement learning for IoT security:
can be done online based on application criteria. Also, a hard- A comprehensive survey,” IEEE Internet Things J., vol. 8, no. 11,
ware accelerator might be implemented using mixed circuits to pp. 8693–8706, Jun. 2020.
[21] H. Xu, A. D. Domínguez-García, and P. W. Sauer, “Optimal tap setting
have both benefits of analog and digital designs. Furthermore, of voltage regulation transformers using batch reinforcement learning,”
some hardware components can be shared to support multiple IEEE Trans. Power Syst., vol. 35, no. 3, pp. 1990–2001, May 2020.
operations to save area on a chip. [22] M. Saharkhizan, A. Azmoodeh, A. Dehghantanha, K.-K. R. Choo, and
R. M. Parizi, “An ensemble of deep recurrent neural networks for
detecting IoT cyber attacks using network traffic,” IEEE Internet Things
REFERENCES J., vol. 7, no. 9, pp. 8852–8859, Sep. 2020.
[23] P. Goswami, A. Mukherjee, M. Maiti, S. K. S. Tyagi, and L. Yang,
[1] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech “A neural-network-based optimal resource allocation method for secure
recognition using deep neural networks: A systematic review,” IEEE IIoT network,” IEEE Internet Things J., vol. 9, no. 4, pp. 2538–2544,
Access, vol. 7, pp. 19143–19165, 2019. Feb. 2022.
[2] S. Dua et al., “Developing a speech recognition system for recognizing [24] M. Woźniak, J. Siłka, M. Wieczorek, and M. Alrashoud, “Recurrent
tonal speech signals using a convolutional neural network,” Appl. Sci., neural network model for IoT and networking malware threat detection,”
vol. 12, no. 12, 2022, Art. no. 6223. IEEE Trans. Ind. Informat., vol. 17, no. 8, pp. 5583–5594, Aug. 2021.
[3] M. Chun, H. Jeong, H. Lee, T. Yoo, and H. Jung, “Development [25] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional
of Korean food image classification model using public food image neural networks: Analysis, applications, and prospects,” IEEE Trans.
dataset and deep learning methods,” IEEE Access, vol. 10, pp. 128732– Neural Netw. Learn. Syst., vol. 33, no. 12, pp. 6999–7019, Dec. 2022.
128741, 2022. [26] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. J.
[4] C. T. Sari and C. Gunduz-Demir, “Unsupervised feature extraction via Inman, “1D convolutional neural networks and applications: A survey,”
deep learning for histopathological classification of colon tissue images,” Mech. Syst. Signal Process., vol. 151, 2021, Art. no. 107398.
IEEE Trans. Med. Imag., vol. 38, no. 5, pp. 1139–1149, May 2019. [27] V. Veerasamy et al., “LSTM recurrent neural network classifier for high
[5] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Machine learning- impedance fault detection in solar PV integrated power system,” IEEE
based approach for hardware faults prediction,” IEEE Trans. Circuits Access, vol. 9, pp. 32672–32687, 2021.
Syst. I, Reg. Papers, vol. 67, no. 11, pp. 3880–3892, Nov. 2020. [28] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Economic LSTM
[6] R. Malhotra, “A systematic review of machine learning techniques approach for recurrent neural networks,” IEEE Trans. Circuits Syst., II,
for software fault prediction,” Appl. Soft Comput., vol. 27, pp. 504– Exp. Briefs, vol. 66, no. 11, pp. 1885–1889, Nov. 2019.
518, 2015. [29] O. I. Abiodun et al., “Comprehensive review of artificial neural network
[7] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Intelligent fault- applications to pattern recognition,” IEEE Access, vol. 7, pp. 158820–
prediction assisted self-healing for embryonic hardware,” IEEE Trans. 158846, 2019.
Biomed. Circuits Syst., vol. 14, no. 4, pp. 852–866, Aug. 2020. [30] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “An efficient
[8] L.-Q. Zuo, H.-M. Sun, Q.-C. Mao, R. Qi, and R.-S. Jia, “Natural scene approach for neural network architecture,” in Proc. 25th IEEE Int. Conf.
text recognition based on encoder-decoder framework,” IEEE Access, Electron. Circuits Syst. (ICECS), Piscataway, NJ, USA: IEEE Press,
vol. 7, pp. 62616–62623, 2019. 2018, pp. 745–748.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
MOHAIDAT AND KHALIL: SURVEY ON NEURAL NETWORK HARDWARE ACCELERATORS 3821
[31] K. Khalil, O. Eldash, B. Dey, A. Kumar, and M. Bayoumi, “Architecture [52] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Self-healing ap-
of a novel low-cost hardware neural network,” in Proc. IEEE 63rd Int. proach for hardware neural network architecture,” in Proc. IEEE 62nd
Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA: IEEE Int. Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA:
Press, 2020, pp. 1060–1063. IEEE Press, 2019, pp. 622–625.
[32] E. Wang et al., “Deep neural network approximation for custom hard- [53] K. Khalil, A. Kumar, and M. Bayoumi, “Reconfigurable hardware design
ware: Where we’ve been, where we’re going,” ACM Comput. Surveys approach for economic neural network,” IEEE Trans. Circuits Syst., II,
(CSUR), vol. 52, no. 2, pp. 1–39, 2019. Exp. Briefs, vol. 69, no. 12, pp. 5094–5098, Dec. 2022.
[33] K. Khalil, B. Dey, M. Abdelrehim, A. Kumar, and M. Bayoumi, “An [54] T. V. Huynh, “Deep neural network accelerator based on FPGA,” in
efficient reconfigurable neural network on chip,” in Proc. 28th IEEE Proc. 4th NAFOSTED Conf. Inf. Comput. Sci., Piscataway, NJ, USA:
Int. Conf. Electron. Circuits Syst. (ICECS), Piscataway, NJ, USA: IEEE IEEE Press, 2017, pp. 254–257.
Press, 2021, pp. 1–4. [55] L. D. Medus, T. Iakymchuk, J. V. Frances-Villora, M. Bataller-
[34] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “N2 OC: Neural- Mompeán, and A. Rosado-Muñoz, “A novel systolic parallel hardware
network-on-chip architecture,” in Proc. 32nd IEEE Int. System–Chip architecture for the FPGA acceleration of feedforward neural networks,”
Conf. (SOCC), Piscataway, NJ, USA: IEEE Press, 2019, pp. 272–277. IEEE Access, vol. 7, pp. 76084–76103, 2019.
[35] K. Khalil, O. Eldash, B. Dey, A. Kumar, and M. Bayoumi, “A novel [56] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, “Adaptive hardware
reconfigurable hardware architecture of neural network,” in Proc. IEEE architecture for neural-network-on-chip,” in Proc. IEEE 65th Int. Mid-
62nd Int. Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, west Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA: IEEE Press,
USA: IEEE Press, 2019, pp. 618–621. 2022, pp. 1–4.
[36] M. A. Rajput, S. Alyami, Q. A. Ahmed, H. Alshahrani, Y. Asiri, [57] S. Xiao et al., “Neuronlink: An efficient chip-to-chip interconnect for
and A. Shaikh, “Improved learning-based design space exploration for large-scale neural network accelerators,” IEEE Trans. Very Large Scale
approximate instance generation,” IEEE Access, vol. 11, pp. 18291– Integr. VLSI Syst., vol. 28, no. 9, pp. 1966–1978, Sep. 2020.
18299, 2023. [58] B. Zhang et al., “PIMCA: A programmable in-memory computing
[37] G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, “Hardware accelerator for energy-efficient DNN inference,” IEEE J. Solid-State
approximate techniques for deep neural network accelerators: A survey,” Circuits, vol. 58, no. 5, pp. 1436–1449, May 2023.
ACM Comput. Surveys, vol. 55, no. 4, pp. 1–36, 2022. [59] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “HyPar: Towards
[38] K. Khalil, A. Kumar, and M. Bayoumi, “Low-power convolutional hybrid parallelism for deep learning accelerator array,” in Proc. IEEE
neural network accelerator on FPGA,” in Proc. IEEE 5th Int. Conf. Int. Symp. High Perform. Comput. Archit. (HPCA), 2019, pp. 56–68.
Artif. Intell. Circuits Syst. (AICAS), Piscataway, NJ, USA: IEEE Press, [60] X. Wei, Y. Liang, P. Zhang, C. H. Yu, and J. Cong, “Overcoming data
2023, pp. 1–5. transfer bottlenecks in DNN accelerators via layer-conscious memory
[39] C. Åleskog, H. Grahn, and A. Borg, “Recent developments in low- managment,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable
power AI accelerators: A survey,” Algorithms, vol. 15, no. 11, 2022, Gate Arrays (FPGA), New York, NY, USA: ACM, 2019, p. 120,
Art. no. 419. doi: 10.1145/3289602.3293947.
[40] M. Giordano, L. Piccinelli, and M. Magno, “Survey and comparison of [61] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and
milliwatts micro controllers for tiny machine learning at the edge,” in M. Martina, “An updated survey of efficient hardware architectures
Proc. IEEE 4th Int. Conf. Artif. Intell. Circuits Syst. (AICAS), Piscataway, for accelerating deep convolutional neural networks,” Future Internet,
NJ, USA: IEEE Press, 2022, pp. 94–97. vol. 12, no. 7, 2020, Art. no. 113.
[41] S. S. Saha, S. S. Sandha, and M. Srivastava, “Machine learning [62] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, “A reversible-logic
for microcontroller-class hardware-a review,” IEEE Sens. J., vol. 22, based architecture for convolutional neural network (CNN),” in Proc.
no. 22, pp. 21362–21390, Nov. 2022. IEEE Int. Midwest Symp. Circuits Syst. (MWSCAS),. Piscataway, NJ,
[42] K. Khalil, T. Mohaidat, and M. Bayoumi, “Low-cost hardware design USA: IEEE Press, 2021, pp. 1070–1073.
approach for long short-term memory (LSTM),” in Proc. IEEE Int. [63] H. Li, X. Yue, Z. Wang, W. Wang, H. Tomiyama, and L. Meng, “A
Symp. Circuits Syst. (ISCAS), Piscataway, NJ, USA: IEEE Press, 2023, survey of convolutional neural networks—From software to hardware
pp. 1–5. and the applications in measurement,” Meas. Sens., vol. 18, 2021,
[43] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “How to evaluate deep Art. no. 100080.
neural network processors: TOPS/W (alone) considered harmful,” IEEE [64] B. Dey, K. Khalil, A. Kumar, and M. Bayoumi, “A reversible-logic
Solid-State Circuits Mag., vol. 12, no. 3, pp. 28–41, Summer 2020. based architecture for VGGNet,” in Proc. 28th IEEE Int. Conf. Elec-
[44] N. Gupta, “Introduction to hardware accelerator systems for artificial tron. Circuits Syst. (ICECS), Piscataway, NJ, USA: IEEE Press, 2021,
intelligence and machine learning,” in Hardware Accelerator Systems pp. 1–4.
for Artificial Intelligence and Machine Learning, S. Kim and G. C. [65] Y. Tang, L. Tian, Y. Liu, Y. Wen, K. Kang, and X. Zhao, “Design and
Deka, Eds., Elsevier, 2021, ch. 1, vol. 122, pp. 1–21. [Online]. Available: implementation of improved CNN activation function,” in Proc. 3rd Int.
https://www.sciencedirect.com/science/article/pii/S0065245820300541 Conf. Comput. Vis. Image Deep Learn. Int. Conf. Comput. Eng. Appl.
[45] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. (CVIDL & ICCEA), Piscataway, NJ, USA: IEEE Press, 2022, pp. 1166–
Kepner, “Survey of machine learning accelerators,” in Proc. IEEE High 1170.
Perform. Extreme Comput. Conf. (HPEC), Piscataway, NJ, USA: IEEE [66] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, “Designing novel
Press, 2020, pp. 1–12. AAD pooling in hardware for a convolutional neural network acceler-
[46] M. F. Hashmi, R. Pal, R. Saxena, and A. G. Keskar, “A new approach ator,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 30, no. 3,
for real time object detection and tracking on high resolution and multi- pp. 303–314, Mar. 2022.
camera surveillance videos using GPU,” J. Central South Univ., vol. 23, [67] Q. Song, J. Zhang, L. Sun, and G. Jin, “Design and implementation
pp. 130–144, 2016. of convolutional neural networks accelerator based on multidie,” IEEE
[47] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and Access, vol. 10, pp. 91497–91508, 2022.
M. Shafique, “Hardware and software optimizations for accelerating [68] Y.-S. Ting, Y.-F. Teng, and T.-D. Chiueh, “Batch normalization processor
deep neural networks: Survey of current trends, challenges, and the road design for convolution neural network training and inference,” in Proc.
ahead,” IEEE Access, vol. 8, pp. 225134–225180, 2020. IEEE Int. Symp. Circuits Syst. (ISCAS), Piscataway, NJ, USA: IEEE
[48] Z. Qi, W. Chen, R. A. Naqvi, and K. Siddique, “Designing deep Press, 2021, pp. 1–4.
learning hardware accelerator and efficiency evaluation,” Comput. Intell. [69] B. Khabbazan and S. Mirzakuchaki, “Design and implementation of a
Neurosci., vol. 2022, 2022, Art. no. 1291103. low-power, embedded CNN accelerator on a low-end FPGA,” in Proc.
[49] S. Bavikadi et al., “A survey on machine learning accelerators and 22nd Euromicro Conf. Digit. Syst. Des. (DSD), Piscataway, NJ, USA:
evolutionary hardware platforms,” IEEE Des. Test, vol. 39, no. 3, IEEE Press, 2019, pp. 647–650.
pp. 91–116, Jun. 2022. [70] H. Xiao, K. Li, and M. Zhu, “FPGA-based scalable and highly con-
[50] Z. Zhang, K. Zhang, and A. Khelifi, Multivariate Time Series Analysis current convolutional neural network acceleration,” in Proc. IEEE Int.
in Climate and Environmental Research. Springer, 2018. Conf. Power Electron. Comput. Appl. (ICPECA), Piscataway, NJ, USA:
[51] B. Dey, K. Khalil, A. Kumar, and M. Bayoumi, “A reversible-logic IEEE Press, 2021, pp. 367–370.
based architecture for artificial neural network,” in Proc. IEEE 63rd Int. [71] J. Lee, J. Rhim, D. Kang, and S. Ha, “SNAS: Fast hardware-aware neural
Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA: IEEE architecture search methodology,” IEEE Trans. Comput.-Aided Design
Press, 2020, pp. 505–508. Integr. Circuits Syst., vol. 41, no. 11, pp. 4826–4836, Nov. 2022.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.
3822 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 8, AUGUST 2024
[72] S. Liu, H. Fan, M. Ferianc, X. Niu, H. Shi, and W. Luk, “Toward full- [91] A. Vaswani et al., “Attention is all you need,” Proc. Adv. Neural Inf.
stack acceleration of deep convolutional neural networks on FPGAs,” Process. Syst., vol. 30, 2017.
IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 8, pp. 3974–3987, [92] W. Li, S. Wang, and G. Liu, “Transformer-based model for fMRI
Aug. 2022. data: ABIDE results,” in Proc. 7th Int. Conf. Comput. Commun. Syst.
[73] H. Wang, Y. Zhao, and F. Gao, “A convolutional neural network (ICCCS), 2022, pp. 162–167.
accelerator based on FPGA for buffer optimization,” in Proc. IEEE 5th [93] S. Ansari and K. A. Alnajjar, “Multi-hop genetic-algorithm-optimized
Adv. Inf. Technol., Electron. Automat. Control Conf. (IAEAC), vol. 5, routing technique in diffusion-based molecular communication,” IEEE
Piscataway, NJ, USA: IEEE Press, 2021, pp. 2362–2367. Access, vol. 11, pp. 22689–22704, 2023.
[74] P. Achararit, M. A. Hanif, R. V. W. Putra, M. Shafique, and Y. [94] M. S. Rao, K. Venkata Rao, and M. H. M. Krishna Prasad, “Hybrid
Hara-Azumi, “APNAS: Accuracy-and-performance-aware neural archi- security approach for database security using diffusion based cryp-
tecture search for neural hardware accelerators,” IEEE Access, vol. 8, tography and diffie-hellman key exchange algorithm,” in Proc. 5th
pp. 165319–165334, 2020. Int. Conf. I-SMAC (IoT Soc. Mob. Analytics Cloud) (I-SMAC), 2021,
[75] T. Yuan, W. Liu, J. Han, and F. Lombardi, “High performance CNN pp. 1608–1612.
accelerators based on hardware and algorithm co-optimization,” IEEE [95] Z. Zhao, R. Cao, K.-F. Un, W.-H. Yu, P.-I. Mak, and R. P. Martins,
Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 1, pp. 250–263, “An FPGA-based transformer accelerator using output block stationary
Jan. 2021. dataflow for object recognition applications,” IEEE Trans. Circuits Syst.,
[76] W. Huang et al., “FPGA-based high-throughput CNN hardware acceler- II, Exp. Briefs, vol. 70, no. 1, pp. 281–285, Jan. 2023.
ator with high computing resource utilization ratio,” IEEE Trans. Neural [96] Z. Cheng, Z. Zhang, J. Jiang, and J. Sun, “Signal detection of mobile
Netw. Learn. Syst., vol. 33, no. 8, pp. 4069–4083, Aug. 2022. multi-user molecular communication system using transformer-based
[77] M. Kim and J.-S. Seo, “Deep convolutional neural network accelerator model,” in Proc. 8th Int. Conf. Comput. Commun. Syst. (ICCCS), 2023,
featuring conditional computing and low external memory access,” in pp. 85–90.
Proc. IEEE Custom Integr. Circuits Conf. (CICC), Piscataway, NJ, USA: [97] Y. Yan, W. Du, D. Yang, and D. Yin, “CIPTA: Contrastive-based iterative
IEEE Press, 2020, pp. 1–4. prompt-tuning using text annotation from large language models,” in
[78] Q. Cheng et al., “A low-power sparse convolutional neural network Proc. 4th Int. Conf. Electron. Commun. Artif. Intell. (ICECAI), 2023,
accelerator with pre-encoding Radix-4 booth multiplier,” IEEE Trans. pp. 174–178.
Circuits Syst., II, Exp. Briefs, vol. 70, no. 6, pp. 2246–2250, Jun. 2023. [98] Y. Ye, H. You, and J. Du, “Improved trust in human-robot collaboration
[79] X. Yu et al., “A data-center FGPA acceleration platform for convolu- with ChatGPT,” IEEE Access, vol. 11, pp. 55748–55754, 2023.
tional neural networks,” in Proc. 29th Int. Conf. Field Programmable [99] P. Maddigan and T. Susnjak, “Chat2VIS: Generating data visualizations
Log. Appl. (FPL), 2019, pp. 151–158. via natural language using ChatGPT, Codex and GPT-3 large language
[80] R. Hwang, M. Kang, J. Lee, D. Kam, Y. Lee, and M. Rhu, “GROW: models,” IEEE Access, vol. 11, pp. 45181–45193, 2023.
A row-stationary sparse-dense GEMM accelerator for memory-efficient [100] W. Zhu et al., “Sensitivity, specificity, accuracy, associated confidence
graph convolutional neural networks,” in Proc. IEEE Int. Symp. High- interval and ROC analysis with practical SAS implementations,” in
Perform. Comput. Archit. (HPCA), 2023, pp. 42–55. Proc. Health Care Life Sci. (NESUG), Baltimore, MD, USA, vol. 19,
[81] A. Graves and J. Schmidhuber, “Framewise phoneme classification with 2010, p. 67.
bidirectional LSTM networks,” in Proc. IEEE Int. Joint Conf. Neural
Netw., vol. 4, Piscataway, NJ, USA: IEEE Press, 2005, pp. 2047–2052.
[82] K. Smagulova and A. P. James, “A survey on LSTM memristive neural Tamador Mohaidat received the B.Sc. degree in
network architectures and applications,” Eur. Phys. J. Special Top., computer engineering from Yarmouk University,
vol. 228, no. 10, pp. 2313–2324, 2019. Irbid, Jordan, in 2010. She is currently working
[83] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, “A reversible-logic toward the M.Sc. degree in computer engineering
based architecture for long short-term memory (LSTM) network,” in with the Department of Electrical and Computer
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Piscataway, NJ, USA: Engineering, University of Mississippi, Oxford, MS,
IEEE Press, 2021, pp. 1–5. USA.
[84] Y. Wei et al., “A review of algorithm & hardware design for AI-based She was a Lecturer with the Deanship of the
biomedical applications,” IEEE Trans. Biomed. Circuits Syst., vol. 14, Preparatory Year, Prince Sattam Bin Abdulaziz Uni-
no. 2, pp. 145–163, Apr. 2020. versity, Al-Kharj, Saudi Arabia, for two years. She
[85] J. Wu, F. Li, Z. Chen, and X. Xiang, “A 3.89-GOPS/mW scalable is currently a Research Assistant with the Depart-
recurrent neural network processor with improved efficiency on memory ment of Electrical and Computer Engineering, University of Mississippi.
and computation,” IEEE Trans. Very Large Scale Integr. VLSI Syst., Her research interests include very large-scale integration (VLSI), artificial
vol. 27, no. 12, pp. 2939–2943, Dec. 2019. intelligence, machine learning, and hardware accelerator.
[86] D. Kadetotad, S. Yin, V. Berisha, C. Chakrabarti, and J.-S. Seo, “An
8.93 TOPS/W LSTM recurrent neural network accelerator featuring
hierarchical coarse-grain sparsity for on-device speech recognition,”
IEEE J. Solid-State Circuits, vol. 55, no. 7, pp. 1877–1887, Jul. 2020. Kasem Khalil (Senior Member, IEEE) received
[87] G. Nan et al., “An energy efficient accelerator for bidirectional recurrent the B.Sc. and M.Sc. degrees in electrical engi-
neural networks (BiRNNs) using hybrid-iterative compression with error neering from Assiut University, Asyut, Egypt, in
sensitivity,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68, no. 9, 2009 and 2014, respectively, and the Ph.D. de-
pp. 3707–3718, Sep. 2021. gree in computer engineering from the Center
[88] C. Gao, A. Rios-Navarro, X. Chen, S.-C. Liu, and T. Delbruck, “Edge- of Advanced Computer Studies (CACS), Univer-
DRNN: Recurrent neural network accelerator for edge inference,” IEEE sity of Louisianaat Lafayette, Lafayette, LA, USA,
J. Emerg. Sel. Topics. Circuits Syst., vol. 10, no. 4, pp. 419–432, in 2021.
Dec. 2020. Since 2022, he has been serving as an Associate
[89] D. Shan, Y. Luo, X. Zhang, and C. Zhang, “DRRNets: Dynamic recur- Editor at Elsevier Microelectronics Journal. His re-
rent routing via low-rank regularization in recurrent neural networks,” search interests include electronics, very large-scale
IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 4, pp. 2057–2067, integration (VLSI), microelectronics, reconfigurable hardware, self-healing
Apr. 2023. hardware system, machine learning, hardware accelerators, network-on-chip,
[90] J. Chen, S. Hong, W. He, J. Moon, and S.-W. Jun, “Eciton: Very low- artificial intelligence, intelligent hardware system, and the Internet of Things.
power LSTM neural network accelerator for predictive maintenance at Dr. Khalil was the recipient of IEEE TRANSACTIONS ON VERY LARGE
the edge,” in Proc. 31st Int. Conf. Field-Programmable Log. Appl. (FPL), SCALE INTEGRATION SYSTEMS Prize Paper Award (IEEE Circuits and Systems
2021, pp. 1–8. Society VLSI Paper Award), 2023.
Authorized licensed use limited to: G Narayanamma Institute of Technology & Science. Downloaded on October 17,2024 at 06:23:06 UTC from IEEE Xplore. Restrictions apply.