Embedded Deep Learning Accelerators A Survey On Recent Advances
Embedded Deep Learning Accelerators A Survey On Recent Advances
5, MAY 2024
Abstract—The exponential increase in generated data as well as other licensed processors. We then proceed in detailing and com-
the advances in high-performance computing has paved the way paring some of the work done on design custom CNN accelerators.
for the use of complex machine learning methods. Indeed, the Mainly, for Internet of Thing and object detection applications on
availability of graphical processing units and tensor processing the RISC-V core. Finally, in the remaining sections we list other
units has made it possible to train and prototype deep neural techniques for accelerating CNN tasks, such as ISA extension to
networks (DNNs) on large-scale datasets and for a variety of appli- optimize extensive operations (dot product, multiply accumulate,
cations, i.e., vision, robotics, biomedical, etc. The popularity of these looping) and we discuss advances in analog-digital in-memory
DNNs originates from their efficacy and state-of-the-art inference computation.
accuracy. However, this is obtained at the cost of a considerably
high computational complexity. Such drawbacks rendered their Index Terms—Convolutional neural network (CNN), embedded
implementation on limited resources, edge devices, without a major machine learning, hardware accelerators, Reduced Instruction Set
loss in inference speed and accuracy, a dire and challenging task. To Computer (RISC-V), transformers.
this extent, it has become extremely important to design innovative
architectures and dedicated accelerators to deploy these DNNs to I. INTRODUCTION
embedded and reconfigurable processors in a high-performance
HE exponential growth in the deployed computing devices,
low-complexity structure. In this study, we present a survey on
recent advances in deep learning accelerators for heterogeneous
systems and Reduced Instruction Set Computer processors given
T as well as the abundance of generated data, has mandated
the use of complex algorithms and structures for smart data
their open-source nature, accessibility, customizability, and uni- processing [1]. Such overwhelming processing requirements
versality. After reading this article, the readers should have a com-
prehensive overview of the recent progress in this domain, cutting have further mandated the use of artificial intelligence (AI)
edge knowledge of recent embedded machine learning trends, and techniques and compatible hardware [2]. Nowadays, machine
substantial insights for future research directions and challenges. learning (ML) methods are routinely executed in various fields
Impact Statement—The surge in internet connected devices and including health care, robotics, navigation, data mining, agricul-
the amount of data generated have made it essential to adopt deep ture, environment, etc. [3], replacing the need of recurrent human
learning routines, i.e. neural networks (NNs) for intelligent data
processing and decision making. In addition, it has become manda- interventions. Classical ML methods have rapidly evolved, over
tory to implement these algorithms on edge devices and embedded the recent years, to perform compute-intensive operations and
systems restricted in memory and resources. As such, it is needed have expanded various research areas tenfold. The introduction
to optimize these devices and customize their internal architecture of high-accuracy and near-real-time-performing deep learning
to suit the target artificial intelligence application needs without (DL) processes, such as the deep neural network (DNN) [2],
major loss in accuracy or performance. Although many work
have been conducted, the literature is shy of survey articles listing [3], has convoyed unprecedented advances in the areas of nat-
the advancement done in this field specifically for customizable ural language processing (NLP) [4], object detection, image
processors. Thus, this survey article presents a comparative study classification, signal estimation and detection, protein folding,
of different deep NN (DNN) hardware accelerators implemented and genomics analysis, to name a few. These convoluted DNN
on customizable RISC-V processors. This article presents an intro- models achieve their high inference accuracy and performance
duction to DNN, their different types, Convolutional NN (CNN),
the challenges faced in implementing DNN on embedded devices, through the use of manifold trainable parameters and large-scale
available optimization and quantization techniques, an overview datasets [3]. Training and deploying DNNs rely on perform-
of the RISC-V CPU core and its instruction set architecture (ISA). ing heavy computations with the indispensable use of high-
Thus, we highlight its advantages and customizability compared to performance computing units, such as graphical processing units
(GPUs) and tensor processing units (TPUs). Consequently, DL
structures require considerably high energy consumption and
Manuscript received 24 May 2023; revised 27 July 2023; accepted 30 August storage capacity [2], severely limiting their implementation,
2023. Date of publication 5 September 2023; date of current version 14 May
2024. This paper was recommended for publication by Associate Editor Supratik performance, and use on limited resources devices, i.e., field-
Mukhopadhyay upon evaluation of the reviewers’ comments. (Corresponding programmable gate arrays (FPGAs), system on chip (SoC),
author: Ghattas Akkad.) general-purpose microprocessors (MPs), and digital signal pro-
Ghattas Akkad and Ali Mansour are with the Lab-STICC, UMR
CNRS, École Nationale Supérieure de Techniques Avancées (ENSTA) cessing (DSP) processors [1], [5], [6].
Bretagne, 29200 Brest, France (e-mail: [email protected]; However, the need for edge AI computing [6], [7] remains
[email protected]). relevant and crucial. Edge computing involves offloading DNNs’
Elie Inaty is with the Department of Computer Engineering, University of
Balamand, Koura 100, Lebanon (e-mail: [email protected]). inference operations to the node processor for implementing AI
Digital Object Identifier 10.1109/TAI.2023.3311776 procedures on the device itself [7], [8]. Adopting this paradigm
2691-4581 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1955
requires scaling down the DNN to fit on limited-resource devices implementation techniques and accelerators from a general per-
without a significant loss in performance and accuracy [2], [9], spective:
thus adding new challenges to that already at hand, such as area 1) hardware-aware neural architecture search [3], [26];
limitation, power consumption, and storage requirements. To 2) CNN accelerators [27];
abide to the imposed constraints and to efficiently deploy DNN 3) model compression and hardware acceleration [12], [17];
structures on different processors, such as FPGA, SoC, and MP, 4) hardware and software optimization [22].
one practical yet popular solution is to reduce the DNNs’ size Other surveys focused on hardware-specific DNN implemen-
and develop task-specific deep learning accelerators (DLAs) [2], tation targeting FPGAs and microcontroller units (MCUs):
[10], [11]. Moreover, several optimization techniques can be 1) optimizing NN accelerators for MCU-based inference [2],
applied to reduce DNNs’ hardware usage, such as pruning [12], [9];
quantization [13], [14], knowledge distillation [15], multiplex- 2) FPGA-based CNN accelerators [19];
ing [16], and model compression [17], to name a few. 3) custom hardware/application-based accelerators [20].
To account for the requirements of various applications, there Other existing surveys, such as [28], [29], [30], and [31], do
exists different DL structures and models [18], such as the not specifically focus on RISC-V. However, they provide insights
convolutional neural network (CNN) [3], [6], recursive neural into different DNN acceleration techniques that can be adapted
network (RNN) [3], [18], generative adversarial network [18], for RISC-V architecture.
graph neural network (NN), and transformers. To accommodate In this study, we provide a comprehensive overview of the
to such diversity, popular approaches for implementing DNN ac- recent advances in optimizing DNN hardware implementation
celerators rely on using reconfigurable devices, i.e., FPGA [19], and its accelerators for RISC-V, heterogeneous, and hybrid edge
[20], or extending the architecture and instruction set of the devices, as well as meaningful insights into future developments.
ARM and Reduced Instruction Set Computer (RISC-V)-based
processors [3]. In addition, the use of dedicated NN libraries and B. Survey Plan
compilers, such as the CMSIS-NN by ARM [21] for 16-bit and
The rest of this article is organized as follows. Section II
8-bit processors, makes it possible to implement some sophisti-
presents the RISC-V central processing unit (CPU) core ar-
cated quantized DNN, i.e., 8-bit, 4-bit, 2-bit, and even 1-bit [3].
chitecture. Recent work done on DNN hardware accelerators
However, commercial processors have major drawbacks, such
is detailed in Section III. Section IV discusses some RISC-V
as licensing costs and the lack of flexibility in modifying the
ISA extensions for accelerating embedded DNN operations.
general architecture [22].
Section V explores transformer accelerators. Finally, Section VI
Therefore, a more suitable option is to target open-source,
concludes this article with insights into future research tracks
RISC-V, processors [23]. The RISC-V instruction set architec-
and open challenges.
ture (ISA) is managed by a nonprofit foundation and offers
several, unique, advantages [5], [24]:
1) flexibility to fully customize the hardware for specific ap- II. RISC-V CPU CORE
plication requirements, i.e., power, area, and performance, The RISC-V is an open-source, customizable, and scalable
through open-source designs and ISA; ISA. The RISC-V core implements a temporal architecture with
2) reduce third-party licensing dependence and intellectual three basic 32-bit instruction sets, six extended sets, and 32
property usage for DNN accelerators; standard registers [1]. The base instruction set supports three
3) compatibility, standards, and interoperability among dif- main formats: store (S), register (R), and immediate (I). These
ferent platforms; formats include logical, arithmetic, data transfer, memory ac-
4) community support and long-term sustainability, i.e., cess, and control flow instructions, such as branches and jumps.
frameworks, compilers, and software stacks to facilitate It also supports instructions to communicate with the operating
deployment. system. Data transfer between memory and registers may only be
In its nature, the RISC-V is an MP with a temporal like done through the use of load and store instructions. In addition,
architecture, where the arithmetic operations are performed by privileged mode instructions are available to manage system
the arithmetic logic unit (ALU) found in each processing ele- operations and exception handling, allowing full control over
ment (PE). Hence, extending its functionality by implementing the architecture. Furthermore, the RISC-V ISA reserves four
parallel DNN accelerators is tremendously beneficial for edge instruction spaces for user-defined extensions, i.e, custom-0,
AI inference and embedded ML. custom-1, custom-2/rv128, and custom-3/rv128 [6]. Optional
floating point (FP) and vector (V) instructions [33] are supported
to accelerate specific operations. Thus, a RISC-V CPU core
A. Previous Surveys and Motivation coupled with an FPGA enables the designers to create custom
The research work dealing with embedded ML [7] has tremen- accelerators and processors for various applications. Some of
dously increased in an effort to meet the needs of edge AI the most popularly used RISC-V cores are RV12 [24], E203 [6],
applications [25]. These publications have been discussed in RI5CY [34], and Rocket [35].
detailed survey articles published since 2015. Some surveys First, RV12, shown in Fig. 1, is a highly configurable single-
have discussed the optimization of different DNN hardware core CPU based on the industry-standard RISC-V ISA, available
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1956 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
in both RV32I and RV64I (32/64-bit) versions [24]. It features III. DNN HARDWARE ACCELERATORS
a six-stage execution pipeline implemented in a Harvard ar-
Computing platforms, such as CPU, TPU, and GPU, are ex-
chitecture. The pipeline stages increase the overall execution
pensive, power-hungry, and unsuitable for edge applications. On
efficiency, reduce stalls, and optimize the overlap between the
the other hand, application-specific integrated circuits (ASICs)
execution and memory access [24]. In addition, the architec-
are fast but deploy a non reconfigurable architecture [22], [30].
ture includes several features, categorized as core, optional,
However, RISC-V processors and FPGAs can be used concur-
parameterized, and configurable. These features provide the
rently to accelerate different DL structures as they are highly
designer with the ability to emphasize performance, power, or
customizable. Mostly, those with exploitable parallelism can
area based on the application requirements [24]. The processor
benefit the most from optimized matrix operations. However,
also includes a branch predictor unit, data cache, and instruction
the choice varies with respect to the target application’s require-
cache to speed up execution [24].
ments and the available hardware resources.
Second, E203 is a 32-bit RISC-V processor designed for
The most popular structure implemented on edge devices
energy-efficient and high-performance computing applications,
is the CNN [3]. CNNs are inherently parallel and more com-
such as Internet of Things (IoT) [6]. E203 supports the
monly used in error tolerant applications. They can be fur-
RV32IMAC instruction set and is the closest to the ARM
ther simplified, at the cost of minor unnoticeable errors, to
Cortex M0+ [6]. It is composed of two pipeline stages, where
optimize power usage, hardware resources, and latency [3].
the first pipeline stage handles instruction fetch, decode, and
Moreover, substantial work has been done on efficiently accel-
branch prediction. The resulting Program Counter (PC) and
erating quantized transformer models for deployment on edge
instruction value are loaded in the PC and Instruction Register
devices.
(IR), respectively. The second pipeline stage mainly handles
rerouting the IR to the appropriate processing unit to execute
A. CNN Accelerators for the IoT
the required operation. The main processing units are the ALU,
the multiplier/divider, the access memory unit, and the extension To meet the basic CNN functionalities for multimedia data
accelerator interface (EAI) [6]. processing, a low-bandwidth, area-efficient, and low-complexity
Third, RI5CY is an energy-efficient four-stage-pipeline 32-bit accelerator was designed for the IoT SoC endpoint [32]. The
RISC-V processor core designed by the PULP platform. The CNN accelerator is constructed in the form of parallel operating
core architecture supports the RV32IMAC instruction set and acceleration chains, each with a serially connected convolution,
implements power gating and clock frequency scaling (CFS) adder, activation function, and pooling circuits [32], as shown
units to better manage and reduce power consumption. In ad- in Fig. 2. Src is the source input, 32b is 32-bit bus width, and
dition, it also implements a hardware loop unit to efficiently 2D-Conv is the 2-D convolution operation.
execute loops, various single instruction multiple data (SIMD) In Fig. 2(a), a classical IoT SoC processing data flow is
instructions to accelerate DSP operations, and postincrement expanded to include a compact CNN accelerator connected
loads and stores addressing to improve overall performance. to the CPU kernel through the SoC bus. The compact CNN
The RI5CY core is mostly used in accelerating mixed-precision accelerator, detailed in Fig. 2(b), is formed of a core random
DNN operations [36]. access memory (RAM), three ping-pong buffer blocks denoted
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1957
TABLE I
PERFORMANCE COMPARISON OF THE RV12, E203, RI5CY, AND ROCKET RISC-V CORES
Fig. 2. Compact CNN accelerator. (a) Data flow. Source: Adapted from [6]. (b) Compact CNN accelerator architecture. Source: Adapted from [32]. (c) Acceleration
chain architecture. Source: Adapted from [32].
by (BUF RAM BANK), two data selectors, a CNN controller, system performance or results. In addition, the data width in a
and four acceleration chains. The accelerator chain top-level chain varies to maintain accuracy; however, it remains consistent
architecture is presented in Fig. 2(c) and performs the core between layers [32].
mathematical operations, i.e., 2-D convolution, matrix addition, The CNN accelerator was prototyped and tested for the ARM
rectified linear unit (ReLU) activation function, and pooling, in Cortex M3 [32]. The data flow direction is one way, whereas two
fixed-point format [32]. It is essential to highlight that the fully memory access operations are required. This reduces efficiency
connected layer (FCL) operation can be viewed as a special case and flexibility and increases power consumption [6]. In order
of the convolution operation with similar hardware implemen- to improve its performance and efficiency and reduce memory
tation [32], [40]. As such, the FCL operation is implemented in access operations, the IoT CNN accelerator was modified in [6]
the 2D-Conv block [32]. Given the fixed sequence of operations, to a coprocessor and connected to the RISC-V E203 CPU
the operating blocks are serially connected to reduce internal through the EAI.
data movement and interconnectivity. The Bypass control allows The configurable CNN accelerator [6] modifies the compact
bypassing specific, not needed, modules without affecting the CNN accelerator [32] by optimizing memory access and by
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1958 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
Fig. 3. Reconfigurable CNN accelerator. Source: Adapted from [6]. (a) Top-level diagram. (b) Crossbar architecture.
TABLE II
COMPARISON OF CNN ACCELERATOR RESOURCE CONSUMPTION
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1959
the Xilinx XC7A100TFTG256-1 [6]. A summary on resource the Darknet-19 inference network [35]. In their design [35],
consumption is shown in Table II and is compared to the work the filters are considered of size 3 × 3 or 1 × 1, the stride
done in [41], [42], and [43]. LUT represents lookup table slices, is 1, and the output is always a multiple of 7. The YOLO
FF represents register slices, and DSP represents the config- hardware accelerator is designed and implemented with respect
urable digital signal processing units implemented in transistor to specific considerations and parameters rather than general-
level and formed of dedicated multipliers, adders, and registers. ized. This is to achieve an area- and energy-efficient architec-
FPS is frames per second, the power is given in watts (W) and ture [35]. The YOLO accelerator controller is chosen as the
the throughput in giga operations per second (GOPS). open-source RISC-V Rocket Core with extended, customized,
It can be seen from Table II that the CNN coprocessor pro- instructions.
posed in [6], and in contrast to the accelerator proposed in [32], As shown in Fig. 5(a), describing the top-level architecture,
requires the use of 21 DSP blocks and a minor increase in LUT the accelerator is connected to the CPU core through the rocket
and FF elements. This is logical, since the CNN accelerator custom co-processor (ROCC) interface. The instruction FIFO
in [32] has been extended to a reconfigurable coprocessor [6]. In and data FIFO (DFIFO) registers store the instructions and data
addition, the Cortex M3 core [32] accounts for 15 162 of the total forwarded by the CPU core. The decoder block decodes and
SoC LUT resources, while the E203 core only requires 4338 [6]. processes the instructions to the finite-state machine (FSM)
The design suggested in [32] displayed better throughput and acting as the main control unit of the compute, padding, and
resource usage as compared to those suggested in [41] and [42]. memory modules. In a parallel process, the input is read from the
Furthermore, a RISC-V-based CNN coprocessor is proposed double data rate synchronous dynamic random-access memory
in [44] for epilepsy detection. The coprocessor is formed of (DDR-SDRAM), stored in the buffer and communicated to the
an eight-layer 1-D CNN accelerator, a two-stage RISC-V pro- computation module. The DFIFO transfers the CPU data to both
cessor, a main controller, and local memory units operating at the FSM and computation module to begin the CNN operations,
a 10-MHz clock frequency. The accelerator is programmable i.e., convolution, pooling, and activation [35].
and supports the implementation of various CNN models. In The computation module core operating unit is the convo-
contrast to the listed designs, the coprocessor in [44] requires lution unit, shown in Fig. 5(b), and performs the convolution
the least resources, where only 3411 LUTs, 2262 registers, and operation, the max pooling, and the activation function. The
six DSP units are needed. Moreover, the coprocessor consumes convolution unit is formed of nine multipliers, seven adders,
0.118 W, has a latency of 0.137 ms per class, and provides a and a five-stage-pipeline FIFO, as noted in Fig. 5(b), [35]. The
99.16% accuracy on fixed-point operations. The design’s low data and weights are serially fed to the convolution unit using
power and resource requirements makes it a suitable choice for the input FIFO buffers, at every clock cycle, to perform the
low-power IoT wearable devices [44]. While these accelerators convolution operation. The output is then passed to a pooling
are specifically designed for deployment on edge devices, they unit that performs only max pooling with respect to three com-
cannot compete with high-performance models such as that parators. Finally, the ReLU activation function is performed on
proposed in [43] offering a throughput of 84.3 GOPS. the results [35].
To improve overall performance, a memory hierarchy is de-
signed and implemented by Zhang et al. [35]. With respect to
B. CNN Accelerators for Object Detection Fig. 5(c), the memory hierarchy is composed of three levels:
Classically, an object detector relies on segmentation, low- off-chip DDR-SDRAM, input/output (I/O) data buffers, and
level feature extraction, and classification with respect to a shal- internal weight and input double register groups, 9 × 8 bit and
low NN [45], [48]. However, with the advances in the DNN and 226 × 4 × 8 bit, respectively [35]. By adopting such hierarchy,
hardware computing power, the state-of-the-art detectors make the interface limited bandwidth bottleneck complications can be
use of deep CNN structures to dynamically extract complex avoided. Although the design is energy efficient and requires a
features for accurate classification [45]. One of the most pre- relatively small on-chip area; it requires 2.7 s to finish the YOLO
vailing object detectors is the you only look once (YOLO). The inference operation. This delay is a result of a tradeoff between
YOLO detector and its successors (YOLOv2 [48], YOLOv3, resource usage and speed, i.e., serially implemented computa-
and YOLOv4 [45]) offers the best bargain between performance tional module. To decrease the inference latency, the authors sug-
(speed) and accuracy. However, its performance is achieved at gested adding additional, parallel, computation modules [35].
the cost of high computational complexity and requirements, The authors evaluated the systems performance with seven
making it difficult to implement these networks on edge devices. computation modules achieving a 400-ms average time [35].
Lightweight YOLO models (Tiny-YOLOv3 and Tiny-YOLOv4) The accelerator is model specific and designed with respect
have been proposed to reduce the complexity, i.e., fewer pa- to constant configurations, i.e., filter, output, and stride size.
rameters, at the cost of a slight reduction in accuracy. Thus, to However, there exist different YOLO versions each with a
implement these lightweight modules on embedded systems, different input feature map size. Designing a specific accelerator
suitable, low-energy, and high-performance architectures are for each feature can be a tedious solution. Thus, a configurable
required [45]. parameterizable RISC-V accelerator core is designed based on
To accommodate to such requirements, a RISC-V-based the Tiny-YOLO version [45].
YOLO hardware accelerator with a multilevel memory hier- The accelerator in [45] is designed as an algorithm-oriented
archy was proposed in [35]. The YOLO model implements hardware core to accelerate lightweight versions of YOLO,
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1960 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
Fig. 5. YOLO RISC-V accelerator. Source: Adapted from [35]. (a) Top-level architecture. (b) Convolution block architecture (part of the computation module).
(c) Multilevel memory hierarchy.
TABLE III
RISC-V YOLO ACCELERATOR RESOURCE COMPARISON AND PERFORMANCE
thus allowing flexibility, robustness, and configurability. In ad- (MAC) units, an adder tree a Sigmoid activation function, and a
dition, this design not only accelerates the CNN network, but leaky ReLU activation function. The multiplexers route the data
also the pre- and post-CNN operations [45]. The proposed internally and introduce a higher level of customizability [45].
generalized YOLO accelerator is shown in Fig. 6. The Vread The RISC-V YOLO accelerators presented in [35] and [45]
Bias, vRead Weights, vRead Array, and vWrite Array are con- are compared with respect to resource utilization and speed in
figurable dual-port memory units. The direct memory access Table III. In addition, they are compared against different DNN
(DMA) unit is used to read data from the external memory, architectures. BRAM is the internal FPGA block RAM units
the functional unit (FU) is a matrix of configurable custom and CM signifies compute module.
computing units, and AGU is the address generator unit. The As shown in Table III, the special-purpose YOLO accelerator
FU matrix unit is used for reading tiles of the input feature designed in [35] requires the least resources with 161 DSP
map [45]. blocks compared to 832 for the TinyYOLO v3 and 1248 for the
As shown in Fig. 6(a), the YOLO accelerator’s top-level archi- TinyYOLO v4. The YOLO accelerator CM is implemented in a
tecture is mainly formed of three stages: xWeightRead, xComp, serially operating manner, while the TinyYOLO v3 and v4 PEs
and AXI-DMA. The xWeightRead stage is formed of the Vread operate in parallel. However, the massive reduction in resource
Bias and the VRead Weights array. These units perform the read, usage is at the cost of slow performance, i.e., 2.7 s compared
write, and store operations to and from external memory and to 30.9 and 32.1 ms, with an architecture specifically tailored
provide the needed data to the FU matrix unit, as detailed in for predefined parameters [45]. In contrast, the TinyYOLO v3
Fig. 6(b). The weight memories are implemented as asymmetric and v4 designs presented in [45] offer a massive increase in
dual-port units with an external 256-bit bus. In addition, the performance, i.e., an average of 30 ms, at the cost of a tenfold
FU matrix is the accelerator’s main PE and is located in the increase in resource usage, mainly the DSP blocks and BRAM
XComp stage. The FU matrix is a collection of interconnected, units. The TinyYOLO v3 and v4 cores are highly customizable
reconfigurable, PEs whose sole purpose is to perform the 3-D and can be configured to meet any YOLO network version
convolution operations. Each custom FU architecture, detailed requirements. To improve the YOLO accelerator performance,
in Fig. 6(c), is formed of an array of multiply accumulate Zhang et al. [35] suggested using seven serially operating CM
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1961
Fig. 6. Generalized YOLO RISC-V accelerator. Source: Adapted from [45]. (a) Top-level architecture. (b) Detailed architecture. (c) Custom FU architecture.
placed in parallel to speed up the convolution operation, thus The choice of an accelerator is heavily dependent on the
achieving an execution speed of approximately 400 ms. The edge device, its resources, and processing capabilities. While
overall resource requirements for implementing the RISC-V the YOLO accelerator and lightweight SqueezeNet [35], [46] are
processor and the TinyYOLO accelerators are obtained at a slight designed with specific considerations, they are most suitable for
increase in unit usage. Moreover, a lightweight SqueezeNet lower end devices and can be redesigned for other specifications
CNN was proposed for edge MCU-based object detection ap- if needed. For higher end devices and more complex applica-
plications [46]. The proposed architecture is prototyped on the tions, the designs presented in [45] can be a better alternative
ZYNQ ZC702 SoC and can perform an inference run in 22.75 ms with an average speed of 30 ms. However, for general-purpose
while consuming an average power of 2.11 W. Although the SoC and generic applications, the universal coprocessor [47] is
proposed model is not RISC-V specific, it can be adopted for use the convenient choice.
with these open-source processors. As the presented accelerators
are architecture specific, i.e., TinyYOLO and SqueezeNet, a uni-
versal coprocessor is designed to efficiently implement different C. Heterogeneous Single-Shot Multibox Detector Accelerator
object detection networks [47]. The universal coprocessor is for Object Detection
prototyped on the E203 RISC-V SoC and evaluated with respect DL-based real-time object detection [49], [50] and motion
to different architectures, such as the Faster R-CNN, YOLOv3, recognition [51] are popularly implemented in advanced driver
SSD513, and RetinaNet. The coprocessor is able to complete an assistance systems and video analysis applications. The single-
inference run in 210, 51, 125, and 73 ms with 27.2, 33, 31.2, and shot multibox detector (SSD) combines the advantages of
32.5 mean accuracy precision (mAP) for the listed networks, YOLO and Faster R-CNN for fast and accurate real-time object
respectively [47]. detection [49]. The SSD detects multiple objects through a
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1962 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1963
TABLE V
MINI-3D CNN PERFORMANCE, RESOURCE USAGE AND POWER
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1964 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
Fig. 7. Modified RI5CY RISC-V core block diagram. Source: Adapted from [34].
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1965
Fig. 8. Mixed-precision RI5CY modifications. Source: Adapted from [36]. (a) MPIC core. (b) Extended dot product.
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1966 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
Fig. 9. PULP cluster architecture with an eight-core RISC-V processor, IMA unit, and a digital depth width convolution accelerator. Source: Adapted from [37].
as user needs, application needs, compiler mapping, library 3) exploiting heterogeneous analog–digital operations, such
support, memory access, and data transfer. as pointwise/depthwise convolutions and residuals.
As shown from Fig. 9 , the PULP cluster is formed of an eight-
core RISC-V processor, a level 1 (L1) Tightly coupled Data
Memory cache, instruction cache, the depthwise convolution
D. Analog in-Memory Computation digital accelerator, and the IMA subsystem. The components
Analog in-memory computing (AIMC) is a promising so- are connected together internally by means of a low-latency
lution to overcome memory bottlenecks in DNN operations logarithmic interconnect and to the external world with respect
as well as efficiently accelerate QNN operations. It performs to onboard DMA and through an AXI bus. The logarithmic
analog computations, i.e., matrix vector multiplications and interconnect ensures serving the memory in one cycle, while
dot product, on the phase change memory (PCM) crossbars AXI bus allows the cluster to communicate to the external
of nonvolatile memory (NVM) arrays, thus accelerating DNN MCU and peripherals. The external MCU also contains the
inference while optimizing energy usage [37], [55]. cluster core program instructions. A hardware event unit is added
Although efficient, AIMC still requires additional improve- to the system in order to synchronize operations and thread
ments to achieve full-scale application efficiency. Some of its dispatching [37].
key challenges are [37]: Each subsystem or hardware processing engine (HWPE)
r limited to matrix/vector operations; has its own streamer block, a standardized interface, formed
r difficult to integrate in heterogeneous systems (lack of of source and sink FIFO buffers to interact with the RISC-V
optimized interface designs); cores, and exchange data with the internal engine. Each block
r susceptible to computation bottleneck in single-core pro- implements an independent FSM to control and synchronize
cessor devices when handling other workloads, i.e., acti- its operation. The HWPE provides two interfaces: control and
vation function and depthwise convolution; data. The control “Ctrl intf” allows the cluster to manipulate
Heterogeneous RISC-V heavy computing clusters and hybrid the accelerator’s internal registers for configuration purposes,
SoC designs have gained popularity in extreme edge AI infer- while the data interface “data intf” connects to the logarithmic
ence [56], [57]. In an effort to overcome the AIMC challenges, interconnect and in its turn to the L1 memory unit [37]. The IMA
an eight-core RISC-V clustered architecture with in-memory and DW subsystems are further detailed to show their internal
computing accelerators (IMA) and digital accelerators was de- architecture. The IMA subsystem engine implements both the
veloped in [37]. The aim of this system is to sustain AIMC analog and digital circuitry as follows.
performance in heterogeneous systems for optimized DNN 1) Analog: AICM crossbar with a 256 × 256 array, a pro-
inference on edge devices targeting practical end-to-end IoT gramming circuitry, i.e., PCM configuration, digital-to-
applications [37]. Similar to previous designs, the architecture analog (DAC), and analog-to-digital (ADC) converters.
presented in [37] is based on the popular RISC-V PULP cluster. 2) Digital: I/O registers to communicate with the ADC/DAC
The work mainly focused on: and an internal FSM control unit.
1) designing a heterogeneous system with 8 programmable The IMA operates on the L1 memory data encoded in a special
RISC-V core processors, IMA and digital accelerators format, i.e., HWC format. The IMA register file “INPUT PIPE
dedicated for performing depthwise convolutions (DW); REGS” can be set to pipeline different jobs by correctly setting
2) improving computational performance by optimizing the the strides. The proposed IMA structure enables the execution
interfaces between the IMA and the system; of a full layer in one configuration phase.
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1967
On the other hand, the DW convolution engine is a fully digital requirements are >20 GFLOPS and >320 MB FP parameters,
accelerator. It implements a network of multiple MAC units, i.e., respectively [38], hindering its implementation on resource con-
46 MAC, register files for data and configuration, windows and straint edge devices.
weight buffers, a general controller FSM, and a dedicated engine To reduce its memory footprint and computational complex-
FSM. The accelerator can also perform the ReLU activation ity, a fully quantized BERT (FQ-BERT) with hardware–software
function as well as the shift and clip operations [37], thus acceleration is proposed in [38] for the SoC. The FQ-BERT
accelerating the convolution operation. DW convolution output compresses the model by quantizing all parameters and inter-
channels depends on only one input, thus offering a reduction mediate results to integer or fixed-point data type. Moreover, it
in size and a lower connectivity as compared to the original accelerates inference by implementing a dot-product-based PEs
design. The specifically designed DW convolution accelerator and bit-level reconfigurable multipliers [38]. The methods and
resolves DW layer mapping to in-memory computing (IMC) techniques used for quantizing BERT parameters are detailed as
arrays and eliminates any software originating performance bot- follows [38].
tlenecks [37]. Additional studies concerning array structures for 1) Weights and activation functions: Quantized to 4-bit us-
AIMC, such as systolic arrays for reduced energy consumption, ing symmetric linear quantization strategy with tunable
can be found in [58]. (MIN, MAX) clip thresholds and a scaling factor. The
The heterogeneous system was synthesized with Synopsys weight scaling factor is computed using a scaling formula.
Design Compiler-2019.12. The full place and route flow was The exponential moving average is used to determine the
done using Cadence Innovus 20.12, and the cluster was imple- activation functions scaling factor during inference.
mented using GlobalFoundries 22-nm FDX technology node. 2) Biases and other parameters: The biases are quantized to
The total system area of the heterogeneous cluster is 2.5 mm2 , 32-bit integers. The Softmax module and the LN parame-
with the IMA core occupying one-third of the area with 0.000912 ters are quantized to 8-bit fixed-point values.
mm2 and the 512-kB TCD cache occupying another one-third The proposed architecture is divided in two parts: software
and one-third occupied by the remaining parts. The device can and hardware. The software part, running on the CPU and off-
perform an average of 29.7 MAC operations per cycle and chip memory, implements the least computational demanding
execute inference for the MobileNetV2 network in 10 ms while operations like embedding and task-specific layers. However,
achieving a performance of 958 GOPS on NVM. they require the most memory space. The hardware part, running
Emerging technologies, such as 3-D integration, when cou- on the FPGA, implements the encoder layers accelerated units,
pled with IMC techniques, can provide substantial design ben- such as the on-chip buffers, PE, LN core, and Softmax core [38].
efits. 3-D integration is achieved by stacking multiple layers 1) On-chip buffers: double buffered weight buffer, interme-
of electronic components in a single chip or package to reduce diate data buffer for the MHSA unit variables, cache
power consumption, reach higher clock speeds, and improve sig- buffer for storing the scaling factors, Softmax lookup table
nal integrity and overall circuit performance. Additional details values, and the I/O buffers.
on 3-D integration and IMC techniques can be found in [59] 2) PE: Each unit is formed of bit-level reconfigurable multi-
and [60]. pliers with support to 8 × 4 bit and 8 × 8 bit combinations.
In addition, a Bit-split Inner-product Module is included
V. HARDWARE ACCELERATORS FOR TRANSFORMERS to simplify reuse for different operations.
3) Softmax and LN core: The exponential function is quan-
Transformers have been shown to outperform CNN and RNN
tized to 8-bits and 256 sampling points are stored in
in different applications, i.e., NLP and computer vision [38],
a lookup table to simplify the computation. Moreover,
[50], [61], [62]. They are formed of encoder and decoder blocks
a coarse-grained three-stage pipeline parallel SIMD is
that execute several compute-intensive, FP, and nonlinear op-
designed to accelerate the elementwise multiplication.
erations on massive data streams [61], such as multihead self-
Initially, the weights are loaded to the off-chip memory. A
attention (MHSA), Softmax, Gaussian error linear unit (GELU),
task-level scheduler is implemented to fully overlap off-chip
pointwise feed forward network (FFN), and layer normalization
memory access and computing operations. This is done by
(LN) [4], [61]. However, generic DL structures and accelerators
dividing each stage into several substages [38]. The FQ-BERT
are not tailored to support and optimize these specific trans-
and BERT were implemented using PyTorch and evaluated on
former operations [61]. Some common optimization techniques
the SST-2 and MNLI tasks of GLUE benchmark. The FQ-BERT,
include model compression with integer or fixed-point quantiza-
with a compression ratio 7.94×, achieved an accuracy of 91.51%
tion [63], [64], [65], specific approximations with scaling factors
and 81.11% as compared to BERT with 92.32% and 84.19%,
to execute nonlinear operations [61], and specialized hardware
respectively [38]. Furthermore, the accelerator was implemented
accelerators [38], [62].
on the Xilinx ZCU102 (FPGA) and ZCU111 (SoC) and was
compared to the baseline program, FQ-BERT, running on the
A. Fully Quantized Bidirectional Encoder Representations Intel i7-8700 CPU and the Nvidia K80 GPU (CUDA 10.1). The
From Transformers sentence length and batch size are set to 128 and 1, respectively.
The Bidirectional Encoder Representations from Transform- Table VIII compares the performance and energy efficiency of
ers (BERT) is a state-of-the-art model formed of stacked encoder the FQ-BERT and BERT when implemented on different proces-
layers [63]. However, its computational complexity and memory sors. The accelerator achieved a 6.10× and 28.91× improvement
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1968 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
TABLE VIII are integers. With the use of the DN, the unit implements a right
FQ-BERT PERFORMANCE COMPARISON FOR DIFFERENT PROCESSORS
shift operation and eliminates the need for a divider [61]
Sa Sa b
q o = qa = qa DN = qa × . (3)
So So 2c
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1969
Fig. 10. SwiftTron linear layers architecture. Source: Adapted From [61]. (a) MatMul block. (b) MHSA block. (c) Head unit. (d) Attention block.
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1970 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
TABLE X
SUMMARY OF STATE-OF-THE-ART KEY FEATURES AND LIMITATIONS
Table X. The state-of-the-art designs are compared with respect common limitations. These limitations are mainly in terms of
to the selected processor, target application, key features, and size, resource utilization, compiler, processing capabilities, and
limitations. data communication. Some future research tracks and open
In conclusion, ISA extensions provide optimized general- challenges in edge AI to reduce the effect of these limitations
purpose instructions to implement different core DNN opera- are as follows.
tions. However, the performance of networks is limited by that 1) 3-D integration and IMC: Performing IMC in 3-D in-
of the compiler and its ability to correctly map and execute each tegrated circuits allows computations to take place in a
instruction. In addition, certain models, such as transformers, close proximity to the memory. This technique drastically
may require the use of dedicated architectures to perform effi- reduces memory access, data transfer bottlenecks, and
ciently. Thus, a dedicated accelerator, while application specific, size and improves overall performance. 3-D integration
can improve certain networks performance. The choice of an and IMC have the potential to revolutionize the field of
accelerator remains constrained by the application requirements embedded ML by enabling the full implementation of
and device limitations. The designs listed in this survey favored transformers and large DNNs in hardware. However, these
the FPGA over the CPU and the ASIC. Compared to ASICs, technologies are still relatively new and face complex
FPGAs offer the needed flexibility to implement dedicated and challenges. 3-D integration is expensive; it can lead to
reconfigurable architectures to meet the ever changing needs and an increase in heat dissipation and is not freely scalable.
advancements. Compared to the CPU, FPGAs offer parallelism In addition, IMC in 3-D integrated circuits can lead to
and low-power consumption with better performance per watt. complex and difficult-to-implement designs that require
For some models, a heterogeneous implementation was favored, reliable data management techniques. The development
where the design was implemented on a SoC and optimiza- of dedicated 3D-IMC libraries, instruction sets, tools, and
tion was performed on both the software (CPU) and hardware compiler optimization methods can greatly reduce design
(FPGA) sides. It was evident that the open-source RISC-V and testing time.
was the MCU choice for many applications. This is because 2) Optimizing model size and resource usage: Knowledge
it offers the needed flexibility and customizability to implement distillation, model compression, pruning, sharing, parti-
dedicated accelerators that meet specific design criteria, such as tioning, and offloading are some of the popular tech-
power, area, and performance. In addition, it allows developers to niques adopted to reduce the size of DNNs. In knowl-
freely integrate similar designs through open-source licensing. edge distillation, a smaller size student DNN is trained
The need for implementing DNNs on edge devices is in- by the larger teacher DNN. Several requirements should
creasing tremendously and has become a dedicated research be considered when choosing an optimization technique,
topic. However, it is clear from Table X that although the such as accuracy, computational complexity, cost, speed,
proposed designs offer considerable improvements, they share and availability. Although extensively researched, these
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1971
methods might result in a model that suffers from accu- [5] S.-H. Lim, W. W. Suh, J.-Y. Kim, and S.-Y. Cho, “RISC-V virtual platform-
racy loss, requires retraining, poses security concerns, and based convolutional neural network accelerator implemented in systemC,”
Electronics, vol. 10, no. 13, 2021, Art. no. 1514.
becomes difficult to validate and deploy. The introduction [6] N. Wu, T. Jiang, L. Zhang, F. Zhou, and F. Ge, “A reconfigurable
of device-aware model compression techniques can help convolutional neural network-accelerated coprocessor based on RISC-V
in obtaining device-specific models, which facilitates de- instruction set,” Electronics, vol. 9, no. 6, 2020, Art. no. 1005.
[7] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and
ployment and improve performance. challenges,” IEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Oct. 2016.
3) Optimizing memory communication: Data transfer and [8] B. Varghese, N. Wang, S. Barbhuiya, P. Kilpatrick, and D. S. Nikolopoulos,
memory bandwidth play a crucial role in the performance “Challenges and opportunities in edge computing,” in Proc. IEEE Int.
Conf. Smart Cloud, 2016, pp. 20–26.
of accelerators, especially in heterogeneous and hybrid [9] P. P. Ray, “A review on TinyML: State-of-the-art and prospects,” J. King
digital–analog systems. Accelerators are mainly limited Saud Univ.-Comput. Inf. Sci., vol. 34, no. 4, pp. 1595–1623, 2022.
by frequent communication and bandwidth bottlenecks. [10] E. Manor and S. Greenberg, “Custom hardware inference accelera-
tor for tensorflow lite for microcontrollers,” IEEE Access, vol. 10,
Addressing the memory bandwidth and limitations in pp. 73484–73493, 2022.
RISC-V can significantly optimize the overall perfor- [11] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner,
mance. This can be achieved by implementing application- “AI accelerator survey and trends,” in Proc. IEEE High Perform. Extreme
Comput. Conf., 2021, pp. 1–9.
specific memory hierarchies, on-chip interconnects, dedi- [12] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
cated access controllers, memory compression, data reuse, neural networks with pruning, trained quantization and Huffman coding,”
and data-flow optimization. in Proc. 4th Int. Conf. Learn. Representations, Y. Bengio and Y. LeCun,
eds., San Juan, Puerto Rico, 2016.
4) Tools and compilers: We implement automatic code gen- [13] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
eration to efficiently map the program instructions to “Binarized neural networks,” Proc. Adv. Neural Inform. Process. Syst.,
hardware. In addition, we develop open-source standards D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, eds., vol. 29,
2016.
for RISC-V accelerator designs to simplify interoperabil- [14] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
ity and integration for different hardware platforms and in Proc. 5th Int. Conf. Learn. Representations, Toulon, France, 2017.
software frameworks. [15] A. Karine, T. Napoléon, J.-Y. Mulot, and Y. Auffret, “Video seals
recognition using transfer learning of convolutional neural network,”
5) Algorithmic: We exploit the unique capabilities of the in Proc. 10th Int. Conf. Image Process. Theory, Tools, Appl., 2020,
open-source RISC-V ISA to better optimize DNN al- pp. 1–4.
gorithms specifically for RISC-V implementation. The [16] V. Murahari, C. E. Jimenez, R. Yang, and K. Narasimhan, “DataMUX:
Data multiplexing for neural networks,” in Proc. Adv. Neural Inform.
open-source ISA also allows investigating multimodel, Process. Syst., A. H. Oh, A. Agarwal, D. Belgrave, and Kyunghyun Cho,
heterogeneous, and hybrid computing. This can be done eds., 2022.
by designing algorithms and structures for data fusion [17] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression and
hardware acceleration for neural networks: A comprehensive survey,”
from different sensors, as well as concurrently targeting Proc. IEEE, vol. 108, no. 4, pp. 485–532, Apr. 2020.
various computing platforms, such as CPU, FPGA, GPU, [18] S. Pouyanfar et al., “A survey on deep learning: Algorithms, tech-
and neural processing unit. niques, and applications,” ACM Comput. Surv., vol. 51, no. 5, pp. 1–36,
2018.
6) Optical chips: Optical AI accelerators are extensively [19] S. Mittal, “A survey of FPGA-based accelerators for convolutional neural
investigated for implementing high-accuracy and high- networks,” Neural Comput. Appl., vol. 32, no. 4, pp. 1109–1139, 2020.
speed inference CNNs. Classical processors are beginning [20] E. Wang et al., “Deep neural network approximation for custom hardware:
Where we’ve been, where we’re going,” ACM Comput. Surv., vol. 52, no. 2,
to face limitations in the post Moore’s law era, where pp. 1–39, 2019.
their processing capabilities are not improving at the same [21] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient neural network
pace as the requirements. Optical processors, on the other kernels for arm Cortex-M CPUs,” 2018, arXiv:1801.06601.
[22] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and M.
hand, are not affected by Moore’s law and are currently Shafique, “Hardware and software optimizations for accelerating deep
investigated as an alternative to train and deploy DNN neural networks: Survey of current trends, challenges, and the road ahead,”
structures, offering the advantage of handling much larger IEEE Access, vol. 8, pp. 225134–225180, 2020.
[23] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “PULP-NN:
and complex networks. Accelerating quantized neural networks on parallel ultra-low-power RISC-
Additional research tracks that would offer further improve- V processors,” Philos. Trans. Roy. Soc. A, vol. 378, no. 2164, 2020,
ments include those related to improving adaptability to dynamic Art. no. 20190155.
[24] “RV12 RISC-V 32/64-bit CPU core datasheet.” Accessed: Apr. 28, 2022.
workloads and exploring techniques to optimize online learning, [Online]. Available: https://roalogic.github.io/RV12/DATASHEET.html
federated learning, and training, to name a few. [25] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner,
“Survey and benchmarking of machine learning accelerators,” in Proc.
IEEE High Perform. Extreme Comput. Conf., 2019, pp. 1–9.
REFERENCES [26] K. T. Chitty-Venkata and A. K. Somani, “Neural architecture search
survey: A hardware perspective,” ACM Comput. Surv., vol. 55, no. 4,
[1] Z. Liu, J. Jiang, G. Lei, K. Chen, B. Qin, and X. Zhao, “A heterogeneous pp. 1–36, 2022.
processor design for CNN-based AI applications on IoT devices,” Procedia [27] D. Ghimire, D. Kil, and S.-H. Kim, “A survey on efficient convolutional
Comput. Sci., vol. 174, pp. 2–8, 2020. neural networks and hardware acceleration,” Electronics, vol. 11, no. 6,
[2] A. N. Mazumder et al., “A survey on the optimization of neural network 2022, Art. no. 945.
accelerators for micro-AI on-device inference,” IEEE Trans. Emerg. Sel. [28] S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, and P. Kitsos, “A survey
Topics Circuits Syst., vol. 11, no. 4, pp. 532–547, Dec. 2021. on RISC-V-based machine learning ecosystem,” Information, vol. 14,
[3] L. Sekanina, “Neural architecture search and hardware accelerator co- no. 2, 2023, Art. no. 64.
search: A survey,” IEEE Access, vol. 9, pp. 151337–151362, 2021. [29] A. Sanchez-Flores, L. Alvarez, and B. Alorda-Ladaria, “A review of CNN
[4] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural accelerators for embedded systems based on RISC-V,” in Proc. IEEE Int.
Inf. Process. Syst., 2017, vol. 30, pp. 6000–6010. Conf. Omni-Layer Intell. Syst., 2022, pp. 1–6.
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1972 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024
[30] J. K. L. Lee, M. Jamieson, N. Brown, and R. Jesus, “Test-driving RISC-V [50] W. Lv et al., “Detrs beat YOLOs on real-time object detection,” 2023,
vector hardware for HPC,” Proc. High Perform. Comput., A. Bienz, M. arXiv:2304.08069.
Weiland, M. Baboulin, and C. Kruse, eds., Chem., vol. 13999, pp. 419–432, [51] S. Lv, T. Long, Z. Hou, L. Yan, and Z. Li, “3D CNN hardware circuit
2023. for motion recognition based on FPGA,” in J. Phys.: Conf. Ser., 2022,
[31] C. Silvano et al., “A survey on deep learning hardware accelerators for vol. 2363, Art. no. 012030.
heterogeneous HPC platforms,” 2023, arXiv:2306.15552. [52] L. Lamberti, M. Rusci, M. Fariselli, F. Paci, and L. Benini, “Low-power
[32] F. Ge, N. Wu, H. Xiao, Y. Zhang, and F. Zhou, “Compact convolutional license plate detection and recognition on a RISC-V multi-core MCU-
neural network accelerator for IoT endpoint SOC,” Electronics, vol. 8, based vision system,” in Proc. IEEE Int. Symp. Circuits Syst., 2021,
no. 5, 2019, Art. no. 497. pp. 1–5.
[33] I. A. Assir, M. E. Iskandarani, H. R. A. Sandid, and M. A. Saghir, [53] Z. Azad et al., “An end-to-end RISC-V solution for ML on the edge
“Arrow: A RISC-V vector accelerator for machine learning inference,” using in-pipeline support,” in Proc. Boston Area Archit. Workshop,
2021, arXiv:2107.07169. 2020.
[34] J. Vreca et al., “Accelerating deep learning inference in constrained [54] “The x86 Advanced Matrix Extension (AMX) Brings Matrix Opera-
embedded devices using hardware loops and a dot product unit,” IEEE tions to Debut With Sapphire Rapids,” WikiChip, New York, NY, USA,
Access, vol. 8, pp. 165913–165926, 2020. 2023.
[35] G. Zhang, K. Zhao, B. Wu, Y. Sun, L. Sun, and F. Liang, “A RISC-V [55] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalable and
based hardware accelerator designed for YOLO object detection system,” energy efficient deep learning with smart memory cubes,” IEEE Trans.
in Proc. IEEE Int. Conf. Intell. Appl. Syst. Eng., 2019, pp. 9–11. Parallel Distrib. Syst., vol. 29, no. 2, pp. 420–434, Feb. 2018.
[36] G. Ottavi, A. Garofalo, G. Tagliavini, F. Conti, L. Benini, and D. Rossi, [56] A. Garofalo et al., “DARKSIDE: A heterogeneous RISC-V compute
“A mixed-precision RISC-V processor for extreme-edge DNN inference,” cluster for extreme-edge on-chip DNN inference and training,” IEEE Open
in Proc. IEEE Comput. Soc. Annu. Symp. Very Large Scale Integr., 2020, J. Solid-State Circuits Soc., vol. 2, pp. 231–243, 2022.
pp. 512–517. [57] K. Ueyoshi et al., “Diana: An end-to-end energy-efficient digital and
[37] A. Garofalo et al., “A heterogeneous in-memory computing cluster for analog hybrid neural network SOC,” in Proc. IEEE Int. Solid-State Circuits
flexible end-to-end inference of real-world deep neural networks,” IEEE Conf., 2022, pp. 1–3.
Trans. Emerg. Sel. Topics Circuits Syst., vol. 12, no. 2, pp. 422–435, [58] M. E. Elbtity, B. Reidy, M. H. Amin, and R. Zand, “Heterogeneous
Jun. 2022. integration of in-memory analog computing architectures with tensor
[38] Z. Liu, G. Li, and J. Cheng, “Hardware acceleration of fully quantized processing units,” in Proc. Great Lakes Sympos. VLSI, Knoxville, TN,
BERT for efficient natural language processing,” in Proc. Des., Autom. USA, 2023, pp. 607–612.
Test Eur. Conf. Exhib., 2021, pp. 513–516. [59] E. Giacomin, S. Gudaparthi, J. Boemmels, R. Balasubramonian, F.
[39] S. Harini, A. Ravikumar, and D. Garg, “VeNNus: An artificial intelligence Catthoor, and P.-E. Gaillardon, “A multiply-and-accumulate array for
accelerator based on RISC-V architecture,” in Proc. Int. Conf. Comput. machine learning applications based on a 3D nanofabric flow,” IEEE Trans.
Intell. Data Eng., 2021, pp. 287–300. Nanotechnol., vol. 20, pp. 873–882, 2021.
[40] B. Jacob et al., “Quantization and training of neural networks for efficient [60] Z. Lin et al., “A fully digital SRAM-based four-layer in-memory com-
integer-arithmetic-only inference,” in Proc. IEEE Conf. Comput. Vis. puting unit achieving multiplication operations and results store,” IEEE
Pattern Recognit., 2018, pp. 2704–2713. Trans. Very Large Scale Integr. Syst., vol. 31, no. 6, pp. 776–788,
[41] Y. Zhang, N. Wu, F. Zhou, and M. R. Yahya, “Design of multifunctional Jun. 2023.
convolutional neural network accelerator for IoT endpoint SoC,” in Proc. [61] A. Marchisio, D. Dura, M. Capra, M. Martina, G. Masera, and M. Shafique,
World Congr. Eng. Comput. Sci., 2018, pp. 16–19. “SwiftTron: An efficient hardware accelerator for quantized transformers,”
[42] Z. Li et al., “Laius: An 8-bit fixed-point CNN hardware inference engine,” in Proc. Int. Joint Conf. Neural Netw., Gold Coast, Australia, 2023, pp. 1–9,
in Proc. IEEE Int. Symp. Parallel Distrib. Process. Appl./IEEE Int. Conf. doi: 10.1109/IJCNN54540.2023.10191521.
Ubiquitous Comput. Commun., 2017, pp. 143–150. [62] A. F. Laguna, M. M. Sharifi, A. Kazemi, X. Yin, M. Niemier, and X. S.
[43] K. Guo et al., “Angel-eye: A complete design flow for mapping CNN onto Hu, “Hardware-software co-design of an in-memory transformer network
embedded FPGA,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., accelerator,” Front. Electron., vol. 3, 2022, Art. no. 10.
vol. 37, no. 1, pp. 35–47, Jan. 2018. [63] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
[44] S.-Y. Pan, S.-Y. Lee, Y.-W. Hung, C.-C. Lin, and G.-S. Shieh, “A pro- of deep bidirectional transformers for language understanding,” in Proc.
grammable CNN accelerator with RISC-V core in real-time wearable Conf. North Amer. Chap. Assoc. Comput. Linguistics: Human Lang. Tech-
application,” in Proc. IEEE Int. Conf. Recent Adv. Syst. Sci. Eng., 2022, nol., J. Burstein, C. Doran, and T. Solorio, eds., 2019, pp. 4171–4186.
pp. 1–4. [64] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for
[45] D. Pestana et al., “A full featured configurable accelerator for object multi-head attention and position-wise feed-forward in the transformer,”
detection with YOLO,” IEEE Access, vol. 9, pp. 75864–75877, 2021. in Proc. IEEE 33rd Int. Syst.-Chip Conf., 2020, pp. 84–89.
[46] K. Kim, S.-J. Jang, J. Park, E. Lee, and S.-S. Lee, “Lightweight and energy- [65] S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-BERT:
efficient deep learning accelerator for real-time object detection on edge Integer-only BERT quantization,” in Proc. Int. Conf. Mach. Learn., 2021,
devices,” Sensors, vol. 23, no. 3, 2023, Art. no. 1185. pp. 5506–5518.
[47] D. Wu, Y. Liu, and C. Tao, “A universal accelerated coprocessor for object [66] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “FPGA-based accelerator for long
detection based on RISC-V,” Electronics, vol. 12, no. 3, 2023, Art. no. 475. short-term memory recurrent neural networks,” in Proc. 22nd Asia South
[48] Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang, “Sparse-YOLO: Pacific Des. Autom. Conf., 2017, pp. 629–634.
Hardware/software co-design of an FPGA accelerator for YOLOV2,” [67] S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel,
IEEE Access, vol. 8, pp. 116569–116585, 2020. “ViTA: A vision transformer inference accelerator for edge applications,”
[49] L. Cai, F. Dong, K. Chen, K. Yu, W. Qu, and J. Jiang, “An FPGA based in Proc. IEEE Int. Sympos. Circuits Syst., 2023, pp. 1–5, doi: 10.1109/IS-
heterogeneous accelerator for single shot multibox detector (SSD),” in CAS46773.2023.10181988.
Proc. IEEE 15th Int. Conf. Solid-State Integr. Circuit Technol., 2020,
pp. 1–3.
Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.