0% found this document useful (0 votes)
50 views19 pages

Embedded Deep Learning Accelerators A Survey On Recent Advances

This document surveys recent advancements in embedded deep learning accelerators, focusing on their implementation on customizable RISC-V processors. It highlights the challenges of deploying deep neural networks (DNNs) on resource-constrained edge devices while maintaining performance and accuracy. The survey provides insights into various optimization techniques and discusses the advantages of RISC-V architecture for deep learning applications.

Uploaded by

mohamed234sheimy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views19 pages

Embedded Deep Learning Accelerators A Survey On Recent Advances

This document surveys recent advancements in embedded deep learning accelerators, focusing on their implementation on customizable RISC-V processors. It highlights the challenges of deploying deep neural networks (DNNs) on resource-constrained edge devices while maintaining performance and accuracy. The survey provides insights into various optimization techniques and discusses the advantages of RISC-V architecture for deep learning applications.

Uploaded by

mohamed234sheimy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

1954 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO.

5, MAY 2024

Embedded Deep Learning Accelerators:


A Survey on Recent Advances
Ghattas Akkad , Ali Mansour , and Elie Inaty

Abstract—The exponential increase in generated data as well as other licensed processors. We then proceed in detailing and com-
the advances in high-performance computing has paved the way paring some of the work done on design custom CNN accelerators.
for the use of complex machine learning methods. Indeed, the Mainly, for Internet of Thing and object detection applications on
availability of graphical processing units and tensor processing the RISC-V core. Finally, in the remaining sections we list other
units has made it possible to train and prototype deep neural techniques for accelerating CNN tasks, such as ISA extension to
networks (DNNs) on large-scale datasets and for a variety of appli- optimize extensive operations (dot product, multiply accumulate,
cations, i.e., vision, robotics, biomedical, etc. The popularity of these looping) and we discuss advances in analog-digital in-memory
DNNs originates from their efficacy and state-of-the-art inference computation.
accuracy. However, this is obtained at the cost of a considerably
high computational complexity. Such drawbacks rendered their Index Terms—Convolutional neural network (CNN), embedded
implementation on limited resources, edge devices, without a major machine learning, hardware accelerators, Reduced Instruction Set
loss in inference speed and accuracy, a dire and challenging task. To Computer (RISC-V), transformers.
this extent, it has become extremely important to design innovative
architectures and dedicated accelerators to deploy these DNNs to I. INTRODUCTION
embedded and reconfigurable processors in a high-performance
HE exponential growth in the deployed computing devices,
low-complexity structure. In this study, we present a survey on
recent advances in deep learning accelerators for heterogeneous
systems and Reduced Instruction Set Computer processors given
T as well as the abundance of generated data, has mandated
the use of complex algorithms and structures for smart data
their open-source nature, accessibility, customizability, and uni- processing [1]. Such overwhelming processing requirements
versality. After reading this article, the readers should have a com-
prehensive overview of the recent progress in this domain, cutting have further mandated the use of artificial intelligence (AI)
edge knowledge of recent embedded machine learning trends, and techniques and compatible hardware [2]. Nowadays, machine
substantial insights for future research directions and challenges. learning (ML) methods are routinely executed in various fields
Impact Statement—The surge in internet connected devices and including health care, robotics, navigation, data mining, agricul-
the amount of data generated have made it essential to adopt deep ture, environment, etc. [3], replacing the need of recurrent human
learning routines, i.e. neural networks (NNs) for intelligent data
processing and decision making. In addition, it has become manda- interventions. Classical ML methods have rapidly evolved, over
tory to implement these algorithms on edge devices and embedded the recent years, to perform compute-intensive operations and
systems restricted in memory and resources. As such, it is needed have expanded various research areas tenfold. The introduction
to optimize these devices and customize their internal architecture of high-accuracy and near-real-time-performing deep learning
to suit the target artificial intelligence application needs without (DL) processes, such as the deep neural network (DNN) [2],
major loss in accuracy or performance. Although many work
have been conducted, the literature is shy of survey articles listing [3], has convoyed unprecedented advances in the areas of nat-
the advancement done in this field specifically for customizable ural language processing (NLP) [4], object detection, image
processors. Thus, this survey article presents a comparative study classification, signal estimation and detection, protein folding,
of different deep NN (DNN) hardware accelerators implemented and genomics analysis, to name a few. These convoluted DNN
on customizable RISC-V processors. This article presents an intro- models achieve their high inference accuracy and performance
duction to DNN, their different types, Convolutional NN (CNN),
the challenges faced in implementing DNN on embedded devices, through the use of manifold trainable parameters and large-scale
available optimization and quantization techniques, an overview datasets [3]. Training and deploying DNNs rely on perform-
of the RISC-V CPU core and its instruction set architecture (ISA). ing heavy computations with the indispensable use of high-
Thus, we highlight its advantages and customizability compared to performance computing units, such as graphical processing units
(GPUs) and tensor processing units (TPUs). Consequently, DL
structures require considerably high energy consumption and
Manuscript received 24 May 2023; revised 27 July 2023; accepted 30 August storage capacity [2], severely limiting their implementation,
2023. Date of publication 5 September 2023; date of current version 14 May
2024. This paper was recommended for publication by Associate Editor Supratik performance, and use on limited resources devices, i.e., field-
Mukhopadhyay upon evaluation of the reviewers’ comments. (Corresponding programmable gate arrays (FPGAs), system on chip (SoC),
author: Ghattas Akkad.) general-purpose microprocessors (MPs), and digital signal pro-
Ghattas Akkad and Ali Mansour are with the Lab-STICC, UMR
CNRS, École Nationale Supérieure de Techniques Avancées (ENSTA) cessing (DSP) processors [1], [5], [6].
Bretagne, 29200 Brest, France (e-mail: [email protected]; However, the need for edge AI computing [6], [7] remains
[email protected]). relevant and crucial. Edge computing involves offloading DNNs’
Elie Inaty is with the Department of Computer Engineering, University of
Balamand, Koura 100, Lebanon (e-mail: [email protected]). inference operations to the node processor for implementing AI
Digital Object Identifier 10.1109/TAI.2023.3311776 procedures on the device itself [7], [8]. Adopting this paradigm
2691-4581 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1955

requires scaling down the DNN to fit on limited-resource devices implementation techniques and accelerators from a general per-
without a significant loss in performance and accuracy [2], [9], spective:
thus adding new challenges to that already at hand, such as area 1) hardware-aware neural architecture search [3], [26];
limitation, power consumption, and storage requirements. To 2) CNN accelerators [27];
abide to the imposed constraints and to efficiently deploy DNN 3) model compression and hardware acceleration [12], [17];
structures on different processors, such as FPGA, SoC, and MP, 4) hardware and software optimization [22].
one practical yet popular solution is to reduce the DNNs’ size Other surveys focused on hardware-specific DNN implemen-
and develop task-specific deep learning accelerators (DLAs) [2], tation targeting FPGAs and microcontroller units (MCUs):
[10], [11]. Moreover, several optimization techniques can be 1) optimizing NN accelerators for MCU-based inference [2],
applied to reduce DNNs’ hardware usage, such as pruning [12], [9];
quantization [13], [14], knowledge distillation [15], multiplex- 2) FPGA-based CNN accelerators [19];
ing [16], and model compression [17], to name a few. 3) custom hardware/application-based accelerators [20].
To account for the requirements of various applications, there Other existing surveys, such as [28], [29], [30], and [31], do
exists different DL structures and models [18], such as the not specifically focus on RISC-V. However, they provide insights
convolutional neural network (CNN) [3], [6], recursive neural into different DNN acceleration techniques that can be adapted
network (RNN) [3], [18], generative adversarial network [18], for RISC-V architecture.
graph neural network (NN), and transformers. To accommodate In this study, we provide a comprehensive overview of the
to such diversity, popular approaches for implementing DNN ac- recent advances in optimizing DNN hardware implementation
celerators rely on using reconfigurable devices, i.e., FPGA [19], and its accelerators for RISC-V, heterogeneous, and hybrid edge
[20], or extending the architecture and instruction set of the devices, as well as meaningful insights into future developments.
ARM and Reduced Instruction Set Computer (RISC-V)-based
processors [3]. In addition, the use of dedicated NN libraries and B. Survey Plan
compilers, such as the CMSIS-NN by ARM [21] for 16-bit and
The rest of this article is organized as follows. Section II
8-bit processors, makes it possible to implement some sophisti-
presents the RISC-V central processing unit (CPU) core ar-
cated quantized DNN, i.e., 8-bit, 4-bit, 2-bit, and even 1-bit [3].
chitecture. Recent work done on DNN hardware accelerators
However, commercial processors have major drawbacks, such
is detailed in Section III. Section IV discusses some RISC-V
as licensing costs and the lack of flexibility in modifying the
ISA extensions for accelerating embedded DNN operations.
general architecture [22].
Section V explores transformer accelerators. Finally, Section VI
Therefore, a more suitable option is to target open-source,
concludes this article with insights into future research tracks
RISC-V, processors [23]. The RISC-V instruction set architec-
and open challenges.
ture (ISA) is managed by a nonprofit foundation and offers
several, unique, advantages [5], [24]:
1) flexibility to fully customize the hardware for specific ap- II. RISC-V CPU CORE
plication requirements, i.e., power, area, and performance, The RISC-V is an open-source, customizable, and scalable
through open-source designs and ISA; ISA. The RISC-V core implements a temporal architecture with
2) reduce third-party licensing dependence and intellectual three basic 32-bit instruction sets, six extended sets, and 32
property usage for DNN accelerators; standard registers [1]. The base instruction set supports three
3) compatibility, standards, and interoperability among dif- main formats: store (S), register (R), and immediate (I). These
ferent platforms; formats include logical, arithmetic, data transfer, memory ac-
4) community support and long-term sustainability, i.e., cess, and control flow instructions, such as branches and jumps.
frameworks, compilers, and software stacks to facilitate It also supports instructions to communicate with the operating
deployment. system. Data transfer between memory and registers may only be
In its nature, the RISC-V is an MP with a temporal like done through the use of load and store instructions. In addition,
architecture, where the arithmetic operations are performed by privileged mode instructions are available to manage system
the arithmetic logic unit (ALU) found in each processing ele- operations and exception handling, allowing full control over
ment (PE). Hence, extending its functionality by implementing the architecture. Furthermore, the RISC-V ISA reserves four
parallel DNN accelerators is tremendously beneficial for edge instruction spaces for user-defined extensions, i.e, custom-0,
AI inference and embedded ML. custom-1, custom-2/rv128, and custom-3/rv128 [6]. Optional
floating point (FP) and vector (V) instructions [33] are supported
to accelerate specific operations. Thus, a RISC-V CPU core
A. Previous Surveys and Motivation coupled with an FPGA enables the designers to create custom
The research work dealing with embedded ML [7] has tremen- accelerators and processors for various applications. Some of
dously increased in an effort to meet the needs of edge AI the most popularly used RISC-V cores are RV12 [24], E203 [6],
applications [25]. These publications have been discussed in RI5CY [34], and Rocket [35].
detailed survey articles published since 2015. Some surveys First, RV12, shown in Fig. 1, is a highly configurable single-
have discussed the optimization of different DNN hardware core CPU based on the industry-standard RISC-V ISA, available

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1956 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

Finally, the Rocket core is a high-performance five-stage-


pipeline 64-bit RISC-V processor, which supports the RV64GC
instruction set. The core supports a wide range of operating
systems and has a peak performance of four instructions per
cycle (IPC). The Rocket core is configurable to suit different
application requirements and serves as a reference for the RISC-
V ISA. In addition, it is highly extensible and is designed to
allow developers to incorporate custom instructions, dedicated
accelerators, and complex extensions [37].
As shown in Table I, all the processor cores support the listed
features except for RV12. In addition, Intel i7-8700 and the
Rocket cores scored the highest and second highest peak IPC
values of 4.6 and 4.0, respectively. However, in contrast to the
Rocket core, Intel i7-8700 is a power hungry desktop processor
and is not suitable for embedded applications. Surprisingly, all
the RISC-V cores outperformed the commercial ARM cortex
M4 and M7 in peak IPC while offering similar features. Thus, to
maximize performance and efficiency, the RISC-V core should
be selected with respect to the application requirements. Addi-
Fig. 1. RV12 single-core RV32I/RV64I six-stage pipeline RISC-V CPU. tional comparison of the RISC-V against other platforms, such
Source: Adapted from [24]. as TPU and GPU, is provided in [39].

in both RV32I and RV64I (32/64-bit) versions [24]. It features III. DNN HARDWARE ACCELERATORS
a six-stage execution pipeline implemented in a Harvard ar-
Computing platforms, such as CPU, TPU, and GPU, are ex-
chitecture. The pipeline stages increase the overall execution
pensive, power-hungry, and unsuitable for edge applications. On
efficiency, reduce stalls, and optimize the overlap between the
the other hand, application-specific integrated circuits (ASICs)
execution and memory access [24]. In addition, the architec-
are fast but deploy a non reconfigurable architecture [22], [30].
ture includes several features, categorized as core, optional,
However, RISC-V processors and FPGAs can be used concur-
parameterized, and configurable. These features provide the
rently to accelerate different DL structures as they are highly
designer with the ability to emphasize performance, power, or
customizable. Mostly, those with exploitable parallelism can
area based on the application requirements [24]. The processor
benefit the most from optimized matrix operations. However,
also includes a branch predictor unit, data cache, and instruction
the choice varies with respect to the target application’s require-
cache to speed up execution [24].
ments and the available hardware resources.
Second, E203 is a 32-bit RISC-V processor designed for
The most popular structure implemented on edge devices
energy-efficient and high-performance computing applications,
is the CNN [3]. CNNs are inherently parallel and more com-
such as Internet of Things (IoT) [6]. E203 supports the
monly used in error tolerant applications. They can be fur-
RV32IMAC instruction set and is the closest to the ARM
ther simplified, at the cost of minor unnoticeable errors, to
Cortex M0+ [6]. It is composed of two pipeline stages, where
optimize power usage, hardware resources, and latency [3].
the first pipeline stage handles instruction fetch, decode, and
Moreover, substantial work has been done on efficiently accel-
branch prediction. The resulting Program Counter (PC) and
erating quantized transformer models for deployment on edge
instruction value are loaded in the PC and Instruction Register
devices.
(IR), respectively. The second pipeline stage mainly handles
rerouting the IR to the appropriate processing unit to execute
A. CNN Accelerators for the IoT
the required operation. The main processing units are the ALU,
the multiplier/divider, the access memory unit, and the extension To meet the basic CNN functionalities for multimedia data
accelerator interface (EAI) [6]. processing, a low-bandwidth, area-efficient, and low-complexity
Third, RI5CY is an energy-efficient four-stage-pipeline 32-bit accelerator was designed for the IoT SoC endpoint [32]. The
RISC-V processor core designed by the PULP platform. The CNN accelerator is constructed in the form of parallel operating
core architecture supports the RV32IMAC instruction set and acceleration chains, each with a serially connected convolution,
implements power gating and clock frequency scaling (CFS) adder, activation function, and pooling circuits [32], as shown
units to better manage and reduce power consumption. In ad- in Fig. 2. Src is the source input, 32b is 32-bit bus width, and
dition, it also implements a hardware loop unit to efficiently 2D-Conv is the 2-D convolution operation.
execute loops, various single instruction multiple data (SIMD) In Fig. 2(a), a classical IoT SoC processing data flow is
instructions to accelerate DSP operations, and postincrement expanded to include a compact CNN accelerator connected
loads and stores addressing to improve overall performance. to the CPU kernel through the SoC bus. The compact CNN
The RI5CY core is mostly used in accelerating mixed-precision accelerator, detailed in Fig. 2(b), is formed of a core random
DNN operations [36]. access memory (RAM), three ping-pong buffer blocks denoted

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1957

TABLE I
PERFORMANCE COMPARISON OF THE RV12, E203, RI5CY, AND ROCKET RISC-V CORES

Fig. 2. Compact CNN accelerator. (a) Data flow. Source: Adapted from [6]. (b) Compact CNN accelerator architecture. Source: Adapted from [32]. (c) Acceleration
chain architecture. Source: Adapted from [32].

by (BUF RAM BANK), two data selectors, a CNN controller, system performance or results. In addition, the data width in a
and four acceleration chains. The accelerator chain top-level chain varies to maintain accuracy; however, it remains consistent
architecture is presented in Fig. 2(c) and performs the core between layers [32].
mathematical operations, i.e., 2-D convolution, matrix addition, The CNN accelerator was prototyped and tested for the ARM
rectified linear unit (ReLU) activation function, and pooling, in Cortex M3 [32]. The data flow direction is one way, whereas two
fixed-point format [32]. It is essential to highlight that the fully memory access operations are required. This reduces efficiency
connected layer (FCL) operation can be viewed as a special case and flexibility and increases power consumption [6]. In order
of the convolution operation with similar hardware implemen- to improve its performance and efficiency and reduce memory
tation [32], [40]. As such, the FCL operation is implemented in access operations, the IoT CNN accelerator was modified in [6]
the 2D-Conv block [32]. Given the fixed sequence of operations, to a coprocessor and connected to the RISC-V E203 CPU
the operating blocks are serially connected to reduce internal through the EAI.
data movement and interconnectivity. The Bypass control allows The configurable CNN accelerator [6] modifies the compact
bypassing specific, not needed, modules without affecting the CNN accelerator [32] by optimizing memory access and by

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1958 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

Fig. 3. Reconfigurable CNN accelerator. Source: Adapted from [6]. (a) Top-level diagram. (b) Crossbar architecture.

TABLE II
COMPARISON OF CNN ACCELERATOR RESOURCE CONSUMPTION

replacing the CNN acceleration chain with a crossbar intercon-


necting different arithmetic units, as shown in Fig. 3. Replacing
the serialized acceleration chain by a crossbar provides a re-
configurable architecture that allows data to flow in different
directions, improving the computation performance of different
algorithms [6].
In contrast to Fig. 2(b), the CNN accelerator in Fig. 3(a) uses
two ping-pong buffers (BUF RAM BANK) instead of three and Fig. 4. RISC-V IoT SoC top-level diagram. Source: Adapted from [6].
a reconfigurable controller instead of the CNN controller. Each
PE has been modified to make use of a crossbar interconnecting
its arithmetic units instead of a serialized chain. The crossbar In contrast to Fig. 2(a), the CNN accelerator shown in Fig. 4
architecture is displayed in Fig. 3(b) and is formed of a first- is a reconfigurable coprocessor rather than an extension. In
in first-out (FIFO) buffer, configuration registers (cfg Regs), addition, the memory access is optimized and controlled by the
and five multiplexers to route the data appropriately [6]. The coprocessor, thus improving the overall performance [6].
cfg Regs are configured through the reconfigurable controller Both designs were implemented using an FPGA, specifically
block. the Xilinx VC707 board with an XC7VX485T-2 FPGA [32] and

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1959

the Xilinx XC7A100TFTG256-1 [6]. A summary on resource the Darknet-19 inference network [35]. In their design [35],
consumption is shown in Table II and is compared to the work the filters are considered of size 3 × 3 or 1 × 1, the stride
done in [41], [42], and [43]. LUT represents lookup table slices, is 1, and the output is always a multiple of 7. The YOLO
FF represents register slices, and DSP represents the config- hardware accelerator is designed and implemented with respect
urable digital signal processing units implemented in transistor to specific considerations and parameters rather than general-
level and formed of dedicated multipliers, adders, and registers. ized. This is to achieve an area- and energy-efficient architec-
FPS is frames per second, the power is given in watts (W) and ture [35]. The YOLO accelerator controller is chosen as the
the throughput in giga operations per second (GOPS). open-source RISC-V Rocket Core with extended, customized,
It can be seen from Table II that the CNN coprocessor pro- instructions.
posed in [6], and in contrast to the accelerator proposed in [32], As shown in Fig. 5(a), describing the top-level architecture,
requires the use of 21 DSP blocks and a minor increase in LUT the accelerator is connected to the CPU core through the rocket
and FF elements. This is logical, since the CNN accelerator custom co-processor (ROCC) interface. The instruction FIFO
in [32] has been extended to a reconfigurable coprocessor [6]. In and data FIFO (DFIFO) registers store the instructions and data
addition, the Cortex M3 core [32] accounts for 15 162 of the total forwarded by the CPU core. The decoder block decodes and
SoC LUT resources, while the E203 core only requires 4338 [6]. processes the instructions to the finite-state machine (FSM)
The design suggested in [32] displayed better throughput and acting as the main control unit of the compute, padding, and
resource usage as compared to those suggested in [41] and [42]. memory modules. In a parallel process, the input is read from the
Furthermore, a RISC-V-based CNN coprocessor is proposed double data rate synchronous dynamic random-access memory
in [44] for epilepsy detection. The coprocessor is formed of (DDR-SDRAM), stored in the buffer and communicated to the
an eight-layer 1-D CNN accelerator, a two-stage RISC-V pro- computation module. The DFIFO transfers the CPU data to both
cessor, a main controller, and local memory units operating at the FSM and computation module to begin the CNN operations,
a 10-MHz clock frequency. The accelerator is programmable i.e., convolution, pooling, and activation [35].
and supports the implementation of various CNN models. In The computation module core operating unit is the convo-
contrast to the listed designs, the coprocessor in [44] requires lution unit, shown in Fig. 5(b), and performs the convolution
the least resources, where only 3411 LUTs, 2262 registers, and operation, the max pooling, and the activation function. The
six DSP units are needed. Moreover, the coprocessor consumes convolution unit is formed of nine multipliers, seven adders,
0.118 W, has a latency of 0.137 ms per class, and provides a and a five-stage-pipeline FIFO, as noted in Fig. 5(b), [35]. The
99.16% accuracy on fixed-point operations. The design’s low data and weights are serially fed to the convolution unit using
power and resource requirements makes it a suitable choice for the input FIFO buffers, at every clock cycle, to perform the
low-power IoT wearable devices [44]. While these accelerators convolution operation. The output is then passed to a pooling
are specifically designed for deployment on edge devices, they unit that performs only max pooling with respect to three com-
cannot compete with high-performance models such as that parators. Finally, the ReLU activation function is performed on
proposed in [43] offering a throughput of 84.3 GOPS. the results [35].
To improve overall performance, a memory hierarchy is de-
signed and implemented by Zhang et al. [35]. With respect to
B. CNN Accelerators for Object Detection Fig. 5(c), the memory hierarchy is composed of three levels:
Classically, an object detector relies on segmentation, low- off-chip DDR-SDRAM, input/output (I/O) data buffers, and
level feature extraction, and classification with respect to a shal- internal weight and input double register groups, 9 × 8 bit and
low NN [45], [48]. However, with the advances in the DNN and 226 × 4 × 8 bit, respectively [35]. By adopting such hierarchy,
hardware computing power, the state-of-the-art detectors make the interface limited bandwidth bottleneck complications can be
use of deep CNN structures to dynamically extract complex avoided. Although the design is energy efficient and requires a
features for accurate classification [45]. One of the most pre- relatively small on-chip area; it requires 2.7 s to finish the YOLO
vailing object detectors is the you only look once (YOLO). The inference operation. This delay is a result of a tradeoff between
YOLO detector and its successors (YOLOv2 [48], YOLOv3, resource usage and speed, i.e., serially implemented computa-
and YOLOv4 [45]) offers the best bargain between performance tional module. To decrease the inference latency, the authors sug-
(speed) and accuracy. However, its performance is achieved at gested adding additional, parallel, computation modules [35].
the cost of high computational complexity and requirements, The authors evaluated the systems performance with seven
making it difficult to implement these networks on edge devices. computation modules achieving a 400-ms average time [35].
Lightweight YOLO models (Tiny-YOLOv3 and Tiny-YOLOv4) The accelerator is model specific and designed with respect
have been proposed to reduce the complexity, i.e., fewer pa- to constant configurations, i.e., filter, output, and stride size.
rameters, at the cost of a slight reduction in accuracy. Thus, to However, there exist different YOLO versions each with a
implement these lightweight modules on embedded systems, different input feature map size. Designing a specific accelerator
suitable, low-energy, and high-performance architectures are for each feature can be a tedious solution. Thus, a configurable
required [45]. parameterizable RISC-V accelerator core is designed based on
To accommodate to such requirements, a RISC-V-based the Tiny-YOLO version [45].
YOLO hardware accelerator with a multilevel memory hier- The accelerator in [45] is designed as an algorithm-oriented
archy was proposed in [35]. The YOLO model implements hardware core to accelerate lightweight versions of YOLO,

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1960 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

Fig. 5. YOLO RISC-V accelerator. Source: Adapted from [35]. (a) Top-level architecture. (b) Convolution block architecture (part of the computation module).
(c) Multilevel memory hierarchy.

TABLE III
RISC-V YOLO ACCELERATOR RESOURCE COMPARISON AND PERFORMANCE

thus allowing flexibility, robustness, and configurability. In ad- (MAC) units, an adder tree a Sigmoid activation function, and a
dition, this design not only accelerates the CNN network, but leaky ReLU activation function. The multiplexers route the data
also the pre- and post-CNN operations [45]. The proposed internally and introduce a higher level of customizability [45].
generalized YOLO accelerator is shown in Fig. 6. The Vread The RISC-V YOLO accelerators presented in [35] and [45]
Bias, vRead Weights, vRead Array, and vWrite Array are con- are compared with respect to resource utilization and speed in
figurable dual-port memory units. The direct memory access Table III. In addition, they are compared against different DNN
(DMA) unit is used to read data from the external memory, architectures. BRAM is the internal FPGA block RAM units
the functional unit (FU) is a matrix of configurable custom and CM signifies compute module.
computing units, and AGU is the address generator unit. The As shown in Table III, the special-purpose YOLO accelerator
FU matrix unit is used for reading tiles of the input feature designed in [35] requires the least resources with 161 DSP
map [45]. blocks compared to 832 for the TinyYOLO v3 and 1248 for the
As shown in Fig. 6(a), the YOLO accelerator’s top-level archi- TinyYOLO v4. The YOLO accelerator CM is implemented in a
tecture is mainly formed of three stages: xWeightRead, xComp, serially operating manner, while the TinyYOLO v3 and v4 PEs
and AXI-DMA. The xWeightRead stage is formed of the Vread operate in parallel. However, the massive reduction in resource
Bias and the VRead Weights array. These units perform the read, usage is at the cost of slow performance, i.e., 2.7 s compared
write, and store operations to and from external memory and to 30.9 and 32.1 ms, with an architecture specifically tailored
provide the needed data to the FU matrix unit, as detailed in for predefined parameters [45]. In contrast, the TinyYOLO v3
Fig. 6(b). The weight memories are implemented as asymmetric and v4 designs presented in [45] offer a massive increase in
dual-port units with an external 256-bit bus. In addition, the performance, i.e., an average of 30 ms, at the cost of a tenfold
FU matrix is the accelerator’s main PE and is located in the increase in resource usage, mainly the DSP blocks and BRAM
XComp stage. The FU matrix is a collection of interconnected, units. The TinyYOLO v3 and v4 cores are highly customizable
reconfigurable, PEs whose sole purpose is to perform the 3-D and can be configured to meet any YOLO network version
convolution operations. Each custom FU architecture, detailed requirements. To improve the YOLO accelerator performance,
in Fig. 6(c), is formed of an array of multiply accumulate Zhang et al. [35] suggested using seven serially operating CM

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1961

Fig. 6. Generalized YOLO RISC-V accelerator. Source: Adapted from [45]. (a) Top-level architecture. (b) Detailed architecture. (c) Custom FU architecture.

placed in parallel to speed up the convolution operation, thus The choice of an accelerator is heavily dependent on the
achieving an execution speed of approximately 400 ms. The edge device, its resources, and processing capabilities. While
overall resource requirements for implementing the RISC-V the YOLO accelerator and lightweight SqueezeNet [35], [46] are
processor and the TinyYOLO accelerators are obtained at a slight designed with specific considerations, they are most suitable for
increase in unit usage. Moreover, a lightweight SqueezeNet lower end devices and can be redesigned for other specifications
CNN was proposed for edge MCU-based object detection ap- if needed. For higher end devices and more complex applica-
plications [46]. The proposed architecture is prototyped on the tions, the designs presented in [45] can be a better alternative
ZYNQ ZC702 SoC and can perform an inference run in 22.75 ms with an average speed of 30 ms. However, for general-purpose
while consuming an average power of 2.11 W. Although the SoC and generic applications, the universal coprocessor [47] is
proposed model is not RISC-V specific, it can be adopted for use the convenient choice.
with these open-source processors. As the presented accelerators
are architecture specific, i.e., TinyYOLO and SqueezeNet, a uni-
versal coprocessor is designed to efficiently implement different C. Heterogeneous Single-Shot Multibox Detector Accelerator
object detection networks [47]. The universal coprocessor is for Object Detection
prototyped on the E203 RISC-V SoC and evaluated with respect DL-based real-time object detection [49], [50] and motion
to different architectures, such as the Faster R-CNN, YOLOv3, recognition [51] are popularly implemented in advanced driver
SSD513, and RetinaNet. The coprocessor is able to complete an assistance systems and video analysis applications. The single-
inference run in 210, 51, 125, and 73 ms with 27.2, 33, 31.2, and shot multibox detector (SSD) combines the advantages of
32.5 mean accuracy precision (mAP) for the listed networks, YOLO and Faster R-CNN for fast and accurate real-time object
respectively [47]. detection [49]. The SSD detects multiple objects through a

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1962 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

TABLE IV to the SSD design, the RISC-V (RV) TinyYoLo V4 achieves


HETEROGENEOUS SSD ACCELERATOR PERFORMANCE IN FPS
lower FPS on both the GPU (Nvidia RTX 2080Ti) and the
SoC. However, it can be fully implemented on hardware, where
the RISC-V softcore occupies 3475 resources and is used as a
controller.
Moreover, a low-power RISC-V MCU is proposed in [52]
for SSD-based license plate recognition. The nine-core RISC-V
processor implements a multimodel inference approach based
single image snapshot. This is done by dividing the image into on SSDlite-MobilenetV2 operating at 117 mW. It achieves an
multiple grid cells with bounding/anchor boxes and performing approximate accuracy of 99.13% with a 38.9% mAP score. In
concurrent object detection on each cell’s region. The need for contrast to the heterogeneous SSD design [49], the nine-core
high-accuracy and high-speed inference makes implementing MCU fully implements the SSDlite structure, however, at the
SSD DL structures on hardware a challenging task. ASIC, cost of a much lower throughput of 1.09 FPS. This is because
GPU, and FPGA are famously used for accelerating complex the architecture does not include any hardware accelerators, and
DL structures with high inference speed. For SSD acceleration, the execution is run solely on MCU cores [52].
ASIC and GPU offers the least desirable choice, mainly due to
the lack of flexibility of the former and high power consumption
of the latter [49], while the FPGA offers more flexibility and D. 3-D CNN Accelerator for Motion Recognition
customizability; its limited resources constrain the performance 3-D CNNs have become widely used in video analysis and
of fully integrated complex systems [49]. As such, a hetero- motion recognition. In contrast to the classical CNN, the 3-D
geneous CPU-FPGA-based approach was proposed in [49] to CNN model can extract both temporal and spatial informa-
accelerate both the SSD DL structure software and hardware tion, i.e., action information, from consecutive frames [51]. To
parts. The target CPU (host) and FPGA were chosen as the accelerate 3-D CNN on embedded devices, a mini 3-D CNN
Intel Xeon Silver 4116 and the Arria 10 development board, (mini-C3D) design is proposed in [51] for the Weizmann dataset
respectively [49]. with 3 × 3 × 3 kernels. The proposed design modifies the C3D
As for the software, the pretrained network is optimized by network [51] by mapping the 3-D convolution to 2-D matrix
fusing the batch normalization layer with the convolution layer. multiplications using the general matrix multiplication (Mat-
The operator fusion technique reduces the number of network Mul) algorithm. The matrix is divided into smaller blocks and
layers, input parameters, and memory access during inference. processed by the FPGA’s 2-D MAC accelerator array [51]. The
By adopting this technique, the inference speed, measured in design is optimized and implemented on the Xilinx PYNQ-Z2
FPS, is increased by 10–30%, for different SSD networks [49]. development board with the ZYNQ 7020 SoC composed of an
Using layer hardware affinity and graph partitioning [49], each ARM A9 processor and an Artix-7 FPGA. The optimization
SSD network is divided into subgraphs. Then, each subgraph is done for both the software and hardware parts. The CPU-
is executed on its target processor to appropriately manage FPGA data exchange is done through the AXI bus and DMA
workload, resource consumption, and inference speed. Base feature [51].
SSD operations, such as multiscale feature maps and CNNs, The software design mainly consists of three different layers
are executed on the FPGA side using parallel PEs (SubGraph in order to load the 3-D CNN model and features on the FPGA,
1). However, high-complexity layers and operations, such as control the accelerator, and exchange CPU-FPGA data. The
nonmaximum suppression, are executed on the CPU side (Sub- video input features are preprocessed using the video frame dif-
Graph 2). With the use of OpenCL kernels, the CPU and the ference method, binarization, and portrait contour cropping [51].
FPGA operate interchangeably to appropriately manage data An image-to-row operation is then performed to map convo-
transfer and the execution of compute-intensive layers [49]. lution multiplications into matrix multiplications. Finally, the
As for the hardware, a block floating-point (BFP) scheme resulting feature matrix is divided into blocks in order to simplify
is implemented to reduce the data size and resource usage. In the DMA channels transfer [51]. A controller is implemented to
the BFP, a group of FP numbers share a common exponent. A manage data transfer, zero padding, and output reconstruction.
precision of 11-bit FP (FP11) and FP16 have resulted in a 30% The FPGA hardware design is divided into five main
and 20% reduction in data size with an mAP of 0.3–1.4 and 0.2, parts [51].
respectively [49]. Table IV compares the accelerators inference 1) Hardware interface: It consists of an AXI4-Stream data
speed performance, in FPS, with respect to different models and input interface and a 4-bit AXI4-lite configuration inter-
against other systems [49]. face. The parallel MAC accelerator receives the block
As compared to the CPU-Nvidia Titan X GPU system, the feature and weight matrices through the data input stream,
FP11 heterogeneous accelerator [49] achieved a 2.4× and 3.0× as read from the DDR3 memory.
improvement for the Inception and MobileNet models, respec- 2) On-chip cache: It is designed as a single-port BRAM
tively. In addition, the FP11 accelerator achieved a 1.42× and buffer space.
1.28× improvement versus the GPU. When compared to the 3) Loop boundary control: Feature and weight matrices will
full-precision system, the FP11 architecture achieved a 10–30% be loaded in the on-chip cache in the form of cyclic
improvement with an mAP of 0.3–1.4. Finally, as compared tiling. The loop boundary control sets the cyclic boundary

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1963

TABLE V
MINI-3D CNN PERFORMANCE, RESOURCE USAGE AND POWER

parameters to account for the different convolution layers equation:


requirements.
4) Matrix tiling: It performs a circular tiling to cache the N
 −1
block feature and weight matrices. res = vec_a[n]vec_b[n]. (1)
5) Parallel MAC: It implements the parallel, pipeline, MAC n=0
array accelerator to perform the MatMul. The accelerator
is optimized using different high-level synthesis parame- where N is the vector length.
ters. i.e., PIPELINE, UNROLL, and PARTITION. The DNN routines includes extensive matrix computations,
The mini-C3D was implemented on the ARM A9, ARM- i.e., MAC, implemented using loop instructions. Traditionally,
FPGA, and the AMD-3550H CPU running at 3.7-GHz turbo loops, when implemented in software, infer a large branch
frequency. It was evaluated in terms of inference speed, resource overhead that adds numerous setbacks to the architecture, i.e.,
utilization, and computing power. increased delay and resource usage. By considering hardware
From Table V, as compared to ARM A9 stand-alone imple- loops and supporting instruction set extensions, branch overhead
mentation, ZYNQ 7020 with the proposed FPGA accelerator can be removed resulting in an increased performance [34]. In
greatly improved the inference speed by approximately 17.99× addition, extended instruction sets for accelerating critical arith-
[51] with an average power consumption of 1.602 W. While the metic operations, i.e., vector multiplications, play an important
AMD-3550H CPU achieved the best speed, it is not suitable for role in enhancing overall performance [33].
embedded applications as it has the highest power consumption To evaluate the advantages of hardware loops and dot product
of 35 W. Moreover, in terms of computational performance, the acceleration, Vreca et al. [34] modified the RISC-V RI5CY
ZYNQ 7020 system achieved better performance with 0.155 core ISA to support these custom instructions. The RI5CY
GOPS/W, while the ARM-based implementation only achieved core is a 32-bit four-stage-pipeline RISC-V core with integer
0.011 GOPS/W. The design implements 27 parallel MAC units multiplication, division, and FP instructions. It has 31 general-
and requires a total of 35 570 FPGA resources. The most utilized purpose registers, 32 FP registers, and a 128-bit cache for
resources are the DSP blocks and BRAM, occupying 62.27% instruction prefetch [34]. In addition, it provides the XpulpV2
and 30.36% of the total available, respectively. Compared to nonstandard extension, which includes several functionalities,
other designs, the proposed system achieved an accuracy of 95% such as hardware loops [36]. From the modified RI5CY RISC-V
for the Weizmann dataset. The mini-3DC heterogeneous design core block diagram, shown in Fig. 7, we can list the following
provides a system for implementing large-scale low-power 3-D details: nonhighlighted boxes: original RI5CY core architecture;
CNN on embedded SoC devices with accelerated inference blue boxes: PEs/operating units; red boxes: control logic; violet
speed [51]. boxes: pipeline stages registers; orange boxes: general purpose
and status registers; and gray boxes: interface. The modified
IV. ISA DNN EXTENSIONS RI5CY core includes a hardware loop control “hwloop control”
While some work focused on designing fully programmable block highlighted in red and an FP dot product unit “fDotp”
generic CNN accelerators for the RISC-V processors, others highlighted in blue [34]. The “hwloop control” core is capable of
optimized embedded DNN operations by extending the original handling two levels nested loops. The “fDotp” unit performs two
RISC-V ISA [33], [53]. This technique implements specific core instructions, i.e., p.fdotp2.s and p.fdotp4.s, on single-precision
DNN routines, such as hardware loops [34], dot product [34], and 32-bit FP numbers. These instructions perform the dot product
mixed-precision support [36], in-memory computations [37] and operation described in (1) on two or four element vectors,
in-pipeline ML processing [53] to improve overall performance. respectively. The “fDotp” unit is not pipelined, since the RI5CY
core runs at a low frequency for reduced energy consumption.
However, this poses no considerable effect on performance [34].
A. Hardware Loop and Dot Product This design was prototyped on the ZYNQ 7000 SoC board and
DNN routines consist of heavy arithmetic operations. To later synthesized using the Synopsys Design Compiler and the
accelerate these computations, dedicated, parallel, hardware 90-nm generic core cell library from the United Microelectronics
blocks are needed. Thus, a tradeoff exists between resource Corporations [34]. Compared to the original RI5CY design
utilization and performance [34]. occupying an area of 0.24233283 mm2 with a dynamic power of
In order to speed up DL algorithms in RISC-V, without a 147.48 mW, the modified RI5CY is 72% larger and requires an
major sacrifice in hardware, an instruction set extension has been area of 0.41758819 mm2 with 148.47 mW. The increase in area
proposed in [34], mainly for hardware loops, i.e., zero overhead is caused by the addition of the single-precision FP dot product
loops, and dot product operations as shown in the following unit [34].

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1964 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

Fig. 7. Modified RI5CY RISC-V core block diagram. Source: Adapted from [34].

TABLE VI compared to software-based mixed-precision RI5CY designs.


MODIFIED RI5CY PERFORMANCE COMPARISON PER ONE INFERENCE RUN
The RISC-V ISA extension called XMPI extends the RI5CY
core functionalities, adding support for status-based operations,
for efficiently implementing 16-, 4-, and 2-bit QNNs [36].
A small set of the XpulpV2 instructions has been extended
from 16/8-bit formats to support 4/2-bit precision and mixed-
precision operation. Mainly, the following list of instructions
A simple optical character recognition NN was implemented from the XpulpV2 instructions has been extended for use in 4-
to evaluate the designed ISA optimization. The five-layer net- and 2-bit formats:
work architecture is as follows: 28 × 28 input, 24 × 24 con- 1) basic: ADD (addition), SUB (subtraction) and AVG (av-
volution, 12 × 12 max pooling, 60 fully connected, and 10 erage);
output layers [34]. The modified RI5CY performance, shown 2) vector comparison: MAX (maximum) and MIN (mini-
in Table VI, is evaluated in terms of clock cycle count, dynamic mum);
instruction count, and energy consumption for different program 3) vector shift: SRL (shift right logical), SRA (shift right
implementations. The Fp program implements the reference arithmetic) and SLL (shift left logical);
library version using all optimizations except hardware loops. 4) vector absolute: ABS (absolute value);
Similarly, the FpHwU is a modification of the Fp with hardware 5) dot product variations (unsigned–signed).
loops and loop unrolling. Finally, the FpDotHw makes use of Fig. 8(a) details the MPIC core architecture and a mixed-
the optimized assembly library, the dot product unit, and all precision controller (MPC) block has been added to orches-
optimizations including hardware loops [34]. trate mixed-precision operations. In addition, the decoder, CSR,
Compared to the baseline Fp, the FpHwU presented a minor ALU, and DOTP units have been modified for performing the
10% improvement in clock cycle count. However, with the addi- required tasks and XMPI extended instructions. The mixed-
tion of the dot product unit, the FpDotHw demonstrated the best precision dot product unit structure is shown in Fig. 8(b), where
computational performance by achieving a 74% reduction in the MPC_CNT signal is the MPC count output controlled by the
cycles. Similarly, the FpDotHw achieved the best performance in MPC core unit and used to select the subgroups of operands [36].
instruction count and energy with 66% and 27% improvements, The DOTP unit has been extended, from its original 16- and 8-bit
respectively. Thus, for an MCU, running at 10 MHz, a single form, to support 4- and 2-bit formats by adding two additional
inference run is performed within 7.5 ms and consumes 1118 µJ DOTP units with internal adders and multipliers.
using the FpDotHw [34]. The modified RI5CY core was integrated into the PULPisino
SoC and synthesized using Synopsys Design Compiler obtain-
ing a maximum operating frequency of 250 MHz with a power
B. Mixed-Precision RISC-V Core consumption of 5.30 mW. Compared to the original RI5CY
A mixed-precision inference core (MPIC) for the RI5CY variation, the MPIC SoC occupies an area of 1.004273 mm2
RISC-V processor, using virtual instructions, is presented versus 1.002681 mm2 , resulting in approximately 0.2%
in [36]. It is developed for eliminating the RI5CY encoding space overhead.
problem and for implementing heavily quantized deep neural The MPIC was benchmarked against different commer-
networks (QNNs) with improved performance and efficiency as cially available processors by executing a QNN layer with a

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1965

Fig. 8. Mixed-precision RI5CY modifications. Source: Adapted from [36]. (a) MPIC core. (b) Extended dot product.

TABLE VII Advanced Matrix Extension tile operation category, performs


MPIC AVERAGE COMPUTATIONAL PERFORMANCE AND ENERGY EFFICIENCY
COMPARISON
only one operation, as defined in the following equation [54]:
T ileC [i][j]+ = T ileA [i][l] × T ileB [l][j] (2)
where i is the number of rows, j is the number of columns, and
l is an intermediate variable.
However, in contrast to RISC-V, these processors cannot
be customized and tailored to support the application needs.
In addition, they are high in power consumption and, in het-
erogeneous systems, performance is mostly limited by the
combination of various uniform and mixed-precision config-
host-accelerator frequent communications. Thus, the accelerator
urations, namely 8-, 4-, and 2-bit. The MPIC average com-
becomes severely underutilized [53]. As such, an end-to-end
putational performance and energy efficiency are shown in
RISC-V edge ML solution using in-pipeline support is pro-
Table VII for an input tensor and filter sizes of 16 × 16 × 32
posed in [53]. This design implements a custom processing unit,
and 64 × 3 × 3 × 32, respectively [36].
RV-MLPU, to accelerate RISC-V Rocket Core processor DNN
Compared to the Cortex M4 (STM32L4), the Cortex M7
operations [53]. It mainly consists of extending the RISC-V ISA
(STM32H7), and the RI5CY, the MPIC achieved an 8.55×,
with a dedicated ML SIMD unit, developing a software stack to
5.36×, and 2.77× increase in the number of MAC operations
support custom ML instructions, and adding compiler support
performed in a cycle (MAC/cycle), respectively. In addition, it
to map TensorFlow Lite operations and vectorized kernels to
attained the lowest power consumption of 5.30 mW. The energy
the ISA [53]. The RV-MLPU SIMD extends the RISC-V Rocket
efficiency is provided in GMAC/s/W; however, it is also affected
Core processor and includes support for vector operations, mod-
by physical design parameters [36]. In contrast to the Cortex M4
ified DCache for high memory bandwidth, and different memory
with 2.64 GMAC/s/W, the Cortex M7 achieved lower efficiency
access operations.
of 1.27 GMAC/s/W despite the higher frequency and better
The design performance was evaluated with respect to popu-
performance results. This is a consequence of its higher power
lar benchmark ML models, frequently implemented on mobile
consumption of ∼ 234 mW at 480 MHz. However, both Cortex
devices, i.e., DenseNet, MnasNet, Inception V3, etc. The mod-
M cores still fall behind when compared to the RI5CY and MPIC
ified RISC-V ISA was compared against the ARM v-8A with
cores achieving an efficiency of 42.18 and 96.7 GMAC/s/W,
NEON Advanced SIMD extensions for different implementa-
respectively.
tions, i.e., ARM-base, RV-base, ARM-opt, 128-bit RV-opt-v1,
and 256-bit RV-opt-v2 [53]. Performance was evaluated with
C. In-Pipeline ML Processing respect to the number of executed instructions. On average,
In mobile edge inference, such as Android devices, the CPU both the ARM-base and the RV-base have executed the same
handles all ML computations without any additional acceler- number of instructions for all models. However, the RV-opt-v1
ators. This is because the gain in performance is not always implementation achieved an average of 8× and 1.25× reduction
the main parameter of interest. As such, for some edge AI in executed instructions compared to the ARM-opt and the
applications, having a decent CPU with a dedicated SIMD unit others, respectively. Also, the RV-opt-v2 at 256-bit register width
is sufficient [53]. Some modern CPUs, such as the Intel Sap- achieved 2× more reduction compared to the RV-opt-v1 [53].
phire Rapids, includes a matrix-multiply engine to perform tile While ISA extensions offer a general-purpose solution, the
based multiply-add (TMUL). The TMUL instruction, part of the performance is constrained by several limiting factors, such

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1966 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

Fig. 9. PULP cluster architecture with an eight-core RISC-V processor, IMA unit, and a digital depth width convolution accelerator. Source: Adapted from [37].

as user needs, application needs, compiler mapping, library 3) exploiting heterogeneous analog–digital operations, such
support, memory access, and data transfer. as pointwise/depthwise convolutions and residuals.
As shown from Fig. 9 , the PULP cluster is formed of an eight-
core RISC-V processor, a level 1 (L1) Tightly coupled Data
Memory cache, instruction cache, the depthwise convolution
D. Analog in-Memory Computation digital accelerator, and the IMA subsystem. The components
Analog in-memory computing (AIMC) is a promising so- are connected together internally by means of a low-latency
lution to overcome memory bottlenecks in DNN operations logarithmic interconnect and to the external world with respect
as well as efficiently accelerate QNN operations. It performs to onboard DMA and through an AXI bus. The logarithmic
analog computations, i.e., matrix vector multiplications and interconnect ensures serving the memory in one cycle, while
dot product, on the phase change memory (PCM) crossbars AXI bus allows the cluster to communicate to the external
of nonvolatile memory (NVM) arrays, thus accelerating DNN MCU and peripherals. The external MCU also contains the
inference while optimizing energy usage [37], [55]. cluster core program instructions. A hardware event unit is added
Although efficient, AIMC still requires additional improve- to the system in order to synchronize operations and thread
ments to achieve full-scale application efficiency. Some of its dispatching [37].
key challenges are [37]: Each subsystem or hardware processing engine (HWPE)
r limited to matrix/vector operations; has its own streamer block, a standardized interface, formed
r difficult to integrate in heterogeneous systems (lack of of source and sink FIFO buffers to interact with the RISC-V
optimized interface designs); cores, and exchange data with the internal engine. Each block
r susceptible to computation bottleneck in single-core pro- implements an independent FSM to control and synchronize
cessor devices when handling other workloads, i.e., acti- its operation. The HWPE provides two interfaces: control and
vation function and depthwise convolution; data. The control “Ctrl intf” allows the cluster to manipulate
Heterogeneous RISC-V heavy computing clusters and hybrid the accelerator’s internal registers for configuration purposes,
SoC designs have gained popularity in extreme edge AI infer- while the data interface “data intf” connects to the logarithmic
ence [56], [57]. In an effort to overcome the AIMC challenges, interconnect and in its turn to the L1 memory unit [37]. The IMA
an eight-core RISC-V clustered architecture with in-memory and DW subsystems are further detailed to show their internal
computing accelerators (IMA) and digital accelerators was de- architecture. The IMA subsystem engine implements both the
veloped in [37]. The aim of this system is to sustain AIMC analog and digital circuitry as follows.
performance in heterogeneous systems for optimized DNN 1) Analog: AICM crossbar with a 256 × 256 array, a pro-
inference on edge devices targeting practical end-to-end IoT gramming circuitry, i.e., PCM configuration, digital-to-
applications [37]. Similar to previous designs, the architecture analog (DAC), and analog-to-digital (ADC) converters.
presented in [37] is based on the popular RISC-V PULP cluster. 2) Digital: I/O registers to communicate with the ADC/DAC
The work mainly focused on: and an internal FSM control unit.
1) designing a heterogeneous system with 8 programmable The IMA operates on the L1 memory data encoded in a special
RISC-V core processors, IMA and digital accelerators format, i.e., HWC format. The IMA register file “INPUT PIPE
dedicated for performing depthwise convolutions (DW); REGS” can be set to pipeline different jobs by correctly setting
2) improving computational performance by optimizing the the strides. The proposed IMA structure enables the execution
interfaces between the IMA and the system; of a full layer in one configuration phase.

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1967

On the other hand, the DW convolution engine is a fully digital requirements are >20 GFLOPS and >320 MB FP parameters,
accelerator. It implements a network of multiple MAC units, i.e., respectively [38], hindering its implementation on resource con-
46 MAC, register files for data and configuration, windows and straint edge devices.
weight buffers, a general controller FSM, and a dedicated engine To reduce its memory footprint and computational complex-
FSM. The accelerator can also perform the ReLU activation ity, a fully quantized BERT (FQ-BERT) with hardware–software
function as well as the shift and clip operations [37], thus acceleration is proposed in [38] for the SoC. The FQ-BERT
accelerating the convolution operation. DW convolution output compresses the model by quantizing all parameters and inter-
channels depends on only one input, thus offering a reduction mediate results to integer or fixed-point data type. Moreover, it
in size and a lower connectivity as compared to the original accelerates inference by implementing a dot-product-based PEs
design. The specifically designed DW convolution accelerator and bit-level reconfigurable multipliers [38]. The methods and
resolves DW layer mapping to in-memory computing (IMC) techniques used for quantizing BERT parameters are detailed as
arrays and eliminates any software originating performance bot- follows [38].
tlenecks [37]. Additional studies concerning array structures for 1) Weights and activation functions: Quantized to 4-bit us-
AIMC, such as systolic arrays for reduced energy consumption, ing symmetric linear quantization strategy with tunable
can be found in [58]. (MIN, MAX) clip thresholds and a scaling factor. The
The heterogeneous system was synthesized with Synopsys weight scaling factor is computed using a scaling formula.
Design Compiler-2019.12. The full place and route flow was The exponential moving average is used to determine the
done using Cadence Innovus 20.12, and the cluster was imple- activation functions scaling factor during inference.
mented using GlobalFoundries 22-nm FDX technology node. 2) Biases and other parameters: The biases are quantized to
The total system area of the heterogeneous cluster is 2.5 mm2 , 32-bit integers. The Softmax module and the LN parame-
with the IMA core occupying one-third of the area with 0.000912 ters are quantized to 8-bit fixed-point values.
mm2 and the 512-kB TCD cache occupying another one-third The proposed architecture is divided in two parts: software
and one-third occupied by the remaining parts. The device can and hardware. The software part, running on the CPU and off-
perform an average of 29.7 MAC operations per cycle and chip memory, implements the least computational demanding
execute inference for the MobileNetV2 network in 10 ms while operations like embedding and task-specific layers. However,
achieving a performance of 958 GOPS on NVM. they require the most memory space. The hardware part, running
Emerging technologies, such as 3-D integration, when cou- on the FPGA, implements the encoder layers accelerated units,
pled with IMC techniques, can provide substantial design ben- such as the on-chip buffers, PE, LN core, and Softmax core [38].
efits. 3-D integration is achieved by stacking multiple layers 1) On-chip buffers: double buffered weight buffer, interme-
of electronic components in a single chip or package to reduce diate data buffer for the MHSA unit variables, cache
power consumption, reach higher clock speeds, and improve sig- buffer for storing the scaling factors, Softmax lookup table
nal integrity and overall circuit performance. Additional details values, and the I/O buffers.
on 3-D integration and IMC techniques can be found in [59] 2) PE: Each unit is formed of bit-level reconfigurable multi-
and [60]. pliers with support to 8 × 4 bit and 8 × 8 bit combinations.
In addition, a Bit-split Inner-product Module is included
V. HARDWARE ACCELERATORS FOR TRANSFORMERS to simplify reuse for different operations.
3) Softmax and LN core: The exponential function is quan-
Transformers have been shown to outperform CNN and RNN
tized to 8-bits and 256 sampling points are stored in
in different applications, i.e., NLP and computer vision [38],
a lookup table to simplify the computation. Moreover,
[50], [61], [62]. They are formed of encoder and decoder blocks
a coarse-grained three-stage pipeline parallel SIMD is
that execute several compute-intensive, FP, and nonlinear op-
designed to accelerate the elementwise multiplication.
erations on massive data streams [61], such as multihead self-
Initially, the weights are loaded to the off-chip memory. A
attention (MHSA), Softmax, Gaussian error linear unit (GELU),
task-level scheduler is implemented to fully overlap off-chip
pointwise feed forward network (FFN), and layer normalization
memory access and computing operations. This is done by
(LN) [4], [61]. However, generic DL structures and accelerators
dividing each stage into several substages [38]. The FQ-BERT
are not tailored to support and optimize these specific trans-
and BERT were implemented using PyTorch and evaluated on
former operations [61]. Some common optimization techniques
the SST-2 and MNLI tasks of GLUE benchmark. The FQ-BERT,
include model compression with integer or fixed-point quantiza-
with a compression ratio 7.94×, achieved an accuracy of 91.51%
tion [63], [64], [65], specific approximations with scaling factors
and 81.11% as compared to BERT with 92.32% and 84.19%,
to execute nonlinear operations [61], and specialized hardware
respectively [38]. Furthermore, the accelerator was implemented
accelerators [38], [62].
on the Xilinx ZCU102 (FPGA) and ZCU111 (SoC) and was
compared to the baseline program, FQ-BERT, running on the
A. Fully Quantized Bidirectional Encoder Representations Intel i7-8700 CPU and the Nvidia K80 GPU (CUDA 10.1). The
From Transformers sentence length and batch size are set to 128 and 1, respectively.
The Bidirectional Encoder Representations from Transform- Table VIII compares the performance and energy efficiency of
ers (BERT) is a state-of-the-art model formed of stacked encoder the FQ-BERT and BERT when implemented on different proces-
layers [63]. However, its computational complexity and memory sors. The accelerator achieved a 6.10× and 28.91× improvement

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1968 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

TABLE VIII are integers. With the use of the DN, the unit implements a right
FQ-BERT PERFORMANCE COMPARISON FOR DIFFERENT PROCESSORS
shift operation and eliminates the need for a divider [61]
 
Sa Sa b
q o = qa = qa DN = qa × . (3)
So So 2c

Among others, the MatMul blocks are used as the basic


building blocks of the MHSA and the FFN with dimension
as compared to the CPU and 1.17× and 12.72× as compared df f . The MHSA block, shown in Fig. 10(b), is formed of
to the GPU [38]. For 12 processing units with 16 PEs and 16 k head units operating in parallel and connected to a final
multipliers, the total resource consumption on the ZCU111 is MatMul block to generate the output. The MHSA block can
395 159, where 3287 DSP blocks were allocated. be reconfigured to include one or many heads depending on the
Although transformers are the go-to choice in NLP applica- desired architecture, i.e., parallel or sequential with reuse, and
tions, not all models can be fully deployed on hardware. As available resources. Each head unit, shown in Fig. 10(c), contains
such, deploying long short-term memory (LSTM) networks can three MatMul blocks and one attention block to compute the
be a suitable alternative where the requirements are minimal. A Query (Q), Key (K), and Value (V) matrices in parallel [61]. The
32-bit precision FP LSTM-RNN FPGA accelerator is proposed attention block, shown in Fig. 10(d), computes the QK T matrix,
in [66]. The design is implemented on the Virtex 7 running where T means transpose. It is formed of two MatMul blocks
at 150 MHz and can perform an average of 7.26 GFLOP/s. with intermediate scale units, i.e., division by the transformer
The memory-optimized architecture occupies 52.04% BRAMs, dimension d, Softmax and requantization. Finally, the FFN block
42% DSP units, 30.08% FFs, and 65.31% LUTs. The network structure also implements two MatMul blocks with intermediate
can be fully implemented on hardware and achieves a 20.18× GELU and requantization units [61].
speedup compared to the Intel Xeon CPU E5-2430 software Nonlinear operations, such as Softmax, GELU, and Square
implementation clocked at 2.20 GHz [66]. Root for the LN, are performed by the use of second-order
polynomial approximations and recursive implementation [61].
Softmax is applied to the row components of the QK T matrix.
B. SwiftTron
As such, m parallel units are instantiated, where m is the
Various DLAs were designed to implement fully quantized, sentence length. Its implementation is summarized as follows:
fixed-point, and integer-based transformers, i.e., FQ-BERT and First, the unit implements a maximum search block to obtain
I-BERT [65]. However, these architectures do not fully deploy the maximum value, which is subtracted to obtain decomposable
the model on hardware but only optimize and execute specific nonpositive real numbers. Second, the input range is restricted to
parts. In addition, nonlinear operations are difficult to implement [− ln 2, 0]. Third, the exponential function is computed by means
in integer arithmetic without a significant loss in accuracy [61]. of a second-order polynomial. Finally, the output is generated
As such, SwiftTron, a specialized open-source hardware accel- using an accumulate and divide block [61].
erator, is proposed [61] for quantized transformers and vision The GELU unit is implemented with simple add, multiple, and
transformers. The SwiftTron architecture implements several sign handling operations. This is done by linearizing the error
hardware units to fully and efficiently deploy quantized trans- function (erf) through a second-order polynomial with limited
formers in edge AI/TinyML devices using only integer oper- input. The LN blocks square root operation is implemented in a
ations. To minimize accuracy loss, a quantization strategy for recursive manner. The algorithm iterates until xi+1 ≥ xi , where
transformers with scaling factors is designed and implemented. x is the partial result and i is the iteration index. Finally, a
The scheme reliably implements linear and nonlinear operations control unit is implemented to manage different operations. The
in 8-bit integer (INT8) and 32-bit integer (INT32) arithmetic, residual block output is added to the original inputs with respect
respectively. Quantization is performed by using scaling factors to Dyadic units to ensure matching scaling factors [61].
that are dynamically computed during the process [61]. SwiftTron architecture was synthesized in a 65-nm CMOS
To accelerate the linear layers, an INT8 input MatMul block technology using Synopsys Design Compiler. Its parameters
is proposed [61] and is shown in Fig. 10(a). The MatMul block were set to d = 768, k = 12, m = 256, and df f = 3072. Syn-
is designed as an array of INT32, shareable and reusable, MAC thesis results show that the architecture operates at a clock
units to avoid accuracy loss. The MAC units perform column- frequency of 143 MHz, occupies an area of 273.0 mm2 , and
oriented computations with bias addition. This data flow sim- consumes 33.64 W. It was shown that the MatMul, Softmax,
plifies the MatMul architecture as well as the interface between LN, and GELU blocks occupy 55%, 17%, 25%, and 3% of the
blocks. However, as nonlinear operations are performed with total area, respectively. Their contribution to the total power is
INT8 representation, a requantization (Req) unit is needed. Since 79%, 14%, 6%, and 1%, respectively [61].
scaling factors can also assume real values, the requantization The architecture was evaluated by executing the RoBERTa-
unit represents the scaling factor ratio with a dyadic number base/large on STT-2 and DeiT-S models with a 244 × 244 image
(DN), as shown in (3). Note that a and o are the INT32 and resolution from the ImageNet database. The inference latency
INT8 values, qa and qo are their quantized values, and Sa and was compared to that of the Nvidia RTX 2080 Ti GPU [61]. The
So are their scales, such that a = qa Sa , o = qo So , and b and c mean accuracy obtained for the RoBERTa models was 95.8%

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1969

Fig. 10. SwiftTron linear layers architecture. Source: Adapted From [61]. (a) MatMul block. (b) MHSA block. (c) Head unit. (d) Attention block.

TABLE IX memory access, and implement a head-level pipeline and sev-


BERT, I-BERT, FQ-BERT, AND SWIFTTRON INT8 COMPARISON
eral layer optimizations [67]. The design is based on ViT-B/16
model and prototyped on the ZYNQ ZC7020. For an image
dimension of 256 × 256 × 3, ViTA occupies 53 200 LUTs, 220
DSP slices, and 630 kB of BRAM. In terms of performance,
the accelerator can perform 2.17 FPS with a 93.2% hardware
utilization efficiency, while operating at a frequency of 150 MHz
and consuming 0.88 W, i.e., 3.12 FPS/W [67].
and 79.11% for the DeiT-S. In terms of latency, the RoBERTa- In contrast to the SwiftTron and FQ-BERT hardware ac-
base, large, and the DeiT-S required 1.83, 45.70, and 1.13 ms, celerators, the ViTA presents a design suitable for resource-
respectively, and the latency speedup factor with respect to the constrained edge devices with a reasonable frame rate and power
GPU was 3.81×, 3.90×, and 3.58×, respectively [61]. consumption. Although these designs do not explicitly target
From Table IX and as compared to the baseline BERT, the RISC-V processors, they can be integrated into a RISC-V system
I-BERT and FQ-BERT achieves a speedup of 3.56× and 12.72× given its open source nature.
with an accuracy of 96.3% and 96.4%, respectively. However,
both the designs make use of a heterogeneous processor and can-
not be fully deployed on an FPGA. While SwiftTron architecture VI. SUMMARY AND FUTURE RESEARCH CHALLENGES
was not explicitly designed for a specific processor, model, or In this survey, we presented an overview of embedded DNN
ISA, it is prototyped on an FPGA; as compared to the FQ-BERT, accelerators for the open-source RISC-V processor core. In ad-
it achieves a speedup of 3.90× with an accuracy of 96.4% and a dition, we offered an overview on some RISC-V ISA extensions,
latency of 45.70 ms. However, the SwiftTron is not suitable for compatible accelerators, and heterogeneous and digital with
deployment on highly resource-constrained devices. AIMC hybrid designs. We explored different DNN structures
ViTA, a hardware accelerator architecture and an efficient and models, like 1-D, 2-D, and 3-D CNN, SSD, and trans-
data flow, is proposed in [67] to deploy compute-heavy vision formers. In addition, we provided some up-to-date references
transformer models on edge devices. The design supports sev- on recent advances in optical AI edge inference designs and
eral popular vision transformer models, avoid repeated off-chip 3-D integration. The work listed in this article is summarized in

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1970 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

TABLE X
SUMMARY OF STATE-OF-THE-ART KEY FEATURES AND LIMITATIONS

Table X. The state-of-the-art designs are compared with respect common limitations. These limitations are mainly in terms of
to the selected processor, target application, key features, and size, resource utilization, compiler, processing capabilities, and
limitations. data communication. Some future research tracks and open
In conclusion, ISA extensions provide optimized general- challenges in edge AI to reduce the effect of these limitations
purpose instructions to implement different core DNN opera- are as follows.
tions. However, the performance of networks is limited by that 1) 3-D integration and IMC: Performing IMC in 3-D in-
of the compiler and its ability to correctly map and execute each tegrated circuits allows computations to take place in a
instruction. In addition, certain models, such as transformers, close proximity to the memory. This technique drastically
may require the use of dedicated architectures to perform effi- reduces memory access, data transfer bottlenecks, and
ciently. Thus, a dedicated accelerator, while application specific, size and improves overall performance. 3-D integration
can improve certain networks performance. The choice of an and IMC have the potential to revolutionize the field of
accelerator remains constrained by the application requirements embedded ML by enabling the full implementation of
and device limitations. The designs listed in this survey favored transformers and large DNNs in hardware. However, these
the FPGA over the CPU and the ASIC. Compared to ASICs, technologies are still relatively new and face complex
FPGAs offer the needed flexibility to implement dedicated and challenges. 3-D integration is expensive; it can lead to
reconfigurable architectures to meet the ever changing needs and an increase in heat dissipation and is not freely scalable.
advancements. Compared to the CPU, FPGAs offer parallelism In addition, IMC in 3-D integrated circuits can lead to
and low-power consumption with better performance per watt. complex and difficult-to-implement designs that require
For some models, a heterogeneous implementation was favored, reliable data management techniques. The development
where the design was implemented on a SoC and optimiza- of dedicated 3D-IMC libraries, instruction sets, tools, and
tion was performed on both the software (CPU) and hardware compiler optimization methods can greatly reduce design
(FPGA) sides. It was evident that the open-source RISC-V and testing time.
was the MCU choice for many applications. This is because 2) Optimizing model size and resource usage: Knowledge
it offers the needed flexibility and customizability to implement distillation, model compression, pruning, sharing, parti-
dedicated accelerators that meet specific design criteria, such as tioning, and offloading are some of the popular tech-
power, area, and performance. In addition, it allows developers to niques adopted to reduce the size of DNNs. In knowl-
freely integrate similar designs through open-source licensing. edge distillation, a smaller size student DNN is trained
The need for implementing DNNs on edge devices is in- by the larger teacher DNN. Several requirements should
creasing tremendously and has become a dedicated research be considered when choosing an optimization technique,
topic. However, it is clear from Table X that although the such as accuracy, computational complexity, cost, speed,
proposed designs offer considerable improvements, they share and availability. Although extensively researched, these

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
AKKAD et al.: EMBEDDED DEEP LEARNING ACCELERATORS: A SURVEY ON RECENT ADVANCES 1971

methods might result in a model that suffers from accu- [5] S.-H. Lim, W. W. Suh, J.-Y. Kim, and S.-Y. Cho, “RISC-V virtual platform-
racy loss, requires retraining, poses security concerns, and based convolutional neural network accelerator implemented in systemC,”
Electronics, vol. 10, no. 13, 2021, Art. no. 1514.
becomes difficult to validate and deploy. The introduction [6] N. Wu, T. Jiang, L. Zhang, F. Zhou, and F. Ge, “A reconfigurable
of device-aware model compression techniques can help convolutional neural network-accelerated coprocessor based on RISC-V
in obtaining device-specific models, which facilitates de- instruction set,” Electronics, vol. 9, no. 6, 2020, Art. no. 1005.
[7] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and
ployment and improve performance. challenges,” IEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Oct. 2016.
3) Optimizing memory communication: Data transfer and [8] B. Varghese, N. Wang, S. Barbhuiya, P. Kilpatrick, and D. S. Nikolopoulos,
memory bandwidth play a crucial role in the performance “Challenges and opportunities in edge computing,” in Proc. IEEE Int.
Conf. Smart Cloud, 2016, pp. 20–26.
of accelerators, especially in heterogeneous and hybrid [9] P. P. Ray, “A review on TinyML: State-of-the-art and prospects,” J. King
digital–analog systems. Accelerators are mainly limited Saud Univ.-Comput. Inf. Sci., vol. 34, no. 4, pp. 1595–1623, 2022.
by frequent communication and bandwidth bottlenecks. [10] E. Manor and S. Greenberg, “Custom hardware inference accelera-
tor for tensorflow lite for microcontrollers,” IEEE Access, vol. 10,
Addressing the memory bandwidth and limitations in pp. 73484–73493, 2022.
RISC-V can significantly optimize the overall perfor- [11] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner,
mance. This can be achieved by implementing application- “AI accelerator survey and trends,” in Proc. IEEE High Perform. Extreme
Comput. Conf., 2021, pp. 1–9.
specific memory hierarchies, on-chip interconnects, dedi- [12] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
cated access controllers, memory compression, data reuse, neural networks with pruning, trained quantization and Huffman coding,”
and data-flow optimization. in Proc. 4th Int. Conf. Learn. Representations, Y. Bengio and Y. LeCun,
eds., San Juan, Puerto Rico, 2016.
4) Tools and compilers: We implement automatic code gen- [13] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
eration to efficiently map the program instructions to “Binarized neural networks,” Proc. Adv. Neural Inform. Process. Syst.,
hardware. In addition, we develop open-source standards D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, eds., vol. 29,
2016.
for RISC-V accelerator designs to simplify interoperabil- [14] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
ity and integration for different hardware platforms and in Proc. 5th Int. Conf. Learn. Representations, Toulon, France, 2017.
software frameworks. [15] A. Karine, T. Napoléon, J.-Y. Mulot, and Y. Auffret, “Video seals
recognition using transfer learning of convolutional neural network,”
5) Algorithmic: We exploit the unique capabilities of the in Proc. 10th Int. Conf. Image Process. Theory, Tools, Appl., 2020,
open-source RISC-V ISA to better optimize DNN al- pp. 1–4.
gorithms specifically for RISC-V implementation. The [16] V. Murahari, C. E. Jimenez, R. Yang, and K. Narasimhan, “DataMUX:
Data multiplexing for neural networks,” in Proc. Adv. Neural Inform.
open-source ISA also allows investigating multimodel, Process. Syst., A. H. Oh, A. Agarwal, D. Belgrave, and Kyunghyun Cho,
heterogeneous, and hybrid computing. This can be done eds., 2022.
by designing algorithms and structures for data fusion [17] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression and
hardware acceleration for neural networks: A comprehensive survey,”
from different sensors, as well as concurrently targeting Proc. IEEE, vol. 108, no. 4, pp. 485–532, Apr. 2020.
various computing platforms, such as CPU, FPGA, GPU, [18] S. Pouyanfar et al., “A survey on deep learning: Algorithms, tech-
and neural processing unit. niques, and applications,” ACM Comput. Surv., vol. 51, no. 5, pp. 1–36,
2018.
6) Optical chips: Optical AI accelerators are extensively [19] S. Mittal, “A survey of FPGA-based accelerators for convolutional neural
investigated for implementing high-accuracy and high- networks,” Neural Comput. Appl., vol. 32, no. 4, pp. 1109–1139, 2020.
speed inference CNNs. Classical processors are beginning [20] E. Wang et al., “Deep neural network approximation for custom hardware:
Where we’ve been, where we’re going,” ACM Comput. Surv., vol. 52, no. 2,
to face limitations in the post Moore’s law era, where pp. 1–39, 2019.
their processing capabilities are not improving at the same [21] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient neural network
pace as the requirements. Optical processors, on the other kernels for arm Cortex-M CPUs,” 2018, arXiv:1801.06601.
[22] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and M.
hand, are not affected by Moore’s law and are currently Shafique, “Hardware and software optimizations for accelerating deep
investigated as an alternative to train and deploy DNN neural networks: Survey of current trends, challenges, and the road ahead,”
structures, offering the advantage of handling much larger IEEE Access, vol. 8, pp. 225134–225180, 2020.
[23] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “PULP-NN:
and complex networks. Accelerating quantized neural networks on parallel ultra-low-power RISC-
Additional research tracks that would offer further improve- V processors,” Philos. Trans. Roy. Soc. A, vol. 378, no. 2164, 2020,
ments include those related to improving adaptability to dynamic Art. no. 20190155.
[24] “RV12 RISC-V 32/64-bit CPU core datasheet.” Accessed: Apr. 28, 2022.
workloads and exploring techniques to optimize online learning, [Online]. Available: https://roalogic.github.io/RV12/DATASHEET.html
federated learning, and training, to name a few. [25] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner,
“Survey and benchmarking of machine learning accelerators,” in Proc.
IEEE High Perform. Extreme Comput. Conf., 2019, pp. 1–9.
REFERENCES [26] K. T. Chitty-Venkata and A. K. Somani, “Neural architecture search
survey: A hardware perspective,” ACM Comput. Surv., vol. 55, no. 4,
[1] Z. Liu, J. Jiang, G. Lei, K. Chen, B. Qin, and X. Zhao, “A heterogeneous pp. 1–36, 2022.
processor design for CNN-based AI applications on IoT devices,” Procedia [27] D. Ghimire, D. Kil, and S.-H. Kim, “A survey on efficient convolutional
Comput. Sci., vol. 174, pp. 2–8, 2020. neural networks and hardware acceleration,” Electronics, vol. 11, no. 6,
[2] A. N. Mazumder et al., “A survey on the optimization of neural network 2022, Art. no. 945.
accelerators for micro-AI on-device inference,” IEEE Trans. Emerg. Sel. [28] S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, and P. Kitsos, “A survey
Topics Circuits Syst., vol. 11, no. 4, pp. 532–547, Dec. 2021. on RISC-V-based machine learning ecosystem,” Information, vol. 14,
[3] L. Sekanina, “Neural architecture search and hardware accelerator co- no. 2, 2023, Art. no. 64.
search: A survey,” IEEE Access, vol. 9, pp. 151337–151362, 2021. [29] A. Sanchez-Flores, L. Alvarez, and B. Alorda-Ladaria, “A review of CNN
[4] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural accelerators for embedded systems based on RISC-V,” in Proc. IEEE Int.
Inf. Process. Syst., 2017, vol. 30, pp. 6000–6010. Conf. Omni-Layer Intell. Syst., 2022, pp. 1–6.

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.
1972 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 5, NO. 5, MAY 2024

[30] J. K. L. Lee, M. Jamieson, N. Brown, and R. Jesus, “Test-driving RISC-V [50] W. Lv et al., “Detrs beat YOLOs on real-time object detection,” 2023,
vector hardware for HPC,” Proc. High Perform. Comput., A. Bienz, M. arXiv:2304.08069.
Weiland, M. Baboulin, and C. Kruse, eds., Chem., vol. 13999, pp. 419–432, [51] S. Lv, T. Long, Z. Hou, L. Yan, and Z. Li, “3D CNN hardware circuit
2023. for motion recognition based on FPGA,” in J. Phys.: Conf. Ser., 2022,
[31] C. Silvano et al., “A survey on deep learning hardware accelerators for vol. 2363, Art. no. 012030.
heterogeneous HPC platforms,” 2023, arXiv:2306.15552. [52] L. Lamberti, M. Rusci, M. Fariselli, F. Paci, and L. Benini, “Low-power
[32] F. Ge, N. Wu, H. Xiao, Y. Zhang, and F. Zhou, “Compact convolutional license plate detection and recognition on a RISC-V multi-core MCU-
neural network accelerator for IoT endpoint SOC,” Electronics, vol. 8, based vision system,” in Proc. IEEE Int. Symp. Circuits Syst., 2021,
no. 5, 2019, Art. no. 497. pp. 1–5.
[33] I. A. Assir, M. E. Iskandarani, H. R. A. Sandid, and M. A. Saghir, [53] Z. Azad et al., “An end-to-end RISC-V solution for ML on the edge
“Arrow: A RISC-V vector accelerator for machine learning inference,” using in-pipeline support,” in Proc. Boston Area Archit. Workshop,
2021, arXiv:2107.07169. 2020.
[34] J. Vreca et al., “Accelerating deep learning inference in constrained [54] “The x86 Advanced Matrix Extension (AMX) Brings Matrix Opera-
embedded devices using hardware loops and a dot product unit,” IEEE tions to Debut With Sapphire Rapids,” WikiChip, New York, NY, USA,
Access, vol. 8, pp. 165913–165926, 2020. 2023.
[35] G. Zhang, K. Zhao, B. Wu, Y. Sun, L. Sun, and F. Liang, “A RISC-V [55] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalable and
based hardware accelerator designed for YOLO object detection system,” energy efficient deep learning with smart memory cubes,” IEEE Trans.
in Proc. IEEE Int. Conf. Intell. Appl. Syst. Eng., 2019, pp. 9–11. Parallel Distrib. Syst., vol. 29, no. 2, pp. 420–434, Feb. 2018.
[36] G. Ottavi, A. Garofalo, G. Tagliavini, F. Conti, L. Benini, and D. Rossi, [56] A. Garofalo et al., “DARKSIDE: A heterogeneous RISC-V compute
“A mixed-precision RISC-V processor for extreme-edge DNN inference,” cluster for extreme-edge on-chip DNN inference and training,” IEEE Open
in Proc. IEEE Comput. Soc. Annu. Symp. Very Large Scale Integr., 2020, J. Solid-State Circuits Soc., vol. 2, pp. 231–243, 2022.
pp. 512–517. [57] K. Ueyoshi et al., “Diana: An end-to-end energy-efficient digital and
[37] A. Garofalo et al., “A heterogeneous in-memory computing cluster for analog hybrid neural network SOC,” in Proc. IEEE Int. Solid-State Circuits
flexible end-to-end inference of real-world deep neural networks,” IEEE Conf., 2022, pp. 1–3.
Trans. Emerg. Sel. Topics Circuits Syst., vol. 12, no. 2, pp. 422–435, [58] M. E. Elbtity, B. Reidy, M. H. Amin, and R. Zand, “Heterogeneous
Jun. 2022. integration of in-memory analog computing architectures with tensor
[38] Z. Liu, G. Li, and J. Cheng, “Hardware acceleration of fully quantized processing units,” in Proc. Great Lakes Sympos. VLSI, Knoxville, TN,
BERT for efficient natural language processing,” in Proc. Des., Autom. USA, 2023, pp. 607–612.
Test Eur. Conf. Exhib., 2021, pp. 513–516. [59] E. Giacomin, S. Gudaparthi, J. Boemmels, R. Balasubramonian, F.
[39] S. Harini, A. Ravikumar, and D. Garg, “VeNNus: An artificial intelligence Catthoor, and P.-E. Gaillardon, “A multiply-and-accumulate array for
accelerator based on RISC-V architecture,” in Proc. Int. Conf. Comput. machine learning applications based on a 3D nanofabric flow,” IEEE Trans.
Intell. Data Eng., 2021, pp. 287–300. Nanotechnol., vol. 20, pp. 873–882, 2021.
[40] B. Jacob et al., “Quantization and training of neural networks for efficient [60] Z. Lin et al., “A fully digital SRAM-based four-layer in-memory com-
integer-arithmetic-only inference,” in Proc. IEEE Conf. Comput. Vis. puting unit achieving multiplication operations and results store,” IEEE
Pattern Recognit., 2018, pp. 2704–2713. Trans. Very Large Scale Integr. Syst., vol. 31, no. 6, pp. 776–788,
[41] Y. Zhang, N. Wu, F. Zhou, and M. R. Yahya, “Design of multifunctional Jun. 2023.
convolutional neural network accelerator for IoT endpoint SoC,” in Proc. [61] A. Marchisio, D. Dura, M. Capra, M. Martina, G. Masera, and M. Shafique,
World Congr. Eng. Comput. Sci., 2018, pp. 16–19. “SwiftTron: An efficient hardware accelerator for quantized transformers,”
[42] Z. Li et al., “Laius: An 8-bit fixed-point CNN hardware inference engine,” in Proc. Int. Joint Conf. Neural Netw., Gold Coast, Australia, 2023, pp. 1–9,
in Proc. IEEE Int. Symp. Parallel Distrib. Process. Appl./IEEE Int. Conf. doi: 10.1109/IJCNN54540.2023.10191521.
Ubiquitous Comput. Commun., 2017, pp. 143–150. [62] A. F. Laguna, M. M. Sharifi, A. Kazemi, X. Yin, M. Niemier, and X. S.
[43] K. Guo et al., “Angel-eye: A complete design flow for mapping CNN onto Hu, “Hardware-software co-design of an in-memory transformer network
embedded FPGA,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., accelerator,” Front. Electron., vol. 3, 2022, Art. no. 10.
vol. 37, no. 1, pp. 35–47, Jan. 2018. [63] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
[44] S.-Y. Pan, S.-Y. Lee, Y.-W. Hung, C.-C. Lin, and G.-S. Shieh, “A pro- of deep bidirectional transformers for language understanding,” in Proc.
grammable CNN accelerator with RISC-V core in real-time wearable Conf. North Amer. Chap. Assoc. Comput. Linguistics: Human Lang. Tech-
application,” in Proc. IEEE Int. Conf. Recent Adv. Syst. Sci. Eng., 2022, nol., J. Burstein, C. Doran, and T. Solorio, eds., 2019, pp. 4171–4186.
pp. 1–4. [64] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for
[45] D. Pestana et al., “A full featured configurable accelerator for object multi-head attention and position-wise feed-forward in the transformer,”
detection with YOLO,” IEEE Access, vol. 9, pp. 75864–75877, 2021. in Proc. IEEE 33rd Int. Syst.-Chip Conf., 2020, pp. 84–89.
[46] K. Kim, S.-J. Jang, J. Park, E. Lee, and S.-S. Lee, “Lightweight and energy- [65] S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-BERT:
efficient deep learning accelerator for real-time object detection on edge Integer-only BERT quantization,” in Proc. Int. Conf. Mach. Learn., 2021,
devices,” Sensors, vol. 23, no. 3, 2023, Art. no. 1185. pp. 5506–5518.
[47] D. Wu, Y. Liu, and C. Tao, “A universal accelerated coprocessor for object [66] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “FPGA-based accelerator for long
detection based on RISC-V,” Electronics, vol. 12, no. 3, 2023, Art. no. 475. short-term memory recurrent neural networks,” in Proc. 22nd Asia South
[48] Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang, “Sparse-YOLO: Pacific Des. Autom. Conf., 2017, pp. 629–634.
Hardware/software co-design of an FPGA accelerator for YOLOV2,” [67] S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel,
IEEE Access, vol. 8, pp. 116569–116585, 2020. “ViTA: A vision transformer inference accelerator for edge applications,”
[49] L. Cai, F. Dong, K. Chen, K. Yu, W. Qu, and J. Jiang, “An FPGA based in Proc. IEEE Int. Sympos. Circuits Syst., 2023, pp. 1–5, doi: 10.1109/IS-
heterogeneous accelerator for single shot multibox detector (SSD),” in CAS46773.2023.10181988.
Proc. IEEE 15th Int. Conf. Solid-State Integr. Circuit Technol., 2020,
pp. 1–3.

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at 15:54:01 UTC from IEEE Xplore. Restrictions apply.

You might also like