Embedded Deep Learning Accelerators - A Survey On Recent Advances
Embedded Deep Learning Accelerators - A Survey On Recent Advances
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Abstract—The exponential increase in generated data as well detection, protein folding and genomics analysis, to name a
as the advances in high-performance computing has paved the few. These convoluted DNN models achieve their high infer-
way for the use of complex machine learning methods. Indeed, ence accuracy and performance through the use of manifold
the availability of Graphical Processing Units (GPU) and Tensor
Processing Units (TPU) have made it possible to train and trainable parameters and large-scale datasets [3]. Training and
prototype Deep Neural Networks (DNN) on large-scale data sets deploying DNNs rely on performing heavy computations with
and for a variety of applications, i.e., vision, robotics, biomedical, the indispensable use of High Performance Computing (HPC)
etc. The popularity of these DNNs originates from their efficacy units, such as Graphical Processing Units (GPU) and Tensor
and state-of-the-art inference accuracy. However, this is obtained Processing Units (TPU). Consequently, DL structures require
at the cost of a considerably high computational complexity. Such
drawbacks rendered their implementation on limited resources, a considerably high energy consumption and storage capacity
edge devices, without a major loss in inference speed and [2], severely limiting their implementation, performance and
accuracy, a dire and challenging task. To this extent, it has be- use on limited resources devices, i.e., Field Programmable
come extremely important to design innovative architectures and Gate Arrays (FPGA), System on Chip (SoC), general purpose
dedicated accelerators to deploy these DNNs to embedded and Microprocessors (MP) and digital signal processing (DSP)
re-configurable processors in a high-performance low complexity
structure. In this study, we present a survey on recent advances processors [1], [5], [6].
in deep learning accelerators (DLA) for heterogeneous systems However, the need for edge AI computing [6], [7] remains
and Reduced Instruction Set Computer (RISC-V) processors relevant and crucial. Edge computing involves offloading
given their open-source nature, accessibility, customizability and DNNs’ inference operations to the node processor for imple-
universality. After reading this article, the readers should have menting AI procedures on the device itself [7], [8]. Adopting
a comprehensive overview of the recent progress in this domain,
cutting edge knowledge of recent embedded machine learning this paradigm requires scaling down the DNN to fit on limited-
trends and substantial insights for future research directions and resource devices without a significant loss in performance and
challenges. accuracy [2], [9]. Thus, adding new challenges to that already
Index Terms—Hardware Accelerators, RISC-V, Convolutional at hand, such as area limitation, power consumption and
Neural Network (CNN), Embedded Machine Learning, Trans- storage requirements. To abide to the imposed constraints and
formers to efficiently deploy DNN structures on different processors,
such as FPGA, SoC and MP, one practical yet popular solution
is to reduce the DNNs’ size and develop task-specific DL
I. I NTRODUCTION
accelerators (DLA) [2], [10], [11]. Moreover, several opti-
The exponential growth in the deployed computing devices, mization techniques can be applied to reduce DNNs’ hardware
as well as the abundance of generated data, has mandated usage, such as pruning [12], quantization [13], [14], knowledge
the use of complex algorithms and structures for smart data distillation [15], multiplexing [16] and model compression
processing [1]. Such overwhelming processing requirements [17], to name a few.
have further mandated the use of Artificial Intelligence (AI) To account for the requirements of various applications,
techniques and compatible hardware [2]. Nowadays, Machine there exists different DL structures and models [18], such
Learning (ML) methods are routinely executed in various as the Convolutional Neural Network (CNN) [3], [6], Recur-
fields including healthcare, robotics, navigation, data mining, sive Neural Network (RNN) [3], [18], Generative Adversarial
agriculture, environmental and more [3], replacing the need of Network (GAN) [18], Graph Neural Network (GNN) and
recurrent human interventions. Classical ML methods rapidly Transformers. To accommodate to such diversity, popular
evolved, over the recent years, to perform compute intensive approaches for implementing DNN accelerators rely on using
operations and have expand various research areas tenfold. The re-configurable devices, i.e., FPGA [19], [20], or extending
introduction of high accuracy and near real-time performing the architecture and instruction set of the ARM and Reduced
Deep Learning (DL) processes, such as the Deep Neural Instruction Set Computer (RISC-V) based processors [3]. Ad-
Network (DNN) [2], [3], convoyed unprecedented advances ditionally, the use of dedicated Neural Network (NN) libraries
in the areas of: Natural Language Processing (NLP) [4], and compilers, such as the CMSIS-NN by ARM [21] for
object detection, image classification, signal estimation and 16-bit and 8-bit processors, makes it possible to implement
some sophisticated, quantized DNN, i,e, 8-bit, 4-bit, 2-bit and
1 Computer Engineering department, University of Balamand, Koura,
even 1-bit [3]. However, commercial processors have major
Lebanon, {ghattas.akkad, elie.inaty}@balamand.edu.lb
2 Lab-STICC, UMR CNRS 6285, ENSTA Bretagne, Brest, France, drawbacks, such as licensing costs and the lack of flexibility
[email protected] in modifying the general architecture [22].
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Fig. 2. Compact CNN accelerator: (a) Data flow. Source: Adapted from [6]. (b) Compact CNN accelerator architecture. Source: Adapted from [32]. (c)
Acceleration chain architecture. Source: Adapted from [32]
the architecture includes several features, categorized as core, and implements power gating and clock frequency scaling
optional, parameterized and configurable. These features pro- (CFS) units to better manage and reduce power consumption.
vide the designer with the ability to emphasize performance, Additionally, it also implements a hardware loop unit to
power or area based on the applications requirements [24]. efficiently execute loops, various Single Instruction Multiple
The processor also includes a branch predictor unit (BPU), Data (SIMD) instructions to accelerate DSP operations and
data cache and instruction cache to speed up execution [24]. post-increment loads and stores addressing to improve overall
Second, the E203 is a 32-bit RISC-V processor designed for performance. The RI5CY core is mostly used in accelerating
energy-efficient and high-performance computing applications, mixed-precision DNN operations [36].
such as Internet of Things (IoT) [6]. The E203 supports the Finally, the Rocket core is a high-performance, 5-stage
RV32IMAC instruction set and is the closest to the ARM pipeline, 64-bit RISC-V processor which supports the
Cortex M0+ [6]. It is composed of two pipeline stages, where RV64GC instruction set. The core supports a wide range of
the first pipeline stage handles instruction fetch, decode and operating systems and has a peak performance of 4 Instruc-
branch prediction. The resulting Program Counter (PC) and tions Per Cycle (IPC). The Rocket core is configurable to suit
instruction value are loaded in the PC and Instruction Register different application requirements and serves as a reference
(IR) registers, respectively. The second pipeline stage mainly for the RISC-V ISA. Additionally, it is highly extensible
handles rerouting the IR to the appropriate processing unit to and is designed to allow developers to incorporate, custom
execute the required operation. The main processing units are instructions, dedicated accelerators and complex extensions
the ALU, the Multiplier/Divider, the Access Memory Unit and [37].
the Extension Accelerator Interface (EAI) [6]. As shown in Table I, all processor cores support the listed
Third, the RI5CY is an energy efficient, 4-stage pipeline, 32- features except for the RV12. Additionally, the Intel i7-8700
bit RISC-V processor core designed by the PULP platform. and the Rocket cores scored the highest and second highest
The core architecture supports the RV32IMAC instruction set peak IPC values of 4.6 and 4.0, respectively. However, in
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
TABLE I
P ERFORMANCE COMPARISON OF THE RV12, E203, RI5CY AND ROCKET RISC-V CORES
Core ISA Pipeline Register file Memory Interrupts Sleep mode Power gating CFS Peak IPC
RV12 [24] RV32I 6-stage 32 32-bit Yes No No No 2.0
E203 [6] RV32IMAC 2-stage 32 32-bit Yes Yes Yes Yes 2.4
RI5CY [36] RV32IMAC 4-stage 32 32-bit Yes Yes Yes Yes 3.0
Rocket [37] RV64GC 5-stage 64 64-bit Yes Yes Yes Yes 4.0
ARM Cortex M4 [36] ARMv7E-M 3-stage 32 32-bit Yes Yes Yes Yes 1.5
ARM Cortex M7 [36] ARMv7E-M 3-stage 32 32-bit Yes Yes Yes Yes 2.0
Intel i7-8700 [38] x86-64 14-stage 16 64-bit Yes Yes Yes Yes 4.6
contrast to the Rocket core, the Intel i7-8700 is a power top level architecture is presented in Fig. 2c and performs
hungry desktop processor and is not suitable for embedded the core mathematical operations, i.e., 2D convolution, matrix
applications. Surprisingly, all RISC-V cores outperformed the addition, Rectified Linear Unit (ReLU) activation function
commercial ARM cortex M4 and M7 in peak IPC while and pooling, in fixed point format [32]. It is essential to
offering similar features. Thus, to maximize performance and highlight that the fully connected layer (FCL) operation can
efficiency, the RISC-V core should be selected with respect to be viewed as a special case of the convolution operation with
the applications requirements. Additional comparison of the similar hardware implementation [32], [40]. As such, the FCL
RISC-V against other platforms, such as TPU and GPU, is operation is implemented in the 2D-Conv block [32]. Given
provided in [39]. the fixed sequence of operations, the operating blocks are
serially connected to reduce internal data movement and inter-
III. DNN H ARDWARE ACCELERATORS connectivity. The Bypass control allows bypassing specific, not
needed, modules without affecting the systems performance
Computing platforms, such as CPU, TPU and GPU, are or results. Additionally, the data width in a chain varies to
expensive, power-hungry and unsuitable for edge applications. maintain accuracy, however remains consistent between layers
On the other hand, application Specific Integrated Circuits [32].
(ASIC) are fast but deploy a non re-configurable architecture
The CNN accelerator was prototyped and tested for the
[22], [30]. However, RISC-V processors and FPGAs can be
ARM Cortex M3 [32]. The data flow direction is one way,
used concurrently to accelerate different DL structures as they
whereas, two memory access operations are required. This
are highly customizable. Mostly, those with exploitable paral-
reduces efficiency, flexibility and increases power consumption
lelism can benefit the most from optimized matrix operations.
[6]. In order to improve its performance, efficiency and reduce
However, the choice varies with respect to the target applica-
memory access operations, the IoT CNN accelerator was
tion’s requirements and the available hardware resources.
modified in [6] to a co-processor and connected to the RISC-V
The most popular structure implemented on edge devices
E203 CPU through the EAI.
is the CNN [3]. CNNs are inherently parallel and more
The configurable CNN accelerator [6] modifies the compact
commonly used in error tolerant applications. They can be
CNN accelerator [32] by optimizing memory access and
further simplified, at the cost of minor unnoticeable errors,
by replacing the CNN acceleration chain with a crossbar
to optimize power usage, hardware resources and latency
interconnecting different arithmetic units, as shown in Fig.
[3]. Moreover, substantial work has been done on efficiently
3. Replacing the serialized acceleration chain by a crossbar
accelerating quantized transformer models for deployment on
provides a re-configurable architecture that allows data to flow
edge devices.
in different directions, improving the computation performance
of different algorithms [6].
A. CNN Accelerators for IoT In contrast to Fig. 2b, the CNN accelerator in Fig. 3a uses
To meet the basic CNN functionalities for multimedia data two ping-pong buffers (BUF RAM BANK) instead of three
processing, a low bandwidth, area efficient and low complexity and a re-configurable controller instead of the CNN controller.
accelerator was designed for IoT SoC endpoint [32]. The CNN Each PE has been modified to make use of a crossbar intercon-
accelerator is constructed in the form of parallel operating ac- necting its arithmetic units instead of a serialized chain. The
celeration chains, each with a serially connected convolution, crossbar architecture is displayed in Fig. 3b and is formed of
adder, activation function and pooling circuits [32], as shown a First in First out (FIFO) buffer, configuration registers (cfg
in Fig. 2. Src is the source input, 32b is 32-bit bus width, Regs) and five multiplexers to route the data appropriately
2D-Conv is the two dimensional convolution operation. [6]. The cfg Regs are configured through the re-configurable
In Fig. 2a, a classical IoT SoC processing data flow is controller block.
expanded to include a compact CNN accelerator connected In contrast to Fig. 2a, the CNN accelerator shown in Fig.
to the CPU kernel through the SoC bus. The compact CNN 4 is a re-configurable co-processor rather than an extension.
accelerator, detailed in Fig. 2b, is formed of a core Random Additionally, the memory access is optimized and controlled
Access Memory (RAM), three ping pong buffer blocks de- by the co-processor thus improving the overall performance
noted by (BUF RAM BANK), two data selectors, a CNN [6].
controller and four acceleration chains. The accelerator chain Both designs were implemented using a FPGA, specifically
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Fig. 3. Re-configurable CNN accelerator. Source: Adapted from [6]: (a) Top level diagram, (b) Crossbar architecture
TABLE II
CNN ACCELERATOR RESOURCE CONSUMPTION COMPARISON TABLE
Design Processor LUT FF DSP GOPS Power (W) Network Speed (ms)
[6] RISC-V E203 8,534 7,023 21 - - LeNet-5 -
[32] ARM Cortex M3 4,901 2,983 0 6.54 0.380 LeNet-5 2.44
[41] ARM Cortex M3 5,717 6,207 20 1.602 0.370 LeNet-5 11
[42] Xilinx 485T 15,285 2,074 564 44.9 0.658 LeNet-5 0.49
[43] ZYNQ XC7Z020 29,867 35,489 190 84.3 9.630 VGG-16 364
[44] PYNQ-Z2 3,411 2,262 6 - 0.118 1D-CNN 0.137
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Fig. 5. YOLO RISC-V accelerator. Source: Adapted from [35]: (a) Top level architecture, (b) Convolution block architecture (part of the computation module),
(c) Multi-level memory hierarchy
in [44] requires the least resources, where only 3,411 LUTs, 1 and the output is always a multiple of 7. The YOLO hard-
2,262 registers and 6 DSP units are needed. Moreover, the ware accelerator is designed and implemented with respect to
co-processor consumes 0.118 W, has a latency of 0.137 ms specific considerations and parameters rather than generalized.
per class and provides a 99.16% accuracy on fixed-point This is to achieve an area and energy efficient architecture [35].
operations. The designs low power and resource requirements The YOLO accelerator controller is chosen as the open source
makes it a suitable choice for low power IoT wearable devices RISC-V Rocket Core with extended, customized, instructions.
[44]. While these accelerators are specifically designed for As shown in Fig. 5a, describing the top level architecture,
deployment on edge devices, they cannot compete with high the accelerator is connected to the CPU core through the
performance models such as that proposed in [43] offering a ROCC interface. The Instruction FIFO (IFIFO) and Data FIFO
throughput of 84.3 GOPS. (DFIFO) registers stores the instructions and data forwarded
by the CPU core. The decoder block decodes and processes the
B. CNN Accelerators for Object Detection instructions to the Finite State Machine (FSM) acting as the
Classically, an object detector relies on segmentation, low main control unit of the compute, padding and memory mod-
level feature extraction and classification with respect to a ules. In a parallel process, the input is read from the Double
shallow NN [48], [45]. However, with the advances in DNN Data Rate Synchronous Dynamic Random-Access Memory
and hardware computing power, the state of the art detectors (DDR-SDRAM), stored in the buffer and communicated to
make use of deep CNN structures to dynamically extract the computation module. The DFIFO transfers the CPU data
complex features for accurate classification [45]. One of the to both the FSM and computation module to begin the CNN
most prevailing object detectors is the You Only Look Once operations, i.e., convolution, pooling and activation [35].
(YOLO). The YOLO detector and its successors (YOLOv2 The computation modules core operating unit is the con-
[48], YOLOv3 and YOLOv4 [45]) offers the best bargain volution unit, shown in Fig. 5b and performs the convolution
between performance (speed) and accuracy. However, its operation, the max pooling and the activation function. The
performance is achieved at the cost of high computational convolution unit is formed of 9 multipliers, 7 adders and a
complexity and requirements. Making it difficult to implement 5-stage pipeline FIFO, as noted on Fig. 5b [35]. The data and
these networks on edge devices. Lightweight YOLO models weights are serially fed to the convolution unit using the input
(Tiny-YOLOv3 and Tiny-YOLOv4) have been proposed to FIFO buffers, at every clock cycle, to perform the convolution
reduce the complexity, i.e., fewer parameters, at the cost operation. The output is then passed to a pooling unit that
of a slight reduction in accuracy. Thus, to implement these performs only max pooling with respect to three comparators.
lightweight modules on embedded systems, suitable, low en- Finally, the ReLU activation function is performed on the
ergy and high performance architectures are required [45]. results [35].
To accommodate to such requirements, a RISC-V based To improve overall performance, a memory hierarchy is
YOLO hardware accelerator with a multi-level memory hi- designed and implemented by [35]. With respect to Fig. 5c,
erarchy was proposed in [35]. The YOLO model implements the memory hierarchy is composed of three levels: The off
the Darknet-19 inference network [35]. In their design [35], chip DDR-SDRAM, the input/output data buffers and the
The filters are considered of size 3 ⇥ 3 or 1 ⇥ 1, the stride is internal weight and input double register groups, 9 ⇥ 8 bit
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Fig. 6. Generalized YOLO RISC-V accelerator. Source: Adapted from [45]: (a) Top level architecture, (b) Detailed architecture, (c) Custom functional unit
(FU) architecture
and 226 ⇥ 4 ⇥ 8 bit, respectively [35]. By adopting such Thus, allowing flexibility, robustness and configurability. Ad-
hierarchy, the interfaces limited bandwidth bottleneck compli- ditionally, this design not only accelerates the CNN network,
cations can be avoided. Although the design is energy efficient but also the pre and post CNN operations [45]. The proposed
and requires a relatively small on-chip area; it requires 2.7 generalized YOLO accelerator is shown in Fig. 6. The Vread
seconds to finish the YOLO inference operation. This delay Bias, vRead Weights, vRead Array and vWrite Array are con-
is a result of a trade-off between resource usage and speed, figurable dual-port memory units. The Direct Memory Access
i.e., serially implemented computational module. To decrease (DMA) unit is used to read data from the external memory,
the inference latency, the authors suggested adding additional, the Functional unit (FU) is a matrix of configurable custom
parallel, computation modules [35]. The authors evaluated the computing units and the (AGU) is the Address Generator Unit.
systems performance with 7 computation modules achieving The FU Matrix unit is used for reading tiles of the input feature
a 400 milliseconds (ms) average time [35]. map [45].
The accelerator is model specific and designed with respect As shown in Fig. 6a, the YOLO accelerators top level
to constant configurations, i.e., filter, output and stride size. architecture, is mainly formed of three stages: xWeightRead,
However, there exist different YOLO versions each with a dif- xComp and AXI-DMA. The xWeightRead stage is formed of
ferent input feature map size. Designing a specific accelerator the Vread Bias and the VRead Weights array. These units per-
for each feature can be a tedious solution. Thus, a configurable, form the read, write and store operations to and from external
parameterizable RISC-V accelerator core is designed based on memory and provide the needed data to the FU Matrix unit
the Tiny-YOLO version [45]. as detailed in Fig. 6b. The weight memories are implemented
The accelerator in [45] is designed as an algorithm oriented as asymmetric dual-port units with an external 256-bit bus.
hardware core to accelerate lightweight versions of YOLO. Additionally, The FU Matrix is the accelerators main PE and
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
TABLE III
RISC-V YOLO ACCELERATORS RESOURCE COMPARISON AND PERFORMANCE
Design Platform LUT FF DSP BRAM Speed (ms) Power (W) GOPS
YOLO accelerator (1CM) [35] Virtex-7 VC709 13,798 17,514 161 83 2,700 2.38 3.5
TinyYOLO v3 core [45] Ultrascale XCKU040 103,655 86,319 832 339 30.9 - 238
TinyYOLO v4 core [45] Ultrascale XCKU040 146,820 124,761 1,248 403 32.1 - 357
TinyYOLO v3 + RISC-V [45] Ultrascale XCKU040 138,946 110,988 839 383.5 30.9 3.87 180
TinyYOLO v4 + RISC-V [45] Ultrascale XCKU040 182,111 149,430 1,255 447.5 32.1 - -
CNN SqueezeNet [46] ZYNQ ZC702 18,300 21,500 7 31.5 22.75 2.11 -
Universal co-processor [47] E203 SoC 19,500 15,600 23 48 51 2.1 -
is located in the XComp stage. The FU Matrix is a collection to complete an inference run in 210, 51, 125 and 73 ms with
of interconnected, reconfigurable, PEs whose sole purpose is 27.2, 33, 31.2 and 32.5 Mean Accuracy Precision (mAP), for
to perform the 3D convolution operations. Each custom FU the listed networks, respectively [47].
architecture, detailed in Fig. 6c, is formed of an array of The choice of an accelerator is heavily dependent on the
MAC units, an adder tree a Sigmoid activation function and edge device, its resources and processing capabilities. While
a leaky ReLU activation function. The multiplexers route the the YOLO accelerator and lightweight SqueezeNet [35], [46]
data internally and introduce a higher level of customizability are designed with specific considerations, they are most suit-
[45]. able for lower-end devices and can be redesigned for other
The RISC-V YOLO accelerators presented in [45], [35] are specifications if needed. For higher-end devices and more
compared with respect to resource utilization and speed in complex applications, the designs presented in [45] can be
Table III. Additionally, they are compared against different a better alternative with an average speed of 30 ms. However,
DNN architectures. BRAM is the internal FPGA Block RAM for general purpose SoC and generic applications the universal
units and CM signifies compute module. co-processor [47] is the convenient choice.
As shown in Table III, the special purpose YOLO accelera-
tor designed in [35] requires the least resources with 161 DSP C. Heterogeneous SSD accelerator for object detection
blocks compared to 832 for the TinyYOLO v3 and 1,248 for DL based real-time object detection [49], [50] and mo-
the TinyYOLO v4. The YOLO accelerator CM is implemented tion recognition [51] are popularly implemented in Advanced
in a serially operating manner while the TinyYOLO v3 and Driver Assistance Systems (ADAS) and video analysis appli-
v4 PEs operates in parallel. However, the massive reduction cations. The Single Shot Multibox Detector (SSD) combines
in resource usage is at the cost of slow performance, i.e., 2.7 the advantages of YOLO and Faster R-CNN for fast and
sec compared to 30.9 ms and 32.1 ms, with an architecture accurate real-time object detection [49]. SSD detects multiple
specifically tailored for pre-defined parameters [45]. In con- objects through a single image snapshot. This is done by di-
trast, the TinyYOLO v3 and v4 designs presented in [45] offers viding the image into multiple grid cells with bounding/anchor
a massive increase in performance, i.e., an average of 30 ms, boxes and performing concurrent object detection on each
at the cost of a tenfold increase in resource usage, mainly cell’s region. The need for high accuracy and high speed
the DSP blocks and BRAM units. The TinyYOLO v3 and inference makes implementing SSD DL structures on hardware
v4 cores are highly customizable and can be configured to a challenging task. ASIC, GPU and FPGA are famously used
meet any YOLO network version requirements. To improve the for accelerating complex DL structures with high inference
YOLO accelerators performance, the authors in [35] suggested speed. For SSD acceleration, ASIC and GPU offers the least
using 7 serially operating CM placed in parallel to speed up desirable choice. Mainly, due to the lack of flexibility of
the convolution operation. Thus, achieving an execution speed the former and high power consumption of the latter [49].
of approximately 400 ms. The overall resource requirements While the FPGA offers more flexibility and customizability, its
for implementing the RISC-V processor and the TinyYOLO limited resources constrain the performance of fully integrated
accelerators are obtained at a slight increase in units usage. complex systems [49]. As such, a heterogeneous, CPU-FPGA,
Moreover, a lightweight SqueezeNet CNN was proposed for based approach was proposed in [49] to accelerate both the
edge MCU based object detection applications [46]. The SSD DL structures software and hardware parts. The target
proposed architecture is prototyped on the ZYNQ ZC702 SoC CPU (host) and FPGA were chosen as the Intel Xeon Silver
and can perform an inference run in 22.75 ms while consuming 4116 and the Arria 10 development board, respectively [49].
an average power of 2.11 W. Although the proposed model is As for the software, the pre-trained network is optimized by
not RISC-V specific, it can be adopted for use with these open- fusing the Batch Normalization (BN) layer with the convolu-
source processors. As the presented accelerators are architec- tion layer. The operator fusion technique reduces the number
ture specific, i.e., TinyYOLO and SqueezeNet, a universal of network layers, input parameters and memory access during
co-processor is designed to efficiently implement different inference. By adopting this technique, the inference speed,
object detection networks [47]. The universal co-processor measured in FPS, is increased by 10% to 30%, for different
is prototyped on the E203 RISC-V SoC and evaluated with SSD networks [49]. Using Layer Hardware Affinity (LHA)
respect to different architectures, such as the Faster R-CNN, and graph partitioning [49], each SSD network is divided
YOLOv3, SSD513 and RetinaNet. The co-processor is able into subgraphs. Then, each subgraph is executed on its target
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
TABLE V
M INI -3D CNN PERFORMANCE , RESOURCE USAGE AND POWER
are the DSP blocks and BRAM, occupying 62.27% and TABLE VI
30.36% of the total available, respectively. Compared to other M ODIFIED RI5CY PERFORMANCE COMPARISON PER ONE INFERENCE RUN
designs the proposed system achieved an accuracy of 95% for Program Clock cycles Dynamic instructions Energy (µJ)
the Weizmann data set. The mini-3DC heterogeneous design Fp 277,388 200,188 4,108
provides a system for implementing large-scale, low-power, FpHwU 249,495 194,471 3,695
FpDotHw 74,945 67,430 1,118
3D-CNN on embedded SoC devices with accelerated inference
speed [51].
provides the XpulpV2 non-standard extension which includes
IV. ISA DNN E XTENSIONS several functionalities, such as hardware loops [36]. From the
While some work focused on designing fully programmable modified RI5CY RISC-V core block diagram, shown in Fig.
generic CNN accelerators for the RISC-V processors, others 7, we can list the following details: Non-highlighted boxes:
optimized embedded DNN operations by extending the origi- Original RI5CY core architecture; Blue boxes: Processing
nal RISC-V ISA [33], [53]. This technique implements specific elements/operating units; Red boxes: Control logic; Violet
core DNN routines, such as hardware loops [34], dot product boxes: Pipeline stages registers; Orange boxes: General pur-
[34], mixed precision support [36], in memory computations pose and status registers; Gray boxes: Interface. The modified
[37] and in-pipeline ML processing [53] to improve overall RI5CY core includes a hardware loop control ’hwloop control’
performance. block highlighted in red and a floating point dot product
unit ’fDotp’ highlighted in blue [34]. The ’hwloop control’
core is capable of handling two levels nested loops. The
A. Hardware loop and dot product
’fDotp’ unit performs two instructions, i.e., p.fdotp2.s and
DNN routines consist of heavy arithmetic operations. To p.fdotp4.s, on single precision, 32-bit floating point numbers.
accelerate these computations, dedicated, parallel, hardware These instructions perform the dot product operation described
blocks are needed. Thus, a trade off exists between resource in (1) on two or four elements vectors, respectively. The
utilization and performance [34]. ’fDotp’ unit is not pipelined, since the RI5CY core runs at
In order to speed up DL algorithms in RISC-V, without a a low frequency for reduced energy consumption. However,
major sacrifice in hardware, an instruction set extension has this poses no considerable effect on performance [34].
been proposed in [34]. Mainly for hardware loops, i.e., zero This design was prototyped on the ZYNQ 7000 SoC board
overhead loops, and dot product operations as shown in (1), and later synthesized using the Synopsys Design Compiler
where N is the vector length. and the 90 nm generic core cell library from the United
Microelectronics Corporations [34]. Compared to the original
N
X1 RI5CY design occupying an area of 0.24233283 mm2 with a
res = vec a[n]vec b[n] (1) dynamic power of 147.48 mW, the modified RI5CY is 72%
n=0 larger and requires an area of 0.41758819 mm2 with 148.47
DNN routines includes extensive matrix computations, i.e., mW. The increase in area is caused by the addition of the
MAC, implemented using loop instructions. Traditionally, single-precision floating point dot product unit [34].
loops, when implemented in software, infer a large branch A simple Optical Character Recognition (OCR) NN was
overhead that adds numerous setbacks to the architecture, i.e., implemented to evaluate the designed ISA optimization. The
increased delay and resource usage. By considering hardware five-layer network architecture is as follows: 28 ⇥ 28 input,
loops and supporting instruction set extensions, branch over- 24⇥24 convolution, 12⇥12 max pooling, 60 Fully Connected
head can be removed resulting in an increased performance (FC) and 10 output [34]. The modified RI5CY performance,
[34]. Additionally, extended instruction sets for accelerating shown in Table VI, is evaluated in terms of clock cycle count,
critical arithmetic operations, i.e., vector multiplications, plays dynamic instruction count and energy consumption for dif-
an important role in enhancing overall performance [33]. ferent program implementations. The Fp program implements
To evaluate the advantages of hardware loops and dot the reference library version using all optimizations except
product acceleration, the authors of [34] modified the RISC- hardware loops. Similarly, the FpHwU is a modification of
V RI5CY core ISA to support these custom instructions. The the Fp with hardware loops and loop unrolling. Finally, the
RI5CY core is a 32-bit, 4-stage, pipeline, RISC-V core with in- FpDotHw makes use of the optimized assembly library, the dot
teger multiplication, division and floating point instructions. It product unit and all optimizations including hardware loops
has 31 general purpose registers, 32 floating point registers and [34].
a 128-bit cache for instruction prefetch [34]. Additionally, it Compared to the baseline Fp, the FpHwU presented a minor
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Fig. 7. The modified RI5CY RISC-V core block diagram. Source: Adapted from [34]
Fig. 8. Mixed precision RI5CY modifications. Source: Adapted from [36]: (a) MPIC core, (b) Extended dot product
10% improvement in clock cycle count. However, with the from 16/8 bits formats to support 4/2 bits precision and mixed
addition of the dot product unit, the FpDotHw demonstrated precision operation. Mainly, the below list of instructions from
the best computational performance by achieving a 74% the XpulpV2 instructions have been extended for use in 4 and
reduction in cycles. Similarly, the FpDotHw achieved the best 2 bits format:
performance in instruction count and energy with 66% and
• Basic: ADD (addition), SUB (subtraction) and AVG
27% improvements, respectively. Thus, for an MCU, running
(average)
at 10 MHz, a single inference run is performed within 7.5 ms
• Vector comparison: MAX (maximum) and MIN (mini-
and consumes 1,118 µJ using the FpDotHw [34].
mum)
• Vector shift: SRL (shift right logical), SRA (shift right
B. Mixed precision RISC-V core arithmetic) and SLL (shift left logical)
• Vector absolute: ABS (absolute value)
A mixed precision inference core (MPIC) for the RI5CY
• Dot product variations (usigned - signed)
RISC-V processor, using virtual instructions, is presented in
[36]. It is developed for eliminating the RI5CY encoding space Fig. 8a details the MPIC core architecture, a mixed precision
problem and for implementing heavily quantized deep neural controller (MPC) block has been added to orchestrate mixed
networks (QNN) with improved performance and efficiency as precision operations. Additionally, the Decoder, CSR, ALU
compared to software based mixed precision RI5CY designs. and DOTP units have been modified for performing the
The RISC-V ISA extension called XMPI extends the RI5CY required tasks and XMPI extended instructions. The mixed
core functionalities, adding support for status-based opera- precision dot product unit structure is shown in Fig. 8b, where
tions, for efficiently implementing 16, 4 and 2-bit QNNs [36]. the MPC CNT signal is the MPC count output controlled
A small set of the XpulpV2 instructions has been extended by the MPC core unit and used to select the sub-groups of
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
operands [36]. The DOTP unit has been extended, from its However, in contrast to RISC-V, these processors cannot
original 16 and 8 bits form, to support 4 and 2-bit formats by be customized and tailored to support the application needs.
adding two additional DOTP units with internal adders and Additionally, they are high in power consumption and, in
multipliers. heterogeneous systems, performance is mostly limited by the
The modified RI5CY core was integrated into the host-accelerator frequent communications. Thus, the acceler-
PULPisino SoC and synthesized using Synopsys Design Com- ator becomes severely underutilized [53]. As such, an end-
piler obtaining a maximum operating frequency of 250 MHz to-end RISC-V edge ML solution using in-pipeline support is
with a power consumption of 5.30 mW. Compared to the proposed in [53]. This design implements a custom processing
original RI5CY variation, the MPIC SoC occupies an area of unit, RV-MLPU, to accelerate RISC-V Rocket Core processor
1.004273 mm2 versus 1.002681 mm2 , resulting in approxi- DNN operations [53]. It mainly consists of: extending the
mately 0.2% overhead. RISC-V ISA with a dedicated ML SIMD unit, developing a
The MPIC was benchmarked against different commercially software stack to support custom ML instructions and adding
available processors by executing a QNN layer with a combi- compiler support to map TensorFlow Lite operations and
nation of various uniform and mixed precision configurations, vectorized kernels to the ISA [53]. The RV-MLPU SIMD
namely 8, 4 and 2-bit. the MPIC average computational extends the RISC-V Rocket Core processor and includes:
performance and energy efficiency are shown in Table VII, support for vector operations, modified DCache for high
for an input tensor and filter sizes of 16 ⇥ 16 ⇥ 32 and memory bandwidth and different memory access operations.
64 ⇥ 3 ⇥ 3 ⇥ 32, respectively [36]. The designs performance was evaluated with respect to
popular benchmark ML models, frequently implemented on
TABLE VII mobile devices, i.e., DenseNet, MnasNet, Inception V3, etc.
MPIC AVERAGE COMPUTATIONAL PERFORMANCE AND ENERGY
EFFICIENCY COMPARISON
The modified RISC-V ISA was compared against the ARM
v-8A with NEON Advanced SIMD extensions for different
Frequency Power Performance Efficiency implementations, i.e., ARM-base, RV-base, ARM-opt, 128-bit
MCU
(MHz) (mW) (MAC/cycle) (GMAC/s/W)
Cortex M4 80 10 0.4 2.64 RV-opt-v1 and 256-bit RV-opt-v2 [53]. Performance was eval-
Cortex M7 480 234 0.6 1.27 uated with respect to the number of executed instructions. On
RI5CY 250 5.39 1.16 42.18 average, both the ARM-base and the RV-base have executed
MPIC 250 5.30 3.22 96.7
the same number of instructions for all models. However,
the RV-opt-v1 implementation achieved an average of 8⇥
Compared to the Cortex M4 (STM32L4), the Cortex M7
and 1.25⇥ reduction in executed instructions compared to the
(STM32H7) and the RI5CY, the MPIC achieved an 8.55⇥,
ARM-opt and the others, respectively. Also, the RV-opt-v2 at
5.36⇥ and 2.77⇥ increase in the number of MAC operations
256-bit register width achieved 2⇥ more reduction compared
performed in a cycle (MAC/cycle), respectively. Additionally,
to the RV-opt-v1 [53].
it attained the lowest power consumption of 5.30 mW. The
While ISA extensions offers a general purpose solution, the
energy efficiency is provided in GMAC/s/W; however it is
performance is constrained by several limiting factors, such
also affected by physical design parameters [36]. In contrast
as user needs, application needs, compiler mapping, library
to the Cortex M4 with 2.64 GMAC/s/W, the Cortex M7
support, memory access and data transfer.
achieved lower efficiency of 1.27 GMAC/s/W despite the
higher frequency and better performance results. This is a
consequence of its higher power consumption of ⇠ 234 mW D. Analog in memory computation
at 480 MHz. However, both Cortex M cores still fall behind Analog in memory computing (AIMC) is a promising
when compared to the RI5CY and MPIC cores achieving an solution to overcome memory bottlenecks in DNN operations
efficiency of 42.18 and 96.7 GMAC/s/W, respectively. as well as efficiently accelerate QNN operations. It performs
analog computations, i.e., matrix vector multiplications and
C. In-pipeline ML processing dot product, on the phase change memory (PCM) cross-bars
In mobile edge inference, such as Android devices, the of non volatile memory (NVM) arrays thus accelerating DNN
CPU handles all ML computations without any additional inference while optimizing energy usage [37], [55].
accelerators. This is because the gain in performance is not Although efficient, AIMC still requires additional improve-
always the main parameter of interest. As such, for some edge- ments to achieve full scale application efficiency. Some of its
AI applications, having a decent CPU with a dedicated SIMD key challenges are [37]:
unit is sufficient [53]. Some modern CPUs, such as the Intel • Limited to matrix/vector operations
Sapphire Rapids, includes a matrix-multiply engine to perform • Difficult to integrate in heterogeneous systems (lack of
tile based multiply-add (TMUL). The TMUL instruction, part optimized interface designs)
of the Advanced Matrix Extension (AMX) tile operations • Susceptible to computation bottleneck in single core
category, performs only one operation as defined in (2) [54], processor devices when handling other workloads, i.e.,
where, i is the number of rows, j is the number of columns activation function and depth wise convolution
and l is an intermediate variable.
Heterogeneous RISC-V heavy computing clusters and hybrid
SoC designs have gained popularity in extreme edge AI
T ileC [i][j] += T ileA [i][l] ⇥ T ileB [l][j] (2) inference [56], [57]. In an effort to overcome the AIMC
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Fig. 9. PULP cluster architecture with an 8 core RISC-V processor, IMA unit and a digital depth width convolution accelerator. Source: Adapted from [37]
challenges, an 8-core RISC-V clustered architecture with in interconnect and in its turn to the L1 memory unit [37]. The
memory computing accelerators (IMA) and digital accelerators IMA and DW subsystems are further detailed to show their
was developed in [37]. The aim of this system is to sustain internal architecture. The IMA subsystem engine implements
AIMC performance in heterogeneous systems for optimized both the analog and digital circuitry as follows:
DNN inference on edge devices targeting practical end to • Analog: AICM crossbar with a 256 ⇥ 256 array, a
end IoT applications [37]. Similar to previous designs, the programming circuitry, i.e., PCM configuration, Digital to
architecture presented in [37] is based on the popular RISC-V Analog (DAC) and Analog to Digital (ADC) converters
PULP cluster. The work mainly focused on: • Digital: Input/Output (IO) registers to communicate with
• Designing a heterogeneous system with 8 programmable the ADC/DAC and an internal FSM control unit
RISC-V core processors, IMA and digital accelerators The IMA operates on the L1 memory data encoded in a special
dedicated for performing depth-wise convolutions (DW). format, i.e., HWC format. The IMA register file ’INPUT PIPE
• Improving computational performance by optimizing the REGS’ can be set to pipeline different jobs by correctly setting
interfaces between the IMA and the system the strides. Thus, the advantage of executing a full layer in one
• Exploiting heterogeneous analog-digital operations, such configuration phase.
as point wise/depth wise convolutions and residuals On the other hand, the DW convolution engine is a fully
As shown from Fig. 9, the PULP cluster is formed of an 8-core digital accelerator. It implements a network of composed of
RISC-V processor, a level 1 (L1) Tightly couple Data Memory multiple MAC units, i.e., 46 MAC, register files for data
(TDM) cache, instruction cache, the depth wise convolution and configuration, windows and weight buffers, a general
digital accelerator and the IMA subsystem. The components controller FSM and a dedicated engine FSM. The accelerator
are connected together internally by means of a low latency can also perform the ReLU activation function as well as the
logarithmic interconnect and to the external world with respect shift and clip operations [37]. Thus, accelerating the convo-
to on board DMA and through an AXI-bus. The logarithmic lution operation. DW convolution output channels depends
interconnect ensures serving the memory in one cycle while on only one input thus offering a reduction in size and a
AXI-bus allows the cluster to communicate to the external lower connectivity as compared to the original design. The
MCU and peripherals. The external MCU also contains the specifically designed DW convolution accelerator resolve DW
cluster core program instructions. A hardware event unit is layers mapping to IMC arrays and eliminate any software
added to the system in order to synchronize operations and originating performance bottlenecks [37]. Additional studies
thread dispatching [37]. concerning array structures for AIMC, such as systolic arrays
Each subsystem or hardware processing engine (HWPE) has for reduced energy consumption, can be found in [58].
its own streamer block, a standardized interface, formed of The heterogeneous system was synthesized with Synopsys
source and sink FIFO buffers to interact with the RISC-V Design Compiler-2019.12. The full place and route flow
cores and exchange data with the internal engine. Each block was done using Cadence Innovus 20.12 and the cluster was
implements an independent FSM to control and synchronize implemented using GlobalFoundries 22 nm FDX technology
its operation. The HWPE provides two interfaces, control and node. The total system area of the heterogeneous cluster is 2.5
data. The control ’Ctrl intf’ allows the cluster to manipulate mm2 with the IMA core occupying one-third of the area with
the accelerators internal registers for configuration purposes, 0.000912 mm2 . In addition, the 512 KiloBytes (KB) TCD
while the data interface ’data intf’ connects to the logarithmic cache occupying another one-third and one-third occupied by
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
the remaining parts. The device can perform an average of The proposed architecture is divided in two parts: software and
29.7 MAC operations per cycle and execute inference for the hardware. The software part, running on the CPU and off-chip
MobileNetV2 network in 10 ms while achieving a performance memory, implements the least computational demanding oper-
of 958 GOPS on NVM. ations like embedding and task-specific layers. However, they
Emerging technologies, such as 3D integration, when cou- require the most memory space. The hardware part, running
pled with in-memory computing (IMC) techniques, can pro- on the FPGA, implements the encoder layers accelerated units,
vide substantial design benefits. 3D integration is achieved by such as the on-chip buffers, PE, LN core and Softmax core
stacking multiple layers of electronic components in a single [38].
chip or package to reduce power consumption, reach higher 1) On-chip buffers: double buffered weight buffer, inter-
clock speeds, improve signal integrity and overall circuit mediate data buffer for the MHSA unit variables, cache
performance. Additional details on 3D integration and IMC buffer for storing the scaling factors, Softmax look-up
techniques can be found in [59], [60]. table values and the input/output buffers.
2) PE: each unit is formed of bit-level re-configurable
V. H ARDWARE ACCELERATORS FOR T RANSFORMERS multipliers with support to 8 ⇥ 4 bit and 8 ⇥ 8 bit
Transformers have been shown to outperform CNN and combinations. Additionally, a Bit-split Inner-product
RNN in different applications, i.e., NLP and computer vision Module (BIM) is included to simplify reuse for different
[50], [61], [38], [62]. They are formed of encoders and de- operations.
coders blocks that execute several compute-intensive, floating 3) Softmax and LN core: the exponential function is quan-
point and non-linear operations, on massive data streams tized to 8-bits and 256 sampling points are stored in
[61], such as Multi-Head Self Attention (MHSA), Softmax, a look-up table to simplify the computation. Moreover,
Gaussian Error Linear Unit (GELU), point-wise Feed Forward a coarse grained 3-stage pipeline parallel SIMD is de-
Network (FFN) and Layer Normalization (LN) [4], [61]. How- signed to accelerate the element-wise multiplication.
ever, generic DL structures and accelerators are not tailored
to support and optimize these specific transformer operations TABLE VIII
[61]. Some common optimization techniques include: model FQ-BERT PERFORMANCE COMPARISON FOR DIFFERENT PROCESSORS
compression with integer or fixed point quantization [63],
Latency (ms) Power (W) FPS/W Clock
[64], [65], specific approximations with scaling factors to CPU 145.06 65 0.11 3.2 GHz
execute non-linear operations [61] and specialized hardware GPU 27.84 143 0.25 -
accelerators [38], [62]. ZCU102 43.89 9.8 2.32 240 MHz
ZCU111 23.79 13.2 3.18 240 MHz
A. Fully quantized BERT Initially the weights are loaded to the off-chip memory. A
The Bidirectional Encoder Representations from Transform- task level scheduler is implemented to fully overlap off-chip
ers (BERT) is a state-of-the-art model formed of stacked memory access and computing operations. This is done by
encoder layers [63]. However, its computational complexity dividing each stage into several sub stages [38]. The FQ-BERT
and memory requirements are > 20 GFLOPS and > 320 and BERT were implemented using PyTorch and evaluated on
MB floating points parameters, respectively [38]. Hindering the SST-2 and MNLI tasks of GLUE benchmark. The FQ-
its implementation on resource constraint edge devices. BERT, with a compression ratio 7.94⇥, achieved an accuracy
To reduce its memory footprint and computational com- of 91.51% and 81.11% as compared to BERT with 92.32%
plexity, a Fully Quantized BERT (FQ-BERT) with hardware- and 84.19%, respectively [38]. Furthermore, the accelerator
software acceleration is proposed in [38] for SoC. The FQ- was implemented on the Xilinx ZCU102 (FPGA) and ZCU111
BERT compresses the model by quantizing all parameters (SoC) and was compared to the baseline program, FQ-BERT,
and intermediate results to integer or fixed-point data type. running on the Intel i7-8700 CPU and the Nvidia K80 GPU
Moreover, it accelerates inference by implementing a dot- (CUDA 10.1). The sentence length and batch size are set to
product based PEs and bit-level reconfigurable multipliers 128 and 1, respectively.
[38]. The methods and techniques used for quantizing BERT Table VIII compares the performance and energy efficiency
parameters are detailed as follows [38]: of the FQ-BERT and BERT when implemented on different
1) Weights and activation functions: Quantized to 4-bit processors. The accelerator achieved a 6.10⇥ and 28.91⇥
using symmetric linear quantization strategy with tun- improvement as compared to the CPU, 1.17⇥ and 12.72⇥ as
able (MIN, MAX) clip thresholds and a scaling factor. compared to the GPU [38]. For 12 processing units with 16
The weight scaling factor is computed using a scaling PEs and 16 multipliers, the total resource consumption on the
formula. The Exponential Moving Average (EMA) is ZCU111 is 395,159, where, 3,287 DSP blocks were allocated.
used to determine the activation functions scaling factor Although transformers are the go-to choice in NLP appli-
during inference. cations, not all models can be fully deployed on hardware. As
2) Biases and other parameters: The biases are quantized such, deploying Long Short-Term Memory (LSTM) networks
to 32-bit integers. The Softmax module and the layer can be a suitable alternative where the requirements are
normalization parameters are quantized to 8-bit fixed- minimal. A 32-bit precision floating-point LSTM-RNN FPGA
point values. accelerator is proposed in [66]. The design is implemented
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
on the Virtex 7 running at 150 MHz and can perform an QK T matrix, where T means transpose. It is formed of two
average of 7.26 GFLOP/s. The memory-optimized architecture MatMul blocks with intermediate scale units, i.e., division
occupies 52.04% BRAMs, 42% DSP units, 30.08% FFs and by the transformer dimension d, Softmax and requantization.
65.31% LUTs. The network can be fully implemented on Finally, the FFN block structure also implements two MatMul
hardware and achieves a 20.18⇥ speed up compared to the blocks with intermediate GELU and requantization units [61].
Intel Xeon CPU E5-2430 software implementation clocked at Non linear operations, such as Softmax, GELU and Square
2.20 GHz [66]. Root for the LN, are performed by the use of second-order
polynomial approximations and recursive implementation [61].
B. SwiftTron Softmax is applied to the row components of the QK T
matrix. As such, m parallel units are instantiated, where m
Various DL accelerators were designed to implement fully
is the sentence length. Its implementation is summarized as
quantized, fixed point and integer based, transformers, i.e.,
follows: First, the unit implements a maximum search block
FQ-BERT and I-BERT [65]. However, these architectures do
to obtain the maximum value which is subtracted to obtain
not fully deploy the model on hardware but only optimize
decomposable non-positive real numbers. Second, the input
and execute specific parts. In addition, non linear operations
range is restricted to [ ln 2, 0]. Third, the exponential function
are difficult to implement in integer arithmetic without a
is computed by means of a second-order polynomial. Finally,
significant loss in accuracy [61]. As such, SwiftTron, a spe-
the output is generated using an accumulate and divide block
cialized open source hardware accelerator, is proposed [61] for
[61].
quantized transformers and vision transformers. The SwiftTron The GELU unit is implemented with simple add, multiple
architecture implements several hardware units to fully and and sign handling operations. This is done by linearizing
efficiently deploy quantized transformers in edge AI/TinyML the error function (erf) through a second-order polynomial
devices using only integer operations. To minimize accuracy with limited input. The LN blocks square root operation is
loss, a quantization strategy for transformers with scaling implemented in a recursive manner. The algorithm iterates
factors is designed and implemented. The scheme reliably until xi+1 xi , where x is the partial result and i is
implements linear and non-linear operations in 8-bit integer the iteration index. Finally, a control unit is implemented to
(INT8) and 32-bit integer (INT32) arithmetic, respectively. manage different operations. The residual blocks output is
Quantization is performed by using scaling factors, that are added to the original inputs with respect to Dyadic units to
dynamically computed during the process [61]. ensure matching scaling factors [61].
To accelerate the linear layers, an INT8 input Matrix Mul- SwiftTron architecture was synthesized in a 65 nm CMOS
tiplication (MatMul) block is proposed [61] and is shown in technology using Synopsys Design Compiler. Its parameters
Figure 10a. The MatMul block is designed as an array of were set to d = 768, k = 12, m = 256 and df f = 3072.
INT32, shareable and reusable, MAC units to avoid accuracy Synthesis results shows that the architecture operates at a clock
loss. The MAC units perform column-oriented computations frequency of 143 MHz, occupies an area of 273.0 mm2 and
with bias addition. This data flow simplifies the MatMul archi- consumes 33.64 W. It was shown that the MatMul, Softmax,
tecture as well as the interface between blocks. However, as LN and GELU blocks occupies 55%, 17%, 25% and 3% of the
non-linear operations are performed with INT8 representation, total area, respectively. Their contribution to the total power
a requantization (Req) unit is needed. Since scaling factors can is 79%, 14%, 6% and 1%, respectively [61].
also assume real values, the requantization unit represents the The architecture was evaluated by executing the RoBERTa-
scaling factor ratio with a Dyadic number (DN), as shown in base/large on STT-2 and DeiT-S models with a 244 ⇥ 244
(3). Note that, a and o are the INT32 and INT8 values, qa image resolution from the ImageNet database. The inference
and qo are their quantized values, Sa and So are their scales, latency was compared to that of the Nvidia RTX 2080 Ti GPU
such that a = qa Sa , o = qo So , b and c are integers. With the [61]. The mean accuracy obtained for the RoBERTa models
use of the DN, the unit implements a right shift operation and was 95.8% and 79.11% for the DeiT-S. In terms of latency, the
eliminates the need for a divider [61]. RoBERTa-base, large and the DeiT-S required 1.83 ms, 45.70
ms and 1.13 ms, respectively. The latency speed up factor with
Sa b respect to the GPU was 3.81⇥, 3.90⇥ and 3.58⇥ [61].
qo = qa = qa DN ( SSao ) = qa ⇥ c (3)
So 2
Among others, the MatMul blocks are used as the basic TABLE IX
BERT, I-BERT, FQ-BERT AND S WIFT T RON INT8 COMPARISON
building blocks of the MHSA and the FFN with dimension
df f . The MHSA block, shown in Figure 10b, is formed of Model Processor Accuracy Latency (ms) Speed up
k head units operating in parallel and connected to a final BERT [65] T4 GPU 96.3% - 1⇥
I-BERT [65] T4 GPU 96.4% - 3.56⇥
MatMul block to generate the output. The MHSA block can
FQ-BERT [38] SoC 91.51% 23.79 12.72⇥
be reconfigured to include one or many heads depending on SwiftTron [61] FPGA 96.4% 45.70 3.90⇥
the desired architecture, i.e., parallel or sequential with reuse,
and available resources. Each head unit, shown in Figure 10c, From Table IX and as compared to the baseline BERT, the
contains three MatMul blocks and one attention block to com- I-BERT and FQ-BERT achieves a speed up of 3.56⇥ and
pute the Query (Q), Key (K) and Value (V) matrices in parallel 12.72⇥ with an accuracy of 96.3% and 96.4%, respectively.
[61]. The attention block, shown in Figure 10d computes the However, both designs makes use of a heterogeneous processor
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
Fig. 10. SwiftTron linear layers architecture. Source: Adapted From [61]: (a) MatMul block, (b) MHSA block, (c) Head Unit, (d) Attention block
and cannot be fully deployed on an FPGA. While SwiftTron VI. S UMMARY AND F UTURE R ESEARCH C HALLENGES
architecture was not explicitly designed for a specific proces-
sor, model or ISA, it is prototyped on an FPGA, as compared In this survey, we presented an overview of embedded
to the FQ-BERT, it achieves a speed up of 3.90⇥ with an DNN accelerators for the open source RISC-V processor
accuracy of 96.4% and a latency of 45.70 ms. However, the core. In addition, we offered an overview on some RISC-V
SwiftTron is not suitable for deployment on highly resource- ISA extensions, compatible accelerators, heterogeneous and
constrained devices. digital with analog in memory computing hybrid designs.
We explored different DNN structures and models, like 1D,
ViTA, a hardware accelerator architecture and an efficient 2D and 3D-CNN, SSD and Transformers. Additionally, we
data-flow is proposed in [67] to deploy compute-heavy vision provided some up to date references on recent advances in
transformer models on edge devices. The design supports optical AI edge inference designs and 3D integration. The
several popular vision transformer models, avoid repeated off- work listed in this article is summarized in Table X. The state-
chip memory access, implement a head-level pipeline and of-the-art designs are compared with respect to the selected
several layer optimizations [67]. The design is based on ViT- processor, target application, key features and limitations.
B/16 model and prototyped on the ZYNQ ZC7020. For an In conclusion, ISA extensions provide optimized, general
image dimension of 256 ⇥ 256 ⇥ 3, ViTA occupies 53,200 purpose instructions to implement different core DNN opera-
LUTs, 220 DSP slices and 630KB of BRAM. In terms of tions. However, the networks performance is limited by that of
performance, the accelerator can perform 2.17 FPS with a the compiler and its ability to correctly map and execute each
93.2% hardware utilization efficiency, while operating at a instruction. Additionally, certain models, such as transformers,
frequency of 150 MHz and consuming 0.88W, i.e., 3.12 may require the use of dedicated architectures to perform
FPS/W [67]. efficiently. Thus, a dedicated accelerator, while application-
specific, can improve certain networks performance. The
In contrast to the SwiftTron and FQ-BERT hardware ac- choice remains of an accelerator remains constrained by the
celerators, the ViTA presents an design suitable for resource- applications requirements and device limitations. The designs
constrained edge devices with a reasonable frame rate and listed in this survey favored the FPGA over the CPU and the
power consumption. Although these designs do not explicitly ASIC. Compared to ASICs, FPGAs offer the needed flexibility
target RISC-V processors, they can be integrated into a RISC- to implement dedicated and re-configurable architectures to
V system given its open source nature. meet the ever changing needs and advancements. Compared
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
TABLE X
S TATE OF THE ART KEY FEATURES AND LIMITATIONS SUMMARY
to the CPU, FPGAs offers parallelism and low power con- implementation of transformers and large DNNs in
sumption with better performance per Watt. For some models, hardware. However, these technologies are still relatively
a heterogeneous implementation was favored, where the design new and face complex challenges. 3D integration is
was implemented on a SoC and optimization was performed expensive, it can lead to an increase in heat dissipation
on both the software (CPU) and hardware (FPGA) sides. and is not freely scalable. Additionally, IMC in 3D
It was evident that the open-source RISC-V was the MCU integrated circuits can lead to complex and difficult-to-
choice for many applications. This is because it offers the implement designs that require reliable data management
needed flexibility and customizability to implement dedicated techniques. The development of dedicated 3D-IMC li-
accelerators that meet specific design criteria, such as power, braries, instruction sets, tools and compiler optimization
area and performance. Additionally, it allows developers to methods can greatly reduce design and testing time.
freely integrate similar designs through open-source licensing. 2) Optimizing model size and resource usage: Knowl-
The need for implementing DNNs on edge devices is edge distillation, model compression, pruning, sharing,
increasing tremendously and have become a dedicated research partitioning and offloading are some of the popular
topic. However, it is clear from Table X that although the techniques adopted to reduce the size of DNNs. In
proposed designs offer considerable improvements, they share knowledge distillation a smaller size student DNN is
common limitations. These limitations are mainly in terms trained by the larger teacher DNN. Several requirements
of size, resources utilization, compiler, processing capabilities should be considered when choosing an optimization
and data communication. Some future research tracks and open technique, such as accuracy, computational omplexity,
challenges in edge AI to reduce the effect of these limitations cost, speed and availability. Although extensively re-
are: searched, these methods might result in a model that
1) 3D integration and IMC: Performing IMC in 3D inte- suffers from accuracy loss, requires retraining, poses se-
grated circuits allows computations to take place in a curity concerns, becomes difficult to validate and deploy.
close proximity to the memory. This technique drasti- The introduction of device-aware model compression
cally reduces memory access, data transfer bottlenecks, techniques can help in obtaining device-specific models
size and improves overall performance. 3D integration which facilitates deployment and improve performance.
and IMC have the potential to revolutionize the field 3) Optimizing memory communication: Data transfer and
of embedded machine learning by enabling the full memory bandwidth play a crucial role in the perfor-
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
mance of accelerators, especially in heterogeneous and [8] B. Varghese, N. Wang, S. Barbhuiya, P. Kilpatrick, and D. S. Nikolopou-
hybrid digital-analog systems. Accelerators are mainly los, “Challenges and opportunities in edge computing,” in 2016 IEEE
International Conference on Smart Cloud (SmartCloud), pp. 20–26,
limited by frequent communication and bandwidth bot- IEEE, 2016.
tlenecks. Addressing the memory bandwidth and limi- [9] P. P. Ray, “A review on tinyml: State-of-the-art and prospects,” Journal
tations in RISC-V can significantly optimize the overall of King Saud University-Computer and Information Sciences, 2021.
[10] E. Manor and S. Greenberg, “Custom hardware inference accelerator for
performance. This can be achieved by implementing tensorflow lite for microcontrollers,” IEEE Access, vol. 10, pp. 73484–
application specific memory hierarchies, on-chip inter- 73493, 2022.
connects, dedicated access controllers, memory com- [11] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and
pression, data reuse and data-flow optimization. J. Kepner, “Ai accelerator survey and trends,” in 2021 IEEE High
Performance Extreme Computing Conference (HPEC), pp. 1–9, IEEE,
4) Tools and compilers: Implement automatic code gen- 2021.
eration to efficiently map the program instructions to [12] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
hardware. Additionally, develop open source standards deep neural networks with pruning, trained quantization and huffman
coding,” arXiv preprint arXiv:1510.00149, 2015.
for RISC-V accelerator designs to simplify ineroperabil- [13] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-
ity and integration for different hardware platforms and gio, “Binarized neural networks: Training deep neural networks with
software frameworks. weights and activations constrained to+ 1 or-1,” arXiv preprint
arXiv:1602.02830, 2016.
5) Algorithmic: Exploit the unique capabilities of the open [14] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
source RISC-V ISA to better optimize DNN algo- arXiv preprint arXiv:1612.01064, 2016.
rithms specifically for RISC-V implementation. The [15] A. Karine, T. Napoléon, J.-Y. Mulot, and Y. Auffret, “Video seals
open-source ISA also allows investigating multi-model, recognition using transfer learning of convolutional neural network,” in
2020 Tenth International Conference on Image Processing Theory, Tools
heterogeneous and hybrid computing. This can be done and Applications (IPTA), pp. 1–4, IEEE, 2020.
by designing algorithms and structures for data fusion [16] V. Murahari, C. E. Jimenez, R. Yang, and K. Narasimhan,
from different sensors, as well as concurrently targeting “Datamux: Data multiplexing for neural networks,” arXiv preprint
arXiv:2202.09318, 2022.
various computing platforms, such as CPU, FPGA, GPU [17] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression and
and Neural Processing Unit (NPU). hardware acceleration for neural networks: A comprehensive survey,”
6) Optical chips: Optical AI accelerators are extensively Proceedings of the IEEE, vol. 108, no. 4, pp. 485–532, 2020.
investigated for implementing high-accuracy and high- [18] S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes, M.-L. Shyu,
S.-C. Chen, and S. S. Iyengar, “A survey on deep learning: Algorithms,
speed inference CNNs. Classical processors are begin- techniques, and applications,” ACM Computing Surveys (CSUR), vol. 51,
ning to face limitations in the post Moore’s law era, no. 5, pp. 1–36, 2018.
where their processing capabilities are not improving at [19] S. Mittal, “A survey of fpga-based accelerators for convolutional neural
networks,” Neural computing and applications, vol. 32, no. 4, pp. 1109–
the same pace as the requirements. Optical processors, 1139, 2020.
on the other hand, are not affected by Moore’s law [20] E. Wang, J. J. Davis, R. Zhao, H.-C. Ng, X. Niu, W. Luk, P. Y.
and are currently investigated as an alternative to train Cheung, and G. A. Constantinides, “Deep neural network approximation
for custom hardware: Where we’ve been, where we’re going,” ACM
and deploy DNN structures, offering the advantage of Computing Surveys (CSUR), vol. 52, no. 2, pp. 1–39, 2019.
handling much larger and complex networks. [21] L. Lai, N. Suda, and V. Chandra, “Cmsis-nn: Efficient neural network
kernels for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018.
Additional research tracks that would offer further improve-
[22] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and
ments include those related to improving adaptability to dy- M. Shafique, “Hardware and software optimizations for accelerating
namic workloads and exploring techniques to optimize online deep neural networks: Survey of current trends, challenges, and the road
learning, federated learning and training, to name a few. ahead,” IEEE Access, vol. 8, pp. 225134–225180, 2020.
[23] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “Pulp-nn:
accelerating quantized neural networks on parallel ultra-low-power risc-v
processors,” Philosophical Transactions of the Royal Society A, vol. 378,
R EFERENCES no. 2164, p. 20190155, 2020.
[1] Z. Liu, J. Jiang, G. Lei, K. Chen, B. Qin, and X. Zhao, “A heterogeneous [24] “Rv12 risc-v 32/64-bit cpu core datasheet.”
processor design for cnn-based ai applications on iot devices,” Procedia https://roalogic.github.io/RV12/DATASHEET.html. Accessed: 2022-04-
Computer Science, vol. 174, pp. 2–8, 2020. 28.
[2] A. N. Mazumder, J. Meng, H.-A. Rashid, U. Kallakuri, X. Zhang, J.- [25] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and
S. Seo, and T. Mohsenin, “A survey on the optimization of neural J. Kepner, “Survey and benchmarking of machine learning accelerators,”
network accelerators for micro-ai on-device inference,” IEEE Journal on in 2019 IEEE high performance extreme computing conference (HPEC),
Emerging and Selected Topics in Circuits and Systems, vol. 11, no. 4, pp. 1–9, IEEE, 2019.
pp. 532–547, 2021. [26] K. T. Chitty-Venkata and A. K. Somani, “Neural architecture search
[3] L. Sekanina, “Neural architecture search and hardware accelerator co- survey: A hardware perspective,” ACM Computing Surveys, vol. 55,
search: A survey,” IEEE Access, vol. 9, pp. 151337–151362, 2021. no. 4, pp. 1–36, 2022.
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [27] D. Ghimire, D. Kil, and S.-h. Kim, “A survey on efficient convolutional
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural networks and hardware acceleration,” Electronics, vol. 11, no. 6,
neural information processing systems, vol. 30, 2017. p. 945, 2022.
[5] S.-H. Lim, W. W. Suh, J.-Y. Kim, and S.-Y. Cho, “Risc-v virtual [28] S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, and P. Kitsos,
platform-based convolutional neural network accelerator implemented “A survey on risc-v-based machine learning ecosystem,” Information,
in systemc,” Electronics, vol. 10, no. 13, p. 1514, 2021. vol. 14, no. 2, p. 64, 2023.
[6] N. Wu, T. Jiang, L. Zhang, F. Zhou, and F. Ge, “A reconfigurable [29] A. Sanchez-Flores, L. Alvarez, and B. Alorda-Ladaria, “A review of
convolutional neural network-accelerated coprocessor based on risc-v cnn accelerators for embedded systems based on risc-v,” in 2022 IEEE
instruction set,” Electronics, vol. 9, no. 6, p. 1005, 2020. International Conference on Omni-layer Intelligent Systems (COINS),
[7] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and pp. 1–6, 2022.
challenges,” IEEE internet of things journal, vol. 3, no. 5, pp. 637–646, [30] J. K. Lee, M. Jamieson, N. Brown, and R. Jesus, “Test-driving risc-v
2016. vector hardware for hpc,” arXiv preprint arXiv:2304.10319, 2023.
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Artificial Intelligence. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TAI.2023.3311776
[31] C. Silvano, D. Ielmini, F. Ferrandi, L. Fiorin, S. Curzel, L. Benini, [51] S. Lv, T. Long, Z. Hou, L. Yan, and Z. Li, “3d cnn hardware circuit for
F. Conti, A. Garofalo, C. Zambelli, E. Calore, et al., “A survey on deep motion recognition based on fpga,” in Journal of Physics: Conference
learning hardware accelerators for heterogeneous hpc platforms,” arXiv Series, vol. 2363, p. 012030, IOP Publishing, 2022.
preprint arXiv:2306.15552, 2023. [52] L. Lamberti, M. Rusci, M. Fariselli, F. Paci, and L. Benini, “Low-power
[32] F. Ge, N. Wu, H. Xiao, Y. Zhang, and F. Zhou, “Compact convolutional license plate detection and recognition on a risc-v multi-core mcu-based
neural network accelerator for iot endpoint soc,” Electronics, vol. 8, vision system,” in 2021 IEEE International Symposium on Circuits and
no. 5, p. 497, 2019. Systems (ISCAS), pp. 1–5, IEEE, 2021.
[33] I. A. Assir, M. E. Iskandarani, H. R. A. Sandid, and M. A. Saghir, [53] Z. Azad, M. S. Louis, L. Delshadtehrani, A. Ducimo, S. Gupta,
“Arrow: A risc-v vector accelerator for machine learning inference,” P. Warden, V. J. Reddi, and A. Joshi, “An end-to-end risc-v solution for
arXiv preprint arXiv:2107.07169, 2021. ml on the edge using in-pipeline support,” in Boston area Architecture
[34] J. Vreca, K. J. X. Sturm, E. Gungl, F. Merchant, P. Bientinesi, (BARC) Workshop, 2020.
R. Leupers, and Z. Brezocnik, “Accelerating deep learning inference in [54] WikiChip, “The x86 advanced matrix extension (amx) brings matrix
constrained embedded devices using hardware loops and a dot product operations to debut with sapphire rapids,” 2023.
unit,” IEEE Access, vol. 8, pp. 165913–165926, 2020. [55] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalable
[35] G. Zhang, K. Zhao, B. Wu, Y. Sun, L. Sun, and F. Liang, “A risc-v and energy efficient deep learning with smart memory cubes,” IEEE
based hardware accelerator designed for yolo object detection system,” Transactions on Parallel and Distributed Systems, vol. 29, no. 2,
in 2019 IEEE International Conference of Intelligent Applied Systems pp. 420–434, 2017.
on Engineering (ICIASE), pp. 9–11, IEEE, 2019. [56] A. Garofalo, Y. Tortorella, M. Perotti, L. Valente, A. Nadalini, L. Benini,
D. Rossi, and F. Conti, “Darkside: A heterogeneous risc-v compute
[36] G. Ottavi, A. Garofalo, G. Tagliavini, F. Conti, L. Benini, and D. Rossi,
cluster for extreme-edge on-chip dnn inference and training,” IEEE Open
“A mixed-precision risc-v processor for extreme-edge dnn inference,”
Journal of the Solid-State Circuits Society, vol. 2, pp. 231–243, 2022.
in 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI),
[57] K. Ueyoshi, I. A. Papistas, P. Houshmand, G. M. Sarda, V. Jain, M. Shi,
pp. 512–517, 2020.
Q. Zheng, S. Giraldo, P. Vrancx, J. Doevenspeck, et al., “Diana: An end-
[37] A. Garofalo, G. Ottavi, F. Conti, G. Karunaratne, I. Boybat, L. Benini, to-end energy-efficient digital and analog hybrid neural network soc,”
and D. Rossi, “A heterogeneous in-memory computing cluster for in 2022 IEEE International Solid-State Circuits Conference (ISSCC),
flexible end-to-end inference of real-world deep neural networks,” IEEE vol. 65, pp. 1–3, IEEE, 2022.
Journal on Emerging and Selected Topics in Circuits and Systems, 2022. [58] M. E. Elbtity, B. Reidy, M. H. Amin, and R. Zand, “Heterogeneous
[38] Z. Liu, G. Li, and J. Cheng, “Hardware acceleration of fully quantized integration of in-memory analog computing architectures with tensor
bert for efficient natural language processing,” in 2021 Design, Automa- processing units,” arXiv preprint arXiv:2304.09258, 2023.
tion & Test in Europe Conference & Exhibition (DATE), pp. 513–516, [59] E. Giacomin, S. Gudaparthi, J. Boemmels, R. Balasubramonian,
IEEE, 2021. F. Catthoor, and P.-E. Gaillardon, “A multiply-and-accumulate array for
[39] S. Harini, A. Ravikumar, and D. Garg, “Vennus: An artificial intelligence machine learning applications based on a 3d nanofabric flow,” IEEE
accelerator based on risc-v architecture,” in Proceedings of International Transactions on Nanotechnology, vol. 20, pp. 873–882, 2021.
Conference on Computational Intelligence and Data Engineering: IC- [60] Z. Lin, S. Zhang, Q. Jin, J. Xia, Y. Liu, K. Yu, J. Zheng, X. Xu,
CIDE 2020, pp. 287–300, Springer, 2021. X. Fan, K. Li, Z. Tong, X. Wu, W. Lu, C. Peng, and Q. Zhao, “A
[40] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, fully digital sram-based four-layer in-memory computing unit achieving
and D. Kalenichenko, “Quantization and training of neural networks for multiplication operations and results store,” IEEE Transactions on Very
efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Large Scale Integration (VLSI) Systems, vol. 31, no. 6, pp. 776–788,
conference on computer vision and pattern recognition, pp. 2704–2713, 2023.
2018. [61] A. Marchisio, D. Dura, M. Capra, M. Martina, G. Masera, and
[41] Y. Zhang, N. Wu, F. Zhou, and M. R. Yahya, “Design of multifunctional M. Shafique, “Swifttron: An efficient hardware accelerator for quantized
convolutional neural network accelerator for iot endpoint soc,” in Proc. transformers,” arXiv preprint arXiv:2304.03986, 2023.
World Congress Eng. Comput. Sci, pp. 16–19, 2018. [62] A. F. Laguna, M. M. Sharifi, A. Kazemi, X. Yin, M. Niemier, and
[42] Z. Li, L. Wang, S. Guo, Y. Deng, Q. Dou, H. Zhou, and W. Lu, “Laius: X. S. Hu, “Hardware-software co-design of an in-memory transformer
An 8-bit fixed-point cnn hardware inference engine,” in 2017 IEEE network accelerator,” Frontiers in Electronics, vol. 3, p. 10, 2022.
International Symposium on Parallel and Distributed Processing with [63] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
Applications and 2017 IEEE International Conference on Ubiquitous of deep bidirectional transformers for language understanding,” arXiv
Computing and Communications (ISPA/IUCC), pp. 143–150, 2017. preprint arXiv:1810.04805, 2018.
[43] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, [64] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator
and H. Yang, “Angel-eye: A complete design flow for mapping cnn for multi-head attention and position-wise feed-forward in the trans-
onto embedded fpga,” IEEE transactions on computer-aided design of former,” in 2020 IEEE 33rd International System-on-Chip Conference
integrated circuits and systems, vol. 37, no. 1, pp. 35–47, 2017. (SOCC), pp. 84–89, IEEE, 2020.
[44] S.-Y. Pan, S.-Y. Lee, Y.-W. Hung, C.-C. Lin, and G.-S. Shieh, “A [65] S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert:
programmable cnn accelerator with risc-v core in real-time wearable ap- Integer-only bert quantization,” in International conference on machine
plication,” in 2022 IEEE International Conference on Recent Advances learning, pp. 5506–5518, PMLR, 2021.
in Systems Science and Engineering (RASSE), pp. 1–4, 2022. [66] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “Fpga-based accelerator for long
short-term memory recurrent neural networks,” in 2017 22nd Asia and
[45] D. Pestana, P. R. Miranda, J. D. Lopes, R. P. Duarte, M. P. Véstias,
South Pacific Design Automation Conference (ASP-DAC), pp. 629–634,
H. C. Neto, and J. T. De Sousa, “A full featured configurable accelerator
IEEE, 2017.
for object detection with yolo,” IEEE Access, vol. 9, pp. 75864–75877,
[67] S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel, “Vita:
2021.
A vision transformer inference accelerator for edge applications,” arXiv
[46] K. Kim, S.-J. Jang, J. Park, E. Lee, and S.-S. Lee, “Lightweight and preprint arXiv:2302.09108, 2023.
energy-efficient deep learning accelerator for real-time object detection
on edge devices,” Sensors, vol. 23, no. 3, p. 1185, 2023.
[47] D. Wu, Y. Liu, and C. Tao, “A universal accelerated coprocessor for
object detection based on risc-v,” Electronics, vol. 12, no. 3, p. 475,
2023.
[48] Z. Wang, K. Xu, S. Wu, L. Liu, L. Liu, and D. Wang, “Sparse-yolo:
hardware/software co-design of an fpga accelerator for yolov2,” IEEE
Access, vol. 8, pp. 116569–116585, 2020.
[49] L. Cai, F. Dong, K. Chen, K. Yu, W. Qu, and J. Jiang, “An fpga based
heterogeneous accelerator for single shot multibox detector (ssd),” in
2020 IEEE 15th International Conference on Solid-State & Integrated
Circuit Technology (ICSICT), pp. 1–3, IEEE, 2020.
[50] W. Lv, S. Xu, Y. Zhao, G. Wang, J. Wei, C. Cui, Y. Du, Q. Dang, and
Y. Liu, “Detrs beat yolos on real-time object detection,” arXiv preprint
arXiv:2304.08069, 2023.
Authorized licensed use limited to: Vietnam National University. Downloaded on March 25,2024 at 07:16:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.