Optimizing CNN Computation Using RISC-V Custom Instruction Sets For Edge Platforms
Optimizing CNN Computation Using RISC-V Custom Instruction Sets For Edge Platforms
Abstract—Benefit from the custom instruction extension applications. RISC-V standard ISA defines only basic instruc-
capabilities, RISC-V architecture can be optimized for many tions that are both aligned and compact, allowing efficient com-
domain-specific applications. In this paper, we propose seven mand decoding and execution. In the meantime, RISC-V ISA
RISC-V SIMD (single instruction multiple data) custom instruc-
tions that can significantly optimize the convolution, activation is also very flexible for customized instruction extensions and
and pool operations in CNN inference computation. More specif- additions. In recent years, many varieties of RISC-V ISA have
ically, instruction CONV23 can greatly speed up the operation of been developed to accelerate computation in specific applica-
F (2 × 2, 3 × 3). With the adoption of Winograd algorithm, the tions, such as edge computing platforms, software defined radio
number of multiplications can be reduced from 36 to 16, and the [1], [2] and cryptographic computation [3], [4], [5].
execution time is also reduced from 140 to 21 clock cycles. These
custom instructions can be executed in batch mode within the Convolutional neural network (CNN) is a feed-forward arti-
acceleration module where the immediate data can be reused, so ficial neural network widely used in image or video recogni-
the latency and energy overhead associated with excess memory tion, classification and detection [6], [7], [8]. In recent years,
accesses can be eliminated. Using inline assembler in C language, many AI applications, such as image/audio/video rendering
the custom instructions can be called and compiled together and processing, have been implemented using CNN architec-
with C source code. A revised RISC-V processor, RI5CY-Accel is
tures on edge computing platforms [9], [10], [11]. Meanwhile,
constructed on FPGA to accommodate these custom instructions.
Revised LeNet-5, VGG16 and ResNet18 model; called LeNet- CNN implementations on RISC-V processors have attracted
Accel, VGG16-Accel and ResNet18-Accel are also optimized lots of interests from research institutes as well as industries
based on RI5CY-Accel architecture. Benchmark experiments [12], [13], [14].
demonstrated that the inference of LeNet-Accel, VGG16-Accel Convolution computations account for the majority, in many
and ResNet18-Accel based on RI5CY-Accel can greatly reduce cases, over 90%, of CNN workloads [15]. Under the standard
the execution latency by over 76.6%, 88.8% and 87.1%, with
the total energy consumption saving of 74.8%, 87.8% and RISC-V ISA pipeline, the operations of CNN need to con-
85.1% respectively. tinuously fetch and load data from cache and memory, fol-
lowed by computation executions and write-back. Taking the
Index Terms—RISC-V, CNN, acceleration, Winograd, RISC-V
custom instruction sets, edge computing. multiply-accumulation (MAC) operation, which is widely used
in CNN, as an example, a two-input MAC operation requires 4
data-load instructions, 2 multiplication instructions, 1 addition
instruction and 1 data-store instruction. Most of the time is
I. INTRODUCTION spent on the loading and storing of data, which is inefficient
and time-consuming. It also consumes huge hardware overhead
R ISC-V architecture has demonstrated its great potentials
in edge computing platforms and embedded systems.
Enabled by its open source instruction set architecture (ISA),
and power consumption. In order to speed-up the memory
access time, [16] proposes stream semantic registers to map
RISC-V ISA can be customized for many domain-specific the accesses to specific registers into memory. Proprietary AI
acceleration chip can significantly accelerate CNN [17], [18].
In this paper, we have implemented a set of seven customized
Manuscript received 22 October 2022; revised 3 August 2023; instructions based on RISC-V ISA extension that can efficiently
accepted 14 January 2024. Date of publication 5 February 2024; date of
current version 9 April 2024. Recommended for acceptance by A. Sankara- perform most of the critical operations in CNN computation, in-
narayanan. (Shihang Wang and Xingbo Wang contributed equally to this work.) cluding convolution, memory-loading, write-back, pooling and
(Corresponding author: Terry Tao Ye.) activation. These instructions are implemented as an ASIC (ap-
Shihang Wang, Xingbo Wang, Zhiyuan Xu, Bingzhen Chen, and Chenxi
Feng are with the Department of Electrical and Electronic Engineering, plication specific integrated circuit) acceleration module, which
Southern University of Science and Technology, Shenzhen 518055, China functions as a RISC-V co-processor once these instructions are
(e-mail: [email protected]). being called.
Qi Wang is with the Department of Electrical and Computer Engineering,
The University of British Columbia, Vancouver, BC V6T 1Z4, Canada. Among these instructions, CONV23 [19] is an SIMD (single
Terry Tao Ye is with the Institute of Nanoscience and Applications instruction multiple data) vector instruction that performs the
(INA) and the Department of Electrical and Electronic Engineering, Southern convolution between a 3 × 3 kernel matrix and a 4 × 4 input
University of Science and Technology, Shenzhen 518055, China (e-mail:
[email protected]). matrix and generates a 2 × 2 output matrix. WB23 is also an
Digital Object Identifier 10.1109/TC.2024.3362060 SIMD instruction that writes back the results of CONV23 to the
0018-9340 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1372 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1373
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1374 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024
Fig. 6. Instruction format of MP_RI, MAX_POOL and MP_WB. A. Convolution Algorithm Comparison
There are several convolution acceleration algorithms in use
today, namely im2col, Winograd, and FFT (Fast Fourier Trans-
form). Each of these algorithms has cons and pros for specific
computation scenarios.
Im2col, for instance, transforms convolutions into matrix
Fig. 7. Instruction format of RELU. multiplications through memory rearrangement. The algo-
rithm needs to utilize high-performance matrix multiplication
libraries. The re-mapping of the input matrices and convolution
kernel matrices can substantially increase memory overhead,
rendering it less suitable for edge platforms.
FFT [21], on the other hand, when being used in convolu-
Fig. 8. Instruction format of W_WB. tion, involves complex number computations. It is more ad-
vantageous to be applied for convolution with larger kernels.
However, most of the kernels used in neural network con-
from the input matrix. Instruction MP_RI loads the input from volution are 3 × 3 in size, FFT does not have advantages in
memory, of which the rs1 register indicates the first address to these applications.
load the input; MAX_POOL performs max pooling operation; Winograd algorithm was proposed by Lavin [20], as an op-
MP_WB writes the output back to memory, of which the rs1 timization technique to reduce the number of multiplications
register indicates the first address of memory to store the output. in matrix multiplication. Winograd algorithm can be exploited
Because the matrix size used by these three instructions is pre- to optimize the convolution calculation between a 4 × 4 input
determined, there is no need to indicate the size of the matrix. matrix and a 3 × 3 convolution kernel matrix and generates a
The instruction format of MP_RI, MAX_POOL and MP_WB 2 × 2 output matrix. Compared with FFT and im2col, for the
are shown in Fig. 6 below. convolution with 3 × 3 kernels, the required hardware overhead
and computational complexity are smaller. Since the majority
of the kernels used in CNN is 3 × 3 in size, the proposed custom
D. RELU - The Activation Operation
ISA can effectively mitigate the convolution computation loads.
RELU instruction performs the activation operation using the
ReLU function, i.e., changing all negative values in the input B. Winograd Algorithm Introduction
to 0 and preserving the positive values. Since the activation
function is performed immediately after the convolution or Winograd algorithm can be adopted to accelerate the con-
pooling operation, the RELU activation instruction does not volution operations in CNN. Theoretical analysis shows that
need a dedicated write-back instruction, i.e., they all share Winograd arithmetic can reduce the arithmetic complexity by
the same write-back instruction WB23 as used by CONV23 a factor of 2.25 for convolutions with 3 × 3 kernels that are
and MAX_POOL instructions. After the RELU instruction is widely used in CNN models.
executed, the output is written back to the memory by WB23. The algorithm can be better explained using a one-
The instruction format of RELU is shown in Fig. 7 below. dimensional convolution as the starting example. We denote
the one-dimensional convolution operation as F (m, r), where
m is the output size and r is the filter size. For a F (2, 3)
E. W_WB - Updating the Kernel Matrix convolution, assume the input data is d = (d0 , d1 , d2 , d3 ), and
the convolution kernel is g = (g0 , g1 , g2 ). The input vector
In CONV23 execution, the 3 × 3 convolutional kernel is
d can be transformed into a 2 × 3 matrix and the F (2, 3)
pre-loaded in the acceleration module. Instruction W_WB is
convolution can be rewritten as:
used to update the data of the convolution kernel matrix in the ⎡ ⎤
g0
acceleration module. The value in rs1 register defines the rank d d1 d2 ⎣ ⎦ r
of the kernel matrix, and the value in rs2 register defines the F (2, 3) = 0 g1 = 0 (1)
d1 d2 d3 r1
data. The instruction format of W_WB is shown in Fig. 8. g2
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1375
TABLE I
THE FORMAT OF CUSTOM INSTRUCTIONS
⎡ ⎤
Computing r0 and r1 in this case requires 6 multiplications k00 k01 k02
and 4 additions. However, if we compute the intermediate terms K = ⎣k10 k11 k12 ⎦ (6)
m1 , m2 , m3 and m4 as follows: k20 k21 k22
g0 + g1 + g2
m0 = (d0 − d2 )g0 m1 = (d1 + d2 ) (2) Where D is the input matrix and the K is the convolu-
2
g0 − g1 + g2 tion kernel.
m3 = (d1 − d3 )g2 m2 = (d2 − d1 ) (3) Like in the case of 1D convolution, as shown in Eq. 4,
2
F (2, 3) convolution can be transformed in to the follow- matrix D and K can be folded and their convolution can be
ing format: transformed into matrix multiplication of higher dimensions, as
⎡ ⎤ expressed in Eq. 7:
g
d0 d1 d2 ⎣ 0 ⎦ m0 + m1 + m2 ⎡ ⎤
F (2, 3) = g1 = (4)
d1 d2 d3 m1 − m 2 − m 3 k00
g2 ⎢k01 ⎥
⎢ ⎥
Because the kernel elements are predetermined when ⎡ ⎤ ⎢k02 ⎥ ⎡ ⎤
d00 d01 d02 d10 d11 d12 d20 d21 d22 ⎢ ⎢ ⎥
⎥ r00
performing the convolution, terms related to g can be ⎢d01 d02 d03 d11 d12 d13 d21 d22 d23 ⎥ ⎢k10 ⎥ ⎢r01 ⎥
pre-calculated. So the transformed format requires 4 ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣d10 d11 d12 d20 d21 d22 d30 d31 d32 ⎦ ⎢k11 ⎥ = ⎣r10 ⎦
multiplications and 8 additions, as compared to 6 ⎢k12 ⎥
d11 d12 d13 d21 d22 d23 d31 d32 d33 ⎢ ⎢k20 ⎥
⎥ r11
multiplications and 4 additions in direct matrix multiplication. ⎢ ⎥
Winograd algorithm reduces the number of multiplications at ⎣k21 ⎦
the cost of increasing the number of additions. In computer k22
arithmetic implementations, the resources and latency required (7)
for multiplication are order-of-magnitude higher than that of
addition, so the overall computation complexity is reduced. For clarity, the folded D and K matrix can be further
In most applications, CNN input image is a two-dimensional segmented into sub-matrices and sub-vectors, as expressed
matrix, therefore it is necessary to extend the Winograd in Eq. 8:
algorithm from 1D to 2D inputs, which is defined as ⎡ ⎤
K0
F (m × m, r × r), where m is the size of output matrix and D00 D10 D20 ⎣ ⎦ R0
the r is the size of kernel matrix. Kernel size of 3 × 3 is com- K1 = (8)
D10 D20 D30 R1
monly used in CNN applications, we hereby use the operation K2
F (2 × 2, 3 × 3) as an example:
Eq. 8 is actually Eq. 4 in higher dimension (from 1D to 2D).
Given that the output size is 2 × 2 and the convolution kernel
Winograd complexity reduction technique we have introduced
size is 3 × 3, the corresponding input matrix size is 4 × 4, so
in F (2, 3) can also be applied in this case, and Eq. 8 can be
the input vector and convolution kernel can be expressed as:
⎡ ⎤ transformed into its Winograd form:
d00 d01 d02 d03 ⎡ ⎤
⎢d10 d11 d12 d13 ⎥ K0
D=⎢ ⎣d20 d21 d22 d23 ⎦
⎥ (5) D00 D10 D20 ⎣ ⎦ R0 M0 + M1 + M2
K1 = = (9)
D10 D20 D30 R1 M1 − M2 − M3
d30 d31 d32 d33 K2
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024
Fig. 9. Sliding window fashion. Fig. 10. The structure diagram of the RISC-V processor.
Where:
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1377
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1378 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024
operations. As shown in Fig. 13, at the end of convolution, the R-type, using CONV23 instruction as an example, the opcode
output matrix (a 2 × 2 matrix in the case of F (2 × 2, 3 × 3)) of CONV23 is 0x77, funct3 is 1, and funct7 is 0. Then the
needs to be written back to the memory, and the subsequent assembly code of CONV23 instruction is as follows:
pooling operation will need to load the matrix to be pooled from .insn r 0x77, 1, 0, rd, rs1, rs2
the memory back to the acceleration module again. Similarly, the assembly codes for other proposed custom
With the combination of convolution and pooling, operation instructions are listed as follows.
WB23 and MP_RI can be omitted and the pooling operation C language provides a wrapper for inline assembler, the
can be directly performed inside the acceleration module. This assembly code and associated registers need to be specified
approach not only reduces the latencies occurred, it also reduces inside the __asm__ __volatile__ block. Taking the CONV23
the overhead for memory access (both read and write). instruction as an example, CONV23 is called using the syntax
“opcode, funct3 and funct7” inside the __asm__ __volatile__
block, the variable names of the rs1, rs2 and rd registers are
E. The Activation Unit
also specified in the block, as shown in the sample source
The activation unit performs the RELU instruction, where the code below. The rs1 register here is the value of the header
negative input matrix values are set to 0 while positive values address, the rs2 register is the value of the data size, and the rd
remain untouched. This operation can be simply implemented register is the value of the result. The RISC-V gcc compiler will
by checking the sign bit of each of the input values. compile and generate the corresponding assembly code. For
The instruction RELU also supports activation operation with other custom instructions WB23, MAX_POOL, RELU, etc.,
a bias value stored in register rs1. RELU instruction adds the the corresponding inline assembler C language wrapper can be
bias to the input values and performs the activation calculation. written in similar way.
In CNN, the activation layer always follows the convolutional __asm__ __volatile__(
layer or the pooling layer, RELU instruction does not need to ".insn r 0x77, 1, 0, %[conv_o], %[conv_i1], %[conv_i2]"
load or write-back intermediate data to/from the memory. It can :[conv_o] "=r"(result):
be immediately executed right after the convolution or pooling [conv_i1] "r"(address), [conv_i2] "r"(size)
operations. Then WB23 (in the case of convolution) or MP_WB );
(in the case of max pooling) are executed and the results are With the above assembler code wrapper, the proposed custom
written-back to the memory. instructions can be called directly through the inline assembler
From the above discussion, we can see that based on the in C language. Standard RISC-V gcc compiler can be used
network architecture, the seven instructions can be executed in a to compile the C source code with inline assembler into exe-
batched and/or separated fashion to further reduce the latencies cutable binaries.
of CNN computation along with the avoidance of excess and
unnecessary memory accesses.
B. Data Alignment in Memory Access
V. RISC-V TOOL-CHAIN INTEGRATION Our proposed instructions assume that the data stored in
OF CUSTOM INSTRUCTIONS
the memory have consecutive addresses, i.e., once the header
Our proposed custom instructions need to be integrated with address (defined in rs1 register), and the size of the matrix
RISC-V SDK tool chains in order to be called from upper (defined in rs2 register) are given, the data can be loaded from,
stacks. We have defined assembly code for each of the instruc- or written back to the memory consecutively.
tions and use inline assembly with C language to compile the The elements of the input matrix stored in the memory need
source code into executable binaries. to be rearranged before the execution of the custom instructions,
as shown in Fig. 15. After the completion of these instructions,
the locations of the output data need to be re-arranged back to
A. Assembly Codes for Custom Instructions be used by other operations that follow.
Assembly code syntax is based on RISC-V ISA custom in- In order the reduce the overhead of memory data location
struction format. Since the proposed custom instructions are all re-arranging, the operations of convolution and pooling are
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1379
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1380 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024
TABLE II
THE ARCHITECTURE OF LENET-ACCEL
TABLE IV
HARDWARE RESOURCE REPORT OF THE RI5CY-ACCEL
Fig. 19. Clock cycles for each layer in ResNet18-Accel (typical) and
ResNet18-Accel (custom).
TABLE III
HARDWARE RESOURCE REPORT OF THE ORIGINAL RI5CY
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1382 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024
TABLE VI
COMPARE WITH OTHER PROCESSORS
Both RI5CY and RI5CY-Accel consumes a static power of because the addition of acceleration module has resulted in
0.115W. For dynamic power during execution, RI5CY-Accel more connections.
consumes 0.166W, as compared to 0.144W consumed by the The power consumption reports generated after DC synthesis
genetic RI5CY, with an increase of 15.3%. of the RI5CY and RI5CY-Accel processors are also shown in
However, since the running time of LeNet-Accel, VGG16- Table V. The dynamic power consumption accounts for the
Accel and ResNet18-Accel on RI5CY-Accel is greatly reduced main part, and the combinational logic power consumption
when performing the inference of MNIST, Fashion-MNIST or accounts for nearly 80%. Compared with the original processor,
CIFAR-10 Classification, the total energy consumption will the internal power has increased by 16.08% and the switching
be reduced. power has increased by 11.40%. The static power consumption
For LeNet-Accel(typical) running on the original RI5CY pro- leakage power has increased by 261.18%, which is about 3.6
cessor, the inference computation takes 428ms to complete at times of the original. However, since the static power con-
55MHz. Running LeNet-Accel(custom) on RI5CY-Accel only sumption accounts for a small proportion of the total power
takes 99ms at the same frequency. consumption, the increase in the total power consumption is
For VGG16-Accel(typical) running on the original RI5CY only 23.23%.
processor, the inference computation takes 35.8s to complete
at 55MHz. Running VGG16-Accel(custom) on RI5CY-Accel
only takes 4.02s at the same frequency. E. Compare With Other Processors
For ResNet18-Accel(typical) running on the original RI5CY In this section, we evaluate the performance/power efficiency
processor, the inference computation takes 62.9s to complete of the RI5CY-Accel and compared it with other processors.
at 55MHz. Running ResNet18-Accel(custom) on RI5CY-Accel Three microprocessors are selected in the comparison, i.e., Intel
only takes 8.1s at the same frequency. i7-7700, ARM Cotex-M0 and NVIDIA GTX 1080ti.
Therefore, the total energy consumption on RI5CY-Accel are Intel i7-7700 processor runs at 3600MHz with a power con-
99ms × 0.281W = 28mJ for LeNet-Accel(custom), 4.02s × sumption of 65W. 32-bit floating-point convolution computa-
0.281W = 1.13J for VGG16-Accel(custom) and 8.1s × tion is performed at 87.3GOP/s, achieving a performance/power
0.281W = 2.28J for ResNet18-Accel(custom). The same efficiency of 1.3GOP/s/W [23].
inference task on the original RI5CY processor consumes NVIDIA GTX 1080ti has a greater bandwidth, running
428ms × 0.259W = 111mJ for LeNet-Accel(typical), at 1923MHz with a power consumption of 243W. 32-
35.8s × 0.259W = 9.27J for VGG16-Accel(typical) and bit floating-point convolution computation is performed at
62.9s × 0.259W = 16.29J for ResNet18-Accel(typical). A 6974.2GOP/s/W, achieving a performance/power efficiency of
total energy saving of 74.8%, 87.8% and 85.1% is achieved. 28.7GOP/s/W [23].
The energy consumption comparison is illustrated in Fig. 20. ARM Cotex-M0, as a low-power RISC processor, runs at
96MHz with a power consumption of 8.16mW. 16-bit fixed-
point convolution computation is performed at 0.0864GOP/s,
D. Synthesized ASIC Implementation
achieving a performance/power efficiency of 10.58GOP/s/W.
Both RI5CY and RI5CY-Accel are synthesized using Syn- The baseline RI5CY processor without an acceleration mod-
opsis synthesis tool Design Compiler (DC). The process is ule runs at 100MHz with a power consumption of 4.71mW.
TSMC090nm, and the library and database are both typical 32-bit fixed-point convolution computation is performed at
standard process libraries. We synthesize two processors and 0.0514GOP/s, achieving a performance/power efficiency of
generate area and power reports. 10.92GOP/s/W.
The area report generated after DC synthesis is shown in Our proposed RI5CY-Accel processor runs at 100MHz with
Table V. Compared with RI5CY, RI5CY-Accel has different a power consumption of 5.80mW, 32-bit fixed-point convo-
degrees of increase in the number of ports, nets and cells. The lution computation can be accelerated to 0.225GOP/s, which
number of ports increased by 130.08%, the number of nets is a 4.37× speedup as compared to the original RI5CY, and
increased by 158.04%, and the number of cells increased by 2.60× speedup as compared to ARM Cotex-M0. It achieves
121.85%. The total unit area increased by 184.37%, and the a performance/power efficiency of 38.79GOP/s/W, or 3.66×
total area increased by 108.30%. Nets has increased the most and 3.55× improvement as compared to ARM Cotex-M0 and
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1383
the original RI5CY respectively. The comparison is listed based on FPGA,” in Proc. 14th IEEE Int. Conf. Solid-State Integr. Circuit
in Table VI. Technol. (ICSICT), Piscataway, NJ, USA: IEEE, 2018, pp. 1–3.
[15] J. Cong and B. Xiao, “Minimizing computation in convolutional neural
networks,” in Proc. Int. Conf. Artif. Neural Netw, Cham, Switzerland:
VII. CONCLUSION Springer, 2014, pp. 281–290.
[16] F. Schuiki, F. Zaruba, T. Hoefler, and L. Benini, “Stream semantic
In this work, we proposed seven custom instructions based registers: A lightweight RISC-V ISA extension achieving full compute
on RISC-V ISA extension format. These instructions can signif- utilization in single-issue cores,” IEEE Trans. Comput., vol. 70, no. 2,
pp. 212–227, Feb. 2021.
icantly optimize the operations of convolution, activation and [17] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator
pooling in CNN inference computation. Power by Winograd for ubiquitous machine-learning,” ACM SIGARCH Comput. Archit.
algorithm, the execution time of convolution can be reduced News, vol. 42, no. 1, pp. 269–284, 2014.
[18] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in
from 140 to 21 clock cycles. These instructions can be executed Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit., Piscataway, NJ,
in batch mode, where the immediate data can be shared and USA: IEEE, 2014, pp. 609–622.
consequently, eliminate the excess memory load/store overhead [19] S. Wang, J. Zhu, Q. Wang, C. He, and T. T. Ye, “Customized instruction
on RISC-V for Winograd-based convolution acceleration,” in Proc. IEEE
and the associated energy consumptions. A revised RISC-V 32nd Int. Conf. Appl.-Specific Syst., Archit. Process. (ASAP), Piscataway,
processor adopting these instructions, called RI5CY-Accel is NJ, USA: IEEE, 2021, pp. 65–68.
implemented on FPGA. Through experiments, RI5CY-Accel [20] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
had demonstrated great advantages in execution speed-up and pp. 4013–4021.
total energy consumption in inference computation. The source [21] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutional
code is placed on GitHub https://github.com/QmppmQ/riscv. networks through FFTs,” 2013, arXiv:1312.5851.
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
REFERENCES pp. 770–778.
[1] V. Jain, A. Sharma, and E. A. Bezerra, “Implementation and extension [23] W. Xu, Z. Zhang, X. You, and C. Zhang, “Reconfigurable and low-
of bit manipulation instruction on RISC-V architecture using FPGA,” complexity accelerator for convolutional and generative networks over
in Proc. IEEE 9th Int. Conf. Commun. Syst. Netw. Technol. (CSNT), finite fields,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,
Piscataway, NJ, USA: IEEE, 2020, pp. 167–172. vol. 39, no. 12, pp. 4894–4907, Dec. 2020.
[2] C. A. R. Melo and E. Barros, “Oolong: A baseband processor extension
to the RISC-V ISA,” in Proc. IEEE 27th Int. Conf. Appl.-Specific
Syst., Archit. Process. (ASAP), Piscataway, NJ, USA: IEEE, 2016,
pp. 241–242. Shihang Wang received the B.E. degree in mi-
[3] G. Xin et al., “VPQC: A domain-specific vector processor for post- croelectronics science and engineering from the
quantum cryptography based on RISC-V architecture,” IEEE Trans. Southern University of Science and Technology,
Circuits Syst. I, Reg. Papers, vol. 67, no. 8, pp. 2672–2684, Aug. 2020. Shenzhen, China, in 2020. He is currently work-
[4] D. B. Roy et al., “Customized instructions for protection against memory ing toward the M.S. degree with the Department
integrity attacks,” IEEE Embedded Syst. Lett., vol. 10, no. 3, pp. 91–94, of Electronic and Electrical Engineering, Southern
Sep. 2018. University of Science and Technology, Shenzhen,
[5] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA China. His research interests include hardware ac-
extensions for finite field arithmetic: Accelerating Kyber and NewHope celeration for neural network, architecture optimiza-
on RISC-V,” IACR Trans. Cryptogr. Hardware Embedded Syst., tion of RISC-V processor, and GPGPU.
vol. 2020, no. 3, pp. 219–242, 2020.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” Commun. ACM, vol. 60,
no. 6, pp. 84–90, 2017.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” 2014, arXiv:1409.1556. Xingbo Wang received the B.E. degree in mi-
[8] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE croelectronics science and engineering from the
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9. Southern University of Science and Technology,
[9] F. Ge, N. Wu, H. Xiao, Y. Zhang, and F. Zhou, “Compact convolutional Shenzhen, China, in 2022. He is currently working
neural network accelerator for IoT endpoint SoC,” Electronics, vol. 8, toward the Ph.D. degree in microelectronics sci-
no. 5, p. 497, 2019. ence and engineering with the Southern University
[10] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing of Science and Technology, Shenzhen, China. His
FPGA-based accelerator design for deep convolutional neural networks,” research interests include RISC-V processor, hard-
in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, ware acceleration for convolutional neural network,
2015, pp. 161–170. and parallel computing.
[11] X. Liu, Y. Chen, C. Hao, A. Dhar, and D. Chen, “WinoCNN: Ker-
nel sharing Winograd systolic array for efficient convolutional neural
network acceleration on FPGAs,” in Proc. IEEE 32nd Int. Conf. Appl.-
Specific Syst., Archit. Process. (ASAP), Piscataway, NJ, USA: IEEE,
2021, pp. 258–265.
[12] Z. Li, W. Hu, and S. Chen, “Design and implementation of CNN custom Zhiyuan Xu received the B.E. degree in electronic
processor based on RISC-V architecture,” in Proc. IEEE 21st Int. Conf. information engineering from Hangzhou Dianzi Uni-
High Perform. Comput. Commun.; IEEE 17th Int. Conf. Smart City; versity, Hangzhou, China, in 2021. He is currently
IEEE 5th Int. Conf. Data Sci. Syst. (HPCC/SmartCity/DSS), Piscataway, working toward the M.S. degree with the Department
NJ, USA: IEEE, 2019, pp. 1945–1950. of Electronic and Electrical Engineering, Southern
[13] N. Wu, T. Jiang, L. Zhang, F. Zhou, and F. Ge, “A reconfigurable University of Science and Technology, Shenzhen,
convolutional neural network-accelerated coprocessor based on RISC- China. His research interest includes deep-learning
V instruction set,” Electronics, vol. 9, no. 6, p. 1005, 2020. architectures for energy-efficient systems.
[14] D.-Z. Li, H.-R. Gong, and Y.-C. Chang, “Implementing RISCV system-
on-chip for acceleration of convolution operation and activation function
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1384 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024
Bingzhen Chen received the B.E. degree in mi- Terry Tao Ye (Senior Member, IEEE) received
croelectronics science and engineering from the the Bachelor of Science degree in electronic engi-
Southern University of Science and Technology, neering from Tsinghua University, Beijing, in 1993,
Shenzhen, China, in 2021. He is currently work- and the Ph.D. degree in electrical engineering from
ing toward the M.S. degree with the Depart- Stanford University, in 2003. He is a Professor with
ment of Electronic and Electrical Engineering, the Institute of Nanoscience and Applications (INA)
Southern University of Science and Technology, and the Department of Electrical and Electronics
Shenzhen, China. His research interests include Engineering (EEE), Southern University of Science
hardware acceleration for neural network and and Technology (SUSTech), and by courtesy, an
RISC-V processor. Adjunct Professor with the Department of Electrical
and Computer Engineering (ECE), Carnegie Mellon
University. He is active in both academic research as well as industrial appli-
Chenxi Feng received the B.E. degree in microelec- cations in many engineering areas that include IC designs, Internet-of-Things
tronics science and engineering from the Southern (IOT) wearable sensor devices, and neuromorphic computing ICs. Beside
University of Science and Technology, Shenzhen, his academic activities, he is also keen on industry-academic collaborations.
China, in 2022. She is currently working toward He had held various engineering and consulting roles in China Academy of
the M.S. degree with the Department of Electronic Science, Impinj Inc, Synopsys Inc., Magma Design Automation Inc., Silicon
and Electrical Engineering, Southern University of Architects Inc., and many other Silicon Valley companies.
Science and Technology, Shenzhen, China. Her re-
search interests include hardware acceleration for
neural network and RISC-V processor.
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.