0% found this document useful (0 votes)
105 views14 pages

Optimizing CNN Computation Using RISC-V Custom Instruction Sets For Edge Platforms

This paper presents seven custom SIMD instructions for RISC-V architecture aimed at optimizing convolutional neural network (CNN) computations on edge platforms. The proposed instructions, particularly CONV23, leverage the Winograd algorithm to significantly reduce computation time and energy consumption during CNN inference, achieving execution latency reductions of over 76% to 88% across various models. The implementation on the RI5CY-Accel architecture demonstrates substantial improvements in performance and efficiency for CNN workloads, making it a valuable advancement for AI applications in edge computing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views14 pages

Optimizing CNN Computation Using RISC-V Custom Instruction Sets For Edge Platforms

This paper presents seven custom SIMD instructions for RISC-V architecture aimed at optimizing convolutional neural network (CNN) computations on edge platforms. The proposed instructions, particularly CONV23, leverage the Winograd algorithm to significantly reduce computation time and energy consumption during CNN inference, achieving execution latency reductions of over 76% to 88% across various models. The implementation on the RI5CY-Accel architecture demonstrates substantial improvements in performance and efficiency for CNN workloads, making it a valuable advancement for AI applications in edge computing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO.

5, MAY 2024 1371

Optimizing CNN Computation Using RISC-V


Custom Instruction Sets for Edge Platforms
Shihang Wang , Xingbo Wang , Zhiyuan Xu , Bingzhen Chen , Chenxi Feng , Qi Wang ,
and Terry Tao Ye , Senior Member, IEEE

Abstract—Benefit from the custom instruction extension applications. RISC-V standard ISA defines only basic instruc-
capabilities, RISC-V architecture can be optimized for many tions that are both aligned and compact, allowing efficient com-
domain-specific applications. In this paper, we propose seven mand decoding and execution. In the meantime, RISC-V ISA
RISC-V SIMD (single instruction multiple data) custom instruc-
tions that can significantly optimize the convolution, activation is also very flexible for customized instruction extensions and
and pool operations in CNN inference computation. More specif- additions. In recent years, many varieties of RISC-V ISA have
ically, instruction CONV23 can greatly speed up the operation of been developed to accelerate computation in specific applica-
F (2 × 2, 3 × 3). With the adoption of Winograd algorithm, the tions, such as edge computing platforms, software defined radio
number of multiplications can be reduced from 36 to 16, and the [1], [2] and cryptographic computation [3], [4], [5].
execution time is also reduced from 140 to 21 clock cycles. These
custom instructions can be executed in batch mode within the Convolutional neural network (CNN) is a feed-forward arti-
acceleration module where the immediate data can be reused, so ficial neural network widely used in image or video recogni-
the latency and energy overhead associated with excess memory tion, classification and detection [6], [7], [8]. In recent years,
accesses can be eliminated. Using inline assembler in C language, many AI applications, such as image/audio/video rendering
the custom instructions can be called and compiled together and processing, have been implemented using CNN architec-
with C source code. A revised RISC-V processor, RI5CY-Accel is
tures on edge computing platforms [9], [10], [11]. Meanwhile,
constructed on FPGA to accommodate these custom instructions.
Revised LeNet-5, VGG16 and ResNet18 model; called LeNet- CNN implementations on RISC-V processors have attracted
Accel, VGG16-Accel and ResNet18-Accel are also optimized lots of interests from research institutes as well as industries
based on RI5CY-Accel architecture. Benchmark experiments [12], [13], [14].
demonstrated that the inference of LeNet-Accel, VGG16-Accel Convolution computations account for the majority, in many
and ResNet18-Accel based on RI5CY-Accel can greatly reduce cases, over 90%, of CNN workloads [15]. Under the standard
the execution latency by over 76.6%, 88.8% and 87.1%, with
the total energy consumption saving of 74.8%, 87.8% and RISC-V ISA pipeline, the operations of CNN need to con-
85.1% respectively. tinuously fetch and load data from cache and memory, fol-
lowed by computation executions and write-back. Taking the
Index Terms—RISC-V, CNN, acceleration, Winograd, RISC-V
custom instruction sets, edge computing. multiply-accumulation (MAC) operation, which is widely used
in CNN, as an example, a two-input MAC operation requires 4
data-load instructions, 2 multiplication instructions, 1 addition
instruction and 1 data-store instruction. Most of the time is
I. INTRODUCTION spent on the loading and storing of data, which is inefficient
and time-consuming. It also consumes huge hardware overhead
R ISC-V architecture has demonstrated its great potentials
in edge computing platforms and embedded systems.
Enabled by its open source instruction set architecture (ISA),
and power consumption. In order to speed-up the memory
access time, [16] proposes stream semantic registers to map
RISC-V ISA can be customized for many domain-specific the accesses to specific registers into memory. Proprietary AI
acceleration chip can significantly accelerate CNN [17], [18].
In this paper, we have implemented a set of seven customized
Manuscript received 22 October 2022; revised 3 August 2023; instructions based on RISC-V ISA extension that can efficiently
accepted 14 January 2024. Date of publication 5 February 2024; date of
current version 9 April 2024. Recommended for acceptance by A. Sankara- perform most of the critical operations in CNN computation, in-
narayanan. (Shihang Wang and Xingbo Wang contributed equally to this work.) cluding convolution, memory-loading, write-back, pooling and
(Corresponding author: Terry Tao Ye.) activation. These instructions are implemented as an ASIC (ap-
Shihang Wang, Xingbo Wang, Zhiyuan Xu, Bingzhen Chen, and Chenxi
Feng are with the Department of Electrical and Electronic Engineering, plication specific integrated circuit) acceleration module, which
Southern University of Science and Technology, Shenzhen 518055, China functions as a RISC-V co-processor once these instructions are
(e-mail: [email protected]). being called.
Qi Wang is with the Department of Electrical and Computer Engineering,
The University of British Columbia, Vancouver, BC V6T 1Z4, Canada. Among these instructions, CONV23 [19] is an SIMD (single
Terry Tao Ye is with the Institute of Nanoscience and Applications instruction multiple data) vector instruction that performs the
(INA) and the Department of Electrical and Electronic Engineering, Southern convolution between a 3 × 3 kernel matrix and a 4 × 4 input
University of Science and Technology, Shenzhen 518055, China (e-mail:
[email protected]). matrix and generates a 2 × 2 output matrix. WB23 is also an
Digital Object Identifier 10.1109/TC.2024.3362060 SIMD instruction that writes back the results of CONV23 to the

0018-9340 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1372 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024

latency, resource usage and power consumption are compared


with the execution on the original RI5CY processor. Experi-
mental results show that RI5CY-Accel architecture significantly
speeds up the computations of convolutional neural networks
while reducing the total energy consumption to complete the
inference task. For Fashion-MNIST and CIFAR-10 Classifica-
tion Inference on LeNet-Accel, VGG16-Accel and ResNet18-
Accel, the total execution time is reduced by over 76.6%, 88.8%
and 87.1%, with the total energy consumption saving of 74.8%,
87.8% and 85.1% respectively.
Fig. 1. The complete system architecture of RI5CY-Accel. Different from many other CNN accelerators proposed from
prior works, the contributions of our work are:
1) We propose a complete RISC-V SIMD custom instruc-
memory. MP_RI, MAX_POOL and MP_WB perform the max tion set that can significantly optimize the convolution,
pooling operations. RELU performs the activation operation activation and pool operations in CNN inference compu-
based on the ReLU function, and W_WB updates the elements tation. In particular, convolution between input matrices
in the kernel matrix. and 3 × 3 kernels are specifically optimized through the
We use the open-source RISC-V kernel RI5CY as the base- exploitation of Winograd algorithm to reduce the compu-
line architecture, and construct a dedicated acceleration mod- tation complexity.
ule to implement these seven custom instructions. The new 2) The instructions can be executed in a batch-mode (called
RISC-V processor is called RI5CY-Accel. In order to speed instruction fusion in this paper) to avoid excess mem-
up the convolution operations between the input matrix and ory read/write accesses and reduce energy and hardware
the kernel matrix, we adopt Winograd algorithm to optimize overhead.
the matrix multiplication computation [20]. Benefit from the 3) The custom instructions are embedded in a co-processor
Winograd algorithm, the computation requires only 16 mul- (called acceleration module in this paper) and can be
tiplications, as compared to 36 multiplications needed by the called directly from RISC-V CPU, while most other CNN
traditional direct dot-product method. CONV23 instruction can accelerators utilize PE systolic array architecture.
be executed within 21 clock cycles, as compared to 140 cycles 4) The acceleration module are specifically optimized for
if it is implemented by the standard RISC-V ISA pipeline. low-power operation, it has a much smaller hardware
The original RI5CY, as well as RI5CY-Accel with the accel- footprint as compared to other works.
eration module that performs the new custom instructions, are The structure of the paper is as follows. Section II de-
written in Verilog and implemented on Xilinx FPGA platform scribes the methods to construct custom instructions based on
PYNQ-Z2. The complete system architecture is illustrated RISC-V ISA. Section III introduces Winograd algorithm
in Fig. 1. that can be used to accelerate the convolution computation.
These seven custom instructions can be called directly from Section IV describes the acceleration module implementation.
the RISC-V executable using assembly commands, and they Section V introduces the RISC-V tool-chain integration with
can also be programmed with C language using inline as- the custom instructions. The hardware resource overhead, along
sembler wrapping. A complete neural network can be con- with the latency and power reduction will be discussed in
structed using these custom instructions. We have implemented Section V, and the conclusion will be given in Section VI.
a complete LeNet-5, VGG16 and ResNet18 neural network
on RI5CY-Accel architecture using these custom instructions
and perform inference tasks on MNIST, Fashion-MNIST and
II. RISC-V CUSTOM INSTRUCTION EXTENSIONS
CIFAR-10 datasets.
In order to demonstrate the benefits of RI5CY-Accel, the RISC-V ISA supports multiple instruction subsets, and cus-
architecture of LeNet-5, VGG16 and ResNet18 are also revised. tomized functions can be realized by configuring different
The acceleration module can perform convolution, activation instruction subsets to perform application-specific tasks. For
and pooling operations in one batch mode without the need example, the basic instruction set (I instruction subset) supports
to repetitively load and store the immediate data to/from the operations including addition, subtraction, shift, XOR and other
memory. Therefore, in LeNet-5, VGG16 and ResNet18 con- logics. Integer multiplication and division instruction subsets
struction on RI5CY-Accel, we have fused the convolutional (M instruction subset) provide operations including multiplica-
layer and pooling layer into one layer (called conv-pool layer tion and division operations. If a 32-bit RISC-V ISA supports
in this paper). The optimized network are called LeNet-Accel, a subset of I and M, it is conventionally named as RV32IM.
VGG16-Accel and ResNet18-Accel. This optimization greatly The RI5CY processor is a 32-bit four-stage pipelined single-
reduces the memory access latencies and the energy consump- issue RISC-V ISA soft core in PULPino, an open source
tion associated. platform jointly developed by the Swiss Federal Institute of
Running LeNet-Accel, VGG16-Accel and ResNet18-Accel Technology Zurich and the University of Bologna. RISC-V ISA
on RI5CY-Accel processor as the benchmark, the execution is very flexible in adding new instructions, the ISA reserves four

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1373

Fig. 2. RISC-V Opcode map for custom instructions.

opcodes for users to define additional custom instructions in the


opcode map, as shown in Fig. 2.
The 0x0b and 0x2b opcode in RI5CY have been used by the
PULP system. Opcode 0x5b and 0x7b, are also reserved for
other specific tasks. 0x77 opcode is still available, so in this pa-
per, we choose opcode 0x77 to implement custom instructions.
There are six types of RISC-V instruction set encoding formats.
Each format is intended for different functions. For example,
J-type format is for out-of order instructions, R-type format is Fig. 3. Redundant elements between subsequent strides in the convolu-
tion process.
used to perform logical operations between values stored in
two registers, while I-type format is used to perform logical
operations on immediate data and values stored in registers.
In this paper, in order to speed up the complete operations
needed to perform CNN convolution, we have implemented
seven custom instructions based on RISC-V extension format.
These instructions all use opcode 0x77, and they are distin-
guished by the values set in funct3 and funct7, as defined in
the RISC-V instruction format. These custom instructions are
described in details below. Fig. 4. Instruction format of CONV23.

A. CONV23 - Instruction for Convolution Between Two


from the input matrix, the processor wastes half of the clock
Matrices
cycle to loading data from memory during the convolution.
1) Instruction Format: CONV23 uses the opcode 0x77 with To solve this issue, in CONV23 instruction, we have imple-
funct3 set to 1 and funct7 set to 0. mented a data-reuse option in rs2 to avoid the redundancy. If
The main function of CONV23 is to perform a convolution CONV23 is performed in the data-reuse mode, the CONV23
calculation between a 4 × 4 input matrix and a 3 × 3 convo- instruction will only fetch 8 new data in the fetching stage, and
lution kernel matrix and generates a 2 × 2 output matrix. The 8 reusable data is preserved in the corresponding registers. The
input matrix is stored in the memory before the convolution lower 16 bits of the rs2 register indicate the size of the input
operation, and the elements of the kernel matrix are preloaded in matrix. The 17th bit indicates the data-reuse option, i.e., if the
the acceleration module, which can be updated with a separate 17th bit is 1, it indicates that the input matrix will be loaded in
instruction W_WB (mentioned later). data-reuse mode. If it is 0, the input matrix will be loaded as a
In CONV23 operation, the elements of matrices are from complete matrix. The implementation of the data-reuse option is
memories and cache, not from immediate values, additionally, described in details in the next sections. The instruction format
CONV23 uses two source registers, i.e., rs1 and rs2, while rs1 of CONV23 is shown in Fig. 4 below.
is to store the size of the input matrix, rs2 is to store the first
address in memory of the input matrix elements. So CONV23 B. WB23 - Writing-Back CONV23 Results
uses the R-type instruction format and performs convolution
Instruction CONV23 will create a 3 × 3 matrix. The elements
operations based on the values in rs1 and rs2 registers. The
of this matrix are written back to the memory with the custom
result of CONV23 is stored in the acceleration module for
instruction WB23. In WB23, rs1 indicates the first address of
subsequent operations and does not need a destination register.
the memory to store the results elements. There is no need to
2) Data Re-Use: The convolution calculation between two
indicate the size of the resulting matrix because the matrix size
matrices is performed in a sliding-window fashion, i.e., starting
is pre-determined. The instruction format of WB23 is shown in
from the top-left corner, the kernel matrix overlays on the input
Fig. 5 below.
matrix and performs dot-multiplication and moves to the next
position with a stride step. Taking the convolution between a
C. MP_RI, MAX_POOL and MP_WB - Max Pooling
4 × 4 input matrix and a 3 × 3 kernel matrix with the stride 2
as an example, as shown in the Fig. 3, there are total 8 elements Operations
in the input matrix that are already loaded from operations in These three instructions together perform the max pooling
the previous stride. Because of the redundant elements loaded operations, which output the largest value among the elements

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1374 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024

These seven custom instructions are summarized in


Table I, along with the formats and register definition of each
instructions.
Fig. 5. Instruction format of WB23. In order for the processor to recognize custom instruc-
tions, the decoder needs to be modified with corresponding
control signals added. Once the decoder unit encounters the
instructions, the acceleration module is executed to perform
the operations.

III. CONVOLUTION ALGORITHM

Fig. 6. Instruction format of MP_RI, MAX_POOL and MP_WB. A. Convolution Algorithm Comparison
There are several convolution acceleration algorithms in use
today, namely im2col, Winograd, and FFT (Fast Fourier Trans-
form). Each of these algorithms has cons and pros for specific
computation scenarios.
Im2col, for instance, transforms convolutions into matrix
Fig. 7. Instruction format of RELU. multiplications through memory rearrangement. The algo-
rithm needs to utilize high-performance matrix multiplication
libraries. The re-mapping of the input matrices and convolution
kernel matrices can substantially increase memory overhead,
rendering it less suitable for edge platforms.
FFT [21], on the other hand, when being used in convolu-
Fig. 8. Instruction format of W_WB. tion, involves complex number computations. It is more ad-
vantageous to be applied for convolution with larger kernels.
However, most of the kernels used in neural network con-
from the input matrix. Instruction MP_RI loads the input from volution are 3 × 3 in size, FFT does not have advantages in
memory, of which the rs1 register indicates the first address to these applications.
load the input; MAX_POOL performs max pooling operation; Winograd algorithm was proposed by Lavin [20], as an op-
MP_WB writes the output back to memory, of which the rs1 timization technique to reduce the number of multiplications
register indicates the first address of memory to store the output. in matrix multiplication. Winograd algorithm can be exploited
Because the matrix size used by these three instructions is pre- to optimize the convolution calculation between a 4 × 4 input
determined, there is no need to indicate the size of the matrix. matrix and a 3 × 3 convolution kernel matrix and generates a
The instruction format of MP_RI, MAX_POOL and MP_WB 2 × 2 output matrix. Compared with FFT and im2col, for the
are shown in Fig. 6 below. convolution with 3 × 3 kernels, the required hardware overhead
and computational complexity are smaller. Since the majority
of the kernels used in CNN is 3 × 3 in size, the proposed custom
D. RELU - The Activation Operation
ISA can effectively mitigate the convolution computation loads.
RELU instruction performs the activation operation using the
ReLU function, i.e., changing all negative values in the input B. Winograd Algorithm Introduction
to 0 and preserving the positive values. Since the activation
function is performed immediately after the convolution or Winograd algorithm can be adopted to accelerate the con-
pooling operation, the RELU activation instruction does not volution operations in CNN. Theoretical analysis shows that
need a dedicated write-back instruction, i.e., they all share Winograd arithmetic can reduce the arithmetic complexity by
the same write-back instruction WB23 as used by CONV23 a factor of 2.25 for convolutions with 3 × 3 kernels that are
and MAX_POOL instructions. After the RELU instruction is widely used in CNN models.
executed, the output is written back to the memory by WB23. The algorithm can be better explained using a one-
The instruction format of RELU is shown in Fig. 7 below. dimensional convolution as the starting example. We denote
the one-dimensional convolution operation as F (m, r), where
m is the output size and r is the filter size. For a F (2, 3)
E. W_WB - Updating the Kernel Matrix convolution, assume the input data is d = (d0 , d1 , d2 , d3 ), and
the convolution kernel is g = (g0 , g1 , g2 ). The input vector
In CONV23 execution, the 3 × 3 convolutional kernel is
d can be transformed into a 2 × 3 matrix and the F (2, 3)
pre-loaded in the acceleration module. Instruction W_WB is
convolution can be rewritten as:
used to update the data of the convolution kernel matrix in the ⎡ ⎤
  g0  
acceleration module. The value in rs1 register defines the rank d d1 d2 ⎣ ⎦ r
of the kernel matrix, and the value in rs2 register defines the F (2, 3) = 0 g1 = 0 (1)
d1 d2 d3 r1
data. The instruction format of W_WB is shown in Fig. 8. g2

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1375

TABLE I
THE FORMAT OF CUSTOM INSTRUCTIONS

Instruction Func3 Func7 RS1 RS2 RD Operations


Header Address of
The Size of Input Matrix,
CONV23 1 0 Input Matrix in / Winograd Convolution
Whether Data Reuse
Memory
Header Address of
Write Back Winograd
WB23 1 1 Output Matrix in / /
Convolution Results to Memory
Memory
MAX_POOL 2 0 / / / Max Pooling
Header Address of
Write Back Max Pooling
MP_WB 2 1 Output Matrix in / /
Results to Memory
Memory
Header Address of
Read in Input of Max Pooling
MP_RI 2 2 Input Matrix in / /
to Acceleration Module
Memory
Active the Sum of Results
RELU 3 0 The Value of Bias / / of Winograd Convolution
or Max Pooling and Bias
The Rank of The Value of Update the Value
W_WB 4 0 /
Convolution Kernel Convolution Kernel of Convolution Kernel

⎡ ⎤
Computing r0 and r1 in this case requires 6 multiplications k00 k01 k02
and 4 additions. However, if we compute the intermediate terms K = ⎣k10 k11 k12 ⎦ (6)
m1 , m2 , m3 and m4 as follows: k20 k21 k22
g0 + g1 + g2
m0 = (d0 − d2 )g0 m1 = (d1 + d2 ) (2) Where D is the input matrix and the K is the convolu-
2
g0 − g1 + g2 tion kernel.
m3 = (d1 − d3 )g2 m2 = (d2 − d1 ) (3) Like in the case of 1D convolution, as shown in Eq. 4,
2
F (2, 3) convolution can be transformed in to the follow- matrix D and K can be folded and their convolution can be
ing format: transformed into matrix multiplication of higher dimensions, as
⎡ ⎤ expressed in Eq. 7:
  g  
d0 d1 d2 ⎣ 0 ⎦ m0 + m1 + m2 ⎡ ⎤
F (2, 3) = g1 = (4)
d1 d2 d3 m1 − m 2 − m 3 k00
g2 ⎢k01 ⎥
⎢ ⎥
Because the kernel elements are predetermined when ⎡ ⎤ ⎢k02 ⎥ ⎡ ⎤
d00 d01 d02 d10 d11 d12 d20 d21 d22 ⎢ ⎢ ⎥
⎥ r00
performing the convolution, terms related to g can be ⎢d01 d02 d03 d11 d12 d13 d21 d22 d23 ⎥ ⎢k10 ⎥ ⎢r01 ⎥
pre-calculated. So the transformed format requires 4 ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣d10 d11 d12 d20 d21 d22 d30 d31 d32 ⎦ ⎢k11 ⎥ = ⎣r10 ⎦
multiplications and 8 additions, as compared to 6 ⎢k12 ⎥
d11 d12 d13 d21 d22 d23 d31 d32 d33 ⎢ ⎢k20 ⎥
⎥ r11
multiplications and 4 additions in direct matrix multiplication. ⎢ ⎥
Winograd algorithm reduces the number of multiplications at ⎣k21 ⎦
the cost of increasing the number of additions. In computer k22
arithmetic implementations, the resources and latency required (7)
for multiplication are order-of-magnitude higher than that of
addition, so the overall computation complexity is reduced. For clarity, the folded D and K matrix can be further
In most applications, CNN input image is a two-dimensional segmented into sub-matrices and sub-vectors, as expressed
matrix, therefore it is necessary to extend the Winograd in Eq. 8:
algorithm from 1D to 2D inputs, which is defined as ⎡ ⎤
  K0  
F (m × m, r × r), where m is the size of output matrix and D00 D10 D20 ⎣ ⎦ R0
the r is the size of kernel matrix. Kernel size of 3 × 3 is com- K1 = (8)
D10 D20 D30 R1
monly used in CNN applications, we hereby use the operation K2
F (2 × 2, 3 × 3) as an example:
Eq. 8 is actually Eq. 4 in higher dimension (from 1D to 2D).
Given that the output size is 2 × 2 and the convolution kernel
Winograd complexity reduction technique we have introduced
size is 3 × 3, the corresponding input matrix size is 4 × 4, so
in F (2, 3) can also be applied in this case, and Eq. 8 can be
the input vector and convolution kernel can be expressed as:
⎡ ⎤ transformed into its Winograd form:
d00 d01 d02 d03 ⎡ ⎤
⎢d10 d11 d12 d13 ⎥   K0    
D=⎢ ⎣d20 d21 d22 d23 ⎦
⎥ (5) D00 D10 D20 ⎣ ⎦ R0 M0 + M1 + M2
K1 = = (9)
D10 D20 D30 R1 M1 − M2 − M3
d30 d31 d32 d33 K2

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1376 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024

Fig. 9. Sliding window fashion. Fig. 10. The structure diagram of the RISC-V processor.

Where:

M0 = (D00 − D20 )K0 (10a)


K0 + K1 + K2
M1 = (D10 + D20 ) (10b)
2
K0 − K1 + K2
M2 = (D20 − D10 ) (10c)
2 Fig. 11. State transition diagram of CONV23.

M3 = (D10 − D30 )K2 (10d)


IV. ACCELERATION MODULE
The intermediate terms m1 , m2 , m3 and m4 can be cal-
culated using function F (2, 3), and F (2 × 2, 3 × 3) can be A. Module Architecture
calculated recursively using F (2, 3) function. More specifically, The acceleration module works as a co-processor, and it
Function F (m × m, r × r) can be recursively computed using shares the LSU (load store unit) module with the CPU. When
F (m, r) function. the custom instructions are executed, the data to be calculated
In general, adopting Winograd algorithm, convolution be- can be directly read from the CPU cache or memory through
tween input matrix and kernel matrix can be formulated as the LSU. After the calculation is completed, it is written back
follows, where d represents the input two-dimensional matrix to the CPU cache or memory through the LSU.
and g represents the two-dimensional convolution kernel: The acceleration module consists of three units, i.e., the
convolution unit, the pooling unit and the activation unit. The
Y = AT [(GgGT )  (B T dB)]A (11)
structure diagram of the RISC-V processor integrated with the
 
1 1 1 0 acceleration module is shown in Fig. 10.
AT = (12a) Function F (2 × 2, 3 × 3) is executed in the following
0 1 −1 −1
⎡ ⎤ manner. When Instruction CONV23 is called from the proces-
1 0 −1 0 sor, the acceleration module loads the input matrix data from the
⎢0 1 1 0⎥
BT = ⎢
⎣0 −1 1
⎥ (12b) memory through the LSU module, which performs the memory-
0⎦ read and memory-write functions in RI5CY. The loaded input
0 1 0 −1 matrix (size 4 × 4) is temporarily stored in 16 32bits-registers
⎡ ⎤ inside the acceleration module. Then the CONV23 unit is called
1 0 0
⎢1/2 1/2 1/2⎥ in the acceleration module and generates a 2 × 2 output matrix.
G=⎢ ⎣1/2 −1/2 1/2⎦
⎥ (12c) Before the execution of the CONV23, 16 load data operations
0 0 1 are required.
The CONV23 execution is implemented as a 5-state state
The input matrix size does not have to be 4 × 4, as used in machine, namely IDLE state, GET-DATA state, CAL1 state,
our example above. Actually, the input matrix size can be of any CAL2 state and CAL3 state. The state transition diagram is
dimension larger than 4 × 4. Function F (2 × 2, 3 × 3) is the illustrated in Fig. 11.
baseline function for convolutions between a input matrix and 1) IDLE state: the CONV23 module waits the RI5CY
a kernel matrix of 3 × 3 in dimension. For input matrix larger processor to fetch the CONV23 instruction, decode the
than 4 × 4, F (2 × 2, 3 × 3) can be used in a sliding window instruction arguments. Once the decoding is finished,
fashion with a stride step, as shown in Fig. 9 below. control signals are sent to the acceleration module, and
Function F (2 × 2, 3 × 3) is implemented in the acceleration the acceleration module jumps from IDLE state to GET-
module and encapsulated into a custom instruction CONV23, DATA state.
the details of the implementation, along with the structure of 2) GET-DATA state: the acceleration module interacts with
the acceleration module is described in the following sections. the LSU module. The acceleration module transmits the

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1377

is required for each instruction execution, a total of 140 clock


cycles are required to perform F (2 × 2, 3 × 3). In compari-
son, our proposed CONV23 instruction only needs 21 clock
cycles and achieves a speedup of 6.67× in the execution of
F (2 × 2, 3 × 3) function.
Fig. 12. Simulation waveform of CONV23 instruction.
B. Data Reuse and Repetitive Data-Loading Reduction
control signals to the LSU module, along with the header
address of input matrix data stored in the rs1 register. As introduced in Section II, convolution between two matri-
After the LSU module receives the control signals from ces are performed in a successive slide-and-stride fashion, when
the acceleration module, it reads 1 32-bit data from the the stride is smaller than the kernel matrix size, at every step,
memory per cycle and transmits it to the corresponding there are certain number of elements of the input matrix that are
registers in the acceleration module. In every cycle the already loaded from previous step. In order to further reduce
address of loading data would increment by 4, which al- the latencies incurred during convolution, inside our proposed
lows the LSU module to load the matrix data sequentially CONV23 instruction, there is an option that performs the data
from the memory, until all input matrix data are loaded reuse function to avoid repetitive loading of elements from the
after 16 cycles, then the acceleration module will jump input matrix. The 17th bit of the rs2 register is used to indicate
to CAL1 state. this “data-reuse” option.
3) CAL1 state: In order to reduce the latency of Wino- As shown in Fig. 3, in the case of CONV23 execution, at
grad convolution calculation, Winograd calculation is every convolution step, 8 elements of the input matrix had
performed in three clock cycles, namely, CAL1, CAL2 been already loaded from previous step, while there are 8 new
and CAL3 states. In CAL1 state, GgGT and B T dB elements that need to be loaded in the current step. CONV23
are calculated in parallel, and the results are temporarily with “data-reuse” option only needs 8 cycles to load the element
stored in the registers, then CAL1 state transitions into data from the memory, and consequently, the total execution
CAL2 state. only needs 13 clock cycles, as compared to 21 clock cycles
4) CAL2 state: Hadamard product is performed on ma- without the “data-reuse” option. For the convolution between a
trix multiplications of GgGT and B T dB, the results are 28 × 28 input matrix and a 3 × 3 convolution kernel, the stan-
temporarily stored in the registers, and transitions into dard RISC-V ISA requires 23660 clock cycles, and CONV23
CAL3 state. instruction with data-reuse option only requires 2301 clock
5) CAL3 state: The results of the Hadamard product of cycles, achieving a 10.28× speedup.
GgGT and B T dB are multiplied by matrix AT on the
left and matrix A on the right, and generates the 4 el- C. The Pooling and Activation Unit
ements of the resulting 2 × 2 matrix as the output of The pooling function is implemented with three independent
F (2 × 2, 3 × 3) convolution. custom instructions, namely, MP_RI instruction that reads the
The clock cycles distribution of these five states is illustrated pooling data from the memory into the acceleration module,
in Fig. 12. The execution of CONV23 can be divided into 4 MAX_POOL instruction that compares the pooled data tem-
groups of clock cycles. porarily stored in the acceleration module and outputs the max-
1) The first clock cycle is used to detect the enable signal of imum value, and MP_WB instruction that writes the maximum
the CONV23 instruction value back to the memory.
2) The second clock cycle is used to assign the header mem- These three instructions are executed consecutively. For func-
ory address stored in the rs1 register to the LSU module tion F (2 × 2, 3 × 3) with stride of 2, the pooling operation
3) 16 clock cycles are used to complete the load data is performed on the 2 × 2 output matrix. The header memory
operation address is stored in register rs1, MP_RI instruction spends 4
4) The last 3 clock cycles are used to complete Winograd clock cycles to load 4 data to be pooled and temporarily stores
convolution calculation. them in the register. The subsequent MAX_POOL instruction
Implemented with our proposed state machine and clock dis- does not need any source and destination registers. It directly
tribution scheme, CONV23 instruction requires 21 clock cycles compares the maximum value among the four values stored in
to execute. the registers in the acceleration module. MP_WB instruction
Under the standard ISA in a generic RI5CY processor, the is executed immediately after MAX_POOL, and the maximum
convolution between a 3 × 3 input matrix and a 3 × 3 kernel value obtained is written back to the memory where the header
matrix requires 9 load-data operations of input data, 9 load-data address is specified in Register rs1.
operations of kernel, 9 multiplication operations and 8 addition
operations. And the above process needs to repeat four times
D. Combining CONV23 With Pooling Instructions
to calculate F (2 × 2, 3 × 3), where the input matrix is 4 × 4.
Therefore, using standard RISC-V ISA, F (2 × 2, 3 × 3) needs We split the pooling operation into 3 consecutive instructions,
total 72 load-data instructions, 36 multiplication instructions this approach has an advantage to further reduce the latencies
and 32 addition instructions. In RI5CY processor, 1 clock cycle when combining the convolution operations with the pooling

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1378 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024

Fig. 13. Instruction fusion. Fig. 14. Assembly code of instructions.

operations. As shown in Fig. 13, at the end of convolution, the R-type, using CONV23 instruction as an example, the opcode
output matrix (a 2 × 2 matrix in the case of F (2 × 2, 3 × 3)) of CONV23 is 0x77, funct3 is 1, and funct7 is 0. Then the
needs to be written back to the memory, and the subsequent assembly code of CONV23 instruction is as follows:
pooling operation will need to load the matrix to be pooled from .insn r 0x77, 1, 0, rd, rs1, rs2
the memory back to the acceleration module again. Similarly, the assembly codes for other proposed custom
With the combination of convolution and pooling, operation instructions are listed as follows.
WB23 and MP_RI can be omitted and the pooling operation C language provides a wrapper for inline assembler, the
can be directly performed inside the acceleration module. This assembly code and associated registers need to be specified
approach not only reduces the latencies occurred, it also reduces inside the __asm__ __volatile__ block. Taking the CONV23
the overhead for memory access (both read and write). instruction as an example, CONV23 is called using the syntax
“opcode, funct3 and funct7” inside the __asm__ __volatile__
block, the variable names of the rs1, rs2 and rd registers are
E. The Activation Unit
also specified in the block, as shown in the sample source
The activation unit performs the RELU instruction, where the code below. The rs1 register here is the value of the header
negative input matrix values are set to 0 while positive values address, the rs2 register is the value of the data size, and the rd
remain untouched. This operation can be simply implemented register is the value of the result. The RISC-V gcc compiler will
by checking the sign bit of each of the input values. compile and generate the corresponding assembly code. For
The instruction RELU also supports activation operation with other custom instructions WB23, MAX_POOL, RELU, etc.,
a bias value stored in register rs1. RELU instruction adds the the corresponding inline assembler C language wrapper can be
bias to the input values and performs the activation calculation. written in similar way.
In CNN, the activation layer always follows the convolutional __asm__ __volatile__(
layer or the pooling layer, RELU instruction does not need to ".insn r 0x77, 1, 0, %[conv_o], %[conv_i1], %[conv_i2]"
load or write-back intermediate data to/from the memory. It can :[conv_o] "=r"(result):
be immediately executed right after the convolution or pooling [conv_i1] "r"(address), [conv_i2] "r"(size)
operations. Then WB23 (in the case of convolution) or MP_WB );
(in the case of max pooling) are executed and the results are With the above assembler code wrapper, the proposed custom
written-back to the memory. instructions can be called directly through the inline assembler
From the above discussion, we can see that based on the in C language. Standard RISC-V gcc compiler can be used
network architecture, the seven instructions can be executed in a to compile the C source code with inline assembler into exe-
batched and/or separated fashion to further reduce the latencies cutable binaries.
of CNN computation along with the avoidance of excess and
unnecessary memory accesses.
B. Data Alignment in Memory Access
V. RISC-V TOOL-CHAIN INTEGRATION Our proposed instructions assume that the data stored in
OF CUSTOM INSTRUCTIONS
the memory have consecutive addresses, i.e., once the header
Our proposed custom instructions need to be integrated with address (defined in rs1 register), and the size of the matrix
RISC-V SDK tool chains in order to be called from upper (defined in rs2 register) are given, the data can be loaded from,
stacks. We have defined assembly code for each of the instruc- or written back to the memory consecutively.
tions and use inline assembly with C language to compile the The elements of the input matrix stored in the memory need
source code into executable binaries. to be rearranged before the execution of the custom instructions,
as shown in Fig. 15. After the completion of these instructions,
the locations of the output data need to be re-arranged back to
A. Assembly Codes for Custom Instructions be used by other operations that follow.
Assembly code syntax is based on RISC-V ISA custom in- In order the reduce the overhead of memory data location
struction format. Since the proposed custom instructions are all re-arranging, the operations of convolution and pooling are

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1379

The total number of weights is 81194 in new LeNet-Accel,


which is larger than the original 61706. We have trained and
tested LeNet-Accel using the MNIST datasets and the Fashion-
MNIST datasets. MNIST and Fashion-MNIST datasets consists
of a training set of 60,000 examples and a testing set of 10,000
examples, each of which is a 28 × 28 grayscale image
associated with 10 class labels. After training, LeNet-Accel
achieves 87.8% accuracy on the Fashion-MNIST testing set,
where original LeNet-5 can achieve 90.1% of accuracy. On
the MNIST datasets, the accuracy is almost unchanged.

D. VGG16 Construction Using Custom Instructions


In order to further reflect the acceleration effect of our pro-
posed custom instruction set, we have constructed a revised
Fig. 15. Rearrange of input matrix. VGG16, called VGG16-Accel, also in C language with the
inline assembler code and compiled it into RISC-V. VGG was
proposed by Karen Simonyan and Andrew Zisserman in 2015,
merged, i.e., the corresponding instructions are executed in the and was used to classify the complex images. The VGG16
following manner, CONV23, MAX_POOL, WB23. contains 12 convolutional layers, 5 pooling layers, 3 fully-
In this way, the re-arranging overhead between the convolu- connected layers and one output layer. Each convolutional layer
tion operation and pooling operation is eliminated. Correspond- uses 3 × 3 convolution kernel and a ReLU activation function.
ingly, in the network construction, the convolution layer and The original VGG16 architecture is revised in our experiment
pooling layer are actually consolidated into one layer called in order to be optimized for RISC-V processors with the pro-
“conv-pool” in our experiment. posed acceleration module. The original VGG16 architecture
uses a 224 × 224 × 3 input. Due to the size limitation of the
C. LeNet-5 Construction Using Custom Instructions CIFAR-10 datasets, we adjusted the input size of VGG16 from
224 to 32. In order to prevent the problem of overfitting due to
In order to demonstrate the application and benefit using our
excessive weights, the number of channels convolved in each
proposed custom instruction set, we have constructed a revised
layer of VGG16 is reduced to 25% of the original. Since the
LeNet-5, called LeNet-Accel, in C language with the inline as-
test uses CIFAR-10 datasets, the original 1000 classification
sembler code and compiled it into RISC-V executable binaries
is reduced to 10 classification, so we canceled the 3 fully-
running on FPGA. LeNet was proposed by LeCun at AT&T Bell
connected layers, and used 1 fully-connected layer of 512 to
Labs in 1989, and was used to recognize handwritten digital
10 to directly output the network results.
images. The original LeNet-5 contains 2 convolutional layers,
VGG16-Accel model uses a smaller input size of convolution
2 pooling layers, 2 fully-connected layers and one output layer.
and number of channels, so the number of kernel weights is
Each convolutional layer uses 5 × 5 convolution kernel and a
less than that from VGG16. The number of parameters in the
Sigmoid activation function.
fully-connected layer changes from 25088 × 4096 + 4096 ×
The original LeNet-5 architecture is revised in our exper-
4096 + 4096 × 1000 = 123633664 to 512 × 10 = 5120. The
iment in order to be optimized for RISC-V processors with
total number of weights is 921310 in VGG16-Accel, which is
the proposed acceleration module. The original LeNet-5 archi-
smaller than VGG16. We have trained and tested VGG16-Accel
tecture uses a 5 × 5 convolution kernel for each convolutional
using the CIFAR-10. CIFAR-10 datasets consists of a training
layer. We replace it with a 3 × 3 convolution kernel, and use
set of 60000 examples and a testing set of 10000 examples,
the max pooling operation in the pooling layer. The original
each of which is a 32 × 32 color image associated with 10 class
Sigmoid activation is also replaced with ReLU function.
labels. After training, VGG16-Accel achieves 88.47% accuracy
In order to keep the same fully-connected layers as the orig-
on the CIFAR-10 testing set.
inal LeNet-5, LeNet-Accel architecture uses the same number
of channels, and replaces the original 2 padding with 1 padding
E. ResNet18 Construction Using Custom Instructions
in the first layer of convolution. LeNet-Accel architecture and
parameters are listed in Table II. While most of the parameters In order to further reflect the acceleration effect of our pro-
are the same as the original LeNet-5, the revisions are indi- posed custom instruction set, we have constructed a larger-
cated in parentheses, where the values in parentheses are from scale network revised from ResNet18, called ResNet18-Accel,
original LeNet-5. in C language using the inline assembler code and compiled
LeNet-Accel model uses a 3 × 3 convolutional kernel to it into RISC-V executable. ResNet18 [22] is a convolutional
replace the original 5 × 5 convolutional kernel, so the number neural network architecture that was proposed in 2016. It con-
of kernel weights is less than that from LeNet-5. The number tains 4 basic blocks, and each basic block has 2 convolutions
of parameters in the first fully-connected layer changes from to realize the residual structure. Except the first convolutional
5 × 5 × 16 × 120 = 48000 to 6 × 6 × 16 × 120 = 69120. layer, each subsequent convolutional layer uses 3 × 3 kernel,

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1380 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024

TABLE II
THE ARCHITECTURE OF LENET-ACCEL

Layer Input Kernel Stride Padding Output


1 convolution 28 × 28 × 1 3×3×1×6 1 1 28 × 6
2 pool 28 × 28 × 6 2×2 2 0 14 × 14 × 6
3 convolution 14 × 14 × 6 3 × 3 × 6 × 16 1 0 12 × 12 × 16
4 pool 12 × 12 × 16 2×2 2 0 6 × 6 × 16
5 fully connection 6 × 6 × 16 6 × 6 × 16 × 120 / / 120
6 fully connection 120 120 × 84 / / 84
7 output 84 84 × 10 / / 10

Fig. 16. “Conv-pool” layer.

followed by a ReLU activation function. The original ResNet18


architecture is revised in our benchmark to accommodate the
CIFAR-10 dataset.
The original ResNet18 architecture has the input dimension
of 224 × 224 × 3. In order to be compatible with the CIFAR-
10 dataset, the input dimension of ResNet18-Accel is re- Fig. 17. Clock cycles for each layer in LeNet-Accel (typical) and LeNet-
Accel (custom).
vised to 32 × 32 × 3. 1000-Classication ResNet18 is reduced to
10-Classication in ResNet18-Accel, the dimension of the
fully-connected layers in ResNet18 is also revised from 512
to 10 accordingly.
ResNet18-Accel network is trained and tested with CIFAR-
10 dataset and achieves 91.41% accuracy.

VI. HARDWARE OVERHEAD, LATENCY AND


POWER CONSUMPTION
A. Latencies of Convolution Operations
1) F (2 × 2, 3 × 3) Operation: It takes 140 clock cycles
to complete the F (2 × 2, 3 × 3) operation using the RI5CY
processor of the standard RISC-V ISA, in comparison, the same
operation only takes 21 cycles to complete using the CONV23
instruction. One F (2 × 2, 3 × 3) operation can reduce the ex-
Fig. 18. Clock cycles for each layer in VGG16-Accel (typical) and VGG16-
ecution latencies by 85%. Accel (custom).
2) 2 × 2 Max Pooling Operation: For a 4-input, 1-output
max pooling operation, standard RISC-V ISA needs to perform
in a pair-wise comparison fashion, i.e., comparing two numbers instruction. If the input values of the ReLU function are nega-
at each clock cycle. It requires a total of 6 load data instructions, tive, then an additional instruction is required to set it to 0. A
3 compare instructions and 3 store data instructions. The total total of 3 or 4 clock cycles (in the case of negative inputs) are
execution requires 12 clock cycles. Using the proposed instruc- required to complete the activation operation.
tions, i.e., MP_RI, MAX_POOL and MP_WB, it requires 4 The custom instruction RELU does not need to load or store
cycles for MP_RI instructions, 1 cycle for MAX_POOL in- data, the instruction only takes 1 clock cycle to execute. The
structions, and 1 cycle for MP_WB instruction. The total execu- latency for a single activation operation can be reduced by
tion requires 6 clock cycles. For a 2 × 2 max pooling operation 66% to 75%.
the latency can be reduced by 50%. 4) Overall Network Latencies: Fig. 17, Fig. 18, and
3) ReLU Activation Operation: ReLU activation operation Fig. 19 showed the execution latencies (in clock cycles)
executed by the standard RISC-V ISA requires at least 1 data- needed for each layer in LeNet-Accel, VGG16-Accel and
load instruction, 1 comparison instruction, and 1 data-store ResNet18-Accel network respectively. These comparisons
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1381

TABLE IV
HARDWARE RESOURCE REPORT OF THE RI5CY-ACCEL

Resource Utilization Available Utilization %


LUT 8389 53200 15.77
FF 3105 106400 2.92
BRAM 128 140 91.43
DSP 53 220 24.09
IO 6 125 4.80
BUFG 14 32 43.75
MMCM 1 4 25.00

Fig. 19. Clock cycles for each layer in ResNet18-Accel (typical) and
ResNet18-Accel (custom).

TABLE III
HARDWARE RESOURCE REPORT OF THE ORIGINAL RI5CY

Resource Utilization Available Utilization %


LUT 4908 53200 9.23
FF 2090 106400 1.96
BRAM 128 140 91.43
Fig. 20. Power and energy consumption of RI5CY-Accel and original
DSP 5 220 2.27
RI5CY.
IO 5 125 4.00
BUFG 3 32 9.38
MMCM 1 4 25.00 TABLE V
SYNTHESIS REPORT OF THE RI5CY AND RI5CY-ACCEL

demonstrated that powered by the proposed custom RI5CY RI5CY-Accel


instructions, LeNet-Accel(custom), VGG16-Accel(custom) Clock Freq. 100MHz 100MHz
and ResNet18-Accel(custom) greatly speeds-up the net- Number of ports 11431 26300
work computation. Number of nets 33278 85871
Number of cells 22817 50620
In LeNet-Accel(typical), about 84% of the clock cycles are Total area 3093945 6444599
spent on convolution and pooling, while this proportion is Internal power 1.30 1.50
about 31% in LeNet-Accel(custom). In the fused conv-pool Switching power 3.22 3.59
Leakage power 1.98E+8pW 7.14E+8pW
layers, LeNet-Accel(custom) reduces the clock cycle by 91.2% Total power 4.71mW 5.80mW
as compared to the original architecture. The entire network
computation latency is reduced by 76.6%.
In VGG16-Accel(typical) and VGG16-Accel(custom), about The hardware resources needed by the original RI5CY pro-
99.9% of the clock cycles are spent on convolution and pooling. cessor as well as RI5CY-Accel are compared in Table III and
In the fused conv-pool layers, VGG16-Accel(custom) reduces Table IV. The resource consumption is measured in the usage
the clock cycle by 88.8% as compared to the original archi- of LUT, FF, DSP and BUFG units, as reported in Xilinx FPGA
tecture. The entire network computation latency is reduced utilization reports.
by 88.8%. From Table III and Table IV, we can estimate that the addi-
In ResNet18-Accel(typical) and ResNet18-Accel(custom), tional resources required by the acceleration module are about
the residual structure is more complex than LeNet and VGG. 3481 LUTs, 1015 FFs, 48 DSPs and 11 BUFGs. Compared with
However, about 99.9% of the clock cycles are also spent on con- the resources required by the genetic RI5CY, the acceleration
volution and pooling. In the fused conv-pool layers, ResNet18- module increases the LUT by 70.9%, the FF by 48.6%, the DSP
Accel(custom) reduces the clock cycle by 87.1% as compared by 9.6× and the BUFG by 3.67×.
to the original architecture. The entire network computation
latency is reduced by 87.1%.
C. Execution Energy Consumption Reduction
Due to the latency between the block memory of the PYNQ-
B. Comparison of Hardware Resource Overhead
Z2 FPGA board and the RI5CY soft core, the RI5CY proces-
Both the original RI5CY processor and its revision with the sor can only run at a maximum frequency of 55MHz on the
acceleration module, i.e., RI5CY-Accel are constructed with PYNQ-Z2 platform. Using the power report from Vivado ,
Verilog RTL, synthesized and implemented on Xilinx PYNQ- we can compare the power between the original RI5CY and
Z2 platform using Vivado . RI5CY-Accel.

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1382 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024

TABLE VI
COMPARE WITH OTHER PROCESSORS

Intel i7-7700 GTX 1080Ti ARM Cotex- M0 RI5CY RI5CY-Accel


Platform CPU GPU CPU TSMC090(DC) TSMC090(DC)
Clock Freq. 3600MHz 1923MHz 96MHz 100MHz 100MHz
Precision 32-bit float 32-bit float 16-bit fixed 32-bit fixed 32-bit fixed
GOP/s 87.3 6974.2 0.0864 0.0514 0.225(CONV)
Power 65W 243W 8.16mW 4.71mW 5.80mW
Energy Efficiency(GOP/s/W) 1.3 28.7 10.58 10.92 38.79

Both RI5CY and RI5CY-Accel consumes a static power of because the addition of acceleration module has resulted in
0.115W. For dynamic power during execution, RI5CY-Accel more connections.
consumes 0.166W, as compared to 0.144W consumed by the The power consumption reports generated after DC synthesis
genetic RI5CY, with an increase of 15.3%. of the RI5CY and RI5CY-Accel processors are also shown in
However, since the running time of LeNet-Accel, VGG16- Table V. The dynamic power consumption accounts for the
Accel and ResNet18-Accel on RI5CY-Accel is greatly reduced main part, and the combinational logic power consumption
when performing the inference of MNIST, Fashion-MNIST or accounts for nearly 80%. Compared with the original processor,
CIFAR-10 Classification, the total energy consumption will the internal power has increased by 16.08% and the switching
be reduced. power has increased by 11.40%. The static power consumption
For LeNet-Accel(typical) running on the original RI5CY pro- leakage power has increased by 261.18%, which is about 3.6
cessor, the inference computation takes 428ms to complete at times of the original. However, since the static power con-
55MHz. Running LeNet-Accel(custom) on RI5CY-Accel only sumption accounts for a small proportion of the total power
takes 99ms at the same frequency. consumption, the increase in the total power consumption is
For VGG16-Accel(typical) running on the original RI5CY only 23.23%.
processor, the inference computation takes 35.8s to complete
at 55MHz. Running VGG16-Accel(custom) on RI5CY-Accel
only takes 4.02s at the same frequency. E. Compare With Other Processors
For ResNet18-Accel(typical) running on the original RI5CY In this section, we evaluate the performance/power efficiency
processor, the inference computation takes 62.9s to complete of the RI5CY-Accel and compared it with other processors.
at 55MHz. Running ResNet18-Accel(custom) on RI5CY-Accel Three microprocessors are selected in the comparison, i.e., Intel
only takes 8.1s at the same frequency. i7-7700, ARM Cotex-M0 and NVIDIA GTX 1080ti.
Therefore, the total energy consumption on RI5CY-Accel are Intel i7-7700 processor runs at 3600MHz with a power con-
99ms × 0.281W = 28mJ for LeNet-Accel(custom), 4.02s × sumption of 65W. 32-bit floating-point convolution computa-
0.281W = 1.13J for VGG16-Accel(custom) and 8.1s × tion is performed at 87.3GOP/s, achieving a performance/power
0.281W = 2.28J for ResNet18-Accel(custom). The same efficiency of 1.3GOP/s/W [23].
inference task on the original RI5CY processor consumes NVIDIA GTX 1080ti has a greater bandwidth, running
428ms × 0.259W = 111mJ for LeNet-Accel(typical), at 1923MHz with a power consumption of 243W. 32-
35.8s × 0.259W = 9.27J for VGG16-Accel(typical) and bit floating-point convolution computation is performed at
62.9s × 0.259W = 16.29J for ResNet18-Accel(typical). A 6974.2GOP/s/W, achieving a performance/power efficiency of
total energy saving of 74.8%, 87.8% and 85.1% is achieved. 28.7GOP/s/W [23].
The energy consumption comparison is illustrated in Fig. 20. ARM Cotex-M0, as a low-power RISC processor, runs at
96MHz with a power consumption of 8.16mW. 16-bit fixed-
point convolution computation is performed at 0.0864GOP/s,
D. Synthesized ASIC Implementation
achieving a performance/power efficiency of 10.58GOP/s/W.
Both RI5CY and RI5CY-Accel are synthesized using Syn- The baseline RI5CY processor without an acceleration mod-
opsis synthesis tool Design Compiler (DC). The process is ule runs at 100MHz with a power consumption of 4.71mW.
TSMC090nm, and the library and database are both typical 32-bit fixed-point convolution computation is performed at
standard process libraries. We synthesize two processors and 0.0514GOP/s, achieving a performance/power efficiency of
generate area and power reports. 10.92GOP/s/W.
The area report generated after DC synthesis is shown in Our proposed RI5CY-Accel processor runs at 100MHz with
Table V. Compared with RI5CY, RI5CY-Accel has different a power consumption of 5.80mW, 32-bit fixed-point convo-
degrees of increase in the number of ports, nets and cells. The lution computation can be accelerated to 0.225GOP/s, which
number of ports increased by 130.08%, the number of nets is a 4.37× speedup as compared to the original RI5CY, and
increased by 158.04%, and the number of cells increased by 2.60× speedup as compared to ARM Cotex-M0. It achieves
121.85%. The total unit area increased by 184.37%, and the a performance/power efficiency of 38.79GOP/s/W, or 3.66×
total area increased by 108.30%. Nets has increased the most and 3.55× improvement as compared to ARM Cotex-M0 and

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: OPTIMIZING CNN COMPUTATION USING RISC-V CUSTOM INSTRUCTION SETS FOR EDGE PLATFORMS 1383

the original RI5CY respectively. The comparison is listed based on FPGA,” in Proc. 14th IEEE Int. Conf. Solid-State Integr. Circuit
in Table VI. Technol. (ICSICT), Piscataway, NJ, USA: IEEE, 2018, pp. 1–3.
[15] J. Cong and B. Xiao, “Minimizing computation in convolutional neural
networks,” in Proc. Int. Conf. Artif. Neural Netw, Cham, Switzerland:
VII. CONCLUSION Springer, 2014, pp. 281–290.
[16] F. Schuiki, F. Zaruba, T. Hoefler, and L. Benini, “Stream semantic
In this work, we proposed seven custom instructions based registers: A lightweight RISC-V ISA extension achieving full compute
on RISC-V ISA extension format. These instructions can signif- utilization in single-issue cores,” IEEE Trans. Comput., vol. 70, no. 2,
pp. 212–227, Feb. 2021.
icantly optimize the operations of convolution, activation and [17] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator
pooling in CNN inference computation. Power by Winograd for ubiquitous machine-learning,” ACM SIGARCH Comput. Archit.
algorithm, the execution time of convolution can be reduced News, vol. 42, no. 1, pp. 269–284, 2014.
[18] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in
from 140 to 21 clock cycles. These instructions can be executed Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit., Piscataway, NJ,
in batch mode, where the immediate data can be shared and USA: IEEE, 2014, pp. 609–622.
consequently, eliminate the excess memory load/store overhead [19] S. Wang, J. Zhu, Q. Wang, C. He, and T. T. Ye, “Customized instruction
on RISC-V for Winograd-based convolution acceleration,” in Proc. IEEE
and the associated energy consumptions. A revised RISC-V 32nd Int. Conf. Appl.-Specific Syst., Archit. Process. (ASAP), Piscataway,
processor adopting these instructions, called RI5CY-Accel is NJ, USA: IEEE, 2021, pp. 65–68.
implemented on FPGA. Through experiments, RI5CY-Accel [20] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
had demonstrated great advantages in execution speed-up and pp. 4013–4021.
total energy consumption in inference computation. The source [21] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutional
code is placed on GitHub https://github.com/QmppmQ/riscv. networks through FFTs,” 2013, arXiv:1312.5851.
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
REFERENCES pp. 770–778.
[1] V. Jain, A. Sharma, and E. A. Bezerra, “Implementation and extension [23] W. Xu, Z. Zhang, X. You, and C. Zhang, “Reconfigurable and low-
of bit manipulation instruction on RISC-V architecture using FPGA,” complexity accelerator for convolutional and generative networks over
in Proc. IEEE 9th Int. Conf. Commun. Syst. Netw. Technol. (CSNT), finite fields,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,
Piscataway, NJ, USA: IEEE, 2020, pp. 167–172. vol. 39, no. 12, pp. 4894–4907, Dec. 2020.
[2] C. A. R. Melo and E. Barros, “Oolong: A baseband processor extension
to the RISC-V ISA,” in Proc. IEEE 27th Int. Conf. Appl.-Specific
Syst., Archit. Process. (ASAP), Piscataway, NJ, USA: IEEE, 2016,
pp. 241–242. Shihang Wang received the B.E. degree in mi-
[3] G. Xin et al., “VPQC: A domain-specific vector processor for post- croelectronics science and engineering from the
quantum cryptography based on RISC-V architecture,” IEEE Trans. Southern University of Science and Technology,
Circuits Syst. I, Reg. Papers, vol. 67, no. 8, pp. 2672–2684, Aug. 2020. Shenzhen, China, in 2020. He is currently work-
[4] D. B. Roy et al., “Customized instructions for protection against memory ing toward the M.S. degree with the Department
integrity attacks,” IEEE Embedded Syst. Lett., vol. 10, no. 3, pp. 91–94, of Electronic and Electrical Engineering, Southern
Sep. 2018. University of Science and Technology, Shenzhen,
[5] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA China. His research interests include hardware ac-
extensions for finite field arithmetic: Accelerating Kyber and NewHope celeration for neural network, architecture optimiza-
on RISC-V,” IACR Trans. Cryptogr. Hardware Embedded Syst., tion of RISC-V processor, and GPGPU.
vol. 2020, no. 3, pp. 219–242, 2020.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” Commun. ACM, vol. 60,
no. 6, pp. 84–90, 2017.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” 2014, arXiv:1409.1556. Xingbo Wang received the B.E. degree in mi-
[8] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE croelectronics science and engineering from the
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9. Southern University of Science and Technology,
[9] F. Ge, N. Wu, H. Xiao, Y. Zhang, and F. Zhou, “Compact convolutional Shenzhen, China, in 2022. He is currently working
neural network accelerator for IoT endpoint SoC,” Electronics, vol. 8, toward the Ph.D. degree in microelectronics sci-
no. 5, p. 497, 2019. ence and engineering with the Southern University
[10] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing of Science and Technology, Shenzhen, China. His
FPGA-based accelerator design for deep convolutional neural networks,” research interests include RISC-V processor, hard-
in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, ware acceleration for convolutional neural network,
2015, pp. 161–170. and parallel computing.
[11] X. Liu, Y. Chen, C. Hao, A. Dhar, and D. Chen, “WinoCNN: Ker-
nel sharing Winograd systolic array for efficient convolutional neural
network acceleration on FPGAs,” in Proc. IEEE 32nd Int. Conf. Appl.-
Specific Syst., Archit. Process. (ASAP), Piscataway, NJ, USA: IEEE,
2021, pp. 258–265.
[12] Z. Li, W. Hu, and S. Chen, “Design and implementation of CNN custom Zhiyuan Xu received the B.E. degree in electronic
processor based on RISC-V architecture,” in Proc. IEEE 21st Int. Conf. information engineering from Hangzhou Dianzi Uni-
High Perform. Comput. Commun.; IEEE 17th Int. Conf. Smart City; versity, Hangzhou, China, in 2021. He is currently
IEEE 5th Int. Conf. Data Sci. Syst. (HPCC/SmartCity/DSS), Piscataway, working toward the M.S. degree with the Department
NJ, USA: IEEE, 2019, pp. 1945–1950. of Electronic and Electrical Engineering, Southern
[13] N. Wu, T. Jiang, L. Zhang, F. Zhou, and F. Ge, “A reconfigurable University of Science and Technology, Shenzhen,
convolutional neural network-accelerated coprocessor based on RISC- China. His research interest includes deep-learning
V instruction set,” Electronics, vol. 9, no. 6, p. 1005, 2020. architectures for energy-efficient systems.
[14] D.-Z. Li, H.-R. Gong, and Y.-C. Chang, “Implementing RISCV system-
on-chip for acceleration of convolution operation and activation function

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.
1384 IEEE TRANSACTIONS ON COMPUTERS, VOL. 73, NO. 5, MAY 2024

Bingzhen Chen received the B.E. degree in mi- Terry Tao Ye (Senior Member, IEEE) received
croelectronics science and engineering from the the Bachelor of Science degree in electronic engi-
Southern University of Science and Technology, neering from Tsinghua University, Beijing, in 1993,
Shenzhen, China, in 2021. He is currently work- and the Ph.D. degree in electrical engineering from
ing toward the M.S. degree with the Depart- Stanford University, in 2003. He is a Professor with
ment of Electronic and Electrical Engineering, the Institute of Nanoscience and Applications (INA)
Southern University of Science and Technology, and the Department of Electrical and Electronics
Shenzhen, China. His research interests include Engineering (EEE), Southern University of Science
hardware acceleration for neural network and and Technology (SUSTech), and by courtesy, an
RISC-V processor. Adjunct Professor with the Department of Electrical
and Computer Engineering (ECE), Carnegie Mellon
University. He is active in both academic research as well as industrial appli-
Chenxi Feng received the B.E. degree in microelec- cations in many engineering areas that include IC designs, Internet-of-Things
tronics science and engineering from the Southern (IOT) wearable sensor devices, and neuromorphic computing ICs. Beside
University of Science and Technology, Shenzhen, his academic activities, he is also keen on industry-academic collaborations.
China, in 2022. She is currently working toward He had held various engineering and consulting roles in China Academy of
the M.S. degree with the Department of Electronic Science, Impinj Inc, Synopsys Inc., Magma Design Automation Inc., Silicon
and Electrical Engineering, Southern University of Architects Inc., and many other Silicon Valley companies.
Science and Technology, Shenzhen, China. Her re-
search interests include hardware acceleration for
neural network and RISC-V processor.

Qi Wang received the B.E. degree in microelec-


tronics science and engineering from the Southern
University of Science and Technology, Shenzhen,
China, in 2020. He is currently working toward the
Ph.D. degree with the Department of Electrical and
Computer Engineering, The University of British
Columbia, Vancouver, BC, Canada. His research
interests include hardware acceleration for neural
network, computer-aided design algorithms, FPGA
architectures, and parallel computing.

Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on August 09,2024 at 08:56:51 UTC from IEEE Xplore. Restrictions apply.

You might also like