RISC-VTF RISC-V Based Extended Instruction Set for Transformer
RISC-VTF RISC-V Based Extended Instruction Set for Transformer
Abstract— Deep learning model Transformer has been widely CPUs, GPUs and TPUs can quickly implement forward
used in natural language processing(NLP) filed, and its demand computation of complex algorithms in terminals, they cannot
for computing resources is also growing. However, general- be deployed on many occasions that are sensitive to power
purpose processors(CPU and GPU) have invested excessive
hardware resources in the design because they have to flexibly consumption, resources, and cost and have certain require-
support a variety of tasks, they are not efficient for the ments for real-time performance, such as edge computing,
implementation of Transformer. Consequently, various software embedded devices, etc.
optimizations that towards general-purpose processors have In order to solve this problem, it is feasible to deploy
been proposed one after another. But, under the condition of the algorithm to FPGA [11] [12], These accelerators usually
ensuring sufficient accuracy, the degree of software optimization
is limited. It is requisite to friendly-support Transformer at the use a combination of high-level instructions and control
hardware level. signals to design high-level functional modules such as con-
After analysing the computational characteristics of the volution/pooling/MLP, and use weight reuse and computing
Transformer model, based on RISC-V, we designed a hardware resource reuse to optimize hardware design. It treats the
friendly instruction set architecture for the Transformer model. network as a whole, rather than dividing it into low-level
In addition to the basic instruction, for the intensive and general
computing part of the model, according to the expansion rules computing operations(such as dot products) [11], but the
of RISC-V instruction, we design the matrix load/store in- same network model often has different structures, such ac-
struction calculation instruction, softmax instruction, activation celerators usually cannot be reconfigured and lack flexibility.
instruction and other user-defined instructions. They support Another method is to design a dedicated neural network
any matrix scale, and deploy it on FPGA to realize a flexible processor, a representative in this field is the Cambrian
and efficient custom processor RISC-VTF for Transformer. The
design is integrated on the Xilinx toolkit zynq-7000 FPGA, series of deep learning processor architectures, DaDianNao
and the resource consumption and performance are analyzed. [13]and PuDianNao [14]. They can support a variety of
Compared with the traditional common ISA(Instruction Set machine learning algorithms, and performance power con-
Architecture) such as x86, arm or MIPs, RISC-VTF provides sumption is significantly better than general-purpose pro-
higher code density and performance efficiency. cessors. The Cambricon [15] proposed by Chen Tianshi’s
team is a dedicated instruction set architecture in the field
I. INTRODUCTION
of neural networks, it implements a chip that can support
Transformer is a deep learning model proposed by a paper 10 network structures with lower area overhead/power con-
[1] of Google in 2017. In the NLP field, compared with CNN sumption/latency. However, none of them are dedicated to
[2] [3] and RNN [4] [5] models used in previous deep leaning Transformer. Different application scenarios have different
tasks, recent studies [6] [7] show that Transformer’s semantic requirements for the computing power of AI algorithms. This
feature extraction ability, long-distance feature capture ability kind of more general AI chip has room for improvement in a
and task comprehensive feature extraction ability are superior single deep learning model. Only by designing a customized
to CNN and RNN in most cases. Moreover, in terms of chip architecture based on actual scenarios can it reduce
parallel computing capability, Transformer is equivalent to power consumption and at the same time, the cost greatly
CNN, while RNN is less efficient than the former two improves computing performance.
because of its sequence dependence. These advantages of
Transformer make it widely used in NLP [1] [8] [9]. II. BACKGROIUND
After BERT [10] appeared, the scale and complexity AI algorithms are mostly computationally intensive and
of Transformer have increased significantly. Due to the parallelizable. For the Transformer model, take the machine
slowing down of Moord’s law, the amount of parameters translation in reference [1] as an example, as shown in Figure
and computation of Transformer pose challenges to general- 1. The matrix X is obtained by adding the Embedding of
purpose processors. Although some traditional superscalar each word and the Embedding of the word position in the
input sentence. The number of columns is the Embedding
*This paper was sponsored by Key Project of Hubei Provincial Depart- dimension of 512, and the number of rows is determined by
ment of Education under Granted No. D20201103
1 College of Computer Science and Technology, Wuhan University of the number of words in the sentence. An EnCoder accepts the
Science and Technology, Wuhan, Hubei 430065, China X matrix. After that, X will pass through the two sub-layers
2 Hubei Province Key Laboratory of Intelligent Information Processing
of the Encoder, the Multi-Head Attention layer(consisting of
and Real-time Industrial System, China multiple Self-Attention) and the Feed Forward layer. Here,
3 School of Computer Science , Wuhan University ,Wuhan, 430072,China
4 Department of Information Technology, Wuhan City College,Wuhan, we only analyze its specific calculation method, as shown in
430083,China formula 1.
N* N*
posi_encod+word_emb W1_K K1
K2 Z2
W2_K W2
*
Kn *
Wn_K
X Z
QKT
softmax( )xV
sqrt(dk)
N* N*
*
W1_V
V1
W2_V Zn
V2 Wn
Wn_V
Vn *
N*
N*
X + Z
Layer Normalization(E)
Layer Normalization(F+G)
• Layer normalization is performed on E to get F. to verify the corresponding extended instruction set.
• F goes through the feedforward neural network to get G. • Optimize micro architecture, optimize memory access,
Feedforward neural network is divided into two layers, and reuse computing resources.
as shown in formula 2. One layer uses the activation III. RISC-VTF DESIGN
function relu, and the other layer does not. RISC-V is an open source, modular incremental ISA that
• Layer normalization(F+G). allows individuals or enterprises to modify RISC-V accord-
Get an EnCoder output. In literature [1], there are 6 such ing to their own needs. RISC-V’s basic instruction RV32I
EnCoder and 6 DeCoder(calculation method is similar to has 40 instructions such as arithmetic logic, control transfer,
EnCoder). load and store. RISC-V also provides a wealth of extended
Figure 1 shows the calculation process intuitively, involv- instructions. After implementing the basic instructions, users
ing a large number of matrix addition, multiplication, Soft- can selectively include them according to actual needs. These
max and other operations. If it is a traditional CPU, in terms extended instructions, finally, RISC-V also reserves space
of multiplication, usually only one calculation at a time; for custom instructions [16], users can easily modify and
using more advanced large-scale servers with superscalar The extend the hardware decoding, calculation unit and software
CPU, 128-bit data width, executes SIMD instructions, and compiler, which makes RISC-V very suitable for the design
can only calculate 4 32-bit fixed-point multiplications at the of dedicated processors.
1566
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
Figure 2 shows the main instruction format, and figure 3 length and width of the matrix respectively, and rm is only
shows the operands encoding of the basic instruction. valid in the lowest bit. The funct7 code 0000010 defines
the operation of a matrix store(MSTORE). Its function is
opposite to matrix load. The decoding part is similar to the
31 25 24 20 19 15 14 12 11 7 6 0
31 25 24 20 19 15 14 12 11 7 6 0
Ins
Ins
t[4 111
2) Matrix operation instructions: The instruction format
t[6 :2] 000 001 010 011 100 101 110
00
:5]
48b
is shown in Figure 6. The value of funct3 is 010 and 011
01 STORE STORE-FP custom-1 AMO OP LUI op-32 64b define matrix multiplication(MMUL) and addition(MADD).
10 MADD MSUB NMSUB NMADD OP-FP reserved custom-2/rv128 48b
Among them, rs1, rs2 specify the starting addresses of the
11 BRANCH JALR reserved JAL SYSTEM reserved custom-3/rv128 >=80b
two matrices to be operated on the chip, and rd is the on-
chip destination address for storing the operation result. We
Fig. 3. Instruction operand encoding decode the high three bits and low three bits of the immediate
field into the address of an additional register. The value of
the register is used to specify the length and width of the
A. Design of Transformer Instruction Set two matrices to be operated on.
The calculation process of Transformer is described in
detail above. Figure 1 shows this process intuitively. We an-
31 25 24 20 19 15 14 12 11 7 6 0
alyze the calculation process and characteristics of the calcu- MMUL imm[6:0] rs2 rs1 010 rd 0001011 MMUL rd,rs1,rs2,imm
rm1_rm2
rs2
m2_addr
rs1
m1_addr
011
funct3
rd
dest_addr
0001011
opcode
MADD rd,rs1,rs2,imm
010
MMUL
011
MADD
0000001
MLOAD
3) Softmax and activation instructions: The instruction
0000010
funct7
rm
m_size
rs1
m_addr
000
fun3
rd
dest_addr
0001011
opcode
MSFT rd,rs1,rm
MRELU rd,rs1,rm
MRELU 0000100 rm rs1 000 rd 0001011
cating a single matrix operation instruction.The funct7 codes funct7 m_size m_addr fun3 dest_addr opcode
1567
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
IF_ID ID_EX EX_MEM MEM_WB CTRL
stall
flush
ALU
MEM
ID
MUX
opnum1
opnum1
opnum1 wdata wdata
opnum2 waddr waddr
we
PC ms1
ms2
ROM maddr1 RAM
maddr2
raddr
pc addr
inst
pc MALU read
wdata wdata
ce ce inst waddr waddr
raddr1 sel sel
raddr2 data
REGISTER ms M_RAM wmaddr
data
Wishbone Bus
shown in Figure 8. It includes five operations: fetch, decode, initial address and size of the two operation matrices, and
execute, access and write back. Here, the execution unit is the number of clock cycles for multiplying the two matrices
the main part of the design, which includes the general com- depends on the size of the matrix. Considering the possible
puting unit(Figure 8 ALU unit) and the custom instruction transpose operation in matrix multiplication, this paper sets
computing unit(Figure 8 MALU unit). the offset value offset, which changes with the number of
After decoding the basic instructions, the logic, arithmetic, clock cycles spent in the calculation. The value of offset is
shift and other instructions are executed in the general combined with the initial address of the matrix and the length
computing unit. The calculation operations involved in the and width of the two matrices given in the instruction(the
custom instruction will be completed in the MALU calcula- length and width determine whether the matrix needs to
tion unit. The MALU unit performs indirect data interaction be transposed), it clarifies the position in the matrix of the
with off-chip memory through an on-chip cache(M RAM). number to be operated on each clock cycle, So as to complete
Because the Malu unit is the core unit. Next, we will the operation of the entire matrix. Addition operation steps
introduce the Malu unit and data access optimization method are similar to multiplication. Here we reuse the addition
in detail. circuit to the greatest extent, and its structure is shown in
1) Design of MALU unit: The overall structure is shown Figure 10.
in Figure 9.
Ma11
Mb11
Ma1
Mb1
MALU Ma12
Mb12
Ma2 +
Mb2 Ma13
Mb13
Ma3 +
MMA Mb3 Ma1n
Ma4 + + Mb1n
Ma4
Control signals
¼
Ma5 + +
Ma5
M_RAM
SoftMax Ma6 + +
Ma6
+
¼
+
¼
Relu +
¼
Ma(n-1)
Mb(n-1)
¼
+
Man +
¼
Mbn
+
¼
+ ¼
Fig. 9. MALU unit +
1568
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
mation method, which is a combination of a first-order IV. EXPRIMENTAL
polynomial and a lookup table to implement exponential
The design is written in Verilog, simulated and synthesized
calculations to reduce the amount of calculation and resource
on vivado2018, and deployed on zynq-7000 FPGA with
consumption. Then, the denominator is obtained by the
Xilinx toolkit. Vivado’s report on resource occupancy and
exponential operation of multiple input data, and the final
power consumption is shown in Table 1.
division is implemented by hardware, which will cost a lot
of resources. Therefore, this paper transforms the division TABLE I
into one-time inversion and multiple multiplication. Here, RESOURCES OCCUPANCY AND POWER CONSUMPTION
we know the storage location of the final result through
the MSFT instruction, so we can temporarily save the result XC7Z020-2CLG400L
Resource
utilization Available Utlization(%)
back to the destination address when performing exponential FF 31092 106400 29.2
operation. In this way, if we want to use the value of exi LUT 24350 53200 45.7
again, we don’t need to calculate it for the second time. The BRAM 78 140 55.7
DSP 198 220 90.0
structure diagram is shown in Figure 11. I/O 93 200 46.5
BUFG 11 32 34.3
exi POWER 0.783W
f (xi ) = Pn (3)
j=1 ej
1569
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
Based on the Transformer model in reference[1], this paper V. CONCLUSION
takes the input 512 * 64 matrix as an example to verify RISC- According to the computing characteristics of Transformer
VTF. The contrast experiment is general CPU i7 8750H, algorithm, this paper designs an instruction set to speed up
NVIDIA GPU gp107. Use the risc32-unknown-linux-gun- the Transformer module. We propose a micro architecture,
series cross-compiler tool chain to compile the program, the design two schemes to perform the calculation, and use one
user-defined instructions are implemented with embedded scheme to optimize the data access. The design has been
instruction code. The evaluation results are shown in the implemented on FPGA with Verilog. Experimental results
figure below: show that RISC-VTF implements the Transformer efficiently
with less hardware consumption. However, if we want to
fully deploy AI algorithms which use Transformer, we need
1 RISC-V to implement a compiler that supports custom instructions.
MMUL 25.6
28.6
MIPS
This is future work.
X86
1 ACKNOWLEDGMENT
MADD 26.4
29.3
This paper was sponsored by Key Project of Hubei Provin-
Softmax
1
32.5
cial Department of Education under Granted No.D20201103.
33.4
We thank for their support of our research.
1
1570
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.