0% found this document useful (0 votes)
26 views

RISC-VTF RISC-V Based Extended Instruction Set for Transformer

Uploaded by

jdiianngg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

RISC-VTF RISC-V Based Extended Instruction Set for Transformer

Uploaded by

jdiianngg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

17-20 October, 2021. Melbourne, Australia

RISC-VTF: RISC-V Based Extended Instruction Set for Transformer


Qiang Jiao1,2 , Wei Hu1,2 , Fang Liu3,4 and Yong Dong1,2
2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC) | 978-1-6654-4207-7/21/$31.00 ©2021 IEEE | DOI: 10.1109/SMC52423.2021.9658643

Abstract— Deep learning model Transformer has been widely CPUs, GPUs and TPUs can quickly implement forward
used in natural language processing(NLP) filed, and its demand computation of complex algorithms in terminals, they cannot
for computing resources is also growing. However, general- be deployed on many occasions that are sensitive to power
purpose processors(CPU and GPU) have invested excessive
hardware resources in the design because they have to flexibly consumption, resources, and cost and have certain require-
support a variety of tasks, they are not efficient for the ments for real-time performance, such as edge computing,
implementation of Transformer. Consequently, various software embedded devices, etc.
optimizations that towards general-purpose processors have In order to solve this problem, it is feasible to deploy
been proposed one after another. But, under the condition of the algorithm to FPGA [11] [12], These accelerators usually
ensuring sufficient accuracy, the degree of software optimization
is limited. It is requisite to friendly-support Transformer at the use a combination of high-level instructions and control
hardware level. signals to design high-level functional modules such as con-
After analysing the computational characteristics of the volution/pooling/MLP, and use weight reuse and computing
Transformer model, based on RISC-V, we designed a hardware resource reuse to optimize hardware design. It treats the
friendly instruction set architecture for the Transformer model. network as a whole, rather than dividing it into low-level
In addition to the basic instruction, for the intensive and general
computing part of the model, according to the expansion rules computing operations(such as dot products) [11], but the
of RISC-V instruction, we design the matrix load/store in- same network model often has different structures, such ac-
struction calculation instruction, softmax instruction, activation celerators usually cannot be reconfigured and lack flexibility.
instruction and other user-defined instructions. They support Another method is to design a dedicated neural network
any matrix scale, and deploy it on FPGA to realize a flexible processor, a representative in this field is the Cambrian
and efficient custom processor RISC-VTF for Transformer. The
design is integrated on the Xilinx toolkit zynq-7000 FPGA, series of deep learning processor architectures, DaDianNao
and the resource consumption and performance are analyzed. [13]and PuDianNao [14]. They can support a variety of
Compared with the traditional common ISA(Instruction Set machine learning algorithms, and performance power con-
Architecture) such as x86, arm or MIPs, RISC-VTF provides sumption is significantly better than general-purpose pro-
higher code density and performance efficiency. cessors. The Cambricon [15] proposed by Chen Tianshi’s
team is a dedicated instruction set architecture in the field
I. INTRODUCTION
of neural networks, it implements a chip that can support
Transformer is a deep learning model proposed by a paper 10 network structures with lower area overhead/power con-
[1] of Google in 2017. In the NLP field, compared with CNN sumption/latency. However, none of them are dedicated to
[2] [3] and RNN [4] [5] models used in previous deep leaning Transformer. Different application scenarios have different
tasks, recent studies [6] [7] show that Transformer’s semantic requirements for the computing power of AI algorithms. This
feature extraction ability, long-distance feature capture ability kind of more general AI chip has room for improvement in a
and task comprehensive feature extraction ability are superior single deep learning model. Only by designing a customized
to CNN and RNN in most cases. Moreover, in terms of chip architecture based on actual scenarios can it reduce
parallel computing capability, Transformer is equivalent to power consumption and at the same time, the cost greatly
CNN, while RNN is less efficient than the former two improves computing performance.
because of its sequence dependence. These advantages of
Transformer make it widely used in NLP [1] [8] [9]. II. BACKGROIUND
After BERT [10] appeared, the scale and complexity AI algorithms are mostly computationally intensive and
of Transformer have increased significantly. Due to the parallelizable. For the Transformer model, take the machine
slowing down of Moord’s law, the amount of parameters translation in reference [1] as an example, as shown in Figure
and computation of Transformer pose challenges to general- 1. The matrix X is obtained by adding the Embedding of
purpose processors. Although some traditional superscalar each word and the Embedding of the word position in the
input sentence. The number of columns is the Embedding
*This paper was sponsored by Key Project of Hubei Provincial Depart- dimension of 512, and the number of rows is determined by
ment of Education under Granted No. D20201103
1 College of Computer Science and Technology, Wuhan University of the number of words in the sentence. An EnCoder accepts the
Science and Technology, Wuhan, Hubei 430065, China X matrix. After that, X will pass through the two sub-layers
2 Hubei Province Key Laboratory of Intelligent Information Processing
of the Encoder, the Multi-Head Attention layer(consisting of
and Real-time Industrial System, China multiple Self-Attention) and the Feed Forward layer. Here,
3 School of Computer Science , Wuhan University ,Wuhan, 430072,China
4 Department of Information Technology, Wuhan City College,Wuhan, we only analyze its specific calculation method, as shown in
430083,China formula 1.

978-1-6654-4207-7/21/$31.00 ©2021 IEEE 1565


Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
Q1
W1_Q Q2
W2_Q
Qn
Wn_Q
Z1
W1
*
*

N* N*

posi_encod+word_emb W1_K K1
K2 Z2
W2_K W2
*
Kn *
Wn_K
X Z
QKT
softmax( )xV
sqrt(dk)

N* N*
*
W1_V
V1
W2_V Zn
V2 Wn
Wn_V
Vn *

N*
N*

X + Z

Layer Normalization(E)

Feedforward Neural Network(F)

Layer Normalization(F+G)

Fig. 1. Transformer diagram

same time. Therefore, it is necessary to design a hardware-


T friendly Transformer processor from the hardware level.
(X × WQ )(X × WK )
sof tmax( √ ) × (X × WV ) × W (1) In this article, on the basis of summarizing the opera-
dK tional characteristics of Transformer, we have extended the
RISC-V instructions to abstract descriptions of mainstream
M AX(0, F WF 1 + b1 )WF 2 + b2 (2) Transformers at the instruction level.
Work summary:
WQ , WK , WV , and W are obtained through training, and • Based on the extended rules of RISC-V, the instruction
the calculation steps are as follows: set is designed to optimize typical Transformer opera-
• Output matrix Z after Formula 1 calculation. tions.
• Add Z and X to get E. • Deploy a dedicated processor RISC-VTF on the FPGA

• Layer normalization is performed on E to get F. to verify the corresponding extended instruction set.
• F goes through the feedforward neural network to get G. • Optimize micro architecture, optimize memory access,

Feedforward neural network is divided into two layers, and reuse computing resources.
as shown in formula 2. One layer uses the activation III. RISC-VTF DESIGN
function relu, and the other layer does not. RISC-V is an open source, modular incremental ISA that
• Layer normalization(F+G). allows individuals or enterprises to modify RISC-V accord-
Get an EnCoder output. In literature [1], there are 6 such ing to their own needs. RISC-V’s basic instruction RV32I
EnCoder and 6 DeCoder(calculation method is similar to has 40 instructions such as arithmetic logic, control transfer,
EnCoder). load and store. RISC-V also provides a wealth of extended
Figure 1 shows the calculation process intuitively, involv- instructions. After implementing the basic instructions, users
ing a large number of matrix addition, multiplication, Soft- can selectively include them according to actual needs. These
max and other operations. If it is a traditional CPU, in terms extended instructions, finally, RISC-V also reserves space
of multiplication, usually only one calculation at a time; for custom instructions [16], users can easily modify and
using more advanced large-scale servers with superscalar The extend the hardware decoding, calculation unit and software
CPU, 128-bit data width, executes SIMD instructions, and compiler, which makes RISC-V very suitable for the design
can only calculate 4 32-bit fixed-point multiplications at the of dedicated processors.

1566
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
Figure 2 shows the main instruction format, and figure 3 length and width of the matrix respectively, and rm is only
shows the operands encoding of the basic instruction. valid in the lowest bit. The funct7 code 0000010 defines
the operation of a matrix store(MSTORE). Its function is
opposite to matrix load. The decoding part is similar to the
31 25 24 20 19 15 14 12 11 7 6 0

funct7 rs2 rs1 funct3 rd opcode R-type


decoding of MLOAD.
imm[11:0] rs1 funct3 rd opcode I-type

imm[11:5] rs2 rs1 funct3 imm[4:0] opcode S-type


31 24 20 19 15 14 12 11 7 6 0

MLOAD 0000001 rm rs1 000 rd 0001011 MLOAD rd,rs1,rm


imm[31:12] rd opcode U-type
funct7 m_size m_addr funct3 dest_addr opcode

31 25 24 20 19 15 14 12 11 7 6 0

MSTORE 0000010 rm rs1 000 rd 0001011 MSTORE rd,rs1,rm

funct7 m_size m_addr funct3 dest_addr opcode

Fig. 2. Basic instruction format

Fig. 5. Instruction for matrix load and store

Ins
Ins
t[4 111
2) Matrix operation instructions: The instruction format
t[6 :2] 000 001 010 011 100 101 110

00
:5]

LOAD LOAD-FP custom-0


MISC-MEM
OP-IMM AUIPC OP-IMM-32
(>32b)

48b
is shown in Figure 6. The value of funct3 is 010 and 011
01 STORE STORE-FP custom-1 AMO OP LUI op-32 64b define matrix multiplication(MMUL) and addition(MADD).
10 MADD MSUB NMSUB NMADD OP-FP reserved custom-2/rv128 48b
Among them, rs1, rs2 specify the starting addresses of the
11 BRANCH JALR reserved JAL SYSTEM reserved custom-3/rv128 >=80b
two matrices to be operated on the chip, and rd is the on-
chip destination address for storing the operation result. We
Fig. 3. Instruction operand encoding decode the high three bits and low three bits of the immediate
field into the address of an additional register. The value of
the register is used to specify the length and width of the
A. Design of Transformer Instruction Set two matrices to be operated on.
The calculation process of Transformer is described in
detail above. Figure 1 shows this process intuitively. We an-
31 25 24 20 19 15 14 12 11 7 6 0
alyze the calculation process and characteristics of the calcu- MMUL imm[6:0] rs2 rs1 010 rd 0001011 MMUL rd,rs1,rs2,imm

rm1_rm2 m2_addr m1_addr funct3 dest_addr opcode


lation. Using custom instructions to reserve space custom-0, 31 25 24 20 19 15 14 12 11 7 6 0

the following instructions are designed: MLOAD, MSTORE, MADD imm[6:0]

rm1_rm2
rs2

m2_addr
rs1

m1_addr
011

funct3
rd

dest_addr
0001011

opcode
MADD rd,rs1,rs2,imm

MMUL, MADD, MSFT, MRELU.


Fig. 6. Instruction for matrix multiplication and addition

010
MMUL

011
MADD
0000001
MLOAD
3) Softmax and activation instructions: The instruction
0000010

0001011 000 0000011


MSTORE
format is shown in Figure 7. The value of funct3 is 000 indi-
opcode[6:0] funt3[14:12] funt7[31:25] MSFT
0000100
MRELU
cating a single matrix operation instruction. The funct7 codes
0000011 indicates the softmax command(MSFT), which is
used to perform softmax operation on matrix. Funct7 is
Fig. 4. Custom instruction decoding 0000100, which means relu command(MRELU), used to
activate matrix. Their decoding is similar. The values of rs1,
The design of instruction follows two points: one is to rm and rd registers specify the on-chip start address, size and
conform to the expansion rule, the other is to realize the on-chip destination address of the matrix to be operated.
reuse of basic instruction rv32i decoding circuit as far as
possible. The following is a detailed description. 31 25 24 20 19 15 14 12 11 7 6 0

1) Matrix load and store instructions: The instruction MSFT 0000011

funct7
rm

m_size
rs1

m_addr
000

fun3
rd

dest_addr
0001011

opcode
MSFT rd,rs1,rm

format is shown in Figure 5. The value of funct3 is 000 indi- 31 25 24 20 19 15 14 12 11 7 6 0

MRELU rd,rs1,rm
MRELU 0000100 rm rs1 000 rd 0001011

cating a single matrix operation instruction.The funct7 codes funct7 m_size m_addr fun3 dest_addr opcode

0000001 defines the operation of matrix load(MLOAD). It is


used to load the embedding matrix of a sentence or weight Fig. 7. Instruction for matrix softmax and relu
matrix from the main memory to the on-chip cache. The
values of rs1, rm and rd registers specify the off chip address,
size and on-chip destination address of the matrix to be B. Processor micro architecture design
loaded. It should be noted that rm is not the address of the In this paper, a five stage pipeline 32-bit processor is
general register, and the custom instruction involves at most designed, which includes 32 general registers, two 32-bit
two matrix operations. Therefore, we define two additional registers for extra matrix size storage, data buffer memory,
32-bit registers, ms1 and ms2 to store the size of the matrix. and peripheral devices such as flash, SRAM and SDRAM.
Only the custom instruction will access these two registers. All rv32i instructions and user-defined instructions are im-
The high 16 bits and low 16 bits in the register store the plemented. The general structure of the simple interior is

1567
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
IF_ID ID_EX EX_MEM MEM_WB CTRL

stall

flush
ALU
MEM
ID

MUX
opnum1
opnum1
opnum1 wdata wdata
opnum2 waddr waddr
we
PC ms1
ms2
ROM maddr1 RAM
maddr2
raddr
pc addr
inst
pc MALU read
wdata wdata
ce ce inst waddr waddr
raddr1 sel sel
raddr2 data
REGISTER ms M_RAM wmaddr
data

read1 raddr1 wmdata


opnum1 read1 raddr1 read2 raddr2
opnum2 read2 raddr2 ofset ofset
write wmaddr
m1 wdata wmdata
data1 m2
data2 data1
we
wdata data2
waddr m1
wdata m2
waddr

Wishbone Bus

Fig. 8. Overall structure diagram

shown in Figure 8. It includes five operations: fetch, decode, initial address and size of the two operation matrices, and
execute, access and write back. Here, the execution unit is the number of clock cycles for multiplying the two matrices
the main part of the design, which includes the general com- depends on the size of the matrix. Considering the possible
puting unit(Figure 8 ALU unit) and the custom instruction transpose operation in matrix multiplication, this paper sets
computing unit(Figure 8 MALU unit). the offset value offset, which changes with the number of
After decoding the basic instructions, the logic, arithmetic, clock cycles spent in the calculation. The value of offset is
shift and other instructions are executed in the general combined with the initial address of the matrix and the length
computing unit. The calculation operations involved in the and width of the two matrices given in the instruction(the
custom instruction will be completed in the MALU calcula- length and width determine whether the matrix needs to
tion unit. The MALU unit performs indirect data interaction be transposed), it clarifies the position in the matrix of the
with off-chip memory through an on-chip cache(M RAM). number to be operated on each clock cycle, So as to complete
Because the Malu unit is the core unit. Next, we will the operation of the entire matrix. Addition operation steps
introduce the Malu unit and data access optimization method are similar to multiplication. Here we reuse the addition
in detail. circuit to the greatest extent, and its structure is shown in
1) Design of MALU unit: The overall structure is shown Figure 10.
in Figure 9.

Ma11
Mb11
Ma1
Mb1
MALU Ma12
Mb12
Ma2 +
Mb2 Ma13
Mb13
Ma3 +
MMA Mb3 Ma1n
Ma4 + + Mb1n

Ma4
Control signals

¼
Ma5 + +
Ma5
M_RAM

SoftMax Ma6 + +
Ma6
+
¼
+
¼

Relu +
¼

Ma(n-1)
Mb(n-1)
¼
+
Man +
¼

Mbn
+
¼

+ ¼
Fig. 9. MALU unit +

The MMA module implements 32 32-bit floating-point


multipliers and several 32-bit floating-point adders to imple-
ment matrix multiplication and addition. The multiplication Fig. 10. MMA module
of matrix can be mapped to the multiplication of vector, so
that the full connection layer can also reuse the computing The SoftMax module implements nonlinear calculations.
circuit, which is actually the multiplication and accumulation As shown in Equation 3, hardware implements nonlinear
of vector. Through the mmul instruction, we can know the calculations, this article uses a piecewise linear approxi-

1568
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
mation method, which is a combination of a first-order IV. EXPRIMENTAL
polynomial and a lookup table to implement exponential
The design is written in Verilog, simulated and synthesized
calculations to reduce the amount of calculation and resource
on vivado2018, and deployed on zynq-7000 FPGA with
consumption. Then, the denominator is obtained by the
Xilinx toolkit. Vivado’s report on resource occupancy and
exponential operation of multiple input data, and the final
power consumption is shown in Table 1.
division is implemented by hardware, which will cost a lot
of resources. Therefore, this paper transforms the division TABLE I
into one-time inversion and multiple multiplication. Here, RESOURCES OCCUPANCY AND POWER CONSUMPTION
we know the storage location of the final result through
the MSFT instruction, so we can temporarily save the result XC7Z020-2CLG400L
Resource
utilization Available Utlization(%)
back to the destination address when performing exponential FF 31092 106400 29.2
operation. In this way, if we want to use the value of exi LUT 24350 53200 45.7
again, we don’t need to calculate it for the second time. The BRAM 78 140 55.7
DSP 198 220 90.0
structure diagram is shown in Figure 11. I/O 93 200 46.5
BUFG 11 32 34.3
exi POWER 0.783W
f (xi ) = Pn (3)
j=1 ej

In order to illustrate the usage of custom instructions,


take a 512×64 input matrix as an example, and give some
assembly codes:
multiplication # r1:m1 address(off-chip) ; r2:m1 address(on-chip)
M_RAM
op_unit # r11:m2 address(off-chip) ; r12:m2 address(on-chip)
# r13:m3 address(off-chip) ; r14:m3 address(on-chip)
# r15:m4 address(off-chip) ; r16:m4 address(on-chip)
Input # r17:m5 address(off-chip) ; r18:m5 address(on-chip)
Exponential reciproca
op_unit op_unit ...
MTMS ms1,0x02000040 #[ms1]=0x02000040,length:512,width:64
MLOAD r2,r1,ms1 #load m1(input X)
MTMS ms2,0x00400040 #[ms2]=0x00400040,length:64,width:64
MLOAD r12,r11,ms2 #load m2(input WQ )
Fig. 11. Softmax module MMUL r5,r2,r12,0x01 #r5:temporary address; X*WQ ; 0x01:ms1,ms2
MLOAD r13,r14,ms2 #load m3(input WK )
The implementation of the Relu module is relatively MLOAD r15,r16,ms2 #load m4(input WV )
MMUL r6,r2,r14,0x01 #r6:temporary address; X*WK ; 0x01:ms1,ms2
simple and will not be introduced. MSTORE ...,r5,...
2) Optimization of memory access: After loading two MMUL r7,r2,r16,0x01 #r7:temporary address; X*WV ; 0x01:ms1,ms2
matrices, the calculation of the matrix usually takes several MSTORE ...,r6,...
MSTORE ...,r7,...
cycles, the pipeline needs to be suspended until the calcula- ...
tion of the loaded two matrices is completed. At this time, ... #Multi Head Attention
...
since the matrix calculation in Transformer is independent of MTMS ms1,0x02000040 #[ms1]=0x02000040,length:512,width:64
each other and there is no data dependence, in order to further MLOAD r2,r1,ms1 #load m1(input Q)
MLOAD r12,r11,ms1 #load m2(input K)
improve the execution efficiency and reduce the data access MMUL r5,r2,r12,0x00 #r5:temporary address; 0x00:ms1,ms1
delay, the matrix loading instruction can be added after the MLOAD r14,r13,ms1√ #load m3(input V)
...... # (QK T )/ 64
matrix calculation instruction, so that the data can continue MTMS ms2,0x02000200 #[ms2]=0x02000200,length:512,width:512
to be loaded from off chip during matrix calculation. MSFT r6,r5,ms2 #r6:temporary address;
In figure 12, there are two buffers in M RAM. The idea MMUL r7,r6,r14,0x10 #r7:temporary address; 0x10:ms2,ms1
MTMS ms2,0x00400040 #[ms2]=0x00400040,length:64,width:64
is to use the multiplexer to select the data buffer module MLOAD r16,r15,ms2 #load m4(input W)
that needs to store the data when loading data from outside MMUL r8,r7,r16,0x01 #r8:temporary address; 0x10:ms1,ms2
MSTORE ...r8...
the chip. For example, if the data operated in the matrix ...
calculation unit is sent from buffer 1, then the loaded data MADD ... #Multi Head Attention
...
is sent to buffer 2 at this moment. Otherwise, the reverse is ... #Layer Normalization
performed alternately. This double buffering mechanism can ...
effectively use idle time to reduce data access latency. MTMS ms1,0x02000040 #[ms1]=0x02000040,length:512,width:64
MLOAD r2,r1,ms1 #load m1(input F)
MLOAD r12,r11,ms1 #load m2(input Wf 1 )
MMUL r5,r2,r12,0x00 #r5:temporary address; 0x00:ms1,ms1
MLOAD r14,r13,ms1 #load m3(input Wb1 )
Data Buffer1 MADD r6,r5,r14,0x00 #r6:temporary address; 0x00:ms1,ms1
Input MUX1 MUX2 MRELU r7,r6,ms1 #r7:temporary address
MALU
Data
Data Buffer2 MLOAD r16,r15,ms1 #load m4(input Wf 2 )
MMUL r8,r7,r16,0x00 #r8:temporary address; 0x00:ms1,ms1
MLOAD r18,r17,ms1 #load m5(input Wb2 )
MADD r9,r8,r18,0x00 #r9:temporary address; 0x00:ms1,ms1
Fig. 12. Double-buffered mechanism ... #Layer Normalization
MSTORE...

1569
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.
Based on the Transformer model in reference[1], this paper V. CONCLUSION
takes the input 512 * 64 matrix as an example to verify RISC- According to the computing characteristics of Transformer
VTF. The contrast experiment is general CPU i7 8750H, algorithm, this paper designs an instruction set to speed up
NVIDIA GPU gp107. Use the risc32-unknown-linux-gun- the Transformer module. We propose a micro architecture,
series cross-compiler tool chain to compile the program, the design two schemes to perform the calculation, and use one
user-defined instructions are implemented with embedded scheme to optimize the data access. The design has been
instruction code. The evaluation results are shown in the implemented on FPGA with Verilog. Experimental results
figure below: show that RISC-VTF implements the Transformer efficiently
with less hardware consumption. However, if we want to
fully deploy AI algorithms which use Transformer, we need
1 RISC-V to implement a compiler that supports custom instructions.
MMUL 25.6
28.6
MIPS
This is future work.
X86

1 ACKNOWLEDGMENT
MADD 26.4
29.3
This paper was sponsored by Key Project of Hubei Provin-
Softmax
1
32.5
cial Department of Education under Granted No.D20201103.
33.4
We thank for their support of our research.
1

Relu 43.2 R EFERENCES


42.1

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.


10 20 30 40 50 Code length
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv
preprint arXiv:1706.03762, 2017.
[2] A. Rakhlin, “Convolutional neural networks for sentence classifica-
Fig. 13. Code length tion,” GitHub, 2016.
[3] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Con-
volutional sequence to sequence learning,” in International Conference
on Machine Learning, pp. 1243–1252, PMLR, 2017.
[4] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-
1 RISC-V
lation by jointly learning to align and translate,” arXiv preprint
Code length 15.4 MIPS arXiv:1409.0473, 2014.
12.3 X86
[5] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-
18.3 RISC-VTF
proaches to attention-based neural machine translation,” arXiv preprint
speed 1 i7 8750H
12.4 gp107
arXiv:1508.04025, 2015.
1 RISC-VTF [6] G. Tang, M. Müller, A. Rios, and R. Sennrich, “Why self-attention? a
Energy consumption 32.4 i7 8750H targeted evaluation of neural machine translation architectures,” arXiv
22.5 gp107
preprint arXiv:1808.08946, 2018.
8 16 24 32 40 multiple [7] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving
language understanding by generative pre-training,” 2018.
[8] B. Gunel, C. Zhu, M. Zeng, and X. Huang, “Mind the facts:
Fig. 14. Assessment results Knowledge-boosted coherent abstractive text summarization,” arXiv
preprint arXiv:2006.15435, 2020.
The results show that the encoding length of the instruction [9] P. Li, “An empirical investigation of pre-trained transformer lan-
guage models for open-domain dialogue generation,” arXiv preprint
designed in this paper is 12.3 times shorter than that of arXiv:2003.04195, 2020.
MIPs and 15.4 times shorter than that of x86. This is [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
because the operation completed by the custom instruction of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
requires multiple instructions on MIPs and x86. In terms of [11] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: To-
performance, RISC-VTF is 18.2 times faster than CPU(i7 ward uniformed representation and acceleration for deep convolutional
8750h) and 1.47 times faster than GPU(gp107). This is neural networks,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 38, no. 11, pp. 2072–2085, 2018.
because we have specially designed a computing circuit [12] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
for the general and intensive computing part, which can efficient reconfigurable accelerator for deep convolutional neural net-
perform multiple computing operations in one clock cycle. works,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–
138, 2016.
However, GPU may be faster than RISC-VTF if the amount [13] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam,
of computation becomes large enough. In terms of energy and Y. Chen, “Dadiannao: A neural network supercomputer,” IEEE
consumption(product of power and execution time), CPU(i7 Transactions on Computers, vol. 66, no. 1, pp. 73–88, 2016.
[14] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou,
8750h) is 32.4 times of RISC-VTF, and GPU(gp107) is 22.5 and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator,”
times of RISC-VTF. Experimental results show that RISC- ACM SIGARCH Computer Architecture News, vol. 43, no. 1, pp. 369–
VTF is superior to General purpose processor in speed and 381, 2015.
[15] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen,
energy consumption. We have tried to increase the instruction “Cambricon: An instruction set architecture for neural networks,” in
to achieve the matrix layer normalization operation, which 2016 ACM/IEEE 43rd Annual International Symposium on Computer
will further shorten the code length and execution speed. Architecture (ISCA), pp. 393–405, IEEE, 2016.
[16] Y. Lee, A. Waterman, H. Cook, B. Zimmer, B. Keller, A. Puggelli,
But the calculation is too complicated, it will require more J. Kwak, R. Jevtic, S. Bailey, M. Blagojevic, et al., “An agile approach
resources to implement the hardware circuit. After weighing to building risc-v microprocessors,” IEEE Micro, vol. 36, no. 2, pp. 8–
the speed and resource usage, we gave up this idea. 20, 2016.

1570
Authorized licensed use limited to: Zhejiang University. Downloaded on July 18,2024 at 06:46:58 UTC from IEEE Xplore. Restrictions apply.

You might also like