Ai Inst
Ai Inst
Ai Inst
Inference in Clouds
Rongkai Zhan, Xiaobo Fan
{mark,shawn}@streamcomputing.com
Stream Computing Inc.
Beijing, China
unit conducts standard vector operations. The matrix execution 2.2.1 VME. VME performs all base vector extension instructions
unit conducts matrix operations, e.g. GEMM operation, which is and part of customized vector extension instructions with the MAC
at the heart of AI inference workloads. An industrial product con- vector and POLY module. The MAC vector consists of a vector of
figured with 32 NeuralScale cores is implemented in a TSMC 12nm multiply-and-add units, which supports both 16-bit half-precision
FinFET process technology. Evaluation results on ResNet-50 V1.5 floating-point (FP16) arithmetics and 8-bit integer (INT8) arithmetics.
and BERT inference show that our processor achieves state-of-the- The POLY module contains exp, div, and sqrt function units for
art throughput performance, latency performance, and energy ef- the complex arithmetics in activation functions or classifiers dur-
ficiency for both CNN and NLP models. ing the AI inference. As we can see from the data flow, the source
operands for VME may come from vector registers or local buffers,
and the results may be written back to vector registers or local
2 NEURALSCALE ARCHITECTURE buffers. When performing base vector extension instructions, VME
2.1 Scalar Core reads source operands from specified source vector registers and
We adopt the AndesCore N25F core [12] as the scalar core, which writes results back to the destination vector register. For customized
is a 32-bit RISC-V CPU IP core with vector extension support. The vector extension instructions, VME reads source operands from L1
scalar core has a 5-stage in-order execution pipeline and separated Buffer or Intermediate Buffer based on the addresses specified by
instruction and data caches. Features also include dynamic branch the source general-purpose registers, and writes results back to L1
prediction for efficient branch execution. It is capable of deliver- Buffer or Intermediate Buffer based on the address specified by the
ing high per-MHz performance and operating at high frequencies destination general-purpose register. Final results will usually be
with small gate counts. Figure 2 illustrates a high-level overview of written back to the L1 Buffer.
the NeuralScale architecture. As we can see from the control flow,
2.2.2 MME. MME performs customized vector extension instruc-
the scalar core fetches and decodes all instructions, and divert the
tions related to matrix or convolution operations with the MAC
instructions to the correct path based on their types. Scalar instruc-
matrix. Notice that convolution operations are implemented with
tions are executed in order in the scalar pipeline while vector in-
the GEMM algorithm. The MAC matrix is comprised of 𝑚 × 𝑛
structions flow through the scalar pipeline to the neural processor
multiply-and-add units, with 𝑚 indicates the height of the MAC
core.
matrix and 𝑛 indicates the width of the MAC matrix. Each MAC
unit supports both FP16 and INT8 arithmetics. When fully utilized,
2.2 Neural Processor Core the MAC matrix can compute 𝑚 × 𝑛 FP16 arithmetics or 2 × 𝑚 ×
The neural processor core combines the features of vector proces- 𝑛 INT8 arithmetics simultaneously. That’s why the MAC matrix
sors and AI inference accelerators. As shown in Figure 2, the com- is the most important contributor to compute power of AI chips.
putation components include a MAC vector for executing vector MME reads source operands from Data Input Buffer and the Weight
operations, a MAC matrix for executing matrix operations, and a Buffer in L1 Buffer, and writes results back to Intermediate Buffer.
POLY module for complex arithmetics like exp, div, and sqrt com- Input buffer fetches data from the Data IO Buffer in L1 Buffer or In-
putations. On-chips memory components include a vector register termediate Buffer depends on the addresses specified by the source
file (the REG Bank module) as well as three local buffers, named general-purpose register.
L1 Buffer, Data Input Buffer, and Intermediate Buffer respectively.
The neural processor core’s pipeline is divided into 4 stages in
concept: decode, issue, execute, and write-back. As the control flow
in Figure 2 shows, the scalar core diverts the vector instructions to
the neural processor core. Vector instructions are further decoded
into micro-ops in the decode unit and then dispatched to the is-
sue unit. The issue unit issues instructions to corresponding execu-
tion units based on their operation types. There are three execution
units, a vector MAC engine (VME), a matrix MAC engine (MME),
and a memory transmission engine (MTE) for different operation
types, as will be explained. The issue unit maintains three instruc-
tion buffers tracking the state of all inflight instructions in each
execution unit. A dispatched instruction from the decode unit will
be buffered according to its operation type and will be removed
once it’s committed by the execution unit. All instructions will be
issued in order, and an instruction can be issued only when there
is no address overlap with inflight instructions. The issue unit can
issue three instructions at most. All three execution units can work
simultaneously in this case, and hence memory latency can be par-
tially hidden by computation, which lifts overall performance. Af-
ter execution, the results will be written back to vector registers or Figure 2: A high-level overview of the NeuralScale architec-
local buffers. ture.
NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds CARRV’21, June 17, 2021
2.2.3 MTE. MTE connects local L1 Buffer with other memory com- The matrix operation directions are encoded using the dmc and
ponents outside the core through NoCs. Outside memory compo- dm fields, as displayed in Table 4. Taking matrix-vector additions
nents include remote L1 Buffers in a multi-core scenario and pe- for example, {𝑑𝑚𝑐, 𝑑𝑚} = 10 indicates adding a matrix with a row
ripheral memory like the last level buffer (LLB) or external DDR vector while {𝑑𝑚𝑐, 𝑑𝑚} = 01 indicates adding a matrix with a col-
DRAMs. In the first case, MTE exchanges data between L1 Buffers umn vector.
in a point-to-point manner. In the second case, MTE can exchange
data between L1 Buffer and LLB in a point-to-point manner, or Table 4: dmc and dm encoding
broadcast the data in LLB to all corresponding L1 Buffers.
dmc dm Operation directions
2.3 Instruction-Set Extension
0 X matrix operations on full elements
The RISC-V Vector extension (RVV) [19] enables processor cores
based on the RISC-V instruction set architecture to process data ar- 1 0 matrix operations vertically on row vectors
rays, alongside traditional scalar operations to accelerate the com- 0 1 matrix operations horizontally on column vectors
putation of single instruction streams on large data sets. The scalar
core adopted in our processor implements the RV32G. Therefore,
we implement standard extensions including the base RVV exten- The funct6 field encodes operation types, including addition,
sion (v0.8) and customized vector extensions with fixed-width 32- subtraction, multiplication, accumulation, etc. Some funct6 codes
bit instruction format in our neural processor core. We use the are listed in Table 5 for illustration. Typical operations such as con-
custom-3 opcode (11111011) in the RISC-V base opcode map as the volutions and activation functions in AI inference workloads are
major opcode for customized vector extensions, marked as OP-VE. all covered.
All customized vector extensions keep the source (rs1 and rs2) and
destination (rd) registers at the same position as the base RISC-V Table 5: funct6 encoding
ISA does to simplify decoding, as shown in Table 2.
funct6 Name Descrition
Table 2: Format for customized vector extension
000001 veadd add
31 26 25 24 20 19 15 14 13 12 11 7 6 0 000010 vesub subtract
funct6 dmc rs2 rs1 dm opm2 rd 1111011 000011 veacc accumulate
000101 veemul element-wise multiply
The opm2 field encodes the source operand types and source 011001 memul matrix multiply
locations, as listed in Table 3. For a vector or matrix operand, the
000110 veemacc element-wise multiply-accumulate
general-purpose register provides the memory address of the val-
ues, marked as (rsx). 011010 meconv convolution
001001 velkrelu Leaky Relu activation function
Table 3: opm2[1:0] encoding
001011 mov transfer data with MTE
opm2[1:0] source operands types source1 source2
00 mm (matrix-matrix) (rs1) (rs2) A total of 53 customized instructions are extended in addition to
the base RVV extension. For many matrix-related operations, infor-
01 m (matrix) (rs1) null mation such as height and width of the matrix cannot be encoded
10 mv (matrix-vector) (rs1) (rs2) within the 32-bit fixed-width instruction. Therefore, 22 unprivi-
leged vector CSRs are added to the base RVV extension. Table 1
11 mf (matrix-scalar) (rs1) rs2
lists several of them for illustration. Customized vector CSRs can
CARRV’21, June 17, 2021 Rongkai Zhan, Xiaobo Fan
only be updated with CSR instructions defined in the base scalar LLB within a small latency. As there are 32 NeuralScale cores in
RISC-V ISA. The values should be properly set to match application total, an HSYNC subsystem is used to manage how these cores
needs. cooperate and synchronize. NeuralScale cores can be divided into
up to 16 groups by the HSYNC subsystems, and the number of
3 SOC PLATFROM IMPLEMENTATION cores in each group is configured by the application. That is to say,
Based on the NeuralScale architecture, we implement an industrial an application can be performed either on one group with 32 cores
SoC platform named P920 for AI inference in clouds. A complete or on multiple groups with several cores in each group. The HSYNC
toolchain suite including graph compiler, runtime, and driver is subsystem provides great flexibility and hence an application can
also released for developers. choose the granularity of task division according to its features to
make full use of the NeuralScale cores.
3.1 Core Configuration P920 has two PCIe subsystems: PCIE0 and PCIE1. Each PCIe
P920 consists of 32 NeuralScale cores and the configuration of each subsystem supports up to 16 lanes and can be configured as an
core is listed in Table 6. The scalar core has a separated L1 Data endpoint or a root complex. PCIE0 is usually configured as an end-
Cache and L1 Instruction Cache, each of 64KB. The neural proces- point, receiving compute tasks and data from the host. PCIE1 is
sor core has a 1MB L1 Data IO Buffer, a 256KB L1 Weight Buffer usually configured as a root complex for scalability, connecting
and a 256KB Intermediate Buffer. The size of each local buffer in the to other SoC chips to construct a larger-scale compute platform.
neural processor core is selected based on experimental statistics In addition, P920 has four DDR subsystems. Each subsystem has
of typical AI inference workloads, which helps to avoid frequently an independent channel of LPDDR4 DRAM, supporting up to 4GB
exchanging data between on-chip memory and external memory. memory capacity and 4266𝑀𝑇 /𝑠 transfer rate. Therefore, the DDR
The MAC vector in the neural processor core has 64 MAC units, subsystems provide 16GB memory capacity in total and a peak the-
and the MAC matrix in the neural processor core contains 64 × 32 oretical bandwidth of 136𝐺𝐵/𝑠 for AI inference workloads. During
MAC units. Each MAC unit supports both FP16 and INT8 arith- the AI inference process, LLBs need to fetch weights from DRAMs
metics, which can be dynamically switched according to the oper- frequently. In order to improve the data transmission efficiency be-
ation type of each instruction. tween LLBs and DRAMs, high-performance DMA controllers are
integrated into the DDR subsystems. Each DMA controller con-
Table 6: NeuralScale Core Configurations nects a DDR controller and an LLB through NoCs. As there are
eight LLBs and only four DDR channels, each DDR subsystem in-
tegrates two DMA controllers with independent DMA channels
feature configuration and data buses.
L1 Data Cache 64KB The peripheral subsystem implements many common hardware
scalar core
L1 Instruction Cache 64KB devices including UART, SPI, I2C, PWM, and RTC, which plays an
important role in booting, debugging, and managing the SoC. The
L1 Data IO Buffer 1MB
CPU subsystem is implemented with an ARM Cortex-A53 core [5].
L1 Weight Buffer 256KB
It features an in-order, 8-stage, dual-issue pipeline, and supports
Intermediate Buffer 256KB
neural processor core PPI interrupts and up to 64 SPI interrupts. The CPU subsystem is
(VLEN, ELEN) (1024, 16)
mainly used to initialize a series of devices during the SoC startup,
MAC Vector 64 FP16 MACs
including PCIe controllers, DDR controllers, SPI controllers, and
MAC Matrix 64×32 FP16 MACs
other devices. Besides, it also monitors and manages the SoC dur-
ing running.
3.3 Toolchain choose the public ResNet-50 v1.5 [17] for Tensorflow to perform
We implement an end-to-end inference stack named TensorTurbo image classification tasks in our experiment. The Nvidia V100 GPU
for P920 that enables fast and efficient deployment of customers’ performs ResNet-50 inference with FP16 computations as it pro-
pre-trained AI models, as shown in Figure 4. TensorTurbo is mainly vides much higher FP16 performance than INT8. The other three
comprised of a graph compiler and a heterogenous program en- platforms perform ResNet-50 inference with INT8 computations.
gine (HPE). The graph compiler is based on TVM [2] and has been Besides, the batch size is properly configured to fully exploit the
deeply customized for NeuralScale architecture. It provides C++ compute performance of each platform: 128 for GPUs, 10 for the
and python inference API for popular deep learning frameworks Habana Goya chip, and 64 for P920. The performance results in-
including TensorFlow, PyTorch, MxNet, and Keras. Graph inter- clude throughput, power efficiency, and latency of the four plat-
mediate representations (GIRs) from different frameworks are im- forms are shown in Figure 5. Our P920 chip can process 14442 im-
ported as unified TensorTurbo IRs via the inference API. The graph ages per second (IPS), which is 2.98 times more powerful than the
compiler then applies graph schedule, operators schedule, tiling Nvidia T4 GPU, 1.85 times more powerful than the Nvidia V100
strategies within an operator, among other optimizations to find GPU, and nearly the same as Habana Goya chip. With a thermal
the fastest implementation leveraging the hardware features at the design power of 130𝑊 , P920’s power efficiency is 110 IPS/W, which
most. The HPE provides high-level CUDA-style runtime APIs in is 1.59 times of the Nvidia T4 GPU, 4.23 times of the Nvidia V100
the hardware abstraction layer (HAL), enabling functions like de- GPU, and 1.50 times of the Habana Goya chip. In terms of latency
vice management, kernel launch and management, memory man- performance, the Habana Goya chip has the shortest latency with
agement, etc. The HPE also provides utilities including GDB de- only 0.87 ms. The latency of P920 is 4.43 ms, 5.87 times shorter
bug tool, performance profiling tool, and system monitor interface than the Nvidia T4 GPU, and 3.61 times shorter than the Nvidia
tool via accessing P920’s debugging features (event logging, per- V100 GPU.
formance counters, breakpointing).
V100 GPU, and 2.37 times more powerful than the Habana Goya REFERENCES
chip. The power efficiency of P920 is 32 sentences per second per [1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
watt, which is 1.14 times of the Nvidia T4 GPU, 2.91 times of the Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man-
junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,
Nvidia V100 GPU, and 3.56 times of the Habana Goya chip. The Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan
latency of P920 is 7.63 ms, 9.17 times shorter than the Nvidia T4 Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine
learning. In 12th USENIX Symposium on Operating Systems Design and Imple-
GPU, 5.24 times shorter than the Nvidia V100 GPU, and very close mentation (OSDI 16). 265–283. https://www.usenix.org/system/files/conference/
to the latency of the Habana Goya chip. osdi16/osdi16-abadi.pdf
[2] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan
Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin,
4.3 Trace Analysis and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing
Compiler for Deep Learning. In Proceedings of the 13th USENIX Conference on
The runtime tool traces the instructions and performance coun- Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI’18).
ters for further profiling using P920’s debugging features. Figure 7 USENIX Association, USA, 579–594.
illustrates the profiling results of P920’s performance on BERT in- [3] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,
Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A
ference. During P920’s running, scalar cores process scalar instruc- Machine-Learning Supercomputer. In 2014 47th Annual IEEE/ACM International
tions for control, neural processor cores process compute-intensive Symposium on Microarchitecture. 609–622. https://doi.org/10.1109/MICRO.2014.
58
vector instructions, and DMAs transfer data between LLBs and ex- [4] Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Archi-
ternal DRAMs. As expected, processing in neural processor cores tecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In
takes up most of the total time, 95%. DMAs work in parallel with 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture
(ISCA). 367–379. https://doi.org/10.1109/ISCA.2016.40
scalar cores or neural processor cores in 96% of its total time, show- [5] CortexA53 2012. ARM Cortex-A53. Retrieved April 29, 2021 from https:
ing that external memory transfers are very well overlapped with //developer.arm.com/ip-products/processors/cortex-a/cortex-a53
computations in NeuralScale cores. [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
BERT: Pre-training of Deep Bidirectional Transformers for Language Under-
Inside of the neural processor cores, MME units take up 78% of standing. In Proceedings of the 2019 Conference of the North American Chapter
the total cycles. MTE units work in parallel with MME units, VME of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers). Association for Computational Linguistics,
units or both in 92% of its total time, showing that on-chip memory Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
transfers between LLBs and L1 Buffers are also well overlapped [7] Clément Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culur-
with computations. VME units work in serial with MTE units and ciello, and Yann LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow pro-
cessor for vision. In CVPR 2011 WORKSHOPS. 109–116. https://doi.org/10.1109/
MME units in 45% of its total time mainly due to data dependencies CVPRW.2011.5981829
with MME units, which is the focus of our future optimization. [8] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming
Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi,
Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram
Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug
Burger. 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture
(ISCA). 1–14. https://doi.org/10.1109/ISCA.2018.00012
[9] HabanaGoya [n.d.]. Habana GOYA. Retrieved April 29, 2021 from https://habana.
ai/inference/
[10] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz,
and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep
Neural Network. In 2016 ACM/IEEE 43rd Annual International Symposium on
Computer Architecture (ISCA). 243–254. https://doi.org/10.1109/ISCA.2016.30
[11] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image
Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 770–778. https://doi.
org/10.1109/CVPR.2016.90
[12] N25F [n.d.]. AndesCore N25F. Retrieved April 29, 2021 from http://www.
andestech.com/en/products-solutions/andescore-processors/riscv-n25f/
Figure 7: BERT Performance Profile [13] NVIDIA-T4 [n.d.]. NVIDIA T4 Tensor Core GPU for AI Inference. Retrieved April
29, 2021 from https://www.nvidia.com/en-us/data-center/tesla-t4/
[14] NVIDIA-V100 [n.d.]. NVIDIA V100 Tensor Core GPU. Retrieved April 29, 2021
from https://www.nvidia.com/en-us/data-center/v100/
[15] Karyofyllis Patsidis, Dimitris Konstantinou, Chrysostomos Nicopoulos, and
Giorgos Dimitrakopoulos. 2018. A low-cost synthesizable RISC-V dual-issue
5 CONCLUSION processor core leveraging the compressed Instruction Set Extension. Micropro-
cessors and Microsystems 61 (2018), 1–10. https://doi.org/10.1016/j.micpro.2018.
In this paper, we present NeuralScale, a neural processor core ar- 05.007
chitecture based on RISC-V ISA for AI inference in clouds. Neu- [16] Abhinav Podili, Chi Zhang, and Viktor Prasanna. 2017. Fast and efficient imple-
mentation of Convolutional Neural Networks on FPGA. In 2017 IEEE 28th Inter-
ralScale takes advantage of customized RISC-V vector extensions national Conference on Application-specific Systems, Architectures and Processors
to improve programmability and performance. Evaluations on our (ASAP). 11–18. https://doi.org/10.1109/ASAP.2017.7995253
industrial product P920 demonstrate that our processor can achieve [17] ResNetV15 [n.d.]. ResNet-50 V1.5 Model for Tensorflow. Retrieved April 29,
2021 from https://github.com/IntelAI/models/blob/master/benchmarks/image_
state-of-the-art inference performance on both CNN and NLP tasks. recognition/tensorflow/resnet50v1_5/README.md
Optimizations will be done in future work to further lift overall [18] RISC-V [n.d.]. RISC-V ISA Specifications. Retrieved April 29, 2021 from https:
//riscv.org/technical/specifications/
performance, including replacing the in-order scalar core with an [19] RVV [n.d.]. RISC-V Vector Extension Specification. Retrieved April 29, 2021 from
out-of-order alternative [15] [22] and adjusting instruction granu- https://github.com/riscv/riscv-v-spec
larity or local buffer design to lift the overlapping ratios of VME [20] TIRIAS 2019. Why Your AI infrastructure Needs Both Training and Inference. Re-
trieved April 29, 2021 from https://www.ibm.com/downloads/cas/QM4BYOPP
and MME units.
NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds CARRV’21, June 17, 2021
[21] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. //doi.org/10.1145/2684746.2689060
2015. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neu- [22] Jerry Zhao, Ben Korpan, Abraham Gonzalez, and Krste Asanovic. 2020. Sonic-
ral Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium BOOM: The 3rd Generation Berkeley Out-of-Order Machine. (May 2020).
on Field-Programmable Gate Arrays (Monterey, California, USA) (FPGA ’15). As-
sociation for Computing Machinery, New York, NY, USA, 161–170. https: