RISC V2 A Scalable RISC V Vector Process
RISC V2 A Scalable RISC V Vector Process
Abstract—Machine learning adoption has seen a widespread spotlight. Vector architectures are almost unique in their ability
bloom in recent years, with neural network implementations to effectively combine high programmability attributes, high
being at the forefront. In light of these developments, vector computational throughput, and high energy efficiency [11],
processors are currently experiencing a resurgence of interest,
due to their inherent amenability to accelerate data-parallel [12]. The inception of modern vector processors was triggered
algorithms required in machine learning environments. In this by NN applications, which copiously rely on operations that
paper, we propose a scalable and high-performance RISC- can be readily vectorized [11]. The extensive proliferation of
V vector processor core. The presented processor employs a NNs in the last few years is precisely why vector processing
triptych of novel mechanisms that work synergistically to achieve is regaining notable traction in the community [13].
the desired goals. An enhanced vector-specific incarnation of
register renaming is proposed to facilitate dynamic hardware Building on this momentum, this paper presents a vector
loop unrolling and alleviate instruction dependencies. Moreover, processor architecture that leverages the upcoming RISC-
a cost-efficient decoupled execution scheme splits instructions into V [14], [15] vector extension [16], which allows RISC-V-based
execution and memory-access streams, while hardware support processors to be augmented with a vector processing core.
for reductions accelerates the execution of key instructions in the While the proposed architecture is founded on the traditional
RISC-V ISA. Extensive performance evaluation and hardware
synthesis analysis validate the efficiency of the new architecture. tenets of vector processing [17], [18], [19], it introduces some
novel techniques that reap high performance benefits in a
I. I NTRODUCTION very scalable and cost-effective implementation. Specifically,
The last few years have witnessed the widespread pro- the new design is spearheaded by three mechanisms that
liferation and massive adoption of machine learning as a collectively constitute the main contributions of this work:
fundamental thrust in a multitude of application domains. • A new register remapping technique reimagines the no-
Increasingly, more aspects of everyday life are being disrupted tion of register renaming in a vector processing con-
by new capabilities enabled by machine learning. Neural text. Coupled with a dynamically allocated register file,
Networks (NN) have emerged as the most popular approach the new register remapping mechanism enables dynamic
to implementing machine learning, and they are considered hardware-based loop unrolling and optimized instruction
state-of-the-art in such applications as pattern [1], image [2], scheduling at run-time.
and speech recognition. The rapid and vast increase in the use • The design’s decoupled execution scheme employs re-
of NNs has accentuated the demand for hardware architec- source acquire-and-release semantics to disambiguate be-
tures that can accelerate the processing of various operations tween parallel computation and memory-access instruc-
encountered in machine learning applications. tion streams, thereby allowing for independent execu-
Traditional general-purpose processors have focused on tion/memory flow rates.
Instruction-Level Parallelism (ILP) for decades. Consequently, • A dynamically generated hardware reduction tree enables
they are not tuned to effectively handle the massively data- significant acceleration of reduction intstructions, which
parallel workloads that machine learning algorithms and NNs are prevalent in most NN and DSP algorithms.
have brought to the forefront [3]. While the addition of SIMD The efficacy and efficiency of the presented vector processor
instructions to the ISA of general-purpose machines partially are corroborated through extensive performance simulations
exploits Data-Level Parallelism (DLP), the obtained through- using real benchmark applications, and detailed hardware
put is somewhat limited [4]. On the other hand, Graphics analysis of synthesized and placed-and-routed designs using
Processing Units (GPU) provide very high data parallelism, so commercial 45 nm standard-cell libraries.
they have been extensively used to accelerate NN workloads.
Nevertheless, GPUs tend to be power-hungry and the energy II. T HE P ROPOSED V ECTOR P ROCESSOR A RCHITECTURE
efficiency they can achieve is not adequate for many imple- The proposed processor design uses a superscalar core as the
mentations, e.g., those requiring computation on the edge, main control processor, with all the instructions being fetched
where battery life is of paramount importance [5] [6]. To and decoded in the superscalar pipeline, similar to [20],
address energy efficiency, researchers have turned to custom [21], [22]. A high-level overview of the micro-architecture is
architectures targeting specific NN implementations [7], [8], depicted in Figure 1. During the superscalar issue stage (sIS),
[9]. Even though such application-specific designs are very the instructions are diverted to the correct path (i.e., scalar,
efficient, they typically offer limited programmability and or vector), based on their type. A vector instruction queue
small flexibility in adapting to the evolving and emerging decouples the execution rates of the two datapaths. The vector
needs of NN workloads. processor core itself is implemented in a diversified pipelined
The search for high performance and energy efficiency in organization, whereby the actual pipeline depth experienced
highly data-parallel workloads has brought vector processors by each instruction depends on the instruction type, as will be
– a concept heavily explored in the 1970s [10] – back into the shortly explained. The vector pipeline includes the following
2 way out-of-order superscalar processor
TABLE I
Instr Issue T HE EXECUTION LATENCIES OF THE VARIOUS INSTRUCTION TYPES .
ICache sIS DCache
Fetch
memory
Exec
Instruction Type Latency (cycles)
Simple arithmetic & logical 1
vector instruction Multiplication 3
mem
insts Division 4
vRRM vMA Reductions Variable: log2 (vector length)
Load/Store Variable
vIQ computation
insts
Vector Vector is variable, and it depends on the operation being executed.
Issue-vIS Scoreboard Table I lists the latencies for the various classes of instructions.
When a result is generated, it becomes available to the issue
vector RF stage through the forwarding paths. Since the execution latency
is variable, the orchestration of instruction progress is per-
Lane #0 Lane #1 Lane #v formed by the scoreboard, which notifies stalled instructions
in the issue stage whenever their pending operand values are
Fwd ready. The stalled instructions “wake up” and proceed to the
Logic next pipeline stage. During execution, vector µops may trigger
the same operation in multiple execution lanes, based on the
EX #0 EX #1 EX #v
vector length.
Memory instructions do not access the execution lanes;
instead, they are routed after the vRRM pipeline stage directly
to the memory unit, as depicted in Figure 1. The memory unit
writeback logic features two parallel engines that allow for the simultaneous
Fig. 1. A high-level overview of the micro-architecture of the proposed vector processing and disambiguating of one load and one store
processor. All vector instructions are diverted to the vector execution path instruction. All instructions in the vMA and vEX stages are
upon completion of the scalar Issue Stage (sIS). always issued and retired in order, writing their results directly
into the register file upon retirement.
stages: (a) Register Remap (vRRM), (b) Instruction Issue
(vIS), (c) Execution (vEX), and (d) Memory Access (vMA). A. Register remapping and dynamic register file allocation
Computation instructions are decoupled from memory-access The first key micro-architectural novelty of the proposed
instructions, and the two instruction types follow different processor design is a brand new approach to register renam-
pipeline paths, as illustrated in Figure 1 and explained in ing within the context of vector processing. The mechanism
Section II-B. is aptly called register remapping, and it operates within
During the first vector pipeline stage (vRRM), the instruc- the vRRM pipeline stage shown in Figure 1. The register
tion operands are remapped to point to their new allocated remapping mechanism enables vector loops to be unrolled
locations. This process is facilitated by a dynamic register file dynamically in hardware, thereby (a) minimizing the overhead
allocation mechanism, as will be described in Section II-A. of control instructions executed in the superscalar pipeline, and
The remapped instructions then propagate to the issue stage (b) maximizing the utilization of the available fetch bandwidth.
(vIS), where they access the vector register file (RF), and/or the The operation of the register remap scheme comprises three
forwarding paths (as vector chaining [17]), to get their source distinct phases, as abstractly depicted in Figure 2.
data, before proceeding for execution. The vector RF is sliced In the first phase, the mechanism generates groups of vector
into v lanes, with each one corresponding to a separate parallel registers, based on the number of logical registers requested
execution lane. Vectors of arbitrary length are stripmined to by the software. In the RISC-V ISA vector extension, the
the maximum number of lanes supported. software communicates to the processor – through specialized
Hazarding is also implemented in this stage through the use system registers – the desired amount of logical registers for
of a scoreboard. If all the operands of an instruction are ready, the upcoming computations. This information is leveraged
they can be read directly from the register file. If any operand to generate the desired register-group numbers and sizes, as
is pending (i.e., an earlier instruction that is currently in execu- shown in Figure 2.
tion is the producer of the operand value), the instruction will Upon completion of group generation, the proposed mech-
be stalled – by the scoreboard – until the pending value appears anism proceeds to the second phase of its operation; it uses
on the forwarding path. Once all operands are available, the a remapping table (similar to a register alias table) to remap
instruction can proceed to the execution stage (vEX). Since the logical registers to the corresponding base address of their
vector instructions operate on multiple elements (i.e., entire assigned register group. Since these assignments are static for
vectors), the vIS stage “transforms” vector instructions into the duration of each computational kernel, the remap table
multiple micro-operations (µops), with each µop operating is only written once per logical register, during the first time
on different register groups.Scheduling in the vIS stage is, each new destination operand is encountered in the instruction
therefore, performed at the granularity of individual µops. stream. Contrary to traditional register renaming [23], the
The execution stage contains the pipelined parallel execu- presented register remapping process does not perform one-
tion lanes. Similar to [17], each execution lane sees a portion to-one register mappings; it performs one-to-group register
of the vector RF. The duration, in cycles, of the vEX stage mappings, whereby a single logical register is mapped to a
group of registers to enable loop unrolling. Subsequently, the A[0] A[1] A[2] A[3]
Finally, in phase three of the scheme, the remapped in- A[0] + A[1] A[2] + A[3]
Throughput (EPC)
III. E VALUATION R ESULTS 30
2-ways OoO
Vector - 4 lanes
25
A. Performance evaluation 20
Vector - 8 lanes
Vector - 16 lanes
In this sub-section, we perform a detailed performance 15
10
evaluation of the proposed vector design and its key features. A 5
total of 10 benchmark applications are employed, consisting of 0
dot multiply motion perce-
product estimation ptron
7 well-known linear algebra kernels and basic DSP algorithms, Benchmarks
and 3 NNs of varying complexity: a simple perceptron, a 4- Fig. 5. Performance scaling for 3 different vector configurations, as compared
stage convolutional NN, and an 8-stage deep convolutional to a baseline dual-issue superscalar core.
NN. The examined NNs execute inference tasks on digit
recognition using the MNIST database [25]. The compared dual-issue superscalar processor. All three vector cores have
designs were implemented in fully-functional and synthesiz- all the features presented in Section II. Figure 5 shows the
able RTL code that will be open-sourced on GitHub [26]. All obtained throughput results. An almost linear scaling (with
benchmarks were cycle-accurately executed at the RTL level, the number of lanes) is achieved in the 7 linear algebra and
with various statistics retrieved from hardware counters and DSP algorithms, but smaller gains are observed in the 3 NN
speciliazed trackers facilitating processor profiling. algorithms. This is due to the complex memory access patterns
We first examine the impact of the novel register remapping that NN kernels exhibit (primarily using indexed accesses),
scheme discussed in Section II-A. We compare the proposed leading to limited scaling.
design with a simpler baseline vector processor [22] that does
B. Hardware cost analysis
not have the register remapping mechanism and operates with
a shorter pipeline (i.e., one without the vRRM stage). Figure 4 The proposed vector processor is also assessed in terms of
depicts the results, normalized to the throughput of the base- its hardware cost and power efficiency. The RTL code of the
line design. The average throughput – calculated as Elements design was synthesized using a commercial 45 nm standard-
Per Cycle (EPC), the ratio of total elements over the execution cell library under worst-case conditions (0.8 V, 125 ◦ C), using
time – increases by 2.1×. This significant improvement is the Cadence digital implementation flow. All designs under
primarily attributed to the enhanced instruction scheduling investigation were optimized for 1 GHz operation. The derived
resulting from the synergistic effect of register remapping, area/power results are summarized in Table III. Similar to
instruction expansion, and the dynamically allocated register Section III-A, the results also include the baseline dual-issue
file. superscalar core. Recall that the proposed vector processor
uses said superscalar core as the main control processor.
3.5
per cycle (%)