0% found this document useful (0 votes)
40 views5 pages

RISC V2 A Scalable RISC V Vector Process

The document proposes a scalable and high-performance RISC-V vector processor core called RISC-V2. It employs novel mechanisms like an enhanced vector-specific register renaming technique, a decoupled execution scheme that splits instructions into execution and memory-access streams, and hardware support for reductions to accelerate key instructions.

Uploaded by

Yasser Dahshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views5 pages

RISC V2 A Scalable RISC V Vector Process

The document proposes a scalable and high-performance RISC-V vector processor core called RISC-V2. It employs novel mechanisms like an enhanced vector-specific register renaming technique, a decoupled execution scheme that splits instructions into execution and memory-access streams, and hardware support for reductions to accelerate key instructions.

Uploaded by

Yasser Dahshan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

RISC-V2: A Scalable RISC-V Vector Processor

Kariofyllis Patsidis∗ , Chrysostomos Nicopoulos† , Georgios Ch. Sirakoulis∗ , Giorgos Dimitrakopoulos∗


∗ Electrical and Computer Engineering, Democritus University of Thrace, Xanthi, Greece
† Electrical and Computer Engineering, University of Cyprus, Nicosia, Cyprus

Abstract—Machine learning adoption has seen a widespread spotlight. Vector architectures are almost unique in their ability
bloom in recent years, with neural network implementations to effectively combine high programmability attributes, high
being at the forefront. In light of these developments, vector computational throughput, and high energy efficiency [11],
processors are currently experiencing a resurgence of interest,
due to their inherent amenability to accelerate data-parallel [12]. The inception of modern vector processors was triggered
algorithms required in machine learning environments. In this by NN applications, which copiously rely on operations that
paper, we propose a scalable and high-performance RISC- can be readily vectorized [11]. The extensive proliferation of
V vector processor core. The presented processor employs a NNs in the last few years is precisely why vector processing
triptych of novel mechanisms that work synergistically to achieve is regaining notable traction in the community [13].
the desired goals. An enhanced vector-specific incarnation of
register renaming is proposed to facilitate dynamic hardware Building on this momentum, this paper presents a vector
loop unrolling and alleviate instruction dependencies. Moreover, processor architecture that leverages the upcoming RISC-
a cost-efficient decoupled execution scheme splits instructions into V [14], [15] vector extension [16], which allows RISC-V-based
execution and memory-access streams, while hardware support processors to be augmented with a vector processing core.
for reductions accelerates the execution of key instructions in the While the proposed architecture is founded on the traditional
RISC-V ISA. Extensive performance evaluation and hardware
synthesis analysis validate the efficiency of the new architecture. tenets of vector processing [17], [18], [19], it introduces some
novel techniques that reap high performance benefits in a
I. I NTRODUCTION very scalable and cost-effective implementation. Specifically,
The last few years have witnessed the widespread pro- the new design is spearheaded by three mechanisms that
liferation and massive adoption of machine learning as a collectively constitute the main contributions of this work:
fundamental thrust in a multitude of application domains. • A new register remapping technique reimagines the no-
Increasingly, more aspects of everyday life are being disrupted tion of register renaming in a vector processing con-
by new capabilities enabled by machine learning. Neural text. Coupled with a dynamically allocated register file,
Networks (NN) have emerged as the most popular approach the new register remapping mechanism enables dynamic
to implementing machine learning, and they are considered hardware-based loop unrolling and optimized instruction
state-of-the-art in such applications as pattern [1], image [2], scheduling at run-time.
and speech recognition. The rapid and vast increase in the use • The design’s decoupled execution scheme employs re-
of NNs has accentuated the demand for hardware architec- source acquire-and-release semantics to disambiguate be-
tures that can accelerate the processing of various operations tween parallel computation and memory-access instruc-
encountered in machine learning applications. tion streams, thereby allowing for independent execu-
Traditional general-purpose processors have focused on tion/memory flow rates.
Instruction-Level Parallelism (ILP) for decades. Consequently, • A dynamically generated hardware reduction tree enables
they are not tuned to effectively handle the massively data- significant acceleration of reduction intstructions, which
parallel workloads that machine learning algorithms and NNs are prevalent in most NN and DSP algorithms.
have brought to the forefront [3]. While the addition of SIMD The efficacy and efficiency of the presented vector processor
instructions to the ISA of general-purpose machines partially are corroborated through extensive performance simulations
exploits Data-Level Parallelism (DLP), the obtained through- using real benchmark applications, and detailed hardware
put is somewhat limited [4]. On the other hand, Graphics analysis of synthesized and placed-and-routed designs using
Processing Units (GPU) provide very high data parallelism, so commercial 45 nm standard-cell libraries.
they have been extensively used to accelerate NN workloads.
Nevertheless, GPUs tend to be power-hungry and the energy II. T HE P ROPOSED V ECTOR P ROCESSOR A RCHITECTURE
efficiency they can achieve is not adequate for many imple- The proposed processor design uses a superscalar core as the
mentations, e.g., those requiring computation on the edge, main control processor, with all the instructions being fetched
where battery life is of paramount importance [5] [6]. To and decoded in the superscalar pipeline, similar to [20],
address energy efficiency, researchers have turned to custom [21], [22]. A high-level overview of the micro-architecture is
architectures targeting specific NN implementations [7], [8], depicted in Figure 1. During the superscalar issue stage (sIS),
[9]. Even though such application-specific designs are very the instructions are diverted to the correct path (i.e., scalar,
efficient, they typically offer limited programmability and or vector), based on their type. A vector instruction queue
small flexibility in adapting to the evolving and emerging decouples the execution rates of the two datapaths. The vector
needs of NN workloads. processor core itself is implemented in a diversified pipelined
The search for high performance and energy efficiency in organization, whereby the actual pipeline depth experienced
highly data-parallel workloads has brought vector processors by each instruction depends on the instruction type, as will be
– a concept heavily explored in the 1970s [10] – back into the shortly explained. The vector pipeline includes the following
2 way out-of-order superscalar processor
TABLE I
Instr Issue T HE EXECUTION LATENCIES OF THE VARIOUS INSTRUCTION TYPES .
ICache sIS DCache
Fetch

memory
Exec
Instruction Type Latency (cycles)
Simple arithmetic & logical 1
vector instruction Multiplication 3
mem
insts Division 4
vRRM vMA Reductions Variable: log2 (vector length)
Load/Store Variable
vIQ computation
insts
Vector Vector is variable, and it depends on the operation being executed.
Issue-vIS Scoreboard Table I lists the latencies for the various classes of instructions.
When a result is generated, it becomes available to the issue
vector RF stage through the forwarding paths. Since the execution latency
is variable, the orchestration of instruction progress is per-
Lane #0 Lane #1 Lane #v formed by the scoreboard, which notifies stalled instructions
in the issue stage whenever their pending operand values are
Fwd ready. The stalled instructions “wake up” and proceed to the
Logic next pipeline stage. During execution, vector µops may trigger
the same operation in multiple execution lanes, based on the
EX #0 EX #1 EX #v
vector length.
Memory instructions do not access the execution lanes;
instead, they are routed after the vRRM pipeline stage directly
to the memory unit, as depicted in Figure 1. The memory unit
writeback logic features two parallel engines that allow for the simultaneous
Fig. 1. A high-level overview of the micro-architecture of the proposed vector processing and disambiguating of one load and one store
processor. All vector instructions are diverted to the vector execution path instruction. All instructions in the vMA and vEX stages are
upon completion of the scalar Issue Stage (sIS). always issued and retired in order, writing their results directly
into the register file upon retirement.
stages: (a) Register Remap (vRRM), (b) Instruction Issue
(vIS), (c) Execution (vEX), and (d) Memory Access (vMA). A. Register remapping and dynamic register file allocation
Computation instructions are decoupled from memory-access The first key micro-architectural novelty of the proposed
instructions, and the two instruction types follow different processor design is a brand new approach to register renam-
pipeline paths, as illustrated in Figure 1 and explained in ing within the context of vector processing. The mechanism
Section II-B. is aptly called register remapping, and it operates within
During the first vector pipeline stage (vRRM), the instruc- the vRRM pipeline stage shown in Figure 1. The register
tion operands are remapped to point to their new allocated remapping mechanism enables vector loops to be unrolled
locations. This process is facilitated by a dynamic register file dynamically in hardware, thereby (a) minimizing the overhead
allocation mechanism, as will be described in Section II-A. of control instructions executed in the superscalar pipeline, and
The remapped instructions then propagate to the issue stage (b) maximizing the utilization of the available fetch bandwidth.
(vIS), where they access the vector register file (RF), and/or the The operation of the register remap scheme comprises three
forwarding paths (as vector chaining [17]), to get their source distinct phases, as abstractly depicted in Figure 2.
data, before proceeding for execution. The vector RF is sliced In the first phase, the mechanism generates groups of vector
into v lanes, with each one corresponding to a separate parallel registers, based on the number of logical registers requested
execution lane. Vectors of arbitrary length are stripmined to by the software. In the RISC-V ISA vector extension, the
the maximum number of lanes supported. software communicates to the processor – through specialized
Hazarding is also implemented in this stage through the use system registers – the desired amount of logical registers for
of a scoreboard. If all the operands of an instruction are ready, the upcoming computations. This information is leveraged
they can be read directly from the register file. If any operand to generate the desired register-group numbers and sizes, as
is pending (i.e., an earlier instruction that is currently in execu- shown in Figure 2.
tion is the producer of the operand value), the instruction will Upon completion of group generation, the proposed mech-
be stalled – by the scoreboard – until the pending value appears anism proceeds to the second phase of its operation; it uses
on the forwarding path. Once all operands are available, the a remapping table (similar to a register alias table) to remap
instruction can proceed to the execution stage (vEX). Since the logical registers to the corresponding base address of their
vector instructions operate on multiple elements (i.e., entire assigned register group. Since these assignments are static for
vectors), the vIS stage “transforms” vector instructions into the duration of each computational kernel, the remap table
multiple micro-operations (µops), with each µop operating is only written once per logical register, during the first time
on different register groups.Scheduling in the vIS stage is, each new destination operand is encountered in the instruction
therefore, performed at the granularity of individual µops. stream. Contrary to traditional register renaming [23], the
The execution stage contains the pipelined parallel execu- presented register remapping process does not perform one-
tion lanes. Similar to [17], each execution lane sees a portion to-one register mappings; it performs one-to-group register
of the vector RF. The duration, in cycles, of the vEX stage mappings, whereby a single logical register is mapped to a
group of registers to enable loop unrolling. Subsequently, the A[0] A[1] A[2] A[3]

remapping table dynamically allocates the generated register reduction reduction


groups into the register file, as illustrated in Figure 2. logic logic

Finally, in phase three of the scheme, the remapped in- A[0] + A[1] A[2] + A[3]

structions are “expanded”’ to operate on the full size of reduction


logic
their groups. In the vIS pipeline stage, each instruction gen-
A[0] + A[1]+Α[2]+Α[3]
erates and dispatches multiple micro-operations (µops) to
the execution stage (vEX). In the example of Figure 2, the
original instruction generates two µops. Each dispatched µop Fig. 3. The deployment of the dynamically generated hardware reduction
executes the parent instruction’s operation, but with adjusted tree in a setup with 4 execution lanes. Each cycle the length of the vector is
reduced in half by computing the neighboring partial results, until the final
operands, so that the computation is applied to a different set result is ready.
of inputs within the assigned group space. Once the µops have
covered the full group size, the parent instruction is retired. To alleviate these shortcomings of the traditional approach,
This expansion scheme also exists inside the memory unit, the proposed scheme uses a resource-locking mechanism
since memory instructions also need to undergo the same to effectively safeguard the correct program execution flow
transformation. without hindering the flow of the computation stream. During
In summary, the new register remapping mechanism fa- the vRRM pipe stage, the memory instructions are diverted
cilitates dynamic loop unrolling in hardware. The unrolling to the memory unit, while a ghost copy of the instruction
mitigates the stalls incurred by data dependencies, since the is dispatched into the vIS stage. The ghost instruction only
direct consumer of a result is now separated from its producer updates the scoreboard, by locking the source operands of
by multiple µops. Consequently, resource utilization increases the memory instructions. It then disappears from the pipeline
substantially. without triggering any computation. This way, data to be
B. Decoupled execution: computation and memory accesses stored, or address offsets, are safeguarded against tampering
from future computation instructions. At the same time, the
To further increase the utilization of the vector pipeline, computational flow remains completely unblocked to continue
we also introduce a novel memory decoupling scheme that dispatching instructions, as long as no instruction tries to
effectively hides the latency of memory accesses. The compu- modify the locked registers. When the memory instruction
tation (execution) and memory instructions are separated into finally retires, the data is written directly into the register file,
two independent streams, which are appropriately diverted into and the corresponding registers are unlocked, thereby allowing
different execution paths after the vRRM pipeline stage. for their subsequent reuse.
Traditionally, synchronization in decoupled processor
schemes is achieved by employing so called synchronization
C. Hardware support for reduction operations
queues and special move operations [24]. However, such
schemes are not amenable to vector processors, where hun- Reduction operations have historically been handled using
dreds of elements have to be moved from/to the memory. The specialized hardware that iteratively shifts and computes on
synchronization queues incur a significant hardware overhead, pairs of elements. However, such approaches tend to have
while the inserted move instructions block the computation long execution latencies and are unfit for contemporary NN
stream until all the data has been transfered. (and convolutional NN) applications that rely heavily on such
computational patterns. To effectively accelerate reduction
1 Register File before operations, we employ a scalable tree scheme, which calcu-
Vector pipeline setup: v3 lates multiple partial results in parallel, in order to achieve
Unused
Required logical regs v2 significant speedups.
per operation: 2 v1 v1
The reduction tree is automatically generated and dis-
v0 v0
tributed, based on the design’s number of configured vector
add v1, v1, v0
Group generation execution lanes. The generated tree includes all necessary
(2 groups) interconnections between the execution lanes. During each
add [v3:v2], [v3:v2], [v1:v0]
pipeline stage, the tree operates on pairs of neighbors, reducing
2 Vector Register Dynamic Reg.File Allocator
Remapping Register File after
the input vector’s dimensionality in half. The partial results are
Logical Physical then registered and used in the next stage’s computations. The
v3
v3
v2
v1 organization of the reduction tree is depicted in Figure 3 for
v2 four execution lanes. The vector length being operated on at
v1 v2 v1
v0 any given time determines the required reduction depth, which,
v0 v0 v0
in turn, triggers the reduction tree control logic to dynamically
3 Instruction Expansion: activate the appropriate interconnects of the tree. Since the tree
✁OP #0 : add v2, v2, v0 operate ✂n full is automatically generated and the interconnects dynamically
✁OP #1 : add v3, v3, v1 size of groups
activated at runtime, the scalability of the overall design is
Fig. 2. The three-phase process of remapping the registers and dynamically maintained, without requiring any manual effort in adjusting
allocating the register file. The software initially requests a number of logical
registers, which dictates the number of groups the vector register file is going the design’s RTL code. The proposed reduction scheme yields
to be split into. Each instruction dispatches multiple µops to cover the full significant latency improvements; the execution latency of the
size of the allocated group. unit is calculated as log2 (vector length).
35

Throughput (EPC)
III. E VALUATION R ESULTS 30
2-ways OoO
Vector - 4 lanes
25
A. Performance evaluation 20
Vector - 8 lanes
Vector - 16 lanes
In this sub-section, we perform a detailed performance 15
10
evaluation of the proposed vector design and its key features. A 5
total of 10 benchmark applications are employed, consisting of 0
dot multiply motion perce-
product estimation ptron
7 well-known linear algebra kernels and basic DSP algorithms, Benchmarks
and 3 NNs of varying complexity: a simple perceptron, a 4- Fig. 5. Performance scaling for 3 different vector configurations, as compared
stage convolutional NN, and an 8-stage deep convolutional to a baseline dual-issue superscalar core.
NN. The examined NNs execute inference tasks on digit
recognition using the MNIST database [25]. The compared dual-issue superscalar processor. All three vector cores have
designs were implemented in fully-functional and synthesiz- all the features presented in Section II. Figure 5 shows the
able RTL code that will be open-sourced on GitHub [26]. All obtained throughput results. An almost linear scaling (with
benchmarks were cycle-accurately executed at the RTL level, the number of lanes) is achieved in the 7 linear algebra and
with various statistics retrieved from hardware counters and DSP algorithms, but smaller gains are observed in the 3 NN
speciliazed trackers facilitating processor profiling. algorithms. This is due to the complex memory access patterns
We first examine the impact of the novel register remapping that NN kernels exhibit (primarily using indexed accesses),
scheme discussed in Section II-A. We compare the proposed leading to limited scaling.
design with a simpler baseline vector processor [22] that does
B. Hardware cost analysis
not have the register remapping mechanism and operates with
a shorter pipeline (i.e., one without the vRRM stage). Figure 4 The proposed vector processor is also assessed in terms of
depicts the results, normalized to the throughput of the base- its hardware cost and power efficiency. The RTL code of the
line design. The average throughput – calculated as Elements design was synthesized using a commercial 45 nm standard-
Per Cycle (EPC), the ratio of total elements over the execution cell library under worst-case conditions (0.8 V, 125 ◦ C), using
time – increases by 2.1×. This significant improvement is the Cadence digital implementation flow. All designs under
primarily attributed to the enhanced instruction scheduling investigation were optimized for 1 GHz operation. The derived
resulting from the synergistic effect of register remapping, area/power results are summarized in Table III. Similar to
instruction expansion, and the dynamically allocated register Section III-A, the results also include the baseline dual-issue
file. superscalar core. Recall that the proposed vector processor
uses said superscalar core as the main control processor.
3.5
per cycle (%)

3.0 Without Register Remap


Elements

2.5 With Register Remap TABLE III


2.0 H ARDWARE IMPLEMENTATION RESULTS OF FOUR INVESTIGATED DESIGNS
1.5 ( CACHES ARE EXCLUDED ) AT 45 NM / 0.8 V AT 1 GH Z .
1.0
0.5
0.0 Design Area (mm2 ) Avg. Power (mW) Power Efficiency
fir dot median multiply vvadd saxpy motion perce- cnn dcnn average
product estimation ptron
Superscalar 0.24 13.1 1
Vector - 4-Lane 0.61 43.9 1.54
Benchmarks Vector - 8-Lane 0.97 75.3 1.62
Fig. 4. The performance improvement obtained when using the novel register Vector - 16-Lane 1.67 124 1.70
remapping mechanism and the dynamic allocation of the register file. Both
vector cores feature 8 execution lanes.
As expected, the area increases significantly when augment-
The next feature we evaluate is the hardware support for re- ing the superscalar processor with a vector core. The area
duction operations, as presented in Section II-C. The presence overhead of the vector core scales almost linearly with the
of a reconfigurable reduction tree improves the performance of increase in the number of execution lanes. The same trends
NN algorithms. Consequently, our experiment focuses on the are also followed by the power consumption. Nevertheless,
three NN benchmarks, since they make heavy use of reduction modern systems (and especially resource-constrained ones)
operations. Table II shows the obtained throughput results, demand increasingly higher computational power implemented
normalized to the throughput of a design with no reduction in a cost-effective manner. Therefore, a key metric is that of
tree. As can be seen, the hardware acceleration of reduction power efficiency (EPC/Watt). Clearly, the proposed architec-
operations yields massive throughput improvements in the ture achieves a markedly better overall power efficiency that
inference operations of the NN benchmarks. scales well with bigger vector configurations.
IV. C ONCLUSION
TABLE II
I MPACT OF THE HARDWARE - BASED REDUCTION TREE ON THE
This work presented a RISC-V-based high-performance
THROUGHPUT (EPC) OF NN ALGORITHMS IN VARIOUS CONFIGURATIONS . and power-efficient vector processor architecture. The new
design employs three novel mechanisms that collectively yield
Design Perceptron CNN Deep CNN Average impressive performance gains: (a) a register remapping scheme
No Red. Tree 1 1 1 1
With Red. Tree 2.57 1.89 1.87 2.11 facilitated by dynamic register file allocation; (b) a decoupled
execution scheme that separates execution and memory-access
Finally, we evaluate the scalability of the overall vector instructions; and (c) hardware support for vector reduction
processor design. We compare three different vector config- operations. A detailed evaluation of the new architecture high-
urations using 4, 8, and 16 execution lanes and a baseline lighted both its performance prowess and its power efficiency.
R EFERENCES [12] R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernan-
dez, T. Juan, G. Lowney, M. Mattina, and A. Seznec, “Tarantula: a vector
[1] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural extension to the alpha architecture,” in Proc. International Symposium
networks for lvcsr using rectified linear units and dropout,” in IEEE on Computer Architecture (ISCA), May 2002, pp. 281–292.
International Conference on Acoustics, Speech and Signal Processing, [13] S. Hurkat and J. F. Martı́nez, “VIP: A versatile inference processor,”
May 2013, pp. 8609–8613. in IEEE International Symposium on High Performance Computer
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Architecture, HPCA, 2019, pp. 345–358.
Surpassing human-level performance on imagenet classification,” in [14] “RISC-V Foundation,” http://www.riscv.org, accessed: 17-10-2019.
Proc. of IEEE International Conference on Computer Vision (ICCV), [15] F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, “PULP: A Ultra-Low
2015, pp. 1026–1034. Power Parallel Accelerator for Energy-Efficient and Flexible Embedded
[3] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deep Vision,” Journal Signal Processing Systems, vol. 84, p. 339–354, 2016.
neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. [16] “Working draft of the proposed RISC-V V vector extension,” https://
105, no. 12, pp. 2295–2329, Dec 2017. github.com/riscv/riscv-v-spec, accessed: 17-10-2019.
[4] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and [17] K. Asanovic, “Vector microprocessors,” Ph.D. dissertation, University
K. Asanović, “Exploring the tradeoffs between programmability and of California, Berkeley, 1998.
efficiency in data-parallel accelerators,” in International Symposium on [18] C. E. Kozyrakis and D. A. Patterson, “Scalable, vector processors for
Computer Architecture (ISCA), June 2011, pp. 129–140. embedded systems,” IEEE Micro, vol. 23, no. 6, pp. 36–45, Nov 2003.
[5] H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie, [19] J. Yu, G. Lemieux, and C. Eagleston, “Vector processing as a soft-core
“Neural network stream processing core (nnsp) for embedded systems,” cpu accelerator,” in Proc. of ACM International Symposium on Field
in IEEE International Symposium on Circuits and Systems, May 2006, Programmable Gate Arrays (FPGA), 2008, pp. 222–232.
pp. pp.–2776. [20] C. Celio, D. A. Patterson, and K. Asanovic, “The Berkeley Out-
[6] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, of-Order Machine (BOOM): An Industry-Competitive, Synthesizable,
“Deep learning with cots hpc systems,” in Proc. of International Confer- Parameterized RISC-V Processor,” EECS Department, University of
ence on International Conference on Machine Learning (ICML), 2013, California, Berkeley, Tech. Rep., Technical Report UCB/EECS-2015-
pp. III–1337–III–1345. 167 2015.
[7] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural ac- [21] C. Celio, P. F. Chiu, B. Nikolic, D. A. Patterson, and K. Asanovic,
celeration for general-purpose approximate programs,” in IEEE/ACM “BOOM v2: An Open-Source Out-of-Order RISC-V Core,” EECS
International Symposium on Microarchitecture (MICRO), Dec 2012, pp. Department, University of California, Berkeley, Tech. Rep., Technical
449–460. Report UCB/EECS-2017-157 2017.
[8] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, [22] K. Patsidis, D. Konstantinou, C. Nicopoulos, and G. Dimitrakopoulos,
“Diannao: A small-footprint high-throughput accelerator for ubiquitous “A low-cost synthesizable risc-v dual-issue processor core leveraging
machine-learning,” in Proc. of International Conference on Architectural the compressed instruction set extension,” in Microprocessors and
Support for Programming Languages and Operating Systems (ASPLOS), Microsystems, vol. 61, 2018, pp. 1–10.
2014, pp. 269–284. [23] R. Espasa, M. Valero, and J. E. Smith, “Out-of-order vector architec-
[9] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator tures,” in Proc. of ACM/IEEE International Symposium on Microarchi-
for emerging deep neural networks on mobile devices,” IEEE Journal tecture (MICRO), 1997, pp. 160–170.
on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, [24] R. Espasa and M. Valero, “Decoupled vector architectures,” in Proc.
pp. 292–308, June 2019. International Symposium on High-Performance Computer Architecture
[10] R. M. Russell, “The cray-1 computer system,” Commun. ACM, vol. 21, (HPCA), 1996, pp. 281–290.
no. 1, pp. 63–72, Jan. 1978. [25] L. Deng, “The mnist database of handwritten digit images for machine
[11] J. Wawrzynek, K. Asanovic, B. Kingsbury, D. Johnson, J. Beck, and learning research [best of the web],” IEEE Signal Processing Magazine,
N. Morgan, “Spert-ii: a vector microprocessor system,” Computer, vol. 29, no. 6, pp. 141–142, Nov 2012.
vol. 29, no. 3, pp. 79–86, March 1996. [26] IC-Lab-DUTH Repository. (2020) RISC-V-Vector processor. [Online].
Available: https://github.com/ic-lab-duth/RISC-V-Vector

You might also like