mantovani_cicc20

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

HL5: A 32-bit RISC-V Processor

Designed with High-Level Synthesis


Paolo Mantovani, Robert Margelli, Davide Giri and Luca P. Carloni
Department of Computer Science · Columbia University, New York, NY 10027
[paolo, margelli, davide giri, luca]@cs.columbia.edu

Abstract—The growing complexity of system-on-chip fuels the new methodologies that target the problem of heterogeneous
adoption of high-level synthesis (HLS) to reduce the design time component integration in SoCs [12], [14]–[16].
of application-specific accelerators. General-purpose processors,
however, are still designed using RTL and logic synthesis. Yet
In this landscape, one particular IP block stands out as a
they are the most complex components of most systems-on-chip. major exception: the processor core.
We show that HLS can simplify the design of processors while As discussed in more detail in Section V, the literature
enhancing their customization and reusability. We present HL5 offers very few examples of papers investigating HLS for
as the first 32-bit RISC-V microprocessor designed with SystemC processor design. Commercial HLS tools are very efficient
and optimized with a commercial HLS tool. We evalute HL5
through the execution of software programs on an experimental in synthesizing computationally intensive datapaths, when the
infrastructure that combines FPGA emulation with a standard amount of control logic is limited. However, despite recent
RTL synthesis flow for a commercial 32 nm CMOS technology. advances [17], [18], they still struggle to produce high-
By describing the challenges and opportunities of applying HLS quality RTL for control-dominated branching logic [19]. These
to processor design, our paper aims also at sparking a renewed limitations may have prevented researchers from investing
interest in HLS research.
Index Terms—High Level Synthesis; RISC-V; Processor time and resources in applying HLS to processor design. Yet
Pipeline. processor cores are the most complex components of most
SoCs, with the biggest impact on design and verification
I. I NTRODUCTION costs [20]. Furthermore, as technology trends push designers
towards even more heterogeneity and customization [21], [22],
The worldwide market for semiconductor intellectual prop- processor designs must become more configurable so that they
erty (IP) blocks is expected to approach 8 billion by 2019 [1]. can be optimized for each particular target SoC where they can
A growing variety of these IP blocks populates any system-on- be instanced. These considerations motivated our work.
chip (SoC) for embedded systems or Internet-of-Things (IoT)
because SoC architects must find the right mix of components Contributions. We present HL5 as the first 32-bit processor
for the target application domain, while being pressured by fully designed and implemented with HLS. HL5 is a pipelined
stringent time-to-market constraints. processor that implements the full RV32IM subset of the
To cope with design complexity it is necessary to raise RISC - V instruction set architecture (ISA). The RISC - V ISA,
the level of abstraction by embracing system-level design originally developed at UC Berkeley [23] and now supported
methods [2], [3]. These include the use of high-level pro- by the RISC-V Foundation [24], has raised broad interest
gramming languages, like C/C++ and S YSTEM C, for design across academia and industry thanks to the appealing promise
specification and the application of high-level synthesis (HLS) of representing a free and open ISA that could become an
for design optimization [4]. The benefits include: (1) the ability industry standard for a variety of systems, from IoT devices
to complete a richer design-space exploration (DSE); (2) larger to datacenter servers [25].
portability to various technology platforms, including FPGAs; We designed HL5 with the goal of leveraging HLS to
(3) and a broader reusability across different target systems, maximize its configurability and minimize design costs. We
as each IP-block implementation offers a different trade-off derived a concurrent specification for the HL5 design that is
option between performance and cost (i.e. power or area). entirely based on the synthesizable subset of SystemC. This
Over the past decade, HLS has been used for thousands of required us to circumvent some limitations of current HLS
ASIC tapeouts and FPGA designs [4], particularly to optimize tools through clever design choices applied to the high-level
IP blocks for wireless communication [5] and video decod- specification, as explained in Section II.
ing/encoding [6]. Meanwhile, researchers have demonstrated In Section III, we dive into the details of our design. The
the application of HLS to many other IP blocks such as: ac- latency of the functional units of HL5 can be varied based on
celerators for database analytics [7] and machine learning [8], the HLS configuration parameters (aka HLS-knob settings or
the on-chip memory subsystems [9], [10], the memory hier- knobs). By using latency-insensitive channels [3], [26] as the
archy [11] and the on-chip interconnect, including networks- main communication mechanism across the processor stages,
on-chip and interfaces for standard bus protocols [12], [13]. the pipeline of HL5 can tolerate latency variations of the
Moreover, HLS has been proposed as a key ingredient for caches, the register file, the functional units and the branch
Code snippet 1: transition in S YSTEM C, controlled by its virtual clock, may
typedef struct instruction { correspond to many state transitions in the RTL circuit. For
raddr_t rs1, rs2, rd;
func_t func;
example, an invocation of function execute() in Fig. 1 does
ctrl_t ctrl; not cause the S YSTEM C simulation time to advance: the loop
} instruction_t;
body processes one full instruction consuming zero simulation
// Body of the main SystemC SC_CTHREAD time, then suspends the execution of the SC_CTHREAD until
while(true)
{
the next virtual-clock edge by calling the wait() function.
HLS_PIPELINE_LOOP; The RTL implementation, instead, will consume at least one
wait(); // Wait on virtual clock edge
physical-clock cycle to execute the logic synthesized for the
execute() function, even in the case of simple instructions.
pc = npc;
instruction_t inst = decode(imem[pc]);
In general, the virtual-clock abstraction allows a designer to
op1 = regfile[rs1]; op2 = regfile[rs2]; simply specify the hardware for complex computations in a
npc = next_pc(pc, inst, op1, op2);
loosely-timed fashion, while letting HLS schedule these com-
if (check_for_hazards(inst)) { // Stall putations across physical-clock cycles and infer the necessary
npc = pc; continue;
}
finite-state machines (FSM) to control them.
if (inst.ctrl.jump) { Wouldn’t it be great if we could synthesize the pipeline of a
if (npc == pc) break; // End of program
else continue; // Jump or branch taken
complete processor from this simple S YSTEM C specifications?
} Indeed, we can! State-of-the-art HLS tools can generate
out = execute(pc, inst, op1, op2);
a working RTL implementation from the S YSTEM C speci-
if (inst.ctrl.store) dmem[out] = op2; fication of Fig. 1. To the best of our knowledge, however,
else if (inst.ctrl.load) regfile[rd] = dmem[out];
else if (inst.ctrl.writereg) regfile[rd] = out;
from this naive specification no HLS tool is currently capable
} of achieving a throughput of one instruction-per-cycle (IPC),
which is the ideal IPC for single-issue processors in absence
Figure 1. Naive S YSTEM C description of a pipeline. of dependencies across instructions. Despite its simplicity, this
code snippet captures most of the key challenges that HLS
logic. These variations can be the results of both customization faces when the target design is a complex pipeline such as a
decisions at design time (through HLS knobs) and their impact processor pipeline.
on the data/instruction flow when executing a given program Loop-Carried Dependencies. The ISA is an abstraction layer
at run time. We used Cadence Stratus, a commercial HLS that exposes the processor state (i.e., the content of the pro-
tool, to process the SystemC specification and synthesize 12 gram counter, the register file, and the memory) to the software
RTL implementations, with clock frequencies ranging from while hiding the details of the hardware implementation (e.g.,
700 MHz up to 2 GHz. the content of the pipeline registers in a pipelined implemen-
In Section IV, we describe our experimental infrastructure to tation). Indeed, the ISA is like a “contract” that allows many
evaluate HL5 with the execution of actual software programs: different pipelined hardware implementations as long as they
this combines FPGA emulation with a standard RTL synthesis can run software programs as if each instruction is executed
flow for a commercial 32 nm CMOS technology. We compare atomically before the next one starts. The specification of
our synthesized RTL implementations of HL5 against ZERO - Fig. 1 reflects this abstraction as each loop iteration may
RISCY , a processor core for IoT that is part of the PULP plat- change the value of the processor state in a way that impacts
form [27] and that was carefully optimized for area occupation the next iterations. Since a pipeline implementation overlaps
through manual RTL design [28]. Through the design of HL5, the partial execution of instructions, it requires a careful
we illustrate a set of guidelines to design processor pipelines handing of control and data dependencies to avoid possible
in SystemC. Furthermore, we consider the limitations of this hazards. These dependencies may translate into loop-carried
approach, which can drive future improvements of HLS tools. dependencies (LCDs) that pose significant challenge to any
HLS tool.
II. HLS FOR P ROCESSOR D ESIGN
Control Feedback. For example, with no control hazard the
In this section, we discuss how to address the current computation of the next value npc of the program counter
limitations of HLS for processor design. We show that HLS requires just a simple addition that has typically a delay
can actually reduce the effort to design the pipeline of a smaller than the physical-clock period. However, if a control
processor compared to hand-written RTL, while still achieving hazard is present because the current instruction is a branch,
good quality of results. then its resolution typically takes multiple clock cycles in
First, let us consider the snippet of code in Fig. 1, which any pipelined implementation. Hence, without any additional
shows the body of a S YSTEM C process of type SC_CTHREAD information from the designer, a valid scheduling cannot
that specifies a simple processor pipeline. In the S YSTEM C allow an initiation interval equal to one for a pipeline that
simulation engine, the SC_CTHREAD processes are concurrent implements the specification of Fig. 1.
threads that are triggered for execution at every edge of a Similar considerations apply for data dependencies and,
clock signal. This is often called a virtual clock because a state indeed, for any modification of a state variable in the body
of a SystemC loop. For example, since the instruction flow Code snippet 2:
cannot be known at design time, the HLS tool must expect // Stage 1
while(true)
that an entry in the register file (or in the data memory) that is {
written by an instruction at a given iteration could be read by HLS_CONSTRAINT_LATENCY;
pc = npc;
another as early as in the next iteration. Hence, to synthesize instruction_t inst = decode(imem[pc]);
a classic 5-stage pipeline without any data hazard, the HLS
if (check_for_hazards(inst)) { // Stall
tool may force the execution of just one instruction every 5 wait_for_stage_3(); // May block.
cycles, thus yielding a poor IP C = 0.2. Note that checking }
for hazards in the body of the loop prevents that an instruction op1 = regfile[rs1]; op2 = regfile[rs2];
with invalid operands commits data into the register file and npc = next_pc(pc, inst, op1, op2);
the data memory. When the HLS tool analyzes the code, if (inst.ctrl.jump) {
however, this check does not eliminate intrinsic dependencies if (npc == pc) break; // End of program
else continue; // Jump or branch taken
across loop iterations due to array accesses. }
Memory Access. In SystemC, memories are naturally speci- signal_stage_2();
}
fied with arrays. To map an array to a given memory block,
a HLS tools transforms a simple array indexing into a bundle // Stage 2
while (true) {
of logic for address generation and read/write enable signals. HLS_PIPELINE_LOOP;
While accessing an array in S YSTEM C incurs a zero-time wait_for_stage_1(); // May block
out = execute(pc, inst, op1, op2);
penalty, most memories have latency of one or more clock signal_stage_3();
cycles, each imposing the injection of at least one state into }
the synthesized RTL to schedule the memory access opera-
tion. Combining such latency constraints with LCDs further // Stage 3
while (true) {
complicates the synthesis of pipelined implementations. HLS_CONSTRAINT_LATENCY;
Given the above limitations, we now explain how to write wait_for_stage_2(); // May block
if (inst.ctrl.store) dmem[out] = op2;
a HLS-friendly specification of a pipeline in S YSTEM C. else if (inst.ctrl.load) regfile[rd] = dmem[out];
Breaking Dependencies with Concurrency. First, we must else if (inst.ctrl.writereg) regfile[rd] = out;
signal_stage_1();
prevent HLS from implicitly handling data and control depen- }
dencies. Many hazards are read-after-write (RAW) hazards,
which correspond to real data dependencies. Hence, forcing
Figure 2. HLS-friendly specification of a pipeline.
the HLS tool to ignore them during scheduling requires to
model the pipeline stages with multiple SC_CTHREAD pro-
cesses, such that no SC_CTHREAD is both reading from and
writing to the register file (or the data memory). Fig. 2 shows synthesized from each SC_CTHREAD. This enables a richer
a partial specification based on this idea: e.g., Stage 3 may only DSE by allowing an independent optimization of each pipeline
access the register file with a write operation, whereas Stage 1 stage with HLS.
always accesses it with a read operation and Stage 2 operates TLM Channels. While the partial SystemC specification in
without accessing it. In this way, dependencies only exist Fig. 2 shows generic synchronization primitives across the
across concurrent SC_CTHREAD processes so that the HLS processes, HLS tools offer libraries of point-to-point (p2p)
tool does not have to abide by the C++ semantics to handle channels based on TLM. Usually these channels can be
them conservatively. Instead, they are handled explicitly by customized to be blocking, non-blocking, or conditionally-
the check_for_hazards() function. blocking. In order to implement a pipeline across processes,
Distributed Pipeline Control. The next step is to introduce the best choice is to use conditionally-blocking channels,
a flow-control mechanism ensuring that each stage only pro- which guarantee maximum throughput when both the sending
cesses and retires valid data and instructions. We do so through and the receiving processes call the synchronization primitive.
synchronization primitives across the stages that may block the A conditionally-blocking channel, on the other hand, enforces
caller whenever there is no valid data to process. For instance, correctness by preventing the receiving process to advance if
Stage 2 in Fig. 2 calls wait_for_stage_1(), which sus- no valid data is present on the channel. Note that non-blocking
pends the execution of the corresponding SC_CTHREAD until channels would achieve similar performance but require that
Stage 1 calls signal_stage_2(). The latter is called when the designer implements all the necessary checks on the
a valid instruction is decoded and all operands are available. presence of valid data. While these checks complicate the
These synchronization functions use latency-insensitive proto- design, non-blocking channels are useful when implementing
cols [3], [26], [29] and apply the principles of transaction- portions of the pipeline that may execute “out-of-order”. In
level modeling (TLM) [30], which advocates a separation fact, in order to prevent deadlock when the order of execution
between computation and communication. As a result, the is not predetermined, the designer must be able to check the
composition of the three processes is correct-by-construction, state of each channel without blocking the pipeline control
independently from the latency and the throughput of the logic logic. This behavior can be obtained only with non-blocking
F/D D/E1 En/M M/WB
WB/F
D/F Variable latency for DSE DMEM
IMEM

SC CTHREAD() SC CTHREAD() SC CTHREAD()

SC MODULE(fedec) SC MODULE(execute) SC MODULE(memwb)

Latency-insensitive Fixed forward Fixed feedback Optional forward


P2P channels pipeline register pipeline register pipeline register

Figure 3. HL5 pipeline: SystemC specification.

channels. in Fig. 4, each process is structured with two main sections: a


The implementation of conditionally-blocking channels de- reset section where initialization and configuration steps are
pends on the particular HLS tool. The most common behavior performed, and an infinite loop where communication and
consists in letting the sending process advance for one or a computation occur. Within this loop, the stage acquires new
few iterations and then suspend it until the receiving process is data from the previous stage, performs some computation, and
ready to consume the data. The number of iterations allowed transfers the processed information to the next stage.
before blocking determines the amount of storage necessary to Each wait() statement corresponds to a virtual-clock
save the information on the channel and can be considered a boundary. The code in a region delimited by two consecutive
HLS knob to explore trade-offs between area and throughput. wait() statements is typically specified as untimed logic, i.e.
Remarks. While the specification of Fig. 1 cannot be synthe- no constraints are imposed on the HLS tool with respect to
sized with good quality-of-results, Fig. 2 shows how minor the actual timing of the hardware that must be synthesized
code changes can circumvent the limitations of HLS. With to implement this logic. In general, the HLS tools can be
these guidelines, in the span of two months one master student left free to decide how many physical-clock cycles to use
was able to complete the initial design of HL5 and use for implementing this logic (which can be seen as implicitly
HLS to synthesize several pipelined implementations, each adding/removing wait() statements during synthesis until
characterized by a particular trade-off point between area and the physical-clock boundaries coincide with the virtual-clock
performance. In just two more hours another student managed ones.) These synthesis decisions can be influenced by setting a
to introduce data forwarding across stages to improve the variety of HLS knobs, such as loop unrolling or pipelining, to
overall IPC. The experience of designing HL5 allowed us to guide the HLS tools towards synthesizing a particular microar-
appreciate the advantage of debugging a complex IP block chitecture with a specific trade-off in terms of performance
at a level where the ISA simulator corresponds to the design versus area occupation. In general, depending on the specified
entry point for synthesis, thus eliminating the risk of injecting settings of the HLS knobs, the synthesized pipeline may
bugs while manually translating the initial specification into have a different number of stages, which doesn’t necessarily
RTL. Furthermore, since the control logic of each stage is correspond to the three SC_MODULEs.
abstracted away from the specification, extending the ISA of
Some HLS directives can be used to constrain more the HLS
HL5 with any standard or custom instruction can be done
tool on the desired timing characteristics of a circuit. Among
without changing the baseline design. The latency-insensitive
these, the PROTOCOL_REGION() directive forces a code region
p2p channels across the stages, in fact, create a flexible design,
to be interpreted as cycle accurate: this means that the HLS
that is tolerant to variations in latency and throughput both at
tool does not add or prune any wait() statement while synthe-
design time and run time in every stage of the pipeline.
sizing hardware for this region, but, instead, interprets those
III. HL5 P ROCESSORS D ESIGN that are present as physical-clock boundaries. In particular,
Fig. 3 illustrates the structure of the synthesizable SystemC the p2p-channel primitives provide a transparent abstraction
specification of the HL5 pipeline, which consists of three main for optimized low-level communication protocols that are
stages: fedec, execute, and memwb. Each stage is modeled implemented with protocol regions. In summary, the combi-
with an SC_CTHREAD process as part of an SC_MODULE and nation of latency-insensitive p2p channels and SC_CTHREAD
communication is implemented through latency-insensitive processes provides a clear separation of computation and
p2p channels and read or write initiators (i.e. get() and communication in a compositional way that makes DSE more
put()). The execute stage is the target of most DSE proce- effective.
dures to optimize the structure of the operations it implements: As an example, the listing in Fig. 5 presents the data struc-
e.g. addition, subtraction, multiplication, division. As shown tures and latency-insensitive p2p interfaces for one module.
Code snippet 3: Code snippet 4:
void pipeline_stage_cthread() { /* hl5_datatypes.h */
{ typedef struct exe2memwb_s {
PROTOCOL_REGION("reset"); // ... } exe2memwb_t;
from_previous_stage_if.reset_get();
to_next_stage_if.reset_put(); typedef struct memwb2fedec_s{
// ... state initialization. sc_bv <1> regwrite;
wait(); sc_bv <5> regfile_address;
} sc_bv <32> regfile_data;
while(true) { } memwb2fedec_t;
{
PROTOCOL_REGION("input"); /* memwb.h */
din = from_previous_stage_if.get();
} // LIC interfaces
dout = compute(din); // Relax latency for DSE LIC_get_if<exe2memwb_t> memwb_get_if;
{ LIC_put_if<memwb2fedec_t> memwb_put_if;
PROTOCOL_REGION("output");
to_next_stage_if.put(dout); // LIC data structures
wait(); exe2memwb_t memwb_din;
} memwbMa2fedec_t memwb_dout;
}

Figure 5. Definition of the data structure and p2p interfaces for the memwb
Figure 4. Structure of a pipeline stage process. stage.

The members of each struct are of type sc_bv, which mod- statically mapped to the first register RS1, while the second
els a single wire or a bundle (the fields of the exe2memwb_t operand may be mapped to register RS2 or consists of the
structure are omitted in the reported listing). Whenever the immediate field of the instruction. All operations, except for
values on these wires must be used for computation, it is the C++ operators / (division) and % (modulo), are synthesiz-
possible to cast them to other types (such as sc_int) that able by the HLS tool we used. We implemented an optimized
support the C++ arithmetic and logic operators. algorithm for the division that we encapsulated in a separate
The use of multiple SC_MODULEs was initially intended for function, which is called within the switch statement. The
simply keeping the design modular and thus easier to manage. 32-bit division algorithm supports the execution of the div,
However, it has an additional advantage over specifying the divu, rem and remu instructions from the RISC - V ISA. Note
three threads within a single module: HLS preserves the that any ISA extension can be similarly implemented with a
hierarchy, including signals across the stages in this case, thus simple function call.
these can still be observed when running RTL simulation. The main loop is specified without using
Fedec Stage. Fetch and decode are the first two stages of many PROTOCOL_REGION() to give maximum freedom to the
processor pipelines and are typically separated by pipeline HLS tool while we performed our DSE to obtain many
registers. For the design of HL5, however, we combined them alternative microarchitectural implementations by applying
into a single stage due to the HLS limitations discussed in Sec- HLS knobs. In particular:
tion II. In particular, since scheduling an instruction-memory (1) loop unrolling is applied to increase hardware par-
access requires at least one cycle and the communication allelism. For instance, the hardware resources necessary to
across two SC_CTHREAD processes consumes another cycle, implement the divisor can be replicated multiple times in order
splitting fetch and decode would cause an undesirable lower- to reduce the division latency from 32 clock cycles down to
bound of three cycles for just fetch and decode. 16, 8, or 4. In traditional RTL synthesis, there is no control
While the HLS tools automatically map the instruction of such kind and loops are always completely unrolled. In
memory to a static RAM, the register file is modeled in this case, a division which by the definition of the algorithm
SystemC as a simple non-shared array of type sc_bv. The takes 32 clock cycles (CC), may be transformed into different
operations of reading/writing its content both occur at the implementations which may take as little as a few clock cycles
beginning of fedec, thus eliminating the problem of loop- to perform the operation. On the down-side the replication of
carried dependencies, and allowing for an initiation interval of hardware yields a larger area occupation.
one. Besides instruction fetching, fedec may also get input (2) loop pipelining is applied to raise throughput while
data from memwb through a feedback path. Hence, at any keeping the possibility of sharing most resources of multi-
given iteration of its main loop it decodes the instruction while cycle units, thus reducing area occupation. While loop-carried
accessing the register file for reading and/or writing operands, dependencies prevent this option to be applied to the division,
before propagating its output to the execute stage. it can be used to improve a multi-cycle version of the multi-
Execute Stage. The execute stage is the core stage of the plier and to automatically implement multiple pipeline stages
pipeline, where arithmetic and logical operations are per- within the execute phase.
formed. The main loop of its SystemC specification contains (3) tool-specific synthesis directives enable fine tuning of the
a large switch statement to select the operation that must scheduling by requiring more aggressive synthesis approaches,
be performed on the operands. Generally, the first operand is such as scheduling operations as-soon-as-possible, or extract-
SystemC
specification

Stratus HLS
HL5 RTL ZERO - RISCYRTL RISC - V
Description Description Executable

Incisive Simulator

CMOS Design Compiler


library
IP Block Packaging & IP
ZYNQ Integration library
CMOS Timing &
Cost Analysis Control
Application

Vivado SDK

FPGA Emulation & Performace Estimation

Figure 6. Implementation and evaluation flow. Figure 7. IP block system for FPGA emulation.

ing portions of the logic into separate FSMs that are concurrent data memories for both HL5 and ZERO - RISCY. The ARM core
with respect to the rest of the circuit. executes a control application to load these memories, start
Memwb Stage. To avoid the same issue discussed for the the target processor, and monitor its execution. A custom core
fedec stage, we combined memory and write-back into a controller, also implemented with HLS, interfaces the control
single memwb stage. This accesses data memory for load/store application with the processor under test. Additionally, this
operations and retires completed instructions while marking module raises an interrupt to the ARM processor when the
accordingly those that update an entry in the register file. execution on the target core is completed and returns the value
of a performance counter, corresponding to the number of
IV. D ESIGN E VALUATION clock cycles taken by the program execution. Notice that every
We synthesized the SystemC specification of HL5 with HL5 implementation can be seamlessly integrated into the
the commercial S TRATUS HLS tool from Cadence. Through ZYNQ processing system without any edits to the S YSTEM C
various combinations of the HLS knobs, we obtained 12 code, or to the control application.
different RTL implementations of the HL5 pipeline, each Area-Performance Analysis. Fig. 8 reports the performance
corresponding to a different microarchitecture. We compared of four Pareto-optimal implementations of HL5 normal-
these implementations with ZERO - RISCY, a processor core ized against ZERO - RISCY, when executing eight popular
for IoT that is part of the PULP platform [27], [28]. Both benchmarks: DHRYSTONE, the HISTOGRAM - EQUALIZATION,
processors implement the full RV32IM subset of the RISC - V AES 256, MATRIX MULTIPLICATION , a division-intensive syn-
ISA [24]. thetic benchmark, fixed-point FFT, CONVOLUTION, and 2D-
Fig. 6 shows the CAD flow we used to implement and CONVOLUTION . The four HL5 implementations are labeled
evaluate both HL5 and ZERO - RISCY: the only difference is based on the chosen HLS-knob settings. For the “Basic”
in the initial steps for HL5 that consist in simulating the implementation, we set only default HLS knobs; for the
SystemC program to check functional correctness with respect “ASAP” implementation, we force the HLS tool to schedule
to the ISA and in synthesizing the RTL with Stratus HLS. The operations as soon as possible, thus trading off some oppor-
RTL implementation is validated again via RTL simulation (for tunities for resource sharing; finally, for the implementations
ZERO - RISCY this is the starting point) before going through labeled “DIV2” and “DIV4”, we leverage loop unrolling to
logic synthesis for FPGA. Estimates on area occupation and speed up sequential units, and in particular the divider. Each
maximum clock frequency are also obtained for each imple- bar with a value above one corresponds to a speedup, while
mentation with Synopsys Design Compiler using a commercial each of the others corresponds to a slow-down. The speedup
32 nm CMOS technology. The performance of each implemen- is evaluated as the ratio between the effective latency of
tation is then evaluated by running on the FPGA bare-metal ZERO - RISCY and the effective latency of HL5. This metric
applications compiled from C software. In particular, each is computed as the product of the cycle count, measured
implementation is packaged as an IP block within the Xilinx through the FPGA emulation, and the achievable clock period.
Vivado environment and integrated “as a client processor” with When considering performance, HL5 implementations are
the dual-core ARM processor on the ZYNQ SoC FPGA. Fig. 7 on average comparable with respect to ZERO - RISCY with a
illustrates the ZYNQ system instantiating either HL5 or ZERO - speedup that varies across benchmarks 1 .
RISCY as the device-under-test. Through the AXI interconnect,
we added two SRAM modules that serve as instruction and 1 HL5 does not support the custom instructions of ZERO - RISCY.
2.6
2.4 Basic
2.2 ASAP
2 DIV2
DIV4

Speedup
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
stoneaes256 fft div mmult onv1D onv2D hist omean
dhry c c ge

Figure 8. Performance analysis: HL5 vs ZERO - RISCY.

1.20 1.80
Pareto dominated zero-riscy Pareto dominated
Normalized CPU Time

1.15 Pareto optimal Pareto optimal


1.75

CPI = 1 / IPC
1.10 Basic DIV2 ASAP Basic

1.05 1.70 DIV2


zero-riscy
DIV4
1.00
ASAP 1.65 DIV4
0.95

0.90 1.60
12000 14000 16000 18000 20000 12000 14000 16000 18000 20000
2
Area (um ) Area (um2)

Figure 9. Normalized CPU time vs. area. Figure 10. Clock-per-instruction (1/IPC) vs. area.

2k, while the RTL of ZERO - RISCY consists of more than 6k


HISTOGRAM - EQUALIZATION shows most of the benefits of
LOC.
applying HLS to automatically trade off resources and clock
period for spatial computation through loop unrolling. Even V. R ELATED W ORK
though this advantage is clearly application-dependent, in the SystemC has been used for modeling processors, but there
context of embedded systems and IoT, the mix of applications are no papers on its application to design them with HLS and
is typically well characterized. Therefore, designers can select synthesize an implementation for FPGA or ASIC flows.
a microarchitecture that performs best with the target workload The work by Huang and Despain [31] is the first in a series
without editing the source code of the processor. of papers that propose processor-specific HLS tools for both
Fig. 9 shows the area-speedup trade offs in a bi-objective pipeline optimization and compiler generation [32], [33]. In
space where normalized latency is the y axis and area oc- contrast, we use a commercial general-purpose HLS tool, since
cupation is the x axis. Fig. 10, instead, shows a different specialized ones have never reached a market large enough to
bi-objective space where latency is replaced with clock-per- enable hardware development beyond the research stage.
instruction (1/IPC). This metric is relevant because HLS tools Skalicky et al. have proposed using a commercial HLS
are currently too pessimistic when calculating the critical path, to improve the performance of a customizable MIPS-like
thus penalizing some microarchitectural choices which fail the processor [34]. While they only comment about how pipeline
scheduling step if the target clock period is too stringent. hazards reduce performance, their approach severely hinders
These results show that HL5 footprint is between 35% and the achievable throughput, as it is based on a purely sequential
50% larger than ZERO - RISCY, which was carefully optimized specification. In contrast, we start from a concurrent model of
for area occupation through manual RTL design. On the the processor and achieve better performance because pipeline
other hand, HLS allows us to automatically generate multiple hazards are handled up front.
implementations of HL5 which have comparable performance Along different research lines, some researchers started
and improved IPC with respect to ZERO - RISCY. from a domain-specific model of a processor pipeline and
Lines of code (LOC). It should be noted that the effort and automatically performed transformations such as bypass and
time required by the proposed design activity is notably less speculation to increase its throughput [35]–[37]. Some of
than that involved into a traditional RTL design flow. These these transformations (e.g. speculation) can be automatically
aspects are usually hard to measure. We report the number of performed by the commercial HLS tool we used, while others
lines of code (LOC), which is a commonly used metric for (e.g. register-file or memory bypass) could represent interest-
estimating design effort. The SystemC LOC for HL5 is about ing enhancements to any HLS tool.
Finally, several implementations of RISC - V have been made [12] V. Kathail, J. Hwang, W. Sun, Y. Chobe, T. Shui, and J. Carrillo,
in Chisel [38], a Scala-embedded language that allows func- “SDSoC: A higher-level programming environment for Zynq SoC and
Ultrascale+ MPSoC,” in Proc. of Symp. on FPGA, 2016, pp. 4–4.
tional and object-oriented descriptions of hardware circuits. [13] P. Parakh, D. Mullassery, A. Chandrashekar, H. Koc, D. Dal, and
Chisel, however, “more closely resembles traditional hardware N. Mansouri, “Interconnect-centric high level synthesis for enhanced
description languages like Verilog than high-level synthesis layouts with reduced wire length,” in Intl. Midwest Symp. on Circuits
and Systems, Aug 2006, pp. 595–600.
systems”, as recognized by some of its developers [39]. [14] A. Canis, J. Choi, B. Fort, R. Lian, Q. Huang, N. Calagar, M. Gort, J. J.
Qin, M. Aldham, T. Czajkowski, S. Brown, and J. Anderson, “From
VI. C ONCLUDING R EMARKS software to accelerators with LegUp high-level synthesis,” in Proc. of
CASES, 2013, pp. 18:1–18:9.
We presented HL5, which supports the RV32IM subset of [15] L. P. Carloni, “Invited - the case for embedded scalable platforms,” in
Proc. of DAC, 2016, pp. 17:1–17:6.
the RISC - V ISA, as the first 32-bit processor fully designed [16] Y.-T. Chen et al., “Accelerator-rich CMPs: From concept to real hard-
and implemented using SystemC and HLS. We showed how ware,” in Proc. of ICCD, Oct. 2013, pp. 169–176.
the HLS flow can be applied to realize processor pipelines [17] D. Pursley and T. H. Yeh, “High-level low-power system design opti-
mization,” in Proc. of VLSI-DAT, April 2017, pp. 1–4.
with performance comparable to that of a manually-optimized [18] A. Kondratyev, L. Lavagno, M. Meyer, and Y. Watanabe, “Realistic
RTL implementation. Despite the limitations of current HLS performance-constrained pipelining in high-level synthesis,” in Proc. of
tools, the effort to design and optimize HL5 was significantly DATE, Mar. 2011, pp. 1–6.
[19] G. Martin and G. Smith, “High-level synthesis: Past, present, and future,”
smaller compared to traditional RTL flows. Addressing these IEEE Design & Test of Computers, vol. 26, no. 4, pp. 18–25, 2009.
limitations represents a research driver for future HLS tools. [20] T. Kam, S. Rawat, D. Kirkpatrick, R. Roy, G. Spirakis, N. Sherwani,
We plan to to release HL5 in public domain to serve as a initial and C. Peterson, “EDA challenges facing future microprocessor design,”
IEEE Trans. on CAD, vol. 19, no. 12, pp. 1498–1506, Dec. 2000.
template for the design of future, more complex processors [21] S. Borkar and A. A. Chien, “The future of microprocessors,” Commu-
with HLS. nication of the ACM, vol. 54, pp. 67–77, May 2011.
[22] M. Horowitz, “Computing’s energy problem (and what we can do about
Acknowledgments. This work is supported in part by the National it),” in ISSCC, Feb. 2014, pp. 10–14.
Science Foundation A#: 1219001 and 1527821 and DARPA [23] K. Asanović and D. A. Patterson, “Instruction sets should be free: The
case for RISC-V,” EECS Dept., UC Berkeley, Tech. Rep. UCB/EECS-
DECADES C#: FA8650-18-2-7862. The views and conclu- 2014-146, Aug. 2014.
sions contained herein are those of the authors and should not [24] “The RISC-V foundation,” https://riscv.org/.
be interpreted as necessarily representing the official policies or [25] D. Kanter, “RISC-V offers simple, modular ISA,” in Microprocessor
Report, Mar. 2016.
endorsements, eitherexpressed or implied, of Air Force Research [26] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-
Laboratory and DARPA or the U.S. Government. Vincentelli, “A methodology for “correct-by-construction” latency in-
sensitive design,” in Proc. of ICCAD, Nov. 1999, pp. 309–315.
R EFERENCES [27] “The PULP platform,” https://pulp-platform.org.
[28] P. D. Schiavone, F. Conti, D. Rossi, M. Gautschi, A. Pullini, E. Flamand,
[1] S. R. Corp., “Licensing, royalty and service revenues for 3rdParty SIP: and L. Benini, “Slow and steady wins the race?” in Proc. of PATMOS,
Market analysis and forecast for 2015.” Sept 2017, pp. 1–8.
[2] A. Sangiovanni-Vincentelli, “Quo vadis SLD: Reasoning about trends [29] L. P. Carloni, K. McMillan, and A. Sangiovanni-Vincentelli, “Theory
and challenges of system-level design,” Proc. of the IEEE, vol. 95, no. 3, of latency-insensitive design,” IEEE Trans. on CAD, vol. 20, no. 9, pp.
pp. 467–506, 2007. 1059–1076, Sep. 2001.
[3] L. P. Carloni, “From latency-insensitive design to communication-based [30] F. Ghenassia, Transaction-Level Modeling with SystemC. Springer,
system-level design,” Proc. of the IEEE, vol. 103, no. 11, pp. 2133–2151, 2010.
Nov. 2015. [31] I. J. Huang and A. M. Despain, “High level synthesis of pipelined
[4] A. Takach, “High-level synthesis: Status, trends, and future directions,” instruction set processors and back-end compilers,” in Proc. of DAC,
IEEE Design & Test of Comp., vol. 33, no. 3, pp. 116–124, 2016. Jun. 1992, pp. 135–140.
[5] S. Mhaske, H. Kee, T. Ly, and P. Spasojevic, “FPGA-accelerated [32] M. Itoh, S. Higaki, J. Sato, A. Shiomi, Y. Takeuchi, A. Kitajima, and
simulation of a hybrid-ARQ system using high level synthesis,” in IEEE M. Imai, “PEAS-III: an ASIP design environment,” in Proc. of ICCD,
37th Sarnoff Symposium, Sep. 2016, pp. 19–21. Sep. 2000, pp. 430–436.
[6] P. Sjovall, J. Virtanen, J. Vanne, and T. D. Hamalainen, “High-level [33] O. Schliebusch, A. Chattopadhyay, R. Leupers, G. Ascheid, H. Meyr,
synthesis design flow for HEVC intra encoder on SoC-FPGA,” in M. Steinert, G. Braun, and A. Nohl, “RTL processor synthesis for
Euromicro Conf. on Digital System Design, Aug. 2015, pp. 49–56. architecture exploration and implementation,” in Proc. of DATE, Feb.
[7] G. A. Malazgirt, N. Sonmez, A. Yurdakul, A. Cristal, and O. Unsal, 2004, pp. 156–160.
“High level synthesis based hardware accelerator design for processing [34] S. Skalicky, T. Ananthanarayana, S. Lopez, and M. Lukowiak, “Design-
SQL queries,” in Proc. of the 12th FPGAworld Conference, Sep. 2015, ing customized ISA processors using high level synthesis,” in Proc. of
pp. 27–32. ReConFig, Dec. 2015.
[8] E. D. Sozzo, A. Solazzo, A. Miele, and M. D. Santambrogio, “On the [35] S. Gupta, T. Kam, M. Kishinevsky, S. Rotem, N. Savoiu, N. Dutt,
automation of high level synthesis of convolutional neural networks,” in R. Gupta, and A. Nicolau, “Coordinated transformations for high-level
Proc. of International Parallel and Distributed Processing Symposium synthesis of high performance microprocessor blocks,” in Proc. of DAC,
Workshops (IPDPSW), May 2016, pp. 217–224. 2002, pp. 898–903.
[9] C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni, “System- [36] T. Kam, M. Kishinevsky, J. Cortadella, and M. Galceran-Oms, “Correct-
level optimization of accelerator local memory for heterogeneous by-construction microarchitectural pipelining,” in Proc. of ICCAD, Nov.
systems-on-chip,” IEEE Trans. on CAD, vol. 36, no. 3, pp. 435–448, 2008, pp. 434–441.
Mar. 2017. [37] E. Nurvitadhi, J. C. Hoe, T. Kam, and S. L. L. Lu, “Automatic pipelining
[10] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, “Polyhedral- from transactional datapath specifications,” in Proc. of DATE, Mar. 2010.
based Data Reuse Optimization for Configurable Computing,” in Proc. [38] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis,
of Symp. on FPGA, Jan. 2013, pp. 29–38. J. Wawrzynek, and K. Asanović, “Chisel: Constructing hardware in a
[11] J. Cong, P. Zhang, and Y. Zou, “Optimizing memory hierarchy allocation Scala embedded language,” in Proc. of DAC, June 2012, pp. 1212–1221.
with loop transformations for high-level synthesis,” in Proc. of DAC, Jun. [39] K. Asanovic et al., “The Rocket chip generator,” EECS Dept., UC
2012, pp. 1233–1238. Berkeley, Tech. Rep. UCB/EECS-2016-17, Apr. 2016.

You might also like