IBTIDA Fully Open-Source ASIC Implementation of CH
IBTIDA Fully Open-Source ASIC Implementation of CH
net/publication/355051535
CITATIONS READS
0 5,960
5 authors, including:
Sajjad Ahmed
University of Salento
14 PUBLICATIONS 126 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sajjad Ahmed on 25 October 2021.
LICENSE
CC BY 4.0
22-09-2021 / 30-09-2021
CITATION
Khan, Muhammad Hadir; Jalal, Aireen Amir; Ahmed, Sajjad; Ansari, Ali Ahmed; Naqvi, Syed Roomi (2021):
IBTIDA: Fully open-source ASIC implementation of Chisel-generated System on a Chip. TechRxiv. Preprint.
https://doi.org/10.36227/techrxiv.16663738.v1
DOI
10.36227/techrxiv.16663738.v1
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, SEPTEMBER 2021 1
Abstract—Building a System on Chip (SoC) using a fully open- also been resolved recently in mid-2020, when SkyWater
source toolchain requires the availability of open-source tools for foundry together with Google introduced the first fully open-
RTL simulation, generation, GDS-II conversion, manufacture- source PDK the SKY130 process node [4] which is based on a
able foundry process design kits (PDKs), IP libraries, and
I/O blocks. The proposed work shows the methodology of 130nm Complementary Metal Oxide Semiconductor (CMOS)
using completely open-source tools and hardware construction technology.
language (HCL) to tape-out RISC-V based SoC - Ibtida. The
methodology utilizes Chisel (Constructing Hardware in Scala
A. The open-source hardware momentum
Embedded Language) as the RTL generator, Verilator as the
RTL simulator, OpenLANE as the RTL to GDS-II converter, Since the arrival of the RISC-V ISA, there has been a
and SKY-130nm Open PDK to manufacture the SoC. Ibtida boom in the open-source chip designing domain. It proved
consists of a 5-stage pipelined 32-bit RISC-V (RV32IM) core that like open-source software, open-source hardware can
with 32 GPIOs, and separate instruction and data memories. The
Ibtida design is embedded in a harness on a physical chip. The be greatly improved by a collaborative effort between small
harness is equipped with a management SoC used as a controller and big companies complementing each other and not only
to the Ibtida. Prior to converting the RTL into GDS-II, the improving the ISA but also the other tools ecosystem required
cycle-accurate simulation using Verilator and FPGA emulation for hardware designing [5].
on Xilinx ARTY A7 has been performed for verification and The ChipsAlliance [6] established in 2019, takes the aim
regression testing. The FPGA implementation utilizes 8650 LUTs,
3356 Slice Registers, 714 flip flops, and 2.5 Block RAM of 36Kb. of open-source hardware designing even further. It provides
The ASIC implementation utilizes a 2.5 mm2 area with a density a commonplace for designers to create innovative solutions
of 37.44 KGate/mm2. The manufacturing of this SoC is provided using open-source tools. It has renowned companies as mem-
by Google shuttle program called Open MPW (Multi Project bers working together to develop reusable open-source IPs. It
Wafer) in association with Efabless and SkyWater technologies. is also focused on providing tools for open-source physical
To the best of our knowledge, this is the first RISC-V based SoC,
generated using Chisel and taped-out using fully open-source design. The very ambitious open-source OpenROAD project
technologies. [7] is also part of the ChipsAlliance that aims to provide 24-
hour, No-Human-In-The-Loop layout design for SOC, Pack-
Index Terms—OpenLANE, OpenROAD, Chisel, RTL, FPGA,
SoC, Open-source hardware, RISC-V. age, and PCB with no Power-Performance-Area (PPA) loss,
enabling software engineers and people with scarce physical
design knowledge to tape-out their own processors.
I. I NTRODUCTION The availability of everything open-source from RTL to
EDA tools still hindered the complete flow of open chip
T ODAY, Moore’s law is diminishing. The trend of in-
creasing computing capabilities by doubling the number
of transistors is coming to a halt [1]. Due to this, we are
designing due to the nonexistence of a completely open-source
PDK. For over twenty years, the PDKs have been kept closed
source and required non-disclosure agreements (NDAs), li-
entering the golden age of computer architecture [2] where
cense servers, and password-protected download sites causing
the key driving force for the pursuit of increased performance
the privilege to tape-out designs at the hands of only big
has been other than the only miniaturization. Although, due
established companies [8]. But with the SkyWater foundry
to the proprietary nature of chip designing, the innovation has
opening up their design for a 130nm process together with
been somewhat limited due to the fact that only big companies
Google and the Efabless/Google collaboration for providing
can design their own processors. This was democratized by
free tape-out shuttles, presents a huge opportunity for startups,
the advent of RISC-V Instruction Set Architecture (ISA) [3]
small academic institutes, and even high school students to
which enabled startups and communities to work together in
come up with their custom unique designs and actually get
chip designing. Still, there was another barrier for academic
them fabricated.
researchers, startups, and small companies to actually tape-out
their processors, that is, the close nature of Process Design
Kit (PDK). From the past, there have been many open-source B. Why hardware should learn from software
Electronic Design Automation (EDA) tools (SPICE, Magic, Due to the halt in performance even after doubling the
etc.) available for the physical design engineers but the lack number of transistors, the era of domain-specific architecture
of a completely open-source PDK kept the custom hardware is booming. The advent of the RISC-V ISA has enabled small
design to a handful of large and established companies and teams and startups to develop custom hardware to improve
well-funded research universities. However, this problem has performance and efficiency in terms of power consumption.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, SEPTEMBER 2021 2
However, the process for designing chips has been painfully process design kit through the Google/Efabless MPW Shuttle
long and involved a rigid development model that earlier program [15]. We used Chisel for the ease of programming
software development followed, known as the waterfall model. hardware circuits providing us a quickstart with RTL designing
The software created in the early days suffered from over- as compared to the low-level Verilog and proved that the
budget, not meeting deadlines, and being abandoned. Making generated Verilog can be mapped to the fully open suite
changes to the whole monolithic software project was very of Electronic Design Automation (EDA) tools and can be
difficult as the customer’s needs changed. The same goes for fabricated on the Skywater 130nm open PDK.
hardware projects. In a hardware project, first, the micro-
architecture is specified, followed by the RTL design after II. D ESIGN M ETHODOLOGY AND S PECIFICATION
which the verification happens, and then the complete physical To prove our work proposed in the paper we followed
design of the netlist is done. Usually, the physical design is a methodology on a design specification and analyzed its
even outsourced to other companies which further increases implementation and results. In the following sub-sections, we
the timeline of the projects usually ranging from 1-3 years, will discuss the methodology and specification of the design
and if the customer’s need changes the whole process needs to later delving into other sections for details related to the
be repeated. The agile software methodology [9] emphasizes implementation and analysis of the design.
on working software over detailed documentation, customer
collaboration, and being flexible over rigid specifications.
A. Methodology
It promotes small teams working iteratively on improving
working-but-incomplete prototypes and enhancing them until Chisel was used as a frontend of the proposed design which
the end result is acceptable. Inspired by this agile software is a domain-specific language embedded inside Scala that
approach, the researchers at the University of California, provides higher functionality of a programming language to
Berkeley proposed their own “Agile Hardware Manifesto” [10] design circuits instead of traditional HDLs like Verilog/VHDL
through which they taped-out eleven processors in a span of [16]. The Chisel front-end generates an Intermediate Repre-
five years. sentation (IR) called Flexible Intermediate Representation for
To facilitate this agile hardware development idea by in- RTL (FIRRTL) which provides certain transforms and passes
creasing designer productivity, Chisel [11] was created. It is a based on Scala [17] that runs on top of the Java Virtual
domain-specific language created on top of Scala which pro- Machine (JVM) which transforms the same Chisel code to be
vides all the high-level programming features such as Object- used into three different backends: 1) Simulation, 2) FPGA
Oriented Programming (OOP) and Functional Programming Emulation and 3) ASIC Implementation
(FP) to the designer for creating reusable libraries that generate
efficient hardware circuits. The idea is to create reusable
packages just like in software which provides abstraction
and easy-to-use integration opportunities of various verified
IPs. Furthermore, the Chisel compiler automatically creates
a fast, cycle-accurate C++ software simulator, or low-level
synthesizable Verilog that maps to FPGAs or ASIC flows.
C. Previous works
There have been eleven tape-outs based on Chisel utilizing
the Rocket-chip generator [12] by the University of California,
Berkeley but were based on commercial EDA tools and closed
PDKs. Also, a family of striVe SoCs was taped out using the
OpenLANE and Skywater 130nm PDK to prove the viability
of all open-source EDA tools and the PDK [13]. However,
it is written in a traditional low-level hardware description
language, Verilog. The Rocketchip generated tape-outs were
missing the open-source backend flow to generate the GDS Fig. 1. Overview of different stages used in Ibtida during the tape-out of the
and the striVe family SoCs although mapped on the open design
PDKs, lacked the frontend design written in a higher-level
For simulation, to check the functionality of the design, the
programming language.
Chisel compiler was used to generate a C++ simulator based
In this paper, we present our contribution by using the on the emitted Verilog of the SoC through Verilator [18] and
abstractness and software programming feel of Chisel to tape- emitted C++ wrapper for providing stimuli to the compiled
out a 5-stage pipelined RISC-V RV32IM core and a minimal simulator, finally running the simulator to generate a Value
SoC around it with no prior experience in chip designing Change Dump (VCD) file that can be viewed on an open-
and passed the generated RTL, Verilog, to OpenLANE [14] source waveform viewer GTKWave [19].
to provide a completely open-source RTL-GDS flow which For emulating on the FPGA, the Chisel generated Verilog
was then mapped onto the fully open-source SkyWater 130nm was mapped on the Arty A7 FPGA board using Xilinx’s
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, SEPTEMBER 2021 3
Vivado for synthesizing, placing and routing, and generating jump address depending upon the instruction in the Decode
the bitstream to be mapped on the board. This is the only stage.
closed source path that was used for emulation. However, an b) Decode: The decode stage consists of a register file
open-source alternative for the FPGA implementation exists as with 32 registers x0 to x31 each 32 bits wide as described in
well such as the Symbiflow project [20] or OpenFPGA [21] the RISC-V ISA. It also has an Immediate Generation unit that
but that is not the scope of this paper. extracts the encoded immediate values from the instructions,
For the ASIC, the generated Verilog and SkyWater 130 nm concatenating and padding them to become 32 bits wide. There
PDKs were used along with the OpenLANE flow comprising is a Control Unit as well that decodes the current instruction
of various open-source tools for Synthesis, Floorplan, Power using the opcode and enables certain control signals depending
Distribution Network (PDN) generation, Place and Route, upon the type of instruction. There is a Branch Unit that
Design Rule Check (DRC), Layout Versus Schematic (LVS) identifies if the current instruction is a branch instruction and
checks and GDSII generation. calculates the next PC address if the branch is taken. The
Branch Unit was kept in the Decode stage to improve the
branch miss penalty to 1 cycle if the branch is taken since
B. Specification
the fetch would need to be flushed and the new instruction
Ibtida is a minimal System on a Chip designed completely needs to be fetched from the updated PC value. It also has
with Chisel using the higher programming language features. a Hazard Detection logic unit that prevents structural hazards
It consists of four basic elements that every computer has: 1) from happening i.e if the register being accessed by the current
Compute, 2) Communication, 3) Peripherals, and 4) Storage. instruction is also being written at the same time by another
The instruction interface has a Point-Point interconnect for instruction in the Write Back stage.
fetching instructions and the data interface has a 1xN inter- c) Execute: The execute stage has an Arithmetic Logic
connect that allows the core to either perform loads/stores to Unit (ALU) for computation-related tasks and an ALU Control
the memory or to the GPIO peripheral. Since there is no non- unit indicating the ALU as to which operation needs to be
volatile memory present for code storage, a UART controller performed. It also has a forwarding unit that is used to provide
is designed to accept the program from the host computer the ALU with proper operands if there are any data hazards
and writes it into the ICCM memory every time the board is in the pipeline.
powered on or a new program needs to be uploaded. d) Memory: The memory stage consists of a store/load
unit that performs either stores or loads to the memory or the
GPIO peripheral.
e) Write Back: The write back stage consists of a mux
that selects the data to be written in the register file which can
be either from the ALU output or from the data memory.
2) Communication: The communication mechanism used
between the core, peripherals, and memories is TileLink Un-
cached Lightweight (TL-UL) bus protocol [22]. The miniature
version of TileLink, the TL-UL was used since we did not
require cache coherency and other complex communication.
The fetch stage sends a valid request to the TL-UL Master
which then communicates with the TL-UL Slave that is then
connected with the instruction memory. This forms a Point-
Point interconnection between the core’s fetch and instruction
memory as shown in figure 2. For load/stores during the
memory stage, a 1xN switch is used to connect a single
Fig. 2. Ibtida System on a Chip block diagram TL-UL Master with multiple TL-UL Slaves which are two
in our case. One for the data memory and the other for
The details of each element highlighted in figure 2 above the GPIO peripheral. The 1xN switch automatically decodes
are described below: which slave to route the master’s request to depending upon
1) Compute: It is a 32 bit 5-stage pipelined core compliant the address issued. There is no support for burst accesses.
with the RISC-V base ISA I-type extension and an additional The master can only send one request at a time and wait for
M-type extension that supports multiply/divide instructions the acknowledgment before sending another request. The write
together becoming an RV32IM supported core. It has five back stage consists of a mux that selects the data to be written
pipelined stages: 1) Fetch (F). 2) Decode (D). 3) Execute (E). in the register file which can be either from the ALU output
4) Memory (M). 5) WriteBack (WB). or from the data memory.
a) Fetch: The fetch has a Program Counter (PC) that 3) Peripherals: The SoC contains only one peripheral that
points to the next instruction to be fetched and an interface is the GPIO connected to the bus. The GPIO has 30 I/O pads
to fetch the instructions from the memory. The PC value is going outside to interact with the outside world. Its control and
updated through a multiplexer that selects the next PC value status registers (CSRs) are accessible via TL-UL bus which
which can be a simple PC + 4 through an adder or another can be manipulated by the software program running on the
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, SEPTEMBER 2021 4
Fig. 3. The RISC-V ISA compliant RV32IM 5-Stage fully pipelined datapath designed from scratch with Chisel
core. The other standalone peripheral is the UART that is used The verification of the Mux can be done by creating a
to program the instruction memory. testbench in Chisel as shown in listing 2.
4) Storage: The SoC has a Harvard Architecture consisting
of separate Instruction Closely Coupled Memory (ICCM) package myMux
and Data Closely Coupled Memory (DCCM). Both of the class MuxTester(c: Mux) extends PeekPokeTester(c) {
memories are 1Kbyte in size and are accessible via TL-UL poke(c.io.a, 2) // any random value
bus. poke(c.io.b, 4) // any random value
poke(c.io.sel, 0) // io.a to output
expect(c.io.o, 2) // expecting the output
step(1) // after one clock edge
III. I MPLEMENTATION AND A NALYSIS poke(c.io.a, 2) // any random value
poke(c.io.b, 4) // any random value
A. Verilator Simulation poke(c.io.sel, 1) // io.b to output
expect(c.io.o, 4) // expecting the output
For testing the functionality of the design, Verilator was }
used to simulate the SoC and each of it’s individual compo-
nents. The listing 1 shows how a 2-way mux can be designed Listing 2: Testbench to verify a 2-way Mux in Chisel
in Chisel.
A driver class as shown in listing 3 is used to configure the
package myMux Scala backend to use verilator for testing and an additional
class Mux extends Module {
flag is used to generate the VCD trace for waveform view.
val io = IO(new Bundle { Scala build tool (sbt) is utilized to compile the Scala classes
val a = Input(UInt(8.W)) and execute them as shown in listing 4 which in turn builds
val b = Input(UInt(8.W)) all the verilator files using the testbench and generates a VCD
val sel = Input(UInt(1.W))
val o = Output(UInt(8.W)) trace to view.
}) The generated VCD trace can be viewed on GTKWave. In
io.o := DontCare figure 4 the resulting waveform for the mux is depicted.
switch(io.sel) {
is(1.U) { Similarly, each module within the Ibtida SoC was tested for
io.o := io.b its correct functionality using Chisel-based testbenches and
} Verilator based simulation. In table I, a RISC-V assembly
is(0.U) { program for the sake of testing is shown that is run on the
io.o := io.a SoC, and figure 6 shows how the instructions passes through
} the pipeline with only the important signals extracted for ease.
}
The whole test suite run on the Ibtida SoC is present on Github.
} [23]
Initially, as shown in the figure 6 the UART programmer
Listing 1: Design of a 2-way Multiplexer in Chisel loads the program into the instruction memory and asserts
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, SEPTEMBER 2021 5
package myMux
TABLE IV
uart done high signaling that the memory is loaded. The P OWER CONSUMPTION REPORT OF I BTIDA S O C
fetch then sends a valid request with the PC’s current value
Elements Power in Watts (W) Power in %
and gets the instruction in the next cycle. Until then a NOP
(No operation) instruction is sent to the datapath that does Clocks 0.001 1
nothing in the pipeline. After this, on each clock cycle, a Signals 0.008 7
new instruction is fetched and previous instructions progress Logic 0.006 5
through in the pipeline. Finally, the registers get loaded with BRAM <0.001 1
the values coming from the write back stage. DSP 0.001 1
MMCM 0.106 86
B. FPGA Emulation
I/O <0.001 0
The generated Verilog of Ibtida SoC from Chisel was
mapped on the Arty A7 FPGA board. It runs on 8MHz
frequency with no total negative slack (TNS) and failing Power Type Power in Watts (W) Power in %
endpoints. Table II shows the timing report of the implemented Dynamic 0.123 66
design. Static 0.062 34
TABLE II
T IMING REPORT OF I BTIDA S O C
C. ASIC Implementation
Setup Hold Pulse Width
For the ASIC implementation, the Chisel-generated Verilog
Worst Negative Slack: Worst Hold Slack: Worst Pulse Width
1.826ns 0.028ns Slack: 3.000ns was integrated inside a testing harness and then hardened
through the OpenLANE flow for generating the GDSII layout.
Total Negative Slack: Total Hold Slack: Total Pulse Width
0.000ns 0.000ns Negative Slack: 1) Testing Harness: Caravel [24], is a testing harness that
0.000ns acts as a manager of the Ibtida SoC. It has three parts in it:
1) Management Area 2) User Project Area 3) Storage Area as
The MMCM primitive was used as the clock generator to shown in figure 5.
provide the clock to the design. The ICCM and DCCM memo- a) Management Area: The management area consists
ries were mapped into FPGA Block Rams (BRAMs). The DSP of an SoC built on top of a RISC-V based microprocessor
units inside the board were used for efficient multiplication. PicoRV32 [25]. It has some peripherals including timers, uart,
The resource utilization of the design is given in Table III and gpio. The firmware on the management area can be used
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, SEPTEMBER 2021 6
TABLE V
N ETLIST REPORT OF I BTIDA S O C AFTER SYNTHESIS
init_floorplan
place_io
gen_pdn
tap_decap_or
stage. The placement of cells is accomplished using two Ibtida SoC was generated using the command as shown in
commands; global placement followed by detailed placement listing 9.
as shown in listing 8.
The global placement command, inserts all the standard run_cts
cells into the core area haphazardly. There is no sequence
or order; some standard cells might even overlap each other. Listing 9: Clock Tree Synthesis Script
Ibtida SoC after global placement is shown in figure 15.
e) Routing: Routing is a step followed by CTS. In the
global_placement OpenLANE flow, routing is executed automatically through
detailed_placement
scripts. The task of the router is to precisely define the paths
Listing 8: Placement Script on the layout surface enabling conductors to carry electrical
signals. The conductors are responsible for interconnecting the
pins and the standard cells on the layout and thus forming
a routing grid. Since the routing grid is quite large, routing
is performed using a divide and conquer approach; Global
Routing followed by Detailed Routing as shown in listing 10.
global_routing
detailed_routing
write_powered_verilog
set_netlist $::env(lvs_result_file_tag).powered.v
run_magic
run_magic_drc
Fig. 18. Final Ibtida SoC GDSII layout run_magic_spice_export
run_lvs
run_antenna_check
g) Physical Verification: The physical verification step
also termed as the sign-off step in the OpenLANE flow is Listing 13: Checkers Script
to validate the final layout. Throughout the flow, a series
of reports and logs are generated, which usually involves h) Getting aboard on Caravel: The generated GDS of
checking the generated def file at each stage of physical the Ibtida SoC is then placed on the Caravel harness inside
design for any design rule violations. This is ensured by the allocated user space area and hardened using the command
the EDA tools; fastroute, for identifying antenna violations, as shown in the listing 14, and the final GDSII layout after
and tritonroute, which checks for any routing violations. The Ibtida SoC caravel integration is shown in figure 21.
verification step ascertains that the placer and router have
correctly placed the cells and routed the grid. The design make ship
is checked for any overlapping cells or short circuits and
inspects any Layout vs. Schematic (LVS) error that includes Listing 14: On-boarding Ibtida SoC on Caravel
any unmatched pins or short/open circuits between nets that
should have been connected. Some common Design Rule
Check (DRC) errors corresponding to the wire spacing, width IV. C ONCLUSION
and pitch need to be catered as defined in the PDK technology In this paper we overviewed a Chisel generated SoC taped-
lef (.tlef) file. Some basic errors are shown in figure 19 and out using the completely open-source toolchain and discussed
20. the different chip designing flows involving RTL simulation,
The final DRC and LVS check on the generated Ibtida FPGA emulation, and ASIC implementation. Furthermore, we
SoC layout is ensured using the listing 13. For Ibtida SoC also discussed how the OpenLANE suite allows the automatic
design to be considered DRC and LVS clean, it needs to place and route of a chip without needing a physical design
be validated through Magic where run magic drc command expert.
checks the layout for any design rule check errors and reports Chisel HDL allows software programmers and novice hard-
them if any. Furthermore, a hierarchical SPICE netlist is ware engineers to describe circuits in a higher programming
extracted using the run magic spice export. The extracted language feel as compared to traditional hardware description
netlist is then validated through the open-source tool Netgen languages. The user can abstractly design RTL logic and write
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, SEPTEMBER 2021 11