0% found this document useful (0 votes)

46 views10 pages

Group 2 Report

The document describes a project to design a RISC-V processor that supports floating point instructions. It will extend an existing RISC-V core with a floating point unit and support out-of-order execution. The design includes modifications to the decoder, register files, execution pipeline, and a testing framework.

Uploaded by

Rio Carthiis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views10 pages

Group 2 Report

Uploaded by

Rio Carthiis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

6.

375 Final Project Report: RISC-V processor with Floating-Point instructions

Kathy Camenzind and Miguel Gomez
Mentor: Andy Wright

1. Objective

The current RISC-V processor that we’ve worked on in 6.375 uses the base integer instruction
set. However, for many applications it is useful to perform floating-point operations, which is
supported by various extensions to the RISC-V instruction set. In this project, we’d like to
explore extending the current RISC-V processor to support the Single-Precision Floating-Point
“F” standard extension. While it is possible to handle floating-point operations through software
emulation, the goal of this project is to actually add an FPU to the processor that can handle the
instructions as part of the processor pipeline.

There are several considerations when adding an FPU to the processor that will add considerable
complexity. First, we will need to implement changes to the Decoder to correctly interpret this
extended set of instructions and store information about whether operations are integer or
floating-point. The development of a new register file will also be necessary, as it requires a
separate set of floating point registers, as well as a third read port.

The primary changes, however, will be made by adding the FPU to the pipeline. This FPU will
provide various functional units that perform math, comparisons, and other operations on
floating-point values. We must determine the path of execution of the new floating-point
instructions as well as the old integer instructions in the pipeline using the FPU.

Since floating-point operations are more computationally intensive than integer operations, the
floating point computations that will be done in the FPU should be pipelined in order to reduce
the combinational delay of the processor. This introduces the requirement that our processor
pipeline now has to handle potentially multi-cycle execution of instructions. If the processor is
scalar and in-order, then this will involve a relatively simple design that includes incorporating
stall signals based on the type of the executing instruction. However, to increase our processor
efficiency, we also plan to look into out of order implementations.

A basic out of order implementation scheme, which we will describe in further detail later,
would be to allow for the dispatching of an integer instruction while the FPU is busy, and
allowing for our FPU to handle calculating multiple floating-point operations simultaneously,
given that the functional unit required is available. This will require scoreboard checks to ensure
that there are no data conflicts between executing elements, and handling older instructions
finishing before newer ones if they are lower-latency operations. Reordering of these instructions
that complete out of order will be explored.

Through this project, we hope to further explore more complex RISC-V architectures and gain
an understanding of more sophisticated pipelining techniques, while keeping under consideration
data and control hazards that might arise from out-of-order implementations of the processor.

2. High-Level Design

The high-level design of the processor will work off of a 4-stage RISC-V pipeline, similar to the
3-stage pipeline that we developed in Lab 5 with a bypassing register file. It will have stages for
instruction fetch, decode, execute, and writeback, as well as a scoreboard to detect data hazards.
For simplicity, and since they are not the focus of the project, we will not have data bypassing
(other than the bypassing register file) or branch prediction. The increased number of stages as
opposed to Lab 5 cuts down on the latency of each stage, which should allow us to run the
processor at a higher clock speed. Most stages, other than instruction fetch, will have to be
modified in order to incorporate floating point instructions, as described below and shown in
Figure 1.

● Decode: The decoder reads an instruction and converts it into information in a more
usable format for the rest of the pipeline to handle. With the addition of floating point
instructions, we will have to implement additional paths in the Decoder for these new
instruction types, to teach it to handle their encodings. The data that is passed onto the
next stage will need to be altered to include information about whether the operation uses
floating-point registers, what floating-point operation it uses, etc.

Decode will also read values from the register file. The RISC-V floating-point extension
uses 32 additional 32-bit registers for floating-point operations, so we’ll need to include a
second register file for these floating-point registers. The register read stage will then
have to distinguish between which of the two to read from by using a single additional
bit, and insert this augmented register information into the scoreboard. Based on the data
hazard present in the scoreboard and whether the corresponding execute pipeline is busy
(described below), we will potentially stall the pipeline at this stage.

● Execute: The execute stage will be broken into two parts: a single-cycle ALU for all
operations that write back to the integer register file, and a multi-cycle FPU (with
multiple multi-cycle units for different FPU operation types) that can complete
floating-point operations out of order, and reorders operations before writing back to the
floating-point register file. Each individual stage will be pipelined and operate in parallel.
They will be implemented with a get/put interface, so that traffic further down the
pipeline will simply stall earlier in the pipeline, and data will not be lost.

The reordering of completed floating-point instructions will be done using a completion

buffer, that is pushed to in order that instructions are dispatched, and are popped in the
same order. When instructions complete out of order, the completion buffer will store the
result until all earlier instructions complete and have been popped in order.

● Writeback: Writeback is largely the same as before, with the exception of having to
decide which register file to write back to, either integer or floating point, depending on
the instruction type.

Figure 1: Instruction Pipeline for the processor with a multicycle FPU for floating-point
operations.

There are several intermediates steps that we have written on the way to our eventual pipelined,
out of order floating point processor. First, we ensure that our floating point library is functional
by implementing the FPU as a single-cycle operation, and integrated it directly into the 3-stage
processor from Lab 5 alongside the ALU. We additionally develop a

3. Testing Framework

The testing framework that we will be using is mostly the same as the one that was provided for
the use of lab 5, with some key differences that will allow us to test it with the floating-point
ISA. The basic structure of the test bench will remain the same as that of the lab; a Connectal
wrapper around the processor in order to send and receive data using the Connectal main.cpp
program. The instructions and data will be compiled to the riscv binary format using the
Makefile, which will then be loaded onto the memory by main.cpp using the memInit method.
The processor will be started by using the hostToCPU method, after which the program will run
until completion, and the final state will be returned with cpuToHost.

Figure 2: General Structure of Testing Framework with FPGA

To comprehensively test the processor, we will be using a combination of existing tests (which
include most of the base RISC-V integer instructions) and new tests that we create, which will
include all of the new floating-point instructions. To modify the instructions to function properly
with the new floating-point architecture, all that is needed is to change the test configuration
from RVTEST_RV32U to RVTEST_32UF, which will allow the use of all the new
floating-point instructions in our test program. To aid us in debugging our design, apart from
using the new floating-point instructions we will also use the test macros provided in
“test_macros.h” to verify correct functionality. Table 1, listed below, includes a comprehensive
list of all instructions that must be tested to ensure the correct implementation of the RISC-V
Floating-Point processor.

FLW, FSW Loads/Stores Floating-point data to/from rd

FMADD.S, FMSUB.S Multiplies rs1, rs2, adds rs3, stores in rd

FNMADD.S, FNMSUB.S Multiplies rs1, rs2, negates, adds rs3, stores in rd

FADD.S, FSUB.S, FMUL.S, FDIV.S Adds/Subtracts/Multiplies/Divides rs1, rs2, stores in rd

FSQRT.S Computes square root of rs1, stores in rd

FSGNJ.S, FSGNJN.S, FSGNJX.S Takes all bits from rs1 except sign bit, which is determined by the sign of rs2, the
opposite sign of rs2, or XOR of signs of rs1 and rs2, stores in rd

FMIN.S, FMAX.S Takes min/max of rs1 and rs2, stores in rd

FCVT.W.S, FCVT.WU.S Converts floating-point rs1 value to signed/unsigned integer value, stores in rd
FMV.X.W, FMV.W.X Moves floating point value from rs1 to lower 32 bits of integer register rd, or vice
versa

FEQ.S, FLT.S, FLE.S Equality/Less than/Less than or equal to of rs1, rs2, stores in rd

FCLASS.S Examines value in rs1, stores 10-bit mask in rd that indicates class of floating-point
number

FCVT.S.W, FCVT.S.WU Converts signed/unsigned rs1 value to floating-point value, stores in rd

Table 1: Descriptions of Floating-Point Instructions

4. Microarchitectural Description

In our pipeline, we will introduce several new modules. The first class of modules is adding the
Floating Point execute units, that perform the new floating point operations. These are already
implemented as multi-cycle units in built-in Bluespec library, in FloatingPoint.bsv. They have
Server (request/response) interfaces, have inputs of 1-3 floating point operands and the rounding
mode, and have outputs of the floating point result and any exceptions. We plan to use the
following modules:

● mkFloatingPointAdder: Adds two floating point numbers. Takes 5 cycles.

● mkFloatingPointMultiplier: Multiplies two floating point numbers. Takes 5 cycles.

● mkFloatingPointDivider: Divides two floating point numbers in 5 cycles.

● mkFloatingPointSquareRooter: Takes the square root of a floating point number in 5

cycles. Only takes one floating point operand.

● mkFloatingPointFusedMultiplyAccumulate: Multiplies two operands and adds a third.

Takes 9 cycles to complete, and uses 3 floating point operands.

The second class of modules that we plan to use are the already-existing modules from the
processor implemented in class, that we will either instantiate or modify slightly. Any changes
made are to support floating point registers and allow for out of order execution. The modules
we will use and/or modify are as follows:

● mkBypassingRFile: The register file, which previously has two read ports and one write
port, will now be used twice for the floating-point register file. To support the
multiply-accumulate function, we must add a 3rd read port. Since we currently plan on
only implementing single-precision, there are no other changes to the register file,
although if we were to extend to double precision, we could parameterize the data size.

● mkCsrFile: Modified to include the addresses to read FCSR and its fields. Since FCSR
has to be read within the Decode stage to determine the rounding mode, we need to be
able to handle 2 CSR reads in a cycle. Instead of implementing a second read port, we
observed that one of the reads will always be of FRM, and that a more efficient
implementation was to just add a method, getFRM, that returns the rounding mode
directly.

● mkBypassingScoreboard: Our scoreboard will be augmented with a bit that

distinguishes between floating point and integer registers. We also extend the scoreboard
to a larger size to accommodate for the maximum number of instructions possible in
Execute. To always be functionally correct, it has to be the size of the maximum number
of instructions possible in the Execute stage, which includes up to 5 floating-point
operations (our CompletionBuffer size, described below) and one single-cycle integer
register-file operation.

Lastly, we will need to add a few modules to complete the out-of-order functionality. When
dispatching instructions from Decode, we need to decide which functional unit to pass the
instruction to. Similarly, when an instruction finishes executing, we will need a completion
buffer to hold the result until it is ready to commit in-order (i.e. all previous instructions have
committed).
Figure 3: Microarchitecture of the modified pipelined execute stage for implementing
floating-point instructions. Includes pipelined functional units for floating point operations, and a
completion buffer to hold results until they’re ready to commit.

● mkCompletionBuffer: Holds results from all completed functional units when they
complete. Has a queue of the order than functional units were called, that acts as a FIFO,
and only can pop a result when it is the first thing in the queue and has been completed.
The interface of this module additionally includes a complete of a completed instruction
with data, and a push method called by the reservation station to enqueue a functional
unit to the ordered list of instructions to complete.

Internally, this completion buffer can be any size, but making it larger is beneficial, since
it can then hold many completed instructions that can pass an earlier instruction that uses
a long-latency functional unit (likely multiply-accumulate, which takes 9 cycles). This
avoids having the CompletionBuffer fills up, which would stall the rest of the pipeline.

5. Implementation

The development of the processor was accomplished through several design stages of
implementation, described below.
● Combinational: Uses the combinational versions of the floating point add, multiply,
divide, and square root operations. Also, only one instruction is allowed to pass through
the processor at a time, similar to the Multicycle implementation.

● Multicycle: Uses sequential versions of the floating point add, multiply, divide, and
square root operations, decreasing the critical path delay of the design. One instruction
can flow through the pipeline at a time.

● Four Stage w/ Bypass: Uses sequential versions of floating point add, multiply, divide,
and square root operations. Multiple instructions can flow through the pipeline, and
bypassing is achieved with EHRs in the register files and scoreboard.

● Four Stage Superscalar w/ Bypass: Final version of the floating point processor. Uses
sequential versions of floating point add, multiply, divide, and square root operations.
Multiple instructions can flow through the pipeline, as well as within the Execute stage
using a completion buffer.

All of the “F” extension instruction pass compliance tests, and our processor still passes
all of the original microtests for integer instructions, including both small and large
benchmarks from Lab 5.

The Execute stage now has an implementation that includes out-of-order execution
between floating-point and non-floating-point instructions. All instructions that write
back to the integer register file still execute in one cycle. Floating point register
instructions, on the other hand, are entered into the pipelined functional unit for the
relevant floating point operation, which can take between 1 and 9 cycles to complete the
operation. When a floating point instruction enters a functional unit, it is pushed into a
completion buffer, and when a floating point instruction leaves a functional unit, it is
marked as “complete” in the completion buffer. The writeback stage then pops
instructions from the completion buffer in order, which reorders the writeback of floating
point instructions.

6. Performance

The table below summarizes the synthesis results of each of these designs:

Combinational Multicycle FourStageBypass Out of Order

Area (μm2) ~190,000 ~187,000 ~368,000 ~397,000

Critical Path (ps) ~10,200 1,147 2,018 1,656

By far, the greatest improvement in clock speed was between the first and second versions of the
processor, which was due to the change from the combinational to the sequential versions of the
FPU. The four-stage processor with bypassing sees an increase in latency due to the long-latency
bypass register file, and the out of order processor is the largest processor, as it has not only the
fully pipelined processor, but also the extra CompletionBuffer for reordering out-of-order
instructions.

However, it was more challenging to analyze the actual performance of the different processor
versions. We initially thought that the RISC-V compliance tests was an option; however, it
doesn’t properly illustrate the improvement in the final version, since due to the structure of the
tests only one instruction would be present in the FPU at any given time. This is because all
floating point math instructions were followed by an fmv instruction that directly depends on the
result of the math operation, so the second floating point instruction that would in theory be able
to pass the longer-latency math instruction, instead stalls due to the data dependency between the
two instructions.

Writing C code compiled to F-extension RISC-V instructions was also unsuccessful, due to the
nature of the compiler, and there are no available existing floating-point benchmark programs for
RVI32UF (as double-precision is more common). We settled on using the small benchmark code
for Lab 5; however, we are aware that it doesn't demonstrate the full capabilities of the processor
since it only uses integer instructions. The benchmarks are shown below:

Multicycle Four Stage Four Stage

(Floating Point) Bypass (Floating Out-of-Order Bypass
Point) (Floating Point)

towers .1033 .1069 0.1069

median .2768 .2872 0.2873

Multiply .4763 .3838 0.3838

Qsort .3817 n/a 0.3090

Vvadd .2229 .2241 0.2242

7. Conclusions
Our greatest challenge by far was, surprisingly, debugging the functional correctness of our basic
floating point operations. We found bugs in the FloatingPoint.bsv library, as well as in the
compliance tests that we were using to check the correctness of our processor, which was
difficult to debug since we had to manually check our design, the tests, and the floating point
library all for correctness.

We were able to, in the end, design and build the processor that we set out to complete, that can
run and pass all of the single-precision floating point RISC-V compliance tests, as well as still
pass and run the benchmarks that use integer instructions. This design also successfully compiled
and ran on the FPGA. However, we were unable to do significant performance testing, due to the
lack of tests that we were able to run that used a significant number of floating point operations
in a realistic manner.

We learned a lot from this project, particularly about putting in the time to develop a rigorous
and useful debug framework, to save time later when debugging the processor. We also spent a
lot of time fixing edge cases, and in the future would’ve liked to spend more time on the
high-level design of the processor and performance testing.

RVCoreP An Optimized RISC-V Soft Processor of Five-Stage
No ratings yet
RVCoreP An Optimized RISC-V Soft Processor of Five-Stage
10 pages
Out of Order Floating Point Coprocessor For RISC V ISA
No ratings yet
Out of Order Floating Point Coprocessor For RISC V ISA
7 pages
My Thesis
No ratings yet
My Thesis
59 pages
FPGA Implementation of Educational RISC - V Processor Suitable For Embedded Applications
No ratings yet
FPGA Implementation of Educational RISC - V Processor Suitable For Embedded Applications
5 pages
Floating Point Unit Implementation and Verification For Machine Learning and AI Applications
No ratings yet
Floating Point Unit Implementation and Verification For Machine Learning and AI Applications
116 pages
Rvcorep: An Optimized Risc-V Soft Processor of Five-Stage Pipelining
No ratings yet
Rvcorep: An Optimized Risc-V Soft Processor of Five-Stage Pipelining
9 pages
Endgamechanger
No ratings yet
Endgamechanger
22 pages
Fast Implementation of CV Algorithms: Using Floating Point Hardware For Numeric Intensive Algorithms
No ratings yet
Fast Implementation of CV Algorithms: Using Floating Point Hardware For Numeric Intensive Algorithms
21 pages
Fulltext
No ratings yet
Fulltext
145 pages
Manuscript
No ratings yet
Manuscript
12 pages
Asic Project 1.0
No ratings yet
Asic Project 1.0
14 pages
Five-Stage Pipelined 32-Bit RISC-V Base Integer Instruction Set Architecture Soft Microprocessor Core in VHDL
No ratings yet
Five-Stage Pipelined 32-Bit RISC-V Base Integer Instruction Set Architecture Soft Microprocessor Core in VHDL
6 pages
ASIC Project ckpt1
No ratings yet
ASIC Project ckpt1
12 pages
8bit Risc Processor
No ratings yet
8bit Risc Processor
7 pages
16-Bit Risc Cpu
No ratings yet
16-Bit Risc Cpu
5 pages
ASIC Project ckpt3
No ratings yet
ASIC Project ckpt3
16 pages
ASIC Project ckpt2
No ratings yet
ASIC Project ckpt2
14 pages
Project Phase1
No ratings yet
Project Phase1
2 pages
Design A 5-Stage Pipeline RISC-V CPU and Optimise
100% (1)
Design A 5-Stage Pipeline RISC-V CPU and Optimise
8 pages
Lab 7
No ratings yet
Lab 7
9 pages
KV Dsflab Mini Projects 2020 PDF
No ratings yet
KV Dsflab Mini Projects 2020 PDF
5 pages
Design and Implementation of A 32-Bit ISA RISC-V
No ratings yet
Design and Implementation of A 32-Bit ISA RISC-V
5 pages
Reduced Instruction Set Computer (Risc) 32bit Processor On Field Programmable Gate Arrays (Fpgas) Implementation
No ratings yet
Reduced Instruction Set Computer (Risc) 32bit Processor On Field Programmable Gate Arrays (Fpgas) Implementation
5 pages
ASIC Project ckpt4
No ratings yet
ASIC Project ckpt4
22 pages
A CMOS Floating Point Unit
No ratings yet
A CMOS Floating Point Unit
13 pages
Cep (2019ee616)
No ratings yet
Cep (2019ee616)
24 pages
Single Precision Floating Point Unit
No ratings yet
Single Precision Floating Point Unit
45 pages
Design and Implementation of Single Precision Floating-Point Arithmetic Logic Unit For RISC Processor On FPGA
No ratings yet
Design and Implementation of Single Precision Floating-Point Arithmetic Logic Unit For RISC Processor On FPGA
5 pages
Design and Implementation of Single PrecisionFloatingPointArithmeticLogicUnitforRISCProcessoronFPGA
No ratings yet
Design and Implementation of Single PrecisionFloatingPointArithmeticLogicUnitforRISCProcessoronFPGA
6 pages
Risc V1
No ratings yet
Risc V1
34 pages
Design and Implementation of Single PrecisionFloatingPointArithmeticLogicUnitforRISCProcessoronFPGA
No ratings yet
Design and Implementation of Single PrecisionFloatingPointArithmeticLogicUnitforRISCProcessoronFPGA
6 pages
881 Asm
No ratings yet
881 Asm
23 pages
Design of RISC-Based Processor On FPGA: CENG450: C S A - P
No ratings yet
Design of RISC-Based Processor On FPGA: CENG450: C S A - P
31 pages
Department of Electronics and Communication Engineering Saintgits College of Engineering
No ratings yet
Department of Electronics and Communication Engineering Saintgits College of Engineering
41 pages
High Speed Data Acquisition System Using Fpslic
No ratings yet
High Speed Data Acquisition System Using Fpslic
4 pages
ECE-6913 - RISC-V Project - A1
No ratings yet
ECE-6913 - RISC-V Project - A1
4 pages
Design and Implementation of Floating Point ALU With Parity Generator Using Verilog HDL
No ratings yet
Design and Implementation of Floating Point ALU With Parity Generator Using Verilog HDL
6 pages
Research and Design of RISC-V Four-Stage Out-of-Order Execution Processor
No ratings yet
Research and Design of RISC-V Four-Stage Out-of-Order Execution Processor
3 pages
Design of A 32-Bit Dual Pipeline Superscalar RISC-V Processor On FPGA
No ratings yet
Design of A 32-Bit Dual Pipeline Superscalar RISC-V Processor On FPGA
4 pages
Fpga Based 32 Bit Risc Processor Design
No ratings yet
Fpga Based 32 Bit Risc Processor Design
18 pages
Lecture6 RISC V Assembly III
No ratings yet
Lecture6 RISC V Assembly III
47 pages
Design of FPGA Based 32-Bit Floating Point Arithmetic Unit and Verification of Its VHDL Code Using MATLAB
No ratings yet
Design of FPGA Based 32-Bit Floating Point Arithmetic Unit and Verification of Its VHDL Code Using MATLAB
14 pages
University of Engineering and Technology: Submitted By: Group#1
No ratings yet
University of Engineering and Technology: Submitted By: Group#1
10 pages
Design of 32 Bit (MIPS) RISC PROCESSOR Using FPGA: R M Kubde D B Bhoyar R S Khedikar
No ratings yet
Design of 32 Bit (MIPS) RISC PROCESSOR Using FPGA: R M Kubde D B Bhoyar R S Khedikar
4 pages
Paper 10235
No ratings yet
Paper 10235
5 pages
Efficient Implementation of Pipelined Double Precision Floating Point Unit On FPGA
No ratings yet
Efficient Implementation of Pipelined Double Precision Floating Point Unit On FPGA
6 pages
FPGA Implementation of Convolutional Eural Etwork For Real-Time Handwriting Recognition
No ratings yet
FPGA Implementation of Convolutional Eural Etwork For Real-Time Handwriting Recognition
27 pages
CS61C 2022fa L07-Intro-RISC-V
No ratings yet
CS61C 2022fa L07-Intro-RISC-V
39 pages
1822 B.E Ece Batchno 102
No ratings yet
1822 B.E Ece Batchno 102
86 pages
Design and Implementation of Single Precision Pipelined Floating Point Co-Processor
No ratings yet
Design and Implementation of Single Precision Pipelined Floating Point Co-Processor
4 pages
Architecture Project F24
No ratings yet
Architecture Project F24
5 pages
Milestone 3
No ratings yet
Milestone 3
7 pages
PFG 21 23
No ratings yet
PFG 21 23
35 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
Milestone 2
No ratings yet
Milestone 2
14 pages
Day 2 3 Dec 8
No ratings yet
Day 2 3 Dec 8
2 pages
Arxiv 2404.14135
No ratings yet
Arxiv 2404.14135
49 pages
Day 0 1 Dec 7
No ratings yet
Day 0 1 Dec 7
1 page
Challenges of Artificial Intelligence in Design Education
No ratings yet
Challenges of Artificial Intelligence in Design Education
4 pages
v2 FirstPrinciple
No ratings yet
v2 FirstPrinciple
4 pages
Group 3 Report
No ratings yet
Group 3 Report
14 pages
Group 7 Report
No ratings yet
Group 7 Report
10 pages
QXQ - YLC-Week 8 Summary of Key Concepts
No ratings yet
QXQ - YLC-Week 8 Summary of Key Concepts
2 pages
QXQ - YLC-Week 7 Summary of Key Concepts
No ratings yet
QXQ - YLC-Week 7 Summary of Key Concepts
3 pages
Loops and Conditionals Cheat Sheet
No ratings yet
Loops and Conditionals Cheat Sheet
2 pages

Group 2 Report

Uploaded by

Group 2 Report

Uploaded by

6.

375 Final Project Report: RISC-V processor with Floating-Point instructions

The reordering of completed floating-point instructions will be done using a completion

Figure 2: General Structure of Testing Framework with FPGA

FLW, FSW Loads/Stores Floating-point data to/from rd

FMADD.S, FMSUB.S Multiplies rs1, rs2, adds rs3, stores in rd

FNMADD.S, FNMSUB.S Multiplies rs1, rs2, negates, adds rs3, stores in rd

FADD.S, FSUB.S, FMUL.S, FDIV.S Adds/Subtracts/Multiplies/Divides rs1, rs2, stores in rd

FSQRT.S Computes square root of rs1, stores in rd

FMIN.S, FMAX.S Takes min/max of rs1 and rs2, stores in rd

FCVT.S.W, FCVT.S.WU Converts signed/unsigned rs1 value to floating-point value, stores in rd

Table 1: Descriptions of Floating-Point Instructions

● mkFloatingPointAdder​: Adds two floating point numbers. Takes 5 cycles.

● mkFloatingPointMultiplier​: Multiplies two floating point numbers. Takes 5 cycles.

● mkFloatingPointDivider​: Divides two floating point numbers in 5 cycles.

● mkFloatingPointSquareRooter​: Takes the square root of a floating point number in 5

● mkFloatingPointFusedMultiplyAccumulate​: Multiplies two operands and adds a third.

● mkBypassingScoreboard​: Our scoreboard will be augmented with a bit that

Combinational Multicycle FourStageBypass Out of Order

Area (μm2) ~190,000 ~187,000 ~368,000 ~397,000

Multicycle Four Stage Four Stage

towers .1033 .1069 0.1069

median .2768 .2872 0.2873

Multiply .4763 .3838 0.3838

Qsort .3817 n/a 0.3090

Vvadd .2229 .2241 0.2242

You might also like

● mkFloatingPointAdder: Adds two floating point numbers. Takes 5 cycles.

● mkFloatingPointMultiplier: Multiplies two floating point numbers. Takes 5 cycles.

● mkFloatingPointDivider: Divides two floating point numbers in 5 cycles.

● mkFloatingPointSquareRooter: Takes the square root of a floating point number in 5

● mkFloatingPointFusedMultiplyAccumulate: Multiplies two operands and adds a third.

● mkBypassingScoreboard: Our scoreboard will be augmented with a bit that