Floating 2
Floating 2
3-5, 2004
Abstract— We present an adder/substractor and a multiplier II. DESIGN FLOW AND METHODOLOGY
for single precision floating point numbers in IEEE-754 format.
They are fully synthesizable hardware descriptions in VHDL
that are available for general and educational use. Each one In the Figure 1 a design flow overview is presented. The
is presented in a single cycle and pipelined implementation, Design Entry involves the description of the design using
suitable for high speed computing, with performance comparable graphical tools and/or high level languages. The Synthesis
to other available implementations. Precision for non-denormal is the process of converting this description to a device
multiplications is under ulp and for additions in ±1 LSB.
independent RTL netlist. The implementation is the process
of generating a circuit description suitable for device program-
I. INTRODUCTION ming from a RTL netlist. Each one of this steps can and should
be validated for the correctness of the design representation.
Computation with floating point arithmetic is a necessary The design process involves capture of schematics and
task in many applications. Until the end of the 1980’s float- VHDL coding in our case. The design capture methodology
ing point operations were mainly implemented as software involved a top-bottom analysis and general layout followed by
emulation while hardware implementation was an option for a bottom-up implementation of the required components.
mainstream general purpose microprocessors, due to the high The synthesis involves setting constraints for the design
cost of the hardware. At the present, all major microprocessors (timing, location and IO), plus some manipulation of the code
include hardware specific for handling floating point oper- as the synthesizer normally imposes coding styles in the design
ations, but the advancements in reconfigurable logic in the entry stage to produce the desired hardware output.
mid-1990’s, particularly in size and speed, allows for the The design implementation stage is generally called the
implementation of multiple, high performance floating point ”Place and Route”, but involves the translation, mapping, plac-
units in Field Programmable Gate Arrays (FPGA). ing and routing from the device independent RTL description
In the development of a FPGA based hardware coprocessor into a device dependent circuit. This device dependent circuit
for the SPHINX speech recognition system[1][7] that required is converted to a BIT file for device programming.
the use floating point (FP) adders and multipliers, we looked The design verification involves the use of simulations
for free implementations described in a Hardware Description based on netlist equivalents of the outputs of each one of the
Languages (HDL), preferrably VHDL. Even when significant previously described stages to validate the correct behavior of
work has been done, the implementations found were either in the description based on the design specifications.
Verilog [9] or not freely available [2]. Therefore, we decided
to implement our own in VHDL and made it available to the
public for general use and as an educational tool for advanced
computer arithmetics courses, while providing performance
comparable to other implementations. Given the nature of the
tools used (described in another section), the implementation
is described as a combination of schematics and VHDL code,
a very desirable aspect for education.
This document is organized as follows: Section II describes
the design flow and methodology used in the design; Section
III summarizes the important aspects of the IEEE-754 format;
Fig. 1. Design Flow Overview
Section IV and V describes the Multiplier and Adder imple-
mentations and Section VI presents the results.
320
V. FP ADDER The exponent of the inputs are substracted, the bigger one
is used as the tentative result exponent and the difference is
Given two FP numbers n1 and n2 , and assuming that E1 >
used as the amount of shift needed to align the significands.
E2 , the sum of both denoted as n, can be expressed as:
The significand processing can be divided in preparation,
n = n1 + n2 execution and normalization stages. The preparation involves
= ((−1)S1 · p1 · 2E1 ) + ((−1)S2 · p2 · 2E2 ) selecting the significand of the smaller exponent and aligning
it to the bigger exponent. This stage also includes comple-
= ((−1)S1 · p1 · 2E1 ) + ((−1)S2 · (p2 · 2E2 −E1 ) · 2E1 ) menting one of the significands if needed, which is actually
= (−1)S · (p1 + p2 · 2E2 −E1 ) · 2E1 (3) done inverting the selected significand and setting a carry-in
in the adder. The execution stage involves realizing the actual
This means that adding two FP numbers involves aligning addition of the significands. The normalization stage is the
both numbers to the bigger exponent and adding the aligned normalization into the [1, 2) range, the rounding and a final
significands. The sign S is a function of the operation (adding normalization. The first normalization may be a single right
or substracting) and the result of the addition/substraction of shift or a n bit left shift, that depends on the number of leading
the mantissas. Given that the significands are in the range zeros of the tentative result significand. The round and the final
[1, 2) ([0, 2) if we consider denormals), the result is in the normalization are like those in the FP multiplier.
range [0, 4) in either case, excluding sign changes (here the
The proposed implementation has some differences with
conditions are different if denormals are considered or not).
the diagram presented in [3], as the complement operations
Therefore, normalization is required to properly represent the
in the pre- and post-execution stages were moved closer to
result in the range [1, 2) ([0, 2)). This normalization may
the execution stage, to minimize the manipulation of signed
require a single division by 2 or multiple x2 multiplications.
significands in the remaining blocks.
A. Implementation After analysis of the synthesis, the single cycle design was
divided in a 6-stage pipeline to provide performance matching
the multiplier. The division is done naturally in the data flow,
but a little more complicated than in the multiplier as some
values are calculated in a stage and used in another not
immediately after. After analysis of the critical path, (in the
4th stage and involving the leading zero counter unit needed
for the n bit left shift), better performance can be achieved
dividing this stage for a 7-stage pipeline. Once again, this is
suggested as an exercise.
VI. RESULTS
We used Mentor Graphics tools for the full development
cycle, using the FPGA Advantage 6.2 suite[8], that integrates
the HDL Designer 2003.2 for design capture and as a design
management interface; Precision Synthesis 2003b.41 as the
synthesis tool and ModelSim 5.7f SE as the simulation tool,
running in a Windows XP workstation.
Even when the design is target independent, it was validated
in hardware using a Xilinx Virtex-II family X2V3000-4 FPGA
mounted on a RACE-1 PCI coprocessor card[5]. Xilinx ISE
6.1[6] was used for the Translation, Mapping and Place and
Route processes.
A. Limitations
Fig. 3. FP Adder Block Diagram All operations of the FP multiplier that do not involve de-
normals are performed correctly with under least (significant)
In Figure 3 we present a general FP adder block diagram position (ulp) precision. When denormals are involved, the
similar to the one presented in [3]. The FP numbers are result may be very small (as there is no hidden 1 in one or
separated in sign, exponent and significand components with both of the operands) and multibit left shifting using the lower
hidden 1 or 0 restore, and processed in a flow fashion. bits of the multiplication result may be needed to accurately
The sign of the inputs and the operation are used to define represent the result. This is not implemented in the current unit
the sign of the operation and the need for complementing one but can be done as an exercise in a computational arithmetic
of the input significands, and which one. course.
321
All operations of the FP adder are performed correctly, but not considered an error. The non-correct results were tracked
precision is affected in certain situations. Precision for the FP to the limitations previously described in this section, and
adder can be bounded to ±1 LSB as some rounding is affected this numbers are provided as a sample of how often they are
by the lack of extended representation in the internal data path. obtained in a random data set.
B. Device usage and Performance TABLE IV
STRESS TEST RESULTS
In Table I are listed the CLB usage, hardware multiplier
and theoretical maximum frecuency of operation for both the
single cycle and pipelined versions of the FP multiplier as Multiplier Adder
synthesized for the Virtex-II architecture, while in Table II are Incorrect Results 65684 (0.656%) 12539 (0.125%)
listed the CLB usage and the theoretical maximum frecuency Total operations 10 million 10 million
of operation for the FP adder.
TABLE I
VII. CONCLUSION
FP MULTIPLIER SYNTHESIS RESULTS
A FP adder and a FP multiplier are presented. Both are avail-
able in single cycle and pipeline architectures and they are im-
Design CLBs HW mult. Max. Frec. (MHz)
plemented in VHDL, are fully synthesizable with performance
Single Cycle 101 4 8
comparable to other available high speed implementations. The
Pipelined 119 4 90.5
design is described as graphical schematics and VHDL code
and both are freely available for general and educational use.
This dual representation is very valuable as allows for easy
TABLE II navigation over all the components of the units, that allows
FP ADDER SYNTHESIS RESULTS for an faster understanding of their interrelationships and the
different aspects of a FP operation. Various opportunities for
Design CLBs Max. Frec. (MHz) extension and modifications are also presented. The limitations
Single Cycle 373 6 in precision described for these units are similar to those of
Pipelined 385 87.9 other implementations.
VIII. AVAILABILITY
The design can be compared directly to the implementations The design files, documentation and source code are avail-
presented in [2], as both were implemented and tested over the able under the GNU General Public License and the GNU Free
same hardware. The Area usage and theoriretical maximum Documentation License in the Internet at the ITESM Speech
operating frecuency are presented in Table III for pipelined group website (http://speech.mty.itesm.mx/∼gmarcus/FPU), at
implementations. It can be observed that our FP multiplier is the ITESM FPGA group website (http://fpga.mty.itesm.mx)
31% faster than their implementation at the cost of a 23% area and in the OpenCores web site (http://www.opencores.org).
increase, while the FP adder is 3% faster but consuming 32%
ACKNOWLEDGMENT
more area.
The authors would like to thank Prof. Dr. Reinhard Männer
TABLE III
and the Institute for Computer Science V of Universität
AREA AND FRECUENCY COMPARISON
Mannheim in Mannheim, Germany for providing the devel-
opment hardware used in this project.
Lienhart Marcus
FP Multiplier
REFERENCES
Area (CLBs) 78 101 [1] Guillermo Marcus, Diseño e Implementación de un Coprocesador Basado
en FPGA para el Reconocedor de Voz SPHINX, MSc. Thesis. ITESM,
Max. Frecuency (MHz) 63 90.5
2003.
FP Signed Adder [2] Gerhard Lienhart, Andreas Kugel and Reinhard Männer, Using Floating-
Area (CLBs) 290 385 Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations,
Field-Programmable Custom Computing Machines, 2002. Proceedings.
Max. Frecuency (MHz) 85 87.9
10th Annual IEEE Symposium on , 22-24 April 2002.
[3] Behrooz Parhami, Computer Arithmetic: Algoritms and Hardware De-
signs, 1st ed. Oxford: Oxford University Press, 2000.
The implemented designs were stress tested with random [4] IEEE Standards Board, IEEE-754, IEEE Standard for Binary Floating-
Point Arithmetic, New York: IEEE, 1985.
operations, and the results are presented in Table IV. When [5] FPGA Processors Group, mpRACE Coprocessor, at
the result of an operation produces a NaN the sign is inverted http://www-li5.ti.uni-mannheim.de/fpga/?race/. Universit¨’at Mannheim
compared to the sign produced by the reference FP unit (the Mannheim, Germany.
[6] Xilinx Inc, ISE, at http://www.xilinx.com
workstation processor’s FPU), but since the IEEE-754 specifies [7] Carnegie Mellon University, SPHINX and SPHINX Train, at
that the sign is not important in the NaN representation, it is http://www.speech.cs.cmu.edu/
322
[8] Mentor Graphics Inc, FPGA Advantage, at http://www.mentor.com/fpga-
advantage/
[9] Rudolf Usselmann, Floating Point Unit, at
http://www.opencores.org/projects.cgi/web/fpu/overview or
http://www.ASICS.ws
323