A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
Abstract—Floating point operations are widely used in the an increased chip-area exposure of 53.54% [9]. Floating point
fields of communication algorithm, digital signal processing, divider is also a very common arithmetic unit, but it's usually
artificial intelligence and so on. However, the low computation very slow. If the reciprocal operation and multiplier are
speed and excessive resource consumption have become key combined to implement the divider, the performance can be
limitations on system performance and hardware overhead. optimized by designing them separately. Work [10] uses the
Thus, the area efficiency of floating point arithmetic units is
important to accelerate computation and reduce resources. This
least multiplier to fit the reciprocal quadratic equation, which
paper presents high performance and area efficient floating consumes at least two single precision multipliers and two
2021 28th IEEE International Conference on Electronics, Circuits, and Systems (ICECS) | 978-1-7281-8281-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICECS53924.2021.9665459
point arithmetic units, including adder, multiplier and single precision floating point adders. Work [11] adopts
reciprocal operator. The proposed floating point arithmetic Newton iteration method and has a long waiting period, which
units are evaluated based on a typical scenario of 4×4 matrix does not accord with the high throughput characteristic that
inversion in communication. Experimental results show that our GPU arithmetic unit outputs one result in one cycle after the
designs achieved improvements on both performance and pipeline is filled [12].
resource overhead. Compared with Xilinx Vivado IP, our Therefore, to realize high-speed and high-performance
designs save 20%-45% resources and only consumes 1/4 applications, a single precision floating point number adder,
computing latency. Compared to DesignWare IP, our designs reciprocal operation and a mixed precision floating point
need only 1/4 computing latency, as well as improving area number multiplier are proposed.
efficiency by 3.65 times.
Keywords—floating point arithmetic, mixed precision, Taylor II. ALGORITHM ANALYSIS
series, pipeline, matrix inversion
A. Floating Point Representation
I. INTRODUCTION TABLE I shows IEEE-754 form for floating point number.
The single precision point number consists of 32 bits and the
In computing system, floating point number is the most double precision point number consists of 64 bit [13]. And the
widely used form to represent real number. Floating point single precision floating point number is symbolized in (1). S
numbers use exponents to make the position of the decimal is the sign bit of A, 0 for positive and 1 for negative. E is the
point float up and down as needed to express a wider range of exponent bit of A, ranges from -127 to +128. M is the mantissa
real numbers flexibly. Compared with fixed point numbers, bit of A. Single precision floating point number includes 1 bit
floating point numbers can represent a wider range and have for sign, 8 bits for exponent and 23 bits for mantissa. Mantissa
higher accuracy, so they are widely used in the fields of is hoarded with one hidden bit as 1.M. The exponent requires
communication algorithm, digital signal processing, artificial a bias to represent negative number, which in single precision
intelligence and so on. However, the slow operation and high is 127. The representation of double precision floating point
resource consumption speed has become its limitation Taking number is similar, except the bit width of exponent and
the communication algorithm as an example, the channel mantissa. IEEE-754 also specifies some special formats, such
estimation [1] and MMSE algorithm [2] all need to use the as the representation of infinity and denormalization number,
operation of matrix, which inevitably requires floating point which will be ignored in this paper. That’s to say, all the 32-
adders, multipliers and dividers. Therefore, A high area bit number will be a real number according to (1). And number
efficiency floating point arithmetic unit is important to 0 is represented by 32’h00000000 or 32’h80000000.
improve the algorithm speed and reduce resources. In all kinds TABLE I IEEE-754 FORM
of matrix operations, addition, multiplication and division are
most used. Corresponding to floating point numbers, adders, Sign Exponent Mantissa
multipliers and divider are generally required. Single Precision 1-bit[31] 8-bit [30:23] 23-bit [22:0]
Floating point adder is one of the most frequently used
floating point operation. An analysis of real time applications Double Precision 1-bit[63] 11-bit [62:52] 52-bit [51:0]
indicate that signal processing algorithms require, on an
average 40% multiplication and 60% addition operations [3]. 𝐴 = (−1) 𝑆 ∗ 2𝐸−127 ∗ (1. 𝑀)
Leading one predictor and “far and close path” algorithms B. Floating Point Adder
achieve a good overall latency, but the resources used are 88% Fig. 1 show the standard floating point adder process.
higher than the standard adder [4]. A tremendous resource and Floating point number adder requires a series of processes,
latency overhead reduce the efficiency of floating point such as sign judgment, exponent subtractor, exponent
multipliers. A significant portion of the latency is attributed alignment, mantissa add/sub, mantissa normalization and so
mainly to the multiplication of the mantissa parts of the two on. And subtraction only needs to reverse the sign bit of the
multiplicands. In order to improve the efficiency of floating second operand. The function of sign judgment is judging
point multipliers, numerous fast-many algorithms have been whether the mantissa should be added or subtracted. Floating
put forth and modeled, such as array multiplier [5], Vedic point number addition needs to ensure the same exponent. The
multiplier [6] and Wallace tree multiplier [7], [8]. Among smaller number’s mantissa will be shifted by 𝐸 bits, which
them, Array multiplier consumes the least number of LUTs is the difference between the two operands. This requires steps
but suffers from increased latency, while the Wallace tree exponent subtractor and alignment. After add/sub the mantissa
multiplier provides the lowest input-output latency but suffers of the larger number and the mantissa of the smaller number
978-1-7281-8281-0/21/$31.00
Authorized ©2021
licensed use limited to: VNR IEEE
Vignana Jyothi Inst of Eng & Tech. Downloaded on February 04,2025 at 08:40:18 UTC from IEEE Xplore. Restrictions apply.
after shifting, the result may not conform to the rule of IEEE- accuracy of the final result will be. In this way, finding the
754, so the mantissa needs to be normalized. In general, the 1
and
1
by looking up a smaller table to get the
above steps are in order, while some steps can be implemented 1.𝑚0 1.𝑚02
reciprocal through (2) will be more convenient.
in parallel. Therefore, only using combinational logic to
1 1 1 1 1 0.0𝑛
implement floating point adder will lead to a long latency. And = −( − )= −( )
1. 𝑚𝑛 1. 𝑚0 1. 𝑚0 1. 𝑚𝑛 1. 𝑚0 1. 𝑚0 ∗ 1. 𝑚𝑛
pipeline is usually used to reduce latency. In order to unify the (2)
design and equalize the timing, the three kinds of arithmetic ≈
1
−(
1
∗ 0.0𝑛)
units all use three-stage pipeline. Some advanced algorithms 1. 𝑚0 1. 𝑚02
speed up the process by increasing the area, but the pipeline 𝑆1 𝐸1 𝑀1
can also improve the speedup ratio by evenly dividing the
hold 2 − 𝐸1 split
standard process.
𝑆1 𝑆2 𝐸1 𝐸2 𝑀1 𝑀2 underflow exponent look up table
judgment
sign judgment compare absolute value normal T= −2
𝐸1 𝐸2 𝑀1 𝑀2 underflow flag
out= -T
𝐸 = 𝐸1 − 𝐸2 −1
END
mantissa add/sub
Fig. 3. Process of floating point reciprocal operation
mantissa normalization
exponent judgment
III. IMPLEMENTATION
underflow normal overflow A. High speedup adder with computation balance
overflow flag
underflow flag
In the standard addition process, exponent alignment and
mantissa normalization require the most complex operations.
END
The bit width of the fixed point adder required for mantissa
Fig. 1. Standard process of floating point adder add/sub is too large and requires a long latency. Therefore, the
C. Floating Point Multiplier three steps are divided into three different stages of the
Fig. 2 shows the floating point multiplier process. The pipeline. Fig. 4 shows the divided three-stage pipeline adder.
floating point multiplication process is relatively simple, but The first stage is mainly to complete the preparatory work.
the fixed point multiplier used in mantissa multiplication The inputs of the whole module are 𝑠𝑒𝑙𝑒𝑐𝑡 and 𝑆𝑥 /𝐸𝑥 /𝑀𝑥 .
consumes a lot of resources. In some application scenarios, The former determines whether operands are added or
such as channel estimation, the accuracy of the multiplier is subtracted, while the latter are the sign bit, exponent and
required to be high and low. Only low accuracy is required for mantissa of input data. And the core module is the comparator,
matrix multiplication, and high accuracy is required for matrix the function of which is to compare the absolute values of two
inversion. At present, advanced algorithms are studying how operands, which will be used to calculate 𝐸 to determine the
to optimize the multiplier. However, there is nothing they can number of bits to shift the mantissa, decide whether to add or
do for different accuracy requirements. The multiplication of subtract mantissa and determine the sign bit of the output
a pair of single precision floating point numbers requires a 24- result. Output 𝑀𝑆1 /𝑀𝐿1 is the mantissa of smaller/bigger input
bit multiplier to multiply the mantissa, while half precision data. The second stage mainly completes the mantissa add/sub,
requires a 12-bit multiplier. That’s to say, using a 12-bit and the latency can be balanced by putting the positive and
multiplier can not only directly realize half precision negative judgment of the smaller mantissa into the second
multiplication, but also realize single precision multiplication level. The key pf the third stage is the "leading 1" detection.
through time-sharing multiplexing, so as to save the resource Using dichotomy to detect "leading 1" can save area as much
consumption of the multiplier. as possible on the premise of faster speed. Moreover, it is of
𝑆1 𝑆2 𝐸1 𝐸2 𝑀1 𝑀2 little significance to simply reduce the latency of a certain
sign judgment exponent add mantissa multiply
stage in the pipeline, and the dichotomy will not lead to too
long latency of the third stage. Exponent judgment is used to
mantissa normalization
determine whether the result overflows or underflows. This is
because the single precision floating point number
exponent judgment representation has upper and lower restrictions and requires
underflow normal overflow additional judgment. Finally, the red font is the output signal,
underflow flag overflow flag including the sign bit, exponent and mantissa of the result. The
remaining two signals are overflow and underflow labels.
select 𝑆1 𝑆2 𝐸1 𝐸2 𝑀1 𝑀2
END
0 1 ALU
Fig. 2. Process of Floating point multiplier MUX MUX comparator MUX
process. A reciprocal operation based on look-up table method −𝑀𝑆1 𝑀𝑆1 𝑀𝐿1
MUX
and first-order Taylor series is proposed for floating point 𝑆 2 𝐸 2
𝑐
ALU
𝑀 𝐸 𝑡 𝑀 𝑡
m bit, and N is the lower (23-m) bits, the reciprocal of the
2 2
dichotomy ALU >> exponent judgment
when M is small. However, the greater M is, the higher the Fig. 4. Divided three-stage pipeline adder
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on February 04,2025 at 08:40:18 UTC from IEEE Xplore. Restrictions apply.
B. Highly-utilization multiplier for mixed precision The most important for reciprocal operation is the scale of
Usually, different multipliers are needed to achieve the table. TABLE II shows the verification results under three
multiplication with different precision. But this will lead to a conditions. According to the evaluation results, 60000 single
waste of resources. Fig. 5 is the pipeline diagram of multiplier precision floating points are randomly generated, which can
with mixed precision. Compared with the adder, the flow of meet the accuracy of 18-bit when m is 10-bit and the decimal
the multiplier is simpler. And the core module is the 12-bit place of the approximate result is 20-bit. Therefore, the width
fixed point multiplier in the second stage and the state of this table is 40-bit, the depth is 1024, and the total capacity
machines in the first and second stages. The input is similar to is 5KB.
adder, except that 𝑠𝑒𝑙𝑒𝑐𝑡 determines the precision of the TABLE II ACCURACY COMPARISON UNDER THREE CONDITIONS
multiplier. This signal will be used to control the state 𝟏 𝟏
m-bit / -bit accuracy
machines. When implementing half precision floating point 𝟏.𝒎𝟎 𝟏.𝒎𝟎𝟐
multiplication, the two state machines only need one cycle. 9 40-bit 93.96%
After the pipeline is full, the results can be calculated in each
10 20-bit 100%
cycle. When implementing single precision floating point
multiplication, each state machine needs four cycles. The 10 19-bit 99.83%
former prepares operands𝑚𝑎 /𝑚𝑏 for fixed point multiplier in
each cycle, and the latter stores the product result 𝑚𝑢𝑙 𝑡 one IV. RESULTS AND COMPARISONS
cycle behind. This implementation can calculate a half
precision floating point multiplication every cycle or a single A. Implementation setup
precision floating point multiplication every 4 cycles. To detect the overall area efficiency of floating point
Moreover, under different working modes, the utilization rate operation unit, take the inversion of matrix in communication
of resources is close to 100%. Another important advantage is algorithm as an example to analysis the performance of
that the multiplier implemented in this way can cooperate with our_adder, our_multiplier and our_reciprocal. The hardware
the reciprocal operation proposed later to realize single implementation of matrix inversion is difficult, it generally
precision division. One reciprocal operation and four single needs to be divided into several steps. They are triangular
precision multipliers can be used to realize four dividers, decomposition, inverse of triangular matrix/diagonal matrix,
which greatly improves the efficiency and resource utilization. multiplication of triangular matrix/diagonal matrix and
𝑆1 𝑆2 𝐸1 𝐸2 𝑀1 𝑀2 select conjugate [14]. These operations require a total of 128 adders,
128 multipliers and 4 reciprocal operations and it is executed
XOR ALU State machine in sequence.
𝑆 1 𝐸 1 𝑚 𝑚 𝑠𝑒𝑙𝑒𝑐𝑡 B. Comparisons
Register
1) Comparison with Vivado IP
𝑚 𝑚 𝑠𝑒𝑙𝑒𝑐𝑡 0 1 𝐸 1 𝑀 𝑡
We first evaluated the performance and resource overhead
* 2
of our designs on Xilinx xc7k325tffg900-2 FPGA device
12 𝑡
State machine MUX 𝑠
𝑆 2 𝑚𝑢𝑙 𝑡 ALU >>
Register
𝑀 2 𝑡 𝑠
𝐸 2
𝑀 2
using Vivado design suite. TABLE III shows the synthesis
Register results of this design and Xilinx floating point arithmetic IPs
𝐸 2 𝑀 2 [15]. The resources of adder, multiplier and reciprocal
exponent judgment
operation are far less than these of Xilinx IPs, and the latency
𝑀 over under and cycles are much lower. The reciprocal operation reduces
Register
the resources of LUT, FF and DSP by introducing ROM. In
Fig. 5. Divided three-stage pipeline multiplier most cases, a ROM can support several reciprocal operations
C. Reciprocal operator using look-up table and first-order at the same time. On average, this method is more meaningful
Taylor series than directly calculating the reciprocal.
The reciprocal operation will not be used alone. It is TABLE III SYNTHESIS RESULTS ON VIVADO
needed to realize the division operation together with the
Latency
multiplier. The existing reciprocal operations generally have Operation LUT FF ROM DSP
(ns)
Cycles
the problems of long latency and large area. In order to realize
our_adder 311 212 0 0 4.513 3
the overall optimization, the latency of reciprocal operation
only needs to be lower than that of adder and multiplier, and our_multiplier 167 285 0 1 5.174 3
meet a certain accuracy, so as to realize accurate and efficient our_reciprocal 24 111 5kB 0 6.393 3
division operation. Fig. 6 is the pipeline diagram of multiplier
with first-order Taylor series. Vivado adder 350 654 0 0 6.618 14
𝑆1 𝐸1 2 𝑀1 Vivado
244 444 0 1 6.515 13
multiplier
ALU Look up table Vivado
203 375 0 8 7.871 31
𝑆 𝐸 −1 −2
0.0
reciprocal
1 1
Register
According to the relevant conclusions of matrix inversion,
the longest path of matrix inversion of 4×4 needs 10 times
−2
𝑆
0.0
−1
addition, 5 times multiplication and 3 times division. The
𝐸 20 𝑡 *
division can be realized by a multiplication and a reciprocal
2 2
Register
operation. TABLE IV shows the comparison of resource and
−1 𝐸 2
latency. According to the TABLE IV, the LUT required for
ALU
exponent judgment matrix inversion is saved by about 20%, and the FF is 45% of
𝑀 𝑡 𝑀 over under the standard IP. Since the tables required by the four reciprocal
Register operations are the same, only one ROM is required. By
Fig. 6. Divided three-stage pipeline reciprocal operation introducing ROM, DSP resources occupied by reciprocal
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on February 04,2025 at 08:40:18 UTC from IEEE Xplore. Restrictions apply.
operation can be saved. The number of arithmetic units in this China under Grant 62176206, Special Project of Artificial
design is far less than the standard IP. At the same time, Intelligence of Aviation Science Foundation under Grant
because the slowest reciprocal operation is about 1.5ns faster 2020Z066070001, and Key-Area Research and Development
than the standard IP, the total latency to realize a 4×4 matrix Program of Guangdong Province under Grant
inversion only consumes less than 1/4 of the standard IP. 2019B010154002. (Corresponding author: Chen Yang.)
TABLE IV RESOURCES AND LATENCY
REFERENCES
Operation LUT FF ROM DSP Cycles Latency(ns)
[1] D. Neumann, T. Wiese and W. Utschick, "Learning the MMSE
This Channel Estimator," IEEE Transactions on Signal Processing, 2018,
61280 64060 1.5 128 95 607.335
design vol. 66, no. 11, pp. 2905-2917, doi: 10.1109/TSP.2018.2799164.
Vivado 76844 142044 0 160 334 2628.914 [2] C. Senning and A. Burg, "Block-floating-point enhanced MMSE filter
matrix computation for MIMO-OFDM communication systems," IEEE
2) Comparison with DesignWare IP International Conference on Electronics, Circuits, and Systems
DesignWare (DW) library contains pre-designed and (ICECS), 2013, pp. 787-790, doi: 10.1109/ICECS.2013.6815532.
verified IPs. It includes a process independent, proven and [3] M. Shirke, S. Chandrababu and Y. Abhyankar, "Implementation of
integrated component set of virtual microarchitecture, IEEE 754 compliant single precision floating-point adder unit
supporting denormal inputs on Xilinx FPGA," IEEE International
including logic, arithmetic, storage and special component Conference on Power, Control, Signals and Instrumentation
series, with more than 140 modules [16]. TABLE V shows the Engineering (ICPCSI), 2017, pp. 408-412, doi:
synthesis results of this design and the floating point 10.1109/ICPCSI.2017.8392326.
arithmetic IPs in DW library, using SMIC 40nm CMOS [4] A. Malik and S.-B. Ko, “A study on the floating-point adder in FPGAs,”
technology. The adder, multiplier and divider in DW library Canadian Conference on Electrical and Computer Engineering
are combinational logic, so the area is smaller but the speed is (CCECE), 2006, pp. 86-89, doi: 10.1109/CCECE.2006.277498..
[5] K. Arun and K. Srivatsan, "A binary high speed floating point
slower. In terms of area, the adder and multiplier of this design
multiplier," International Conference on Nextgen Electronic
are slightly larger than of DW, but the latency is optimized. Technologies: Silicon to Software (ICNETS2), 2017, pp. 316-321, doi:
The area of this design’s divider, which consists of a multiplier 10.1109/ICNETS2.2017.8067953.
and a reciprocal operation, is also smaller than DW’s divider. [6] K. V. Gowreesrinivas and P. Samundiswary, “Comparative
To realize division once, this design requires a total of performance analysis of multiplexer based single precision floating
2.60×6=15.6ns, which is much less than DW's divider. point multipliers,” International Conference on Electronics,
TABLE V SYNTHESIS RESULTS USING 40NM TECHNOLOGY Communication and Aerospace Tehnology (ICECA), 2017, pp. 430-
435, doi: 10.1109/ICECA.2017.8212851.
Area Critical Path Clock Frequency [7] N. Sureka, R. Porselvi and K. Kumuthapriya, “An efficient high speed
Operation
(𝝁𝒎𝟐) (ns) (MHz) Wallace tree multiplier,” International Conference on Information
our_adder 2180.77 2.29 436.68 Communication and Embedded Systems (ICICES), 2013, pp. 1023-
1026, doi: 10.1109/ICICES.2013.6508192.
our_multiplier 3513.75 2.60 384.61 [8] P. Sangeetha and A. Ali Khan, “Comparison of Braun multiplier and
Wallace multiplier techniques in VLSI,” International Conference on
our_reciprocal 1059.74 2.01 497.61
Devices, Circuits and Systems (ICDCS), 2018, pp. 48-53, doi:
DW_fp_addsub 1400.01 7.20 138.89 10.1109/ICDCSyst.2018.8605173.
[9] V. K. R, A. R. S and N. D. R, "A comparative study on the performance
DW_fp_mult 3294.14 5.70 175.44 of FPGA implementations of high-speed single-precision binary
floating-point multipliers," International Conference on Smart Systems
DW_fp_div 5074.00 45.75 21.86 and Inventive Technology (ICSSIT), 2019, pp. 1041-1045, doi:
Taking the inversion of 4×4 matrix as an example again to 10.1109/ICSSIT46314.2019.8987800.
comprehensively compare the area efficiency. When the [10] Vanderspek, Julius (San Jose, CA, US), "Integer division using
floating-point reciprocal," U.S.Patent 8 938 485, Jan 20, 2015.
pipeline realizes the matrix inversion, this design needs 4
[11] M. Joldes, J. Muller, and V. Popescu, "On the computation of the
cycles, and clock frequency can reach 384.61MHz. However, reciprocal of floating point expansions using an adapted Newton-
DW’s IP needs to be stored after the output of each module, Raphson iteration." IEEE International Conference on Application-
and the clock frequency is also limited by the divider. In terms specific Systems, Architectures and Processors (ASAP), 2014, pp. 63-
of area efficiency, the design area is 1.2 times that of DW’s IP, 67, doi: 10.1109/ASAP.2014.6868632.
while the latency is 22.73%. Therefore, the total area [12] S. Mou and X. Yang, "Design of a High-speed FPGA-based 32-bit
efficiency is 3.65 times that of DW’s IP. Floating-point FFT Processor," IEEE/ACIS International Conference
on Software Engineering, Artificial Intelligence, Networking, and
TABLE VI AREA AND TIMING REQUIRED Parallel/Distributed Computing (SNPD), 2007, pp. 84-87, doi:
Area Clock Period Latency 10.1109/SNPD.2007.46.
Operation Cycles
(𝝁𝒎𝟐) (ns) (ns) [13] G. Ushasree, R. Dhanabal and S. Kumar sahoo, "VLSI implementation
This design 733138.55 4 2.60 10.4 of a high speed single precision floating point unit using verilog," IEEE
Conference on Information & Communication Technologies, 2013, pp.
DW IP 607971.30 1 45.75 45.75 803-808, doi: 10.1109/CICT.2013.6558204.
[14] C. S., L. V., S. S. and M. J., "Design and Implementation of a Floating
point matrix Inversion Module using Model based Programming,"
V. CONCLUSION AND ACKNOWLEDGEMENT IEEE India Council International Conference (INDICON), 2019, pp. 1-
This paper proposed three types of floating point 4, doi: 10.1109/INDICON47234.2019.9030370.
arithmetic units, i.e., single precision adder, mixed precision [15] Xilink, PG060-Floating-Point Operator v7.1 Product Guide (v7.1),
multiplier and reciprocal operator. Compared with Vivado and June 6, 2018. Accessed on: June 15, 2021. [Online]. Available:
Floating-Point Operator v7.1 LogiCORE IP Product Guide (xilinx.com)
DW IPs, this design requires only 1/4 computation latency [16] Synopsys, DesignWare Library, June 3, 2021. Accessed on: July 7,
when calculating matrix inversion, as well as achieving much 2021 [Online]. Available: Synopsys DesignWare Library of Design
higher area efficiency. This work was supported by NNSF of and Verification IP
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on February 04,2025 at 08:40:18 UTC from IEEE Xplore. Restrictions apply.