0% found this document useful (0 votes)
39 views6 pages

An Efficient Implementation of Oating Point Multiplier: Conference Paper

Floating point multiplier

Uploaded by

arya shiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views6 pages

An Efficient Implementation of Oating Point Multiplier: Conference Paper

Floating point multiplier

Uploaded by

arya shiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224242069

An efficient implementation of floating point multiplier

Conference Paper · May 2011


DOI: 10.1109/SIECPC.2011.5876905 · Source: IEEE Xplore

CITATIONS READS

33 817

3 authors, including:

Ashraf M Salem
Ain Shams University
90 PUBLICATIONS   434 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

HW Accelerated SAT Solver View project

All content following this page was uploaded by Ashraf M Salem on 03 April 2015.

The user has requested enhancement of the downloaded file.


An Efficient Implementation of Floating Point
Multiplier

Mohamed Al-Ashrafy Ashraf Salem Wagdy Anis


Mentor Graphics Mentor Graphics Communications and Electronics
Cairo, Egypt Cairo, Egypt Engineering
[email protected] [email protected] Ain Shams University
Cairo, Egypt
[email protected]

Abstract—In this paper we describe an efficient implementation Multiplying two numbers in floating point format is done
of an IEEE 754 single precision floating point multiplier targeted by 1- adding the exponent of the two numbers then subtracting
for Xilinx Virtex-5 FPGA. VHDL is used to implement a the bias from their result, 2- multiplying the significand of the
technology-independent pipelined design. The multiplier two numbers, and 3- calculating the sign by XORing the sign
implementation handles the overflow and underflow cases. of the two numbers. In order to represent the multiplication
Rounding is not implemented to give more precision when using result as a normalized number there should be 1 in the MSB of
the multiplier in a Multiply and Accumulate (MAC) unit. With the result (leading one).
latency of three clock cycles the design achieves 301 MFLOPs. Floating-point implementation on FPGAs has been the
The multiplier was verified against Xilinx floating point
interest of many researchers. In [2], an IEEE 754 single
multiplier core.
precision pipelined floating point multiplier was implemented
Keywords-floating point; multiplication; FPGA; CAD design on multiple FPGAs (4 Actel A1280). In [3], a custom 16/18 bit
flow three stage pipelined floating point multiplier that doesn‟t
support rounding modes was implemented. In [4], a single
I. INTRODUCTION precision floating point multiplier that doesn‟t support
Floating point numbers are one possible way of rounding modes was implemented using a digit-serial
representing real numbers in binary format; the IEEE 754 [1] multiplier: using the Altera FLEX 8000 it achieved 2.3
standard presents two different floating point formats, Binary MFlops. In [5], a parameterizable floating point multiplier was
interchange format and Decimal interchange format. implemented using the software-like language Handel-C, using
Multiplying floating point numbers is a critical requirement for the Xilinx XCV1000 FPGA; a five stages pipelined multiplier
DSP applications involving large dynamic range. This paper achieved 28MFlops. In [6], a latency optimized floating point
focuses only on single precision normalized binary interchange unit using the primitives of Xilinx Virtex II FPGA was
format. Fig. 1 shows the IEEE 754 single precision binary implemented with a latency of 4 clock cycles. The multiplier
format representation; it consists of a one bit sign (S), an eight reached a maximum clock frequency of 100 MHz.
bit exponent (E), and a twenty three bit fraction (M or II. FLOATING POINT MULTIPLICATION ALGORITHM
Mantissa). An extra bit is added to the fraction to form what is
called the significand1. If the exponent is greater than 0 and As stated in the introduction, normalized floating point
smaller than 255, and there is 1 in the MSB of the significand numbers have the form of Z= (-1S) * 2 (E - Bias) * (1.M). To
then the number is said to be a normalized number; in this case multiply two floating point numbers the following is done:
the real number is represented by (1) 1. Multiplying the significand; i.e. (1.M1*1.M2)
2. Placing the decimal point in the result

Figure 1. IEEE single precision floating point format


3. Adding the exponents; i.e. (E1 + E2 – Bias)
4. Obtaining the sign; i.e. s1 xor s2
 Z = (-1S) * 2 (E - Bias) * (1.M) 
5. Normalizing the result; i.e. obtaining 1 at the MSB of
the results‟ significand
Where M = m22 2 + m21 2 + m20 2 +…+ m1 2 + m0 2 ;
-1 -2 -3 -22 -23

6. Rounding the result to fit in the available bits


Bias = 127.
7. Checking for underflow/overflow occurrence
1
Significand is the mantissa with an extra MSB bit. Consider a floating point representation similar to the IEEE
This research has been supported by Mentor Graphics. 754 single precision floating point format, but with a reduced
number of mantissa bits (only 4) while still retaining the hidden results in a 48 bit product, which we will call the intermediate
„1‟ bit for normalized numbers: product (IP). The IP is represented as (47 downto 0) and the
decimal point is located between bits 46 and 45 in the IP. The
A = 0 10000100 0100 = 40, B = 1 10000001 1110 = -7.5 following sections detail each block of the floating point
To multiply A and B multiplier.
1. Multiply significand: 1.0100
× 1.1110
00000
10100
10100
10100
_10100____
1001011000

2. Place the decimal point: 10.01011000


3. Add exponents: 10000100
+ 10000001
100000101 Figure 2. Floating point multiplier block diagram
The exponent representing the two numbers is already III. HARDWARE OF FLOATING POINT MULTIPLIER
shifted/biased by the bias value (127) and is not the true
exponent; i.e. EA = EA-true + bias and EB = EB-true + bias A. Sign bit calculation
And Multiplying two numbers results in a negative sign number
iff one of the multiplied numbers is of a negative value. By the
EA + EB = EA-true + EB-true + 2 bias aid of a truth table we find that this can be obtained by XORing
So we should subtract the bias from the resultant exponent the sign of two inputs.
otherwise the bias will be added twice. B. Unsigned Adder (for exponent addition)
100000101 This unsigned adder is responsible for adding the exponent
- 01111111 of the first input to the exponent of the second input and
10000110 subtracting the Bias (127) from the addition result (i.e.
A_exponent + B_exponent - Bias). The result of this stage is
4. Obtain the sign bit and put the result together:
called the intermediate exponent. The add operation is done on
1 10000110 10.01011000 8 bits, and there is no need for a quick result because most of
the calculation time is spent in the significand multiplication
5. Normalize the result so that there is a 1 just before the process (multiplying 24 bits by 24 bits); thus we need a
radix point (decimal point). Moving the radix point moderate exponent adder and a fast significand multiplier.
one place to the left increments the exponent by 1;
moving one place to the right decrements the An 8-bit ripple carry adder is used to add the two input
exponent by 1. exponents. As shown in Fig. 3 a ripple carry adder is a chain of
cascaded full adders and one half adder; each full adder has
1 10000110 10.01011000 (before normalizing) three inputs (A, B, Ci) and two outputs (S, Co). The carry out
1 10000111 1.001011000 (normalized) (Co) of each adder is fed to the next full adder (i.e each carry
bit "ripples" to the next full adder).
The result is (without the hidden bit):
1 10000111 00101100
6. The mantissa bits are more than 4 bits (mantissa
available bits); rounding is needed. If we applied the
truncation rounding mode then the stored value is:
1 10000111 0010.
In this paper we present a floating point multiplier in which
rounding support isn‟t implemented. Rounding support can be Figure 3. Ripple Carry Adder
added as a separate unit that can be accessed by the multiplier
or by a floating point adder, thus accommodating for more The addition process produces an 8 bit sum (S7 to S0) and a
precision if the multiplier is connected directly to an adder in a carry bit (Co,7). These bits are concatenated to form a 9 bit
MAC unit. Fig. 2 shows the multiplier structure; Exponents addition result (S8 to S0) from which the Bias is subtracted. The
addition, Significand multiplication, and Result‟s sign Bias is subtracted using an array of ripple borrow subtractors.
calculation are independent and are done in parallel. The
significand multiplication is done on two 24 bit numbers and
A normal subtractor has Fig. 6 shows the Bias subtractor which is a chain of 7 one
S T
three inputs subtractors (OS) followed by 2 zero subtractors (ZS); the
(minuend (S), subtrahend (T), borrow output of each subtractor is fed to the next subtractor. If
Borrow in (Bi)) and two an underflow occurs then Eresult < 0 and the number is out of
outputs (Difference (R), the IEEE 754 single precision normalized numbers range; in
Borrow out (Bo)). The Subtractor this case the output is signaled to 0 and an underflow flag is
subtractor logic can be Bo Bi
asserted.
optimized if one of its inputs is
a constant value which is our
R
case, where the Bias is constant
(127|10 = 001111111|2). Table I
shows the truth table for a 1-bit subtractor with the input T
equal to 1 which we will call “one subtractor (OS)”

TABLE I. 1-BIT SUBTRACTOR WITH THE INPUT T = 1


Figure 6. Ripple Borrow Subtractor
S T Bi Difference(R) Bo
0 1 0 1 1 C. Unsigned Multiplier (for significand multiplication)
1 1 0 0 0 This unit is responsible for multiplying the unsigned
0 1 1 0 1 significand and placing the decimal point in the multiplication
1 1 1 1 1
product. The result of significand multiplication will be called
the intermediate product (IP). The unsigned significand
The Boolean equations (2) and (3) represent this subtractor: multiplication is done on 24 bit. Multiplier performance should
be taken into consideration so as not to affect the whole
(2) multiplier‟s performance. A 24x24 bit carry save multiplier
architecture is used as it has a moderate speed with a simple
(3) architecture. In the carry save multiplier, the carry bits are
passed diagonally downwards (i.e. the carry bit is propagated
to the next stage). Partial products are made by ANDing the
inputs together and passing them to the appropriate adder.
Carry save multiplier has three main stages:
1- The first stage is an array of half adders.
2- The middle stages are arrays of full adders. The
Figure 4. 1-bit subtractor with the input T = 1 number of middle stages is equal to the significand
size minus two.
Table II shows the truth table for a 1-bit subtractor with the 3- The last stage is an array of ripple carry adders. This
input T equal to 0 which we will call “zero subtractor (ZS)” stage is called the vector merging stage.
The number of adders (Half adders and Full adders) in each
TABLE II. 1-BIT SUBTRACTOR WITH THE INPUT T = 0 stage is equal to the significand size minus one. For example,
S T Bi Difference(R) Bo
a 4x4 carry save multiplier is shown in Fig. 7 and it has the
0 0 0 0 0 following stages:
1 0 0 1 0 1- The first stage consists of three half adders.
0 0 1 1 1 2- Two middle stages; each consists of three full adders.
1 0 1 0 0 3- The vector merging stage consists of one half adder
and two full adders.
The Boolean equations (4) and (5) represent this subtractor:
The decimal point is between bits 45 and 46 in the
(4) significand multiplier result. The multiplication time taken by
the carry save multiplier is determined by its critical path. The
(5) critical path starts at the AND gate of the first partial products
(i.e. a1b0 and a0b1), passes through the carry logic of the first
half adder and the carry logic of the first full adder of the
middle stages, then passes through all the vector merging
adders. The critical path is marked in bold in Fig. 7

Figure 5. 1-bit subtractor with the input T = 0


between 1 and 254 otherwise the value is not a normalized one.
An overflow may occur while adding the two exponents or
during normalization. Overflow due to exponent addition may
be compensated during subtraction of the bias; resulting in a
normal output value (normal operation). An underflow may
occur while subtracting the bias to form the intermediate
exponent. If the intermediate exponent < 0 then it‟s an
underflow that can never be compensated; if the intermediate
exponent = 0 then it‟s an underflow that may be compensated
during normalization by adding 1 to it.
When an overflow occurs an overflow flag signal goes high
and the result turns to ±Infinity (sign determined according to
the sign of the floating point multiplier inputs). When an
underflow occurs an underflow flag signal goes high and the
result turns to ±Zero (sign determined according to the sign of
the floating point multiplier inputs). Denormalized numbers
Figure 7. 4x4 bit Carry Save multiplier
are signaled to Zero with the appropriate sign calculated from
In Fig. 7: the inputs and an underflow flag is raised. Assume that E1 and
1- Partial product: aibj = ai and bj E2 are the exponents of the two numbers A and B respectively;
2- HA: half adder the result‟s exponent is calculated by (6)
3- FA: full adder
Eresult = E1 + E2 - 127 (6)
D. Normalizer
E1 and E2 can have the values from 1 to 254; resulting in
The result of the significand multiplication (intermediate Eresult having values from -125 (2-127) to 381 (508-127); but
product) must be normalized to have a leading „1‟ just to the for normalized numbers, Eresult can only have the values from 1
left of the decimal point (i.e. in the bit 46 in the intermediate to 254. Table III summarizes the Eresult different values and the
product). Since the inputs are normalized numbers then the effect of normalization on it.
intermediate product has the leading one at bit 46 or 47
1- If the leading one is at bit 46 (i.e. to the left of the decimal TABLE III. NORMALIZATION EFFECT ON RESULT‟S EXPONENT AND
OVERFLOW/UNDERFLOW DETECTION
point) then the intermediate product is already a
normalized number and no shift is needed. Eresult Category Comments
2- If the leading one is at bit 47 then the intermediate
-125 ≤ Eresult < 0 Underflow Can‟t be compensated during
product is shifted to the right and the exponent is normalization
incremented by 1.
Eresult = 0 Zero May turn to normalized number during
normalization (by adding 1 to it)
The shift operation is done using combinational shift logic 1 < Eresult < 254 Normalized May result in overflow during
made by multiplexers. Fig. 8 shows a simplified logic of a number normalization
Normalizer that has an 8 bit intermediate product input and a 6 255 ≤ Eresult Overflow Can‟t be compensated
bit intermediate exponent input.
V. PIPELINING THE MULTIPLIER
In order to enhance the performance of the multiplier, three
pipelining stages are used to divide the critical path thus
increasing the maximum operating frequency of the multiplier.
The pipelining stages are imbedded at the following locations:
1. In the middle of the significand multiplier, and in the
middle of the exponent adder (before the bias
subtraction).
2. After the significand multiplier, and after the
exponent adder.
3. At the floating point multiplier outputs (sign,
exponent and mantissa bits).
Figure 8. Simplified Normalizer logic
Fig. 9 shows the pipelining stages as dotted lines.
IV. UNDERFLOW/OVERFLOW DETECTION
Overflow/underflow means that the result‟s exponent is too
large/small to be represented in the exponent field. The
exponent of the result must be 8 bits in size, and must be
The area of Xilinx core is less than the implemented
floating point multiplier because the latter doesn‟t
truncate/round the 48 bits result of the mantissa multiplier
which is reflected in the amount of function generators and
registers used to perform operations on the extra bits; also the
speed of Xilinx core is affected by the fact that it implements
the round to nearest rounding mode.
VII. CONCLUSIONS AND FUTURE WORK
This paper presents an implementation of a floating point
multiplier that supports the IEEE 754-2008 binary interchange
format; the multiplier doesn‟t implement rounding and just
presents the significand multiplication result as is (48 bits); this
gives better precision if the whole 48 bits are utilized in another
Figure 9. Floating point multiplier with pipelined stages
unit; i.e. a floating point adder to form a MAC unit. The design
has three pipelining stages and after implementation on a
Three pipelining stages mean that there is latency in the Xilinx Virtex5 FPGA it achieves 301 MFLOPs.
output by three clocks. The synthesis tool “retiming” option
was used so that the synthesizer uses its optimization logic to ACKNOWLEDGMENT
better place the pipelining registers across the critical path.
Authors would like to thank Randa Hashem for her
VI. IMPLEMENTATION AND TESTING invaluable support and contribution.
The whole multiplier (top unit) was tested against the REFERENCES
Xilinx floating point multiplier core generated by Xilinx [1] IEEE 754-2008, IEEE Standard for Floating-Point Arithmetic, 2008.
coregen. Xilinx core was customized to have two flags to
[2] B. Fagin and C. Renard, “Field Programmable Gate Arrays and Floating
indicate overflow and underflow, and to have a maximum Point Arithmetic,” IEEE Transactions on VLSI, vol. 2, no. 3, pp. 365–
latency of three cycles. Xilinx core implements the “round to 367, 1994.
nearest” rounding mode. [3] N. Shirazi, A. Walters, and P. Athanas, “Quantitative Analysis of
Floating Point Arithmetic on FPGA Based Custom Computing
A testbench is used to generate the stimulus and applies it Machines,” Proceedings of the IEEE Symposium on FPGAs for Custom
to the implemented floating point multiplier and to the Xilinx Computing Machines (FCCM‟95), pp.155–162, 1995.
core then compares the results. The floating point multiplier [4] L. Louca, T. A. Cook, and W. H. Johnson, “Implementation of IEEE
code was also checked using DesignChecker [7]. Single Precision Floating Point Addition and Multiplication on FPGAs,”
DesignChecker is a linting tool which helps in filtering design Proceedings of 83 the IEEE Symposium on FPGAs for Custom
Computing Machines (FCCM‟96), pp. 107–116, 1996.
issues like gated clocks, unused/undriven logic, and
combinational loops. The design was synthesized using [5] A. Jaenicke and W. Luk, "Parameterized Floating-Point
Arithmetic on FPGAs", Proc. of IEEE ICASSP, 2001, vol. 2, pp.
Precision synthesis tool [8] targeting Xilinx Virtex-5 897-900.
5VFX200TFF1738 with a timing constraint of 300MHz. Post [6] B. Lee and N. Burgess, “Parameterisable Floating-point Operations on
synthesis and place and route simulations were made to ensure FPGA,” Conference Record of the Thirty-Sixth Asilomar Conference on
the design functionality after synthesis and place and route. Signals, Systems, and Computers, 2002
Table IV shows the resources and frequency of the [7] “DesignChecker User Guide”, HDL Designer Series 2010.2a, Mentor
implemented floating point multiplier and Xilinx core. Graphics, 2010
[8] “Precision® Synthesis User‟s Manual”, Precision RTL plus 2010a
TABLE IV. AREA AND FREQUENCY COMPARISON BETWEEN THE update 2, Mentor Graphics, 2010.
IMPLEMENTED FLOATING POINT MULTIPLIER AND XILINX CORE [9] Patterson, D. & Hennessy, J. (2005), Computer Organization and
Design: The Hardware/software Interface , Morgan Kaufmann .
Our Floating Point Xilinx Core
Multiplier [10] John G. Proakis and Dimitris G. Manolakis (1996), “Digital Signal
Processing: Principles,. Algorithms and Applications”, Third Edition.
Function Generators 1263 765
CLB Slices 604 266
DFF 293 241
Max Frequency 301.114 MHz 221.484 MHz

View publication stats

You might also like