0% found this document useful (0 votes)
96 views5 pages

Abstract-A New Floating-Point Fused Multiply-Add (FMA) Design For The

This document presents a new "bridge" fused multiply-add (FMA) design that aims to add FMA functionality to existing floating-point units without requiring significant changes to hardware or degrading performance of addition and multiplication instructions. It describes the classic FMA architecture and IEEE double-precision floating point format. Several circuit implementations, including an FADD, FMUL, classic FMA and the new bridge FMA, are presented to provide a realistic comparison of performance, area and power costs between the bridge design and standard units.

Uploaded by

Divya Dm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views5 pages

Abstract-A New Floating-Point Fused Multiply-Add (FMA) Design For The

This document presents a new "bridge" fused multiply-add (FMA) design that aims to add FMA functionality to existing floating-point units without requiring significant changes to hardware or degrading performance of addition and multiplication instructions. It describes the classic FMA architecture and IEEE double-precision floating point format. Several circuit implementations, including an FADD, FMUL, classic FMA and the new bridge FMA, are presented to provide a realistic comparison of performance, area and power costs between the bridge design and standard units.

Uploaded by

Divya Dm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1726 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO.

12, DECEMBER 2008

Bridge Floating-Point Fused Multiply-Add Design unit to emulate the desired arithmetic result. Due to the increased com-
plexity of an FMA unit, such standalone instructions are subject to
Eric Quinnell, Earl E. Swartzlander Jr., and Carl Lemonds greater latencies than if they were processed in a single function arith-
metic unit (i.e., an FADD or FMUL unit). Additionally, the introduc-
tion of a three-operand execution unit requires special scheduling, re-
Abstract—A new floating-point fused multiply-add (FMA) design for the tire queue, register, and control hardware as well as the sacrifice of ex-
execution of (A B) + C as a single instruction is presented. The bridge ecuting FADD and FMUL instructions in parallel. These custom hard-
fused multiply-add unit is a design intended to add FMA functionality to ware requirements are seen in the literature [1], [9].
existing floating-point coprocessor units by including specialized hardware A few possible solutions have already been identified. FMA designs
that reuses floating-point adder and floating-point multiplier components.
have been proposed [18], [19] that use a FADD dual-path scheme
The bridge unit adds this functionality without requiring an overhaul
of coprocessor control units and without degrading the performance or within a reduced latency FMA unit to allow FADD instructions to
parallel execution of addition and multiplication single instructions. To bypass the multiplier tree, approaching—but not yet reaching—the
evaluate the performance, area, and power costs of adding a bridge FMA latency of a common FADD unit instruction. Though such proposals
unit to common floating-point execution blocks, several circuits including have been made, they have yet to be implemented. Additionally,
a double-precision floating-point adder, floating-point multiplier, classic
FMA, and a bridge FMA unit have been designed and implemented with no study has yet been presented that identifies the relative costs of
AMD 65-nm silicon-on-insulator technology to provide a realistic and fair creating a floating-point unit capable of performing all three basic
analysis of the presented FMA hardware tradeoffs. floating-point arithmetic instructions (FADD, FMUL, and FMA) in
Index Terms—Computer arithmetic, floating point arithmetic, fused hardware.
multiply-add, IEEE-754 standard. This paper presents a new architecture that adds FMA functionality
by including hardware between an FADD and FMUL unit, creating a
“bridge” that connects the two. The architecture is designed to reuse
I. INTRODUCTION as many components as possible from both the FADD and FMUL to
minimize the area and the power consumption of the enhancement. This
In 1990, the fused multiply-add (FMA) unit was introduced on the
IBM RS/6000 for the single instruction execution of the equation (A 2
paper is intended to provide an identification of the implementation
costs in an execution unit capable of pure FADD, FMUL, and FMA
B) + C [1], [2]. This hardware unit was designed to reduce the latency
instructions completely processed by hardware.
of dot-product calculations. Additionally, the FMA unit provides more To provide details on the design costs of the bridge FMA enhance-
precise results since only a single rounding is performed after both the ment, this paper presents the results of several custom circuit imple-
multiplication and addition are executed at full precision.
Since 1990, many algorithms that utilize the (A 2 B) + C single-
mentations, including an FADD, FMUL, classic FMA, and the new
bridge FMA. All circuits have been designed and implemented with
instruction equation have been introduced, for applications in digital AMD 65-nm silicon-on-insulator technology to provide a realistic and
signal and graphics processing [3], [4], fast Fourier transforms (FFTs) fair comparison of the performance and cost tradeoffs seen between the
[5], finite-impulse response (FIR) filters [3], division [6], argument re- bridge architecture and standard floating-point execution units.
duction [7], etc. Several industrial-level chips have been designed and
implemented with embedded fused multiply-add units. These chips in-
clude designs by IBM [1], [8]–[10], HP [11], [12], MIPS [13], ARM II. CLASSIC FMA ARCHITECTURE
(MACC/un-fused) [3], and Intel [14], [15]. Some chips entirely replace
the floating-point adder (FADD) and floating-point multiplier (FMUL) The FMA architectures implemented in industry are all based on
with an FMA unit by using constants to perform single floating-point the original serial design of the IBM RS/6000 [1], [2]. There are lo-
operations, e.g., (A 2 B) + 0:0 for single multiplies and (A 2 1:0) + C calized improvements and varying bit-width changes from one design
for single adds. to another, but the basic microarchitecture has remained constant. To
This combination of industrial implementation and increased algo- describe this traditional architecture, this section touches briefly on
rithmic activity has pushed the IEEE-754R committee to consider in- the IEEE double-precision floating-point format used in the following
cluding the FMA instruction in the IEEE standard for floating-point FMA descriptions as well as an introduction to the structure of the tra-
arithmetic [16]. Most recently, AMD introduced the SSE5 extensions ditional FMA architecture (denoted “classic” FMA unit in this paper).
to the x86 instruction set [17]; SSE5 extensions are centered on the
inclusion of the FMA instruction and its derivatives into modern x86 A. Double-Precision Floating-Point Format
computing processors.
However, the greatest advantage of the modern FMA unit is also the All floating-point units described in this paper are designed for com-
greatest argument against its use. FMA units remove the need for any putation using the double-precision IEEE-754 standard format [20].
single multiplication or add instruction and improve performance by Specifically, we selected double-precision to both simplify the com-
combining the two. As described already, any calls to standalone ad- parisons between architectures and to follow the same double-preci-
ditions or multiplications require the insertion of a constant into the sion format presented in most FMA literature.
IEEE-754 double-precision format consists of a 64-bit vector split
into three sections: a sign bit, biased exponent bits, and a significand.
Manuscript received July 20, 2007; revised December 18, 2007. Current ver- As seen in Fig. 1, the sign bit is a single bit stored at the most signifi-
sion published November 19, 2008.
E. Quinnell was with The University of Texas at Austin, Austin, TX 78712
cant bit (MSB) of the packed number to represent a positive or negative
USA. He is now with the mobile division of AMD, Austin, TX 78752 USA floating-point number. Following are 11-bits of excess-1023 biased ex-
(e-mail: [email protected]). ponent and finally a 52-bit fraction.
E. E. Swartzlander, Jr., is with the Department of Electrical and Computer To represent any given floating-point number in a decimal format
Engineering, The University of Texas at Austin, Austin, TX 78712 USA (e-mail: (i.e., a stored number given here as floating-point number “A”), the
[email protected]).
C. Lemonds is with the Mobile Division of AMD, Austin, TX 78752 USA three fields are combined and interpreted as the following:
(e-mail: [email protected]).
Digital Object Identifier 10.1109/TVLSI.2008.2001944 A 0
= ( 1)
sign
2 1 fractionA 2 2exp 0bias
: (1)
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008 1727

Fig. 1. IEEE-754 double-precision format [20].

where signA is the sign bit, fractionA is the 52-bit significand fraction
following a leading implicit “1” bit, and the expA bias is the 11-bit
exponent minus the double-precision bias value 1023. Further specifics
on the double precision format may be found in the IEEE-754 standard
[20].

B. Bridge FMA Architecture


The classic FMA architecture is shown in Fig. 2. There are several
serial steps to complete the execution of an FMA instruction in double-
precision format on the traditional architecture.
1) Multiply the 53-bit input significands A 2 B in the multiply array
to produce a pair of 106-bit numbers in the carry-save format.
Align the addend C to any point in the range of the 106-bit product,
including 55 bits above the product (53-bit operand length and a
double-carry buffer of 2 bits) based on exponent difference. This
requires a 161-bit aligner.
2) Combine the lower 106 bits of the addend and the product with a
3:2 carry-save adder (CSA) to produce a 161-bit string of data in
carry-save format.
3) Input the 161-bit string into a 161-bit adder, or—more effi-
ciently—a 109-bit adder followed by a 52-bit incrementer. Fig. 2. Classic FMA architecture.
4) Input the lower 109 bits into a 109-bit leading-zero anticipator
(LZA) that processes in parallel to the adder for the case of mas-
sive cancellation and addends smaller than the product.
5) Complement the sum if required. This step may be performed a
number of ways, including incrementers, end-around-carry (EAC)
adders, or massive parallelism.
6) Normalize the 161-bit result based on the output of the LZA, the
exponent difference, and—in some designs—a zero detect on the
output of the top 52-bit incrementer.
7) Round the result and post-normalize if necessary. This requires
a 52-bit adder or incrementer plus control logic, depending on
data-type support.

III. PROPOSED BRIDGE FMA ARCHITECTURE


The bridge fused multiply-add unit is intended to add FMA func-
tionality to existing floating-point coprocessor units by including spe-
cialized hardware that reuses FADD and FMUL components. Addi-
tionally, the bridge architecture is intended to provide a realistic study
of the implementation costs involved when adding the FMA feature to
popular FADD and FMUL coprocessor units. This section provides the
design details of the elements required to implement the bridge FMA
architecture.
Fig. 3. Bridge FMA unit block diagram.
A. Bridge FMA Architecture
Fig. 3 shows a high level block diagram of the bridge FMA architec-
ture. The design begins with common FMUL and FADD units capable FADD units are modified and reused for dual functionality. Specif-
of independent and parallel execution. Several blocks are added be- ically, the FADD add/round stage is used for both additions and
tween the two execution units, creating a “bridge” capable of carrying FMAs, while the FMUL reuses the single largest component block
data from the FMUL unit to the FADD unit to efficiently perform an of any arithmetic unit, the multiplier array. The remaining hardware
FMA instruction. requirements for a complete FMA instruction are implemented in the
The bridge FMA architecture does not require a fully independent bridge unit itself, which is only powered on via clock-gating during
FMA hardware implementation. Pieces from both the FMUL and an FMA instruction.
1728 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008

Fig. 4. FMUL Unit.

Fig. 5. Dual-path FADD unit.

B. Bridge FMUL Unit Fig. 6. FMA Bridge Unit.

The bridge FMA architecture uses the FMUL unit to process both
standalone multiplications as well as create the partial product for an significand is detected, it is anchored and the smaller significand is
FMA instruction. The FMUL itself is designed to be the exact imple- aligned until the exponents match.
For cases of addition and subtraction where the exponents are equal
or differ by 61, the input data are processed in the addition unit close
mentation of a common FMUL unit with only the addition of an output
bus passing data to the bridge.
As shown in Fig. 4, the double-precision FMUL unit takes two 64-bit path that is shown on the right side of Fig. 5. The close path pre-
operands as inputs. The significands are processed in a 53 2 53-bit mul- shifts both input significands by one and inputs shifted and non-shifted
tiplier, while the exponent and sign bits are processed in parallel. For operands to a swap multiplexer. Meanwhile, a comparator is used to
any FMUL instruction, the multiplier array forwards the 106-bit sum determine the larger significand in the case of no exponent difference,
and carry results to a rounding unit designed from a combination of all while three leading-one predictors (LOP) operate in parallel on each
several common multiplication rounding schemes [21]–[23]. possible exponent difference case.
When the required operation is an FMA instruction, the unit begins The exponent prediction logic and significand comparator drive the
execution in the same way as an FMUL instruction. However, when the select lines on several sets of swap multiplexers. The resulting LOP
partial product tree produces a product in sum/carry format, it is passed selection enters a 53-bit priority encoder and is reduced to a 5-bit nor-
to the bridge while the round element is shut down via clock-gating. malization control. Both the larger and smaller significands in the close
path are normalized by up to 54 bits, and the stage is complete.
C. The Bridge FADD Unit The larger and smaller operands from both the far and close path
exit the block and are forwarded to the FMA/FADD add/round stage
The bridge FMA architecture uses a common Farmwald [24]
for path merging, rounding, and instruction completion.
dual-path FADD design to execute standalone addition instructions.
As shown in Fig. 5, the addition unit uses a far and close path to handle
D. Bridge FMA Unit
the two classical FADD cases. The far path, shown on the left side of
Fig. 5, is used to process input significands for either an addition or The bridge unit is shown in Fig. 6. The unit is essentially the classic
a subtraction if their exponents differ by more than 1. For this path, FMA architecture described in Section II without the multiplier array,
the significands of both inputs are passed to a swap multiplexer that rounding, or post-normalization block. Instead, the bridge unit accepts
awaits the results of a comparison of the exponents. When the larger the product from the floating-point multiplier and combines it with a
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008 1729

TABLE I
BRIDGE FMA RESULTS

To fully understand the units as a whole, each design’s internal


pipeline latches were removed so that a total block latency measure-
ment could be obtained. The total latencies of each block do not
directly correlate to their operating frequency, as each design may be
pipelined according to a specific coprocessor’s requirements.
In terms of power measurements, floating-point power signatures
vary wildly according to the application and input vectors provided.
The reported power results in this paper attempt to provide a general
comparison by measuring the maximum power observed of any input
vector combination. Each block was held at a predetermined and set fre-
quency while given a long series of randomly generated floating-point
vectors. The combination of vectors that caused the most switching in
a given block was identified, the peak current draw extracted, and the
maximum power reported. Peak power vector combinations for each
block are mutually exclusive.
Fig. 7. Bridge add/round unit. As shown in Table I, the bridge architecture is 30% to 70% faster and
50% to 70% lower in peak power consumption than a classic FMA unit
when executing FADD or FMUL instructions. The bridge unit also may
pre-aligned 161-bit addend taken from an operand originally provided execute both an FADD and an FMUL instruction in parallel, which is
to the FADD. The unit then proceeds with steps 2–6 of the FMA algo- not possible on a classic FMA unit. Additionally, when compared to the
rithm described in Section II. combination of an FADD and an FMUL, the bridge FMA unit shows
about a 12% performance gain for FMA instructions.
E. Bridge Add/Round Unit The cost of the added FMA functionality is about 40% more area
than an execution block with an FADD and FMUL Unit, as well as
The addition and rounding unit is designed to perform several roles.
a 65% increase in FMA instruction peak power due to shared hard-
When a standalone FADD instruction is required, the add/round unit
ware not necessarily intended for the FMA instruction itself. The bridge
acts as a common FADD dual-path merge stage, selecting between the
FMA results also show higher peak power and lower performance gain
far- and close-path operands for inputs to the addition and rounding
on FMA instructions as compared to a classic FMA unit.
units. For FMA instructions, the same multiplexer used for a merge
in the FADD path selects the FMA unit unrounded result. The second
operand input to the addition and round units is passed a null string, as V. CONCLUSION
another operator is not needed for the FMA rounding completion. We present a new architecture for the design and implementation of
The add/round unit is shown in Fig. 7. It uses a combined add/round the FMA instruction. The bridge fused multiply-add unit adds FMA
scheme suggested by several schemes seen in [25], [26]. The two se- functionality to existing floating-point coprocessor units by including
lected input operands are passed to dual 59-bit adders, producing a re- an FMA hardware “bridge” between an existing FADD and FMUL
sult and a result plus 2 (or plus 1 for subtraction). Providing these arith- unit. This added functionality comes at the cost of additional hardware
metic results, as explained by the literature [25], allows for an easy LSB and power consumption compared to a common FADD/FMUL execu-
fix-up, shift, post-alignment, and final result selection. The controls for tion block, as well as a slightly degraded FMA instruction performance
the shifts come from the overflow bits of the adders, and the rounding compared to a classic FMA unit.
selections are decided by combinational rounding logic. The bridge FMA unit adds the FMA functionality to existing designs
without degrading the performance or parallel execution of FADD and
IV. RESULTS FMUL single instructions, providing an IEEE-754R FMA solution that
does not require a complete overhaul of current coprocessing systems.
Each unit’s Verilog code was implemented using a standard-cell li-
brary in custom placement with wire extraction using Steiner route
estimates. These libraries and models are from the AMD 65-nm sil- REFERENCES
icon-on-insulator technology design set. With the results from these [1] R. K. Montoye, E. Hokenek, and S. L. Runyon, “Design of the IBM
models and design implementations, the bridge FMA unit enhancement RISC System/6000 floating-point execution unit,” IBM J. Res. Devel-
opment, vol. 34, pp. 59–70, 1990.
is compared in Table I to a custom floating-point adder, floating-point [2] E. Hokenek, R. Montoye, and P. W. Cook, “Second-generation RISC
multiplier, and classic FMA unit in the fields of latency, area, and max- floating point with multiply-add fused,” IEEE J. Solid-State Circuits,
imum observed power consumption. vol. 25, no. 5, pp. 1207–1213, Oct. 1990.
1730 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008

[3] C. Hinds, “An enhanced floating point coprocessor for embedded [15] H. Sharangpani and K. Arora, “Itanium processor microarchitecture,”
signal processing and graphics applications,” in Proc. 33rd Asilomar IEEE Micro Mag., vol. 20, no. 5, pp. 24–43, May 2000.
Conf. Signals, Syst., Comput., 1999, pp. 147–151. [16] DRAFT Standard for Floating-Point Arithmetic, IEEE Std.
[4] Y. Voronenko and M. Puschel, “Automatic generation of implementa- P754/D1.4.0, Apr. 2007.
tions for DSP transforms on fused multiply-add architectures,” in Proc. [17] Advanced Micro Devices, “128-bit SSE5 instruction set,” Pub. No.
Int. Conf. Acoust., Speech Signal Process., 2004, pp. V-101–V-104. 43479, Rev. 3.01, Aug. 2007.
[5] E. N. Linzer, “Implementation of efficient FFT algorithms on fused [18] H. Sun and M. Gao, “A novel architecture for floating-point mul-
multiply-add architectures,” IEEE Trans. Signal Process., vol. 41, no. tiply-add-fused operation,” in Proc. 4th Int. Conf. Inf., Commun.
1, pp. 93–107, Jan. 1993. Signal Process. 4th Pacific Rim Conf. Multimedia, Dec. 2003, vol. 3,
[6] A. D. Robison, “N-Bit unsigned division via N-Bit multiply-add,” in pp. 1675–1679.
Proc. 17th IEEE Symp. Comput. Arithmetic, 2005, pp. 131–139. [19] T. Lang and J. D. Bruguera, “Floating-Point fused multiply-add: Re-
[7] R.-C. Li, S. Boldo, and M. Daumas, “Theroems on efficient argument duced latency for floating-point addition,” in Proc. 17th IEEE Symp.
reductions,” in Proc. 16th IEEE Symp. Comput. Arithmetic, 2003, pp. Comput. Arithmetic, Jun. 2005, pp. 42–51.
129–136. [20] IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE
[8] F. P. O’Connell and S. W. White, “POWER3: The next generation of Standard 754-1985, Reaffirmed Dec. 6, 1990, Inc., 1985.
PowerPC processors,” IBM J. Res. Development, vol. 44, pp. 873–884, [21] G. Even and P. M. Seidel, “A comparison of three rounding algorithms
2000. for IEEE floating-point multiplication,” IEEE Trans. Comput., vol. 49,
[9] R. Jessani and C. Olson, “The floating-point unit of the PowerPC 603e,” no. 7, pp. 638–650, Jul. 2000.
IBM J. Res. Development, vol. 40, pp. 559–566, 1996. [22] R. K. Yu and G. B. Zyner, “167 MHz Radix-4 floating-point multiplier,”
[10] R. M. Jessani and M. Putrino, “Comparison of single- and dual-pass in Proc. 12th Symp. Comput. Arithmetic, 1995, pp. 149–154.
multiply-add fused floating-point units,” IEEE Trans. Comput., vol. 47, [23] N. Quach, N. Takagi, and M. Flynn, “On fast IEEE rounding,” Stanford
no. 9, pp. 927–937, Sep. 1998. Univ., Stanford, CA, Tech. Rep. CSL-TR-91-459, Jan. 1991.
[11] A. Kumar, “The HP PA-8000 RISC CPU,” IEEE Micro Mag., vol. 17, [24] M. P. Farmwald, “On the design of high performance digital arithmetic
no. 2, pp. 27–32, Mar./Apr. 1997. units,” Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ., Stan-
[12] D. Hunt, “Advanced performance features of the 64-bit PA-8000,” in ford, CA, 1981.
Proc. COMPCON, 1995, pp. 123–128. [25] N. Quach and M. J. Flynn, “An improved algorithm for high-speed
[13] K. C. Yeager, “The MIPS R10000 superscalar microprocessor,” IEEE floating point addition,” Comput. Syst. Lab., Stanford Univ., Stanford,
Micro Mag., vol. 16, no. 2, pp. 28–40, Mar. 1996. CA, Tech. Rep. CSL-TR-90-442, Aug. 1990.
[14] B. Greer, J. Harrison, G. Henry, W. Li, and P. Tang, “Scientific com- [26] A. Naini, A. Dhablania, W. James, and D. Das Sarma, “1 GHz HAL
puting on the itanium processor,” in Proc. ACM/IEEE SC Conf., Nov. Sparc64 dual floating point unit with RAS features,” in Proc. 15th
2001, pp. 1–1. Symp. Comput. Arithmetic, 2001, pp. 173–183.

You might also like