Abstract-A New Floating-Point Fused Multiply-Add (FMA) Design For The

This document presents a new "bridge" fused multiply-add (FMA) design that aims to add FMA functionality to existing floating-point units without requiring significant changes to hardware or degrading performance of addition and multiplication instructions. It describes the classic FMA architecture and IEEE double-precision floating point format. Several circuit implementations, including an FADD, FMUL, classic FMA and the new bridge FMA, are presented to provide a realistic comparison of performance, area and power costs between the bridge design and standard units.

Uploaded by

Divya Dm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views5 pages

Abstract-A New Floating-Point Fused Multiply-Add (FMA) Design For The

Uploaded by

Divya Dm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

1726 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO.

12, DECEMBER 2008

Bridge Floating-Point Fused Multiply-Add Design unit to emulate the desired arithmetic result. Due to the increased com-
plexity of an FMA unit, such standalone instructions are subject to
Eric Quinnell, Earl E. Swartzlander Jr., and Carl Lemonds greater latencies than if they were processed in a single function arith-
metic unit (i.e., an FADD or FMUL unit). Additionally, the introduc-
tion of a three-operand execution unit requires special scheduling, re-
Abstract—A new floating-point fused multiply-add (FMA) design for the tire queue, register, and control hardware as well as the sacrifice of ex-
execution of (A B) + C as a single instruction is presented. The bridge ecuting FADD and FMUL instructions in parallel. These custom hard-
fused multiply-add unit is a design intended to add FMA functionality to ware requirements are seen in the literature [1], [9].
existing floating-point coprocessor units by including specialized hardware A few possible solutions have already been identified. FMA designs
that reuses floating-point adder and floating-point multiplier components.
have been proposed [18], [19] that use a FADD dual-path scheme
The bridge unit adds this functionality without requiring an overhaul
of coprocessor control units and without degrading the performance or within a reduced latency FMA unit to allow FADD instructions to
parallel execution of addition and multiplication single instructions. To bypass the multiplier tree, approaching—but not yet reaching—the
evaluate the performance, area, and power costs of adding a bridge FMA latency of a common FADD unit instruction. Though such proposals
unit to common floating-point execution blocks, several circuits including have been made, they have yet to be implemented. Additionally,
a double-precision floating-point adder, floating-point multiplier, classic
FMA, and a bridge FMA unit have been designed and implemented with no study has yet been presented that identifies the relative costs of
AMD 65-nm silicon-on-insulator technology to provide a realistic and fair creating a floating-point unit capable of performing all three basic
analysis of the presented FMA hardware tradeoffs. floating-point arithmetic instructions (FADD, FMUL, and FMA) in
Index Terms—Computer arithmetic, floating point arithmetic, fused hardware.
multiply-add, IEEE-754 standard. This paper presents a new architecture that adds FMA functionality
by including hardware between an FADD and FMUL unit, creating a
“bridge” that connects the two. The architecture is designed to reuse
I. INTRODUCTION as many components as possible from both the FADD and FMUL to
minimize the area and the power consumption of the enhancement. This
In 1990, the fused multiply-add (FMA) unit was introduced on the
IBM RS/6000 for the single instruction execution of the equation (A 2
paper is intended to provide an identification of the implementation
costs in an execution unit capable of pure FADD, FMUL, and FMA
B) + C [1], [2]. This hardware unit was designed to reduce the latency
instructions completely processed by hardware.
of dot-product calculations. Additionally, the FMA unit provides more To provide details on the design costs of the bridge FMA enhance-
precise results since only a single rounding is performed after both the ment, this paper presents the results of several custom circuit imple-
multiplication and addition are executed at full precision.
Since 1990, many algorithms that utilize the (A 2 B) + C single-
mentations, including an FADD, FMUL, classic FMA, and the new
bridge FMA. All circuits have been designed and implemented with
instruction equation have been introduced, for applications in digital AMD 65-nm silicon-on-insulator technology to provide a realistic and
signal and graphics processing [3], [4], fast Fourier transforms (FFTs) fair comparison of the performance and cost tradeoffs seen between the
[5], finite-impulse response (FIR) filters [3], division [6], argument re- bridge architecture and standard floating-point execution units.
duction [7], etc. Several industrial-level chips have been designed and
implemented with embedded fused multiply-add units. These chips in-
clude designs by IBM [1], [8]–[10], HP [11], [12], MIPS [13], ARM II. CLASSIC FMA ARCHITECTURE
(MACC/un-fused) [3], and Intel [14], [15]. Some chips entirely replace
the floating-point adder (FADD) and floating-point multiplier (FMUL) The FMA architectures implemented in industry are all based on
with an FMA unit by using constants to perform single floating-point the original serial design of the IBM RS/6000 [1], [2]. There are lo-
operations, e.g., (A 2 B) + 0:0 for single multiplies and (A 2 1:0) + C calized improvements and varying bit-width changes from one design
for single adds. to another, but the basic microarchitecture has remained constant. To
This combination of industrial implementation and increased algo- describe this traditional architecture, this section touches briefly on
rithmic activity has pushed the IEEE-754R committee to consider in- the IEEE double-precision floating-point format used in the following
cluding the FMA instruction in the IEEE standard for floating-point FMA descriptions as well as an introduction to the structure of the tra-
arithmetic [16]. Most recently, AMD introduced the SSE5 extensions ditional FMA architecture (denoted “classic” FMA unit in this paper).
to the x86 instruction set [17]; SSE5 extensions are centered on the
inclusion of the FMA instruction and its derivatives into modern x86 A. Double-Precision Floating-Point Format
computing processors.
However, the greatest advantage of the modern FMA unit is also the All floating-point units described in this paper are designed for com-
greatest argument against its use. FMA units remove the need for any putation using the double-precision IEEE-754 standard format [20].
single multiplication or add instruction and improve performance by Specifically, we selected double-precision to both simplify the com-
combining the two. As described already, any calls to standalone ad- parisons between architectures and to follow the same double-preci-
ditions or multiplications require the insertion of a constant into the sion format presented in most FMA literature.
IEEE-754 double-precision format consists of a 64-bit vector split
into three sections: a sign bit, biased exponent bits, and a significand.
Manuscript received July 20, 2007; revised December 18, 2007. Current ver- As seen in Fig. 1, the sign bit is a single bit stored at the most signifi-
sion published November 19, 2008.
E. Quinnell was with The University of Texas at Austin, Austin, TX 78712
cant bit (MSB) of the packed number to represent a positive or negative
USA. He is now with the mobile division of AMD, Austin, TX 78752 USA floating-point number. Following are 11-bits of excess-1023 biased ex-
(e-mail: [email protected]). ponent and finally a 52-bit fraction.
E. E. Swartzlander, Jr., is with the Department of Electrical and Computer To represent any given floating-point number in a decimal format
Engineering, The University of Texas at Austin, Austin, TX 78712 USA (e-mail: (i.e., a stored number given here as floating-point number “A”), the
[email protected]).
C. Lemonds is with the Mobile Division of AMD, Austin, TX 78752 USA three fields are combined and interpreted as the following:
(e-mail: [email protected]).
Digital Object Identifier 10.1109/TVLSI.2008.2001944 A 0
= ( 1)
sign
2 1 fractionA 2 2exp 0bias
: (1)
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008 1727

Fig. 1. IEEE-754 double-precision format [20].

where signA is the sign bit, fractionA is the 52-bit significand fraction
following a leading implicit “1” bit, and the expA bias is the 11-bit
exponent minus the double-precision bias value 1023. Further specifics
on the double precision format may be found in the IEEE-754 standard
[20].

B. Bridge FMA Architecture

The classic FMA architecture is shown in Fig. 2. There are several
serial steps to complete the execution of an FMA instruction in double-
precision format on the traditional architecture.
1) Multiply the 53-bit input significands A 2 B in the multiply array
to produce a pair of 106-bit numbers in the carry-save format.
Align the addend C to any point in the range of the 106-bit product,
including 55 bits above the product (53-bit operand length and a
double-carry buffer of 2 bits) based on exponent difference. This
requires a 161-bit aligner.
2) Combine the lower 106 bits of the addend and the product with a
3:2 carry-save adder (CSA) to produce a 161-bit string of data in
carry-save format.
3) Input the 161-bit string into a 161-bit adder, or—more effi-
ciently—a 109-bit adder followed by a 52-bit incrementer. Fig. 2. Classic FMA architecture.
4) Input the lower 109 bits into a 109-bit leading-zero anticipator
(LZA) that processes in parallel to the adder for the case of mas-
sive cancellation and addends smaller than the product.
5) Complement the sum if required. This step may be performed a
number of ways, including incrementers, end-around-carry (EAC)
adders, or massive parallelism.
6) Normalize the 161-bit result based on the output of the LZA, the
exponent difference, and—in some designs—a zero detect on the
output of the top 52-bit incrementer.
7) Round the result and post-normalize if necessary. This requires
a 52-bit adder or incrementer plus control logic, depending on
data-type support.

III. PROPOSED BRIDGE FMA ARCHITECTURE

The bridge fused multiply-add unit is intended to add FMA func-
tionality to existing floating-point coprocessor units by including spe-
cialized hardware that reuses FADD and FMUL components. Addi-
tionally, the bridge architecture is intended to provide a realistic study
of the implementation costs involved when adding the FMA feature to
popular FADD and FMUL coprocessor units. This section provides the
design details of the elements required to implement the bridge FMA
architecture.
Fig. 3. Bridge FMA unit block diagram.
A. Bridge FMA Architecture
Fig. 3 shows a high level block diagram of the bridge FMA architec-
ture. The design begins with common FMUL and FADD units capable FADD units are modified and reused for dual functionality. Specif-
of independent and parallel execution. Several blocks are added be- ically, the FADD add/round stage is used for both additions and
tween the two execution units, creating a “bridge” capable of carrying FMAs, while the FMUL reuses the single largest component block
data from the FMUL unit to the FADD unit to efficiently perform an of any arithmetic unit, the multiplier array. The remaining hardware
FMA instruction. requirements for a complete FMA instruction are implemented in the
The bridge FMA architecture does not require a fully independent bridge unit itself, which is only powered on via clock-gating during
FMA hardware implementation. Pieces from both the FMUL and an FMA instruction.
1728 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008

Fig. 4. FMUL Unit.

Fig. 5. Dual-path FADD unit.

B. Bridge FMUL Unit Fig. 6. FMA Bridge Unit.

The bridge FMA architecture uses the FMUL unit to process both
standalone multiplications as well as create the partial product for an significand is detected, it is anchored and the smaller significand is
FMA instruction. The FMUL itself is designed to be the exact imple- aligned until the exponents match.
For cases of addition and subtraction where the exponents are equal
or differ by 61, the input data are processed in the addition unit close
mentation of a common FMUL unit with only the addition of an output
bus passing data to the bridge.
As shown in Fig. 4, the double-precision FMUL unit takes two 64-bit path that is shown on the right side of Fig. 5. The close path pre-
operands as inputs. The significands are processed in a 53 2 53-bit mul- shifts both input significands by one and inputs shifted and non-shifted
tiplier, while the exponent and sign bits are processed in parallel. For operands to a swap multiplexer. Meanwhile, a comparator is used to
any FMUL instruction, the multiplier array forwards the 106-bit sum determine the larger significand in the case of no exponent difference,
and carry results to a rounding unit designed from a combination of all while three leading-one predictors (LOP) operate in parallel on each
several common multiplication rounding schemes [21]–[23]. possible exponent difference case.
When the required operation is an FMA instruction, the unit begins The exponent prediction logic and significand comparator drive the
execution in the same way as an FMUL instruction. However, when the select lines on several sets of swap multiplexers. The resulting LOP
partial product tree produces a product in sum/carry format, it is passed selection enters a 53-bit priority encoder and is reduced to a 5-bit nor-
to the bridge while the round element is shut down via clock-gating. malization control. Both the larger and smaller significands in the close
path are normalized by up to 54 bits, and the stage is complete.
C. The Bridge FADD Unit The larger and smaller operands from both the far and close path
exit the block and are forwarded to the FMA/FADD add/round stage
The bridge FMA architecture uses a common Farmwald [24]
for path merging, rounding, and instruction completion.
dual-path FADD design to execute standalone addition instructions.
As shown in Fig. 5, the addition unit uses a far and close path to handle
D. Bridge FMA Unit
the two classical FADD cases. The far path, shown on the left side of
Fig. 5, is used to process input significands for either an addition or The bridge unit is shown in Fig. 6. The unit is essentially the classic
a subtraction if their exponents differ by more than 1. For this path, FMA architecture described in Section II without the multiplier array,
the significands of both inputs are passed to a swap multiplexer that rounding, or post-normalization block. Instead, the bridge unit accepts
awaits the results of a comparison of the exponents. When the larger the product from the floating-point multiplier and combines it with a
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008 1729

TABLE I
BRIDGE FMA RESULTS

To fully understand the units as a whole, each design’s internal

pipeline latches were removed so that a total block latency measure-
ment could be obtained. The total latencies of each block do not
directly correlate to their operating frequency, as each design may be
pipelined according to a specific coprocessor’s requirements.
In terms of power measurements, floating-point power signatures
vary wildly according to the application and input vectors provided.
The reported power results in this paper attempt to provide a general
comparison by measuring the maximum power observed of any input
vector combination. Each block was held at a predetermined and set fre-
quency while given a long series of randomly generated floating-point
vectors. The combination of vectors that caused the most switching in
a given block was identified, the peak current draw extracted, and the
maximum power reported. Peak power vector combinations for each
block are mutually exclusive.
Fig. 7. Bridge add/round unit. As shown in Table I, the bridge architecture is 30% to 70% faster and
50% to 70% lower in peak power consumption than a classic FMA unit
when executing FADD or FMUL instructions. The bridge unit also may
pre-aligned 161-bit addend taken from an operand originally provided execute both an FADD and an FMUL instruction in parallel, which is
to the FADD. The unit then proceeds with steps 2–6 of the FMA algo- not possible on a classic FMA unit. Additionally, when compared to the
rithm described in Section II. combination of an FADD and an FMUL, the bridge FMA unit shows
about a 12% performance gain for FMA instructions.
E. Bridge Add/Round Unit The cost of the added FMA functionality is about 40% more area
than an execution block with an FADD and FMUL Unit, as well as
The addition and rounding unit is designed to perform several roles.
a 65% increase in FMA instruction peak power due to shared hard-
When a standalone FADD instruction is required, the add/round unit
ware not necessarily intended for the FMA instruction itself. The bridge
acts as a common FADD dual-path merge stage, selecting between the
FMA results also show higher peak power and lower performance gain
far- and close-path operands for inputs to the addition and rounding
on FMA instructions as compared to a classic FMA unit.
units. For FMA instructions, the same multiplexer used for a merge
in the FADD path selects the FMA unit unrounded result. The second
operand input to the addition and round units is passed a null string, as V. CONCLUSION
another operator is not needed for the FMA rounding completion. We present a new architecture for the design and implementation of
The add/round unit is shown in Fig. 7. It uses a combined add/round the FMA instruction. The bridge fused multiply-add unit adds FMA
scheme suggested by several schemes seen in [25], [26]. The two se- functionality to existing floating-point coprocessor units by including
lected input operands are passed to dual 59-bit adders, producing a re- an FMA hardware “bridge” between an existing FADD and FMUL
sult and a result plus 2 (or plus 1 for subtraction). Providing these arith- unit. This added functionality comes at the cost of additional hardware
metic results, as explained by the literature [25], allows for an easy LSB and power consumption compared to a common FADD/FMUL execu-
fix-up, shift, post-alignment, and final result selection. The controls for tion block, as well as a slightly degraded FMA instruction performance
the shifts come from the overflow bits of the adders, and the rounding compared to a classic FMA unit.
selections are decided by combinational rounding logic. The bridge FMA unit adds the FMA functionality to existing designs
without degrading the performance or parallel execution of FADD and
IV. RESULTS FMUL single instructions, providing an IEEE-754R FMA solution that
does not require a complete overhaul of current coprocessing systems.
Each unit’s Verilog code was implemented using a standard-cell li-
brary in custom placement with wire extraction using Steiner route
estimates. These libraries and models are from the AMD 65-nm sil- REFERENCES
icon-on-insulator technology design set. With the results from these [1] R. K. Montoye, E. Hokenek, and S. L. Runyon, “Design of the IBM
models and design implementations, the bridge FMA unit enhancement RISC System/6000 floating-point execution unit,” IBM J. Res. Devel-
opment, vol. 34, pp. 59–70, 1990.
is compared in Table I to a custom floating-point adder, floating-point [2] E. Hokenek, R. Montoye, and P. W. Cook, “Second-generation RISC
multiplier, and classic FMA unit in the fields of latency, area, and max- floating point with multiply-add fused,” IEEE J. Solid-State Circuits,
imum observed power consumption. vol. 25, no. 5, pp. 1207–1213, Oct. 1990.
1730 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 12, DECEMBER 2008

[3] C. Hinds, “An enhanced floating point coprocessor for embedded [15] H. Sharangpani and K. Arora, “Itanium processor microarchitecture,”
signal processing and graphics applications,” in Proc. 33rd Asilomar IEEE Micro Mag., vol. 20, no. 5, pp. 24–43, May 2000.
Conf. Signals, Syst., Comput., 1999, pp. 147–151. [16] DRAFT Standard for Floating-Point Arithmetic, IEEE Std.
[4] Y. Voronenko and M. Puschel, “Automatic generation of implementa- P754/D1.4.0, Apr. 2007.
tions for DSP transforms on fused multiply-add architectures,” in Proc. [17] Advanced Micro Devices, “128-bit SSE5 instruction set,” Pub. No.
Int. Conf. Acoust., Speech Signal Process., 2004, pp. V-101–V-104. 43479, Rev. 3.01, Aug. 2007.
[5] E. N. Linzer, “Implementation of efficient FFT algorithms on fused [18] H. Sun and M. Gao, “A novel architecture for floating-point mul-
multiply-add architectures,” IEEE Trans. Signal Process., vol. 41, no. tiply-add-fused operation,” in Proc. 4th Int. Conf. Inf., Commun.
1, pp. 93–107, Jan. 1993. Signal Process. 4th Pacific Rim Conf. Multimedia, Dec. 2003, vol. 3,
[6] A. D. Robison, “N-Bit unsigned division via N-Bit multiply-add,” in pp. 1675–1679.
Proc. 17th IEEE Symp. Comput. Arithmetic, 2005, pp. 131–139. [19] T. Lang and J. D. Bruguera, “Floating-Point fused multiply-add: Re-
[7] R.-C. Li, S. Boldo, and M. Daumas, “Theroems on efficient argument duced latency for floating-point addition,” in Proc. 17th IEEE Symp.
reductions,” in Proc. 16th IEEE Symp. Comput. Arithmetic, 2003, pp. Comput. Arithmetic, Jun. 2005, pp. 42–51.
129–136. [20] IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE
[8] F. P. O’Connell and S. W. White, “POWER3: The next generation of Standard 754-1985, Reaffirmed Dec. 6, 1990, Inc., 1985.
PowerPC processors,” IBM J. Res. Development, vol. 44, pp. 873–884, [21] G. Even and P. M. Seidel, “A comparison of three rounding algorithms
2000. for IEEE floating-point multiplication,” IEEE Trans. Comput., vol. 49,
[9] R. Jessani and C. Olson, “The floating-point unit of the PowerPC 603e,” no. 7, pp. 638–650, Jul. 2000.
IBM J. Res. Development, vol. 40, pp. 559–566, 1996. [22] R. K. Yu and G. B. Zyner, “167 MHz Radix-4 floating-point multiplier,”
[10] R. M. Jessani and M. Putrino, “Comparison of single- and dual-pass in Proc. 12th Symp. Comput. Arithmetic, 1995, pp. 149–154.
multiply-add fused floating-point units,” IEEE Trans. Comput., vol. 47, [23] N. Quach, N. Takagi, and M. Flynn, “On fast IEEE rounding,” Stanford
no. 9, pp. 927–937, Sep. 1998. Univ., Stanford, CA, Tech. Rep. CSL-TR-91-459, Jan. 1991.
[11] A. Kumar, “The HP PA-8000 RISC CPU,” IEEE Micro Mag., vol. 17, [24] M. P. Farmwald, “On the design of high performance digital arithmetic
no. 2, pp. 27–32, Mar./Apr. 1997. units,” Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ., Stan-
[12] D. Hunt, “Advanced performance features of the 64-bit PA-8000,” in ford, CA, 1981.
Proc. COMPCON, 1995, pp. 123–128. [25] N. Quach and M. J. Flynn, “An improved algorithm for high-speed
[13] K. C. Yeager, “The MIPS R10000 superscalar microprocessor,” IEEE floating point addition,” Comput. Syst. Lab., Stanford Univ., Stanford,
Micro Mag., vol. 16, no. 2, pp. 28–40, Mar. 1996. CA, Tech. Rep. CSL-TR-90-442, Aug. 1990.
[14] B. Greer, J. Harrison, G. Henry, W. Li, and P. Tang, “Scientific com- [26] A. Naini, A. Dhablania, W. James, and D. Das Sarma, “1 GHz HAL
puting on the itanium processor,” in Proc. ACM/IEEE SC Conf., Nov. Sparc64 dual floating point unit with RAS features,” in Proc. 15th
2001, pp. 1–1. Symp. Comput. Arithmetic, 2001, pp. 173–183.

PL-SQL - Notes
100% (3)
PL-SQL - Notes
13 pages
Design A Floating-Point Fused Add-Subtract Unit Using Verilog
No ratings yet
Design A Floating-Point Fused Add-Subtract Unit Using Verilog
5 pages
Decimal Floating-Point Fused Multiply-Add With Redundant Internal Encodings
No ratings yet
Decimal Floating-Point Fused Multiply-Add With Redundant Internal Encodings
10 pages
Fused Multiply-Add Microarchitecture Comprising Se
No ratings yet
Fused Multiply-Add Microarchitecture Comprising Se
7 pages
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
No ratings yet
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
7 pages
A CMOS Floating Point Unit
No ratings yet
A CMOS Floating Point Unit
13 pages
Design and Implementation of FPU For Optimised Speed: R. Bhuvanapriya, Menakadevi T
No ratings yet
Design and Implementation of FPU For Optimised Speed: R. Bhuvanapriya, Menakadevi T
12 pages
Floating Point Arithmetic Unit With Multi-Precision For DSP Applications
No ratings yet
Floating Point Arithmetic Unit With Multi-Precision For DSP Applications
8 pages
Low-Cost Binary128 Floating-Point FMA Unit Design Using SIMD Support
No ratings yet
Low-Cost Binary128 Floating-Point FMA Unit Design Using SIMD Support
4 pages
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
No ratings yet
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
4 pages
A Configurable Floating-Point Fused Multiply-Add Design With Mixed Precision For AI Accelerators
No ratings yet
A Configurable Floating-Point Fused Multiply-Add Design With Mixed Precision For AI Accelerators
15 pages
Out of Order Floating Point Coprocessor For RISC V ISA
No ratings yet
Out of Order Floating Point Coprocessor For RISC V ISA
7 pages
Design and Synthesizing of Floating Point Adder Andmultiplier Using Cadence RTL Compiler
No ratings yet
Design and Synthesizing of Floating Point Adder Andmultiplier Using Cadence RTL Compiler
6 pages
Design of Low-Area and High Speed Pipelined
No ratings yet
Design of Low-Area and High Speed Pipelined
6 pages
Verilog Project Report
No ratings yet
Verilog Project Report
13 pages
Efficient Implementation of Pipelined Double Precision Floating Point Unit On FPGA
No ratings yet
Efficient Implementation of Pipelined Double Precision Floating Point Unit On FPGA
6 pages
FPDSP Latest
No ratings yet
FPDSP Latest
14 pages
Design and Implementation of Floating Point ALU With Parity Generator Using Verilog HDL
No ratings yet
Design and Implementation of Floating Point ALU With Parity Generator Using Verilog HDL
6 pages
On-Chip Implementation of High Resolution High Speed Low Area Floating Point AdderSubtractor With Reducing Mean Latency For OFDM Applications
No ratings yet
On-Chip Implementation of High Resolution High Speed Low Area Floating Point AdderSubtractor With Reducing Mean Latency For OFDM Applications
6 pages
Design of FPGA Based 32-Bit Floating Point Arithmetic Unit and Verification of Its VHDL Code Using MATLAB
No ratings yet
Design of FPGA Based 32-Bit Floating Point Arithmetic Unit and Verification of Its VHDL Code Using MATLAB
14 pages
Project Report Vlsi
No ratings yet
Project Report Vlsi
33 pages
The IBM System 360 Model 91 Floating-Point Execution Unit
No ratings yet
The IBM System 360 Model 91 Floating-Point Execution Unit
20 pages
CortexM4 FPU
No ratings yet
CortexM4 FPU
14 pages
Fpga Implementation of FFT Algorithms Using Floating
No ratings yet
Fpga Implementation of FFT Algorithms Using Floating
5 pages
Design and Implementation of Pipelined FFT Processor: D.Venkata Kishore, C.Ram Kumar
No ratings yet
Design and Implementation of Pipelined FFT Processor: D.Venkata Kishore, C.Ram Kumar
4 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
Ijspr 1203 438
No ratings yet
Ijspr 1203 438
4 pages
Article 87
No ratings yet
Article 87
4 pages
Floating-Point Hardware Designs For Multimedia Processing
No ratings yet
Floating-Point Hardware Designs For Multimedia Processing
117 pages
8159-Article Text-14636-1-10-20210604
No ratings yet
8159-Article Text-14636-1-10-20210604
8 pages
An FPGA Implementation of High Speed and Area Efficient Double-Precision Floating Point Multiplier Using Urdhva Tiryagbhyam Technique
No ratings yet
An FPGA Implementation of High Speed and Area Efficient Double-Precision Floating Point Multiplier Using Urdhva Tiryagbhyam Technique
6 pages
An Efficient Algorithm For Exploiting Multiple Arithmetic Units
No ratings yet
An Efficient Algorithm For Exploiting Multiple Arithmetic Units
9 pages
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
No ratings yet
Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas
5 pages
Field Programmable Gate Array Prototyping of End-Around Carry Parallel Prefix Tree Architectures
No ratings yet
Field Programmable Gate Array Prototyping of End-Around Carry Parallel Prefix Tree Architectures
11 pages
Synthesis of Single Precision Floating Point ALU: Department of Electronics and Communication Engineering
No ratings yet
Synthesis of Single Precision Floating Point ALU: Department of Electronics and Communication Engineering
20 pages
Research and Analysis of Floating-Point Adder Prin
No ratings yet
Research and Analysis of Floating-Point Adder Prin
5 pages
Algorithm and Design
No ratings yet
Algorithm and Design
6 pages
Floating-Point Butterfly Architecture Based On Binary Signed-Digit Representation
No ratings yet
Floating-Point Butterfly Architecture Based On Binary Signed-Digit Representation
1 page
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
No ratings yet
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
5 pages
Hybrid FP FXP Dot Product
No ratings yet
Hybrid FP FXP Dot Product
12 pages
Floating-Point Division and Square Root Implementation Using A Taylor-Series Expansion Algorithm
No ratings yet
Floating-Point Division and Square Root Implementation Using A Taylor-Series Expansion Algorithm
4 pages
Finalpublishedpaperoriginal PDF
No ratings yet
Finalpublishedpaperoriginal PDF
10 pages
10 1 1 961 4530 PDF
No ratings yet
10 1 1 961 4530 PDF
5 pages
Implementation of A High Speed Single Precision Floating Point Unit Using Verilog
No ratings yet
Implementation of A High Speed Single Precision Floating Point Unit Using Verilog
5 pages
Design and Implementation of Modified Booth Recoder Using Fused Add Multiply Operator
No ratings yet
Design and Implementation of Modified Booth Recoder Using Fused Add Multiply Operator
5 pages
Floating Point Adder
No ratings yet
Floating Point Adder
14 pages
Shi Wal 95 A
No ratings yet
Shi Wal 95 A
8 pages
Floatingpoint Fused Multiplyadd Reduced Latency For Floatingpoin
No ratings yet
Floatingpoint Fused Multiplyadd Reduced Latency For Floatingpoin
10 pages
Floating Point Arith
100% (1)
Floating Point Arith
8 pages
Design and Implementation of Fast Floating Point Multiplier Unit
No ratings yet
Design and Implementation of Fast Floating Point Multiplier Unit
5 pages
Fixed Point Math F-Lemieu
No ratings yet
Fixed Point Math F-Lemieu
5 pages
Floating Point Processor
No ratings yet
Floating Point Processor
5 pages
Implementation of 32 Bit Floating Point MAC Unit To Feed Weighted Inputs To Neural Networks
No ratings yet
Implementation of 32 Bit Floating Point MAC Unit To Feed Weighted Inputs To Neural Networks
4 pages
Lec 25
No ratings yet
Lec 25
15 pages
FPGA Implementation of Addition Subtraction Module For Double Precision Floating Point Numbers Using Verilog
No ratings yet
FPGA Implementation of Addition Subtraction Module For Double Precision Floating Point Numbers Using Verilog
5 pages
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
No ratings yet
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
10 pages
NVL-04.an Optimized Modified Booth Recoder For Efficient Design of The Add-Multiply Operator
No ratings yet
NVL-04.an Optimized Modified Booth Recoder For Efficient Design of The Add-Multiply Operator
4 pages
LabVIEW Core 1 Exercises
0% (1)
LabVIEW Core 1 Exercises
43 pages
VBScript
No ratings yet
VBScript
120 pages
FPGAProgramming Handcode Framework Guide
100% (1)
FPGAProgramming Handcode Framework Guide
96 pages
Java
No ratings yet
Java
309 pages
PPS - Unit 1
No ratings yet
PPS - Unit 1
160 pages
Unit - Ii Arithmetic For Computers
No ratings yet
Unit - Ii Arithmetic For Computers
28 pages
Fixed Point Conversion
No ratings yet
Fixed Point Conversion
50 pages
Built in of Opnqryf
No ratings yet
Built in of Opnqryf
21 pages
Module 2 - Memory and Data Types
No ratings yet
Module 2 - Memory and Data Types
7 pages
L Moments
No ratings yet
L Moments
39 pages
Algorithm and Architecture For Logarithm, Exponential, and Powering Computation
No ratings yet
Algorithm and Architecture For Logarithm, Exponential, and Powering Computation
12 pages
Unit2 COA Sarabjeet Kaur
No ratings yet
Unit2 COA Sarabjeet Kaur
110 pages
Porting Newlib
No ratings yet
Porting Newlib
41 pages
Java Project Report
100% (6)
Java Project Report
39 pages
Computer Science Book
No ratings yet
Computer Science Book
7 pages
05 Types of Pipelining
No ratings yet
05 Types of Pipelining
56 pages
Lecture1 6 PDF
No ratings yet
Lecture1 6 PDF
30 pages
MD Nastran 2006 DMAP Programmer's Guide
100% (1)
MD Nastran 2006 DMAP Programmer's Guide
1,848 pages
NMF P
No ratings yet
NMF P
57 pages
Autocad 2002
No ratings yet
Autocad 2002
188 pages
M1L2 Lyst5406
No ratings yet
M1L2 Lyst5406
40 pages
Scanit User Guide
No ratings yet
Scanit User Guide
74 pages
MITSUBISHI QL Simple Mode Programming Manual Common Instructions1
No ratings yet
MITSUBISHI QL Simple Mode Programming Manual Common Instructions1
920 pages
Number System
No ratings yet
Number System
38 pages
C Basics
No ratings yet
C Basics
39 pages
VBA Data Type Summary
No ratings yet
VBA Data Type Summary
2 pages
DDI0403E B Armv7m Arm
No ratings yet
DDI0403E B Armv7m Arm
40 pages
CProgramming 15ME47P Polytechnic College DTE Board Bengaluru
No ratings yet
CProgramming 15ME47P Polytechnic College DTE Board Bengaluru
38 pages
Lua Bytecode Exploitation
No ratings yet
Lua Bytecode Exploitation
25 pages

Abstract-A New Floating-Point Fused Multiply-Add (FMA) Design For The

Uploaded by

Abstract-A New Floating-Point Fused Multiply-Add (FMA) Design For The

Uploaded by

1726 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO.

12, DECEMBER 2008

Fig. 1. IEEE-754 double-precision format [20].

B. Bridge FMA Architecture

III. PROPOSED BRIDGE FMA ARCHITECTURE

Fig. 4. FMUL Unit.

Fig. 5. Dual-path FADD unit.

B. Bridge FMUL Unit Fig. 6. FMA Bridge Unit.

To fully understand the units as a whole, each design’s internal

You might also like