0% found this document useful (0 votes)
80 views10 pages

High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms

This document describes a project to design and implement a high performance floating point unit (FPU) on a Virtex-4 FPGA. The authors implemented a simple floating point format and a complex radix-2 butterfly kernel to test the FPU. They achieved optimizations by utilizing DSP blocks for multiplication and leveraging FPGA architecture like localized placement of components to improve timing. The initial implementation met 250MHz but required further FPGA-specific optimizations to integrate into the complex kernel without timing issues.

Uploaded by

Tudor Cret
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views10 pages

High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms

This document describes a project to design and implement a high performance floating point unit (FPU) on a Virtex-4 FPGA. The authors implemented a simple floating point format and a complex radix-2 butterfly kernel to test the FPU. They achieved optimizations by utilizing DSP blocks for multiplication and leveraging FPGA architecture like localized placement of components to improve timing. The initial implementation met 250MHz but required further FPGA-specific optimizations to integrate into the complex kernel without timing issues.

Uploaded by

Tudor Cret
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

High Performance FPGA based Floating Point

Arithmetics

Project report for Computer Arithmetic Algorithms

Andreas Ehliar and Per Karlström


{ehliar,perk}@isy.liu.se

June 13, 2006

1 Introduction
We decided to investigate what kind of floating point arithmetic performance
it is possible to achieve in a modern FPGA. In order to gain a thorough un-
derstanding of the issues involved we decided to try to implement a fast FPU
ourselves. We do expect that an expert in the field could come up with a better
solution, especially given the limited amount of time available for this project.
However, a search on the Internet did not turn up any references to high per-
formance FPU:s on Virtex 4 FPGA:s.
The Virtex-4 uses a relatively standard FPGA architecture with CLB:s con-
sisting of 4 slices which each contains 2 4-LUTs and 2 flip flops. The FPGA
also has a large number of embedded memories and DSP blocks containing high
speed multipliers and adders. In addition, the Virtex-4 also contains a number
of specialized components which were not used in this project For further details
about the Virtex-4 FPGA, see the Virtex-4 User Guide [2]. The DSP blocks are
thoroughly described in the XtremeDSP user guide [3].
In order to test the FPU in a realistic environment we decided to implement
a complex radix-2 butterfly kernel with the FPU adder and multiplier we were
going to implement. This kernel can be used to implement for example higher
radix FFT:s.
We selected a simple floating point format with no denormalized numbers and
neither NaN nor Inf.

2 Methodology
To test the final result we implemented a C++ class for floating point numbers.
The number of bits in the mantissa and exponents could be configured from 1

1
to 31 bits. The C++ model was used to generate the test vectors for the RTL
test benches.
An initial RTL model was then developed and tested against the floating point
test data. The RTL model was written with the hardware in mind but it was
not optimized for the Virtex 4 FPGA.
The performance of the initial RTL model was evaluated and the most critical
part of the design was optimized to better fit the FPGA. After the optimization,
the model was verified with the test benches. This was repeated until the
performance was satisfactory.
Finally, the design was tested in an FPGA by downloading test data to the
FPGA and uploading the results from the butterfly calculation for verification
against test patterns generated by the C++ model.

3 Floating point format


The first version of the RTL code was fairly configurable in regards to mantissa
and exponent sizes. In order to ease the development of an optimized FPGA
implementation, we decided to limit the floating point format to a maximum
of one sign bit, 10 bits of exponent, and 15 bits of mantissa with an implicit
one. The mantissa is represented using regular unsigned binary numbers. The
exponent is implemented using excess 511.

4 Multiplier
The multiplier is quite simple to construct due to the large number of available
multiplier blocks in the FPGA. A single multiplier is used for the mantissa and
an adder is used for the exponent. It is also necessary to normalize the result
of the multiplication. This normalizer is very simple since the most significant
bit can only be located at one out of two bit positions given normalized inputs
to the multiplier. The overall architecture of the multiplier is shown in figure 1.
A simple rounding scheme was chosen where the rounding was done before
the normalization. This can be implemented basically for free in the DSP48
blocks in the FPGA. This can be contrasted with the rounding schemes used in
IEEE-754 where rounding is performed after normalization with an extra small
normalization step required to check for overflow after rounding. This would
not map very well to the DSP48 block. Except for the utilization of the DSP48
block, no FPGA specific optimizations were performed in the multiplier block.

2
Round

Normalize

Figure 1: The floating point multiplier architecture

5 Adder/Subtracter
A floating point adder is more complicated than a floating point multiplier. The
basic architecture for the adder is shown in figure 2. The first step compares
the operands and swaps them if necessary so that the largest number always
enters the left path. This step also adds the implicit one if the input operands
are non-zero. In the next step, the smallest number is aligned so that the
exponents of both operands match. After this step, an addition or subtraction
of the two numbers are performed. A subtraction can never cause a negative
result because of the earlier comparison and swap step. The normalization step
is the final step. It is implemented using two pipeline stages. The first stage
looks at the mantissa in 4 bits intervals as seen in figure 3. The first module
looks at the first four bits and outputs a normalized result assuming a one was
found in these bits. An extra output signal, shown as gray lines in the figure,
is used to signal that all bits four bits were zero. The second module assumes
that the first four bits were all zero and instead looks at the following four bits,
outputting a normalized result. This is repeated for the remaining bits of the
mantissa. The next stage decides which of the previous results that should be
used. If all bits were zero, a zero is output as the result. The value needed to
correct the exponent is generated according to the same scheme. This is shown
as dashed lines in the figure.

3
Exponent Mantissa
Compare/Select

CMP
Align
Add
Denormalization

Find leading one

Figure 2: The overall architecture for the adder

4
Unnormalized mantissa Exp

0 4 8 12

ff1 in 4 ff1 in 4 ff1 in 4 ff1 in 4


shift shift shift shift

Priority
decoder 4−1 MUX

Normalized mantissa New exponent

Figure 3: The normalizer architecture

5.1 FPGA optimizations

Initially the adder met timing at 250 MHz. It did not achieve this performance
once it was inserted into a complex butterfly. At this point further optimizations
were required. The first FPGA specific optimization was to make sure that the
adder/subtracter was implemented using only one LUT per bit. A standard
adder structure as compared to an adder structure with both addition and
subtraction is shown in figure 4.
Another optimization was to optimize the exponent selection in the normaliza-
tion step. At first, this was implemented using a 5 to 1-mux in front of an adder.
By implementing a 2 to 1-mux directly in the same LUT used for the addition,
a smaller 4 to 1-mux could be used in front of the adder. In order to make sure
that this mux was placed near the adder, RLOC directives were used to place
the components in relation to each other.
In both the exponent and mantissa mux, the reset signal of the flip flop was
used to set the result to zero instead of embedding this logic into the LUT.
Another technique that we tried was to construct a 4 to 1-mux combined with
a priority decoder as shown in figure 5. This mux should achieve slightly better
performance than an ordinary mux since there is only one level of LUTs. In
a later stage of the implementation we moved the or function to the previous
pipeline stage as well.

5
Carry out
Carry out

Sub
B =1
B
=1 =1
A =1 Sum A =1 Sum

Carry in Carry in

Figure 4: A regular adder using 1 LUT / bit as compared to a adder/subtracter


using 1 LUT/bit.

X X
1
Y A
B

O
Functionality table Z

X Y Z O C
1 x x A D
0 1 x B
0 0 1 C
0 0 0 D

Figure 5: A priority decoder combined with a 4-to-1 mux.

6
6 Floorplanning
In order to improve performance of the final system we tried to locate different
pipeline stages close to each other by using RLOC directives. Doing this resulted
in more regularity and smaller area footprint.

7 Results and Discussion


Knowing the FPGA architecture is important to write efficient HDL code. A
good understanding of the Virtex 4 architecture enables the designer to use the
fabric in ways not (yet) supported by the synthesis tools. In some cases the
gains can be substantial in other cases the gains are more limited.
With the initial RLOC optimizations we achieved better timing results but as
soon as we tried to use RLOC over pipeline boundaries we got worse timing
results. Eventually we managed to reach a 250 MHz clock frequency for the
radix-2 butterfly by using RLOC. The floorplan for this implementation is shown
in figure 6. At this point however, a number of low level optimizations had been
done which enabled the design to meet timing at 250 MHz even without the
use of RLOC. Sadly enough the RLOC:ed radix-4 butterfly could not be fit into
the FPGA because one radix-2 butterfly was too wide. Unfortunately we did
not have time to correct this problem. Thus the radix-4 butterfly could only be
placed without RLOC directives. The radix-4 butterfly also met timing at 250
MHz. The floorplan for the radix-4 butterfly is shown in figure 7. Table 1 lists
the final resource utilization in the FPGA for various components. The radix-2
and radix-4 are complex valued butterflies whereas the floating point adder and
multiplier operates on real values.

Resource Radix-4 Radix-2 Adder Multiplier Available


LUTs 10104 2514 73 372 30720
Flip Flops 14432 3660 63 325 30720
DSP48 16 4 1 0 192

Table 1: Component resource utilization

There are a number of opportunities for further optimizations in this design.


For example, instead of using CLB:s for the shifting, a multiplier could be used
for this task by sending in the number to be shifted as one operand and a bit
vector with a single one in a suitable position as the other operand.
If the application of the floating point blocks are known it is possible to do some
application specific optimizations. For example, in a butterfly with an adder
and a subtracter operating on the same operands the first compare stage could
be shared between these. If the application can tolerate it, further pipelining
could increase the performance significantly. If the latency tolerance is very
high, bit serial arithmetics could probably be used as well. In this project we
limited the pipeline depth to compare well with FPU:s used in CPU:s.
According to a post on comp.arch.fpga it is possible to achieve 400MHz per-
formance for IEEE single precision floating point arithmetics. Few details are

7
Figure 6: RLOC:ed complex butterfly

8
Figure 7: Non RLOC:ed radix-4 butterfly

9
available but a key technique is to use the DSP48 block for the adder since an
adder implemented with a carry chain would be too slow. The post normaliza-
tion step is supposed to be implemented using both DSP48 and Block RAMs [1].
The pipeline depth of this implementation is not known.
It would also be interesting to look at the newly announced Virtex 5 architecture.
The 6-LUT architecture should reduce the number of logic levels and routing
all over the design. Unfortunately no tools are publicly available today that
targets the Virtex 5.

8 RLOC related problems


It is relatively easy to RLOC individual pipeline stages. Once we tried to hier-
archically RLOC several pipeline stages, the performance suddenly decreased.
Generally, the place and route tool seems to place modules quite far from each
other. This generally balances the different pipeline stages as well and eases
routing due to lower congestion. However, as soon as we started to RLOC sev-
eral pipeline stages together the distance between two non-RLOCed stages grew
larger and it was harder to meet timing. In the end, we had to RLOC at least
some parts of all modules involved in the design to be able to meet timing.

9 Conclusions
The Virtex 4 FPGA is not really suited for floating point arithmetics. With
some techniques detailed in this report it is possible to get relatively decent
performance. We would have liked to be able to achieve a higher performance
though. We also realized that the placer does a pretty good job and it is not
trivial to achieve higher performance by doing some of the placement by hand.

References
[1] Andraka, Ray; Re: Floating point reality check, news:comp.arch.fpga, 14
May 2006
[2] Xilinx ; Virtex-4 User Guide
[3] Xilinx ; XtremeDSP for Virtex-4 FPGAs User Guide

10

You might also like