High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
High Performance FPGA Based Floating Point Arithmetics: Project Report For Computer Arithmetic Algorithms
Arithmetics
1 Introduction
We decided to investigate what kind of floating point arithmetic performance
it is possible to achieve in a modern FPGA. In order to gain a thorough un-
derstanding of the issues involved we decided to try to implement a fast FPU
ourselves. We do expect that an expert in the field could come up with a better
solution, especially given the limited amount of time available for this project.
However, a search on the Internet did not turn up any references to high per-
formance FPU:s on Virtex 4 FPGA:s.
The Virtex-4 uses a relatively standard FPGA architecture with CLB:s con-
sisting of 4 slices which each contains 2 4-LUTs and 2 flip flops. The FPGA
also has a large number of embedded memories and DSP blocks containing high
speed multipliers and adders. In addition, the Virtex-4 also contains a number
of specialized components which were not used in this project For further details
about the Virtex-4 FPGA, see the Virtex-4 User Guide [2]. The DSP blocks are
thoroughly described in the XtremeDSP user guide [3].
In order to test the FPU in a realistic environment we decided to implement
a complex radix-2 butterfly kernel with the FPU adder and multiplier we were
going to implement. This kernel can be used to implement for example higher
radix FFT:s.
We selected a simple floating point format with no denormalized numbers and
neither NaN nor Inf.
2 Methodology
To test the final result we implemented a C++ class for floating point numbers.
The number of bits in the mantissa and exponents could be configured from 1
1
to 31 bits. The C++ model was used to generate the test vectors for the RTL
test benches.
An initial RTL model was then developed and tested against the floating point
test data. The RTL model was written with the hardware in mind but it was
not optimized for the Virtex 4 FPGA.
The performance of the initial RTL model was evaluated and the most critical
part of the design was optimized to better fit the FPGA. After the optimization,
the model was verified with the test benches. This was repeated until the
performance was satisfactory.
Finally, the design was tested in an FPGA by downloading test data to the
FPGA and uploading the results from the butterfly calculation for verification
against test patterns generated by the C++ model.
4 Multiplier
The multiplier is quite simple to construct due to the large number of available
multiplier blocks in the FPGA. A single multiplier is used for the mantissa and
an adder is used for the exponent. It is also necessary to normalize the result
of the multiplication. This normalizer is very simple since the most significant
bit can only be located at one out of two bit positions given normalized inputs
to the multiplier. The overall architecture of the multiplier is shown in figure 1.
A simple rounding scheme was chosen where the rounding was done before
the normalization. This can be implemented basically for free in the DSP48
blocks in the FPGA. This can be contrasted with the rounding schemes used in
IEEE-754 where rounding is performed after normalization with an extra small
normalization step required to check for overflow after rounding. This would
not map very well to the DSP48 block. Except for the utilization of the DSP48
block, no FPGA specific optimizations were performed in the multiplier block.
2
Round
Normalize
5 Adder/Subtracter
A floating point adder is more complicated than a floating point multiplier. The
basic architecture for the adder is shown in figure 2. The first step compares
the operands and swaps them if necessary so that the largest number always
enters the left path. This step also adds the implicit one if the input operands
are non-zero. In the next step, the smallest number is aligned so that the
exponents of both operands match. After this step, an addition or subtraction
of the two numbers are performed. A subtraction can never cause a negative
result because of the earlier comparison and swap step. The normalization step
is the final step. It is implemented using two pipeline stages. The first stage
looks at the mantissa in 4 bits intervals as seen in figure 3. The first module
looks at the first four bits and outputs a normalized result assuming a one was
found in these bits. An extra output signal, shown as gray lines in the figure,
is used to signal that all bits four bits were zero. The second module assumes
that the first four bits were all zero and instead looks at the following four bits,
outputting a normalized result. This is repeated for the remaining bits of the
mantissa. The next stage decides which of the previous results that should be
used. If all bits were zero, a zero is output as the result. The value needed to
correct the exponent is generated according to the same scheme. This is shown
as dashed lines in the figure.
3
Exponent Mantissa
Compare/Select
CMP
Align
Add
Denormalization
4
Unnormalized mantissa Exp
0 4 8 12
Priority
decoder 4−1 MUX
Initially the adder met timing at 250 MHz. It did not achieve this performance
once it was inserted into a complex butterfly. At this point further optimizations
were required. The first FPGA specific optimization was to make sure that the
adder/subtracter was implemented using only one LUT per bit. A standard
adder structure as compared to an adder structure with both addition and
subtraction is shown in figure 4.
Another optimization was to optimize the exponent selection in the normaliza-
tion step. At first, this was implemented using a 5 to 1-mux in front of an adder.
By implementing a 2 to 1-mux directly in the same LUT used for the addition,
a smaller 4 to 1-mux could be used in front of the adder. In order to make sure
that this mux was placed near the adder, RLOC directives were used to place
the components in relation to each other.
In both the exponent and mantissa mux, the reset signal of the flip flop was
used to set the result to zero instead of embedding this logic into the LUT.
Another technique that we tried was to construct a 4 to 1-mux combined with
a priority decoder as shown in figure 5. This mux should achieve slightly better
performance than an ordinary mux since there is only one level of LUTs. In
a later stage of the implementation we moved the or function to the previous
pipeline stage as well.
5
Carry out
Carry out
Sub
B =1
B
=1 =1
A =1 Sum A =1 Sum
Carry in Carry in
X X
1
Y A
B
O
Functionality table Z
X Y Z O C
1 x x A D
0 1 x B
0 0 1 C
0 0 0 D
6
6 Floorplanning
In order to improve performance of the final system we tried to locate different
pipeline stages close to each other by using RLOC directives. Doing this resulted
in more regularity and smaller area footprint.
7
Figure 6: RLOC:ed complex butterfly
8
Figure 7: Non RLOC:ed radix-4 butterfly
9
available but a key technique is to use the DSP48 block for the adder since an
adder implemented with a carry chain would be too slow. The post normaliza-
tion step is supposed to be implemented using both DSP48 and Block RAMs [1].
The pipeline depth of this implementation is not known.
It would also be interesting to look at the newly announced Virtex 5 architecture.
The 6-LUT architecture should reduce the number of logic levels and routing
all over the design. Unfortunately no tools are publicly available today that
targets the Virtex 5.
9 Conclusions
The Virtex 4 FPGA is not really suited for floating point arithmetics. With
some techniques detailed in this report it is possible to get relatively decent
performance. We would have liked to be able to achieve a higher performance
though. We also realized that the placer does a pretty good job and it is not
trivial to achieve higher performance by doing some of the placement by hand.
References
[1] Andraka, Ray; Re: Floating point reality check, news:comp.arch.fpga, 14
May 2006
[2] Xilinx ; Virtex-4 User Guide
[3] Xilinx ; XtremeDSP for Virtex-4 FPGAs User Guide
10