0% found this document useful (0 votes)
107 views9 pages

Emulation of FMA and Correctly-Rounded Sums: Proved Algorithms Using Rounding To Odd

This document summarizes an algorithm for emulating the fused multiply-add (FMA) operator and computing correctly rounded sums using rounding to odd. Rounding to odd rounds a real number to the nearest floating-point number with an odd mantissa. It avoids problems with double rounding that can occur when intermediate results are rounded to a higher precision format before the final rounding. The authors formally proved the correctness of these algorithms using the Coq proof checker.

Uploaded by

Ramesh Dama
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views9 pages

Emulation of FMA and Correctly-Rounded Sums: Proved Algorithms Using Rounding To Odd

This document summarizes an algorithm for emulating the fused multiply-add (FMA) operator and computing correctly rounded sums using rounding to odd. Rounding to odd rounds a real number to the nearest floating-point number with an odd mantissa. It avoids problems with double rounding that can occur when intermediate results are rounded to a higher precision format before the final rounding. The authors formally proved the correctness of these algorithms using the Coq proof checker.

Uploaded by

Ramesh Dama
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

TRANSACTION ON COMPUTERS, VOL. 0, NO.

0, JANUARY 2099 1

Emulation of FMA and correctly-rounded sums:


proved algorithms using rounding to odd
Sylvie Boldo and Guillaume Melquiond

Abstract— Rounding to odd is a non-standard rounding on results [3]. Nevertheless, double rounding is not necessarily a
floating-point numbers. By using it for some intermediate values threat. For example, if the extended precision is at least twice
instead of rounding to nearest, correctly rounded results can be as big, then it can be used to emulate correctly rounded basic
obtained at the end of computations. We present an algorithm for
operations for a smaller precision [4]. Double rounding can
emulating the fused multiply-and-add operator. We also present
an iterative algorithm for computing the correctly rounded sum also be made innocuous by introducing a new rounding mode
of a set floating-point numbers under mild assumptions. A and using it for the first rounding. When a real number is not
variation on both previous algorithms is the correctly rounded representable, it will be rounded to the adjacent floating-point
sum of any three floating-point numbers. This leads to efficient number with an odd mantissa. In this article, this rounding will
implementations, even when this rounding is not available. In be named rounding to odd.
order to guarantee the correctness of these properties and algo- Von Neumann was considering this rounding when designing
rithms, we formally proved them using the Coq proof checker.
the arithmetic unit of the EDVAC [5]. Goldberg later used
this rounding when converting binary floating-point numbers
Index Terms— Floating-point, rounding to odd, accurate sum- to decimal representations [6]. The properties of this rounding
mation, FMA, formal proof, Coq. operator are close to the ones needed when implementing rounded
floating-point operators with guards bits [7]. Because of its double
I. I NTRODUCTION rounding property, it has also been studied in the context of
multistep gradual rounding [8]. Rounding to odd was never more
LOATING-POINT computations and their roundings are de-
F scribed by the IEEE-754 standard [1], [2] followed by every
modern general-purpose processor. This standard was written to
than an implementation detail though, as two extra bits had to be
stored in the floating-point registers. It was part of some hardware
recipes that were claimed to give a correct result. Our work aims
ensure the coherence of the result of a computation whatever the
at giving precise and clear definitions and properties with a strong
environment. This is the “correct rounding” principle: the result of
guarantee on their correctness. We also show that it is worth
an operation is the same as if it was first computed with an infinite
making rounding to odd a rounding mode in its own rights (it
precision and then rounded to the precision of the destination
may be computed in hardware or in software). By rounding some
format. There may exist higher precision formats though, and it
computations to odd in an algorithm, more accurate results can
would not be unreasonable for a processor to store all kinds of
be produced without extra precision.
floating-point result in a single kind of register instead of having
Section II will detail a few characteristics of double rounding
as many register sets as it supports floating-point formats. In order
and why rounding to nearest is failing us. Section III will
to ensure IEEE-754 conformance, care must then be taken that a
introduce the formal definition of rounding to odd, how it solves
result is not first rounded to the extended precision of the registers
the double rounding issue, and how to implement this rounding.
and then rounded to the precision of the destination format.
Its property with respect to double rounding will then be ex-
This “double rounding” phenomenon may happen on proces-
tended to two applications. Section IV will describe an algorithm
sors built around the Intel x86 instruction set for example. Indeed,
that emulates the floating-point fused-multiply-and-add operator.
their floating-point units use 80-bit long registers to store the
Section V will then present algorithms for performing accurate
results of their computations, while the most common format
summation. Formal proofs of the lemmas and theorems have
used to store in memory is only 64-bit long (IEEE double
been written and included in the Pff library1 on floating-point
precision). To prevent double rounding, a control register allows
arithmetic. Whenever relevant, the names of the properties in the
to set the floating-point precision, so that the results are not
following sections match the ones in the library.
first rounded to the register precision. Unfortunately, setting the
target precision is a costly operation as it requires the processor
II. D OUBLE ROUNDING
pipeline to be flushed. Moreover, thanks to the extended precision,
programs generally seem to produce more accurate results. As A. Floating-point definitions
a consequence, compilers usually do not generate the additional Our formal proofs are based on the floating-point formaliza-
code that would ensure that each computation is correctly rounded tion [9] of Daumas, Rideau, and Théry in Coq [10], and on
in its own precision. the corresponding library by Théry, Rideau, and one of the au-
Double rounding can however lead to unexpected inaccuracy. thors [11]. Floating-point numbers are represented by pairs (n, e)
As such, it is considered a dangerous feature. So writing robust that stand for n × 2e . We use both an integral signed mantissa n
floating-point algorithms requires extra care in order to ensure and an integral signed exponent e for sake of simplicity.
that this potential double rounding will not produce incorrect A floating-point format is denoted by B and is a pair composed
by the lowest exponent −E available and the precision p. We do
S. Boldo is with the INRIA Futurs.
G. Melquiond is with the École Normale Supérieure de Lyon. 1 See http://lipforge.ens-lyon.fr/www/pff/.
TRANSACTION ON COMPUTERS, VOL. 0, NO. 0, JANUARY 2099 2

not set an upper bound on the exponent as overflows do not matter second rounding
here (see Section VI). We define a representable pair (n, e) such g h
that |n| < 2p and e ≥ −E . We denote by F the subset of real first rounding
numbers represented by these pairs for a given format B. Now
only the representable floating-point numbers will be referred to;
they will simply be denoted as floating-point numbers. x
All the IEEE-754 rounding modes are also defined in the Coq t
library, especially the default rounding: rounding to nearest even, f Be step
denoted by ◦. We have f = ◦(x) if f is the floating-point number
closest to x; when x is half way between two consecutive floating- Bw step
point numbers, the one with an even mantissa is chosen. Fig. 2
A rounding mode is defined in the Coq library as a relation BAD CASE FOR DOUBLE ROUNDING .
between a real number and a floating-point number, and not a
function from real values to floats. Indeed, there may be several
floats corresponding to the same real value. For a relation, a
weaker property than being a rounding mode is being a faithful Section IV-B.1 will show that, when there is only one single
rounding. A floating-point number f is a faithful rounding of floating-point format but many computations, trying to get a
a real x if it is either the rounded up or rounded down of x, correctly rounded result is somehow similar to avoiding incorrect
as shown on Figure 1. When x is a floating-point number, it is double rounding.
its own and only faithful rounding. Otherwise there always are
two faithful rounded values bracketing the real value when no C. Double rounding and faithfulness
overflow occurs. Another interesting property of double rounding as defined
previously is that it is a faithful rounding. We even have a more
faithful roudings
generic result.

x
correct rounding (closest)
Fig. 1 f1 f2
FAITHFUL ROUNDINGS . Fig. 3
D OUBLE ROUNDINGS ARE FAITHFUL .

Let us consider that the relations are not required to be rounding


B. Double rounding accuracy modes but only faithful roundings. We formally certified that the
As explained before, a floating-point computation may first be rounded result f of a double faithful rounding is faithful to the
done in an extended precision, and later rounded to the working real initial value x, as shown in Figure 3.
precision. The extended precision is denoted by Be = (p + k, Ee ) Theorem 1 (DblRndStable): Let Re be a faithful rounding in
and the working precision is denoted by Bw = (p, Ew ). If the extended precision Be = (p + k, Ee ) and let Rw be a faithful
same rounding mode is used for both computations (usually to rounding in the working precision Bw = (p, Ew ). If k ≥ 0 and
nearest even), it can lead to a less precise result than the result k ≤ Ee − Ew , then for all real value x, the floating-point number
after a single rounding. Rw (Re (x)) is a faithful rounding of x in the working precision.
For example, see Figure 2. When the real value x is in the This is a deep result as faithfulness is the best result we can
neighborhood of the midpoint of two consecutive floating-point expect as soon as we consider at least two roundings to nearest.
numbers g and h, it may first be rounded in one direction toward This result can be applied to any two successive IEEE-754
this middle t in extended precision, and then rounded in the same rounding modes (to zero, toward +∞. . . ). The requirements are
direction toward f in working precision. Although the result f is k ≥ 0 and k ≤ Ee − Ew . The last requirement means that the
close to x, it is not the closest floating-point number to x, as minimum exponents emin e and emin
w — as defined by the IEEE-
h is. When both rounding directions are to nearest, we formally 754 standard — should satisfy emin e ≤ emin
w . As a consequence, it
proved that the distance between the given result f and the real is equivalent to: Any normal floating-point number with respect
value x may be as much as to Bw should be normal with respect to Be .
„ « This means that any sequence of successive roundings in
1 −k−1 decreasing precisions gives a faithful rounding of the initial value.
|f − x| ≤ +2 ulp(f ).
2
When there is only one rounding, the corresponding inequality III. ROUNDING TO ODD
is |f − x| ≤ 21 ulp(f ). This is the expected result for a IEEE-754 As seen in the previous section, rounding two times to nearest
compatible implementation. may induce a bigger round-off error than one single rounding
TRANSACTION ON COMPUTERS, VOL. 0, NO. 0, JANUARY 2099 3

to nearest and may then lead to unexpected incorrect results. By We then concatenate the inexact bit of the previous operation at
rounding to odd first, the second rounding will correctly round to the end of the mantissa in order to get a p + k-bit float. The
nearest the initial value. justification is similar to the previous one.
Both previous methods are aimed at hardware implementation.
A. Formal description They may not be efficient enough to be used in software.
Paragraph V-D will present a third way of rounding to odd, more
Rounding to odd does not belong to the IEEE-754’s or even
adapted to current architectures and actually implemented. It is
754R2 ’s rounding modes. It should not be mixed up with rounding
portable and available in higher level languages as it does not
to the nearest odd (similar to the default rounding: rounding to
require changing the rounding direction and accessing the inexact
the nearest even).
flag.
We denote by 4 rounding toward +∞ and by 5 rounding
toward −∞. Rounding to odd is defined by:
8 C. Correct double rounding
< x if x ∈ F,
Let x be a real number. This number is first rounded to odd in
odd (x) = 4(x) if its mantissa is odd,
an extended format (precision is p+k bits and 2−Ee is the smallest
5(x) otherwise.
:
positive floating-point number). Let t be this intermediate rounded
Note that the result of x rounded to odd can be even only result. It is then rounded to nearest even in the working format
when x is a representable floating-point number. Note also that (precision is p bits and 2−Ew is the smallest positive floating-point
when x is not representable, odd (x) is not necessarily the nearest number). Although we are considering a real value x here, an
floating-point number with an odd mantissa. Indeed, this is wrong implementation does not need to really handle x. The value x can
when x is close to a power of two. This partly explains why the indeed represent the abstract exact result of an operation between
formal proofs on algorithms involving rounding to odd will have floating-point numbers. Although this sequence of operations is
to separate the case of powers of two from other floating-point a double rounding, we state that the computed final result is
numbers. correctly rounded.
Theorem 2 (To_Odd*): Rounding to odd has the properties of Theorem 3 (To_Odd_Even_is_Even): Assuming p ≥ 2, k ≥ 2,
a rounding mode [9]: and Ee ≥ 2 + Ew ,
• each real can be rounded to odd;
“ ”
∀x ∈ R, ◦p p+k p
odd (x) = ◦ (x).
• rounding to odd is faithful;
• rounding to odd is monotone.
Moreover, The proof is split in three cases as shown in Figure 4. When x
• rounding to odd can be expressed as a function: a real cannot
is exactly equal to the middle of two consecutive floating-point
be rounded to two different floating-point values; numbers f1 and f2 (case 1), then t is exactly x and f is the correct
• rounding to odd is symmetric:
rounding of x. Otherwise, when x is slightly different from this
if f = odd (x), then −f = odd (−x). midpoint (case 2), then t is different from this midpoint: it is the
odd value just greater or just smaller than the midpoint depending
on the value of x. The reason is that, as k ≥ 2, the midpoint is
B. Implementing rounding to odd even in the p + k precision, so t cannot be rounded into it if it is
Rounding to odd the real result x of a floating-point com- not exactly equal to it. This obtained t value will then be correctly
putation can be done in two steps. First round it to zero into rounded to f , which is the closest p-bit float from x. The other
the floating-point number Z(x) with respect to the IEEE-754 numbers (case 3) are far away from the midpoint and are easy to
standard. And then perform a logical or between the inexact flag ι handle.
(or the sticky bit) of the first step and the last bit of the mantissa.
If the mantissa of Z(x) is already odd, this floating-point f1
f2
number is also the value of x rounded to odd; the logical or does
not change it. If the floating-point computation is exact, Z(x)
is equal to x and ι is not set; consequently odd (x) = Z(x) is
correct. Otherwise the computation is inexact and the mantissa 3 1 2
of Z(x) is even, but the final mantissa must be odd, hence the
logical or with ι. In this last case, this odd float is the correct
one, since the first rounding was toward zero. Fig. 4
Computing ι is not a problem per se, since the IEEE-754 stan- D IFFERENT CASES OF ROUNDING TO ODD
dard requires this flag to be implemented, and hardware already
uses sticky bits for the other rounding modes. Furthermore, the
value of ι can directly be reused to flag the rounded value of Note that the hypothesis Ee ≥ 2 + Ew is a requirement easy
x as exact or inexact. As a consequence, on an already IEEE- to satisfy. It is weaker than the corresponding one in Theorem 1.
754 compliant architecture, adding this new rounding has no In particular, the following condition is sufficient but no longer
significant cost. necessary: Any normal floating-point number with respect to Bw
Another way to round to odd with precision p + k is the should be normal with respect to Be .
following. We first round x toward zero with p + k − 1 bits. While the pen and paper proof is a bit technical, it does seem
easy. It does not, however, consider the special cases, especially
2 See http://www.validlab.com/754R/. the ones where ◦p (x) is a power of two, and subsequently where
TRANSACTION ON COMPUTERS, VOL. 0, NO. 0, JANUARY 2099 4

◦p (x) is the smallest normal floating-point number. We must the neglected terms ul and tl , and by rounding the result to odd.
look into all these special cases in order to ensure that the The following example will show that the answer would be wrong
rounding is always correct, even when underflow occurs. We if all the roundings were to nearest instead.
have formally proved this result using the Coq proof assistant. Let us consider a = 1 + 2−27 and b = 1 − 2−27 . The exact
By using a proof checker, we are sure no cases were forgotten product is a · b = 1 − 2−54 . This real number is exactly the
and no mistakes were made in the numerous computations. There midpoint between 1 and its representable predecessor in double
are many splittings into subcases; they make the final proof rather precision. If c is small enough (for example, |c| ≤ 2−150 ), it
long: seven theorems and about one thousand lines of Coq, but means that the value ◦(a · b + c) will purely depend on the sign
we are now sure that every cases (normal/subnormal, power of the of c. If c is negative, it should be 1 − 2−53 , otherwise 1. If our
radix or not) are supported. Details on this proof were presented algorithm were to round to nearest instead of rounding to odd,
in a previous work [12]. the final result would always be 1, irrespective of the sign of c.
Theorem 3 is even more general than what is presented here:
it can be applied to any realistic rounding to the closest (meaning B. Theorem of correctness
that the result of a computation is uniquely defined by the value
Theorem 4 (FmaEmul): Under the notations of Algorithm 1,
of the infinitely precise result and does not depend on the machine
if ul is representable and p ≥ 5, then
state). In particular, it handles the new rounding to nearest, ties
away from zero, defined by the revision of the IEEE-754 standard. z = ◦(a × b + c).

IV. E MULATING THE FMA


The previous theorem states that the algorithm emulates a FMA
The fused-multiply-and-add is a recent floating-point operator under two hypotheses. First, the value ul has to be the error term
that is present on a few modern processors like PowerPC or of the multiplication a · b, in order to avoid some degenerate
Itanium. This operation will hopefully be standardized in the underflow cases: the error term becomes so small that its exponent
revision of the IEEE-754 standard. Given three floating-point falls outside the admitted range. The second hypothesis requires
numbers a, b, and c, it computes the value z = ◦(a · b + c) with the mantissa to have at least 5 bits. This requirement is reasonable
one single rounding at the end of the computation. There is no since even the smallest format of the IEEE-754 standard has a
rounding after the product a · b. This operator is very useful as it 24-bit mantissa.
may increase performance and accuracy of the dot product and This theorem has been formally proved with a proof checker.
matrix multiplication. Algorithm 1, page 5, shows how it can be This is especially important as it is quite generic. In particular,
emulated thanks to rounding to odd. This section will describe its it does not contain any hypothesis regarding subnormal numbers.
principles. The algorithm will behave correctly even if some computed values
are not normal numbers, as long as ul is representable.
A. The algorithm 1) Adding a negligible yet odd value: We need an intermediate
lemma for simplicity and reusability, described by Figure 5.
Algorithm 1 relies on error-free transformations (ExactAdd and Lemma 1 (AddOddEven): Let µ be the smallest positive
ExactMult) to perform some of the operations exactly. These floating-point normal number. Let x be a floating-point number
transformations described below return two floating-point values. such that |x| ≥ 5 · µ. Let z be a real number and y = odd (z).
The first one is the usual result: the exact sum or product rounded Assuming 5 · |y| ≤ |x|,
to nearest. The other one is the error term. For addition and
multiplication, this term happens to be exactly representable by ◦(x + y) = ◦(x + z).
a floating-point number and computable using only floating-point
operations provided neither underflow (for the multiplication)
nor overflow occurs. As a consequence, in Algorithm 1, these
z ∈ R y = odd (z)
equalities hold: a · b = uh + ul and c + uh = th + tl . And the
rounded result is stored in the higher word: uh = ◦(a · b) and
th = ◦(c + uh ). x+ x+
A fast operator for computing the error term of the multiplica-
tion is the FMA: ul = ◦(a · b + (−uh )). Unfortunately, our goal x+z ∈ R x+y ∈ F+F
is the emulation of a FMA, so we have to use another method.
In IEEE-754 double precision, Dekker’s algorithm first splits the
53-bit floating-point inputs into 26-bit parts thanks to the sign
◦(x + z) = ◦(x + y)
bit. These parts can then be multiplied exactly and subtracted in
order to get the error term [13]. For the error term of the addition, Fig. 5
since we do not know the relative order of |c| and |uh |, we use L EMMA A DD O DD E VEN .
Knuth’s unconditional version of the algorithm [14]. These two
algorithms have been formally proved in Coq [9], [15].
Our emulated FMA first computes an approximation of the By uniqueness of the rounded value to nearest even, it is enough
correct result: th = ◦(◦(a · b) + c). It also computes an auxiliary to prove that ◦(x + y) is a correct rounding to nearest of x + z
term v that is added to th to get the final result. All the with tie breaking to even. By applying Theorem 3, we just have
p+k
computations are done at the working precision, there is no need to prove that x + y is equal to odd (x + z) for a k that we might
for an extended precision. The number v is computed by adding choose as we want (as soon as it is greater than 1).
TRANSACTION ON COMPUTERS, VOL. 0, NO. 0, JANUARY 2099 5

Algorithm 1 Emulating the FMA.

c a b

Error-free multiplication

uh ul

(uh , ul ) = ExactMult(a, b) Error-free addition


(th , tl ) = ExactAdd(c, uh ) th tl
v = odd (tl + ul )
z = ◦(th + v)
Odd-rounded addition

v = odd (tl + ul )

Rounded addition

z = ◦(th + v)

The integer k is chosen so that there exists a floating-point both |ul | and |tl | are bounded by 21−p · |th |. So their sum |ul + tl |
number f equal to x + y , normal with respect to an extended is bounded by 22−p · |th |. Since v is not a subnormal number,
format on precision p + k and having the same exponent as y . the inequality still holds when rounding ul + tl to v . So we have
For that, we set f = (nx ·2ex −ey +ny , ey ). As |y| ≤ |x|, we know proved that |v| ≤ 22−p ·|th | when the computation of v is inexact.
that ey ≤ ex and this definition has the required exponent. We To summarize, either v is equal to tl + ul , or v is negligible
then choose k such that 2p+k−1 ≤ |nf | < 2p+k . The property with respect to th . Lemma 1 can then be applied with x = th ,
k ≥ 2 is guaranteed by 5 · |y| ≤ |x|. The underflow threshold for y = v , and z = tl + ul . Indeed x + z = th + tl + ul = a · b + c. We
the extended format is defined as needed thanks to the 5 · µ ≤ |x| have to verify two inequalities in order to apply it though. First,
hypothesis. These ponderous details are handled in the machine- we must prove 5 · |y| ≤ |x|, meaning that 5 · |v| ≤ |th |. We have
checked proof. just shown that |v| ≤ 22−p · |th |. As p ≥ 5, this inequality easily
So we have defined an extended format where x + y is holds.
representable. There is left to prove that x + y = p+k odd (x + z).
Second, we must prove 5 · µ ≤ |x|, meaning that 5 · µ ≤ |th |.
We know that y = odd (z), thus we have two cases. First, y = z , We prove it by assuming |th | < 5·µ and reducing it to the absurd.
so x + y = x + z and the result holds. Second, y is odd and is So tl is subnormal. More, th must be normal: if th is subnormal,
a faithful rounding of z . Then we prove (several possible cases then tl = 0, which is impossible. We then look into ul . If ul is
and many computations later), that x + y is odd and is a faithful subnormal, then v = odd (ul + tl ) is computed correctly, which
rounding of x + z with respect to the extended format. That ends is impossible. So ul is normal. We then prove that both tl = 0 and
the proof. tl 6= 0 hold. First, tl 6= 0 as v 6= ul + tl . Second, we will prove
Several variants of this lemma are used in Section V-A. They tl = 0 by proving that the addition c + uh is computed exactly
all have been formally proved too. Their proofs have a similar (as th = ◦(c + uh )). For that, we will prove that eth < euh − 1
structure and they will not be detailed here. Please refer to the as that implies a cancellation in the computation of c + uh and
Coq formal development for in-depth proofs. therefore the exactness of th . There is then left to prove that
2) Emulating a FMA: First, we can eliminate the case where 2eth < 2euh −1 . As th is normal, 2eth ≤ |th | · 21−p . As p ≥ 5
v is computed without rounding error. Indeed, it means that z = and ul is normal, 5 · µ · 21−p ≤ µ ≤ |ul |. Since we have both
◦(th +v) = ◦(th +tl +ul ). Since ul = a·b−uh and th +tl = c+uh , |th | < 5 · µ and |ul | ≤ 2euh −1 , we can deduce 2eth < 2euh −1 .
we have z = ◦((c + uh ) + (a · b − uh )) = ◦(a · b + c). We have a contradiction in all cases, therefore 5 · µ ≤ |th | holds.
So the hypotheses of Lemma 1 are now verified and the proof is
Now, if v is rounded, it means that v is not a subnormal number.
completed.
Indeed, if the result of a floating-point addition is a subnormal
number, then the addition is exact. It also means that neither ul
nor tl are zero. So neither the product a · b nor the sum c + uh V. ACCURATE SUMMATION
are representable floating-point numbers. The last steps of the algorithm for emulating the FMA actually
Since c + uh is not representable, the inequality 2 · |th | ≥ |uh | compute the correctly rounded sum of three floating-point num-
holds. Moreover, since ul is the error term in uh + ul , we have bers at once. Although there is no particular assumption on two of
|ul | ≤ 2−p · |uh |. Similarly, |tl | ≤ 2−p · |th |. As a consequence, the numbers (c and uh ), there is a strong hypothesis on the third
TRANSACTION ON COMPUTERS, VOL. 0, NO. 0, JANUARY 2099 6

Algorithm 2 Iterated summation

f5

f4

Input: the (fi )1≤i≤n are suitably ordered f3


and spaced out.
f2
g1 = f1
f 1 = g1
For i from 2 to n − 1,
gi = odd (gi−1 + fi )
g2
s = ◦(gn−1 + fn )
g3 Odd-rounded additions
P
Output: s = ◦( f i ).
g4

s Rounded addition

one: |ul | ≤ 2−p · |uh |. We will generalize this summation scheme Theorem 5 (Summation): We use the notations of Algorithm 2
to an iterated scheme that will compute the correctly rounded sum and assume a reasonable floating-point format is used. Let µ
of a set of floating-point numbers under some strong assumptions. be the smallest positive floating-point normal number. If the
We will then describe a generic adder for three floating-point following properties hold for any j such that 2 < j < n,
numbers that rely on rounding to odd to produce the correctly
|fj | ≥ 2 · |tj−1 | and |fj | ≥ 2 · µ,
rounded result.
and if the most significant term verifies

A. Iterated summation |fn | ≥ 6 · |tn−1 | and |fn | ≥ 5 · µ,


We consider the problem of adding a sequence of floating- then s = ◦(tn ).
The proof of this theorem has two parts. First, we prove
P
point numbers (fi )1≤i≤n . Let us pose tj = 1≤i≤j fi the exact
partial sums. The objective is to compute the correctly rounded by induction on j that gj = odd (tj ) holds for all j < n.
sum s = ◦(tn ). This problem is not new: adding several floating- In particular, we have s = ◦(fn + odd (tn−1 )). Second, we
point numbers with good accuracy is an important problem of prove that ◦(fn + odd (tn−1 )) = ◦(fn + tn−1 ). This equality is
scientific computing [16]. Demmel and Hida presented a simple precisely s = ◦(tn ). Both parts are proved by applying variants of
algorithm that yields almost correct summation results [17]. And Lemma 1. The correctness of the induction is a consequence of
recently Oishi, Rump, and Ogita, presented some other algorithms Lemma 2 while the correctness of the final step is a consequence
for accurate summation [18]. Our algorithm requires stronger of Lemma 3.
assumptions, but it is simple, very fast, and will return the Lemma 2 (AddOddOdd2): Let x be a floating-point number
correctly rounded result thanks to rounding to odd. such that |x| ≥ 2 · µ. Let z be a real number. Assuming 21 is
Two approaches are possible. The first one was described in a a normal floating-point and 2 · |z| ≤ |x|,
previous work [12]. The algorithm computes the partial sums in odd (x + odd (z)) = odd (x + z).
an extended precision format with rounding to odd. In order to
produce the final result, it relies on this double rounding property:
◦(tn ) = ◦(p+k Lemma 3 (AddOddEven2): Let x be a floating-point number
odd (tn )). The correctness of the algorithm depends
on the following property proved by induction on j for j < n: such that |x| ≥ 5 · µ. Let z be a real number. Assuming p > 3
“ ” and 6 · |z| ≤ |x|,
p+k p+k p+k
odd (tj+1 ) = odd fj+1 + odd (tj ) . ◦(x + odd (z)) = ◦(x + z).

Once the value p+k odd (tn ) has been computed, it is rounded
to the working precision in order to obtain the correctly rounded It may generally be a bit difficult to verify that the hypotheses
result s thanks to the double rounding property. Unfortunately, this of the summation theorem hold at execution time. So it is
algorithm requires that an extended precision format is available interesting to have a sufficient criteria that can be checked with
in order to compute the intermediate results. floating-point numbers only:
Let us now present a new approach. While similar to the old |f2 | ≥ 2 · µ and |fn | ≥ 9 · |fn−1 |,
one, Algorithm 2 does not need any extended precision to perform
its intermediate computations. for 1 ≤ i ≤ n − 2, |fi+1 | ≥ 3 · |fi |.
TRANSACTION ON COMPUTERS, VOL. 0, NO. 0, JANUARY 2099 7

Algorithm 3 Adding three numbers

a b c

Error-free addition

uh ul

(uh , ul ) = ExactAdd(b, c) Error-free addition


(th , tl ) = ExactAdd(a, uh ) th tl
v = odd (tl + ul )
z = ◦(th + v)
Odd-rounded addition

v = odd (tl + ul )

Rounded addition

z = ◦(th + v)

B. Reducing expansions ordering of the inputs. Algorithm 3 shows how to compute this
A floating-point expansion is a list of sorted floating-point correctly-rounded sum of three numbers.
numbers, its value being the exact sum of its components [19]. Its graph looks similar to the graph of Algorithm 1 for
Computations on these multi-precision values are done using only emulating the FMA. The only difference lies in its first error-
existing hardware and are therefore very fast. free transformation. Instead of computing the exact product of
two of its inputs, this algorithm computes their exact sum. As
If the expansion is non-overlapping, looking at the three most
a consequence, its proof of correctness can directly be derived
significant terms is sufficient to get the correct approximated value
from the one for the FMA emulation. Indeed, the correctness
of the expansion. This can be achieved by computing the sum of
of the emulation does not depend on the properties of an exact
these three terms with Algorithm 2. The algorithm requirements
product. The only property that matters is: uh +ul is a normalized
on ordering and spacing are easily met by expansions.
representation of a number u. As a consequence, both Algorithm 1
Known fast algorithms for basic operations on expansions
and Algorithm 3 are special cases of a more general algorithm
(addition, multiplication, etc) take as inputs and outputs pseudo-
that would compute the correctly rounded sum of a floating-point
expansions, i.e. expansions with a slight overlap (typically a few
number with a real number exactly represented by the sum of two
bits) [19], [20]. Then, looking at three terms only is no longer
floating-point numbers.
enough. All the terms up to the least significant one may have
Note that the three inputs of the adder do not play a symmetric
an influence on the correctly rounded sum. This problem can be
role. This property will be used in the following section to
solved by normalizing the pseudo-expansions in order to remove
optimize some parts of the adder.
overlapping terms. This process is, however, extremely costly: if
the expansion has n terms, Priest’s algorithm requires about 6 · n
floating-point additions in the best case (9·n in the worst case). In D. A practical use case
a simpler normalization algorithm with weaker hypotheses [20], CRlibm3 is an efficient library for computing correctly rounded
the length of the dependency path to get the three most significant results of elementary functions in IEEE-754 double precision. Let
terms is 7 · n additions. us consider the logarithm function [21]. In order to be efficient,
A more efficient solution is provided by Algorithm 2, as it the library first executes a fast algorithm. This usually gives the
can directly compute the correctly rounded result with n floating- correctly rounded result, but in some situations it may be off
point additions only. Indeed, the criteria at the end of Section V-A by one unit in the last place. When the library detects such a
is verified by expansions which overlap by at most p − 5 bits, situation, it starts again with a slower yet accurate algorithm in
therefore also by pseudo-expansions. order to get the correct final result.
When computing the logarithm ◦(log f ), the slow algorithm
C. Adding three numbers will use triple-double arithmetic [22] to first compute an ap-
proximation of log f stored on three double precision numbers
Let us now consider a simpler situation. We still want to com- xh + xm + xl . Thanks to results provided by the table-maker
pute a correctly-rounded sum, but there are only three numbers
left. In return, we will remove all the requirements on the relative 3 See http://lipforge.ens-lyon.fr/www/crlibm/.
TRANSACTION ON COMPUTERS, VOL. 0, NO. 0, JANUARY 2099 8

Listing 1 Correctly rounded sum of three ordered values

double CorrectRoundedSum3(double xh, double xm, double xl) {


double th, tl;
db_number thdb; // thdb.l is the binary representation of th

// Dekker’s error-free adder of two ordered numbers


Add12(th, tl, xm, xl);

// round to odd th if tl is not zero


if (tl != 0.0) {
thdb.d = th;
// if the mantissa of th is odd, there is nothing to do
if (!(thdb.l & 1)) {
// choose the rounding direction
// depending on the signs of th and tl
if ((tl > 0.0) ^ (th < 0.0))
thdb.l++;
else
thdb.l--;
th = thdb.d;
}
}

// final addition rounded to nearest


return xh + th;
}

dilemma [23], this approximation is known to be sufficiently correct, since th cannot be zero when tl is not zero.
accurate for the equality ◦(log f ) = ◦(xh + xm + xl ) to hold. CRlibm already contained some code at the end of the loga-
This means the library just has to compute the correctly rounded rithm function in order to compute the correctly rounded sum of
sum of the three floating-point numbers xh , xm , and xl . three floating-point numbers. When the code of Listing 1 is used
Computing this sum is exactly the point of Algorithm 3. instead, the slow step of this elementary function gets 25 cycles
Unfortunately, rounding to odd is not available on any architecture faster on an AMD Opteron processor. While we only looked at
targeted by CRlibm, so it will have to be emulated. Although such the very last operation of the logarithm, it still amounts to a 2%
an emulation is costly in software, rounding to odd still allows for speed-up on the whole function.
a speed-up here. Indeed xh + xm + xl is the result of a sequence The performance increase would obviously be even greater if
of triple-double floating-point operations, so this is precisely the we had not to emulate a rounded-to-odd addition. Moreover, this
case described in Section V-B. As a consequence, the operands speed-up is not restricted to logarithm: it is available for every
are ordered in such a way that some parts of Algorithm 3 are not other rounded elementary functions, since they all rely on triple-
necessary. In fact, Lemma 3 implies the following equality: double arithmetic at the end of their slow step.
◦(xh + xm + xl ) = ◦(xh + odd (xm + xl )).
VI. C ONCLUSION
This means that, at the end of the logarithm function, we just
have to compute the rounded-to-odd sum of xm and xl , and then We first considered rounding to odd as a way of performing
do a standard floating-point addition with xh . Now, all that is left intermediate computations in an extended precision and yet still
is the computation of odd (xm + xl ). This is achieved by first obtaining correctly rounded results at the end of the computations.
computing th = ◦(xm + xl ) and tl = xm + xl − th thanks to an This is expressed by Theorem 3. Rounding to odd then led us to
error-free adder. If tl is zero or if the mantissa of th is odd, then consider algorithms that could benefit from its robustness. We
th is already equal to odd (xm + xl ). Otherwise th is off by one first considered an iterated summation algorithm that was using
unit in the last place. We replace it either by its successor or by extended precision and rounding to odd in order to perform the
its predecessor depending on the signs of tl and th . intermediate additions. The FMA emulation however showed that
Listing 1 shows a cleaned version of a macro used by CRlibm: the extended precision only has to be virtual. As long as we prove
ReturnRoundToNearest3Other. The macro Add12 is an that the computations are done as if an extended precision is
implementation of Dekker’s error-free adder. It is only 3-addition used, the working precision can be used. This is especially useful
long, and it is correct since the inequality |xm | ≥ |xl | holds. when we already compute with the highest available precision.
The successor or the predecessor of th is directly computed by The constraints on the inputs of Algorithm 2 are compatible
incrementing or decrementing the integer thdb.l that holds its with floating-point expansions: the correctly rounded sum of an
binary representation. Working on the integer representation is overlapping expansion can easily be computed.
TRANSACTION ON COMPUTERS, VOL. 0, NO. 0, JANUARY 2099 9

Algorithm 1 for emulating the FMA and Algorithm 3 for adding [13] T. J. Dekker, “A floating point technique for extending the available
numbers are similar. They both allow to compute ◦(a  b + c) precision,” Numerische Mathematik, vol. 18, no. 3, pp. 224–242, 1971.
[14] D. E. Knuth, The Art of Computer Programming: Seminumerical Algo-
with a, b, and c three floating-point numbers, as long as a  b is rithms. Addison Wesley, 1969, vol. 2.
exactly representable as the sum of two floating-point numbers. [15] S. Boldo, “Pitfalls of a full floating-point proof: example on the formal
These algorithms rely on rounding to odd to ensure that the proof of the Veltkamp/Dekker algorithms,” in Third International Joint
result is correctly rounded. Although this rounding is not available Conference on Automated Reasoning, ser. Lecture Notes in Artificial
Intelligence, U. Furbach and N. Shankar, Eds., vol. 4130. Seattle,
in current hardware, our changes to CRlibm have shown that USA: Springer-Verlag, 2006.
reasoning on it opens the way to some efficient new algorithms [16] N. J. Higham, Accuracy and stability of numerical algorithms. SIAM,
for computing correctly rounded results. 1996.
[17] J. W. Demmel and Y. Hida, “Fast and accurate floating point summation
In this paper, we did not tackle at all the problem of overflowing with applications to computational geometry,” in Proceedings of the
operations. The reason is that overflow does not matter here: on 10th GAMM-IMACS International Symposium on Scientific Computing,
all the algorithms presented, overflow can be detected afterward. Computer Arithmetic, and Validated Numerics (SCAN 2002), January
Indeed, any of these algorithms will produce an infinity or a NaN 2003.
[18] T. Ogita, S. M. Rump, and S. Oishi, “Accurate sum and dot product,”
as a result in case of overflow. The only remaining problem is that SIAM Journal on Scientific Computing, vol. 26, no. 6, pp. 1955–1988,
they may create an infinity or a NaN although the result could be 2005.
represented. For example, let M be the biggest positive floating- [19] D. M. Priest, “Algorithms for arbitrary precision floating point arith-
metic,” in Proceedings of the 10th IEEE Symposium on Computer
point number, and let a = −M and b = c = M in Algorithm 3. Arithmetic, P. Kornerup and D. W. Matula, Eds. Grenoble, France:
Then uh = th = +∞ and ul = tl = v = −∞ and z = NaN IEEE Computer Society, 1991, pp. 132–144.
whereas the correct result is M . This can be misleading, but this [20] M. Daumas, “Multiplications of floating point expansions,” in Proceed-
is not a real problem when adding three numbers. Indeed, the ings of the 14th IEEE Symposium on Computer Arithmetic, I. Koren and
P. Kornerup, Eds., Adelaide, Australia, 1999, pp. 250–257.
crucial point is that we cannot create inexact finite results: when [21] F. de Dinechin, C. Q. Lauter, and J.-M. Muller, “Fast and correctly
the result is finite, it is correct. When emulating the FMA, the rounded logarithms in double-precision,” Theoretical Informatics and
error-term of the product is required to be correctly computed. Applications, vol. 41, no. 1, pp. 85–102, 2007.
[22] C. Q. Lauter, “Basic building blocks for a triple-double intermediate
This property can be checked by verifying that the magnitude of
format,” LIP, Tech. Rep. RR2005-38, Sep. 2005.
the product is big enough. [23] V. Lefèvre and J.-M. Muller, “Worst cases for correct rounding of the
While the algorithms presented here look short and simple, elementary functions in double precision,” in Proceedings of the 15th
their correctness is far from trivial. When rounding to odd is IEEE Symposium on Computer Arithmetic, N. Burgess and L. Ciminiera,
Eds., Vail, Colorado, 2001, pp. 111–118.
replaced by a standard rounding to nearest, there exist inputs
such that the final results are no longer correctly rounded. It
may be difficult to believe that simply changing one intermediate
rounding is enough to fix some algorithms. So we have written
formal proofs of their correctness and used the Coq proof-checker
to guarantee their validity. This approach is essential to ensure that
the algorithms are correct, even in the unusual cases.
Sylvie Boldo received the MSC and PhD de-
grees in computer science from the École Normale
R EFERENCES Supérieure de Lyon, France, in 2001 and 2005.
She is now researcher for the INRIA Futurs in the
[1] D. Stevenson et al., “A proposed standard for binary floating point
ProVal team (Orsay, France), whose research fo-
arithmetic,” IEEE Computer, vol. 14, no. 3, pp. 51–62, 1981.
cuses on formal certification of programs. Her main
[2] ——, “An American national standard: IEEE standard for binary floating research interests include floating-point arithmetic,
point arithmetic,” ACM SIGPLAN Notices, vol. 22, no. 2, pp. 9–25, 1987. formal methods and formal verification of numerical
[3] G. Melquiond and S. Pion, “Formally certified floating-point filters programs.
for homogeneous geometric predicates,” Theoretical Informatics and
Applications, vol. 41, no. 1, pp. 57–70, 2007.
[4] S. A. Figueroa, “When is double rounding innocuous?” SIGNUM
Newsletter, vol. 30, no. 3, pp. 21–26, 1995.
[5] J. von Neumann, “First draft of a report on the EDVAC,” IEEE Annals
of the History of Computing, vol. 15, no. 4, pp. 27–75, 1993.
[6] D. Goldberg, “What every computer scientist should know about floating
point arithmetic,” ACM Computing Surveys, vol. 23, no. 1, pp. 5–47,
1991.
[7] M. D. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann
Publishers, 2004. Guillaume Melquiond received the MSC and PhD
[8] C. Lee, “Multistep gradual rounding,” IEEE Transactions on Computers, degrees in computer science from the École Normale
vol. 38, no. 4, pp. 595–600, 1989. Supérieure de Lyon, France, in 2003 and 2007.
[9] M. Daumas, L. Rideau, and L. Théry, “A generic library of floating-point He is now a postdoctoral fellow at the INRIA–
numbers and its application to exact computing,” in 14th International Microsoft Research joint laboratory (Orsay, France)
Conference on Theorem Proving in Higher Order Logics, Edinburgh, in the Mathematical Components team, whose re-
Scotland, 2001, pp. 169–184. search focuses on developing tools for formally
[10] Y. Bertot and P. Casteran, Interactive Theorem Proving and Program proving mathematical theorems. His interests in-
Development. Coq’Art : the Calculus of Inductive Constructions, ser. clude floating-point arithmetic, formal methods for
Texts in Theoretical Computer Science. Springer Verlag, 2004. certifying numerical software, interval arithmetic,
[11] S. Boldo, “Preuves formelles en arithmétiques à virgule flottante,” Ph.D. and C++ software engineering.
dissertation, École Normale Supérieure de Lyon, Nov. 2004.
[12] S. Boldo and G. Melquiond, “When double rounding is odd,” in
Proceedings of the 17th IMACS World Congress on Computational and
Applied Mathematics, Paris, France, 2005.

You might also like