Advanced Arithmetic For The Digital Computer - Design of Arithmetic Units (PDFDrive)
Advanced Arithmetic For The Digital Computer - Design of Arithmetic Units (PDFDrive)
T
Ulrich W. Kulisch
Advanced Arithmetic
for the Digital Computer
This work is subject to copyright. All rights are reserved, whether the whole Of part of
the material is concemed, specifically those of translation, reprinting, re-use of
illustrations, broadcasting, reproduction by photocopying machines or similar means,
and storage in data banks.
Product Liability: The publisher can give no guarantee for all the information contained
in this book. This does also refer to information about drug dosage and application
thereof. In every individual case the respective user must check its accuracy by
consulting other pharmaceuticalliterature. The use of registered names, trademarks,
etc. in this publication does not imply, even in the absence of a specific statement, that
such names are exempt from the relevant protective laws and regulations and therefore
free for general use.
The number one requirement for computer arithmetic has always been speed.
It is the main force that drives the technology. With increased speed larger
problems can be attempted. To gain speed, advanced processors and pro-
gramming languages offer, for instance, compound arithmetic operations like
matmul and dotproduct.
But there is another side to the computational coin - the accuracy and
reliability of the computed result. Progress on this side is very important, if
not essential. Compound arithmetic operations, for instance, should always
deliver a correct result. The user should not be obliged to perform an error
analysis every time a compound arithmetic operation, implemented by the
hardware manufacturer or in the programming language, is employed.
This treatise deals with computer arithmetic in a more general sense than
usual. Advanced computer arithmetic extends the accuracy of the elementary
floating-point operations, for instance, as defined by the IEEE arithmetic
standard, to all operations in the usual product spaces of computation: the
complex numbers, the real and complex intervals, and the real and complex
vectors and matrices and their interval counterparts. The implementation of
advanced computer arithmetic by fast hardware is examined in this book.
Arithmetic units for its elementary components are described. It is shown
that the requirements for speed and for reliability do not conflict with each
other. Advanced computer arithmetic is superior to other arithmetic with
respect to accuracy, costs, and speed.
Vector processing is an important technique used to speed up computa-
tion. Difficulties concerning the accuracy of conventional vector processors
are addressed in [116,117]. See also [32] and [78]. Accurate vector processing
is subsumed in what is called advanced computer arithmetic in this treatise.
Compared with elementary floating-point arithmetic it speeds up computa-
tions considerably and it eliminates many rounding errors and exceptions. Its
implementation requires little more hardware than is needed for elementary
floating-point arithmetic. All this strongly supports the case for implementing
such advanced computer arithmetic on every CPU. With the speed comput-
ers have reached and the problem sizes that are dealt with, vector operations
should be performed with the same reliability as elementary floating-point
operations.
VI Preface
The picture on the cover page illustrates the contents of the book. It is
showing a chip for fast Advanced Computer Arithmetic and eXtended Pre-
cision Arithmetic (ACA-XPA). Its components are symbolically indicated
on top: hardware support for 15 basic arithmetic operations including ac-
curate scalar products with different roundings and case selections for in-
terval multiplication and division. Corresponding circuits are developed in
the book.
The picture is showing friends with Ursula Kulisch flanked by the host, Satoshi Sekiguchi, and Ulrich Kulisch.
Contents
Summary.
Advances in computer technology are now so profound that the arith-
metic capability and repertoire of computers can and should be expanded.
Nowadays the elementary floating-point operations +, -, x, / give com-
puted results that coincide with the rounded exact result for any operands.
Advanced computer arithmetic extends this accuracy requirement to all
operations in the usual product spaces of computation: the real and com-
plex vector spaces as well as their interval correspondents. This enhances
the mathematical power of the digital computer considerably. A new com-
puter operation, the scalar product, is fundamental to the development of
advanced computer arithmetic.
This paper studies the design of arithmetic units for advanced com-
puter arithmetic. Scalar product units are developed for different kinds
of computers like personal computers, workstations, mainframes, super
computers or digital signal processors. The new expanded computational
capability is gained at modest cost. The units put a methodology into
modern computer hardware which was available on old calculators before
the electronic computer entered the scene. In general the new arithmetic
units increase both the speed of computation as well as the accuracy of
the computed result. The circuits developed in this paper show that there
is no way to compute an approximation of a scalar product faster than the
correct result.
A collection of constructs in terms of which a source language may
accommodate advanced computer arithmetic is described in the paper.
The development of programming languages in the context of advanced
computer arithmetic is reviewed. The simulation of the accurate scalar
product on existing, conventional processors is discussed. Finally the the-
oretical foundation of advanced computer arithmetic is reviewed and a
comparison with other approaches to achieving higher accuracy in com-
putation is given. Shortcomings of existing processors and standards are
discussed.
1.1 Introduction
1.1.1 Background
Using the long accumulator the result is independent of the sequence in which
the summands are added. For details see Remark 3 on page 60 in section 1.7.
Approximation of a continuous function by a polynomial by the method
of least squares leads to the Hilbert matrix as coefficients. It is extremely
ill conditioned. It is well known that it is impossible to invert the Hilbert
matrix in double precision floating-point arithmetic successfully by any direct
or iterative method for dimensions greater than 11. Implementation of the
accurate scalar product in hardware also supports very fast multiple precision
arithmetic. It easily inverts the Hilbert matrix of dimension 40 to full accuracy
on a PC in a very short computing time. If increase or decrease of the precision
in a program is provided by the programming environment, the user or the
computer itself can choose the precision which optimally fits his problem.
Inversion of the Hilbert matrix of dimension 40 is impossible with quadru-
ple precision arithmetic. With it only one fixed precision is available. If one
runs out of precision in a certain problem class, one often runs out of quadru-
ple precision very soon as well. It is preferable and simpler, therefore, to
provide the principles for enlarging the precision than simply providing any
fixed higher precision. A hardware implementation of a full quadruple preci-
sion arithmetic is much more costly than an implementation of the accurate
4 1. Fast and Accurate Vector Operations
scalar product. The latter only requires fixed-point accumulation of the prod-
ucts. On the computer, there is only one standardized floating-point format
that is double precision.
With increasing speed of computers, problems to be dealt with become
larger. Instead of two dimensional problems users would like to solve three
dimensional problems. Gauss elimination for a linear system of equations re-
quires the magnitude of O(n 3 ) operations. Large, sparse or structured linear
or non linear systems, therefore, can only be solved iteratively. The basic op-
eration of iterative methods (Jacobi method, Gauss-Seidel method, overrelax-
ation method, conjugate gradient method, Krylow space methods, multigrid
methods and others like the QR method for the computation of eigenval-
ues) is the matrix-vector multiplication which consists of a number of scalar
products. It is well known that finite precision arithmetic often worsens the
convergence of these methods. An iterative method which converges to the
solution in infinite precision arithmetic often converges much slower or even
diverges in finite precision arithmetic. The accurate scalar product is faster
than a computation in conventional floating-point arithmetic. In addition to
that it can speed up the rate of convergence of iterative methods significantly
in many cases [27,28].
For many applications it is necessary to compute the value of the deriva-
tive of a function. Newton's method in one or several variables is a typical
example for this. Modern numerical analysis solves this problem by auto-
matic or algorithmic differentiation. The so called reverse mode is a very fast
method of automatic differentiation. It computes the gradient, for instance,
with at most five times the number of operations which are needed to com-
pute the function value. The memory overhead and the spatial complexity
of the reverse mode can be significantly reduced by the exact scalar product
if this is considered as a single, always correct, basic arithmetic operation
in the vector spaces [88]. The very powerful methods of global optimization
[79], [80], [81] are impressive applications of these techniques.
Many other applications require that rigorous mathematics can be done
with the computer using floating-point arithmetic. As an example, this is
essential in simulation runs (fusion reactor, eigenfrequencies of large genera-
tors) or mathematical modelling where the user has to distinguish between
computational artifacts and genuine reactions of the model. The model can
only be developed systematically if errors resulting from the computation can
be excluded.
Nowadays computer applications are of immense variety. Any discussion of
where a dot product computed in quadruple or extended precision arithmetic
can be used to substitute for the accurate scalar product is superfluous. Since
the former can fail to produce a correct answer an error analysis is needed
for all applications. This can be left to the computer. As the scalar product
can always be executed correctly with moderate technical effort it should
indeed always be executed correctly. An error analysis thus becomes irrele-
1.1 Introduction 5
This text summarizes both an extensive research activity during the past
twenty years and the experience gained through various implementations of
the entire arithmetic package on diverse processors. The text is also based
on lectures held at the Universitat Karlsruhe during the preceding 25 years.
While the collection of research articles that contribute to this paper is not
very large in number, I refrain from a detailed review of them and refer the
reader to the list of references. This text synthesizes and organizes diverse
contributions into a coherent presentation. In many cases more detailed in-
formation can be obtained from original doctoral theses.
1 GAMM = Gesellschaft fiir Angewandte Mathematik und Mechanik
2 IMACS = International Association for Mathematics and Computers in Simula-
tion
1.1 Introduction 7
Floating-point arithmetic has been used since the early forties and fifties
(Zuse Z3, 1941) [11,82]. Technology in those days was poor (electromechan-
ical relays, electron tubes). It was complex and expensive. The word size of
the Z3 consisted of 24 bits. The storage provided 64 words. The four ele-
mentary floating-point operations were all that could be provided. For more
complicated calculations an error analysis was left to and put on the shoulder
of the user.
Before that time, highly sophisticated mechanical computing devices were
used. Several very interesting techniques provided the four elementary oper-
ations addition, subtraction, multiplication and division. Many of these cal-
culators were able to perform an additional fifth operation which was called
A uflaufenlassen or the running total. The input register of such a machine
had perhaps 10 or 12 decimal digits. The result register was much wider and
had perhaps 30 digits. It was a fixed-point register which could be shifted
back and forth relative to the input register. This allowed a continuous accu-
mulation of numbers and of products of numbers into different positions of
the result register. Fixed-point accumulation is thus error free. See Fig. 1.22
and Fig. 1.23 on page 62. This fifth arithmetic operation was the fastest way
to use the computer. It was applied as often as possible. No intermediate re-
sults needed to be written down and typed in again for the next operation.
No intermediate roundings or normalizations had to be performed. No error
analysis was necessary. As long as no under- or overflow occurred, which
would be obvious and visible, the result was always correct. It was indepen-
dent of the order in which the summands were added. If desired, only one
final rounding was executed at the very end of the accumulation.
This extremely useful and fast fifth arithmetic operation was not built into
the early floating-point computers. It was too expensive for the technologies
of those days. Later its superior properties had been forgotten. Thus floating-
point arithmetic is still somehow incomplete.
After Zuse, the early electronic computers in the late forties and early
fifties represented their data as fixed-point numbers. Fixed-point arithmetic
was used because of its superior properties. Fixed-point addition and subtrac-
tion are error free. Fixed-point arithmetic with a rather limited word size,
however, imposed a scaling requirement. Problems had to be preprocessed
by the user so that they could be accommodated by this fixed-point number
representation. With increasing speed of computers, the problems that could
be solved became larger and larger. The necessary preprocessing soon became
an enormous burden.
Thus floating-point arithmetic became generally accepted. It largelyelim-
inated this burden. A scaling factor is appended to each number in floating-
point representation. The arithmetic itself takes care of the scaling. An ex-
ponent addition (subtraction) is executed during multiplication (division).
It may result in a big change in the value of the exponent. But multi plica-
8 1. Fast and Accurate Vector Operations
10 20 + 17 - 10 + 130 - 10 20
10 20 - 10 + 130 - 1020 + 17
10 20 + 17 - 1020 - 10 + 130
10 20 - 10 - 1020 + 130 + 17
10 20 - 10 20 + 17 - 10 + 130
1020 + 17 + 130 - 1020 - 10
A conventional computer using the data format double-precision of the
IEEE floating-point arithmetic standard returns the values 0, 17, 120, 147,
137, -10. These errors come about because the floating-point arithmetic is
unable to cope with the digit range required with this calculation. Notice
that the data cover less than 4% of the digit range of the data format double
precision!
3. Compute the scalar product of the two vectors a and b with five com-
ponents each:
Problems that can be solved by computers become larger and larger. To-
day fast computers are able to execute several billion floating-point operations
in each second. This number exceeds the imagination of any user. Traditional
error analysis of numerical algorithms is based on estimates of the error of
each individual arithmetic operation and on the propagation of these errors
through a complicated algorithm. It is simply no longer possible to expect
that the error of such computations can be controlled by the user. There
remains no alternative to further develop the computer's arithmetic and to
furnish it with the capability of control and validation of the computational
process.
Computer technology is extremely powerful today. It allows solutions
which even an experienced computer user may be totally unaware of. Floating-
point arithmetic which may fail in simple calculations, as illustrated above,
is no longer adequate to be used exclusively in computers of such gigantic
speed for huge problems. The reintroduction of the fifth arithmetic operation,
the accurate scalar product, into computers is a step which is long overdue.
A central and fundamental operation of numerical analysis which can be
executed correctly with only modest technical effort should indeed always
be executed correctly and no longer only approximately. With the accurate
scalar product all the nice properties which have been listed in connection
with the old mechanical calculators return to the modern digital computer.
The accurate scalar product is the fastest way to use the computer. It should
be applied as often as possible. No intermediate results need to be stored and
read in again for the next operation. No intermediate rmmdings and normal-
izations have to be performed. No intermediate over- or underflow can occur.
10 1. Fast and Accurate Vector Operations
I
m = Ld i b- i .
i=l
The di are the digits of the mantissa. They have the property di E
{O,l, ... ,b -1} for all i = l(l)l and d 1 # 0. Without this last condition
floating-point numbers are said to be unnormalized. The set of normalized
floating-point numbers does not contain zero. For a unique representation of
zero we assume the mantissa and the exponent to be zero. Thus a floating-
point system depends on the four constants b, l, e1 and e2. We denote it
by R = R(b, l, e1, e2). Occasionally we shall use the abbreviations sign(x),
mant(x) and exp(x) to denote the sign, mantissa and exponent of x respec-
tively.
Nowadays the elementary floating-point operations +, -, x, / give com-
puted results that coincide with the rounded exact result of the operation for
any operands. See, for instance, the IEEE Arithmetic Standards 754 and 854,
[114,115]. Advanced computer arithmetic extends this accuracy requirement
to all operations in the usual product spaces of computation: the complex
numbers, the real and complex vectors, real and complex matrices, real and
complex intervals as well as the real and complex interval vectors and interval
matrices.
A careful analysis and a general theory of computer arithmetic [60,62]
show that all arithmetic operations in the computer representable subsets of
these spaces can be realized on the computer by a modular technique as soon
as fifteen fundamental operations are made available at a low level, possibly
by fast hardware routines. These fifteen operations are
and
n
C.- 0Laixbi = 0(a1xb1+a2Xb2+ ... +anxbn) = 08,
i=l
where all additions and multiplications are the operations for real numbers
and 0 is a rounding symbol representing, for instance, rounding to nearest,
rounding towards zero, rounding upwards or downwards.
Since ai and bi are floating-point numbers with a mantissa of l digits,
the products ai x bi in the sums for 8 and C are floating-point numbers with
a mantissa of 2l digits. The exponent range of these numbers doubles also,
i. e. ai x bi E R(b, 2l, 2e1, 2e2). All these summands can be expressed in a
fixed-point register of length 2e2 + 2l + 21e11 without loss of information,
see Fig. 1.1. If one of the summands has an exponent 0, its mantissa can be
expressed in a register of length 2l. If another summand has exponent 1, it
can be expressed with exponent 0, if the register provides further digits on
the left and the mantissa is shifted one place to the left. An exponent -1 in
one of the summands requires a corresponding shift to the right. The largest
°
exponents in magnitude that may occur in the summands are 2e2 and 21e11.
So all summands can be expressed with exponent in a fixed-point register
of length 2e2 + 2l + 21e11 without loss of information.
bound for k. Thus, the long accumulator and the long adder consist of L =
k + 2e2 + 2l + 21ell digits of base b. The summands are shifted to the proper
position and added. See Fig. 1.1. Fast carry resolution techniques will be
discussed later. The final sums sand c are supposed to be in the single
exponent range e1 :-s; e :-s; e2, otherwise c is not representable as a floating-
point number and the problem has to be scaled.
e2 lell
k
Fig. 1.1. Long accumulator with long shift for accurate scalar product accumula-
tion.
In a scalar product computation the summands are all oflength 2l. So actually
the long adder and long accumulator may be replaced by a short adder and
a local store of size L on the arithmetic unit. The local store is organized in
words of length l or l', where l' is a power of 2 and slightly larger than l. (For
instance I = 53 bits and I' = 64 bits). Since the summands are of length 2l,
they fit into a part of the local store of length 31'. This part of the store is
determined by the exponent of the summand. We load this part of the store
into an accumulator of length 31'. The summand mantissa is placed in a shift
register and is shifted to the correct position as determined by the exponent.
Then the shift register contents are added to the contents of the accumulator.
Fig. 1.2.
An addition into the accumulator may produce a carry. As a simple
method to accommodate carries, we enlarge the accumulator on its left end
by a few more digit positions. These positions are filled with the correspond-
ing digits of the local store. If not all of these digits equal b - 1 in case of
addition (or zero in case of subtraction), they will accommodate a possible
carry ofthe addition (or borrow in case of subtraction). Of course, it is possi-
ble that all these additional digits are b - 1 (or zero). In this case, a loop can
be provided that takes care of the carry and adds it to (subtracts it from)
the next digits of the local store. This loop may need to be traversed several
times. Other carry (borrow) handling processes are possible and will be dealt
with later. This completes our sketch of the second method for an accurate
1.2 Implementation Principles 15
computation of scalar products using a short adder and some local store on
the arithmetic unit. See Fig. 1.2.
J,J~.J
1111111 y I 2I Y I adder
I
L........J'--........_---'IL..Il. ._.... shifted ummand
2/ -~
Fig. 1.2. Short adder and local store on the arithmetic unit for accurate scalar
product accumulation
1.2.3 Remarks
These samples show that the register size (at a time where memory space
is measured in gigabits and gigabytes) is modest in all cases. It grows with
the exponent range of the data format. If this range should be extremely
large, as for instance in case of an extended precision floating-point format,
only an inner part of the register would be supported by hardware. The outer
parts which then appear very rarely could be simulated in software. The long
data format of the /370 architecture covers in decimal a range from about
10- 75 to 1075 which is very modest. This architecture dominated the market
for more than 20 years and most problems could conveniently be solved with
machines of this architecture within this range of numbers.
Remark 4: Multiplication is often considered to be more complex than ad-
dition. In modern computer technology this is no longer the case. Very fast
circuits for multiplication using carry-save-adders (Wallace tree) are available
and common practice. They nearly equalize the time to compute a sum and
a product of two floating-point numbers. In a scalar product computation
usually a large number of products is to be computed. The multiplier is able
to produce these products very quickly. In a balanced scalar product unit the
accumulator should be able to absorb a product in about the same time the
multiplier needs to produce it. Therefore, measures have to be taken to equal-
ize the speed of both operations. Because of a possible long carry propagation
the accumulation seems to be the more complicated process.
Remark 5: Techniques to implement the optimal scalar product on machines
which do not provide enough register space on the arithmetic logical unit will
be discussed in Section 1.6 later in this paper.
1.2 Implementation Principles 17
Both solutions A and B for our problem which we sketched above seem to
be slow at first glance. Solution A requires a long shift which is necessarily
slow. The addition over perhaps 4000 bits is slow also, in particular if a long
carry propagation is necessary. For solution B, five steps have to be carried
out: 1. read from the local store, 2. perform the shift, 3. add the summand,
4. resolve the carry, possibly by loops, and 5. write the result back into the
local store. Again the carry resolution may be very time consuming.
As a first step to speed up solutions A and B, we discuss a technique which
allows a very fast carry resolution. Actually a possible carry can already be
accommodated while the product, the addition of which might produce a
carry, is still being computed.
Both solutions A and B require a long register in which the final sum in a
scalar product computation is built up. Henceforth we shall call this register
the Long Accumulator and abbreviate it as LA. It consists of L bits. LA is a
fixed-point register wherein any sum of floating-point numbers and of simple
products of floating-point numbers can be represented without error.
To be more specific we now assume that we are using the double precision
data format of the IEEE-arithmetic standard 754. See case c) of remark 3.
As soon as the principles are clear, a transfer of the technique to other data
formats is easy. Thus, in particular, the mantissa consists of [ = 53 bits.
We assume additionally that the LA that appears in solutions A and B is
subdivided into words of [' = 64 bits. The mantissa of the product ai x bi then
is 106 bits wide. It touches at most three consecutive 64-bit words of the LA
which are determined by the exponent of the product. A shifter then aligns
the 106 bit product into the correct position for the subsequent addition into
the three consecutive words of the LA. This addition may produce a carry
(or a borrow in case of subtraction). The carry is absorbed by that next more
significant 64 bit word of the LA in which not all digits are 1 (or 0 in case of
subtraction). Fig. 1.3, a). For a fast detection of this word two information
bits or flags are appended to each long accumulator word. Fig. 1.3, b). One
of these bits, the all bits 1 flag, is set to 1 if all 64 bits of the register word
are 1. This means that a carry will propagate through the entire word. The
other bit, the all bits 0 flag, is set to 0, if all 64 bits of the register word are
O. This means that in case of subtraction a borrow will propagate through
the entire word.
During the addition of a product into the three consecutive words of the
LA, a search is started for the next more significant word of the LA where
the all bits 1 flag is not set. This is the word which will absorb a possible
carry. If the addition generates a carry, this word must be incremented by one
and all intermediate words must be changed from all bits 1 to all bits O. The
easiest way to do this is simply to switch the flag bits from all bits 1 to all
bits 0 with the additional semantics that if a flag bit is set, the appropriate
constant (all bits 0 or all bits 1) must be generated instead of reading the LA
18 1. Fast and Accurate Vector Operations
word contents when reading a LA word, Fig. 1.3, b). Borrows are handled in
an analogous way.
carry generation
a) I
1000000 11 000 II 111111 I 111111 1111111 XXXXXX 1 XXXXXX 1 XXXXXX 1000000 I
rcarry skip area----< local fixed-point addition
1 XXXXXX XXXXXX 1
carry carry
resolution start
address address
b)
This carry handling scheme allows a very fast carry resolution. The gen-
eration of the carry resolution address is independent of the addition of the
product, so it can be performed in parallel. At the same time, a second set of
flags is set up for the case that a carry is generated. If the latter is the case,
the carry is added into the appropriate word and the second set of flags is
copied into the former flag word.
Simultaneously with the multiplication of the mantissa of ai and bi their
exponents are added. This is just an eleven bit addition. The result is available
very quickly. It delivers the exponent of the product and the address for
its addition. By looking at the flags, the carry resolution address can be
determined and the carry word can already be incremented/decremented as
soon as the exponent of the product is available. It could be available before
the multiplication of the mantissas is finished. If the accumulation of the
product then produces a carry, the incremented/decremented carry word is
written back into the LA, otherwise nothing is changed.
This very fast carry resolution technique could be used in particular for
the computation of short scalar products which occur, for instance, in the
computation of the real and imaginary part of a product of two complex
floating-point numbers. A long scalar product, however, is usually performed
in a pipeline. Then, during the execution of a product, the former product
1.3 High-Performance Scalar Product Units (SPU) 19
Here we consider a computer which is able to read the data into the arithmetic
logical unit and/or the SPU in portions of 32 bits. The personal computer is
a typical representative of this kind of computer.
Solution A with an adder and a shifter for the full LA of 4288 bits would
be too expensive. So the SPU for these computers is built upon solution B
20 1. Fast and Accurate Vector Operations
b·I
53
53 x 53 bit
mu ltiplication
by 27 x 27 bit
multiplier
shifted product
LA
carry
(see Fig. 1.4). For the computation of the product ai x bi the two factors
ai and bi are to be read. Both consist of 64 bits. Since the data can only
be read in 32 bit portions, the unit has to read 4 times. We assume that
with the necessary decoding this can be done in eight cycles. See Fig. l.5.
This is rather slow and turns out to be the bottleneck for the whole pipeline.
In a balanced SPU the multiplier should be able to produce a product and
the adder should be able to accumulate the product in about the same time
the unit needs to read the data. Therefore, it suffices to provide a 27 x 27
bit multiplier. It computes the 106 bit product of the two 53 bit mantissas
of ai and bi by 4 partial products. The subsequent addition of the product
into the three consecutive words of the LA is performed by an adder of 64
bits. The appropriate three words of the LA are loaded into the adder one
after the other and the appropriate portion of the product is added. The sum
is written back into the same word of the LA where the portion has been
read from. A 64 out of 106 bit shifter must be used to align the product
onto the relevant word boundaries. See Fig. 1.4. The addition of the three
portions of the product into the LA may cause a carry. The carry is absorbed
1.3 High-Performance Scalar Product Units (SPU) 21
read a~_l
read b~_l
read bLl
read a;
read a~ Ci-l := ai-l * bi-l
read b~
load 1
read a~+l
add/sub load2
store 1 add/sub load3
read a~+l Ci := ai * hi store2 add sub oad carry
(c;) store3 inc/dec
read b~+l Ci := shift
store carry
store flags
read b~+l
load 1
read a!+2
add, sub oac2
store 1 add/sub load3
read a~+2 Ci+l := ai+l * bi + 1 store2 add/sub load carry
store3 inc/dec
read b~+2 CH1 := shift (CHtl
store carry
store flags
read b~+2
load 1
read a;+3
add/sub load2
store 1 addjsub load3
read a;+3 Ci+2 := ai+2 * bi + 2 store2 a~d/su oac carry
store3 inc/dec
read b~+3 CH2 := shift (CH2)
store carry
store flags
read b~+3
Fig. 1.5. Pipeline for the accumulation of scalar products on computers with 32
bit data bus.
data bus
r exception
1 32
interface I
1 64
1 1
Iexp (ai)1 Iexp (bi) I I
4 l'1
mant (ai)llmant (bi)
1
adder
i 53 I53
53 x 53 bit
multiplication
by 27 x 27 bit
12 multiplier
6
1 106
shifter I
6
1 64
address adder I
decoder
flag
1 64 f64
control
- -
1-1--- -
1-1--- -
1-1---
dual port RAM -
1-1--- 64x67 -
I- -
1-1--- -~-~- .--~... -
Fig. 1.6. Block diagram for a SPU with 32 bit data supply and sequential addition
into SPU
same time. See Fig. 1.5. Fig. 1.6 shows a block diagram for a SPU with 32
bit data bus.
The sum of the exponents of ai and bi delivers the exponent of the product
ai x bi. It consists of 12 bits. The 6 low order (less significant) bits of this sum
are used to perform the shift. The more significant bits of the sum deliver the
LA address to which the product ai x bi has to be added. So the originally
very long shift is split into a short shift and an addressing operation. The
1.3 High-Performance Scalar Product Units (SPU) 23
shifter performs a relatively short shift operation. The addressing selects the
three words of the LA for the addition of the product.
The LA RAM needs only one address decoder to find the start address
for an addition. The two more significant parts of the product are added to
the contents of the two LA words with the two subsequent addresses. The
carry logic determines the word which absorbs the carry. All these address
decodings can be hard wired. The result of each one of the four additions is
written back into the same LA word to which the addition has been executed.
The two carry flags appended to each accumulator word are indicated in Fig.
1.6. In practice the flags are kept in separate registers.
We stress the fact that in the circuit just discussed virtually no specific
computing time is needed for the execution of the arithmetic. In the pipeline
the arithmetic is performed in the time which is needed to read the data
into the SPU. Here, we assumed that this requires 8 cycles. This allows both
the multiplication and the accumulation to be performed very economically
and sequentially by a 27 x 27 bit multiplier and a 64 bit adder. Both the
multiplication and the addition are themselves performed in a pipeline. The
arithmetic overlaps with the loading of the data into the SPU.
There are processors on the market, where the data supply to the arith-
metic unit or the SPU is much faster. We discuss the design of a SPU for
such processors in the next section and in Section 1.5.
Now we consider a computer which is able to read the data into the arithmetic
logical unit and/or the SPU in portions of 64 bits. Fast workstations or
mainframes are typical for this kind of computer.
Now the time to perform the multiplication and the accumulation over-
lapped in pipelines as before is no longer available. In order to keep the
execution time for the arithmetic within the time the SPU needs to read the
data, we have to invest in more hardware. For the multiplication a 53 x 53
bit multiplier must now be used. The result is still 106 bits wide. It could
touch three 64 bit words of the LA. But the addition of the product and the
carry resolution now have to be performed in parallel.
The 106 bit summand may fit into two instead of three consecutive 64
bit words of the LA. A closer look at the details shows that the 22 least
significant bits of the three consecutive LA words are never changed by an
addition of the 106 bit product. Thus the adder needs to be 170 bits wide
only. Fig. 1. 7 shows a sketch for the parallel accumulation of a product.
In the circuit a 106 to 170 bit shifter is used. The four additions are to
be performed in parallel. So four read/write ports are to be provided for the
LA RAM. A sophisticated logic must be used for the generation of the carry
resolution address, since this address must be generated very quickly. Again
the LA RAM needs only one address decoder to find the start address for an
addition. The more significant parts of the product are added to the contents
24 1. Fast and Accurate Vector Operations
53 x 53 bit
multiplier
LA
carry start
resolution address
address
Fig. 1.7. Parallel accumulation of a product into the LA
of the two LA words with the two subsequent addresses. A tree structured
carry logic now determines the LA word which absorbs the carry. A very fast
hardwired multi-port driver can be designed which allows all 4 LA words to
be read into the adder in one cycle.
Fig. 1.8 shows the pipeline for this kind of addition. In the figure we
assume that 2 machine cycles are needed to decode and read one 64 bit word
into the SPU.
Fig. 1.9 shows a block diagram for a SPU with a 64 bit data bus and
parallel addition.
We emphasize again that virtually no computing time is needed for the
execution of the arithmetic. In a pipeline the arithmetic is performed in the
1.4 Comments on the Scalar Product Units 25
read bi - I
address decoding
read aHI Ci := ai * bi load
read bi + 1 Ci := shift (Ci)
add/sub Ci-I
store & store Hags
address decoding
read aH2 Ci+1 := ai+1 * bi + 1 load
read b H2 Ci+l := shift (CHI)
add/sub Ci
store & store t1ags
address decoding
read ai+3 Ci+2 := ai+2 * bH2 load
read bH3 Ci+2 := shift (Ci+2)
add/sub CHI
store & store Hags
time which is needed to read the data into the SPU. Here, we assume that
with the necessary decoding, this requires 4 cycles for the two 64 bit factors
ai and bi for a product. To match the shorter time required to read the data,
more hardware has to be invested for the multiplier and the adder.
If the technology is fast enough it may be reasonable to provide a 256
bit adder instead of the 170 bit adder. An adder width of a power of 2 may
simplify the shift operation as well as the address decoding. The lower hits of
the exponent of the product control the shift operation while the higher bits
are directly used as the start address for the accumulation of the product
into the LA.
The two flag registers appended to each accumulator word are indicated
in Fig. 1.9 again. In practice the flags are kept in separate registers.
1.4.1 Rounding
data bus
I exception
1 64
interface
t 64
53 x 53 bit
multiplier
12
1
shifter
106
I
t 170
address
decoder
adder & carry-inc. I
64 64 64 42
flag
control carry
resolve
f-f- -
address f-f-four port RAM -
f-- 64x67 -
-
f-- accumemory -
f-c-
f-r--- & -
I I
start address I
Fig. 1.9. Block diagram for a SPU with 64 bit data bus and parallel addition into
the SPU
If not processed any further the correct result of a scalar product compu-
tation usually has to be rounded into a floating-point number or a floating-
point interval. The flag bits that are used for the fast carry resolution can
be used for the rounding of the LA contents also. By looking at the flag
bits, the leading result word in the accumulator can easily be identified. This
and the next LA word are needed to compose the mantissa of the result.
This 128 bit quantity must then be shifted to form a normalized mantissa
1.4 Comments on the Scalar Product Units 27
There are applications which make it desirable to provide more than one long
accumulator on the SPU. If, for instance, the components of the two vectors
a = (ai) and b = (b i ) are complex floating-point numbers, the scalar product
a· b is also a complex floating-point number. It is obtained by accumulating
the real and imaginary parts of the product of two complex floating-point
numbers. The formula for the product of two complex floating-point numbers
For the SPU the following 10 instructions for the LA are recommended. These
10 low level instructions are most natural and inevitable as soon as the idea
of the long accumulator for the accurate scalar product has been chosen.
They are low level capabilities to support the high level instructions devel-
oped in the next section, and are based on preceding experience with these
in the XSC-Ianguages since 1980. Very similar instructions were provided by
the processor developed in [93]. Practically identical instructions were used in
[109] to support ACRITH and ACRITH-XSC [108,110,112]. These IBM pro-
gram products have been developed at the author's institute in collaboration
with IBM.
The 10 low level instructions for the LA are:
1. clear the LA,
2. add a product to the LA,
3. add a floating-point number to the LA,
1.4 Comments on the Scalar Product Units 29
The clear instruction can be performed by setting all all bits 0 flags to
o. The load and store instructions are performed by using the load/store
instructions of the processor. For the add, subtract and round instructions
the following denotations could be used. There the prefix sp identifies SPU
instructions. In denotes the floating-point format that is used and will be db
for IEEE double. In all SPU instructions, the LA is an implicit source and
destination operand. The number of the instruction above is repeated in the
following coding which could be used to realize it.
2. spadd In srel, src2,
multiply the numbers in the given registers and add the product to the
LA.
3. spadd In src,
add the number in the given register to the LA.
4. spsub In srel, src2,
multiply the numbers in the given registers and subtract the product
from the LA.
5. spsub In src,
subtract the number in the given register from the LA.
6. spstore In. rd dest,
get LA contents and put the rounded value into the destination register.
In the instruction rd controls the rounding mode that is used when the
LA contents is stored in a floating-point register. It is one of the following:
Tn round to nearest,
rz round towards zero,
rp round upwards, i. e. towards plus infinity,
rm round downwards, i. e. towards minus infinity.
7. spstore dest,
get LA contents and put its value into the destination memory operand.
8. spload src,
load accumulator contents from the given memory operand into the LA.
9. spadd src,
the contents of the accumulator at the location src are added to the
contents of the accumulator in the processor.
10. spsub src,
the contents of the accumulator at the location src are subtracted from
the contents of the accumulator in the processor.
30 1. Fast and Accurate Vector Operations
x:= x +y *z
the double length product of y and z is added to the variable x of type
dotprecision and its new value is assigned to x.
The scalar product of two vectors a = (ai) and b = (b i ) is now easily
implemented with a variable x of type dotprecision as follows:
x:= 0;
for i := 1 to n do x := x + ali] * b[i];
y:=x;
The last statement y := x rounds the value of the variable x of type dotpre-
cision into the variable y of type real by applying the standard rounding of
the computer. y then has the value of the scalar product aob which is within
a single rounding error of the exact scalar product a . b.
For example, the method of defect correction or iterative refinement re-
quires highly accurate computation of expressions of the form
a·b-c·d
with vectors a, b, c, and d. Employing a variable x of type dot precision, this
expression can now be programmed as follows:
x:= 0;
for i := 1 to n do x := x + ali] * b[i];
for i := 1 to n do x := x - c[i] * d[i];
y:=x;
32 1. Fast and Accurate Vector Operations
In [44] the basic ideas have been developed for a general data format. How-
ever, to be very specific we discuss here a circuit for the double precision
format of the IEEE-arithmetic standard 754. The word size is 64 bits. The
mantissa has 53 bits and the exponent 11 bits. The exponent covers a range
from -1022 to +1023. The LA has 4288 bits. We assume again that the scalar
product computation can be subdivided into a number of independent steps
like
a) read ai and bi ,
b) compute the product ai x bi ,
c) add the product to the LA.
Now by assumption the SPU can read the two factors ai and bi simul-
taneously in one portion. We call the time that is needed for this a cycle.
1.5 Scalar Product Units for Top-Performance Computers 33
summand register
parallel adder
accumulator
segmented summand
segmented adder
segmented accumulator
wherever a carry is left. So in an average case there will only be very few
carries left at the end of the accumulation and a few additional cycles will
suffice to absorb the remaining carries. Thus, segmenting the adder enables it
to keep up in speed with steps a) and b) and to read and process a summand
in each cycle.
The long shift of the 106 bit summand is slow also. It is speeded up by
a matrix shaped arrangement of the adders. Only a few, let us assume here
four of the partial adders, are placed in a row. We begin with the four least
significant adders. The four next more significant adders are placed directly
beneath of them and so on. The most significant adders form the last row.
The rows are connected as shown in Fig. 1.11.
In our example, where we have 67 adders of 64 bits, 17 rows suffice to
arrange the entire summing matrix. Now the long shift is performed as fol-
lows: The summand of 106 bits carries an exponent. In a fast shifter of 106
to 256 bits it is shifted into a position where its most significant digit is
placed directly above the position in the long adder which carries the same
exponent identification E. The remaining digits of the summand are placed
immediately to its right. Now the summing matrix reads this summand into
the S-registers (summand registers) of every row. The addition is executed
in that row where the exponent identification coincides with that of the sum-
mand.
It may happen that the most significant digit of the summand has to be
shifted so far to the right that the remaining digits would hang over at the
right end of the shifter. These digits then are reinserted at the left end of the
shifter by a ring shift. If now the more significant part of the summand is
added in row r, its less significant part will be added in row r - 1.
By this matrix shaped arrangement of the adders, the unit can perform
both a shift and an addition in a single cycle. The long shift is reduced to
a short shift of 106 to 256 bits which is fast. The remaining shift happens
automatically by the row selection for the addition in the summing matrix.
Every summand carries an exponent which in our example consists of
12 bits. The lower part of the exponent, i. e. the 8 least significant digits,
determine the shift width and with it the selection of the columns in the
summing matrix. The row selection is obtained by the 4 most significant
bits of the exponent. This complies roughly with the selection of the adding
position in two steps by the process of Fig. 1.2. The shift width and the row
selection for the addition of a product ai x bi to the LA are known as soon
as the exponent of the product has been computed. Since the exponents of
ai and bi consist of 11 bits only, the result of their addition is available very
quickly. So while the multiplication of the mantissas is still being executed
the shifter can already be switched and the addresses of the LA words for
the accumulation of the product ai x bi can be selected.
The 106 bit summand touches at most three consecutive words of the
LA. The addition of the summand is executed by these three partial adders.
1.5 Scalar Product Units for Top-Performance Computers 35
interface
,"."ilruelion
decoder
(m~1
4 .. ignltic:ant
dl~lIs)
256
lEI lEI
lEI lEI
(onlroller LSB
multiply
.dd
rounding
dcbuS£ing
command
(oolrol
flng
cOOirol
s: summand
10: exponent
identifier
Fig_ 1.11. Block diagram of a SPU with long adder for a 64 bit data word and 128
bit data bus
36 1. Fast and Accurate Vector Operations
Each of these adders can produce a carry. The carry of the leftmost of these
partial adders can with high probability be absorbed, if the addition always
is executed over four adders and the fourth adder then is the next more
significant one. This can reduce the number of carries that have to be resolved
during future steps of the accumulation and in particular at the end.
In each step of the accumulation an addition only has to be activated in
the selected row of adders and in those adders where a non zero carry is wait-
ing to be absorbed. This adder selection can reduce the power consumption
for the accumulation step significantly.
The carry resolution method that has been discussed so far is quite nat-
ural. It is simple and does not require particular hardware support. If long
scalar products are being computed it works very well. Only at the end of
the accumulation, if no more summands are coming, a few additional cycles
may be required to absorb the remaining carries. Then a rounding can be ex-
ecuted. However, this number of additional cycles for the carry resolution at
the end of the accumulation, although it is small in general, depends on the
data and is unpredictable. In case of short scalar products the time needed
for these additional cycles may be disproportionately high and indeed exceed
the addition time.
With the fast carry resolution mechanism that has been discussed in Sec-
tion 1.2.4 these difficulties can be overcome. At the cost of some additional
hardware all carries can be absorbed immediately at each step of the accu-
mulation. The method is shown in Fig. 1.11 also. Two flag registers for the
all bits 0 and the all bits 1 flags are shown at the left end of each partial accu-
mulator word in the figure. The addition of the 106 bit products is executed
by three consecutive partial adders. Each one of these adders can produce a
carry. The carries between two of these adjacent adders can be avoided, if all
partial adders are built as Carry Select Adders. This increases the hardware
costs only moderately. The carry registers between two adjacent adders then
are no longer necessary. 5 The flags indicate which one of the more significant
LA words will absorb the left most carry. During an addition of a product
only these 4 LA words are changed and only these 4 adders need to be acti-
vated. The addresses of these 4 words are available as soon as the exponent
of the summand ai x bi has been computed. During the addition step now
simultaneously with the addition of the product the carry word can be in-
cremented (decremented). If the addition produces a carry the incremented
word will be written back into the local accumulator. If the addition does not
produce a carry, the local accumulator word remains unchanged. Since we
have assumed that all partial adders are built as Carry Select Adders this fi-
nal carry resolution scheme requires no additional hardware. Simultaneously
with the incrementation/decrement at ion of the carry word a second set of
5 This is the case in Fig. 1.12 where a similar situation is discussed. There all
adders are supposed to be carry select adders.
1.5 Scalar Product Units for Top-Performance Computers 37
flags is set up for the case that a carry is generated. In this case the second
set of flags is copied into the former word.
The accumulators that belong to partial adders in Fig. 1.11 are denoted by
AG. Beneath them a small memory is indicated in the figure. It can be used to
save the LA contents very quickly in case that a program with higher priority
interrupts the computation of a scalar product and requires the unit for itself.
However, the author is of the opinion that the scalar product is a fundamental
and basic arithmetic operation which should never be interrupted. The local
memory on the SPU can be used for fast execution of scalar products in the
case of complex and of interval arithmetic.
In Section 1.4.2 we have discussed applications like complex arithmetic
or interval arithmetic which make it desirable to provide more than one LA
on the SPU. The local memory on the SPU shown in Fig. 1.11 serves this
purpose.
In Fig. 1.11 the registers for the summands carry an exponent identifica-
tion denoted by E. This is very useful for the final rounding. The usefulness
of the flags for the final rounding has already been discussed. They also serve
for fast clearing of the accumulator.
The SPU which has been discussed in this section seems to be costly.
However, it consists of a large number of identical parts and it is very regular.
This allows a highly compact design. Furthermore the entire unit is simple.
No particular exception handling techniques are to be dealt with by the
hardware. Vector computers are the most expensive. A compact and simple
solution, though expensive, is justified for these systems.
a) read ai and bi ,
b) compute the product ai x bi ,
c) add the product to the LA.
Each of the mantissas of ai and bi has 24 bits. Their product has 48
bits. It can be computed very fast by a 24 x 24 bit multiplier using standard
techniques like Booth-Recoding and Wallace tree. The addition of the two 8
bit exponents of ai and bi delivers the exponent of the product consisting of
9 bits.
The LA consists of 10 words of 64 bits. The 48 bit mantissa ofthe product
touches at most two of these words. The addition of the product is executed
by the corresponding two consecutive partial adders. Each of these two adders
can produce a carry. The carry between the two adjacent adders can immedi-
ately be absorbed if all partial adders are built as Carry Select Adders again.
The carry of the more significant of the two adders will be absorbed by one of
the more significant 64 bit words of the LA. The flag mechanism (see Section
1.2.4) indicates which one of the LA words will absorb a possible carry. So
during an addition of a summand the contents of at most 3 LA words are
changed and only these three partial adders need to be activated. The ad-
dresses of these words are available as soon as the exponent of the summand
ai x bi has been computed. During the addition step, simultaneously with the
addition of the product, the carry word can be incremented (decremented).
If the addition produces a carry the incremented word will be written back
into the local accumulator. If the addition does not produce a carry, the lo-
cal accumulator word remains unchanged. Since all partial adders are built
as Carry Select Adders no additional hardware is needed for the carry reso-
lution. Simultaneously with the incrementation/ decrementation of the carry
word a second set of flags is set up for the case that a carry is generated.
If the latter is the case the second set of flags is copied into the former flag
word.
Details of the circuitry just discussed are summarized in Fig. 1.12. The
figure is highly similar to Fig. 1.11 of the previous section. In order to avoid
the long shift, the long adder is designed as a summing matrix consisting of
2 adders of 64 bits in each row. For simplicity in the figure only 3 rows (of
the 5 needed to represent the full LA) are shown.
In a fast shifter of 48 to 128 bits the 48 bit product is shifted into a
position where its most significant digit is placed directly above the position
in the long adder which carries the same exponent identification E. The
remaining digits of the summand are placed immediately to its right. If they
hang over at the right end of the shifter, they are reinserted at the left end
by a ring shift. Above the summing matrix in Fig. 1.12 two possible positions
of summands after the shift are indicated.
The summing matrix now reads the summand into its S-registers. The
addition is executed by those adders where the exponent identification coin-
cides with the one of the summand. The exponent of the summand consists
1.5 Scalar Product Units for Top-Performance Computers 39
I inlerface
regisler file
64-1-
1 1 1
~
I exp{a) I I I
t t
I
t
manl(a) manl(b)
2' 24 J2
adder 8
I multipl ier 24 x 24 J
9 (least .l: 48
!iignifiunI
instruction 7 dig;_L>L
decoder shi fter 48 to 128 I
(most
2 significant 64 64
digios)
controller possible
location of
{ I EI I I lEI I I
multiply
add !ft~r~~'!~ I E I I IlEI I I
rounding
L-~B
debugging
command lE I I lEI s I
control
.
S
.
~ ~
+/- j<- +/- I
AC I-' AC f->
lE I s I
s: summand lEI S I
.
~ ~
E: exponent identifier +/-
oj,
I<- +/-
oj,
j<- f-
AC: accu word
64 bits with flags
AC I-' AC I-'
III I
'il I
~ ~
I<- -
LSB: Icast significant bit
+/- I<- +f-
M B: mosl ignificant bit
AC I-' AC I-'
i ll
Adder
Fig_ 1-12. Block diagram of a SPU with long adder for a 32 bit data word and 64
bit data bus
40 1. Fast and Accurate Vector Operations
of 9 bits. The lower part, i. e. the 7 least significant digits, determine the
shift width. The selection of the two adders which perform the addition is
determined by the most significant bits of the exponent.
In Fig. 1.12 again some memory is indicated for each part of the LA. It
can be used to save the LA contents very quickly in case a program with
higher priority interrupts the computation of a scalar product and requires
the unit for itself. The local memory on the SPU also can be used for fast
execution of scalar products in the case of complex arithmetic and of interval
arithmetic.
In comparison with Fig. 1.11, Fig. 1.12 shows an additional 32 bit data
path directly from the input register file to the fast shifter. This data path
is supposed to allow a very fast execution of the operation multiply and
add fused, rnd(a x b + c), which is provided by some conventional floating-
point processors. While the product a x b is computed by the multiplier, the
summand c is added to the LA.
The SPU which has been discussed in this section seems to be costly
at first glance. While a single floating-point addition conveniently can be
done with one 64 bit adder, here 640 full adders (10 64-bit adders) have
been used in carry select adder mode. However, the advantages of this design
are tremendous. While a conventional floating-point addition can produce a
completely wrong result with only two or three additions, the new unit never
delivers a wrong answer, even if millions of floating-point numbers or single
products of such numbers are added. An error analysis is never necessary for
these operations. The unit consists of a large number of identical parts and
it is very regular. This allows a very compact design. No particular hardware
has to be included to deal with rare exceptions. Although an increase in
adder equipment by a factor of 10, compared with a conventional floating-
point adder, might seem to be high, the number of full adders used for the
circuitry is not extraordinary. We stress the fact that for a Wallace tree in
case of a standard 53 x 53 bit multiplier about the same number of full adders
is used. For fast conventional computers this has been the state of the art
multiplication for many years and nobody complains about high cost.
1.5.3 Short Adder with Local Memory on the Arithmetic Unit for
64 Bit Data Word (Solution B)
In the circuits discussed in Sections 1.5.1 and 1.5.2 adder equipment was
provided for the full width of the LA. The long adder was segmented into
partial adders of 64 bits. In Section 1.5.1 67, and in Section 1.5.2 10, such
units were used. During an addition of a summand, however, in Section 1.5.1
only 4, and in Section 1.5.2 only 3, of these units are activated. This raises
the question whether adder equipment is really needed for the full width of
the LA and whether the accumulation can be done with only 4 or 3 adders
in accordance with Solution B of Section 1.2.2. There the LA is kept as local
memory on the arithmetic unit.
1.5 Scalar Product Units for Top-Performance Computers 41
In this section we develop such a solution for the double precision data
format. An in-principle solution using a short adder and local memory on the
arithmetic unit was discussed in Section 1.3.2. There the data ai and bi to
perform a product ai x bi are read into the SPU successively in two portions
of 64 bits. This leaves 4 machine cycles to perform the accumulation in the
pipeline.
Now we assume that the two data ai and bi for a product ai x bi are read
into the SPU simultaneously in one portion of 128 bits. Again we call the
time that is needed for this a cycle. In accordance with the solution shown in
Fig. 1.11 and Section 1.5.1 we assume again that the multiplication and the
shift also can be done in one such read cycle. In a balanced pipeline, then, the
circuit for the accumulation must be able to read and process one summand
in each (read) cycle also. The circuit in Fig. 1.13 displays a solution. Closely
following the summing matrix in Fig. 1.11 we assume there that the local
memory LA is organized in 17 rows of four 64 bit words.
In each cycle the multiplier supplies a product (summand) to be added
in the accumulation unit. Every such summand carries an exponent which
in our example consists of 12 bits. The 8 lower (least significant) bits of the
exponent determine the shift width. The row selection of the LA is obtained
by the 4 most significant bits of the exponent. This roughly corresponds to the
selection of the adding position in two steps by the process described in the
context of Fig. 1.2. The shift width and the row selection for the addition of
the product to the LA are known as soon as the exponent of the product has
been computed. Since the exponents of ai and bi consist of 11 bits only, the
result of their addition is available very quickly. So while the multiplication
of the mantissa is still being executed the shifter can already be switched and
the addresses for the LA words for the accumulation of the product ai x bi
can be selected.
After being shifted the summand reaches the accumulation unit. It is read
into the input register IR of this unit. The shifted summand now consists of
an exponent e, a sign s, and a mantissa m. The mantissa touches three
consecutive words of the LA, while the exponent is reduced to the four most
significant bits of the original exponent of the product.
Now the addition of the summand is performed in the accumulation unit
by the following three steps:
1 interface
tj
register file
1 1
xp(a) exp(b) I mant(a) I I mant(b)
'--1---"' -----'1
instruction It II
± 53 ± 53
decoder adder multiplier 53 x 53 I
~ 12 (11:'aS1
signific~l
±106
8 d.gi ...)
controller
multiply add rounding
shifter 106 [0 170 J
debugging (!\\OS!
command control '" $ignificam 170
flag control digil$)
possible
location
{ ~I II ~ ~I I I ~ I
of summands
after shift o::::J ~ ~ II ~I I
write data
IR c:b I.--s.----I--....J:L...------,I
r+----- wri c
rean ::lddres:s add Il:. .
read data
~
I 1 ,j.
~
:::::::!I - multiplexer
,j.
RBS e':o 11 ·'1 m' I I r' I
I ,j.
add I subtract I
RAS c· I r" I
IR - Input Register
RB - Regi ter Before ummation I
RA Register After ummation
~
I rounding unit
Fig. 1.13. Block diagram of a SPU with short adder and local store for a 64 bit
data word and 128 bit data bus
1.5 Scalar Product Units for Top-Performance Computers 43
of 64 bits which are working in carry select mode. The summand touches
three of these adders. Each one of these three adders can produce a carry.
The carries between two of these adjacent adders are absorbed by the
carry select addition. The fourth word is the carry word. It is selected by
the flag mechanism. During the addition step a 1 is added to or subtracted
from this word in carry select mode. If the addition produces a carry the
incremented/ decremented word will be selected. If the addition does not
produce a carry this word remains unchanged. Simultaneously with the
incrementation/ decrementation of the carry word a second set of flags
is set up which is copied into the flag word in the case that a carry is
generated. In Fig. 1.13 two possible locations of the summand after the
shift are indicated. The carry word is always the most significant word.
An incrementation/ decrement at ion of this word never produces a carry.
Thus the adder/subtracter in Fig. 1.13 simply can be built as a parallel
carry select adder.
3. In the next cycle the computed sum is written back into the same four
memory cells of the LA to which the addition has been executed. Thus
only one address decoding is necessary for the read and write step. A
different bus called write data in Fig. 1.13 is used for this purpose.
In summary the addition consists of the typical three steps: 1. read the
summand, 2. perform the addition, and 3. write the sum back into the (local)
memory. Since a summand is delivered from the multiplier in each cycle, all
three phases must be active simultaneously, i. e. the addition itself must be
performed in a pipeline. This means that it must be possible to read from the
memory and to write into the memory in each cycle simultaneously. So two
different data paths have to be provided. This, however, is usual for register
memory.
The pipeline for the addition consists of three steps. Pipeline conflicts are
quite possible. A pipeline conflict occurs if an incoming summand needs to
be added to a partner from the LA which is still being computed and not yet
available in the local memory. These situations can be detected by comparing
the exponents e, e' and e" of three successively incoming summands. In prin-
ciple all pipeline conflicts can be solved by the hardware. Here we discuss the
solution of two pipeline conflicts which with high probability are the most
frequent occurrences.
One conflict situation occurs if two consecutive products carry the same
exponent e. In this case the two summands touch the same three words of
the LA. Then the second summand is unable to read its partner for the
addition from the local memory because it is not yet available. This situation
is checked by the hardware where the exponents e and e' of two consecutive
summands are compared. If they are identical, the multiplexer blocks off the
process of reading from the local memory. Instead the sum which is just being
computed is directly written back into the register before summation RBS
44 1. Fast and Accurate Vector Operations
via the multiplexer so that the second summand can immediately be added
without memory involvement.
Another possibility of a pipeline conflict occurs if from three successively
incoming summands the first one and the third one carry the same exponent.
Since the pipeline consists of three steps, the partner for the addition of the
third one then is not yet in the local memory but still in the register after
summation RAS. This situation is checked by the hardware also, see Fig.
1.13. There the two exponents e and e" of the two summands are compared.
In case of coincidence the multiplier again suppresses the reading from the
local memory. Instead now, the sum of the former addition, the result of
which is still in RAS, is directly written back into the register RBS before
summation via the multiplexer. So also this pipeline conflict can be solved
by the hardware without memory involvement.
The case e = e' = e" is also possible. It would cause a reading conflict in
the multiplexer. The situation can be avoided by writing a dummy exponent
into e" or by reading from the add/subtract unit with higher priority.
The product that arrives at the accumulation unit touches three consec-
utive words of the LA. A more significant fourth word absorbs the possible
carry. The solution for the two pipeline conflicts just described works well, if
this fourth word is the next more significant word. A carry is not absorbed
by the fourth word if all its bits are one, or are all zero. The probability that
this is the case is 1 : 264 < 10- 18 . In the vast majority of instances this will
not be the case.
If it is the case the word which absorbs the carry is selected by the flag
mechanism and read into the most significant word of the RBS. The addition
step then again works well including the carry resolution. But difficulties
occur in both cases of a pipeline conflict. Fig. 1.14 displays a certain part of
the LA. The three words to which the addition is executed are denoted by 1,
2 and 3. The next more significant word is denoted by 4 and the word which
absorbs the carry by 5.
5 4 3 2
significant position of the RBS. It is simply treated the same way as the
words 1, 2 and 3. In the other case word 4 has to be read from the LA into
RBS, simultaneously with the words 1, 2 and 3 from the add/subtract unit
or from RAS into RBS. In this case word 5 is written into the local memory
via the normal write path.
So far certain solutions for the possible pipeline conflicts e = e' and e = e"
have been discussed. These are the most frequent but not the only conflicts
that may occur. Similar difficulties appear if two or three successive incoming
summands overlap only partially. In this case the exponents e and e' and/or
e" differ by 1 or 2 so that also these situations can be detected by comparison
of the exponents. Another pipeline conflict appears if one of the two following
summands overlaps with a carry word. In these cases summands have to
be built up in parts from the adder/subtracter or RAS and the LA. Thus
hardware solutions for these situations are more complicated and costly. We
leave a detailed study of these situations to the reader/designer and offer the
following alternative: The accumulation pipeline consists of three steps only.
Instead of investing in a lot of hardware logic for rare situations of a pipeline
conflict it may be simpler and less expensive to stall the pipeline and delay
the accumulation by one or two cycles as needed. It should be mentioned
that other details as for instance the width of the adder that is used also
can heavily change the design aspects. A 128 instead of a 64 bit adder width
which was assumed here could simplify several details.
It was already mentioned that the probability for the carry to run further
than the fourth word is less than 10~18. A particular situation where this
happens occurs if the sum changes its sign from a positive to a negative
value or vice versa. This can happen frequently. To avoid a complicated carry
handling procedure in this case a small carry counter of perhaps three bits
could be appended to each 64 bit word of the LA. If these counters are not
zero at the end of the accumulation their contents have to be added to the
LA. For further details see [66], [45].
As was pointed out in connection with the unit discussed in Section 1.3.2,
the addition of the summand actually can be carried out over 170 bits only.
Thus the shifter that is shown in Fig. 1.13 can be reduced to a 106 to 170 bits
shifter and the data path from the shifter to the input register IR as well as
the one to RBS also need to be 170 bits wide only. If this possible hardware
reduction is applied, the summand has to be expanded to the full 256 bits
when it is transferred to the adder/subtracter.
1.5.4 Short Adder with Local Memory on the Arithmetic Unit for
32 Bit Data Word (Solution B)
Now we consider again a 32 bit data word. We assume that two of these are
read simultaneously into the SPU in one read cycle. The LA is kept as local
memory in the SPU. We assume that the addition of a summand, which now
is a 48 bit product, can be done by three adders of 64 bits including the carry
46 1. Fast and Accurate Vector Operations
Since a summand is delivered from the multiplier in each cycle, all three
of these phases must be active simultaneously, i. e. the addition must be
performed in a pipeline. This means, in particular, that it must be possible
to read from the LA and to write into the LA simultaneously in each cycle.
Therefore, two different data paths have to be provided, as shown in Fig.
1.15.
The pipeline for the addition consists of three steps. Pipeline conflicts
again are quite possible. A pipeline conflict occurs if an incoming summand
needs to be added to a partner from the LA which is still being computed
and not yet available in the local memory. These situations can be detected
by comparing the exponents e, e' and e" of three successively incoming sum-
mands. In principle all pipeline conflicts can be solved by the hardware. We
discuss here the solution of two pipeline conflicts which with high probability
are the most frequent occurrences.
1.5 Scalar Product Units for Top-Performance Computers 47
J, J,
I I write data
J,
IRcb lsi m I
wri e
read address ade ess
r
read data
~
~
y y
....- I~
~
I add I subtract
'''f
+
r"
IR =Input Register
RBS ;;;;: Register Before Summation
RAS =Register After Summation
~
rounding unit
Fig. 1.15. Block diagram for a SPU with short adder and local store for a 32 bit
data word and 64 bit data bus
One conflict situation occurs if two consecutive products carry the same
exponent e, In this case the two summands touch the same two words of the
LA, Then the second summand is unable to read its partner for the addition
from the LA because it is not yet available, This situation is checked by
the hardware where the exponent e and e' of two consecutive summands
are compared, In case of coincidence the process of reading from the LA is
blocked off, Instead the sum which is just being computed is directly written
back into the register RBS so that the second summand can immediately be
added without memory involvement.
Another possibility of a pipeline conflict occurs if from three successive
incoming summands the first one and the third one carry the same exponent.
Since the pipeline consists of three phases the partner for the addition of
the third one then is not yet in the LA but still in the register RAS. This
situation is checked by the hardware as well, see Fig. 1.15. There the two
exponents e and e" are compared. In case of coincidence again the process
of reading from the LA is blocked off. Instead now, the result of the former
48 1. Fast and Accurate Vector Operations
addition, which is still in RAS, is directly written back into RBS. Then the
addition can be executed without LA involvement.
The case e = e' = e" is also possible. It would cause a conflict in the
selection unit which in Fig. 1.15 is shown directly beneath of the LA. The
situation can be avoided by writing a dummy exponent into e" or by reading
from the add/subtract unit with higher priority. This solution is not shown
in Fig. 1.15.
The product that arrives at the accumulation unit touches two consecutive
words of the LA. A more significant third word absorbs the possible carry.
The solution for the two pipeline conflicts work well, if this third word is the
next more significant word of the LA. The probability that this is not the
case is less than 10- 18 . In the vast majority of instances this will be the case.
If it is not the case the word which absorbs the carry is selected by the flag
mechanism and read into the most significant word of the RBS. The addition
step then works well again including the carry resolution. But difficulties can
occur in both cases of a pipeline conflict. Fig. 1.16 shows a certain part of
the LA. The two words to which the addition is executed are denoted by 1
and 2. The next more significant word is denoted by 3 and the word which
absorbs the carry by 4.
I 4 3 2
single exponent range in order to save some silicon area. This would require
the installment of complicated exception handling routines in software or in
hardware. The latter may finally require as much silicon. A software solution
certainly is much slower. The hardware requirement for the LA in case of
standard arithmetics is modest and the necessary register space really should
be invested.
However, the memory space for the LA on the arithmetic unit grows with
the exponent range of the data format. If this range is extremely large, as
for instance in case of an extended precision floating-point format, then only
an inner part of the LA can be supported by hardware. We call this part of
the LA a Hardware Accumulation Window (HAW). See Fig. 1.17. The outer
parts of this window must then be handled in software. They are probably
needed less often.
software LA
k 2e2 21 21ell
21 HAW
Fig. 1.17. Hardware Accumulation Window (HAW)
There are still other reasons that suppose the development of techniques
for the computation of the accurate scalar product using a HAW. Many
conventional computers on the market do not provide enough register space
to represent the full LA on the CPU. Then a HAW is one choice which allows
a fast and correct computation of the scalar product in many cases.
Another possibility is to place the LA in the user memory, i. e. in the data
cache. In this case only the start address of the LA and the flag bits are put
into (fixed) registers of the general purpose register set of the computer. This
solution has the advantage that only a few registers are needed and that a
longer accumulator window or even the full LA can be provided. This reduces
the need to handle exceptions. The disadvantage of this solution is that for
each accumulation step, four memory words must be read and written in
addition to the two operand loads. So the scalar product computation speed
is limited by the data cache to processor transfer bandwidth and speed. If
the full long accumulator is provided this is a very natural solution. It has
been realized on several IBM, SIEMENS and HITACHI computers of the
/370 architecture in the 1980s [109,110,112,119].
A faster solution certainly is obtained for many applications with a HAW
in the general purpose register set of the processor. Here only a part of the
LA is present in hardware. Overflows and underflows of this window have to
be handled by software. A full LA for the data format double precision of the
IEEE-arithmetic standard 754 requires 4288 bits or 67 words of 64 bits. We
1.6 Hardware Accumulation Window 51
assume here that only 10 of these words are located in the general purpose
register set.
Such a window covers the full LA that is needed for a scalar product com-
putation in case of the data format single precision of the IEEE-arithmetic
standard 754. It also allows a correct computation of scalar products in the
case of the long data format of the /370 architecture as long as no under-
or overflows occur. In this case 64 + 28 + 63 = 155 hexadecimal digits or
620 bits are required. With a HAW of 640 bits all scalar products that do
not cause an under- or overflow could have been correctly computed on these
machines. This architecture was successfully used and even dominated the
market for more than 20 years. This example shows that even if a HAW of
only 640 bits is available, the vast majority of scalar products will execute
on fast hardware.
Of course, even if only a HAW is available, all scalar products should be
computed correctly. Any operation that over- or underflows the HAW must
be completed in software. This requires a complete software implementation
of the LA, i. e. a variable of type dot precision. All additions that do not fit
into the HAW must be executed in software into this dotprecision variable.
There are three situations where the HAW can not correctly accumulate
the product:
• the exponent of the product is so high that the product does not (com-
pletely) fit into the HAW. Then the product is added in software to the
dot precision variable.
• the exponent of the product is so low that the product does not (com-
pletely) fit into the HAW. Then the product is added in software to the
dotprecision variable.
• the product fits into the HAW, but its accumulation causes a carry to be
propagated outside the range of the HAW. In this case the product is added
into the HAW. The carry must be added in software to the dot precision
variable.
If at the end of the accumulation the contents of the software accumulator
are non zero, the contents of the HAW must be added to the software accu-
mulator to obtain the correct value of the scalar product. Then a rounding
can be performed if required. If at the end of the accumulation the contents
of the software accumulator are zero, the HAW contains the correct value of
the scalar product and a rounded value can be obtained from it.
Thus, in general, a software controlled full LA supplements a HAW. The
software routines must be able to perform the following functions:
• clear the software LA. This routine must be called during the initialization
of the HAW. Ideally, this routine only sets a flag. The actual clearing is
only done if the software LA is needed.
• add or subtract a product to/from the software LA.
52 1. Fast and Accurate Vector Operations
a ~ a 0 b ~ (3,
then
Oa = a ~ O(a 0 b) = al£lb ~ o{3 = {3. (1.1)
(Rl) (R2) (RG) (R2) (Rl)
Thus, all semimorphic computer operations are of 1 ulp (!!nit in the last
Elace) accuracy. 1/2 ulp accuracy is achieved in the case ofrounding to near-
est. In the product spaces the order relation is defined componentwise. So in
the product spaces property (1.1) holds for every component.
Figure 1.18 shows a table of the twelve basic arithmetic data types and
corresponding operators as they are provided by the programming language
PASCAL-XSC [46,47,49,67,68,108]. All data types and operators are prede-
fined available in the language. The operations can be called by the operator
symbols shown in the table. An arithmetic operator followed by a less or
greater symbol denotes an operation with rounding downwards or upwards,
respectively. The operator +* takes the interval hull of two elements, **
means intersection. Also all outer operations that occur in Fig. 1.18 (scalar
times vector, matrix times vector, etc.) are defined by the five properties
1.7 Theoretical Foundation of Advanced Computer Arithmetic 55
(RG), (R1, 2, 3, 4), whatever applies. A count of all inner and outer pre-
defined operations in the figure leads to a number of about 600 arithmetic
operations.
~
integer
op. interval rvector ivector rmatrix imatrix
left real
cinterval cvector civector cmatrix cimatrix
operan complex
monadic +, - +, - +, - +, - +, - +, -
+, +<, +>
integer - , -<, ->
+, - , *, / *, * <, * >
real
*, *<, *> * *, *<, *> *
complex /, /<, /> +*
+*
interval +, - , *, / +, - , *, /
cinterval +* +*, ** * * * *
+, +<, +>
rvector *, *<, *> *, /
- -<, -> +, -. *
cvector /, /<, /> *, *<, *> +*
+*
ivector +, , * +, - ,*
*, / *, /
-
civector
+* +*, **
+, +<, +>
rmatrix *,*<,*> - , -<, -> +, - ,
*, / *, *<, *> * *
cmatrix /, /<, /> *, *<, *> +*
+*
imatrix
*, / *, / +, -, * +, -, *
cimatrix * * +* +*, **
Figure 1.19 lists the same data types in their usual mathematical notation.
There JR denotes the real and Cthe complex numbers. A heading letter V, M
and I denotes vectors, matrices and intervals, respectively. R stands for the
set of floating-point numbers and D for any set of higher precision floating-
point numbers. If M is any set, IPM denotes the power set, which is the set
of all subsets of M. For any operation 0 in M a corresponding operation 0
in IPM is defined by A 0 B := {a 0 b I a E A 1\ b E B} for all A, B E IP M.
For each set-subset pair in Fig. 1.19, arithmetic in the subset is defined by
semimorphism. These operations are different in general from those which are
performed in the product spaces if only elementary floating-point arithmetic
is furnished on the computer. Semi morphism defines operations in a subset
N of a set M directly by making use of the operations in M. It makes a
direct link between an operation in lvl and its approximation in the subset
N. For instance, the operations in MCR (see Fig. 1.19) are directly defined
by the operations in MC, and not in a roundabout way via C, JR, R, CR,
56 1. Fast and Accurate Vector Operations
IR~ D~ R
VIR ~ VD~ VR
MIR ~ MD~ MR
IPIR ~ fIR ~ fD ~ fR
IPV IR ~ fV IR ~ fVD ~ fVR
IPMIR ~ fMIR ~ fMD ~ fMR
C~ CD ~ CR
VC~ VCD~ VCR
MC~ MCD~ MCR
Fig. 1.20. The fifteen fundamental operations for advanced computer arithmetic.
The IEEE arithmetic standards 754 and 854 offer 12 of these operations:
@], 'W, &, with 0 E {+, -, x, /}. These standards also prescribe specific data
Fig. 1.21. Functional units, chip and board of the vector arithmetic coprocessor
XPA 3233
62 1. Fast and Accurate Vector Operations
56. Kramer, W.; Walter, W.: FORTRAN-SC: A FORTRAN Extension for Engi-
neering/Scientific Computation with Access to ACRITH, General Information
Notes and Sample Programs. pp 1-51, IBM Deutschland GmbH, Stuttgart,
1989.
57. Kramer, W.; Kulisch, U.; Lohner, R: Numerical Toolbox for Verified
Computing II: Theory, Algorithms and Pascal-XSC Programs. (Vol. I
see [34,35]) Springer-Verlag, Berlin I Heidelberg I New York, to appear.
58. Kulisch, U.: An axiomatic approach to rounded computations. TS Report No.
1020, Mathematics Research Center, University of Wisconsin, Madison, Wis-
consin, 1969, and Numerische Mathematik 19, pp. 1-17, 1971.
59. Kulisch, U.: Formalization and Implementation of Floating-Point Arithmetic.
Computing 14, pp. 323-348, 1975.
60. Kulisch, U.: Grundlagen des Numerischen Rechnens - Mathematis-
che Begriindung der Rechnerarithmetik. Reihe Informatik, Band 19,
Bibliographisches Institut, Mannheim/Wien/Ziirich, 1976 (ISBN 3-411-01517-
9).
61. Kulisch, U.: Schaltungsanordnung und Verfahren zur Bildung von Skalarpro-
dukten und Summen von Gleitkommazahlen mit maximaler Genauigkeit.
Patentschrift DE 3144015 AI, 1981.
62. Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Prac-
tice. Academic Press, New York, 1981 (ISBN 0-12-428650-x).
63. Kulisch, U.; Ullrich, Ch. (Eds.): Wissenschaftliches Rechnen und Pro-
grammiersprachen. Proceedings of Seminar held in Karlsruhe, April 2-3,
1982. Berichte des German Chapter of the ACM, Band 10, B. G. Teubner Ver-
lag, Stuttgart, 1982 (ISBN 3-519-02429-2).
64. Kulisch, U.; Miranker, W. L. (Eds.): A New Approach to Scientific Com-
putation. Proceedings of Symposium held at IBM Research Center, Yorktown
Heights, N. Y., 1982. Academic Press, New York, 1983 (ISBN 0-12-428660-7).
65. Kulisch, U.; Miranker, W. L.: The Arithmetic of the Digital Computer: A New
Approach. IBM Research Center RC 10580, pp. 1-62, 1984. SIAM Review, Vol.
28, No.1, pp. 1-40, March 1986.
66. Kulisch, U.; Kirchner, R: Schaltungsanordnung zur Bildung von Produktsum-
men in Gleitkommadarstellung, insbes. von Skalarprodukten. Patentschrift
DE 3703440 C2, 1986.
67. Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific
computation, Information Manual and Floppy Disks, Version IBM PCI AT;
Operating System DOS. B. G. Teubner Verlag (Wiley-Teubner series in com-
puter science), Stuttgart, 1987 (ISBN 3-519-02106-4 10-471-91514-9).
68. Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific
computation, Information Manual and Floppy Disks, Version ATARI ST. B.
G. Teubner Verlag, Stuttgart, 1987 (ISBN 3-519-02108-0).
69. Kulisch, U. (Ed.): Wissenschaftliches Rechnen mit Ergebnisverifikation
- Eine Einfiihrung. Ausgearbeitet von S. Georg, R Hammer und D. Ratz.
Vol. 58. Akademie Verlag, Berlin, und Vieweg Verlagsgesellschaft, Wiesbaden,
1989.
70. Kulisch, U.; Teufel, T.; HoefRinger, B.: Genauer und trotzdem schneller,
Ein neuer Coprozessor fur hochgenaue M atrix- und Vektoroperationen.
Titelgeschichte, Elektronik 26, 1994.
71. Kulisch, U.; Lohner, Rand Facius, A. (edts.): Perspectives on Enclosure
Methods. Springer-Verlag, Wien, New York, 2001.
72. Lichter, P.: Realisierung eines VLSI-Chips fur das Gleitkomma-Skalarprodukt
der Kulisch-Arithmetik. Diplomarbeit, Fachbereich 10, Angewandte Mathe-
matik und Informatik, Universitat des Saarlandes, 1988.
Bibliography and Related Literature 67
Summary.
This paper deals with arithmetic on a discrete subset S of the real
numbers 1R and with floating-point arithmetic in particular. We assume
that arithmetic on S is defined by semimorphism. Then for any element
a E S the element -a E S is an additive inverse of a, i.e. a(±)(-a) = O.
The first part of the paper describes a necessary and sufficient condition
under which -a is the unique additive inverse of a in S. In the second
part this result is generalized. We consider algebraic structures M which
carry a certain metric, and their semimorphic images on a discrete subset
N of M. Again, a necessary and sufficient condition is given under which
elements of N have a unique additive inverse. This result can be applied to
complex floating-point numbers, real and complex floating-point intervals,
real and complex floating-point matrices, and real and complex floating-
point interval matrices.
Here 0 E {+, - } is the sign of the number, m is the mantissa, b is the base
of the number system in use and e is the exponent. b is an integer greater
than unity. The exponent is an integer between two fixed integer bounds el,
e2, and in general el :::; 0 :::; e2. The mantissa is of the form
m= Ldi·b- i .
i=l
The di are the digits of the mantissa. They have the property d i E {O,
1, ... , b - I} for all i = l(l)r and d l f. O. Without the condition d l f. 0,
floating-point numbers are said to be unnormalized. The set of normalized
floating-point numbers does not contain zero. So zero is adjoined to 8. For
a unique representation of zero it is often assumed that m = 0.00· . ·0 and
e = o. A floating-point system depends on the constants b, r, el, and e2. We
denote it by 8 = 8(b, r, ell e2).
The floating-point numbers are not equally spaced between successive
powers of b and their negatives. This spacing changes at every power of
b. In particular, there are relatively large gaps around zero which contain
no further floating-point number. Figure 2.1 shows a simple floating-point
system 8 = 8(2,3, -1, 2) consisting of 33 elements.
If, for instance, the rounding towards zero is chosen, the entire interval
(-1/4, 1/4) is mapped onto zero. So whenever the real sum of two numbers
of 8 falls into this interval (e.g. 1/4 - 3/8) their sum in 8 is zero, a(i)b = 0,
and the two elements form a pair of additive inverses.
The following theorem characterizes a discrete subset 8 of IR by a nec-
essary and sufficient condition under which the element -a is the unique
additive inverse of a in 8:
2.1 The one dimensional case 73
I I I " •I •
1',111',111 ' '1',1 11 ' I •I ..
IR
-3 -2 -1 0 1 2 3
Fig. 2.1. The characteristic spacing of a floating-point system.
Theorem 1:
If S is a symmetric, discrete subset of IR with 0 E S, 0: IR -4 S a semimor-
phism, and c > 0 the least distance between distinct elements of S, then for
all a E S the element b = -a is the unique additive inverse of a if and only
if
0- 1 (0) ~ (-c,c). (2.1)
Here 0- 1 (0) denotes the inverse image of 0 and (-c, c) is the open interval
between -c and c.
.-
-
~
- -
a a e a
.- - -
.... -
-- e a e a
-
a
mantissas are permitted. Then c itself becomes an element of S and for all
a E S the element -a is the unique additive inverse of a. This is the case, for
instance, if IEEE arithmetic with denormalized numbers is implemented.
Figure 2.2 illustrates the behavior of typical roundings in the neighbor-
hood of zero. (R1) means that for floating-point numbers the rounding func-
tion coincides with the identity mapping.
zero and a number that is not, however small it might actually be. This
rounding is not provided by the IEEE arithmetic standard.
In case of the interval spaces, the order relation:::; means set inclusion ~. In
this case the rounding is required to have the additional property
With this definition it is shown in [3,4] for all spaces mentioned above
that the multiplicative unit e has a unique additive inverse 8e in N. With
this quantity the minus operator (negation) and subtraction are defined by
This preserves all rules of the minus operator in the computer repre-
sentable subspaces, [3,4].
The proof that 8e is unique is intricate and not easy in all the individual
cases [3,4]. A generalization of the conditions given in Section 2.1 for the
76 2. Rounding Near Zero
existence and uniqueness of additive inverses for the product spaces listed
above could simplify the situation considerably. This is now done.
We assume that the basic set M is mapped into a discrete subset N by
semimorphism, where N is symmetric, i.e. 0 E N and for all a E N also
-a E N. It follows from (RG) and (Rl) that an element a EN which has an
additive inverse -a in M has the same additive inverse in N:
Theorem 2:
For all elements a of N which have an unique additive inverse -a in M, -a
is also the unique additive inverse of a in N if and only if
d(b,-a)::::E: =} d(a+b,a+(-a))=d(a+b,O)::::E:.
Summary.
This paper deals with interval arithmetic and interval mathematics.
Interval mathematics has been developed to a high standard during the
last few decades. It provides methods which deliver results with guaran-
tees. However, the arithmetic available on existing processors makes these
methods extremely slow. The paper reviews a number of basic methods
and techniques of interval mathematics in order to derive and focus on
those properties which by today's knowledge could effectively be supported
by the computer's hardware, by basic software, and by the programming
languages. The paper is not aiming for completeness. Unnecessary math-
ematical details, formalisms and derivations are left aside whenever possi-
ble. Particular emphasis is put on an efficient implementation of interval
arithmetic on computers.
Interval arithmetic is introduced as a shorthand notation and auto-
matic calculus to add, subtract, multiply, divide, and otherwise deal with
inequalities. Interval operations are also interpreted as special powerset or
set operations. The inclusion isotony and the inclusion property are cen-
tral and important consequences of this property. The basic techniques for
enclosing the range of function values by centered forms or by subdivi-
sion are discussed. The Interval Newton Method is developed as an always
(globally) convergent technique to enclose zeros of functions.
Then extended interval arithmetic is introduced. It allows division by
intervals that contain zero and is the basis for the development of the
extended Interval Newton Method. This is the major tool for computing
enclosures at all zeros of a function or of systems of functions in a given
domain. It is also the basic ingredient for many other important applica-
tions like global optimization, subdivision in higher dimensional cases or
for computing error bounds for the remainder term of definite integrals in
more than one variable. We also sketch the techniques of differentiation
arithmetic, sometimes called automatic differentiation, for the computa-
tion of enclosures of derivatives, of Taylor coefficients, of gradients, of
Jacobian or Hessian matrices.
The major final part of the paper is devoted to the question of how
interval arithmetic can effectively be provided on computers. This is an
essential prerequisite for its superior and fascinating properties to be more
widely used in the scientific computing community. With more appropri-
ate processors, rigorous methods based on interval arithmetic could be
comparable in speed with today's "approximate" methods. At processor
speeds of gigaFLOPS there remains no alternative but to furnish future
computers with the capability to control the accuracy of a computation at
least to a certain extent.
82 3. Interval Arithmetic Revisited
sees several reasons for this which should be discussed briefly. A broad un-
derstanding of these reasons is an essential prerequisite for further progress.
Forty years of nearly exclusive use of floating-point arithmetic in scientific
computing has formed and now dominates our thinking. Interval arithmetic
requires a much higher level of abstraction than languages like Fortran-77,
Pascal or C provide. If every single interval operation requires a procedure
call, the user's energy and attention are forced down to the level of coding,
and are dissipated there.
The development and implementation of adequate and powerful program-
ming environments like PASCAL-XSC [17,26,27] or ARITH-XSC [77] re-
quires a large body of experienced and devoted scientists (about 20 man
years for each) which is not easy to muster. In such environments interval
arithmetic, the elementary functions for the data types real and interval,
a long real and a long real interval arithmetic including the corresponding
elementary functions, vector and matrix arithmetic, differentiation and Tay-
lor arithmetic both for real and interval data are provided by the run time
system of the compiler. All operations can be called by the usual mathemat-
ical operator symbols and are of maximum accuracy. This releases the user
from coding drudgery. This means, for instance, that an enclosure of a high
derivative of a function over an interval - needed for step size control and
to guarantee the value of a definite integral or a differential equation within
close bounds - can be computed by the same notation used to compute the
real function value. The compiler interprets the operators according to the
type specification of the data. This level of programming is essential indeed.
It opens a new era of conceptual thinking for mathematical numerics.
A second reason for the low acceptance of interval arithmetic in the sci-
entific computing community is simply the prejudices which are often the
result of superficial experiments. Sentences like the following appear again
and again in the literature: "The error bounds are overly conservative; they
quickly grow to the computer representation of [-00, +00]", "Interval arith-
metic is expensive because it takes twice the storage and at least twice the
work of ordinary arithmetic."
Such sentences are correct for what is called "naive interval arithmetic".
Interval arithmetic, however, should not be applied naively. Its properties
must be studied and understood first, before it can be applied successfully.
Many program packages have been developed using interval arithmetic, which
deliver close bounds for their solutions. In no case are these bounds obtained
by substituting intervals in a conventional floating-point algorithm. Interval
arithmetic is an extension of floating-point arithmetic, not a replacement
for it. Sophisticated use of interval arithmetic often leads to safe and bet-
ter results. There are many applications where the extended tool delivers a
guaranteed answer faster than the restricted tool of floating-point arithmetic
delivers an "approximation". Examples are numerical integration (because
of automatic step size control) and global optimization (intervals bring the
84 3. Interval Arithmetic Revisited
tant that these factors are well understood. Real progress depends critically
on an understanding of their details. Interval methods are not slow per se. It
is the actual available arithmetic on existing processors which makes them
slow. With better processor and language support, rigorous methods could
be comparable in speed to today's "approximate" methods. Interval mathe-
matics or mathematical numerics has been developed to a level where already
today library routines could speedily deliver validated bounds instead of just
approximations for small and medium size problems. This would ease the life
of many users dramatically.
Future computers must be equipped with fast and effective interval arith-
metic. At processor speeds of gigaFLOPS it is almost the only way to check
the accuracy of a computation. Computer-generated graphics requires vali-
dation techniques in many cases.
After Sunaga's early paper the publication of Ramon E. Moore's book
on interval arithmetic in 1966 [44] certainly was another milestone in the
development of interval arithmetic. Moore's book is full of unconventional
ideas which were out of the mainstream of numerical analysis of that time.
To many colleagues the book appeared as an utopian dream. Others tried
to carry out his ideas with little success in general. Computers were very
very slow at that time. Today Moore's book appears as an exposition of
extraordinary intellectual and creative power. The basic ideas of a great many
well established methods of validation numerics can be traced back to Moore's
book.
We conclude this introduction with a brief sketch of the development of
interval arithmetic at the author's institute. Already by 1967 an ALGOL-
60 extension implemented on a Zuse Z 23 computer provided operators and
a number of elementary functions for a new data type interval [69,70]. In
1968/69 this language was implemented on a more powerful computer, an
Electrologica X8. To speed up the arithmetic, the hardware of the processor
was extended by the four arithmetic operations with rounding downwards
'el, 0 E {+, -, *, j}. Operations with rounding upwards were produced by
use of the relation 6.( a) = -\7 ( -a). Many early interval methods have been
developed using these tools. Based on this experience a book [5] was written
by two collaborators of that time. The English translation which appeared
in 1983 is still a standard monograph on interval arithmetic [6].
At about 1969 the author became aware that interval and floating-point
arithmetic basically follow the same mathematical mapping principles, and
can be subsumed by a general mathematical theory of what is called advanced
computer arithmetic in this paper. The basic assumption is that all arithmetic
operations on computers (for real and complex numbers, real and complex
intervals as well as for vectors and matrices over these four basic data types)
should be defined by four simple rules which are called a semimorphism. This
guarantees the best possible answers for all these arithmetic operations. A
book on the subject was published in 1976 [33] and the German company
88 3. Interval Arithmetic Revisited
publications with problem solving routines are available for both languages
[17,18,31].
Of course, much valuable work on the subject had been done at other
places as well. International Conferences where new results can be presented
and discussed are held regularly.
After completion of this paper Sun Microsystems announced an interval
extension of Fortran 95 [83]. With this new product and compiler, interval
arithmetic is now available on computers which are wide spread.
As Teruo Sunaga did in 1958 and many others after him, I am looking
forward to, expect, and eagerly await a revision of the structure of the digital
computer for better support of interval arithmetic.
(3.1)
90 3. Interval Arithmetic Revisited
(3.2)
The rule for multiplication of two intervals is more complicated. Nine cases
are to be distinguished depending on whether aI, a2, bl , b2 , are less or greater
than zero. For division the situation is similar. Since we shall build upon
these rules later they are cited here. For a detailed derivation see [33,34]. In
the tables the order relation :S is used for intervals. It is defined by
Table 3.1. The 9 cases for the multiplication of two intervals or inequalities
Table 3.2. The 6 cases for the division of two intervals or inequalities
o 0
In Tables 3.1 and 3.2 A denotes the interior of A, i.e. c E A means
al < c < a2. In the cases 0 E B division AlB is not defined.
3.2 Interval Arithmetic, a Powerful Calculus to Deal with Inequalities 91
As a result of these rules it can be stated that in the case of real intervals
the result of an interval operation A 0 B, for all 0 E {+, -, *, j}, can be
expressed in terms of the bounds of the interval operands (with the AlB
exception above). In order to get each of these bounds, typically only one
o 0
real operation is necessary. Only in case 9 of Table 3.1, 0 E A and 0 E B, do
two products have to be calculated and compared.
Whenever in the Tables 3.1 and 3.2 both operands are comparable with
the interval [0,0] with respect to S,~, < or >, the result of the interval
operation A * B or AlB contains both bounds of A and B. If one or both of
the operands A or B, however, contains zero as an interior point, then the
result A * B and AlB is expressed by only three of the four bounds of A and
B. In all these cases (3, 6, 7, 8, 9) in Table 3.1, the bound which is missing
in the expression for the result can be shifted towards zero without changing
the result of the operation A * B. Similarly, in cases 5 and 6 in Table 3.2, the
bound of B, which is missing in the expression for the resulting interval, can
be shifted toward 00 (resp. -00) without changing the result of the operation.
This shows a certain lack of sensitivity of interval arithmetic or computing
with inequalities whenever in the cases of multiplication and division one of
the operands contains zero as an interior point.
In all these cases - 3, 6, 7, 8, 9, of Table 3.1 and 5, 6 of Table 3.2 -
the result of A * B or AlB also contains zero, and the formulas show that
the result tends toward the zero interval if the operands that contain zero do
likewise. In the limit case when the operand that contains zero has become the
zero interval, no such imprecision is left. This suggests that within arithmetic
expressions interval operands that contain zero as an interior point should be
made as small in diameter as possible.
We illustrate the efficiency of this calculus for inequalities by a simple
example. See [4]. Let x = Ax + b be a system of linear equations in fixed
point form with a contracting real matrix A and a real vector b, and let the
interval vector X be a rough initial enclosure of the solution x* EX. We can
now formally write down the Jacobi method, the Gauss-Seidel method, a re-
laxation method or some other iterative scheme for the solution of the linear
system. In these formulas we then interpret all components of the vector x
as being intervals. Doing so we obtain a number of iterative methods for the
computation of enclosures of linear systems of equations. Further iterative
schemes then can be obtained by taking the intersection of two successive
approximations. If we now decompose all these methods in formulas for the
bounds of the intervals we obtain a major number of methods for the compu-
tation of bounds for the solution of linear systems which have been derived by
well-known mathematicians painstakingly about 40 years ago, see [14]. The
calculus of interval arithmetic reproduces these and other methods in the
simplest way. The user does not have to take care of the many case distinc-
tions occurring in the matrix vector multiplications. The computer executes
92 3. Interval Arithmetic Revisited
The rules (3.1), (3.2), (3.3), and (3.4) also can be interpreted as arithmetic op-
erations for sets. As such they are special cases of general set operations. Fur-
ther important properties of interval arithmetic can immediately be obtained
via set operations. Let M be any set with a dyadic operation 0 : M x M -> M
defined for its elements. The powerset lP M of M is defined as the set of all
subsets of M. The operation 0 in M can be extended to the powerset lP M
by the following definition
and in particular
(3.6) is called the inclusion isotony (or inclusion monotony). (3.7) is called
the inclusion property.
By use of parentheses these rules can immediately be extended to expres-
sions with more than one arithmetic operation, e.g.
and so on. Moreover, if more than one operation is defined in M this chain
of conclusions also remains valid for expressions containing several different
operations.
If we now replace the general set M by the set of real numbers, (3.5),
(3.6), and (3.7) hold in particular for the powerset lPIR of the real numbers
IR. This is the case for all operations 0 E {+, -, *, I}, if we assume that in
case of division 0 is not an element of the denominator, for instance, 0 tj. B
in (3.5).
The set lIR of closed and bounded intervals over IR is a subset of lPIR.
Thus (3.5), (3.6), and (3.7) are also valid for elements of lIR. The set lIR
3.3 Interval Arithmetic as Executable Set Operations 93
o
L....L-.L..--+----REAL CONSTANT - - - - - ,
O·
.-(~
I---REAL VARIABLE
:0
REAL ELEMENTARY FUNCTION
REAL EXPRESSION
For instance:
3.3 Interval Arithmetic as Executable Set Operations 95
·······,····0··
'---'---'---+---INTERVAL CONSTANT-----,
··0··
INTERVAL VARIABLE - - - - - /
····0··
(]).
INTERVAL ELEMENTARY FUNCTION
INTERVAL FUNCTION - - - - - I
INTERVAL EXPRESSION
For non monotonic functions the computation of the range of values over
an interval [aI, a2] requires the determination of the global minimum and
maximum of the function in the interval [aI, a2]. For the usual elementary
functions, however, these are known. With this definition of elementary func-
tions for intervals the key properties of interval arithmetic, the inclusion
monotony (3.7) and the inclusion property (3.8) extend immediately to ele-
mentary functions and with this to interval expressions as defined in Fig. 3.2:
real operands). This is just the step from Fig. 3.1 to Fig. 3.2. What is obtained
is an interval expression. Then all arithmetic operations are performed in
interval arithmetic. For a real function f(a) we denote the interval evaluation
over the interval A by F(A).
With this definition we can immediately conclude that interval evaluations
of (computable) real functions are inclusion isotone and that the inclusion
property holds in particular:
Evaluation of the two expressions for a real number always leads to the same
real function value. In contrast to this, interval evaluation of the two expres-
sions may lead to different intervals. In the example we obtain for the interval
A = [1,2]:
elementary functions are provided for interval arguments. Then, if called for
a point interval (where the lower and upper bound coincide), a compari-
son of the lower and upper bound of the result of the interval evaluation
of the function reveals immediately the accuracy with which the elementary
function has been implemented. This situation has forced extremely careful
implementation of the elementary functions and since interval versions of the
elementary functions have been provided on a large scale [26-29,37,38,77]
the conventional real elementary functions on computers also had to be and
have been improved step by step by the manufacturers. A most advanced
programming environment in this respect is a decimal version of PASCAL-
XSC [10] where, besides the usual 24 elementary functions, about the same
number of special functions are provided for real and interval arguments with
highest accuracy.
1., 2. and 3. are minimum requirements for any sophisticated use of inter-
val arithmetic. If they are not met, coding difficulties absorb all the attention
and capacity of users and prevent them from developing deeper mathemati-
cal ideas and insight. So far none of the widespread programming languages
like Fortran, C, and even Fortran 95 and C++ provide the necessary pro-
gramming ease. This is the basic reason for the slow progress in the field.
It is a matter of fact that a great deal of the existing and established inter-
val methods and algorithms have originally been developed in PASCAL-XSC
even if they have been coded afterwards in other languages. Programming
ease is essential indeed. The typical user, however, is reluctant to leave the
programming environment he is used to, just to apply interval methods.
We summarize this discussion by stating that it does not suffice for an
adequate use of interval arithmetic on computers that only the four basic
arithmetic operations +, -, * and I for intervals are somehow supported by
the computer hardware. An appropriate language support is absolutely nec-
essary. So far this has been missing. This is the basic dilemma of interval
arithmetic. Experience has shown that it cannot be overcome via slow mov-
ing standardization committees for programming languages. Two things seem
to be necessary for the great breakthrough. A major vendor has to provide
the necessary support and the body of numerical analysts must acquire a
broader insight and skills in order to use this support.
property (3.10), (3.13) holds. Since (3.10) and (3.13) hold for all a E A we
can immediately state that
i.e. that the interval evaluation of a real function over an interval delivers
a superset of the range of function values over that interval. If A is a point
interval [a, a] this reduces to:
(3.19)
and, in general, the union of the interval evaluations over all subintervals
n
A. It should be clear, however, that in general only for small intervals is the
bound in (3.21) better than in (3.18).
The decrease of the overestimation of the range of function values by the
interval evaluation of the function with the diameter of the interval A, and the
method of subdivision, are reasons why interval arithmetic can successfully
be used in many applications. Numerical methods often proceed in small
steps. This is the case, for instance, with numerical quadrature or cubature,
or with numerical integration of ordinary differential equations. In all these
cases an interval evaluation of the remainder term of the integration formula
(using differentiation arithmetic) controls the step size of the integration, and
anyhow because of the small steps, overestimation is practically negligible.
We now mention briefly how centered forms can be obtained. Usually a
centered form is derived via the mean-value theorem. If f is differentiable in
its domain D, then f(x) = f(c) + f'(~)(x - c) for fixed c E D and some ~
between x and c. If x and c are elements out of the interval A ~ D, then also
~ E A. Therefore
In all these and other cases, zero finding is a central task. Here the extended
Interval Newton Method plays an extraordinary role so we are now going to
review this method, which is also one of the requirements that have to be
met when interval arithmetic is implemented on the computer.
f(x) = o. (3.22)
f(xo)
Xl := Xo - f'(xo)· (3.24)
f(x v )
Xv+l := Xv - f'(x v )' v = 0, 1,2, .... (3.25)
It is well known that if f (x) has a single zero x* in an interval X and f (x)
is twice continuously differentiable, then the sequence
f(m(Xv))
X v+l := (m(Xv) - F'(X v ) ) n Xv, v = 0, 1,2, ... , (3.26)
In contrast to (3.25) the method (3.26) can never diverge (fail). Because
of the intersection with Xv the sequence
(3.27)
f(x)
N(X) := x - F'(X)' x E X E fIR (3.28)
f(x) X=[xj,x21
N(X) = [nj,n21
Xl = [xj,n21
N(X) x
f(x) = f(x*) + !'(E,)(x - x*) for all x E X and some E, between x and x*
* f(x)
x = x - fl(f,)"
3.5 The Interval Newton Method 103
If F'(X) denotes the interval evaluation of f'(x) over the interval X, we have
f'(~) E F'(X) and therefore
* f(x) f(x)
x = x - f'(~) Ex - F'(X) = N(X) for all x E X,
* f(x)
x E (x - F'(X)) n X = N(X) n x.
As a consequence of (3.30) again the inclusion isotony (3.6) and the in-
clusion property (3.7) hold for all operations and arithmetic expressions in
lPIR. In particular, this is the case if (3.30) is restricted to operands of fIR.
fIR is a subset of lP IR.
We are now going to define division by an interval B of fIR which contains
zero. It turns out that the result is no longer an interval of fIR. But we can
apply the definition of the division in the powerset as given by (3.30). This
leads to
AI B := {alb I a E A 1\ b E B} for all A, B E fIR. (3.31 )
2 An ordered set is called conditionally complete if every non empty, bounded sub-
set has a greatest lower bound (infimum) and a least upper bound (supremum).
3 In a complete lattice every subset has an infimum and a supremum.
3.6 Extended Interval Arithmetic 105
In order to interpret the right hand side of (3.31) we remember that the
quotient alb is defined as the inverse operation of multiplication, i.e. as the
solution of the equation b· x = a. Thus (3.31) can also be written in the form
Now we have to interpret the right hand side of (3.32). We are interested
in obtaining simply executable, explicit formulas for the right hand side of
(3.32). The case 0 tJ. B was already dealt with in Table 3.2. So we assume
here generally that 0 E B. For A = [aI, a2] and B = [bl, b2] E fIR, 0 E B the
following eight distinct cases can be set out:
l. o E A, o E B.
2. o tJ. A, B = [0,0].
3. s:; a2 < 0,
al bl < b2 = O.
4. s:; a2 < 0,
al bl < 0 < h
5. al s:; a2 < 0, 0= bl < h
6. o < al s:; a2, bl < b2 = O.
7. o < al s:; a2, bl < 0 < h
8. o < al s:; a2, 0= bl < b2.
The list distinguishes the cases 0 E A (case 1) and 0 tJ. A (cases 2 to 8).
Since it is generally assumed that 0 E B these eight cases indeed cover all
possibilities.
We are now going to derive simple formulas for the result of the interval
division AlB for these eight cases:
In all other cases 0 tJ. A also. We have already observed under 2. that in
this case the element 0 in B does not contribute to the solution set. So it can
be excluded without changing the set AlB.
So the general rule for computing AlB by (3.32) is to punch out zero of
the interval B and replace it by a small positive or negative number E as the
case may be. The so changed interval B is denoted by B and represented in
column 4 of Table 3.3. With this B the solution set AI B can now easily be
computed by applying the rules of Table 3.2. The results are shown in column
5 of Table 3.3. Now the desired result AlB as defined by (3.32) is obtained if
in column 5 E tends to zero. Thus in cases 3 to 8 the results are obtained by
the limit process AlB = limE~O AI B. The solution set AlB is shown in the
last column of Table 3.3 for all the 8 cases. There, as usual in mathematics
106 3. Interval Arithmetic Revisited
parentheses denote open interval ends, i.e. the bound is excluded. In contrast
to this brackets denote closed interval ends, i.e. the bound is included.
In Table 3.3 the operands A and B of the division AlB are intervals of
IlR! The results of the division AlB shown in the last column, however, are
no longer intervals of IlR nor are they intervals of IlR* which is the set of
intervals over lR*. This is logically correct and should not be surprising, since
the division has been defined as an operation in /PlR by (3.30).
Table 3.4 shows the result of the division AlB of two intervals A = [aI, a2]
and B = [b l , b2 ] in the case 0 E B in a more convenient layout.
Table 3.3. The 8 cases of the division of two intervals A/ B, with A, B E fIR and
OE B.
al > °
7 bl <0<b2 [bl, -e)U [a2/( -e), aI/bl)U (-00, al/bl)U
fE, b2) [al/b2, a2/e) [al/b2, +00)
8 al > ° 0=bl<b2 fE, b2) [al/b2, a2/e) [al/b2, +00)
a2 < °
al :::; 0:::; a2
[)
(-00,+00)
[a2/bl , +00) (-00,a2/b2) U [a2/bl ,+00)
(-00, +00) (-00, +00)
(-00, a2/b2)
(-00,+00)
al > ° [) (-00, al /bl ) (-00, al/bl) U [al/b2, +00) [al/b2, +00)
For completeness we repeat at the end of this section the results of the
basic arithmetic operations for intervals A = [aI, a2] and B = [b l , b2] of IlR
which have already been given in Section 3.2. In the cases of multiplication
and division we use different representations. We also list the basic rules of
the order relations:::; and ~ for intervals of IlR*.
3.6 Extended Interval Arithmetic 107
The closed intervals over the real numbers IR* are ordered with respect
to two different order relations, the comparison:::; and the set inclusion <;;;.
With respect to both order relations they are complete lattices. The basic
properties are:
VII. {fIR',:::;}: [aI, a2] :::; [bl, b2] :<=? al :::; bl /\ a2 :::; b2.
The least element of fIR' with respect to:::; is the interval [-00, -00], the
greatest element is [+00, +00]. The infimum and supremum respectively
of a subset S <;;; fIR' are:
VIII. {fIR', <;;;} : [aI, a2] <;;; [bl, b2] :<=? bl :::; al /\ a2 :::; h
The interval [-00,+00] is the greatest element in {fIR',<;;;}, i.e. for all
intervals A E fIR' we have A <;;; [-00, +00]. But a least element is missing
in fIR'. So we adjoin the empty set [ ] as the least element of fIR'. The
empty set [] is a subset of any set, thus for all A E fIR' we have [ ] <;;; A.
We denote the resulting set by fIR' := fIR' U {[ n. With this completion
{fIR', <;;;} is a complete lattice. The infimum and supremum respectively
of a subset S <;;; fIR* are [33,34]:
i.e. the infimum is the intersection and the supremum is the interval
(convex) hull of all intervals out of S. For inf~ S we shall also use the
108 3. Interval Arithmetic Revisited
n
usual symbol S. sUPe S is occasionally written as US. If in particular
S just consists of two intervals A, B, this reads: -
intersection: An B, interval hull: A1lB.
Since IR is a linearly ordered set with respect to ::;, the interval hull is
the same as the convex hull. The intersection may be empty.
the case 0 E F'(X) is no longer excluded. The result of the division then can
be taken from Tables 3.3 and 3.4. It is no longer an element out of fIR, but
an element of the powerset lPIR. Thus the subtraction that occurs in (3.33)
is also an operation in lPIR. As such it is defined by (3.29) and (3.30). As
a consequence of this, of course, the operation is inclusion isotone and the
inclusion property holds. We are interested in the evaluation of an expression
of the form
a B = [b l ,b 2 J -alB Z :=x-aIB
1 a=O OE B (-00, +00) (-00, +00)
2 a#O B= [O,OJ [J [J
3 a<O bl<b2=0 (-00, -alblJ (-00, x - alblJ
4 a<O bl<0<b2 (-00, -alblJ U (-00, x - alblJ U
[-alb2, +00) [x - alb2, +00)
5 a<O 0= bl < b2 [-alb2, +00) [x - alb2, +00)
6 a>O bl<b2=0 [-albl, +00) [x - albl , +00)
7 a>O bl < 0 < b2 ( -00, -alb2J U ( -00, x - alb2J U
[-albl , +00) [x - albl, +00)
8 a>O 0=bl <b2 (-00, -alb2J ( -00, x - alb2J
The general rules for subtraction of the type of sets which occur in column
4 of Table 3.5, from a real number x are:
3.6 Extended Interval Arithmetic 109
f(m(Xv))
Xv+! := (m(Xv) - F'(X v ) ) n Xv = N(Xv) n Xv, v = 0, 1,2, ... ,
As shown by Tables 3.3 and 3.4 the result is no longer an interval of fIR.
It is an element of the powerset lPIR which, with the exception of case 2,
stretches continuously to -00 or +00 or both. The intersection N(X) n X
with the finite interval X then produces a finite set again. It may consist
of a finite interval of fIR, or of two separate such intervals, or of the empty
set. These sets are now the starting values for the next iteration. This means
that in the case where two separate intervals have occurred, the iteration
has to be continued with two different starting values. This situation can
occur repeatedly. On a sequential computer where only one iteration can be
performed at a time all intervals which are not yet dealt with are collected
in a list. This list then is treated sequentially. If more than one processor is
available different subintervals can be dealt with in parallel.
Again, we illustrate this process by a simple example. The starting interval
is denoted by X = [Xl, X2] and the result of the Newton operator by N =
[nI, n2]. See Fig. 3.4.
Now F'(x) is again a superset of all slopes of tangents of f(x) in the
interval X = [Xl,X2]. 0 E F'(X). N(X) again is the set of zeros of straight
lines through (x, f(x)) with slopes out of F'(x). Let be F'(x) = [81,82]. Since
o E F'(x) we have 81 :S 0 and 82 2: O. The straight lines through (x,f(x))
3.7 The Extended Interval Newton Method 111
f(x)
X= [XI,X2]
N = [nlon2]
with the slopes 81 and 82 cut the real axis in nl and n2. Thus the Newton
operator produces the set
Intersection with the original set X (the former iterate) delivers the set
consisting of two finite intervals of IJR. From this point the iteration has to
be continued with the two starting intervals [Xl, n2J and [nl' X2J.
Remark: In case of division of a finite interval A = [aI, a2J by an interval
B = [bI, b2 J which contains zero, 8 non overlapping cases of the result were
distinguished in Table 3.3 and its context. Applied to the Newton operator
these 8 cases resulted in the 8 cases in Table 3.5. So far we have discussed
the behaviour of the Interval Newton Method in the cases 3 to 8 of Table 3.5.
We are now going to consider and interpret the particular cases 1 and 2 of
Table 3.5 which, of course, may also occur. In Table 3.5 a stands for the
function value and B is the enclosure of all derivatives of f(x) in the interval
X.
Case 2 in Table 3.5 is easy to interpret. If B == 0 in the entire interval X
then f(x) is a constant in X and its value is f(x) = a f= O. So the occurrence
of the empty interval in the Newton iteration indicates that the function f (x)
is a constant.
In case 1 of Table 3.5 the result of the Newton operator is the interval
(-00, +00). In this case the intersection with the former iterate X does not
reduce the interval and delivers the interval X again. The Newton iteration
does not converge! In this case the function value a is zero (or the numerator
A in case 1 of Table 3.3 contains zero) and a zero has already been found.
In order to avoid rounding noise and to obtain safe bounds for the solution
the value X may be shifted by a small E to the left or right. This may transform
case 1 into one of the other cases 2 to 8.
112 3. Interval Arithmetic Revisited
These rules can be used to define an arithmetic for ordered pairs of num-
bers, similar to complex arithmetic or interval arithmetic. The first compo-
nent of the pair consists of a function value u(xo) at a point Xo. The second
component consists of the value of the derivative u'(xo) of the function at the
point Xo. For brevity we simply write for the pair of numbers (u, u'). Then
the following arithmetic for pairs follows immediately from (3.35):
real function at a point Xo. For brevity we call these values the function-
derivative-value-pair. Why and how can this computation be done?
Earlier in this paper we have defined a (computable) real function by an
arithmetic expression in the manner that arithmetic expressions are usually
defined in a programming language. Apart from the arithmetic operators
+, -, *, and /, arithmetic expressions contain only three kinds of operands
as basic ingredients. These are constants, variables and certain differentiable
elementary functions as, for instance, exp, In, sin, cos, sqr, .... The derivatives
of these functions are also well known.
If for a function f(x) a function-derivative-value-pair is to be computed
at a point xo, all basic ingredients of the arithmetic expression of the function
are replaced by their particular function-derivative-value-pair by the following
rules:
a constant: c ~(c,O),
the variable: Xo ~ (xo, 1),
the elementary functions: exp(xo) ~ (exp(xo),exp(xo)),
In(xo) ~ (In(xo), l/xo),
(3.37)
sin(xo) ~ (sin(xo),cos(xo)),
cos(xo) ~ (cos(xo), - sin(xo)),
sqr(xo) ~ (sqr(xo),2xo),
and so on.
Now the operations in the expression are executed following the rules
(3.36) of differentiation arithmetic. The result is the function-derivative-
value-pair (f(xo), I'(xo)) of the function f at the point Xo·
Example: For the function f(x) = 25(x -1)/(x 2 + 1) the function value
and the value of the first derivative are to be computed at the point Xo = 2.
Applying the substitutions (3.37) and the rules (3.36) we obtain
and so on.
Example: For the function f(x) = exp(sin(x)) the value and the value of
the first derivative are to be computed for Xo = Jr. Applying the above rules
we obtain
114 3. Interval Arithmetic Revisited
(J(1I-), l' (11-)) = (exp(sin( 11")), exp(sinC7l")) . cos(1I"))
= (exp(O), - exp(O)) = (1, -1).
program sample;
use itaylor;
function f(x: itaylor): itaylor[lb(x) .. ub(x)];
begin f := exp(5000!(sin(11+sqr(x!lOO»+30»;
end;
var a: interval; b, fb: itaylor[O ..40];
begin
read(a);
expand(a,b);
fb := f(b);
writeln ('36th Taylor coefficient: " fb[36]);
writeln ('40th Taylor coefficient: " fb[40]);
end.
Test results: a = [1.001, 1.005]
36th Taylor coefficient: [-2.4139E+002, -2.4137E+002]
40th Taylor coefficient: [ 1.0759E-006, 1.0760E-006]
So far the basic set of all our considerations was the set of real numbers JR or
the set of extended real numbers JR* := JRU{ -oo}U{ +oo}. Actual computa-
tions, however, can only be carried out on a computer. The elements of JR and
IJR are in general not representable and the arithmetic operations defined for
them are not executable on the computer. So we have to map these spaces and
their operations onto computer representable subsets. Typical such subsets
are floating-point systems, for instance, as defined by the IEEE arithmetic
standard. However, in this article We do not assume any particular number
representation and data format of the computer representable subsets. The
considerations should apply to other data formats as well. Nevertheless, all
essential properties of floating-point systems are covered.
We assume that R is a finite subset of computer representable elements
of JR with the following properties:
The least positive non Zero element of R will be denoted by L and the
greatest positive element of R by C. Let be R* := R U {-oo} U {+oo}.
Now let V : JR* ---> R* and 6. : JR* ---> R* be mappings of JR* onto R*
with the property that for all a E JR*, Va is the greatest lower bound of a
in R* and 6.a is the least upper bound of a in R*. These mappings have the
following three properties which also define them uniquely [33,34J:
Since R* and JR* only contain a finite number of elements these can also
be written
118 3. Interval Arithmetic Revisited
Because of the finiteness of R' and fR' these can also be written
i.e. the infimum is the intersection and the supremum is the interval (con-
vex) hull of all intervals of S. As in the case of fiR' we shall use the usual
n
mathematical symbols S for infcS and US for supcS. The intersection
may be empty. If in particular S c;nsists oTjust two elements A = [aI, a2]
and B = [bl, b2 ] this reads:
(a) The result of any computation in fR always has to include the result of
the corresponding computation in fIR.
(b) The result of the computation in fR should be as close as possible to the
result of the corresponding computation in fIR.
For all arithmetic operations 0 E {+, -, *, /} in fIR (a) means that the
computer approximation G in fR must be defined in a way that the following
inequality holds:
(3.39) and (3.40) are necessary consequences of (a). There are reasonably
good realizations of interval arithmetic on computers which only fulfil prop-
erty (a).
(b) is an independent additional requirement. In the cases (3.39) and
(3.40) it requires that AGB and Of (A) should be the smallest interval in
fR' that includes the result A 0 Band f(A) in fIR' respectively. It turns
out that interval arithmetic on any computer is uniquely defined by this
requirement. Realization of it actually is the easiest way to support interval
arithmetic on the computer by hardware. To establish this is the aim of this
paper.
We are now going to discuss this arithmetic in detail. First we define the
mapping (; : fIR' --+ fR' which approximates each interval A of fIR' by
its least upper bound (;A in fR' with respect to the order relation <;;;. This
mapping has the property that for each interval A = [aI, a2] E fIR' its image
in fR' is
This mapping (; has the following properties which also define it uniquely
[33,34]:
We call this mapping (; the interval rounding. It has the additional property
120 3. Interval Arithmetic Revisited
since with A = [aI, a2], -A = [-a2' -all and <)( -A) = [v( -a2), L( -al)] =
[-La2' -val] = -[val,La2] = -<)A.
The interval rounding <) : JIR* -> JR* is now employed in order to define
arithmetic operations ~,o E {+, -, *, /} in JR, i.e. on the computer, by
Thus <) f(A) is the least upper bound of f(A) in JR* with respect to the
order relation <::::.
If the arithmetic operations for elements of JR are defined by (RG) with
the rounding (R) the inclusion isotony and the inclusion property hold for the
computer approximations of all interval operations 0 E {+, -, *, /}. These are
simple consequences of (R2) and (R3) respectively:
Inclusion isotony:
Inclusion property:
(R3)
a E A 1\ bE B => a 0 bE A 0 B => a 0 bE <)(A 0 B)
(RG)
=> aob E A~B, for a,b E IR,A,B E JR.
Both properties also hold for the interval evaluation of the elementary func-
tions:
Inclusion isotony: A <:::: B => f(A) <:::: f(B) (~) <) f(A) <:::: <) f(B),
for A,B E JR.
3.9 Interval Arithmetic on the Computer 121
Since '\7 : IR* --+ R* and 6 : IR* --+ R* are monotone mappings (R2), we
obtain
Employing this equation and the explicit formulas for the arithmetic opera-
tions in 1IR listed under I, II, III, IV, V, VI, in Section 3.6 leads to the follow-
ing formulas for the execution of the arithmetic operations <@>, 0 E {+, -, *, j},
in 1R On the computer for intervals A = [aI, a2] and B = [bl, b2] E 1R:
These formulas show, in particular, that the operations <@>, 0 E {+, -, *, j},
in 1R are executable on a computer if the operations '(J and.&., 0 E {+, -, *, j},
for elements of R are available. These operations have been defined earlier in
this Section by
This in turn shows the importance of the roundings \l : IR* ---7 R* and
6 : IR* ---7 R*.
3.9 Interval Arithmetic on the Computer 123
Table 3.8. The 8 cases of the division of two intervals A0B, with A, B E IR and
°E B.
°: :;
[a2Wb l , +ooJ
al :::; a2 [-oo,+ooJ [-oo,+ooJ [-00, +ooJ [-00, +ooJ
al >° [J [-00, al£blJ [-00, al£blJ U [al Wb2, +ooJ
[al Wb2, +ooJ
+00]
x~[-oo, = [-00, +00],
y]
x~[-oo, = [x'V'y, +00],
x~[y, +00] = [-00, x8y],
xN[-oo, y] u [z, +00]) = [-00, x8z] U [x'V'y, +00],
x~[] = [].
After the computation of the Interval Newton Operator the intersection with
a finite interval [Cl, C2] still has to be taken in the generalized Interval Newton
Method. The result may be one or two finite intervals or the empty interval [ ].
These cases are expressed by the following explicit formulas:
124 3. Interval Arithmetic Revisited
For geometric reasons [CI, C2] can only occur as the result of the intersection
in the first case.
For interval arithmetic the roundings V : lEt -+ R* and 6. : lEt -+ R* are
of particular interest. They can be defined by the following properties:
Using the function [a] (the greatest integer less than or equal to a) the de-
scription of Va can be shortened:
Va = t(a) for a ~ O.
This is very easy to execute. Truncation can also be used to perform the
rounding Vain case of negative numbers a < 0 if negative numbers are rep-
resented by their b-complement. Then the rounded value Va can be obtained
by truncation of the b-complement a + x of a via the process:
a+x
Iva~
l~
!I~
~t(a+X)
----- t(a+x)-x
Example: We assume that the decimal number system is used, and that
the mantissa has three decimal digits. Then we obtain for the positive real
number a = 0.354672.10 3 E lR:
Here the easily executable b-complement has been taken twice. In between
the function t(a) was applied which also is easily executable. These three
steps are particularly simple if the binary number system is used.
It is interesting that in case of the (b - I)-complement representation of
negative numbers the monotone rounding downwards \7a cannot be executed
by the function t(a). This representation is isomorphic to the sign-magnitude
representation.
In the preceding Sections 3.1 to 3.8 ideal interval arithmetic for elements
of IlR including division by an interval which contains zero has been devel-
oped. In no case did the symbols -00 and +00 occur as result of an interval
operation. This is not so in this Section where interval arithmetic on the com-
puter is considered. Here -00 and +00 can occur as result of the roundings \7
and 6, and as result of the operations a'IiJb and a&.b, 0 E {+, -, *, /}, respec-
tively. The interval rounding is defined by OA := [\7al' 6a2], and the arith-
metic operations for intervals A, BEl lR are defined by A~B := 0 (A 0 B),
o E {+, -, *, /}. As a consequence of this the symbols -00 and +00 can also
occur as bounds of the result of an interval operation.
This happens, for instance, in case of division by an interval which con-
tains zero, see Table 3.8. The extended Interval Newton Method is an example
3.10 Hardware Support for Interval Arithmetic 127
of this. We have studied this process in detail. Here very large intervals with
-00 and +00 as bounds only appear intermediately. They disappear again as
soon as the intersection with the previous approximation is taken. Finally the
diameters of the approximations decrease to small bounds for the solution.
Among the six interval operations addition, subtraction, multiplication,
division, intersection, and interval hull, the intersection is the only operation
which can reduce an interval which stretches to -00 or +00 or both to a
finite interval again. This step is advantageously used in the extended Interval
Newton Method.
Also certain elementary functions can reduce an interval which stretches
to -00 or +00 or both to a finite interval again. In such a case continuation
of the computation may also be reasonable. The user has to take care that
such situations are appropriately treated in his program.
In general, the appearance of -00 or +00 in the result of an interval oper-
ation indicates that an exponent overflow has occurred or that an operation
or an elementary function has been called outside its range of definition. This
means that the computation has gotten out of control. In this case continua-
tion of the computation is not really recommendable. An appropriate scaling
of the problem may be necessary.
Here the situation is very different from a conventional floating-point com-
putation. In floating-point arithmetic the general directive often is just to
"compute" at any price, hoping that at the end of the computation some-
thing that is reasonable will be delivered. In this process the non numbers
-00, +00, and even NaN (not a number) are often treated as numbers and
the computation is continued with these entities. Since a floating-point com-
putation often flips out of control anyhow it must be the user's responsibility
to control and judge the final result by other means.
In interval mathematics the general philosophy is very different. The user
and the computation itself are controlling the computational process at any
time. In general, an interval computation is aiming to compute small bounds
for the solution of the problem. If during a computation the intervals grow
overly large or even an interval appears which stretches to -00 or +00 or
both, this should be taken as a severe warning. It should cause the user
to think about and study the computational process again with the aim
of obtaining smaller intervals. Blind continuation of the computation even
with non numbers as in the case of floating-point arithmetic hoping that
something reasonable will come out at the end is in strong contradiction to
the philosophy and basic understanding of interval mathematics.
bounds of the interval operands. The lower bound of the resulting interval has
to be computed with rounding downwards and the upper bound with round-
ing upwards. While addition and subtraction are straightforward, multiplica-
tion and division require a detailed case analysis and in case of multiplication
additionally a maximum / minimum computation if both interval operands
contain zero. This may slow down these operations considerably in particular
if the case analysis is performed in software. Thus in summary an interval
operation is slower by a factor of at least two on a conventional sequential
processor in comparison with the corresponding floating-point operation.
We show in this section that with dedicated hardware interval arithmetic
can be made more or less as fast as simple floating-point arithmetic. The
cost increase for the additional hardware is relatively modest and it is close
to zero on superscalar processors. Although different in detail we follow in
this Section ideas of [71,72].
We assume in this section that one arithmetic operation as well as one
comparison cost one unit of time whereas switches controlled by one bit as the
sign bit or data transports inside the unit are free of charge. For simplicity
we denote the computer operations for intervals in this section by +, -, *,
and /. The interval operands are denoted by A = [aI, a2] and B = fbI, b2].
The lower bound, upper bound respectively of the result is denoted by lb, ub
respectively, i.e. [lb,ub]:= [aI,a2] 0 [bI,b 2], 0 E {+,-,*,/}.
3.10.2 Multiplication A *B
A basic method for the multiplication of two intervals is the method of case
distinction. Nine cases have been distinguished in Table 3.6. In eight of the
nine cases one multiplication with directed rounding suffices for the compu-
tation of each bound of the resulting interval. When both interval operands
3.10 Hardware Support for Interval Arithmetic 129
Algorithm 1:
The eight cases with only one multiplication for each bound can be obtained
by:
(A) lb := if (b l ~ 0 V (a2 ~ 01\ b2 > 0)) then al else a2
'W if (al ~ 0 V (a2 > 01\ b2 ~ 0)) then bl else b2;
(B) ub := if (b l ~ 0 V (al ~ 01\ b2 > 0)) then a2 else al
& if (al ~ 0 V (a2 > 01\ bl ~ 0)) then b2 else bl ;
and the final case where two multiplications have to be performed for each
bound by:
(C) p := al 'Wb 2 ;
(D) q:= a2 'Wb l ;
(E) lb:= min(p, q); r:= al.0.b l ;
(F) s := a2.0.b2;
(G) ub:= max(r, s);
Taking all parts together we have:
if (al < 01\ a2 > 01\ bl < 01\ b2 > 0) then
{(C),(D),(E),(F),(G) }
else
{(A),(B)};
The correctness of the algorithm can be checked against the case distinc-
tions of Table 3.6. The algorithm needs 5 time steps in the worst case. In all
the other cases the product can be computed in two time steps.
If two multipliers and one comparator are provided the same algorithm
reduces the execution time to one time step for (A), (B) and three time steps
for (C), (D), (E), (F), (G). Two multiplications and a comparison can then
be performed in parallel:
Algorithm 2:
(A) lb := if (b l ~ 0 V (a2 ~ 01\ b2 > 0)) then al else a2
'W if (al ~ 0 V (a2 > 01\ b2 ~ 0)) then bl else b2;
ub := if (b l ~ 0 V (al ~ 01\ b2 > 0)) then a2 else al
& if (al ~ 0 V (a2 > 01\ bl ~ 0)) then b2 else bl ;
and
130 3. Interval Arithmetic Revisited
Algorithm 3:
By (9.4) the interval product can be computed by the following formula:
Note that here the minimum and maximum are taken from the unrounded
products of double length. The algorithm always needs 5 time steps. In algo-
rithm 1 this is the worst case.
Algorithm 4:
Using the same formula but 2 multipliers, 2 comparators and 2 assignments
leads to:
(A) p := al * bl ; q := al * b2 ;
(B) r := a2 * bl ; S := a2 * b2; MIN := min(p, q); MAX := max(p, q);
(C) MIN:= min(MIN, r); MAX := max(MAX, r);
(D) lb:= 'V min (MIN , s); ub:= 6max(MAX, s);
Again the minimum and maximum are taken from the unrounded products.
The algorithm needs 4 time steps. This is one time step more than the cor-
responding algorithm 2 using case distinction with two multipliers.
Let us denote the components of the two interval vectors A (Ak) and
B = (Bk) by Ak = [akl, ad and Bk = [b kl , bd, k = l(l)n. Then the
product Ak * Bk is to be computed by
Algorithm 5:
(A) p := akl * bkl;
(B) q := akl * bk2 ;
(C) r := ak2 * bkl ; MIN := min(p, q); MAX:= max(p, q);
(D) s := ak2 * bk2; MIN := min(MIN, r); MAX := max(MAX, r);
(E) p := ak+l,l * bk+I,I; MIN := min(MIN, s); MAX := max(MAX, s);
(F) q:= ak+I,1 * bk+l,2; lb := lb + MIN; ub:= ub + MAX;
132 3. Interval Arithmetic Revisited
This algorithm shows that in each sequence of 4 time steps one inter-
val product can be accumulated. Again the minimum and maximum are
taken from the unrounded products. Only at the very end of the accumula-
tion of the bounds is a rounding applied. Then lb and ub are floating-point
numbers which optimally enclose the product A*B of the two interval vectors
A and B.
In the algorithms 3, 4, and 5 the unrounded, double length products were
compared and used for the computation of their minimum and maximum
corresponding to (3.41). This requires comparators of double length. This
can be avoided if formula (3.42) is used instead:
A*B:= [min(al~bl,al~b2,a2~bl,a2~b2)'
max(al&b 1 , al&b 2, a2&b 1 , a2&b 2)].
a~b = a * b = a&b.
If the product a * b is not a floating-point number, then it is "not exact" and
the product with rounding upwards can be obtained by taking the successor
a&b:= succ(a~b). This changes algorithm 4, for instance, into
Algorithm 6:
(A) p:= al~bl;q:= al~b2;
(B) r:= a2~bl;s:= a2~b2; MIN := min(p, q); MAX := max(p, q);
(C) MIN:= min(MIN,r);MAX:= max(MAX,r);
(D) lb:= min(MIN, s); MAX:= max(MAX, s);
(E) if MAX = "exact" then ub := MAX else ub:= succ(MAX);
3.10 Hardware Support for Interval Arithmetic 133
3.10.4 Division A / B
Algorithm 7:
if b2 < 0 V bl > 0 then
{
ib := ( if bl > 0 then al else a2) 'Yl
( if al 2: 0 V (a2 > 01\ b2 < 0) then b2 else bl );
ub := ( if b l > 0 then a2 else ad Il-..
( if al 2: 0 V (a2 > 01\ bl > 0) then bl else b2);
} else {
if (al :::; 01\0:::; a2 1\ bl :::; 01\0:::; b2) then [ib, ub] := [-00, +00];
if ((a2 < 0 V al > 0) 1\ bl = 01\ b2 = 0) then [ib, ub] := [];
if (a2 < 01\ b2 = 0) then [ib, ub] := [a2 'Ylb l , +00];
if (a2 < 01\ bl = 0) then [ib, ub] := [-00, a2Il-..b 2];
if (al > 01\ b2 = 0) then [ib, ub] := [-00, alll-..b l ];
if (al > 01\ bl = 0) then [ib, ub] := [al 'Ylb 2, +00];
if (a2 < 01\ bl < 01\ b2 > 0) then { [ib l , UbI] := [-00, a 21l-..b2];
[ib 2, ub 2] := [a2 'Ylb l , +00]; }
if (al > 01\ bl < 01\ b2 > 0) then { [ib l , UbI] := [-00, alll-..b l ];
[ib 2, ub 2] := [al 'Ylb 2, +oo];}
}
The algorithm is organized in such a way that the most complicated cases,
where the result consists of two separate intervals, appear at the end. It would
be possible also in these cases to write the result as a single interval which then
would overlap the point infinity. In such an interval the lower bound would
then be greater than the upper bound. This could cause difficulties with the
order relation. So we prefer the notation with the two separate intervals. On
the other hand, the representation of the result as an interval which overlaps
the point infinity has advantages as well. The result of an interval division
then always consists of just two bounds. In the Newton step the separation
into two intervals then would have to be done by the intersection.
In practice, division by an interval that contains zero occurs infrequently.
So algorithm 7 shows again that on a processor with two dividers and some
134 3. Interval Arithmetic Revisited
Convenient high level programming languages with particular data types and
operators for intervals, the XSC-Ianguages for instance [11,12,26-29,37,38,
69,70,77], have been in use for more than thirty years now. Due to the lack
of hardware and instruction set support for interval arithmetic, subroutine
calls have to be used by the compiler to map the interval operators and com-
parisons to appropriate floating-point instructions. This slows down interval
arithmetic by a factor close to ten compared to the corresponding floating-
point arithmetic.
It has been shown in the last Section that with appropriate hardware
support interval operations can be made as fast as floating-point operations.
Three additional measures are necessary to let an interval calculation on
the computer run at a speed comparable to the corresponding floating-point
calculation:
From the mathematical point of view the following instructions for interval
operations are desirable (A = [aI, a2] ,B = [bI, b2]):
3.10 Hardware Support for Interval Arithmetic 135
Algebraic operators:
addition C := A + B C := [al 'fbI, a2t&b 2],
subtraction C:=A-B C := [al ~b2' a28bI),
negation C:=-A C := [-a2, -al],
multi plication C:= A*B Table 3.6,
division C:= AIB,O rJ. B Table 3.7,
C:= AIB,O E B Table 3.8,
scalar product C:= <)(A * B) for interval vectors A = (Ak) and
B = (Bk)' see the first chapter.
17. Hammer, R.; Hocks, M.; Kulisch, U.; Ratz, D.: Numerical Toolbox for Ver-
ified Computing I: Basic Numerical Problems. (Vol. II see [31], version
in C++ see [18]) Springer-Verlag, Berlin / Heidelberg / New York, 1993.
18. Hammer, R.; Hocks, M.; Kulisch, U.; Ratz, D.: C++ Toolbox for Veri-
fied Computing: Basic Numerical Problems. Springer-Verlag, Berlin /
Heidelberg / New York, 1995.
19. Hansen, E.: Topics in Interval Analysis. Clarendon Press, Oxford, 1969.
20. Hansen, E.: Global Optimization Using Interval Analysis. Marcel
Dekker Inc., New York/Basel/Hong Kong, 1992.
21. Herzberger, J. (ed.): Topics in Validated Computations. Proceedings of
IMACS-GAMM International Workshop on Validated Numerics, Oldenburg,
1993. North Holland, 1994.
22. Herzberger, J.: Wissenschaftliches Rechnen, Eine Einfiihrung in das
Scientific Computing. Akademie Verlag, 1995.
23. Kaucher, E.: Uber metrische und algebraische Eigenschaften einiger beim nu-
merischen Rechnen auftretender Riiume. Dissertation, Universitat Karlsruhe,
1973.
24. Kaucher, E.: Algebraische Erweiterungen der Intervallrechnung unter Erhal-
tung der Ordnungs- und Verbandsstrukturen. In: Albrecht, R.; Kulisch, U.
(Eds.): Grundlagen der Computerarithmetik. Computing Supplementum
1. Springer-Verlag, Wien / New York, pp. 65-79, 1977.
25. Kaucher, E.: Uber Eigenschaften und Anwendungsmoglichkeiten der erweiterten
Intervallrechnung und des hyperbolischen Fastkorpers iiber R. In: Albrecht, R.;
Kulisch, U. (Eds.): Grundlagen der ComputerarithIIletik. Computing
Supplementum 1. Springer-Verlag, Wien / New York, pp. BI-94, 1977.
26. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-
XSC Sprachbeschreibung mit Beispielen. Springer-Verlag,
Berlin/Heidelberg/New York, 1991 (ISBN 3-540-53714-7, 0-387-53714-7).
27. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-
XSC - Language Reference with Examples. Springer-Verlag,
Berlin/Heidelberg/New York, 1992.
28. Klatte, R.; Kulisch, U.; Lawo, C.; Rauch, M.; Wiethoff, A.: C-XSC, A C++
Class Library for Extended Scientific Computing. Springer-Verlag,
Berlin/Heidelberg/New York, 1993.
29. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-XSC
- Language Reference with Examples (In Russian). Moscow, 1994,
second edition 2000.
30. Knofel, A.: Hardwareentwurf eines Rechenwerks fur semimorphe Skalar- und
Vektoroperationen unter Beriicksichtigung der Anforderungen veriJizierender
Algorithmen. Dissertation, Universitat Karlsruhe, 1991.
31. Kramer, W.; Kulisch, U.; Lohner, R.: Numerical Toolbox for Verified
Computing II: Theory, Algorithms and Pascal-XSC Programs. (Vol. I
see [17,18]) Springer-Verlag, Berlin / Heidelberg / New York, to appear.
32. Krawczyk, R.; Neumaier, A.: Interval Slopes for Rational Functions and Asso-
ciated Centered Forms. SIAM Journal on Numerical Analysis 22, pp. 604-616,
1985.
33. Kulisch, U.: Grundlagen des Numerischen Rechnens - Mathematis-
che Begriindung der Rechnerarithmetik. Reihe Informatik, Band 19,
Bibliographisches Institut, Mannheim/Wien/Ziirich, 1976 (ISBN 3-411-01517-
9).
34. Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Prac-
tice. Academic Press, New York, 1981 (ISBN 0-12-428650-x).
Bibliography and Related Literature 139
35. Kulisch, U.; Ullrich, Ch. (Eds.): Wissenschaftliches Rechnen und Pro-
grammiersprachen. Proceedings of Seminar held in Karlsruhe, April 2-3,
1982. Berichte des German Chapter of the ACM, Band 10, B. G. Teubner Ver-
lag, Stuttgart, 1982 (ISBN 3-519-02429-2).
36. Kulisch, U.; Miranker, W. L. (Eds.): A New Approach to Scientific Com-
putation. Proceedings of Symposium held at IBM Research Center, Yorktown
Heights, N. Y., 1982. Academic Press, New York, 1983 (ISBN 0-12-428660-7).
37. Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific
computation, Information Manual and Floppy Disks, Version IBM PCI AT;
Operating System DOS'. B. G. Teubner Verlag (Wiley-Teubner series in com-
puter science), Stuttgart, 1987 (ISBN 3-519-02106-4 I 0-471-91514-9).
38. Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific
computation, Information Manual and Floppy Disks, Version ATARI ST'. B.
G. Teubner Verlag, Stuttgart, 1987 (ISBN 3-519-02108-0).
39. Kulisch, U. (Ed.): Wissenschaftliches Rechnen mit Ergebnisverifikation
- Eine Einf'tihrung. Ausgearbeitet von S. Georg, R. Hammer und D. Ratz.
Vol. 58. Akademie Verlag, Berlin, und Vieweg Verlagsgesellschaft, Wiesbaden,
1989.
40. Kulisch, U.: Advanced Arithmetic for the Digital Computer - Design
of Arithmetic Units. Electronic Notes in Theoretical Computer Science,
http://www.elsevier.nl/locate/entcs/volume24.html pp. 1-72, 1999.
41. Lohner, R.: EinschliefJung der Losung gewohnlicher Anfangs- und Randwer-
taufgaben und Anwendungen. Dissertation, Universitiit Karlsruhe, 1988.
42. Lohner, R.: Computation of Guaranteed Enclosures for the Solutions of Ordi-
nary Initial and Boundary Value Problems. pp. 425-435 in: Cash, J. R.; Glad-
well, I. (Eds.): Computational Ordinary Differential Equations. Claren-
don Press, Oxford, 1992.
43. Mayer, G.: Grundbegriffe der Intervallrechnung. In [39, pp. 101-117]' 1989.
44. Moore, R. E.: Interval Analysis. Prentice Hall Inc., Englewood Cliffs, N. J.;
1966.
45. Moore, R. E.: Methods and Applications of Interval Analysis. SIAM,
Philadelphia, Pennsylvania, 1979.
46. Moore, R. E. (Ed.): Reliability in Computing: The Role of Interval
Methods in Scientific Computing. Proceedings of the Conference at
Columbus, Ohio, September 8-11, 1987; Perspectives in Computing 19, Aca-
demic Press, San Diego, 1988 (ISBN 0-12-505630-3).
47. Neumaier, A.: Interval Methods for Systems of Equations. Cambridge
University Press, Cambridge, 1990.
48. Neumann, J. von; Goldstine, H. H.: Numerical Inverting of Matrices of High
Order. Bulletin of the American Mathematical Society, 53, 11, pp. 1021-1099,
1947.
49. Rail, L. B.: Automatic Differentiation: Techniques and Applications.
Lecture Notes in Computer Science, No. 120, Springer-Verlag, Berlin, 1981.
50. Ratschek, H.; Rokne, J.: Computer Methods for the Range of Functions.
Ellis Horwood Limited, Chichester, 1984.
51. Ratz, D.: Pmgrammierpraktikum mit PASCAL-SC. In: Hohler, G.; Stau-
denmaier, H. M. (Hrsg.): Computer Theoretikum und Praktikum f'tir
Physiker. Band 5, Fachinformationszentrum Karlsruhe, 1990.
52. Ratz, D.: Globale Optimierung mit automatischer Ergebnisverifikation. Disser-
tation, Universitiit Karlsruhe, 1992.
53. Ratz, D.: Automatic Slope Computation and its Application in Non-
smooth Global Optimization. Shaker Verlag, Aachen, 1998.
140 Bibliography and Related Literature
54. Ratz, D.: On Extended Interval Arithmetic and Inclusion Isotony. Preprint,
Institut fUr Angewandte Mathematik, Universitat Karlsruhe, 1999.
55. Rump, S. M.: Kleine Fehlerschranken bei Matrixproblemen. Dissertation, Uni-
versitat Karlsruhe, 1980.
56. Rump, S. M.: How Reliable are Results of Computers? / Wie zuverliissig sind
die Ergebnisse unserer Rechenanlagen? In: lahrbuch Uberblicke Mathematik,
Bibliographisches Institut, Mannheim, 1983.
57. Rump, S.M.: Validated Solution of Large Linear Systems. In [2, pp. 191-212]'
1993.
58. Rump, S.M.: Verification Methods for Dense and Sparse Systems of Equations.
In [21, pp. 63-135]' 1994.
59. Rump, S.M.: INTLAB - Interval Laboratory. TU Hamburg-Harburg, 1998.
60. Schmidt, L.: Semimorphe Arithmetik zur automatischen Ergebnisverifikation
auf Vektorrechnern. Dissertation, Universitat Karlsruhe, 1992.
61. Shiriaev, D. V.: Fast Automatic Differentiation for Vector Processors and Re-
duction of the Spatial Complexity in a Source Translation Environment. Dis-
sertation, Universitat Karlsruhe, 1994.
62. Sunaga, T.: Theory of an interval algebra and its application to numerical anal-
ysis. RAAG Memoires 2, pp. 547-564, 1958.
63. Teufel, T.: Ein optimaler Gleitkommaprozessor. Dissertation, Universitat Karl-
sruhe, 1984.
64. Ullrich, Ch. (Ed.): Computer Arithmetic and Self-Validating Numerical
Methods. (Proceedings of SCAN 89, held in Basel, Oct. 2-6, 1989, invited
papers). Academic Press, San Diego, 1990.
65. Walter, W. V.: FORTRAN-SC, A FORTRAN Extension for Engineering /
Scientific Computation with Access to A CRITH: Language Description with
Examples. In [46, pp. 43-62]' 1988.
66. Walter, W. V.: Einfiihrung in die wissenschaftlich-technische Programmier-
sprache FORTRAN-SC. ZAMM 69, 4, T52-T54, 1989.
67. Walter, W. V.: FORTRAN-SC: A FORTRAN Extension for Engineering /
Scientific Computation with Access to ACRITH, Language Reference and User's
Guide. 2nd ed., pp. 1-396, IBM Deutschland GmbH, Stuttgart, Jan. 1989.
68. Walter, W. V.: Flexible Precision Control and Dynamic Data Structures for
Programming Mathematical and Numerical Algorithms. Dissertation, Univer-
sitiit Karlsruhe, 1990.
69. Wippermann, H.-W.: Realisierung einer Intervallarithmetik in einem ALGOL-
60 System. Elektronische Rechenanlagen 9, pp. 224-233, 1967.
70. Wippermann, H.-W.: Implementierung eines ALGOL-60 Systems mit
Schrankenzahlen. Elektronische Datenverarbeitung 10, pp. 189-194, 1968.
71. Wolff v. Gudenberg, J.: Hardware Support for Interval Arithmetic, Extended
Version. Report No. 125, Institut fUr Informatik, Universitat Wiirzburg, 1995.
72. Wolffv. Gudenberg, J.: Hardware Support for Interval Arithmetic. In [8, pp. 32-
38], 1996.
73. Wolff v. Gudenberg, J.: Proceedings of Interval'96. International Confer-
ence on Interval Methods and Computer Aided Proofs in Science and Engi-
neering, Wiirzburg, Germany, Sep. 30 - Oct. 2, 1996. Special issue 3/97 of the
journal Reliable Computing, 1997.
74. Yohe, J.M.: Roundings in Floating-Point Arithmetic. IEEE Trans. on Com-
puters, Vol. C-22, No.6, June 1973, pp. 577-586.
75. IBM: IBM System/370 RPQ'. High Accuracy Arithmetic. SA 22-7093-0,
IBM Deutschland GmbH (Department 3282, Schonaicher Strasse 220, D-71032
Boblingen), 1984.
Bibliography and Related Literature 141
Ulrich Kulisch,
Rudolf Lohner,
Axel Facius (eds.)
SpringerWienNewYork
A-1201 Wien, Sachsenplatz 4-6, P.O. Box 89, Fax +43.1.330 24 26, a-mail: [email protected], Internet: www.springer.at
0-69126 Heidelberg, HaberstraBe 7, Fax +49.6221.345-229, a-mail: [email protected]
USA Secaucus, NJ 07096-2485, P.O. Box 2485, Fax +1.201.348-4505, a-mail: [email protected]
Eastern Book Service, Japan,Tokyo 113,3-13, Hongo 3-chome, Bunkyo-ku, Fax +81.3.38 18 OS 64, e-mail: [email protected]
Springer-Verlag
and the Environment