0% found this document useful (0 votes)
33 views150 pages

Advanced Arithmetic For The Digital Computer - Design of Arithmetic Units (PDFDrive)

This document discusses advanced computer arithmetic, emphasizing the need for speed and accuracy in computational operations. It explores the design of arithmetic units, particularly scalar product units, for various types of computers, highlighting their ability to enhance computational capability without significant cost increases. The text also addresses the integration of advanced arithmetic into programming languages and the theoretical foundations that support these advancements.

Uploaded by

testg3478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views150 pages

Advanced Arithmetic For The Digital Computer - Design of Arithmetic Units (PDFDrive)

This document discusses advanced computer arithmetic, emphasizing the need for speed and accuracy in computational operations. It explores the design of arithmetic units, particularly scalar product units, for various types of computers, highlighting their ability to enhance computational capability without significant cost increases. The text also addresses the integration of advanced arithmetic into programming languages and the theoretical foundations that support these advancements.

Uploaded by

testg3478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 150

.~.

T
Ulrich W. Kulisch

Advanced Arithmetic
for the Digital Computer

Design of Arithmetic Units

Springer-Verlag Wien GmbH


Dr.Ulrich W. Kulisch
Professor of Mathematics
Institut fiir Angewandte Mathematik, Universităt Karlsruhe, Germany

This work is subject to copyright. All rights are reserved, whether the whole Of part of
the material is concemed, specifically those of translation, reprinting, re-use of
illustrations, broadcasting, reproduction by photocopying machines or similar means,
and storage in data banks.

© 2002 Springer-Verlag Wien


Originally published by Springer-Verlag Wien in 2002

Product Liability: The publisher can give no guarantee for all the information contained
in this book. This does also refer to information about drug dosage and application
thereof. In every individual case the respective user must check its accuracy by
consulting other pharmaceuticalliterature. The use of registered names, trademarks,
etc. in this publication does not imply, even in the absence of a specific statement, that
such names are exempt from the relevant protective laws and regulations and therefore
free for general use.

Cover illustration: Falko SchrMer, University of Applied Science, Zweibriicken


Typesetting: Camera ready by editor

Printed on acid-free and chlorine-free bleached paper


SPIN: 10892255
CIP data applied for

With 32 Figures (1 in colour)

ISBN 978-3-211-83870-9 ISBN 978-3-7091-0525-2 (eBook)


DOI 10.1007/978-3-7091-0525-2
Preface

The number one requirement for computer arithmetic has always been speed.
It is the main force that drives the technology. With increased speed larger
problems can be attempted. To gain speed, advanced processors and pro-
gramming languages offer, for instance, compound arithmetic operations like
matmul and dotproduct.
But there is another side to the computational coin - the accuracy and
reliability of the computed result. Progress on this side is very important, if
not essential. Compound arithmetic operations, for instance, should always
deliver a correct result. The user should not be obliged to perform an error
analysis every time a compound arithmetic operation, implemented by the
hardware manufacturer or in the programming language, is employed.
This treatise deals with computer arithmetic in a more general sense than
usual. Advanced computer arithmetic extends the accuracy of the elementary
floating-point operations, for instance, as defined by the IEEE arithmetic
standard, to all operations in the usual product spaces of computation: the
complex numbers, the real and complex intervals, and the real and complex
vectors and matrices and their interval counterparts. The implementation of
advanced computer arithmetic by fast hardware is examined in this book.
Arithmetic units for its elementary components are described. It is shown
that the requirements for speed and for reliability do not conflict with each
other. Advanced computer arithmetic is superior to other arithmetic with
respect to accuracy, costs, and speed.
Vector processing is an important technique used to speed up computa-
tion. Difficulties concerning the accuracy of conventional vector processors
are addressed in [116,117]. See also [32] and [78]. Accurate vector processing
is subsumed in what is called advanced computer arithmetic in this treatise.
Compared with elementary floating-point arithmetic it speeds up computa-
tions considerably and it eliminates many rounding errors and exceptions. Its
implementation requires little more hardware than is needed for elementary
floating-point arithmetic. All this strongly supports the case for implementing
such advanced computer arithmetic on every CPU. With the speed comput-
ers have reached and the problem sizes that are dealt with, vector operations
should be performed with the same reliability as elementary floating-point
operations.
VI Preface

On parallel computers faster and more powerful arithmetic units essen-


tially reduce the number of processors needed and the complexity of the
interconnection network. Thus a desired efficiency can be reached at a lower
cost.
A basic feature of advanced computer arithmetic as well as of vector
processing is the two instructions accumulate and multiply and accumulate
added to the instruction set for floating-point numbers. The first instruction
is a particular case of the second one, which computes a sum of products - the
dot product or scalar product of two vectors. Pipelining makes these opera-
tions really fast. We show in the first chapter that fixed-point accumulation
of products is the fastest way to accumulate scalar products on a computer.
This is so for all kinds of computers - personal computers, workstations,
mainframes or super computers. In contrast to floating-point accumulation
of products, fixed-point accumulation is error free. Not a single bit is lost.
The new operation is gained at modest cost. It increases both the speed of a
computation as well as the accuracy of the computed result.
A conventional floating-point computation may fail to produce a correct
answer without any error signal being given to the user. A very worthy goal
of computing, therefore, would be to do rigorous mathematics with the com-
puter. Examples of such rigour are verified solution of differential or integral
equations, or validation of the solution of a system of equations with proof
of existence and uniqueness of the solution within the computed bounds.
Interval arithmetic serves this purpose. If the verification or validation step
fails, the user is made aware that some more powerful tool has to be applied.
Higher precision arithmetic might then be used, for instance. Variable pre-
cision arithmetic is thus a sound complement for interval arithmetic. With
a fast and accurate scalar product, fast multiple precision arithmetic can be
easily provided on the computer.
The second chapter of this booklet reveals a necessary and sufficient condi-
tion under which a computer representable element in anyone of the relevant
computational spaces has a unique additive inverse.
The third chapter deals with interval arithmetic. It is shown that on su-
perscalar processors the four basic interval operations can be made as fast as
simple floating-point operations with only modest additional hardware costs.
In combination with the results of the first chapter - a hardware-supported
accurate scalar product - interval vector and matrix operations can be per-
formed with highest accuracy and faster than with simple floating-point arith-
metic.
The three chapters of this volume were written as independent articles.
They were prepared while the author was staying at the Electrotechnical
Laboratory (ETL), Agency of Industrial Science and Technology, MITI, at
Tsukuba, Japan, during sabbaticals in 1998 and in 1999/2000. Gathering
these articles into a single publication raised the question of whether the
text should be rewritten and reorganized into a unitary exposition. Thus a
small number of repetitions - of simple definitions, of historic remarks, or of
the list of references - could have been avoided. I decided not to do so. The
articles have been prepared to help implement different aspects of advanced
arithmetic on the computer. So it seems preferable not to interweave and
combine separate things into a complex whole. Readability should have the
highest priority.
I am grateful to all those colleagues and co-workers who have contributed
through their research to the development of advanced computer arithmetic
as it is presented in this treatise. In particular I would like to mention and
thank Gerd Bohlender, Willard L Miranker, Reinhard Kirchner, Siegfried
M.Rump, Thomas Teufel, Harald B6hm, Jiirgen Wolff von Gudenberg, An-
dreas Knofel, and Christof Baumhof.
I gratefully acknowledge the help of Neville Holmes who went carefully
through the manuscripts, sending back corrections and suggestions that led
to many improvements.
Finally I wish to thank the Electrotechnical Laboratory, Agency of In-
dustrial Science and Technology at Tsukuba, Japan for providing me the
opportunity to write the articles in a pleasant scientific environment without
constantly being interrupted by the usual university business. I especially
owe thanks to Satoshi Sekiguchi for being a wonderful host personally and
scientifically. I am looking forward to, and eagerly await, advanced computer
arithmetic on commercial computers.

Karlsruhe, July 2002 Ulrich W. K ulisch

The picture on the cover page illustrates the contents of the book. It is
showing a chip for fast Advanced Computer Arithmetic and eXtended Pre-
cision Arithmetic (ACA-XPA). Its components are symbolically indicated
on top: hardware support for 15 basic arithmetic operations including ac-
curate scalar products with different roundings and case selections for in-
terval multiplication and division. Corresponding circuits are developed in
the book.
The picture is showing friends with Ursula Kulisch flanked by the host, Satoshi Sekiguchi, and Ulrich Kulisch.
Contents

1. Fast and Accurate Vector Operations. . . . . . . . . . . . . . . . . . . . . 1


1.1 Introduction........................................... 1
1.1.1 Background..................................... 1
1.1.2 Historic Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Implementation Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10
1.2.1 Solution A: Long Adder and Long Shift. . . . . . . . . . . .. 13
1.2.2 Solution B: Short Adder with Local Memory on the
Arithmetic Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14
1.2.3 Remarks........................................ 15
1.2.4 Fast Carry Resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17
1.3 High-Performance Scalar Product Units (SPU) . . . . . . . . . . . .. 19
1.3.1 SPU for Computers with a 32 Bit Data Bus ......... 19
1.3.2 SPU for Computers with a 64 Bit Data Bus ......... 23
1.4 Comments on the Scalar Product Units.. . . . . . . .. ... . .. . .. 25
1.4.1 Rounding....................................... 25
1.4.2 How much Local Memory should be Provided on a SPU? 27
1.4.3 A SPU Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . .. 28
1.4.4 Interaction with High Level Programming Languages. 30
1.5 Scalar Product Units for Top-Performance Computers. . . . . .. 32
1.5.1 Long Adder for 64 Bit Data Word (Solution A) ...... 32
1.5.2 Long Adder for 32 Bit Data Word (Solution A) ...... 37
1.5.3 Short Adder with Local Memory on the Arithmetic
Unit for 64 Bit Data Word (Solution B) . . . . . . . . . . . .. 40
1.5.4 Short Adder with Local Memory on the Arithmetic
Unit for 32 Bit Data Word (Solution B) . . . . . . . . . . . .. 45
1.6 Hardware Accumulation Window. . . . . . . . . . . . . . . . . . . . . . . .. 49
1. 7 Theoretical Foundation of Advanced Computer Arithmetic .. 53
Bibliography and Related Literature. . . . . . . . . . . . . . . . . . . . . . . . . .. 63

2. Rounding Near Zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71


2.1 The one dimensional case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71
2.2 Rounding in product spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75
Bibliography and Related Literature. . . . . . . . . . . . . . . . . . . . . . . . . .. 79
XII Contents

3. Interval Arithmetic Revisited. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 81


3.1 Introduction and Historical Remarks. . . . . . . . . . . . . . . . . . . . .. 82
3.2 Interval Arithmetic, a Powerful Calculus to Deal with In-
equalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89
3.3 Interval Arithmetic as Executable Set Operations. . . . . . . . . .. 92
3.4 Enclosing the Range of Function Values. . . . . . . . . . . . . . . . . .. 97
3.5 The Interval Newton Method ............................ 101
3.6 Extended Interval Arithmetic ............................ 104
3.7 The Extended Interval Newton Method ................... 110
3.8 Differentiation Arithmetic, Enclosures of Derivatives ........ 112
3.9 Interval Arithmetic on the Computer ..................... 116
3.10 Hardware Support for Interval Arithmetic ................. 127
3.10.1 Addition A + B and Subtraction A - B ............. 128
3.10.2 Multiplication A * B .............................. 128
3.10.3 Interval Scalar Product Computation ............... 131
3.10.4 Division A / B .................................. 133
3.10.5 Instruction Set for Interval Arithmetic .............. 134
3.10.6 Final Remarks ................................... 135
Bibliography and Related Literature ........................... 137
1. Fast and Accurate Vector Operations

Summary.
Advances in computer technology are now so profound that the arith-
metic capability and repertoire of computers can and should be expanded.
Nowadays the elementary floating-point operations +, -, x, / give com-
puted results that coincide with the rounded exact result for any operands.
Advanced computer arithmetic extends this accuracy requirement to all
operations in the usual product spaces of computation: the real and com-
plex vector spaces as well as their interval correspondents. This enhances
the mathematical power of the digital computer considerably. A new com-
puter operation, the scalar product, is fundamental to the development of
advanced computer arithmetic.
This paper studies the design of arithmetic units for advanced com-
puter arithmetic. Scalar product units are developed for different kinds
of computers like personal computers, workstations, mainframes, super
computers or digital signal processors. The new expanded computational
capability is gained at modest cost. The units put a methodology into
modern computer hardware which was available on old calculators before
the electronic computer entered the scene. In general the new arithmetic
units increase both the speed of computation as well as the accuracy of
the computed result. The circuits developed in this paper show that there
is no way to compute an approximation of a scalar product faster than the
correct result.
A collection of constructs in terms of which a source language may
accommodate advanced computer arithmetic is described in the paper.
The development of programming languages in the context of advanced
computer arithmetic is reviewed. The simulation of the accurate scalar
product on existing, conventional processors is discussed. Finally the the-
oretical foundation of advanced computer arithmetic is reviewed and a
comparison with other approaches to achieving higher accuracy in com-
putation is given. Shortcomings of existing processors and standards are
discussed.

1.1 Introduction

1.1.1 Background

Advances in computer technology are now so profound that the arithmetic


capability and repertoire of computers can and should be expanded. At a
time when more than 100 million transistors can be placed on a single chip,
2 1. Fast and Accurate Vector Operations

computing speed is measured in giga- and teraflops, and memory space in


giga-words, there is no longer any need to perform all computer calculations
by the four elementary floating-point operations with all the shortcomings of
this arithmetic (for the shortcomings see the three examples listed in Section
1.1.2).
Nowadays the elementary floating-point operations +, -, x, / give com-
puted results that coincide with the rounded exact result of the operation
for any operands. See, for instance, the IEEE-Arithmetic Standards 754 and
854, [114,115]. Advanced computer arithmetic extends this accuracy require-
ment to all operations in the usual product spaces of computation: the com-
plex numbers, real and complex vectors, real and complex matrices, real and
complex intervals as well as real and complex interval vectors and interval
matrices. This enhances the mathematical power of the digital computer con-
siderably. A great many computer operations can then be performed with but
a single rounding error.
If, for instance, the scalar product of two vectors with 1000 components is
to be computed about 2000 roundings are executed in conventional floating-
point arithmetic. Advanced arithmetic reduces this to a single rounding. The
computed result is within a single rounding error of the correct result.
The new operations are distinctly different from the customary ones which
are based on elementary floating-point arithmetic. A careful analysis and a
general theory of computer arithmetic [60,62] show that the new operations
can be built up on the computer by a modular technique as soon as a new
fundamental operation, the scalar product, is provided with full accuracy on
a low level, possibly in hardware.
The computer realization of the scalar product of two floating-point vec-
tors can be achieved with full accuracy in several ways. A most natural way
is to add the products of corresponding vector components into a long fixed-
point register (accumulator) which covers twice the exponent range of the
floating-point format in which the vector components are given. Use of the
long accumulator has the advantage of being rather simple, straightforward
and fast. Since fixed-point accumulation of numbers is error free it always
provides the desired accurate answer. The technique was already used on old
mechanical calculators long before the electronic computer.
In a floating-point system the number of mantissa digits and the exponent
range are finite. Therefore, the fixed-point register is finite as well, and it is
relatively small, consisting of about one to four thousand bits depending on
the data format in use. So we have the seemingly paradoxing and striking
situation that scalar products of floating-point vectors with even millions of
components can be computed to a fully accurate result using a relatively
small finite local register on the arithmetic unit.
In numerical analysis the scalar or dot product is ubiquitous. It is not
merely a fundamental operation in all the product spaces mentioned above.
1.1 Introduction 3

The process of residual or defect correction, or of iterative refinement,


is composed of scalar products. There are well known limitations to these
processes in floating-point arithmetic. The question of how many digits of a
defect can be guaranteed with single, double or extended precision arithmetic
has been carefully investigated. With the optimal scalar product the defect
can always be computed to full accuracy. It is the accurate scalar product
which makes residual correction effective.
With the accurate scalar product quadruple or multiple precision arith-
metic can easily be provided on the computer. This enables the user to use
higher precision operations in numerically critical parts of his computation.
It helps to increase software reliability. A multiple precision number is repre-
sented as an array of floating-point numbers. The value of this number is the
sum of its components. It can be represented in the long accumulator. Ad-
dition and subtraction of multiple precision variables or numbers can easily
be performed in the long accumulator. Multiplication of two such numbers
is simply a sum of products. It can be computed by means of the accurate
scalar product. For instance in case of a fourfold precision the product of two
such numbers a = (al + a2 + a3 + a4) and b = (b l + b2 + b3 + b4) is obtained
by

a x b = (al + a2 + a3 + a4) x (b 1 + b2 + b3 + b4)


= albl + a 1 b2 + a 1 b3 + a 1 b4 + a2bl + ... + a4b3 + a4b4
4 4
= 2:: 2:: aibj.
i=l j=l

Using the long accumulator the result is independent of the sequence in which
the summands are added. For details see Remark 3 on page 60 in section 1.7.
Approximation of a continuous function by a polynomial by the method
of least squares leads to the Hilbert matrix as coefficients. It is extremely
ill conditioned. It is well known that it is impossible to invert the Hilbert
matrix in double precision floating-point arithmetic successfully by any direct
or iterative method for dimensions greater than 11. Implementation of the
accurate scalar product in hardware also supports very fast multiple precision
arithmetic. It easily inverts the Hilbert matrix of dimension 40 to full accuracy
on a PC in a very short computing time. If increase or decrease of the precision
in a program is provided by the programming environment, the user or the
computer itself can choose the precision which optimally fits his problem.
Inversion of the Hilbert matrix of dimension 40 is impossible with quadru-
ple precision arithmetic. With it only one fixed precision is available. If one
runs out of precision in a certain problem class, one often runs out of quadru-
ple precision very soon as well. It is preferable and simpler, therefore, to
provide the principles for enlarging the precision than simply providing any
fixed higher precision. A hardware implementation of a full quadruple preci-
sion arithmetic is much more costly than an implementation of the accurate
4 1. Fast and Accurate Vector Operations

scalar product. The latter only requires fixed-point accumulation of the prod-
ucts. On the computer, there is only one standardized floating-point format
that is double precision.
With increasing speed of computers, problems to be dealt with become
larger. Instead of two dimensional problems users would like to solve three
dimensional problems. Gauss elimination for a linear system of equations re-
quires the magnitude of O(n 3 ) operations. Large, sparse or structured linear
or non linear systems, therefore, can only be solved iteratively. The basic op-
eration of iterative methods (Jacobi method, Gauss-Seidel method, overrelax-
ation method, conjugate gradient method, Krylow space methods, multigrid
methods and others like the QR method for the computation of eigenval-
ues) is the matrix-vector multiplication which consists of a number of scalar
products. It is well known that finite precision arithmetic often worsens the
convergence of these methods. An iterative method which converges to the
solution in infinite precision arithmetic often converges much slower or even
diverges in finite precision arithmetic. The accurate scalar product is faster
than a computation in conventional floating-point arithmetic. In addition to
that it can speed up the rate of convergence of iterative methods significantly
in many cases [27,28].
For many applications it is necessary to compute the value of the deriva-
tive of a function. Newton's method in one or several variables is a typical
example for this. Modern numerical analysis solves this problem by auto-
matic or algorithmic differentiation. The so called reverse mode is a very fast
method of automatic differentiation. It computes the gradient, for instance,
with at most five times the number of operations which are needed to com-
pute the function value. The memory overhead and the spatial complexity
of the reverse mode can be significantly reduced by the exact scalar product
if this is considered as a single, always correct, basic arithmetic operation
in the vector spaces [88]. The very powerful methods of global optimization
[79], [80], [81] are impressive applications of these techniques.
Many other applications require that rigorous mathematics can be done
with the computer using floating-point arithmetic. As an example, this is
essential in simulation runs (fusion reactor, eigenfrequencies of large genera-
tors) or mathematical modelling where the user has to distinguish between
computational artifacts and genuine reactions of the model. The model can
only be developed systematically if errors resulting from the computation can
be excluded.
Nowadays computer applications are of immense variety. Any discussion of
where a dot product computed in quadruple or extended precision arithmetic
can be used to substitute for the accurate scalar product is superfluous. Since
the former can fail to produce a correct answer an error analysis is needed
for all applications. This can be left to the computer. As the scalar product
can always be executed correctly with moderate technical effort it should
indeed always be executed correctly. An error analysis thus becomes irrele-
1.1 Introduction 5

vant. Furthermore, the same result is always obtained on different computer


platforms. A fully accurate scalar product eliminates many rounding errors
in numerical computations. It stabilizes these computations and speeds them
up as well. It is the necessary complement to floating-point arithmetic.
This paper studies the design of arithmetic units for advanced computer
arithmetic. Scalar product units are developed for different kinds of comput-
ers like personal computers, workstations, mainframes, super computers or
even digital signal processors. The differences in the circuits for these diverse
processors are dictated by the speed with which the processor delivers the
data to the arithmetic or scalar product unit. The data are the vector com-
ponents. In all cases the new expanded computational capability is gained at
modest cost. The cost increase is comparable to that from a simple to a fast
multiplier, for instance, by a Wallace tree, accepted years ago. It is a main
result of our study that for all processors mentioned above circuits can be
given for the computation of the accurate scalar product with virtually no
computing time needed for the execution of the arithmetic. In a pipeline, the
arithmetic can be executed within the time the processor needs to read the
data into the arithmetic unit. This means, that no other method to compute
a scalar product can be faster, in particular not a conventional approximate
computation of the scalar product in floating-point arithmetic which can lead
to an incorrect result.
In the pipeline a multiplication and the accumulation of a product to the
intermediate sum in the long accumulator are performed simultaneously. This
doubles the speed of the accurate scalar product in comparison with a con-
ventional computation in floating-point arithmetic where these operations are
performed sequentially. Furthermore, fixed-point accumulation of the prod-
ucts is simpler than accumulation in floating-point. Many intermediate steps
that are executed in a floating-point accumulation such as normalization
and rounding of the products and the intermediate sum, composition into a
floating-point number and decomposition into mantissa and exponent for the
next operation do not occur in the fixed-point accumulation of the accurate
scalar product used in advanced computer arithmetic.
In recent years there has been a significant shift of numerical computa-
tion from general purpose computers towards vector and parallel computers
- so-called super computers. Along with the four elementary floating-point
operations these computers usually offer compound operations as additional
arithmetic operations. A particular such compound operation, multiply and
accumulate, is provided for the computation of the scalar product of two vec-
tors. These compound operations are heavily pipelined and make the com-
putation really fast. They are automatically inserted in a user's program by
a vectorizing compiler. However, if these operations are not carefully imple-
mented the user loses complete control of his computation.
6 1. Fast and Accurate Vector Operations

In 1987 GAMM1and IMACS 2 published a Resolution on Computer Arith-


metic [116] which criticized the mathematically inadequate execution of ma-
trix and vector operations on all existing vector processors. An amendment
was demanded. The user should not be obliged to perform an error analysis
every time an elementary compound operation, predefined by the manufac-
turer, is employed. In 1993 the two organizations approved and published a
Proposal for Accurate Floating-Point Vector Arithmetic [117]. It requires a
mathematically correct implementation of matrix and vector operations, in
particular, of the accurate scalar product on all computers. In 1995 the IFIP-
Working Group 2.5 on Numerical Software endorsed this proposal. Meanwhile
it became an EU Guideline.
We finish this Section with a warning to the reader. This chapter does not
consist of independent sections. The later sections are built upon the earlier
ones. On the other hand material that is presented later can be helpful in
contributing to a full understanding of circuits that are discussed earlier. In
Section 1.2 basic ideas for the solution of the problem are discussed. Section
1.3 develops fast solutions for small and medium size computers. Section 1.5
then considers solutions for very fast systems.
New fundamental computer operations must be embedded into program-
ming languages where they can be activated by the user. Of course, operator
notation is the ultimate solution. However, it turns out to be extremely useful
to put a number of elementary instructions into the hands of the user also.
Such instructions for low and high level languages are discussed in Section
1.4, in particular in 1.1.3 and 1.4.4. These are based on a long lasting experi-
ence with the XSC-languages since 1980. Some readers might not so much be
interested in the very fast solutions in Section 1.5. Therefore, the interplay
with programming languages is already presented in Section 1.4.

This text summarizes both an extensive research activity during the past
twenty years and the experience gained through various implementations of
the entire arithmetic package on diverse processors. The text is also based
on lectures held at the Universitat Karlsruhe during the preceding 25 years.
While the collection of research articles that contribute to this paper is not
very large in number, I refrain from a detailed review of them and refer the
reader to the list of references. This text synthesizes and organizes diverse
contributions into a coherent presentation. In many cases more detailed in-
formation can be obtained from original doctoral theses.
1 GAMM = Gesellschaft fiir Angewandte Mathematik und Mechanik
2 IMACS = International Association for Mathematics and Computers in Simula-
tion
1.1 Introduction 7

1.1.2 Historic Remarks

Floating-point arithmetic has been used since the early forties and fifties
(Zuse Z3, 1941) [11,82]. Technology in those days was poor (electromechan-
ical relays, electron tubes). It was complex and expensive. The word size of
the Z3 consisted of 24 bits. The storage provided 64 words. The four ele-
mentary floating-point operations were all that could be provided. For more
complicated calculations an error analysis was left to and put on the shoulder
of the user.
Before that time, highly sophisticated mechanical computing devices were
used. Several very interesting techniques provided the four elementary oper-
ations addition, subtraction, multiplication and division. Many of these cal-
culators were able to perform an additional fifth operation which was called
A uflaufenlassen or the running total. The input register of such a machine
had perhaps 10 or 12 decimal digits. The result register was much wider and
had perhaps 30 digits. It was a fixed-point register which could be shifted
back and forth relative to the input register. This allowed a continuous accu-
mulation of numbers and of products of numbers into different positions of
the result register. Fixed-point accumulation is thus error free. See Fig. 1.22
and Fig. 1.23 on page 62. This fifth arithmetic operation was the fastest way
to use the computer. It was applied as often as possible. No intermediate re-
sults needed to be written down and typed in again for the next operation.
No intermediate roundings or normalizations had to be performed. No error
analysis was necessary. As long as no under- or overflow occurred, which
would be obvious and visible, the result was always correct. It was indepen-
dent of the order in which the summands were added. If desired, only one
final rounding was executed at the very end of the accumulation.
This extremely useful and fast fifth arithmetic operation was not built into
the early floating-point computers. It was too expensive for the technologies
of those days. Later its superior properties had been forgotten. Thus floating-
point arithmetic is still somehow incomplete.
After Zuse, the early electronic computers in the late forties and early
fifties represented their data as fixed-point numbers. Fixed-point arithmetic
was used because of its superior properties. Fixed-point addition and subtrac-
tion are error free. Fixed-point arithmetic with a rather limited word size,
however, imposed a scaling requirement. Problems had to be preprocessed
by the user so that they could be accommodated by this fixed-point number
representation. With increasing speed of computers, the problems that could
be solved became larger and larger. The necessary preprocessing soon became
an enormous burden.
Thus floating-point arithmetic became generally accepted. It largelyelim-
inated this burden. A scaling factor is appended to each number in floating-
point representation. The arithmetic itself takes care of the scaling. An ex-
ponent addition (subtraction) is executed during multiplication (division).
It may result in a big change in the value of the exponent. But multi plica-
8 1. Fast and Accurate Vector Operations

tion and division are relatively stable operations in floating-point arithmetic.


Addition and subtraction, on the contrary, are troublesome in floating-point.
The quality of floating-point arithmetic has been improved over the years.
The data format was extended to 64 and even more bits and the IEEE-
arithmetic standard has finally taken the bugs out of particular realizations.
Floating-point arithmetic has been used very successfully in the past. Very
sophisticated and versatile algorithms and libraries have been developed for
particular problems. However, in a general application the result of a floating-
point computation is often hard to judge. It can be satisfactory, inaccurate
or even completely wrong. The computation itself as well as the computed
data do not indicate which one of the three cases has occurred. We illustrate
the typical shortcomings by three very simple examples. All data in these
examples are IEEE double precision floating-point numbers! For these and
other examples see [84]:

1. Compute the following, theoretically equivalent expressions:

10 20 + 17 - 10 + 130 - 10 20
10 20 - 10 + 130 - 1020 + 17
10 20 + 17 - 1020 - 10 + 130
10 20 - 10 - 1020 + 130 + 17
10 20 - 10 20 + 17 - 10 + 130
1020 + 17 + 130 - 1020 - 10
A conventional computer using the data format double-precision of the
IEEE floating-point arithmetic standard returns the values 0, 17, 120, 147,
137, -10. These errors come about because the floating-point arithmetic is
unable to cope with the digit range required with this calculation. Notice
that the data cover less than 4% of the digit range of the data format double
precision!

2. Compute the solution of a system of two linear equations Ax = b, with

( 64919121 -159018721) (1)


A= 41869520.5 -102558961 ' b = 0

The solution can be expressed by the formulas:

A workstation using IEEE double precision floating-point arithmetic re-


turns the approximate solution:

Xl = 102558961 and X2 = 41869520.5 ,

while the correct solution is


1.1 Introduction 9

Xl = 205117922 and X2 = 83739041 .

After only 4 floating-point operations all digits of the computed solution


are wrong. A closer look into the problem reveals that the error happens
during the computation of the denominator. This is just the kind of expression
which always can be computed error free by the missing fifth operation.

3. Compute the scalar product of the two vectors a and b with five com-
ponents each:

al = 2.718281828 * 1010 b1 = 1486.2497 * 10 9


a2 = -3.141592654 * 1010 b2 = 878366.9879 * 10 9
a3 = 1.414213562 * 10 10 b3 = -22.37492 * 10 9
a4 = 0.5772156649 * 10 10 b4 = 4773714.647 * 10 9
a5 = 0.3010299957 * 1010 b5 = 0.000185049 * 10 9

The correct value of the scalar product is -1.00657107 * 108 . IEEE-double


precision arithmetic delivers +4.328386285 * 10 9 so even the sign is incorrect.
Note that no vector element has more than 10 decimal digits.

Problems that can be solved by computers become larger and larger. To-
day fast computers are able to execute several billion floating-point operations
in each second. This number exceeds the imagination of any user. Traditional
error analysis of numerical algorithms is based on estimates of the error of
each individual arithmetic operation and on the propagation of these errors
through a complicated algorithm. It is simply no longer possible to expect
that the error of such computations can be controlled by the user. There
remains no alternative to further develop the computer's arithmetic and to
furnish it with the capability of control and validation of the computational
process.
Computer technology is extremely powerful today. It allows solutions
which even an experienced computer user may be totally unaware of. Floating-
point arithmetic which may fail in simple calculations, as illustrated above,
is no longer adequate to be used exclusively in computers of such gigantic
speed for huge problems. The reintroduction of the fifth arithmetic operation,
the accurate scalar product, into computers is a step which is long overdue.
A central and fundamental operation of numerical analysis which can be
executed correctly with only modest technical effort should indeed always
be executed correctly and no longer only approximately. With the accurate
scalar product all the nice properties which have been listed in connection
with the old mechanical calculators return to the modern digital computer.
The accurate scalar product is the fastest way to use the computer. It should
be applied as often as possible. No intermediate results need to be stored and
read in again for the next operation. No intermediate rmmdings and normal-
izations have to be performed. No intermediate over- or underflow can occur.
10 1. Fast and Accurate Vector Operations

No error analysis is necessary. The result is always correct. It is indepen-


dent of the order in which the summands are added. If desired, only one final
rounding is executed at the very end of the accumulation.
This paper pleads for an extension of floating-point arithmetic by the
accurate scalar product as fifth elementary operation. This combines the
advantages of floating-point arithmetic (no scaling requirement) with those
of fixed-point arithmetic (error free accumulation of numbers and of single
products of numbers even for very long sums). It is obtained by putting a
methodology into modern computer hardware which was already available
on calculators before the electronic computer entered the scene.
To wipe out frequent misunderstandings a few words about what the
paper does (and what it not does) seem to be necessary. The paper claims
that, and explains how, scalar products, the data of which are floating-point
numbers, can always be correctly computed. In the old days of computing
(1950 - 1980) computers often provided sloppy arithmetic in order to be
fast. This was "justified" by explaining that the last bit was incorrect in
many cases, due to rounding errors. So why should the arithmetic be slowed
down or more hardware be invested by computing the best possible answer
of the operations under the assumption that the last bit is correct? Today
it is often asked: why do we need an accurate scalar product? The last bit
of the data is often incorrect and a floating-point computation of the scalar
product delivers the best possible answer for problems with perturbed data.
With the IEEE arithmetic standard this kind of "justification" has been
overcome. This allows problems to be handled correctly where the data are
correct and the best possible result of an operation is needed. With respect
to the scalar product several such problems are listed in Section 1.1.1. In
mathematics it makes a big difference whether a computation is correct for
many or most data or for all data! For problems with perturbed (inexact)
data interval arithmetic is the appropriate tool. Even in this case fixed-point
accumulation of the scalar product delivers bounds or an approximation faster
than a conventional computation in floating-point arithmetic. In summary,
the paper extends the accuracy requirements of the IEEE arithmetic standard
to the scalar product and with it to all operations in the usual product spaces
of computation which are listed in the second paragraph of Section 1.1.1. All
this is shortly called advanced computer arithmetic.

1.2 Implementation Principles


A normalized floating-point number x (in sign-magnitude representation) is
a real number of the form x = *m be. Here * E {+, -} is the sign of the
number, b is the base of the number system in use and e is the exponent. The
base b is an integer greater than unity. The exponent e is an integer between
two fixed integer bounds e1 and e2, and in general e1~0~e2. The mantissa
m is of the form
1.2 Implementation Principles 11

I
m = Ld i b- i .
i=l

The di are the digits of the mantissa. They have the property di E
{O,l, ... ,b -1} for all i = l(l)l and d 1 # 0. Without this last condition
floating-point numbers are said to be unnormalized. The set of normalized
floating-point numbers does not contain zero. For a unique representation of
zero we assume the mantissa and the exponent to be zero. Thus a floating-
point system depends on the four constants b, l, e1 and e2. We denote it
by R = R(b, l, e1, e2). Occasionally we shall use the abbreviations sign(x),
mant(x) and exp(x) to denote the sign, mantissa and exponent of x respec-
tively.
Nowadays the elementary floating-point operations +, -, x, / give com-
puted results that coincide with the rounded exact result of the operation for
any operands. See, for instance, the IEEE Arithmetic Standards 754 and 854,
[114,115]. Advanced computer arithmetic extends this accuracy requirement
to all operations in the usual product spaces of computation: the complex
numbers, the real and complex vectors, real and complex matrices, real and
complex intervals as well as the real and complex interval vectors and interval
matrices.
A careful analysis and a general theory of computer arithmetic [60,62]
show that all arithmetic operations in the computer representable subsets of
these spaces can be realized on the computer by a modular technique as soon
as fifteen fundamental operations are made available at a low level, possibly
by fast hardware routines. These fifteen operations are

EE, El, 181, 0, 0,


W, V, YJ, 'Yl, 'V,
&, /;', &, £, 8.

Here El, * E {+, -, x, /} denotes (semimorphic 3 ) operations using some


particular monotone and antisymmetric rounding D: 1R -+ R such as round-
ing to the nearest floating-point number or rounding towards zero. Likewise
'fl and &, * E {+, -, x, /} denote the operations using the optimal (mono-
tone 3 ) rounding downwards \7: 1R -+ R, and the optimal (monotone 3 ) round-
ing upwards 6: 1R -+ R, respectively. 0, 'V and 8 denote scalar products
with high accuracy. That is, if a = (ai) and b = (bd are vectors with floating-
point components, ai, bi E R, then a 0 b := 0 (al xb 1 +a2xb2+ ... +anxbn ),
o E { D, \7, 6}. The multiplication and addition signs on the right hand
side of the assignment denote exact multiplication and summation in the
sense of real numbers.
3 For a precise mathematical definition see Section 1.7.
12 1. Fast and Accurate Vector Operations

These 15 operations are sufficient for the computer implementation of


all arithmetic operations that are to be defined for all numerical data types
listed above in the third paragraph of this section. Of the 15 fundamental
operations, traditional numerical methods use only the four operations EEl,
E1, IZl and 0. Interval arithmetic requires the eight operations W, '1, '7,
'0/ and ih, 8, &, IA. These eight operations are computer equivalents
of the operations for real floating-point intervals, i. e. of interval arithmetic.
Processors which support the IEEE arithmetic standard, for instance, offer 12
of these 15 operations: 0, "W, ~,* E {+, -, x, /}. The latter 8 operations "W,
~, * E {+, -, x, /} are not yet provided by the usual high level programming
languages. They are available and can be used in PASCAL-XSC, [46,47,
49,67,68]' a PASCAL extension for the high accuracy scientific computing
which was developed at the author's Institute. Roughly speaking, interval
arithmetic brings guarantees into computation while the three scalar or dot
products deliver high accuracy. These two features should not be confused.
The implementation of the 12 operations 0, "W, ~,* E {+, -, x, /}
on computers is routine and standard nowadays. Fast techniques are largely
discussed in the literaturc. So we now turn to the implementation of the
three optimal scalar products [J, V and A on computers. We shall discuss
circuits for the hardware realization of these operations for different kinds
of processors like personal computers, workstations, mainframes, super com-
puters and digital signal processors. The differences in the circuits for these
diverse processors are dictated by the speed with which the processor delivers
the vector components ai and bi, i = 1,2 .... , n, to the arithmetic or scalar
product unit.
After a brief discussion of the implementation of the accurate scalar prod-
uct on computers we shall detail two principal solutions to the problem. So-
lution A uses a long adder and a long shift. Solution B uses a short adder
and some local memory in the arithmetic unit. At first sight both of these
principal solutions seem to lead to relatively slow hardware circuits. However
later, more refined studies will show that very fast circuits can be devised
for both methods and for the diverse processors mentioned above. A first
step in this direction is the provision of the very fast carry resolution scheme
described in Section 1.2.4.
Actually it is a central result of this study that, for all processors under
consideration, circuits for the computation of the optimal scalar product are
available where virtually no computing time for the execution of the arith-
metic is needed. In a pipeline, the arithmetic can be done within the time the
processor needs to read the data into the arithmetic unit. This means that no
other method to compute the scalar product can be faster, in particular, not
even a conventional computation of scalar products in floating-point arith-
metic which may lead to an incorrect answer. Once more we emphasize the
fact that the methods to be discussed here compute thc scalar product of two
1.2 Implementation Principles 13

floating-point vectors of arbitrary finite length without loss of information or


with only one final rounding at the very end of the computation.
Now we turn to our task. Let a = (ai) and b = (b i ), i = l(l)n, be two
vectors with n components which are floating-point numbers, i. e.

ai, bi E R(b, l, e1, e2), for i = l(l)n.


We are going to compute the two results (scalar products):
n
8 '- L ai x bi = a1 x b1 + a2 x b2 + ... + an X bn,
i=l

and
n
C.- 0Laixbi = 0(a1xb1+a2Xb2+ ... +anxbn) = 08,
i=l

where all additions and multiplications are the operations for real numbers
and 0 is a rounding symbol representing, for instance, rounding to nearest,
rounding towards zero, rounding upwards or downwards.
Since ai and bi are floating-point numbers with a mantissa of l digits,
the products ai x bi in the sums for 8 and C are floating-point numbers with
a mantissa of 2l digits. The exponent range of these numbers doubles also,
i. e. ai x bi E R(b, 2l, 2e1, 2e2). All these summands can be expressed in a
fixed-point register of length 2e2 + 2l + 21e11 without loss of information,
see Fig. 1.1. If one of the summands has an exponent 0, its mantissa can be
expressed in a register of length 2l. If another summand has exponent 1, it
can be expressed with exponent 0, if the register provides further digits on
the left and the mantissa is shifted one place to the left. An exponent -1 in
one of the summands requires a corresponding shift to the right. The largest

°
exponents in magnitude that may occur in the summands are 2e2 and 21e11.
So all summands can be expressed with exponent in a fixed-point register
of length 2e2 + 2l + 21e11 without loss of information.

1.2.1 Solution A: Long Adder and Long Shift

If the register is built as an accumulator with an adder, all summands could


even be added without loss of information. In order to accommodate possible
overflows, it is convenient to provide a few, say k more digits of base b on
the left. In such an accumulator, every such sum or scalar product can be
added without loss of information. As many as bk overflows may occur and
be accommodated for without loss of information. In the worst case, presum-
ing every sum causes an overflow, we can accommodate sums with n bk s:
summands.
A gigaflops computer would perform about 10 17 operations in 10 years.
So 17 decimal or about 57 binary digits certainly are a reasonable upper
14 1. Fast and Accurate Vector Operations

bound for k. Thus, the long accumulator and the long adder consist of L =
k + 2e2 + 2l + 21ell digits of base b. The summands are shifted to the proper
position and added. See Fig. 1.1. Fast carry resolution techniques will be
discussed later. The final sums sand c are supposed to be in the single
exponent range e1 :-s; e :-s; e2, otherwise c is not representable as a floating-
point number and the problem has to be scaled.

e2 lell
k

Fig. 1.1. Long accumulator with long shift for accurate scalar product accumula-
tion.

1.2.2 Solution B: Short Adder with Local Memory on the


Arithmetic Unit

In a scalar product computation the summands are all oflength 2l. So actually
the long adder and long accumulator may be replaced by a short adder and
a local store of size L on the arithmetic unit. The local store is organized in
words of length l or l', where l' is a power of 2 and slightly larger than l. (For
instance I = 53 bits and I' = 64 bits). Since the summands are of length 2l,
they fit into a part of the local store of length 31'. This part of the store is
determined by the exponent of the summand. We load this part of the store
into an accumulator of length 31'. The summand mantissa is placed in a shift
register and is shifted to the correct position as determined by the exponent.
Then the shift register contents are added to the contents of the accumulator.
Fig. 1.2.
An addition into the accumulator may produce a carry. As a simple
method to accommodate carries, we enlarge the accumulator on its left end
by a few more digit positions. These positions are filled with the correspond-
ing digits of the local store. If not all of these digits equal b - 1 in case of
addition (or zero in case of subtraction), they will accommodate a possible
carry ofthe addition (or borrow in case of subtraction). Of course, it is possi-
ble that all these additional digits are b - 1 (or zero). In this case, a loop can
be provided that takes care of the carry and adds it to (subtracts it from)
the next digits of the local store. This loop may need to be traversed several
times. Other carry (borrow) handling processes are possible and will be dealt
with later. This completes our sketch of the second method for an accurate
1.2 Implementation Principles 15

computation of scalar products using a short adder and some local store on
the arithmetic unit. See Fig. 1.2.

I I /' I I local store of L digits

J,J~.J
1111111 y I 2I Y I adder

I
L........J'--........_---'IL..Il. ._.... shifted ummand

2/ -~

Fig. 1.2. Short adder and local store on the arithmetic unit for accurate scalar
product accumulation

1.2.3 Remarks

The scalar product is a highly frequent operation in scientific computing. The


two solutions A and B are both simple, straightforward and mature.
Remark 1: The purpose of the k digits on the left end of the register in Fig.
1.1 and Fig. 1.2 is to accommodate possible overflows. The only numbers that
are added to this part of the register are plus or minus unity. So this part of
the register just can be treated as a counter by an incrementer/decrementer.
Remark 2: The final result of a scalar product computation is assumed to
be a floating-point number with an exponent in the range el ::; e ::; e2.
During the computation, however, summands with an exponent outside of
this range may very well occur. The remaining computation then has to cancel
all these digits. This shows that normally in a scalar product computation,
the register space outside the range el ::; e ::; e2 will be used less frequently.
The conclusion should not be drawn from this consideration that the register
size can be restricted to the single exponent range in order to save some
silicon area. This would require the implementation of complicated exception
handling routines which finally require as much silicon but do not solve the
problem in principle.
Remark 3: We emphasize once more that the number of digits, L, needed for
the register to compute scalar products of two vectors to full accuracy only
depends on the floating-point data format . In particular it is independent of
the number n of components of the two vectors to be multiplied.
As samples we calculate the register width L for a few typical and fre-
quently used floating-point data formats:
16 1. Fast and Accurate Vector Operations

a) IEEE-arithmetic single precision:


b = 2; word length: 32 bits; sign: 1 bit; exponent: 8 bits; mantissa: I = 24
bits; exponent range: el = -126, e2 = 127, binary.
L = k + 2e2 + 21 + 21el1 = k + 554 bits.
With k = 86 bits we obtain L = 640 bits. This register can be represented
by 10 words of 64 bits.
b) /370 architecture, long data format:
b = 16; word length: 64 bits; sign: 1 bit; mantissa: 14 hex digits;
exponent range: el = -64, e2 = 63, hexadecimal.
L = k + 2e2 + 21 + 21el1 = k + 282 bits.
With k = 88 bits we obtain L = 88 + 4 * 282 = 1216 bits. This register
can be represented by 16 words of 64 bits.
c) IEEE-arithmetic double precision:
b = 2; word length: 64 bits; sign: 1 bit; exponent: 11 bits; mantissa: I = 53
bits; exponent range: el = -1022, e2 = 1023, binary.
L = k + 2e2 + 21 + 21el1 = k + 4196 bits.
With k = 92 bits we obtain L = 4288 bits. This register can be repre-
sented by 67 words of 64 bits.

These samples show that the register size (at a time where memory space
is measured in gigabits and gigabytes) is modest in all cases. It grows with
the exponent range of the data format. If this range should be extremely
large, as for instance in case of an extended precision floating-point format,
only an inner part of the register would be supported by hardware. The outer
parts which then appear very rarely could be simulated in software. The long
data format of the /370 architecture covers in decimal a range from about
10- 75 to 1075 which is very modest. This architecture dominated the market
for more than 20 years and most problems could conveniently be solved with
machines of this architecture within this range of numbers.
Remark 4: Multiplication is often considered to be more complex than ad-
dition. In modern computer technology this is no longer the case. Very fast
circuits for multiplication using carry-save-adders (Wallace tree) are available
and common practice. They nearly equalize the time to compute a sum and
a product of two floating-point numbers. In a scalar product computation
usually a large number of products is to be computed. The multiplier is able
to produce these products very quickly. In a balanced scalar product unit the
accumulator should be able to absorb a product in about the same time the
multiplier needs to produce it. Therefore, measures have to be taken to equal-
ize the speed of both operations. Because of a possible long carry propagation
the accumulation seems to be the more complicated process.
Remark 5: Techniques to implement the optimal scalar product on machines
which do not provide enough register space on the arithmetic logical unit will
be discussed in Section 1.6 later in this paper.
1.2 Implementation Principles 17

1.2.4 Fast Carry Resolution

Both solutions A and B for our problem which we sketched above seem to
be slow at first glance. Solution A requires a long shift which is necessarily
slow. The addition over perhaps 4000 bits is slow also, in particular if a long
carry propagation is necessary. For solution B, five steps have to be carried
out: 1. read from the local store, 2. perform the shift, 3. add the summand,
4. resolve the carry, possibly by loops, and 5. write the result back into the
local store. Again the carry resolution may be very time consuming.
As a first step to speed up solutions A and B, we discuss a technique which
allows a very fast carry resolution. Actually a possible carry can already be
accommodated while the product, the addition of which might produce a
carry, is still being computed.
Both solutions A and B require a long register in which the final sum in a
scalar product computation is built up. Henceforth we shall call this register
the Long Accumulator and abbreviate it as LA. It consists of L bits. LA is a
fixed-point register wherein any sum of floating-point numbers and of simple
products of floating-point numbers can be represented without error.
To be more specific we now assume that we are using the double precision
data format of the IEEE-arithmetic standard 754. See case c) of remark 3.
As soon as the principles are clear, a transfer of the technique to other data
formats is easy. Thus, in particular, the mantissa consists of [ = 53 bits.
We assume additionally that the LA that appears in solutions A and B is
subdivided into words of [' = 64 bits. The mantissa of the product ai x bi then
is 106 bits wide. It touches at most three consecutive 64-bit words of the LA
which are determined by the exponent of the product. A shifter then aligns
the 106 bit product into the correct position for the subsequent addition into
the three consecutive words of the LA. This addition may produce a carry
(or a borrow in case of subtraction). The carry is absorbed by that next more
significant 64 bit word of the LA in which not all digits are 1 (or 0 in case of
subtraction). Fig. 1.3, a). For a fast detection of this word two information
bits or flags are appended to each long accumulator word. Fig. 1.3, b). One
of these bits, the all bits 1 flag, is set to 1 if all 64 bits of the register word
are 1. This means that a carry will propagate through the entire word. The
other bit, the all bits 0 flag, is set to 0, if all 64 bits of the register word are
O. This means that in case of subtraction a borrow will propagate through
the entire word.
During the addition of a product into the three consecutive words of the
LA, a search is started for the next more significant word of the LA where
the all bits 1 flag is not set. This is the word which will absorb a possible
carry. If the addition generates a carry, this word must be incremented by one
and all intermediate words must be changed from all bits 1 to all bits O. The
easiest way to do this is simply to switch the flag bits from all bits 1 to all
bits 0 with the additional semantics that if a flag bit is set, the appropriate
constant (all bits 0 or all bits 1) must be generated instead of reading the LA
18 1. Fast and Accurate Vector Operations

word contents when reading a LA word, Fig. 1.3, b). Borrows are handled in
an analogous way.

carry generation
a) I
1000000 11 000 II 111111 I 111111 1111111 XXXXXX 1 XXXXXX 1 XXXXXX 1000000 I
rcarry skip area----< local fixed-point addition

1 XXXXXX XXXXXX 1

carry carry
resolution start
address address

b)

extracted carry flag word


carry flag logic
carry dissolution indicator

Fig. 1.3. Fast carry resolution

This carry handling scheme allows a very fast carry resolution. The gen-
eration of the carry resolution address is independent of the addition of the
product, so it can be performed in parallel. At the same time, a second set of
flags is set up for the case that a carry is generated. If the latter is the case,
the carry is added into the appropriate word and the second set of flags is
copied into the former flag word.
Simultaneously with the multiplication of the mantissa of ai and bi their
exponents are added. This is just an eleven bit addition. The result is available
very quickly. It delivers the exponent of the product and the address for
its addition. By looking at the flags, the carry resolution address can be
determined and the carry word can already be incremented/decremented as
soon as the exponent of the product is available. It could be available before
the multiplication of the mantissas is finished. If the accumulation of the
product then produces a carry, the incremented/decremented carry word is
written back into the LA, otherwise nothing is changed.
This very fast carry resolution technique could be used in particular for
the computation of short scalar products which occur, for instance, in the
computation of the real and imaginary part of a product of two complex
floating-point numbers. A long scalar product, however, is usually performed
in a pipeline. Then, during the execution of a product, the former product
1.3 High-Performance Scalar Product Units (SPU) 19

is added. It seems to be reasonable, then, to wait with the carry resolution


until the former addition is actually finished.

1.3 High-Performance Scalar Product Units (SPU)


After having discussed the two principal Solutions A and B for exact scalar
product computation as well as a very fast carry handling scheme, we now
turn to a more detailed design of scalar product computation units for diverse
processors. These units will be called SPU, which stands for Scalar Product
Unit. If not otherwise mentioned we assume throughout this section that the
data are stored in the double precision format of the IEEE-arithmetic stan-
dard 754. There the floating-point word has 64 bits and the mantissa consists
of 53 bits. A central building block for the SPU is the long accumulator LA.
It is a fixed-point register wherein any sum of floating-point numbers and of
simple products of floating-point numbers can be represented without error.
The unit allows the computation of scalar products of two vectors with any
finite number of floating-point components to full accuracy or with one sin-
gle rounding at the very end of the computation. As shown in Remark 3c) of
Section 1.2.3, the LA consists of 4288 bits. It can be represented by 67 words
of 64 bits.
The scalar product is a highly frequent operation in scientific computa-
tion. So its execution should be fast. All circuits to be discussed in this section
perform the scalar product in a pipeline which simultaneously executes the
following steps:

a) read the two factors ai and bi to perform a product,


b)compute the product ai x bi to the full double length, and
c) add the product ai x bi to the LA.

Step a) turns out to be the bottleneck of this pipeline. Therefore, we shall


develop different circuits for computers which are able to read the two factors
ai and bi into the SPU in four or two or one portion. The latter case will
be discussed in Section 1.5. Step b) produces a product of 106 bits. It maps
onto at most three consecutive words of the LA. The address of these words
is determined by the products exponent. In step c) the 106 bit product is
added to the three consecutive words of the LA.

1.3.1 SPU for Computers with a 32 Bit Data Bus

Here we consider a computer which is able to read the data into the arithmetic
logical unit and/or the SPU in portions of 32 bits. The personal computer is
a typical representative of this kind of computer.
Solution A with an adder and a shifter for the full LA of 4288 bits would
be too expensive. So the SPU for these computers is built upon solution B
20 1. Fast and Accurate Vector Operations

b·I
53

53 x 53 bit
mu ltiplication
by 27 x 27 bit
multiplier

shifted product

LA

carry

Fig. 1.4. Accumulation of a product to the LA by a 64 bit adder

(see Fig. 1.4). For the computation of the product ai x bi the two factors
ai and bi are to be read. Both consist of 64 bits. Since the data can only
be read in 32 bit portions, the unit has to read 4 times. We assume that
with the necessary decoding this can be done in eight cycles. See Fig. l.5.
This is rather slow and turns out to be the bottleneck for the whole pipeline.
In a balanced SPU the multiplier should be able to produce a product and
the adder should be able to accumulate the product in about the same time
the unit needs to read the data. Therefore, it suffices to provide a 27 x 27
bit multiplier. It computes the 106 bit product of the two 53 bit mantissas
of ai and bi by 4 partial products. The subsequent addition of the product
into the three consecutive words of the LA is performed by an adder of 64
bits. The appropriate three words of the LA are loaded into the adder one
after the other and the appropriate portion of the product is added. The sum
is written back into the same word of the LA where the portion has been
read from. A 64 out of 106 bit shifter must be used to align the product
onto the relevant word boundaries. See Fig. 1.4. The addition of the three
portions of the product into the LA may cause a carry. The carry is absorbed
1.3 High-Performance Scalar Product Units (SPU) 21

cycle I read mult/shift I accumulate


read a~_l

read a~_l

read b~_l

read bLl

read a;
read a~ Ci-l := ai-l * bi-l

read b~ C'-l := shift (Ci-tl

read b~

load 1
read a~+l
add/sub load2
store 1 add/sub load3
read a~+l Ci := ai * hi store2 add sub oad carry
(c;) store3 inc/dec
read b~+l Ci := shift
store carry
store flags
read b~+l
load 1
read a!+2
add, sub oac2
store 1 add/sub load3
read a~+2 Ci+l := ai+l * bi + 1 store2 add/sub load carry
store3 inc/dec
read b~+2 CH1 := shift (CHtl
store carry
store flags
read b~+2

load 1
read a;+3
add/sub load2
store 1 addjsub load3
read a;+3 Ci+2 := ai+2 * bi + 2 store2 a~d/su oac carry
store3 inc/dec
read b~+3 CH2 := shift (CH2)
store carry
store flags
read b~+3

Fig. 1.5. Pipeline for the accumulation of scalar products on computers with 32
bit data bus.

by incrementing (or decrementing in case of subtraction) a more significant


word of the LA as determined by the carry handling scheme.
A brief sketch of the pipeline is shown in Fig. 1.5. There, we assume that
a dual port RAM is available on the SPU to store the LA. This is usual for
register memory. It allows simultaneous reading from the LA and writing into
the LA . Eight machine cycles are needed to read the two 64 bit factors ai and
bi for a product, including the necessary decoding of the data. This is also
about the time in which the multiplication and the shift can be performed in
the second step of the pipeline. The three successive additions and the carry
resolution in the third step of the pipeline again can be done in about the
22 1. Fast and Accurate Vector Operations

data bus

r exception
1 32
interface I
1 64

1 1
Iexp (ai)1 Iexp (bi) I I
4 l'1
mant (ai)llmant (bi)

1
adder
i 53 I53

53 x 53 bit
multiplication
by 27 x 27 bit
12 multiplier

6
1 106
shifter I
6
1 64

address adder I
decoder

flag
1 64 f64

control
- -

1-1--- -
1-1--- -
1-1---
dual port RAM -
1-1--- 64x67 -

1-1--- accu memory -


1-1--
& flag registers
-

I- -
1-1--- -~-~- .--~... -

Fig. 1.6. Block diagram for a SPU with 32 bit data supply and sequential addition
into SPU

same time. See Fig. 1.5. Fig. 1.6 shows a block diagram for a SPU with 32
bit data bus.
The sum of the exponents of ai and bi delivers the exponent of the product
ai x bi. It consists of 12 bits. The 6 low order (less significant) bits of this sum
are used to perform the shift. The more significant bits of the sum deliver the
LA address to which the product ai x bi has to be added. So the originally
very long shift is split into a short shift and an addressing operation. The
1.3 High-Performance Scalar Product Units (SPU) 23

shifter performs a relatively short shift operation. The addressing selects the
three words of the LA for the addition of the product.
The LA RAM needs only one address decoder to find the start address
for an addition. The two more significant parts of the product are added to
the contents of the two LA words with the two subsequent addresses. The
carry logic determines the word which absorbs the carry. All these address
decodings can be hard wired. The result of each one of the four additions is
written back into the same LA word to which the addition has been executed.
The two carry flags appended to each accumulator word are indicated in Fig.
1.6. In practice the flags are kept in separate registers.
We stress the fact that in the circuit just discussed virtually no specific
computing time is needed for the execution of the arithmetic. In the pipeline
the arithmetic is performed in the time which is needed to read the data
into the SPU. Here, we assumed that this requires 8 cycles. This allows both
the multiplication and the accumulation to be performed very economically
and sequentially by a 27 x 27 bit multiplier and a 64 bit adder. Both the
multiplication and the addition are themselves performed in a pipeline. The
arithmetic overlaps with the loading of the data into the SPU.
There are processors on the market, where the data supply to the arith-
metic unit or the SPU is much faster. We discuss the design of a SPU for
such processors in the next section and in Section 1.5.

1.3.2 SPU for Computers with a 64 Bit Data Bus

Now we consider a computer which is able to read the data into the arithmetic
logical unit and/or the SPU in portions of 64 bits. Fast workstations or
mainframes are typical for this kind of computer.
Now the time to perform the multiplication and the accumulation over-
lapped in pipelines as before is no longer available. In order to keep the
execution time for the arithmetic within the time the SPU needs to read the
data, we have to invest in more hardware. For the multiplication a 53 x 53
bit multiplier must now be used. The result is still 106 bits wide. It could
touch three 64 bit words of the LA. But the addition of the product and the
carry resolution now have to be performed in parallel.
The 106 bit summand may fit into two instead of three consecutive 64
bit words of the LA. A closer look at the details shows that the 22 least
significant bits of the three consecutive LA words are never changed by an
addition of the 106 bit product. Thus the adder needs to be 170 bits wide
only. Fig. 1. 7 shows a sketch for the parallel accumulation of a product.
In the circuit a 106 to 170 bit shifter is used. The four additions are to
be performed in parallel. So four read/write ports are to be provided for the
LA RAM. A sophisticated logic must be used for the generation of the carry
resolution address, since this address must be generated very quickly. Again
the LA RAM needs only one address decoder to find the start address for an
addition. The more significant parts of the product are added to the contents
24 1. Fast and Accurate Vector Operations

53 x 53 bit
multiplier

106 to 170 bit shifter

LA

carry start
resolution address
address
Fig. 1.7. Parallel accumulation of a product into the LA

of the two LA words with the two subsequent addresses. A tree structured
carry logic now determines the LA word which absorbs the carry. A very fast
hardwired multi-port driver can be designed which allows all 4 LA words to
be read into the adder in one cycle.
Fig. 1.8 shows the pipeline for this kind of addition. In the figure we
assume that 2 machine cycles are needed to decode and read one 64 bit word
into the SPU.
Fig. 1.9 shows a block diagram for a SPU with a 64 bit data bus and
parallel addition.
We emphasize again that virtually no computing time is needed for the
execution of the arithmetic. In a pipeline the arithmetic is performed in the
1.4 Comments on the Scalar Product Units 25

cycle read I mult / shift accumulate


read ai-I

read bi - I

read ai Ci-I := ai-I * bi - I


read bi Ci-I := shift (Ci-I)

address decoding
read aHI Ci := ai * bi load
read bi + 1 Ci := shift (Ci)
add/sub Ci-I
store & store Hags
address decoding
read aH2 Ci+1 := ai+1 * bi + 1 load
read b H2 Ci+l := shift (CHI)
add/sub Ci
store & store t1ags
address decoding
read ai+3 Ci+2 := ai+2 * bH2 load
read bH3 Ci+2 := shift (Ci+2)
add/sub CHI
store & store Hags

Fig. 1.8. Pipeline for the accumulation of scalar products.

time which is needed to read the data into the SPU. Here, we assume that
with the necessary decoding, this requires 4 cycles for the two 64 bit factors
ai and bi for a product. To match the shorter time required to read the data,
more hardware has to be invested for the multiplier and the adder.
If the technology is fast enough it may be reasonable to provide a 256
bit adder instead of the 170 bit adder. An adder width of a power of 2 may
simplify the shift operation as well as the address decoding. The lower hits of
the exponent of the product control the shift operation while the higher bits
are directly used as the start address for the accumulation of the product
into the LA.
The two flag registers appended to each accumulator word are indicated
in Fig. 1.9 again. In practice the flags are kept in separate registers.

1.4 Comments on the Scalar Product Units

1.4.1 Rounding

If the result of an exact scalar product is needed later in a program, the


contents of the LA must be put into the user memory. How this can be done
will be discussed later in this section.
26 1. Fast and Accurate Vector Operations

data bus

I exception
1 64
interface

t 64

Iexp (ai)1 Iexp (bi) I mant (ai)llmant (bi) I


411adder ~ 11 53 53

53 x 53 bit
multiplier
12

1
shifter
106
I
t 170
address
decoder
adder & carry-inc. I
64 64 64 42
flag
control carry
resolve
f-f- -
address f-f-four port RAM -
f-- 64x67 -

-
f-- accumemory -
f-c-
f-r--- & -

f-f- flag registers -


r-r--- I I I
-

I I
start address I

Fig. 1.9. Block diagram for a SPU with 64 bit data bus and parallel addition into
the SPU

If not processed any further the correct result of a scalar product compu-
tation usually has to be rounded into a floating-point number or a floating-
point interval. The flag bits that are used for the fast carry resolution can
be used for the rounding of the LA contents also. By looking at the flag
bits, the leading result word in the accumulator can easily be identified. This
and the next LA word are needed to compose the mantissa of the result.
This 128 bit quantity must then be shifted to form a normalized mantissa
1.4 Comments on the Scalar Product Units 27

of an IEEE-arithmetic double precision number. The shift length can be ex-


tracted by looking at the leading result word in the accumulator with the
same procedure which identified it by looking at the flag bit word.
For the correct execution of the rounding downwards (or upwards) it is
necessary to check whether anyone of the discarded bits is different from
zero. This is done by testing the remaining bits of the 128 bit quantity in the
shifter and by looking at the all bits 0 flags of the following LA words. This
information then is used to control the rounding.
Only one rounding at the very end of a scalar product computation is
needed. If a large number of products has been accumulated the contribution
of the rounding to the computing time is not substantial. However, if a short
scalar products or a single floating-point operation addition or subtraction
has to be carried out by the SPU, a very fast rounding procedure is essential
for the speed of the overall operation.
The rounding depends heavily on the speed with which the leading non
zero digit of the LA can be detected. A pointer to this digit, carried along
with the computation, would immediately identify this digit. The pointer
logic requires additional hardware and its usefulness decreases for lengthy
scalar products to be computed.
For short scalar products or single floating-point operations leading zero
anticipation (LZA) would be more useful. The final result of a scalar product
computation is supposed to lie in the exponent range between el and e2 of
the LA. Otherwise the problem has to be scaled. So hardware support for the
LZA is only needed for this part of the LA. A comparison of the exponents of
the summands identifies the LA word for which the LZA should be activated.
The LZA consists of a fast computation of a provisional sum which differs
from the correct sum by at most one leading zero. With this information the
leading zeros and the shift width for the two LA words in question can be
detected easily and fast. [91].

1.4.2 How much Local Memory should be Provided on a SPU?

There are applications which make it desirable to provide more than one long
accumulator on the SPU. If, for instance, the components of the two vectors
a = (ai) and b = (b i ) are complex floating-point numbers, the scalar product
a· b is also a complex floating-point number. It is obtained by accumulating
the real and imaginary parts of the product of two complex floating-point
numbers. The formula for the product of two complex floating-point numbers

(x = Xl + iX2, Y = YI + iY2 '*


X XY = (Xl X YI - X2 X Y2) + i (Xl X Y2 + X2 X YI))
shows that the real and imaginary part of ai and bi are needed for the com-
putation of both the real part of the product ai x bi as well as the imaginary
part.
28 1. Fast and Accurate Vector Operations

Access to user memory is usually slower than access to register memory.


To obtain high computing speed it is desirable, therefore, to read the real
and imaginary parts of the vector components only once and to compute
the real and imaginary parts of the products simultaneously in two long
accumulators on the SPU instead of reading the data twice and performing
the two accumulations sequentially. The old calculators shown in Fig. 1.22
and Fig. 1.23 on page 62 had already two long registers.
Very similar considerations show that a high speed computation of the
scalar product of two vectors with interval components makes two long ac-
cumulators desirable as well.
There might be other reasons to provide local memory space for more
than one LA on the SPU. A program with higher priority may interrupt the
computation of a scalar product and require a LA. The easiest way to solve
this problem is to open a new LA for the program with higher priority. Of
course, this can happen several times which raises the question how much
local memory for how many long accumulators should be provided on a SPU.
Three might be a good number to solve this problem. If a further interrupt
requires another LA, the LA with the lowest priority could be mapped into
the main memory by some kind of stack mechanism and so on. This tech-
nique would not limit the number of interrupts that may occur during a
scalar product computation. These problems and questions must be solved
in connection with the operating system.
For a time sharing environment memory space for more than one LA on
the SPU may also be useful.
However the contents of the last two paragraphs are of a more hypothetical
nature. The author is of the opinion that the scalar product is a fundamental
and basic operation which should not and never needs to be interrupted.

1.4.3 A SPU Instruction Set

For the SPU the following 10 instructions for the LA are recommended. These
10 low level instructions are most natural and inevitable as soon as the idea
of the long accumulator for the accurate scalar product has been chosen.
They are low level capabilities to support the high level instructions devel-
oped in the next section, and are based on preceding experience with these
in the XSC-Ianguages since 1980. Very similar instructions were provided by
the processor developed in [93]. Practically identical instructions were used in
[109] to support ACRITH and ACRITH-XSC [108,110,112]. These IBM pro-
gram products have been developed at the author's institute in collaboration
with IBM.
The 10 low level instructions for the LA are:
1. clear the LA,
2. add a product to the LA,
3. add a floating-point number to the LA,
1.4 Comments on the Scalar Product Units 29

4. subtract a product from the LA,


5. subtract a floating-point number from the LA,
6. read LA and round to the destination format,
7. store LA contents in memory,
8. load LA contents from memory,
9. add LA to LA,
10. Subtract LA from LA.

The clear instruction can be performed by setting all all bits 0 flags to
o. The load and store instructions are performed by using the load/store
instructions of the processor. For the add, subtract and round instructions
the following denotations could be used. There the prefix sp identifies SPU
instructions. In denotes the floating-point format that is used and will be db
for IEEE double. In all SPU instructions, the LA is an implicit source and
destination operand. The number of the instruction above is repeated in the
following coding which could be used to realize it.
2. spadd In srel, src2,
multiply the numbers in the given registers and add the product to the
LA.
3. spadd In src,
add the number in the given register to the LA.
4. spsub In srel, src2,
multiply the numbers in the given registers and subtract the product
from the LA.
5. spsub In src,
subtract the number in the given register from the LA.
6. spstore In. rd dest,
get LA contents and put the rounded value into the destination register.
In the instruction rd controls the rounding mode that is used when the
LA contents is stored in a floating-point register. It is one of the following:
Tn round to nearest,
rz round towards zero,
rp round upwards, i. e. towards plus infinity,
rm round downwards, i. e. towards minus infinity.
7. spstore dest,
get LA contents and put its value into the destination memory operand.
8. spload src,
load accumulator contents from the given memory operand into the LA.
9. spadd src,
the contents of the accumulator at the location src are added to the
contents of the accumulator in the processor.
10. spsub src,
the contents of the accumulator at the location src are subtracted from
the contents of the accumulator in the processor.
30 1. Fast and Accurate Vector Operations

1.4.4 Interaction with High Level Programming Languages

This paper is motivated by the tremendous advances in computer technology


that have been made in recent years. 100 million transistors can be placed on
a single chip. This allows the quality and high accuracy of the basic floating-
point operations of addition, subtraction, multiplication and division to be
extended to the arithmetic operations in the linear spaces and their interval
extensions which are most commonly used in computation. A new fundamen-
tal operation, the scalar product, is needed to provide this advanced computer
arithmetic. The scalar product can be produced by an instruction multiply
and accumulate and placed in the LA which has enough digit positions to
contain the exact sum without rounding. Only a single rounding error of
at most one unit in the last place is introduced when the completed scalar
product (often also called dot product) is returned to one of the floating-point
registers.
By operator overloading in modern programming languages matrix and
vector operations can be provided with highest accuracy and in a simple
notation, if the optimal scalar product is available. However, many scalar
products that occur in a computation do not appear as vector or matrix
operations in the program. A vectorizing compiler is certainly a good tool
detecting such additional scalar products in a program. Since the hardware
supported optimal scalar product is faster than a conventional computation
in floating-point arithmetic this would increase both the accuracy and the
speed of the computation.
In the computer, the scalar product is produced by several, more ele-
mentary computer instructions as shown in the last section. Programming
and the detection of scalar products in a program can be simplified a great
deal if several of these computer instructions are put into the hands of the
user and incorporated into high level programming languages. This has been
done with great success in the so-called XSC-languages (eXtended Scien-
tific Computation) since 1980 [14,41,46-49,56,67,68,106-108,112] that have
been developed at the author's institute. All these languages provide an ac-
curate scalar product implemented in software based on integer arithmetic.
If a computer is equipped with the hardware unit XPA 3233 (see Section
1.7) the hardware unit is called instead. A large number of problem solv-
ing routines with automatic result verification has been implemented in the
XSC-languages for practically all standard problems of numerical analysis
[34,35,57,64,110,112]. These routines have very successfully been applied in
the sciences.
We mention a few of these constructs and demonstrate their usefulness.
Central to this is the idea of allowing variables of the size of the LA to
be defined in a user's program. For this purpose a new data type called
dotprecision is introduced. A variable of the type dotprecision is a fixed-point
variable with L = k + 2e2 + 21 + 21e11 digits of base b. See Fig. 1.1. As has
been shown earlier, every finite sum of floating-point products 'E~=l ai x bi
1.4 Comments on the Scalar Product Units 31

can be represented as a variable of type dotprecision. Moreover, every such


sum can be computed in a local store of length L on the SPU without loss of
information. Along with the type dotprecision the following constructs serve
as primitives for developing expressions in a program which can easily be
evaluated with the SPU instruction set:
dotprecision new data type
assignment from dotprecision
to dotprecision or
to real with rounding to nearest or
to interval with roundings downwards and upwards
depending on the type on the left hand side of the
:= operator.
For variables of type dotprecision so-called dotprecision expressions are
permitted which are defined as sums of simple expressions. A simple expres-
sion is either a signed or unsigned constant or a variable of type real or a single
product of two such objects or another dotprecision variable. All operations
(multiplications and accumulations) are to be executed to full accuracy.
For instance, let x be a variable of type dotprecision and y and z variables
of type real. Then in the assignment

x:= x +y *z
the double length product of y and z is added to the variable x of type
dotprecision and its new value is assigned to x.
The scalar product of two vectors a = (ai) and b = (b i ) is now easily
implemented with a variable x of type dotprecision as follows:
x:= 0;
for i := 1 to n do x := x + ali] * b[i];
y:=x;
The last statement y := x rounds the value of the variable x of type dotpre-
cision into the variable y of type real by applying the standard rounding of
the computer. y then has the value of the scalar product aob which is within
a single rounding error of the exact scalar product a . b.
For example, the method of defect correction or iterative refinement re-
quires highly accurate computation of expressions of the form
a·b-c·d
with vectors a, b, c, and d. Employing a variable x of type dot precision, this
expression can now be programmed as follows:
x:= 0;
for i := 1 to n do x := x + ali] * b[i];
for i := 1 to n do x := x - c[i] * d[i];
y:=x;
32 1. Fast and Accurate Vector Operations

The result, involving 2n multiplications and 2n - 1 additions, is produced


with but a single rounding operation.
In the last two examples y could have been defined to be of type interval.
Then the last statement y := x would produce an interval with a lower bound
which is obtained by rounding the dot precision value of x downwards and an
upper bound by rounding it upwards. Thus, the bounds of y will be either
the same or two adjacent floating-point numbers.
In the XSC-Ianguages the functionality of the dotprecision type and ex-
pression is available also for complex data as well as for interval and complex
interval data.

1.5 Scalar Product Units for Top-Performance


Computers

By definition a top-performance computer is able to read two data x and


y to perform a product x x y into the arithmetic logical unit and/or the
SPU simultaneously in one portion. Supercomputers and vector processors
are typical representatives of this kind of computers. Usually the floating-
point word consists of 64 bits and the data bus is 128 or even more bits wide.
However, digital signal processors with a word size of 32 bits can also belong
in this class if two 32 bit words are read into the AL U and/or SPU in one
portion. For these kind of computers both solutions A and B which have been
sketched in Sections 1.2.1 and 1.2.2 make sense and will be discussed. The
higher the speed of the system the more hardware has to be employed. The
most involved and expensive solution seems to be best suited to reveal the
basic ideas. So we begin with solution A using a long adder for the double
precision data format.

1.5.1 Long Adder for 64 Bit Data Word (Solution A)

In [44] the basic ideas have been developed for a general data format. How-
ever, to be very specific we discuss here a circuit for the double precision
format of the IEEE-arithmetic standard 754. The word size is 64 bits. The
mantissa has 53 bits and the exponent 11 bits. The exponent covers a range
from -1022 to +1023. The LA has 4288 bits. We assume again that the scalar
product computation can be subdivided into a number of independent steps
like

a) read ai and bi ,
b) compute the product ai x bi ,
c) add the product to the LA.
Now by assumption the SPU can read the two factors ai and bi simul-
taneously in one portion. We call the time that is needed for this a cycle.
1.5 Scalar Product Units for Top-Performance Computers 33

Then, in a balanced design, steps b) and c) should both be performed in


about the same time. Using well known fast multiplication techniques like
Booth-Recoding and Wallace-tree this certainly is possible for step b). Here,
the two 53 bit mantissas are multiplied. The product has 106 bits. The main
difficulty seems to appear in step c). There, we have to add a summand of
106 bits to the LA in every cycle.
With solution A the addition is performed by a long adder and a long
shift, both of L = 4288 bits. An adder and a shift of this size are necessarily
slow, certainly too slow to process one summand of 106 bits in a single cycle.
Therefore, measures have to be taken to speed up the addition as well as
the shift. As a first step we subdivide the long adder into shorter segments.
Without loss of generality we assume that the segments consist of 64 bits. 4
A 64 bit adder certainly is faster than a 4288 bit adder. Now each one of the
64 bit adders may produce a carry. We write these carries into carry registers
between two adjacent adders. See Fig. 1.10.

summand register

parallel adder

accumulator

segmented summand

segmented adder

segmented accumulator

Fig. 1.10. Parallel and segmented parallel adder

If a single addition has to be performed these carries still have to be


propagated. In a scalar product computation, however, this is not necessary.
We assume that a large number of summands has to be added. We simply add
the carry with the next summand to the next more significant adder. Only
at the very end of the accumulation, when no more summands are coming,
carries may have to be eliminated. However, every summand is relatively
short. It consists of 106 bits only. So during the addition of a summand,
carries are only produced in a small part of the 4288 bit adder. The carry
elimination, on the other hand, takes place during each step of the addition
4 other segments are possible, see [44,45].
34 1. Fast and Accurate Vector Operations

wherever a carry is left. So in an average case there will only be very few
carries left at the end of the accumulation and a few additional cycles will
suffice to absorb the remaining carries. Thus, segmenting the adder enables it
to keep up in speed with steps a) and b) and to read and process a summand
in each cycle.
The long shift of the 106 bit summand is slow also. It is speeded up by
a matrix shaped arrangement of the adders. Only a few, let us assume here
four of the partial adders, are placed in a row. We begin with the four least
significant adders. The four next more significant adders are placed directly
beneath of them and so on. The most significant adders form the last row.
The rows are connected as shown in Fig. 1.11.
In our example, where we have 67 adders of 64 bits, 17 rows suffice to
arrange the entire summing matrix. Now the long shift is performed as fol-
lows: The summand of 106 bits carries an exponent. In a fast shifter of 106
to 256 bits it is shifted into a position where its most significant digit is
placed directly above the position in the long adder which carries the same
exponent identification E. The remaining digits of the summand are placed
immediately to its right. Now the summing matrix reads this summand into
the S-registers (summand registers) of every row. The addition is executed
in that row where the exponent identification coincides with that of the sum-
mand.
It may happen that the most significant digit of the summand has to be
shifted so far to the right that the remaining digits would hang over at the
right end of the shifter. These digits then are reinserted at the left end of the
shifter by a ring shift. If now the more significant part of the summand is
added in row r, its less significant part will be added in row r - 1.
By this matrix shaped arrangement of the adders, the unit can perform
both a shift and an addition in a single cycle. The long shift is reduced to
a short shift of 106 to 256 bits which is fast. The remaining shift happens
automatically by the row selection for the addition in the summing matrix.
Every summand carries an exponent which in our example consists of
12 bits. The lower part of the exponent, i. e. the 8 least significant digits,
determine the shift width and with it the selection of the columns in the
summing matrix. The row selection is obtained by the 4 most significant
bits of the exponent. This complies roughly with the selection of the adding
position in two steps by the process of Fig. 1.2. The shift width and the row
selection for the addition of a product ai x bi to the LA are known as soon
as the exponent of the product has been computed. Since the exponents of
ai and bi consist of 11 bits only, the result of their addition is available very
quickly. So while the multiplication of the mantissas is still being executed
the shifter can already be switched and the addresses of the LA words for
the accumulation of the product ai x bi can be selected.
The 106 bit summand touches at most three consecutive words of the
LA. The addition of the summand is executed by these three partial adders.
1.5 Scalar Product Units for Top-Performance Computers 35

interface

,"."ilruelion
decoder
(m~1
4 .. ignltic:ant
dl~lIs)
256

lEI lEI
lEI lEI

(onlroller LSB
multiply
.dd
rounding

dcbuS£ing
command
(oolrol

flng
cOOirol

s: summand

10: exponent
identifier

AC: accu word


64 bits
with nags

cy: carry bit


+1-: adder I
""uhlmclcr
64 bits
L B : least
significant
bit
M B: mosl
ignific-.mt
bit rounding unit

Fig_ 1.11. Block diagram of a SPU with long adder for a 64 bit data word and 128
bit data bus
36 1. Fast and Accurate Vector Operations

Each of these adders can produce a carry. The carry of the leftmost of these
partial adders can with high probability be absorbed, if the addition always
is executed over four adders and the fourth adder then is the next more
significant one. This can reduce the number of carries that have to be resolved
during future steps of the accumulation and in particular at the end.
In each step of the accumulation an addition only has to be activated in
the selected row of adders and in those adders where a non zero carry is wait-
ing to be absorbed. This adder selection can reduce the power consumption
for the accumulation step significantly.
The carry resolution method that has been discussed so far is quite nat-
ural. It is simple and does not require particular hardware support. If long
scalar products are being computed it works very well. Only at the end of
the accumulation, if no more summands are coming, a few additional cycles
may be required to absorb the remaining carries. Then a rounding can be ex-
ecuted. However, this number of additional cycles for the carry resolution at
the end of the accumulation, although it is small in general, depends on the
data and is unpredictable. In case of short scalar products the time needed
for these additional cycles may be disproportionately high and indeed exceed
the addition time.
With the fast carry resolution mechanism that has been discussed in Sec-
tion 1.2.4 these difficulties can be overcome. At the cost of some additional
hardware all carries can be absorbed immediately at each step of the accu-
mulation. The method is shown in Fig. 1.11 also. Two flag registers for the
all bits 0 and the all bits 1 flags are shown at the left end of each partial accu-
mulator word in the figure. The addition of the 106 bit products is executed
by three consecutive partial adders. Each one of these adders can produce a
carry. The carries between two of these adjacent adders can be avoided, if all
partial adders are built as Carry Select Adders. This increases the hardware
costs only moderately. The carry registers between two adjacent adders then
are no longer necessary. 5 The flags indicate which one of the more significant
LA words will absorb the left most carry. During an addition of a product
only these 4 LA words are changed and only these 4 adders need to be acti-
vated. The addresses of these 4 words are available as soon as the exponent
of the summand ai x bi has been computed. During the addition step now
simultaneously with the addition of the product the carry word can be in-
cremented (decremented). If the addition produces a carry the incremented
word will be written back into the local accumulator. If the addition does not
produce a carry, the local accumulator word remains unchanged. Since we
have assumed that all partial adders are built as Carry Select Adders this fi-
nal carry resolution scheme requires no additional hardware. Simultaneously
with the incrementation/decrement at ion of the carry word a second set of
5 This is the case in Fig. 1.12 where a similar situation is discussed. There all
adders are supposed to be carry select adders.
1.5 Scalar Product Units for Top-Performance Computers 37

flags is set up for the case that a carry is generated. In this case the second
set of flags is copied into the former word.
The accumulators that belong to partial adders in Fig. 1.11 are denoted by
AG. Beneath them a small memory is indicated in the figure. It can be used to
save the LA contents very quickly in case that a program with higher priority
interrupts the computation of a scalar product and requires the unit for itself.
However, the author is of the opinion that the scalar product is a fundamental
and basic arithmetic operation which should never be interrupted. The local
memory on the SPU can be used for fast execution of scalar products in the
case of complex and of interval arithmetic.
In Section 1.4.2 we have discussed applications like complex arithmetic
or interval arithmetic which make it desirable to provide more than one LA
on the SPU. The local memory on the SPU shown in Fig. 1.11 serves this
purpose.
In Fig. 1.11 the registers for the summands carry an exponent identifica-
tion denoted by E. This is very useful for the final rounding. The usefulness
of the flags for the final rounding has already been discussed. They also serve
for fast clearing of the accumulator.
The SPU which has been discussed in this section seems to be costly.
However, it consists of a large number of identical parts and it is very regular.
This allows a highly compact design. Furthermore the entire unit is simple.
No particular exception handling techniques are to be dealt with by the
hardware. Vector computers are the most expensive. A compact and simple
solution, though expensive, is justified for these systems.

1.5.2 Long Adder for 32 Bit Data Word (Solution A)

In this section as well as in Section 1.5.4 we consider a computer which uses


a 32 bit floating-point word and which is able to read two such words into the
AL U and! or SPU simultaneously in one portion. Digital signal processors are
representatives of this kind of computer. Real time computing requires very
high computing speed and high accuracy in the result. As in the last section
we call the time that is needed to read the two 32 bit floating-point words a
cycle.
We first develop circuitry which realizes Solution A using a long adder
and a long shift. To be very specific we assume that the data are given as
single precision floating-point numbers conforming to the IEEE-arithmetic
standards 754. There the mantissa consists of 24 bits and the exponent has
8 bits. The exponent covers a range from -126 to +127 (in binary). As
discussed in Remark 3a) of Section 1.2.3, 640 bits are a reasonable choice for
the LA. It can be represented by 10 words of 64 bits.
Again the scalar product is computed by a number of independent steps
like
38 1. Fast and Accurate Vector Operations

a) read ai and bi ,
b) compute the product ai x bi ,
c) add the product to the LA.
Each of the mantissas of ai and bi has 24 bits. Their product has 48
bits. It can be computed very fast by a 24 x 24 bit multiplier using standard
techniques like Booth-Recoding and Wallace tree. The addition of the two 8
bit exponents of ai and bi delivers the exponent of the product consisting of
9 bits.
The LA consists of 10 words of 64 bits. The 48 bit mantissa ofthe product
touches at most two of these words. The addition of the product is executed
by the corresponding two consecutive partial adders. Each of these two adders
can produce a carry. The carry between the two adjacent adders can immedi-
ately be absorbed if all partial adders are built as Carry Select Adders again.
The carry of the more significant of the two adders will be absorbed by one of
the more significant 64 bit words of the LA. The flag mechanism (see Section
1.2.4) indicates which one of the LA words will absorb a possible carry. So
during an addition of a summand the contents of at most 3 LA words are
changed and only these three partial adders need to be activated. The ad-
dresses of these words are available as soon as the exponent of the summand
ai x bi has been computed. During the addition step, simultaneously with the
addition of the product, the carry word can be incremented (decremented).
If the addition produces a carry the incremented word will be written back
into the local accumulator. If the addition does not produce a carry, the lo-
cal accumulator word remains unchanged. Since all partial adders are built
as Carry Select Adders no additional hardware is needed for the carry reso-
lution. Simultaneously with the incrementation/ decrementation of the carry
word a second set of flags is set up for the case that a carry is generated.
If the latter is the case the second set of flags is copied into the former flag
word.
Details of the circuitry just discussed are summarized in Fig. 1.12. The
figure is highly similar to Fig. 1.11 of the previous section. In order to avoid
the long shift, the long adder is designed as a summing matrix consisting of
2 adders of 64 bits in each row. For simplicity in the figure only 3 rows (of
the 5 needed to represent the full LA) are shown.
In a fast shifter of 48 to 128 bits the 48 bit product is shifted into a
position where its most significant digit is placed directly above the position
in the long adder which carries the same exponent identification E. The
remaining digits of the summand are placed immediately to its right. If they
hang over at the right end of the shifter, they are reinserted at the left end
by a ring shift. Above the summing matrix in Fig. 1.12 two possible positions
of summands after the shift are indicated.
The summing matrix now reads the summand into its S-registers. The
addition is executed by those adders where the exponent identification coin-
cides with the one of the summand. The exponent of the summand consists
1.5 Scalar Product Units for Top-Performance Computers 39

I inlerface

regisler file
64-1-

1 1 1
~
I exp{a) I I I

t t
I
t
manl(a) manl(b)

2' 24 J2

adder 8
I multipl ier 24 x 24 J
9 (least .l: 48
!iignifiunI
instruction 7 dig;_L>L
decoder shi fter 48 to 128 I
(most
2 significant 64 64
digios)

controller possible
location of
{ I EI I I lEI I I
multiply
add !ft~r~~'!~ I E I I IlEI I I
rounding
L-~B
debugging
command lE I I lEI s I
control
.
S
.
~ ~
+/- j<- +/- I
AC I-' AC f->

lE I s I
s: summand lEI S I
.
~ ~
E: exponent identifier +/-
oj,
I<- +/-
oj,
j<- f-
AC: accu word
64 bits with flags
AC I-' AC I-'
III I
'il I

cy: carry bit

+/-: adder / sublracler MSB


64 bilS lEI S I lEI s I

~ ~
I<- -
LSB: Icast significant bit
+/- I<- +f-
M B: mosl ignificant bit
AC I-' AC I-'
i ll

Adder

I round ing unit

Fig_ 1-12. Block diagram of a SPU with long adder for a 32 bit data word and 64
bit data bus
40 1. Fast and Accurate Vector Operations

of 9 bits. The lower part, i. e. the 7 least significant digits, determine the
shift width. The selection of the two adders which perform the addition is
determined by the most significant bits of the exponent.
In Fig. 1.12 again some memory is indicated for each part of the LA. It
can be used to save the LA contents very quickly in case a program with
higher priority interrupts the computation of a scalar product and requires
the unit for itself. The local memory on the SPU also can be used for fast
execution of scalar products in the case of complex arithmetic and of interval
arithmetic.
In comparison with Fig. 1.11, Fig. 1.12 shows an additional 32 bit data
path directly from the input register file to the fast shifter. This data path
is supposed to allow a very fast execution of the operation multiply and
add fused, rnd(a x b + c), which is provided by some conventional floating-
point processors. While the product a x b is computed by the multiplier, the
summand c is added to the LA.
The SPU which has been discussed in this section seems to be costly
at first glance. While a single floating-point addition conveniently can be
done with one 64 bit adder, here 640 full adders (10 64-bit adders) have
been used in carry select adder mode. However, the advantages of this design
are tremendous. While a conventional floating-point addition can produce a
completely wrong result with only two or three additions, the new unit never
delivers a wrong answer, even if millions of floating-point numbers or single
products of such numbers are added. An error analysis is never necessary for
these operations. The unit consists of a large number of identical parts and
it is very regular. This allows a very compact design. No particular hardware
has to be included to deal with rare exceptions. Although an increase in
adder equipment by a factor of 10, compared with a conventional floating-
point adder, might seem to be high, the number of full adders used for the
circuitry is not extraordinary. We stress the fact that for a Wallace tree in
case of a standard 53 x 53 bit multiplier about the same number of full adders
is used. For fast conventional computers this has been the state of the art
multiplication for many years and nobody complains about high cost.

1.5.3 Short Adder with Local Memory on the Arithmetic Unit for
64 Bit Data Word (Solution B)

In the circuits discussed in Sections 1.5.1 and 1.5.2 adder equipment was
provided for the full width of the LA. The long adder was segmented into
partial adders of 64 bits. In Section 1.5.1 67, and in Section 1.5.2 10, such
units were used. During an addition of a summand, however, in Section 1.5.1
only 4, and in Section 1.5.2 only 3, of these units are activated. This raises
the question whether adder equipment is really needed for the full width of
the LA and whether the accumulation can be done with only 4 or 3 adders
in accordance with Solution B of Section 1.2.2. There the LA is kept as local
memory on the arithmetic unit.
1.5 Scalar Product Units for Top-Performance Computers 41

In this section we develop such a solution for the double precision data
format. An in-principle solution using a short adder and local memory on the
arithmetic unit was discussed in Section 1.3.2. There the data ai and bi to
perform a product ai x bi are read into the SPU successively in two portions
of 64 bits. This leaves 4 machine cycles to perform the accumulation in the
pipeline.
Now we assume that the two data ai and bi for a product ai x bi are read
into the SPU simultaneously in one portion of 128 bits. Again we call the
time that is needed for this a cycle. In accordance with the solution shown in
Fig. 1.11 and Section 1.5.1 we assume again that the multiplication and the
shift also can be done in one such read cycle. In a balanced pipeline, then, the
circuit for the accumulation must be able to read and process one summand
in each (read) cycle also. The circuit in Fig. 1.13 displays a solution. Closely
following the summing matrix in Fig. 1.11 we assume there that the local
memory LA is organized in 17 rows of four 64 bit words.
In each cycle the multiplier supplies a product (summand) to be added
in the accumulation unit. Every such summand carries an exponent which
in our example consists of 12 bits. The 8 lower (least significant) bits of the
exponent determine the shift width. The row selection of the LA is obtained
by the 4 most significant bits of the exponent. This roughly corresponds to the
selection of the adding position in two steps by the process described in the
context of Fig. 1.2. The shift width and the row selection for the addition of
the product to the LA are known as soon as the exponent of the product has
been computed. Since the exponents of ai and bi consist of 11 bits only, the
result of their addition is available very quickly. So while the multiplication
of the mantissa is still being executed the shifter can already be switched and
the addresses for the LA words for the accumulation of the product ai x bi
can be selected.
After being shifted the summand reaches the accumulation unit. It is read
into the input register IR of this unit. The shifted summand now consists of
an exponent e, a sign s, and a mantissa m. The mantissa touches three
consecutive words of the LA, while the exponent is reduced to the four most
significant bits of the original exponent of the product.
Now the addition of the summand is performed in the accumulation unit
by the following three steps:

1. The local memory is addressed by the exponent e. The contents of the


addressed part of the LA including the word which resolves the carry are
transferred to the register before summation RBS. This transfer moves
four words of 64 bits. The summand is also transferred from IR to the
corresponding section of RBS. In Fig. 1.13 this part of the RBS is denoted
bye', s' and m' respectively.
2. In the next cycle the addition or subtraction is executed in the add/sub-
tract unit according to the sign. The result is transferred to the register
after summation RAS. The adder/subtracter consists of 4 parallel adders
42 1. Fast and Accurate Vector Operations

1 interface

tj
register file

1 1
xp(a) exp(b) I mant(a) I I mant(b)

'--1---"' -----'1
instruction It II
± 53 ± 53
decoder adder multiplier 53 x 53 I
~ 12 (11:'aS1
signific~l
±106
8 d.gi ...)
controller
multiply add rounding
shifter 106 [0 170 J
debugging (!\\OS!
command control '" $ignificam 170
flag control digil$)

possible
location
{ ~I II ~ ~I I I ~ I
of summands
after shift o::::J ~ ~ II ~I I
write data

IR c:b I.--s.----I--....J:L...------,I
r+----- wri c
rean ::lddres:s add Il:. .

read data
~
I 1 ,j.
~
:::::::!I - multiplexer
,j.
RBS e':o 11 ·'1 m' I I r' I
I ,j.
add I subtract I
RAS c· I r" I
IR - Input Register
RB - Regi ter Before ummation I
RA Register After ummation

~
I rounding unit

Fig. 1.13. Block diagram of a SPU with short adder and local store for a 64 bit
data word and 128 bit data bus
1.5 Scalar Product Units for Top-Performance Computers 43

of 64 bits which are working in carry select mode. The summand touches
three of these adders. Each one of these three adders can produce a carry.
The carries between two of these adjacent adders are absorbed by the
carry select addition. The fourth word is the carry word. It is selected by
the flag mechanism. During the addition step a 1 is added to or subtracted
from this word in carry select mode. If the addition produces a carry the
incremented/ decremented word will be selected. If the addition does not
produce a carry this word remains unchanged. Simultaneously with the
incrementation/ decrementation of the carry word a second set of flags
is set up which is copied into the flag word in the case that a carry is
generated. In Fig. 1.13 two possible locations of the summand after the
shift are indicated. The carry word is always the most significant word.
An incrementation/ decrement at ion of this word never produces a carry.
Thus the adder/subtracter in Fig. 1.13 simply can be built as a parallel
carry select adder.
3. In the next cycle the computed sum is written back into the same four
memory cells of the LA to which the addition has been executed. Thus
only one address decoding is necessary for the read and write step. A
different bus called write data in Fig. 1.13 is used for this purpose.

In summary the addition consists of the typical three steps: 1. read the
summand, 2. perform the addition, and 3. write the sum back into the (local)
memory. Since a summand is delivered from the multiplier in each cycle, all
three phases must be active simultaneously, i. e. the addition itself must be
performed in a pipeline. This means that it must be possible to read from the
memory and to write into the memory in each cycle simultaneously. So two
different data paths have to be provided. This, however, is usual for register
memory.
The pipeline for the addition consists of three steps. Pipeline conflicts are
quite possible. A pipeline conflict occurs if an incoming summand needs to
be added to a partner from the LA which is still being computed and not yet
available in the local memory. These situations can be detected by comparing
the exponents e, e' and e" of three successively incoming summands. In prin-
ciple all pipeline conflicts can be solved by the hardware. Here we discuss the
solution of two pipeline conflicts which with high probability are the most
frequent occurrences.
One conflict situation occurs if two consecutive products carry the same
exponent e. In this case the two summands touch the same three words of
the LA. Then the second summand is unable to read its partner for the
addition from the local memory because it is not yet available. This situation
is checked by the hardware where the exponents e and e' of two consecutive
summands are compared. If they are identical, the multiplexer blocks off the
process of reading from the local memory. Instead the sum which is just being
computed is directly written back into the register before summation RBS
44 1. Fast and Accurate Vector Operations

via the multiplexer so that the second summand can immediately be added
without memory involvement.
Another possibility of a pipeline conflict occurs if from three successively
incoming summands the first one and the third one carry the same exponent.
Since the pipeline consists of three steps, the partner for the addition of the
third one then is not yet in the local memory but still in the register after
summation RAS. This situation is checked by the hardware also, see Fig.
1.13. There the two exponents e and e" of the two summands are compared.
In case of coincidence the multiplier again suppresses the reading from the
local memory. Instead now, the sum of the former addition, the result of
which is still in RAS, is directly written back into the register RBS before
summation via the multiplexer. So also this pipeline conflict can be solved
by the hardware without memory involvement.
The case e = e' = e" is also possible. It would cause a reading conflict in
the multiplexer. The situation can be avoided by writing a dummy exponent
into e" or by reading from the add/subtract unit with higher priority.
The product that arrives at the accumulation unit touches three consec-
utive words of the LA. A more significant fourth word absorbs the possible
carry. The solution for the two pipeline conflicts just described works well, if
this fourth word is the next more significant word. A carry is not absorbed
by the fourth word if all its bits are one, or are all zero. The probability that
this is the case is 1 : 264 < 10- 18 . In the vast majority of instances this will
not be the case.
If it is the case the word which absorbs the carry is selected by the flag
mechanism and read into the most significant word of the RBS. The addition
step then again works well including the carry resolution. But difficulties
occur in both cases of a pipeline conflict. Fig. 1.14 displays a certain part of
the LA. The three words to which the addition is executed are denoted by 1,
2 and 3. The next more significant word is denoted by 4 and the word which
absorbs the carry by 5.

5 4 3 2

Fig. 1.14. Carry propagation in case of a pipeline conflict

In case of a pipeline conflict with e = e' or e = e" the following addition


again touches the words 1,2 and 3. Now the carry is absorbed either by word
4 or by word 5. Word 4 absorbs the carry if an addition is followed by an
addition or a subtraction followed by a subtraction. Word 5 absorbs the carry
if an addition is followed by a subtraction or vice versa. So the hardware has
to take care that either word 4 or 5 is read into the most significant word
of RBS depending on the operation which follows. The case that word 5 is
the carry word again needs no particular care. Word 5 is already in the most
1.5 Scalar Product Units for Top-Performance Computers 45

significant position of the RBS. It is simply treated the same way as the
words 1, 2 and 3. In the other case word 4 has to be read from the LA into
RBS, simultaneously with the words 1, 2 and 3 from the add/subtract unit
or from RAS into RBS. In this case word 5 is written into the local memory
via the normal write path.
So far certain solutions for the possible pipeline conflicts e = e' and e = e"
have been discussed. These are the most frequent but not the only conflicts
that may occur. Similar difficulties appear if two or three successive incoming
summands overlap only partially. In this case the exponents e and e' and/or
e" differ by 1 or 2 so that also these situations can be detected by comparison
of the exponents. Another pipeline conflict appears if one of the two following
summands overlaps with a carry word. In these cases summands have to
be built up in parts from the adder/subtracter or RAS and the LA. Thus
hardware solutions for these situations are more complicated and costly. We
leave a detailed study of these situations to the reader/designer and offer the
following alternative: The accumulation pipeline consists of three steps only.
Instead of investing in a lot of hardware logic for rare situations of a pipeline
conflict it may be simpler and less expensive to stall the pipeline and delay
the accumulation by one or two cycles as needed. It should be mentioned
that other details as for instance the width of the adder that is used also
can heavily change the design aspects. A 128 instead of a 64 bit adder width
which was assumed here could simplify several details.
It was already mentioned that the probability for the carry to run further
than the fourth word is less than 10~18. A particular situation where this
happens occurs if the sum changes its sign from a positive to a negative
value or vice versa. This can happen frequently. To avoid a complicated carry
handling procedure in this case a small carry counter of perhaps three bits
could be appended to each 64 bit word of the LA. If these counters are not
zero at the end of the accumulation their contents have to be added to the
LA. For further details see [66], [45].
As was pointed out in connection with the unit discussed in Section 1.3.2,
the addition of the summand actually can be carried out over 170 bits only.
Thus the shifter that is shown in Fig. 1.13 can be reduced to a 106 to 170 bits
shifter and the data path from the shifter to the input register IR as well as
the one to RBS also need to be 170 bits wide only. If this possible hardware
reduction is applied, the summand has to be expanded to the full 256 bits
when it is transferred to the adder/subtracter.

1.5.4 Short Adder with Local Memory on the Arithmetic Unit for
32 Bit Data Word (Solution B)

Now we consider again a 32 bit data word. We assume that two of these are
read simultaneously into the SPU in one read cycle. The LA is kept as local
memory in the SPU. We assume that the addition of a summand, which now
is a 48 bit product, can be done by three adders of 64 bits including the carry
46 1. Fast and Accurate Vector Operations

resolution. Multiplication of the mantissas and addition of the exponents are


done in full accordance with the upper part of the circuits shown in Fig. 1.12.
The shift is executed similarly to the one in Fig. l.12. We shall comment on it
later. The appropriately shifted product then reaches the accumulation unit.
A block diagram of this unit is shown in Fig. l.15.
We assume that the multiplication and the shift can be performed in
one read cycle. Then, a shifted product reaches the input register IR of the
accumulation unit in each (read) cycle. The accumulation unit mnst add and
process one summand in each such cycle. The addition itself is performed by
the following three steps, see Fig. l.15.

l. The product which is already in IR touches at most two successive 64 bit


words of the LA. These words are addressed by the exponent e of the
product. The contents of these two words of the LA and the word which
absorbs the carry are transferred from the LA to the register part r' of
RES. This transfer moves three 64 bit words. The summand in IR is also
transferred to the corresponding section of RES. This part is denoted by
e', s' and m' in Fig. l.15.
2. In the next step the addition or subtraction is executed in the add/subtract
unit according to the sign. The result is transferred to the register RAS.
The adder/subtracter consists of three 64 bit adders which are working in
carry select mode. So the carries between the lower two of these adders are
absorbed by the carry select addition. The carry word is the most signifi-
cant one. An incrementation/ decrementation of this word never produces
a carry. Thus the adder/subtracter in Fig. 1.15 can be built simply as a
parallel adder.
3. In the next cycle the computed sum is written back into the same three
memory cells of the LA to which the addition has been executed. The write
bus is used for this purpose. Thus only one address decoding is necessary
for the read and write step.

Since a summand is delivered from the multiplier in each cycle, all three
of these phases must be active simultaneously, i. e. the addition must be
performed in a pipeline. This means, in particular, that it must be possible
to read from the LA and to write into the LA simultaneously in each cycle.
Therefore, two different data paths have to be provided, as shown in Fig.
1.15.
The pipeline for the addition consists of three steps. Pipeline conflicts
again are quite possible. A pipeline conflict occurs if an incoming summand
needs to be added to a partner from the LA which is still being computed
and not yet available in the local memory. These situations can be detected
by comparing the exponents e, e' and e" of three successively incoming sum-
mands. In principle all pipeline conflicts can be solved by the hardware. We
discuss here the solution of two pipeline conflicts which with high probability
are the most frequent occurrences.
1.5 Scalar Product Units for Top-Performance Computers 47

J, J,
I I write data
J,
IRcb lsi m I

wri e
read address ade ess

r
read data
~
~
y y
....- I~
~

RBS eJ- Ils'l m' I r'

I add I subtract

'''f
+
r"
IR =Input Register
RBS ;;;;: Register Before Summation
RAS =Register After Summation

~
rounding unit

Fig. 1.15. Block diagram for a SPU with short adder and local store for a 32 bit
data word and 64 bit data bus

One conflict situation occurs if two consecutive products carry the same
exponent e, In this case the two summands touch the same two words of the
LA, Then the second summand is unable to read its partner for the addition
from the LA because it is not yet available, This situation is checked by
the hardware where the exponent e and e' of two consecutive summands
are compared, In case of coincidence the process of reading from the LA is
blocked off, Instead the sum which is just being computed is directly written
back into the register RBS so that the second summand can immediately be
added without memory involvement.
Another possibility of a pipeline conflict occurs if from three successive
incoming summands the first one and the third one carry the same exponent.
Since the pipeline consists of three phases the partner for the addition of
the third one then is not yet in the LA but still in the register RAS. This
situation is checked by the hardware as well, see Fig. 1.15. There the two
exponents e and e" are compared. In case of coincidence again the process
of reading from the LA is blocked off. Instead now, the result of the former
48 1. Fast and Accurate Vector Operations

addition, which is still in RAS, is directly written back into RBS. Then the
addition can be executed without LA involvement.
The case e = e' = e" is also possible. It would cause a conflict in the
selection unit which in Fig. 1.15 is shown directly beneath of the LA. The
situation can be avoided by writing a dummy exponent into e" or by reading
from the add/subtract unit with higher priority. This solution is not shown
in Fig. 1.15.
The product that arrives at the accumulation unit touches two consecutive
words of the LA. A more significant third word absorbs the possible carry.
The solution for the two pipeline conflicts work well, if this third word is the
next more significant word of the LA. The probability that this is not the
case is less than 10- 18 . In the vast majority of instances this will be the case.
If it is not the case the word which absorbs the carry is selected by the flag
mechanism and read into the most significant word of the RBS. The addition
step then works well again including the carry resolution. But difficulties can
occur in both cases of a pipeline conflict. Fig. 1.16 shows a certain part of
the LA. The two words to which the addition is executed are denoted by 1
and 2. The next more significant word is denoted by 3 and the word which
absorbs the carry by 4.

I 4 3 2

Fig. 1.16. Carry propagation in case of a pipeline conflict

In case of a pipeline conflict with e = e' or e = e" the following addition


again touches the words 1 and 2. Now the carry is absorbed either by the
word 3 or by the word 4. Word 3 absorbs the carry if an addition is followed
by an addition or a subtraction is followed by a subtraction. Word 4 absorbs
the carry if an addition is followed by a subtraction or vice versa. So the
hardware has to take care that either word 3 or word 4 is read into the most
significant word of RBS depending on the operation which follows. The case
that the word 4 is the carry word again needs no particular care. Word 4
is already in the most significant position of the RBS. It is simply treated
the same way as the words 1 and 2. In the other case word 3 has to be
read from the LA into RBS simultaneously with the words 1 and 2 from the
add/subtract unit or from RAS into RBS. In this case word 4 is written into
the local memory via the normal write path.
So far solutions for the two pipeline conflicts e = e' and e = e" have
been discussed. These are not the only conflicts that may occur. Similar
difficulties appear if two or three successively incoming summands overlap
only partially. In this case the exponents e and e' and/or e" differ by 1 so that
these situations can be detected by comparison of the exponents also. Another
pipeline conflict appears if one of the following two summands overlaps with
1.6 Hardware Accumulation Window 49

a carry word. In these cases summands have to be built up in parts from


the adder/subtracter or RAS and the LA. Thus hardware solutions for these
situations are more complicated and costly. We leave a detailed study of
these situations to the reader/designer and offer the following alternative.
The accumulation pipeline consists of three steps only. Instead of investing
in a lot of hardware logic for very rare situations of a pipeline conflict it may
be simpler and less expensive to stall the pipeline and delay the accumulation
by one or two cycles as needed.
The product consists of 48 bits. So the summand never touches the 16 least
significant bits of word 1. The most significant third 64 bit word of the adder
is supposed to absorb the carry. It can be built as an incrementer / decrementer
by halfadders. Thus, in comparison with Fig. 1.12, the shifter can be reduced
to a 48 to 112 bit shifter and the data path from the shifter to the input
register IR as well as the one to RBS also needs to be 112 bits wide only. If
this possibility is chosen, the summand has to be expanded to the full 192
bits when it is read into the adder/subtracter.
The circuits that have been discussed so far are based on the assumption
that the LA is organized in words of 64 bits and that the partial adder that
is used is also 64 bits wide. It should be mentioned that these assumptions,
although realistic, are nevertheless somewhat arbitrary and that other choices
are quite possible and may lead to simpler or better solutions. The LA could
as well be organized in words of 128 or only 32 bits. The width of the partial
adder could also be 128 or 32 bits. All these possibilities allow interesting
solutions for the different cases that have been discussed in this paper. We
leave it to the reader to play with these combinations and select the one
which fits best to a given hardware environment. With increasing word size
the probability for a pipeline conflict which has not been discussed so far
decreases.

1.6 Hardware Accumulation Window


So far it has been assumed in this paper that the SPU is incorporated as
an integral part of the arithmetic unit of the processor. Now we discuss the
question of what can be done if not enough register space for the LA is
available on the processor.
The final result of a scalar product computation is assumed to be a
floating-point number with an exponent in the range e1 :::; e :::; e2. If this
is not the case, the problem has to be scaled. During the computation of the
scalar product, however, summands with an exponent outside of this range
may occur. The remaining computation then has to cancel all the digits
outside of the range e1 :::; e :::; e2. So in a normal scalar product computa-
tion, the register space outside this range will be used less frequently. It was
already mentioned earlier in this paper that the conclusion should not be
drawn from this consideration that the register size can be restricted to the
50 1. Fast and Accurate Vector Operations

single exponent range in order to save some silicon area. This would require
the installment of complicated exception handling routines in software or in
hardware. The latter may finally require as much silicon. A software solution
certainly is much slower. The hardware requirement for the LA in case of
standard arithmetics is modest and the necessary register space really should
be invested.
However, the memory space for the LA on the arithmetic unit grows with
the exponent range of the data format. If this range is extremely large, as
for instance in case of an extended precision floating-point format, then only
an inner part of the LA can be supported by hardware. We call this part of
the LA a Hardware Accumulation Window (HAW). See Fig. 1.17. The outer
parts of this window must then be handled in software. They are probably
needed less often.

software LA
k 2e2 21 21ell

21 HAW
Fig. 1.17. Hardware Accumulation Window (HAW)

There are still other reasons that suppose the development of techniques
for the computation of the accurate scalar product using a HAW. Many
conventional computers on the market do not provide enough register space
to represent the full LA on the CPU. Then a HAW is one choice which allows
a fast and correct computation of the scalar product in many cases.
Another possibility is to place the LA in the user memory, i. e. in the data
cache. In this case only the start address of the LA and the flag bits are put
into (fixed) registers of the general purpose register set of the computer. This
solution has the advantage that only a few registers are needed and that a
longer accumulator window or even the full LA can be provided. This reduces
the need to handle exceptions. The disadvantage of this solution is that for
each accumulation step, four memory words must be read and written in
addition to the two operand loads. So the scalar product computation speed
is limited by the data cache to processor transfer bandwidth and speed. If
the full long accumulator is provided this is a very natural solution. It has
been realized on several IBM, SIEMENS and HITACHI computers of the
/370 architecture in the 1980s [109,110,112,119].
A faster solution certainly is obtained for many applications with a HAW
in the general purpose register set of the processor. Here only a part of the
LA is present in hardware. Overflows and underflows of this window have to
be handled by software. A full LA for the data format double precision of the
IEEE-arithmetic standard 754 requires 4288 bits or 67 words of 64 bits. We
1.6 Hardware Accumulation Window 51

assume here that only 10 of these words are located in the general purpose
register set.
Such a window covers the full LA that is needed for a scalar product com-
putation in case of the data format single precision of the IEEE-arithmetic
standard 754. It also allows a correct computation of scalar products in the
case of the long data format of the /370 architecture as long as no under-
or overflows occur. In this case 64 + 28 + 63 = 155 hexadecimal digits or
620 bits are required. With a HAW of 640 bits all scalar products that do
not cause an under- or overflow could have been correctly computed on these
machines. This architecture was successfully used and even dominated the
market for more than 20 years. This example shows that even if a HAW of
only 640 bits is available, the vast majority of scalar products will execute
on fast hardware.
Of course, even if only a HAW is available, all scalar products should be
computed correctly. Any operation that over- or underflows the HAW must
be completed in software. This requires a complete software implementation
of the LA, i. e. a variable of type dot precision. All additions that do not fit
into the HAW must be executed in software into this dotprecision variable.
There are three situations where the HAW can not correctly accumulate
the product:
• the exponent of the product is so high that the product does not (com-
pletely) fit into the HAW. Then the product is added in software to the
dot precision variable.
• the exponent of the product is so low that the product does not (com-
pletely) fit into the HAW. Then the product is added in software to the
dotprecision variable.
• the product fits into the HAW, but its accumulation causes a carry to be
propagated outside the range of the HAW. In this case the product is added
into the HAW. The carry must be added in software to the dot precision
variable.
If at the end of the accumulation the contents of the software accumulator
are non zero, the contents of the HAW must be added to the software accu-
mulator to obtain the correct value of the scalar product. Then a rounding
can be performed if required. If at the end of the accumulation the contents
of the software accumulator are zero, the HAW contains the correct value of
the scalar product and a rounded value can be obtained from it.
Thus, in general, a software controlled full LA supplements a HAW. The
software routines must be able to perform the following functions:

• clear the software LA. This routine must be called during the initialization
of the HAW. Ideally, this routine only sets a flag. The actual clearing is
only done if the software LA is needed.
• add or subtract a product to/from the software LA.
52 1. Fast and Accurate Vector Operations

• add or subtract a carry or borrow to/from the software LA at the appro-


priate digit position.
• add the HAW to the software LA. This is required to produce the final re-
sult when both the HAW and the software LA were used. Then a rounding
can be performed.
• round the software LA to a floating-point number.
With this software support scalar products can be computed correctly
using a HAW at the cost of a substantial software overhead and a considerable
time penalty for products that fall outside the range of the HAW.
An alternative to the HAW-software environment just described is to
discard the products that underflow the HAW. A counter variable is used
to count the number of discarded products. If a number of products were
discarded, the last bits of the HAW must be considered invalid. A valid
rounded result can be generated by hardware if these bits are not needed.
If this procedure fails to produce a useful answer the whole accumulation is
repeated in software using a full LA.
A 640 bit HAW seems to be the shortest satisfactory hardware window. If
this much register space is not available, a software implementation probably
is the best solution.
If a shorter HAW must be implemented, then it should be a movable
window. This can be represented by an exponent register associated with the
hardware window. At the beginning of an accumulation, the exponent regis-
ter is set so that the window covers the least significant portion of the LA.
Whenever a product would cause the window to overflow, its exponent tag is
adjusted, i. e. the window moves to the left, so that the product fits into the
window. Products that would cause an underflow are counted and otherwise
ignored. The rounding instruction checks whether enough significant digits
are left to produce a correctly rounded result or whether too much cancel-
lation did occur. In the latter case it is up to the user to accept the inexact
result or to repeat the whole accumulation in software using a full LA.
Using this technique a HAW as short as 256 bits could be used to perform
rounded scalar product computation and quadruple precision arithmetic.
However, it would not be possible to perform many other nice and useful
applications of the optimal scalar product with this type of scalar product
hardware as for instance a long real arithmetic.
The software overhead caused by the reduction of the full width of the LA
to a HAW represents a trade off between hardware expenditure and runtime.
With the accurate scalar product operators for multiple precision arith-
metic including multiple precision interval arithmetic can easily be provided.
This enables the user to use higher precision operations in numerically critical
parts of a computation. Experience shows that if one runs out of precision in
a certain problem class one often runs out of double or extended precision very
soon as well. It is preferable and simpler, therefore, to provide the principles
for enlarging the precision than simply providing any fixed higher precision.
1. 7 Theoretical Foundation of Advanced Computer Arithmetic 53

To allow fast execution of a number of multiple precision arithmetics the


HAW should not be too small.

1. 7 Theoretical Foundation of Advanced Computer


Arithmetic and Shortcomings of Existing Processors
and Standards
Arithmetic is the basis of mathematics. Advanced computer arithmetic ex-
pands the arithmetic and mathematical capability of the digital computer in
the most natural way. Instead of reducing all calculations to the four elemen-
tary operations for floating-point numbers, advanced computer arithmetic
provides twelve fundamental data types or mathematical spaces with opera-
tions of highest accuracy in a computing environment.
Besides the real numbers, the complex numbers form the basis of analy-
sis. For computations with guarantees one needs the intervals over the real
and complex numbers as well. The intervals bring the continuum onto the
computer. An interval between two floating-point bounds represents the con-
tinuous set of real numbers between these two bounds.
The twelve fundamental data types or mathematical spaces consist of the
four basic data types real, complex, interval and complex interval as well
as the vectors and matrices over these types. See Fig. 1.18 and Fig. 1.19.
Arithmetic operations in the computer representable subsets of these spaces
are defined by a general mapping principle which is called a semimorphism.
These arithmetic operations are distinctly different from the customary ones
in these spaces which are based on elementary floating-point arithmetic.
If M is anyone of these twelve data types (or mathematical spaces) and
N is its computer representable subset, then for every arithmetic operation
o in M, a corresponding computer operation @l in N is defined by
(RG) a@lb := O(a 0 b) for all a, bEN and all operations 0 in M,
where 0 : M -+ N is a mapping from M onto N which is called a rounding
if it has the following properties:
(Rl) Oa=a for all a E N (projection).
(R2) as b =} Oa sob for a,b EM (monotonicity) .
The concept of semimorphism requires additionally that the rounding is
antisymmetric, i. e. that it has the property
(R3) O( -a) = -o(a) for all a E M (antisymmetry).
For the interval spaces among the twelve basic data types - the intervals
over the real and complex numbers as well as the intervals over the real and
complex vectors and matrices - the order relation in (R2) is the subset rela-
tion ~. A rounding from any interval set M onto its computer representable
subset N is defined by properties (Rl), (R2) (with S replaced by ~), plus
the additional property
54 1. Fast and Accurate Vector Operations

(R4) a ~ Oa for all a E M (inclusion).


These interval roundings are also antisymmetric, that is, they satisfy prop-
erty (R3) [60,62].
Additional important roundings from the real numbers onto the floating-
point numbers are the montone downwardly and upwardly directed roundings
with the property
(R4) 'Va ~ a resp. a ~ 6.a for all a E M (directed).
These directed roundings are uniquely defined by (Rl), (R2) and (R4), see
[60,62]. Arithmetic operations are also defined by (RG) with the roundings
'V and 6..
With the five rules (RG) and (Rl, 2, 3, 4), a large number of arithmetic
operations is defined in the computer representable subsets of the twelve fun-
damental data types or mathematical spaces. (RG) means that every com-
puter operation should be performed in such a way that it produces the
same result as if the mathematically correct operation were first performed
in the basic space M and the exact result then rounded into the computer
representable subset N. In contrast to the traditional approximation of the
arithmetic operations in the product spaces by floating-point arithmetic, all
operations with the properties (RG), (Rl) and (R2) are optimal in the sense
that there is no better computer representable approximation to the true re-
sult (with respect to the prescribed rounding). In other words, between the
correct and the computed result of an operation there is no other element
of the corresponding computer representable subset. This can easily be seen:
Let a, bEN, and a E N the greatest lower and {3 E N the least upper bound
of the correct result a 0 b in M, i. e.

a ~ a 0 b ~ (3,
then
Oa = a ~ O(a 0 b) = al£lb ~ o{3 = {3. (1.1)
(Rl) (R2) (RG) (R2) (Rl)
Thus, all semimorphic computer operations are of 1 ulp (!!nit in the last
Elace) accuracy. 1/2 ulp accuracy is achieved in the case ofrounding to near-
est. In the product spaces the order relation is defined componentwise. So in
the product spaces property (1.1) holds for every component.
Figure 1.18 shows a table of the twelve basic arithmetic data types and
corresponding operators as they are provided by the programming language
PASCAL-XSC [46,47,49,67,68,108]. All data types and operators are prede-
fined available in the language. The operations can be called by the operator
symbols shown in the table. An arithmetic operator followed by a less or
greater symbol denotes an operation with rounding downwards or upwards,
respectively. The operator +* takes the interval hull of two elements, **
means intersection. Also all outer operations that occur in Fig. 1.18 (scalar
times vector, matrix times vector, etc.) are defined by the five properties
1.7 Theoretical Foundation of Advanced Computer Arithmetic 55

(RG), (R1, 2, 3, 4), whatever applies. A count of all inner and outer pre-
defined operations in the figure leads to a number of about 600 arithmetic
operations.

~
integer
op. interval rvector ivector rmatrix imatrix
left real
cinterval cvector civector cmatrix cimatrix
operan complex

monadic +, - +, - +, - +, - +, - +, -

+, +<, +>
integer - , -<, ->
+, - , *, / *, * <, * >
real
*, *<, *> * *, *<, *> *
complex /, /<, /> +*
+*
interval +, - , *, / +, - , *, /
cinterval +* +*, ** * * * *
+, +<, +>
rvector *, *<, *> *, /
- -<, -> +, -. *
cvector /, /<, /> *, *<, *> +*
+*
ivector +, , * +, - ,*
*, / *, /
-

civector
+* +*, **
+, +<, +>
rmatrix *,*<,*> - , -<, -> +, - ,
*, / *, *<, *> * *
cmatrix /, /<, /> *, *<, *> +*
+*
imatrix
*, / *, / +, -, * +, -, *
cimatrix * * +* +*, **

Fig. 1.18. Predefined arithmetic data types and operators of PASCAL-XSC.

Figure 1.19 lists the same data types in their usual mathematical notation.
There JR denotes the real and Cthe complex numbers. A heading letter V, M
and I denotes vectors, matrices and intervals, respectively. R stands for the
set of floating-point numbers and D for any set of higher precision floating-
point numbers. If M is any set, IPM denotes the power set, which is the set
of all subsets of M. For any operation 0 in M a corresponding operation 0
in IPM is defined by A 0 B := {a 0 b I a E A 1\ b E B} for all A, B E IP M.
For each set-subset pair in Fig. 1.19, arithmetic in the subset is defined by
semimorphism. These operations are different in general from those which are
performed in the product spaces if only elementary floating-point arithmetic
is furnished on the computer. Semi morphism defines operations in a subset
N of a set M directly by making use of the operations in M. It makes a
direct link between an operation in lvl and its approximation in the subset
N. For instance, the operations in MCR (see Fig. 1.19) are directly defined
by the operations in MC, and not in a roundabout way via C, JR, R, CR,
56 1. Fast and Accurate Vector Operations

and MeR as it would have to be done by using elementary floating-point


arithmetic only.

IR~ D~ R
VIR ~ VD~ VR
MIR ~ MD~ MR

IPIR ~ fIR ~ fD ~ fR
IPV IR ~ fV IR ~ fVD ~ fVR
IPMIR ~ fMIR ~ fMD ~ fMR

C~ CD ~ CR
VC~ VCD~ VCR
MC~ MCD~ MCR

IPC~ fC~ fCD ~ fCR


IPVC~ fVC ~ fVCD ~ fVCR
IPMC~ fMC ~ fMCD ~ fMCR

Fig. 1.19. Table of the spaces occurring in numerical computations.

The properties of a semimorphism can be derived as necessary condi-


tions for an homomorphism between ordered algebraic structures [60,62]. It
is easy to see that repetition of semi morphism is again a semimorphism. A
careful analysis of the requirements of semimorphism is given in [60,62]. The
resulting algebraic and order structure are studied there under the mapping
properties (RG) and (Rl, 2, 3, 4). Many properties of both the order structure
and the algebraic structure are invariant under a semimorphism. Because of
(R2) with respect to ::::; or <;;; the order structure is not changed if we move
from a set into a subset in any row of Fig. 1.19, while the algebraic structure
is considerably weakened. The concept of semimorphism and its explicit five
rules (RG), (Rl, 2, 3, 4) are used as an axiomatic definition of computer
arithmetic in the XSC-languages [14,41,46-49,56,67,68,106-108,112].
In the theory of computer arithmetic it is ultimately shown, that all arith-
metic operations of the twelve fundamental numerical data types of Fig. 1.18
or spaces of Fig. 1.19 can be provided in a higher programming language by
a modular technique, if On a low level, preferably in hardware, 15 fundamen-
tal operations are available: the five operations +, -, x, /, " each One with
the three roundings D, V, 6. Here· meanS the scalar product of two vectors,
D is a monotone, antisymmetric rounding, e. g. rounding to nearest, and v
and 6 are the monotone downwardly and upwardly directed roundings from
the real numbers into the floating-point numbers. All 15 operations @l, '0/, &,
with 0 E {+, -, x, /, .}, Fig. 1.20, are defined by (RG). In case of the scalar
product, a and b are vectors a = (ai), b = (b i ) with any finite number of
components.
1.7 Theoretical Foundation of Advanced Computer Arithmetic 57
n
HI, 8, 181, 0, (:], a(:]b o L: ai x bi ,
i=l
n
W, 'il, ~, '11, '7 , a'7b V L: ai x bi ,
i=l
n
&, A, &, £, A, aAb !::. L: ai x bi .
i=l

Fig. 1.20. The fifteen fundamental operations for advanced computer arithmetic.

The IEEE arithmetic standards 754 and 854 offer 12 of these operations:
@], 'W, &, with 0 E {+, -, x, /}. These standards also prescribe specific data

formats. A general theory of computer arithmetic is not bound to these data


formats. By adding just three more operations, the optimal scalar products
El, 'V, 6., all operations in the usual product spaces of numerical mathematics
can be performed with 1 or 1/2 ulp accuracy in each component.
Remark 1: With this information it seems to be relatively easy to provide
advanced computer arithmetic on processors which offer the IEEE arithmetic
standard 754. The standard seems to be a step in the right direction. All that
is additionally needed are the three optimal scalar products El, 'V and 6.. If
they are not supported by the computer hardware they could be simulated.
One possibility to simulate these operations certainly would be to place the
LA into the user memory, i. e. in the data cache. This possibility was discussed
in Section 1.6.
However, a closer look into the subject reveals severe difficulties and dis-
advantages which result in unnecessary performance penalties. So that at a
place where an increase in speed is to be expected, a severe loss of speed
results instead.
A first severe drawback comes from the fact that processors that provide
IEEE arithmetic separate the rounding from the operation. First the round-
ing mode has to be set. Then an arithmetic operation can be performed. In a
conventional floating-point computation this does not cause any difficulties.
The rounding mode is set only once. Then a large number of arithmetic oper-
ations is performed with this rounding mode. However, when interval arith-
metic is performed, the rounding mode has to be switched very frequently.
In the computer the lower bound of the result of every interval operation
has to be rounded downwards and the upper bound rounded upwards. Thus
the rounding mode has to be set for every arithmetic operation. If setting
the rounding mode and the arithmetic operation are equally fast, this slows
down interval arithmetic unnecessarily by a factor of two in comparison to a
conventional floating-point arithmetic. On the Pentium processor setting the
rounding mode takes three cycles, the following operation only one!! Thus
an interval operation is 8 times slower than the corresponding floating-point
operation. On workstations the situation is even worse in general. The round-
ing should be part of the arithmetic operation as required by the postulate
58 1. Fast and Accurate Vector Operations

(RG) of the axiomatic definition of (advanced) computer arithmetic. Every


one of the rounded operations @J, W, £, 0 E {+, -, x, /} should be executed
in a single cycle! The rounding must be an integral part of the operation.
A second severe drawback comes from the fact that all the commercial
processors that perform IEEE-arithmetic in case of multiplication only deliver
a rounded product to the outside world. Computation of an accurate scalar
product requires products of the full double length. These products have to be
simulated from outside on the processor. This slows down the multiplication
by a factor of 10 in comparison to a rounded hardware multiplication. In
a software simulation of the accurate scalar product the products of double
length then have to be accumulated into the LA. This process is again slower
by a factor of 5 in comparison to a (possibly wrong) hardware accumulation
of products in floating-point arithmetic. Thus in summary a factor of at least
50 for the runtime is the trade-off for an accurate computation of the scalar
product on existing processors. This is too much to be easily accepted by the
user. Again at a place where an increase in speed by a factor of between two
and four is to be expected if the scalar product is supported by hardware, a
severe loss of speed is obtained by processors which have not been designed
for accurate computation of the scalar product.
A third severe drawback is the fact that no reasonable interface to the pro-
gramming languages is required by existing computer arithmetic standards.
The majority of operations shown in Fig. 1.18 can be provided in a pro-
gramming language which allows operator overloading. Operator overloading,
however, is not enough to call the twelve operations @J, W, £, 0 E {+, -, x, /}
which are provided by all IEEE-arithmetic processors in a higher program-
ming language. A general operator concept is necessary for ease of program-
ming (three real operations for +, -, x, /). This solution has been chosen in
PASCAL-XSC. In C-XSC which has been developed as a C++ class library,
the 8 operators Wand £, 0 E {+, -, x, /} are hidden in the interval op-
erations and not openly available. This is necessary because C++ does not
allow three different operators for addition, subtraction, multiplication and
division for the data type real.
Computer arithmetic is an integral part of all programming languages.
The quality of the arithmetic operations should be an integral part of the
definition of all programming languages. This can easily be done. All oper-
ations that are shown in Fig. 1.18 can be defined by the five simple rules
(RG) and (R1, 2, 3, 4). In particular the eight operations Wand £, with
o E {+, -, x, /} are defined by (RG), (Rl), (R2) and (R4). All interval oper-
ations are defined by (RG), (R1), (R2), (R3) and (R4). All other operations
that appear in Fig. 1.18 can be defined by (RG), (R1), (R2) and (R3) with the
additional information whether rounding to nearest, towards infinity or to-
wards zero is required. A precise definition of advanced computer arithmetic
thus turns out to be short and simple.
1. 7 Theoretical Foundation of Advanced Computer Arithmetic 59

IEEE-arithmetic has been developed as a standard for microprocessors


in the early eighties at a time when the microprocessor was the 8086. Since
that time the speed of microprocessors has been increased by several magni-
tudes. IEEE-arithmetic is now even provided and used by super computers,
the speed of which is faster again by several magnitudes. All this is no longer
in balance. With respect to arithmetic many manufacturers believe that real-
ization of the IEEE-arithmetic standard is all that is necessary to do. In this
way the existing standards prove to be a great hindrance to further progress.
Advances in computer technology are now so profound that the arithmetic
capability and repertoire of computers should be expanded to prepare the
digital computer for the computations of the next century. The provision of
Advanced Computer Arithmetic is the most natural way to do this.
Remark 2: A vector arithmetic coprocessor chip XPA 3233 for the PC has
been developed in a CMOS 0.8 /-lm VLSI gate array technology at the au-
thor's Institute in 1993/94. VHDL and COMPASS design tools were used.
For design details see [8,43] and in particular [9]. The chip is connected with
the PC via the PCI-bus. The PCI- and EMC-interface are integrated on chip.
In its time the chip computed the accurate scalar product between two and
four times faster than the PC an approximation in floating-point arithmetic.
With increasing clock rate of the PC the PCI-bus turned out to be a severe
bottle neck. To keep up with the increased speed the SPU must be inte-
grated into the arithmetic logical unit of the processor and interconnected by
an internal bus system.
The chip, see Fig. 1.21, realizes the SPU that has been discussed in Section
1.3.1, 207,000 transistors are needed. About 30% of the transistors and the
silicon area are used for the local memory and the flag registers with the
carry resolution logic. The remaining 70% of the silicon area is needed for
the PCI/EMC-interface and the chip's own multiplier, shifter, adder and
rounding unit. All these units would be superfluous if the SPU were integrated
into the arithmetic unit of the processor. A multiplier, shifter, adder and
rounding unit are already there. Everything just needs to be arranged a
little differently. Thus finally the SPU requires fewer transistors and less
silicon area than is needed for the exception handling of the IEEE-arithmetic
standard. Logically the SPU is much more regular and simpler. With it a
large number of exceptions that can occur in a conventional floating-point
computation are avoided.
Testing of the coprocessor XPA 3233 was easy. XSC-Ianguages had been
available and used since 1980. There, an identical software simulation of the
accurate scalar product had been implemented. Additionally a large number
of problem solving routines had been developed and collected in the toolbox
volumes [34, 35, 57]. All that had to be done was to change the PASCAL-XSC
compiler a little to call the hardware chip instead of its software simulation.
Surprisingly 40% of the chips on the first wafer were correct and, probably
due to the high standard of the implementors and their familiarity with the
60 1. Fast and Accurate Vector Operations

theoretical background, with PASCAL-XSC and the toolbox routines no re-


design was necessary. The chips produced identical results than the software
simulation.
Modern computer technology can provide millions of transistors on a sin-
gle chip. This allows solutions to put into the computer hardware which
even an experienced computer user is totally unaware of. Due to the insuf-
ficient knowledge and familiarity with the technology, the design tools and
implementation techniques, obvious and easy solutions are not demanded by
mathematicians. The engineer on the other hand, who is familiar with these
techniques, is not aware of the consequences for mathematics [37].
Remark 3: In addition to the numerical data types and operators displayed
in Fig. 1.18, the XSC-languages provide an array type staggered (staggered
precision) [89,90] for multiple precision data. A variable of type staggered
consists of an array of variables of the type of its components. Components
of the staggered type can be of type real or of type interval. The value of a
variable of type staggered is the sum of its components. Addition and sub-
traction of such multiple precision data can easily be performed in the LA.
Multiplication of two variables of this type can be computed easily and fast by
the accurate scalar product. Division is performed iteratively. The multiple
precision data type staggered is controlled by a global variable called stagprec.
If stagprec is 1, the staggered type is identical to its component type. If, for
instance, stagprec is 4 each variable of this type consists of an array of four
variables of its component type. Again its value is the sum of its components.
The global variable stagprec can be increased or decreased at any place in a
program. This enables the user to use higher precision data and operations
in numerically critical parts of his computation. It helps to increase software
reliability. The elementary functions for the type staggered are also available
in the XSC-languages for the component types real and interval [22,53]. In
the case that stagprec is 2, a data type is encountered which occasionally is
denoted as double-double or quadruple precision.
1.7 Theoretical Foundation of Advanced Computer Arithmetic 61

Fig. 1.21. Functional units, chip and board of the vector arithmetic coprocessor
XPA 3233
62 1. Fast and Accurate Vector Operations

Fig. 1.22. MADAS, model 20 BTZG, by H.W. Egli, Zurich, Switzerland


(Multiplication, Automatic Division, Addition, Subtraction)
0 : Multiplication Register, I: Main or Product Register,
II: Counter or Dividend Register, II: Keybord or Entry Register,
IV: Accumulation Register.

Fig. 1.23. MONROE, model MONROMATIC ASMD (1956) ,


by Monroe Calculating Machine Company, Inc., Orange, New Jersey, USA.
Addition, Subtraction, Multiplication, Division, Multiply and Accumulate.
Bibliography and Related Literature

1. Adams, E.; Kulisch, U.(eds.}: Scientific Computing with Automatic Re-


sult Verification. I. Language and Programming Support for Verified Sci-
entific Computation, II. Enclosure Methods and Algorithms with Automatic
Result Verification, III. Applications in the Engineering Sciences. Academic
Press, San Diego, 1993 (ISBN 0-12-044210-8).
2. Albrecht, R.; Kulisch, U. (Eds.): Grundlagen der Computerarithmetik.
Computing Supplementum 1. Springer-Verlag, Wien / New York, 1977.
3. Albrecht, R.; Alefeld, G.; Stetter, H.J. (Eds.): Validation Numerics - The-
ory and Applications. Computing Supplementum 9, Springer-Verlag, Wien /
New York, 1993.
4. Alefeld, G.; Herzberger, J.: Einfiihrung in die Intervallrechnung. Bibli-
ographisches Institut (Reihe Informatik, Nr. 12), Mannheim / Wien / Ziirich,
1974 (ISBN 3-411-01466-0).
5. Alefeld, G.; Herzberger, J.: An Introduction to Interval Computations.
Academic Press, New York, 1983 (ISBN 0-12-049820-0).
6. Apostolatos, N.; Kulisch, U.; Krawczyk, R.; Lortz, B.; Nickel, K.; Wippermann,
H.-W.: The Algorithmic Language Triplex-ALGOL 60. Numerische Mathe-
matik 11, pp. 175-180, 1968.
7. Baumhof, Ch.: Behavioural Description of A Scalar Product Unit. Univer-
sitat Karlsruhe, ESPRIT Project OMI/HORN, Deliverable Report D1.2/2, Dec.
1992.
8. Baumhof, Ch.: A New VLSI Vector Arithmetic Coprocessor for the PC. In
[105, Vol. 12, pp. 210-215]' 1995.
9. Baumhof, Ch.: Ein Vektorarithmetik-Koprozessor in VLSI-Technik zur Un-
terstutzung des Wissenschaftlichen Rechnens. Dissertation, Universitat Karl-
sruhe, 1996.
10. Baumhof, Ch.; Bohlender, G.: A VLSI Vector Arithmetic Coprocessor for the
PC. Proceedings of WAI'96 in Recife/Brasil, RITA (Revista de Informatica
Teorica e Aplicada), Extra Edition, October 1996.
11. De Beauclair, W.: Rechnen mit Maschinen. Vieweg, Braunschweig, 1968.
12. Bierlox, N.: Ein VHDL Koprozessor fur das exakte Skalarprodukt. Dissertation,
Universitat Karlsruhe, 2002.
13. Bleher, J. H.; Kulisch, U.; Metzger, M.; Rump, S. M.; Ullrich, Ch.; Walter, W.:
FORTRAN-SC: A Study of a FORTRAN Extension for Engineering/Scientific
Computation with Access to ACRITH. Computing 39, pp. 93-110, Nov. 1987.
14. Blomquist, F.: PASCAL-XSC, BCD-Version 1.0, Benutzerhandbuch
fiir das dezimale Laufzeitsystem. Universitat Karlsruhe, Institut fiir Ange-
wandte Mathematik, 1997.
15. Bohlender, G.: Floating-Point Computation of Functions with Maximum Accu-
racy. IEEE Transactions on Computers, Vol. C-26, no. 7, July 1977.
64 Bibliography and Related Literature

16. Bohlender, G.: Genaue Berechnung mehrfacher Summen, Produkte und


Wurzeln von Gleitkommazahlen und allgemeine Arithmetik in hoheren Progmm-
mierspmchen. Dissertation, Universitat Karlsruhe, 1978.
17. Bohlender, G.; Griiner, K.; Kaucher, E.; Klatte, R.; Kramer, W.; Kulisch, U.;
Miranker, W. L.; Rump, S. M.; Ullrich, Ch.; Wolff v. Gudenberg, J.: PASCAL-
SC: A PASCAL for Contempomry Scientific Computation. IBM Research Re-
port RC 9009 (#39456) 8/25/81, 79 pages, 1981.
18. Bohlender, G.; Kaucher, E.; Klatte, R.; Kulisch, U.; Miranker, W. L.; Ullrich,
Ch.; Wolff v. Gudenberg, J.: FORTRAN for Contempomry Numerical Compu-
tation. IBM Research Report RC 8348. Computing 26, pp. 277-314, 1981.
19. Bohlender, G.: What Do We Need Beyond IEEE Arithmetic? In [94, pp. 1-32]'
1990.
20. Bohlender, G.: Litemture List on Enclosure Methods and Related Topics Insti-
tut fur Angewandte Mathematik, Universitat Karlsruhe, Report, 1998.
2l. B6hm, H.: Berechnung von Polynomnullstellen und Auswertung arithmetischer
Ausdriicke mit gamntierter maximaler Genauigkeit. Dissertation, Universitat
Karlsruhe, 1983.
22. Braune, K: Hochgenaue Standardfunktionen fur reelle und komplexe Punkte und
Intervalle in beliebigen Gleitpunktmstern. Dissertation, Universitat Karlsruhe,
1987.
23. Cappello, P. R.; Miranker, W. L.: Systolic Super Summation. IEEE Transac-
tions on Computers 37 (6), pp. 657-677, June 1988.
24. Cappello, P. R.; Miranker, W. L.: Systolic Super Summation with Reduced Hard-
ware. IBM Research Report RC 14259 (#63831), IBM Research Division, York-
town Heights, New York, Nov. 30, 1988.
25. Dietrich, St.: Adaptive verifizierte Losung gewohnlicher Differentialgleichungen.
Dissertation, Universitat Karlsruhe, 2002.
26. Erb, H.: Ein Gleitpunkt-Arithmetikprozessor mit mehrfacher Priizision zur ver-
ifizierten Losung linearer Gleichungssysteme. Dissertation, Fakultat fur Infor-
matik, Universitat Karlsruhe, 1992.
27. Facius, A.: Itemtive Solution of Linear Systems with Improved Arithmetic and
Result Verification. Dissertation Universitat Karlsruhe, 2000.
28. Facius, A.: Highly Accumte Verified Error Bounds for Krylov Type Linear Sys-
tem Solvers. pp. 76-98, in [71].
29. Fischer, H.-C.: Schnelle Automatische Differentiation, EinschliefJungsmethoden
und Anwendungen. Dissertation, Universitat Karlsruhe, 1990.
30. Fischer, H.-C.: Automatic Differentiation and Applications. pp. 105-142, in [1].
3l. Hamada, H.: A New Real Number Representation and its Opemtion. In [105,
Vol. 8, pp. 153-157], 1987.
32. Hammer, R.: How Reliable is the Arithmetic of Vector Computers. pp. 467-482,
in [95]., 1990.
33. Hammer, R.: Maximal genaue Berechnung von Skalarproduktausdrucken und
hochgenaue Auswertung von Progmmmteilen. Dissertation, Universitat Karl-
sruhe, 1992.
34. Hammer, R.; Hocks, M.; Kulisch, U.; Ratz, D.: Numerical Toolbox for Ver-
ified Computing I: Basic Numerical Problems. (Vol. II see [57], version
in C++ see [35]) Springer-Verlag, Berlin / Heidelberg / New York, 1993.
35. Hammer, R.; Hocks, M.; Kulisch, U.; Ratz, D.: C++ Toolbox for Veri-
fied Computing: Basic Numerical Problems. Springer-Verlag, Berlin /
Heidelberg / New York, 1995.
36. Hergenhan, A.: Spezifikation und Entwurf einer hochleistungsfiihigen
Gleitkomma-Architektur. Diplomarbeit, Technische Universitat Dresden,
1994.
Bibliography and Related Literature 65

37. HoefRinger, B.: Next-Genemtion Floating-Point Arithmetic for Top-


Performance PCs. The 1995 Silicon Valley Personal Computer Design
Conference and Exposition, Conference Proceedings, pp. 319-325, 1995.
38. Hoff, T.: How Children Accumulate Numbers or Why We Need a Fifth Floating-
Point Opemtion. In: Jahrbuch Uberblicke Mathematik, S. 219-222, Vieweg
Verlag, 1993.
39. Hofschuster, W.: Zur Berechnung von PunktionseinschliejJungen bei speziellen
Punktionen der mathematischen Physik. Dissertation, Universitat Karlsruhe,
2000.
40. Hofschuster, W.; Kramer, W.: A Computer Oriented Approach to Get Sharp
Reliable Error Bounds. Reliable Computing, Issue 3, Volume 3, 1997.
41. Januschke, P.: Oberon-XSC, Eine Progmmmierspmche und Arithmetikbiblio-
thek fUr das Wissenschaftliche Rechnen. Dissertation, Universitat Karlsruhe,
1998.
42. Kelch, R.: Ein adaptives Verfahren zur numerischen Quadmtur mit automatis-
cher Ergebnisverifikation. Dissertation, Univeristat Karlsruhe, 1989.
43. Kernhof, J.; Baumhof, Ch.; HofRinger, B.; Kulisch, U.; Kwee, S.; Schramm,
P.; Selzer, M.; Teufel, Th.: A CMOS Floating-Point Processing Chip for Ver-
ified Exact Vector Arithmetic. European Solid State Circuits Conference 94
ESSCIRC, Ulm, Sept. 1994.
44. Kirchner, R.; Kulisch, U.: Arithmetic for Vector Processors. In [105, Vol. 8,
pp. 256-269], 1987.
45. Kirchner, R.; Kulisch, U.: Accumte Arithmetic for Vector Processing. Journal
of Parallel and Distributed Computing 5, special issue on "High Speed Com-
puter Arithmetic", pp. 250-270, 1988.
46. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-
XSC Sprachbeschreibung mit Beispielen. Springer-Verlag,
Berlin/Heidelberg/New York, 1991 (ISBN 3-540-53714-7, 0-387-53714-7).
47. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-
XSC - Language Reference with Examples. Springer-Verlag,
Berlin/Heidelberg/New York, 1992.
48. Klatte, R.; Kulisch, U.; Lawo, C.; Rauch, M.; Wiethoff, A.: C-XSC, A C++
Class Library for Extended Scientific Computing. Springer-Verlag,
Berlin/Heidelberg/New York, 1993.
49. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-XSC -
Language Reference with Examples (In Russian). Moscow, 1994.
50. Knofel, A.: Hardwareentwurf eines Rechenwerks fur semimorphe Skalar- und
Vektoropemtionen unter Berucksichtigung der Anforderungen verifizierender
Algorithmen. Dissertation, Universitat Karlsruhe, 1991.
51. Klein, W.: Zur EinschliejJung der Losung von linearen und nichtlinearen Fred-
holmschen Integmlgleichungssystemen zweiter Art. Dissertation, Universitat
Karlsruhe, 1990.
52. Knofel, A.: Fast Hardware Units for the Computation of Accumte Dot Products.
In [105, Vol. 10, pp. 70-74]' 1991.
53. Kramer, W.: Inverse Standardfunktionen fur reelle und komplexe Intervallargu-
mente mit a priori Fehlembschiitzungen fur beliebige Datenformate. Disserta-
tion, Universitat Karlsruhe, 1987.
54. Kramer, W.: Constructive Error Analysis. Journal of Universal Computer Sci-
ence (JUCS), Vol. 4, No.2, pp. 147-163, 1998.
55. Kramer, W.; Bantle, A.: Automatic Forward Error Analysis for Floating-Point
Algorithms. Reliable computing, Vol. 7, No.4, pp. 321-340, 2001.
66 Bibliography and Related Literature

56. Kramer, W.; Walter, W.: FORTRAN-SC: A FORTRAN Extension for Engi-
neering/Scientific Computation with Access to ACRITH, General Information
Notes and Sample Programs. pp 1-51, IBM Deutschland GmbH, Stuttgart,
1989.
57. Kramer, W.; Kulisch, U.; Lohner, R: Numerical Toolbox for Verified
Computing II: Theory, Algorithms and Pascal-XSC Programs. (Vol. I
see [34,35]) Springer-Verlag, Berlin I Heidelberg I New York, to appear.
58. Kulisch, U.: An axiomatic approach to rounded computations. TS Report No.
1020, Mathematics Research Center, University of Wisconsin, Madison, Wis-
consin, 1969, and Numerische Mathematik 19, pp. 1-17, 1971.
59. Kulisch, U.: Formalization and Implementation of Floating-Point Arithmetic.
Computing 14, pp. 323-348, 1975.
60. Kulisch, U.: Grundlagen des Numerischen Rechnens - Mathematis-
che Begriindung der Rechnerarithmetik. Reihe Informatik, Band 19,
Bibliographisches Institut, Mannheim/Wien/Ziirich, 1976 (ISBN 3-411-01517-
9).
61. Kulisch, U.: Schaltungsanordnung und Verfahren zur Bildung von Skalarpro-
dukten und Summen von Gleitkommazahlen mit maximaler Genauigkeit.
Patentschrift DE 3144015 AI, 1981.
62. Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Prac-
tice. Academic Press, New York, 1981 (ISBN 0-12-428650-x).
63. Kulisch, U.; Ullrich, Ch. (Eds.): Wissenschaftliches Rechnen und Pro-
grammiersprachen. Proceedings of Seminar held in Karlsruhe, April 2-3,
1982. Berichte des German Chapter of the ACM, Band 10, B. G. Teubner Ver-
lag, Stuttgart, 1982 (ISBN 3-519-02429-2).
64. Kulisch, U.; Miranker, W. L. (Eds.): A New Approach to Scientific Com-
putation. Proceedings of Symposium held at IBM Research Center, Yorktown
Heights, N. Y., 1982. Academic Press, New York, 1983 (ISBN 0-12-428660-7).
65. Kulisch, U.; Miranker, W. L.: The Arithmetic of the Digital Computer: A New
Approach. IBM Research Center RC 10580, pp. 1-62, 1984. SIAM Review, Vol.
28, No.1, pp. 1-40, March 1986.
66. Kulisch, U.; Kirchner, R: Schaltungsanordnung zur Bildung von Produktsum-
men in Gleitkommadarstellung, insbes. von Skalarprodukten. Patentschrift
DE 3703440 C2, 1986.
67. Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific
computation, Information Manual and Floppy Disks, Version IBM PCI AT;
Operating System DOS. B. G. Teubner Verlag (Wiley-Teubner series in com-
puter science), Stuttgart, 1987 (ISBN 3-519-02106-4 10-471-91514-9).
68. Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific
computation, Information Manual and Floppy Disks, Version ATARI ST. B.
G. Teubner Verlag, Stuttgart, 1987 (ISBN 3-519-02108-0).
69. Kulisch, U. (Ed.): Wissenschaftliches Rechnen mit Ergebnisverifikation
- Eine Einfiihrung. Ausgearbeitet von S. Georg, R Hammer und D. Ratz.
Vol. 58. Akademie Verlag, Berlin, und Vieweg Verlagsgesellschaft, Wiesbaden,
1989.
70. Kulisch, U.; Teufel, T.; HoefRinger, B.: Genauer und trotzdem schneller,
Ein neuer Coprozessor fur hochgenaue M atrix- und Vektoroperationen.
Titelgeschichte, Elektronik 26, 1994.
71. Kulisch, U.; Lohner, Rand Facius, A. (edts.): Perspectives on Enclosure
Methods. Springer-Verlag, Wien, New York, 2001.
72. Lichter, P.: Realisierung eines VLSI-Chips fur das Gleitkomma-Skalarprodukt
der Kulisch-Arithmetik. Diplomarbeit, Fachbereich 10, Angewandte Mathe-
matik und Informatik, Universitat des Saarlandes, 1988.
Bibliography and Related Literature 67

73. Meis, T.: Brauchen wir eine Hochgenauigkeitsarithmetik? Elektronische


Rechenanlagen, Carl Hanser Verlag, pp. 19-23, 1987.
74. Miiller, M.; Riib, Ch.; Riilling, W.: Exact Accumulation of Floating-Point Num-
bers. In [105, Vol. 10, pp. 64-69]' 1991.
75. Miiller, M.: Entwicklung eines Chips filr auslOschungsfreie Summation von
Gleitkommazahlen. Dissertation, Universitat des Saarlandes, Saarbriicken,
1993.
76. Pichat, M.: Correction d'une somme en arithmetique a virgule fiottante. Nu-
merische Mathematik 19, pp. 400-406, 1972.
77. Priest, D. M.: Algorithms for Arbitrary Precision Floating Point Arithmetic.
In [105, Vol. 10, pp. 132-143]' 1991.
78. Ratz, D.: The Effects of the Arithmetic of Vector Computers on Basic Numer-
ical Methods. pp. 499-514, in [95]., 1990.
79. Ratz, D.: Automatische Ergebnisverifikation bei globalen Optimierungsproble-
men. Dissertation, Universitat Karlsruhe, 1992.
80. Ratz, D.: Automatic Slope Computation and its Application in Non-
smooth Global Optimization. Shaker-Verlag, Aachen, 1998.
81. Ratz, D.: Nonsmooth Global Optimization. pp. 277-338, in [71].
82. Rojas, R.: Die Architektur der Rechenmaschinen Zl und Z3 von Konrad Zuse.
Informatik Spektrum 19/6, Springer-Verlag, pp. 303-315, 1996.
83. Rump, S. M.: Kleine Fehlerschranken bei Matrixproblemen. Dissertation, Uni-
versitat Karlsruhe, 1980.
84. Rump, S. M.: How Reliable are Results of Computers? / Wie zuverlassig sind
die Ergebnisse unserer Rechenanlagen? In: Jahrbuch Uberblicke Mathematik,
Bibliographisches Institut, Mannheim, 1983.
85. Rump, S. M.: Solving algebraic problems with high accuracy. pp.51-120, in [64].,
1983.
86. Rump, S. M.; Bohm, H.: Least Significant Bit Evaluation of Arithmetic Expres-
sions in Single-Precision. Computing 30, pp. 189-199, 1983.
87. Schmidt, L.: Semimorphe Arithmetik zur automatischen Ergebnisverifikation
auf Vektorrechnern. Dissertation, Universitat Karlsruhe, 1992.
88. Shiriaev, D.: Fast Automatic Differentiation for Vector Processors and Reduc-
tion of the Spatial Complexity in a Source Translation Environment. Disserta-
tion, Universitat Kalrsuhe, 1993.
89. Stetter, H. J.: Sequential Defect Correction for High-Accuracy Floating-Point
Algorithms. Lecture Notes in Mathematics, Vol. 1006, pp. 186-202, Springer-
Verlag, 1984.
90. Stetter, H. J.: Staggered Correction Representation, a Feasible Approach to
Dynamic Precision. In: Proceedings of the Symposium on Scientific Software,
edited by Cai, Fosdick, Huang, China University of Science and Technology
Press, Beijing, China, 1989.
91. Suzuki, H.; Morinaka, H.; Makino, H.; Nakase, Y.; Mashiko, K.; Sumi, T,:
Leading-Zero Anticipatory Logic for High-Speed Floating-Point Addition. IEEE
Journal of Solid-State Circuits, Vol. 31, No.8, August 1996.
92. Tangelder, R.J.W.T: The Design of Chip Architectures for Accurate Inner Prod-
uct Computation. Dissertation, Technical University Eindhoven, 1992. ISBN
90-9005204-6.
93. Teufel, T.: Ein optimaler Gleitkommaprozessor. Dissertation, Universitat Karl-
sruhe, 1984.
94. Ullrich, Ch. (Ed.): Computer Arithmetic and Self-Validating Numerical
Methods. (Proceedings of SCAN 89, held in Basel, Oct. 2-6, 1989, invited
papers). Academic Press, San Diego, 1990.
68 Bibliography and Related Literature

95. Ullrich, Ch. (Ed.): Contributions to Computer Arithmetic and Self-


Validating Numerical Methods. J.C.Baltzer AG, Scientific Publishing Co.,
1990.
96. Wallis, P. J. L. (Ed.): Improving Floating-Point Programming. J. Wiley,
Chichester, 1990 (ISBN 0 471 92437 7).
97. Walter, W.: FORTRAN-SC: A FORTRAN Extension for Engineering / Sci-
entific Computation with Access to A CRITH, Language Reference and User's
Guide. 2nd ed., pp. 1-396, IBM Deutschland GmbH, Stuttgart, Jan. 1989.
98. Walter, W.: Flexible Precision Control and Dynamic Data Structures for Pro-
gramming Mathematical and Numerical Algorithms. 1990.
99. Walter, W. V.: Mathematical Foundations of Fully Reliable and Portable Soft-
ware for Scientific Computing. Universitiit Karlsruhe, 1995.
100. Wilkinson, J.: Rounding Errors in Algebraic Processes. Prentice-Hall,
Englewood Cliffs, New Jersey, 1963.
101. Winter, Th.: Ein VLSI-Chip fur Gleitkomma-Skalarprodukt mit maximaler
Genauigkeit. Diplomarbeit, Fachbereich 10, Angewandte Mathematik und In-
formatik, Universitiit des Saarlandes, 1985.
102. Winter, D. T.: Automatic Identification of Scalar Products. In [96], 1990.
103. Yilmaz, T.; Theeuwen, J.F.M.; Tangelder, RJ.W.T.; Jess, J.A.G.: The Design
of a Chip for Scientific Computation. Eindhoven University of Technology, 1989
and pp. 335-346 of Proceedings of the Euro-Asic Symposium, Grenoble, Jan.25-
27, 1989.
104. Yohe, J.M.: Roundings in Floating-Point Arithmetic. IEEE Trans. on Com-
puters, Vol. C-22, No.6, June 1973, pp. 577-586.
105. Institute of Electrical and Electronics Engineers: Proceedings of x-th Sym-
posium on Computer Arithmetic ARITH. IEEE Computer Society Press.
IEEE Service Center, 445 Hoes Lane, P.O.Box 1331, Piscataway, NJ 08855-1331,
USA.
Editors of proceedings; place of conference; date of conference.
1. Shively, RR; Minneapolis; June 16, 1969.
2. Garner, H.L.; Atkins, D.E.; Univ Maryland, College Park; May 15 - 16,
1972.
3. Rao, T.RN.; Matula, D.W.; SMU, Dallas; Nov. 19 - 20, 1975.
4. Avizienis, A.; Ercegovac, M.D.; UCLA, Los Angeles; Oct. 25 - 27, 1978.
5. Trivedi, K.S.; Atkins, D.E.; Univ Michigan, Ann Arbor; May 18 - 19, 1981.
6. Rao, T.RN.; Kornerup, P.; Univ Aarhus, Denmark; June 20 - 22, 1983.
7. Hwang, K.; Univ Illinois, Urbana; June 4 - 6, 1985.
8. Irwin, M.J.; Stefanelli, R; Como, Italy; May 19 - 21, 1987.
9. Ercegovac, M.; Swartzlander, E.; Santa Monica; Sept. 6 - 8, 1989.
10. Kornerup, P.; Matula, D.; Grenoble, France; June 26 - 28, 1991.
11. Swartzlander Jr., E.; Irwin, M. J.; Jullien, G.; Windsor, Ontario; June 29
- July 2, 1993.
12. Knowles, S.; Mc Allister, W. H.; Bath, England; July 19 - 21, 1995;
13. Lang, Th.; Muller, J.-M.; Takagi, N.; Asilomar, California; July 6 - 9,1997;
106. lAM: PASCAL-XR: PASCAL for eXtended Real arithmetic. Joint research
project with Nixdorf Computer AG. Institute of Applied Mathematics, Univer-
sity of Karlsruhe, Postfach 6980, D-76128 Karlsruhe, Germany, 1980.
107. lAM: FORTRAN-SC: A FORTRAN Extension for Engineering / Scientific
Computation with Access to ACRITH. Institute of Applied Mathematics, Uni-
versity of Karlsruhe, Postfach 6980, D-76128 Karlsruhe, Germany, Jan. 1989.
1. Language Reference and User's Guide, 2nd edition.
2. General Information Notes and Sample Programs.
Bibliography and Related Literature 69

108. lAM: ACRITH-XSC, A Programming Language for Scientific Computation.


Syntax Diagrams. Institute of Applied Mathematics, University of Karlsruhe,
Postfach 6980, D-76128 Karlsruhe, Germany, 1990.
109. IBM: IBM System/370 RPQ. High Accuracy Arithmetic. SA 22-7093-0,
IBM Deutschland GmbH (Department 3282, Schonaicher Strasse 220, D-71032
Boblingen), 1984.
110. IBM: IBM High-Accuracy Arithmetic Subroutine Library
(ACRITH). IBM Deutschland GmbH (Department 3282, Schonaicher
Strasse 220, D-71032 Boblingen), 3rd edition, 1986.
1. General Information Manual. GC 33-6163-02.
2. Program Description and User's Guide. SC 33-6164-02.
3. Reference Summary. GX 33-9009-02.
111. IBM Verfahren und Schaltungsanordnung zur Addition von Gleitkommazahlen.
Europiiische Patentanmeldung, EP 0 265 555 AI, 1986.
112. IBM: ACRITH-XSC: IBM High Accuracy Arithmetic - Extended
Scientific Computation. Version 1, Release 1. IBM Deutschland GmbH
(Schonaicher Strasse 220, D-71032 Boblingen), 1990.
1. General Information, GC33-6461-01.
2. Reference, SC33-6462-00.
3. Sample Programs, SC33-6463-00.
4. How To Use, SC33-6464-00.
5. Syntax Diagrams, SC33-6466-00.
113. IEEE: A Proposed Standard for Binary Floating-Point Arithmetic. IEEE
Computer, March 1981.
114. American National Standards Institute / Institute of Electrical and Electron-
ics Engineers: A Standard for Binary Floating-Point Arithmetic. ANSI/IEEE
Std. 754-1985, New York, 1985 (reprinted in SIGPLAN 22, 2, pp. 9-25, 1987).
Also taken over as IEC Standard 559:1989.
115. American National Standards Institute / Institute of Electrical and Electron-
ics Engineers: A Standard for Radix-Independent Floating-Point Arithmetic.
ANSI/IEEE Std. 854-1987, New York, 1987.
116. IMACS; GAMM: IMACS-GAMM Resolution on Computer Arithmetic. In
Mathematics and Computers in Simulation 31, pp. 297-298, 1989. In Zeitschrift
fur Angewandte Mathematik und Mechanik 70, no. 4, p. T5, 1990.
117. IMACS; GAMM: GAMM-IMACS Proposal for Accurate Floating-Point Vector
Arithmetic. GAMM, Rundbrief2, pp. 9-16,1993. Mathematics and Computers
in Simulation, Vol. 35, IMACS, North Holland, 1993. News of IMACS, Vol. 35,
No.4, pp. 375-382, Oct. 1993.
118. Numerik Software GmbH: PASCAL-XSC: A PASCAL Extension for
Scientific Computation. User's Guide. Numerik Software GmbH, Haid-
und-Neu-Strafie 7, D-76131 Karlsruhe, Germany / Postfach 2232, D-76492
Baden-Baden, Germany, 1991.
119. SIEMENS: ARITHMOS (BS 2000) Unterprogrammbibliothek fUr
Hochpriizisionsarithmetik. Kurzbeschreibung, Tabellenheft, Be-
nutzerhandbuch. SIEMENS AG, Bereich Datentechnik, Postfach 83 09 51,
D-8000 Miinchen 83. Bestellnummer U2900-J-Z87-1, Sept. 1986.
2. Rounding Near Zero

Summary.
This paper deals with arithmetic on a discrete subset S of the real
numbers 1R and with floating-point arithmetic in particular. We assume
that arithmetic on S is defined by semimorphism. Then for any element
a E S the element -a E S is an additive inverse of a, i.e. a(±)(-a) = O.
The first part of the paper describes a necessary and sufficient condition
under which -a is the unique additive inverse of a in S. In the second
part this result is generalized. We consider algebraic structures M which
carry a certain metric, and their semimorphic images on a discrete subset
N of M. Again, a necessary and sufficient condition is given under which
elements of N have a unique additive inverse. This result can be applied to
complex floating-point numbers, real and complex floating-point intervals,
real and complex floating-point matrices, and real and complex floating-
point interval matrices.

2.1 The one dimensional case


Let IR denote the set of real numbers and S a discrete subset of IR which is
symmetric with respect to zero, i.e. 0 E S and for all a E S also -a E S. A
semimorphism defines arithmetic operations <B and 0 on S by the following
rules:

(RG) a0)b:= O(a 0 b) for all a, b E Sand 0 E {+, *}.

In (RG) 0 is a rounding 0: IR ---+ S with the properties

(Rl) O(a) = a for all a E S (projection).


(R2) a ~ b =? O(a) ~ O(b) for a,b E IR (monotonicity) .
(R3) O( -a) = -O(a) for all a E IR (antisymmetry ).

That is, the rounding is a monotone and antisymmetric projection of IR


onto S.
A semimorphism leads to the best possible arithmetic in S in the sense
that between the computed result of an operation and the correct result there
is never another element of S. The computed result is rounding dependent.
72 2. Rounding Near Zero

In case of rounding to nearest it is accurate to 1/2 unit in the last place


(ulp). In all other cases it is accurate to 1 ulp.
Typical monotone and antisymmetric roundings are the rounding to the
nearest element in 8, the rounding towards zero, and the rounding away from
zero.
Since 0 E 8, (R1) and (RG) yield immediately that for every a E 8 the
element -a is an additive inverse of a in 8:

a0( -a) = O(a + (-a)) = 0(0) = 0 for all a E 8.

However, in a normalized floating-point system -a is not in general the


only additive inverse of a in 8. We briefly recall the definition of normalized
floating-point numbers:
A normalized floating-point number is a real number of the form:

Here 0 E {+, - } is the sign of the number, m is the mantissa, b is the base
of the number system in use and e is the exponent. b is an integer greater
than unity. The exponent is an integer between two fixed integer bounds el,
e2, and in general el :::; 0 :::; e2. The mantissa is of the form

m= Ldi·b- i .
i=l

The di are the digits of the mantissa. They have the property d i E {O,
1, ... , b - I} for all i = l(l)r and d l f. O. Without the condition d l f. 0,
floating-point numbers are said to be unnormalized. The set of normalized
floating-point numbers does not contain zero. So zero is adjoined to 8. For
a unique representation of zero it is often assumed that m = 0.00· . ·0 and
e = o. A floating-point system depends on the constants b, r, el, and e2. We
denote it by 8 = 8(b, r, ell e2).
The floating-point numbers are not equally spaced between successive
powers of b and their negatives. This spacing changes at every power of
b. In particular, there are relatively large gaps around zero which contain
no further floating-point number. Figure 2.1 shows a simple floating-point
system 8 = 8(2,3, -1, 2) consisting of 33 elements.
If, for instance, the rounding towards zero is chosen, the entire interval
(-1/4, 1/4) is mapped onto zero. So whenever the real sum of two numbers
of 8 falls into this interval (e.g. 1/4 - 3/8) their sum in 8 is zero, a(i)b = 0,
and the two elements form a pair of additive inverses.
The following theorem characterizes a discrete subset 8 of IR by a nec-
essary and sufficient condition under which the element -a is the unique
additive inverse of a in 8:
2.1 The one dimensional case 73

non zero elements


least element of least absolute value greatest element

I I I " •I •
1',111',111 ' '1',1 11 ' I •I ..
IR
-3 -2 -1 0 1 2 3
Fig. 2.1. The characteristic spacing of a floating-point system.

Theorem 1:
If S is a symmetric, discrete subset of IR with 0 E S, 0: IR -4 S a semimor-
phism, and c > 0 the least distance between distinct elements of S, then for
all a E S the element b = -a is the unique additive inverse of a if and only
if
0- 1 (0) ~ (-c,c). (2.1)

Here 0- 1 (0) denotes the inverse image of 0 and (-c, c) is the open interval
between -c and c.

Proof. At first we show that (2.1) is sufficient: We assume that b #- -a is an


additive inverse of a in S. Then a8)b = 0 and by (RG) a + bE 0- 1 (0). This
means by (2.1)
-c < a+ b < c. (2.2)
Since -b #- a we have by definition of c: la - (-b)1 = la + bl ~ c. This
contradicts (2.2), so the assumption is false. Under the condition (2.1) there
is no additive inverse of a other than -a. In other words (2.1) is sufficient.
Now we show that (2.1) is necessary also: Since 0 E S by (R1) 0(0) = 0
and 0 E 0- 1 (0). Since 0 is monotone, 0- 1 (0) is convex. Since 0 is anti-
symmetric, 0- 1 (0) is symmetric with respect to zero. We assume now that
(2.1) is not true, then 0- 1 (0) :::l (-c, c). We take two elements a, bE S with
distance c. Then a #- band la - bl = c, i.e. a + (-b) E 0- 1 (0) or a<B( -b) = O.
This means that -b #- -a is inverse to a. Thus (2.1) is necessary because
otherwise there would be more than one additive inverse to a. D

(2.1) holds automatically if the number c itself is an element of S. Then,


because of (R1), O(c) = c and, because of the monotonicity of the rounding
(R2) 0- 1 (0) ~ (-c,c). In other words: if the least distance c of two elements
of a discrete subset S of IR is an element of S, then for all a E S the element
-a E S is the unique additive inverse of a in S. (This holds under the
assumption that the mapping 0: IR -4 S is a semimorphism).
Such subsets of the real numbers do indeed occur. For instance, the inte-
gers or the so called fixed-point numbers are subsets of IR with this property.
Sometimes the normalized floating-point numbers are extended into a set
with this property. Such a set is obtained if in the case e = e1 unnormalized
74 2. Rounding Near Zero
(a) (b) (c)
O(a) O(a) O(a)

.-
-
~

- -
a a e a

(d) (e) (f)


O(a) O(a) O(a)

.- - -
.... -
-- e a e a

-
a

Fig. 2.2. The behavior of frequently used roundings near zero.

mantissas are permitted. Then c itself becomes an element of S and for all
a E S the element -a is the unique additive inverse of a. This is the case, for
instance, if IEEE arithmetic with denormalized numbers is implemented.
Figure 2.2 illustrates the behavior of typical roundings in the neighbor-
hood of zero. (R1) means that for floating-point numbers the rounding func-
tion coincides with the identity mapping.

(a) shows the conventional behavior of the rounding in case of normalized


floating-point numbers. In this case (2.1) does not hold and we have no
uniqueness of the additive inverse.
(b) shows the rounding away form zero in case of normalized floating-point
numbers. In this case (2.1) holds and we have unique additive inverses.
(c) here and in the following cases c is an element of S and we have unique
additive inverses.
(d) shows the rounding toward zero near zero in the case where denormalized
numbers are permitted for e = el. In the IEEE arithmetic standard this
situation is called gradual underflow.
(e) shows the rounding to nearest in the neighborhood of zero. The roundings
(d) and (e) are provided by the IEEE arithmetic standard.
(f) shows the rounding away from zero in the neighborhood of zero with de-
normalized numbers permitted for e = el. This rounding has all required
properties. It is O( a) = 0 if and only if a = 0, a property which can be
very important for making a clear distinction between a number that is
2.2 Rounding in product spaces 75

zero and a number that is not, however small it might actually be. This
rounding is not provided by the IEEE arithmetic standard.

2.2 Rounding in product spaces

In addition to the real numbers, other spaces frequently occur in numerical


mathematics. Such spaces are the complex numbers, the real and complex
intervals, the rcal and complex matrices, and the real and complex interval
matrices. In their computer representable subsets subtraction is no longer
the inversion of addition nor is division the inversion of multiplication. Nev-
ertheless subtraction is not an independent operation. If arithmetic in the
computer representable subspaces is defined by semimorphism, subtraction
can be defined by addition and multiplication with the negated multiplicative
unit. For clarity we briefly show the definition here.
Let M denote anyone of the sets listed above and N its computer repre-
sentable subset. A semimorphism defines arithmetic operations CB and 8 in
Nby

(RG) a0b := O(a 0 b) for all a,b EN and 0 E {+,*}.

Here 0 :M --+ N is a mapping with the following properties

(R1) O(a) = a for all a EN (rounding).


(R2) a :::; b =? O( a) :::; O( b ) for a,b EM (monotonicity).
(R3) O( -a) = -O(a) for all a E M (antisymmetry ) .

In case of the interval spaces, the order relation:::; means set inclusion ~. In
this case the rounding is required to have the additional property

(R4) a:::; O(a) for all a E M (upwardly directed).

With this definition it is shown in [3,4] for all spaces mentioned above
that the multiplicative unit e has a unique additive inverse 8e in N. With
this quantity the minus operator (negation) and subtraction are defined by

8b:= (8e)8b and a8b := aCB(8b). (2.3)

This preserves all rules of the minus operator in the computer repre-
sentable subspaces, [3,4].
The proof that 8e is unique is intricate and not easy in all the individual
cases [3,4]. A generalization of the conditions given in Section 2.1 for the
76 2. Rounding Near Zero

existence and uniqueness of additive inverses for the product spaces listed
above could simplify the situation considerably. This is now done.
We assume that the basic set M is mapped into a discrete subset N by
semimorphism, where N is symmetric, i.e. 0 E N and for all a E N also
-a E N. It follows from (RG) and (Rl) that an element a EN which has an
additive inverse -a in M has the same additive inverse in N:

a€l( -a) := O(a + (-a)) = 0(0) = O.

Now we assume additionally that M is a metric space with a distance


function d : M x M -> IR which has the property:

d(a + c, b + c) = d(a, b) for all a, b, c EM (translation invariant). (2.4)

With d, the property of N being discrete can now be expressed by

d(a, b) :::: E: >0 for all a, bEN with a =1= b, (2.5)


where E: > 0 is the least distance of distinct elements of N. With these
concepts the following theorem holds:

Theorem 2:
For all elements a of N which have an unique additive inverse -a in M, -a
is also the unique additive inverse of a in N if and only if

d(x,O) < E: for all x E M with O(x) = O. (2.6)

Proof. At first we show that (2.6) is sufficient: We assume that b =1= -a is an


additive inverse of a in N. Then we obtain by (RG) and (2.6):

a0b=0 =} O(a+b)=O =} d(a+b,O) <E:. (2.7)

On the other hand we get by the definition of E: and by (2.4):

d(b,-a)::::E: =} d(a+b,a+(-a))=d(a+b,O)::::E:.

This contradicts (2.7), so the assumption that there is an additive inverse b


of a other than -a is false. In other words (2.6) is sufficient.
Now we show that (2.6) is necessary also: By (Rl) we obtain 0 E 0- 1 (0).
As a consequence of (R2) 0- 1 (0) is convex and by (R3) 0- 1 (0) is symmetric,
since for all a E M

a E 0- 1 (0) =} O(a) = 0 =} O(-a) = -O(a) = 0 =} -a E 0- 1 (0).


Should (2.6) not hold there would be elements x EM with

d(x, 0) :::: E: > 0 and O(x) = o.


2.2 Rounding in product spaces 77

c is the least distance of distinct elements of N. Now we choose two different


elements a, bEN with distance c, for instance a = x and b = O. Then
dCa, b) = c and O(a - b) = 0, i.e. a - b E 0- 1 (0) or a0( -b) = O. This means
that -b is inverse to a and -b #- -a. In other words: if (2.6) does not hold
there are elements in N which have more than one additive inverse. This
shows that (2.6) is necessary which completes the proof. D
To fully establish the theorem of Section 2.2 we still have to demonstrate
that in all the basic spaces under consideration a metric does indeed exist
which is translation invariant, see (2.4). We just mention the appropriate
metric and leave the demonstration of (2.4) to the reader:
• If M is the set of real numbers JR, then dCa, b) = la - bl.
• If M is the set of complex numbers e, the distance of two complex numbers
a = al + ia2 and b = bl + ib2 is defined by dCa, b) := lal - bll + la2 - b21·
• If M is the set of real intervals IJR the distance of two intervals a = [aI, a2]
and b = [bl, b2] is defined by dCa, b) := max(lal - bll, la2 - b21).
• If M is the set of complex intervals Ie the distance of two complex intervals
a = a 1 + ia2 and b = bl + ib 2 with real intervals a1, a2, bl and b2 is defined
by dCa, b) := deal, bl ) + d(a2' b2).
• In case of two matrices a = (aij) and b = (bij ) with components aij, bij of
JR, e, IJR, or Ie, the distance is defined in each case as the maximum of the
distances of corresponding matrix components: d( a, b) := max( d( aij, bij ).
In all the basic structures under consideration, which are the sets JR, e,
IJR, Ie, and the matrices with components of these sets, the multiplicative
unit e has a unique additive inverse -e. Under the condition (2.1) and (2.6)
respectively, therefore, -e is also the unique additive inverse in any discrete
computer representable subset. This allows the definition of the minus oper-
ator and subtraction in the computer representable subsets as shown in (2.3)
with all its consequences (see [3,4]).
A closer look at interval spaces is particularly interesting. Again we con-
sider the sets JR and e as well as the matrices MJR and Me with components
of JR and e. All these sets are ordered with respect to the order relation ~.
If M denotes anyone of these sets the concept of an interval is defined by

A = [al,a2] := {a E M I al,a2 E M,al ~ a ~ a2}.


In the set 1M of all such intervals arithmetic operations can be defined,
see [1,3,4]. For all M E {JR, e, MJR, Me} the elements of 1M in general
are not computer representable and the arithmetic operations in 1M are not
computer executable. Therefore, subsets N c:.;; M of computer representable
elements have to be chosen. An interval in IN is defined by
A = [a1,a2]:= {a E M I al,a2 E N,al ~ a ~ a2}.

Arithmetic operations in IN are defined by semimorphism, i.e. by (RG)


with the monotone and antisymmetric rounding <> : 1M ---; IN which is
78 2. Rounding Near Zero

upwardly directed. <> is uniquely defined by these properties (Rl), (R2),


(R3), and (R4). This process leads to computer executable operations in the
interval spaces IN of computer representable subsets of JR, e, MJR and Me.
In this treatise the inverse image of zero with respect to a rounding plays a
key role. Since the rounding <> : 1M - t IN is upwardly directed with respect
to set inclusion as an order relation the inverse image of zero <>-1(0) can
only be zero itself. Thus the necessary and sufficient criterion (2.6) for the
existence of unique additive inverses evidently holds for IN. Among others
this establishes the fact with all its consequences that the unit interval [e, e]
has a unique additive inverse [-e, -e] in IN for all discrete subsets N of
ME {JR,e,MJR,MC}.
Bibliography and Related Literature

1. Alefeld, G.; Herzberger, J.: An Introduction to Interval Computations.


Academic Press, New York, 1983 (ISBN 0-12-049820-0).
2. Kaucher, E.: Uber metrische und algebmische Eigenschaften einiger beim nu-
merischen Rechnen auftretender Riiume. Dissertation, Universitat Karlsruhe,
1973.
3. Kulisch, U.: Grundlagen des Numerischen Rechnens - Mathematische
Begriindung der Rechnerarithmetik. Reihe Informatik, Band 19, Bibli-
ographisches Institut, Mannheim/Wien/Ziirich, 1976 (ISBN 3-411-01517-9).
4. Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Prac-
tice. Academic Press, New York, 1981 (ISBN 0-12-428650-x).
5. Yohe, J.M.: Roundings in Floating-Point Arithmetic. IEEE Trans. on Comput-
ers, Vol. C-22, No.6, June 1973, pp. 577-586.
6. American National Standards Institute/Institute of Electrical and Electronic
Engineers: A Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std.
754-1985, New York, 1985 (reprinted in SIGPLAN 22, 2 pp. 9-25, 1987). Also
taken over as IRC Standard 559:1989.
7. American National Standards Institute/Institute of Electrical and Elec-
tronic Engineers: A Standard for Radix-Independent Floating-Point Arithmetic.
ANSI/IEEE Std. 854-1987, New York, 1987.
3. Interval Arithmetic Revisited

Summary.
This paper deals with interval arithmetic and interval mathematics.
Interval mathematics has been developed to a high standard during the
last few decades. It provides methods which deliver results with guaran-
tees. However, the arithmetic available on existing processors makes these
methods extremely slow. The paper reviews a number of basic methods
and techniques of interval mathematics in order to derive and focus on
those properties which by today's knowledge could effectively be supported
by the computer's hardware, by basic software, and by the programming
languages. The paper is not aiming for completeness. Unnecessary math-
ematical details, formalisms and derivations are left aside whenever possi-
ble. Particular emphasis is put on an efficient implementation of interval
arithmetic on computers.
Interval arithmetic is introduced as a shorthand notation and auto-
matic calculus to add, subtract, multiply, divide, and otherwise deal with
inequalities. Interval operations are also interpreted as special powerset or
set operations. The inclusion isotony and the inclusion property are cen-
tral and important consequences of this property. The basic techniques for
enclosing the range of function values by centered forms or by subdivi-
sion are discussed. The Interval Newton Method is developed as an always
(globally) convergent technique to enclose zeros of functions.
Then extended interval arithmetic is introduced. It allows division by
intervals that contain zero and is the basis for the development of the
extended Interval Newton Method. This is the major tool for computing
enclosures at all zeros of a function or of systems of functions in a given
domain. It is also the basic ingredient for many other important applica-
tions like global optimization, subdivision in higher dimensional cases or
for computing error bounds for the remainder term of definite integrals in
more than one variable. We also sketch the techniques of differentiation
arithmetic, sometimes called automatic differentiation, for the computa-
tion of enclosures of derivatives, of Taylor coefficients, of gradients, of
Jacobian or Hessian matrices.
The major final part of the paper is devoted to the question of how
interval arithmetic can effectively be provided on computers. This is an
essential prerequisite for its superior and fascinating properties to be more
widely used in the scientific computing community. With more appropri-
ate processors, rigorous methods based on interval arithmetic could be
comparable in speed with today's "approximate" methods. At processor
speeds of gigaFLOPS there remains no alternative but to furnish future
computers with the capability to control the accuracy of a computation at
least to a certain extent.
82 3. Interval Arithmetic Revisited

3.1 Introduction and Historical Remarks


In 1958 the Japanese mathematician Teruo Sunaga published a paper entitled
"Theory of an Interval Algebra and its Application to Numerical Analysis"
[62]. Sunaga's paper was intended to indicate a method of rigorous error
estimation alternative to the methods and ideas developed in J. v. Neumann
and H. H. Goldstine's paper on "Numerical Inverting of Matrices of High
Order". [48]
Sunaga's paper is not the first one using interval arithmetic in numerical
computing. However, several ideas which are standard techniques in inter-
val mathematics today are for the first time mentioned there in rudimentary
form. The structure of interval arithmetic is studied in Sunaga's paper. The
possibility of enclosing the range of a rational function by interval arith-
metic is discussed. The basic idea of what today is called the Interval New-
ton Method can be found there, and also the methods of obtaining rigorous
bounds in the cases of numerical integration of definite integrals or of initial
value problems of ordinary differential equations by evaluating the remain-
der term of the integration routine in interval arithmetic are indicated in
Sunaga's paper. Under "Conclusion" Sunaga's paper ends with the state-
ment "that a future problem will be to revise the structure of the automatic
digital computer from the standpoint of interval calculus" .
Today Interval Analysis or Interval Mathematics appears as a mature
mathematical discipline. However, the last statement of Sunaga's paper still
describes a "future problem". The present paper is intended to help close this
gap.
This paper is supposed to provide an informal, easily readable introduc-
tion to basic features, properties and methods of interval arithmetic. In par-
ticular it is intended to deepen the understanding and clearly derive those
properties of interval arithmetic which should be supported by computer
hardware, by basic software, and by programming languages. The paper is
not aiming for completeness. Unnecessary mathematical details, formalisms
and derivations are put aside, whenever possible.
Interval mathematics has been developed to a high level during the last
decades at only a few academic sites. Problem solving routines which deliver
validated results are actually available for all the standard problems of nu-
merical analysis. Many applications have been solved using these tools. Since
all these solutions are mathematically proven to be correct, interval mathe-
matics has occasionally been called the Mathematical Numerics in contrast to
Numerical Mathematics, where results are sometimes merely speculative. In-
terval mathematics is not a trivial subject which can just be applied naively.
It needs education, training and practice. The author is convinced that with
the necessary skills interval arithmetic can be useful, and can be successfully
applied to any serious scientific computing problem.
In spite of all its advantages it is a fact that interval arithmetic is not
widely used in the scientific computing community as a whole. The author
3.1 Introduction and Historical Remarks 83

sees several reasons for this which should be discussed briefly. A broad un-
derstanding of these reasons is an essential prerequisite for further progress.
Forty years of nearly exclusive use of floating-point arithmetic in scientific
computing has formed and now dominates our thinking. Interval arithmetic
requires a much higher level of abstraction than languages like Fortran-77,
Pascal or C provide. If every single interval operation requires a procedure
call, the user's energy and attention are forced down to the level of coding,
and are dissipated there.
The development and implementation of adequate and powerful program-
ming environments like PASCAL-XSC [17,26,27] or ARITH-XSC [77] re-
quires a large body of experienced and devoted scientists (about 20 man
years for each) which is not easy to muster. In such environments interval
arithmetic, the elementary functions for the data types real and interval,
a long real and a long real interval arithmetic including the corresponding
elementary functions, vector and matrix arithmetic, differentiation and Tay-
lor arithmetic both for real and interval data are provided by the run time
system of the compiler. All operations can be called by the usual mathemat-
ical operator symbols and are of maximum accuracy. This releases the user
from coding drudgery. This means, for instance, that an enclosure of a high
derivative of a function over an interval - needed for step size control and
to guarantee the value of a definite integral or a differential equation within
close bounds - can be computed by the same notation used to compute the
real function value. The compiler interprets the operators according to the
type specification of the data. This level of programming is essential indeed.
It opens a new era of conceptual thinking for mathematical numerics.
A second reason for the low acceptance of interval arithmetic in the sci-
entific computing community is simply the prejudices which are often the
result of superficial experiments. Sentences like the following appear again
and again in the literature: "The error bounds are overly conservative; they
quickly grow to the computer representation of [-00, +00]", "Interval arith-
metic is expensive because it takes twice the storage and at least twice the
work of ordinary arithmetic."
Such sentences are correct for what is called "naive interval arithmetic".
Interval arithmetic, however, should not be applied naively. Its properties
must be studied and understood first, before it can be applied successfully.
Many program packages have been developed using interval arithmetic, which
deliver close bounds for their solutions. In no case are these bounds obtained
by substituting intervals in a conventional floating-point algorithm. Interval
arithmetic is an extension of floating-point arithmetic, not a replacement
for it. Sophisticated use of interval arithmetic often leads to safe and bet-
ter results. There are many applications where the extended tool delivers a
guaranteed answer faster than the restricted tool of floating-point arithmetic
delivers an "approximation". Examples are numerical integration (because
of automatic step size control) and global optimization (intervals bring the
84 3. Interval Arithmetic Revisited

continuum on the computer). One interval evaluation of a function over an


interval may suffice to prove that the function definitively has no zero in that
interval, while 1000 floating-point evaluations of the function in the interval
could not provide a safe answer. Interval methods that have been developed
for systems of ordinary differential and integral equations may be a bit slower.
But they deliver not just unproven numbers. Interval methods deliver close
bounds and prove existence and uniqueness of the solution within the com-
puted bounds. The bounds include both discretization and rounding errors.
This can save a lot of computing time by avoiding experimental reruns.
The main reason why interval methods are sometimes slow is already
expressed in the last statement of Sunaga's early article. It's not that the
methods are slow. It is the missing hardware support which makes them
slow. While conventional floating-point arithmetic nowadays is provided by
fast hardware, interval arithmetic has to be simulated by software routines
based on integer arithmetic. The IEEE arithmetic standard, adopted in 1985,
seems to support interval arithmetic. It requires the basic four arithmetic
operations with rounding to nearest, towards zero, and with rounding down-
wards and upwards. The latter two are needed for interval arithmetic. But
processors that provide IEEE arithmetic separate the rounding from the op-
eration, which proves to be a severe drawback. In a conventional floating-
point computation this does not cause any difficulties. The rounding mode
is set only once. Then a large number of operations is performed with this
rounding mode. However, when interval arithmetic is performed the rounding
mode has to be switched very frequently. The lower bound of the result of
every interval operation has to be rounded downwards and the upper bound
rounded upwards. Thus, the rounding mode has to be reset for every arith-
metic operation. If setting the rounding mode and the arithmetic operation
are equally fast this slows down interval arithmetic unnecessarily by a fac-
tor of two in comparison to conventional floating-point arithmetic. On all
existing commercial processors, however, setting the rounding mode takes
a multiple (three, ten, twenty and even more) of the time that is needed
for the arithmetic operation. Thus an interval operation is unnecessarily at
least eight (or twenty and even more) times slower than the corresponding
floating-point operation. The rounding should be part of the arithmetic op-
eration as required by the theory of computer arithmetic [33,34]. Everyone
of the rounded operations ~, 'W, &', 0 E {+, -, *, /} with rounding to nearest,
downwards or upwards should be equally fast and executed in a single cycle.
The IEEE arithmetic standard requires that these 12 operations for float-
ing-point numbers give computed results that coincide with the rounded exact
result of the operation for any operands [78]. The standard was developed
around 1980 as a standard for microprocessors at a time when the typical
microprocessor was the 8086 running at 2 MHz and serving a memory space
of 64 KB. Since that time the speed of microprocessors has been increased
by a factor of more than 1000. IEEE arithmetic is now even provided by
3.1 Introduction and Historical Remarks 85

supercomputers, the speed of which is still faster by magnitudes. Advances


in computer technology are now so profound that the arithmetic capability
and repertoire of computers can and should be expanded. In contrast to
IEEE arithmetic a general theory of advanced computer arithmetic requires
that all arithmetic operations in the usual product spaces of computation:
the complex numbers, real and complex vectors, real and complex matrices,
real and complex intervals as well as real and complex interval vectors and
interval matrices are provided on the computer by a general mathematical
mapping principle which is called a semimorphism. For definition see [33,34].
This guarantees, among other things, that all arithmetic operations in all
these spaces deliver a computed result which differs from the exact result of
the operation by (no or) only a single rounding.
A careful analysis within the theory of computer arithmetic shows that the
arithmetic operations in the computer representable subsets of these spaces
can be realized on the computer by a modular technique provided fifteen
fundamental operations are made available on a low level, possibly by fast
hardware routines. These fifteen operations are

Ell, El, lEI, 1Zl, [J,

'0/', "9', W, 'Vl, \"j ,


&, 8., .&, lA, 6.

Here @l,o E {+, -, *, /} denotes operations using a monotone and anti-


symmetric rounding 0 from the real numbers onto the subset of floating-
point numbers, such as rounding to the nearest floating-point number. Like-
wise ~ and &0, 0 E {+, -, *, /} denote the operations using the mono-
tone rounding downwards \l and upwards D. respectively. [J, \"j and 6 de-
note scalar products with only a single rounding. That is, if a = (ai) and
b = (b i ) are vectors with floating-point components ai, bi , then aOb :=
O(al * b1 + a2 * b2 + ... + an * bn ), 0 E {O, \l,D.}. The multiplication
and addition signs on the right hand side of the assignment denote exact
multiplication and summation in the sense of real numbers.
Of these 15 fundamental operations above, traditional numerical methods
use only the four operations Ell, El, lEI and 1Zl. Conventional interval arithmetic
employs the eight operations '0/', "9', W, 'Vl and &,8.,'&, IA. These eight opera-
tions are computer equivalents of the operations for real intervals; they pro-
vide interval arithmetic. The IEEE arithmetic standard requires 12 of these
15 fundamental operations: @l, ~,&o, 0 E {+, -, *, /}. Generally speaking, in-
terval arithmetic brings guarantees into computation, while the three scalar
products [J, \"j and 6 bring high accuracy.
A detailed discussion of the implementation of the three scalar products
on all kinds of computers is given in the first chapter. Basically the products
ai * bi are accumulated in fixed-point arithmetic with or without a single
rounding at the very end of the accumulation. In contrast to accumulation in
86 3. Interval Arithmetic Revisited

floating-point arithmetic, fixed-point accumulation is error free. Apart from


this important property it is simpler than accumulation in floating-point and
it is even faster. Accumulations in floating-point arithmetic are very sensitive
with respect to cancellation.
So accumulations should be done in fixed-point arithmetic whenever pos-
sible whether the data are integers, floating-point numbers or products of
two floating-point numbers. An arithmetic operation which can always be
performed correctly on a digital computer should not be simulated by a rou-
tine which can easily fail in critical situations. Many real life and expensive
accidents have been attributed to loss of numeric accuracy in a floating-point
calculation or to other arithmetic failures. Examples are: bursting of a large
turbine under test due to wrongly predicted eigenvalues; failure of early space
shuttle retriever arms under space conditions; disastrous homing failure on
ground to air missile missions; software failure in the Ariane 5 guidance pro-
gram.
Advanced computer arithmetic requires a correct implementation of all
arithmetic operations in the usual product spaces of computations. This in-
cludes interval arithmetic and in particular the three scalar products 0, 'V
and 8. This confronts us with another severe slowdown of interval arithmetic.
All commercial processors that provide IEEE arithmetic only deliver a
rounded product to the outside world in the case of multiplication. Com-
putation of an accurate scalar product requires products of the full double
length. So these products have to be simulated On the processor. This slows
down the multiplication by a factor of up to 10 in comparison to a rounded
hardware multiplication. In a software simulation of the accurate scalar prod-
uct the products of double length then have to be accumulated in fixed-point
mode. This process is again slower by a factor of about 5 in comparison to
a possibly wrong hardware accumulation of the products in floating-point
arithmetic. Thus in summary a factor of at least 50 is the penalty for an
accurate computation of the scalar product on existing processors. This is
too much to be readily accepted by the user. In contrast to this a hardware
implementation of the optimal scalar product could even be faster than a
conventional implementation in floating-point arithmetic.
Another severe shortcoming which makes interval arithmetic slow is the
fact that no reasonable interface to the programming languages has been
accepted by the standardization committees so far. Operator overloading is
not adequate for calling all fifteen operations @], W, £, 0 E {+, -, *, /, .}, in a
high level programming language. A general operator concept is necessary for
ease of programming (three real operations for +, -, *, / and the dot product
with three different roundings) otherwise clumsy and slow function calls have
to be used to call different rounded arithmetic operations.
All these factors which make interval arithmetic On existing processors
slow are quite well known. Nevertheless, they are generally not taken into
account when the speed of interval methods is judged. It is, however, imp or-
3.1 Introduction and His~orical Remarks 87

tant that these factors are well understood. Real progress depends critically
on an understanding of their details. Interval methods are not slow per se. It
is the actual available arithmetic on existing processors which makes them
slow. With better processor and language support, rigorous methods could
be comparable in speed to today's "approximate" methods. Interval mathe-
matics or mathematical numerics has been developed to a level where already
today library routines could speedily deliver validated bounds instead of just
approximations for small and medium size problems. This would ease the life
of many users dramatically.
Future computers must be equipped with fast and effective interval arith-
metic. At processor speeds of gigaFLOPS it is almost the only way to check
the accuracy of a computation. Computer-generated graphics requires vali-
dation techniques in many cases.
After Sunaga's early paper the publication of Ramon E. Moore's book
on interval arithmetic in 1966 [44] certainly was another milestone in the
development of interval arithmetic. Moore's book is full of unconventional
ideas which were out of the mainstream of numerical analysis of that time.
To many colleagues the book appeared as an utopian dream. Others tried
to carry out his ideas with little success in general. Computers were very
very slow at that time. Today Moore's book appears as an exposition of
extraordinary intellectual and creative power. The basic ideas of a great many
well established methods of validation numerics can be traced back to Moore's
book.
We conclude this introduction with a brief sketch of the development of
interval arithmetic at the author's institute. Already by 1967 an ALGOL-
60 extension implemented on a Zuse Z 23 computer provided operators and
a number of elementary functions for a new data type interval [69,70]. In
1968/69 this language was implemented on a more powerful computer, an
Electrologica X8. To speed up the arithmetic, the hardware of the processor
was extended by the four arithmetic operations with rounding downwards
'el, 0 E {+, -, *, j}. Operations with rounding upwards were produced by
use of the relation 6.( a) = -\7 ( -a). Many early interval methods have been
developed using these tools. Based on this experience a book [5] was written
by two collaborators of that time. The English translation which appeared
in 1983 is still a standard monograph on interval arithmetic [6].
At about 1969 the author became aware that interval and floating-point
arithmetic basically follow the same mathematical mapping principles, and
can be subsumed by a general mathematical theory of what is called advanced
computer arithmetic in this paper. The basic assumption is that all arithmetic
operations on computers (for real and complex numbers, real and complex
intervals as well as for vectors and matrices over these four basic data types)
should be defined by four simple rules which are called a semimorphism. This
guarantees the best possible answers for all these arithmetic operations. A
book on the subject was published in 1976 [33] and the German company
88 3. Interval Arithmetic Revisited

Nixdorf funded an implementation of the new arithmetic. At that time a Z-


80 microprocessor with 64 KB main memory had to be used. The result was a
PASCAL extension called PASCAL-SC, published in [37,38]. The language
provides about 600 predefined arithmetic operations for all the data types
mentioned above and a number of elementary functions for the data types
real and interval. The programming convenience of P ASCAL-SC allowed a
small group of collaborators to implement a large number of problem solving
routines with automatic result verification within a few months. All this work
was exhibited at the Hannover fair in March 1980 with the result that Nixdorf
donated a number of computers to the Universitat Karlsruhe. This allowed
the programming education at Universitat Karlsruhe to be decentralized from
the summer of 1980. PASCAL-SC was the proof that advanced computer
arithmetic need not be restricted to the very large computers. It had been
realized on a microprocessor. When the PC appeared on the scene in 1982 it
looked poor compared with what we had already two years earlier. But the
PASCAL-SC system was never marketed.
In 1978 an English version of the theoretical foundation of advanced com-
puter arithmetic was prepared during a sabbatical of the author jointly with
W. L. Miranker at the IBM Research Center at Yorktown Heights. It ap-
peared as a book in 1981 [34].
In May 1980 IBM became aware of the decentralized programming educa-
tion with PASCAL-SC at the Universitat Karlsruhe. This was the beginning
of nearly ten years of close cooperation with IBM. We jointly developed and
implemented a Fortran extension corresponding to the PASCAL extension
with a large number of problem solving routines with automatic result veri-
fication [75-77].
In 1980 IBM had only the /370 architecture on the market. So we had
to work for this architecture. IBM supported the arithmetic on an early
processor (4361 in 1983) by microcode and later by VLSI design. Everything
we developed for IBM was offered on the market as IBM program products
in several versions between 1983 and 1989. But the products did not sell
in the quantities IBM had expected. During the 1980s scientific computing
had moved from the old mainframes to workstations and supercomputers. So
the final outcome of these wonderful products was the same as for all the
other earlier attempts to establish interval arithmetic effectively. With the
next processor generation or a new language standard work for a particular
processor loses its attraction and its value.
Nevertheless all these developments have contributed to the high standard
attained by interval mathematics or mathematical numerics today. What we
have today is a new version of PASCAL-SC, called PASCAL-XSC [26,27,29]'
with fast elementary functions and a corresponding C++ extension called C-
XSC [28]. Both languages are translated into C so that they can be used on
nearly all platforms. The arithmetic is implemented in software in C with all
the regrettable consequences with respect to speed discussed earlier. Toolbox
3.2 Interval Arithmetic, a Powerful Calculus to Deal with Inequalities 89

publications with problem solving routines are available for both languages
[17,18,31].
Of course, much valuable work on the subject had been done at other
places as well. International Conferences where new results can be presented
and discussed are held regularly.
After completion of this paper Sun Microsystems announced an interval
extension of Fortran 95 [83]. With this new product and compiler, interval
arithmetic is now available on computers which are wide spread.
As Teruo Sunaga did in 1958 and many others after him, I am looking
forward to, expect, and eagerly await a revision of the structure of the digital
computer for better support of interval arithmetic.

3.2 Interval Arithmetic, a Powerful Calculus to Deal


with Inequalities
Problems in technology and science are often described by an equation or
a system of equations. Mathematics is used to manipulate these equations
in order to obtain a solution. The Gauss algorithm, for instance, is used to
compute the solution of a system of linear equations by adding, subtracting,
multiplying and dividing equations in a systematic manner. Newton's method
is used to compute approximately the location of a zero of a non linear
function or of a system of such functions.
Data are often given by bounds rather than by simple numbers. Bounds
are expressed by inequalities. To compute bounds for problems derived from
given data requires a systematic calculus to deal with inequalities. Interval
arithmetic provides this calculus. It supplies the basic rules for how to add,
subtract, multiply, divide, and manipulate inequalities in a systematic man-
ner: Let bounds for two real numbers a and b be given by the inequalities
al :::: a :::: a2 and b1 :::: b :::: b2 . Addition of these inequalities leads to bounds
for the sum a + b:
al + b1 :::: a + b :::: a2 + b2 .
The inequality for b can be reversed by multiplication with -1: -b 2 ::::
-b :::: -b 1 . Addition to the inequality for a then delivers the rule for the
subtraction of two inequalities:

Interval arithmetic provides a shorthand notation for these rules suppress-


ing the :::: symbols. We simply identify the inequality al :::: a :::: a2 with the
closed and bounded real interval [al' a2]. The rules for addition and subtrac-
tion for two such intervals now read:

(3.1)
90 3. Interval Arithmetic Revisited

(3.2)
The rule for multiplication of two intervals is more complicated. Nine cases
are to be distinguished depending on whether aI, a2, bl , b2 , are less or greater
than zero. For division the situation is similar. Since we shall build upon
these rules later they are cited here. For a detailed derivation see [33,34]. In
the tables the order relation :S is used for intervals. It is defined by

Table 3.1. The 9 cases for the multiplication of two intervals or inequalities

Nr. A = [al,a2] B = [bl,b2] A*B


1 A ~ [0,0] B ~ [0,0] [albl, a2b2]
2 A ~ [0,0] B:S [0,0] [a2b1, al b2]
0

3 A ~ [0,0] OE B [a2bl' a2b2]


4 A:S [0,0] B ~ [0,0] [a l b2, a2bl ]
5 A:S [0,0] B:S [0,0] [a2b2, albtl (3.3)
0
6 A:S [0,0] OE B [a l b2, albl ]
0

7 OE A B~ [0,0] [a l b2' a2b2]


°
0

8 E A0 B:S [0,0] [a2bl , albtl


0

9 OE A OE B [min(alb2, a2bl), max(albl , a2b2)]

Table 3.2. The 6 cases for the division of two intervals or inequalities

Nr. A = [al,a2] B = [bl ,b2] A/B


1 A ~ [0,0] 0< bl :S b2 [at/b2, a2/bl ]
2
3
A ~
A:S
[0,0]
[0,0]
bl :S b2 <
0< bl :S b2
° [a2/b2, al/btl
(3.4)
°
[al/bl , a2/b2]
4 A:S [0,0] bl :S b2 < [a2/bl' at/b2]
°
0

5 E A0 0< bl :S b2 [at/bl , a2/b 1 ]


6 OE A bl :S b2 < ° [a2/b2, al/b2]

o 0
In Tables 3.1 and 3.2 A denotes the interior of A, i.e. c E A means
al < c < a2. In the cases 0 E B division AlB is not defined.
3.2 Interval Arithmetic, a Powerful Calculus to Deal with Inequalities 91

As a result of these rules it can be stated that in the case of real intervals
the result of an interval operation A 0 B, for all 0 E {+, -, *, j}, can be
expressed in terms of the bounds of the interval operands (with the AlB
exception above). In order to get each of these bounds, typically only one
o 0
real operation is necessary. Only in case 9 of Table 3.1, 0 E A and 0 E B, do
two products have to be calculated and compared.
Whenever in the Tables 3.1 and 3.2 both operands are comparable with
the interval [0,0] with respect to S,~, < or >, the result of the interval
operation A * B or AlB contains both bounds of A and B. If one or both of
the operands A or B, however, contains zero as an interior point, then the
result A * B and AlB is expressed by only three of the four bounds of A and
B. In all these cases (3, 6, 7, 8, 9) in Table 3.1, the bound which is missing
in the expression for the result can be shifted towards zero without changing
the result of the operation A * B. Similarly, in cases 5 and 6 in Table 3.2, the
bound of B, which is missing in the expression for the resulting interval, can
be shifted toward 00 (resp. -00) without changing the result of the operation.
This shows a certain lack of sensitivity of interval arithmetic or computing
with inequalities whenever in the cases of multiplication and division one of
the operands contains zero as an interior point.
In all these cases - 3, 6, 7, 8, 9, of Table 3.1 and 5, 6 of Table 3.2 -
the result of A * B or AlB also contains zero, and the formulas show that
the result tends toward the zero interval if the operands that contain zero do
likewise. In the limit case when the operand that contains zero has become the
zero interval, no such imprecision is left. This suggests that within arithmetic
expressions interval operands that contain zero as an interior point should be
made as small in diameter as possible.
We illustrate the efficiency of this calculus for inequalities by a simple
example. See [4]. Let x = Ax + b be a system of linear equations in fixed
point form with a contracting real matrix A and a real vector b, and let the
interval vector X be a rough initial enclosure of the solution x* EX. We can
now formally write down the Jacobi method, the Gauss-Seidel method, a re-
laxation method or some other iterative scheme for the solution of the linear
system. In these formulas we then interpret all components of the vector x
as being intervals. Doing so we obtain a number of iterative methods for the
computation of enclosures of linear systems of equations. Further iterative
schemes then can be obtained by taking the intersection of two successive
approximations. If we now decompose all these methods in formulas for the
bounds of the intervals we obtain a major number of methods for the compu-
tation of bounds for the solution of linear systems which have been derived by
well-known mathematicians painstakingly about 40 years ago, see [14]. The
calculus of interval arithmetic reproduces these and other methods in the
simplest way. The user does not have to take care of the many case distinc-
tions occurring in the matrix vector multiplications. The computer executes
92 3. Interval Arithmetic Revisited

them automatically by the preprogrammed calculus. Also the rounding errors


are enclosed. The calculus evolves its own dynamics.

3.3 Interval Arithmetic as Executable Set Operations

The rules (3.1), (3.2), (3.3), and (3.4) also can be interpreted as arithmetic op-
erations for sets. As such they are special cases of general set operations. Fur-
ther important properties of interval arithmetic can immediately be obtained
via set operations. Let M be any set with a dyadic operation 0 : M x M -> M
defined for its elements. The powerset lP M of M is defined as the set of all
subsets of M. The operation 0 in M can be extended to the powerset lP M
by the following definition

A 0 B := {a 0 bla E A 1\ b E B} for all A, B E lP M. (3.5)

The least element in lP M with respect to set inclusion as an order relation


is the empty set. The greatest element is the set M. We denote the empty set
by the character string []. The empty set is subset of any set. Any arithmetic
operation on the empty set produces the empty set.
The following properties are obvious and immediate consequences of (3.5):

A <;;; B 1\ C <;;; D =} A 0 C <;;; BoD for all A, B, C, DE lPM, (3.6)

and in particular

a E A 1\ b E B =} a 0 bE A 0 B for all A, BE lPM. (3.7)

(3.6) is called the inclusion isotony (or inclusion monotony). (3.7) is called
the inclusion property.
By use of parentheses these rules can immediately be extended to expres-
sions with more than one arithmetic operation, e.g.

A <;;; B 1\ C <;;; D 1\ E <;;; F =} A 0 C <;;; BoD =} (A 0 C) 0 E <;;; (B 0 D) 0 F,

and so on. Moreover, if more than one operation is defined in M this chain
of conclusions also remains valid for expressions containing several different
operations.
If we now replace the general set M by the set of real numbers, (3.5),
(3.6), and (3.7) hold in particular for the powerset lPIR of the real numbers
IR. This is the case for all operations 0 E {+, -, *, I}, if we assume that in
case of division 0 is not an element of the denominator, for instance, 0 tj. B
in (3.5).
The set lIR of closed and bounded intervals over IR is a subset of lPIR.
Thus (3.5), (3.6), and (3.7) are also valid for elements of lIR. The set lIR
3.3 Interval Arithmetic as Executable Set Operations 93

with the operations (3.5), 0 E {+, -, *, j}, is an algebraically closed 1 subset


within lPIR. That is, if (3.5) is performed for two intervals A, B E fIR the
result is always an interval again. This holds for all operations 0 E {+, -, *, j}
with 0 rf. B in case of division. This property is a simple consequence of the
fact that for all arithmetic operations 0 E {+, -, *, I}, a 0 b is a continuous
function of both variables. A 0 B is the range of this function over the product
set A x B. Since A and B are closed intervals, A x B is a simply connected,
bounded and closed subset of IR2. In such a region the continuous function
a 0 b takes a maximum and a minimum as well as all values in between.
Therefore

A 0 B = [ min (a 0 b), max (a 0 b)], for all 0 E {+, -, *, /},


aEA,bEB aEA,bEB

provided that 0 rf. B in case of division.


Consideration of (3.5), (3.6), and (3.7) for intervals of fIR leads to the
crucial properties of all applications of interval arithmetic. Because of the
great importance of these properties we repeat them here explicitly. Thus we
obtain for all operations 0 E {+, -, *, I}:
The set definition of interval arithmetic:

A0 B := {a 0 bla E A /\ b E B} for all A, B E fIR,


(3.8)
o rf. B in case of division,

the inclusion isotony (or inclusion monotony):

A <;;; B /\ C <;;; D =} A 0 C <;;; BoD for all A, B, C, D E fIR,


(3.9)
o rf. C, D in case of division,
and in particular the inclusion property

a E A /\ bE B =} a 0 b E A 0 B for all A, B E fIR,


(3.10)
o rf. B in case of division.

If for M = IR in (3.5) the number of elements in A or B is infinite,


the operations are effectively not executable because infinitely many real
operations would have to be performed. If A and B are intervals of fIR,
however, the situation is different. In general A or B or both will again contain
infinitely many real numbers. The result of the operation (3.8), however, can
now be performed by a finite number of operations with real numbers, with
the bounds of A and B. For all operations 0 E {+, -, *, /} the result is
obtained by the explicit formulas (3.1), (3.2), (3.3), and (3.4), [33,34].
For intervals A = [a1,a2] and B = [b 1,b2] the formulas (3.1), (3.2), (3.3),
and (3.4) can be summarized by

A 0 B = [min (ai 0 bj ), .max (ai 0 bj )] for all 0 E {+, -, *, /}, (3.11)


',)=1,2 ',)=1,2

1 as the integers are within the reals for 0 E {+, -, *}


94 3. Interval Arithmetic Revisited

with 0 tj. B in case of division.


Since interval operations are particular powerset operations, the inclusion
isotony and the inclusion property also hold for expressions with more than
one arithmetic operation.
In programming languages the concept of an arithmetic expression is usu-
ally defined to be a little more general. Besides constants and variables ele-
mentary functions (sometimes called standard functions) like sqr, sqrt, sin,
cos, exp, In, tan, ... may also be elementary ingredients. All these are put
together with arithmetic operators and parentheses into the general concept
of an arithmetic expression. This construct is illustrated by the syntax di-
agram of Fig. 3.1. Therein solid lines are to be traversed from left to right
and from top to bottom. Dotted lines are to be traversed oppositely, i.e.
from right to left and from bottom to top. In Fig. 3.1 the syntax variable
REAL FUNCTION merely represents a real arithmetic expression hidden in
a subroutine.
Now we define the general concept of an arithmetic expression for the
new data type interval by exchanging the data type real in Fig. 3.1 for the
new data type interval. This results in the syntax diagram for INTERVAL
EXPRESSION shown in Fig. 3.2. In Fig. 3.2 the syntax variable INTERVAL
FUNCTION represents an interval expression hidden in a subroutine.

o
L....L-.L..--+----REAL CONSTANT - - - - - ,

.-(~
I---REAL VARIABLE
:0
REAL ELEMENTARY FUNCTION

1---- REAL FUNCTION ------I

REAL EXPRESSION

Fig. 3.1. Syntax diagram for REAL EXPRESSION

In the syntax diagram for INTERVAL EXPRESSION in Fig. 3.2 the


concept of an interval elementary function is not yet defined. We simply
define it as the range of function values taken over an interval (out of the
domain of definition D(f) of the function). In case of a real function f we
denote the range of values over the interval [aI, a2] by

For instance:
3.3 Interval Arithmetic as Executable Set Operations 95

·······,····0··
'---'---'---+---INTERVAL CONSTANT-----,
··0··
INTERVAL VARIABLE - - - - - /
····0··
(]).
INTERVAL ELEMENTARY FUNCTION

INTERVAL FUNCTION - - - - - I

INTERVAL EXPRESSION

Fig. 3.2. Syntax diagram for INTERVAL EXPRESSION

for 0 tJ- [aI, a2],


for 0 E [aI, a2],

For non monotonic functions the computation of the range of values over
an interval [aI, a2] requires the determination of the global minimum and
maximum of the function in the interval [aI, a2]. For the usual elementary
functions, however, these are known. With this definition of elementary func-
tions for intervals the key properties of interval arithmetic, the inclusion
monotony (3.7) and the inclusion property (3.8) extend immediately to ele-
mentary functions and with this to interval expressions as defined in Fig. 3.2:

A I::: B =} f(A) I::: f(B), with A, B E fIR inclusion isotone,

and in particular for a E IR and A E fIR:

aEA =} f(a) E f(A) inclusion property.

We summarize the development so far by stating that interval arithmetic


expressions are generally inclusion isotone and that the inclusion property
holds. These are the key properties of interval arithmetic. They give interval
arithmetic its raison d'etre. To start with, they provide the possibility of
enclosing imprecise data within bounds and then continuing the computation
with these bounds. This always results in guaranteed enclosures.
As the next step we define a (computable) real function simply by a real
arithmetic expression. We need the concept of an interval evaluation of a
real function. It is defined as follows: In the arithmetic expression for the
function all operands are replaced by intervals and all operations by interval
operations (where all intervals must be within the domain of definition of the
96 3. Interval Arithmetic Revisited

real operands). This is just the step from Fig. 3.1 to Fig. 3.2. What is obtained
is an interval expression. Then all arithmetic operations are performed in
interval arithmetic. For a real function f(a) we denote the interval evaluation
over the interval A by F(A).
With this definition we can immediately conclude that interval evaluations
of (computable) real functions are inclusion isotone and that the inclusion
property holds in particular:

'* F(A) t;;; F(B)


A t;;; B inclusion isotone, (3.12)

a E A '* f(a) E F(A) inclusion property. (3.13)


These concepts immediately extend in a natural way to functions of sev-
eral real variables. In this case in (3.13) a is an n-tuple, a = (al' a2,···, an),
and A and B are higher dimensional intervals, e.g. A = (Al' A 2 , . .. , An).
Remark: Two different real arithmetic expressions can define equivalent real
functions, for instance:

f(x) = x(x -1) and g(x) = x 2 - x.

Evaluation of the two expressions for a real number always leads to the same
real function value. In contrast to this, interval evaluation of the two expres-
sions may lead to different intervals. In the example we obtain for the interval
A = [1,2]:

F(A) = [1,2]([1,2] + [-1, -1]) G(A) = [1,2][1,2]- [1,2]


= [1,2][0,1] = [0,2]' = [1,4]- [1,2] = [-1,3].

Although an interval evaluation of a real function is very naturally defined via


the arithmetic expression of the function, a closer look at the syntax diagram
in Fig. 3.2 reveals major problems that appear when such evaluations are to
be coded. The widely used programming languages do not provide the nec-
essary ease of programming. An interval evaluation of a real function should
be performable as easily as an execution of the corresponding expression in
real arithmetic. For that purpose the programming language
1. must allow an operator notation A 0 B for the basic interval operations

o E {+, -, *, I}, i.e. operator overloading must be provided,


2. the concept of a function subroutine must not be restricted to the data
types integer and real, i.e. subroutine functions with general result type
should be provided by the programming language, and
3. the elementary functions must be provided for interval arguments.
While 1. and 2. are challenges for the designer of the programming lan-
guage, 3. is a challenge for the mathematician. In a conventional call of an
elementary function the computer provides a result, the accuracy of which
cannot easily be judged by the user. This is no longer the case when the
3.4 Enclosing the Range of Function Values 97

elementary functions are provided for interval arguments. Then, if called for
a point interval (where the lower and upper bound coincide), a compari-
son of the lower and upper bound of the result of the interval evaluation
of the function reveals immediately the accuracy with which the elementary
function has been implemented. This situation has forced extremely careful
implementation of the elementary functions and since interval versions of the
elementary functions have been provided on a large scale [26-29,37,38,77]
the conventional real elementary functions on computers also had to be and
have been improved step by step by the manufacturers. A most advanced
programming environment in this respect is a decimal version of PASCAL-
XSC [10] where, besides the usual 24 elementary functions, about the same
number of special functions are provided for real and interval arguments with
highest accuracy.
1., 2. and 3. are minimum requirements for any sophisticated use of inter-
val arithmetic. If they are not met, coding difficulties absorb all the attention
and capacity of users and prevent them from developing deeper mathemati-
cal ideas and insight. So far none of the widespread programming languages
like Fortran, C, and even Fortran 95 and C++ provide the necessary pro-
gramming ease. This is the basic reason for the slow progress in the field.
It is a matter of fact that a great deal of the existing and established inter-
val methods and algorithms have originally been developed in PASCAL-XSC
even if they have been coded afterwards in other languages. Programming
ease is essential indeed. The typical user, however, is reluctant to leave the
programming environment he is used to, just to apply interval methods.
We summarize this discussion by stating that it does not suffice for an
adequate use of interval arithmetic on computers that only the four basic
arithmetic operations +, -, * and I for intervals are somehow supported by
the computer hardware. An appropriate language support is absolutely nec-
essary. So far this has been missing. This is the basic dilemma of interval
arithmetic. Experience has shown that it cannot be overcome via slow mov-
ing standardization committees for programming languages. Two things seem
to be necessary for the great breakthrough. A major vendor has to provide
the necessary support and the body of numerical analysts must acquire a
broader insight and skills in order to use this support.

3.4 Enclosing the Range of Function Values


The interval evaluation of a real function f over the interval A was denoted by
F(A). We now compare it with the range of function values over the interval
A which was denoted by
f(A) := {J(a) I a E A}. (3.14)
We have observed that interval evaluation of an arithmetic expression
and of real functions is inclusion isotone (3.9), (3.12) and that the inclusion
98 3. Interval Arithmetic Revisited

property (3.10), (3.13) holds. Since (3.10) and (3.13) hold for all a E A we
can immediately state that

f(A) ~ F(A), (3.15)

i.e. that the interval evaluation of a real function over an interval delivers
a superset of the range of function values over that interval. If A is a point
interval [a, a] this reduces to:

f(a) E F([a, a]). (3.16)

These are basic properties of interval arithmetic. Computing with inequal-


ities always aims for bounds for function values, or for bounds for the range
of function values. Interval arithmetic allows this computation in principle.
The range of function values over an interval is needed for many applica-
tions. Its computation is a very difficult task. It is equivalent to the compu-
tation of the global minimum and maximum of the function in that interval.
An interval evaluation of the arithmetic expression on the other hand is very
easy to perform. It requires about twice as many real arithmetic operations
as an evaluation of the function in real arithmetic. Thus interval arithmetic
provides an easy means to compute upper and lower bounds for the range of
function values.
In the end a complicated algorithm just performs an arithmetic expres-
sion. So an interval evaluation of the algorithm would compute bounds for
the result from given bounds for the data. However, it is observed that in
doing so, in general, the diameters of the intervals grow very fast and for
large algorithms the bounds quickly become meaningless in particular if the
bounds for the data are already large. This raises the question whether mea-
sures can be taken to keep the diameters of the intervals from growing too
fast. Interval arithmetic has developed such measures and we are going to
sketch these now.
If an enclosure for a function value is computed by (3.16), the quality of
the computed result F([a, a]) can be judged by the diameter of the interval
F([a, a]). This possibility of easily judging the quality of the computed result,
is not available in (3.15). Even if F(A) is a large interval, it can be a good
approximation for the range of function values f(A) if the latter is large also.
So some means to measure the deviation between f(A) and F(A) in (3.15)
is desirable.
It is well known that the set lIR of real intervals becomes a metric space
with the so called Hausdorff metric, where the distance q of two intervals
A = [aI, a2] and B = rbI, b2] is defined by
(3.17)

See, for instance, [6].


With this distance function q the following relation can be proved to hold
under natural assumptions on f:
3.4 Enclosing the Range of Function Values 99

q(f(A),F(A)) -:; 0:. d(A), with a constant 0: ~ O. (3.18)

Here d(A) denotes the diameter of the interval A:

(3.19)

In case of functions of several real variables the maximum of the diameters


d(Ai) appears on the right hand side of (3.18).
The relation (3.18) shows that the distance between the range of values
of the function f over the interval A and the interval evaluation of the ex-
pression for f tends to zero linearly with the diameter of the interval A. So
the overestimation of f(A) by F(A) decreases with the diameter of A and in
the limit d(A) = 0 no such overestimation is left.
Because of this result subdivision of the interval A into subintervals Ai,
i = l(l)n, with A = U~=l Ai is a frequently applied technique to obtain
better approximations for the range of function values. Then (3.18) holds for
each subinterval:

and, in general, the union of the interval evaluations over all subintervals
n

is a much better approximation for the range f(A) than is F(A).


There are yet other methods to obtain better enclosures for the range of
function values f(A). We have already observed that the interval evaluation
F(A) of a function f depends on the expression used for the representation
of f. So by choosing appropriate representations for f the overestimation
of f(A) by the interval evaluation F(A) can often be reduced. Indeed, if f
allows a representation of the form

f(x) = f(c) + (x - c) . h(x), with c E A, (3.20)

then under natural assumptions on h the following inequality holds

q(f(A), F(A)) -:; f3. (d(A))2, with a constant f3 ~ o. (3.21 )

(3.20) is called a centered form of f. In (3.20) c is not necessarily the center


of A although it is often chosen as the center. (3.21) shows that the distance
between the range of values of the function f over the interval A and the
interval evaluation of a centered form of f tends toward zero quadratically
with the diameter of the interval A. In practice, this means that for small
intervals the interval evaluation of the centered form leads to a very good
approximation of the range of function values over an interval A. Again,
subdivision is a method that can be applied in the case of a large interval
100 3. Interval Arithmetic Revisited

A. It should be clear, however, that in general only for small intervals is the
bound in (3.21) better than in (3.18).
The decrease of the overestimation of the range of function values by the
interval evaluation of the function with the diameter of the interval A, and the
method of subdivision, are reasons why interval arithmetic can successfully
be used in many applications. Numerical methods often proceed in small
steps. This is the case, for instance, with numerical quadrature or cubature,
or with numerical integration of ordinary differential equations. In all these
cases an interval evaluation of the remainder term of the integration formula
(using differentiation arithmetic) controls the step size of the integration, and
anyhow because of the small steps, overestimation is practically negligible.
We now mention briefly how centered forms can be obtained. Usually a
centered form is derived via the mean-value theorem. If f is differentiable in
its domain D, then f(x) = f(c) + f'(~)(x - c) for fixed c E D and some ~
between x and c. If x and c are elements out of the interval A ~ D, then also
~ E A. Therefore

f(x) E F(A) := f(c) + F'(A)(A - c), for all x E A.

Here F'(A) is an interval evaluation of f'(x) in A.


In (3.20) the slope
h(x) = f(x) - f(c)
x-c
can be used instead of the derivative for the representation of f(x). Slopes
often lead to better enclosures for f(A) than do derivatives. For details see
[7,32,53].
Derivatives and enclosures of derivatives can be computed by a process
which is called automatic differentiation or differentiation arithmetic. Slopes
and enclosures of slopes can be computed by another process which is very
similar to automatic differentiation. In both cases the computation of the
derivative or slope or enclosures of these is done together with the com-
putation of the function value. For these processes only the expression or
algorithm for the function is required. No explicit formulas for the derivative
or slope are needed. The computer interprets the arithmetic operations in the
expression by differentiation or slope arithmetic. The arithmetic is hidden in
the runtime system of the compiler. It is activated by type specification of
the operands. For details see [17,18,53]' and Section 3.8. Thus the computer
is able to produce and enclose the centered form via the derivative or slope
automatically.
Without going into further details we mention once more, that all these
considerations are not restricted to functions of a single real variable. Subdi-
vision in higher dimensions, however, is a difficult task which requires addi-
tional tools and strategies. Typical of such problems are the computation of
the bounds of the solution of a system of nonlinear equations, and global opti-
mization or numerical integration of functions of more than one real variable.
3.5 The Interval Newton Method 101

In all these and other cases, zero finding is a central task. Here the extended
Interval Newton Method plays an extraordinary role so we are now going to
review this method, which is also one of the requirements that have to be
met when interval arithmetic is implemented on the computer.

3.5 The Interval Newton Method

Traditionally Newton's method is used to compute an approximation of a zero


of a nonlinear real function f(x), i.e. to compute a solution of the equation

f(x) = o. (3.22)

The method approximates the function f(x) in the neighborhood of an initial


value Xo by the linear function (the tangent)

t(x) = f(xo) + f'(xo)(x - xo) (3.23)

the zero of which can easily be calculated by

f(xo)
Xl := Xo - f'(xo)· (3.24)

Xl is used as new approximation for the zero of (3.22). Continuation of this


method leads to the general iteration scheme:

f(x v )
Xv+l := Xv - f'(x v )' v = 0, 1,2, .... (3.25)

It is well known that if f (x) has a single zero x* in an interval X and f (x)
is twice continuously differentiable, then the sequence

converges quadratically towards x* if Xo is sufficiently close to x* . If the latter


condition does not hold the method may well fail.
The interval version of Newton's method computes an enclosure of the
zero x* of a continuously differentiable function f (x) in the interval X by the
following iteration scheme:

f(m(Xv))
X v+l := (m(Xv) - F'(X v ) ) n Xv, v = 0, 1,2, ... , (3.26)

with Xo = X. Here F'(X v ) is the interval evaluation of the first derivative


f'(x) of f over the interval Xv and m(Xv) is the midpoint of the interval Xv.
Instead of m(Xv) another point within Xv could be chosen. The method can
only be applied if 0 ~ F'(X o). This guarantees that f(x) has only a single
zero in Xo.
102 3. Interval Arithmetic Revisited

In contrast to (3.25) the method (3.26) can never diverge (fail). Because
of the intersection with Xv the sequence

(3.27)

is bounded. It can be shown that under natural conditions on the function f


the sequence converges quadratically to x* [6,47].
The operator

f(x)
N(X) := x - F'(X)' x E X E fIR (3.28)

is called the Interval Newton Operator. It has the following properties:

I. If N(X) <;;; X, then f(x) has exactly one zero x* in X.


II.If N(X) n X = [] then f(x) has no zero in X.

f(x) X=[xj,x21
N(X) = [nj,n21
Xl = [xj,n21

N(X) x

Fig. 3.3. Geometric interpretation of the Interval Newton Method

Thus, N(X) can be used to prove the existence or absence of a zero x*


of f (x) in X. Since in the case of existence of a zero x* in X the sequence
(3.26), (3.27) converges, in the case of absence the situation N(X) n X = []
must occur in (3.27).
The interval version of Newton's method (3.26) can also be derived via
the mean value theorem. If f(x) is continuously differentiable and has a single
zero x* in the interval X, and l' (x) =f 0 for all x EX, then

f(x) = f(x*) + !'(E,)(x - x*) for all x E X and some E, between x and x*

Since f (x*) = 0 and f' (E,) =f 0 this leads to

* f(x)
x = x - fl(f,)"
3.5 The Interval Newton Method 103

If F'(X) denotes the interval evaluation of f'(x) over the interval X, we have
f'(~) E F'(X) and therefore

* f(x) f(x)
x = x - f'(~) Ex - F'(X) = N(X) for all x E X,

i.e. x* E N(X) and thus

* f(x)
x E (x - F'(X)) n X = N(X) n x.

Now we obtain by setting Xo := X and x = m(Xo)


f(m(Xo))
Xl := (m(Xo) - F'(Xo) ) n X o,

and by continuation (3.26).


In close similarity to the conventional Newton method the Interval New-
ton Method also allows some geometric interpretation. For that purpose let
be X = [Xl, X2] and N(X) = [nl' n2]. F'(X) is the interval evaluation of f'(x)
over the interval X. As such it is a superset of all slopes of tangents that can
occur in X. (3.24) computes the zero of the tangent of f(x) in (xo, f(xo)).
Similarly N(X) is the interval of all zeros of straight lines through (x, f(x))
with slopes within F'(X), see Fig. 3.3. Of course, f'(x) E F'(X).
The straight line through f(x) with the least slope within F'(X) cuts the
real axis in nl, and the one with the greatest slope in n2. Thus the Interval
Newton Operator N(X) computes the interval [nl' n2] which in the sketch of
Fig. 3.3 is situated on the left hand side of x. The intersection of N(X) with
X then delivers the new interval Xl. In the example in Fig. 3.3, Xl = [Xl, n2].
Newton's method allows some visual interpretation. From the point (x,
f(x)) the conventional Newton method sends a beam along the tangent. The
search is continued at the intersection of this beam with the x-axis. The
Interval Newton Method sends a set of beams like a floodlight from the point
(x, f(x)) to the x-axis. This set includes the directions of all tangents that
occur in the entire interval X. The interval N(X) comprises all cuts of these
beams with the x-axis.
It is a fascinating discovery that the Interval Newton Method can be ex-
tended so that it can be used to compute all zeros of a real function in a
given interval. The basic idea of this extension is already old [3]. Many scien-
tists have worked on details of how to use this method, of how to define the
necessary arithmetic operations, and of how to bring them to the computer.
But inconsistencies have occurred again and again. However, it seems that
understanding has now reached a point which allows a consistent realization
of the method and of the necessary arithmetic. The extended Interval Newton
Method is the most powerful and most frequently used tool for subdivision
in higher dimensional spaces. It requires an extension of interval arithmetic
which we are now going to discuss.
104 3. Interval Arithmetic Revisited

3.6 Extended Interval Arithmetic

In the definition of interval arithmetic, division by an interval which includes


zero was excluded. We are now going to eliminate this exception.
The real numbers IR are defined as a conditionally complete, linearly
ordered field. 2 With respect to the order relation :s: they can be completed
by adjoining a least element -00 and a greatest element +00. We denote the
resulting set by IR* := IRU{ -00 }U{ +oo}. {IR*,:S:} is a complete lattice. 3 This
completion is frequently applied in mathematics and it is well known that the
new elements -00 and +00 fail to satisfy several of the algebraic properties
of a field. -00 and +00 are not real numbers! For example a + 00 = b + 00
even if a < b, so that the cancellation law is not valid. For the new elements
-00 and +00 the following operations with elements x E IR are defined in
analysis:

00 + x= 00, 00 * x = 00 for x > 0,


-00 + x = -00, 00 * x = -00 for x < 0,
.!E... = ~ = 0 (3.29)
00 -00 '
= 00 00 = 00,
00 + 00 *
-00 + (-00) = (-00) 00 = -00. *
together with variants obtained by applying the sign rules and the law of
commutativity. Not defined are the terms 00 - 00 and 0 * 00, again with
variants obtained by applying the sign rules and the law of commutativity.
These rules are well established in real analysis and there is no need to extend
them for the purposes of interval arithmetic in fIR.
IR is a set with certain arithmetic operations. These operations can be
extended to the powerset lPIR in complete analogy to (3.5):

A 0 B := {a 0 bIa E A 1\ b E B} for all 0 E {+, -, *, I},


(3.30)
and all A, B E lPIR.

As a consequence of (3.30) again the inclusion isotony (3.6) and the in-
clusion property (3.7) hold for all operations and arithmetic expressions in
lPIR. In particular, this is the case if (3.30) is restricted to operands of fIR.
fIR is a subset of lP IR.
We are now going to define division by an interval B of fIR which contains
zero. It turns out that the result is no longer an interval of fIR. But we can
apply the definition of the division in the powerset as given by (3.30). This
leads to
AI B := {alb I a E A 1\ b E B} for all A, B E fIR. (3.31 )
2 An ordered set is called conditionally complete if every non empty, bounded sub-
set has a greatest lower bound (infimum) and a least upper bound (supremum).
3 In a complete lattice every subset has an infimum and a supremum.
3.6 Extended Interval Arithmetic 105

In order to interpret the right hand side of (3.31) we remember that the
quotient alb is defined as the inverse operation of multiplication, i.e. as the
solution of the equation b· x = a. Thus (3.31) can also be written in the form

AI B := {x I bx = a 1\ a E A 1\ b E B} for all A, B E fIR. (3.32)

Now we have to interpret the right hand side of (3.32). We are interested
in obtaining simply executable, explicit formulas for the right hand side of
(3.32). The case 0 tJ. B was already dealt with in Table 3.2. So we assume
here generally that 0 E B. For A = [aI, a2] and B = [bl, b2] E fIR, 0 E B the
following eight distinct cases can be set out:

l. o E A, o E B.
2. o tJ. A, B = [0,0].
3. s:; a2 < 0,
al bl < b2 = O.
4. s:; a2 < 0,
al bl < 0 < h
5. al s:; a2 < 0, 0= bl < h
6. o < al s:; a2, bl < b2 = O.
7. o < al s:; a2, bl < 0 < h
8. o < al s:; a2, 0= bl < b2.

The list distinguishes the cases 0 E A (case 1) and 0 tJ. A (cases 2 to 8).
Since it is generally assumed that 0 E B these eight cases indeed cover all
possibilities.
We are now going to derive simple formulas for the result of the interval
division AlB for these eight cases:

1. 0 E A 1\ 0 E B. Since every x E IR fulfils the equation 0 . x = 0, we have


AI B = (-00, +00). Here (-00, +00) denotes the open interval between
-00 and +00 which just consists of all real numbers IR, i.e. AlB = IR.
2. In case 0 tJ. AI\B = [0,0] the set defined by (3.32) consists of all elements
which fulfil the equation 0 . x = a for a E A. Since 0 tJ. A, there is no real
number which fulfils this equation. Thus AlB is the empty set AlB = [ ].

In all other cases 0 tJ. A also. We have already observed under 2. that in
this case the element 0 in B does not contribute to the solution set. So it can
be excluded without changing the set AlB.
So the general rule for computing AlB by (3.32) is to punch out zero of
the interval B and replace it by a small positive or negative number E as the
case may be. The so changed interval B is denoted by B and represented in
column 4 of Table 3.3. With this B the solution set AI B can now easily be
computed by applying the rules of Table 3.2. The results are shown in column
5 of Table 3.3. Now the desired result AlB as defined by (3.32) is obtained if
in column 5 E tends to zero. Thus in cases 3 to 8 the results are obtained by
the limit process AlB = limE~O AI B. The solution set AlB is shown in the
last column of Table 3.3 for all the 8 cases. There, as usual in mathematics
106 3. Interval Arithmetic Revisited

parentheses denote open interval ends, i.e. the bound is excluded. In contrast
to this brackets denote closed interval ends, i.e. the bound is included.
In Table 3.3 the operands A and B of the division AlB are intervals of
IlR! The results of the division AlB shown in the last column, however, are
no longer intervals of IlR nor are they intervals of IlR* which is the set of
intervals over lR*. This is logically correct and should not be surprising, since
the division has been defined as an operation in /PlR by (3.30).
Table 3.4 shows the result of the division AlB of two intervals A = [aI, a2]
and B = [b l , b2 ] in the case 0 E B in a more convenient layout.

Table 3.3. The 8 cases of the division of two intervals A/ B, with A, B E fIR and
OE B.

case A = [al,a2) B = [bl, b2) 13 AjB A/B


1 °E A OEB (-00, +00)
2 °Et A B= [0,0)
°
[)
3 a2 < ° bl < b2 = [bl, -e) [a2/bl , al/( -e)) [a2/bl, +00)
4 a2 < ° bl < °< b2 [bl, -e)U
fE, b2)
[a2/bl , al/( -e))U
[al/e, a2/b2)
(-00, a2/b2)U
[a2/bl, +00)
5 a2 < ° 0=bl<b2 fE, b2) [al/e, a2/b2) (-00, a2/b2)
6 al > ° bl < b2 °
= [bl, -e) [a2/( -e), al/bl ) (-oo,al/bd

al > °
7 bl <0<b2 [bl, -e)U [a2/( -e), aI/bl)U (-00, al/bl)U
fE, b2) [al/b2, a2/e) [al/b2, +00)
8 al > ° 0=bl<b2 fE, b2) [al/b2, a2/e) [al/b2, +00)

Table 3.4. The result of the division A/ B, with A, B E fIR and ° E B.

A/B B= [0,0) bl < b2 = ° bl <0<b2 0=bl <b2

a2 < °
al :::; 0:::; a2
[)
(-00,+00)
[a2/bl , +00) (-00,a2/b2) U [a2/bl ,+00)
(-00, +00) (-00, +00)
(-00, a2/b2)
(-00,+00)
al > ° [) (-00, al /bl ) (-00, al/bl) U [al/b2, +00) [al/b2, +00)

For completeness we repeat at the end of this section the results of the
basic arithmetic operations for intervals A = [aI, a2] and B = [b l , b2] of IlR
which have already been given in Section 3.2. In the cases of multiplication
and division we use different representations. We also list the basic rules of
the order relations:::; and ~ for intervals of IlR*.
3.6 Extended Interval Arithmetic 107

II. Addition: [aI, a2] + [b l , b2] = [al + bl , a2 + b2].


III. Subtraction: [aI, a2] - [bl, b2] = [al - h a2 - bl ].
IV. Negation: A = [aI, a2], -A = [-a2, -ad.
v. Multiplication:
A-B bl ~ 0 bl < 0 < b2 b2 :::; 0

a2 :::; 0 [a l b2,a2bd [a l b2, albd [a2b2, albl ]


al < 0 < a2 [a l b2, a2 b2] [min(alh a2bl), [a2bl, albl ]
max(albl , a2 b2)]
al ~ 0 [a l bl ,a2b2] [a2bl, a2 b2] [a2bl, al b2]

VI. Division, 0 tJ- B:

A/B bl > 0 b2 < 0


a2 :::; 0 [al/bl , a2/ b2] [a2/ bl, al/b2]
al < 0 < a2 [al/bl , a2/ bl] [a2/b2, al/b2]
al ~ 0 [al/b2, a2/bl] [a2/b2, al/bl ]

The closed intervals over the real numbers IR* are ordered with respect
to two different order relations, the comparison:::; and the set inclusion <;;;.
With respect to both order relations they are complete lattices. The basic
properties are:
VII. {fIR',:::;}: [aI, a2] :::; [bl, b2] :<=? al :::; bl /\ a2 :::; b2.
The least element of fIR' with respect to:::; is the interval [-00, -00], the
greatest element is [+00, +00]. The infimum and supremum respectively
of a subset S <;;; fIR' are:

infS= [inf aI, inf a2], supS= [supaI, sup a2]'


::; AES AES ::; AES AES

VIII. {fIR', <;;;} : [aI, a2] <;;; [bl, b2] :<=? bl :::; al /\ a2 :::; h
The interval [-00,+00] is the greatest element in {fIR',<;;;}, i.e. for all
intervals A E fIR' we have A <;;; [-00, +00]. But a least element is missing
in fIR'. So we adjoin the empty set [ ] as the least element of fIR'. The
empty set [] is a subset of any set, thus for all A E fIR' we have [ ] <;;; A.
We denote the resulting set by fIR' := fIR' U {[ n. With this completion
{fIR', <;;;} is a complete lattice. The infimum and supremum respectively
of a subset S <;;; fIR* are [33,34]:

i.e. the infimum is the intersection and the supremum is the interval
(convex) hull of all intervals out of S. For inf~ S we shall also use the
108 3. Interval Arithmetic Revisited

n
usual symbol S. sUPe S is occasionally written as US. If in particular
S just consists of two intervals A, B, this reads: -
intersection: An B, interval hull: A1lB.
Since IR is a linearly ordered set with respect to ::;, the interval hull is
the same as the convex hull. The intersection may be empty.

In the following section we shall generalize the Interval Newton Method in


such a way that for the Interval Newton Operator

N(X):= x - f(x)jF'(X), X E X E fIR (3.33)

the case 0 E F'(X) is no longer excluded. The result of the division then can
be taken from Tables 3.3 and 3.4. It is no longer an element out of fIR, but
an element of the powerset lPIR. Thus the subtraction that occurs in (3.33)
is also an operation in lPIR. As such it is defined by (3.29) and (3.30). As
a consequence of this, of course, the operation is inclusion isotone and the
inclusion property holds. We are interested in the evaluation of an expression
of the form

Z := x - aj B, with x, a E IR and 0 E B E fIR. (3.34)

(3.34) can also be written as Z = x + (-aj B). Multiplication of the set aj B


by -1 negates and exchanges all bounds (see IV. above). Corresponding to
the eight cases of Table 3.3, eight cases are again to be distinguished. The
result is shown in the last column of Table 3.5.

Table 3.5. Evaluation of Z = x - alB for x, a E JR, and 0 E B E IJR.

a B = [b l ,b 2 J -alB Z :=x-aIB
1 a=O OE B (-00, +00) (-00, +00)
2 a#O B= [O,OJ [J [J
3 a<O bl<b2=0 (-00, -alblJ (-00, x - alblJ
4 a<O bl<0<b2 (-00, -alblJ U (-00, x - alblJ U
[-alb2, +00) [x - alb2, +00)
5 a<O 0= bl < b2 [-alb2, +00) [x - alb2, +00)
6 a>O bl<b2=0 [-albl, +00) [x - albl , +00)
7 a>O bl < 0 < b2 ( -00, -alb2J U ( -00, x - alb2J U
[-albl , +00) [x - albl, +00)
8 a>O 0=bl <b2 (-00, -alb2J ( -00, x - alb2J

The general rules for subtraction of the type of sets which occur in column
4 of Table 3.5, from a real number x are:
3.6 Extended Interval Arithmetic 109

x - (-00, +00) = (-00, +00),


X - (-00, +y] = [x - y, +00),
x - [y,+OO) = (-OO,X - y],
X - (-oo,y] U [Z,+OO) = (-OO,X - Z] U [x - y,+OO),
x- [] = [].
If in any arithmetic operation an operand is the empty set the result of
the operation is also the empty set.
At the end of this Section we briefly summarize what has been developed
so far.
In Section 3.3 we have considered the powerset IP IR of real numbers and
the subset fIR of closed and bounded intervals over IR. Arithmetic operations
have been defined in IPIR by (3.5). We have seen that with these operations
fIR is an algebraically closed subset of IP IR if division by an interval which
contains zero is excluded.
With respect to the order relation ::;, {IR,::;} is a linearly ordered set.
With respect to the order relation ~, {fIR, ~} is an ordered set. Both sets
{IR, ::;} and {fIR, ~} are conditionally complete lattices (i.e. every non empty,
bounded subset has an infimum and a supremum).
In this section we have completed the set {IR,::;} by adjoining a least
element -00 and a greatest element +00. This leads to the set IR* := IR U
{-oo} U {+oo}. {JR*,::;} then is a complete lattice (i.e. every subset has an
infimum and a supremum). Similarly we have completed the set {fIR*,~} by
adjoining the empty set [ ] as a least element. This leads to the set fIR* :=
fIR* U {[ ]}. {fIR*,~} then is a complete lattice also.
Then we have extended interval division AlB, with A, B E fIR to the case
o E B. We have seen, that division by an interval of fIR which contains zero
is well defined in IP IR and that the result always is an element of IP IR , i.e.
a set of real numbers.
This is an important result. We stress the fact that the result of division
by an interval of fIR which contains zero is not an element of fIR nor of fIR*.
Thus definition by an interval that contains zero does not require definition
of arithmetic operations in the completed set of intervals fIR* := fIR* U {[ n.
Although this is often done in the literature, we have not done so here.
Thus complicated definitions and rules for computing with intervals like
[-00, -00], [-00,0]' [0, +00]' [+00, +00]' and [-00, +00] need not be consid-
ered. Putting aside such details makes interval arithmetic more friendly for
the user.
Particular and important applications in the field of enclosure methods
and validated numerics may require the definition of arithmetic operations in
the complete lattice fIR*. The infrequent occurrence of such applications cer-
tainly justifies leaving their realization to software and to the user. This paper
aims to set out those central properties of interval arithmetic which should
110 3. Interval Arithmetic Revisited

effectively be supported by the computer's hardware, by basic software, and


by the programming languages.
00 takes on a more subtle meaning in complex arithmetic. We defer con-
sideration of complex arithmetic and complex interval arithmetic on the com-
puter to a follow up paper.

3.7 The Extended Interval Newton Method

The extended Interval Newton Method can be used to compute enclosures of


all zeros of a continuously differentiable function f(x) in a given interval X.
The iteration scheme is identical to the one defined by (3.26) in Section 3.5:

f(m(Xv))
Xv+! := (m(Xv) - F'(X v ) ) n Xv = N(Xv) n Xv, v = 0, 1,2, ... ,

with Xo := X. Here again F'(Xv ) is the interval evaluation of the first


derivative f'(x) of the function f over the interval Xv, and m(Xv) is any
point out of Xv, the midpoint for example. If f(x) has more than one zero
in X, then the derivative f'(x) has at least one zero (horizontal tangent of
f(x)) in X also, and the interval evaluation F'(X) of f'(x) contains zero.
Thus extended interval arithmetic has to be used to execute the Newton
operator
f(x) .
N(X) = x - F'(X)' wIth x E X.

As shown by Tables 3.3 and 3.4 the result is no longer an interval of fIR.
It is an element of the powerset lPIR which, with the exception of case 2,
stretches continuously to -00 or +00 or both. The intersection N(X) n X
with the finite interval X then produces a finite set again. It may consist
of a finite interval of fIR, or of two separate such intervals, or of the empty
set. These sets are now the starting values for the next iteration. This means
that in the case where two separate intervals have occurred, the iteration
has to be continued with two different starting values. This situation can
occur repeatedly. On a sequential computer where only one iteration can be
performed at a time all intervals which are not yet dealt with are collected
in a list. This list then is treated sequentially. If more than one processor is
available different subintervals can be dealt with in parallel.
Again, we illustrate this process by a simple example. The starting interval
is denoted by X = [Xl, X2] and the result of the Newton operator by N =
[nI, n2]. See Fig. 3.4.
Now F'(x) is again a superset of all slopes of tangents of f(x) in the
interval X = [Xl,X2]. 0 E F'(X). N(X) again is the set of zeros of straight
lines through (x, f(x)) with slopes out of F'(x). Let be F'(x) = [81,82]. Since
o E F'(x) we have 81 :S 0 and 82 2: O. The straight lines through (x,f(x))
3.7 The Extended Interval Newton Method 111

f(x)
X= [XI,X2]
N = [nlon2]

Fig. 3.4. Geometric interpretation of the extended Interval Newton Method.

with the slopes 81 and 82 cut the real axis in nl and n2. Thus the Newton
operator produces the set

Intersection with the original set X (the former iterate) delivers the set

consisting of two finite intervals of IJR. From this point the iteration has to
be continued with the two starting intervals [Xl, n2J and [nl' X2J.
Remark: In case of division of a finite interval A = [aI, a2J by an interval
B = [bI, b2 J which contains zero, 8 non overlapping cases of the result were
distinguished in Table 3.3 and its context. Applied to the Newton operator
these 8 cases resulted in the 8 cases in Table 3.5. So far we have discussed
the behaviour of the Interval Newton Method in the cases 3 to 8 of Table 3.5.
We are now going to consider and interpret the particular cases 1 and 2 of
Table 3.5 which, of course, may also occur. In Table 3.5 a stands for the
function value and B is the enclosure of all derivatives of f(x) in the interval
X.
Case 2 in Table 3.5 is easy to interpret. If B == 0 in the entire interval X
then f(x) is a constant in X and its value is f(x) = a f= O. So the occurrence
of the empty interval in the Newton iteration indicates that the function f (x)
is a constant.
In case 1 of Table 3.5 the result of the Newton operator is the interval
(-00, +00). In this case the intersection with the former iterate X does not
reduce the interval and delivers the interval X again. The Newton iteration
does not converge! In this case the function value a is zero (or the numerator
A in case 1 of Table 3.3 contains zero) and a zero has already been found.
In order to avoid rounding noise and to obtain safe bounds for the solution
the value X may be shifted by a small E to the left or right. This may transform
case 1 into one of the other cases 2 to 8.
112 3. Interval Arithmetic Revisited

However, since 0 E B in case 1, normally case 1 will indicate a multiple


zero at the point x. This case can be further evaluated by applying the Interval
Newton Method to the derivative f' of f. The values of f' as well as enclosures
F"(X) for the second derivative f"(x) can be obtained by differentiation
arithmetic (automatic differentiation) which will be dealt with in the next
section.

3.8 Differentiation Arithmetic, Enclosures of Derivatives

For many applications in scientific computing it is necessary to compute the


value of the derivative of a function. The Interval Newton Method requires
the computation of an enclosure of the first derivative of the function over an
interval. The typical "school method" first computes a formal expression for
the derivative of the function by applying well known rules of differentiation.
Then this expression is evaluated for a point or an interval. Differentiation
arithmetic avoids the computation of a formal expression for the derivative. It
computes values or enclosures of derivatives just by computing with numbers
or intervals. We are now going to sketch this method for the simplest case
where the value of the first derivative is to be computed. If u(x) and v(x) are
differentiable functions then the following rules for the computation of the
derivative of the sum, difference, product, and quotient of the functions are
well known:

(u(x) ± v(x))' = u'(x) ± v'(x),


(u(x) * v(x)), = u'(x)v(x) + u(x)v'(x),
(u(x) / v(x))' = v 2 (x) (u'(x)v(x) - u(x)v'(x)) (3.35)
_ 1 ( '() u(x),())
- v(x) u x - v(x) V X .

These rules can be used to define an arithmetic for ordered pairs of num-
bers, similar to complex arithmetic or interval arithmetic. The first compo-
nent of the pair consists of a function value u(xo) at a point Xo. The second
component consists of the value of the derivative u'(xo) of the function at the
point Xo. For brevity we simply write for the pair of numbers (u, u'). Then
the following arithmetic for pairs follows immediately from (3.35):

(u, u') + (v, v') = (u + v, u' + v'),


(u, u') - (v, v') = (u - v, u' - v'),
(3.36)
(u,u') * (v,v') = (u*v,u'v+uv'),
(u,u') / (v,v')=(u/v,~(u'-C;Dv')), vio.

The set of rules (3.36) is called differentiation arithmetic. It is an arith-


metic which deals just with numbers. The rules (3.36) are easily pro-
grammable and are executable by a computer. These rules are now used
to compute simultaneously the value and the value of the derivative of a
3.8 Differentiation Arithmetic, Enclosures of Derivatives 113

real function at a point Xo. For brevity we call these values the function-
derivative-value-pair. Why and how can this computation be done?
Earlier in this paper we have defined a (computable) real function by an
arithmetic expression in the manner that arithmetic expressions are usually
defined in a programming language. Apart from the arithmetic operators
+, -, *, and /, arithmetic expressions contain only three kinds of operands
as basic ingredients. These are constants, variables and certain differentiable
elementary functions as, for instance, exp, In, sin, cos, sqr, .... The derivatives
of these functions are also well known.
If for a function f(x) a function-derivative-value-pair is to be computed
at a point xo, all basic ingredients of the arithmetic expression of the function
are replaced by their particular function-derivative-value-pair by the following
rules:
a constant: c ~(c,O),
the variable: Xo ~ (xo, 1),
the elementary functions: exp(xo) ~ (exp(xo),exp(xo)),
In(xo) ~ (In(xo), l/xo),
(3.37)
sin(xo) ~ (sin(xo),cos(xo)),
cos(xo) ~ (cos(xo), - sin(xo)),
sqr(xo) ~ (sqr(xo),2xo),
and so on.

Now the operations in the expression are executed following the rules
(3.36) of differentiation arithmetic. The result is the function-derivative-
value-pair (f(xo), I'(xo)) of the function f at the point Xo·
Example: For the function f(x) = 25(x -1)/(x 2 + 1) the function value
and the value of the first derivative are to be computed at the point Xo = 2.
Applying the substitutions (3.37) and the rules (3.36) we obtain

(f(2) 1'(2))= (25,0)((2,1)-(1,0)) = (25,0)(1,1) = (25,25) =(51).


, (2,1)(2,1) + (1,0) (4,4) + (1,0) (5,4) ,
Thus f(2) = 5 and 1'(2) = l.
If in the arithmetic expression for the function f(x) elementary functions
occur in composed form, the chain rule has to be applied, for instance

exp(u(xo)) ~ (exp(u(xo)),exp(u(xo))' u'(xo)) = (expu,u' expu),


sin( u(xo)) ~ (sin( u(xo)), cos( u(xo)) . u'(xo)) = (sin u, u' cos u),

and so on.
Example: For the function f(x) = exp(sin(x)) the value and the value of
the first derivative are to be computed for Xo = Jr. Applying the above rules
we obtain
114 3. Interval Arithmetic Revisited
(J(1I-), l' (11-)) = (exp(sin( 11")), exp(sinC7l")) . cos(1I"))
= (exp(O), - exp(O)) = (1, -1).

Thus 1(11") = 1 and 1'(11") = -l.


Differentiation arithmetic is often called automatic differentiation. All op-
erations are performed with numbers. A computer can easily and safely exe-
cute these operations though people cannot.
Automatic differentiation is not restricted to real functions which are de-
fined by an arithmetic expression. Any real algorithm in essence evaluates a
real expression or the value of one or several real functions. Substituting for
all constants, variables and elementary functions their function-derivative-
value-pair, and performing all arithmetic operations by differentiation arith-
metic, computes simultaneously the function-derivative-value-pair of the re-
sult. Large program packages have been developed which do just this in
particular for problems in higher dimensions.
Automatic differentiation or differentiation arithmetic simply uses the
arithmetic expression or the algorithm for the function. A formal arithmetic
expression or algorithm for the derivative does not explicitly occur. Of course
an arithmetic expression or algorithm for the derivative is evaluated indi-
rectly. However, this expression remains hidden. It is evaluated by the rules of
differentiation arithmetic. Similarly if differentiation arithmetic is performed
for a fixed interval Xo instead of for a real point xo, an enclosure of the
range of function values and an enclosure of the range of values of the deriva-
tive over that interval Xo are computed simultaneously. Thus, for instance,
neither the Newton Method nor the Interval Newton Method requires that
the user provide a formal expression for the derivatives. The derivative or an
enclosure for it are computed just by use of the expression for the function
itself.
Automatic differentiation allows many generalizations which all together
would fill a thick book. We mention only a few of these.
If the value or an enclosure of the second derivative is needed one would
use triples instead of pairs and extend the rules (3.36) for the third component
by corresponding rules: u"+v", u" -v", uv" +2u'v' +vu", . ... In the arithmetic
expression a constant c would now have to be replaced by the triple (c, 0, 0),
the variable x by (x, 1,0) and the elementary functions also by a triple with
the second derivative as the third component.
Another generalization is Taylor arithmetic. It works with tuples where
the first component represents the function value and the following compo-
nents represent the successive Taylor coefficients. The remainder term of an
integration routine for a definite integral or for an initial value problem of an
ordinary differential equation usually contains a derivative of higher order.
Interval Taylor arithmetic can be used to compute a safe enclosure of the
remainder term over an interval. This enclosure can serve as an indicator for
automatic step size control.
3.8 Differentiation Arithmetic, Enclosures of Derivatives 115

In arithmetics like complex arithmetic, rational arithmetic, matrix or


vector arithmetic, interval arithmetic, differentiation arithmetic and Taylor
arithmetic, the arithmetic itself is predefined and can be hidden in the run-
time system of the compiler. The user calls the arithmetic operations by the
usual operator symbols. The desired arithmetic is activated by type specifi-
cation of the operands.

program sample;
use itaylor;
function f(x: itaylor): itaylor[lb(x) .. ub(x)];
begin f := exp(5000!(sin(11+sqr(x!lOO»+30»;
end;
var a: interval; b, fb: itaylor[O ..40];
begin
read(a);
expand(a,b);
fb := f(b);
writeln ('36th Taylor coefficient: " fb[36]);
writeln ('40th Taylor coefficient: " fb[40]);
end.
Test results: a = [1.001, 1.005]
36th Taylor coefficient: [-2.4139E+002, -2.4137E+002]
40th Taylor coefficient: [ 1.0759E-006, 1.0760E-006]

Fig. 3.S. Computation of enclosures of Taylor coefficients

As an example the PASCAL-XSC program shown in Fig. 3.5 computes


and prints enclosures of the 36th and the 40th Taylor-coefficient of the func-
tion

I(x) = exp(5000j(sin(1l + sqr(xjlOO» + 30»


over the interval a = [1.001,1.005].
First the interval a is read. Then it is expanded into the 41-tuple of its
Taylor coefficients (a, 1,0,0, ......... , 0) which is kept in b. Then the expression
for I (x) is evaluated in interval Taylor arithmetic and enclosures of the 36th
and the 40th Taylor coefficient over the interval a are printed.
Automatic differentiation develops its full power in the case of differen-
tiable functions of several real variables. For instance, values or enclosures of
the gradient
al al al
V/= (-a
Xl
'-a
X2
""'-a
Xn
)
of a function I : lRn --> lR or the Jacobian or Hessian matrix can be computed
directly from the expression for the function I. No formal expressions for the
derivatives are needed. A particular mode, the so called reverse mode, allows a
116 3. Interval Arithmetic Revisited

considerable acceleration for many algorithms of automatic differentiation. In


the particular caSe of the computation of the gradient the following inequality
can be shown to hold:
A(f, V f) :::; 5A(f).
Here A(f, V f) denotes the number of operations for the computation
of the gradient including the function evaluation, and A(f) the number of
operations for the function evaluation. For more details see [15,16, 49J.

3.9 Interval Arithmetic on the Computer

So far the basic set of all our considerations was the set of real numbers JR or
the set of extended real numbers JR* := JRU{ -oo}U{ +oo}. Actual computa-
tions, however, can only be carried out on a computer. The elements of JR and
IJR are in general not representable and the arithmetic operations defined for
them are not executable on the computer. So we have to map these spaces and
their operations onto computer representable subsets. Typical such subsets
are floating-point systems, for instance, as defined by the IEEE arithmetic
standard. However, in this article We do not assume any particular number
representation and data format of the computer representable subsets. The
considerations should apply to other data formats as well. Nevertheless, all
essential properties of floating-point systems are covered.
We assume that R is a finite subset of computer representable elements
of JR with the following properties:

0,1 E R and for all a E R also - a E R.

The least positive non Zero element of R will be denoted by L and the
greatest positive element of R by C. Let be R* := R U {-oo} U {+oo}.
Now let V : JR* ---> R* and 6. : JR* ---> R* be mappings of JR* onto R*
with the property that for all a E JR*, Va is the greatest lower bound of a
in R* and 6.a is the least upper bound of a in R*. These mappings have the
following three properties which also define them uniquely [33,34J:

(R1) Va = a for all a E R*, 6.a = a for all a E R*,


(R2) a :::; b => Va :::; Vb for a, b E JR*, a :::; b => 6.a :::; 6.b for a, b E JR*,
(R3) Va:::; a for all a E JR*, a:::; 6.a for all a E JR*.

Because of these properties V is called the monotone rounding downwards


and 6. is called the monotone rounding upwards. The mappings V and 6. are
not independent of each other. The following equalities hold for them:

Va = -6.(-a) l\6.a = -V(-a) for all a E JR*.


3.9 Interval Arithmetic on the Computer 117

With the roundings \l and 6 arithmetic operations ~ and &, 0 E {+, -,


*, j} can be defined in R by:
(RG) a~b := \l(a 0 b) for all a, bE R and all 0 E {+, -, *, I},
a&b:= 6(a 0 b) for all a, bE R and all 0 E {+, -, *, j},
with b i- 0 in case of division.
For elements a, b E R (floating-point numbers, for instance) these op-
erations approximate the correct result a 0 b in JR by the greatest lower
bound \l(a 0 b) and the least upper bound 6(a 0 b) in R* for all operations
o E {+,-,*,j}.
In the particular case of floating-point numbers, the IEEE arithmetic
standard, for instance, requires the roundings \l and 6, and the corresponding
operations defined by (RG). As a consequence of this all processors that
provide IEEE arithmetic are equipped with the roundings \l and 6 and the
eight operations \7', '<7, ~, '0/, &., 8, £, and fA. On any computer each one of
these roundings and operations should be provided by a single instruction
which is directly supported by the computer hardware.
The IEEE arithmetic standard [78], however, separates the rounding from
the arithmetic operation. First the rounding mode has to be set then one of
the operations \7', '<7, ~, '0/, &., 8, £, and fA may be called. This slows down
these operations and interval arithmetic unnecessarily and significantly.
In the preceding sections we have defined and studied the set of intervals
JJR*. We are now going to approximate intervals of JJR* by intervals over R* .
We consider intervals over JR* with endpoints in R' of the form

[aI,a2] = {x E JR* I aI,a2 E R*,aI::::: x::::: a2}.


The set of all such intervals is denoted by JR'. The empty set [] is assumed
to be an element of JR' also. Then JR' <;;; JJR*. Note that an interval of JR*
represents a continuous set of real numbers. It is not just a set of elements of
R*! Only the bounds of intervals of JR* are restricted to be elements of R*.
With JJR' also the subset JR* is an ordered set with respect to both order
relations::::: and <;;;. It can be shown that JR* is a complete sublattice of JJR*
with respect to both order relations. For a complete proof of these properties
see [33,34]. For completeness we list the order and lattice operations in both
cases. We assume that A = [aI, a2], and B = [b I , b2] are elements of JR*.
{JR* , :::::}: [aI, a2] ::::: [b I , b2] :¢? al ::::: bl 1\ a2 ::::: bz.
The least element of JR* with respect to ::::: is the interval [-00, -00]. The
greatest element is [+00, +00]. The infimum and supremum respectively of a
subset S <;;; JR* with respect to ::::: are with A = [aI, a2] E S:
infS= [inf aI, inf a2], supS = [supaI,supa2].
::; AES AES ::; AES AES

Since R* and JR* only contain a finite number of elements these can also
be written
118 3. Interval Arithmetic Revisited

{JR*, ~}: [aI, a2] ~ [b l , b2 ] :¢? bl :s; al /\ a2 :s; h


The least element of JR* with respect to ~ is the empty set [ ]. The greatest
element is the interva [-00, +00]. The infimum and supremum respectively
of a subset S E fR* with respect to ~ are with A = [aI, a2] E S

Because of the finiteness of R' and fR' these can also be written

i.e. the infimum is the intersection and the supremum is the interval (con-
vex) hull of all intervals of S. As in the case of fiR' we shall use the usual
n
mathematical symbols S for infcS and US for supcS. The intersection
may be empty. If in particular S c;nsists oTjust two elements A = [aI, a2]
and B = [bl, b2 ] this reads:

An B = [max(al, bd, min(a2' b2)] intersection,


A IJ. B = [min(al' bl ), max(a2, b2)] interval hull.
Thus, for both order relations :s; and ~ for any subset S of fR' the
infimum and supremum are the same as taken in fiR'. This is by definition
the criterion for a subset of a complete lattice to be a complete sublattice.
So we have the resultf:

{JR', :s;}, is a complete sublattice of {JiR', :s;}, and


{JR', ~}, is a complete sublattice of {JiR*, ~}.

In many applications of interval arithmetic, it has to be determined


whether an interval A is strictly included in an interval B. This is formally
expressed by the notation:
o
AcB. (3.38)
o
Here B denotes the interior of B. With A = [aI, a2] and B = [h, b2] (3.38)
is equivalent to
o
A c B :¢? bl < al /\ a2 < b2·
In general, interval calculations are employed to determine sets that in-
clude the solution to a given problem. Since the arithmetic operations in
fiR cannot in general be executed on the computer, they have to be approxi-
mated by corresponding operations in fR. These approximations are required
to have the following properties:
3.9 Interval Arithmetic 0'1 the Computer 119

(a) The result of any computation in fR always has to include the result of
the corresponding computation in fIR.
(b) The result of the computation in fR should be as close as possible to the
result of the corresponding computation in fIR.

For all arithmetic operations 0 E {+, -, *, /} in fIR (a) means that the
computer approximation G in fR must be defined in a way that the following
inequality holds:

AoB <;;; AGB for A,B E fR and all 0 E {+,-,*,/}. (3.39)

Similar requirements must hold for the elementary functions. Earlier in


this paper we have defined the interval evaluation of an elementary function
f over an interval A E fIR by the range offunction values f(A) = {I(a) I a E
A}. So (a) requires that for the computer evaluation OJ(A) of f the following
inequality holds:

f(A) <;;; Of (A) with A and Of (A) E fR. (3.40)

(3.39) and (3.40) are necessary consequences of (a). There are reasonably
good realizations of interval arithmetic on computers which only fulfil prop-
erty (a).
(b) is an independent additional requirement. In the cases (3.39) and
(3.40) it requires that AGB and Of (A) should be the smallest interval in
fR' that includes the result A 0 Band f(A) in fIR' respectively. It turns
out that interval arithmetic on any computer is uniquely defined by this
requirement. Realization of it actually is the easiest way to support interval
arithmetic on the computer by hardware. To establish this is the aim of this
paper.
We are now going to discuss this arithmetic in detail. First we define the
mapping (; : fIR' --+ fR' which approximates each interval A of fIR' by
its least upper bound (;A in fR' with respect to the order relation <;;;. This
mapping has the property that for each interval A = [aI, a2] E fIR' its image
in fR' is

This mapping (; has the following properties which also define it uniquely
[33,34]:

(Rl) (;A = A for all A E fR',


(R2) A <;;; B =} (;A <;;; (;B for A, BE fIR', (monotone)
(R3) A <;;; (;A for all A E fIR'. (upwardly directed)

We call this mapping (; the interval rounding. It has the additional property
120 3. Interval Arithmetic Revisited

(R4) <)(-A) = -<)(A), (antisymmetry )

since with A = [aI, a2], -A = [-a2' -all and <)( -A) = [v( -a2), L( -al)] =
[-La2' -val] = -[val,La2] = -<)A.
The interval rounding <) : JIR* -> JR* is now employed in order to define
arithmetic operations ~,o E {+, -, *, /} in JR, i.e. on the computer, by

(RG) A~B := <)(A 0 B) for all A, B E JR and 0 E {+, -, *, /},


with 0 =1= B in case of division.

For intervals A, B E JR (for instance intervals the bounds of which are


floating-point numbers) these operations approximate the correct result of
the interval operation A 0 Bin JIR by the least upper bound <)(A 0 B) in JR*
with respect to the order relation <:::: for all operations 0 E {+, -, *, /}.
Now we proceed similarly with the elementary functions. The interval
evaluation f(A) of an elementary function f over an interval A E JR is
approximated on the computer by its image under the interval rounding <).
Consequently the following inequality holds:

f(A) <:::: <) f(A) with A E JR and <) f(A) E JR*.

Thus <) f(A) is the least upper bound of f(A) in JR* with respect to the
order relation <::::.
If the arithmetic operations for elements of JR are defined by (RG) with
the rounding (R) the inclusion isotony and the inclusion property hold for the
computer approximations of all interval operations 0 E {+, -, *, /}. These are
simple consequences of (R2) and (R3) respectively:
Inclusion isotony:

A<::::BI\C<::::D => AoC<::::BoD


(~) <)(A C) <:::: <)(B D)
0 0

(~) A~C <:::: B~D, for all A, B, C, D E JR.

Inclusion property:

(R3)
a E A 1\ bE B => a 0 bE A 0 B => a 0 bE <)(A 0 B)
(RG)
=> aob E A~B, for a,b E IR,A,B E JR.

Both properties also hold for the interval evaluation of the elementary func-
tions:
Inclusion isotony: A <:::: B => f(A) <:::: f(B) (~) <) f(A) <:::: <) f(B),
for A,B E JR.
3.9 Interval Arithmetic on the Computer 121

Inclusion property: a E A => f(a) E f(A) (~) f(a) E Of (A) ,


for a E IR,A E fR.
Note that these two properties for the elementary functions are simple con-
sequences of (R2) and (R3) respectively only. The optimality of the rounding
o: fIR* --+ fR* which requires that the image of an interval A E fIR* is the
least upper bound in fR* is not necessarily required!
With these results we can define the computer evaluation of general arith-
metic expressions and of real functions in interval arithmetic for an interval
X E fR. If f (x) is an arithmetic expression (consisting of constants, variables,
and elementary functions connected by arithmetic operations and parenthe-
ses) an interval evaluation on the computer for an interval X E fR (out of
the domain of definition D(J)) is obtained by the following rules:

• Every constant a E IR is replaced by the interval ['\7 a, L>a].


• Every occurrence of the variable x in the expression for f(x) is replaced
by the interval X.
• An elementary function cp(x) is replaced by its computer evaluation Ocp(X).
• Every real operation 0 E {+, -, *, /} is replaced by the corresponding
interval operation ~,o E {+, -, *, /}.
• The interval expression thus defined in interval arithmetic is evaluated on
the computer.
This procedure extends the central properties of interval arithmetic -
the inclusion isotony and the inclusion property - to computer evaluations
of arithmetic expressions and of real functions in interval arithmetic.
(3.11) in Section 3.3 summarizes the explicit formulas (3.1), (3.2), (3.3),
and (3.4) for the operations with intervals A = [aI, a2] and B = [bl, b2] E fIR
by

A 0 B = [mini,j=I,2(ai 0 bj), maXi,j=I,2(ai 0 bj )] for all 0 E {+, -, *, /} with

o 'f. B in case of division.

Thus, the definition of the operations in fR by (RG) and of the interval


rounding 0 : fIR* --+ fR' by (R) leads directly to the following formula for
the operations for intervals A = [aI, a2] and B = [bl, b2] E fR:

A~B:= O(A 0 B) = ['\7 mini,j=I,2 (ai 0 bj),1',maxi,j=1,2(ai 0 bj)], (3.41)


o E {+, -, *, I}, with 0 'f. B in case of division.

Since '\7 : IR* --+ R* and 6 : IR* --+ R* are monotone mappings (R2), we
obtain

A~B := O(A 0 B) = [mini,j=I,2(ai '07b j ) , maXi,j=1,2(ai&bj)], (3.42)


o E {+, -, *, I}, with 0 'f. B in case of division.
122 3. Interval Arithmetic Revisited

Employing this equation and the explicit formulas for the arithmetic opera-
tions in 1IR listed under I, II, III, IV, V, VI, in Section 3.6 leads to the follow-
ing formulas for the execution of the arithmetic operations <@>, 0 E {+, -, *, j},
in 1R On the computer for intervals A = [aI, a2] and B = [bl, b2] E 1R:

I. Equality: [al,a2] = [b l ,b2]:<=? al = bl ,a2 = b2·


II. Addition: [al,a2]~[bl,b2]:= [a IWb l ,a2&b2].
III. Subtraction: [al,a2]~[bl,b2]:= [a l'Vb2,a2 8bl].
IV. Negation: A = [aI, a2], -A = [-a2, -ad.
V. Multiplication: see Table 3.6.
VI. Division, 0 tJ. B: see Table 3.7.

Table 3.6. Multiplication of two intervals A, BE IR on the computer.

A0B b1 ~ 0 bl < 0 < b2 b2 S 0

al ~ 0 [al 'Clbl , a2~b2] [a2'Clbl , a2~b2] [a2 'Clbl , al~b2]


al < 0 < a2 [al 'Clb2, a2~b2] [min(al 'Clb2, a2'Clbl ), [a2 'Clb 1 , al~bl]
max(al~bl, a2~b2)]
a2 SO [al 'Clb2, a2~bl] [al 'Clb2, al~bl] [a2 'Clb2, al~bl]

Table 3.7. Division of two intervals A, B E IR with 0 tt B on the computer.

A~B bl >0 b2 < 0

al ~ 0 [al 'Vb2, a2£bl ] [a2 'Vb2, al£bl ]


al < 0 < a2 [al 'Vbl , a2£b 1 ] [a2 'Vb2, al£b2]
a2 SO [al 'Vbl , a2£b2] [a2 'Vbl , al£b2]

These formulas show, in particular, that the operations <@>, 0 E {+, -, *, j},
in 1R are executable on a computer if the operations '(J and.&., 0 E {+, -, *, j},
for elements of R are available. These operations have been defined earlier in
this Section by

(RG) a'(Jb := \l(a 0 b) and a.&.b := ,0,(a 0 b) for a, bE Rand 0 E {+, -, *, j}


with bolO in case of division.

This in turn shows the importance of the roundings \l : IR* ---7 R* and
6 : IR* ---7 R*.
3.9 Interval Arithmetic on the Computer 123

Table 3.8. The 8 cases of the division of two intervals A0B, with A, B E IR and
°E B.

case A = [al,a2J B = [bl, b2 J A0B


1
2
°°IlAA
E OE B
B= [O,OJ
[-00, +ooJ
[J
3 <° bl < b2 = ° [a2Wb 1 ,+ooJ
<° °
a2
4 bl < < b2 [-00, a2£b2J U [a2 Wbl , +ooJ

a2
5 0= bl < b2 [-00, a2£b2J
>° °
a2
6 bl < b2 = [-00, al£blJ
>° °
al
7 bl < < b2 [-00, al£blJ U [al Wb2, +ooJ

al
8 al 0= bl < b2 [alWb2,+ooJ

In case of division by an interval B which contains zero, eight cases had


to be distinguished in Table 3.3. On the computer these cases have to be
performed as shown in Table 3.8. With A = [al, a2] and B = [h, b2] Table 3.9
shows the same cases as Table 3.8 in another representation.

Table 3.9. The result of the division A0B, with A, BE IR and ° E B.

A0B B= [O,OJ bl < b2 = ° bl °


< < b2 0= bl < b2

a2 < ° [J [a2Wbl,+OOJ [-00,a2£b2J U [-00, a2£b2J

°: :;
[a2Wb l , +ooJ
al :::; a2 [-oo,+ooJ [-oo,+ooJ [-00, +ooJ [-00, +ooJ
al >° [J [-00, al£blJ [-00, al£blJ U [al Wb2, +ooJ
[al Wb2, +ooJ

The generalized Newton operator requires the subtraction of a set which


tends to plus or minus infinity or both or which is the empty set from a real
number x. On the computer the corresponding rules now appear in the form

+00]
x~[-oo, = [-00, +00],
y]
x~[-oo, = [x'V'y, +00],
x~[y, +00] = [-00, x8y],
xN[-oo, y] u [z, +00]) = [-00, x8z] U [x'V'y, +00],
x~[] = [].
After the computation of the Interval Newton Operator the intersection with
a finite interval [Cl, C2] still has to be taken in the generalized Interval Newton
Method. The result may be one or two finite intervals or the empty interval [ ].
These cases are expressed by the following explicit formulas:
124 3. Interval Arithmetic Revisited

(x~[-oo,+oo]) n [c1,C2] = [CI,C2J,


(X~[-oo, y]) n [CI, C2] = [X'Vy, C2] or [J,
(x~[y, +00]) n [CI,C2] = [cl,xAy] or [J,
x~([-oo,y] U [z,+oo]) n [CI,C2] = [cI,xAz] U [X'Vy,C2] or [J,
(x~[]) n [CI, C2] = [] n [CI, C2] = [].

For geometric reasons [CI, C2] can only occur as the result of the intersection
in the first case.
For interval arithmetic the roundings V : lEt -+ R* and 6. : lEt -+ R* are
of particular interest. They can be defined by the following properties:

Va:= max{x E R* I x ~ a}, monotone rounding downwards, and


6.a:= min{x E R* I x ~ a}, monotone rounding upwards.

The following equalities hold for V and 6.:

Va = -6.( -a), and 6.a = - V( -a),


i.e. they can be expressed by one another.
For completeness we give an explicit description of the rounding V : JR* -+
R* in the case that R* is a floating-point system. A floating-point number is
a real number of the form x = m . be. Here m is the mantissa, b is the base of
the number system in use and e is the exponent. b is an integer greater than
one. The exponent is an integer between two fixed integer bounds el, e2, and
in general el ~ 0 ~ e2. The mantissa is of the form m = 0 L~=l d[i] . b- i .
Here 0 E {+, -} is the sign of the number. The d[i] are the digits of the
mantissa. In a normalized floating-point system they have the property d[i] E
{O,l, ... b - I}, for all i = l(l)r and d[l] =1= O. Thus Iml < 1. Without
the condition d[l] =1= 0, floating-point numbers are said to be unnormalized.
The set of normalized floating-point numbers does not contain zero. So zero
is adjoined to R*. For a unique representation of zero it is often assumed
that m = 0,00 .... 0 and e = o. A floating-point system thus depends on
the constants b, r, el, and e2. Here we denote it by R = R(b, r, el, e2), then
R* := Ru {-oo} U {+oo}.
In the following description of V : JR* -+ R* we use the abbreviation
G := O.(b - l)(b - 1) ... (b - 1) . be2 for the greatest positive floating-point
number. Then we obtain for Va:
3.9 Interval Arithmetic on the Computer 125

+00 for a = +00,


+G for + G ::; a < +00,
+0.a[l]a[2] . .. a[r] . be for bel-I::; a < +G,
+0.000 ... 0· bO for 0 ::; a < bel-I,
-0.100 ... 0·bel for - bel-I::; a < 0,
-0.a[l]a[2] ... a[r] . be for - G ::; a < _bel-ll\
a[r + i] = 0 for all i ~ 1,
Va= -0.100 ... 0· be+! for - G ::; a < _bel-ll\
a[i] = b - 1 for all i = l(l)rl\
a[r + i] =I- 0 for any i ~ 1,
-(0.a[l]a[2] ... a[r] + b- r ). be for - G::; a < _bel-ll\
a[i] =I- b - 1 for any i E {l, 2, ... , r}1\
a[r + i] =I- 0 for any i ~ 1,
-00 for - 00 ::; a < -G.

Using the function [a] (the greatest integer less than or equal to a) the de-
scription of Va can be shortened:

+00 for a= +00,


+G for + G ::; a < +00,
[moo· bTl . b- T . be for bel-I::; lal ::; +G,
Va= 0.00 ... 0· bO for 0::; a < bel-I,
-0.100 ... 0·bel for _b el - l ::;a<O,
-00 for - 00 ::; a < -G.

The more detailed description of Va above shows that a normalization may


still be necessary.
A few additional but very similar cases occur if in the case e < e1, e is
set to e1 and unnormalized mantissas are permitted.
In these representations for Va we have assumed that a floating-point
number is represented by the so called sign-magnitude representation. For
real numbers a ~ 0 the rounded value Va is obtained by truncation of a after
the rth digit of the normalized mantissa moo of a. If we denote this process
by t(a) (truncation), we have

Va = t(a) for a ~ O.

This is very easy to execute. Truncation can also be used to perform the
rounding Vain case of negative numbers a < 0 if negative numbers are rep-
resented by their b-complement. Then the rounded value Va can be obtained
by truncation of the b-complement a + x of a via the process:

Va=t(a+x)-x for a~O, (3.43)

with a suitable x. See Fig. 3.6.


126 3. Interval Arithmetic Revisited

a+x

Iva~
l~
!I~
~t(a+X)
----- t(a+x)-x

Fig. 3.6. Execution of the rounding 'Va in case of b-complement representation of


negative numbers a < O.

Example: We assume that the decimal number system is used, and that
the mantissa has three decimal digits. Then we obtain for the positive real
number a = 0.354672.10 3 E lR:

\7a = t(a) = 0.354.10 3 .


For the negative real number a = -0.354672.10 3 we obtain obviously
\7a = -0.355.10 3 .

This value is obtained by application of (3.43) with x = 1.00 .... 0.10 3 :


a + x = 0.645328 . 10 3 ,
t(a + x) = 0.645.10 3 ,
\7a = t(a + x) - x = 0.355.10 3 .

Here the easily executable b-complement has been taken twice. In between
the function t(a) was applied which also is easily executable. These three
steps are particularly simple if the binary number system is used.
It is interesting that in case of the (b - I)-complement representation of
negative numbers the monotone rounding downwards \7a cannot be executed
by the function t(a). This representation is isomorphic to the sign-magnitude
representation.
In the preceding Sections 3.1 to 3.8 ideal interval arithmetic for elements
of IlR including division by an interval which contains zero has been devel-
oped. In no case did the symbols -00 and +00 occur as result of an interval
operation. This is not so in this Section where interval arithmetic on the com-
puter is considered. Here -00 and +00 can occur as result of the roundings \7
and 6, and as result of the operations a'IiJb and a&.b, 0 E {+, -, *, /}, respec-
tively. The interval rounding is defined by OA := [\7al' 6a2], and the arith-
metic operations for intervals A, BEl lR are defined by A~B := 0 (A 0 B),
o E {+, -, *, /}. As a consequence of this the symbols -00 and +00 can also
occur as bounds of the result of an interval operation.
This happens, for instance, in case of division by an interval which con-
tains zero, see Table 3.8. The extended Interval Newton Method is an example
3.10 Hardware Support for Interval Arithmetic 127

of this. We have studied this process in detail. Here very large intervals with
-00 and +00 as bounds only appear intermediately. They disappear again as
soon as the intersection with the previous approximation is taken. Finally the
diameters of the approximations decrease to small bounds for the solution.
Among the six interval operations addition, subtraction, multiplication,
division, intersection, and interval hull, the intersection is the only operation
which can reduce an interval which stretches to -00 or +00 or both to a
finite interval again. This step is advantageously used in the extended Interval
Newton Method.
Also certain elementary functions can reduce an interval which stretches
to -00 or +00 or both to a finite interval again. In such a case continuation
of the computation may also be reasonable. The user has to take care that
such situations are appropriately treated in his program.
In general, the appearance of -00 or +00 in the result of an interval oper-
ation indicates that an exponent overflow has occurred or that an operation
or an elementary function has been called outside its range of definition. This
means that the computation has gotten out of control. In this case continua-
tion of the computation is not really recommendable. An appropriate scaling
of the problem may be necessary.
Here the situation is very different from a conventional floating-point com-
putation. In floating-point arithmetic the general directive often is just to
"compute" at any price, hoping that at the end of the computation some-
thing that is reasonable will be delivered. In this process the non numbers
-00, +00, and even NaN (not a number) are often treated as numbers and
the computation is continued with these entities. Since a floating-point com-
putation often flips out of control anyhow it must be the user's responsibility
to control and judge the final result by other means.
In interval mathematics the general philosophy is very different. The user
and the computation itself are controlling the computational process at any
time. In general, an interval computation is aiming to compute small bounds
for the solution of the problem. If during a computation the intervals grow
overly large or even an interval appears which stretches to -00 or +00 or
both, this should be taken as a severe warning. It should cause the user
to think about and study the computational process again with the aim
of obtaining smaller intervals. Blind continuation of the computation even
with non numbers as in the case of floating-point arithmetic hoping that
something reasonable will come out at the end is in strong contradiction to
the philosophy and basic understanding of interval mathematics.

3.10 Hardware Support for Interval Arithmetic


An interval operation requires the computation of the lower and the upper
bound of the resulting interval. For the four basic operations each of these
bounds can be expressed by a single floating-point operation with particular
128 3. Interval Arithmetic Revisited

bounds of the interval operands. The lower bound of the resulting interval has
to be computed with rounding downwards and the upper bound with round-
ing upwards. While addition and subtraction are straightforward, multiplica-
tion and division require a detailed case analysis and in case of multiplication
additionally a maximum / minimum computation if both interval operands
contain zero. This may slow down these operations considerably in particular
if the case analysis is performed in software. Thus in summary an interval
operation is slower by a factor of at least two on a conventional sequential
processor in comparison with the corresponding floating-point operation.
We show in this section that with dedicated hardware interval arithmetic
can be made more or less as fast as simple floating-point arithmetic. The
cost increase for the additional hardware is relatively modest and it is close
to zero on superscalar processors. Although different in detail we follow in
this Section ideas of [71,72].
We assume in this section that one arithmetic operation as well as one
comparison cost one unit of time whereas switches controlled by one bit as the
sign bit or data transports inside the unit are free of charge. For simplicity
we denote the computer operations for intervals in this section by +, -, *,
and /. The interval operands are denoted by A = [aI, a2] and B = fbI, b2].
The lower bound, upper bound respectively of the result is denoted by lb, ub
respectively, i.e. [lb,ub]:= [aI,a2] 0 [bI,b 2], 0 E {+,-,*,/}.

3.10.1 Addition A +B and Subtraction A - B

The formulas for addition and subtraction

[aI, a2] + [b I , b2] = [al Wb I , a2LM2],


[aI, a2] - [b I , b2] = [al 'l7b2, a28b I]

require no conditionals or exceptions. They show that in comparison with


floating-point arithmetic a time factor of 2 is achieved with one arithmetic
unit. Duplication of this unit yields a factor of 1. In the case of addition we
have with one arithmetic unit sequentially:
(A) lb:= aIWb I ;
(B) ub := a2.&b2;
and with two arithmetic units in parallel:
(C) lb := al Wb I ; ub := a2.&b 2;

3.10.2 Multiplication A *B
A basic method for the multiplication of two intervals is the method of case
distinction. Nine cases have been distinguished in Table 3.6. In eight of the
nine cases one multiplication with directed rounding suffices for the compu-
tation of each bound of the resulting interval. When both interval operands
3.10 Hardware Support for Interval Arithmetic 129

contain zero as an interior point two multiplications with directed roundings


and one comparison have to be performed for each bound of the resulting
interval. The case selection depends on the sign bits of the interval operands.
It may be performed by hardware multiplexers which select one of two in-
puts. On a sequential processor with one multiplier and one comparator the
following algorithm solves the problem:

Algorithm 1:
The eight cases with only one multiplication for each bound can be obtained
by:
(A) lb := if (b l ~ 0 V (a2 ~ 01\ b2 > 0)) then al else a2
'W if (al ~ 0 V (a2 > 01\ b2 ~ 0)) then bl else b2;
(B) ub := if (b l ~ 0 V (al ~ 01\ b2 > 0)) then a2 else al
& if (al ~ 0 V (a2 > 01\ bl ~ 0)) then b2 else bl ;
and the final case where two multiplications have to be performed for each
bound by:
(C) p := al 'Wb 2 ;
(D) q:= a2 'Wb l ;
(E) lb:= min(p, q); r:= al.0.b l ;
(F) s := a2.0.b2;
(G) ub:= max(r, s);
Taking all parts together we have:
if (al < 01\ a2 > 01\ bl < 01\ b2 > 0) then
{(C),(D),(E),(F),(G) }
else
{(A),(B)};
The correctness of the algorithm can be checked against the case distinc-
tions of Table 3.6. The algorithm needs 5 time steps in the worst case. In all
the other cases the product can be computed in two time steps.
If two multipliers and one comparator are provided the same algorithm
reduces the execution time to one time step for (A), (B) and three time steps
for (C), (D), (E), (F), (G). Two multiplications and a comparison can then
be performed in parallel:

Algorithm 2:
(A) lb := if (b l ~ 0 V (a2 ~ 01\ b2 > 0)) then al else a2
'W if (al ~ 0 V (a2 > 01\ b2 ~ 0)) then bl else b2;
ub := if (b l ~ 0 V (al ~ 01\ b2 > 0)) then a2 else al
& if (al ~ 0 V (a2 > 01\ bl ~ 0)) then b2 else bl ;
and
130 3. Interval Arithmetic Revisited

(B) p:= al ~b2; q:= a2~bl;


(C) lb := min(p, q); r := al&b l ; s:= a2&b2;
(D) ub := max(r, s);
if (al < 01\ a2 > 01\ bl < 01\ b2 > 0) then {(B),(C),(D)} else (A);
The resulting interval is delivered either in step (A) or in step (C) (min-
imum) and step (D) (maximum). In step (A) one multiplication for each
bound suffices while in the steps (B), (C), (D) a second multiplication and
a comparison are necessasry for each bound. This case where both operands
contain zero occurs rather rarely. So the algorithm shows that on a processor
with two multipliers and one comparator an interval multiplication can in
general be performed in the same time as a floating-point multiplication.
There are applications where a large number of interval products have to
be computed consecutively. This is the case, for instance, if the scalar product
of two interval vectors or a matrix vector product with interval components is
to be computed. In such a case it is desirable to perform the computation in
a pipeline. Algorithms like 1 and 2 can, of course, be performed in a pipeline.
But in these algorithms the time needed to compute an interval product
heavily depends on the data. In algorithm 1 the computation of an interval
product requires 2 or 5 time steps and in algorithm 2, 1 or 3 time steps. So
the pipeline would have to provide 5 time steps in case of algorithm 1 and 3
in case of algorithm 2 for each interval product. I.e. the worst case rules the
pipeline. The pipeline can not easily draw advantage out of the fact that in
the majority of cases the data would allow to compute the product in 2 or 1
time step, respectively.
There are other methods for computing an interval product which, al-
though they look more complicated at first glance, lead to a more regular
pipeline. These methods compute an interval product in the same number of
time steps as algorithms 1 and 2. The following two algorithms display such
possibilities.

Algorithm 3:
By (9.4) the interval product can be computed by the following formula:

A *B := [\7 min(al * bl , al * b2, a2 * bI, a2 * b2),


6max(al * bl , al * b2, a2 * bl , a2 * b2)].
This leads to the following 5 time steps for the computation of A * Busing
1 multiplier, 2 comparators and 2 assignments:
(A) p := al * bl ;
(B) q := al * b2 ;
(C) r := a2 * bl ; MIN := min(p, q); MAX := max(p, q);
(D) s := a2 * b2; MIN := min(MIN, r); MAX := max(MAX, r);
(E) lb:=\7min(MIN,s); ub:=6max(MAX,s);
3.10 Hardware Support for Interval Arithmetic 131

Note that here the minimum and maximum are taken from the unrounded
products of double length. The algorithm always needs 5 time steps. In algo-
rithm 1 this is the worst case.

Algorithm 4:
Using the same formula but 2 multipliers, 2 comparators and 2 assignments
leads to:
(A) p := al * bl ; q := al * b2 ;
(B) r := a2 * bl ; S := a2 * b2; MIN := min(p, q); MAX := max(p, q);
(C) MIN:= min(MIN, r); MAX := max(MAX, r);
(D) lb:= 'V min (MIN , s); ub:= 6max(MAX, s);

Again the minimum and maximum are taken from the unrounded products.
The algorithm needs 4 time steps. This is one time step more than the cor-
responding algorithm 2 using case distinction with two multipliers.

3.10.3 Interval Scalar Product Computation

Let us denote the components of the two interval vectors A (Ak) and
B = (Bk) by Ak = [akl, ad and Bk = [b kl , bd, k = l(l)n. Then the
product Ak * Bk is to be computed by

The formula for the scalar product now reads:


n
[lb, ub] := A0B := <)(A * B) := <) L Ak * Bk
k=1
n n

This leads to the following pipeline using 1 multiplier, 2 comparators, and 2


long fixed-point accumulators (see chapter 1):

Algorithm 5:
(A) p := akl * bkl;
(B) q := akl * bk2 ;
(C) r := ak2 * bkl ; MIN := min(p, q); MAX:= max(p, q);
(D) s := ak2 * bk2; MIN := min(MIN, r); MAX := max(MAX, r);
(E) p := ak+l,l * bk+I,I; MIN := min(MIN, s); MAX := max(MAX, s);
(F) q:= ak+I,1 * bk+l,2; lb := lb + MIN; ub:= ub + MAX;
132 3. Interval Arithmetic Revisited

(G) r := ak+l,2 * bk+l,l; MIN := min(p, q); MAX := max(p, q);


(H) s := ak+l,2 * bk +1 ,2; MIN := min(MIN, r); MAX := max(MAX, r);
MIN := min(MIN, s); MAX := max(MAX, s);
lb := lb + MIN; ub:= ub + MAX;

lb:= V(lb + MIN); ub:= 6(ub + MAX);

This algorithm shows that in each sequence of 4 time steps one inter-
val product can be accumulated. Again the minimum and maximum are
taken from the unrounded products. Only at the very end of the accumula-
tion of the bounds is a rounding applied. Then lb and ub are floating-point
numbers which optimally enclose the product A*B of the two interval vectors
A and B.
In the algorithms 3, 4, and 5 the unrounded, double length products were
compared and used for the computation of their minimum and maximum
corresponding to (3.41). This requires comparators of double length. This
can be avoided if formula (3.42) is used instead:

A*B:= [min(al~bl,al~b2,a2~bl,a2~b2)'
max(al&b 1 , al&b 2, a2&b 1 , a2&b 2)].

Computation of the 8 products ai~bj, ai&b j , i, j = 1,2, can be avoided if the


exact flag of the IEEE arithmetic is used. In general a~b and a&b differ only
by one unit in the last place and we have

a~b ::; a * b ::; a&b.


If the computation of the product a * b leads
already to a floating-point
number which needs no rounding, then the product is called exact and we
have:

a~b = a * b = a&b.
If the product a * b is not a floating-point number, then it is "not exact" and
the product with rounding upwards can be obtained by taking the successor
a&b:= succ(a~b). This changes algorithm 4, for instance, into

Algorithm 6:
(A) p:= al~bl;q:= al~b2;
(B) r:= a2~bl;s:= a2~b2; MIN := min(p, q); MAX := max(p, q);
(C) MIN:= min(MIN,r);MAX:= max(MAX,r);
(D) lb:= min(MIN, s); MAX:= max(MAX, s);
(E) if MAX = "exact" then ub := MAX else ub:= succ(MAX);
3.10 Hardware Support for Interval Arithmetic 133

The algorithm requires one additional step in comparison with algorithm 4


where products of double length have been compared.

3.10.4 Division A / B

If 0 ~ B, 6 different cases have been distinguished as listed in Table 3.7. If


o E B, 8 cases have to be considered. These are listed in Table 3.8. This
leads directly to the following.

Algorithm 7:
if b2 < 0 V bl > 0 then
{
ib := ( if bl > 0 then al else a2) 'Yl
( if al 2: 0 V (a2 > 01\ b2 < 0) then b2 else bl );
ub := ( if b l > 0 then a2 else ad Il-..
( if al 2: 0 V (a2 > 01\ bl > 0) then bl else b2);

} else {
if (al :::; 01\0:::; a2 1\ bl :::; 01\0:::; b2) then [ib, ub] := [-00, +00];
if ((a2 < 0 V al > 0) 1\ bl = 01\ b2 = 0) then [ib, ub] := [];
if (a2 < 01\ b2 = 0) then [ib, ub] := [a2 'Ylb l , +00];
if (a2 < 01\ bl = 0) then [ib, ub] := [-00, a2Il-..b 2];
if (al > 01\ b2 = 0) then [ib, ub] := [-00, alll-..b l ];
if (al > 01\ bl = 0) then [ib, ub] := [al 'Ylb 2, +00];
if (a2 < 01\ bl < 01\ b2 > 0) then { [ib l , UbI] := [-00, a 21l-..b2];
[ib 2, ub 2] := [a2 'Ylb l , +00]; }
if (al > 01\ bl < 01\ b2 > 0) then { [ib l , UbI] := [-00, alll-..b l ];
[ib 2, ub 2] := [al 'Ylb 2, +oo];}
}

The algorithm is organized in such a way that the most complicated cases,
where the result consists of two separate intervals, appear at the end. It would
be possible also in these cases to write the result as a single interval which then
would overlap the point infinity. In such an interval the lower bound would
then be greater than the upper bound. This could cause difficulties with the
order relation. So we prefer the notation with the two separate intervals. On
the other hand, the representation of the result as an interval which overlaps
the point infinity has advantages as well. The result of an interval division
then always consists of just two bounds. In the Newton step the separation
into two intervals then would have to be done by the intersection.
In practice, division by an interval that contains zero occurs infrequently.
So algorithm 7 shows again that on a processor with two dividers and some
134 3. Interval Arithmetic Revisited

multiplexer equipment an interval division can in general be performed in the


same time as a floating-point division.
Variants of the algorithms discussed in this Section can, of course, also
be used. In algorithm 7, for instance, the sequence of the if-statements after
the else could be interchanged. If between these if-statements all semicolons
are replaced by an else the result may be obtained faster.

3.10.5 Instruction Set for Interval Arithmetic

Convenient high level programming languages with particular data types and
operators for intervals, the XSC-Ianguages for instance [11,12,26-29,37,38,
69,70,77], have been in use for more than thirty years now. Due to the lack
of hardware and instruction set support for interval arithmetic, subroutine
calls have to be used by the compiler to map the interval operators and com-
parisons to appropriate floating-point instructions. This slows down interval
arithmetic by a factor close to ten compared to the corresponding floating-
point arithmetic.
It has been shown in the last Section that with appropriate hardware
support interval operations can be made as fast as floating-point operations.
Three additional measures are necessary to let an interval calculation on
the computer run at a speed comparable to the corresponding floating-point
calculation:

1. Interval arithmetic hardware must be supported by the instruction set of


the processor.
2. The high level programming language should provide operators for float-
ing-point operations with directed roundings. The language must provide
data types for intervals, and operators for interval operations and com-
parisons. It must allow overloading of names of elementary functions for
interval data types.
3. The compiler must directly map the interval operators, comparisons and
elementary functions of the high level programming language onto the
instruction set of the processor. This mapping must not be done by slow
function or subroutine calls.

From the mathematical point of view the following instructions for interval
operations are desirable (A = [aI, a2] ,B = [bI, b2]):
3.10 Hardware Support for Interval Arithmetic 135

Algebraic operators:
addition C := A + B C := [al 'fbI, a2t&b 2],
subtraction C:=A-B C := [al ~b2' a28bI),
negation C:=-A C := [-a2, -al],
multi plication C:= A*B Table 3.6,
division C:= AIB,O rJ. B Table 3.7,
C:= AIB,O E B Table 3.8,
scalar product C:= <)(A * B) for interval vectors A = (Ak) and
B = (Bk)' see the first chapter.

Comparisons and lattice operations:


equality A =B al = bl ,a2 = b2 ,
less than or equal A ~B al ~ bl , a2 ~ b2,
greatest lower bound C:= glb (A, B) C := [min(al, bI), min(a2, b2 )],
least upper bound C := lub (A, B) C := [max(al, bl ], max(a2, b2 )],
inclusion A<;;;B bl~al,a2~b2,
element of a E A al ~ a ~ a2,
interval hull C:= AlJ.B C:= [min(al,bd,max(a2,b 2)],
intersection C := An B C := if max(al, h) ~ min(a2' b2 )
then
[max(al, bl ), min(a2' b2 )]
else
[].

Other comparisons can directly be obtained by comparison of bounds of


the intervals A and B. With two comparators all comparisons and lattice
operations can be performed in parallel in one time step.
Fast multiple precision arithmetic and fast multiple precision interval
arithmetic can easily be obtained by means of the exact scalar product. See
Remark 3 on page 60 in section 1.7.

3.10.6 Final Remarks

We have seen that relatively modest hardware equipment consisting of two


operation units, a few multiplexers and comparators could make interval
arithmetic as fast as floating-point arithmetic. For multiplication various ap-
proaches have been compared. The case selection clearly is the most favorable
one. Several of the multiplication algorithms use and compare double length
products. These are not easily obtainable on most existing processors. Since
the double length product is a fundamental operation for other applications
as well, like complex arithmetic, the accurate dot product, vector and matrix
arithmetic, we require that multipliers should have a fifth rounding mode,
namely "unrounded", which enables them to provide the full double length
product.
136 3. Interval Arithmetic Revisited

A similar situation appears in interval arithmetic, if division by an inter-


val which contains zero is permitted. The result of an interval division then
may consist of two disjoint intervals. In applications of interval arithmetic,
division by an interval which contains zero has been used for 30 years now.
This forces an extension of the real numbers JR by -00 and +00 and a con-
sideration of the complete lattice JR* := JRU {-oo} U {+oo} and its computer
representable subset R* := R U { -oo} U {+oo}. In the early days of interval
arithmetic attempts were made to define interval arithmetic immediately in
I JR* instead of I JR. See [23-25] and others. This leads to deep mathematical
considerations. We did not follow such lines in this study. The Extended In-
terval Newton Method is practically the only frequent application where -00
and +00 are needed. But there they appear only in an intermediate step as
auxiliary values and they disappear immediately in the next step when the
intersection with the former approximation is taken.
In conventional numerical analysis Newton's method is the key method
for nonlinear problems. The method converges quadratically to the solution
if the initial value of the iteration is already close enough. However, it may
fail in finite as well as in infinite precision arithmetic even in the case of only
a single solution in a given interval. In contrast to this the interval version of
Newton's method is globally convergent. It never fails, not even in rounded
arithmetic. Newton's method reaches its final elegance and strength in the
Extended Interval Newton Method. It encloses all (single) zeros in a given
domain. It is locally quadratically convergent. The key operation to achieve
these fascinating properties is division by an interval which contains zero. It
separates different solutions from each other. A Method which provides for
computation of all zeros of a system of nonlinear equations in a given domain
is much more frequently applied than the conventional Newton method. This
justifies taking division by an interval which contains zero into the basic set
of interval operations, and supporting it within the instruction set of the
computer.
Bibliography and Related Literature

1. Adams, E.; Kulisch, U.(eds.): Scientific Computing with Automatic Re-


sult Verification. I. Language and Programming Support for Verified Sci-
entific Computation, II'. Enclosure Methods and Algorithms with Automatic
Result Verification, III'. Applications in the Engineering Sciences. Academic
Press, San Diego, 1993 (ISBN 0-12-044210-8).
2. Albrecht, R.; Alefeld, G.; Stetter, H.J. (Eds.): Validation Numerics - The-
ory and Applications. Computing Supplementum 9, Springer-Verlag, Wien I
New York, 1993.
3. Alefeld, G.: Intervallrechnung iiber den komplexen Zahlen und einige Anwen-
dungen. Dissertation, Universitiit Karlsruhe, 1968.
4. Alefeld, G.: Uber die aus monoton zerlegbaren Opemtoren gebildeten Item-
tionsverfahren. Computing 6, pp. 161-172, 1970.
5. Alefeld, G.; Herzberger, J.: Einfiihrung in die Intervallrechnung. Bibli-
ographisches Institut (Reihe Informatik, Nr. 12), Mannheim I Wien I Zurich,
1974 (ISBN 3-411-01466-0).
6. Alefeld, G.; Herzberger, J.: An Introduction to Interval Computations.
Academic Press, New York, 1983 (ISBN 0-12-049820-0).
7. Alefeld, G.; Mayer, G.: EinschliejJungsverfahren. In [22, pp. 155-186]' 1995.
8. Alefeld, G.; Frommer, A.; Lang, B. (eds.): Scientific Computing and Vali-
dated Numerics. Proceedings of SCAN-95. Akademie Verlag, Berlin, 1996.
ISBN 3-05-501737-4
9. Baumhof, Ch.: Ein Vektomrithmetik-Koprozessor in VLSI-Technik zur Un-
terstiitzung des Wissenschaftlichen Rechnens. Dissertation, Universitiit Karl-
sruhe, 1996.
10. Blomquist, F.: PASCAL-XSC, BCD-Version 1.0, Benutzerhandbuch
fiir das dezimale Laufzeitsystem. Institut fUr Angewandte Mathematik,
Universitiit Karlsruhe, 1997.
11. Bohlender, G.; Rail, L. B.; Ullrich, Ch.; Wolffv. Gudenberg, J.: PASCAL-SC:
Wirkungsvoll programmieren, kontrolliert rechnen. Bibliographisches
Institut, Mannheim I Wien I Zurich, 1986 (ISBN 3-411-03113-1).
12. Bohlender, G.; RaIl, L. B.; Ullrich, Ch.; Wolff v. Gudenberg, J.: PASCAL-
SC: A Computer Language for Scientific Computation. Perspectives
in Computing, Vol. 17, Academic Press, Orlando, 1987 (ISBN 0-12-111155-5).
13. Bohlender, G.: Litemture on Enclosure Methods and Related Topics. Institut
fUr Angewandte Mathematik, Universitiit Karlsruhe, pp. 1-68, 2000.
14. Collatz, L.: Funktionalanalysis und numerische Mathematik. Springer-
Verlag, Berlin I Heidelberg I New York, 1968.
15. Fischer, H.-C.: Schnelle automatische Differentiation, EinschliejJungsmethoden
und Anwendungen. Dissertation, Universitiit Karlsruhe, 1990.
16. Fischer, H.: Automatisches DifJerenzieren. In [22, pp. 53-104], 1995.
138 Bibliography and Related Literature

17. Hammer, R.; Hocks, M.; Kulisch, U.; Ratz, D.: Numerical Toolbox for Ver-
ified Computing I: Basic Numerical Problems. (Vol. II see [31], version
in C++ see [18]) Springer-Verlag, Berlin / Heidelberg / New York, 1993.
18. Hammer, R.; Hocks, M.; Kulisch, U.; Ratz, D.: C++ Toolbox for Veri-
fied Computing: Basic Numerical Problems. Springer-Verlag, Berlin /
Heidelberg / New York, 1995.
19. Hansen, E.: Topics in Interval Analysis. Clarendon Press, Oxford, 1969.
20. Hansen, E.: Global Optimization Using Interval Analysis. Marcel
Dekker Inc., New York/Basel/Hong Kong, 1992.
21. Herzberger, J. (ed.): Topics in Validated Computations. Proceedings of
IMACS-GAMM International Workshop on Validated Numerics, Oldenburg,
1993. North Holland, 1994.
22. Herzberger, J.: Wissenschaftliches Rechnen, Eine Einfiihrung in das
Scientific Computing. Akademie Verlag, 1995.
23. Kaucher, E.: Uber metrische und algebraische Eigenschaften einiger beim nu-
merischen Rechnen auftretender Riiume. Dissertation, Universitat Karlsruhe,
1973.
24. Kaucher, E.: Algebraische Erweiterungen der Intervallrechnung unter Erhal-
tung der Ordnungs- und Verbandsstrukturen. In: Albrecht, R.; Kulisch, U.
(Eds.): Grundlagen der Computerarithmetik. Computing Supplementum
1. Springer-Verlag, Wien / New York, pp. 65-79, 1977.
25. Kaucher, E.: Uber Eigenschaften und Anwendungsmoglichkeiten der erweiterten
Intervallrechnung und des hyperbolischen Fastkorpers iiber R. In: Albrecht, R.;
Kulisch, U. (Eds.): Grundlagen der ComputerarithIIletik. Computing
Supplementum 1. Springer-Verlag, Wien / New York, pp. BI-94, 1977.
26. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-
XSC Sprachbeschreibung mit Beispielen. Springer-Verlag,
Berlin/Heidelberg/New York, 1991 (ISBN 3-540-53714-7, 0-387-53714-7).
27. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-
XSC - Language Reference with Examples. Springer-Verlag,
Berlin/Heidelberg/New York, 1992.
28. Klatte, R.; Kulisch, U.; Lawo, C.; Rauch, M.; Wiethoff, A.: C-XSC, A C++
Class Library for Extended Scientific Computing. Springer-Verlag,
Berlin/Heidelberg/New York, 1993.
29. Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-XSC
- Language Reference with Examples (In Russian). Moscow, 1994,
second edition 2000.
30. Knofel, A.: Hardwareentwurf eines Rechenwerks fur semimorphe Skalar- und
Vektoroperationen unter Beriicksichtigung der Anforderungen veriJizierender
Algorithmen. Dissertation, Universitat Karlsruhe, 1991.
31. Kramer, W.; Kulisch, U.; Lohner, R.: Numerical Toolbox for Verified
Computing II: Theory, Algorithms and Pascal-XSC Programs. (Vol. I
see [17,18]) Springer-Verlag, Berlin / Heidelberg / New York, to appear.
32. Krawczyk, R.; Neumaier, A.: Interval Slopes for Rational Functions and Asso-
ciated Centered Forms. SIAM Journal on Numerical Analysis 22, pp. 604-616,
1985.
33. Kulisch, U.: Grundlagen des Numerischen Rechnens - Mathematis-
che Begriindung der Rechnerarithmetik. Reihe Informatik, Band 19,
Bibliographisches Institut, Mannheim/Wien/Ziirich, 1976 (ISBN 3-411-01517-
9).
34. Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Prac-
tice. Academic Press, New York, 1981 (ISBN 0-12-428650-x).
Bibliography and Related Literature 139

35. Kulisch, U.; Ullrich, Ch. (Eds.): Wissenschaftliches Rechnen und Pro-
grammiersprachen. Proceedings of Seminar held in Karlsruhe, April 2-3,
1982. Berichte des German Chapter of the ACM, Band 10, B. G. Teubner Ver-
lag, Stuttgart, 1982 (ISBN 3-519-02429-2).
36. Kulisch, U.; Miranker, W. L. (Eds.): A New Approach to Scientific Com-
putation. Proceedings of Symposium held at IBM Research Center, Yorktown
Heights, N. Y., 1982. Academic Press, New York, 1983 (ISBN 0-12-428660-7).
37. Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific
computation, Information Manual and Floppy Disks, Version IBM PCI AT;
Operating System DOS'. B. G. Teubner Verlag (Wiley-Teubner series in com-
puter science), Stuttgart, 1987 (ISBN 3-519-02106-4 I 0-471-91514-9).
38. Kulisch, U. (Ed.): PASCAL-SC: A PASCAL extension for scientific
computation, Information Manual and Floppy Disks, Version ATARI ST'. B.
G. Teubner Verlag, Stuttgart, 1987 (ISBN 3-519-02108-0).
39. Kulisch, U. (Ed.): Wissenschaftliches Rechnen mit Ergebnisverifikation
- Eine Einf'tihrung. Ausgearbeitet von S. Georg, R. Hammer und D. Ratz.
Vol. 58. Akademie Verlag, Berlin, und Vieweg Verlagsgesellschaft, Wiesbaden,
1989.
40. Kulisch, U.: Advanced Arithmetic for the Digital Computer - Design
of Arithmetic Units. Electronic Notes in Theoretical Computer Science,
http://www.elsevier.nl/locate/entcs/volume24.html pp. 1-72, 1999.
41. Lohner, R.: EinschliefJung der Losung gewohnlicher Anfangs- und Randwer-
taufgaben und Anwendungen. Dissertation, Universitiit Karlsruhe, 1988.
42. Lohner, R.: Computation of Guaranteed Enclosures for the Solutions of Ordi-
nary Initial and Boundary Value Problems. pp. 425-435 in: Cash, J. R.; Glad-
well, I. (Eds.): Computational Ordinary Differential Equations. Claren-
don Press, Oxford, 1992.
43. Mayer, G.: Grundbegriffe der Intervallrechnung. In [39, pp. 101-117]' 1989.
44. Moore, R. E.: Interval Analysis. Prentice Hall Inc., Englewood Cliffs, N. J.;
1966.
45. Moore, R. E.: Methods and Applications of Interval Analysis. SIAM,
Philadelphia, Pennsylvania, 1979.
46. Moore, R. E. (Ed.): Reliability in Computing: The Role of Interval
Methods in Scientific Computing. Proceedings of the Conference at
Columbus, Ohio, September 8-11, 1987; Perspectives in Computing 19, Aca-
demic Press, San Diego, 1988 (ISBN 0-12-505630-3).
47. Neumaier, A.: Interval Methods for Systems of Equations. Cambridge
University Press, Cambridge, 1990.
48. Neumann, J. von; Goldstine, H. H.: Numerical Inverting of Matrices of High
Order. Bulletin of the American Mathematical Society, 53, 11, pp. 1021-1099,
1947.
49. Rail, L. B.: Automatic Differentiation: Techniques and Applications.
Lecture Notes in Computer Science, No. 120, Springer-Verlag, Berlin, 1981.
50. Ratschek, H.; Rokne, J.: Computer Methods for the Range of Functions.
Ellis Horwood Limited, Chichester, 1984.
51. Ratz, D.: Pmgrammierpraktikum mit PASCAL-SC. In: Hohler, G.; Stau-
denmaier, H. M. (Hrsg.): Computer Theoretikum und Praktikum f'tir
Physiker. Band 5, Fachinformationszentrum Karlsruhe, 1990.
52. Ratz, D.: Globale Optimierung mit automatischer Ergebnisverifikation. Disser-
tation, Universitiit Karlsruhe, 1992.
53. Ratz, D.: Automatic Slope Computation and its Application in Non-
smooth Global Optimization. Shaker Verlag, Aachen, 1998.
140 Bibliography and Related Literature

54. Ratz, D.: On Extended Interval Arithmetic and Inclusion Isotony. Preprint,
Institut fUr Angewandte Mathematik, Universitat Karlsruhe, 1999.
55. Rump, S. M.: Kleine Fehlerschranken bei Matrixproblemen. Dissertation, Uni-
versitat Karlsruhe, 1980.
56. Rump, S. M.: How Reliable are Results of Computers? / Wie zuverliissig sind
die Ergebnisse unserer Rechenanlagen? In: lahrbuch Uberblicke Mathematik,
Bibliographisches Institut, Mannheim, 1983.
57. Rump, S.M.: Validated Solution of Large Linear Systems. In [2, pp. 191-212]'
1993.
58. Rump, S.M.: Verification Methods for Dense and Sparse Systems of Equations.
In [21, pp. 63-135]' 1994.
59. Rump, S.M.: INTLAB - Interval Laboratory. TU Hamburg-Harburg, 1998.
60. Schmidt, L.: Semimorphe Arithmetik zur automatischen Ergebnisverifikation
auf Vektorrechnern. Dissertation, Universitat Karlsruhe, 1992.
61. Shiriaev, D. V.: Fast Automatic Differentiation for Vector Processors and Re-
duction of the Spatial Complexity in a Source Translation Environment. Dis-
sertation, Universitat Karlsruhe, 1994.
62. Sunaga, T.: Theory of an interval algebra and its application to numerical anal-
ysis. RAAG Memoires 2, pp. 547-564, 1958.
63. Teufel, T.: Ein optimaler Gleitkommaprozessor. Dissertation, Universitat Karl-
sruhe, 1984.
64. Ullrich, Ch. (Ed.): Computer Arithmetic and Self-Validating Numerical
Methods. (Proceedings of SCAN 89, held in Basel, Oct. 2-6, 1989, invited
papers). Academic Press, San Diego, 1990.
65. Walter, W. V.: FORTRAN-SC, A FORTRAN Extension for Engineering /
Scientific Computation with Access to A CRITH: Language Description with
Examples. In [46, pp. 43-62]' 1988.
66. Walter, W. V.: Einfiihrung in die wissenschaftlich-technische Programmier-
sprache FORTRAN-SC. ZAMM 69, 4, T52-T54, 1989.
67. Walter, W. V.: FORTRAN-SC: A FORTRAN Extension for Engineering /
Scientific Computation with Access to ACRITH, Language Reference and User's
Guide. 2nd ed., pp. 1-396, IBM Deutschland GmbH, Stuttgart, Jan. 1989.
68. Walter, W. V.: Flexible Precision Control and Dynamic Data Structures for
Programming Mathematical and Numerical Algorithms. Dissertation, Univer-
sitiit Karlsruhe, 1990.
69. Wippermann, H.-W.: Realisierung einer Intervallarithmetik in einem ALGOL-
60 System. Elektronische Rechenanlagen 9, pp. 224-233, 1967.
70. Wippermann, H.-W.: Implementierung eines ALGOL-60 Systems mit
Schrankenzahlen. Elektronische Datenverarbeitung 10, pp. 189-194, 1968.
71. Wolff v. Gudenberg, J.: Hardware Support for Interval Arithmetic, Extended
Version. Report No. 125, Institut fUr Informatik, Universitat Wiirzburg, 1995.
72. Wolffv. Gudenberg, J.: Hardware Support for Interval Arithmetic. In [8, pp. 32-
38], 1996.
73. Wolff v. Gudenberg, J.: Proceedings of Interval'96. International Confer-
ence on Interval Methods and Computer Aided Proofs in Science and Engi-
neering, Wiirzburg, Germany, Sep. 30 - Oct. 2, 1996. Special issue 3/97 of the
journal Reliable Computing, 1997.
74. Yohe, J.M.: Roundings in Floating-Point Arithmetic. IEEE Trans. on Com-
puters, Vol. C-22, No.6, June 1973, pp. 577-586.
75. IBM: IBM System/370 RPQ'. High Accuracy Arithmetic. SA 22-7093-0,
IBM Deutschland GmbH (Department 3282, Schonaicher Strasse 220, D-71032
Boblingen), 1984.
Bibliography and Related Literature 141

76. IBM: IBM High-Accuracy Arithmetic Subroutine Library


(ACRITH). IBM Deutschland GmbH (Department 3282, Schonaicher
Strasse 220, D-71032 Boblingen), 3rd edition, 1986.
1. General Information Manual. GC 33-6163-02.
2. Program Description and User's Guide. SC 33-6164-02.
3. Reference Summary. GX 33-9009-02.
77. IBM: ACRITH-XSC: IBM High Accuracy Arithmetic - Extended
Scientific Computation. Version 1, Release 1. IBM Deutschland GmbH
(SchDnaicher Strasse 220, D-71032 Boblingen), 1990.
1. General Information, GC33-6461-01.
2. Reference, SC33-6462-00.
3. Sample Programs, SC33-6463-00.
4. How To Use, SC33-6464-00.
5. Syntax Diagrams, SC33-6466-00.
78. American National Standards Institute / Institute of Electrical and Electronics
Engineers: A Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std.
754-1985, New York, 1985 (reprinted in SIGPLAN 22, 2, pp. 9-25, 1987). Also
adopted as IEC Standard 559:1989.
79. American National Standards Institute / Institute of Electrical and Electron-
ics Engineers: A Standard for Radix-Independent Floating-Point Arithmetic.
ANSI/IEEE Std. 854-1987, New York, 1987.
80. IMACS; GAMM: IMACS-GAMM Resolution on Computer Arithmetic. In
Mathematics and Computers in Simulation 31, pp. 297-298, 1989. In Zeitschrift
fiir Angewandte Mathematik und Mechanik 10, no. 4, p. T5, 1990.
81. IMACS; GAMM: GAMM-IMACS Proposal for Accumte Floating-Point Vector
Arithmetic. GAMM, Rundbrief 2, pp. 9-16, 1993. Mathematics and Computers
in Simulation, Vol. 35, IMACS, North Holland, 1993. News of IMACS, Vol. 35,
No.4, pp. 375-382, Oct. 1993.
82. SIEMENS: ARITHMOS (BS 2000) Unterprogrammbibliothek fiir
Hochpriizisionsarithmetik. Kurzbeschreibung, Tabellenheft, Be-
nutzerhandbuch. SIEMENS AG, Bereich Datentechnik, Postfach 83 09 51,
D-8000 Miinchen 83. Bestellnummer U2900-J-Z87-1, Sept. 1986.
83. Sun Microsystems: Interval Arithmetic Programming Reference, For-
tran 95. Sun Microsystems, Inc., 901 San Antonio Road, Palo Alto, CA 94303,
USA, 2000.
SpringerMathematics

Ulrich Kulisch,
Rudolf Lohner,
Axel Facius (eds.)

Perspectives on Enclosure Methods

2001. XII, 345 pages. With numerous figures.


Softcover EUR 59,90
(Recommended retail pricel
All prices are net-prices subject to local VAT.
Net-price subject to local VAT.
ISBN 3-211-83590-3

Enclosure methods and their applications have been developed to a


high standard during the last decades. These methods guarantee
the validity of the computed results, this means they are of the same
standard as the rest of mathematics. This book deals with a wide
variety of aspects of enclosure methods.
All contributions follow the common goal to push the limits of
enclosure methods forward. Topics that are treated include basic
questions of arithmetic, proving conjectures, bounds for Krylow
type linear system solvers, bounds for eigenvalues, the wrapping
effect, algorithmic differencing, differential equations, finite element
methods, application in robotics, and nonsmooth global optimization.

SpringerWienNewYork
A-1201 Wien, Sachsenplatz 4-6, P.O. Box 89, Fax +43.1.330 24 26, a-mail: [email protected], Internet: www.springer.at
0-69126 Heidelberg, HaberstraBe 7, Fax +49.6221.345-229, a-mail: [email protected]
USA Secaucus, NJ 07096-2485, P.O. Box 2485, Fax +1.201.348-4505, a-mail: [email protected]
Eastern Book Service, Japan,Tokyo 113,3-13, Hongo 3-chome, Bunkyo-ku, Fax +81.3.38 18 OS 64, e-mail: [email protected]
Springer-Verlag
and the Environment

WE AT SPRINGER-VERLAG FIRMLY BELIEVE THAT AN


international science publisher has a special obliga-
tion to the environment, and our corporate policies
consistently reflect this conviction.
WE ALSO EXPECT OUR BUSINESS PARTNERS- PRINTERS,
paper mills, packaging manufacturers, etc. - to commit
themselves to using environmentally friendly mate-
rials and production processes.
THE PAPERIN THIS BOOKIS MADE FROM NO-CHLORINE
pulp and is acid free, in conformance with inter-
national standards for paper permanency.

You might also like