0% found this document useful (0 votes)
10 views45 pages

Numerical Methods I - Roundoff Errors With Arithmetic Operations

Uploaded by

thekonan726
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views45 pages

Numerical Methods I - Roundoff Errors With Arithmetic Operations

Uploaded by

thekonan726
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Azərbaycan Dövlət

Neft və Sənaye
Universiteti

Numerical
Methods I
Round-Off Errors:
Arithmetic Manipulations
Aside from the limitations of a
computers number system, the actual
arithmetic manipulations involving these
numbers can also result in round-off error.

In the following section, we will first


illustrate how common arithmetic
operations affect round-off errors.
Then we will investigate a number of
particular manipulations that are especially
prone to round-off errors.
Because of their familiarity
normalized base- numbers will be
employed to illustrate the effect of
round-off errors on addition,
subtraction, multiplication, and
To simplify
division. the discussion we will
employ a hypothetical decimal
computer with a digit mantissa and a
When two floating-point numbers are
added the mantissa of the number
with the smaller exponent is modified
so that the exponents are the same
(this has the effect of aligning the
decimal points).
For example suppose we want to
perform the addition
The decimal of the mantissa of the
second number is shifted to the left a
number of places equal to the
difference of the exponents [] as in

Now the numbers can be added:


Lets assume that in our hypothetical
computer chopping is used (rounding
would lead to similar though less
dramatic errors).
The result of addition is chopped to

Notice how the last two digits of the


second number that were shifted to
the right have essentially been lost

Subtraction is performed identically to
addition except that the sign of the
subtrahend is reversed.
For example suppose that we are
subtracting from .

First of all we need to represent these


numbers in normalized form:
Then we perform the subtraction as
usual

For this case the result is not


normalized so we must shift the
decimal one place to the right to give

Notice that zero added to the end of


mantissa is not significant but is
merely appended to fill the empty
Even more dramatic results would be
obtained when the numbers are very
close as in

which would
Thus for thisbecase
converted
three to .
insignificant
zeros are appended.

This introduces a substantial


computational error because
subsequent manipulations would act
As we will see later on the loss of
significance during the subtraction of
nearly equal numbers is among the
greatest source of round-off error in
numerical methods.

Multiplication is more straightforward
than addition or subtraction:
the exponents are added and the
mantissas multiplied.

Because multiplication of two -digit


mantissas will yield -digit result most
computers hold intermediate results in
For example:

If as in this case a leading zero is


introduced the result is normalized:

And finally the result is chopped to


give

Division is performed in a similar
manner but the mantissas are divided
and the exponents are subtracted.
Then the results are normalized and
chopped.

Certain numerical methods require
extremely large numbers of arithmetic
manipulations to arrive at their final results.

In addition, these computations are often


interdependent – that is, later calculations
are dependent on the results of earlier ones.

Even though an individual round-off error


could be small the cumulative effect over
the course of a large computation can
Investigate the effect of round-off
error on large numbers of
interdependent computations.
Develop a program to sum a number
times:
sum the number in single precision
and in single and double precision.
Whereas the single-precision summation of
yields the expected result the single-
precision summation of yields a large
discrepancy.
This error is reduced significantly when is
summed in double precision.

Quantizing errors are the source of the


discrepancies.
Because the integer can be represented
In contrast cannot be represented exactly
and is quantized by a value that is slightly
different from its true value.

Whereas this very slight discrepancy would


be negligible for a small number of
computations, it accumulates after repeated
summations.
The problem still occurs in double precision
but is greatly mitigated because the
quantizing error is much smaller.
Note that the type of error illustrated
by the previous example is somewhat
atypical in that all the errors in the
repeated operation are of the same
sign.
In most cases the errors of a long
computation alternate sign in a
random fashion and thus often cancel
However there are also instances
where such errors do not cancel but in
fact lead to a spurious (false) final
result.
The following sections are intended to
provide insight into ways in which this
may occur.

Suppose we add a small number to a
large number using a hypothetical
computer with the digit mantissa and
the digit exponent.

We modify the smaller number so that


its exponent matches the larger:
The result of addition is chopped to

– thus we might as well have not


performed the addition

This type of error can occur in the


computation of an infinite series.

The initial terms in such series are


often relatively large in comparison
with the later terms.
Thus after a few terms have been
added we are in the situation of adding
a small quantity to a large quantity.

One way to mitigate this type of error


is to sum the series in reverse order —
i.e. in ascending rather than
descending order.
In this way, each new term will be of
comparable magnitude to the
. The finite series

converges on a value of as tends to .

Write a program in single precision to


calculate for by computing the sum
from .

Then repeat the calculation but in


reverse order — i.e. from .
.
In each case, compute the true percent
relative error. Explain the results.

This term refers to the round-off
induced when subtracting two nearly
equal floating-point numbers.
One common instance where this can
occur involves finding the roots of a
quadratic equation with the quadratic
formula
For cases where the difference in the
numerator can be very small.

In such cases double precision can


mitigate the problem.

In addition an alternative formulation


can be used to minimize subtractive
cancellation*
An illustration of the problem and the
use of this alternative formula are
provided in the following example.

Compute the values of the roots of a


quadratic equation with and .

Check the computed values versus the


true roots of and .
. Let’s develop a computer program
that computes the roots and on the
basis of the quadratic formula . *

Whereas the results for are adequate,


the percent relative errors for are
poor for the single-precision version, .

This level could be inadequate for


many applied scientific and
. This result is particularly surprising
because we are employing an
analytical formula to obtain our
solution.
The loss of significance occurs in the
line of both equations, and where two
relatively large numbers are
subtracted.
Similar problems do not occur when
the same numbers are added.
. On the basis of the above we can
draw the general conclusion that the
quadratic formula will be susceptible
to subtractive cancellation whenever .

• One way to circumvent this problem


is to use double precision. *

• Another is to recast the quadratic


formula in the format of Eq. . *
. As in the program output, both
options give a much smaller error
because the subtractive cancellation is
minimized or avoided.

Note that as in the foregoing example


there are times where subtractive
cancellation can be circumvented by
However the only general remedy is to
using a transformation.
employ extended precision.

Smearing occurs whenever the
individual terms in a summation are
larger than the summation itself.

This example demonstrates one case


where smearing occurs – in series of
The exponential function is given by
the infinite series

Evaluate this function for and and be


attentive to the problems of round-off
error.
. Let’s develop a computer program
that uses the infinite series to evaluate
.•  The variable is the number of terms

in the series;
• is the value of the current term
added to the series;
• is the accumulative value of the
series.
The variable is the preceding
accumulative value of the series prior
.
The series is
terminated when the
computer cannot
detect the difference
between and .

shows the results of


running the program
for .
.
Note that this case is completely
satisfactory:
the final result is achieved in terms
with the series identical to the
library function value within seven
significant figures.
.
shows similar
results for .

However, for this


case, the results of
the series
calculation are not
even the same sign
.
As a matter of fact the negative results
are open to serious question because
can never be less than zero.
The problem here is caused by round-
off error (note that many of the terms
that make up the sum are much larger
than the final result of the sum.)
Furthermore, unlike the previous case,
the individual terms vary in sign.
Thus in effect we are adding and
subtracting large numbers (each with
some small error) and placing great
significance on the differences — that
is subtractive
Thus we cancancellation.
see that the culprit
behind this example of smearing is in
fact subtractive cancellation.

For such cases it is appropriate to seek


some other computational strategy.
For example one might try to compute
as . 

Other than such a reformulation, the


only general recourse is extended
precision. 

Some infinite series are particularly
prone to round-off error.
Fortunately the calculation of series is
not one of the more common
operations in numerical methods.
A far more ubiquitous manipulation is
the calculation of inner products as in:
This operation is very common
particularly in the solution of
simultaneous linear algebraic
equations.
Such summations are prone to round-
off error.

Consequently it is often desirable to


compute such summations in extended
precision.
Although the foregoing sections
should provide rules of thumb to
mitigate round-off error they do not
provide a direct means beyond trial
and error to actually determine the
effect of such errors on a computation.
Thank you for attention!

You might also like