0% found this document useful (0 votes)
13 views46 pages

Numerical Methods I - Roundoff Errors

The document discusses round-off errors in numerical methods, highlighting how computers represent numbers using a fixed number of significant figures, which can lead to discrepancies. It explains the concepts of floating-point representation, normalization, and the limitations of representing both large and small numbers, as well as the implications of quantizing errors. Additionally, it covers methods for approximating values and the importance of machine epsilon in error characterization and numerical method applications.

Uploaded by

thekonan726
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views46 pages

Numerical Methods I - Roundoff Errors

The document discusses round-off errors in numerical methods, highlighting how computers represent numbers using a fixed number of significant figures, which can lead to discrepancies. It explains the concepts of floating-point representation, normalization, and the limitations of representing both large and small numbers, as well as the implications of quantizing errors. Additionally, it covers methods for approximating values and the importance of machine epsilon in error characterization and numerical method applications.

Uploaded by

thekonan726
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Azərbaycan Dövlət

Neft və Sənaye
Universiteti

Numerical
Methods I
Round-Off Errors
We can now proceed to the two types
of error connected directly with
numerical methods:
• ;
• .
Recall that
• errors originate from the fact that
computers retain only a fixed
number of significant figures during
Numbers such as or cannot be
expressed by a fixed number of
significant figures.

In addition because computers use a


representation they cannot precisely
represent certain exact numbers
(like ).
The discrepancy introduced by this
omission of significant figures is called
Numerical round-off errors are directly
related to the manner in which
numbers are stored in a computer
memory.
Fractional quantities are typically
represented in computers using form.
In this approach the number is
expressed as a fractional part called a
or and an integer part called an or
as in .
For instance the number could be
represented as in a floating-point
system.

shows one way that a floating-point


number could be
stored in a word:
• the first bit is reserved for the
• the next series of bits for the
exponent
Note that the mantissa is usually
normalized if it has leading zero
digits.
For example, suppose the quantity
was stored in a floating-point system
that allowed only four decimal places
to be stored.

Thus would be stored as


However in the process of doing this
the inclusion of the useless zero to the
right of the decimal

forces us to drop the digit in the fifth


decimal place
The number can be normalized to
remove the leading zero by multiplying
the mantissa by and lowering the
exponent by to give
Thus we retain an additional
significant figure when the number is
stored.
The consequence of normalization is
that the absolute value of is limited
i.e.
For example,
• for a system, would range between
and
• for a system would range between
and .

Floating-point representation
allows both fractions and very large
numbers to be expressed on the
However it has some disadvantages:
• For example floating-point numbers
take up more room and take longer
to process than integer numbers.
• More significantly, however, their
use introduces a source of error
because the mantissa holds only a
finite number of significant figures.

Thus a is introduced.
. Hypothetical Set of Floating-Point
Numbers
Create a hypothetical floating-point
number set for a machine that stores
information using bit words.

Employ the first bit for the sign of the


number the next three bits for the sign
and the magnitude of the exponent
and the last three bits for the
The smallest
possible positive
number is depicted
in .
The initial indicates that the quantity
is positive.

The in the second place designates


that the exponent has a negative sign.
Solution (continued)
The s in the third and fourth places
give a maximum value to the exponent
of

– therefore the exponent will be .


Finally the mantissa is specified by the
in the last three places which
conforms to
Solution (continued)
Although a smaller mantissa is
possible (e.g. ) the value of is used
because of the limit imposed by
normalization (recall Eq. ).

Thus the smallest possible positive


number for this system is

which is equal to in the base- system.


Solution (continued)
The next highest numbers are
developed by increasing the mantissa
as in
Solution (continued)
Notice that the base- equivalents are
spaced evenly with an interval of .

At this point to continue increasing we


must decrease the exponent to which
gives a value of

The mantissa is brought (decreased)


back to its smallest value of .
Solution (continued)
Therefore the next number is

This still represents a gap of

However, now when higher numbers


are generated by increasing the
mantissa the gap is lengthened to .
Solution (continued)
Next highest numbers are developed
by increasing the mantissa:
Solution (continued)
This pattern is repeated as each larger
quantity is formulated until a
maximum number is reached:
Solution (continued) The final number
set is depicted graphically in .
 manifests several aspects of floating-
point representation that have
significance regarding computer - :

• There is a limited range of real


quantities that may be represented.

Just as for the integer case, there


are large positive and negative
numbers that cannot be represented.
Attempts to employ numbers
outside the
acceptable range will result in what

• is called an in
However . addition to large
quantities the floating-point
representation has the added
limitation that very small numbers
cannot be represented.
This is illustrated by the underflow
• There are only a finite number of
quantities that can be represented
within the range – thus, the degree
of precision
Obviously, is limited.
irrational numbers
cannot be represented exactly.

Furthermore, rational numbers that


do not exactly match one of the
values in the set also cannot be
The errors introduced by
approximating both these cases are
referred to as .

The actual approximation is


accomplished in either of two ways:
or .
. Suppose the value of is to be stored on a
base number system carrying seven
significant figures.

One method of approximation would be to


merely omit or chop off the eighth and
higher d as with the introduction of an
associated error of .

This technique of retaining only the


significant terms was originally dubbed in
computer jargon.
.
Note that for the base number system in ,
means that any quantity falling within an
interval of length will be stored as the
quantity at the lower end of the interval.

Thus the upper error bound for chopping is .


.
Additionally a bias is introduced because all
errors are positive.

The shortcomings of chopping are


attributable to the fact that yields a lower
absolute error than chopping.

For instance in our example of the first


discarded digit is . If we round up the last
retained digit we will get .
.
Such reduces the error to

Note that for the base number system in ,


rounding means that any quantity falling
within an interval of length will be
represented as the nearest allowable
number.
Thus the upper error bound for rounding is
(as opposed to in case of truncating).
.
Additionally, no bias is introduced because
some errors are positive and some are
negative.
Some computers employ rounding. However,
this adds to the computational overhead,
and, consequently, many machines use
simple chopping.
• The interval between numbers
increases as the numbers grow in
magnitude.
It is this characteristic of course
that allows floating-point
representation to preserve
significant digits.
However it also means that
quantizing errors will be
proportional to the magnitude of the
For normalized floating-point numbers,
this proportionality can be expressed,
for cases where chopping is employed,
as

And for cases where rounding is


employed as

Here is referred to as the .


Machine epsilon can be computed as
follows

where is the number base and is the


number of significant digits in the
Determine
mantissa. the machine epsilon and
verify its effectiveness in
characterizing the errors of the
number system from .
.
The hypothetical floating-point system
from employed values of the base and
the number of mantissa bits .

Therefore, the machine epsilon


according to Eq. would be .

Consequently, the relative quantizing


error should be bounded by for .
The largest relative errors should
occur for those quantities that fall just
below the upper bound of the first
interval between successive
equispaced numbers (see ):
Those numbers falling in the
succeeding higher intervals would
have the same value of but a greater
value of and hence would have a lower
relative error.
An example of a maximum error would
be a value falling just below the upper
bound of the interval between and .
For this case the error would be less
than

Thus the error is as predicted by Eq. .



The magnitude dependence of quantizing
errors has a number of practical
applications in numerical methods.

Most of these relate to the commonly


employed operation of testing whether two
numbers are equal.
This occurs when testing convergence of
quantities as well as in the stopping
mechanism for iterative processes.
For these cases it should be clear that rather
than test whether the two quantities are
equal it is advisable to test whether their
difference is less than an acceptably small
tolerance.
In addition, the machine epsilon can be
employed in formulating stopping or
convergence criteria.
These all precautions ensures that programs
are portable — that is they are not
dependent on the computer on which they
Write pseudocode to automatically
determine the machine epsilon of a binary
computer, then implement it in some
programming language:
epsilon = 1.0;
while (epsilon + 1.0 > 1.0)
epsilon = epsilon / 2.0;
end
epsilon = 2.0 * epsilon;
 Extended Precision
Commercial computers use much larger
words than were previously described in
examples, and, consequently, allow
numbers to be expressed with more than
adequate precision.
For example computers that employ IEEE
format use bits for the mantissa which
translates into about seven significant
base digits of precision with a range of
With this acknowledged there are still
cases where round-off error becomes
critical.
For this reason most computers allow
the specification of , the most common
of which is .

It provides about to decimal digits of


precision and a range of approximately
to .
In many cases, the use of double-
precision quantities can greatly
mitigate the effect of round-off errors.

However a price is paid for such


remedies in that they also require
more memory and execution time.

The difference in execution time for a


small calculation might seem
However as your programs become
larger and more complicated added
execution time could become
considerable and have negative impact
on your effectiveness as a problem
Therefore, extended precision should
solver.
be selectively employed where it will
yield the maximum benefit at the least
cost in terms of execution time.
It should be noted that some of the
commonly used software packages (for
example Excel Mathcad) routinely use
to represent numerical quantities.

Others like MATLAB software allow


you to use if you desire so.
Thank you very much for
attention!

You might also like