0% found this document useful (0 votes)
67 views26 pages

L3 Source of Error, Floating-Point

Round-off and truncation errors occur in numerical calculations due to limitations in computer representation of numbers. Numbers are stored in binary format using a sign bit, exponent, and significand. This finite representation introduces errors in arithmetic operations. Doubling precision, grouping operations, and rewriting equations can minimize round-off errors.

Uploaded by

Eden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views26 pages

L3 Source of Error, Floating-Point

Round-off and truncation errors occur in numerical calculations due to limitations in computer representation of numbers. Numbers are stored in binary format using a sign bit, exponent, and significand. This finite representation introduces errors in arithmetic operations. Doubling precision, grouping operations, and rewriting equations can minimize round-off errors.

Uploaded by

Eden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Source of Error

• Round-off Error
• Caused by the limited number of digits that represent numbers
in a computer and
• The ways numbers are stored
• Additions and subtractions are performed in

• Truncation Error
• Caused by approximation used in the mathematical formula of
the scheme a computer

1
Roundoff Errors
• Roundoff errors arise because digital computers cannot
represent some quantities exactly
• There are two major aspects of roundoff errors involved in
numerical calculations:
1. Digital computers have magnitude and precision limits on
their ability to represent numbers.
2. Certain numerical manipulations are highly sensitive to
roundoff errors. This can result from both mathematical
considerations as well as from the way in which computers
perform arithmetic operations.
2
Computer Number Representation
• Numerical roundoff errors are directly related to the manner
in which numbers are stored in a computer.
• A number system is merely a convention for representing
quantities.
• We have 10 fingers and 10 toes, the number system that we
are most familiar with is the decimal, or base-10.
• Numbers on the computer are represented with a binary, or
base-2, system
• (10101101)2 = 27 + 25 + 23 + 22 + 20 = 128 + 32 + 8 + 4 + 1 = (173)10
3
Floating Point Number
• Fractional quantities are typically represented in computers using
floating-point format. In this approach, which is very much like
scientific notation, the number is expressed as
± s× be
• where s = the significand (or mantissa), b = the base of the number
system being used, and e = the exponent.
• For example, a value like 0.005678 could be represented in a
wasteful manner as 0.005678 × 100
• However, normalization would yield 5.678 × 10−3 which eliminates
the useless zeroes
4
Implications of Floating-Point Representation
• Let a base-10 computer with a 5-digit word size. Assume, one
digit is used for the sign, two for the exponent, and two for the
mantissa.
• For simplicity, assume that one of the exponent digits is used
for its sign, leaving a single digit for its magnitude
• A general representation of the number following
normalization would be
s1s0d2d1d0 → s1d1.d2 × 10s0d0
where s0 and s1 = the signs, d0 = the magnitude of the exponent,
and d1 and d2 = the magnitude of the significand digits. 5
Implications of Floating-Point Representation
• Largest possible value in base-10, that is, 9:
• Largest value = ±9.9 × 10+9
• Smallest value = ±1.0 × 10−9
• We could not use it to represent a quantity like Planck’s constant
6.626 × 10−34 J·s
• If we increase the mantissa by one digit, the maximum/minimum
value increases slightly to
±9.99 ×10±9
• In contrast, a one-digit increase in the exponent raises the maximum
/minimum by 90 orders of magnitude to
±9.9 ×10±99
6
Implications of Floating-Point Representation

The number line showing the possible ranges corresponding to the


hypothetical base-10 floating-point scheme described above

7
Computer Number Representation

8
IEEE-754 Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction

x = (-1)s × (1 + Fraction) × 2(Exponent – Bias)


• S: sign bit (0 = non-negative, 1 = negative)
• Normalize significand: 1.0 ≤ |significand| < 2.0
• Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
• Significand is Fraction with the “1.(Fraction)” restored
• Exponent = Actual Exponent + Bias (or, Actual Exponent = Exponent – Bias)
• Ensures exponent is unsigned
• Single: Bias = 127; Double: Bias = 1203
9
Single-Precision Range
• Exponents 00000000=0 and 11111111=255 reserved
• Smallest value
• Exponent: 00000001=1
Actual Exponent = 1 – 127 = –126
• Fraction: 000…00 = significand = 1.0
• ±1.0 × 2–126 ≈ ±1.2 × 10–38
• Largest value
• Exponent: 11111110=254
Actual Exponent = 254 – 127 = +127
• Fraction: 111…11 = significand ≈ 2.0
• ±2.0 × 2+127 ≈ ±3.4 × 10+38
10
Double-Precision Range
• Exponents 0000…00 = 0 and 1111…11 = 2047 reserved
• Smallest value
• Exponent: 00000000001=1
Actual Exponent = 1 – 1023 = –1022
• Fraction: 000…00 = significand = 1.0
• ±1.0 × 2–1022 ≈ ±2.2 × 10–308
• Largest value
• Exponent: 11111111110=2046
Actual Exponent = 2046 – 1023 = +1023
• Fraction: 111…11 = significand ≈ 2.0
• ±2.0 × 2+1023 ≈ ±1.8 × 10+308 11
11
Floating-Point Example
• Represent –0.625
• –0.625 = – 0.101 = (–1)1 × 1.012 × 2–1
• S=1
• Fraction = 01000…002
• Exponent = –1 + Bias
•Single: –1 + 127 = 126 = 011111102
•Double: –1 + 1023 = 1022 = 011111111102
• Single: 10111111001000…00
• Double: 10111111111001000…00
12
12
Floating Point Arithmetic in Computer
• The summation of these two numbers becomes
1.010 + 0.0000110 = 1.0000110
= 1.100 0000 0000 0000 0101 0011 1110 0010 1110 01102 x 20
• Because the mantissa has 23 bits, therefore, the result of this
calculation is stored in memory as,
0011 1111 1100 0000 0000 0000 0101 00112
• Thus, whenever 0.0000110 is added to 1.010, the result gains
0.000000013610 as an error which called round-off error
13
Floating Point Arithmetic in Computer
• When two floating-point numbers are added, the numbers are first expressed so that they
have the same exponents.
• For example, if we want to add 1.557 + 0.04341, the computer would express the numbers as
0.1557×101 + 0.004341×101.
• Then the mantissas are added to give 0.160041×101=1.60041.
• More example, suppose that we are subtracting 26.86 from 36.41. That is,
0.3641×102 − 0.2686×102 = 0.0955×102 = 9.5500
[Notice that the zero added to the end of the mantissa is not significant but is merely appended to fill the empty space created by the shift.]

• Even more dramatic results would be obtained when the numbers are very close as in
0.7642×103 − 0.7641×103 = 0.0001×103
which would be converted to 0.1000×100 = 0.1000.
• Thus, for this case, three nonsignificant zeros are appended.
• The subtracting of two nearly equal numbers is called subtractive cancellation.
14
Floating Point Arithmetic in Computer
• Large Computations: Certain methods require extremely large numbers
of arithmetic manipulations to arrive at their final results.
• A very simple case involves summing a round base-10 number that is not
round in base-2. Suppose that the following function is called as:
int main(void){
…;
• For the case of 15 significant-digit s_out = sum(0.0001, 10000);
representation …;
}
float
float sum(float num, int times){
S_out = 1.000053524971008 float s = 0;
double for(int i=0; i<times; i++){
s = s + num;
S_out = 0.99999999999990619000 }
return s;
15
}
Minimizing Round-off Error

The effects of round-off error can be minimized by


changing the computational algorithm although it must be
devised case by case. Some useful strategies include:
o Double precision
oGrouping
oTaylor expansions
oChanging definition of variables
oRewriting the equation to avoid subtraction

16
Minimizing Round-off Error
• Double Precision
• In a double precision, 8 bytes, or equivalently 64 bits, are used to
store one real number. In this format 1 bit is used for sign, 11 bits
are used for exponent, and 48 bits are used for mantissa.

• Grouping
When the small numbers are computed, e.g. addition, subtraction,
etc., grouping them helps to reduce round-off errors. For example,
to add 0.00001 to unity ten thousand times one can grouped into
100 groups and each group consists of 100 small values.

17
Minimizing Round-off Error
• Taylor Expansions
as approaches 0, accuracy of a numerical evaluation for
sin(1 +  ) − sin(1)
f ( ) =

becomes very poor because round-off errors. By using Taylor
expansion, we can rewrite the equation so that the accuracy for
 is improved as
f ( )  cos(1) − 0.5 sin(1) 2

18
Minimizing Round-off Error
• Rewriting the equation to avoid subtraction
consider the equation of

f ( x) = x( x + 1 − x )
for an increasing of values x, the calculation of the equation
above has a loss-of-significance error. To avoid this error one
can reformulate it to get
x
f ( x) =
x +1 + x 19
Truncation Error
• Numerical solutions are mostly approximations for exact
solution
• Most numerical methods are based on approximating
function by polynomials
• How accurately the polynomial is approximating the true
function?
• Comparing the polynomial to the exact solution it
becomes possible to evaluate the error, called truncation
error
20
Truncation Error
• Truncation errors are those that result from using an
approximation in place of an exact mathematical
procedure.
• Example 1: approximation to a derivative using a finite-
difference equation:

21
Truncation Error
Example 2: The Taylor Series
• The Taylor theorem states that any smooth function can be
approximated as a polynomial.
• The Taylor series provides a means to express this idea
mathematically.

22
Taylor Series
• Let x=xi+1, x0=xi and (xi+1 – xi) = h, then the above series can be
written as,

• In general, the nth order Taylor series expansion will be exact for an
nth order polynomial.
• In other cases, the remainder term Rn is of the order of hn+1,
meaning:
• The more terms are used, the smaller the error, and
• The smaller the spacing, the smaller the error for a given number of terms
23
Numerical Differentiation

24
Numerical Differentiation

25
Total Numerical Error
• The total numerical error is the
summation of the truncation and
roundoff errors.
• The truncation error generally
increases as the step size increases,
while the roundoff error decreases
as the step size increases - this
leads to a point of diminishing
returns for step size

26

You might also like