0% found this document useful (0 votes)
18 views15 pages

Numerical Error RoundOff

Uploaded by

alyssayoum04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views15 pages

Numerical Error RoundOff

Uploaded by

alyssayoum04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Numerical Methods (Engr-280)

Errors in Numerical Methods


Roundoff Errors

Dakar American University of Science and Technology (DAUST)

Dr. Amadou Toure

26 March, 2023

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 1 / 15
Source of Error:
Two sources of numerical error:
1 Round off error - Caused by representing a number approximately
2 Truncation error - Caused by truncating or approximating a
mathematical procedure.
1 √
Round off error: ≅ 0.333333, 2 ≅ 1.4142 …
3
Example problems created by round off error:
▶ Over 20 civilians killed by enemy’s Scud missile due to failure in the

defense tracking system to intercept the Scud. Why?


▶ Clock cycle of 1/10 seconds was represented in 24-bit fixed point

register created an error of 9.5 × 10−8 seconds.


▶ The battery was on for consecutive hours, thus causing an inaccuracy of
𝑠 3600𝑠
9.5 × 10−8 × 100ℎ𝑟 × = 0.342𝑠
0.1𝑠 1ℎ𝑟
▶ Calculated shift in the ranging system of the missile ≈ 687 meters.
▶ The target was considered to be out of range at a distance greater

than 137 meters.


Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 2 / 15
Binary Representation
Decimal Number Representation (Base 10):
257.76 = 2 × 102 + 5 × 101 + 7 × 100 + 7 × 10−1 + 6 × 10−2
Binary Number Representation (Base 2):
1 × 23 + 0 × 2 2 + 1 × 2 1 + 1 × 2 0
(1011.0011)2 = ( )
+ 0 × 2−1 + 0 × 2−2 + 1 × 2−3 + 1 × 2−4
10
= 11.1875
Convert base-10 to binary:
Quotient Remainder
11/2 5 1=𝑎0 ⎧ 2) 11 ↑
↑ 1 ⎫
{ ↑
↑ }
5/2 2 1=𝑎1 2) 5 ↑
↑ 1
1110 →⎨ ↑
2/2 1 0=𝑎2 2) 2 ↑

↑ 0 ⎬ 10112
{ ↑
↑ }
1/2 0 1=𝑎3 ⎩ 2) 1 ↑ 1 ⎭
Hence (11)10 = (𝑎3 𝑎2 𝑎1 𝑎0 )2 = (1011)2

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 3 / 15
Fractional Decimal Number to Binary

Number Number after decimal Number before decimal


0.1875 × 2 0.375 0.375 0 = 𝑎−1
0.375 × 2 0.75 0.75 0 = 𝑎−2
0.75 × 2 1.75 0.5 1 = 𝑎−3
0.5 × 2 1.0 0.0 1 = 𝑎−4

Hence (0.1875)10 = (𝑎−1 𝑎−2 𝑎−3 𝑎−4 )2 = (0.0011)2


(11.1875)10 = (?.?)2
▶ (11)10 = (1011)2 and (0.1875)10 = (0.0011)2
Thus:
▶ (11.1875)10 = (1011.0011)2

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 4 / 15
All Fractional Decimal Numbers Cannot be Represented Exactly:

Number Number after decimal Number before decimal


0.3 × 2 0.6 0.6 0 = 𝑎−1
0.6 × 2 1.2 0.2 1 = 𝑎−2
0.2 × 2 0.4 0.4 0 = 𝑎−3
0.4 × 2 0.8 0.8 0 = 𝑎−4
0.8 × 2 1.6 0.6 1 = 𝑎−5
⋯ ⋯ ⋯ ⋯

Hence (0.3)10 = (𝑎−1 𝑎−2 𝑎−3 𝑎−4 𝑎−5 )2 = (0.01001)2 = 0.28125

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 5 / 15
Scientific Form: Floating Point

Floating Decimal Point : Scientific Form


256.78 → +2.5678 × 102 0.003678 → +3.678 × 10−3
exponent 𝑒
The form: sign × mantissa × base or 𝜎 × 𝑚 × base
Floating Point Format for Binary Numbers:

𝑦 = 𝜎 × 𝑚 × 2𝑒

𝜎 = sign (0 for +ve, 1 for -ve )


𝑚 - mantissa or fraction (aka significand) - the valuable digits
𝑒 - exponent, how far & in which direction to move the decimal point
(11.1875)10 = (1011.0011)2 = (1.0110011)2 × 23

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 6 / 15
How computer represents number in memory.

Computer represents numbers in binary (1’s and 0’s).


Computer has a limited amount of space for each number or variable.
Thus direct impact to the size, or range, of numbers represented.
▶ a byte (8-bits) to represent 28 = 256 different numbers,
unsigned (+ve): 0 − 256
signed: 2𝑛−1 to 2𝑛−1 − 1 or −128 to 127
▶ a word (16-bits) can be used to represent 216 = 65 536 numbers
▶ a double-word (32-bits) can be used to represent 232 = 4 294 967 296
numbers.

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 7 / 15
IEEE Standard 754
Most (but not all) computer manufacturers use IEEE-754 format
Number represented:
(−1)𝑠 × (1.𝑀 ) × 2𝐸−𝑏𝑖𝑎𝑠 ≡ (−1)𝑠 × 2𝐸−𝑏𝑖𝑎𝑠 × (1.𝑀 )
31 23 0
Sign

Exponent Mantissa

Single Precision (32-bits): The exponent is 8 bits and the mantissa 23


bits.

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 8 / 15
Number Range:
𝑒 = 1111 1111 = 28 − 1 = 255 reserved
𝑒 = 0000 0000 = 0 reserved
so, 𝐸 = 𝑒 − 127 ≡ 1 − 127 ≤ 𝐸 ≤ 254 − 127 ≡ −126 ≤ 𝐸 ≤ 127
Smallest exponent: 𝑒 = 0000 0000, represents denormal numbers
(1.𝑚 → 0.𝑚)
Largest exponent:
𝑒 = 1111 1111, represents ±∞, if 𝑚 = 0
𝑒 = 1111 1111, represents NaN (not a number) , if 𝑚 ≠ 0
Smallest positive normal number
= 1.0000 0000 ⋯ 0000 × 2−126
≈ 1.2 × 10−38
bin: 0000 0000 1000 0000 0000 0000 0000 0000 MATLAB:
realmin(’single’)
Largest positive number
= 1.1111 1111 ⋯ 1111 × 2127
= (1 + (1 − 2−23 )) × 2127
≈ 2128 ≈ 3.4 × 1038
bin: 0111 1111 0111 1111 1111 1111 1111 1111 MATLAB:
realmax(’single’)
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 9 / 15
Machine Epsilon

Machine epsilon (𝜖𝑚 ) is defined as the distance (gap) between 1 and the
next largest floating point number.
For IEEE-754 single precision, 𝜖𝑚 = 2−23 ≈ 1.192 × 10−7 , as shown by:
1 = 1.00000000000000000000000
1 + 𝜖𝑚 = 1⏟ .0000000000000000000000 1⏟
20 2−23

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 10 / 15
(11.1875)10 = (1011.0011)2 = (1.0110011)2 × 23
▶ 𝑠 = 0, +ve ⟹ (−1)0 = 1

▶ 𝑒 = 3 = 𝐸 − bias ⟹ 𝐸 = 3 + 127 = (130) 0 = (100000010)


1 2
▶ 𝑀 = 0110011, fill with 0 to 23-bits
31 23 0

01000001000110011000000000000000

Given ieee-format, convert to decimal:


31 23 0

11000001110110010100000000000000

𝑠 = 1, -ve ⟹ (−1)1 = −1
𝐸 = 28 + 21 + 20 = 128 + 2 + 1 = 131
𝑒 = 𝐸 − 127 = 131 − 127 = 4
𝑀 = 2−1 + 2−3 + 2−4 + 2−7 + 2−9 = 0.697265625
= −1 × 24 × (20 + 0.697265625) = −27.15625
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 11 / 15
Some fractions can’t be represented exactly! (But, 𝜖𝑎 < 𝜖𝑚 )

0.02832 x 2 = 0.05664
0.05664 x 2 = 0.11328
0.11328 x 2 = 0.22656
0.22656 x 2 = 0.45312
0.45312 x 2 = 0.90624 (0.02832)10 = (0.00000111001111111111101011000001)2
0.90624 x 2 = 1.81248 = 1.11001111111111101011000001 × 2−6
0.81248 x 2 = 1.62496 𝐸 = −6 + 127 = 121 = 011110012
0.62496 x 2 = 1.24992
0.24992 x 2 = 0.49984
0.49984 x 2 = 0.99968
0.99968 x 2 = 1.99936

31 23 0

00111100111001111111111101011000

(0.02832)10 ≈ (0.02831999957561492919921875)2
The Error 𝐸𝑡 = −4.2438507080078125 × 10−10
𝐸𝑡
𝜖𝑎 = ∣ ∣ = 1.498535 × 10−08 < 𝜖𝑚 (𝑒−23 or 1.192 × 10−07 )
0.02832
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 12 / 15
Not Really Real
Floating point numbers approximate real numbers.
▶ The number of bits in the mantissa sets the granularity,
▶ the exponent sets the range of the approximation.

Exple: a 3-bit exponent and 3-bit mantissa represent the following values on
the real line:

Calculations with results greater than greatest value (15 in this case) gives
an overflow.
A result between 0 and the lowest value (1/8 in this case) gives an
underflow.
The distance from 1 to the smallest value greater than 1 is 𝜖𝑚 (machine
epsilon)
Dr. ▶ (in Matlab
Amadou Toure eps)Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 13 / 15
single precision - The exponent is 8 bits and the mantissa 23 bits.
𝜖𝑚 ≈ 1.2 × 10−7
Overflow at ≈ 3.4 × 1038
double precision - The exponent is 11 bits and the mantissa 52 bits.
𝜖𝑚 ≈ 2.2 × 10−16
Overflow at ≈ 1.8 × 10308
The underflow is handled gracefully by unnormalized values, that fill
the gap round zero.
If an underflow still occurs the result is set to 0.
An overflow gives ∞ as result.
Undeterminable results produce a NaN – Not a Number. (ex: 0/0 =
NaN).
Arithmetic operations (+, −, ∗, /) get a result as if they were
calculated in full precision, and then rounded to fit into the floating
point representation.
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 14 / 15
Rounding errors introduced in IEEE floating point arithmetics are modeled
like this:
𝜖
𝑓𝑙(𝑥 + 𝑦) = (𝑥 + 𝑦)(1 + 𝜎), |𝜎| ≤ 𝑚
2
𝜖
𝑓𝑙(𝑥 − 𝑦) = (𝑥 − 𝑦)(1 + 𝜎), |𝜎| ≤ 𝑚
2
𝜖
𝑓𝑙(𝑥 ∗ 𝑦) = (𝑥 ∗ 𝑦)(1 + 𝜎), |𝜎| ≤ 𝑚
2
𝜖𝑚
𝑓𝑙(𝑥/𝑦) = (𝑥/𝑦)(1 + 𝜎), |𝜎| ≤
2
Two uses:
(uncommon) Keep track on errors in calculations
(important) Analyze algorithms with respect to influence of floating
point round off. If round off errors are moderate compared to a
perturbation analysis of the problem, the algorithm is stable.

Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods


26 March, 2023 15 / 15

You might also like