0% found this document useful (0 votes)
63 views2 pages

MAM380F 2008 Tutorial 1 Solutions

This document provides solutions to problems involving floating point arithmetic. It discusses: 1) Calculating sums, differences, products, and quotients of floating point numbers and analyzing the resulting errors 2) Determining the number of representable numbers in a floating point system 3) Sources of error such as subtractive cancellation that can occur when subtracting two nearly equal numbers

Uploaded by

nmhatitye
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views2 pages

MAM380F 2008 Tutorial 1 Solutions

This document provides solutions to problems involving floating point arithmetic. It discusses: 1) Calculating sums, differences, products, and quotients of floating point numbers and analyzing the resulting errors 2) Determining the number of representable numbers in a floating point system 3) Sources of error such as subtractive cancellation that can occur when subtracting two nearly equal numbers

Uploaded by

nmhatitye
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

MAM380F 2008 Tutorial 1 Solutions

1. Note: In all problems in this section I rst nd f l(x) for each number. I then do the operation with the correct digit shifting. I then normalize the result if necessary. You need to do all these steps for full marks in an exam situation. Converting the number from its oating point form back into decimal form isnt required but it can help make the error calculation clearer. (a) f l(401) = 0.4010 103 and f l(27) = 0.2700 102 therefore we shift f l(27) as it has the smaller exponent resulting with (0.4010 103 ) + (0.0270 103 ) = 0.428 103 = 428. The oating point calculation is exact. (b) f l(1.736) = 0.1736101 and f l(204.8) = 0.2048103 therefore we have (0.0017103 )+(0.2048103 ) = 0.2065 103 = 206.5. The exact result is 206.536. The absolute error is |206.536 206.5| = 0.036 and the .536206.5| relative error is |206 0.174 103 . |206.536| (c) f l(1023.5) = 0.1024104 and f l(2.46) = 0.2460101 therefore we have (0.1024104 )(0.0002104 ) = 0.1022 104 = 1022.0. The exact result is 1021.04. The absolute error is |1021.04 1022.0| = 0.96 and .041022.0| 0.940 103 . the relative error is |1021 |1021.04| (d) f l(1.73647) = 0.1736 101 and f l(1.73652) = 0.1737 101 therefore we have (0.1736 101 ) (0.1737 101 ) = 0.0001 101 = 0.1000 102 . The exact result is 0.00005. The absolute error is (0.001)| | 0.00005 (0.001)| = 0.95 103 and the relative error is |0.00005 = 19.0. |0.00005| (e) f l(327800) = 0.3278 106 but the exponent 6 is larger than our maximum exponent of 4 therefore we have overow. This is easily corrected however if we rst divide the numerator and denominator by 10000 giving the algebraically equivalent expression 32.78/0.1876. Now we have f l(32.78) = 0.3278 102 .3278102 2 3 and f l(0.1876) = 0.1876 100 . We have 0 0.1876100 = 1.7473 10 = 0.1747 10 . In decimal form this is 174.7. The exact answer is 174.7334755. The absolute error is calculated the same way as in all other questions and is 0.0334755 while the relative error is 0.192 103 . (f ) f l(0.1748) = 0.1748 100 and f l(0.00001) = 0.1000 104 . So we have (0.1748 100 ) (0.1000 104 ) = 0.01748 104 = 0.1748 105 . Since 5 < 4 we have underow and the oating representation of our result is 0. The actual result is 0.1748 105 and hence the absolute error |0.1748 105 0| 1748105 0| is also 0.1748 105 . The relative error in this case is |0|.0 .1748105 | = 1.0. (g) f l( 2) = 0.1414 101 . So we have (0.1414 101 ) (0.1414 101 ) = 0.01999 102 = 0.1999 101 . The exact answer is clearly 2.0. This gives an absolute error of 0.001 and a relative error of 0.0005. (h) f l(1000) = 0.1000 104 and f l(0.3) = 0.3000 100 . When we add f l(1000) to f l(0.03) we get 0.1000 104 + 0.0000 104 = 0.1000 104 = 1000. Note how the 3 falls o the end when shifting the digits to line up the exponent. Therefore, using the order of operations given we would could add as many 0.3 terms as we want and the result is still 1000! The relative and absolute error depend on how many 0.3 terms we use. To x this we could bracket the smaller terms so that they are added to each other rst and then added to 1000. For example, we can write 1000 + (0.3 + 0.3 + 0.3) instead of 1000 + 0.3 + 0.3 + 0.3. Now we have (0.3000 100 ) + (0.3000 100 ) + (0.3000 100 ) = 0.9000 100 . Then we do the operation 0.1000 104 + 0.9000 100 . Shifting the digits to get the same exponent we get 0.1000 104 + 0.0001 104 = 0.1001 104 = 1001.0. The exact answer is 1000.9 so this is a big improvement. 2. (a) Numbers in the system have the form 0.b1 b2 b3 b4 10e . There are 9 choices for the 1st digit and 10 choices for each of the following digits hence we have 9 10 10 10 = 9000 normalized positive mantissas. (b) We have 9 possible exponents (4, 3, 2, 1, 0, 1, 2, 3, 4) so, including positive and negative numbers and zero we have 2 9000 9 + 1 = 162001 oating point numbers in the system. (c) The smallest normalized mantissa is .1000 and the smallest exponents is 4 so the smallest positive number is 0.1000 104 = 0.00001. Similarly, the largest number in the system is 0.9999 104 = 9999.
1 (d) The base is 10 with a 4 digit mantissa so the machine precision is 2 1014 = 0.0005

3. The result will be highly inaccurate if x is large and positive. Note that if x 1 the result is ne as f l x x 1 1 which is correct.

4. Start by by computing the oating point representation of the intermediate calculations. We have f l(b2 ) = 0.3856104 and f l(4ac) = 0.8000100 . Therefore f l( f l(b2 ) f l(4ac)) = (0.3856 104 ) (0.8000 100 ). Shifting the smaller number to make the exponent the same gives us (0.3856 104 ) (0.0001 104 ) = (0.3855 104 ) = 0.6209 102 . Since f l(b) = 0.6210 102 we are going to be subtracting two nearly identical numbers. Doing the subtraction f l(b) + f l( f l(b2 ) f l(4ac)) = 0.0001 102 = 0.1000 101 = 0.01. Completing the 101 calculation gives x1 = 0.1000 = 0.005. 2 To get the other root we do a similar calculation x2 = (0.621010 )2(0.620910 102 = 62.10. Note that here we were adding two nearly identical numbers.
2 2

.1242103 2

= 0.6210

The actual roots are approximately x1 = 0.0032208 and x2 = 62.09678. So the relative error in (0.005)| our calculation of x1 is |0.0032208 = 0.5524 which is terrible. Our calculation of x2 is ne though |0.0032208|
09678(62.10)| as the relative error is |62.| = 0.519 104 . The reason for this is that in the calculation of 62.06778| x1 subtractive cancellation occured when we subtracted two nearly identical numbers.

Using the new formula we still have f l( f l(b2 ) f l(4ac)) = 0.6209 102 . The value of sgn(b) is 1. Therefore f l(f l(b) + sgn(b)f l( f l(b2 ) f l(4ac))) = (0.6210 102 ) + (0.6209 102 ) = 0.1242 103 . Therefore q = .6210 102 = 62.10. Now x1 = 0.2/q = 0.32206 102 = 0.0032206 and x2 = q/1 = 62.10. (0.0032206)| = 0.621 104 which is The relative error error for our new approximation is |0.0032208 |0.0032208| a very large improvement. This is due to the elimination of the subtractive cancellation in the original equation.
.1415| 0.295 104 . Now < 5 105 but > 5 106 so 3.1415 5. (a) The relative error is = |3 | | approximates to 5 signicant gures. This is most easily seen by writing the relative error as 2.95 105 which clearly less than 5 105 . You should solve all signicant gure problems in the same way to ensure you dont make a mistake.

(b) The relative error 2.34 106 < 5 106 but is greater than 5 107 and hence the the approximation has 6 signicant gures. (c) The relative error = 1 104 < 5 104 but is greater than 5 105 and hence the the approximation has 4 signicant gures despite the fact that not a single digit is in agreement. (d) The relative error = 1 101 < 5 101 but is greater than 5 102 and hence the the approximation has 1 signicant gure. (e) The relative error is the same as in part (d) and hence also has 1 signicant gure. (f ) The relative error 4.55 105 < 5 105 but is greater than 5 106 and hence the the approximation has 5 signicant gures.
p | 6. (a) If p is to approximate 150 with relative error 103 then p must satisfy |150 < 103 . This implies |150| |150 p | < 0.150. To remove the || we note that the relation will be satised for p > 150 0.150 and p < 150 + 0.150. Hence the bounds are 150 0.150. This means p (149.85, 150.15).

(b) The bounds on p are 1500 1.50 therefore p (1498.5, 1501.5) (c) The bounds on p are 90 0.09 therefore p (89.91, 90.09) (c) The bounds on p are 900 0.9 therefore p (899.1, 900.9) Note: The calculations in this question are trivial but the important thing to note is that the interval on which p can lie grows proportionally to the number it is approximating. 7. This question sounds complex but its actually trivial. On any machine, the machine precision(eps) is p | 1 1t . Since this value is the bound on the relative error |p 2 |p| , a bound on the absolute error is |p p | < |p| eps. So when approximating e using any machine the bound on the absolute error is e eps. Similarly, for 107 the bound is 107 eps. This implies that the larger the number you are approximating, the larger the bound on the absolute error. This is expected though, since oating point numbers are not evenly distributed throughout the range.

You might also like