Computational Physics I: Luigi Scorzato Lecture 2: Floating Point Arithmetic
Computational Physics I: Luigi Scorzato Lecture 2: Floating Point Arithmetic
Computer memories are nite: 1. 2. how can we represent , or on a computer? To what extent can such representation(s) be trusted?
Representation
Usually computers assign 32 or 63 bits foreach single number, but there are two main strategies to do it:
n = (1)a0 (a1 20 + a2 21 + . . . + aM 1 2M 2 )
M is the number of bits available for a single number (typical choices are M=32 or 64) and ai = 0,1. nmax = 2M-1 i.e. (with 64bits) 9.2 1018. This means that (9 1018), (9 1018 +1) are all represented and distinguishable; but 1019=
x = (1)s 1.f 2
bias
IEEE754 standard s is 1 bit for the sign; f is the 23 (single precision) or 53 (double) bits mantissa and is the 8 (single) 10 (double) bits exponent. The length of the Mantissa denes roughly the relative precision and that of the exponent the range. Both are represented as integers.
Commutativity and Addition inverse are OK for IEEE: (a+b=b+a; a*b=b*a; a-a=0) (less trivial than you may think). But this are the last good news... Addition is not associative ((a+b)+c) != (a+(b+c)) Distributive law does not hold ((a+b)*c) != (a*c+b*c) Multiplication inverse may not exist a* (1/a) != 1 Most simple numbers in decimal units are not mapped exactly
A Model
Instead of: || oat(A op B) - (A op B) ||=0, we can only assume that: || oat(A op B) - (A op B) || < u || A op B || (where op=+,-,/,* of single oating point numbers) we can use this model to predict which errors we might expect. For example for the scalar product one nds (Golub-van Loan):
N N N
fl
k=1
xk yk
k=1
xk yk N u
k=1
|xk yk | + O(u2 )
Simple exercises
Exponential function [see my_exp.m] Accumulating sums as e.g. harmonic series: lim_n (1+x/n)^n [my_exp_seq.m]
(x)n = n! n=0
N k=1
1 N ln N + Euler k
Trying to understand precisely the origin of rounding errors is often frustrating and as hard as
Message:
solving analytically the problem that we want to solve numerically. What we can do is to check a posteriori: check the correctness with known exact results, which have the same numerical difculties. check consistency when changing conditions by negligeable amounts (when you know that they should not matter: sometimes high sensitivity is physical). check consistency when changing numerical precision of the operations.
f1 f0 + O(h) h
h2 (2) h3 (3) h4 (4) f0 hf + f f + f + O(h5 ) 2 3! 4! (2h)2 (2) (2h)3 (3) (2h)4 (4) f0 2hf (1) + f f + f + O((2h)5 ) 2 3! 4! 1 3 (3) (1) 2hf + h f + O(h5 ) 3 8 3 (3) (1) 4hf + h f + O(h5 ) 3 f1 f1 + O(h2 ) 2h (f2 f2 ) 8(f1 f1 ) + O(h4 ) 2h
(1)
However, smaller h and higher orders are not necessarely better: see the following example