0% found this document useful (0 votes)
61 views7 pages

Computational Physics I: Luigi Scorzato Lecture 2: Floating Point Arithmetic

This lecture discusses floating point arithmetic and its limitations for computational physics simulations: - Floating point numbers represent real numbers using a finite number of bits, introducing rounding errors. Commonly used formats include 32-bit and 64-bit following the IEEE 754 standard. - While basic arithmetic operations like addition and multiplication are reasonably accurate, errors accumulate over many operations. Associativity and distributivity do not strictly hold due to rounding. - Simple numbers like fractions in decimal form cannot be represented exactly. Derivatives and sums converge slowly due to shifting of the mantissa with differing magnitudes. - Numerical models assume the error in floating point operations is bounded and proportional to the operation value. Consistency checks

Uploaded by

Petàr Groff
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views7 pages

Computational Physics I: Luigi Scorzato Lecture 2: Floating Point Arithmetic

This lecture discusses floating point arithmetic and its limitations for computational physics simulations: - Floating point numbers represent real numbers using a finite number of bits, introducing rounding errors. Commonly used formats include 32-bit and 64-bit following the IEEE 754 standard. - While basic arithmetic operations like addition and multiplication are reasonably accurate, errors accumulate over many operations. Associativity and distributivity do not strictly hold due to rounding. - Simple numbers like fractions in decimal form cannot be represented exactly. Derivatives and sums converge slowly due to shifting of the mantissa with differing magnitudes. - Numerical models assume the error in floating point operations is bounded and proportional to the operation value. Consistency checks

Uploaded by

Petàr Groff
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Computational Physics I

Luigi Scorzato Lecture 2: Floating point arithmetic

Computer memories are nite: 1. 2. how can we represent , or on a computer? To what extent can such representation(s) be trusted?

Representation
Usually computers assign 32 or 63 bits foreach single number, but there are two main strategies to do it:

Fixed Point (used for Integers):

n = (1)a0 (a1 20 + a2 21 + . . . + aM 1 2M 2 )

M is the number of bits available for a single number (typical choices are M=32 or 64) and ai = 0,1. nmax = 2M-1 i.e. (with 64bits) 9.2 1018. This means that (9 1018), (9 1018 +1) are all represented and distinguishable; but 1019=

Floating Point (used for Reals, ...)

x = (1)s 1.f 2

bias

IEEE754 standard s is 1 bit for the sign; f is the 23 (single precision) or 53 (double) bits mantissa and is the 8 (single) 10 (double) bits exponent. The length of the Mantissa denes roughly the relative precision and that of the exponent the range. Both are represented as integers.

Limitations of Floating Point Arithmetics

Commutativity and Addition inverse are OK for IEEE: (a+b=b+a; a*b=b*a; a-a=0) (less trivial than you may think). But this are the last good news... Addition is not associative ((a+b)+c) != (a+(b+c)) Distributive law does not hold ((a+b)*c) != (a*c+b*c) Multiplication inverse may not exist a* (1/a) != 1 Most simple numbers in decimal units are not mapped exactly

Typical mechanism that produces errors:

shift of the mantissa, when summing very different numbers. E.g:

A Model

Instead of: || oat(A op B) - (A op B) ||=0, we can only assume that: || oat(A op B) - (A op B) || < u || A op B || (where op=+,-,/,* of single oating point numbers) we can use this model to predict which errors we might expect. For example for the scalar product one nds (Golub-van Loan):
N N N

fl
k=1

xk yk

k=1

xk yk N u

k=1

|xk yk | + O(u2 )

Simple exercises

Exponential function [see my_exp.m] Accumulating sums as e.g. harmonic series: lim_n (1+x/n)^n [my_exp_seq.m]

(x)n = n! n=0
N k=1

1 N ln N + Euler k

Trying to understand precisely the origin of rounding errors is often frustrating and as hard as

Message:

solving analytically the problem that we want to solve numerically. What we can do is to check a posteriori: check the correctness with known exact results, which have the same numerical difculties. check consistency when changing conditions by negligeable amounts (when you know that they should not matter: sometimes high sensitivity is physical). check consistency when changing numerical precision of the operations.

Compute Derivative of a function


Notation:
fn = f (t0 + nh) f
(1),order=1

Naive 1st derivative:

f1 f0 + O(h) h

One can do better (remember Taylor):


f1 f2 f1 f1 f2 f2 f (1),o=2 f (1),o=4 = = = = = =

h2 (2) h3 (3) h4 (4) f0 hf + f f + f + O(h5 ) 2 3! 4! (2h)2 (2) (2h)3 (3) (2h)4 (4) f0 2hf (1) + f f + f + O((2h)5 ) 2 3! 4! 1 3 (3) (1) 2hf + h f + O(h5 ) 3 8 3 (3) (1) 4hf + h f + O(h5 ) 3 f1 f1 + O(h2 ) 2h (f2 f2 ) 8(f1 f1 ) + O(h4 ) 2h
(1)

However, smaller h and higher orders are not necessarely better: see the following example

Exercise: Write a program that computes the derivative of sin( x) for

different order of approximation; different values of h; different values of .


Compare with [numdiff.m]

You might also like