Scientific Programming
Scientific Programming
Steve Summit
C was not used very much for floating-point work at first (i.e. back when it was still
being developed and refined), so its support for some aspects of floating-point and
scientific work is a little weak. (On the other hand, C's support for string processing and
other data structures can make up the difference in a program that does some of each.)
There are two significant areas of concern. One is specific to C, and concerns the
handling of multidimensional arrays. The other is endemic in floating-point work done on
computers, and concerns accuracy, precision, and round-off error.
f(double a[])
{
...
}
the size of the array need not be specified, because the function is not allocating the array
and does not need to know how big it is. The array is actually allocated in the caller (and
in fact, only a pointer to the beginning of the array is passed to a function like f). If, in
the function, we need to know the number of elements in the array, we have to pass it as a
second parameter:
f(double a[], int n)
{ ... }
If we wish to emphasize that it is really a pointer which is being passed, we can write it
like this:
f(double *a, int n)
{ ... }
(In fact, if we declare the parameter as an array, the compiler pretends we declared it as a
pointer, as in this last form.)
f(double a[10])
{ ... }
then we might find it a real nuisance that f could only operate on arrays of size 10; we'd
hate to have to write several different versions of f, one for each size.)
double a2[10][20];
which you wanted to pass to a function g, what would g receive? We could declare g as
g(double a[10][20])
{ ... }
and let the compiler worry about it--that is, let it ignore the dimensions it doesn't need,
and pretend that g really receives a pointer. But what pointer does it receive? It's easy to
imagine that g could also be declared as
g(double **a) /* WRONG */
{ ... }
but this is incorrect. Notice that the array a2 that we're trying to pass to g is not really a
multidimensional array; it is an array of arrays (more precisely, a one-dimensional array
of one-dimensional arrays). So in fact, what g receives is a pointer to the first element of
the array, but the first element of the array is itself an array! Therefore, g receives a
pointer to an array. Its actual declaration looks like this:
g(double (*a)[20])
{ ... }
(and this is what the compiler pretends we'd written if we instead write a[10][20] or a[]
[20]). This says that the parameter a is a pointer to an array of 20 doubles. We can still
use normal-looking subscripts: a[i][j] is the j'th element in the i'th array pointed to by
a. The problem is that no matter how we declare g and its array parameter a, we have to
specify the second dimension. If we write
g(double a[][])
the compiler complains that a needed dimension is missing. If we write
g(double (*a)[])
the compiler complains that a needed dimension is missing. We have to specify the
``width'' of the array, and if we've got various sizes of arrays running around in our
program, we're back with the problem of potentially needing different functions to deal
with different sizes of arrays.
Why wouldn't
A second problem arises if you need any temporary arrays inside a function. Often, a
temporary array needs to be the same size as the passed-in arrays being worked on.
Unfortunately, there's no way to declare an array and say ``just make it as big as the
passed-in array.''
It's easy to imagine a solution to all of the problems we've been discussing. To write a
function accepting a multidimensional array with the size specified by the caller, you
might want to write
(It is worth noting that at least one popular compiler, namely gcc, does let you do these
things: parameter and local arrays can take their dimensions from parameters. If you
make use of this extension, just be aware that your program won't work under other
compilers.)
How can we work around these limitations? The most powerful and flexible solution
involves abandoning multidimensional arrays and using pointers to pointers instead. This
approach has the advantage that all dimensions may be specified at run time, though it
also has a few disadvantages. (Namely: you may have to remember to free the arrays,
especially if they're local; the intermediate pointers take up more space, and accessing
``array'' elements via pointers-to-pointers may be slightly less efficient than true subscript
references. On the other hand, on some machines, pointer accesses are more efficient.
Another disadvantage is that our treatise on scientific programming in C is about to
digress into a treatise on memory allocation.)
To illustrate, here is a pair of functions for allocating simulated one- and two-dimensional
arrays of double. (By ``simulated'' I mean that these are not true arrays as the compiler
knows them, but because of the way arrays and pointers work in C, we are going to be
able to treat these pointers as if they were arrays, by using [] array subscript notation.)
#include <stdio.h>
#include <stdlib.h>
double *alloc1d(int n)
{
double *dp = (double *)malloc(n * sizeof(double));
if(dp == NULL) {
fprintf(stderr, "out of memory\n");
exit(1);
}
return dp;
}
These routines print an error message and exit if they can't allocate the requested
memory. Another approach would be to have them return NULL, and have the caller check
the return value and take appropriate action on failure. (In this case, alloc2d would want
to ``back out'' by freeing any intermediate pointers it had successfully allocated.)
Note that these routines allocate arrays of type double. They could be trivially rewritten
to allocate arrays of other types. (Note, however, that it is impossible in portable,
standard C to write a routine which generically allocates a multidimensional array of
arbitrary type, or with an arbitrary number of dimensions.)
free((void *)dpp);
}
(The casts in the malloc and free calls in these routines are not necessary under ANSI
C, but are included to make it easier to port this code to a pre-ANSI compiler, if
necessary.)
With these routines in place, we can dynamically allocate one- and two-dimensional
arrays of double to our heart's content. Here is an (artificial) example:
g1(double a[10][20])
is not compatible with
g2(double **a)
we will have to maintain a segregation between arrays-of-arrays and pointers-to-pointers.
We can't write a single function that operates on either a true multidimensional array or
one of our pointer-to-pointer, simulated multidimensional arrays. (Actually, there is one
way, but it's more elaborate, and not strictly portable.)
That's about enough (for now, at least) about multidimensional arrays. We now turn to
the subject of...
Floating-Point Precision
It's the essence of real numbers that there are infinitely many of them: not only infinitely
large (in both the positive and negative direction), but infinitely small and infinitely fine.
Between any two real numbers like 1.1 and 1.2 there are infinitely many more numbers:
1.11, 1.12, etc., and also 1.101, 1.102, etc., and also 1.1001, 1.1002, etc.; and etc. There is
no way in a finite amount of space to represent the infinite granularity of real numbers.
Most computers use an approximation called floating point. In a floating point
representation, we keep track of a base number or ``mantissa'' with some degree of
precision, and a magnitude or exponent. The parallel here with scientific notation is
close: The number actually represented is something like mantissa x
10<sup>exponent</sup>. (Actually, most computers use mantissa x
2<sup>exponent</sup>, and we'll soon see how this difference can be important.)
The problem is that error can accumulate. In the worst case, you can lose one (or more
than one) significant digit in each step of a calculation. If a calculation involves six steps,
and you only have six significant digits to play with, by the end of the calculation, you've
got nothing of significance left. As a wise programmer once said, ``Floating point
numbers are like piles of sand. Every time you move them around, you lose a little sand.''
How can an error grow to involve more than the least-significant digit? Here is one
simple example. Suppose we are computing the sum 1+1+1+1+1+1+1+1+1+1+1000000,
and we have only six significant digits to play with. Clearly the answer is 1000010,
which has six significant digits, so we're okay. But suppose we compute it in the other
order, as 1000000+1+1+1+1+1+1+1+1+1+1. In the first step, 1000000+1, the
intermediate result is 1000001, which has seven significant digits and so cannot be
represented. It will probably be rounded back down to 1000000. So
1000000+1+1+1+1+1+1+1+1+1+1 = 1000000, and 1+1+1+1+1+1+1+1+1+1+1000000 !
= 1000000+1+1+1+1+1+1+1+1+1+1. In other words, the associative law does not hold
for floating-point addition. Elementary school students often wonder whether it makes a
difference which order you add the column of numbers up in (and sometimes notice that
it does, i.e. they sometimes make mistakes when they try it). But here we see that, on
modern digital computers, the folklore and superstition is true: it really can make a
difference which order you add the numbers up in.
If you don't believe the above, I invite you to try this little program:
#include <stdio.h>
main()
{
int i;
float f = 0;
return 0;
}
You may have to play with the value of BIG (other values to try are 1e6, 1e7, or 1e9), but
you should probably be able to get a surprising result.
Because of the finite precision of floating point numbers on computers, you also have to
be careful when subtracting large numbers, as the result may not be significant at all. If
we try to compute the difference 100000789 - 100000123 (again using only six
significant digits), we are not likely to get 666, because neither 100000789 nor
100000123 are representable in six digits in the first place. This may not seem surprising,
but what if the difference we were really interested in, and thought was significant, was
789 - 123, with that bias of 100000000 being an artifact of the algorithm which we
assumed would cancel out? In this case, it does not cancel out, and in general we cannot
assume that
(a + b) - (c + b) = a - c
Another thing to know about floating point numbers on computers is that, as mentioned
above, they're usually represented internally in base 2. It's well known that in base 10,
fractions like 1/3 have infinitely-repeating decimal representations. (In fact, if you
truncate them, you can easily find anomalies like the one we discussed in the previous
paragraphs, such as that 0.333+0.333+0.333 is 0.999, not 1.) What's not so well known is
that in base 2, the fraction one tenth is an infinitely-repeating binary number: it's
0.00011001100110011<sub>2</sub>, where the 0011 part keeps repeating. This means
that almost any number that you thought was a ``clean'' decimal fraction like 0.1 or 2.3 is
probably represented internally as a messy infinite binary fraction which has been
truncated or rounded and hence is not quite the number you thought it was. (In fact,
depending on how carefully routines like atof and printf have been written, it's
possible to have something like printf("%f", atof("2.3")) print an ugly number like
2.299999.) Where you have to watch out for this is that if you multiply a number by 0.1,
it is not quite the same as dividing it by 10. Furthermore, if you multiply a number by 0.1
three times, it may be different than if you'd multiplied it by 0.001 and different than if
you'd divided it by 1000.
My purpose in bringing up these anomalies is not to scare you off of doing floating point
work forever, but to drive home the point that floating point inaccuracies can be
significant, and cannot be brushed of as simple uncertainty about the least-significant
digit. With care, the inaccuracies can be confined to the least-significant digit, and the
care involves making sure that errors do not proliferate and compound themselves. (Also,
I should again point out that none of these problems are specific to C; they arise any time
floating-point work is done with finite precision.)
In practice, anomalies crop up when both large and small numbers participate in the same
calculation (as in the 1+1+1+1+1+1+1+1+1+1+1000000 example), and when large
numbers are subtracted, and when calculations are iterated many times. One thing to
beware of is that many common algorithms magnify the effects of these problems. For
example, the definition of the standard deviation is the square root of the mean variance,
namely
sqrt(sum(sqr(x[i] - mean)) / n)
where sum() and sqr() are summing and squaring operators (neither of which are
standard C library functions, of course) and where mean is
sum(x[i]) / n
(The statisticians out there know that the denominator in the first expression is not
necessarily n but should usually be n-1.)
Computing standard deviations from the definition can be a bit of a nuisance because you
have to walk over the numbers twice: once to compute the mean, and a second time to
subtract it from each input number, square the difference, and sum the squares. So it's
quite popular to rearrange the expression to
sqrt((sum(sqr(x[i])) - sqr(sum(x))/n) / n)
which involves only the sum of the x's and the sum of the squares of the x's, and hence
can be completed in one pass. However, it turns out that this method can in practice be
significantly less accurate. By squaring each individual x and accumulating
sum(sqr(x[i])), the big numbers get bigger and the small numbers get smaller, so the
small numbers are more likely to get lost in the underflow. When you use the definition,
it's only the differences that get squared, so less of significance gets lost. (However, even
working from the definition, if the individual variances sum to near zero they can
sometimes end up summing to a slightly negative number, due to accumulated error,
which is a problem when you try to take the square root.)
If b<sup>2</sup> and 4ac are both large, there may be very few significant digits left in
the determinant b<sup>2</sup> - 4ac.
Iterated simulations:
Trigonometric reductions
The only sane ways of computing trig functions numerically (i.e. the ways likely to be
used by library functions such as sin, cos, and tan) work only for the one period of the
function near 0. These functions will of course work on arguments outside the interval (-
pi, pi), but the function must first reduce the arguments by subtracting some multiple of
2*pi, and as we've seen, taking the difference of two large numbers (whether you do it or
the library function does it) can lose all significance.
Compounded interest
principle x (1 + rate)<sup>n</sup>
but raising a number near 1 to a large power can be inaccurate.
[okay, so this is a computer science problem, not a mathematical problem] If you handle
digits to the right of the decimal by repeatedly multiplying by 0.1, you'll lose accuracy.
For some of these problems, using double instead of single precision makes the
difference, but for others, great care and rearrangement of the algorithms may be required
for best accuracy.
Returning to C, you might like to know that type float is guaranteed to give you the
equivalent of at least 6 decimal digits of significance, and double (which is essentially
C's version of the ``double precision'' mentioned in the preceding paragraph) is
guaranteed to give you at least 10. Both float and double are guaranteed to handle
exponents in at least the range -37 to +37.
One other point relating to numerical work and specific to C is that C does not have a
built-in exponentiation operator. It does have the library function pow, which is the right
thing to use for general exponents, but don't use it for simple squares or cubes. (Use
simple multiplication instead.) pow is often implemented using the identity
The fact that a floating-to-int conversion truncates the fractional part has nothing to do,
by the way, with the problem mentioned earlier of a poor-quality floating-point processor
not rounding well. (Back there we were talking about rounding an intermediate result
which was still trying to be represented as a floating-point number.) When you do want to
round a number off to the nearest integer, the usual trick is
(int)(x + 0.5)
By adding 0.5 before truncating the fractional part, you arrange that numbers with
fractions of .5 or above get rounded up. (This trick obviously doesn't handle negative
numbers correctly, nor does it do even/odd rounding.)
***
I am not a numerical analyst, so this presentation has been somewhat superficial. On the
other hand, I've tried to make it practical and accessible and to highlight the real-world
problems which do come up in practice but which too few programmers are aware of.
You can learn much more (including the traditional notations for describing and
analyzing floating point accuracy, and some clever algorithms for doing floating point
arithmetic without losing so much accuracy) in the following references.
David Goldberg, ``What Every Computer Scientist Should Know about Floating-Point
Arithmetic,'' ACM Computing Surveys, Vol. 23 #1, March, 1991, pp. 5-48.
Brian W. Kernighan and P.J. Plauger, The Elements of Programming Style, Second
Edition, McGraw-Hill, 1978, ISBN 0-07-034207-5, pp. 115-118.