Number Representatiin
Number Representatiin
The representation of numbers in computers is usually not based on the decimal system,
and therefore it is necessary to understand how to convert between different systems of
representation. Computers also have a limited amount of bits for representing numbers,
and for this reason numbers with arbitrarily large magnitude cannot be represented nor can
floating-point numbers be represented with arbitrary precision. In this chapter, we define
the key concepts of number representation in computers and discuss the main sources of
errors in numerical computing.
Integer part
In the decimal system, the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are used for the representation
of numbers. The individual digits in a number, such as 37294, are the coefficients of
powers of 10:
Fractional part
A number between 0 and 1 is represented by a string of digits to the right of a decimal
point. The individual digits represent the coefficients of negative powers of 10. In general,
we have the formula
For example,
1
0.1.2 Base β numbers
In computers, the number 10 is generally not used as the base for number representation.
Instead, other systems, such as the binary (2 as the base), the octal (8 as the base), and the
hexadecimal (16 as the base) systems are commonly used in computing. In the general
form, the base is noted as β .
The separator between the integer and fractional part is called the radix point since deci-
mal point is reserved for base-10 numbers.
For example, in the octal representation, the individual digits before the radix point refer
to increasing powers of 8:
(21467)8 = 7 + 6 × 8 + 4 × 82 + 1 × 83 + 2 × 84
= 7 + 8(6 + 8(4 + 8(1 + 8(2))))
= 9015
To convert this number to the decimal system, we can build the nested form of the number
by taking a common denominator and then replacing each of the numbers by its represen-
tation in the decimal system. We will soon discuss the conversion between different bases
in more detail.
A number between 0 and 1, when expressed in the octal system, is given by the coeffi-
cients of negative powers of 8. For example,
In order to convert N to the number system with base β , we first write N in its nested form
and then replace each of the numbers on the right by its representation in the β -arithmetic.
The replacement requires a table showing how each of the numbers is represented in the
2
β -system. In addition, a base-β multiplication table may be required.
As an example, consider the conversion of the decimal number 3781 to binary form.
The following table can be used for the replacement of base-10 numbers by their binary
counterparts:
DEC BIN
0 0
1 1
2 10
3 11
4 100
5 101
6 110
7 111
8 1000
9 1001
10 1010
This calculations is easy for computers but quite tedious for humans. Therefore, another
procedure can be used for hand calculations. Taking the conversion from the decimal to
the binary system as an example, we continuously divide by 2, until 0 remains. Then the
converted number is given by the remainders of the division in reversed order.
3
For example, here each number on the left hand side is obtained by dividing the previous
number by 2 and taking the integer part, the quotient is marked on the right hand side.
Thus the converted number is (111011000101)2 .
Quotients Remainders
3781
1890 1
945 0
472 1
236 0
118 0
59 0
29 1
14 1
7 0
3 1
1 1
0 1
Since dividing binary arithmetic is not straightforward, we look for alternative ways of
doing the conversion such that the arithmetic can be carried out in the decimal system.
Suppose that x is in the range 0 < x < 1 and given in the β representation:
∞
x= ∑ ck β −k = (0.c1c2c3 . . .)β
k=1
Observe that
β x = (c1 .c2 c3 c4 . . .)β
because we only need to multiply by β to shift the radix point.
Thus the unknown digit c1 can be described as the integer part of β x, denoted as I (β x).
Denote the fractional part of β x by F (β x). The process can be repeated until all the
unknown digits ck have been converted:
d0 = x
d1 = F (β d0 ) c1 = I (β d0 )
d2 = F (β d1 ) c2 = I (β d1 )
..
.
4
For example, here we repeatedly multiply by 2 and remove the integer parts to get (0.372)10 =
(0.010111 . . .)2 .
Products Digits
0.372
0.744 0
1.488 1
0.976 0
1.952 1
1.904 1
1.808 1
etc.
To include also negative numbers, we must assign a separate sign bit. The first bit of the
string is the sign bit which is zero for positive numbers and one for negative numbers.
The most often used method for obtaining the representation for negative numbers is a
method called two’s complement:
In the binary representation of a positive number x, invert all the bits (0 ↔ 1),
and add 1 to get the binary representation of −x.
For example, if we have eight bits for the representation (of which one is for the sign and
seven for the digits), then
+2 = 0000 0010
+1 = 0000 0001
0 = 0000 0000
−1 = 1111 1111
−2 = 1111 1110
5
The largest representable number in an n-bit system is 2n−1 − 1. For example, a 32-bit
integer number can have values between
In the minimum value, the sign bit has been interpreted as part of the number.
Most compilers do not give error messages of exceeding the range of integer numbers,
except in some obvious situations. As an example, consider the following C code.
int main(void)
{
short int si;
int i, k, j1, j2, j3, j4;
si = 1; i = 1;
j1 = -2147483647;
fprintf(stdout,"\n%d\n",j1);
j2 = -2147483648;
fprintf(stdout,"\n%d\n",j2);
j3 = 2147483647;
fprintf(stdout,"\n%d\n",j3);
j4 = 2147483648;
fprintf(stdout,"\n%d\n",j4);
}
COMPILING:
6
OUTPUT:
1 2 2
2 4 4
3 8 8
4 16 16
5 32 32
6 64 64
7 128 128
8 256 256
9 512 512
10 1024 1024
11 2048 2048
12 4096 4096
13 8192 8192
14 16384 16384
15 -32768 32768
16 0 65536
17 0 131072
18 0 262144
19 0 524288
20 0 1048576
21 0 2097152
22 0 4194304
23 0 8388608
24 0 16777216
25 0 33554432
26 0 67108864
27 0 134217728
28 0 268435456
29 0 536870912
30 0 1073741824
31 0 -2147483648
32 0 0
-2147483647
-2147483648
2147483647
-2147483648
7
0.2.2 Floating-point numbers
In a computer, there are no real numbers or rational numbers (in the sense of the mathe-
matical definition), but all noninteger numbers are represented with finite precision.
For example, the numbers π or 1/3 cannot be represented precisely (in any representa-
tion), and in some representations, the numbers 1.00000000 and 1.00000001 are equally
large.
x = s × Bc−E × M
where B is the base of the representation (usually B = 2) and E is the bias of the expo-
nent (a fixed integer constant for any given machine and representation which enables
representing negative exponents without a separate sign bit).
In the decimal system, this corresponds to the following normalized floating-point form
8
Machine numbers
A floating-point number system within a computer is always limited by the finite word-
length of computers. This means that only a finite number of digits can be represented. As
a consequence, numbers that are too large or too small cannot be represented. Moreover,
most real numbers cannot be represented exactly in a computer. For example,
1
= (0.1)10 = (0.0 0011 0011 0011 0011 0011 . . .)2
10
The effective number system for a computer is not a continuum but a discrete set of
numbers called the machine numbers.
In a normalized representation (used in most computers), the bit patterns are as "left
shifted" as possible, corresponding to the "1-plus" form of the mantissa (this form doesn’t
waste any bits but uses the correct exponent such that there are no leading zero bits in
the mantissa). The machine numbers in the normalized representation are not uniformly
spaced but unevenly distributed about zero.
For example, if we list all floating-point numbers which can be expressed in the following
normalized form
x = ±(0.1b2 b3 )2 × 2±k (k, bi ∈ [0, 1])
where we have only two bits for the mantissa (b2 and b3 ) and one bit for the exponent
(k), we get the set of numbers shown in Fig. 2. This shows the phenomenon known as
the hole at zero. There is a relatively wide gap between the smallest positive number and
zero, that is (0.100)2 × 2−1 = 1/4.
Machine epsilon
Arithmetic among numbers in floating-point representation is not exact even if the operands
happen to be exactly represented. For example, two floating-point numbers are added by
first right-shifting (dividing by two) the mantissa of the smaller one (in magnitude), si-
multaneously increasing the exponent, until the two operands have the same exponent.
Low-order (least significant) bits of the smaller operand are lost by this shifting. If the
two operands differ too much in magnitude, then the smaller operand is effectively re-
placed by zero. This leads us to the concept of machine accuracy.
Machine epsilon ε (or machine accuracy) is defined to be the smallest number that can
be added to 1.0 to give a number other than one. Thus, the machine epsilon describes
the accuracy of floating-point calculations. Note that this is not the same as the smallest
floating-point number that is representable in a given computer (see below)!
When using single-precision numbers, there are 23 bits allocated for the mantissa. The
machine epsilon is given by ε = 2−23 ≈ 1.19 × 10−7 , which means that numbers can be
represented with approximately six accurate digits. In double precision, we have 52 bits
allocated for the mantissa, and thus the machine epsilon is approximately ε = 2−52 ≈
2.22 × 10−16 , giving the accuracy of about 15 digits.
9
To a great extent any arithmetic operation among floating-point numbers should be thought
of as introducing an additional fractional error of at least ε. This type of error is called
roundoff error (we return to this in more detail later in this chapter).
The smallest and largest representable numbers are determined by the number of bits that
are allocated for the exponent c. [Note: the machine epsilon, in contrast, depends on how
many bits there are in the mantissa.]
xmin = 2cmin
where cmin and cmax are the minimum and maximum values of the exponent, and Mmax is
the maximum value of the mantissa.
The value of c in the representation of single-precision floating-point numbers is restricted
by the inequality
0 < c < (11 111 111)2 = 255
The values 0 and 255 are reserved for special cases including ±0 and ±∞. Hence, the
actual exponent is restricted by
1 ≤ (1. f )2 ≤ (1.111 111 111 111 111 111 111 11)2 = 2 − 2−23
10
Infinity is defined by setting all the exponent bits to one and those in mantissa to zero.
Depending on the sign bit, we have two different representation of infinity: +∞ and −∞.
The following rules apply to arithmetic operations involving ∞:
x
∞ ± x = ∞, x × ∞ = ∞, =0
∞
In addition, the standard defines a constant called NaN (not a number) which is an error
pattern rather than a number. It is obtained by setting all bits to ones.
In C, the header file float.h contains constants which reflect the characteristics of that
machine’s floating-point arithmetic. Some of the most widely used values are (these are
obtained from a Linux/Intel -machine):
Calculations with double-precision numbers are somewhat slower than when using single
precision. The following loop (N = 107 )
took 5.70 seconds with single-precision numbers and 6.43 seconds with double-precision
in a Linux/Intel machine (y and z were defined as floats or doubles). In this case,
the difference is not large but may depend strongly on the machine in question.
11
0.3 Errors
Numerical calculations always involve approximations due to several reasons. These er-
rors are not the result of poor thinking or carelessness (like programming errors) but they
inevitably arise in all numerical calculations. We can divide the sources of errors roughly
into four categories: model, method, initial values (data) and roundoff.
An example of methodological errors is the truncation error (or chopping error) which
is encountered when, for example, an interminating series is chopped:
t2 tn
x(t) = et = 1 + t + + . . . + + Rn+1 (t)
2 n!
Here the truncation error is: −Rn+1 (t). It follows that when n is sufficiently large, the ex-
cess term Rn+1 is small and the chopping gives a good approximation of the exact result.
Another example of methodological errors is the discretizing error which results when a
continuous quantity is replaced by a discrete approximation. For example, replacing the
derivative by the difference quotient leads to a discretizing error:
x(t + h) − x(t)
x0 (t) ≈ y(t, h) =
h
Or as another example, in numerical integration, the integrand is evaluated at a discrete
set of points, rather than at "every" point.
12
0.3.3 Errors due to initial values
The initial values of a numerical computation can involve inaccurate values (e.g. mea-
surements). When designing the algorithm, it is important to keep in mind that the initial
errors must not accumulate during the calculation. There are also techniques for data fil-
tering that are designed to decrease the effects of errors in the initial values.
The process of replacing x by its nearest machine number is called correctly rounding.
The error involved is called the roundoff error. How large can it be?
One nearby machine number can be obtained by rounding down or by dropping the excess
bits a24 a25 . . . (if only 23 bits have been allocated to the mantissa). This machine number
is
x− = (1.a1 a2 a3 . . . a23 )2 × 2m
Another nearby machine number is found by rounding up. It is found by adding one unit
to a23 in the expression for x− . Thus,
x− x x+
Figure 3: Rounding to the nearest: the number x is represented by the machine number
that lies closest to the true value of x. The roundoff error involved in the operation is no
greater than ε/2.
13
0.3.5 Elementary arithmetic operations
We continue to examining errors that are produced in basic arithmetic operations. Round-
off errors accumulate with increasing amounts of calculation. If you perform N arithmetic
√
operations to obtain a given value, the total roundoff error might be of the order Nε, if
you are lucky! [The square root comes from a random walk, assuming that the errors
come in randomly up or down.]
However, this estimate can be badly off the mark for two reasons:
(i) It frequently happens that for some reason (because of the regularities of your calcu-
lation or the peculiarities of your computer) the roundoff errors accumulate preferentially
in one direction. In this case, the total error will be of order Nε.
(ii) Some especially unfavorable operations can increase the roundoff error substantially
in a single arithmetic operation. A good example is the subtraction of two almost equal
numbers, which results in the loss of significance (see the discussion later in this chapter).
To discuss these issues in more detail, we first introduce the notation fl(x) to denote the
floating-point machine number that corresponds to the real number x.
For example, in a hypothetical five-decimal machine,
If x is any real number within the range of the computer, then the error involved in the
operation is bound by ε/2 (assuming correct rounding)
|x − fl(x)| 1
≤ ε
|x| 2
Let the symbol denote any one of the operations, +, - , ×, or ÷. Suppose that whenever
two machine numbers x and y are combined arithmetically, the computer will produce
fl(x y) instead of x y . Under this assumption, the previous analysis gives
1
fl(x y) = (x y)(1 + δ ) (|δ | ≤ ε)
2
It is of course possible that x y overflows or underflows, but except for this case, the
above assumption is realistic for most computers.
However, in real calculations, the initial values are given as real numbers and not as
machine numbers. For example, the relative error for addition can be estimated by
z = fl[fl(x) + fl(y)]
= [x(1 + δ1 ) + y(1 + δ2 )](1 + δ3 )
≈ (x + y) + x(δ1 + δ3 ) + y(δ2 + δ3 )
14
The relative round-off error is
(x + y) − z x(δ1 + δ3 ) + y(δ2 + δ3 )
=
(x + y) (x + y)
xδ1 + yδ2
= δ3 +
(x + y)
We notice that this cannot be bounded, because the second term has a denominator that
can be zero or close to zero. Problems due to these type of situations are discussed next.
x = 0.3721498 × 10−5
The digits of x do not have the same significance since they represent different powers of
10. Thus, 3 is the most significant digit and 8 is the least significant digit.
Remember that the output of a computer program often produces numbers with a long
list of digits. It is perfectly acceptable (and even recommended) to include these in the
raw output, but after analysis, the final results should always be given with appropriate
precision!
15
0.4.2 Loss of significance
In some cases, the relative error involved in arithmetic calculations can grow significantly
large. This often involves subtracting two almost equal numbers.
For example, consider the case y = x − sin(x). Let us have a computing system which
works with ten decimal digits. Then
Thus the number of significant digits was reduced by three! Three spurious zeros were
added by the computer to the last three decimal places, but these are not significant digits.
The correct value is 0.49371 74327 × 10−4 .
The simplest solution to this problem is to use floating-point numbers with higher pre-
cision, but this only helps up to a certain point. A better approach is to anticipate that a
problematic situation may arise and change the algorithm in such a way that the problem
is avoided.
Example.
How many significant bits are lost in the subtraction x − y = 37.593621 − 37.584216?
We have
y
1− = 0.0002501754
x
This lies between 2−12 = 0.000244 and 2−11 = 0.000488. Hence, at least 11 but at most
12 bits are lost.
16
0.4.4 Avoiding loss of significance
i. Rationalizing
Consider the function p 2
f (x) = (x + 1) − 1
We see that near zero, there is a potential loss of significance.
However, the function can be rewritten in the form
p 2 !
p 2 (x + 1) + 1
f (x) = ( (x + 1) − 1) p
(x2 + 1) + 1
x2
=p
(x2 + 1) + 1
This allows terms to be canceled out and therefore removes the problematic subtraction.
x3 x5 x7 x3 x5 x7
f (x) = x − x − + − − . . . = + − − . . .
3! 5! 7! 3! 5! 7!
17
0.4.5 Range reduction
Another cause for loss of significant digits is the use of various library functions with very
large arguments.
For example, consider the sine function whose basic property is its periodicity
18