0% found this document useful (0 votes)

62 views19 pages

Number Representatiin

This document provides an overview of number representation in computers. It discusses that computers typically do not use the decimal system and how to convert between different bases of representation. It also notes that computers have limited bits to represent numbers, so numbers with very large magnitudes or arbitrary precision cannot be stored. The document then goes into details on representing numbers in different bases like binary, octal, and hexadecimal. It provides the general forms and examples of converting between bases for both integer and fractional parts of numbers.

Uploaded by

maheepa pavuluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views19 pages

Number Representatiin

Uploaded by

maheepa pavuluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

LECTURE 1: Number representation and errors

The representation of numbers in computers is usually not based on the decimal system,
and therefore it is necessary to understand how to convert between different systems of
representation. Computers also have a limited amount of bits for representing numbers,
and for this reason numbers with arbitrarily large magnitude cannot be represented nor can
floating-point numbers be represented with arbitrary precision. In this chapter, we define
the key concepts of number representation in computers and discuss the main sources of
errors in numerical computing.

0.1 Representation of numbers in different bases

0.1.1 Decimal system

We first review the representation of numbers in the familiar decimal system and then
generalize the concept for systems with different base numbers.

Integer part
In the decimal system, the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are used for the representation
of numbers. The individual digits in a number, such as 37294, are the coefficients of
powers of 10:

37294 = 4 + 90 + 200 + 7000 + 30000

= 4 × 100 + 9 × 101 + 2 × 102 + 7 × 103 + 3 × 104

In general, a string of digits represents a number according to the formula

an an−1 . . . a1 a0 = a0 × 100 + a1 × 101 + . . . + an−1 × 10n−1 + an × 10n

Fractional part
A number between 0 and 1 is represented by a string of digits to the right of a decimal
point. The individual digits represent the coefficients of negative powers of 10. In general,
we have the formula

0.b1 b2 b3 . . . = b1 × 10−1 + b2 × 10−2 + b3 × 10−3 + . . .

For example,

0.7215 = 7/10 + 2/100 + 1/1000 + 5/10000

= 7 × 10−1 + 2 × 10−2 + 1 × 10−3 + 5 × 10−4

General form of base-10 numbers

The representation of a real number in the decimal system is given by the following gen-
eral form in which the integer part is the first summation and the decimal part is the latter
summation.
n ∞
(an an−1 . . . a1 a0 .b1 b2 b3 . . .)10 = k
∑ ak 10 + ∑ bk 10−k
k=0 k=1
The notation ()10 is used for the decimal system to indicate that the number 10 is used as
the base number.

1
0.1.2 Base β numbers
In computers, the number 10 is generally not used as the base for number representation.
Instead, other systems, such as the binary (2 as the base), the octal (8 as the base), and the
hexadecimal (16 as the base) systems are commonly used in computing. In the general
form, the base is noted as β .

General form of base β numbers

The digits used in the general β representation are 0, 1, 2, . . . , β − 1. For example, in the
octal representation of a number, the digits used are 0,1,2,3,4,5,6, and 7.
The general form of base β representation is given by
n ∞
(an an−1 . . . a1 a0 .b1 b2 b3 . . .)β = ∑ ak β k + ∑ bk β −k
k=0 k=1

The separator between the integer and fractional part is called the radix point since deci-
mal point is reserved for base-10 numbers.
For example, in the octal representation, the individual digits before the radix point refer
to increasing powers of 8:

(21467)8 = 7 + 6 × 8 + 4 × 82 + 1 × 83 + 2 × 84
= 7 + 8(6 + 8(4 + 8(1 + 8(2))))
= 9015

To convert this number to the decimal system, we can build the nested form of the number
by taking a common denominator and then replacing each of the numbers by its represen-
tation in the decimal system. We will soon discuss the conversion between different bases
in more detail.
A number between 0 and 1, when expressed in the octal system, is given by the coeffi-
cients of negative powers of 8. For example,

(0.36207)8 = 3 × 8−1 + 6 × 8−2 + 2 × 8−3 + 0 × 8−4 + 7 × 8−5

= 8−5 (7 + 82 (2 + 8(6 + 8(3))))
= 15495/32768 = 0.42286987 . . .

Conversion of integer part

We now consider a general procedure for converting from one base to another. We do this
by considering the conversion of the integer and fractional parts separately.
Consider an integer N in the number system with base α:
n
N = (an an−1 . . . a1 a0 )α = ∑ ak α k
k=0

In order to convert N to the number system with base β , we first write N in its nested form

N = a0 + α(a1 + α(a2 + . . . + α(an−1 + α(an )) . . .))

and then replace each of the numbers on the right by its representation in the β -arithmetic.
The replacement requires a table showing how each of the numbers is represented in the

2
β -system. In addition, a base-β multiplication table may be required.

As an example, consider the conversion of the decimal number 3781 to binary form.

(3781)10 = 1 + 10(8 + 10(7 + 10(3)))

= (1)2 + (1010)2 ((1000)2 + (1010)2 ((111)2 + (1010)2 (11)2 ))
= (111011000101)2

The following table can be used for the replacement of base-10 numbers by their binary
counterparts:

DEC BIN
0 0
1 1
2 10
3 11
4 100
5 101
6 110
7 111
8 1000
9 1001
10 1010

This calculations is easy for computers but quite tedious for humans. Therefore, another
procedure can be used for hand calculations. Taking the conversion from the decimal to
the binary system as an example, we continuously divide by 2, until 0 remains. Then the
converted number is given by the remainders of the division in reversed order.

3
For example, here each number on the left hand side is obtained by dividing the previous
number by 2 and taking the integer part, the quotient is marked on the right hand side.
Thus the converted number is (111011000101)2 .

Quotients Remainders
3781
1890 1
945 0
472 1
236 0
118 0
59 0
29 1
14 1
7 0
3 1
1 1
0 1

Conversion of fractional part

The conversion of the fractional part can be done by the following direct, but somewhat
useless approach:

(0.372)10 = 3 × 10−1 + 7 × 10−2 + 2 × 10−3

1 1 1
= (011)2 + (111)2 + ((010)2 )
(1010)2 (1010)2 (1010)2

Since dividing binary arithmetic is not straightforward, we look for alternative ways of
doing the conversion such that the arithmetic can be carried out in the decimal system.
Suppose that x is in the range 0 < x < 1 and given in the β representation:
∞
x= ∑ ck β −k = (0.c1c2c3 . . .)β
k=1

Observe that
β x = (c1 .c2 c3 c4 . . .)β
because we only need to multiply by β to shift the radix point.
Thus the unknown digit c1 can be described as the integer part of β x, denoted as I (β x).
Denote the fractional part of β x by F (β x). The process can be repeated until all the
unknown digits ck have been converted:

d0 = x
d1 = F (β d0 ) c1 = I (β d0 )
d2 = F (β d1 ) c2 = I (β d1 )
..
.

4
For example, here we repeatedly multiply by 2 and remove the integer parts to get (0.372)10 =
(0.010111 . . .)2 .
Products Digits
0.372
0.744 0
1.488 1
0.976 0
1.952 1
1.904 1
1.808 1
etc.

0.2 Representation of numbers in computers

In this section, we discuss how numbers are represented in typical computers and how
arithmetic between the numbers works. We begin by discussing integer numbers and then
continue to floating-point numbers.

0.2.1 Integer numbers

A number in integer presentation is exact. Arithmetic between two integer numbers is
also exact provided that (i) the answer is not outside the range of representable numbers,
and that (ii) division is interpreted as producing an integer result (throwing any remainder
away).
Unsigned integers are easy to represent. All digits are represented with bits 0 and 1, and
thus the string of digits can be directly interpreted as a binary number. For example, with
eight bits, we have
0011 1001 = (1 + 8 + 16 + 32)10 = (57)10

To include also negative numbers, we must assign a separate sign bit. The first bit of the
string is the sign bit which is zero for positive numbers and one for negative numbers.
The most often used method for obtaining the representation for negative numbers is a
method called two’s complement:
In the binary representation of a positive number x, invert all the bits (0 ↔ 1),
and add 1 to get the binary representation of −x.
For example, if we have eight bits for the representation (of which one is for the sign and
seven for the digits), then
+2 = 0000 0010
+1 = 0000 0001
0 = 0000 0000
−1 = 1111 1111
−2 = 1111 1110

5
The largest representable number in an n-bit system is 2n−1 − 1. For example, a 32-bit
integer number can have values between

231 − 1 = 2147483647 = (01111111 11111111 11111111 11111111)2 and

31
−2 = −2147483648 = (10000000 00000000 00000000 00000000)2

In the minimum value, the sign bit has been interpreted as part of the number.

Most compilers do not give error messages of exceeding the range of integer numbers,
except in some obvious situations. As an example, consider the following C code.

int main(void)
{
short int si;
int i, k, j1, j2, j3, j4;

si = 1; i = 1;

for(k=1; k<33; k++) {

si *= 2;
i *= 2;
fprintf(stdout,"%3d%8d%12d\n",k,si,i);
}

j1 = -2147483647;
fprintf(stdout,"\n%d\n",j1);
j2 = -2147483648;
fprintf(stdout,"\n%d\n",j2);
j3 = 2147483647;
fprintf(stdout,"\n%d\n",j3);
j4 = 2147483648;
fprintf(stdout,"\n%d\n",j4);
}

COMPILING:

warning: decimal constant is so large that it is unsigned

6
OUTPUT:

1 2 2
2 4 4
3 8 8
4 16 16
5 32 32
6 64 64
7 128 128
8 256 256
9 512 512
10 1024 1024
11 2048 2048
12 4096 4096
13 8192 8192
14 16384 16384
15 -32768 32768
16 0 65536
17 0 131072
18 0 262144
19 0 524288
20 0 1048576
21 0 2097152
22 0 4194304
23 0 8388608
24 0 16777216
25 0 33554432
26 0 67108864
27 0 134217728
28 0 268435456
29 0 536870912
30 0 1073741824
31 0 -2147483648
32 0 0

-2147483647

-2147483648

2147483647

-2147483648

7
0.2.2 Floating-point numbers
In a computer, there are no real numbers or rational numbers (in the sense of the mathe-
matical definition), but all noninteger numbers are represented with finite precision.
For example, the numbers π or 1/3 cannot be represented precisely (in any representa-
tion), and in some representations, the numbers 1.00000000 and 1.00000001 are equally
large.

The floating-point representation in computers is based on an internal division of bits

which are reserved for representing a given number x. The number is represented in three
parts: a sign s that is either + or -, and an integer exponent c, and a positive mantissa M:

x = s × Bc−E × M

where B is the base of the representation (usually B = 2) and E is the bias of the expo-
nent (a fixed integer constant for any given machine and representation which enables
representing negative exponents without a separate sign bit).
In the decimal system, this corresponds to the following normalized floating-point form

x = ±0.d1 d2 d3 . . . × 10n = ±r × 10n

where d1 6= 0 and n is an integer.

In most computers, floating-point numbers are represented in the following standard

IEEE floating-point form
x = s × 2c−E × (1. f )2
The first bit s is the sign bit (0 = + and 1 = −). The next bits are used to represent
the exponent c corresponding to 2c−E where E is a constant which enables representing
negative exponents without a separate sign bit. The last bits are reserved for the mantissa
(also called the significand) which is given in the "1-plus" form (1. f )2 .
The IEEE standard defines single-precision (32 bits) and double-precision (64 bits) floating-
point numbers. The available bits are allocated as shown in Fig. 1 (constant E is 127 for
single precision and 1023 for double precision):

sign exponent mantissa

s c f

1 bit 8 / 11 bits 23 / 52 bits

Figure 1: IEEE standard for floating-point numbers.

8
Machine numbers

A floating-point number system within a computer is always limited by the finite word-
length of computers. This means that only a finite number of digits can be represented. As
a consequence, numbers that are too large or too small cannot be represented. Moreover,
most real numbers cannot be represented exactly in a computer. For example,
1
= (0.1)10 = (0.0 0011 0011 0011 0011 0011 . . .)2
10

The effective number system for a computer is not a continuum but a discrete set of
numbers called the machine numbers.
In a normalized representation (used in most computers), the bit patterns are as "left
shifted" as possible, corresponding to the "1-plus" form of the mantissa (this form doesn’t
waste any bits but uses the correct exponent such that there are no leading zero bits in
the mantissa). The machine numbers in the normalized representation are not uniformly
spaced but unevenly distributed about zero.
For example, if we list all floating-point numbers which can be expressed in the following
normalized form
x = ±(0.1b2 b3 )2 × 2±k (k, bi ∈ [0, 1])
where we have only two bits for the mantissa (b2 and b3 ) and one bit for the exponent
(k), we get the set of numbers shown in Fig. 2. This shows the phenomenon known as
the hole at zero. There is a relatively wide gap between the smallest positive number and
zero, that is (0.100)2 × 2−1 = 1/4.

0 1/4 3/8 1/2 5/8 3/4 7/8 1 5/4 3/2 7/4

5/16 7/16

Figure 2: Representable numbers (machine numbers) in the example number system.

Machine epsilon

Arithmetic among numbers in floating-point representation is not exact even if the operands
happen to be exactly represented. For example, two floating-point numbers are added by
first right-shifting (dividing by two) the mantissa of the smaller one (in magnitude), si-
multaneously increasing the exponent, until the two operands have the same exponent.
Low-order (least significant) bits of the smaller operand are lost by this shifting. If the
two operands differ too much in magnitude, then the smaller operand is effectively re-
placed by zero. This leads us to the concept of machine accuracy.
Machine epsilon ε (or machine accuracy) is defined to be the smallest number that can
be added to 1.0 to give a number other than one. Thus, the machine epsilon describes
the accuracy of floating-point calculations. Note that this is not the same as the smallest
floating-point number that is representable in a given computer (see below)!
When using single-precision numbers, there are 23 bits allocated for the mantissa. The
machine epsilon is given by ε = 2−23 ≈ 1.19 × 10−7 , which means that numbers can be
represented with approximately six accurate digits. In double precision, we have 52 bits
allocated for the mantissa, and thus the machine epsilon is approximately ε = 2−52 ≈
2.22 × 10−16 , giving the accuracy of about 15 digits.

9
To a great extent any arithmetic operation among floating-point numbers should be thought
of as introducing an additional fractional error of at least ε. This type of error is called
roundoff error (we return to this in more detail later in this chapter).

Smallest and largest numbers

The smallest and largest representable numbers are determined by the number of bits that
are allocated for the exponent c. [Note: the machine epsilon, in contrast, depends on how
many bits there are in the mantissa.]

The smallest representable number is given by

xmin = 2cmin

and similarly, the largest number is

xmax = Mmax × 2cmax

where cmin and cmax are the minimum and maximum values of the exponent, and Mmax is
the maximum value of the mantissa.
The value of c in the representation of single-precision floating-point numbers is restricted
by the inequality
0 < c < (11 111 111)2 = 255

The values 0 and 255 are reserved for special cases including ±0 and ±∞. Hence, the
actual exponent is restricted by

−126 < c − 127 < 127

Likewise, the mantissa is restricted by

1 ≤ (1. f )2 ≤ (1.111 111 111 111 111 111 111 11)2 = 2 − 2−23

The largest representable 32-bit number is therefore

(2 − 2−23 )2127 ≈ 2128 ≈ 3.4 × 1038

The smallest number is

2−126 ≈ 1.2 × 10−38

Zero, infinity and NaN

The IEEE standard defines some useful special characters.

The number zero is represented by all bits being zero. The sign bit, however, can obtain
both values, and thus there are two different representations of zero: +0 (sign bit 0) and
-0 (sign bit 1).

10
Infinity is defined by setting all the exponent bits to one and those in mantissa to zero.
Depending on the sign bit, we have two different representation of infinity: +∞ and −∞.
The following rules apply to arithmetic operations involving ∞:
x
∞ ± x = ∞, x × ∞ = ∞, =0
∞
In addition, the standard defines a constant called NaN (not a number) which is an error
pattern rather than a number. It is obtained by setting all bits to ones.

C header file float.h

In C, the header file float.h contains constants which reflect the characteristics of that
machine’s floating-point arithmetic. Some of the most widely used values are (these are
obtained from a Linux/Intel -machine):

Number of decimal digits for float (FLT_DIG) = 6

Number of decimal digits for double (DBL_DIG) = 15

Precision for float (FLT_EPSILON) = 1.19209e-07

Precision for double (DBL_EPSILON) = 2.22045e-16

Maximum float (FLT_MAX) = 3.40282e+38

Maximum double (DBL_MAX) = 1.79769e+308

Minimum positive float (FLT_MIN) = 1.17549e-38

Minimum positive double (DBL_MIN) = 2.22507e-308

Single or double precision?

Calculations with double-precision numbers are somewhat slower than when using single
precision. The following loop (N = 107 )

for(i=0; i<N; i++) {

x += 0.1;
y += sin(x);
z += log(x)*cos(x);
}

took 5.70 seconds with single-precision numbers and 6.43 seconds with double-precision
in a Linux/Intel machine (y and z were defined as floats or doubles). In this case,
the difference is not large but may depend strongly on the machine in question.

In many cases, it is recommendable to use double-precision in order to avoid problems

due to for example loss of significance or round-off errors.

11
0.3 Errors
Numerical calculations always involve approximations due to several reasons. These er-
rors are not the result of poor thinking or carelessness (like programming errors) but they
inevitably arise in all numerical calculations. We can divide the sources of errors roughly
into four categories: model, method, initial values (data) and roundoff.

0.3.1 Modeling errors

When a practical problem is formulated into mathematical language, it is almost always
necessary to make simplifications. Examples of modeling errors include leaving out less-
influential factors (e.g., no air resistance in falling) or using a simplified description of a
more complex system (e.g., classical description of a quantum-mechanical system).
Modeling errors are not discussed here in more detail but left as a subject of courses in
the various application fields.

0.3.2 Methodological errors

The conversion of a mathematical problem into a numerical one is also a source of errors.
Care should be taken to control these errors and to estimate their magnitude and thus the
quality of the numerical solution. Note that by methodological errors we mean errors that
would persist even if a hypothetical "perfect" computer had an infinitely accurate repre-
sentation and no roundoff error. As a general rule, there is not much a programmer can
do about the computer’s roundoff error (see below for more details). Methodological er-
rors, on the other hand, are entirely under the programmer’s control. In fact, an incredible
amount of work in the field of numerical analysis has been devoted to the fine minimiza-
tion methodological errors!

An example of methodological errors is the truncation error (or chopping error) which
is encountered when, for example, an interminating series is chopped:

t2 tn
x(t) = et = 1 + t + + . . . + + Rn+1 (t)
2 n!
Here the truncation error is: −Rn+1 (t). It follows that when n is sufficiently large, the ex-
cess term Rn+1 is small and the chopping gives a good approximation of the exact result.

Another example of methodological errors is the discretizing error which results when a
continuous quantity is replaced by a discrete approximation. For example, replacing the
derivative by the difference quotient leads to a discretizing error:

x(t + h) − x(t)
x0 (t) ≈ y(t, h) =
h
Or as another example, in numerical integration, the integrand is evaluated at a discrete
set of points, rather than at "every" point.

12
0.3.3 Errors due to initial values
The initial values of a numerical computation can involve inaccurate values (e.g. mea-
surements). When designing the algorithm, it is important to keep in mind that the initial
errors must not accumulate during the calculation. There are also techniques for data fil-
tering that are designed to decrease the effects of errors in the initial values.

0.3.4 Roundoff errors

Roundoff errors are the result of having a finite number of bits to represent floating-point
numbers in computers. As already mentioned, arbitrarily large or small numbers cannot
be represented and floating-point numbers cannot have arbitrary precision.
Consider a positive real number x in normalized floating-point form

x = (1.a1 a2 a3 . . . a23 a24 a25 . . .)2 × 2m

The process of replacing x by its nearest machine number is called correctly rounding.
The error involved is called the roundoff error. How large can it be?

One nearby machine number can be obtained by rounding down or by dropping the excess
bits a24 a25 . . . (if only 23 bits have been allocated to the mantissa). This machine number
is
x− = (1.a1 a2 a3 . . . a23 )2 × 2m

Another nearby machine number is found by rounding up. It is found by adding one unit
to a23 in the expression for x− . Thus,

x+ = [(1.a1 a2 a3 . . . a23 )2 + 2−23 ] × 2m

The closer of these machine numbers is chosen for x.

Since the unit roundoff error (machine epsilon) for a 32-bit floating-point representation
is ε = 2−23 , we notice that in the case of rounding to the nearest, the relative error is no
greater than ε/2. This is illustrated in Fig. 3. [Remember that the machine epsilon is
defined to be the smallest number such that when added to 1, the result is not equal to 1.]

< machine epsilon

roundoff error

x− x x+
Figure 3: Rounding to the nearest: the number x is represented by the machine number
that lies closest to the true value of x. The roundoff error involved in the operation is no
greater than ε/2.

13
0.3.5 Elementary arithmetic operations
We continue to examining errors that are produced in basic arithmetic operations. Round-
off errors accumulate with increasing amounts of calculation. If you perform N arithmetic
√
operations to obtain a given value, the total roundoff error might be of the order Nε, if
you are lucky! [The square root comes from a random walk, assuming that the errors
come in randomly up or down.]
However, this estimate can be badly off the mark for two reasons:
(i) It frequently happens that for some reason (because of the regularities of your calcu-
lation or the peculiarities of your computer) the roundoff errors accumulate preferentially
in one direction. In this case, the total error will be of order Nε.
(ii) Some especially unfavorable operations can increase the roundoff error substantially
in a single arithmetic operation. A good example is the subtraction of two almost equal
numbers, which results in the loss of significance (see the discussion later in this chapter).

To discuss these issues in more detail, we first introduce the notation fl(x) to denote the
floating-point machine number that corresponds to the real number x.
For example, in a hypothetical five-decimal machine,

fl(0.3721871422 × 104 ) = 0.37219 × 104

If x is any real number within the range of the computer, then the error involved in the
operation is bound by ε/2 (assuming correct rounding)

|x − fl(x)| 1
≤ ε
|x| 2

This can be written as

1
fl(x) = x(1 + δ ) (|δ | ≤ ε)
2
Remember that the unit round-off error or the machine epsilon is the smallest positive
machine number ε such that
fl(1 + ε) > 1

Let the symbol denote any one of the operations, +, - , ×, or ÷. Suppose that whenever
two machine numbers x and y are combined arithmetically, the computer will produce
fl(x y) instead of x y . Under this assumption, the previous analysis gives
1
fl(x y) = (x y)(1 + δ ) (|δ | ≤ ε)
2
It is of course possible that x y overflows or underflows, but except for this case, the
above assumption is realistic for most computers.
However, in real calculations, the initial values are given as real numbers and not as
machine numbers. For example, the relative error for addition can be estimated by

z = fl[fl(x) + fl(y)]
= [x(1 + δ1 ) + y(1 + δ2 )](1 + δ3 )
≈ (x + y) + x(δ1 + δ3 ) + y(δ2 + δ3 )

14
The relative round-off error is

(x + y) − z x(δ1 + δ3 ) + y(δ2 + δ3 )
=
(x + y) (x + y)

xδ1 + yδ2
= δ3 +
(x + y)

We notice that this cannot be bounded, because the second term has a denominator that
can be zero or close to zero. Problems due to these type of situations are discussed next.

0.4 Loss of significance

In this section, we discuss how the problem of loss of significance arises and how it can be
avoided by various techniques, such as the use of rationalization, Taylor series, trigono-
metric identities, and so on.

0.4.1 Significant digits

We begin by considering the concept of significant digits in a number. Consider a real
number x expressed in normalized scientific form. For example,

x = 0.3721498 × 10−5

The digits of x do not have the same significance since they represent different powers of
10. Thus, 3 is the most significant digit and 8 is the least significant digit.

A mathematically exact quantity, such as π, can be expressed with as many significant

digits as we like. A measured quantity, however, always involves an error whose magni-
tude depends on the measuring device. Likewise, all numerically computed quantities are
not exact, but can only be expressed with a certain precision depending on the machine.
It is important to understand that the precision of a computed quantity is determined by
the least precise value that was used in the computation.

For example, suppose we have a measured quantity s = 0.736. It is a scientific convention

that the least significant digit given in a measured quantity should be in error by at most
five units. The following computation gives
p
y = s × (2) ≈ 0.1040861182 × 101

but the result should be reported as 0.104 × 101 .

Remember that the output of a computer program often produces numbers with a long
list of digits. It is perfectly acceptable (and even recommended) to include these in the
raw output, but after analysis, the final results should always be given with appropriate
precision!

15
0.4.2 Loss of significance
In some cases, the relative error involved in arithmetic calculations can grow significantly
large. This often involves subtracting two almost equal numbers.
For example, consider the case y = x − sin(x). Let us have a computing system which
works with ten decimal digits. Then

x = 0.66666 66667 × 10−1

sin x = 0.66617 29492 × 10−1
x − sin x = 0.00049 37175 × 10−1
= 0.49371 75000 × 10−4

Thus the number of significant digits was reduced by three! Three spurious zeros were
added by the computer to the last three decimal places, but these are not significant digits.
The correct value is 0.49371 74327 × 10−4 .

The simplest solution to this problem is to use floating-point numbers with higher pre-
cision, but this only helps up to a certain point. A better approach is to anticipate that a
problematic situation may arise and change the algorithm in such a way that the problem
is avoided.

0.4.3 Loss of precision theorem

Exactly how many significant binary digits are lost in the subtraction x − y when x is close
to y?

Let x and y be normalized floating-point numbers with x > y > 0.

If 2−p ≤ 1 − y/x ≤ 2−q for some positive integers p and q, then at most
p and at least q significant binary digits are lost in the subtraction x − y.

Example.
How many significant bits are lost in the subtraction x − y = 37.593621 − 37.584216?
We have
y
1− = 0.0002501754
x
This lies between 2−12 = 0.000244 and 2−11 = 0.000488. Hence, at least 11 but at most
12 bits are lost.

16
0.4.4 Avoiding loss of significance
i. Rationalizing
Consider the function p 2
f (x) = (x + 1) − 1
We see that near zero, there is a potential loss of significance.
However, the function can be rewritten in the form
p 2 !
p 2 (x + 1) + 1
f (x) = ( (x + 1) − 1) p
(x2 + 1) + 1
x2
=p
(x2 + 1) + 1

This allows terms to be canceled out and therefore removes the problematic subtraction.

ii. Using series expansion

Consider the function
f (x) = x − sin x
whose values are required near x = 0. We can avoid the loss of significance by using the
Taylor series for sin x
x3 x5 x7
sin x = x − + − − . . .
3! 5! 7!
For x near zero, the series converges quite rapidly.
We can now rewrite the function f as

x3 x5 x7 x3 x5 x7

f (x) = x − x − + − − . . . = + − − . . .
3! 5! 7! 3! 5! 7!

This is a very effective form for calculating f for small x.

iii. Using trigonometric identities

As a simple example, consider the function

f (x) = cos2 (x) − sin2 (x)

There will be loss of significance at x = π/4.

The problem can be solved by the simple substitution

cos2 (x) − sin2 (x) = cos(2x)

17
0.4.5 Range reduction
Another cause for loss of significant digits is the use of various library functions with very
large arguments.
For example, consider the sine function whose basic property is its periodicity

sin(x) = sin(x + 2πn)

for all real values of x and all integer values of n.

The periodicity is used in the the computer evaluation of sin x: One only needs to know
the values of sin x in some fixed interval of length 2π in order to compute sin x for arbitrary
x. This is called range reduction.
This procedure leads to an unavoidable loss of precision if the original argument x is very
large. For example, if we want to evaluate sin(12532.14), we first subtract 3988π from the
original argument and calculate the value of sin(3.47). Here the argument has only three
significant figures due to the subtraction! Thus our computed value of sin(12532.14) can
only be given with three significant digits!
The only way to improve the situation is to use double- or extended-precision program-
ming. This is recommended for variables which are used as arguments of library functions
such as the sine function.

Digital System Design Notes
No ratings yet
Digital System Design Notes
94 pages
DD Detailed Notes
No ratings yet
DD Detailed Notes
114 pages
FOC MODULE 2 Ktuspecial. in
No ratings yet
FOC MODULE 2 Ktuspecial. in
33 pages
Tài liệu 1
No ratings yet
Tài liệu 1
19 pages
Number System
No ratings yet
Number System
16 pages
Chapter 10
No ratings yet
Chapter 10
15 pages
7-8 Data Representation and Computer Arithmetic
No ratings yet
7-8 Data Representation and Computer Arithmetic
23 pages
Number Systems
No ratings yet
Number Systems
18 pages
DIGITAL FUNDAMENTALS - Unit-1
No ratings yet
DIGITAL FUNDAMENTALS - Unit-1
188 pages
Numeral System (Number System)
No ratings yet
Numeral System (Number System)
29 pages
Coa CH2
No ratings yet
Coa CH2
15 pages
2 Numeration Systems
No ratings yet
2 Numeration Systems
14 pages
CH08.1-Number Systems
No ratings yet
CH08.1-Number Systems
5 pages
PART9
No ratings yet
PART9
19 pages
Chapter 4
No ratings yet
Chapter 4
6 pages
Get 211 Data Representation Lecture 3
No ratings yet
Get 211 Data Representation Lecture 3
30 pages
Screenshot 2025-05-14 at 6.32.49 PM
No ratings yet
Screenshot 2025-05-14 at 6.32.49 PM
131 pages
Digital Computers and Information
No ratings yet
Digital Computers and Information
53 pages
1 Numbering Systems
No ratings yet
1 Numbering Systems
22 pages
Tin học đại cương - Unit 1 (part 2)
No ratings yet
Tin học đại cương - Unit 1 (part 2)
83 pages
Booklet Senior 3 0607
0% (1)
Booklet Senior 3 0607
213 pages
Chapter 3
No ratings yet
Chapter 3
72 pages
Computer Organization and Architecture Computer Arithmetic
No ratings yet
Computer Organization and Architecture Computer Arithmetic
78 pages
Unit1 2
No ratings yet
Unit1 2
98 pages
Number System
No ratings yet
Number System
22 pages
CH 04 Number System
No ratings yet
CH 04 Number System
62 pages
Lecture02-Data Representation 2
No ratings yet
Lecture02-Data Representation 2
39 pages
Lecture02-Data Representation 2
No ratings yet
Lecture02-Data Representation 2
38 pages
Lesson 2 - Data in Computer
No ratings yet
Lesson 2 - Data in Computer
78 pages
What Is Number System
No ratings yet
What Is Number System
4 pages
Chapter 2
No ratings yet
Chapter 2
18 pages
11 Number Systems
No ratings yet
11 Number Systems
53 pages
Unit 1.1 Data Formats
No ratings yet
Unit 1.1 Data Formats
64 pages
Numerical Analysis Chapter 2
No ratings yet
Numerical Analysis Chapter 2
7 pages
2019 2020 CSE206 Week09 Ch9 Ch10 Number Systems and Computer Arithmetic
No ratings yet
2019 2020 CSE206 Week09 Ch9 Ch10 Number Systems and Computer Arithmetic
39 pages
Numeral System and Its Importance: Hong@Is - Naist.Jp
No ratings yet
Numeral System and Its Importance: Hong@Is - Naist.Jp
41 pages
Lect 14 15 NumberSystems
No ratings yet
Lect 14 15 NumberSystems
56 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter1 2
No ratings yet
Chapter1 2
66 pages
Module 1c Number Systems and Presentation
No ratings yet
Module 1c Number Systems and Presentation
53 pages
Chap1-Eng-L2-Logique Combinatoire Et Sequentielle-P
No ratings yet
Chap1-Eng-L2-Logique Combinatoire Et Sequentielle-P
9 pages
Chap-1 DLD
No ratings yet
Chap-1 DLD
41 pages
Lec3-Intro To Number System
No ratings yet
Lec3-Intro To Number System
47 pages
Unit1 2
No ratings yet
Unit1 2
64 pages
CH 2
No ratings yet
CH 2
25 pages
CH10 COA11e
No ratings yet
CH10 COA11e
18 pages
Calculus MCQ
100% (1)
Calculus MCQ
15 pages
Chapter 2: Data Representation
No ratings yet
Chapter 2: Data Representation
32 pages
Unit 1
No ratings yet
Unit 1
40 pages
Chapter 2 Number System and Codes
No ratings yet
Chapter 2 Number System and Codes
40 pages
Detailed Lesson Plan in Mathematics 7 I. Objectives
100% (1)
Detailed Lesson Plan in Mathematics 7 I. Objectives
5 pages
02 Numbersystems
No ratings yet
02 Numbersystems
56 pages
Nimo 8 PDF
No ratings yet
Nimo 8 PDF
4 pages
2 Analog Vs Digital Number Systems
No ratings yet
2 Analog Vs Digital Number Systems
38 pages
Number Systems and Codes
No ratings yet
Number Systems and Codes
78 pages
Number Systems: Prof. Indranil Sen Gupta
No ratings yet
Number Systems: Prof. Indranil Sen Gupta
21 pages
Digital Electronics
No ratings yet
Digital Electronics
64 pages
Lesson - Sequence & Series
No ratings yet
Lesson - Sequence & Series
17 pages
ASHUTOSH SAMANTA Camellia Institute of Engineering and Techn
No ratings yet
ASHUTOSH SAMANTA Camellia Institute of Engineering and Techn
15 pages
Data Representation
100% (1)
Data Representation
29 pages
2013 JCE Maths Syllabus
No ratings yet
2013 JCE Maths Syllabus
19 pages
Coordinate Geometry of The Circle C6 ihdFG5KUVVado
No ratings yet
Coordinate Geometry of The Circle C6 ihdFG5KUVVado
6 pages
Describe How Data Are Stored and Manipulated Within The Computer
100% (1)
Describe How Data Are Stored and Manipulated Within The Computer
9 pages
M.Tech. ETC VSP PDF
No ratings yet
M.Tech. ETC VSP PDF
29 pages
Functions Part 9
No ratings yet
Functions Part 9
9 pages
Rpncalcseries For Palm Os Section Properties - User Guide
No ratings yet
Rpncalcseries For Palm Os Section Properties - User Guide
10 pages
Grade 9 Week 11
No ratings yet
Grade 9 Week 11
7 pages
Cylinder: Questions: 25 Exercise - 1 Time: 15 Min
No ratings yet
Cylinder: Questions: 25 Exercise - 1 Time: 15 Min
5 pages
Course Structure & Syllabus of M. Tech. Programme In: Vlsi Signal Processing
No ratings yet
Course Structure & Syllabus of M. Tech. Programme In: Vlsi Signal Processing
29 pages
p7 MTC Scheme of Work Full Year@ Tr. Arall
No ratings yet
p7 MTC Scheme of Work Full Year@ Tr. Arall
23 pages
Enhanced Math 9 Q2 Final
100% (1)
Enhanced Math 9 Q2 Final
8 pages
Usa Aime 2001 45
No ratings yet
Usa Aime 2001 45
4 pages
MA1001E
No ratings yet
MA1001E
81 pages
Further Mathematics Sample MS FINAL
No ratings yet
Further Mathematics Sample MS FINAL
14 pages
9007 - Learner Guide
No ratings yet
9007 - Learner Guide
53 pages
Chapter 2 - Vector Algebra Vector Mechanics
No ratings yet
Chapter 2 - Vector Algebra Vector Mechanics
64 pages
CL - 4 - NSTSE-2018-Paper-439 Key-Updated PDF
No ratings yet
CL - 4 - NSTSE-2018-Paper-439 Key-Updated PDF
4 pages
Exercise 5 D
No ratings yet
Exercise 5 D
6 pages
Lec 6
No ratings yet
Lec 6
21 pages
Chap 2
No ratings yet
Chap 2
52 pages
JEE Main 2022 26 July Evening Shift Maths Question Paper With Solutions (PDF)
No ratings yet
JEE Main 2022 26 July Evening Shift Maths Question Paper With Solutions (PDF)
71 pages
2025FINALEXAMWORKSHEET 8vuemv5mizio9fky
No ratings yet
2025FINALEXAMWORKSHEET 8vuemv5mizio9fky
2 pages
NEW Maths Edexcel Checklist
No ratings yet
NEW Maths Edexcel Checklist
17 pages
X X X X X X X X X X X: Worksheet of Chapet 7, 8 - Algeabraic Identities and Polynomials 1. 2. 3. 4. 5. 6. 7
No ratings yet
X X X X X X X X X X X: Worksheet of Chapet 7, 8 - Algeabraic Identities and Polynomials 1. 2. 3. 4. 5. 6. 7
2 pages
EE 731 Adaptive Signal Processing: Spring Semester 2013
No ratings yet
EE 731 Adaptive Signal Processing: Spring Semester 2013
3 pages
Karatsuba's Algorithm For Integer Multiplication: Jeremy R. Johnson
No ratings yet
Karatsuba's Algorithm For Integer Multiplication: Jeremy R. Johnson
17 pages
Sekolah Menegah La Salle, Kota Kinabalu Project Based On Learning Project: Soalan Stem
No ratings yet
Sekolah Menegah La Salle, Kota Kinabalu Project Based On Learning Project: Soalan Stem
11 pages
5-2 Medians and Altitudes of Triangles
No ratings yet
5-2 Medians and Altitudes of Triangles
7 pages
Angles Exercises For First Year of Secondary
No ratings yet
Angles Exercises For First Year of Secondary
3 pages
The Law of Cosines: For Any Triangle
No ratings yet
The Law of Cosines: For Any Triangle
5 pages
Good Food. Good Life: Farm2 Kitchen
No ratings yet
Good Food. Good Life: Farm2 Kitchen
2 pages
Proposal PDF
No ratings yet
Proposal PDF
2 pages
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
Basic Mathematics. Explained Easy | For Beginners
From Everand
Basic Mathematics. Explained Easy | For Beginners
ExaGrecation
No ratings yet
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
Fast mental calculation tricks
From Everand
Fast mental calculation tricks
EasyMath
No ratings yet

Number Representatiin

Uploaded by

Number Representatiin

Uploaded by

LECTURE 1: Number representation and errors

0.1 Representation of numbers in different bases

0.1.1 Decimal system

37294 = 4 + 90 + 200 + 7000 + 30000

In general, a string of digits represents a number according to the formula

an an−1 . . . a1 a0 = a0 × 100 + a1 × 101 + . . . + an−1 × 10n−1 + an × 10n

0.b1 b2 b3 . . . = b1 × 10−1 + b2 × 10−2 + b3 × 10−3 + . . .

0.7215 = 7/10 + 2/100 + 1/1000 + 5/10000

General form of base-10 numbers

General form of base β numbers

(0.36207)8 = 3 × 8−1 + 6 × 8−2 + 2 × 8−3 + 0 × 8−4 + 7 × 8−5

Conversion of integer part

N = a0 + α(a1 + α(a2 + . . . + α(an−1 + α(an )) . . .))

(3781)10 = 1 + 10(8 + 10(7 + 10(3)))

Conversion of fractional part

(0.372)10 = 3 × 10−1 + 7 × 10−2 + 2 × 10−3

0.2 Representation of numbers in computers

0.2.1 Integer numbers

231 − 1 = 2147483647 = (01111111 11111111 11111111 11111111)2 and

for(k=1; k<33; k++) {

warning: decimal constant is so large that it is unsigned

The floating-point representation in computers is based on an internal division of bits

x = ±0.d1 d2 d3 . . . × 10n = ±r × 10n

where d1 6= 0 and n is an integer.

In most computers, floating-point numbers are represented in the following standard

sign exponent mantissa

1 bit 8 / 11 bits 23 / 52 bits

Figure 1: IEEE standard for floating-point numbers.

0 1/4 3/8 1/2 5/8 3/4 7/8 1 5/4 3/2 7/4

Figure 2: Representable numbers (machine numbers) in the example number system.

Smallest and largest numbers

The smallest representable number is given by

and similarly, the largest number is

xmax = Mmax × 2cmax

−126 < c − 127 < 127

Likewise, the mantissa is restricted by

The largest representable 32-bit number is therefore

(2 − 2−23 )2127 ≈ 2128 ≈ 3.4 × 1038

The smallest number is

Zero, infinity and NaN

The IEEE standard defines some useful special characters.

C header file float.h

Number of decimal digits for float (FLT_DIG) = 6

Precision for float (FLT_EPSILON) = 1.19209e-07

Maximum float (FLT_MAX) = 3.40282e+38

Minimum positive float (FLT_MIN) = 1.17549e-38

Single or double precision?

for(i=0; i<N; i++) {

In many cases, it is recommendable to use double-precision in order to avoid problems

0.3.1 Modeling errors

0.3.2 Methodological errors

0.3.4 Roundoff errors

x = (1.a1 a2 a3 . . . a23 a24 a25 . . .)2 × 2m

x+ = [(1.a1 a2 a3 . . . a23 )2 + 2−23 ] × 2m

The closer of these machine numbers is chosen for x.

< machine epsilon

fl(0.3721871422 × 104 ) = 0.37219 × 104

This can be written as

0.4 Loss of significance

0.4.1 Significant digits

A mathematically exact quantity, such as π, can be expressed with as many significant

For example, suppose we have a measured quantity s = 0.736. It is a scientific convention

but the result should be reported as 0.104 × 101 .

x = 0.66666 66667 × 10−1

0.4.3 Loss of precision theorem

Let x and y be normalized floating-point numbers with x > y > 0.

ii. Using series expansion

This is a very effective form for calculating f for small x.

iii. Using trigonometric identities

f (x) = cos2 (x) − sin2 (x)

There will be loss of significance at x = π/4.

cos2 (x) − sin2 (x) = cos(2x)

sin(x) = sin(x + 2πn)

for all real values of x and all integer values of n.

You might also like