0% found this document useful (0 votes)
60 views16 pages

Introduction To Numerical Methods: 1.1 A Problem and Its Solution

This document introduces numerical methods for solving engineering problems. Section 1.1 uses an example of calculating the forces on a falling object to illustrate how numerical methods can approximate solutions without needing exact analytical solutions. Section 1.2 discusses sources of error in numerical solutions, including round-off error from number representation and truncation error from approximate methods. Section 1.3 covers number representation on computers, such as floating-point formats. Section 1.4 briefly mentions different types of problems that can be solved using numerical methods.

Uploaded by

Coder D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views16 pages

Introduction To Numerical Methods: 1.1 A Problem and Its Solution

This document introduces numerical methods for solving engineering problems. Section 1.1 uses an example of calculating the forces on a falling object to illustrate how numerical methods can approximate solutions without needing exact analytical solutions. Section 1.2 discusses sources of error in numerical solutions, including round-off error from number representation and truncation error from approximate methods. Section 1.3 covers number representation on computers, such as floating-point formats. Section 1.4 briefly mentions different types of problems that can be solved using numerical methods.

Uploaded by

Coder D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CHAPTER 1

Introduction to Numerical Methods

Contents
1.1 A problem and its solution . . . . . . . . . . . . . . . . . . . . 10
1.2 Error in numerical solutions . . . . . . . . . . . . . . . . . . . 15
1.2.1 Round-off Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Truncation Error . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 Total Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Number representation on a computer . . . . . . . . . . . . 19
1.3.1 Scientific number representation . . . . . . . . . . . . . . . . . 19
1.3.2 Decimal and binary representation . . . . . . . . . . . . . . . . 19
1.3.3 Floating-point representation . . . . . . . . . . . . . . . . . . . 20
1.3.4 Single and double precision floating point representation . . . . 23
1.4 Types of problems to be solved . . . . . . . . . . . . . . . . . 25

In this lecture, an engineering problem is introduced and its solution using the
analytical method and numerical method is discussed. You will observe how the
numerical method is capable of solving a complex engineering problem without knowing
the analytical solution of the problem. However, numerical methods are always
associated with some error due to its behaviour of getting approximate solution instead
of an exact solution. Different types of error, while using numerical methods, are also
discussed. In this regard, the number representation and arithmetic operation on a
computer (32-bit and 64-bit) are covered. At the end of this lecture, types of problem,
which can be solved using numerical methods, are briefly introduced which will be
covered in more detail in forthcoming lectures.

1.1 A problem and its solution


Let us introduce you an engineering problem of forces acting on a falling object as shown
in Fig. 1.1. This problem can be simplified using Newton’s law of motion, which states
that the time rate of change of momentum of a body is equal to the resultant force acting
on it. Therefore, for the given problem of an object falling within the vicinity of the
earth, the net force F is composed of two opposing forces: the downward pull of gravity
FD and the upward force of air resistance FU , i.e.,

F = FD + FU . (1.1)

10
1.1 A SIMPLE MATHEMATICAL M
1.1. A problem and its solution
FU The second law can be recast
sides by m to give
F
a5
m
where a 5 the dependent variabl
function, and m 5 a parameter rep
simple case there is no independe
acceleration varies in time or space
Equation (1.3) has several char
the physical world:
1.
It describes a natural process o
2.
It represents an idealization an
FD negligible details of the natura
Thus, the second law does no
FIGURE
Figure 1.2
1.1: Falling object importance when applied to ob
Schematic diagram of the
forces acting on a falling surface at velocities and on sc
If the downward force is assigned a positive sign, the second law can be
3. Finally, it used
yieldstoreproducible
parachutist. FD is the downward
formulate the force due to gravity as purposes. For example, if the fo
force due to gravity. FU is the
upward force = mg
FD due to air Eq. (1.3) can(1.2)
be used to comp
resistance.
where g = gravitational constant, or the acceleration due to Because gravity, of
which is
its simple algebraic
approximately equal to 9.81 m/s and m is the mass of the object.
2
ily. However, other mathematical
Although the air resistance can be formulated in many ways complex,
(proportional
andtoeither
v 2 ); we
cannot be solve
consider a very simple approach that air resistance is linearly techniques
proportionalthan
to velocity
simple algebra for
and acts in an upward direction, as in this kind, Newton’s second law can
FU = −cv, falling body near (1.3)
the earth’s surfac
model for this case can be derive
where c = a proportionality constant called the drag coefficient (kg/s).
changeThus, thevelocity
of the greater (dy兾dt) and
the fall velocity, the greater the upward force due to air resistance.
dy Therefore,
The net force is the algebraic sum of the downward and upward forces. F we
5
can write dt m
F = FD + FU where y is velocity (m/s) and t is
⇒ F = mg − cv change of the velocity is equal to
" positive,
# the object will accelerate.
dv
⇒ ma = mg − cv = is zero, the object’s velocity
aforce
dt
Next, we will express the net for
dv mg − cv a body falling within the vicinity of
⇒ = opposing forces: the downward pull
dt m
dv c F 5 FD 1 FU(1.4)
⇒ =g− v
dt m
If the downward force is assigned
11 late the force due to gravity, as
FD 5 mg
where g 5 the gravitational constan
1.1. A problem and its solution
Equation (1.4) is a model that relates the acceleration of a falling object to the forces
acting on it. It is a differential equation because it is written in terms of the differential
rate of change (dv/dt) of the variable that we are interested in predicting. The exact
solution of Eq. (1.4) for the velocity of the falling object cannot be obtained using simple
algebraic manipulation. Rather, more advanced techniques, such as those of calculus,
must be applied to obtain an exact or analytical solution. For example, if the object is
initially at rest (v = 0 at t = 0), calculus can be used to solve Eq. (1.4) for
gm  
v(t) = 1 − e−(c/m)t . (1.5)
c
Equation (1.5) is a simple mathematical model of the falling object problem.
However, in general, a mathematical model can be broadly defined in terms of dependent,
independent variables, and parameters.

Dependent variable = f (independent variable, parameters, forcing action), (1.6)

where the dependent variable (e.g., v(t)) is a characteristic that usually reflects the
behavior or state of the system; the independent variables (e.g., t) are usually dimensions,
such as time and space, along which the system’s behavior is being determined. The
parameters (e.g., m, c) are reflective of the system’s properties and the forcing functions
(e.g., g) are external influences acting upon the system.
Now, we can solve a problem using the mathematical model derived in Eq. (1.5).

Example 1.1. A parachutist of mass 68.1 kg jumps out of a stationary hot air
balloon. Use Eq. (1.5) to compute velocity prior to opening the parachute. The
drag coefficient is equal to 12.5 kg/s.

Solution: Given values, m = 68.1 kg, g = 9.81 m/s, and c = 12.5 kg/s, put in
Eq. (1.5), we get

t(sec) v (m/sec) 9.81 × 68.1  


v(t) = 1 − e−(12.5/68.1)t
0 0.00 12.5
=53.44(1 − e−0.18355t )
2 16.42
4 27.80
6 35.68 which can be used to compute the
8 41.14 velocity attained after time t. Here,
10 44.92 the velocity of parachutist prior to
12 47.54 opening the parachute at different
∞ 53.44 instance of time is tabulated.

Example 1.1 is solved using analytical approach, which gives analytical or exact
solution, because it uses exact mathematical model of the system defined earlier. We
have seen the mathematical model defined in Eq. (1.4) that can be solved for exact

12
1.1. A problem and its solution
solution but lots of effort is required. We need to adopt one or more advanced techniques
to find out the solution. In many of these cases, the only alternative is to develop a
numerical solution that approximates the exact solution.
Now let us try to reformulate the problem to find the approximate solution close to
exact solution. This can be illustrated for Newton’s second law by realizing that the time

Figure 1.2: Use of a finite difference to approximate the first derivative of v with
respect to t.

rate of change of velocity can be approximated by (Fig. 1.2)

dv ∼ ∆v v(ti+1 ) − v(ti )
= = (1.7)
dt ∆t ti+1 − ti

where ∆v and ∆t are differences in velocity and time, respectively, computed over finite
intervals, v(ti ) is velocity at an interval time ti , and v(tt+1 ) is the velocity at some later
time ti+1 . Note that dv/dt ∼ = ∆v/∆t is approximate because ∆t is finite. Remember
from calculus that
dv ∆v
= lim
dt ∆t→0 ∆t
Equation (1.7) represents the reverse process and is called a finite divided difference
approximation of the derivative at time ti . So we can substitute this value in Eq. (1.4),
we get
v(ti+1 ) − v(ti ) c
= g − v(ti ).
ti+1 − ti m
This equation can then be rearranged to yield
c
 
v(ti+1 ) = v(ti ) + g − v(ti ) (ti+1 − ti ). (1.8)
m

13
1.1. A problem and its solution
Thus, the differential equation has been transformed into an equation that can be used to
determine the velocity algebraically at ti+1 using the slope and previous values of v and
t. If you are given an initial value for velocity at some time ti , you can easily compute
velocity at a later time ti+1 . Let us solve the same problem (Example 7.1) using numerical
approach formulated in Eq. (1.8).

Example 1.2. A parachutist of mass 68.1 kg jumps out of a stationary hot air
balloon. Compute the velocity attained by parachutist after t s using approximation
approach. Employ a step size of 2 s for the calculation.

Solution: Let us considered that at the start of the computation (t0 = 0), the
velocity of the parachutist is zero (v(t0 ) = 0). Given ti+1 − ti = 2 (step size)
So we can compute velocity v(t1 ) at t1 as
12.5
 
v(t1 ) =v(t0 ) + 9.81 − v(t0 ) × 2 (1.9)
68.1
12.5
 
v(t1 ) =0 + 9.81 − (0) × 2 = 19.62 m/s (1.10)
68.1
For the next interval (from t=2 to 4 s), the computation is repeated but this time
updated velocity will be used to compute velocity in next iteration.
12.5
 
v(t2 ) =v(t1 ) + 9.81 − v(t1 ) × 2 (1.11)
68.1
12.5
 
v(t2 ) =19.62 + 9.81 − (19.62) × 2 = 32.04 m/s (1.12)
68.1
The list of computed values of veloctiy at different instance of time are tabulated
as

t(sec) v (m/sec)
0 0.00
2 19.62
4 32.04
6 39.90
8 44.87
10 48.02
12 50.01
∞ 53.44

Please note that the computed values of velocities are approximated values.

Please note that the technique used to solve Example 1.1 is called Analytical
method. The solution obtained using analytical methods is called analytical solution
which is an exact answer in the form of a mathematical expression in terms of a

14
1.2. Error in numerical solutions
variables associated with the problem that being solved. On the contrary, technique
used to solve Example 1.2 is called Numerical method. Formally, numerical methods
used for calculating approximated solutions to problems that cannot be solved (or are
difficult to solve) analytically. In other words, numerical methods are techniques by
which mathematical problems are formulated so that they can be solved with arithmetic
operations. Although numerical solutions are an approximation, they can be very
accurate. In many numerical methods, the calculations are executed in an iterative
manner until a desired accuracy is achieved. In the previous examples, a computational
price must be paid for a more accurate numerical results. Each halving of the step size
to attain more accuracy leads to a doubling of the number of computations. Thus, we
see that there is a trade-off between accuracy and computational effort.

1.2 Error in numerical solutions


Numerical solutions can be very accurate but in general are not exact. In general,
they are always associated with some error. If we compare the exact and approximate
solution obtained in Example 1.1 and 1.2, respectively, then we find that approximate
solutions are associated with some error (See Fig. 1.3). It can be seen that the numerical

Figure 1.3: Comparison of the analytical and numerical solutions for the falling
object problem

method captures the essential features of the exact solution. However, because we have
employed straight-line segments in numerical method to approximate a continuously
curving function, there is some discrepancy between the two results. One way to minimize
such discrepancies is to use a smaller step size.
Two kinds of errors are introduced when numerical methods are used for solving a
problem.

15
1.2. Error in numerical solutions
• First errors are labeled as Round-off errors. Round-off errors occurs because of
the way that we (or digital computers) store the number and execute numerical
operations. (Discussed in Section 1.3)

• The second kind of errors is introduced by the numerical method that is used for
the solution. These errors are labeled Truncation errors.

Together, the two errors constitute the total error of the numerical solution, which
is the difference (can be defined in various ways) between the true (exact) solution (which
is usually unknown) and the approximate numerical solution.

1.2.1 Round-off Errors


A mathematical quantity or real number x is not always stored in the real form. Instead,
a machine (or computer) store or process a number in a standard form to support a
trade-off between range and precision.

mantissa × 10exponent or mantissa × 2exponent

Here, 10 and 2 are base values where base 10 or 2 are used to represent decimal or
binary numbers. A computer’s representation of real numbers is limited to the fixed
precision of the mantissa. True values are sometimes not stored exactly by a computers
representation. Numbers are represented on a computer by a finite number of bits.
Consequently, real numbers that have a mantissa longer than the number of bits that are
available for representing them have to be shortened. The actual number that is stored
in the computer may undergo chopping or rounding of the last digit. A number can be
shortened either by

• Chopping off the extra digits:

– In chopping, the digits in the mantissa beyond the length that can be stored
are simply left out.
– For illustration, consider the number 2/3. In decimal form with four significant
digits, 2/3 can be written as 0.6666.

• Rounding:

– In rounding, the last digit that is stored is rounded. Ex: 2/3 can be written
as 0.6667.

Either way, such chopping and rounding of real numbers lead to errors in numerical
computations, especially when many operations are performed. This is called Round-off
error.

16
1.2. Error in numerical solutions
1.2.2 Truncation Error
Truncation error usually refers to errors introduced when a more complicated
mathematical expression is “replaced” with a more elementary formula as we did in
Example 1.2. For better understating, let us consider an example of the infinite Taylor
series expansion of sinusoidal function
x3 x5 x7 x9 x11
sin(x) = x − + − + − + ··· (1.13)
3! 5! 7! 9! 11!
might be replaced with just the first one or two terms. For example, if only the first term
is used
π π
 
sin = = 0.5235988
6 6
ET runc = 0.5 − 0.5235988 = −0.0235988
If two terms of the Taylor’s series are used
π π (π/6)3
 
sin = − = 0.4996742
6 6 3!
ET runc = 0.5 − 0.4996742 = 0.0003258
The truncation error is dependent on the specific numerical method or algorithm used to
solve a problem.

Example 1.3. The Taylor series expansion of cos(x) is given by:

x2 x4 x6 x8 x10
cos(x) = 1 − + − + − + ... (1.14)
2! 4! 6! 8! 10!
Use the first three terms to calculate the value of cos(π/4). Use the decimal format
with six significant digits (apply rounding at each step). Calculate the truncation
error.

1.2.3 Total Error


Together, the round-off and truncation errors yield the total numerical error that is
included in the numerical solution. This total error, also called the Absolute error,
is the difference between the true (exact) solution and the numerical solution. Suppose
that pb is an approximation to p. Then, we can define absolute error and relative error as
• Absolute error: The absolute error is simply the difference between the true value
and the approximate value,
Ep = |p − p|
b (1.15)

• Relative error: The relative error expresses the error as a percentage of the true
value.
|p − p|
Rp = provided that p 6= 0 (1.16)
b
|p|

17
1.2. Error in numerical solutions
The number pb is said to approximate p to d significant digits if d is the largest
non-negative integer for which

|p − p| 101−d
(1.17)
b
< .
|p| 2

Example 1.4. Compute the absolute error and relative error in Example 1.2 at
time instances 2s, 8s, 20s, and 100s.

Example 1.5. Find the error and relative error in the following three cases.

(i) Let x = 3.141592 and xb = 3.14


(ii) Let y = 1, 000, 000 and yb = 999, 996
(iii) Let z = 0.000012 and zb = 0.000009

Solution:

(i) Given, x = 3.141592 and xb = 3.14; then the error is

Ex = |x − xb| = |3.141592 − 3.14| = 0.001592

and the relative error is


|x − xb| 0.001592
Rx = = = 0.00507
|x| 3.141592

(ii) Given, y = 1, 000, 000 and yb = 999, 996; then the error is

Ey = |y − yb| = |1, 000, 000 − 999, 996| = 4

then the error is


|y − yb| 4
Ry = = = 0.000004
|y| 1, 000, 000

(iii) Given, Let z = 0.000012 and zb = 0.000009; then the error is

Ez = |z − zb| = |0.000012 − 0.000009| = 0.000003

then the error is


|z − zb| 0.000003
Rz = = = 0.25
|z| 0.000012

18
1.3. Number representation on a computer

1.3 4Number representation on a computer Chapter 1 Introduction

Human beings do equation


Governing arithmetic using the decimal (base 10) number system. Most
computers do arithmetic using the binary (base 2) number system. It may seem
The governing equation is derived by applying Newton's second law in the tangential direction:

de . 2
d e
otherwise, since communication "f.F with the
-cL--mgsme
1
=

dt
computer
mL-
dt2
(input/output) is (1.3)
=
in base 10
numbers. This transparency
Equation (1.3), which is a does not mean
second-order, that
nonlinear, thedifferential
ordinary computer usescanbase
equation, 10. inIn fact, it
be written

converts inputs to base 2 (or perhaps


the form:
2
base 16), then performs base 2 arithmetic, and
d e de mgsme .
finally, translates the answer into base dt2 10 dt
mL-+ before it displays
cL-+ 0 a result. = (1.4)

The initial conditions are that when the motion of the pendulum starts ( t = 0 ), the pendulum is at
angle 90 and its velocity is zero (released from rest):
1.3.1 Scientific number representation
9(0) dB i
9 and - 0 (1.5)
=
0 dt
=

I= 0
A standardMethod
way to ofpresent
solution a real number, called scientific notation, is obtained by shifting
the decimalEquation
point (1.4)
andis supplying an appropriate
a nonlinear equation power
and cannot be solved of 10.However,
analytically. For example,
in part (a) the ini­
tial displacement of the pendulum is 90 = 5°, and once the pendulum is released, the angle as the

0.0000747 = 7.47 × 10
pendulum oscillates will be less than 5°. For this case, Eq.−5
(1.4) can be linearized by assuming that
sine:::::e. With this approximation, the equation that has to be solved is linear and can be solved ana­
lytically: 31.4159265 = 3.14159265 × 10 (1.18)
a2e 000de= 9.7 × 109
9, 700,mL-+cL-+mgS
000, 0 = (1.6)
dt2 dt
with the initial conditions Eq. (1.5).
In computer science, 1K = 1.024 × 103 .
In part (b), the initial displacement of the pendulum is 90 = 90° and the equation has to be
solved numerically. An actual numerical solution for this problem is shown in Example 8-8.

1.3.2 Decimal and binary1.2representation


REPRESENTATION OF NUMBERS ON A
COMPUTER
• Decimal representation:
Decimal and binary representation
Numbers can be represented in various forms. The familiar decimal system (base
Numbers can be represented in various forms. The familiar decimal sys­
10) uses ten digits 0, 1, . . tem
. , 9.(base
A 10)
number
uses ten is written
digits 0, 1, ... , by
9. Aanumber
sequence of by
is written digits
a that
sequence of digits that correspond to multiples of powers of 10. As
correspond to multiples ofshown powers of 10. As shown in Fig. 1.4, the first digit to
in Fig. 1-2, the first digit to the left of the decimal point corre-

6 0 7 2 4 • 3 I 2 5

Figure 1.4: Decimal representation of real number


Figure 1-2: Representation of the number 60,724.3125 in the decimal system (base 10).

the left of the decimal point corresponds to 100 . The digit next to it on the left
corresponds to 101 , the next digit to the left to 102 , and so on. In the same way,
the first digit to the right of the decimal point corresponds to 10−1 , the next digit
to the right to 10−2 , and so on.

• Binary representation:
Similarly, binary system (base 2) uses two digits 0 and 1. A number is then written
as a sequence of zeros and ones that correspond to multiples of powers of 2. The
first digit to the left of the decimal point corresponds to 20 . The digit next to it on
the left corresponds to 21 , the next digit to the left to 22 , and so on. In the same

19
+
1 0 0 1 1 1 0 1

4 3 2 1 ° 1 2 3
1x 2 + 0x2 + 0x2 + 1x2 + 1x2 + 1x 2- + 0x 2- + 1x 2-

1 x16+Ox 8+Ox4+1x2+1x 1+1x 0.5+Ox0.25+1x 0.125 19.625


1.3. Number representation on a computer
=

Figure 1-4: Representation of the number 19.625 in the binary system (base 2).

way, the first digit to the right of the decimal point corresponds to 2−1 , the next
Another example is shown in Fig. 1-5, where the number
digit to the right to 2−2 , and so on.
60,724.3125 is written in binary form.

1 1 0 0 0 0 0 0 0 0 0

1x 215+1x214+1x213+0 x212+1x211+1x210+0 x29+1x28+ox27+0x26+1x 25

2 1 2
+1x24+0 x23+1x2 +0x2 +ox2°+Ox2-l +1x2- +Ox2-3+1x2-4 60,72 . 125
43
=

Figure 1.5: Binary representation of a real number


Figure 1-5: Representation of the number 60,724.3125 in the binary system (base 2).

As mentioned earlier that most computers store and process numbers in binary (base
2) form. However, It uses a normalized floating-point binary representation for real
numbers.

1.3.3 Floating-point representation


To accommodate large and small numbers, real numbers are written in floating-point
representation.

Decimal floating point representation has the form

d.dddddd × 10p (1.19)

One digit is written to the left of the decimal point, and the rest of the significant digits
are written to the right of the decimal point (normalized form). The decimal floating
point representation also known as scientific notation (See Section 1.3.1). The number
0.dddddd is called the mantissa. The power of 10, p, represents the number’s order of
magnitude, provided the preceding number is smaller than 5. Otherwise, the number is
said to be of the order of p + 1. Thus, the number 3.91 × 10−6 is of the order of 10−6 ,
O(10−6 ), and the number 6.51923 × 103 is of the order of 104 (written as O(104 )).

Example 1.6. Floating Point Addition: Add the following two decimal
numbers in scientific notation: 8.70 × 10−1 with 9.95 × 101

Solution: Rewrite the smaller number such that its exponent matches with the
exponent of the larger number.

8.70 × 10−1 = 0.087 × 101

Add the mantissas

9.95 + 0.087 = 10.037 and write the sum 10.037 × 101

20
1.3. Number representation on a computer

Put the result in Normalised Form

10.037 × 101 = 1.0037 × 102 (shift mantissa, adjust exponent)

Round the result:If the mantissa does not fit in the space reserved for it, it has
to be rounded off. For Example: If only 4 digits are allowed for mantissa

1.0037 × 102 ⇒ 1.004 × 102

Binary floating point representation has the form:

1.bbbbbb × 2bbb (b is a binary digit) (1.20)

In this form, the mantissa is .bbbbbb , and the power of 2 is called the exponent. Both
the mantissa and the exponent are written in a binary form. The form in Eq. (1.20)
is obtained by normalizing the number (when it is written in the decimal form) with
respect to the largest power of 2 that is smaller than the number itself. If there are eight
choices for the mantissa and eight choices for the exponent then this produces a set of 64
numbers

{1.000two × 2−3 , 1.001two × 2−3 , . . . , 1.110two × 24 , 1.111two × 24 }

Example 1.7. Write the number 50 in binary floating point representation.

Solution: To write the number 50 in binary floating point representation, the


number is divided (and multiplied) by 25 = 32 (which is the largest power of 2 that
is smaller than 50):
50
50 = × 25 = 1.5625 × 25 Binary floating point form: 1.1001 × 2101 (1.21)
25

Floating-point operations:

Example 1.8. Addition in binary: Perform

0.5 + (−0.4375)

Solution:

0.5 = 0.1 × 20 = 1.000 × 2−1 (normalised)

−0.4375 = −0.0111 × 20 = −1.110 × 2−2 (normalised)


Rewrite the smaller number such that its exponent matches with the exponent of
the larger number.

21
1.3. Number representation on a computer

−1.110 × 2−2 = −0.1110 × 2−1


Add the mantissas:

1.000 × 2−1 + −0.1110 × 2−1 = 0.001 × 2−1

Normalise the sum, checking for overflow/underflow:

0.001 × 2−1 = 1.000 × 2−4

−126 <= −4 <= 127 ⇒ No overflow or underflow


Round the sum: The sum fits in 4 bits so rounding is not required
Check: 1.000 × 2−4 = 0.0625 which is equal to 0.5 − 0.4375.

 
Example 1.9. Compute 10 1
+ 15 + 16 if a computer had only a 3-bit mantissa and
exponent of n ∈ {−3, −2, −1, 0, 1, 2, 3, 4}.
 
Solution: First we need to compute 10 1
+ 51 . To write the binary representation
of 10
1
in 3-bit mantissa, first we need to convert the number in normalized form as
1 0.1
= 0.1 = −4 × 2−4 = 1.6 × 2−4
10 2
Now, we can write binary representation of 10
1
as

1
 
= (1.101)2 × 2−4 (after rounding)
10 10

Similarly, binary representation of 1


5
is

1 0.2
= 0.2 = −3 × 2−3 = 1.6 × 2−3
5 2
1
 
= (1.101)2 × 2−3 (after rounding)
5 10
Next step is to rewrite smaller number such that exponent of smaller number
matches with the exponent of the largest number. Therefore, we can write
1
 
= (0.1101)2 × 2−3
10 10

Now, we can add the mantissa of these two numbers as

(0.1101)2 + (1.1010)2 = (10.0111)2


1 1
 
+ = (10.0111)2 × 2−3 = (1.00111)2 × 2−2
10 5 10

22
1.3. Number representation on a computer

After rounding,
1.00111 × 2−2 = 1.010 × 2−2
Now, we need to add 1
6
in previously computed value

1
= (1.3333)10 × 2−3 = (1.011)2 × 2−3
6
Now, we can add the mantissa of these two numbers after equating the exponent
value
(0.1011)2 + (1.0100)2 = (1.1111)2
1 1 1
  
+ + = (1.1111)2 × 2−2 = (1.000)2 × 2−1 (after rounding)
10 5 6 10
After many truncation, the calculated approximate value

(1.000)2 × 2−1 = (0.5)10


1 1 1 7
 
Actual value = + + = = 0.466667
10 5 6 16

Truncation Error =|Actual value − Approx. value|


=|0.466667 − 0.500000|
=0.033333

0.033333
% of Error = = 0.0714 = 7.14%
0.466667

1.3.4 Single and double precision floating point representation


Once in binary floating point representation, the number is stored in the computer. The
computer stores the values of the exponent and the mantissa separately, while the leading
1 in front of the decimal point is not stored. As already mentioned, a bit is a binary digit.
The memory in the computer is organized in bytes, where each byte is 8 bits (called a
word).
According to the IEEE-754 standard (1985), computers store numbers and carry out
calculations in single precision or in double precision. In single precision, the numbers
are stored in a string of 32 bits (4 bytes), and in double precision in a string of 64 bits (8
bytes). In both cases, the first bit stores the sign (0 corresponds to + and 1 corresponds
to −) of the number. The next 8 bits in single precision (11 bits in double precision) are
used for storing the exponent. The following 23 bits in single precision (52 bits in double
precision) are used for storing the mantissa (see Figure 1.6).
The value of the mantissa is in a binary form. The value of the exponent is entered
with a bias. A bias means that a constant is added to the value of the exponent. The bias

23
cision, the numbers are stored in a string of 32 bits (4 bytes), and in
double precision in a string of 64 bits (8 bytes). In both cases the first
bit stores the sign (0 corresponds to + and 1 corresponds to -) of the
number. The next 8 bits in single precision (11 bits in double precision)
are used for storing the exponent. The following 23 bits in single preci­
sion (52 bits in double precision) are used for storing the mantissa. This
is illustrated for double precision in Fig. 1-6.
1.3. Number representation on a computer

1/ 1/ 1/ 1/ 1/ 1 / 1/ 1/ 1/ 1/ 1/ 1/
I I I
/o /o /o /o . . . .
/o 10 10 /o /o . . . . . .
10 /o /o

Exponent + bias Mantissa


Sign
11 bits 52 bits
1 bit

FigureFigure 1.6: inFloating-point


1-6: Storing double precision a representation in double
number written in binary precision.
floating point representation.

The value of the mantissa is entered as is in a binary form. The


value of the exponent is entered with a bias. A bias means that a con­
is introduced in order to avoid using one of the bits for the sign of the exponent (since the
stant is added to the value of the exponent. The bias is introduced in
exponent can be positive or negative).
order to avoid In using
binary onenotation,
of the bits forthethelargest
sign of number
the exponent that can
(since
be written with 11 bits is 2047the (when
exponent all can
11 be digits areor1).
positive The bias
negative). thatnotation,
In binary is usedtheislargest
1023,
number that can be written with 11 bits is 2047 (when all 11 digits are
which means that if, for example, the exponent is 4, then the value that is stored is 4 +
1). The bias that is used is 1023, which means that if, for example, the
1023 = 1027. Thus, the smallest exponent
exponent that
is 4, then the can
valuebe thatstored
is storedbyis the computer
4 + 1023 is -1023,
1027. Thus, the =

and the largest is 1024 (which will be stored as 2047). However, the smallest and largest
values
8
of the exponent plus bias are stands
1. IEEE reserved for zero
for the Institute and infinity
of Electrical (Inf)
and Electronics or not-a-number
Engineers.
Chapter 1 Introduction
2. Precision refers to the number of significant digits of a real number that can be
(NaN) due to invalid mathematical operation. The 11 bits for the exponent plus bias
stored on a computer. For example, the number 1/3 0.333333 ... can be represented =

store values between -1023 and 1024. If the


on a computer
smallest exponent
only inexponent
that
a chopped or plus
roundedbias
can be stored by anda finite
form with
the computer mantissa
number ofare
binaryboth
is -1023, and
dig­
the
its, since the amount of memory where these bits are held is finite. The more digits
zero, then the number actually stored
largest
to the is 0.
1024
is right-hand
(whichIf of
side the
the exponent
will be stored
decimal pointasplus
2047).
that bias is 2047
However,
are stored, the more thethen
smallest
precise the
is the
and representation
largest values of theofreal
thenumber on the plus
exponent computer.
bias are reserved for zero and
number stored is Inf if the mantissa issomewhat
3. This is(Inf)
zero, and it is NaNprecision
if theinmantissa is not zero. In
infinity or of a misnomer. The (NaN)
not-a-number duea double-precision number is not
to invalid mathematical
single precision, 8 bits are allocated
really to
operation. the11
doubled
The value
compared
bits for of
to a the exponent
single-precision
the exponent plusandbias the
storebias
number. Rather, is between
the "double"
values 127.
in dou­
ble precision refers to the fact that twice as many binary digits (64 versus 32) are
-1023
usedand 1024. If
to represent thenumber
a real exponent plus
than in biasofand
the case mantissa are
a single-precision both zero,
representation.
Example 1.10. How the number 22.5 can be stored in double precision according
then the number actually stored is 0. If the exponent plus bias is 2047
to the IEEE-754 standard.the number stored is Inf if the mantissa is zero, and it is NaN if the
mantissa is not zero. In single precision, 8 bits are allocated to the value
Solution:: First, the number isexponent
of the normalized:
and the bias is 127.
As an example, consider storing of the number 22.5 in double preci­
22.5
sion according
4 to the IEEE-7544 standard. First, the number is normal-
∗ 2 = 1.40625 × 2
ized:2 · 24
22
4 5
1.40625 2 4 . In double precision, the exponent with the
= x
24
In double precision, the exponent
bias is 4 + with
the 24which
1027, bias isisstored
1023 4 + 1023 = form
in binary
=
1027,as which is
10000000011.
The mantissa is 0.40625, which is stored in binary form as
stored in binary form as 10000000011. The mantissa is 0.40625, which is stored
.01101000....000. The storage of the number is illustrated in Fig. 1-7.
in binary form as .01101000....000. The storage of the number is illustrated below

Li+-������+ �������

Sign Exponent + bias Mantissa


1 bit
11 bits 52 bits

Figure 1-7: Storing the number 22.5 in double precision according to the IEEE-754 standard.
Example 1.11. How the number 22.5 can be stored in single precision according
to the IEEE-754 standard.Additional notes
• The smallest positive number that can be expressed in double preci­
sion 1s:

2-1022 ""2.2 308


x 10-
This means that there is a (small) gap between zero and the smallest
number that can be stored on the computer. Attempts to define a
24
number in this gap causes an underflow error. (In the same way, the

closest negative number to zero is -2.2 x 10 -308 .)


• The largest positive number that can be expressed in double preci­
sion is approximately :
1.4. Types of problems to be solved

1.4 Types of problems to be solved

(i) Roots of equations: Solve f (x) = 0 (v) Interpolation


for x.

(vi) Integration
Z b
I= f (x)f x
a
(ii) Linear algebraic equations: Given
find the area under the curve.
the a’s and c’s, Solve

a11 x1 + a12 x2 = c1
a21 x1 + a22 x2 = c2

(vi) Ordinary differential equations


Given
dy ∆y
≈ = f (t, y)
dt ∆t
solve for y as a funtion of t.

(iii) Optimization: Determine x that yi+1 = yi + f (ti , yi )∆t


gives optimum f (x).

(vi) Partial differential equations Given


δ2u δ2u
+ = f (x, y)
δx2 δy 2
(iv) Curve fitting
solve for u as a function of x and y.

25

You might also like