Cit335 PDF
Cit335 PDF
NIGERIA
FACULTY OF SCIENCES
LAGOS OFFICE
14/16 Ahmadu Bello Way
Victoria Island , Lagos
e-mail:[email protected]
URL: www.nou.edu.ng
Published by
National Open University of Nigeria
ISBN:
2022
AllRightsReserved
Introduction
The overall aim of this course is to introduce you to computational science and
numerical methods. Topics related to machine numbers, Least Square
Approximation, Computer Arithmetic and Accumulated Errors are equally
discussed.
The bottom-up approach is adopted in structuring this course. We start with the basic
Machine Arithmetic and Related Matter concepts and move on to the fundamental
principles of Approximation and Interpolation.
The overall aim and objectives of this course provide guidance on what you should
be achieving in the course of your studies. Each unit also has its own objectives which
states specifically what you should be achieving in the corresponding unit. To
evaluate your progress continuously, you are expected to refer to the overall course
aims and objectives as well as the corresponding unit objectives upon the completion
of each.
Course Aims
The overall aim and objectives of this course include:
Course Objectives
Upon completion of the course, you should be able to:
Course Materials
Online Materials
Feel free to refer to the websites provided for all the online reference materials required in
this course. The website is designed to integrate with the print-based course
materials. The structure follows the structure of the units and all the reading and
activity numbers are the same in both media.
Study Units
There are 3 modules in this course. Each module comprises various units which you are
expected to complete in 3 hours. The 3 modules and their units are listed below.
The questions addressed in this first chapter are fundamental in the sense that they are relevant in
any situation that involves numerical machine computation, regardless of the kind of problem that
gave rise to these computations. In the first place, one has to be aware of the rather primitive type
of number system available on computers. It is basically a finite system of numbers of finite length,
thus a far cry from the idealistic number system familiar to us from mathematical analysis. The
passage from a real number to a machine number entails rounding, and thus small errors, called
roundoff errors. Additional errors are introduced when the individual arithmetic operations are
carried out on the computer. In themselves, these errors are harmless, but acting in correct and
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain real number
Describe machine numbers
Identify fixed-point numbers
Explain other data structures for numbers
Describe the rounding in machine numbers
We begin with the number system commonly used in mathematical analysis and confront it with the more
primitive number system available to us on any particular computer. We identify the basic constant (the
machine precision) that determines the level of precision attainable on such computer.
One can introduce real numbers in many different ways. Mathematicians favor the axiomatic approach,
which leads them to define the set of real numbers as a ―complete Archimedean ordered field‖. Here we
adopt a more pedestrian attitude and consider the set of real numbers ℝto consist of positive and negative
numbers represented in some appropriate number system and manipulated in the usual manner known from
elementary arithmetic. We adopt here the binary number system, since it is the one most commonly used
on computers. Thus,
It is important to note that in general we need infinitely many binary digits to represent a real number/ We
conveniently write such a number in the abbreviated form )familiar from the decimal number system)
𝜘 = ± (𝑏𝑛𝑏𝑛−1 … 𝑏0 . 𝑏−1𝑏0−2𝑏−3…)2.(1.3)
Where the subscript 2 at the end is to remind us that we are dealing with a binary number. (Without this
subscript, the number could also be read as a decimal number, which would be a source of ambiguity). The
dot in (1.3) – appropriately called the binary point – separates the integer part on the left from the fractional
part on the right. Note that representation (1.3) is not unique, for example, (0.0111… )2 = (0.1)2 . We
2. (. 010101… ) = ∑∞ 1 1𝑚
2−𝑘 = ∑𝑛∞ 2−2𝑚 = ∑∞
2 𝑘=2 (𝑘=𝑒𝑣𝑒𝑛) 𝑚=1 4 𝑚 = 0( 4)
1 1 1
= = = (0.333… )
4 1− 1 3 10
4
3. 1 = 0.2 = (0.001100
1)
5 10 2
To determine the binary digits on the right, one keeps multiplying by 2 and observing the integer part in the
result; if it is zero, the binary digit in question is 0, otherwise 1. In the latter case, the integral part is removed
The last example is of interest in so far as it shows that to a finite decimal number there may correspond a
(nontrivial) infinite binary representation. One cannot assume, therefore, that a finite decimal number is
exactly representable on binary computer. Conversely, however, to a finite binary number there always
There are two kinds of machine numbers: floating point and fixed point. The first corresponds to the
―scientific notation‖ in the decimal system, whereby a number is written as a decimal fraction times an
integral power of 10. The second allows only for fractions. On a binary number, one consistently uses
powers of 2 instead of 10. More important, the number of binary digits, both in the fraction and in the
exponent of 2 (if any), is finite and cannot exceed certain limits that are characteristics of the particular
computer at hand.
We denote by t the number of binary digits allowed by the computer in the fractional part and by s the
number of binary digits in the exponent. Then the (real) floating-point numbers on that computer will be
Here all 𝑏𝑖 and 𝑐𝑗 are binary digits, that is, either zero or one. The binary fraction f is usually referred to as
the mantissa of x and the integer e as the exponent of x. The number x in (1.4) is said to be normalized if in
its fraction f we have 𝑏−1 = 1. We assume that all numbers in ℝ(𝑡, 𝑠) are normalized (with the exception
of 𝜘 = 0, which is treated as a special number). If 𝜘 ≠ 0 were not normalized. We could
f e
t bits s bits
Multiply f by an appropriate power of 2, to normalize it, and adjust the exponent accordingly. This is always
We can think of a floating-point number (1.4) as being accommodated in a machine register as shown in
Fig. 1.1. The figure does not quite correspond to reality, but is close enough to it for our purposes.
Note that the set (1.4) of normalized floating-point numbers is finite and is thus represented by a finite set
of points on the real line. What is worse, these points are not uniformly distributed (cf. Ex. 1). This, then,
It is immediately clear from (1.4) and (1.5) that the largest and smallest magnitude of a (normalized)
On a Sun Spare workstation, for example, one has t = 23, S = 7, so that the maximum and minimum in (1.6)
are 1.70 x 1038 and 2.94 x 10-39 respectively. (Because of an asymmetric internal hardware representation
of the exponent on these computers, the true range of floating-point numbers is slightly shifted, more like
from 1.18 × 10-38 to 3.40 x 1038.) Matlab arithmetic, essentially double precision, uses 1 = 53 and s = 10,
which greatly expands the number range from something like 10-308 + 10+308.
A real nonzero number whose modulus is not in the range determined by (1.6) cannot be represented on
this particular computer. If such a number is produced during the course of a computation, one says that
overflow has occurred if its modulus is larger than the maximum in (1.6) and underflow if it is smaller than
the minimum in (1.6). The occurrence of overflow is fatal, and the machine (or its operating system) usually
prompts the computation to be interrupted. Underflow is less serious, and one may get away with replacing
the delinquent number by zero. However, this is not foolproof. Imagine that at the next step the number that
underflow is to be multiplied by a huge number. If the replacement by zero has been made, the result
To increase the precision, one can use two machine registers to represent a machine number. In effect, one
then embeds, ℝ (t, S) ⊂ ℝ (2+, s), and calls x∈ ℝ (2t, s) a double-precision number
This is the case (1.4) where e = O. That is, fixed-point numbers are binary fractions, x = f, hence | f | < 1.
We can therefore only deal with numbers that are in the interval (-1,1). This, in particular, requires extensive
scaling and rescaling to make sure that all initial data, as well as all intermediate and final results, lie in that
interval. Such a complication can only be justified in special circumstances where machine time and/or
precision is at a premium. Note that on the same computer as considered before, we do not need to allocate
space for the exponent in the machine register, and thus have in effect s+ t binary digits available for the
Complex floating-point numbers consist of pairs of real floating-point numbers, the first of the pair
representing the real part and the second the imaginary part. To avoid rounding errors in arithmetic
operations altogether, one can employ rational arithmetic, in which each (rational) number is represented
by a pair of extended-precision integers — the numerator and denominator of the rational number, The
Euclidean algorithm is used to remove common factors. A device that allows keeping track of error
propagation and the influence of data errors is interval arithmetic involving intervals guaranteed to contain
the desired numbers. In complex arithmetic, one employs rectangular or circular domains.
3.3.1 Rounding
A machine register acts much like the infamous Procrustes bed in Greek mythology. Procrustes was the
innkeeper whose inn had only beds of one size. If a fellow came along who was too tall to fit into his beds,
he cut off his feet. If the fellow was too short, he stretched him. In the same way, if a real number comes
along that is too long, its tail end (not the head) is cutoff; if it is too short, it is padded by zeros at the end.
x ∈ 𝑅, x = ±(∑𝑘=1
∞ 𝑏 −𝑘 2−𝑘)2e (1.7)
the rounded number. One then distinguishes between two methods of rounding, thefirst being Procrustes'
method.
(b) Symmetric rounding. This corresponds to the familiar rounding up or roundingdown in decimal
arithmetic, based on the first discarded decimal digit: if it islarger than or equal to 5, one rounds up; if it is
less than 5, one rounds down. Inbinary arithmetic, the procedure is somewhat simpler, since there are only
twopossibilities: either the first discarded binary digit is 1, in which case one roundsup, or it is 0, in which
case one rounds down. We can write the procedure verysimply in terms of the chop operation in (1.9):
There is a small error incurred in rounding, which is most easily estimated in thecase of chopping. Here the
It depends on e (i.e., the magnitude of .*), which is the reason why one prefers therelative error | (x -x*)/x|
(if x ≠ 0), which, for normalized r, can be estimated as
2− . 2𝑒
|
𝑥−𝑐𝑜(𝑥)
|≤|
2−𝑡 . 2
𝑒 |2e ≤ = 2.2-t (1.11)
1
𝑥 ± ∑∞
𝑘=1 𝑏−𝑘2−𝑘 . 2𝑒
2
The number on the right is an important, machine-dependent quantity, called the machine precision (or
unity roundoff),
it determines the level of precision of any large-scale floating-point computation. In Matlab double-
precision arithmetic, one has t=53, so that eps≈1.11X10-16 (cf.Ex.5), corresponding to a precision of 15-16
Since it is awkward to work with inequalities, one prefers writing (1.12) equivalently as an equality.
And defers dealing with the inequality (for ɛ) to the very end.
SELFASSESSMENTEXERCISE1
SELFASSESSMENTEXERCISE2
What are the constituents of a Machine number? Give 2 typical examples of machine
numbers.
4.0 CONCLUSION
In this unit, you have learned about real number, and machine number. You have also
been able to understand the meaning of some notions about real and machine numbers.
Finally, you have been able to appreciate the significance of machine numbers in
developing numbers in computer system.
5.0 SUMMARY
What you have learned borders on the basic real numbers and machine numbers.The
subsequent units shall build up on these fundamentals.
6.0 TUTOR-MARKEDASSIGNMENT
Represent all elements of ℝ+ (3, 2) = {x ℝ(3, 2); x > 0, x normalized as dots on the real
axis. For clarity, draw two axes, one from 0 to 8, the other form 0 to ½.
7.0 REFERENCES/FURTHERREADINGS
Bloch E. D. (2011). The Real Numbers and Real Analysis, Springer Nature Pp 577.
Unit 2: Machine Arithmetic
CONTENTS
1.0 Introduction
2.0 Objectives
3.4 Main Content
3.5 A model of machine arithmetic
3.6 Error Propagation in Arithmetic Operations
3.6.1 Cancellation Error
3.7 The condition of a problem
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
The arithmetic used on computers unfortunately dos not respect the laws of ordinary arithmetic. Each
elementary floating-point operation, in general, generates a small error that may then propagate through
subsequent machine operations. As a rule, this error propagation is harmless, except in the case of
subtraction, where cancellation effects may seriously compromise the accuracy of the results.
Most problems involve input data not represented exactly on the computer. Therefore, even before the
solution process starts, simply by storing the input in computer memory, the problem is already slightly
perturbed, owing to the necessity of rounding the input. It is important, then, to estimate how such small
perturbations in the input affect the output, the solution of the problem. This is the question of the
(numerical) condition of a problem: the problem is called well-conditioned if the changes in the solution of
the problem are of the same order of magnitude as the perturbations in the input that caused those changes.
If, on the other hand, they are much larger, the problem is called ill conditioned. It is desirable to measure
by a single number – the condition number of the problem – the extent to which the solution is sensitive to
perturbations in the input. The larger this number, the more ill conditioned the problem.
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain machine arithmetic
Describe condition numbers
Identify the condition of a problem
3.0 MAIN CONTENT
3.1 A Model of Machine Arithmetic
Any of the four basic arithmetic operations, when applied to two machine numbers, may produce a result
no longer represented on the computer. We have therefore errors also associated with arithmetic operations.
Barring the occurrence of overflow or underflow, we may assume as a model of machine arithmetic that
each arithmetic operation ο (= +, -, x, /) produces a correctly rounded result. Thus, if x, y ϵ ℝ (t,s) are
floating-point machine numbers, and fl(xοy) denotes the machine-produced result of the arithmetic
In each equation we identify the computed result as the exact result on data that are slightly perturbed,
whereby the respective relative perturbations can be estimated, for example, by | ɛ|≤eps in the first two
equations, and √1 + 𝖼≈ 1 + ½ ɛ, |½ ɛ| ≤ ½ eps in the third. These are elementary examples of backward
Even though a single arithmetic operation causes a small error that can be neglected, a succession of
arithmetic operations can well result in a significant error, owing to error propagation. It is like the small
microorganisms that we all carry in our bodies; if our defense mechanism is in good order, the
microorganisms cause no harm, in spite of their large presence. If for some reason our defenses are
weakened, then all of a sudden, they can play havoc with our health. The same is true in machine
computation: the rounding errors, although widespread, will cause little harm unless our computations
contain some weak spots that allow rounding errors to take over to the point of completely invalidating the
results. We learn about one such weak spot (indeed the only one) in the next section. 1
We now study the extent to which basic arithmetic operations propagate errors already present in their
operands to be exact machine-representable numbers and discussed the errors due to imperfect execution
of the arithmetic operations by the computer. We now change our viewpoint and assume that the operands
themselves are contaminated by errors, but the arithmetic operations are carried out exactly. (We already
know what to do, cf. (1.15), when we are dealing with machine operations.) Our interest is in the errors in
a) Multiplication We consider values x (1+ ɛx) and y (1+ ɛy) of x and y contaminated by relative
errors ɛx and ɛy, respectively. What is the relative error in the product? We assume ɛx, ɛy sufficiently
small so that quantities of second order, ɛ2x , ɛyɛx and ɛ2y – and even more so, quantities of still higher
Thus, the relative error ɛx.y in the product is given (at least approximately) by
that is, the (relative) errors in the data are being added to produce the (relative) error in the result.
We consider this to be acceptable error propagation, and in this sense, multiplication is a benign
operation.
≈ (1 + ɛx− ɛy),
𝑦
that is,
c) Addition and subtraction. Since x and y can be numbers of arbitrary signs, it suffices to look at
addition. We have
assuming x + y≠ 0. Therefore,
𝑥 𝑦
ℰ𝑥+𝑦 = ℰ + ℰ𝑦 . (1.18)
𝑥+𝑦 𝑥 𝑥+𝑦
Fig. 1.3 The cancellation phenomenon
As before, the error in the result is a linear combination of the errors in the data, but now the coefficients
are no longer ±1 but can assume values that are arbitrarily large. Note first, however, that when x and y
have the same sign, then both coefficients are positive and bounded by 1, so that
Addition, in this case, is a benign operation. It is only when x and y have opposite signs that the coefficients
in (1.18) can be arbitrarily large, namely, when |𝑥 + 𝑦| is arbitrarily small compared to |𝑥| and|𝑦|. This
happens when x and y are almost equal in absolute value, but opposite in sign. The large magnification of
error then occurring in (1.18) is referred to as cancellation error. It is the only serious weakness – the
Achilles heel, as it were – of numerical computation, and it should be avoided whenever possible. In
particular, one should e prepared to encounter cancellation effects not only in single devastating amounts,
but also repeatedly over a long period of time involving ―small doses‖ of cancellation. Either way, the end
We illustrate the cancellation phenomenon schematically in Fig. 1.3, where b, b’, b’’ stand for binary digits
that are reliable, and the g represents binary digits contaminated by error; these are often called garbage
digits. Note in Fig. 1.3 that ―garbage – garbage = garbage,‖ but more importantly, that the final
normalization of the result moves the first garbage digit from the 12th position to the 3rd.
Cancellation is such a serious matter that we wish to give a number of elementary examples, not only of its
Examples. 1. An algebraic identity: (𝑎 − 𝑏)2 = 𝑎2 − 2𝑎𝑏 + 𝑏2. Although this is a valid identity in
algebra, it is no longer valid in machine arithmetic. Thus, on a 2-decimal digit computer, with a = 1.8, b =
Instead of the true result 0.010, which we obtain also on our 2-digit computer if we use the left-hand side
which is off by one order of magnitude and on top, has the wrong sign.
2. Quadratic equation: 𝑥2 − 56𝑥 + 1 = 0. The usual formula for a quadratic gives, in 5-decimal arithmetic,
This should be contrasted with the exact roots 0.0178628… and 55.982137… . As can be seen, the smaller
of the two is obtained to only two decimal digits, owing to cancellation. An easy way out, of course, is to
compute 𝑥2 first, which involves a benign addition, and then to compute 𝑥1 = 1/𝑥2 by Vieta‘s formula,
which gain involves a benign operation – division. In this way we obtain both roots to full machine
accuracy.
3. Compute 𝑦 = √𝑥 + 𝛿 − √𝑥, 𝑤𝑒𝑟𝑒 𝑥 > 0 𝑎𝑛𝑑 |𝛿| 𝑖𝑠 𝑣𝑒𝑟𝑦 𝑠𝑚𝑎𝑙𝑙. Clearly, the formula as written
causes severe cancellation errors, since each square root has to be rounded. Writing instead
𝛿
𝑦=
√𝑥 + 𝛿 + √𝑥
4. Compute y = cos(𝑥 + 𝛿) − cos(𝑥) , 𝑤𝑒𝑟𝑒 |𝛿| 𝑖𝑠 𝑣𝑒𝑟𝑦 𝑠𝑚𝑎𝑙𝑙. Here cancellation can be avoided by
𝛿 𝛿
𝑦 = −2 sin sin (𝑥 + ).
2 2
5. Compute 𝑦 = (𝑥 + 𝛿) − 𝑓(𝑥), 𝑤𝑒𝑟𝑒 |𝛿| 𝑖𝑠 𝑣𝑒𝑟𝑦 𝑠𝑚𝑎𝑙𝑙, 𝑎𝑛𝑑 𝑓 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛. Special tricks, such
as those used in the two preceding examples, can no longer be played, but if f is sufficiently smooth in the
1
𝑦 = 𝑓′(𝑥)𝛿 + 𝑓′′(𝑥)𝛿2 + ⋯
2
The terms in this series decrease rapidly when |𝛿| is small so that cancellation is no longer a problem.
Addition is an example of a potentially ill-conditioned function (of two variables). It naturally leads us to
some equation, and the output of another set of numbers uniquely determined by the input, say, all the roots
of the equation in some prescribed order. If we collect the input in a vector x∈ ℝ𝑚 (assuming
x y
the data consists of real numbers), and the output in the vector y∈ ℝ𝑛 (also assumed real), we have the
black box situation shown in Fig. 1.4, where the box P accepts some input x and then solves the problem
(One or both of the spaces ℝ𝑚, ℝ𝑛 could be complex spaces without changing in any essential way the
discussion that follows.) What we are interested in is the sensitivity of the map f at some given point x to a
small perturbation of x, that is, how much bigger (or smaller) the perturbation in y is compared to the
perturbation in x. In particular, we wish to measure the degree of sensitivity by a single number – the
condition number of the map f at the point x. We emphasize that, as we perturb x, the function f is always
assumed to be evaluated exactly, with infinite precision. The condition of f, therefore is an inherent property
of the map f and does not depend on algorithmic considerations concerning its implementation.
This is not to say that knowledge of the condition of a problem is irrelevant to any algorithmic solution of
the problem. On the contrary, the reason is that quite often the computed solution 𝑦∗ of (1.20) (computed
in floating point machine arithmetic, using a specified algorithm) can be demonstrated to be the exact
𝑦∗ = (𝑥∗), (1.21)
and moreover, the distance ‖𝛿‖ of 𝑥∗ to x can be estimated in terms of the machine precision. Therefore, if
we know how strongly (or weakly) the map f reacts to a small perturbation, such as 𝛿 in (1.22), we can say
something about the error 𝑦∗ − 𝑦 in the solution caused by this perturbation. This, indeed, is an important
technique of error analysis – known as backward error analysis – which was pioneered in the 1950s by J.W.
Maps f between more general spaces (in particular, function spaces) have also been considered from the
point of view of conditioning, but eventually these spaces have to be reduced to finite dimensional spaces
SELFASSESSMENTEXERCISE1
Define exhaustively the term ‗machine arithmetic‘.
SELFASSESSMENTEXERCISE2
List the condition error number in machine arithmetic? Give 2 typical examples of
each mentioned.
4.0 CONCLUSION
In this unit, you have learned about machine arithmetic, and error number. You have also
been able to understand the meaning of some notions about machine arithmetic.
5.0 SUMMARY
What you have learned borders on the basic of machine arithmetic in computer science.
6.0 TUTOR-MARKEDASSIGNMENT
Consider a miniature binary computer whose floating-point words consist of four binary
digits for the mantissa and three binary digits for the exponent (plus sign bits). Let
𝑥 = (0.1011)2x 20, 𝑦 = (0.1100)2x 20.
Mark in the following table whether the machine operation indicated (with the result z
assumed normalized) is exact, rounded (i.e., subject to a nonzero rounding error),
overflows, or underflows.
Operation Exact Rounded Overflow Underflow
𝑧 = 𝑓𝑙(𝑥 − 𝑦)
z = 𝑓((𝑥 − 𝑦)10)
z = 𝑓(𝑥 + 𝑦)
z = 𝑓𝑙 (𝑦 + (𝑥⁄4))
𝑦
z = 𝑓𝑙 (𝑥 + ( ⁄ ))
4
7.1 REFERENCES/FURTHERREADINGS
Bloch E. D. (2011).The Real Numbers and Real Analysis, Springer Nature Pp 577.
Francis Scheid. (1989) Schaum‘s Outlines Numerical Analysis 2nd ed. McGraw-Hill
New York.
Okunuga, S. A., and Akanbi M, A., (2004). Computational Mathematics, First Course,
WIM Pub. Lagos, Nigeria.
Turner P. R. (1994) Numerical Analysis Macmillan College Work Out Series Malaysia
Unit 3: Condition Number
CONTENTS
1.0 Introduction
2.0 Objectives
3.8 Main Content
3.9 Condition Numbers
3.10 The condition of a problem
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
In the field of numerical analysis, the condition number of a function measures how much the output value
of the function can change for a small change in the input argument. This is used to measure how sensitive a
function is to changes or errors in the input, and how much error in the output results from an error in the
input. Very frequently, one is solving the inverse problem: given {\displaystyle f(x)=y,} one is solving
for x, and thus the condition number of the (local) inverse must be used. In linear regression the condition
number of the moment matrix can be used as a diagnostic for multicollinearity. Once the solution process
starts, additional rounding errors will be committed, which also contaminate the solution. The resulting
errors, in contrast to those caused by input errors, depend on the particular solution algorithm. It makes
sense, therefore, to also talk about the condition of an algorithm, although its analysis is usually quite a bit
harder. The quality of the computed solution is then determined by both (essentially the product of) the
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain condition numbers
Identify the condition of a problem
3.0 MAIN CONTENT
3.1 Condition Numbers
assuming that 𝑓 is differentiable at 𝑥. Since our interest is in relative errors, we write this in the form
𝘍
∆𝑦 𝑥𝑓 (𝑥) ∆𝑥 (1.24)
𝑦
≈ 𝑓(𝑥)
∙ 𝑥
The approximate equality becomes a true equality in the limit as ∆𝑥 → 0. This suggests that the condition
The number tells us how much larger the relative perturbation in y is compared to the relative perturbation
in 𝑥.
If 𝑥 = 0 𝑎𝑛𝑑 𝑦 ≠ 0, it is more meaningful to consider the absolute error measure for 𝑥 and for 𝑦
still the relative error. This leads to the condition number |𝑓′(𝑥)/𝑓(𝑥)|. Similarly, for 𝑦 = 0, 𝑥 ≠ 0. If
We assume again that each function 𝑓𝑣 has partial derivatives with respect to all 𝑚 variables at the point 𝒙.
Then the most detailed analysis departs from considering each component 𝑦𝑣 as a function of one single
variable, 𝑥𝜇. In other words, we subject just one component, 𝑦𝑣. Then we can apply (1.25) and obtain
𝑥𝜇𝜕𝑓
𝜕𝑥𝜇
𝑣
number, we can take any convenient measure if the ―magnitude‖ of 𝚪(x) such as one of the matrixdefined in
(1.30).
The condition so defined, of course, depends on the choice of norm, but the order of magnitude (and that is
all that counts) should be more or less the same for any reasonable norm.
If a component of 𝑥, or 𝑦, vanishes, one modifies (1.27) as discussed earlier. A less-refined analysis can be
modelled after the one-dimensional case by defining the relative perturbation of 𝑥 ∈ ℝ𝑚 to mean
‖∆𝑥‖ℝ𝑚
, ∆𝑥 = [∆𝑥 , ∆ 𝑥 , … , ∆𝑥 ], (1.29)
‖𝑥‖ℝ𝑚 1 2 𝑚
where ∆𝑥 is a perturbation vector whose components ∆𝑥𝜇 are small compared to 𝑥𝜇, and where ‖∙‖ℝ𝑚 is
some vector norm in ℝ𝑚. For the perturbation ∆𝑦 caused by ∆𝑥, one defines similarly the relative
perturbation ‖∆𝑦‖ℝ𝑛⁄‖𝑦‖ℝ𝑛, with as suitable vector ‖∙‖ℝ𝑛 𝑖𝑛 ℝ𝑛. One then tries to relate the relative
To carry this out, one needs to define a matrix norm for matrices 𝑨 ∈ ℝ𝑛×𝑚. We choose the so-called
―operator norm,‖
𝑛
‖𝑨‖ℝ𝑛×𝑚 ∶= 𝑚𝑎𝑥 ‖𝐴𝑥‖ℝ . (1.30)
𝑛
𝑥 ∈ ℝ𝑚‖𝑥‖ℝ
𝑥≠0
In the following we take vector norms the ―uniform‖ (or infinity) norm,
∂𝑓𝑣
≤ max|∆𝑥𝜇| . max ∑𝜇=1
𝑚 | |.
∂𝑥𝜇
Since this holds for each v = 1, 2,……n, it also holds for max |∆𝑦𝑣| giving, in view of (1.31) and (1.32),
𝜕𝒇
||∆𝑦|| ≤ ||∆𝒙|| || || (1.33)
∞ ∞ 𝜕𝒙 ∞
Here,
is the Jacobianmatrix of f . (This is the analogue of the first derivative for systems of functions of several
Although, this is an inequality it is sharp in the sense that equality can be achieved for a suitable perturbation
clearly in the case m =n = 1, definition (1.35)reduces precisely to definition (1.25)(as well as (1.28)) given
earlier.In higher dimensions‘m and/or n larger than 1), however, the condition number in (1.35) is much
cruder than the one in (1.28). This is because normstend to destroy detail: if x, for example has components
of vastly differentt magnitudes, then ||𝒙||∞ is simply equal to the largest of these components and the others
are ignored.For this reason, some caution is required when using (1.35).
indicating ill-conditioning if either x1≈ x2 or x1 ≈ -x2 and |x1| hence also|x2| is not small. The global
𝜕𝒇 1 𝑥2 𝑥2
𝜕𝒙 (𝒙) = − 𝑥2𝑥2 [ 22 1 ],
2
1 2 𝑥2 −𝑥1
becomes, when L1 vector and matrix norms are used (cf.Ex. 33),
2
||𝑥||1∙ max(𝑥2,𝑥2)
𝑥2 2 1 2 |𝑥1| + |𝑥2| 2 2
(cond f )(x) = 1
1 𝑥2 =2 𝑚𝑎(𝑥1 ,𝑥 2)
(|𝑥 | + |𝑥 − 𝑥 |) |𝑥1𝑥2 | |𝑥1| + |𝑥2| + |𝑥1 − 𝑥2|
| 𝑥1 𝑥2 | 1+ 2 1 2
Examples
We illustrate the idea of numerical condition in the number of examples some of which are of considerable
interest in applications.
1 𝑡𝑛
1. Compute𝐼𝑛 = ∫0 𝑑𝑡 for some fixed integer n ≥ 1 . As it stands, the example here deals with a map
𝑡+5
𝑦𝑣
Where 𝐼∗ is some approximation of . Actually 𝐼∗does not even have to be close to 𝐼 for (1.46) to hold,
𝑣 𝑣 𝑣 𝑣
since the function 𝑔𝑛is linear. Thus, we may take 𝐼∗𝑣= 0, committing a 100% error in the starting value,
The bound on the right can be made arbitrarily small, say ≤ , if we choose v large enough, for example,
1
𝐼𝑛
v≥n+ 𝗌 . (1.48)
𝐼𝑛 5
The final procedure, therefore, is: given the desired relative accuracy , choose v to be the smallest
𝐼∗ = 0, 𝐼 (1.49)
𝑣 𝑛
= ( − 𝐼 ), k = v, v – 1,…,n + 1.
1 1 ∗
𝐼∗
𝑘−1 5 𝑘 𝑘
This will produce a sufficiently accurate 𝐼 ∗ ≈ , even in the presence of rounding errors committed in
𝑛 𝑛
Similar ideas can be applied to be more important problem of computing solutions to second-order linear
recurrence relations such as those satisfied by Bessel functions and many other special functions of
mathematical physics.
The procedure of backward recurrence is then closely tied up with the theory of continued fractions.
The problem then is to find 𝜉, 𝑔𝑖𝑣𝑒𝑛 𝑝. 𝑇𝑒 𝑑𝑎𝑡𝑎 𝑣𝑒𝑐𝑡𝑜𝑟 a = [a0, a1, … an−1]T ∈ ℝnconsists of the
coefficients of the polynomial p, and the result is 𝜉, a real or complex number. Thus, we have
What is the condition of 𝜉? We adopt the detailed approach of (1.27) and first define
𝑎𝑣 𝜕𝜉
𝑦𝑣 = (𝑐𝑜𝑛𝑑𝑣 𝜉)(𝑎) = | 𝜕𝑎𝑣| , 𝑣
𝜉
= 0,1, … , 𝑛 − 1. (1.53)
To determine the partial derivative of 𝝃 with respect to 𝑎𝑣, observe that we have the identity
+ [𝜉(… )]𝑣 ≡ 0.
Where the last term comes from differentiating the first factor in the product 𝑎𝑣𝜉𝑣
and, therefore, (cond A)(x) ≤ 5 |In x|. The algorithm A is well conditioned, except in the immediate right-
hand vicinity of x = 0 and for x very large. (In the latter case, however, x is likely to overflow before A
f: ℝ𝑛 ⟶ ℝ, y = 𝑥1𝑥2 … 𝑥𝑛.
A: 𝑝𝑘 = fl(𝑥𝑘𝑝k−1), 𝑘 = 2,3,…,n,
𝑦𝐴 = 𝑝n.
Note that x1 is machine representable, since for the algorithm A we assume x ∈ ℝ(t, s).
Now using the basic law of machine arithmetic (cf. (1.15)), we get
𝑝1 = 𝑥1,
From which
and so, by (1.66), (cond A)(x)≤ 1 for any x 𝜖 ℝ𝑛(𝑡, 𝑠). Our algorithm, to nobody‘s surprise, is perfectly
well conditioned.
SELFASSESSMENTEXERCISE1
Define condition number.
SELFASSESSMENTEXERCISE2
List two types of condition number. Give 2 typical examples of each mentioned.
4.0 CONCLUSION
The condition number is an application of the derivative, and is formally defined as the value of the
asymptotic worst-case relative change in output for a relative change in input. The "function" is the solution
of a problem and the "arguments" are the data in the problem. The condition number is frequently applied
to questions in linear algebra, in which case the derivative is straightforward but the error could be in many
different directions, and is thus computed from the geometry of the matrix. More generally, condition
numbers can be defined for non-linear functions in several variables. A problem with a low condition
number is said to be well-conditioned, while a problem with a high condition number is said to be ill-
conditioned.
5.0 SUMMARY
In non-mathematical terms, an ill-conditioned problem is one where, for a small change in the inputs
(the independent variables) there is a large change in the answer or dependent variable. This means that the
correct solution/answer to the equation becomes hard to find. The condition number is a property of the
problem. Paired with the problem are any number of algorithms that can be used to solve the problem, that
is, to calculate the solution. Some algorithms have a property called backward stability. In general, a
backward stable algorithm can be expected to accurately solve well-conditioned problems. Numerical
analysis textbooks give formulas for the condition numbers of problems and identify known backward
stable algorithms.
6.0 TUTOR-MARKEDASSIGNMENT
Let (𝑥) = √1 + 𝑥2 − 1.
(a) Explain the difficulty of computing (𝑥) for a small value of |𝑥| and show
(b) Compute (𝑐𝑜𝑛𝑑𝑓)(𝑥) and discuss the conditioning of 𝑓(𝑥) for small|𝑥|.
7.2 REFERENCES/FURTHERREADINGS
Belsley, David A.; Kuh, Edwin; Welsch, Roy E. (1980). "The Condition
Number". Regression Diagnostics: Identifying Influential Data and Sources of
Collinearity. New York: John Wiley & Sons. pp. 100–104. ISBN 0-471-05856-4.
Pesaran, M. Hashem (2015). "The Multicollinearity Problem". Time Series and Panel
Data Econometrics. New York: Oxford University Press. pp. 67–72 [p. 70]. ISBN 978-0-
19-875998-0.
Cheney; Kincaid (2008). Numerical Mathematics and Computing. p. 321. ISBN 978-0-
495-11475-8.
Demmel, James (1990). "Nearest Defective Matrices and the Geometry of Ill-
conditioning". In Cox, M. G.; Hammarling, S. (eds.). Reliable Numerical Computation.
Oxford: Clarendon Press. pp. 35–55. ISBN 0-19-853564-3.
Module 2 Error Analysis and Computer Arithmetic
CONTENTS
1.0 Introduction
2.0 Objectives
3.11 Main Content
3.12 Overall errors
3.13 Sources of errors
3.13.1 Errors in the input data
3.13.2 Rounding errors
3.13.3 Truncation errors
3.14 Basic concepts
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
In scientific computing, we never expect to get the exact answer. Inexactness is practically the definition
of scientific computing. Getting the exact answer, generally with integers or rational numbers, is symbolic
computing, an interesting but distinct subject. Suppose we are trying to compute the number A. The
computer will produce an approximation, which we call Ab. This Ab may agree with A to 16 decimal
places, but the identity A = Ab (almost) never is true in the mathematical sense, if only because the computer
does not have an exact representation for A. For example, if we need to find x that satisfies the equation x
2 − 175 = 0, we might get 13 or 13.22876, depending on the computational method, but √ 175 cannot be
represented exactly as a floating point number. Four primary sources of error are: (i) roundoff error, (ii)
truncation error, (iii) termination of iterations, and (iv) statistical error in Monte Carlo. We will estimate
the sizes of these errors, either a priori from what we know in advance about the solution, or a posteriori
from the computed (approximate) solutions themselves. Software development requires distinguishing
these errors from those caused by outright bugs. In fact, the bug may not be that a formula is wrong in a
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain errors in computational science
Identify the sources of errors
Describe the basic concepts of errors
3.0 MAIN CONTENT
This is the mathematical (idealized) problem, where the data are exact real numbers, and the solution is the
When solving such a problem on a computer, in floating-point arithmetic with precision eps, and
using some algorithm A, one first of all rounds the data, and then applies to these rounded data not f , but
𝒇𝑨:
|𝒙∗ −𝒙||
x* = rounded data. | = 𝜀. (1.68)
||𝒙||
` 𝒚∗𝐴 = 𝒇𝑨(𝒙*).
This shows how the data error and machine precision contribute toward the total error: both are amplified
by the condition of the problem, but the latter is further amplified by the condition of the algorithm.
In module we illustrated how approximations are introduced in the solution of mathematical problems that
cannot be solved exactly. One of the tasks in numerical analysis is to estimate the accuracy of the result of
a numerical computation. In this chapter we discuss different sources of the error that affected the computed
result and we derive methods for error estimation. In particular we discuss some properties of computer
arithmetic. Finally, we describe the main features of the standard for floating point arithmetic, which was
Basically there are three types of error that affect the result of a numerical computation
1. Errors in the input data are often unavoidable. The input data may be results of measurements
with limited accuracy, or real numbers which must be represented with a fixed number of digits.
2. Rounding errors arise when computation are performed using a fixed number of digits.
3. Truncation errors arise when ―an infinite process is replace by an infinite one‖, e.g when an
line.
Truncation errors will be discussed in connection with different numerical methods. In this chapter we shall
The different types of errors are illustrated in Figure 2.1, which refers to the discussion in Chapter 1.
Experiment
construction
Mathematical model
mathematical problem
Numerical problem
Input
Numerical algorithm data
Output
data
Figure 2.1. Sources of error.
RXF error in the result, coming from errors in the function values used,
RC rounding error,
RT truncation error.
Absolute error in ā: Δa = ā – a.
Relative error in ā: 𝚫 , (a ≠ 0). Δ
𝑎
a = √2, ā = 1.414.
Δa = 1.414 - √2 = -0.0002135… ,
𝚫𝐚 −0.0002135…
= = −0.0001510 …
𝑎 √2
In many cases we only know a bound on the magnitude of the absolute error of an approximation. Also, it
is often practical to give an estimate of the magnitude of the absolute and relative error, even if more
information is available.
𝚫𝑎
|𝛥𝑎| ≤ 0.00022, ≤ 0.00016,
𝑎
𝚫𝑎
|𝛥𝑎| ≤ 0.0003, ≤ 0.0002,
𝑎
Note that we must always round upwards in order that the inequalities shall hold relative errors are often
3° 1.41378≤ 𝑎 ≤ 1.41422.
There are two ways to reduce the number of digit in a numerical value: rounding and chopping. We first
consider rounding of decimal numbers to t digits. Let 𝜂 denote the part of the number that corresponds to
positions to the right of the tth decimal. If𝜂 < 0.5. 10−𝑡, then the tth decimal is left unchanged and it is
raised by 1 if > 0.5. 10− . In the limit case where if 𝜂 = 0.5. 10−𝑡, the tth decimal is raised by one if it is
odd and left unchanged if it is even. This is called rounding to even. With chopping all the decimals after
Chopping to 3 decimals:
It is important to remember that errors are introduced when numbers are rounded or chopped. From the
above rules we see that when a number is rounded to t decimals, then the error is at most 0.5.10-t, while the
chopping error can be 10-t. Note that chopping errors are systematic: the chopped result is always closer to
zero than the original number. When an approximate value is rounded or chopped, then the associated error
EXAMPLE. Let b = 11.2376 ± 0.1. Since the absolute error can be one unit in the first decimal, it is not
meaningful to give four decimals, and we round to one decimal, brd = 11.2. The rounding error is
11.137611.2376 11.3366
Notice that the rounded interval [11.1, 11.3] does not necessarily contain the exact value.
The following definitions relate to the concepts of absolute and relative error.
SELFASSESSMENTEXERCISE1
Define errors in computational science.
SELFASSESSMENTEXERCISE2
State the four (4) sources of error. Give 3 typical examples of each source of error.
4.0 CONCLUSION
When you study about error how it is gets measured and what factor affects it in which way only then you
can improve it. As we know computation is something in which machines do different mathematical or
other operational work and yes machines also makes mistakes, that's what we call error. The analysis of
errors computed using the global positioning system is important for understanding how GPS works, and
This module discusses errors and sources of error. Scientific computing is shaped by the fact that nothing
is exact. A mathematical formula that would give the exact answer with exact inputs might not be robust
enough to give an approximate answer with (inevitably) approximate inputs. Individual errors that were
small at the source might combine and grow in the steps of a long computation. Such a method is unstable.
A problem is ill conditioned if any computational method for it is unstable. Stability theory, which is
6.0 TUTOR-MARKEDASSIGNMENT
Consider a decimal computer with three (decimal) digits in the floating-point mantissa.
(a) Estimate the relative error committed in symmetric rounding.
(b) let 𝑥1 = 0.982,
1 𝑥2 = 0.984 be two machine numbers. Calculate in machine arithmetic
the mean 𝑚 = (𝑥 + 𝑥 ). Is the computed number between 𝑥 and ?
2 1 2 1 2
(c) Derive sufficient conditions for 𝑥1 < 𝑓𝑙(𝑚) < 𝑥2 to hold, where 𝑥1, 𝑥2are two
machine numbers with 0 < 𝑥1 < 𝑥2.
7.3 REFERENCES/FURTHERREADINGS
Murray, R. S. (1974). Schaums Outline Series or Theory and Problem of Advanced Calculus. Great
Britain: McGraw–Hill Inc.
Murray,R.S.(1974). ―Schaums Outline Series on Vector Analysis‖. In: An Introduction to Tension
Analysis. Great Britain: McGraw-Hill Inc.
Stephenson, G. (1977). Mathematical Methods for Science Students. London: Longman. Group Limited.
Stroud, K.A. (1995). Engineering Maths. 5th Edition Palgraw Verma, P.D.S. (1995). Engineering
Mathematics. New Delhi: Vikas Publishing House PVT Ltd.
Unit 2: Error Propagation
CONTENTS
1.0 Introduction
2.0 Objectives
3.15 Main Content
3.16 Error propagation
3.17 Addition and Subtraction
3.18 Multiplication and Division
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
3.1 Error Propagation
If |𝗈a| = |ā – a| ≤ 0.5 . 10-t, then the approximate value ā is said to have t correct
decimals.
In an approximate value with t>0 correct decimals, the digits in positions with
unit ≥ 10-t are called significant digits. Leading zeros are not counted; they only
indicate the position of the decimal point.
The definitions are easily modified to cover the case when the magnitude of the absolute error is larger than
0.5.
210000 ± 5000 2
a = 1.789 ± 0.005
has two correct decimals even though the exact value may be 1.794. The principles for rounding and
choppping and the concepts of significant are completely analogous in other number systems than the
decimal system; see Sections //
when approximate values are used in computations, their errors will, of course, give rise to errors in the
results. We shall derive some simple methods for estimating how errors in the data are propagated in
computations.
In practical applications error analysis is often closely related to the technology for constructions of devices
Example. The efficiency of a certain type of solar energy collectors can be computed by the formula
Ƞ = K QTd ,
Where K is a constant, known to high accuracy: Q denotes volume flow; Td denotes temperature difference
between ingoing and outgoing fluid; and I denotes irradiance. The accuracies with which we can assume
Assume that we have computed the efficiencies 0.76 and 0.70 for two solar collectors S1 and S2, and that
Collector S1 S2
Q 1.5% 0.5%
Td 1% 1%
I 3.6% 2%
Based on these data, can we be sure that S1 has a larger efficiency than S2?
We return to this example when we have derived mathematical tools that can help us answer the question.
First, assume that we shall compute f(x), where f is a differentiable function. Further, assume that we know
can estimate the propagated error simply by computing f(x - ϵ) and f(x + ϵ) :
A more generally applicable method is provided by the following theorem, which will be used repeatedly
in the module.
When the mean value theorem is used for practical error estimation, the derivative is computed at x , and
Δf
x ξ x
Figure 2.2: Mean value theorem.
Example. We shall compute (α) = √𝛼for α = 2.05 ± 0.01. The mean value theorem gives
Δ𝑓 = 𝑓’(ξ)Δα = 1 Δα
2√𝜉
We can estimate
1 0.01
|𝛥𝑓|≲ |𝛥𝛼| ≤ = 0.00349… ≤ 0.0036
2√2.05 2√2.05
first examine error propagation for the four simple arithmetic operations.
If we only know bounds for the absolute errors in 𝑥1 and 𝑥2, we must take absolute values and use the
triangle inequality,
𝑦 = 𝑥1+ 𝑥2 , 𝑦 = 𝑥1 − 𝑥2,
𝑦 = 𝑛𝑖=1
∑ 𝑥𝑖 we get |𝛥𝑦| ≤ ∑𝑛𝑖=1 |𝛥𝑥𝑖|
𝛥𝑦 𝛥𝑥1 𝛥𝑥2
≃ + ,
𝑦 𝑥1 𝑥2
and if we take absolute values and use the triangle inequality, we get the bound
𝛥𝑦 𝛥𝑥1 𝛥𝑥2
| |≲| |+| |
𝑦 𝑥1 𝑥2
We summarize,
𝑦 = 𝑥1 ⋅ 𝑥2 , 𝑦 = 𝑥1/𝑥2,
𝛥𝑦 𝛥𝑥1 𝛥𝑥2 𝛥𝑦 𝛥𝑥1 𝛥𝑥2
Relative error: ≃ + , ≃ − ,
𝑦 𝑥1 𝑥2 𝑦 𝑥1 𝑥2
Example. Now we can solve the solar collector problem. The error propagation formulas for
𝛥𝑦
| | ≤ (1.5 + 1 + 3.6) ⋅ 10−2 = 0.061
𝑦
so that
The two intervals overlap, and therefore we cannot be sure that the solar collector S1 is better than S2
The following generalization of the mean value theorem is useful for examination of error propagation in
. , 𝑥𝑛) and 𝑥 = 𝑥 + 𝛥𝑥, is a point in that neighbourhood, then there is a number 𝜃, 0 <𝜃< 1, such that
მ𝑓
𝛥𝑓 = 𝑓(𝑥) −𝑓(𝑥) = ∑𝑘=1
𝑛 (𝑥 + 𝜃𝛥𝑥)𝛥𝑥𝑘
მ𝑥𝑘
Proof. Define the function (𝑡) = f(𝑥 + 𝑡𝛥𝑥) . The mean value theorem for a function of one variable and
When this theorem is used for practical error estimation, the partial derivatives are evaluated at 𝑥 = 𝑥 (the
given approximation). When there are only bound for the errors in the argument 𝑥, one can get a bound
SELFASSESSMENTEXERCISE1
How accurately do we need to know π in order to be able to compute √π with four correct decimals?
SELFASSESSMENTEXERCISE2
Derive the error propagation formula for division.
4.0 CONCLUSION
Error propagation, a term that refers to the way in which, at a given stage of a calculation, part of the error
arises out of the error at a previous stage. This is independent of the further round off inevitably introduced
between the two stages. Unfavorable error propagation can seriously affect the results of a calculation. The
investigation of error propagation in simple arithmetical operations is used as the basis for the detailed
analysis of more extensive calculations. The way in which uncertainties in the data propagate into the final
results of a calculation can be assessed in practice by repeating the calculation with slightly perturbed data.
5.0 SUMMARY
find effect on function, when there variables uncertainty. Example :- F = ma , for finding error in F
we take, ∆F/F = ∆m/m + ∆a/a , this process is no other than propagation error . Propagation error in
addition : when any function y is given in such a way that it is sum of two variable x and z then, error in y
can be measured by
dy = dx + dz .
Then, L = ? If L = l₁ + l₂
∵F = ma
First find F,
F = 5 × 10 = 50 N
Hence, F = (50 ± 2)
6.0 TUTOR-MARKEDASSIGNMENT
(a) Derive the error propagation formula for the function y = log x.
(b) Use the result from (a) to derive the error propagation formula for the function y = f(x1, x2, x3) =
𝑥1𝛼1𝑥2𝛼2𝑥3𝛼3. (This technique is called logarithmic differentiation)
7.4 REFERENCES/FURTHERREADINGS
Murray, R. S. (1974). Schaums Outline Series or Theory and Problem of Advanced Calculus. Great
Britain: McGraw–Hill Inc.
Murray, R.S.(1974). ―Schaums Outline Series on Vector Analysis‖. In: An Introduction to Tension
Analysis. Great Britain: McGraw-Hill Inc.
Stephenson, G. (1977). Mathematical Methods for Science Students. London: Longman. Group Limited.
Stroud, K.A. (1995). Engineering Maths. 5th Edition Palgraw Verma, P.D.S. (1995). Engineering
1.0 Introduction
2.0 Objectives
3.0 Main Content
3.1 Number Representation
3.2 ROUNDING Errors in Floating Point Representation
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
The architecture of most computers is based on the principle that data are stored in main memory with a
fixed amount of information as a unit. This unit is called a word, and the word length is the number of bit
per word. Most computers have word length 32 bits, but some use 64 bits. Integers can, of course, be
represented exactly in a computer, provided that the word length is large enough for storing all the digits.
In contrast, a real number cannot in general be represented exactly in a computer. There are two reasons
for this: Errors may arise when a number is converted from one number system to another, e.g.
Thus, the number (0.1)10 cannot be represented exactly in a computer with a binary number system. The
other reason is that errors may arise because of finite word length of the computer.
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain number representation in computational science
Describe ROUNDING Errors in Floating Point Representation
3.0 MAIN CONTENT
3.1 Number Representation
How should real numbers be represented in a computer? The first computer (in the 1940s and early 1950s)
For each computation the user decided how many units in a computer word were to be used for representing
respectively the integer and the fractional parts of a real number. Obviously, with this method is difficult
to represent simultaneously both large and small numbers. Assume e.g. that we have a decimal
representation with word length six digits, and that we use three digits for decimal parts. The largest and
smallest positive numbers that can be represented are 999.999 and 0.001, respectively.
Note that small numbers are represented with large relative errors than large numbers.
This difficulty is overcome if real numbers are represented in the exponential form that we generally use
0.00000000789, 6540000000000
but rather
This way of representing real numbers is called floating point representation (as opposed to fixed point).
In the number system with base β any real number X ≠ 0 can be written in the form
X = M . 𝛽𝐸 ,
M = ±𝐷0.𝐷1𝐷2𝐷3 … ,
1 ≤ 𝐷0≤ β-1
0 ≤ 𝐷𝑖 ≤ β-1, i = 1, 2,….
M may be a number with infinitely many digits. When a number written in this form is to be stored in a
computer, it must be reduced — by rounding or chopping. Assume that t + 1 digits are used for representing
x = m . 𝛽𝑒
m = ±𝑑0.𝑑1𝑑2𝑑3 … 𝑑 ,
1 ≤ 𝑑0≤ β-1
0 ≤ 𝑑𝑖 ≤ β-1, i = 1, 2,….t,
𝑎𝑛𝑑3 e = E. The number m is called the significant or mantissa, and e is Called exponent. The digits to the
right point in m are called fraction. From the expression m it follows that
1 ≤ │m│ < β .
The range of the numbers that can be represented in the computer depends on the amount of storage that is
L≤e≤U,
Where L and U are negative and positive integers respectively. If the result of a computation is floating
point number with e > U, then the computer issue an error signal. This kind of error is called overflow. The
3) In the extreme case where we use rounding; all 𝐷𝑖 = β-1, I = 0,…, t; and we have to augment the last
digit by 1, we get m = 1.00…0 and e = E + 1. We shall ignore this case in the following presentation, but
the results regarding relative error are also valid for this extreme case.
Corresponding error with e < L is called underflow. It is often handled by assigning the value 0 to the result.
A set of normalized floating point numbers is uniquely characterized by β (the base). T (the precision), and
The floating point system (β, t, L, U) is the set of normalized floating point numbers in the
number system with base β and t digits for the fraction (equivalent to t + 1 digits in the
significand), ie all numbers of the form
x = m . 𝛽𝑒 ,
where
m = ±𝑑0.𝑑1𝑑2𝑑3 … 𝑑 ,
0 ≤ 𝑑𝑖 ≤ β-1, i = 1, 2,….t,
It is important to realize that the floating point numbers are not evenly distributed over the real axis. As an
example, the positive floating point numbers in the system (β, t, L, U) = (2, 2, -2, 1) are shown in Figure
2.3.
0 0.5 1 2 3
Figure 2.3. The positive numbers in the floating system (2, 2, -2, 1).
Β t L U
Double precision floating point numbers are available in several programming languages, eg Fortran and
MATLAB. Usually, such number are implemented so that two computer words are used for storing one
number; this gives higher precision and a larger range. We return to this in Section 2.8.
Again, we want to emphasize that our notation ―The floating-point system (β, t, L, U)‖ means that the
fraction occupies t digits. This notation is consistent with the IEEE standard for floating point arithmetic,
see Section 2.8. In order literature floating point number are often normalized so that 𝛽−1 ≤ |𝑚| < 1, and
there (𝛽, 𝑡, 𝐿, 𝑈) means that the significand (equal to the fraction) occupies t digits.
When number are represented in the floating point system (β, t, L, U), we get rounding errors because of
the limiting precision. We shall derive a bound for the relative error.
|𝑥 − 𝑋| 1 −𝑡 𝑒 1 −𝑡
𝛽 ∙𝛽 𝛽 1
≤ 2 =2 ≤ 𝛽−𝑡
|𝑋| |𝑀| ∙ 𝛽𝑒 |𝑀| 2
The last inequality follows from the condition |𝑀| ≥ 1. Thus, we have shown
Theorem 2.5.1 The relative rounding error in floating point representation can be estimated as
|𝑥 − 𝑋| 1 −𝑡
≤ 𝜇, 𝜇= 𝛽 𝜇
|𝑋| 2
Note that the bound for the relative error is independent of the magnitude of X. This means that both large
and small numbers are represented with the same relative accuracy.
𝑥 = (1+ ∈), | ∈ | ≤ 𝜇.
In section 2.7 we shall see that this formulation is useful in the analysis of accumulated rounding errors in
If we have a computer with binary arithmetic, using 𝑡 + 1 digits in the significand, how accurate is this
computer, expressed in terms of the decimal number? We must answer this question in order to know how
Example. The floating point system (2, 23, -126, 127) has unit roundoff
1
𝜇= ∙ 2−23 = 2−24 ≅ 5.96 ∙ 10−8 ≅ 0.5 ∙ 10−7
2
Thus, this system is roughly equivalent to a floating-point system with (𝛽, 𝑡) = (10,7).
Alternative formulation of the above question are: ―How many decimal digits correspond to t +1 binary?‖
and ―If the unit roundoff in a binary floating point system is 𝜇 = 0.5 ∙ 2−𝑡, how many digit must we have
This was the formulation used in the example. In general, we have to solve the equation
Rule of thumb:
Example. The IEEE standard for single precision arithmetic prescribes t=23 binary digits in the fraction.
This corresponds approximately to a decimal floating point system with s = 7, since 23 log10 2 ≅ 6.9.
The standard for MATLAB has (𝛽, 𝑡) = (2 , 52). This corresponding to a decimal system with 𝑠 = 16
𝑝𝑖
>>𝑦 = sin( ⁄4)
𝑦 = 70.0710667811865475𝑒 − 01
difference is that then the unit roundoff is 𝜇𝑐 = 2𝜇 = 𝛽−𝑡. Floating point arithmetic with chopping is easier
to implement than arithmetic with rounding, but is rarely used today because the IEEE standard for floating
SELFASSESSMENTEXERCISE1
SELFASSESSMENTEXERCISE2
Convert the following: (i) 2345 to base two (ii) ADE3 to base ten (iii) 65328 to base
two
4.0 CONCLUSION
Numbers in everyday life are usually represented using the digits 0 to 9, but this is not the
only way in which a number can be represented. There are multiple number base systems,
which determine which digits are used to represent a number. The number system that we
are most familiar with is called denary or decimal (base-10), but binary (base-2)
and hexadecimal (hex or base-16) are also used by computers. You can perform
arithmetic calculations on numbers written in other base notations, and even convert
numbers between bases. At a more advanced level, you will learn that representing negative
numbers and fractional numbers using binary is also possible. There are several different
5.0 SUMMARY
All the numbers are the same and the easiest version to remember/understand for humans
is the base-16. Hexadecimal is used in computers for representing numbers for human
consumption, having uses for things such as memory addresses and error codes. NOTE:
Hexadecimal is used as it is shorthand for binary and easier for people to remember. It
DOES NOT take up less space in computer memory, only on paper or in your head!
Computers still have to store everything as binary whatever it appears as on the screen.
6.0 TUTOR-MARKEDASSIGNMENT
Let f be a function from Rn to Rm , and assume that we want to compute f(𝑎) , where 𝑎is
an approximation of a. Show that the general error propagation formula applied to each
component of f leads to
∆𝑓 ≃ 𝐽∆𝑎 ,
𝜕 𝑓𝑖
(J)ij = 𝜕 .
𝑥𝑖
7.5 REFERENCES/FURTHERREADINGS
1.0 Introduction
2.0 Objectives
3.19 Main Content
3.20 Arithmetic operation
3.21 Floating point addition
3.22 Normalize
3.23 Floating point multiplication
3.24 Floating point division
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
To perform calculations, you can write programming statements that include arithmetic operators and
functions. The values in the calculations to which the arithmetic operators are applied are called operands.
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain arithmetic operation in computational science
Describe addition and subtraction of floating point in arithmetic operation
Describe multiplication and division of floating point in arithmetic operation
3.0 MAIN CONTENT
3.1 ARITHMETIC OPERATION
The aim of this unit is not to describe in full detail how floating point arithmetic can be implemented.
Rather, we want to show that under certain assumption it is possible to perform floating point operations
with good accuracy. This accuracy should then be requested from all implementations.
Since rounding errors arise already when real numbers are stored in the computer, one can handle expect
that arithmetic operations can be performed without errors. As an example, let a, b and c be variables with
𝑎 ∶= 𝑏 ∗ 𝑐.
In general, the product of two 𝑡 + 1digit numbers has 2t + 1 or 2t + 2 digits, so the significance must be
reduced (by rounding or chopping) before the result can be stored in 𝑎.
Before 1985 there did not exist a standard for the implementation of floating point arithmetic, and different
computer manufacturers chose different solution, depending on economic and other reasons. In this section
we shall describe somewhat simplified arithmetic in order to be able to explain the principle of floating
point arithmetic without giving too many details. A survey of IEEE floating point standard is given in unit
6.
We shall describe an arithmetic for the floating point system (β, t, L, U) and assume rounding. In the
Computers have special registers for performing arithmetic operations. The length of these registers (the
number of digits they hold) determine how exactly floating point operations can be performed. We
assume that the arithmetic registers hold 2t + 4 digits in the significance (and faster) using shorter
Apart from arithmetic and logical operations one must be able to perform shift operations, which are used
in connection with normalization and to make two floating point numbers have the same exponent. As an
0.031.101 = 3.100.10-1.
We first discuss floating point addition (and at the same time subtraction, since x – y = x + (-y)). Let
x = mxβex, y=myβey ,
and let z = fl [x + y] denotes the result of the floating point addition We assume that ex ≥ ey. If ex > ey, then
= 1.27967.100 = 1.280.100.
z= x+ y;
if ex - ey ≥ t + 3 then
mz : mx;
else
mz := mx + my;
endif
if ex - ey < t + 3, then my can be stored exactly after the shift, since the arithmetic register is assumed to
hold 2t + 4 digits. Also the addition mx + my is performed without error. In general, the result of these
operations may be an unnormalized floating point number z = mz. βmz, with [mz] ≥ β or [mz] < 1, e.g
5.67.100 = 10.245.100 or 5.678.100 + (-5.612.100) = 0.066.100. In such cases the floating point number is
normalized by appropriate shifts. Further, the significand must be rounded to t+1 digits. These two tasks
are performed by the following algorithm that takes an unnormalized, nonzero floating point number m.
3.3 NORMALIZE
if |m| ≥ β then
else
endif
Round m to t+1 digits;
If |m| = β then
endif
if e > U then
x := 0; (exponent underflow)
else
endif
The if-statement after the rounding is needed because the rounding to t + 1 digits can give an
unnormalized results:
z := x * y;
ez := ex + ey:
mz = mx * my
Normalize:
z := x / y;
if y = 0 then
else
ez := ex – ez;
mz = mx / my
Normlize;
endif
We have assumed that the arithmetic registers hold 2t + 4digits. This implies that the results of addition
and multiplication are exact before normalization and rounding. Therefore, the only error in these
operations is the rounding error. A careful analysis of the division algorithm shows that the division of the
significands can be performed so that the 2t + 4 digits are correct. Therefore, the fundamental error
estimate for floating point representation (Theorem 2.5.1) is valid for floating point arithmetic:
Theorem 2.6.1. Let ꙩ denote any of the arithmetic operators +,-,* and /, and assume that x ꙩ y
≠ 0 and that the arithmetic registers are as described above. Then
𝑓[𝑥 ꙩ y]−𝑥 ꙩ y
| |≤ µ
𝑥ꙩy
Or, equivalently,
fl[x ꙩ y] = (x ꙩ y) (1 + 𝜖 ),
For some 𝜖 that satisfies |𝜖| ≤ µ. µ = 1 β-t is the unit roundoff.
2
5)
In the IEEE standard the error signal is z := Inf if x ≠ 0 and z := NaN (Not-a-Number) if x = 0,
see unit 6.
It can be shown that the theorem is valid even if the arithmetic registers hold t + 4 digits only, provided that
A consequence of the errors in floating point arithmetic is that some of the usual mathematical laws are no
longer valid. Eg the associative law for addition does not necessarily hold. It may happen that
Example. Let a = 9.876 ∙ 104 , b = -9.880 ∙ 104 , c = 3.456 ∙ 101 , and use the floating point system
Another consequence of the errors is that it is seldom meaningful to test for equality between floating point
numbers. Let x and y be floating point numbers that are results of earlier computation.
Then there is very small probability that the Boolean x == y will be true.
Therefore, instead of
If x == y
where delta is some small number, slightly larger than the unit round-off µ.
The basic arithmetic operators that are used in programming follow the same notation as
in mathematics. It is useful to know the operators in the table below. You may be unfamiliar
DIV returns a whole number result (or quotient) of a division, which means that the
pseudocode.
5.0 SUMMARY
exponentiation) are performed in the natural way with Mathematica. Whenever possible,
3. ―a times b,‖ ab, is entered as either a*b or a b (note the space between the
4. ―a divided by b,‖ a/b, is entered as a/b. Executing the command a/b results in a fraction
6.0 TUTOR-MARKEDASSIGNMENT
Let f be a function from Rn to Rm , and assume that we want to compute f(𝑎) , where 𝑎is
an approximation of a. Show that the general error propagation formula applied to each
component of f leads to
∆𝑓 ≃ 𝐽∆𝑎 ,
𝜕 𝑓𝑖
(J)ij = 𝜕 .
𝑥𝑖
7.6 REFERENCES/FURTHERREADINGS
J.H. Wilkinson, "The algebraic eigenvalue problem" , Oxford Univ. Press (1969)
V.V. Voevodin, "Rounding-off errors and stability in direct methods of linear algebra" ,
Moscow (1969) (In Russian)
1.0 Introduction
2.0 Objectives
3.25 Main Content
3.26 Accumulated errors
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
The overall effect of rounding-off at the various stages of a computation procedure on the
accuracy of the computed solution to a system of algebraic equations. The most commonly
employed technique for a priori estimation of the total effect of rounding-off errors in
numerical methods of linear algebra is the scheme of inverse (or backward) analysis.
Ax=b
The scheme of inverse analysis is as follows. On the assumption that some direct
method M has been used, the computed solution xM does not satisfy (1), but it can be
(A+FM)x=b+kM.
The quality of the direct method is estimated in terms of the best a priori estimate that can
be found for the norms of the matrix FM and the vector kM. These "best" FM and kM are
known as the equivalent perturbation matrix and vector, respectively, for the method M.
If estimates for FM and kM are available, the error of the approximate solution xM can be
Here cond(A)=‖A‖‖A−1‖ is the condition number of the matrix A, and the matrix norm in
In reality, an estimate for ‖A−1‖ is rarely known in advance, and the principal meaning of
(2) is the possibility that it offers of comparing the merits of different methods. In the sequel
For methods with orthogonal transformations and floating-point arithmetic ( A and b in the
‖FM‖E≤ f(n)⋅‖A‖E⋅ϵ.
Euclidean matrix norm, and f(n) is a function of type Cnk, where n is the order of the
system. The exact values of the constants C and the exponents k depend on such details of
the computation procedure as the rounding-off method, the use made of accumulation of
In Gauss-type methods, the right-hand side of the estimate (4) involves yet another
factor g(A), reflecting the possibility that the elements of A may increase at intermediate
steps of the method in comparison with their initial level (no such increase occurs in
orthogonal methods). In order to reduce g(A) one resorts to various ways of pivoting, thus
In the square-root method (or Cholesky method), which is commonly used when the
‖FM‖E≤ C‖A‖E⋅ϵ.
There exist direct methods (the methods of Gordan, bordering and of conjugate gradients)
for which a direct application of a scheme of inverse analysis does not yield effective
estimates. In these cases, different arguments are utilized to investigate the accumulation
of errors
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain accumulation errors in computational science
As an example of error accumulation in repeated floating point operations we shall consider the
computation of a sum,
Sn= ∑ 𝑛𝑘=1 𝑥𝑘
We assume that the sum is computed in the natural order and let 𝑠idenote the computed partial sum,
31
𝑆I := x1
We see that
A simple induction argument shows that the final result can be written in the form
𝑆i = 𝑥1 + 2 + …+ 𝑥n (2.7.1a)
Where
To be able to obtain practical error estimates, we need the following lemma, the proof of which is left as an
exercise.
Lemma 2.7.1 Let the numbers є1 , є2,…, єr satisfy |єi| ≤ 𝜇, i= 1,2,…..r, and assume that r 𝜇 ≤
0.1. Then there is number 𝛿r such that
(1 + є1)(1 + є1)….(1 + єr) = 1 + 𝛿r
and
| 𝛿r| ≤1.06 r 𝜇
Now we can derive two types of results, which give error estimates for summation in floating point
arithmetic.
Theorem 2.7.2. Forward analysis. If 𝑛𝜇 ≤ 0.1, then the error in the computed sum can be
estimated as |𝑆n−𝑆𝑛 | ≤ |𝑥1 | |𝛿𝑛−1 | + |𝑥2|𝛿𝑛−1 | + |𝑥3 ||𝛿𝑛−2 | + ⋯ + |𝑥𝑛 ||𝛿1 | ,
where
|𝛿𝑖| ≤ 𝑖 ∙ 1.06𝜇, 𝑖 = 1, 2, … , 𝑛 − 1 . Type equation here.
Where the 𝛿1 satisfy the inequality in the theorem. Subtract 𝑆𝑛 and use the triangle inequality.
Forward analysis is the type of error analysis that we used at the beginning of this chapter. However, it is
difficult to use this method to analyze such a fundamental algorithm as Gaussian elimination for the solution
of a linear system of equations. The first correct error analysis of this algorithm was made in the mid 1950s
In backward analysis one shows that the approximate solution 𝑆 which has been computed for the problem
𝑃is the exact solution of a perturbed problem 𝑃, and estimate the ―distance‖ between 𝑃and 𝑃. By means
of perturbation analysis of the problem it is then possible to estimate the difference between 𝑆 and the true
solution 𝑆.
We cite the following description of the aim of backward error analysis from J.R. Rice, Matrix computations
―The objective of backward error analysis is to stop worrying about whether one has the ―exact‖ answer,
because this is not a well-defined thing in most real-world situations. What one wants is to find an answer
which is the true mathematical solution to a problem which is within the domain of uncertainty of the
original problem. Any result that does this must be acceptable as an answer to the problem, at least with the
The error estimates in these two theorems lead to an important conclusion: We can rewrite the estimates in
the form
This shows that in order to minimize the error bound, we should add the terms in increasing order of
Example: Let
X1 = 1.234. 101,
x2 = 3.453. 100,
x3 =3.442. 10-2,
x4 = 4.667. 10-3,
x5 = 9.876. 10-4,
and use the floating point system (10,3,-90.9) with rounding. Summation in decreasing and
Similarly, a relatively large error may arise when a slowly converging series is summed in decreasing order.
∑30000
𝑛=1 1/𝑛
2
In the floating point system (2,23,-126,127) with rounding. If we take the terms in increasing order of n,
we get the result 1.644725, while we get 1.644901 if we sum in decreasing order. The last result is equal
It should be pointed out that the major part of the difference between the two results is due to the fact that
when we sum in decreasing order, the last 25904 terms do not contribute to the sum because
SELFASSESSMENTEXERCISE1
What are accumulation errors?
SELFASSESSMENTEXERCISE2
State the steps in finding accumulated errors in computational science
4.0 CONCLUSION
algebraic problems (see above). In turn, one is most commonly concerned with algebraic
problems that arise from the approximation of differential equations. These problems
display certain specific features. Errors accumulate in accordance with the same or even
simpler laws as those governing the accumulation of computational errors; they may be
investigated when analyzing a method for the solution of a problem. There are two different
approaches to investigating the accumulation of computational errors. In the first case one
assumes that the computational errors at each step are introduced in the most unfavourable
way and so obtains a majorizing estimate for the error. In the second case, one assumes
that the errors are random and obey a certain distribution law.
5.0 SUMMARY
The nature of the accumulation of errors depends on the problem being solved, the method
of solving and a variety of other factors that appear at first sight to be rather inessential:
such as the type of computer arithmetic (fixed-point or floating-point), the order in which
the arithmetic operations are performed, etc. For example, in computing the sum
of N numbers
AN=a1+⋯+aN
the order of the operations is significant. Suppose that the computation is being done in a
computer with floating-point arithmetic with t binary digits, all numbers lying in the
An+1=An+an, n=1…N−1,
the majorizing error estimate is of the order 2−tN. But one can proceed otherwise (see [1]).
First compute sums of pairs, A1k=a2k−1+a2k( if N=2l+1 is odd, one puts A1l+1=a2l+1).
such cases use of the scheme just described increases the load on the computer memory.
However, the sequence of computations can be so organized that the load on the operating
In the numerical solution of differential equations one may encounter the following cases.
As the grid spacing h tends to zero, the error increases as (a(h))h−q, where q>0,
while ¯limh→0|a(h)|>1. Such methods of solution fall into the category of unstable
6.0 TUTOR-MARKEDASSIGNMENT
operations.
(a) Let |a| and |b| be two floating point numbers and c = fl[a + b]. The error e = c − (a+b)
else e := (a − c) + b;
(a1) Use the algorithm in the floating point system (10, 4,−9, 9) with a = 1.2345, b =
0.045678.
7.7 REFERENCES/FURTHERREADINGS
Theorem, one Rule of Thumb and One Hundred and One Exercises), SIAM, Philadelphia,
PA, 2001.
Unit 6: IEEE Standard for Floating Point
CONTENTS
1.0 Introduction
2.0 Objectives
3.27 Main Content
3.28 Floating point
3.29 IEEE Standard
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
IEEE Standards documents (standards, recommended practices, and guides), both full-use and trial-use, are
developed within IEEE Societies and the Standards Coordinating Committees of the IEEE Standards
Association (―IEEE-SA‖) Standards Board. IEEE (―the Institute‖) develops its standards through a
consensus development process, approved by the American National Standards Institute (―ANSI‖), which
brings together volunteers representing varied viewpoints and interests to achieve the final product. IEEE
Standards are documents developed through scientific, academic, and industry-based technical working
groups. Volunteers in IEEE working groups are not necessarily members of the Institute and participate
without compensation from IEEE. While IEEE administers the process and establishes rules to promote
fairness in the consensus development process, IEEE does not independently evaluate, test, or verify the
accuracy of any of the information or the soundness of any judgments contained in its standards. IEEE
Standards do not guarantee or ensure safety, security, health, or environmental protection, or ensure against
interference with or from other devices or networks. Implementers and users of IEEE Standards documents
are responsible for determining and complying with all appropriate safety, security, environmental, health,
and interference protection practices and all applicable laws and regulations. IEEE does not warrant or
represent the accuracy or content of the material contained in its standards, and expressly disclaims all
warranties (express, implied and statutory) not included in this or any other document relating to the
standard, including, but not limited to, the warranties of: merchantability; fitness for a particular purpose;
non-infringement; and quality, accuracy, effectiveness, currency, or completeness of material. In addition,
IEEE disclaims any and all conditions relating to: results; and workmanlike effort. IEEE standards
Above all, it is the development of microcomputers that has made it necessary to standardize floating point
arithmetic. The aim is to facilitate portability, ie a program should run on different computers without
changes, and if two computers conform to the standard, then the program should give identical results (or
A proposal for a standard for binary floating point arithmetic was sentenced in 1979. Some changes were
made, and the standard was adopted in 1985. It has been implemented in most computers6). We shall present
the most important parts of the standard for binary floating point arithmetic without going into too much
detail.
The standard defines four floating point formats divided into two groups, basic and extended, each in a
single precision and double precision version. The single precision basic format requires a word of length
s E F
0 8 31
The component of a floating point number x is the sign s (one bit), the biased exponent E (8bits) and the
normally use the base β=10. There have also been computers using base 8 (octal system, using 3 bits per
digit), and base 16(hexadecimal system, using 4 bits per digit ). A major argument for these systems is that
The normal case is a. Due to the normalization the significand satisfies 1 ≤ |𝑚| < 2. Thus, the integer part
is always one, and by not storing this digit we save an extra bit for the fraction. Also note that the exponent
is stored in biased (or shifted) form. The range of values that can be obtained with 8 bits is
The two extreme values are reserved (cf the above table), so the range of values for the true exponent e =
E – 127 is
0 01111111 00000000000000000000000
0 10000001 00001000000000000000000
1 01111011 10000000000000000000000
One reason for introducing NaN (Not-a-Number) and Inf (Infinity), items d. and e. above is to make
debugging easier. Both of these are returned as the result in exceptional case. Eg the result of 1/0 and 0/0
is Inf and NaN, respectively. Also, NaN is the result of a computation involving an initialized variable, and
Inf is returned in case of overflow. Rather than stopping the execution of the program it may be preferable
Example. If e ˂ - 126, the floating point number is unnormalized. Eg the number stored as
0 00000000 00010000000000000000000
is 2−4·2−126 = 2−130. Note that the leading zeros in f are not significant, and the bound of the relative
The smallest, nonzero, positive number that can be represented in this way is
The basic double precision format is analogous to the precision format. Here, 64 bits are used as illustrated
s E F
01163
f is given with t = 52 binary digits, the biased exponents is E = e + 1023 and range of positive, normalized
Details of the extended single and double formats are left to the implementer, but there must be at least one
and extended double. One bit is used for the sign, 15 bits for the biased exponent, and 64 bits for the
Implementations of the standard must provide the four simple arithmetic operations, the square root
function and binary-decimal conversion. When every operand is normalized, then an operation (also the
square root function) must be performed such that the result is equal to the rounded result of the same
The standard specifies that rounding is done according to the rules in section 2.2. In particular, rounding to
The extended formats can be used (by the compiler in some high level languages) both for avoiding
Example. The computation of s:= √𝑥12 + 𝑥22 may give overflow or underflow, even if the result can be
represented as a normalized floating point number, see Exercise E9. If the computer uses extended precision
for the computed squares and their sum, then overflow or underflow cannot occur.
Extended real s;
S:=0;
l := √𝑠;
If l can be represented (e.g. in single precision), overflow or underflow cannot occur. Further, since the
significand of the extended format has more digits, l will be computed more accurately than in the case
where s is a basic format variable. If n is not too large, the l can even be equal to the exact result rounded
to the basic format.
In the beginning of this section we mentioned that even if two computers both apply to the IEEE standard
for floating point arithmetic, the same program does not necessarily give identical results when run on the
SELFASSESSMENTEXERCISE1
What is IEEE Standard for Floating-Point
SELFASSESSMENTEXERCISE2
State two type of IEEE Standard for Floating-PointArithmetic
4.0 CONCLUSION
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for
floating-point computation which was established in 1985 by the Institute of Electrical and
Electronics Engineers (IEEE). The standard addressed many problems found in the diverse
floating point implementations that made them difficult to use reliably and reduced their
portability. IEEE Standard 754 floating point is the most common representation today for
real numbers on computers, including Intel-based PC‘s, Macs, and most Unix platforms.
There are several ways to represent floating point number but IEEE 754 is the most
The Sign of Mantissa –This is as simple as the name. 0 represents a positive number while
The exponent field needs to represent both positive and negative exponents. A bias is added
floating-point number, consisting of its significant digits. Here we have only 2 digits, i.e.
O and 1. So a normalized mantissa is one with only one 1 to the left of the decimal.
5.0 SUMMARY
This standard specifies interchange and arithmetic formats and methods for binary and
point system conforming to this standard may be realized entirely in software, entirely in
hardware, or in any combination of software and hardware. For operations specified in the
normative part of this standard, numerical results and exceptions are uniquely determined
by the values of the input data, sequence of operations, and destination formats, all under
user control. This standard specifies formats and operations for floating-point arithmetic in
computer systems. Exception conditions are defined and handling of these conditions is
specified.
6.0 TUTOR-MARKEDASSIGNMENT
1. (a) Use Taylor expansion to avoid cancellation in 𝑒𝑥 − 𝑒−𝑥, x close to 0. Use reformulation to avoid
2. Let x be a normalized floating point number in the system (β,t,L,U). Show that r ≤ |x| ≤ R, where
3. Assume that x and y re binary floating point numbers that satisfy xy >0 and |y| ≤ |x| ≤ 2|y|. Show that
fl[x-y] = x – y.
4. (a) Show that fl[1+x] = 1 for all x ∈ [0, µ], where µ is the unit roundoff.
(b) Show that fl[1+x] > 1 for all x > µ
(c) Let i + e be the smallest floating point number greater than 1. Determine e in the floating point
system (2, t, L, U) and compare it to µ. (This number is sometimes called the ‘machine epsilon’; it is
5. Show that the computation range of s = √𝑥12 + 𝑥22 can give overflow or underflow even if s is in the
range of the floating point system. (As examples take 𝑥1 = 𝑥2 = 8.105 and 𝑥1 = 𝑥2 = 2.10-5 in the system
(10,4,-9,9)). Rewrite the computation so that over and underflow is avoided for all data 𝑥1, 𝑥2 such that the
7.8 REFERENCES/FURTHERREADINGS
M.L. Overton, Numerical Computing with IEEE floating point arithmetic (including one
Theorem, one Rule of Thumb and One Hundred and One Exercises), SIAM, Philadelphia,
PA, 2001.
Module 3 Approximation and Interpolation
CONTENTS
1.0 Introduction
2.0 Objectives
3.30 Main Content
3.31 The module overview
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
For practical use, it is convenient to have an analytical representation of the relationships between variables
in a physical problem, and this representation can be approximately reproduced from data given by the
problem. The purpose of such a representation might be to determine the values at intermediate points, to
approximate an integral or derivative, or simply to represent the phenomena of interest in the form of a
smooth or continuous function. Interpolation refers to the problem of determining a function that exactly
represents a collection of data. The most elementary type of interpolation consists of fitting a polynomial
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain approximation and interpolation in computational science
Describe the general overview of the two concepts
The present chapter is basically concerned with the approximation of functions. The functions in question
may be functions defined only on a finite set of points. The first instance arises, for example, in the context
of special functions (elementary or transcendental) that one wishes to evaluate as part of a subroutine. Since
any such evaluation must be reduced to a finite number of arithmetic operations, we must ultimately
approximate the function by means of a polynomial or a rational function. The second instance is frequently
encountered in the physical sciences when measurements are taken of a certain physical quantity as a
function of some other physical quantity (such as time). In either case one wants to approximate the given
The general scheme of approximation can be described as follows. We are given the function f to be
approximated, along with a class 𝛷 of ―approximating functions‖ 𝜑 and a ―norm‖ || ⋅ || measuring the
The function 𝜑is called the best approximation to f from the class 𝛷, relative to the norm || ⋅ ||.
The class 𝛷 is called a (real) linear space if with any two functions 𝜑1, 𝜑2 𝜖 𝛷 it also contains 𝜑1 + 𝜑2 and
c𝜑1 for any c 𝜖 ℝ, hence also any (finite) linear combination of functions 𝜑i 𝜖 𝛷. Given n ―basis functions‖
Examples of linear spaces 𝛷. 1. 𝛷 = ℙm: polynomials of degree ≤ 𝑚. A basis for ℙm is, for example 𝜋(𝑡) =
𝑡 𝑗−1, j = 1, 2,...,m + 1, so that n = m + 1. Polynomials are the most frequently used ―general-purpose‖
approximants for dealing with functions on bounded domains (finite intervals or finite sets of points). One
reason is Weierstrass's theorem, which states that any continuous function can be approximated on a finite
2. 𝛷 = 𝕊𝑘𝑚(𝛥) (polynomial) spline functions of degree m and smoothness class kon the subdivision
of the interval [a, b]. These are piecewise polynomials of degree ≤m pieced together at the "joints" t2,..., tN
-1 in such a way that all derivatives up to and including the kth are continuous on the whole interval [a, b],
including the joints:
We assume here 0 ≤k<m; otherwise, we are back to polynomials ℙ𝑚 (see Ex. 68). We set k = -1 if
3. 𝜙 = 𝕋𝑚 [0, 2𝜋]: trigonometric polynomials of degree ≤m on [0, 2𝜋]. These are linear
combinations of the basic harmonics up to and including the mth one, that is,
where now n = 2m + 1. Such approximants are a natural choice when the function f to be approximated is
periodic with period 2𝜋. (If f has period p, one makes a preliminary change of variables 𝑡 → 𝑡 ⋅ 𝑝/2𝜋.)
4. 𝜙 = ➪𝑛: exponential sums. For given distinct 𝛼𝑗 > 0, one takes 𝜋(𝑡) = e-𝛼𝑗𝑡, j = 1.2,...,n.
Exponential sums are often employed on the half-infinite interval ℝ+: 0 ≤ 1 < ∞, especially if one knows
Possible choices of norm - both for continuous and discrete functions and the type of approximation they
generate are summarized in Table 2.1. The continuous case involves an interval [a, b] and a "weight
function" w(t) (possibly w(t) = 1) defined on [a, b] and positive except for isolated zeros. The discrete case
involves a set of N distinct points t1, t2….,tN along with positive weight factors
𝑎≤𝑡≤𝑏 1≤𝑖≤𝑁
Uniform
Chebyshev
||𝑢||1 = ∫
𝑏
|𝑢(𝑡)|dt ||𝑢||1 = ∑𝑁 |𝑢(𝑡𝑖)|
𝑖=1
𝑎 L1
||𝑢||1,w = ∫
𝑏
|𝑢(𝑡)|w(t)dt ||𝑢||1,w = ∑𝑁 𝑤𝑖|𝑢(𝑡𝑖)|
𝑖=1
𝑎 Weighted L1
1 1
||𝑢||2,w = (∫𝑎
𝑏 2
|𝑢(𝑡)| 𝑤(𝑡)𝑑𝑡)2 ||𝑢||2,w = (∑𝑁 𝑤 |𝑢(𝑡 )|)2
𝑖=1 𝑖 𝑖
Weighted L2
Least squares
w1, w2,...,wn (possibly all equal to 1). The interval [a, b] may be unbounded if the weight function w is such
that the integral extended over [a, b], which defines the norm, makes sense.
Hence, we may take any one of the norms in Table 2.1 and combine it with any of the preceding linear
spaces 𝜙 to arrive at a meaningful best approximation problem (2.1). In the continuous case, the given
function f, and the functions 𝜑 of the class 𝜙, of course, must be defined on [a, b] and such that the norm
||𝑓 − 𝜑|| makes sense. Likewise, f and 𝜑 must be defined at the points ti in the discrete case.
Note that if the best approximant 𝜑in the discrete case is such that ||𝑓 − 𝜑|| = 0, then 𝜑(𝑡𝑖 ) = 𝑓(𝑡𝑖 ) for i
= 1,2,...,N. We then say that 𝜑interpolates f at the points ti; and we refer to this kind of approximation
The simplest approximation problems are the least squares problem and the interpolation problem, and the
easiest space 𝜙 to work with the space of polynomials of a given degree. These are indeed the problems we
concentrate on in this chapter. In the case of the least squares problem, however, we admit general linear
spaces 𝜙 of approximants, and also in the case of the interpolation problem, we include polynomial splines
continuous and the discrete case simultaneously. We define, in the continuous case,
Since 𝑑ƛ ≡ 0 ―outside‖ [a,b], and 𝑑ƛ(𝑡) = 𝑤(𝑡)𝑑𝑡 𝑖𝑛𝑠𝑖𝑑𝑒. We call 𝑑ƛ a continuous (positive) measure.
The discrete measure (also called ―Dirac measure‖) associated with the point set {t1, t2, ….., tN} is a
measure 𝑑ƛ that is nonzero only at the points ti and has the value wi there. Thus, in this case,
(A more precise definition can be given in terms of Stieltjes integrals, if we define ƛ(𝑡) to be a step function
And obtain the continuous or the discrete norm depending on whether ƛ is taken to be as in (2.3), or step
function, as in (2.5).
We call the support of 𝑑ƛ – and denote it by supp 𝑑ƛ – the interval [a,b] in the continuous case (assuming
w positive on [a,b] except for isolated zeros), and the set {𝑡1, 𝑡2, ....... ,𝑡𝑁 } in the discrete case. We say that
the set of functions 𝜋𝑗 (𝑡) in (2.2) is linearly independent on the support of𝑑ƛ if:
∑𝑁
𝑗=1 𝑐𝑗 𝜋𝑗 (t) ≡ 0 for all t Є supply 𝑑ƛ implies 𝑐1 = 𝑐 2 = . . . = 𝑐𝑛 = 0. (2.7)
Here ∑𝑗=1
𝑁 𝑐𝑗 𝜋 𝑗 (t) = 𝑝𝑛−1(t) is a polynomial of degrees ≤ n – 1. Suppose, first, that supp 𝑑ƛ = [a,b].
Then the identity in (2.7) says that 𝑝𝑛−1(t) ≡ 0 on [a,b].
Clearly, this implies 𝑐1 = 𝑐 2 = . . . = 𝑐𝑛 = 0, so that the powers are linearly independent on supp 𝑑ƛ = [a,b].
If, on the other hand, supp 𝑑ƛ {𝑡1, 𝑡2,……..,𝑡𝑁 }, then the premise in (2.7) says that 𝑝𝑛−1(ti) = 0, i =
1,2,… .................,N; that is 𝑝𝑛−1 has N distinct zeros 𝑡 𝑖 . This implies 𝑝𝑛−1 ≡ 0 only if N ≥ n. Otherwise,
𝑁
𝑝𝑛−1(t) = ∑ 𝑖=1 (t − 𝑡𝑖) Є 𝑃𝑛−1 would satisfy 𝑝𝑛−1(ti) = 0, I = 1, 2,… ............ ,N, without being identically
zero. Thus, we have linear independence on supp 𝑑ƛ = {𝑡1, 𝑡2, .......,𝑁 } if and only if N ≥ n.
SELFASSESSMENTEXERCISE1
Define Interpolation
SELFASSESSMENTEXERCISE2
Differentiate between approximation and Interpolation
4.0 CONCLUSION
To apply the methods of numerical mathematics, it is necessary to know and analyze the
error estimate. In general, we can say that the problem we solve is called input information
and the corresponding result is output information. The process of transforming input into
match and interpolation or exact matches are treated.By interpolation we come to functions
that pass exactly through all given points, and we use it for a small amount of input data.
Interpolation implies the passage of an interpolation function through all given points,
while the approximation allows errors to a certain extent, and then we smooth the obtained
function. In the case of interpolation, the problem of determining the function f is called
the interpolation problem, and the given points and xi are called nodes (base points,
interpolation points). We choose the function f according to the nature of the model, but so
that it is relatively simple to calculate. These are most often polynomials, trigonometric
functions, exponential functions and, more recently, rational functions. In practice, it has
been shown that it is not wise to use polynomials of degree greater than three for
approximation, we arrive at functions that pass through a group of data in the best possible
way, without the obligation to pass exactly through the given points. The approximation is
suitable for large data groups, nicely grouped data, and small and large groups of scattered
data.
5.0 SUMMARY
Approximation occurs in two forms. We know the function f, but its form is complicated
to compute. In this case, we select the function information to use. The error of the obtained
approximation can be estimated with respect to the true value of the function. The
function f is unknown to us, but only some information about it is known. For example,
values at a set of points are known. The substitution function ϕ is determined from the
available information, which, in addition to the data itself, includes the expected form of
data behavior, ie. Functionϕ. In this case, we cannot make an error estimate without
additional information about the unknown function f [1-3]. In practice, we often encounter
the variant that the function f is not known to us. It most often occurs when measuring
various quantities, because in addition to the measured data, we also try to approximate the
data between the measured points. Some of the mathematical problems can be solved by
numerical methods, however, not always with great precision and accuracy. Sometimes the
time we have to solve problems is not enough and in that case we use programming
methods using a computer. Programming allows you to solve complex tasks with great
accuracy and in a short period of time. The ability of a computer to perform a large number
mathematics and mathematics in general, and thus the development of science and
technology. All software solutions are integrated systems for numerical and symbolic
calculations, graphical presentation and interpretation, and provide support that allows the
6.0 TUTOR-MARKEDASSIGNMENT
We want to compute f(a) = √a, and we have very high requirements concerning speed.
(a) One possible method is to interpolate linearly in an equidistant table. Which table size
is needed if we require that |RXF + RT| shall be smaller than 2μ? The computer is using
the floating point system (2, 23,−126, 127).
(b) Another method is to perform one iteration with Newton-Raphson‘s method applied
to the equation f(x) = x2 − a = 0. The initial approximation x0 is taken from a table.
Which table size is needed if we require that, the error after one iteration is smaller than
2μ?
7.9 REFERENCES/FURTHERREADINGS
Taibleson MH, Nikol skiĭ SM. Approximation of functions of several variables and
CONTENTS
1.0 Introduction
2.0 Objectives
3.0 Main Content
3.1 Inner products
3.2 The normal equations
3.3 Convergence
4.0 Conclusion
5.0 Summary
6.0 Tutor-Marked Assignment
7.0 References/Further Readings
1.0 INTRODUCTION
We specialize the best approximation problem (2.1) by taking as norm the 𝐿2 norm
Where 𝑑ƛ is either a continuous measure ( cf. (2.3) ) or a discrete measure ( cf. (2.5) ), and by using
Here the basis functions 𝜋𝑗 are assumed linearly independent as supp 𝑑ƛ (cf. (2.7)). We furthermore assume
of course, that the integral in (2.8) is meaningful whenever u = 𝜋𝑗 or u = f, the given function to be
approximated.
The solution of the best squares problem is mostly easily expressed in terms of orthogonal systems
𝜋𝑗 relative to an appropriate inner product. We therefore begin with a discussion of inner products.
2.0 OBJECTIVES
By the end of this unit, you should be able to:
Explain inner production in approximation and interpolation
Describe the normal equations concepts
3.0 MAIN CONTENT
3.1 Inner Products
Given a continuous or discrete measure 𝑑ƛ, as introduced earlier, and given any two functions u, v having
(Schwarz‘s inequality | (u,v) | ≤ ||𝑢||2,𝑑ƛ . ||𝑣||2,ƛ, cf . Ex . 6, tells us that the integral in (2.10) is well
defined.) The inner product (2.10) has the following obvious (but useful) properties:
4. Positive definiteness: (u, u) ≥0, with equality holding if and only if u ≡ 0 on supp 𝑑ƛ.
In the first variable and, by symmetry, also in the second. Moreover, (2.11) easily extends to linear
Where || . || = || . ||2,ƛ. (From now on we use this abbreviated notation for the norm.) Indeed,
Where the first equality is a definition, the second follows from additivity, the third from symmetry, and
the last from orthogonality. Interpreting functions u, v as ―vectors,‖ we can picture the configuration of u,
( 𝑢𝑖 , 𝑢𝑗 ) = 0 if I ≠ j, 𝑢𝑘 ≢ 0 on supp 𝑑ƛ;
𝑛 2 𝑛
‖∑ 𝛼𝐾𝑢𝑘‖ = ∑|𝛼𝑘|2‖𝑢𝑘‖2
𝑘=1 𝑘=1
The proof is essentially the same as before. An important consequence of (2.16) is that every orthogonal
system is linearly independent on the support of 𝑑𝜆 Indeed, if the left-hand side (2.16) vanishes, then so
does the right-hand side, and this since ‖𝑢𝑘‖2 > 0 by assumption, implies 𝑎1 = 𝛼2 = … = 𝛼𝑛 = 0
We are now in apposition to solve the least square approximation problem. By (2.12), we can write the 𝐿2
The squared /// error, therefore, is a quadratic function of the coefficients /////////. The problem of best //
approximation thus amounts to minimizing a quadratic function of // variables. This is a standard problem
of calculus and is solved by setting all partial derivatives equal to zero. This yields a system of linear
algebraic equations. Indeed, differentiating partially with respect to /// under the integral sign in (2.17) gives
𝜕 𝑛
𝐸2[𝜑] = 2 ∫(∑ 𝐶 𝜋 (𝑡)) 𝜋 (𝑡)𝑑𝜆(𝑡) − 2 ∫ 𝜋 (𝑡)𝑓(𝑡)𝑑𝜆(𝑡)
𝑗 𝑗 𝑖 𝑖
𝜕𝐶𝑖 𝑗=1 ℝ
𝑅
and setting this equal to zero, interchanging integration and summation in the process, we get
𝑝
These are called the normal equations for the least squares problem. They form a linear system of the form
𝑨𝒄 = 𝒃,
By symmetry of the inner product, 𝑨 is a symmetry matrix. Moreover, /// is positive definite; that is,
𝑛
𝑛
𝑥𝑇𝐴𝑥 =∑ ∑ 𝑎𝑖𝑗𝑥𝑖𝑥𝑗 > 0 if 𝑥 ≠ [0,0, … ,0]𝑇
𝐽=1
𝑖=1
The quadratic function in (2.21) is called a quadratic form (since it is homogeneous of degree 2). Positive
definiteness of 𝑨 thus says that quadratic form whose coefficients are the elements of 𝑨 is always
To prove (2.21), all we have to do is insert the definition of the 𝑎𝑖𝑗 and use the elementary properties 1-4
Now it is a well-known fact of linear algebra that a symmetric positive definite matrix /// is nonsingular.
Indeed, its determinant, as well as all its leading principal minor determinants, are strictly positive. It
follows that the system (2.18) of normal equations has a unique solution. Does this solution correspond to
the minimum of [𝜑] in (2.17)? Calculus tells sus that for this to be the case, the Hessian matrix 𝐻 = [𝜕2𝐸2
∕ 𝜕𝐶𝑖𝜕𝐶𝑗] has to be positive define. But H=2A, since 𝐸2 is a quadratic function. Therefore, 𝑯, with A, is
indeed positive definite, and the solution of the normal equation s gives us the desired minimum. The least
𝜑(𝑡) = ∑ 𝐶𝑗 𝜋𝑗 (𝑡),
𝑗=1
This completely settles the least squares approximation problem in theory. How about in practice?
Assuming a general set of (linearly independent) basis functions, we can see the following possible
difficulties.
1. The system (2.18) may be ill-conditioned. A simple example is provided by supp 𝑑𝜆 = [0,1],
Then
1 1
(𝜋 , 𝜋 ) = ∫ 𝑖+𝑗−2 𝑑𝑡 = , 𝑖, 𝑗 = 1,2, … , 𝑛;
𝑖 𝑗 𝑖+𝑗−1
0
The resulting severe ill-conditioning of the normal equations in this example is entirely due to an
unfortunate choice of basic functions – the powers. These become almost linearly dependent, more so the
in (2.18). When 𝐽 is large, 𝜋𝑗 = 𝑡𝑗−1 the power behaves very much like a discontinuous function
on [0,1]: it is practically zero for much of interval until it shoots up to the value 1 at the right
endpoint. This has the unfortunate consequence that a good deal of information about 𝑓 is lost when
one forms the integral defining 𝑏𝑗. A polynomial 𝜋𝑗 that oscillates rapidly on [0,1] would seem to
be preferable from this point of view, since it would ―engage‖ the function 𝑓 more vigorously over
2. The second disadvantage is the fact that all coefficients 𝑐𝑗 in (2.22) depend on n; that is, 𝑐𝑗 =
(𝑛)
𝐶𝑗 , 𝑗 = 1,2, … 𝑛. Increasing n, for example, will give an enlarge system of normal equations with
a completely new solution vector. We refer to this as the nonpermanence of the coefficients 𝑐𝑗 .
Both defects 1 and 2 can be eliminated (or at least attenuated in case of 1) in one stroke :
2
(𝜋𝑖, 𝜋𝑗) = 0 if 𝑖 ≠ ; (𝜋𝑗, 𝜋𝑗) = ‖𝜋𝑗‖ > 0
Then the system of normal equations becomes diagonal and is solved immediately by
(𝜋𝑗, 𝑓)
𝐶𝑗 = (𝜋 , 𝜋 ) , 𝑗 = 1,2, … . , 𝑛
𝑗 𝑗
Clearly, each of these coefficients 𝑐𝑗 is independent of n, and once computed, remains the same for any
larger n. We now have permanence of the coefficients. Also, we do not have to go through the trouble of
solving a linear system of equations, but instead can use the formula (2.24) directly. This does not mean
2
that there are no numerical problems associated with (2.24). Indeed, it is typical that the denominator ‖𝜋𝑗 ‖
in (2.24) decrease rapidly with increasing 𝑗, whereas the integrand in the numerator (or the individual terms
in the case of a discrete inner product) are of the same magnitude as 𝑓. Yet the coefficients 𝑐𝑗 also are
expected to decrease rapidly. Therefore, cancellation errors must occur when one computes the inner
product in the numerator. The cancellation problem can be alleviated somewhat by computing 𝑐𝑗 in the
alternative form
𝑗−1
1
𝐶𝑗 = (𝑓 − ∑ 𝐶𝑘 𝜋𝑘 , 𝜋𝑗) , 𝑗 = 1,2, … , 𝑛
(𝜋𝑗𝘍 𝜋𝑗) 𝑘=1
where the empty sum (when 𝑗 = 1) is taken to be zero, as usual. Clearly, by orthogonality of the 𝜋𝑗, (2.25)
An algorithm for computing 𝑐 j from (2.25), and at the same time 𝜑(t), is as follows:
s0 = 0.
For j – 1, 2,…, n do
𝑐j= 1 (f – Sj - 1, πj)
II πj II2
Sj = Sj – 1 + 𝑐j πj (t)
This produces the co-efficient as well as 𝑐1, 𝑐 2, …, 𝑐n, as well as 𝜑(t) = sn.
Any system {𝜋j} that is linearly independent on the support of dʎ can be ortogonalized (with respect to
the measure dʎ) by a device known as the Gram1 –Schmidt 2procedure. One takes
𝑗−1
(𝜋j, πk)
𝜋𝑗 = j − ∑ 𝐶𝑘 𝜋𝑘. 𝐶𝑘 =
(πk, πk)
𝑘=1
3.3 Convergence
We have seen in Sect. 2.1.2 that if the class φ = φn consists of n functions πj, j = 1, 2, …., n, that are
linearly independent on the support of some measure dʎ, then the least squares problem for this measure,
has a unique solution 𝜑= 𝜑ngiven by (2.22). There are many ways we can select a basis πj in φn and,
therefore many ways the solution 𝜑n can be represented. Nevertheless, it is always one and the same
function. The least squares error – the quantity on the right-hand side of (2.26) – therefore is independent
of the choice of basis functions (although the calculation of the least squares solution, as mentioned
previously, is not). In studying this error, we may thus assume, without restricting generality, that the basis
πj is an orthogonal system. (every Linearly independent system can be orthogonalized by the Gram-Schmidt
We first note that the error f - 𝜑n is orthogonal to the space φn; that is,
Where the inner product is the one in (2. 10). Since ϕ is a liner combination of the πk, it suffices to show
(2.28) for each ϕ = πk, k = 1,2, … , n. Inserting 𝜑n from (2.27) in the left-hand side of (2.28), and using
The last equation following from the formular for in (2.27). the result (2.28) has a simple geometric
interpretation. If we picture functions as vectors, and the space φn as a plane, then for any f that ―sticks out‖
of the plane φn, the least squares approximant 𝜑n is the orthogonal projection of f onto φn; see Fig. 2.2.
(f - 𝜑n, 𝜑n) = 0
An algorithm for computing 𝑐 j from (2.25), and at the same time 𝜑(t), is as follows:
s0 = 0.
For j – 1, 2,…, n do
𝑐j= 1 (f – Sj - 1, πj)
II πj II2
Sj = Sj – 1 + 𝑐j πj (t)
This produces the coeeficients as well as 𝑐1, 𝑐 2, …, 𝑐 n, as well as 𝜑(t) = sn.
Any system {𝜋j} that is linearly independent on the support of dʎ can be ortogonalized (with respect to the
measure dʎ) by a device known as the Gram1 –Schmidt 2procedure. One takes
𝑗−1
(𝜋j, πk)
𝜋𝑗 = j − ∑ 𝐶𝑘 𝜋𝑘. 𝐶𝑘 =
(πk, πk)
𝑘=1
We have seen in Sect. 2.1.2 that if the class φ = φn consists of n functions πj, j = 1, 2, …., n, that are linearly
independent on the support of some measure dʎ, then the least squares problem for this measure,
ϕϵφn (2.26)
has a unique solution 𝜑= 𝜑ngiven by (2.22). There are many ways we can select a basis πj in φn and,
therefore many ways the solution 𝜑n can be represented. Nevertheless, it is always one and the same
function. The least squares error – the quantity on the right-hand side of (2.26) – therefore is independent
of the choice of basis functions (although the calculation of the least squares solution, as mentioned
previously, is not). In studying this error, we may thus assume, without restricting generality, that the basis
πj is an orthogonal system. (every Linearly independent system can be orthogonalized by the Gram-Schmidt
We first note that the error f - 𝜑n is orthogonal to the space φn; that is,
Where the inner product is the one in (2. 10). Since ϕ is a liner combination of the πk, it suffices to show
(2.28) for each ϕ = πk, k = 1,2, … , n. Inserting 𝜑n from (2.27) in the left-hand side of (2.28), and using
The last equation following from the formular for in (2.27). the result (2.28) has a simple geometric
interpretation. If we picture functions as vectors, and the space φn as a plane, then for any f that ―sticks out‖
of the plane φn, the least squares approximant 𝜑n is the orthogonal projection of f onto φn; see Fig. 2.2.
(f - 𝜑n, 𝜑n) = 0
SELFASSESSMENTEXERCISE1
Define least square methods
SELFASSESSMENTEXERCISE2
Differentiate between least square method and equal equation
4.0 CONCLUSION
A mathematical procedure for finding the best-fitting curve to a given set of points by
minimizing the sum of the squares of the offsets (―the residuals‖) of the points from the
curve. The sum of the squares of the offsets is used instead of the offset absolute values
disproportionate effect on the fit, a property which may or may not be desirable depending
5.0 SUMMARY
least squares method, also called least squares approximation, in statistics, a method for
observations or measurements. ... One of the first applications of the method of least
6.0 TUTOR-MARKEDASSIGNMENT
(1) Derive a method for estimating R b a f(x) dx by interpolating f by a linear spline with
(2). Show that the interpolating linear spline with knots x0, x1, . . . , xn is the function that
minimizes Z xn x0 ¡ g (x) ¢2 dx among all functions g such that g(xi) = fi, i=0, 1, . . . , n,
(3). For r =1, 2, 3 show that the B-spline Bir(x) has support [xi, xi+r+1] and that Bir(x) >
0 for xi <x<xi+r+1.
7.10 REFERENCES/FURTHERREADINGS
Press WH, Teukolsky SA, Vetterling WT, et al. Numerical recipes in C++. The art of
scientific computing. 1992;2:1002
Davis PJ. Interpolation and approximation. Courier Corporation; 1975.
Taibleson MH, Nikol skiĭ SM. Approximation of functions of several variables and