MTH 214 Accuracy in Numerical Calculations and Error Analysis
MTH 214 Accuracy in Numerical Calculations and Error Analysis
The smallest unit of information stored in the memory is the binary digit and abbreviated
as BIT. It represents either “0” or “1”. The instructions given to a computer and data to
be processed in group of bits are classified as follows:
NIBBLE: A string of four bits or binary representation of four bits is called a NIBBLE.
COMPUTER WORD: is a string of bits whose size called the WORD LENGTH or WORD
SIZE is fixed for a specific computer though it may vary from computer to computer.
WORD LENGTH: The WORD LENGTH may be 1 Byte, 2 Bytes or 4 Bytes or even larger.
All computers are designed to use binary digits to represent numbers and other
information. The memory is organized into strings of bits called words. Computers read
decimal numbers supplied by humans but convert them automatically into binary numbers
for internal use. These binary numbers may also be expressed in the octal or hexadecimal
form. For output, the numbers are reconverted to decimal form for human use.
INTEGER REPRESENTATION
Decimal numbers are first converted into binary equivalent and then expressed in either
integer or floating point form. In the integer representation the decimal or binary point is
always fixed to the right of the least significant digit and therefore fractions are not
included. The magnitude of the number is restricted to 2n -1, where n is the word length in
bits. Negative numbers are stored by using the 2’s compliment. This is done by taking the
1’s compliment of the binary representation of the positive number and then adding 1 to it.
Solution 13 = 01101
+00001
Page 1 of 18
Note: The extra bit to the left most of the binary number indicates the sign bit. “0”
indicating the number is positive and “1” indicating the number is negative. If one bit is
reserved to represent the sign of the number, there are only n-1 bits to represent the
number. Thus a 16-bit word can contain numbers from
Example: Show that the number -32768 is represented in a 16-bit word as follows:
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
If a register is capable of storing 6 digits and a sign bit and this register is split into two
parts i.e. one part containing the integral portion of the number and other part containing
the fractional portion and the decimal located between the two parts of the register. The
Page 2 of 18
first drawback of this scheme is that the range of numbers which can be represented
using this scheme is limited from -999.999 to +999.999.
A floating point number is said to be normalized if the most significant digit of the
mantissa is non-zero. The shifting of the decimal point to the left of the most significant
digit is called normalization and the real numbers represented in this form are known as
normalization floating point number. The mantissa of the floating point number satisfies
the following inequality:
The actual exponent values represented will be -128 to +127. A normal exponent of 8-bit
normally can represent exponent values as 0 to 255. Thus we are adding 128 in the biased
exponent. All these are positive values and therefore this technique eliminates the use of
sign bit. The sign of the floating point number is negative and if it is 1, then the floating
point number is positive. The rest of 23 bits represents the bit pattern of the mantissa.
Example: Represent IEEE 32-bit format for 12.6875 in normalized floating point form
Solution:
Page 3 of 18
*The mantissa 11001011, when extended to 23 bits, by adding 0’s on the right, becomes
1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
*The exponent of floating point number is 4. Thus we are adding constant 128 in the
biased exponent. The modified exponent becomes 132 and its binary equivalent is
10000100.
*Combining the result of all the above steps, we get the final representative of 12.6875 is
1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Example: Represent IEEE 32-bit format for -12.6875 in normalized floating point form.
*The mantissa 11001011, when extended to 23 bits, by adding 0’s on the right, becomes
1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
*Taking 2’s complement of the above 23-bit pattern, we get
0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
*The exponent of floating point number is 4. Thus we are adding constant 128 in the
biased exponent. The modified exponent becomes 132 and its binary equivalent is
10000100.
* Combining the result of all the above steps, we get the final representation of -12.6875
is
1 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FLOATING POINT ARITHMETIC OPERATIONS
To perform arithmetic operations with numbers in normalized floating point having four-
digit mantissa and two-digit exponent. Let assume that mantissa retained by a hypothetical
computer is from -0.9999 to +0.9999 and exponent is from -99 to +99.
Page 4 of 18
ADDITION OPERATION
For addition it is necessary that both operands have the same exponent. If the exponents
are not equal, then the exponent of the numbers with smaller exponent is made equal to
the larger exponent and its mantissa is modified. This is done by shifting the decimal point
to the left by number of places equal to the positive difference between the two
exponents. Now the mantissas of the numbers are added.
The next phase normalizes the result. Normalization consists of shifting significant digits
left until the most significant digit is non-zero. Each shift causes a decrement of the
exponent, and could cause an exponent underflow. Finally, the result must be rounded off.
To perform arithmetic operations with numbers in normalized floating point form, the
mantissa retained by a hypothetical computer is from -0.9999 to 0.9999.
In case of sum, the mantissa of the sum (before normalization) can be maximum of 1.9999
now we need to shift the decimal point to the left by one position in order to normalize it.
As a result, the exponent of the sum is increased by 1. Thus because of normalization
process, the exponent of the sum may become greater than +99. This exceeds the largest
number which a computer can store. This condition is called overflow, and the computer
will indicate this error.
Solution
The decimal point of the mantissa of 0.8761 is shifted by 2 (7 - 5) position to the left and
the exponent is increased by 2. The number after normalization becomes 0.0087E7,
whereas the digit 1 is chopped off. Now these numbers can be added as follows:
Addend 0.0087E7
Augend 0.8906E7
Sum 0.8993E7
Page 5 of 18
Example: Add 0.9754E8 to 0.9871E9
Solution
The decimal point of the mantissa of 0.9754 is shifted by 1 (9 – 8) position to the left and
the exponent is increased by 1. The number after normalization becomes 0.0975E9,
whereas the digit 4 is chopped off. Now these numbers can be added as follows.
Addend 0.9871E9
Augend 0.0975E9
Sum 1.0846E9
Since the mantissa of the sum is greater than 1.0, so the decimal point is shifted to left
by one position and the exponent is increased by 1. The result after normalization becomes
0.1084E10, whereas the digit 6 is chopped off.
Solution
The decimal point of the mantissa of 0.9278 is shifted by 1 (99 – 98) position to the left
and the exponent is increased by 1. The number after normalization becomes 0.0927E99,
whereas the digit 8 is chopped off. Now these numbers can be added as follows:
Addend 0.9896E99
Augend 0.0927E99
Sum 1.0823E99
Since the mantissa of the sum is greater than 1.0, so the decimal point is shifted to the
left by one position and the exponent is increased by 1. The result after normalization
becomes 0.1082E100, the digit 3 is chopped off. This number is greater than the largest
number which a hypothetical computer can handle, it is a case of overflow and the
computer will indicate.
Page 6 of 18
SUBTRACTION OPERATION
Similar to Addition, in subtraction it is necessary that both operands have the same
exponent. If the exponents are not equal, then the exponent of the numbers with smaller
exponent is made equal to the larger exponent and its mantissa is modified. This is done by
shifting the decimal point to the left by number of places equal to the positive difference
between the two exponents. Now the mantissa of these numbers are subtracted.
In the case of difference, the mantissa of the different (before normalization) will be in
the range 0.0000 to 0.9999. These extreme values will occur in case the numbers are equal
to the mantissa of the number to be subtracted is 0.0000 respectively. Therefore, the
decimal may have to shift to right by more than one position. As a result, the exponent is
decreased by one for each shift of the mantissa. Thus because of normalization process,
the exponent of the difference may become less than -99, which is smaller than the
smallest number which a computer can store. This condition is called underflow, and
computer will indicate this error.
Solution:
The number 0.6434E3 after normalization obtained is 0.0064E5. The subtraction process
is as
Minuend 0.4217E5
Subtrahend 0.0064E5
Difference 0.4153E5
Solution:
Since the exponents are already equal, the given numbers can be directly subtracted as
Minuend 0.7678E7
Subtrahend 0.7673E7
Difference 0.0005E7
Page 7 of 18
Since the last three significant digits are zero, thus normalize the number 0.0005E7. This
can be done by shifting the decimal point to right by three positions and the exponent is
decreased by 3. Therefore, the resultant difference becomes 0.5000E4
Solution:
Since the exponents are already equal, the given numbers can be directly subtracted as:
Minuend 0.8691E-99
Subtrahend 0.8678E-99
Difference 0.0013E-99
Since the last two significant digits are zero, thus normalizing the number 0.0013E-99.
This can be done by shifting the decimal point to the right by two positions and the
exponent is decreased by 2. Therefore, the resultant difference becomes 0.1300E-101.
Since the number is smaller than the smallest number which our hypothetical computer can
handle. It is a case of underflow and the computer will indicate this error.
MULTIPLICATION OPERATION
To multiply two numbers given in the normalized floating point form, multiply their
mantissas and their exponents are added. After multiplication of the mantissas, the
resulting mantissa is normalized and the exponent is adjusted. The magnitude of the
product will be greater than 1.0 but less than 10.0. Therefore, at most, the decimal point
of the mantissa of the product will be shifted one position left. As a result, the exponent
of the product can become +99. Now if mantissa is positive, this results in overflow and if
mantissa is negative, then the result is an underflow.
A computer retained with 4-decimal digit mantissa, so mantissa becomes 0.6022E12 after
truncating the mantissa of the product to four digits.
Page 8 of 18
Example: Multiply 0.1191E8 by 0.1232E-4
Since the last significant digit is zero, thus normalizing the number 0.1467312E3. We
obtain 0.1467E3 after truncating the mantissa of the product to four digits.
We obtain 0.2353E134 after truncating the mantissa of the product to four digits. Since
this number is greater than the largest number so it is a case of overflow and the
computer will indicate this error.
We obtain 0.2884E-112 after truncating the mantissa of the product to four digits. Since
this number is smaller than the smallest number so it is a case of underflow and the
computer will indicate the error.
DIVISION OPERATION
To divide two numbers given in the normalized floating point form, we divide their
mantissas and their exponents are subtracted. After division of the mantissa, the
resulting mantissa is normalized and the exponent is suitably adjusted. Note that the
magnitude of the quotient will be greater than 1.0 but less than 10.0. Therefore, at most,
the decimal point of the mantissa of the quotient will shift by one position left. As a
result, the exponent of the quotient is increased by 1. Thus because of normalization
process, the exponent of the quotient can become +99. Now if mantissa is positive, this
results in overflow and if mantissa is negative this result in underflow.
We obtain 0.9254E4 after truncating the mantissa of the product to four digits.
Page 9 of 18
Example: Divide 0.9542E-18 by 0.8532E91
= 1.118377872 E-109
The mantissa of the quotient obtained is greater than 1.0, therefore the decimal point is
shifted one position to the left and the exponent is increased by 1. The result obtained is
0.1118377872E-108. Now we obtain 0.1118E-108 after truncating the mantissa of the
product to four digits. Since this number is smaller than the smallest number so it is the
case of underflow and the computer will indicate this error.
The mantissa of the quotient obtained is greater than 1.0, therefore the decimal point is
shifted one position to the left and the exponent is increased by 1. The result obtained is
0.1336521137E113. Now we obtained 0.1336E113 after truncating the mantissa of the
product of four digits. Since this number is larger than the largest number so it is the
case of overflow and the computer will indicate this error.
ERRORS IN ARITHMETIC
In Integral arithmetic, while all arithmetic operations are exact, we might come across the
following two situations:
1. An operation may result in a large number that is beyond the range of the numbers that
the computer can handle.
1. Error due to inexact, representation of a decimal number in a binary form. E.g., consider
the decimal number 0.1. The binary equivalent of this number is 0.0001100110011. The
binary equivalent has a repeating fraction and therefore must be terminated at some
point.
2. Error due to rounding method used by the computer in order to limit the number of
significant digits.
Page 10 of 18
3. Floating point subtraction may induce a special phenomenon. It is possible that some
mantissa positions in the result are unspecified. This happens when two nearly equal
numbers are subtracted. This is known as subtractive cancellation. If the operands
themselves represent approximate values, the loss of significance is serious since it
greatly reduces the number of significant digits.
4. Overflow or underflow can occur in floating point operations when the result is outside
the limits of floating point number system of the computer.
LAWS OF ARITHMETIC
Due to errors introduced in floating point arithmetic the associative and distribution laws
of arithmetic are not always satisfied. That is
(i ) x ( y z ) ( x y ) z
(ii ) x * ( y * z ) ( x * y ) * z
(iii) x * ( y z ) ( x * y ) ( x * z )
Although failure of these laws to be satisfied affects relatively few computations, it can
be very critical in some occasion.
(i) Exact Numbers: Exact Numbers are those in which there is no uncertainty or
approximation associated with them i.e. exact. E.g. 5, 10, -6, 1/10, 1, 5, etc.
(ii) Approximate Numbers: Approximate Numbers are those in which there is uncertainty
or approximation associated with them. E.g. , 5, , 1 , etc. These numbers are appears
3
to be exact, but they cannot be expressed as exactly finite numbers of digits. These
numbers can be expressed as 2.7183, 2.236067, 3.1414, 0.3333… respectively. These
numbers are approximation to the true values and they are called approximate numbers.
Hence an approximate number is defined as the number which is approximated to an exact
number.
An error is defined as the difference between the exact value and the approximate value
obtained from experimental observations.
Suppose X is the true value of a quantity and Xa is the approximate value, then
Page 11 of 18
Error = True Value- Approximate Value
E = X - Xa
In any Numerical Computation results, we may come across the following types and sources
of errors:
Inherent Errors
These are errors already contained in the statement of a problem before obtaining the
solution to the problem. Such errors arise as a result of the given data being approximated
or the limitations of mathematical tables, calculators, digital computer or inaccurate
measurements or observations which may be due to limitations of the measuring device.
E.g. Screw gauge, Vernier caliper, weighing machine, etc. can measure the quantity up to
smallest permissible value.
Rounding Errors
These errors arise due to rounding off of a number during computations. If a number is to
be rounded off to n significant digits, then the following rules are observed.
3. If the (n + 1)th digit is greater than 5 or it is followed by a non-zero digit, then nth
digit is increased by one
4. If the (n + 1)th digit is 5 and the is followed by digits other than zero, then the
preceding digit is raised by one.
5. If the (n + 1)th digit is 5 or 5 followed by zeros, then the nth digit is left unchanged if
it is even.
6. If the (n + 1)th digit to be dropped is 5 or 5 followed by zeros, then the preceding digit
is raised by one, if it is odd.
Analytic Errors
These are errors introduced due to transforming a physical or mathematical problem into a
x3 x5 x7
computational problem. E.g. sin x x ... if we compute sin x by the formula
3! 5! 7!
Page 12 of 18
then it leads to an error. Similarly, the transformation x x 0 into the equation
x2 x3
(1 x ) x 0 involves an analytic error.
2! 3!
These errors are caused by leaving out the extra digits that are not required in a number
without rounding off. The difference between a numerical value X and its truncated value
XT is called truncation error. The following points must be taken into consideration during
truncation of a numerical value
(i) In truncation, the numerical value of a positive number is decreased and a negative
number is increased.
(ii) If we round off a large number to positive numbers to the same number of decimal
places then the average error due to rounding off is zero.
(iii) In case of truncation of a large number of positive numbers to the same number of
decimal places, the average truncation error is one half of the place value of the last
retained digit.
(iv) If a number is rounded off and the truncated to the same number of decimal places,
then truncation error is greater than the round off error.
(v) Round off error may be positive or negative but truncation error is always positive in
case of positive numbers and negative in case of negative numbers
Note: The maximum error due to truncation of a number cannot exceed the place value of
the last retained digit in the number.
Exercise: Find the truncation error in the result of the following function for x 1 when
5
we use:
x2 x3 x4
(a) first three terms (b) first four terms (c) first five terms 1 x ...
x
2 3 4
Accumulated Error
In a sequence of computations, the error in one value may affect the computation of the
next value and the error gets added. This is called the accumulated error. The Relative
Page 13 of 18
Accumulated Error is the ratio of the accumulated error to the exact value of that
iteration.
Note: It is observed that the relative accumulated error is the same for all the values.
Modelling Errors
Mathematical models are the basis for numerical solutions. They are formulated to
represent physical processes using certain parameters involved in the situations. A model
is an approximate representation of the real system under consideration. In many
situations, it is impossible to include all the real problem and therefore, certain simplifying
assumptions are made.
Blunders
Blunders are errors that are caused due to human imperfection. Such errors may cause a
very serious disaster in the result. Since these errors are due to human mistakes, it should
be possible to avoid them to a large extent by acquiring a sound knowledge of all the
aspects of the problems as well as the numerical process.
MEASUREMENT OF ACCURACY
In numerical computation, the rounds off errors are difficult to estimate, so its effect on
the final result has to be reduced by some specific rules. However, the truncation errors
can be easily estimated and can be reduced effectively. Thus in any case, we need some
measures of accuracy of the results.
X Xa X Xa
Er and the Percentage Error is defined as E p * 100 . If Y be such a
X X
number that I X – Xa I<= Y, then Y is an upper limit on the magnitude of absolute error and
measures the absolute accuracy.
Page 14 of 18
1. Absolute Error due to truncation is: I X – Xa I < 10n-k
X Xa
2. Relative Error due to truncation is: 10k 1
X
X Xa
4. Relative Error due to rounding off is: 0.5 * 10k 1
X
Exercise:
1. Round off the number 865230 to four significant figures and compute the absolute,
relative and percentage errors.
2. Let X = 0.0045895. Find the relative error if X is truncated to three decimal places
3. The computing value of a problem is 7.896. The absolute error in the computing value is
less than 1%. Find the range within which the true value must lie.
ERROR PROPAGATION
Let there be two quantities A and B. if A and B are the corresponding absolute errors
in their measurement, then:
(i ) Let Z denote the Sum of A and B and Z be the corresponding absolute error
Clearly, Z A B and Z Z ( A A) ( B B)
Z Z ( A B) (A B) or Z (A B)
Thus the max imum error in Z is, Z (A B)
i.e., the max imum error is the sum of the individual errors.
Page 15 of 18
(ii ) Let Z denote the difference of A and B , therefore, the corresponding Z can be obtain
Clearly Z A B and Z Z ( A A) ( B B )
Z Z ( A B ) ( A B ) or Z ( A B )
Thus the max imum is the sum of the individual errors
i.e., the max imum error in the difference is again the sum of the individual errors.
(ii ) Let Z denote the division of A and B and Z be the corresponding absolute error.
A A A A B A B
Clearly, Z A and Z Z or Z Z (1 )
B B B B A B A B
Z A B A B
Dividing both sides by Z ( A ), we get 1 1
B Z A B A B
A.B
{Ignore as it contains the product of two small quantities A and B}
AB
Z A B Z A B
or
Z A B Z A B
Z A B
, and are the relative errors in the measurement of Z , A and B respectively
Z A B
Hence when two quantities are multiplied (or divided ), the relative error in the product
(or quotient) is the sum of the relative errors in the quantities to be multiplied or divided.
Page 16 of 18
3. Error Propagation due to the power of a measured Quantity
In general, if Z ( A p B q / C r ), then let there be quantities A, B, C and let A, B, C be the absolute
error in the measurement. From the above discussion that relative errors are always added when quantities
are multiplied and divided i.e.
Z A A B B C C
[ ...to p terms ] [ ...to q terms ] [ ...to r terms ]
Z A A B B C C
Z A B C
p q r
Z A B C
P = a3b2/(cd)1/2 .The percentage errors of measurement in a, b, c and d are 1%, 3%, 4% and
2% respectively. What is the percentage errors in the quantity P? If the value of P
calculated using the above relation turns out to be 3.763, to what value of P should you
round off the result?
Solution: Given that P = a3b2/(cd)1/2 and applying the formula for the combination of
errors:
Z A B C
p q r
Z A B C
P a b 1 c 1 d
3 2
P a b 2 c 2 d
P 1 3 1 4 1 2
3 2
P 100 100 2 100 2 100
P 12
% error in P * 100 * 100 12%
P 100
12
Now if P 3.763, then P P * 3.763 * 0.12 0.45156
100
Example: The following observations were made during an experiment to find the value of g
using simple pendulum, l=90.0cm, time (t) for 20 vibrations is 36.0s. Find the percentage
l
error in the measurement of g. Given that time period of pendulum is T 2 . Length
g
is being measured to an accuracy of 0.1cm and time to 0.2s.
Page 17 of 18
l 4 l2
4 2 l ]
Solution: Given that T 2 , g or g [
g T2 ( t )2
20
g l t 0.1 0.2
2 2 0.01222
g l t 90 36.0
Percentage error 0.01222 * 100% 1.222%
References:
Page 18 of 18