Numerical Methods Chap1
Numerical Methods Chap1
Contents
1 Numbering systems and notation of numbers 2
1.1 Main numbering systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Binary to decimal Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Decimal to Binary Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 a) Case of non-decimal fractions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 b) Case of decimal fractions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Normalized scientific notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
1 Numbering systems and notation of numbers
1.1 Main numbering systems
Binary system: Numbers used are made of 0 and 1; the base is 2, i.e. it uses 2 figures (most
modern computers).
Decimal system: Numbers used contain figures that range from 0 to 9; the base is 10 (our
mind, shop’s calculator).
Hexadecimal system: Numbers used contain figures that range from 0 to 9 as well as the
capital letters A, B, ..., F; the base is 16.
Illustration: 12, 1100 and 0xC represent the same number in the decimal, binary and hexadecimal
systems, resp. For the last one, the prefix ”0x” just shows that the number is hexadecimal. Both
0xF and 1111 (binary) are 15 in base-10.
Binary digit format: dn−1 dn−2 · · · d1 d0 =⇒ Decimal digit format: dn−1 ×2n−1 +dn−2 ×2n−2 +· · ·+d1 ×21 +d0 ×
2
1.3.2 b) Case of decimal fractions
Method: Use repeated multiplication by 2 until the fractional product is 0.
Steps:
1. Multiply by 2
2. Get the integer part for the binary result digit, and get the fractional part for the next iteration
3
1.4 Normalized scientific notation
Problem: 0.7e-2, 0.7×10−2 , 70×10−4 , 7×10−3 and 0.007 can be used to represent the same number in
3 different notation forms. Which of those forms is the scientific notation? the normalized scientific
notation?
Scientific or exponential notation: Writing a number r in the form
r = ±M × bp
where M is a fractional number with a single digit to the left of the decimal point. It is called the
mantissa, and p is the exponent. Scientific notation is said to be normalized when the number
has no leading zeros.
Examples
(i) The decimal numbers 5.4 × 10−5 , 1.25 × 10−5 , 0.125 × 10−4 , 0.0125 × 10−2 are all in scientific
notation. The two first numbers are normalized while the two latter are not.
(ii) In binary, any number in the form 1.m1 m2 · · · × 2p1 p2 ··· is a normalized number (mi and pi are
binary digits). e.g. 1.01 × 2−5 .
Coursework
1. Convert to binary the following decimal numbers: 1245, 98, 0.9, 0.8125, 0.6875, 0.65, 12.75
2. Convert from binary to decimal the numbers: 101010, 1111, 0.1101, 0.1011, 1100.11
r = r1 r2 = m1 m2 × bp1 +p2 .
4
Example 1 (decimal). Multiply the following two numbers in normalized scientific notation :
1.110 × 1010 and 9.200 × 10−4 .
1. Add the exponents: p = 10 + (−4) = 6
2. Multiply the mantissas: m = 1.110 × 9.200 = 10.212000
3. Keeping only 3 digits in the fractional part: m = 10.212, and thus r = 10.212 × 106
4. Normalize the result: r = 1.021 × 107
Example 2 (binary). Multiply the following two numbers in normalized scientific notation :
1.000 × 2−1 and −1.110 × 2−2 .
1. Sign= minus
2. Add the exponents: p = −1 + (−2) = −3
3. Multiply the mantissas: m = 1.000 × 1.110 = 1.110000
4. Keeping only 3 digits in the fractional part: m = 1.110, and thus r = −1.110 × 2−3
Result is already normalized!
Example 1 (decimal). Add the following two numbers in normalized scientific notation : 9.970×
101 and 8.740 × 10−1 .
1. Rewrite the smaller number using the exponent of the larger number: 8.740 × 10−1 =
(8.740 × 10−2 ) × 101 = 0.08740 × 101
2. Add the mantissas: m = 9.970 + 0.08740 = 10.05740 =⇒ r = 10.05740 × 101
3. Normalize the result (if necessary, shift the mantissa and adjust exponent): r = 10.05740 ×
101 = 1.005740 × 102
4. Check for overflow/underflow of the exponent and round off: r = 1.006 × 102
Example 2 (binary). Write the binary normalized notation of the numbers and add them : 0.5
and −0.4375.
1. Binary normalized notation: 0.510 = 0.12 = 0.1 × 20 = 1.000 × 2−1 , and −0.437510 =
−0.01112 = −0.0111 × 20 = −1.110 × 2−2
2. Rewrite the smaller number using the exponent of the larger number: −1.110 × 2−2 =
−(1.110 × 2−1 ) × 2−1 = −0.111 × 2−1
3. Add the mantissas: m = 1.000 + (−0.111) = 0.001 × r = 0.001 × 2−1
4. Normalize the result : r = 1.000 × 2−4 ,
5. No overflow/underflow (−4 ∈ [−126, 127]), no rounding required: r = 1.000 × 2−4
5
3 Computer representation of numbers: general principle
3.1 Computer’s memory structure
The computer representation of numbers is based on the so-called floating-point representation. The
computer operates in binary system, and then represents numbers as an ordered set of digits 0 and
1. Each digit is contained in a bit (binary digit), and any set of 8 bits is called a byte.
6
Solved exercise. Find the signed binary number representing the decimal value −5 in two’s-
complement form using 1 byte.
On 1 byte (8 bits), the decimal number 5 is represented by 000001012 . The most significant bit is
0, so the pattern represents a non-negative value. To convert to −5 in two’s-complement notation,
first, the bits are inverted: 111110102 . That representation is the one’s complement of the decimal
value −5. To obtain the two’s complement, 1 is added to the result, giving: 111110112 . The most
significant bit is 1, so the value represented is negative.
• The biased binary system. More practical solution: Instead of letting 0 to be represented
by 00000000, we let 0 be 01111111 which is 127 in the standard binary system. We say that the
system is biased by 127. In this case 00000000 represents −127.
Decimal (to represent) Binary Biased binary Biased decimal
−127 −01111111 00000000 0
0 00000000 01111111 127
127 01111111 11111110 254
128 10000000 11111111 255
r = (−1)s ×M ×bp , M = mantissa or significand, p = exponent, b = base (radix) [of the exponent], s = sig
Example: Decimal: 0.0052710 = 0.52710 ×10−210 ; Binary: 10.12 = 0.1012 ×2210 = 0.1012 ×2102
Remark. Negative and positive numbers will have 1 and 0 as sign bits, resp, in contrast to what
we took for naive representation of integers above.
7
Bias exponent consists in adding some constant called bias to the exponent, chosen to make
the range of exponents nonnegative. The bias solves the comparison issues. As for the range of
representable numbers, there is still a room for improvement.
interchange formats: encodings (bit strings) that may be used to exchange floating-point
data in an efficient and compact form
rounding rules: properties to be satisfied when rounding numbers during arithmetic and
conversions
Convention 1. Since the normalized binary mantissa always writes 1.xx · · · (except for 0), omit
the leading 1, store only the fractional part of the mantissa (as IEEE mantissa, f ).
8
Bit = 0 or 1 (binary digit)
Byte = 8 bits
Word = Reals: 4 bytes (single precision)
8 bytes (double precision)
= Integers: signed: 1, 2, 4, or 8 byte
unsigned: 1, 2, 4, or 8 byte
Remark: The single precision format uses 32 bits while the double precision format uses 64 bits.
4.2.4 Example.
Find the sign, mantissa, bias exponent and give the single precision representation of the number
1.5.
Since 1.5 is positive, s=0. Convert the number into binary and use the normalized notation in
base 2: 1.510 = 1.12 = 1.1 × 20 = (−1)0 × 1.1 × 2(e−127) , with e = 127.
Hence, s = 0, e = 12710 = 01111111, 0.f = 0.100 · · · =⇒ f = 10000000000000000000000
Bin: 0 0111 1111 100 00000000 0000 0000 0000
4.2.5 Coursework.
Find the sign, mantissa, bias exponent and give the single precision representation of the numbers:
2.0 and 0.5
We have
2.0 = 2 = 21 = (−1)0 × 1.0 × 2(128−127) ,
0.5 = 1/2 = 2−1 = (−1)0 × 1.0 × 2(126−127)
9
• Ranges and precisions (in decimal):
4.2.7 d) Extended
Apart from the basic floating point formats (single and double precisions), there exists a greater
IEEE format called the extended precision which consists of 80 bits, with 1 bit for the sign, 15 for
the exponent, 64 for the significand, and for the bias 16383. (Note, however, that numbers stored
in extended precision do not use hidden bit storage.)
Overflow. It means that values have grown too large for the representation as a float in its
format. More precisely, Overflow occurs when the exponent is too large to be represented in
the exponent field.
Invalid operation: when an operand is invalid for the operation about to be performed, and
thus the result of an operation is ill-defined, such as 0.0/0.0, Square root of negative operand,
any operation with a signaling NaN (not a number) operand.
Inexact calculation: when the result of a floating point operation is not exact, i.e. the
result was rounded.
10
Inexact floating-point numbers. In decimal system, only rational numbers whose denominator
can be factorized in terms of 2 and 5 ( i.e., a/(2n × 5m ) ) will terminate while others will not.
Similarly in binary system, only rational numbers whose denominator is a power of 2 will ter-
minate while others will not.
Example: −1/3 = (0.0101010101 · · · )2 = (−1)1 (1.01010101 · · · )×2−2 = (−1)1 ×(1+0.01010101 · · · )×
2125−127 , The single precision representation of −1/3 is then 10111110101010101010101010101011.
Similarly 1/10 = (0.00011001100110011 · · · )2
|x − x̄|
6 mach .
|x|
11
5.2 Roundoff errors and loss of significance
12
13
Problems
Question 1: Find the sign, mantissa, bias exponent and write the single-precision representation
of the decimal numbers: −1.5, 0.2 and 4.
Question 2: Find the sign, mantissa, bias exponent and write the single-precision representation
of the binary numbers: −0.1 and 0.00101.
Question 3: Write the binary normalized notation of the numbers and add them : 1.5 and −0.6375.
Question 4: Write the binary normalized notation of the numbers and multiply them : 12.0 and
−0.2375.
Question 5: On changing from IEEE single- to double-precision, how quantitatively do the numbers
of bits representing the mantissa and exponent change? Deduce whether the change prioritizes the
precision or the range of expressible numbers.
Question 6: Consider the quadratic equation ax2 + bx + c = 0
- Express the result in terms of a, b, c
- In real arithmetic compute the solutions for a = 1, b = 200 and c = −0.000015.
- In 10-digit floating-point arithmetic compute the solutions for a = 1, b = 200 and c =
−0.000015. Round-off is applied where need be.
- Compare the floating point and real results, in terms of number of correct significant digits.
- Compute the absolute and relative errors on the smallest solution (small in absolute value).
References
[1] Serge Lang, Introduction to Linear Algebra, 2nd ed., Springer, USA, (1986).
[2] Erwin Kreyszig, Advanced Engineering Mathematics, 10th ed., John Wiley & Sons, Inc., USA,
(2011).
[3] S. Boyd and L. Vandenberghe, Introduction to Applied Linear Algebra: Vectors, Matrices, and
Least Squares, Cambridge University Press, United Kingdom, (2018).
[5] J. R. Chasnov, Lecture notes for MATH 3311, Hong Kong University of Science and Technology,
Hong Kong, (2012).
[6] J. R. Chasnov, Lecture Notes for COURSERA: Matrix Algebra for Engineers, Hong Kong
University of Science and Technology, Hong Kong, (2019).
14