0% found this document useful (0 votes)
2 views14 pages

Numerical Methods Chap1

The document outlines the fundamentals of floating-point arithmetic, including numbering systems, conversion methods between binary and decimal, and the representation of numbers in computer memory. It covers operations such as multiplication, division, addition, and subtraction in floating-point format, as well as the IEEE 754 standard for floating-point arithmetic. Additionally, it discusses rounding, accuracy requirements, and potential errors in numerical computations.

Uploaded by

obenakum1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views14 pages

Numerical Methods Chap1

The document outlines the fundamentals of floating-point arithmetic, including numbering systems, conversion methods between binary and decimal, and the representation of numbers in computer memory. It covers operations such as multiplication, division, addition, and subtraction in floating-point format, as well as the IEEE 754 standard for floating-point arithmetic. Additionally, it discusses rounding, accuracy requirements, and potential errors in numerical computations.

Uploaded by

obenakum1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Chapter 1: Floating-point arithmetic

Tools and Numerical Methods for Engineering – CEF 352

Academic year 2022-2023


Instructors: Dr. Wamba & Dr. Azeufack

Contents
1 Numbering systems and notation of numbers 2
1.1 Main numbering systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Binary to decimal Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Decimal to Binary Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 a) Case of non-decimal fractions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 b) Case of decimal fractions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Normalized scientific notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Floating Arithmetic operations 4


2.1 Multiplication and Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Addition and subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Computer representation of numbers: general principle 6


3.1 Computer’s memory structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Representation of integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 a) Representation of unsigned integers . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 b) Representation of signed integers . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Representation of real numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.1 a) Floating point numbers: definition . . . . . . . . . . . . . . . . . . . . . . 7
3.3.2 b) Naive representation of FP: Comparison issues with negative exponents . 7
3.3.3 c) FP representation with bias exponent . . . . . . . . . . . . . . . . . . . . 7

4 IEEE standard for floating-point arithmetic: formats and exceptions 8


4.1 What is the IEEE 754 standard about ? . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1.1 Advantage of IEEE formatting . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 IEEE Floating-point formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.1 a) Available formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.2 b) Half-precision format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.3 c) Single-precision format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.4 Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.5 Coursework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.6 d) Double-precision format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.7 d) Extended . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Floating-point exceptions and special numbers . . . . . . . . . . . . . . . . . . . . . 10

5 Number rounding and accuracy requirements: Machine epsilon, round-off errors


and loss of significance 11
5.1 Machine epsilon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Roundoff errors and loss of significance . . . . . . . . . . . . . . . . . . . . . . . . . 12

1
1 Numbering systems and notation of numbers
1.1 Main numbering systems
ˆ Binary system: Numbers used are made of 0 and 1; the base is 2, i.e. it uses 2 figures (most
modern computers).
ˆ Decimal system: Numbers used contain figures that range from 0 to 9; the base is 10 (our
mind, shop’s calculator).
ˆ Hexadecimal system: Numbers used contain figures that range from 0 to 9 as well as the
capital letters A, B, ..., F; the base is 16.
Illustration: 12, 1100 and 0xC represent the same number in the decimal, binary and hexadecimal
systems, resp. For the last one, the prefix ”0x” just shows that the number is hexadecimal. Both
0xF and 1111 (binary) are 15 in base-10.

1.2 Binary to decimal Conversion


Method: Obtain the decimal as the sum of binary digits (dn ) times their power of 2 (2n ).

Binary digit format: dn−1 dn−2 · · · d1 d0 =⇒ Decimal digit format: dn−1 ×2n−1 +dn−2 ×2n−2 +· · ·+d1 ×21 +d0 ×

Example: Example with fractional part:


Convert into decimal the binary number 11001. Convert into decimal the binary number 0.1011.

Hence we have 110012 = 2510 and 0.10112 = 0.687510 .

1.3 Decimal to Binary Conversion


1.3.1 a) Case of non-decimal fractions
Method: Use repeated division by 2 until the quotient is equal to 0.
Steps:
1. Divide the number by 2
2. Get the integer quotient for next iteration, and get the remainder for the binary digit
3. Repeat the steps until the quotient is equal to 0
4. Write the remainder from last to first
Example without fractional part:

2
1.3.2 b) Case of decimal fractions
Method: Use repeated multiplication by 2 until the fractional product is 0.
Steps:

1. Multiply by 2

2. Get the integer part for the binary result digit, and get the fractional part for the next iteration

3. Repeat the steps until the fractional product is 0

4. Write those digits from first to last

Examples with fractional part:


(i) Convert the decimal number 0.3125 in binary

(ii) Convert the decimal number 12.125 in binary

(iii) Simple binary fractions

Binary Fractional form (base 10) Decimal form


0.1 1/2 0.5
0.01 1/4 0.25
0.11 3/4 0.75
0.001 1/8 0.125
0.101 5/8 0.625

di × 2i = Di × 10i where di and Di are digits in base 2 and 10, respectively.


P P
Note: We have i i

3
1.4 Normalized scientific notation
Problem: 0.7e-2, 0.7×10−2 , 70×10−4 , 7×10−3 and 0.007 can be used to represent the same number in
3 different notation forms. Which of those forms is the scientific notation? the normalized scientific
notation?
Scientific or exponential notation: Writing a number r in the form

r = ±M × bp

where M is a fractional number with a single digit to the left of the decimal point. It is called the
mantissa, and p is the exponent. Scientific notation is said to be normalized when the number
has no leading zeros.
Examples
(i) The decimal numbers 5.4 × 10−5 , 1.25 × 10−5 , 0.125 × 10−4 , 0.0125 × 10−2 are all in scientific
notation. The two first numbers are normalized while the two latter are not.
(ii) In binary, any number in the form 1.m1 m2 · · · × 2p1 p2 ··· is a normalized number (mi and pi are
binary digits). e.g. 1.01 × 2−5 .

Coursework
1. Convert to binary the following decimal numbers: 1245, 98, 0.9, 0.8125, 0.6875, 0.65, 12.75
2. Convert from binary to decimal the numbers: 101010, 1111, 0.1101, 0.1011, 1100.11

2 Floating Arithmetic operations


2.1 Multiplication and Division
Suppose we want to multiply two numbers r1 = m1 × bp1 and r2 = m2 × bp2 . Then the product is

r = r1 r2 = m1 m2 × bp1 +p2 .

Next, if need be normalize it to get


r = ±m × bp ,
where 1 6 m < 2 and p 6= p1 + p2 in general.

4
Example 1 (decimal). Multiply the following two numbers in normalized scientific notation :
1.110 × 1010 and 9.200 × 10−4 .
1. Add the exponents: p = 10 + (−4) = 6
2. Multiply the mantissas: m = 1.110 × 9.200 = 10.212000
3. Keeping only 3 digits in the fractional part: m = 10.212, and thus r = 10.212 × 106
4. Normalize the result: r = 1.021 × 107

Example 2 (binary). Multiply the following two numbers in normalized scientific notation :
1.000 × 2−1 and −1.110 × 2−2 .
1. Sign= minus
2. Add the exponents: p = −1 + (−2) = −3
3. Multiply the mantissas: m = 1.000 × 1.110 = 1.110000

4. Keeping only 3 digits in the fractional part: m = 1.110, and thus r = −1.110 × 2−3
Result is already normalized!

2.2 Addition and subtraction


Suppose we want to add or subtract two numbers r1 = m1 × bp1 and r2 = m2 × bp2 , with p1 > p2 .
Then the result is written in the form

r = r1 ± r2 = (m1 ± m2 × bp1 −p2 )bp1 .

Next, if need be normalize it to get


r = ±m × bp ,
where 1 6 m < 2 and p 6= p1 + p2 in general.

Example 1 (decimal). Add the following two numbers in normalized scientific notation : 9.970×
101 and 8.740 × 10−1 .
1. Rewrite the smaller number using the exponent of the larger number: 8.740 × 10−1 =
(8.740 × 10−2 ) × 101 = 0.08740 × 101
2. Add the mantissas: m = 9.970 + 0.08740 = 10.05740 =⇒ r = 10.05740 × 101
3. Normalize the result (if necessary, shift the mantissa and adjust exponent): r = 10.05740 ×
101 = 1.005740 × 102
4. Check for overflow/underflow of the exponent and round off: r = 1.006 × 102

Example 2 (binary). Write the binary normalized notation of the numbers and add them : 0.5
and −0.4375.
1. Binary normalized notation: 0.510 = 0.12 = 0.1 × 20 = 1.000 × 2−1 , and −0.437510 =
−0.01112 = −0.0111 × 20 = −1.110 × 2−2
2. Rewrite the smaller number using the exponent of the larger number: −1.110 × 2−2 =
−(1.110 × 2−1 ) × 2−1 = −0.111 × 2−1
3. Add the mantissas: m = 1.000 + (−0.111) = 0.001 × r = 0.001 × 2−1
4. Normalize the result : r = 1.000 × 2−4 ,
5. No overflow/underflow (−4 ∈ [−126, 127]), no rounding required: r = 1.000 × 2−4

5
3 Computer representation of numbers: general principle
3.1 Computer’s memory structure
The computer representation of numbers is based on the so-called floating-point representation. The
computer operates in binary system, and then represents numbers as an ordered set of digits 0 and
1. Each digit is contained in a bit (binary digit), and any set of 8 bits is called a byte.

3.2 Representation of integers


3.2.1 a) Representation of unsigned integers
Suppose a computer’s memory has 1 byte available to represent numbers (like the RGB range in
most standard image formats). Which range of positive integers can we represent on that computer?
Answer: Positive numbers range in the binary system from 00000000 to 11111111, i.e., from 0 to
28 − 1 = 255 in decimal.

3.2.2 b) Representation of signed integers


The above representation of unsigned integers does not allow representing signed numbers. Number
of ways of fixing the problem were explored.
• Naive trial: Make the first bit represents the sign. Using 0 for negative and 1 for positive,
negative numbers would range (in the standard binary system) from 0 0000000 to 0 1111111 ,
and positive numbers would range from 1 0000000 to 1 1111111 . The conversion to decimal
would range the negative integers from 0 to 27 − 1 = 127 (where 0 represents −0) and the positive
integers from 128 to 255, with 128 representing +0. The fact that −0 and +0 (actually equal) are
represented as 00000000 and 10000000, well different, leads to comparison issues. Moreover, the
extra bit booked for the sign reduces the range for representable numbers.
• The two’s complement. In two’s complement notation, a non-negative number is repre-
sented by its ordinary binary representation; in this case, the most significant bit is 0. Negative
numbers are represented by the two’s complement of their positive counterpart, leading to the most
significant bit 1. The two’s complement of an N -bit number is defined as its complement with
respect to 2N , such that the sum of a number and its two’s complement is 2N .
Example. The two’s complement of the three-bit number 0112 (310 ) is 1012 (510 ), because
0112 + 1012 = 10002 = 810 which is equal to 23 . Hence 310 is represented as 0112 while −310
is represented as 1012 . Inversely, the two’s complement of −310 (1012 ) is 310 (0112 ). So the two’s
complement of a negative number is the corresponding positive value (except in the special case of
the most negative number).
In practice, the two’s complement is calculated by inverting/flipping (0 becomes 1 and 1 becomes
0) the bits. At this point, the representation is the one’s complement of the decimal. Then adding
one to the resulting number gives the two’s complement.
Example. For the number 0112 , inverting (0 → 1 and 1 → 0) gives 1002 . Adding 1 gives 1012 .
For the 5-bit number 100112 (1910 ), inverting gives 011002 . Adding 1 gives 11012 (1310 ).
Decimal (to represent) Binary two’s complement notation two’s complement notation in decimal
−4 −100 100 4
−3 −011 101 5
−2 −010 110 6
−1 −001 111 7
0 000 000 0
1 001 001 1
2 010 010 2
3 011 011 3
Remark: The two’s complement of a positive number N is −N while the two’s complement of
a negative number −N is N . However, the two’s complement notation of a positive number N is
N while the two’s complement of a negative number −N is N .

6
Solved exercise. Find the signed binary number representing the decimal value −5 in two’s-
complement form using 1 byte.
On 1 byte (8 bits), the decimal number 5 is represented by 000001012 . The most significant bit is
0, so the pattern represents a non-negative value. To convert to −5 in two’s-complement notation,
first, the bits are inverted: 111110102 . That representation is the one’s complement of the decimal
value −5. To obtain the two’s complement, 1 is added to the result, giving: 111110112 . The most
significant bit is 1, so the value represented is negative.
• The biased binary system. More practical solution: Instead of letting 0 to be represented
by 00000000, we let 0 be 01111111 which is 127 in the standard binary system. We say that the
system is biased by 127. In this case 00000000 represents −127.
Decimal (to represent) Binary Biased binary Biased decimal
−127 −01111111 00000000 0
0 00000000 01111111 127
127 01111111 11111110 254
128 10000000 11111111 255

3.3 Representation of real numbers


3.3.1 a) Floating point numbers: definition
Computers store real numbers as normalized binary numbers in a very specific way called floating
point (FP), which allows the representation of both astronomic and atomic distances in a similar
way. The basic idea of floats (FP numbers) is to use the binary equivalent of normalized scientific
notation. The term FP refers to the fact that the radix point of a number (decimal or binary point)
can float, i.e., placed anywhere relative to the significant digits of the number. The radix position
is specified by the exponent.

r = (−1)s ×M ×bp , M = mantissa or significand, p = exponent, b = base (radix) [of the exponent], s = sig

Example: Decimal: 0.0052710 = 0.52710 ×10−210 ; Binary: 10.12 = 0.1012 ×2210 = 0.1012 ×2102

Remark. Negative and positive numbers will have 1 and 0 as sign bits, resp, in contrast to what
we took for naive representation of integers above.

3.3.2 b) Naive representation of FP: Comparison issues with negative exponents


The idea is to book one bit for the sign, and then share the rest of available bits between the
mantissa and exponent. The numbers of bits of the mantissa and exponent, respectively, determine
the precision and the range of representable numbers of the computer.
Problem situation. Suppose a computer’s memory has 2 bytes available to represent numbers.
Find the representation of the numbers 1.0 × 2−1 , 1.0 × 2+1 if the memory architecture is as follows:
(i) exponent is unbiased, positive and negative exponents are signed with 0 and 1, resp; (ii) 1 byte
is reserved for the exponent; (iii) the exponent is stored as from the 2nd bit, right after the sign.
Which range of floats can we represent on that computer?
Discussion. 1.0 × 2−1 ≡ 0 10000001 0000001
1.0 × 2+1 ≡ 0 00000001 0000001
We know that 1.0 × 2−1 < 1.0 × 2+1 , but 10000001 > 00000001 ! So the first exponent shows a
”larger” binary number, making direct comparison more difficult.

3.3.3 c) FP representation with bias exponent


Reconsider the problem above. The bias used in the exponent is 127. Then we have −1 + 127 =
126 = 011111102 , +1 + 127 = 128 = 100000002 . Since 10000000 > 01111110, comparison issues
with negative exponents are solved, with bias exponent !

7
Bias exponent consists in adding some constant called bias to the exponent, chosen to make
the range of exponents nonnegative. The bias solves the comparison issues. As for the range of
representable numbers, there is still a room for improvement.

4 IEEE standard for floating-point arithmetic: formats and


exceptions
4.1 What is the IEEE 754 standard about ?
IEEE 754 standard specifications. There have been various floating point representations up
to the 90’s. Today, IEEE 754 Standard (on floating point representation) is the most common
representation for real numbers on computers and many hardware floating-point units. The IEEE
Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arith-
metic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The
standard addressed many problems found in the diverse floating-point implementations that made
them difficult to use reliably and portably. The standard defines:
ˆ arithmetic formats: sets of binary and decimal floating-point data, which consist of fi-
nite numbers (including signed zeros and subnormal numbers), infinities, and special ”not a
number” values (NaNs)

ˆ interchange formats: encodings (bit strings) that may be used to exchange floating-point
data in an efficient and compact form

ˆ rounding rules: properties to be satisfied when rounding numbers during arithmetic and
conversions

ˆ operations: arithmetic and other operations (such as trigonometric functions) on arithmetic


formats

ˆ exception handling: indications of exceptional conditions (such as division by zero, overflow,


etc.)

Convention 1. Since the normalized binary mantissa always writes 1.xx · · · (except for 0), omit
the leading 1, store only the fractional part of the mantissa (as IEEE mantissa, f ).

Convention 2. If the actual exponent is p then it is represented as E = p + bias.

IEEE floating-point number representation: r = (−1)s ×(1+0.f )×2E−bias , standard mantissa :


1 + 0.f
f is the IEEE mantissa (fractional part of the standard mantissa), p is the exponent, s is the
sign, bias = 127 for single precision and 1023 for double. E is the so-called biased exponent, 1.f is
the significand, and the dot is the radix point (binary point for base 2, decimal point for base 10).

4.1.1 Advantage of IEEE formatting


• Increase the number storage, f (Single precision: 23+1 bits, Double precision: 52+1 bits)
• Avoid comparison issues with negative exponents.

4.2 IEEE Floating-point formats


4.2.1 a) Available formats
In the FP representation there are two main IEEE formats: single precision and double pre-
cision. A floating point format can only present a finite amount of numbers (written as per the
specifications of the format).

8
Bit = 0 or 1 (binary digit)
Byte = 8 bits
Word = Reals: 4 bytes (single precision)
8 bytes (double precision)
= Integers: signed: 1, 2, 4, or 8 byte
unsigned: 1, 2, 4, or 8 byte
Remark: The single precision format uses 32 bits while the double precision format uses 64 bits.

4.2.2 b) Half-precision format


Half precision consists of 12 bits, with 1 bit for the sign, 5 for the exponent, 10 for the significand,
and for the bias 15 (hidden bit storage is used). Note that bias = 2(n−1) − 1, with n being the
number of bits of the exponent.

4.2.3 c) Single-precision format


Single precision consists of 32 bits, with 1 bit for the sign, 8 for the exponent, 23 for the significand,
and for the bias 127 (hidden bit storage is used)
• Bit index: 31, 30, · · · , 0

• Ranges and precisions (in decimal):

4.2.4 Example.
Find the sign, mantissa, bias exponent and give the single precision representation of the number
1.5.
Since 1.5 is positive, s=0. Convert the number into binary and use the normalized notation in
base 2: 1.510 = 1.12 = 1.1 × 20 = (−1)0 × 1.1 × 2(e−127) , with e = 127.
Hence, s = 0, e = 12710 = 01111111, 0.f = 0.100 · · · =⇒ f = 10000000000000000000000
Bin: 0 0111 1111 100 00000000 0000 0000 0000

4.2.5 Coursework.
Find the sign, mantissa, bias exponent and give the single precision representation of the numbers:
2.0 and 0.5
We have
2.0 = 2 = 21 = (−1)0 × 1.0 × 2(128−127) ,
0.5 = 1/2 = 2−1 = (−1)0 × 1.0 × 2(126−127)

4.2.6 d) Double-precision format


Double precision consists of 64 bits, with 1 bit for the sign, 11 for the exponent, 52 for the significand,
and for the bias 1023 (hidden bit storage is used).
• Bit index: 63, 30, · · · , 0
• The exponent represented by 11111111111 is reserved for infinities and NaNs.
• The exponent represented by 00000000000 is reserved for 0 and something else like −1023.

9
• Ranges and precisions (in decimal):

4.2.7 d) Extended
Apart from the basic floating point formats (single and double precisions), there exists a greater
IEEE format called the extended precision which consists of 80 bits, with 1 bit for the sign, 15 for
the exponent, 64 for the significand, and for the bias 16383. (Note, however, that numbers stored
in extended precision do not use hidden bit storage.)

4.3 Floating-point exceptions and special numbers


Extreme situations may arise in the floating number representation. An arithmetic exception arises
when the result of a floating point operation is unclear or undesirable. The IEEE 754 standard
defines five types of floating-point exception that must be signaled when detected:

ˆ Overflow. It means that values have grown too large for the representation as a float in its
format. More precisely, Overflow occurs when the exponent is too large to be represented in
the exponent field.

ˆ Underflow. When the result of an operation is too small to be represented as a normalized


float in its format, there is underflow. In that case the exponent is too small to be represented
in the exponent field. Underflow is a less serious problem because is just denotes a loss of
precision, which is guaranteed to be closely approximated by zero.

ˆ Invalid operation: when an operand is invalid for the operation about to be performed, and
thus the result of an operation is ill-defined, such as 0.0/0.0, Square root of negative operand,
any operation with a signaling NaN (not a number) operand.

ˆ Division by zero: when a float is divided by zero.

ˆ Inexact calculation: when the result of a floating point operation is not exact, i.e. the
result was rounded.

10
Inexact floating-point numbers. In decimal system, only rational numbers whose denominator
can be factorized in terms of 2 and 5 ( i.e., a/(2n × 5m ) ) will terminate while others will not.
Similarly in binary system, only rational numbers whose denominator is a power of 2 will ter-
minate while others will not.
Example: −1/3 = (0.0101010101 · · · )2 = (−1)1 (1.01010101 · · · )×2−2 = (−1)1 ×(1+0.01010101 · · · )×
2125−127 , The single precision representation of −1/3 is then 10111110101010101010101010101011.
Similarly 1/10 = (0.00011001100110011 · · · )2

5 Number rounding and accuracy requirements: Machine


epsilon, round-off errors and loss of significance
Number rounding: direction, precision, significant figures and round-off error

5.1 Machine epsilon


Digital computers are fixed-precision devices and the number of digits the device can manipulate
depends on its hardware configuration. characterizes the accuracy of a floating-point system.
Machine precision or machine epsilon is the smallest number denoted mach (or macheps or
eps) such that the difference between 1 and 1 + mach is nonzero, i.e., it is the smallest difference
between two numbers that the computer can recognize or represent. It gives an upper bound on
the relative error due to rounding in floating point arithmetic.
On single precision processors, machine epsilon is 2−23 (approximately 10−7 ) while double preci-
sion is 2−52 (approximately 10−16 ): f=00000000000000000000000 and f=00000000000000000000001.
mach is determined computationally by finding the smallest positive  is for which 1 +  6= 1. For
instance, if a particular computing device computes 1.000000001 for 1 + 10−9 but 1 for 1 + 10−10 ,
then we conclude that 10−10 6 mach < 10−9 and the device in this case would be known as a 10
significant-digit device.
Machine precision characterizes the accuracy of a floating-point system and its value depends
on the particular rounding being used.
For rounding to nearest we write mach = 0.5 × B 1−P where P is the precision and B the base.
In the above example B = 10, P = 10 (digits of the number 1.000000001). Then for rounding to
nearest, mach = 0.5 × 109 .
It is important since it bounds the relative error in representing any non-zero real number x
within the normalized range of a floating-point system:

|x − x̄|
6 mach .
|x|

11
5.2 Roundoff errors and loss of significance

12
13
Problems
Question 1: Find the sign, mantissa, bias exponent and write the single-precision representation
of the decimal numbers: −1.5, 0.2 and 4.
Question 2: Find the sign, mantissa, bias exponent and write the single-precision representation
of the binary numbers: −0.1 and 0.00101.
Question 3: Write the binary normalized notation of the numbers and add them : 1.5 and −0.6375.
Question 4: Write the binary normalized notation of the numbers and multiply them : 12.0 and
−0.2375.
Question 5: On changing from IEEE single- to double-precision, how quantitatively do the numbers
of bits representing the mantissa and exponent change? Deduce whether the change prioritizes the
precision or the range of expressible numbers.
Question 6: Consider the quadratic equation ax2 + bx + c = 0
- Express the result in terms of a, b, c
- In real arithmetic compute the solutions for a = 1, b = 200 and c = −0.000015.
- In 10-digit floating-point arithmetic compute the solutions for a = 1, b = 200 and c =
−0.000015. Round-off is applied where need be.
- Compare the floating point and real results, in terms of number of correct significant digits.
- Compute the absolute and relative errors on the smallest solution (small in absolute value).

References
[1] Serge Lang, Introduction to Linear Algebra, 2nd ed., Springer, USA, (1986).

[2] Erwin Kreyszig, Advanced Engineering Mathematics, 10th ed., John Wiley & Sons, Inc., USA,
(2011).

[3] S. Boyd and L. Vandenberghe, Introduction to Applied Linear Algebra: Vectors, Matrices, and
Least Squares, Cambridge University Press, United Kingdom, (2018).

[4] L. Răde et al., Mathematics Handbook, Springer-Verlag, Berlin Heidelberg, (1999).

[5] J. R. Chasnov, Lecture notes for MATH 3311, Hong Kong University of Science and Technology,
Hong Kong, (2012).

[6] J. R. Chasnov, Lecture Notes for COURSERA: Matrix Algebra for Engineers, Hong Kong
University of Science and Technology, Hong Kong, (2019).

[7] https://en.m.wikipedia.org (2022).

[8] Wikiversity (2022).

14

You might also like