0% found this document useful (0 votes)

2 views14 pages

Numerical Methods Chap1

The document outlines the fundamentals of floating-point arithmetic, including numbering systems, conversion methods between binary and decimal, and the representation of numbers in computer memory. It covers operations such as multiplication, division, addition, and subtraction in floating-point format, as well as the IEEE 754 standard for floating-point arithmetic. Additionally, it discusses rounding, accuracy requirements, and potential errors in numerical computations.

Uploaded by

obenakum1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views14 pages

Numerical Methods Chap1

Uploaded by

obenakum1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Chapter 1: Floating-point arithmetic

Tools and Numerical Methods for Engineering – CEF 352

Academic year 2022-2023

Instructors: Dr. Wamba & Dr. Azeufack

Contents
1 Numbering systems and notation of numbers 2
1.1 Main numbering systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Binary to decimal Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Decimal to Binary Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 a) Case of non-decimal fractions . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 b) Case of decimal fractions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Normalized scientific notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Floating Arithmetic operations 4

2.1 Multiplication and Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Addition and subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Computer representation of numbers: general principle 6

3.1 Computer’s memory structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Representation of integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 a) Representation of unsigned integers . . . . . . . . . . . . . . . . . . . . . 6
3.2.2 b) Representation of signed integers . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Representation of real numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.1 a) Floating point numbers: definition . . . . . . . . . . . . . . . . . . . . . . 7
3.3.2 b) Naive representation of FP: Comparison issues with negative exponents . 7
3.3.3 c) FP representation with bias exponent . . . . . . . . . . . . . . . . . . . . 7

4 IEEE standard for floating-point arithmetic: formats and exceptions 8

4.1 What is the IEEE 754 standard about ? . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1.1 Advantage of IEEE formatting . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 IEEE Floating-point formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.1 a) Available formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2.2 b) Half-precision format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.3 c) Single-precision format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.4 Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.5 Coursework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.6 d) Double-precision format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.7 d) Extended . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Floating-point exceptions and special numbers . . . . . . . . . . . . . . . . . . . . . 10

5 Number rounding and accuracy requirements: Machine epsilon, round-off errors

and loss of significance 11
5.1 Machine epsilon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Roundoff errors and loss of significance . . . . . . . . . . . . . . . . . . . . . . . . . 12

1
1 Numbering systems and notation of numbers
1.1 Main numbering systems
Binary system: Numbers used are made of 0 and 1; the base is 2, i.e. it uses 2 figures (most
modern computers).
Decimal system: Numbers used contain figures that range from 0 to 9; the base is 10 (our
mind, shop’s calculator).
Hexadecimal system: Numbers used contain figures that range from 0 to 9 as well as the
capital letters A, B, ..., F; the base is 16.
Illustration: 12, 1100 and 0xC represent the same number in the decimal, binary and hexadecimal
systems, resp. For the last one, the prefix ”0x” just shows that the number is hexadecimal. Both
0xF and 1111 (binary) are 15 in base-10.

1.2 Binary to decimal Conversion

Method: Obtain the decimal as the sum of binary digits (dn ) times their power of 2 (2n ).

Binary digit format: dn−1 dn−2 · · · d1 d0 =⇒ Decimal digit format: dn−1 ×2n−1 +dn−2 ×2n−2 +· · ·+d1 ×21 +d0 ×

Example: Example with fractional part:

Convert into decimal the binary number 11001. Convert into decimal the binary number 0.1011.

Hence we have 110012 = 2510 and 0.10112 = 0.687510 .

1.3 Decimal to Binary Conversion

1.3.1 a) Case of non-decimal fractions
Method: Use repeated division by 2 until the quotient is equal to 0.
Steps:
1. Divide the number by 2
2. Get the integer quotient for next iteration, and get the remainder for the binary digit
3. Repeat the steps until the quotient is equal to 0
4. Write the remainder from last to first
Example without fractional part:

2
1.3.2 b) Case of decimal fractions
Method: Use repeated multiplication by 2 until the fractional product is 0.
Steps:

1. Multiply by 2

2. Get the integer part for the binary result digit, and get the fractional part for the next iteration

3. Repeat the steps until the fractional product is 0

4. Write those digits from first to last

Examples with fractional part:

(i) Convert the decimal number 0.3125 in binary

(ii) Convert the decimal number 12.125 in binary

(iii) Simple binary fractions

Binary Fractional form (base 10) Decimal form

0.1 1/2 0.5
0.01 1/4 0.25
0.11 3/4 0.75
0.001 1/8 0.125
0.101 5/8 0.625

di × 2i = Di × 10i where di and Di are digits in base 2 and 10, respectively.

P P
Note: We have i i

3
1.4 Normalized scientific notation
Problem: 0.7e-2, 0.7×10−2 , 70×10−4 , 7×10−3 and 0.007 can be used to represent the same number in
3 different notation forms. Which of those forms is the scientific notation? the normalized scientific
notation?
Scientific or exponential notation: Writing a number r in the form

r = ±M × bp

where M is a fractional number with a single digit to the left of the decimal point. It is called the
mantissa, and p is the exponent. Scientific notation is said to be normalized when the number
has no leading zeros.
Examples
(i) The decimal numbers 5.4 × 10−5 , 1.25 × 10−5 , 0.125 × 10−4 , 0.0125 × 10−2 are all in scientific
notation. The two first numbers are normalized while the two latter are not.
(ii) In binary, any number in the form 1.m1 m2 · · · × 2p1 p2 ··· is a normalized number (mi and pi are
binary digits). e.g. 1.01 × 2−5 .

Coursework
1. Convert to binary the following decimal numbers: 1245, 98, 0.9, 0.8125, 0.6875, 0.65, 12.75
2. Convert from binary to decimal the numbers: 101010, 1111, 0.1101, 0.1011, 1100.11

2 Floating Arithmetic operations

2.1 Multiplication and Division
Suppose we want to multiply two numbers r1 = m1 × bp1 and r2 = m2 × bp2 . Then the product is

r = r1 r2 = m1 m2 × bp1 +p2 .

Next, if need be normalize it to get

r = ±m × bp ,
where 1 6 m < 2 and p 6= p1 + p2 in general.

4
Example 1 (decimal). Multiply the following two numbers in normalized scientific notation :
1.110 × 1010 and 9.200 × 10−4 .
1. Add the exponents: p = 10 + (−4) = 6
2. Multiply the mantissas: m = 1.110 × 9.200 = 10.212000
3. Keeping only 3 digits in the fractional part: m = 10.212, and thus r = 10.212 × 106
4. Normalize the result: r = 1.021 × 107

Example 2 (binary). Multiply the following two numbers in normalized scientific notation :
1.000 × 2−1 and −1.110 × 2−2 .
1. Sign= minus
2. Add the exponents: p = −1 + (−2) = −3
3. Multiply the mantissas: m = 1.000 × 1.110 = 1.110000

4. Keeping only 3 digits in the fractional part: m = 1.110, and thus r = −1.110 × 2−3
Result is already normalized!

2.2 Addition and subtraction

Suppose we want to add or subtract two numbers r1 = m1 × bp1 and r2 = m2 × bp2 , with p1 > p2 .
Then the result is written in the form

r = r1 ± r2 = (m1 ± m2 × bp1 −p2 )bp1 .

Next, if need be normalize it to get

r = ±m × bp ,
where 1 6 m < 2 and p 6= p1 + p2 in general.

Example 1 (decimal). Add the following two numbers in normalized scientific notation : 9.970×
101 and 8.740 × 10−1 .
1. Rewrite the smaller number using the exponent of the larger number: 8.740 × 10−1 =
(8.740 × 10−2 ) × 101 = 0.08740 × 101
2. Add the mantissas: m = 9.970 + 0.08740 = 10.05740 =⇒ r = 10.05740 × 101
3. Normalize the result (if necessary, shift the mantissa and adjust exponent): r = 10.05740 ×
101 = 1.005740 × 102
4. Check for overflow/underflow of the exponent and round off: r = 1.006 × 102

Example 2 (binary). Write the binary normalized notation of the numbers and add them : 0.5
and −0.4375.
1. Binary normalized notation: 0.510 = 0.12 = 0.1 × 20 = 1.000 × 2−1 , and −0.437510 =
−0.01112 = −0.0111 × 20 = −1.110 × 2−2
2. Rewrite the smaller number using the exponent of the larger number: −1.110 × 2−2 =
−(1.110 × 2−1 ) × 2−1 = −0.111 × 2−1
3. Add the mantissas: m = 1.000 + (−0.111) = 0.001 × r = 0.001 × 2−1
4. Normalize the result : r = 1.000 × 2−4 ,
5. No overflow/underflow (−4 ∈ [−126, 127]), no rounding required: r = 1.000 × 2−4

5
3 Computer representation of numbers: general principle
3.1 Computer’s memory structure
The computer representation of numbers is based on the so-called floating-point representation. The
computer operates in binary system, and then represents numbers as an ordered set of digits 0 and
1. Each digit is contained in a bit (binary digit), and any set of 8 bits is called a byte.

3.2 Representation of integers

3.2.1 a) Representation of unsigned integers
Suppose a computer’s memory has 1 byte available to represent numbers (like the RGB range in
most standard image formats). Which range of positive integers can we represent on that computer?
Answer: Positive numbers range in the binary system from 00000000 to 11111111, i.e., from 0 to
28 − 1 = 255 in decimal.

3.2.2 b) Representation of signed integers

The above representation of unsigned integers does not allow representing signed numbers. Number
of ways of fixing the problem were explored.
• Naive trial: Make the first bit represents the sign. Using 0 for negative and 1 for positive,
negative numbers would range (in the standard binary system) from 0 0000000 to 0 1111111 ,
and positive numbers would range from 1 0000000 to 1 1111111 . The conversion to decimal
would range the negative integers from 0 to 27 − 1 = 127 (where 0 represents −0) and the positive
integers from 128 to 255, with 128 representing +0. The fact that −0 and +0 (actually equal) are
represented as 00000000 and 10000000, well different, leads to comparison issues. Moreover, the
extra bit booked for the sign reduces the range for representable numbers.
• The two’s complement. In two’s complement notation, a non-negative number is repre-
sented by its ordinary binary representation; in this case, the most significant bit is 0. Negative
numbers are represented by the two’s complement of their positive counterpart, leading to the most
significant bit 1. The two’s complement of an N -bit number is defined as its complement with
respect to 2N , such that the sum of a number and its two’s complement is 2N .
Example. The two’s complement of the three-bit number 0112 (310 ) is 1012 (510 ), because
0112 + 1012 = 10002 = 810 which is equal to 23 . Hence 310 is represented as 0112 while −310
is represented as 1012 . Inversely, the two’s complement of −310 (1012 ) is 310 (0112 ). So the two’s
complement of a negative number is the corresponding positive value (except in the special case of
the most negative number).
In practice, the two’s complement is calculated by inverting/flipping (0 becomes 1 and 1 becomes
0) the bits. At this point, the representation is the one’s complement of the decimal. Then adding
one to the resulting number gives the two’s complement.
Example. For the number 0112 , inverting (0 → 1 and 1 → 0) gives 1002 . Adding 1 gives 1012 .
For the 5-bit number 100112 (1910 ), inverting gives 011002 . Adding 1 gives 11012 (1310 ).
Decimal (to represent) Binary two’s complement notation two’s complement notation in decimal
−4 −100 100 4
−3 −011 101 5
−2 −010 110 6
−1 −001 111 7
0 000 000 0
1 001 001 1
2 010 010 2
3 011 011 3
Remark: The two’s complement of a positive number N is −N while the two’s complement of
a negative number −N is N . However, the two’s complement notation of a positive number N is
N while the two’s complement of a negative number −N is N .

6
Solved exercise. Find the signed binary number representing the decimal value −5 in two’s-
complement form using 1 byte.
On 1 byte (8 bits), the decimal number 5 is represented by 000001012 . The most significant bit is
0, so the pattern represents a non-negative value. To convert to −5 in two’s-complement notation,
first, the bits are inverted: 111110102 . That representation is the one’s complement of the decimal
value −5. To obtain the two’s complement, 1 is added to the result, giving: 111110112 . The most
significant bit is 1, so the value represented is negative.
• The biased binary system. More practical solution: Instead of letting 0 to be represented
by 00000000, we let 0 be 01111111 which is 127 in the standard binary system. We say that the
system is biased by 127. In this case 00000000 represents −127.
Decimal (to represent) Binary Biased binary Biased decimal
−127 −01111111 00000000 0
0 00000000 01111111 127
127 01111111 11111110 254
128 10000000 11111111 255

3.3 Representation of real numbers

3.3.1 a) Floating point numbers: definition
Computers store real numbers as normalized binary numbers in a very specific way called floating
point (FP), which allows the representation of both astronomic and atomic distances in a similar
way. The basic idea of floats (FP numbers) is to use the binary equivalent of normalized scientific
notation. The term FP refers to the fact that the radix point of a number (decimal or binary point)
can float, i.e., placed anywhere relative to the significant digits of the number. The radix position
is specified by the exponent.

r = (−1)s ×M ×bp , M = mantissa or significand, p = exponent, b = base (radix) [of the exponent], s = sig

Example: Decimal: 0.0052710 = 0.52710 ×10−210 ; Binary: 10.12 = 0.1012 ×2210 = 0.1012 ×2102

Remark. Negative and positive numbers will have 1 and 0 as sign bits, resp, in contrast to what
we took for naive representation of integers above.

3.3.2 b) Naive representation of FP: Comparison issues with negative exponents

The idea is to book one bit for the sign, and then share the rest of available bits between the
mantissa and exponent. The numbers of bits of the mantissa and exponent, respectively, determine
the precision and the range of representable numbers of the computer.
Problem situation. Suppose a computer’s memory has 2 bytes available to represent numbers.
Find the representation of the numbers 1.0 × 2−1 , 1.0 × 2+1 if the memory architecture is as follows:
(i) exponent is unbiased, positive and negative exponents are signed with 0 and 1, resp; (ii) 1 byte
is reserved for the exponent; (iii) the exponent is stored as from the 2nd bit, right after the sign.
Which range of floats can we represent on that computer?
Discussion. 1.0 × 2−1 ≡ 0 10000001 0000001
1.0 × 2+1 ≡ 0 00000001 0000001
We know that 1.0 × 2−1 < 1.0 × 2+1 , but 10000001 > 00000001 ! So the first exponent shows a
”larger” binary number, making direct comparison more difficult.

3.3.3 c) FP representation with bias exponent

Reconsider the problem above. The bias used in the exponent is 127. Then we have −1 + 127 =
126 = 011111102 , +1 + 127 = 128 = 100000002 . Since 10000000 > 01111110, comparison issues
with negative exponents are solved, with bias exponent !

7
Bias exponent consists in adding some constant called bias to the exponent, chosen to make
the range of exponents nonnegative. The bias solves the comparison issues. As for the range of
representable numbers, there is still a room for improvement.

4 IEEE standard for floating-point arithmetic: formats and

exceptions
4.1 What is the IEEE 754 standard about ?
IEEE 754 standard specifications. There have been various floating point representations up
to the 90’s. Today, IEEE 754 Standard (on floating point representation) is the most common
representation for real numbers on computers and many hardware floating-point units. The IEEE
Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arith-
metic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The
standard addressed many problems found in the diverse floating-point implementations that made
them difficult to use reliably and portably. The standard defines:
arithmetic formats: sets of binary and decimal floating-point data, which consist of fi-
nite numbers (including signed zeros and subnormal numbers), infinities, and special ”not a
number” values (NaNs)

interchange formats: encodings (bit strings) that may be used to exchange floating-point
data in an efficient and compact form

rounding rules: properties to be satisfied when rounding numbers during arithmetic and
conversions

operations: arithmetic and other operations (such as trigonometric functions) on arithmetic

formats

exception handling: indications of exceptional conditions (such as division by zero, overflow,

etc.)

Convention 1. Since the normalized binary mantissa always writes 1.xx · · · (except for 0), omit
the leading 1, store only the fractional part of the mantissa (as IEEE mantissa, f ).

Convention 2. If the actual exponent is p then it is represented as E = p + bias.

IEEE floating-point number representation: r = (−1)s ×(1+0.f )×2E−bias , standard mantissa :

1 + 0.f
f is the IEEE mantissa (fractional part of the standard mantissa), p is the exponent, s is the
sign, bias = 127 for single precision and 1023 for double. E is the so-called biased exponent, 1.f is
the significand, and the dot is the radix point (binary point for base 2, decimal point for base 10).

4.1.1 Advantage of IEEE formatting

• Increase the number storage, f (Single precision: 23+1 bits, Double precision: 52+1 bits)
• Avoid comparison issues with negative exponents.

4.2 IEEE Floating-point formats

4.2.1 a) Available formats
In the FP representation there are two main IEEE formats: single precision and double pre-
cision. A floating point format can only present a finite amount of numbers (written as per the
specifications of the format).

8
Bit = 0 or 1 (binary digit)
Byte = 8 bits
Word = Reals: 4 bytes (single precision)
8 bytes (double precision)
= Integers: signed: 1, 2, 4, or 8 byte
unsigned: 1, 2, 4, or 8 byte
Remark: The single precision format uses 32 bits while the double precision format uses 64 bits.

4.2.2 b) Half-precision format

Half precision consists of 12 bits, with 1 bit for the sign, 5 for the exponent, 10 for the significand,
and for the bias 15 (hidden bit storage is used). Note that bias = 2(n−1) − 1, with n being the
number of bits of the exponent.

4.2.3 c) Single-precision format

Single precision consists of 32 bits, with 1 bit for the sign, 8 for the exponent, 23 for the significand,
and for the bias 127 (hidden bit storage is used)
• Bit index: 31, 30, · · · , 0

• Ranges and precisions (in decimal):

4.2.4 Example.
Find the sign, mantissa, bias exponent and give the single precision representation of the number
1.5.
Since 1.5 is positive, s=0. Convert the number into binary and use the normalized notation in
base 2: 1.510 = 1.12 = 1.1 × 20 = (−1)0 × 1.1 × 2(e−127) , with e = 127.
Hence, s = 0, e = 12710 = 01111111, 0.f = 0.100 · · · =⇒ f = 10000000000000000000000
Bin: 0 0111 1111 100 00000000 0000 0000 0000

4.2.5 Coursework.
Find the sign, mantissa, bias exponent and give the single precision representation of the numbers:
2.0 and 0.5
We have
2.0 = 2 = 21 = (−1)0 × 1.0 × 2(128−127) ,
0.5 = 1/2 = 2−1 = (−1)0 × 1.0 × 2(126−127)

4.2.6 d) Double-precision format

Double precision consists of 64 bits, with 1 bit for the sign, 11 for the exponent, 52 for the significand,
and for the bias 1023 (hidden bit storage is used).
• Bit index: 63, 30, · · · , 0
• The exponent represented by 11111111111 is reserved for infinities and NaNs.
• The exponent represented by 00000000000 is reserved for 0 and something else like −1023.

9
• Ranges and precisions (in decimal):

4.2.7 d) Extended
Apart from the basic floating point formats (single and double precisions), there exists a greater
IEEE format called the extended precision which consists of 80 bits, with 1 bit for the sign, 15 for
the exponent, 64 for the significand, and for the bias 16383. (Note, however, that numbers stored
in extended precision do not use hidden bit storage.)

4.3 Floating-point exceptions and special numbers

Extreme situations may arise in the floating number representation. An arithmetic exception arises
when the result of a floating point operation is unclear or undesirable. The IEEE 754 standard
defines five types of floating-point exception that must be signaled when detected:

Overflow. It means that values have grown too large for the representation as a float in its
format. More precisely, Overflow occurs when the exponent is too large to be represented in
the exponent field.

Underflow. When the result of an operation is too small to be represented as a normalized

float in its format, there is underflow. In that case the exponent is too small to be represented
in the exponent field. Underflow is a less serious problem because is just denotes a loss of
precision, which is guaranteed to be closely approximated by zero.

Invalid operation: when an operand is invalid for the operation about to be performed, and
thus the result of an operation is ill-defined, such as 0.0/0.0, Square root of negative operand,
any operation with a signaling NaN (not a number) operand.

Division by zero: when a float is divided by zero.

Inexact calculation: when the result of a floating point operation is not exact, i.e. the
result was rounded.

10
Inexact floating-point numbers. In decimal system, only rational numbers whose denominator
can be factorized in terms of 2 and 5 ( i.e., a/(2n × 5m ) ) will terminate while others will not.
Similarly in binary system, only rational numbers whose denominator is a power of 2 will ter-
minate while others will not.
Example: −1/3 = (0.0101010101 · · · )2 = (−1)1 (1.01010101 · · · )×2−2 = (−1)1 ×(1+0.01010101 · · · )×
2125−127 , The single precision representation of −1/3 is then 10111110101010101010101010101011.
Similarly 1/10 = (0.00011001100110011 · · · )2

5 Number rounding and accuracy requirements: Machine

epsilon, round-off errors and loss of significance
Number rounding: direction, precision, significant figures and round-off error

5.1 Machine epsilon

Digital computers are fixed-precision devices and the number of digits the device can manipulate
depends on its hardware configuration. characterizes the accuracy of a floating-point system.
Machine precision or machine epsilon is the smallest number denoted mach (or macheps or
eps) such that the difference between 1 and 1 + mach is nonzero, i.e., it is the smallest difference
between two numbers that the computer can recognize or represent. It gives an upper bound on
the relative error due to rounding in floating point arithmetic.
On single precision processors, machine epsilon is 2−23 (approximately 10−7 ) while double preci-
sion is 2−52 (approximately 10−16 ): f=00000000000000000000000 and f=00000000000000000000001.
mach is determined computationally by finding the smallest positive is for which 1 + 6= 1. For
instance, if a particular computing device computes 1.000000001 for 1 + 10−9 but 1 for 1 + 10−10 ,
then we conclude that 10−10 6 mach < 10−9 and the device in this case would be known as a 10
significant-digit device.
Machine precision characterizes the accuracy of a floating-point system and its value depends
on the particular rounding being used.
For rounding to nearest we write mach = 0.5 × B 1−P where P is the precision and B the base.
In the above example B = 10, P = 10 (digits of the number 1.000000001). Then for rounding to
nearest, mach = 0.5 × 109 .
It is important since it bounds the relative error in representing any non-zero real number x
within the normalized range of a floating-point system:

|x − x̄|
6 mach .
|x|

11
5.2 Roundoff errors and loss of significance

12
13
Problems
Question 1: Find the sign, mantissa, bias exponent and write the single-precision representation
of the decimal numbers: −1.5, 0.2 and 4.
Question 2: Find the sign, mantissa, bias exponent and write the single-precision representation
of the binary numbers: −0.1 and 0.00101.
Question 3: Write the binary normalized notation of the numbers and add them : 1.5 and −0.6375.
Question 4: Write the binary normalized notation of the numbers and multiply them : 12.0 and
−0.2375.
Question 5: On changing from IEEE single- to double-precision, how quantitatively do the numbers
of bits representing the mantissa and exponent change? Deduce whether the change prioritizes the
precision or the range of expressible numbers.
Question 6: Consider the quadratic equation ax2 + bx + c = 0
- Express the result in terms of a, b, c
- In real arithmetic compute the solutions for a = 1, b = 200 and c = −0.000015.
- In 10-digit floating-point arithmetic compute the solutions for a = 1, b = 200 and c =
−0.000015. Round-off is applied where need be.
- Compare the floating point and real results, in terms of number of correct significant digits.
- Compute the absolute and relative errors on the smallest solution (small in absolute value).

References
[1] Serge Lang, Introduction to Linear Algebra, 2nd ed., Springer, USA, (1986).

[2] Erwin Kreyszig, Advanced Engineering Mathematics, 10th ed., John Wiley & Sons, Inc., USA,
(2011).

[3] S. Boyd and L. Vandenberghe, Introduction to Applied Linear Algebra: Vectors, Matrices, and
Least Squares, Cambridge University Press, United Kingdom, (2018).

[4] L. Răde et al., Mathematics Handbook, Springer-Verlag, Berlin Heidelberg, (1999).

[5] J. R. Chasnov, Lecture notes for MATH 3311, Hong Kong University of Science and Technology,
Hong Kong, (2012).

[6] J. R. Chasnov, Lecture Notes for COURSERA: Matrix Algebra for Engineers, Hong Kong
University of Science and Technology, Hong Kong, (2019).

[7] https://en.m.wikipedia.org (2022).

[8] Wikiversity (2022).

AKUEB XI Physics Notes-1-1 PDF
90% (20)
AKUEB XI Physics Notes-1-1 PDF
162 pages
Photonics OpticsBookVasan
100% (1)
Photonics OpticsBookVasan
419 pages
Grade 11 Physics Module1
No ratings yet
Grade 11 Physics Module1
81 pages
Numerical Methods Binary FloatingPoint Errors
No ratings yet
Numerical Methods Binary FloatingPoint Errors
109 pages
Lesson 3
No ratings yet
Lesson 3
70 pages
Chapter 03 RISC V
No ratings yet
Chapter 03 RISC V
56 pages
Smith
No ratings yet
Smith
476 pages
Lecture 4 - Floating Point Data
No ratings yet
Lecture 4 - Floating Point Data
44 pages
Chapter 3 Merged
No ratings yet
Chapter 3 Merged
81 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
CSC 206 Lecture 3
No ratings yet
CSC 206 Lecture 3
13 pages
Chap-03 Computer Arithmetics
No ratings yet
Chap-03 Computer Arithmetics
16 pages
Math g8 m1 Teacher Materials
No ratings yet
Math g8 m1 Teacher Materials
159 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
Lecture5 - Arithmetic For Computers - Part 2
No ratings yet
Lecture5 - Arithmetic For Computers - Part 2
57 pages
Aritmética - Arq. Mic.
No ratings yet
Aritmética - Arq. Mic.
53 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
55 pages
COA UNIT-III PPTs Dr.G.Bhaskar ECE
No ratings yet
COA UNIT-III PPTs Dr.G.Bhaskar ECE
64 pages
LEC03 Data II
No ratings yet
LEC03 Data II
45 pages
Floating Point Representation: Reading: B&O 2.4
No ratings yet
Floating Point Representation: Reading: B&O 2.4
44 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
Computer Arithmetic
No ratings yet
Computer Arithmetic
9 pages
Dit 705 - DSP
No ratings yet
Dit 705 - DSP
15 pages
Number System
No ratings yet
Number System
38 pages
Digital I Summaries - Appendix I - Bases, Conversions, Complements, Representations, and Operations
No ratings yet
Digital I Summaries - Appendix I - Bases, Conversions, Complements, Representations, and Operations
19 pages
8.1.4 Data Representation - Floatng Point Numbers
No ratings yet
8.1.4 Data Representation - Floatng Point Numbers
3 pages
CH2 - Data Representation
No ratings yet
CH2 - Data Representation
29 pages
2.4 Floating Point Representation
No ratings yet
2.4 Floating Point Representation
7 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
Ipe JR Physics
No ratings yet
Ipe JR Physics
144 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
No ratings yet
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
32 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
AP Chemistry Notes - Chapter 1 Chemistry Notes - Chapter 1, 2, 3, & 4
No ratings yet
AP Chemistry Notes - Chapter 1 Chemistry Notes - Chapter 1, 2, 3, & 4
31 pages
NAChapter 1
No ratings yet
NAChapter 1
24 pages
ch1 Anal Num
No ratings yet
ch1 Anal Num
17 pages
COA Module 2
No ratings yet
COA Module 2
65 pages
Computer Organization and Architecture Computer Arithmetic
No ratings yet
Computer Organization and Architecture Computer Arithmetic
78 pages
COMP0068 Lecture10 High Level Data Types
No ratings yet
COMP0068 Lecture10 High Level Data Types
25 pages
Floating Point Numbers 237045407 237045407
No ratings yet
Floating Point Numbers 237045407 237045407
20 pages
COA
No ratings yet
COA
14 pages
Floating Point Numbers: CS101 Introduction To Computing
No ratings yet
Floating Point Numbers: CS101 Introduction To Computing
41 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
ARCh Presentation1
No ratings yet
ARCh Presentation1
12 pages
Slide8-Number Systems and Number Representations
No ratings yet
Slide8-Number Systems and Number Representations
24 pages
Data Storage in Computer System: BITS Pilani
No ratings yet
Data Storage in Computer System: BITS Pilani
30 pages
Calculator - Sharp El735s Manual
No ratings yet
Calculator - Sharp El735s Manual
88 pages
Si Units Summary
No ratings yet
Si Units Summary
7 pages
Aqad-Grade 9
No ratings yet
Aqad-Grade 9
31 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
COA Chapter 3
No ratings yet
COA Chapter 3
23 pages
Binary Tutorial
No ratings yet
Binary Tutorial
10 pages
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Floating Points
No ratings yet
Floating Points
31 pages
Math 095 Unite 7 Practice Test
100% (1)
Math 095 Unite 7 Practice Test
13 pages
Sign Magnitude (1 - Ve & 0 +ve) : Zone + Number A 65 100 0001
No ratings yet
Sign Magnitude (1 - Ve & 0 +ve) : Zone + Number A 65 100 0001
5 pages
Lect 13
No ratings yet
Lect 13
41 pages
Fixed & Floating Point
No ratings yet
Fixed & Floating Point
31 pages
End of Course Exam Review: Exponents: A Aa A A A
No ratings yet
End of Course Exam Review: Exponents: A Aa A A A
5 pages
Information Representation Floating Point
No ratings yet
Information Representation Floating Point
17 pages
Math g8 m1 Student Materials PDF
No ratings yet
Math g8 m1 Student Materials PDF
51 pages
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
No ratings yet
Floating Point Representation of Data: By-Astha Jain Class-It1 0827IT171019
16 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Chap 02
No ratings yet
Chap 02
16 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
IEEE 754 Floating Point Notes
No ratings yet
IEEE 754 Floating Point Notes
4 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
29 pages
Fall 2020 CHE 121L Lab 1 Scientific Notation Sig Fig
No ratings yet
Fall 2020 CHE 121L Lab 1 Scientific Notation Sig Fig
10 pages
Some Basic Concepts Applied in Chemistry
No ratings yet
Some Basic Concepts Applied in Chemistry
7 pages
Chapter 1 - Chemistry - Copy of Students
No ratings yet
Chapter 1 - Chemistry - Copy of Students
64 pages
Complete Floating Point (Blog)
No ratings yet
Complete Floating Point (Blog)
18 pages
IXL - British Columbia Grade 9 Math Curriculum
No ratings yet
IXL - British Columbia Grade 9 Math Curriculum
5 pages
Chapter 1
No ratings yet
Chapter 1
4 pages
CH 16 Powers - Class 8
No ratings yet
CH 16 Powers - Class 8
15 pages
Scientific Computation (Floating Point Numbers)
No ratings yet
Scientific Computation (Floating Point Numbers)
4 pages
Mathematics s5 Unit of Learning 04 Index Laws
No ratings yet
Mathematics s5 Unit of Learning 04 Index Laws
26 pages
Assessment 7 Peac
No ratings yet
Assessment 7 Peac
5 pages
Writing Numbers in Scientific Notation Notes Key
No ratings yet
Writing Numbers in Scientific Notation Notes Key
4 pages
SDEV 120 Computer Logic: Chapter 2: Exponents Sections 2-3 Thru 2-5
No ratings yet
SDEV 120 Computer Logic: Chapter 2: Exponents Sections 2-3 Thru 2-5
23 pages
Lecture 3-Measurement of Matter
No ratings yet
Lecture 3-Measurement of Matter
17 pages
13.3 Real Numbers & Normalized Floating-Point
No ratings yet
13.3 Real Numbers & Normalized Floating-Point
17 pages
Cayona Lec CHM020 Measurements
No ratings yet
Cayona Lec CHM020 Measurements
8 pages
Physic Note
No ratings yet
Physic Note
4 pages
TI 84 Plus CE Manual PDF
100% (1)
TI 84 Plus CE Manual PDF
76 pages
Math Reviewer 2ND Quarter
No ratings yet
Math Reviewer 2ND Quarter
1 page

Numerical Methods Chap1

Uploaded by

Numerical Methods Chap1

Uploaded by

Chapter 1: Floating-point arithmetic

Tools and Numerical Methods for Engineering – CEF 352

Academic year 2022-2023

2 Floating Arithmetic operations 4

3 Computer representation of numbers: general principle 6

4 IEEE standard for floating-point arithmetic: formats and exceptions 8

5 Number rounding and accuracy requirements: Machine epsilon, round-off errors

1.2 Binary to decimal Conversion

Example: Example with fractional part:

Hence we have 110012 = 2510 and 0.10112 = 0.687510 .

1.3 Decimal to Binary Conversion

3. Repeat the steps until the fractional product is 0

4. Write those digits from first to last

Examples with fractional part:

(ii) Convert the decimal number 12.125 in binary

(iii) Simple binary fractions

Binary Fractional form (base 10) Decimal form

di × 2i = Di × 10i where di and Di are digits in base 2 and 10, respectively.

2 Floating Arithmetic operations

Next, if need be normalize it to get

2.2 Addition and subtraction

r = r1 ± r2 = (m1 ± m2 × bp1 −p2 )bp1 .

Next, if need be normalize it to get

3.2 Representation of integers

3.2.2 b) Representation of signed integers

3.3 Representation of real numbers

3.3.2 b) Naive representation of FP: Comparison issues with negative exponents

3.3.3 c) FP representation with bias exponent

4 IEEE standard for floating-point arithmetic: formats and

 operations: arithmetic and other operations (such as trigonometric functions) on arithmetic

 exception handling: indications of exceptional conditions (such as division by zero, overflow,

Convention 2. If the actual exponent is p then it is represented as E = p + bias.

IEEE floating-point number representation: r = (−1)s ×(1+0.f )×2E−bias , standard mantissa :

4.1.1 Advantage of IEEE formatting

4.2 IEEE Floating-point formats

4.2.2 b) Half-precision format

4.2.3 c) Single-precision format

• Ranges and precisions (in decimal):

4.2.6 d) Double-precision format

4.3 Floating-point exceptions and special numbers

 Underflow. When the result of an operation is too small to be represented as a normalized

 Division by zero: when a float is divided by zero.

5 Number rounding and accuracy requirements: Machine

5.1 Machine epsilon

[4] L. Răde et al., Mathematics Handbook, Springer-Verlag, Berlin Heidelberg, (1999).

[7] https://en.m.wikipedia.org (2022).

[8] Wikiversity (2022).

You might also like

operations: arithmetic and other operations (such as trigonometric functions) on arithmetic

exception handling: indications of exceptional conditions (such as division by zero, overflow,

Underflow. When the result of an operation is too small to be represented as a normalized

Division by zero: when a float is divided by zero.