0% found this document useful (0 votes)
23 views85 pages

Numerical Methods For Engineers ch1

Uploaded by

okeremoozcan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views85 pages

Numerical Methods For Engineers ch1

Uploaded by

okeremoozcan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Chapter 1

INTRODUCTION TO PYTHON (NUMERICAL) AND FP ARITHMETIC

1
PYTHON NUMPY

4
NumPy
• Throughout this course Python will be utilized as a
For NumPy examples you can follow CH4
programming language. Sample codes will be given in of,
python (Another alternative is the MATLAB).
“Python for Data Analysis’’ by William
• Python is an object based programming language. It is Wesley McKinney (O’Reilly) 2012
similar to the pseudocode description of the process,
highly plain. For computational aspects you can follow
CH1 of:
• NumPy is a special library written for numerical
Svein Linge, Hans Petter Langtangen,
calculations of array based operations.
‘’Programming for Computations – Python
A Gentle Introduction to Numerical
• SciPy is another library for scientific computing which Simulations with Python 3.6’’, (2020 )
focuses on algorithms. Second Edition, Springer-Open

5
For a fast start to python
• To run the supplied codes you will need a Python installation, you can
download and install Spyder v>5.0 or Anaconda development
environment.

• To modify the codes or in order to understand how it works, you should


have a programming background. For a fast starting you may search for
the open source book https://greenteapress.com/wp/think-python-2e/

‘’Think Python’’, First Edition, by Allen B. Downey

• You do not need to know programming for the exams, however it is


required to have a knowledge of pseudocode representation (a
programming logic).

6
Python code structure
• Consider a mathematical model for the second law
Code
of Newton:

1 2
𝑦 = 𝑣0 𝑡 − 𝑔𝑡
2

• An object thrown at an initial speed of 𝑣0 upwards


will inevitably slow-down for a maximum altitude.

7
Python code structure
Code
• Consider a mathematical
model for the second law of
Newton:

1 2
𝑦 = 𝑣0 𝑡 − 𝑔𝑡
2

• The output can be used in a


modular structure by using
a python function.

8
Python code structure

Code library

library function

9
Python code structure
Code

10
Python NumPy
• NumPy, short for Numerical Python, is the fundamental
package required for high performance scientific computing
and data analysis.

• ndarray is a fast and memory efficient structure for multi-


dimensional computations.

• standart mathematical functions are included (no need to


design cumbersome loops).

• Linear algebra routines.

• Integration with other programming languages (C, C++ etc.).

11
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

Code

12
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

Code

1 2 3 4
𝑎𝑟𝑟2 =
5 6 7 8 2×4

dim 𝑎𝑟𝑟2 = 2

13
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

Code

𝑎𝑟𝑟2𝑇 × 𝑎𝑟𝑟2

𝑎𝑟𝑟2 ⋅ 𝑎𝑟𝑟2

14
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

15
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

16
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Initial Placeholders:

17
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Initial Placeholders:

18
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

19
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

𝑯𝑝 = 𝑩 × 𝑯 × 𝑩−𝟏

𝑯−𝟏
𝑝 × 𝑯𝑝 → 𝑰

20
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Indexing of a 2D NumPy array:

21
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Indexing and Slicing Operations:

22
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Indexing and Slicing Operations:

select
3rd row

select elements from k=2 to k=END, from first column (0),

23
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Indexing and Slicing Operations:

Transposition

24
EVALUATING A POLYNOMIAL

25
Evaluating a polynomial
• Polynomials are the basic representation forms to describe any function

• We will usually consider polynomials or polynomial representation of more


complex functions

• A real polynomial is a real valued function with real coefficients that let
certain algebraic operations: addition, multiplication and exponentiation
to non-negative integer exponent.

𝑃 𝑥 = 2𝑥 4 + 3𝑥 2 + 𝑥 − 1

• How do you calculate 𝑃 0.1 ?

26
Evaluating a polynomial
• How do you calculate 𝑃 0.1 ?

𝑃 𝑥 = 2𝑥 4 + 3𝑥 2 + 𝑥 − 1

One way is:

4349
𝑃 0.1 = 2 ⋅ 0.1 ⋅ 0.1 ⋅ 0.1 ⋅ 0.1 + 3 ⋅ 0.1 ⋅ 0.1 + 0.1 − 1 = −
5000

A total of 6 multiplications and 3 additions.

27
Evaluating a polynomial
• First calculate the powers of 0.1

0.1 ⋅ 0.1 = 0.12 → 𝑠𝑡𝑜𝑟𝑒


0.12 ⋅ 0.12 = 0.14 → 𝑠𝑡𝑜𝑟𝑒

One way is:

𝑃 0.1 = 2 ⋅ 0.14 + 3 ⋅ 0.12 + 0.1 − 1

A total of 2 multiplications and 3 additions, and additional 2 multiplications.

28
Evaluating a polynomial
• Nested form (Horner’s)

𝑃 𝑥 = 2𝑥 4 + 3𝑥 2 + 𝑥 − 1
𝑃 𝑥 = 2𝑥 3 + 3𝑥 + 1 𝑥 − 1
𝑃 𝑥 = (2𝑥 2 + 3)𝑥 + 1 𝑥 − 1
𝑃 𝑥 = 2 + 0 𝑥 + 0 𝑥 + 3)𝑥 + 1 𝑥 − 1

• Then a polynomial in the form

𝑃 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 + ⋯ + 𝑐𝑛 𝑥 𝑛

can be represented as

𝑃 𝑥 = 𝑐0 + (𝑐1 +𝑥(𝑐2 𝑥 + 𝑥(… + 𝑥(𝑐𝑛 ))

29
Evaluating a polynomial
• Then a polynomial in the form

𝑃 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 + ⋯ + 𝑐𝑛 𝑥 𝑛

can be represented as

𝑃 𝑥 = 𝑐0 + (𝑐1 +𝑥(𝑐2 𝑥 + 𝑥(… + 𝑥(𝑐𝑛 ))

A general degree 𝑑 (nested) polynomial can be evaluated in 𝑑 multiplications and 𝑑 additions.

𝑃 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 + ⋯ + 𝑐𝑑 𝑥 𝑑

30
Evaluating a polynomial
A general degree 𝑑 (nested) polynomial can be
evaluated in 𝑑 multiplications and 𝑑 additions.

𝑃 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 + ⋯ + 𝑐𝑑 𝑥 𝑑

do for j ← (n,0,-1)
𝑝 ← 𝑝 × 𝑥 + 𝑐𝑗
end do

31
Evaluating a polynomial
• Polynomials are constructed by choosing the coefficients from (preferably) a real set 𝑐𝑘 ∈ ℜ and
variable representations with integer exponents from a basis set, (preferably) the monomial basis.

𝑐2
𝑐0 𝑐𝑘−1 1, 𝑥, 𝑥 2 , … , 𝑥 𝑛
𝑐1 𝑐𝑘
do for j ← (n,0,-1)
𝑝 ← 𝑝 × 𝑥 + 𝑐𝑗
end do
𝑁

𝑃 𝑥 = ෍ 𝑐𝑘 𝑥 𝑘
𝑘=0
𝑘
• If all 𝑐𝑘 ’s are obtained by algebraic operations (×,÷, +, −, , 𝑥 𝑘 ), then 𝑃(𝑥) is an algebraic
polynomial. Some well known real numbers can not be obtained by algebraic operations.

32
Evaluating a polynomial
𝑘
• If all 𝑐𝑘 ’s are obtained by algebraic operations (×,÷, +, −, , 𝑥 𝑘 ), then 𝑃(𝑥) is an algebraic
polynomial. Some well known real numbers can not be obtained by algebraic operations.

• 𝜋 is an irrational number that can not be obtained by these operations. This number is defined by not
algebraic, but geometrical relationships.

• 𝑒 is an irrational number that requires an infinite sum to be calculated. These are called
transcendental constants.

• 2 = 1.414213562 … is an irrational number that has infinite decimals beyond the point. However,
irrational numbers can be obtained by algebraic operations. On the other hand, they can not be
represented by any rational numbers (they are not division of two integers that are in irreducible
form).

• transcendental or irrational, storage of these numbers requires effort in computer representations.

33
Evaluating a polynomial
• Think about the computation of the polynomial below, which stages should be completed in order to
store the result on the memory:

Step Operations
1 Store c, all elements are int
2 Store x and p (initial values)
3 Start Loop over j,
4 Update p

34
Evaluating a polynomial
• Think about the computation of the polynomial below, which stages should be completed in order to
store the result on the memory:

Step Operations A standart integer element k=32bits/4bytes,


can store integers from 0 to 2𝑘 − 1.
1 Store c, all elements are int
2 Store x and p (initial values)

3 Start Loop over j,
231 230 229 … 22 21 20
4 Update p

232 − 1 = 4294967295
231 − 1 = ±2147483647 (with sign bit)

35
Evaluating a polynomial
Example
What are the smallest and largest numbers that can be stored on a 𝑘 = 5 bit
integer definition?
0 1 1 1 1

+(1 × 23 + 1 × 22 + 1 × 21 + 1 × 20 )

± 23 22 21 20

1 1 1 1 1

2𝑘−1 − 1 = ±15
−(1 × 23 + 1 × 22 + 1 × 21 + 1 × 20 )

36
Evaluating a polynomial
• Result:
• We can store integers with arbitrary precision by increasing the memory field assigned to single
integer.

Step Operations
1 Store c, all elements are int
2 Store x and p (initial values) This polynomial can be stored in memory,
without any loss of precision (numerically
3 Start Loop over j, exact representation) for integer x.
4 Update p
We will not loose any significant bits.

37
Evaluating a polynomial
• Think about

𝑃 𝑥 = 1 + 5𝑥 + 0.2𝑥 2 + 𝜋𝑥 3

Some of the coefficients can not be represented


Step Operations
by rational numbers. They are infinitely long.
1 Store c, all elements are int
𝑐2 = 0.2 is not an integer, but a rational number.
2 Store x and p (initial values)
3 Start Loop over j, • How can we represent 𝑐2 by using integers?

4 Update p • One way is the Fixed format,

38
Evaluating a polynomial
• Think about

𝑃 𝑥 = 1 + 5𝑥 + 0.2𝑥 2 + 𝜋𝑥 3

0 0 0 0 1 ⋅ 0 0 0 1 0

• One way is the Fixed format, ± 23 22 21 20 24 23 22 21 20

1
• since 𝑐2 = 5 = 0.2

• Very limited upper-lower bounds for general


purpose operations.
0 0 0 0 1 / 0 0 1 0 1
• If the scale of the problem intinsically changes, ± 23 22 21 20 24 23 22 21 20
then boundaries are reached rapidly.

39
Evaluating a polynomial

𝜋 = 3.14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 58209 74944
59230 78164 06286 20899 86280 34825 34211 70679 …

Real line 𝑥

3 𝜋

40
Evaluating a polynomial
• Storing 𝜋 and √5 requires much more attention.
real count
• Real numbers fill the space in an uncountable manner.
3→ 𝑞0

3.1 → 𝑞1

3.14 3.141 3.14 → 𝑞2


Real line 𝑥
3.141 → 𝑞3

3.1415 → 𝑞4
3 𝜋

3.1

• There is no direct way to store irrational numbers on memory. 3.1415 … 2643 … → 𝑞∞

41
FLOATING POINT REPRESENTATION OF REAL NUMBERS

42
Floating point representation of real numbers
• Scientific notation and its binary version

Think about

2.1503 = 2 × 100 + 1 × 10−1 + 5 × 10−2 + 3 × 10−4

A decimal number can be expressed as

… 𝑑3 𝑑2 𝑑1 𝑑0 . 𝑑−1 𝑑−2 𝑑−3 …

Thus we use the power of 10 s in the expansion, represented by ⋅ 10 . Similarly;

2.1503 10 = (21503) × 10−4

43
Floating point representation of real numbers
• Scientific notation and its binary version

• Thus we use the power of 10 s in the expansion, represented by ⋅ 10 . Then the expansion
below is called the positional notation,

71 10 = 7 × 101 + 1 × 100

15.2 10 = 1 × 101 + 5 × 100 + 2 × 10−1

• To sum up all these numbers we can shift the decimal point to correct position, we mean:

701.3 10 + 0.012 10 × 103

→ 0.7013 10 × 𝟏𝟎𝟑 + 0.012 10 × 𝟏𝟎𝟑

44
Floating point representation of real numbers
• Base 60 of Babylonians

The notation rules are similar for other base systems (i.e. Babylonians 2000BC)

2
1

24 51 10
2= 1+ + 2 + 3 = 1.41421296
60 60 60

= 1. 24 51 10 60
Source:
https://en.wikipedia.org/wiki/Babylonian_cuneiform_numerals#/media/File:Babylonian_numerals.svg 45
Floating point representation of real numbers
• Base 60 of Babylonians

The notation rules are similar for other base systems (i.e. Babylonians 2000BC)

… ℎ3 ℎ2 ℎ1 ℎ0 . ℎ−1 ℎ−2 ℎ−3 = ⋯ + ℎ3 × 60 3 + ℎ2 × 60 2 + ℎ1 × 60 1 + ℎ0 × 60 0 .


+ ℎ−1 × 60 −1 + ℎ−2 × 60 −2 + ℎ−3 × 60 −3 …

24 51 10
2= 1+ + + = 1.41421296
60 602 603

= 1. 24 51 10 60

Source:
https://en.wikipedia.org/wiki/Babylonian_cuneiform_numerals#/media/File:Babylonian_numerals.svg 46
Floating point representation of real numbers
• positional notation and its binary version

For computer systems binary representations are crucial. This is due to the electronic design of
arithmetic units where algebraic operations are easy to implement. Then a binary floating point
in positional notation is:

… 𝑏3 𝑏2 𝑏1 𝑏0 . 𝑏−1 𝑏−2 𝑏−3 …

… + 𝑏3 × 23 + 𝑏2 × 22 + 𝑏1 × 21 + 𝑏0 × 20 + 𝑏−1 × 2−1 + 𝑏−2 × 2−2 + 𝑏−3 × 2−3 …

Here 𝑏 ∈ [0,1] is an integer, 𝑏 ∈ 𝑍.

47
Floating point representation of real numbers
• Binary representation of floating point numbers

How do we calculate binary reals? Assume we have 72.125 10

integer part (72), divide by 2 fractional part (.125), multiply by 2

num Division Remainder num fraction Int


72÷ 2 36 0 .125 × 2 0.25 0
36÷ 2 18 0 .25 × 2 0.5 0
18÷ 2 9 0 .5 × 2 0 1
9÷ 2 4 1
.001 2
4÷ 2 2 0
2÷ 2 1 0
1÷ 2 0 1
Then 72.125 10 = 1001000.001 2
1001000 2

48
Floating point representation of real numbers
• Binary representation of floating point numbers

• Some fractional decimals do not have an ending binary representation. In these cases the
fractional part has repeating decimals (non-terminating).
fractional part (0.1), multiply by 2 fractional part (0.1), multiply by 2
num fraction Int num fraction Int
.1 × 2 0.2 0 .4 × 2 0.8 0
.2 × 𝟐 0.4 0 .8 × 2 0.6 1
.4 × 2 0.8 0 .6 × 2 0.2 1
.8 × 2 0.6 1 ⋮ ⋮ ⋮
.6 × 2 0.2 1
.2 × 𝟐 0.4 0 0.1 10 = 0.00011 2

49
Floating point representation of real numbers
• Binary representation of floating point numbers

Example
Find the first 4 bits in binary representation of 𝜋 = 3.14159265 …

integer part (3), divide by 2 fractional part (….), multiply by 2

num Division Remainder num fraction Int


3÷ 2 1 1 .14159265 × 2 0.2831… 0
1÷ 2 0 1 .2831… × 2 0.5663… 0
.5663… × 2 0.1327… 1
11 2
.001 2

Then 𝜋 ≈ 11.00 2

50
Floating point representation of real numbers
• Binary representation of floating point numbers

Example

Find the binary representation of 11.25 10

integer part (53), divide by 2 fractional part (.7), multiply by 2

51
Floating point representation of real numbers
• Binary representation of floating point numbers

Example

3
Find the binary representation of 7 10
. Use algebraic operations on this rational number.

fractional part (3/7), multiply by 2 fractional part (3/7), multiply by 2

52
Floating point representation of real numbers
• Binary representation of floating point numbers

Example

Convert 10101.1011 2 to decimal,

53
IEEE-754

54
Floating point representation in IEEE-754
• Before 80s, every company/facility was working with their own FP descriptions which might
carry out portability problems between systems.

• At the end of 70s, a commission held by IEEE consortium defined and determined unique
procedures to handle FP’s in computer systems. This process has been largely motivated by
the electronics industry.

• The model we will describe here is the so-called IEEE-754 Floating Point Standart (Institute
of Electrical and Electronics Engineers).

• This representation is based on scientific notation.

55
Floating point representation in IEEE-754
• In exponential notation a real number 𝑥 is expressed as;

𝑥 = ±𝑆 × 10𝐸

𝐸 is an integer that satisfies 1 ≤ 𝑆 < 10 interval. Consequently, 𝑥 = 0.0024 is represented as


𝑥 = 2.4 × 10−3

In base2, again, nothing changes, then a real number is

𝑥 = ±𝑆 × 2𝐸

𝐸 is an integer that satisfies 1 ≤ 𝑆 < 2 interval. Then the first bit of 𝑥 will always be 1, since
1 ≤ 𝑆. The exponent 𝐸 should be varied to satisfy this constraint.

56
Floating point representation in IEEE-754
The first bit of 𝑥 will always be 1, since 1 ≤ 𝑆. The exponent 𝐸 should be varied to satisfy this
constraint.

𝑥 = ±𝑆 × 2𝐸

𝑥 = 𝑏0 . 𝑏−1 𝑏−2 𝑏−3 … 2 × 2𝐸

where 𝑏0 = 1. The operation applied onto 𝐸 to satisfy 𝑏0 = 1 is called normalization.

Example
Represent 0.01011 2 in base10,

0.01011 2 = 0 × 2−1 + 1 × 2−2 + 0 × 2−3 + 1 × 2−4 + 1 × 2−5 = 𝟑. 𝟒𝟑𝟕𝟓 × 𝟏𝟎−𝟏

57
Floating point representation in IEEE-754

Example
Normalize 𝑥 = 0.00101 2

After normalization we should get S = 1.01 2 . The operations are as follows:

𝑥 = 0.00101 2

0.00101 2 × 2 → 0.0101 𝑥 × 2 × 2 × 2 = 1.01 2


2
0.0101 2 × 2 → 0.101 2 𝑥 × 23 = 1.01 2
0.101 2 × 2 → 1.01 2
𝒙 = 𝟏. 𝟎𝟏 𝟐 × 𝟐−𝟑

58
Floating point representation in IEEE-754
• In a computer, to store normalized numbers in memory, the word dedicated to a real
number is evaluated in three different sections. The 32-bit floating point description is as
follows:

𝑥 = 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2𝐸

1bit 8bits 23bits

sign exponent 𝐸 mantissa (significand) 𝑆

59
Floating point representation in IEEE-754
• In a computer, to store normalized numbers in memory, the word dedicated to a real
number is evaluated in three different sections.

𝑥 = 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2𝐸

• Computer implementations do not consider the leftmost bit 𝑏0 = 1, since it is always 1


(hidden bit). Also, there are some predefined rules for exponential part (extended
exponent), 0 and -+infinity cases (subnormal descriptions). We will omit these discussions
and limit our concern to basic rules.

60
Floating point representation in IEEE-754

Example
Find the normalized binary expression of 𝑥 = 71 10

𝑥 = 1000111 2 𝑆 = 1. 𝑏−1 𝑏−2 𝑏−3 … 2

Then normalized form is:

𝑥 = 1.000111 2 × 26

This will be written on memory as :


0 …… 00011100000000000000000

61
Floating point representation in IEEE-754
• A Toy System:

𝑥 = 1. 𝑏−1 𝑏−2 𝑏−3 2 × 2𝐸

sign sign 𝑏−1 𝑏−2 𝑏−3


← 𝐸 →

• 1 bit for sign of 𝑥


• 1 bit for sign of 𝐸
• 3 bits for 𝐸
• 3 bits for 𝑆

62
Floating point representation in IEEE-754
• A Toy System:

Example
If the binary allocation on the memory of 𝑥 is

0 1 1 0 1 0 0 1

What is the value of 𝑥 in base10?

𝑥 = ± 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2𝐸 = 1.001 2 × 2− 101 2

𝑥 10 = 1.001 2 × 2− 101 2 = 1.001 2 × 2−5


= 1001 2 × 2−8 = 9 × 2−8 = 0.03515625

63
Floating point representation in IEEE-754
• A Toy System:

Example
Find the max value of 𝑥 in base10 that can be represented by our Toy-8bit FP system?

0 0 1 1 1 1 1 1

𝑥 = ± 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2±𝐸 = + 1.111 2 × 2+ 111 2

111 2
𝑥 10 = 1.111 2 × 2 = 1.111 2 × 27 = 1111 2 × 24
= 15 × 24 = 240

64
Floating point representation in IEEE-754
• A Toy System:

Example
Find the min value of 𝑥 > 0 in base10 that can be represented by the Toy-8bit FP system?

0 1 1 1 1 0 0 0

𝑥 = ± 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2±𝐸 = + 1.000 2 × 2− 111 2

𝑥 10 = 1.000 2 × 2− 111 2 = 1 2 × 2−7 = 2−7 ≈ 0.008

No positive real numbers smaller than 𝑥 < 2−7 can be represented by Toy (8bit) system

65
Floating point representation in IEEE-754
• We have seen that the exponential descriptions of a FP can be achieved by finding
normalized expressions for S and E. These information can bring out FP numbers, however
with errors.

• Consider the toy system: how 𝑥 = 0.004 can be represented in memory?

biased definition of 0

1 0000 000 0 1111 000 0 0000 000

0 0.008 1.0
0.004

66
Floating point representation in IEEE-754

• Consider the toy system: how 𝑥 = 0.004 can be represented in memory?

rounding

1 0000 000 0 1111 000

0 0.008
𝜀
0.004

67
Floating point representation in IEEE-754

• Although the number of real numbers between real 𝑝, 𝑞 are not finite, binary expression can
point for a finite number of reals.

• Any number beyond the binary expression capacity should be rounded to a near value.

• Errors 𝜀 caused by such operations are called quantizing errors.

• Two ways to remedy this error: rounding or chopping.

68
APPROXIMATIONS AND ROUND -OFF ERRORS

69
Floating point representation in IEEE-754
• To remedy quantizing errors:

A Simple Rounding Rule Chopping Operation

Round π = 3.14159265 … for 3 significant Round π = 3.14159265 … for 3 significant


digits: digits:
_____________________________________ _____________________________________
If 4𝑡ℎ digit, 𝑑−4 ≥ 5 : Apply chopping beginning from 𝑑−4 :
Then 𝑑−3 = 𝑑−3 + 1
Else: If 𝑗 > 3 :
apply chopping at 𝑑−4 Then set 𝑑−(𝑗) = 0
_____________________________________ _____________________________________
Thus, π ≈ 3.142
Thus, π ≈ 3.141

70
Floating point representation in IEEE-754
• To remedy quantizing errors for binary numbers:

A Simple Rounding Rule Chopping Operation


Round 𝑥 =
Round 𝑥 = 0 0001 0011110010010100 21-bit real into our 8-bit toy system. 0 0001 0011110010010100 21-bit
real into our 8-bit toy system.
Assumption: we will preserve the exponent part and round the mantissa 𝑆 =
001𝟏110010010100 from the 4th binary digit position. Since 𝑏−4 = 1, and Assumption: we will preserve the
exponent part and round the mantissa
𝑏−3 ← 𝑏−3 + 1 𝑆 = 001𝟏110010010100 from the 4th
binary digit position. Chop the digits
Then 𝑏−3 = 2 which will have a carry and should be delivered to 𝑏−2 . Thus, new beyond the third. Then 𝑆 = 001, and
mantissa in 8-bit is 𝑆 = 010, and 𝑥 is : 𝑥 is :

𝑥 = 0 0001 010 𝑥 = 0 0001 001

71
Floating point representation in IEEE-754
• To remedy quantizing errors for binary numbers:
𝑥 = 0 0001 0011110010010100

𝑥 = 0 0001 010 (rounding, (not IEEE rounding))

𝑥 = 0 0001 001 (chopping)

1 1 1 1 1 1 1
𝑆 = 3 + 4 + 5 + 6 + 9 + 12 + 14 = 0.23663330078125
2 2 2 2 2 2 2

1
𝑆𝑟𝑜𝑢𝑛𝑑 = 2 = 0.25
2
1
𝑆𝑐ℎ𝑜𝑝 = 3 = 0.125
2

72
Floating point representation in IEEE-754
• Machine Epsilon (𝝐) is a measure of the precision of a FP number system

• 𝜖 is the distance between 1 and the next larger floating point number.

1
𝜖= 3
2

0 0000 000 0 0000 001

1.0 1+𝜖

73
Floating point representation in IEEE-754
• What is the measure of the precision of a 32-bit FP number including 1 sign bit, 8 exponent-
bit, 23 bit mantissa?

• Since the difference between mantissa will be equal to 𝑚 = 23, then epsilon will be equal
to the contribution of the last mantissa digit, 𝜖 = 2−23 ≈ 10−7 .

𝜖 = 2−𝑚

0 00000000 0…0 0 00000000 0…01

1.0 1+𝜖

74
Floating point representation in IEEE-754
• 64bit Floating Point Arithmetic in Computer
Example
In computer arithmetic, the real number 𝑥 is replaced by 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 𝑥 in memory. Then what is 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 2.4
? As discussed above, 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 2.4 has integer and fractional parts,

2 = 10.0
0.4 = 0.01100
Rounding bit
2.4 = 10.01100 → 1.001100 × 21

+1. 𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏|00 1100

Rounding bit is zero, nothing will be added to the 52nd position. The error is 0011 × 2−53 × 2.

75
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Rounding To The Nearest Rule for IEEE-754:

For double precision, if the 53rd bit to the right of the binary point is 0, round down (apply
chopping after 52nd bit).
52nd bit
53rd bit

1.00 … . . 011|0111 …

If 53rd bit is one, then add 1 to the 52nd bit, if and only is 52nd bit is one.

1.00 … . . 1000|10000 … (apply chopping )

1.00 … . . 1001|10000 …(round up 52nd bit)

76
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
In computer arithmetic, the real number 𝑥 is replaced by 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 𝑥 in memory. Then what is 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 2.4
? As discussed above, 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 2.4 has integer and fractional parts,

2 = 10.0
0.4 = 0.01100
Rounding bit
2.4 = 10.01100 → 1.001100 × 21

+1. 𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏|00 1100

Rounding bit is zero, nothing will be added to the 52nd position. The error is 0011 × 2−53 × 2.

77
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
For which positive integer 𝑘, the number 5 + 2−𝑘 can be represented exactly in 64bits (with no rounding
error).
5 = 101.0 = 1.01 × 22 (normalized)

2−𝑘 = 0. 000 … 0000 1 = 0. 0 … 0000 1 × 22


𝑘−1 𝑡𝑖𝑚𝑒𝑠 𝑘+1 𝑡𝑖𝑚𝑒𝑠

5 + 2−𝑘 = 1. 01 … 0000 1 × 22 =
𝑘+1

𝑆 = 1. 01 … 0000 1 should be of length 52bits, then (𝑘 + 1) + 1 = 52 → 𝑘 = 50


𝑘+1

78
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
For which positive integer 𝑘,
the number 5 + 2−𝑘 can be
represented exactly in 64bits
(with no rounding error).
(𝑘 + 1) + 1 = 52 → 𝑘 = 50

79
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
Find the largest integer 𝑘 for which 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 19 + 2−𝑘 > 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 (19) in double precision format.

19 = 10011.0 = 1.0011 × 24 (normalized)

2−𝑘 = 0. 000 … 0000 1 = 0. 0 … 0000 1 × 24


𝑘−1 𝑡𝑖𝑚𝑒𝑠 𝑘+3 𝑡𝑖𝑚𝑒𝑠

19 + 2−𝑘 = 1. 00110 … 0000 1 × 24


𝑘+3

𝑆 = 1. 00110 … 0000 1 × 24 should be of length 52bits, then (𝑘 + 3) + 1 = 52 → 𝑘 = 48


𝑘+3

80
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
Find the largest integer 𝑘 for which 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 19 + 2−𝑘 > 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 (19) in double precision format.

𝑆 = 1. 00110 … 0000 1 × 24 should be of length 52bits, then (𝑘 + 3) + 1 = 52 → 𝑘 = 48


𝑘+3

81
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
Problem
Find the integer 𝑘 which maximizes the value of 𝑓𝑙𝑐ℎ𝑜𝑝 2𝑘 + 2−𝑘 for a 𝑞-bit mantissa in normalized
format.
𝑥 = 1. 𝑏−1 𝑏−2 𝑏−3 … 𝑏−𝑞 2 × 2𝐸

2−𝑘 2𝑘

82
MEASURING ERRORS

83
Floating point Errors in IEEE-754
• Let 𝑥ො be a computed quantity of 𝑥. Then,

𝐓𝐫𝐮𝐞 𝐀𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐄𝐫𝐫𝐨𝐫 𝜀𝑇𝑎𝑏𝑠 = 𝑥ො − 𝑥

𝑥ො − 𝑥
𝐓𝐫𝐮𝐞 𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐄𝐫𝐫𝐨𝐫 𝜀𝑇𝑟𝑒𝑙 =
𝑥

if the quantity exist. If the exact quantity 𝑥 is not known, then, as an approximation in iterative
algorithms, the current estimation can be used for the ratio.

𝑥ො𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − 𝑥ො𝑝𝑟𝑒𝑣
𝐀𝐩𝐩𝐫𝐨𝐱𝐢𝐦𝐚𝐭𝐞 𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐄𝐫𝐫𝐨𝐫 𝜀𝑟𝑒𝑙
Ƹ =
𝑥ො𝑐𝑢𝑟𝑟𝑒𝑛𝑡

84
Floating point Errors in IEEE-754
• Let 𝑓𝑙(𝑥) be the IEEE machine arithmetic model of 𝑥. Then the relative rounding error of 𝑥
is no more than one-half machine epsilon:

𝑓𝑙𝑟𝑜𝑢𝑛𝑑 (𝑥) − 𝑥 1
𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐑𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐄𝐫𝐫𝐨𝐫 = ≤ 𝜖𝑚𝑎𝑐ℎ
𝑥 2

𝑓𝑙𝑐ℎ𝑜𝑝 (𝑥) − 𝑥
𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐂𝐡𝐨𝐩𝐩𝐢𝐧𝐠 𝐄𝐫𝐫𝐨𝐫 = ≤ 𝜖𝑚𝑎𝑐ℎ
𝑥

85
Floating point Errors in IEEE-754
• Let 𝑓𝑙(𝑥) be the IEEE machine arithmetic model 𝑥.

Example
Calculate 𝑓𝑙 𝑥 − 𝑥 for 𝑥 = 10/3, check the relative rounding error for 𝑥.
____________________________________________________________________________________

Integer part 3 = 11 fractional part (1/3), multiply by 2


2
num fraction Int
Fractional part 1/3 = 0. 01 2
1/3 × 2 2/3 0
𝑥 = 11. 01 = 1.101 × 21 2/3 × 2 1/3 1
1/3 × 2 2/3 0
2/3 × 2 1/3 1
… … …

86
Floating point Errors in IEEE-754
Example
Let 𝑓𝑙(𝑥) be the (round to nearest) IEEE machine arithmetic model of 𝑥. Calculate 𝑓𝑙 𝑥 − 𝑥 for 𝑥 = 10/3, check
the relative rounding error for 𝑥 in a 64-bit representation.
____________________________________________________________________________________
𝑥 = 11. 01 = 1.101 × 21 (normalized)

𝑆 = 1. 1 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 𝟎 |1 01 …
52 𝑏𝑖𝑡𝑠 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎

Since 52nd bit is 0 rounding is not applied. Then


1
𝑓𝑙 𝑥 − 𝑥 = 0.101 × 2−52 = 1. 01 × 2−53 = 1 + × 2−53
3
Relative rounding error is:
𝑓𝑙 𝑥 − 𝑥 4 −53
3
= ×2 × = 0.4 × 2−53 ≤ 𝟐−𝟓𝟑
𝑥 3 10

87

You might also like