0% found this document useful (0 votes)

23 views85 pages

Numerical Methods For Engineers ch1

Uploaded by

okeremoozcan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views85 pages

Numerical Methods For Engineers ch1

Uploaded by

okeremoozcan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Chapter 1

INTRODUCTION TO PYTHON (NUMERICAL) AND FP ARITHMETIC

1
PYTHON NUMPY

4
NumPy
• Throughout this course Python will be utilized as a
For NumPy examples you can follow CH4
programming language. Sample codes will be given in of,
python (Another alternative is the MATLAB).
“Python for Data Analysis’’ by William
• Python is an object based programming language. It is Wesley McKinney (O’Reilly) 2012
similar to the pseudocode description of the process,
highly plain. For computational aspects you can follow
CH1 of:
• NumPy is a special library written for numerical
Svein Linge, Hans Petter Langtangen,
calculations of array based operations.
‘’Programming for Computations – Python
A Gentle Introduction to Numerical
• SciPy is another library for scientific computing which Simulations with Python 3.6’’, (2020 )
focuses on algorithms. Second Edition, Springer-Open

5
For a fast start to python
• To run the supplied codes you will need a Python installation, you can
download and install Spyder v>5.0 or Anaconda development
environment.

• To modify the codes or in order to understand how it works, you should

have a programming background. For a fast starting you may search for
the open source book https://greenteapress.com/wp/think-python-2e/

‘’Think Python’’, First Edition, by Allen B. Downey

• You do not need to know programming for the exams, however it is

required to have a knowledge of pseudocode representation (a
programming logic).

6
Python code structure
• Consider a mathematical model for the second law
Code
of Newton:

1 2
𝑦 = 𝑣0 𝑡 − 𝑔𝑡
2

• An object thrown at an initial speed of 𝑣0 upwards

will inevitably slow-down for a maximum altitude.

7
Python code structure
Code
• Consider a mathematical
model for the second law of
Newton:

1 2
𝑦 = 𝑣0 𝑡 − 𝑔𝑡
2

• The output can be used in a

modular structure by using
a python function.

8
Python code structure

Code library

library function

9
Python code structure
Code

10
Python NumPy
• NumPy, short for Numerical Python, is the fundamental
package required for high performance scientific computing
and data analysis.

• ndarray is a fast and memory efficient structure for multi-

dimensional computations.

• standart mathematical functions are included (no need to

design cumbersome loops).

• Linear algebra routines.

• Integration with other programming languages (C, C++ etc.).

11
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

Code

12
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

Code

1 2 3 4
𝑎𝑟𝑟2 =
5 6 7 8 2×4

dim 𝑎𝑟𝑟2 = 2

13
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

Code

𝑎𝑟𝑟2𝑇 × 𝑎𝑟𝑟2

𝑎𝑟𝑟2 ⋅ 𝑎𝑟𝑟2

14
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

15
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

16
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Initial Placeholders:

17
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Initial Placeholders:

18
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

19
Python NumPy:
The NumPy ndarray, Multidimensional Array Object

𝑯𝑝 = 𝑩 × 𝑯 × 𝑩−𝟏

𝑯−𝟏
𝑝 × 𝑯𝑝 → 𝑰

20
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Indexing of a 2D NumPy array:

21
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Indexing and Slicing Operations:

22
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Indexing and Slicing Operations:

select
3rd row

select elements from k=2 to k=END, from first column (0),

23
Python NumPy:
The NumPy ndarray, Multidimensional Array Object
Indexing and Slicing Operations:

Transposition

24
EVALUATING A POLYNOMIAL

25
Evaluating a polynomial
• Polynomials are the basic representation forms to describe any function

• We will usually consider polynomials or polynomial representation of more

complex functions

• A real polynomial is a real valued function with real coefficients that let
certain algebraic operations: addition, multiplication and exponentiation
to non-negative integer exponent.

𝑃 𝑥 = 2𝑥 4 + 3𝑥 2 + 𝑥 − 1

• How do you calculate 𝑃 0.1 ?

26
Evaluating a polynomial
• How do you calculate 𝑃 0.1 ?

𝑃 𝑥 = 2𝑥 4 + 3𝑥 2 + 𝑥 − 1

One way is:

4349
𝑃 0.1 = 2 ⋅ 0.1 ⋅ 0.1 ⋅ 0.1 ⋅ 0.1 + 3 ⋅ 0.1 ⋅ 0.1 + 0.1 − 1 = −
5000

A total of 6 multiplications and 3 additions.

27
Evaluating a polynomial
• First calculate the powers of 0.1

0.1 ⋅ 0.1 = 0.12 → 𝑠𝑡𝑜𝑟𝑒

0.12 ⋅ 0.12 = 0.14 → 𝑠𝑡𝑜𝑟𝑒

One way is:

𝑃 0.1 = 2 ⋅ 0.14 + 3 ⋅ 0.12 + 0.1 − 1

A total of 2 multiplications and 3 additions, and additional 2 multiplications.

28
Evaluating a polynomial
• Nested form (Horner’s)

𝑃 𝑥 = 2𝑥 4 + 3𝑥 2 + 𝑥 − 1
𝑃 𝑥 = 2𝑥 3 + 3𝑥 + 1 𝑥 − 1
𝑃 𝑥 = (2𝑥 2 + 3)𝑥 + 1 𝑥 − 1
𝑃 𝑥 = 2 + 0 𝑥 + 0 𝑥 + 3)𝑥 + 1 𝑥 − 1

• Then a polynomial in the form

𝑃 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 + ⋯ + 𝑐𝑛 𝑥 𝑛

can be represented as

𝑃 𝑥 = 𝑐0 + (𝑐1 +𝑥(𝑐2 𝑥 + 𝑥(… + 𝑥(𝑐𝑛 ))

29
Evaluating a polynomial
• Then a polynomial in the form

𝑃 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 + ⋯ + 𝑐𝑛 𝑥 𝑛

can be represented as

𝑃 𝑥 = 𝑐0 + (𝑐1 +𝑥(𝑐2 𝑥 + 𝑥(… + 𝑥(𝑐𝑛 ))

A general degree 𝑑 (nested) polynomial can be evaluated in 𝑑 multiplications and 𝑑 additions.

𝑃 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 + ⋯ + 𝑐𝑑 𝑥 𝑑

30
Evaluating a polynomial
A general degree 𝑑 (nested) polynomial can be
evaluated in 𝑑 multiplications and 𝑑 additions.

𝑃 𝑥 = 𝑐0 + 𝑐1 𝑥 + 𝑐2 𝑥 2 + ⋯ + 𝑐𝑑 𝑥 𝑑

do for j ← (n,0,-1)
𝑝 ← 𝑝 × 𝑥 + 𝑐𝑗
end do

31
Evaluating a polynomial
• Polynomials are constructed by choosing the coefficients from (preferably) a real set 𝑐𝑘 ∈ ℜ and
variable representations with integer exponents from a basis set, (preferably) the monomial basis.

𝑐2
𝑐0 𝑐𝑘−1 1, 𝑥, 𝑥 2 , … , 𝑥 𝑛
𝑐1 𝑐𝑘
do for j ← (n,0,-1)
𝑝 ← 𝑝 × 𝑥 + 𝑐𝑗
end do
𝑁

𝑃 𝑥 = ෍ 𝑐𝑘 𝑥 𝑘
𝑘=0
𝑘
• If all 𝑐𝑘 ’s are obtained by algebraic operations (×,÷, +, −, , 𝑥 𝑘 ), then 𝑃(𝑥) is an algebraic
polynomial. Some well known real numbers can not be obtained by algebraic operations.

32
Evaluating a polynomial
𝑘
• If all 𝑐𝑘 ’s are obtained by algebraic operations (×,÷, +, −, , 𝑥 𝑘 ), then 𝑃(𝑥) is an algebraic
polynomial. Some well known real numbers can not be obtained by algebraic operations.

• 𝜋 is an irrational number that can not be obtained by these operations. This number is defined by not
algebraic, but geometrical relationships.

• 𝑒 is an irrational number that requires an infinite sum to be calculated. These are called
transcendental constants.

• 2 = 1.414213562 … is an irrational number that has infinite decimals beyond the point. However,
irrational numbers can be obtained by algebraic operations. On the other hand, they can not be
represented by any rational numbers (they are not division of two integers that are in irreducible
form).

• transcendental or irrational, storage of these numbers requires effort in computer representations.

33
Evaluating a polynomial
• Think about the computation of the polynomial below, which stages should be completed in order to
store the result on the memory:

Step Operations
1 Store c, all elements are int
2 Store x and p (initial values)
3 Start Loop over j,
4 Update p

34
Evaluating a polynomial
• Think about the computation of the polynomial below, which stages should be completed in order to
store the result on the memory:

Step Operations A standart integer element k=32bits/4bytes,

can store integers from 0 to 2𝑘 − 1.
1 Store c, all elements are int
2 Store x and p (initial values)
…
3 Start Loop over j,
231 230 229 … 22 21 20
4 Update p

232 − 1 = 4294967295
231 − 1 = ±2147483647 (with sign bit)

35
Evaluating a polynomial
Example
What are the smallest and largest numbers that can be stored on a 𝑘 = 5 bit
integer definition?
0 1 1 1 1

+(1 × 23 + 1 × 22 + 1 × 21 + 1 × 20 )

± 23 22 21 20

1 1 1 1 1

2𝑘−1 − 1 = ±15
−(1 × 23 + 1 × 22 + 1 × 21 + 1 × 20 )

36
Evaluating a polynomial
• Result:
• We can store integers with arbitrary precision by increasing the memory field assigned to single
integer.

Step Operations
1 Store c, all elements are int
2 Store x and p (initial values) This polynomial can be stored in memory,
without any loss of precision (numerically
3 Start Loop over j, exact representation) for integer x.
4 Update p
We will not loose any significant bits.

37
Evaluating a polynomial
• Think about

𝑃 𝑥 = 1 + 5𝑥 + 0.2𝑥 2 + 𝜋𝑥 3

Some of the coefficients can not be represented

Step Operations
by rational numbers. They are infinitely long.
1 Store c, all elements are int
𝑐2 = 0.2 is not an integer, but a rational number.
2 Store x and p (initial values)
3 Start Loop over j, • How can we represent 𝑐2 by using integers?

4 Update p • One way is the Fixed format,

38
Evaluating a polynomial
• Think about

𝑃 𝑥 = 1 + 5𝑥 + 0.2𝑥 2 + 𝜋𝑥 3

0 0 0 0 1 ⋅ 0 0 0 1 0

• One way is the Fixed format, ± 23 22 21 20 24 23 22 21 20

1
• since 𝑐2 = 5 = 0.2

• Very limited upper-lower bounds for general

purpose operations.
0 0 0 0 1 / 0 0 1 0 1
• If the scale of the problem intinsically changes, ± 23 22 21 20 24 23 22 21 20
then boundaries are reached rapidly.

39
Evaluating a polynomial

𝜋 = 3.14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 58209 74944
59230 78164 06286 20899 86280 34825 34211 70679 …

Real line 𝑥

3 𝜋

40
Evaluating a polynomial
• Storing 𝜋 and √5 requires much more attention.
real count
• Real numbers fill the space in an uncountable manner.
3→ 𝑞0

3.1 → 𝑞1

3.14 3.141 3.14 → 𝑞2

Real line 𝑥
3.141 → 𝑞3

3.1415 → 𝑞4
3 𝜋
⋮
3.1
⋮
• There is no direct way to store irrational numbers on memory. 3.1415 … 2643 … → 𝑞∞

41
FLOATING POINT REPRESENTATION OF REAL NUMBERS

42
Floating point representation of real numbers
• Scientific notation and its binary version

Think about

2.1503 = 2 × 100 + 1 × 10−1 + 5 × 10−2 + 3 × 10−4

A decimal number can be expressed as

… 𝑑3 𝑑2 𝑑1 𝑑0 . 𝑑−1 𝑑−2 𝑑−3 …

Thus we use the power of 10 s in the expansion, represented by ⋅ 10 . Similarly;

2.1503 10 = (21503) × 10−4

43
Floating point representation of real numbers
• Scientific notation and its binary version

• Thus we use the power of 10 s in the expansion, represented by ⋅ 10 . Then the expansion
below is called the positional notation,

71 10 = 7 × 101 + 1 × 100

15.2 10 = 1 × 101 + 5 × 100 + 2 × 10−1

• To sum up all these numbers we can shift the decimal point to correct position, we mean:

701.3 10 + 0.012 10 × 103

→ 0.7013 10 × 𝟏𝟎𝟑 + 0.012 10 × 𝟏𝟎𝟑

44
Floating point representation of real numbers
• Base 60 of Babylonians

The notation rules are similar for other base systems (i.e. Babylonians 2000BC)

2
1

24 51 10
2= 1+ + 2 + 3 = 1.41421296
60 60 60

= 1. 24 51 10 60
Source:
https://en.wikipedia.org/wiki/Babylonian_cuneiform_numerals#/media/File:Babylonian_numerals.svg 45
Floating point representation of real numbers
• Base 60 of Babylonians

The notation rules are similar for other base systems (i.e. Babylonians 2000BC)

… ℎ3 ℎ2 ℎ1 ℎ0 . ℎ−1 ℎ−2 ℎ−3 = ⋯ + ℎ3 × 60 3 + ℎ2 × 60 2 + ℎ1 × 60 1 + ℎ0 × 60 0 .

+ ℎ−1 × 60 −1 + ℎ−2 × 60 −2 + ℎ−3 × 60 −3 …

24 51 10
2= 1+ + + = 1.41421296
60 602 603

= 1. 24 51 10 60

Source:
https://en.wikipedia.org/wiki/Babylonian_cuneiform_numerals#/media/File:Babylonian_numerals.svg 46
Floating point representation of real numbers
• positional notation and its binary version

For computer systems binary representations are crucial. This is due to the electronic design of
arithmetic units where algebraic operations are easy to implement. Then a binary floating point
in positional notation is:

… 𝑏3 𝑏2 𝑏1 𝑏0 . 𝑏−1 𝑏−2 𝑏−3 …

… + 𝑏3 × 23 + 𝑏2 × 22 + 𝑏1 × 21 + 𝑏0 × 20 + 𝑏−1 × 2−1 + 𝑏−2 × 2−2 + 𝑏−3 × 2−3 …

Here 𝑏 ∈ [0,1] is an integer, 𝑏 ∈ 𝑍.

47
Floating point representation of real numbers
• Binary representation of floating point numbers

How do we calculate binary reals? Assume we have 72.125 10

integer part (72), divide by 2 fractional part (.125), multiply by 2

num Division Remainder num fraction Int

72÷ 2 36 0 .125 × 2 0.25 0
36÷ 2 18 0 .25 × 2 0.5 0
18÷ 2 9 0 .5 × 2 0 1
9÷ 2 4 1
.001 2
4÷ 2 2 0
2÷ 2 1 0
1÷ 2 0 1
Then 72.125 10 = 1001000.001 2
1001000 2

48
Floating point representation of real numbers
• Binary representation of floating point numbers

• Some fractional decimals do not have an ending binary representation. In these cases the
fractional part has repeating decimals (non-terminating).
fractional part (0.1), multiply by 2 fractional part (0.1), multiply by 2
num fraction Int num fraction Int
.1 × 2 0.2 0 .4 × 2 0.8 0
.2 × 𝟐 0.4 0 .8 × 2 0.6 1
.4 × 2 0.8 0 .6 × 2 0.2 1
.8 × 2 0.6 1 ⋮ ⋮ ⋮
.6 × 2 0.2 1
.2 × 𝟐 0.4 0 0.1 10 = 0.00011 2

49
Floating point representation of real numbers
• Binary representation of floating point numbers

Example
Find the first 4 bits in binary representation of 𝜋 = 3.14159265 …

integer part (3), divide by 2 fractional part (….), multiply by 2

num Division Remainder num fraction Int

3÷ 2 1 1 .14159265 × 2 0.2831… 0
1÷ 2 0 1 .2831… × 2 0.5663… 0
.5663… × 2 0.1327… 1
11 2
.001 2

Then 𝜋 ≈ 11.00 2

50
Floating point representation of real numbers
• Binary representation of floating point numbers

Example

Find the binary representation of 11.25 10

integer part (53), divide by 2 fractional part (.7), multiply by 2

51
Floating point representation of real numbers
• Binary representation of floating point numbers

Example

3
Find the binary representation of 7 10
. Use algebraic operations on this rational number.

fractional part (3/7), multiply by 2 fractional part (3/7), multiply by 2

52
Floating point representation of real numbers
• Binary representation of floating point numbers

Example

Convert 10101.1011 2 to decimal,

53
IEEE-754

54
Floating point representation in IEEE-754
• Before 80s, every company/facility was working with their own FP descriptions which might
carry out portability problems between systems.

• At the end of 70s, a commission held by IEEE consortium defined and determined unique
procedures to handle FP’s in computer systems. This process has been largely motivated by
the electronics industry.

• The model we will describe here is the so-called IEEE-754 Floating Point Standart (Institute
of Electrical and Electronics Engineers).

• This representation is based on scientific notation.

55
Floating point representation in IEEE-754
• In exponential notation a real number 𝑥 is expressed as;

𝑥 = ±𝑆 × 10𝐸

𝐸 is an integer that satisfies 1 ≤ 𝑆 < 10 interval. Consequently, 𝑥 = 0.0024 is represented as

𝑥 = 2.4 × 10−3

In base2, again, nothing changes, then a real number is

𝑥 = ±𝑆 × 2𝐸

𝐸 is an integer that satisfies 1 ≤ 𝑆 < 2 interval. Then the first bit of 𝑥 will always be 1, since
1 ≤ 𝑆. The exponent 𝐸 should be varied to satisfy this constraint.

56
Floating point representation in IEEE-754
The first bit of 𝑥 will always be 1, since 1 ≤ 𝑆. The exponent 𝐸 should be varied to satisfy this
constraint.

𝑥 = ±𝑆 × 2𝐸

𝑥 = 𝑏0 . 𝑏−1 𝑏−2 𝑏−3 … 2 × 2𝐸

where 𝑏0 = 1. The operation applied onto 𝐸 to satisfy 𝑏0 = 1 is called normalization.

Example
Represent 0.01011 2 in base10,

0.01011 2 = 0 × 2−1 + 1 × 2−2 + 0 × 2−3 + 1 × 2−4 + 1 × 2−5 = 𝟑. 𝟒𝟑𝟕𝟓 × 𝟏𝟎−𝟏

57
Floating point representation in IEEE-754

Example
Normalize 𝑥 = 0.00101 2

After normalization we should get S = 1.01 2 . The operations are as follows:

𝑥 = 0.00101 2

0.00101 2 × 2 → 0.0101 𝑥 × 2 × 2 × 2 = 1.01 2

2
0.0101 2 × 2 → 0.101 2 𝑥 × 23 = 1.01 2
0.101 2 × 2 → 1.01 2
𝒙 = 𝟏. 𝟎𝟏 𝟐 × 𝟐−𝟑

58
Floating point representation in IEEE-754
• In a computer, to store normalized numbers in memory, the word dedicated to a real
number is evaluated in three different sections. The 32-bit floating point description is as
follows:

𝑥 = 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2𝐸

1bit 8bits 23bits

sign exponent 𝐸 mantissa (significand) 𝑆

59
Floating point representation in IEEE-754
• In a computer, to store normalized numbers in memory, the word dedicated to a real
number is evaluated in three different sections.

𝑥 = 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2𝐸

• Computer implementations do not consider the leftmost bit 𝑏0 = 1, since it is always 1

(hidden bit). Also, there are some predefined rules for exponential part (extended
exponent), 0 and -+infinity cases (subnormal descriptions). We will omit these discussions
and limit our concern to basic rules.

60
Floating point representation in IEEE-754

Example
Find the normalized binary expression of 𝑥 = 71 10

𝑥 = 1000111 2 𝑆 = 1. 𝑏−1 𝑏−2 𝑏−3 … 2

Then normalized form is:

𝑥 = 1.000111 2 × 26

This will be written on memory as :

0 …… 00011100000000000000000

61
Floating point representation in IEEE-754
• A Toy System:

𝑥 = 1. 𝑏−1 𝑏−2 𝑏−3 2 × 2𝐸

sign sign 𝑏−1 𝑏−2 𝑏−3

← 𝐸 →

• 1 bit for sign of 𝑥

• 1 bit for sign of 𝐸
• 3 bits for 𝐸
• 3 bits for 𝑆

62
Floating point representation in IEEE-754
• A Toy System:

Example
If the binary allocation on the memory of 𝑥 is

0 1 1 0 1 0 0 1

What is the value of 𝑥 in base10?

𝑥 = ± 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2𝐸 = 1.001 2 × 2− 101 2

𝑥 10 = 1.001 2 × 2− 101 2 = 1.001 2 × 2−5

= 1001 2 × 2−8 = 9 × 2−8 = 0.03515625

63
Floating point representation in IEEE-754
• A Toy System:

Example
Find the max value of 𝑥 in base10 that can be represented by our Toy-8bit FP system?

0 0 1 1 1 1 1 1

𝑥 = ± 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2±𝐸 = + 1.111 2 × 2+ 111 2

111 2
𝑥 10 = 1.111 2 × 2 = 1.111 2 × 27 = 1111 2 × 24
= 15 × 24 = 240

64
Floating point representation in IEEE-754
• A Toy System:

Example
Find the min value of 𝑥 > 0 in base10 that can be represented by the Toy-8bit FP system?

0 1 1 1 1 0 0 0

𝑥 = ± 1. 𝑏−1 𝑏−2 𝑏−3 … 2 × 2±𝐸 = + 1.000 2 × 2− 111 2

𝑥 10 = 1.000 2 × 2− 111 2 = 1 2 × 2−7 = 2−7 ≈ 0.008

No positive real numbers smaller than 𝑥 < 2−7 can be represented by Toy (8bit) system

65
Floating point representation in IEEE-754
• We have seen that the exponential descriptions of a FP can be achieved by finding
normalized expressions for S and E. These information can bring out FP numbers, however
with errors.

• Consider the toy system: how 𝑥 = 0.004 can be represented in memory?

biased definition of 0

1 0000 000 0 1111 000 0 0000 000

0 0.008 1.0
0.004

66
Floating point representation in IEEE-754

• Consider the toy system: how 𝑥 = 0.004 can be represented in memory?

rounding

1 0000 000 0 1111 000

0 0.008
𝜀
0.004

67
Floating point representation in IEEE-754

• Although the number of real numbers between real 𝑝, 𝑞 are not finite, binary expression can
point for a finite number of reals.

• Any number beyond the binary expression capacity should be rounded to a near value.

• Errors 𝜀 caused by such operations are called quantizing errors.

• Two ways to remedy this error: rounding or chopping.

68
APPROXIMATIONS AND ROUND -OFF ERRORS

69
Floating point representation in IEEE-754
• To remedy quantizing errors:

A Simple Rounding Rule Chopping Operation

Round π = 3.14159265 … for 3 significant Round π = 3.14159265 … for 3 significant

digits: digits:
_____________________________________ _____________________________________
If 4𝑡ℎ digit, 𝑑−4 ≥ 5 : Apply chopping beginning from 𝑑−4 :
Then 𝑑−3 = 𝑑−3 + 1
Else: If 𝑗 > 3 :
apply chopping at 𝑑−4 Then set 𝑑−(𝑗) = 0
_____________________________________ _____________________________________
Thus, π ≈ 3.142
Thus, π ≈ 3.141

70
Floating point representation in IEEE-754
• To remedy quantizing errors for binary numbers:

A Simple Rounding Rule Chopping Operation

Round 𝑥 =
Round 𝑥 = 0 0001 0011110010010100 21-bit real into our 8-bit toy system. 0 0001 0011110010010100 21-bit
real into our 8-bit toy system.
Assumption: we will preserve the exponent part and round the mantissa 𝑆 =
001𝟏110010010100 from the 4th binary digit position. Since 𝑏−4 = 1, and Assumption: we will preserve the
exponent part and round the mantissa
𝑏−3 ← 𝑏−3 + 1 𝑆 = 001𝟏110010010100 from the 4th
binary digit position. Chop the digits
Then 𝑏−3 = 2 which will have a carry and should be delivered to 𝑏−2 . Thus, new beyond the third. Then 𝑆 = 001, and
mantissa in 8-bit is 𝑆 = 010, and 𝑥 is : 𝑥 is :

𝑥 = 0 0001 010 𝑥 = 0 0001 001

71
Floating point representation in IEEE-754
• To remedy quantizing errors for binary numbers:
𝑥 = 0 0001 0011110010010100

𝑥 = 0 0001 010 (rounding, (not IEEE rounding))

𝑥 = 0 0001 001 (chopping)

1 1 1 1 1 1 1
𝑆 = 3 + 4 + 5 + 6 + 9 + 12 + 14 = 0.23663330078125
2 2 2 2 2 2 2

1
𝑆𝑟𝑜𝑢𝑛𝑑 = 2 = 0.25
2
1
𝑆𝑐ℎ𝑜𝑝 = 3 = 0.125
2

72
Floating point representation in IEEE-754
• Machine Epsilon (𝝐) is a measure of the precision of a FP number system

• 𝜖 is the distance between 1 and the next larger floating point number.

1
𝜖= 3
2

0 0000 000 0 0000 001

1.0 1+𝜖

73
Floating point representation in IEEE-754
• What is the measure of the precision of a 32-bit FP number including 1 sign bit, 8 exponent-
bit, 23 bit mantissa?

• Since the difference between mantissa will be equal to 𝑚 = 23, then epsilon will be equal
to the contribution of the last mantissa digit, 𝜖 = 2−23 ≈ 10−7 .

𝜖 = 2−𝑚

0 00000000 0…0 0 00000000 0…01

1.0 1+𝜖

74
Floating point representation in IEEE-754
• 64bit Floating Point Arithmetic in Computer
Example
In computer arithmetic, the real number 𝑥 is replaced by 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 𝑥 in memory. Then what is 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 2.4
? As discussed above, 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 2.4 has integer and fractional parts,

2 = 10.0
0.4 = 0.01100
Rounding bit
2.4 = 10.01100 → 1.001100 × 21

+1. 𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏𝟎𝟎 𝟏𝟏|00 1100

Rounding bit is zero, nothing will be added to the 52nd position. The error is 0011 × 2−53 × 2.

75
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Rounding To The Nearest Rule for IEEE-754:

For double precision, if the 53rd bit to the right of the binary point is 0, round down (apply
chopping after 52nd bit).
52nd bit
53rd bit

1.00 … . . 011|0111 …

If 53rd bit is one, then add 1 to the 52nd bit, if and only is 52nd bit is one.

1.00 … . . 1000|10000 … (apply chopping )

1.00 … . . 1001|10000 …(round up 52nd bit)

76
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
In computer arithmetic, the real number 𝑥 is replaced by 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 𝑥 in memory. Then what is 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 2.4
? As discussed above, 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 2.4 has integer and fractional parts,

2 = 10.0
0.4 = 0.01100
Rounding bit
2.4 = 10.01100 → 1.001100 × 21

Rounding bit is zero, nothing will be added to the 52nd position. The error is 0011 × 2−53 × 2.

77
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
For which positive integer 𝑘, the number 5 + 2−𝑘 can be represented exactly in 64bits (with no rounding
error).
5 = 101.0 = 1.01 × 22 (normalized)

2−𝑘 = 0. 000 … 0000 1 = 0. 0 … 0000 1 × 22

𝑘−1 𝑡𝑖𝑚𝑒𝑠 𝑘+1 𝑡𝑖𝑚𝑒𝑠

5 + 2−𝑘 = 1. 01 … 0000 1 × 22 =
𝑘+1

𝑆 = 1. 01 … 0000 1 should be of length 52bits, then (𝑘 + 1) + 1 = 52 → 𝑘 = 50

𝑘+1

78
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
For which positive integer 𝑘,
the number 5 + 2−𝑘 can be
represented exactly in 64bits
(with no rounding error).
(𝑘 + 1) + 1 = 52 → 𝑘 = 50

79
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
Find the largest integer 𝑘 for which 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 19 + 2−𝑘 > 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 (19) in double precision format.

19 = 10011.0 = 1.0011 × 24 (normalized)

2−𝑘 = 0. 000 … 0000 1 = 0. 0 … 0000 1 × 24

𝑘−1 𝑡𝑖𝑚𝑒𝑠 𝑘+3 𝑡𝑖𝑚𝑒𝑠

19 + 2−𝑘 = 1. 00110 … 0000 1 × 24

𝑘+3

𝑆 = 1. 00110 … 0000 1 × 24 should be of length 52bits, then (𝑘 + 3) + 1 = 52 → 𝑘 = 48

𝑘+3

80
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
• 64bit Floating Point Arithmetic in Computer IEEE-754
Example
Find the largest integer 𝑘 for which 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 19 + 2−𝑘 > 𝑓𝑙𝑟𝑜𝑢𝑛𝑑 (19) in double precision format.

𝑆 = 1. 00110 … 0000 1 × 24 should be of length 52bits, then (𝑘 + 3) + 1 = 52 → 𝑘 = 48

𝑘+3

81
64bit Floating Point Rounding To The Nearest Rule for IEEE-754:
Problem
Find the integer 𝑘 which maximizes the value of 𝑓𝑙𝑐ℎ𝑜𝑝 2𝑘 + 2−𝑘 for a 𝑞-bit mantissa in normalized
format.
𝑥 = 1. 𝑏−1 𝑏−2 𝑏−3 … 𝑏−𝑞 2 × 2𝐸

2−𝑘 2𝑘

82
MEASURING ERRORS

83
Floating point Errors in IEEE-754
• Let 𝑥ො be a computed quantity of 𝑥. Then,

𝐓𝐫𝐮𝐞 𝐀𝐛𝐬𝐨𝐥𝐮𝐭𝐞 𝐄𝐫𝐫𝐨𝐫 𝜀𝑇𝑎𝑏𝑠 = 𝑥ො − 𝑥

𝑥ො − 𝑥
𝐓𝐫𝐮𝐞 𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐄𝐫𝐫𝐨𝐫 𝜀𝑇𝑟𝑒𝑙 =
𝑥

if the quantity exist. If the exact quantity 𝑥 is not known, then, as an approximation in iterative
algorithms, the current estimation can be used for the ratio.

𝑥ො𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − 𝑥ො𝑝𝑟𝑒𝑣
𝐀𝐩𝐩𝐫𝐨𝐱𝐢𝐦𝐚𝐭𝐞 𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐄𝐫𝐫𝐨𝐫 𝜀𝑟𝑒𝑙
Ƹ =
𝑥ො𝑐𝑢𝑟𝑟𝑒𝑛𝑡

84
Floating point Errors in IEEE-754
• Let 𝑓𝑙(𝑥) be the IEEE machine arithmetic model of 𝑥. Then the relative rounding error of 𝑥
is no more than one-half machine epsilon:

𝑓𝑙𝑟𝑜𝑢𝑛𝑑 (𝑥) − 𝑥 1
𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐑𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐄𝐫𝐫𝐨𝐫 = ≤ 𝜖𝑚𝑎𝑐ℎ
𝑥 2

𝑓𝑙𝑐ℎ𝑜𝑝 (𝑥) − 𝑥
𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐂𝐡𝐨𝐩𝐩𝐢𝐧𝐠 𝐄𝐫𝐫𝐨𝐫 = ≤ 𝜖𝑚𝑎𝑐ℎ
𝑥

85
Floating point Errors in IEEE-754
• Let 𝑓𝑙(𝑥) be the IEEE machine arithmetic model 𝑥.

Example
Calculate 𝑓𝑙 𝑥 − 𝑥 for 𝑥 = 10/3, check the relative rounding error for 𝑥.
____________________________________________________________________________________

Integer part 3 = 11 fractional part (1/3), multiply by 2

2
num fraction Int
Fractional part 1/3 = 0. 01 2
1/3 × 2 2/3 0
𝑥 = 11. 01 = 1.101 × 21 2/3 × 2 1/3 1
1/3 × 2 2/3 0
2/3 × 2 1/3 1
… … …

86
Floating point Errors in IEEE-754
Example
Let 𝑓𝑙(𝑥) be the (round to nearest) IEEE machine arithmetic model of 𝑥. Calculate 𝑓𝑙 𝑥 − 𝑥 for 𝑥 = 10/3, check
the relative rounding error for 𝑥 in a 64-bit representation.
____________________________________________________________________________________
𝑥 = 11. 01 = 1.101 × 21 (normalized)

𝑆 = 1. 1 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 𝟎 |1 01 …
52 𝑏𝑖𝑡𝑠 𝑚𝑎𝑛𝑡𝑖𝑠𝑠𝑎

Since 52nd bit is 0 rounding is not applied. Then

1
𝑓𝑙 𝑥 − 𝑥 = 0.101 × 2−52 = 1. 01 × 2−53 = 1 + × 2−53
3
Relative rounding error is:
𝑓𝑙 𝑥 − 𝑥 4 −53
3
= ×2 × = 0.4 × 2−53 ≤ 𝟐−𝟓𝟑
𝑥 3 10

Math Reviewer g10 1
100% (10)
Math Reviewer g10 1
10 pages
Chapter-8 Data Handling
No ratings yet
Chapter-8 Data Handling
69 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
Week 8 Polynomial Equation
No ratings yet
Week 8 Polynomial Equation
34 pages
Python Practical Solution
No ratings yet
Python Practical Solution
54 pages
Class - 6 - Maths MCQ
100% (1)
Class - 6 - Maths MCQ
8 pages
Python Basics 1612354525
No ratings yet
Python Basics 1612354525
129 pages
Cs Xii Practical File
No ratings yet
Cs Xii Practical File
51 pages
Unesco Als Ls3 Math m02 SG
100% (4)
Unesco Als Ls3 Math m02 SG
44 pages
AIMS-Python Notes 2016
No ratings yet
AIMS-Python Notes 2016
76 pages
NB 4
No ratings yet
NB 4
21 pages
Lab 2
No ratings yet
Lab 2
26 pages
CS 100: Roadmap To Computing Fall 2015
No ratings yet
CS 100: Roadmap To Computing Fall 2015
18 pages
DSTL Solution
No ratings yet
DSTL Solution
20 pages
A Short Course in Python For Number Theory: Jim Carlson
No ratings yet
A Short Course in Python For Number Theory: Jim Carlson
17 pages
Chapter 03
No ratings yet
Chapter 03
42 pages
Ch11a Numpy
No ratings yet
Ch11a Numpy
8 pages
Chapter 12 Python Litvin Chapter 12 Python Litvin Chapter 12 Python Litvin
No ratings yet
Chapter 12 Python Litvin Chapter 12 Python Litvin Chapter 12 Python Litvin
18 pages
Unit Vi
No ratings yet
Unit Vi
31 pages
5 Session Python Fundamentals Part 2
No ratings yet
5 Session Python Fundamentals Part 2
23 pages
Python Lab 7
No ratings yet
Python Lab 7
11 pages
Python Manual
No ratings yet
Python Manual
25 pages
Maths Class 11 Chapter 5 Part - 1 Quadratic Equations
No ratings yet
Maths Class 11 Chapter 5 Part - 1 Quadratic Equations
33 pages
22-ML Lab Expt 1
No ratings yet
22-ML Lab Expt 1
29 pages
SageMath Note 1
No ratings yet
SageMath Note 1
21 pages
Sec - Program Document
No ratings yet
Sec - Program Document
8 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Week1 Handout
No ratings yet
Week1 Handout
24 pages
Python 1
No ratings yet
Python 1
18 pages
Fast Fourier Transforms
No ratings yet
Fast Fourier Transforms
17 pages
SageMath Lecture 8
No ratings yet
SageMath Lecture 8
14 pages
Arithmetic Operations, Bitwise Operations and Complex Numbers
No ratings yet
Arithmetic Operations, Bitwise Operations and Complex Numbers
18 pages
Introduction To Python: The Language That Meets All Requirements and The Language Which All Requirements Meet
No ratings yet
Introduction To Python: The Language That Meets All Requirements and The Language Which All Requirements Meet
35 pages
COL100-Lecture 15 Lists As Polynomials
No ratings yet
COL100-Lecture 15 Lists As Polynomials
22 pages
CSC307 Lab 02 - 0
No ratings yet
CSC307 Lab 02 - 0
8 pages
Chapter03 Python Programming
No ratings yet
Chapter03 Python Programming
48 pages
Rational Functions: 5.4 Complex Arithmetic
No ratings yet
Rational Functions: 5.4 Complex Arithmetic
3 pages
Python As A Calculator - Python Numerical Methods
No ratings yet
Python As A Calculator - Python Numerical Methods
4 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
Mathematics: Quarter 2 - Module 2: Solving Problems Involving Polynomial Functions
No ratings yet
Mathematics: Quarter 2 - Module 2: Solving Problems Involving Polynomial Functions
19 pages
Calculator Project
67% (3)
Calculator Project
30 pages
Lecture 4-Arrays
No ratings yet
Lecture 4-Arrays
45 pages
Number Theory Functions in Python
No ratings yet
Number Theory Functions in Python
7 pages
Kartik MLP 1-3
No ratings yet
Kartik MLP 1-3
10 pages
Ad Cpp+ii+
No ratings yet
Ad Cpp+ii+
15 pages
Introduction To Python
No ratings yet
Introduction To Python
35 pages
Python
No ratings yet
Python
13 pages
Lci2022028 La
No ratings yet
Lci2022028 La
7 pages
An Intro To Python and Algorithms
No ratings yet
An Intro To Python and Algorithms
199 pages
Detailed Lesson Plan in Monomial Factor
No ratings yet
Detailed Lesson Plan in Monomial Factor
6 pages
Operations
No ratings yet
Operations
28 pages
FIT1053 Algorithms and Programming Fundamentals in Python - Workshop 2
No ratings yet
FIT1053 Algorithms and Programming Fundamentals in Python - Workshop 2
3 pages
LabWork 2
No ratings yet
LabWork 2
3 pages
Lec3 Strings Algos PDF
No ratings yet
Lec3 Strings Algos PDF
32 pages
Python of Data Science Practical BE Sem-5
No ratings yet
Python of Data Science Practical BE Sem-5
64 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
39 pages
Practical Set:1: 210490131035 Python For Data Science
No ratings yet
Practical Set:1: 210490131035 Python For Data Science
54 pages
EEPC 102 Module 1
No ratings yet
EEPC 102 Module 1
6 pages
Activity Sheet Grade 7
No ratings yet
Activity Sheet Grade 7
33 pages
DSAI Admission Level Check 2025-26 M1
No ratings yet
DSAI Admission Level Check 2025-26 M1
6 pages
Numerical Methods Using Python: (MCSC-202)
No ratings yet
Numerical Methods Using Python: (MCSC-202)
29 pages
Algebra
No ratings yet
Algebra
523 pages
FDS - Unit2 QB - Solution
No ratings yet
FDS - Unit2 QB - Solution
14 pages
10 Unit1
No ratings yet
10 Unit1
52 pages
First Final Examination Math Grade 9
100% (3)
First Final Examination Math Grade 9
2 pages
11th Maths Chapter 5 Soulution - NOTESPK
No ratings yet
11th Maths Chapter 5 Soulution - NOTESPK
16 pages
Algebra I Teks Snapshot
No ratings yet
Algebra I Teks Snapshot
2 pages
Num Py
No ratings yet
Num Py
46 pages
1a 1 - Polynomial Characteristics - Guided Notes Annotated
No ratings yet
1a 1 - Polynomial Characteristics - Guided Notes Annotated
21 pages
Mathematics 7 First Quarter Exam 2019-2020
No ratings yet
Mathematics 7 First Quarter Exam 2019-2020
15 pages
ME Math 10 Q1 0702 SG
No ratings yet
ME Math 10 Q1 0702 SG
20 pages
English Class 9 Ws 1
No ratings yet
English Class 9 Ws 1
4 pages
Python Tutorial
No ratings yet
Python Tutorial
32 pages
Algebraic Equations
No ratings yet
Algebraic Equations
28 pages
GATE Mathematics Paper 2021
No ratings yet
GATE Mathematics Paper 2021
32 pages
Introduction To Polynomials
No ratings yet
Introduction To Polynomials
12 pages
MELC Grade 7 1st To 4th
No ratings yet
MELC Grade 7 1st To 4th
13 pages
Numerical Methods For Engineers ch2
No ratings yet
Numerical Methods For Engineers ch2
73 pages
DS Unit-2
No ratings yet
DS Unit-2
37 pages
CBSE Class 10 Maths Notes Chapter 2
No ratings yet
CBSE Class 10 Maths Notes Chapter 2
1 page
Unit 5 Notes - Complete
No ratings yet
Unit 5 Notes - Complete
39 pages
Diff 2
No ratings yet
Diff 2
10 pages
Functions 1.15 Assesment
No ratings yet
Functions 1.15 Assesment
14 pages
Diff 6
No ratings yet
Diff 6
22 pages
Diff 1
No ratings yet
Diff 1
18 pages
Diff 3
No ratings yet
Diff 3
18 pages
Algegbra II Notes
No ratings yet
Algegbra II Notes
3 pages
Diff 4
No ratings yet
Diff 4
10 pages
Unit 5.1-5.8 Test Review
No ratings yet
Unit 5.1-5.8 Test Review
6 pages
Activity 4 Mind Map and Reflection
No ratings yet
Activity 4 Mind Map and Reflection
1 page