0% found this document useful (0 votes)

32 views33 pages

An Introduction To Floating Point Arithmetic by Example: Pat Quillen

This document provides an introduction to floating point arithmetic through examples in MATLAB. It explains that floating point numbers are represented in binary computers as (-1)^s * 2^e * (1 + f), where s is the sign bit, e is the exponent, and f is the mantissa. For double precision numbers, e is an 11-bit number between -1022 and 1023 and f is a 52-bit number between 0 and 1. Due to the finite precision of the mantissa, not all real numbers can be exactly represented, which can lead to rounding errors in calculations. Examples are provided to illustrate representation of simple values and how rounding errors can accumulate.

Uploaded by

Aashish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views33 pages

An Introduction To Floating Point Arithmetic by Example: Pat Quillen

Uploaded by

Aashish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

An Introduction to Floating Point

Arithmetic by Example
Pat Quillen

21 January 2010

Floating Point Arithmetic by Example – p.1/15

Example
What is the value of

1 − 3 ∗ (4/3 − 1)

according to M ATLAB?

Floating Point Arithmetic by Example – p.2/15

Example
What is the value of

1 − 3 ∗ (4/3 − 1)

according to M ATLAB?

2.220446049250313e-016

Floating Point Arithmetic by Example – p.2/15

Example
What is the value of

1 − 3 ∗ (4/3 − 1)

according to M ATLAB?

2.220446049250313e-016
Why??

Floating Point Arithmetic by Example – p.2/15

Example
What is the value of

1 − 3 ∗ (4/3 − 1)

according to M ATLAB?

2.220446049250313e-016
Why?? Essentially because 4/3 cannot be represented
exactly by a binary number with finitely many terms.

Floating Point Arithmetic by Example – p.2/15

Example (continued)
Notice that
∞
4 1 1 X 1k
= 3 = 1 =
3 4 1− 4
4
k=0

That is,
4 1 1 1
= 1 + 2 + 4 + 6 + ···
3 2 2 2
or, in binary,
4
= 1.010101010101 · · ·
3
which, again, is not exactly representable by finitely many
terms.

Floating Point Arithmetic by Example – p.3/15

Floating Point Representation
In binary computers, most floating point numbers are
represented as
(−1)s 2e (1 + f )
where
s is represented by one bit (called the sign bit).
e is the exponent.
f is the mantissa.

Floating Point Arithmetic by Example – p.4/15

For double precision numbers, e is an eleven bit number

and f is a fifty-two bit number.

Floating Point Arithmetic by Example – p.4/15

Floating Point Exponent
As e is represented by 11 bits, it can range in value from
0 to 211 − 1 = 2047.

Floating Point Arithmetic by Example – p.5/15

Floating Point Exponent
As e is represented by 11 bits, it can range in value from
0 to 211 − 1 = 2047.
Negative exponents are represented by biasing e when
stored.

Floating Point Arithmetic by Example – p.5/15

Floating Point Exponent
As e is represented by 11 bits, it can range in value from
0 to 211 − 1 = 2047.
Negative exponents are represented by biasing e when
stored.
The double precision bias is 210 − 1 = 1023. Thus,
−1023 ≤ e ≤ 1024.
The extreme values e = −1023 (stored as eb = 0) and
e = 1024 (stored as eb = 2047) are special, so
−1022 ≤ e ≤ 1023 is the valid range of the exponent.

Floating Point Arithmetic by Example – p.5/15

Floating Point Mantissa
f limits the precision of the floating point number.

Floating Point Arithmetic by Example – p.6/15

Floating Point Mantissa
f limits the precision of the floating point number.
0≤f <1

Floating Point Arithmetic by Example – p.6/15

Floating Point Mantissa
f limits the precision of the floating point number.
0≤f <1
The format 2e (1 + f ) provides an implicitly stored 1, so
doubles actually have 53 bits of precision.

Floating Point Arithmetic by Example – p.6/15

Floating Point Mantissa
f limits the precision of the floating point number.
0≤f <1
The format 2e (1 + f ) provides an implicitly stored 1, so
doubles actually have 53 bits of precision.
252 f is an integer ⇒ gaps between successive doubles.

Floating Point Arithmetic by Example – p.6/15

For example, all integers up to 253 are exactly representable

as floating point numbers, but 253 + 1 is not.

Floating Point Arithmetic by Example – p.6/15

Examples
The number 1 is represented as

(−1)0 20 (1 + 0).

That is, s = 0, e = 0, f = 0. Adding the bias (1023), the

biased value of e is eb = 1023.

Floating Point Arithmetic by Example – p.7/15

Examples
The number 1 is represented as

(−1)0 20 (1 + 0).

That is, s = 0, e = 0, f = 0. Adding the bias (1023), the

biased value of e is eb = 1023.

You can use format hex in M ATLAB to see the bit pattern
of the floating point number in hexadecimal. The first three
hex digits (12 bits) represent the sign bit and the biased ex-
ponent, and the remaining 13 hex digits (52 bits) represent
the mantissa.

Floating Point Arithmetic by Example – p.7/15

Examples
In the case of the number 1, s = 0 and eb = 01111111111,
so the first three hex digits are 001111111111 = 3f f so, 1
is represented by
3ff0000000000000

Floating Point Arithmetic by Example – p.8/15

Examples
In the case of the number 1, s = 0 and eb = 01111111111,
so the first three hex digits are 001111111111 = 3f f so, 1
is represented by
3ff0000000000000
For 34 , f = 0.01010101 · · · 0101, or 55 · · · 5 in hex. As with
1, 34 has e = 1, and so it has representation
3ff5555555555555
which is just slightly smaller than 43 .
The real number 0.1 has e = −4, and
f = 0.10011001 · · · 10011010, and thus has representation
3fb999999999999a
which is just slightly larger than 0.1.
Floating Point Arithmetic by Example – p.8/15
Round-off
4
6= 34 (where fl(x) stands for “the floating point

Since fl 3
representation of x”), we see the behavior

1 − 3 ∗ (4/3 − 1) 6= 0.

All of the operations except the division are performed

without error, and the special value

ǫ = 2−52

is the result.
ǫ is referred to as machine epsilon, or the unit-roundoff, and it
is the distance between 1 and the next closest floating point
number.
Floating Point Arithmetic by Example – p.9/15
Example
A very common example of propogation of round-off comes
in the form of
0.1 + 0.1 + 0.1
Specifically, is the above expression equal to 0.3?

Floating Point Arithmetic by Example – p.10/15

Example
A very common example of propogation of round-off comes
in the form of
0.1 + 0.1 + 0.1
Specifically, is the above expression equal to 0.3?

No! As a matter of fact, M ATLAB will tell you that 0.3 is

represented by
3fd3333333333333
while 0.1 + 0.1 + 0.1 is represented by
3fd3333333333334
The difference in the last place is due to accumulation of the
difference between 0.1 and fl(0.1).

Floating Point Arithmetic by Example – p.10/15

Deadly Consequences

Numerical Disasters
1991: Patriot Missile misses Scud!
1996: Ariane Rocket explodes!

Floating Point Arithmetic by Example – p.11/15

Swamping
Due to finiteness of precision, floating point addition can
suffer swamping. Suppose we have two floating point
numbers a = 105 and b = 10−12 . The quantity c = a + b is
equal to a, since a and b differ by many orders of
magnitude.

Floating Point Arithmetic by Example – p.12/15

To rectify the effects of swamping, one may compute in

increasing order of magnitude. For example, try these in
M ATLAB:

eps/2 + 1 − eps/2 eps/2 − eps/2 + 1

Floating Point Arithmetic by Example – p.12/15

To rectify the effects of swamping, one may compute in

increasing order of magnitude. For example, try these in
M ATLAB:

eps/2 + 1 − eps/2 eps/2 − eps/2 + 1

Note: It is frequently infeasible to do this!

Floating Point Arithmetic by Example – p.12/15

Cancellation
A phenomenon not dissimilar from swamping is
cancellation, which occurs when a number is subtracted
from another number of rougly the same magnitude.

For example, for values of x very near 0, the expression

√
x+1−1

suffers cancellation, as 1 swamps x in the computation of

x + 1, and the subsuquent subtraction results in 0.

Floating Point Arithmetic by Example – p.13/15

Cancellation
To get around the effects of cancellation, one may rewrite
their computation in an equivalent form that avoids the
cancellation altogether. For example, computing with
√ x
x+1−1= √
x+1+1

avoids the cancellation for values of x near zero. Now, the

only value of x that results in a zero output is 0 itself.

Floating Point Arithmetic by Example – p.14/15

avoids the cancellation for values of x near zero. Now, the

only value of x that results in a zero output is 0 itself.

Note: Not all cancellation can be avoided, and not all can-
cellation is bad!

Floating Point Arithmetic by Example – p.14/15

Resources
What Every Computer Scientist Should Know About
Floating-Point Arithmetic by David Goldberg. Available
here.
Numerical Analysis, 8th ed. by Richard L. Burden and
J. Douglas Faires.
Numerical Computing with MATLAB by Cleve Moler.
Accuracy and Stability of Numerical Algorithms by
Nicholas J. Higham.
Technical Note regarding Floating Point Arithmetic.
Available here

Floating Point Arithmetic by Example – p.15/15

FloatingPoint Handout
No ratings yet
FloatingPoint Handout
122 pages
Floating Point
No ratings yet
Floating Point
26 pages
Floating Point
No ratings yet
Floating Point
3 pages
Floating Point
No ratings yet
Floating Point
26 pages
Floating Point Arithmetic
No ratings yet
Floating Point Arithmetic
9 pages
Scientific Computation (Floating Point Numbers)
No ratings yet
Scientific Computation (Floating Point Numbers)
4 pages
Floating Points
No ratings yet
Floating Points
31 pages
Numerical Methods Binary FloatingPoint Errors
No ratings yet
Numerical Methods Binary FloatingPoint Errors
109 pages
Lab 3
No ratings yet
Lab 3
5 pages
Rounding Errors: Course Website
No ratings yet
Rounding Errors: Course Website
34 pages
MATH1070 2 Error and Computer Arithmetic
No ratings yet
MATH1070 2 Error and Computer Arithmetic
60 pages
MATH1070 2 Error and Computer Arithmetic PDF
No ratings yet
MATH1070 2 Error and Computer Arithmetic PDF
60 pages
Numerical Methods: Representing Numbers
No ratings yet
Numerical Methods: Representing Numbers
30 pages
Ponto Flutuante
No ratings yet
Ponto Flutuante
87 pages
Slide n2 Appendix Posted
No ratings yet
Slide n2 Appendix Posted
21 pages
EC-502 - Aritra Dutta
No ratings yet
EC-502 - Aritra Dutta
6 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
23 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
Demystifying Floating Point - John Farrier - CppCon 2015
No ratings yet
Demystifying Floating Point - John Farrier - CppCon 2015
61 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
What Are Floating Point Numbers?
No ratings yet
What Are Floating Point Numbers?
7 pages
IEEE Standard 754 Floating Point Numbers
No ratings yet
IEEE Standard 754 Floating Point Numbers
7 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
Module2.1 of Nothing
No ratings yet
Module2.1 of Nothing
7 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
(Turner) - Applied Scientific Computing - Chap - 02
No ratings yet
(Turner) - Applied Scientific Computing - Chap - 02
19 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
1 5 Floating Point Representation
No ratings yet
1 5 Floating Point Representation
9 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
L-5 Floating Point Representation of Numbers
No ratings yet
L-5 Floating Point Representation of Numbers
12 pages
Real Number Representation and Floating Point Arithmetic
No ratings yet
Real Number Representation and Floating Point Arithmetic
12 pages
L3 Source of Error, Floating-Point
No ratings yet
L3 Source of Error, Floating-Point
26 pages
Floating Point Representation of Numbers: Wide Range
No ratings yet
Floating Point Representation of Numbers: Wide Range
11 pages
LESSON 5 Floating Point
No ratings yet
LESSON 5 Floating Point
11 pages
Floating Point Arithmetic Presentation
No ratings yet
Floating Point Arithmetic Presentation
3 pages
4.16. Floating Point
No ratings yet
4.16. Floating Point
5 pages
Mws Gen Aae Spe Floatingpoint
No ratings yet
Mws Gen Aae Spe Floatingpoint
8 pages
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
No ratings yet
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
39 pages
Document From Avijit Mukherjee
No ratings yet
Document From Avijit Mukherjee
10 pages
Lect4 Floats
No ratings yet
Lect4 Floats
64 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
26 pages
Computations in Mechanical Engineering: Numbers and Vectors
No ratings yet
Computations in Mechanical Engineering: Numbers and Vectors
18 pages
Numeric Artifacts: Lecture Notes by John Schneider
No ratings yet
Numeric Artifacts: Lecture Notes by John Schneider
6 pages
Numerical Methods
No ratings yet
Numerical Methods
72 pages
Floating Point
No ratings yet
Floating Point
16 pages
Floating Point Numbers: Do You Have Your Laptop Here?
No ratings yet
Floating Point Numbers: Do You Have Your Laptop Here?
10 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
5 pages
Computer Organisation
No ratings yet
Computer Organisation
4 pages
3-EED220 Lecture 3
No ratings yet
3-EED220 Lecture 3
22 pages
Chap2 Float
No ratings yet
Chap2 Float
20 pages
COA Module6 FloatingPoint
No ratings yet
COA Module6 FloatingPoint
17 pages
Cacc
No ratings yet
Cacc
106 pages
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
No ratings yet
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
49 pages
Ref: Dean Adams, Chapter 10 Exclude LFSR & MISR
No ratings yet
Ref: Dean Adams, Chapter 10 Exclude LFSR & MISR
11 pages
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material III - Memory Patterns
No ratings yet
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material III - Memory Patterns
17 pages
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Lecture5
No ratings yet
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Lecture5
6 pages
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Radiation Effects On Semiconductor Memories
No ratings yet
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Radiation Effects On Semiconductor Memories
72 pages
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Erasable (UV) Programmable ROM (EPROM's
No ratings yet
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Erasable (UV) Programmable ROM (EPROM's
14 pages
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Cam-Svv
No ratings yet
WINSEM2017-18 - ECE5023 - TH - TT531A - VL2017185001741 - Reference Material I - Cam-Svv
20 pages
Design/Coding Exercise: Ethernet Management Interface (MDC/MDIO) Design
No ratings yet
Design/Coding Exercise: Ethernet Management Interface (MDC/MDIO) Design
5 pages
An Introduction To Graphical Processing Unit: Jayshree Ghorpade, Jitendra Parande, Rohan Kasat, Amit Anand
No ratings yet
An Introduction To Graphical Processing Unit: Jayshree Ghorpade, Jitendra Parande, Rohan Kasat, Amit Anand
6 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
FSM Modeling: Jagannadha Naidu K
No ratings yet
FSM Modeling: Jagannadha Naidu K
22 pages
Mat5009 Advanced Computer Arithmetic Assignment-1
No ratings yet
Mat5009 Advanced Computer Arithmetic Assignment-1
4 pages
Coding and Scripting Techniques For FSM Designs With Synthesis-Optimized, Glitch-Free Outputs
No ratings yet
Coding and Scripting Techniques For FSM Designs With Synthesis-Optimized, Glitch-Free Outputs
12 pages
FALLSEM2018-19 - MAT5009 - TH - TT531 - VL2018191004951 - Reference Material I - 18 - Forward and Backward Analysis
No ratings yet
FALLSEM2018-19 - MAT5009 - TH - TT531 - VL2018191004951 - Reference Material I - 18 - Forward and Backward Analysis
12 pages
Mat5009 Advanced-Computer-Arithmetic TH 1.1 46 Mat5009
No ratings yet
Mat5009 Advanced-Computer-Arithmetic TH 1.1 46 Mat5009
2 pages
Sachin Ramesh Hallad Assignment-1 18MVD0003
No ratings yet
Sachin Ramesh Hallad Assignment-1 18MVD0003
9 pages
FALLSEM2018-19 - MAT5009 - TH - TT531 - VL2018191004951 - Reference Material I - 16 - Rounding - Error Calculation
No ratings yet
FALLSEM2018-19 - MAT5009 - TH - TT531 - VL2018191004951 - Reference Material I - 16 - Rounding - Error Calculation
13 pages
Exercise Sheet
No ratings yet
Exercise Sheet
4 pages
Floating Point Arithmetic
No ratings yet
Floating Point Arithmetic
15 pages
25.memory Representation of Float Data Type in C
No ratings yet
25.memory Representation of Float Data Type in C
4 pages
Number Systems Exam Questions
No ratings yet
Number Systems Exam Questions
14 pages
Assgn 2
No ratings yet
Assgn 2
2 pages
Mealy Moore
No ratings yet
Mealy Moore
22 pages
Binary 2 (Addition, Multiplication)
No ratings yet
Binary 2 (Addition, Multiplication)
4 pages
TC-Question Bank For 1st IA Test PDF
No ratings yet
TC-Question Bank For 1st IA Test PDF
2 pages
Design & Analysis of Algorithms: Dr. Iftikhar Ahmad
No ratings yet
Design & Analysis of Algorithms: Dr. Iftikhar Ahmad
21 pages
DFT4024 Object Oriented Programming: Lab Exercise 3
No ratings yet
DFT4024 Object Oriented Programming: Lab Exercise 3
2 pages
CD ch2
No ratings yet
CD ch2
104 pages
Chapter - 6 Data Representation
No ratings yet
Chapter - 6 Data Representation
8 pages
Chapter 3 - Exercies
No ratings yet
Chapter 3 - Exercies
5 pages
Chapter 3 Data Representation
No ratings yet
Chapter 3 Data Representation
23 pages
Number System
No ratings yet
Number System
6 pages
Definition
No ratings yet
Definition
10 pages
Digital Systems and Algebra Circuits
No ratings yet
Digital Systems and Algebra Circuits
13 pages
Quine McCluskey Tabulation
No ratings yet
Quine McCluskey Tabulation
7 pages
Quine-McCluskey Algorithm
No ratings yet
Quine-McCluskey Algorithm
27 pages
Number System
No ratings yet
Number System
10 pages
EETE 2340 Sample Midterm
No ratings yet
EETE 2340 Sample Midterm
11 pages
Encoders
No ratings yet
Encoders
8 pages
Chapt 3 Number Representation
No ratings yet
Chapt 3 Number Representation
93 pages
Flat 2
No ratings yet
Flat 2
8 pages
BCS 41 - Computer & IT - Preli
No ratings yet
BCS 41 - Computer & IT - Preli
5 pages
Parallel Prefix Computation
No ratings yet
Parallel Prefix Computation
8 pages
2) Number Systems: Dr. E. Lang
No ratings yet
2) Number Systems: Dr. E. Lang
31 pages
R-Format Instructions: Op Rs RT RD Shamt Funct
No ratings yet
R-Format Instructions: Op Rs RT RD Shamt Funct
4 pages
TCS Lect 1 Syllabus & Intro
No ratings yet
TCS Lect 1 Syllabus & Intro
24 pages
Ch08 DSE MCQ WS
No ratings yet
Ch08 DSE MCQ WS
3 pages

An Introduction To Floating Point Arithmetic by Example: Pat Quillen

Uploaded by

An Introduction To Floating Point Arithmetic by Example: Pat Quillen

Uploaded by

An Introduction to Floating Point

Floating Point Arithmetic by Example – p.1/15

Floating Point Arithmetic by Example – p.2/15

Floating Point Arithmetic by Example – p.2/15

Floating Point Arithmetic by Example – p.2/15

Floating Point Arithmetic by Example – p.2/15

Floating Point Arithmetic by Example – p.3/15

Floating Point Arithmetic by Example – p.4/15

For double precision numbers, e is an eleven bit number

Floating Point Arithmetic by Example – p.4/15

Floating Point Arithmetic by Example – p.5/15

Floating Point Arithmetic by Example – p.5/15

Floating Point Arithmetic by Example – p.5/15

Floating Point Arithmetic by Example – p.5/15

Floating Point Arithmetic by Example – p.6/15

Floating Point Arithmetic by Example – p.6/15

Floating Point Arithmetic by Example – p.6/15

Floating Point Arithmetic by Example – p.6/15

For example, all integers up to 253 are exactly representable

Floating Point Arithmetic by Example – p.6/15

That is, s = 0, e = 0, f = 0. Adding the bias (1023), the

Floating Point Arithmetic by Example – p.7/15

That is, s = 0, e = 0, f = 0. Adding the bias (1023), the

Floating Point Arithmetic by Example – p.7/15

Floating Point Arithmetic by Example – p.8/15

Floating Point Arithmetic by Example – p.8/15

All of the operations except the division are performed

Floating Point Arithmetic by Example – p.10/15

No! As a matter of fact, M ATLAB will tell you that 0.3 is

Floating Point Arithmetic by Example – p.10/15

Floating Point Arithmetic by Example – p.11/15

Floating Point Arithmetic by Example – p.12/15

To rectify the effects of swamping, one may compute in

eps/2 + 1 − eps/2 eps/2 − eps/2 + 1

Floating Point Arithmetic by Example – p.12/15

To rectify the effects of swamping, one may compute in

eps/2 + 1 − eps/2 eps/2 − eps/2 + 1

Note: It is frequently infeasible to do this!

Floating Point Arithmetic by Example – p.12/15

For example, for values of x very near 0, the expression

suffers cancellation, as 1 swamps x in the computation of

Floating Point Arithmetic by Example – p.13/15

avoids the cancellation for values of x near zero. Now, the

Floating Point Arithmetic by Example – p.14/15

avoids the cancellation for values of x near zero. Now, the

Floating Point Arithmetic by Example – p.14/15

Floating Point Arithmetic by Example – p.15/15

You might also like