ieeeTex

Uploaded by

godswillsuccess18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

ieeeTex

Uploaded by

godswillsuccess18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Scientific Computing: IEEE arithmetic

Jonathan Goodman
May 25, 2024

These are course notes for Scientific Computing, given Spring 1996 at the Courant Institute of Mathematical Sciences at New York
University by Jonathan Goodman. Professor Goodman retains the copyright to these notes. He does not give anyone permission to
copy computer files related to them. Send email to [email protected].

The IEEE floating point standard is a set of rules issued by the IEEE (Institute of Electrical and Electron-
ics Engineers) on computer representation and processing of floating point numbers. Today, most computers
claim to be IEEE compliant but many cut corners in what (they consider to be) minor details. The stan-
dard is currently being enlarged to specify some details left open in the original standard, mostly on how
programmers interact with flags and traps. The standard has four main goals:

(1) To make floating point arithmetic as accurate as possible.

(2) To produce sensible outcomes in exceptional situations.
(3) To standardize floating point operations across computers.
(4) To give the programmer control over exception handling.

Point (1) is achieved in two ways. The standard specifies exactly how a floating point number should be
represented in hardware. It demands that operations (addition, square root, etc.) should be as accurate as
possible. For point (2), the standard introduces inf (infinite) to indicate that the result is larger than the
largest floating point number, and NaN (Not a Number) to indicate that the result makes no sense. Before the
standard, most computers would abort a program in such circumstances. Point (3) has several consequences:
(i) that floating point numbers can be transferred from one IEEE compliant computer to another in binary
without the loss of precision and extra storage associated with conversion to a decimal ASCII representation,
(ii) that the details of floating point arithmetic should be understood by the programmer, and (iii) that the
same program run on a different computer should produce exactly identical results, down to the last bit.
The last one is not true in practice, yet.
The IEEE standard specifies exactly what floating point numbers are and how they are to be represented
in hardware. The most basic unit of information that a computer stores is a bit, a variable whose value
may be either 0 or 1. Bits are organized into groups of 8 called bytes. The most common unit of computer
number information is the 4 byte (32 bit) word. For higher accuracy this is doubled to 8 bytes or 64 bits.
There are 2 possible values for a bit, there are 28 = 256 possible values for a byte, and there are 232 different
4 byte words, about 4.3 billion. A typical computer should take less than an hour to list all 4.3 billion 32 bit
words. To get an idea how fast your computer is, try running the program in figure 1. This will keep going
as long as there are positive integers (half the 4 byte words represent positive integers). This program relies
on the assumption that when the computer adds one to the largest integer, the result will be the smallest
(most negative) integer. This is true in almost all computers.

1
The two kinds of number are fixed point (integer) and floating point (real). A fixed point number has
type int in C and type integer in FORTRAN. A floating point number has type float in C and real in
FORTRAN. In most C compilers, a float by by default has 8 bytes instead of 4.
Integer arithmetic is very simple. There are 223 ≈ 4 × 109 32 bit integers filling the range from about
−2 · 109 to 2 · 109 . Addition , subtraction, and multiplication are done exactly whenever the answer is within
this range. Most computers will do something unpredictable when the answer is out of range (overflow).
The disadvantages of integer arithmetic are both that it can not represent fractions and that it has a narrow
range of values. The number of dollars in the US national debt cannot be represented as a 32 bit integer.
A floating point (or “real”) number is a computer version of the exponential (or “scientific”) notation
used on calculator displays. Consider the example expression:

−.2491E − 5

which is one way a calculator could display the number −2.491 × 10−6 . This expression consists of a sign bit,
s = −, a mantissa, m = 2491 and an exponent, e = −5. The expression s.mEe corresponds to the number
s · .m · 10e .
The IEEE format replaces the base 10 with base 2, and makes a few other changes. When a 32 bit
word is interpreted as a floating point number, the first bit is the sign bit, s = ±. The next 8 bits form
the “exponent”, e, and the remaining 23 bits determine the “fraction”, f . There are two possible signs, 256
possible exponents (ranging from 0 to 255), and 223 ≈ 8.4 million possible fractions. Normally a floating
point number has the value
x = ±2e−127 · (1.f )2 ,
where f is base 2 and the notation (1.f )2 means that the expression 1.f is interpreted in base 2. Note
that the mantissa is 1.f rather than just the fractional part, f . In base 2 any number (except 0) can be
normalized so that the mantissa has the form 1.f . There is no need to store the “1.” explicitly. For example,
the number 2.752 · 103 = 2572 can be written

2752 = 211 + 29 + 27 + 26
211 · 1 + 2−2 + 2−4 + 2−5

=
= 211 · (1 + (.01)2 + (.0001) + (.00001)2 )
= 211 · (1.01011)2 .

Thus, the representation of this number would have sign s = +, exponent e − 127 = 11 so that e = 138 =
(10001010)2 , and fraction f = (01011000000000000000000)2 . The entire 32 bit string corresponding to
2.752 · 103 then is:
|{z} | {z } 01011000000000000000000
1 10001010 | {z } .
s e f

The exceptional cases e = 0 (which would correspond to 2−127 ) and e = 255 (which would correspond
to 2128 ) have complex and carefully engineered interpretations that make the IEEE standard distinctive. If
e = 0, the value is
x = ±2−126 · 0.f .
This feature is called “gradual underflow”. (“Underflow” is the situation in which the result of an operation
is not zero but is closer to zero than any floating point number.) The corresponding numbers are called
“denormalized”. Gradual underflow has the consequence that two floating point numbers are equal, x = y,
if and only if subtracting one from the other gives exactly zero.
The use of denormalized (or “subnormal”) numbers makes sense when you consider the spacing between
floating point numbers. If we exclude denormalized numbers then the smallest positive floating pont number
(in single precision) is a = 2−126 (corresponding to e = 1 and f = 00 · · · 00 (23 zeros)) but the next positive
floating point number larger than a is b, which also has e = 1 but now has f = 00 · · · 01 (22 zeros and a 1).
The distance between b and a is 223 times smaller than the distance between a and zero. That is, without
gradual underflow, there is a large and unnecessary gap between 0 and the nearest nonzero floating point
number.

2
The other extreme case, e = 255, has two subcases, inf (for infinity) if f = 0 and NaN (for Not a Number)
if f ̸= 0. Both C and FORTRAN print “inf” and “NaN” when you print out a variable in floating point
format that has one of these values. The computer produces inf if the result of an operation is larger than
the largest floating point number, in cases such as x*x*x*x when x = 5.2 · 1015 , or exp(x) when x = 204, or
1/x if x = ±0. (Actually 1/ + 0 = +inf and 1/ − 0 = -inf). Other invalid operations such as sqrt(-1.),
log(-4.), and inf/inf, produce NaN. It is planned that f will contain information about how or where in
the program the NaN was created but this is not standardized yet. It might be worthwhile to look in the
hardware or arithmetic manual of the computer you are using.
Exactly what happens when a floating point exception, the generation of an inf or a NaN, occurs is
supposed to be under software control. There is supposed to be a flag (an internal computer status bit with
values 0 or 1, something like the open/closed sign on a shop) that the program can set. If the flag is on, the
exception will cause a program interrupt (the program will halt there and print an error message), otherwise,
the computer will just give the inf or NaN as the result of the operation and continue. The latter is usually
the default. In practice, it may be hard for the programmer to figure out how to set the arithmetic flags and
other arithmetic defaults.
The floating point standard specifies that all arithmetic operations should be a accurate as possible within
the constraints of finite precision arithmetic. The result of arithmetic operations is to be “the exact result
correctly rounded”. This means that you can get the computed result in two steps. First interpret the
operand(s) as mathematical real number(s) and perform the mathematical operation exactly. This usually
gives a result that cannot be represented exactly in IEEE format. The second step is to “round” this
mathematical answer, that is, replace it with the IEEE number closest to it. Ties (two IEEE numbers the
same distance from the mathematical answer) are broken in some way (e.g. round to nearest even). Any
operation involving a NaN produces another NaN. Operations with inf are common sense: inf+f inite = inf,
inf/inf = NaN, f inite/inf = 0., inf − inf = NaN.
From the above, it is clear that the accuracy of floating point operations is determined by the size
of rounding error. This rounding error is determined by the distance between neighboring floating point
numbers. Except for denormalized numbers, neighboring floating numbers differ by one bit in the last bit
of the fraction, f . This is one part in 223 ≈ 10−7 in single precision. Note that this is relative error, rather
than absolute error. If the result is on the order of 1012 then the roundoff error will be on the order of 105 .
This is sometimes expressed by saying that z = x + y produces (x + y)(1 + ϵ) where |ϵ| ≤ ϵmach , and ϵmach ,
the “machine epsilon” is on the order of 10−7 in single precision.
A related remark about IEEE arithmetic is its scale invariance. The relative distance between neighboring
floating point numbers does not depend too much on the size of the numbers, as long as they are not
denormalized. A common definition of ϵmach is that it is the smallest positive floating point number so that
1 + ϵmach ̸= 1. This is in keeping with a general principle in numerical computing. We should always measure
error in relative terms rather than absolute terms.
Double precision IEEE arithmetic used 8 bytes (64 bits) rather than 4 bytes. There is one sign bit, 11
exponent bits, and 52 fraction bits. Therefore the double precision floating point precision is determined by
ϵmach = 2−52 ≈ 5 · 10−16 . That is, double precision arithmetic gives roughly 15 decimal digits of accuracy
instead of 7 for single precision. There are 211 possible exponents in double precision, ranging from 1023 to
−1022. The largest double precision number is of the order of 21023 ≈ 10307 . Not only is double precision
arithmetic more accurate than single precision, but the range of numbers is far greater.
Many features of IEEE arithmetic are illustrated in the program below. The reader would do well do
do some of this kind of experimentation for herself or himself. The first section illustrates various how inf
and NaN work. Note that e204 = inf in single precision but not in double precision because the range of
values is larger in double precision. The next section shows that IEEE addition is commutative. This is a
consequence of the “exact answer correctly rounded” procedure. The exact answer is commutative so either
way the computer have the same number to round. This commutativity does not apply to triples of numbers
because the computer only adds two numbers at a time. The compiler turns the expression x + y + z into
(x + y) + z. It adds x to y, rounds the answer and adds the result to z. The expression z + x + y causes
z + x to be rounded and added to y, which gives a (slightly) different result. Then comes an illustration
of type conversion. Doing the integer division i/j gives 1/2 which is rounded to the integer value, 0, and
then converted to floating point format. When y is computed, the conversion to floating point format is

3
done before the division. Finally there is another example in which two variables might be expected to be
equal but aren’t because of inexact arithmetic. It is almost always wrong to ask whether two floating point
numbers are equal. Last is a serendipitous example of getting exactly the right answer by accident. Don’t
count on this!

Target Your Maths Year 4 Answers
75% (4)
Target Your Maths Year 4 Answers
36 pages
Introduction To Numerical Computing: Statistics 580 Number Systems
No ratings yet
Introduction To Numerical Computing: Statistics 580 Number Systems
35 pages
CA Notes 01
No ratings yet
CA Notes 01
14 pages
3 Fixed and Floating Point DSP
No ratings yet
3 Fixed and Floating Point DSP
23 pages
Floating Point
No ratings yet
Floating Point
26 pages
IEEE Standard 754 Floating Point Numbers
No ratings yet
IEEE Standard 754 Floating Point Numbers
7 pages
COA UNIT-III PPTs Dr.G.Bhaskar ECE
No ratings yet
COA UNIT-III PPTs Dr.G.Bhaskar ECE
64 pages
Floating Point
No ratings yet
Floating Point
26 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
8 pages
IEEE 754 Floating Point Standard
No ratings yet
IEEE 754 Floating Point Standard
2 pages
What Are Floating Point Numbers?
No ratings yet
What Are Floating Point Numbers?
7 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
Data Representation
No ratings yet
Data Representation
28 pages
Floating Point Representation: Reading: B&O 2.4
No ratings yet
Floating Point Representation: Reading: B&O 2.4
44 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
Floating Point & fixed point Representation_BCA II
No ratings yet
Floating Point & fixed point Representation_BCA II
24 pages
COMPX203 Computer Systems: Number Representation
No ratings yet
COMPX203 Computer Systems: Number Representation
33 pages
Sutter Capitulo-12
No ratings yet
Sutter Capitulo-12
32 pages
Floating Point
No ratings yet
Floating Point
16 pages
EC-502 - Aritra Dutta
No ratings yet
EC-502 - Aritra Dutta
6 pages
Design & Simulation of 32-Bit Floating Point Alu
No ratings yet
Design & Simulation of 32-Bit Floating Point Alu
3 pages
COMP0068 Lecture10 High Level Data Types
No ratings yet
COMP0068 Lecture10 High Level Data Types
25 pages
ENSC254 - Floating Point Computation
No ratings yet
ENSC254 - Floating Point Computation
29 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
55 pages
Lect4 Floats
No ratings yet
Lect4 Floats
64 pages
Gate Cse Cao
100% (1)
Gate Cse Cao
108 pages
5 Data - Floating - Point v1
No ratings yet
5 Data - Floating - Point v1
25 pages
Lecture 4 - Computer Arithmetic
No ratings yet
Lecture 4 - Computer Arithmetic
18 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
3 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Cao Iii PDF
No ratings yet
Cao Iii PDF
16 pages
Floating Point Package User's Guide
No ratings yet
Floating Point Package User's Guide
13 pages
Floating Point 6up
No ratings yet
Floating Point 6up
7 pages
Int and Float Data Types
No ratings yet
Int and Float Data Types
5 pages
Coa Rest Notes
No ratings yet
Coa Rest Notes
24 pages
80X87 Arch and Register Set
100% (1)
80X87 Arch and Register Set
56 pages
LEC03 Data II
No ratings yet
LEC03 Data II
45 pages
class03_cs230s22
No ratings yet
class03_cs230s22
33 pages
Implementation of Binary To Floating Point Converter Using HDL
No ratings yet
Implementation of Binary To Floating Point Converter Using HDL
41 pages
COD - Unit-3 - N - 4 - PPT AJAY Kumar
No ratings yet
COD - Unit-3 - N - 4 - PPT AJAY Kumar
93 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
FIXED and FLOAT
No ratings yet
FIXED and FLOAT
8 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
Chap-03 Computer Arithmetics
No ratings yet
Chap-03 Computer Arithmetics
16 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
4.16. Floating Point
No ratings yet
4.16. Floating Point
5 pages
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
No ratings yet
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
34 pages
04-float
No ratings yet
04-float
40 pages
MIPS Architecture - BITS Pilani
No ratings yet
MIPS Architecture - BITS Pilani
58 pages
COA - Unit 2 Data Representation 1
No ratings yet
COA - Unit 2 Data Representation 1
59 pages
Week 5: IEEE Floating Point Revision Guide For Phase Test
No ratings yet
Week 5: IEEE Floating Point Revision Guide For Phase Test
23 pages
Number Systems - Data Representation (Numbers)
No ratings yet
Number Systems - Data Representation (Numbers)
27 pages
Ieee Floating-Point Decimal Number
No ratings yet
Ieee Floating-Point Decimal Number
12 pages
COA
No ratings yet
COA
14 pages
Floating Point Numbers: CS101 Introduction To Computing
No ratings yet
Floating Point Numbers: CS101 Introduction To Computing
41 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Digital Circuit Simulation Using Excel
From Everand
Digital Circuit Simulation Using Excel
Anthony Mazzurco
No ratings yet
More on C# in Front Office
From Everand
More on C# in Front Office
Xing Zhou
No ratings yet
Assignment Class 6 Maths
No ratings yet
Assignment Class 6 Maths
2 pages
Montgomery Multiplication On The Cell
No ratings yet
Montgomery Multiplication On The Cell
9 pages
Imo - 2017
No ratings yet
Imo - 2017
13 pages
Kto12 Periodic Test in Math
100% (1)
Kto12 Periodic Test in Math
3 pages
Digital Logic Design Ali Saleh Normal Form Sop/Pos
No ratings yet
Digital Logic Design Ali Saleh Normal Form Sop/Pos
11 pages
Math 6-Q2-Week-5
No ratings yet
Math 6-Q2-Week-5
12 pages
Design of Efficient Multiplier Using VHDL
No ratings yet
Design of Efficient Multiplier Using VHDL
50 pages
01 Topic 1 Notes On Scientific Processes by Zacks
No ratings yet
01 Topic 1 Notes On Scientific Processes by Zacks
60 pages
Answer Key - Subtraction With Borrowing
No ratings yet
Answer Key - Subtraction With Borrowing
3 pages
CLMD4A MathG7
70% (10)
CLMD4A MathG7
40 pages
Quarter Most Essential Learning Competencies Q1
No ratings yet
Quarter Most Essential Learning Competencies Q1
1 page
RRB Clerk Prelims Memory Based Paper (Held On_ 19 Aug, 2023 Shift 1)
No ratings yet
RRB Clerk Prelims Memory Based Paper (Held On_ 19 Aug, 2023 Shift 1)
23 pages
SRSD Lesson Plan - Division
No ratings yet
SRSD Lesson Plan - Division
6 pages
Log 10.05.2023
No ratings yet
Log 10.05.2023
2 pages
ITF Modern MCQ
No ratings yet
ITF Modern MCQ
24 pages
Numbers: Important Formulae
No ratings yet
Numbers: Important Formulae
6 pages
Digital Logic Design: Federal Urdu University of Arts Science and Technology Islamabad
No ratings yet
Digital Logic Design: Federal Urdu University of Arts Science and Technology Islamabad
29 pages
Special Maths Contest 2023 First Round Problems
No ratings yet
Special Maths Contest 2023 First Round Problems
4 pages
8 Rationalizing Denominators
No ratings yet
8 Rationalizing Denominators
7 pages
Chapter 9
No ratings yet
Chapter 9
3 pages
Fraction Review Exercises
No ratings yet
Fraction Review Exercises
12 pages
Course Code: CSE 2203 Course Title: Digital Techniques: Department of Computer Science & Engineering
No ratings yet
Course Code: CSE 2203 Course Title: Digital Techniques: Department of Computer Science & Engineering
48 pages
Pharmacy 2 Notes 3
No ratings yet
Pharmacy 2 Notes 3
46 pages
4024 - 2nd Monthly 2022 - Paper - 1.5hour - Maths L2 D2 Chap 2 - 6 - 7
No ratings yet
4024 - 2nd Monthly 2022 - Paper - 1.5hour - Maths L2 D2 Chap 2 - 6 - 7
2 pages
Chemistry Chapter 1
No ratings yet
Chemistry Chapter 1
47 pages
Combinational Logic Circuits - PPT
100% (2)
Combinational Logic Circuits - PPT
36 pages
Programming Assignment 5
No ratings yet
Programming Assignment 5
6 pages
MPTC Module 5.4
No ratings yet
MPTC Module 5.4
14 pages
Answer Key - CK-12 Chapter 09 Algebra II WithTrigonometry Concepts (Revised)
No ratings yet
Answer Key - CK-12 Chapter 09 Algebra II WithTrigonometry Concepts (Revised)
27 pages

ieeeTex

Uploaded by

ieeeTex

Uploaded by

Scientific Computing: IEEE arithmetic

(1) To make floating point arithmetic as accurate as possible.

You might also like