0% found this document useful (0 votes)
139 views4 pages

IEEE 754 Floating Point Notes

This document provides an overview of floating point arithmetic and normalized real numbers. It discusses how real numbers are represented in decimal, other bases, and the IEEE 754 format. Key points include: - Real numbers can be represented in normalized scientific notation as a significand/mantissa and exponent. - IEEE 754 uses a sign bit, exponent field, and fraction field to represent numbers. It also defines special values like infinities and NaNs. - Converting between decimal and IEEE 754 involves normalizing and assembling the sign, exponent, and fraction pieces.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views4 pages

IEEE 754 Floating Point Notes

This document provides an overview of floating point arithmetic and normalized real numbers. It discusses how real numbers are represented in decimal, other bases, and the IEEE 754 format. Key points include: - Real numbers can be represented in normalized scientific notation as a significand/mantissa and exponent. - IEEE 754 uses a sign bit, exponent field, and fraction field to represent numbers. It also defines special values like infinities and NaNs. - Converting between decimal and IEEE 754 involves normalizing and assembling the sign, exponent, and fraction pieces.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS 312, Spring 2009: Some Notes on Floating Point Arithmetic.

[email protected]

Introduction

Blah, blah, say some words to introduce the topic, maybe provide an example, whatever. On second
thought, lets just dive head-first into things and see where the current takes us.

Normalized Real Numbers in Decimal

Before we hit the meat of the topic too terribly hard, lets briefly review scientific notation, significant places, and the process for normalizing real numbers in the base that were familiar with: base
10. Additionally, we say real number to mean any number x in R, the set of real numbers.
Scientific notation is a way of representing a real number that would otherwise be cumbersome
to be written conventionally. Weve all seen this several thousand times already, so Ill save a lot
of the detail except for the form of the representation and what it means for a real number to be
normalized.
That said, a normalized real number x in scientific notation has the form
x = a 10b ,

(1)

where 1 |a| < 10. We call the coefficient a one of either the significand, mantissa, or fraction and
b the characteristic or exponent. In 312, we used the terms fraction and exponent.
The process of normalization itself is braindead simple to the point that we can probably do it
in our sleep. Informally, we just shift the decimal place as appropriate to ensure 1 |a| < 10 and
keep count of how many places we shift it. The count of places is, hence, b.

Normalized Real Numbers in Some Base

Now that weve hit the material with which were all familiar, we can finally abstract the select few
parts so that we can glean more insight as to how this mess works as a whole.
Really, the changes to the definition we discussed earlier are minimal. A normalized real number
x in a base in floating-point notation has the form
x = a b ,

(2)

where 1 |a| < . The same vocabulary applies, and so on. The process of normalization is even
the same.

Normalized Real Numbers in IEEE 754 Format

It may seem like were beating this normalization bit pretty heavily, but we intend to say a few
different words in this section compared to the others. Namely, we intend to introduce the mathematical representation and some of the associated analysis formulas for a real number represented
using IEEE 754, and we intend to explain some of the features of this format. For specific examples
and calculations, we assume that were dealing with double-precision floating-point numbers.

Page 1 of ??

CS 312, Spring 2009: Some Notes on Floating Point Arithmetic.

[email protected]

Having said that, a real number x in IEEE 754 floating-point format (C/C++ float or double)
has the form
x = (1)s 2cb (1 + m),
(3)
where s (ls = 1 bit) is an integer that represents the sign of x, c (lc = 11 bits) is the exponent
obtained by normalizing x, b = 1023 is the bias used to determine signedness of the exponent given
c, and m (lm = 52 bits) is the fraction obtained by normalizing x.
Question. Why do we multiply (1 + m) by 2cb and not just m?
Answer. Since were dealing with a binary representation of our number, we consider = 2. This
means then that, since 1 |a| < , we always have |a| = 1. Hence, our number x2 in normalized
form will always be
x2 = 1.m(1) m(2) . . . m(lm ) 2b ,
where each m(i) represents an individual bit. The biggest consequence of this happens to be the
fact that we dont store the extra bit of precision. In essence, we get 53 bits of precision for the
price of 52.
We computed the bias b = 1023 by noting that the length of the bitfield for our exponent is
lc = 11 bits and then performing
 lc
  11
 

2 1
2047
2 1
=
=
= 1023.
(4)
b=
2
2
2
From here, we can compute emin by
emin = b 2lc 2 = 1022

(5)

emax = 2lc b 2 = 1023.

(6)

and emax by
Question. OK, wait. Why isnt it the case that emin = 1023 and emax = 1024? Why are (??)
and (??) both off by one?
Answer. The IEEE 754 floating-point representation uses emin 1 and emax + 1 as special values to
encode non-numeric results like NaNs, , and denormalized numbers. Well discuss the specifics
of these later.
Going further, we can compute the minimum positive fraction mmin by
mmin = 1 2lm = 1 252 =

1
4 503 599 627 370 496

(7)

4 503 599 627 370 495


4 503 599 627 370 496

(8)

and the maximum positive fraction mmax by


mmax =

lm
X
i=1

1 =

52
X
i=1

1 2i =

Now that we have emin , emax , mmin , mmax , we can find the values for xmin and xmax .
Page 2 of ??

CS 312, Spring 2009: Some Notes on Floating Point Arithmetic.

[email protected]

To do this, we refer back to (??) and plug in values as appropriate. For xmin , we let s = 0,
c b = emin and m = mmin . We obtain then
xmin = (1)0 2emin (1 + mmin )
= (1)0 21022 (1 +

1
)
4 503 599 627 370 496

= 2.22507 10308 .

(9)
(10)
(11)

Similarly, for xmax , we let s = 0, c b = emax , and m = mmax . Hence,


xmax = (1)0 2emax (1 + mmax )
4 503 599 627 370 495
)
= (1)0 21023 (1 +
4 503 599 627 370 496
= 1.79769 10308 .

(12)
(13)
(14)

Since weve hinted to it already in this writing, we should probably enumerate the special cases
of the representation sooner or later. Doing it now doesnt sound like a bad idea. The only problem
cb
0
0
1 to emax
emax + 1
emax + 1

m
Object Represented
0
zeroes
nonzero denormalized numbers
anything normalized numbers
0
infinities
nonzero
NaNs

Table 1: IEEE 754 encoding of floating-point numbers.


with Table ?? is that we havent yet covered what a denormalized number is.
Question. What is a denormalized number?
Answer. A denormalized number is a number in the world that is floating-point math that fills in the
gap between zero and the smallest (largest) positive (negative) number. These are only important
to consider from a numerical analysis standpoint because they have interesting implications when
involved in arithmetic operations, namely division by zero and a b = 0 for a 6= b.

More Floating-Point Numbers

So, what more is there? For starters, we can talk about arithmetic operations like addition, subtraction, and so on and their pitfalls but well probably leave this for last or near to last.
Theres also the possibility for discussing conversions between decimal representations and IEEE
754 representations, which well most likely discuss next. Furthermore, we can even get into some of
the interesting analytic topics like the spacing of IEEE 754 values on a number line or error analysis
or ordering or the like, but its unlikely that well get too far with this because of the mathematical
background of the participants in 312 and because of the fact that this is not a numerical analysis
course.
Still, theres the topic of rounding, but that also tends to involve numerical analysis.
Page 3 of ??

CS 312, Spring 2009: Some Notes on Floating Point Arithmetic.

5.1

[email protected]

Converting from Decimal to IEEE 754

The conversion from decimal to IEEE 754 isnt exactly the most straightforward at first, but the
process pretty much goes like the following. Assume that we have a real number x in decimal.
1. Convert x to binary.
2. Normalize the binary representation of x.
3. Let e be the amount of binary places shifted in the normalization process. Set c b = e
in (??).
4. Assemble the pieces.
5. ???
6. PROFIT!
There are exercises in the text that cover this, but just for the sake of being at least halfway
complete here, lets work out an example.
Example. Convert 3.14159 to IEEE 754 double-precision format.
Solution. First, we convert 3.14159 to binary, which yields
2 11.0010010000111111012 .
We normalize 2 , which gets us
2 1.10010010000111111012 1012 .
This means that e = 1. We can now start assembling the pieces. We have
2 (1)0 21 (1 + 21 + 24 + 27 + 212 + + 217 + 219 ).
This is our IEEE 754 double-precision representation for an approximation of to six significant
digits.

Page 4 of ??

You might also like