0% found this document useful (0 votes)

139 views4 pages

IEEE 754 Floating Point Notes

This document provides an overview of floating point arithmetic and normalized real numbers. It discusses how real numbers are represented in decimal, other bases, and the IEEE 754 format. Key points include: - Real numbers can be represented in normalized scientific notation as a significand/mantissa and exponent. - IEEE 754 uses a sign bit, exponent field, and fraction field to represent numbers. It also defines special values like infinities and NaNs. - Converting between decimal and IEEE 754 involves normalizing and assembling the sign, exponent, and fraction pieces.

Uploaded by

Nathaniel R. Reindl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views4 pages

IEEE 754 Floating Point Notes

Uploaded by

Nathaniel R. Reindl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

CS 312, Spring 2009: Some Notes on Floating Point Arithmetic.

[email protected]

Introduction

Blah, blah, say some words to introduce the topic, maybe provide an example, whatever. On second
thought, lets just dive head-first into things and see where the current takes us.

Normalized Real Numbers in Decimal

Before we hit the meat of the topic too terribly hard, lets briefly review scientific notation, significant places, and the process for normalizing real numbers in the base that were familiar with: base
10. Additionally, we say real number to mean any number x in R, the set of real numbers.
Scientific notation is a way of representing a real number that would otherwise be cumbersome
to be written conventionally. Weve all seen this several thousand times already, so Ill save a lot
of the detail except for the form of the representation and what it means for a real number to be
normalized.
That said, a normalized real number x in scientific notation has the form
x = a 10b ,

(1)

where 1 |a| < 10. We call the coefficient a one of either the significand, mantissa, or fraction and
b the characteristic or exponent. In 312, we used the terms fraction and exponent.
The process of normalization itself is braindead simple to the point that we can probably do it
in our sleep. Informally, we just shift the decimal place as appropriate to ensure 1 |a| < 10 and
keep count of how many places we shift it. The count of places is, hence, b.

Normalized Real Numbers in Some Base

Now that weve hit the material with which were all familiar, we can finally abstract the select few
parts so that we can glean more insight as to how this mess works as a whole.
Really, the changes to the definition we discussed earlier are minimal. A normalized real number
x in a base in floating-point notation has the form
x = a b ,

(2)

where 1 |a| < . The same vocabulary applies, and so on. The process of normalization is even
the same.

Normalized Real Numbers in IEEE 754 Format

It may seem like were beating this normalization bit pretty heavily, but we intend to say a few
different words in this section compared to the others. Namely, we intend to introduce the mathematical representation and some of the associated analysis formulas for a real number represented
using IEEE 754, and we intend to explain some of the features of this format. For specific examples
and calculations, we assume that were dealing with double-precision floating-point numbers.

Page 1 of ??

CS 312, Spring 2009: Some Notes on Floating Point Arithmetic.

[email protected]

Having said that, a real number x in IEEE 754 floating-point format (C/C++ float or double)
has the form
x = (1)s 2cb (1 + m),
(3)
where s (ls = 1 bit) is an integer that represents the sign of x, c (lc = 11 bits) is the exponent
obtained by normalizing x, b = 1023 is the bias used to determine signedness of the exponent given
c, and m (lm = 52 bits) is the fraction obtained by normalizing x.
Question. Why do we multiply (1 + m) by 2cb and not just m?
Answer. Since were dealing with a binary representation of our number, we consider = 2. This
means then that, since 1 |a| < , we always have |a| = 1. Hence, our number x2 in normalized
form will always be
x2 = 1.m(1) m(2) . . . m(lm ) 2b ,
where each m(i) represents an individual bit. The biggest consequence of this happens to be the
fact that we dont store the extra bit of precision. In essence, we get 53 bits of precision for the
price of 52.
We computed the bias b = 1023 by noting that the length of the bitfield for our exponent is
lc = 11 bits and then performing
lc
11

2 1
2047
2 1
=
=
= 1023.
(4)
b=
2
2
2
From here, we can compute emin by
emin = b 2lc 2 = 1022

(5)

emax = 2lc b 2 = 1023.

(6)

and emax by
Question. OK, wait. Why isnt it the case that emin = 1023 and emax = 1024? Why are (??)
and (??) both off by one?
Answer. The IEEE 754 floating-point representation uses emin 1 and emax + 1 as special values to
encode non-numeric results like NaNs, , and denormalized numbers. Well discuss the specifics
of these later.
Going further, we can compute the minimum positive fraction mmin by
mmin = 1 2lm = 1 252 =

1
4 503 599 627 370 496

(7)

4 503 599 627 370 495

4 503 599 627 370 496

(8)

and the maximum positive fraction mmax by

mmax =

lm
X
i=1

1 =

52
X
i=1

1 2i =

Now that we have emin , emax , mmin , mmax , we can find the values for xmin and xmax .
Page 2 of ??

CS 312, Spring 2009: Some Notes on Floating Point Arithmetic.

[email protected]

To do this, we refer back to (??) and plug in values as appropriate. For xmin , we let s = 0,
c b = emin and m = mmin . We obtain then
xmin = (1)0 2emin (1 + mmin )
= (1)0 21022 (1 +

1
)
4 503 599 627 370 496

= 2.22507 10308 .

(9)
(10)
(11)

Similarly, for xmax , we let s = 0, c b = emax , and m = mmax . Hence,

xmax = (1)0 2emax (1 + mmax )
4 503 599 627 370 495
)
= (1)0 21023 (1 +
4 503 599 627 370 496
= 1.79769 10308 .

(12)
(13)
(14)

Since weve hinted to it already in this writing, we should probably enumerate the special cases
of the representation sooner or later. Doing it now doesnt sound like a bad idea. The only problem
cb
0
0
1 to emax
emax + 1
emax + 1

m
Object Represented
0
zeroes
nonzero denormalized numbers
anything normalized numbers
0
infinities
nonzero
NaNs

Table 1: IEEE 754 encoding of floating-point numbers.

with Table ?? is that we havent yet covered what a denormalized number is.
Question. What is a denormalized number?
Answer. A denormalized number is a number in the world that is floating-point math that fills in the
gap between zero and the smallest (largest) positive (negative) number. These are only important
to consider from a numerical analysis standpoint because they have interesting implications when
involved in arithmetic operations, namely division by zero and a b = 0 for a 6= b.

More Floating-Point Numbers

So, what more is there? For starters, we can talk about arithmetic operations like addition, subtraction, and so on and their pitfalls but well probably leave this for last or near to last.
Theres also the possibility for discussing conversions between decimal representations and IEEE
754 representations, which well most likely discuss next. Furthermore, we can even get into some of
the interesting analytic topics like the spacing of IEEE 754 values on a number line or error analysis
or ordering or the like, but its unlikely that well get too far with this because of the mathematical
background of the participants in 312 and because of the fact that this is not a numerical analysis
course.
Still, theres the topic of rounding, but that also tends to involve numerical analysis.
Page 3 of ??

CS 312, Spring 2009: Some Notes on Floating Point Arithmetic.

5.1

[email protected]

Converting from Decimal to IEEE 754

The conversion from decimal to IEEE 754 isnt exactly the most straightforward at first, but the
process pretty much goes like the following. Assume that we have a real number x in decimal.
1. Convert x to binary.
2. Normalize the binary representation of x.
3. Let e be the amount of binary places shifted in the normalization process. Set c b = e
in (??).
4. Assemble the pieces.
5. ???
6. PROFIT!
There are exercises in the text that cover this, but just for the sake of being at least halfway
complete here, lets work out an example.
Example. Convert 3.14159 to IEEE 754 double-precision format.
Solution. First, we convert 3.14159 to binary, which yields
2 11.0010010000111111012 .
We normalize 2 , which gets us
2 1.10010010000111111012 1012 .
This means that e = 1. We can now start assembling the pieces. We have
2 (1)0 21 (1 + 21 + 24 + 27 + 212 + + 217 + 219 ).
This is our IEEE 754 double-precision representation for an approximation of to six significant
digits.

Page 4 of ??

Bios & Uefi
No ratings yet
Bios & Uefi
3 pages
Ed Periodic Table
No ratings yet
Ed Periodic Table
1 page
Paper 199-Morse Code Translator Using The Arduino Platform
No ratings yet
Paper 199-Morse Code Translator Using The Arduino Platform
6 pages
10 MIPS Floating Point Arithmetic
No ratings yet
10 MIPS Floating Point Arithmetic
28 pages
Design and Implementation of Fast Floating Point Multiplier Unit
No ratings yet
Design and Implementation of Fast Floating Point Multiplier Unit
5 pages
DSP
No ratings yet
DSP
190 pages
Synthesis of Area Optimized 64 Bit Double Precision Floating Point Multiplier Using VHDL
No ratings yet
Synthesis of Area Optimized 64 Bit Double Precision Floating Point Multiplier Using VHDL
4 pages
Design of Double Precision IEEE-754 Floating-Point Units
100% (15)
Design of Double Precision IEEE-754 Floating-Point Units
73 pages
Cython Tutorial: Release 0.28.2
No ratings yet
Cython Tutorial: Release 0.28.2
81 pages
Fix Point Implementation of Elementry Functions
No ratings yet
Fix Point Implementation of Elementry Functions
134 pages
Xyce Reference Guide
No ratings yet
Xyce Reference Guide
634 pages
Introduction To Verilog HDL
No ratings yet
Introduction To Verilog HDL
38 pages
M68 HC 05
No ratings yet
M68 HC 05
332 pages
Asm
No ratings yet
Asm
156 pages
Power Amplifier Linearization Using Singular Value Decomposition Algorithm
No ratings yet
Power Amplifier Linearization Using Singular Value Decomposition Algorithm
4 pages
Micro Interfacing
No ratings yet
Micro Interfacing
15 pages
Ani C Bus Analyser To Let You Satisfy Your Curiosity: The Secrets of I C
No ratings yet
Ani C Bus Analyser To Let You Satisfy Your Curiosity: The Secrets of I C
7 pages
I2c Serial Protocol
100% (2)
I2c Serial Protocol
9 pages
An920 Rev2
No ratings yet
An920 Rev2
38 pages
Instruction Manual: Digital Multimeter
No ratings yet
Instruction Manual: Digital Multimeter
269 pages
I2c Slave
No ratings yet
I2c Slave
4 pages
How To Solve The Rubik's Cube: By: Isaiah Bowers
No ratings yet
How To Solve The Rubik's Cube: By: Isaiah Bowers
31 pages
Diodes Inc - Library - Components - List PDF
No ratings yet
Diodes Inc - Library - Components - List PDF
5 pages
Exploring The Best Indicators in TA-Lib - Technical Analysis of Stocks Using Python - Part 1 - by Himanshu Sharma - MLearning - Ai - Medium
No ratings yet
Exploring The Best Indicators in TA-Lib - Technical Analysis of Stocks Using Python - Part 1 - by Himanshu Sharma - MLearning - Ai - Medium
14 pages
Loop Gain Measurement
No ratings yet
Loop Gain Measurement
7 pages
Clem Engine Paper Presentation
No ratings yet
Clem Engine Paper Presentation
6 pages
Introduction To JSON
No ratings yet
Introduction To JSON
1 page
Your First Code Using Mojo Programming Language
No ratings yet
Your First Code Using Mojo Programming Language
2 pages
Basic Tutorials - Batteries For Solar Energy Systems
No ratings yet
Basic Tutorials - Batteries For Solar Energy Systems
3 pages
Desert Biome
No ratings yet
Desert Biome
5 pages
Characterization of Au (OH) 3
No ratings yet
Characterization of Au (OH) 3
6 pages
Using The Mid-Range Enhanced Core PIC16 Devices' MSSP Module For Slave I C Communication
100% (3)
Using The Mid-Range Enhanced Core PIC16 Devices' MSSP Module For Slave I C Communication
14 pages
Javascript Object Notation
No ratings yet
Javascript Object Notation
17 pages
Jap6 72 280-310 3BB-1
No ratings yet
Jap6 72 280-310 3BB-1
2 pages
Implementation of PID Controllers On Motorola DSP PDF
No ratings yet
Implementation of PID Controllers On Motorola DSP PDF
84 pages
Greek Letters
No ratings yet
Greek Letters
15 pages
Floating Point
No ratings yet
Floating Point
26 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
Introduction To Verilog Hardware Description Language
No ratings yet
Introduction To Verilog Hardware Description Language
108 pages
32 Bit Floating Point ALU
0% (1)
32 Bit Floating Point ALU
7 pages
N LMS Impedance Bridge
No ratings yet
N LMS Impedance Bridge
7 pages
I2C Protocol Design For Reusability
No ratings yet
I2C Protocol Design For Reusability
8 pages
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
No ratings yet
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
5 pages
I2C Bus Specifications V2 0
No ratings yet
I2C Bus Specifications V2 0
13 pages
An Implementation of I2C Slave Interface Using Verilog HDL
No ratings yet
An Implementation of I2C Slave Interface Using Verilog HDL
6 pages
IEEE 754 Floating Point Standard
No ratings yet
IEEE 754 Floating Point Standard
2 pages
Micro Controller Based Scientific Calculator
50% (2)
Micro Controller Based Scientific Calculator
6 pages
I2C
100% (3)
I2C
19 pages
I2c Program
100% (1)
I2c Program
2 pages
Stream Gate
No ratings yet
Stream Gate
644 pages
Json
100% (1)
Json
71 pages
Introduction To Verilog
No ratings yet
Introduction To Verilog
88 pages
An Implementation of I2C Slave Interface Using Verilog HDL: Journal
100% (1)
An Implementation of I2C Slave Interface Using Verilog HDL: Journal
6 pages
A Low Cost Automastic Impedance Bridge
No ratings yet
A Low Cost Automastic Impedance Bridge
4 pages
IEEE 754 Converter
No ratings yet
IEEE 754 Converter
2 pages
Analysis and Design of Zeta Converter: Ashvini Admane, DR - Harikumar Naidu
No ratings yet
Analysis and Design of Zeta Converter: Ashvini Admane, DR - Harikumar Naidu
7 pages
Analog Dialogue, Volume 47, Number 3
From Everand
Analog Dialogue, Volume 47, Number 3
Analog Dialogue
No ratings yet
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
ESTANERO - April 12 LP Mam Duenas
No ratings yet
ESTANERO - April 12 LP Mam Duenas
6 pages
2 Manual RPI M50A 12s V1 EU EN 2017-03-09
No ratings yet
2 Manual RPI M50A 12s V1 EU EN 2017-03-09
166 pages
Standard Specification For Castings, Austenitic-Ferritic (Duplex) Stainless Steel, For Pressure-Containing Parts
No ratings yet
Standard Specification For Castings, Austenitic-Ferritic (Duplex) Stainless Steel, For Pressure-Containing Parts
6 pages
Data Sheet: SFH757 and SFH757V
No ratings yet
Data Sheet: SFH757 and SFH757V
4 pages
Kinetic Theory & Thermal Properties Notes IGCSE AVG
100% (3)
Kinetic Theory & Thermal Properties Notes IGCSE AVG
12 pages
Molas Lubes-Products List
No ratings yet
Molas Lubes-Products List
2 pages
Lecture 3: Role of Academic Librarian: Prof. Dana P. Tugade
100% (1)
Lecture 3: Role of Academic Librarian: Prof. Dana P. Tugade
34 pages
Beyond The Blackboard Reflection Paper
100% (1)
Beyond The Blackboard Reflection Paper
3 pages
Hypothesis Testing - II: S. Devi Yamini
No ratings yet
Hypothesis Testing - II: S. Devi Yamini
145 pages
2018 Book CyberSecurityForCyberPhysicalS PDF
100% (1)
2018 Book CyberSecurityForCyberPhysicalS PDF
189 pages
L Matching Reflection Coefficient Using Matlab
100% (1)
L Matching Reflection Coefficient Using Matlab
7 pages
The Three Lines of Defence: Audit Committee Institute
No ratings yet
The Three Lines of Defence: Audit Committee Institute
4 pages
CHAPTER 2 - FILE HANDLING-txtfile
No ratings yet
CHAPTER 2 - FILE HANDLING-txtfile
23 pages
Drone Suppliers Uae
No ratings yet
Drone Suppliers Uae
5 pages
Mosi Debat
No ratings yet
Mosi Debat
8 pages
Motivational Cognitive Behavioural Therapy Distinctive Features 1st Edition Optimized EPUB Download
100% (19)
Motivational Cognitive Behavioural Therapy Distinctive Features 1st Edition Optimized EPUB Download
16 pages
Evolution of Stars
No ratings yet
Evolution of Stars
3 pages
ER Model and Relational Model: Learning Objectives
No ratings yet
ER Model and Relational Model: Learning Objectives
18 pages
Ok Java Case Study
No ratings yet
Ok Java Case Study
18 pages
Understanding Demand: Unit 2: Microeconomics
No ratings yet
Understanding Demand: Unit 2: Microeconomics
26 pages
Lesson Plan
No ratings yet
Lesson Plan
9 pages
Design of HVAC Control System For Building Energy Management Systems
No ratings yet
Design of HVAC Control System For Building Energy Management Systems
5 pages
Fiat Hitachi Excavator Ex135w Workshop Manual
100% (1)
Fiat Hitachi Excavator Ex135w Workshop Manual
22 pages
60. Đề Thi Thử TN THPT 2021 - Môn Tiếng Anh - Sở GD & ĐT Hưng Yên - File Word Có Lời Giải
No ratings yet
60. Đề Thi Thử TN THPT 2021 - Môn Tiếng Anh - Sở GD & ĐT Hưng Yên - File Word Có Lời Giải
6 pages
XS Series E Appen 7 Installation PDF
No ratings yet
XS Series E Appen 7 Installation PDF
101 pages
SPM Swivels Operation Instruction and Service Manual
No ratings yet
SPM Swivels Operation Instruction and Service Manual
44 pages
Lab 12 Eca2 Version Modif
No ratings yet
Lab 12 Eca2 Version Modif
13 pages
Revision Worksheet 4 Grade 6
100% (1)
Revision Worksheet 4 Grade 6
2 pages
Mbafm MMPC 020
No ratings yet
Mbafm MMPC 020
28 pages
Asset Holiday Home Work 2
No ratings yet
Asset Holiday Home Work 2
13 pages