0% found this document useful (0 votes)

209 views9 pages

The IEEE Standard For Floating Point Arithmetic

The document discusses the IEEE 754 standard for single and double precision floating point numbers. It describes the layout and components of 32-bit single precision and 64-bit double precision floating point numbers according to the standard. This includes the sign bit, exponent field with bias, and mantissa which includes an implicit leading bit and fraction bits. It explains how the different components are used to represent numbers in normalized and denormalized scientific notation and calculate the value. The key ranges and limitations of the floating point representations are also summarized.

Uploaded by

Vijay Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

209 views9 pages

The IEEE Standard For Floating Point Arithmetic

Uploaded by

Vijay Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 9

The IEEE standard for floating point

arithmetic
The IEEE (Institute of Electrical and Electronics Engineers) has produced a standard for
floating point arithmetic. This standard specifies how single precision (32 bit) and double
precision (64 bit) floating point numbers are to be represented, as well as how arithmetic
should be carried out on them.
Because many of our users may have occasion to transfer unformatted or "binary" data
between an IEEE machine and the Cray or the VAX/VMS, it is worth noting the details of
this format for comparison with the Cray and VAX representations. The differences in the
formats also affect the accuracy of floating point computations.

Summary:
Single Precision
The IEEE single precision floating point standard representation requires a 32 bit word,
which may be represented as numbered from 0 to 31, left to right. The first bit is the sign
bit, S, the next eight bits are the exponent bits, 'E', and the final 23 bits are the fraction 'F':
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1
8 9
31

The value V represented by the word may be determined as follows:

If E=255 and F is nonzero, then V=NaN ("Not a number")

If E=255 and F is zero and S is 1, then V=-Infinity
If E=255 and F is zero and S is 0, then V=Infinity
If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to
represent the binary number created by prefixing F with an implicit leading 1 and
a binary point.
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are
"unnormalized" values.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0

In particular,
0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0
0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000100000000000000000 = NaN

1 11111111 00100010001001010101010 = NaN
0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5
0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127)
0 00000000 00000000000000000000001 = +1 * 2**(-126) *
0.00000000000000000000001 =
2**(-149) (Smallest positive
value)

Double Precision
The IEEE double precision floating point standard representation requires a 64 bit word,
which may be represented as numbered from 0 to 63, left to right. The first bit is the sign
bit, S, the next eleven bits are the exponent bits, 'E', and the final 52 bits are the fraction
'F':
S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
0 1
11 12
63

The value V represented by the word may be determined as follows:

If E=2047 and F is nonzero, then V=NaN ("Not a number")

If E=2047 and F is zero and S is 1, then V=-Infinity
If E=2047 and F is zero and S is 0, then V=Infinity
If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to
represent the binary number created by prefixing F with an implicit leading 1 and
a binary point.
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are
"unnormalized" values.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0

IEEE Standard 754 Floating Point

Numbers
Steve Hollasch / Last update 2005-Feb-24

IEEE Standard 754 floating point is the most common representation today for real
numbers on computers, including Intel-based PC's, Macintoshes, and most Unix
platforms. This article gives a brief overview of IEEE floating point and its
representation. Discussion of arithmetic implementation may be found in the book
mentioned at the bottom of this article.

What Are Floating Point Numbers?

There are several ways to represent real numbers on computers. Fixed point places a
radix point somewhere in the middle of the digits, and is equivalent to using integers that
represent portions of some unit. For example, one might represent 1/100ths of a unit; if
you have four decimal digits, you could represent 10.82, or 00.01. Another approach is to
use rationals, and represent every number as the ratio of two integers.
Floating-point representation - the most common solution - basically represents reals in
scientific notation. Scientific notation represents numbers as a base number and an
exponent. For example, 123.456 could be represented as 1.23456 102. In hexadecimal,
the number 123.abc might be represented as 1.23abc 162.
Floating-point solves a number of representation problems. Fixed-point has a fixed
window of representation, which limits it from representing very large or very small
numbers. Also, fixed-point is prone to a loss of precision when two large numbers are
divided.
Floating-point, on the other hand, employs a sort of "sliding window" of precision
appropriate to the scale of the number. This allows it to represent numbers from
1,000,000,000,000 to 0.0000000000000001 with ease.

Storage Layout
IEEE floating point numbers have three basic components: the sign, the exponent, and
the mantissa. The mantissa is composed of the fraction and an implicit leading digit
(explained below). The exponent base (2) is implicit and need not be stored.

The following figure shows the layout for single (32-bit) and double (64-bit) precision
floating-point values. The number of bits for each field are shown (bit ranges are in
square brackets):

Single Precision

Sign

Exponent

Fraction

Bias

1 [31]

8 [30-23]

23 [22-00]

127

Double Precision 1 [63] 11 [62-52] 52 [51-00] 1023

The Sign Bit

The sign bit is as simple as it gets. 0 denotes a positive number; 1 denotes a negative
number. Flipping the value of this bit flips the sign of the number.

The Exponent
The exponent field needs to represent both positive and negative exponents. To do this, a
bias is added to the actual exponent in order to get the stored exponent. For IEEE singleprecision floats, this value is 127. Thus, an exponent of zero means that 127 is stored in
the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73. For
reasons discussed later, exponents of -127 (all 0s) and +128 (all 1s) are reserved for
special numbers.
For double precision, the exponent field is 11 bits, and has a bias of 1023.

The Mantissa
The mantissa, also known as the significand, represents the precision bits of the number.
It is composed of an implicit leading bit and the fraction bits.
To find out the value of the implicit leading bit, consider that any number can be
expressed in scientific notation in many different ways. For example, the number five can
be represented as any of these:
5.00 100
0.05 102
5000 10-3

In order to maximize the quantity of representable numbers, floating-point numbers are

typically stored in normalized form. This basically puts the radix point after the first nonzero digit. In normalized form, five is represented as 5.0 100.
A nice little optimization is available to us in base two, since the only possible non-zero
digit is 1. Thus, we can just assume a leading digit of 1, and don't need to represent it

explicitly. As a result, the mantissa has effectively 24 bits of resolution, by way of 23

fraction bits.

Putting it All Together

So, to sum up:
1. The sign bit is 0 for positive, 1 for negative.
2. The exponent's base is two.
3. The exponent field contains 127 plus the true exponent for single-precision, or
1023 plus the true exponent for double precision.
4. The first bit of the mantissa is typically assumed to be 1.f, where f is the field of
fraction bits.

Ranges of Floating-Point Numbers

Let's consider single-precision floats for a second. Note that we're taking essentially a 32bit number and re-jiggering the fields to cover a much broader range. Something has to
give, and it's precision. For example, regular 32-bit integers, with all precision centered
around zero, can precisely store integers with 32-bits of resolution. Single-precision
floating-point, on the other hand, is unable to match this resolution with its 24 bits. It
does, however, approximate this value by effectively truncating from the lower end. For
example:
11110000 11001100 10101010 00001111 // 32-bit integer
= +1.1110000 11001100 10101010 x 231
// Single-Precision Float
=
11110000 11001100 10101010 00000000 // Corresponding Value

This approximates the 32-bit value, but doesn't yield an exact representation. On the other
hand, besides the ability to represent fractional components (which integers lack
completely), the floating-point value can represent numbers around 2127, compared to 32bit integers maximum value around 232.
The range of positive floating point numbers can be split into normalized numbers (which
preserve the full precision of the mantissa), and denormalized numbers (discussed later)
which use only a portion of the fractions's precision.
Denormalized
Single Precision 2-149 to (1-2-23)2-126
Double
Precision

2-1074 to (1-2-52)2-1022

Normalized

Approximate
Decimal

2-126 to (2-2-23)2127

~10-44.85 to ~1038.53

2-1022 to (2-252
)21023

~10-323.3 to ~10308.3

Since the sign of floating point numbers is given by a special leading bit, the range for
negative numbers is given by the negation of the above values.
There are five distinct numerical ranges that single-precision floating-point numbers are
not able to represent:
1.
2.
3.
4.
5.

Negative numbers less than -(2-2-23) 2127 (negative overflow)

Negative numbers greater than -2-149 (negative underflow)
Zero
Positive numbers less than 2-149 (positive underflow)
Positive numbers greater than (2-2-23) 2127 (positive overflow)

Overflow means that values have grown too large for the representation, much in the
same way that you can overflow integers. Underflow is a less serious problem because is
just denotes a loss of precision, which is guaranteed to be closely approximated by zero.
Here's a table of the effective range (excluding infinite values) of IEEE floating-point
numbers:
Binary

Decimal

(2-2-23) 2127

~ 1038.53

Double (2-2-52) 21023

~ 10308.25

Single

Note that the extreme values occur (regardless of sign) when the exponent is at the
maximum value for finite numbers (2127 for single-precision, 21023 for double), and the
mantissa is filled with 1s (including the normalizing 1 bit).

Special Values
IEEE reserves exponent field values of all 0s and all 1s to denote special values in the
floating-point scheme.

Zero
As mentioned above, zero is not directly representable in the straight format, due to the
assumption of a leading 1 (we'd need to specify a true zero mantissa to yield a value of
zero). Zero is a special value denoted with an exponent field of zero and a fraction field
of zero. Note that -0 and +0 are distinct values, though they both compare as equal.

Denormalized
If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero),
then the value is a denormalized number, which does not have an assumed leading 1

before the binary point. Thus, this represents a number (-1)s 0.f 2-126, where s is the
sign bit and f is the fraction. For double precision, denormalized numbers are of the form
(-1)s 0.f 2-1022. From this you can interpret zero as a special type of denormalized
number.

Infinity
The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of
all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being
able to denote infinity as a specific value is useful because it allows operations to
continue past overflow situations. Operations with infinite values are well defined in
IEEE floating point.

Not A Number
The value NaN (Not a Number) is used to represent a value that does not represent a real
number. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero
fraction. There are two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling
NaN).
A QNaN is a NaN with the most significant fraction bit set. QNaN's propagate freely
through most arithmetic operations. These values pop out of an operation when the result
is not mathematically defined.
An SNaN is a NaN with the most significant fraction bit clear. It is used to signal an
exception when used in operations. SNaN's can be handy to assign to uninitialized
variables to trap premature usage.
Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid
operations.

Special Operations
Operations on special numbers are well-defined by IEEE. In the simplest case, any
operation with a NaN yields a NaN result. Other operations are as follows:
Operation

Result

n Infinity

Infinity Infinity Infinity

nonzero 0

Infinity

Infinity + Infinity

Infinity

0 0

NaN

Infinity - Infinity

NaN

Infinity Infinity

NaN

Infinity 0

NaN

Summary
To sum up, the following are the corresponding values for a given representation:
Float Values (b = bias)
Sign Exponent (e) Fraction (f)

Value

00..00

00..01
:
11..11

Positive Denormalized Real

0.f 2(-b+1)

00..01
:
11..10

XX..XX

Positive Normalized Real

1.f 2(e-b)

11..11

00..00

+Infinity

11..11

00..01
:
01..11

SNaN

11..11

10..00
:
11..11

QNaN

00..00

-0

00..00

00..01
:
11..11

Negative Denormalized Real

-0.f 2(-b+1)

00..01
:
11..10

XX..XX

Negative Normalized Real

-1.f 2(e-b)

11..11

00..00

-Infinity

11..11

00..01
:

SNaN

01..11
1

11..11

10..00
:
11.11

QNaN

References
A lot of this stuff was observed from small programs I wrote to go back and forth
between hex and floating point (printf-style), and to examine the results of various
operations. The bulk of this material, however, was lifted from Stallings' book.
1. Computer Organization and Architecture, William Stallings, pp. 222-234
Macmillan Publishing Company, ISBN 0-02-415480-6
2. IEEE Computer Society (1985), IEEE Standard for Binary Floating-Point
Arithmetic, IEEE Std 754-1985.
3. Intel Architecture Software Developer's Manual, Volume 1: Basic Architecture , (a
PDF document downloaded from intel.com.)

An Example of Dantzig-Wolfe Decomposition
No ratings yet
An Example of Dantzig-Wolfe Decomposition
7 pages
FloatingPoint Handout
No ratings yet
FloatingPoint Handout
122 pages
Paper 199-Morse Code Translator Using The Arduino Platform
No ratings yet
Paper 199-Morse Code Translator Using The Arduino Platform
6 pages
Floating Point
No ratings yet
Floating Point
26 pages
IEEE 754 Floating Point Standard
No ratings yet
IEEE 754 Floating Point Standard
2 pages
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
No ratings yet
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
5 pages
IEEE 754 Floating Point Notes
No ratings yet
IEEE 754 Floating Point Notes
4 pages
10 MIPS Floating Point Arithmetic
No ratings yet
10 MIPS Floating Point Arithmetic
28 pages
Fix Point Implementation of Elementry Functions
No ratings yet
Fix Point Implementation of Elementry Functions
134 pages
Design and Implementation of Fast Floating Point Multiplier Unit
No ratings yet
Design and Implementation of Fast Floating Point Multiplier Unit
5 pages
DSP
No ratings yet
DSP
190 pages
Synthesis of Area Optimized 64 Bit Double Precision Floating Point Multiplier Using VHDL
No ratings yet
Synthesis of Area Optimized 64 Bit Double Precision Floating Point Multiplier Using VHDL
4 pages
Introduction To Verilog HDL
No ratings yet
Introduction To Verilog HDL
38 pages
Cython Tutorial: Release 0.28.2
No ratings yet
Cython Tutorial: Release 0.28.2
81 pages
Design of Double Precision IEEE-754 Floating-Point Units
100% (15)
Design of Double Precision IEEE-754 Floating-Point Units
73 pages
Power Amplifier Linearization Using Singular Value Decomposition Algorithm
No ratings yet
Power Amplifier Linearization Using Singular Value Decomposition Algorithm
4 pages
2020fa CS61C 2020fa Module 2 C PDF
No ratings yet
2020fa CS61C 2020fa Module 2 C PDF
106 pages
Xyce Reference Guide
No ratings yet
Xyce Reference Guide
634 pages
Asm
No ratings yet
Asm
156 pages
Json
100% (1)
Json
71 pages
AS3842
No ratings yet
AS3842
10 pages
How To Solve The Rubik's Cube: By: Isaiah Bowers
No ratings yet
How To Solve The Rubik's Cube: By: Isaiah Bowers
31 pages
32 Bit Floating Point ALU
0% (1)
32 Bit Floating Point ALU
7 pages
Introduction To JSON
No ratings yet
Introduction To JSON
1 page
M68 HC 05
No ratings yet
M68 HC 05
332 pages
Exploring The Best Indicators in TA-Lib - Technical Analysis of Stocks Using Python - Part 1 - by Himanshu Sharma - MLearning - Ai - Medium
No ratings yet
Exploring The Best Indicators in TA-Lib - Technical Analysis of Stocks Using Python - Part 1 - by Himanshu Sharma - MLearning - Ai - Medium
14 pages
A Novel Approach To Transform Relational Database Into Graph Database Using Neo4j
No ratings yet
A Novel Approach To Transform Relational Database Into Graph Database Using Neo4j
64 pages
An920 Rev2
No ratings yet
An920 Rev2
38 pages
Loop Gain Measurement
No ratings yet
Loop Gain Measurement
7 pages
Basic Tutorials - Batteries For Solar Energy Systems
No ratings yet
Basic Tutorials - Batteries For Solar Energy Systems
3 pages
Your First Code Using Mojo Programming Language
No ratings yet
Your First Code Using Mojo Programming Language
2 pages
I2c Slave
No ratings yet
I2c Slave
4 pages
Handbook of Floating-Point Arithmetic
No ratings yet
Handbook of Floating-Point Arithmetic
11 pages
Javascript Object Notation
No ratings yet
Javascript Object Notation
17 pages
Micro Interfacing
No ratings yet
Micro Interfacing
15 pages
I2c Serial Protocol
100% (2)
I2c Serial Protocol
9 pages
Quiz2 3510 Cheat-Sheet
100% (1)
Quiz2 3510 Cheat-Sheet
4 pages
Diodes Inc - Library - Components - List PDF
No ratings yet
Diodes Inc - Library - Components - List PDF
5 pages
Federalist 10 Brutus 1 Analytical Reading
No ratings yet
Federalist 10 Brutus 1 Analytical Reading
4 pages
Java Collections
No ratings yet
Java Collections
11 pages
Fundamentals of Algorithm
No ratings yet
Fundamentals of Algorithm
14 pages
SAT Reading/Writing Guide: Likey
No ratings yet
SAT Reading/Writing Guide: Likey
3 pages
Introduction To Verilog
No ratings yet
Introduction To Verilog
88 pages
Stream Gate
No ratings yet
Stream Gate
644 pages
Java Collections Cheat Sheet Easy
0% (1)
Java Collections Cheat Sheet Easy
40 pages
I2c Program
100% (1)
I2c Program
2 pages
Floating-Point To Fixed-Point Conversion For Audio
No ratings yet
Floating-Point To Fixed-Point Conversion For Audio
10 pages
IEEE 754 Tutorial - Converting To IEEE 754 Form
No ratings yet
IEEE 754 Tutorial - Converting To IEEE 754 Form
2 pages
Desert Biome
No ratings yet
Desert Biome
5 pages
Factor K
No ratings yet
Factor K
10 pages
Military HFGW Applications
100% (1)
Military HFGW Applications
45 pages
Instruction Manual: Digital Multimeter
No ratings yet
Instruction Manual: Digital Multimeter
269 pages
Li-Baker Detector
No ratings yet
Li-Baker Detector
32 pages
Implementing Wire Protocols With Boost Fusion - Thomas Rodgers - CppCon 2014
100% (1)
Implementing Wire Protocols With Boost Fusion - Thomas Rodgers - CppCon 2014
82 pages
Machine Learning Techniques - Types of Machine Learning - Applications Mathematical Foundations of Machine Learning
No ratings yet
Machine Learning Techniques - Types of Machine Learning - Applications Mathematical Foundations of Machine Learning
15 pages
IEEE Standard 754 Floating Point Numbers
No ratings yet
IEEE Standard 754 Floating Point Numbers
7 pages
What Are Floating Point Numbers?
No ratings yet
What Are Floating Point Numbers?
7 pages
IEEE Standard 754
No ratings yet
IEEE Standard 754
10 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Revised World Seismicity Catalog (1950-1997) For Strong 6) Shallow (H 70 KM) Earthquakes
No ratings yet
Revised World Seismicity Catalog (1950-1997) For Strong 6) Shallow (H 70 KM) Earthquakes
7 pages
HW5 Assignment
No ratings yet
HW5 Assignment
1 page
Implementation of A Simple Graphics Program With Window-Viewport Manipulation
No ratings yet
Implementation of A Simple Graphics Program With Window-Viewport Manipulation
4 pages
Computer Graphics Computer Graphics: A Brief Introduction A Brief Introduction A Brief Introduction A Brief Introduction
No ratings yet
Computer Graphics Computer Graphics: A Brief Introduction A Brief Introduction A Brief Introduction A Brief Introduction
27 pages
Computer Graphics Computer Graphics For Engineering For Engineering For Engineering For Engineering
No ratings yet
Computer Graphics Computer Graphics For Engineering For Engineering For Engineering For Engineering
25 pages
Implementation of Computational Geometry Algorithm
No ratings yet
Implementation of Computational Geometry Algorithm
1 page
Creation of Digital Curves Using The Mouse and Visualization of Digital Curves
No ratings yet
Creation of Digital Curves Using The Mouse and Visualization of Digital Curves
1 page
Implementation of A Gui RPN Calculator For Integer Numbers: Rio - BR/FTP - Pub/Lfm/L1E - Rpncalculator PDF
No ratings yet
Implementation of A Gui RPN Calculator For Integer Numbers: Rio - BR/FTP - Pub/Lfm/L1E - Rpncalculator PDF
1 page
Arce 352: Structural Computing Ii: Aneuenho@calpoly - Edu
No ratings yet
Arce 352: Structural Computing Ii: Aneuenho@calpoly - Edu
1 page
Arce 306: Matrix Analysis of Structures: Aneuenho@calpoly - Edu
No ratings yet
Arce 306: Matrix Analysis of Structures: Aneuenho@calpoly - Edu
1 page
Arce 502: Nonlinear Structural Behavior I Office: Office Hours: E-Mail Textbook: Lectures: Grading: 20% Homework
No ratings yet
Arce 502: Nonlinear Structural Behavior I Office: Office Hours: E-Mail Textbook: Lectures: Grading: 20% Homework
1 page
Arce504 FEM - ++++
No ratings yet
Arce504 FEM - ++++
84 pages
Aneuenho@calpoly - Edu: Arce 351: Structural Computing Analysis I
No ratings yet
Aneuenho@calpoly - Edu: Arce 351: Structural Computing Analysis I
1 page
Arce306 Matrix STR Analy - +++
No ratings yet
Arce306 Matrix STR Analy - +++
66 pages
Arce211structures 1 - +++
No ratings yet
Arce211structures 1 - +++
143 pages
Curriculum Vitae of Ansgar Neuenhofer
No ratings yet
Curriculum Vitae of Ansgar Neuenhofer
4 pages
Arce 302: Structural Analysis Prerequisite: Co-Requisite: ARCE Office: Office Hours: E-Mail: Textbook
No ratings yet
Arce 302: Structural Analysis Prerequisite: Co-Requisite: ARCE Office: Office Hours: E-Mail: Textbook
1 page
Aneuenho@calpoly - Edu: Arce 223: Mechanics of Structural Members
No ratings yet
Aneuenho@calpoly - Edu: Arce 223: Mechanics of Structural Members
1 page
Rectangular R/C Concrete Beams: Tension Steel Only: FALL 2002
No ratings yet
Rectangular R/C Concrete Beams: Tension Steel Only: FALL 2002
16 pages
Engr 221 Mid-Term #2: Review Sheet For Chapters 4-5
No ratings yet
Engr 221 Mid-Term #2: Review Sheet For Chapters 4-5
2 pages
Aneuenho@calpoly - Edu: Arce 211: Structures I
No ratings yet
Aneuenho@calpoly - Edu: Arce 211: Structures I
1 page
Engr221 Mid-Term Test #1
No ratings yet
Engr221 Mid-Term Test #1
3 pages
Aneuenho@calpoly - Edu: Arce 226: Structural Systems, Sections I & Iii
No ratings yet
Aneuenho@calpoly - Edu: Arce 226: Structural Systems, Sections I & Iii
1 page
Floating Point Representation of Numbers: Wide Range
No ratings yet
Floating Point Representation of Numbers: Wide Range
11 pages
Porting Newlib
No ratings yet
Porting Newlib
41 pages
Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support
No ratings yet
Low-Power Multiple-Precision Iterative Floating-Point Multiplier With SIMD Support
13 pages
R07 Ind780 PLC en
No ratings yet
R07 Ind780 PLC en
256 pages
Design & Simulation of 32-Bit Floating Point Alu
No ratings yet
Design & Simulation of 32-Bit Floating Point Alu
3 pages
DDCA Ch5
No ratings yet
DDCA Ch5
101 pages
PH 4 Quiz
100% (1)
PH 4 Quiz
119 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
27 pages
The New C Standard (C90 and C++)
No ratings yet
The New C Standard (C90 and C++)
275 pages
CS6303 Computer Architecture ACT Notes
No ratings yet
CS6303 Computer Architecture ACT Notes
76 pages
Eastron Modbus Registers
100% (1)
Eastron Modbus Registers
22 pages
Excess 64 and IEEE 754 Format
No ratings yet
Excess 64 and IEEE 754 Format
9 pages
Tiva C Series LaunchPad
No ratings yet
Tiva C Series LaunchPad
127 pages
Number System
No ratings yet
Number System
28 pages
FPGA Implementation of IEEE-754 Karatsuba Multiplier
No ratings yet
FPGA Implementation of IEEE-754 Karatsuba Multiplier
4 pages
Tiva C Series LaunchPad
100% (1)
Tiva C Series LaunchPad
152 pages
Unit 1
No ratings yet
Unit 1
40 pages
F95 Reference
No ratings yet
F95 Reference
169 pages
MATH1070 2 Error and Computer Arithmetic PDF
No ratings yet
MATH1070 2 Error and Computer Arithmetic PDF
60 pages
Design and Implementation of An Optimized Double Precision Floating Point Divider On FPGA
No ratings yet
Design and Implementation of An Optimized Double Precision Floating Point Divider On FPGA
8 pages
FormCalc - Manual
No ratings yet
FormCalc - Manual
117 pages
Fast Floating Point Square Root: Thomas F. Hain, David B. Mercer
No ratings yet
Fast Floating Point Square Root: Thomas F. Hain, David B. Mercer
7 pages
CS 251 1099 Midterm Exam
No ratings yet
CS 251 1099 Midterm Exam
9 pages
ASCII Stands For American Standard Code For Information Interchange
No ratings yet
ASCII Stands For American Standard Code For Information Interchange
3 pages
Binary Representations For Integers
No ratings yet
Binary Representations For Integers
13 pages
Floating Point Number
No ratings yet
Floating Point Number
34 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
5118.numerical Computing With IEEE Floating Point Arithmetic by Michael L. Overton
No ratings yet
5118.numerical Computing With IEEE Floating Point Arithmetic by Michael L. Overton
121 pages

The IEEE Standard For Floating Point Arithmetic

Uploaded by

The IEEE Standard For Floating Point Arithmetic

Uploaded by

The IEEE standard for floating point

The value V represented by the word may be determined as follows:

If E=255 and F is nonzero, then V=NaN ("Not a number")

0 11111111 00000100000000000000000 = NaN

The value V represented by the word may be determined as follows:

If E=2047 and F is nonzero, then V=NaN ("Not a number")

IEEE Standard 754 Floating Point

What Are Floating Point Numbers?

Double Precision 1 [63] 11 [62-52] 52 [51-00] 1023

The Sign Bit

In order to maximize the quantity of representable numbers, floating-point numbers are

explicitly. As a result, the mantissa has effectively 24 bits of resolution, by way of 23

Putting it All Together

Ranges of Floating-Point Numbers

Negative numbers less than -(2-2-23) 2127 (negative overflow)

Double (2-2-52) 21023

Infinity Infinity Infinity

Positive Denormalized Real

Positive Normalized Real

Negative Denormalized Real

Negative Normalized Real

You might also like