Floating Point Numbers: Do You Have Your Laptop Here?

31/10/2011
Floating Point Numbers

Question 6.626068 x 10-34
Do you have your laptop here?
A yes B no C what’s a laptop D where is here?

E none of the above
• Eddie Edwards
• [email protected]
• https://www.doc.ic.ac.uk/~eedwards/compsys
• Heavily based on notes from Naranker Dulay

Eddie Edwards 2008 Floating Point Numbers 7.2
Learning Outcomes Number Representation - recap
We have seen how to represent integers

At the end of this lecture you should positive integers as binary, octal and hexadecimal
Understand the representation of real numbers in other bases negative integers as one's complement, two's complement, Excess-n
(e.g 2) BCD, ASCI.....
Know the mantissa/exponent representation (in base 10, 2 etc.)

Be able to express numbers in normalised/un-normalised form

We have also seen how to perform arithmetic
Addition
Be able to convert fractions/decimals between bases
by adding the binary bits
Know the IEEE 754 floating point format (32 and 64 bit)
overflow conditions
Know the special values and when they should occur
Multiplication/division
Understand the issues of accuracy in floating point representation same “long hand” techniques as base 10
slightly complicated in two's compliment
can take the absolute values, perform calculation, then sort out the sign
Eddie Edwards 2008 Floating Point Numbers 7.3 Eddie Edwards 2008 Floating Point Numbers 7.4
1
31/10/2011
Numbers: Large, Small, Fractional Large Integers

Population of the World 6,879,009,033 people Example: How can we represent integers up to 30 decimal digits long?
US National Debt (1990) $3, 144, 830, 000, 000
1 Light Year 9, 130, 000, 000, 000 km 30
Binary log (10 ) = ~ 100 bits (1 decimal digit = 3.322 bits)
Mass of the Sun 2, 000, 000, 000, 000, 000, 000, 000, 000, 2
000 kg
BCD 30 x 4-bit = 120 bits
Diameter of an Electron 0.000, 000, 000, 000, 000, 000, 01 m
Mass of an Electron 0.000, 000, 000, 000, 000, 000, 000, 000, ASCII 30 x 8-bit = 240 bits
000, 000, 9 kg
Smallest Measurable 0.000, 000, 000, 000, 000, 000, 000, 000,
length of Time 000, 000, 000, 000, 000, 000, 1 sec
Pi (to 8 decimal places) 3.14159265... The Pentium includes instructions for writing multi-precision integer
Standard Rate of VAT 17.5 routines using Binary Coded Decimal (BCD) Arithmetic & ASCII arithmetic
Floating Pointing Numbers Zones of Expressibility

Scientific Notation Example: Assume numbers are formed with a Signed 3-digit Mantissa
and a Signed 2-digit Exponent
–99 +99
Numbers span from ±.001 x 10 to ±.999 x 10
Number = M x 10E Decimal
Number = M x 2E Binary Zones of Expressibility
Zero
M is the Mantissa (or Significand or Fraction or Argument ) Negative Expressible Negative Positive Expressible Positive
E is the Exponent (or Characteristic) Overflow –ve Nums Underflow Underflow +ve Nums Overflow
10 (or for binary, 2) is the Radix (or Base)
Digits (bits) in Exponent -> Range (Bigness/Smallness)

Digits (bits) in Mantissa -> Precision (Exactness)
–99 –99
–.999 x 10+99 –.001 x 10 0 +.001 x 10 +.999 x 10+99
2
31/10/2011
Reals vs. Floating Point Numbers Normalised Floating Point Numbers

Floating Point Numbers can have multiple forms, e.g.
Mathematical Real Floating-point Number 4 3
0.232 x 10 = 2.32 x 10
2
= 23.2 x 10
Range -Infinity .. +Infinity Finite 0
= 2 320 x 10
-2
= 232 000 x 10
No. of Values Infinite Finite
For hardware implementation its desirable for each number to have a
Spacing Constant & Infinite Gap between numbers varies unique representation => Normalised Form
Errors ? Incorrect results are We’ll normalise Mantissa's in the Range [ 1 .. R ) where R is the Base,
possible e.g.:
[ 1 .. 10 ) for DECIMAL
[ 1 .. 2 ) for BINARY
Normalised Forms (Base 10) Binary & Decimal Fractions

Binary Decimal
Number Normalised Form
0.1 0.5
4 5
23.2 x 10 2.32 x 10
0.01 0.25
-3 -3
–4.01 x 10 –4.01 x 10 0.001 0.125
5 0.11 0.75
343 000 x 10 3.43 x 10
0.111 0.875
0 -8
0.000 000 098 9 x 10 9.89 x 10
0.011 0.375
0.101 0.625
3
31/10/2011
Binary Fraction to Decimal Fraction Decimal Fraction to Binary Fraction

Example: What is the binary value 0.011010 in decimal ? Example: What is 0.687510 in binary ?
. 0 1 1 0 1 0.6875 * 2= 1 .3750
0.3750 * 2= 0 .7500
32 16 8 4 2 1 Sum = 8+4+1 = 13 0.7500 * 2= 1 .5000
0.5000 * 2= 1 .0000
0.0000 * 2= 0
Answer: 13 / 32 = 0.40625
Answer: 0.10112
Example: What is 0 . 0 0011 0011 00 in decimal ? Example: What is 0.110 in binary ?
Answer: (32+
32+16+
16+2+1) / 512 = 51 / 512 = 0.099609375
0.110 in binary? Normalised Binary Floating Point Numbers

What is 0.110 in binary ?
Number Normalised Binary Normalised Decimal
0.1 * 2 = 0 .2
0.2 * 2 = 0 .4 100.01 x 21 1.0001 x 23 8.5 x 100
0.4 * 2 = 0 .8
0.8 * 2 = 1 .6 1010.11 x 22 1.01011 x 25 4.3 x 101
0.6 * 2 = 1 .2
0.2 * 2 = 0 .4 and then repeating 0.4, 0.8, 0.6 0.00101 x 2-2 1.01 x 2-5 3.90625 x 10-2
1100101 x 2-2 1.100101 x 2+4 9.86328125 x 10-2

Answer 0.0 0011 0011 0011 0011 0011 0011 ..... 2
4
31/10/2011
Floating Point Multiplication Truncation and Rounding

E1 E2 For many computations the result of a floating point operation can be
N1 x N2 = (M1 x 10 ) x (M2 x 10 )
E1 E2 too large to store in the Mantissa.
= (M1 x M2) x (10 x 10 )
E1+E2 Example: with a 2-digit mantissa
= (M1 x M2) x (10 )
1 1 2
i.e. We multiply the Mantissas and Add the Exponents 2.3 x 101 * 2.3 x 101 = 5.29 x 102
2
TRUNCATION => 5.2 x 10 (Biased Error)
2
Example: 20 * 6
1
= (2.0 x 10 ) x (6.0 x 10 )
0 ROUNDING => 5.3 x 10 (Unbiased Error)
1+0
= (2.0 x 6.0) x (10 )
1
= 12.0 x 10
2
We must also normalise the result, so the final answer = 1.2 x 10
Floating Point Addition Exponent Overflow & Underflow

EXPONENT OVERFLOW occurs when the Result is too Large
A floating point addition such as 4.5 x 103 + 6.7 x 102 is not a simple mantissa i.e. when the Result’s Exponent > Maximum Exponent
addition, unless the exponents are the same
=> we need to ensure that the mantissas are aligned first. Example: if Max Exponent is 99 then 1099 * 1099 = 10198 (overflow)
N1 + N2 = ( M1 x 10E1 ) + ( M2 x 10E2 )
On Overflow => Proceed with incorrect value or infinity value or raise an
= ( M1 + M2 x 10 E2-E1) x 10E1 Exception
To align, choose the number with the smaller exponent & shift mantissa the
corresponding number of digits to the right.
EXPONENT UNDERFLOW occurs when the Result is too Small
Example: 4.5 x 103 + 6.7 x 102 = 4.5 x 103 + 0.67 x 103 i.e. when the Result’s Exponent < Smallest Exponent
= 5.17 x 103 Example: if Min Exp. is –99 then 10-99 * 10-99 = 10-198 (underflow)
= 5.2 x 103 (rounded)
On Underflow => Proceed with zero value or raise an Exception
5
31/10/2011
Comparing Floating-Point Values

Because of the potential for producing in-exact results, comparing
floating-point values should account for close results.
If we know the likely magnitude and precision of results we can adjust

for closeness (epsilon), for example, for equality we can:
a = b a > ( b - e ) AND a < ( b + e )
Floating point numbers - questions
a = 1 a > (1 - 0.000005) AND a < 1 + 0.000005
a> 0.999995 AND a < 1.000005
Alternatively we can calculate | a - b | < e e.g. | a - 1 | < 0.000005
A more general approach is to calculate the closeness based on the

relative size of the two numbers being compared.
What is the binary notation for 3.625 IEEE Floating-Point Standard

A 11.011 B 10.101 C 11.101 D 101.11 E 11.11 IEEE: Institute of Electrical & Electronic Engineers (USA)
Comprehensive standard for Binary Floating-Point Arithmetic
What is binary 0.1101 in decimal? Widely adopted => Predictable results independent of architecture
The standard defines:

A 0.8125 B 0.8 C 0.8625 D 0.9125 E 0.7865 The format of binary floating-point numbers
Semantics of arithmetic operations
Rules for error conditions
6
31/10/2011
Single Precision Format (32-bit) Exponent Field

Sign Exponent Significand In the IEEE Standard, exponents are stored as Excess (Bias) Values, not as 2’s
S E F Complement Values
1 bit 8 bits 23 bits Example: In 8-bit Excess 127

–127 would be held as 0000 0000
The mantissa is called the SIGNIFICAND in the IEEE standard ... ...
0 would be held as 0111 1111
Value represented = ± 1.F x 2 E-127 127 = 28-1 - 1 1 would be held as 1000 0000
The Normal Bit (the 1.) is omitted from the Significand field => a HIDDEN bit ... ...
128would be held as 1111 1111
Single precision yields 24-bits = ~ 7 decimal digits of precision
Normalised Ranges in decimal are approximately: Excess notation allows non-negative floating point numbers to be compared using
simple integer comparisons, regardless of the absolute magnitude of the
–10+38 to -10-38, 0, +10-38 to +10
+10+38 exponents.
Double Precision Format (64-bit) Example: Conversion to IEEE format

Sign Exponent Significand
What is +42.6875 in IEEE Single Precision Format?
S E F
First convert to a binary number: 42.6875 = 10_
10_1010 . 1011
1 bit 11 bits 52 bits
5
Next normalise: 1 . 0101_
0101_0101_
0101_1 x 2
Value represented = ± 1.F x 2E-1023 1023 = 211-
11-1 - 1
Significand field is therefore: 0101_
0101_0101_
0101_1000_
000_0000_
0000_0000_
0000_000
Yields 53 bits of precision = ~ 16 decimal digits of precision Exponent field is (5+127=132): 1000_
1000_0100
Normalised Ranges in decimal are approximately: Value in IEEE Single Precision is:
–10+308 to -10-308, 0, +10-308 to +10

+10
+308
Double-Precision format is preferred for its greater precision. Single-precision 0 1000_
1000_0100 0101_
0101_0101_
0101_1000_
000_0000_
0000_0000_
0000_000
is useful when memory is scarce and for debugging numerical calculations since 0100__
0100__0010
__0010__
0010__0
__0 010__
010__1010
__1010__
1010__1100
__1100__
1100__0000
__0000__
0000__0000
__0000__
0000__0000
__0000
rounding errors show up more quickly.
In hexadecimal this value is 422A_C000

7
31/10/2011
Example: Conversion from IEEE format Example: Addition

Convert the IEEE Single Precision Value given by BEC0
BEC0_0000 to Decimal Carry out the addition 42.6875 + 0.375 in IEEE single precision arithmetic.
BEC0
BEC0_0000 = 1011_
011_1110_
1110_1 100_
100_0000_
0000_0000_
0000_0000_
0000_0000_
0000_0000 Number Sign Exponent Significand
42.6875 0 1000_
1000_0100 0101_
0101_0101_
0101_1000_
1000_0000_
0000_0000_
0000_000
1 0111_
0111_1101 1000_
1000_0000_
0000_0000_
0000_0000_
0000_0000_
0000_000 0.375 0 0111_
0111 _1101 1000_
1000 0000_
_0000 0000_
_0000 0000_
_0000 0000_
_0000 _000
To add these numbers the exponents of the numbers must be the same => Make
Exponent Field = 0111_1101 = 125
the smaller exponent equal to the larger exponent, shifting the mantissa
True Binary Exponent = 125 – 127 = –2
accordingly.
Significand Field = 1000_
1000_0000_
0000_0000_
0000_0000_
0000_0000_
0000_000
Adding Hidden Bit = 1.1000
1.1000_
1000_0000_
0000_0000_
0000_0000_
0000_0000_
0000_000 Note: We must restore the Hidden bit when carrying out floating point
Therefore unsigned value = 1.1 x 2–2 = 0 . 011 (binary) operations.
= 0.25 + 0.125 = 0.375 (decimal)
Sign bit = 1 therefore number is –0.375
Example: Addition Contd. Special Values

Significand of Larger No = 1 . 0101_0101_1000_0000_0000_000 The IEEE format can represent five kinds of values: Zero, Normalised Numbers,
Significand of Smaller No = 1 . 1000_0000_0000_0000_0000_000 Denormalised Numbers, Infinity and Not-A-Numbers (NANs).
For single precision format we have the following representations:
Exponents differ by +7 (1000_0100 – 0111_1101). Therefore shift binary
point of smaller number 7 places to the left:
IEEE Value Sign Exponent Significand True
Significand of Smaller No = 0 . 0000_0011_0000_0000_0000_000 Field Field Field Exponent
Significand of Larger No = 1 . 0101_0101_1000_0000_0000_000
± Zero 0 or 1 0 0 (All zeroes)
Significand of SUM = 1 . 0101_1000_1000_0000_0000_000
± Denormalised No 0 or 1 0 Any non-zero bit pat. -126
Therefore SUM = 1 . 0101_ 1000_1 x 25 = 10_1011.0001 = 43.0625
0101_1000_
± Normalised No 0 or 1 1 .. 254 Any bit pattern -126 .. + 127
0 1000_0100 0101_1000_1 000_0000_0000_000 = 422C 4000H ± Infinity 0 or 1 255 0 (All zeroes)
Not-A-Number 0 or 1 255 Any non-zero bit pat.
8
31/10/2011
Denormalised Numbers
An Exponent of All 0’s is used to represent Zero and Denormalised numbers,
while All 1’s is used to represent Infinities and Not-A-Numbers (NaNs)
This means that the maximum range for normalised numbers is reduced, i.e. for
Single Precision the range is –126 .. +127 rather than
–127 .. +128 as one might expect for Excess 127.
IEEE 754 floating point numbers -
Denormalised Numbers represent values between the Underflow limits questions
and zero, i.e. for single precision we have:
–126
± 0.F x 2–126
Traditionally a “flush-to-zero” is done when an underflow occurs

Denormalised numbers allow a more gradual shift to zero, and are useful
in a few numerical applications
What decimal is represented by the hex word Infinities and NaN’s

C0CA0000
Infinities (both positive & negative) are used to represent values that exceed
Answer - -6.3125 the overflow limits, and for operations like Divide by Zero
Infinities behave as in Mathematics, e.g.
Infinity + 5 = Infinity, -Infinity + -Infinity = -Infinity

What hex word is -0.75 in IEEE-754?
Answer - BFE8000000000000
Not-A-Numbers (NaNs) are used to represent the results of operations which
have no mathematical interpretation, e.g.
0 / 0, +Infinity + -Infinity, 0 x Infinity, Square root of a -ve number,
Operations with a NaN operand yield either a NaN result (quiet NaN operand)
or an exception (signalling NaN operand)
9
31/10/2011
This lecture - feedback
The pace of the lecture was:

A. much too fast B. too fast C. about right D. too slow E. much too slow
The learning objectives were met:

A. Fully B. Mostly C. Partially D. Slightly E. Not at all
10

Floating Point Numbers: Do You Have Your Laptop Here?

Uploaded by

Copyright:

Available Formats

Floating Point Numbers: Do You Have Your Laptop Here?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Floating Point Numbers: Do You Have Your Laptop Here?

Uploaded by

Copyright:

Available Formats

31/10/2011

Floating Point Numbers

Do you have your laptop here?

A yes B no C what’s a laptop D where is here?

• Heavily based on notes from Naranker Dulay

Learning Outcomes Number Representation - recap

We have seen how to represent integers

Be able to express numbers in normalised/un-normalised form

Numbers: Large, Small, Fractional Large Integers

Floating Pointing Numbers Zones of Expressibility

Digits (bits) in Exponent -> Range (Bigness/Smallness)

Reals vs. Floating Point Numbers Normalised Floating Point Numbers

Normalised Forms (Base 10) Binary & Decimal Fractions

Binary Fraction to Decimal Fraction Decimal Fraction to Binary Fraction

Example: What is 0 . 0 0011 0011 00 in decimal ? Example: What is 0.110 in binary ?

0.110 in binary? Normalised Binary Floating Point Numbers

1100101 x 2-2 1.100101 x 2+4 9.86328125 x 10-2

Floating Point Multiplication Truncation and Rounding

Floating Point Addition Exponent Overflow & Underflow

Comparing Floating-Point Values

If we know the likely magnitude and precision of results we can adjust

A more general approach is to calculate the closeness based on the

What is the binary notation for 3.625 IEEE Floating-Point Standard

Comprehensive standard for Binary Floating-Point Arithmetic

The standard defines:

Semantics of arithmetic operations

Rules for error conditions

Single Precision Format (32-bit) Exponent Field

1 bit 8 bits 23 bits Example: In 8-bit Excess 127

Double Precision Format (64-bit) Example: Conversion to IEEE format

–10+308 to -10-308, 0, +10-308 to +10

In hexadecimal this value is 422A_C000

Example: Conversion from IEEE format Example: Addition

Example: Addition Contd. Special Values

Traditionally a “flush-to-zero” is done when an underflow occurs

What decimal is represented by the hex word Infinities and NaN’s

Infinity + 5 = Infinity, -Infinity + -Infinity = -Infinity

0 / 0, +Infinity + -Infinity, 0 x Infinity, Square root of a -ve number,

This lecture - feedback

The pace of the lecture was:

The learning objectives were met:

You might also like