Floating Point Numbers: Do You Have Your Laptop Here?

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

31/10/2011

Floating Point Numbers


Question 6.626068 x 10-34

Do you have your laptop here?

A yes B no C what’s a laptop D where is here?


E none of the above
• Eddie Edwards
[email protected]

• https://www.doc.ic.ac.uk/~eedwards/compsys

• Heavily based on notes from Naranker Dulay


Eddie Edwards 2008 Floating Point Numbers 7.2

Learning Outcomes Number Representation - recap

 We have seen how to represent integers


 At the end of this lecture you should  positive integers as binary, octal and hexadecimal
 Understand the representation of real numbers in other bases  negative integers as one's complement, two's complement, Excess-n
(e.g 2)  BCD, ASCI.....
Know the mantissa/exponent representation (in base 10, 2 etc.)


 Be able to express numbers in normalised/un-normalised form


 We have also seen how to perform arithmetic
 Addition
 Be able to convert fractions/decimals between bases
 by adding the binary bits
 Know the IEEE 754 floating point format (32 and 64 bit)
 overflow conditions
 Know the special values and when they should occur
 Multiplication/division
 Understand the issues of accuracy in floating point representation  same “long hand” techniques as base 10
 slightly complicated in two's compliment
 can take the absolute values, perform calculation, then sort out the sign

Eddie Edwards 2008 Floating Point Numbers 7.3 Eddie Edwards 2008 Floating Point Numbers 7.4

1
31/10/2011

Numbers: Large, Small, Fractional Large Integers


Population of the World 6,879,009,033 people Example: How can we represent integers up to 30 decimal digits long?
US National Debt (1990) $3, 144, 830, 000, 000
1 Light Year 9, 130, 000, 000, 000 km 30
 Binary log (10 ) = ~ 100 bits (1 decimal digit = 3.322 bits)
Mass of the Sun 2, 000, 000, 000, 000, 000, 000, 000, 000, 2
000 kg
 BCD 30 x 4-bit = 120 bits
Diameter of an Electron 0.000, 000, 000, 000, 000, 000, 01 m
Mass of an Electron 0.000, 000, 000, 000, 000, 000, 000, 000,  ASCII 30 x 8-bit = 240 bits
000, 000, 9 kg
Smallest Measurable 0.000, 000, 000, 000, 000, 000, 000, 000,
length of Time 000, 000, 000, 000, 000, 000, 1 sec

Pi (to 8 decimal places) 3.14159265... The Pentium includes instructions for writing multi-precision integer
Standard Rate of VAT 17.5 routines using Binary Coded Decimal (BCD) Arithmetic & ASCII arithmetic

Eddie Edwards 2008 Floating Point Numbers 7.5 Eddie Edwards 2008 Floating Point Numbers 7.6

Floating Pointing Numbers Zones of Expressibility


Scientific Notation  Example: Assume numbers are formed with a Signed 3-digit Mantissa
and a Signed 2-digit Exponent
–99 +99
 Numbers span from ±.001 x 10 to ±.999 x 10
Number = M x 10E Decimal
Number = M x 2E Binary Zones of Expressibility
Zero
 M is the Mantissa (or Significand or Fraction or Argument ) Negative Expressible Negative Positive Expressible Positive
 E is the Exponent (or Characteristic) Overflow –ve Nums Underflow Underflow +ve Nums Overflow
 10 (or for binary, 2) is the Radix (or Base)

 Digits (bits) in Exponent -> Range (Bigness/Smallness)


 Digits (bits) in Mantissa -> Precision (Exactness)
–99 –99
–.999 x 10+99 –.001 x 10 0 +.001 x 10 +.999 x 10+99
Eddie Edwards 2008 Floating Point Numbers 7.7 Eddie Edwards 2008 Floating Point Numbers 7.8

2
31/10/2011

Reals vs. Floating Point Numbers Normalised Floating Point Numbers


 Floating Point Numbers can have multiple forms, e.g.
Mathematical Real Floating-point Number 4 3
0.232 x 10 = 2.32 x 10
2
= 23.2 x 10
Range -Infinity .. +Infinity Finite 0
= 2 320 x 10
-2
= 232 000 x 10
No. of Values Infinite Finite
 For hardware implementation its desirable for each number to have a
Spacing Constant & Infinite Gap between numbers varies unique representation => Normalised Form

Errors ? Incorrect results are  We’ll normalise Mantissa's in the Range [ 1 .. R ) where R is the Base,
possible e.g.:
[ 1 .. 10 ) for DECIMAL
[ 1 .. 2 ) for BINARY

Eddie Edwards 2008 Floating Point Numbers 7.9 Eddie Edwards 2008 Floating Point Numbers 7.10

Normalised Forms (Base 10) Binary & Decimal Fractions


Binary Decimal
Number Normalised Form
0.1 0.5
4 5
23.2 x 10 2.32 x 10
0.01 0.25

-3 -3
–4.01 x 10 –4.01 x 10 0.001 0.125

5 0.11 0.75
343 000 x 10 3.43 x 10
0.111 0.875
0 -8
0.000 000 098 9 x 10 9.89 x 10
0.011 0.375

0.101 0.625

Eddie Edwards 2008 Floating Point Numbers 7.11 Eddie Edwards 2008 Floating Point Numbers 7.12

3
31/10/2011

Binary Fraction to Decimal Fraction Decimal Fraction to Binary Fraction


 Example: What is the binary value 0.011010 in decimal ?  Example: What is 0.687510 in binary ?

. 0 1 1 0 1 0.6875 * 2= 1 .3750
0.3750 * 2= 0 .7500
32 16 8 4 2 1 Sum = 8+4+1 = 13 0.7500 * 2= 1 .5000
0.5000 * 2= 1 .0000
0.0000 * 2= 0
Answer: 13 / 32 = 0.40625
Answer: 0.10112

 Example: What is 0 . 0 0011 0011 00 in decimal ?  Example: What is 0.110 in binary ?

Answer: (32+
32+16+
16+2+1) / 512 = 51 / 512 = 0.099609375

Eddie Edwards 2008 Floating Point Numbers 7.13 Eddie Edwards 2008 Floating Point Numbers 7.14

0.110 in binary? Normalised Binary Floating Point Numbers


What is 0.110 in binary ?
Number Normalised Binary Normalised Decimal
0.1 * 2 = 0 .2
0.2 * 2 = 0 .4 100.01 x 21 1.0001 x 23 8.5 x 100
0.4 * 2 = 0 .8
0.8 * 2 = 1 .6 1010.11 x 22 1.01011 x 25 4.3 x 101
0.6 * 2 = 1 .2
0.2 * 2 = 0 .4 and then repeating 0.4, 0.8, 0.6 0.00101 x 2-2 1.01 x 2-5 3.90625 x 10-2

1100101 x 2-2 1.100101 x 2+4 9.86328125 x 10-2


 Answer 0.0 0011 0011 0011 0011 0011 0011 ..... 2

Eddie Edwards 2008 Floating Point Numbers 7.15 Eddie Edwards 2008 Floating Point Numbers 7.16

4
31/10/2011

Floating Point Multiplication Truncation and Rounding


E1 E2  For many computations the result of a floating point operation can be
N1 x N2 = (M1 x 10 ) x (M2 x 10 )
E1 E2 too large to store in the Mantissa.
= (M1 x M2) x (10 x 10 )
E1+E2  Example: with a 2-digit mantissa
= (M1 x M2) x (10 )
1 1 2
i.e. We multiply the Mantissas and Add the Exponents 2.3 x 101 * 2.3 x 101 = 5.29 x 102
2
 TRUNCATION => 5.2 x 10 (Biased Error)

2
Example: 20 * 6
1
= (2.0 x 10 ) x (6.0 x 10 )
0  ROUNDING => 5.3 x 10 (Unbiased Error)
1+0
= (2.0 x 6.0) x (10 )
1
= 12.0 x 10
2
We must also normalise the result, so the final answer = 1.2 x 10
Eddie Edwards 2008 Floating Point Numbers 7.17 Eddie Edwards 2008 Floating Point Numbers 7.18

Floating Point Addition Exponent Overflow & Underflow


 EXPONENT OVERFLOW occurs when the Result is too Large
 A floating point addition such as 4.5 x 103 + 6.7 x 102 is not a simple mantissa i.e. when the Result’s Exponent > Maximum Exponent
addition, unless the exponents are the same
=> we need to ensure that the mantissas are aligned first. Example: if Max Exponent is 99 then 1099 * 1099 = 10198 (overflow)
N1 + N2 = ( M1 x 10E1 ) + ( M2 x 10E2 )
On Overflow => Proceed with incorrect value or infinity value or raise an
= ( M1 + M2 x 10 E2-E1) x 10E1 Exception
 To align, choose the number with the smaller exponent & shift mantissa the
corresponding number of digits to the right.
 EXPONENT UNDERFLOW occurs when the Result is too Small
Example: 4.5 x 103 + 6.7 x 102 = 4.5 x 103 + 0.67 x 103 i.e. when the Result’s Exponent < Smallest Exponent

= 5.17 x 103 Example: if Min Exp. is –99 then 10-99 * 10-99 = 10-198 (underflow)
= 5.2 x 103 (rounded)
On Underflow => Proceed with zero value or raise an Exception

Eddie Edwards 2008 Floating Point Numbers 7.19 Eddie Edwards 2008 Floating Point Numbers 7.20

5
31/10/2011

Comparing Floating-Point Values


 Because of the potential for producing in-exact results, comparing
floating-point values should account for close results.

 If we know the likely magnitude and precision of results we can adjust


for closeness (epsilon), for example, for equality we can:
a = b a > ( b - e ) AND a < ( b + e )
Floating point numbers - questions
a = 1 a > (1 - 0.000005) AND a < 1 + 0.000005
a> 0.999995 AND a < 1.000005
Alternatively we can calculate | a - b | < e e.g. | a - 1 | < 0.000005

 A more general approach is to calculate the closeness based on the


relative size of the two numbers being compared.

Eddie Edwards 2008 Floating Point Numbers 7.21 Eddie Edwards 2008 Floating Point Numbers 7.22

 What is the binary notation for 3.625 IEEE Floating-Point Standard


 A 11.011 B 10.101 C 11.101 D 101.11 E 11.11  IEEE: Institute of Electrical & Electronic Engineers (USA)

 Comprehensive standard for Binary Floating-Point Arithmetic

 What is binary 0.1101 in decimal?  Widely adopted => Predictable results independent of architecture

 The standard defines:


 A 0.8125 B 0.8 C 0.8625 D 0.9125 E 0.7865 The format of binary floating-point numbers

Semantics of arithmetic operations

Rules for error conditions

Eddie Edwards 2008 Floating Point Numbers 7.23 Eddie Edwards 2008 Floating Point Numbers 7.24

6
31/10/2011

Single Precision Format (32-bit) Exponent Field


Sign Exponent Significand  In the IEEE Standard, exponents are stored as Excess (Bias) Values, not as 2’s
S E F Complement Values

1 bit 8 bits 23 bits  Example: In 8-bit Excess 127


–127 would be held as 0000 0000
 The mantissa is called the SIGNIFICAND in the IEEE standard ... ...
0 would be held as 0111 1111
 Value represented = ± 1.F x 2 E-127 127 = 28-1 - 1 1 would be held as 1000 0000
 The Normal Bit (the 1.) is omitted from the Significand field => a HIDDEN bit ... ...
128would be held as 1111 1111
 Single precision yields 24-bits = ~ 7 decimal digits of precision
 Normalised Ranges in decimal are approximately:  Excess notation allows non-negative floating point numbers to be compared using
simple integer comparisons, regardless of the absolute magnitude of the
–10+38 to -10-38, 0, +10-38 to +10
+10+38 exponents.

Eddie Edwards 2008 Floating Point Numbers 7.25 Eddie Edwards 2008 Floating Point Numbers 7.26

Double Precision Format (64-bit) Example: Conversion to IEEE format


Sign Exponent Significand
What is +42.6875 in IEEE Single Precision Format?
S E F
First convert to a binary number: 42.6875 = 10_
10_1010 . 1011
1 bit 11 bits 52 bits
5
Next normalise: 1 . 0101_
0101_0101_
0101_1 x 2
Value represented = ± 1.F x 2E-1023 1023 = 211-
11-1 - 1
Significand field is therefore: 0101_
0101_0101_
0101_1000_
000_0000_
0000_0000_
0000_000
 Yields 53 bits of precision = ~ 16 decimal digits of precision Exponent field is (5+127=132): 1000_
1000_0100
 Normalised Ranges in decimal are approximately: Value in IEEE Single Precision is:

–10+308 to -10-308, 0, +10-308 to +10


+10
+308
Sign Exponent Significand
 Double-Precision format is preferred for its greater precision. Single-precision 0 1000_
1000_0100 0101_
0101_0101_
0101_1000_
000_0000_
0000_0000_
0000_000
is useful when memory is scarce and for debugging numerical calculations since 0100__
0100__0010
__0010__
0010__0
__0 010__
010__1010
__1010__
1010__1100
__1100__
1100__0000
__0000__
0000__0000
__0000__
0000__0000
__0000
rounding errors show up more quickly.

In hexadecimal this value is 422A_C000


Eddie Edwards 2008 Floating Point Numbers 7.27 Eddie Edwards 2008 Floating Point Numbers 7.28

7
31/10/2011

Example: Conversion from IEEE format Example: Addition


Convert the IEEE Single Precision Value given by BEC0
BEC0_0000 to Decimal  Carry out the addition 42.6875 + 0.375 in IEEE single precision arithmetic.

BEC0
BEC0_0000 = 1011_
011_1110_
1110_1 100_
100_0000_
0000_0000_
0000_0000_
0000_0000_
0000_0000 Number Sign Exponent Significand
42.6875 0 1000_
1000_0100 0101_
0101_0101_
0101_1000_
1000_0000_
0000_0000_
0000_000
Sign Exponent Significand
1 0111_
0111_1101 1000_
1000_0000_
0000_0000_
0000_0000_
0000_0000_
0000_000 0.375 0 0111_
0111 _1101 1000_
1000 0000_
_0000 0000_
_0000 0000_
_0000 0000_
_0000 _000

 To add these numbers the exponents of the numbers must be the same => Make
Exponent Field = 0111_1101 = 125
the smaller exponent equal to the larger exponent, shifting the mantissa
True Binary Exponent = 125 – 127 = –2
accordingly.
Significand Field = 1000_
1000_0000_
0000_0000_
0000_0000_
0000_0000_
0000_000
Adding Hidden Bit = 1.1000
1.1000_
1000_0000_
0000_0000_
0000_0000_
0000_0000_
0000_000  Note: We must restore the Hidden bit when carrying out floating point
Therefore unsigned value = 1.1 x 2–2 = 0 . 011 (binary) operations.
= 0.25 + 0.125 = 0.375 (decimal)
Sign bit = 1 therefore number is –0.375

Eddie Edwards 2008 Floating Point Numbers 7.29 Eddie Edwards 2008 Floating Point Numbers 7.30

Example: Addition Contd. Special Values


 Significand of Larger No = 1 . 0101_0101_1000_0000_0000_000  The IEEE format can represent five kinds of values: Zero, Normalised Numbers,
Significand of Smaller No = 1 . 1000_0000_0000_0000_0000_000 Denormalised Numbers, Infinity and Not-A-Numbers (NANs).
 For single precision format we have the following representations:
 Exponents differ by +7 (1000_0100 – 0111_1101). Therefore shift binary
point of smaller number 7 places to the left:
IEEE Value Sign Exponent Significand True
 Significand of Smaller No = 0 . 0000_0011_0000_0000_0000_000 Field Field Field Exponent
Significand of Larger No = 1 . 0101_0101_1000_0000_0000_000
± Zero 0 or 1 0 0 (All zeroes)
Significand of SUM = 1 . 0101_1000_1000_0000_0000_000
± Denormalised No 0 or 1 0 Any non-zero bit pat. -126
 Therefore SUM = 1 . 0101_ 1000_1 x 25 = 10_1011.0001 = 43.0625
0101_1000_
± Normalised No 0 or 1 1 .. 254 Any bit pattern -126 .. + 127
Sign Exponent Significand
0 1000_0100 0101_1000_1 000_0000_0000_000 = 422C 4000H ± Infinity 0 or 1 255 0 (All zeroes)
Not-A-Number 0 or 1 255 Any non-zero bit pat.

Eddie Edwards 2008 Floating Point Numbers 7.31 Eddie Edwards 2008 Floating Point Numbers 7.32

8
31/10/2011

Denormalised Numbers
 An Exponent of All 0’s is used to represent Zero and Denormalised numbers,
while All 1’s is used to represent Infinities and Not-A-Numbers (NaNs)
 This means that the maximum range for normalised numbers is reduced, i.e. for
Single Precision the range is –126 .. +127 rather than
–127 .. +128 as one might expect for Excess 127.
IEEE 754 floating point numbers -
 Denormalised Numbers represent values between the Underflow limits questions
and zero, i.e. for single precision we have:
–126
± 0.F x 2–126

Traditionally a “flush-to-zero” is done when an underflow occurs


 Denormalised numbers allow a more gradual shift to zero, and are useful
in a few numerical applications

Eddie Edwards 2008 Floating Point Numbers 7.33 Eddie Edwards 2008 Floating Point Numbers 7.34

 What decimal is represented by the hex word Infinities and NaN’s


C0CA0000
 Infinities (both positive & negative) are used to represent values that exceed
Answer - -6.3125 the overflow limits, and for operations like Divide by Zero
 Infinities behave as in Mathematics, e.g.

Infinity + 5 = Infinity, -Infinity + -Infinity = -Infinity


 What hex word is -0.75 in IEEE-754?
Answer - BFE8000000000000
 Not-A-Numbers (NaNs) are used to represent the results of operations which
have no mathematical interpretation, e.g.

0 / 0, +Infinity + -Infinity, 0 x Infinity, Square root of a -ve number,

 Operations with a NaN operand yield either a NaN result (quiet NaN operand)
or an exception (signalling NaN operand)

Eddie Edwards 2008 Floating Point Numbers 7.35 Eddie Edwards 2008 Floating Point Numbers 7.36

9
31/10/2011

This lecture - feedback

The pace of the lecture was:


A. much too fast B. too fast C. about right D. too slow E. much too slow

 The learning objectives were met:


A. Fully B. Mostly C. Partially D. Slightly E. Not at all

10

You might also like