0% found this document useful (0 votes)
14 views36 pages

2.4 Floating Points

The document discusses floating point arithmetic, focusing on fractional binary numbers and the IEEE floating point standard established in 1985. It covers the representation of numbers, operations like rounding, addition, and multiplication, and highlights the importance of precision and potential errors in calculations, illustrated by the Patriot Missile failure. The document also provides examples of floating-point representation and operations in both decimal and binary formats.

Uploaded by

bofalat186
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views36 pages

2.4 Floating Points

The document discusses floating point arithmetic, focusing on fractional binary numbers and the IEEE floating point standard established in 1985. It covers the representation of numbers, operations like rounding, addition, and multiplication, and highlights the importance of precision and potential errors in calculations, illustrated by the Patriot Missile failure. The document also provides examples of floating-point representation and operations in both decimal and binary formats.

Uploaded by

bofalat186
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Carnegie Mellon

Floating Point

Instructor
Dr. Neha Agrawal
Carnegie Mellon

Today: Floating Point


■ Background: Fractional binary numbers
■ IEEE floating point standard: Definition
■ Example and properties
■ Rounding, addition, multiplication
■ Floating point in C
■ Summary

2
Carnegie Mellon

Fractional binary numbers


■ What is 1011.1012?

3
Carnegie Mellon

Fractional Binary Numbers


2i
2i-1

4
••• 2
1
bi bi-1 ••• b2 b1 b0 b-1 b-2 b-3 ••• b-j
1/2
1/4
1/8

■ Representation 2-j
 Bits to right of “binary point” represent fractional powers of 2
 Represents rational number:

4
Carnegie Mellon

Fractional Binary Numbers: Examples


■ Value Representation
5 3/4 101.112
2 7/8 10.1112
63/64 1.01112

■ Observations
 Divide by 2 by shifting right
 Multiply by 2 by shifting left
 Numbers of form 0.111111…2 are just below 1.0
 1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
 Use notation 1.0 – ε

5
Carnegie Mellon

Representable Numbers
■ Limitation
 Can represent 𝑦

 Other rational numbers have repeating bit representations

■ Value Representation
 1/3 0.0101010101[01]…2
 1/5 0.001100110011[0011]…2
 1/10 0.0001100110011[0011]…2

6
Motivation
■ On February 25, 1991, during the Gulf War, an American Patriot Missile
battery in Dharan, Saudi Arabia, failed to track and intercept an
incoming Iraqi Scud missile. The Scud struck an American Army
barracks, killing 28 soldiers and injuring around 100 other people.
Areport of the General Accounting office, GAO/IMTEC-92-26, entitled
Patriot Missile Defense: Software Problem Led to System Failure at
Dhahran, Saudi Arabia reported on the cause of the failure. It turns out
that the cause was an inaccurate calculation of the time since boot due
to computer arithmetic errors.

7
■ Specifically, the time in tenths of second as measured by the system's internal clock was
multiplied by 1/10 to produce the time in seconds. This calculation was performed using a
24 bit fixed point register. In particular, the value 1/10, which has a non-terminating binary
expansion, was chopped at 24 bits after the radix point. The Patriot battery had been up
around 100 hours, and an easy calculation shows that the resulting time error due to the
magnified chopping error was about 0.34 seconds. (The number 1/10 equals
1/24+1/25+1/28+1/29+1/212+1/213+.... In other words, the binary expansion of 1/10 is
0.0001100110011001100110011001100.... Now the 24 bit register in the Patriot stored
instead 0.00011001100110011001100 introducing an error of
0.0000000000000000000000011001100... binary, or about 0.000000095 decimal.
Multiplying by the number of tenths of a second in 100 hours gives
0.000000095×100×60×60×10=0.34.)
■ A Scud travels at about 1,676 meters per second, and so travels more than half a
kilometer in this time. This was far enough that the incoming Scud was outside the "range
gate" that the Patriot tracked. Ironically, the fact that the bad time calculation had been
improved in some parts of the code, but not all, contributed to the problem, since it meant
that the inaccuracies did not cancel, as discussed here.

8
Carnegie Mellon

Today: Floating Point


■ Background: Fractional binary numbers
■ IEEE floating point standard: Definition
■ Example and properties
■ Rounding, addition, multiplication
■ Floating point in C
■ Summary

9
Carnegie Mellon

IEEE Floating Point


■ IEEE Standard 754
 Established in 1985 as uniform standard for floating point arithmetic
Before that, many idiosyncratic formats
 Supported by all major CPUs

■ Driven by numerical concerns


 Nice standards for rounding, overflow, underflow
 Hard to make fast in hardware
 Numerical analysts predominated over hardware designers in defining
standard

10
IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction/Mantissa

x  (1)S (1 Fraction)  2(Exponent Bias)


Significand
■ S: sign bit (0  non-negative, 1  negative)
■ Normalize significand: 1.0 ≤ |significand| < 2.0
 Always has a leading pre-binary-point 1 bit, so no need to represent it
explicitly (hidden bit)
 Significand is Fraction with the “1.” restored
■ Exponent: excess representation: actual exponent + Bias
 Ensures exponent is unsigned
 Single: Bias = 127; Double: Bias 1023

11
Floating-Point Example
■ Represent –0.75
 –0.75 = (–1)1 × 1.12 × 2–1
 S=1
 Fraction = 1000…002
 Exponent = –1 + Bias
 Single: –1 + 127 = 126 = 011111102
 Double: –1 + 1023 = 1022 = 011111111102

■ Single: 1011111101000…00
■ Double: 1011111111101000…00

12
Reason for bias

13
Single Precision

■ Overflow - the exponent is too large to be represented in the


exponent field
■ Underflow - nonzero fraction has become so small that it
cannot be represented

14
Single-Precision Range
■ Exponents 00000000 and 11111111 reserved
■ Smallest value

 Exponent: 00000001
 actual exponent = 1 – 127 = –126
 Fraction: 000…00  significand = 1.0
 ±1.0 × 2–126 ≈ ±1.2 × 10–38
■ Largest value
 exponent: 11111110
 actual exponent = 254 – 127 = +127
 Fraction: 111…11  significand ≈ 2.0
 ±2.0 × 2+127 ≈ ±3.4 × 10+38

15
Double Precision

16
Double-Precision Range
■ Exponents 0000…00 and 1111…11 reserved
■ Smallest value

 Exponent: 00000000001
 actual exponent = 1 – 1023 = –1022
 Fraction: 000…00  significand = 1.0
 ±1.0 × 2–1022 ≈ ±2.2 × 10–308
■ Largest value
 Exponent: 11111111110
 actual exponent = 2046 – 1023 = +1023
 Fraction: 111…11  significand ≈ 2.0
 ±2.0 × 2+1023 ≈ ±1.8 × 10+308

17
Floating-Point Example
■ What number is represented by the single-precision float
11000000101000…00
 S=1
 Fraction = 01000…002
 Exponent = 100000012 = 129
■ x = (–1)1 × (1 + 012) × 2(129 – 127)
= (–1) × 1.25 × 22
= –5.0

18
19
Single Precision Examples
■ Denormalized Numbers

20
Precisions
■ Extended precision: 80 bits (Intel only)

s exp frac

1 15-bits 63 or 64-bits

23
Carnegie Mellon

Special Properties of Encoding


■ FP Zero Same as Integer Zero
 All bits = 0

■ Can (Almost) Use Unsigned Integer Comparison


 Must first compare sign bits
 Must consider −0 = 0
 NaNs problematic
Will be greater than any other values
 What should comparison yield?
 Otherwise OK
 Denorm vs. normalized
 Normalized vs. infinity

24
Carnegie Mellon

Today: Floating Point


■ Background: Fractional binary numbers
■ IEEE floating point standard: Definition
■ Example and properties
■ Rounding, addition, multiplication
■ Floating point in C
■ Summary

25
Carnegie Mellon

Floating Point Operations: Basic Idea


■ x +f y = Round(x + y)

■ x f y = Round(x  y)

■ Basic idea
 First compute exact result
 Make it fit into desired precision
 Possibly overflow if exponent too large
 Possibly round to fit into frac

26
Carnegie Mellon

Rounding
■ Rounding Modes (illustrate with $ rounding)

■ $1.40 $1.60 $1.50 $2.50 –$1.50


 Towards zero $1 $1 $1 $2 –$1
 Round down (−) $1 $1 $1 $2 –$2
 Round up (+) $2 $2 $2 $3 –$1
 Nearest Even (default) $1 $2 $2 $2 –$2

■ What are the advantages of the modes?

27
Carnegie Mellon

Closer Look at Round-To-Even


■ Default Rounding Mode
 All others are statistically biased
 Sum of set of positive numbers will consistently be over- or under-
estimated

■ Applying to Other Decimal Places / Bit Positions


 When exactly halfway between two possible values
 Round so that least significant digit is even
 E.g., round to nearest hundredth
1.2349999 1.23 (Less than half way)
1.2350001 1.24 (Greater than half way)
1.2350000 1.24 (Half way—round up)
1.2450000 1.24 (Half way—round down)

28
Carnegie Mellon

Rounding Binary Numbers


■ Binary Fractional Numbers
 “Even” when least significant bit is 0
 “Half way” when bits to right of rounding position = 100… 2

■ Examples
 Round to nearest 1/4 (2 bits right of binary point)
Value Binary Rounded Action Rounded Value
2 3/32 10.000112 10.002 (<1/2—down) 2
2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
2 7/8 10.111002 11.002 ( 1/2—up) 3
2 5/8 10.101002 10.102 ( 1/2—down) 2 1/2

29
Floating-Point Addition
Consider a 4-digit decimal example
9.999 × 101 + 1.610 × 10–1
1. Align decimal points
Shift number with smaller exponent
9.999 × 101 + 0.016 × 101
2. Add significands
9.999 × 101 + 0.016 × 101 = 10.015 × 101
3. Normalize result & check for over/underflow
1.0015 × 102
4.Round and renormalize if necessary. Assume only four
digits are allowed for significant and two digits for exponent
1.002 × 102

Chapter 3 — Arithmetic for Computers — 30


30
Floating-Point Addition
Now consider a 4-digit binary example
1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
1. Align binary points
Shift number with smaller exponent
1.0002 × 2–1 + –0.1112 × 2–1
2. Add significands
1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
3. Normalize result & check for over/underflow
1.0002 × 2–4, (no over/underflow
4. Round and renormalize if necessary
1.0002 × 2–4 (no change) = 0.0625

Chapter 3 — Arithmetic for Computers — 31


30
Floating-Point Multiplication
 Consider a 4-digit decimal example
 1.110 × 1010 × 9.200 × 10–5
Biased 10 + 127 = 137, and -5 + 127 = 122,
 1. Add exponents New exponent
137 + 122 = 259 Wrong !!!
 For biased exponents, subtract bias from sum
(10 + 127) + (-5 +127) = (5 + 2 *127) = 259
 New exponent = 10 + –5 = 5 we must subtract the bias from the sum:

2. Multiply significands New exponent 137 + 122 - 127 =


1.110 × 9.200 = 10.212  10.212 × 105 259 -127 = 132 = (5 + 127)
3. Normalize result & check for
over/underflow
1.0212 × 106
4. Round and renormalize if necessary
1.021 × 106
5. Determine sign of result from signs of
operands
+1.021 × 106

Chapter 3 — Arithmetic for Computers — 32


32
Floating-Point Multiplication
■ Now consider a 4-digit binary example
 1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375)
■ 1. Add exponents
 Unbiased: –1 + –2 = –3
 Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127
■ 2. Multiply significands
 1.0002 × 1.1102 = 1.1100002  1.1102 × 2–3
■ 3. Normalize result & check for over/underflow
 1.1102 × 2–3 (no change) & no over/underflow
■ 4. Round and renormalize if necessary
 1.1102 × 2–3 (no change)
■ 5. Determine sign: +ve × –ve  –ve
 –1.1102 × 2–3 = –0.21875
Chapter 3 — Arithmetic for Computers — 33
33
Carnegie Mellon

Today: Floating Point


■ Background: Fractional binary numbers
■ IEEE floating point standard: Definition
■ Example and properties
■ Rounding, addition, multiplication
■ Floating point in C
■ Summary

34
Carnegie Mellon

Floating Point in C
■ C Guarantees Two Levels
float single precision
double double precision

■ Conversions/Casting
Casting between int, float, and double changes bit representation
 double/float → int
 Truncates fractional part
 Like rounding toward zero
 Not defined when out of range or NaN: Generally sets to TMin
 int → double
 Exact conversion, as long as int has ≤ 53 bit word size
 int → float
 Will round according to rounding mode

35
Carnegie Mellon

Floating Point Puzzles


■ For each of the following C expressions, either:
 Argue that it is true for all argument values
 Explain why not true
• x == (int)(float) x
• x == (int)(double) x
• f == (float)(double) f
int x = …;
float f = …; • d == (float) d
double d = …; • f == -(-f);
• 2/3 == 2/3.0
Assume neither • d < 0.0 ((d*2) < 0.0)
d nor f is NaN • d>f -f > -d
• d * d >= 0.0
• (d+f)-d == f

36
Carnegie Mellon

Today: Floating Point


■ Background: Fractional binary numbers
■ IEEE floating point standard: Definition
■ Example and properties
■ Rounding, addition, multiplication
■ Floating point in C
■ Summary

37
Carnegie Mellon

Summary
■ IEEE Floating Point has clear mathematical properties
■ Represents numbers of form M x 2E
■ One can reason about operations independent of
implementation
 As if computed with perfect precision and then rounded
■ Not the same as real arithmetic
 Violates associativity/distributivity
 Makes life difficult for compilers & serious numerical applications
programmers

38

You might also like