2.4 Floating Points
2.4 Floating Points
Floating Point
Instructor
Dr. Neha Agrawal
Carnegie Mellon
2
Carnegie Mellon
3
Carnegie Mellon
4
••• 2
1
bi bi-1 ••• b2 b1 b0 b-1 b-2 b-3 ••• b-j
1/2
1/4
1/8
■ Representation 2-j
Bits to right of “binary point” represent fractional powers of 2
Represents rational number:
4
Carnegie Mellon
■ Observations
Divide by 2 by shifting right
Multiply by 2 by shifting left
Numbers of form 0.111111…2 are just below 1.0
1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
Use notation 1.0 – ε
5
Carnegie Mellon
Representable Numbers
■ Limitation
Can represent 𝑦
■ Value Representation
1/3 0.0101010101[01]…2
1/5 0.001100110011[0011]…2
1/10 0.0001100110011[0011]…2
6
Motivation
■ On February 25, 1991, during the Gulf War, an American Patriot Missile
battery in Dharan, Saudi Arabia, failed to track and intercept an
incoming Iraqi Scud missile. The Scud struck an American Army
barracks, killing 28 soldiers and injuring around 100 other people.
Areport of the General Accounting office, GAO/IMTEC-92-26, entitled
Patriot Missile Defense: Software Problem Led to System Failure at
Dhahran, Saudi Arabia reported on the cause of the failure. It turns out
that the cause was an inaccurate calculation of the time since boot due
to computer arithmetic errors.
7
■ Specifically, the time in tenths of second as measured by the system's internal clock was
multiplied by 1/10 to produce the time in seconds. This calculation was performed using a
24 bit fixed point register. In particular, the value 1/10, which has a non-terminating binary
expansion, was chopped at 24 bits after the radix point. The Patriot battery had been up
around 100 hours, and an easy calculation shows that the resulting time error due to the
magnified chopping error was about 0.34 seconds. (The number 1/10 equals
1/24+1/25+1/28+1/29+1/212+1/213+.... In other words, the binary expansion of 1/10 is
0.0001100110011001100110011001100.... Now the 24 bit register in the Patriot stored
instead 0.00011001100110011001100 introducing an error of
0.0000000000000000000000011001100... binary, or about 0.000000095 decimal.
Multiplying by the number of tenths of a second in 100 hours gives
0.000000095×100×60×60×10=0.34.)
■ A Scud travels at about 1,676 meters per second, and so travels more than half a
kilometer in this time. This was far enough that the incoming Scud was outside the "range
gate" that the Patriot tracked. Ironically, the fact that the bad time calculation had been
improved in some parts of the code, but not all, contributed to the problem, since it meant
that the inaccuracies did not cancel, as discussed here.
8
Carnegie Mellon
9
Carnegie Mellon
10
IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction/Mantissa
11
Floating-Point Example
■ Represent –0.75
–0.75 = (–1)1 × 1.12 × 2–1
S=1
Fraction = 1000…002
Exponent = –1 + Bias
Single: –1 + 127 = 126 = 011111102
Double: –1 + 1023 = 1022 = 011111111102
■ Single: 1011111101000…00
■ Double: 1011111111101000…00
12
Reason for bias
13
Single Precision
14
Single-Precision Range
■ Exponents 00000000 and 11111111 reserved
■ Smallest value
Exponent: 00000001
actual exponent = 1 – 127 = –126
Fraction: 000…00 significand = 1.0
±1.0 × 2–126 ≈ ±1.2 × 10–38
■ Largest value
exponent: 11111110
actual exponent = 254 – 127 = +127
Fraction: 111…11 significand ≈ 2.0
±2.0 × 2+127 ≈ ±3.4 × 10+38
15
Double Precision
16
Double-Precision Range
■ Exponents 0000…00 and 1111…11 reserved
■ Smallest value
Exponent: 00000000001
actual exponent = 1 – 1023 = –1022
Fraction: 000…00 significand = 1.0
±1.0 × 2–1022 ≈ ±2.2 × 10–308
■ Largest value
Exponent: 11111111110
actual exponent = 2046 – 1023 = +1023
Fraction: 111…11 significand ≈ 2.0
±2.0 × 2+1023 ≈ ±1.8 × 10+308
17
Floating-Point Example
■ What number is represented by the single-precision float
11000000101000…00
S=1
Fraction = 01000…002
Exponent = 100000012 = 129
■ x = (–1)1 × (1 + 012) × 2(129 – 127)
= (–1) × 1.25 × 22
= –5.0
18
19
Single Precision Examples
■ Denormalized Numbers
20
Precisions
■ Extended precision: 80 bits (Intel only)
s exp frac
1 15-bits 63 or 64-bits
23
Carnegie Mellon
24
Carnegie Mellon
25
Carnegie Mellon
■ x f y = Round(x y)
■ Basic idea
First compute exact result
Make it fit into desired precision
Possibly overflow if exponent too large
Possibly round to fit into frac
26
Carnegie Mellon
Rounding
■ Rounding Modes (illustrate with $ rounding)
27
Carnegie Mellon
28
Carnegie Mellon
■ Examples
Round to nearest 1/4 (2 bits right of binary point)
Value Binary Rounded Action Rounded Value
2 3/32 10.000112 10.002 (<1/2—down) 2
2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
2 7/8 10.111002 11.002 ( 1/2—up) 3
2 5/8 10.101002 10.102 ( 1/2—down) 2 1/2
29
Floating-Point Addition
Consider a 4-digit decimal example
9.999 × 101 + 1.610 × 10–1
1. Align decimal points
Shift number with smaller exponent
9.999 × 101 + 0.016 × 101
2. Add significands
9.999 × 101 + 0.016 × 101 = 10.015 × 101
3. Normalize result & check for over/underflow
1.0015 × 102
4.Round and renormalize if necessary. Assume only four
digits are allowed for significant and two digits for exponent
1.002 × 102
34
Carnegie Mellon
Floating Point in C
■ C Guarantees Two Levels
float single precision
double double precision
■ Conversions/Casting
Casting between int, float, and double changes bit representation
double/float → int
Truncates fractional part
Like rounding toward zero
Not defined when out of range or NaN: Generally sets to TMin
int → double
Exact conversion, as long as int has ≤ 53 bit word size
int → float
Will round according to rounding mode
35
Carnegie Mellon
36
Carnegie Mellon
37
Carnegie Mellon
Summary
■ IEEE Floating Point has clear mathematical properties
■ Represents numbers of form M x 2E
■ One can reason about operations independent of
implementation
As if computed with perfect precision and then rounded
■ Not the same as real arithmetic
Violates associativity/distributivity
Makes life difficult for compilers & serious numerical applications
programmers
38