04 Float
04 Float
Floating Point
15-213/18-213/15-513: Introduction to Computer Systems
4th Lecture, June 1, 2021
4
••• 2
1
Representation 2-j
▪ Bits to right of “binary point” represent fractional powers of 2
▪ Represents rational number:
Observations
▪ Divide by 2 by shifting right (unsigned)
▪ Multiply by 2 by shifting left
▪ Numbers of form 0.111111…2 are just below 1.0
▪ 1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
▪ Use notation 1.0 – ε
Representable Numbers
Limitation #1
▪ Can only exactly represent numbers of the form x/2k
▪ Other rational numbers have repeating bit representations
▪ Value Representation
▪ 1/3 0.0101010101[01]…2
▪ 1/5 0.001100110011[0011]…2
▪ 1/10 0.0001100110011[0011]…2
Limitation #2
▪ Just one setting of binary point within the w bits
▪ Limited range of numbers (very small values? very large?)
This is important!
Ariane 5 explodes on maiden voyage: $500 MILLION dollars lost
▪ 64-bit floating point number assigned to 16-bit integer
▪ Causes rocket to get incorrect value of horizontal velocity and crash
(–1)s M 2E
▪ Sign bit s determines whether number is negative or positive
▪ Significand M normally a fractional value in range [1.0,2.0).
▪ Exponent E weights value by power of two
Encoding
▪ MSB s is sign bit s
▪ exp field encodes E (but is not equal to E)
▪ frac field encodes M (but is not equal to M)
s exp frac
Precision options
Single precision: 32 bits
7 decimal digits, 10±38
s exp frac
1 8-bits 23-bits
Significand
M = 1.11011011011012
frac = 110110110110100000000002
Exponent
E = 13
Bias = 127
exp = 140 = 100011002
Result:
0 10001100 11011011011010000000000
s exp frac
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 14
Carnegie Mellon
Special Values
Condition: exp = 111…1
1 8-bits 23-bits
E = 129
S = 1 -> negative number
M = 1.010 0000 0000 0000 0000 0000
M = 1 + 1/4 = 1.25
v = (–1)s M 2E =
1 8-bits 23-bits
E = 129
S = 1 -> negative number
M = 1.010 0000 0000 0000 0000 0000
M = 1 + 1/4 = 1.25
v = (–1)s M 2E =
1 8-bits 23-bits
1 8-bits 23-bits
E = 129
S = 1 -> negative number
M = 0.010 0000 0000 0000 0000 0000
M = 1 + 1/4 = 1.25
v = (–1)s M 2E =
1 8-bits 23-bits
− +
−Normalized −Denorm +Denorm +Normalized
NaN NaN
−0 +0
Floating point in C
Summary
s exp frac
1 4-bits 3-bits
v = (–1)s M 2E
Dynamic Range (s=0 only) norm: E = exp – Bias
s exp frac E Value
denorm: E = 1 – Bias
0 0000 000 -6 0
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512 (-1)0(0+1/4)*2-6
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512 smallest norm
0 0001 001 -6 9/8*1/64 = 9/512 (-1)0(1+1/8)*2-6
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 000 0 8/8*1 = 1
numbers 0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf
Distribution of Values
6-bit IEEE-like format
▪ e = 3 exponent bits
▪ f = 2 fraction bits s exp frac
▪ Bias is 23-1-1 = 3 1 3-bits 2-bits
-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
Quiz Time!
Check out:
https://canvas.cmu.edu/courses/16836
x f y = Round(x y)
Basic idea
▪ First compute exact result
▪ Make it fit into desired precision
▪ Possibly overflow if exponent too large
▪ Possibly round to fit into frac
Rounding
Rounding Modes (illustrate with $ rounding)
Examples
▪ Round to nearest 1/4 (2 bits right of binary point)
Value Binary Rounded Action Rounded Value
2 3/32 10.000112 10.002 (<1/2—down) 2
2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
2 7/8 10.111002 11.002 ( 1/2—up) 3
2 5/8 10.101002 10.102 ( 1/2—down) 2 1/2
Rounding 1.BBGRXXX
Guard bit: LSB of result
Sticky bit: OR of remaining bits
Round bit: 1st bit removed
Round up conditions
▪ Round = 1, Sticky = 1 ➙ > 0.5
▪ Guard = 1, Round = 1, Sticky = 0 ➙ Round to even
Fraction GRS Incr? Rounded
1.0000000 000 N 1.000
1.1010000 100 N 1.101
1.0001000 010 N 1.000
1.0011000 110 Y 1.010
1.0001010 011 Y 1.001
1.1111100 111 Y 10.000
FP Multiplication
(–1)s1 M1 2E1 x (–1)s2 M2 2E2
Exact Result: (–1)s M 2E
▪ Sign s: s1 ^ s2
▪ Significand M: M1 x M2
▪ Exponent E: E1 + E2
Fixing
▪ If M ≥ 2, shift M right, increment E
▪ If E out of range, overflow
▪ Round M to fit frac precision
Implementation
▪ Biggest chore is multiplying significands
Monotonicity
▪ a ≥ b & c ≥ 0 ⇒ a * c ≥ b *c? Almost
▪ Except for infinities & NaNs
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 39
Carnegie Mellon
Floating Point in C
C Guarantees Two Levels
▪ float single precision
▪ double double precision
Conversions/Casting
▪ Casting between int, float, and double changes bit representation
▪ double/float → int
Truncates fractional part
▪
▪ Like rounding toward zero
▪ Not defined when out of range or NaN: Generally sets to TMin
▪ int → double
▪ Exact conversion, as long as int has ≤ 53 bit word size
▪ int → float
▪ Will round according to rounding mode
Summary
IEEE Floating Point has clear mathematical properties
Represents numbers of form M x 2E
One can reason about operations independent of
implementation
▪ As if computed with perfect precision and then rounded
Not the same as real arithmetic
▪ Violates associativity/distributivity
▪ Makes life difficult for compilers & serious numerical applications
programmers
Additional Slides
Case Study
▪ Convert 8-bit unsigned numbers to tiny floating point format
Example Numbers
128 10000000
15 00001101
33 00010001
35 00010011
138 10001010
63 00111111
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 45
Carnegie Mellon
Postnormalize
Issue
▪ Rounding may have caused overflow
▪ Handle by shifting right once & incrementing exponent
Value Rounded Exp Adjusted Numeric Result
128 1.000 7 128
15 1.101 3 15
17 1.000 4 16
19 1.010 4 20
138 1.001 7 134
63 10.000 5 1.000/6 64