04-float-2
04-float-2
Instructor:
Alan L. Cox
Alternative formats
Floating point in C
Summary
4
••• 2
1
bi-
bi ••• b2 b1 b0 b-1 b-2 b-3 ••• b-j
1 1/2
1/4 •••
1/8
Representation 2-j
Bits to right of “binary point” represent fractional powers of 2
Represents rational number:
Observations
Divide by 2 by shifting right (unsigned)
Multiply by 2 by shifting left
Numbers of form 0.111111…2 are just below 1.0
1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
Use notation 1.0 – ε
Representable Numbers
Limitation #1
Can only exactly represent numbers of the form x/2k
Other rational numbers have repeating bit representations
Value Representation
1/3 0.0101010101[01]…2
1/5 0.001100110011[0011]…2
1/10 0.0001100110011[0011]…2
Limitation #2
Just one setting of binary point within the w bits
Limited range of numbers (very small values? very large?)
Alternative formats
Floating point in C
Summary
Encoding
MSB s is sign bit s
exp field encodes E (but is not equal to E)
frac field encodes M (but is not equal to M)
s exp frac
E = Exp – Bias
Value: float F = 15213.0;
15213 = 11101101101101
10 2
= 1.1101101101101 x 2 2
13
Significand
M = 1.1101101101101 2
frac = 11011011011010000000000 2
Exponent
E = 13
Bias = 127
Exp = 140 = 10001100 2
0 10001100 11011011011010000000000
s exp frac
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 12
Carnegie Mello
Denormalized (Subnormal)
v = (–1) M 2 s E
Values E = 1 – Bias
Condition: exp = 000…0
0.xxx…x2
xxx…x: bits of frac
Cases
exp = 000…0, frac = 000…0
Represents zero value
Note distinct values: +0 and –0 (why?)
exp = 000…0, frac ≠ 000…0
Numbers closest to 0.0
Equispaced
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 13
Carnegie Mello
Special Values
Condition: exp = 111…1
− +
−Normalized −Denorm +Denorm +Normalized
NaN NaN
0 +0
Alternative formats
Floating point in C
Summary
Distribution of Values
6-bit IEEE-like format
e = 3 exponent bits
f = 2 fraction bits s exp frac
Bias is 23-1-1 = 3 1 3-bits 2-bits
-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
Alternative formats
Floating point in C
Summary
x f y = Round(x y)
Basic idea
First compute exact result
Make it fit into desired precision
Possibly overflow if exponent too large
Possibly round to fit into frac
Rounding
Rounding Modes (illustrate with $ rounding)
Examples
Round to nearest 1/4 (2 bits right of binary point)
Value Binary Rounded Action Rounded Value
2 3/32 10.000112 10.002 (<1/2—down) 2
2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
2 7/8 10.111002 11.002 ( 1/2—up) 3
2 5/8 10.101002 10.102 ( 1/2—down) 2 1/2
Sign s: s1 ^ s2
Significand M: M1 x M2
Exponent E: E1 + E2
Fixing
If M ≥ 2, shift M right, increment E
If E out of range, overflow
Round M to fit frac precision
Implementation
Biggest chore is multiplying significands
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 27
Carnegie Mello
Sign s, significand M:
+ (–1)s2 M2
Result of signed align & add
Exponent E: E1 (–1)s M
Fixing
If M ≥ 2, shift M right, increment E
if M < 1, shift M left k positions, decrement E by k
Overflow if E out of range
Round M to fit frac precision
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 28
Carnegie Mello
Mathematical Properties of FP
Add
Compare to those of Abelian Group
Closed under addition? Yes
But may generate infinity or NaN
Commutative? Yes
Associative? No
Overflow and inexactness of rounding
(3.14+1e10)-1e10 = 0, 3.14+(1e10-1e10) = 3.14
0 is additive identity? Yes
Every element has additive inverse? Almost
Yes, except for infinities & NaNs
Monotonicity
a ≥ b ⇒ a+c ≥ b+c? Almost
Except for infinities & NaNs
Mathematical Properties of FP
Mult
Compare to Commutative Ring
Closed under multiplication? Yes
But may generate infinity or NaN
Multiplication Commutative? Yes
Multiplication is Associative? No
Possibility of overflow, inexactness of rounding
Ex: (1e20*1e20)*1e-20= inf, 1e20*(1e20*1e-20)= 1e20
1 is multiplicative identity? Yes
Multiplication distributes over addition? No
Possibility of overflow, inexactness of rounding
1e20*(1e20-1e20)= 0.0, 1e20*1e20 – 1e20*1e20 = NaN
Monotonicity
a ≥ b & c ≥ 0 ⇒ a * c ≥ b *c? Almost
Except for infinities & NaNs
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 30
Carnegie Mello
Alternative formats
Floating point in C
Summary
Application Driven
Reduced range and precision sufficient for some
applications, for example,
Image processing
Neural networks
s exp frac
1 5-bits 10-bits
s exp frac
1 8-bits 7-bits
Statistics Driven
Multiplying a large number of probabilities results in
underflow
Hidden Markov Models
Alternative formats
Floating point in C
Summary
Floating Point in C
C provides three standard binary floating point
types
float binary32 (single precision)
double binary64 (double precision)
long double varies by machine, at least binary64
binary128 on Arm64
binary64-extended on x86-64 (80 bits)
s exp frac
1 15-bits 63 or 64-bits
Floating Point in C
Conversions/Casting
Casting between int, float, and double changes bit representation
double/float → int
Truncates fractional part
Like rounding toward zero
Not defined when out of range or NaN: Generally sets to TMin
int → double
Exact conversion, as long as int has ≤ 53 bit word size
int → float
Will round according to rounding mode
Alternative formats
Floating point in C
Summary
Case Study
Convert 8-bit unsigned numbers to tiny floating point format
Example Numbers
128 10000000
13 00001101
17 00010001
19 00010011
138 10001010
63 00111111
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 40
Carnegie Mello
Rounding 1.BBGRXXX
Guard bit: LSB of
result Sticky bit: OR of remaining bit
Round bit: 1st bit removed
Round up conditions
Round = 1, Sticky = 1 ➙ > 0.5
Guard = 1, Round = 1, Sticky = 0 ➙ Round to even
Value Fraction GRS Incr? Rounded
128 1.0000000 000 N 1.000
13 1.1010000 100 N 1.101
17 1.0001000 010 N 1.000
19 1.0011000 110 Y 1.010
138 1.0001010 011 Y 1.001
63 1.1111100 111 Y 10.000
Postnormalize
Issue
Rounding may have caused overflow
Handle by shifting right once & incrementing exponent
Value Rounded Exp Adjusted Result
128 1.000 7 128
13 1.101 3 13
17 1.000 4 16
19 1.010 4 20
138 1.001 7 144
63 10.000 5 1.000/6 64
Summary
IEEE Floating Point has clear mathematical
properties
Represents numbers of form M x 2E
implementation
As if computed with perfect precision and then rounded
Not the same as real arithmetic
Violates associativity/distributivity
Makes life difficult for compilers & serious numerical applications
programmers