Lecture 3 - Floating Point
Lecture 3 - Floating Point
Floating Point
Carnegie Mellon
Floating point in C
Summary
2
Carnegie Mellon
3
Carnegie Mellon
4
••• 2
1
Representation 2-j
▪ Bits to right of “binary point” represent fractional powers of 2
▪ Represents rational number:
4
Carnegie Mellon
Observations
▪ Divide by 2 by shifting right
▪ Multiply by 2 by shifting left
▪ Numbers of form 0.111111…2 are just below 1.0
▪ 1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
▪ Use notation 1.0 – ε
5
Carnegie Mellon
Representable Numbers
Limitation
▪ Can only exactly represent numbers of the form x/2k
▪ Other rational numbers have repeating bit representations
Value Representation
▪ 1/3 0.0101010101[01]…2
▪ 1/5 0.001100110011[0011]…2
▪ 1/10 0.0001100110011[0011]…2
6
Carnegie Mellon
Floating point in C
Summary
7
Carnegie Mellon
8
Carnegie Mellon
Encoding
▪ MSB s is sign bit s
▪ exp field encodes E (but is not equal to E)
▪ frac field encodes M (but is not equal to M)
s exp frac
9
Carnegie Mellon
Precisions
Single precision: 32 bits
s exp frac
1 8-bits 23-bits
s exp frac
1 11-bits 52-bits
Extended precision: 80 bits (Intel only)
s exp frac
1 15-bits 63 or 64-bits
10
Carnegie Mellon
Normalized Values
Condition: exp ≠ 000…0 and exp ≠ 111…1
Significand
M = 1.11011011011012
frac = 110110110110100000000002
Exponent
E = 13
Bias = 127
Exp = 140 = 100011002
Result:
0 10001100 11011011011010000000000
s exp frac
12
Carnegie Mellon
Denormalized Values
Condition: exp = 000…0
Special Values
Condition: exp = 111…1
14
Carnegie Mellon
− +
−Normalized −Denorm +Denorm +Normalized
NaN NaN
−0 +0
15
Carnegie Mellon
Floating point in C
Summary
16
Carnegie Mellon
s exp frac
1 4-bits 3-bits
17
Carnegie Mellon
0 0000 000 -6 0
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512
smallest norm
0 0001 001 -6 9/8*1/64 = 9/512
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 000 0 8/8*1 = 1
numbers 0 0111 001 0 9/8*1 = 9/8
closest to 1 above
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf
18
Carnegie Mellon
Distribution of Values
6-bit IEEE-like format
▪ e = 3 exponent bits
▪ f = 2 fraction bits s exp frac
▪ Bias is 23-1-1 = 3 1 3-bits 2-bits
-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
19
Carnegie Mellon
-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
20
Carnegie Mellon
21
Carnegie Mellon
22
Carnegie Mellon
Floating point in C
Summary
23
Carnegie Mellon
x f y = Round(x y)
Basic idea
▪ First compute exact result
▪ Make it fit into desired precision
▪ Possibly overflow if exponent too large
▪ Possibly round to fit into frac
24
Carnegie Mellon
Rounding
Rounding Modes (illustrate with $ rounding)
25
Carnegie Mellon
Examples
▪ Round to nearest 1/4 (2 bits right of binary point)
Value Binary Rounded Action Rounded Value
2 3/32 10.000112 10.002 (<1/2—down) 2
2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
2 7/8 10.111002 11.002 ( 1/2—up) 3
2 5/8 10.101002 10.102 ( 1/2—down) 2 1/2
27
Carnegie Mellon
FP Multiplication
(–1)s1 M1 2E1 x (–1)s2 M2 2E2
Exact Result: (–1)s M 2E
▪ Sign s: s1 ^ s2
▪ Significand M: M1 x M2
▪ Exponent E: E1 + E2
Fixing
▪ If M ≥ 2, shift M right, increment E
▪ If E out of range, overflow
▪ Round M to fit frac precision
Implementation
▪ Biggest chore is multiplying significands
28
Carnegie Mellon
Fixing
▪If M ≥ 2, shift M right, increment E
▪if M < 1, shift M left k positions, decrement E by k
▪Overflow if E out of range
▪Round M to fit frac precision
29
Carnegie Mellon
Floating point in C
Summary
30
Carnegie Mellon
Floating Point in C
C Guarantees Two Levels
▪float single precision
▪double double precision
Conversions/Casting
▪Casting between int, float, and double changes bit representation
▪ double/float → int
▪ Truncates fractional part
▪ Like rounding toward zero
▪ Not defined when out of range or NaN: Generally sets to TMin
▪ int → double
▪ Exact conversion, as long as int has ≤ 53 bit word size
▪ int → float
▪ Will round according to rounding mode
31
Carnegie Mellon
Floating point in C
Summary
32
Carnegie Mellon
Summary
IEEE Floating Point has clear mathematical properties
Represents numbers of form M x 2
E
33