5 Data - Floating - Point v1
5 Data - Floating - Point v1
Carnegie Mellon
2
Carnegie Mellon
4
••• 2
1
• Representation 2-j
• Bits to right of “binary point” represent fractional powers of 2
• Represents rational number
3
Carnegie Mellon
• Observations
• Divide by 2 by shifting right
• Multiply by 2 by shifting left
• Numbers of the form 0.111111…2 are just below 1.0
• 1/2 + 1/4 + 1/8 + … + 1/2i + … ➙ 1.0
• Use notation 1.0 – ε
4
Carnegie Mellon
Representable Numbers
• Limitation #1
• Can only exactly represent numbers of the form x/2k
• Other rational numbers have repeating bit representations
• Value Representation
• 1/3 0.0101010101[01]…2
• 1/5 0.001100110011[0011]…2
• 1/10 0.0001100110011[0011]…2
• Limitation #2
• Just one setting of binary point within the w bits
• Limited range of numbers (very small values? very large?)
5
Scientific Notation
• Allows us to specify a number and where the decimal point goes
• Useful notation for very small and very large numbers
• ±m x 10n
• n is the order of magnitude
• m is called the significand (also called the mantissa)
• Example
• 123.456e-2 = 123.456 x 10-2 = 1.23456
• 123.456e2 = 123.456 x 102 = 12345.6
• 1.23456e4 = 1.23456 x 104 = 12345.6
• Normalized notation
• Exponent is chosen so the m is at least one but less than 10
• 12345.6 would be written as 1.23456e4 in normalized form
6
Carnegie Mellon
7
Floating-Point Representation
• Numerical Form: v = (–1)s x M x 2E
• Sign bit s determines whether number is negative (1) or positive (0)
• Significand M is the binary fractional value of the number, usually normalized
• Exponent E weights the significand by a (possibly negative) power of two
• Example: floating-point representation of 15213.0
• 1521310 = 111011011011012
= 1.11011011011012 x 213 (normalized form)
• Significand
• M = 1.11011011011012
• Exponent
• E = 13
• Sign bit
• S = 0 (positive number)
8
Floating-Point Representation
• Numerical Form: v = (–1)s x M x 2E
• Sign bit s determines whether number is negative (1) or positive (0)
• Significand M is the binary fractional value of the number, usually normalized
• Exponent E weights the significand by a (possibly negative) power of two
• Encoding
• MSB s is sign bit s (0 for +, 1 for -)
• exp field encodes E (but is not equal to E)
• frac field encodes M (but is not equal to M)
s exp frac
9
Carnegie Mellon
10
Carnegie Mellon
• Significand
• M = 1.11011011011012
• frac = 110110110110100000000002
• Exponent
• E = 13
• Bias = 127
• Exp = 140 = 100011002
12
Carnegie Mellon
v = (–1)s M 2E
Denormalized Values E = 1 – Bias
• Goal: To represent 0 and have good precision for numbers very close to zero
• Can’t do this with normalized values having an implied leading 1.xxxx…xxx
• Condition: exp = 000…0 (all zeros for exp)
• Significand coded with implied leading 0: M = 0.xxx…x2
• xxx…x: are the bits encoding frac
• Exponent value: E = 1 – Bias (instead of E = 0 – Bias)
• This allows for a smooth transition between normalized and denormalized numbers
• Cases
• exp = 000…0, frac = 000…0
• Represents zero value
• Note distinct values: +0 and –0 (why?)
• exp = 000…0, frac ≠ 000…0
• Numbers closest to 0.0
• Equispaced
13
Carnegie Mellon
14
Carnegie Mellon
s exp frac
1 4-bits 3-bits
• 8-bit Floating-Point Representation
• the sign bit is in the most significant bit
• the next four bits are the exponent, with a bias of _____
• the last three bits are the frac
15
Carnegie Mellon
16
Carnegie Mellon
Distribution of Values
• 6-bit IEEE-like format
• exp = 3 exponent bits
• frac = 2 fraction bits s exp frac
• Bias is 2(3-1)-1 = 3
1 3-bits 2-bits
-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity
17
Carnegie Mellon
-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
18
Carnegie Mellon
19
Carnegie Mellon
• x f y = Round(x y)
• Basic idea
• First compute exact result
• Make it fit into desired precision
• Possibly overflow if exponent too large
• Possibly round to fit into frac
20
Carnegie Mellon
Rounding
• Rounding Modes (illustrate with rounding to the nearest dollar)
21
Carnegie Mellon
22
Carnegie Mellon
• Examples
• Round to nearest 1/4 (2 bits right of binary point)
• Value Binary Rounded Action Rounded Value
• 2 3/32 10.000112 10.002 (<1/2—down) 2
• 2 3/16 10.001102 10.012 (>1/2—up) 2 1/4
• 2 7/8 10.111002 11.002 ( 1/2—up) 3
• 2 5/8 10.101002 10.102 ( 1/2—down) 2 1/2
23
Carnegie Mell
Floating Point in C
• C Guarantees Two Levels
• float single precision
• double double precision
• Conversions/Casting
• Casting between int, float, and double changes bit representation
• double/float → int
• Truncates fractional part
• Like rounding toward zero
• Not defined when out of range or NaN: Generally sets to TMin
• int → double
• Exact conversion, as long as int has ≤ 53 bit word size
• int → float
• Will round according to rounding mode
24
Carnegie Mellon
Summary
• IEEE Floating Point has clear mathematical properties
• Represents numbers of form M x 2E
• One can reason about operations independent of implementation
• As if computed with perfect precision and then rounded
• Not the same as real arithmetic
• Violates associativity/distributivity in some corner cases
• Overflow and inexactness of rounding
• (3.14+1e10)-1e10 = 0, 3.14+(1e10-1e10) = 3.14
• Makes life difficult for compilers & serious numerical applications programmers
25