COA Module6 FloatingPoint
COA Module6 FloatingPoint
Floating-point Numbers
85
86
86
1
10/14/24
Some Examples
1011.1 è 1x23 + 0x22 + 1x21 + 1x20 + 1x2-1 = 11.5
101.11 è 1x22 + 0x21 + 1x20 + 1x2-1 + 1x2-2 = 5.75
10.111 è 1x21 + 0x20 + 1x2-1 + 1x2-2 + 1x2-3 = 2.875
Some Observations:
• Shift right by 1 bit means divide by 2
• Shift left by 1 bit means multiply by 2
• Numbers of the form 0.111111…2 has a value less than 1.0 (one).
87
87
Limitations of Representation
• In the fractional part, we can only represent numbers of the form x/2k exactly.
• Other numbers have repeating bit representations (i.e. never converge).
• Examples:
3/4 = 0.11
7/8 = 0.111 • More the number of bits, more
5/8 = 0.101 accurate is the representation.
1/3 = 0.10101010101 [01] …. • We sometimes see: (1/3)*3 ≠ 1.
1/5 = 0.001100110011 [0011] ….
1/10 = 0.0001100110011 [0011] ….
88
88
2
10/14/24
• For representing numbers with fractional parts, we can assume that the fractional point
is somewhere in between the number (say, n bits in integer part, m bits in fraction
part). à Fixed-point representation
• Lacks flexibility.
• Cannot be used to represent very small or very large numbers
(for example: 2.53 x 10-26, 1.7562 x 10+35, etc.).
89
89
F = (-1)s M x 2E
• s is the sign bit indicating whether the number is negative (=1) or positive (=0).
• M is called the mantissa, and is normally a fraction in the range [1.0,2.0].
• E is called the exponent, which weights the number by power of 2.
Encoding:
• Single-precision numbers: total 32 bits, E 8 bits, M 23 bits
• Double-precision numbers: total 64 bits, E 11 bits, M 52 bits
s E M
90
90
3
10/14/24
Points to Note
• The number of significant digits depends on the number of bits in M.
• 7 significant digits for 24-bit mantissa (23 bits + 1 implied bit).
• The range of the number depends on the number of bits in E.
• 1038 to 10-38 for 8-bit exponent.
91
91
“Normalized” Representation
• We shall now see how E and M are actually encoded.
• Assume that the actual exponent of the number is EXP
(i.e. number is M x 2EXP).
• Permissible range of E: 1 ≤ E ≤ 254 (the all-0 and all-1 patterns are not allowed).
• Encoding of the exponent E:
The exponent is encoded as a biased value: E = EXP + BIAS
where BIAS = 127 (28-1 – 1) for single-precision, and
BIAS = 1023 (211-1 – 1) for double-precision.
92
92
4
10/14/24
93
93
An Encoding Example
94
94
5
10/14/24
95
95
• When E = 111…1
NaN represents cases
• M = 000…0 represents the value ∞ (infinity).
when no numeric value
• M ≠ 000…0 represents Not-a-Number (NaN).
can be determined, like
uninitialized values, ∞*0,
∞-∞, square root of a
negative number, etc.
96
96
6
10/14/24
NaN +0 NaN
-0
Denormal numbers have very small magnitudes (close to 0) such that trying to
normalize them will lead to an exponent that is below the minimum possible value.
• Mantissa with leading 0’s and exponent field equal to zero.
• Number of significant digits gets reduced in the process.
97
97
Rounding
• Suppose we are adding two numbers (say, in single-precision).
• We add the mantissa values after shifting one of them right for exponent alignment.
• We take the first 23 bits of the sum, and discard the residue R (remaining bits).
98
98
7
10/14/24
99
99
Some Exercises
100
100
8
10/14/24
Floating-point Arithmetic
101
102
102
9
10/14/24
Addition Example
• Suppose we want to add F1 = 270.75 and F2 = 2.375
F1 = (270.75)10 = (100001110.11)2 = 1.0000111011 x 28
F2 = (2.375)10 = (10.011)2 = 1.0011 x 21
• Shift the mantissa of F2 right by 8 – 1 = 7 positions, and add:
1000 0111 0110 0000 0000 0000
1 0011 0000 0000 0000 0000 000
1000 1000 1001 0000 0000 0000 0000 000
• Result: 1.00010001001 x 28
Residue
103
103
Subtraction Example
• Suppose we want to subtract F2 = 224 from F1 = 270.75
F1 = (270.75)10 = (100001110.11)2 = 1.0000111011 x 28
F2 = (224)10 = (11100000)2 = 1.111 x 27
• Shift the mantissa of F2 right by 8 – 7 = 1 position, and subtract:
1000 0111 0110 0000 0000 0000
111 0000 0000 0000 0000 0000 000
0001 0111 0110 0000 0000 0000 000
• For normalization, shift mantissa left 3 positions, and decrement E by 3.
• Result: 1.01110110 x 25
104
104
10
10/14/24
105
105
Floating-point Multiplication
106
106
11
10/14/24
Multiplication Example
107
107
s1 E1 M1 s2 E2 M2
23
8 23
1 1
8
8-bit Adder 24 x 24 Multiplier
9 1111111
s1 s2 48
9-bit Subtractor
8
XOR
Normalizer
8 23
s3 E3 M3
108
108
12
10/14/24
Floating-point Division
109
109
Division Example
110
110
13
10/14/24
s1 E1 M1 s2 E2 M2
23
8 23
1 1
8
8-bit Subtractor 24-bit Divider
9 1111111
s1 s2 48
9-bit Adder
8
XOR
Normalizer
8 23
s3 E3 M3
111
111
112
112
14
10/14/24
113
113
F0
F1 FIR
F2 FCCR
F3 FEXR
F4 FENR
FPRs
F5 FCSR
..
. Special-purpose
Registers
F30
F31
114
114
15
10/14/24
115
115
• Arithmetic instructions
• Floating-point absolute value
• Floating-point compare
• Floating-point negate
• Floating-point add
• Floating-point subtract
• Floating-point multiply
• Floating-point divide
• Floating-point square root
• Floating-point multiply add
• Floating-point multiply subtract
116
116
16
10/14/24
• Rounding instructions:
• Floating-point truncate
• Floating-point ceiling
• Floating-point floor
• Floating-point round
• Format conversions:
• Single-precision to double-precision
• Double-precision to single-precision
117
117
118
118
17