Floating Point Arithmetic
Floating Point Arithmetic
ARITHMETIC UNIT
SHWETA(2k21/VLS/18)
Outline
Floating-Point Numbers
IEEE 754 Floating-Point Standard
Floating-Point Addition and Subtraction
Floating-Point Multiplication
Simulation Results
The World is Not Just Integers
Programming languages support numbers with fraction
Called floating-point numbers
Examples:
3.14159265… (π)
2.71828… (e)
0.000000001 or 1.0 × 10–9 (seconds in a nanosecond)
86,400,000,000,000 or 8.64 × 1013 (nanoseconds in a day)
last number is a large integer that cannot fit in a 32-bit integer
Solution:
Sign = 1 is negative
Exponent = (01111100)2 = 124, E – bias = 124 – 127 = –3
Significand = (1.0100 … 0)2 = 1 + 2-2 = 1.25 (1. is implicit)
Value in decimal = –1.25 × 2–3 = –0.15625
Examples of Double Precision Float
Solution:
Value of exponent = (10000000101)2 – Bias = 1029 –
1023 = 6
Value of double float = (1.00101010 … 0)2 × 26 (1. is
implicit) =(1001010.10 … 0)2 = 74.5
Largest Normalized Float
What is the Largest normalized float?
Solution for Single Precision:
01111111011111111111111111111111
Exponent – bias = 254 – 127 = 127 (largest exponent for SP)
Significand = (1.111 … 1)2 = almost 2
Value in decimal ≈ 2 × 2127 ≈ 2128 ≈ 3.4028 … × 1038
Solution for Double Precision:
01111111111011111111111111111111
11111111111111111111111111111111
Value in decimal ≈ 2 × 21023 ≈ 21024 ≈ 1.79769 … × 10308
Overflow: exponent is too large to fit in the exponent field
Smallest Normalized Float
What is the smallest (in absolute value) normalized float?
Solution for Single Precision:
00000000100000000000000000000000
Exponent – bias = 1 – 127 = –126 (smallest exponent for SP)
Significand = (1.000 … 0)2 = 1
Value in decimal = 1 × 2–126 = 1.17549 … × 10–38
Solution for Double Precision:
00000000000100000000000000000000
00000000000000000000000000000000
– 1.00000 × 22 sign-magnitude
0 0.00001 × 22
2’s Complement
1 1.00000 × 22 – 0.11111 × 22
1 1.00001 × 22
Subtraction Example – cont’d
So, (1.000)2 × 2–3 – (1.000)2 × 22 = – 0.111112 × 22
Normalize result: – 0.111112 × 22 = – 1.11112 × 21
For subtraction, we can have leading zeros
Count number z of leading zeros (in this case z = 1)
Shift left and decrement exponent by z
Round the significand to fit in appropriate number of bits
We assumed 4 bits of precision or 3 bits of fraction
Round to nearest: (1.1111)2 ≈ (10.000)2 1.111 1
+ 1
Renormalize: rounding generated a carry 10.000
–1.11112 × 21 ≈ –10.0002 × 21 = –1.0002 × 22
Result would have been accurate if more fraction bits are used
Floating Point Addition / Subtraction
Start
Shift significand right by
1. Compare the exponents of the two numbers. Shift the smaller d = | EX – EY |
number to the right until its exponent would match the larger
exponent.
Add significands when signs
of X and Y are identical,
2. Add / Subtract the significands according to the sign bits.
Subtract when different
X – Y becomes X + (–Y)
3. Normalize the sum, either shifting right and incrementing the
exponent or shifting left and decrementing the exponent
Normalization shifts right by 1 if
4. Round the significand to the appropriate number of bits, and there is a carry, or shifts left by the
renormalize if rounding generates a carry number of leading zeros in the
case of subtraction
Overflow or yes
Exception Rounding either truncates fraction,
underflow?
or adds a 1 to least significant
no fraction bit
Done
Simulation
Floating Point Multiplication Example
yes
Rounding either truncates fraction,
Overflow or
Exception or adds a 1 to least significant
underflow?
fraction bit
no
Done
Simulation
Advantages of IEEE 754 Standard
Used predominantly by the industry
Encoding of exponent and fraction simplifies comparison
Integer comparator used to compare magnitude of FP numbers
Includes special exceptional values: NaN and ±∞
Special rules are used such as:
0/0 is NaN, sqrt(–1) is NaN, 1/0 is ∞, and 1/∞ is 0
Computation may continue in the face of exceptional conditions
Denormalized numbers to fill the gap
Between smallest normalized number 1.0 × 2E and zero min