Floating Point Representation
Floating Point Representation
Introduction
Floating-point representation is a way to store real numbers in a computer's memory using a
format similar to scientific notation. This representation allows computers to handle very
large and very small numbers efficiently.
IEEE 754 Floating-Point Standard
The most commonly used standard for floating-point representation is the IEEE 754
standard. This standard divides a floating-point number into three parts:
1. Sign Bit (S): 1 bit that indicates the sign of the number.
o 0 for positive numbers.
o 1 for negative numbers.
2. Exponent (E): Determines the scaling factor of the number.
o Single Precision (32-bit) uses 8 bits for the exponent.
o Double Precision (64-bit) uses 11 bits for the exponent.
o The exponent is stored with a bias (127 for single precision, 1023 for double
precision).
o The exponent allows the floating-point number to represent very large and
very small values by adjusting the position of the decimal point.
3. Mantissa (M): Stores the significant digits of the number.
o Single Precision has 23 bits for the mantissa.
o Double Precision has 52 bits for the mantissa.
o The mantissa represents the precise value of the number, with an implicit
leading 1 in normalized representation.
o The more bits used in the mantissa, the more precise the number is.
A floating-point number is represented as:
(−𝟏)𝒔 × 𝑴 × 𝟐𝑬
Step-by-Step Example: 5.75
We already know that:
5.7510=(101.11)2
1. Normalize the number
o We shift the binary point to get the 1.yyy form:
o 1.0111×22
o The 1 before the decimal is always there (so we don't store it).
o The remaining 0111 is our mantissa.
2. Fill to 23 Bits
o The mantissa must be 23 bits long. So we add extra zeros:
o 01110000000000000000000
Another Example: 0.15625
1. Convert to binary:
o 0.1562510=0.001012
2. Normalize:
o 1.01×2−3
3. Mantissa (without the leading 1):
o 01000000000000000000000
Floating-Point Representation Format (Single Precision - 32-bit)