0% found this document useful (0 votes)
13 views3 pages

8.1.4 Data Representation - Floatng Point Numbers

The document explains floating point numbers, including their representation in both decimal and binary formats using scientific notation. It details the IEEE 754 standard for single and double precision, including the structure of the sign, exponent, and mantissa fields, as well as the concepts of overflow and underflow. Additionally, it covers floating point arithmetic operations such as addition and multiplication, providing examples for both decimal and binary calculations.

Uploaded by

Candice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views3 pages

8.1.4 Data Representation - Floatng Point Numbers

The document explains floating point numbers, including their representation in both decimal and binary formats using scientific notation. It details the IEEE 754 standard for single and double precision, including the structure of the sign, exponent, and mantissa fields, as well as the concepts of overflow and underflow. Additionally, it covers floating point arithmetic operations such as addition and multiplication, providing examples for both decimal and binary calculations.

Uploaded by

Candice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Floating Point Numbers

Real Numbers: pi = 3.14159265... e = 2.71828...


Scientific Notation: has a single digit to the left of the decimal point.
A number in Scientific Notation with no leading 0s is called a Normalised Number: 1.0 × 10-8
Not in normalised form: 0.1 × 10-7 or 10.0 × 10-9
Can also represent binary numbers in scientific notation: 1.0 × 2-3
Computer arithmetic that supports such numbers is called Floating Point.
The form is 1.xxxx… × 2yy…
Using normalised scientific notation
1. Simplifies the exchange of data that includes floating-point numbers
2. Simplifies the arithmetic algorithms to know that the numbers will always be in this form
3. Increases the accuracy of the numbers that can be stored in a word, since each unnecessary
leading 0 is replaced by another significant digit to the right of the decimal point
Representation of Floating-Point numbers
-1S × M × 2E
Bit No Size Field Name
31 1 bit Sign (S)
23-30 8 bits Exponent (E)
0-22 23 bits Mantissa (M)
A Single-Precision floating-point number occupies 32-bits, so there is a compromise between the
size of the mantissa and the size of the exponent.
These chosen sizes provide a range of approx:
± 10-38 ... 1038
 Overflow
The exponent is too large to be represented in the Exponent field
 Underflow
The number is too small to be represented in the Exponent field
To reduce the chances of underflow/overflow, can use 64-bit Double-Precision arithmetic
Bit No Size Field Name
63 1 bit Sign (S)
52-62 11 bits Exponent (E)
0-51 52 bits Mantissa (M)
providing a range of approx
± 10-308 ... 10308
These formats are called ...
IEEE 754 Floating-Point Standard
Since the mantissa is always 1.xxxxxxxxx in the normalised form, no need to represent the leading 1.
So, effectively:
Single Precision: mantissa ===> 1 bit + 23 bits
Double Precision: mantissa ===> 1 bit + 52 bits
Since zero (0.0) has no leading 1, to distinguish it from others, it is given the reserved bit pattern all
0s for the exponent so that hardware won't attach a leading 1 to it. Thus:
 Zero (0.0) = 0000...0000
 Other numbers = -1S × (1 + Mantissa) × 2E
If we number the mantissa bits from left to right m1, m2, m3, ...
mantissa = m1 × 2-1 + m2 × 2-2 + m3 × 2-3 + ....
Negative exponents could pose a problem in comparisons.
For example (with two's complement):
Sign Exponent Mantissa
-1
1.0 × 2 0 11111111 0000000 00000000 00000000
+1
1.0 × 2 0 00000001 0000000 00000000 00000000
With this representation, the first exponent shows a "larger" binary number, making direct
comparison more difficult.
To avoid this, Biased Notation is used for exponents.
If the real exponent of a number is X then it is represented as (X + bias)
IEEE single-precision uses a bias of 127. Therefore, an exponent of
-1 is represented as -1 + 127 = 126 = 011111102
0 is represented as 0 + 127 = 127 = 011111112
+1 is represented as +1 + 127 = 128 = 100000002
+5 is represented as +5 + 127 = 132 = 100001002
So the actual exponent is found by subtracting the bias from the stored exponent. Therefore, given
S, E, and M fields, an IEEE floating-point number has the value:
-1S × (1.0 + 0.M) × 2E-bias
(Remember: it is (1.0 + 0.M) because, with normalised form, only the fractional part of the mantissa
needs to be stored)

Floating Point Addition


Add the following two decimal numbers in scientific notation:
8.70 × 10-1 with 9.95 × 101
1. Rewrite the smaller number such that its exponent matches with the exponent of the larger
number.
8.70 × 10-1 = 0.087 × 101
2. Add the mantissas
9.95 + 0.087 = 10.037 and write the sum 10.037 × 101
3. Put the result in Normalised Form
10.037 × 101 = 1.0037 × 102 (shift mantissa, adjust exponent)
check for overflow/underflow of the exponent after normalisation
4. Round the result
If the mantissa does not fit in the space reserved for it, it has to be rounded off.
For Example: If only 4 digits are allowed for mantissa
1.0037 × 102 ===> 1.004 × 102
(only have a hidden bit with binary floating point numbers)
Example addition in binary
Perform 0.5 + (-0.4375)
0.5 = 0.1 × 20 = 1.000 × 2-1 (normalised)
-0.4375 = -0.0111 × 20 = -1.110 × 2-2 (normalised)
1. Rewrite the smaller number such that its exponent matches with the exponent of the larger
number.
-1.110 × 2-2 = -0.1110 × 2-1
2. Add the mantissas:
1.000 × 2-1 + -0.1110 × 2-1 = 0.001 × 2-1
3. Normalise the sum, checking for overflow/underflow:
0.001 × 2-1 = 1.000 × 2-4
-126 <= -4 <= 127 ===> No overflow or underflow
4. Round the sum:
The sum fits in 4 bits so rounding is not required
Check: 1.000 × 2-4 = 0.0625 which is equal to 0.5 - 0.4375
Correct!
Floating Point Multiplication
Multiply the following two numbers in scientific notation by hand:
1.110 × 1010 × 9.200 × 10-5
1. Add the exponents to find
New Exponent = 10 + (-5) = 5
If we add biased exponents, bias will be added twice. Therefore we need to subtract it once
to compensate:
(10 + 127) + (-5 + 127) = 259
259 - 127 = 132 which is (5 + 127) = biased new exponent
2. Multiply the mantissas
1.110 × 9.200 = 10.212000
Can only keep three digits to the right of the decimal point, so the result is
10.212 × 105
3. Normalise the result
1.0212 × 106
4. Round it
1.021 × 106

Example multiplication in binary:


1.000 × 2-1 × -1.110 × 2-2
1. Add the biased exponents
(-1 + 127) + (-2 + 127) - 127 = 124 ===> (-3 + 127)
2. Multiply the mantissas
1.000
× 1.110
-----------
0000
1000
1000
+ 1000
-----------
1110000 ===> 1.110000

The product is 1.110000 × 2-3


Need to keep it to 4 bits 1.110 × 2-3
3. Normalise (already normalised)
At this step check for overflow/underflow by making sure that
-126 <= Exponent <= 127
1 <= Biased Exponent <= 254
4. Round the result (no change)
5. Adjust the sign.
Since the original signs are different, the result will be negative
-1.110 × 2-3

You might also like