0% found this document useful (0 votes)
2K views17 pages

Floating Point Representation

Floating point numbers represent numbers with fractional components. They have three parts - mantissa, base, and exponent. In IEEE format, they are represented using 32-bit single precision or 64-bit double precision. Single precision uses 1 sign bit, 8 exponent bits, and 23 mantissa bits. Double precision uses 1 sign bit, 11 exponent bits, and 52 mantissa bits. Examples show how floating point addition and subtraction are performed by normalizing the numbers and adjusting the shared exponent.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views17 pages

Floating Point Representation

Floating point numbers represent numbers with fractional components. They have three parts - mantissa, base, and exponent. In IEEE format, they are represented using 32-bit single precision or 64-bit double precision. Single precision uses 1 sign bit, 8 exponent bits, and 23 mantissa bits. Double precision uses 1 sign bit, 11 exponent bits, and 52 mantissa bits. Examples show how floating point addition and subtraction are performed by normalizing the numbers and adjusting the shared exponent.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Floating point Representation

What is Floating point Numbers ?


• Programming language support numbers
with fractions i.e -5.43 x 102.
• Floating point numbers contain binary
point variable in its process and hence
these numbers are called floating point
numbers.
Floating point Representation
 It Has Three Parts
Mantissa
Base
Exponent

Example:
Number Mantissa Base Exponent
3 x 106 3 10 6
Floating Point Number Representation
IN IEEE
It has Two types

 Single Precision format (32Bit)


 Double Precision format (64bit)
Floating Point Number Representation
IN IEEE

 Single Precision format (32Bit)

Sign Bit Exponent Mantissa

1Bit 8Bit 23Bit


Floating Point Number
Representation IN IEEE
 Double Precision format (64Bit)
Sign Bit Exponent Mantissa
1Bit 11Bit 52Bit
Floating Point Number Representation IN
IEEE
 Single Precision format (32Bit)
(1460.125)10
Step 1: Convert into Binary
(10110110100.001)2

Step 2: Normalize the Number


(10110110100.001)2
1.0110110100 001 x 210
Floating Point Number Representation IN
IEEE
Step 3:
1.0110110100 001 x 210
(1.N) 2E-127
E-127 = 10
E = 127 + 10
E = 137
Convert Exponent into Binary
E = (10001001)2
Floating Point Number Representation IN
IEEE
 Exponent (8 Bit)
10001001
 Mantissa (23 Bit)
0110110100 001------0

Sign Exponent Mantissa


0 10001001 0110110100 001------0

1Bit 8 Bit 23 Bit


Floating Point Number Representation IN
IEEE
 Double Precision format (64Bit)
(1259.125)10
Step 1: Convert into Binary
(10011101011.001)2

Step 2: Normalize the Number


10011101011.001
1.0011101011 001 x 210
Floating Point Number Representation IN
IEEE
Step 3:
1.0011101011 001 x 210
(1.N) 2E-1023
E-1023 = 10
E = 1023 + 10
E = 1033
Convert Exponent into Binary
E = (10000001001)2
Floating Point Number Representation IN IEEE
 Exponent (11 Bit)
10000001001

 Mantissa (52 Bit)


0011101011 001 -----0

Sign Exponent Mantissa


0 10000001001
0011101011 001 -----0

1 Bit 11 Bit 52 Bit


Floating Point Addition(32Bit)
123.45 + 15.35
Step 1: Convert into Binary
1111011.0111 + 1111.0101
Step 2: Normalize the Number
1.1110110111 x 26 + 1.1110101 x 23

Add 3 in Exponent of 0.0011110101 x 23+3

1.1110110111 x 26 + 0.0011110101 x 26
(1.1110110111 + 0.0011110101) x 26
Floating Point Addition(32Bit)
1.1110110111
+ 0.0011110101
10.0010101100
10.0010101100 x 26
(1.N) 2E-127
E-127 = 6
E = 133
Floating Point Addition(32Bit)
E = (10000101)2

Sign Exponent Mantissa


0 10000101 0010101100----0

1Bit 8 Bit 23 Bit


Floating Point Subtraction(32Bit)
4.5 - 0.125
Step 1: Convert into Binary
100.1000 - 0.00100
Step 2: Normalize the Number
1.001000 x 22 - 1.00 x 2-3

Add 5 in Exponent of 0.0000100 x 2-3+5

1.001000 x 22 - 0.0000100 x 22
(1.001000 - 0.0000100) x 2 2
Floating Point Subtraction(32Bit)
1.0010000
- 0.0000100
1.0001100
1.0001100 x 22
(1.N) 2E-127
E-127 = 2
E = 129
Floating Point Subtraction(32Bit)

E = (10000001)2

Sign 0 Exponent
10000001 Mantissa
0001100------0

1Bit 8 Bit 23 Bit

You might also like