Floating Point Representation
Floating Point Representation
• Binary number in NSN form always has the form 1.F x 2E
• To store this in computer, we need to store (i) Signs of number and
exponent, (ii) Fraction (F) and (iii) Exponent (E)
• We do not need to store ‘1’ or ‘.’ or ‘2’ because they are part of any floating
point number in NSN form.
• If we can store (1) signs, (2) F and (3) E, we can retrieve the complete number
by using 1.F x 2E
• The IEEE 754 Standard is an international standard for storing floating
point numbers in computer memory
• In ‘single precision’, it uses a 32-bit word.
• In ‘double precision’, it uses a 64-bit or two 32-bit words.
2. IEEE 754 in Single Precision
• The standard specifies 32-bit word as shown below
31 30 – 23 bits 23 22 – 0 bits 0
M L
S S
B B
• Bit # 0 is the right most bit, also called the least significant bit (LSB)
• Bit # 31 is the left most bit, also called the most significant bit (MSB)
• MSB is used as a sign bit (s) for the number
• Bit number 23-30 (eight bits) are used to store the exponent plus bias (127)
• Bit number 0-22 (23 bits) are used to store the fraction (F)
3. Storing a number in IEEE 754 Single
Precision 31 30 – 23 bits 23 22 – 0 bits 0
M L
S S
B B
1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6. IEEE 754 Single Precision Floating Format 32 bits
0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
• IEEE 754 double precision with one 64-bit register / memory cell
#63 #0
s 11-bit exponent + 1023 MSB 52 Bits of fraction LSB
64-bit register
7. Review Questions
• Question 1: What is the precision of the IEEE 754 standard using single precision?
• Answer: The precision is the smallest quantity that can be stored. For 32-bit
standard with a bias of 28-1 – 1 = 128, the precision is 2-128 .
• Question 2. Can we store a number smaller than 2-128 in the single precision IEEE
standard?
• Answer: No.
• Question 3. What is the possible impact of questions 1 & 2?
• Answer: In numerical methods, using single precision, we can’t have a step size
smaller than precision. If we compare a floating point operation with 0, we will
not get a true value for equal operation. We will have two zeros depending
whether we approach 2-128 from positive or negative side. We should use integer
values where operations could contain a comparison with zero.