0% found this document useful (0 votes)
21 views10 pages

Floating Point Representation

The document explains the IEEE 754 standard for floating point representation, detailing the structure for single and double precision formats. It describes how to store and retrieve floating point numbers using binary representation, including the roles of the sign bit, exponent, and fraction. It also addresses precision limitations and numerical implications in computations using single precision.

Uploaded by

dfsadfsagf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Floating Point Representation

The document explains the IEEE 754 standard for floating point representation, detailing the structure for single and double precision formats. It describes how to store and retrieve floating point numbers using binary representation, including the roles of the sign bit, exponent, and fraction. It also addresses precision limitations and numerical implications in computations using single precision.

Uploaded by

dfsadfsagf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

5.

Floating point representation


1. IEEE 754 Standard for floating point representation

• Binary number in NSN form always has the form  1.F x 2E
• To store this in computer, we need to store (i) Signs of number and
exponent, (ii) Fraction (F) and (iii) Exponent (E)
• We do not need to store ‘1’ or ‘.’ or ‘2’ because they are part of any floating
point number in NSN form.
• If we can store (1) signs, (2) F and (3) E, we can retrieve the complete number
by using  1.F x 2E
• The IEEE 754 Standard is an international standard for storing floating
point numbers in computer memory
• In ‘single precision’, it uses a 32-bit word.
• In ‘double precision’, it uses a 64-bit or two 32-bit words.
2. IEEE 754 in Single Precision
• The standard specifies 32-bit word as shown below
31 30 – 23 bits 23 22 – 0 bits 0
M L
S S
B B

Sign bit Fraction (Mantissa)


Exponent + bias (127)
1 = negative
8 bits 23 bits
0 = positive

• Bit # 0 is the right most bit, also called the least significant bit (LSB)
• Bit # 31 is the left most bit, also called the most significant bit (MSB)
• MSB is used as a sign bit (s) for the number
• Bit number 23-30 (eight bits) are used to store the exponent plus bias (127)
• Bit number 0-22 (23 bits) are used to store the fraction (F)
3. Storing a number in IEEE 754 Single
Precision 31 30 – 23 bits 23 22 – 0 bits 0
M L
S S
B B

Sign bit Fraction (Mantissa)


Exponent + bias (127)
1 = negative 23 bits
8 bits
0 = positive

• Step 1: We convert the number into normalized scientific notation.


•  1.F x 2E
• Step 2: MSB = 1 if the number is negative and MSB = 0 if number is positive
• Step 3: Bit 30-23 are 8-bit binary representation of E + 127
• Step 4: Bit 22-0 are F
4. Retrieving a number stored in IEEE 754 Single Precision
31 30 – 23 bits 23 22 – 0 bits 0
M L
S S
B B

Sign bit Fraction (Mantissa)


Exponent + bias (127)
1 = negative 23 bits
8 bits
0 = positive

• Step 1: We check if the number is positive or negative by checking s bit.


• Step 2: We convert 8-bit combination of bits 30-23 into decimal. Suppose the
decimal value is Eb. Then, E = Eb – 127.
• Step 3: We check the fraction F from looking at bits 22-0.
• Step 4: Now, we have the sign of the number, F and E. The number is =  1.F x 2E
• We can convert this into decimal using known method for this purpose.
5. IEEE 754 Single Precision Floating Format
32 bits
31 30 – 23 bits 23 22 – 0 bits 0
M L
S S
B B

Sign bit Fraction (Mantissa)


Exponent + bias (127)
1 = negative 23 bits
8 bits
0 = positive Fraction part = The
number after the
Example: floating points is the
Convert -28.75 part of the fraction
(single precision floating point) portion 1.1100110 ->
Exponent part = 28 in binary is 1100110
11100. Convert 0.75 = 11 ->
11100.11
Shift floating point to first 1 =
Negative 1.1100110 = 4xleft shift, add bias ->
number –sign bit 127+4=131 = 10000011 in binary
is 1

1 1 0 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6. IEEE 754 Single Precision Floating Format 32 bits

0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

To find the number stored:

Step 1: Sign bit s = 0, the number is positive


Step 2: E + 127 = 011111102 = 12610 This means E = -1
Step 3: F = Bit 22-0 = .1 (zeros on the right side of fraction are
meaningless)
Step 4: The store number is 1.1 x 2-1 = 0.112 = 0.5 + 0.25 = ¾ Answer
7. IEEE 754 Double precision
• In double precision, IEEE 754 uses 64-bits.
• The MSB is used as a sign bit to store the sign of the number.
• The next 11 bits to the MSB are used to store a value of the bias plus
the exponent from normalized scientific notation (NSN).
• Bias = 211-1 – 1 = 1023
• The remaining 52-bits are used for fraction.
• Therefore the floating point 0 = 1.0 x 2-1023 which is its precision
• The range is 22047 – 1023 = 21024 = roughly the largest stored number
7. IEEE 754 Double precision - II
• IEEE 754 double precision with two 32-bit registers / memory cells
#31 #0
s 11-bit exponent + 1023 20 MSBits of fraction 32-bit register 1
#31 #0
32 LSBits of fraction 32-bit register 2

• IEEE 754 double precision with one 64-bit register / memory cell
#63 #0
s 11-bit exponent + 1023 MSB 52 Bits of fraction LSB

64-bit register
7. Review Questions
• Question 1: What is the precision of the IEEE 754 standard using single precision?
• Answer: The precision is the smallest quantity that can be stored. For 32-bit
standard with a bias of 28-1 – 1 = 128, the precision is 2-128 .
• Question 2. Can we store a number smaller than 2-128 in the single precision IEEE
standard?
• Answer: No.
• Question 3. What is the possible impact of questions 1 & 2?
• Answer: In numerical methods, using single precision, we can’t have a step size
smaller than precision. If we compare a floating point operation with 0, we will
not get a true value for equal operation. We will have two zeros depending
whether we approach 2-128 from positive or negative side. We should use integer
values where operations could contain a comparison with zero.

You might also like