0% found this document useful (0 votes)
6 views17 pages

COA Module6 FloatingPoint

The document discusses floating-point numbers, focusing on their representation, limitations, and encoding methods, particularly the IEEE-754 standard. It explains how floating-point arithmetic works, including addition, subtraction, multiplication, and division, with examples for clarity. Special values, rounding methods, and encoding examples are also provided to illustrate the concepts.

Uploaded by

s.sarthak1357
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views17 pages

COA Module6 FloatingPoint

The document discusses floating-point numbers, focusing on their representation, limitations, and encoding methods, particularly the IEEE-754 standard. It explains how floating-point arithmetic works, including addition, subtraction, multiplication, and division, with examples for clarity. Special values, rounding methods, and encoding examples are also provided to illustrate the concepts.

Uploaded by

s.sarthak1357
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

10/14/24

Floating-point Numbers

85

Representing Fractional Numbers

• A binary number with fractional part


B = bn-1 bn-2 …..b1 b0 . b-1 b-2 ….. b-m
corresponds to the decimal number If the radix point is allowed to
n-1 move, we call it a floating-point
D = S bi 2i
i = -m representation.

• Also called fixed-point numbers.


• The position of the radix point is fixed.

86

86

1
10/14/24

Some Examples
1011.1 è 1x23 + 0x22 + 1x21 + 1x20 + 1x2-1 = 11.5
101.11 è 1x22 + 0x21 + 1x20 + 1x2-1 + 1x2-2 = 5.75
10.111 è 1x21 + 0x20 + 1x2-1 + 1x2-2 + 1x2-3 = 2.875

Some Observations:
• Shift right by 1 bit means divide by 2
• Shift left by 1 bit means multiply by 2
• Numbers of the form 0.111111…2 has a value less than 1.0 (one).

87

87

Limitations of Representation

• In the fractional part, we can only represent numbers of the form x/2k exactly.
• Other numbers have repeating bit representations (i.e. never converge).
• Examples:
3/4 = 0.11
7/8 = 0.111 • More the number of bits, more
5/8 = 0.101 accurate is the representation.
1/3 = 0.10101010101 [01] …. • We sometimes see: (1/3)*3 ≠ 1.
1/5 = 0.001100110011 [0011] ….
1/10 = 0.0001100110011 [0011] ….

88

88

2
10/14/24

Floating-point Number Representation (IEEE-754)

• For representing numbers with fractional parts, we can assume that the fractional point
is somewhere in between the number (say, n bits in integer part, m bits in fraction
part). à Fixed-point representation
• Lacks flexibility.
• Cannot be used to represent very small or very large numbers
(for example: 2.53 x 10-26, 1.7562 x 10+35, etc.).

• Solution :: use floating-point number representation.


• A number F is represented as a triplet <s, M, E> such that
F = (-1)s M x 2E

89

89

F = (-1)s M x 2E
• s is the sign bit indicating whether the number is negative (=1) or positive (=0).
• M is called the mantissa, and is normally a fraction in the range [1.0,2.0].
• E is called the exponent, which weights the number by power of 2.

Encoding:
• Single-precision numbers: total 32 bits, E 8 bits, M 23 bits
• Double-precision numbers: total 64 bits, E 11 bits, M 52 bits

s E M

90

90

3
10/14/24

Points to Note
• The number of significant digits depends on the number of bits in M.
• 7 significant digits for 24-bit mantissa (23 bits + 1 implied bit).
• The range of the number depends on the number of bits in E.
• 1038 to 10-38 for 8-bit exponent.

How many significant digits? Range of exponent?


224 = 10x 2127 = 10y
24 log102 = x log1010 127 log102 = y log1010
x = 7.2 à 7 significant decimal places y = 38.1 à maximum exponent value
38 (in decimal)

91

91

“Normalized” Representation
• We shall now see how E and M are actually encoded.
• Assume that the actual exponent of the number is EXP
(i.e. number is M x 2EXP).
• Permissible range of E: 1 ≤ E ≤ 254 (the all-0 and all-1 patterns are not allowed).
• Encoding of the exponent E:
The exponent is encoded as a biased value: E = EXP + BIAS
where BIAS = 127 (28-1 – 1) for single-precision, and
BIAS = 1023 (211-1 – 1) for double-precision.

92

92

4
10/14/24

• Encoding of the mantissa M:


• The mantissa is coded with an implied leading 1 (i.e. in 24 bits).
M = 1 . xxxx...x
• Here, xxxx…x denotes the bits that are actually stored for the mantissa. We get the extra
leading bit for free.
• When xxxx…x = 0000…0, M is minimum (= 1.0).
• When xxxx…x = 1111…1, M is maximum (= 2.0 – ε).

93

93

An Encoding Example

• Consider the number F = 15335


1533510 = 111011111001112 = 1.1101111100111 x 213

• Mantissa will be stored as: M = 1101111100111 00000000002

• Here, EXP = 13, BIAS = 127. è E = 13 + 127 = 140 = 100011002

0 10001100 11011111001110000000000 466F9C00 in hex

94

94

5
10/14/24

Another Encoding Example

• Consider the number F = -3.75


-3.7510 = -11.112 = -1.111 x 21

• Mantissa will be stored as: M = 111000000000000000000002

• Here, EXP = 1, BIAS = 127. è E = 1 + 127 = 128 = 100000002

1 10000000 11100000000000000000000 40700000 in hex

95

95

Special Values Zero is represented by the


all-zero string.
• When E = 000…0
Also referred to as de-
• M = 000…0 represents the value 0.
normalized numbers.
• M ≠ 000…0 represents numbers very close to 0.

• When E = 111…1
NaN represents cases
• M = 000…0 represents the value ∞ (infinity).
when no numeric value
• M ≠ 000…0 represents Not-a-Number (NaN).
can be determined, like
uninitialized values, ∞*0,
∞-∞, square root of a
negative number, etc.

96

96

6
10/14/24

Summary of Number Encodings


-¥ -Normalized -Denorm +Denorm +Normalized +¥

NaN +0 NaN
-0

Denormal numbers have very small magnitudes (close to 0) such that trying to
normalize them will lead to an exponent that is below the minimum possible value.
• Mantissa with leading 0’s and exponent field equal to zero.
• Number of significant digits gets reduced in the process.

97

97

Rounding
• Suppose we are adding two numbers (say, in single-precision).
• We add the mantissa values after shifting one of them right for exponent alignment.
• We take the first 23 bits of the sum, and discard the residue R (remaining bits).

• IEEE-754 format supports four rounding modes:


a) Truncation
b) Round to +∞ (similar to ceiling function)
c) Round to -∞ (similar to floor function)
d) Round to nearest

98

98

7
10/14/24

• To implement rounding, two temporary bits are maintained:


• Round Bit (r): This is equal to the MSB of the residue R.
• Sticky Bit (s): This the logical OR of the rest of the bits of the residue R.

• Decisions regarding rounding can be taken based on these bits:


a) R > 0: if r + s = 1
b) R = 0.5: if r.s’ = 1
c) R > 0.5: if r.s = 1 // ‘+’ is logical OR, ‘.’ is logical AND

• Renormalization after Rounding:


• If the process of rounding generates a result that is not in normalized form, then we need to re-
normalize the result.

99

99

Some Exercises

Decode the following single-precision floating-point numbers.


a) 0011 1111 1000 0000 0000 0000 0000 0000
b) 0100 0000 0110 0000 0000 0000 0000 0000
c) 0100 1111 1101 0000 0000 0000 0000 0000
d) 1000 0000 0000 0000 0000 0000 0000 0000
e) 0111 1111 1000 0000 0000 0000 0000 0000
f) 0111 1111 1101 0101 0101 0101 0101 0101

100

100

8
10/14/24

Floating-point Arithmetic

101

Floating Point Addition/Subtraction


• Two numbers: M1 x 2E1 and M2 x 2E2 , where E1 > E2 (say).
• Basic steps:
• Select the number with the smaller exponent (i.e. E2) and shift its mantissa right by (E1-E2) positions.
• Set the exponent of the result equal to the larger exponent (i.e. E1).
• Carry out M1 ± M2, and determine the sign of the result.
• Normalize the resulting value, if necessary.

102

102

9
10/14/24

Addition Example
• Suppose we want to add F1 = 270.75 and F2 = 2.375
F1 = (270.75)10 = (100001110.11)2 = 1.0000111011 x 28
F2 = (2.375)10 = (10.011)2 = 1.0011 x 21
• Shift the mantissa of F2 right by 8 – 1 = 7 positions, and add:
1000 0111 0110 0000 0000 0000
1 0011 0000 0000 0000 0000 000
1000 1000 1001 0000 0000 0000 0000 000

• Result: 1.00010001001 x 28
Residue

103

103

Subtraction Example
• Suppose we want to subtract F2 = 224 from F1 = 270.75
F1 = (270.75)10 = (100001110.11)2 = 1.0000111011 x 28
F2 = (224)10 = (11100000)2 = 1.111 x 27
• Shift the mantissa of F2 right by 8 – 7 = 1 position, and subtract:
1000 0111 0110 0000 0000 0000
111 0000 0000 0000 0000 0000 000
0001 0111 0110 0000 0000 0000 000
• For normalization, shift mantissa left 3 positions, and decrement E by 3.
• Result: 1.01110110 x 25

104

104

10
10/14/24

105

105

Floating-point Multiplication

• Two numbers: M1 x 2E1 and M2 x 2E2


• Basic steps:
• Add the exponents E1 and E2 and subtract the BIAS.
• Multiply M1 and M2 and determine the sign of the result.
• Normalize the resulting value, if necessary.

106

106

11
10/14/24

Multiplication Example

• Suppose we want to multiply F1 = 270.75 and F2 = -2.375


F1 = (270.75)10 = (100001110.11)2 = 1.0000111011 x 28
F2 = (-2.375)10 = (-10.011)2 = -1.0011 x 21

• Add the exponents: 8 + 1 = 9


• Multiply the mantissas: 1.01000001100001
• Result: - 1.01000001100001 x 29

107

107

s1 E1 M1 s2 E2 M2
23
8 23
1 1
8
8-bit Adder 24 x 24 Multiplier

9 1111111

s1 s2 48
9-bit Subtractor

8
XOR
Normalizer
8 23
s3 E3 M3

108

108

12
10/14/24

Floating-point Division

• Two numbers: M1 x 2E1 and M2 x 2E2


• Basic steps:
• Subtract the exponents E1 and E2 and add the BIAS.
• Divide M1 by M2 and determine the sign of the result.
• Normalize the resulting value, if necessary.

109

109

Division Example

• Suppose we want to divide F1 = 270.75 by F2 = -2.375


F1 = (270.75)10 = (100001110.11)2 = 1.0000111011 x 28
F2 = (-2.375)10 = (-10.011)2 = -1.0011 x 21

• Subtract the exponents: 8 – 1 = 7


• Divide the mantissas: 0.1110010
• Result: - 0.1110010 x 27
• After normalization: - 1.110010 x 26

110

110

13
10/14/24

s1 E1 M1 s2 E2 M2
23
8 23
1 1
8
8-bit Subtractor 24-bit Divider
9 1111111

s1 s2 48
9-bit Adder

8
XOR
Normalizer
8 23
s3 E3 M3

111

111

FLOATING-POINT ARITHMETIC in MIPS32

112

112

14
10/14/24

• The MIPS32 architecture defines the following floating-point registers (FPRs).


• 32 32-bit floating-point registers F0 to F31, each of which is capable of storing a single-
precision floating-point number.
• Double-precision floating-point numbers can be stored in even-odd pairs of FPRs (e.g.,
(F0,F1), (F10,F11), etc.).

• In addition, there are five special-purpose FPU control registers.

113

113

F0
F1 FIR
F2 FCCR
F3 FEXR
F4 FENR
FPRs
F5 FCSR
..
. Special-purpose
Registers
F30
F31

114

114

15
10/14/24

Typical Floating Point Instructions in MIPS32

• Load and Store instructions


• Load Word from memory
• Load Double-word from memory
• Store Word to memory
• Store Double-word to memory

• Data Movement instructions


• Move data between integer registers and floating-point registers
• Move data between integer registers and floating-point control registers

115

115

• Arithmetic instructions
• Floating-point absolute value
• Floating-point compare
• Floating-point negate
• Floating-point add
• Floating-point subtract
• Floating-point multiply
• Floating-point divide
• Floating-point square root
• Floating-point multiply add
• Floating-point multiply subtract

116

116

16
10/14/24

• Rounding instructions:
• Floating-point truncate
• Floating-point ceiling
• Floating-point floor
• Floating-point round

• Format conversions:
• Single-precision to double-precision
• Double-precision to single-precision

117

117

Example: Add a scalar s to a vector A

for (i=1000; i>0; i--)


A[i]= A[i] + s;
R1: initially qoints to A[1000]
(F2,F3): contains the scalar s
Loop: L.D F0,0(R1) R2: initialized such that 8(R2) is the
ADD.D F4,F0,F2 address of A[1]
S.D F4,0(R1) We assume double precision (64 bits):
ADDI R1,R1,-8 • Numbers stored in (F0,F1), (F2,F3), and (F4,F5).
BNE R1,R2,Loop

118

118

17

You might also like