0% found this document useful (0 votes)
4 views26 pages

Lecture5 COA

Uploaded by

dsrao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views26 pages

Lecture5 COA

Uploaded by

dsrao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

BITS Pilani

Pilani Campus

DS RAO
COA -IS ZC353
IMP Note to Self
IMP Note to Students
 It is important to know that just login to the session does not
guarantee the attendance.
 Once you join the session, continue till the end to consider
you as present in the class.
 IMPORTANTLY, you need to make the class more interactive
by responding to Professors queries in the session.
 Whenever Professor calls your number / name ,you need
to respond, otherwise it will be considered as ABSENT
Real Numbers

Numbers with fractions


Could be done in pure binary
▪ 1001.1010 = 24 + 20 +2-1 + 2-3 =9.625
Where is the binary point?
Fixed?
▪ Very limited
Moving?
▪ How do you show where it is?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Floating Point

• Normalized scientific notation: single non-zero digit to the


left of the decimal (binary) point – example: 3.5 x 109

• 1.010001 x 2-5two = (1 + 0 x 2-1 + 1 x 2-2 + … + 1 x 2-6) x 2-5ten

• A standard notation enables easy exchange of data between


machines and simplifies hardware algorithms – the
IEEE 754 standard defines how floating point numbers
are represented

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sign and Magnitude Representation

Sign Exponent Fraction


1 bit 8 bits 23 bits
S E F

• More exponent bits  wider range of numbers (not necessarily more


numbers – recall there are infinite real numbers)

• More fraction bits  higher precision

• Register value = (-1)S x F x 2E

• Since we are only representing normalized numbers, we are


guaranteed that the number is of the form 1.xxxx..
Hence, in IEEE 754 standard, the 1 is implicit
Register value = (-1)S x (1 + F) x 2E
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sign and Magnitude Representation

Sign Exponent Fraction


1 bit 8 bits 23 bits
S E F

• Largest number that can be represented:

• Smallest number that can be represented:

0.000000001ten or 1.0ten × 10-9 (seconds in a nanosecond)

3,155,760,000ten or 3.15576ten × 109 (seconds in a typical


century)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Biased Exponent Representation
 How to represent a signed exponent? Choices are …
 Sign + magnitude representation for the exponent
 Two’s complement representation
 Biased representation
 IEEE 754 uses biased representation for the exponent
 Value of exponent
= val(E) = E – Bias (Bias is a constant)

 Recall that exponent field is 8 bits for single precision


 E can be in the range 0 to 255
 E = 0 and E = 255 are reserved for special use (discussed later)

 E = 1 to 254 are used for normalized floating point numbers

 Bias = 127 (half of 254), val(E) = E – 127

 val(E=1) = –126, val(E=127) = 0, val(E=254) = 127


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Biased Exponent – Cont’d

 Value of a Normalized Floating Point Number is

(–1)S × (1.F)2 × 2E – Bias

(–1)S × (1.f1f2f3f4 …)2 × 2E – Bias

(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2E – Bias

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Examples of Single Precision Float
 What is the decimal value of this Single Precision float?
 10111110001000000000000000000000
 Solution:
 Sign = 1 is negative
 Exponent = (01111100)2 = 124, E – bias = 124 – 127 = –3
 Significand = (1.0100 … 0)2 = 1 + 2-2 = 1.25 (1. is implicit)
 Value in decimal = –1.25 × 2–3 = –0.15625
Or Significand = (1.0100 … 0)2 X 2–3 = 0.0010100= –0.15625

 What is the decimal value of?


 01000001001001100000000000000000
 Solution: Implicit

 Value in decimal = +(1.01001100 … 0)2 × 2130–127

= (1.01001100 … 0)2 × 23 = (1010.01100 … 0)2 = 10.375

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Converting FP Decimal to Binary

 Convert –0.8125 to binary in single precision


 Solution:
 Fraction bits can be obtained using multiplication by 2
 0.8125 × 2 = 1.625
 0.625 × 2 = 1.25
 0.25 × 2 = 0.5
 0.5 × 2 = 1.0
 Stop when fractional part is 0

0.8125 = (0.1101)2 = ½ + ¼ + 1/16 = 13/16

 Fraction = (0.1101)2 = (1.101)2 × 2 –1 (Normalized)

 Exponent = –1 + Bias = 126 (single precision)= 7E H

10111111010100000000000000000000

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Largest Normalized Float

 What is the Largest normalized float?

 Solution for Single Precision:


 01111111011111111111111111111111

Exponent – bias = 254 – 127 = 127 (largest exponent for SP)


 Significand = (1.111 … 1)2 = almost 2

 Value in decimal ≈ 2 × 2127 ≈ 2128 ≈ 3.4028 … × 1038

 Overflow: exponent is too large to fit in the exponent field

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Smallest Normalized Float

 What is the smallest (in absolute value) normalized float?

 Solution for Single Precision:


00000000100000000000000000000000

Exponent – bias = 1 – 127 = –126 (smallest exponent for SP)

 Significand = (1.000 … 0)2 = 1

 Value in decimal = 1 × 2–126 = 1.17549 … × 10–38

 Underflow: exponent is too small to fit in exponent field

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Zero, Infinity, and NaN

 Zero
 Exponent field E = 0 and fraction F = 0
 +0 and –0 are possible according to sign bit S
 Infinity
Infinity is a special value represented with maximum E and
F=0
 For single precision with 8-bit exponent: maximum E = 255
 Infinity can result from overflow or division by zero
 +∞ and –∞ are possible according to sign bit S

 NaN (Not a Number)


 NaN is a special value represented with maximum E and
F≠0
Result of exceptional situations, such as 0/0 or sqrt(negative)
 Operation on a NaN results is NaN: Op(X, NaN) = NaN
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary of IEEE 754 Encoding

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Floating Point Addition Example

Consider Adding (Single-Precision Floating-Point):

+ 1.111001000000000000000102 × 24
+ 1.100000000000001100001012 × 22
 Cannot add significands … Why?

 Because exponents are not equal


 How to make exponents equal?
 Shift the significand of the lesser exponent right
 Difference between the two exponents = 4 – 2 = 2
 So, shift right second number by 2 bits and increment
exponent
1.100000000000001100001012 × 22 =
0.01100000000000001100001 012 × 24

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Floating-Point Addition – cont'd

 Now, ADD the Significands:


+ 1.11100100000000000000010 × 24
+ 1.10000000000000110000101 × 22
----------------------------------------------------------------------------------------------------------------------------- -

+ 1.11100100000000000000010 × 24
+ 0.01100000000000001100001 01 × 24 (shift right)
-------------------------------------------------------------------------------------
+10.01000100000000001100011 01 × 24 (result)

 Addition produces a carry bit, result is NOT normalized


 Normalize Result (shift right and increment exponent):
+ 10.01000100000000001100011 01 × 24
= + 1.00100010000000000110001 101 × 25

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Rounding

 Single-precision requires only 23 fraction bits


 However, Normalized result can contain additional bits
1.00100010000000000110001 | 1 01 × 25
Round Bit: R = 1 Sticky Bit: S = 1
 Two extra bits are needed for rounding
 Round bit: appears just after the normalized result
 Sticky bit: appears after the round bit (OR of all additional
bits)
 Since RS = 11, increment fraction to round to nearest
1.00100010000000000110001 × 25
+1
----------------------------------------------
1.00100010000000000110010 × 25 (Rounded)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


 Sometimes,
FP Subtractionaddition is converted into subtraction
 If the sign bits of the operands are different
 Consider Adding:
+ 1.00000000101100010001101 × 2-6
– 1.00000000000000010011010 × 2-1
-------------------------------------------------------------------------------
+ 0.00001000000001011000100 01101 × 2-1 (shift right 5 bits)
– 1.00000000000000010011010 × 2-1
---------------------------------------------------------------------------------
0 0.00001000000001011000100 01101 × 2-1
1 0.11111111111111101100110 × 2-1 (2's complement)
-------------------------------------------------------------------------
1 1.00001000000001000101010 01101 × 2-1 (ADD)
------------------------------------------------------------------------------------
- 0.11110111111110111010101 10011 × 2-1 (2's complement)
 2's complement of result is required if result is negative

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subtraction FP contnd.

+ 1.00000000101100010001101 × 2-6
– 1.00000000000000010011010 × 2-1
- 0.11110111111110111010101 10011 × 2-1 (result is negative)
 Result should be normalized
 For subtraction, we can have leading zeros. To normalize,
count the number of leading zeros, then shift result left and
decrement the exponent accordingly.
Guard bit
- 0.11110111111110111010101 1 0011 × 2-1
- 1.11101111111101110101011 0011 × 2-2 (Normalized)
Guard bit
 Guard bit: guards against loss of a fraction bit
 Needed for subtraction, when result has a leading zero and
should be normalized.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subtraction FP contnd.
 Next, normalized result should be rounded
Guard bit
- 0.11110111111110111010101 1 0 011 × 2-1
- 1.11101111111101110101011 0 011 × 2-2 (Normalized)
Round bit: R=0 Sticky bit: S = 1

 Since R = 0, it is more accurate to truncate the result


even if S = 1. We simply discard the extra bits.

- 1.11101111111101110101011 0 011 × 2-2 (Normalized)


- 1.11101111111101110101011 × 2-2 (Rounded to nearest)
Exponent -2 is biased to -2+127= 125= 7D H
 IEEE 754 Representation of Result

10111110111101111111101110101011
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rounding to Nearest Even
 Normalized result has the form: 1. f1 f2 … fl R S
 The round bit R appears after the last fraction bit fl
 The sticky bit S is the OR of all remaining additional bits
 Round to Nearest Even: default rounding mode
 Four cases for RS:
 RS = 00  Result is Exact, no need for rounding
 RS = 01  Truncate result by discarding RS
 RS = 11  Increment result: ADD 1 to last fraction bit
 RS = 10  Tie Case (either truncate or increment result)

 Check Last fraction bit fl (f23 for single-precision or f52


for double)
 If fl is 0 then truncate result to keep fraction even
 If fl is 1 then increment result to make fraction even

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Floating Point Multiplication Example

 Consider multiplying:
-1.110 1000 0100 0000 1010 00012 × 2–4
× 1.100 0000 0001 0000 0000 00002 × 2–2
 Unlike addition, we add the exponents of the operands
 Result exponent value = (–4) + (–2) = –6
 Using the biased representation: EZ = EX + EY – Bias
 EX = (–4) + 127 = 123 (Bias = 127 for single precision)
 EY = (–2) + 127 = 125
 EZ = 123 + 125 – 127 = 121 (value = –6)
 Sign bit of product can be computed independently
 Sign bit of product = SignX XOR SignY = 1 (negative)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


FP Multiplication contnd.

 Now multiply the significands:


(Multiplicand) 1.11010000100000010100001
(Multiplier) × 1.10000000001000000000000
------------------------------------------------------------------------------------------
111010000100000010100001
111010000100000010100001
1.11010000100000010100001
10.1011100011111011111100110010100001000000000000
 24 bits × 24 bits  48 bits (double number of bits)
 Multiplicand × 0 = 0 Zero rows are eliminated
 Multiplicand × 1 = Multiplicand (shifted left)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


FP Multiplication contnd.

 Normalize Product:
-10.10111000111110111111001100... × 2-6
Shift right and increment exponent because of carry bit
= -1.010111000111110111111001100... × 2-5
 Round to Nearest Even: (keep only 23 fraction bits)
1.01011100011111011111100 | 1 100... × 2-5

Round bit = 1, Sticky bit = 1, so increment fraction


Final result = -1.01011100011111011111101 × 2-5
 IEEE 754 Representation
10111101001011100011111011111101

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Examples

Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias)

• Represent -0.75ten in single and double-precision formats

Single: (1 + 8 + 23)

Double: (1 + 11 + 52)

• What decimal number is represented by the following


single-precision number?
1 1000 0001 01000…0000

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like