0% found this document useful (0 votes)
20 views14 pages

Lecture 02 - Floating Point Arithmetic

Uploaded by

alngarm246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

Lecture 02 - Floating Point Arithmetic

Uploaded by

alngarm246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Digital Engineering

Fall 2024

Lecture 02 - Floating Point Arithmetic


Instructor: Dr. Tarek Abdul Hamid
The World is Not Just Integers
 Programming languages support numbers with fraction
 Called floating-point numbers
 Examples:
3.14159265… (π)
2.71828… (e)
0.000000001 or 1.0 × 10–9 (seconds in a nanosecond)
86,400,000,000,000 or 8.64 × 1013 (nanoseconds in a day)
last number is a large integer that cannot fit in a 32-bit integer
 We use a scientific notation to represent
 Very small numbers (e.g. 1.0 × 10–9)
 Very large numbers (e.g. 8.64 × 1013)
 Scientific notation: ± d . f1f2f3f4 … × 10 ± e1e2e3

2 Dr. Tarek Abdul Hamid Digital Engineering


Floating-Point Numbers
 Examples of floating-point numbers in base 10 …
 5.341×103 , 0.05341×105 , –2.013×10–1 , –201.3×10–3
decimal point
 Examples of floating-point numbers in base 2 …
 1.00101×223 , 0.0100101×225 , –1.101101×2–3 , –1101.101×2–6
 Exponents are kept in decimal for clarity binary
point
 The binary number (1101.101)2 = 23+22+20+2–1+2–3 = 13.625
 Floating-point numbers should be normalized
 Exactly one non-zero digit should appear before the point
 In a decimal number, this digit can be from 1 to 9
 In a binary number, this digit should be 1
 Normalized FP Numbers: 5.341×103 and –1.101101×2–3
 NOT Normalized: 0.05341×105 and –1101.101×2–6

3 Dr. Tarek Abdul Hamid Digital Engineering


Floating-Point Representation
 A floating-point number is represented by the triple
 S is the Sign bit (0 is positive and 1 is negative)
 Representation is called sign and magnitude
 E is the Exponent field (signed)
 Very large numbers have large positive exponents
 Very small close-to-zero numbers have negative exponents
 More bits in exponent field increases range of values
 F is the Fraction field (fraction after binary point)
 More bits in fraction field improves the precision of FP numbers

S Exponent Fraction

Value of a floating-point number = (-1)S × val(F) × 2val(E)

4 Dr. Tarek Abdul Hamid Digital Engineering


IEEE 754 Floating-Point Standard
 Found in virtually every computer invented since 1980
 Simplified porting of floating-point numbers
 Unified the development of floating-point algorithms
 Increased the accuracy of floating-point numbers
 Single Precision Floating Point Numbers (32 bits)
 1-bit sign + 8-bit exponent + 23-bit fraction
S Exponent8 Fraction23

 Double Precision Floating Point Numbers (64 bits)


 1-bit sign + 11-bit exponent + 52-bit fraction
S Exponent11 Fraction52
(continued)
5 Dr. Tarek Abdul Hamid Digital Engineering
Normalized Floating Point
Numbers
 For a normalized floating point number (S, E, F)

S E F = f1 f2 f3 f4 …
 Significand is equal to (1.F)2 = (1.f1f2f3f4…)2
 IEEE 754 assumes hidden 1. (not stored) for normalized numbers
 Significand is 1 bit longer than fraction
 Value of a Normalized Floating Point Number is

(–1)S × (1.F)2 × 2val(E)


(–1)S × (1.f1f2f3f4 …)2 × 2val(E)
(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2val(E)
(–1)S is 1 when S is 0 (positive), and –1 when S is 1 (negative)

6 Dr. Tarek Abdul Hamid Digital Engineering


Biased Exponent Representation
 How to represent a signed exponent? Choices are …
 Sign + magnitude representation for the exponent
 Two’s complement representation
 Biased representation
 IEEE 754 uses biased representation for the exponent
 Value of exponent = val(E) = E – Bias (Bias is a constant)
 Recall that exponent field is 8 bits for single precision
 E can be in the range 0 to 255
 E = 0 and E = 255 are reserved for special use (discussed later)
 E = 1 to 254 are used for normalized floating point numbers
 Bias = 127 (half of 254), val(E) = E – 127
 val(E=1) = –126, val(E=127) = 0, val(E=254) = 127

7 Dr. Tarek Abdul Hamid Digital Engineering


Biased Exponent – Cont’d
 For double precision, exponent field is 11 bits
 E can be in the range 0 to 2047
 E = 0 and E = 2047 are reserved for special use
 E = 1 to 2046 are used for normalized floating point numbers
 Bias = 1023 (half of 2046), val(E) = E – 1023
 val(E=1) = –1022, val(E=1023) = 0, val(E=2046) = 1023
 Value of a Normalized Floating Point Number is

(–1)S × (1.F)2 × 2E – Bias


(–1)S × (1.f1f2f3f4 …)2 × 2E – Bias
(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2E – Bias

8 Dr. Tarek Abdul Hamid Digital Engineering


Examples of Single Precision Float
 What is the decimal value of this Single Precision float?
10111110001000000000000000000000
 Solution:
 Sign = 1 is negative
 Exponent = (01111100)2 = 124, E – bias = 124 – 127 = –3
 Significand = (1.0100 … 0)2 = 1 + 2-2 = 1.25 (1. is implicit)
 Value in decimal = –1.25 × 2–3 = –0.15625
 What is the decimal value of?
01000001001001100000000000000000

 Solution: implicit
 Value in decimal = +(1.01001100 … 0)2 × 2130–127 =
(1.01001100 … 0)2 × 23 = (1010.01100 … 0)2 = 10.375
9 Dr. Tarek Abdul Hamid Digital Engineering
Examples of Double Precision Float
 What is the decimal value of this Double Precision float ?
01000000010100101010000000000000
00000000000000000000000000000000
 Solution:
 Value of exponent = (10000000101)2 – Bias = 1029 – 1023 = 6
 Value of double float = (1.00101010 … 0)2 × 26 (1. is implicit) =
(1001010.10 … 0)2 = 74.5
 What is the decimal value of ?

10111111100010000000000000000000
00000000000000000000000000000000

 Do it yourself! (answer should be –1.5 × 2–7 = –0.01171875)

10 Dr. Tarek Abdul Hamid Digital Engineering


Converting FP Decimal to Binary
 Convert –0.8125 to binary in single and double precision
 Solution:
 Fraction bits can be obtained using multiplication by 2
 0.8125 × 2 = 1.625
 0.625 × 2 = 1.25
 0.25 × 2 = 0.5 0.8125 = (0.1101)2 = ½ + ¼ + 1/16 = 13/16
 0.5 × 2 = 1.0
 Stop when fractional part is 0
 Fraction = (0.1101)2 = (1.101)2 × 2 –1 (Normalized)
 Exponent = –1 + Bias = 126 (single precision) and 1022 (double)
Single
10111111010100000000000000000000
Precision
10111111111010100000000000000000 Double
Precision
00000000000000000000000000000000
11 Dr. Tarek Abdul Hamid Digital Engineering
Largest Normalized Float
 What is the Largest normalized float?
 Solution for Single Precision:
01111111011111111111111111111111
 Exponent – bias = 254 – 127 = 127 (largest exponent for SP)
 Significand = (1.111 … 1)2 = almost 2
 Value in decimal ≈ 2 × 2127 ≈ 2128 ≈ 3.4028 … × 1038
 Solution for Double Precision:
01111111111011111111111111111111
11111111111111111111111111111111
 Value in decimal ≈ 2 × 21023 ≈ 21024 ≈ 1.79769 … × 10308
 Overflow: exponent is too large to fit in the exponent field

12 Dr. Tarek Abdul Hamid Digital Engineering


Smallest Normalized Float
 What is the smallest (in absolute value) normalized float?
 Solution for Single Precision:
00000000100000000000000000000000
 Exponent – bias = 1 – 127 = –126 (smallest exponent for SP)
 Significand = (1.000 … 0)2 = 1
 Value in decimal = 1 × 2–126 = 1.17549 … × 10–38
 Solution for Double Precision:
00000000000100000000000000000000
00000000000000000000000000000000
 Value in decimal = 1 × 2–1022 = 2.22507 … × 10–308
 Underflow: exponent is too small to fit in exponent field

13 Dr. Tarek Abdul Hamid Digital Engineering

You might also like