0% found this document useful (0 votes)

51 views

Floating Point Arithmetic: Computer Architecture and Assembly Language Dr. Aiman El-Maleh

simple data types 2 in c

Uploaded by

A7a Wtf

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Floating Point Arithmetic: Computer Architecture and Assembly Language Dr. Aiman El-Maleh

simple data types 2 in c

Uploaded by

A7a Wtf

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Floating Point Arithmetic

ICS 233
Computer Architecture and Assembly Language
Dr. Aiman El-Maleh
College of Computer Sciences and Engineering
King Fahd University of Petroleum and Minerals
[Adapted from slides of Dr. M. Mudawar, ICS 233, KFUPM]
Outline

❖ Floating-Point Numbers

❖ IEEE 754 Floating-Point Standard

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 2

The World is Not Just Integers
❖ Programming languages support numbers with fraction
 Called floating-point numbers
 Examples:
3.14159265… (π)
2.71828… (e)
0.000000001 or 1.0 × 10–9 (seconds in a nanosecond)
86,400,000,000,000 or 8.64 × 1013 (nanoseconds in a day)
last number is a large integer that cannot fit in a 32-bit integer

❖ We use a scientific notation to represent

 Very small numbers (e.g. 1.0 × 10–9)
 Very large numbers (e.g. 8.64 × 1013)
 Scientific notation: ± d . f1f2f3f4 … × 10 ± e1e2e3
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 3
Floating-Point Numbers
❖ Examples of floating-point numbers in base 10 …
 5.341×103 , 0.05341×105 , –2.013×10–1 , –201.3×10–3
decimal point
❖ Examples of floating-point numbers in base 2 …
 1.00101×223 , 0.0100101×225 , –1.101101×2–3 , –1101.101×2–6
binary point
 Exponents are kept in decimal for clarity
 The binary number (1101.101)2 = 23+22+20+2–1+2–3 = 13.625
❖ Floating-point numbers should be normalized
 Exactly one non-zero digit should appear before the point
▪ In a decimal number, this digit can be from 1 to 9
▪ In a binary number, this digit should be 1
 Normalized FP Numbers: 5.341×103 and –1.101101×2–3
 NOT Normalized: 0.05341×105 and –1101.101×2–6
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 4
Floating-Point Representation
❖ A floating-point number is represented by the triple
 S is the Sign bit (0 is positive and 1 is negative)
▪ Representation is called sign and magnitude
 E is the Exponent field (signed)
▪ Very large numbers have large positive exponents
▪ Very small close-to-zero numbers have negative exponents
▪ More bits in exponent field increases range of values
 F is the Fraction field (fraction after binary point)
▪ More bits in fraction field improves the precision of FP numbers

S Exponent Fraction

Value of a floating-point number = (-1)S × val(F) × 2val(E)

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 5
Real Numbers

FP Overflow & Underflow

• Fixed-sized representation leads to limitations

Large positive exponent.

Unlike integer arithmetic, overflow →
imprecise result (), not inaccurate result

Round Round
to - Zero to +

Negative Expressible Negative Positive Expressible Positive

overflow negative values underflow underflow positive values overflow

Large negative exponent

Round to zero

Cox
6
Alan L. Cox [email protected]
Next . . .

❖ Floating-Point Numbers

❖ IEEE 754 Floating-Point Standard

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 7

IEEE 754 Floating-Point Standard
❖ Single Precision Floating Point Numbers (32 bits)
 1-bit sign + 8-bit exponent + 23-bit fraction

❖ Double Precision Floating Point Numbers (64 bits)

 1-bit sign + 11-bit exponent + 52-bit fraction

S Exponent8 Fraction23

S Exponent11 Fraction52
(continued)

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 8

Normalized Floating Point Numbers
❖ For a normalized floating point number (S, E, F)
S E F = f 1 f2 f3 f4 …

❖ Significand is equal to (1.F)2 = (1.f1f2f3f4…)2

 IEEE 754 assumes hidden 1. (not stored) for normalized numbers
 Significand is 1 bit longer than fraction
❖ Value of a Normalized Floating Point Number is
(–1)S × (1.F)2 × 2val(E)
(–1)S × (1.f1f2f3f4 …)2 × 2val(E)
(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2val(E)

(–1)S is 1 when S is 0 (positive), and –1 when S is 1 (negative)

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 9
Biased Exponent Representation
❖ How to represent a signed exponent? Choices are …
 Sign + magnitude representation for the exponent
 Two’s complement representation
 Biased representation
❖ IEEE 754 uses biased representation for the exponent
 Value of exponent = val(E) = E – Bias (Bias is a constant)
❖ Recall that exponent field is 8 bits for single precision
 E can be in the range 0 to 255
 E = 0 and E = 255 are reserved for special use (discussed later)
 E = 1 to 254 are used for normalized floating point numbers
 Bias = 127 (half of 254), val(E) = E – 127
 val(E=1) = –126, val(E=127) = 0, val(E=254) = 127
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 10
Biased Exponent – Cont’d
❖ For double precision, exponent field is 11 bits
 E can be in the range 0 to 2047
 E = 0 and E = 2047 are reserved for special use
 E = 1 to 2046 are used for normalized floating point numbers
 Bias = 1023 (half of 2046), val(E) = E – 1023
 val(E=1) = –1022, val(E=1023) = 0, val(E=2046) = 1023
❖ Value of a Normalized Floating Point Number is

(–1)S × (1.F)2 × 2E – Bias

(–1)S × (1.f1f2f3f4 …)2 × 2E – Bias
(–1)S × (1 + f1×2-1 + f2×2-2 + f3×2-3 + f4×2-4 …)2 × 2E – Bias

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 11

Examples of Single Precision Float
❖ What is the decimal value of this Single Precision float?
10111110001000000000000000000000

❖ Solution:
 Sign = 1 is negative
 Exponent = (01111100)2 = 124, E – bias = 124 – 127 = –3
 Significand = (1.0100 … 0)2 = 1 + 2-2 = 1.25 (1. is implicit)
 Value in decimal = –1.25 × 2–3 = –0.15625
❖ What is the decimal value of?
01000001001001100000000000000000

❖ Solution: implicit
 Value in decimal = +(1.01001100 … 0)2 × 2130–127 =
(1.01001100 … 0)2 × 23 = (1010.01100 … 0)2 = 10.375
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 12
Examples of Double Precision Float
❖ What is the decimal value of this Double Precision float ?
01000000010100101010000000000000
00000000000000000000000000000000

❖ Solution:
 Value of exponent = (10000000101)2 – Bias = 1029 – 1023 = 6
 Value of double float = (1.00101010 … 0)2 × 26 (1. is implicit) =
(1001010.10 … 0)2 = 74.5
❖ What is the decimal value of ?
10111111100010000000000000000000
00000000000000000000000000000000

❖ Do it yourself! (answer should be –1.5 × 2–7 = –0.01171875)

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 13
Converting FP Decimal to Binary
❖ Convert –0.8125 to binary in single and double precision
❖ Solution:
 Fraction bits can be obtained using multiplication by 2
▪ 0.8125 × 2 = 1.625
▪ 0.625 × 2 = 1.25
0.8125 = (0.1101)2 = ½ + ¼ + 1/16 = 13/16
▪ 0.25 × 2 = 0.5
▪ 0.5 × 2 = 1.0
▪ Stop when fractional part is 0
 Fraction = (0.1101)2 = (1.101)2 × 2 –1 (Normalized)
 Exponent = –1 + Bias = 126 (single precision) and 1022 (double)
Single
10111111010100000000000000000000
Precision
10111111111010100000000000000000 Double
Precision
00000000000000000000000000000000
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 14
Basic Technique

• Represent the decimal in the form +/- 1.xxxb x 2y

• And “fill in the fields”
– Remember biased exponent and implicit “1.” mantissa!
• Examples:
– 0.0: 0 00000000 00000000000000000000000
– 1.0 (1.0 x 2^0): 0 01111111 00000000000000000000000
– 0.5 (0.1 binary = 1.0 x 2^-1): 0 01111110 00000000000000000000000
– 0.75 (0.11 binary = 1.1 x 2^-1): 0 01111110 10000000000000000000000
– 3.0 (11 binary = 1.1*2^1): 0 10000000 10000000000000000000000
– -0.375 (-0.011 binary = -1.1*2^-2): 1 01111101 10000000000000000000000
– 1 10000011 01000000000000000000000 = - 1.01 * 2^4 = -20.0

Lec 14 Systems Architecture 15

http://www.math-cs.gordon.edu/courses/cs311/lectures-2003/binary.html
Copyright ©2003 - Russell C. Bjork
Floating-Point Example
• Represent –0.75
– –0.75 = (–1)1 × 1.12 × 2–1
– S=1
– Fraction = 1000…002
– Exponent = –1 + Bias
• Single: –1 + 127 = 126 = 011111102
• Double: –1 + 1023 = 1022 = 011111111102

• Single: 1011111101000…00
• Double: 1011111111101000…00

Lec 14 Systems Architecture 16

Jeremy R. Johnson, Anatole D. Ruslanov, William M. Mongan
Floating-Point Example
• What number is represented by the single-precision float
11000000101000…00
– S=1
– Fraction = 01000…002
– Fxponent = 100000012 = 129
• x = (–1)1 × (1 + 012) × 2(129 – 127)
= (–1) × 1.25 × 22
= –5.0

Lec 14 Systems Architecture 17

Jeremy R. Johnson, Anatole D. Ruslanov, William M. Mongan
Largest Normalized Float
❖ What is the Largest normalized float?
❖ Solution for Single Precision:
01111111011111111111111111111111

 Exponent – bias = 254 – 127 = 127 (largest exponent for SP)

 Significand = (1.111 … 1)2 = almost 2
 Value in decimal ≈ 2 × 2127 ≈ 2128 ≈ 3.4028 … × 1038
❖ Solution for Double Precision:
01111111111011111111111111111111
11111111111111111111111111111111

 Value in decimal ≈ 2 × 21023 ≈ 21024 ≈ 1.79769 … × 10308

❖ Overflow: exponent is too large to fit in the exponent field
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 18
Smallest Normalized Float
❖ What is the smallest (in absolute value) normalized float?
❖ Solution for Single Precision:
00000000100000000000000000000000
 Exponent – bias = 1 – 127 = –126 (smallest exponent for SP)
 Significand = (1.000 … 0)2 = 1
 Value in decimal = 1 × 2–126 = 1.17549 … × 10–38
❖ Solution for Double Precision:
00000000000100000000000000000000
00000000000000000000000000000000

 Value in decimal = 1 × 2–1022 = 2.22507 … × 10–308

❖ Underflow: exponent is too small to fit in exponent field
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 19
Zero, Infinity, and NaN
❖ Zero
 Exponent field E = 0 and fraction F = 0
 +0 and –0 are possible according to sign bit S
❖ Infinity
 Infinity is a special value represented with maximum E and F = 0
▪ For single precision with 8-bit exponent: maximum E = 255
▪ For double precision with 11-bit exponent: maximum E = 2047
 Infinity can result from overflow or division by zero
 +∞ and –∞ are possible according to sign bit S
❖ NaN (Not a Number)
 NaN is a special value represented with maximum E and F ≠ 0
 Result from exceptional situations, such as 0/0 or sqrt(negative)
 Operation on a NaN results is NaN: Op(X, NaN) = NaN
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 20
Denormalized Numbers
❖ IEEE standard uses denormalized numbers to …
 Fill the gap between 0 and the smallest normalized float
 Provide gradual underflow to zero
❖ Denormalized: exponent field E is 0 and fraction F ≠ 0
 Implicit 1. before the fraction now becomes 0. (not normalized)
❖ Value of denormalized number ( S, 0, F )
Single precision: (–1) S × (0.F)2 × 2–126
Double precision: (–1) S × (0.F)2 × 2–1022
Negative Negative Positive Positive
Overflow Underflow Underflow Overflow

-∞ Normalized (–ve) Denorm Denorm Normalized (+ve) +∞

-2128 -2–126 0 2–126 2128
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 21
Special Value Rules

Operation Result
n /  0
 x  
nonzero / 0 
+  (similar for -)
0 / 0 NaN
- NaN (similar for -)
 /  NaN
 x 0 NaN
NaN op anything NaN

Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 22

Summary of IEEE 754 Encoding
Single-Precision Exponent = 8 Fraction = 23 Value
Normalized Number 1 to 254 Anything ± (1.F)2 × 2E – 127
Denormalized Number 0 nonzero ± (0.F)2 × 2–126
Zero 0 0 ±0
Infinity 255 0 ±∞
NaN 255 nonzero NaN

Double-Precision Exponent = 11 Fraction = 52 Value

Normalized Number 1 to 2046 Anything ± (1.F)2 × 2E – 1023
Denormalized Number 0 nonzero ± (0.F)2 × 2–1022
Zero 0 0 ±0
Infinity 2047 0 ±∞
NaN 2047 nonzero NaN

Simple 6-bit Floating Point Example
❖ 6-bit floating point representation
S Exponent3 Fraction2
 Sign bit is the most significant bit
 Next 3 bits are the exponent with a bias of 3
 Last 2 bits are the fraction
❖ Same general form as IEEE
 Normalized, denormalized
 Representation of 0, infinity and NaN
❖ Value of normalized numbers (–1)S × (1.F)2 × 2E – 3
❖ Value of denormalized numbers (–1)S × (0.F)2 × 2– 2

Values Related to Exponent

Exp. exp E 2E
0 000 -2 ¼ Denormalized

1 001 -2 ¼
2 010 -1 ½
3 011 0 1
Normalized
4 100 1 2
5 101 2 4
6 110 3 8
7 111 n/a Inf or NaN

Dynamic Range of Values
s exp frac E value
0 000 00 -2 0
0 000 01 -2 1/4*1/4=1/16 smallest denormalized
0 000 10 -2 2/4*1/4=2/16
0 000 11 -2 3/4*1/4=3/16 largest denormalized
0 001 00 -2 4/4*1/4=4/16=1/4=0.25 smallest normalized
0 001 01 -2 5/4*1/4=5/16
0 001 10 -2 6/4*1/4=6/16
0 001 11 -2 7/4*1/4=7/16
0 010 00 -1 4/4*2/4=8/16=1/2=0.5
0 010 01 -1 5/4*2/4=10/16
0 010 10 -1 6/4*2/4=12/16=0.75
0 010 11 -1 7/4*2/4=14/16
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 26
Dynamic Range of Values
s exp frac E value
0 011 00 0 4/4*4/4=16/16=1
0 011 01 0 5/4*4/4=20/16=1.25
0 011 10 0 6/4*4/4=24/16=1.5
0 011 11 0 7/4*4/4=28/16=1.75
0 100 00 1 4/4*8/4=32/16=2
0 100 01 1 5/4*8/4=40/16=2.5
0 100 10 1 6/4*8/4=48/16=3
0 100 11 1 7/4*8/4=56/16=3.5
0 101 00 2 4/4*16/4=64/16=4
0 101 01 2 5/4*16/4=80/16=5
0 101 10 2 6/4*16/4=96/16=6
0 101 11 2 7/4*16/4=112/16=7
Floating Point ICS 233 – KFUPM © Muhamed Mudawar slide 27
Dynamic Range of Values
s exp frac E value
0 110 00 3 4/4*32/4=128/16=8
0 110 01 3 5/4*32/4=160/16=10
0 110 10 3 6/4*32/4=192/16=12
0 110 11 3 7/4*32/4=224/16=14 largest normalized
0 111 00 
0 111 01 NaN
0 111 10 NaN
0 111 11 NaN

FP Behavior
Programmer must be aware of accuracy limitations!