0% found this document useful (0 votes)
20 views44 pages

Lecture 4 - Floating Point Data

Uploaded by

splendiferousbee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views44 pages

Lecture 4 - Floating Point Data

Uploaded by

splendiferousbee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Carnegie Mellon

198:331 Intro to Computer Organization Lecture 4

Floating Point Data

Lecture 4
198:331 Introduction to Computer Organization
Instructor:
Michael A. Palis
[email protected]

1
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Fixed Point Numbers


 Properties
● Consists of integer part + fractional part
● Fixed number of digits to left and right of the radix point
 In decimal: radix point

23.78410
integer fraction

23.78410 = 2×101 + 3×100 + 7×10-1 + 8×10-2 + 4×10-3


 Similarly in binary:
10.10112 = 1×21 + 0×20 + 1×2-1 + 0×2-2 + 1×2-3 + 1×2-4
= 2 + 0 + 0.5 + 0 + 0.125 + 0.0625
= 2.687510

2
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Converting Decimal Fraction to Binary Fraction


 Algorithm illustration: 0.687510 = ?2

int part frac part


0.6875 × 2 = 1.375 1 0.375
0.375 × 2 = 0.75 0 0.75
0.75 × 2 = 1.5 1 0.5
0.5 × 2 = 1.0 1 0

Read off int Stop when


parts in order frac part = 0

Therefore, 0.687510 = 0.10112

3
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Converting Binary Fraction to Decimal Fraction


 Converting from Example: 0.110 = ?2
decimal to binary may
result in a non-
terminating fraction int part frac part
0.1 × 2 = 0.2 0 0.2
0.2 × 2 = 0.4 0 0.4
 Example
0.4 × 2 = 0.8 0 0.8 r
0.110 ≈ 0.000112 0.8 × 2 = 1.6 1 0.6 e
p
repeating 0.6 × 2 = 1.2 1 0.2 e
sequence
0.2 × 2 = 0.4 0 0.4 a
t
0.4 × 2 = 0.8 0 0.8
 May need to round to s
desired number of 0.8 × 2 = 1.6 1 0.6
fractional places 0.6 × 2 = 1.2 1 0.2

4
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Rounding
 Representing numbers using a fixed number of bits limits:
● Range – span of numbers that can be represented
● Precision – difference between successive values that can be
represented
 Precision
● Associated with the number of fractional bits allowed by the
computer representation
● If number has more fractional bits than is allowed, it must be
rounded to the required precision
 Example: How should 10.10112 be rounded to 2 fractional bits?
● 10.102?
● Or 10.112?

5
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Rounding Modes
 Let a be the number and ā be its rounded value
1. Round-toward-zero (aka truncation)
► Round a to nearest number ā of desired precision such that |ā| ≤ |a|
2. Round-down (aka round-toward-negative-infinity)
► Round a to nearest number ā of desired precision such that ā ≤ a
3. Round-up (aka round-toward-postive-infinity)
► Round a to nearest number ā of desired precision such that ā ≥ a
4. Round-to-even (aka round-to-nearest)
► Round a to the number ā of desired precision such that |a – ā| is
minimized
► If there is a tie, choose the ā whose least significant digit/bit is even
► Default mode used in IEEE Floating Point Format, which we’ll discuss

6
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Rounding Modes
 Example
● Assume precision is 2 fractional digits

Rounded Value
Number Round- Round-to-
Round-down Round-up
toward-0 even
1.452310 1.4510 1.4510 1.4610 1.4510

−2.178610 −2.1710 −2.1810 −2.1710 −2.1810


10.100112 10.102 10.102 10.112 10.102

−1.001102 −1.002 −1.012 −1.002 −1.012

−10.111002 −10.112 −11.002 −10.112 −11.002

1.101002 1.102 1.102 1.112 1.102

7
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Fixed Point Arithmetic


 Adapt integer arithmetic algorithms
● Will illustrate for unsigned fixed point only

 Addition and Subtraction


● Similar to integer addition/subtraction
● Just align radix points

 Example: 100.1012 + 10.11012

1 0 0 . 1 0 1 = 4.625
+ 1 0 . 1 1 0 1 = 2.8125
1 1 1 . 0 1 1 1 = 7.4375

align binary points

8
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Fixed Point Arithmetic


 Multiplication
1. Ignore radix points; multiply as integers
2. Insert radix point of product: no. of fractional places = sum of no.
of fractional places of two operands
 Example: 11.012 × 0.1012
1 1.0 1 = 3.25
× 0.1 0 1 = 0.625
1 1 0 1
00 0 0
1 10 1
00 00
1 0.0 0 0 0 1 = 2.03125

9
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Fixed Point Arithmetic


 Division
1. Shift right radix point of divisor until it is a whole integer
2. Shift right radix point of dividend the same number of positions
3. Divide as in integer division
4. Radix point of quotient is in same position as that of dividend
 Example: 10.11012 ÷ 1.012 (2.812510 ÷ 1.2510)
➙ 1011.012 ÷ 1012 (11.2510 ÷ 510)

1 0.0 1 = 2.2510
1 0 1 1 0 1 1.0 1
-1 0 1
0 1 0 1
- 1 0 1
0
● May result in a quotient with non-terminating fractional part
➙ round to desired number of fractional places
10
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Numbers


 Fixed-point numbers can also be written in scientific
notation – also referred to as floating point format
● Decimal
975.673 = 9.75673 × 102
−0.000324 = −3.24 × 10-4
● Binary: significand exponent

1101.011 = 1.101011 × 23
−0.010111 = −1.0111 × 2-2

 Significand (aka mantissa) is normalized: exactly one


digit/bit to left of decimal/binary point
 Allows for more compact representation of real
numbers than fixed-point format

11
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

IEEE Floating Point Format


 IEEE Standard 754
● Established in 1985 as uniform standard for floating point arithmetic
► Before that, many idiosyncratic formats

● Supported by all major CPUs

 Driven by numerical concerns


● Nice standards for rounding, overflow, underflow
● Hard to make fast in hardware
► Numerical analysts predominated over hardware designers in defining

standard

12
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Representation


 Numerical Form:
± M × 2E
● Sign ± indicates whether number is negative or positive
● Significand (aka mantissa) M is normalized – i.e., of the form
M = 1.frac
● Exponent E weights value by power of two

 Encoding
● MSB s encodes sign: 0 if + and 1 if −
● exp field encodes E (but is not equal to E)
● frac field encodes M (but is not equal to M)

s exp frac

13
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Precision Options
 Single precision: 32 bits
s exp frac
1 8 bits 23 bits

 Double precision: 64 bits

s exp frac
1 11 bits 52 bits
 Extended precision: 80 bits (Intel only)

s exp frac
1 15 bits 64 bits
14
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

IEEE Single Precision FP Format


 Normalized binary FP number ➙ Single precision FP format

± 1.frac × 2E

s exp frac
1 8 bits 23 bits

Field # Bits Value Remarks


s 1 0 if +; 1 if −

exp 8 E + bias, where called the biased exponent


bias = 28-1 -1 = 27-1 = 127
frac 23 frac of significand ‘1’ to left of binary point is not stored
(hidden bit)

15
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

IEEE Single Precision FP Format


 Example: −54.62510
 Steps
1. Convert to binary FP
−54.62510 = −110110.1012

2. Normalize binary FP
−110110.101 = −1.10110101 × 25

3. Encode into IEEE single precision FP format


s=1
frac = 10110101000000000000000 (pad with zeros to make 23 bits)
exp = 5 + 127 = 132 = 10000100

4. Answer 1 10000100 10110101000000000000000

16
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

IEEE Double Precision FP Format


 Normalized binary FP number ➙ Double precision FP format

± 1.frac × 2E

s exp frac
1 11 bits 52 bits

Field # Bits Value Remarks


s 1 0 if +; 1 if −

exp 11 E + bias, where called the biased exponent


bias = 211-1 -1 = 210 -1 = 1023
frac 23 frac of significand ‘1’ to left of binary point is not stored
(hidden bit)

17
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Normalized Values
± M × 2E ➙

 Condition: exp ≠ 000…0 and exp ≠ 111…1

 Exponent coded as a biased value: E = exp – bias


● In general, bias = 2k-1 - 1, where k is number of exponent bits
● Single precision: bias = 127 ➙ E: −126 … 127
● Double precision: bias = 1023 ➙ E: −1022 … 1023

 Significand coded with implied leading 1: M = 1.xxx…x2


● xxx…x: bits of frac field
● Minimum when frac = 000…0 (M = 1.0)
● Maximum when frac = 111…1 (M = 2.0 – ε)
● Get extra leading bit for “free”
18
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Why Bias the Exponent?


 exp = E + bias
● In general, bias = 2k-1 - 1, where k is number of exponent bits
► Single precision: bias = 127, E: −126 … 127 ⬌ exp: 1 … 254

► Double precision: bias = 1023, E: −1022 … 1023 ⬌ exp: 1 … 2046

● Biased exponent is positive – can be treated as an unsigned integer

 Comparing unsigned integers is easy:


● Compare bitwise starting from left (msb) 10100111
● Stop at bit position where the numbers differ 10111010 larger

● The number with a ‘1’ bit is larger

 Can compare two normalized numbers in IEEE FP format with


the same sign using same algorithm:
0 10100111 01011100000000000000000 < 0 10111011 01101100000000000000000

+1.010111 × 240 +1.011011 × 260


19
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Denormalized Values
± M × 2E ➙

 Condition: exp = 000…0 (all zeros)


● Case 1: exp = 000…0, frac = 000…0
► Represents zero value (note: two distinct values +0 and –0)

● Case 2: exp = 000…0, frac ≠ 000…0


(1−bias)
► Represents binary number of the form ± 0.frac × 2

► Significand M < 1 (bit to left of binary point is 0)


► Exponent E = 1 – bias (not E = 0 – bias)
► Allows representation of numbers smaller than least positive/negative normalized
number
▪ smaller than ± 1.00…0 × 2−126 for single precision
▪ smaller than ± 1.00…0 × 2−1022 for double precision
► These are the non-zero numbers that are closest to zero

20
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Special Values
± M × 2E ➙

 Condition: exp = 111…1 (all ones)


 Cases
● exp = 111…1, frac = 000…0
► Represents value  (infinity)
► Operation that overflows
► Both positive and negative
► E.g., 1.0/0.0 = −1.0/−0.0 = +, 1.0/−0.0 = −
● exp = 111…1, frac ≠ 000…0
► Not-a-Number (NaN)

► Represents case when no numeric value can be determined

► E.g., sqrt(–1),  − ,   0

21
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Example Values
{single,double}
Description exp frac Numeric Value
 Zero 00…00 00…00 0.0
 Smallest Pos. Denorm. 00…00 00…01 2– {23,52} x 2– {126,1022}
● Single ≈ 1.4 x 10–45
● Double ≈ 4.9 x 10–324
 Largest Denormalized 00…00 11…11 (1.0 – ε) x 2– {126,1022}
● Single ≈ 1.18 x 10–38
● Double ≈ 2.2 x 10–308
 Smallest Pos. Normalized 00…01 00…00 1.0 x 2– {126,1022}
● Just larger than largest
denormalized
 One 01…11 00…00 1.0
 Largest Normalized 11…10 11…11 (2.0 – ε) x 2{127,1023}
● Single ≈ 3.4 x 1038
● Double ≈ 1.8 x 10308
22
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Example Single Precision Values


Description* Bit Pattern (Hex) Decimal Value
0 00000000 0.0

Smallest positive denormalized 00000001 1.40129846e-45

Largest positive denormalized 007fffff 1.17549421e-38

Smallest positive normalized 00800000 1.17549435e-38

1 3f800000 1.0

Largest positive normalized 7f7fffff 3.40282347e+38

Positive infinity 7f800000 +


Not-a-Number 7fc00000 NaN
* Positive values only; for
negative values change
sign bit to 1.

23
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Example Double Precision Values


Description* Bit Pattern (Hex) Decimal Value
0 00000000 00000000 0.0

Smallest positive denormalized 00000000 00000001 4.9406564584124654e-324

Largest positive denormalized 000fffff ffffffff 2.2250738585072009e-308

Smallest positive normalized 00100000 00000000 2.2250738585072014e-308

1 3ff00000 00000000 1.0

Largest positive normalized 7fefffff ffffffff 1.7976931348623157e+308

Positive infinity 7ff00000 00000000 +

Not-a-Number 7ff80000 00000000 NaN


* Positive values only; for
negative values change
sign bit to 1.

24
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Visualization: Floating Point Encodings

− +
−Normalized −Denorm +Denorm +Normalized

NaN NaN
−0 +0

25
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Non-Uniform Distribution
 FP numbers not uniformly distributed
● Spacing between successive FP numbers is magnified by a factor 2 at
each power of 2
● Something you need to understand if you do numerical calculations

26
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Addition and Subtraction
1. Make exponents equal
2. Add/subtract significands
3. Normalize result
 Why?
● Let A = a × 2e1 and B = b × 2e2 and suppose e1 < e2
● Then A can be rewritten as A = a × 2e2 × 2-(e2-e1)
● Therefore,

A + B = ( (a × 2-(e2-e1) ) + b ) × 2e2

Shift a right of the


binary point (e2-e1)
places; then add to b

27
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Addition Example: IEEE single precision format
s exp frac
0 01111101 00000000000000000000000 1.0 × 2-2 = 0.2510
+
0 10000101 10010000000000000000000 1.1001 × 26 = 100.010

● Don’t forget the hidden bit!


● To simplify illustration, let’s show the hidden bit.

0 01111101 1 00000000000000000000000 1.0 × 2-2 = 0.2510


+
0 10000101 1 10010000000000000000000 1.1001 × 26 = 100.010

hidden bit significand

28
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Addition Example, Cont.
0 01111101 1 00000000000000000000000 1.0 × 2-2 = 0.2510
+
0 10000101 1 10010000000000000000000 1.1001 × 26 = 100.010

1. Make exponents equal


► To leave value unchanged:
– Shift significand left by 1 bit → must decrease exponent by 1
– Shift significand right by 1 bit → must increase exponent by 1
► Increase smaller exponent to equal larger exponent. Why?
– Will shift significand right, losing only least significant bits
► Therefore, increase exponent of 0.2510 , shifting significand
right by 10000101 – 01111101 = 00001000 = 810 places

29
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Addition Example, Cont. Note that hidden bit
is shifted into msb
● Shift significand of 0.2510 right by 8 places

0 01111101 1 00000000000000000000000 original value


0 01111110 0 10000000000000000000000 shift right 1 place
0 01111111 0 01000000000000000000000 shift right 2 places
0 10000000 0 00100000000000000000000 shift right 3 places
0 10000001 0 00010000000000000000000 shift right 4 places
0 10000010 0 00001000000000000000000 shift right 5 places
0 10000011 0 00000100000000000000000 shift right 6 places
0 10000100 0 00000010000000000000000 shift right 7 places
0 10000101 0 00000001000000000000000 shift right 8 places

30
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Addition Example. Cont.
2. Add significands

0 10000101 0 00000001000000000000000 1.0 × 2-2 = 0.2510


+
0 10000101 1 10010000000000000000000 1.1001 × 26 = 100.010

0 10000101 1 10010001000000000000000 1.10010001 × 26 = 100.2510

3. Normalize result (already normalized; hide hidden bit)

0 10000101 10010001000000000000000 1.10010001 × 26 = 100.2510

31
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Multiplication
1. Add exponents
2. Multiply significands
3. Normalize result

 Why?
● Let A = a × 2e1 and B = b × 2e2
● Then,

A × B = ( a × b ) × 2e1+e2

32
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Multiplication Example: IEEE single precision format
s exp frac
0 01111100 01000000000000000000000 1.01 × 2-3 = 0.1562510
×
1 10000011 11000000000000000000000 -1.11 × 24 = -28.010

● As before, let’s show the hidden bit

0 01111100 1 01000000000000000000000 1.01 × 2-3 = 0.1562510


×
1 10000011 1 11000000000000000000000 -1.11 × 24 = -28.010

hidden bit significand

33
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Multiplication Example. Cont.
1. Compute biased exponent of result as:
expresult = exp1 + exp2 − bias
Why?
► As shown earlier, if A = a × 2e1 and B = b × 2e2, then
A × B = ( a × b ) × 2e1+e2
► Here, e1 and e2 are the true exponents, and are related to the biased
exponents exp1 and exp2 as follows:
exp1 = e1 + bias
exp2 = e2 + bias
► This implies that exp1 + exp2 = (e1 + e2) + 2 × bias
► But expresult = (e1 + e2) + bias
► Therefore,
expresult = exp1 + exp2 − bias

34
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Multiplication Example. Cont.
1. Compute biased exponent of result as:
expresult = exp1 + exp2 − bias
exp1
0 01111100 1 01000000000000000000000 1.01 × 2-3 = 0.1562510
× exp2
1 10000011 1 11000000000000000000000 -1.11 × 24 = -28.010

► Here, bias = 127 (single precision)


► expresult = exp1 + exp2 − 127
= 011111002 + 100000112 − 011111112
= 100000002

35
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Multiplication Example. Cont.
2. Multiply significands significand

0 01111100 1 01000000000000000000000 1.01 × 2-3 = 0.1562510


×
1 10000011 1 11000000000000000000000 -1.11 × 24 = -28.010

► significandresult = 1.01 × 1.11 = 10.0011


► signresult = 1 Why?
► expresult = 10000000 (from previous slide)
3. Normalize result
► shift significandresult right by 1 bit ➙ 1.00011
► increase expresult by 1 ➙ 10000001
► hide hidden bit

1 10000001 00011000000000000000000 -1.00011 × 22 = -4.37510


36
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Division
1. Subtract exponents
2. Divide significands
3. Normalize result

 Why?
● Let A = a × 2e1 and B = b × 2e2
● Then,

A / B = ( a / b ) × 2e1−e2

37
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Division Example: IEEE single precision format
s exp frac
0 10000110 00011000000000000000000 1.00011 × 27 = 140.010
÷
0 01111101 11000000000000000000000 1.11 × 2-2 = 0.437510

● As before, let’s show the hidden bit

0 10000110 1 00011000000000000000000 1.00011 × 27 =


÷ 140.0 10
0 01111101 1 11000000000000000000000 1.11 × 2-2 = 0.437510

hidden bit significand

38
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Division Example. Cont.
1. Compute biased exponent of result as:
expresult = exp1 − exp2 + bias
Why?
► As shown earlier, if A = a × 2e1 and B = b × 2e2, then
A / B = ( a / b ) × 2e1−e2
► Here, e1 and e2 are the true exponents, and are related to the biased
exponents exp1 and exp2 as follows:
exp1 = e1 + bias
exp2 = e2 + bias
► This implies that exp1 − exp2 = (e1 − e2)
► But expresult = (e1 − e2) + bias
► Therefore,
expresult = exp1 − exp2 + bias

39
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Division Example. Cont.
1. Compute biased exponent of result as:
expresult = exp1 − exp2 + bias
exp1
0 10000110 1 00011000000000000000000 1.00011 × 27 = 140.010
÷ exp2
0 01111101 1 11000000000000000000000 1.11 × 2-2 = 0.437510

► Here, bias = 127 (single precision)


► expresult = exp1 − exp2 + 127
= 100001102 − 011111012 + 011111112
= 100010002

40
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Division Example. Cont.
2. Divide significands significand

0 10000110 1 00011000000000000000000 1.00011 × 27 = 140.010


÷
0 01111101 1 11000000000000000000000 1.11 × 2-2 = 0.437510

► significandresult = 1.00011 ÷ 1.11 = 0.101


► signresult = 0
► expresult = 10001000 (from previous slide)
3. Normalize result
► shift significandresult left by 1 bit ➙ 1.01
► decrease expresult by 1 ➙ 10000111
► hide hidden bit

0 10000111 01000000000000000000000 1.01 × 28 = 32010


41
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point Arithmetic


 Things to watch out for
● Overflow in exp field when adding exponents in FP multiply
► E.g., 2
80 × 250 = 2130 > 2128

► 128 = largest positive true exponent for single precision FP format

● Underflow in exp field when subtracting exponents in FP divide


► E.g., 2
−80 ÷ 270 = 2−150 < 2−149

► −149 = largest negative true exponent for single precision FP format

(including denormalized numbers)


● Rounding to fit frac field when significands are added or multiplied
► IEEE FP format uses round-to-even by default

42
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

Floating Point in C
 C Guarantees Two Levels
● float single precision
● double double precision
 Conversions/Casting
● Casting between int, float, and double changes bit
representation
● double/float ➙ int
► Truncates fractional part
► Like rounding toward zero
► Not defined when out of range or NaN: generally sets to TMin
● int ➙ double
► Exact conversion, as long as int has ≤ 53-bit word size

● int ➙ float
► Will round according to rounding mode
43
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4

The End

44

You might also like