Lecture 4 - Floating Point Data
Lecture 4 - Floating Point Data
Lecture 4
198:331 Introduction to Computer Organization
Instructor:
Michael A. Palis
[email protected]
1
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
23.78410
integer fraction
2
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
3
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
4
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Rounding
Representing numbers using a fixed number of bits limits:
● Range – span of numbers that can be represented
● Precision – difference between successive values that can be
represented
Precision
● Associated with the number of fractional bits allowed by the
computer representation
● If number has more fractional bits than is allowed, it must be
rounded to the required precision
Example: How should 10.10112 be rounded to 2 fractional bits?
● 10.102?
● Or 10.112?
5
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Rounding Modes
Let a be the number and ā be its rounded value
1. Round-toward-zero (aka truncation)
► Round a to nearest number ā of desired precision such that |ā| ≤ |a|
2. Round-down (aka round-toward-negative-infinity)
► Round a to nearest number ā of desired precision such that ā ≤ a
3. Round-up (aka round-toward-postive-infinity)
► Round a to nearest number ā of desired precision such that ā ≥ a
4. Round-to-even (aka round-to-nearest)
► Round a to the number ā of desired precision such that |a – ā| is
minimized
► If there is a tie, choose the ā whose least significant digit/bit is even
► Default mode used in IEEE Floating Point Format, which we’ll discuss
6
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Rounding Modes
Example
● Assume precision is 2 fractional digits
Rounded Value
Number Round- Round-to-
Round-down Round-up
toward-0 even
1.452310 1.4510 1.4510 1.4610 1.4510
7
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
1 0 0 . 1 0 1 = 4.625
+ 1 0 . 1 1 0 1 = 2.8125
1 1 1 . 0 1 1 1 = 7.4375
8
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
9
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
1 0.0 1 = 2.2510
1 0 1 1 0 1 1.0 1
-1 0 1
0 1 0 1
- 1 0 1
0
● May result in a quotient with non-terminating fractional part
➙ round to desired number of fractional places
10
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
1101.011 = 1.101011 × 23
−0.010111 = −1.0111 × 2-2
11
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
standard
12
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Encoding
● MSB s encodes sign: 0 if + and 1 if −
● exp field encodes E (but is not equal to E)
● frac field encodes M (but is not equal to M)
s exp frac
13
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Precision Options
Single precision: 32 bits
s exp frac
1 8 bits 23 bits
s exp frac
1 11 bits 52 bits
Extended precision: 80 bits (Intel only)
s exp frac
1 15 bits 64 bits
14
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
± 1.frac × 2E
s exp frac
1 8 bits 23 bits
15
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
2. Normalize binary FP
−110110.101 = −1.10110101 × 25
16
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
± 1.frac × 2E
s exp frac
1 11 bits 52 bits
17
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Normalized Values
± M × 2E ➙
Denormalized Values
± M × 2E ➙
20
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Special Values
± M × 2E ➙
► E.g., sqrt(–1), − , 0
21
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Example Values
{single,double}
Description exp frac Numeric Value
Zero 00…00 00…00 0.0
Smallest Pos. Denorm. 00…00 00…01 2– {23,52} x 2– {126,1022}
● Single ≈ 1.4 x 10–45
● Double ≈ 4.9 x 10–324
Largest Denormalized 00…00 11…11 (1.0 – ε) x 2– {126,1022}
● Single ≈ 1.18 x 10–38
● Double ≈ 2.2 x 10–308
Smallest Pos. Normalized 00…01 00…00 1.0 x 2– {126,1022}
● Just larger than largest
denormalized
One 01…11 00…00 1.0
Largest Normalized 11…10 11…11 (2.0 – ε) x 2{127,1023}
● Single ≈ 3.4 x 1038
● Double ≈ 1.8 x 10308
22
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
1 3f800000 1.0
23
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
24
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
− +
−Normalized −Denorm +Denorm +Normalized
NaN NaN
−0 +0
25
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Non-Uniform Distribution
FP numbers not uniformly distributed
● Spacing between successive FP numbers is magnified by a factor 2 at
each power of 2
● Something you need to understand if you do numerical calculations
26
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
A + B = ( (a × 2-(e2-e1) ) + b ) × 2e2
27
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
28
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
29
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
30
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
31
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Why?
● Let A = a × 2e1 and B = b × 2e2
● Then,
A × B = ( a × b ) × 2e1+e2
32
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
33
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
34
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
35
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Why?
● Let A = a × 2e1 and B = b × 2e2
● Then,
A / B = ( a / b ) × 2e1−e2
37
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
38
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
39
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
40
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
42
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
Floating Point in C
C Guarantees Two Levels
● float single precision
● double double precision
Conversions/Casting
● Casting between int, float, and double changes bit
representation
● double/float ➙ int
► Truncates fractional part
► Like rounding toward zero
► Not defined when out of range or NaN: generally sets to TMin
● int ➙ double
► Exact conversion, as long as int has ≤ 53-bit word size
● int ➙ float
► Will round according to rounding mode
43
Carnegie Mellon
198:331 Intro to Computer Organization Lecture 4
The End
44