0% found this document useful (0 votes)
19 views64 pages

Lect4 Floats

The document discusses representing real numbers in computers using floating point representation, which stores numbers in the format of sign x 2exponent x fraction, allowing for a flexible decimal place and wide range of values. It explains how single precision floating point uses 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction, and describes how the exponent is encoded and how the fraction works by always having a 1 to the left of the decimal point.

Uploaded by

Cem Güven
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views64 pages

Lect4 Floats

The document discusses representing real numbers in computers using floating point representation, which stores numbers in the format of sign x 2exponent x fraction, allowing for a flexible decimal place and wide range of values. It explains how single precision floating point uses 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction, and describes how the exponent is encoded and how the fraction works by always having a 1 to the left of the decimal point.

Uploaded by

Cem Güven
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

photo by unsplash user @swimstaralex

COMP201
Computer
Systems &
Programming
Lecture #04 –Floating Point
Aykut Erdem // Koç University // Spring 2023
Image: Professor Farnsworth (Futurama)

Good news, everyone!

• Labs will start this Friday!


• Assg1 will be out on Mar 9
(due Mar 16)

2
2
Recap: Bitwise Operators
• You’re already familiar with many operators in C:
– Arithmetic operators: +, -, *, /, %
– Comparison operators: ==, !=, <, >, <=, >=
– Logical Operators: &&, ||, !

• Bitwise operators:
– Logical operators: &, |, ~, ^,
– Bit shift operators: <<, >>

3
Plan For Today
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

Disclaimer: Slides for this lecture were borrowed from


—Nick Troccoli's Stanford CS107 class
—Randal E. Bryant and David R. O’Hallaron's CMU 15-213 class
6
COMP201 Topic 2: How can a
computer represent real numbers
in addition to integer numbers?
Learning Goals
Understand the design and compromises of the floating point
representation, including:
• Fixed point vs. floating point
• How a floating point number is represented in binary
• Issues with floating point imprecision
• Other potential pitfalls using floating point numbers in programs

8
Plan For Today
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

9
Real Numbers
• We previously discussed representing integer numbers using two’s
complement.
• However, this system does not represent real numbers such as
3/5 or 0.25.
• How can we design a representation for real numbers?

10
Real Numbers
Problem: unlike with the integer number line, where there are a finite
number of values between two numbers, there are an infinite number of
real number values between two numbers!

Integers between 0 and 2: 1


Real Numbers Between 0 and 2: 0.1, 0.01, 0.001, 0.0001, 0.00001,…

We need a fixed-width representation for real numbers. Therefore, by


definition, we will not be able to represent all numbers.
11
Real Numbers
Problem: every number base has un-representable real numbers.

Base 10: 1/610 = 0.16666666…..10


Base 2: 1/1010 = 0.000110011001100110011…2

Therefore, by representing in base 2, we will not be able to represent all


numbers, even those we can exactly represent in base 10.

12
Plan For Today
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

13
Fixed Point
• Idea: Like in base 10, let’s add binary decimal places to our existing
number representation.

5934 . 2 1 6
103 102 101 100 10-1 10-2 10-3

101 1 . 01 1
23 22 21 20 2-1 2-2 2-3

14
Fixed Point
• Idea: Like in base 10, let’s add binary decimal places to our existing
number representation.

101 1 . 01 1
8s 4s 2s 1s 1/2s 1/4s 1/8s

• Pros: arithmetic is easy! And we know exactly how much precision we


have.
15
Fixed Point
• Problem: we have to fix where the decimal point is in our representation.
What should we pick? This also fixes us to 1 place per bit.

. 0 1 1 00 1 1
1/2s 1/4s 1/8s …

101 10 . 1 1
16s 8s 4s 2s 1s 1/2s 1/4s

16
Fixed Point
• Problem: we have to fix where the decimal point is in our representation.
What should we pick? This also fixes us to 1 place per bit.
Base 10 Base 2
To be able to store

5.07E30 = 10.....0.1 both these numbers


using the same fixed
point representation,
100 zeros the bitwidth of the

9.86E-32 = 0.0.....01
type would need to
be at least 207 bits
wide!
100 zeros

17
Let’s Get Real
What would be nice to have in a real number representation?
• Represent widest range of numbers possible
• Flexible “floating” decimal point
• Represent scientific notation numbers, e.g. 1.2 x 106
• Still be able to compare quickly
• Have more predictable over/under-flow behavior

18
Lecture Plan
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

19
IEEE Floating Point
Let’s aim to represent numbers of the following scientific-notation-like
format:

!
𝑥 ∗2
With this format, 32-bit floats represent numbers in the range ~1.2 x10-38
to ~3.4 x1038! Is every number between those representable? No.
20
IEEE Single Precision Floating Point

31 30 23 22 0

s exponent (8 bits) fraction (23 bits)

!
Sign bit
(0 = positive) 𝑥 ∗2
21
Exponent
s exponent (8 bits) fraction (23 bits)

Exponent (Binary) Exponent (Base 10)


11111111 ?
11111110 ?
11111101 ?
11111100 ?
… ?
00000011 ?
00000010 ?
00000001 ?
00000000 ? 22
Exponent
s exponent (8 bits) fraction (23 bits)

Exponent (Binary) Exponent (Base 10)


11111111 RESERVED
11111110 ?
11111101 ?
11111100 ?
… ?
00000011 ?
00000010 ?
00000001 ?
00000000 RESERVED 23
Exponent
s exponent (8 bits) fraction (23 bits)

Exponent (Binary) Exponent (Base 10)


11111111 RESERVED
11111110 127
11111101 126
11111100 125
… …
00000011 -124
00000010 -125
00000001 -126
00000000 RESERVED 24
Exponent
s exponent (8 bits) fraction (23 bits)

• The exponent is not represented in two’s complement.


• Instead, exponents are sequentially represented starting from 000…1 (most
negative) to 111…10 (most positive). This makes bit-level comparison fast.
• Actual value = binary value – 127 (“bias”)
11111110 254 – 127 = 127
11111101 253 – 127 = 126
… …
00000010 2 – 127 = -125
00000001 1 – 127 = -126
25
Fraction
s exponent (8 bits) fraction (23 bits)

!
𝑥 ∗2
• We could just encode whatever x is in the fraction field. But there’s a trick
we can use to make the most out of the bits we have.

26
An Interesting Observation
In Base 10:
We tend to adjust the exponent
42.4 x 105 = 4.24 x 106 until we get down to one place
324.5 x 105 = 3.245 x 107 to the left of the decimal point.
0.624 x 105 = 6.24 x 104

In Base 2: Observation: in base 2, this


10.1 x 25 = 1.01 x 26 means there is always a 1 to the
1011.1 x 25 = 1.0111 x 28 left of the decimal point!
0.110 x 25 = 1.10 x 24

27
Fraction
s exponent (8 bits) fraction (23 bits)

!
𝑥 ∗2
• We can adjust this value to fit the format described previously. Then, x will
always be in the format 1.XXXXXXXXX…
• Therefore, in the fraction portion, we can encode just what is to the right of
the decimal point! This means we get one more digit for precision.

Value encoded = 1._[FRACTION BINARY DIGITS]_


28
Practice
Sign Exponent Fraction
0 0 … 0 0 0 1 0 1 0 …

Is this number:
A) Greater than 0?
B) Less than 0?

29
Practice
Sign Exponent Fraction
0 0 … 0 0 0 1 0 1 0 …

Is this number:
A) Greater than 0?
B) Less than 0?

Is this number:
1.25 x 2^-126
A) Less than -1?
B) Between -1 and 1?
C) Greater than 1?
30
Skipping Numbers
• We said that it’s not possible to represent all real numbers using a fixed-
width representation. What does this look like?

Float Converter
• https://www.h-schmidt.net/FloatConverter/IEEE754.html

Floats and Graphics


• https://www.shadertoy.com/view/4tVyDK

31
Let’s Get Real
What would be nice to have in a real number representation?
• Represent widest range of numbers possible ✔
• Flexible “floating” decimal point ✔
• Represent scientific notation numbers, e.g. 1.2 x 106 ❓
• Still be able to compare quickly ✔
• Have more predictable over/under-flow behavior ❓

32
Representing Zero
The float representation of zero is all zeros (with any value for the sign bit)
Sign Exponent Fraction
any All zeros All zeros

• This means there are two representations for zero! L

33
Representing Small Numbers
If the exponent is all zeros, we switch into “denormalized” mode.

Sign Exponent Fraction


any All zeros Any

• We now treat the exponent as -126, and the fraction as without the
leading 1.
• This allows us to represent the smallest numbers as precisely as
possible.

34
Representing Exceptional Values
If the exponent is all ones, and the fraction is all zeros, we have +- infinity.

Sign Exponent Fraction


any All ones All zeros

• The sign bit indicates whether it is positive or negative infinity.


• Floats have built-in handling of over/underflow!
– Infinity + anything = infinity
– Negative infinity + negative anything = negative infinity
– Etc.

35
Representing Exceptional Values
If the exponent is all ones, and the fraction is nonzero, we have
Not a Number (NaN)
Sign Exponent Fraction
any 1 … … … … 1 Any nonzero

• NaN results from computations that produce an invalid mathematical


result.
– Sqrt(negative)
– Infinity / infinity
– Infinity + -infinity
– Etc.
36
Number Ranges
• 32-bit integer (type int):
› -2,147,483,648 to 2147483647
› Every integer in that range can be represented

• 64-bit integer (type long):


› −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
• 32-bit floating point (type float):
– ~1.2 x10-38 to ~3.4 x1038
– Not all numbers in the range can be represented (not even all integers in the range can be
represented!)
– Gaps can get quite large! (larger the exponent, larger the gap between successive fraction
values)

• 64-bit floating point (type double):


– ~2.2 x10-308 to ~1.8 x10308
37
Precision options
• Single precision: 32 bits
s exp frac

1 8-bits 23-bits

• Double precision: 64 bits


s exp frac

1 11-bits 52-bits
• Extended precision: 80 bits (Intel only)

s exp frac

1 15-bits 63 or 64-bits
38
Carnegie Mellon

Visualization: Floating Point Encodings

−¥ +¥
−Normalized −Denorm +Denorm +Normalized

NaN NaN
-0 +0

39
Lecture Plan
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

40
Tiny Floating Point Example
s exp frac

1 4-bits 3-bits

• 8-bit Floating Point Representation


– the sign bit is in the most significant bit
– the next four bits are the exponent, with a bias of 7 (= 2(4-1)-1)
– the last three bits are the frac

• Same general form as IEEE Format


– normalized, denormalized
– representation of 0, NaN, infinity
41
Dynamic Range (Positive Only)
s exp frac E Value v = (–1)s M 2E
0 0000 000 -6 0
n: E = Exp – Bias
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero d: E = 1 – Bias
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512
smallest norm
0 0001 001 -6 9/8*1/64 = 9/512

0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 000 0 8/8*1 = 1
numbers 0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8

0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf
42
Distribution of Values
• 6-bit IEEE-like format
– e = 3 exponent bits
s exp frac
– f = 2 fraction bits
– Bias is 23-1-1 = 3 1 3-bits 2-bits

• Notice how the distribution gets denser toward zero.


8 values

-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity

43
Distribution of Values (close-up view)
• 6-bit IEEE-like format
– e = 3 exponent bits
s exp frac
– f = 2 fraction bits
– Bias is 3 1 3-bits 2-bits

-1 -0.5 0 0.5 1
Denormalized Normalized Infinity

44
Special Properties of the IEEE Encoding
• FP Zero Same as Integer Zero
– All bits = 0

• Can (Almost) Use Unsigned Integer Comparison


– Must first compare sign bits
– Must consider −0 = 0
– NaNs problematic
• Will be greater than any other values
• What should comparison yield?
– Otherwise OK
• Denorm vs. normalized
• Normalized vs. infinity
45
Lecture Plan
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

46
Demo: Float Arithmetic

float_arithmetic.c
Floating Point Arithmetic
Is this just overflowing? It turns out it’s more subtle.
float a = 3.14;
float b = 1e20;
printf("(3.14 + 1e20) - 1e20 = %g\n", (a + b) - b); // prints 0
printf("3.14 + (1e20 - 1e20) = %g\n", a + (b - b)); // prints
3.14
Let’s look at the binary representations for 3.14 and 1e20:
31 30 23 22 0

3.14: 0 10000000 10010001111010111000011


31 30 23 22 0

1e20: 0 11000001 01011010111100011101100


48
Floating Point Arithmetic
31 30 23 22 0

3.14: 0 10000000 10010001111010111000011


31 30 23 22 0

1e20: 0 11000001 01011010111100011101100

To add real numbers, we must align their binary points:

3.14
What does this number look
+ 100000000000000000000.00
like in 32-bit IEEE format?
100000000000000000003.14
49
Floating Point Arithmetic
Step 1: convert from base 10 to binary

What is 100000000000000000003.14 in binary? Let’s find out!


http://web.stanford.edu/class/archive/cs/cs107/cs107.1184/float/convert.html

1010110101111000111010111100010110101100011000100000000000000000011.0010001111010111000010100011…

50
Floating Point Arithmetic
Step 2: find most significant 1 and take the
next 23 digits for the fractional component,
rounding if needed.
1010110101111000111010111100010110101100011000100000000000000000011.0010001111010111000010100011…

1 01011010111100011101100
51
Floating Point Arithmetic
Step 3: find how many places we need to shift
left to put the number in 1.xxx format. This fills
in the exponent component.
1010110101111000111010111100010110101100011000100000000000000000011.0010001111010111000010100011…

66 shifts -> 66 + 127 = 193


52
Floating Point Arithmetic
Step 4: if the sign is positive, the sign bit is 0.
Otherwise, it’s 1.

1010110101111000111010111100010110101100011000100000000000000000011.0010001111010111000010100011…

Sign bit is 0.

53
Floating Point Arithmetic
The binary representation for 1e20 + 3.14 thus equals the following:
31 30 23 22 0

0 11000001 01011010111100011101100

This is the same as the binary representation for 1e20 that we had
before!

We didn’t have enough bits to differentiate


between 1e20 and 1e20 + 3.14.

54
Floating Point Arithmetic
Is this just overflowing? It turns out it’s more subtle.
float a = 3.14;
float b = 1e20;
printf("(3.14 + 1e20) - 1e20 = %g\n", (a + b) - b); // prints 0
printf("3.14 + (1e20 - 1e20) = %g\n", a + (b - b)); // prints 3.14

Floating point arithmetic is not associative. The order of operations


matters!
• The first line loses precision when first adding 3.14 and 1e20, as we have
seen.
• The second line first evaluates 1e20 – 1e20 = 0, and then adds 3.14

55
Demo: Float Equality

float_equality.c
Floating Point Arithmetic
Float arithmetic is an issue with most languages, not just C!
• http://geocar.sdf1.org/numbers.html

57
Let’s Get Real
What would be nice to have in a real number representation?
• Represent widest range of numbers possible ✔
• Flexible “floating” decimal point ✔
• Represent scientific notation numbers, e.g. 1.2 x 106 ✔
• Still be able to compare quickly ✔
• Have more predictable over/under-flow behavior ✔

58
Lecture Plan
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

59
Floating Point in C
• C Guarantees Two Levels
– float single precision
– double double precision
• Conversions/Casting
– Casting between int, float, and double changes bit representation
– double/float → int
• Truncates fractional part
• Like rounding toward zero
• Not defined when out of range or NaN: Generally sets to TMin
– int → double
• Exact conversion, as long as int has ≤ 53 bit word size
– int → float
• Will round according to rounding mode
60
Ariane 5: A Bug and A Crash would never overflow a 16-bit number. Unfortunately, they simply
the Ariane 5 without checking the assumptions on which it had bee

• On June 4, 1996, Ariane 5 rocket self


destructed just after 37 seconds after
liftoff
• Cost: $500 million
• Cause: An overflow in the conversion
from a 64 bit floating point number to a
16 bit signed integer
• A design flaw:
– 5 times faster than Ariane 4
– Reused same software
specifications from Ariane 4
– Ariane 4 assumes horizontal
velocity would never overflow a
16-bit number
© Fourmy/REA/SABA/Corbis
© Fourmy/REA/SABA/Corbis 61
0x5F3759DF or The Fast Inverse Square Root

The fast inverse square root implementation


from Quake III Arena, including the exact
original comment text

62
Floating Point Puzzles
• For each of the following C expressions, either:
– Argue that it is true for all argument values
– Explain why not true

• x == (int)(float) x False
int x = …;
• x == (int)(double) x True
float f = …;
double d = …; • f == (float)(double) f True
• d == (float) d False
Assume neither • f == -(-f); True
d nor f is NaN • 2/3 == 2/3.0 False
• d < 0.0 ⇒ ((d*2) < 0.0) True (OF?)
• d > f ⇒ -f > -d True
• d * d >= 0.0 True (OF?)
• (d+f)-d == f False
63
Floats Summary
• IEEE Floating Point is a carefully-thought-out standard. It’s complicated,
but engineered for their goals.
• Floats have an extremely wide range, but cannot represent every number
in that range.
• Some approximation and rounding may occur! This means you definitely
don’t want to use floats e.g. for currency.
• Associativity does not hold for numbers far apart in the range
• Equality comparison operations are often unwise.

64
Additional Reading

What Every Computer Scientist Should Know About Floating-Point Arithmetic,


David Goldberg, ACM Computing Surveys, 23(1), 1991 65
Recap
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

Next time: How can a computer represent and manipulate more complex
data like text?
66

You might also like