0% found this document useful (0 votes)

19 views64 pages

Lect4 Floats

The document discusses representing real numbers in computers using floating point representation, which stores numbers in the format of sign x 2exponent x fraction, allowing for a flexible decimal place and wide range of values. It explains how single precision floating point uses 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction, and describes how the exponent is encoded and how the fraction works by always having a 1 to the left of the decimal point.

Uploaded by

Cem Güven

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views64 pages

Lect4 Floats

Uploaded by

Cem Güven

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

photo by unsplash user @swimstaralex

COMP201
Computer
Systems &
Programming
Lecture #04 –Floating Point
Aykut Erdem // Koç University // Spring 2023
Image: Professor Farnsworth (Futurama)

Good news, everyone!

• Labs will start this Friday!

• Assg1 will be out on Mar 9
(due Mar 16)

2
2
Recap: Bitwise Operators
• You’re already familiar with many operators in C:
– Arithmetic operators: +, -, *, /, %
– Comparison operators: ==, !=, <, >, <=, >=
– Logical Operators: &&, ||, !

• Bitwise operators:
– Logical operators: &, |, ~, ^,
– Bit shift operators: <<, >>

3
Plan For Today
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

Disclaimer: Slides for this lecture were borrowed from

—Nick Troccoli's Stanford CS107 class
—Randal E. Bryant and David R. O’Hallaron's CMU 15-213 class
6
COMP201 Topic 2: How can a
computer represent real numbers
in addition to integer numbers?
Learning Goals
Understand the design and compromises of the floating point
representation, including:
• Fixed point vs. floating point
• How a floating point number is represented in binary
• Issues with floating point imprecision
• Other potential pitfalls using floating point numbers in programs

8
Plan For Today
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

9
Real Numbers
• We previously discussed representing integer numbers using two’s
complement.
• However, this system does not represent real numbers such as
3/5 or 0.25.
• How can we design a representation for real numbers?

10
Real Numbers
Problem: unlike with the integer number line, where there are a finite
number of values between two numbers, there are an infinite number of
real number values between two numbers!

Integers between 0 and 2: 1

Real Numbers Between 0 and 2: 0.1, 0.01, 0.001, 0.0001, 0.00001,…

We need a fixed-width representation for real numbers. Therefore, by

definition, we will not be able to represent all numbers.
11
Real Numbers
Problem: every number base has un-representable real numbers.

Base 10: 1/610 = 0.16666666…..10

Base 2: 1/1010 = 0.000110011001100110011…2

Therefore, by representing in base 2, we will not be able to represent all

numbers, even those we can exactly represent in base 10.

12
Plan For Today
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

13
Fixed Point
• Idea: Like in base 10, let’s add binary decimal places to our existing
number representation.

5934 . 2 1 6
103 102 101 100 10-1 10-2 10-3

101 1 . 01 1
23 22 21 20 2-1 2-2 2-3

14
Fixed Point
• Idea: Like in base 10, let’s add binary decimal places to our existing
number representation.

101 1 . 01 1
8s 4s 2s 1s 1/2s 1/4s 1/8s

• Pros: arithmetic is easy! And we know exactly how much precision we

have.
15
Fixed Point
• Problem: we have to fix where the decimal point is in our representation.
What should we pick? This also fixes us to 1 place per bit.

. 0 1 1 00 1 1
1/2s 1/4s 1/8s …

101 10 . 1 1
16s 8s 4s 2s 1s 1/2s 1/4s

16
Fixed Point
• Problem: we have to fix where the decimal point is in our representation.
What should we pick? This also fixes us to 1 place per bit.
Base 10 Base 2
To be able to store

5.07E30 = 10.....0.1 both these numbers

using the same fixed
point representation,
100 zeros the bitwidth of the

9.86E-32 = 0.0.....01
type would need to
be at least 207 bits
wide!
100 zeros

17
Let’s Get Real
What would be nice to have in a real number representation?
• Represent widest range of numbers possible
• Flexible “floating” decimal point
• Represent scientific notation numbers, e.g. 1.2 x 106
• Still be able to compare quickly
• Have more predictable over/under-flow behavior

18
Lecture Plan
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

19
IEEE Floating Point
Let’s aim to represent numbers of the following scientific-notation-like
format:

!
𝑥 ∗2
With this format, 32-bit floats represent numbers in the range ~1.2 x10-38
to ~3.4 x1038! Is every number between those representable? No.
20
IEEE Single Precision Floating Point

31 30 23 22 0

s exponent (8 bits) fraction (23 bits)

!
Sign bit
(0 = positive) 𝑥 ∗2
21
Exponent
s exponent (8 bits) fraction (23 bits)

Exponent (Binary) Exponent (Base 10)

11111111 ?
11111110 ?
11111101 ?
11111100 ?
… ?
00000011 ?
00000010 ?
00000001 ?
00000000 ? 22
Exponent
s exponent (8 bits) fraction (23 bits)

Exponent (Binary) Exponent (Base 10)

11111111 RESERVED
11111110 ?
11111101 ?
11111100 ?
… ?
00000011 ?
00000010 ?
00000001 ?
00000000 RESERVED 23
Exponent
s exponent (8 bits) fraction (23 bits)

Exponent (Binary) Exponent (Base 10)

11111111 RESERVED
11111110 127
11111101 126
11111100 125
… …
00000011 -124
00000010 -125
00000001 -126
00000000 RESERVED 24
Exponent
s exponent (8 bits) fraction (23 bits)

• The exponent is not represented in two’s complement.

• Instead, exponents are sequentially represented starting from 000…1 (most
negative) to 111…10 (most positive). This makes bit-level comparison fast.
• Actual value = binary value – 127 (“bias”)
11111110 254 – 127 = 127
11111101 253 – 127 = 126
… …
00000010 2 – 127 = -125
00000001 1 – 127 = -126
25
Fraction
s exponent (8 bits) fraction (23 bits)

!
𝑥 ∗2
• We could just encode whatever x is in the fraction field. But there’s a trick
we can use to make the most out of the bits we have.

26
An Interesting Observation
In Base 10:
We tend to adjust the exponent
42.4 x 105 = 4.24 x 106 until we get down to one place
324.5 x 105 = 3.245 x 107 to the left of the decimal point.
0.624 x 105 = 6.24 x 104

In Base 2: Observation: in base 2, this

10.1 x 25 = 1.01 x 26 means there is always a 1 to the
1011.1 x 25 = 1.0111 x 28 left of the decimal point!
0.110 x 25 = 1.10 x 24

27
Fraction
s exponent (8 bits) fraction (23 bits)

!
𝑥 ∗2
• We can adjust this value to fit the format described previously. Then, x will
always be in the format 1.XXXXXXXXX…
• Therefore, in the fraction portion, we can encode just what is to the right of
the decimal point! This means we get one more digit for precision.

Value encoded = 1._[FRACTION BINARY DIGITS]_

28
Practice
Sign Exponent Fraction
0 0 … 0 0 0 1 0 1 0 …

Is this number:
A) Greater than 0?
B) Less than 0?

29
Practice
Sign Exponent Fraction
0 0 … 0 0 0 1 0 1 0 …

Is this number:
A) Greater than 0?
B) Less than 0?

Is this number:
1.25 x 2^-126
A) Less than -1?
B) Between -1 and 1?
C) Greater than 1?
30
Skipping Numbers
• We said that it’s not possible to represent all real numbers using a fixed-
width representation. What does this look like?

Float Converter
• https://www.h-schmidt.net/FloatConverter/IEEE754.html

Floats and Graphics

• https://www.shadertoy.com/view/4tVyDK

31
Let’s Get Real
What would be nice to have in a real number representation?
• Represent widest range of numbers possible ✔
• Flexible “floating” decimal point ✔
• Represent scientific notation numbers, e.g. 1.2 x 106 ❓
• Still be able to compare quickly ✔
• Have more predictable over/under-flow behavior ❓

32
Representing Zero
The float representation of zero is all zeros (with any value for the sign bit)
Sign Exponent Fraction
any All zeros All zeros

• This means there are two representations for zero! L

33
Representing Small Numbers
If the exponent is all zeros, we switch into “denormalized” mode.

Sign Exponent Fraction

any All zeros Any

• We now treat the exponent as -126, and the fraction as without the
leading 1.
• This allows us to represent the smallest numbers as precisely as
possible.

34
Representing Exceptional Values
If the exponent is all ones, and the fraction is all zeros, we have +- infinity.

Sign Exponent Fraction

any All ones All zeros

• The sign bit indicates whether it is positive or negative infinity.

• Floats have built-in handling of over/underflow!
– Infinity + anything = infinity
– Negative infinity + negative anything = negative infinity
– Etc.

35
Representing Exceptional Values
If the exponent is all ones, and the fraction is nonzero, we have
Not a Number (NaN)
Sign Exponent Fraction
any 1 … … … … 1 Any nonzero

• NaN results from computations that produce an invalid mathematical

result.
– Sqrt(negative)
– Infinity / infinity
– Infinity + -infinity
– Etc.
36
Number Ranges
• 32-bit integer (type int):
› -2,147,483,648 to 2147483647
› Every integer in that range can be represented

• 64-bit integer (type long):

› −9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
• 32-bit floating point (type float):
– ~1.2 x10-38 to ~3.4 x1038
– Not all numbers in the range can be represented (not even all integers in the range can be
represented!)
– Gaps can get quite large! (larger the exponent, larger the gap between successive fraction
values)

• 64-bit floating point (type double):

– ~2.2 x10-308 to ~1.8 x10308
37
Precision options
• Single precision: 32 bits
s exp frac

1 8-bits 23-bits

• Double precision: 64 bits

s exp frac

1 11-bits 52-bits
• Extended precision: 80 bits (Intel only)

s exp frac

1 15-bits 63 or 64-bits
38
Carnegie Mellon

Visualization: Floating Point Encodings

−¥ +¥
−Normalized −Denorm +Denorm +Normalized

NaN NaN
-0 +0

39
Lecture Plan
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

40
Tiny Floating Point Example
s exp frac

1 4-bits 3-bits

• 8-bit Floating Point Representation

– the sign bit is in the most significant bit
– the next four bits are the exponent, with a bias of 7 (= 2(4-1)-1)
– the last three bits are the frac

• Same general form as IEEE Format

– normalized, denormalized
– representation of 0, NaN, infinity
41
Dynamic Range (Positive Only)
s exp frac E Value v = (–1)s M 2E
0 0000 000 -6 0
n: E = Exp – Bias
0 0000 001 -6 1/8*1/64 = 1/512 closest to zero d: E = 1 – Bias
Denormalized 0 0000 010 -6 2/8*1/64 = 2/512
numbers …
0 0000 110 -6 6/8*1/64 = 6/512
0 0000 111 -6 7/8*1/64 = 7/512 largest denorm
0 0001 000 -6 8/8*1/64 = 8/512
smallest norm
0 0001 001 -6 9/8*1/64 = 9/512
…
0 0110 110 -1 14/8*1/2 = 14/16
0 0110 111 -1 15/8*1/2 = 15/16 closest to 1 below
Normalized 0 0111 000 0 8/8*1 = 1
numbers 0 0111 001 0 9/8*1 = 9/8 closest to 1 above
0 0111 010 0 10/8*1 = 10/8
…
0 1110 110 7 14/8*128 = 224
0 1110 111 7 15/8*128 = 240 largest norm
0 1111 000 n/a inf
42
Distribution of Values
• 6-bit IEEE-like format
– e = 3 exponent bits
s exp frac
– f = 2 fraction bits
– Bias is 23-1-1 = 3 1 3-bits 2-bits

• Notice how the distribution gets denser toward zero.

8 values

-15 -10 -5 0 5 10 15
Denormalized Normalized Infinity

43
Distribution of Values (close-up view)
• 6-bit IEEE-like format
– e = 3 exponent bits
s exp frac
– f = 2 fraction bits
– Bias is 3 1 3-bits 2-bits

-1 -0.5 0 0.5 1
Denormalized Normalized Infinity

44
Special Properties of the IEEE Encoding
• FP Zero Same as Integer Zero
– All bits = 0

• Can (Almost) Use Unsigned Integer Comparison

– Must first compare sign bits
– Must consider −0 = 0
– NaNs problematic
• Will be greater than any other values
• What should comparison yield?
– Otherwise OK
• Denorm vs. normalized
• Normalized vs. infinity
45
Lecture Plan
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

46
Demo: Float Arithmetic

float_arithmetic.c
Floating Point Arithmetic
Is this just overflowing? It turns out it’s more subtle.
float a = 3.14;
float b = 1e20;
printf("(3.14 + 1e20) - 1e20 = %g\n", (a + b) - b); // prints 0
printf("3.14 + (1e20 - 1e20) = %g\n", a + (b - b)); // prints
3.14
Let’s look at the binary representations for 3.14 and 1e20:
31 30 23 22 0

3.14: 0 10000000 10010001111010111000011

31 30 23 22 0

1e20: 0 11000001 01011010111100011101100

48
Floating Point Arithmetic
31 30 23 22 0

3.14: 0 10000000 10010001111010111000011

31 30 23 22 0

1e20: 0 11000001 01011010111100011101100

To add real numbers, we must align their binary points:

3.14
What does this number look
+ 100000000000000000000.00
like in 32-bit IEEE format?
100000000000000000003.14
49
Floating Point Arithmetic
Step 1: convert from base 10 to binary

What is 100000000000000000003.14 in binary? Let’s find out!

http://web.stanford.edu/class/archive/cs/cs107/cs107.1184/float/convert.html

1010110101111000111010111100010110101100011000100000000000000000011.0010001111010111000010100011…

50
Floating Point Arithmetic
Step 2: find most significant 1 and take the
next 23 digits for the fractional component,
rounding if needed.
1010110101111000111010111100010110101100011000100000000000000000011.0010001111010111000010100011…

1 01011010111100011101100
51
Floating Point Arithmetic
Step 3: find how many places we need to shift
left to put the number in 1.xxx format. This fills
in the exponent component.
1010110101111000111010111100010110101100011000100000000000000000011.0010001111010111000010100011…

66 shifts -> 66 + 127 = 193

52
Floating Point Arithmetic
Step 4: if the sign is positive, the sign bit is 0.
Otherwise, it’s 1.

1010110101111000111010111100010110101100011000100000000000000000011.0010001111010111000010100011…

Sign bit is 0.

53
Floating Point Arithmetic
The binary representation for 1e20 + 3.14 thus equals the following:
31 30 23 22 0

0 11000001 01011010111100011101100

This is the same as the binary representation for 1e20 that we had
before!

We didn’t have enough bits to differentiate

between 1e20 and 1e20 + 3.14.

54
Floating Point Arithmetic
Is this just overflowing? It turns out it’s more subtle.
float a = 3.14;
float b = 1e20;
printf("(3.14 + 1e20) - 1e20 = %g\n", (a + b) - b); // prints 0
printf("3.14 + (1e20 - 1e20) = %g\n", a + (b - b)); // prints 3.14

Floating point arithmetic is not associative. The order of operations

matters!
• The first line loses precision when first adding 3.14 and 1e20, as we have
seen.
• The second line first evaluates 1e20 – 1e20 = 0, and then adds 3.14

55
Demo: Float Equality

float_equality.c
Floating Point Arithmetic
Float arithmetic is an issue with most languages, not just C!
• http://geocar.sdf1.org/numbers.html

57
Let’s Get Real
What would be nice to have in a real number representation?
• Represent widest range of numbers possible ✔
• Flexible “floating” decimal point ✔
• Represent scientific notation numbers, e.g. 1.2 x 106 ✔
• Still be able to compare quickly ✔
• Have more predictable over/under-flow behavior ✔

58
Lecture Plan
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

59
Floating Point in C
• C Guarantees Two Levels
– float single precision
– double double precision
• Conversions/Casting
– Casting between int, float, and double changes bit representation
– double/float → int
• Truncates fractional part
• Like rounding toward zero
• Not defined when out of range or NaN: Generally sets to TMin
– int → double
• Exact conversion, as long as int has ≤ 53 bit word size
– int → float
• Will round according to rounding mode
60
Ariane 5: A Bug and A Crash would never overflow a 16-bit number. Unfortunately, they simply
the Ariane 5 without checking the assumptions on which it had bee

• On June 4, 1996, Ariane 5 rocket self

destructed just after 37 seconds after
liftoff
• Cost: $500 million
• Cause: An overflow in the conversion
from a 64 bit floating point number to a
16 bit signed integer
• A design flaw:
– 5 times faster than Ariane 4
– Reused same software
specifications from Ariane 4
– Ariane 4 assumes horizontal
velocity would never overflow a
16-bit number
© Fourmy/REA/SABA/Corbis
© Fourmy/REA/SABA/Corbis 61
0x5F3759DF or The Fast Inverse Square Root

The fast inverse square root implementation

from Quake III Arena, including the exact
original comment text

62
Floating Point Puzzles
• For each of the following C expressions, either:
– Argue that it is true for all argument values
– Explain why not true

• x == (int)(float) x False
int x = …;
• x == (int)(double) x True
float f = …;
double d = …; • f == (float)(double) f True
• d == (float) d False
Assume neither • f == -(-f); True
d nor f is NaN • 2/3 == 2/3.0 False
• d < 0.0 ⇒ ((d*2) < 0.0) True (OF?)
• d > f ⇒ -f > -d True
• d * d >= 0.0 True (OF?)
• (d+f)-d == f False
63
Floats Summary
• IEEE Floating Point is a carefully-thought-out standard. It’s complicated,
but engineered for their goals.
• Floats have an extremely wide range, but cannot represent every number
in that range.
• Some approximation and rounding may occur! This means you definitely
don’t want to use floats e.g. for currency.
• Associativity does not hold for numbers far apart in the range
• Equality comparison operations are often unwise.

64
Additional Reading

What Every Computer Scientist Should Know About Floating-Point Arithmetic,

David Goldberg, ACM Computing Surveys, 23(1), 1991 65
Recap
• Representing real numbers
• Fixed Point
• Floating Point
• Example and Properties
• Floating Point Arithmetic
• Floating Point in C

Next time: How can a computer represent and manipulate more complex
data like text?
66

Anti-Aliasing with MSAA vs ABAA
From Everand
Anti-Aliasing with MSAA vs ABAA
Michel A Rohner
No ratings yet
08 FloatingPoint
No ratings yet
08 FloatingPoint
52 pages
Floating Point Arithmetic
100% (1)
Floating Point Arithmetic
30 pages
L2-Variables and Floating Point Number System
No ratings yet
L2-Variables and Floating Point Number System
38 pages
Unit 2
No ratings yet
Unit 2
16 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
Course Note
No ratings yet
Course Note
121 pages
4 Floating Point Inclass
No ratings yet
4 Floating Point Inclass
33 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Lec 06
No ratings yet
Lec 06
49 pages
Floating - Point - Number
No ratings yet
Floating - Point - Number
36 pages
COA UNIT-III PPTs Dr.G.Bhaskar ECE
No ratings yet
COA UNIT-III PPTs Dr.G.Bhaskar ECE
64 pages
ENSC254 - Floating Point Computation
No ratings yet
ENSC254 - Floating Point Computation
29 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
LEC03 Data II
No ratings yet
LEC03 Data II
45 pages
Floating Point Representation: Reading: B&O 2.4
No ratings yet
Floating Point Representation: Reading: B&O 2.4
44 pages
CH03 Data II
No ratings yet
CH03 Data II
31 pages
Quiz For Chapter 3 With Solutions PDF
No ratings yet
Quiz For Chapter 3 With Solutions PDF
8 pages
L4
No ratings yet
L4
29 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
7 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
No ratings yet
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
32 pages
Floating Point & Fixed Point Representation - BCA II
No ratings yet
Floating Point & Fixed Point Representation - BCA II
24 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Part 1
No ratings yet
Part 1
33 pages
Data Representation
No ratings yet
Data Representation
19 pages
5 Data - Floating - Point v1
No ratings yet
5 Data - Floating - Point v1
25 pages
Fixed and Floating Point Representation
No ratings yet
Fixed and Floating Point Representation
5 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Fixed & Floating Point
No ratings yet
Fixed & Floating Point
31 pages
Floating Points
No ratings yet
Floating Points
31 pages
Floating Point
No ratings yet
Floating Point
13 pages
COMP0068 Lecture10 High Level Data Types
No ratings yet
COMP0068 Lecture10 High Level Data Types
25 pages
Floating Point
No ratings yet
Floating Point
33 pages
FIXED and FLOAT
No ratings yet
FIXED and FLOAT
8 pages
CSC340 - HW3
No ratings yet
CSC340 - HW3
28 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Bits, Bytes, Integers, and Floats Notes
No ratings yet
Bits, Bytes, Integers, and Floats Notes
18 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
Chapter3 3
No ratings yet
Chapter3 3
13 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
5 pages
ARCh Presentation1
No ratings yet
ARCh Presentation1
12 pages
Computer Architecture & Organization Unit 2
No ratings yet
Computer Architecture & Organization Unit 2
24 pages
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
No ratings yet
Cosc 2150: Computer Organization: Chapter 9, Part 3 Floating Point Numbers
39 pages
COMPX203 Computer Systems: Number Representation
No ratings yet
COMPX203 Computer Systems: Number Representation
33 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
30 pages
Computer Arithmetic Representations
No ratings yet
Computer Arithmetic Representations
24 pages
Finite Word Length Effects
No ratings yet
Finite Word Length Effects
31 pages
EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
No ratings yet
EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
31 pages
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
No ratings yet
Floating Point Sept 6, 2006 15-213: "The Course That Gives CMU Its Zip!"
34 pages
ELEC2041 Microprocessors and Interfacing Lectures 19: Floating Point Number Representation - I
No ratings yet
ELEC2041 Microprocessors and Interfacing Lectures 19: Floating Point Number Representation - I
24 pages
"The Course That Gives CMU Its Zip!": Topics
No ratings yet
"The Course That Gives CMU Its Zip!": Topics
31 pages
Number Representation
No ratings yet
Number Representation
7 pages
IEEE 754 Floating Point Standard
No ratings yet
IEEE 754 Floating Point Standard
2 pages
Fixed Point Numbers
No ratings yet
Fixed Point Numbers
20 pages
DDI0408I Cortex A9 Fpu r4p1 TRM
No ratings yet
DDI0408I Cortex A9 Fpu r4p1 TRM
27 pages
Pymodbustcp Documentation: Release 0.1.10
100% (1)
Pymodbustcp Documentation: Release 0.1.10
34 pages
Real Number Representation and Floating Point Arithmetic
No ratings yet
Real Number Representation and Floating Point Arithmetic
12 pages
9 Computer Architecture and Organization
No ratings yet
9 Computer Architecture and Organization
52 pages
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
No ratings yet
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
8 pages
Computer Architecture
No ratings yet
Computer Architecture
18 pages
Data Representation Refers To The Form in Which Data Is Stored, Processed, and Transmitted
No ratings yet
Data Representation Refers To The Form in Which Data Is Stored, Processed, and Transmitted
9 pages
Floating Point Numbers
No ratings yet
Floating Point Numbers
5 pages
Digital Logic Design PDF
No ratings yet
Digital Logic Design PDF
106 pages
Capitulo 15-17
No ratings yet
Capitulo 15-17
249 pages
Coa Unit 2 Digital Notes
No ratings yet
Coa Unit 2 Digital Notes
91 pages
Unit 1
No ratings yet
Unit 1
40 pages
CAO Mid
No ratings yet
CAO Mid
13 pages
Coa Unit 2
No ratings yet
Coa Unit 2
5 pages
Mips Green Card
No ratings yet
Mips Green Card
2 pages
Numpy
No ratings yet
Numpy
72 pages
Module 2 - Number System Arithmetic
No ratings yet
Module 2 - Number System Arithmetic
60 pages
Digital Logic Design: Number System
No ratings yet
Digital Logic Design: Number System
48 pages
EE221 - Fall 2024 - Midterm Exam Solution
No ratings yet
EE221 - Fall 2024 - Midterm Exam Solution
7 pages
Unit 1
No ratings yet
Unit 1
21 pages
Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)
No ratings yet
Chapter 1 - Izaac-Wang - Computational Quantum Mechanics (2018)
12 pages
MTL107 MAL230 Problem Sheet 1
No ratings yet
MTL107 MAL230 Problem Sheet 1
4 pages
Computer Architecture: Arithmetic For Computers
No ratings yet
Computer Architecture: Arithmetic For Computers
54 pages
CH10 Computer Arithmetic
No ratings yet
CH10 Computer Arithmetic
55 pages
Visiferm RS485 ProgrammersManual ODOUM102 11100430201
No ratings yet
Visiferm RS485 ProgrammersManual ODOUM102 11100430201
97 pages
Chapter Three - Object Oriented Fundametals in C#
No ratings yet
Chapter Three - Object Oriented Fundametals in C#
50 pages
Celua
No ratings yet
Celua
80 pages
Micro (2) - 9 The Math Coprocessor Part-1 29-3-2024
No ratings yet
Micro (2) - 9 The Math Coprocessor Part-1 29-3-2024
14 pages
Analyzing Logistic Map Pseudorandom Number Generators For
No ratings yet
Analyzing Logistic Map Pseudorandom Number Generators For
8 pages

Lect4 Floats

Uploaded by

Lect4 Floats

Uploaded by

photo by unsplash user @swimstaralex

Good news, everyone!

• Labs will start this Friday!

Disclaimer: Slides for this lecture were borrowed from

Integers between 0 and 2: 1

We need a fixed-width representation for real numbers. Therefore, by

Base 10: 1/610 = 0.16666666…..10

Therefore, by representing in base 2, we will not be able to represent all

• Pros: arithmetic is easy! And we know exactly how much precision we

5.07E30 = 10.....0.1 both these numbers

s exponent (8 bits) fraction (23 bits)

Exponent (Binary) Exponent (Base 10)

Exponent (Binary) Exponent (Base 10)

Exponent (Binary) Exponent (Base 10)

• The exponent is not represented in two’s complement.

In Base 2: Observation: in base 2, this

Value encoded = 1._[FRACTION BINARY DIGITS]_

Floats and Graphics

• This means there are two representations for zero! L

Sign Exponent Fraction

Sign Exponent Fraction

• The sign bit indicates whether it is positive or negative infinity.

• NaN results from computations that produce an invalid mathematical

• 64-bit integer (type long):

• 64-bit floating point (type double):

• Double precision: 64 bits

Visualization: Floating Point Encodings

• 8-bit Floating Point Representation

• Same general form as IEEE Format

• Notice how the distribution gets denser toward zero.

• Can (Almost) Use Unsigned Integer Comparison

3.14: 0 10000000 10010001111010111000011

1e20: 0 11000001 01011010111100011101100

3.14: 0 10000000 10010001111010111000011

1e20: 0 11000001 01011010111100011101100

To add real numbers, we must align their binary points:

What is 100000000000000000003.14 in binary? Let’s find out!

66 shifts -> 66 + 127 = 193

We didn’t have enough bits to differentiate

Floating point arithmetic is not associative. The order of operations

• On June 4, 1996, Ariane 5 rocket self

The fast inverse square root implementation

What Every Computer Scientist Should Know About Floating-Point Arithmetic,

You might also like