0% found this document useful (0 votes)

2 views

floating_point

The document discusses floating point arithmetic, focusing on the representation of real numbers in computers. It explains the structure of floating point numbers, including the base, mantissa, and exponent, and presents a toy floating point number system as an example. The goals include understanding computer number representation and the implications of floating point arithmetic in numerical computation.

Uploaded by

thamsanqadube008

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

floating_point

Uploaded by

thamsanqadube008

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

MATH 3511

Lecture 4. Floating Point Arithmetic

Dmitriy Leykekhman

Spring 2012

Goals
I Basic understanding of computer representation of numbers
I Basic understanding of floating point arithmetic
I Consequences of floating point arithmetic for numerical computation

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 1

Representation of Real Numbers

In everyday life we use decimal representation of numbers. For example

1234.567

for us means

1 ∗ 104 + 2 ∗ 103 + 3 ∗ 102 + 4 ∗ 100 + 5 ∗ 10−1 + 6 ∗ 10−2 + 7 ∗ 10−3 .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 2

Representation of Real Numbers

In everyday life we use decimal representation of numbers. For example

1234.567

for us means

1 ∗ 104 + 2 ∗ 103 + 3 ∗ 102 + 4 ∗ 100 + 5 ∗ 10−1 + 6 ∗ 10−2 + 7 ∗ 10−3 .

More generally
. . . dj . . . d1 d0 .d−1 . . . d−i . . .
represents

· · · dj ∗ 10j + · · · + d1 ∗ 101 + d0 ∗ 100 + d−1 ∗ 10−1 + · · · + d−i ∗ 10−i + · · · .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 2

Representation of Real Numbers
Let β ≥ 2 be an integer. For every x ∈ IR there exist integers e and
di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , such that
∞
!
X
x = sign(x) di β −i β e . (1)
i=0

The representation is unique if one requires that d0 > 0 when x 6= 0.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 3

The representation is unique if one requires that d0 > 0 when x 6= 0.

Example

11
= 5 ∗ 100 + 5 ∗ 10−1 = (5.5)10 ,
2
11
= 1 ∗ 22 + 0 ∗ 21 + 1 ∗ 20 + 1 ∗ 2−1
2
= (1 ∗ 20 + 0 ∗ 2−1 + 1 ∗ 2−2 + 1 ∗ 2−3 ) ∗ 22 = (1.011)2 ∗ 22 .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 3

Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1
!
X
s −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

I β is called the base,

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

I e is the exponent, and {emin , . . . , emax } is the exponent range.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

I e is the exponent, and {emin , . . . , emax } is the exponent range.

I If β = 2, then we say the floating point number system is a binary
system. In this case the di ’s are called bits.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

I e is the exponent, and {emin , . . . , emax } is the exponent range.

I If β = 2, then we say the floating point number system is a binary
system. In this case the di ’s are called bits.
I If β = 10, then we say the floating point number system is a decimal
system. In this case the di ’s are called digits.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

I e is the exponent, and {emin , . . . , emax } is the exponent range.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4

A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.