0% found this document useful (0 votes)
43 views6 pages

IEEE Arithmetic

The document discusses IEEE floating point arithmetic including: - Definitions of bits, bytes, words for single and double precision numbers. - Examples of binary and decimal numbers with binary and decimal points. - The format of single and double precision numbers including sign, exponent, significand. - Special numbers like infinity, NaN, smallest/largest normal numbers. - Examples of common numbers in hexadecimal. - Sources of error like inexact representation of numbers like 1/3 and the smallest positive integer not represented exactly. - Machine epsilon which is the distance between 1 and the next largest number.

Uploaded by

omar shady
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views6 pages

IEEE Arithmetic

The document discusses IEEE floating point arithmetic including: - Definitions of bits, bytes, words for single and double precision numbers. - Examples of binary and decimal numbers with binary and decimal points. - The format of single and double precision numbers including sign, exponent, significand. - Special numbers like infinity, NaN, smallest/largest normal numbers. - Examples of common numbers in hexadecimal. - Sources of error like inexact representation of numbers like 1/3 and the smallest positive integer not represented exactly. - Machine epsilon which is the distance between 1 and the next largest number.

Uploaded by

omar shady
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 1

IEEE Arithmetic

1.1 Definitions
Bit = 0 or 1
Byte = 8 bits
Word = Reals: 4 bytes (single precision)
8 bytes (double precision)
= Integers: 1, 2, 4, or 8 byte signed
1, 2, 4, or 8 byte unsigned

1.2 Numbers with a decimal or binary point

    ·    
Decimal: 103 102 101 100 10−1 10−2 10−3 10−4
Binary: 23 22 21 20 2−1 2−2 2−3 2−4

1.3 Examples of binary numbers


Decimal Binary
1 1
2 10
3 11
4 100
0.5 0.1
1.5 1.1

1.4 Hex numbers

{0, 1, 2, 3, . . . , 9, 10, 11, 12, 13, 14, 15} = {0, 1, 2, 3.......9, a,b,c,d,e,f}

1
2 CHAPTER 1. IEEE ARITHMETIC

1.5 4-bit unsigned integers as hex numbers

Decimal Binary Hex


1 0001 1
2 0010 2
3 0011 3
.. .. ..
. . .
10 1010 a
.. .. ..
. . .
15 1111 f

1.6 IEEE single precision format:

𝑓
⏞ 𝑠⏟ ⏞ 𝑒
⏟ ⏞ ⏟
   · · · · · · · ·
0 1 2 3 4 5 6 7 8 9 31

# = (−1)𝑠 × 2𝑒−127 × 1.f

where s = sign
e = biased exponent
p=e-127 = exponent
1.f = significand (use binary point)
1.7. SPECIAL NUMBERS 3

1.7 Special numbers


Smallest exponent: e = 0000 0000, represents denormal numbers (1.f → 0.f)
Largest exponent: e = 1111 1111, represents ±∞, if f = 0
e = 1111 1111, represents NaN, if f ̸= 0

Number Range: e = 1111 1111 = 28 - 1 = 255 reserved


e = 0000 0000 = 0 reserved
so, p = e - 127 is
1 - 127 ≤ p ≤ 254-127
-126 ≤ p ≤ 127

Smallest positive normal number


= 1.0000 0000 · · · · ·· 0000× 2−126
≃ 1.2 × 10−38
bin: 0000 0000 1000 0000 0000 0000 0000 0000
hex: 00800000
MATLAB: realmin(’single’)
Largest positive number
= 1.1111 1111 · · · · ·· 1111× 2127
= (1 + (1 − 2−23 )) × 2127
≃ 2128 ≃ 3.4 × 1038
bin: 0111 1111 0111 1111 1111 1111 1111 1111
hex: 7f7fffff
MATLAB: realmax(’single’)
Zero
bin: 0000 0000 0000 0000 0000 0000 0000 0000
hex: 00000000

Subnormal numbers
Allow 1.f → 0.f (in software)
Smallest positive number = 0.0000 0000 · · · · · 0001 × 2−126
= 2−23 × 2−126 ≃ 1.4 × 10−45

1.8 Examples of computer numbers


What is 1.0, 2.0 & 1/2 in hex ?

1.0 = (−1)0 × 2(127−127) × 1.0


Therefore, 𝑠 = 0, 𝑒 = 0111 1111, 𝑓 = 0000 0000 0000 0000 0000 000
bin: 0011 1111 1000 0000 0000 0000 0000 0000
hex: 3f80 0000

2.0 = (−1)0 × 2(128−127) × 1.0


Therefore, 𝑠 = 0, 𝑒 = 1000 0000, 𝑓 = 0000 0000 0000 0000 0000 000
bin: 0100 00000 1000 0000 0000 0000 0000 0000
hex: 4000 0000
4 CHAPTER 1. IEEE ARITHMETIC

1/2 = (−1)0 × 2(126−127) × 1.0


Therefore, 𝑠 = 0, 𝑒 = 0111 1110, 𝑓 = 0000 0000 0000 0000 0000 000
bin: 0011 1111 0000 0000 0000 0000 0000 0000
hex: 3f00 0000

1.9 Inexact numbers


Example:
1 1 1
= (−1)0 × × (1 + ),
3 4 3
so that 𝑝 = 𝑒 − 127 = −2 and 𝑒 = 125 = 128 − 3, or in binary, 𝑒 = 0111 1101.
How is 𝑓 = 1/3 represented in binary? To compute binary number, multiply
successively by 2 as follows:

0.333 . . . 0.
0.666 . . . 0.0
1.333 . . . 0.01
0.666 . . . 0.010
1.333 . . . 0.0101
etc.

so that 1/3 exactly in binary is 0.010101 . . . . With only 23 bits to represent 𝑓 ,


the number is inexact and we have

𝑓 = 01010101010101010101011,

where we have rounded to the nearest binary number (here, rounded up). The
machine number 1/3 is then represented as

00111110101010101010101010101011

or in hex
3𝑒𝑎𝑎𝑎𝑎𝑎𝑏.

1.9.1 Find smallest positive integer that is not exact in


single precision
Let 𝑁 be the smallest positive integer that is not exact. Now, I claim that

𝑁 − 2 = 223 × 1.11 . . . 1,

and
𝑁 − 1 = 224 × 1.00 . . . 0.
The integer 𝑁 would then require a one-bit in the 2−24 position, which is not
available. Therefore, the smallest positive integer that is not exact is 224 + 1 =
16 777 217. In MATLAB, single(224 ) has the same value as single(224 +1). Since
single(224 + 1) is exactly halfway between the two consecutive machine numbers
224 and 224 + 2, MATLAB rounds to the number with a final zero-bit in f, which
is 224 .
1.10. MACHINE EPSILON 5

1.10 Machine epsilon


Machine epsilon (𝜖mach ) is the distance between 1 and the next largest number.
If 0 ≤ 𝛿 < 𝜖mach /2, then 1 + 𝛿 = 1 in computer math. Also since
𝑥 + 𝑦 = 𝑥(1 + 𝑦/𝑥),
if 0 ≤ 𝑦/𝑥 < 𝜖mach /2, then 𝑥 + 𝑦 = 𝑥 in computer math.

Find 𝜖mach
The number 1 in the IEEE format is written as
1 = 20 × 1.000 . . . 0,
with 23 0’s following the binary point. The number just larger than 1 has a 1
in the 23rd position after the decimal point. Therefore,
𝜖mach = 2−23 ≈ 1.192 × 10−7 .
What is the distance between 1 and the number just smaller than 1? Here,
the number just smaller than one can be written as
2−1 × 1.111 . . . 1 = 2−1 (1 + (1 − 2−23 )) = 1 − 2−24
Therefore, this distance is 2−24 = 𝜖mach /2.
The spacing between numbers is uniform between powers of 2, with logarith-
mic spacing of the powers of 2. That is, the spacing of numbers between 1 and
2 is 2−23 , between 2 and 4 is 2−22 , between 4 and 8 is 2−21 , etc. This spacing
changes for denormal numbers, where the spacing is uniform all the way down
to zero.

Find the machine number just greater than 5


A rough estimate would be 5(1 + 𝜖mach ) = 5 + 5𝜖mach , but this is not exact. The
exact answer can be found by writing
1
5 = 22 (1 + ),
4
so that the next largest number is
1
22 (1 + + 2−23 ) = 5 + 2−21 = 5 + 4𝜖mach .
4

1.11 IEEE double precision format


Most computations take place in double precision, where round-off error is re-
duced, and all of the above calculations in single precision can be repeated for
double precision. The format is

𝑓
⏞ 𝑠⏟ ⏞ 𝑒
⏟ ⏞ ⏟
   · · · · · · · ·
0 1 2 3 4 5 6 7 8 9 10 11 12 63
6 CHAPTER 1. IEEE ARITHMETIC

# = (−1)𝑠 × 2𝑒−1023 × 1.f

where s = sign
e = biased exponent
p=e-1023 = exponent
1.f = significand (use binary point)

1.12 Roundoff error example


Consider solving the quadratic equation

𝑥2 + 2𝑏𝑥 − 1 = 0,

where 𝑏 is a parameter. The quadratic formula yields the two solutions


√︀
𝑥± = −𝑏 ± 𝑏2 + 1.

Consider the solution with 𝑏 > 0 and 𝑥 > 0 (the 𝑥+ solution) given by
√︀
𝑥 = −𝑏 + 𝑏2 + 1. (1.1)

As 𝑏 → ∞,
√︀
𝑥 = −𝑏 + 𝑏2 + 1
√︀
= −𝑏 + 𝑏 1 + 1/𝑏2
√︀
= 𝑏( 1 + 1/𝑏2 − 1)
(︂ )︂
1
≈𝑏 1+ 2 −1
2𝑏
1
= .
2𝑏
Now in double precision, realmin ≈ 2.2 × 10−308 and we would like 𝑥 to be
accurate to this value before it goes to 0 via denormal numbers. Therefore,
𝑥 should be computed accurately to 𝑏 ≈ 1/(2 × realmin) ≈ 2 × 10307 . What
2 2
happens if we compute (1.1) directly? Then √ 𝑥√= 0 when 𝑏 + 1 = 𝑏 , or
1 + 1/𝑏2 = 1. That is 1/𝑏2 = 𝜖mach /2, or 𝑏 = 2/ 𝜖mach ≈ 108 .
For a subroutine written to compute the solution of a quadratic for a general
user, this is not good enough. The way for a software designer to solve this
problem is to compute the solution for 𝑥 as
1
𝑥= √ .
𝑏 + 𝑏2 + 1

In this form, if 𝑏2 + 1 = 𝑏2 , then 𝑥 = 1/2𝑏 which is the correct asymptotic form.

You might also like