NB 4
NB 4
scientific notation
I
RESULT
exponent
sign significand
r
tuple
I s Sssss I xx
e.g 6 significand t 2exponent
digits
4 1,250001 01
e
g
Here we have
big to round result
I
1 00000
1 110 e 16
The assertion
Exattyttresented
inputs produced to
different results
Leadingdigit is alwaysimplicitlyequal 1
to 2 lessstorageneeded morerange
I
according to the standardforanynonzero a.me
I
significand
slightly asymmetric
need to storeit
I alwayshere no
1
38
Smallest possible positive value 2 w 1 18 10
Fun fact quark is the smallest known particle no meters
value 1.0 It ÉÉ
Smallest positive largestrelative
error after rounding
anythingsmallergets
lost
machineepsilon
23 1.19 107
2
default
in Python
Framework for error analysis
time
can
suitlyaccuratealgotrum
Forward stability analysis
calculation of the upper
Fwd bound of the forward
Tthabsolute difference
the algorithm program
error
between
and the exactfunction it
represents
Smyffrardstppgghb
approximates Fx
alg x
Backward stability analysis computing an exact answer
to a slightly wrong problem
smgtfadward.ba agdsaby
E
Suppose tÉÉm
function and
ÉÉÉÉ the true function is
of the true
smooth Sdifferentiable
everywhere
Forward
error alga FX I FXX FA
x is input data s can be noisy
Taylor's
Theorem
flax e FA DX
allows to approximate the
backward error
El lay
large FA
backward
error
Small backward error
emporaryscalar
arable to
hold a running
sum
nanny my
sum in exact arithmetic
16 37
10 710
exact sum
acts
a
definition of si
F XX
G
O
ÉIxi II Xidi Expandthedoublesum
The is the sum of the
computed sum
inputvalues with each Xi perturbed by
Di and that is exactly the form
m
of a backwadst
gf.y.gg
When is D Small
1 How large can any di
be
isnt FE E e
Efx Dis
ID its II Egil
atmostE
Equation
of triangle
DE
III 181
ID i S n
I tÉÉ
smnE
Part 0: Representing numbers as strings
The following exercises are designed to reinforce your understanding of how we can view the encoding of a number as string of digits in a give
If you are interested in exploring this topic in more depth, see the "Floating-Point Arithmetic" section
(https://docs.python.org/3/tutorial/floatingpoint.html) of the Python documentation.
Integers as strings
Consider the string of digits:
'16180339887'
If you are told this string is for a decimal number, meaning the base of its digits is ten (10), then its value is given by
'100111010'
If you are told this string is for a binary number, meaning its base is two (2), then its value is
Bases greater than ten (10). Observe that when the base at most ten, the digits are the usual decimal digits, 0, 1, 2, ..., 9. What happens whe
greater than ten? For this notebook, suppose we are interested in bases that are at most 36; then, we will adopt the convention of using lowerc
letters, a, b, c, ..., z for "digits" whose values correspond to 10, 11, 12, ..., 35.
Before moving on to the next exercise, run the following code cell. It has three functions, which are used in some of the testing code. G
base, one of these functions checks whether a single-character input string is a valid digit; and the other returns a list of all valid string
(The third one simply prints the valid digit list, given a base.) If you want some additional practice reading code, you might inspect thes
functions.
def valid_strdigits(base=2):
POSSIBLE_DIGITS = '0123456789abcdefghijklmnopqrstuvwxyz'
return [c for c in POSSIBLE_DIGITS if is_valid_strdigit(c, base)]
def print_valid_strdigits(base=2):
valid_list = valid_strdigits(base)
if not valid_list:
msg = '(none)'
else:
msg = ', '.join([c for c in valid_list])
print('The valid base ' + str(base) + ' digits: ' + msg)
# Quick demo:
print_valid_strdigits(6)
print_valid_strdigits(16)
print_valid_strdigits(23)
Exercise 0 (3 points) Write a function l t i t( b ) It takes a string of digits in the base given by b It returns its value as a
i o 0 (3
Exercise o points).are i n
I
Write a tfunction, eval_strint(s, base).n
i t of digits
It takes a string fi n by base. n
n siinnthe base given i I i
t as a
i nits value
It returns
integer
That is, this function implements the mathematical object, , which would convert a string to its numerical value, assuming its digits are g
ivenin
For example: bases
eval_strint('100111010', base=2) == 314
Hint: Python makes this exercise very easy. Search Python's online documentation for information about the int() constructor to see
you can apply it to solve this problem. (You have encountered this constructor already, in Lab/Notebook 2.)
(Passed!)
(Passed!)
(Passed!)
Fractional values
Recall that we can extend the basic string representation to include a fractional part by interpreting digits to the right of the "fractional point" (i.e
as having negative indices. For instance,
Or, in general,
Exercise 1 (4 points). Suppose a string of digits s in base base contains up to one fractional point. Complete the function, eval_strfrac(s,
base
that it returns its corresponding floating-point value. Your function should always return a value of type float, even if the input happens to correspond
exact integer.
Examples:
Comment. Because of potential floating-point roundoff errors, as explained in the videos, conversions based on the general polynomia
formula given previously will not be exact. The testing code will include a built-in tolerance to account for such errors.
In [6]: def is_valid_strfrac(s, base=2): checking ifthere's only one period in the string
return all([is_valid_strdigit(c, base) for c in s if c != '.']) \
and (len([c for c in s if c == '.']) <= 1)
print("\n(Passed!)")
(Passed!)
def check_random_strfrac():
from random import randint
b = randint(2, 36) # base
d = randint(0, 5) # leading digits
r = randint(0, 5) # trailing digits
v_true = 0.0
s = ''
possible_digits = valid_strdigits(b)
for i in range(-r, d+1):
v_i = randint(0, b-1)
s_i = possible_digits[v_i]
for _ in range(10):
check_random_strfrac()
print("\n(Passed!)")
(Passed!)
Floating-point encodings
Recall that a floating-point encoding or format is a normalized scientific notation consisting of a base, a sign, a fractional significand or mantissa
signed integer exponent. Conceptually, think of it as a tuple of the form, , where is the digit base (e.g., decimal, binary); is the s
the significand encoded as a base string; and is the exponent. For simplicity, let's assume that only the significand is encoded in base a
an integer value. Mathematically, the value of this tuple is .
IEEE double-precision. For instance, Python, R, and MATLAB, by default, store their floating-point values in a standard tuple representation kn
double-precision format. It's a 64-bit binary encoding having the following components:
Thus, the smallest positive value in this format , and the smallest positive value greater than 1 is , where
is known as machine epsilon (in this case, for double-precision).
Special values. You might have noticed that the exponent is slightly asymmetric. Part of the reason is that the IEEE floating-point encoding can
represent several kinds of special values, such as infinities and an odd bird called "not-a-number" or NaN. This latter value, which you may have
have used any standard statistical packages, can be used to encode certain kinds of floating-point exceptions that result when, for instance, yo
zero by zero.
If you are familiar with languages like C, C++, or Java, then IEEE double-precision format is the same as the double primitive type. The
common format is single-precision, which is float in those same languages.
Inspecting a floating-point number in Python. Python provides support for looking at floating-point values directly! Given any floating-point v
is, type(v) is float), the method v.hex() returns a string representation of its encoding. It's easiest to see by example, so run the follow
print_fp_hex(0.0)
print_fp_hex(1.0)
print_fp_hex(16.0625)
print_fp_hex(-0.1)
Thus, to convert this string back into the floating-point value, you could do the following:
For example, here is how you can get 16.025 back from its hex() representation, '0x1.0100000000000p+4':
16.0625
Exercise 2 (4 points). Write a function, fp_bin(v), that determines the IEEE-754 tuple representation of any double-precision floating-point va
is, given the variable v such that type(v) is float, it should return a tuple with three components, (s_sign, s_bin, v_exp) such that
s_sign is a string representing the sign bit, encoded as either a '+' or '-' character;
s_signif is the significand, which should be a string of 54 bits having the form, x.xxx...x, where there are (at most) 53 x bits (0 or 1 va
v_exp is the value of the exponent and should be an integer.
For example:
v = -1280.03125
assert v.hex() == '-0x1.4002000000000p+10'
assert fp_bin(v) == ('-', '1.0100000000000010000000000000000000000000000000000000', 10)
There are many ways to approach this problem. One we came up exploits the observation that and
and applies an idea in this Stackoverflow post: https://stackoverflow.com/questions/1425493/convert-hex-to-binary
(https://stackoverflow.com/questions/1425493/convert-hex-to-binary)
0.0 [0x0.0p+0] ==
('+', '0.0000000000000000000000000000000000000000000000000000', 0)
vs. you: ('+', '0.0000000000000000000000000000000000000000000000000000', 0)
-0.1 [-0x1.999999999999ap-4] ==
('-', '1.1001100110011001100110011001100110011001100110011010', -4)
vs. you: ('-', '1.1001100110011001100110011001100110011001100110011010', -4)
1.0000000000000002 [0x1.0000000000001p+0] ==
('+', '1.0000000000000000000000000000000000000000000000000001', 0)
vs. you: ('+', '1.0000000000000000000000000000000000000000000000000001', 0)
(Passed!)
print("\n(Passed.)")
-1280.03125 [-0x1.4002000000000p+10] ==
('-', '1.0100000000000010000000000000000000000000000000000000', 10)
vs. you: ('-', '1.0100000000000010000000000000000000000000000000000000', 10)
6.2831853072 [0x1.921fb544486e0p+2] ==
('+', '1.1001001000011111101101010100010001001000011011100000', 2)
vs. you: ('+', '1.1001001000011111101101010100010001001000011011100000', 2)
-0.7614972118393695 [-0x1.85e2f669b0c80p-1] ==
('-', '1.1000010111100010111101100110100110110000110010000000', -1)
vs. you: ('-', '1.1000010111100010111101100110100110110000110010000000', -1)
(Passed.)
Exercise 3 (2 points). Suppose you are given a floating-point value in a base given by base and in the form of the tuple, (sign, significan
exponent), where
sign is either the character '+' if the value is positive and '-' otherwise;
sign is either the character + if the value is positive and otherwise;
significand is a string representation in base-base;
exponent is an integer representing the exponent value.
so that it converts the tuple into a numerical value (of type float) and returns it.
One of the two test cells below uses your implementation of fp_bin() from a previous exercise. If you are encountering errors you ca
figure out, it's possible that there is still an unresolved bug in fp_bin() that its test cell did not catch.
print("\n(Passed.)")
('+', ['1.25000']_{10}, -1) ~= 0.125: You computed 0.125, which differs by 0.0.
(Passed.)
for _ in range(5):
(v_true, sign, significand, exponent) = gen_rand_fp_bin()
check_eval_fp(sign, significand, exponent, v_true, base=2)
print("\n(Passed.)")
(Passed.)
Exercise 4 (2 points). Suppose you are given two binary floating-point values, u and v, in the tuple form given above. That is, u == (u_sign,
u_exp) and v == (v_sign, v_signif, v_exp), where the base for both u and v is two (2). Complete the function add_fp_bin(u, v,
signif_bits), so that it returns the sum of these two values with the resulting significand truncated to signif_bits digits.
Note 0: Assume that signif_bits includes the leading 1. For instance, suppose signif_bits == 4. Then the significand will have
form, 1.xxx.
Note 1: You may assume that u_signif and v_signif use signif_bits bits (including the leading 1). Furthermore, you may assum
each uses far fewer bits than the underlying native floating-point type (float) does, so that you can use native floating-point to compu
intermediate values.
Hint: The test cell above defines a function, fp_bin(v), which you can use to convert a Python native floating-point value (i.e., type(
float) into a binary tuple representation.
u = ('+', '1.010010', 0)
v = ('-', '1.000000', -2)
w_true = ('+', '1.000010', 0)
check_add_fp_bin(u, v, 7, w_true)
u = ('+', '1.00000', 0)
v = ('+', '1.00000', -5)
w_true = ('+', '1.00001', 0)
check_add_fp_bin(u, v, 6, w_true)
u = ('+', '1.00000', 0)
v = ('-', '1.00000', -5)
w_true = ('+', '1.11110', -1)
check_add_fp_bin(u, v, 6, w_true)
u = ('+', '1.00000', 0)
v = ('+', '1.00000', -6)
w_true = ('+', '1.00000', 0)
check_add_fp_bin(u, v, 6, w_true)
u = ('+', '1.00000', 0)
v = ('-', '1.00000', -6)
w_true = ('+', '1.11111', -1)
check_add_fp_bin(u, v, 6, w_true)
print("\n(Passed!)")
('+', '1.010010', 0) + ('-', '1.000000', -2) == ('+', '1.000010', 0): You produced ('+', '1.0000
('+', '1.00000', 0) + ('+', '1.00000', -5) == ('+', '1.00001', 0): You produced ('+', '1.00001',
('+', '1.00000', 0) + ('-', '1.00000', -5) == ('+', '1.11110', -1): You produced ('+', '1.11110'
('+', '1.00000', 0) + ('+', '1.00000', -6) == ('+', '1.00000', 0): You produced ('+', '1.00000',
('+', '1.00000', 0) + ('-', '1.00000', -6) == ('+', '1.11111', -1): You produced ('+', '1.11111'
(Passed!)
Done! You've reached the end of part0. Be sure to save and submit your work. Once you are satisfied, move on to part1.
Floating-point arithmetic
As a data analyst, you will be concerned primarily with numerical programs in which the bulk of the computational work involves floating-point c
This notebook guides you through some of the most fundamental concepts in how computers store real numbers, so you can be smarter abou
crunching.
This code invokes Python's decimal (https://docs.python.org/3/library/decimal.html) package, which implements base-10 floating-po
arithmetic in software.
print(x)
print(Decimal(x)) # What does this do?
1.0000000000000002
1.0000000000000002220446049250313080847263336181640625
5.551115123125782702118158340E-18
Aside: If you ever need true decimal storage with no loss of precision (e.g., an accounting application), turn to the decimal package. J
warned it might be slower. See the following experiment for a practical demonstration.
NUM_TRIALS = 2500000
print("Native arithmetic:")
A_native = [random() for _ in range(NUM_TRIALS)]
B_native = [random() for _ in range(NUM_TRIALS)]
%timeit [a+b for a, b in zip(A_native, B_native)]
print("\nDecimal package:")
A_decimal = [Decimal(a) for a in A_native]
B_decimal = [Decimal(b) for b in B_native]
%timeit [a+b for a, b in zip(A_decimal, B_decimal)]
Native arithmetic:
212 ms ± 4.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Decimal package:
583 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The same and not the same. Consider the following two program fragments:
Program 1:
s = a - b
t = s + b
Program 2:
s = a + b
t = s - b
Let a = 1.0 and b = ϵ d / 2 ≈ 1.11 × 10 − 16, i.e., machine epsilon for IEEE-754 double-precision. Recall that we do not expect these programs to ret
value; let's run some Python code to see.
Note: The IEEE standard guarantees that given two finite-precision floating-point values, the result of applying any binary operator to th
the same as if the operator were applied in infinite-precision and then rounded back to finite-precision. The precise nature of rounding c
controlled by so-called rounding modes; the default rounding mode is "round-half-to-even (http://en.wikipedia.org/wiki/Rounding)."
In [4]: a = 1.0
b = 2.**(-53) # == $\epsilon_d$ / 2.0
s1 = a - b
t1 = s1 + b
s2 = a + b
t2 = s2 - b
print("s1:", s1.hex())
print("t1:", t1.hex())
print("\n")
print("s2:", s2.hex())
print("t2:", t2.hex())
print("")
print(t1, "vs.", t2)
print("(t1 == t2) == {}".format(t1 == t2))
s1: 0x1.fffffffffffffp-1
t1: 0x1.0000000000000p+0
s2: 0x1.0000000000000p+0
t2: 0x1.fffffffffffffp-1
By the way, the NumPy/SciPy package, which we will cover later in the semester, allows you to determine machine epsilon in a portable way. J
fact for now.
Here is an example of printing machine epsilon for both single-precision and double-precision values.
Forward stability. One way to show that the algorithm is good is to show that
|alg(x) − f(x)|
is "small" for all x of interest to your application. What is small depends on context. In any case, if you can show it then you can claim that the a
forward stable.
Backward stability. Sometimes it is not easy to show forward stability directly. In such cases, you can also try a different technique, which is to
the algorithm is, instead, backward stable.
In particular, alg(x) is a backward stable algorithm to compute f(x) if, for all x, there exists a "small" Δx such that
In other words, if you can show that the algorithm produces the exact answer to a slightly different input problem (meaning Δx is small, again in
dependent sense), then you can claim that the algorithm is backward stable.
Round-off errors. You already know that numerical values can only be represented finitely, which introduces round-off error. Thus, at the very l
should hope that a scheme to compute f(x) is as insensitive to round-off errors as possible. In other words, given that there will be round-off err
prove that alg(x) is either forward or backward stable, then that will give you some measure of confidence that your algorithm is good.
Here is the "standard model" of round-off error. Start by assuming that every scalar floating-point operation incurs some bounded error. That is
the exact mathematical result of some operation on the inputs, a and b, and let fl(a b) be the computed value, after rounding in finite-precisio
standard model says that
n−1
f(x) ≡ ∑ x i.
i=0
Given x, let's also denote its exact sum by the synonym s n − 1 ≡ f(x).
In exact arithmetic, meaning without any rounding errors, this program would compute the exact sum. (See also the note below.) However, you
finite arithmetic means there will be some rounding error after each addition.
Let δ i denote the (unknown) error at iteration i. Then, assuming the collection x represents the input values exactly, you can show that alg_sum
computes ŝ n − 1 where
n−1
ŝ n − 1 ≈ s n − 1 + ∑ s iδ i,
i=0
that is, the exact sum plus a perturbation, which is the second term (the sum). The question, then, is under what conditions will this sum will be
n−1
ŝ n − 1 ≈ ∑ x i(1 + Δ i) = f(x + Δ),
i=0
where Δ ≡ (Δ 0, Δ 1, …, Δ n − 1). In other words, the computed sum is the exact solution to a slightly different problem, x + Δ.
| Δ i | ≤ (n − i)ϵ,
where ϵ is machine precision. Thus, as long as nϵ 1, then the algorithm is backward stable and you should expect the computed result to be
1
true result. Interpreted differently, as long as you are summing n ϵ
values, then you needn't worry about the accuracy of the computed result
the true result:
Based on this result, you can probably surmise why double-precision is usually the default in many languages.
In the case of this summation, we can quantify not just the backward error (i.e., Δ i) but also the forward error. In that case, it turns out that
|ŝ n − 1 − sn − 1 | nϵǁxǁ 1.
Note: Analysis in exact arithmetic. We claimed above that alg_sum() is correct in exact arithmetic, i.e., in the absence of round-off
You probably have a good sense of that just reading the code.
However, if you wanted to argue about its correctness more formally, you might do so as follows using the technique of proof by induct
(https://en.wikipedia.org/wiki/Mathematical_induction). When your loops are more complicated and you want to prove that they are cor
you can often adapt this technique to your problem.
First, assume that the for loop enumerates each element p[i] in order from i=0 to n-1, where n=len(p). That is, assume p_i is p[
i
Let p k ≡ p[k] be the k-th element of p[:]. Let s i ≡ ∑ k = 0 p k; in other words, s i is the exact mathematical sum of p[:i+1]. Thus, s n − 1 is t
exact sum of p[:].
Let ŝ − 1 denote the initial value of the variable s, which is 0. For any i ≥ 0, let ŝ i denote the computed value of the variable s immediatel
the execution of line 4, where i = i. When i = i = 0, ŝ 0 = ŝ − 1 + p 0 = p 0, which is the exact sum of p[:1]. Thus, ŝ 0 = s 0.
Now suppose that ŝ i − 1 = s i − 1. When i = i, we want to show that ŝ i = s i. After line 4 executes, ŝ i = ŝ i − 1 + p i = s i − 1 + p i = s i. Thus, the
computed value ŝ i is the exact sum s i.
If i = n, then, at line 5, the value s = ŝ n − 1 = s n − 1, and thus the program must in line 5 return the exact sum.
print(t)
print("\n(Passed!)")
n rel_err rel_err_bound
0 10 1.110223e-16 2.220446e-15
(Passed!)
Now suppose we store the values of x and y exactly in two Python arrays, x[0:n] and y[0:n]. Further suppose we compute their dot produc
program, alg_dot().
Exercise 1 (OPTIONAL -- 0 points, not graded or collected). Show under what conditions alg_dot() is backward stable.
n−1
Hint. Let (x k, y k) denote the exact values of the corresponding inputs, (x[k], y[k]). Then the true dot product, x Ty = ∑ l = 0 x ly l. Next, let p̂ k
the k-th computed product, i.e., p̂ k ≡ x ky k(1 + γ k), where γ k is the k-th round-off error and | γ k | ≤ ϵ. Then apply the results for alg_sum(
analyze alg_dot().
Answer. Following the hint, alg_sum will compute ŝ n − 1 on the computed inputs, {p̂ k}. Thus,
n−1
ŝ n − 1 ≈ ∑ p̂ l(1 + Δ l)
l=0
n−1
= ∑ x ly l(1 + γ l)(1 + Δ l)
l=0
n−1
= ∑ x ly l(1 + γ l + Δ l + γ lΔ l).
l=0
Mathematically, this appears to be the exact dot product to an input in which x is exact and y is perturbed (or vice-versa). To argue that alg_do
| |
stable, we need to establish under what conditions the perturbation, γ l + Δ l + γ lΔ l , is "small." Since | γ l | ≤ ϵ and | Δ l | ≤ nϵ,
For the standard algorithm, let the i-th addition incur a roundoff error, δ i. Then our usual error analysis would reveal that the absolute error in the
sum, ŝ, is approximately:
And since | δ i | ≤ ϵ, you would bound the absolute value of the error by,
|ŝ − s | (4 | x 0 | + 3 | x 1 | + 2 | x 2 | + 1 | x 3 | )ϵ.
Exercise 2 (3 points). Based on the preceding observation, implement a new summation function, alg_sum_accurate(x) that computes a m
moreaca
sum than alg_sum().
Hint 1. You do not need Decimal() in this problem. Some of you will try to use it, but it's not necessary.
Hint 2. Some of you will try to "implement" the error formula to somehow compensate for the round-off error. But that shouldn't make s
to do. (Why not? Because the formula above is a bound, not an exact formula.) Instead, the intent of this problem is to see if you can lo
the formula and understand how to interpret it. That is, what does the formula tell you?
print("Running alg_sum()...")
s_alg = [alg_sum(x[:n]) for n in N]
print("==>", s_alg)
print("Running alg_sum_accurate()...")
s_acc = [alg_sum_accurate(x[:n]) for n in N]
print("==>", s_acc)
print("\n(Passed!)")