CHAP 1 Intro MathBackGround
CHAP 1 Intro MathBackGround
Chapter 1: Introduction
and Mathematical Background
University of Calgary
Schulich School of Engineering
Department of Geomatics Engineering
k=0 k=1
u Common base-b representations
o Decimal (b = 10) or base-10 representation – digits: 0,1,2,3,4,5,6,7,8,9
² 12345.6710 = 1∙104+2∙103+3∙102+4∙101+5∙100 + 6∙10-1+7∙10-2
² The RHS of the formula above converts a number from any base to the decimal one
o Binary (b = 2) or base-2 representation – digits: 0,1
² 11111.1112 = 1∙24+1∙23+1∙22+1∙21+1∙20 + 1∙2-1 + 1∙2-2+ 1∙2-3 =
= 16+8+4+2+1 + (7/8) = 31.87510
Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F
Dec 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Oct 00 01 02 03 04 05 06 07 10 11 12 13 14 15 16 17
Bin 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
u Notes:
o Numbers with finite representation in one base may have infinite representation in an-
other, e.g., [0.1]10 = [0.0 0011 0011 0011 …]2 ⇨ truncation and round-off errors
o There is also a need to store very large and very small numbers, for which we use a
floating point representation with signs, exponents and mantissas ⇨ overflow and
underflow errors
ENDG 407 - Computational Numerical Methods 1.9
Introduction (11)
o Most computers store and process numbers in binary form
² 0 and 1 can be represented by the off and on position of a switch (transistor)
² 0's and 1's are called binary digits or bits
² Thus there is a need to convert numbers from decimal to binary (and between any two bases)
k=0 k=1
u Notes:
o Procedure 1 is preferred when a < b and Procedure 2 is preferred when a > b
ENDG 407 - Computational Numerical Methods 1.12
Introduction (14)
o When converting between decimal and binary by hand, it is convenient to use the
octal representation as an intermediate step because the 8⟷2 conversion is very simple,
i.e., instead of 10⟷2 directly, we can carry out a 10⟷8⟷2 conversion in two steps:
² 1st step: 10⟷8 with one of the two procedures discussed previously
² 2nd step: 8⟷2 by translating groups of three binary digits directly into single octal digits using
this table
Oct 0 1 2 3 4 5 6 7
² e.g., [551.624]8 =
Bin 000 001 010 011 100 101 110 111
= [101 101 001.110 010 100]2
² Proof: If we start from a fractional binary number x2 = [0.b1b2b3b4b5b6…]2 we can express it as
x2 = b12-1 + b22-2 + b32-3 + b42-4 + b52-5 + b62-6 + …
= (4b1+ 2b2 + b3)8-1 + (4b4+ 2b5 + b6)8-2 + … = c18-1 + c28-2 + … = [0.c1c2c3…]8
In the above, we have set the sums in the parentheses equal to the ci coefficients in base-8
because, since each bi is either 0 or 1, the sums in the parentheses are digits between 0 and 7,
i.e., the digits of base-8
o With exactly the same logic, when converting between decimal and hexadecimal by
hand, it is convenient to use 10⟷8⟷2⟷16 because the 16⟷2 conversion is also very
simple: each group of four binary digits translates directly into a hexadecimal digit
² e.g., [2BAD]16 = [0010 1011 1010 1101]2
DP (64 bits)
1 8 23 SP (32 bits)
o The bias is introduced in order to avoid using one of the bits for the sign of the exponent
² The largest exponent that can be stored with 11 bits is 211-1 = 2047. The bias that is used is 1023:
² the largest positive actual exponent is 1024, resulting in exp.+bias = 2047 = [11111111111]2
² the smallest negative actual exponent is -1023, resulting in exp.+bias = 0 = [00000000000]2
o Example 1.6: How is the number 22.5 stored in BFPR in double precision?
normalization: 22.5/24 ∙ 24 = 22.5/16 ∙ 24 = 1.40625 ∙ 24
binary 1-bit sign: 0
binary 11-bit exponent + bias: 4+1023 = 1027 = [10000000011]2
binary 52-bit mantissa: 0.40625 = [0.011010000…0000] 2
u Example 1.7: In a computer having a 5-digit mantissa (a) what are the absolute and
relative errors of representing x = 0.003721478693, y = 0.003730230572, x-y?
(b) How many significant digits will be lost when storing the value r = x/y?
(c) How can the loss of significance be avoided/minimized?
x = 0.00372, E xA = x − x = 0.003721478693-0.00372 = 0.000001478693
y = 0.00373, E yA = y − y = 0.003730230572-0.00373 = 0.000000230752
x − y = −0.000008751879, x − y = −0.00001, A
E x−y = (x − y) − ( x − y ) = 0.000001248121
E xA 0.000001478693 E yA A
E x−y
Relative errors: = ≈ 0.040%, = 0.006%, ≈ 14.261%
x 0.003721478693 y x−y
q Truncation error E TR = x − x̂
u The difference between the true solution x and the solution x̂ calculated by the
specific approximate mathematical algorithm employed to solve the problem
u Example 1.8: Computers use a series expansion to compute sine values. The
magnitude of ETR depends on the number of terms that are retained in the
expansion. E.g., for x = p/6, y = sin(x) can be approximated as follows:
x 3 x 5 x 7 x 9 x11
y = sin( x) = x - + - + - + ...
3! 5! 7! 9! 11!
all erms: y = sin(p / 6) = 0.5000000 (exact solution)
1st term: yˆ1 = p / 6 » 0.5235988 y = y - y1 = -0.0235988
E TR ˆ
(p / 6)3
2 terms: yˆ 2 = p / 6 - » 0.4996742 y = y - y2 = -0.0003258
E TR ˆ
3!
q Total Error E TE = E RO + E TR
u The total (round-off plus truncation) error of the numerical solution
o We can again define a true, an absolute and a relative total error
ENDG 407 - Computational Numerical Methods 1.21
Introduction (23)
u Rule 3: The result of addition and subtraction should show significant digits only
as far to the right as is seen in the least precise number in the calculation
o The insignificant digits following the least significant figure should be rounded to the
least significant figure. E.g.,
o 3.51 + 2.246 + 0.0192 = 5.7752 → 5.78
o 3.510 + 2.246 + 0.0192 = 5.7752 → 5.775
o 1725.463 - 189.2 = 1536.263 → 1536.3
o 23578.3 + 0.1892 = 23578.4892 → 23578.5
u Rule 4: The result of multiplication or division should show significant digits
only as far to the right as seen in the number with the fewest significant digits
o The insignificant digits following the least significant figure should be rounded to the
least significant figure. E.g.,
o 2.43 × 17.675 = 42.950250 → 42.95
o 75.22 ÷ 25.1 = 2.9968127 → 3.0
o 75.220 ÷ 25.100 = 2.9968127 → 2.997
x 2 - 4 | < e if | x - 2 | < e / (5 + e )
|
f ( x) L a
d
o For f(x) = |x| / x, lim f (x) = L is not true, i.e., the limit does not exit, for an e > 0.
x→0
Take e = 1. For the limit to exist, it must be | |x|/x - L | < 1 if |x - 0| < d, or there must
be an L that satisfies simultaneously |-1-L| < 1 and |1-L| < 1. No such L exists!
q Continuity of a function
u f(x) is continuous at x = a if lim f (x) = f (a)
x→a
o Useful in finding bounds for the order of magnitude of errors in numerical methods
df (x) f (b) − f (a)
f '(c) = =
dx x=c b−a
q Integral of a function
u Indefinite integral or antiderivative: The inverse of the derivative
dF(x)
f (x) = F '(x) = ⇔ F(x) = ∫ f (x) dx = ∫ F '(x) dx
dx
u Definite integral (it is a number I) defined on a closed interval [a,b]:
b N
I= ∫ f (x) dx = lim
Δ xi →0
∑ f (c )Δ x
i i
a i=1
I= ∫ f (x) dx = f (c)(b − a)
a
u f(c) above is the average value ⟨ f ⟩ of the
function in the interval [a,b], i.e.,
b
1
f = ∫ f (x) dx
(b − a) a
u Second fundamental theorem of calculus
o If f ∈ C[a,b] and c ∈ (a,b), then ∀ x ∈ [a,b]
x
d
( ∫ f (ξ ) d ξ ) = f (x)
dx c
o Useful in evaluating the derivative of a definite integral
ENDG 407 - Computational Numerical Methods 1.28
Mathematical Background (6)
df (x − xo )2 d 2 f (x − xo )3 d 3 f
f (x) = f (xo ) + (x − xo ) + +
dx xo 2! dx 2 xo
3! dx 3 xo
(x − xo )n d n f (x − xo )n+1 d n+1 f
+…+ +
n! dx n xo
(n + 1)! dx n+1 ξ
u Example 1.11: (a) Expand lnx into a Taylor series around xo = 1 in the interval
[1,2]. (b) What is the upper bound of the truncation error of computing ln2 with
an accuracy of 1 part in 108? How many terms are needed to obtain this accuracy?
How many terms would we need to obtain ln1.5 with the same accuracy?
(a) Recall that d(lnx)/dx = 1/x, therefore
1 1 1 1
ln(x) = (x − 1) − (x − 1)2 + (x − 1)3 − (x − 1)4 +…+ (−1)n−1 (x − 1)n + Rn (x)
2 3 4 n
−(n+1)
n ξ
Rn (x) = (−1) (x − 1)n+1
n +1
(b) Since x > 1, x–(n+1) < 1 and therefore we can obtain the upper bound for Rn
ξ −(n+1) 1
Rn (x) = (x − 1)n+1 < (x − 1)n+1
n +1 n +1
(c) We need to have |Rn(2)| ≤ 1/108, which requires 100 million terms !!
1 1
Rn (2) < (2 − 1)n+1 ≤ 8 ⇒ n + 1 ≥ 10 8
n +1 10
(d) We need to have |Rn(1.5)| ≤ 1/108, which requires only 22 terms !!
1 1 n +1
Rn (2) < (1.5 − 1)n+1 ≤ 8 ⇒ n+1
≥ 10 8 ⇒ n ≥ 22
n +1 10 0.5
ENDG 407 - Computational Numerical Methods 1.32
Mathematical Background (10)
u Rolle's theorem
o It is easy to see that for n = 0, the Taylor series equation becomes
df f (x) − f (xo )
f (x) = f (xo ) + (x − xo ) ⇒ f '(ξ ) =
dx ξ x − xo
which for x = b, xo = a and x = c is the mean value theorem for derivatives (see p. 1.28)
o Rolle's theorem is a special case of the mean value theorem when f(a) = f(b) = 0:
If f ∈ C[a,b], f ' ∈ D(a,b) and f(a) = f(b) = 0,
then∃a c ∈ (a,b) such that f '(c) = 0
o Interpretation: When f(a) = f(b) = 0, f(x) will have
be at least one maximum or minimum in (a,b)
² Useful to find the min/max values of a function
∂ ∂
where (h + i )0 f (x, y) = f
∂x ∂y
∂ ∂ ∂f ∂f
(h + i )1 f (x, y) = h + i
∂x ∂y ∂x ∂y
2
∂ ∂ 2 2 ∂ f ∂2 f 2
2 ∂ f
(h + i ) f (x, y) = h + 2hi +i
∂x ∂y ∂x 2 ∂x∂y ∂y 2
ENDG 407 - Computational Numerical Methods 1.34
Review Engineering Problems (1)
Results
u Solution
q Solution
u (a)
u (b)