0% found this document useful (0 votes)
14 views

ch2 Diff

Uploaded by

Paul Williams
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

ch2 Diff

Uploaded by

Paul Williams
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

and

d
X
Dv f (a) = Dfa v = vi ∂i f (a).
i=1
CHAPTER 2 Remark 2.4. This shows that the linear transformation appearing in the
definition of f is unique!

Differentiation The converse of Proposition 2.3 is (surprisingly?) false. All directional deriva-
tives can exist, however, the function need not be differentiable (or even continuous!)
Example 2.5. Let f (x, y) = x2 y/(x4 + y 2 ). Then for every v ∈ R2 − {0},
1. Directional and Partial Derivatives Dv f (0) exists, but f is not differentiable (or even continuous) at 0.
Definition 1.1. Let U ⊂ Rd be a domain, f : U → R be a function, and The converse of Proposition 2.3 is true under the additional assumption that
v ∈ Rd − {0} be a vector. We define the directional derivative of f in the direction the partial derivatives are continuous.
v at the point a by
def d Theorem 2.6. If all partial derivatives of f exist in a neighbourhood of a, and
Dv f (a) = f (a + tv) are continuous at a, then f is differentiable at a.
dt t=0

Example 1.2. If f (x) = |x|2 , then Dv f (x) = 2x · v. Proof. For simplicity we assume d = 2. By the mean value theorem
Remark 1.3. Be aware that some authors define Dv f by additionally dividing f (a + h) − f (a) = f (a1 + h1 , a2 + h2 ) − f (a1 + h1 , a2 ) + f (a1 + h1 , a2 ) − f (a1 , a2 )
by the length of v. We will never do that! = h2 ∂2 f (a1 + h1 , a2 + ξ2 ) + h1 ∂1 f (a1 + ξ1 , a2 )
th
Definition 1.4. We define the i partial derivative of f (denoted by ∂i f ) to for some ξ1 , ξ2 such that ξi lies between 0 and hi . Now let T be the matrix
be the directional derivative of f in direction ei (where ei is the ith elementary basis (∂1 f (a) ∂2 f (a)) and observe
vector). f (a + h) = f (a) + T h + e(h),
th
Practically, to compute the i partial derivative of f differentiate it with respect where
to xi treating all the other coordinates as constant.
e(h) = h2 (∂2 f (a1 + h1 , a2 + ξ2 ) − ∂2 f (a)) + h1 (∂1 f (a1 + ξ1 , a2 ) − ∂1 f (a)).
Example 1.5. For x 6= 0 we have ∂i |x| = xi /|x|.
Clearly
2. Derivatives |e(h)|
6 |∂2 f (a1 + h1 , a2 + ξ2 ) − ∂2 f (a)| + |∂1 f (a1 + ξ1 , a2 ) − ∂1 f (a)|,
Definition 2.1. Let U ⊆ Rd be a domain, f : Rd → R be a function, and a ∈ U . |h|
We say f is differentiable at a if there exists a linear transformation T : Rd → R which converges to 0 as h → 0. 
and a function e such that Note, however, it is possible for a function to be differentiable, and for the
f (a + h) − f (a) − T h partial derivatives to exist and be discontinuous.
lim = 0.
h→0 |h|
Example 2.7. Let f : R2 → R be defined by f (x) = |x|2 sin(1/|x|) when x 6= 0,
In this case, the linear transformation T is called the derivative of f at a, and and f (0) = 0. Then f is differentiable on all of R2 (including x = 0), and hence
denoted by Dfa . all partial derivatives of f exist at all points in R2 . However, ∂1 f and ∂2 f are not
Proposition 2.2. Let U ⊆ Rd be a domain, f : Rd → R be a function, and continuous at x = 0.
a ∈ U . The function f is differentiable at a if and only if there exists a linear Definition 2.8. Let U ⊂ Rm be a domain, and a ∈ U . We say a function
transformation T : Rd → R and a function e such that U → Rn is differentiable if if there exists a linear transformation T : Rm → Rn and
(1) f (a + h) = f (a) + T h + e(h) a function e such that
(2) and limh→0 |e(h)|/|h| = 0. (1) f (a + h) = f (a) + T h + e(h)
Proposition 2.3. If f is differentiable at a, then all the directional derivatives (2) and limh→0 |e(h)|/|h| = 0.
Dv f (a) exist. Further, Note this is almost the same as Definition 2.1. The only change is that the

Dfa = ∂1 f (a) ∂2 f (a) · · · ∂d f (a) linear transformation T is now a map from Rm to Rn instead.
1
2 2. DIFFERENTIATION

Proposition 2.9. Let f = (f1 , . . . , fn ) : Rm → Rn and a ∈ Rm . The function 4. Chain rule


f is differentiable at a if and only if each coordinate function fi is differentiable at
The one variable calculus rules for differentiation of sums, products and quotients
a. Further, the derivative Df is a n × m matrix given by
(when they make sense) are still valid in higher dimensions.
 
∂1 f1 (a) ∂2 f1 (a) · · · ∂m f1 (a) Proposition 4.1. Let f, g : Rd → R be two differentiable functions.
 ∂1 f2 (a) ∂2 f2 (a) · · · ∂m f2 (a) 
Dfa = 

.. .. ..

 • f + g is differentiable and D(f + g) = Df + Dg.
 . . .  • f g is differentiable and D(f g) = f Dg + gDf .
∂1 fn (a) ∂2 fn (a) · · · ∂m fn (a) . • At points where g 6= 0, f /g is also differentiable and
 f  gDf − f Dg
As before, the derivative Df is also called the Jacobian Matrix. D =
g g2
These follow in a manner very similar to the one variable analogues, and are
3. Tangent planes and Level Sets left for you to verify. The one rule that is a little different in this context is the
Let f : Rd → Rn be differentiable. differentiation of composites.

Definition 3.1. The graph of f is the set Γ ⊂ Rd × Rn defined by Theorem 4.2 (Chain Rule). Let U ⊆ Rm , V ⊆ Rn be domains, g : U → V ,
f : V → Rd be two differentiable functions. Then f ◦g : U → Rd is also differentiable
Γ = {(x, f (x)) | x ∈ Rd }. and
D(f ◦ g)a = (Dfg(a) )(Dga )
Given a point (a, f (a)) ∈ Γ we define the tangent plane of f at the point a by the
equation Note Dfg and Dg are both matrices, and the product above is the matrix
product of Df and Dg.
y = f (a) + Dfa (x − a)
Proof. The basic intuition is as follows. Since f, g are differentiable we know
Note that the tangent plane is a d-dimensional hyper-plane in Rd × Rn . It is there exist functions e1 and e2 such that
the best linear approximation to the graph Γ at the point a. Projecting the tangent
plane into 2 dimensions (by freezing other coordinates) gives you a tangent line. g(a + h) = g(a) + Dga + e2 (h) and f (g(a) + h) = f (g(a)) + Dfg(a) + e1 (h) ,

Definition 3.2. The tangent space to the graph Γ at the point (a, f (a)), with limh→0 ei (h)/|h| = 0. Consequently,

denoted by T Γ(a,f (a)) is defined by f (g(a + h)) = f g(a) + Dga h + e2 (h)
 
= f (g(a)) + Dfg(a) Dga h + e2 (h) + e1 Dga h + e2 (h)
T Γ(a,f (a)) = {(x, y) | y = Dfa x, x ∈ Rd } . 
= f (g(a)) + Dfg(a) Dga h + e3 (h) ,
Namely, the tangent space is the space of all vectors parallel to the tangent
plane, and passing through the origin. where

e3 (h) = Dfg(a) e2 (h) + e1 Dga h + e2 (h) .
Definition 3.3. Given c ∈ R we define the level set of f to be the set {x ∈ Rd |
Now to finish the proof one only needs to show limh→0 e3 (h)/|h| = 0, which can be
f (x) = c}.
done directly from the ε-δ definition. 
If d = 2, then level sets are typically curves. If d = 3, then level sets are typically Remark 4.3. An extremely useful arises when d = 1 and we need to compute
surfaces. In higher dimensions (for “nice functions”) level sets of f are typically ∂i (f ◦ g). In this case, by the chain rule
d − 1-dimensional hyper-surfaces.
n
X
Example 2
3.4. Let d = 3 and f (x) = |x| . Then {f (x) = c} is the sphere of ∂i (f ◦ g) = (Dfg )(Dg)ei = ∂j f ∂i gj .
√ j=1
g
radius c for c > 0, a point for c = 0 and the empty set for c < 0.
While this can be derived immediately by multiplying the matrices Dfg and Dg, it
Level sets are very useful in plotting, and are often used to produce contour plots. arises often enough that it is worth directly remembering.
We will see later that if v is tangent to a level set of f , then Dv f = 0. Moreover if
f : Rd → R, then ∇f (defined to be (Df )T ) is orthogonal to level sets. Another version of the chain rule that often shows up in problems is as follows.
6. MAXIMA AND MINIMA 3

Proposition 4.4. Suppose z is a function of x and y, and x and y are in turn for some small γn . Usually, the standard numerical algorithms suggest using
functions of s and t. Then (xn − xn−1 ) · (∇f (xn ) − ∇f (xn−1 ))
γn = ,
(4.1) ∂s z = ∂x z∂s x + ∂y z∂s y , and ∂t z = ∂x z∂t x + ∂y z∂t y . |∇f (xn ) − ∇f (xn−1 )|2
Proof. Let z = f (x, y), x = g(s, t) and y = h(s, t) for some functions f, g, h. which guarantees converge to a local minimum, under certain assumptions on f .
Define ψ : R2 → R2 by ψ = (g, h). Now equation (4.1) follows immediately by
realizing z = f ◦ ψ and using the chain rule.  5. Higher order derivatives
d x d
R t −(t−s)2 Given a function f , treat ∂i f as a function. If ∂i f is itself a differentiable
Example 4.5. Compute dx x and dt 0
e ds.
function, we can differentiate it again. The second derivative (denoted by ∂j ∂i f )
Proposition 4.6. Let f, g : Rd → R be differentiable. Then f g : Rd → R is is called a second order partial of f . These can further be differentiated to obtain
differentiable and D(f g) = f (Dg) + g(Df ). third order partials.

Proof. Let F (x, y) = xy and G(x) = (f (x), g(x)). Then observe f g = F ◦ G, Theorem 5.1 (Clairaut). If ∂i ∂j f and ∂j ∂i f both exist in a neighbourhood of
and the conclusion follows from the chain rule.  a, and are continuous at a then they must be equal.

Remark 4.7. A similar trick can be used to prove the quotient rule. If the mixed second order partials are not continuous, however, they need not
be equal.
As a consequence, here is a “proof” that directional derivatives in directions
tangent to level sets vanish. Example 5.2. Let f (x, y) = x3 y/(x2 + y 2 ) for (x, y) 6= 0 and f (0, 0) = 0. Then
∂x ∂y f (0, 0) = 1 but ∂y ∂x f (0, 0) = 0.
Proposition 4.8. Let Γ = {x | f (x) = c} be a level set of a differentiable
function f . Let γ : [−1, 1] → Γ be a differentiable function, v = Dγ(0), and Proof of Clairaut’s theorem. Here’s the idea in 2D (the same works in
a = γ(0). Then Dv f (a) = 0. higher dimensions). For simplicity assume a = 0.
• Let R be the rectangle with corners (0, 0), (h, 0), (0, k), (h, k).
Think of γ(t) as the position of a particle at time t. If for all t, γ(t) belongs to • Using the mean value theorem, show f (h, k) − f (h, 0) − f (0, k) + f (0, 0) =
the curve Γ, then the velocity Dγ should be tangent to the curve γ, and thus thus hk∂x ∂y f (α) for some point α ∈ R.
the vector v above should be tangent to Γ. (When we can define this rigorously, we • Observe f (h, k) − f (h, 0) − f (0, k) + f (0, 0) = f (h, k) − f (0, k) − f (h, 0) +
will revisit it and prove it.) f (0, 0) and so using the mean value theorem show f (h, k) − f (h, 0) −
Proof. Note f ◦ γ = c (since γ(t) ∈ Γ for all t). By the chain rule D(f ◦ γ) = f (0, k) + f (0, 0) = hk∂y ∂x f (β) for some point β ∈ R.
Dfγ Dγ. At t = 0 this gives Dfγ(0) v = 0 =⇒ Dv f (γ(0)) = 0 as desired.  • Note that as (h, k) → 0, we have α, β → 0. Consequently, if ∂x ∂y f and
∂y ∂x f are both continuous at 0 we must have
Definition 4.9. If f : Rd → R is differentiable, define the gradient of f f (h, k) − f (h, 0) − f (0, k) + f (0, 0)
(denoted by ∇f ) to be the transpose of the derivative of f . ∂x ∂y f (0, 0) = lim = ∂y ∂x f (0, 0),
(h,k)→0 hk
We’ve seen above that if v is tangent to a level set of f at a, then Dv f (a) = 0. proving equality as desired. 
This is equivalent to saying ∇f (a) · v = 0, or that the gradient of f is perpendicular
to level sets of f . Note, in directions tangent to level sets, f is changing the least. Definition 5.3. A function is said to be of class C k if all its k th -order partial
One would expect that in the perpendicular direction (given by ∇f ), the function f derivatives exist and are continuous.
is changing the most. This is shown by the following proposition.
By Clairaut’s theorem, we know that mixed partials are equal for C k functions.
d
Proposition 4.10. If v ∈ R with |v| = 1. Then Dv f (a) is maximised when
v = ∇f (a) and Dv f (a) is minimised when v = −∇f (a). 6. Maxima and Minima
Remark 4.11. This fact is often used when numerically finding minima of Definition 6.1. A function f has a local maximum at a if ∃ε > 0 such that
functions, and is known as the method of gradient descent. Namely, start with a whenever |x − a| < ε we have f (x) 6 f (a).
guess for the minimum x0 . Now choose successive approximations to move directly
Our aim is now to understand what having a local maximum / minimum
against ∇f . That is, define
translates to in terms of derivatives of f . For this we do a simple calculation: Observe
xn+1 = xn + γn ∇f (xn ) , that if f has a local maximum at a, then for all v ∈ Rd − {0} the function f (a + tv)
4 2. DIFFERENTIATION

must have a local maximum at t = 0. Hence we must have ∂t f (a + tv)|t=0 = 0 and (2) If Dfa = 0 and further Hfa is negative definite, then f attains a local
∂t2 f (a + tv)|t=0 6 0. Using the chain rule, we compute maximum at a.
d d
X X The proof uses Taylor’s theorem, and we will prove it in Section 7, below
∂t f (a + tv) = ∂i f (a + tv)vi and ∂t2 f (a + tv) = ∂i ∂j f (a + tv)vi vj
i=1 i,j=1 Definition 6.7. We say a is a local saddle of f if there exist two linearly
independent vectors v1 and v2 such that f has a strict local minimum in direction
Thus at a local maximum we must have
v1 and a strict local maximum in direction v2 .
d
X d
X
∂i f (a)vi = 0 and ∂i ∂j f (a)vi vj 6 0 Proposition 6.8. If f is C 2 , Dfa = 0 and Hfa has at least one strictly positive
i=1 i,j=1 and one strictly negative eigenvalue, then a is a local saddle of f .
d
for every v ∈ R . This translates to the following proposition. This corresponds to points where f has a local maximum in one direction and
a local minimum in the other.
Proposition 6.2. If f is a C 2 function which has a local maximum at a, then
(1) The first derivative Df must vanish at a (i.e. Dfa = 0). Dfa = 0 Example 6.9. The function |x|2 has a local minimum at 0. The function −|x|2
(2) The Hessian Hf is negative semi-definite at a. has a local maximum at 0. The function x21 − x22 has a saddle at 0.
For a local minimum, we replace negative semi-definite above with positive Example 6.10. Let f : Rd → R, and let Γ ⊆ Rd+1 be the graph of f (i.e.
semi-definite. Γ = {(x, y) | x ∈ Rd , y = f (x)}). Fix (x0 , y 0 ) ∈ Rd+1 , and let (a, f (a)) be the point
Definition 6.3. The Hessian of a C 2 function (denoted by Hf ) is defined to on Γ which is closest to (x0 , y 0 ). Then x0 −a is parallel to ∇f (a) and (x0 −a, y 0 −f (a))
be the matrix is normal to the tangent plane at (a, f (a)).
 
∂1 ∂1 f ∂2 ∂1 f · · · ∂d ∂1 f Proof. Let d(x) = |x − x0 |2 + (f (x) − y 0 )2 . At a max ∇d = 0, and hence
∂1 ∂2 f ∂2 ∂2 f · · · ∂d ∂2 f 
Hf =  . (6.1) 2(a − x0 ) + 2(f (a) − y 0 )∇f (a) = 0 .
 
.. .. 
 .. . . 
∂1 ∂d f ∂2 ∂d f · · · ∂d ∂d f This shows x0 − a is parallel to ∇f (a).
For the second assertion recall that the tangent plane is defined by
Note if f ∈ C 2 , Hf is symmetric.
y = f (a) + ∇f (a) · (x − a) .
Definition 6.4. Let A be a d × d symmetric matrix. This is equivalent to saying (∇f (a), −1)·(x−a, y −f (a)) = 0, and hence (∇f (a), −1)
• If (Av) · v 6 0 for all v ∈ Rd , then A is called negative semi-definite. is normal to the tangent plane. But from (6.1) we immediately see
• If (Av) · v < 0 for all v ∈ Rd , then A is called negative definite.  0   
• If (Av) · v > 0 for all v ∈ Rd , then A is called positive semi-definite. x −a 0 ∇f (a)
= −(y − f (a)) ,
• If (Av) · v > 0 for all v ∈ Rd , then A is called positive definite. y 0 − f (a) −1
and hence (x0 − a, y 0 − f (a)) is also normal to the tangent plane at a. 
Recall a symmetric matrix is positive semi-definite if all the eigenvalues are
non-negative. In 2D this simplifies to the following:
7. Taylors theorem
Proposition 6.5. Let A be the symmetric 2 × 2 matrix ( ab cb ).
Theorem 7.1. If f ∈ C 2 , then
2
(1) A is positive definite if and only if a > 0 and ac − b > 0. 1
(2) A is negative definite if and only if a < 0 and ac − b2 > 0. (7.1) f (a + h) = f (a) + Dfa h + h · Hfa h + R2 (h),
2
(3) A is positive semi-definite if and only if a, c > 0 and ac − b2 > 0.
(4) A is negative semi-definite if and only if a, c 6 0 and ac − b2 > 0. where R2 (h) is some function such that
R2 (h)
Finally, we address the converse: Namely, we look for a condition on the lim → 0.
h→0 |h|2
derivatives of f that guarantees that f attains a local maximum or minimum at a.
In coordinates equation (7.1) is
Theorem 6.6. Let f be a C 2 function.
(1) If Dfa = 0 and further Hfa is positive definite, then f attains a local
X 1X
f (a + h) = f (a) + ∂i f (a)hi + ∂i ∂j f (a)hi hj + R2 (h).
minimum at a. i
2 i,j
7. TAYLORS THEOREM 5

Proof. Let g(t) = f (a + th). Using the 1D Taylors theorem we have for some function Rn such that
1 Rn (h)
g(1) = g(0) + g 0 (0) + g 00 (ξ) lim = 0.
2 h→0 |h|n

for some ξ ∈ (0, 1). Writing this in terms of f finishes the proof.  The proof follows from the one variable Taylor’s theorem in exactly the same
as our second order version does, and collecting all mixed partials that are equal
The same technique can show the following mean value theorem:
puts it in the above form.
Theorem 7.2 (Mean value theorem). If f is differentiable on the entire line
joining a and b,
f (b) = f (a) + (b − a) · ∇f (ξ)
for some point ξ on the line segment joining a and b.
Taylor’s theorem allows us to prove Theorem 6.6. In order to do this, we need
a standard lemma from linear algebra.
Lemma 7.3. If A is a d × d positive definite symmetric matrix, then for every
v ∈ Rd we have (Av) · v > λ0 |v|2 , where λ0 is the smallest eigenvalue of A.
Proof. By the spectral theorem, we know there exists an orthonormal basis
of Rd consisting of eigenvectors of A. Let {e1 , . . . , ed } denote this basis, and let
λi denote the P (i.e. Aei = λi ei ). Now let vi = v · ei , and
Pcorresponding eigenvalue
observe v = vi ei and hence Av = λi vi ei . Thus
X  X  X X
(Av) · v = λi vi ei · vj e j = λi vi vj ei · ej = λi vi2
i j i,j i
X
> (min λi ) vi2 2
= λ0 |v| . 
i
i

Proof of Theorem 6.6. Suppose Dfa = 0 and Hfa is positive definite. Let
λ0 be the smallest eigenvalue of Hfa . Expanding in terms of an orthonormal basis
of eigenfunctions of Hfa we see Hh · h > λ0 |h|2 .
Now choose δ > 0 so that |R2 (h)| < λ0 |h|2 /2 for h < δ, and note f (a + h) >
2
f (a) + |h|2 > f (a), showing f has a local min at a. 

A higher order version of Taylor’s theorem is also true. It is usually stated using
the multi-index notation, collecting all mixed partials that are equal.
Definition 7.4. Let α = (α1 , α2 , . . . , αd ), with αi ∈ N ∪ {0}. If h ∈ Rd define
αd
hα = hα1 α2
1 h2 · · · hd , |α| = α1 + · · · + αd , and α! = α1 ! α2 ! · · · αd !.
Given a C |α| function f , define
Dα f = ∂1α1 ∂2α2 · · · ∂dαd f,
with the convention that ∂i0 f = f .
Theorem 7.5. If f is a C n function on Rd and a ∈ Rd we have
X 1
f (a + h) = Dα f (a) + Rn (h),
α!
|α|<n

You might also like