Lecture Notes - Updates
Lecture Notes - Updates
AO SUN
CONTENTS
Introduction 1
1. Differentiation 2
1.1. Differentiation of single variable functions 2
1.2. Euclidean space 4
1.3. Directional derivatives 6
1.4. Differentiable functions 8
1.5. Linear map and Matrix 11
1.6. Differentiable maps 14
1.7. Mean value theorem 17
1.8. Higher order derivatives and Taylor’s theorem 19
1.9. Hessian and second partial derivative test 22
2. Submanifolds in ℝ𝑁 25
2.1. Invertible linear maps 25
2.2. Inverse function theorem 26
2.3. 𝑘-surfaces and submanifolds in ℝ𝑁 31
2.4. Tangent space and 𝐶 1 -maps 32
2.5. Implicit function theorem 34
2.6. Constrained optimization and Lagrange multipliers 37
2.7. Applications of Lagrange multiplier 39
2.8. Newton’s iteration method* 40
2.9. Diagonalizing symmetric matrix* 43
References 44
INTRODUCTION
If you find any typos, please let me know and I appreciate that!
1. D IFFERENTIATION
1.1. Differentiation of single variable functions. Let us recall some basic concepts. ℝ is the real
number field. Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ is a single variable function, we say 𝑓 is differentiable at a
point 𝑥 ∈ (𝑎, 𝑏) if
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
(1.1) lim
ℎ→0 ℎ
′
exists, and we use the notation 𝑓 (𝑥) or 𝐷𝑓 (𝑥) to denote this limit. This limit is called the derivative
of 𝑓 at 𝑥.
If we use the 𝜖 − 𝛿 language to write down this definition, a function 𝑓 is differentiable at 𝑥 if and
only if there exists 𝐿 ∈ ℝ, such that for any 𝜖 > 0, there exists 𝛿 > 0, such that for any 0 < |ℎ| < 𝛿
and 𝑥 + ℎ ∈ [𝑎, 𝑏],
| 𝑓 (𝑥 + ℎ) − 𝑓 (𝑥) |
(1.2) | − 𝐿 | < 𝜖.
| |
| ℎ |
We will use the following notation: suppose 𝑓 ∶ (𝑎, 𝑏) → ℝ is a function. If 𝑓 is continuous, we
write 𝑓 ∈ 𝐶((𝑎, 𝑏)). If 𝑓 is bounded on (𝑎, 𝑏), then we write 𝑓 ∈ 𝐶 0 ((𝑎, 𝑏)). If 𝑓 is differentiable,
and its derivative is continuous on (𝑎, 𝑏), then we write 𝑓 ∈ 𝐶 1 ((𝑎, 𝑏)). In general, if 𝑓 is 𝑘-
times differentiable and all its 𝑚-th order derivatives are continuous on (𝑎, 𝑏), then we write 𝑓 ∈
𝐶 𝑘 ((𝑎, 𝑏)). We will use 𝑓 (𝑘) to denote the 𝑘-th order derivative of 𝑓 .
Differentiability is a local property, namely a function can be only differentiable at a single point
but nowhere else. The following example shows this fact.
Example 1.1. Consider a function 𝑓 ∶ ℝ → ℝ defined as follows
{
𝑥2 , if 𝑥 is rational
(1.3) 𝑓 (𝑥) =
0, if 𝑥 is irrational
Using the definition of derivatives, we can calculate 𝑓 ′ (0) = 0, so 𝑓 is differentiable at 0. How-
ever, 𝑓 is not differentiable anywhere else. In fact, 𝑓 is even discontinuous on ℝ∖{0}. Then Theo-
rem 1.2 implies that 𝑓 is not differentiable at any point besides 0.
Differentiability is closely related to another basic concept called continuity.
Theorem 1.2. Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ is differentiable at 𝑥 ∈ (𝑎, 𝑏). Then 𝑓 is continuous at 𝑥.
Proof. Suppose 𝑓 ′ (𝑥) = 𝐿. Using the 𝜖 − 𝛿 definition of derivative, if we choose 𝜖 = 1, we can
find 𝛿 > 0 such that for any 0 < |ℎ| < 𝛿 and 𝑥 + ℎ ∈ [𝑎, 𝑏], we have
| 𝑓 (𝑥 + ℎ) − 𝑓 (𝑥) |
| − 𝑓 ′ (𝑥)|| < 1,
|
| ℎ |
then we have
|𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)| < |ℎ|(1 + |𝑓 ′ (𝑥)|).
This implies that lim𝑓 (𝑥 + ℎ) = 𝑓 (𝑥), which implies that 𝑓 is continuous at 𝑥.
ℎ→0
MATH 20400 LECTURE NOTES 3
On the other hand, a continuous function may be far from a differentiable function. We have the
following very pathological example.
Example 1.3 (A continuous function that is nowhere differentiable). Let us define a function 𝜑1 ∶
ℝ → ℝ as follows. On [0, 1] we define
{
𝑥, if 0 ≤ 𝑥 < 1∕2
(1.4) 𝜑1 (𝑥) =
1 − 𝑥, if 1∕2 ≤ 𝑥 ≤ 1.
Then we extend 𝜑1 periodically to ℝ by 𝜑1 (𝑥 + 1) = 𝜑1 (𝑥). Then we inductively define 𝜑𝑛+1 (𝑥) =
1
𝜑 (2𝑥). Let
2 𝑛
∑𝑛
(1.5) 𝑆𝑛 (𝑥) = 𝜑𝑗 (𝑥).
𝑗=1
The mean value theorem can be viewed as an approximation theorem - it describes how a general
differentiable function 𝑓 can be approximated by a linear function. In fact, we can interpret the
mean value theorem as
𝑓 (𝑥) ≈ 𝑓 (𝑎) + 𝑓 ′ (𝑎)(𝑥 − 𝑎).
In general, we can approximate a function with continuous 𝑘-th order derivatives by a degree 𝑘
polynomial, which is known as the Taylor polynomial.
Theorem 1.8 (Taylor expansion). Suppose 𝑓 ∈ 𝐶 𝑘 ((𝑎, 𝑏)), then for any 𝑥0 ∈ (𝑎, 𝑏) and 𝑥 ∈ (𝑎, 𝑏),
there exists 𝑐 between 𝑥 and 𝑥0 such that
(1.8)
𝑓 ′ (𝑥0 ) 𝑓 (2) (𝑥0 ) 𝑓 (𝑘−1) (𝑥0 ) 𝑓 (𝑘) (𝑐)
𝑓 (𝑥) = 𝑓 (𝑥0 ) + (𝑥 − 𝑥0 )1 + (𝑥 − 𝑥0 )2 + ⋯ + (𝑥 − 𝑥0 )𝑘−1 + (𝑥 − 𝑥0 )𝑘 .
1! 2! (𝑘 − 1)! 𝑘!
Here the polynomial
𝑓 ′ (𝑥0 ) 𝑓 (2) (𝑥0 ) 𝑓 (𝑘−1) (𝑥0 )
𝑃𝑘−1 (𝑥) ∶= 𝑓 (𝑥0 ) + (𝑥 − 𝑥0 )1 + (𝑥 − 𝑥0 )2 + ⋯ + (𝑥 − 𝑥0 )𝑘−1
1! 2! (𝑘 − 1)!
is called the degree 𝑘 Taylor polynomial of 𝑓 at 𝑥0 .
1.2. Euclidean space. Before we start the discussion of the differentiation of multivariable func-
tions, we need to understand the space that we will be working on. The 𝑛-dimensional Euclidean
space, denoted by ℝ𝑛 , is the Cartesian product of 𝑛 copies of real numbers as a set. There are three
important structures on the Euclidean space.
1.2.1. Vector space strcuture. ℝ𝑛 is a (ℝ-) vector space. This means that we can define the fol-
lowing operations:
∙ (Addition) There is a function + ∶ ℝ𝑛 × ℝ𝑛 → ℝ𝑛 . We denote the image of (𝐮, 𝐯) by 𝐮 + 𝐯.
∙ (Scalar multiplication) There is a function ℝ × ℝ𝑛 → ℝ𝑛 . We denote the image of (𝑎, 𝐯) by
𝑎𝐯.
These operations should satisfy the following axioms. Suppose 𝐮, 𝐯, 𝐰 are elements of ℝ𝑛 , 𝑎, 𝑏
are elements of ℝ.
(1) (Associativity of vector addition ) 𝐮 + (𝐯 + 𝐰) = (𝐮 + 𝐯) + 𝐰.
(2) (Commutativity of vector addition) 𝐮 + 𝐯 = 𝐯 + 𝐮.
(3) (Identity element of vector addition) There exists an element in ℝ𝑛 denoted by 𝟎, such that
for every 𝐮 ∈ ℝ𝑛 , 𝐮 + 𝟎 = 𝐮.
(4) (Inverse elements of vector addition) For every 𝐮 ∈ ℝ𝑛 , there exists an element denoted by
(−𝐮), such that (−𝐮) + 𝐮 = 𝟎.
(5) (Compatibility of scalar multiplication with field multiplication) 𝑎(𝑏𝐮) = (𝑎𝑏)𝐮.
(6) (Identity element of scalar multiplication) 1𝐮 = 𝐮.
(7) (Distributivity of scalar multiplication with respect to vector addition) 𝑎(𝐮 + 𝐯) = 𝑎𝐮 + 𝑎𝐯.
(8) (Distributivity of scalar multiplication with respect to field addition) (𝑎 + 𝑏)𝐮 + 𝑎𝐮 + 𝑏𝐮.
MATH 20400 LECTURE NOTES 5
In general, a vector space can be defined over any field, not necessarily ℝ. Throughout this
course, we will be only focused on vector spaces over ℝ.
We also have a special name for the zero element.
Definition 1.9. 𝟎 ∈ ℝ𝑛 is called the origin.
For Euclidean space, the addition and multiplication can be written down explicitly. Suppose
𝐮 ∈ ℝ𝑛 , 𝐯 ∈ ℝ𝑛 . Then we can use the Cartesian coordinate to write the vectors in the form of
𝑛-tuple:
𝐮 = (𝑢1 , 𝑢2 , ⋯ , 𝑢𝑛 ), 𝐯 = (𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 ).
Here each 𝑢𝑖 and 𝑣𝑖 are real numbers. Then we define
𝐮 + 𝐯 = (𝑢1 + 𝑣1 , 𝑢2 + 𝑣2 , ⋯ , 𝑢𝑛 + 𝑣𝑛 ),
and for 𝑎 ∈ ℝ, we define
𝑎𝐮 = (𝑎𝑢1 , 𝑎𝑢2 , ⋯ , 𝑎𝑢𝑛 ).
Given a vector space 𝑉 , a basis of 𝑉 is a collection of vectors {𝐯1 , 𝐯2 ⋯ , 𝐯𝑛 } such that any vector
in 𝑉 can be expressed by an unique linear combinations of these vectors, namely, for any vector
𝐮 ∈ 𝑉 , there exist an unique array of numbers 𝑎1 , 𝑎2 , ⋯ , 𝑎𝑛 such that
∑
𝑛
𝐮= 𝑎𝑗 𝐯𝑗 .
𝑗=1
Using this expression of directional derivatives, we can easily check the linearity of 𝐷 in the first
argument if 𝑓 is differentiable.
Proposition 1.26. Suppose 𝑈 ∈ ℝ𝑛 is open, 𝑓 ∶ 𝑈 → ℝ, 𝑥0 ∈ 𝑈 . Suppose 𝑓 is differentiable at
𝑥0 . Then for any 𝐯 ∈ ℝ𝑛 , 𝐮 ∈ ℝ𝑛 and 𝑎 ∈ ℝ, 𝑏 ∈ ℝ,
(1.14) 𝐷𝑎𝐮+𝑏𝐯 𝑓 (𝑥0 ) = 𝑎𝐷𝐮 𝑓 (𝑥0 ) + 𝑏𝐷𝐯 𝑓 (𝑥0 ).
Although the differentiability of 𝑓 implies the directional derivatives exist for all directions, the
converse is not true.
10 AO SUN
We can write
𝑔(𝐯) − 𝑔(𝟎)
=𝑔(𝑣1 , ⋯ , 𝑣𝑛−1 , 𝑣𝑛 ) − 𝑔(0, ⋯ , 0, 0)
=[𝑔(𝑣1 , ⋯ , 𝑣𝑛−1 , 𝑣𝑛 ) − 𝑔(𝑣1 , ⋯ , 𝑣𝑛−1 , 0)] + [𝑔(𝑣1 , ⋯ , 𝑣𝑛−1 , 0) − 𝑔(𝑣1 , ⋯ , 0, 0)]
+ ⋯ + [𝑔(𝑣1 , ⋯ , 0, 0) − 𝑔(0, ⋯ , 0, 0)].
For the term 𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 𝑣𝑘+1 , 0, ⋯ , 0) − 𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 0, 0, ⋯ , 0), we can use the mean value theo-
rem to show that there exists 𝑐𝑘+1 between 0 and 𝑣𝑘+1 , such that
𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 𝑣𝑘+1 , 0, ⋯ , 0) − 𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 0, 0, ⋯ , 0) = 𝐷𝑘+1 𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 𝑐𝑘+1 , 0, ⋯ , 0)𝑣𝑘+1 .
Note that here even 𝑣𝑘+1 = 0, this equality is true. Then we have
∑
𝑛
|𝑔(𝐯) − 𝑔(𝟎)| ≤ |𝑣𝑗 | sup |𝐷𝑗 𝑔(𝐮)|.
𝑗=1 ‖𝐮‖≤‖𝐯‖
Definition 1.30. A (𝑚 by 𝑛) matrix is a rectangular table of numbers, with 𝑚 rows and 𝑛 columns
Now we consider a more complicated question. Suppose we have two linear maps 𝐴 ∶ ℝ𝑛 → ℝ𝑚
and 𝐵 ∶ ℝ𝑚 → ℝ𝑘 , and in the matrices form
⎡ 𝑎11 𝑎12 ⋯ 𝑎1𝑛 ⎤ ⎡𝑏11 𝑏12 ⋯ 𝑏1𝑚 ⎤
⎢𝑎 𝑎22 ⋯ 𝑎2𝑛 ⎥ ⎢𝑏 𝑏 ⋯ 𝑏2𝑚 ⎥
𝐴 = ⎢ 21 ⎥ , 𝐵 = ⎢ 21 22
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯⎥
⎢ ⎥ ⎢ ⎥
⎣𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛 ⎦ ⎣𝑏𝑘1 𝑏𝑘2 ⋯ 𝑏𝑘𝑚 ⎦
What is the composition 𝐶 = 𝐵◦𝐴?
One can check that 𝐶 is again a linear map. If we write 𝐶 in the form of a matrix, then it must
be a 𝑘 by 𝑛 matrix, and each column vector is the image of 𝑒𝑖 ∈ ℝ𝑛 under 𝐶 = 𝐵◦𝐴. Then we use
MATH 20400 LECTURE NOTES 13
Remark 1.32. The matrix multiplication can be interpreted by the linear maps: If 𝐴 ∶ ℝ𝑛 → ℝ𝑚
and 𝐵 ∶ ℝ𝑚 → ℝ𝑘 are linear maps, then the matrix 𝐶 gives the linear map 𝐵◦𝐴 ∶ ℝ𝑛 → ℝ𝑘 .
Remark 1.33. We can only multiply a matrix 𝐵 by a matrix 𝐴 if the number of columns of 𝐵 equals
the number of rows of 𝐴.
Remark 1.34. Using the inner product, we can interpret the (𝑝, 𝑞) entry 𝑐𝑝𝑞 of 𝐶 = 𝐵𝐴 to be the
inner product of the 𝑝-th row vector of 𝐵 and the 𝑞-th column vector of 𝐴.
Example 1.35. Suppose [
]
1 1
𝐴= ,
0 1
Then 𝐴𝐴 =∶ 𝐴2 , the multiplication of 𝐴 with itself, is
[ ]
2 1 2
𝐴 = .
0 1
14 AO SUN
[ ]
6 5 3
𝐴𝐵 =
8 6 3
It is similar to the function case that we can calculate the directional derivative from the differ-
ential.
Just like the differentiable functions, we can use coordinates to express the differential.
We can use the Jacobi matrix to compute the directional derivative of a map explicitly. In fact,
if 𝐯 = (𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 ) ∈ ℝ𝑛 is given, from 𝐷𝐯 𝑓 (𝑥0 ) = 𝐷𝑓 (𝑥0 )(𝐯), we can write the coordinate
MATH 20400 LECTURE NOTES 15
Then we have
𝑔(𝑓 (𝑥0 + 𝑣)) − 𝑔(𝑓 (𝑥0 )) =𝑔(𝑓 (𝑥0 ) + 𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)) − 𝑔(𝑓 (𝑥0 ))
=𝐷𝑔(𝑓 (𝑥0 ))(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)) + 𝛽(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣))
Then let 𝑣 → 0, we see that
‖𝑔(𝑓 (𝑥0 + 𝑣)) − 𝑔(𝑓 (𝑥0 )) − 𝐷𝑔(𝑓 (𝑥0 ))𝐷𝑓 (𝑥0 )𝑣‖
lim = 0.
𝑣→0 ‖𝑣‖
Here we use the following computation. For 𝑣 ≠ 0,
‖𝛽(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣))‖ 𝛽(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)) ‖𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)‖ + 𝜖‖𝑣‖
=
‖𝑣‖ ‖𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)‖ + 𝜖‖𝑣‖ ‖𝑣‖
( )
𝛽(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)) ‖𝛼(𝑣)‖
≤ ‖𝐷𝑓 (𝑥0 )‖ + +𝜖 ,
‖𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)‖ + 𝜖‖𝑣‖ ‖𝑣‖
(We do not divide it by ‖𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)‖ in case this is zero.) hence it → 0 as 𝑣 → 0.
MATH 20400 LECTURE NOTES 17
Proof. For every 1 ≤ 𝑗 ≤ 𝑚, we can apply the mean value theorem to 𝑓𝑗 to show that there exists
𝑐𝑗 ∈ (𝑎, 𝑏) such that
𝑓𝑗 (𝑏) − 𝑓𝑗 (𝑎) = 𝑓𝑗′ (𝑐𝑗 )(𝑏 − 𝑎).
Then we have the inequality
|𝑓𝑗 (𝑏) − 𝑓𝑗 (𝑎)| ≤ sup ‖𝐷𝑓 (𝑐)‖|𝑏 − 𝑎|.
𝑐∈(𝑎,𝑏)
Then we can use triangle inequality and add them together to conclude that
∑
𝑚
‖𝑓 (𝑏) − 𝑓 (𝑎)‖ ≤ |𝑓𝑗 (𝑏) − 𝑓𝑗 (𝑎)| ≤ 𝑚 sup ‖𝐷𝑓 (𝑐)‖|𝑏 − 𝑎|.
𝑗=1 𝑐∈(𝑎,𝑏)
Next, we consider multi-variable functions. we introduce the special maps called differentiable
curves.
18 AO SUN
Finally, we can combine all the previous theorems to show the following:
Theorem 1.53. Suppose 𝑈 is a convex set, and 𝑓 ∶ 𝑈 → ℝ𝑚 is a differentiable map. Then for
𝑥, 𝑦 ∈ 𝑈 ,
(1.25) ‖𝑓 (𝑥) − 𝑓 (𝑦)‖ ≤ 𝑚 sup ‖𝐷𝑓 (𝑧)‖‖𝑥 − 𝑦‖.
𝑧∈𝑈
1.8. Higher order derivatives and Taylor’s theorem. Now we generalize the derivatives to higher
orders. Suppose 𝑈 ⊂ ℝ𝑛 and 𝑥0 ∈ 𝑈 . It is natural to say 𝑓 ∶ 𝑈 → ℝ𝑚 is twice differentiable
at 𝑥0 if 𝑓 is differentiable in a neighborhood 𝑉 of 𝑥0 , and 𝐷𝑓 ∶ 𝑉 → ℝ𝑚×𝑛 is differentiable at
𝑥0 . Here recall that 𝐷𝑓 (𝑥) is a linear map from ℝ𝑛 → ℝ𝑚 , and such a linear map can be identified
as a 𝑚 by 𝑛 matrix, which can be viewed as a vector in ℝ𝑚×𝑛 . Similarly, we can define 𝓁-th order
differentiability of 𝑓 .
Remark 1.55. In linear algebra, the space of all linear maps from ℝ𝑛 → ℝ𝑚 is usually denoted by
Hom(ℝ𝑛 , ℝ𝑚 ), where Hom stands for homomorphisms. We have Hom(ℝ𝑛 , ℝ𝑚 ) ≅ ℝ𝑚×𝑛 .
From the above discussion, we see that a 𝓁-th order differential of a function 𝑓 is a linear map
in ℝ𝑚×𝑛×𝑛×⋯×𝑛 . This is a space with huge dimensions. In general, we can only write down a higher-
order differential in coordinates.
We will use the following notation:
Proof. The proof adapts the approximation idea that we discussed above. Let us prove for 𝐷12 𝑓 (𝑥) =
𝐷21 𝑓 (𝑥), and the proof for other indexes is the same. For (𝑦1 , 𝑦2 ) ≠ (0, 0), define
𝑓 (𝑥 + 𝑦2 𝑒2 + 𝑦1 𝑒1 ) − 𝑓 (𝑥 + 𝑦2 𝑒2 ) − 𝑓 (𝑥 + 𝑦1 𝑒1 ) + 𝑓 (𝑥)
𝑔(𝑥, 𝑦1 , 𝑦2 ) = .
𝑦1 𝑦2
Then the mean value theorem for the single variable function
𝛼(𝑧) = 𝑓 (𝑥 + 𝑧𝑒2 + 𝑦1 𝑒1 ) − 𝑓 (𝑥 + 𝑧𝑒2 ) − 𝑓 (𝑥 + 𝑦1 𝑒1 ) + 𝑓 (𝑥)
shows that there exists 𝑧2 between 0 and 𝑦2 such that
𝐷2 𝑓 (𝑥 + 𝑧2 𝑒2 + 𝑦1 𝑒1 ) − 𝑓 (𝑥 + 𝑧2 𝑒2 )
𝑔(𝑥, 𝑦1 , 𝑦2 ) =
𝑦1
Similarly, we conclude that there exists 𝑧1 between 0 and 𝑦1 such that
𝑔(𝑥, 𝑦1 , 𝑦2 ) = 𝐷12 𝑓 (𝑥 + 𝑧2 𝑒2 + 𝑧1 𝑒1 ).
Similarly, there exists 𝑤1 between 0 and 𝑦1 and 𝑤2 between 0 and 𝑦2 , such that
𝑔(𝑥, 𝑦1 , 𝑦2 ) = 𝐷21 𝑓 (𝑥 + 𝑤2 𝑒2 + 𝑤1 𝑒1 ).
In particular, if we let (𝑦1 , 𝑦2 ) → (0, 0) with 𝑦1 ≠ 0 and 𝑦2 ≠ 0, by the continuity of 𝐷12 𝑓 and
𝐷21 𝑓 , we see that
𝐷12 𝑓 (𝑥) = 𝐷21 𝑓 (𝑥).
Next, we generalize Taylor’s theorem to multi-variable functions. Again, we first use the idea of
approximation to see what should be the correct formulation of the Taylor theorem. Suppose 𝑓 is
a very nicely higher-order differentiable function. Then let us expand 𝑓 using the single variable
Taylor theorem in each one of the variables. We would get
𝛼
∑ (𝛼 ) 𝑦1 1
𝑓 (𝑥1 + 𝑦1 , ⋯ , 𝑥𝑛 + 𝑦𝑛 ) ≈ 𝐷1 1 𝑓 (𝑥1 , 𝑥2 + 𝑦2 , ⋯ , 𝑥𝑛 + 𝑦𝑛 )
0≤𝛼1 ≤𝑘
𝛼1 !
𝛼 𝛼
∑ ∑ 𝛼 𝛼 𝑦11 𝑦22
≈ 𝐷2 2 𝐷1 1 𝑓 (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 + 𝑦𝑛 )
0≤𝛼2 ≤𝑘2 0≤𝛼1 ≤𝑘1
𝛼1 ! 𝛼2 !
≈⋯
𝛼 𝛼
∑ ∑ ∑ 𝛼2 𝛼1 𝑦1 1 𝑦2 2 𝑦𝑛
𝛼
≈ ⋯ 𝐷𝑛𝛼𝑛 ⋯ 𝐷2 𝐷1 𝑓 (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) ⋯ 𝑛
0≤𝛼𝑛 ≤𝑘𝑛 0≤𝛼2 ≤𝑘2 0≤𝛼1 ≤𝑘1
𝛼1 ! 𝛼2 ! 𝛼𝑛 !
In order to write down this expression succinctly, we introduce the following notation: suppose
𝛼 = (𝛼1 , 𝛼2 , ⋯ , 𝛼𝑛 ) is an 𝑛-tuple, where 𝛼𝑖 ∈ ℤ≥0 for all 1 ≤ 𝑖 ≤ 𝑛. Then we use |𝛼| to denote
∑𝑛 𝛼 𝛼 𝛼 𝛼 𝛼 𝛼
𝛼 , we use 𝐷𝛼 to denote 𝐷1 1 𝐷2 2 ⋯ 𝐷𝑛 𝑛 , we use 𝑥𝛼 to denote 𝑥11 𝑥22 ⋯ 𝑥𝑛𝑛 , and we use 𝛼! to
𝑖=1 𝑖
denote 𝛼1 !𝛼2 ! ⋯ 𝛼𝑛 !. Now we can state the Taylor theorem for 𝑛-variable functions.
MATH 20400 LECTURE NOTES 21
Theorem 1.58 (Taylor theorem for 𝑛-variable functions). Suppose 𝑈 ⊂ ℝ𝑛 is an open convex set,
𝑓 ∶ 𝑈 → ℝ is a function such that for all 1 ≤ 𝑚 ≤ 𝑘 + 1, all 𝑚-th order partial derivatives of 𝑓
exist and are continuous on 𝑈 . Then for any 𝑥, 𝑦 ∈ 𝑈 , there exists 𝑠 ∈ [0, 1], such that
∑ 𝐷𝛼 𝑓 (𝑥) ∑ 𝐷𝛼 𝑓 (𝑠𝑦 + (1 − 𝑠)𝑥)
(1.27) 𝑓 (𝑦) = (𝑦 − 𝑥)𝛼 + (𝑦 − 𝑥)𝛼 .
|𝛼|≤𝑘
𝛼! |𝛼|=𝑘+1
𝛼!
Proof. By the openness and convexity of 𝑈 , we may find 𝜖 > 0 such that {𝑡𝑦 + (1 − 𝑡)𝑥|𝑡 ∈
[−𝜖, 1 + 𝜖]} ⊂ 𝑈 . Now we define the single variable function 𝑔 ∶ [−𝜖, 1 + 𝜖] → ℝ by
𝑔(𝑡) = 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥).
Now we compute the 𝑘-th order derivative of 𝑔. We prove the following identity by induction on
𝑘:
∑ 𝑘!
𝑔 (𝑘) (𝑡) = 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦 − 𝑥)𝛼 .
|𝛼|=𝑘
𝛼!
When 𝑘 = 1, this identity becomes
∑
𝑛
′
𝑔 (𝑡) = 𝐷𝑖 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦𝑖 − 𝑥𝑖 ),
𝑖=1
and this is just the chain rule. Now suppose the identity holds for 𝑘 ∈ ℤ+ . Then we have
∑ 𝑘!
𝑔 (𝑘+1) (𝑡) =𝐷𝑡 [ 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦 − 𝑥)𝛼 ]
|𝛼|=𝑘
𝛼!
∑𝑛
∑ 𝑘!
= 𝐷𝑖 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦𝑖 − 𝑥𝑖 )(𝑦 − 𝑥)𝛼 .
𝑖=1 |𝛼|=𝑘
𝛼!
Note that for 𝛼 = (𝛼1 , 𝛼2 , ⋯ , 𝛼𝑛 ), for any 1 ≤ 𝑖 ≤ 𝑛, let us write 𝛼 ∗ = (𝛼1 , ⋯ , 𝛼𝑖−1 , 𝛼𝑖 +1, 𝛼𝑖+1 , ⋯ , 𝛼𝑛 ),
then
∗ ∗
𝐷𝑖 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦𝑖 − 𝑥𝑖 )(𝑦 − 𝑥)𝛼 = 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦 − 𝑥)𝛼
In order to take the sum over 𝛼 ∗ with |𝛼 ∗ | = 𝑘 + 1, we notice that, for any 𝛼 ∗ = (𝛼1∗ , ⋯ , 𝛼𝑛∗ ) with
|𝛼 ∗ | = 𝑘 + 1, it can be obtained by adding 1 to one of the following terms: (𝛼1∗ − 1, ⋯ , 𝛼𝑛∗ ), ⋯,
∗ ∗
(𝛼1∗ , ⋯ , 𝛼𝑛∗ −1), and there are 𝑛 different possibilities. So the coefficient of 𝐷𝛼 𝑓 (𝑡𝑦+(1−𝑡)𝑥)(𝑦−𝑥)𝛼
is
∑ 𝑛 𝑘!𝛼 ∗
𝑗 (𝑘 + 1)!
∗
= .
𝑗=1
𝛼 ! 𝛼∗!
Note that we can extend the definition of the notations of 𝐷𝛼 , 𝛼!, (𝑦 − 𝑥)𝛼 to 0 if 𝛼 = (𝛼1 , ⋯ , 𝛼𝑛 )
has some of 𝛼𝑖 < 0. The above argument in the induction step still works. This implies that
∑ (𝑘 + 1)!
𝑔 (𝑘+1) (𝑡) = 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦 − 𝑥)𝛼 ,
|𝛼|=𝑘+1
𝛼!
which concludes the induction.
22 AO SUN
Finally, we apply the Taylor theorem of single variable functions to 𝑔 on (−𝜖, 1 + 𝜖). There exists
𝑠 ∈ (0, 1) such that
∑ 𝑘
𝑔 (𝑙) (0) 𝑔 (𝑘+1) (𝑠)
𝑔(1) = + .
𝑙=0
𝑙! (𝑘 + 1)!
Then we use the expression of 𝑔 (𝑙) (0) that we have obtained above to see that
∑ 𝐷𝛼 𝑓 (𝑥) ∑ 𝐷𝛼 𝑓 (𝑠𝑦 + (1 − 𝑠)𝑥)
𝑓 (𝑦) = (𝑦 − 𝑥)𝛼 + (𝑦 − 𝑥)𝛼 .
|𝛼|≤𝑘
𝛼! |𝛼|=𝑘+1
𝛼!
For later purposes, we state Taylor’s theorem with the Peano remainder.
Theorem 1.59 (Taylor theorem for 𝑛-variable functions, with Peano remainder). Suppose 𝑈 ⊂ ℝ𝑛
is an open convex set, 𝑓 ∶ 𝑈 → ℝ is a function such that for all 1 ≤ 𝑚 ≤ 𝑘, all 𝑚-th order partial
derivatives of 𝑓 exist and are continuous on 𝑈 . Then
∑ 𝐷𝛼 𝑓 (𝑥)
(1.28) 𝑓 (𝑦) = (𝑦 − 𝑥)𝛼 + 𝑜(‖𝑥 − 𝑦‖𝑘 ).
|𝛼|≤𝑘
𝛼!
⎡𝑎1 0 0 ⋯ 0⎤
⎢ 0 𝑎2 0 ⋯ 0⎥
Hess𝑓 (𝑥) = ⎢ 0 0 𝑎3 ⋯ 0⎥
⎢⋯ ⋯ ⋯ ⋯ ⋯⎥⎥
⎢
⎣0 0 0 ⋯ 𝑎𝑛 ⎦
It is clear that Hess𝑓 is positive definite if 𝑎𝑖 > 0 for all 1 ≤ 𝑖 ≤ 𝑛.
24 AO SUN
[ ]
𝑎 𝑏
Example 1.68. Let us consider a 2 × 2 matrix 𝐴 = . One can show that it is positive definite
𝑏 𝑐
if and only if 𝑎 > 0 and det 𝐴 = 𝑎𝑐 − 𝑏𝑑 > 0.
Definition 1.69. Suppose 𝑈 ⊂ ℝ𝑛 is open and 𝑓 ∈ 𝐶 1 (𝑈 , ℝ). We say 𝑥0 is a local maximum
(resp. minimum) point of 𝑓 , if there exists a neighbourhood 𝑉 of 𝑥0 such that for all 𝑥 ∈ 𝑉 ,
𝑓 (𝑥) ≤ 𝑓 (𝑥0 ) (resp. 𝑓 (𝑥) ≥ 𝑓 (𝑥0 )).
Definition 1.70. Suppose 𝑈 ⊂ ℝ𝑛 is open and 𝑓 ∈ 𝐶 1 (𝑈 , ℝ). Suppose 𝑥0 is a local maximum
point or a local minimum point. Then 𝐷𝑓 (𝑥0 ) = 0.
Proof. We prove the case that 𝑥0 is a local maximum point, and the local minimum point case is
similar. We only need to show that 𝐷𝑣 𝑓 (𝑥0 ) = 0 for all 𝑣 ∈ ℝ𝑛 . Suppose 𝑥0 is the maximum point
in 𝑉 .
𝑓 (𝑥0 + ℎ𝑣) − 𝑓 (𝑥0 )
𝐷𝑣 𝑓 (𝑥0 ) = lim .
ℎ→0 ℎ
Because for ℎ sufficiently small such that 𝑥0 + ℎ𝑣 ∈ 𝑉 , 𝑓 (𝑥0 + ℎ𝑣) − 𝑓 (𝑥0 ) ≤ 0, we have
𝑓 (𝑥0 + ℎ𝑣) − 𝑓 (𝑥0 ) 𝑓 (𝑥0 + ℎ𝑣) − 𝑓 (𝑥0 )
lim+ ≥ 0, lim− ≤ 0.
ℎ→0 ℎ ℎ→0 ℎ
Therefore, 𝐷𝑣 𝑓 (𝑥0 ) = 0.
Just like in the single variable case, we call a point 𝑥0 critical point if 𝐷𝑓 (𝑥0 ) = 0.
Example 1.71. Critical points may not be local minimum/maximum. Even in the single variable
case, 0 is a critical point of 𝑓 (𝑥) = 𝑥3 , but it is neither local minimum or maximum.
Now we show that the Hessian of a function at a local minimum/maximum point must be semi-
definite.
Theorem 1.72. Suppose 𝑈 ⊂ ℝ𝑛 is open and 𝑓 ∈ 𝐶 2 (𝑈 , ℝ). Suppose 𝑥0 ∈ 𝑈 is a local mini-
mum point (resp. local maximum point) of 𝑓 , then Hess𝑓 (𝑥0 ) is semi-positive definite (resp. semi-
negative definite).
Proof. We prove this by contradiction. Suppose Hess𝑓 (𝑥0 ) is not semi-positive definite, then there
exists 𝑣 = (𝑣1 , ⋯ , 𝑣𝑛 ) ≠ 0 such that 𝑣𝑡 Hess𝑓 (𝑥0 )𝑣 < 0. By the Taylor theorem with the Peano
remainder,
∑ 𝑛
𝐷𝑖𝑗 𝑓 (𝑥0 ) 2
𝑓 (𝑥0 + 𝑡𝑣) =𝑓 (𝑥0 ) + 𝑡 𝑣𝑖 𝑣𝑗 + 𝑜(‖𝑡𝑣‖2 )
𝑖,𝑗=1
2
𝑡2 𝑡
=𝑓 (𝑥0 ) + 𝑣 Hess𝑓 (𝑥0 )𝑣 + 𝑜(𝑡2 ).
2
Because 𝑥0 is a local minimum point, then there exists 𝑡0 > 0 such that when |𝑡| < 𝑡0 , 𝑓 (𝑥0 + 𝑡𝑣) ≥
𝑓 (𝑥0 ). Hence
𝑜(𝑡2 )
𝑣𝑡 Hess𝑓 (𝑥0 )𝑣 + 2 2 ≥ 0.
𝑡
Let 𝑡 → 0 we see 𝑣𝑡 Hess𝑓 (𝑥0 )𝑣 ≥ 0, which is a contradiction.
MATH 20400 LECTURE NOTES 25
Proof. We use the Taylor theorem with the Peano remainder. Because 𝑥0 is a critical point, for
ℎ = (ℎ1 , ℎ2 , ⋯ , ℎ𝑛 ),
∑
𝑛
𝐷𝑖𝑗 𝑓 (𝑥0 )
𝑓 (𝑥0 + ℎ) =𝑓 (𝑥0 ) + ℎ𝑖 ℎ𝑗 + 𝑜(‖ℎ‖2 )
𝑖,𝑗=1
2
1
=𝑓 (𝑥0 ) + ℎ𝑡 Hess𝑓 (𝑥0 )ℎ + 𝑜(‖ℎ‖2 )
2
≥𝑓 (𝑥0 ) + Λ‖ℎ‖2 + 𝑜(‖ℎ‖2 ).
𝑜(‖ℎ‖2 )
By the little 𝑜 notation definition, lim = 0. So there exists 𝛿 > 0 such that when 0 <
‖ℎ‖→0 ‖ℎ‖2
Λ
‖ℎ‖ < 𝛿, |𝑜(‖ℎ‖2 )| ≤ ‖ℎ‖2 . Therefore, when 0 < ‖ℎ‖ < 𝛿, we have
2
Λ
𝑓 (𝑥0 + ℎ) ≥ 𝑓 (𝑥0 ) + Λ‖ℎ‖2 − ‖ℎ‖2 > 𝑓 (𝑥0 ).
2
This implies that 𝑥0 is the local minimum point.
2. SUBMANIFOLDS IN ℝ𝑁
2.1. Invertible linear maps. Suppose 𝐴 ∶ ℝ𝑛 → ℝ𝑚 is a linear map. Let us consider in what
circumstance 𝐴 is invertible, in other words for any 𝑦 ∈ ℝ𝑚 , there exists a unique 𝑥 ∈ ℝ𝑛 such that
𝐴(𝑥) = 𝑦.
Suppose {𝛼1 , ⋯ , 𝛼𝑛 } is a basis of ℝ𝑛 , {𝛽1 , ⋯ , 𝛽𝑚 } is a basis of ℝ𝑚 . Recall the linear map 𝐴 can
be written as a matrix
⎡ 𝑎11 𝑎12 ⋯ 𝑎1𝑛 ⎤
⎢𝑎 𝑎22 ⋯ 𝑎2𝑛 ⎥
𝐴 = ⎢ 21
⋯ ⋯ ⋯ ⋯⎥
.
⎢ ⎥
⎣𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛 ⎦
From linear algebra, we know that the matrix 𝐴 is invertible if and only if 𝑚 = 𝑛 and there exists a
matrix 𝐴−1 such that 𝐴𝐴−1 = 𝐴−1 𝐴 = Id, where Id is the identity matrix, whose (𝑖, 𝑖)-entries are
all 1 and all other entries are 0.
The following result is key to determining whether a matrix is invertible or not.
2.2. Inverse function theorem. Now we consider a map 𝑓 ∶ 𝑈 → 𝑉 , where 𝑈 and 𝑉 are open
sets. We want to know when 𝑓 is invertible.
Because 𝐷𝑓 (𝑥0 ) is the linear approximation of 𝑓 near 𝑥0 , so naturally we would expect that if
𝐷𝑓 (𝑥0 ) is invertible, then 𝑓 is invertible, at least in a neighbourhood of 𝑥0 .
The main goal of this section is to prove the following theorem.
Theorem 2.2 (Inverse function theorem). Suppose Ω is an open set in ℝ𝑛 , 𝑥0 ∈ Ω, 𝑓 ∈ 𝐶 1 (Ω, ℝ𝑛 )
and 𝐷𝑓 (𝑥0 ) is invertible. Then there exists a neighbourhood 𝑈 of 𝑥0 and a neighbourhood 𝑉 of
𝑓 (𝑥0 ), such that 𝑓 |𝑈 ∶ 𝑈 → 𝑉 is bijective, and 𝑓 −1 ∶ 𝑉 → 𝑈 is in 𝐶 1 (𝑉 , ℝ𝑛 ).
The proof is based on the Contraction Mapping Theorem.
Theorem 2.3 (Contraction Mapping Theorem). Suppose (𝑋, 𝑑) is a complete metric space, 𝑇 ∶
𝑋 → 𝑋 is a contraction map, i.e. there exists 𝑎 ∈ (0, 1) such that for all 𝑥, 𝑦 ∈ 𝑋, 𝑑(𝑇 (𝑥), 𝑇 (𝑦)) ≤
𝑎𝑑(𝑥, 𝑦), then there exists 𝑥0 ∈ 𝑋 such that 𝑇 (𝑥0 ) = 𝑥0 .
The Contraction Mapping Theorem can be viewed as a method to solve “nonlinear” problems.
In mathematics, if we would like to solve linear problems, we usually adapt the ideas from linear
algebra; if we would like to solve nonlinear problems, we would like to apply some fixed point
theorem, and the Contraction Mapping Theorem is such a theorem.
Proof of Inverse function theorem. The proof is divided into several parts.
Step 1: bijectivity. We first turn the problem into a problem of finding a fixed point. For
simplicity, we suppose 𝑥0 = 0, 𝑓 (0) = 0 and 𝐴 = Id. (Otherwise, we can replace 𝑓 (𝑥) by
𝐴−1 𝑓 (𝑥 + 𝑥0 ) − 𝐴−1 𝑓 (𝑥0 ).) Given 𝑦, our goal is to find 𝑥 ∈ Ω such that 𝑓 (𝑥) = 𝑦. Let
𝑇 (𝑥) = 𝑥 − 𝑓 (𝑥) + 𝑦.
If 𝑥0 is a fixed point of 𝑇 , then 𝑇 (𝑥0 ) = 𝑥0 , hence 𝑓 (𝑥0 ) = 𝑦. So the question becomes finding a
fixed point of 𝑇 .
Now we check that 𝑇 is a contraction map.
𝑇 (𝑥1 ) − 𝑇 (𝑥2 ) = 𝑥1 − 𝑥2 − (𝑓 (𝑥1 ) − 𝑓 (𝑥2 )).
If we apply the mean value theorem to the function Id −𝑓 , we have for 𝑥1 , 𝑥2 ∈ 𝐵𝛿 (0),
‖𝑇 (𝑥1 ) − 𝑇 (𝑥2 )‖ ≤ 𝑛 sup ‖ Id −𝐷𝑓 (𝑧)‖‖𝑥1 − 𝑥2 ‖.
𝑧∈𝐵𝛿 (0)
Because 𝑓 ∈ 𝐶 1 (Ω, ℝ𝑛 ) and 𝐷𝑓 (0) = Id, there exists 𝛿 > 0 such that
1
sup ‖ Id −𝐷𝑓 (𝑧)‖ < .
𝑧∈𝐵𝛿 (0) 2𝑛
Then inside 𝐵𝛿 (0),
1
‖𝑇 (𝑥1 ) − 𝑇 (𝑥2 )‖ ≤ ‖𝑥1 − 𝑥2 ‖,
2
hence 𝑇 is a contraction map.
MATH 20400 LECTURE NOTES 27
In order to apply the contraction mapping theorem, we need to choose 𝑦 such that 𝑇 is a map
from 𝐵𝛿 (0) to 𝐵𝛿 (0). In fact, if 𝑦 is chosen such that ‖𝑦‖ < 12 𝛿, then for 𝑥 ∈ 𝐵𝛿 (0),
1
‖𝑇 (𝑥)‖ ≤ ‖𝑇 (𝑥) − 𝑇 (0)‖ + ‖𝑇 (0)‖ ≤ ‖𝑥‖ + ‖𝑦‖ < 𝛿.
2
This implies that 𝑇 is a map from 𝐵𝛿 (0) to 𝐵𝛿 (0). Then by the contraction mapping theorem, we
conclude the following fact: there exists 𝛿 > 0 such that for 𝑦 ∈ 𝐵𝛿∕2 (0), there exists 𝑥 ∈ 𝐵𝛿 (0)
such that 𝑓 (𝑥) = 𝑦. If we call 𝑉 ∶= 𝐵𝛿∕2 (0) and 𝑈 = 𝑓 −1 (𝑉 ), we have proved that 𝑓 |𝑈 ∶ 𝑈 → 𝑉
is a bijection.
Step 2: 𝑓 −1 is continuous. For simplicity we denote 𝑓 −1 by 𝑔. Suppose 𝑓 (𝑥1 ) = 𝑦1 and
𝑓 (𝑥2 ) = 𝑦2 . Then
‖𝑔(𝑦1 ) − 𝑔(𝑦2 )‖ =‖𝑥1 − 𝑥2 ‖
=‖𝑇 (𝑥1 ) + 𝑓 (𝑥1 ) − (𝑇 (𝑥2 ) + 𝑓 (𝑥2 ))‖
≤‖𝑇 (𝑥1 ) − 𝑇 (𝑥2 )‖ + ‖𝑦1 − 𝑦2 ‖
Here 𝑇 is the contraction map with any fixed 𝑦 ∈ 𝐵𝛿∕2 (0). Then the contraction property of 𝑇
1
implies that ‖𝑇 (𝑥1 ) − 𝑇 (𝑥2 )‖ ≤ ‖𝑥1 − 𝑥2 ‖. This implies that
2
1
‖𝑥1 − 𝑥2 ‖ ≤ ‖𝑥1 − 𝑥2 ‖ + ‖𝑦1 − 𝑦2 ‖.
2
Thus, we have
‖𝑥1 − 𝑥2 ‖ ≤ 2‖𝑦1 − 𝑦2 ‖
and
‖𝑔(𝑦1 ) − 𝑔(𝑦2 )‖ ≤ 2‖𝑦1 − 𝑦2 ‖.
This implies that 𝑓 −1 is continuous.
Step 3: differentiability of 𝑓 −1 . Let us use 𝑔 to denote 𝑓 −1 . We only need to show that 𝐷𝑖 𝑔
exists and continuous in 𝐵𝛿∕2 (0), then Theorem 1.28 implies that 𝑔 is differentiable on 𝐵𝛿∕2 (0).
Suppose 𝑦 = 𝑓 (𝑥) and 𝐵 = 𝐷𝑓 (𝑥). Then for 𝑣 ≠ 0, such that 𝑦 + 𝑣 ∈ 𝐵𝛿∕2 (0), let 𝑔(𝑦 + 𝑣) = 𝑥′
‖𝑔(𝑦 + 𝑣) − 𝑔(𝑦) − 𝐵 −1 𝑣‖ =‖𝑥′ − 𝑥 − 𝐵 −1 (𝑓 (𝑥′ ) − 𝑓 (𝑥))‖
≤𝑛 sup ‖ Id −𝐵 −1 ◦𝐷𝑓 (𝑧)‖‖𝑥′ − 𝑥‖
𝑧∈𝐵‖𝑥′ −𝑥‖ (𝑥)
From Step 2,
‖𝑥′ − 𝑥‖ ≤ 2‖(𝑦 + 𝑣) − 𝑦‖ = 2‖𝑣‖.
Therefore, we conclude that
‖𝑔(𝑦 + 𝑣) − 𝑔(𝑦) − 𝐵 −1 𝑣‖
lim = 0.
𝑣→0 ‖𝑣‖
This shows that 𝑔 is differentiable on 𝐵𝛿∕2 (0), and 𝐷𝑔(𝑦) = (𝐷𝑓 (𝑥))−1 for 𝑦 = 𝑓 (𝑥).
Remark 2.4. 𝐷(𝑓 −1 ) = (𝐷𝑓 )−1 is expected from the chain rule.
28 AO SUN
⎡ 𝑥3 + 3𝑥𝑦2 𝑦3 + 3𝑥2 𝑦 ⎤
⎢ √ − √ ⎥
( 𝑥2 + 𝑦2 )3 ( 𝑥2 + 𝑦2 )3
𝐷𝑓 (𝑥, 𝑦) = ⎢ 3
⎥
⎢ 2𝑦 2𝑥3 ⎥
⎢ √ √ , ⎥
⎣ ( 𝑥2 + 𝑦2 )3 ( 𝑥2 + 𝑦2 )3 ⎦
And it is always invertible. Therefore, even if we have the assumption that a function has invertible
differentials everywhere, it does not mean the function is invertible globally.
We finish this section by discussing some applications of the inverse function theorem.
Theorem 2.13 (Continuity of roots). Given (𝑐𝑛−1 , 𝑐𝑛−2 , ⋯ , 𝑐1 , 𝑐0 ) ∈ ℝ𝑛 , we can define a polynomial
where
𝑐0 (𝑥) =(−1)𝑛 𝑥1 𝑥2 ⋯ 𝑥𝑛 ,
∑
𝑐1 (𝑥) = (−1)𝑛−1 𝑥𝑖1 𝑥𝑖2 ⋯ 𝑥𝑖𝑛−1 ,
1≤𝑖1 <𝑖2 <⋯<𝑖𝑛−1 ≤𝑛
⋯
∑
𝑐𝑛−2 (𝑥) = 𝑥𝑖1 𝑥𝑖2 ,
1≤𝑖1 <𝑖2 ≤𝑛
∑
𝑛
𝑐𝑛−1 (𝑥) = −𝑥𝑖 .
𝑖=1
They are called Newton polynomials. We can obtain the expression by expanding the product
𝑝(𝑥) = (𝑥 − 𝑥1 )(𝑥 − 𝑥2 ) ⋯ (𝑥 − 𝑥𝑛 ).
Let us compute the differential of 𝐹 . To do so, we need a trick. Let
𝑃 (𝑠, 𝑥) = (𝑠 − 𝑥1 )(𝑠 − 𝑥2 ) ⋯ (𝑠 − 𝑥𝑛 ) = 𝑠𝑛 + 𝑐𝑛−1 (𝑥)𝑠𝑛−1 + ⋯ + 𝑐1 (𝑥)𝑠 + 𝑐0 (𝑥).
Then
𝐷𝑖 𝑃 (𝑠, 𝑥) = − (𝑠 − 𝑥1 )(𝑠 − 𝑥2 ) ⋯ (𝑠 − 𝑥𝑖−1 )(𝑠 − 𝑥𝑖+1 ) ⋯ (𝑠 − 𝑥𝑛 )
=𝐷𝑖 𝑐𝑛−1 (𝑥)𝑠𝑛−1 + ⋯ + 𝐷𝑖 𝑐1 (𝑥)𝑠 + 𝐷𝑖 𝑐0 (𝑥).
In particular, if we plug in all the 𝑠 = 𝑥𝑗 , we see that for 𝑗 ≠ 𝑖,
𝐷𝑖 𝑐𝑛−1 (𝑥)𝑥𝑛−1
𝑗
+ ⋯ + 𝐷𝑖 𝑐1 (𝑥)𝑥𝑗 + 𝐷𝑖 𝑐0 (𝑥) = 0,
and
𝐷𝑖 𝑐𝑛−1 (𝑥)𝑥𝑛−1
𝑖
+ ⋯ + 𝐷𝑖 𝑐1 (𝑥)𝑥𝑖 + 𝐷𝑖 𝑐0 (𝑥) ≠ 0.
Let us use 𝑎𝑖 to denote 𝐷𝑖 𝑐𝑛−1 (𝑥)𝑥𝑛−1
𝑖
+ ⋯ + 𝐷𝑖 𝑐1 (𝑥)𝑥𝑖 + 𝐷𝑖 𝑐0 (𝑥).
This implies that
⎡𝑥𝑛−1
1
𝑥𝑛−2
1
⋯ 𝑥1 1 ⎤ ⎡ 𝑎1 0 0 ⋯ 0 ⎤
⎢𝑥 2
𝑛−1
𝑥𝑛−2 ⋯ 𝑥2 1 ⎥ ⎢ 0 𝑎2 0 ⋯ 0 ⎥
⎢⋯ 2
⋯ ⋯ ⋯ ⋯ ⎥ 𝐷𝐹 = ⎢
⋯ ⋯ ⋯ ⋯ ⋯⎥
.
⎢ 𝑛−1 𝑛−2 ⎥ ⎢ ⎥
⎣𝑥 𝑛 𝑥𝑛 ⋯ 𝑥𝑛 1 ⎦ ⎣ 0 0 0 ⋯ 𝑎𝑛 ⎦
Because the right-hand side is an invertible matrix, 𝐷𝐹 is also invertible. Then by inverse function
theorem, we know that in a neighbourhood of (𝑐𝑛−1 , ⋯ , 𝑐0 ), the map
(𝑐𝑛−1 , 𝑐𝑛−2 , ⋯ , 𝑐1 , 𝑐0 ) → (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 )
is smooth as long as 𝑥1 < 𝑥2 < ⋯ < 𝑥𝑛 .
Corollary 2.14. The eigenvalues of matrices depend on the matrices smoothly as long as they are
distinct.
MATH 20400 LECTURE NOTES 31
2.3. 𝑘-surfaces and submanifolds in ℝ𝑁 . The inverse function theorem can be interpreted as if
the differential of a 𝐶 1 -map is invertible, then the map is a local diffeomorphism. In particular, this
requires the domain and the target of the map to have the same dimensions.
A natural question is: what if the dimensions of the domain and the target are different? Previ-
ously we see that a map 𝛾 from an interval 𝐼 ⊂ ℝ to 𝑈 ⊂ ℝ𝑛 is a curve, if we require 𝛾 ′ ≠ 0, and
the reason we require 𝛾 ′ ≠ 0 is we do not want to see “degenerate” curves like a constant map -
then the “curve” is actually a point.
We adapt this idea to higher dimensions.
Definition 2.15. Suppose 𝑈 ⊂ ℝ𝑘 is an open set. A map 𝐹 ∶ 𝑈 → ℝ𝑁 is called a 𝐶 1 𝑘-surface if
𝐹 ∈ 𝐶 1 (𝑈 , ℝ𝑁 ) and 𝐷𝐹 (𝑥) has rank 𝑘 for all 𝑥 ∈ 𝑈 .
Of course, we can change the regularity in the definition to get surface with other regularity, e.g.
𝐶 𝑚 𝑘-surface, smooth 𝑘-surface, etc.
Let us recall the definition of rank. The rank of a linear map 𝐴 ∶ ℝ𝑛 → ℝ𝑚 is the dimension of
the image of 𝐴. Equivalently, the rank of a matrix 𝐴 is the maximum number of linearly independent
column vectors.
We require 𝐷𝐹 (𝑥) to have rank 𝑘 to assure the image also “has dimension 𝑘”. The meaning of
this sentence is not clear at this moment, and we will try to make it more precise in the following
several sections.
Example 2.16 (Hemisphere). Let 𝐵12 (0) be the unit disk in ℝ2 . Consider a map 𝐹 ∶ 𝐵12 (0) → ℝ3
√
𝐹 (𝑥, 𝑦) = (𝑥, 𝑦, 1 − (𝑥2 + 𝑦2 )).
Then the image of 𝐹 is the upper hemisphere. We can check that 𝐷𝐹 has rank 2 everywhere in
𝐵12 (0).
Example 2.17 (Graph of map). Suppose 𝑈 is open in ℝ𝑘 , 𝑓 ∶ 𝑈 → ℝ𝑚 is a 𝐶 1 -map. Then we can
define a map 𝐹 ∶ 𝑈 → ℝ𝑚+𝑘 by
𝐹 (𝑥) = (𝑥, 𝑓 (𝑥)).
1
Then 𝐹 is a 𝐶 𝑘-surface.
To see this, we can compute the differential of 𝐹 :
[ ]
𝐷𝐹 (𝑥) = Id 𝑑𝑓 (𝑥) ,
where Id is the 𝑘 by 𝑘 identity matrix and 𝑑𝑓 (𝑥) is a 𝑘 by 𝑚 matrix. This matrix has rank 𝑘.
Now we would like to upgrade the notion of 𝑘-surfaces. Each 𝑘 surface can be viewed as a
parametrization of a surface by an open set in ℝ𝑘 . We will allow multiple parametrizations.
Definition 2.18. A 𝑘-dimensional 𝐶 1 -submanifold 𝑋 ⊂ ℝ𝑁 is a set 𝑋 such that for each 𝑥 ∈ 𝑋
there exists an open set 𝑈 ⊂ ℝ𝑘 , a 𝐶 1 -map 𝜑 ∶ 𝑈 → ℝ𝑁 and 𝑉 ⊂ ℝ𝑁 , such that 𝑋 ∩ 𝑉 is the
image of 𝜑.
Such a pair (𝑈 , 𝜑) is called a parametrization of 𝑋. The collection of (𝑈 , 𝜑) is called an atlas
of 𝑋.
32 AO SUN
Example 2.19 (Unit sphere). The unit sphere 𝑆 𝑛 is a 𝑛-dimensional submanifold in ℝ𝑛+1 , defined
as the set
{ }
𝑆 𝑛 ∶= (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛+1 ) ∶ 𝑥21 + 𝑥22 + ⋯ + 𝑥2𝑛+1 = 1 .
We can find the following atlas of 𝑆 𝑛 : suppose 𝐵1 (0) ⊂ ℝ𝑛 . For any 1 ≤ 𝑖 ≤ 𝑛 + 1, we can use the
graph to define the parametrizations 𝜑±𝑖 ∶ 𝐵1 (0) → ℝ𝑛+1 by
√
+
𝜑𝑖 (𝑦1 , ⋯ , 𝑦𝑛 ) = (𝑦1 , 𝑦2 , ⋯ , 𝑦𝑖−1 , 1 − (𝑦21 + ⋯ + 𝑦2𝑛 ), 𝑦𝑖+1 , ⋯ , 𝑦𝑛 ),
√
𝜑−𝑖 (𝑦1 , ⋯ , 𝑦𝑛 ) = (𝑦1 , 𝑦2 , ⋯ , 𝑦𝑖−1 , − 1 − (𝑦21 + ⋯ + 𝑦2𝑛 ), 𝑦𝑖+1 , ⋯ , 𝑦𝑛 ).
Then the collection {(𝐵1 (0), 𝜑±𝑖 )}𝑛+1
𝑖=1
is an atlas of 𝑆 𝑛 .
Remark 2.20. “Sub” in the name of submanifold means that they are subsets of some larger spaces.
We also have the notion of “manifolds”, which are geometric objects that are not inside some larger
space. The manifolds are the model of our universal - they are the mathematical language for
general relativity.
2.4. Tangent space and 𝐶 1 -maps. Let us discuss two important concepts related to submanifolds.
We will see how to use the parametrizations to characterize local properties.
Definition 2.21. Suppose 𝑀 ⊂ ℝ𝑁 is a 𝑘-dimensional submanifold, 𝑥 ∈ 𝑀 is a point. Suppose
𝜑 ∶ 𝑈 → ℝ𝑁 is a parametrization of 𝑀 and 𝜑(𝑝) = 𝑥. Then the tangent space of 𝑀 at 𝑥 is the
set
{𝑥 + 𝐷𝜑(𝑝)𝑣|𝑣 ∈ ℝ𝑘 } =∶ 𝑇𝑥 𝑀.
Notice that this definition seems to rely on the choice of the parametrization (𝑈 , 𝜑). Nevertheless,
this definition is actually independent of the choice of the parametrization. We may see this later.
We can view the tangent spaces as vector spaces if we shift 𝑥 to the origin. Because the rank of
𝐷𝜑(𝑝) is 𝑘, it is a 𝑘-dimensional space.
In the case that we can visualize: 1-dimensional submanifolds in ℝ2 or ℝ3 (curves), 2-dimensional
submanifolds in ℝ3 (surfaces), the tangent space is just the line (in ℝ2 ) or the plane (in ℝ3 ) that
“touch” the submanifold at a point.
Example 2.22. Consider 𝑆 2 ⊂ ℝ3 . The tangent space at the north pole (0, 0, 1) can be computed
explicitly. Consider the upper hemisphere parametrization
√
𝜑 ∶ 𝐵12 (0) → ℝ3 , 𝜑(𝑥, 𝑦) = (𝑥, 𝑦, 1 − 𝑥2 − 𝑦2 ).
Then 𝜑(0, 0) = (0, 0, 1) and
⎡ 1 0 ⎤
⎢ 0 1 ⎥
𝐷𝜑(𝑥, 𝑦) = ⎢ −𝑥 −𝑦 ⎥.
⎢√ √ ⎥
⎣ 1 − 𝑥 2 − 𝑦2 1 − 𝑥2 − 𝑦2 ⎦
MATH 20400 LECTURE NOTES 33
Then the image of (1, 0), (0, 1) ∈ ℝ2 under 𝐷𝜑(0, 0) is (1, 0, 0) and (0, 1, 0). So the tangent space
is the plane
𝑇(0,0,1) 𝑆 2 = {(𝑠, 𝑡, 0)|𝑠 ∈ ℝ, 𝑡 ∈ ℝ}.
Similarly, if 𝑝 = (𝑎, 𝑏, 𝑐) is any point in the upper hemisphere, we can compute that
𝑇𝑝 𝑆 2 = {(𝑎, 𝑏, 𝑐) + 𝑠(𝑐, 0, −𝑎) + 𝑡(0, 𝑐, −𝑏)|𝑠 ∈ ℝ, 𝑡 ∈ ℝ}.
Definition 2.23. Suppose 𝑀 ⊂ ℝ𝑁 is a 𝑘-dimensional submanifold, 𝑥 ∈ 𝑀 is a point. Then the
normal space, denoted by 𝑁𝑥 𝑀, is the subspace that is orthogonal complement to 𝑇𝑥 𝑀.
Recall that we say a subspace 𝑉 ⊂ ℝ𝑁 that is orthogonal complement to 𝑊 ⊂ ℝ𝑁 if for any
𝑣1 , 𝑣2 ∈ 𝑉 and 𝑤1 , 𝑤2 ∈ 𝑊 , ⟨𝑣1 − 𝑣2 , 𝑤1 − 𝑤2 ⟩ = 0, and 𝑉 and 𝑊 span the whole ℝ𝑁 . Then by
linear algebra, dim 𝑁𝑥 𝑀 = 𝑁 − 𝑘.
Example 2.24. In the sphere example, because for a point 𝑝 = (𝑎, 𝑏, 𝑐) in the upper hemisphere,
𝑇𝑝 𝑆 2 = {(𝑎, 𝑏, 𝑐) + 𝑠(𝑐, 0, −𝑎) + 𝑡(0, 𝑐, −𝑏)|𝑠 ∈ ℝ, 𝑡 ∈ ℝ},
So we can compute that
𝑁𝑝 𝑆 2 = {𝑡(𝑎, 𝑏, 𝑐)|𝑡 ∈ ℝ}.
If a submanifold 𝑀 has dimension 𝑁 − 1 in ℝ𝑁 , then for any 𝑥 ∈ 𝑀, 𝑁𝑥 𝑀 has dimension 1.
So there is only one (up to ± sign) vector 𝐧 in this space with ‖𝐧‖ = 1. Such a vector is called a
unit normal vector.
It is natural to define maps, or even functions, from a submanifold to another submanifold. How-
ever, it is a little bit subtle to define the regularity of the maps, because the submanifolds are just
subsets of ℝ𝑁 , and a map defined on a submanifold may not be differentiable in ℝ𝑁 . Instead, we
use the parametrizations to define the regularity of a map.
Definition 2.25. Suppose 𝑀 ⊂ ℝ𝑁 is a submanifold. Then a map 𝐹 ∶ 𝑀 → ℝ𝑚 is 𝐶 1 (denoted by
𝐹 ∈ 𝐶 1 (𝑀, ℝ𝑚 )) if for any parametrization 𝜑 ∶ 𝑈 → 𝑀, the composition 𝐹 ◦𝜑 ∶ 𝑈 → ℝ𝑚 is 𝐶 1 .
Here we use the idea that the regularity (or differentiability) is a local property. Similarly, we
define a map to be 𝐶 𝑘 , smooth, etc.
Example 2.26. Consider a function 𝑓 ∶ 𝑆 2 → ℝ defined by 𝑓 (𝑥, 𝑦, 𝑧) = 𝑧. We can check that this
function is a smooth function using the definition. For example, on the front hemisphere, if we use
the parametrization 𝜑 ∶ 𝐵12 (0) → ℝ3 defined by
√
𝜑(𝑥, 𝑦) = (𝑥, 1 − 𝑥2 − 𝑦2 , 𝑦),
then 𝑓 ◦𝜑(𝑥, 𝑦) = 𝑦, which is smooth. This function 𝑓 gives the latitude of 𝑆 2 .
Proposition 2.27. Suppose 𝑀 ⊂ ℝ𝑁 is a submanifold, Ω ⊂ ℝ𝑁 is open, and 𝑀 ⊂ Ω. If 𝐹 ∶ Ω →
ℝ𝑚 is 𝐶 1 , then 𝐹 |𝑀 is 𝐶 1 . Similar statements hold for 𝐶 𝑘 and 𝐶 ∞ if the submanifold is 𝐶 𝑘 or 𝐶 ∞
respectively.
Proof. Suppose 𝜑 ∶ 𝑈 → ℝ𝑁 is a parametrization, then 𝐹 ◦𝜑 is 𝐶 1 by the chain rule.
2 3
This proposition gives another proof that the map 𝑓 ∶ 𝑆 → ℝ sending (𝑥, 𝑦, 𝑧) → 𝑧 is smooth.
34 AO SUN
2.5. Implicit function theorem. The concept of submanifolds suggests that we should study maps
and their differentials between domains in different dimensional spaces. We have an analog of the
inverse function theorem in this setting.
Before then let us first introduce some notations. Suppose Ω ⊂ ℝ𝑛+𝑝 is open. We write ℝ𝑛+𝑝 =
ℝ𝑛 × ℝ𝑝 , and we will use (𝑥, 𝑦) to denote the points in the Cartesian product ℝ𝑛 × ℝ𝑝 . Given a
function 𝐹 ∶ Ω → ℝ𝑚 , we will use 𝐷𝑥 𝐹 to denote the 𝑚 by 𝑛 matrix
⎡ 𝐷𝑦 𝐹1 𝐷𝑦 𝐹1 ⋯ 𝐷𝑦𝑝 𝐹1 ⎤
⎢𝐷 1𝐹 𝐷 2𝐹 ⋯ 𝐷𝑦𝑝 𝐹2 ⎥⎥
⎢ 𝑦1 2 𝑦2 2
⎢ ⋯ ⋯ ⋯ ⋯ ⎥
⎢𝐷𝑦 𝐹𝑚 𝐷𝑦 𝐹𝑚 ⋯ 𝐷𝑦𝑝 𝐹𝑚 ⎥⎦
⎣ 1 2
Theorem 2.28 (Implicit function theorem). Suppose Ω ⊂ ℝ𝑛+𝑝 is open, 𝐹 ∈ 𝐶 1 (Ω, ℝ𝑝 ). Suppose
(𝑥0 , 𝑦0 ) ∈ Ω such that 𝐹 (𝑥0 , 𝑦0 ) = 0, and 𝐷𝑦 𝐹 (𝑥0 , 𝑦0 ) is invertible, then there exists a neighbour-
hood 𝑈 ⊂ ℝ𝑛 of 𝑥0 and a neighbourhood 𝑉 ⊂ ℝ𝑝 of 𝑦0 and a map 𝜙 ∈ 𝐶 1 (𝑈 , 𝑉 ), such that
∙ (𝑥, 𝑦) ∈ 𝑈 × 𝑉 and 𝐹 (𝑥, 𝑦) = 0
if and only if
∙ 𝑥 ∈ 𝑈 and 𝑦 = 𝜙(𝑥).
Moreover,
𝐷𝜙(𝑥) = −(𝐷𝑦 𝐹 (𝑥, 𝜙(𝑥)))−1 ◦𝐷𝑥 𝐹 (𝑥, 𝜙(𝑥)).
It is quite difficult (to be honest, impossible) to understand the Implicit function theorem without
geometry. Let me state an equivalent version, which illustrates the geometric meaning behind it.
Theorem 2.29 (Implicit function theorem, submanifold version). Suppose Ω ⊂ ℝ𝑛+𝑝 is open, 𝐹 ∈
𝐶 1 (Ω, ℝ𝑝 ). Suppose (𝑥0 , 𝑦0 ) ∈ Ω such that 𝐹 (𝑥0 , 𝑦0 ) = 0, and 𝐷𝑦 𝐹 (𝑥0 , 𝑦0 ) is invertible, then there
exists a neighbourhood 𝑈 ⊂ ℝ𝑛 of 𝑥0 and a neighbourhood 𝑉 ⊂ ℝ𝑝 of 𝑦0 , such that {𝐹 (𝑥, 𝑦) =
0} ∩ 𝑈 × 𝑉 is a 𝑛-surface.
In fact, the parametrization is given by 𝜓 ∶ 𝑈 → ℝ𝑛+𝑝 defined by
Φ(𝑥) = (𝑥, 𝜙(𝑥)).
Example 2.30 (Local model). Let us consider the map 𝐹 (𝑥, 𝑦) = 𝑦. This map clearly has 𝐹 (0, 0) =
0 and 𝐷𝑦 𝐹 (0, 0) = Id. We can see that 𝐹 (𝑥, 𝑦) = 0 if and only if 𝑦 = 0, and the function 𝜙 in the
implicit function theorem is just the constant map 𝜙(𝑥) = 0.
MATH 20400 LECTURE NOTES 35
hence
𝑇(𝑥0 ,𝑦0 ) 𝑀 = {(𝑥0 , 𝑦0 ) − [Id, (𝐷𝑦 𝐹 (𝑥0 , 𝜙(𝑥0 )))−1 ◦𝐷𝑥 𝐹 (𝑥0 , 𝜙(𝑥0 ))]𝑣 ∶ 𝑣 ∈ ℝ𝑛 }.
Note that this expression only relies on 𝐹 . Now let 𝑤 + (𝑥0 , 𝑦0 ) ∈ 𝑇(𝑥0 ,𝑦0 ) 𝑀, there exists 𝑣 ∈ ℝ𝑛
such that
Proof. By the implicit function theorem, 𝐷𝐹 (𝑥0 ) is surjective means that the set 𝑀 ∶= {𝑥|𝐹 (𝑥) =
0} is locally a submanifold, and we can find a parametrization Φ ∶ 𝑈 → ℝ𝑁 where 𝑈 ⊂ ℝ𝑁−𝑚
is open and Φ is in 𝐶 1 with Φ(0) = 𝑥0 . Then 𝑥0 is a local minimum/local maximum point of
the function 𝑓 ∈ 𝐶 1 (Ω, ℝ) under the constraint 𝐹 (𝑥) = 0 implies that 𝑓 ◦Φ ∶ 𝑈 → ℝ has a local
minimum/local maximum at 0. By the characterization of the local minimum/local maximum point,
we have 𝐷(𝑓 ◦Φ)(0) = 0. By the chain rule, 𝐷𝑓 (𝑥0 )◦𝐷Φ(0) = 0.
If we plug in any vector 𝑣, this implies that 𝐷𝑓 (𝑥0 )◦𝐷Φ(0) = 0, and by the expression of the
tangent space that we have discussed in the last section, 𝐷𝑓 (𝑥0 ) is perpendicular to the tangent
space 𝑇𝑥0 𝑀, and hence 𝐷𝑓 (𝑥0 ) is perpendicular to the kernel of 𝐷𝑓 (𝑥0 ). Therefore, 𝐷𝑓 (𝑥0 ) is a
linear combination of the row vectors of 𝐷𝐹 (𝑥0 ), namely there exists 𝜆0 ∈ ℝ𝑚 (as a column vector)
such that
𝐷𝑓 (𝑥0 ) = 𝜆0 𝐷𝐹 (𝑥0 ).
Now differentiate 𝐿,
𝐷𝐿(𝑥, 𝜆) = (𝐷1 𝑓 (𝑥) + ⟨𝜆, 𝐷1 𝐹 (𝑥)⟩, 𝐷2 𝑓 (𝑥) + ⟨𝜆, 𝐷2 𝐹 (𝑥)⟩, ⋯ , 𝐷𝑁 𝑓 (𝑥) + ⟨𝜆, 𝐷𝑁 𝐹 (𝑥)⟩), 𝐹 (𝑥)
So a critical point (𝑥, 𝜆) of 𝐿 satisfies 𝐷𝑖 𝑓 (𝑥) + ⟨𝜆, 𝐷𝑖 𝐹 (𝑥)⟩ for 𝑖 = 1, 2 ⋯ , 𝑁 and 𝐹 (𝑥) = 0.
Then we can see that 𝑥0 , 𝜆0 is a critical point of 𝐿.
Remark 2.43. In the proof we used the following lemma: suppose 𝐴 ∶ ℝ𝑛 → ℝ𝑚 is linear. Then
if a vector 𝑣 ∈ ℝ𝑛 is perpendicular to the kernel of 𝐴, then 𝑣 is a linear combination of the row
vectors of 𝐴.
The intuition behind the Lagrange multiplier is the following: −𝐷𝑓 (𝑥) is the direction where 𝑓
decreases fastest at 𝑥, in the following sense: for any 𝑣, 𝐷𝑣 𝑓 (𝑥) = ⟨𝐷𝑓 (𝑥), 𝑣⟩. Then if we plug in
𝑣 = −𝐷𝑓 (𝑥), 𝐷𝑣 𝑓 (𝑥) = −‖𝐷𝑓 (𝑥)‖2 ≤ 0. Moreover, by the inequality of the inner product,
|⟨𝑣, 𝐷𝑓 (𝑥)⟩|2 ≤ ‖𝑣‖‖𝐷𝑓 (𝑥)‖,
we see that 𝑣 = −𝐷𝑓 (𝑥) is actually the direction makes 𝐷𝑣 𝑓 (𝑥) most negative. This makes −𝐷𝑓 (𝑥)
a very special vector.
Definition 2.44. 𝐷𝑓 (𝑥) is called the gradient of 𝑓 at 𝑥. Sometimes people use ∇𝑓 (𝑥) to denote it.
Then the Lagrange multiplier actually saying that, 𝑥0 is a minimum point means that the gradient
𝐷𝑓 (𝑥0 ) is normal to the submanifold given by the constraint function 𝐹 . Intuitively, this means
that there is no way that we can decrease the value of the function 𝑓 in the manifold, because the
gradient 𝐷𝑓 (𝑥0 ) points completely outside the submanifold.
Just like all the critical point tests, the Lagrange multiplier method only gives a necessary con-
dition for a point being local minimum/local maximum. A critical point of 𝐿 does not necessarily
give a critical point of 𝑓 with the constraint 𝐹 = 0.
Similarly we have a second-order derivative test for a function with constraint.
Theorem 2.45. Suppose Ω ⊂ ℝ𝑁 is open, 𝑓 ∈ 𝐶 2 (Ω, ℝ) and 𝐹 ∈ 𝐶 2 (Ω, ℝ𝑚 ). If 𝑥0 is a local
minimum point of the function 𝑓 under the constraint 𝐹 (𝑥) = 0, and 𝐷𝐹 (𝑥0 ) is surjective, then
MATH 20400 LECTURE NOTES 39
Hess𝑓 (𝑥0 ) is semi-positive definite when restricted to the subspace 𝑉 that is passing through 𝑥0
and 𝑉 = ker(𝐷𝐹 (𝑥0 )).
When we say we restrict a 𝑁 by 𝑁 matrix 𝐴 to a subspace 𝑉 ⊂ ℝ𝑁 that is semi-positive definite,
we are actually saying that for any 𝑣 ∈ 𝑉 that is nonzero, 𝑣⊤ 𝐴𝑣 > 0
Theorem 2.46. Suppose Ω ⊂ ℝ𝑁 is open, 𝑓 ∈ 𝐶 2 (Ω, ℝ) and 𝐹 ∈ 𝐶 2 (Ω, ℝ𝑚 ). If (𝑥0 , 𝜆0 ) is
a critical point of the Langrange function 𝐿 and 𝐷𝐹 (𝑥0 ) is surjective, and Hess𝑓 (𝑥0 ) is positive
definition when restricted to the subspace 𝑉 that is passing through 𝑥0 and 𝑉 = ker(𝐷𝐹 (𝑥0 )), then
𝑥0 is a local minimum point of the function 𝑓 under the constraint 𝐹 (𝑥) = 0.
The proof of these theorems is similar to the Hessian test, but we need to incorporate the language
of submanifolds. We omit the proof here.
We want to point out that 𝐷𝐹 (𝑥0 ) being surjective is crucial to the Lagrange multiplier. It actually
tells us that locally the constraint really gives us a submanifold.
2.7. Applications of Lagrange multiplier. We see some classical applications of the Lagrange
multiplier.
1 1
Theorem 2.47 (Hölder inequality). Suppose 𝑥 ≥ 0, 𝑦 ≥ 0, 𝑝 > 1, 𝑞 > 1, and + = 1. Show that
𝑝 𝑞
1 1
𝑥𝑦 ≤ 𝑥𝑝 + 𝑦𝑞 .
𝑝 𝑞
Proof. It is clear that if 𝑥 = 0 or 𝑦 = 0 the inequality holds. So it suffices to show the inequality
for 𝑥 > 0, 𝑦 > 0. let
1 1
𝑓 (𝑥, 𝑦) = 𝑥𝑝 + 𝑦𝑞 − 𝑥𝑦.
𝑝 𝑞
It suffices to show 𝑓 (𝑥, 𝑦) ≥ 0 for all 𝑥 > 0, 𝑦 > 0. We remark that the direct derivative test gives
a huge set of critical points, and it is hard to prove 𝑓 (𝑥, 𝑦) ≥ 0.
Instead, we use the following observation: for any 𝑎 > 0,
𝑓 (𝑎1∕𝑝 𝑥, 𝑎1∕𝑞 𝑦) = 𝑎𝑓 (𝑥, 𝑦).
So in order to show 𝑓 (𝑥, 𝑦) ≥ 0, it suffices to show that 𝑓 ( (𝑥𝑦)𝑥1∕𝑝 , (𝑥𝑦)𝑦1∕𝑞 ) ≥ 0. Namely, we only need
to show that 𝑓 (𝑥, 𝑦) ≥ 0 with the constraint that 𝑥𝑦 = 1.
If 𝑥 ≥ 𝑝1∕𝑝 or 𝑦 ≥ 𝑞 1∕𝑞 , then 𝑓 (𝑥, 𝑦) ≥ 0. Therefore, we only need to show that 𝑓 (𝑥, 𝑦) ≥ 0 with
the constraint that 𝑥𝑦 = 1 within the bounded domain
𝑈 ∶= [0, 𝑝1∕𝑝 ] × [0, 𝑞 1∕𝑞 ].
It is clear that 𝑓 (𝑥, 𝑦) ≥ 0 on 𝜕𝑈 . Now we consider the inside points. Let
1 1
𝐿(𝑥, 𝑦, 𝜆) = 𝑥𝑝 + 𝑦𝑞 − 𝑥𝑦 + 𝜆(𝑥𝑦 − 1).
𝑝 𝑞
We compute that
𝐷𝑥 𝐿(𝑥, 𝑦, 𝜆) = 𝑥𝑝−1 − 𝑦 + 𝜆𝑦, 𝐷𝑦 𝐿(𝑥, 𝑦, 𝜆) = 𝑦𝑞−1 − 𝑥 + 𝜆𝑥, 𝐷𝜆 𝐿(𝑥, 𝑦, 𝜆) = 𝑥𝑦 − 1.
40 AO SUN
‖𝑥𝑛+1 − 𝑥0 ‖ ≤ 𝑀‖𝑥𝑛 − 𝑥0 ‖2 .
Then by mathematical induction,
𝑛
‖𝑥𝑛+1 − 𝑥0 ‖ ≤ 𝑀 𝑛 ‖𝑥1 − 𝑥0 ‖2 .
Thus, if ‖𝑥1 − 𝑥0 ‖ < 1, the convergence is actually super-exponential.
Example 2.53 (Fast inverse square root). The fast inverse square root is a classical application of
Newton’s iteration method. This algorithm is best known for its implementation in 1999 in Quake
III Arena, a first-person shooter video game heavily based on 3D graphics.
The algorithm consists of two steps. The first step provides a first guess of the square inverse,
and it uses the full power of the structure of 32-bit floating-point numbers. This part needs some
knowledge of computer science and we omit it here, but refer you to a nice survey by Chris Lomont:
http: // www. lomont. org/ papers/ 2003/ InvSqrt. pdf .
The second step uses Newton’s method to increase the precision of the result. The question can
1
be formulated as follows: given 𝑎 > 0, find √ . We first translate the problem as a problem of
𝑎
finding zero of 𝐹 (𝑥) = 𝑎𝑥2 − 1 for 𝑥 > 0. start with 𝑥 = 1, we consider
𝑎𝑥2𝑛 − 1 𝑥𝑛 1
𝑥𝑛+1 = 𝑥𝑛 − = + .
2𝑎𝑥𝑛 2 2𝑎𝑥𝑛
1
Then after several times of iterations (actually even once), we can get an approximate value of √ .
𝑎
1
However, in computer science, the computation 𝑥 → is actually very hard. Therefore, in the
𝑥
algorithm, the real choice of 𝐹 is 𝐹 (𝑥) = 𝑥12 − 𝑎. Then Newton’s iteration becomes
( )
1 1 3 1 2
𝑥𝑛+1 = 𝑥𝑛 + 𝑥3𝑛 ( 2 − 𝑎) = 𝑥𝑛 − 𝑎𝑥𝑛 .
2 𝑥𝑛 2 2
MATH 20400 LECTURE NOTES 43
In the iteration, we only need to compute addition and multiplication, which is much simpler in
computers.
2.9. Diagonalizing symmetric matrix*. We show another application of the Lagrange multiplier
method. Recall that a 𝑛 × 𝑛 matrix 𝐴 is symmetric if 𝐴𝑡 = 𝐴. If the entries of 𝐴 are 𝑎𝑖𝑗 , 𝐴 being
symmetric is equivalent to 𝑎𝑖𝑗 = 𝑎𝑗𝑖 for all 1 ≤ 𝑖, 𝑗 ≤ 𝑛.
The goal of this section is to prove a diagonalization theorem of symmetric matrices.
Theorem 2.54 (diagonalization theorem of symmetric matrices). Suppose 𝐴 is a symmetric matrix.
Then there exists an orthonormal matrix 𝑂 such that 𝑂𝑡 𝐴𝑂 = diag{𝜆1 , ⋯ , 𝜆𝑛 }, where
⎡𝜆 1 ⎤
⎢ 𝜆2 ⎥
diag{𝜆1 , ⋯ , 𝜆𝑛 } = ⎢ ⎥
⋱
⎢ ⎥
⎣ 𝜆𝑛 ⎦
Remark 2.55. If all 𝜆𝑖 > 0, then 𝐴 is positive definite; if all 𝜆𝑖 ≥ 0, then 𝐴 is nonnegative definite.
Similar arguments hold for 𝐴 being negative definite and semi-negative definite.
Proof of the diagonalization theorem. We divide the proof into several steps.
Step 1. Define 𝑓 ∶ ℝ𝑛 → ℝ by
𝑓 (𝑥) = 𝑥𝑡 𝐴𝑥.
Let 𝑣1 ∈ ℝ𝑛 be a minimizer of 𝑓 under the constraint ‖𝑣‖2 = 1. The Lagrange function in this
situation is
𝐿(𝑥, 𝜆) = 𝑥𝑡 𝐴𝑥 + 𝜆(1 − ‖𝑥‖2 ).
Then
𝐷𝐿(𝑥, 𝜆) = (2(𝐴𝑥)1 − 2𝜆𝑥1 , 2(𝐴𝑥)2 − 2𝜆𝑥2 , ⋯ , 2(𝐴𝑥)𝑛 − 2𝜆𝑥𝑛 , 1 − ‖𝑥‖2 ).
Notice that {‖𝑥‖2 − 1} is a compact submanifold (it is just the sphere), there must exist a minimum
point under the constraint, we call it 𝑣1 . Then the Lagrange multiplier method shows that there
exists 𝜆1 ∈ ℝ, such that (𝑣1 , −𝜆1 ) is the critical point of 𝐿. Then we see that
𝐴𝑣1 = 𝜆1 𝑣1 .
Step 2. Define 𝑊1 = ⟨𝑣1 , ⟩ be the space spanned by 𝑣1 , and let 𝑉1 = 𝑊1⟂ . Repeat the above
discussion, but restrict 𝑓 to 𝑉1 , we can find a minimizer 𝑣2 ∈ 𝑉1 under the constraint ‖𝑣2 ‖ = 1.
Same argument shows that there exists 𝜆2 ∈ ℝ such that 𝐴𝑣 − 2 = 𝜆2 .
Step 3. Repeat the above discussion 𝑛 times, we get (column) vectors 𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 and real
numbers 𝜆1 , ⋯ 𝜆𝑛 , such that
∙ ‖𝑣𝑖 ‖ = 1;
∙ for 𝑖 ≠ 𝑗, 𝑣𝑖 ⟂ 𝑣𝑗 ;
∙ 𝐴𝑣𝑖 = 𝜆𝑖 𝑣𝑖 .
Then if we define
[ ]
𝑂 = 𝑣1 𝑣2 ⋯ 𝑣𝑛 ,
44 AO SUN
we have
𝑂𝑡 𝐴𝑂 = diag{𝜆1 , ⋯ , 𝜆𝑛 }.
Remark 2.56. You can reverse the above procedure by picking the maximum point under constraint.
The result is the same.
R EFERENCES
U NIVERSITY OF C HICAGO, D EPARTMENT OF M ATHEMATICS, 5734 S U NIVERSITY AVE, C HICAGO IL, 60637
Email address: [email protected]