0% found this document useful (0 votes)
16 views44 pages

Lecture Notes - Updates

The document is a set of lecture notes for MATH 20400 - Analysis in ℝⁿ II, covering topics such as differentiation, Euclidean space, directional derivatives, and the mean value theorem. It includes definitions, theorems, and examples related to single and multivariable functions, as well as properties of derivatives. The notes also discuss submanifolds, optimization techniques, and provide references for further reading.

Uploaded by

Angelo Oppio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views44 pages

Lecture Notes - Updates

The document is a set of lecture notes for MATH 20400 - Analysis in ℝⁿ II, covering topics such as differentiation, Euclidean space, directional derivatives, and the mean value theorem. It includes definitions, theorems, and examples related to single and multivariable functions, as well as properties of derivatives. The notes also discuss submanifolds, optimization techniques, and provide references for further reading.

Uploaded by

Angelo Oppio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

MATH 20400 - ANALYSIS IN ℝ𝑛 II - LECTURE NOTE

AO SUN

CONTENTS
Introduction 1
1. Differentiation 2
1.1. Differentiation of single variable functions 2
1.2. Euclidean space 4
1.3. Directional derivatives 6
1.4. Differentiable functions 8
1.5. Linear map and Matrix 11
1.6. Differentiable maps 14
1.7. Mean value theorem 17
1.8. Higher order derivatives and Taylor’s theorem 19
1.9. Hessian and second partial derivative test 22
2. Submanifolds in ℝ𝑁 25
2.1. Invertible linear maps 25
2.2. Inverse function theorem 26
2.3. 𝑘-surfaces and submanifolds in ℝ𝑁 31
2.4. Tangent space and 𝐶 1 -maps 32
2.5. Implicit function theorem 34
2.6. Constrained optimization and Lagrange multipliers 37
2.7. Applications of Lagrange multiplier 39
2.8. Newton’s iteration method* 40
2.9. Diagonalizing symmetric matrix* 43
References 44

INTRODUCTION
If you find any typos, please let me know and I appreciate that!

Date: May 24, 2023.


1
2 AO SUN

1. D IFFERENTIATION
1.1. Differentiation of single variable functions. Let us recall some basic concepts. ℝ is the real
number field. Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ is a single variable function, we say 𝑓 is differentiable at a
point 𝑥 ∈ (𝑎, 𝑏) if
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
(1.1) lim
ℎ→0 ℎ

exists, and we use the notation 𝑓 (𝑥) or 𝐷𝑓 (𝑥) to denote this limit. This limit is called the derivative
of 𝑓 at 𝑥.
If we use the 𝜖 − 𝛿 language to write down this definition, a function 𝑓 is differentiable at 𝑥 if and
only if there exists 𝐿 ∈ ℝ, such that for any 𝜖 > 0, there exists 𝛿 > 0, such that for any 0 < |ℎ| < 𝛿
and 𝑥 + ℎ ∈ [𝑎, 𝑏],
| 𝑓 (𝑥 + ℎ) − 𝑓 (𝑥) |
(1.2) | − 𝐿 | < 𝜖.
| |
| ℎ |
We will use the following notation: suppose 𝑓 ∶ (𝑎, 𝑏) → ℝ is a function. If 𝑓 is continuous, we
write 𝑓 ∈ 𝐶((𝑎, 𝑏)). If 𝑓 is bounded on (𝑎, 𝑏), then we write 𝑓 ∈ 𝐶 0 ((𝑎, 𝑏)). If 𝑓 is differentiable,
and its derivative is continuous on (𝑎, 𝑏), then we write 𝑓 ∈ 𝐶 1 ((𝑎, 𝑏)). In general, if 𝑓 is 𝑘-
times differentiable and all its 𝑚-th order derivatives are continuous on (𝑎, 𝑏), then we write 𝑓 ∈
𝐶 𝑘 ((𝑎, 𝑏)). We will use 𝑓 (𝑘) to denote the 𝑘-th order derivative of 𝑓 .
Differentiability is a local property, namely a function can be only differentiable at a single point
but nowhere else. The following example shows this fact.
Example 1.1. Consider a function 𝑓 ∶ ℝ → ℝ defined as follows
{
𝑥2 , if 𝑥 is rational
(1.3) 𝑓 (𝑥) =
0, if 𝑥 is irrational
Using the definition of derivatives, we can calculate 𝑓 ′ (0) = 0, so 𝑓 is differentiable at 0. How-
ever, 𝑓 is not differentiable anywhere else. In fact, 𝑓 is even discontinuous on ℝ∖{0}. Then Theo-
rem 1.2 implies that 𝑓 is not differentiable at any point besides 0.
Differentiability is closely related to another basic concept called continuity.
Theorem 1.2. Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ is differentiable at 𝑥 ∈ (𝑎, 𝑏). Then 𝑓 is continuous at 𝑥.
Proof. Suppose 𝑓 ′ (𝑥) = 𝐿. Using the 𝜖 − 𝛿 definition of derivative, if we choose 𝜖 = 1, we can
find 𝛿 > 0 such that for any 0 < |ℎ| < 𝛿 and 𝑥 + ℎ ∈ [𝑎, 𝑏], we have
| 𝑓 (𝑥 + ℎ) − 𝑓 (𝑥) |
| − 𝑓 ′ (𝑥)|| < 1,
|
| ℎ |
then we have
|𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)| < |ℎ|(1 + |𝑓 ′ (𝑥)|).
This implies that lim𝑓 (𝑥 + ℎ) = 𝑓 (𝑥), which implies that 𝑓 is continuous at 𝑥. 
ℎ→0
MATH 20400 LECTURE NOTES 3

On the other hand, a continuous function may be far from a differentiable function. We have the
following very pathological example.
Example 1.3 (A continuous function that is nowhere differentiable). Let us define a function 𝜑1 ∶
ℝ → ℝ as follows. On [0, 1] we define
{
𝑥, if 0 ≤ 𝑥 < 1∕2
(1.4) 𝜑1 (𝑥) =
1 − 𝑥, if 1∕2 ≤ 𝑥 ≤ 1.
Then we extend 𝜑1 periodically to ℝ by 𝜑1 (𝑥 + 1) = 𝜑1 (𝑥). Then we inductively define 𝜑𝑛+1 (𝑥) =
1
𝜑 (2𝑥). Let
2 𝑛
∑𝑛
(1.5) 𝑆𝑛 (𝑥) = 𝜑𝑗 (𝑥).
𝑗=1

Because each 𝜑𝑗 is continuous, 𝑆𝑛 is a continuous function. We leave it as an exercise to show that


(𝑆𝑛 )𝑛∈ℤ+ converges uniformly to a continuous function, and the limit function is not differentiable
at any point in ℝ.
Next, we review some basic properties of derivatives.
Proposition 1.4. Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ, 𝑔 ∶ [𝑎, 𝑏] → ℝ are both differentiable at 𝑥 ∈ (𝑎, 𝑏).
Suppose 𝛼 ∈ ℝ, 𝛽 ∈ ℝ. Then
(1) (Linear property) (𝛼𝑓 + 𝛽𝑔)′ (𝑥) = 𝛼𝑓 ′ (𝑥) + 𝛽𝑔 ′ (𝑥).

(2) (Product property) (𝑓
(𝑔) ) (𝑥) = 𝑓 ′ (𝑥)𝑔(𝑥) + 𝑓 (𝑥)𝑔 ′ (𝑥).

𝑓 𝑓 ′ (𝑥)𝑔(𝑥) − 𝑓 (𝑥)𝑔 ′ (𝑥)
(3) (Quotient property) (𝑥) = if 𝑔(𝑥) ≠ 0.
𝑔 𝑔 2 (𝑥)
Proposition 1.5 (Chain rule). Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ is differentiable at 𝑥 ∈ (𝑎, 𝑏), 𝑓 (𝑥) ∈ (𝑐, 𝑑)
and 𝑔 ∶ [𝑐, 𝑑] → ℝ is differentiable at 𝑓 (𝑥). Then
(1.6) 𝐷(𝑔◦𝑓 )(𝑥) = 𝐷𝑔(𝑓 (𝑥)) ⋅ 𝐷𝑓 (𝑥).
If you forget about the proof, you can check any rigorous calculus textbook to see a proof.
Let us review one of the most important theorems in single variable calculus.
Theorem 1.6 (Mean value theorem). Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ is a continuous function and 𝑓 is
differentiable on (𝑎, 𝑏). Then there exists 𝑐 ∈ (𝑎, 𝑏) such that
(1.7) 𝑓 (𝑏) − 𝑓 (𝑎) = 𝑓 ′ (𝑐)(𝑏 − 𝑎).
As a consequence, if the derivative of a function is positive, then the function is increasing.
Theorem 1.7. Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ is continuous, 𝑓 is differentiable on (𝑎, 𝑏), and 𝑓 ′ (𝑥) ≥ 0
for all 𝑥 ∈ (𝑎, 𝑏). Then 𝑓 (𝑎) ≤ 𝑓 (𝑏).
4 AO SUN

The mean value theorem can be viewed as an approximation theorem - it describes how a general
differentiable function 𝑓 can be approximated by a linear function. In fact, we can interpret the
mean value theorem as
𝑓 (𝑥) ≈ 𝑓 (𝑎) + 𝑓 ′ (𝑎)(𝑥 − 𝑎).
In general, we can approximate a function with continuous 𝑘-th order derivatives by a degree 𝑘
polynomial, which is known as the Taylor polynomial.
Theorem 1.8 (Taylor expansion). Suppose 𝑓 ∈ 𝐶 𝑘 ((𝑎, 𝑏)), then for any 𝑥0 ∈ (𝑎, 𝑏) and 𝑥 ∈ (𝑎, 𝑏),
there exists 𝑐 between 𝑥 and 𝑥0 such that
(1.8)
𝑓 ′ (𝑥0 ) 𝑓 (2) (𝑥0 ) 𝑓 (𝑘−1) (𝑥0 ) 𝑓 (𝑘) (𝑐)
𝑓 (𝑥) = 𝑓 (𝑥0 ) + (𝑥 − 𝑥0 )1 + (𝑥 − 𝑥0 )2 + ⋯ + (𝑥 − 𝑥0 )𝑘−1 + (𝑥 − 𝑥0 )𝑘 .
1! 2! (𝑘 − 1)! 𝑘!
Here the polynomial
𝑓 ′ (𝑥0 ) 𝑓 (2) (𝑥0 ) 𝑓 (𝑘−1) (𝑥0 )
𝑃𝑘−1 (𝑥) ∶= 𝑓 (𝑥0 ) + (𝑥 − 𝑥0 )1 + (𝑥 − 𝑥0 )2 + ⋯ + (𝑥 − 𝑥0 )𝑘−1
1! 2! (𝑘 − 1)!
is called the degree 𝑘 Taylor polynomial of 𝑓 at 𝑥0 .
1.2. Euclidean space. Before we start the discussion of the differentiation of multivariable func-
tions, we need to understand the space that we will be working on. The 𝑛-dimensional Euclidean
space, denoted by ℝ𝑛 , is the Cartesian product of 𝑛 copies of real numbers as a set. There are three
important structures on the Euclidean space.

1.2.1. Vector space strcuture. ℝ𝑛 is a (ℝ-) vector space. This means that we can define the fol-
lowing operations:
∙ (Addition) There is a function + ∶ ℝ𝑛 × ℝ𝑛 → ℝ𝑛 . We denote the image of (𝐮, 𝐯) by 𝐮 + 𝐯.
∙ (Scalar multiplication) There is a function ℝ × ℝ𝑛 → ℝ𝑛 . We denote the image of (𝑎, 𝐯) by
𝑎𝐯.
These operations should satisfy the following axioms. Suppose 𝐮, 𝐯, 𝐰 are elements of ℝ𝑛 , 𝑎, 𝑏
are elements of ℝ.
(1) (Associativity of vector addition ) 𝐮 + (𝐯 + 𝐰) = (𝐮 + 𝐯) + 𝐰.
(2) (Commutativity of vector addition) 𝐮 + 𝐯 = 𝐯 + 𝐮.
(3) (Identity element of vector addition) There exists an element in ℝ𝑛 denoted by 𝟎, such that
for every 𝐮 ∈ ℝ𝑛 , 𝐮 + 𝟎 = 𝐮.
(4) (Inverse elements of vector addition) For every 𝐮 ∈ ℝ𝑛 , there exists an element denoted by
(−𝐮), such that (−𝐮) + 𝐮 = 𝟎.
(5) (Compatibility of scalar multiplication with field multiplication) 𝑎(𝑏𝐮) = (𝑎𝑏)𝐮.
(6) (Identity element of scalar multiplication) 1𝐮 = 𝐮.
(7) (Distributivity of scalar multiplication with respect to vector addition) 𝑎(𝐮 + 𝐯) = 𝑎𝐮 + 𝑎𝐯.
(8) (Distributivity of scalar multiplication with respect to field addition) (𝑎 + 𝑏)𝐮 + 𝑎𝐮 + 𝑏𝐮.
MATH 20400 LECTURE NOTES 5

In general, a vector space can be defined over any field, not necessarily ℝ. Throughout this
course, we will be only focused on vector spaces over ℝ.
We also have a special name for the zero element.
Definition 1.9. 𝟎 ∈ ℝ𝑛 is called the origin.
For Euclidean space, the addition and multiplication can be written down explicitly. Suppose
𝐮 ∈ ℝ𝑛 , 𝐯 ∈ ℝ𝑛 . Then we can use the Cartesian coordinate to write the vectors in the form of
𝑛-tuple:
𝐮 = (𝑢1 , 𝑢2 , ⋯ , 𝑢𝑛 ), 𝐯 = (𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 ).
Here each 𝑢𝑖 and 𝑣𝑖 are real numbers. Then we define
𝐮 + 𝐯 = (𝑢1 + 𝑣1 , 𝑢2 + 𝑣2 , ⋯ , 𝑢𝑛 + 𝑣𝑛 ),
and for 𝑎 ∈ ℝ, we define
𝑎𝐮 = (𝑎𝑢1 , 𝑎𝑢2 , ⋯ , 𝑎𝑢𝑛 ).
Given a vector space 𝑉 , a basis of 𝑉 is a collection of vectors {𝐯1 , 𝐯2 ⋯ , 𝐯𝑛 } such that any vector
in 𝑉 can be expressed by an unique linear combinations of these vectors, namely, for any vector
𝐮 ∈ 𝑉 , there exist an unique array of numbers 𝑎1 , 𝑎2 , ⋯ , 𝑎𝑛 such that

𝑛
𝐮= 𝑎𝑗 𝐯𝑗 .
𝑗=1

In particular, this implies that {𝐯1 , 𝐯2 ⋯ , 𝐯𝑛 } are linearly independent, namely if



𝑛
𝟎= 𝑏𝑗 𝐯𝑗 ,
𝑗=1

then all the 𝑏𝑗 = 0.


In the Euclidean space, there is a natural basis given by the Cartesian coordinate. We will use 𝑒𝑗
to denote the vector (0, 0, ⋯ , 0, 1, 0, ⋯ , 0), which has entry 1 in the 𝑗-th position and 0 elsewhere.
One can check that {𝑒1 , ⋯ , 𝑒𝑛 } is a basis of ℝ𝑛 .
1.2.2. Metric space structure and Topological space structure. In Euclidean space, we can define
the distance between two points as follows: for 𝐮 = (𝑢1 , 𝑢2 , ⋯ , 𝑢𝑛 ) ∈ ℝ𝑛 , 𝐯 = (𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 ) ∈ ℝ𝑛 ,
the distance between them, denoted by dist(𝐮, 𝐯) or ‖𝐮 − 𝐯‖, is defined as

‖𝐮 − 𝐯‖ = (𝑢1 − 𝑣1 )2 + (𝑢2 − 𝑣2 )2 + ⋯ + (𝑢𝑛 − 𝑣𝑛 )2 .
This definition is motivated by the Pythagoras theorem. This is a metric because it satisfies the
following three axioms:
(1) (Reflexivity) dist(𝐮, 𝐯) = 0 if and only if 𝐮 = 𝐯.
(2) (Symmetry) dist(𝐮, 𝐯) = dist(𝐯, 𝐮).
(3) (Triangle inequality) dist(𝐮, 𝐰) ≤ dist(𝐮, 𝐯) + dist(𝐯, 𝐰).
If we view 𝐮 as a vector, then the length can be interpreted as the distance from the point 𝐮 to
the origin 𝟎.
6 AO SUN

Definition 1.10. The length of 𝐮 ∈ ℝ𝑛 is defined to be ‖𝑢‖.


The topological structure induced by this metric is crucial.
Definition 1.11. A open ball centered at 𝑥 with radius 𝑟 in the Euclidean space, denoted by 𝐵𝑟 (𝑥),
is the set
𝐵𝑟 (𝑥) ∶= {𝑦 ∈ ℝ𝑛 ∶ ‖𝑦 − 𝑥‖ < 𝑟}.
Definition 1.12. A set 𝑈 ⊂ ℝ𝑛 is open if and only if for any 𝑥 ∈ 𝑈 , there exists 𝑟 > 0 such that
𝐵𝑟 (𝑥) ⊂ 𝑈 . ∅ is viewed as an open set. A set 𝑉 ⊂ ℝ𝑛 is closed if and only if the complement of 𝑉
is open.
This definition gives a topological structure on ℝ𝑛 , because the open sets satisfy the following
three axioms:
(1) Union of arbitrarily many open sets is again an open set.
(2) Intersection of finitely many open sets is again an open set.
(3) ∅ and ℝ𝑛 are open sets.
There are actually many different topological structures on ℝ𝑛 . Throughout this course, we will
only use the topological structure defined above.
Definition 1.13. Suppose 𝑥 ∈ ℝ𝑛 . A neighborhood of 𝑥 is an open set containing 𝑥.
1.2.3. Inner product space structure. ℝ𝑛 is an inner product space, because we can define an
inner product on ℝ𝑛 as follows: for 𝐮 = (𝑢1 , 𝑢2 , ⋯ , 𝑢𝑛 ) ∈ ℝ𝑛 and 𝐯 = (𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 ) ∈ ℝ𝑛 , the
inner product of 𝐮 and 𝐯, denoted by ⟨𝐮, 𝐯⟩, is defined by
⟨𝐮, 𝐯⟩ ∶= 𝑢1 𝑣1 + 𝑢2 𝑣2 + ⋯ + 𝑢𝑛 𝑣𝑛 .
This is an inner product because it satisfies the following axioms:
(1) (Symmetry) ⟨𝐮, 𝐯⟩ = ⟨𝐯, 𝐮⟩
(2) (Linearity) ⟨𝑎𝐮 + 𝑏𝐯, 𝐰⟩ = 𝑎⟨𝐮, 𝐰⟩ + 𝑏⟨𝐯, 𝐰⟩
(3) (Positive-definiteness) if 𝐮 ≠ 𝟎, then ⟨𝐮, 𝐮⟩ > 0.
The inner product space is an upgrade of the vector space.
1.3. Directional derivatives. Now suppose our function is not a single variable function, but a
multivariable function. Namely, the function we want to study is 𝑓 ∶ ℝ𝑛 → ℝ, where 𝑛 ∈ ℤ
and 𝑛 ≥ 2. We will use 𝑓 (𝑥) = 𝑓 (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) to denote the value of the function at 𝑥 =
(𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ).
Recall that the intuition behind the derivative of a single variable function: it is the limit of the
𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
quotient , where ℎ is the “small increment”. So the derivative of a multivariable

𝑓 (𝑥 + ℎ) − 𝑓 (𝑥)
function should be the same thing: the limit of the quotient , where ℎ is the small

increment.
Recall the vector space structure of the Euclidean space: if we want to add something to a vector
𝑥, then it must also be a vector. Next, recall the metric structure gives us a way to measure the
length of a vector. These motivate the following definition.
MATH 20400 LECTURE NOTES 7

Definition 1.14 (Directional derivative). Suppose 𝑈 ⊂ ℝ𝑛 is an open set, 𝑥0 ∈ 𝑈 , 𝑓 ∶ 𝑈 → ℝ,


and 𝐯 ∈ ℝ𝑛 . If the following limit exists
𝑓 (𝑥0 + ℎ𝐯) − 𝑓 (𝑥0 )
(1.9) lim ,
ℎ→0 ℎ
then we call it the directional derivative of 𝑓 with respect to a vector 𝐯 at a point 𝑥0 . It is denoted
by 𝐷𝐯 𝑓 (𝑥0 ).
The reason that the function is defined on an open set can be seen from the definition: to calculate
the directional derivative, we need to move the point in the direction we would like to calculate the
derivative a little bit. If the domain is an open set then this allows the expression 𝑓 (𝑥0 + ℎ𝐯) is still
meaningful when ℎ is very small.
Example 1.15. Suppose 𝑓 ∶ ℝ3 → ℝ is defined by 𝑓 (𝑥, 𝑦, 𝑧) = 𝑥 + 𝑦2 + 𝑧2 + 𝑥𝑦𝑧. Then if
𝐯 = (1, 1, 1), we can calculate the directional derivative of 𝑓 at (1, 1, 1) as follows:
𝑓 (1 + ℎ, 1 + ℎ, 1 + ℎ) − 𝑓 (1, 1, 1)
𝐷𝐯 𝑓 (1, 1, 1) = lim
ℎ→0 ℎ
1 + ℎ + (1 + ℎ)2 + (1 + ℎ)2 + (1 + ℎ)3 − (1 + 1 + 1 + 1)
= lim =8
ℎ→0 ℎ
Example 1.16. Suppose 𝑓 ∶ ℝ2 → ℝ is defined by 𝑓 (𝑥, 𝑦) = (𝑥), where  is the Dirichlet
function
{
1, if 𝑥 is rational,
(1.10) (𝑥) =
0, if 𝑥 is irrational.
Then one can check that 𝐷(1,0) 𝑓 (𝑥, 𝑦) does not exist for all (𝑥, 𝑦) ∈ ℝ2 , while 𝐷(0,1) 𝑓 (𝑥, 𝑦) exist and
equal to 0 for all (𝑥, 𝑦) ∈ ℝ2 .
This example implies that a function can have directional derivatives in some directions mean-
while have no directional derivatives in some other directions.
The Cartesian coordinate system provides some special directions of ℝ𝑛 . We will use 𝑒𝑖 to denote
the vector (0, 0, ⋯ , 0, 1, 0, ⋯ , 0) ∈ ℝ𝑛 , which has 1 at the 𝑖-th position and 0 elsewhere. We will
use 𝐷𝑖 𝑓 to denote the directional coordinate 𝐷𝑒𝑖 𝑓 . Such derivatives are called partial derivatives.
A particular notation: if we use (𝑥, 𝑦) to denote the coordinate of ℝ2 , then we will use 𝐷𝑥 𝑓 or
𝜕𝑓 𝜕𝑓
to denote 𝐷1 𝑓 and 𝐷𝑦 𝑓 or to denote 𝐷2 𝑓 . If we use (𝑥, 𝑦, 𝑧) to denote the coordinate of
𝜕𝑥 𝜕𝑦
ℝ3 , we have similar notations. If we use (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) to denote the coordinate of ℝ𝑛 , then we
𝜕𝑓
may use to denote 𝐷𝑖 𝑓 .
𝜕𝑥𝑖
The following propositions for directional derivatives can be checked just like the derivatives of
single variable functions. We leave them as exercises.
Proposition 1.17. Suppose 𝑈 ⊂ ℝ𝑛 is open, 𝑓 ∶ 𝑈 → ℝ, 𝑔 ∶ 𝑈 → ℝ, 𝑥0 ∈ 𝑈 , 𝐯 ∈ ℝ𝑛 , 𝑎 ∈ ℝ,
𝑏 ∈ ℝ. If 𝑓 and 𝑔 both have directional derivatives with respect to 𝐯 at 𝑥0 , then
8 AO SUN

(1) 𝐷𝐯 (𝑎𝑓 + 𝑏𝑔)(𝑥0 ) = 𝑎𝐷𝐯 𝑓 (𝑥0 ) + 𝑏𝐷𝐯 𝑔(𝑥0 ).


(2) 𝐷𝐯 (𝑓 ) 0 ) = 𝑔(𝑥0 )𝐷𝐯 𝑓 (𝑥0 ) + 𝑓 (𝑥0 )𝐷𝐯 𝑔(𝑥0 ).
( 𝑔)(𝑥
𝑓 𝑔(𝑥0 )𝐷𝐯 𝑓 (𝑥0 ) − 𝑓 (𝑥0 )𝐷𝐯 𝑔(𝑥0 )
(3) 𝐷𝐯 (𝑥0 ) = .
𝑔 (𝑔(𝑥0 ))2
If we view 𝐷 as a function ℝ𝑛 ×{“differentiable functions”} → ℝ, from the item (1) above we see
𝐷 is linear in the second argument. It is natural to expect that 𝐷 is also linear in the first argument.
The following example suggests that in general, this is not true.
𝑥𝑦
Example 1.18. Suppose 𝑓 ∶ ℝ2 → ℝ defined by 𝑓 (𝑥, 𝑦) = 2 if (𝑥, 𝑦) ≠ (0, 0), and 𝑓 (0, 0)
𝑥 + 𝑦2
is defined to be 0. Then one can compute that 𝐷𝑥 𝑓 (0, 0) = 0, and 𝐷𝑦 (0, 0) = 0. However, for any
𝐯 = (𝑣1 , 𝑣2 ) such that 𝑣1 𝑣2 ≠ 0,
𝑣1 𝑣2
𝐷𝐯 𝑓 (0, 0) = lim
ℎ→0 ℎ(𝑣2 + 𝑣2 )
1 2
does not exist.
This example suggests that even though all the partial derivatives exist, some directional deriva-
tives may not exist.
This example shows that the definition of “differentiable functions” is actually subtle.
1.4. Differentiable functions. From the Taylor expansion, we see that 𝑓 ′ (𝑥0 )(𝑥 − 𝑥0 ) + 𝑓 (𝑥0 ) is
the best linear function to approximate 𝑓 (𝑥) near 𝑓 (𝑥0 ). The derivative of single variable functions
can be interpreted as the best linear approximation of the function. Similarly, we can define the
derivative of a multivariable function as the best linear approximation.
Definition 1.19 (Linear function). A function 𝐴 ∶ ℝ𝑛 → ℝ is linear if for any 𝐮 ∈ ℝ𝑛 , 𝐯 ∈ ℝ𝑛 and
𝑎 ∈ ℝ, 𝑏 ∈ ℝ, we have
𝐴(𝑎𝐮 + 𝑏𝐯) = 𝑎𝐴(𝐮) + 𝑏𝐴(𝐯).
It is clear from the definition that 𝐴(𝟎) = 0, and 𝐴 is continuous.
Example 1.20. One can check that 𝐴(𝑥, 𝑦, 𝑧) = 𝑥 + 𝑦 + 𝑧 is a linear function.
We can use the Cartesian coordinate to express the linear functions as follows. Suppose 𝐴 is a
linear function, then by the linearity, for any 𝑥 = (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) ∈ ℝ𝑛 , we have
𝐴(𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) = 𝑥1 𝐴(𝑒1 ) + 𝑥2 𝐴(𝑒2 ) + ⋯ + 𝑥𝑛 𝐴(𝑒𝑛 ).
Using the inner product structure of ℝ𝑛 , we can simply write the above equality as the inner product
𝐴(𝑥) = ⟨𝑥, (𝐴(𝑒1 ), 𝐴(𝑒2 ), ⋯ , 𝐴(𝑒𝑛 ))⟩.
Theorem 1.21 (Riesz representation theorem). A function 𝐴 ∶ ℝ𝑛 → ℝ is linear if and only if there
exists a vector 𝐚 such that for any 𝑥 ∈ ℝ𝑛 , 𝐴(𝑥) = ⟨𝑥, 𝐚⟩. Moreover, 𝐚 = (𝐴(𝑒1 ), 𝐴(𝑒2 ), ⋯ , 𝐴(𝑒𝑛 )).
Sometimes we will slightly abuse the notation and just use 𝐴 itself to denote the vector 𝐚.
The directional derivative of a linear function 𝐴 is given by the image of the direction under 𝐴.
MATH 20400 LECTURE NOTES 9

Proposition 1.22. Suppose 𝐴 ∶ ℝ𝑛 → ℝ is a linear function, 𝐯 ∈ ℝ𝑛 . Then for any 𝑥 ∈ ℝ𝑛 ,


𝐷𝐯 𝐴(𝑥) = 𝐴(𝐯).
Proof. It is straightforward from the computation
𝐴(𝑥 + ℎ𝐯) − 𝐴(𝑥) 𝐴(𝑥) + ℎ𝐴(𝐯) − 𝐴(𝑥)
= = 𝐴(𝐯).
ℎ ℎ

Definition 1.23 (Differentiable function). Suppose 𝑈 ∈ ℝ𝑛 is open, 𝑓 ∶ 𝑈 → ℝ, 𝑥0 ∈ 𝑈 . Then 𝑓
is differentiable if there exists a linear function 𝐴 ∶ ℝ𝑛 → ℝ such that
|𝑓 (𝑥0 + 𝐯) − 𝑓 (𝑥0 ) − 𝐴(𝐯)|
(1.11) lim = 0.
𝐯→0 ‖𝐯‖
The linear function 𝐴 is called the differential of 𝑓 at 𝑥0 , denoted by 𝐷𝑓 (𝑥0 ) or ∇𝑓 (𝑥0 ).
Suppose 𝑓 is differentiable at 𝑥0 . Then by the definition, we have
|𝑓 (𝑥0 + ℎ𝑒𝑖 ) − 𝑓 (𝑥0 ) − 𝐴(ℎ𝑒𝑖 )|
lim =0
ℎ→0 |ℎ|
This actually implies that 𝑓 has partial derivative at 𝑥0 with respect to direction 𝑒𝑖 , and 𝐴(𝑒𝑖 ) =
𝐷𝑖 𝑓 (𝑥0 ). So we have the coordinate expression of the linear function 𝐴.
Proposition 1.24. Suppose 𝑈 ∈ ℝ𝑛 is open, 𝑓 ∶ 𝑈 → ℝ, 𝑥0 ∈ 𝑈 . Suppose 𝑓 is differentiable at
𝑥0 , then
(1.12) 𝐷𝑓 (𝑥0 ) = (𝐷1 𝑓 (𝑥0 ), 𝐷2 𝑓 (𝑥0 ), ⋯ , 𝐷𝑛 𝑓 (𝑥0 )),
in the sense that
(𝐷𝑓 (𝑥0 ))(𝐯) = ⟨𝐯, (𝐷1 𝑓 (𝑥0 ), 𝐷2 𝑓 (𝑥0 ), ⋯ , 𝐷𝑛 𝑓 (𝑥0 ))⟩.
As a consequence, we can calculate the directional derivative using the partial derivatives.
Proposition 1.25. Suppose 𝑈 ∈ ℝ𝑛 is open, 𝑓 ∶ 𝑈 → ℝ, 𝑥0 ∈ 𝑈 . Suppose 𝑓 is differentiable at
𝑥0 . Then for any 𝐯 = (𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 ) ∈ ℝ𝑛 ,

𝑛
(1.13) 𝐷𝐯 𝑓 (𝑥0 ) = ⟨𝐷𝑓 (𝑥0 ), 𝐯⟩ = 𝑣𝑗 𝐷𝑗 𝑓 (𝑥0 ).
𝑗=1

Using this expression of directional derivatives, we can easily check the linearity of 𝐷 in the first
argument if 𝑓 is differentiable.
Proposition 1.26. Suppose 𝑈 ∈ ℝ𝑛 is open, 𝑓 ∶ 𝑈 → ℝ, 𝑥0 ∈ 𝑈 . Suppose 𝑓 is differentiable at
𝑥0 . Then for any 𝐯 ∈ ℝ𝑛 , 𝐮 ∈ ℝ𝑛 and 𝑎 ∈ ℝ, 𝑏 ∈ ℝ,
(1.14) 𝐷𝑎𝐮+𝑏𝐯 𝑓 (𝑥0 ) = 𝑎𝐷𝐮 𝑓 (𝑥0 ) + 𝑏𝐷𝐯 𝑓 (𝑥0 ).
Although the differentiability of 𝑓 implies the directional derivatives exist for all directions, the
converse is not true.
10 AO SUN

Example 1.27. Suppose 𝑓 ∶ ℝ2 → ℝ is defined as follows


2
⎧𝑦 , 𝑥 ≠ 0,
⎪𝑥
(1.15) 𝑓 (𝑥, 𝑦) = ⎨

⎩0, 𝑥 = 0.

For any 𝐯 = (𝑣1 , 𝑣2 ) ∈ ℝ2 , 𝐷𝐯 𝑓 (0, 0) exists. In fact, if 𝑣1 = 0,


𝑓 (0 + 0, 0 + ℎ𝑣2 ) − 𝑓 (0, 0)
lim = 0,
ℎ→0 ℎ
and if 𝑣1 ≠ 0,
𝑓 (0 + ℎ𝑣1 , 0 + ℎ𝑣2 ) − 𝑓 (0, 0) ℎ𝑣2 𝑣2
lim = lim 2 = 2 .
ℎ→0 ℎ ℎ→0 𝑣
1 𝑣1
{ ( )}
1 1
However, 𝑓 is not differentiable at (0, 0). In fact, if we take a sequence of vectors 𝐯𝑛 ∶= ,
𝑛2 𝑛
,
𝑛∈ℤ+
for any linear function 𝐴 = (𝑎1 , 𝑎2 ) ∶ ℝ2 → ℝ, we have
𝑎1 𝑎2
|𝑓 (𝐯𝑛 ) − 𝑓 (0, 0) − 𝐴(𝐯𝑛 )| 1 − 2

𝑛
= √𝑛 ,
‖𝐯𝑛 ‖ 1 1
+
𝑛4 𝑛2
hence
|𝑓 (𝐯𝑛 ) − 𝑓 (0, 0) − 𝐴(𝐯𝑛 )|
lim
𝑛→∞ ‖𝐯𝑛 ‖
turns to infinity and does not exist. But ‖𝑣𝑛 ‖ → 0 as 𝑛 → ∞. This implies that 𝐷𝑓 does not exist
at 𝟎, although all the directional derivatives exist.
The above example is very pathological. Nevertheless, if we require the partial derivatives of 𝑓
to be continuous in a neighborhood of 𝑥0 , then 𝑓 is differentiable at 𝑥0 .
Theorem 1.28. Suppose 𝑈 ∈ ℝ𝑛 is open, 𝑓 ∶ 𝑈 → ℝ, 𝑥0 ∈ 𝑈 . Suppose there exists a neighbour-
hood 𝑉 ⊂ 𝑈 of 𝑥0 such that for all 𝑗 = 1, 2, ⋯ , 𝑛, 𝐷𝑗 𝑓 exist and are continuous in 𝑉 , then 𝑓 is
differentiable at 𝑥0 .
Proof. For simplicity, let us assume 𝑉 is a ball centered at 𝑥0 = (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ). Suppose 𝐯 =
(𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 ) ∈ ℝ𝑛 satisfies that 𝑥0 +𝐯 ∈ 𝑉 and 𝐯 ≠ 0. Let 𝐴 = (𝐷1 𝑓 (𝑥0 ), 𝐷2 𝑓 (𝑥0 ), ⋯ , 𝐷𝑛 𝑓 (𝑥0 )).
Notice that at this moment we don’t know 𝐴 = 𝐷𝑓 (𝑥0 ) because we haven’t proved that 𝑓 is differ-
entiable at 𝑥0 . Let 𝑔(𝐯) = 𝑓 (𝑥0 + 𝐯) − 𝐴𝐯. Then it suffices to show that
|𝑔(𝐯) − 𝑔(𝟎)|
lim = 0.
𝐯→0 ‖𝐯‖
MATH 20400 LECTURE NOTES 11

We can write
𝑔(𝐯) − 𝑔(𝟎)
=𝑔(𝑣1 , ⋯ , 𝑣𝑛−1 , 𝑣𝑛 ) − 𝑔(0, ⋯ , 0, 0)
=[𝑔(𝑣1 , ⋯ , 𝑣𝑛−1 , 𝑣𝑛 ) − 𝑔(𝑣1 , ⋯ , 𝑣𝑛−1 , 0)] + [𝑔(𝑣1 , ⋯ , 𝑣𝑛−1 , 0) − 𝑔(𝑣1 , ⋯ , 0, 0)]
+ ⋯ + [𝑔(𝑣1 , ⋯ , 0, 0) − 𝑔(0, ⋯ , 0, 0)].
For the term 𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 𝑣𝑘+1 , 0, ⋯ , 0) − 𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 0, 0, ⋯ , 0), we can use the mean value theo-
rem to show that there exists 𝑐𝑘+1 between 0 and 𝑣𝑘+1 , such that
𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 𝑣𝑘+1 , 0, ⋯ , 0) − 𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 0, 0, ⋯ , 0) = 𝐷𝑘+1 𝑔(𝑣1 , ⋯ , 𝑣𝑘 , 𝑐𝑘+1 , 0, ⋯ , 0)𝑣𝑘+1 .
Note that here even 𝑣𝑘+1 = 0, this equality is true. Then we have

𝑛
|𝑔(𝐯) − 𝑔(𝟎)| ≤ |𝑣𝑗 | sup |𝐷𝑗 𝑔(𝐮)|.
𝑗=1 ‖𝐮‖≤‖𝐯‖

It is straightforward to check that for any 1 ≤ 𝑗 ≤ 𝑛, |𝑣𝑗 | ≤ ‖𝐯‖. So we have


|𝑔(𝐯) − 𝑔(𝟎)| ∑
𝑛
≤ sup |𝐷𝑗 𝑔(𝐮)|
‖𝐯‖ 𝑗=1 ‖𝐮‖≤‖𝐯‖

By the continuity of 𝐷𝑗 𝑓 , we have


lim |𝐷𝑗 𝑔(𝐮)| = lim |𝐷𝑗 𝑓 (𝑥0 + 𝐮) − 𝐷𝑗 𝐴(𝐮)| = lim |𝐷𝑗 𝑓 (𝑥0 + 𝐮) − 𝐷𝑗 𝑓 (𝑥0 )| = 0.
𝐮→0 𝐮→0 𝐮→0

Then we can use the comparison of limit to conclude that


|𝑔(𝐯) − 𝑔(𝟎)|
lim = 0.
𝐯→0 ‖𝐯‖

1.5. Linear map and Matrix. So far we have established the differential of multivariable func-
tions. The next natural generalization is not only considering multivariable functions but “multi-
variable vector-valued functions”. In other words, we want to consider the functions ℝ𝑛 → ℝ𝑚 ,
where 𝑚, 𝑛 can both be greater than 1.
Notation: throughout this course, I will use maps or mappings to denote a function ℝ𝑛 → ℝ𝑚 .
From previous sections, we see that linear functions are the natural candidates for differential.
Therefore, in order to understand the differential of maps, we first study linear maps.
Definition 1.29. A map 𝐴 ∶ ℝ𝑛 → ℝ𝑚 is linear if for any 𝐮 ∈ ℝ𝑛 , 𝐯 ∈ ℝ𝑛 and 𝑎 ∈ ℝ, 𝑏 ∈ ℝ, we
have
𝐴(𝑎𝐮 + 𝑏𝐯) = 𝑎𝐴(𝐮) + 𝑏𝐴(𝐯).
Just like the linear function, we can use the Cartesian coordinate to express the linear maps.
Given a linear map 𝐴 ∶ ℝ𝑛 → ℝ𝑚 , 𝐴(𝑒𝑗 ) is a vector in ℝ𝑚 , and if we put all the vectors 𝐴(𝑒𝑗 )
together, we get a matrix.
12 AO SUN

Definition 1.30. A (𝑚 by 𝑛) matrix is a rectangular table of numbers, with 𝑚 rows and 𝑛 columns

⎡ 𝑎11 𝑎12 ⋯ 𝑎1𝑛 ⎤


⎢ 𝑎21 𝑎22 ⋯ 𝑎2𝑛 ⎥
⎢⋯ ⋯ ⋯ ⋯⎥
⎢ ⎥
⎣𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛 ⎦
We can write any linear map 𝐴 as a matrix using the Cartesian coordinate. Suppose 𝐴(𝑒𝑗 ) =
(𝑎1𝑗 , 𝑎2𝑗 , ⋯ , 𝑎𝑚𝑗 ), then we write

⎡ 𝑎11 𝑎12 ⋯ 𝑎1𝑛 ⎤


⎢𝑎 𝑎22 ⋯ 𝑎2𝑛 ⎥
𝐴 = ⎢ 21
⋯ ⋯ ⋯ ⋯⎥
⎢ ⎥
⎣ 𝑚1 𝑎𝑚2
𝑎 ⋯ 𝑎𝑚𝑛 ⎦
Notice that the convention is that the images of 𝐴 are column vectors, so we list 𝑎1𝑗 , 𝑎2𝑗 , ⋯ , 𝑎𝑚𝑗 in
a column.
There are numerous ways to introduce matrices. For example, many linear algebra courses start
by introducing solving the system of linear equations.
The matrix record the information of the images of {𝑒1 , 𝑒2 , ⋯ , 𝑒𝑛 } under 𝐴. Because {𝑒1 , 𝑒2 , ⋯ , 𝑒𝑛 }
is a basis of ℝ𝑛 , any vector 𝐮 ∈ ℝ𝑛 can be uniquely written ∑𝑛 as the linear combination of them, i.e.
there exists a unique array {𝑢1 , 𝑢2 , ⋯ , 𝑢𝑛 } such that 𝐮 = 𝑗=1 𝑢𝑗 𝑒𝑗 . By the linearity,

⎡ 𝑢1 𝑎11 + 𝑢2 𝑎12 + ⋯ + 𝑢𝑛 𝑎1𝑛 ⎤



𝑛
⎢ 𝑢 𝑎 + 𝑢2 𝑎12 + ⋯ + 𝑢𝑛 𝑎2𝑛 ⎥
(1.16) 𝐴(𝐮) = 𝑢𝑗 𝐴(𝑒𝑗 ) = ⎢ 1 21 ⎥

𝑗=1 ⎢ ⎥
⎣𝑢1 𝑎𝑚1 + 𝑢2 𝑎𝑚2 + ⋯ + 𝑢𝑛 𝑎𝑚𝑛 ⎦
Here the 𝑗-th entry of the vector 𝐴(𝐮) is

𝑛
(1.17) 𝑢𝑙 𝑎𝑗𝑙
𝑙=1

Now we consider a more complicated question. Suppose we have two linear maps 𝐴 ∶ ℝ𝑛 → ℝ𝑚
and 𝐵 ∶ ℝ𝑚 → ℝ𝑘 , and in the matrices form
⎡ 𝑎11 𝑎12 ⋯ 𝑎1𝑛 ⎤ ⎡𝑏11 𝑏12 ⋯ 𝑏1𝑚 ⎤
⎢𝑎 𝑎22 ⋯ 𝑎2𝑛 ⎥ ⎢𝑏 𝑏 ⋯ 𝑏2𝑚 ⎥
𝐴 = ⎢ 21 ⎥ , 𝐵 = ⎢ 21 22
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯⎥
⎢ ⎥ ⎢ ⎥
⎣𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛 ⎦ ⎣𝑏𝑘1 𝑏𝑘2 ⋯ 𝑏𝑘𝑚 ⎦
What is the composition 𝐶 = 𝐵◦𝐴?
One can check that 𝐶 is again a linear map. If we write 𝐶 in the form of a matrix, then it must
be a 𝑘 by 𝑛 matrix, and each column vector is the image of 𝑒𝑖 ∈ ℝ𝑛 under 𝐶 = 𝐵◦𝐴. Then we use
MATH 20400 LECTURE NOTES 13

the formula (1.16) to show that for 𝐮 = ()


⎡ 𝑢1 𝑎11 + 𝑢2 𝑎12 + ⋯ + 𝑢𝑛 𝑎1𝑛 ⎤ ⎡𝑣1 ⎤
⎢ 𝑢 𝑎 + 𝑢2 𝑎12 + ⋯ + 𝑢𝑛 𝑎2𝑛 ⎥ ⎢𝑣2 ⎥
𝐶(𝐮) = 𝐵◦𝐴(𝐮) = 𝐵 ⎢ 1 21 ⎥ = ⎢⋯⎥ ,

⎢ ⎥ ⎢ ⎥
⎣𝑢1 𝑎𝑚1 + 𝑢2 𝑎𝑚2 + ⋯ + 𝑢𝑛 𝑎𝑚𝑛 ⎦ ⎣𝑣𝑘 ⎦
where

𝑚

𝑛

𝑚

𝑛
𝑣𝑗 = 𝑏𝑗𝑙 ( 𝑢𝑝 𝑎𝑙𝑝 ) = 𝑏𝑗𝑙 𝑎𝑙𝑝 𝑢𝑝 .
𝑙=1 𝑝=1 𝑙=1 𝑝=1
∑𝑚
So we conclude that the (𝑖, 𝑗) entry of 𝐶 must be 𝑙=1
𝑏𝑗𝑙 𝑎𝑙𝑖 .
Definition 1.31 (Matrix multiplication). Suppose 𝐴 is a 𝑚 by 𝑛 matrix and 𝐵 is a 𝑘 by 𝑚 matrix,
and
⎡𝑏11 𝑏12 ⋯ 𝑏1𝑚 ⎤ ⎡ 𝑎11 𝑎12 ⋯ 𝑎1𝑛 ⎤
⎢𝑏21 𝑏22 ⋯ 𝑏2𝑚 ⎥ ⎢𝑎 𝑎22 ⋯ 𝑎2𝑛 ⎥
𝐵=⎢ , 𝐴 = ⎢ 21
⋯ ⋯ ⋯ ⋯⎥ ⋯ ⋯ ⋯ ⋯⎥
,
⎢ ⎥ ⎢ ⎥
⎣𝑏𝑘1 𝑏𝑘2 ⋯ 𝑏𝑘𝑚 ⎦ ⎣𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛 ⎦
Then we define the matrix 𝐶 = 𝐵𝐴 to be the 𝑘 by 𝑛 matrix
⎡𝑐11 𝑐12 ⋯ 𝑐1𝑛 ⎤
⎢𝑐 𝑐22 ⋯ 𝑐2𝑛 ⎥
𝐶 = ⎢ 21
⋯⎥
,
⋯ ⋯ ⋯
⎢ ⎥
⎣𝑐𝑘1 𝑐𝑘2 ⋯ 𝑐𝑘𝑛 ⎦
where for 1 ≤ 𝑝 ≤ 𝑘 and 1 ≤ 𝑞 ≤ 𝑛,

𝑚
(1.18) 𝑐𝑝𝑞 = 𝑏𝑝𝑗 𝑎𝑗𝑞 .
𝑗=1

Remark 1.32. The matrix multiplication can be interpreted by the linear maps: If 𝐴 ∶ ℝ𝑛 → ℝ𝑚
and 𝐵 ∶ ℝ𝑚 → ℝ𝑘 are linear maps, then the matrix 𝐶 gives the linear map 𝐵◦𝐴 ∶ ℝ𝑛 → ℝ𝑘 .
Remark 1.33. We can only multiply a matrix 𝐵 by a matrix 𝐴 if the number of columns of 𝐵 equals
the number of rows of 𝐴.
Remark 1.34. Using the inner product, we can interpret the (𝑝, 𝑞) entry 𝑐𝑝𝑞 of 𝐶 = 𝐵𝐴 to be the
inner product of the 𝑝-th row vector of 𝐵 and the 𝑞-th column vector of 𝐴.
Example 1.35. Suppose [
]
1 1
𝐴= ,
0 1
Then 𝐴𝐴 =∶ 𝐴2 , the multiplication of 𝐴 with itself, is
[ ]
2 1 2
𝐴 = .
0 1
14 AO SUN

Example 1.36. Suppose


[ ] ⎡1 1 1⎤
1 1 1
𝐴= , 𝐵 = ⎢2 2 1⎥ ,
0 1 2 ⎢3 2 1⎥
⎣ ⎦

[ ]
6 5 3
𝐴𝐵 =
8 6 3

1.6. Differentiable maps.

Definition 1.37. Suppose 𝑈 ⊂ ℝ𝑛 is open, 𝑥0 ∈ 𝑈 , 𝑓 ∶ 𝑈 → ℝ𝑚 is a map. Then 𝑓 is differentiable


at 𝑥0 if there exists a linear map 𝐴 ∶ ℝ𝑛 → ℝ𝑚 such that

‖𝑓 (𝑥0 + 𝐯) − 𝑓 (𝑥0 ) − 𝐴(𝐯)‖


lim = 0.
𝐯→0 ‖𝐯‖

Here 𝐴 is called the differential of 𝑓 at 𝑥0 , and is usually denoted by 𝐷𝑓 (𝑥0 ).

It is similar to the function case that we can calculate the directional derivative from the differ-
ential.

Proposition 1.38. Suppose 𝑈 ⊂ ℝ𝑛 is open, 𝑥0 ∈ 𝑈 , 𝑓 ∶ 𝑈 → ℝ𝑚 is a map, and 𝑓 is differentiable


at 𝑥0 . Then for 𝐯 ∈ ℝ𝑛 , we have

(1.19) 𝐷𝐯 𝑓 (𝑥0 ) = 𝐷𝑓 (𝑥0 )(𝐯).

Just like the differentiable functions, we can use coordinates to express the differential.

Definition 1.39 (Jacobi matrix). Suppose 𝑈 ⊂ ℝ𝑛 is open, 𝑥0 ∈ 𝑈 , 𝑓 ∶ 𝑈 → ℝ𝑚 is a map with


coordinate expression (𝑓1 , 𝑓2 , ⋯ , 𝑓𝑚 ). The Jacobi matrix, denoted by 𝑓 (𝑥0 ), is the 𝑚 by 𝑛 matrix

⎡ 𝜕𝑓1 𝜕𝑓1 𝜕𝑓1 ⎤


⎢ 𝜕𝑥𝑖 ⋯
𝜕𝑥2 𝜕𝑥𝑛 ⎥
⎢ 𝜕𝑓 𝜕𝑓2 𝜕𝑓2 ⎥
⎢ 2 ⋯ ⎥
(1.20) 𝑓 (𝑥0 ) ∶= ⎢ 𝜕𝑥𝑖 𝜕𝑥2 𝜕𝑥𝑛 ⎥
⎢ ⋯ ⋯ ⋯ ⋯ ⎥
⎢ 𝜕𝑓𝑚 𝜕𝑓𝑚 𝜕𝑓𝑚 ⎥
⎢ ⋯ ⎥
⎣ 𝜕𝑥𝑖 𝜕𝑥2 𝜕𝑥𝑛 ⎦

We can use the Jacobi matrix to compute the directional derivative of a map explicitly. In fact,
if 𝐯 = (𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 ) ∈ ℝ𝑛 is given, from 𝐷𝐯 𝑓 (𝑥0 ) = 𝐷𝑓 (𝑥0 )(𝐯), we can write the coordinate
MATH 20400 LECTURE NOTES 15

expression of 𝐷𝐯 𝑓 (𝑥0 ) explicitly

⎡ 𝜕𝑓1 𝜕𝑓1 𝜕𝑓1 ⎤ ⎡ ∑𝑛 𝜕𝑓1 𝑣 ⎤


⎢ 𝜕𝑥1 ⋯
𝜕𝑥2 𝜕𝑥𝑛 ⎥ ⎢ 𝑗=1 𝜕𝑥𝑗 𝑗 ⎥
⎢ 𝜕𝑓 𝜕𝑓2 𝜕𝑓2 ⎥ ⎡𝑣1 ⎤ ⎢ ∑𝑛 𝜕𝑓2 ⎥
⎢ 2 ⋯ ⎥ ⎢𝑣 ⎥ ⎢ 𝑣𝑗 ⎥
𝐷𝐯 𝑓 (𝑥0 ) = 𝑓 (𝑥0 )𝐯 = ⎢ 𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛 ⎥ ⎢⋯2 ⎥ = ⎢ 𝑗=1 𝜕𝑥𝑗 ⎥
⎢ ⋯ ⋯ ⋯ ⋯ ⎥⎢ ⎥ ⎢ ⋯ ⎥
⎢ 𝜕𝑓𝑚 𝜕𝑓𝑚 𝜕𝑓𝑚 ⎥ ⎣ 𝑣 𝑛 ⎦ ⎢ ∑ 𝜕𝑓 ⎥
⎢ ⋯ ⎥ ⎢ 𝑛𝑗=1 𝑚 𝑣𝑗 ⎥
⎣ 𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛 ⎦ ⎣ 𝜕𝑥𝑗 ⎦
Example 1.40. Suppose 𝑓 ∶ ℝ𝑛 → ℝ𝑚 is a linear map, given by the matrix 𝐴. Then 𝐷𝑓 (𝑥) = 𝐴
for any 𝑥 ∈ ℝ𝑛 .
Example 1.41 (Polar coordinate). The polar coordinate is an important parametrization of ℝ2
(minus a tiny set). Suppose 𝑈 ⊂ ℝ2 is the set (0, ∞) × (0, 2𝜋). We use 𝑟 to denote the variable
in (0, ∞) and we use 𝜃 to denote the variable in (0, 2𝜋). Then we can define a map 𝑓 ∶ 𝑈 → ℝ2
defined by 𝑓 (𝑟, 𝜃) = (𝑟 cos 𝜃, 𝑟 sin 𝜃). Then Jacobi matrix of 𝑓 is
[ ]
cos 𝜃 −𝑟 sin 𝜃
(1.21) 𝑓 (𝑟, 𝜃) =
sin 𝜃 𝑟 cos 𝜃
.
The differential of a function is given by a vector, and we can use the Euclidean metric to define
the magnitude of the differential. We also would like to define the magnitude of the differentials of
maps.
Definition 1.42 (Operator norm). Supose 𝐴 ∶ ℝ𝑛 → ℝ𝑚 is a linear map. Then the operator norm
‖𝐴‖ is defined as
‖𝐴𝐯‖
sup .
𝐯∈ℝ ,𝐯≠𝟎 ‖𝐯‖
𝑛

We have the following rough bound for the operator norm.


Proposition 1.43. Suppose
⎡ 𝑎11 𝑎12 ⋯ 𝑎1𝑛 ⎤
⎢𝑎 𝑎22 ⋯ 𝑎2𝑛 ⎥
𝐴 = ⎢ 21
⋯ ⋯⎥
,
⋯ ⋯
⎢ ⎥
⎣𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛 ⎦
then

√ 𝑚 𝑛
√∑ ∑
(1.22) ‖𝐴‖ ≤ √ 𝑎2 . 𝑖𝑗
𝑖=1 𝑗=1
16 AO SUN

Proof. Suppose 𝐯 ∈ ℝ𝑛 such that ‖𝐯‖ = 1. Then


( 𝑛 )2
∑𝑚

‖𝐴𝐯‖2 = 𝑎𝑖𝑗 𝑣𝑗 .
𝑖=1 𝑗=1

Cauchy-Schwarz inequality shows that


( 𝑛 )2 ( 𝑛 )( 𝑛 ) ( 𝑛 )
∑ ∑ ∑ ∑
𝑎𝑖𝑗 𝑣𝑗 ≤ 𝑎2𝑖𝑗 𝑣2𝑗 = 𝑎2𝑖𝑗 ‖𝐯‖2 .
𝑗=1 𝑗=1 𝑗=1 𝑗=1

This is sufficient to conclude the desired bound. 


Now we study an important tool that can be used to compute the derivatives called the chain rule.
Theorem 1.44 (Chain rule). Suppose 𝑈 ⊂ ℝ𝑚 is open, 𝑉 ⊂ ℝ𝑛 is open, 𝑥0 ∈ 𝑈 . Suppose
𝑓 ∶ 𝑈 → ℝ𝑚 is differentiable at 𝑥0 , and 𝑓 (𝑥0 ) ∈ 𝑉 , and 𝑔 ∶ 𝑉 → ℝ𝑘 is differentiable at 𝑓 (𝑥0 ).
Then
𝐷(𝑔◦𝑓 )(𝑥0 ) = (𝐷𝑔(𝑓 (𝑥0 )))(𝐷𝑓 (𝑥0 )).
Proof. By definition, we have
𝑓 (𝑥0 + 𝑣) − 𝑓 (𝑥0 ) = 𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣),
and
𝑔(𝑓 (𝑥0 ) + 𝑢) − 𝑔(𝑓 (𝑥0 )) = 𝐷𝑔(𝑓 (𝑥0 ))𝑢 + 𝛽(𝑢),
where there exists 𝛿 > 0 and 𝛼 ∶ 𝐵𝛿 (0) → ℝ𝑚 , and 𝛽 ∶ 𝐵𝛿 (0) → ℝ𝑛 , such that
𝛼(𝑣) 𝛽(𝑢)
lim = 0, lim = 0.
𝑣→0 𝑣 𝑢→0 𝑢

Then we have
𝑔(𝑓 (𝑥0 + 𝑣)) − 𝑔(𝑓 (𝑥0 )) =𝑔(𝑓 (𝑥0 ) + 𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)) − 𝑔(𝑓 (𝑥0 ))
=𝐷𝑔(𝑓 (𝑥0 ))(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)) + 𝛽(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣))
Then let 𝑣 → 0, we see that
‖𝑔(𝑓 (𝑥0 + 𝑣)) − 𝑔(𝑓 (𝑥0 )) − 𝐷𝑔(𝑓 (𝑥0 ))𝐷𝑓 (𝑥0 )𝑣‖
lim = 0.
𝑣→0 ‖𝑣‖
Here we use the following computation. For 𝑣 ≠ 0,
‖𝛽(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣))‖ 𝛽(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)) ‖𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)‖ + 𝜖‖𝑣‖
=
‖𝑣‖ ‖𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)‖ + 𝜖‖𝑣‖ ‖𝑣‖
( )
𝛽(𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)) ‖𝛼(𝑣)‖
≤ ‖𝐷𝑓 (𝑥0 )‖ + +𝜖 ,
‖𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)‖ + 𝜖‖𝑣‖ ‖𝑣‖
(We do not divide it by ‖𝐷𝑓 (𝑥0 )𝑣 + 𝛼(𝑣)‖ in case this is zero.) hence it → 0 as 𝑣 → 0. 
MATH 20400 LECTURE NOTES 17

Example 1.45. Suppose 𝑔 ∶ ℝ2 → ℝ is a differentiable function, and 𝑓 ∶ (0, ∞) × (0, 2𝜋) → ℝ2


is the polar parametrization. Then using the chain rule we have
[ ] [ 𝜕𝑔 𝜕𝑔
][ ]
𝜕𝑔◦𝑓 𝜕𝑔◦𝑓 cos 𝜃 −𝑟 sin 𝜃
(𝑟, 𝜃) (𝑟, 𝜃) = (𝑓 (𝑟, 𝜃)) (𝑓 (𝑟, 𝜃))
𝜕𝑟 𝜕𝜃 𝜕𝑥 𝜕𝑦 sin 𝜃 𝑟 cos 𝜃
[ ]
𝜕𝑔 𝜕𝑔 𝜕𝑔 𝜕𝑔
= cos 𝜃 (𝑓 (𝑟, 𝜃)) + sin 𝜃 (𝑓 (𝑟, 𝜃)) −𝑟 sin 𝜃 (𝑓 (𝑟, 𝜃)) + 𝑟 cos 𝜃 (𝑓 (𝑟, 𝜃)) .
𝜕𝑥 𝜕𝑦 𝜕𝑥 𝜕𝑦
Sometimes people just abuse the notation to write
𝜕𝑟 = cos 𝜃𝜕𝑥 + sin 𝜃𝜕𝑦 .
and
𝜕𝜃 = −𝑟 sin 𝜃𝜕𝑥 + 𝑟 cos 𝜃𝜕𝑦 .
1.7. Mean value theorem. Recall that the mean value theorem for single variable functions: if
𝑓 ∶ [𝑎, 𝑏] → ℝ is continuous and differentiable on (𝑎, 𝑏), then there exists 𝑐 ∈ (𝑎, 𝑏) such that
𝑓 (𝑏) − 𝑓 (𝑎) = 𝑓 ′ (𝑐)(𝑏 − 𝑎).
This mean value “identity” in general can not be true for maps. For example, if we consider 𝑓1 ∶
[𝑎, 𝑏] → ℝ, 𝑓2 ∶ [𝑎, 𝑏] → ℝ are two functions that are continuous and differentiable on (𝑎, 𝑏). Then
by the mean value theorem, then there exists 𝑐1 , 𝑐2 ∈ (𝑎, 𝑏) such that 𝑓1 (𝑏) − 𝑓1 (𝑎) = 𝑓1′ (𝑐1 )(𝑏 − 𝑎),
𝑓2 (𝑏)−𝑓2 (𝑎) = 𝑓2′ (𝑐2 )(𝑏−𝑎). However, 𝑐1 , 𝑐2 may not be the same number. So in general we should
not expect that for an open set 𝑈 ⊂ ℝ𝑛 and a differentiable map 𝑓 ∶ 𝑈 → ℝ𝑚 , for any 𝑥, 𝑦 ∈ 𝑈 we
can find 𝑧 ∈ 𝑈 such that 𝑓 (𝑦) − 𝑓 (𝑥) = 𝐷𝑓 (𝑧)(𝑦 − 𝑥).
Nevertheless, we have the following inequality version of the mean value theorem for maps.
Theorem 1.46. Suppose 𝑓 ∶ [𝑎, 𝑏] → ℝ𝑚 is continuous, 𝑓 is differentiable on (𝑎, 𝑏). Then
(1.23) ‖𝑓 (𝑏) − 𝑓 (𝑎)‖ ≤ 𝑚 sup ‖𝐷𝑓 (𝑐)‖|𝑏 − 𝑎|.
𝑐∈(𝑎,𝑏)

Proof. For every 1 ≤ 𝑗 ≤ 𝑚, we can apply the mean value theorem to 𝑓𝑗 to show that there exists
𝑐𝑗 ∈ (𝑎, 𝑏) such that
𝑓𝑗 (𝑏) − 𝑓𝑗 (𝑎) = 𝑓𝑗′ (𝑐𝑗 )(𝑏 − 𝑎).
Then we have the inequality
|𝑓𝑗 (𝑏) − 𝑓𝑗 (𝑎)| ≤ sup ‖𝐷𝑓 (𝑐)‖|𝑏 − 𝑎|.
𝑐∈(𝑎,𝑏)

Then we can use triangle inequality and add them together to conclude that

𝑚
‖𝑓 (𝑏) − 𝑓 (𝑎)‖ ≤ |𝑓𝑗 (𝑏) − 𝑓𝑗 (𝑎)| ≤ 𝑚 sup ‖𝐷𝑓 (𝑐)‖|𝑏 − 𝑎|.
𝑗=1 𝑐∈(𝑎,𝑏)


Next, we consider multi-variable functions. we introduce the special maps called differentiable
curves.
18 AO SUN

Definition 1.47. Suppose (𝑎, 𝑏) ⊂ ℝ. A map 𝛾 ∶ (𝑎, 𝑏) → ℝ𝑛 is called a differentiable curve if as


a map 𝛾 is differentiable, and |𝐷𝛾| ≠ 0 on (𝑎, 𝑏). 𝐷𝛾(𝑠) is called the tangent vector of 𝛾 at 𝛾(𝑠).
Sometimes, people say the image of 𝑓 is a curve. During this course, we will view the map itself
as the “curve”.
Example 1.48. Suppose 𝛾 ∶ (0, 2𝜋) → ℝ2 is defined by 𝛾(𝜃) = (cos 𝜃, sin 𝜃). Then the curve is the
unit circle.
Example 1.49. Suppose 𝛾 ∶ ℝ → ℝ3 is defined by 𝛾(𝑠) = (cos 𝑠, sin 𝑠, 𝑠). Then the curve is the
helix. It is known to be the model of DNA.
it is natural to define a curve with end points, namely we consider 𝛾 ∶ [𝑎, 𝑏] → ℝ𝑚 , where 𝛾 is
continuous on [𝑎, 𝑏] and differentiable on (𝑎, 𝑏).
Theorem 1.50 (Mean value theorem for multivariable functions). Suppose 𝑈 ⊂ ℝ𝑛 is open, 𝑓 ∶
𝑈 → ℝ is a differentiable map. Suppose 𝛾 ∶ [𝑎, 𝑏] → 𝑈 is a curve with end points 𝑥, 𝑦 ∈ 𝑈 , then
there exists 𝑠 ∈ [𝑎, 𝑏], such that
𝑓 (𝑦) − 𝑓 (𝑥) = 𝐷𝑓 (𝛾(𝑠))𝐷𝛾(𝑠)(𝑏 − 𝑎).
For some special open sets 𝑈 , we can always find nice curves to connect two points in 𝑈 .
Definition 1.51. An open set 𝑈 ⊂ ℝ𝑛 is convex if for all 𝑥, 𝑦 ∈ 𝑈 , the straight line segment
connecting them {𝑡𝑦 + (1 − 𝑡)𝑥|𝑡 ∈ [0, 1]} ⊂ 𝑈 .
Theorem 1.52 (Mean value theorem in a convex set). Suppose 𝑈 is a convex set, 𝑓 ∶ 𝑈 → ℝ is a
differentiable function. Then for 𝑥, 𝑦 ∈ 𝑈 , there exists 𝑠 ∈ [0, 1] such that
(1.24) 𝑓 (𝑦) − 𝑓 (𝑥) = 𝐷𝑦−𝑥 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥).
Proof. Let 𝛾(𝑡) = 𝑡𝑦 + (1 − 𝑡)𝑥, and apply the mean value theorem for single variable functions. 

Finally, we can combine all the previous theorems to show the following:
Theorem 1.53. Suppose 𝑈 is a convex set, and 𝑓 ∶ 𝑈 → ℝ𝑚 is a differentiable map. Then for
𝑥, 𝑦 ∈ 𝑈 ,
(1.25) ‖𝑓 (𝑥) − 𝑓 (𝑦)‖ ≤ 𝑚 sup ‖𝐷𝑓 (𝑧)‖‖𝑥 − 𝑦‖.
𝑧∈𝑈

As an application, we have the following uniqueness result.


Theorem 1.54. Suppose 𝑈 is an open convex set. If 𝑓 ∶ 𝑈 → ℝ𝑚 satisfies 𝐷𝑓 = 0, then 𝑓 is a
constant function on 𝑈 .
Proof. Use the mean value theorem in a convex set. 
MATH 20400 LECTURE NOTES 19

1.8. Higher order derivatives and Taylor’s theorem. Now we generalize the derivatives to higher
orders. Suppose 𝑈 ⊂ ℝ𝑛 and 𝑥0 ∈ 𝑈 . It is natural to say 𝑓 ∶ 𝑈 → ℝ𝑚 is twice differentiable
at 𝑥0 if 𝑓 is differentiable in a neighborhood 𝑉 of 𝑥0 , and 𝐷𝑓 ∶ 𝑉 → ℝ𝑚×𝑛 is differentiable at
𝑥0 . Here recall that 𝐷𝑓 (𝑥) is a linear map from ℝ𝑛 → ℝ𝑚 , and such a linear map can be identified
as a 𝑚 by 𝑛 matrix, which can be viewed as a vector in ℝ𝑚×𝑛 . Similarly, we can define 𝓁-th order
differentiability of 𝑓 .
Remark 1.55. In linear algebra, the space of all linear maps from ℝ𝑛 → ℝ𝑚 is usually denoted by
Hom(ℝ𝑛 , ℝ𝑚 ), where Hom stands for homomorphisms. We have Hom(ℝ𝑛 , ℝ𝑚 ) ≅ ℝ𝑚×𝑛 .
From the above discussion, we see that a 𝓁-th order differential of a function 𝑓 is a linear map
in ℝ𝑚×𝑛×𝑛×⋯×𝑛 . This is a space with huge dimensions. In general, we can only write down a higher-
order differential in coordinates.
We will use the following notation:

(1.26) 𝐷𝑖1 ,𝑖2 ,⋯,𝑖𝑘 𝑓 (𝑥0 ) = 𝐷𝑖1 (𝐷𝑖2 (⋯ (𝐷𝑖𝑘 𝑓 )))(𝑥0 ).


In general, it would be hard to understand higher-order derivatives (actually it is even hard to
write down such notations!). We will use 2nd order derivatives as examples to understand higher
order derivatives.
Recall the approximation form of partial derivatives:
𝐷1 𝑓 (𝑥)ℎ ≈ 𝑓 (𝑥 + ℎ𝑒1 ) − 𝑓 (𝑥).
Then we should have
𝐷21 𝑓 (𝑥)ℎ2 ≈𝐷1 𝑓 (𝑥 + ℎ𝑒2 )ℎ − 𝐷1 𝑓 (𝑥)ℎ
≈[𝑓 (𝑥 + ℎ𝑒2 + ℎ𝑒1 ) − 𝑓 (𝑥 + ℎ𝑒2 )] − [𝑓 (𝑥 + ℎ𝑒1 ) − 𝑓 (𝑥)]
=𝑓 (𝑥 + ℎ𝑒2 + ℎ𝑒1 ) − 𝑓 (𝑥 + ℎ𝑒2 ) − 𝑓 (𝑥 + ℎ𝑒1 ) + 𝑓 (𝑥)
Similarly, we have
𝐷12 𝑓 (𝑥)ℎ2 ≈ 𝑓 (𝑥 + ℎ𝑒2 + ℎ𝑒1 ) − 𝑓 (𝑥 + ℎ𝑒2 ) − 𝑓 (𝑥 + ℎ𝑒1 ) + 𝑓 (𝑥),
This implies that 𝐷12 𝑓 = 𝐷21 𝑓 . Of course, we should be careful with limits in higher dimensions,
and such an identity is not always true.
Example 1.56. The following example is an exercise in Boller & Sally: let
{
0, when (𝑥, 𝑦) = (0, 0),
𝑓 (𝑥, 𝑦) = 𝑥3 𝑦−𝑥𝑦3
𝑥2 +𝑦2
, elsewhere.
One can show such a function has 𝐷12 𝑓 (0, 0) and 𝐷21 𝑓 (0, 0) both exist but 𝐷12 𝑓 (0, 0) ≠ 𝐷21 𝑓 (0, 0).
Nevertheless, we have the following theorem:
Theorem 1.57. Suppose 𝑈 ⊂ ℝ𝑛 and 𝑓 ∶ 𝑈 → ℝ is differentiable on 𝑈 . Suppose 𝐷𝑖𝑗 𝑓 and 𝐷𝑗𝑖 𝑓
exist and continuous on 𝑈 , then 𝐷𝑖𝑗 𝑓 (𝑥) = 𝐷𝑗𝑖 𝑓 (𝑥) for all 𝑥 ∈ 𝑈 .
20 AO SUN

Proof. The proof adapts the approximation idea that we discussed above. Let us prove for 𝐷12 𝑓 (𝑥) =
𝐷21 𝑓 (𝑥), and the proof for other indexes is the same. For (𝑦1 , 𝑦2 ) ≠ (0, 0), define
𝑓 (𝑥 + 𝑦2 𝑒2 + 𝑦1 𝑒1 ) − 𝑓 (𝑥 + 𝑦2 𝑒2 ) − 𝑓 (𝑥 + 𝑦1 𝑒1 ) + 𝑓 (𝑥)
𝑔(𝑥, 𝑦1 , 𝑦2 ) = .
𝑦1 𝑦2
Then the mean value theorem for the single variable function
𝛼(𝑧) = 𝑓 (𝑥 + 𝑧𝑒2 + 𝑦1 𝑒1 ) − 𝑓 (𝑥 + 𝑧𝑒2 ) − 𝑓 (𝑥 + 𝑦1 𝑒1 ) + 𝑓 (𝑥)
shows that there exists 𝑧2 between 0 and 𝑦2 such that
𝐷2 𝑓 (𝑥 + 𝑧2 𝑒2 + 𝑦1 𝑒1 ) − 𝑓 (𝑥 + 𝑧2 𝑒2 )
𝑔(𝑥, 𝑦1 , 𝑦2 ) =
𝑦1
Similarly, we conclude that there exists 𝑧1 between 0 and 𝑦1 such that
𝑔(𝑥, 𝑦1 , 𝑦2 ) = 𝐷12 𝑓 (𝑥 + 𝑧2 𝑒2 + 𝑧1 𝑒1 ).
Similarly, there exists 𝑤1 between 0 and 𝑦1 and 𝑤2 between 0 and 𝑦2 , such that
𝑔(𝑥, 𝑦1 , 𝑦2 ) = 𝐷21 𝑓 (𝑥 + 𝑤2 𝑒2 + 𝑤1 𝑒1 ).
In particular, if we let (𝑦1 , 𝑦2 ) → (0, 0) with 𝑦1 ≠ 0 and 𝑦2 ≠ 0, by the continuity of 𝐷12 𝑓 and
𝐷21 𝑓 , we see that
𝐷12 𝑓 (𝑥) = 𝐷21 𝑓 (𝑥).


Next, we generalize Taylor’s theorem to multi-variable functions. Again, we first use the idea of
approximation to see what should be the correct formulation of the Taylor theorem. Suppose 𝑓 is
a very nicely higher-order differentiable function. Then let us expand 𝑓 using the single variable
Taylor theorem in each one of the variables. We would get
𝛼
∑ (𝛼 ) 𝑦1 1
𝑓 (𝑥1 + 𝑦1 , ⋯ , 𝑥𝑛 + 𝑦𝑛 ) ≈ 𝐷1 1 𝑓 (𝑥1 , 𝑥2 + 𝑦2 , ⋯ , 𝑥𝑛 + 𝑦𝑛 )
0≤𝛼1 ≤𝑘
𝛼1 !
𝛼 𝛼
∑ ∑ 𝛼 𝛼 𝑦11 𝑦22
≈ 𝐷2 2 𝐷1 1 𝑓 (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 + 𝑦𝑛 )
0≤𝛼2 ≤𝑘2 0≤𝛼1 ≤𝑘1
𝛼1 ! 𝛼2 !
≈⋯
𝛼 𝛼
∑ ∑ ∑ 𝛼2 𝛼1 𝑦1 1 𝑦2 2 𝑦𝑛
𝛼
≈ ⋯ 𝐷𝑛𝛼𝑛 ⋯ 𝐷2 𝐷1 𝑓 (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) ⋯ 𝑛
0≤𝛼𝑛 ≤𝑘𝑛 0≤𝛼2 ≤𝑘2 0≤𝛼1 ≤𝑘1
𝛼1 ! 𝛼2 ! 𝛼𝑛 !
In order to write down this expression succinctly, we introduce the following notation: suppose
𝛼 = (𝛼1 , 𝛼2 , ⋯ , 𝛼𝑛 ) is an 𝑛-tuple, where 𝛼𝑖 ∈ ℤ≥0 for all 1 ≤ 𝑖 ≤ 𝑛. Then we use |𝛼| to denote
∑𝑛 𝛼 𝛼 𝛼 𝛼 𝛼 𝛼
𝛼 , we use 𝐷𝛼 to denote 𝐷1 1 𝐷2 2 ⋯ 𝐷𝑛 𝑛 , we use 𝑥𝛼 to denote 𝑥11 𝑥22 ⋯ 𝑥𝑛𝑛 , and we use 𝛼! to
𝑖=1 𝑖
denote 𝛼1 !𝛼2 ! ⋯ 𝛼𝑛 !. Now we can state the Taylor theorem for 𝑛-variable functions.
MATH 20400 LECTURE NOTES 21

Theorem 1.58 (Taylor theorem for 𝑛-variable functions). Suppose 𝑈 ⊂ ℝ𝑛 is an open convex set,
𝑓 ∶ 𝑈 → ℝ is a function such that for all 1 ≤ 𝑚 ≤ 𝑘 + 1, all 𝑚-th order partial derivatives of 𝑓
exist and are continuous on 𝑈 . Then for any 𝑥, 𝑦 ∈ 𝑈 , there exists 𝑠 ∈ [0, 1], such that
∑ 𝐷𝛼 𝑓 (𝑥) ∑ 𝐷𝛼 𝑓 (𝑠𝑦 + (1 − 𝑠)𝑥)
(1.27) 𝑓 (𝑦) = (𝑦 − 𝑥)𝛼 + (𝑦 − 𝑥)𝛼 .
|𝛼|≤𝑘
𝛼! |𝛼|=𝑘+1
𝛼!

Proof. By the openness and convexity of 𝑈 , we may find 𝜖 > 0 such that {𝑡𝑦 + (1 − 𝑡)𝑥|𝑡 ∈
[−𝜖, 1 + 𝜖]} ⊂ 𝑈 . Now we define the single variable function 𝑔 ∶ [−𝜖, 1 + 𝜖] → ℝ by
𝑔(𝑡) = 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥).
Now we compute the 𝑘-th order derivative of 𝑔. We prove the following identity by induction on
𝑘:
∑ 𝑘!
𝑔 (𝑘) (𝑡) = 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦 − 𝑥)𝛼 .
|𝛼|=𝑘
𝛼!
When 𝑘 = 1, this identity becomes

𝑛

𝑔 (𝑡) = 𝐷𝑖 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦𝑖 − 𝑥𝑖 ),
𝑖=1

and this is just the chain rule. Now suppose the identity holds for 𝑘 ∈ ℤ+ . Then we have
∑ 𝑘!
𝑔 (𝑘+1) (𝑡) =𝐷𝑡 [ 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦 − 𝑥)𝛼 ]
|𝛼|=𝑘
𝛼!
∑𝑛
∑ 𝑘!
= 𝐷𝑖 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦𝑖 − 𝑥𝑖 )(𝑦 − 𝑥)𝛼 .
𝑖=1 |𝛼|=𝑘
𝛼!
Note that for 𝛼 = (𝛼1 , 𝛼2 , ⋯ , 𝛼𝑛 ), for any 1 ≤ 𝑖 ≤ 𝑛, let us write 𝛼 ∗ = (𝛼1 , ⋯ , 𝛼𝑖−1 , 𝛼𝑖 +1, 𝛼𝑖+1 , ⋯ , 𝛼𝑛 ),
then
∗ ∗
𝐷𝑖 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦𝑖 − 𝑥𝑖 )(𝑦 − 𝑥)𝛼 = 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦 − 𝑥)𝛼
In order to take the sum over 𝛼 ∗ with |𝛼 ∗ | = 𝑘 + 1, we notice that, for any 𝛼 ∗ = (𝛼1∗ , ⋯ , 𝛼𝑛∗ ) with
|𝛼 ∗ | = 𝑘 + 1, it can be obtained by adding 1 to one of the following terms: (𝛼1∗ − 1, ⋯ , 𝛼𝑛∗ ), ⋯,
∗ ∗
(𝛼1∗ , ⋯ , 𝛼𝑛∗ −1), and there are 𝑛 different possibilities. So the coefficient of 𝐷𝛼 𝑓 (𝑡𝑦+(1−𝑡)𝑥)(𝑦−𝑥)𝛼
is
∑ 𝑛 𝑘!𝛼 ∗
𝑗 (𝑘 + 1)!

= .
𝑗=1
𝛼 ! 𝛼∗!
Note that we can extend the definition of the notations of 𝐷𝛼 , 𝛼!, (𝑦 − 𝑥)𝛼 to 0 if 𝛼 = (𝛼1 , ⋯ , 𝛼𝑛 )
has some of 𝛼𝑖 < 0. The above argument in the induction step still works. This implies that
∑ (𝑘 + 1)!
𝑔 (𝑘+1) (𝑡) = 𝐷𝛼 𝑓 (𝑡𝑦 + (1 − 𝑡)𝑥)(𝑦 − 𝑥)𝛼 ,
|𝛼|=𝑘+1
𝛼!
which concludes the induction.
22 AO SUN

Finally, we apply the Taylor theorem of single variable functions to 𝑔 on (−𝜖, 1 + 𝜖). There exists
𝑠 ∈ (0, 1) such that
∑ 𝑘
𝑔 (𝑙) (0) 𝑔 (𝑘+1) (𝑠)
𝑔(1) = + .
𝑙=0
𝑙! (𝑘 + 1)!
Then we use the expression of 𝑔 (𝑙) (0) that we have obtained above to see that
∑ 𝐷𝛼 𝑓 (𝑥) ∑ 𝐷𝛼 𝑓 (𝑠𝑦 + (1 − 𝑠)𝑥)
𝑓 (𝑦) = (𝑦 − 𝑥)𝛼 + (𝑦 − 𝑥)𝛼 .
|𝛼|≤𝑘
𝛼! |𝛼|=𝑘+1
𝛼!

For later purposes, we state Taylor’s theorem with the Peano remainder.
Theorem 1.59 (Taylor theorem for 𝑛-variable functions, with Peano remainder). Suppose 𝑈 ⊂ ℝ𝑛
is an open convex set, 𝑓 ∶ 𝑈 → ℝ is a function such that for all 1 ≤ 𝑚 ≤ 𝑘, all 𝑚-th order partial
derivatives of 𝑓 exist and are continuous on 𝑈 . Then
∑ 𝐷𝛼 𝑓 (𝑥)
(1.28) 𝑓 (𝑦) = (𝑦 − 𝑥)𝛼 + 𝑜(‖𝑥 − 𝑦‖𝑘 ).
|𝛼|≤𝑘
𝛼!

Here 𝑜(‖𝑥 − 𝑦‖𝑘 ) notation means the following:


∑ 𝐷𝛼 𝑓 (𝑥)
𝑓 (𝑦) − |𝛼|≤𝑘 (𝑦 − 𝑥)𝛼
lim 𝛼! = 0.
‖𝑥−𝑦‖→0 ‖𝑥 − 𝑦‖𝑘
1.9. Hessian and second partial derivative test. For later purposes, we will introduce the fol-
lowing notations. Suppose Ω ⊂ ℝ𝑛 is an open set. We say a map 𝑓 ∈ 𝐶 𝑘 (Ω, ℝ𝑘 ) if it has all partial
derivatives up to 𝑘-th order, and all these partial derivatives are continuous in Ω. We say a map
𝑓 ∈ 𝐶(Ω, ℝ𝑘 ) if it is continuous on the closed set Ω.
Recall that for a single variable function, the extreme points are critical points, and we can use
the second-order derivative test to determine whether the critical points are maximum or minimum.
We can generalize this tool to multi-variable worlds. Before then we need some new notations.
Definition 1.60. Suppose Ω ⊂ ℝ𝑛 is an open set, and 𝑓 ∈ 𝐶 2 (Ω, ℝ). For 𝑥 ∈ Ω, the Hessian
matrix of 𝑓 at 𝑥 is the 𝑛 × 𝑛 matrix
⎡𝐷11 𝑓 (𝑥) 𝐷12 𝑓 (𝑥) ⋯ 𝐷1𝑛 𝑓 (𝑥)⎤
⎢𝐷 𝑓 (𝑥) 𝐷22 𝑓 (𝑥) ⋯ 𝐷2𝑛 𝑓 (𝑥)⎥
Hess𝑓 (𝑥) ∶= 𝐷2 𝑓 (𝑥) = ⎢ 21
⋯ ⎥
(1.29)
⋯ ⋯ ⋯
⎢ ⎥
⎣𝐷𝑛1 𝑓 (𝑥) 𝐷𝑛2 𝑓 (𝑥) ⋯ 𝐷𝑛𝑛 𝑓 (𝑥)⎦
The Hessian is a symmetric matrix, namely the (𝑖, 𝑗)-entry equals the (𝑗, 𝑖)-entry.
Definition 1.61. A symmetric matrix 𝐴 is
∙ positive definite if for all 𝑣 ∈ ℝ𝑛 ∖{0}, 𝑣𝑡 𝐴𝑣 > 0;
∙ semi-positive definite if for all 𝑣 ∈ ℝ𝑛 ∖{0}, 𝑣𝑡 𝐴𝑣 ≥ 0;
MATH 20400 LECTURE NOTES 23

∙ negative definite if for all 𝑣 ∈ ℝ𝑛 ∖{0}, 𝑣𝑡 𝐴𝑣 < 0;


∙ semi-negative definite if for all 𝑣 ∈ ℝ𝑛 ∖{0}, 𝑣𝑡 𝐴𝑣 ≤ 0.
We remark that positive definite property actually means a stronger inequality.
Proposition 1.62. If a symmetric matrix 𝐴 is positive definite (resp/ negative definite), then there
exists Λ > 0 such that for all 𝑣 ∈ ℝ𝑛 ∖{0}, 𝑣𝑡 𝐴𝑣 ≥ Λ‖𝑣‖2 (resp. 𝑣𝑡 𝐴𝑣 ≤ −Λ‖𝑣‖2 ).
Now we introduce how to determine a matrix is positive definite or negative definite or neither.
First we notice the following fact:
Proposition 1.63. 𝐴 is positive definite if and only if −𝐴 is negative definite; 𝐴 is positive semi-
definite if and only if −𝐴 is negative semi-definite.
So we only need to have a way to determine whether a symmetric matrix is positive definite. We
need some notations from linear algebra.
Definition 1.64. Suppose 𝐴 is a matrix. 𝜆 is an eigenvalue of 𝐴 if there exists a vector 𝑣 ≠ 0 such
that 𝐴𝑣 = 𝜆𝑣. 𝑣 is called an bf eigenvector.
We state two criteria to characterize a positive definite matrix. The proof can be found in a linear
algebra textbook.
Proposition 1.65 (Characterization of a positive definite matrix by eigenvalues). A symmetric ma-
trix 𝐴 is positive definite if all the eigenvalues are positive.
Similarly, 𝐴 is semi-positive definite; negative definite;semi-negative definite if all the eigenval-
ues are nonnegative; negative; nonpositive respectively.
The next characterization is more computable.
Proposition 1.66 (Characterization of a positive definite matrix by computations). Suppose 𝐴 =
[𝑎𝑖𝑗 ] is a symmetric matrix. Then 𝐴 is positive definite if and only if for all 1 ≤ 𝑘 ≤ 𝑛, the sub-
squared matrix
⎡𝑎11 𝑎12 ⋯ 𝑎1𝑘 ⎤
⎢𝑎 𝑎 ⋯ 𝑎2𝑘 ⎥
𝐴𝑘 = ⎢ 21 22
⋯ ⋯ ⋯ ⋯⎥
⎢ ⎥
⎣𝑎𝑘1 𝑎𝑘2 ⋯ 𝑎𝑘𝑘 ⎦
has determinants all positive. Similarly 𝐴 is semi-positive definite if for 1 ≤ 𝑘 ≤ 𝑛, det(𝐴𝑘 ) ≥ 0.
∑𝑛
Example 1.67 (Local model). Let 𝑓 (𝑥1 , ⋯ , 𝑥𝑛 ) = 𝑗=1 𝑎𝑗 𝑥2𝑗 . Then

⎡𝑎1 0 0 ⋯ 0⎤
⎢ 0 𝑎2 0 ⋯ 0⎥
Hess𝑓 (𝑥) = ⎢ 0 0 𝑎3 ⋯ 0⎥
⎢⋯ ⋯ ⋯ ⋯ ⋯⎥⎥

⎣0 0 0 ⋯ 𝑎𝑛 ⎦
It is clear that Hess𝑓 is positive definite if 𝑎𝑖 > 0 for all 1 ≤ 𝑖 ≤ 𝑛.
24 AO SUN
[ ]
𝑎 𝑏
Example 1.68. Let us consider a 2 × 2 matrix 𝐴 = . One can show that it is positive definite
𝑏 𝑐
if and only if 𝑎 > 0 and det 𝐴 = 𝑎𝑐 − 𝑏𝑑 > 0.
Definition 1.69. Suppose 𝑈 ⊂ ℝ𝑛 is open and 𝑓 ∈ 𝐶 1 (𝑈 , ℝ). We say 𝑥0 is a local maximum
(resp. minimum) point of 𝑓 , if there exists a neighbourhood 𝑉 of 𝑥0 such that for all 𝑥 ∈ 𝑉 ,
𝑓 (𝑥) ≤ 𝑓 (𝑥0 ) (resp. 𝑓 (𝑥) ≥ 𝑓 (𝑥0 )).
Definition 1.70. Suppose 𝑈 ⊂ ℝ𝑛 is open and 𝑓 ∈ 𝐶 1 (𝑈 , ℝ). Suppose 𝑥0 is a local maximum
point or a local minimum point. Then 𝐷𝑓 (𝑥0 ) = 0.
Proof. We prove the case that 𝑥0 is a local maximum point, and the local minimum point case is
similar. We only need to show that 𝐷𝑣 𝑓 (𝑥0 ) = 0 for all 𝑣 ∈ ℝ𝑛 . Suppose 𝑥0 is the maximum point
in 𝑉 .
𝑓 (𝑥0 + ℎ𝑣) − 𝑓 (𝑥0 )
𝐷𝑣 𝑓 (𝑥0 ) = lim .
ℎ→0 ℎ
Because for ℎ sufficiently small such that 𝑥0 + ℎ𝑣 ∈ 𝑉 , 𝑓 (𝑥0 + ℎ𝑣) − 𝑓 (𝑥0 ) ≤ 0, we have
𝑓 (𝑥0 + ℎ𝑣) − 𝑓 (𝑥0 ) 𝑓 (𝑥0 + ℎ𝑣) − 𝑓 (𝑥0 )
lim+ ≥ 0, lim− ≤ 0.
ℎ→0 ℎ ℎ→0 ℎ
Therefore, 𝐷𝑣 𝑓 (𝑥0 ) = 0. 
Just like in the single variable case, we call a point 𝑥0 critical point if 𝐷𝑓 (𝑥0 ) = 0.
Example 1.71. Critical points may not be local minimum/maximum. Even in the single variable
case, 0 is a critical point of 𝑓 (𝑥) = 𝑥3 , but it is neither local minimum or maximum.
Now we show that the Hessian of a function at a local minimum/maximum point must be semi-
definite.
Theorem 1.72. Suppose 𝑈 ⊂ ℝ𝑛 is open and 𝑓 ∈ 𝐶 2 (𝑈 , ℝ). Suppose 𝑥0 ∈ 𝑈 is a local mini-
mum point (resp. local maximum point) of 𝑓 , then Hess𝑓 (𝑥0 ) is semi-positive definite (resp. semi-
negative definite).
Proof. We prove this by contradiction. Suppose Hess𝑓 (𝑥0 ) is not semi-positive definite, then there
exists 𝑣 = (𝑣1 , ⋯ , 𝑣𝑛 ) ≠ 0 such that 𝑣𝑡 Hess𝑓 (𝑥0 )𝑣 < 0. By the Taylor theorem with the Peano
remainder,
∑ 𝑛
𝐷𝑖𝑗 𝑓 (𝑥0 ) 2
𝑓 (𝑥0 + 𝑡𝑣) =𝑓 (𝑥0 ) + 𝑡 𝑣𝑖 𝑣𝑗 + 𝑜(‖𝑡𝑣‖2 )
𝑖,𝑗=1
2
𝑡2 𝑡
=𝑓 (𝑥0 ) + 𝑣 Hess𝑓 (𝑥0 )𝑣 + 𝑜(𝑡2 ).
2
Because 𝑥0 is a local minimum point, then there exists 𝑡0 > 0 such that when |𝑡| < 𝑡0 , 𝑓 (𝑥0 + 𝑡𝑣) ≥
𝑓 (𝑥0 ). Hence
𝑜(𝑡2 )
𝑣𝑡 Hess𝑓 (𝑥0 )𝑣 + 2 2 ≥ 0.
𝑡
Let 𝑡 → 0 we see 𝑣𝑡 Hess𝑓 (𝑥0 )𝑣 ≥ 0, which is a contradiction. 
MATH 20400 LECTURE NOTES 25

Now we generalize the second-order derivative test to multi-variable functions.

Theorem 1.73 (Hessian test). Suppose 𝑈 ⊂ ℝ𝑛 is open and 𝑓 ∈ 𝐶 2 (𝑈 , ℝ). Suppose 𝑥0 ∈ 𝑈 is a


critical point of 𝑓 . If Hess𝑓 (𝑥0 ) is positive definite, then 𝑥0 is a local minimum point; if Hess𝑓 (𝑥0 )
is negative definite, then 𝑥0 is a local maximum point.

Proof. We use the Taylor theorem with the Peano remainder. Because 𝑥0 is a critical point, for
ℎ = (ℎ1 , ℎ2 , ⋯ , ℎ𝑛 ),

𝑛
𝐷𝑖𝑗 𝑓 (𝑥0 )
𝑓 (𝑥0 + ℎ) =𝑓 (𝑥0 ) + ℎ𝑖 ℎ𝑗 + 𝑜(‖ℎ‖2 )
𝑖,𝑗=1
2
1
=𝑓 (𝑥0 ) + ℎ𝑡 Hess𝑓 (𝑥0 )ℎ + 𝑜(‖ℎ‖2 )
2
≥𝑓 (𝑥0 ) + Λ‖ℎ‖2 + 𝑜(‖ℎ‖2 ).

𝑜(‖ℎ‖2 )
By the little 𝑜 notation definition, lim = 0. So there exists 𝛿 > 0 such that when 0 <
‖ℎ‖→0 ‖ℎ‖2
Λ
‖ℎ‖ < 𝛿, |𝑜(‖ℎ‖2 )| ≤ ‖ℎ‖2 . Therefore, when 0 < ‖ℎ‖ < 𝛿, we have
2
Λ
𝑓 (𝑥0 + ℎ) ≥ 𝑓 (𝑥0 ) + Λ‖ℎ‖2 − ‖ℎ‖2 > 𝑓 (𝑥0 ).
2
This implies that 𝑥0 is the local minimum point. 

2. SUBMANIFOLDS IN ℝ𝑁
2.1. Invertible linear maps. Suppose 𝐴 ∶ ℝ𝑛 → ℝ𝑚 is a linear map. Let us consider in what
circumstance 𝐴 is invertible, in other words for any 𝑦 ∈ ℝ𝑚 , there exists a unique 𝑥 ∈ ℝ𝑛 such that
𝐴(𝑥) = 𝑦.
Suppose {𝛼1 , ⋯ , 𝛼𝑛 } is a basis of ℝ𝑛 , {𝛽1 , ⋯ , 𝛽𝑚 } is a basis of ℝ𝑚 . Recall the linear map 𝐴 can
be written as a matrix
⎡ 𝑎11 𝑎12 ⋯ 𝑎1𝑛 ⎤
⎢𝑎 𝑎22 ⋯ 𝑎2𝑛 ⎥
𝐴 = ⎢ 21
⋯ ⋯ ⋯ ⋯⎥
.
⎢ ⎥
⎣𝑎𝑚1 𝑎𝑚2 ⋯ 𝑎𝑚𝑛 ⎦
From linear algebra, we know that the matrix 𝐴 is invertible if and only if 𝑚 = 𝑛 and there exists a
matrix 𝐴−1 such that 𝐴𝐴−1 = 𝐴−1 𝐴 = Id, where Id is the identity matrix, whose (𝑖, 𝑖)-entries are
all 1 and all other entries are 0.
The following result is key to determining whether a matrix is invertible or not.

Definition 2.1. A 𝑛 × 𝑛 matrix 𝐴 is invertible if and only if the determinant det(𝐴) ≠ 0.


26 AO SUN

2.2. Inverse function theorem. Now we consider a map 𝑓 ∶ 𝑈 → 𝑉 , where 𝑈 and 𝑉 are open
sets. We want to know when 𝑓 is invertible.
Because 𝐷𝑓 (𝑥0 ) is the linear approximation of 𝑓 near 𝑥0 , so naturally we would expect that if
𝐷𝑓 (𝑥0 ) is invertible, then 𝑓 is invertible, at least in a neighbourhood of 𝑥0 .
The main goal of this section is to prove the following theorem.
Theorem 2.2 (Inverse function theorem). Suppose Ω is an open set in ℝ𝑛 , 𝑥0 ∈ Ω, 𝑓 ∈ 𝐶 1 (Ω, ℝ𝑛 )
and 𝐷𝑓 (𝑥0 ) is invertible. Then there exists a neighbourhood 𝑈 of 𝑥0 and a neighbourhood 𝑉 of
𝑓 (𝑥0 ), such that 𝑓 |𝑈 ∶ 𝑈 → 𝑉 is bijective, and 𝑓 −1 ∶ 𝑉 → 𝑈 is in 𝐶 1 (𝑉 , ℝ𝑛 ).
The proof is based on the Contraction Mapping Theorem.
Theorem 2.3 (Contraction Mapping Theorem). Suppose (𝑋, 𝑑) is a complete metric space, 𝑇 ∶
𝑋 → 𝑋 is a contraction map, i.e. there exists 𝑎 ∈ (0, 1) such that for all 𝑥, 𝑦 ∈ 𝑋, 𝑑(𝑇 (𝑥), 𝑇 (𝑦)) ≤
𝑎𝑑(𝑥, 𝑦), then there exists 𝑥0 ∈ 𝑋 such that 𝑇 (𝑥0 ) = 𝑥0 .
The Contraction Mapping Theorem can be viewed as a method to solve “nonlinear” problems.
In mathematics, if we would like to solve linear problems, we usually adapt the ideas from linear
algebra; if we would like to solve nonlinear problems, we would like to apply some fixed point
theorem, and the Contraction Mapping Theorem is such a theorem.
Proof of Inverse function theorem. The proof is divided into several parts.
Step 1: bijectivity. We first turn the problem into a problem of finding a fixed point. For
simplicity, we suppose 𝑥0 = 0, 𝑓 (0) = 0 and 𝐴 = Id. (Otherwise, we can replace 𝑓 (𝑥) by
𝐴−1 𝑓 (𝑥 + 𝑥0 ) − 𝐴−1 𝑓 (𝑥0 ).) Given 𝑦, our goal is to find 𝑥 ∈ Ω such that 𝑓 (𝑥) = 𝑦. Let
𝑇 (𝑥) = 𝑥 − 𝑓 (𝑥) + 𝑦.
If 𝑥0 is a fixed point of 𝑇 , then 𝑇 (𝑥0 ) = 𝑥0 , hence 𝑓 (𝑥0 ) = 𝑦. So the question becomes finding a
fixed point of 𝑇 .
Now we check that 𝑇 is a contraction map.
𝑇 (𝑥1 ) − 𝑇 (𝑥2 ) = 𝑥1 − 𝑥2 − (𝑓 (𝑥1 ) − 𝑓 (𝑥2 )).
If we apply the mean value theorem to the function Id −𝑓 , we have for 𝑥1 , 𝑥2 ∈ 𝐵𝛿 (0),
‖𝑇 (𝑥1 ) − 𝑇 (𝑥2 )‖ ≤ 𝑛 sup ‖ Id −𝐷𝑓 (𝑧)‖‖𝑥1 − 𝑥2 ‖.
𝑧∈𝐵𝛿 (0)

Because 𝑓 ∈ 𝐶 1 (Ω, ℝ𝑛 ) and 𝐷𝑓 (0) = Id, there exists 𝛿 > 0 such that
1
sup ‖ Id −𝐷𝑓 (𝑧)‖ < .
𝑧∈𝐵𝛿 (0) 2𝑛
Then inside 𝐵𝛿 (0),
1
‖𝑇 (𝑥1 ) − 𝑇 (𝑥2 )‖ ≤ ‖𝑥1 − 𝑥2 ‖,
2
hence 𝑇 is a contraction map.
MATH 20400 LECTURE NOTES 27

In order to apply the contraction mapping theorem, we need to choose 𝑦 such that 𝑇 is a map
from 𝐵𝛿 (0) to 𝐵𝛿 (0). In fact, if 𝑦 is chosen such that ‖𝑦‖ < 12 𝛿, then for 𝑥 ∈ 𝐵𝛿 (0),
1
‖𝑇 (𝑥)‖ ≤ ‖𝑇 (𝑥) − 𝑇 (0)‖ + ‖𝑇 (0)‖ ≤ ‖𝑥‖ + ‖𝑦‖ < 𝛿.
2
This implies that 𝑇 is a map from 𝐵𝛿 (0) to 𝐵𝛿 (0). Then by the contraction mapping theorem, we
conclude the following fact: there exists 𝛿 > 0 such that for 𝑦 ∈ 𝐵𝛿∕2 (0), there exists 𝑥 ∈ 𝐵𝛿 (0)
such that 𝑓 (𝑥) = 𝑦. If we call 𝑉 ∶= 𝐵𝛿∕2 (0) and 𝑈 = 𝑓 −1 (𝑉 ), we have proved that 𝑓 |𝑈 ∶ 𝑈 → 𝑉
is a bijection.
Step 2: 𝑓 −1 is continuous. For simplicity we denote 𝑓 −1 by 𝑔. Suppose 𝑓 (𝑥1 ) = 𝑦1 and
𝑓 (𝑥2 ) = 𝑦2 . Then
‖𝑔(𝑦1 ) − 𝑔(𝑦2 )‖ =‖𝑥1 − 𝑥2 ‖
=‖𝑇 (𝑥1 ) + 𝑓 (𝑥1 ) − (𝑇 (𝑥2 ) + 𝑓 (𝑥2 ))‖
≤‖𝑇 (𝑥1 ) − 𝑇 (𝑥2 )‖ + ‖𝑦1 − 𝑦2 ‖
Here 𝑇 is the contraction map with any fixed 𝑦 ∈ 𝐵𝛿∕2 (0). Then the contraction property of 𝑇
1
implies that ‖𝑇 (𝑥1 ) − 𝑇 (𝑥2 )‖ ≤ ‖𝑥1 − 𝑥2 ‖. This implies that
2
1
‖𝑥1 − 𝑥2 ‖ ≤ ‖𝑥1 − 𝑥2 ‖ + ‖𝑦1 − 𝑦2 ‖.
2
Thus, we have
‖𝑥1 − 𝑥2 ‖ ≤ 2‖𝑦1 − 𝑦2 ‖
and
‖𝑔(𝑦1 ) − 𝑔(𝑦2 )‖ ≤ 2‖𝑦1 − 𝑦2 ‖.
This implies that 𝑓 −1 is continuous.
Step 3: differentiability of 𝑓 −1 . Let us use 𝑔 to denote 𝑓 −1 . We only need to show that 𝐷𝑖 𝑔
exists and continuous in 𝐵𝛿∕2 (0), then Theorem 1.28 implies that 𝑔 is differentiable on 𝐵𝛿∕2 (0).
Suppose 𝑦 = 𝑓 (𝑥) and 𝐵 = 𝐷𝑓 (𝑥). Then for 𝑣 ≠ 0, such that 𝑦 + 𝑣 ∈ 𝐵𝛿∕2 (0), let 𝑔(𝑦 + 𝑣) = 𝑥′
‖𝑔(𝑦 + 𝑣) − 𝑔(𝑦) − 𝐵 −1 𝑣‖ =‖𝑥′ − 𝑥 − 𝐵 −1 (𝑓 (𝑥′ ) − 𝑓 (𝑥))‖
≤𝑛 sup ‖ Id −𝐵 −1 ◦𝐷𝑓 (𝑧)‖‖𝑥′ − 𝑥‖
𝑧∈𝐵‖𝑥′ −𝑥‖ (𝑥)

From Step 2,
‖𝑥′ − 𝑥‖ ≤ 2‖(𝑦 + 𝑣) − 𝑦‖ = 2‖𝑣‖.
Therefore, we conclude that
‖𝑔(𝑦 + 𝑣) − 𝑔(𝑦) − 𝐵 −1 𝑣‖
lim = 0.
𝑣→0 ‖𝑣‖
This shows that 𝑔 is differentiable on 𝐵𝛿∕2 (0), and 𝐷𝑔(𝑦) = (𝐷𝑓 (𝑥))−1 for 𝑦 = 𝑓 (𝑥). 
Remark 2.4. 𝐷(𝑓 −1 ) = (𝐷𝑓 )−1 is expected from the chain rule.
28 AO SUN

Definition 2.5. Suppose 𝑈 , 𝑉 are open sets in ℝ𝑛 . If 𝑓 ∈ 𝐶 1 (𝑈 , 𝑉 ) is bijective, and 𝑓 −1 ∈


𝐶 1 (𝑉 , 𝑈 ), then 𝑓 is called a 𝐶 1 -diffeomorphism.
An interesting fact is that if 𝑓 has higher regularity, then 𝑓 −1 will also have the same regularity.
We write 𝑓 ∈ 𝐶 𝑘 (𝑈 , ℝ𝑛 ) if 𝑓 has all the 𝑚-th order partial derivatives for 1 ≤ 𝑚 ≤ 𝑘, and all the
higher order partial derivatives are continuous on 𝑈 .
Theorem 2.6. Suppose 𝑘 ≥ 1 is an integer. If 𝑓 ∈ 𝐶 𝑘 (𝑈 , ℝ𝑛 ) in the statement of inverse function
theorem, then 𝑓 −1 ∈ 𝐶 𝑘 (𝑉 , ℝ𝑛 ).
Proof. We need the following lemmas.
Lemma 2.7. If 𝑓 ∶ 𝑈 → ℝ𝑛 is differentiable and 𝐷𝑓 ∈ 𝐶 𝑘−1 (𝑈 × ℝ𝑛 , ℝ𝑛 ), then 𝑓 ∈ 𝐶 𝑘 (𝑈 , ℝ𝑛 ).
The proof is not that hard and we leave it to the readers.
Lemma 2.8. Suppose 𝑈 ⊂ ℝ𝑛 is open, 𝐹 ∈ 𝐶 𝑘 (𝑈 , ℝ𝑚 ), 𝐹 (𝑈 ) ⊂ 𝑉 where 𝑉 is open in ℝ𝑚 ,
𝐺 ∈ 𝐶 𝑘 (𝑉 , ℝ𝑙 ), then 𝐺◦𝐹 is in 𝐶 𝑘 (𝑈 , ℝ𝑙 ).
The proof is based on the chain rule and the previous lemma, and we leave it to the readers.
Now we prove the theorem. Suppose 𝑓 ∈ 𝐶 𝑘 (𝑈 , ℝ𝑛 ) in the statement of inverse function the-
orem, then we know that 𝑓 −1 exists and is in 𝐶 1 (𝑉 , ℝ) for some open set 𝑉 . Because 𝑓 −1 is
differentiable, we can apply the chain rule: for 𝑦 = 𝑓 (𝑥), because 𝑓 ◦𝑓 −1 = Id,
Id = 𝐷(Id)(𝑦) = 𝐷𝑓 (𝑥)𝐷𝑓 −1 (𝑦),
Hence
𝐷𝑓 −1 (𝑦) = (𝐷𝑓 (𝑓 −1 (𝑦)))−1 .
Notice that 𝐷𝑓 is in 𝐶 𝑘−1 (𝑈 × ℝ𝑛 , ℝ𝑛 ) and 𝑓 −1 is 𝐶 1 (𝑉 , ℝ𝑛 ), the previous lemma shows that 𝑓 −1 ∈
𝐶 2 (𝑉 , ℝ𝑛 ). Then using the mathematical induction we can show 𝑓 −1 ∈ 𝐶 𝑘 (𝑉 , ℝ𝑛 ).

Definition 2.9. We define the space of smooth maps defined on 𝑈 as


∞ 𝑛
𝐶 (𝑈 , ℝ ) = 𝐶 𝑘 (𝑈 , ℝ𝑛 ).
𝑘=1

Definition 2.10. Suppose 𝑈 , 𝑉 are open sets in ℝ𝑛 . If 𝑓 ∈ 𝐶 ∞ (𝑈 , 𝑉 ) is bijective, and 𝑓 −1 ∈


𝐶 ∞ (𝑉 , 𝑈 ), then 𝑓 is called a smooth diffeomorphism.
The inverse function theorem is a local theorem. Namely, we know the information of a function
at a point (𝐷𝑓 (𝑥0 ) is invertible), then we conclude a property of the function holds in a neighbour-
hood of the point (there exists a diffeomorphism 𝑓 |𝑈 ∶ 𝑈 → 𝑉 where 𝑈 is a neighbourhood of
𝑥0 ).
However, a local diffeomorphism may not be a global diffeomorphism.
MATH 20400 LECTURE NOTES 29

Example 2.11. Suppose for 0 < 𝑠 < 𝑡,

𝐴𝑠,𝑡 = {𝑣 ∈ ℝ2 |𝑠 < ‖𝑣‖ < 𝑡}

is the annulus. Then


( )
2 2
𝑥 −𝑦 2𝑥𝑦
𝑓 (𝑥, 𝑦) = √ ,√
𝑥2 + 𝑦2 𝑥2 + 𝑦2
is a local diffeomorphism at every point in 𝐴1,2 , but it is not a global diffeomorphism, because it is
not bijective.
If I write this map using polar coordinates, it is given by

𝑓 (𝑟, 𝜃) = (𝑟, 2𝜃).

Notice that, the Hessian of 𝑓 is given by

⎡ 𝑥3 + 3𝑥𝑦2 𝑦3 + 3𝑥2 𝑦 ⎤
⎢ √ − √ ⎥
( 𝑥2 + 𝑦2 )3 ( 𝑥2 + 𝑦2 )3
𝐷𝑓 (𝑥, 𝑦) = ⎢ 3

⎢ 2𝑦 2𝑥3 ⎥
⎢ √ √ , ⎥
⎣ ( 𝑥2 + 𝑦2 )3 ( 𝑥2 + 𝑦2 )3 ⎦

And it is always invertible. Therefore, even if we have the assumption that a function has invertible
differentials everywhere, it does not mean the function is invertible globally.

Remark 2.12. Diffeomorphisms can be viewed as “reparametrization” of an open region.

We finish this section by discussing some applications of the inverse function theorem.

Theorem 2.13 (Continuity of roots). Given (𝑐𝑛−1 , 𝑐𝑛−2 , ⋯ , 𝑐1 , 𝑐0 ) ∈ ℝ𝑛 , we can define a polynomial

𝑝(𝑥) = 𝑥𝑛 + 𝑐𝑛−1 𝑥𝑛−1 + 𝑐𝑛−2 𝑥𝑛−2 + ⋯ + 𝑐1 𝑥 + 𝑐0 ,

and suppose 𝑥1 ≤ 𝑥2 ≤ ⋯ ≤ 𝑥𝑛 are solution to 𝑝(𝑥) = 0. Then the map

(𝑐𝑛−1 , 𝑐𝑛−2 , ⋯ , 𝑐1 , 𝑐0 ) → (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 )

is smooth as long as 𝑥1 < 𝑥2 < ⋯ < 𝑥𝑛 .

Proof. Suppose 𝑥1 < 𝑥2 < ⋯ < 𝑥𝑛 , then we have the map

𝐹 (𝑥) = 𝐹 (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 ) = (𝑐𝑛−1 (𝑥), 𝑐𝑛−2 (𝑥), ⋯ , 𝑐1 (𝑥), 𝑐0 (𝑥)),


30 AO SUN

where
𝑐0 (𝑥) =(−1)𝑛 𝑥1 𝑥2 ⋯ 𝑥𝑛 ,

𝑐1 (𝑥) = (−1)𝑛−1 𝑥𝑖1 𝑥𝑖2 ⋯ 𝑥𝑖𝑛−1 ,
1≤𝑖1 <𝑖2 <⋯<𝑖𝑛−1 ≤𝑛



𝑐𝑛−2 (𝑥) = 𝑥𝑖1 𝑥𝑖2 ,
1≤𝑖1 <𝑖2 ≤𝑛


𝑛
𝑐𝑛−1 (𝑥) = −𝑥𝑖 .
𝑖=1

They are called Newton polynomials. We can obtain the expression by expanding the product
𝑝(𝑥) = (𝑥 − 𝑥1 )(𝑥 − 𝑥2 ) ⋯ (𝑥 − 𝑥𝑛 ).
Let us compute the differential of 𝐹 . To do so, we need a trick. Let
𝑃 (𝑠, 𝑥) = (𝑠 − 𝑥1 )(𝑠 − 𝑥2 ) ⋯ (𝑠 − 𝑥𝑛 ) = 𝑠𝑛 + 𝑐𝑛−1 (𝑥)𝑠𝑛−1 + ⋯ + 𝑐1 (𝑥)𝑠 + 𝑐0 (𝑥).
Then
𝐷𝑖 𝑃 (𝑠, 𝑥) = − (𝑠 − 𝑥1 )(𝑠 − 𝑥2 ) ⋯ (𝑠 − 𝑥𝑖−1 )(𝑠 − 𝑥𝑖+1 ) ⋯ (𝑠 − 𝑥𝑛 )
=𝐷𝑖 𝑐𝑛−1 (𝑥)𝑠𝑛−1 + ⋯ + 𝐷𝑖 𝑐1 (𝑥)𝑠 + 𝐷𝑖 𝑐0 (𝑥).
In particular, if we plug in all the 𝑠 = 𝑥𝑗 , we see that for 𝑗 ≠ 𝑖,
𝐷𝑖 𝑐𝑛−1 (𝑥)𝑥𝑛−1
𝑗
+ ⋯ + 𝐷𝑖 𝑐1 (𝑥)𝑥𝑗 + 𝐷𝑖 𝑐0 (𝑥) = 0,
and
𝐷𝑖 𝑐𝑛−1 (𝑥)𝑥𝑛−1
𝑖
+ ⋯ + 𝐷𝑖 𝑐1 (𝑥)𝑥𝑖 + 𝐷𝑖 𝑐0 (𝑥) ≠ 0.
Let us use 𝑎𝑖 to denote 𝐷𝑖 𝑐𝑛−1 (𝑥)𝑥𝑛−1
𝑖
+ ⋯ + 𝐷𝑖 𝑐1 (𝑥)𝑥𝑖 + 𝐷𝑖 𝑐0 (𝑥).
This implies that
⎡𝑥𝑛−1
1
𝑥𝑛−2
1
⋯ 𝑥1 1 ⎤ ⎡ 𝑎1 0 0 ⋯ 0 ⎤
⎢𝑥 2
𝑛−1
𝑥𝑛−2 ⋯ 𝑥2 1 ⎥ ⎢ 0 𝑎2 0 ⋯ 0 ⎥
⎢⋯ 2
⋯ ⋯ ⋯ ⋯ ⎥ 𝐷𝐹 = ⎢
⋯ ⋯ ⋯ ⋯ ⋯⎥
.
⎢ 𝑛−1 𝑛−2 ⎥ ⎢ ⎥
⎣𝑥 𝑛 𝑥𝑛 ⋯ 𝑥𝑛 1 ⎦ ⎣ 0 0 0 ⋯ 𝑎𝑛 ⎦
Because the right-hand side is an invertible matrix, 𝐷𝐹 is also invertible. Then by inverse function
theorem, we know that in a neighbourhood of (𝑐𝑛−1 , ⋯ , 𝑐0 ), the map
(𝑐𝑛−1 , 𝑐𝑛−2 , ⋯ , 𝑐1 , 𝑐0 ) → (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛 )
is smooth as long as 𝑥1 < 𝑥2 < ⋯ < 𝑥𝑛 . 
Corollary 2.14. The eigenvalues of matrices depend on the matrices smoothly as long as they are
distinct.
MATH 20400 LECTURE NOTES 31

2.3. 𝑘-surfaces and submanifolds in ℝ𝑁 . The inverse function theorem can be interpreted as if
the differential of a 𝐶 1 -map is invertible, then the map is a local diffeomorphism. In particular, this
requires the domain and the target of the map to have the same dimensions.
A natural question is: what if the dimensions of the domain and the target are different? Previ-
ously we see that a map 𝛾 from an interval 𝐼 ⊂ ℝ to 𝑈 ⊂ ℝ𝑛 is a curve, if we require 𝛾 ′ ≠ 0, and
the reason we require 𝛾 ′ ≠ 0 is we do not want to see “degenerate” curves like a constant map -
then the “curve” is actually a point.
We adapt this idea to higher dimensions.
Definition 2.15. Suppose 𝑈 ⊂ ℝ𝑘 is an open set. A map 𝐹 ∶ 𝑈 → ℝ𝑁 is called a 𝐶 1 𝑘-surface if
𝐹 ∈ 𝐶 1 (𝑈 , ℝ𝑁 ) and 𝐷𝐹 (𝑥) has rank 𝑘 for all 𝑥 ∈ 𝑈 .
Of course, we can change the regularity in the definition to get surface with other regularity, e.g.
𝐶 𝑚 𝑘-surface, smooth 𝑘-surface, etc.
Let us recall the definition of rank. The rank of a linear map 𝐴 ∶ ℝ𝑛 → ℝ𝑚 is the dimension of
the image of 𝐴. Equivalently, the rank of a matrix 𝐴 is the maximum number of linearly independent
column vectors.
We require 𝐷𝐹 (𝑥) to have rank 𝑘 to assure the image also “has dimension 𝑘”. The meaning of
this sentence is not clear at this moment, and we will try to make it more precise in the following
several sections.
Example 2.16 (Hemisphere). Let 𝐵12 (0) be the unit disk in ℝ2 . Consider a map 𝐹 ∶ 𝐵12 (0) → ℝ3

𝐹 (𝑥, 𝑦) = (𝑥, 𝑦, 1 − (𝑥2 + 𝑦2 )).
Then the image of 𝐹 is the upper hemisphere. We can check that 𝐷𝐹 has rank 2 everywhere in
𝐵12 (0).
Example 2.17 (Graph of map). Suppose 𝑈 is open in ℝ𝑘 , 𝑓 ∶ 𝑈 → ℝ𝑚 is a 𝐶 1 -map. Then we can
define a map 𝐹 ∶ 𝑈 → ℝ𝑚+𝑘 by
𝐹 (𝑥) = (𝑥, 𝑓 (𝑥)).
1
Then 𝐹 is a 𝐶 𝑘-surface.
To see this, we can compute the differential of 𝐹 :
[ ]
𝐷𝐹 (𝑥) = Id 𝑑𝑓 (𝑥) ,
where Id is the 𝑘 by 𝑘 identity matrix and 𝑑𝑓 (𝑥) is a 𝑘 by 𝑚 matrix. This matrix has rank 𝑘.
Now we would like to upgrade the notion of 𝑘-surfaces. Each 𝑘 surface can be viewed as a
parametrization of a surface by an open set in ℝ𝑘 . We will allow multiple parametrizations.
Definition 2.18. A 𝑘-dimensional 𝐶 1 -submanifold 𝑋 ⊂ ℝ𝑁 is a set 𝑋 such that for each 𝑥 ∈ 𝑋
there exists an open set 𝑈 ⊂ ℝ𝑘 , a 𝐶 1 -map 𝜑 ∶ 𝑈 → ℝ𝑁 and 𝑉 ⊂ ℝ𝑁 , such that 𝑋 ∩ 𝑉 is the
image of 𝜑.
Such a pair (𝑈 , 𝜑) is called a parametrization of 𝑋. The collection of (𝑈 , 𝜑) is called an atlas
of 𝑋.
32 AO SUN

Example 2.19 (Unit sphere). The unit sphere 𝑆 𝑛 is a 𝑛-dimensional submanifold in ℝ𝑛+1 , defined
as the set
{ }
𝑆 𝑛 ∶= (𝑥1 , 𝑥2 , ⋯ , 𝑥𝑛+1 ) ∶ 𝑥21 + 𝑥22 + ⋯ + 𝑥2𝑛+1 = 1 .
We can find the following atlas of 𝑆 𝑛 : suppose 𝐵1 (0) ⊂ ℝ𝑛 . For any 1 ≤ 𝑖 ≤ 𝑛 + 1, we can use the
graph to define the parametrizations 𝜑±𝑖 ∶ 𝐵1 (0) → ℝ𝑛+1 by

+
𝜑𝑖 (𝑦1 , ⋯ , 𝑦𝑛 ) = (𝑦1 , 𝑦2 , ⋯ , 𝑦𝑖−1 , 1 − (𝑦21 + ⋯ + 𝑦2𝑛 ), 𝑦𝑖+1 , ⋯ , 𝑦𝑛 ),

𝜑−𝑖 (𝑦1 , ⋯ , 𝑦𝑛 ) = (𝑦1 , 𝑦2 , ⋯ , 𝑦𝑖−1 , − 1 − (𝑦21 + ⋯ + 𝑦2𝑛 ), 𝑦𝑖+1 , ⋯ , 𝑦𝑛 ).
Then the collection {(𝐵1 (0), 𝜑±𝑖 )}𝑛+1
𝑖=1
is an atlas of 𝑆 𝑛 .
Remark 2.20. “Sub” in the name of submanifold means that they are subsets of some larger spaces.
We also have the notion of “manifolds”, which are geometric objects that are not inside some larger
space. The manifolds are the model of our universal - they are the mathematical language for
general relativity.
2.4. Tangent space and 𝐶 1 -maps. Let us discuss two important concepts related to submanifolds.
We will see how to use the parametrizations to characterize local properties.
Definition 2.21. Suppose 𝑀 ⊂ ℝ𝑁 is a 𝑘-dimensional submanifold, 𝑥 ∈ 𝑀 is a point. Suppose
𝜑 ∶ 𝑈 → ℝ𝑁 is a parametrization of 𝑀 and 𝜑(𝑝) = 𝑥. Then the tangent space of 𝑀 at 𝑥 is the
set
{𝑥 + 𝐷𝜑(𝑝)𝑣|𝑣 ∈ ℝ𝑘 } =∶ 𝑇𝑥 𝑀.
Notice that this definition seems to rely on the choice of the parametrization (𝑈 , 𝜑). Nevertheless,
this definition is actually independent of the choice of the parametrization. We may see this later.
We can view the tangent spaces as vector spaces if we shift 𝑥 to the origin. Because the rank of
𝐷𝜑(𝑝) is 𝑘, it is a 𝑘-dimensional space.
In the case that we can visualize: 1-dimensional submanifolds in ℝ2 or ℝ3 (curves), 2-dimensional
submanifolds in ℝ3 (surfaces), the tangent space is just the line (in ℝ2 ) or the plane (in ℝ3 ) that
“touch” the submanifold at a point.
Example 2.22. Consider 𝑆 2 ⊂ ℝ3 . The tangent space at the north pole (0, 0, 1) can be computed
explicitly. Consider the upper hemisphere parametrization

𝜑 ∶ 𝐵12 (0) → ℝ3 , 𝜑(𝑥, 𝑦) = (𝑥, 𝑦, 1 − 𝑥2 − 𝑦2 ).
Then 𝜑(0, 0) = (0, 0, 1) and
⎡ 1 0 ⎤
⎢ 0 1 ⎥
𝐷𝜑(𝑥, 𝑦) = ⎢ −𝑥 −𝑦 ⎥.
⎢√ √ ⎥
⎣ 1 − 𝑥 2 − 𝑦2 1 − 𝑥2 − 𝑦2 ⎦
MATH 20400 LECTURE NOTES 33

Then the image of (1, 0), (0, 1) ∈ ℝ2 under 𝐷𝜑(0, 0) is (1, 0, 0) and (0, 1, 0). So the tangent space
is the plane
𝑇(0,0,1) 𝑆 2 = {(𝑠, 𝑡, 0)|𝑠 ∈ ℝ, 𝑡 ∈ ℝ}.
Similarly, if 𝑝 = (𝑎, 𝑏, 𝑐) is any point in the upper hemisphere, we can compute that
𝑇𝑝 𝑆 2 = {(𝑎, 𝑏, 𝑐) + 𝑠(𝑐, 0, −𝑎) + 𝑡(0, 𝑐, −𝑏)|𝑠 ∈ ℝ, 𝑡 ∈ ℝ}.
Definition 2.23. Suppose 𝑀 ⊂ ℝ𝑁 is a 𝑘-dimensional submanifold, 𝑥 ∈ 𝑀 is a point. Then the
normal space, denoted by 𝑁𝑥 𝑀, is the subspace that is orthogonal complement to 𝑇𝑥 𝑀.
Recall that we say a subspace 𝑉 ⊂ ℝ𝑁 that is orthogonal complement to 𝑊 ⊂ ℝ𝑁 if for any
𝑣1 , 𝑣2 ∈ 𝑉 and 𝑤1 , 𝑤2 ∈ 𝑊 , ⟨𝑣1 − 𝑣2 , 𝑤1 − 𝑤2 ⟩ = 0, and 𝑉 and 𝑊 span the whole ℝ𝑁 . Then by
linear algebra, dim 𝑁𝑥 𝑀 = 𝑁 − 𝑘.
Example 2.24. In the sphere example, because for a point 𝑝 = (𝑎, 𝑏, 𝑐) in the upper hemisphere,
𝑇𝑝 𝑆 2 = {(𝑎, 𝑏, 𝑐) + 𝑠(𝑐, 0, −𝑎) + 𝑡(0, 𝑐, −𝑏)|𝑠 ∈ ℝ, 𝑡 ∈ ℝ},
So we can compute that
𝑁𝑝 𝑆 2 = {𝑡(𝑎, 𝑏, 𝑐)|𝑡 ∈ ℝ}.
If a submanifold 𝑀 has dimension 𝑁 − 1 in ℝ𝑁 , then for any 𝑥 ∈ 𝑀, 𝑁𝑥 𝑀 has dimension 1.
So there is only one (up to ± sign) vector 𝐧 in this space with ‖𝐧‖ = 1. Such a vector is called a
unit normal vector.
It is natural to define maps, or even functions, from a submanifold to another submanifold. How-
ever, it is a little bit subtle to define the regularity of the maps, because the submanifolds are just
subsets of ℝ𝑁 , and a map defined on a submanifold may not be differentiable in ℝ𝑁 . Instead, we
use the parametrizations to define the regularity of a map.
Definition 2.25. Suppose 𝑀 ⊂ ℝ𝑁 is a submanifold. Then a map 𝐹 ∶ 𝑀 → ℝ𝑚 is 𝐶 1 (denoted by
𝐹 ∈ 𝐶 1 (𝑀, ℝ𝑚 )) if for any parametrization 𝜑 ∶ 𝑈 → 𝑀, the composition 𝐹 ◦𝜑 ∶ 𝑈 → ℝ𝑚 is 𝐶 1 .
Here we use the idea that the regularity (or differentiability) is a local property. Similarly, we
define a map to be 𝐶 𝑘 , smooth, etc.
Example 2.26. Consider a function 𝑓 ∶ 𝑆 2 → ℝ defined by 𝑓 (𝑥, 𝑦, 𝑧) = 𝑧. We can check that this
function is a smooth function using the definition. For example, on the front hemisphere, if we use
the parametrization 𝜑 ∶ 𝐵12 (0) → ℝ3 defined by

𝜑(𝑥, 𝑦) = (𝑥, 1 − 𝑥2 − 𝑦2 , 𝑦),
then 𝑓 ◦𝜑(𝑥, 𝑦) = 𝑦, which is smooth. This function 𝑓 gives the latitude of 𝑆 2 .
Proposition 2.27. Suppose 𝑀 ⊂ ℝ𝑁 is a submanifold, Ω ⊂ ℝ𝑁 is open, and 𝑀 ⊂ Ω. If 𝐹 ∶ Ω →
ℝ𝑚 is 𝐶 1 , then 𝐹 |𝑀 is 𝐶 1 . Similar statements hold for 𝐶 𝑘 and 𝐶 ∞ if the submanifold is 𝐶 𝑘 or 𝐶 ∞
respectively.
Proof. Suppose 𝜑 ∶ 𝑈 → ℝ𝑁 is a parametrization, then 𝐹 ◦𝜑 is 𝐶 1 by the chain rule. 
2 3
This proposition gives another proof that the map 𝑓 ∶ 𝑆 → ℝ sending (𝑥, 𝑦, 𝑧) → 𝑧 is smooth.
34 AO SUN

2.5. Implicit function theorem. The concept of submanifolds suggests that we should study maps
and their differentials between domains in different dimensional spaces. We have an analog of the
inverse function theorem in this setting.
Before then let us first introduce some notations. Suppose Ω ⊂ ℝ𝑛+𝑝 is open. We write ℝ𝑛+𝑝 =
ℝ𝑛 × ℝ𝑝 , and we will use (𝑥, 𝑦) to denote the points in the Cartesian product ℝ𝑛 × ℝ𝑝 . Given a
function 𝐹 ∶ Ω → ℝ𝑚 , we will use 𝐷𝑥 𝐹 to denote the 𝑚 by 𝑛 matrix

⎡ 𝐷𝑥1 𝐹1 𝐷𝑥2 𝐹1 ⋯ 𝐷𝑥𝑛 𝐹1 ⎤


⎢ 𝐷𝑥1 𝐹2 𝐷𝑥2 𝐹2 ⋯ 𝐷𝑥𝑛 𝐹2 ⎥
⎢ ⋯ ⋯ ⋯ ⋯ ⎥
⎢ ⎥
⎣𝐷𝑥1 𝐹𝑚 𝐷𝑥2 𝐹𝑚 ⋯ 𝐷𝑥𝑛 𝐹𝑚 ⎦
and use 𝐷𝑦 𝐹 to denote the 𝑚 by 𝑝 matrix

⎡ 𝐷𝑦 𝐹1 𝐷𝑦 𝐹1 ⋯ 𝐷𝑦𝑝 𝐹1 ⎤
⎢𝐷 1𝐹 𝐷 2𝐹 ⋯ 𝐷𝑦𝑝 𝐹2 ⎥⎥
⎢ 𝑦1 2 𝑦2 2
⎢ ⋯ ⋯ ⋯ ⋯ ⎥
⎢𝐷𝑦 𝐹𝑚 𝐷𝑦 𝐹𝑚 ⋯ 𝐷𝑦𝑝 𝐹𝑚 ⎥⎦
⎣ 1 2

Theorem 2.28 (Implicit function theorem). Suppose Ω ⊂ ℝ𝑛+𝑝 is open, 𝐹 ∈ 𝐶 1 (Ω, ℝ𝑝 ). Suppose
(𝑥0 , 𝑦0 ) ∈ Ω such that 𝐹 (𝑥0 , 𝑦0 ) = 0, and 𝐷𝑦 𝐹 (𝑥0 , 𝑦0 ) is invertible, then there exists a neighbour-
hood 𝑈 ⊂ ℝ𝑛 of 𝑥0 and a neighbourhood 𝑉 ⊂ ℝ𝑝 of 𝑦0 and a map 𝜙 ∈ 𝐶 1 (𝑈 , 𝑉 ), such that
∙ (𝑥, 𝑦) ∈ 𝑈 × 𝑉 and 𝐹 (𝑥, 𝑦) = 0
if and only if
∙ 𝑥 ∈ 𝑈 and 𝑦 = 𝜙(𝑥).
Moreover,
𝐷𝜙(𝑥) = −(𝐷𝑦 𝐹 (𝑥, 𝜙(𝑥)))−1 ◦𝐷𝑥 𝐹 (𝑥, 𝜙(𝑥)).
It is quite difficult (to be honest, impossible) to understand the Implicit function theorem without
geometry. Let me state an equivalent version, which illustrates the geometric meaning behind it.
Theorem 2.29 (Implicit function theorem, submanifold version). Suppose Ω ⊂ ℝ𝑛+𝑝 is open, 𝐹 ∈
𝐶 1 (Ω, ℝ𝑝 ). Suppose (𝑥0 , 𝑦0 ) ∈ Ω such that 𝐹 (𝑥0 , 𝑦0 ) = 0, and 𝐷𝑦 𝐹 (𝑥0 , 𝑦0 ) is invertible, then there
exists a neighbourhood 𝑈 ⊂ ℝ𝑛 of 𝑥0 and a neighbourhood 𝑉 ⊂ ℝ𝑝 of 𝑦0 , such that {𝐹 (𝑥, 𝑦) =
0} ∩ 𝑈 × 𝑉 is a 𝑛-surface.
In fact, the parametrization is given by 𝜓 ∶ 𝑈 → ℝ𝑛+𝑝 defined by
Φ(𝑥) = (𝑥, 𝜙(𝑥)).
Example 2.30 (Local model). Let us consider the map 𝐹 (𝑥, 𝑦) = 𝑦. This map clearly has 𝐹 (0, 0) =
0 and 𝐷𝑦 𝐹 (0, 0) = Id. We can see that 𝐹 (𝑥, 𝑦) = 0 if and only if 𝑦 = 0, and the function 𝜙 in the
implicit function theorem is just the constant map 𝜙(𝑥) = 0.
MATH 20400 LECTURE NOTES 35

Example 2.31 (Circle). Let us consider 𝐹 ∶ ℝ2 → ℝ defined by 𝐹 (𝑥, 𝑦) = 𝑥2 + 𝑦2 − 1. We know


that {(𝑥, 𝑦)|𝐹 (𝑥, 𝑦) = 0} is the unit circle. √
𝐷𝑦 𝐹 (𝑥, 𝑦) = 2𝑦. Whenever 𝑦 ≠ 0, {𝐹 (𝑥, 𝑦) = 0} can be locally parametrized by 𝑦 = 1 − 𝑥2 or
√ √ √
𝑦 = − 1 − 𝑥2 . Then the map 𝜙 locally is given by 1 − 𝑥2 or − 1 − 𝑥2 , depending on whether
𝑦 > 0 or 𝑦 < 0.
Proof of Implicit function theorem. We use the Inverse function theorem. In order to apply the
inverse function theorem, we try to extend the map 𝐹 to a map between the open sets in the spaces
with the same dimensions. We define a map 𝐺 ∶ Ω → ℝ𝑛+𝑝 by
𝐺(𝑥, 𝑦) = (𝑥, 𝐹 (𝑥, 𝑦)).
Then we can compute
[ ]
Id 0
𝐷𝐺(𝑥, 𝑦) = .
𝐷𝑥 𝐹 (𝑥, 𝑦) 𝐷𝑦 𝐹 (𝑥, 𝑦)
Then 𝐷𝐺(𝑥0 , 𝑦0 ) is invertible. Inverse function theorem implies that there exists a neighbourhood
𝑈̃ of (𝑥0 , 𝑦0 ) and a neighbourhood 𝑉̃ of 0 ∈ ℝ𝑛+𝑝 such that 𝐺|𝑈̃ ∶ 𝑈̃ → 𝑉̃ is a diffeomorphism.
Let us slightly shrink 𝑈̃ to a product 𝑈 × 𝑉 .
Suppose 𝐻 ∶ 𝐺(𝑈 × 𝑉 ) → 𝑈 × 𝑉 is the inverse of 𝐺, and let us use (𝐻𝑥 , 𝐻𝑦 ) to denote the
components of of 𝐻 in ℝ𝑛 and ℝ𝑝 . Define 𝜙 ∶ 𝑈 → 𝑉 by
𝜙(𝑥) = 𝐻𝑦 (𝑥, 0).
We show that 𝜙 is the desired map. By definition, 𝐺◦𝐻 = Id, so 𝐺(𝐻𝑥 (𝑥, 0), 𝐻𝑦 (𝑥, 0)) = (𝑥, 0).
This implies that 𝐹 (𝑥, 𝜙(𝑥)) = 0. On the other hand, if 𝐹 (𝑥, 𝑦) = 0, then 𝐺(𝑥, 𝑦) = (𝑥, 0), then
𝐻◦𝐺 = Id implies that
(𝑥, 𝑦) = 𝐻◦𝐺(𝑥, 𝑦) = 𝐻(𝑥, 0) = (𝐻𝑥 (𝑥, 0), 𝐻𝑦 (𝑥, 0)) = (𝐻𝑥 (𝑥, 0), 𝜙(𝑥)),
hence 𝜙(𝑥) = 𝑦.
Finally, by the inverse function theorem,
𝐷𝐻(𝑥, 0) = (𝐷𝐺(𝑥, 𝜙(𝑥)))−1 ,
Then we have
[ ] [ ]−1
𝐷𝑥 𝐻𝑥 (𝑥, 0) 𝐷𝑦 𝐻𝑥 (𝑥, 0) Id 0
=
𝐷𝑥 𝐻𝑦 (𝑥, 0) 𝐷𝑦 𝐻𝑦 (𝑥, 0) 𝐷𝑥 𝐹 (𝑥, 𝜙(𝑥)) 𝐷𝑦 𝐹 (𝑥, 𝜙(𝑥))
[ ]
Id 0
= .
−(𝐷𝑦 𝐹 (𝑥, 𝜙(𝑥)))−1 𝐷𝑥 𝐹 (𝑥, 𝜙(𝑥)) (𝐷𝑦 𝐹 (𝑥, 𝜙(𝑥)))−1
Take the 𝑥 part of the differential we get the expression of 𝐷𝜙. 
Now we take a look at two very important applications of the implicit function theorem. The
first one is the so-called preimage theorem. It claims that given a 𝐶 1 -map 𝐹 from a higher dimen-
sional space to a lower dimensional space, if 𝐷𝐹 is surjective, then locally the pre-image set is a
submanifold.
36 AO SUN

Theorem 2.32 (Preimage theorem). Suppose Ω ⊂ ℝ𝑁 is open and 𝐹 ∈ 𝐶 1 (Ω, ℝ𝑚 ), 𝑁 > 𝑚.


Suppose 𝑦 is in the image set of 𝐹 , and for any 𝑥 ∈ 𝐹 −1 (𝑦), 𝐷𝐹 (𝑥) is surjective, then 𝐹 −1 (𝑦) is a
(𝑁 − 𝑚)-dimensional 𝐶 1 manifold.
Definition 2.33. Suppose Ω ⊂ ℝ𝑁 is open and 𝐹 ∈ 𝐶 1 (Ω, ℝ𝑚 ). If 𝐷𝐹 (𝑥) is surjective, then 𝐹 is
submersion at 𝑥.
Proof of preimage theorem. Suppose 𝑥 ∈ Ω such that 𝐹 (𝑥) = 𝑦 and 𝐷𝐹 (𝑥) is a surjection. It
suffices to find a local parametrization of 𝐹 −1 (𝑦) at 𝑥. By shifting 𝑦, we may assume 𝑦 = 0.
Write 𝐹 = (𝐹1 , ⋯ , 𝐹𝑚 ) in coordinates. Because 𝐷𝐹 (𝑥) = [𝐷𝑗 𝐹𝑖 ]1≤𝑖≤𝑚,1≤𝑗≤𝑁 is surjective, the
rank of 𝐷𝐹 (𝑥) is 𝑚, and we can find 𝑚 column vectors {(𝐷𝑗𝑙 𝐹1 , 𝐷𝑗𝑙 𝐹2 , ⋯ , 𝐷𝑗𝑙 𝐹𝑚 )}1≤𝑙≤𝑚 that are
linearly independent. Without loss of generality, let us assume (𝑗1 , 𝑗2 , ⋯ , 𝑗𝑚 ) = (𝑁 − 𝑚 + 1, 𝑁 −
𝑚 + 2, ⋯ , 𝑁). Then the last 𝑚 by 𝑚 part of 𝐷𝐹 (𝑥) is invertible, and we can apply the implicit
function theorem to show that locally 𝐹 −1 (0) is a submanifold. 
A particularly important application of this theorem is to understand the level sets of a function.
Definition 2.34. Suppose Ω ⊂ ℝ𝑁 is open, 𝑓 ∈ 𝐶 1 (Ω, ℝ). Then for any 𝑦 in the image of 𝑓 , 𝑓 −1 (𝑦)
is called a level set.
Corollary 2.35. Suppose Ω ⊂ ℝ𝑁 is open, 𝑓 ∈ 𝐶 1 (Ω, ℝ). If 𝐷𝑓 (𝑥) ≠ 0 for all 𝑥 such that
𝑓 (𝑥) = 𝑦, then the level set 𝑓 −1 (𝑦) is a 𝐶 1 (𝑁 − 1)-dimensional submanifold.
Remark 2.36. All the theorems we have proved so far have high regularity versions if the map 𝐹
itself has higher regularity. We omit the precise statement here.
Example 2.37. 𝑆 2 ⊂ ℝ3 is the level set of a smooth function 𝑓 (𝑥, 𝑦, 𝑧) = 𝑥2 + 𝑦2 + 𝑧2 , and
𝐷𝑓 (𝑥, 𝑦, 𝑧) = (2𝑥, 2𝑦, 2𝑧), which is nonzero if (𝑥, 𝑦, 𝑧) ≠ (0, 0, 0). So 𝑆 2 is a smooth submanifold
by the preimage theorem.
Example 2.38. The level set {𝑥2 + 𝑦2 − 𝑧2 = 𝑎} ⊂ ℝ3 is a smooth sumanifold whenever 𝑎 ≠ 0. It
is a hyperboloid. Let 𝐹 (𝑥, 𝑦, 𝑧) = 𝑥2 + 𝑦2 − 𝑧2 − 𝑎, we see that 𝐷𝐹 (𝑥, 𝑦, 𝑧) = (2𝑥, 2𝑦, −2𝑧), and
when 𝑎 ≠ 0, the implicit function theorem shows that {𝑥2 + 𝑦2 − 𝑧2 = 𝑎} is a smooth submanifold.
When 𝑎 = 0, 𝐷𝐹 (𝑥, 𝑦, 𝑧) = 0 if and only if (𝑥, 𝑦, 𝑧) = (0, 0, 0). So if (𝑥, 𝑦, 𝑧) ≠ (0, 0, 0), we
can still conclude that {𝑥2 + 𝑦2 − 𝑧2 = 0} is a smooth submanifold. It is actually a cone; but
(𝑥, 𝑦, 𝑧) = (0, 0, 0) is not a smooth point – it is the cone vertex.
A further application of the implicit function theorem is to find the tangent space of a subman-
ifold. Suppose 𝐹 ∈ 𝐶 1 (Ω, ℝ𝑝 ) with 𝐹 (𝑥0 , 𝑦0 ) = 0, and 𝐷𝑦 𝐹 (𝑥0 , 𝑦0 ) is invertible. By the implicit
function theorem, locally there exists 𝑈 ⊂ ℝ𝑛 and 𝑉 ⊂ ℝ𝑝 and a function 𝜙 ∶ 𝑈 → 𝑉 such that
𝐹 (𝑥, 𝑦) = 0 iff 𝑦 = 𝜙(𝑥). Let Φ ∶ 𝑈 → ℝ𝑛+𝑝 defined by Φ(𝑥) = (𝑥, 𝜙(𝑥)). Then the tangent space
of the submanifold 𝑀 ∶= {(𝑥, 𝑦)|𝐹 (𝑥, 𝑦) = 0} at (𝑥0 , 𝑦0 ) is given by
𝑇(𝑥0 ,𝑦0 ) 𝑀 = {(𝑥0 , 𝑦0 ) + 𝐷Φ(𝑥0 )𝑣 ∶ 𝑣 ∈ ℝ𝑛 }.
From the implicit function theorem

𝐷𝜙(𝑥0 ) = −(𝐷𝑦 𝐹 (𝑥0 , 𝜙(𝑥0 )))−1 ◦𝐷𝑥 𝐹 (𝑥0 , 𝜙(𝑥0 )),


MATH 20400 LECTURE NOTES 37

hence

𝑇(𝑥0 ,𝑦0 ) 𝑀 = {(𝑥0 , 𝑦0 ) − [Id, (𝐷𝑦 𝐹 (𝑥0 , 𝜙(𝑥0 )))−1 ◦𝐷𝑥 𝐹 (𝑥0 , 𝜙(𝑥0 ))]𝑣 ∶ 𝑣 ∈ ℝ𝑛 }.
Note that this expression only relies on 𝐹 . Now let 𝑤 + (𝑥0 , 𝑦0 ) ∈ 𝑇(𝑥0 ,𝑦0 ) 𝑀, there exists 𝑣 ∈ ℝ𝑛
such that

𝑤 = −[Id, −(𝐷𝑦 𝐹 (𝑥0 , 𝜙(𝑥0 )))−1 ◦𝐷𝑥 𝐹 (𝑥0 , 𝜙(𝑥0 ))]𝑣.


Thus
𝐷𝐹 (𝑥0 )𝑤 = −(𝐷𝑥 𝐹 (𝑥0 , 𝜙(𝑥0 )) − 𝐷𝑥 𝐹 (𝑥0 , 𝜙(𝑥0 )))𝑣 = 0.
This implies that 𝑤 ∈ ker(𝐷𝐹 (𝑥0 )). On the other hand, from linear algebra, dim ker(𝐷𝐹 (𝑥0 )) =
(𝑛 + 𝑝) − rank(𝐷𝐹 (𝑥0 )) = 𝑛, so this implies that ker(𝐷𝐹 (𝑥0 )) = 𝑇(𝑥0 ,𝑦0 ) 𝑀.
A particularly interesting case is 𝑝 = 1. In this case, 𝐷𝐹 (𝑥0 ) is a vector, and ker(𝐷𝐹 (𝑥0 )) is the
space that is orthogonal to the vector 𝐷𝐹 (𝑥0 ). Thus, this shows that 𝐷𝐹 (𝑥0 ) is actually a normal
vector.
Theorem 2.39. Suppose Ω ⊂ ℝ𝑁 is open, 𝑓 ∈ 𝐶 1 (Ω, ℝ) is a function, 𝑓 (𝑥0 ) = 0. If 𝐷𝑓 (𝑥0 ) ≠ 0,
then the tangent space of 𝑓 −1 (0) at 𝑥0 is
{𝑥0 + 𝑤|𝑤 ∈ ℝ𝑁 , ⟨𝑤, 𝐷𝑓 (𝑥0 )⟩ = 0}.
Example 2.40. 𝑆 2 ⊂ ℝ3 is the level set of 𝑓 (𝑥, 𝑦, 𝑧) = 𝑥2 + 𝑦2 + 𝑧2 − 1. The tangent space at
(𝑥, 𝑦, 𝑧) ∈ 𝑆 2 is given by
{(𝑥, 𝑦, 𝑧) + 𝑤|⟨𝑤, (2𝑥, 2𝑦, 2𝑧)⟩ = 0}.
2.6. Constrained optimization and Lagrange multipliers. One important purpose of studying
submanifolds is to understand the constrained optimization problem. Roughly speaking, the ques-
tion is the following: Suppose Ω is an open set in ℝ𝑁 and 𝐹 ∈ 𝐶 1 (Ω, ℝ𝑚 ), 𝑓 ∈ 𝐶 1 (Ω, ℝ). We want
to find the minimum/maximum point of 𝑓 (𝑥) subject to the constraint 𝐹 (𝑥) = 0. The problem is
quite natural because in many situations we only care about the minimal point of a function with
constraint. For example, if we want to know what is the most economical way to distribute 100 tons
of fruit to 3 cold storage warehouses 𝐴, 𝐵, 𝐶, suppose 𝜌(𝑥, 𝑦, 𝑧) is the cost to send 𝑥 tons of fruits
to 𝐴, 𝑦 tons of fruits to𝐵 and 𝑧 tons of fruits to 𝐶, we actually need to minimize 𝜌(𝑥, 𝑦, 𝑧) under
the constraint 𝑥 + 𝑦 + 𝑧 = 100. Without this constraint, this problem does not make sense at all.
Definition 2.41. Given a function 𝑓 ∈ 𝐶 1 (Ω, ℝ) and a constraint function 𝐹 ∈ 𝐶 1 (Ω, ℝ𝑚 ), the
Lagrange function 𝐿 ∶ Ω × ℝ𝑚 → ℝ is defined as
𝐿(𝑥, 𝜆) = 𝑓 (𝑥) + ⟨𝜆, 𝐹 (𝑥)⟩
𝜆 is called the Lagrange multiplier.
Theorem 2.42. Suppose 𝑥0 ∈ Ω is a local minimum/local maximum point of the function 𝑓 ∈
𝐶 1 (Ω, ℝ) under the constraint 𝐹 (𝑥) = 0, where 𝐹 ∈ 𝐶 1 (Ω, ℝ𝑚 ), and 𝐷𝐹 (𝑥0 ) is surjective, then
there exists 𝜆0 ∈ ℝ𝑚 such that (𝑥0 , 𝜆0 ) is a critical point of 𝐿.
38 AO SUN

Proof. By the implicit function theorem, 𝐷𝐹 (𝑥0 ) is surjective means that the set 𝑀 ∶= {𝑥|𝐹 (𝑥) =
0} is locally a submanifold, and we can find a parametrization Φ ∶ 𝑈 → ℝ𝑁 where 𝑈 ⊂ ℝ𝑁−𝑚
is open and Φ is in 𝐶 1 with Φ(0) = 𝑥0 . Then 𝑥0 is a local minimum/local maximum point of
the function 𝑓 ∈ 𝐶 1 (Ω, ℝ) under the constraint 𝐹 (𝑥) = 0 implies that 𝑓 ◦Φ ∶ 𝑈 → ℝ has a local
minimum/local maximum at 0. By the characterization of the local minimum/local maximum point,
we have 𝐷(𝑓 ◦Φ)(0) = 0. By the chain rule, 𝐷𝑓 (𝑥0 )◦𝐷Φ(0) = 0.
If we plug in any vector 𝑣, this implies that 𝐷𝑓 (𝑥0 )◦𝐷Φ(0) = 0, and by the expression of the
tangent space that we have discussed in the last section, 𝐷𝑓 (𝑥0 ) is perpendicular to the tangent
space 𝑇𝑥0 𝑀, and hence 𝐷𝑓 (𝑥0 ) is perpendicular to the kernel of 𝐷𝑓 (𝑥0 ). Therefore, 𝐷𝑓 (𝑥0 ) is a
linear combination of the row vectors of 𝐷𝐹 (𝑥0 ), namely there exists 𝜆0 ∈ ℝ𝑚 (as a column vector)
such that
𝐷𝑓 (𝑥0 ) = 𝜆0 𝐷𝐹 (𝑥0 ).
Now differentiate 𝐿,
𝐷𝐿(𝑥, 𝜆) = (𝐷1 𝑓 (𝑥) + ⟨𝜆, 𝐷1 𝐹 (𝑥)⟩, 𝐷2 𝑓 (𝑥) + ⟨𝜆, 𝐷2 𝐹 (𝑥)⟩, ⋯ , 𝐷𝑁 𝑓 (𝑥) + ⟨𝜆, 𝐷𝑁 𝐹 (𝑥)⟩), 𝐹 (𝑥)
So a critical point (𝑥, 𝜆) of 𝐿 satisfies 𝐷𝑖 𝑓 (𝑥) + ⟨𝜆, 𝐷𝑖 𝐹 (𝑥)⟩ for 𝑖 = 1, 2 ⋯ , 𝑁 and 𝐹 (𝑥) = 0.
Then we can see that 𝑥0 , 𝜆0 is a critical point of 𝐿.

Remark 2.43. In the proof we used the following lemma: suppose 𝐴 ∶ ℝ𝑛 → ℝ𝑚 is linear. Then
if a vector 𝑣 ∈ ℝ𝑛 is perpendicular to the kernel of 𝐴, then 𝑣 is a linear combination of the row
vectors of 𝐴.
The intuition behind the Lagrange multiplier is the following: −𝐷𝑓 (𝑥) is the direction where 𝑓
decreases fastest at 𝑥, in the following sense: for any 𝑣, 𝐷𝑣 𝑓 (𝑥) = ⟨𝐷𝑓 (𝑥), 𝑣⟩. Then if we plug in
𝑣 = −𝐷𝑓 (𝑥), 𝐷𝑣 𝑓 (𝑥) = −‖𝐷𝑓 (𝑥)‖2 ≤ 0. Moreover, by the inequality of the inner product,
|⟨𝑣, 𝐷𝑓 (𝑥)⟩|2 ≤ ‖𝑣‖‖𝐷𝑓 (𝑥)‖,
we see that 𝑣 = −𝐷𝑓 (𝑥) is actually the direction makes 𝐷𝑣 𝑓 (𝑥) most negative. This makes −𝐷𝑓 (𝑥)
a very special vector.
Definition 2.44. 𝐷𝑓 (𝑥) is called the gradient of 𝑓 at 𝑥. Sometimes people use ∇𝑓 (𝑥) to denote it.
Then the Lagrange multiplier actually saying that, 𝑥0 is a minimum point means that the gradient
𝐷𝑓 (𝑥0 ) is normal to the submanifold given by the constraint function 𝐹 . Intuitively, this means
that there is no way that we can decrease the value of the function 𝑓 in the manifold, because the
gradient 𝐷𝑓 (𝑥0 ) points completely outside the submanifold.
Just like all the critical point tests, the Lagrange multiplier method only gives a necessary con-
dition for a point being local minimum/local maximum. A critical point of 𝐿 does not necessarily
give a critical point of 𝑓 with the constraint 𝐹 = 0.
Similarly we have a second-order derivative test for a function with constraint.
Theorem 2.45. Suppose Ω ⊂ ℝ𝑁 is open, 𝑓 ∈ 𝐶 2 (Ω, ℝ) and 𝐹 ∈ 𝐶 2 (Ω, ℝ𝑚 ). If 𝑥0 is a local
minimum point of the function 𝑓 under the constraint 𝐹 (𝑥) = 0, and 𝐷𝐹 (𝑥0 ) is surjective, then
MATH 20400 LECTURE NOTES 39

Hess𝑓 (𝑥0 ) is semi-positive definite when restricted to the subspace 𝑉 that is passing through 𝑥0
and 𝑉 = ker(𝐷𝐹 (𝑥0 )).
When we say we restrict a 𝑁 by 𝑁 matrix 𝐴 to a subspace 𝑉 ⊂ ℝ𝑁 that is semi-positive definite,
we are actually saying that for any 𝑣 ∈ 𝑉 that is nonzero, 𝑣⊤ 𝐴𝑣 > 0
Theorem 2.46. Suppose Ω ⊂ ℝ𝑁 is open, 𝑓 ∈ 𝐶 2 (Ω, ℝ) and 𝐹 ∈ 𝐶 2 (Ω, ℝ𝑚 ). If (𝑥0 , 𝜆0 ) is
a critical point of the Langrange function 𝐿 and 𝐷𝐹 (𝑥0 ) is surjective, and Hess𝑓 (𝑥0 ) is positive
definition when restricted to the subspace 𝑉 that is passing through 𝑥0 and 𝑉 = ker(𝐷𝐹 (𝑥0 )), then
𝑥0 is a local minimum point of the function 𝑓 under the constraint 𝐹 (𝑥) = 0.
The proof of these theorems is similar to the Hessian test, but we need to incorporate the language
of submanifolds. We omit the proof here.
We want to point out that 𝐷𝐹 (𝑥0 ) being surjective is crucial to the Lagrange multiplier. It actually
tells us that locally the constraint really gives us a submanifold.
2.7. Applications of Lagrange multiplier. We see some classical applications of the Lagrange
multiplier.
1 1
Theorem 2.47 (Hölder inequality). Suppose 𝑥 ≥ 0, 𝑦 ≥ 0, 𝑝 > 1, 𝑞 > 1, and + = 1. Show that
𝑝 𝑞
1 1
𝑥𝑦 ≤ 𝑥𝑝 + 𝑦𝑞 .
𝑝 𝑞
Proof. It is clear that if 𝑥 = 0 or 𝑦 = 0 the inequality holds. So it suffices to show the inequality
for 𝑥 > 0, 𝑦 > 0. let
1 1
𝑓 (𝑥, 𝑦) = 𝑥𝑝 + 𝑦𝑞 − 𝑥𝑦.
𝑝 𝑞
It suffices to show 𝑓 (𝑥, 𝑦) ≥ 0 for all 𝑥 > 0, 𝑦 > 0. We remark that the direct derivative test gives
a huge set of critical points, and it is hard to prove 𝑓 (𝑥, 𝑦) ≥ 0.
Instead, we use the following observation: for any 𝑎 > 0,
𝑓 (𝑎1∕𝑝 𝑥, 𝑎1∕𝑞 𝑦) = 𝑎𝑓 (𝑥, 𝑦).
So in order to show 𝑓 (𝑥, 𝑦) ≥ 0, it suffices to show that 𝑓 ( (𝑥𝑦)𝑥1∕𝑝 , (𝑥𝑦)𝑦1∕𝑞 ) ≥ 0. Namely, we only need
to show that 𝑓 (𝑥, 𝑦) ≥ 0 with the constraint that 𝑥𝑦 = 1.
If 𝑥 ≥ 𝑝1∕𝑝 or 𝑦 ≥ 𝑞 1∕𝑞 , then 𝑓 (𝑥, 𝑦) ≥ 0. Therefore, we only need to show that 𝑓 (𝑥, 𝑦) ≥ 0 with
the constraint that 𝑥𝑦 = 1 within the bounded domain
𝑈 ∶= [0, 𝑝1∕𝑝 ] × [0, 𝑞 1∕𝑞 ].
It is clear that 𝑓 (𝑥, 𝑦) ≥ 0 on 𝜕𝑈 . Now we consider the inside points. Let
1 1
𝐿(𝑥, 𝑦, 𝜆) = 𝑥𝑝 + 𝑦𝑞 − 𝑥𝑦 + 𝜆(𝑥𝑦 − 1).
𝑝 𝑞
We compute that
𝐷𝑥 𝐿(𝑥, 𝑦, 𝜆) = 𝑥𝑝−1 − 𝑦 + 𝜆𝑦, 𝐷𝑦 𝐿(𝑥, 𝑦, 𝜆) = 𝑦𝑞−1 − 𝑥 + 𝜆𝑥, 𝐷𝜆 𝐿(𝑥, 𝑦, 𝜆) = 𝑥𝑦 − 1.
40 AO SUN

We see that for a critical point (𝑥0 , 𝑦0 , 𝜆0 ),


𝑥𝑝−1
0
𝑦𝑞−1
0
= , 𝑥0 𝑦0 = 1.
𝑦0 𝑥0
So the only critical point of 𝐿 is (1, 1, 𝜆0 ). This implies that the inside minimum point of 𝑓 with
constraint 𝑥𝑦 = 1 must be (1, 1). Then 𝑓 (1, 1) ≥ 0 implies that 𝑓 (𝑥, 𝑦) ≥ 0. 
2 𝑦2
Question 2.48. Find the largest rectangle that is inscribed in the ellipse {(𝑥, 𝑦)| 𝑥𝑎2 + 𝑏2
= 1}.
To answer this question, we suppose one vertex of the rectangle is (𝑥, 𝑦), then the area of the
2 2
rectangle is 4|𝑥𝑦|. So we want to optimize 4|𝑥𝑦| under the constraint that 𝑥𝑎2 + 𝑦𝑏2 − 1 = 0.
A technical issue is that 4|𝑥𝑦| is not always differentiable. To overcome this issue, we can op-
timize (4|𝑥𝑦|)2 = 16𝑥2 𝑦2 . This is a smooth function and we can feel free to differentiate it. Once
we find the maximum value of 16𝑥2 𝑦2 , we can take a square root to find the largest area of the
rectangle.
Define ( 2 )
2 2 𝑥 𝑦2
𝐿(𝑥, 𝑦, 𝜆) = 16𝑥 𝑦 + 𝜆 + −1 .
𝑎2 𝑏2
We can compute that
𝑥 𝑦 𝑥2 𝑦2
𝐷𝑥 𝐿(𝑥, 𝑦, 𝜆) = 32𝑥𝑦2 + 2𝜆 , 𝐷 𝑦 𝐿(𝑥, 𝑦, 𝜆) = 32𝑥 2
𝑦 + 2𝜆 , 𝐷 𝜆 𝐿(𝑥, 𝑦, 𝜆) = + − 1.
𝑎2 𝑏2 𝑎2 𝑏2
Then we can find the critical points
(√ √ )
2 2 2 2
(±𝑎, 0, 0), (0, ±𝑏, 0), 𝑎, 𝑏, 16𝑎 𝑏 .
2 2
(√ √ )
Plug in these values, we see that the maximum point is 22 𝑎, 22 𝑏 , and the largest value of the
rectangle is 2𝑎𝑏.
2.8. Newton’s iteration method*. To optimize the function (under constraint or not), we have a
systematic theory as follows: first, we try to find the critical points, then use the Hessian test to see
whether they are the local minimum or local maximum, then we go from local to global.
In practice, the first step is actually very hard. For example, in order to find a critical point of a
function 𝑓 ∈ 𝐶 ∞ (ℝ𝑛 , ℝ), one needs to solve the following system of equations

⎪𝐷1 𝑓 (𝑥) = 0,
⎪𝐷2 𝑓 (𝑥) = 0,

⎪⋯
⎪𝐷𝑛 𝑓 (𝑥) = 0.

Solving these equations may be very hard. Newton’s method is an amazing method to solve a
system of equations. Let us formulate the questions precisely:
MATH 20400 LECTURE NOTES 41

Question 2.49. Suppose 𝐹 ∈ 𝐶 1 (ℝ𝑛 , ℝ𝑛 ). We want to find 𝑥 ∈ Ω such that 𝐹 (𝑥) = 0.


Remark 2.50. In general, the zero set of 𝐹 ∈ 𝐶 1 (ℝ𝑛 , ℝ𝑚 ) with 𝑚 < 𝑛 has dimension 𝑛 − 𝑚
(think about the implicit function theorem). Hence in general there is a huge zero set of a map
𝐹 ∈ 𝐶 1 (ℝ𝑛 , ℝ𝑚 ) with 𝑚 < 𝑛. So the method does not work very well there.
Newton’s method works as follows.
Step 1. We pick an arbitrary point 𝑥1 ∈ Ω. If 𝐹 (𝑥1 ) = 0 then we are done. So we assume
𝐹 (𝑥1 ) ≠ 0.
Step 2. Let
𝑥2 = 𝑥1 − (𝐷𝐹 (𝑥1 ))−1 𝐹 (𝑥1 ).
Then if 𝐹 (𝑥1 ) = 0, then we are done. Otherwise, let
𝑥3 = 𝑥2 − (𝐷𝐹 (𝑥2 ))−1 𝐹 (𝑥2 ).
Step 3. Repeat the above steps. In general, let
𝑥𝑛+1 = 𝑥𝑛 − (𝐷𝐹 (𝑥𝑛 ))−1 𝐹 (𝑥𝑛 ).
Then either we find some 𝑥𝑛 such that 𝐹 (𝑥𝑛 ) = 0, or we get a sequence of points 𝑥1 , 𝑥2 , ⋯. Then
𝑥∞ = lim𝑛→∞ 𝑥𝑛 is the desired zero point.
We first need to remark that, Newton’s method does not always work. First of all, we must require
all the differential 𝐷𝐹 (𝑥𝑛 ) in the iterations to be invertible, otherwise, we can not run the iteration.
We study a particular class of functions that Newton’s method works:
∙ Ω is convex,
∙ 𝐹 ∈ 𝐶 2 (Ω, ℝ),
∙ there exists an unique zero point of 𝐹 , denoted by 𝑥0 ,
∙ 𝐷𝐹 (𝑥) is invertible for 𝑥 ∈ Ω.
2𝛼
∙ ‖𝐷𝐹 −1 (𝑥)‖‖ Hess𝐹 (𝑥)‖ < , where 𝛼 ∈ (0, 1), and the diameter diam(Ω) ∶=
diam(Ω)
sup𝑥,𝑦∈Ω ‖𝑥 − 𝑦‖.
Proof that {𝑥𝑛 }∞
𝑛=1
converges. The Taylor expansion of 𝐹 at 𝑥𝑛 implies that

𝑛
𝐷𝑖𝑗 𝐹 (𝑠𝑥0 + (1 − 𝑠)𝑥𝑛 )
0 = 𝐹 (𝑥0 ) = 𝐹 (𝑥𝑛 ) + 𝐷𝐹 (𝑥𝑛 )(𝑥0 − 𝑥𝑛 ) + (𝑥0 − 𝑥𝑛 )𝑖𝑗 .
𝑖,𝑗=1
2
Then we have

𝑛
𝐷𝑖𝑗 𝐹 (𝑠𝑥0 + (1 − 𝑠)𝑥𝑛 )
−𝐷𝐹 (𝑥𝑛 )−1 𝐹 (𝑥𝑛 ) = 𝑥0 − 𝑥𝑛 + 𝐷𝐹 (𝑥𝑛 )−1 (𝑥0 − 𝑥𝑛 )𝑖𝑗 ,
𝑖,𝑗=1
2
namely

𝑛
𝐷𝑖𝑗 𝐹 (𝑠𝑥0 + (1 − 𝑠)𝑥𝑛 )
−1
𝑥𝑛+1 − 𝑥0 = 𝐷𝐹 (𝑥𝑛 ) (𝑥0 − 𝑥𝑛 )𝑖𝑗 .
𝑖,𝑗=1
2
42 AO SUN

This implies that


‖𝑥𝑛+1 − 𝑥0 ‖ ≤ 𝛼‖𝑥𝑛 − 𝑥0 ‖.
Then by mathematical induction,
‖𝑥𝑛+1 − 𝑥0 ‖ ≤ 𝛼 𝑛 ‖𝑥1 − 𝑥0 ‖.
This implies that lim𝑛→∞ 𝑥𝑛 = 𝑥0 . 
Remark 2.51. We did not discuss Taylor’s theorem for maps, so here the proof is actually sketchy.
You can try to make it rigorous by yourself.
Remark 2.52. It seems that Newton’s method converges exponentially, but actually, it does a better
job. In fact, from the proof we have
1
‖𝑥𝑛+1 − 𝑥0 ‖ ≤ ‖𝐷𝐹 −1 (𝑥)‖‖ Hess𝐹 (𝑥)‖‖𝑥𝑛 − 𝑥0 ‖2 .
2
If we write sup𝑥 ‖𝐷𝐹 (𝑥)‖‖ Hess𝐹 (𝑥)‖ = 2𝑀, then
−1

‖𝑥𝑛+1 − 𝑥0 ‖ ≤ 𝑀‖𝑥𝑛 − 𝑥0 ‖2 .
Then by mathematical induction,
𝑛
‖𝑥𝑛+1 − 𝑥0 ‖ ≤ 𝑀 𝑛 ‖𝑥1 − 𝑥0 ‖2 .
Thus, if ‖𝑥1 − 𝑥0 ‖ < 1, the convergence is actually super-exponential.
Example 2.53 (Fast inverse square root). The fast inverse square root is a classical application of
Newton’s iteration method. This algorithm is best known for its implementation in 1999 in Quake
III Arena, a first-person shooter video game heavily based on 3D graphics.
The algorithm consists of two steps. The first step provides a first guess of the square inverse,
and it uses the full power of the structure of 32-bit floating-point numbers. This part needs some
knowledge of computer science and we omit it here, but refer you to a nice survey by Chris Lomont:
http: // www. lomont. org/ papers/ 2003/ InvSqrt. pdf .
The second step uses Newton’s method to increase the precision of the result. The question can
1
be formulated as follows: given 𝑎 > 0, find √ . We first translate the problem as a problem of
𝑎
finding zero of 𝐹 (𝑥) = 𝑎𝑥2 − 1 for 𝑥 > 0. start with 𝑥 = 1, we consider
𝑎𝑥2𝑛 − 1 𝑥𝑛 1
𝑥𝑛+1 = 𝑥𝑛 − = + .
2𝑎𝑥𝑛 2 2𝑎𝑥𝑛
1
Then after several times of iterations (actually even once), we can get an approximate value of √ .
𝑎
1
However, in computer science, the computation 𝑥 → is actually very hard. Therefore, in the
𝑥
algorithm, the real choice of 𝐹 is 𝐹 (𝑥) = 𝑥12 − 𝑎. Then Newton’s iteration becomes
( )
1 1 3 1 2
𝑥𝑛+1 = 𝑥𝑛 + 𝑥3𝑛 ( 2 − 𝑎) = 𝑥𝑛 − 𝑎𝑥𝑛 .
2 𝑥𝑛 2 2
MATH 20400 LECTURE NOTES 43

In the iteration, we only need to compute addition and multiplication, which is much simpler in
computers.
2.9. Diagonalizing symmetric matrix*. We show another application of the Lagrange multiplier
method. Recall that a 𝑛 × 𝑛 matrix 𝐴 is symmetric if 𝐴𝑡 = 𝐴. If the entries of 𝐴 are 𝑎𝑖𝑗 , 𝐴 being
symmetric is equivalent to 𝑎𝑖𝑗 = 𝑎𝑗𝑖 for all 1 ≤ 𝑖, 𝑗 ≤ 𝑛.
The goal of this section is to prove a diagonalization theorem of symmetric matrices.
Theorem 2.54 (diagonalization theorem of symmetric matrices). Suppose 𝐴 is a symmetric matrix.
Then there exists an orthonormal matrix 𝑂 such that 𝑂𝑡 𝐴𝑂 = diag{𝜆1 , ⋯ , 𝜆𝑛 }, where
⎡𝜆 1 ⎤
⎢ 𝜆2 ⎥
diag{𝜆1 , ⋯ , 𝜆𝑛 } = ⎢ ⎥

⎢ ⎥
⎣ 𝜆𝑛 ⎦
Remark 2.55. If all 𝜆𝑖 > 0, then 𝐴 is positive definite; if all 𝜆𝑖 ≥ 0, then 𝐴 is nonnegative definite.
Similar arguments hold for 𝐴 being negative definite and semi-negative definite.
Proof of the diagonalization theorem. We divide the proof into several steps.
Step 1. Define 𝑓 ∶ ℝ𝑛 → ℝ by
𝑓 (𝑥) = 𝑥𝑡 𝐴𝑥.
Let 𝑣1 ∈ ℝ𝑛 be a minimizer of 𝑓 under the constraint ‖𝑣‖2 = 1. The Lagrange function in this
situation is
𝐿(𝑥, 𝜆) = 𝑥𝑡 𝐴𝑥 + 𝜆(1 − ‖𝑥‖2 ).
Then
𝐷𝐿(𝑥, 𝜆) = (2(𝐴𝑥)1 − 2𝜆𝑥1 , 2(𝐴𝑥)2 − 2𝜆𝑥2 , ⋯ , 2(𝐴𝑥)𝑛 − 2𝜆𝑥𝑛 , 1 − ‖𝑥‖2 ).
Notice that {‖𝑥‖2 − 1} is a compact submanifold (it is just the sphere), there must exist a minimum
point under the constraint, we call it 𝑣1 . Then the Lagrange multiplier method shows that there
exists 𝜆1 ∈ ℝ, such that (𝑣1 , −𝜆1 ) is the critical point of 𝐿. Then we see that
𝐴𝑣1 = 𝜆1 𝑣1 .
Step 2. Define 𝑊1 = ⟨𝑣1 , ⟩ be the space spanned by 𝑣1 , and let 𝑉1 = 𝑊1⟂ . Repeat the above
discussion, but restrict 𝑓 to 𝑉1 , we can find a minimizer 𝑣2 ∈ 𝑉1 under the constraint ‖𝑣2 ‖ = 1.
Same argument shows that there exists 𝜆2 ∈ ℝ such that 𝐴𝑣 − 2 = 𝜆2 .
Step 3. Repeat the above discussion 𝑛 times, we get (column) vectors 𝑣1 , 𝑣2 , ⋯ , 𝑣𝑛 and real
numbers 𝜆1 , ⋯ 𝜆𝑛 , such that
∙ ‖𝑣𝑖 ‖ = 1;
∙ for 𝑖 ≠ 𝑗, 𝑣𝑖 ⟂ 𝑣𝑗 ;
∙ 𝐴𝑣𝑖 = 𝜆𝑖 𝑣𝑖 .
Then if we define
[ ]
𝑂 = 𝑣1 𝑣2 ⋯ 𝑣𝑛 ,
44 AO SUN

we have
𝑂𝑡 𝐴𝑂 = diag{𝜆1 , ⋯ , 𝜆𝑛 }.

Remark 2.56. You can reverse the above procedure by picking the maximum point under constraint.
The result is the same.
R EFERENCES

U NIVERSITY OF C HICAGO, D EPARTMENT OF M ATHEMATICS, 5734 S U NIVERSITY AVE, C HICAGO IL, 60637
Email address: [email protected]

You might also like