0% found this document useful (0 votes)
52 views

Matec Notes: Alexey Guzey This Version From September 25, 2017. Click For The Latest Version

This document contains notes for a mathematics course. It begins by introducing set operations and notation, including defining a set, union, intersection, subset, empty set, and power set. It then covers sequences and limits in Rn, open and closed sets, compact sets, continuity of multivariable functions, differentiation of multivariable functions including gradients and the chain rule, implicit functions, convexity and convex sets, unconstrained and constrained optimization. The document is intended to help students understand the concepts and see how they are connected, with a focus on theoretical understanding over problem-solving.

Uploaded by

ayeni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Matec Notes: Alexey Guzey This Version From September 25, 2017. Click For The Latest Version

This document contains notes for a mathematics course. It begins by introducing set operations and notation, including defining a set, union, intersection, subset, empty set, and power set. It then covers sequences and limits in Rn, open and closed sets, compact sets, continuity of multivariable functions, differentiation of multivariable functions including gradients and the chain rule, implicit functions, convexity and convex sets, unconstrained and constrained optimization. The document is intended to help students understand the concepts and see how they are connected, with a focus on theoretical understanding over problem-solving.

Uploaded by

ayeni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Matec Notes

Alexey Guzey

This version from September 25, 2017.

Click here for the latest version.

I want to thank Elena Kochegarova for invaluable advice in preparation of these notes, without
which they would not have been written.
Feel free to contact me regarding typos, suggestions, questions about the material, etc. by email:
[email protected] / VK: vk.com/alexeyguzey / Telegram: t.me/alexeyguzey

How to use the notes. The principle that guided us during the creation of these notes was to
let the student not just memorize the algorithm, but to be able to understand why the algorithm
works. Consequently, it focuses much more on the theoretical part of the course, rather than on
problem-grinding. What this means is that these notes are not a subsitute for seminars and home
assignmentsthey're a completent to them. To facilitate the process of gaining insight into the concepts
your lecturers and seminar teachers want you to learn, explanations were attempted to be made as
clear as possible and a lot of eort was made to connect up ideas to each other. To make introduction
of new concepts easier, most of the explanations begin with examples in R1 or R2 and only then are
generalized.
Some chapters feature an appendix in which additional material can be found, such as even deeper
explanations or simply something fun and intersting, tangential but outside the scope of the course.
There are two ways to look at the rst couple of months of matec:

1. As an extension of rst year calculus topics but for several dimensions

2. Getting ready to solve optimization problems. This is totally unobvious, but all the set theory,
limits, derivatives of functions of several variables, etc. are needed to be able to fully understand
constrained optimization problems, similar to the typical micro utility maximization problems,
but more complex. Right now the picture below won't make any sense. But when you start
getting the material, you can try to look back at it and you will realize how it's all connected:

Optimization

first-order condition second-order condition

Hessian Weierstrass
Lagrangian matrix theorem

second-order
first-order continuity compact sets
derivatives
and partial
derivatives open/closed,
limits bounded/unbounded
sets

ε-balls

1
Contents
1 Set Operations and Notation 3
1.1 Operations on sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Sequences in Rn 5
2.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Balls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Sequences and Their Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Sets 10
3.1 Open and Closed Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Compact Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Multivariable Functions. Continuity. 17


4.1 Level curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Limit of a Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Finding Limits with Polar Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Dierentiation of Multivariable Functions. Approximation 22


5.1 Taylor Series Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 First-Order (Linear) Approximation for One Variable . . . . . . . . . . . . . . . . . . . 23
5.3 First-Order (Linear) Approximation for Two Variables . . . . . . . . . . . . . . . . . . 23
5.4 Tangent Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5 Directional Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.6 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.7 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.8 Appendix to Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.9 Second-order approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Implicit functions 31
6.1 Implicit Function Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Implicit Function Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 Implicit Function Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Convexity and Concavity. Convex Sets 35


7.1 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Appendix (don't read this unless you want to mess with your head) . . . . . . . . . . . 39
7.3 What Does Determinant Have To Do With Anything? (don't read this even more) . . 39

8 Unconstrained Optimization 41
8.1 Local Optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Global Optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

9 Constrained Optimization 43
9.1 What the hell is NDCQ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.2 Lagrange multiplier method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9.3 Envelope theorem (unconstrained) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.4 Envelope theorem (constrained) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2
1 Set Operations and Notation
Denition. A set is a collection of distinct objects.
Which means that there are no repeats and order doesn't matter.

Example 1.
A = {1, 2, 3} = {3, 1, 2}

Example 2.

x is bigger or equal
all x such that than 0 and less
or equal than 1

A has three members, while B has an innite number of members. Thus, sets can either be nite
or innite.

1.1 Operations on sets

First, some notation:

a∈A element a belongs to a set A 1 ∈ {1, 2}


{1, 2} ⊆ {1, 2, 3, 4},
B⊆A a set B is a subset of a set A
{1, 2} ⊆ {1, 2}
∅ empty set A = {}
power set: set of all subsets of e.g. A = {1, 2}
2A
a set 2A = {∅, {1}, {2}, {1, 2}}

Union of sets. A ∪ B. Venn diagram:

A B

such x belongs or x belongs


all x
that to A to B

Intersection of sets. A ∩ B. Venn diagram:

A B

A ∩ B = {x | x ∈ A and x ∈ B}.
Verbally: elements that belong to both A and B.

3
Dierence of sets. A \ B. Venn diagram:

A B

A \ B = {x | x ∈ A and x∈
/ B}.
Verbally: elements that belong to A and do not belong to B.

Complement or negation of a set. Ā or ¬A. Venn diagram:

¬A = {x | x ∈
/ A}.
Verbally: elements that do not belong to A.

Cartesian product of sets.

all points such a coordinate b coordinate


(a, b) that comes from A comes from B

Example 1.

{1, 2, 3} × {3, 4} = {(1, 3), (1, 4), (2, 3), (2, 4), (3, 3), (3, 4)}

Example 2.
{3, 4} × {1, 2, 3} = {(3, 1), (3, 2), (3, 3), (4, 1), (4, 2), (4, 3)}.

1.2 Appendix

Fun fact: Russell's paradox Suppose a barber who shaves all men who do not shave themselves
and only men who do not shave themselves. Does the barber shave himself ? If he does, then he
shouldn't. If he doesn't, then he should. Thus a paradox.
Stated more formally: Let R be the set of all sets that are not members of themselves (R = {x|x ∈
/
x}). If R is not a member of itself, then its denition dictates that it must contain itself (R ∈ R), and
if it contains itself, then it contradicts its own denition as the set of all sets that are not members of
themselves (R ∈
/ R).

4
2 Sequences in Rn
2.1 Vectors
 
a1
.
Vectors a in Rn is given by a= . , where an is a nth coordinate of a vector. For example, in
 
.
an
   
a1 ax
R2 vector a= = .
a2 ay
 
a1 + b1
.
We add vectors together by adding each coordinate: a+b= . .
 
.
bn + bn

2.1.1 Vectors in R2 (Euclidean space)


s
q 2
Length of a vector
1 a is denoted by ||a|| and equals a2x + a2y or in sum notation
P
a2i . You already
i=1
knew this in the form of Pythagoras' theorem:

y
(ax,ay)

||a||= a2x+a2y
a

x
p
s Distance between vectors a and b is denoted by ||a − b|| and is equal to (ax − bx )2 + (ay − by )2 =
2
(ai − bi )2 .
P
As we can see from the picture, it's Pythagoras again:
i=1

y
(ax,ay)

d=||a-b||= (ax- bx )2+(ay- by )2


ay- by

(bx,by)
b
x
ax- bx
1
Technically, it's called a norm of a vector but you don't need to think about it.

5
2.1.2 Vectors in Rn (Euclidean space)
The way we count lenghts and distances in Rn is exactly the same as in R2 but we use more coordinates.
Basically, Pythagoras' theorem, but for an arbitrary number of dimensions.

Denition. Length of a vector a in Rn :

v
u n
uX
||a|| = t a2i
i=1

Denition. Distance between vectors a and b in Rn :

v
u n
uX
||a − b|| = t (ai − bi )2
i=1

2.2 Balls

2.2.1 Balls in R1
Ball in R1 is called an interval. An interval near the point a, (a − ε, a + ε)

a- ε a a+ ε x

written in math notation as

{x | a − ε < x < a + ε}
is given by

Bε (a) = x ∈ R1 | ||a − x|| < ε




2.2.2 Balls in R2
Ball in R2 is called a disk. A disk around the point a with a radius ε:

ε
inside the circle ||a-(x,y)||<ε,
a for all points (x,y).

written in math notation as

 q 
2 2
(x, y) | (ax − x) + (ay − y) < ε

is given by

Bε (a) = x ∈ R2 | ||a − x|| < ε




6
2.2.3 Balls in Rn
For a ball in Rn we have exactly the same denition as for a disk in terms of ||a − x|| < ε:
Denition. Ball Bε (a) around the point a with a radius ε n
in R is given by

Bε (a) = {x ∈ Rn | ||a − x|| < ε}

2.3 Sequences and Their Limits

2.3.1 A sequence of points in Rn


The Fibonacci sequence, given by 1, 1, 2, 3, 5, 8, 13, ... is an example of a sequence in R1 , which is the
n
type of sequences you explored in the rst year calculus. A sequence in R is an extension of this idea:

arbitrary sequence
Fibonacci sequence arbitrary sequence in Rn
in R1
a11
 
.
a1 = a11

a1 = 1 a1 =  .
 
. 
m
 a11 
a2
 ..
a2 = a12

a2 = 1 a2 =  .


m
 a21 
a3
 ..
a3 = a13

a3 = 2 a3 =  .


m
 a31 
an
 ..
an = a1n

an = an−2 + an−1 an =  .


am
n

a1n
 
.
So, an =  .  is a nth point of a sequence, ain is an ith coordinate of a point, and the sequence
 
.
am
n
itself is given by {an }∞
n=1 .

2.3.2 Limit of a sequence in R1


In R1 a sequence has the limit L, if for any arbitrary small number ε > 0, there's a point nε such that,
for all points following nε , the sequence lies within the distance between its limit L and ε. Written
formally,

limn→∞ an =L i ∀ε > 0 ∃nε : ∀n > nε , ||an − L|| < ε


Graphically:

ε L

a
a 3
a1 2
nε n

7
Note that here the horizontal axis does not carry any meaning by itself: it simply shows the order
of the sequence.

2.3.3 Limit of a sequence in R2


In R2 we need both coordinates to be close to L and the following theorem should be fairly obvious:

 i ∞
Theorem. A sequence {an }∞
n=1 converges to L if and only if each coordinate of a sequence an n=1
converges to the corresponding coordinate Li .
A natural way to dene this would be to pick an arbitrarily small area  an ε-ball  around L and
see whether all points after a certain nε lie within this ball. So we modify the denition of a limit by
making ε to be the radius of a ball. Notice that the denition written in math notation didn't change:

limn→∞ an =L i ∀ε > 0 ∃nε : ∀n > nε , ||an − L|| < ε

Next, let's see a couple of examples of converging sequences.


First is a spiral sequence (try to dene it explicitly :p). Imagine a ball around its center and start
shrinking it. For any arbitrarily small ball, we will nd a point on the spiral, such that all points after
this point lie within the ball. Thus, spiral's center is its limit.

a3
a2
a1

On the pictures below

left sequence (a) right sequence (b)


2
xn = 3 − n xn = 3 − n2 · (−1)n
3
yn = 1 + n yn = 1 + n3

y y

5 5
a1 b1
4 4
3 3
a2 a b2
2 3 lim=(3,1) 2 b3
1 1 lim=(3,1)

0 1 2 3 4 5 x 0 1 2 3 4 5 x

(you can calculate several rst values of these sequences to conrm that they are convergent)

8
2.3.4 Limit of a sequence in Rn
Limit of a sequence in Rn is very similar to that in R1 and R2 2
(except we can't really visualize it ),
2
and the denition carries over from R completely.

Denition. The sequence {an }∞


n=1 in Rn converges to L if

∀ε > 0 ∃nε : ∀n > nε , ||an − L|| < ε

2.3.5 Accumulation points of a sequence

Suppose we have a sequence in R1 given by an = (−1)n . Its behavior is pretty straightforward  the
sequence jumps back and forth from −1 to 1 ad innitum. We may be tempted to say it has two limits
but that would only be half-right: we could apply the denition of a limit either to all odd points,
or all even points  but not to both. Such points, where an innite number, but not necesarrily all
of them, lie within a ball Bε (a) are called accumulation points of a sequence. Note that the limit is
always an accumulation point. Also, if the sequence has one accumulation point, then this point is the
limit.
Although the concept of accumulation point of a sequence is rarely used, it will be helpful in
understanding the limit of a function.

2
A mathematician and an engineer go to a physics talk where the speaker discusses 23-dimensional models for
spacetime. Afterwards the mathematician says "that talk was great!" and the engineer is shaking his head and is very
confused: "The guy was talking about 23-dimensional spaces. How do you picture that?" "Oh," says the mathematician,
"it's very easy. Just picture it in n dimensions and set n = 23.

9
3 Sets
3.1 Open and Closed Sets

3.1.1 Sets (intervals) in R1


Back in high school you learned about the types of intervals. Interval is called closed, if it includes
its endpoints. Interval is called open, if it doesn't include its endpoints. To understand the dierence
between open and closed interval in a dierent way, let's try to draw some R1 balls (intervals, that is)
on it:

a b
[ [
Note that a ball drawn around endpoints a or b does not lie inside [a, b] completely. For point b,
right-hand side of a ball will necessarily be outside the interval; for point a  left-hand side.
Now consider an open interval (a, b):

a b
( (
In contrast to a closed interval, for any point in an open interval, we can nd a ball which would
lie inside the interval completely. We call such a point internal:
Denition. Internal point is a point such that we can draw an ε-Ball arount it, which would contain
only points from the set.

Then, an open interval consists only of internal points. In fact, this observation is true for all open
sets in Rn . Formally, though, the denition of an open set is:

Denition. The set S is called open, if for its every point x, there exists an ε-Ball centered at x that
lies in S completely.

3.1.2 Sets in R2
Consider a circle x2 + y 2 = 1 on a plane:

y
1

1
x

10
The circle naturally divides the whole plane into 2 parts: inside of it and outside of it. However,
we also need to decide which part the circle itself belongs to: inside or outside. So there are four
arrangements in total:

y y y y
1 1 1 1

1 1 1 1
x x x x

x2 + y 2 ≤ 1 x2 + y 2 < 1 x2 + y 2 ≥ 1 x2 + y 2 > 1
closed open closed open
bounded bounded unbounded unbounded

Note that the dierence between open and closed sets is whether they contain their boundary points.
Intuitively, boundary points lie on, duh, bounds, which implies that however small balls around these
points we draw, some parts of them will neccesarily be inside the interval and some parts will be outside
(check picture for closed interval above, if this is not obvious). Formally:

Denition. Boundary point is a point such that every ε-Ball around it contains points from a set and
not from a set.

The way to remember the distinction between closed and open set is that closed contains all its
boundary points, while open doesn't contain any of its boundary points (check the picture above with
four circles to conrm). If a set contains some but not all its boundary points, then it's neither open
nor closed.
Also note that two leftmost circles are bounded, while two rightmost are unbounded. Both visually
and intuitively it's obvious: set is bounded if it doesn't extend to innity in any direction (extending to
innity even along a single line in R2 is enough to become unbounded), so here's the formal denition:

Denition. Set is called bounded if it is contained within some ball.


Note that a bounded set doesn't need to be round itself. It simply needs to be able to be drawn
into a ball, and as long as it doesn't include innity in any direction, it's bounded.

3.1.3 Sets in Rn
In Rn everything stays basically the same, except we need more axes.
Now, hopefully having acquired intuition, we move on to dene closedness of a set formally. To do
this let's get back to R1 . Consider an open interval (0, 1) and this sequence on it:

a1 = 0.9
a2 = 0.99
a3 = 0.999
...
Each member of the sequence is in the interval. However lim an = 1 is outside the interval. Having
this sequence in mind, proceed to the denition of a closed set:

Denition. Set is called closed, if it contains the limit of any convergent sequence of elements from
the set.

Note that [0, 1] containts lim an , as well as the limit of any other sequence, which consists of points
on the interval. Thus, we know that it's closed.

11
Example 1. Consider the interval (0, +∞). In R1 it is given by {x | x > 0} and it's open and
2
unbounded; in R it is given by {(x, y) | x > 0, y = 0} it's still unbounded but it's no longer open,
since line on a plane entirely consists of boundary points (if you try to imagine these sets, this will
become obvious).

Example 2. (example taken from 31.10.11 mock)



Let D be the domain of the function f (x, y) = ln(x) + y − x. Find D, the set Do of internal
points of D , and the set ∂D of boundary points of D .

Solution. Domain is the function's all possible inputs, which means we have to ensure that ln(x)

and y−x receive the correct inputs:

ln(x)⇒x>0

y−x⇒y ≥x
So D is {(x, y) | x > 0 and y ≥ x}. Graphically:

From the picture we can see that internal points Do are {(x, y) | x > 0
y > x}. Recalling theand
denition of a boundary point, which says that it is a point such that every ε-Ball around it contains
points from a set and not from a set, we see that the vertical line x = 0 and the diagonal y = x satisfy
it. So boundary points ∂D are {(x, y) | x = 0 and y ≥ 0 or x ≥ 0 and y = x}. Also, we can see that
this set is neither open nor closed, as it contains some but not all of its boundary points.

Theorem. Complement of a closed set is an open set; complement of an open set is a closed set.

Example. Complement of a closed interval [0, 1] is open interval (−∞, 0) ∪ (1, +∞); complement of
and open interval (0, 1) is closed interval (−∞, 0] ∪ [1, +∞).

Fun fact. Empty set and its complement (entire line in R1 ; entire plane in R2 ; and so on) are the
n
only sets on R that are simultaneously open and closed.

3.2 Compact Sets

(x, y) | x2 + y 2 ≤ 1

Now, recall the mental image of a set and proceed to the following denition:

Denition. Set is called compact if it is both closed and bounded.

If you want to understand the signicance of this denition, let me return to the rst year calculus
for a moment (you can skip this otherwise). Consider the problem of nding the maximum and the
minimum of a function on an interval. Immediately we start thinking about rst and perhaps second
derivative. But wait, how do we know that min and max even exist? Consider this function, which is
discontinuous on the closed interval [a, b]:

12
y
y(x
)

min? max?

y(x
)
a b x

Pretty obvious that it has neither min, nor max on [a, b]. Well, maybe requiring function to be
continuous would suce? Then consider this function, which is continous on an open interval (a, b):

min max?
)
y(x

a b x

Minimum is attained somewhere between a and b. b is not included, it can't be the point
But since
where f attaints its maximum. Where does it then? Somewhere really close to b, obviously. Let's call
this point c. Now move to c + ε. We're still inside the interval as c + ε < b but f (c + ε) > f (c). So f (c)
is not the maximum. It's easy to see that there's in fact no such point, where f attains maximum.
So it turns out that the function has to be both continuous and be dened on a closed interval for
us to be condent that it attains both min and max.

min max
)
y(x

a b x

This might remind you of the Extreme Value Theorem, and it actually is:

13
Extreme Value Theorem (EVT). If f is continuous on the closed interval [a, b], then f attains
its minimum and maximum values on [a, b].
Note, however, that EVT denes sucient conditions i.e. if EVT is satised, then min and max
are attained for sure. However these conditions are not necessary for min and max to be attained.
Consider this function, which is not continuous and is dened on an open interval:

y(x)
max
min

y(
x)

a b x

Both extreme values exist. So the point of EVT is simply to provide a shortcut to us. If it is
satised, extremes exist. If not  they may or they may not.
Okay, back to matec. Large part of the Math for Economists course is dedicated to the extension
of the problem we examined above, except f becomes a function of several variables and constraints
are much more complex than a ≤ x ≤ b. Weierstrass Theorem that will be introduced further in the
text, provides a similar shortcut for these cases. The dierence is that instead of the function needing
to be continuous and dened on a closed interval, it will have to be continous and be dened on a
compact set i.e. a set that is both closed and bounded.

3.3 Appendix 1

Unless you're very comfortable with denitions of open and closed sets, the following is
pretty hard to understand. If you don't get it from the rst time, try to absorb as much
as possible initially and then reread this a couple of days later.
Earlier I wrote that an easy way to remember the distinction between open and closed sets it this:
1. Open set doesn't contain any of its boundary points.
2. Closed set contains all its boundary points.
How is this heuristic connected to the formal denition of open and closed sets? Let's start with
closed sets.
Recall from the denition that a closed set is a set that contains the limit of every convergent
sequence that consists of points from the set. To show the equivalence of these two denitions we have
to show that the set contains all its boundary points if and only if If, for every boundary point, we
could nd a sequence that would converge to this boundary point, then, by denition, we would show
that a closed set contains all its boundary points.

14
a4 L=lim(an)
a3
a2

set S a1

Let's pick an arbitrary point L. Note that L is a boundary point, which means there are both
points belonging to S and points outside S arbitrarily close to L (in any ε-Ball around L). Next just
pick a sequence {an } from S , such that each next term would lie in an ε-Ball with smaller and smaller
radius around L, thus having L as its limit. Now L must belong to S by denition of a closed set.
Finally, repeat the process for all boundary points of S .
For open sets it's simpler: denition of an open set basically says that it only contains internal
points (since around each point you can draw a ball that would reside inside the set completely), which
is pretty much equivalent to saying that it doesn't contain any of its boundary points.

3.4 Appendix 2

Think about the following questions for a few moments, before reading the answers:

1. Consider a nite union of closed sets. Is it open or closed?

2. Consider any (nite or innite) intersection of closed sets. Is it open or closed?

3. Consider any (nite or innite) union of open sets. Is it open or closed?

4. Consider a nite intersection of open sets. Is it open or closed?

5. Is it true that any set consists of only boundary and interior points?

3.4.1 Answers.

Theorem (questions 1 and 2).

1. A nite union of closed sets is a closed set.



∪ 1 + n1 , 2 − n1 = (1, 2).
 
Counterexample for innite union of closed sets that forms an open set:
n=1

2. Any (nite or innite) intersection of closed sets is a closed set.

Theorem (questions 3 and 4).

1. Any (nite or innite) union of open sets is an open set.

2. A nite intersection of open sets is an open set.


1 − n1 , 2 + 1

Counterexample for innite intersection of open sets that forms a closed set: ∩ n = [1, 2].
n=1

15
Question 5. Consider the set {[0, 1] ∪ {2}} on R1 :

0 1 2
[ [
Is point (2) boundary or interior? If you check with the denitions, it's neither. Points like this
are called isolated. Thus, there are three kinds of points, which means that the boundary and internal
points are not necessarily complements.

16
4 Multivariable Functions. Continuity.
Functions and sequences are basically the same things, except that sequences are discrete (they're
dened on the set of natural numbers and we can number each term of a sequence: 1, 2, 3, ...) , while
functions are dened on the set of real numbers, which are unenumerable. One could say that sequence
is a function on N.

4.1 Level curves

Drawing the functions of one variable is okay; drawing the functions of two variables is hard. Which is
why when we have a function of two variables, we frequently try to visualise it on a usual 2-axis graph.
Suppose we have a function which shows the height above the sea level of some piece of land
h = f (x, y). The most natural way to visualise it would be the following:

y h=300 h=100

h=400
h=200

The lines on the graph are the level curves of the function h = f (x, y). For example, for h = 100,
the level curve is given by

{(x, y) | f (x, y) = 100}


Denition. C-level curve of the function h = f (x, y) for some level c is given by

{(x, y) | f (x, y) = c}
Note that the indierence curve from micro is a level curve in this course:

4.2 Limit of a Function

ÌÈÝÔýòî òàêîå óíèêàëüíîå ìåñòî, ãäå òðè ðàçà ðàññêàçûâàþò


î òîì, ÷òî òàêîå ïðåäåë, è êàæäûé ðàç áåðóò çà ýòî øåñòüñîò òûñÿ÷.
Àëåêñåé Àõìåòøèí
17
The limit of a function is pretty much the same thing as the limit of a sequence (check section 2.3.2
on page 7 for explanation). In fact, as mentioned earlier, we could view sequences a special kind of
functions for which the only values are f (1), f (2), f (3), and so on.
You'll probably never be asked to actually employ the denition of a limit, but you denitely need
to understand it conceptually to be able to prove either existence of a limit or absense of a limit of a
function at a point.
While calculating the limits of multivariable functions, we can employ the same operations as for
single-variable functions (lim of a sum, product, quotinent). The problem appears when we try to deal
∞ 0
with uncertanties (
∞ and 0 ): L'Hospital's rule doesn't work for multivariable functions. This means
that when faced with uncertainty have to nd other ways around it. This chapter shows the most
common techniques.

Example 1: multiplication by the conjugate


xy
lim √ =
x→0 3 − xy + 9
y→0

3 + xy + 9 and apply (a + b)(a − b) = a2 − b2 to the denominator:

Multiply the fraction by

√  √  √
xy 3 + xy + 9 xy 3 + xy + 9 3 + xy + 9
= lim = lim = lim = −6
x→0 9 − (xy + 9) x→0 −xy x→0 −1
y→0 y→0 y→0

Example 2: change of variables and equivalences


ln(1 + xy) ln(1 + xy)y
lim = lim =
x→0 x x→0 xy
y→−3 y→−3

Substitute z = xy → 0 and apply lim(f · g) = lim(f ) · lim(g):

ln(1 + z)
lim ·y =
z→0 z
y→−3
Recall that for t → 0, ln(1 + t) ∼ t:
z 
lim · y = −3
z→0 z
y→−3

4.3 Continuity

4.3.1 Continuity on R1
Denition. f (x) is continuous around point x0 , if lim f (x) = f (x0 ) and lim f (x) = f (x0 ).
x→x−
0 x→x+
0

Check the picture below to understand why we need these conditions:

left limit not ok left limit ok left limit not ok left limit ok x
right limit ok right limit not ok right limit not ok right limit ok

18
4.3.2 Continuity on R2 and Rn
While there's only two ways to approach a point on the linefrom left or righton a plane (and
in higher dimensions) there's an innite number of directions to do that (recall the examples from
section 2.3.3 on page 8), and limit's denition extends to accomodate this fact:

Denition. f (x) is continuous around point x0 , if lim


x→x0
f (x) = f (x0 ).
Now in order to prove existence of a limit we need to check all possible directions. To prove that
the limit doesn't exist, we just need to show that while approaching the point from two dierent
paths, function approaches dierent values (a parallel to sequences: we can prove that the limit of
a sequence doesn't exist by showing that it has two accumulation points). Most often we use the
following technique when we want to show that the limit doesn't exist:

1. Approach the point along the line y = kx

2. Approach the point along the parabolic curve y = kx2

3. And so on.

Usually, just checking y = kx is enough. Checking y = kx2 is almost always enough. But nothing
hypothetically stops Demeshev or Bukin from coming up with a function where you need to check
y = kx15 or something to prove that the limit doesn't exist.
Exam tip: the main diculty when solving such a problem is to recognize that the limit doesn't
exist and not waste time trying to nd it.

Example 1. Find the limit of the following function as x → +∞, y → +∞ (i.e. as we go to the
upper right-hand side corner from the origin) or prove that it doesn't exist:

x2 + y 4
f (x, y) =
x4 + y 2

Solution: Look at directions y = kx:

x2 + k 4 x4 x2 (1 + k 4 x2 ) 1 + k 4 x2
lim = lim = lim
x→∞ x4 + k 2 x2 x→∞ x2 (x2 + k 2 ) x→∞ x2 + k 2

Dropping 1 and k2 , as they're bounded and won't matter, we get:

k 4 x2
lim = k4
x→∞ x2

Which means that the lim of f (x, y) depends on the line on which we approach ∞. For example,
moving along the line y = 1 · x, f (x, y) → 1; moving along the line y = 5 · x, f (x, y) → 625. Thus,
lim f (x, y) doesn't exist.
x→∞

Example 2. Find the limit of the following function, as x → ∞, y → ∞ or prove that it doesn't
exist:

x2 y
f (x, y) =
x4 + y 2

Solution: Look at directions y = kx:

kx3 kx3 kx kx
lim
4 2 2
= lim
2 2 2
= lim 2 2
= lim 2 = 0
x→∞ x + k x x→∞ x (x + k ) x→∞ x + k x→∞ x

Wait-wait-wait; what if we use parabolas y = kx2 ?

19
kx4 kx4 k
lim = lim =
x→∞ x4 + k 2 x4 x→∞ x4 (1 + k 2 ) 1 + k2
So actually the limit does not exist! y = kx just couldn't provide us with the right curve.
Protip: by checking some, not all directions (such as y = kx in this example) we can only prove
that the limit does not exist (if we nd dierent limits along dierent directions). Finding the limit
with only some directions doesn't tell us anything about the existence of the actual limit!

Example 3. Find the limit of the following function, as x → ∞, y → ∞ or prove that it doesn't
exist:

x15
f (x, y) =
y

4.4 Finding Limits with Polar Coordinates

Usually, we dene the point on a plane using the Cartesian system with x and y axes. An alternative
way to uniquely identify the point is by the distance from the origin and the angle:

A=(xA,yA)=(rA,θA)

r yA

θ
x
xA

y y
To refresh our memory: sinθ = r , cosθ = x , tanθ = x.
q  r
yA
So r = x2A + yA
2, while θ= arctan
xA .
The inverse conversion is x = r · cosθ , y = r · sinθ .

Protip: Polar coordinates are very useful when we deal with x2 + y 2 in limits.

Example 1. This example simply shows explicitly how the change of the coordiates to polar works:

lim f (x, y) = lim f (rcosθ, rsinθ) = lim g(r, θ)


x→3 r→5 r→5
y→4 θ→arctan( 4
3 ) θ→arctan( 4
3 )
The question that might pop up is why do we switch g, rather than continue working with f in
polar? Consider f (x, y) = x + y . Then f (r, θ) = r + θ. That's hardly what we wanted to achieve, so
we introduce g(r, θ) = rcosθ + rsinθ.

Example 2.
lim f (x, y) = lim g(r, θ)
x→0 r→0
y→0 θ→any

since when r=0 we don't have any information about the angle.

20
Example 3. (example taken from 25.10.12 mock)

x3 + y 3 r3 cos3 θ + r3 sin3 θ
lim = lim =
x→0 x2 + y 2 r→0 r2 cos2 θ + r2 sin2 θ
y→0 θ→any

Using cos
2x + sin2 x = 1

= lim r(cos3 θ + sin3 θ) = 0


r→0
θ→any

Since both cos and sin are restricted by −1 and 1, and are therefore bounded, cos
3θ + sin3 θ won't
actually matter.
Notice that we could see from the very beginning that the limit is equal to 0, as, close to the origin,
x3 and y 3 are much smaller than x2 and y2.
The result about unimportance of bounded values, when dealing with innities is mostly obvious
but still there's a theorem for it:

Theorem. Limit of the product of an innitely big number and a bounded number is innity: +∞ ·
c=∞ and −∞ · c = ∞ (here, by ∞, I mean either +∞ or −∞, depending on the sign of c). Limit of
the product of an innitely small number and a bounded number is zero: 0 · c = 0.

Example 4.
x2 y 2 r2 cos2 θsin2 θ
lim = lim = cos2 θsin2 θ
x→0 (x2 + y 2 )2 r→0 r2
y→0 θ→any

Since both cos and sin change with a change in θ, and we can approach point (0, 0) at any angle θ,
there is no limit of this function.

21
5 Dierentiation of Multivariable Functions. Approximation
5.1 Taylor Series Introduction

Let's start with a real-life example: so imagine a car. We know that the car's position at a time t0
is s(t0 ) = s0 . However, we don't know its speed. Neither do we know its acceleration. If we are asked
about the car's position at a time t1 , what do we say? The only value we can use is s0 , so there's no
choice but to say that s(t1 ) ≈ s0 . Okay. Now, we suddenly learn about the car's speed at the moment
t0 : v = v0 . What do we say s1 is now? Since we don't know anything about the car's acceleration,
we'll just have to assume its speed is constant. Then, s(t1 ) ≈ s0 + v0 (t1 − t0 ), i.e. the car's initial
position and what we assume it drove during the period between t0 and t1 . Okay, better. But what if
we know car's acceleration at t0 : a = a0 as well? Surely we want to use this information, but how do
we do it?
In your high school physics class you were just given a formula:

at2
s1 = s0 + vt +
2
Today you learn that this formula is a special case of Taylor series.
Back to math. Car's speed is the rate of change of its position, therefore it's the rst derivative
s0 (t) of s(t). Car's acceleration is the rate of change of its speed, therefore it's the second derivative
s00 (t) of s(t). What if car's acceleration varies as well? And its acceleration? And so on? Only using
s00 (t) would be a waste of all the derivatives that follow it.
Okay, here's the formula:

s00 (t0 )(t1 − t0 )2 s000 (t0 )(t1 − t0 )3


s(t1 ) ≈ s0 + s0 (t0 )(t1 − t0 ) + + + ...
2 6
In a more general form:

f (2) (x0 )(x − x0 )2 f (3) (x0 )(x − x0 )3


f (x) ≈ f (x0 ) + f (1) (x0 )(x − x0 ) + + + ...
2! 3!
And in the most general form possible:

f (0) (x0 )(x − x0 )0 f (1) (x0 )(x − x0 )1 f (2) (x0 )(x − x0 )2


f (x) ≈ + + + ...
0! 1! 2!
n
where f (x) = f (0) (x), f 0 (x) = f (1) (x) and so on. n! = 1 · 2 · 3 · ... · n = Π n and 0! = 1.
n=1

Denition. Taylor series of a function is given by

∞ (n)
X f (x0 )(x − x0 )n
f (x) ≈
n!
n=0

The thing is, it's really hard to understand exactly how we get this formula (I don't really get
it myself; blame Akhmetshin :p). If you're interested, Wikipedia has a really great article on Taylor
series (link). But you probably want to memorize the formula and be able to use it when
asked. Taylor series turns up everywhere!
As a rule of thumb, the more derivatives we use, the better approximation is. However this is not
universal and even using the innite number of derivatives does not guarantee the convergence to the
true value. Fortunately, within the course the functions are all so nice, we can forget about this and
just use Taylor series blindly.
Some terminology: the case when we used car's speed  rst derivative  was an instance of rst-
order approximation. The case when we used its acceleration  second derivative  was an instance of
second-order approximation. These are the only two cases we're going to look in deeply during this
course.

22
5.2 First-Order (Linear) Approximation for One Variable

Back to normal functions. Hopefully, the idea of using derivatives to approximate functions is pretty
intuitive now. If we have a function of one variable such that calculating its value at x0 is trivial, while
doing the same thing at x0 + ε is nearly impossible, use approximation, usually rst-order:

approximation

actual y(x)
value

x0 x0+ε x

Recall the graphical interpretation of the derivative: it is the slope of the function at a point (or
of its tangent line at a point). Equivalently it is shown by the change in f given dx = 1. The closer
to x0 we are, the less the slope changes and the more accurate the approximation is. For one variable,
generalized form of rst-order approximation is:

f (x) ≈ f (x0 ) + f 0 (x0 )dx = f (x0 ) + f 0 (x0 )(x − x0 )

5.3 First-Order (Linear) Approximation for Two Variables

We want to generalize the method of using the derivative to approximate a function to functions of
two variables:

z = f (x, y)
In order to do this, we'll decompose total change of function's value into change due to change in x:
dx and change due to change in y : dy .
First with dx. To isolate change of z due to dx we need to x y, i.e. take y as if it were a constant.

z = f (x, y0 )
Then take this function's derivative, which is the rate of change of the function f along the line
y = y0

zx0 = f 0 (x, y0 )
at a specic point (x0 , y0 ). This is called a partial derivative.
Denition. If it exists3 , partial derivative of z with respect to x is given by

∂z
fx0 (x0 , y0 ) = fx0 =
∂x
Note that partial derivatives are denoted with ∂ , rather than d. Change in z due to
change in x is

∂z
dx = fx0 dx
∂x
3
This is almost always the case within this course, but we can actually come up with a simple looking function that
0.5
is not dierentiable, e.g. y = x2 , which we usually write as y = |x|.

23
Now, repeat the same operation for dy .

z = f (x0 , y)

∂z
fy0 =
∂y
And change in z due to change in y is

∂z
dy = fy0 dy
∂y
Finallly, total change in z equals change due to dx and change due to dy . By combining these we
get z 's rst total dierential.
Denition. First total dierential of a function of two variables is given by

∂z ∂z
dz = dx + dy = fx0 dx + fy0 dy
∂x ∂x
And we can use this result to approximate z close to (x0 , y0 ):

z = f (x, y) ≈ f (x0 , y0 ) + fx0 · (x − x0 ) + fy0 · (y − y0 )

5.4 Tangent Plane

If in R1 we nd linear approximation of function's value by its tangent line, in R2 it is tangent plane.

Denition. Tangent plane for a function of two variables is given by

z = f (x0 , y0 ) + fx0 · (x − x0 ) + fy0 · (y − y0 )

5.5 Directional Derivative

Directional derivative shows the rate of change of a function in a particular direction we picked. It
is a concept which can be thought of as a special case of the rst total dierential. Key dierence
between them is that the total dierential is a function of arbitrary dx and dy , while, for directional
derivative, length of a vector is always 1, in other words, dx and dy become dependent on each other,
since dx
2 + dy 2 =1 (check section 2.1.1 on page 5 for explanation). We call such a vector normalized
or a unit vector.
Denition. Directional derivative gives the rate of change of f (x, y) at a point (x0 , y0 ) in the direction

of a unit vector u (vector of length 1, where dx2 + dy 2 = u21 + u22 = 1). Its formula is:

 
u1
f (x, y) = fx0 · u1 + fy0 · u2 = fx0 fy0

D→
u u2
At a point is important because we calculate fx0 and fx0 at a specic point and plug in concrete

numbers. Alternatively, if length of a vector is not equal to 1 (such vector is usually denoted as l ),
the directional derivative is


fx0 · l1 + fy0 · l2  l
D→ f (x, y) = p = fx0 fy0 ·
l 2
l1 + l22 ||l||
   
5 15
This implies that whether we pick vector or directional derivative stays the same.
3 9
Only change in the ratio dy/dx will change it.

24
5.6 Gradient

Suppose we want to go in the direction of the maximum growth of a function (assuming vector length
is 1 for simplicity). Which dx and dy should we pick? The problem is:

fx0 · dx + fy0 · dy → max


 0   
fx dx
Note that this is the dot product of vectors and . Since another formulation of the
fy0 dy
→ →
dot product is || a || · || b || · cosα and cos is maximum (= 1) when α= 0, it is maximum when vectors
dx
are codirected. Thus, to maximize function's growth rate we pick such that it is codirected
dy
fx0 fx0
   
with . This, in turn, means that itself points in the direction of maximum growth of
fy0 fy0
the function. This vector is called the gradient of a function.

Denition. Gradient of a function f (x, y) is given by

fx0
 

∇f (x, y) =
fy0
Which is simply the vector of partial derivatives of a function. Also, now you can see that we can
reformulate directional derivative using the gradient at a specic point :

→ →
∇f · l
D→ f = →
l
|| l ||

Protip: Remember rmly these three key properties of a gradient, as they're very frequently helpful
in the exams. If you are too lazy to memorize all of them, pick property 3.

1. Gradient points in the direction of the most rapid growth of the function (discussed above).

2. Length of the gradient is equal to the maximum rate of growth of the function. Proof:

→ → → →
∇f · l ||∇f || · || l || · cosα → →
D→ f = → = → = ||∇f || · cosα = ||∇f ||
l
|| l || || l ||
cosα =1 from the derivation of a gradient above.

3. Gradient is orthogonal to the level curve.

Example. (taken from ??.11.2008 mock)


Calculate the directional derivative of the function f (x, y) = 2x3 + 2y 2 at the point A (1, 2) in the
following directions:

a) ~l = (1, 3)

b) ~l which is orthogonal to the curve given by the equation x2 + y 2 = 5

c) Direction of the fastest growth of f (x, y)

25
Solution. In order to learn anything at all about the function, we'll need to know its partial deriva-
tives, so:

fx0 = 6x2 = 6
fy0 = 4y = 8
6·1+8·3 √30
a) D= √
12 +32
= 10
b) Orthogonal is a synonym to normal for xy -plane, and from Jerey we remember that the
=1 −1
equation of the normal line is y = f (x0 ) + =
f 0 (x0 )(x x0 ) and its slope is f 0 (x0 ) . By using the Implicit
0 F0 0
Function Theorem (discussed in the next chapter), i.e. the fact that y = − x0 , we can nd f :
Fy

Fx0 2x 1
y0 = − 0
=− =−
Fy 2y 2
 
1 1
Orthogonal to which is
− 12
= 2. Now just pick any vector with dy twice of dx, e.g. and
2
6·1+8·2 22
calculate directional derivative in its direction: D= √
12 +22
= √
5  
6 6·6+8·8
c) Direction of the fastest growth is the direction of the gradient i.e. : D= √
62 +82
= 10
8

5.7 Chain Rule

Suppose we have a one-dimensional mountain, height of which at every point x is given by f (x), and
a hiker walking on it, whose coordinate at a time t is given by x(t):

f(x)

x(t) x

Then, if we want to learn the height of the hiker at some time t, the function we work with
is f (x(t)). This was sort of a justication for the existence of the chain rule but this is where
the real world example ends. If you'd like to learn more, Math Insight has a great page about it:
http://mathinsight.org/chain_rule_multivariable_introduction.

5.7.1 f (x(t))
Suppose we have a function f (x(t)) and we want to nd its rst total dierential df . Change in f is
equal to derivative of f multiplied by the change in the argument:

df
df = dx
dx
But since x depends on t, we can't just leave dx be, and dx becomes

dx
dx = dt
dt
Then

 
df dx
df = dt
dx dt

26
df
Transfering dt to the other side, we nd the derivative
dt :

df df dx
=
dt dx dt
Alternatively:

ft0 = fx0 · x0t


We can draw a diagram
4 to help us understand this process. It seems complicated and unnecessary

right now, but it will help a lot with more complex functions:

Next, just multiply both terms by owing downwards, to get the exact same result, as written
above.

5.7.2 f (x(t), y(t))


Recall that once we start to deal with functions of several variables, e.g. f (x, y), df needs to be
decomposed into change due to dx and change due to dy . Note that we use ∂ 's here, because ∂f
dx and
∂f
dy are partial derivatives:

∂f ∂f
df = dx + dy
dx dy
But when x and y are dependent variables, e.g. f (x(t), y(t)), we need to count that in, and df
becomes:

   
∂f dx ∂f dy
df = dt + dt
dx dt dx dt
Transferring dt to the other side we get:

df ∂f dx ∂f dy
= +
dt dx dt dy dt
Alternatively:

ft0 = fx0 · x0t + fy0 · yt0


However, we could get exactly the same result by drawing a diagram:

4
Idea by Paul Dawkins: http://tutorial.math.lamar.edu/Classes/CalcIII/ChainRule.aspx

27
f

x y

t t
Now, we add up both branches to each other and get precisely:

fx0 · x0t + fy0 · yt0

5.7.3 f (x(a, b), y(a, b))


Let's nd a partial derivative fa0 of the function f (x(a, b), y(a, b)), using a diagram. Here, we're only
interested in the branches that end with a (I greyed out branches we don't need):

x y

a b a b
After multiplying each element of black branches and then adding up branches to each other, the
result is:

fa0 = fx0 · x0a + fy0 · ya0


By analogy we can get fb0 . Also, by analogy we can work with more complex functions with the
help of diagrams.

Example. (taken from 29.12.2011 mock)


The function f (x, y) f (x, y) = u2 (x, y) + v 3 (x, y). The value of u and v and their
is given by
respective gradients at the point (x, y) = (1, 1) are also known, u(1, 1) = 3, v(1, 1) = −2, ∇u(1, 1) =
(1, 4), ∇v(1, 1) = (−1, 1). Find ∇f (1, 1) if u, v ∈ C 1 .

Solution.  
∇f = (∇u) 2 + (∇v)3 = 12 + (−1)2 , 43 + 13 = (2, 65)
If you found yourself nodding along and the equation above did not raise any red ags, you should
stop immediately and try to understand why the thing I just did is completely wrong. Correct solution
is after the following appendix.

28
5.8 Appendix to Chain Rule

If Bukin or Demeshev feel particularly sadistic when composing the exam, they might come up with
something like this:

Example. (taken from 21.01.2009 mock)


Calculate all partial derivatives of the rst and second order of u with respect to x and y if
u = f (a, b) and a = x + xy, b = x/y .
The rst thing to do here is to rewrite u = f (a, b) as u = f (a(x, y), b(x, y)) to better understand
the task and not bother with calculations for now. Let's
0
focus on ux :

u0x = fa0 · a0x + fb0 · b0x


Now, we move on to the second-order derivatives:

u00xx = faa
00
· a0x + fab
00
· b0x a0x + fa0 · a00xx + fba
00
· a0x + fbb
00
· b0x b0x + fb0 · b00xx
 

Simple! Right now your face probably looks a lot like this:

What the hell happened to fa0 and fb0 ? The rst thing to realize is that fa0 is actually fa0 (a (x, y) , b (x, y))and
fb0 is actually fy0 (a (x, y) , b (x, y)) and we can dierentiate them further just as if they were ordinary
functions. So let's slow down a little and get back drawing :

a b

x y x y
00 0 00 0 00
Notice that this diagram gives faa · ax + fab · bx , which is exactly what you can see in uxx above.
0 0 00 0 0 0 00
ax + fa · axx part is the result of applying (f · g) = f · g + f · g . Second half of uxx is derived in exactly
00 00 00
the same fashion. Same for uxy , uyx , and uyy . It takes some eort to understand the process,
but once you draw a few diagrams, it becomes rather straightforward. But back to u00 xx :

u00xx = faa
00
· a0x + fab
00
· b0x a0x + fa0 · a00xx + fba
00
· a0x + fbb
00
· b0x b0x + fb0 · b00xx
 

We could try to simplify this, but really it's simpler to just plug in the numbers. We don't know
f so all f0 and f 00 just stay as they are. a0x = 1 + y ; a0xx = 0; b0x = 1
y; b00xx = 0. Then:

   
1 00 1 1
u00xx 00 00
= faa · (1 + y) + fab · 0 00
(1 + y) + fa · 0 + fba · (1 + y) + fbb · + fb0 · 0 =
y y y
00 1 + y 00 1
00
(1 + y)2 + fab00

= faa + fba + fbb
y y2

I very strongly encourage you to try to calculate at least u00xy by yourself and compare the results:

29
u00xy = faa
00
· a0y + fab
00
· b0y a0x + fa0 · a00xy + fba
00
· a0y + fbb
00
· b0y b0x + fb0 · b00xy =
 

00 00 (1 + y) x 00 x 00 x 1
= faa · x (1 + y) + fab 2
+ fa0 + fba + fbb 3
+ fb0 2
−y y −y −y

Correct solution to the example in the previous section. To recap: f (x, y) = u2 (x, y) +
v 3 (x, y), u(1, 1) = 3, v(1, 1) = −2, ∇u(1, 1) = (1, 4), ∇v(1, 1) = (−1, 1). It's not a coincidence that
this example is given in the chain rule section:

∇f = ∇ u2 + ∇ v 3 = 2u · ∇u + 3v 2 · ∇v
 

5.9 Second-order approximation

Young's Theorem. If the function is ∈ C2 (twice continuously dierentiable), then


00 = f 00 5 .
fxy yx

Protip: Although you could always just rewrite


00
fyx as
00 ,
fxy it's a good idea to calculate them both
independently to conrm that they are the same and that you did everything correctly.
If a function of two variables f (x, y) is twice continuously dierentiable (f ∈ C 2 ), its second-order
total dierential is:

d2 f = d(df ) = d(fx0 dx + fy0 ydy) = d(fx0 )dx + d(fy0 )dy = (fxx


00 00
dx + fxy 00
dy)dx + (fyx 00
dx + fyy dy)dy =

00 00 00
= fxx dx2 + 2fxy dxdy + fyy dy 2
Subsequently, the second-order approximation of a function of two variables f (x, y) is its Taylor
polynomial up to second-order derivative:

f (x, y) ≈ f (x0 , y0 ) + fx0 (x0 , y0 )(x − x0 ) + fy0 (x0 , y0 )(y − y0 )+


1 00 00 00
+ (fxx (x − x0 )2 + 2fxy (x − x0 )(y − y0 ) + fyy (y − y0 )2 )
2

Example. Use second-order approximation to approximate the function f (x, y) = x3 y 5 +x2 −y 3 +xy
at a point (1, 1)

Solution.
df = (3x2 y 5 + 2x + y)dx + (5y 4 x3 − 3y 2 + x)dy

d2 f = (6xy 5 + 2)dx2 + 2(15x2 y 4 + 1)dxdy + (20y 3 x3 − 6y)dy 2

f (x, y) ≈ f (1, 1) + fx0 (1, 1)(x − 1) + fy0 (1, 1, )(y − 1)+


1 00 00 00
+ (fxx (x − 1)2 + 2fxy (x − 1)(y − 1) + fyy (y − 1)2 ) =
2
1
= 2 + 6(x − 1) + 3(y − 1) + (8(x − 1)2 + 2 · 16(x − 1)(y − 1) + 14(y − 1)2 )
2
Note that since we only use rst two dierentials, the approximation is only accurate around the
point (1, 1).
5 0 00
You could try to picture a surface in xyz -space in your head, then imagine how we rst take fx and then fxy or
0 00
fy and then fyx on it, and, with a considerable eort, might see, why this theorem true. There's no short and clear
explanation.

30
6 Implicit functions
Suppose we want to nd the derivative
dy
dx of an implicit function xy = 1. Well, simple enough, just
write it explicitly as y= 1
x , and dierentiate it:
 0
dy 1 1
= y0 = =− 2
dx x x
But now suppose the function is x + siny + xy = 0. Whatever we try to do, there's no way to place
all ys on the one side and all xs on the other side. We are forced to dierentiate the implicit function.
The way we do it is using Implicit Function Theorem.

6.1 Implicit Function Theorem 1

Let's continue with x+ siny + xy = 0. The important thing to realize here is that, although we
can't disentangle x from y, the function itself still exists, and there's nothing fundamentally dierent
between an implicit function F (x, y) = 0 and an explicit function y = f (x). One of the implications is
that we can still view y as a function of x:

F (x, y) = x + siny + xy = 0 ⇒ F (x, y (x)) = x + siny (x) + xy (x) = 0


dy
This also means it's possible to nd the derivative of the implicit function
dx at a point, just as if it
were explicit. Since F (x, y (x)) = 0, F 0 (x, y(x)) = 0.
Now, remembering the chain rule, we dierentiate with respect to x:

∂F ∂F dy
+ · =0
∂x dy dx
then

∂F dy ∂F
· =−
dy dx ∂x
and nally

∂F
dy ∂x
= − ∂F
dx ∂y

which is usually written as

Fx0
y0 = −
Fy0
The result of this derivation should have been familiar to you from the previous year Calculus as
an Implicit Function Theorem (IFT). Since in this course we'll study more than one IFT, we are going
to call it IFT1.

IFT1. If we have equation F (x, y) = 0 and such point (x0 , y0 ) that:

1. point (x0 , y0 ) satises equation F (x0 , y0 ) = 0

2. the function F is continuosly dierentiable


6 (F ∈ C 1)

3. Fy0 (x0 , y0 ) 6= 0
6
Actually, we only need it to be dierentiable around the point, but you don't need to think about it.

31
Then explicit function y = f (x) is dened near the point (x0 , y0 ) and its derivative y0 is equal to
Fx0
y0 = −
Fy0

Condition 1 is needed because we need to make sure the point we're trying to nd the derivative
at actually belongs to the graph of the function.
Condition 2 is needed because, well, unless the function is dierentiable at a point, we can't really
take its derivative (Exam tip: usually implicit functions given are polynomials, which are always
continuously dierentiable; you can simply state this fact to show that the condition holds).
Condition 3 is needed since we divide by Fy0 when calculating the derivative, and the function would
0
not be dened if Fy was 0 (like with x
2 + y2 = 1 at (1, 0) and (−1, 0), as Fy0 = 2y = 0 at these points).

Example. (taken from 25.03.2015 mock)


Consider the equation y 3 + xy + 3x2 + 2x3 = 7.

(a) Does this equation dene the implicit function y(x) at a point (x = 1, y = 1)?

(b) If the function y(x) is dened, nd its second-order Taylor expansion.

Solution.

(a) Let's check the three conditions:

1. y 3 + xy + 3x2 + 2x3 at (1, 1) is 1+1+3+2=7  correct.

2. Polynomial, thus C 1.

3. Fy0 = 3y 2 + x = 3 + 1 = 4 6= 0  satised.

Thus, we can conclude that this equation does indeed dene the implicit function y(x) at a point
(x = 1, y = 1).

(b) Second-order Taylor expansion of any function is given by:

y 00 (x0 )(x − x0 )2
y(x0 ) + y 0 (x0 )(x − x0 ) +
2
Check Taylor Series Introduction if you forgot this formula. And for our case it would look the
following way:

y 00 (1)(x − 1)2
y(1) + y 0 (1)(x − 1) +
2
0
Then we can nd y 0 (1) by using the formula y 0 = − FFx0 :
y

y + 6x + 6x2 1+6+6 13
y0 = − == =−
3y 2 + x 3+1 4
Now, remember that y is a function of x: y(x), so both Fx0 = y + 6x + 6x2 and Fy0 = 3y 2 + x are
functions of x, not of y, and when we write y we actually mean y(x), so dierentiate accordingly:

Fx0 0 Fx00 Fy0 − Fx0 Fy00 (y 0 + 6 + 12x) 3y 2 + x − (6y · y 0 + 1) y + 6x + 6x2


   
00
y = − 0 =− 2 =−
Fy Fy0 (3y 2 + x)2

32
Fx0 0 Fx00 Fy0 − Fx0 Fy00
 
00
y = − 0 =− 2 =
Fy Fy0
(y 0 + 6 + 12x) 3y 2 + x − (6y · y 0 + 1) y + 6x + 6x2
 
=− =
(3y 2 + x)2
−13 −13
 
4 + 6 + 12 (3 + 1) − 6 · 4 (1 + 6 + 6) 4115
=− 2 =
(3 + 1) 16

Finally, the answer is

13 4115
16 (x − 1)2
y ≈1− (x − 1) +
4 2

6.2 Implicit Function Theorem 2

A slight generalization of IFT1 is the case when we have one dependent variable y and several inde-
pendent variables x1 , . . . , x n , so the equation becomes

F (x1 , . . . , xn , y) = F (x1 , . . . , xn , y(x1 , . . . , xn )) = 0


dy
Fortunately, we're actually only interested in the partial derivative
dxi of this function, which means
that all the derivatives not involving xi are 0 (as c0 = 0). So our new expression is

Fx0 i + Fy0 · yx0 i = 0

x10 , . . . , xn0, y0

IFT2. If we have equation F (x1 , . . . , xn , y) = 0 and such point that:

x10 , . . . , xn0, y0 F (x01 , . . . , x0n , y) = 0



1. point satises equation

2. the function is continuosly dierentiable (F ∈ C 1)

Fy0 x10 , . . . , xn0, y0 =



3. 6 0

y = f (x1 , . . . , xn ) is dened near the point x10 , . . . , xn0, y0



Then explicit function and its partial deriva-
tives yx0 i are equal to

Fx0 i
yx0 i = − , for any i = 1, . . . , n
Fy0

6.3 Implicit Function Theorem 3

The nal generalization happens when there are n independent variables and m simultaneous equations.
We will actually only work with the case of one independent variable and two functions, as, going
beyond, everything gets too complicated. In equations below x is an independent variable, while y(x)
and z(x) are dependent, i.e. they're functions of x:
(
F (x, y, z) = 0
G(x, y, z) = 0
Dierentiating each function with respect to x by using chain rule (check IFT1 if you forgot) to
each function:

(
Fx0 + Fy0 · yx0 + Fz0 · zx0 = 0
G0x + G0y · yx0 + G0z · zx0 = 0
Alternatively:

33
(
Fy0 · yx0 + Fz0 · zx0 = −Fx0
G0y · yx0 + G0z · zx0 = −G0x
Thus, we have a system of two equation with two unknowns: yx0 and zx0 , which we can solve by
Cramer's rule.

(
F (x, y, z) = 0
IFT 3. If we have equations and such point (x0 , y0 , z0 ) that:
G(x, y, z) = 0
(
F (x, y, z) = 0
1. point (x0 , y0 , z0 ) satises equations
G(x, y, z) = 0

2. the functions are continuosly dierentiable (F, G ∈ C 1)

3. Jacobian (matrix of partial derivatives) given by

Fy0 Fz0
 
J=
G0y G0z

4. is such that|J| 6 0
= at a point (x0 , y0 , z0 )

then, by application of Cramer's rule 7 the derivatives we're interested in are given by

0
Fx Fz0
0
Fz0

Fx
G0 G0 G0 G0z

x z x
yx0 = − = − 0
Fz0

|J| Fy

G0 G0z
y
0
Fy Fx0 Fy0 Fx0

G0 G0 G0 G0x

0 y x y
zx = − = − 0
Fz0

|J| Fy
G0 G0z

y

To remind you, determinant of a 2×2 matix


a b
c d = ad − bc

Exam tip: You will absolutely certainly be asked to employ IFT1, or IFT2, or IFT3, or any
combination of these on the exam, so even if the explanations of these are unclear, just memorize the
results of each: y0 for IFT1; yx0 i for IFT2; and yx0 , zx0 for IFT3; and make sure you can plug in the
right numbers in formulas when asked.

7
You could just memorize the formulas below, but Wikipedia actually has a wonderful (still rather di-
cult to understand, though) geometric explanation of this formula. Do check it out, if you're interested:
https://en.wikipedia.org/wiki/Cramer%27s_rule#Geometric_interpretation

34
7 Convexity and Concavity. Convex Sets
Remark. In contrast to the course, the topics Convexity and Concavity and Unconstrained Opti-
mization are presented in a dierent order here, because it feels more natural to me this way.

First derivative shows the slope of the function: f 0 (x) > 0 ⇒ slope positive ⇒ function increases;
f 0 (x) <0⇒ slope negative ⇒ function decreases. Recall an example from Taylor Series Introduction
chapter. Slope of a function is analogous to its speed: speed is positive ⇒ function increases; speed is
negative ⇒ function decreases.
Second derivative then is the acceleration of a function: function is speeding up ⇒ f 00 > 0;
function is slowing down ⇒ f 00 < 0.
We call functions that are speeding up convex and functions that are slowing down concave 8 .
strictly strictly
convex neither
convex concave

Protip: an easy way to remember which one is convex and which is concave is to note that y = −x2
looks a lot like a cave. Coincidentally, it is also con cave.
Okay, this was the basic intuition, but it is waaaaay too imprecise, even for me. Actually, if the
function is always speeding up, i.e. f 00 > 0, then it's called strictly convex. Simply convex means that
it does not slow down, i.e. f
00 ≥ 0. Same for concave. So lines like y = 2x are both convex and concave.
Furthermore, you could say that concave function like y = −x2 is rst slowing down and then
speeding up, pointing to the absolute value of its rst derivative. Well, technically, by speeding up I
mean speeding up upwards or slowing down downwards. Same for slowing down.
The technical formulation for convex function is the following:

Denition. A function is convex on (a, b) if the inequality

f (αx + (1 − α) y) ≤ αf (x) + (1 − α) f (y)


is satised for any two points x and y from (a, b) and any α in [0, 1].

Protip: Although you are rarely asked for this denition, sometimes, remembering it and under-
standing its geometrical meaning (it is explained wonderfully in Jerey on page 49) is extremely helful
in the exams (see Example at the end of this chapter).

7.0.1 Convex sets

It's rather obvious that f (x) = x2 is convex. However, what if we dene the domain (all inputs) to be
(1, 4) ∪ (8, 14), rather than (−∞, +∞). Is f (x) still convex on its domain?
The rst thing to notice here is that the denition above only describes the sitiation of (a, b) 
a single interval, while here we have two intervals. But let's ignore this for a moment and proceed
anyway. Then, by taking two points in the domain, say x=2 and y = 10, and taking their middle i.e.
α = 0.5, we get f (0.5 · 2 + 0.5 · 10) = f (6), which is not dened!
What we found is that the initial question does not make any sense  the function can't be either
convex or concave on a set like this. In R1 the set (domain) needs to be connected. In Rn the

8
Sometimes convex is called concave up and concave is called concave down.

35
situation is more complicated: here, the set (domain) needs to be convex, i.e. all of its points have to
be connected by a straight line segment, for us to be able to determine convexity of a function. Some
examples:

convex non-convex non-convex

Denition. A set is called convex if given any two points a, b in that set, the straight line segment
ab joining them lies entirely within that set.
Formally, a set V is called convex if

∀a, b ∈ V point αa + (1 − α)b ∈ V, 0 ≤ α ≤ 1

Notice, that for a concave function, e.g. y = ln(x), the area below it  caled subgraph  looks like a
convex set; and for a convex function, e.g. y = x2 the area above it  called epigraph  looks like a
convex set. Thus, a theorem:

Theorem. If f is concave, then its subgraph is a convex set. If f is convex, then its epigraph is a
convex set.

(x, y) | y = x2

Example 1. Determine whether the following set is convex:

Answer: If you skipped the relevant seminar, you've probably thought of course it is, since y = x2
is convex!. But if you reread the problem, it does not actually say anything about the epigraph of
y = x2 . The points in this example all lie on the parabola itself. Since when we connect any two of
these points, we get o the parabola, the set in question is not convex.
This was an intuitive explanation, but to prove it formally we'll need to make use of the denition
of a convex set written just above. Let a = (−2, f (−2)) = (−2, 4), b = (1, f (1)) = (1, 1). We can pick
any α ∈ (0, 1) but let's take α = 0.6 here, as an example. Then αa + (1 − α)b = 0.6(−2, 4) + 0.4(1, 1) =
(−0.8, 2.8). Since f (−0.8) = 0.64 6= 2.8 this point does not lie in the set. Thus we get a contradiction
with the denition and a proof that the set is not convex.

(x, y) | y ≥ x2

Example 2. Determine whether the following set is convex:

Answer: This set is convex, since it describes the area above the parabola y = x2 , and the theorem
is applicable.

7.1 R2
But what do we do with a function of two variables? Rather than simply checking f 00 , we now have
00 00 00 00
four partial derivatives: fxx , fxy , fyx fyy . Let's start with a simple example:

f (x, y) = x2 + y 2

36
It's visually obvious that this function is convex, so let's see what happens to second-order partial
derivatives in this case:

00 = 2
fxx
fx0 = 2x f 00 = 0
⇒ xy
fy0 = 2y 00 = 0
fyx
00 = 2
fyy
Note that cross derivatives (fxy and fyx ) are 0 and we can forget about them for now. Seeing that
00 > 0 at the entire domain of the function, we may say that f is always speeding up along the
fxx
x-axis; And since the same could be said about y -axis, we may conclude that the function is convex
as a whole.
Usually we arrange partial derivatives in the form of the Hessian matrix :
 00 00
  
fxx fxy 2 0
H= 00 00 =
fyx fyy 0 2
Denition. Hessian matrix is given by

 00 00

fxx fxy
H= 00 00
fyx fyy
Switching signs of the function we get

f (x, y) = −x2 − y 2

 00 00
  
fxx fxy −2 0
H= 00 00 =
fyx fyy 0 −2
Which is obviously concave. Finally, for function

f (x, y) = x2 − y 2

 00 00
  
fxx fxy 2 0
H= 00 00 =
fyx fyy 0 −2
As
00 > 0,
fxx the function is speeding up along its
00 < 0,
x-axis; fyy so function is slowing along the
y -axis, which means that it's neither concave nor convex.
Using these three functions for intuition, we can proceed to a more formal treatment. If we dene
00 00

00 | and
fxx fxy
H1 = |fxx H2 = |H| = 00

00 ,
we can create the following table:
fyx fyy

37
f (x) H H1 H2 convexity/concavity deniteness
 
2 0
x2 + y 2 >0 >0 strictly convex positive denite
 0 2 
−2 0
−x2 − y 2 <0 >0 strictly concave negative denite
0 −2
2 0
x2 − y 2 something else neither neither
0 −2
Positive denite and negative denite is what matrices, which satisfy the given conditions are
called. You should remember them because sometimes these terms are used in the exams.
H1 and H2 in the table above are called leading principal minors of a matrix. Formally:
Denition. Let A be an n × n matrix. The k th order leading principal minor of A is the determinant
of a matrix obtained by deleting the last n−k rows and columns of A.
So 1st order leading principal minor of A : H1 , is obtained by deleting all but the rst row and
column. H2 is obtained by deleting all but rst two rows and columns. And so on. Consequently,|Hn |
is the determinant of a n×n matrix.
A general rule for nding whether a function f is strictly convex or strictly concave is:

1. f is strictly convex if and only if all its leading principal minors are strictly positive (> 0).
2. f is strictly concave if and only if all its leading principal minors alternate signs as follows:

H1 < 0, H2 > 0, H3 < 0, and so on

But what if the general pattern above holds, but some leading principal minor Hm is 0? This is
where intuition about speeding up and slowing down along axes ends, and where we'll need to do a
lot more calculations. In this case we'll unfortunately need to check all principal minors to determine
whether the function is convex or concave:

Denition. Let A by an n × n matrix. A principal minor of A is the determinant of a matrix obtained


by deleting n − k rows of A, and the same n − k columns of A.
00
00
So, for a 2 × 2 matrix there are two 1st order principal minors: D11 = |fxx | and D12 = fyy , and
00 00

fxx fxy
one 2nd order principal minor: D2 = 00 00 .

f
yx fyy
00
For a 3 × 3 matrix there are three 1st order principal minors: D11 = |fxx | (remove 2nd and 3rd
00 00
rows and columns), D12 = fyy , (remove 1st and 3rd rows and columns) and D13 = |fzz | (remove 2nd
00 00

fxx fxy
and 3rd rows and columns); three 2nd order principal minors: D21 = 00 00 (remove 3rd row and

f fyy
yx
00 00
00 00

fxx fxz fyy fyz
column), D22 = 00 00 (remove 2nd row and column), and D23 = f 00

00 (remove 1st row and

f
zx fzz 00 zy fzz
00 00

fxx fxy fxz
00 00 00

column); and one 3rd order principal minor: D3 = fyx fyy fyz .


f 00 f 00 f 00
zx zy zz
A general rule for nding whether a function f is convex or concave is:

1. f is convex if and only if all its principal minors are non-negative (≥ 0).
2. f is concave if and only if all its principal minors alternate signs as follows:

D1 ≤ 0, D2 ≥ 0, D3 ≤ 0, and so on

The table for convexity/concavity for functions of two variables is:

D1m D2m convexity/concavity deniteness

≥0 ≥0 convex positive semidenite


≤0 ≥0 concave negative semidenite
something else neither neither

38
Example. If you understand the formal denition of concavity/convexity, you might nd yourself
quite happy upon seeing a problem like this on the exam (this one was taken from 25.03.2015 mock):
Let f (x) be a concave function dened on [0; ∞) and f (0) = 0. Is it true that for k ≥ 1 the
following inequality holds: kf (x) ≥ f (kx)?
I strongly suggest you try to solve this problem yourself before reading the solution.

Solution. Recalling that concave means that

f (αx1 + (1 − α)x2 ) ≥ αf (x1 ) + (1 − α)f (x2 )

We need to gure out a way to turn this into kf (x) ≥ f (kx). The rst thing to notice is that this
looks a lot like the denition, except for this pesky 1 − α term. Recalling that f (0) = 0 and setting
x2 = 0 (setting x2 < x1 is counterintuitive, but the denition doesn't actually say that x2 most be
greater than x1 ), we get the following:

f (αx1 ) ≥ αf (x1 )
But in this formulation α(f (x1 ) is to the right of ≥, while in the problem formulation it's to the
1
left of ≥. Then we may notice that taking α= k solves this problem:

x1
kf ( ) ≥ f (x1 )
k
Now it's pretty obvious that to get from this to kf (x) ≥ f (kx) we just need to take x1 = kx.

7.2 Appendix (don't read this unless you want to mess with your head)

You can think of every leading/principal leading minor as of cross-section of a function:

0.7x2 + xy + 
0.7y 2 0.5x2+ xy 
+ 0.5y 0.3x 2
 + xy +0.3y
1.4 1 1 1 0.6 1
1 1.4 1 1 1 0.6
H1 > 0, H2 > 0 H1 > 0, H2 = 0 H1 > 0, H2 < 0
strictly convex convex neither

2.5 2.0 2.0

2.0 1.5 1.5

1.5 1.0 1.0

1.0 0.5 0.5

0.5 0.0 0.0

0.0 −0.5 −0.5


1.0 1.0 1.0
0.5 0.5 0.5
1.0 0.0 1.0 0.0 1.0 0.0
0.5 0.0 −0.5 axis 0.5 0.0 −0.5 axis 0.5 0.0 −0.5 axis
−0.5 −1.0 −1.0 x −0.5 −1.0 −1.0 x −0.5 −1.0 −1.0 x
y axis y axis y axis

7.3 What Does Determinant Have To Do With Anything? (don't read this even
more)

Suppose we have a 1 by 1 square, which we can get from vectors (1, 0) and (0, 1):
(0,1)

(1,0)

39
This square can be written in a matrix form as

 
1 0
0 1
where the rst row denotes vector (1, 0) and second row denotes vector (0, 1). Area of the square
is 1 and determinant of this matrix is 1. Now let's add the rst row of the matrix to the second row
and get the following rhombus:
 
1 0
1 1

(1,1)

(1,0)

Area stayed the same and determinant stayed the same. Now let's add the second row to the rst
row:

 
2 1
1 1

(1,1)

(2,1)

And again, both area and determinant stayed the same. This should give you an intuitive under-
standing why determinant gives the area of the gure
9 and why determinant is 0, when vectors are
however twisted the initial gure is, we
dependent. What else does this argument show is that
can always reduce it to the fundamental form of the diagonal matrix10 (all gure's angles
are 90 degree). The determinant will stay the same. Fundamental does not mean unique. This has
something to do with Hessian but I'm not sure what exactly. Sorry.

9
Also, multiplication of a row multiplies area and determinant by a constant.
10
Numbers on the diagonal are eigenvalues of the matrix.

40
8 Unconstrained Optimization
8.1 Local Optima

Suppose we have a function y = f (x) and we want to nd its minimum and maximum. How do we do
it? Start looking for stationary points, i.e. points where the function is at i.e. y0 = 0.
Denition. Point is called stationary (or critical ), if all partial derivatives of the function are equal
to 0 at this point, i.e. ∇f = 0.
This was the rst-order condition (since it's based on the rst derivative) for the min or max, also
called necessary condition. Usually we just say  FOC , though.
However, nding such a point is not sucient, since we don't know if it is a minimum, maximum,
or neither of these.

min max neither

To ascertain which point it is, we need to nd the second derivative of a function. If the second
derivative is positive, then the rst derivative (slope) is increasing, the function is speeding up, and,
looking at the picture above, we have the case of min. If the second derivative is negative, then the
slope is decreasing, the function is slowing down, and we have the case of max. If the second derivative
is zero, then this is an inection point.

y 0 = 0, y 00 > 0 ⇒ speeding up ⇒min


y 0 = 0, y 00 < 0 ⇒ slowing down ⇒max

This was thesecond-order condition (since it's based on the second derivative) for the min or max,
also called sucient condition. Usually we just say  SOC , though.
All of this sounds suspiciously similar to our discussion of convexity and concavity in the previous
chapter. And it is in fact the same discussion. Finding that the function attains a local minimum at
a point is exactly the same as nding that a function is convex in this point's vicinity (look at the
picture above if it's not clear why!). Finding that the function attains a local maximum at a point is
exactly the same as nding that a function is concave in this point's vicinity. If this is not clear, try
to imagine a dierentiable function that would be convex around a point, where it attains maximum.
Therefore, the rule for convexity becomes the rule for local min and the rule for concavity becomes
the rule for local max, with the dierence that all Hessian matrices are calculated at stationary
points:

H1 H2 H3 deniteness min/max

>0 >0 >0 positive denite local min


≥0 ≥0 ≥0 positive semidenite inconclusive
<0 >0 <0 negative denite local max
≤0 ≥0 ≤0 negative semidenite inconclusive
something else indenite saddle point

Exam tip: If you get an inconclusive result, generally you don't need to look any further and can
just write inconclusive in answer.
You can check the pictures in section 7.2 on page 39 for geometric intuition regarding these rules.
The middle picture there: 0.5x2 + xy + 0.5y , sheds some light on the inconclusive case.
The reason we're only talking about local optima right now is that we're calculating Hessian
matrices at specic points and therefore cannot know what happens with the function on its whole
domain.

41
Protip: Recall that the Young's Theorem says that if the function is ∈ C2 (twice continuously
00
dierentiable), then fxy = 00 .
fyx Since functions given for such exercices almost always satisfy this
theorem, Hessian matrices are almost always symmetric.

8.2 Global Optima

If the function is either concave or convex, then its only critical point is its global maximum or
minimum, respectively. In all other cases, nding global minima and maxima of a function is much
less straightforward, as there's no universal rule for this problem. There are two general ways to
proceed further:

1. Try to prove that there's no global min/max

2. Prove that a local extremum is also a global one.

Let's start with trying to prove there's no global min/max. The usual way to do this would be to show
that the function goes to innity in some direction. For example, let f (x, y) = 0.5x2 + xy + 0.5y 2 :

2.0

1.5

1.0

0.5

0.0

−0.5
−1.0 1.0
−0.5 0.5
x ax0.0 0.0
is
is 0.5 −0.5 y ax
1.0 −1.0

So, suppose we want to prove that it has no global maximum. Let's try the direction x = y, so
f (x, y) = 0.5y 2 + y 2 + 0.5y 2 = 2y 2 . Now check that lim 2y 2 = +∞, so indeed this function has
y→+∞
no global maximum. In fact  picture makes it clear  for this function, we could check literally any
direction other than x = −y (purple line) and the result would stay the same. For example, let x = 0
2 2 2 2 3
(red line), then f (x, y) = 0.5y and lim 0.5y = +∞; or let y = x ! Then f (x, y) = 0.5x +x +0.5x
4
y→−∞
and lim 0.5x2 + x3 + 0.5x4 = +∞;
x→+∞
Proving that a local extremum is also the global one is harder. There's basically three options:

1. The function is convex or kinda convex and everything works out

2. Transformation to polar coordinates works and everything's easy as well (just nd lim of a
function as r→∞ to prove that the local limit is a global one)

3. The two above don't work and you're fuc..uh... you have to come up with something on the spot.

42
9 Constrained Optimization
Note: You can skip this explanation of the lagrangian if you wish and move right to the method
itself.
Denition of the economic good is that it's something that is both scarce and desirable. It's pretty
obvious that the lack of constraints when optimizing, does not go hand-in-hand with the scarcity
condition. Almost all optimization problems encountered in real life do have some constraints placed
on them.
Here, I'll use a utility maximization problem faced by a consumer as an example. The constraint
is their income I = 5. Available goods are x and y. Utility function U (x, y) = x · y . For simplicity,
we'll assume price of both goods to be equal to 1, so the generic income constraint Px · x + Py · y = I
transforms into x+y =5 Formally, the task is:


U (x, y) = x · y → max
x,y
x + y = 5

There are several ways to solve this problem.


The most obvious one is to express y = 5 − x and substitute this into the original equation, making
the problem

x · (5 − x) = 5x − x2 → max
x
Then we set derivative to 0

5 − 2x = 0
x = 2.5
We know that 5x − x2 is a parabola with branches downwards, which means that x = 2.5 is its
maximum. Pretty simple.
In microeconomics we would probably use graphs to solve it. So let's draw the indierence curves
(which are called level curves in our course; check section 4.1 on page 17) and income constraint:

The solution x = 2.5 is immediately obvious. The important thing to notice here is that the income
constaint is tangent to the optimal level-curve. As we can see, any level-curve that crosses the income
constraint, but is not tangent to it, is not optimal.
Now let's change the income constraint x+y = 5 to the general form of g(x, y) = x+y . Again, from
the picture above, we can see that the level-curves of f and g are tangent. Recall that the gradient's

43
third property says that it is orthogonal to the level-curve. Since f and g share the level-curve at a
point, their gradients are orthogonal to the same line, and they must be codirected.
The fact that ∇f and ∇g are codirected means that they are coecients of each other and we can
get one from the other by multiplying it by some number. Usually λ is used here:

(
fx0 = λgx0
∇f = λ∇g ⇐⇒
fy0 = λgy0
By including the original income constraint g(x, y) = c, we get three equations with three unknowns
and the problem becomes:


0 0
fx = λgx

fy0 = λgy0

g(x, y) = c

Solving this system of equations will get us rst-order conditions. This approach might seem
somewhat unwieldy, especially when we can just subsitute y = 5 − x, but when constraints become
more complex, e.g. y 2 + x2 = 1, it's the most convenient way to solve an optimization problem.
The way to remember those three equations is to introduce the Lagrangian function:

L(x, y, λ) = f (x, y) + λ (c − g(x, y))


By taking its partial derivatives L0x , L0y , and L0λ and equating them to zero we get exactly the
original system:

 
0 0 0
Lx = 0
 fx − λgx = 0

L0y = 0 ⇒ fy0 − λgy0 = 0

 0 
Lλ = 0 c − g(x, y) = 0

Weierstrass Theorem. Function continuous on a compact (closed and bounded) set attains its
minimum and maximum.
I suggest you check back on the discussion of the signicance of this theorem in section 3.2 on
page 12.

9.1 What the hell is NDCQ?

Always check NDCQ your seminar teacher tells you. What the hell does that even mean? Well its
actual formulation
( is:
∇g (x, y) = (0, 0)
If ∃ (x, y) : , then remember (x∗, y∗) as a candidate for extremum. If @, then
g (x, y) = c
NDCQ holds.
Basically it says that when the gradient of the constraint is equal to 0, while satisfying the con-
straint, then the Lagrangian won't detect this point while looking for extremum (since we can't solve
∇f = λ∇g ). This means that we need to check this point separately later.

Example 1. Suppose the constraint x2 + 3y 2 = 4. Then its gradient is (2x, 6y). 2x = 0 → x = 0,


6y = 0 → y = 0. Since 02 + 3 · 02 = 0 6= 4, NDCQ is satised.

Example 2. Suppose the constraint is x2 + 3y 2 = 0. Then its gradient is (2x, 6y). 2x = 0 → x =


0, 6y = 0 → y = 0. Since 02 + 3 · 02 = 0, NDCQ is violated. Then we'll need to calculate the value of
the function at a point (0, 0) later and check if it is an extremum.

44
9.2 Lagrange multiplier method

Steps (solution of an example is below):

1. Check NDCQ (non-degenerate constraint qualication).


(
∇g (x∗, y∗) = (0, 0)
If ∃ (x∗, y∗) : , then remember (x∗, y∗) as a candidate for extremum. If @, then
g (x, y) = c
NDCQ holds.

2. Introduce Lagrangian function.

L(x, y, λ) = f (x, y) + λ (c − g(x, y))


Check FOC (necessary condition):

 
0 0 0
Lx = 0
 fx − λgx = 0

L0y = 0 ⇒ fy0 − λgy0 = 0

 0 
Lλ = 0 c − g(x, y) = 0

Find critical points (x∗, y∗, λ∗)

3. Check SOC (sucient condition).

Bordered Hessian in our case is:

0 gx0 gy0
 
0 00 00
H̄ = gx Lxx Lxy 
gy Lyx L00yy
0 00

Note that L00xy = L00yx , so you only need to calculate one of them. In our case, n = 2, m = 1, so if
H̄ > 0, then (x∗, y∗, λ∗) is maximum. If H̄ < 0, then (x∗, y∗, λ∗) is minumum.

9.2.1 Bordered Hessian

Exam tip: If you can only memorize one thing from the entire course for the exam, memorize this!!
Let n be the number of variables and m be the number of constraints. The general rule for the
Bordered Hessian when nding max is:

1. Calculate the determinant of the Hessian (recall that the Hessian is the last principal leading
minor)

2. If its sign is (−1)n , then start to calculate the determinants of the previous principal leading mi-
nors i.e. remove rightmost column and the bottom row one by one. The signs must alternate.

3. Calculate the determinants of the last n − m leading principal minors or until the pattern breaks
down (so you know this is not max).

The general rule for the Bordered Hessian when nding min is:

1. Calculate the determinant of the Hessian (recall that the Hessian is the last principal leading
minor)

2. If its sign is (−1)m , then start to calculate the determinants of the previous principal leading
minors i.e. remove rightmost column and the bottom row one by one. The signs must all
equal to (−1)m .

3. Calculate the determinants of the last n − m leading principal minors or until the pattern breaks
down (so you know this is not min).

45
The rule when signs must alternate and when they stay the same is hopefully familiar to you from the
discussion of unconstrained optimization. If it's not, you can use mnemonics to remember it (remember
the cave ?) The metaphor that came to my mind is that in order to stay at the top (max) you always
need to ght dierent enemies (so need alternate strategies and stu ); and when you're just trying to
hold on (min) you're digging into the trenches and just do one thing (signs stay the same). If this
didn't help, try come up with your own mnemonic! Anyway, here's another one: Note that n is always
bigger than m. So when we are nding max (big number) we care about (−1)n and when we are
m
nding min (small number) we care about (−1) .

Examples. There are great examples and a deeper explanation of the Lagrange Multipliers method in
the OptimizationHOWTO by A.Kalchenko. (if the link doesn't work, go to Mathematics for Economists
page in icef-info and scroll to the bottom of the page). It should also help if you found my explanation
of the Lagrangian and/or Bordered Hessian convoluted and unintelligible.

9.3 Envelope theorem (unconstrained)

Imagine yourself several of years from now: a successful ICEF graduate, you are in a very competitive
business of growing marijuana. You learned well from the microeconomics courses that the only way
to survive in competitive markets is to minimize Average T otal Cost. Your AT C is aected by the
economies of scale, which increase your production eciency, and by the fact that if you, um, produce
too much of the good, law enforcement agencies will spend much more resources trying to bust you,
thus increasing your costs. This model suggests the following quadratic function, which you need to
minimize:

AT C = y(x) = x2 − 6x + 14 → min
x
y 0 = 2x − 6 = 0
x = 3, y = 5

economies of scale effect negative outside effect

So you nd that the optimal production is 3 units of your top-notch product. However, the police
has suddenly become much more active, which aects the coecient a, changing your AT C function
to
y(x) = x2 − 4x + 14 → min
x
y 0 = 2x − 4 = 0
x = 2, y = 10
As expected, the optimal quantity has fallen from 3 to 2, while AT C has risen from 5 to 10. However,
instead of calculating the optimal production every time the activity level of the police changes, we
could solve the equation once for arbitragy a: y(x, a) = x2 − ax + 14 and then just substitute the
appropriate a into the solution to nd the answer. To see how it works, let's do this procedure:

46
y(x, a) = x2 − ax + 14 → min
x
y 0 = 2x − a = 0
2
x = a2 , y = − a4 + 14
Substituting a = 4 we get x = 2, y = 10, exactly as before.
a2
The nal expression for AT C is y(x, a) = −
4 + 14. It tells us the optimal value of our function,
depending on some parameter a, and it is called the value function, usually denoted V (a). In our case
2
V (a) = − a4 + 14.
Note that to nd the eect of a marginal change in police activity on AT C we would need to take
2
the derivative of y(x, a) = x − ax + 14 by a:

ya0 (x, a) = −x
a
And since the optimal x= 2 , substituting it,
a
ya0 (x, a) = −
2
On the other hand, we could nd the eect of a marginal change in police activity on AT C by
2
taking the derivative of V (a) = − a4 + 14 directly, as it shows AT C for all a:
a
V 0 (a) = −
2
What we just saw is exactly the statemetent of the Envelope theorem. Mathematically, it is stated
as

Theorem 1 (unconstrained optimization). Let f (x, a) ∈ C 1 and f (x, a) → max = f (x∗ (a), a) =
x

V (a) i.e. we rename the result of the maximization as V (a). Then V 0 (a) = df (x da(a),a) = ∂f∂a
(x,a)

x (a)

9.4 Envelope theorem (constrained)

Most often the Envelope theorem is used in the constrained case, e.g. with the Lagrangian. In this case
the mathematical formulation gets very clunky but the result is that L = f (x, y) + λg(x, , y) becomes
V (a). This means that all you need to do is to take the Lagrangian derivative with respect to the
parameter, at the optimal point you have found. The intuition behind this is that L kinda incorporates
f (x) and g(x) together, which means that we can work (i.e. take derivative) with it directly.

Example. (taken from 24.12.2014 exam)


It is known that the point (1, 0) is the constrained local maximum of the function f (x, y) =
5x − ky − 3x2 + 2xy − 5y 2 subject to x + y = 1.
(a) Find the value of k and the maximum value of the function f
(b) Using Envelope theorem nd the new value of maximum if k will increase by 0.1

Solution.
(a) First, set up the Lagrangian

L = 5x − ky − 3x2 + 2xy − 5y 2 + λ(1 − x − y)


Then solve it and nd k

0
Lx = 5 − 6x + 2y − λ = 0
 (1)
0
Ly = −k + 2x − 10y − λ = 0 (2)

 0
Lλ = 1 − x − y = 0 ⇒ y = 1 − x (3)
5−6+2−2−λ=0 (1)
λ = −1 (1)
−k + 2 + 1 = 0 (2)
k=3 (2)

47
So f (x, y) = 2 at x = 1, y = 0 and k=3
(b)
df = L0k dk = −0.1y = 0

48

You might also like