Bit Coins
Bit Coins
1
2006,
c 2008 by Joel Sobel and Joel Watson. For use only by Econ 205 students at
UCSD in 2008. These notes are not to be distributed elsewhere.
ii
Preface
These notes are the starting point for a math-preparation book, primarily for use by
UCSD students enrolled in Econ 205 (potentially for use by folks outside UCSD
as well). The first draft consists of a transcription of Joel Watson’s handwritten
notes, as well as extra material added by Philip Neary, who worked on the tran-
scription in 2006. Joel Sobel and Joel Watson have revised parts of these notes
and added material, but the document is still rough and disorganized. Surely there
are many mistakes. These material here is incomplete and contain many mistakes.
If you find an error, a notational inconsistency, or other deficiency, please let one
of the Joels know.
Contents
2 Sequences 25
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Differentiation 47
5 Taylor’s Theorem 59
6 Univariate Optimization 63
7 Integration 69
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Fundamental Theorems of Calculus . . . . . . . . . . . . . . . . 73
7.3 Properties of Integrals . . . . . . . . . . . . . . . . . . . . . . . . 75
7.4 Computing Integrals . . . . . . . . . . . . . . . . . . . . . . . . 76
iii
iv CONTENTS
10 Convexity 127
10.1 Preliminary: Topological Concepts . . . . . . . . . . . . . . . . . 127
10.2 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
10.3 Quasi-Concave and Quasi-Convex Functions . . . . . . . . . . . 130
10.3.1 How to check if a function f is quasiconcave or not . . . . 131
10.3.2 Relationship between Concavity and Quasiconcavity . . . 132
10.3.3 Ordinal “vs” Cardinal . . . . . . . . . . . . . . . . . . . 133
1. Sets are typically denoted by italic (math type) upper case letters; elements
are lower case.
2. PN used script letters to denote sets and there may be remnants of this
throughout (such as in some figures, all which must be redrawn anyway).
4. Standard numerical sets (reals, positive integers, etc.) are written using the
mathbb symbols: R, P, and so on.
2 CONTENTS
Chapter 1
This chapter reviews some basic definitions regarding sets and functions, and it
contains a brief overview of the construction of the real line.
1.1 Sets
The most basic of mathematical concepts is a set, which is simply a collection of
objects. For example, the “days of the week” is a set comprising the following
objects: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.
The set of Scandinavian countries consists of: Sweden, Norway, Denmark, Fin-
land and Iceland. Sets can be composed of any objects whatsoever.
We often use capital italic letters to denote sets. For instance, we might let D
denote the set of days of the week and S denote the set of Scandinavian countries.
We will use lowercase italic letters to denote individual objects (called elements
or points) in sets. Using the symbol “∈,” which means “is an element of,” we thus
write x ∈ X to indicate that x is an element of set X. By the way, the symbol “6∈”
means “is not an element of,” so we would write x 6∈ X to mean that X does not
contain x.
To define a set, it is sometimes convenient to list its elements. Formally, we
do this by enclosing the list in curly brackets and separating the elements with
commas. For instance, the set of Scandinavian countries is
3
4 CHAPTER 1. SETS, FUNCTIONS, AND THE REAL LINE
A⊂B
A∪B
U ≡ {a, b, c, . . . , z}.
1.2 Functions
Often we are interested in representing ways in which elements of one set might
be related to, or associated with, elements of another set. For example, for sets X
A∩B
6 CHAPTER 1. SETS, FUNCTIONS, AND THE REAL LINE
and Y , we might say that every x ∈ X points to, or “maps to,” a point y ∈ Y . The
concept of a “function” represents such a mapping.
Definition 6. A function f from a set X to a set Y is a specification (a mapping)
that assigns to each element of X exactly one element of Y . Typically we express
that f is such a mapping by writing f : X → Y . The set X (the points one “plugs
into” the function) is called the domain of the function, and the set Y (the items
that one can get out of the function) is called the function’s codomain.
Note that the key property of a function is that every element in X is associated
with exactly one element in Y . If you plug a point from X into the function, it
specifies just one point in Y that is associated with it. It is not the case that some
point x ∈ X maps to both y and y 0 such that y 6= y 0 .
We write f (x) as the point in Y that the function associates with x ∈ X. Also,
for any subset of the domain Z ⊂ X, we define
We refer to this as the image of the set Z for the function f . The set f (X), which
is clearly a subset of Y , is called the range of the function. Finally, we define the
inverse image of a set W ⊂ Y as
f −1 (W ) ≡ {x ∈ X | f (x) ∈ W }.
This is the set of points in X that map to points in W . Note that the inverse image
does not necessarily define a function from Y to X, because it could be that there
are two distinct elements of X that map to the same point in Y .
Figures 1.4 and 1.5 depict functions from S to T . Figure 1.6 depicts a map-
ping that is not a function. The mapping does not associate any element in T with
the point y ∈ S. Another violation of the requirements for a function is that the
mapping associates two elements of T (both 1 and 3) with the single point x ∈ S.
1.3. THE REAL LINE 7
Example 2. With α and β shown in Figures 1.4 and 1.5, we have α({ x, z}) =
{ 2, 3} and β −1 ({1, 2}) = {x, z}.
Note that the function α in Example 2 is onto while the function β is not, since
its image is {1, 3} which is a proper subset of {1, 2, 3}, the codomain of β.
The set of reals comprises all of the ordinary numbers you are used to dealing
with, such as 0, 1, fractions like 1/2, decimals such as 4.23, and more exotic
numbers like the “natural number” e. The set includes both positive and negative
numbers, arbitrarily high numbers, and arbitrarily low (large negative) numbers.
Although many of its elements are quite familiar to most people, the defini-
tion of R is not as straightfoward. In fact, R is defined in relation to the real
number system, which includes the set R, operators of addition and multiplica-
tion, standard “identiity numbers” 0 and 1, and some axioms (assumptions). Here
is a brief description of how the set R is defined. It is okay to not study a complete
treatment, but realize that the real number system has some nuances.
A good place to start is with the set of positive integers:
P ≡ {1, 2, 3, . . .}.
One way to think of this set is that it is defined by the special “multiplicative iden-
tity” number 1 and the idea of addition. The number 1 is called the multiplicative
identify because 1 times any number is the same number. The set of positive in-
tegers comprises the number 1, 1 + 1, 1 + 1 + 1, and all such numbers formed by
adding 1 to itself multiple times.
These are pictured in Figure 1.7. Note that α is one-to-one but not onto, and β is
Definition 11. If there exists a 1-1 function of X onto Y, we say that X and Y can
be put in a 1-1 correspondence, or that X and Y have the same cardinal number,
or briefly, that X and Y are equivalent, and we write X ∼ Y.
• It is reflexive: X ∼ X .
1.3. THE REAL LINE 9
• It is symmetric: If X ∼ Y, then Y ∼ X .
• It is transitive: If X ∼ Y and Y ∼ Z, then X ∼ Z.
Any relation with these three properties is called an equivalence relation.
Note: there may be lots of functions from X onto Y that are not 1-1, but for
this condition we only need to be able to find 1 such function!
Definition 12. For any positive integer n, let Pn be the set whose elements are the
integers 1, 2, . . . , n; Let P be the set consisting of all positive integers (which is
the set of natural numbers we already saw in Example 3).
For any set A, we say:
• A is finite if A ∼ Pn for some n.
• A is infinite if A is not finite.
• A is countable if A ∼ P.
• A is uncountable if A is neither finite nor countable.
• A is at most countable if A is finite or countable.
Note: see section ?? for a further discussion of countable versus uncountable.
Definition 13. The rational numbers, Q1 , are informally defined as the set of all
numbers of the form m/n, where m and n are integers and n 6= 0.
Definition 14. The Real Numbers, R2 , are informally defined as the set of all
regular numbers from negative to positive.
R = { x | −∞ < x < ∞}
Formally3 R is defined as a set of objects associated with a set of “operators”
(function from X ×X → X ) “+” and “.”, and special elements 0 and 1, satisfying
certain axioms:
1
Q is an example of a countable set. This may seem very confusing since there must be more
rational numbers than integers (mustn’t there be!), and will be discussed further in section ??
2
R is an example of an uncountable set
3
For mathsy types in the class; really first of all you have to define a field, then an ordered
field, and then R can be defined as the ordered field which has the least upper bound property
(to be defined soon) and which contains Q as a subfield. Finally you must show that R \ Q, the
irrationals, can be written as infinite decimal expansions and are considered “approximated” by
the corresponding finite decimals. If after all this you still care, see Rudin chapter 1.
10 CHAPTER 1. SETS, FUNCTIONS, AND THE REAL LINE
• Addition: If x, y ∈ R ⇒ x + y ∈ R.
• Additive Identity: x + 0 = 0 + x, ∀ x ∈ R
• Associativity: (x + y) + z = x + (y + z), ∀ x, y, z ∈ R
• If x ∈ R ⇒ x ∈ R++ or x = 0 or −x ∈ R++
where -x ≡ so that (-x) + x = 0
It is fairly obvious from this that R++ is just the set of strictly positive real
numbers.
x>y ⇐⇒ x − y ∈ R++
where
−y ≡ +(−y)
a ≥ x, ∀ x ∈ X
b ≤ x, ∀ x ∈ X
1.3. THE REAL LINE 11
Note that there can be lots of upper bounds and lots of lower bounds.
Example 4. The following illustrates the important point that the “sup” and
“inf” of a set do not have to be contained in the set.
For example let
X = [0, 1], and Y = (0, 1)
We can see that
sup X = sup Y = 1
and that
inf X = inf Y = 0
However, obviously the points { 0} and { 1} are not contained in the set Y.
Lemma 1. Any set X ⊂ R, (X 6= φ), that has an upper bound, has a least upper
bound (sup).
4
Any reader able to complete this exercise without assistance should have a successful career
as a mathematician. Although the result seems intuitive, it is a deep property of the real numbers.
12 CHAPTER 1. SETS, FUNCTIONS, AND THE REAL LINE
Note It should now be clear that the max and min do not always exist even if
the set is bounded, but the sup and the inf do always exist if the set is bounded.
P ≡ { 1, 1 + 1, 1 + 1 + 1, . . . }
Integers:
Z ≡ P ∪ −P ∪ { 0}
1.4. METHODS OF PROOF 13
Rational Numbers: nm o
Q≡ | m, n ∈ Z, n 6= 0
n
It can be shown that
Q⊂R
and
Q 6= R
It follows that D = E.
Example 6. Suppose there is a two person world consisting of John and Mary,
and in this world everybody wears properly fitting clothes. Suppose people fit into
the clothes of people the same size or bigger, but do not fit into the clothes of
smaller people. We denote John’s height by x, and Mary’s height by y. We want
to show that John is taller than Mary, i.e. that
x>y
So the way we proceed is that we make the initial assumption that Mary is at least
as tall as John, i.e. y ≥ x. But then we note that John’s clothes are all bigger
than Mary’s clothes. So John must not fit into Mary’s clothes. Thus we must have
that our initial assumption is false. So
x>y
Now this is a pretty simple and stupid example but highlights the way to proceed.
The following statements are all equivalent (you should convince yourself of
this)
• If x ∈ A ⇒ x ∈ B,
⇒A⊂B
• If x ∈ B c ⇒ x ∈ Ac ,
⇒ B c ⊂ Ac
• x 6∈ B ⇒ x 6∈ A
p⇒q ⇐⇒ ¬q ⇒ ¬p
Lemma 2.
1.4. METHODS OF PROOF 15
a−<x≤a
• S(1) is true
• S(n) ⇒ S(n + 1)
1
S(1) = (1)2 (1 + 1)2
4
=1
Now look at n = m + 1
m+1
X m
X
3
k = k 3 + (m + 1)3
k=1 k=1
1 2
= 4
m (m + 1)2 + (m + 1)3
= (m + 1)2 ( 41 m2 + m + 1)
= 14 (m + 1)2 (m2 + 4m + 4)
= 41 (m + 1)2 (m + 2)2
So you can see that if we assume the original statement is true for m and now
everywhere that we had an m we have an m + 1. Thus it must be true for m + 1.
And so we are done.
If and only if
You will have noticed in preceding sections mentioning of things like “if and
only if ” (which is often abbreviated to iff ), and phrases like “A is necessary for
B”, or “x is sufficient for y”. You will also have noticed the symbol ⇐⇒ .
So what the hell do all these things mean? Thankfully they are all very closely
related.
The symbol ⇐⇒ is just the mathematical symbol for “if and only if”5 . But
that’s not particularly helpful if we don’t know what “if and only if” means.
The difference between “if” and “iff” is as follows. Compare the 2 sentences
below:
Sentence (1) says only that I will drink Guinness. It does not rule out that I
may also drink Budweiser. Maybe I will, maybe I won’t - there just is not enough
information to determine. All we know for sure is that I will drink Guinness.
5
in fact many symbols are used for if and only if, such as ⇐⇒ , ≡, and ←→, so watch out as
different authors may use different ones.
1.5. SOME HELPFUL NOTES 17
Sentence (2) makes it quite clear that I will drink Guinness and Guinness only.
I won’t drink any other type of beer6 . Also, I will definitely drink the beer if it is
a Guinness.
This may seem confusing so perhaps we should look at how proofs involving
“iff” are special. Consider the First Fundamental Theorem of Asset Pricing due
to Harrison and Kreps7 .
Theorem 4. The finite market model is viable if and only if there exists an equiv-
alent martingale measure.
So perhaps it is better to look at the statement of the theorem more carefully
(with some shortening so as to fit the page better).
Theorem 5. viable
| {z } if and only if equivalent martingale measure.
| {z }
A B
Or in mathematical symbols
Theorem 6. A ⇐⇒ B.
Now you might have noticedn that the ⇐⇒ is sort of a fusion of the symbols
⇐=, and =⇒. This was not an accident. To prove an iff statement, you must
prove the implication both ways. To see what is meant by this let’s look at a mock
proof for the above theorem.
Basically to prove a statement A ⇐⇒ B, you must show
• A =⇒ B
• B =⇒ A
Of course you can also prove that
• A =⇒ B
• ¬A =⇒ ¬B
since recall that proving B ⇒ A is the same as proving ¬A ⇒ ¬B 8 .
Ok, so now let’s run through a mock proof9 of The First Fundamental Theorem
of Asset Pricing
6
I’m not really like this:)
7
Don’t worry if you don’t understand what the Theorem means, the method of proof is what’s
important
8
or now to show we really get it, we could write [B ⇒ A] ⇐⇒ [¬A ⇒ ¬B]!
9
There is not even a hint of how to prove the theorem here. I’m just trying to show you how
you go about proving such types of statements.
18 CHAPTER 1. SETS, FUNCTIONS, AND THE REAL LINE
Proof. (⇒) So to prove the implication this direction, we first assume that the
finite market model is viable. Or more simply, we assume A. Thus, having
made this assumption, we now have to show that armed with this assump-
tion, that there exists an equivalent martingale measure. We have to show
B.
(⇐) Now we must go the other way. So we start by assuming there is an equiv-
alent martingale measure, i.e. we assume B. Then using only this assump-
tion, we must some how end up with the statement that the finite market
model is viable (statement A).
For example, see back to Definition 18 on page 10. For clarity’s sake I’ll
restate the first part.
a ≥ x, ∀ x ∈ X
Necessary Condition
“I am the best football player in Ireland if I am the best football player in the
world” (equivalently: If I am the best football player in the world, then I am the
best football player in Ireland).
Note however that this is not an if and only if statement. It does not go both
ways. If I am the best football player in Ireland, we cannot conclude that I am the
best football player in the world (one of those Brazilians might be better than me!).
Thus we would say that being the best football player in Ireland is a necessary
condition for being the best football player in the world, since if I am the best
football player in the world, then I am necessarily the best football player in
Ireland.
Sufficient Condition
Example 10. Supposing you are studying for the mathcamp final while sitting
on a bench on the cliffs overlooking Torrey Pines. Suppose you are becoming
increasingly frustrated with the course and are really getting sick of studying. You
decide to hurl your notes off the cliff so that they land in the sea where they will
never be seen again. Then we would say that the hurling of the notes is a sufficient
condition for the notes to land in the sea. But the notes landing in the sea does
not imply that they were hurled: you might have slipped and dropped them, or a
gust of wind may have come and taken them. Thus the hurl was sufficient but not
necessary for the notes to land in the sea.
10
football refers to soccer not “American Football”
20 CHAPTER 1. SETS, FUNCTIONS, AND THE REAL LINE
You guessed it, necessity and sufficiency are dual to one another. Look back to
example 9. We have that the following are all equivalent:
• (best football player in the world) is sufficient for (best football player in
Ireland)
• (best football player in Ireland) is necessary for (best football player in the
world)
Infinity
The notion of infinity is one of the most seemingly simple yet difficult con-
cepts in all of mathematics. The symbol for infinity is ∞. Intuitively, ∞ just
represents a really “big” number that “beats” all other numbers. We got an intro-
duction to how to measure the size of sets on pages 8 and 9.
But before we get to comparing sizes of sets, let’s just look at how infinity is
defined.
Definition 24. The extended real number system, R, consists of R11 and two sym-
bols, +∞ and −∞. We define
−∞ < x < +∞
for every x ∈ R.
• If x ∈ R then
x+∞=∞
x − ∞ = −∞
x x
= =0
+∞ −∞
• If x > 0 then
x · (+∞) = +∞
x · (−∞) = −∞
• If x < 0 then
x · (+∞) = −∞
x · (−∞) = +∞
Now recall definitions 11, 12. In fact let’s just state them again
Definition 25. If there exists a 1-1 function of X onto Y, we say that X and Y can
be put in a 1-1 correspondence, or that X and Y have the same cardinal number,
or briefly, that X and Y are equivalent, and we write X ∼ Y.
Note: there may be lots of functions from X onto Y that are not 1-1, but for
this condition we only need to be able to find 1 such function!
Definition 26. For any positive integer n, let Pn be the set whose elements are the
integers 1, 2, . . . , n; Let P be the set consisting of all positive integers (which is
the set of natural numbers we already saw in Example 3).
• A is countable if A ∼ P.
• A is uncountable if A is neither finite nor countable.
• A is at most countable if A is finite or countable.
The notion of cardinality was always used to compare finite sets, and is very
intuitive for such finite sets. But the notion, as described above to describe infinite
sets was first used by Georg Cantor. He said that any set which could be put in
one-to-one correspondence with the set of natural numbers Z is countably infinite.
Loosely, what this means is that we can label or index all the elements of any
countable set with the integers. Now it turns out that the rational numbers Q can
be put in a one-to-one correspondence with Z and as such are countable. Now this
seems very unintuitive since there are obviously more rationals since they also
include fractions. The point is you can keep going further out along the set of
integers and as such can keep indexing the rationals. Every time you need another
index it is there. This is not an easy concept. In fact the natural numbers Z have
the same cardinality as the positive integers, P even though P ⊂ Z. Confusing
eh?? To make it worse, a countable union of countable sets is a new set that is also
countable! For a proof of the countability of the rationals Z see Rudin Theorem
2.13.
Now the reals, R, are an example of an uncountable set. Loosely, what this
means is that they can not be put in a one-to-one correspondence with Z. There
are just too many of them to index by the integers. Below we list some other sets
of cardinality C.
Yes that’s right, there are more numbers within [0, 1] than there are in Q! And
the sets R and [0, 1] are of the same cardinality even though obviously [0, 1] is a
tiny subset of R.
But then mustn’t the cardinality of R be at most twice that of Q? And from
the statement above, twice a countable set would definitely be countable! What’s
wrong with this?
24 CHAPTER 1. SETS, FUNCTIONS, AND THE REAL LINE
Chapter 2
Sequences
2.1 Introduction
2.2 Sequences
25
26 CHAPTER 2. SEQUENCES
converges in R (to 0), but fails to converge in the set of all positive real numbers
R++ .
b = b0
Let d(x, y) ≡| x − y |.
Proof. Let > 0 be given. There exist integers N and N 0 such that
n ≥ N =⇒ d(an , b) <
2
n ≥ N 0 =⇒ d(an , b0 ) <
2
0
Hence, if n ≥ max (N, N ), we have
2
d here is called a metric. You can think of it as being function which takes in any two points
in a set, and then spits out the distance between the two of them. It will be more formally defined
in Definition 72.
2.2. SEQUENCES 27
f : P −→ R
is given by a function
f ◦ g : P −→ R
where
g : P −→ P
and g is assumed to be strictly increasing4 .
1. monotonically increasing if
an ≤ an+1 (n = 1, 2, 3, . . . )
2. monotonically decreasing if
an ≤ an+1 (n = 1, 2, 3, . . . )
3
see back to Definition 10 for what exactly is a composition mapping
4
we will see exactly what it means for a function to be strictly increasing soon, but for now just
think of it as always going up
28 CHAPTER 2. SEQUENCES
Note we say strictly increasing/decreasing if the above weak inequalities are re-
placed with strict inequalities.
Theorem 8. Suppose
lim an = b
n−→∞
The converse is not true. That is, if a subsequence has a limit - this does not imply
that the sequence necessarily has a limit5 .
Example 13. Consider the sequence
1 1 1 1
{ an } = 1, 1, 1, , 1, , 1, , 1, , 1, . . .
2 3 4 5
So we can tell from this that the subsequence obtained by taking the even elements
of the above sequence has a limit of 06 . But this is clearly not the limit of the orig-
inal sequence since the number 1 keeps appearing! Moreover, the subsequence
5
As a test of whether or not you’ve grasped necessary versus sufficiency, which is which in this
case?
6
thought I didn’t define the space on which the sequence is defined so technically this is not
correct!
2.2. SEQUENCES 29
obtained by taking the odd elements of the original sequence clearly has a limit of
1 (since we are only taking the element 1 over and over again)!! (since recall that
the terms of a sequence need not be distinct, and a subsequence is a sequence in
its own right).
lim an = a
n−→∞
lim bn = b
n−→∞
Then
1.
lim (an + bn ) = a + b
n−→∞
2.
lim can = ca, c ∈ R
n−→∞
3.
lim an bn = ab
n−→∞
4.
lim (an )k = ak
n−→∞
5.
1 1
lim =
n−→∞ an a
6 0(n = 1, 2, 3, . . . ), and a 6= 0
provided an =
6.
an a
lim =
n−→∞ bn b
6 0(n = 1, 2, 3, . . . ), and b 6= 0
provided bn =
7
Proof. 1. Given > 0, ∃ integers N1 , N2 such that
n ≥ N1 =⇒ |an − a| < ,
2
7
this proof really uses the triangle inequality which will not show up til Definition 72, but
which intuitively just states that the shortest distance between any two points is a straight line
between them.
30 CHAPTER 2. SEQUENCES
n ≥ N2 =⇒ |bn − b| < .
2
If we take N = max (N1 , N2 ), then n ≥ N =⇒
2. This proof is very easy - just factor out the c and then put it in again at the
end
so that
lim (an − a)(bn − b) = 0.
n−→∞
We now apply results (1) and (2) to Equation 2.1, and conclude that
Definition 32. We say the sequence {an } is bounded above if ∃ m ∈ R such that
an ≤ m, ∀n ∈ P
We say the sequence {an } is bounded below if ∃ m ∈ R such that
an ≥ m, ∀n ∈ P
Note: that a bounded sequence does not necessarily converge, for example the
sequence {0, 1, 0, 1, 0, 1, . . . } just oscillates back forth between {0} and {1} for-
ever.
Theorem 11. Let {an }, {bn }, and {cn } be sequences such that
an ≤ bn ≤ cn , ∀n ∈ R
and we have that both
an −→ a, cn −→ a
=⇒ bn −→ a
32 CHAPTER 2. SEQUENCES
only if
1. ( =⇒ ) We already showed that any convergent sequence is bounded in The-
orem 10. So obviously a monotone sequence is bounded!10
if
2. (⇐=) Suppose { an } is bounded. Thus by Lemma 1, we know that the set
X ≡ { an | n ∈ P} has a sup which we will denote a. We must show that a
is the limit of { an }. We need to show that ∀ > 0, ∃ N ∈ P such that
|a − an | < , ∀n ≥ N
Note: that this theorem is not saying that there exists a unique monotone subse-
quence. Indeed, if there is one monotone subsequence, there will be infinitely
many (generate a new one by deleting the first term of the original one).
1. Form a sequence of P by:
xn1 = max xn
n>1
xn2 = max xn
n>n1
..
.
xnk+1 = max xn
n>nk
Define,
xn1 ≡ xN ,
xn2 ≡ the first term following xn1 for which xn2 > xn1
.. ..
. .
xnk+1 ≡ the first term following xnk for which xnk+1 > xnk .
Definition 33. A sequence {an } is said to be a Cauchy Sequence if for every > 0
there is an integer N such that
|an − am | <
if n ≥ N and m ≥ N .
Refer back to the definition of a convergent sequence (Definition 28) to see the
difference between the definition of convergence and the definition of a Cauchy
sequence. The difference is that the limit is explicitly involved in the definition of
convergence but not in that of a Cauchy sequence11 .
Note: that a 6∈ X is possible, so a set need not include it’s limit points.
Definition 36. X is called a closed set if it contains all its limit points.
(x − , x + ) ⊂ X
A note on notation:
As you may have seen before that a closed set from the point x to the point y is
denoted with square brackets
[x, y]
while an open set from x to y is denoted with round brackets
(x, y)
So the only difference is that the open set does not contain the points x and y but
does include every point right up next to them!
Recall the set X from example 14. We can see that this set X has a limit point
(namely the point a = 0) but that no point of X is a limit point of X . It’s important
to understand the difference between having a limit point and containing one!
Definition 38. y is called the limit of f from the right at a (right hand limit) if,
for every > 0, there is a δ such that
Definition 39. y is called the limit of f from the left at a (left hand limit) if, for
every > 0, there is a δ > 0 such that
Note: Now it is a − x since we are coming from the left, so this must be positive.
37
38 CHAPTER 3. FUNCTIONS AND LIMITS OF FUNCTIONS
y = lim− f (x)
x→a
= lim+ f (x)
x→a
and we write
y = lim f (x)
x→a
Note: The definition of the limit of a function at the point a does not require
the function to be defined at a.
The use of and δ in these definitions is a standard part of mathematical anal-
ysis, but takes practice to appreciate. The formal definition of limit at a point
is a way to capture the idea of the behavior of a function “near” the point. If
limx→a f (x) = L these means, roughly, that values of f (x) are near L when x is
near to a.
In order to use the definition, you must answer infinitely many questions: for
each you must say how close x must be to a to guarantee that f (x) is within
of L. The way that you do this is to define δ as a function of and then attempt to
confirm the definition.
⇒ lim f (x) = 0
x→0
But the value of the function at the point 0 is not equal to the limit. The example
shows that it is possible for limx→a f (x) to exist but for it to be different from
f (a). Since you can define the limit without knowing the value of f (a), this
observation is mathematically trivial. It highlights a case that we wish to avoid,
because we want a function’s value to be approximated by nearby values of the
function.
Theorem 17. Limits are unique. That is, if limx→a f (x) = L and limx→a f (x) =
L0 , then L = L0 .
Proof. Assume that L 6= L0 and argue to a contradiction. Let = |L − L0 | /2.
Given this let δ ∗ > 0 have the property that |f (x) − L| and |f (x) − L0 | are less
than when 0 < |x − a| < δ ∗ . (This is possible by the definition of limits.) Since
|a − b| ≤ |a| + |b|
This expression is the same as |a − b| ≤ |a| + |b| – just replace b by −b. The
left-hand side is the distance between a and b. The right-hand side is the distance
between 0 and a plus the distance between 0 and b. Hence the inequality says that
it is shorter (or equal) to go directly from a to b than to go from a to 0 and then
0 to b. In one dimension you can check that the inequality is true by using the
definition of absolute value, by drawing a simple picture, or by thinking about the
meaning of the three different terms.
If a limit exists, then both limit from the left and limit from the right exist (and
they are equal) and, conversely, if both limit from right and limit from left exist,
then the limit exists. These statements require proofs, but the proofs are simple.
You do not want to use −δ proofs whenever you need to find a limit. To avoid
tedium, you need to collect a few “obvious” limits (for example, if the function f
is constant, then limx→a f (x) exists for all a and is equal to the constant value of
f ), and then some basic results that permit you to compute limits.
Theorem 18. If f and g are functions defined on a set S, a ∈ (α, β) ⊂ S and
limx→a f (x) = M and limx→a g(x) = N , then
40 CHAPTER 3. FUNCTIONS AND LIMITS OF FUNCTIONS
1. limx→a (f + g) (x) = M + N
2. limx→a (f g) (x) = M N
3. limx→a fg (x) = M/N provided N 6= 0.
Proof. For the first part, given > 0, let δ1 > 0 be such that if 0 < |x − a| <
δ1 then |f (x) − M | < /2 and δ2 > 0 be such that if 0 < |x − a| < δ2 then
|g(x) − N | < /2. This is possible by the definition of limit. If δ = min{δ1 , δ2 },
then 0 < |x − a| < δ implies
The first inequality follows from the triangle inequality while the second uses the
definition of δ. This proves the first part of the theorem.
For the second part, you can use the√ same type of argument, setting δ1 so that
if 0 < |x − a| < δ1 then |f (x) − M | < and so on.
For the third part, note that when g(x) and N are not equal to zero
Using this theorem you can generate the limits of many functions by combin-
ing more basic information.
of a – that is, the function must be defined on an interval (α, β) with a ∈ (α, β).
We extend the definition to take into account “boundary points” in a natural way:
We say that f defined on [a, b] is continuous at a (resp. b) if limx→a+ f (x) = f (a)
(resp. limx→b− f (x) = f (b)). Note: Informally one could remember the definition
of continuity as:
f (lim x) = lim f (x).
x→a x→a
Continuity at a point requires two things: a limit exists and the limit is equal
to the right thing. It is easy to think of examples in which the limit exists, but
is not equal to the value of the function. The limit of a function can fail to exist
for two different reasons. It could be that the left and right hand limits exist,
but are different. Alternatively, either left or right-hand limit may fail to exist.
This could happen either because the function is growing to infinity (for example,
f (x) = 1/x for x > 0 is continuous at any point on a > 0, but cannot be
continuous at 0 no matter how f (0) is defined. Alternatively, the function can
oscillate wildly (a standard example is sin( x1 ) near x = 0.
Lemma 3. f : X 7−→ R is continuous at a if and only if for every > 0, there is
a δ > 0 such that
0 < |a − x| < δ =⇒ |f (x) − f (a)| <
Note: a very informal way of describing a continuous function is that you can
draw the function without lifting your hand off the page, by sweeping the pen in
one fluid, continuous motion!
is continuous at a.1
Proof. Let > 0 be given. Since f is continuous at g(a), there exists γ > 0 such
that
|f (y) − f (g(a))| <
if
|y − g(a)| < γ
and y ∈ g(X ).
Since g is continuous at a, there exist δ > 0 such that
if
|x − a| < δ
and x ∈ X .
It follows that
|f (g(x)) − f (g(a))| <
if
|x − a| < δ
and x ∈ X .
Thus f ◦ g is continuous at a.
It is easy to show that constant functions and linear functions are continuous.
By the combining properties, this allows you to conclude polynomials and ratios
of polynomials are continuous.
f (xn ) −→ f (x)
1
again note that this is continuity at a point.
43
only if
Proof. 1. ( =⇒ ) Let’s use contradiction method of ¬B =⇒ ¬A (to prove
A =⇒ B)
Suppose ∃ a sequence {xn } ⊂ (a, b) with xn → x but f (xn ) 6→ f (x). This
means that ∃ > 0 such that ∀ δ > 0, ∃ n ∈ P (i.e. ∃ {xn }) such that
|xn − x| < δ
but
|f (xn ) − f (x)| ≥
f : [a, b] −→ R
is continuous, then there exists d ≥ c such that f ([a, b]) = [c, d].
The assumptions in the theorem are important. If f is not continuous, then
there generally nothing that can be said about the image. If the domain is an open
interval, the image could be a closed interval (it is a point of f is constant) or it
could be unbounded even if the interval is finite (for example if f (x) = 1/x on
(0, 1)).
The first consequence of the result is the existence of a maximum (and mini-
mum).
Definition 42. We say that x∗ ∈ X maximizes the function f on X if
f (x∗ ) ≥ f (x),
for every x ∈ X .
44 CHAPTER 3. FUNCTIONS AND LIMITS OF FUNCTIONS
f (x∗ ) ≤ f (x),
for every x ∈ X .
We write
max f (x) = max f (X )
x∈X
and
max f (X ) = y ∗
is the max value.
Since the image of f is a closed, bounded interval, f attains both its maximum
(the maximum value is d) and its minimum. This means that if a function is
continuous and it is defined on a “nice” domain, then it has a maximum. This
result is much more general than the result above. It applies to all real-valued
continuous functions (defined on arbitrary sets, not just the real numbers) provided
that the domain is “compact.” Bounded closed intervals contained in the real line
are examples of compact sets. More generally, any set that is “closed” (contains
its boundary points) and bounded is compact.
FIGURES GO HERE
Thus using this result and Lemma 1, we know that sup f (X ) exists. So we
can find a sequence {yn } ⊂ f (X ) such that
yn −→ sup f (X ).
lim yn = sup f (X ) ∈ f (X )
n→∞
And now by Theorem 2 when the max exists it equals the sup.
lim yn = sup f (X )
n→∞
= max f (X )
market clearing price system, for example) follow from (harder to prove) versions
of this result.
Economists often prove existence results using fixed-point theorems. The eas-
iest fixed point theorem is a consequence of the Intermediate Value Theorem.
The point x∗ ∈ S is called a fixed point of the function f : S −→ S if
f (x∗ ) = x∗ .
Differentiation
Calculus works because of two insights. The first is that linear functions are easy
to understand. The second insight is that although not all interesting functions
are linear, there is a large set of functions that can be approximated by a linear
function. The derivative is the best linear approximation to a function. You obtain
a lot of analytic power by taking a general function and studying it by learning
about linear approximations to the function.
How do you approximate a function? The first step is to treat the approxima-
tion as local. Given a point in the domain of the function, a, the idea is to come up
with a function that is easy to deal with and is close to the given function when x
is close to a. A possible attempt to do this is with a zero-th order approximation.
Definition 44. The zero-th order approximation of the function f at a point a in
the domain of f is the function A0 (x) ≡ f (a).
The symbol ≡ indicates an identity. One could also have written A0 (x) = a
for all x.
One way to approximate a function is with the constant function that is equal to
the value of the function at a point. A0 (x) is certainly a tractable function and if f
is continuous at a it is the only function that satisfies limx→a (f (x) − A0 (x)) = 0.
That is the good news. The bad news is that A0 tells you almost nothing about the
behavior of f .
The next step is to try to replace A0 with the best linear approximation to f . In
order to do this, imagine a line that intersects the graph of f at the points (x, f (x))
and the point (x + δ, f (x + δ)). This line has slope given by:
f (x + δ) − f (x) f (x + δ) − f (x)
=
(x + δ) − x δ
47
48 CHAPTER 4. DIFFERENTIATION
When f is linear, the line with this slope is f (just like when f is constant,
A0 ≡ f ). Otherwise, it is not. If we are interesting only in the local behavior of
f , then it makes sense to consider slopes defined when δ is small. The notion of
limit tells us how to do this:
f (x + δ) − f (x) f (x + δ) − f (x)
lim = lim
δ→0 (x + δ) − x δ→0 δ
The denominator of the expression does not make sense when δ = 0, but we do
not need to know this value to evaluate the limit of the ratio as δ approaches zero.
On the other hand, for the limit to make sense the ratio must be defined for all
sufficiently small non-zero δ – that is x must be an interior point of the domain
of f . If the limit exists, we say that f is differentiable at f and call the limit the
derivative of f at x.
As with the two earlier definitions (limit of function and continuity), there are
two “one-sided” versions of this definition.
Take
f : (a, b) −→ R
and x ∈ (a, b).
Definition 45. The derivative of a function f is defined at a point x when the left
hand derivative equals the right hand derivative. There are many ways to denote
49
f 0 (x) = Df
df
=
dx
f (y) − f (x)
≡ lim
y→x y−x
f (y) − f (x)
f 0 (x) = lim
y→x y−x
y − x2
2
= lim
y→x y − x
(y − x)(y + x)
= lim
y→x y−x
= lim (y + x)
y→x
= 2x
Now we can see how the derivative creates a first-order approximation. As-
sume the function f is defined on an open interval containing x. Let A1 (y) =
f (x) + f 0 (x)(y − x). This is the equation of the line with slope f 0 (x) that passes
through (x, f (x)). It follows from the definition of the derivative that
f (y) − A1 (y)
limy→x = 0. (4.1)
y−x
Equation (4.1) explains why A1 is a better approximation to f than A0 . Not only is
the linear approximation A1 close to f when y is close to x – this would be true if
the limit of the numerator in (4.1) converged to zero as y converged to x. It is also
the case that the linear approximation is close to f if you derive the difference by
something really close to zero (y − x). From this interpretation and the examples,
it is not surprising that it is harder to be differentiable than it is to be continuous.
Theorem 26. Consider the function
f : X −→ R
where the second last equality is got from the fact that if you have 2 functions g
and h and limy→x h(y) exists and limy→x g(y) exists, then
lim g(y)h(y) = lim g(y) lim h(y)
y→x y→x y→x
Theorem 27. Suppose f and g are defined on an open interval containing point
x and both functions are differentiable at x. Then, (f + g), f · g, and f /g are
differentiable at x (the last of these provided g 0 (x) 6= 0).1
1.
(f + g)0 (x) = f 0 (x) + g 0 (x)
2.
(f · g)0 (x) = f 0 (x)g(x) + f (x)g 0 (x)
3.
f g(x)f 0 (x) − f (x)g 0 (x)
( )0 (x) =
g [g(x)]2
Proof. 1. This should be clear by Theorem 9 (1), though again we are using
the “functional version”.
2. Let h = f g. Then
If we divide this by y − x and note that f (y) −→ f (x), and g(y) −→ g(x)
as y −→ x by Theorem 26, then the result follows.
Then the result follows.
f (x) = xk
⇒ f 0 (x) = kxk−1
Hint Prove it by induction and using property (2) of Theorem 27. We already have
it for the case where k = 2 from Example 16
1
f · g denotes multiplication of functions. Composition, f ◦ g, is different.
52 CHAPTER 4. DIFFERENTIATION
g : X −→ Y
and
f : Y −→ R
that g is differentiable at x and that f is differentiable at y = g(x) ∈ Y.2
Then
f (g(z)) − f (g(x)) g(z) − g(x)
lim = lim h(z) ·
z→x z−x z→x z−x
h i g(z) − g(x)
= lim h(z) lim
z→x z→x z−x
|z − x| < δ and g(z) 6= g(x), then |h(z) − f (g(x))| < . To complete the proof
just note that if g(z) = g(y) then h(z) = f 0 (g(x)).
2
In this theorem we are talking about f ◦ g.
53
Definition 46. Recall Definition 42. This is technically the definition of a global
max. We say globally since it maximizes the function over the whole domain.
However, we say x∗ is a local maximizer of the function f if ∃ a segment (a, b)
such that
f (x∗ ) ≥ f (x), ∀ x ∈ (a, b)
a local maximum, but that a maximum that occurs in the interior of the domain is
a local maximum.
Theorem 29. Suppose f is defined on [a, b]. If f has a local max at c ∈ (a, b) and
if f 0 (c) exists, then
f 0 (c) = 0
Proof. For |x − c| < δ, we have
f (x) − f (c) ≤ 0
Therefore, if x ∈ (c, c + δ), then
f (x) − f (c)
≥0 (4.2)
x−c
while for x ∈ (c − δ, c)
f (x) − f (c)
≤ 0. (4.3)
x−c
Since f is differentiable at c, both left and right derivatives exist and they are
equal. Inequality (4.2) states that the derivative from above must be nonpositive
(if it exists). Inequality (4.3) states that the derivative from below must be non-
negative (if it exists). Since the derivative exists, these observations imply that the
derivative must be both nonpositive and nonnegative. The only possibility is that
f 0 (c) = 0.
54 CHAPTER 4. DIFFERENTIATION
You can use essentially the same argument to conclude that if c is a local
minimum and f 0 (c) exists, then f 0 (c) = 0. The equation f 0 (c) = 0 is called a
first-order condition. The theorem states that satisfying a first-order condition is
a necessary condition for c to be a local maximum or minimum. This observation
may be the most important result in calculus for economics. Economists are in-
terested in optimizing functions. The result says that you can replace solving an
optimization problem (which seems complicated) with solving an equation (which
perhaps is easier to do). The approach is powerful and generalizes. It suffers from
several limitations. One limitation is that it does not distinguish local maxima
from local minima. This is a major limitation of the statement of the theorem. If
you examine the proof carefully, you will see that the calculus does distinguish
maxima from minima. If you attempt to carry out the proof when c is a local
minimum, the inequalities in (4.2) and (4.3) will be reversed. This means, loosely
speaking, for a local maximum f 0 (x) ≥ 0 for x < c and f 0 (x) ≥ 0 for x > c. That
means, it appears that f 0 is decreasing. While for a local minimum the derivative
is negative to the left of c and positive to the right.
A second limitation is that the theorem only applies to local extrema. It is
possible that the maximum occurs at the boundary or there are many local max-
ima. Calculus still has something to say about boundary optima: If f is defined on
[a, b] and a is a maximum, then f (a) ≥ f (x) for all x ∈ [a, b] so that the derivative
from above must be nonpositive. Analogous statements are available for minima
or for the right-hand endpoint. As for identifying which local maximum is a true
maximum, calculus typically does not help. You must compare the values of the
various candidates.
Finally, it is possible that f 0 (c) equals zero, but c is neither a local maximum
nor a local minimum. The standard example of this is f (x) = x3 at the point
x = 0.
In spite of the drawbacks, Theorem 29 gives a procedure for solving an op-
timization problem for a differentiable function on the interval [a, b]: Solve the
equation f 0 (c) = 0. Call the set of solutions Z. Evaluate f (x) for all x ∈
Z ∪ {a, b}. The highest value of f over this set is the maximum. The lowest
value is the minimum. This means that instead of checking the values of f over
the entire domain, you need only check over a much smaller set.
An implication of the algorithm is that if a differentiable function’s derivative
is never zero, then it only has maxima and minima at the boundaries of its domain.
The previous result tells you that there is something special about places where
the derivative is zero. It is also possible to interpret places where the derivative is
positive or negative.
Definition 48. The function f is increasing if x > y implies f (x) ≥ f (y). The
55
function f is strictily increasing if x > y implies f (x) > f (y). The function is
increasing in a neighborhood of x if there exists δ > 0 such that if y ∈ (x −
δ, x + δ) then x > y implies f (x) ≥ f (y). The function is strictly increasing in a
neighborhood of x if there exists δ > 0 such that if y ∈ (x − δ, x + δ) then x > y
implies f (x) > f (y).
There are analogous definitions for deceasing and strictly decreasing. A func-
tion that is either (strictly) increasing everyone or decreasing everywhere is called
(strictly) monotonic. There is a little bit of ambiguity about the term “increasing.”
Some people use “non-decreasing” to describe a function that we call increasing.
The theorem almost says that differentiable functions are (strictly) increasing
if and only if the derivative is nonnegative (positive). Alas, you can have functions
that are strictly increasing but whose derivative is sometimes zero: f (x) = x3 is
again the standard example.
The proof of Theorem 30 is a straightforward exercise in the definition of the
derivative. It requires only writing down inequalities similar to (4.2) and (4.3) and
a little bit of care.
The previous two results give you excellent tools for graphing functions. You
can use derivatives to identify when the function is increasing, decreasing, or has
local maxima and minima. If you can figure out when the function crosses zero
and its behavior at infinity, then you have a nice insight into into behavior.
Theorem 31. If
g : R 7−→ R
is the inverse of
f : R 7−→ R
and if f is strictly increasing and differentiable with f 0 > 0, then
1
g 0 (f (x)) =
f 0 (x)
56 CHAPTER 4. DIFFERENTIATION
Note: f absolutely must be strictly increasing since otherwise it’s not invertible
and hence g is not well defined.
Theorem 32 (Mean Value Theorem). If f is real valued and continuous on [a, b],
and differentiable on (a, b), then ∃ a point c ∈ (a, b) such that
Proof. Define
f (b) − f (a)
g(x) = f (x) − (x − a)
b−a
We know that g(x) is continuous on compact [a, b]. Thus g(x) attains its maximum
and minimum on [a, b]. Note, however, that g(a) = g(b) = f (a). Consequently,
either g(a) ≥ g(x) for all x ∈ [a, b] so that g must attain its minimum on (a, b) or
else g attains its maximum on (a, b). We can conclude that g has a local minimum
or local maximum at some point c ∈ (a, b) and so g 0 (c) = 0. Thus
f (b) − f (a)
g 0 (c) = f 0 (c) −
b−a
= 0.
57
Theorem 33. Suppose f is real valued and continuous on [a, b], and differentiable
on (a, b). If f 0 (x) ≡ 0 for x ∈ (a, b), then f is constant.
The result is intuitively obvious and perhaps something that you would im-
plicitly assume. It does require proof. (Notice that the converse is true too: If f is
constant, then it is differentiable and its derivative is always zero.)
Proof. By the Mean Value Theorem, for all x ∈ [a, b],
f (x) − f (0)
= f 0 (c)
x
for c between 0 and x. Since f 0 is continuous at 0, limx→0+ f 0 (x) = limx→0− f 0 (x).
It follows that
OR
and further
f 0 (x)
lim− =L∈R
x→b g 0 (x)
Then
f (x)
lim− =L
x→b g(x)
L’Hopital’s Rule is a useful way to evaluate indeterminate forms (0/0). It
looks a bit magical: How can the ratio of the functions be equal to the ratio of
derivatives? You prove that the rule works by using a variation of the mean value
theorem. In Case 1, what is going on is that f (x) = f (b) + f 0 (c)(x − b) for some
c ∈ (x, b) and similarly g(x) = g(b) + g 0 (d)(x − b). Since f (b) = g(b) = 0
(loosely), the ratio of f to g is the ratio of derivatives of f and g. The trouble is
that these derivatives are evaluated at different points. The good news is that you
can prove a version of the Mean-Value Theorem that allows you to take c = d.
This enables you to prove the theorem.
Warning: The Rule needs the two conditions to hold. If you try to evaluate
x+1
limx→0
x2 + 1
by setting the ratio equal to 1/(2x) and taking the limit you would be making a
big mistake. (The ratio is a rational function and the denominator is positive, so it
is continuous. Therefore the limit is just the function’s value at 0: 1.)
Chapter 5
Taylor’s Theorem
Using the first derivative, we were able to come up with a way to find the best
linear approximation to a function. It is natural to ask whether it is possible to
find higher order approximations. What does this mean? By analogy with zeroth
and first order approximations, we first decide what an appropriate approximating
function is and then what the appropriate definition of approximation is.
First-order approximations were affine functions. In general, an nth-order
approximation is a polynomial of degree n, that is a function of the form
a0 + a1 x + · · · + an−1 xn−1 + an xn .
Technically, the degree of a polynomial is the largest power of n that appears
with non-zero coefficient. So this polynomial has degree n if and only if an 6= 0.
Plainly a zeroth degree polynomial is a constant, a first degree polynomial is
an affine function, a second degree polynomial is a quadratic, and so on. An nth
order approximation of the function f at x is a polynomial of degree at most n,
An that satisfies
f (y) − An (y)
lim = 0.
y→x (y − x)n
This definition generalizes the earlier definition. Notice that the denominator is
a power of y − x. When y approaches x the denominator is really small. If the
ratio has limit zero it must be that the numerator is really, really small. We know
that zeroth order approximations exist for continuous functions and first-order
approximations exist for differentiable functions. It is natural to guess that higher-
order approximations exist under stricter assumptions. This guess is correct.
Definition 49. The nth derivative of a function f , denoted f n , is defined induc-
tively to be the derivative of f (n−1) .
59
60 CHAPTER 5. TAYLOR’S THEOREM
f (n+1) (t)
f (d) = An (d) + (d − c)n+1 , (5.1)
(n + 1)!
where An is the Taylor Polynomial for f centered at c:
n
X f (k) (c)
An (d) = (d − c)k .
j=0
k!
f (n+1) (t)
The theorem decomposes f into a polynomial and an error term En = (n+1)!
(d−
n+1
c) . Notice that
En
lim n = 0
d→c (d − c)
so the theorem states that the Taylor polynomial is, in fact, the nth order approxi-
mation of f at c.
The form of the Taylor approximation may seem mysterious at first, but the
coefficients can be seen to be the only choices with the property that f (k) (c) =
(k)
An (c) for k ≤ n. As impressive as the theorem appears, it is just a disguised
version of the mean-value theorem.
Proof. Define
n
X f (k) (x)
F (x) ≡ f (d) − (d − x)k
k=0
k!
and n+1
d−x
G(x) ≡ F (x) − F (c).
d−c
61
(n+1)
It follows that F (d) = 0 and (lots of terms cancel) F 0 (x) = − f n! (x) (d − x)n .
Also, G(c) = G(d) = 0. It follows from the mean value theorem that there exists
a t between c and d such that G0 (t) = 0. That is, there exists a t such that
n
f (n+1) (x)
n d−t
0=− (d − x) + F (c)
n! d−c
or
f (n+1) (t)
F (c) = (d − c)n+1 .
(n + 1)!
An examination of the definition of F confirms that this completes the proof.
Taylor’s Theorem has several uses. As a conceptual tool it makes precise the
notion that well behaved functions have polynomial approximations. This per-
mits you to understand “complicated” functions like the logarithm or exponential
by using their Taylor’s expansion. As a computational tool, it permits you to
compute approximate values of functions. Of course, doing this is not practical
(because calculators and computers are available). As a practical tool, first- and
second-order approximations permit you to conduct analyses in terms of linear or
quadratic approximations. This insight is especially important for solving opti-
mization problems, as we will see soon.
Next we provide examples of the first two uses.
Consider the logarithm function: f (x) = log x.1 This function is defined for
x > 0. It is not hard to show that f (k) (x) = x−k (−1)k−1 (k − 1)!. So f (k) (1) =
(−1)k−1 (k − 1)! Hence:
N
X (x − 1)k
f (x) = (−1)k−1 + EN
k=1
k
N
where EN = (−1)N (y−1) N +1
for y between 1 and x. Notice that this expansion is
done around x0 = 1. This is a point at which the function is nicely behaved. Next
notice that the function f (k) is differentiable at 1 for all k. This suggests that you
can extend the polynomial for an infinite number of terms. It is the case that
∞
X (x − 1)k
log(x) = (−1)k−1 .
k=1
k
Sometimes this formula is written in the equivalent form:
1
Unless otherwise mentioned, logarithms are always with respect to the base e.
62 CHAPTER 5. TAYLOR’S THEOREM
∞
X yk
log(y + 1) = (−1)k−1 .
k=1
k
The second way to use Taylor’s Theorem is to find approximations. The for-
mula above can let you compute logarithms. How about square roots? The Tay-
lor’s expansion of the square root of x around 1 is:
√
2 = 1 + .5 − .125 + E2 .
E2 = x−2.5 /16 for some x ∈ [1, 2]. Check to √ make sure you know where the
terms come from. The approximation says that 2 = 1.375 up to an error. The
error term is largest when x = 1. Hence the error is no more than .0625. The error
term is smallest when x = 2. I’m not sure what the error is then, but it is certainly
positive. Hence I know that the square root of 2 is at least 1.375 and no greater
than 1.4275. Perhaps this technique will come in handy the next time you need to
compute a square root without the aid of modern electronics.
Chapter 6
Univariate Optimization
63
64 CHAPTER 6. UNIVARIATE OPTIMIZATION
00
3. If f (x∗ ) < 0, then x∗ is a local maximum.
00
4. If f (x∗ ) > 0, then x∗ is a local minimum.
Conditions (1) and (3) are almost converses (as are (2) and (4)), but not quite.
00
Knowing that x∗ is a local maximum is enough to guarantee that f (x∗ ) ≤ 0.
00
Knowing that f x∗ ) is not enough to guarantee that you have a local minimum
(you may have a local maximum or you may have neither a minimum nor a max-
imum). (All the intuition you need comes from thinking about the behavior of
f (x) = xn at x = 0 for different values of x = 0.) You might think that you
could improve the statements by trying to characterize strict local maxima. It is
00
true that if f (x∗ ) < 0, then x∗ is a strict local maximum, but it is possible to have
00
a strict local maximum and f (x∗ ) = 0. The conditions in Theorem 37 parallel
the results about first derivatives and monotonicity stated earlier.
Proof. By Taylor’s Theorem we can write:
1 00
f (x) = f (x∗ ) + f 0 (x∗ )(x − x∗ ) + f (t)(x − x∗ )2 (6.1)
2
00 00 00
for t between x and x∗ . If f (x∗ ) > 0, then by continuity of f , f (t) > 0 for t
sufficiently close to x∗ and so, by (6.1), f (x) > f (x∗ ) for all x sufficiently close
00
to x∗ . Consequently, if x∗ is a local maximum, f (x∗ ) ≤ 0, proving (1). (2) is
similar.
00 00
If f (x∗ ) < 0, then by continuity of f , there exists δ > 0 such that if 0 <
00
|x − t| < δ, then f (t) < 0. By (6.1), it follows that if 0 < |x − x∗ | < δ, then
f (x) < f (x∗ ), which establishes (3). (4) is similar.
The theorem allows us to refine our method for looking for maxima. If f is
defined on an interval (and is twice continuously differentiable), the maximum (if
it exists) must occur either at a boundary point or at a critical point x∗ that satisfies
f (x∗ ) ≤ 0. So you can search for maxima by evaluating f at the boundaries and
at the appropriate critical points.
This method still does not permit you to say when a local maximum is really
a global maximum. You can do this only if f satisfies the appropriate global
conditions.
Definition 50. We say a function f is concave over an interval X ⊂ R if ∀x, y ∈
X and δ ∈ (0, 1), we have
f (δx + (1 − δ)y) ≥ δf (x) + (1 − δ)f (y) (6.2)
If f is only a function of one argument you can think of this graph as having an
inverted “u” shape.
65
Geometrically the definition says that the graph of the function always lies
above segments connecting two points on the graph. Another way to say this is
that the graph of the function always lies below its tangents (when the tangents
exist). If the inequality in (6.2) is strict, then we say that the function is strictly
concave. A linear function is concave, but not strictly concave.
Concave functions have nicely behaved sets of local maximizers. It is an im-
mediate consequence of the definition that if x and y are local maxima, then so
are all of the points on the line segment connecting x to y. A fancy way to get at
this result is to note that concavity implies
f (δx + (1 − δ)y) ≥ δf (x) + (1 − δ)f (y) ≥ min{f (x), f (y)}. (6.3)
Moreover, the inequality in (6.3) is strict if δ ∈ (0, 1) and either (a) f (x) 6= f (y)
or (b) f is strictly concave. Suppose that x is a local maximum of f . It follows
that f (x) ≥ f (y) for all y. Otherwise f (λx + (1 − λ)y) > f (x), for all λ ∈ (0, 1),
contradicting the hypothesis that x is a local maximum. This means that any local
maximum of f must be a global maximum. It follows that if x and y are both local
maxima, then they both must be global maximal and so f (x) = f (y). In this case
it follows from (6.3) that all of the points on the segment connecting x and y must
also be maxima. It further implies that x = y if f is strictly concave.
These results are useful. They guarantee that local extrema are global maxima
(so you know that a critical point must be a maximum without worrying about
boundary points or local minima) and they provide a tractable sufficient condition
for uniqueness. Notice that these nice properties follow from (6.3), which is a
weaker condition than (6.2).2 This suggests that the following definition might be
useful.
Definition 51. We say a function f is quasi concave over an interval X ⊂ R if
∀x, y ∈ X and δ ∈ (0, 1), we have f (δx + (1 − δ)y) ≥ min{f (x), f (y)}.
If quasi-concavity is so great, why bother with concavity? It turns out that
concavity has a characterization in terms of second derivatives.
We can repeat the same analysis with signs reversed.
Definition 52. We say a function f is convex over an interval X ⊂ R if ∀x, y ∈ X
and δ ∈ (0, 1), we have
f (δx + (1 − δ)y) ≤ δf (x) + (1 − δ)f (y)
If f is only a function of one argument you can think of this graph as having an
“u” shape.
2
As an exercise, try to find a function that satisfies (6.3) but not (6.2).
66 CHAPTER 6. UNIVARIATE OPTIMIZATION
for some c between x and λx + (1 − λ)y. You can check that this means that if f 0
is decreasing, then
Similarly,
f (λx + (1 − λ)y) − f (y) = −λf 0 (c)(y − x)
for some c between λx + (1 − λ)y and y and if f 0 is decreasing, then
Note
00
• The previous theorem used the fact that f 0 was decreasing (rather than f ≤
0).
67
Integration
FIGURES GO HERE
7.1 Introduction
Integration is a technique that does three superficially different things. First, it acts
an the inverse to differentiation. That is, if you have the derivative of a function
and want to know the function, then the process of “anti-differentiation” is closely
related to the integration. Finding antiderivatives is essential if you want to solve
differential equations.
Second, integration is a way to take averages of general functions. This in-
terpretation is the most natural and insightful one for the study of integration in
probability and statistics.
Third, for non-negative functions, the integral is a way to compute areas.
The connection between the second and third interpretations is fairly straight-
forward. The connection between the second and first interpretation is important
and surprising and is a consequence of “The Fundamental Theorems of Calculus.”
We will motivate the definition of integral as a generalized average. If you are
given a set of N numbers, you average them by adding the numbers up and then
dividing by N . How do you generalize this to a situation in which you are given
an infinite set of numbers to average?
Assume that f : [a, b] −→ R. A naive way to guess the average value of f
would be to pick some x ∈ [a, b] and say that the average is equal to f (x). This
would work great if f happened to be constant. If you could find the smallest and
largest values of f on the interval (this would be possible if f were continuous),
69
70 CHAPTER 7. INTEGRATION
then you could use these are lower and upper bounds for the average. If f did not
vary too much, perhaps you could use these values to get a good estimate of the
value of f ’s average. You could get an every better estimate if you subdivided the
interval [a, b] and repeated the process: The lower bound for the average value of f
would be the average of the minimum value of f on [a, (a+b)/2] and [(a+b)/2, b].
When you do this, the estimate for the lower bound will be higher (at least it
won’t be lower) and the estimate for the upper bound will be no higher than the
original estimate. Maybe if you keep repeating the process the upper and lower
bounds converge to something that would be a good candidate for the average.
The theory of integration is primarily about finding conditions under which this
kind of argument works.
a = x0 ≤ x1 ≤ · · · ≤ xn−1 ≤ xn ≤ b
where
mk ≡ inf f (x)
x∈[xk−1 ,xk ]
and
∆k ≡ xk − xk−1
n
X
Uf (P ) = Mk ∆k
k=1
where
Since these definitions are defined in terms of “sup” and “inf” they are well
defined even if f is not continuous. (If f is continuous, then the maxima and
7.1. INTRODUCTION 71
minima are attained on each subinterval, so you can replace sup by max and inf
by min in the definitions.) It is clear that for each partition P , Lf (P ) ≤ Uf (P ).
Also, Lf (P ) is an underestimate of the value that we want while U (P ) is an
overestimate.1
Now imagine subdividing the partition P by dividing each subset of P into
two non-empty pieces. This leads to a new partition P 0 and new values Lf (P 0 ) ≥
Lf (P ) and Uf (P 0 ) ≤ Uf (P ). The reason that subdividing increases the lower
sums is that
because the value of x that makes f smallest in the expression on the left may be
in only one of the two subintervals.
So far we have a process that generates an increasing sequence of lower esti-
mates of the average value of f and a process that generates a decreasing sequence
of upper estimates of the average value of f . We know that bounded monotone
sequences converge. This motivates the following definition.
such that ∆k (r) = xk (r) − xk−1 (r) goes to zero as r approaches infinity for all k.
f is integrable if
Rb
If f is integrable then we denote the common limit in (7.1) by a
f (x)dx.
In the definition, there are lots of ways to take “finer and finer” partitions. It
turns out that if the upper and lower limits exist and are equal, then it does not
matter which partition you take. They will all converge to the same limit as the
length of each element in the partition converges to zero. It also turns out that if a
function is integrable, then it does not matter whether you evaluate f inside each
partition element using the sup, the inf or any value in between. All choices will
lead to the same value.
1
The formulas for U and L are the standard ones, but they are not quite right for the “average”
interpretation. If f (x) ≡ c, then we would have Uf (P ) = Lf (P ) = c(b − a) – the length of the
interval appears in the formula. In order to maintain the interpretation as average, you must divide
the formulas by the length of the interval, b − a.
72 CHAPTER 7. INTEGRATION
2
To define the Lebesgue integral you approximate a function by a sequence of “simple” func-
tions that take on only finitely many values and hence are easy to average.
7.2. FUNDAMENTAL THEOREMS OF CALCULUS 73
Theorem 39. If
f (x) = c, ∀x ∈ [a, b]
Then Z b
f (x)dx = c(b − a)
a
Example 18.
f (x) = x2
1
=⇒ F (x) = x3
3
also could have
1
F (x) = x3 + 6
3
The theorem states that differentiation and integration are inverse operations
in the sense that if you start with a function and integrate it, then you get a dif-
ferentiable function and the derivative of that function is the function you started
with.
Proof. By the definition of the integral,
Z x+h
h sup f (y) ≥ F (x + h) − F (x) = f (x)dx ≥ h inf f (y)
y∈[x,x+h] x y∈[x,x+h]
because for each element of the partition F (xk ) − F (xk−1 ) = f (tk )(xk − xk−1 )
and f (tk ) ∈ [inf x∈[xk−1 ,xk ] f (x), supx∈[xk−1 ,xk ] f (x)]. Since
n
X
(F (xk ) − F (xk−1 ) = F (b) − F (a)
k=1
the result follows by taking limits of finer and finer partitions in (7.2).
This theorem is a converse of the first result.
A bit of terminology: An antiderivative of f is a function whose derivative is
f . This is
R sometimes called a primitive of f or the indefinite integral of f and is
denoted f (where the limits of integration are not specified). When the limits of
Rb
integration are specified, a f is a number, called the definite integral of f on the
interval [a, b].
Up until now, we have only talked about integration over closed and bounded
intervals. It is sometimes convenient to talk about “improper” integrals in which
the limits of the integrals may be infinity. These integrals are defined to be limits
of integrals over bounded intervals (provided that the limits exist).
7.3. PROPERTIES OF INTEGRALS 75
where λ, µ ∈ R.
d
(F G) = f G + F g
dx
d
(F G) − F g
=⇒ f G =
dx
and we can just integrate to get the desired result
(b) Alternatively we can set H(x) = F (X)G(X) an apply Theorem 42.
3. Exercise
76 CHAPTER 7. INTEGRATION
The second formula is called integration by parts and it comes up a lot. The
third formula is the change of variables formula. The fourth result is called the
mean value theorem for integrals and states that at some point in an interval a
function must take on the average value of the function on the interval.
1. Let F 0 = f .
Z b Z b
xf (x)dx = bF (b) − aF (a) − F (x)dx
a a
7.4. COMPUTING INTEGRALS 77
2. Z ∞ Z ∞
n −x
x e dx = n xn−1 e−x dx = n!
0 0
8.1 Preliminaries
Rn = R × R × · · · × R × R
X × Y ≡ {(x, y) | x ∈ X , y ∈ Y}
so
Rn = {(x1 , x2 , . . . , xn ) | xi ∈ R, ∀ i = 1, 2, . . . , n}
79
80 CHAPTER 8. BASIC LINEAR ALGEBRA
You may hear people talk about vector spaces. Maybe they are showing off.
Maybe they really need a more general structure. In any event a vector space
is a general set of V in which the operation of addition and multiplication
by a scalar make sense, where addition is commutative and associative (as
above), there is a special zero vector that is the additive identity (0 + v = v),
additive inverses exist (for each v there is a −v, and where scalar multi-
plication is defined as above. Euclidean Spaces are the leading example of
vector spaces. We will need to talk about subsets of Euclidean Spaces that
have a linear structure (they contain 0, and if x and y are in the set, then so
is x + y and all scalar multiples of x and y). We will call these subspaces
(this is a correct use of the technical term), but we have no reason to talk
about more general kinds of vector spaces.
8.2 Matrices
In particular real numbers are just another special case of matrices. e.g. the
number 6
6 ∈ R = R1×1
Example 19.
0 1 5
A =
2×3 6 0 2
α11 + β11 α12 + β12 ··· α1n + β1n
α21 + β21 α22 + β22 ··· α2n + β2n
A + B = = [αij + βij ]
| {z } .. .. ..
m×n
. . .
αm1 + βm1 αm2 + βm2 · · · αmn + βmn
The above expression may look quite daunting if you have never seen sum-
mation signs before so a simple example should help to clarify.
Note further that this brings up the very important point that matrices do
not multiply like regular numbers. They are NOT commutative i.e.
A · B 6= B · A
For example
A · B 6= B · A
2×3 3×4 3×4 2×3
in fact, not only does the LHS not equal the RHS, the RHS does not even
exist. We will see later that one interpretation of a matrix is as a represen-
tation of a linear function. With that interpretation, matrix multiplication
takes on a specific meaning and there will be another way to think about
why you can only multiply certain “conformable” pairs of matrices.
8.2. MATRICES 85
Definition 64. Any matrix which has the same number of rows as columns
is known as a square matrix, and is denoted A .
n×n
Definition 65. There is a special square matrix known as the identity matrix
(which is likened to the number 1 (the multiplicative identity) from Defini-
tion 14), in that any matrix multiplied by this identity matrix gives back the
original matrix. The Identity matrix is denoted In and is equal to
1 0 ... 0
0 1 ... 0
In == .. .. .
. .
n×n . . .
0 ... 0 1
Definition 66. A square matrix is called a diagonal matrix if aij = 0 when-
ever i 6= j.1
Definition 67. A square matrix is called an upper triangular matrix (resp.
lower triangular if aij = 0 whenever i > j (resp. i < j).
Diagonal matrices are easy to deal with. Triangular matrixes are also some-
what tractable. You’ll see that for many applications you can replace an
arbitrary square matrix with a related diagonal matrix.
For any matrix A we have the results that
m×n
A · In = A
m×n m×n
and
Im · A = A
m×n m×n
Note that unlike normal algebra it is not the same matrix which multiplies
A on both sides to give back A (unless n = m).
m×n m×n
1
The main diagonal is always defined as the diagonal going from top left corner to bottom right
corner i.e. &
86 CHAPTER 8. BASIC LINEAR ALGEBRA
A square matrix that is not invertible is called singular. Note that this only
applies to square matrices2 .
Note We will see how to calculate inverses soon.
n=1 A
(1×1)
n≥2 A
(n×n)
detA = |A| ≡ a11 |A−11 | − a12 |A−12 | + a13 |A−13 | − · · · ± a1n |A−1n |
where A−1j is the matrix formed by deleting the first row and jth col-
umn of A.
Example 22. If
a11 a12
A = [aij ] =
2×2 a21 a22
=⇒ |A| = a11 a22 − a12 a21
2
You can find one-sided “pseudo inverses” for all matrices, even those that are not square.
8.2. MATRICES 87
Example 23. If
a11 a12 a13
A = [aij ] = a21 a22 a23
3×3
a31 a32 a33
a22 a23 a21 a23 a21 a22
=⇒ |A| = a11
− a 12
+ a 13
a32 a33 a31 a33 a31 a32
1
A−1 = · adjA
|A|
where adjA is the adjoint of A and we will not show how to calculate it here.
xt y = x1 y1 + x2 y2 + · · · + xn yx
Xn
= xi y i
i=1
88 CHAPTER 8. BASIC LINEAR ALGEBRA
(a) d(x, y) ≥ 0
(b) d(x, y) = 0 ⇐⇒ x = y
(c) d(x, y) = d(y, x)
(d) d(x, y) ≤ d(x, z) + d(z, y), for any z ∈ Rn
states that the distance between two points is the length of the path connect-
ing the two points using segments parallel to the coordinate axes.
Example 27.
d(x, y) = kx − yk
where
q
kzk = z12 + z22 + · · · + zn2
v
u n
uX
=t zi2
i=1
Under the Euclidean metric, the distance between two points is the length of
the line segment connecting the points. We call kzk, which is the distance
between 0 and z the norm of z.
Notice that kzk2 = z · z.
When x·y = 0 we say that x and y are orthogonal/at right angles/perpendicular.
It is a surprising geometric property that two vectors are perpendicular if
and only if their inner product is zero. This fact follows rather easily from
“The Law of Cosines.” The law of cosines states that if a triangle has sides
A,B, and C and the angle θ opposite the side c, then
c2 = a2 + b2 − 2ab cos(θ),
where a, b, and c are the lengths of A, B, and C respectively. This means
that:
y = A · x
(m×1) (m×n) (n×1)
| {z }
(m×1)
where
y1 x1
y2 x2
y = , x =
.. ..
(m×1) . (n×1) .
ym xn
α11 α12 · · · α1n
α21 α22 · · · α2n
A = .. = [αij ]
.. ..
m×n . . .
αm1 αm2 · · · αmn
or, putting it all together
y = A · x
(m×1) (m×n) (n×1)
| {z }
(m×1)
y1 α11 α12 ··· α1n x1
y2 α21 α22 ··· α2n x2
= ·
.. .. .. .. ..
. . . . .
ym αm1 αm2 · · · αmn xn
8.3. SYSTEMS OF LINEAR EQUATIONS 91
Note you should convince yourself that if you multiply out the RHS of the
above equation and then compare corresponding entries of the new (m × 1)
vectors that the result is equivalent to the original system of equations. The
value of matrices is that they permit you to write the complicated system
of equations in a simple form. Once you have written them a system of
equations in this way, you can use matrix operations to solve some systems.
Example 28. In high school you probably solved equations of the form:
3x1 − 2x2 = 7
8x1 + x2 = 25
Well matrix algebra is just a clever way to solve these in one go.
So here we have that
3 −2 x1 7
A = , x = , y =
(2×2) 8 1 (2×1) x2 (2×1) 25
−1 −1
| {zA} x = A y
A
I2
⇐⇒ I2 x = A−1 y
⇐⇒ x = A−1 · y
(2×2) (2×1)
| {z }
(2×1)
92 CHAPTER 8. BASIC LINEAR ALGEBRA
So
x = A−1 y
1 1 2 7
= · ·
|19| −8 3 25
1 57
= ·
19 19
3
=
1
Here is the way in which linear independence captures the idea of no redun-
dancies:
Theorem 45. If X = {x1 , . . . , xk } is a linearly independent collection of
vectors
Pk and z ∈ S(X), then there are unique λ1 , . . . , λk such that z =
i=1 λi xi .
Proof. Existence follows from the definition of span. Suppose that there are
two linear combinations that of the elements of X that yield z so that
k
X
z= λi x i
i=1
and
k
X
z= λ0i xi .
i=1
Subtract the equations to obtain:
k
X
0= (λ0i − λi ) xi .
i=1
94 CHAPTER 8. BASIC LINEAR ALGEBRA
Next, let us investigate the set of things that can be described by a collection
of vectors.
We deal only with finite dimensional vector spaces. We’ll see this defini-
tion agrees with the intuitive notion of dimension. In particular, Rn has
dimension n.
Definition 76. A basis for a vector span V is any collection of linearly
independent vectors than span V .
Theorem 46. If X = {x1 , . . . , xk } is a set of linearly independent vectors
that does not span V , then there exists v ∈ V such that X ∪ {v} is linearly
independent.
You should check that the standard basis really is a linearly independent set
that spans Rn . Also notice that the elements of the standard basis are mutu-
ally orthogonal. When this happens, we say that the basis is orthogonal. It
is also the case that each basis element has unit length. When this also hap-
pens, we say that the basis is orthonormal. It is always possible to find an
orthonormal basis.4 They are particularly useful because it is easy to figure
out how to express an arbitrary element of X in terms of the basis.
It follows from these observations that each vector v has a unique repre-
sentation in terms of the basis, where the representation consists of the λi
used in the linear combination that expresses v in terms of the basis. For the
standard basis, this representation is just the components of the vector.
It is not hard (but a bit tedious) to prove that all bases have the same number
of elements. (This follows from the observation that any system of n homo-
geneous equations and m > n unknowns has a non-trivial solution, which
in turn follows from “row-reduction” arguments.)
In general, one can prove that the eigenvectors of distinct eigenvalues are
distinct. To see this, suppose that λ1 , . . . , λk are distinct eigenvalues and
x1 , . . . , xk are associated eigenvectors. In order to reach a contradiction,
suppose that the vectors are linearly dependent. Without loss of generality,
we may assume that {x1 , . . . , xk−1 } are linearly independent, but that xk can
be written as a linear combination of the first k − 1 vectors. This means that
there exists αi i = 1, . . . , k − 1 not all zero such that:
k−1
X
αi xi = xk . (8.4)
i=1
Multiply both sides of equation (8.4) by A and use the eigenvalue property
to obtain:
Xk−1
αi λi xi = λk xk . (8.5)
i=1
Since the eigenvalues are distinct, equation (8.6) gives a non-trivial linear
combination of the first k − 1 xi that is equal to 0, which contradicts linear
independence.
Here are some useful facts about determinants and eigenvalues. (The proofs
range from obvious to tedious.)
One variation on the symmetric case is particularly useful in the next sec-
tion. When A is symmetric, then we take take the eigenvectors of A to
98 CHAPTER 8. BASIC LINEAR ALGEBRA
be orthonormal. In this case, the P in the previous theorem has the prop-
erty that P−1 = Pt . Eigenvalues turn out to be important in many different
places. They play a role in the study of stability of difference and differen-
tial equations. They make certain computations easy. They make it possible
to define a sense in which matrices can be positive and negative that allows
us to generalize the one-variable second-order conditions. The next topic
will do this.
This quadratic form is positive definite if and only if all of the aii > 0, neg-
ative definite if and only if all of the aii < 0, positive semi definite if and
only if aii ≥ 0, for all i negative semi definite if and only if aii ≤ 0 for all i,
and indefinite if A has both negative and positive diagonal entries.
The theory of diagonalization gives us a way to translate use these results for
all matrices. We know that if A is a symmetric matrix, then it can be written
A = Rt DR, where D is a diagonal matrix with (real) eigenvalues down the
diagonal and R is an orthogonal matrix. This means that the quadratic form:
Q(x) = xt Ax = xt Rt DRx = (Rx)t D (Rx). This expression is useful
because it means that the definiteness of A is equivalent to the definiteness
of its diagonal matrix of eigenvectors, D. (Notice that if I can find an x such
that xt Ax > 0, then I can find an x such that yt Dy (y = Rx) and conversely.)
(a) positive definite if and only if all of its leading principal minors are
positive.
(b) negative definite if and only if its odd principal minors and negative
and its even principal minors are positive.
100 CHAPTER 8. BASIC LINEAR ALGEBRA
(c) indefinite if one of its kth order leading principal minors is negative
for an even k or if there are two odd leading principal minors that
have different signs.
Multivariable Calculus
The goal is to extend the calculus from real-valued functions of a real vari-
able to general functions from Rn to Rm . Some ideas generalize easily, but
going from one dimensional domains to many dimensional domains raises
new issues that need discussion. Raising the dimension of the range space,
on the other hand, raises no new conceptual issues. Consequently, we begin
our discussion to real-valued functions. We will explicitly consider higher
dimensional ranges only when convenient or necessary. (It will be conve-
nient to talk about linear functions in general terms. It is necessary to talk
about the most interesting generalization of the Chain Rule and for the dis-
cussions of inverse and implicit functions.)
101
102 CHAPTER 9. MULTIVARIABLE CALCULUS
If we constrain t ∈ [0, 1] in the definition, then the set is the line segment
connecting x to x+v. Two points still determine a line: The line connecting
x to y can be viewed as the line containing x in the direction v. You should
check that this is the same as the line through y in the direction v.
Definition 82. A hyperplane is described by a point x0 and a normal direc-
tion p ∈ Rn , p 6= 0. It can be represented as {z : p · (z − x0 ) = 0}. p is
called the normal direction of the plane.
which is the standard way to represent the equation of a line (in the plane)
through the point (x1 , x2 ) with slope (y2 − x2 )(y1 − x1 ). This means that
the “parametric” representation is essentially equivalent to the standard rep-
resentation in R2 .1 The familiar ways to represent lines do not work in
higher dimensions. The reason is that one linear equation in Rn typically
has an n − 1 dimensional solution set, so it is a good way to describe a one
dimensional set only if n = 2.
You need two pieces of information to describe a line. If the information
consists of a point and a direction, then the parametric version of the line is
immediately available. If the information consists of two points, then you
form a direction by subtracting one point from the other (the order is not
important).
You can describe a hyperplane easily given a point and a (normal) direction.
Note that the direction of a line is the direction you follow to stay on the line.
The direction for a hyperplane is the direction you follow to go away from
the hyperplane. If you are given a point and a normal direction, then you
can immediately write the equation for the hyperplane. What other pieces of
information determine a hyperplane? In R3 , a hyperplane is just a standard
plane. Typically, three points determine a plane (if the three points are all
on the same line, then infinitely many planes pass through the points). How
can you determine the equation of a plane in R3 that passes through three
given points? A mechanical procedure is to note that the equation for the
plane can always be written Ax1 + Bx2 + Cx3 = D and, use the three
points to find values for the coefficients. For example, if the points are
(1, 2 − 3), (0, 1, 1), (2, 1, 1), then we can solve:
A + 2B − 3C = D
B + C = D
2A + B + C = D
Doing so yields (A, B, C, D) = (0, .8D, .2D, D). (If you find one set of
coefficients that work, any non-zero multiple will also work.) Hence an
equation for the plane is: 4x2 + x3 = 5 you can check that the three points
actually satisfy this equation.
An alternate computation technique is to look for a normal direction. A nor-
mal direction is a direction that is orthogonal to all directions in the plane.
1
The parametric representation is actually a bit more general, since it allows you to describe
lines that are parallel to the vertical axis. Because these lines have infinite slope, they cannot be
represented in standard form.
104 CHAPTER 9. MULTIVARIABLE CALCULUS
A direction in the plane is a direction of a line in the plane. You can get
such a direction by subtracting any two points in the plane. A two dimen-
sional hyperplane will have two independent directions. For this example,
one direction can come from the difference between the first two points:
(1, 1, −4) and the other can come from the difference between the second
and third points (−2, 0, 0) (a third direction will be redundant, but you can
do the computation using the direction of the line connecting (1, 2, −3) and
(2, 1, 1) instead of either of directions computed above). Once you have
two directions, you want to find a normal to both of them. That is, a p such
that p 6= 0 and p · (1, 1, −4) = p · (−2, 0, 0) = 0. This is a system of two
equations and three variables. All multiples of (0, 4, 1) solve the equations.2
Hence the equation for the hyperplane is (0, 4, 1)·(x1 −1, x2 −2, x3 +3) = 0.
You can check that this agrees with the equation will found earlier. It also
would be equivalent to the equation you would obtain if you used either of
the other two given points as “the point on the plane.”
L : Rn −→ Rm
The first condition says that a linear function must be additive. The sec-
ond condition says that it must have constant returns to scale. The condi-
tions generate several obvious consequences. If L is a linear function, then
L(0) = 0 and, more generally, L(x) = −L(−x). It is an important observa-
tion that any linear function can be “represented” by matrix multiplication.
Given a linear function, compute L(ei ), where ei is the ith standard basis
element. Can this ai and let A be the square matrix with ith column equal
to ai . Note that A must have n columns and m rows. Note also that (by the
2
The “cross product” is a computational tool that allows you to mechanically compute a direc-
tion perpendicular to two given directions.
9.3. REPRESENTING FUNCTIONS 105
Definition 85. A level set is the set of points such that the functions achieves
the same value. Formally it is defined as the set
While the graph of the function is a subset of Rn+1 , the level set (actually,
level sets) are subsets of Rn .
Example 29.
f (x) = x21 + x22
So the function is R2 −→ R and therefore the graph is in R3 .
106 CHAPTER 9. MULTIVARIABLE CALCULUS
The level sets of this function are circles in the plane. (The graph is a cone.)
FIGURE GOES HERE
Example 30. A good example to help understand this is a utility function (of
which you will see lots!). A utility function is a function which “measures” a
person’s happiness. It is usually denoted U . In 200A you will see conditions
necessary for the existence of the utility function but for now we will just
assume that it exists, and is strictly increasing in each argument. Suppose
we have a guy Joel whose utility function is just a function of the number of
apples and the number of bananas he eats. So his happiness is determined
solely by the number of apples and bananas he eats, and nothing else. Thus
we lose no information when we think about utility as a function of two
variables:
U : R2 −→ R
U (xA , xB ) where xA is the number of apples he eats, and xB is the number
of bananas he eats.
A level set is all the different possible combinations of apples and bananas
that give him the same utility level i.e. that leave him equally happy!jFor
example Joel might really like apples and only slightly like bananas. So 3
apples and 2 bananas might make him as happy as 1 apple and 10 bananas.
In other words he needs lots of bananas to compensate for the loss of the 2
apples.
If the only functions we dealt with were utility functions, we would call
level sets “indifference curves.” In economics, typically curves that are “iso-
SOMETHING” are level sets of some function.
Example 31. Suppose we have a guy Joel who only derives joy from teach-
ing mathematics. Nothing else in the world gives him any pleasure, and as
such his utility function UJ is only a function of the number of hours of he
spends teaching mathematics HM . Now we also make the assumption that
Joel’s utility is strictly increasing in hours spend teaching mathematics, the
more he teaches the happier he is3 . So the question is does Joel’s utility
function UJ have any level sets? Since utility is a function of one variable,
3
you may doubt that Joel is really like this but I assure you he is!
9.3. REPRESENTING FUNCTIONS 107
Joel’s level “sets” are zero dimensional objects – points. Since if his utility
1/2
function is defined as UJ (HM ) = HM and Joel teaches for 4 hours (i.e.
HM = 4), then UJ (4) = 41/2 = 2. So is there any other combination of
hours teaching that could leave Joel equally happy? Obviously not, since
his utility is only a function of one argument, and it is strictly increasing, no
two distinct values can leave him equally happy.4
{x ∈ Rn | f (x) ≥ c} c∈R
{x ∈ Rn | f (x) ≤ c} c∈R
4
So you can see that level sets reference the arguments of a function. And functions with two
or more arguments are much more likely to have level sets than functions of one argument, since
you can have many different combinations of the arguments.
108 CHAPTER 9. MULTIVARIABLE CALCULUS
So referring back to our map example. The upper contour set of a point x
would be a set of all the coordinates such that if we plugged those coor-
dinates into our altitude function, it would give out a value greater than or
equal to the value at the point x, i.e. all points that are at a higher altitude
than x.
f : Rn −→ R
lim f (x) = c ∈ R
x−→a
=⇒ |f (x) − c| <
This definition agrees with the earlier definition, although there are two
twists. First, a general “distance function” replaces absolute values in the
condition that says that x is close to a. For our purposes, the distance func-
tion will always be the standard Euclidean distance. Second, there we do
not define one-sided continuity.
f : Rn −→ R
=⇒ |f (x1 , x2 ) − 1| <
Note that
p
k(x1 , x2 ) − (1, 1)k = (x1 − 1)2 + (x2 − 1)2
Also
|f (x1 , x2 ) − 1| = |x1 x2 − 1|
= |x2 x1 − x1 + x1 − 1|
= |x1 (x2 − 1) + x1 − 1|
4
≤ |x1 (x2 − 1)| + |x1 − 1|
| {z } | {z }
< 12 < 12
<
where the second last inequality is got using the triangle inequality hence
the 4 superscipt.
1
For any given > 0 let δ = min 4
, 1 . Then
1
k(x1 , x2 ) − (1, 1)k <
4
1 1
=⇒ |x1 − 1| < and |x2 − 1| <
4 4
Also we have that x1 < 2. Thus
1 1
|x1 (x2 − 1)| + |x1 − 1| < 2 · +
4 4
3
=
4
implying that
3
|f (x1 , x2 ) − 1| <
4
<
110 CHAPTER 9. MULTIVARIABLE CALCULUS
9.5 Sequences
(we now use superscript for elements of sequence)
k ∞
x k=1 sequences of vectors in Rn .
xk ∈ Rn , ∀k
xk = (xk1 , xk2 , . . . , xkn )
∞
Definition 89. A sequence xk k=1 converges to a point x ∈ Rn , that is
xk −→ x, if and only if ∀ > 0, ∃ K ∈ P such that k ≥ K =⇒
d(xk , x) <
a≥b
⇐⇒ ai ≥ bi ∀ i = 1, 2, . . . , n
and
a>b
⇐⇒ ai ≥ bi ∀ i = 1, 2, . . . , n,
and aj > bj for some j
M ≥ x, ∀x ∈ X
m ≤ x, ∀x ∈ X
Now we define what mean by vectors being “greater” or “less” than each
other.
9.6. PARTIAL DERIVATIVES AND DIRECTIONAL DERIVATIVES 111
Definition 92. X is said to be closed if, for every sequence xk from X , if
xk −→ x ∈ Rn
B (x) ⊂ X
9.7 Differentiability
Definition 98. We say a function f : Rn −→ Rm is differentiable at a ∈ Rn
if and only if there is a linear function L : Rn −→ Rm such that
m×1 m×1
n×1
f (x) − f (a) − L(x − a
lim
= 0.
x−→a
x − a
n×1
9.7. DIFFERENTIABILITY 113
Df (a)y
=
kyk
non-negative scalar. If we did not do this (at least for the denominator),
then the ratios would not make sense. Second, because the linear function
has the same domain and range as f , it is more complicated than in the
one variable case. In the one variable case, the derivative of f evaluated
at a point is a single function of a single variable. This allows us to think
about f 0 as a real-valued function. When f : Rn −→ Rm , the derivative is a
linear function from Rn into Rm . This means that it can be represented by
matrix multiplication of a matrix with m rows and n columns. That is, the
derivative is described by mn numbers. What are these numbers? The com-
putation after the definition demonstrates that the entries in the matrix that
represents the derivative are the partial derivatives of (the component func-
tions of) f . This is why we typically think of the derivatives of multivariable
functions as “matrices of partial derivatives.”
Sometimes people represent the derivative of a function from Rn to R as a
vector rather than a linear function.
Df (x) = ∇f (x)
∂f ∂f ∂f
= (x), (x), . . . , (x)
∂x1 ∂x2 ∂xn
Example 33.
f (x) = x1 x2
∂f
=⇒ (0, 0) = x2 |(x1 =0,x2 =0)
∂x1
=0
∂f
=⇒ (0, 0) = x1 |(x1 =0,x2 =0)
∂x2
=0
=0
Example 34.
1 1
f (x) = |x1 | 2 |x2 | 2
So
f ((0, 0) + h(1, 1)) − f (0, 0) f ((h, h)) − f (0, 0)
lim √ = lim √
h−→0 h 2 h−→0 h 2
1 1
|h| 2 |h| 2
= lim √
h−→0 h 2
h
= lim √
h−→0 h 2
Why include this example? The computation above √ tells√you that the direc-
tional derivative of f at (0, 0) in the direction (1/
√ 2, 1/ √ 2) does not exist.
(The one sided limit exists and is equal to 1/ 2 or −1/ 2. On the other
hand, you can easily check that both partial derivatives of the function at
(0, 0) exist and are equal to zero. Hence the formula for computing the di-
rection directive as the average of partials fails. Why? Because f is not
differentiable at (0, 0).
Question: What is the direction from x that most increases the value of f ?
Answer: It’s the direction given by the gradient.
(a)
D[cf ](a) = cDf (a)∀c ∈ R
(b)
D[f + g](a) = Df (a) + Dg(a)
For the case m=1
(c)
D[g · f ](a) = g(a) · Df (a) + f (a) · Dg(a)
1×n 1×1 1×n 1×1 1×n
(d)
f g(a) · Df (a) − f (a) · Dg(a)
D (a) =
g [g(a)]2
f ◦ g : R −→ R
Then
D(f ◦ g)0 (t) = Df (x)Dg(t) for x = g(t)
That is
∂f1 ∂x1 ∂f2 ∂x2 ∂fm ∂xm
D[f ◦ g](t) = · + · + ··· + ·
∂x1 ∂t ∂x2 ∂t ∂xm ∂t
Example 35. Let the variable t denote the price of oil. This one variable
induces an array of population responses (thus becomes a vector valued
function) like
118 CHAPTER 9. MULTIVARIABLE CALCULUS
and then these responses in turn have their own effect like determining GNP,
the variable y (which was got by the function f using these population re-
sponses).
g f
t −→ x = g(t) −→ y = f (g(t)) ∈ R
↓ ↓ ↓
price of oil actions taken by individuals GNP
∂y
D[f ◦ g](t) =
∂t
= D(f (g(t))Dg(t)
dg1
dt
∂f ∂f ..
= (g(t)), . . . , (g(t)) ·
∂x1 ∂xm .
dgm
dt
m
X ∂y dg1
= ·
i=1
∂x1 dt
Example 36.
g(x) = x − 1
2y
f (y) =
y2
So note that
g : R −→ R, and f : R −→ R2
2(x − 1)
[f ◦ g](x) =
(x − 1)2
2
D[f ◦ g](x) =
2(x − 1)
9.8. PROPERTIES OF THE DERIVATIVE 119
Now let’s see if we get the same answer doing it the chain rule way:
Dg(x) = 1
2
Df (y) =
2y
2
Df (g(x))Dg(x) =
2(x − 1)
Example 37.
f (y) = f (y1 , y2 )
2
y1 + y2
=
y1 − y1 y2
g(x) = g(x1 , x2 )
2
x1 − x2
=
x1 x2
=y
So we note here that both g and f take in two arguments and spit out a
(2 × 1) vector, so we must have
g : R2 −→ R2 , and f : R2 −→ R2
So
2x21 − 2x2 1 2x1 −1
= ·
1 − x1 x2 x2 − x21 x2 x1
4x1 (x21 − x2 ) + x2 x1 − 2(x21 − x2 )
=
2x1 (1 − x1 x2 ) + x2 (x2 − x1 ) x1 (x2 − x21 ) + x1 x2
2
Example 38.
f : R3 −→ R
so the graph is in R4 (pretty difficult to draw!), but the graph of the level set
is in R3 .
f (x) = x21 + x22 + x23 = 1
∇f (x0 ) · (x − x0 )) = y − y0 .
Proof. Substitute ∇F (x0 , y0 ) = (∇f (x0 , −1) into equation (9.2) and re-
arrange terms.
∂f ∂f ∂f
∇f (x) = ( , , )
∂x1 ∂x2 ∂x3
= (x2 , x1 , −2x3 )
∇f (b
x) = ∇f (x) |x=(2,5,2)
= (5, 2, −4)
Tangent Plane:
{b
x + y | y · ∇f (b
x) = 0} = {(2, 5, 2) + (y1 , y2 , y3 ) | 5y1 + 2y2 − 4y3 = 0}
= {x | 5x1 − 10 + 2x2 − 10 − 4x3 + 8 = 0}
= {5x1 + 2x2 − 4x3 = 12}
(12, 4, −6) · (x − 2, y − 1, z − 3) = 0
or
12x + 4y − 6z = 10.
w − 7 = ∇f (2, 1, 3) · (x − 2, y − 1, z − 3) = 12x + 4y − 6z − 10
or
12x + 4y − 6z − w = 3.
9.10. HOMOGENEOUS FUNCTIONS 123
In economics functions that are homogeneous of degree zero and one arise
naturally in consumer theory. A cost function depends on the wages you
pay to workers. If all of the wages double, then the cost doubles. This
is homogeneity of degree one. On the other hand, a consumer’s demand
behavior is typically homogeneous of degree zero. Demand is a function
φ(p, w) that gives the consumer’s utility maximizing feasible demand given
prices p and wealth w. The demand is the best affordable consumption for
the consumer. The consumptions x that are affordable satisfy p · x ≤ w (and
possibly another constraint like non-negativity). If p and w are multiplied
by the same factor, λ, then the budget constraint remains unchanged. Hence
the demand function is homogeneous of degree zero.
Euler’s Theorem provides a nice decomposition of a function F . Suppose
that F describes the profit produced by a team of n agents, when agent
i contributes effort xi . How such the team divide the profit it generates?
If F is linear, the answer is easy: If F (x) = p · x, then just give agent
124 CHAPTER 9. MULTIVARIABLE CALCULUS
i pi xi . Here you give each agent a constant “per unit” payment equal to
the marginal contribution of her effort. When you do so, you distribute the
entire surplus (and nothing else). When F is non-linear, it is harder to figure
out the contribution of each agent. The theorem states that if you pay each
agent her marginal contribution (Dei f (x)) per unit, then you distribute the
surplus fully if F is homogeneous of degree one. Otherwise it identifies
alternative ways to distribute the surplus.
1
f (x) = f (a) + ∇f (a)(x − a) + (x − a)0 D2 f (a)(x − a) +E3 (x, a)
1×n n×1 2 1×n n×n n×1
| {z }
P2 (x,a)
126 CHAPTER 9. MULTIVARIABLE CALCULUS
where
∂2f ∂2f
∂x21
(a) ... ∂xn ∂x1
(a) x
1 1 .. ..
(x − a)0 D2 f (a)(x − a) = (x1 − a1 , . . . , xn − an ) ·
. . ·
2 1×n n×n n×1 2
∂2f 2
∂ f
| {z } (a) ... (a) x
1×1 ∂x1 ∂xn ∂x2n
n X
n
1 X ∂ 2f
= (xi − ai ) (a)(xj − aj )
2 i=1 j=1
∂xi ∂xj
The general form is quite messy. The way to make the mess more pleasant is
to let the notation do all of the work. We define Dhk f to be a kth derivative:
X k j
k
Dh f = h11 · · · hjnn D1j1 · · · Dnjn f,
j +···+j =k
j1 · · · jn
1 n
Convexity
This section contains some basic information about convex sets and func-
tions in Rn .
Let X ⊂ Rn .
B (x) ∩ X 6= φ
B ⊂ X
B (x) ∩ X 6= φ
and
B (x) ∩ [Rn \X ] 6= φ
127
128 CHAPTER 10. CONVEXITY
X ⊂ {y | y · p ≥ c}
and
x·p<c
10.2. CONVEX SETS 129
Proof. Consider the problem of minimizing the distance between x and the
set X . That is, find y∗ to solve:
y ∈ X , then y · p ≥ c.
(y − y∗ ) · (y∗ − x) ≥ 0. (10.2)
Since X is convex and y∗ is defined to solve (10.1), it must be that kty + (1 − t)y∗ − xk2
is minimized when t = 0 so that the derivative of kty + (1 − t)y∗ − xk2
is non-negative at t = 0. Differentiation and simplifying yields inequal-
ity (10.2).
Notice that without loss of generality you can normalize the normal to the
separating hyperplane. That is, you can assume that kpk = 1.
You can refine the separating hyperplane theorem in two ways. First, if x is
an element of the boundary of X , then you can approximate x by a sequence
xk such that each xk ∈ / X . This yields a sequence of pk , which can be taken
to be unit vectors, that satisfy the conclusion of the theorem. A subsequence
of the pk must converge. The limit point p∗ will satisfy the conclusion of
the theorem (except we can only guarantee that c ≥ p∗ · x rather than the
strict equality above). Second, one can check that the closure of any convex
130 CHAPTER 10. CONVEXITY
set is convex. Therefore, given a convex set X and a point x not in the
interior of the set, we can separate x from the closure of X . Taking these
considerations into account we have the following version of the separating
hyperplane theorem.
X ⊂ {y | y · p ≥ c}
and
x·p≤c
for all x, y, and λ ∈ (0, 1), f (λx + (1 − λ)y) ≥ min{f (x), f (y)}. (10.3)
To see that this definition is equivalent to Definition 108 note first that if
a = min{f (x), f (y)}, then Definition 108 implies (10.3). Conversely, if
the condition in the Definition 108 fails, then there exists a, x, y such that
f (x), f (y) ≥ a but f (λx + (1 − λ)y) < a. Plainly condition (10.3) fails for
these values of λ, x, and y.
Quasiconcavity and Quasiconvexity are global properties of a function. Un-
like continuity, differentiability, concavity and convexity (of functions), they
cannot be defined at a point.
Points on the boundary of this level set (convex) give you values of the
function equal to some constant c. Points inside the shaded region give
you values for the function that are greater than or equal to c. We want to
examine the set { z | f (z) ≥ c}. Consider the points x, y ∈ { z | f (z) ≥ c},
=⇒ αx + (1 − α)y ∈ { z | f (z) ≥ c}
This is really providing the intuition for the following Theorem.
Note that concavity implies quasiconcavity but that the opposite is not true.
A quasiconcave function may be concave, but certainly does not have to be.
The easiest way to see this is just to note that f (x) = ex is just a func-
tion of one variable and thus is easy to plot. But we plot the function not
the level sets (Note level sets are points). Since the exponential function
10.3. QUASI-CONCAVE AND QUASI-CONVEX FUNCTIONS 133
is increasing, upper contour sets are of the form [a, ∞). So the function is
quasiconcave. The function is quasiconvex too!
=⇒ g ◦ f has property(∗)
In other words it doesn’t matter how much greater the value f (x) is than
f (y), it just matters that it is greater.
“Doing better” is typically an ordinary property. When you finish a race,
you receive an ordinal ranking (“fifth place”), which would not change if
you applied a monotonic transformation to the times. There are times when
you care about more than ordinal ranking. Saying someone is taller than
someone else is ordinal. Saying some is ‘a lot’ taller is not.
f : (0, ∞) −→ R
1
f (x) = x 2
So if
x > y =⇒ f (x) > f (y)
134 CHAPTER 10. CONVEXITY
g(y) = y 4
=⇒ g ◦ f (x) = x2
And again we have that
g : R −→ R
UJ : R++ −→ R
So the input into Joel’s utility function is just a positive real number (i.e. the
number of apples he ate) and the output is also just a number (the number
of “utils” that this number of apples gave him). But the important point is
that we do not attach any psychological significance to the value attained
by the function, because Joel only cares about the ranking of how much joy
(how many utils) he gets from different quantities of apples.
Suppose
1
UJ (x) = x 2
10.3. QUASI-CONCAVE AND QUASI-CONVEX FUNCTIONS 135
So we have that
UJ (16) > UJ (9)
The fact that it is 1 “util” greater is irrelevant - the important thing is that
16 apples gives Joel more utils than 9 apples. So as you would expect he
prefers 16 apples to 9!! He ranks the bundles of goods (in this case apples)
ordinally.
While it seems obvious that Joel prefers 16 apples to 9 apples, it does get
more tricky when Joel’s utility function depends on two commodities: say
apples and bananas. But you will see all this in 200A.
g : R −→ R
Another more technical way to distinguish the two is to say that an ordinal
property is preserved under any monotonic transformation while a cardinal
property is preserved only under a positive affine transformation. Ordi-
nal properties rank points in the domain while cardinal properties rank the
cartesian product of points in the domain.
136 CHAPTER 10. CONVEXITY
Chapter 11
Unconstrained Extrema of
Real-Valued Functions
11.1 Definitions
The following are natural generalizations of the one-variable concepts.
f (x) ≥ f (x∗ )
137
138CHAPTER 11. UNCONSTRAINED EXTREMA OF REAL-VALUED FUNCTIONS
Df (x) = 0.
Consequently,
∂f ∗
(x ) = 0,
∂xi
∀ i = 1, 2, . . . , n.
Proof. We define h : R −→ R by
for any v ∈ Rn , t ∈ R.
6 0). We have (for a local maximum)
Fix a direction v (kvk =
f (x∗ ) ≥ f (x),
=⇒ h0 (0) = 0
∇f (x∗ ) = 0.
∇f (x∗ , y ∗ ) = 0
∂f
(x, y) = 12x2 − 6y + 6
∂x
=0
∂f
(x, y) = 2y − 6x
∂y
=0
2y = 6x
12x2 − 18x + 6 = 0
⇒ 2x2 − 6x + 1 = 0
⇒ (2x − 1)(x − 1) = 0
142CHAPTER 11. UNCONSTRAINED EXTREMA OF REAL-VALUED FUNCTIONS
gives us
1
x= 2
or x=1
1 3
x= 2
⇒ y= 2
x=1 ⇒ y=3
So we have 2 critical points
( 21 , 32 ), (1, 3)
Df (x) = ∇f (x)
∂f ∂f
= ,
∂x ∂y
∂f
0 ∂x
(Df ) = ∂f
∂y
D f = D(Df )0
2
!
∂2f ∂2f
∂x2
D2 f = ∂2f
∂x∂y
∂2f
∂y∂x ∂x2
24x −6
=
−6 2
So
2 12 −6
D f ( 12 , 32 )
=
−6 2
And we note that the first entry in this matrix (i.e. the a entry) is positive! It
can also be seen that
2 1 3
D f ( , ) = −12 < 0
2 2
And from our rule for quadratic forms this gives us an indefinite quadratic
form. So the point ( 12 , 23 ) is neither a maximizer nor a minimizer.
So
2 24 −6
D f (1, 3) =
−6 2
And we note that the first entry in this matrix (i.e. the a entry) is positive! It
can also be seen that 2 1 3
D f ( , ) = 12 > 0
2 2
And from our rule for quadratic forms this gives us an positive definite
quadratic form. So the point (1, 3) is a local minimizer.
Chapter 12
The standard motto is: Whatever you know about linear functions is true
locally about differentiable functions. This section discusses two useful
properties that can be understood in these terms.
f (x) = f (x0 ) =⇒ x = x0
143
144CHAPTER 12. INVERTIBILITY AND IMPLICIT FUNCTION THEOREM
f 0 (x0 ) 6= 0,
then ∃ > 0 such that f is strictly monotone on the open interval (x0 −
, x0 + ).
Note this is just saying that if f 0 (x0 ) is positive and f 0 is continuous, then it
is remains positive over an interval.
So f is locally invertible at x0 , then we can define g on (x0 − , x0 + ) such
that
g(f (x)) = x
That is, in the one-variable case, linear functions with (constant) derivative
not equal to zero are invertible. Differentiable functions with derivative not
equal to zero at a point are invertible locally. For one variable functions,
if the derivative is always non zero, then the inverse can be defined on the
entire range of the function. When you move from functions from Rn to
itself, you can ask whether inverse functions exist. Linear functions can be
represented as multiplication by a square matrix. Invertibility of the func-
tion is equivalent to inverting the matrix. So a linear function is invertible
(globally) if its matrix representation is invertible.
so the formula for the derivative of the inverse generalizes the one-variable
formula.
The proof of the inverse function theorem is hard. One standard technique
involves methods that are sometimes used in economics, but the details are
fairly intricate and not worth our time.
The problem of finding an inverse suggests a more general problem. Sup-
pose that you have a function G : Rn+m −→ Rm . Let x ∈ Rn and y ∈ Rm .
We might be interested in whether we can solve the system of equations:
G(x, y) = 0. This is a system of m equations in n + m variables. You
might hope therefore that for every choice of y you could solve the equa-
tion. That is, you might search for a solution to the equation that gives x as
a function of y. The problem of finding an inverse is really a special case
where n = m and G(x, y) = f (x) − y.
The general case is an important problem in economics. In the typical ap-
plication, the system of equations characterize an economic equilibrium.
Maybe they are the equations that determine market clearing price. Maybe
they are the first-order conditions that characterize the solution to an opti-
mization problem. The x variables are parameters. You “solve” a model
for a fixed value of the parameters and you want to know what the solution
to the problem is when the parameters change by a little bit. The implicit
function theorem is a tool for analyzing the problem. This theorem says that
(under a certain condition), if you can solve the system at a give x0 , then
you can solve the system in a neighborhood of x0 . Furthermore, it gives you
expressions for the derivatives of the solution function.
Why call this the implicit function theorem? Life would be great if you
could write down the system of equations and solve them to get an explicit
146CHAPTER 12. INVERTIBILITY AND IMPLICIT FUNCTION THEOREM
representation of the solution function. If you can do this, then you can ex-
hibit the solution (so existence is not problematic) and you can differentiate
it (so you do not need a separate formula for derivatives). In practice, you
may not be able to find an explicit form for the solution. The theorem is the
next best thing.
We will try to illustrate the ideas with some simple examples.
Example 46.
f (x, z) = x3 − z = 0
So there is a pair x and z which fits this identity (actually in this exam-
ple there are an infinite number of them)
In this example it is easy to pull out the z to one side and solve for it in
terms of x but supposing you had a much more complicated example like
x2 z − z 2 + sin x ln z + cos x = 0,
The important point is that even this messy function is still expressing z
in terms of x, although it may be difficult solve for this explicitly - but there
are ways around this problem.
12.2. IMPLICIT FUNCTIONS 147
Not only might x0 correspond to a few z-values, but also z0 might corre-
spond to a few x-values.
But if we examine (x0 , z0 ) in a small neighborhood where the function is
increasing...
Using the picture, we can see that increasing the x value slightly, the func-
tion no longer takes the value of zero, so you can see that we are no longer
on the level set!
∂f ∂f
4x = −4z
∂x ∂z
∂f
4z ∂x
= − ∂f
4x ∂z
∂f
∂z ∂x
= − ∂f
∂x ∂z
So if f gives us a function
g : (x0 − , x0 + ) −→ (z0 − , z0 + )
such that
f (x, g(x)) = 0,
∀ x ∈ (x0 − , x0 + )
then we can define
h : (x0 − , x0 + ) −→ R
by
h(x) = f (x, g(x))
So the point of this is that if h(x) = 0 for any x, then h0 (x) = 0 for any
x. But the clever part is that h0 (x) can be written as a function of f 0 (x) and
148CHAPTER 12. INVERTIBILITY AND IMPLICIT FUNCTION THEOREM
z = g(x)
f : R × Rm −→ Rm
and
f (x0 , z0 ) = 0.
There exists a neighborhood of (x0 , z0 ) and a function g : R −→ Rm defined
on the neighborhood of x0 , such that z = g(x) uniquely solves f (x, z) = 0
on this neighborhood.
Furthermore the derivatives of g are given by
We said that the inverse function theorem was too hard to prove here. Since
the implicit function theorem is a generalization, it too must be too hard
to prove. It turns out that the techniques one develops to prove the inverse
function theorem can be used to prove the implicit function theorem, so the
proof is not much harder. It also is that case that the hard thing to prove
is the existence of the function g that gives z in terms of x. If you assume
that this function exists, then computing the derivatives of g is a simple
application of the chain rule. The following proof describes this argument.
Proof. So we have
f (x, g(x)) = 0
And we define
H(x) ≡ f (x, g(x))
150CHAPTER 12. INVERTIBILITY AND IMPLICIT FUNCTION THEOREM
And thus
[Dz f (x0 , z0 )]−1 · [Dz f (x0 , z0 )] ·Dx g(x0 ) = −[Dz f (x0 , z0 )]−1 Dx F (x0 , z0 )
| {z }
=Im
The implicit function theorem thus gives you a guarantee that you can (lo-
cally) solve a system of equations in terms of parameters. As before, the
theorem really is a local version of a result about linear systems. The theo-
rem tells you what happens if you have one parameter. The same theorem
holds when you have many parameters x ∈ Rn rather than x ∈ R.
Theorem 67. Suppose
f : Rn × Rm −→ Rm
and
f (x0 , z0 ) = 0.
There exists a neighborhood of (x0 , z0 ) and a function g : Rn −→ Rm
defined on the neighborhood of x0 , such that z = g(x) uniquely solves
f (x, z) = 0 on this neighborhood.
Furthermore the derivatives of g are given by implicit differentiation (use
chain rule)
Dg(x0 ) = −[Dz f (x0 , z0 )]−1 Dx f (x0 , z0 )
m×n m×m m×n
12.3. EXAMPLES 151
Verifying that this is the correct formula for the derivative is just the chain
rule.
Comments:
12.3 Examples
Here is a simple economic example of a comparative-statics computation.
A monopolist produces a single output to be sold in a single market. The
cost to produce q units is C(q) = q + .5q 2 dollars and the monopolist can
5
sell q units for the price of P (q) = 4 − q6 dollars per unit. The monopolist
must pay a tax of one dollar per unit sold.
(a) Show that the output q ∗ = 1 that maximizes profit (revenue minus tax
payments minus production cost).
(b) How does the monopolist’s output change when the tax rate changes
by a small amount?
q5
q(4 − ) − tq − q − .5q 2 .
6
q5 + q − 3 + t = 0
and the second derivative of the objective function is −5q 4 − 1 < 0, so there
is at most one solution to this equation, and the solution must be a (global)
maximum. Plug in q = 1 to see that this value does satisfy the first-order
condition when t = 1. Next the question asks you to see how the solution
q(t) to:
q5 + q − 3 + t = 0
varies as a function of t when t is close to one. We know that q(1) = 1
satisfies the equation. We also know that the left-hand side of the equation
is increasing in q, so the condition of the implicit function theorem holds.
Differentiation yields:
1
q 0 (t) = − 4 . (12.1)
5q + 1
In particular, q 0 (1) = − 61 .
In order to obtain the equation for q 0 (t) (12.1), you could use the general
formula or you could differentiate the identity:
q(t)5 + q(t) − 3 + t ≡ 0
and solve for q 0 (t). Notice that the equation is linear in q 0 . This technique
of “implicit differentiation” is fully general. In the example you have n =
m = 1 so there is just one equation and one derivative to find. In general,
you will have an identity in n variables and m equations. If you differentiate
each of the equations with respect to a fixed parameter, you will get m linear
equations for the derivatives of the m implicit functions with respect to that
variable. The system will have a solution if the invertibility condition in the
theorem is true.
12.4. ENVELOPE THEOREM FOR UNCONSTRAINED OPTIMIZATION153
We will not concern ourselves with conditions under which the maximum
exists (we cannot apply the standard existence result when X is open), but
just focus on examples where the max value of f does exist i.e. V (a) exists,
and we will ask the question what happens to V (a) as a changes. i.e. What
is V 0 (a)? Suppose that we can find a function g : R −→ Rn such that g(a)
maximizes f (x, a). We have V (a) = f (g(a), a). That is, the value function
V is a real-valued function of a real variable. We can compute its derivative
using the chain rule (assuming that f and g are differentiable). The implicit
function theorem gives us a sufficient condition for g to be differentiable
and a formula for the derivative provided that solutions to the optimization
problem are characterized as solution to the first order condition. That is,
if x∗ solves maxx∈X f (x, a∗ ) if and only if Dx f (x∗ , a∗ ) = 0, then x∗ is im-
plicitly defined as a solution to a system of equations. Applying the chain
rule mechanically we have
The implicit function theorem tells us when Dg(a∗ ) exists and gives us a
formula for the derivative. In order to evaluate V 0 , however, we only need
to know that the derivative exists. This is because at an interior solution to
the optimization problem Dx f (g(a∗ ), a∗ ) = 0. It follows that the change in
the value function is given by V 0 (a∗ ) = Da f (x∗ , a∗ ), where x∗ = g(a∗ ) is
the solution to the optimization problem at a∗ .
The one condition that needs to be checked is the non-singularity condition
in the statement of the implicit function theorem. Here the condition is
that the matrix of second derivatives of f with respect to the components
of x is non-singular. We often make this assumption (in fact, we assume
that the matrix is negative definite), to guarantee that solutions to first-order
conditions characterize a maximum.
The result that the derivative of the value function depends only on the par-
tial derivative of the objective function with respect to the parameter maybe
154CHAPTER 12. INVERTIBILITY AND IMPLICIT FUNCTION THEOREM
somewhat surprising. It is called the envelope theorem (you will see other
flavors of envelope theorem).
Suppose that f is a profit function and a is a technological parameter. Cur-
rently a∗ describes the state of the technology. The technology shifts. You
shift your production accordingly (optimizing by selecting x = g(a)). Your
profits shift. How can you tell whether the technological change makes you
better off? Changing the technology has two effects: a direct one that is
captured by Da f . For example, if Da f (x∗ , a∗ ) < 0 this means that increas-
ing a leads to a decrease in profits if you hold x∗ fixed. The other effect
is indirect: you adjust x∗ to be optimal with respect to the new technology.
The envelope theorem says that locally the first effect is the only thing that
matters. If the change in a makes the technology less profitable for fixed
x∗ , then you will not be able to change x∗ to reverse the negative effect.
The reason is, loosely, that when you optimize with respect to x and change
the technology, then x∗ is still nearly optimal for the new problem. Up to
“first order” you do not gain from changing it. When Da f (x∗ , a∗ ) 6= 0,
changing a leads to a first-order change in profits that dominates the effect
of reoptimizing.
The result is called the envelope theorem because (geometrically) the result
says that V (a), the value function, is tangent to a family of functions f (x, a)
when x = g(a). On the other hand, V (a) ≥ f (x, a) for all a, so the V curve
looks like an “upper envelope” to the f curves.
For a presentation of value functions that is both more simple and coherent
than this, you can read chapters 26 and 27 of Varian. It is a nice treatment
since it does lots of simpler examples of functions of only 1 and 2 variables,
as opposed to the general n variable case above. Really only chapter 27
deals with value functions but chapter 26 is good background reading too.
Chapter 13
Constrained Optimization
155
156 CHAPTER 13. CONSTRAINED OPTIMIZATION
or
Du f (x∗ ) = Dw f (x∗ ) (Dw G(x∗ ))−1 Du G(x∗ ). (13.4)
Let
λ = Dw f (x∗ ) (Dw G(x∗ ))−1
or
Dw f (x∗ ) = λDw G(x∗ ). (13.5)
It follows that from equation (13.4) that
Df (x∗ ) = λDG(x∗ ).
We summarize the result in the following result.
m
X
∗
∇f (x ) = λi ∇gi (x∗ ).
i=1
158 CHAPTER 13. CONSTRAINED OPTIMIZATION
DV (0) = ((Du f (x∗ ) + Dw f (x∗ )) (Du W (x∗ , 0)) Du∗ (0)+Dw f (x∗ )Db W (x∗ , 0).
The first two terms on the right-hand side are precisely those from the first-
order condition (13.3) multiplied by Du∗ (0). These terms are zero (as in
the case of the unconstrained envelope theorem). The final term on the right
is the contribution that comes from the fact that W may change with b. We
can use the implicit function theorem to get:
DV (0) = λ. (13.6)
resource. When the constraints are equations you cannot state in advance
whether an increase in bi will raise or lower the value of the problem. That
is, you cannot state in advance the sign of λi . There is a variation of this
result that holds in inequality constrained problems. In that case there is a
definite sign to λi (whether the sign is positive or negative depends on how
you write the constraints).
and
Equation (13.8) differs from equation (13.7) because it includes all con-
straints (not just the binding ones). Equation (13.8) and (13.9) are equiva-
lent to (13.7). Equation (13.9) can only be satisfied (for a given i) if either
λi = 0 or gi (x∗ ) = 0. That is, it is a clever way of writing that either a con-
straints is binding or its associated multiplier is zero. Once we know that
160 CHAPTER 13. CONSTRAINED OPTIMIZATION
Definition 115. The vector v enters S from x∗ if there exists > 0 such that
x∗ + tv ∈ S for all t ∈ [0, ).
equality constrained set as a special case (to impose the constraint g(x) = b
substitute two inequalities: g(x) ≥ b and −g(x) ≥ −b). The converse is
not true: Equality constrained problems are typically easier than inequality
constrained problems.
When gi (x∗ ) = bi the ith constraint is binding. If you move from x∗ in the
“wrong” direction, then the constraint will no longer be satisfied. However,
if you move “into” S, then the constraint will be satisfied. When S is de-
scribed by linear inequalities, then the inward directions are simply the ai
associated with binding constraints.
ak · v ≥ 0 for all k ∈ J.
Proof. This is the standard one-variable argument. We know that there ex-
ists > 0 such that x∗ + tv ∈ S for t ∈ [0, ) and therefore f (x∗ ) ≥
f (x∗ + tv) for t ∈ [0, ). Hence Dv f (x∗ ) ≥ 0, which is equivalent to
∇f (x∗ ) · v ≤ 0.
162 CHAPTER 13. CONSTRAINED OPTIMIZATION
or
The first condition in the theorem says that w is in the convex set generated
by the rows of the matrix A. Hence the theorem says that if w fails to be in a
certain convex set, then it can be separated from the set. (In the application
of the result, w = −∇f (x∗ ).) The vector x in the second condition plays the
role of the normal of the separating hyperplane. Geometrically, the second
condition states that it is possible to find a direction that that makes an angle
of less than ninety degrees with all of the rows of A and greater than ninety
degrees with w. This condition is ruled out by Theorem 70, hence the first
condition must hold.
and X
λ∗i (gi (x∗ ) − bi ) = 0. (13.11)
i∈I
Plainly the example does not satisfy the constraint qualification. The con-
straint qualification will hold if the set of inward normals is linearly inde-
pendent (check that this condition fails in the example). When linear inde-
pendence holds, it is possible to find a normal direction that strictly enters:
ni · w = 1 for all i and with this you can construct a curve with the property
that ∇gi (x∗ ) · σ 0 (0) > 0 for all i such that gi (x∗ ).
Here is another technical point. The linear independence condition will not
hold if some of the constraints were derived from equality constraints (that
is, one constraint is of the form g(x) ≥ 0 and another constraint is of the
form −g(x) ≥ 0). This is why frequently equality constraints are written
separately and a linear independence condition is imposed on directions
formed by all binding constraints.
We now modify Theorem 72.
and X
λ∗i (gi (x∗ ) − bi ) = 0. (13.14)
i∈I
It is no accident Theorem 73 looks like Theorem 72. All that we did was
assume away the pathologies caused by non-linear constraints.
subject to
gi (x) ≥ b̃i , i ∈ I (13.16)
then equations (13.10) and (13.11) must hold. Note that
c − y∗A + z∗ = 0 (13.18)
and
y ∗ · (b − Ax∗ ) + z ∗ · x∗ = 0. (13.19)
Equations (13.18) and (13.19) divide the multiplier vector λ into two parts.
y ∗ contains the components of λ corresponding to i ≤ m – the constraints
summarized in the matrix A. z ∗ are the components of λ corresponding to
i > m; these are the non-negativity constraints.
It is useful to rewrite the constraints. Let
Theorem 74. If x∗ solves (13.17), then there exists y ∗ and z ∗ ≥ 0 such that
.
166 CHAPTER 13. CONSTRAINED OPTIMIZATION
s(x : y ∗ , z ∗ ) = c − At y ∗ +z ∗ x + b · y ∗ = b · y ∗
(13.21)
for all x. Therefore the first claim of the theorem holds as an equation.
Furthermore, equation (13.19) implies that s(x∗ ; y ∗ , z ∗ ) = c · x∗ , which
yields the fourth line.
To prove the second line, we must show that
c − At y ∗ + z ∗ = 0 and z ∗ ≥ 0
where the first equation is the definition of s(·), the second equation is just
algebraic manipulation, the third equation follows because equations (13.19),
y ∗ ≥ 0, and b ≥ Ax∗ imply that z ∗ · x∗ = 0, and the inequality follows be-
cause x∗ ≥ 0 and c ≤ At y. It follows that
b · y ≥ s(x∗ ; y, z ∗ ) ≥ s(x∗ ; y ∗ , z ∗ ) = b · y ∗
max c · x subject to Ax ≤ b, x ≥ 0,
min b · y subject to At y ≥ c, y ≥ 0.
13.3. SADDLE POINT THEOREMS 167
and that this problem has the same general form as the Primal (with c re-
placed by −b and A replaced by −At and b replaced by −c. Hence the Dual
can be written in the form of the Primal and (applying the transformation
one more time), the Dual of the Dual is the Primal.
Summarizing the result we have:
max c · x subject to Ax ≤ b, x ≥ 0,
min b · y subject to At y ≥ c, y ≥ 0.
The Dual of the Dual is the Primal. If the Primal has a solution x∗ , then
the Dual has a solution y ∗ and b · y ∗ = c · x∗ . Moreover, if the Dual has
a solution, then the Primal has a solution and the problems have the same
value.
This result is called “The Fundamental Duality Theorem” (of Linear Pro-
gramming). It frequently makes it possible to interpret the “multipliers”
economically. Mathematically, the result makes it clear that when you solve
a constrained optimization problem you are simultaneously solving another
optimization problem for a vector of multipliers.
There is one yi for every constraint in the primal (in particular, there need
not be the same number of variables in the primal as variables in the dual).
You therefore cannot compare x and y directly. You can compare the values
of the two problems. The theorem shows that when you solve these prob-
lems, the values are equal: b · y ∗ = c · x∗ . It is straightforward to show that if
x satisfies the constraints of the Primal then c·x ≤ c·x∗ and if y satisfies the
constraints of the dual then b · y ≥ b · y ∗ . Consequently any feasible value
of the Primal is less than or equal to any feasible value for the Dual. The
minimization problem provides upper bounds for the maximization problem
(and conversely).
It is a useful exercise to look up some standard linear programming prob-
lems and try to come up with interpretations of dual variables.
168 CHAPTER 13. CONSTRAINED OPTIMIZATION
The duality theorem of linear programming tells you a lot about the dual
if you know that the primal has a solution. What happens if the Primal
fails to have a solution? First, since the duality theorem applies when the
Dual plays the role of the Primal, it cannot be that the Dual has a solution.
That is, the Dual has a solution if and only if the Primal has a solution.
There are only two ways in which a linear programming problem can fail
to have a solution. Either {x : Ax ≤ b, x ≥ 0} is empty. In this case the
problem is infeasible – nothing satisfies the constraints. In this case you
certainly do not have a maximum. Alternatively, the problem is feasible
({x : Ax ≤ b, x ≥ 0} is not empty), but you cannot achieve a maximum.
Of course, since linear functions are continuous, this means that the feasible
set cannot be bounded. It turns out that in the second case you can make
13.3. SADDLE POINT THEOREMS 169
the objective function of the Primal arbitrarily large (so the Primal’s value
is “infinite”). It is not hard to show this. It is even easier to show that it is
impossible for the Primal to be unbounded if the Dual is feasible (because
any feasible value of the Dual is an upper bound for the Primal’s value).
To summarize:
(a) If the primal has a solution, then so does the dual, and the solutions
have the same value.
(b) If the primal has no solution because it is unbounded, then the dual is
not feasible.
(c) If the primal is not feasible, then either the dual is not feasible or the
dual is unbounded.
(d) The three statements about remain true if you interchange the words
“primal” and “dual.”
We began by showing that if x∗ solves the Primal, then there exist y ∗ and z ∗
such that
s(x∗ ; y, z) ≥ s(x∗ ; y ∗ , z ∗ ) ≥ s(x; y ∗ , z ∗ ). (13.27)
Expression (13.27) states that the function s is maximized with respect to
x at x = x∗ , (y, z) = (y ∗ , z ∗ ) and minimized with respect to (y, z) at
(x∗ , y ∗ , z ∗ ), (y, z) ≥ 0. This makes the point (x∗ , y ∗ , z ∗ ) a saddle point of
the function s.
What about the converse? Let us state the problem more generally.
Given the problem
for all λ ≥ 0.
Theorem 76. If (x∗ , λ∗ ) is a saddle point of s, then x∗ solves max f (x)
subject to gi (x) ≥ bi .
170 CHAPTER 13. CONSTRAINED OPTIMIZATION
Since the right-hand side gets arbitrarily small as M grows large, which
cannot hold. This establishes the claim.
The only way to have λ∗ ≥ 0, gi (x∗ )−bi ≥ 0 for all i, and λ∗ ·(g(x∗ ) − b) ≤
0 is for λ∗ · (g(x∗ ) − b) = 0. It follows that s(x∗ , λ∗ ) ≤ s(x∗ , λ) implies
that x∗ is feasible (g(x∗ ) ≥ b) and that λ∗ , x∗ satisfy the complementary
slackness condition
λ∗ · (g(x∗ ) − b) = 0. (13.31)
To complete the proof, we must show that g(x) ≥ b implies that f (x∗ ) ≥
f (x). However, we know that s(x∗ , λ∗ ) ≥ s(x, λ∗ ) and we just showed that
Theorem 77. If f and gi are concave and x∗ solves max f (x) subject to
g(x) ≥ b and a “constraint qualification” holds, then there exists λ∗ ≥ 0
such that (x∗ , λ∗ ) is a saddle point of s(x, λ) = f (x) + λ · (g(x) − b).
Comments:
(a) We will discuss the “constraint qualification” later. It plays the role
that “regularity” played in the “inward normal” discussion.
(b) This result is often called the Kuhn-Tucker Theorem. λ are called
Kuhn-Tucker multipliers. (Largrange mulitpliers typically refer only
to multipliers from equality constrained problems. Remember: equal-
ity constraints are a special case of inequality constraints and are gen-
erally easier to deal with.)
(c) It is possible to prove essentially the same theorem with weaker as-
sumptions (quasi concavity of g). The proof that we propose does not
generalize directly. See Sydsaeter or the original paper of Arrow and
Enthoven, Econometrica 1961.
(d) The beauty of this result is that it transforms a constrained problem
into an unconstrained one. Finding x∗ to maximize s(x, λ∗ ) for λ∗
fixed is relatively easy. Also, note that if f and gi are differentiable,
then the first-order conditions for maximizing s are the familiar ones:
∇f (x∗ ) + λ · ∇g(x∗ ) = 0.
λ · b + µf (x∗ ) ≥ λ · y ≥ µz
for all (y, z) ∈ K. This is almost what we need. First, observe that (λ, µ) ≥
0. If some component of λ, say λj < 0, you get a contradiction in the
usual way by taking points (y, z) ∈ K where, for example, z = f (x∗ ), yi =
bi , i 6= j, yj = −M . If M is large enough, this choice contradicts
λ · b + µf (x∗ ) ≥ λ · y + µz.
λ · (g(x∗ ) − b) = 0.
where we use λ∗ ·(g(x∗ ) − b) = 0 and the last line holds because g(x∗ ) ≥ b.
s(x∗ , λ∗ ) ≥ s(x, λ∗ ) ≡
f (x∗ ) + λ∗ · (g(x∗ ) − b) ≥ f (x) + λ∗ · (g(x) − b) ≡
f (x∗ ) + λ∗ · b ≥ f (x) + λ∗ · g(x)
since λ∗ · (g(x∗ ) − b) = 0.
All that remains is to clarify the constraint qualification. We need an as-
sumption that guarantees that µ, the weight on f , is strictly positive. What
happens if µ = 0? We have
(that is, if x∗ + tv satisfies the constraints of the problem for all v and suf-
ficiently small t, then the second-order conditions require that the matrix
of second derivatives be negative semi-definite. In general, this condition
need only apply in the directions consistent with the constraints. There is a
theory of “Boardered Hessians” that allows you to use some insights from
the theory of quadratic forms to classify when quadratic form restricted to
a set of directions will be positive definite. This theory is ugly and we will
say no more about it.
13.5 Examples
Example 47.
max x21 − x2
x1 ,x2
subject to x1 − x2 = 0
It is pretty obvious then that x1 = x2 and thus the problem is really
max x21 − x1
x1
g(x1 , x2 ) = c
implicitly define x2 = h(x1 ) around x∗1 (the solution).
Recall that h is well defined if
∂g ∗
(x ) 6= 0
∂x2
g(x1 , h(x1 )) = 0
∂g ∗ ∂g ∗ 0 ∗
(x ) + (x )h (x1 ) = 0
∂x1 ∂x2
∂g
0 ∂x1
(x∗ )
h (x1 ) = − ∂g
∂x2
(x∗ )
13.5. EXAMPLES 175
max U (x)
x
subject to pt · x ≤ I
where,
p is vector of prices
x is vector of commodities
I is income.
So all the inequality is saying is that a person cannot spend more than the
money in their pocket!
For these problems we just assume that the constraint binds (we will see
why) since people obviously spend all the money in their pockets. You may
question this as people do save, but if people do invest in savings then we
just define saving as an extra commodity and this gets rid of the problem.
f (x)
2 = 2λxi ,
xi
13.5. EXAMPLES 177
1/2
ai
xi = , for i = 1, . . . , n.
(a1 + · · · + an )1/2
Pn
It follows that i=1 x2i = 1 and so
1/n
a1 · · · an 1
≤
(a1 + · · · + an )n n
and so
(a1 + · · · + an )1/2
1/n
(a1 + · · · + an ) ≤ .
n
The interpretation of the last inequality is that the geometric mean of n
positive numbers is no greater than their arithmetic mean.
It is possible to use techniques from constrained optimization to deduce
other important results (the triangle inequality, the optimality of least squares,
. . . ).
ai f (x) = λpi xi .
Summing both sides of this equation and using the fact that the ai sum to
one yields:
f (x) = λp · x = λw.
It follows that
4
Careful: We know that this problem has both a maximum and a minimum. How did we know
that this was the maximum? And where did the minimum go? The answer is that it is plain that
you do not want to set any xi = 0 to get a maximum and that we have the only critical point
that does not involve xi = 0. It is also clear that setting one or more xi = 0 does solve the
minimization problem.
178 CHAPTER 13. CONSTRAINED OPTIMIZATION
a1 an
wai a1 an
xi = and λ = ··· .
pi p1 pn
The previous two examples show illustrate that a bit of cleverness allows
neat solutions to the first-order conditions.
Chapter 14
for any strictly increasing φ(·) even though the transformation may destroy
smoothness or concavity properties of the objective function.
179
180 CHAPTER 14. MONOTONE COMPARATIVE STATICS
Definition 121. For two sets of real numbers A and B, we say that A ≥s B
(“A is greater than or equal to B in the strong set order” if for any a ∈ A
and b ∈ B, min{a, b} ∈ B and max{a, b} ∈ A.
Proof. Suppose µ0 > µ, and that x ∈ x∗ (µ) and x0 ∈ x∗ (µ0 ). x ∈ x∗ (µ) im-
plies f (x; ) − f (min{x, x0 }; µ) ≥ 0. This implies that f (max{x, x0 }; µ) −
f (x0 ; µ) ≥ 0 (you need to check two cases, x > x0 and x0 > x). By
supermodularity, f (max{x, x0 }; µ0 ) − f (x0 ; µ0 ) ≥ 0, which implies that
max{x, x0 } ∈ x∗ (µ0 ).
On the other hand, x0 ∈ x∗ (µ0 ) implies that f (x0 ; µ0 ) − f (max{x, x0 }, µ) ≥
0, or equivalently f (max{x, x0 }, µ) − f (x0 ; µ0 ) ≤ 0. This implies that
f (max{x, x0 }; µ0 )−f (x0 ; µ0 ) ≥ 0, which by supermodularity implies f (x; µ)−
f (min{x, x0 }; µ) ≤ 0 and so min{x, x0 } ∈ x∗ (µ).
Theorem 79. If f is single crossing in (x; µ), then x∗ (µ) = arg maxx∈S(µ) f (x; µ)
is nondecreasing. Moreover, if x∗ (µ) is nondecreasing in µ for all choice
sets S, then f is single-crossing in (x; µ).