0% found this document useful (0 votes)
11 views

Math Notes 12

Uploaded by

Jessica Kurian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Math Notes 12

Uploaded by

Jessica Kurian
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Lecture 1: Functions 1

1 Mathematics Notes

1.1 Functions

Today’s Topics1 : • R1 and Rn • Interval Notation for R1 • Neighborhoods: Intervals, Disks,


and Balls • Open/Closed/Compact Sets • Introduction to Functions • Domain and Range/Image
• Some General Types of Functions • Log, Ln, and e • Graphing Functions • Solving for Variables
• Finding Roots • Summation and Product Notation • Limit of a Function • Continuity
1.1.1 R1 and Rn

• R1 is the set of all real numbers extending from −∞ to +∞ — i.e., the real number line.

• Rn is an n-dimensional space (often referred to as Euclidean space), where each of the n axes
extends from −∞ to +∞.

• Examples:

1. R1 is a line.
2. R2 is a plane.
3. R3 is a 3-D space.
4. R4 could be 3-D plus time.

• Points in Rn are ordered n-tuples, where each element of the n-tuple represents the coordinate
along that dimension.

1.1.2 Interval Notation for R1

• Open interval: (a, b) ≡ {x ∈ R1 : a < x < b}

• Closed interval: [a, b] ≡ {x ∈ R1 : a ≤ x ≤ b}

• Half open, half closed: (a, b] ≡ {x ∈ R1 : a < x ≤ b}

1.1.3 Neighborhoods: Intervals, Disks, and Balls

• In many areas of math, we need a formal construct for what it means to be “near” a point c in
Rn . This is generally called the neighborhood of c and is represented by an open interval,
disk, or ball, depending on whether Rn is of one, two, or more dimensions, respectively. Given
the point c, these are defined as

1. -interval in R1 : {x : |x − c| < }
The open interval (c − , c + ).
2. -disk in R2 : {x : ||x − c|| < }
The open interior of the circle centered at c with radius .
3. -ball in Rn : {x : ||x − c|| < }
The open interior of the sphere centered at c with radius .
1
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists, Boyce & Diprima (1988) Calculus, and Protter & Morrey (1991) A First Course in Real Analysis
Lecture 1: Functions 2

1.1.4 Sets, Sets, and More Sets

• Interior Point: The point x is an interior point of the set S if x is in S and if there is
some -ball around x that contains only points in S. The interior of S is the collection of
all interior points in S. The interior can also be defined as the union of all open sets in S.
Example: The interior of the set {(x, y) : x2 + y 2 ≤ 4} is {(x, y) : x2 + y 2 < 4} .
• Boundary Point: The point x is a boundary point of the set S if every -ball around x
contains both points that are in S and points that are outside S. The boundary is the
collection of all boundary points.
Example: The boundary of {(x, y) : x2 + y 2 ≤ 4} is {(x, y) : x2 + y 2 = 4}.
• Open: A set S is open if for each point x in S, there exists an open -ball around x completely
contained in S.
Example: {(x, y) : x2 + y 2 < 4}
• Closed: A set S is closed if it contains all of its boundary points.
Example: {(x, y) : x2 + y 2 ≤ 4}
• Note: a set may be neither open nor closed.
Example: {(x, y) : 2 < x2 + y 2 ≤ 4}
• Complement: The complement of set S is everything outside of S.
Example: The complement of {(x, y) : x2 + y 2 ≤ 4} is {(x, y) : x2 + y 2 > 4}.
• Closure: The closure of set S is the smallest closed set that contains S.
Example: The closure of {(x, y) : x2 + y 2 < 4} is {(x, y) : x2 + y 2 ≤ 4}
• Bounded: A set S is bounded if it can be contained within an -ball.
Examples: Bounded: any interval that doesn’t have ∞ or −∞ as endpoints; any disk in a
plane with finite radius. Unbounded: the set of integers in R1 ; any ray.
• Compact: A set is compact if and only if it is both closed and bounded.

1.1.5 Introduction to Functions

• A function (in R1 ) is a rule or relationship or mapping or transformation that assigns one


and only one number in R1 to each number in R1 .
• Mapping notation examples
1. Function of one variable: f : R1 → R1
2. Function of two variables: f : R2 → R1
• Examples:
1. f (x) = x + 1
For each x in R1 , f (x) assigns the number x + 1.
2. f (x, y) = x2 + y 2
For each ordered pair (x, y) in R2 , f (x, y) assigns the number x2 + y 2 .
• Often use one variable x as input and another y as output.
Example: y = x + 1
• Input variable also called independent variable. Output variable also called dependent
variable.
Lecture 1: Functions 3

1.1.6 Domain and Range/Image

• Some functions are defined only on proper subsets of Rn .

• Domain: the set of numbers in X at which f (x) is defined.

• Range: elements of Y assigned by f (x) to elements of X, or

f (X) = {y : y = f (x), x ∈ X}

Most often used when talking about a function f : R1 → R1 .

• Image: same as range, but more often used when talking about a function f : Rn → R1 .

• Examples:

3
1. f (x) = 1+x2
Domain X =
Range f (X) =

 x + 1, 1≤x≤2
2. f (x) = 0, x=0
1 − x, −2 ≤ x ≤ −1

Domain X =
Range f (X) =

3. f (x) = 1/x
Domain X =
Range f (X) =

4. f (x, y) = x2 + y 2
Domain X, Y =
Image f (X, Y ) =

1.1.7 Some General Types of Functions

• Monomials: f (x) = axk


a is the coefficient. k is the degree.
Examples: y = x2 , y = − 21 x3
Lecture 1: Functions 4

• Polynomials: sum of monomials.


Examples: y = − 21 x3 + x2 , y = 3x + 5
The degree of a polynomial is the highest degree of its monomial terms. Also, it’s often a
good idea to write polynomials with terms in decreasing degree.

• Rational Functions: ratio of two polynomials.


2 +1
Examples: y = x2 , y = x2x−2x+1

• Exponential Functions: Example: y = 2x

• Trigonometric Functions: Examples: y = cos(x), y = 3 sin(4x)

• Linear: polynomial of degree 1.


Example: y = mx + b, where m is the slope and b is the y-intercept.

• Nonlinear: anything that isn’t constant or polynomial of degree 1.


Examples: y = x2 + 2x + 1, y = sin(x), y = ln(x), y = ex

1.1.8 Log, Ln, and e

• Relationship of logarithmic and exponential functions:

y = loga (x) ⇐⇒ ay = x

The log function can be thought of as an inverse for exponential functions. a is referred to as
the “base” of the logarithm.

• The two most common logarithms are base 10 and base e.

1. Base 10: y = log10 (x) ⇐⇒ 10y = x


The base 10 logarithm is often simply written as “log(x)” with no base denoted.
2. Base e: y = loge (x) ⇐⇒ ey = x
The base e logarithm is referred to as the “natural” logarithm and is written as “ln(x)”.

• loga (ax ) = x and aloga (x) = x

• Examples:

1. log( 10) =
Lecture 1: Functions 5

2. log(1) =

3. log(10) =

4. log(100) =

5. ln(1) =

6. ln(e) =

• Properties of exponential functions:

1. ax ay = ax+y
2. a−x = 1/ax
3. ax /ay = ax−y
4. (ax )y = axy
5. a0 = 1

• Properties of logarithmic functions (any base):

1. log(xy) = log(x) + log(y)


2. log(1/x) = − log(x)
3. log(x/y) = log(x) − log(y)
4. log(xy ) = y log(x)
5. log(1) = 0

• Use the change of base formula to switch bases as necessary: logb (x) = loga (x)/ loga (b)

1.1.9 Graphing Functions

• Know your function. How? Graph your function.

• A picture is worth a thousand words.

1. Is the function increasing or decreasing? Over what part of the domain?


2. How “fast” does it increase or decrease?
3. Are there global or local maxima and minima? Where?
4. Are there inflection points?
5. Is the function continuous?
6. Is the function differentiable?
7. Does the function tend to some limit?
8. Other questions related to the substance of the problem at hand.
Lecture 1: Functions 6

1.1.10 Solving for Variables

• Sometimes we’re given a function y = f (x) and we want to find how x varies as a function of
y.

• If f is a one-to-one mapping, then it has an inverse.

• Use algebra and relationships identified above to move x to the LHS of the equation and so
that the RHS is only a function of y.

• Examples: (we want to solve for x)

1. y = 3x + 2 =⇒ y − 2 = 3x =⇒ x = 31 (y − 2)
2. y = 3x − 4z + 2 =⇒ y + 4z − 2 = 3x =⇒ x = 13 (y + 4z − 2)
3. y = ex + 4 =⇒ y − 4 = ex =⇒ ln(y − 4) = ln(ex ) =⇒ x = ln(y − 4)

• Sometimes (often?) the inverse does not exist.



• Example: We’re given the function y = x2 (a parabola). Solving for x, we get x = y and

x = − y — for each value of y, there are two values of x.

1.1.11 Finding Roots

• Solving for variables is especially important when we want to find the roots of an equation:
those values of variables that cause an equation to equal zero.

• Especially important in finding equilibria and in doing maximum likelihood estimation.

• Procedure: Given y = f (x), set y = 0. Solve for x.

• There may be multiple roots.


−b± b2 −4ac
• For quadratic equations ax2 + bx + c = 0, use x = 2a .

• Examples:

1. f (x) = 3x + 2
2. f (x) = e−x − 10
3. f (x) = x2 + 3x − 4 = 0

1.1.12 Summation and Product Notation


n
P
• Summation: xi = x1 + x2 + x3 + · · · + xn
i=1
n
P n
P
1. cxi = c xi
i=1 i=1
Pn n
P n
P
2. (xi + yi ) = xi + yi
i=1 i=1 i=1
Lecture 1: Functions 7

n
P
3. c = nc
i=1
n
Q
• Product: xi = x1 x2 x3 · · · xn
i=1

n n
cxi = cn
Q Q
1. xi
i=1 i=1
n
Q
2. (xi + yi ) = a mess
i=1
n
c = cn
Q
3.
i=1

• Use logs to go between sum, product notation:


n
Q n
P n
P
log( cxi ) = log(cxi ) = n log(c) + log(xi )
i=1 i=1 i=1

1.1.13 The Limit of a Function

• We’re often interested in determining if a function f approaches some number L as its inde-
pendent variable x moves to some number c (usually 0 or ±∞). If it does, we say that f (x)
approaches L as x approaches c, or limx→c f (x) = L.

• Limit of a function. Let f be defined at each point in some open interval containing
the point c, although possibly not defined at c itself. Then lim f (x) = L if for any (small
x→c
positive) number , there exists a corresponding number δ > 0 such that if 0 < |x − c| < δ,
then |f (x) − L| < .

• Examples:

1. lim k =
x→c
2. lim x =
x→c

3. lim |x| =
x→0

1

4. lim 1 + x2
=
x→0

• Uniqueness: lim f (x) = L and lim f (x) = M =⇒ L = M


x→c x→c

• Properties: Let f and g be functions with lim f (x) = A and lim g(x) = B.
x→c x→c

1. lim [f (x) + g(x)] = lim f (x) + lim g(x) = A + B


x→c x→c x→c
Lecture 1: Functions 8

2. lim αf (x) = α lim f (x) = αA


x→c x→c
3. lim f (x)g(x) = [lim f (x)][lim g(x)] = AB
x→c x→c x→c
f (x) lim f (x)
x→c A
4. lim = = B, provided B 6= 0
x→c g(x) lim g(x)
x→c

• Examples:

1. lim (2x − 3) =
x→2
2. lim xn =
x→c

• Other types of limits:

1. Right-hand limit: lim f (x) = L, if c < x < c + δ =⇒ |f (x) − L| < 


x→c+


Example: lim x=0
x→0+

2. Left-hand limit: lim f (x) = L, if c − δ < x < c =⇒ |f (x) − L| < 


x→c−
3. Infinity: lim f (x) = L, if x > N =⇒ |f (x) − L| < 
x→∞
4. −Infinity: lim f (x) = L, if x < −N =⇒ |f (x) − L| < 
x→−∞

Example: lim 1/x = lim 1/x = 0


x→∞ x→−∞

1.1.14 Continuity

• Continuity: Suppose that the domain of the function f includes an open interval containing
the point c. Then f is continuous at c if lim f (x) exists and if lim f (x) = f (c). Further, f is
x→c x→c
continuous on an open interval (a, b) if it is continuous at each point in the interval.

• Examples: Continuous functions.


f (x) = x f (x) = ex

• Examples: Discontinuous functions.

1
f (x) = floor(x) f (x) = 1 + x2
Lecture 1: Functions 9

• Properties:

1. If f and g are continuous at point c, then f + g, f − g, f g, |f |, and αf are continuous.


f /g is continuous, provided g(c) 6= 0.
2. Boundedness: If f is continuous on the closed bounded interval [a, b], then there is a
number K such that |f (x)| ≤ K for each x in [a, b].
3. Max/Min: If f is continous on the closed bounded interval [a, b], then f has a maximum
and a minimum on [a, b], possibly at the end points.
4. The image of a closed bounded interval [a, b] under a continuous function f is also a
closed bounded interval [m, M ].
Lecture 2: Calculus I 10

1.2 Calculus I

Today’s Topics2 : • Sequences • Limit of a Sequence • Derivatives • Higher-Order Derivatives


• Maxima and Minima • Composite Functions • The Chain Rule • Derivatives of Exp and Ln •
L’Hospital’s Rule
1.2.1 Sequences

• A sequence {yn } = {y1 , y2 , y3 , . . . , yn } is an ordered set of real numbers, where y1 is the


first term in the sequence and yn is the nth term. Generally, a sequence is infinite, that is
it extends to n = ∞. We can also write the sequence as {yn }∞ n=1 .

• Examples:
2

1
= 1, 47 , 17 31
 
1. {yn } = 2 − n2 9 , 16 , . . .
1.5

1
0 5 10 15

20

n o
n2 +1
= 2, 52 , 10

2. {yn } = n 3 ,...
10

0 5 10 15

3. {yn } = (−1)n 1 − n1 = {0, 12 , − 23 , 34 , . . .}


 
0 20 40

• Think of sequences like functions. Before, we had y = f (x) with x specified over some domain.
Now we have {yn } = {f (n)} with n = 1, 2, 3, . . ..
• Three kinds of sequences:
1. Sequences like 1 above that converge to a limit.
2. Sequences like 2 above that increase without bound.
3. Sequences like 3 above that neither converge nor increase without bound — alternating
over the number line.
• Boundedness and monotonicity:
1. Bounded: if |yn | ≤ K for all n
2. Monotone Increasing: yn+1 > yn for all n
3. Monotone Decreasing: yn+1 < yn for all n
• Subsequence: choose an infinite collection of entries from {yn }, retaining their order.
2
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists and from Boyce & Diprima (1988) Calculus
Lecture 2: Calculus I 11

1.2.2 The Limit of a Sequence

• We’re often interested in whether a sequence converges to a limit. Limits of sequences are
conceptually similar to the limits of functions addressed in the previous lecture.

• Definition: (Limit of a sequence). The sequence {yn } has the limit L, that is lim yn = L,
n→∞
if for any  > 0 there is an integer N (which depends on ) with the property that |yn − L| < 
for each n > N . {yn } is said to converge to L. If the above does not hold, then {yn } diverges.

• Examples:
1

1. lim 2 − n2
=2
n→∞
15

10
 4n
2. lim n! =0
n→∞ 5

0 5 10 15

• Uniqueness: If {yn } converges, then the limit L is unique.

• Properties: Let lim yn = A and lim zn = B. Then


n→∞ n→∞

1. lim [αyn + βzn ] = αA + βB


n→∞
2. lim yn zn = AB
n→∞
yn A
3. lim = B, provided B 6= 0
n→∞ zn

• Finding the limit of a sequence in Rn is similar to that in R1 .

• Limit of a sequence of vectors. The sequence of vectors {yn } has the limit L, that is
lim yn = L, if for any  there is an integer N where ||yn − L|| <  for each n > N . The
n→∞
sequence of vectors {yn } is said to converge to the vector L — and the distances between yn
and L converge to zero.

• Think of each coordinate of the vector yn as being part of its own sequence over n. Then a
sequence of vectors in Rn converges if and only if all n sequences of its components converge.
Examples:

1. The sequence {yn } where yn = n1 , 2 − n12 converges to (0, 2).




2. The sequence {yn } where yn = n1 , (−1)n does not converge, since {(−1)n } does not


converge.

• Bolzano-Weierstrass Theorem: Any sequence contained in a compact (i.e., closed and


bounded) subset of Rn contains a convergent subsequence.
1

Example: Take the sequence {yn } = {(−1)n }, which has two


0 10 20
accumulating points, but no limit. However, it is both closed
and bounded.
1
Lecture 2: Calculus I 12

The subsequence of {yn } defined by taking n = 1, 3, 5, . . .


0 10 20
does have a limit: −1.

As does the subsequence defined by taking n = 2, 4, 6, . . .,


0 10 20
whose limit is 1.

1.2.3 Series

• The sum of the terms of a sequence is a series. As there are both finite and infinite sequences,
there are finite and infinite series.
P∞
• The series associated with the sequence {yn } P = {y1 , y2 , y3 , . . . , yn } = {yn }∞ n=1 is n=1 yn .
n
The nth partial sum Sn is defined as Sn = y
k=1 k ,the sum of the first n terms of the
sequence.
P
• A series yn converges if the sequence of partial sums {S1 , S2 , S3 , ...} converges, that is has
a finite limit.

• A geometric series is a series that can be written as ∞ n


P
n=0 r , where r is called Pthe ratio. A

1
geometric series converges to 1−r if |r| < 1 and diverges otherwise. For example, n=0 21n = 2.

• Examples of other series:


P∞ 1 1 1 1
1. n=0 n! = 1 + 1! + 2! + 3! + · · · = e
P∞ 1 1 1 1
2. n=1 n = 1 + 2 + 3 + · · · = ∞ (harmonic series)

1.2.4 Derivatives

• The derivative of f at x is its rate of change at x — i.e., how much f (x) changes with a
change in x.
– For a line, the derivative is the slope.
– For a curve, the derivative is the tangent at x.

• Derivative: Let f be a function whose domain includes an open interval containing the point
x. The derivative of f at x is given by

f (x + h) − f (x)
f 0 (x) = lim
h→0 (x + h) − x
f (x + h) − f (x)
= lim
h→0 h
Lecture 2: Calculus I 13

• If f 0 (x) exists at a point x, then f is said to be differentiable at x. Similarly, if f 0 (x) exists


for every point along an interval, then f is differentiable along that interval. For f to be
differentiable at x, f must be both continuous and “smooth” at x. The process of calculating
f 0 (x) is called differentiation.

• Notation for derivatives:

1. y 0 , f 0 (x) (Prime or Lagrange Notation)


2. Dy, Df (x) (Operator Notation)
dy df
3. dx , dx (x) (Leibniz’s Notation)

• Examples:

1. f (x) = c

2. f (x) = x

3. f (x) = x2

4. f (x) = x3

• Properties of derivatives: Suppose that f and g are differentiable at x and that α is a constant.
Then the functions f ± g, αf , f g, and f /g (provided g(x) 6= 0) are also differentiable at x.
Additionally,
Power rule: [xk ]0 = kxk−1
Sum rule: [f (x) ± g(x)]0 = f 0 (x) ± g 0 (x)
Constant rule: [αf (x)]0 = αf 0 (x)
Product rule: [f (x)g(x)]0 = f 0 (x)g(x) + f (x)g 0 (x)
0 (x)g 0 (x)
Quotient rule: [f (x)/g(x)]0 = f (x)g(x)−f
[g(x)] 2 , g(x) 6= 0

• Examples:

1. f (x) = 3x2 + 2x1/3

2. f (x) = (x3 )(2x4 )


Lecture 2: Calculus I 14

x2 +1
3. f (x) = x2 −1

1.2.5 Higher-Order Derivatives

• We can keep applying the differentiation process to functions that are themselves derivatives.
The derivative of f 0 (x) with respect to x, would then be
f 0 (x + h) − f 0 (x)
f 00 (x) = lim
h→0 h
and so on. Similarly, the derivative of f 00 (x) would be denoted f 000 (x).
df (x) dy
• First derivative: f 0 (x), y 0 , dx , dx
00 00 d2 f (x) d2 y
Second derivative: f (x), y , dx2 , dx2
dn f (x) dn y
nth derivative: dxn , dxn

• Example: f (x) = x3 , f 0 (x) = 3x2 , f 00 (x) = 6x, f 000 (x) = 6, f 0000 (x) = 0

1.2.6 Applications of the Derivative: Maxima and Minima

• The first derivative f 0 (x) identifies whether the function f (x) at the point x is
1. Increasing: f 0 (x) > 0
2. Decreasing: f 0 (x) < 0
3. Extremum/Saddle: f 0 (x) = 0
• Examples:

1. f (x) = x2 + 2, f 0 (x) = 2x

2. f (x) = x3 + 2, f 0 (x) = 3x2

• The second derivative f 00 (x) identifies whether the function f (x) at the point x is
1. Concave down: f 00 (x) < 0
2. Concave up: f 00 (x) > 0
• Maximum (Minimum): x0 is a local maximum (minimum) if f (x0 ) > f (x) (f (x0 ) < f (x))
for all x within some open interval containing x0 . x0 is a global maximum (minimum) if
f (x0 ) > f (x) (f (x0 ) < f (x)) for all x in the domain of f .
Lecture 2: Calculus I 15

• Critical points: Given the function f defined over domain D, all of the following are critical
points:
1. Any interior point of D where f 0 (x) = 0.
2. Any interior point of D where f 0 (x) does not exist.
3. Any endpoint that is in D.
The maxima and minima will be a subset of the critical points.
• Combined, the first and second derivatives can tell us whether a point is a maximum or
minimum of f (x).
Local Maximum: f 0 (x) = 0 and f 00 (x) < 0
Local Minimum: f 0 (x) = 0 and f 00 (x) > 0
Need more info: f 0 (x) = 0 and f 00 (x) = 0
• Global Maxima and Minima. Sometimes no global max or min exists — e.g., f (x) not
bounded above or below. However, three situations where we can fairly easily identify global
max or min.
1. Functions with only one critical point. If x0 is a local maximum of f and it is the
only critical point, then it is a global maximum.
2. Globally concave up or concave down functions. If f 00 is never zero, then there is
at most one critical point, which is a global maximum if f 00 < 0 and a global minimum
if f 00 > 0.
3. Functions over closed and bounded intervals must have both a global maximum
and a global minimum.
• Examples:
1. f (x) = x2 + 2
f 0 (x) = 2x
f 00 (x) = 2

2. f (x) = x3 + 2
f 0 (x) = 3x2
f 00 (x) = 6x

3. f (x) = |x2 − 1|, x ∈ [−2, 2]



0 2x −2 < x < −1, 1 < x < 2
f (x) =
−2x −1 < x < 1

2 −2 < x < −1, 1 < x < 2
f 00 (x) =
−2 −1 < x < 1
Lecture 2: Calculus I 16

1.2.7 Composite Functions and the Chain Rule

• Composite functions are formed by substituting one function into another and are denoted
by
(f ◦ g)(x) = f [g(x)]
To form f [g(x)], the range of g must be contained (at least in part) within the domain of
f . The domain of f ◦ g consists of all the points in the domain of g for which g(x) is in the
domain of f .
• Examples:
1. f (x) = ln x,
g(x) = x2
(f ◦ g)(x) = ln x2 ,
(g ◦ f )(x) = [ln x]2 ,
Notice that f ◦ g and g ◦ f are not the same functions.
2. f (x) = √4 + sin x,
g(x) = 1 − x2 , √
(f ◦ g)(x) = 4 + sin 1 − x2 ,
(g ◦ f )(x) does not exist, since the range of f , [3, 5], has no points in common with the
domain of g.
• Chain Rule: Let y = f (z) and z = g(x). Then, y = (f ◦ g)(x) = f [g(x)] and the derivative
of y with respect to x is
d
{f [g(x)]} = f 0 [g(x)]g 0 (x)
dx
which can also be written as
dy dy dz
=
dx dz dx
(Note: the above does not imply that the dz’s cancel out, as in fractions. They are part of
the derivative notation and have no separate existence.) The chain rule can be thought of
as the derivative of the “outside” times the derivative of the “inside,” remembering that the
derivative of the outside function is evaluated at the value of the inside function.
• Generalized Power Rule: If y = [g(x)]k , then dy/dx = k[g(x)]k−1 g 0 (x).
• Examples:
1. Find dy/dx for y = (3x2 + 5x − 7)6 . Let f (z) = z 6 and z = g(x) = 3x2 + 5x − 7. Then,
y = f [g(x)] and
dy
=
dx
=
=
2. Find dy/dx for y = sin(x3 + 4x). (Note: the derivative of sin x is cos x.) Let f (z) = sin z
and z = g(x) = x3 + 4x. Then, y = f [g(x)] and
dy
=
dx
=
=
Lecture 2: Calculus I 17

1.2.8 Derivatives of Exp and Ln

• Derivatives of Exp:
d x x
1. dx αe = αe
dn x x
2. dxn αe = αe
3. d u(x)
dx e = eu(x) u0 (x)

• Examples: Find dy/dx for

1. y = e−3x
2
2. y = ex
3. y = esin 2x

• Derivatives of Ln:
d 1
1. dx ln x = x
d d k
2. dx ln xk = dx k ln x = x
0 (x)
3. d
dx ln u(x) = uu(x) (by the chain rule)

• Examples: Find dy/dx for

1. y = ln(x2 + 9)
2. y = ln(ln x)
3. y = (ln x)2
4. y = ln ex
d x
• For any positive base b, dx b = (ln b) (bx ).

1.2.9 L’Hospital’s Rule


   
• In studying limits, we saw that lim f (x)/g(x) = lim f (x) / lim g(x) , provided that
x→c x→c x→c
lim g(x) 6= 0, which will cause the limit to be unbounded.
x→c

• If both lim f (x) = 0 and lim g(x) = 0, then we get an indeterminate form of the type 0/0
x→c x→c
as x → c. However, we can still analyze such limits using L’Hospital’s rule.

• L’Hospital’s Rule: Suppose f and g are differentiable on a < x < b and that either

1. lim f (x) = 0 and lim g(x) = 0, or


x→a+ x→a+
2. lim f (x) = ±∞ and lim g(x) = ±∞
x→a+ x→a+

Suppose further that g 0 (x) is never zero on a < x < b and that
f 0 (x)
lim =L
x→a+ g 0 (x)
then
f (x)
lim =L
x→a+ g(x)
Lecture 2: Calculus I 18

• Examples: Use L’Hospital’s rule to find the following limits:


ln(1+x2 )
1. lim x3
x→0+

e1/x
2. lim 1/x
x→0+

x−2
3. lim 1/3
x→2 (x+6) −2
Lecture 3: Calculus II 19

1.3 Calculus II: An Integral Topic

Today’s Topics3 : • Partial Derivatives • The Indefinite Integral: The Antiderivative • The
Definite Integral: The Area under the Curve • Integration by Substitutions • Integration by Parts
1.3.1 Differentiation in Several Variables

• Suppose we have a function f now of two (or more) variables and we want to determine the
rate of change relative to one of the variables. To do so, we would find it’s partial derivative,
which is defined similar to the derivative of a function of one variable.

• Partial Derivative: Let f be a function of the variables (x1 , . . . , xn ). The partial derivative
of f with respect to xi is

∂f f (x1 , . . . , xi + h, . . . , xn ) − f (x1 , . . . , xi , . . . , xn )
(x1 , . . . , xn ) = lim
∂xi h→0 h
Only the ith variable changes — the others are treated as constants.

• We can take higher-order partial derivatives, like we did with functions of a single variable,
except now we the higher-order partials can be with respect to multiple variables.

• Examples:

1. f (x, y) = x2 + y 2
∂f
∂x (x, y) =
∂f
∂y (x, y) =
∂2f
∂x2
(x, y) =
∂2f
∂x∂y (x, y) =
2. f (x, y) = x3 y 4 + ex − ln y
∂f
∂x (x, y) =
∂f
∂y (x, y) =
∂2f
∂x2
(x, y) =
∂2f
∂x∂y (x, y) =

1.3.2 The Indefinite Integral: The Antiderivative

• So far, we’ve been interested in finding the derivative g = f 0 of a function f . However,


sometimes we’re interested in exactly the reverse: finding the function f for which g is its
derivative. We refer to f as the antiderivative of g.

• Let DF be the derivative of F . And let DF (x) be the derivative of F evaluated at x. Then
the antiderivative is denoted by D−1 (i.e., the inverse derivative). If DF = f , then F = D−1 f .

• Indefinite Integral: Equivalently, if F is Rthe antiderivative of f , then F is also called the


indefinite integral of f and written F (x) = f (x)dx.

• Examples:
3
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists and from Boyce & Diprima (1988) Calculus
Lecture 3: Calculus II 20

1
R
1. x2
dx =
3x
R
2. 3e dx =

(x2 − 4)dx =
R
3.

• Notice from these examples that while there is only a single derivative for any function, there
are multiple antiderivatives: one for any arbitrary constant c. c just shifts the curve up or
down on the y-axis. If more info is present about the antiderivative — e.g., that it passes
through a particular point — then we can solve for a specific value of c.
• Common rules of integration:
R R
1. af (x)dx = a f (x)dx
R R R
2. [f (x) + g(x)]dx = f (x)dx + g(x)dx
1
3. xn dx = n+1 xn+1 + c
R

4. e dx = ex + c
R x

5. x1 dx = ln x + c
R

6. ef (x) f 0 (x)dx = ef (x) + c


R

7. [f (x)]n f 0 (x)dx = n+11


[f (x)]n+1 + c
R
R 0 (x)
8. ff (x) dx = ln f (x) + c
• Examples:
1. 3x2 dx = 3 x2 dx =
R R
R
2. (2x + 1)dx =
x
3. ex ee dx =
R

1.3.3 The Definite Integral: The Area under the Curve

• Riemann Sum: Suppose we want to determine the area A(R) of a region R defined by a
curve f (x) and some interval a ≤ x ≤ b. One way to calculate the area would be to divide
the interval a ≤ x ≤ b into n subintervals of length ∆x and then approximate the region with
a series of rectangles, where the base of each rectangle is ∆x and the height is f (x) at the
midpoint of that interval. A(R) would then be approximated by the area of the union of the
rectangles, which is given by
Xn
S(f, ∆x) = f (xi )∆x
i=1
and is called a Riemann sum.
• As we decrease the size of the subintervals ∆x, making the rectangles “thinner,” we would
expect our approximation of the area of the region to become closer to the true area. This
gives the limiting process
Xn
A(R) = lim f (xi )∆x
∆x→0
i=1
Lecture 3: Calculus II 21

• Riemann Integral: If for a given function f the Riemann sum approaches a limit as ∆x → 0,
then that limit is called the Riemann integral of f from a to b. Formally,

Zb n
X
f (x)dx = lim f (xi )∆x
∆x→0
a i=1

Rb
• Definite Integral: We use the notation f (x)dx to denote the definite integral of f from
a
Rb
a to b. In words, the definite integral f (x)dx is the area under the “curve” f(x) from x = a
a
to x = b.

• First Fundamental Theorem of Calculus: Let the function f be bounded on [a, b] and
continuous on (a, b). Then the function
Zx
F (x) = f (s)ds, a≤x≤b
a

has a derivative at each point in (a, b) and

F 0 (x) = f (x), a<x<b

This last point shows that differentiation is the inverse of integration.

• Second Fundamental Theorem of Calculus: Let the function f be bounded on [a, b] and
continuous on (a, b). Let F be any function that is continuous on [a, b] such that F 0 (x) = f (x)
on (a, b). Then
Zb
f (x)dx = F (b) − F (a)
a

Rb
• Procedure to calculate a “simple” definite integral f (x)dx:
a

1. Find the indefinite integral F (x).


2. Evaluate F (b) − F (a).

• Examples:
R3
1. 3x2 dx =
1
R2 x
2. ex ee dx =
−2

• Properties of Definite Integrals:


Ra
1. f (x)dx = 0 There is no area below a point.
a
Rb Ra
2. f (x)dx = − f (x)dx Reversing the limits changes the sign of the integral.
a b
Lecture 3: Calculus II 22

Rb Rb Rb
3. [αf (x) + βg(x)]dx = α f (x)dx + β g(x)dx
a a a
Rb Rc Rc
4. f (x)dx + f (x)dx = f (x)dx
a b a

• Examples:
R1
1. 3x2 dx =
1
R4
2. (2x + 1)dx =
0
R0 x R2 x
3. ex ee dx+ ex ee dx =
−2 0

1.3.4 Integration by Substitutions

• Sometimes the integrand doesn’t appear integrable using common rules and antiderivatives.
A method one might try is integration by substitutions, which is related to the Chain
Rule.
R
• Suppose we want to find the indefinite integral g(x)dx and assume we can identify a function
u(x) such that g(x) = f [u(x)]u0 (x). Let’s refer to the antiderivative of f as F . Then the
d
chain rule tells us that dx F [u(x)] = f [u(x)]u0 (x). So, F [u(x)] is the antiderivative of g. We
can then write
Z Z Z
0 d
g(x)dx = f [u(x)]u (x)dx = F [u(x)]dx = F [u(x)] + c
dx
R
• Procedure to determine the indefinite integral g(x)dx by the method of substitions:

1. Identify some part of g(x) that might be simplified by substituting in a single variable
u (which will then be a function of x).
2. Determine if g(x)dx can be reformulated in terms of u and du.
3. Solve the indefinite integral.
4. Substitute back in for x

• Substitution can also be used to calculate a definite integral. Using the same procedure as
above,
Zb Zd
g(x)dx = f (u)du = F (d) − F (c)
a c

where c = u(a) and d = u(b).

• Examples:
R √
1. x2 x + 1dx
√ √
The problem here is the x + 1 term. However, if the integrand had x times some
Lecture 3: Calculus II 23

polynomial, then we’d be in business. Let’s try u = x + 1. Then x = u − 1 and dx = du.


Substituting these into the above equation, we get
√ √
Z Z
2
x x + 1dx = (u − 1)2 udu
Z
= (u2 − 2u + 1)u1/2 du
Z
= (u5/2 − 2u3/2 + u1/2 )du

We can easily integrate this, since it’s just a polynomial. Doing so and substituting
u = x + 1 back in, we get

Z  
2 3/2 1 2 1
x x + 1dx = 2(x + 1) (x + 1)2 − (x + 1) + +c
7 5 3

2. For the above problem, we could have also used the substitution u = x + 1. Then
x = u2 − 1 and dx = 2udu. Substituting these in, we get

Z Z
2
x x + 1dx = (u2 − 1)2 u2udu

which when expanded is again a polynomial and gives the same result as above.
R1 5e2x
3. (1+e 2x )1/3 dx
0
When an expression is raised to a power, it’s often helpful to use this expression as
the basis for a substitution. So, let u = 1 + e2x . Then du = 2e2x dx and we can
set 5e2x dx = 5du/2. Additionally, u = 2 when x = 0 and u = 1 + e2 when x = 1.
Substituting all of this in, we get

Z1 Z 2
1+e
5e2x 5 du
dx =
(1 + e2x )1/3 2 u1/3
0 2
Z 2
1+e
5
= u−1/3 du
2
2
1+e2
15 2/3
= u
4 2
= 9.53

1.3.5 Integration by Parts

• Another useful integration technique is integration by parts, which is related to the Product
Rule of differentiation. The product rule states that
d dv du
(uv) = u +v
dx dx dx
Integrating this and rearranging, we get
Z Z
dv du
u dx = uv − v dx
dx dx
Lecture 3: Calculus II 24

or Z Z
0
u(x)v (x)dx = u(x)v(x) − v(x)u0 (x)dx

More frequently remembered as


Z Z
udv = uv − vdu

where du = u0 (x)dx and dv = v 0 (x)dx.


Rb Rb
• For definite integrals: dv
u dx dx = uv|ba − v du
dx dx
a a

• Our goal here is to find expressions for u and dv that, when substituted into the above
equation, yield an expression that’s more easily evaluated.

• Examples:

1. xeax dx
R

Let u = x and dv = eax dx. Then du = dx and v = (1/a)eax . Substituting this into the
integration by parts formula, we obtain
Z Z
xeax dx = uv − vdu
  Z
1 ax 1 ax
= x e − e dx
a a
1 ax 1
= xe − 2 eax + c
a a
R n ax
2. x e dx
Lecture 3: Calculus II 25

2
x3 e−x dx
R
3.
Lecture 4: Probability I 26

1.4 Probability I: Probability Theory

Today’s Topics4 : • Counting rules • Sets • Probability • Conditional Probability and Bayes’
Rule • Independence
1.4.1 Counting rules

• Fundamental Q Theorem of Counting: If there are k characteristics, each with nk alterna-


tives, there are ki=1 nk possible outcomes.

• We often need to count the number of ways to choose a subset from some set of possiblities.
The number of outcomes depends on two characteristics of the process: does the order matter
and is replacement allowed?

• If there are n objects and we select k < n of them, how many different outcomes are possible?

1. Ordered, with replacement: nk


n!
2. Ordered, without replacement: (n−k)!
 
n+k−1
(n+k−1)!
3. Unordered, with replacement: =(n−1)!k!k
 
n! n
4. Unordered, without replacement: (n−k)!k! =
k

1.4.2 Sets

• Set: A set is any well defined collection of elements. If x is an element of S, x ∈ S.

• Types of sets:

1. Countably finite: a set with a finite number of elements, which can be mapped onto
positive integers.
S = {1, 2, 3, 4, 5, 6}
2. Countably infinite: a set with an infinite number of elements, which can still be mapped
onto positive integers.
S = {1, 21 , 31 , . . . }
3. Uncountably infinite: a set with an infinite number of elements, which cannot be mapped
onto positive integers.
S = {x : x ∈ [0, 1]}
4. Empty: a set with no elements.
S = {∅}

• Set operations:

1. Union: The union of two sets A and B, A ∪ B, is the set containing all of the elements
in A or B.
4
Much of the material and examples for this lecture are taken from Gill (2006) Essential Mathematics for Political
and Social Research, Wackerly, Mendenhall, & Scheaffer (1996) Mathematical Statistics with Applications, Degroot
(1985) Probability and Statistics, Morrow (1994) Game Theory for Political Scientists, King (1989) Unifying Political
Methodology, and Ross (1987) Introduction to Probability and Statistics for Scientists and Engineers.
Lecture 4: Probability I 27

2. Intersection: The intersection of sets A and B, A ∩ B, is the set containing all of the
elements in both A and B.
3. Complement: If set A is a subset of S, then the complement of A, denoted AC , is the
set containing all of the elements in S that are not in A.

• Properties of set operations:

1. Commutative: A ∪ B = B ∪ A, A ∩ B = B ∩ A
2. Associative: A ∪ (B ∪ C) = (A ∪ B) ∪ C, A ∩ (B ∩ C) = (A ∩ B) ∩ C
3. Distributive: A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C), A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
4. de Morgan’s laws: (A ∪ B)C = AC ∩ B C , (A ∩ B)C = AC ∪ B C

• Disjointness: Sets are disjoint when they do not intersect, such that A ∩ B = {∅}. A
collection of sets is pairwise disjoint if, for all i 6= j, Ai ∩ Aj = {∅}. A collection of sets form
a partition of set S if they are pairwise disjoint and they cover set S, such that ki=1 Ai = S.
S

1.4.3 Probability

• Probability: Many events or outcomes are random. In everyday speech, we say that we are
uncertain about the outcome of random events. Probability is a formal model of uncertainty
which provides a measure of uncertainty governed by a particular set of rules. A different
model of uncertainty would, of course, have a different set of rules and measures. Our focus on
probability is justified because it has proven to be a particularly useful model of uncertainty.

• Sample Space: A set or collection of all possible outcomes from some process. Outcomes in
the set can be discrete elements (countable) or points along a continuous interval (uncount-
able).

• Examples:

1. Discrete: the numbers on a die, the number of possible wars that could occur each year,
whether a vote cast is republican or democrat.
2. Continuous: GNP, arms spending, age.

• Probability Distribution: A probability function on a sample space S is a mapping Pr(A)


from events in S to the real numbers that satisfies the following three axioms (due to Kol-
mogorov).

• Axioms of Probability: Define the number Pr(A) correponding to each event A in the
sample space S such that

1. Axiom: For any event A, Pr(A) ≥ 0.


2. Axiom: Pr(S) = 1
3. Axiom: For any sequence of disjoint events A1 , A2 , . . . (of which there may be infinitely
many),
 k k

S P
Pr Ai = Pr(Ai )
i=1 i=1

• Basic Theorems of Probability: Using these three axioms, we can define all of the common
theorems of probability.
Lecture 4: Probability I 28

1. Pr(∅) = 0
2. Pr(AC ) = 1 − Pr(A)
3. For any event A, 0 ≤ Pr(A) ≤ 1.
4. If A ⊂ B, then Pr(A) ≤ Pr(B).
5. For any two events A and B, Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B)
6. Forany sequence of n events (which need not be disjoint) A1 , A2 , . . . , An ,
n n

S P
Pr Ai ≤ Pr(Ai )
i=1 i=1

• Examples: Let’s assume we have an evenly-balanced, six-sided die. Then,

1. Sample space S =
2. Pr(1) = · · · = Pr(6) =
3. Pr(∅) = Pr(7) =
4. Pr ({1, 3, 5}) =
 
5. Pr {1, 2} = Pr ({3, 4, 5, 6}) =
6. Let B = S and A = {1, 2, 3, 4, 5} ⊂ B. Then Pr(A) = < Pr(B) = .
7. Let A = {1, 2, 3} and B = {2, 4, 6}. Then A ∪ B = {1, 2, 3, 4, 6}, A ∩ B = {2}, and

Pr(A ∪ B) =
=
=

1.4.4 Conditional Probability and Bayes Law

• Conditional Probability: The conditional probability Pr(A|B) of an event A is the prob-


ability of A, given that another event B has occurred. It is calculated as

Pr(A ∩ B)
Pr(A|B) =
Pr(B)

A A
• Example: Assume A and B occur with the following frequencies: B nab nab
B nab nab
and let nab + nab + nab + nab = N . Then

1. Pr(A) ≈
2. Pr(B) ≈
3. Pr(A ∩ B) ≈
4. Pr(A|B) ≈

5. Pr(B|A) ≈

• Example: A six-sided die is rolled. What is the probability of a 1, given the outcome is an odd
number?
Lecture 4: Probability I 29

• Multiplicative Law of Probability: The probability of the intersection of two events A


and B is
Pr(A ∩ B) = Pr(A) Pr(B|A) = Pr(B) Pr(A|B)
which follows directly from the definition of conditional probability.
• Calculating the Probability of an Event Using the Event-Composition Method:
The event-composition method for calculating the probability of an event A involves express-
ing A as a composition involving the unions and/or intersections of other events. Then use
the laws of probability to to find Pr(A). The steps used in the event-composition method are:
1. Define the experiment.
2. Identify the general nature of the sample points.
3. Write an equation expressing the event of interest A as a composition of two or more
events, using unions, intersections, and/or complements.
4. Apply the additive and multiplicative laws of probability to the compositions obtained
in step 3 to find Pr(A).
• Law of Total Probability: Let S be the sample space of some experiment and let the
disjoint k events B1 , . . . , Bk partition S. If A is some other event in S, then the events
AB1 , AB2 , . . . , ABk will form a partition of A and we can write A as
A = (AB1 ) ∪ · · · ∪ (ABk )
Since the k events are disjoint,
k
X
Pr(A) = Pr(ABi )
i=1
Xk
= Pr(Bi ) Pr(A|Bi )
i=1

Sometimes it is easier to calculate the conditional probabilities and sum them than it is to
calculate Pr(A) directly.
• Bayes Rule: Assume that events B1 , . . . , Bk form a partition of the space S. Then
Pr(ABj ) Pr(Bj ) Pr(A|Bj )
Pr(Bj |A) = = k
Pr(A) P
Pr(Bi ) Pr(A|Bi )
i=1

If there are only two states of B, then this is just


Pr(B1 ) Pr(A|B1 )
Pr(B1 |A) =
Pr(B1 ) Pr(A|B1 ) + Pr(B2 ) Pr(A|B2 )

• Bayes rule determines the posterior probability of a state or type Pr(Bj |A) by calculating the
probability Pr(ABj ) that both the event A and the state Bj will occur and dividing it by the
probability that the event will occur regardless of the state (by summing across all Bi ).
• Often Bayes’ rule is used when one wants to calculate a posterior probability about the “state”
or type of an object, given that some event has occurred. The states could be something like
Normal/Defective, Normal/Diseased, Democrat/Republican, etc. The event on which one
conditions could be something like a sampling from a batch of components, a test for a
disease, or a question about a policy position.
Lecture 4: Probability I 30

• Prior and Posterior Probabilities: In the above, Pr(B1 ) is often called the prior proba-
bility, since it’s the probability of B1 before anything else is known. Pr(B1 |A) is called the
posterior probability, since it’s the probability after other information is taken into account.

• Examples:

1. A test for cancer correctly detects it 90% of the time, but incorrectly identifies a person
as having cancer 10% of the time. If 10% of all people have cancer at any given time,
what is the probability that a person who tests positive actually has cancer?

2. In Boston, 30% of the people are conservatives, 50% are liberals, and 20% are indepen-
dents. In the last election, 65% of conservatives, 82% of liberals, and 50% of independents
voted. If a person in Boston is selected at random and we learn that s/he did not vote
last election, what is the probability s/he is a liberal?

1.4.5 Independence

• Independence: If the occurrence or nonoccurrence of either events A and B have no effect


on the occurrence or nonoccurrence of the other, then A and B are independent. If A and B
are independent, then

1. Pr(A|B) = Pr(A)
2. Pr(B|A) = Pr(B)
3. Pr(A ∩ B) = Pr(A) Pr(B)

• Pairwise independence: A set of more than two events A1 , A2 , . . . , Ak is pairwise indepen-


dent if Pr(Ai ∩ Aj ) = Pr(Ai ) Pr(Aj ), ∀i 6= j. Note that this does not necessarily imply that
Pr( ki=1 Ai ) = K
T Q
i=1 Pr(Ai ).

• Conditional independence: If the occurrence of A or B conveys no information about the


occurrence of the other, once you know the occurrence of a third event C, then A and B are
conditionally independent (conditional on C):

1. Pr(A|B ∩ C) = Pr(A|C)
2. Pr(B|A ∩ C) = Pr(B|C)
3. Pr(A ∩ B|C) = Pr(A|C) Pr(B|C)
Lecture 5: Probability II 31

1.5 Probability II: Random Variables

Today’s Topics5 :
• Levels of Measurement • Discrete Distributions • Continuous Distributions • Joint Distribu-
tions • Expectation • Special Discrete Distributions • Special Continuous Distributions • Summa-
rizing Observed Data
1.5.1 Levels of Measurement

• In empirical research, data can be classified along several dimensions. We have already
distinguished between discrete (countable) and continuous (uncountable) data. We can also
look at the precision with which the underlying quantities are measured.

• Nominal: Discrete data are nominal if there is no way to put the categories represented
by the data into a meaningful order. Typically, this kind of data represents names (hence
‘nominal’) or attributes, like Republican or Democrat.

• Ordinal: Discrete data are ordinal if there is a logical order to the categories represented
by the data, but there is no common scale for differences between adjacent categories. Party
identification is often measured as ordinal data.

• Interval: Discrete or continuous data are interval if there is an order to the values and there
is a common scale, so that differences between two values have substantive meanings. Dates
are an example of interval data.

• Ratio: Discrete or continuous data are ratio if the data have the characteristics of interval
data and zero is a meaningful quantity. This allows us to consider the ratio of two values as
well as difference between them. Quantities measured in dollars, such as per capita GDP, are
ratio data.

1.5.2 Discrete Distributions

• Random Variable: A random variable is a real-valued function defined on the sample space
S; it assigns a real number to every outcome s ∈ S.

• Discrete Random Variable: Y is a discrete random variable if it can assume only a finite
or countably infinite number of distinct values.

• Examples: number of wars per year, heads or tails, voting Republican or Democrat, number
on a rolled die.

• Probability Mass Function: For a discrete random variable Y , the probability mass func-
tion (pmf)6 p(y) = Pr(Y = y) assigns probabilities to a countable number of distinct y values
such that

1. 0 ≤ p(y) ≤ 1
P
2. p(y) = 1
y
5
Much of the material and examples for this lecture are taken from Gill (2006) Essential Mathematics for Political
and Social Research, Wackerly, Mendenhall, & Scheaffer (1996) Mathematical Statistics with Applications, Degroot
(1985) Probability and Statistics, Morrow (1994) Game Theory for Political Scientists, and Ross (1987) Introduction
to Probability and Statistics for Scientists and Engineers.
6
Also referred to simply as the “probability distribution.”
Lecture 5: Probability II 32

0.2

0.15

• Example: For a fair six-sided die, there is an equal probability of rolling 0.1

any number. Since there are six sides, the probability mass function is then 0.05

p(y) = 1/6 for y = 1, . . . , 6. Each p(y) is between 0 and 1. And, the sum of 0
1 2 3 4 5 6

the p(y)’s is 1.
• Cumulative Distribution: The cumulative distribution F (y) or Pr(Y ≤ y) is the proba-
bility that Y is less than or equal to some value y, or
X
Pr(Y ≤ y) = p(i)
i≤y

. The CDF must satisfy these properties:

1. F (y) is non-decreasing in y.
2. limy→−∞ F (y) = 0 and limy→∞ F (y) = 1
3. F (y) is right-continuous.
1

0.75

• Example: For a fair die, Pr(Y ≤ 1) = , Pr(Y ≤ 3) = , and 0.5

Pr(Y ≤ 6) = . 0.25

0
1 2 3 4 5 6

1.5.3 Continuous Distributions

• Continuous Random Variable: Y is a continuous random variable if there exists a non-


negative function f (y) defined for all real y ∈ (−∞, ∞), such that for any interval A,
Z
Pr(Y ∈ A) = f (y)dy
A

• Examples: age, income, GNP, temperature

• Probability Density Function: The function f above is called the probability density
function (pdf) of Y and must satisfy

1. f (y) ≥ 0
R∞
2. f (y)dy = 1
−∞

Note also that Pr(Y = y) = 0 — i.e., the probability of any point y is zero.
1.5

• f (y) = 1, 0≤y≤1
0.5

0
0 0.5 1
Lecture 5: Probability II 33

• Cumulative Distribution: Because the probability that a continuous random variable will
assume any particular value is zero, we can only make statements about the probability of a
continuous random variable being within an interval. The cumulative distribution gives the
probability that Y lies on the interval (−∞, y) and is defined as
Zy
F (y) = Pr(Y ≤ y) = f (s)ds
−∞

Note that F (y) has similar properties with continuous distributions as it does with dis-
crete - non-decreasing, continuous (not just right-continuous), and limy→−∞ F (y) = 0 and
limy→∞ F (y) = 1.

Similarly, we can also make probability statements about Y falling in an interval a ≤ y ≤ b.

Zb
Pr(a ≤ y ≤ b) = f (y)dy
a

1.5

• Example: f (y) = 1, 0 < y < 1. Find F (y) and Pr(.5 < y < .75).
0.5

0
0 0.5 1

F (y) =

Pr(.5 < y < .75) =

dF (y)
• F 0 (y) = dy = f (y)

1.5.4 Joint Distributions

• Often, we are interested in two or more random variables defined on the same sample space.
The distribution of these variables is called a joint distribution. Joint distributions can be
made up of any combination of discrete and continuous random variables.

• Example: Suppose we are interested in the outcomes of flipping a coin and rolling a 6-sided
die at the same time. The sample space for this process contains 12 elements:

{h1, h2, h3, h4, h5, h6, t1, t2, t3, t4, t5, t6}

We can define two random variables X and Y such that X = 1 if heads and X = 0 if
tails, while Y equals the number on the die. We can then make statements about the joint
distribution of X and Y .

• Joint discrete random variables: If both X and Y are discrete, their joint probability
mass function assigns probabilities to each pair of outcomes

p(x, y) = Pr(X = x, Y = y)
Lecture 5: Probability II 34
PP
Again, p(x, y) ∈ [0, 1] and p(x, y) = 1.
If we are interested in the marginal probability of one of the two variables (ignoring infor-
mation about the other variable), we can obtain the marginal pmf by summing across the
variable that we don’t care about:
X
pX (x) = p(x, yi )
i

We can also calculate the conditional pmf for one variable, holding the other variable fixed.
Recalling from the previous lecture that Pr(A|B) = Pr(A∩B)
Pr(B) , we can write the conditional
pmf as
p(x, y)
pY |X (y|x) = , pX (x) > 0
pX (x)
• Joint continuous random variables: If both X and Y are continuous, their joint proba-
bility density function defines their distribution:
ZZ
Pr((X, Y ) ∈ A) = f (x, y)dxdy
A
R∞ R∞
Likewise, f (x, y) ≥ 0 and −∞ −∞ f (x, y)dxdy = 1.
Instead of summing, we obtain the marginal probability density function by integrating out
one of the variables:
Z ∞
fX (x) = f (x, y)dy
−∞

Finally, we can write the conditional pdf as

f (x, y)
fY |X (y|x) = , fX (x) > 0
fX (x)

1.5.5 Expectation

• We often want to summarize some characteristics of the distribution of a random variable.


The most important summary is the expectation (or expected value, or mean), in which the
possible values of a random variable are weighted by their probabilities.

• Expectation of Discrete Random Variable: The expected value of a discrete random


variable Y is X
E(Y ) = yp(y)
y

In words, it is the weighted average of the possible values y can take on, weighted by the
probability that y occurs. It is not necessarily the number we would expect Y to take on, but
the average value of Y after a large number of repetitions of an experiment.

• Example: For a fair die,

E(Y ) =
Lecture 5: Probability II 35

• Expectation of a Continuous Random Variable: The expected value of a continuous


random variable is similar in concept to that of the discrete random variable, except that
instead of summing using probabilities as weights, we integrate using the density to weight.
Hence, the expected value of the continuous variable Y is defined by
Z∞
E(Y ) = yf (y)dy
−∞

1
• Example: Find E(Y ) for f (y) = 1.5 , 0 < y < 1.5.

E(Y ) =

• Expected Value of a Function:


P
1. Discrete: E[g(Y )] = g(y)p(y)
y
R∞
2. Continuous: E[g(Y )] = g(y)f (y)dy
−∞

• Other Properties of Expected Values:

1. E(c) = c
2. E[E[Y ]] = E[Y ] (because the expected value of a random variable is a constant)
3. E[cg(Y )] = cE[g(Y )]
4. E[g(Y1 ) + · · · + g(Yn )] = E[g(Y1 )] + · · · + E[g(Yn )]

• Variance: We can also look at other summaries of the distribution, which build on the idea
of taking expectations. Variance tells us about the “spread” of the distribution; it is the
expected value of the squared deviations from the mean of the distribution. The standard
deviation is simply the square root of the variance.

1. Variance: σ 2 = Var(Y ) = E[(Y − E(Y ))2 ] = E(Y 2 ) − [E(Y )]2


p
2. Standard Deviation: σ = Var(Y )

• Covariance and Correlation: The covariance measures the degree to which two random
variables vary together; if the covariance is positive, X tends to be larger than its mean when
Y is larger than its mean. The covariance of a variable with itself is the variance of that
variable.

Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))] = E(XY ) − E(X)E(Y )

The correlation coefficient is the covariance divided by the standard deviations of X and Y.
It is a unitless measure and always takes on values in the interval [−1, 1].

Cov(X, Y ) Cov(X, Y )
ρ= p =
Var(X)Var(Y ) SD(X)SD(Y )
Lecture 5: Probability II 36

• Conditional Expectation: With joint distributions, we are often interested in the expected
value of a variable Y if we could hold the other variable X fixed. This is the conditional
expectation of Y given X = x:
P
1. Y discrete: E(Y |X = x) = y ypY |X (y|x)
R
2. Y continuous: E(Y |X = x) = y yfY |X (y|x)dy

The conditional expectation is often used for prediction when one knows the value of X
but not Y ; the realized value of X contains information about the unknown Y so long as
E(Y |X = x) 6= E(Y )∀x.

1.5.6 Special Discrete Distributions

• Binomial Distribution: Y is distributed binomial if it represents the number of “successes”


observed in n independent, identical “trials,” where the probability of success in any trial is
p and the probability of failure is q = 1 − p.
For any particular sequence of y successes and n − y failures, the probability of obtaining
that sequence is py q n−y (by the multiplicative law and independence). However, there are
n n!

y = (n−y)!y! ways of obtaining a sequence with y successes and n − y failures. So the
binomial distribution is given by
 
n y n−y
p(y) = p q , y = 0, 1, 2, . . . , n
y

with mean µ = E(Y ) = np and variance σ 2 = V (Y ) = npq.

• Example: Republicans vote for Democrat-sponsored bills 2% of the time. What is the proba-
bility that out of 10 Republicans questioned, half voted for a particular Democrat-sponsored
bill? What is the mean number of Republicans voting for Democrat-sponsored bills? The
variance?
1

0.75

1. p(5) = 0.5

0.25

0
0 2 4 6 8 10

2. E(Y ) =
3. V (Y ) =

• Poisson Distribution: A random variable Y has a Poisson distribution if


λy −λ
p(y) = e , y = 0, 1, 2, . . . , λ>0
y!
The Poisson has the unusual feature that its expectation equals its variance: E(Y ) = V (Y ) =
λ. The Poisson distribution is often used to model event counts: counts of the number of
events that occur during some unit of time. λ is often called the “arrival rate.”
1

0.75

• Example: Border disputes occur between two countries at a rate of 2 0.5

per month. What is the probability of 0, 2, and less than 5 disputes 0.25

occurring in a month? 0
0 2 4 6 8 10
Lecture 5: Probability II 37

1. p(0) =
2. p(2) =

3. Pr(Y < 5) =

1.5.7 Special Continuous Distributions

• Uniform Distribution: A random variable Y has a continuous uniform distribution on the


interval (α, β) if its density is given by
1
f (y) = , α≤y≤β
β−α
α+β (β−α)2
The mean and variance of Y are E(Y ) = 2 and V (Y ) = 12 .
1

0.75

• Example: Y uniformly distributed over (1, 3). 0.5

0.25

0
1 1.4 1.8 2.2 2.6 3

• Normal Distribution: A random variable Y is normally distributed with mean E(Y ) = µ


and variance V (Y ) = σ 2 if its density is

1 (y−µ)2
f (y) = √ e− 2σ2
2πσ

1.5

1.125

• Example: Y normally distributed with mean µ = 0 and variance 0.75

σ 2 = .1 0.375

0
2 1 0 1 2

1.5.8 Summarizing Observed Data

• So far, we’ve talked about distributions in a theoretical sense, looking at different properties of
random variables. We don’t observe random variables; we observe realizations of the random
variable.

• Central tendency: The central tendency describes the location of the “middle” of the
observed data along some scale. There are several measures of central tendency.

1. Sample mean: This is the most common measure of central tendency, calculated by
summing across the observations and dividing by the number of observations.
n
1X
x̄ = xi
n
i=1

The sample mean is an estimate of the expected value of a distribution.


Lecture 5: Probability II 38

2. Sample median: The median is the value of the “middle” observation. It is obtained
by ordering n data points from smallest to largest and taking the value of the n + 1/2th
observation (if n is odd) or the mean of the n/2th and (n + 1)/2th observations (if n is
even).
3. Sample mode: The mode is the most frequently observed value in the data:

mx = Xi : n(Xi ) > n(Xj )∀j 6= i

When the data are realizations of a continuous random variable, it often makes sense
to group the data into bins, either by rounding or some other process, in order to get a
reasonable estimate of the mode.
4. Exercise: Calculate the sample mean, median, and mode for the following two variables,
X and Y.
X 6 3 7 5 5 5 6 4 7 2
Y 1 2 1 2 2 1 2 0 2 0
• Dispersion: We also typically want to know how spread out the data are relative to the
center of the observed distribution. Again, there are several ways to measure dispersion.

1. Sample variance: The sample variance is the sum of the squared deviations from the
sample mean, divided by the number of observations minus 1.
n
1 X
Var(X) = (xi − x̄)2
n−1
i=1

Again, this is an estimate of the variance of a random variable; we divide by n−1 instead
of n in order to get an unbiased estimate.
2. Standard deviation: The sample standard deviation is the square root of the sample
variance.
v
u n
p u 1 X
SD(X) = Var(X) = t (xi − x̄)2
n−1
i=1

3. Median absolute deviation (MAD): The MAD is a different measure of dispersion,


based on deviations from the median rather than deviations from the mean.

M AD(X) = median(|xi − median(x)|)


4. Exercise: Calculate the sample variance, standard deviation, and MAD for the following
two variables, X and Y.
X 6 3 7 5 5 5 6 4 7 2
Y 1 2 1 2 2 1 2 0 2 0
• Covariance and Correlation: Both of these quantities measure the degree to which two
variables vary together, and are estimates of the covariance and correlation of two random
variables as defined above.
1 Pn
1. Sample covariance: Cov(X, Y ) = n−1 i=1 (xi − x̄)(yi − ȳ)
2. Sample correlation: r = √ Cov(X,Y )
Var(X)Var(Y )
Lecture 5: Probability II 39

3. Exercise: Calculate the sample covariance and correlation coefficient for the following
two variables, X and Y.
X 6 3 7 5 5 5 6 4 7 2
Y 1 2 1 2 2 1 2 0 2 0
Lecture 6: Linear Algebra I 40

1.6 Linear Algebra I

Today’s Topics7 : • Working with Vectors • Linear Independence • Matrix Algebra • Square
Matrices • Systems of Linear Equations • Method of Substitution • Gaussian Elimination • Gauss-
Jordan Elimination
1.6.1 Working with Vectors

• Vector: A vector in n-space is an ordered list of n numbers. These numbers can be repre-
sented as either a row vector or a column vector:
 
v1
  v2 
v = v1 v2 . . . vn , v =  . 
 
 .. 
vn

We can also think of a vector as defining a point in n-dimensional space, usually Rn ; each
element of the vector defines the coordinate of the point in a particular direction.

• Vector Addition: Vector addition is defined for two vectors u and v iff they have the same
number of elements: 
u + v = u1 + v1 u2 + v2 · · · uk + vn

• Scalar Multiplication: The product of a scalar c and vector v is:



cv = cv1 cv2 . . . cvn

• Vector Inner Product: The inner product (also called the dot product or scalar product)
of two vectors u and v is again defined iff they have the same number of elements

n
X
u · v = u1 v1 + u2 v2 + · · · + un vn = ui vi
i=1

If u · v = 0, the two vectors are orthogonal (or perpendicular).

• Vector Norm: The norm of a vector is a measure of its length. There are many different
norms, the most common of which is the Euclidean norm (which corresponds to our usual
conception of distance in three-dimensional space):

√ √
||v|| = v·v = v1 v1 + v2 v2 + · · · + vn vn

1.6.2 Linear Dependence

• Linear combinations: The vector u is a linear combination of the vectors v1 , v2 , · · · , vk if

u = c1 v1 + c2 v2 + · · · + ck vk
7
Much of the material and examples for this lecture are taken from Gill (2006) Essential Mathematics for Political
and Social Scientists, Simon & Blume (1994) Mathematics for Economists and Kolman (1993) Introductory Linear
Algebra with Applications.
Lecture 6: Linear Algebra I 41

• Linear independence: A set of vectors v1 , v2 , · · · , vk is linearly independent if the only


solution to the equation

c1 v1 + c2 v2 + · · · + ck vk = 0

is c1 = c2 = · · · = ck = 0. If another solution exists, the set of vectors is linearly dependent.

• A set S of vectors is linearly dependent iff at least one of the vectors in S can be written as
a linear combination of the other vectors in S.

• Linear independence is only defined for sets of vectors with the same number of elements;
any linearly independent set of vectors in n-space contains at most n vectors.

• Exercises: Are the following sets of vectors linearly independent?

1.      
1 1 1
v1 = 0 , v2 = 0 , v3 = 1
    
0 1 1
2.      
3 −2 2
v1 =  2  , v2 =  2  , v3 = 3
−1 4 1

1.6.3 Matrix Algebra

• Matrix: A matrix is an array of mn real numbers arranged in m rows by n columns.


 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A= .
 
.. .. .. 
 .. . . . 
am1 am2 · · · amn

Note that you can think of vectors as special cases of matrices; a column vector of length k
is a k × 1 matrix, while a row vector of the same length is a 1 × k matrix. You can also think
of larger matrices as being made up of a collection of row or column vectors. For example,


A = a1 a2 · · · am

• Matrix Addition: Let A and B be two m × n matrices. Then


 
a11 + b11 a12 + b12 · · · a1n + b1n
 a21 + b21 a22 + b22 · · · a2n + b2n 
A+B=
 
.. .. .. .. 
 . . . . 
am1 + bm1 am2 + bm2 · · · amn + bmn

Note that matrices A and B must be the same size, in which case they are conformable for
addition.
Lecture 6: Linear Algebra I 42

• Example:    
1 2 3 1 2 1
A= , B=
4 5 6 2 1 2

A+B=

• Scalar Multiplication: Given the scalar s, the scalar multiplication of sA is


   
a11 a12 · · · a1n sa11 sa12 · · · sa1n
 a21 a22 · · · a2n   sa21 sa22 · · · sa2n 
sA = s  . ..  =  ..
   
. .. . . .. . . .. 
 . . . .   . . . . 
am1 am2 · · · amn sam1 sam2 · · · samn

• Example:  
1 2 3
s = 2, A=
4 5 6

sA =

• Matrix Multiplication: If A is an m×k matrix and B is a k ×n matrix, then their product


C = AB is the m × n matrix where

cij = ai1 b1j + ai2 b2j + · · · + aik bkj

• Examples:
 
a b  
A B
1. c d
  =
C D
e f
 
  −2 5
1 2 −1 
2. 4 −3 =
3 1 4
2 1
Note that the number of columns of the first matrix must equal the number of rows of the
second matrix, in which case they are conformable for multiplication. The sizes of the
matrices (including the resulting product) must be

(m × k)(k × n) = (m × n)

• Laws of Matrix Algebra:

1. Associative: (A + B) + C = A + (B + C)
(AB)C = A(BC)
2. Commutative: A+B=B+A
3. Distributive: A(B + C) = AB + AC
(A + B)C = AC + BC

• Commutative law for multiplication does not hold – the order of multiplication matters:

AB 6= BA
Lecture 6: Linear Algebra I 43

• Example:    
1 2 2 1
A= , B=
−1 3 0 1
   
2 3 1 7
AB = , BA =
−2 2 −1 3

• Transpose: The transpose of the m×n matrix A is the n×m matrix AT (sometimes written
A0 ) obtained by interchanging the rows and columns of A.
• Examples:
 
  4 0
4 −2 3
1. A = , AT = −2 5 
0 5 −1
3 −1
 
2
BT = 2 −1 3

2. B = −1 ,
3
• The following rules apply for transposed matrices:
1. (A + B)T = AT + BT
2. (AT )T = A
3. (sA)T = sAT
4. (AB)T = BT AT
• Example of (AB)T = BT AT :
 
  0 1
1 3 2
A= , B = 2 2 
2 −1 3
3 −1
  T
  0 1  
T 1 3 2 2 2  = 12 7
(AB) = 
2 −1 3 5 −3
3 −1
 
  1 2  
T T 0 2 3  12 7
B A = 3 −1 =

1 2 −1 5 −3
2 3

1.6.4 Square Matrices

• Square matrices have the same number of rows and columns; a k ×k square matrix is referred
to as a matrix of order k.
• The diagonal of a square matrix is the vector of matrix elements that have the same sub-
scripts. If Ais a square matrix of order k, then its diagonal is [a11 , a22 , . . . , akk ]0 .
• Trace: The trace of a square matrix Ais the sum of the diagonal elements:

tr(A) = a11 + a22 + · · · + akk

Properties of the trace operator: If Aand Bare square matrices of order k, then
Lecture 6: Linear Algebra I 44

1. tr(A + B) = tr(A) + tr(B)


2. tr(AT ) = tr(A)
3. tr(sA) = str(A)
4. tr(AB) = tr(BA)

• There are several important types of square matrix:

1. Symmetric Matrix: A matrix Ais symmetric if A = A0 ; this implies that aij = aji
for all i and j.
Examples:  
  4 2 −1
1 2 0
A= =A, B=  2 1 3  = B0
2 1
−1 3 1
2. Diagonal Matrix: A matrix Ais diagonal if all of its non-diagonal entries are zero;
formally, if aij = 0 for all i 6= j
Examples:  
  4 0 0
1 0
A= , B = 0 1 0
0 2
0 0 1
3. Triangular Matrix: A matrix is triangular one of two cases. If all entries below the
diagonal are zero (aij = 0 for all i > j), it is upper triangular. Conversely, if all entries
above the diagonal are zero (aij = 0 for all i < j), it is lower triangular.
Examples:    
1 0 0 1 7 −4
ALT =  4 2 0 , AU T = 0 3 9 
−3 2 5 0 0 −3
4. Identity Matrix: The n × n identity matrix In is the matrix whose diagonal elements
are 1 and all off-diagonal elements are 0. Examples:
 
  1 0 0
1 0
I2 = , I3 = 0 1 0 
0 1
0 0 1

1.6.5 Linear Equations

• Linear Equation: a1 x1 + a2 x2 + · · · + an xn = b
ai are parameters or coefficients. xi are variables or unknowns.

• Linear because only one variable per term and degree is at most 1.
b a1
1. R2 : line x2 = a2 − a2 x1
b a1 a2
2. R3 : plane x3 = a3 − a3 x1 − a3 x2
3. Rn : hyperplane
Lecture 6: Linear Algebra I 45

1.6.6 Systems of Linear Equations

• Often interested in solving linear systems like

x − 3y = −3
2x + y = 8

• More generally, we might have a system of m equations in n unknowns

a11 x1 + a12 x2 + ··· + a1n xn = b1


a21 x1 + a22 x2 + ··· + a2n xn = b2
.. .. ..
. . .
am1 x1 + am2 x2 + · · · + amn xn = bm

• A solution to a linear system of m equations in n unknowns is a set of n numbers x1 , x2 , · · · , xn


that satisfy each of the m equations.

1. R2 : intersection of the lines.


2. R3 : intersection of the planes.
3. Rn : intersection of the hyperplanes.

• Example: x = 3 and y = 2 is the solution to the above 2 × 2 linear system. Notice from the
graph that the two lines intersect at (3, 2).

• Does a linear system have one, no, or multiple solutions?


For a system of 2 equations in 2 unknowns (i.e., two lines):

1. One solution: The lines intersect at exactly one point.


2. No solution: The lines are parallel.
3. Infinite solutions: The lines coincide.

• Methods to solve linear systems:

1. Substitution
2. Elimination of variables
3. Matrix methods

1.6.7 Method of Substitution

• Procedure:

1. Solve one equation for one variable, say x1 , in terms of the other variables in the equation.
2. Substitute the expression for x1 into the other m−1 equations, resulting in a new system
of m − 1 equations in n − 1 unknowns.
3. Repeat steps 1 and 2 until one equation in one unknown, say xn . We now have a value
for xn .
4. Backward substitution: Substitute xn into the previous equation (which should be a
function of only xn ). Repeat this, using the successive expressions of each variable in
terms of the other variables, to find the values of all xi ’s.
Lecture 6: Linear Algebra I 46

• Exercises:

1. Using substitution, solve:


x − 3y = −3
2x + y = 8
2. Using substitution, solve
x + 2y + 3z = 6
2x − 3y + 2z = 14
3x + y − z = −2

1.6.8 Elementary Equation Operations

• Elementary equation operations are used to transform the equations of a linear system, while
maintaining an equivalent linear system — equivalent in the sense that the same values of
xj solve both the original and transformed systems. These operations are

1. Interchanging two equations,


2. Multiplying two sides of an equation by a constant, and
3. Adding equations to each other

• Interchanging Equations: Given the linear system

a11 x1 + a12 x2 = b1
a21 x1 + a22 x2 = b2

we can interchange its equations, resulting in the equivalent linear system

a21 x1 + a22 x2 = b2
a11 x1 + a12 x2 = b1

• Multiplying by a Constant: Suppose we had the following equation:

2=2

If we multiply each side of the equation by some number, say 4, we still have an equality:

2(4) = 2(4) =⇒ 8=8

More generally, we can multiply both sides of any equation by a constant and maintain an
equivalent equation. For example, the following two equations are equivalent:

a11 x1 + a12 x2 = b1
ca11 x1 + ca12 x2 = cb1

• Adding Equations: Suppose we had the following two very simple equations:

3 = 3
7 = 7

If we add these two equations to each other, we get

7+3=7+3 =⇒ 10 = 10
Lecture 6: Linear Algebra I 47

Suppose we now have


a = b
c = d
If we add these two equations to each other, we get
a+c=b+d
Extending this, suppose we had the linear system
a11 x1 + a12 x2 = b1
a21 x1 + a22 x2 = b2
If we add these two equations to each other, we get
(a11 + a21 )x1 + (a12 + a22 )x2 = b1 + b2

1.6.9 Method of Gaussian Elimination

• Gaussian Elimination is a method by which we start with some linear system of m equations
in n unknowns and use the elementary equation operations to eliminate variables, until we
arrive at an equivalent system of the form

a011 x1 + a012 x2 + a013 x3 + ··· + a01n xn = b01


a022 x2 + a023 x3 + ··· + a02n xn = b02
a033 x3 + ··· + a03n xn = b03
.. ..
. .
a0mn xn = b0m

where a0ij denotes the coefficient of the jth unknown in the ith equation after the above
transformation. Note that at each stage of the elimination process, we want to change some
coefficient of our system to 0 by adding a multiple of an earlier equation to the given equation.
The coefficients a011 , a022 , etc in boxes are referred to as pivots, since they are the terms
used to eliminate the variables in the rows below them in their respective columns.8 Once the
linear system is in the above reduced form, we then use back substitution to find the values
of the xj ’s.
• Exercises:

1. Using Gaussian elimination, solve


x − 3y = −3
2x + y = 8

2. Using Gaussian elimination, solve


x + 2y + 3z = 6
2x − 3y + 2z = 14
3x + y − z = −2
8
As we’ll see, pivots don’t need to be on the ij, i = j diagonal. Additionally, sometimes when we pivot, we will
eliminate variables in rows above a pivot.
Lecture 6: Linear Algebra I 48

1.6.10 Method of Gauss-Jordan Elimination

• The method of Gauss-Jordan elimination takes the Gaussian elimination method one
step further. Once the linear system is in the reduced form shown in the preceding section,
elementary row operations and Gaussian elimination are used to

1. Change the coefficient of the pivot term in each equation to 1 and


2. Eliminate all terms above each pivot in its column,

resulting in a reduced, equivalent system. For a system of m equations in m unknowns, a


typical reduced system would be

x1 = b∗1
x2 = b∗2
x3 = b∗3
.. ..
. .
xn = b∗m

which needs no further work to solve for the xj ’s.

• Exercises:

1. Using Gauss-Jordan elimination, solve

x − 3y = −3
2x + y = 8

2. Using Gauss-Jordan elimination, solve

x + 2y + 3z = 6
2x − 3y + 2z = 14
3x + y − z = −2
Lecture 7: Linear Algebra II 49

1.7 Linear Algebra II

Today’s Topics9 :
• Matrix Methods for Linear Systems • Rank • Existence of Solutions • Inverse of a Matrix •
Linear Systems and Inverses • Determinants • The Determinant Formula for an Inverse • Cramer’s
Rule
1.7.1 Matrices, Row Operations, & (Reduced) Row Echelon Form

• Matrices provide an easy and efficient way to represent linear systems such as
a11 x1 + a12 x2 + ··· + a1n xn = b1
a21 x1 + a22 x2 + ··· + a2n xn = b2
.. .. ..
. . .
am1 x1 + am2 x2 + · · · + amn xn = bm
as
Ax = b
where

1. The m × n coefficient matrix A is an array of mn real numbers arranged in m rows


by n columns:  
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A= .
 
.. .. 
 .. . . 
am1 am2 · · · amn
 
x1
 x2 
2. The unknown quantities are represented by the vector x =  . .
 
 .. 
xn
 
b1
 b2 
3. The RHS of the linear system is represented by the vector b =  . .
 
 .. 
bm

• Augmented Matrix: When we append b to the coefficient matrix A, we get the augmented
matrix A
b = [A|b]
 
a11 a12 · · · a1n | b1
 a21 a22 · · · a2n | b2 
 
 .. .. .. .. 
 . . . | . 
am1 am2 · · · amn | bm
• Elementary Row Operations: Just as we conducted elementary equation operations, we
can conduct elementary row operations to transform some augmented matrix representation
of a linear system into another augmented matrix that represents an equivalent linear system.
Since we’re really operating on equations when we operate on the rows of the matrix, these
row operations correspond exactly to the equation operations:
9
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists and Kolman (1993) Introductory Linear Algebra with Applications.
Lecture 7: Linear Algebra II 50

1. Interchanging two rows. =⇒ Interchanging two equations.


2. Multiplying a row by a constant. =⇒ Multiplying both sides of an equation
by a constant.
3. Adding two rows to each other. =⇒ Adding two equations to each other.
• Interchanging Rows: Suppose we have the augmented matrix
 
a11 a12 | b1
A=
b
a21 a22 | b2
If we interchange the two rows, we get the augmented matrix
 
a21 a22 | b2
a11 a12 | b1

which represents a linear system equivalent to that represented by matrix A.


b

• Multiplying by a Constant: If we multiply the second row of matrix A


b by a constant c,
we get the augmented matrix  
a11 a12 | b1
ca21 ca22 | cb2
which represents a linear system equivalent to that represented by matrix A.
b

• Adding Rows: If we add the first row of matrix A


b to the second, we obtain the augmented
matrix  
a11 a12 | b1
a11 + a21 a12 + a22 | b1 + b2
which represents a linear system equivalent to that represented by matrix A.
b

• Row Echelon Form: We use the row operations to change coefficients in the augmented
matrix to 0 — i.e., pivot to eliminate variables — and to put it in a matrix form representing
the final linear system of Gaussian elimination. An augmented matrix of the form
 
a011 a012 a013 · · · a01n | b01
 
 0
 a022 a023 · · · a02n | b02  
0 0 0
 
 0
 0 a33 · · · a3n | b3  
 .. .. .. 
 0
 0 0 . . | .  
0 0 0 0 a0mn | b0m

is said to be in row echelon form — each row has more leading zeros than the row preceding
it.
• Reduced Row Echelon Form: Reduced row echelon form is the matrix representation of
a linear system after Gauss-Jordan elimination. For a system of m equations in m unknowns,
with no all-zero rows, the reduced row echelon form would be
 
1 0 0 0 0 | b∗1
0 1 0 0 0 | b∗2 
 
0 | b∗3 
 
0 0 1 0
.. . 
 
0 0 0 . 0 | .. 

0 0 0 0 1 | b∗m
Lecture 7: Linear Algebra II 51

• Exercises:
Using matrix methods, solve the following linear system by Gaussian elimination and then
Gauss-Jordan elimination:

1.
x − 3y = −3
2x + y = 8

2.
x + 2y + 3z = 6
2x − 3y + 2z = 14
3x + y − z = −2

1.7.2 Rank — and Whether a System Has One, Infinite, or No Solutions

• We previously noted that a 2 × 2 system had one, infinite, or no solutions if the two lines in-
tersected, were the same, or were parallel, respectively. More generally, to determine whether
one, infinite, or no solutions exist, we can use information about (1) the number of equations
m, (2) the number of unknowns n, and (3) the rank of the matrix representing the linear
system.

• Rank: The rank of a matrix is the number of nonzero rows in its row echelon form. The
rank corresponds to the maximum number of linearly independent row or column vectors in
the matrix.

• Examples:
 
1 2 3
1. 0 4 5 Rank=
0 0 6
 
1 2 3
2. 0 4
 5 Rank=
0 0 0
 
1 2 3 | b1
3. 0 4 5 | b2  , bi 6= 0 Rank=
0 0 0 | b3

• Let A be the coefficient matrix and A


b = [A|b] be the augmented matrix. Then

1. rank A ≤ rank A
b Augmenting A with b can never result in more zero rows
than originally in A itself. Suppose row i in A is all zeros
and that bi is non-zero. Augmenting A with b will yield a
non-zero row i in A.
b
2. rank A ≤ rows A By definition of “rank.”

3. rank A ≤ cols A Suppose there are more rows than columns (otherwise the
previous rule applies). Each column can contain at most one
pivot. By pivoting, all other entries in a column below the
pivot are zeroed. Hence, there will only be as many non-zero
rows as pivots, which will equal the number of columns.
Lecture 7: Linear Algebra II 52

• Existence of Solutions:

1. Exactly one solution: rank A = rank A


b = rows A = cols A
Necessary condition for a system to have a unique solution:
that there be exactly as many equations as unknowns.
2. Infinite solutions: rank A = rank A
b and cols A > rank A
If a system has a solution and has more unknowns than equa-
tions, then it has infinitely many solutions.
3. No solution: rank A < rank A b
Then there is a zero row i in A’s reduced echelon that corre-
sponds to a non-zero row i in A’s
b reduced echelon. Row i of
the A
b translates to the equation

0xi1 + 0xi2 + · · · + 0xin = b0i

where b0i 6= 0. Hence the system has no solution.

• Exercises:

1.
x − 3y = −3
2x + y = 8

2.
x + 2y + 3z = 6
2x − 3y + 2z = 14
3x + y − z = −2

3.
x + 2y − 3z = −4
2x + y − 3z = 4

4.
x1 + 2x2 − 3x4 + x5 = 2
x1 + 2x2 + x3 − 3x4 + x5 + 2x6 = 3
x1 + 2x2 − 3x4 + 2x5 + x6 = 4
3x1 + 6x2 + x3 − 9x4 + 4x5 + 3x6 = 9

5.
x + 2y + 3z + 4w = 5
x + 3y + 5z + 7w = 11
x − z − 2w = −6
Lecture 7: Linear Algebra II 53

1.7.3 The Inverse of a Matrix

• Inverse Matrix: An n × n matrix A is nonsingular or invertible if there exists an n × n


matrix A−1 such that
AA−1 = A−1 A = In
A−1 is the inverse of A. If there is no such A−1 , then A is singular or noninvertible.

• Example: Let
−1 32
   
2 3
A= , B=
2 2 1 −1
Since
AB = BA = In
we conclude that B is the inverse, A−1 , of A and that A is nonsingular.

• Properties of the Inverse:

1. If the inverse exists, it is unique.


2. A nonsingular =⇒ A−1 nonsingular (A−1 )−1 = A
3. A and B nonsingular =⇒ AB nonsingular (AB)−1 = B−1 A−1
4. A nonsingular =⇒ (AT )−1 = (A−1 )T

• Procedure to Find A−1 : We know that if B is the inverse of A, then

AB = BA = In

Looking only at the first and last parts of this

AB = In

Solving for B is equivalent to solving for n linear systems, where each column of B is solved for
the corresponding column in In . In performing Gauss-Jordan elimination for each individual
system, the same row operations will be performed on A regardless of the column of B and
In . Hence, we can solve the systems simultaneously by augmenting A with In and performing
Gauss-Jordan elimination on A. Note that for the square matrix A, Gauss-Jordan elimination
should result in A becoming row equivalent to In . Therefore, if Gauss-Jordan elimination on
[A|In ] results in [In |B], then B is the inverse of A. Otherwise, A is singular.

To summarize: To calculate the inverse of A

1. Form the augmented matrix [A|In ]


2. Using elementary row operations, transform the augmented matrix to reduced row ech-
elon form.
3. The result of step 2 is an augmented matrix [C|B].
(a) If C = In , then B = A−1 .
(b) If C 6= In , then C has a row of zeros. A is singular and A−1 does not exist.
 
1 1 1
• Exercise: Find the inverse of A = 0 2 3
5 5 1
Lecture 7: Linear Algebra II 54

1.7.4 Linear Systems and Inverses

• Let’s return to the matrix representation of a linear system


Ax = b
If A is an n × n matrix,then Ax = b is a system of n equations in n unknowns. Suppose
A is nonsingular =⇒ A−1 exists. To solve this system, we can premultiply each side by
A−1 and reduce it as follows:
A−1 (Ax) = A−1 b
(A−1 A)x = A−1 b
In x = A−1 b
x = A−1 b
Hence, given A and b and given that A is nonsingular, then x = A−1 b is a unique solution
to this system.
• Notice also that the requirements for A to be nonsingular correspond to the requirements for
a linear system to have a unique solution: rank A = rows A = cols A.

1.7.5 Determinants

• Singularity: Determinants can be used to determine whether a square matrix is nonsingular.


– A square matrix is nonsingular iff its determinant is not zero.
• Determinants defined inductively:
1. Let A = a. We want the determinant to equal zero when the inverse does not exist.
Since the inverse of a, 1/a, does not exist when a = 0, we let the determinant of a be
|a| = a
 
a11 a12
2. For a 2 × 2 matrix A = , A is nonsingular only if a11 a22 − a12 a21 6= 0
a21 a22
(check by doing Gauss-Jordan to find the inverse of a 2 × 2 matrix). We then define the
determinant of a 2 × 2 matrix A as
a11 a12
= a11 a22 − a12 a21
a21 a22
= a11 |a22 | − a12 |a21 |
3. Extending this to a 3 × 3 matrix, we get
a11 a12 a13
a a a a a a
a21 a22 a23 = a11 22 23 − a12 21 23 + a13 21 22
a32 a33 a31 a33 a31 a32
a31 a32 a33
4. Let’s extend this now to any n × n matrix. Let Aij be the (n − 1) × (n − 1) submatrix
of A obtained by deleting row i and column j. Let the (i, j)th minor of A be
Mij = |Aij |
Then for any n × n matrix A
|A| = a11 M11 − a12 M12 + · · · + (−1)n+1 a1n M1n
Lecture 7: Linear Algebra II 55

• Example: Does the following matrix have an inverse?


 
1 1 1
A = 0 2 3
5 5 1
1. Calculate its determinant.
2 3 0 3 0 2
|A| = 1 −1 +1
5 1 5 1 5 5
= 1(2 − 15) − 1(0 − 15) + 1(0 − 10)
= −13 + 15 − 10
= −8

2. Since |A| =
6 0, we conclude that A has an inverse.
• Triangular or Diagonal Matrices: For any upper-triangular, lower-triangular, or diagonal
matrix, the determinant is just the product of the diagonal terms.
• Example: Suppose we have the following square matrix in row echelon form (i.e., upper
triangular)  
r11 r12 r13
R=  0 r22 r23 
0 0 r33
Then
r22 r23
|R| = r11 = r11 r22 r33
0 r33

• Properties of Determinants:
1. |A| = |AT |

2. If B results from A by interchanging two


rows, then |B| = −|A|.
3. If two rows of A are equal, then |A| = 0. (Notice that in this case rank A 6=
rows A, which was one of the conditions
for the existence of a unique solution.)
4. If a row of A consists of all zeros, then (Same as 3.)
|A| = 0.
5. If B is obtained by multiplying a row of
A by a scalar s, then |B| = s|A|.
6. If B is obtained from A by adding to the (i.e., If the row isn’t simply multiplied by
ith row of A the jth row (i 6= j) multi- a scalar and left, then the determinant
plied by a scalar s, then |B| = |A|. remains the same.)
7. If no row interchanges and no scalar mul- (Implied by the previous properties.)
tiplications of a single row are used to
compute the row echelon form R from
the n×n coefficient matrix A, then |A| =
|R|.
Lecture 7: Linear Algebra II 56

8. A square matrix is nonsingular iff its de- (Implied by the previous properties.)
terminant 6= 0.
9. |AB| = |A||B|

10. If A is nonsingular, then |A| =


6 0 and
|A−1 | = |A|
1
.

1.7.6 Determinants: Formulas for Inverses and Solutions

• Thus far, we have a number of algorithms to

1. Find the solution of a linear system,


2. Find the inverse of a matrix

but these remain just that — algorithms. At this point, we have no way of telling how the
solutions xj change as the parameters aij and bi change, except by changing the values and
“rerunning” the algorithms.

• With determinants, we can

1. Provide an explicit formula for the inverse, and


2. Provide an explicit formula for the solution of an n × n linear system.

Hence, we can examine how changes in the parameters and bi affect the solutions xj .

• The Determinant Formula for the Inverse:

– Define the (i, j)th cofactor Cij of A as (−1)i+j Mij . Notice that it’s just the signed
(i, j)th minor.
– Define the adjoint of A as the n × n matrix whose (i, j)th entry is Cji (notice the switch
in indices!). We’ll refer to the adjoint of A as adj A.

Then the inverse of A is given by the formula


 C11 C21 Cn1 
|A| |A| ··· |A|
C C22 Cn2 

 12
 |A| |A| ··· |A| 
1
A−1 = adj A = 
 .. .. .. .. 

|A|  . . . . 
 
C1n C2n Cnn
|A| |A| ··· |A|

 
1 1 1
• Exercise: Find the inverse of A = 0 2 3
5 5 1
• Cramer’s Rule: The Determinant Formula for the Solution of a Linear System:

– Let Aj be the matrix obtained from A by replacing the jth column of A by b.


Lecture 7: Linear Algebra II 57

Example:  
b1 a12 ··· a1n
 b2 a22 ··· a2n 
A1 =  .
 
.. .. 
 .. . . 
bn an2 · · · ann

Then the unique solution x = (x1 , · · · , xn ) to the n × n system Ax = b is

|Aj |
xj =
|A|

• Exercise: Find the solution of the following system:

−2x1 + 3x2 − x3 = 1
x1 + 2x2 − x3 = 4
−2x1 − x2 + x3 = −3
Lecture 8: Unconstrained Optimization 58

1.8 Unconstrained Optimization

Today’s Topics10 : • Taylor Series Approximation • Quadratic Forms • Definiteness of Quadratic


Forms • Maxima and Minima in Rn • First Order Conditions • Second Order Conditions • Global
Maxima and Minima
1.8.1 Taylor Series Approximation

• Taylor series are used commonly to represent functions as infinite series of the function’s
derivatives at some point a. One can thus approximate functions by using lower-order, finite
series known as Taylor polynomials. If a = 0, the series is called a Maclaurin series.
• Specifically, a Taylor series of a real or complex function f (x) that is infinitely differentiable
in the neighborhood of point a is:

X f (n) (a) f 0 (a) f 00 (a)
(x − a)n = f (a) + (x − a) + (x − a)2 + · · ·
n! 1! 2!
n=0

• We can often approximate the curvature of a function f (x) at point a using a 2nd order
Taylor polynomial around point a:

f 0 (a) f 00 (a)
f (x) = f (a) + (x − a) + (x − a)2 + R2
1! 2!

R2 is the Lagrange remainder and often treated as negligible, giving us:

f 00 (a)
f (x) ≈ f (a) + f 0 (a)(x − a) + (x − a)2
2
• Taylor series expansion is easily generalized to multiple dimensions.

1.8.2 Quadratic Forms

• Quadratic forms important because


1. Approximates local curvature around a point — e.g., used to identify max vs min vs
saddle point.
2. Simple, so easy to deal with.
3. Have a matrix representation.
• Quadratic Form: A polynomial where each term is a monomial of degree 2:
X
Q(x1 , · · · , xn ) = aij xi xj
i≤j

which can be written in matrix terms


1 1
  
a11 2 a12 ··· 2 a1n x1
1 1
  2 a12
 a22 ··· 2 a2n   x2 
  
Q(x) = x1 x2 · · · xn  . .. .. ..   .. 
 .. . . .  . 
1 1
2 a1n 2 2n · · ·
a ann xn
10
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists and Ecker & Kupferschmid (1988) Introduction to Operations Research.
Lecture 8: Unconstrained Optimization 59

or
Q(x) = xT Ax

• Examples:

1. Quadratic on R2 :

1
  
 a11 2 a12 x1
Q(x1 , x2 ) = x1 x2 1
2 a12 a22 x2
= a11 x21 + a12 x1 x2 + a22 x22

2. Quadratic on R3 :

1 1
  
a11 2 a12 2 a13 x1
 1 a12 1

Q(x1 , x2 , x3 ) = x1 x2 x3 2 a22 a
2 23
 x2 
1 1
2 a13 2 a23 a33 x3
= a11 x21 + a22 x22 + a33 x23 + a12 x1 x2 + a13 x1 x3 + a23 x2 x3

1.8.3 Definiteness of Quadratic Forms

• Definiteness helps identify the curvature of Q(x) at x.

• Definiteness: By definition, Q(x) = 0 at x = 0. The definiteness of the matrix A is


determined by whether the quadratic form Q(x) = xT Ax is greater than zero, less than zero,
or sometimes both over all x 6= 0.

1. Positive Definite xT Ax > 0, ∀x 6= 0 Min


2. Positive Semidefinite xT Ax ≥ 0, ∀x 6= 0
3. Negative Definite xT Ax < 0, ∀x 6= 0 Max
4. Negative Semidefinite xT Ax ≤ 0, ∀x 6= 0
5. Indefinite xT Ax > 0 for some x 6= 0 and Neither
xT Ax < 0 for other x 6= 0
• Examples:

1. Positive Definite:
 
T 1 0
Q(x) = x x
0 1
= x21 + x22
Lecture 8: Unconstrained Optimization 60

2. Positive Semidefinite:
 
T 1 −1
Q(x) = x x
−1 1
= (x1 − x2 )2

3. Indefinite:
 
1 0
Q(x) = xT x
0 −1
= x21 − x22

1.8.4 Test for Definiteness using Principal Minors

• Given an n × n matrix A, kth order principal minors are the determinants of the k × k
submatrices along the diagonal obtained by deleting n − k columns and the same n − k rows
from A.

• Example: For a 3 × 3 matrix A,

1. First order principle minors:


|a11 |, |a22 |, |a33 |
2. Second order principle minors:

a11 a12 a11 a13 a22 a23


, ,
a21 a22 a31 a33 a32 a33

3. Third order principle minor: |A|

• Define the kth leading principal minor Mk as the determinant of the k × k submatrix
obtained by deleting the last n − k rows and columns from A.

• Example: For a 3 × 3 matrix A, the three leading principal minors are

a11 a12 a13


a11 a12
M1 = |a11 |, M2 = , M3 = a21 a22 a23
a21 a22
a31 a32 a33

• Algorithm: If A is an n × n symmetric matrix, then

1. Mk > 0, k = 1, . . . , n =⇒ Positive Definite


Lecture 8: Unconstrained Optimization 61

2. Mk < 0, for odd k and =⇒ Negative Definite


Mk > 0, for even k
3. Mk 6= 0, k = 1, . . . , n, =⇒ Indefinite.
but does not fit the pattern of
1 or 2.
• If some leading principle minor is zero, but all others fit the pattern of the preceding conditions
1 or 2, then
1. Every principal minor ≥ 0 =⇒ Positive Semidefinite
2. Every principal minor of odd =⇒ Negative Semidefinite
order ≤ 0 and every principal
minor of even order ≥ 0

1.8.5 Maxima and Minima in Rn

• Conditions for Extrema: The conditions for extrema are similar to those for functions on
R1 . Let f (x) be a function of n variables. Let B(x, ) be the -ball about the point x. Then
1. f (x∗ ) > f (x), ∀x ∈ B(x∗ , ) =⇒ Strict Local Max
2. f (x∗ ) ≥ f (x), ∀x ∈ B(x∗ , ) =⇒ Local Max
3. f (x∗ ) < f (x), ∀x ∈ B(x∗ , ) =⇒ Strict Local Min
4. f (x∗ ) ≤ f (x), ∀x ∈ B(x∗ , ) =⇒ Local Min

1.8.6 First Order Conditions

• When we examined functions of one variable x, we found critical points by taking the first
derivative, setting it to zero, and solving for x. For functions of n variables, the critical points
are found in much the same way, except now we set the partial derivatives equal to zero.11
• Given a function f (x) in n variables, the gradient ∇f (x) is a column vector, where the ith
element is the partial derivative of f (x) with respect to xi :
 ∂f (x) 
 ∂x1 
 ∂f (x) 
 
∇f (x) =  ∂x2 
 . 
 .. 
 
∂f (x)
∂xn

• x∗ is a critical point iff ∇f (x∗ ) = 0.


• Example: Find the critical points of f (x) = (x1 − 1)2 + x22 + 1
1. The partial derivatives of f (x) are
∂f (x)
= 2(x1 − 1)
∂x1
∂f (x)
= 2x2
∂x2
2. Setting each partial equal to zero and solving for x1 and x2 , we find that there’s a critical
point at x∗ = (1, 0).
11
We will only consider critical points on the interior of a function’s domain.
Lecture 8: Unconstrained Optimization 62

1.8.7 Second Order Conditions

• When we found a critical point for a function of one variable, we used the second derivative
as an indicator of the curvature at the point in order to determine whether the point was a
min, max, or saddle. For functions of n variables, we use second order partial derivatives as
an indicator of curvature.

• Given a function f (x) of n variables, the Hessian H(x) is an n × n matrix, where the (i, j)th
element is the second order partial derivative of f (x) with respect to xi and xj :
 2 
∂ f (x) ∂ 2 f (x) ∂ 2 f (x)
2 ∂x1 ∂x2 · · · ∂x1 ∂xn 
 ∂x1
 2 2
 ∂ f (x) ∂ f (x) · · · ∂ 2 f (x) 

 ∂x2 ∂x1 ∂x2 2 ∂x2 ∂xn 
H(x) = 
. . .. .. 

 . ..
 . . . 
 2 
∂ f (x) ∂ 2 f (x) ∂ 2 f (x)
∂xn ∂x1 ∂xn ∂x2 · · · ∂x2 n

• Curvature and The Taylor Polynomial as a Quadratic Form: The Hessian is used in
a Taylor polynomial approximation to f (x) and provides information about the curvature of
f (x) at x — e.g., which tells us whether a critical point x∗ is a min, max, or saddle point.

1. The second order Taylor polynomial about the critical point x∗ is


1
f (x∗ + h) = f (x∗ ) + ∇f (x∗ )h + hT H(x∗ )h + R(h)
2
2. Since we’re looking at a critical point, ∇f (x∗ ) = 0; and for small h, R(h) is negligible.
Rearranging, we get
1
f (x∗ + h) − f (x∗ ) ≈ hT H(x∗ )h
2
3. The RHS is a quadratic form and we can determine the definiteness of H(x∗ ).

(a) If H(x∗ ) is positive definite, then the RHS is positive for all small h:

f (x∗ + h) − f (x∗ ) > 0 =⇒ f (x∗ + h) > f (x∗ )

i.e., f (x∗ ) < f (x), ∀x ∈ B(x∗ , ), so x∗ is a strict local min.

(b) Conversely, if H(x∗ ) is negative definite, then the RHS is negative for all small h:

f (x∗ + h) − f (x∗ ) < 0 =⇒ f (x∗ + h) < f (x∗ )

i.e., f (x∗ ) > f (x), ∀x ∈ B(x∗ , ), so x∗ is a strict local max.

• Summary of Second Order Conditions:


Given a function f (x) and a point x∗ such that ∇f (x∗ ) = 0,

1. H(x∗ ) Positive Definite =⇒ Strict Local Min


2. H(x) Positive Semidefinite =⇒ Local Min
∀x ∈ B(x∗ , )
Lecture 8: Unconstrained Optimization 63

3. H(x∗ ) Negative Definite =⇒ Strict Local Max


4. H(x) Negative Semidefinite =⇒ Local Max
∀x ∈ B(x∗ , )
5. H(x∗ ) Indefinite =⇒ Saddle Point

• Example: We found that the only critical point of f (x) = (x1 − 1)2 + x22 + 1 is at x∗ = (1, 0).
Is it a min, max, or saddle point?
1. Recall that the gradient of f (x) is
 
2(x1 − 1)
∇f (x) =
2x2
Then the Hessian is  
2 0
H(x) =
0 2
2. To check the definiteness of H(x∗ ), we could use either of two methods:
(a) Determine whether xT H(x∗ )x is greater or less than zero for all x 6= 0:
  
T ∗
 2 0 x1
x H(x )x = x1 x2 = 2x21 + 2x22
0 2 x2

For any x 6= 0, 2(x21 + x22 ) > 0, so the Hessian is positive definite and x∗ is a strict
local minimum.
(b) Using the method of leading principal minors, we see that M1 = 2 and M2 = 4. Since
both are positive, the Hessian is positive definite and x∗ is a strict local minimum.

1.8.8 Global Maxima and Minima

• To determine whether a critical point is a global min or max, we can check the concavity
of the function over its entire domain. Here again we use the definiteness of the Hessian to
determine whether a function is globally concave or convex:
1. H(x) Positive Semidefinite ∀x =⇒ Globally Convex
2. H(x) Negative Semidefinite ∀x =⇒ Globally Concave
Notice that the definiteness conditions must be satisfied over the entire domain.
• Given a function f (x) and a point x∗ such that ∇f (x∗ ) = 0,
1. f (x) Globally Convex =⇒ Global Min
2. f (x) Globally Concave =⇒ Global Max
• Note that showing that H(x∗ ) is negative semidefinite is not enough to guarantee x∗ is a local
max. However, showing that H(x) is negative semidefinite for all x guarantees that x∗ is a
global max. (The same goes for positive semidefinite and minima.)
• Example: Take f1 (x) = x4 and f2 (x) = −x4 . Both have x = 0 as a critical point. Unfortu-
nately, f100 (0) = 0 and f200 (0) = 0, so we can’t tell whether x = 0 is a min or max for either.
However, f100 (x) = 12x2 and f200 (x) = −12x2 . For all x, f100 (x) ≥ 0 and f200 (x) ≤ 0 — i.e., f1 (x)
is globally convex and f2 (x) is globally concave. So x = 0 is a global min of f1 (x) and a
global max of f2 (x).
Lecture 8: Unconstrained Optimization 64

1.8.9 One More Example

• Given f (x) = x31 − x32 + 9x1 x2 , find any maxima or minima.

1. First order conditions. Set the gradient equal to zero and solve for x1 and x2 .
∂f
= 3x21 + 9x2 = 0
∂x1
∂f
= −3x22 + 9x1 = 0
∂x2
We have two equations in two unknowns. Solving for x1 and x2 , we get two critical
points: x∗1 = (0, 0) and x∗1 = (3, −3).
2. Second order conditions. Determine whether the Hessian is positive or negative definite.
The Hessian is  
6x1 9
H(x) =
9 −6x2
Evaluated at x∗1 ,  
0 9
H(x∗1 ) =
9 0
The two leading principal minors are M1 = 0 and M2 = −81, so H(x∗1 ) is indefinite and
x∗1 = (0, 0) is a saddle point.

Evaluated at x∗2 ,  
18 9
H(x∗2 ) =
9 18
The two leading principal minors are M1 = 18 and M2 = 243. Since both are positive,
H(x∗2 ) is positive definite and x∗2 = (3, −3) is a strict local min.
3. Global concavity/convexity. In evaluating the Hessians for x∗1 and x∗2 we saw that the
Hessian is not everywhere positive semidefinite. Hence, we can’t infer that x∗2 = (3, −3)
is a global minimum. In fact, if we set x1 = 0, the f (x) = −x32 , which will go to −∞ as
x2 → ∞.
Lecture 9: Constrained Optimization 65

1.9 Constrained Optimization

Today’s Topics12 :
• Constrained Optimization • Equality Constraints • Inequality Constraints • Kuhn-Tucker
Conditions
1.9.1 Constrained Optimization

• We have already looked at optimizing a function in one or more dimensions over the whole
domain of the function. Often, however, we want to find the maximum or minimum of a
function over some restricted part of its domain.
• In any constrained optimization problem, the constrained maximum will always be less than
or equal than the unconstrained maximum. If the constrained maximum is less than the
unconstrained maximum, then the constraint is binding.
• For a function f (x1 , . . . , xn ), there are two types of constraints that can be imposed:
1. Equality constraints: constraints of the form ck (x1 , . . . , xn ) = rk . Budget constraints
are the classic example of equality constraints in social science.
2. Inequality constraints: constraints of the form gm (x1 , . . . , xn ) ≤ bm . These might arise
from non-negativity constraints or other threshold effects.
• When working with constrained optimization problems, always make sure that the set of
constraints are not pathological; it must be possible for all of the constraints to be satisfied
simultaneously.
• Example: Maximize f (x1 , x2 ) = −(x21 + 2x22 ) subject to the constraint that x1 + x2 = 4. It
is easy to see that the unconstrained maximum occurs at (x1 , x2 ) = (0, 0), but that does not
satisfy the constraint. How should we proceed?

1.9.2 Equality Constraints

• Equality constraints are the easiest to deal with because we know that the maximum or
minimum has to lie on the (intersection of the) constraint(s).
• The trick is to change the problem from a constrained optimization problem in n variables
to an unconstrained optimization problem in n + k variables, adding one variable for each
equality constraint.
• Lagrangian function: We define the Lagrangian function L(x1 , . . . , xn , λ1 , . . . , λk ) as fol-
lows:
k
X
L(x1 , . . . , xn , λ1 , . . . , λk ) = f (x1 , . . . , xn ) − λi (ci (x1 , . . . , xn ) − ri )
i=1
Occasionally, you may see the following form of the Lagrangian, which is equivalent:
k
X
L(x1 , . . . , xn , λ1 , . . . , λk ) = f (x1 , . . . , xn ) + λi (ri − ci (x1 , . . . , xn ))
i=1

The λ terms are known as Lagrange multipliers.


12
Much of the material and examples for this lecture are taken from Gill (2006) Essential Mathematics for Political
and Social Research and Simon & Blume (1994) Mathematics for Economists.
Lecture 9: Constrained Optimization 66

• To find the critical points, we take the partial derivatives of L(x1 , . . . , xn , λ1 , . . . , λk ) with
respect to each of its variables. At a critical point, each of these partial derivatives must be
equal to zero, so we obtain a system of n + k equations in n + k unknowns:
k
∂L ∂f X ∂ci
= − λi = 0 (1)
∂x1 ∂x1 ∂x1
i=1
.. .
. = .. (2)
k
∂L ∂f X ∂ci
= − λi = 0 (3)
∂xn ∂xn ∂xn
i=1
∂L
= c1 (xi , . . . , xn ) − r1 = 0 (4)
∂λ1
.. .
. = .. (5)
∂L
= ck (xi , . . . , xn ) − rk = 0 (6)
∂λk
• Some caveats apply. There may be more than one critical point. Analogs to second-order
conditions for unconstrained optimization exist, or it may suffice to check the critical points
individually. There are also conditions on the behavior of the constraints at critical points;
these are typically satisfied with non-pathological linear constraints.
• Example: Maximize

f (x) = −(x21 + 2x22 ) (7)


s.t. x1 + x2 = 4 (8)

1. Begin by writing the Lagrangian:


L(x1 , x2 , λ) = −(x21 + 2x22 ) − λ(x1 + x2 − 4)
2. Take the partial derivatives and set equal to zero:
∂L
= −2x1 − λ = 0 (9)
∂x1
∂L
= −4x2 − λ = 0 (10)
∂x2
∂L
= −(x1 + x2 − 4) = 0 (11)
∂λ
(12)
3. The only solution to this system of linear equations occurs at (x1 , x2 , λ) = ( 38 , 43 , − 16
3 ).
8 4 8 4
Therefore, the only critical point occurs when x1 = 3 and x2 = 3 . This gives f ( 3 , 3 ) =
− 96
9 , which is less than the unconstrained optimum f (0, 0) = 0.

1.9.3 Inequality Constraints

• Inequality constraints are more challenging because we do not know ahead of time which
constraints will be binding and which will not. Inequality constraints define the boundary
of a region over which we seek to optimize the function. The maximum/minimum could lie
along one of the constraints, or it could be in the interior of the region.
Lecture 9: Constrained Optimization 67

• Again, one way to deal with this problem is by introducing more variables in order to turn
the problem into an unconstrained optimization.

• Slack: For each inequality constraint gi (x1 , . . . , xn ) ≤ bi , we define a slack variable s2i for
which the expression gi (x1 , . . . , xn ) ≤ bi − s2i would hold with equality. These slack variables
capture how close the constraint comes to binding. We use s2 rather than s to ensure that
the slack is positive.

• The Lagrangian function in this case is written as


m
X
L(x1 , . . . , xn , λ1 , . . . , λk , s1 , . . . , sm ) = f (x1 , . . . , xn ) − λi (gi (x1 , . . . , xn ) + s2i − bi )
i=1

• To find the critical points, we now need to take the partials with respect to each x, λ, and s.
This will give us n + 2m equations in n + 2m unknowns:

m
∂L ∂f X ∂gi
= − λi = 0 (13)
∂x1 ∂x1 ∂x1
i=1
.. .
. = .. (14)
m
∂L ∂f X ∂gi
= − λi = 0 (15)
∂xn ∂xn ∂xn
i=1
∂L
= g1 (xi , . . . , xn ) + s21 − b1 = 0 (16)
∂λ1
.. .
. = .. (17)
∂L
= gk (xi , . . . , xn ) + s2m − bm = 0 (18)
∂λm
∂L
= 2s1 λ1 = 0 (19)
∂s1
.. .
. = .. (20)
∂L
= 2sm λm = 0 (21)
∂sm

• Complementary slackness: The last set of first order conditions of the form 2si λi = 0 are
known as complementary slackness conditions. These conditions can be satisfied one of three
ways:

1. λi = 0 and si 6= 0: This implies that the slack is positive and thus the constraint does
not bind.
2. λi 6= 0 and si = 0: This implies that there is no slack in the constraint and the constraint
does bind.
3. λi = 0 and si = 0: In this case, there is no slack but the constraint binds trivially,
without changing the optimum.

• Example: Find the critical points for the following constrained optimization:
Lecture 9: Constrained Optimization 68

f (x) = −(x21 + 2x22 ) (22)


s.t. x1 + x2 ≤ 4 (23)
x1 ≥ 0 (24)
x2 ≥ 0 (25)

1. Begin by writing the Lagrangian:

L(x1 , x2 , λ1 , λ2 , λ3 , s1 , s2 , s3 ) = −(x21 +2x22 )−λ1 (x1 +x2 +s21 −4)−λ2 (−x1 +s22 )−λ3 (−x2 +s23 )

2. Take the partial derivatives and set equal to zero:


∂L
= −2x1 − λ1 + λ2 = 0 (26)
∂x1
∂L
= −4x2 − λ1 + λ3 = 0 (27)
∂x2
∂L
= −(x1 + x2 + s21 − 4) = 0 (28)
∂λ1
∂L
= −(−x1 + s22 ) = 0 (29)
∂λ2
∂L
= −(−x2 + s23 ) = 0 (30)
∂λ3
∂L
= 2s1 λ1 = 0 (31)
∂s1
∂L
= 2s2 λ2 = 0 (32)
∂s2
∂L
= 2s3 λ3 = 0 (33)
∂s3
(34)

3. This is a huge mess: a system of 8 non-linear equations. We only have to look at the
various ways that we can satisfy the complementary slackness conditions:

Hypothesis s1 s2 s3 λ1 λ2 λ3 x1 x2 f (x1 , x2 )
s1 = s2 = s3 = 0 No solution
s1 6= 0, s2 = s3 = 0 2 0 0 0 0 0 0 0 0
s2 6= 0, s1 = s3 = 0 0 2 0 -8 0 -8 4 0 -16
s3 6= 0, s1 = s2 = 0 0 0 2 -16 -16 0 0 4 -32
s1 6= 0, s2 6= 0, s3 = 0 No solution
s1 6= 0, s3 6= 0, s2 = 0 No solution
q q
8 4
6 0, s3 =
s2 = 6 0, s1 = 0 0 3 3 − 16
3 0 0 8
3
4
3 − 96
9
s1 =6 0, s2 =6 0, s3 6= 0 No solution

4. This method has identified the four critical points of the function in the region consistent
with the constraints. The constrained maximum is located at (x1 , x2 ) = (0, 0), which is
the same as the unconstrained max. The constrained minimum is located at (x1 , x2 ) =
(0, 4), while there is no unconstrained minimum for this problem.
Lecture 9: Constrained Optimization 69

1.9.4 Kuhn-Tucker Conditions

• The process described above will identify the critical points of a function subject to some
constraints, but it can be a pain to implement. In particular, explicitly including the non-
negativity constraints makes the problem significantly more complex.

• Kuhn-Tucker conditions: Because the problem of maximizing a function subject to in-


equality and non-negativity constraints arises frequently in economics, the Kuhn-Tucker ap-
proach provides a method that often makes it easier to both calculate the critical points and
identify points that are (local) maximums.

1. Setup: We want to maximize a function f (x1 , . . . , xn ) subject to inequality constraints


g1 (x1 , . . . , xn ) ≤ b1 , . . . , gm (x1 , . . . , xn ) ≤ bm and non-negativity constraints x1 , . . . , xn ≥
0.
2. Lagrangian function: We use the same Lagrangian as if we were dealing with equality
constraints (be careful with the signs!!!):
m
X
L(x1 , . . . , xn , λ1 , . . . , λm ) = f (x1 , . . . , xn ) − λi (gi (x1 , . . . , xn ) − bi )
i=1

3. Kuhn-Tucker conditions for maximum:


∂L ∂L
≤ 0, . . . , ≤0 (35)
∂x1 ∂xn
∂L ∂L
≥ 0, . . . , ≥0 (36)
∂λ1 ∂λm
x1 ≥ 0 . . . xn ≥ 0 (37)
λ1 ≥ 0 . . . λm ≥ 0 (38)
∂L ∂L
x1 = 0, . . . , xn =0 (39)
∂x1 ∂xn
∂L ∂L
λ1 = 0, . . . , λm =0 (40)
∂λ1 ∂λm
4. The last two sets of conditions are analogs to the complementary slackness conditions
discussed in the previous section.
5. Kuhn-Tucker conditions for minimum: To minimize the function f (x1 , . . . , xn ), the sim-
plest thing to do is maximize the function −f (x1 , . . . , xn ); all of the conditions remain
the same after reformulating as a maximization problem.
6. There are additional assumptions (notably, f(x) is quasi-concave and the constraints are
convex) that are sufficient to ensure that a point satisfying the Kuhn-Tucker conditions
is a global max; if these assumptions do not hold, you may have to check more than one
point.

• : Consider the example from the previous section; we want to maximize:

f (x) = −(x21 + 2x22 ) (41)


s.t. x1 + x2 ≤ 4 (42)
x1 ≥ 0 (43)
x2 ≥ 0 (44)
Lecture 9: Constrained Optimization 70

1. This time, we begin by writing the Kuhn-Tucker Lagrangian:

L(x1 , x2 , λ) = −(x21 + 2x22 ) − λ(x1 + x2 − 4)

2. The Kuhn-Tucker conditions for this problem are:


∂L
= −2x1 − λ ≤ 0 (45)
∂x1
∂L
= −4x2 − λ ≤ 0 (46)
∂x2
∂L
= −(x1 + x2 − 4) ≥ 0 (47)
∂λ
x1 ≥ 0 (48)
x2 ≥ 0 (49)
λ ≥ 0 (50)
∂L
x1 = x1 (−2x1 − λ) = 0 (51)
∂x2
∂L
x2 = x2 (−4x2 − λ) = 0 (52)
∂x2
∂L
λ = −λ(x1 + x2 − 4) = 0 (53)
∂λ
(54)

3. The same four points are identified using just the equality constraints - (0, 0, 0), (4, 0, −8),
(0, 4, −16), and ( 38 , 34 , −16
3 ). Three of these points, however, violate the requirement that
λ ≥ 0, so the point (0, 0, 0) is the maximum.

• Exercise: Maximize

1 2
f (x) = log(x1 + 1) + log(x2 + 1) (55)
3 3
s.t. x1 + 2x2 ≤ b (56)
x1 ≥ 0 (57)
x2 ≥ 0 (58)

You might also like