Math Notes 12
Math Notes 12
1 Mathematics Notes
1.1 Functions
• R1 is the set of all real numbers extending from −∞ to +∞ — i.e., the real number line.
• Rn is an n-dimensional space (often referred to as Euclidean space), where each of the n axes
extends from −∞ to +∞.
• Examples:
1. R1 is a line.
2. R2 is a plane.
3. R3 is a 3-D space.
4. R4 could be 3-D plus time.
• Points in Rn are ordered n-tuples, where each element of the n-tuple represents the coordinate
along that dimension.
• In many areas of math, we need a formal construct for what it means to be “near” a point c in
Rn . This is generally called the neighborhood of c and is represented by an open interval,
disk, or ball, depending on whether Rn is of one, two, or more dimensions, respectively. Given
the point c, these are defined as
1. -interval in R1 : {x : |x − c| < }
The open interval (c − , c + ).
2. -disk in R2 : {x : ||x − c|| < }
The open interior of the circle centered at c with radius .
3. -ball in Rn : {x : ||x − c|| < }
The open interior of the sphere centered at c with radius .
1
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists, Boyce & Diprima (1988) Calculus, and Protter & Morrey (1991) A First Course in Real Analysis
Lecture 1: Functions 2
• Interior Point: The point x is an interior point of the set S if x is in S and if there is
some -ball around x that contains only points in S. The interior of S is the collection of
all interior points in S. The interior can also be defined as the union of all open sets in S.
Example: The interior of the set {(x, y) : x2 + y 2 ≤ 4} is {(x, y) : x2 + y 2 < 4} .
• Boundary Point: The point x is a boundary point of the set S if every -ball around x
contains both points that are in S and points that are outside S. The boundary is the
collection of all boundary points.
Example: The boundary of {(x, y) : x2 + y 2 ≤ 4} is {(x, y) : x2 + y 2 = 4}.
• Open: A set S is open if for each point x in S, there exists an open -ball around x completely
contained in S.
Example: {(x, y) : x2 + y 2 < 4}
• Closed: A set S is closed if it contains all of its boundary points.
Example: {(x, y) : x2 + y 2 ≤ 4}
• Note: a set may be neither open nor closed.
Example: {(x, y) : 2 < x2 + y 2 ≤ 4}
• Complement: The complement of set S is everything outside of S.
Example: The complement of {(x, y) : x2 + y 2 ≤ 4} is {(x, y) : x2 + y 2 > 4}.
• Closure: The closure of set S is the smallest closed set that contains S.
Example: The closure of {(x, y) : x2 + y 2 < 4} is {(x, y) : x2 + y 2 ≤ 4}
• Bounded: A set S is bounded if it can be contained within an -ball.
Examples: Bounded: any interval that doesn’t have ∞ or −∞ as endpoints; any disk in a
plane with finite radius. Unbounded: the set of integers in R1 ; any ray.
• Compact: A set is compact if and only if it is both closed and bounded.
f (X) = {y : y = f (x), x ∈ X}
• Image: same as range, but more often used when talking about a function f : Rn → R1 .
• Examples:
3
1. f (x) = 1+x2
Domain X =
Range f (X) =
x + 1, 1≤x≤2
2. f (x) = 0, x=0
1 − x, −2 ≤ x ≤ −1
Domain X =
Range f (X) =
3. f (x) = 1/x
Domain X =
Range f (X) =
4. f (x, y) = x2 + y 2
Domain X, Y =
Image f (X, Y ) =
y = loga (x) ⇐⇒ ay = x
The log function can be thought of as an inverse for exponential functions. a is referred to as
the “base” of the logarithm.
• Examples:
√
1. log( 10) =
Lecture 1: Functions 5
2. log(1) =
3. log(10) =
4. log(100) =
5. ln(1) =
6. ln(e) =
1. ax ay = ax+y
2. a−x = 1/ax
3. ax /ay = ax−y
4. (ax )y = axy
5. a0 = 1
• Use the change of base formula to switch bases as necessary: logb (x) = loga (x)/ loga (b)
• Sometimes we’re given a function y = f (x) and we want to find how x varies as a function of
y.
• Use algebra and relationships identified above to move x to the LHS of the equation and so
that the RHS is only a function of y.
1. y = 3x + 2 =⇒ y − 2 = 3x =⇒ x = 31 (y − 2)
2. y = 3x − 4z + 2 =⇒ y + 4z − 2 = 3x =⇒ x = 13 (y + 4z − 2)
3. y = ex + 4 =⇒ y − 4 = ex =⇒ ln(y − 4) = ln(ex ) =⇒ x = ln(y − 4)
• Solving for variables is especially important when we want to find the roots of an equation:
those values of variables that cause an equation to equal zero.
√
−b± b2 −4ac
• For quadratic equations ax2 + bx + c = 0, use x = 2a .
• Examples:
1. f (x) = 3x + 2
2. f (x) = e−x − 10
3. f (x) = x2 + 3x − 4 = 0
n
P
3. c = nc
i=1
n
Q
• Product: xi = x1 x2 x3 · · · xn
i=1
n n
cxi = cn
Q Q
1. xi
i=1 i=1
n
Q
2. (xi + yi ) = a mess
i=1
n
c = cn
Q
3.
i=1
• We’re often interested in determining if a function f approaches some number L as its inde-
pendent variable x moves to some number c (usually 0 or ±∞). If it does, we say that f (x)
approaches L as x approaches c, or limx→c f (x) = L.
• Limit of a function. Let f be defined at each point in some open interval containing
the point c, although possibly not defined at c itself. Then lim f (x) = L if for any (small
x→c
positive) number , there exists a corresponding number δ > 0 such that if 0 < |x − c| < δ,
then |f (x) − L| < .
• Examples:
1. lim k =
x→c
2. lim x =
x→c
3. lim |x| =
x→0
1
4. lim 1 + x2
=
x→0
• Properties: Let f and g be functions with lim f (x) = A and lim g(x) = B.
x→c x→c
• Examples:
1. lim (2x − 3) =
x→2
2. lim xn =
x→c
√
Example: lim x=0
x→0+
1.1.14 Continuity
• Continuity: Suppose that the domain of the function f includes an open interval containing
the point c. Then f is continuous at c if lim f (x) exists and if lim f (x) = f (c). Further, f is
x→c x→c
continuous on an open interval (a, b) if it is continuous at each point in the interval.
√
f (x) = x f (x) = ex
1
f (x) = floor(x) f (x) = 1 + x2
Lecture 1: Functions 9
• Properties:
1.2 Calculus I
• Examples:
2
1
= 1, 47 , 17 31
1. {yn } = 2 − n2 9 , 16 , . . .
1.5
1
0 5 10 15
20
n o
n2 +1
= 2, 52 , 10
2. {yn } = n 3 ,...
10
0 5 10 15
• Think of sequences like functions. Before, we had y = f (x) with x specified over some domain.
Now we have {yn } = {f (n)} with n = 1, 2, 3, . . ..
• Three kinds of sequences:
1. Sequences like 1 above that converge to a limit.
2. Sequences like 2 above that increase without bound.
3. Sequences like 3 above that neither converge nor increase without bound — alternating
over the number line.
• Boundedness and monotonicity:
1. Bounded: if |yn | ≤ K for all n
2. Monotone Increasing: yn+1 > yn for all n
3. Monotone Decreasing: yn+1 < yn for all n
• Subsequence: choose an infinite collection of entries from {yn }, retaining their order.
2
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists and from Boyce & Diprima (1988) Calculus
Lecture 2: Calculus I 11
• We’re often interested in whether a sequence converges to a limit. Limits of sequences are
conceptually similar to the limits of functions addressed in the previous lecture.
• Definition: (Limit of a sequence). The sequence {yn } has the limit L, that is lim yn = L,
n→∞
if for any > 0 there is an integer N (which depends on ) with the property that |yn − L| <
for each n > N . {yn } is said to converge to L. If the above does not hold, then {yn } diverges.
• Examples:
1
1. lim 2 − n2
=2
n→∞
15
10
4n
2. lim n! =0
n→∞ 5
0 5 10 15
• Limit of a sequence of vectors. The sequence of vectors {yn } has the limit L, that is
lim yn = L, if for any there is an integer N where ||yn − L|| < for each n > N . The
n→∞
sequence of vectors {yn } is said to converge to the vector L — and the distances between yn
and L converge to zero.
• Think of each coordinate of the vector yn as being part of its own sequence over n. Then a
sequence of vectors in Rn converges if and only if all n sequences of its components converge.
Examples:
2. The sequence {yn } where yn = n1 , (−1)n does not converge, since {(−1)n } does not
converge.
1.2.3 Series
• The sum of the terms of a sequence is a series. As there are both finite and infinite sequences,
there are finite and infinite series.
P∞
• The series associated with the sequence {yn } P = {y1 , y2 , y3 , . . . , yn } = {yn }∞ n=1 is n=1 yn .
n
The nth partial sum Sn is defined as Sn = y
k=1 k ,the sum of the first n terms of the
sequence.
P
• A series yn converges if the sequence of partial sums {S1 , S2 , S3 , ...} converges, that is has
a finite limit.
1.2.4 Derivatives
• The derivative of f at x is its rate of change at x — i.e., how much f (x) changes with a
change in x.
– For a line, the derivative is the slope.
– For a curve, the derivative is the tangent at x.
• Derivative: Let f be a function whose domain includes an open interval containing the point
x. The derivative of f at x is given by
f (x + h) − f (x)
f 0 (x) = lim
h→0 (x + h) − x
f (x + h) − f (x)
= lim
h→0 h
Lecture 2: Calculus I 13
• Examples:
1. f (x) = c
2. f (x) = x
3. f (x) = x2
4. f (x) = x3
• Properties of derivatives: Suppose that f and g are differentiable at x and that α is a constant.
Then the functions f ± g, αf , f g, and f /g (provided g(x) 6= 0) are also differentiable at x.
Additionally,
Power rule: [xk ]0 = kxk−1
Sum rule: [f (x) ± g(x)]0 = f 0 (x) ± g 0 (x)
Constant rule: [αf (x)]0 = αf 0 (x)
Product rule: [f (x)g(x)]0 = f 0 (x)g(x) + f (x)g 0 (x)
0 (x)g 0 (x)
Quotient rule: [f (x)/g(x)]0 = f (x)g(x)−f
[g(x)] 2 , g(x) 6= 0
• Examples:
x2 +1
3. f (x) = x2 −1
• We can keep applying the differentiation process to functions that are themselves derivatives.
The derivative of f 0 (x) with respect to x, would then be
f 0 (x + h) − f 0 (x)
f 00 (x) = lim
h→0 h
and so on. Similarly, the derivative of f 00 (x) would be denoted f 000 (x).
df (x) dy
• First derivative: f 0 (x), y 0 , dx , dx
00 00 d2 f (x) d2 y
Second derivative: f (x), y , dx2 , dx2
dn f (x) dn y
nth derivative: dxn , dxn
• Example: f (x) = x3 , f 0 (x) = 3x2 , f 00 (x) = 6x, f 000 (x) = 6, f 0000 (x) = 0
• The first derivative f 0 (x) identifies whether the function f (x) at the point x is
1. Increasing: f 0 (x) > 0
2. Decreasing: f 0 (x) < 0
3. Extremum/Saddle: f 0 (x) = 0
• Examples:
1. f (x) = x2 + 2, f 0 (x) = 2x
• The second derivative f 00 (x) identifies whether the function f (x) at the point x is
1. Concave down: f 00 (x) < 0
2. Concave up: f 00 (x) > 0
• Maximum (Minimum): x0 is a local maximum (minimum) if f (x0 ) > f (x) (f (x0 ) < f (x))
for all x within some open interval containing x0 . x0 is a global maximum (minimum) if
f (x0 ) > f (x) (f (x0 ) < f (x)) for all x in the domain of f .
Lecture 2: Calculus I 15
• Critical points: Given the function f defined over domain D, all of the following are critical
points:
1. Any interior point of D where f 0 (x) = 0.
2. Any interior point of D where f 0 (x) does not exist.
3. Any endpoint that is in D.
The maxima and minima will be a subset of the critical points.
• Combined, the first and second derivatives can tell us whether a point is a maximum or
minimum of f (x).
Local Maximum: f 0 (x) = 0 and f 00 (x) < 0
Local Minimum: f 0 (x) = 0 and f 00 (x) > 0
Need more info: f 0 (x) = 0 and f 00 (x) = 0
• Global Maxima and Minima. Sometimes no global max or min exists — e.g., f (x) not
bounded above or below. However, three situations where we can fairly easily identify global
max or min.
1. Functions with only one critical point. If x0 is a local maximum of f and it is the
only critical point, then it is a global maximum.
2. Globally concave up or concave down functions. If f 00 is never zero, then there is
at most one critical point, which is a global maximum if f 00 < 0 and a global minimum
if f 00 > 0.
3. Functions over closed and bounded intervals must have both a global maximum
and a global minimum.
• Examples:
1. f (x) = x2 + 2
f 0 (x) = 2x
f 00 (x) = 2
2. f (x) = x3 + 2
f 0 (x) = 3x2
f 00 (x) = 6x
• Composite functions are formed by substituting one function into another and are denoted
by
(f ◦ g)(x) = f [g(x)]
To form f [g(x)], the range of g must be contained (at least in part) within the domain of
f . The domain of f ◦ g consists of all the points in the domain of g for which g(x) is in the
domain of f .
• Examples:
1. f (x) = ln x,
g(x) = x2
(f ◦ g)(x) = ln x2 ,
(g ◦ f )(x) = [ln x]2 ,
Notice that f ◦ g and g ◦ f are not the same functions.
2. f (x) = √4 + sin x,
g(x) = 1 − x2 , √
(f ◦ g)(x) = 4 + sin 1 − x2 ,
(g ◦ f )(x) does not exist, since the range of f , [3, 5], has no points in common with the
domain of g.
• Chain Rule: Let y = f (z) and z = g(x). Then, y = (f ◦ g)(x) = f [g(x)] and the derivative
of y with respect to x is
d
{f [g(x)]} = f 0 [g(x)]g 0 (x)
dx
which can also be written as
dy dy dz
=
dx dz dx
(Note: the above does not imply that the dz’s cancel out, as in fractions. They are part of
the derivative notation and have no separate existence.) The chain rule can be thought of
as the derivative of the “outside” times the derivative of the “inside,” remembering that the
derivative of the outside function is evaluated at the value of the inside function.
• Generalized Power Rule: If y = [g(x)]k , then dy/dx = k[g(x)]k−1 g 0 (x).
• Examples:
1. Find dy/dx for y = (3x2 + 5x − 7)6 . Let f (z) = z 6 and z = g(x) = 3x2 + 5x − 7. Then,
y = f [g(x)] and
dy
=
dx
=
=
2. Find dy/dx for y = sin(x3 + 4x). (Note: the derivative of sin x is cos x.) Let f (z) = sin z
and z = g(x) = x3 + 4x. Then, y = f [g(x)] and
dy
=
dx
=
=
Lecture 2: Calculus I 17
• Derivatives of Exp:
d x x
1. dx αe = αe
dn x x
2. dxn αe = αe
3. d u(x)
dx e = eu(x) u0 (x)
1. y = e−3x
2
2. y = ex
3. y = esin 2x
• Derivatives of Ln:
d 1
1. dx ln x = x
d d k
2. dx ln xk = dx k ln x = x
0 (x)
3. d
dx ln u(x) = uu(x) (by the chain rule)
1. y = ln(x2 + 9)
2. y = ln(ln x)
3. y = (ln x)2
4. y = ln ex
d x
• For any positive base b, dx b = (ln b) (bx ).
• If both lim f (x) = 0 and lim g(x) = 0, then we get an indeterminate form of the type 0/0
x→c x→c
as x → c. However, we can still analyze such limits using L’Hospital’s rule.
• L’Hospital’s Rule: Suppose f and g are differentiable on a < x < b and that either
Suppose further that g 0 (x) is never zero on a < x < b and that
f 0 (x)
lim =L
x→a+ g 0 (x)
then
f (x)
lim =L
x→a+ g(x)
Lecture 2: Calculus I 18
e1/x
2. lim 1/x
x→0+
x−2
3. lim 1/3
x→2 (x+6) −2
Lecture 3: Calculus II 19
Today’s Topics3 : • Partial Derivatives • The Indefinite Integral: The Antiderivative • The
Definite Integral: The Area under the Curve • Integration by Substitutions • Integration by Parts
1.3.1 Differentiation in Several Variables
• Suppose we have a function f now of two (or more) variables and we want to determine the
rate of change relative to one of the variables. To do so, we would find it’s partial derivative,
which is defined similar to the derivative of a function of one variable.
• Partial Derivative: Let f be a function of the variables (x1 , . . . , xn ). The partial derivative
of f with respect to xi is
∂f f (x1 , . . . , xi + h, . . . , xn ) − f (x1 , . . . , xi , . . . , xn )
(x1 , . . . , xn ) = lim
∂xi h→0 h
Only the ith variable changes — the others are treated as constants.
• We can take higher-order partial derivatives, like we did with functions of a single variable,
except now we the higher-order partials can be with respect to multiple variables.
• Examples:
1. f (x, y) = x2 + y 2
∂f
∂x (x, y) =
∂f
∂y (x, y) =
∂2f
∂x2
(x, y) =
∂2f
∂x∂y (x, y) =
2. f (x, y) = x3 y 4 + ex − ln y
∂f
∂x (x, y) =
∂f
∂y (x, y) =
∂2f
∂x2
(x, y) =
∂2f
∂x∂y (x, y) =
• Let DF be the derivative of F . And let DF (x) be the derivative of F evaluated at x. Then
the antiderivative is denoted by D−1 (i.e., the inverse derivative). If DF = f , then F = D−1 f .
• Examples:
3
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists and from Boyce & Diprima (1988) Calculus
Lecture 3: Calculus II 20
1
R
1. x2
dx =
3x
R
2. 3e dx =
(x2 − 4)dx =
R
3.
• Notice from these examples that while there is only a single derivative for any function, there
are multiple antiderivatives: one for any arbitrary constant c. c just shifts the curve up or
down on the y-axis. If more info is present about the antiderivative — e.g., that it passes
through a particular point — then we can solve for a specific value of c.
• Common rules of integration:
R R
1. af (x)dx = a f (x)dx
R R R
2. [f (x) + g(x)]dx = f (x)dx + g(x)dx
1
3. xn dx = n+1 xn+1 + c
R
4. e dx = ex + c
R x
5. x1 dx = ln x + c
R
• Riemann Sum: Suppose we want to determine the area A(R) of a region R defined by a
curve f (x) and some interval a ≤ x ≤ b. One way to calculate the area would be to divide
the interval a ≤ x ≤ b into n subintervals of length ∆x and then approximate the region with
a series of rectangles, where the base of each rectangle is ∆x and the height is f (x) at the
midpoint of that interval. A(R) would then be approximated by the area of the union of the
rectangles, which is given by
Xn
S(f, ∆x) = f (xi )∆x
i=1
and is called a Riemann sum.
• As we decrease the size of the subintervals ∆x, making the rectangles “thinner,” we would
expect our approximation of the area of the region to become closer to the true area. This
gives the limiting process
Xn
A(R) = lim f (xi )∆x
∆x→0
i=1
Lecture 3: Calculus II 21
• Riemann Integral: If for a given function f the Riemann sum approaches a limit as ∆x → 0,
then that limit is called the Riemann integral of f from a to b. Formally,
Zb n
X
f (x)dx = lim f (xi )∆x
∆x→0
a i=1
Rb
• Definite Integral: We use the notation f (x)dx to denote the definite integral of f from
a
Rb
a to b. In words, the definite integral f (x)dx is the area under the “curve” f(x) from x = a
a
to x = b.
• First Fundamental Theorem of Calculus: Let the function f be bounded on [a, b] and
continuous on (a, b). Then the function
Zx
F (x) = f (s)ds, a≤x≤b
a
• Second Fundamental Theorem of Calculus: Let the function f be bounded on [a, b] and
continuous on (a, b). Let F be any function that is continuous on [a, b] such that F 0 (x) = f (x)
on (a, b). Then
Zb
f (x)dx = F (b) − F (a)
a
Rb
• Procedure to calculate a “simple” definite integral f (x)dx:
a
• Examples:
R3
1. 3x2 dx =
1
R2 x
2. ex ee dx =
−2
Rb Rb Rb
3. [αf (x) + βg(x)]dx = α f (x)dx + β g(x)dx
a a a
Rb Rc Rc
4. f (x)dx + f (x)dx = f (x)dx
a b a
• Examples:
R1
1. 3x2 dx =
1
R4
2. (2x + 1)dx =
0
R0 x R2 x
3. ex ee dx+ ex ee dx =
−2 0
• Sometimes the integrand doesn’t appear integrable using common rules and antiderivatives.
A method one might try is integration by substitutions, which is related to the Chain
Rule.
R
• Suppose we want to find the indefinite integral g(x)dx and assume we can identify a function
u(x) such that g(x) = f [u(x)]u0 (x). Let’s refer to the antiderivative of f as F . Then the
d
chain rule tells us that dx F [u(x)] = f [u(x)]u0 (x). So, F [u(x)] is the antiderivative of g. We
can then write
Z Z Z
0 d
g(x)dx = f [u(x)]u (x)dx = F [u(x)]dx = F [u(x)] + c
dx
R
• Procedure to determine the indefinite integral g(x)dx by the method of substitions:
1. Identify some part of g(x) that might be simplified by substituting in a single variable
u (which will then be a function of x).
2. Determine if g(x)dx can be reformulated in terms of u and du.
3. Solve the indefinite integral.
4. Substitute back in for x
• Substitution can also be used to calculate a definite integral. Using the same procedure as
above,
Zb Zd
g(x)dx = f (u)du = F (d) − F (c)
a c
• Examples:
R √
1. x2 x + 1dx
√ √
The problem here is the x + 1 term. However, if the integrand had x times some
Lecture 3: Calculus II 23
We can easily integrate this, since it’s just a polynomial. Doing so and substituting
u = x + 1 back in, we get
√
Z
2 3/2 1 2 1
x x + 1dx = 2(x + 1) (x + 1)2 − (x + 1) + +c
7 5 3
√
2. For the above problem, we could have also used the substitution u = x + 1. Then
x = u2 − 1 and dx = 2udu. Substituting these in, we get
√
Z Z
2
x x + 1dx = (u2 − 1)2 u2udu
which when expanded is again a polynomial and gives the same result as above.
R1 5e2x
3. (1+e 2x )1/3 dx
0
When an expression is raised to a power, it’s often helpful to use this expression as
the basis for a substitution. So, let u = 1 + e2x . Then du = 2e2x dx and we can
set 5e2x dx = 5du/2. Additionally, u = 2 when x = 0 and u = 1 + e2 when x = 1.
Substituting all of this in, we get
Z1 Z 2
1+e
5e2x 5 du
dx =
(1 + e2x )1/3 2 u1/3
0 2
Z 2
1+e
5
= u−1/3 du
2
2
1+e2
15 2/3
= u
4 2
= 9.53
• Another useful integration technique is integration by parts, which is related to the Product
Rule of differentiation. The product rule states that
d dv du
(uv) = u +v
dx dx dx
Integrating this and rearranging, we get
Z Z
dv du
u dx = uv − v dx
dx dx
Lecture 3: Calculus II 24
or Z Z
0
u(x)v (x)dx = u(x)v(x) − v(x)u0 (x)dx
• Our goal here is to find expressions for u and dv that, when substituted into the above
equation, yield an expression that’s more easily evaluated.
• Examples:
1. xeax dx
R
Let u = x and dv = eax dx. Then du = dx and v = (1/a)eax . Substituting this into the
integration by parts formula, we obtain
Z Z
xeax dx = uv − vdu
Z
1 ax 1 ax
= x e − e dx
a a
1 ax 1
= xe − 2 eax + c
a a
R n ax
2. x e dx
Lecture 3: Calculus II 25
2
x3 e−x dx
R
3.
Lecture 4: Probability I 26
Today’s Topics4 : • Counting rules • Sets • Probability • Conditional Probability and Bayes’
Rule • Independence
1.4.1 Counting rules
• We often need to count the number of ways to choose a subset from some set of possiblities.
The number of outcomes depends on two characteristics of the process: does the order matter
and is replacement allowed?
• If there are n objects and we select k < n of them, how many different outcomes are possible?
1.4.2 Sets
• Types of sets:
1. Countably finite: a set with a finite number of elements, which can be mapped onto
positive integers.
S = {1, 2, 3, 4, 5, 6}
2. Countably infinite: a set with an infinite number of elements, which can still be mapped
onto positive integers.
S = {1, 21 , 31 , . . . }
3. Uncountably infinite: a set with an infinite number of elements, which cannot be mapped
onto positive integers.
S = {x : x ∈ [0, 1]}
4. Empty: a set with no elements.
S = {∅}
• Set operations:
1. Union: The union of two sets A and B, A ∪ B, is the set containing all of the elements
in A or B.
4
Much of the material and examples for this lecture are taken from Gill (2006) Essential Mathematics for Political
and Social Research, Wackerly, Mendenhall, & Scheaffer (1996) Mathematical Statistics with Applications, Degroot
(1985) Probability and Statistics, Morrow (1994) Game Theory for Political Scientists, King (1989) Unifying Political
Methodology, and Ross (1987) Introduction to Probability and Statistics for Scientists and Engineers.
Lecture 4: Probability I 27
2. Intersection: The intersection of sets A and B, A ∩ B, is the set containing all of the
elements in both A and B.
3. Complement: If set A is a subset of S, then the complement of A, denoted AC , is the
set containing all of the elements in S that are not in A.
1. Commutative: A ∪ B = B ∪ A, A ∩ B = B ∩ A
2. Associative: A ∪ (B ∪ C) = (A ∪ B) ∪ C, A ∩ (B ∩ C) = (A ∩ B) ∩ C
3. Distributive: A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C), A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
4. de Morgan’s laws: (A ∪ B)C = AC ∩ B C , (A ∩ B)C = AC ∪ B C
• Disjointness: Sets are disjoint when they do not intersect, such that A ∩ B = {∅}. A
collection of sets is pairwise disjoint if, for all i 6= j, Ai ∩ Aj = {∅}. A collection of sets form
a partition of set S if they are pairwise disjoint and they cover set S, such that ki=1 Ai = S.
S
1.4.3 Probability
• Probability: Many events or outcomes are random. In everyday speech, we say that we are
uncertain about the outcome of random events. Probability is a formal model of uncertainty
which provides a measure of uncertainty governed by a particular set of rules. A different
model of uncertainty would, of course, have a different set of rules and measures. Our focus on
probability is justified because it has proven to be a particularly useful model of uncertainty.
• Sample Space: A set or collection of all possible outcomes from some process. Outcomes in
the set can be discrete elements (countable) or points along a continuous interval (uncount-
able).
• Examples:
1. Discrete: the numbers on a die, the number of possible wars that could occur each year,
whether a vote cast is republican or democrat.
2. Continuous: GNP, arms spending, age.
• Axioms of Probability: Define the number Pr(A) correponding to each event A in the
sample space S such that
• Basic Theorems of Probability: Using these three axioms, we can define all of the common
theorems of probability.
Lecture 4: Probability I 28
1. Pr(∅) = 0
2. Pr(AC ) = 1 − Pr(A)
3. For any event A, 0 ≤ Pr(A) ≤ 1.
4. If A ⊂ B, then Pr(A) ≤ Pr(B).
5. For any two events A and B, Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B)
6. Forany sequence of n events (which need not be disjoint) A1 , A2 , . . . , An ,
n n
S P
Pr Ai ≤ Pr(Ai )
i=1 i=1
1. Sample space S =
2. Pr(1) = · · · = Pr(6) =
3. Pr(∅) = Pr(7) =
4. Pr ({1, 3, 5}) =
5. Pr {1, 2} = Pr ({3, 4, 5, 6}) =
6. Let B = S and A = {1, 2, 3, 4, 5} ⊂ B. Then Pr(A) = < Pr(B) = .
7. Let A = {1, 2, 3} and B = {2, 4, 6}. Then A ∪ B = {1, 2, 3, 4, 6}, A ∩ B = {2}, and
Pr(A ∪ B) =
=
=
Pr(A ∩ B)
Pr(A|B) =
Pr(B)
A A
• Example: Assume A and B occur with the following frequencies: B nab nab
B nab nab
and let nab + nab + nab + nab = N . Then
1. Pr(A) ≈
2. Pr(B) ≈
3. Pr(A ∩ B) ≈
4. Pr(A|B) ≈
5. Pr(B|A) ≈
• Example: A six-sided die is rolled. What is the probability of a 1, given the outcome is an odd
number?
Lecture 4: Probability I 29
Sometimes it is easier to calculate the conditional probabilities and sum them than it is to
calculate Pr(A) directly.
• Bayes Rule: Assume that events B1 , . . . , Bk form a partition of the space S. Then
Pr(ABj ) Pr(Bj ) Pr(A|Bj )
Pr(Bj |A) = = k
Pr(A) P
Pr(Bi ) Pr(A|Bi )
i=1
• Bayes rule determines the posterior probability of a state or type Pr(Bj |A) by calculating the
probability Pr(ABj ) that both the event A and the state Bj will occur and dividing it by the
probability that the event will occur regardless of the state (by summing across all Bi ).
• Often Bayes’ rule is used when one wants to calculate a posterior probability about the “state”
or type of an object, given that some event has occurred. The states could be something like
Normal/Defective, Normal/Diseased, Democrat/Republican, etc. The event on which one
conditions could be something like a sampling from a batch of components, a test for a
disease, or a question about a policy position.
Lecture 4: Probability I 30
• Prior and Posterior Probabilities: In the above, Pr(B1 ) is often called the prior proba-
bility, since it’s the probability of B1 before anything else is known. Pr(B1 |A) is called the
posterior probability, since it’s the probability after other information is taken into account.
• Examples:
1. A test for cancer correctly detects it 90% of the time, but incorrectly identifies a person
as having cancer 10% of the time. If 10% of all people have cancer at any given time,
what is the probability that a person who tests positive actually has cancer?
2. In Boston, 30% of the people are conservatives, 50% are liberals, and 20% are indepen-
dents. In the last election, 65% of conservatives, 82% of liberals, and 50% of independents
voted. If a person in Boston is selected at random and we learn that s/he did not vote
last election, what is the probability s/he is a liberal?
1.4.5 Independence
1. Pr(A|B) = Pr(A)
2. Pr(B|A) = Pr(B)
3. Pr(A ∩ B) = Pr(A) Pr(B)
1. Pr(A|B ∩ C) = Pr(A|C)
2. Pr(B|A ∩ C) = Pr(B|C)
3. Pr(A ∩ B|C) = Pr(A|C) Pr(B|C)
Lecture 5: Probability II 31
Today’s Topics5 :
• Levels of Measurement • Discrete Distributions • Continuous Distributions • Joint Distribu-
tions • Expectation • Special Discrete Distributions • Special Continuous Distributions • Summa-
rizing Observed Data
1.5.1 Levels of Measurement
• In empirical research, data can be classified along several dimensions. We have already
distinguished between discrete (countable) and continuous (uncountable) data. We can also
look at the precision with which the underlying quantities are measured.
• Nominal: Discrete data are nominal if there is no way to put the categories represented
by the data into a meaningful order. Typically, this kind of data represents names (hence
‘nominal’) or attributes, like Republican or Democrat.
• Ordinal: Discrete data are ordinal if there is a logical order to the categories represented
by the data, but there is no common scale for differences between adjacent categories. Party
identification is often measured as ordinal data.
• Interval: Discrete or continuous data are interval if there is an order to the values and there
is a common scale, so that differences between two values have substantive meanings. Dates
are an example of interval data.
• Ratio: Discrete or continuous data are ratio if the data have the characteristics of interval
data and zero is a meaningful quantity. This allows us to consider the ratio of two values as
well as difference between them. Quantities measured in dollars, such as per capita GDP, are
ratio data.
• Random Variable: A random variable is a real-valued function defined on the sample space
S; it assigns a real number to every outcome s ∈ S.
• Discrete Random Variable: Y is a discrete random variable if it can assume only a finite
or countably infinite number of distinct values.
• Examples: number of wars per year, heads or tails, voting Republican or Democrat, number
on a rolled die.
• Probability Mass Function: For a discrete random variable Y , the probability mass func-
tion (pmf)6 p(y) = Pr(Y = y) assigns probabilities to a countable number of distinct y values
such that
1. 0 ≤ p(y) ≤ 1
P
2. p(y) = 1
y
5
Much of the material and examples for this lecture are taken from Gill (2006) Essential Mathematics for Political
and Social Research, Wackerly, Mendenhall, & Scheaffer (1996) Mathematical Statistics with Applications, Degroot
(1985) Probability and Statistics, Morrow (1994) Game Theory for Political Scientists, and Ross (1987) Introduction
to Probability and Statistics for Scientists and Engineers.
6
Also referred to simply as the “probability distribution.”
Lecture 5: Probability II 32
0.2
0.15
• Example: For a fair six-sided die, there is an equal probability of rolling 0.1
any number. Since there are six sides, the probability mass function is then 0.05
p(y) = 1/6 for y = 1, . . . , 6. Each p(y) is between 0 and 1. And, the sum of 0
1 2 3 4 5 6
the p(y)’s is 1.
• Cumulative Distribution: The cumulative distribution F (y) or Pr(Y ≤ y) is the proba-
bility that Y is less than or equal to some value y, or
X
Pr(Y ≤ y) = p(i)
i≤y
1. F (y) is non-decreasing in y.
2. limy→−∞ F (y) = 0 and limy→∞ F (y) = 1
3. F (y) is right-continuous.
1
0.75
Pr(Y ≤ 6) = . 0.25
0
1 2 3 4 5 6
• Probability Density Function: The function f above is called the probability density
function (pdf) of Y and must satisfy
1. f (y) ≥ 0
R∞
2. f (y)dy = 1
−∞
Note also that Pr(Y = y) = 0 — i.e., the probability of any point y is zero.
1.5
• f (y) = 1, 0≤y≤1
0.5
0
0 0.5 1
Lecture 5: Probability II 33
• Cumulative Distribution: Because the probability that a continuous random variable will
assume any particular value is zero, we can only make statements about the probability of a
continuous random variable being within an interval. The cumulative distribution gives the
probability that Y lies on the interval (−∞, y) and is defined as
Zy
F (y) = Pr(Y ≤ y) = f (s)ds
−∞
Note that F (y) has similar properties with continuous distributions as it does with dis-
crete - non-decreasing, continuous (not just right-continuous), and limy→−∞ F (y) = 0 and
limy→∞ F (y) = 1.
Zb
Pr(a ≤ y ≤ b) = f (y)dy
a
1.5
• Example: f (y) = 1, 0 < y < 1. Find F (y) and Pr(.5 < y < .75).
0.5
0
0 0.5 1
F (y) =
dF (y)
• F 0 (y) = dy = f (y)
• Often, we are interested in two or more random variables defined on the same sample space.
The distribution of these variables is called a joint distribution. Joint distributions can be
made up of any combination of discrete and continuous random variables.
• Example: Suppose we are interested in the outcomes of flipping a coin and rolling a 6-sided
die at the same time. The sample space for this process contains 12 elements:
{h1, h2, h3, h4, h5, h6, t1, t2, t3, t4, t5, t6}
We can define two random variables X and Y such that X = 1 if heads and X = 0 if
tails, while Y equals the number on the die. We can then make statements about the joint
distribution of X and Y .
• Joint discrete random variables: If both X and Y are discrete, their joint probability
mass function assigns probabilities to each pair of outcomes
p(x, y) = Pr(X = x, Y = y)
Lecture 5: Probability II 34
PP
Again, p(x, y) ∈ [0, 1] and p(x, y) = 1.
If we are interested in the marginal probability of one of the two variables (ignoring infor-
mation about the other variable), we can obtain the marginal pmf by summing across the
variable that we don’t care about:
X
pX (x) = p(x, yi )
i
We can also calculate the conditional pmf for one variable, holding the other variable fixed.
Recalling from the previous lecture that Pr(A|B) = Pr(A∩B)
Pr(B) , we can write the conditional
pmf as
p(x, y)
pY |X (y|x) = , pX (x) > 0
pX (x)
• Joint continuous random variables: If both X and Y are continuous, their joint proba-
bility density function defines their distribution:
ZZ
Pr((X, Y ) ∈ A) = f (x, y)dxdy
A
R∞ R∞
Likewise, f (x, y) ≥ 0 and −∞ −∞ f (x, y)dxdy = 1.
Instead of summing, we obtain the marginal probability density function by integrating out
one of the variables:
Z ∞
fX (x) = f (x, y)dy
−∞
f (x, y)
fY |X (y|x) = , fX (x) > 0
fX (x)
1.5.5 Expectation
In words, it is the weighted average of the possible values y can take on, weighted by the
probability that y occurs. It is not necessarily the number we would expect Y to take on, but
the average value of Y after a large number of repetitions of an experiment.
E(Y ) =
Lecture 5: Probability II 35
1
• Example: Find E(Y ) for f (y) = 1.5 , 0 < y < 1.5.
E(Y ) =
1. E(c) = c
2. E[E[Y ]] = E[Y ] (because the expected value of a random variable is a constant)
3. E[cg(Y )] = cE[g(Y )]
4. E[g(Y1 ) + · · · + g(Yn )] = E[g(Y1 )] + · · · + E[g(Yn )]
• Variance: We can also look at other summaries of the distribution, which build on the idea
of taking expectations. Variance tells us about the “spread” of the distribution; it is the
expected value of the squared deviations from the mean of the distribution. The standard
deviation is simply the square root of the variance.
• Covariance and Correlation: The covariance measures the degree to which two random
variables vary together; if the covariance is positive, X tends to be larger than its mean when
Y is larger than its mean. The covariance of a variable with itself is the variance of that
variable.
The correlation coefficient is the covariance divided by the standard deviations of X and Y.
It is a unitless measure and always takes on values in the interval [−1, 1].
Cov(X, Y ) Cov(X, Y )
ρ= p =
Var(X)Var(Y ) SD(X)SD(Y )
Lecture 5: Probability II 36
• Conditional Expectation: With joint distributions, we are often interested in the expected
value of a variable Y if we could hold the other variable X fixed. This is the conditional
expectation of Y given X = x:
P
1. Y discrete: E(Y |X = x) = y ypY |X (y|x)
R
2. Y continuous: E(Y |X = x) = y yfY |X (y|x)dy
The conditional expectation is often used for prediction when one knows the value of X
but not Y ; the realized value of X contains information about the unknown Y so long as
E(Y |X = x) 6= E(Y )∀x.
• Example: Republicans vote for Democrat-sponsored bills 2% of the time. What is the proba-
bility that out of 10 Republicans questioned, half voted for a particular Democrat-sponsored
bill? What is the mean number of Republicans voting for Democrat-sponsored bills? The
variance?
1
0.75
1. p(5) = 0.5
0.25
0
0 2 4 6 8 10
2. E(Y ) =
3. V (Y ) =
0.75
per month. What is the probability of 0, 2, and less than 5 disputes 0.25
occurring in a month? 0
0 2 4 6 8 10
Lecture 5: Probability II 37
1. p(0) =
2. p(2) =
3. Pr(Y < 5) =
0.75
0.25
0
1 1.4 1.8 2.2 2.6 3
1 (y−µ)2
f (y) = √ e− 2σ2
2πσ
1.5
1.125
σ 2 = .1 0.375
0
2 1 0 1 2
• So far, we’ve talked about distributions in a theoretical sense, looking at different properties of
random variables. We don’t observe random variables; we observe realizations of the random
variable.
• Central tendency: The central tendency describes the location of the “middle” of the
observed data along some scale. There are several measures of central tendency.
1. Sample mean: This is the most common measure of central tendency, calculated by
summing across the observations and dividing by the number of observations.
n
1X
x̄ = xi
n
i=1
2. Sample median: The median is the value of the “middle” observation. It is obtained
by ordering n data points from smallest to largest and taking the value of the n + 1/2th
observation (if n is odd) or the mean of the n/2th and (n + 1)/2th observations (if n is
even).
3. Sample mode: The mode is the most frequently observed value in the data:
When the data are realizations of a continuous random variable, it often makes sense
to group the data into bins, either by rounding or some other process, in order to get a
reasonable estimate of the mode.
4. Exercise: Calculate the sample mean, median, and mode for the following two variables,
X and Y.
X 6 3 7 5 5 5 6 4 7 2
Y 1 2 1 2 2 1 2 0 2 0
• Dispersion: We also typically want to know how spread out the data are relative to the
center of the observed distribution. Again, there are several ways to measure dispersion.
1. Sample variance: The sample variance is the sum of the squared deviations from the
sample mean, divided by the number of observations minus 1.
n
1 X
Var(X) = (xi − x̄)2
n−1
i=1
Again, this is an estimate of the variance of a random variable; we divide by n−1 instead
of n in order to get an unbiased estimate.
2. Standard deviation: The sample standard deviation is the square root of the sample
variance.
v
u n
p u 1 X
SD(X) = Var(X) = t (xi − x̄)2
n−1
i=1
3. Exercise: Calculate the sample covariance and correlation coefficient for the following
two variables, X and Y.
X 6 3 7 5 5 5 6 4 7 2
Y 1 2 1 2 2 1 2 0 2 0
Lecture 6: Linear Algebra I 40
Today’s Topics7 : • Working with Vectors • Linear Independence • Matrix Algebra • Square
Matrices • Systems of Linear Equations • Method of Substitution • Gaussian Elimination • Gauss-
Jordan Elimination
1.6.1 Working with Vectors
• Vector: A vector in n-space is an ordered list of n numbers. These numbers can be repre-
sented as either a row vector or a column vector:
v1
v2
v = v1 v2 . . . vn , v = .
..
vn
We can also think of a vector as defining a point in n-dimensional space, usually Rn ; each
element of the vector defines the coordinate of the point in a particular direction.
• Vector Addition: Vector addition is defined for two vectors u and v iff they have the same
number of elements:
u + v = u1 + v1 u2 + v2 · · · uk + vn
• Vector Inner Product: The inner product (also called the dot product or scalar product)
of two vectors u and v is again defined iff they have the same number of elements
n
X
u · v = u1 v1 + u2 v2 + · · · + un vn = ui vi
i=1
• Vector Norm: The norm of a vector is a measure of its length. There are many different
norms, the most common of which is the Euclidean norm (which corresponds to our usual
conception of distance in three-dimensional space):
√ √
||v|| = v·v = v1 v1 + v2 v2 + · · · + vn vn
u = c1 v1 + c2 v2 + · · · + ck vk
7
Much of the material and examples for this lecture are taken from Gill (2006) Essential Mathematics for Political
and Social Scientists, Simon & Blume (1994) Mathematics for Economists and Kolman (1993) Introductory Linear
Algebra with Applications.
Lecture 6: Linear Algebra I 41
c1 v1 + c2 v2 + · · · + ck vk = 0
• A set S of vectors is linearly dependent iff at least one of the vectors in S can be written as
a linear combination of the other vectors in S.
• Linear independence is only defined for sets of vectors with the same number of elements;
any linearly independent set of vectors in n-space contains at most n vectors.
1.
1 1 1
v1 = 0 , v2 = 0 , v3 = 1
0 1 1
2.
3 −2 2
v1 = 2 , v2 = 2 , v3 = 3
−1 4 1
Note that you can think of vectors as special cases of matrices; a column vector of length k
is a k × 1 matrix, while a row vector of the same length is a 1 × k matrix. You can also think
of larger matrices as being made up of a collection of row or column vectors. For example,
A = a1 a2 · · · am
Note that matrices A and B must be the same size, in which case they are conformable for
addition.
Lecture 6: Linear Algebra I 42
• Example:
1 2 3 1 2 1
A= , B=
4 5 6 2 1 2
A+B=
• Example:
1 2 3
s = 2, A=
4 5 6
sA =
• Examples:
a b
A B
1. c d
=
C D
e f
−2 5
1 2 −1
2. 4 −3 =
3 1 4
2 1
Note that the number of columns of the first matrix must equal the number of rows of the
second matrix, in which case they are conformable for multiplication. The sizes of the
matrices (including the resulting product) must be
(m × k)(k × n) = (m × n)
1. Associative: (A + B) + C = A + (B + C)
(AB)C = A(BC)
2. Commutative: A+B=B+A
3. Distributive: A(B + C) = AB + AC
(A + B)C = AC + BC
• Commutative law for multiplication does not hold – the order of multiplication matters:
AB 6= BA
Lecture 6: Linear Algebra I 43
• Example:
1 2 2 1
A= , B=
−1 3 0 1
2 3 1 7
AB = , BA =
−2 2 −1 3
• Transpose: The transpose of the m×n matrix A is the n×m matrix AT (sometimes written
A0 ) obtained by interchanging the rows and columns of A.
• Examples:
4 0
4 −2 3
1. A = , AT = −2 5
0 5 −1
3 −1
2
BT = 2 −1 3
2. B = −1 ,
3
• The following rules apply for transposed matrices:
1. (A + B)T = AT + BT
2. (AT )T = A
3. (sA)T = sAT
4. (AB)T = BT AT
• Example of (AB)T = BT AT :
0 1
1 3 2
A= , B = 2 2
2 −1 3
3 −1
T
0 1
T 1 3 2 2 2 = 12 7
(AB) =
2 −1 3 5 −3
3 −1
1 2
T T 0 2 3 12 7
B A = 3 −1 =
1 2 −1 5 −3
2 3
• Square matrices have the same number of rows and columns; a k ×k square matrix is referred
to as a matrix of order k.
• The diagonal of a square matrix is the vector of matrix elements that have the same sub-
scripts. If Ais a square matrix of order k, then its diagonal is [a11 , a22 , . . . , akk ]0 .
• Trace: The trace of a square matrix Ais the sum of the diagonal elements:
Properties of the trace operator: If Aand Bare square matrices of order k, then
Lecture 6: Linear Algebra I 44
1. Symmetric Matrix: A matrix Ais symmetric if A = A0 ; this implies that aij = aji
for all i and j.
Examples:
4 2 −1
1 2 0
A= =A, B= 2 1 3 = B0
2 1
−1 3 1
2. Diagonal Matrix: A matrix Ais diagonal if all of its non-diagonal entries are zero;
formally, if aij = 0 for all i 6= j
Examples:
4 0 0
1 0
A= , B = 0 1 0
0 2
0 0 1
3. Triangular Matrix: A matrix is triangular one of two cases. If all entries below the
diagonal are zero (aij = 0 for all i > j), it is upper triangular. Conversely, if all entries
above the diagonal are zero (aij = 0 for all i < j), it is lower triangular.
Examples:
1 0 0 1 7 −4
ALT = 4 2 0 , AU T = 0 3 9
−3 2 5 0 0 −3
4. Identity Matrix: The n × n identity matrix In is the matrix whose diagonal elements
are 1 and all off-diagonal elements are 0. Examples:
1 0 0
1 0
I2 = , I3 = 0 1 0
0 1
0 0 1
• Linear Equation: a1 x1 + a2 x2 + · · · + an xn = b
ai are parameters or coefficients. xi are variables or unknowns.
• Linear because only one variable per term and degree is at most 1.
b a1
1. R2 : line x2 = a2 − a2 x1
b a1 a2
2. R3 : plane x3 = a3 − a3 x1 − a3 x2
3. Rn : hyperplane
Lecture 6: Linear Algebra I 45
x − 3y = −3
2x + y = 8
• Example: x = 3 and y = 2 is the solution to the above 2 × 2 linear system. Notice from the
graph that the two lines intersect at (3, 2).
1. Substitution
2. Elimination of variables
3. Matrix methods
• Procedure:
1. Solve one equation for one variable, say x1 , in terms of the other variables in the equation.
2. Substitute the expression for x1 into the other m−1 equations, resulting in a new system
of m − 1 equations in n − 1 unknowns.
3. Repeat steps 1 and 2 until one equation in one unknown, say xn . We now have a value
for xn .
4. Backward substitution: Substitute xn into the previous equation (which should be a
function of only xn ). Repeat this, using the successive expressions of each variable in
terms of the other variables, to find the values of all xi ’s.
Lecture 6: Linear Algebra I 46
• Exercises:
• Elementary equation operations are used to transform the equations of a linear system, while
maintaining an equivalent linear system — equivalent in the sense that the same values of
xj solve both the original and transformed systems. These operations are
a11 x1 + a12 x2 = b1
a21 x1 + a22 x2 = b2
a21 x1 + a22 x2 = b2
a11 x1 + a12 x2 = b1
2=2
If we multiply each side of the equation by some number, say 4, we still have an equality:
More generally, we can multiply both sides of any equation by a constant and maintain an
equivalent equation. For example, the following two equations are equivalent:
a11 x1 + a12 x2 = b1
ca11 x1 + ca12 x2 = cb1
• Adding Equations: Suppose we had the following two very simple equations:
3 = 3
7 = 7
7+3=7+3 =⇒ 10 = 10
Lecture 6: Linear Algebra I 47
• Gaussian Elimination is a method by which we start with some linear system of m equations
in n unknowns and use the elementary equation operations to eliminate variables, until we
arrive at an equivalent system of the form
where a0ij denotes the coefficient of the jth unknown in the ith equation after the above
transformation. Note that at each stage of the elimination process, we want to change some
coefficient of our system to 0 by adding a multiple of an earlier equation to the given equation.
The coefficients a011 , a022 , etc in boxes are referred to as pivots, since they are the terms
used to eliminate the variables in the rows below them in their respective columns.8 Once the
linear system is in the above reduced form, we then use back substitution to find the values
of the xj ’s.
• Exercises:
• The method of Gauss-Jordan elimination takes the Gaussian elimination method one
step further. Once the linear system is in the reduced form shown in the preceding section,
elementary row operations and Gaussian elimination are used to
x1 = b∗1
x2 = b∗2
x3 = b∗3
.. ..
. .
xn = b∗m
• Exercises:
x − 3y = −3
2x + y = 8
x + 2y + 3z = 6
2x − 3y + 2z = 14
3x + y − z = −2
Lecture 7: Linear Algebra II 49
Today’s Topics9 :
• Matrix Methods for Linear Systems • Rank • Existence of Solutions • Inverse of a Matrix •
Linear Systems and Inverses • Determinants • The Determinant Formula for an Inverse • Cramer’s
Rule
1.7.1 Matrices, Row Operations, & (Reduced) Row Echelon Form
• Matrices provide an easy and efficient way to represent linear systems such as
a11 x1 + a12 x2 + ··· + a1n xn = b1
a21 x1 + a22 x2 + ··· + a2n xn = b2
.. .. ..
. . .
am1 x1 + am2 x2 + · · · + amn xn = bm
as
Ax = b
where
• Augmented Matrix: When we append b to the coefficient matrix A, we get the augmented
matrix A
b = [A|b]
a11 a12 · · · a1n | b1
a21 a22 · · · a2n | b2
.. .. .. ..
. . . | .
am1 am2 · · · amn | bm
• Elementary Row Operations: Just as we conducted elementary equation operations, we
can conduct elementary row operations to transform some augmented matrix representation
of a linear system into another augmented matrix that represents an equivalent linear system.
Since we’re really operating on equations when we operate on the rows of the matrix, these
row operations correspond exactly to the equation operations:
9
Much of the material and examples for this lecture are taken from Simon & Blume (1994) Mathematics for
Economists and Kolman (1993) Introductory Linear Algebra with Applications.
Lecture 7: Linear Algebra II 50
• Row Echelon Form: We use the row operations to change coefficients in the augmented
matrix to 0 — i.e., pivot to eliminate variables — and to put it in a matrix form representing
the final linear system of Gaussian elimination. An augmented matrix of the form
a011 a012 a013 · · · a01n | b01
0
a022 a023 · · · a02n | b02
0 0 0
0
0 a33 · · · a3n | b3
.. .. ..
0
0 0 . . | .
0 0 0 0 a0mn | b0m
is said to be in row echelon form — each row has more leading zeros than the row preceding
it.
• Reduced Row Echelon Form: Reduced row echelon form is the matrix representation of
a linear system after Gauss-Jordan elimination. For a system of m equations in m unknowns,
with no all-zero rows, the reduced row echelon form would be
1 0 0 0 0 | b∗1
0 1 0 0 0 | b∗2
0 | b∗3
0 0 1 0
.. .
0 0 0 . 0 | ..
0 0 0 0 1 | b∗m
Lecture 7: Linear Algebra II 51
• Exercises:
Using matrix methods, solve the following linear system by Gaussian elimination and then
Gauss-Jordan elimination:
1.
x − 3y = −3
2x + y = 8
2.
x + 2y + 3z = 6
2x − 3y + 2z = 14
3x + y − z = −2
• We previously noted that a 2 × 2 system had one, infinite, or no solutions if the two lines in-
tersected, were the same, or were parallel, respectively. More generally, to determine whether
one, infinite, or no solutions exist, we can use information about (1) the number of equations
m, (2) the number of unknowns n, and (3) the rank of the matrix representing the linear
system.
• Rank: The rank of a matrix is the number of nonzero rows in its row echelon form. The
rank corresponds to the maximum number of linearly independent row or column vectors in
the matrix.
• Examples:
1 2 3
1. 0 4 5 Rank=
0 0 6
1 2 3
2. 0 4
5 Rank=
0 0 0
1 2 3 | b1
3. 0 4 5 | b2 , bi 6= 0 Rank=
0 0 0 | b3
1. rank A ≤ rank A
b Augmenting A with b can never result in more zero rows
than originally in A itself. Suppose row i in A is all zeros
and that bi is non-zero. Augmenting A with b will yield a
non-zero row i in A.
b
2. rank A ≤ rows A By definition of “rank.”
3. rank A ≤ cols A Suppose there are more rows than columns (otherwise the
previous rule applies). Each column can contain at most one
pivot. By pivoting, all other entries in a column below the
pivot are zeroed. Hence, there will only be as many non-zero
rows as pivots, which will equal the number of columns.
Lecture 7: Linear Algebra II 52
• Existence of Solutions:
• Exercises:
1.
x − 3y = −3
2x + y = 8
2.
x + 2y + 3z = 6
2x − 3y + 2z = 14
3x + y − z = −2
3.
x + 2y − 3z = −4
2x + y − 3z = 4
4.
x1 + 2x2 − 3x4 + x5 = 2
x1 + 2x2 + x3 − 3x4 + x5 + 2x6 = 3
x1 + 2x2 − 3x4 + 2x5 + x6 = 4
3x1 + 6x2 + x3 − 9x4 + 4x5 + 3x6 = 9
5.
x + 2y + 3z + 4w = 5
x + 3y + 5z + 7w = 11
x − z − 2w = −6
Lecture 7: Linear Algebra II 53
• Example: Let
−1 32
2 3
A= , B=
2 2 1 −1
Since
AB = BA = In
we conclude that B is the inverse, A−1 , of A and that A is nonsingular.
AB = BA = In
AB = In
Solving for B is equivalent to solving for n linear systems, where each column of B is solved for
the corresponding column in In . In performing Gauss-Jordan elimination for each individual
system, the same row operations will be performed on A regardless of the column of B and
In . Hence, we can solve the systems simultaneously by augmenting A with In and performing
Gauss-Jordan elimination on A. Note that for the square matrix A, Gauss-Jordan elimination
should result in A becoming row equivalent to In . Therefore, if Gauss-Jordan elimination on
[A|In ] results in [In |B], then B is the inverse of A. Otherwise, A is singular.
1.7.5 Determinants
2. Since |A| =
6 0, we conclude that A has an inverse.
• Triangular or Diagonal Matrices: For any upper-triangular, lower-triangular, or diagonal
matrix, the determinant is just the product of the diagonal terms.
• Example: Suppose we have the following square matrix in row echelon form (i.e., upper
triangular)
r11 r12 r13
R= 0 r22 r23
0 0 r33
Then
r22 r23
|R| = r11 = r11 r22 r33
0 r33
• Properties of Determinants:
1. |A| = |AT |
8. A square matrix is nonsingular iff its de- (Implied by the previous properties.)
terminant 6= 0.
9. |AB| = |A||B|
but these remain just that — algorithms. At this point, we have no way of telling how the
solutions xj change as the parameters aij and bi change, except by changing the values and
“rerunning” the algorithms.
Hence, we can examine how changes in the parameters and bi affect the solutions xj .
– Define the (i, j)th cofactor Cij of A as (−1)i+j Mij . Notice that it’s just the signed
(i, j)th minor.
– Define the adjoint of A as the n × n matrix whose (i, j)th entry is Cji (notice the switch
in indices!). We’ll refer to the adjoint of A as adj A.
1 1 1
• Exercise: Find the inverse of A = 0 2 3
5 5 1
• Cramer’s Rule: The Determinant Formula for the Solution of a Linear System:
Example:
b1 a12 ··· a1n
b2 a22 ··· a2n
A1 = .
.. ..
.. . .
bn an2 · · · ann
|Aj |
xj =
|A|
−2x1 + 3x2 − x3 = 1
x1 + 2x2 − x3 = 4
−2x1 − x2 + x3 = −3
Lecture 8: Unconstrained Optimization 58
• Taylor series are used commonly to represent functions as infinite series of the function’s
derivatives at some point a. One can thus approximate functions by using lower-order, finite
series known as Taylor polynomials. If a = 0, the series is called a Maclaurin series.
• Specifically, a Taylor series of a real or complex function f (x) that is infinitely differentiable
in the neighborhood of point a is:
∞
X f (n) (a) f 0 (a) f 00 (a)
(x − a)n = f (a) + (x − a) + (x − a)2 + · · ·
n! 1! 2!
n=0
• We can often approximate the curvature of a function f (x) at point a using a 2nd order
Taylor polynomial around point a:
f 0 (a) f 00 (a)
f (x) = f (a) + (x − a) + (x − a)2 + R2
1! 2!
f 00 (a)
f (x) ≈ f (a) + f 0 (a)(x − a) + (x − a)2
2
• Taylor series expansion is easily generalized to multiple dimensions.
or
Q(x) = xT Ax
• Examples:
1. Quadratic on R2 :
1
a11 2 a12 x1
Q(x1 , x2 ) = x1 x2 1
2 a12 a22 x2
= a11 x21 + a12 x1 x2 + a22 x22
2. Quadratic on R3 :
1 1
a11 2 a12 2 a13 x1
1 a12 1
Q(x1 , x2 , x3 ) = x1 x2 x3 2 a22 a
2 23
x2
1 1
2 a13 2 a23 a33 x3
= a11 x21 + a22 x22 + a33 x23 + a12 x1 x2 + a13 x1 x3 + a23 x2 x3
1. Positive Definite:
T 1 0
Q(x) = x x
0 1
= x21 + x22
Lecture 8: Unconstrained Optimization 60
2. Positive Semidefinite:
T 1 −1
Q(x) = x x
−1 1
= (x1 − x2 )2
3. Indefinite:
1 0
Q(x) = xT x
0 −1
= x21 − x22
• Given an n × n matrix A, kth order principal minors are the determinants of the k × k
submatrices along the diagonal obtained by deleting n − k columns and the same n − k rows
from A.
• Define the kth leading principal minor Mk as the determinant of the k × k submatrix
obtained by deleting the last n − k rows and columns from A.
• Conditions for Extrema: The conditions for extrema are similar to those for functions on
R1 . Let f (x) be a function of n variables. Let B(x, ) be the -ball about the point x. Then
1. f (x∗ ) > f (x), ∀x ∈ B(x∗ , ) =⇒ Strict Local Max
2. f (x∗ ) ≥ f (x), ∀x ∈ B(x∗ , ) =⇒ Local Max
3. f (x∗ ) < f (x), ∀x ∈ B(x∗ , ) =⇒ Strict Local Min
4. f (x∗ ) ≤ f (x), ∀x ∈ B(x∗ , ) =⇒ Local Min
• When we examined functions of one variable x, we found critical points by taking the first
derivative, setting it to zero, and solving for x. For functions of n variables, the critical points
are found in much the same way, except now we set the partial derivatives equal to zero.11
• Given a function f (x) in n variables, the gradient ∇f (x) is a column vector, where the ith
element is the partial derivative of f (x) with respect to xi :
∂f (x)
∂x1
∂f (x)
∇f (x) = ∂x2
.
..
∂f (x)
∂xn
• When we found a critical point for a function of one variable, we used the second derivative
as an indicator of the curvature at the point in order to determine whether the point was a
min, max, or saddle. For functions of n variables, we use second order partial derivatives as
an indicator of curvature.
• Given a function f (x) of n variables, the Hessian H(x) is an n × n matrix, where the (i, j)th
element is the second order partial derivative of f (x) with respect to xi and xj :
2
∂ f (x) ∂ 2 f (x) ∂ 2 f (x)
2 ∂x1 ∂x2 · · · ∂x1 ∂xn
∂x1
2 2
∂ f (x) ∂ f (x) · · · ∂ 2 f (x)
∂x2 ∂x1 ∂x2 2 ∂x2 ∂xn
H(x) =
. . .. ..
. ..
. . .
2
∂ f (x) ∂ 2 f (x) ∂ 2 f (x)
∂xn ∂x1 ∂xn ∂x2 · · · ∂x2 n
• Curvature and The Taylor Polynomial as a Quadratic Form: The Hessian is used in
a Taylor polynomial approximation to f (x) and provides information about the curvature of
f (x) at x — e.g., which tells us whether a critical point x∗ is a min, max, or saddle point.
(a) If H(x∗ ) is positive definite, then the RHS is positive for all small h:
(b) Conversely, if H(x∗ ) is negative definite, then the RHS is negative for all small h:
• Example: We found that the only critical point of f (x) = (x1 − 1)2 + x22 + 1 is at x∗ = (1, 0).
Is it a min, max, or saddle point?
1. Recall that the gradient of f (x) is
2(x1 − 1)
∇f (x) =
2x2
Then the Hessian is
2 0
H(x) =
0 2
2. To check the definiteness of H(x∗ ), we could use either of two methods:
(a) Determine whether xT H(x∗ )x is greater or less than zero for all x 6= 0:
T ∗
2 0 x1
x H(x )x = x1 x2 = 2x21 + 2x22
0 2 x2
For any x 6= 0, 2(x21 + x22 ) > 0, so the Hessian is positive definite and x∗ is a strict
local minimum.
(b) Using the method of leading principal minors, we see that M1 = 2 and M2 = 4. Since
both are positive, the Hessian is positive definite and x∗ is a strict local minimum.
• To determine whether a critical point is a global min or max, we can check the concavity
of the function over its entire domain. Here again we use the definiteness of the Hessian to
determine whether a function is globally concave or convex:
1. H(x) Positive Semidefinite ∀x =⇒ Globally Convex
2. H(x) Negative Semidefinite ∀x =⇒ Globally Concave
Notice that the definiteness conditions must be satisfied over the entire domain.
• Given a function f (x) and a point x∗ such that ∇f (x∗ ) = 0,
1. f (x) Globally Convex =⇒ Global Min
2. f (x) Globally Concave =⇒ Global Max
• Note that showing that H(x∗ ) is negative semidefinite is not enough to guarantee x∗ is a local
max. However, showing that H(x) is negative semidefinite for all x guarantees that x∗ is a
global max. (The same goes for positive semidefinite and minima.)
• Example: Take f1 (x) = x4 and f2 (x) = −x4 . Both have x = 0 as a critical point. Unfortu-
nately, f100 (0) = 0 and f200 (0) = 0, so we can’t tell whether x = 0 is a min or max for either.
However, f100 (x) = 12x2 and f200 (x) = −12x2 . For all x, f100 (x) ≥ 0 and f200 (x) ≤ 0 — i.e., f1 (x)
is globally convex and f2 (x) is globally concave. So x = 0 is a global min of f1 (x) and a
global max of f2 (x).
Lecture 8: Unconstrained Optimization 64
1. First order conditions. Set the gradient equal to zero and solve for x1 and x2 .
∂f
= 3x21 + 9x2 = 0
∂x1
∂f
= −3x22 + 9x1 = 0
∂x2
We have two equations in two unknowns. Solving for x1 and x2 , we get two critical
points: x∗1 = (0, 0) and x∗1 = (3, −3).
2. Second order conditions. Determine whether the Hessian is positive or negative definite.
The Hessian is
6x1 9
H(x) =
9 −6x2
Evaluated at x∗1 ,
0 9
H(x∗1 ) =
9 0
The two leading principal minors are M1 = 0 and M2 = −81, so H(x∗1 ) is indefinite and
x∗1 = (0, 0) is a saddle point.
Evaluated at x∗2 ,
18 9
H(x∗2 ) =
9 18
The two leading principal minors are M1 = 18 and M2 = 243. Since both are positive,
H(x∗2 ) is positive definite and x∗2 = (3, −3) is a strict local min.
3. Global concavity/convexity. In evaluating the Hessians for x∗1 and x∗2 we saw that the
Hessian is not everywhere positive semidefinite. Hence, we can’t infer that x∗2 = (3, −3)
is a global minimum. In fact, if we set x1 = 0, the f (x) = −x32 , which will go to −∞ as
x2 → ∞.
Lecture 9: Constrained Optimization 65
Today’s Topics12 :
• Constrained Optimization • Equality Constraints • Inequality Constraints • Kuhn-Tucker
Conditions
1.9.1 Constrained Optimization
• We have already looked at optimizing a function in one or more dimensions over the whole
domain of the function. Often, however, we want to find the maximum or minimum of a
function over some restricted part of its domain.
• In any constrained optimization problem, the constrained maximum will always be less than
or equal than the unconstrained maximum. If the constrained maximum is less than the
unconstrained maximum, then the constraint is binding.
• For a function f (x1 , . . . , xn ), there are two types of constraints that can be imposed:
1. Equality constraints: constraints of the form ck (x1 , . . . , xn ) = rk . Budget constraints
are the classic example of equality constraints in social science.
2. Inequality constraints: constraints of the form gm (x1 , . . . , xn ) ≤ bm . These might arise
from non-negativity constraints or other threshold effects.
• When working with constrained optimization problems, always make sure that the set of
constraints are not pathological; it must be possible for all of the constraints to be satisfied
simultaneously.
• Example: Maximize f (x1 , x2 ) = −(x21 + 2x22 ) subject to the constraint that x1 + x2 = 4. It
is easy to see that the unconstrained maximum occurs at (x1 , x2 ) = (0, 0), but that does not
satisfy the constraint. How should we proceed?
• Equality constraints are the easiest to deal with because we know that the maximum or
minimum has to lie on the (intersection of the) constraint(s).
• The trick is to change the problem from a constrained optimization problem in n variables
to an unconstrained optimization problem in n + k variables, adding one variable for each
equality constraint.
• Lagrangian function: We define the Lagrangian function L(x1 , . . . , xn , λ1 , . . . , λk ) as fol-
lows:
k
X
L(x1 , . . . , xn , λ1 , . . . , λk ) = f (x1 , . . . , xn ) − λi (ci (x1 , . . . , xn ) − ri )
i=1
Occasionally, you may see the following form of the Lagrangian, which is equivalent:
k
X
L(x1 , . . . , xn , λ1 , . . . , λk ) = f (x1 , . . . , xn ) + λi (ri − ci (x1 , . . . , xn ))
i=1
• To find the critical points, we take the partial derivatives of L(x1 , . . . , xn , λ1 , . . . , λk ) with
respect to each of its variables. At a critical point, each of these partial derivatives must be
equal to zero, so we obtain a system of n + k equations in n + k unknowns:
k
∂L ∂f X ∂ci
= − λi = 0 (1)
∂x1 ∂x1 ∂x1
i=1
.. .
. = .. (2)
k
∂L ∂f X ∂ci
= − λi = 0 (3)
∂xn ∂xn ∂xn
i=1
∂L
= c1 (xi , . . . , xn ) − r1 = 0 (4)
∂λ1
.. .
. = .. (5)
∂L
= ck (xi , . . . , xn ) − rk = 0 (6)
∂λk
• Some caveats apply. There may be more than one critical point. Analogs to second-order
conditions for unconstrained optimization exist, or it may suffice to check the critical points
individually. There are also conditions on the behavior of the constraints at critical points;
these are typically satisfied with non-pathological linear constraints.
• Example: Maximize
• Inequality constraints are more challenging because we do not know ahead of time which
constraints will be binding and which will not. Inequality constraints define the boundary
of a region over which we seek to optimize the function. The maximum/minimum could lie
along one of the constraints, or it could be in the interior of the region.
Lecture 9: Constrained Optimization 67
• Again, one way to deal with this problem is by introducing more variables in order to turn
the problem into an unconstrained optimization.
• Slack: For each inequality constraint gi (x1 , . . . , xn ) ≤ bi , we define a slack variable s2i for
which the expression gi (x1 , . . . , xn ) ≤ bi − s2i would hold with equality. These slack variables
capture how close the constraint comes to binding. We use s2 rather than s to ensure that
the slack is positive.
• To find the critical points, we now need to take the partials with respect to each x, λ, and s.
This will give us n + 2m equations in n + 2m unknowns:
m
∂L ∂f X ∂gi
= − λi = 0 (13)
∂x1 ∂x1 ∂x1
i=1
.. .
. = .. (14)
m
∂L ∂f X ∂gi
= − λi = 0 (15)
∂xn ∂xn ∂xn
i=1
∂L
= g1 (xi , . . . , xn ) + s21 − b1 = 0 (16)
∂λ1
.. .
. = .. (17)
∂L
= gk (xi , . . . , xn ) + s2m − bm = 0 (18)
∂λm
∂L
= 2s1 λ1 = 0 (19)
∂s1
.. .
. = .. (20)
∂L
= 2sm λm = 0 (21)
∂sm
• Complementary slackness: The last set of first order conditions of the form 2si λi = 0 are
known as complementary slackness conditions. These conditions can be satisfied one of three
ways:
1. λi = 0 and si 6= 0: This implies that the slack is positive and thus the constraint does
not bind.
2. λi 6= 0 and si = 0: This implies that there is no slack in the constraint and the constraint
does bind.
3. λi = 0 and si = 0: In this case, there is no slack but the constraint binds trivially,
without changing the optimum.
• Example: Find the critical points for the following constrained optimization:
Lecture 9: Constrained Optimization 68
L(x1 , x2 , λ1 , λ2 , λ3 , s1 , s2 , s3 ) = −(x21 +2x22 )−λ1 (x1 +x2 +s21 −4)−λ2 (−x1 +s22 )−λ3 (−x2 +s23 )
3. This is a huge mess: a system of 8 non-linear equations. We only have to look at the
various ways that we can satisfy the complementary slackness conditions:
Hypothesis s1 s2 s3 λ1 λ2 λ3 x1 x2 f (x1 , x2 )
s1 = s2 = s3 = 0 No solution
s1 6= 0, s2 = s3 = 0 2 0 0 0 0 0 0 0 0
s2 6= 0, s1 = s3 = 0 0 2 0 -8 0 -8 4 0 -16
s3 6= 0, s1 = s2 = 0 0 0 2 -16 -16 0 0 4 -32
s1 6= 0, s2 6= 0, s3 = 0 No solution
s1 6= 0, s3 6= 0, s2 = 0 No solution
q q
8 4
6 0, s3 =
s2 = 6 0, s1 = 0 0 3 3 − 16
3 0 0 8
3
4
3 − 96
9
s1 =6 0, s2 =6 0, s3 6= 0 No solution
4. This method has identified the four critical points of the function in the region consistent
with the constraints. The constrained maximum is located at (x1 , x2 ) = (0, 0), which is
the same as the unconstrained max. The constrained minimum is located at (x1 , x2 ) =
(0, 4), while there is no unconstrained minimum for this problem.
Lecture 9: Constrained Optimization 69
• The process described above will identify the critical points of a function subject to some
constraints, but it can be a pain to implement. In particular, explicitly including the non-
negativity constraints makes the problem significantly more complex.
3. The same four points are identified using just the equality constraints - (0, 0, 0), (4, 0, −8),
(0, 4, −16), and ( 38 , 34 , −16
3 ). Three of these points, however, violate the requirement that
λ ≥ 0, so the point (0, 0, 0) is the maximum.
• Exercise: Maximize
1 2
f (x) = log(x1 + 1) + log(x2 + 1) (55)
3 3
s.t. x1 + 2x2 ≤ b (56)
x1 ≥ 0 (57)
x2 ≥ 0 (58)