Statistics For Econometrics
Statistics For Econometrics
Class Notes
Leonard Goff
This version: September 1, 2022
1 Probability 5
1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Outcomes and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 The probability of an event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Which sets of outcomes get a probability? . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Bringing it all together: a probability space . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 The distribution of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Central concept: the cumulative distribution function . . . . . . . . . . . . . . . . 8
1.3.2 Probability mass and density functions . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2.1 Case 1: Discrete random variables and the probability mass function . . 9
1.3.2.2 Case 2: Continuous random variables and the probability density function 10
1.3.2.3 Case 3 (everything else): mixed distributions . . . . . . . . . . . . . . . . 11
1.3.3 Marginal and joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Functions of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 The expected value of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 General definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Application: variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Conditional distributions and expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.1 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.2 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.3 Conditional expectation (and variance) . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Random vectors and random matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.2 Conditional distributions with random vectors . . . . . . . . . . . . . . . . . . . . 20
1.6.2.1 Conditioning on a random vector . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.2.2 The conditional distribution of a random vector . . . . . . . . . . . . . . 20
3 Statistical models 31
3.1 Modeling the distribution of a random vector . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Two examples of parametric distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 The binomial distribution* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1
4 When the sample gets big: asymptotic theory 38
4.1 Introduction: the law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Asymptotic sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 The general problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Example: LLN and the sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Convergence in probability and convergence in distribution . . . . . . . . . . . . . . . . . 42
4.4 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Properties of convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 The continuous mapping theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.2 The delta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.3 The Cramér–Wold theorem* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Limit theorems for distribution functions* . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Linear regression 70
7.1 Motivation from selection-on-observables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 The linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Five reasons to use the linear regression model . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.1 As a structural model of the world . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.2 As an approximation to the conditional expectation function . . . . . . . . . . . . 72
7.3.3 As a way to summarize variation* . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.3.4 As a weighted average of something we care about* . . . . . . . . . . . . . . . . . 73
7.3.4.1 Example 1: the average derivative of a CEF* . . . . . . . . . . . . . . . . 73
7.3.4.2 Example 2: weighted averages of treatment effects* . . . . . . . . . . . . 74
7.3.5 As a tool for prediction* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.4 Understanding the population regression coefficient vector . . . . . . . . . . . . . . . . . . 75
7.4.1 Existence and uniqueness of β: no perfect multicollinearity . . . . . . . . . . . . . 76
7.4.2 Simple linear regression in terms of covariances . . . . . . . . . . . . . . . . . . . . 78
7.4.3 Multiple linear regression in terms of covariances . . . . . . . . . . . . . . . . . . . 79
2
7.5 The ordinary least squares (OLS) estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5.1 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5.2 OLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5.3 Matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.5.4 Existence and uniqueness of β̂: no perfect-multicollinearity in sample . . . . . . . . 84
7.5.5 More matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.5.6 Regression as projection* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.5.7 The Frisch-Waugh-Lovell theorem: matrix version* . . . . . . . . . . . . . . . . . . 86
7.5.8 For the matrix haters: OLS in terms of covariances* . . . . . . . . . . . . . . . . . 88
7.5.9 A review of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.6 Statistical properties of the OLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.6.1 Finite sample properties of β̂, conditional on X* . . . . . . . . . . . . . . . . . . . 90
7.6.1.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.6.1.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.6.2 Asymptotic properties of β̂ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6.2.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.6.2.3 Estimating the asymptotic variance . . . . . . . . . . . . . . . . . . . . . 95
7.7 Inference on the regression vector β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.7.1 Testing hypotheses about a single regression coefficient . . . . . . . . . . . . . . . . 97
7.7.2 Testing a joint hypothesis about the regression coefficients* . . . . . . . . . . . . . 98
3
Guide to using these notes
This set of notes arose from the course ECON8070: Statistics for Econometrics at the University of
Georgia in Fall 2021. I’m fixing typos as I find them and keeping this document updated on my website.
Check www.leonardgoff.com for the most recent version.
These notes feature two kinds of box, to help organize the material:
White boxes indicate material that is optional, and understanding this material is not required
for the course or exam.
Sections that have an asterisk at the end of their title can be skipped in their entirety: understanding
this material is not required for the course or exam. These sections are mostly there for your interest
and reference.
4
Chapter 1
Probability
Main idea: A probability function ascribes a number to each of a collection of events, where
each event is a set of outcomes.
This section develops the mathematical notion of probability. Probability is a function that associates
a number between zero and one to events. Events, in turn, are sets of outcomes. It’s easiest to think
of outcomes in the context of a process that could have multiple distinct results, like flipping a coin or
randomly choosing a number from a phone book.
Examples: When flipping a coin, the sample space is Ω = {H, T }, corresponding to “heads” or “tails”,
respectively. When rolling a six-sided die, Ω = {1, 2, 3, 4, 5, 6}. When drawing a card from a 52-card
deck, the sample space can de denoted as a combination of a card-value and a suit, or {(n, s) : n ∈
{A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K}, s ∈ {hearts, spades, diamonds, clubs}}. When using a random num-
ber generator to draw any number between 0 and 1, the sample space is Ω = [0, 1].
We denote a generic element of the sample space as as ω ∈ Ω. What we call events are simply sets
of such ω, i.e. subsets of Ω. But in general, not all subsets of Ω necessarily need to be events. Rather,
we consider a collection of sets F , referred to as an event space.
Definition 1.1. An event space F is a collection of subsets A ⊆ Ω.
In all of the examples given above, the outcome space Ω has a finite number of elements. In such
cases, it is typical to choose F to be the collection of all subsets of Ω. This collection is referred to
as the powerset of Ω and is often denoted as 2Ω . As an example, the powerset of the set {1, 2} is
2{1,2} = { ∅, {1}, {2}, {1, 2}}}. When we consider Ω that are uncountable sets (for example when Ω is a
continuum), we’ll need to restrict the event-space, as discussed below.
5
This formulation of probability is sometimes referred to as the Kolmogorov axioms of probability.
These axioms imply several intuitive properties of probability. For example, if A has a countable
number of elements, then the third property in Definition 1.2 implies that:
X
P (A) = P ({ω})
ω∈A
provided that {ω} ∈ F for each ω ∈ A. In particular, this result implies that for a finite set A we
can simply sum up the probability of each of the outcomes in A. For example, for a six-sided die
P (even) = P ({2}) + P ({4}) + P ({6}).
A few other properties of probability functions are left as exercises. As practice, I’ll include a proof
of the familiar property that P (Ac ) = 1 − P (A). To see this, note that A and Ac are disjoint sets, and
that A ∪ Ac = Ω. Thus, by the third property of Definition 1.2 P (Ω) = P (A) + P (Ac ). Then use the
second property to obtain the result.
Exercise: Derive the expression: P (A ∩ B) = P (A) + P (B) − P (A ∪ B). Hint: use (A ∩ B)c = Ac ∪ B c .
Example: Consider as an outcome space the entire unit interval: Ω = [0, 1]. It turns out that it is
impossible to define a ”uniform” probability function on this Ω, if we insist on using the whole powerset
of [0, 1] as our event space F . That is, there is no function P (·) satisfying Kolmogorov’s axioms, and
defined over all A ∈ 2[0,1] , that satisfies our intuitive notion that moving a set around in the unit interval
does not change its probability. See Proposition 1.2.6 of Rosenthal (2006) for details.
This example demonstrates that in some cases we may need to work with something smaller than 2Ω . In
particular, issues like the above arise when Ω is uncountably infinite, e.g. corresponding to a continuum
of numbers. When Ω is finite or countable, it usually makes sense to consider the full powerset of Ω as
our event space. When we are in the uncountable case (e.g. when Ω is a convex subset of the real line
[a, b]), we typically appeal to the Borel σ-algebra:
6
Definition 1.4. The Borel σ-algebra B is the collection that consists of all intervals of the forms [a, b],
(a, b], [a, b), (a, b), and all other sets in R that are then implied by the definition of a σ-algebra.
Exercise: Show that for any Ω, the collection {∅, Ω}} is a σ−algebra.
1.2.1 Definition
Most data we use in econometrics is quantitative in nature, so its natural to think of probability spaces
in which the outcome space is composed of numbers. Many of the examples have this feature already, for
example Ω = {1, 2, 3, 4, 5, 6} for a six-sided die. But even when the ω do not have an immediate numeric
interpretation, we can define a random variable by associating a number to each outcome ω:
Definition 1.6. Given a probability space (Ω, F, P ), a random variable X is a function X : Ω → R.
Example: Suppose I randomly select a student in this class, which I represent by a probability space with
Ω = {all students in this class}, F = 2Ω , and P ({ω}) = 1/|Ω| for each ω ∈ Ω. If we let X(ω) denote the
height in inches of student ω.
A random variable X defined from a primitive probability space (Ω, F, P ) allows us to define a new
probability space in which the outcomes are real numbers. We can now define a new probability function
PX on sets of real numbers, using the original probability function P on Ω:
PX (A) := P ({ω ∈ Ω : X(ω) ∈ A}) (1.1)
Technical note: observe that the above definition gives a way to associate a probability PX with any
set A of real numbers, provided that {ω ∈ Ω : X(ω) ∈ A} ∈ F . To ensure this condition holds it is
typical to restrict to sets A that belong to the Borel algebra B defined in Section 1.1.3, and further insist
that the function X is measurable. X being measurable is a technical condition that just means that for
any x ∈ R, the set {ω ∈ Ω : X(ω) ≤ x} ∈ F . Our new probability space can now be denoted as (R, B, PX ).
A realization of random variable X is the specific value X(ω) that it ends up taking, given ω. While X
is a function, X(ω) is a number. Lowercase letters x are often used to denote numbers that are possible
realizations: e.g. x = X(ω) for some ω ∈ Ω.
1.2.2 Notation
The notation of Equation (1.1) is pretty cumbersome to work with, so the convention is to simplify it in
a few ways.
Let’s start with an example. If we’re interested in the probability that X(ω) is less than or equal to
5, we’ll typically write this as: P (X ≤ 5), which can be interpreted as PX (A) where A = (−∞, 5], or
equivalently: P ({ω ∈ Ω : X(ω) ≤ 5}). What’s changed in this notation? Let’s go through step-by-step:
7
First, we haven’t bothered with the subscript X on PX like in Equation (1.1) because it’s clear
from what’s inside the parentheses that we’re talking about random variable X.
Second, inside the function P we’re using the language of conditions rather than sets. That is,
rather than writing out the set A = (−∞, 5] of values we’re interested in, we just write this as a
condition: “≤ 5”.
Third, we’ve made ω implicit and written X rather than X(ω). However, you often see ω left in.
For example, we might write P (X(ω) = x) for the probability that X takes a value of x. In the
context of Equation (1.1), this maps onto PX ({x}), or equivalently P ({ω ∈ Ω : X(ω) = x)}).
Given that we’re using the language of conditions, we often write “and” inside probabilities, for exam-
ple: P (X ≤ 5 and X ≥ 2). The “and” operation translates into intersection in the language of sets:
P ({ω ∈ Ω : X(ω) ≤ 5} ∩ {ω ∈ Ω : X(ω) ≥ 2}). Similarly, “or” translates into the union of sets:
P (X ≤ 5 or X ≥ 2) = P ({ω ∈ Ω : X(ω) ≤ 5} ∪ {ω ∈ Ω : X(ω) ≥ 2}).
Note: We may have multiple random variables, e.g. X could be a randomly chosen state’s minimum
wage, while Y their unemployment rate. Mathematically, these two random variables correspond to
functions X(·) and Y (·) applied to a common underlying outcome space Ω, which in this case cor-
responds to the set of US states. Probabilities like P (X ≤ $10 and Y ≤ 5%) are interpreted as
P ({ω ∈ Ω : X(ω) ≤ $10 and Y (ω) ≤ 5%}). If P ({ω}) = 1/50 for all ω, then this probability is in
turn equal to the number of states that have a minimum wage less than or equal $10 and an unemploy-
ment rate less than or equal 5%, divided by 50.
From the CDF, we can derive anything we’ll need to know about a single random variable. When we
have multiple random variables, the joint-CDF tells us everything we need to know about them.
Definition 1.8. The joint-CDF of two random variables X and Y is the function
FXY (x, y) := P (X ≤ x and Y ≤ y)
8
We’ll come back to the joint-CDF of two (or more) random variables in Section 1.5.
Although the CDF F (x) of a random variable is a function of a single variable x, we can use it to recover
the probability that X lies in a set. For example, consider the set (a, b], that is all numbers between a
and b, including b itself.
Proposition 1.1. For any numbers a and b such that b ≥ a, P (X ∈ (a, b]) = F (b) − F (a)
Proof. Given that P (A) = 1 − P (Ac ), (see Section 1.1.2), we have that:
Using the third property of a probability function, we have that P (X ≤ a or X > b) = P (X ≤ a)+P (X >
b), since the sets {x ∈ R : x ≤ a} and {x ∈ R : x > b} are disjoint. Thus:
1.3.2.1 Case 1: Discrete random variables and the probability mass function
Call X a discrete set if X contains a finite number of elements, or a countably infinte number of elements
(e.g. X = N, the set of all integers).
Definition 1.9. A discrete random variable X is a random variable such that P (X ∈ X ) = 1 for some
discrete set X .
Example: If X is the number returned by rolling a die, then X is a discrete random variable because
P (X ∈ {1, 2, 3, 4, 5, 6}) = 1.
For any random variable, we call the smallest set X for which P (X ∈ X ) the support of X. A discrete
random variable has as its support a discrete set.
When X is a discrete random variable, its CDF ends up looking like a staircase: flat everywhere except
at each x in its support, where it “jumps” up by an amount P (X = x). For example, for a six-sided die:
Note: The open/closed dots at e.g. x = 1 indicate the F (1) is equal to 1/6, and not 0 (although it is equal
to 0 for x arbitrarily close but to the left of 1). We see from this graph why CDFs are right-continuous
but not necessarily left-continuous.
At each point in its support {1, 2, 3, 4, 5, 6}, the CDF for the die jumps by P (X = x), or 1/6. This is
a general feature of discrete random variables. Thus, rather than use the CDF function F (x) to represent
the distribution of X, we can just keep track of where it jumps and by how much. To do this, we use
probability mass function or p.m.f. of X
Definition 1.10. The probability mass function of a random variable X is the function π(x) = P (X = x)
9
CDF of the number on a die
1
5/6
4/6
F(x)
3/6
2/6
1/6
0
1 2 3 4 5 6
x
Figure 1.1: The CDF of the number returned by a fair six-sided die.
For a discrete random variable, we can express the p.m.f. alternatively as a sequence, rather than a func-
tion. Label the points in the support of X as {x1 , x2 , x3 , . . . }, in increasing order so that x1 < x2 < x3 . . . .
Let xj denote the j th value in this sequence. For any j, let πj = π(xj ) = P (X = xj ).
The sequence of probabilities {π1 , π2 , π3 , . . . } coupled with the sequence of support points {x1 , x2 , x3 , . . . }
carries exactly the same information as the full CDF.
Obtaining the p.m.f. from the CDF: For a given support point xj : πj = F (xj ) − F (xj−1 ), and for any
x: π(x) = limϵ↓0 F (x) − F (x − ϵ). Note that π(x) = 0 for any x that is not a support point, and F is
continuous {x1 , x2 , x3 , . . . }.
P
Obtaining the CDF from the p.m.f (only possible for a discrete random variable): F (x) = j:xj ≤x πj .
P
Note that from this last expression, we can see that since limx→∞ F (x) = 1, we must have that j πj = 1
– probability mass functions sum to one when the sum is taken across all support points j.
1.3.2.2 Case 2: Continuous random variables and the probability density function
For random variables that are not discrete, knowing the probability mass function isn’t sufficient to
recover the whole CDF. Often P (X = x) = 0 for all x, so the p.m.f does not even really tell us anything
useful about X’s distribution.
An important class of random variables that are not discrete are random variables for whom the CDF
is differentiable for all x. When it is, we can define the probability density function or p.d.f. of X.
Definition 1.11. The probability density function of a random variable X having a differentiable CDF
d
F (x), is f (x) = dx F (x).
We will refer to random variables that have a density function f (x) as continuous random variables
(another phrasing is that X is continuously distributed ). Recall that for a function to be differentiable,
it must be continuous; thus, the CDF of a continuous random variable must be continuous, lacking any
jumps like those that characterize the CDF of a discrete random variable.
Note: you may see in various texts a few different notions of “continuity” of a random variable. For
the purposes of this class, a continuous random variable is a random variable with a continuous CDF,
which is basically equivalent to it being differentiable everywhere in its support. We won’t worry about
the distinction between these two things: e.g. random variables with CDFs that are continuous but
non-differentiable.
For a continuous random variable we can use the p.d.f rather than the CDF to calculate anything we need
to know. For example the probability that X lies in any interval [a, b] can be obtained by integrating
10
over the density function:
Z b
P (X ∈ [a, b]) = f (x)dx (1.2)
a
Intuitively, this gives us the area under the curve f (x) between points a and b, as depicted in Figure 1.2.
Rb
Note that a f (x)dx = F (b) − F (a), because the CDF is the anti-derivative of the p.d.f.
F (b)
P (X ∈ [a, b])
f (x)
P (X ∈ [a, b])
F (a)
0
a b a b
x x
Figure 1.2: The left panel depicts an example of the p.d.f. f (x) of a random variable X. The probability that
a ≤ X ≤ b is given by the area under the f (x) curve between x = a and x = b. P (a ≤ X ≤ b) is also equal to
F (b) − F (a), the difference in the CDF of X evaluated at x = b and at x = a, as depicted in the right panel.
While the probability mass function π(x) gives us the probability that X equals x exactly, the p.d.f
does not tell us the probability that X = x (in fact for any x: P (X = x) = 0 for a continuous random
variable!).
Rather f (x) can be interpreted as telling us the probability that X is close to x, in the following
sense. Consider a point x and some small ϵ > 0. Recall the definition of f (x) as the derivative of F (x):
d F (x + ϵ) − F (x) P (X ∈ (x, ϵ])
f (x) = F (x) = lim = lim
dx ϵ→0 ϵ ϵ→0 ϵ
where we’ve used Proposition 1.1 to replace F (x + ϵ) − F (x) with P (X ∈ (x, ϵ]). Thus f (x) is limit
of the ratio of the probability that X lies in a small interval that beings at x, and the width ϵ of that
interval. Note also that for small ϵ: F (x + ϵ) ≈ F (x) + f (x) · ϵ, which is called the first-order Taylor
approximation to F (x + ϵ) around x.
Let us end this section with a few properties of a probably density function:
From Eq. (1.2), we Rsee that the density must integrate to one, when the integral is taken over the
∞
whole real line, i.e. −∞ f (x)dx = 1.
since F (x) is increasing and f (x) is its derivative, f (x) is positive everywhere: f (x) ≥ 0.
F(x)
1
0
a b c
x
Figure 1.3: An example of the CDF of a mixed random variable. This example has mass points at a and c,
where the CDF jumps discretely. It is continuous everywhere else, and is differentiable everwhere except {a, b, c}.
There are some technical aspects to stating the Lebesque decomposition theorem formally, which we
won’t explore here. Rather, it’s easiest to think of this result visually: a generic CDF is any increasing
function bounded between 0 and 1 (which is also right-continuous). The jumps in F (x) define the discrete
part of X (note that it can only jump up, and not down, since F is increasing). The function F (x) will
be differentiable almost everywhere else, defining it’s continuous part.1
R∞
If X and Y are both continuously distributed: fX (x) = −∞ fXY (x, y)dy, where the joint den-
sity fXY (x, y) is the derivative of the joint CDF with respect to both x and y: fXY (x, y) =
∂2
∂x∂y FXY (x, y).
Intuitively, we can obtain the marginal distribution of X from the joint distribution by summing or
integrating over all values of Y , and we can similarly derive the marginal distribution of Y from the joint
distribution of X and Y by summing/integrating over values of X.
The above results all follow from a fundamental identity for probabilities called the law of total probability:
1 “Almost everywhere” here has a technical meaning. Any monotonic function is guaranteed to be differentiable every-
where except at isolated points: see Lebesque’s theorem for the differentiability of a monotone function.
12
Proposition (law of total probability): Consider a countable collection S of events A1 , A2 , . . . that
partition thePsample space (this means that the Aj are disjoint and that j Aj = Ω). Then for any event
B: P (B) = j P (B ∩ Aj ).
Proof. The proof is good practice, so I include it here. Since any event B ⊆ Ω, B =B ∩ Ω and thus
S S
P (B) = P (B ∩ Ω). Now, since j Aj = Ω, we have that P (B) = P B ∩ ( j Aj ) . Observe that
S S
B ∩ ( j Aj ) = j (B ∩ Aj ), and that
P the events (B ∩ Aj ) are disjoint for different values of j (since each
is a subset of Aj ). Thus, P (B) = j P (B ∩ Aj ), proving the result.
We can use the ideas of marginal and joint distributions to define the notion of independence between
two random variables:
Definition 1.13. We say that random variables X and Y are independent if FXY (x, y) = FX (x) · FY (y)
for all x and y.
When X and Y are independent, we denote this fact as X ⊥ Y . When they are not, we say X ̸⊥ Y .
The reason that we can do this is simple: the original random variable was defined from a function X
defined on an underlying outcome space Ω. Evaluating g(X(ω)) for any ω ∈ Ω yields a new function,
the so-called composition of g with X (this is often denoted as g ◦ f ).
Technical note: Recall that the function X(ω) that defines the original random variable X must be a
measurable function. For the above logic to go through, the function g(·) applied to X must also be
measurable, so that the function g ◦ f = g(X(·)) is also measurable. A sufficient condition for a function
to be measurable is that it is piece-wise continuous, which is a very weak condition.
To work with a random variable Y = g(X), we need to know it’s CDF, which is:
FY (y) = P (Y ≤ y) = P (g(X) ≤ y)
This RHS expression can always be evaluated using the CDF of X. However, there are two important
special cases in which deriving the distribution of Y from that of X is particularly easy:
1. If X has a discrete distribution with support points x1 , x2 , . . . and p.m.f. π1 , π2 , . . . , then Y has
the same p.m.f. π1 , π2 , . . . but at new support points g(x1 ), g(x2 ), . . . .
Example: if X is a random variable that takes value 0 with probability p and 1 with probability
1 − p, then the random variable Y = X + 1 is a random variable that takes value 1 with
probability p and 2 with probability 1 − p.
2. (homework problem) If X has a continuous distribution with density fX (x), and if the function g(x)
(g −1 (y))
is strictly increasing and differentiable with derivavtive g ′ , then Y has a density fY (y) = fgX′ (g −1 (y))
−1
where g is the inverse function of g.
Example: if g(x) = log(x), then fY (y) = fX (ey ) · ey , since g −1 (y) = ey and g ′ (x) = 1/x.
Just as a function applied to a random variable defines a new random variable, functions applied to
multiple random variables also yield a new random variable. For example, if X and Y are each random
variables, then Z = g(X, Y ) is also a random variable, where g(x, y) is now a function that takes two
arguments. Some examples would be the random variables X + Y , X · Y , or min{X, Y }. When taking a
functions of two random variables, Z = g(X, Y ), we need the full joint distribution of X and Y to derive
the CDF of Z. Knowing the two functions FX (x) and FY (y) is generally not enough, rather we need to
know the function FXY (x, y) (see Definition 1.8). This will come up later in the course.
13
1.4 The expected value of a random variable
Main idea: The expected value of a random variable is a measure of its average value across
realizations. In the special case of a continuous random variable, its value can be obtained by an
integral involving the density function. In the special case of a discrete random variable, its value
can be obtained by a sum involving the probability mass function.
The expected value (a.k.a. expectation value, or simply expectation) of a random variable is a measure
of its average value over all possible realizations. The expectation of X is denoted E[X].
To motivate how E[X] will be defined, think of task of computing the average of a list of numbers.
For example, the average of the numbers 1, 2, 2, and 4 is (1 + 2 + 2 + 4)/4 = 2. Notice that the number
2 occurred twice in the series, so we added 2 to the sum two times. We could thus have written the
averaging calculation as 14 (1 · 1 + 2 · 2 + 4 · 1), where each number is mulitplied by the number of times
it occurs in the list. The general formula could be written
X # times j th distinct number occurs in the list
average of a list of numbers = (j th distinct number ) ·
j
length of the list
| {z }
wj
where notice that “weight” wj on the j th distinct number sums to one over all j, i.e.
P
j wj = 1.
The definition of E[X] for a discrete random variable is exactly analogous to this formula, where we
average over the values xj that X can take, and use as “weights” the probabilities πj :
X
E[X] = xj · πj (1.4)
j
where x1 , x2 , . . . are the distinct support points of the random variable and πj is it’s p.m.f. Note that
the πj sum to one, as we saw in Section 1.3.2.
In the case of a continuous random variable, the analogous expression to Eq. (1.4) replaces the sum
with an integral, and the probability πj = π(xj ) is replaced by f (x)dx:
Z
E[X] = x · f (x)dx (1.5)
The quantity f (x) · dx can be interpreted as the probability that X lies in an interval [x, x + dx] having
a very small width dx, as discussed in Section 1.3.2.
∞ N
b−a b−a b−a
Z X
x · dF (x) := lim lim a+n· · F a+n· − F a + (n − 1) ·
−∞ a→−∞,b→∞ N →∞
n=1
N N N
R∞
The quantity −∞ x · dF (x) in an example of a Riemann–Stieltjes integral, in which we “integrate‘with
respect to the function” F (x) rather than with respect to the variable x. Let’s try to unpack this long
expression, with the aid of the color-coding above.
First, let’s fix values of a, b, N and consider the quantity appearing inside all of the limits. For given
b > a, imagine cutting the interval [a, b] into N regions of equal size, so that they each have width b−a N .
The nth such region extends from the value a + (n − 1) · b−a N to the value a + n · b−a
N . Note the following:
F a + n · b−a − F a + (n − 1) · b−a
N N yields P (X ∈ region n).
a + n · b−a
N is the location of (the right end of) region n.
14
limN →∞ takes the sum to an integral, and the a, b limit covers full support of X.
Thus, we can interpret E[X] as an integral of the function x over the whole real line, in which each value
of x is multiplied by the probability that X is very close to x, essentially F (x + dx) − F (x).
Discrete case: Now let’s see how Definition 1.14 yields Eq. (1.4) in the special case that X is a discrete
random variable. Let x1 , x2 . . . be the support points of X. Notice that for large enough N , only
one xj
can be between a + n−1 N (b − a) and a + n
N (b − a). Thus: F a + n
N (b − a) − F a + n−1
N (b − a) = πj if
xj lies in the nth region. If on the other hand no xj lies in the n th
region, this quantity is equal to zero.
We arrive at one term for each value xj , and E[X] = j xj · πj .
P
Continuous case: When X is a continuous random variable with density f (x), we can recover Eq. (1.4)
by noticing that for large N :
n−1 b−a
n n
F a + (b − a) − F a + (b − a) ≈ f a + (b − a) ·
N N N N
R∞
Substituting in this approximation delivers the familiar formula that E[X] = −∞ x · f (x)dx.
Exercise: Consider a so-called Bernoulli random variable X that takes a value 1 with probability p and
0 with probability 1 − p. Show that E[X] = p.
Exercise: Consider a uniform [0, 1] random variable, that is a continuous random variable with density
f (x) = x for all 0 ≤ x ≤ 1, and f (x) = 0 everywhere else. Show that E[X] = 1/2.
A key property of the expectation operator that is very useful is that it is linear. It’s actually “linear”
in a few distinct senses:
1. Linearity with respect to functions of a single variable: E[a + b · X] = a + b · E[X]
2. Linearity over sums of random variables: E[X + Y ] = E[X] + E[Y ].
3. Linearity with respect to mixtures: if X, Y and Z are random variables such that FZ (t) = p ·
FX (t) + (1 − p) · FY (t), then E[Z] = p · E[X] + (1 − p) · E[Y ].
Note that because of Property 2, we can compute the expectation value of the random variable X + Y
knowing only the CDFs FX (x) and FY (y), without needing the full joint-CDF FXY (x, y) of X and Y .
This is a very special property of the expectation, which doesn’t hold for most of the things we might
want to know about the random variable X + Y (for example P (X + Y ≤ t)).
Property 3. gives us a nice way to evaluate the expectation value of a random variable that is neither
discrete nor continuous. Recalling decomposition (1.3) of a general mixed random variable, let f c (x) be
the density of the continuous part Fcontinuous and let xdj and πjd denote the support points and associated
probabilities according to the discrete part Fdiscrete . Then:
Z ∞ X Z ∞
E[X] = x · dF (x) = p · xdj · πjd + (1 − p) · x · f c (x) · dx
−∞
j
−∞
Exercise: Use the linearity of the expectation operator to prove the following (very useful) alternative
expression for the variance: V ar(X) = E[X 2 ] − (E[X])2 .
Exercise: Show that for a Bernoulli random variable (defined above), the variance is equal to p · (1 − p).
15
1.5 Conditional distributions and expectation
In this section we develop a final fundamental tool that we will use to analyze random variables: the
idea of conditional distributions and conditional expectations.
Extension: Given events A, B and C, we can also define the probability of A given B and C as
P (A|B ∩ C) = P (A ∩ B ∩ C)/P (B ∩ C), and so on for any number of events.
Exercise: We call events A and B independent if P (A ∩ B) = P (A) · P (B). Suppose that P (B) > 0.
Show that A and B are independent if and only if P (A|B) = P (A).
where the conditional probability appearing in the RHS is defined by Definition 1.16.
We define P (Y ≤ y|X = x) using X ∈ [x, x + ϵ] as our conditioning event B, and then taking the limit,
because the probability of X = x may be zero, e.g. for a continuously distributed X. Note: The Hansen
book uses x ∈ [x − ϵ, x + ϵ] instead of x ∈ [x, x + ϵ], but the two definitions are equivalent.
Given the general Definition 1.17, we can consider each of our two typical special cases:
When P (X = x) > 0 (e.g. for a discrete random variable with a support point at x), Definition
1.17 reduces to the simpler expression P (Y ≤ y|X = x) = P (Y ≤y and X=x)
P (X=x) . We can interpret
FY |X=x (y) as the CDF among the sub-population of i for which X = x.
Exercise: derive each of these two expressions from Definition 1.17 . For the discrete case, you may find
g(t) limt→0 g(t)
useful the “quotient rule” that limt→0 h(t) = lim t→0 h(t)
when both limits exist and limt→0 h(t) ̸= 0. For
the continuous case, try dividing both the numerator and the denominator of P (Y ≤ y|X ∈ [x, x + ϵ])
16
by ϵ before taking the limit.
Exercise: Show that if X and Y are independent then FY |X=x (y) = FY (y) and FX|Y =y (x) = FX (x) for
all x and y. Note: it’s actually an if-and-only-if, but proving the other direction is more difficult.
We can unpack this expression depending on what type of random variable Y is:
R∞
If Y is continuous: E[Y |X = x] = −∞ y · fY |X=x (y) · dy, where fY |X=x (y) = dy
d
FY |X=x (y).
If Y is discrete: E[Y |X = x] = j yj ·πj|X=x , where πj|X=x = limϵ↓0 FY |X=x (yj ) − FY |X=x (yj − ϵ) .
P
Observe that the conditional expectation E[Y |X = x] depends on x only, as we’ve averaged over various
values of Y . Accordingly, we can define a function that evaluates E[Y |X = x] over different values of x:
Definition 1.18. The conditional expectation function (CEF) of Y given X is m(x) := E[Y |X = x].
We can also use the CEF to define a new random variable, denoted E[Y |X].
Definition 1.19. E[Y |X] = m(X), where m(x) := E[Y |X = x].
For example, if X is discrete, then E[Y |X] takes value m(xj ) = E[Y |X = xj ] with probability πj .
The so-called law of iterated expectations shows that the expectation value of E[Y |X] recovers the (un-
conditional) expectation of Y :
The law of iterated expectations is useful because in many settings the quantity E[Y |X = x] is easier to
work with than [Y ] is directly.
Example: Suppose that Y is individual i’s height and X is an indicator for whether they are a child or
an adult. Then the law of iterated expectations tells us that the average height in the population can
be obtained by averaging together the mean height among children with the mean height among adults.
Suppose that 75% of the population are adults. Then the law of iterated expectations reads as:
E[height] = .75 · E[height|adult] + .25 · E[height|child]
17
Proposition (CEF minimizes mean squared prediction error): Suppose we’re interested
in constructing a function g(·) with the goal of using g(X) as a prediction of Y . We can show
that m(x) := E[Y |X = x] is the best such function, in the sense that for each value of x
Proof. Here I’ll use the general notation so we don’t need to make any assumptions about what
type of random variable X is (discrete, continuous, etc.):
Z
E[(Y − g(X)) ] = E E[(Y − g(X)) |X] = E[(Y − g(X))2 |X = x] · dF (x)
2 2
Z Z
= E[(Y − g(x)2 |X = x] · dF (x) = E[Y 2 − 2Y g(x) + g(x)2 |X = x] · dF (x)
Z
E[Y 2 |X = x] − 2g(x)E[Y |X = x]g(x) + g(x)2 · dF (x)
=
For each value of x, the quantity in brackets is minimized by g(x) = E[Y |X = x]. To see this,
note that the quantity E[Y 2 |X = x] − 2gE[Y |X = x]g + g 2 is a convex function of g, and the
first-order condition for minimizing it is satisfied when g = E[Y |X = x].
We can also define a conditional variance function V ar(Y |X = x) = E[(Y − E[Y |X = x])2 |X = x]
from the conditional distribution FY |X=x . An analog to the law of iterated expectations exists for the
conditional variance, which is sometimes called the law of total variance.
Proposition (law of total variance): V ar(Y ) = E[V ar(Y |X)] + V ar(E[Y |X]).
Example: Recall the height example from the law of iterated expectations. The law of total variance
reveals that the variance of heights in the population overall is greater than what we would get by just
averaging the variances of each subgrup. That is:
V ar(height) > .75 · V ar(height|adult) + .25 · V ar(height|child)
The reason is that V ar(height) involves making comparisons directly between the heights of children
and adults, which are not captured in V ar(Y |X = x) for either value of x. The law of total variance
tells us exactly what correction we would need to make, which is to add the second term V ar(E[Y |X]).
Remarkably, the correction required just depends on the average height within each group E[Y |X = x],
as well as the proportion of adults vs. children: P (X = x).
1.6.1 Definition
Rather than coming up with new letters X, Y, Z for multiple random variables, sometimes a more compact
notation is to think of a single “random vector” containing all three.
Definition 1.20. A random vector X is a vector in which each component is a random variable, e.g.
X1
X2
X= .
..
Xk
where X1 , X2 , etc. are each random variables.
18
Note the following:
A realization x of random vector X is a point in Rk , i.e. x = (x1 , x2 , . . . , xk )′ :
For a random vector X, the function FX denotes the joint-CDF of the random variables X1 , X2 , . . . Xk :
The expectation of a random vector X is simply the vector of expectations of each of its components,
i.e.
X1 E[X1 ]
X2 E[X2 ]
E[X] = . := ..
.. .
Xk E[Xk ]
The law of iterated expecations E[Y ] = E [E[Y |X]] still holds when X is a random vector, rather
than a random variable.
.
Definition 1.21. An n × k random matrix X is a matrix in which each component is a random variable,
e.g.
X11 X12 . . . X1k
X21 X22 . . . X2k
X= .
.. .. ..
.. . . .
Xn1 Xn2 . . . Xnk
where Xlm , is a random variable for each entry lm.
Just as with a random variable, we define the expectation of a random matrix as a matrix composed
of the expectation of each of it’s components, i.e.
Note that when X is a random vector rather than a random variable, V ar(X) is often referred to as
the “variance-covariance matrix” of X. We’ll use the variance-covariance matrix a lot, because it plays
an important role in studying parametric distributions like the multivariate normal distribution, and in
asymptotic theory.
To understand the name, let us first define the covariance between random vectors X and Y :
Definition 1.23. The covariance of random vectors X and Y is Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])′ ]
Note the following properties of covariance:
For random vector X: V ar(X) = Cov(X, X)
When X and Y are scalars (i.e. single random variables), Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
For scalar X and Y , and numbers a, b: Cov(X, a + bY ) = b · Cov(X, Y )
19
For a random vector X, the components of the matrix V ar(X) are scalar variance and covariances,
hence its name:
V ar(X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xk )
Cov(X2 , X1 ) V ar(X2 ) . . . Cov(X2 , Xk )
V ar(X) = .. .. ..
..
. . . .
Cov(Xk , X1 ) Cov(Xk , X2 ) . . . V ar(Xk , Xk )
A consequence of this expression is that V ar(X) is a symmetric matrix: [V ar(X)]lm = [V ar(X)]ml ,
because Cov(Xl , Xm ) = Cov(Xm , Xl ).
Cov(X,Y )
When X and Y are scalars, we can define the correlation coefficient ρXY as √ (note
V ar(X)V ar(Y )
that all quantities involved here are scalars). ρXY is always a number between −1 and +1 (home-
work problem).
Exercise: Show that Cov(X, Y ) = E[XY ′ ] − E[X]E[Y ]′
where x = (x1 , x2 , . . . xk )′ .
We can always use the above definition, even if the components of X can be a mix of continuous and
discrete random variables.
For any given value of x, FY |X=x (y) = FY |X (y|x) yields a proper CDF function for y, which means
we can continue to define the conditional expectation as E[Y |X = x] = ∞ y · dFY |X=x (y), where the
R∞
meaning of this integral is as given in Definition 1.14. The conditional variance of Y given X = x can
also be defined in the typical way from the conditional distribution FY |X=x (y).
The law of iterated expectations caries over unchanged when X is a random vector. That is: E[Y ] =
E[E[Y |X]], regardless of whether X has continuous or discretely distributed components, or a mix of
the two. The law of total variance caries over too (see below).
Understanding and estimating the object E[Y |X = x] from data, where X can be a vector, will be one
of our main interests in this course, motivating the use of regression analysis. Take a deep breath, we
made it!
When both X and Y are random vectors, we can talk about the conditional distribution of Y given X
by defining a conditional joint-CDF of all the components of Y , conditional on X = x.
Definition 1.25. With X and Y random vectors, the conditional CDF of Y given X = x is
FY |X (y|x) = lim P (Y1 ≤ y1 , Y2 ≤ y2 , . . . |X1 ∈ [x1 , x1 + ϵ1 ], X2 ∈ [x2 , x2 + ϵ2 ] . . . Xk ∈ [xk , xk + ϵk ])
ϵ1 ↓0
ϵ2 ↓0
...
ϵk ↓0
where x = (x1 , x2 , . . . xk )′ .
20
An important application of the concept of a conditional joint-distribution is the idea of conditional
independence.
Definition 1.26 (conditional independence). We say that X and Y are independent conditional on
Z, denoted (X ⊥ Y )|Z, if for any value z of Z: FXY |Z=z (x, y) = FX|Z=z (x) · FY |Z=z (y) for all x, y.
This definition can be understood by using Definition 1.25 to define interpret FXY |Z=z (x, y) as the joint-
CDF of a random vector composed of X and Y , conditional on the random vector Z. In this definition
X and Y could be random variables or can each be random vectors themselves!
As another application of Definition 1.25, the law of total covariance provides an analog of the law of
iterated expectations for covariance (and hence, as a special case, for variance):
Proposition 1.2. For random vectors X, Y and Z: Cov(X, Y ) = E[Cov(X, Y |Z)]+Cov(E[X|Z], E[Y |Z])]
Note that as a special case we have the law of total variance, that: E[V ar(Y |X)] + V ar(E[Y |X]).
21
Chapter 2
2.1 Introduction
Let’s illustrate some of the concepts from the last chapter with an empirical example. I’ll be working
with the nlswork dataset, which reports a sample of young working women from the Bureau of Labor
Statistics’ National Longitudinal Survey in the 1970s and 1980s. Obviously this data is pretty old, but
it’s easy to load into R, and has a lot of interesting variables. I’ll be showing code in R, but the dataset
is also easy to load into Stata using the command wenuse nlswork.
You can install R by downloading it from https://www.r-project.org/. If you do that, I also
recommend installing RStudio from https://www.rstudio.com/ for a nicer interface. You can also
create a free acount at rstudio.cloud to work with RStudio in the cloud, for low-intensity applications
like this one.
To get started and load nlswork dataset into RStudio, run the following code:
The first three lines load R libraries that we’ll need. The first allows to load the dataset directly from
the web using the webuse() command. The second two will be useful for creating pretty plots. Note
that # in R introduces a comment, which allows one to annotate their code with things that R ignores
when you run it.
Note: The library() command loads a R library up, but you need to install a library before you load it
up. To install the packages that are loaded above, run install.packages("webuse,ggplot2,Rmisc")
in R. You only need to do this once, then you can just jump straight to library(webuse), etc..
The last line of the code block above reads in the dataset and stores it in a dataframe called df. A
dataframe is what R calls a dataset. I could have given it any name, but I chose “df”, which is usually
what I use by default. To take a look at this dataset, you can type View(df) into R after running
the above code. If you’d like to learn more about the variables, see here: https://rdrr.io/rforge/
sampleSelection/man/nlswork.html.
22
2.2 The “empirical distribution” of a dataset
When you look at the dataframe df, you’ll notice that the first two columns are called idcode and
year. This survey tracks workers over several years, so each value of idcode shows up once for
each year in which that worker was surveyed. Altogether, there are 28, 534 rows covering about
4, 711 distinct workers. There are 25 variables reported for each row.
Let us index values of each of the 25 variables recorded in row i with a subscript i, e.g. agei
denotes the age recorded in row i. This is the age of a particular person indexed by id codei in
the particular year yeari .
Now consider the following probability space: we draw a single row ω from the dataset at random,
1
with an equal probability P ({ω}) = 28,534 of selecting any given row (the sample space is finite:
|Ω| = 28, 534, so we can let our event space F be the full powerset of Ω). We have 25 random
variables represented in each row, which define 25 random variables X1 (ω), X2 (ω), . . . X25 (ω),
with names like idcode, year, birth yr, age, race, etc.
What I’ll call the empirical distribution of the dataset is simply the joint-distribution of our
random vector X = (X1 , X2 , . . . X25 )′ given the probability space described above. For example,
the CDF of age evaluated at 25 is:
Note: In Chapter 4 we’ll make a big deal out of distinguishing our sample from the underlying
population of interest, which in this case might be the population all all young working women
in the U.S. in the 1970-1980s, from which our sample is drawn. Given this distinction, one could
view the various distribution functions plotted in this chapter as estimates of the corresponding
distributions in the population. But this is not important for the present purpose, which is simply
to illustrate some properties of distributions in general.
The goal of the above code block is to plot the CDF of hourly wages in the dataset. The dataset
contains only the log of wages (what labor economists often focus on), so our first task above is to
generate a new column in the dataset with the wage, rather than its natural logarithm. This is done
with the command df$wage<-exp(df$ln wage), which adds a new column wage to the dataset that
equals to e to the power of ln wage. The second command generates the following figure:
23
The syntax of ggplot takes some getting used to, and really the best way to learn it is just to look at
examples like this one and start playing with them. The important things to recognize in the above is
that ggplot(df, aes(wage)) tells ggplot that we’re going to be looking at variable wage in dataframe
df, stat ecdf tells it to plot the empirical CDF, and the last two parts of the line set the plot labels
and the range of the x-axis.
As you can see, the CDF of wages is a monotonically increasing function that ranges from 0 to 1. It
makes sense to focus on the range $0 to $20 because the CDF is essentially flat for wages above $20.
Now let’s consider the conditional CDF Fwage|collgrad , where collgrad = 1 indicates that i graduated
college, and collgrad = 0 indicates that they did not. Since collgrad is a discrete random variable,
the function
P (wage ≤ w and collgrad = c)
Fwage|collgrad=c (w) =
P (collgrad = c)
for any value c ∈ {0, 1} and wage value w. Given the empirical distribution of our data, this is equivalent
to
# rows in which wage ≤ w and collgrad = c
Fwage|collgrad=c (w) = (2.1)
# rows in which collgrad = c
Consider for example college graduates: c = 1. We can calculate this conditional CDF by creating a new
dataset composed of just the rows from df in which collgrad = 1, and then computing the empirical CDF
of wages with respect to that new dataset. The reason is that the empirical CDF with respect to this
new dataset counts the number of rows in which wage ≤ w (which is the number of rows in the original
dataset in which wage ≤ w and collgrad = 1) and divides by the number of rows in the new dataset
(which is the number of rows in the original dataset in which collgrad = 1). This exactly recovers Eq.
(2.1) above.
In general, conditioning on a discrete random variable in an empirical distribution is identical to
simply sub-setting the data. Thus we can generate plots of our conditional CDFs for c = 0 and c = 1 as
follows:
24
The command dfgrads<-df[df$collgrad==1,] for example creates a new dataframe dfgrads which
contains only the college grduates, and the next line uses the same command as before to generate the
empirical CDF of wages in dfgrads. Rather than displaying it write away, we save the graph object
as plot1, so that we can display the two conditional CDFs alongside one another using the multiplot
command. The result is the following:
ggplot makes it easy to automatically compute these conditional CDFs and plot them alongside
one-another in the same plot. For example, the following code:
This requires creating a so-called “factor variable”, which is a variable that R knows takes on discrete
categorical values. The first line creates a factor version of collgrad and calls it graduate. Then the
syntax color = graduate tells ggplot to break up the data and color it acording to values of graduate.
25
Notice that the CDF of college graduates is lower than the CDF of college non-graduates at every
wage value. For example, about 50% of non-college graduates have a wage of $5 or less, while only about
20% of college graduates have a wage of $5 or less. This is what we should expect, if college graduates
tend to be paid better than non-graduates.
Is our wage variable a continuous or a discrete random variable? Notice that our CDF functions of
wages looks smooth, like the right panel of Figure 1.2 for a continuous random variable, rather than like
a staircase as in 1.1 for a discrete random variable. But when evaluating probabilities using the empirical
distribution of our dataset, wage can’t literally be a continuous random variable: in a dataset of 28,534
there can at most be 28,534 distinct values of wage represented! In fact, with a finite number of people
on Earth, only a finite number of wages would be represented even if we had an idealized dataset that
included everybody.
Therefore, strictly speaking, we’re plotting the CDF of a discrete random variable above. But since
wages can take any real number as a value, (rather than, e.g. being comfined to integers or some set
of categories), the distribution of wages is well-approximated by thinking of it as being a continuous
random variable. The “jumps” in the above plot are so tiny in our dataset that they are imperceptible
to the naked eye.
Thus, it’s meaningful to talk about the conditional density of wages associated with each of the
conditional CDFs above. R makes it easy to generate plots of the these conditional densities, with the
following code:
plot3 <- ggplot ( dfgrads , aes ( wage ) ) + geom _ density () + xlim (0 , 20) +
labs ( title = " Conditional density of wages , college graduates " ,y =
" pdf " )
multiplot ( plot1 , plot3 )
plot4 <- ggplot ( dfnongrads , aes ( wage ) ) + geom _ density () + xlim (0 , 20)
+ labs ( title = " Conditional density of wages , non - graduates " ,y = "
pdf " )
multiplot ( plot2 , plot4 )
The command geom density() is actually doing some fairly complicated calculations in the background,
which aren’t important here. We just want the graphs. Here’s the conditional distribution for non-
graduates, represented both as a CDF and as a density function:
26
In each case, the density is highest where the CDF is steepest, and lowest where the CDF is the flattest.
This is what we should expect, since the density function is the derivative of the CDF. Notice that the
p.d.f. of wages for non-graduates peaks around $4 and hour, while for graduates it peaks around $7 an
hour.
We can easily compute the expectations of these conditional distributions using a command like
mean(df[df$collgrad==1,]$wage). This command takes the average value of wage across all rows in
which collgrad=1. Running this command for each value of collgrad yields E[wage|collgrad = 0] ≈
$5.52 and E[wage|collgrad = 1] ≈ $8.65.
This provides us an opportuntiy to test the law of expectations, which says that
E[wage] = E [E[wage|collgrad]] = P (colgrad = 0) · E[wage|collgrad = 0]
+ P (colgrad = 1) · E[wage|collgrad = 1]
Indeed mean(df$wage) yields E[wage] ≈ $6.04 and mean(df$collgrad==0) reveals that P (collgrad =
1) ≈ 0.17, and 0.83 · 5.52 + 0.17 · 8.65 ≈ 6.04.
Now let’s look at an example of condition distributions in which Y is discrete, rather than continuous
like wage. Here are conditional CDFs of a workers highest grade of education completed, by college
graduation status:
Now the “staircase” nature of the CDF for a discrete random variable is clear. About 10% of college
graduates have 15 or less years of education, yet have graduated college. Notice that the big jump in the
27
CDF for graduates is at grade 16 (12 years + a 4 year college degree), but there are also jumps beyond
that for post-graduate degrees. By contrast, the big jump for non-graduates occurs at 12, indicating
high-school completion.
Here is the CDF for the non-graduates alongside its probability mass function, plotted as a histogram.
These last two figures were generated with the following code
plot1 <- ggplot ( dfgrads , aes ( grade ) ) + stat _ ecdf ( geom = " step " ) +
labs ( title = " Conditional CDF of grade completed , college
graduates " ,y = " cdf " ) + xlim (0 , 20)
plot2 <- ggplot ( dfnongrads , aes ( grade ) ) + stat _ ecdf ( geom = " step " ) +
labs ( title = " Conditional CDF of grade completed , non - graduates "
,y = " cdf " ) + xlim (0 , 20)
multiplot ( plot1 , plot2 )
plot3 <- ggplot ( dfnongrads , aes ( grade ) ) + geom _ histogram ( aes ( y =..
density ..) ) + xlim (0 , 20) + labs ( y = " pmf " )
multiplot ( plot1 , plot3 )
So far we’ve visualized distributions that condition on the binary variable collgrad, so there were
also two conditional distributions to look at. Now let’s move beyond a binary conditioning variable. For
example, the following figures show the conditional CDF and conditional density of wages given highest
grade completed, for grades 9 and up.
28
The code for these graphs is as follows (note we had to omit some rows in which grade is undefined
using the condition !is.na(df$grade)).
Finally, lets consider a continuous conditioning variable. We’ll take Y to be usual hours worked in a
week, and X to be wage. We can no longer visualize the conditional distributions FY |X=x one-by-one.
We can however visualize the conditional expectation function E[hours|wage = w] as a function of w.
Below we plot it over a raw scatterplot of hours and wage, which provides a visualization of the joint
distribution of the two variables.
Now lets see the law of iterated expectations in action in this setting. With wages conceived of as a
continuously-distributed random variable, the law of iterated expectations says that
Z
E[hours] = E [E[hours|wages]] = fwage (w) · E[hours|wages = w] · dw
where fwage (w) is the density of wages evaluated at w. The plot below shows the function fwage (w) in
red (scale given on the right), and the function E[hours|wages = w] in blue (scale given on the left).
The value of E[hours] turns out to be about 46.56, and this is visualized by a horizontal dotted line on
the same scale as the condition expectation of hours.
29
Observe that the dotted line cuts through the blue CEF, capturing it’s “average value”. This average
is weighted according to the red line, the values near $4-$6 get the most weight, for example. It is thus
sensible that the value of E[hours] is close to the midpoint of the function E[hours|wages = w] in that
range.
The code used to generate these graphs and calculate E[hours] is:
30
Chapter 3
Statistical models
In Chapter 1, we’ve developed the idea of a random vector, which has a probability distribution that can
be characterized by the joint-CDF of all of its components. This allows us to then define concepts like
expectation, conditional distributions, and the conditional expectation function. Chapter 2 has shown
an empirical illustration in these concepts.
The data used in Chapter 2 is an example of a sample. In this case, it was a collection of observations
regarding ∼5000 young working women in the 1970s and 1980s. This chapter develops tools that let us
address the following question: what are we learning about the population of young working women in
this time period, given our sample? In doing so we move from the theory of probability to the theory of
statistics, which studies what we can learn about probability distributions from data.
To do so, it is useful to start with the definition of a statistical model, which embodies a set of
assumptions about the distribution of a set of random variables. In Section 3.3, we’ll apply this idea to
model how a sample of data is generated from an underlying population.
Most familiar models in statistics are “parametric” models. A parametric statistical model is a model
in which each F ∈ F can be written in terms of a vector of parameters θ ∈ Rd . What I mean by this is
that each function F (x) in F can be written as F (x; θ): the whole function F (·; θ) depends on the values
of the parameters θ = (θ1 , θ2 , . . . θd )′ :
Definition 3.2. A parametric statistical model is the set F = {F (·; θ) : θ ∈ Θ}, where each θ ∈ Θ is
some finite-dimensional vector (i.e. Θ is some subset of Rd for a finite d).
Example: Suppose somebody hands you a coin, which could be “weighted” so that the probability of
heads is different than 1/2. They don’t tell you the value of p = P (heads). Thus, your statistical model
is that P (heads) = p and P (tails) = 1 − p for some p ∈ [0, 1]. This model has one parameter, p.
Notation: When a parametric model contains only continuous distributions, the distributions in F are
often denoted by their densities f (·; θ) rather than CDFs. You also sometimes see the notation f (·|θ),
with a “|” rather than “;”.
31
Preview: parametric vs. non-parametric models:
Note that a parametric statistical model must be characterized by a finite number of real-valued
parameters, i.e. d is finite. This distinguishes parametric models from non-parametric models F,
in which there’s no way to come up with a finite number of parameters that can fully characterize
each F ∈ F.
Example: Suppose that X is a random variable, and we let F contain all valid CDFs such that
the function F (x) is concave in x (on its support, where it’s increasing). This represents a
non-parametric statistical model.
Non-parametric models play an important role in modern econometrics, as they often make weaker
assumptions than parametric models, and become most practical to work with when we have big
datasets. “Semi-parametric” models occur when a finite-dimensional θ pins down some—but
not all—features of each F ∈ F. Semi-parametric models play an important role in regression
analysis, as we’ll see.
We’ll start by studying some important parametric models for a single random vector X. We’ll then
apply the idea of a model for the distribution of X to move on to our real objective: modeling the
generation of a whole dataset, which typically consist of n realizations of a random vector X.
The following exercises ask you to show that the parameters µ and σ are the expectation and standard
deviation of X, respectively (the standard deviation of a random variable is defined as the square root
of its variance).
Proposition 3.1. Suppose that X ∼ N (µ, σ 2 ) and we define Y = a + bX for some a, b ∈ R. Then Y is
also normally distributed as N (a + bµ, b2 σ 2 ).
A consequence of this is that if X ∼ N (µ, σ), then the random variable x−µ
σ ∼ N (0, 1). N (0, 1) is
called the standard normal distribution.
The multi-variate normal distribution generalizes the normal distribution to a random vector X =
(X1 , X2 , . . . Xk ). The multi-variate normal density is parametrized by a a k × 1 vector µ and k × k
matrix Σ:
1 1 −1 ′
f (x; µ, Σ) = p e− 2 (x−µ)Σ (x−µ) (3.2)
(2π) · det(Σ)
k
where Σ−1 denotes the matrix inverse of Σ, and det(Σ) its determinant. When X ∼ N (µ, Σ), the mean
of X is µ, i.e. E[X] = µ, and Σ is its variance-covariance matrix: V ar(X) = Σ.
32
Example: The bivariate normal distribution, the case when k = 2 warrants some additional attention.
Let ρ be the correlation coefficient between X1 and X2 , so that
σ12
ρ · σ1 · σ2 1 ρ
Σ= = (σ1 , σ2 ) (σ1 , σ2 )′
ρ · σ1 · σ2 σ22 ρ 1
p
where σj = V ar(Xj ). In this case Eq. (3.2) simplifies to:
2 2
x1 −µ1 x −µ x −µ x2 −µ2
1 1
− 2(1−ρ 2) σ1 + 2σ 2 −2ρ 1σ 1 σ
f (x1 , x2 ; µ1 , µ2 , σ1 , σ2 , ρ) = p ·e 2 1 2
(3.3)
2πσ1 σ2 1 − ρ2
Exercise: Show from Eq. (3.3) that if X1 and X2 are jointly normally distributed, then X1 ⊥ X2 if and
only if Cov(X1 , X2 ) = 0.
A useful property of the family of normal distributions is that many operations keep one in the family.
For example, when X is a (multivariate) normal random vector, the marginal distribution of each of it’s
components is a (univariate) normal random variable:
Proposition 3.2. Let X ∼ N (µ, Σ). Then the marginal distribution of each Xj is N (µj , Σjj ).
A vector X̃ composed of any subset of the Xj is also normal.
2
x−µX 2
y−µY x−µX y−µY
1 1
− 2(1−ρ 2) σY −2ρ σ σ +ρ2 σ
=√ p e X Y X
(3.4)
2πσY 1 − ρ2
2 2
1 x−µX 1−ρ2 x−µX
where we’ve used that ea /eb = ea−b and that 2 σX = 2(1−ρ2 ) σX . Now the beautiful part:
the exponent in (3.4) is a “perfect square” quantity:
2
σ
y− µY +ρ Y (x−µX )
σX
2 − 12
1 1
1
− 2(1−ρ
y−µY
−ρ
x−µX σ 2 (1−ρ2 )
Y
fY |X=x (y) = √ p e 2) σY σ X =√ p e
2πσY 1− ρ2 2π σY2 (1 − ρ2 )
Sums of normal random variables are also normally distributed, provided that the two random vari-
ables being added are jointly normal:
2
X µX σX σXY
Proposition 3.4. If X and Y are jointly normal with =N ,
Y µY σXY σY2
2
Then: X + Y ∼ N (µX + µY , σX + σY2 + 2σXY )
33
Proof. Using the fact that for X and Y normal, X + Y is normal k − 1 times establishes that a′ X
is normal (making use of the fact that aj Xj is normal for each j). If we know that a′ X is normal,
and we know it’s mean vector and variance-covariance matrix, we know its whole distribution. That
E[a′ X] = a′ E[X] = a′ µ follows from linearity of the expectation. That V ar(a′ X) = a′ V ar(X)a then
follows, also by linearity of expectation.
where Fsingle (z) is the CDF of a single Bernoulli random variable having probability p: Fsingle (z) =
(1 − p) · 1(z < 1) + p · 1(z ≥ 1). Pn
We now define the binomial distribution to be the distribution of X := n1 j=1 Zj . Often, each Zj is
interpreted as the success (1) or failure of a ”trial” of some kind. The probability of success of trial j is
P (Zj = 1) = p. With this interpretation the random variable X simply counts the number of successes
out of the n trials.
To denote that X has the binomial distribution with parameters p and n, we write X ∼ B(n, p).
Since X can only take integer values between 0 and n, the binomial distribution is a discrete random
variable. Rather than a density, it has a probability mass function. The probability that X = k can be
written as:
n
P (k; n, p) = · pk · (1 − p)n−k
k
n n! n
where := k!(n−k)! . The quantity simply counts the number of distinct ways that we could get
k k
k sucesses from n trials, i.e. the number of distinct vectors z ∈ {0, 1}n that contain k ones and n − k
zeroes. Each such z has the same probability pk · (1 − p)n−k of occurring, leading to our final expression
for P (k; n, p).
hP i P
We know by linearity of the expectation that the mean of X must be E[X] = E = j=1 E[Zj ] =
n n
j=1 Z j
n · p. We can also verify this by the explicit formula for P (k; n, p):
n n
n! · k
n
E[X] =
X X
k· · pk · (1 − p)n−k = pk · (1 − p)n−k
k k!(n − k)!
k=0 k=0
Exercise: Show that if X ∼ B(n, p), V ar(X) = n · p(1 − p). You may find it useful that since Zj ⊥ Zj ′
for each pair j ̸= j ′ , Cov(Zj , Zj ′ ) = 0.
The i.i.d. model is typically used to describe simple random sampling. Simple random sampling oc-
curs when individuals are selected at random from some underlying population I, and a set of variables
Xi = (X1i , X2i , . . . Xki )′ are recorded for each sampled individual i. Imagine for example a telephone
survey, in which enumerators have a long list I of potential individuals to contact. They use a random
number generator to choose an i at random from this list, contact them, and record responses to a set
of k questions. This process is then repeated n times.
Note: With a finite population I, we must allow sampling “with replacement” for the i.i.d. model to hold
strictly. If individual i is removed from the list after being contacted, then the random vectors Xi may
no longer be independent. For example, suppose we are randomly selecting U.S. states and recording
the population of each one. Suppose California (the post populous state) has 40 million and Georgia has
11. Then for example P (X2 = 40m|X1 = 40m) ̸= P (X2 = 40m|X1 < 40m), since the first probability is
zero and the second is 1/49. This means that X1 and X2 are not independent. Simple random sampling
is often referred to as random sampling for short, or as i.i.d sampling.
We’ll use the terms dataset or sample to refer to an n × k matrix X that records characteristics Xi =
(X1i , X2i , . . . Xki ) for each of n observational units (such as individuals) i. Data is not always generated
by simple random sampling, but when it is, we can imagine X as being formed by randomly choosing
rows from a much larger matrix that records Xi for all individuals in the population, depicted in Figure
3.1. The actual data we see in X is a realization of the collection of random variables {X1 , X2 , . . . Xn }.
′
X1 (X11 , X21 , . . . Xk1 )
X2′ (X12 , X22 , . . . Xk2 )
X= . = ..
.
. .
Xn′ (X1n , X2n , . . . Xkn )
The randomness of X comes from the random-sampling: we could have drawn a different set of individ-
uals from the population, in which case we would have seen a different dataset X.
Notation: Note that the entries of the sample matrix X are denoted Xji , where i index rows (individual
observations) and j index columns (variables/characteristics). This is backwards from the way we often
denote entries Mij of a matrix M, where the row i comes before the column j. This is a consequence of
two conventions interacting: that rows of X index individuals (just like when you open the dataset in
R), but that Xji indexes characteristic j of individual i (equivalently, characteristic j of the individual
sampled in row i).
Note that most sampling processes in the real world occur without replacement: the same individual
cannot show up twice in the data. Given the note above, this suggests that these sampling processes
are not i.i.d., strictly speaking. However, when the size N of the underlying population is large, such
samples can still be well-approximated as being i.i.d.. Intuitively, that’s because when N is much larger
than n (often denoted as N >> n), the chance that you would draw the same individual twice is very
low. We thus typically assume i.i.d., with the idea that N is suitably large to not worry about sampling
with vs. without replacement.
35
Population I
Sample X
individual i agei marriedi collegei
row i ωi agei marriedi collegei 1 25 0 0
1 1 25 0 0 2 74 1 1
2 4 37 1 1 3 8 0 0
3 5 54 0 1 4 37 1 1
5 54 0 1
Figure 3.1: An example of simple random sampling, in which n = 3 and N = 5. Each row of the dataset
on the left is a realization of random vector X = (age, married, college), which chooses a row at random from
the population matrix on the right. We can conceptualize this sampling process as a probability space with
outcomes ω = (ω1 , ω2 , ω3 ), where ωi yields the index of the randomly selected individual in I. The random
vectors Xi = Xi (ωi ) and Xj = Xj (ωj ) are independent for i ̸= j, but the random variables within a row are
generally not independent, e.g. agei and collegei are positively correlated.
The following are some alternative methods of generating data, aside from simple random sampling:
Stratified random sampling: the population is divided into groups, and then simple random sam-
pling occurs within each group (e.g. I run my sampling algorithm separately for men and women,
so that I can ensure equal representation of each).
Clustered random sampling: after defining groups, we randomly select some of the groups. Then all
individuals from those groups are included in the sample (e.g. I interview everybody in a household,
after choosing households at random)
Panel data: suppose we have observations over multiple time-periods t for each individual i, where
the individuals i are drawn as a simple random sample. Then if we arrange all of i’s data onto one
row, we can imagine X as reflecting an i.i.d. sample. But with rows correponding to (i, t) pairs,
the rows are no longer independent (in general)
Observing the whole popluation: this would be the case e.g. with state-level data from all 50 U.S.
states. This situation occurs increasingly frequently with individual-level data now as well, e.g.
administrative data on all tax-filers in a country.
These alternative sampling methods tend to violate the i.i.d assumption. However, methods exist to
deal with each of them.
Let us end this section with a last bit of jargon. When Xi for i = 1 . . . n denotes a collection of i.i.d
random vectors, we’ll refer to the distribution F that describes the marginal distribution of each Xi as
the population distribution. The population distribution is the distribution we get when we randomly
select any individual from the population. Features of the population distribution are the ones that
you naturally think of when you think about summarizing a population. For example, if I is a finite
population, then
1 X
EF [Xi ] = Xi
N
i∈I
where we use the notation EF to make explicit that the expectation is with respect to the CDF F . The
population mean is simply the mean of Xi among everybody in I. We can also talk about the population
variance, the population median, and so on. Note that the empirical distribution introduced in Section
2.2 is an example of a population distribution in which we think of the whole population as equal to the
rows of the dataset.
Another piece of terminology will be useful as we discuss samples and their population counterparts:
Definition 3.4. A statistic or estimator is any function of the sample X = (X1′ , X2′ , . . . , Xn′ )′ .
A generic estimator or statistic will apply some function g(X) = g(X1 , X2 , . . . Xn ) to the collection
Pn of
random vectors that constitute the sample. An example is the so-called sample mean X̄n := n1 i=1 Xi ,
which simply adds together Xi for across the sample and divides by the number of observations n. X̄n
is an example of a statistic. Since each of the Xi is a random variable/vector, it follows that X̄n is itself
36
a random variable/vector. This is true of statistics in general: they are random.
The reason that we also refer to statistics as “estimators” is that statistics often attempt to estimate a
population quantity of some kind from data. For example, we’ll see in the next Chapter that for large
n, we are justified in thinking that X̄n ≈ µ. It is therefore reasonable to use X̄n as an estimate of µ.
Note that X̄n is random, while µ is just a fixed number. Thus we have to be careful in what we mean
by saying that X̄n ≈ µ, which is the topic of the next chapter.
Notation: Often estimators are depicted with a “hat” on them, e.g. θ̂ = g(X). We’ll use this notation
to denote a generic estimator.
A useful property of i.i.d. random vectors that I’ll mention here is the following:
Proposition 3.6. If {X1 , X2 , . . . Xn } are i.i.d random vectors, then {h(X1 ), h(X2 ), . . . h(Xn )} are also
i.i.d for any (measurable) function h.
An implication of Proposition 3.6 is that if we have an i.i.d. sample Xi , we can from it construct an i.i.d.
sample of e.g. Xi2 .
37
Chapter 4
The law of large numbers (LLN) states the deep and useful fact that for very large n, it becomes very
unlikely that X̄n is very far from µ = E[Xi ], the “population mean” of Xi .
Theorem 1 (law of large numbers). If Xi are i.i.d random variables and E[Xi ] is finite, then for
any ϵ > 0:
lim P (|X̄n − µ| > ϵ) = 0
n→∞
Note: The LLN is stated above for a random variable, but the result generalizes easily to random
vectors. In that case, limn→∞ P (||X̄n − µ||2 > ϵ) = 0 where || · ||2 denotes the Euclidean norm, i.e.:
|X̄n − µ| = (|X̄n − µ|)′ (|X̄n − µ|), where X̄n is a vector of sample means for each component of Xi , and
similarly for µ.
Note: the version of the law of large numbers above is called the weak law of large numbers. There exists
another version called the strong LLN.
Let us now prove the LLN. We will do so using a tool called Chebyshev’s inequality. This proof assumes
that V ar(Xi ) is finite, but the LLN holds even if V ar(Xi ) = ∞. Chebyshev’s inequality allows us to use
the variance of a random variable to put an upper bound on the probability that the random variable is
far from its mean. In particular, for any random variable Z with finite mean and variance:
V ar(Z)
P (|Z − E[Z]| ≥ ϵ) ≤
ϵ2
To see that this holds, use the law of iterated expectations to write out the variance as
Now, we will show that as n → ∞, V ar(X̄n ) → 0. This along with Chebyshev’s inequality implies the
LLN, by letting Z = X̄n .
38
n
To see that V ar(X̄n ) → 0, note first that
" n # n n
1X 1X 1X
E[X̄n ] = E Xi = E [Xi ] = µ=µ
n i=1 n i=1 n i=1
The first equality is simply the definition of X̄n , while the second uses linearity of the expectation
operator. Now consider
!2
n
1
V ar(X̄n ) = E (X̄n − E[X̄n ])2 = E (X̄n − µ)2 = E
X
(Xi − µ)
n i=1
n n n n
1 X X 1 XX
= 2E (Xi − µ)(Xj − µ) = 2 E [(Xi − µ)(Xj − µ)]
n i=1 j=1
n i=1 j=1
n
1 X 1 V ar(Xi )
= 2
E [(Xi − µ)][(Xi − µ)] = 2 · nV ar(Xi ) =
n i=1 n n
where the first equality in the third line follows because when i ̸= j, Xi ⊥ Xj implies that E[(Xi − µ)] ·
E[(Xj − µ)] = 0 · 0. Thus, the only terms that remain are when j = i.
Another way to see that V ar(X̄n ) = V ar(X n
i)
is to notice that when Y and Z are independent,
V ar(Y + Z) = V ar(Y ) + V ar(Z). Thus:
1 1 1 1 1 V ar(Xi )
V ar X1 + X2 + · · · + Xn + = n · V ar Xi = n · 2 · V ar(Xi ) =
n n n n n n
When our statistic is computed as θ̂ = g(X1 , X2 , . . . Xn ) from an i.i.d sample of Xi , Fθ̂ depends upon
three things: the function g, the population distribution of Xi , and the sample size n.
Knowing the sampling distribution of a statistic is typically a hard problem. We know g and n,
but in a research setting we don’t generally know the CDF F that describes the underlying population.
However, if we view θ̂ as a point along a sequence of random variables Zn , it is often possible to say
something about the limiting behavior of FZn as n → ∞. Asymptotic theory is a set of tools for describing
this limiting behavior. The law of large numbers is one such tool. If we believe that are actual sample
size n is large enough that FZn ≈ FZ∞ , then tools like the LLN can be extremely useful. For the sample
mean for example, we might, on the basis of the LLN, be prepared to believe that X̄n is close to µ with
very high probability.
Conceptually, we can think of what we’re doing as follows. Suppose our sample size is n = 10, 576,
and we calculate a statistic θ̂ = g(X1 , X2 , . . . X10,576 ) from our sample. Now imagine applying the
same function g to various samples of size 1, 2, . . . and so on, and defining a sequence Z1 , Z2 , Zn of the
corresponding values. Each Z along this sequence is itself a random variable: let FZ1 , FZ2 , . . . be their
39
corresponding CDFs. Our statistic θ̂ can be seen as a specific point along this sequence: θ̂ = Z10,576
(circled in red in Figure 4.1). Since we don’t know FZ10,576 , but we can say something about FZ∞ , we
use the latter as an approximation for the former. Figure 4.1 depicts this logic.
n Zn FZn
1 Z1
2 Z2
..
.
..
.
..
.
∞ Z∞
Figure 4.1: We are interested in the sampling distribution of some statistic θ̂, computed on our sample of 10, 576
observations. This is in general hard to compute. As a tool, we imagine a sequence of random variables Z1 , Z2 , . . .
in which θ̂ = Z10,576 . Asymptotic theory allows us to derive properties of FZ∞ , the limiting distribution of Zn
as n → ∞ (circled in green). Then we use FZ∞ as an approximation to FZ10,576 , which we justify by n being
“large”. The above figure depicts a situation in which Zn = X̄n , so that the distribution of Zn narrows to a
point as n → ∞ (by the LLN).
Of course, the above technique only works if we can say something definite about F∞ . The law of large
numbers says that we can when our statistic is the sample mean. In Section 4.4, we’ll see that the central
limit theorem provides even more information about the limiting distribution of the sample mean: that
it will become approximately normal, regardless of F .
Note: The logic of Figure 4.1 is the “classical” approach to approximating the sampling distri-
bution of θ̂, but it is certainly not the only one. An increasingly popular alternative involves
bootstrap methods. These methods still appeal to n being “large enough”, but they do so in a dif-
ferent way. They also require computing power, because bootstrapping involves resampling new
datasets from our original dataset X. This has become increasingly feasible, and boostrap-based
methods have become increasingly popular.
40
and n = 1, 000. You can think of this as illustrating Figure 4.1 for the specific population distribution
F that describes a coin-flip. With n = 2, we see that we have a 50% chance of getting X̄n of 0.5, which
is the true “population mean” of Xi : µ = E[Xi ] = 0.5. Then 25% of the time we get X̄n = 0 (two flips
of tails), and 25% of the time we get X̄n = 1 (two flips of heads). Thus, the distribution of X̄n is not
very well concentrated around µ = 0.5.
The red vertical lines in Figure 4.2 illustrate the law of large numbers in action. They mark the
points 0.45 and 0.55, which represent a ϵ = .05 in Theorem 1. We can see that by the time n = 100,
P (|X̄n − 1/2| > 0.05) starts to become reasonably small; roughly 1/3 of the mass of X̄n is outside of
[0.45, 0.55]. When n = 1000, there is an imperceptible chance of obtaining an X̄n outside of the vertical
red lines. If we continued this process for larger and larger n, we would see the mass of X̄n continue to
cluster closer and closer to µ = 1/2. Regardless of how small a ϵ we choose, we can always find an n
that fits as much of the mass as we want inside the corresponding red lines.
Note that the law of large numbers does not say that P (|X̄n − µ| > ϵ) will necessarily monotonically
decrease with n, for each n. For example, we can see that for ϵ = .05, we have that P (|X̄1 − µ| > ϵ) is 0.5
and P (|X̄2 − µ| > ϵ) is about 0.25. All that the LLN says is that P (|X̄1 − µ| > ϵ) will get (arbitrarily)
small with n, for any value of ϵ.
Figure 4.2: Distributions along the sequence X̄n for a set of n i.i.d. coin flips. Red liness illustrate the mass of
the distribution X̄n that is more than .05 away from 1/2.
The following is the R code I used to generate this figure, if you’d like to copy-paste it and experiment:
h <- hist ( results $ sample _ mean , plot = FALSE , breaks = seq ( from =0 , to =1 , by
=.01) )
h $ density = h $ density / 100
plot (h , freq = FALSE , main = paste0 ( " Distribution of sample means , n = " ,n , "
coin flips " ) , xlab = " Sample mean " , ylab = " Proportion of samples " , col = "
green " )
abline ( v = c (.45 ,.55) , col = c ( " red " , " red " ) )
}
41
4.3 Convergence in probability and convergence in distribution
Given a sequence of random variables or random vectors Z1 , Z2 , . . . , let us now define two notions of
convergence of the sequence Zn . The first is convergence in probability:
Definition 4.2. We say that Zn converges in probability to Z if for any ϵ > 0:
In this definition, Zn can be a random variable/vector. When Zn is a random variable, then the notation
||Zn − Z|| jut refers to the absolute value of the difference: |Zn − Z|. When Zn is a vector, we can take
||Zn − Z|| to be the Euclidean norm of the difference (see Proposition 4.2 for an example).
We will often talk about Zn converging in probability to a constant c. This does not require a second
definition because a constant is simply an example of a random variable that has degenerate distribution
P (Z = c) = 1. Thus we say that Zn converges in probability to a constant c if limn→∞ P (|Zn −Z| > ϵ) = 0
for all ϵ > 0.
p
Notation: When Zn converges in probability to Z, we write this as Zn → Z, or alternatively plim(Zn ) =
Z. We say that Z is the probability limit of the sequence Zn . We use the same notation when Z is a
constant.
n
The law of large numbers, for example, says that X̄n → µ, the sample mean converges in probability to
the “population mean”, or expectation, of Xi .
Exercise: This problem gives an example of a sequence that converges in probability to another random
variable, rather than to a constant. Let Zn = Z + X̄n , where Z is a random variable and X̄n is the
sample mean of i.i.d. random variables Xi having zero mean and finite variance. Suppose furthermore
that Z and X̄n are independent. Show that plim(Zn ) = Z.
d
Notation: When Zn converges in distribution to Z, we write this as Zn → Z. As with convergence in
probability, Z can be a random vector or a constant.
Note: The requirement that we only consider z where FZ (z) is continuous is a technical condition,
which we can often ignore because we’ll be thinking about continuously distributed Z. In general, we
can construct examples in which limn→∞ P (Zn ≤ z) is not right-continuous (and is thus not a valid
CDF), but the valid CDF function FZ (z) nevertheless captures the limiting distribution of Zn . In these
d
cases we still want to say that Zn → Z.
The definition given above for convergence in distribution takes Zn to be a random (scalar) variable to
emphasize the idea, but the concept extends naturally to sequences of random vectors. We say that a
sequence of random vectors Zn converges in distribution to Z if for all z at which the joint CDF of the
components of Z FZ (z) does not have a discontinuity, the limit of the CDF of Zn evaluated at that point
as n → ∞ is FZ (z).
Convergence in distribution essentially says that the CDF of Zn point-wise converges to the CDF of
d
Z. By “point-wise”, we mean that this occurs for each value z. When Zn → Z, we often refer to Z as
the “large-sample” or “asymptotic” distribution of Zn .
We close this section by investigating the relationship between convergence in probability and con-
vergence in distribution. Convergence in distribution is a weaker notion of convergence (and is in fact
often called “weak” convergence), in the sense that it is implied by convergence in probability.
42
p d
Proposition 4.1. If Zn → Z, then Zn → Z. In the special case that Z is a degenerate random variable
d p
taking value of c, then Zn → c also implies Zn → c. Thus when Z is degenerate, convergence in
distribution and probability are equivalent to one another.
One manifestation of the fact that convergence in probability is stronger than convergence in distribution
is that with the former, covergence of elements of a random vector implies convergence of the whole
random vector:
p p Xn p X
Proposition 4.2. If Xn → X and Yn → Y , then → .
Yn Y
Proof. Since for any limn→∞ P (|X√n − X| > ϵ) = 0 and limn→∞ P (|Yn − Y | > ϵ) = 0 holds for any
ϵp> 0, let’s consider a value ϵ/ 2. Let Zn := (Xn , Yn )′ and Z := (X, Y )′ . Since ||Zn − Z|| =
(Xn − X)2 + (Yn − Y )2 being larger than ϵ is the same as (Zn − Z)2 being larger than ϵ2 , and since
at least one of (Xn − X)2 or (Yn − Y )2 must then be larger than half of ϵ2 , we have:
P (||Zn − Z|| > ϵ) = P ((Xn − X)2 + (Yn − Y )2 > ϵ2 ) ≤ P ((Xn − X)2 > ϵ2 /2 or (Yn − Y )2 > ϵ2 /2)
≤ P ((Xn − X)2 > ϵ2 /2) + P ((Yn − Y )2 > ϵ2 /2)
we have that
√ √
lim P (|Zn − Z| > ϵ) = lim P (|Xn − X| > ϵ/ 2) + lim P (|Yn − Y | > ϵ/ 2) = 0 + 0 = 0
n→∞ n→∞ n→∞
d d
Meanwhile, the same of convergence in distribution: Xn → X and Yn → Y does not in
isnot true
Xn d X
general imply that → . However, one important special case in which it does is when X or
Yn Y
Y is a degenerate random variable. This is useful for example in proving Slutsky’s Theorem in Section 4.5.
The next section will introduce the most famous and useful instance of convergence in distribution:
the central limit theorem (CLT). After introducing the CLT, we will return in Section 4.5 to some fur-
ther properties of convergence in probability and convergence in distribution, that will be useful in the
analysis of large samples.
Optional: There is an even stronger notion of convergence than convergence in probability, re-
a.s.
ferred to as almost-sure convergence. We say that Zn converges almost surely to Z, or, Zn → Z,
if
P lim Zn = Z = 1
n→∞
To make sense of this expression we have to place a probability distribution over entire sequences
{Zn } (something we didn’t need to do for convergence in probability or convergence in distribu-
tion). That is, we imagine a probability space in which each outcome ω yields to a realization
of all of the random variables: Z, Z1 , Z2 , Z3 , and so on. Then, the above expression says that
P ({ω ∈ Ω : limn→∞ Zn (ω) = Z(ω)}) = 1. In words: the probability of getting a sequence of Zn
that does not converge to Z with n is zero.
a.s.
Almost sure convergence is stronger than convergence in probability, i.e. Zn → Z implies that
p d
Zn → Z (which of course in turn implies that Zn → Z). The strong law of large numbers states
a.s.
that the sample mean in fact converges almost surely to the population mean, that is X̄n → µ.
43
Theorem 2 (central limit theorem). If Xi are i.i.d random vectors and E[Xi′ Xi ] < ∞, then
√ d
n(X̄n − µ) → N (0, Σ)
Example: Suppose for simplicity that we had reason to believe that Σ = 1, i.e. we have a random variable
Xi with a variance of 1. However, we don’t know µ. We do know then, by the CLT, that for large n, X̄n
is approximately normally distributed around µ with a variance of 1/n. This is extremely useful, because
we can now evaluate candidate values of µ, based on how unlikely we would be to see a value of X̄n like
the one that we calculate, if that value of µ was true. Suppose for example that n = 100, and in our
sample we observed that X̄n = 0.31. You want to evaluate the possibility that µ = 0. Well, if this were
the true value of µ, then given the asymptotic approximation that X̄n ∼ N (0, 1/n) (or equivalently, that
10 · X̄n ∼ N (0, 1)), we’d only expect to see a value of X̄n as large as 0.3 once in about 1000 samples. We
might thus be willing to rule µ = 0 out as a possibility. This is an example of a hypothesis test, which
will be covered in Section 5.4.1.
√
Figure 4.3: The same simulation as in Figure 4.2, except now we plot the distribution of n(X̄n − 1/2) rather
√ d
than of X̄n . The CLT tells us that n(X̄n − 1/2) → N (0, 1/4), since 1/4 is the variance of Xi . Green dashed
lines depict what is predicted by the distribution N (0, 1/4), which we can see becomes close to what we see for
larger values of n.
The following proof of the CLT is not necessary for you to know, but you may find it interesting, and
being able to follow it is a good study device.
We’ll consider a proof for the univariate case, which can be extended to random vectors
using the Cramér-Wold theorem introduced in Section 4.5. The proof here will use the concept
of a moment generating function:
t2 t3
MX (t) := E[et·Xi ] = 1 + t · E[Xi ] + · E[Xi2 ] + · E[Xi3 ] + . . . (4.1)
2 3!
where the second equality uses the Taylor expansion of etx . This will be a useful expression for
the moment generating function MX (t). Note that MX (t) is a (non-random) function of t: the
randomness in Xi has been averaged out.
A useful result (that we will not prove here) is that if two random variables X and Y have the
same moment generating function MX (t) = MY (t) for all t, then they have the same distribution.
Our
√ goal will be to show that whatever the distribution of Xi , the moment generating function
of n(X̄n − µ) converges to that of a normal random variable with variance σ 2 = V ar(Xi ).
√ d
Let us divide out the variance to rewrite the CLT (in the univariate case) as n· X̄nσ−µ → N (0, 1).
The moment generating function of the standard normal distribution is:
1 1
Z Z
x2 t2 (x−t)2 t2
MZ (t) := √ dx · etx · e− 2 = e− 2 · √ dx · ·e− 2 = e− 2
2π 2π
where we’ve used that (x − t)2 = x2 − 2tx + t2 and that the final integral is over the density of a
normal random variable with mean t and variance 1.
Now for the magic part. We’ll show that whatever the distribution of Xi is, and hence whatever
the moment generating function of Xi , the moment generating function of
√ X̄n − µ 1 X1 − µ 1 X2 − µ 1 Xn − µ
n· =√ +√ + ··· + √
σ n σ n σ n σ
t2
will end up being e− 2 !
First, note that when Y and Z are independent of one another, the moment generating function
of Y + Z is equal to the product of each of their moment generating functions, i.e. E[et(Xi +Zi ) ] =
E[etYi · etZi ] = E[etYi E[etZi ]. Applying this to the above expression, we have that:
n
1
M√n· X̄n −µ (t) = M √1 X−µ (t) · M √1 X−µ (t) · · · · · M √1 X−µ (t) = √ M X−µ (t)
σ n σ n σ n σ n σ
√
Note that for any random variable Y , M √1 ·Y (t) = MY (t/ n). Therefore, we wish to show that
n
√ n t2
lim M X−µ (t/ n) = e− 2
n→∞ σ
45
for any t.
Applying the Taylor series expansion of the moment generating function in Equation 4.1, we have
that:
" 2 #
√ Xi − µ t2 Xi − µ t2
t t
M X−µ (t/ n) = 1 + √ · E + ·E + ·g √
σ n σ 2n σ n n
h i 2
where by the Taylor theorem limn→∞ g √t
n
= 0. Note that E Xi −µ
σ = 0 and E Xi −µ
σ =
1, and thus we wish to show that
n
t2 t2
t t2
lim 1 + + ·g √ = e− 2
n→∞ 2n n n
Recall the identity that limn→∞ (1 + x/n)n = ex . If we can ignore the g term then we are done.
To show that the g term indeed does not contribute in the limit, consider taking the natural
logarithm of both sides of the above equation (since the log is continuous function, it preserves
limits):
n
t2 t2 t2 t2
t t
lim ln 1+ + ·g √ = lim n · ln 1 + + ·g √
n→∞ 2n n n n→∞ 2n n n
2
t2 t2
t t 2 t
= lim n · + ·g √ = − + t · lim ·g √
n→∞ 2n n n 2 n→∞ n
2
t
=−
2
where we’ve used the Taylor theorem for the
2 natural logarithm:
ln(1 + z) = z + z · h(z) where
t t2 t
limz→0 h(z) = 0, and we have that limn→0 2n + n · g √n = 0.
Example: By the large of large numbers and the CMT: X̄n + 5 → (µ + 5), where µ = E[Xi ].
p
√ 2 d
Example: Let Zn = n(X̄n − µ). Then by the CLT and CMT: Zn2 = n X̄n − µ → χ21 , where χ21 is
the chi-squared distribution with one degree of freedom (this is the distribution of a standard normal
N (0, 1) random variable squared).
Note: The assumption that h is (globally) continuous can be weakened, which is often important in
applications.
46
When Z is a contant (call it c), then the convergence in probability part of the CMT only requires
that h(z) be continuous at c, rather than everywhere.
The convergence in distribution part of the CMT can be extended to cases in which h has a set
of points z ∈ D at which it is discontinuous, provided that P (Z ∈ D) = 0. This is useful when
combined with the √ CLT, for which Z is continuously distributed. Hence applying an arbitrary
function h to Zn = n(X̄n − µ) allows us to use the CMT provided that h has only a discrete set
of points of discontinuity.
A set of useful/common applications of the CMT are summarized by the so-called Slutsky’s Theorem:
d p
Theorem 4 (Slutsky’s Theorem). Suppose Zn → Z and Yn → c with c a constant. Then:
d
Zn + Yn → Z + c
d
Zn · Yn → cZ
d
Zn /Yn → Z/c if c ̸= 0.
d p
To see how these results follow from Theorem 3, note that since c is a constant, Zn → Z and Yn → c is
equivalent to
Zn d Z
→
Yn c
Zn
(see discussion following Proposition 4.2). Then we can apply the CMT to the sequence , with the
Yn
following continuous functions h, respectively:
h(Z, Y ) = Z + Y
h(Z, Y ) = Z · Y
where ∇h(z) = ( dd1 h(z), dd2 h(z), . . . )′ is a vector of the derivatives of h with respect to each component
of Z.
Consider now what this implies in the case of the CLT:
Corollary 1. If Xi are i.i.d random vectors, h(x) is a function that is continuously differentiable at
x = µ, and E[Xi′ Xi ] < ∞, then
√ d
n(h(X̄n ) − h(µ)) → N (0, ∇h(µ)′ Σ∇h(µ))
47
Proof. Beginning from Theorem 5, we only need to show that for a random variable Z ∼ N (0, Σ),
h(µ)′ Z ∼ N (0, ∇h(µ)′ Σ∇h(µ)). We can see this in two steps. First of all, since a linear combination
of normal random variables is also normal, we know that a′ Z is normal for any normally-distributed
k−component random vector and k−component vector a. We thus need only to work out the mean and
variance of h(µ)′ Z to characterize its full distribution. By linearity of the expectation, E[h(µ)′ Z] = 0,
since each component of Z has mean zero. You also showed in HW 3 that the variance of a′ Z is a′ Za.
Substituting a = h(µ) completes the proof.
The most important special case of the corollary above is when Xi is a random variable. In this case,
we don’t need any matrix multiplication and we have that:
2 !
√
d d 2
n h(X̄n ) − h(µ) → N 0, h(µ) · σ
dx
d
Note that if the function h is very sensitive to the value of x near µ, i.e. dx h(µ) has a large magnitude,
then the asymptotic variance of h(X̄n ) will be large, since the funciton h blows up the variance of Xi by
d
2
a factor dx h(µ) .
Exercise: Use the Cramér–Wold device to show that if Theorem 2 applies to random variables Xi ,
then it applies to a random vector Xi = (X1i , X2i , . . . Xki )′ as well (assume that any necessary moments
exist).
the proportion of the sample for which Xi ≤ x. When considering this quantity across all x, we call the
resulting function the empirical CDF of Xi , denoted as Fn (x).
Thus, for each x the empirical CDF evaluated at x converges in probability to the population CDF
p
evaluated at x, i.e. Fn (x) → F (x). This result can be strengthened in two ways (which are not implied by
the weak law of large numbers). Consider the error in Fn (x) as an approximation of F (x), |Fn (x) − F (x)|
as a function of x. This may be larger or smaller depending on x. The Glivenko-Cantelli theorem states
that even the largest error, over all x, converges to zero, and furthermore that this convergence is almost
sure convergence (see box at the end of Section 4.3), rather than convergence in probability:
48
Theorem 7 (Glivenko-Cantelli theorem). If Xi are i.i.d, then:
a.s.
sup |Fn (x) − F (x)| → 0
x∈ R
We won’t use Theorem 7 in this class, but it can be useful for proving properties of asymptotic sequences
that involve quantities that cannot be written as a function of X̄n .
49
Chapter 5
This chapter presents a formal view of the goals of using statistics for econometrics. It starts with the
question: what is it that we would like to learn? Once we’ve defined our “parameter of interest”, we can
separate much of econometrics into three parts: identification, estimation and inference.
I will not attempt at a thorough or rigorous treatment of many of the concepts this chapter touches
upon. Rather, I hope it can present a unified way to think about several concepts you have probably
seen in one form or another in previous courses, and serve either as a reference or a starting point to
exploring terms in econometrics as you come across them in your own research.
First type (model parameters): Think back to the idea of a parametric statistical model, introduced in
Chapter 3.1. Suppose we observe i.i.d data Xi , where the distribution of Xi is thought to belong to
a parametric family F (·; θ) for some θ ∈ Θ. For example, we might be willing to assume that Xi is a
normally distributed random variable, with unknown mean µ ∈ R and variance σ 2 > 0. In this case,
θ = (µ, σ), and in the absence of any further assumptions about θ: Θ = R × R+ , where R+ is the set of
all positive real numbers (the variance cannot be negative). In this context, it is natural to take the full
vector of model parameters θ to be our parameter of interest (of course, we might only be interested in
e.g. µ, in which case µ alone is our parameter of interest, and similarly with σ).
Second type (features of observed variables in the population): We don’t need parametric statistical mod-
els to talk about parameters of interest, however. If we have i.i.d. data drawn from any population
distribution F , we might think of some aspect of F that we’d like to know. For example, we might be
interested in E[Xi ], but don’t want to assume that Xi is normally distributed, as in the last example.
Then, our parameter of interest is θ = E[Xi ]. Another parameter of interest might be the median of F ,
the point x at which F (x) = 1/2. In this case, θ = inf{x : F (x) ≥ 1/2} (this general definition allows
for a non-continuous F , in which case there may be no x such that F (x) = 1/2 exactly).
Third type (quantities that depend on unobservables): One of the exciting and difficult things about
applied econometrics is that often our parameters of interest do not depend solely on the distribution F
of the vector of variables X that we observe in our data. Rather, θ often depends also on the distribution
of some other variables U that are not observed. This situation most often arises when discussing
causality, for example when our parameter of interest summarizes the causal effect of a policy. Talking
about causality involves some new notation and concepts, so we’ll defer further discussion to Chapter
6. As a simpler example of a situation that involves unobservables, let us consider a different important
practical problem: measurement error.
50
Suppose our parameter of interest is θ = E[Zi ], the average value of some random variable Zi .
However, our data was not recorded perfectly, and instead of an i.i.d sample of Zi , we observe an
i.i.d sample of Xi = Zi + Ui , where Ui represents unobserved “measurement error”. In this case, our
parameter of interest can be written as E[Xi − Ui ], which depends both upon the distribution of X and
the distribution of U .
5.2 Identification
Once we have a parameter of interest in mind, a good starting point is often to ask the question: “could
I determine the value of θ if I had access to the population distribution F underlying my data?”.
If the answer is no, then no amount of statistical wizardry will allow you to learn the value of θ. If
the answer is yes, then we say that θ is identified.
Definition 5.1. Given a statistical model F for (X, U ), we say that θ is identified when there is a
unique value θ0 of θ compatible with FX , the population CDF of observable variables X.
Often identification is described as saying that if we observed an “infinite” sample, we could determine
the value of θ. The reason for this is that by the law of large numbers, we can learn the entire population
distribution of X from an i.i.d sample Xi , as the sample size goes to infinity (see discussion in Section
4.6). Of course, we never observe an infinitely large dataset, but defining identification in terms of what
we could know if we did cleanly separates problems of research design from the statistical problem of
having too small a sample.
Whenever our parameter of interest is defined directly from the population distribution FX of observables
(e.g. θ = E[Xi ]), it will be identified. Thus, parameters of the second type are always identified. This
logic often applies to parameters of the first type as well, except in cases when F (·; θ) doesn’t always
change with θ (see example below). Questions of identification usually arise in the third case, when our
parameter of interest θ depends on the distribution of unobservables: for example when we’re interested
in causality, have measurement error, or have “simultaneous equations”.
Example: Suppose Xi are i.i.d draws from N (µ, σ 2 ). Then the parameters µ and σ are identified, because
each pair (µ, σ) gives rise to a different CDF FX of Xi .
Example: Suppose Xi are i.i.d draws from N (min{θ, 5}, σ 2 ). Then θ is not identified, because different
values of θ (e.g. θ = 6 vs. θ = 7), do not give rise to a different CDF F of Xi .
Example: In the measurement error example, suppose that we’re willing to assume that E[Ui ] = 0,
that the measurement error averages out to zero (e.g. there are equal chances of getting positive and
negative errors of the same magnitude). Then θ = E[Zi ] is identified, since now E[Zi ] = E[Xi ]. This
example underscores the role of F in Definition 5.1. Whether or not θ is identified often depends on what
assumptions we are willing to make, which restrict the set F of possible joint-distributions for (X, U ).
Below I discuss some additional issues related to identification, which may relate to terms you’ve heard
floating around about identification:
Partial vs. point identification: Sometimes knowing FX is not enough to pin down the value of θ,
but it is enough to determine a set of values that θ might take. For example, we may be able to
51
determine upper and lower bounds for θ. In such cases we often say that θ is partially identified.
This can be contrasted with Definition 5.1, which describes point identification.
Definition 5.2 (full identification of a model). Given a statistical model F for (X, U ), we
say that the model is identified when when the set {θ ∈ Θ : FX (·) = FX (·, θ)} is a singleton,
where FX is the CDF of X.
Definition 5.1 says that there is a unique value θ0 ∈ Θ such that FX (·, θ0 } is equivalent to the
population distribution of observables Xi . This situation arises often in econometrics in the
context of so-called structural models in which the entire model can be characterized by a finite
set of model parameters.
5.3 Estimation
If our parameter of interest θ is identified, then we can move on to our next question: how can we
estimate it?
In this section, we treat the task of estimating θ as a decision problem. In the next section, we’ll take
the same approach to testing hypotheses about θ. This way of thinking about estimation and inference
is called statistical decision theory.
Let’s think about the task of estimating θ as a problem of choosing an optimal strategy in a particular
game, which we play along with “nature”. Nature goes first, giving us a sample X, the distribution of
which we denote abstractly as P (this is equivalent to the joint-CDF of all of the components of X). Our
goal is to think about how to form θ̂ = g(X) as a function of the data X. How should we proceed?
Recall that in game theory, a strategy is a complete profile of what we would do, given whatever the
other players do. In this context, we a strategy is not a particular numerical estimate of θ, but
Pn the function
g. For example, if our estimator is the sample mean, then g(X) = g(X1 , X2 , . . . Xn ) = n1 i=1 Xi , which
will depend upon the particular values of Xi occur in our sample.
As in game theory, our best-response to the actions of nature will depend upon our preferences (a.k.a.
our utility function). In statistical decision theory this takes the form of a ”loss function”: L(θ̂, θ0 ), where
θ0 is the true value of θ. For the most part, we consider the so-called quadratic loss function:
When θ is a scalar, then this is just the square of the difference between our estimator θ̂ and the true
value θ0 .
However, remember that θ̂ = g(X) is a random variable/vector, which depends on our randomly
drawn dataset X. Thus to pick a strategy g, we need to define our preferences over “lotteries”, again–as
in standard game theory. In line with expected utility theory, the convention here is to take our optimal
action g to be the minimizer of expected loss: E[L(θ̂, θ0 )] where the expectation is over the distribution
of X. The risk function Rg (θ) of estimator g views the expected loss as a function of the true value of
θ. It is common to write this as Eθ [L(θ̂, θ)], where the notation Eθ makes it clear that the distribution
of X must depend in some way on the value of θ. This is motivated by cases in which we have i.i.d.
data from a parametric statistical model where θ indexes the population distribution of Xi . Then the
distribution of X depends on just two things: n and the true value of θ.
When we use the quadratic loss function, the optimal estimator g would be
However, solving this problem is not easy, because we generally don’t know the distribution of X ex-ante.
However, statisticians have developed various strategies to try to keep E[||g(X) − θ0 )||22 ] small. These
52
strategies are best understood as ways to navigate the so-called bias-variance tradeoff. The following
proposition shows that expected quadratic loss can be decomposed into two terms: one capturing the
square of the “bias” of the estimator, and the other capturing its variance.
For simplicity, we state this result in the special case that θ is a scalar. We’ll also just write θ̂ rather
than g(X), to keep the notation simple.
Proposition 5.1 (the bias-variance decomposition).
2
5.3.1.1 Consistency
The first thing that we might ask of our estimator is that it be consistent. What we mean by that is that
p
θ̂ → θ0
regardless of the value of θ0 . Consistency means that as n goes to infinity, the entire expected loss in
Equation (5.2) converges to zero.
if we set p = 1/2. Note that if we set the power p on n to be any larger than 1/2, then the LHS
would blow up, rather than converging in distribution to anything (like a normal distribution). On the
other hand, if we had set p < 1/2, then np (X̄n − µ) will simply converge in probability to zero. 1/2 is
“Goldilocks” level of p in which we get a non-degenerate asymptotic distribution for np (X̄n − µ).
In general, when we have a consistent estimator θ̂, we call the maximum value of p such that np (θ̂−θ0 )
converges in distribution to something (technically, to some distribution that is “bounded in probability”)
the rate of convergence of θ̂. The rate of convergence of the sample mean is 1/2, and we often say that it
53
√ √
is n−consistent. n-consistency is a desirable property, which is shared by many common estimators.
However, some estimators have a slower rate of convergence. For example, suppose we’d like to estimate
the density θ = f (x) of a d−dimensional random vector Xi at some point x, and we’d like to make this
estimation non-parametric—that is, not based on assuming a parametric model for f (x).
We can do so using the so-called kernel density estimator fˆK (x), which has a rate of convergence
no better than p = d+4 2
. When d = 1, for example, we can only blow up (fˆK (x) − f (x)) by a factor
2/5
of n and get an asymptotic distribution. In practice, this means that we need a larger sample n
for asymptotic arguments to provide good approximations to the sampling distribution of fˆK (x). This
becomes a real problem as d starts to increase: for example, the rate of convergence fˆK (x) when d = 5 is
just 2/9. This problem is often referred to as the curse of dimensionality, and is why we need very large
samples—and/or even more clever techniques—to do non-parametric estimation with many covariates.
5.3.1.3 Unbiasedness
If an estimator is unbiased if it manages to make the second term in Equation (5.2) zero, that is:
E[θ̂] = θ0
Unbiasedness has a nice interpretation: we know that θ̂ ̸= θ0 in general, but we know that θ̂ will be right
on average, over different realizations of our dataset.
An example of an unbiased estimator is the sample mean, when our parameter of interest θ is the
population mean E[Xi ]. In Section 4.1, we showed indeed that E[X̄n ] = E[Xi ], regardless of n or the
true value of E[Xi ] (so long as it exists).
Note that an estimator can be consistent without being unbiased. For example, the estimator θ̂ =
n X̄n is biased as an estimator for θ0 = E[Xi ], because
n+1
n+1 θ0
E[θ̂] − θ0 = · θ0 − θ0 = ̸= 0
n n
unless θ0 = 0. However, this θ̂ is consistent. This implies that as n approaches infinity, both its bias
and its variance converge to zero. If an estimator has an asymptotic bias (that is, a bias that doesn’t go
away with n), then it cannot be consistent.
5.3.1.4 Efficiency
Econometricians often speak of an estimator as being efficient. Loosely speaking, this typically means
that θ̂ minimizes mean squared error (5.2) among some class of estimators.
For example, we might consider the class of unbiased estimators, and ask whether a given θ̂ minimizes
Eq. (5.2). Since the bias term is zero for all estimators in this class, the efficient estimator will be the
one that minimizes variance.
In the context of parametric models, the Cramer-Rao lower-bound establishes the smallest variance that
an unbiased estimator can possibly have (even when θ is a vector, though the definition of “smallest”
here requires qualification). The maximum likelihood estimator, discussed in the next section, achieves
this bound: it is thus efficient, whenever it happens to be unbiased (which is not guaranteed in general).
54
In this section, I present a few prominent examples of a category of estimator called extremum-
estimators. This category overlaps the one I just described: OLS can also be seen as an extremum
estimator. An extremum-estimator takes the form:
where Q(X, θ) is some function of θ and our sample. Why might (5.3) be a good way to construct an
p
estimator? The basic idea is to find cases in which Q(X, θ) → Q0 (θ) for some function Q0 (θ) which has
a unique minimizer within a space Θ of possible values for θ. As the sample gets larger, our θ̂—which
solves the sample minimization problem—converges to θ0 , which solves the population problem Q0 (θ).
The function Q might be provided by economic theory, a statistical model, or some combination of the
two.
An important subclass
Pn of extremum estimators are those in which Q(X, θ) takes the form of a sample
mean: Q(X, θ) = n1 · i=1 m(Xi , θ) for some m(Xi , θ). Such estimators are called M-estimators. Given
the law of large numbers, M-estimation will work well, in the sense of at least being consistent, under
fairly general technical conditions on m and the distribution of Xi (which we will not discuss here).
which we can gather into a vector equation E[g(Xi , θ)] = 0. The most prominent example of a model
like (5.4) is the linear regression model, which we will study in detail in Chapter 7. Another example
is the linear instrumental variables model with l instruments and k endogenous variables. The moment
conditions (5.4) are a statistical model, in the sense of Section 3.1, because they restrict the distribution
of Xi to be among those for which E[g(Xi , θ0 )] = 0, where θ0 is the true value of θ. If Equation (5.4)
has only one solution (which must then be equal to θ0 ), then the parameter θ is identified. In general, a
necessary condition for identification is l ≥ k.
The basic method-of-moments estimator θ̂M M considers the case in which k = l; i.e. we have the
same number of moments and parameters we wish to estimate. In this setting, we simply replaces the
population expectations in (5.4) by their sample counterparts, that is we solve choose θ̂M M such that
1
Pn
n i=1 g(Xi , θ̂M M ) = 0. Note that this equation is the first-order-condition of the extremum-estimator
in which !′
2 !
1 X 1 X 1 X
Q(X, θ) = g(Xi , θ)) = g(Xi , θ) g(Xi , θ)
n i=1n n i=1n n i=1n
2
The most famous example of a method-of-moments estimator is the ordinary least squares estimator for
the linear regression model, which we will study in depth in Section 7.5.
The so-called generalized method of moments (GMM) considers the more general case in which l ≥ k.
When we have more moment equations than parameters we aggregate over them with some k × l matrix
B to yield k equations: BE[g(Xi , θ)] = 0 which are linear combinations of the original ones (where now
0 has k components). The GMM estimator θ̂M M minimizes the sample analog of this set of equations,
θ̂M M = argmin Q(X, θ), where
θ∈Θ
! 2 !′ !
1 X 1 X 1 X
Q(X, θ) = B g(Xi , θ)) = g(Xi , θ) W g(Xi , θ)
n i=1n n i=1n n i=1n
2
where W := B′ B is an l × l matrix that is positive-semidefinite. Note that θ̂GM M depends on our choice
of B through W. Efficient GMM uses a data-driven weighting matrix Ŵ in order to minimize the
55
asymptotic variance of θ̂GM M .
Note: A related method uses moment inequalities, in which the equals signs in Equation (5.4) are re-
placed with inequalities ≤. Models with moment inequalities typically result in partial identification of θ.
Note that the method of moments does not require us to have a fully-specified statistical model F (·; θ),
in the sense of Section 3.1. Rather, it is typically sufficient to have l ≤ k moment conditions of the form
Equation (5.4) that involve our parameter of interest θ. The method of moments can thus be thought
of as employing a semi-parametric model.
E [log(f (Xi , θ))] − E [log(f (Xi , θ0 ))] = E [log(f (Xi , θ)) − log(f (Xi , θ0 ))] = E
f (Xi , θ)
L(θ) − L(θ0 ) = log
f (Xi , θ0 )
Becauseh logis a concavei function,
h we have i by a property of expectation known as Jensen’s inequality
that E log ff(X(Xi ,θ)
i ,θ0 )
≤ log E f (Xi ,θ)
f (Xi ,θ0 ) . Now, since f (·, θ0 ) is the true density of Xi , this is in turn
equal to Z
f (Xi , θ) ) · f (x, θ) · dx = log(1) = 0
log E = log f (x,
θ
0
f (Xi , θ0 )
f (x,
θ
0)
Note: Provided that P (f (Xi , θ) ̸= f (Xi , θ0 )) > 0 for any θ ̸= θ0 , the above can be strengthened to say
that θ0 is the unique maximizer of L(θ).
The maximum likelihood estimator θ̂M LE minimizes the sample analog of L(θ) with respect to θ. In
particular, let the likelihood function of the data L(X; θ) be:
n
Y
L(X; θ) := f (Xi , θ)
i=1
Given that the Xi are i.i.d., L(X; θ) is the joint density of the observed dataset mathbf X = {X1 , X2 , . . . Xn },
given the model f (·; θ). Maximizing L(X; θ) with respect to θ is equivalent to maximizing the logarithm
of L(X; θ) with respect to θ, leading to the more-familiar “log-likelihood” expression for θ̂M LE :
n
! n
Y 1 X
θ̂M LE = argmax log f (Xi , θ) = argmax log (f (Xi , θ))
θ∈Θ i=1 θ∈Θ n i=1
| {z }
“log-likelihood function”:
where we insert a factor of 1/4 (which doesn’t change the argmax) in order to view this as a sample
analog of Proposition (5.2). This problem has a unique solution if for example the the log-likelihood
function log(L(X; θ)) is globally-concave, as it does in many familiar models (e.g. the “probit” model).
The above expression also reveals that θ̂M LE is an example of an M-estimator in which the function
m(x, θ) = log (f (x, θ)).
When θ̂M LE is an unbiased estimator (which is only true in some applications), it is efficient among
all unbiased estimator
√ (look up the Cramer-Rao bound for details). It is also asymptotically efficient
among so-called n-consistent “regular” estimators. These properties however require the model f (x; θ)
56
to be correctly specified, which may be a hard assumption to defend. When the model is misspecified,
p
then generally θ̂M LE → θ∗ , where
f (x; θ)
θ∗ = argmin − E log
θ∈Θ f (x)
where f (x) is whatever the true density function of Xi is. The quantity inside the minimization is
called the Kullback-Leibler divergence between f (x; θ) and f (x), and can be thought of as a kind of
“distance” between the two density funcitons. Thus, when the model is misspecified, θ̂M LE is consistent
for a “pseudo-parameter” that finds the closest point in Θ to θ0 , as measured by the Kullback-Lieber
divergence. Misspecification occurs when Θ is not a sufficiently big or rich set to contain f (x) = f (x; θ0 )
for some θ0 ∈ Θ.
by the law of iterated expectations. The inner expectation represents an average over values of θ with
respect to the posterior distribution π(θ|X), and delivers a function of X (and θ0 ). The outer expectation
integrates over the marginal distribution of the data, just as in Equation 5.1.
Our goal is now to minimize (5.6) with respect to θ̂, to provide
h aniestimate of θ0 . Note that since
θ̂ = g(X) is a function of X, the optimal g must minimize E L(θ̂, θ)|X for every possible value x of X
(otherwise we could just change g(x)
h for thati x along while decreasing (5.6). Thus, we can focus on the
problem of minimizing the risk E L(θ̂, θ)|X with respect to θ, for a given value of X. The function of
X that does this is called the Bayes estimator, θ̂B (let’s assume the problem is such that it is unique).
When we use the quadratic loss function L(θ̂, θ) = (θ̂ − θ)2 , as in Equation 5.1 (focusing on the case
of a scalar θ for simplicity), our problem can now be written as
h i Z
θ̂B = argmin E L(θ̂, θ)|X = argmin π(θ|X) · (θ̂ − θ)2 · dθ
θ∈Θ θ∈Θ
i.e. Bayes estimator θ̂B is equal to the so-called posterior-mean of θ. Note that we can use Eq. (5.5) to
write the posterior mean in terms of our prior π on θ and the likelihood function:
R
π(θ) · L(θ; X) · θ · dθ
Z
−1
θ̂B = P (X) · π(θ) · L(θ; X) · θ · dθ = R
π(θ) · L(θ; X) · dθ
57
where pull the P (X) factor out of the integral since it does not depend on θ, and then express it in terms
of the likelihood function given that π(θ|X)h has to integrate
i to one. Although it takes the form of an
extremum estimator in which Q(X, θ) = E L(θ̂, θ)|X , we do not need to solve an explicit minimization
problem to compute the Bayes estimator. Rather, we only need to be able to evaluate the above integral
(typically numerically), given the likelihood function implied by our model and the prior π(θ).
θ̂B thus has two ingredients provided by the researcher: a statistical model which delivers Lθ; X, and
a prior π on θ. Since the value of θ̂B depends on how the researcher chooses this prior, you might wonder
whether how we could ever be confident that θ̂B could even be consistent for θ0 , even if the model is
correctly specified. Couldn’t you pick a really bad prior? The Bernstein-von Mises theorem says that
for finite-dimensional θ, the posterior mean and the maximum likelihood estimator θ̂M LE will converge
asymptotically under general conditions.
Essentially, the influence of the prior π(θ) wanes as the sample becomes large, and the posterior mean
ends up finding the peak of the likelihood function, just as θ̂M LE does. One way to see this is to take
the log of the posterior:
n
! n
π(θ) Y X
log(π(θ|X)) = log · f (Xi , θ) = −Cn + log(π(θ)) + log (f (Xi , θ))
P (X) i=1 i=1
As n gets large, the data contribute n terms to this sum, while the prior always contributes the same
log(π(θ)). The other term Cn := log(P (X)) is also growing with n, but doesn’t depend on θ. Thus, the
dependence of the posterior distribution on θ is entirely driven by the likelihood, asymptotically.
5.4 Inference*
In Section 5.3, our goal was to deliver a point-estimate θ̂ of our parameter of interest. That is, we want
a number that yields something close to the true value θ0 of θ.
Sometimes we can settle for a less ambitious goal, which is to ask not what the exact value of θ0 is,
but rather we want to know whether or not θ0 belongs to some set of values. I will discuss two approaches
of this type: i) hypothesis testing, in which we want to test whether θ0 ∈ Θ0 for some fixed set Θ0 ; and
ii) interval estimation, in which we want to construct a set Θ̂ that has some desirable relationship to θ0
(for example contains θ0 with high probability)
Note that provided that our model θ0 ∈ Θ is correctly specified, either the null hypothesis H0 or the
alternative hypothesis H1 holds.
Continuing of the approach of statistical decision theory, we may think of our action space as now
as consisting of two actions d ∈ {a, r}, either accept (a) or reject (r) the null-hypothesis H0 . This can
be contrasted with estimation, in which our action space was to pick a specific value in Θ to serve as an
estimate for θ.
In this context, a strategy is a mapping from the possible datasets X that we might see to an action
{a, r}. This function d(X) is referred to as a decision rule, or a test. To think about what kind of a test
might be optimal, we again need to specify our preferences, or a loss function, over actions. Compared
with estimation, in which our loss function took the form L(θ̂, θ), it now takes the form L(d, θ0 ): how
happy would we be with our decision d ∈ {a, r}, if we learned the true value of θ was θ0 ?
Compared with estimation—where the quadratic loss function is very standard—in testing it is less
obvious what our cost function could be. One thing is clear however, we’d prefer not to be wrong: we
don’t want to reject the null hypothesis (often referred to as failing to accept the null) when in fact
θΘ0 , and we also don’t want to accept the null hypothesis when in fact θ0 ∈ Θ1 . The first of these
errors is called a Type-I error (falsely rejecting H0 ) while the second is called a Type-II error (incorrectly
accepting H0 ).
58
The most basic loss function we might think of is called 0-1 loss, and only cares about whether we
are right or not, i.e. L(d, θ0 ) when either d = a and θ0 ∈ Θ0 or d = r and θ0 ∈ Θ1 (i.e. we are right), and
L(d, θ0 ) otherwise (we are wrong). Recall that since X is random, our decision d(X) will be random,
and thus we can again think about the risk, or expected loss, due to a particular strategy d. With the
0 − 1 loss function:
(
P (d(X) = r) if θ0 ∈ Θ0 (Type-I error)
E[L(d(X), θ0 )] =
P (d(X) = a) if θ0 ∈ Θ1 (Type-II error)
It is clear from the above that whether or not the null is actually true determines which probability
matters in determining the risk of the test.
Since the value of θ pins down some aspect of the distribution of X, the probability of rejecting
the null will depend upon what the true value of θ0 in fact is. Like the risk function that we saw in
estimation, let us use the notation Pθ (d(X) = r) to denote the probability of rejecting when the true
value is θ. Viewing this as a function of θ, we define the power function β(θ) of test d.
Beyond the 0 − 1 loss function, we might put a different penalty on Type-I vs. Type-II errors:
( (
0 if θ0 ∈ Θ0 0 if θ0 ∈ Θ1
L(a, θ0 ) = while L(r, θ0 ) =
ℓII if θ0 ∈ / Θ0 ℓI if θ0 ∈
/ Θ1
The ratio ℓII /ℓI will govern whether our test d should be more conservative about avoided Type-I errors,
or about avoiding Type-II errors.
5.4.2.1 Size
The size α of a test d is the maximum probability of making a Type-I error (falsely rejecting), over all
θ ∈ Θ0 . We can write this in terms of the power-function β(θ) as:
α = sup β(θ)
θ0 ∈Θ0
We’d like the size of a test d to be small; we therefore often design tests to control their size (keep it
below a certain value). Often we can do this in the asymptotic limit (as n → ∞) even if we do now know
the size of a test in finite sample.
5.4.2.2 Power
The power of a test is given by its power function β(θ). We generally want to increase β(θ) among the
θ ∈ Θ1 , to reduce the probability of a Type-II error.
59
5.4.3 Constructing a hypothesis test
The most common variety of hypothesis test takes the following form: from the data X we compute
some test statistic, call it Tn . Then we compare Tn to some critical value c, and choose to reject the
null-hypothesis if and only if |Tn | exceeds the critical value (a so-called two-sided test), or alternatively
if Tn exceeds the critical value (a so-called one-sided test).
d
Tests of this form are usually motivated by knowing the asymptotic distribution of Tn , i.e. Tn → T
where T has some known distribution. Then we can control the size of our test by choosing c to be such
that P (T ≤ c) ≥ 1 − α. We then maximize power subject to this contraint on size by choosing c to be
exactly the 1 − α quantile of T (and no lower), so that P (T ≤ c) = 1 − α.
Example: Let us close by illustrating some of the concepts of this section with an example. Suppose our
statistical model is that Xi ∼ N (θ0 , 1), i.e. a normal random variable with unit variance but unknown
mean
√ θ0 . We wish to test whether H √0 : θ0 = 0, that is: Θ0 = {0} and Θ1 = R/{0}. Let our test statistic
be n times the sample mean Tn = n·X̄n . Given our model, the sample mean has the exact distribution
X̄n ∼ N (θ0 , 1/n) for any n, and hence Tn ∼ N (θ0 , 1). Under the null, Tn is a standard normal (since
then θ0 = 0) and hence for a two-sided test we can choose our critical value c to be the 1 − α/2 quantile
of the standard normal distribution (then P (|Tn | > c) = P (Tn < −c) + P (Tn > c) = α2 + α2 = α). Note
that the power function β(θ) of this test is the probability that a N (θ, 1) random variable has absolute
value greater than c, which is equal to Φ(c − θ) + Φ(−c − θ), where Φ denotes the standard normal CDF.
60
Example: Suppose we apply this principle to the example in Section 5.4.3 in which Xi ∼ N (θ0 , 1). There
we constructed a test for the null hypothesis that θ0 = 0, but now we need to consider
√ more general
hypotheses of the form H0 : θ0 = θ. If we revise our test statistic to be Tn (θ) = n · (X̄n − θ), we
again have that Tn has a standard normal distribution asymptotically, and thus our critical value c is
unchanged from the θ = 0 case. A 1 − α confidence interval would thus be:
√ √
CI 1−α = {θ ∈ R : |Tn (θ)| ≤ c} = {X̄n − c/ n, X̄n + c/ n}
61
Chapter 6
Most interesting questions in social science concern causality. We aren’t just interested in observing
what happens in the social world, but why and how. And we’re usually interested in what changes to
policy or behavior could lead to changes that we might deem desirable.
These types of questions concern causality. The meaning of the term “causal” is a long-standing
philosophical question; see Lewis (1973) for a fairly modern treatment that will accord with our approach
in this class. We will take a very simple perspective: A causes B if B would be different if A were different.
For example, on a day in which rain was forecast and I took my umbrella to school, we might say that
the rain forecast caused me to bring my umbrella, if I wouldn’t have taken the umbrella, absent the
forecast for rain. We of course can’t directly observe what would have happened if the forecast had been
different; we call this a counterfactual.
62
Definition 6.2. An individual’s treatment effect is defined as: ∆i = Yi (1) − Yi (0), the difference
between their treated and untreated potential outcomes.
In the above example, the treatment effects ∆i are $8 an hour for Person A, $15 an hour for Person B,
$12 an hour for Person C, and $ − 90 an hour for Person D. On average, treatment effects or positive—
although Person D’s individual treatment effect is negative.
The leap of faith that you need to take withe potential outcomes is to believe that there exists a
value Yi (0) and Yi (1) for each individual, regardless of whether they actually went to college. If i does
graduate college (i.e Di = 1), then their actual earnings Yi , will be Yi = Yi (1). Similarly, if they don’t
go to college, then their earnings will be Di = 0. Another way of writing this is that, for each i:
Yi = Di · Yi (1) + (1 − Di ) · Yi (0)
Notice that since Di ∈ {0, 1}, there is always one of the above terms that is equal to zero, and the other
term gives us the appropriate potential outcome.
In the above example, suppose that Persons A and D do go to college and graduate, while B and C
do not. Then if we measure the earnings and college-graduation status of each of the four individuals,
our data will be {(Yi , Di )}i=1,2,3,4 = {($18, 1), ($25, 0), ($0, 0), ($60, 0)}.
What can we say about treatment effects, given this data? Consider for example individual D, who
in reality missed their opportunity to start the business and earn $150 an hour. This is a counterfactual,
something that would have happened if the world were different. Since we can’t observe what would
have happened, we’ll never be able to answer the question of what person D’s value of ∆i is, empirically.
Definition 6.3. The fundamental problem of causal inference is that for a given i, we only observe
one of the two potential outcomes: either Yi (1) if Di = 1, or Yi (0) if Di = 0. In other words, we only
observe i’s realized value Yi = Yi (Di ), and not their other potential outcome.
The fundamental problem of causal inference means that we have a problem of identification, in the
language of Section 5.2. Suppose our parameter of interst is ∆i = Yi (1) − Yi (0) for some particular
individual i. Suppose that they did graduate from college, so Di = 1. If we can’t observe Yi (0), we can’t
identify their treatment effect. However, we’ll see that we can still sometimes make statements about
average treatment effects, by using other students who didn’t go to college as a comparison group.
63
6.2 The difference in means estimator and selection bias
The difference-in-means estimator takes the difference in the sample average of the outcome variable
among the “treatment group” Di = 1 and “control group” Di = 0:
1 X 1 X
θ̂DM = Yi − Yi
N1 N0
i:Di =1 i:Di =0
where N1 is the number of individuals i in the sample such that Di = 1, and N1 is the number of
individuals in the sample such that Di = 0. Of course, the total sample size n = N0 + N1 .
We know from the results of 4.6, and the midterm that for large samples θ̂DM converges in probability
to its population counterpart:
Suppose that in our data, we observe that Yi and Di are positively correlated, that is θ̂DM ≥ 0 (as you
showed in a homework problem). This suggests that E[Yi |Di = 1] ≥ E[Yi |Di = 0]. Can we conclude
from our data that going to college causes ones earnings at age 30 to be higher?
We know from Definition 6.3 that for any individual for whom Di = 1, our observed Yi is Yi = Yi (1).
Similarly for any individual who doesn’t go to college, Yi = Yi (0). Thus, we can rewrite the estimand of
our difference-in-means estimator as:
Notice that the first term in Eq. (6.2) conditions on the event Di = 1, and the second term conditions
on Di = 0. This means that the difference-in-means estimand compares two different groups, which
might not be comparable to one another. For example, students who go to college might have higher
SAT scores than students who do not.
Suppose for the moment that the second term in Eq. (6.2) also conditioned on the event Di = 1
(rather than Di = 0). If this were the case, then we could use linearity of the expectation to rewrite θ
as being equal to E[Yi (1) − Yi (0)|Di = 1], the average treatment effect ∆i among students who do go to
college. We call this the average treatment effect on the treated, or AT T . The ATT is a causal parameter
of interest, because it compares the values of Yi (1) and Yi (0), on average, for the same group.
Note that by adding and subtracting E[Yi (0)|Di = 1] to equation (6.2), we can write:
θDM = {E[Yi (1) − Yi (0)|Di = 1]} + {E[Yi (0)|Di = 1] − E[Yi (0)|Di = 0]} (6.3)
| {z } | {z }
AT T selection bias
The parameter if interest AT T is not identified, unless the selection bias term E[Yi (0)|Di = 1] −
E[Yi (0)|Di = 0] is equal to zero. This term represents a measure of non-comparability between the
students who go to college and the students who do not, in terms of their counterfactual earnings Yi (0).
For example, students who obtain a college degree may be more likely to come from family back-
grounds in which their parent(s) had time and resources to help the student accumulate skills that are
valued by the labor market. As a result, these students would have earned more on average, even if
they didn’t go to college and hence E[Yi (0)|Di = 1] − E[Yi (0)|Di = 0]. Many other stories also lead to
a positive correlation between Di and Yi (0): students whose parents are well-connected may be more
likely to go to college, and earn more even if they didn’t go to college, and any genetic traits that are
associated with higher earnings are likely to also increase college attendance.
where AT U := E[Yi (1) − Yi (0)|Di = 0] is the average treatment effect on the untreated, and we’ve used
the law of iterated expectations to decompose ATE into the ATT and ATU. Since the random-assignment
assumption says that treated potential outcomes Yi (1) are also independent of treatment Di , we have
note only that E[Yi (0)|Di = 1] = E[Yi (0)|Di = 0], but also that E[Yi (1)|Di = 1] = E[Yi (1)|Di = 0], and
thus the AT T , AT U , and AT E are all equal to one another.
In non-experimental settings, one may be able to identify a parameter like the ATT without being
able to identify the ATE. An example of this is the difference-in-differences research design, which (in
its basic, most common form) only yields identification of the ATT and not the ATU or ATE.
Note: Even the above argument that AT T = AT U = AT E = E[Yi |Di = 1] − E[Yi |Di = 0] only
ever makes use of Yi (0) being independent of Di , and Yi (1) being independent of Di . This is still
weaker than the assumption made above, that Yi (0) and Yi (1) are jointly independent of Di . In
practice, it’s usually hard to come up for an argument for why only the marginal distributions of
Yi (1) and Yi (0) would be independent of Di , and not their joint distribution, which is why I’ve
written it the way I have.
Note: Definition 6.4 corresponds to a randomized controlled trial with perfect compliance. In
many real-world trials, the only thing that can be randomized is whether an individual is assigned
to receive treatment. But subjects may still choose whether to actually receive treatment. This
situation is very common in economics settings, see for example the homework problem about
a lottery to migrate to New Zealand. In these cases, one can use the method of instrumental
variables to estimate causal effects, which you’ll see in later courses.
An assumption implicit in our use of random assignment to eliminate selection bias above is that
each individual’s potential outcomes does not depend on whether other individuals go to college.
This is known as the stable unit treatment value assumption, or SUTVA. This is not always a
harmless assumption, as it rules out spillover effects.
Selection-on-observables makes the same assumption as random assignment, but we assume it holds
conditional on each value Xi . It is thus often also called a conditional independence assumption.
65
Why might selection-on-observables be more reasonable than random assignment? The basic
idea is that if we observe a rich enough set of Xi , we might be able to control for confounding
factors that lead to selection bias. For example, in the returns-to-college example, we might
include in the vector Xi whether ot not i’s parents graduated from college, their socio-economic
status, and i’s test scores in high school.
Imagine that we observed literally everything that matters for determining the outcome Yi , in
addition to treatment. In this case, we could write potential outcomes as
where the function Y (d, x) is common to everybody: once we know d and x we can say exactly
what is going to happen to you. Then selection-on-observables would be satisfied automatically,
since if we condition on Xi = x, then Yi (d) = Y (d, x) for either d ∈ {0, 1}. Notice that Y (d, x)
doesn’t depend on i: it is no longer random once we’ve fixed Xi . It is hence is uncorrelated with
Di , since degenerate random variables are statistcally independent of everything! This can be
seen as mimicking the logic of a carefully controlled experiment in the natural sciences, in which
we make sure “everything else” that matters Xi is held fixed, while varying Di between 0 and 1.
A similar logic would apply if Xi includes everything that determines Di : e.g. Di = d(Xi ) for
some function d. Then we’d also get selection-on-observables for free. In practice, apart from
vey specific settings, we’ll never observe everything that determines outcomes Yi , or selection into
treatment Di . However, if we can control for most of obvious threats to eliminating selection
bias, we might be willing to think that our Xi get us most of the way there. For a clever and
compelling example of using selection-on-observables, I recommend looking at Dale and Krueger
(2002).
How does the selection-on-observables assumption help us? Note that if it holds then
Thus, the average treatment effect, conditional on X is identified by a version of the difference in means
estimand that conditions on any given value x of Xi . Let us denote this parameter as AT E(x). Equa-
tion 6.4 shows that under selection-on-observables, it is identified. Since we also observe the marginal
distribution of Xi , we can then recover for example the overall average treatment effect by integrating
over values of the control variables:
Z Z
AT E = E[Yi (1) − Yi (0)|Xi = x] · dF (x) = AT E(x) · dF (x)
Exercise: Show that the difference-in-means estimator won’t work in general under selection-on-observables.
66
Solution: Suppose for the sake of argument that Xi is discrete. Then, by LIE:
θDM = E[Yi |Di = 1] − E[Yi |Di = 0]
= E[Yi (1)|Di = 1] − E[Yi (0)|Di = 0]
P (Xi = x|Di = 1) · E[Yi (1)|Xi = x, Di = 1] − P (Xi = x|Di = 0) · E[Yi (0)|Xi = x, Di = 0]
X X
=
x x
Note: When using the selection-on-observables assumption, it is important that the variables in the
vector Xi are unaffected by treatment. That is, if we introduced potential outcomes Xi (0) and Xi (1),
we would have Xi (0) = Xi (1) for all i. To make sure of this, researchers typically consider variables
Xi that are measured earlier in time than treatment Di is assigned. When this condition fails, causal
inference can fail even when the selection-on-observables assumption holds, via a problem often referred
to as “bad-control”.
E E Di · Yi (1 − Di ) · Yi
= − Xi
P (Di = 1|Xi ) 1 − P (Di = 1|Xi )
E D i · Yi (1 − Di ) · Yi
Z
= − Xi = x · dFX (x)
P (Di = 1|Xi ) 1 − P (Di = 1|Xi )
=
Z
E
[Di · Yi |Xi = x]
−
E
[(1 − Di ) · Yi |Xi = x]
· dFX (x)
P (Di = 1|Xi = x) 1 − P (Di = 1|Xi = x)
=
Z
E
[Di · Yi (1)|Xi = x]
−
E
[(1 − Di ) · Yi (0)|Xi = x]
· dFX (x)
P (Di = 1|Xi = x) 1 − P (Di = 1|Xi = x)
E
( (
P (D =(1|X(i (= x) · [Yi (1)|Di = 1, Xi = x]
Z
= ( (i( (
(i (
((D
P (i(=(1|X = x)
((( E[Yi (0)|D
)
−P
(1 ( ((D =(
(i ( 1|Xi = x)) · i = 0, Xi = x]
−( ( ( · dFX (x)
1 − P (Di = ((
(1|Xi = x)
((((
E E[Yi (0)|Xi = x]} · dFX (x)
Z
= { [Yi (1)|Xi = x] −
67
follows from the selection-on-observables assumption that for any p ∈ (0, 1) we have that:
E[Yi |Di = 1, P(Xi ) = p] − E[Yi |Di = 0, P(Xi ) = p] = E[Yi (1) − Yi (0)|P (Xi ) = p]
where P(x) = P (Di = 1|Xi = x) is the propensity score function introduced above. This
expression says that conditioning on values of the propensity score rather than on Xi itself is
sufficient to estimate causal effects. This is useful because while Xi may have many components,
the propensity score is always a scalar. Thus, we simply need to estimate the function P(x), and
then match units i and i′ such that P(Xi ) ≈ P(Xi′ ), rather than finding a good way to compare
X on all dimensions.
Let d be the number of years of schooling student i completes, and Yi (d) be their earnings at age
30.
Let d be the price of some good, and let the function Yi (d) be the demand function for that good
in market i.
Let d be the high school in Georgia that student i attends, and let Yi (d) be an indicator for whether
they were accepted to UGA, e.g. d ∈ {school A, school B, school C, etc.}.
In a randomized experiment about the effect of social media on mental health, subjects i are
assigned to three different treatments:
d ∈ {no social media, Facebook only, Twitter only, Facebook and Twitter}
Regardless of the setting, we can still define random assignment ((Yi (1), Yi (0)) ⊥ Di ) and selection-on-
observables ({(Yi (1), Yi (0)) ⊥ Di } |Xi ) exactly as we did before.
However, with more than two values of treatment, there are now many different ways to think about
treatment effects. For example, in the first example above, we can think about the effect of finishing
grade 12 as:
Yi (12) − Yi (11),
while the effect of completing high-school versus dropping out after grade 10 is:
Yi (12) − Yi (10)
The overall average causal effect of the last year of schooling that each student actually completes would
be
E[Yi (Di ) − Yi (Di − 1)]
In the first two examples above, the values of treatment Di have a natural order to them. In the
third and fourth examples, treatment is categorical, and there may not be a natural such order. With an
unordered treatment, like in the last example, we might pick one comparison category and consider treat-
ment effects with respect to it, e.g. separately estimating E[Yi (Facebook only) − Yi (no social media)],
E[Yi (Twitter only) − Yi (no social media)] and E[Yi (Facebook and Twitter) − Yi (no social media)].
68
6.6 Moving beyond average treatment effects*
Although our discussion here has been focused on parameters that average over treatment effects ∆i =
Yi (1) − Yi (0), this isn’t the only type of causal question that we can answer with random-assignment or
selection-on-observables.
Consider a binary treatment Di and random assignment: (Yi (0), Yi (1)) ⊥ Di . Note that we can apply
any function g(·) to the potential outcomes, without destroying independence, i.e. (g(Yi (0)), g(Yi (1))) ⊥
Di . Why is this useful? Consider the function g(t) = 1(t ≤ y) for some value y. Given that random-
assignment implies that the random variable 1(Yi (1) ≤ y) is independent of Di , we have that
The term on the left is the conditional CDF of Yi given Di = 1, which can be computed from the data.
The term on the right is the (unconditional) CDF of the treated potential outcome Yi (1). This expression
shows that we can identify the CDF of Yi (1) at any point y. Collecting over all y, we can thus compute
the entire distribution of Yi (1).
By the same logic, we can also identify the entire distribution of Yi (0), using FY |D=1 (y) = E[1(Yi ≤
y)|Di = 1]. That means that we can use random-assignment to uncover the effect of treatment on the
entire distribution of outcomes. This lets us answer a new set of causal questions. For instance: what is
the difference between the median value of Yi (1) and the median value of Yi (0)? This is an example of
a so-called quantile-treatment effect.
A natural question that you might hope to answer is: how many individuals in my population have
a negative treatment effect Yi (1) < Yi (0), versus a positive one? This is a harder type of question,
because it depends on the joint distribution of potential outcomes. By contrast, random assignment
(and similarly selection-on-observables, or quasi-experimental approaches), only let us identify each of
the marginal distributions of Yi (0) and Yi (1), due to the fundamental problem of causal inference.
The situation is not completely hopeless: the marginal distributions of Yi (1) and Yi (0) do put some
restrictions on the distribution of treatment effects. For instance, it can be shown that a lower bound
on the proportion “harmed” by treatment P (∆i ≤ 0) is the supremum of FY (1) (y) − FY (0) (y) over all
values of y (see e.g. Fan and Park, 2010 for details). We can also make additional assumptions that
allow us to say more about the distribution of treatment effects. For example, the strong assumption
of rank-invariance allows us to trace out the entire CDF of ∆i , and in principle estimate the treatment
effect for any given individual (see e.g. Heckman et al., 1997).
69
Chapter 7
Linear regression
Note on notation: In this section we’ll simplify notation by dropping i subscripts when discussing
population quantities. We’ll add them back in Section 7.5 when we get to estimation. Remember that
with i.i.d. data, it doesn’t matter whether we include the i indices or not, because the distribution of
variables in each observation i is the same as the population distribution.
This requires having a way to estimate conditional expectations of the form E[Y |X = x, D = d] for d = 0
and d = 1. How should we do this?
If X is a discrete random variable, there is a pretty straightforward way we could do this. With i.i.d.
data, a consistent estimator is simply the mean among the sub-sample of data for which Di = d and
Xi = x:
1
E[Y |X = x, D = d]
X p
Yi →
# of observations i for which Xi = x and Di = d
i:Xi =x&Di =d
| {z }
Ê[Y |X=x,D=d]
But remember that for the selection-on-observables assumption, we want X to be an extensive-enough
set of control variables to eliminate selection bias. So how should we proceed X = (X1 , X2 , . . . Xk ) is a
vector of several random variables, some of which may be continuously distributed?
This is actually a hard problem, in practice. Recall that E[Y |X = x, D = d] is a function of x and d,
which in the notation of 1.5.3 we might write as:
E[Y |X = x, D = d] = m(d, x1 , x2 , . . . xk )
where x1 , x2 , . . . xk are the components of the vector x. Provided that (Y, D, X) are all observed, the
function m will be identified (see Section 5.2). That is, for fixed values (x, d) there is only one value of
m(d, x1 , x2 , . . . xk ) compatible with the joint distribution of our observables.
However, estimation is another thing. Given our finite sample, how do we uncover the function
m(d, x1 , x2 , . . . xk )? This turns out to be particularly straightforward when the function m is linear, that
is:
m(d, x1 , x2 , . . . xk ) = β0 + βD d + β1 x1 + β2 x2 + · · · + βk xk (7.1)
for some set of coefficients (βD , β0 , β1 , . . . βk ). In this case note that our parameter of interest, AT E(x) is
simply equal to m(1, x) − m(0, x) = βD . Since this difference yields the same fixed number βD regardless
of x, the conditional-on-X ATE is the same as the overall average treatment effect, so AT E = βD .
70
7.2 The linear regression model
Given a random variable Y and a random vector X, the linear-regression model says that
Y = X ′β + ϵ (7.2)
where
E[ϵ|X] = 0 (7.3)
We’ll refer to the vector β appearing in Eq. (7.2) the coefficient vector from a regression of Y on X (as
a reminder of notation: β ′ X = j βj · Xj ). The term ϵ is often called an error term or residual.1
P
Remember from Section 3.1 that a statistical model places some kind of restriction on the joint
distribution of random variables. The key restriction of the linear regression model is that the residual
ϵ has a conditional mean of zero given any realization of X. The linear regression model holds for some
β if and only if the conditional expectation function of Y on X is a linear function of X, that is:
In almost all cases in which we use the linear regression model, one of the components of X is taken to
be non-random and simply equal to one. It thus contributes a constant to the function X ′ β, for example:
Y = β0 + β1 · X1 + · · · + βk · Xk + ϵ (7.5)
where here we have started the numbering at 0, so that β has k + 1 components. In this notation X also
has k + 1 components: X = (1, X1 , . . . Xk )′ . However, to keep notation compact, we’ll often ignore the
distinction between a constant and random elements in X.
To see that E[ϵ|X] = 0 implies E[ϵ · Xj ] = 0 for any j = 1 . . . k, use the law of iterated expectations:
E[ϵ · Xj ] = E {E[ ϵ · Xj | X]} = E {E[ϵ|X] · Xj } = E {0 · Xj } = 0
It’s probably a good idea to stare at this and make sure it makes sense. Conditional on any value
X = x, the component Xj has some fixed value xj . Thus, we can pull it out of the inner expectation, so
that E[ ϵ · Xj | X = x] = E[ϵ|X = x] · xj . Then we take the outer expectation (curly braces) over values x.
Since (7.6) provides a system of k equations in the k unknowns β1 . . . βk , it generally has a unique
solution. A general expression for this solution is:
We’ll unpack the matrix notation of this equation later, so don’t worry if it’s not familiar to you right
now. The important thing is that there is typically a single vector β that can satisfy all k lines of Eq.
(7.6), and it is given by (7.7) above.
When people talk about “running a regression”, the quantity they are estimating is (7.7), whether or
not the conditional expectation function E[Y |X] is linear in X as the linear regression model assumes.
Thus, rather than Eqs. (7.2) and (7.3) we could have gotten away with introducing β with a so-called
linear projection model, which just says that
Whether one starts from Eq. (7.4) or from (7.8), we’re talking about the same β. We’ll call this β, which
has the explicit formula (7.7), the coefficient vector or the linear regression vector.
1 The Hansen textbook reserves the term “residual” for an estimated value of ϵ that arises in the context of the ordinary
least squares estimator. I’ll refer to ϵ above as a residual, and what Hansen calls a residual a “fitted residual” in Sec. 7.5.
71
We can also write the linear regression vector in a second way: it minimizes the population mean-squared
error between Y and a linear function of the components of X:
β = argmin E[(Y − X ′ γ)2 ] (7.9)
γ∈ Rk
This says that the value of the β appearing in Eq. (7.2) is exactly the one that minimizes the expectation
of the squared difference between Y and the “regression line” X ′ β implied by β and X. We’ll establish
Eq. (7.9) in Section 7.4.1.
Exercise: Using the fact that ϵ = Y − X ′ β, show that Equation (7.6) is the first-order-condition of the
minimization problem (7.9).
Note: to connect the linear regression model to the discussion of selection-on-observables in Section 7.1,
we simply incorporate D as well as the constant in Eq. (7.1) into the vector X.
There are several motivations for caring about the linear regression vector β. linear regression will be
a useful tool when the RHS of Eq. (7.7) answers or sheds light on some interesting question about the
world. We’ll now turn to several ways in which it can.
Note that in the linear regression model, the residual ϵ is equal to the deviation of the random variable
Y from its expectation given X; that is:
ϵ := Y − E[Y |X]
Then, by definition Y = E[Y |X] + ϵ. When the CEF takes the linear form of Equation 7.11, we get
Equation 7.2. This is useful for example when we have selection-on-observables and are interested in the
coefficient on treatment D, as in 7.1.
However, even if the assumption of Equation (7.4) that the CEF is linear in X is false, the function X ′ β
will still provide the best linear approximation to the true function m(x) = E[Y |X = x], in the sense of
minimizing the mean squared approximation error:
Proposition 7.1. β = argmin E[(m(X) − X ′ γ)2 ], where β is as defined in Equation 7.7.
γ∈ Rk
This means that even if the CEF is not quite linear, the linear projection coefficient β is the k−component
vector such that x′ β best approximates m(x) as a linear function.
73
In teal, I’ve plotted the regression function β0 − β1 · w. We know from Proposition 7.1 that the vector
(β0 , β1 ) finds the best linear approximation to the conditional expectation function, which happens to
be highly non-linear. That is: the slope of E[hours|wage = w] changes a lot with w. Although the
linear approximation is not a particularly good one (the red and teal lines are far from one another
for many values of w), we can see that the slope coefficient β1 appears to “average out” the slope of
E[hours|wage = w]: sometimes the teal line is steeper, and sometimes the red line is steeper.
This is no accident: in a regression with one continuously-distributed X variable and a constant, the
coefficient on X Ralways captures a weighted-average of the derivative of the CEF with respect to X. In
particular, β1 = m′ (x) · w(x) · dx, where m′ (x) = dx d
E[Y |X = x], and w(x) is a positive function that
integrates to one. The weighting function w(x) is proportional to F (x) (E[X] − E[X|X ≤ x]), where
F (x) is the CDF of X. This result was originally derived by Yitzhaki (1996).
This result is nice, but be careful when appealing to it to say that a linear regression coefficient always
captures an average CEF derivative. When there is more than one variable in the regression, say X1 and
X2 , the weights that β1 places on ∂x ∂
1
m(x1 , x2 ) could be negative, if E[X1 |X2 ] is not linear in X2 . Even
with a single regressor, the averaging introduced by the linear regression vector can be misleading. If
the CEF is sometimes increasing, and sometimes decreasing, we may get a regression coefficient of zero
even in cases where X and Y are closely related.
Proof: This is jumping ahead a bit, but I include a proof here in case you are interested. To see
this, note that by the law of iterated expectations, the slope coeffient is:
Cov(X, Y ) 1 1
β1 = = · E[Y · X] − E[X]E[Y ] = · E[Y · (X − E[X])]
V ar(X) V ar(X) V ar(X)
1 1
Z Z
= · f (x) · (x − E[X]) · E[Y |X = x] = · f (x) · (x − E[X]) · m(x)
V ar(X) V ar(X)
where m(x) = E[Y |X = x] and f (x) is the density of X. Now, we use the method of integrattion
by parts with u = f (x)(x − E[X]) and dv = m(x) · dx to write the integral as v(x)g(x)|−∞ −
∞
m (x)v(x)dx where v(x) = −∞ f (t)(t − E[X]). The first term is zero because both v(∞)
R ′ Rx
R ′ v(−∞) are equal to zero (for v(∞) we assume X has a finite second moment). So, β1 =
and
m (x)w(x)dx where w(x) = −v(x)/V ar(X). To see that w(x) integrates to one, substitute
Y = X in which case m′ (x) = 1. To see that the w(x) ≥ 0, rewrite v(x) = F (x)E[X|X ≤
x] − E[X]F (x) = F (x) (E[X|X ≤ x] − E[X]) and note that E[X|X ≤ x] ≤ E[X] for all x.
Thus, Eq. (7.12) recovers a weighted average of the conditional-on-X average treatment effects. It
weights each group in proportion to the conditional variance of D given X. Intuitively, this puts more
weights on the groups for which there is a more equal proportion of treatment and control units (since
note that V ar(D|X) = P (D = 1|X) · P (D = 0|X). It does not recover the average treatment effect
AT E = E[Y (1) − Y (0)], which can be thought of as applying weights wj = P (X = xj ) (by the law of
iterated expectations).
Recall that by contrast, we can always with a binary treatment get the ATE under selection-on-
observables by estimating E[Y |D = 1, X = x] − E[Y |D = 0, X = x], and we don’t need to assume
that Xi has a group structure for this result. What then is the value of the result in this section? In
Section 7.2, we assumed that E[Y |D = 1, X = x] had the linear form of Eq. (7.1). This is a strong
assumption; it implies for example that AT E(x) = E[Y (1) − Y (0)|X = x] is the same for all x. The
result in this section shows that when X as a group structure, we can still keep X and D separate in
the regression, without assuming that AT E(x) is constant in x (i.e. we can get away without including
interaction terms between X and D).
75
the mean squared error, it must satisfy the following k first-order-conditions (FOCs), one for each of its
components βj for j = 1 . . . k:
∂ E[(Y − X ′ β)2 ]
= E[2(Y − X ′ β) · Xj ] = 0 (7.14)
∂βj
Thus we’ve seen that the minimizer of the mean squared error between Y and a linear function of X
must be equal to the regression coefficient vector β. The box at the end of Section 7.4.1 shows that this
also goes in the other direction: the β defined by Equations 7.2 and 7.6 must be the β that solves (7.9).
Note: I’ve assumed in the above that E[(Y − X ′ γ)2 ] is differentiable with respect to γ and that we can
interchange the derivative and the expectation (this requires regularity conditions that allow us to appeal
to the dominated convergence theorem, but we don’t need to worry about these technicalities here).
Proof. We call a symmetric k × k matrix M positive definite if γ ′ M γ > 0. Any positive definite matrix
is invertible. I won’t prove this here, but for the curious: the eigenvalues of a positive definite matrix
are strictly positive, and a matrix is invertible if and only if it does not have zero as an eigenvalue.
Thus, we’ll show that H = E[XX ′ ] is positive definite in order to establish that it is invertible. For
any γ ∈ Rk , note that γ ′ Hγ = E[γ ′ XX ′ γ] = E[||Xγ||2 ], where for any vector x ∈ Rk we let ||x||2 denote
Pk
it’s Euclidean norm ||x||2 = x′ x = j=1 (xj )2 . Since ||X ′ γ||2 ≥ 0 for any realization of X, it follows that
γ ′ Hγ > 0 if and only if with some positive probability, ||X ′ γ||2 is strictly greater than zero. This occurs
whenever P (X ′ γ ̸= 0) > 0.
Proposition 7.2 says that there exists no value γ that makes X ′ γ equal to the zero vector, with probability
one (remember that X here is a random vector). When there is such a γ, we say that there is perfect
multicollinearity among our regressions X = (X1 , X2 , . . . Xk ).
Definition. We say that there is perfect multicollinearity among our regressors (in the population)
if there exists some γ ∈ Rk such that P (X ′ γ = 0) = 1.
Example: Suppose that our regression includes a constant X1 = 1, a binary variable indicating that a
given individual is married: X2 = married, and a second binary variable X3 that indicates that a given
individual is not married. Then, since X = (1, married, 1 − married)′ , we have that X ′ (−1, 1, 1) = 0
for all realizations of X. Thus, we have perfect multicollinearity: X ′ γ = 0 regardless of the value of
married and hence with probability one.
Note: Since we’re talking about the population in this section (rather than a sample), definition 7.4.1
says that there is no perfect multicollinearity in the population. In Section 7.5, we’ll use a sample analog
of this definition, which will be required for us to define the OLS estimator of β.
Now let’s see how the absence of perfect multicollinearity, which by Proposition 7.2 we know is equivalent
to E[XX ′ ] being invertible, implies that β exists and is unique.
Proposition 7.3. If there is no perfect multicollinearity, then there exists a β ∈ Rk that satisfies
Equations 7.2 and 7.6, and it is unique.
76
Proof. We can combine Equations 7.2 and 7.6 into a single matrix equation, which is equivalent to the
system of FOCs 7.14:
E[X(Y − X ′ β)] = 0
where 0 denotes a vector of k zeroes 0 = (0, 0, . . . 0)′ . This equation is the same as
E[XX ′ ]β = E[XY ]
Since E[XX ′ ] is invertible by Proposition 7.2, we can multiply both sides of the above equation by
E[XX ′ ]−1 (see box below) to obtain:
β = E[XX ′ ]−1 E[XY ]
Of course, we can only do this if the matrix E[XX ′ ]−1 exists, which is guaranteed by no perfect multi-
collinearity. Note that the above expression for β was given previously in Equation 7.7.
We seek a solution x = (x1 , x2 , . . . xn ) that satisfies all of the above equations. Let us gather all
of the coefficients in to a k × k matrix and call it A:
a11 a21 . . . an1
a21 a22 . . . an2
A= . .. ..
.. ..
. . .
an1 an1 ... ank
Our system of Equations (7.15) says, in vector notation, that Ax = b, where b = (b1 , b2 , . . . bk )′
is a vector composed of the values appearing on the RHS in Eq. (7.15).
If the matrix A is invertible, this means that there exists a unique matrix A−1 such that AA−1 =
A−1 A = Ik , where Ik is the k × k identity matrix. It has entries of one along the diagonal and
zeros everywhere else:
1 0 ... 0
0 1 . . . 0
In = . . .
. . ...
.. ..
0 0 ... 1
Note that the identity matrix Ik has the property that Ik λ = λ for any vector λ ∈ Rn .
Thus, if we start with the equation Ax = b and multiply both sides by A−1 , we get that
Thus, we’ve shown that x must be equal to A−1 b. This value definately satisfies (7.15), which
we can verify by:
A(A−1 b) = (AA−1 )b = Ik b = b
Also, it is the only value of x that satisfies the system (7.15). The solution exists and is unique,
provided that A−1 exists.
Furthermore, one can show that the x solving Ax = b is unique only if A is invertible. A is
invertible if and only if there exists no λ ∈ Rk that differs from the zero vector (i.e. it is not all
77
zeros), for which Aλ = 0 (here 0k is a vector composed of k zeros). Thus if A is not invertible,
there is such a vector λ. Suppose we have one solution x to Ax = b. Then x + αλ is another
solution, for any value of α, because A(x + αλ) = Ax + αAλ = b + 0k = bss.
We are now also in a position to see why when E[X ′ X]−1 exists, the regression coefficient vector
β must minimize the mean-squared error in Equation 7.13. Since E[(Y − X ′ γ)2 ] is a convex
function of the vector γ = (γ1 , γ2 , . . . γk ). That implies that any local minimum of E[(Y − X ′ γ)2 ]
is also global minimum. Therefore, we’d like to find any values of γ that might represent local
minima of the mean-squared error. Sufficient conditions for γ to be a local minimum of the MSE
are that: a) γ satisfies the FOCs 7.14 for each j = 1 . . . k; and b) the matrix H composed of
components Hjℓ = ∂γ∂j ∂γℓ E[(Y − X ′ γ)2 ] is positive definite. The k × k matrix H is called the
2
Hessian and it represents all of the second derivatives of a function. In the case of the MSE
function E[(Y − X ′ γ)2 ], the Hessian matrix turns out to be equal to E[X ′ X].
Y = β0 + β1 · X + ϵ (7.16)
where X is a scalar. Note that this is really a k = 2 instance of regression, in which one regressor is a
constant and the other is a random variable. In this case we can derive a simple expression for β0 and
β1 , which do not require matrix notation.
Note that (7.6) provides two equations:
E[ϵ] = 0 (7.17)
E[X · ϵ] = 0 (7.18)
We can take expectations of both sides of 7.16 and use 7.17 to obtain:
E[Y ] = β0 + β1 · E[X] +
E[ϵ]
This expression says that the regression line Y = β0 + β1 · X passes through the point (E[X], E[Y ]). If
we plug in the mean value of X, our “predicted value” of Y is the mean of Y . Re-arranging, we have
that β0 = E[Y ] − β1 · E[X].
To make use of 7.18, let us multiply both sides of 7.16 by X and then take expectations. We have:
We’ve now derived two equations for the two unknowns β0 and β1 . If we substitute β0 = E[Y ] − β1 · E[X]
into the second equation above, we get that
Looking back at Equatios 7.19, we see that in the case of simple linear regression β1 is nothing more
than a rescaled version of the covariance between X and Y . When Cov(X, Y ) is positive, β1 will be
positive, and when X and Y are negatively correlated β1 will be negative. Looking at Equation 7.20,
we see that β0 is simply whatever it needs to be so that when we plot the regression line β0 + β1 · x, it
passes through the point (E[X], E[Y ]).
78
There is also a simpler way to derive Eq. (7.19), which is to start with Eq. 7.16 and take the covariance
of both sides with X. Since the covariance operator is linear, we have that
Cov(X, Y ) = β0 ) + β1 · Cov(X, X) +
Cov(X, Cov(X,
ϵ)
The first term on the RHS is zero, because β0 is a constant, which has a zero covariance with anything.
The last term is zero because Cov(X, ϵ) = E[X
·ϵ] − E[X] ·
E[ϵ],
where the crossed out terms are zero
by 7.17 and 7.18. Using that Cov(X, X) = V ar(X), we can rearrange to obtain Eq. (7.19).
Note that Equations 7.19 and 7.20 imply that we can write the residual from simple linear regression as
Cov(X, Y )
ϵ = (Y − E[Y ]) − (X − E[X]) (7.21)
V ar(X)
Example: Consider a simple linear regression in which X is a binary variable, for example if
X = 1 indicates that an individual is female and X = 0 otherwise. In this case, recall from the
homework that Cov(X, Y ) = V ar(X) · (E[Y |X = 1] − E[Y |X = 0]), where V ar(X) = P (X =
1) · (1 − P (X = 1)). Thus, β1 = E[Y |X = 1] − E[Y |X = 0]. This means that β0 must be:
where we’ve used the law of iterated expectations and that E[X] = P (X = 1).
Thus, when we have a single binary regression, linear regression gives us an intercept β0 =
E[Y |X = 0] that recovers the condition mean of Y given X = 0. Since the slope coefficient is
β1 = E[Y |X = 1] − E[Y |X = 0], this means that when X = 1, the regression line passes through
β0 + β1 = E[Y |X = 1]. In other words, β0 + β1 · X exactly recovers the CEF function E[Y |X].
We’ll see that this nice property generalizes in multiple linear regression: whenever X contains
indicators for a complete set of groups (or a constant and all but one of the groups, then regression
exactly captures the CEF).
Exercise: Given the above, convince yourself that in a regression of Y on a binary treatment variable D,
the coefficient on D is the difference-in-means estimand θDM that we met in Chapter 6. Under random
assignment, this regression hence yields the ATE.
Exercise: Derive Equations 7.19 and 7.20 from the general formula (7.7) for the regression vector, which
in this case reads β = (β0 , β1 ) = E[(1, X)′ (1, X)]−1 E[(1, X)′ Y ]. For this you will need the formula for
the inverse of a 2 × 2 matrix: −1
a b 1 d −b
=
c d ad − bc −c a
A hint to get you started: E[(1, X)′ Y ] = (E[X], E[XY ])′ .
Y = β0 + β1 · X1 + β2 · X2 + ϵ (7.22)
79
Recall that in this case the linear regression model gives us three restrictions: E[ϵ] = 0, E[X1 · ϵ] = 0
and E[X2 · ϵ] = 0.
Consider β2 , the coefficient on X2 . In turns out that we can write an expression for β2 by imagining
a sequence of two simple linear regressions. In the first step, we imagine running a regression of X2 on
X1 and a constant. We know that in this case, the coefficient on X1 will be Cov(X1 , X2 )/V ar(X1 ), the
constant will be E[X2 ] − Cov(X1 , X2 )/V ar(X1 ) · E[X1 ], and by Equation (7.21) the residual will be
Cov(X1 , X2 )
X̃2 = (X2 − E[X2 ]) − (X1 − E[X1 ]) (7.23)
V ar(X1 )
where we use the notation X̃2 for the residual from this regression of X2 on X1 and a constant. Observe
that since X̃2 is a linear function of X1 and X2 , we know that it is uncorrelated with the error ϵ from
the “long-regression” Equation (7.22).
Exercise: Prove the claim above, that Cov(X̃2 , ϵ) = 0, using that E[ϵ] = 0, E[X1 ·ϵ] = 0 and E[X2 ·ϵ] = 0.
Given that Cov(X̃2 , ϵ) = 0, consider running a second simple linear regression, in which we regress Y on
X̃2 and a constant.
Proposition 7.4. The slope coefficient from this second regression is equal to β2 .
To demonstrate that this claim is true, let’s begin by substituting in Eq. (7.22), the equation for the
slope in this second regression will be
Cov(X̃2 , Y ) (Cov(β
((0( ,X
(()
1 Cov(X̃2 , X1 ) Cov(X̃2 , X2 ) Cov(
X̃
2 , ϵ)
= + β1 · + β2 · + (7.24)
V ar(X̃2 ) V ar(X̃2 ) V ar(X̃2 ) V ar(X̃2 ) V ar(X̃2 )
where the first term is zero since β0 is a constant, and the last term is zero because we’ve seen that
Cov(X̃2 , ϵ) = 0.
However, we also have that Cov(X̃2 , X1 ) = 0, so the second term above also vanishes. How do we
know this? Remember that X̃2 is the residual from a regression of something on X1 and a constant, so
it is uncorrelated with X1 by construction.
Exercise: Convince yourself of the claim above, that Cov(X̃2 , X2 ) = 0. Why does it follow from how
we’ve defined X̃2 that E[X̃2 ] = 0 and E[X1 · X̃2 ] = 0?
The last step in establishing Proposition 7.4 is showing that Cov(X̃2 , X2 ) = V ar(X̃2 ). To see this,
substitute our explicit expression (7.23) for X̃2 into the variance. Since E[X̃2 ] = 0, V ar(X̃2 ) = E[(X̃2 )2 ],
which is equal to
" 2 #
Cov(X1 , X2 ) 2
E E E
Cov(X1 , X2 ) Cov(X1 , X2 )
(X2 − [X2 ]) − (X1 − [X1 ]) = V ar(X2 ) − 2 · Cov(X1 , X2 ) + · V ar(X1 )
V ar(X1 ) V ar(X1 ) V ar(X1 )
Cov(X1 , X2 )
= V ar(X2 ) − · Cov(X1 , X2 ) = Cov(X̃2 , X2 )
V ar(X1 )
A more direct way to conclude that Cov(X̃2 , X2 ) = V ar(X̃2 ) is to notice that this statement is equivalent
to Cov(X̃2 , (X2 − X̃2 )) = 0, where (X2 − X̃2 ) is the regression line from a regression of X2 on X1 and a
constant. That this is uncorrelated with the error X̃2 from this same regression follows then by definition.
Since the labeling of X1 and X2 is arbitrary, Proposition 7.4 also gives us a way to express the coefficient
β1 : first, run a regression of X1 on X2 , and then regress Y on the residuals X̃1 from this initial regression.
Y = β0 + β1 · X1 + β2 · X2 + . . . βk · Xk + ϵ (7.25)
Note first that since one of our regressors is a constant, the system of equations (7.6) implies that
E[ϵ] = 0. Then the remainder of the equations in (7.6) can be read as saying that each Xj is uncorrelated
E[X
with the error, since Cov(Xj , ϵ) =
E[X
j· Y ] − j ] · E[Y ] = 0.
80
Proposition 7.5 (“regression anatomy” formula). The coefficient on Xj in regression (7.25) is
where X̃j is the residual from a regression of Xj on all of the other regressors and a constant.
The text Mostly Harmless Econometrics refers to Proposition 7.5 as the “regression anatomy” formula
because it allows us to translate the complicated expression for the full vector β = E[XX ′ ]−1 E[XY ] into
a simpler expression for each of the components βj .
Note: We’ll see when we get to estimation in Section 7.5 that Proposition 7.5 has a sample analog,
referred to as the Frisch-Waugh-Lovell theorem. Proposition 7.5 constitutes a “population version” of
this very useful result.
Note: A Corollary to Proposition 7.5 is that we can also write βj as Cov(X̃j , Y˜j )/V ar(X̃j ), where we
define Y˜j to be the residual from a regression of Y on all the regressors aside from Xj , and a constant.
This follows because the difference between Y − Ỹj is uncorrelated with X̃j .
Using Proposition 7.5 iteratively to get regression coefficients: Notice that Propo-
sition 7.5 gives us an explicit expression for βj , if we know the residuals X̃j . But if k > 2,
how do we compute the X̃j , which involve running a regression on k − 1 variables (e.g.
X1 , X2 , . . . Xj−1 , Xj+1 , . . . Xk ) and a constant?
If one would like to avoid the matrix expression for β, the answer is to appeal to Proposition 7.5
iteratively. This allows us to build up an expression for βj by running a series of several simple
linear regressions.
For instance, suppose we are interested in βk in regression (7.25). We can obtain it as follows:
1. To get βk , we need the residuals X̃j from a regression of Xk on X1 . . . Xk−1 and a constant.
2. Call the coefficient on Xj from this regression βjk . If we know all the βjk for j = 1 . . .k=1 ,
we can calculate X̃j (note that we can pin down the β0k from the fact that E[X̃j ] = 0.
3. By Proposition 7.5, we know that we can calculate each βjk if we have the residuals from a
regression of Xj on all regressors except Xj and Xk , and a constant. Call the coefficient on
Xℓ from this regression βℓjk .
4. By Proposition 7.5, we know that we can obtain βℓjk if we know the coefficients from a
regression of Xℓ on a regression on all variables except Xℓ , Xj and Xk , and a constant.
5. Continue on in this way, until we have a set of simple linear regressions, for which we can
apply Equations 7.19 and 7.20.
The box above describes an iterative algorithm that would allow us to use Proposition 7.5 to obtain an
expression for each coefficient βj in Equation (7.25). As you can see, it is tedious, and involves running
many simple linear regressions if our original regression contains more than a couple of variables.
81
That was a mess–lets use matrix notation!
For this reason, we inevitably need to appeal to the general formula β = E[XX ′ ]−1 E[XY ], which in the
context of regression (7.25) says that:
β0
β1
β = β2 = E[(1, X1 , X2 , . . . Xk )(1, X1 , X2 , . . . Xk )′ ]−1 E[(1, X1 , X2 , . . . Xk )Y ]
..
.
βk
E[X1 ] E[X2 ] E[Xk ] E[Y ]
−1
1 ...
E[X1 ] E[X12 ] E[X1 · X2 ] ... E[X1 · Xk ] E[X1 · Y ]
= E[X2 ] E[X1 · X2 ] E[X22 ] E[X2 · Xk ] E[X2 · Y ]
... (7.26)
.. .. .. .. ..
..
. . . . . .
E[Xk ] E[Xk · X1 ] E[Xk · X2 ] . . . E[Xk ]
2
E[Xk · Y ]
Proposition 7.5 can be derived from Equation 7.26, but it’s not exactly pretty. One way to make this
work is to apply the block matrix inversion formula over and over again, mirroring the iterative approach
in the box above. We would eventually end up with a large number of 2 × 2 matrix inverse problems,
which are each easy to deal with (see the exercise at the end of Section 7.4.2). But life’s too short for
that! Thankfully computers area great help in computing our matrix inverses in practice.
7.5.1 Sample
To define the OLS estimator β̂ we suppose that we have a sample (Yi , X1i , X2i , . . . Xki ) of Y and some
set of regressors X1 to Xk . Let n be the number of observations in our sample. Note: we will later
assume that our sample is i.i.d, but we don’t need to use that fact right now.
Given the OLS estimator β̂, let us make the following definitions:
Pk
The fitted value Ŷi for observation i is Ŷi = Xi′ β̂ = j=1 β̂j · Xji
Note that for each i, we have that Yi = Ŷi + ϵ̂i (by definition)
Equation (7.27) explains the origin of the name “ordinary least squares”, as β̂ is defined as the value of
γ that minimizes the sample sum of squares.
82
What is the solution to the minimization problem (7.27)? Taking the first-order-condition with respect
to each γj , we obtain the following system of equations:
n n
1X 1X
X1i · (Yi − Xi′ β̂) = X1i · ϵ̂i = 0
n i=1 n i=1
n n
1X 1X
X2i · (Yi − Xi′ β̂) = X2i · ϵ̂i = 0
n i=1 n i=1
..
.
n n
1X 1X
Xki · (Yi − Xi′ β̂) = Xki · ϵ̂i = 0 (7.28)
n i=1 n i=1
where recall that Xi = (X1i , X2i , . . . Xk )′ is a vector and Yi is a scalar for each i. Since Xi is k × 1 and
Xi′ is 1 × k, Xi Xi′ is a k × k matrix. In Equation 7.29 we’ve used that by the distributive property of
matrix multiplication, we can sum over the observations i and then multiply by β, which is equivalent
to multiplying and then summing the k × 1 vector Xi Xi′ β̂ over observations.
“Physics intuition”: A nice way to visualize what OLS does (in the case of a single regressor
pus a constant) is to imagine a set of n springs. One end of each spring is attached to a data
point (Yi , Xi ), and the other end is attached to a rigid rod, which will represent the regression
line β̂0 + β̂1 X. The springs want to be as short as possible, and are constrained to move only in
the vertical direction. The length of spring i is equal to the fitted residual ϵ̂i . An approximation
known as Hooke’s Law says that the potential energy stored in a spring is proportional to the
square of its distance. Thus, if the springs are identical to one another, the total potential
energy will be equal to the sum of squares of ϵ̂i . Classical physics (i.e. “Newton’s laws”) tell
us that equilibrium position of the rod will minimize the potential energy stored across all of
the springs: exactly the minimization problem that OLS solves! The following link provides an
illustration of this: it generates a random dataset, and shows the rod coming to rest in exactly
the position of the OLS regression line: https://sam.zhang.fyi/html/fullscreen/springs/.
Here is another nice visualization (not animated though): https://joshualoftus.com/posts/
2020-11-23-least-squares-as-springs/
83
Similiarly, we define a k × 1 vector of our observations of Y :
Y1
Y2
Y := .
..
Yn
Pn
In this notation, we can rewrite the matrix n1 i=1 Xi Xi′ as 1 ′
n X X. We can then write 7.29 in the
compact form:
(X′ X)β̂ = X′ Y (7.30)
where we’ve multiplied both sides by n.
Exercise: Use the definition of matrix multiplication to verify Equation (7.30) from Equation (7.29).
Pn
That is, show that the j th component of X′ X is equalPto the j th component of β̂ = ( i=1 Xi Xi ) β̂, and
th ′ n
similarly that the j component of X Y is equal to i=1 Yi Xji , for all j = 1 . . . k.
Note: For a general matrix M , it is conventional to let Mij denote the entry on row i, column j. In
our notation above, we actually have the opposite order of indices, because we’ve let Xji denote the ith
observation of variable Xj , which is the entry in the j th column and ith row of X. To avoid getting
confused, always remain mindful that your matrix expressions are conformable, e.g. if you’re multiplying
X by a vector v to obtain Xv, then you know that v must have k components, since X has k columns.
Proposition 7.6. Provided that n > k, the matrix X′ X is invertible if none of the columns of X can be
written as linear combinations of the other. That is: X ′ γ ̸= 0 for all γ ∈ Rk .
This condition can be referred to as no perfect multicollinearity in the sample. When it holds, we obtain
an explicit expression for the OLS estimator β̂:
Recall from Proposition 7.2 that we have no perfect multicollinearity in the population if
P (Xi′ γ ̸= 0) < 1 for all γ ∈ Rk . It’s not impossible that this could hold, but that we still end
up with perfect multicollinearity failing in the sample. Technically, this is a “knife-edge” case
that would happen with probability zero if the Xi are continuously distributed. But in practice
β̂ will be defined but numerically unstable if X′ X is close to being non-invertible. This is what
statistical software sometimes complains about when it says that X is “highly singular”.
The most common case in which perfect multicollinearity occurs is when you forget that some of
your regressors are related to one another definitionally. For instance, recall the example from
Section 7.4.1, in which our regression includes a constant X1i = 1, a binary variable indicating that
a given individual i is married: X2i = marriedi , and a second binary variable X3i that indicates
that a given individual is not married. Then, since Xi = (1, marriedi , 1 − marriedi )′ , we have
that Xi′ (−1, 1, 1) = 0 for each i in the sample, and hence X(−1, 1, 1)′ = (0, 0, 0)′ . In this case we
have perfect multicollinearity both in sample and in the population, since P (Xi′ (−1, 1, 1)) = 1.
If you tried to compute the OLS estimator for this regression in statistical software like Stata, it
will choose arbitrarily one of the variables and drop it from the regression equation, and give you
a warning.
84
7.5.5 More matrix notation
Following the notation we’ve developed to define the OLS estimator, we can also define k × 1 vectors of
the fitted values Ŷi , the fitted residuals ϵ̂i , and the population residuals ϵi :
ϵ̂1 Ŷ1 ϵ1
ϵ̂2 Ŷ2 ϵ2
ê := . Ŷ := . ϵ := .
.. .. ..
ϵ̂n Ŷn ϵn
While ê and Ŷ are built with estimates from the data, note that ϵ is not observable. However, under
the assumption that the regression model Yi = Xi′ β + ϵi holds for each i, we have that
Y = Xβ + ϵ (7.32)
3. Define the n × n matrix M := In − P = In − X(X′ X)−1 X′ . Following Hansen, we’ll call this the
annihilator matrix.
Exercise: Check that both P and M are idempotent. We say that an n × n matrix A is idempotent
when AA = A.
Exercise: Show that both P and M are symmetric. We say that an n × n matrix A is symmetric when
it is equal to its transpose: A′ = A.
Exercise: Use the definitions of the projector and annihilator matrices to verify that
Xβ̂ = PY and ê = MY
Consider any vector v in Rn . Since In v = v (check this if you haven’t seen it before), and P + M = In ,
it follows that any vector v can be written as v = Pv + Mv. Applying this to the vector Y composed of
observations of our outcome variable, we obtain Equation (7.33).
Geometrically, Equation (7.33) provides a decomposition of Y into the part of Rn that is spanned by
the columns of the design matrix X (this is Xβ̂), and the part that is orthogonal to it (this is ê).
In what sense does P “project” the vector Y? Algebraically, we think of projection as something that
throws away some information in Y in such a way that after we “project” once, projecting again wouldn’t
change anything. For example, projecting a two-dimensional vector (v1 , v2 ) onto the x-axis means re-
placing v2 with zero, but keeping the first value: (v1 , 0). If we then projected (v1 , 0) onto
the x-axis
1 0
again, we’d just get (v1 , 0) again. In this case projection corresponds to the matrix .
0 0
This is also what happens with regression, except that our projection operator is somewhat more com-
plicated. If we regress Y on X, and then run the regression a second time using the fitted values Ŷ on
the left hand side, this second regression will result in the exact same vector of estimate. This is because
PP = P, and hence PŶ = PPŶ = PŶ = Ŷ (see exercise above). A similar argument holds for the
85
fitted residuals, given that MM = M.
We call M the “annihilator” matrix because it is contructed in such a way that it “annihilates” the
matrix X upon mulitplication:
The diagonal elements of the projection matrix P have a useful interpretation, and are called
leverage values. Let
hii := [P]ii = [X(X′ X)−1 X′ ]ii = Xi′ (X′ X)−1 Xi
Since P is an n × n matrix, there exists a leverage value corresponding to each observation i.
It measures how “extreme” the regressors Xi are for that observation. Note that Xi′ Xi is the
norm of the vector Xi —how “big” it is. hii inserts the k × k matrix (X′ X)−1 inside the dot-
product, which changes what we mean by “big”. In the case of simple linear regression, hii ends
up measuring whether the value of Xi is particularly high or low. Such observations may have a
large influence on the values of the OLS estimate β̂, since a small change in β̂ can translate into
a large change in the fitted residual ϵ̂i for that observation, when Xi is very far from the sample
mean.
The projection matrix notation also gives us a nice way to express the coefficient of determi-
nation, or R2 , which you may have met in previous econometrics classes. The coefficient of
determination can be defined as the ratio of the variance of the fitted estimates Ŷi across the
sample to the variance of the observed outcomes Yi across the sample. Thus it measures the
proportion of the variation in Y that is “explained” by the OLS regression line Ŷi = Xi′ β̂.
For simplicity, suppose that P the average value of Yi in the sample is zero, so that the sample
n
variance of Yi is n1 Y′ Y = n1 i=1 Yi2 . Note that we could always de-mean our data to satisfy
this assumption without affecting the variance of Yi or R2 (provided that the regression contains
a constant, for the latter claim). In our matrix notation, we could then write R2 as:
86
Analogously, define the matrices X1 and X2 as
(X11 , X21 , . . . Xj1 ) (Xj+1,1 , Xj+2,1 , . . . Xk1 )
(X12 , X22 , . . . Xj2 ) (Xj+1,2 , Xj+1,2 , . . . Xk2 )
X1 := .. n rows and X2 := .. n rows
.
.
(X1n , X2n , . . . Xjn ) (Xj+1,n , Xj+2,n , . . . Xkn )
| {z } | {z }
j columns (k − j) columns
Define n × n projector matrices P1 = X1 (X′1 X1 )−1 X′1 and P2 = X2 (X′2 X2 )−1 X′2 , and corresponding
annihilator matrices M1 = In − P1 and M2 = In − P2 . Note that by the same logic as Equation
(7.34), the matrix M1 annihilates X1 (that is, M1 X1 = 0, where 0, is a set of j zeroes), and similarly
M2 X2 = 0, where now 0, is a set of k − j zeroes
The matrix P1 projects vectors in Rn into the subspace spanned by the columns of X1 , which are the
first j columns of X. The matrix M1 projects vectors in Rn into the subspace orthogonal to the columns
of X1 . Similarly,P2 projects vectors in Rn into the subspace spanned by the columns of X2 , which are
the last (k − j) columns of X.
With the matrices M1 and M2 in hand, we can now give an explicit formula for β̂1 and β̂2 , known
famously as the Frisch-Waugh-Lovell theorem:
Proposition 7.7 (Frisch-Waugh-Lovell theorem).
and
β̂2 = (X′2 M1 X2 )−1 X′2 M1 Y
Proof. We’ll prove the expression for β̂1 , as the proof for β̂2 is exactly analogous. Note that the
full design matrix X can be written as X = [X1 , X2 ], and
′ ′
Xβ = [X1 , X2 ](β̂1 , β̂2 )′ = X1 β̂1 + X2 β̂2
The final step of the proof is to show that X′1 M2 ê = 0, and then multiply each side by
(X′1 M2 X1 )−1 to get the final expression for β̂1 . To see that X′1 M2 ê = 0, recall that ê = MY,
where M is the annihilator matrix for the full design matrix X. Since the vector ê = MY
is orthogonal to all of the columns of X = [X1 , X2 ], then it is also orthogonal to all of
the columns of X1 , which is just a susbset of these. That implies that M1 ê = ê. Now
X′1 M2 ê = X′1 ê = X′1 MY = 0, where in the last step we’ve used that X′1 M = 0′ .
Now let’s see how the Frisch-Waugh-Lovell theorem relates to the “regression anatomy” result Proposition
7.5. Since M2 is idempotent, we can write
An analogous formula applies for β̂2 , where X̃2 collects the residuals from regressions of each Xℓ on
X1 . . . . . . Xj , for ℓ = j + 1 . . . k. In the special case in which X2 has a single column (e.g. we’re
interested only in β̂k , and we include a constant in the regression (e.g. X1 = 1), then we get exactly a
sample version of Proposition 7.5.
Let us now consider the OLS estimates β̂0 , β̂1 . . . β̂k . One useful property of β̂ is that the fitted residuals
From the Frisch-Waugh-Lovell theorem, we can obtain an expression for each slope coefficient estimate
ˆ
betaj. In particular:
d j, Y )
Cov(ϵ̂
β̂j = (7.35)
Vdar(ϵ̂j )
where we let ϵ̂j denote the fitted residuals from a regression of Xj on all the other regressors and a
constant. This is an analog of the population residual X̃j from this same regression. Equation (7.35)
provides a “sample analog” to the regression anatomy formula: Proposition 7.5.
The operators Cov d and Vd ar appearing in Eq. (7.35) are defined as follows. Let A = (A1 , A2 , . . . An )
) be n × 1 vectors composed of observations of a random variable Ai and Bi , respec-
B = (B1 , B2 , . . . BnP
n
tively. Let Ān = n1 i=1 Ai be the sample mean of Ai , and similarly for B̄n . Then, we define:
n
!
1 X
Cov(A,
d B) = Ai · Bi − Ān · B̄n
n i=1
and !
n
1X 2 2
Vd
ar(A) = Cov(A,
d B) = A − Ān
n i=1 i
Cov(X,
d Y)
β̂1 =
Vd
ar(X)
where in this case ϵ̂ji is simply equal to Xi , the ith observation of our single regressor X.PWe can
n
work out the estimate of the constant β0 from the fact that the fitted residual ϵ̂i satisfies n1 i=1 ϵ̂i =
1
Pn
n i=1 (Yi − β̂0 − β̂1 · Xi ) = 0. β̂0 is thus β̂0 = Ȳn − β̂0 · X̄n .
To see how to obtain Eq. (7.35), suppose we’re interested in βk , in which case X2 is a n × 1 vector
of observations of Xj and X1 is a matrix of observations of the other regressors and a constant.
Pn
Note that since ϵ̂ji is a fitted residual from a regression that includes a constant, n1 i=1 ϵ̂ji = 0.
88
Thus, we wish to show that
1 j′
1
Pn j
n i=1 ϵ̂i · Yi n ϵ̂ Y (X2 M1 Y)
β̂k = Pn j 2
= 1 j′ j
=
1
i=1 (ϵ̂i )
X2 M1 X2
n n ϵ̂ ϵ̂
where ϵ̂j := (ϵ̂j1 , ϵ̂j2 , . . . ϵ̂jn )′ is a vector of the ϵ̂ji across i. Since in this case X2 M1 X2 is a scalar
(recall that we’re looking at a single coefficient, and have thus defined X2 to be a vector rather
than a matrix), and thus what we aim to show is β̂k = (X2 M1 X2 )−1 (X2 M1 Y). This is exactly
what is given by Proposition 7.7.
We began with a random variable Y and a random vector X, which are related by Y = X ′ β + ϵ in
“the population”. The random vector X can be written X = (X1 , X2 , . . . Xk )′ , where each Xj is a
different regressor. No i subscripts are necessary here.
Then we draw a random sample, where observations are indexed by i = 1 . . . n. Yi is a random
variable reflecting the value of Y in the ith observation, and Xi assembles the value of all regressors
for observation i into a random vector: Xi = (X1i , X2i , . . . Xki )′ .
When discussing the OLS estimator, it is convenient to assemble information across all of the
observations, leading to the n × 1 vector Y and the n × k matrix X.
Consider the following toy dataset, where n = 4 and k = 3. This reflects a realization of the random
matrix X and the random vector Y:
i X1 X2 X3 Y
1 1 4 0 23
2 1 3 1 54
3 1 2 1 21
4 1 6 0 77
The 4 × 3 matrix framed by a large red box is X in our sample. The smaller green box inside indicates
X3 laid out as a row vector: the values of each of the three regressors in the third observation. The blue
skinny rectangle indicates the n × 1 vector Y. Note that our first “regressor” X1 is simply one for each
observation, and contributes a constant to our regression. Regressor X3 is a binary or “dummy” random
variable: taking values of only zero or one for all obseravtions.
89
in vector notation. Studying the statistical properties of β̂ thus begins with the following crucial step:
substitute our equation for Y (Eq. 7.32) into the definition of the estimator (Eq. 7.31):
β̂ = (X′ X)−1 X′ Y
= (X′ X)−1 X′ (Xβ + ϵ)
′ −1
′
=(X X) X Xβ + (X′ X)−1 X′ ϵ
= β + (X′ X)−1 X′ ϵ (7.36)
Eq. (7.36) is really quite remarkable: it says that regardless of whatever sample we ended up estimating
β̂ from, it is exactly equal to the true population parameter β, plus second term that depends on the
vector of residuals ϵ and the sample design matrix X.
We’ll now proceed in two steps. First, we’ll study the distribution of β̂ when our sample design matrix
is held fixed. This allows us to establish that conditional on X, the estimator β̂ is unbiased and efficient.
Then, we’ll consider the properties of β̂ as n gets very large.
Keep in mind what we’re doing in this section: we’re asking what the distribution of our estimator β̂ is,
given that the data in our sample was a random draw from an underlying population. This will allow us
to think about questions like: how likely would we be to get an estimate β̂ that is far from β, given that
the sample we use to compute β̂ is random (and thus could have been different that the one we actually
see)?
Yi = Xi′ β + ϵi
7.6.1.1 Unbiasedness
Proposition 7.8. Given Assumption 1, OLS is conditionally unbiased for β; that is: E[β̂|X] = β.
Note that by the law of iterated expectations, this implies that β̂ is also unconditionally unbiased:
n o
E[β̂] = E E[β̂|X] = E {β} = β
Proof. Note that by 7.36, E[β̂|X] = β + E (X′ X)−1 X′ ϵ X]. Our goal will be to show that the second
Now consider the quantity E [ϵ| X]. This is an n × 1 vector, of which the ith component is E[ϵi |X].
Noting that conditioning on X is the same as conditioning on X1 , X2 , . . . , Xn , we have that:
7.6.1.2 Efficiency
Given that OLS is (conditionally) unbiased, we know that conditional on X, it’s mean squared error
as an estimator of β is equal to it’s variance (recall the bias-variance decomposition of Eq. 5.2). A
well-known result about OLS is that it has the smallest variance among all unbiased estimators of β. To
state this result, we will require an extra assumption, which is that the errors ϵi have the same variance
for each observation:
Definition 7.1. We call the linear regression model (conditionally) homoskedastic if for some value
σ, V ar(ϵi |Xi ) = σ 2 for all i.
Homoskedasticity is a very strong assumption, and won’t hold in practice in most settings. But it has
some use as a simplifying assumption to help understand OLS.
Proposition (Gauss-Markov Theorem). Consider an alternative estimator β̃ of β that is also con-
ditionally unbiased: E[β̃|X] = β. Given Assumption 1 and homoskedasticity, OLS is efficient for β in
the sense that:
V ar(β̃|X) ≥ V ar(β̂|X)
where for two k×k matrices A and B, we say that A ≥ B when the matrix A−B is positive semi-definite.
The above version of the Gauss-Markov Theorem is due to Hansen (2021), which was proved just this
year (2021)! Most textbooks (with the exception of the Hansen book) add the assumption that the
candidate estimator β̃ is a linear function of Y (like OLS is). In this context, the Gauss-Markov theorem
is often described as saying that OLS is B.L.U.E: the best linear unbiased estimator of β. The above
“modern Gauss-Markov” theorem is a stronger result than BLUE: it implies that considering non-linear
estimators of β wouldn’t help us reduce the variance of ˜ˆbeyond the bound acheived by OLS.
What is the conditional variance of OLS V ar(β̂|X) appearing in the Gauss-Markov Theorem?
Note that homoskedasticity along with the i.i.d assumption imply that:
Exercise: Definition 7.1 implies that the diagonal elements of V ar(ϵ|X) are equal to σ 2 , but
how do we know that the off-diagonal elements are equal to zero?
The above in turn implies that the conditional variance of the OLS estimator is:
Thus the Gauss-Markov Theorem is typically written as saying that V ar(β̃|X) ≥ σ 2 (X′ X)−1 ,
where the RHS is the conditional variance of the OLS estimator.
91
What should we make of the Gauss-Markov Theorem given that homoskedasticity is usually not a
reasonable assumption? It turns out that OLS is actually not efficient when homoskedasticity fails
(we call this heteroskedasticity). In that setting, a closely-related estimator known as weighted-
least-squares
Pn (WLS) becomes efficient (at least among linear estimators). Efficient WLS minimizes
′ 2
i=1 wi (Yi − Xi β) where the wi are chosen to be 1/V ar(ϵi |X). In practice, we don’t generally
know V ar(ϵi |X) ex-ante, so a feasible version of WLS requires estimating this quantity. Basically,
the re-weighting of each observation by wi “undoes” heteroskedasticity, so that WLS mimics the
properties that OLS has in the homoskedastic case. Thus you should think of Gauss-Markov as
illustrating an idea that is more general that homoskedasticity, but is simpler to express under
that strong assumption.
To make claims that involve convergence in probability and convergence in distribution, we will consider
a sequence of estimators β̂, indexed by the sample size n. For each n = 1, . . . , ∞ along the sequence,
we assume that 2 holds. As a reminder (cf. Chapter 4), in reality sample sizes never actually “grow” to
infinity. In practice, we always have an actual sample that has some actual finite size n. The idea of an
asymptotic sequence exists only to provide an approximation to the sampling distribution of β̂ given our
fixed n, which we will take to be accurate when the sample size is big enough.
7.6.2.1 Consistency
p
We’ll first see that given the asymptotic sequence described above, β̂ → β. That is, β̂ is a consistent
estimator of β.
Subtracting β from each side of Equation 7.36:
−1
1 ′ 1 ′
β̂ − β = (X′ X)−1 X′ ϵ = XX Xϵ
n n
where in the second equality we’ve used that the factor of n1 inside the matrix inverse cancels the one on
X′ ϵ. Now let’s consider this latter quantity alone. Expanding out the matrix product:
n
1 ′ 1X
Xϵ= Xi ϵi ,
n n i=1
i.e. it is equal to the sample average of the random variable Xi ϵi . To see the above, note that n1 X′ ϵ is a
k × 1 vector, whose j th element is equal to the inner product between ϵ and the j th row of X′ . The j th
row of X′ is equal to the j th column of X, which is comprised of the n observations of regressor Xj .
Thus, by the law of large numbers, we have that n1 X′ ϵ → E[Xi ϵi ], provided that E[Xi ϵi ] < ∞. By
p
the linear projection model (Assumption 2), E[Xi ϵi ] = 0, where 0 is a vector of k zeroes.
Similarly, we have by the law of large numbers that
n
1 ′ 1X
Xi Xi′ → E[Xi Xi′ ],
p
XX=
n n i=1
In Chapter 4 we only considered the LLN for random vectors, not random matrices like Xi Xi′ . But since
you can always rewrite an n × m matrix as a vector with n · m elements, the LLN for vectors applies
so long as each element of the matrix E[Xi Xi′ ] is finite. In the box below, I state a set of assumptions,
92
“regularity conditions”, that ensure we can use the law of large numbers here, and that all expectations
that appear in this section exist.
Given that n1 X′ X → E[Xi Xi′ ], the continuous mapping theorem implies that
p
−1
1 ′
→ E[Xi Xi′ ]−1
p
XX
n
That’s because for a general invertible matrix M, the matrix inverse function M−1 is a continuous func-
tion of each of the elements of M.
p
Thus we have proved that β̂ → β.
Proposition 7.9. OLS is consistent for β given Assumption 2 and the regularity conditions 3 below.
How do we know that the matrix E[Xi Xi′ ] has only finite entries, so that we can use the LLN?
One might simply assume that this is true, but the conventional textbook approach is to state
a set of conditions that are sufficient for E[Xi Xi′ ] to be finite, but are easier or more natural to
state. Technical assumptions of this kind are often referred to as “regularity conditions”. The
Hansen text uses the following:
Let us now see how Assumption 3 get us what we need to huse the LLN i to analyze the OLS
estimator. In particular, we’ll use item 2, that E[||Xi || ] = E j=1 E Xji < ∞.
2
Pk 2
Pk 2
j=1 Xji =
Note that this implies that E Xji
2
< ∞ for each j, since these quantities are all positive would
could never have one of them be infinite but the sum finite.
Now consider a generic element of the matrix E[Xi Xi′ ]. For instance, the element in the j th row,
ℓth column is E[Xji Xℓi ]. To show that this must be finite, we’ll use the following very useful
inequality for expectations:
1. The Cauchy-Schwarz inequality says that for random variables X and Y : |E[X · Y ]|2 ≤
E[X 2 ] · E[Y 2 ]
Applying the Cauchy-Schwartz to the k, ℓ element of E[Xi Xi′ ], we have that
q
E[Xji Xℓi ] ≤ E[Xji2 ] · E[Xℓi2 ]
Since each of E[Xji2 ] and E[Xℓi2 ] are finite, their product must also be finite, and also its square
root.
Recall that E[Xi ϵi ] = 0, where 0 is a vector of k zeroes, and that n1 X′ ϵ is the sample mean of the random
Pn
vector Xi · ϵi . Using the notation of Chapter 4, let’s denote this as (Xϵ)n := n1 i=1 Xi · ϵi . Then we
can write the above as:
−1
√ √
1 ′
n(β̂ − β) = XX · n (Xϵ)n − E[Xi ϵi ]
n
The rightmost factor in the above expression has exactly the form that we need to apply the CLT, in
particular:
√
n (Xϵ)n − E[Xi ϵi ] → N (0, V ar(Xi ϵi )),
d
√
That is, n(β̂ − β) converges in distribution to a random vector whose distribution is that of the matrix
E[Xi Xi′ ]−1 times a normal vector with mean zero (for each component) and variance-covariance matrix
E[ϵ2i Xi Xi′ ].
The RHS of (7.39) is thus equal to a linear combination of normal random vectors. Adapting Propo-
sition 3.5, note the following: let X ∼ N (µ, Σ) be a k-component random vector. Then for any k × k
matrix A:
A′ X ∼ N (A′ µ, A′ ΣA))
(this can also be seen as an example of the delta method, applied to a vector-valued function h). Thus,
we can write (7.39) as
√ d
n(β̂ − β) → N (0, V) (7.40)
where V := E[Xi Xi′ ]−1 E[ϵ2i Xi Xi′ ]E[Xi Xi′ ]−1 . We refer to V as the asymptotic variance of the OLS
estimator.
Example: Note that in the special case of homoskedasticity studied in Section 7.6.1, we have that
E[ϵ2i |Xi ] = σ2 for all i, and thus by the law of iterated expectations:
E[ϵ2i Xi Xi′ ] = E E[ϵ2i |Xi ]Xi Xi′ = E σ2 Xi Xi′ = σ2 · E[Xi Xi′ ]
(7.41)
A sufficient condition for us to be able to apply the CLT (see Section 4.4) is that E[(Xi ϵi )′ (Xi ϵi )] =
E[ϵ2i Xi′ Xi ] be finite. This requires finite fourth moments of the data, rather than the finite second
moments assumed to prove consistency of OLS. To see why, note that for any j and ℓ:
E[ϵ2i Xji ·Xℓi ] = E[(Yi −Xi′ β)2 Xji ·Xℓi ] = E[Yi2 ·Xji Xℓi ]−2E[Xi′ β·Yi ·Xji ·Xℓi ]+E[β ′ Xi Xi′ β·Xji ·Xℓi ]
which can be written out as a sum over expectations that each involve the product of four random
variables. To keep all such terms finite, Hansen assumes the following:
94
Assumption 4 (regularity conditions for asymptotic normality). Suppose that:
1. E[Yi4 ] is finite
2. E[||Xi ||4 ] is finite
3. We have no perfect multicollinearity in the population: that is, E[Xi Xi′ ] is positive definite.
work, but the true residuals ϵi are not observed. However, we can use the fitted residuals ϵ̂i , which are
a function of the observed data, instead. We can write this in matrix form as:
n
1X 2
Ω̂ := ϵ̂ Xi Xi′
n i=1 i
One can verify that Ω̂ → E[ϵ2i Xi Xi′ ]. Thus, we can form a consistent variance estimator as
p
−1 −1
1 ′ 1 ′
V̂HC0 := XX Ω̂ XX (7.42)
n n
Eq. (7.42) is referred to as the “HC0” estimator of V, where HC stands for heterokedasticity consistent.
This name comes from the fact that V̂HC0 does not require the assumption of homoskedasticity (Defi-
nition 7.1) to be a consistent estimator of V.
When you run a command like regress y x, robust in Stata, the default covariance estimator is the
so-called “HC1” estimator of V:
−1 −1
n 1 ′ 1 ′
V̂HC1 := XX Ω̂ XX (7.43)
n−k n n
n
Note that the additional factor n−k will make very little difference when n is large compared with k,
n
and will make no difference in the asymptotic limit, since n−k → 1 as n → ∞. Applying this rescaling
however can be helpful when n is small. It’s easiest to understand the justification in the case of ho-
moskedasticity, which is left as an exercise (see box below).
Note: there are further estimators floating around, with names HC2, HC3, and HC4. These apply fur-
ther modifications to V̂HC0 (see the Hansen text for details). Other variance estimators exist for certain
violations of the i.i.d sampling assumption, including cluster-robust variance estimators for clustered
sampling and autocorrelation-consistent estimators for serially correlated panel data.
If we had reason to believe that homoskedasticity holds, then it is not necessary to use the matrix
Ω̂ when estimating V. The alternative estimator given below will perform better provided that
the assumption of homoskedasticity is true. But, it will be inconsistent if not.
95
Recall from Eq. 7.41 that under homoskedasticity E[ϵ2i Xi Xi′ ] = σ 2 · E[Xi Xi′ ], where σ 2 =
E[ϵ2i |Xi ] = E[ϵ2i ]. Thus, we just need a consistent estimator of E[ϵ2i ], which we can then multiply
by n1 X′ X. The standard estimator of E[ϵ2i ] is denoted by Hansen as s2 , where
n
1 X 2
s2 = ϵ̂
n − k i=1 i
−1
Thus an estimator of V that is valid under the assumption of homoskedasticity is s2 · n1 X′ X .
This is what Stata computes by default, if you don’t include , robust at the end of your
regression command.
Pn
Exercise: Why use the estimator s2 as written rather than the simpler expression n1 i=1 ϵ̂2i ?
The reason is that dividing by n − Pkn rather than n makes s2 an unbiased estimator of E[ϵ2i ].
1
Derive the bias of the estimator n i=1 ϵ̂i , and show that s2 is unbiased.
2
The matrix square root. One can show that the matrix V = E[Xi Xi′ ]−1 E[ϵ2i Xi Xi′ ]E[Xi Xi′ ]−1
is invertible, symmetric and positive definite. A property from linear algebra is that symmetric
positive definite matrices M have a “matrix square root”, which is a unique matrix M 1/2 such
that M = M 1/2 M 1/2 . Thus, we can let V−1/2 denote the inverse of the matrix square root of V.
−1/2
Let V̂−1/2 denote the analogous matrix for our estimator V̂. For example, VHC0 turns out to
−1/2 −1 −1/2 1 ′ −1
be: V̂HC0 = n1 X′ X
Ω̂ nX X .
p
Proof of Proposition 7.10. By the continuous mapping theorem V̂−1/2 → V−1/2 . Then, by the
Slutsky theorem and Eq. (7.40):
√
nV̂−1/2 (β̂ − β) → V−1/2 N (0, V) = N (V−1/2 0, V−1/2 VV−1/2 ) = N (0, Ik )
d
where in the last step we’ve used that V−1/2 VV−1/2 = VV−1/2 V−1/2 = VV−1 = Ik (we’ve
made use of the fact that “matrix powers” like M−1/2 always commute with the original matrix
M, meaning that M−1/2 M = MM−1/2 ).
Note: here we’ve simply assumed that V̂−1/2 exists. However it’s possible to argue that it must
96
exist for large enough n by appealing to the strong law of large numbers.
where Vjj = e′j Vej is the j th element along the diagonal of the matrix V.
√ β̂j − βj d
n· q → N (0, 1) (7.44)
V̂jj
where V̂jj is the j th element along the diagonal of the matrix V̂, which is a consistent estimator of Vjj .
√ −1/2
Note that we could have written the LHS of Eq. (7.44) as nV̂jj (β̂j − βj ) as in Proposition 7.10, but
since V̂jj is a scalar we may take its conventional square root and divide by it.
q
We define the standard error for the estimate β̂j to be se(β̂j ) := V̂jj /n. Note that the standard error is
a quantity that is computed from the data, given V̂ (it is an estimate, rather than a population quantity).
By Eq. (7.44), we know that the quantity (β̂j −βj )/se(β̂j ) converges in distribution to a standard normal.
This allows us to test hypotheses about the value of βj , using our estimate β̂j and se(β̂j ). Consider the
null hypothesis: H0 : βj = β0 for some value β0 (e.g. zero). Define the T-statistic for this hypothesis to
be
β̂j − β0
T (β0 ) =
se(β̂j )
d
If H0 is true, then we know that T (β0 ) → N (0, 1). Recall from Section 5.4.1 that the size of a hypothesis
test is the maximum probability of rejecting the null hypothesis, when the null hypothesis is in fact true.
We can form a test with size α in the following way:
where c is a value such that the probability of a standard normal random variable having a magnitude
of at least c is less than α. To do this in a way that maximizes power, we choose c to be exactly the
1 − α/2 quantile of the standard normal distribution: c = Φ−1 (1 − α/2).
Note: using the standard normal distribution Φ to form our critical value c makes our test a so-called
z-test. When you use the regress command in Stata, it performs a t-test, which uses the same test
statistic T (β0 ) but a different critical value, instead based on the students’ t-distribution. Using the
t-distribution will be more accurate if n is small an the residuals are approximately normal and ho-
moskedastic, but as n becomes large critical values based on the t-distribution or the standard normal
distribution become the same. In modern (large n) datasets, this distinction isn’t very important, so we
just develop theory for the z-test here.
Example: A common hypothesis to test is that βj = 0. In this case, our T-statistic is simply β̂j /se(β̂j ).
It is common to construct the test with a size of 5%, in which case we reject the null that βj = 0
97
if β̂j /se(β̂j ) > 1.96, where 1.96 ≈ Φ−1 (1 − .05/2). If we reject, then we say that the estimate β̂j is
“significant at the 95% level”.
Note: The above is an example of a so-called “two-sided” test. If we had a null-hypothesis like H0 :
βj ≥ β0 , we might apply an asymmetric decision rule like: H0 iff |T (β0 )| > c, where now c = Φ−1 (1 − α)
is the 1 − α quantile of the standard normal distribution. This test also has a size of α.
A 1 − α confidence interval CI 1−α for βj is the set of all values β0 for which the null-hypothesis H0 :
βj = β0 is not rejected by a test with size α. In other words, it is for a two-sided test the set of all β0
such that
β̂j − β0
≤ Φ−1 (1 − α/2)
se(β̂j )
Rearranging this, we have that CI 1−α = [β̂j − c · se(β̂j ), β̂j + c · se(β̂j )], where c = Φ−1 (1 − α/2) grows
with the desired α.
Exercise: Show that as limn→∞ P (βj ∈ CI 1−α ) = 1 − α. That is, the 1 − α confidence inter-
val contains the true value of βj with probability 1 − α, in the asymptotic limit. You may take for
d
granted that if a sequence of random variables Zn → Z, then for any closed interval A of the real line:
limn→∞ P (Zn ∈ A) = P (Z ∈ A). This statement is a consequence of the so-called Portmanteau Theorem.
Note: It is often said on the basis of the above that their is a 95% change that the true value of βj
lies in an e.g. 95% confidence interval. This language is sort of sloppy and makes it sound like βj is a
random variable, that may or may not lie inside the confidence interval. This is backwards: it is the
confidence interval CI 1−α ) that is random (it depends on the random variables β̂j and se(β̂j )), while βj
is not (it is just some number).
where χ2k indicates the chi-squared distribution with k degrees of freedom, which is the distribution that
applies to the sum of the squares of k independent standard normal random variables. To see this, note
that the LHS must converge to the distribution that would apply to Z ′ Z, if Z ∼ N (0, Ik ) is a vector of
k mutually independent standard normal random variables.
Thus, to test the null-hypothesis H0 : β = β0 , we can compute n(β̂ − β)′ V̂−1 (β̂ − β), and compare
it’s value to quantiles of the chi-squared distribution. A test of this kind is called a Wald-test, and the
test-statistic n(β̂ − β)′ V̂−1 (β̂ − β) is called a Wald statistic. Since a chi-squared random variable can
only be positive, Wald-tests are inherently two-sided tests.
More generally, suppose we want to test a null-hypothesis of the form H0 : r(β) = θ0 where r : Rk → Rq
is some known and differentiable function of the regression vector β. Introduce the shorthand that
θ = r(β).
By the continuous mapping theorem, we know that θ̂ := r(β̂) is a consistent estimator of the parameter
θ, and by the Delta method we know that
√ d
n(θ̂ − θ) → N (0, R′ VR)
where the k×q matrix R = ∇r(θ) is the Jacobian matrix of r, composed of all of it’s derivatives evaluated
√ d
at θ. By the same steps as those that establish Proposition 7.10, we have that n(R′ VR)−1/2 (θ̂ − θ) →
N (0, Ik ), where notice now that since θ has q rather than k components, we have the q × q identity
matrix appearing in the asymptotic variance.
98
Similar to Eq. (7.45), we can now form a Wald statistic for the hypothesis r(β) = θ0 , which has a
chi-squared distribution with q, rather than k degrees of freedom:
ˆ −1 (θ̂ − θ) →
n(θ̂ − θ)′ (R′ VR)
d
χ2q (7.46)
We can perform tests of very general hypotheses about β based upon Eq. (7.46). Note that in imple-
menting the Wald-test, it is important that the matrix R is known in order to construct the test statistic
(7.46). Fortunately we do know R under the null-hypotheses, which this fixes the value of θ and the
function r is provided by the researcher (and if the function r is linear, then R doesn’t even depend on
the value of θ). Note: it is common to use critical values from the so-called F -distribution rather than
the χ2 distribution when using (7.46) to construct tests (this also involves rescaling the test statistic by
a factor of 1/q). This is analagous to the distinction between t and z tests discussed above: the F test
is more conservative and may do better if the residuals are close to normally distributed and n is small.
It is illustrative to see that a test based on (7.46) recovers a test that is equivalent to the t-test considered
in the last section, when the function r(β) = e′j β picks out a single component of β. In this case R = ej .
The Wald statistic becomes
−1 (βj − β0 )2
n(e′j β − e′j β)′ (e′j V̂ej )−1 (e′j β − e′j β)′ = n(βj − β0 )V̂jj (βj − β0 ) = = T (β0 )2
V̂jj /n
exactly the square of the t-statistic for H0 : βj = β0 . The 1 − α quantile of the χ21 distribution is equal to
the square of the 1 − α/2 quantile of the standard normal distribution. Thus, a two-tailed t-test rejects
the null that βj = β0 exactly when the Wald test does, and vice versa.
99
Bibliography
FAN, Y. and PARK, S. S. (2010). “Sharp Bounds on the Distribution of the Treatment
E§ect and Their Statistical Inference”. Econometric Theory 26 (3), 931–951.
HECKMAN, J. J., SMITH, J. and CLEMENTS, N. (1997). “Making the Most Out
of Programme Evaluations and Social Experiments: Accounting for Heterogeneity in
Programme Impacts”. The Review of Economic Studies 64 (4), pp. 487–535.
ROSENBAUM, P. and RUBIN, D. (1983). “The central role of the propensity score in
observational studies for causal effects”. Biometrika 70 (1), pp. 41–55. eprint: https:
//academic.oup.com/biomet/article-pdf/70/1/41/662954/70-1-41.pdf.
100