0% found this document useful (0 votes)
18 views100 pages

Statistics For Econometrics

The document consists of class notes on statistics for econometrics, covering topics such as probability spaces, random variables, statistical models, and asymptotic theory. It includes detailed sections on empirical illustrations, statistical decision problems, and various estimation methods. The notes serve as a comprehensive guide for understanding statistical concepts and their applications in econometrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views100 pages

Statistics For Econometrics

The document consists of class notes on statistics for econometrics, covering topics such as probability spaces, random variables, statistical models, and asymptotic theory. It includes detailed sections on empirical illustrations, statistical decision problems, and various estimation methods. The notes serve as a comprehensive guide for understanding statistical concepts and their applications in econometrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Statistics for Econometrics

Class Notes

Leonard Goff
This version: September 1, 2022

1 Probability 5
1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Outcomes and events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 The probability of an event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Which sets of outcomes get a probability? . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Bringing it all together: a probability space . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 The distribution of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Central concept: the cumulative distribution function . . . . . . . . . . . . . . . . 8
1.3.2 Probability mass and density functions . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2.1 Case 1: Discrete random variables and the probability mass function . . 9
1.3.2.2 Case 2: Continuous random variables and the probability density function 10
1.3.2.3 Case 3 (everything else): mixed distributions . . . . . . . . . . . . . . . . 11
1.3.3 Marginal and joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Functions of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 The expected value of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 General definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Application: variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Conditional distributions and expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.1 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.2 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.3 Conditional expectation (and variance) . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Random vectors and random matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.2 Conditional distributions with random vectors . . . . . . . . . . . . . . . . . . . . 20
1.6.2.1 Conditioning on a random vector . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.2.2 The conditional distribution of a random vector . . . . . . . . . . . . . . 20

2 Empirical illustration: National Longitudinal Survey of Young Working Women 22


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 The “empirical distribution” of a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Examples of conditional distributions and the law of iterated expectations . . . . . . . . . 23

3 Statistical models 31
3.1 Modeling the distribution of a random vector . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Two examples of parametric distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 The binomial distribution* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1
4 When the sample gets big: asymptotic theory 38
4.1 Introduction: the law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Asymptotic sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 The general problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Example: LLN and the sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Convergence in probability and convergence in distribution . . . . . . . . . . . . . . . . . 42
4.4 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Properties of convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 The continuous mapping theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.2 The delta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.3 The Cramér–Wold theorem* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Limit theorems for distribution functions* . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Statistical decision problems 50


5.1 Step one: defining a parameter of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Desirable properties of an estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1.2 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1.3 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.1.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2 Important types of estimator* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2.1 Method of moments* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2.2 Maximum likelihood estimation* . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.2.3 Bayes estimators* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Inference* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4.2 Desirable properties of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.2.1 Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.2.2 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.2.3 Navigating the tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.3 Constructing a hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.4 Interval estimation and confidence intervals . . . . . . . . . . . . . . . . . . . . . . 60
5.4.4.1 Confidence intervals by test inversion . . . . . . . . . . . . . . . . . . . . 60

6 Brief intro to causality 62


6.1 Causality as counterfactual contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 The difference in means estimator and selection bias . . . . . . . . . . . . . . . . . . . . . 64
6.3 Randomization eliminates selection bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 The selection-on-observables assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Causality beyond a binary treatment* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.6 Moving beyond average treatment effects* . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7 Linear regression 70
7.1 Motivation from selection-on-observables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 The linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Five reasons to use the linear regression model . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.1 As a structural model of the world . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.3.2 As an approximation to the conditional expectation function . . . . . . . . . . . . 72
7.3.3 As a way to summarize variation* . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.3.4 As a weighted average of something we care about* . . . . . . . . . . . . . . . . . 73
7.3.4.1 Example 1: the average derivative of a CEF* . . . . . . . . . . . . . . . . 73
7.3.4.2 Example 2: weighted averages of treatment effects* . . . . . . . . . . . . 74
7.3.5 As a tool for prediction* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.4 Understanding the population regression coefficient vector . . . . . . . . . . . . . . . . . . 75
7.4.1 Existence and uniqueness of β: no perfect multicollinearity . . . . . . . . . . . . . 76
7.4.2 Simple linear regression in terms of covariances . . . . . . . . . . . . . . . . . . . . 78
7.4.3 Multiple linear regression in terms of covariances . . . . . . . . . . . . . . . . . . . 79
2
7.5 The ordinary least squares (OLS) estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5.1 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5.2 OLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5.3 Matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.5.4 Existence and uniqueness of β̂: no perfect-multicollinearity in sample . . . . . . . . 84
7.5.5 More matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.5.6 Regression as projection* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.5.7 The Frisch-Waugh-Lovell theorem: matrix version* . . . . . . . . . . . . . . . . . . 86
7.5.8 For the matrix haters: OLS in terms of covariances* . . . . . . . . . . . . . . . . . 88
7.5.9 A review of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.6 Statistical properties of the OLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.6.1 Finite sample properties of β̂, conditional on X* . . . . . . . . . . . . . . . . . . . 90
7.6.1.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.6.1.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.6.2 Asymptotic properties of β̂ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6.2.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.6.2.3 Estimating the asymptotic variance . . . . . . . . . . . . . . . . . . . . . 95
7.7 Inference on the regression vector β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.7.1 Testing hypotheses about a single regression coefficient . . . . . . . . . . . . . . . . 97
7.7.2 Testing a joint hypothesis about the regression coefficients* . . . . . . . . . . . . . 98

3
Guide to using these notes

This set of notes arose from the course ECON8070: Statistics for Econometrics at the University of
Georgia in Fall 2021. I’m fixing typos as I find them and keeping this document updated on my website.
Check www.leonardgoff.com for the most recent version.

These notes feature two kinds of box, to help organize the material:

Gray boxes offer section summaries.

White boxes indicate material that is optional, and understanding this material is not required
for the course or exam.

Sections that have an asterisk at the end of their title can be skipped in their entirety: understanding
this material is not required for the course or exam. These sections are mostly there for your interest
and reference.

4
Chapter 1

Probability

1.1 Probability spaces

Main idea: A probability function ascribes a number to each of a collection of events, where
each event is a set of outcomes.

This section develops the mathematical notion of probability. Probability is a function that associates
a number between zero and one to events. Events, in turn, are sets of outcomes. It’s easiest to think
of outcomes in the context of a process that could have multiple distinct results, like flipping a coin or
randomly choosing a number from a phone book.

1.1.1 Outcomes and events


We begin with a set Ω of conceivable outcomes, which is referred to as the sample space or outcome space.

Examples: When flipping a coin, the sample space is Ω = {H, T }, corresponding to “heads” or “tails”,
respectively. When rolling a six-sided die, Ω = {1, 2, 3, 4, 5, 6}. When drawing a card from a 52-card
deck, the sample space can de denoted as a combination of a card-value and a suit, or {(n, s) : n ∈
{A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K}, s ∈ {hearts, spades, diamonds, clubs}}. When using a random num-
ber generator to draw any number between 0 and 1, the sample space is Ω = [0, 1].

We denote a generic element of the sample space as as ω ∈ Ω. What we call events are simply sets
of such ω, i.e. subsets of Ω. But in general, not all subsets of Ω necessarily need to be events. Rather,
we consider a collection of sets F , referred to as an event space.
Definition 1.1. An event space F is a collection of subsets A ⊆ Ω.
In all of the examples given above, the outcome space Ω has a finite number of elements. In such
cases, it is typical to choose F to be the collection of all subsets of Ω. This collection is referred to
as the powerset of Ω and is often denoted as 2Ω . As an example, the powerset of the set {1, 2} is
2{1,2} = { ∅, {1}, {2}, {1, 2}}}. When we consider Ω that are uncountable sets (for example when Ω is a
continuum), we’ll need to restrict the event-space, as discussed below.

1.1.2 The probability of an event


A probability function P associates a positive real number to each event A ∈ F .
Definition 1.2. A probability function P (·) is a function from F to R, satisfying the following properties:
1. P (A) ≥ 0 for each A ∈ F
2. P (Ω) = 1
3. If A1 , A2 . . . is a countable collection of disjoint sets (i.e. Aj ∩ Ak = ∅ for any j ̸= k), then
 
[ X
P  Aj  = P (Aj )
j j

5
This formulation of probability is sometimes referred to as the Kolmogorov axioms of probability.
These axioms imply several intuitive properties of probability. For example, if A has a countable
number of elements, then the third property in Definition 1.2 implies that:
X
P (A) = P ({ω})
ω∈A

provided that {ω} ∈ F for each ω ∈ A. In particular, this result implies that for a finite set A we
can simply sum up the probability of each of the outcomes in A. For example, for a six-sided die
P (even) = P ({2}) + P ({4}) + P ({6}).
A few other properties of probability functions are left as exercises. As practice, I’ll include a proof
of the familiar property that P (Ac ) = 1 − P (A). To see this, note that A and Ac are disjoint sets, and
that A ∪ Ac = Ω. Thus, by the third property of Definition 1.2 P (Ω) = P (A) + P (Ac ). Then use the
second property to obtain the result.

Exercise: Show that if A ⊆ B: P (A) ≤ P (B).

Exercise: Use the above to show that P (A ∩ B) ≤ min{P (A), P (B)}.

Exercise: Derive the expression: P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Exercise: Derive the expression: P (A ∩ B) = P (A) + P (B) − P (A ∪ B). Hint: use (A ∩ B)c = Ac ∪ B c .

1.1.3 Which sets of outcomes get a probability?


In addition to the Kolmogorov axioms for the function P , we also place some requirements on the event
space F . In particular, we require it to be a σ-algebra:
Definition 1.3. A σ-algebra on Ω is a collection F of subsets of Ω with the following properties:
1. Ω ∈ F

2. If any A ∈ F , then Ac ∈ F , where Ac is the complement of A in Ω


S
3. If A1 , A2 . . . are each in F , then j Aj is also in F
Recall that events A ∈ F are those subsets of Ω that the function P must ascribe a probability (these
sets A are called measurable sets). The first item above, that Ω ∈ F , was already assumed by item 2. of
Definition 1.2: we can always associate a probability with the whole outcome space, and that probability
is one. Item 2. of Definition 1.3 says that if we are willing to give a probability to event A, then we
should also be willing to give a probability to the event that A does not happen, i.e. Ac . The third
property assures that given events A and B, we can always talk about the probability of A or B, which
is P (A ∪ B).
Note that all of the properties of a σ-algebra tell us about things that must be in F , they guarantee
that F is not to “small”. The biggest collection of subsets of Ω is the set of all of its subsets: the pow-
erset 2Ω . The powerset of Ω is always a σ−algebra (exercise: check that it satisfies all three properties).
However, using 2Ω as the event space F can also be too big for certain applications. This is why it is
necessary to introduce the idea of a σ-algebra.

Example: Consider as an outcome space the entire unit interval: Ω = [0, 1]. It turns out that it is
impossible to define a ”uniform” probability function on this Ω, if we insist on using the whole powerset
of [0, 1] as our event space F . That is, there is no function P (·) satisfying Kolmogorov’s axioms, and
defined over all A ∈ 2[0,1] , that satisfies our intuitive notion that moving a set around in the unit interval
does not change its probability. See Proposition 1.2.6 of Rosenthal (2006) for details.

This example demonstrates that in some cases we may need to work with something smaller than 2Ω . In
particular, issues like the above arise when Ω is uncountably infinite, e.g. corresponding to a continuum
of numbers. When Ω is finite or countable, it usually makes sense to consider the full powerset of Ω as
our event space. When we are in the uncountable case (e.g. when Ω is a convex subset of the real line
[a, b]), we typically appeal to the Borel σ-algebra:

6
Definition 1.4. The Borel σ-algebra B is the collection that consists of all intervals of the forms [a, b],
(a, b], [a, b), (a, b), and all other sets in R that are then implied by the definition of a σ-algebra.
Exercise: Show that for any Ω, the collection {∅, Ω}} is a σ−algebra.

Exercise: Show that ∅ ∈ F if F is a σ−algebra.


T
Exercise: Show that σ−algebras are closed under countable intersections, that is j Aj is in F if
A1 , A2 , . . . are each in F .

1.1.4 Bringing it all together: a probability space


Once we have a sample space, event space, and probability function, we refer to them altogether as a
probability space (sometimes called a probability triple).
Definition 1.5. An probability space is a triple (Ω, F, P ) in which F is a σ-algebra defined on Ω, and
P is a probability function defined on F .

1.2 Random variables


Main idea: If we associate a number to each outcome in a probability space, we have what is
called a random variable.

1.2.1 Definition
Most data we use in econometrics is quantitative in nature, so its natural to think of probability spaces
in which the outcome space is composed of numbers. Many of the examples have this feature already, for
example Ω = {1, 2, 3, 4, 5, 6} for a six-sided die. But even when the ω do not have an immediate numeric
interpretation, we can define a random variable by associating a number to each outcome ω:
Definition 1.6. Given a probability space (Ω, F, P ), a random variable X is a function X : Ω → R.
Example: Suppose I randomly select a student in this class, which I represent by a probability space with
Ω = {all students in this class}, F = 2Ω , and P ({ω}) = 1/|Ω| for each ω ∈ Ω. If we let X(ω) denote the
height in inches of student ω.

A random variable X defined from a primitive probability space (Ω, F, P ) allows us to define a new
probability space in which the outcomes are real numbers. We can now define a new probability function
PX on sets of real numbers, using the original probability function P on Ω:
PX (A) := P ({ω ∈ Ω : X(ω) ∈ A}) (1.1)
Technical note: observe that the above definition gives a way to associate a probability PX with any
set A of real numbers, provided that {ω ∈ Ω : X(ω) ∈ A} ∈ F . To ensure this condition holds it is
typical to restrict to sets A that belong to the Borel algebra B defined in Section 1.1.3, and further insist
that the function X is measurable. X being measurable is a technical condition that just means that for
any x ∈ R, the set {ω ∈ Ω : X(ω) ≤ x} ∈ F . Our new probability space can now be denoted as (R, B, PX ).

A realization of random variable X is the specific value X(ω) that it ends up taking, given ω. While X
is a function, X(ω) is a number. Lowercase letters x are often used to denote numbers that are possible
realizations: e.g. x = X(ω) for some ω ∈ Ω.

1.2.2 Notation
The notation of Equation (1.1) is pretty cumbersome to work with, so the convention is to simplify it in
a few ways.

Let’s start with an example. If we’re interested in the probability that X(ω) is less than or equal to
5, we’ll typically write this as: P (X ≤ 5), which can be interpreted as PX (A) where A = (−∞, 5], or
equivalently: P ({ω ∈ Ω : X(ω) ≤ 5}). What’s changed in this notation? Let’s go through step-by-step:
7
ˆ First, we haven’t bothered with the subscript X on PX like in Equation (1.1) because it’s clear
from what’s inside the parentheses that we’re talking about random variable X.
ˆ Second, inside the function P we’re using the language of conditions rather than sets. That is,
rather than writing out the set A = (−∞, 5] of values we’re interested in, we just write this as a
condition: “≤ 5”.
ˆ Third, we’ve made ω implicit and written X rather than X(ω). However, you often see ω left in.
For example, we might write P (X(ω) = x) for the probability that X takes a value of x. In the
context of Equation (1.1), this maps onto PX ({x}), or equivalently P ({ω ∈ Ω : X(ω) = x)}).
Given that we’re using the language of conditions, we often write “and” inside probabilities, for exam-
ple: P (X ≤ 5 and X ≥ 2). The “and” operation translates into intersection in the language of sets:
P ({ω ∈ Ω : X(ω) ≤ 5} ∩ {ω ∈ Ω : X(ω) ≥ 2}). Similarly, “or” translates into the union of sets:
P (X ≤ 5 or X ≥ 2) = P ({ω ∈ Ω : X(ω) ≤ 5} ∪ {ω ∈ Ω : X(ω) ≥ 2}).

Note: We may have multiple random variables, e.g. X could be a randomly chosen state’s minimum
wage, while Y their unemployment rate. Mathematically, these two random variables correspond to
functions X(·) and Y (·) applied to a common underlying outcome space Ω, which in this case cor-
responds to the set of US states. Probabilities like P (X ≤ $10 and Y ≤ 5%) are interpreted as
P ({ω ∈ Ω : X(ω) ≤ $10 and Y (ω) ≤ 5%}). If P ({ω}) = 1/50 for all ω, then this probability is in
turn equal to the number of states that have a minimum wage less than or equal $10 and an unemploy-
ment rate less than or equal 5%, divided by 50.

1.3 The distribution of a random variable


Main idea: The cumulative distribution function (CDF) provides a concise and convenient way
to represent the probability function of a random variable or of multiple random variables. From
the CDF we can define everything else we use to work with specific types of random variables,
for example probability density functions and probability mass functions.

1.3.1 Central concept: the cumulative distribution function


We can summarize the probability function over values of a random variable X through the so-called
cumulative distribution function or CDF of X.
Definition 1.7. The cumulative distribution function of X is the function FX (x) := P (X ≤ x).
Note that FX (x) is a function from R to the unit interval [0, 1], that is FX (x) is defined for all x ∈ R and
FX (x) is always between zero and one. The following properties can be proven to hold for any random
variable X:
ˆ FX (x) is a weakly increasing function, that is FX (x′ ) ≥ FX (x) if x′ > x
ˆ limx↓−∞ FX (x) = 0
ˆ limx↑∞ FX (x) = 1
ˆ FX (x) is right-continuous, i.e. FX (x) = limϵ↓0 FX (x + ϵ)
Note on notation: when the context is clear, we often denote a CDF as F (x) rather than FX (x). However,
when we have multiple random variables like X and Y , we may need the notation FX (x) and FY (y) to be
clear about which variable we are referring to. When using the notation F (x) for a CDF, keep in mind
that this is not the same “F” as we used to denote the event space of a generic probability triple (Ω, F, P ).

From the CDF, we can derive anything we’ll need to know about a single random variable. When we
have multiple random variables, the joint-CDF tells us everything we need to know about them.
Definition 1.8. The joint-CDF of two random variables X and Y is the function
FXY (x, y) := P (X ≤ x and Y ≤ y)
8
We’ll come back to the joint-CDF of two (or more) random variables in Section 1.5.

Although the CDF F (x) of a random variable is a function of a single variable x, we can use it to recover
the probability that X lies in a set. For example, consider the set (a, b], that is all numbers between a
and b, including b itself.
Proposition 1.1. For any numbers a and b such that b ≥ a, P (X ∈ (a, b]) = F (b) − F (a)
Proof. Given that P (A) = 1 − P (Ac ), (see Section 1.1.2), we have that:

P (X ∈ (a, b]) = P (a < X ≤ b) = 1 − P (X ≤ a or X > b)

Using the third property of a probability function, we have that P (X ≤ a or X > b) = P (X ≤ a)+P (X >
b), since the sets {x ∈ R : x ≤ a} and {x ∈ R : x > b} are disjoint. Thus:

P (X ∈ (a, b]) = 1 − {P (X ≤ a) + P (X > b)} = P (X ≤ b) − P (X ≤ a) = F (b) − F (a)

where I’ve used that P (X ≤ b) = 1 − P (X > b).


More generally, we can from the function F (x) compute the probability that X ∈ A for any Borel-
measurable set, that is a set A that belongs to the Borel σ−algebra. Sets that are simple intervals on
the real line like (a, b] are the leading example of such sets. Computing the probability associated with
more complicated sets that aren’t intervals is also possible using the CDF. The next section develops
two functions that can be derived from the CDF, and are sometimes easier to work with for such
computations.

1.3.2 Probability mass and density functions


Let X be a random variable with CDF F (x). We often refer to the whole function F as the distribution
of X. It always tells us everything we need to know about X. But there are two important special cases
in which we can represent the distribution of X in an alternative way that is often more convenient.

1.3.2.1 Case 1: Discrete random variables and the probability mass function
Call X a discrete set if X contains a finite number of elements, or a countably infinte number of elements
(e.g. X = N, the set of all integers).
Definition 1.9. A discrete random variable X is a random variable such that P (X ∈ X ) = 1 for some
discrete set X .
Example: If X is the number returned by rolling a die, then X is a discrete random variable because
P (X ∈ {1, 2, 3, 4, 5, 6}) = 1.

For any random variable, we call the smallest set X for which P (X ∈ X ) the support of X. A discrete
random variable has as its support a discrete set.

When X is a discrete random variable, its CDF ends up looking like a staircase: flat everywhere except
at each x in its support, where it “jumps” up by an amount P (X = x). For example, for a six-sided die:

Note: The open/closed dots at e.g. x = 1 indicate the F (1) is equal to 1/6, and not 0 (although it is equal
to 0 for x arbitrarily close but to the left of 1). We see from this graph why CDFs are right-continuous
but not necessarily left-continuous.

At each point in its support {1, 2, 3, 4, 5, 6}, the CDF for the die jumps by P (X = x), or 1/6. This is
a general feature of discrete random variables. Thus, rather than use the CDF function F (x) to represent
the distribution of X, we can just keep track of where it jumps and by how much. To do this, we use
probability mass function or p.m.f. of X
Definition 1.10. The probability mass function of a random variable X is the function π(x) = P (X = x)

9
CDF of the number on a die
1

5/6

4/6

F(x)
3/6

2/6

1/6

0
1 2 3 4 5 6
x

Figure 1.1: The CDF of the number returned by a fair six-sided die.

For a discrete random variable, we can express the p.m.f. alternatively as a sequence, rather than a func-
tion. Label the points in the support of X as {x1 , x2 , x3 , . . . }, in increasing order so that x1 < x2 < x3 . . . .
Let xj denote the j th value in this sequence. For any j, let πj = π(xj ) = P (X = xj ).

The sequence of probabilities {π1 , π2 , π3 , . . . } coupled with the sequence of support points {x1 , x2 , x3 , . . . }
carries exactly the same information as the full CDF.

Obtaining the p.m.f. from the CDF: For a given support point xj : πj = F (xj ) − F (xj−1 ), and for any
x: π(x) = limϵ↓0 F (x) − F (x − ϵ). Note that π(x) = 0 for any x that is not a support point, and F is
continuous {x1 , x2 , x3 , . . . }.
P
Obtaining the CDF from the p.m.f (only possible for a discrete random variable): F (x) = j:xj ≤x πj .
P
Note that from this last expression, we can see that since limx→∞ F (x) = 1, we must have that j πj = 1
– probability mass functions sum to one when the sum is taken across all support points j.

1.3.2.2 Case 2: Continuous random variables and the probability density function
For random variables that are not discrete, knowing the probability mass function isn’t sufficient to
recover the whole CDF. Often P (X = x) = 0 for all x, so the p.m.f does not even really tell us anything
useful about X’s distribution.
An important class of random variables that are not discrete are random variables for whom the CDF
is differentiable for all x. When it is, we can define the probability density function or p.d.f. of X.
Definition 1.11. The probability density function of a random variable X having a differentiable CDF
d
F (x), is f (x) = dx F (x).
We will refer to random variables that have a density function f (x) as continuous random variables
(another phrasing is that X is continuously distributed ). Recall that for a function to be differentiable,
it must be continuous; thus, the CDF of a continuous random variable must be continuous, lacking any
jumps like those that characterize the CDF of a discrete random variable.

Note: you may see in various texts a few different notions of “continuity” of a random variable. For
the purposes of this class, a continuous random variable is a random variable with a continuous CDF,
which is basically equivalent to it being differentiable everywhere in its support. We won’t worry about
the distinction between these two things: e.g. random variables with CDFs that are continuous but
non-differentiable.

For a continuous random variable we can use the p.d.f rather than the CDF to calculate anything we need
to know. For example the probability that X lies in any interval [a, b] can be obtained by integrating

10
over the density function:
Z b
P (X ∈ [a, b]) = f (x)dx (1.2)
a
Intuitively, this gives us the area under the curve f (x) between points a and b, as depicted in Figure 1.2.
Rb
Note that a f (x)dx = F (b) − F (a), because the CDF is the anti-derivative of the p.d.f.

Probability density function f (x) Cumulative distribution function F (x)


1

F (b)
P (X ∈ [a, b])
f (x)

P (X ∈ [a, b])

F (a)

0
a b a b
x x

Figure 1.2: The left panel depicts an example of the p.d.f. f (x) of a random variable X. The probability that
a ≤ X ≤ b is given by the area under the f (x) curve between x = a and x = b. P (a ≤ X ≤ b) is also equal to
F (b) − F (a), the difference in the CDF of X evaluated at x = b and at x = a, as depicted in the right panel.

While the probability mass function π(x) gives us the probability that X equals x exactly, the p.d.f
does not tell us the probability that X = x (in fact for any x: P (X = x) = 0 for a continuous random
variable!).
Rather f (x) can be interpreted as telling us the probability that X is close to x, in the following
sense. Consider a point x and some small ϵ > 0. Recall the definition of f (x) as the derivative of F (x):
d F (x + ϵ) − F (x) P (X ∈ (x, ϵ])
f (x) = F (x) = lim = lim
dx ϵ→0 ϵ ϵ→0 ϵ
where we’ve used Proposition 1.1 to replace F (x + ϵ) − F (x) with P (X ∈ (x, ϵ]). Thus f (x) is limit
of the ratio of the probability that X lies in a small interval that beings at x, and the width ϵ of that
interval. Note also that for small ϵ: F (x + ϵ) ≈ F (x) + f (x) · ϵ, which is called the first-order Taylor
approximation to F (x + ϵ) around x.

Let us end this section with a few properties of a probably density function:
ˆ From Eq. (1.2), we Rsee that the density must integrate to one, when the integral is taken over the

whole real line, i.e. −∞ f (x)dx = 1.
ˆ since F (x) is increasing and f (x) is its derivative, f (x) is positive everywhere: f (x) ≥ 0.

1.3.2.3 Case 3 (everything else): mixed distributions


Although most familiar examples of random variables are either discrete or continuous, a given random
variable X need not be either. However, a powerful result known as the Lebesque decomposition theorem
shows that we can combine the two tools we’ve just developed: the p.m.f. and the p.d.f., to work with
any random variable.
Definition 1.12. Given two random variables X and Y with CDFs FX and FY , a third random variable
Z is called a mixture of X and Y if it has a CDF that for some p ∈ (0, 1) satisfies FZ (t) = p · FX (t) +
(1 − p) · FY (t), for all t.
The Lebesgue decomposition theorem says that a generic random variable X can be seen as a “mix-
ture” of a discrete random variable and a continuous one, that is

F (x) = p · Fdiscrete + (1 − p) · Fcontinuous (1.3)


11
for some p ∈ (0, 1), where Fdiscrete admits of a probability mass function, and Fcontinuous admits of a
probability density function (i.e. is differentiable everywhere). The support points of Fdiscrete are often
referred to as mass points of F .

F(x)
1

0
a b c
x

Figure 1.3: An example of the CDF of a mixed random variable. This example has mass points at a and c,
where the CDF jumps discretely. It is continuous everywhere else, and is differentiable everwhere except {a, b, c}.

There are some technical aspects to stating the Lebesque decomposition theorem formally, which we
won’t explore here. Rather, it’s easiest to think of this result visually: a generic CDF is any increasing
function bounded between 0 and 1 (which is also right-continuous). The jumps in F (x) define the discrete
part of X (note that it can only jump up, and not down, since F is increasing). The function F (x) will
be differentiable almost everywhere else, defining it’s continuous part.1

Note for the interested: to explicitly generate decomposition (1.3),


P first collect the locations xj and sizes
yj of each of the jumps j = 1, 2, . . . in F (x). Then π(xj ) = j yj , and πj = yj /p yields a well-defined
p.m.f. function. This characterizes Fdiscrete . For any remaining point where F (x) is differentiable, we
1 d
define a density fcontinuous (x) = 1−p dx F (x), which characterizes Fcontinuous . Note that there may be
points at which F (x) doesn’t jump, but also isn’t differentiable, such as point b in Figure 1.3. We can
safely ignore such points, since they are isolated and have probability zero, e.g. P (X = {b}) = 0.

1.3.3 Marginal and joint distributions


Recall that when we have two random variables X and Y , we have defined the joint CDF FXY (x, y) =
P (X ≤ x, Y ≤ y) as well as the individual CDFs: FX (x) = P (X ≤ x) and FY (y) = P (Y ≤ y). The
functions FX and FY are often referred to as the marginal distributions of X and Y .

The following relationships hold between marginal and joint distributions:


ˆ FX (x) = FXY (x, ∞) = P (X ≤ x, Y ≤ ∞) = P (X ≤ x). Similarly, FY (y) = FXY (∞, y).
ˆ If Y is discrete: P (X = x) = j P (X = x and Y = yj ) where yj are the support points of Y
P

R∞
ˆ If X and Y are both continuously distributed: fX (x) = −∞ fXY (x, y)dy, where the joint den-
sity fXY (x, y) is the derivative of the joint CDF with respect to both x and y: fXY (x, y) =
∂2
∂x∂y FXY (x, y).

Intuitively, we can obtain the marginal distribution of X from the joint distribution by summing or
integrating over all values of Y , and we can similarly derive the marginal distribution of Y from the joint
distribution of X and Y by summing/integrating over values of X.

The above results all follow from a fundamental identity for probabilities called the law of total probability:

1 “Almost everywhere” here has a technical meaning. Any monotonic function is guaranteed to be differentiable every-

where except at isolated points: see Lebesque’s theorem for the differentiability of a monotone function.

12
Proposition (law of total probability): Consider a countable collection S of events A1 , A2 , . . . that
partition thePsample space (this means that the Aj are disjoint and that j Aj = Ω). Then for any event
B: P (B) = j P (B ∩ Aj ).
Proof. The proof is good practice, so I include it here. Since any event B ⊆ Ω, B =B ∩ Ω and thus
S S
P (B) = P (B ∩ Ω). Now, since j Aj = Ω, we have that P (B) = P B ∩ ( j Aj ) . Observe that
S S
B ∩ ( j Aj ) = j (B ∩ Aj ), and that
P the events (B ∩ Aj ) are disjoint for different values of j (since each
is a subset of Aj ). Thus, P (B) = j P (B ∩ Aj ), proving the result.

We can use the ideas of marginal and joint distributions to define the notion of independence between
two random variables:
Definition 1.13. We say that random variables X and Y are independent if FXY (x, y) = FX (x) · FY (y)
for all x and y.
When X and Y are independent, we denote this fact as X ⊥ Y . When they are not, we say X ̸⊥ Y .

1.3.4 Functions of a random variable


An important property of random variables is that we can apply a function to a random variable, and
this results in a new random variable. For example, if we start with a random variable X, and have a
function g : R → R, then g(X) is also a random variable. For example, X + 1 defines a new random
variable that is one larger than X for all i.

The reason that we can do this is simple: the original random variable was defined from a function X
defined on an underlying outcome space Ω. Evaluating g(X(ω)) for any ω ∈ Ω yields a new function,
the so-called composition of g with X (this is often denoted as g ◦ f ).

Technical note: Recall that the function X(ω) that defines the original random variable X must be a
measurable function. For the above logic to go through, the function g(·) applied to X must also be
measurable, so that the function g ◦ f = g(X(·)) is also measurable. A sufficient condition for a function
to be measurable is that it is piece-wise continuous, which is a very weak condition.

To work with a random variable Y = g(X), we need to know it’s CDF, which is:

FY (y) = P (Y ≤ y) = P (g(X) ≤ y)

This RHS expression can always be evaluated using the CDF of X. However, there are two important
special cases in which deriving the distribution of Y from that of X is particularly easy:

1. If X has a discrete distribution with support points x1 , x2 , . . . and p.m.f. π1 , π2 , . . . , then Y has
the same p.m.f. π1 , π2 , . . . but at new support points g(x1 ), g(x2 ), . . . .
ˆ Example: if X is a random variable that takes value 0 with probability p and 1 with probability
1 − p, then the random variable Y = X + 1 is a random variable that takes value 1 with
probability p and 2 with probability 1 − p.
2. (homework problem) If X has a continuous distribution with density fX (x), and if the function g(x)
(g −1 (y))
is strictly increasing and differentiable with derivavtive g ′ , then Y has a density fY (y) = fgX′ (g −1 (y))
−1
where g is the inverse function of g.
ˆ Example: if g(x) = log(x), then fY (y) = fX (ey ) · ey , since g −1 (y) = ey and g ′ (x) = 1/x.

Just as a function applied to a random variable defines a new random variable, functions applied to
multiple random variables also yield a new random variable. For example, if X and Y are each random
variables, then Z = g(X, Y ) is also a random variable, where g(x, y) is now a function that takes two
arguments. Some examples would be the random variables X + Y , X · Y , or min{X, Y }. When taking a
functions of two random variables, Z = g(X, Y ), we need the full joint distribution of X and Y to derive
the CDF of Z. Knowing the two functions FX (x) and FY (y) is generally not enough, rather we need to
know the function FXY (x, y) (see Definition 1.8). This will come up later in the course.

13
1.4 The expected value of a random variable

Main idea: The expected value of a random variable is a measure of its average value across
realizations. In the special case of a continuous random variable, its value can be obtained by an
integral involving the density function. In the special case of a discrete random variable, its value
can be obtained by a sum involving the probability mass function.

The expected value (a.k.a. expectation value, or simply expectation) of a random variable is a measure
of its average value over all possible realizations. The expectation of X is denoted E[X].
To motivate how E[X] will be defined, think of task of computing the average of a list of numbers.
For example, the average of the numbers 1, 2, 2, and 4 is (1 + 2 + 2 + 4)/4 = 2. Notice that the number
2 occurred twice in the series, so we added 2 to the sum two times. We could thus have written the
averaging calculation as 14 (1 · 1 + 2 · 2 + 4 · 1), where each number is mulitplied by the number of times
it occurs in the list. The general formula could be written
X # times j th distinct number occurs in the list
average of a list of numbers = (j th distinct number ) ·
j
length of the list
| {z }
wj

where notice that “weight” wj on the j th distinct number sums to one over all j, i.e.
P
j wj = 1.
The definition of E[X] for a discrete random variable is exactly analogous to this formula, where we
average over the values xj that X can take, and use as “weights” the probabilities πj :
X
E[X] = xj · πj (1.4)
j

where x1 , x2 , . . . are the distinct support points of the random variable and πj is it’s p.m.f. Note that
the πj sum to one, as we saw in Section 1.3.2.
In the case of a continuous random variable, the analogous expression to Eq. (1.4) replaces the sum
with an integral, and the probability πj = π(xj ) is replaced by f (x)dx:
Z
E[X] = x · f (x)dx (1.5)

The quantity f (x) · dx can be interpreted as the probability that X lies in an interval [x, x + dx] having
a very small width dx, as discussed in Section 1.3.2.

1.4.1 General definition


We now give a general definition of the expectation of a random variable X, and see that Equations (1.4)
and (1.5) emerge as simple special cases of it when X is discrete or continuous, respectively.
R∞
Definition 1.14. The expectation of a random variable X having CDF F (x) is E[X] = −∞ x · dF (x),
R∞
where we define the integral −∞ x · dF (x) as

∞ N 
b−a b−a b−a
Z X      
x · dF (x) := lim lim a+n· · F a+n· − F a + (n − 1) ·
−∞ a→−∞,b→∞ N →∞
n=1
N N N

R∞
The quantity −∞ x · dF (x) in an example of a Riemann–Stieltjes integral, in which we “integrate‘with
respect to the function” F (x) rather than with respect to the variable x. Let’s try to unpack this long
expression, with the aid of the color-coding above.
First, let’s fix values of a, b, N and consider the quantity appearing inside all of the limits. For given
b > a, imagine cutting the interval [a, b] into N regions of equal size, so that they each have width b−a N .
The nth such region extends from the value a + (n − 1) · b−a N to the value a + n · b−a
N . Note the following:

ˆ F a + n · b−a − F a + (n − 1) · b−a
 
N N yields P (X ∈ region n).

ˆ a + n · b−a

N is the location of (the right end of) region n.
14
ˆ limN →∞ takes the sum to an integral, and the a, b limit covers full support of X.
Thus, we can interpret E[X] as an integral of the function x over the whole real line, in which each value
of x is multiplied by the probability that X is very close to x, essentially F (x + dx) − F (x).

Discrete case: Now let’s see how Definition 1.14 yields Eq. (1.4) in the special case that X is a discrete
random variable. Let x1 , x2 . . . be the support points of X. Notice that for  large enough N , only
 one xj
can be between a + n−1 N (b − a) and a + n
N (b − a). Thus: F a + n
N (b − a) − F a + n−1
N (b − a) = πj if
xj lies in the nth region. If on the other hand no xj lies in the n th
region, this quantity is equal to zero.
We arrive at one term for each value xj , and E[X] = j xj · πj .
P

Continuous case: When X is a continuous random variable with density f (x), we can recover Eq. (1.4)
by noticing that for large N :
n−1  b−a
 
 n   n
F a + (b − a) − F a + (b − a) ≈ f a + (b − a) ·
N N N N
R∞
Substituting in this approximation delivers the familiar formula that E[X] = −∞ x · f (x)dx.

Exercise: Consider a so-called Bernoulli random variable X that takes a value 1 with probability p and
0 with probability 1 − p. Show that E[X] = p.

Exercise: Consider a uniform [0, 1] random variable, that is a continuous random variable with density
f (x) = x for all 0 ≤ x ≤ 1, and f (x) = 0 everywhere else. Show that E[X] = 1/2.

A key property of the expectation operator that is very useful is that it is linear. It’s actually “linear”
in a few distinct senses:
1. Linearity with respect to functions of a single variable: E[a + b · X] = a + b · E[X]
2. Linearity over sums of random variables: E[X + Y ] = E[X] + E[Y ].
3. Linearity with respect to mixtures: if X, Y and Z are random variables such that FZ (t) = p ·
FX (t) + (1 − p) · FY (t), then E[Z] = p · E[X] + (1 − p) · E[Y ].
Note that because of Property 2, we can compute the expectation value of the random variable X + Y
knowing only the CDFs FX (x) and FY (y), without needing the full joint-CDF FXY (x, y) of X and Y .
This is a very special property of the expectation, which doesn’t hold for most of the things we might
want to know about the random variable X + Y (for example P (X + Y ≤ t)).
Property 3. gives us a nice way to evaluate the expectation value of a random variable that is neither
discrete nor continuous. Recalling decomposition (1.3) of a general mixed random variable, let f c (x) be
the density of the continuous part Fcontinuous and let xdj and πjd denote the support points and associated
probabilities according to the discrete part Fdiscrete . Then:
 
Z ∞ X  Z ∞
E[X] = x · dF (x) = p · xdj · πjd + (1 − p) · x · f c (x) · dx
−∞ 
j
 −∞

1.4.2 Application: variance


From the expectation operator, we can also define the variance of a random variable, which measures
how “dispersed” it is. We’ll see that the variance plays an important role in asymptotic theory.
Definition 1.15. The variance of X is the expected value of the random variable (X − E[X])2 , i.e.
V ar(X) := E[(X − E[X])2 ].
The variance of X can be interpreted as the average value of the squared distance between X and its
expectation E[X]. Note that V ar(X) ≥ 0 for any random variable, with V ar(X) = 0 only when X takes
one value with probability one (i.e. X is a so-called degenerate random variable).

Exercise: Use the linearity of the expectation operator to prove the following (very useful) alternative
expression for the variance: V ar(X) = E[X 2 ] − (E[X])2 .

Exercise: Show that for a Bernoulli random variable (defined above), the variance is equal to p · (1 − p).
15
1.5 Conditional distributions and expectation
In this section we develop a final fundamental tool that we will use to analyze random variables: the
idea of conditional distributions and conditional expectations.

Main idea: Conditioning on an event allows us to examine a restricted probability space in


which that event is true (but other things are still random). When this idea is applied to random
variables, we can define conditional distributions that we can work with in all of normal ways.

1.5.1 Conditional probabilities


We begin with a concept that applies to all probability spaces, not just to random variables.
Definition 1.16. Given an event B such that P (B) > 0, the conditional probability of event A given B
is defined as
P (A ∩ B)
P (A|B) =
P (B)
where recall that the intersection of two events A ∩ B can be interpreted as the event that both of events
A and B occur. This definition is often referred to as Bayes’ rule. You can think of Bayes’ rule as a way
to define a probability function using B as the whole outcome space: yielding a way to talk about the
probability that ω ∈ A, given that ω ∈ B.

Extension: Given events A, B and C, we can also define the probability of A given B and C as
P (A|B ∩ C) = P (A ∩ B ∩ C)/P (B ∩ C), and so on for any number of events.

Exercise: We call events A and B independent if P (A ∩ B) = P (A) · P (B). Suppose that P (B) > 0.
Show that A and B are independent if and only if P (A|B) = P (A).

1.5.2 Conditional distributions


Consider now two random variables X and Y .
Definition 1.17. The conditional CDF of Y given X = x is

FY |X=x (y) := P (Y ≤ y|X = x) := lim P (Y ≤ y|X ∈ [x, x + ϵ])


ϵ↓0

where the conditional probability appearing in the RHS is defined by Definition 1.16.

Notation: The conditional CDF will sometimes also be denoted as FY |X (y|X).

We define P (Y ≤ y|X = x) using X ∈ [x, x + ϵ] as our conditioning event B, and then taking the limit,
because the probability of X = x may be zero, e.g. for a continuously distributed X. Note: The Hansen
book uses x ∈ [x − ϵ, x + ϵ] instead of x ∈ [x, x + ϵ], but the two definitions are equivalent.

Given the general Definition 1.17, we can consider each of our two typical special cases:

ˆ When P (X = x) > 0 (e.g. for a discrete random variable with a support point at x), Definition
1.17 reduces to the simpler expression P (Y ≤ y|X = x) = P (Y ≤y and X=x)
P (X=x) . We can interpret
FY |X=x (y) as the CDF among the sub-population of i for which X = x.

ˆ If on the other hand fX (x) = d


dx FX (x) exists (e.g., for a continuous random variable), then
d
P (Y ≤y,X≤x)
Definition 1.17 simplifies to P (Y ≤ y|X = x) = dx fX (x) . We can interpret FY |X=x (y) as
the CDF among the sub-population of i for which X is “very close” to x.

Exercise: derive each of these two expressions from Definition 1.17 . For the discrete case, you may find
g(t) limt→0 g(t)
useful the “quotient rule” that limt→0 h(t) = lim t→0 h(t)
when both limits exist and limt→0 h(t) ̸= 0. For
the continuous case, try dividing both the numerator and the denominator of P (Y ≤ y|X ∈ [x, x + ϵ])

16
by ϵ before taking the limit.

Exercise: Show that if X and Y are independent then FY |X=x (y) = FY (y) and FX|Y =y (x) = FX (x) for
all x and y. Note: it’s actually an if-and-only-if, but proving the other direction is more difficult.

1.5.3 Conditional expectation (and variance)


Consider a fixed value of x, and view the conditional CDF FY |X=x (y) as a function of y. This function
satisfies the four properties of a CDF mentioned in Section 1.3.1: it is weakly increasing, right-continuous,
and ranges from zero to one.
Thus, we can define the expectation over this distribution in exactly the same way as we would for
E[Y ] based on Definition 1.14, except that use FY |X=x (y) as the CDF rather than it’s “unconditional”
analog F (y). We can write this using the general notation of Definition 1.14 as:
Z ∞
E[Y |X = x] = y · dFY |X=x (y)
−∞

We can unpack this expression depending on what type of random variable Y is:
R∞
ˆ If Y is continuous: E[Y |X = x] = −∞ y · fY |X=x (y) · dy, where fY |X=x (y) = dy
d
FY |X=x (y).

ˆ If Y is discrete: E[Y |X = x] = j yj ·πj|X=x , where πj|X=x = limϵ↓0 FY |X=x (yj ) − FY |X=x (yj − ϵ) .
P 

Observe that the conditional expectation E[Y |X = x] depends on x only, as we’ve averaged over various
values of Y . Accordingly, we can define a function that evaluates E[Y |X = x] over different values of x:
Definition 1.18. The conditional expectation function (CEF) of Y given X is m(x) := E[Y |X = x].
We can also use the CEF to define a new random variable, denoted E[Y |X].
Definition 1.19. E[Y |X] = m(X), where m(x) := E[Y |X = x].
For example, if X is discrete, then E[Y |X] takes value m(xj ) = E[Y |X = xj ] with probability πj .

The so-called law of iterated expectations shows that the expectation value of E[Y |X] recovers the (un-
conditional) expectation of Y :

Proposition (law of iterated expectations): E[Y ] = E [E[Y |X]]


Proof. We prove it for the case in which both X and Y are continuous random variables. The other
cases are analagous.
Z
E [E[Y |X]] = fX (x) · E[Y |X = x] · dx
x∈R:fX (x)>0
Z Z 
= fX (x) · y · fY |X (y|x) · dy · dx
x∈R:fX (x)>0 y∈R
Z 
fXY (x, y)
Z
· y·
= fX
 (x)
fX  · dy
(x)
· dx
x∈R:fX (x)>0 y∈R 
Z (Z ) Z
= y· fXY (x, y) · dx = fY (y) · dy = y · fY (y) · dy = E[Y ]
y∈R x∈R:fX (x)>0 y∈R
| {z }

The law of iterated expectations is useful because in many settings the quantity E[Y |X = x] is easier to
work with than [Y ] is directly.

Example: Suppose that Y is individual i’s height and X is an indicator for whether they are a child or
an adult. Then the law of iterated expectations tells us that the average height in the population can
be obtained by averaging together the mean height among children with the mean height among adults.
Suppose that 75% of the population are adults. Then the law of iterated expectations reads as:
E[height] = .75 · E[height|adult] + .25 · E[height|child]
17
Proposition (CEF minimizes mean squared prediction error): Suppose we’re interested
in constructing a function g(·) with the goal of using g(X) as a prediction of Y . We can show
that m(x) := E[Y |X = x] is the best such function, in the sense that for each value of x

m(x) = argming E[Y − g(X))2 ]

Proof. Here I’ll use the general notation so we don’t need to make any assumptions about what
type of random variable X is (discrete, continuous, etc.):
Z
E[(Y − g(X)) ] = E E[(Y − g(X)) |X] = E[(Y − g(X))2 |X = x] · dF (x)
2 2


Z Z
= E[(Y − g(x)2 |X = x] · dF (x) = E[Y 2 − 2Y g(x) + g(x)2 |X = x] · dF (x)
Z
E[Y 2 |X = x] − 2g(x)E[Y |X = x]g(x) + g(x)2 · dF (x)

=

For each value of x, the quantity in brackets is minimized by g(x) = E[Y |X = x]. To see this,
note that the quantity E[Y 2 |X = x] − 2gE[Y |X = x]g + g 2 is a convex function of g, and the
first-order condition for minimizing it is satisfied when g = E[Y |X = x].

We can also define a conditional variance function V ar(Y |X = x) = E[(Y − E[Y |X = x])2 |X = x]
from the conditional distribution FY |X=x . An analog to the law of iterated expectations exists for the
conditional variance, which is sometimes called the law of total variance.

Proposition (law of total variance): V ar(Y ) = E[V ar(Y |X)] + V ar(E[Y |X]).

Example: Recall the height example from the law of iterated expectations. The law of total variance
reveals that the variance of heights in the population overall is greater than what we would get by just
averaging the variances of each subgrup. That is:
V ar(height) > .75 · V ar(height|adult) + .25 · V ar(height|child)
The reason is that V ar(height) involves making comparisons directly between the heights of children
and adults, which are not captured in V ar(Y |X = x) for either value of x. The law of total variance
tells us exactly what correction we would need to make, which is to add the second term V ar(E[Y |X]).
Remarkably, the correction required just depends on the average height within each group E[Y |X = x],
as well as the proportion of adults vs. children: P (X = x).

1.6 Random vectors and random matrices


Main idea: Random vectors are vectors in which each component is a random variable, and
random matrices are matrices where each entry is a random variable. These concepts allow us
to define the expectation, variance, and covariance between random vectors, which gives us a
compact notation to discuss many random variables at the same time.

1.6.1 Definition
Rather than coming up with new letters X, Y, Z for multiple random variables, sometimes a more compact
notation is to think of a single “random vector” containing all three.
Definition 1.20. A random vector X is a vector in which each component is a random variable, e.g.
 
X1
 X2 
X= . 
 
 .. 
Xk
where X1 , X2 , etc. are each random variables.
18
Note the following:
ˆ A realization x of random vector X is a point in Rk , i.e. x = (x1 , x2 , . . . , xk )′ :

P (X = x) = P (X1 = x1 and X2 = x2 . . . and . . . Xk = xk )

ˆ For a random vector X, the function FX denotes the joint-CDF of the random variables X1 , X2 , . . . Xk :

FX (x) = P (X1 ≤ x1 and X2 ≤ x2 . . . and . . . Xk ≤ xk )

ˆ The expectation of a random vector X is simply the vector of expectations of each of its components,
i.e.    
X1 E[X1 ]
 X2   E[X2 ]
E[X] =  .  :=  .. 
   
 ..   . 
Xk E[Xk ]

ˆ The law of iterated expecations E[Y ] = E [E[Y |X]] still holds when X is a random vector, rather
than a random variable.
.
Definition 1.21. An n × k random matrix X is a matrix in which each component is a random variable,
e.g.  
X11 X12 . . . X1k
 X21 X22 . . . X2k 
X= .
 
.. .. .. 
 .. . . . 
Xn1 Xn2 . . . Xnk
where Xlm , is a random variable for each entry lm.
Just as with a random variable, we define the expectation of a random matrix as a matrix composed
of the expectation of each of it’s components, i.e.

E[X11 ] E[X12 ] . . . E[X1k ]


 
 E[X21 ] E[X22 ] . . . E[X2k ] 
E[X] =  .. .. .. .. 

 . . . . 
E[Xn1 ] E[Xn2 ] . . . E[Xnk ]
This allows us to generalize the notion of variance to random vectors.
Definition 1.22. The variance of a random vector X is V ar(X) = E[(X − E[X])(X − E[X])′ ]
where use the notation that for a vector x: x′ indicates its transpose (x1 , x2 , . . . xk ). Note that for vectors
x = (x1 . . . xn )′ and y = (y1 . . . yk ), xy′ is an n × k matrix, where the lm component of xy′ is xl · ym .
We will also use ′ to denote the matrix transpose, i.e. [X ′ ]lm = Xml .

Note that when X is a random vector rather than a random variable, V ar(X) is often referred to as
the “variance-covariance matrix” of X. We’ll use the variance-covariance matrix a lot, because it plays
an important role in studying parametric distributions like the multivariate normal distribution, and in
asymptotic theory.

To understand the name, let us first define the covariance between random vectors X and Y :
Definition 1.23. The covariance of random vectors X and Y is Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])′ ]
Note the following properties of covariance:
ˆ For random vector X: V ar(X) = Cov(X, X)
ˆ When X and Y are scalars (i.e. single random variables), Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
ˆ For scalar X and Y , and numbers a, b: Cov(X, a + bY ) = b · Cov(X, Y )
19
ˆ For a random vector X, the components of the matrix V ar(X) are scalar variance and covariances,
hence its name:
 
V ar(X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xk )
Cov(X2 , X1 ) V ar(X2 ) . . . Cov(X2 , Xk ) 
V ar(X) =  .. .. ..
 
.. 
 . . . . 
Cov(Xk , X1 ) Cov(Xk , X2 ) . . . V ar(Xk , Xk )
A consequence of this expression is that V ar(X) is a symmetric matrix: [V ar(X)]lm = [V ar(X)]ml ,
because Cov(Xl , Xm ) = Cov(Xm , Xl ).
Cov(X,Y )
ˆ When X and Y are scalars, we can define the correlation coefficient ρXY as √ (note
V ar(X)V ar(Y )
that all quantities involved here are scalars). ρXY is always a number between −1 and +1 (home-
work problem).
Exercise: Show that Cov(X, Y ) = E[XY ′ ] − E[X]E[Y ]′

1.6.2 Conditional distributions with random vectors


1.6.2.1 Conditioning on a random vector
In Section 1.5 we defined the conditional distribution of one random variable Y given another random
variable X. This idea extends naturally to conditioning a random variable Y on multiple random
variables at the same time, e.g. FY |X=x,Z=z (y). Random vectors give us a nice notation for this:
Definition 1.24. With X a random vector, the conditional CDF of random variable Y given X = x is
FY |X (y|x) = lim P (Y ≤ y|X1 ∈ [x1 , x1 + ϵ1 ], X2 ∈ [x2 , x2 + ϵ2 ] . . . Xk ∈ [xk , xk + ϵk ])
ϵ1 ↓0
ϵ2 ↓0
...
ϵk ↓0

where x = (x1 , x2 , . . . xk )′ .
We can always use the above definition, even if the components of X can be a mix of continuous and
discrete random variables.

For any given value of x, FY |X=x (y) = FY |X (y|x) yields a proper CDF function for y, which means
we can continue to define the conditional expectation as E[Y |X = x] = ∞ y · dFY |X=x (y), where the
R∞

meaning of this integral is as given in Definition 1.14. The conditional variance of Y given X = x can
also be defined in the typical way from the conditional distribution FY |X=x (y).

The law of iterated expectations caries over unchanged when X is a random vector. That is: E[Y ] =
E[E[Y |X]], regardless of whether X has continuous or discretely distributed components, or a mix of
the two. The law of total variance caries over too (see below).

Understanding and estimating the object E[Y |X = x] from data, where X can be a vector, will be one
of our main interests in this course, motivating the use of regression analysis. Take a deep breath, we
made it!

1.6.2.2 The conditional distribution of a random vector


This section can be skipped for now, but later in the course we’ll need to talk about joint-distribution
of a random vector, conditional on the value of one or more other random variables.

When both X and Y are random vectors, we can talk about the conditional distribution of Y given X
by defining a conditional joint-CDF of all the components of Y , conditional on X = x.
Definition 1.25. With X and Y random vectors, the conditional CDF of Y given X = x is
FY |X (y|x) = lim P (Y1 ≤ y1 , Y2 ≤ y2 , . . . |X1 ∈ [x1 , x1 + ϵ1 ], X2 ∈ [x2 , x2 + ϵ2 ] . . . Xk ∈ [xk , xk + ϵk ])
ϵ1 ↓0
ϵ2 ↓0
...
ϵk ↓0

where x = (x1 , x2 , . . . xk )′ .
20
An important application of the concept of a conditional joint-distribution is the idea of conditional
independence.
Definition 1.26 (conditional independence). We say that X and Y are independent conditional on
Z, denoted (X ⊥ Y )|Z, if for any value z of Z: FXY |Z=z (x, y) = FX|Z=z (x) · FY |Z=z (y) for all x, y.

This definition can be understood by using Definition 1.25 to define interpret FXY |Z=z (x, y) as the joint-
CDF of a random vector composed of X and Y , conditional on the random vector Z. In this definition
X and Y could be random variables or can each be random vectors themselves!

As another application of Definition 1.25, the law of total covariance provides an analog of the law of
iterated expectations for covariance (and hence, as a special case, for variance):
Proposition 1.2. For random vectors X, Y and Z: Cov(X, Y ) = E[Cov(X, Y |Z)]+Cov(E[X|Z], E[Y |Z])]
Note that as a special case we have the law of total variance, that: E[V ar(Y |X)] + V ar(E[Y |X]).

21
Chapter 2

Empirical illustration: National


Longitudinal Survey of Young
Working Women

2.1 Introduction
Let’s illustrate some of the concepts from the last chapter with an empirical example. I’ll be working
with the nlswork dataset, which reports a sample of young working women from the Bureau of Labor
Statistics’ National Longitudinal Survey in the 1970s and 1980s. Obviously this data is pretty old, but
it’s easy to load into R, and has a lot of interesting variables. I’ll be showing code in R, but the dataset
is also easy to load into Stata using the command wenuse nlswork.
You can install R by downloading it from https://www.r-project.org/. If you do that, I also
recommend installing RStudio from https://www.rstudio.com/ for a nicer interface. You can also
create a free acount at rstudio.cloud to work with RStudio in the cloud, for low-intensity applications
like this one.
To get started and load nlswork dataset into RStudio, run the following code:

library ( webuse ) # For importing the dataset


library ( ggplot2 ) # For plotting
library ( Rmisc ) # Needed for multiplot

df <- webuse ( " nlswork " )

The first three lines load R libraries that we’ll need. The first allows to load the dataset directly from
the web using the webuse() command. The second two will be useful for creating pretty plots. Note
that # in R introduces a comment, which allows one to annotate their code with things that R ignores
when you run it.

Note: The library() command loads a R library up, but you need to install a library before you load it
up. To install the packages that are loaded above, run install.packages("webuse,ggplot2,Rmisc")
in R. You only need to do this once, then you can just jump straight to library(webuse), etc..

The last line of the code block above reads in the dataset and stores it in a dataframe called df. A
dataframe is what R calls a dataset. I could have given it any name, but I chose “df”, which is usually
what I use by default. To take a look at this dataset, you can type View(df) into R after running
the above code. If you’d like to learn more about the variables, see here: https://rdrr.io/rforge/
sampleSelection/man/nlswork.html.

22
2.2 The “empirical distribution” of a dataset
When you look at the dataframe df, you’ll notice that the first two columns are called idcode and
year. This survey tracks workers over several years, so each value of idcode shows up once for
each year in which that worker was surveyed. Altogether, there are 28, 534 rows covering about
4, 711 distinct workers. There are 25 variables reported for each row.
Let us index values of each of the 25 variables recorded in row i with a subscript i, e.g. agei
denotes the age recorded in row i. This is the age of a particular person indexed by id codei in
the particular year yeari .
Now consider the following probability space: we draw a single row ω from the dataset at random,
1
with an equal probability P ({ω}) = 28,534 of selecting any given row (the sample space is finite:
|Ω| = 28, 534, so we can let our event space F be the full powerset of Ω). We have 25 random
variables represented in each row, which define 25 random variables X1 (ω), X2 (ω), . . . X25 (ω),
with names like idcode, year, birth yr, age, race, etc.
What I’ll call the empirical distribution of the dataset is simply the joint-distribution of our
random vector X = (X1 , X2 , . . . X25 )′ given the probability space described above. For example,
the CDF of age evaluated at 25 is:

# rows in which agei ≤ 25


P (age ≤ 25) = ,
# rows in dataset
i.e. the proportion of the 28,534 rows in which the age recorded is less than or equal to 25.
This section will use the empirical distribution of this NLS dataset as an example of the various
concepts covered in the last chapter.

Note: In Chapter 4 we’ll make a big deal out of distinguishing our sample from the underlying
population of interest, which in this case might be the population all all young working women
in the U.S. in the 1970-1980s, from which our sample is drawn. Given this distinction, one could
view the various distribution functions plotted in this chapter as estimates of the corresponding
distributions in the population. But this is not important for the present purpose, which is simply
to illustrate some properties of distributions in general.

2.3 Examples of conditional distributions and the law of iter-


ated expectations
With our dataset loaded and our probability space well-defined, we can start to consider any of the
quantities we defined for random variables in Chapter 1.
Let’s start with the (marginal) CDF of a single variable. We’ll do this by relying on the stat ecdf()
command of the ggplot library.

# Define a new variable called wage :


df $ wage <- exp ( df $ ln _ wage )

# Plot unconditional CDF wage of wage :


ggplot ( df , aes ( wage ) ) + stat _ ecdf ( geom = " step " ) + labs ( title = " CDF
of wage " ,y = " cdf " ) + xlim (0 , 20)

The goal of the above code block is to plot the CDF of hourly wages in the dataset. The dataset
contains only the log of wages (what labor economists often focus on), so our first task above is to
generate a new column in the dataset with the wage, rather than its natural logarithm. This is done
with the command df$wage<-exp(df$ln wage), which adds a new column wage to the dataset that
equals to e to the power of ln wage. The second command generates the following figure:

23
The syntax of ggplot takes some getting used to, and really the best way to learn it is just to look at
examples like this one and start playing with them. The important things to recognize in the above is
that ggplot(df, aes(wage)) tells ggplot that we’re going to be looking at variable wage in dataframe
df, stat ecdf tells it to plot the empirical CDF, and the last two parts of the line set the plot labels
and the range of the x-axis.
As you can see, the CDF of wages is a monotonically increasing function that ranges from 0 to 1. It
makes sense to focus on the range $0 to $20 because the CDF is essentially flat for wages above $20.
Now let’s consider the conditional CDF Fwage|collgrad , where collgrad = 1 indicates that i graduated
college, and collgrad = 0 indicates that they did not. Since collgrad is a discrete random variable,
the function
P (wage ≤ w and collgrad = c)
Fwage|collgrad=c (w) =
P (collgrad = c)
for any value c ∈ {0, 1} and wage value w. Given the empirical distribution of our data, this is equivalent
to
# rows in which wage ≤ w and collgrad = c
Fwage|collgrad=c (w) = (2.1)
# rows in which collgrad = c
Consider for example college graduates: c = 1. We can calculate this conditional CDF by creating a new
dataset composed of just the rows from df in which collgrad = 1, and then computing the empirical CDF
of wages with respect to that new dataset. The reason is that the empirical CDF with respect to this
new dataset counts the number of rows in which wage ≤ w (which is the number of rows in the original
dataset in which wage ≤ w and collgrad = 1) and divides by the number of rows in the new dataset
(which is the number of rows in the original dataset in which collgrad = 1). This exactly recovers Eq.
(2.1) above.
In general, conditioning on a discrete random variable in an empirical distribution is identical to
simply sub-setting the data. Thus we can generate plots of our conditional CDFs for c = 0 and c = 1 as
follows:

dfgrads <- df [ df $ collgrad ==1 ,]


plot1 <- ggplot ( dfgrads , aes ( wage ) ) + stat _ ecdf ( geom = " step " ) +
labs ( title = " Conditional CDF of wages , college graduates " ,y = " cdf
" ) + xlim (0 , 20)

dfnongrads <- df [ df $ collgrad ==0 ,]


plot2 <- ggplot ( dfnongrads , aes ( wage ) ) + stat _ ecdf ( geom = " step " ) +
labs ( title = " Conditional CDF of wages , non - graduates " ,y = " cdf " ) +
xlim (0 , 20)
multiplot ( plot1 , plot2 )

24
The command dfgrads<-df[df$collgrad==1,] for example creates a new dataframe dfgrads which
contains only the college grduates, and the next line uses the same command as before to generate the
empirical CDF of wages in dfgrads. Rather than displaying it write away, we save the graph object
as plot1, so that we can display the two conditional CDFs alongside one another using the multiplot
command. The result is the following:

ggplot makes it easy to automatically compute these conditional CDFs and plot them alongside
one-another in the same plot. For example, the following code:

df $ graduate = factor ( df $ collgrad )


ggplot ( df , aes ( x = wage , color = graduate ) ) + stat _ ecdf ( lwd =
1.25 , geom = " step " ) + labs ( title = " Conditional CDF of wages , by
college graduation " ,y = " cdf " ) + xlim (0 , 20)

generates the combined plot:

This requires creating a so-called “factor variable”, which is a variable that R knows takes on discrete
categorical values. The first line creates a factor version of collgrad and calls it graduate. Then the
syntax color = graduate tells ggplot to break up the data and color it acording to values of graduate.
25
Notice that the CDF of college graduates is lower than the CDF of college non-graduates at every
wage value. For example, about 50% of non-college graduates have a wage of $5 or less, while only about
20% of college graduates have a wage of $5 or less. This is what we should expect, if college graduates
tend to be paid better than non-graduates.
Is our wage variable a continuous or a discrete random variable? Notice that our CDF functions of
wages looks smooth, like the right panel of Figure 1.2 for a continuous random variable, rather than like
a staircase as in 1.1 for a discrete random variable. But when evaluating probabilities using the empirical
distribution of our dataset, wage can’t literally be a continuous random variable: in a dataset of 28,534
there can at most be 28,534 distinct values of wage represented! In fact, with a finite number of people
on Earth, only a finite number of wages would be represented even if we had an idealized dataset that
included everybody.
Therefore, strictly speaking, we’re plotting the CDF of a discrete random variable above. But since
wages can take any real number as a value, (rather than, e.g. being comfined to integers or some set
of categories), the distribution of wages is well-approximated by thinking of it as being a continuous
random variable. The “jumps” in the above plot are so tiny in our dataset that they are imperceptible
to the naked eye.
Thus, it’s meaningful to talk about the conditional density of wages associated with each of the
conditional CDFs above. R makes it easy to generate plots of the these conditional densities, with the
following code:

plot3 <- ggplot ( dfgrads , aes ( wage ) ) + geom _ density () + xlim (0 , 20) +
labs ( title = " Conditional density of wages , college graduates " ,y =
" pdf " )
multiplot ( plot1 , plot3 )

plot4 <- ggplot ( dfnongrads , aes ( wage ) ) + geom _ density () + xlim (0 , 20)
+ labs ( title = " Conditional density of wages , non - graduates " ,y = "
pdf " )
multiplot ( plot2 , plot4 )

The command geom density() is actually doing some fairly complicated calculations in the background,
which aren’t important here. We just want the graphs. Here’s the conditional distribution for non-
graduates, represented both as a CDF and as a density function:

For college graduates we have:

26
In each case, the density is highest where the CDF is steepest, and lowest where the CDF is the flattest.
This is what we should expect, since the density function is the derivative of the CDF. Notice that the
p.d.f. of wages for non-graduates peaks around $4 and hour, while for graduates it peaks around $7 an
hour.
We can easily compute the expectations of these conditional distributions using a command like
mean(df[df$collgrad==1,]$wage). This command takes the average value of wage across all rows in
which collgrad=1. Running this command for each value of collgrad yields E[wage|collgrad = 0] ≈
$5.52 and E[wage|collgrad = 1] ≈ $8.65.
This provides us an opportuntiy to test the law of expectations, which says that
E[wage] = E [E[wage|collgrad]] = P (colgrad = 0) · E[wage|collgrad = 0]
+ P (colgrad = 1) · E[wage|collgrad = 1]
Indeed mean(df$wage) yields E[wage] ≈ $6.04 and mean(df$collgrad==0) reveals that P (collgrad =
1) ≈ 0.17, and 0.83 · 5.52 + 0.17 · 8.65 ≈ 6.04.
Now let’s look at an example of condition distributions in which Y is discrete, rather than continuous
like wage. Here are conditional CDFs of a workers highest grade of education completed, by college
graduation status:

Now the “staircase” nature of the CDF for a discrete random variable is clear. About 10% of college
graduates have 15 or less years of education, yet have graduated college. Notice that the big jump in the
27
CDF for graduates is at grade 16 (12 years + a 4 year college degree), but there are also jumps beyond
that for post-graduate degrees. By contrast, the big jump for non-graduates occurs at 12, indicating
high-school completion.
Here is the CDF for the non-graduates alongside its probability mass function, plotted as a histogram.

These last two figures were generated with the following code

plot1 <- ggplot ( dfgrads , aes ( grade ) ) + stat _ ecdf ( geom = " step " ) +
labs ( title = " Conditional CDF of grade completed , college
graduates " ,y = " cdf " ) + xlim (0 , 20)
plot2 <- ggplot ( dfnongrads , aes ( grade ) ) + stat _ ecdf ( geom = " step " ) +
labs ( title = " Conditional CDF of grade completed , non - graduates "
,y = " cdf " ) + xlim (0 , 20)
multiplot ( plot1 , plot2 )

plot3 <- ggplot ( dfnongrads , aes ( grade ) ) + geom _ histogram ( aes ( y =..
density ..) ) + xlim (0 , 20) + labs ( y = " pmf " )
multiplot ( plot1 , plot3 )

So far we’ve visualized distributions that condition on the binary variable collgrad, so there were
also two conditional distributions to look at. Now let’s move beyond a binary conditioning variable. For
example, the following figures show the conditional CDF and conditional density of wages given highest
grade completed, for grades 9 and up.

28
The code for these graphs is as follows (note we had to omit some rows in which grade is undefined
using the condition !is.na(df$grade)).

df $ gradecompleted = factor ( df $ grade )


ggplot ( df [ ! is . na ( df $ grade ) & df $ grade >8 ,] , aes ( x = wage , color =
gradecompleted ) ) + stat _ ecdf ( lwd = 1.25 , geom = " step " ) + labs (
title = " Conditional CDF of wages , by grade completed " ,y = " cdf " ) +
xlim (0 , 20)
ggplot ( df [ ! is . na ( df $ grade ) & df $ grade >8 ,] , aes ( x = wage , color =
gradecompleted ) ) + geom _ density () + labs ( title = " Conditional CDF
of wages , by grade completed (9 and up ) " ,y = " cdf " ) + xlim (0 ,
20)

Finally, lets consider a continuous conditioning variable. We’ll take Y to be usual hours worked in a
week, and X to be wage. We can no longer visualize the conditional distributions FY |X=x one-by-one.
We can however visualize the conditional expectation function E[hours|wage = w] as a function of w.
Below we plot it over a raw scatterplot of hours and wage, which provides a visualization of the joint
distribution of the two variables.

Now lets see the law of iterated expectations in action in this setting. With wages conceived of as a
continuously-distributed random variable, the law of iterated expectations says that
Z
E[hours] = E [E[hours|wages]] = fwage (w) · E[hours|wages = w] · dw

where fwage (w) is the density of wages evaluated at w. The plot below shows the function fwage (w) in
red (scale given on the right), and the function E[hours|wages = w] in blue (scale given on the left).
The value of E[hours] turns out to be about 46.56, and this is visualized by a horizontal dotted line on
the same scale as the condition expectation of hours.

29
Observe that the dotted line cuts through the blue CEF, capturing it’s “average value”. This average
is weighted according to the red line, the values near $4-$6 get the most weight, for example. It is thus
sensible that the value of E[hours] is close to the midpoint of the function E[hours|wages = w] in that
range.
The code used to generate these graphs and calculate E[hours] is:

# CEF of hours by wage


ggplot ( df , aes ( x = wage , y = hours ) ) + geom _ point () + geom _ smooth () +
labs ( title = " CEF of weekly hours on wages " ,y = " E [ hours | wage ] " ) +
xlim (0 , 20) + ylim (0 , 80)
# CEF of hours by wage with density

avghours <- mean ( df $ hours , na . rm = TRUE )


ggplot ( df ) + geom _ density ( aes ( x = wage , y =.. density .. * 250 , color = "
density of wage " ) ) + geom _ smooth ( aes ( x = wage , y = hours , color = " E [
hours | wage ] " ) ) + labs ( title = " Visualizing the law of iterated
expectations " ,y = " E [ hours | wage ] " ) + xlim (0 , 20) + ylim (0 , 80) +
scale _ y _ continuous ( name = " E [ hours | wage ] " , sec . axis = sec _ axis
( trans = ~ . / 250 , name = " Density of wage " ) ) + geom _ hline ( aes (
yintercept = avghours , linetype = " E [ hours ] " ) , colour = " black " )
+ theme ( legend . key = element _ rect ( colour = NA , fill = NA ) ,
legend . title = element _ blank () ) + scale _ linetype _ manual ( name = " E
[ hours ] " , values = c (2) , guide = guide _ legend ( override . aes =
list ( color = c ( " black " ) ) ) ) + scale _ color _ manual ( name = " E [ hours ] "
, values = c ( " red " ," blue " ) , guide = guide _ legend ( override . aes =
list ( fill = c ( " white " , " white " ) , color = c ( " red " , " blue " ) ) ) )

30
Chapter 3

Statistical models

In Chapter 1, we’ve developed the idea of a random vector, which has a probability distribution that can
be characterized by the joint-CDF of all of its components. This allows us to then define concepts like
expectation, conditional distributions, and the conditional expectation function. Chapter 2 has shown
an empirical illustration in these concepts.
The data used in Chapter 2 is an example of a sample. In this case, it was a collection of observations
regarding ∼5000 young working women in the 1970s and 1980s. This chapter develops tools that let us
address the following question: what are we learning about the population of young working women in
this time period, given our sample? In doing so we move from the theory of probability to the theory of
statistics, which studies what we can learn about probability distributions from data.
To do so, it is useful to start with the definition of a statistical model, which embodies a set of
assumptions about the distribution of a set of random variables. In Section 3.3, we’ll apply this idea to
model how a sample of data is generated from an underlying population.

3.1 Modeling the distribution of a random vector


Let X be a random vector (let’s say it has k components), having some joint-CDF function F . We’ll
define a model to be a set of such distributions, which we think that the true F might belong to.

Definition 3.1. A statistical model is a set F of potential candidates for F .


One thing we know for sure about the CDF function F (x) = P (X1i ≤ x1 , X2i ≤ x2 , . . . Xki ≤ xk ) is that
it is increasing and right-continuous with respect to each xj and takes values between zero and one. Let
F be the set of all such functions, which reflect valid CDF functions for a k−dimensional random vector
X. A statistical model thus specifies a subset F ⊆ F.

Most familiar models in statistics are “parametric” models. A parametric statistical model is a model
in which each F ∈ F can be written in terms of a vector of parameters θ ∈ Rd . What I mean by this is
that each function F (x) in F can be written as F (x; θ): the whole function F (·; θ) depends on the values
of the parameters θ = (θ1 , θ2 , . . . θd )′ :

Definition 3.2. A parametric statistical model is the set F = {F (·; θ) : θ ∈ Θ}, where each θ ∈ Θ is
some finite-dimensional vector (i.e. Θ is some subset of Rd for a finite d).
Example: Suppose somebody hands you a coin, which could be “weighted” so that the probability of
heads is different than 1/2. They don’t tell you the value of p = P (heads). Thus, your statistical model
is that P (heads) = p and P (tails) = 1 − p for some p ∈ [0, 1]. This model has one parameter, p.

Notation: When a parametric model contains only continuous distributions, the distributions in F are
often denoted by their densities f (·; θ) rather than CDFs. You also sometimes see the notation f (·|θ),
with a “|” rather than “;”.

31
Preview: parametric vs. non-parametric models:

Note that a parametric statistical model must be characterized by a finite number of real-valued
parameters, i.e. d is finite. This distinguishes parametric models from non-parametric models F,
in which there’s no way to come up with a finite number of parameters that can fully characterize
each F ∈ F.

Example: Suppose that X is a random variable, and we let F contain all valid CDFs such that
the function F (x) is concave in x (on its support, where it’s increasing). This represents a
non-parametric statistical model.

Non-parametric models play an important role in modern econometrics, as they often make weaker
assumptions than parametric models, and become most practical to work with when we have big
datasets. “Semi-parametric” models occur when a finite-dimensional θ pins down some—but
not all—features of each F ∈ F. Semi-parametric models play an important role in regression
analysis, as we’ll see.

We’ll start by studying some important parametric models for a single random vector X. We’ll then
apply the idea of a model for the distribution of X to move on to our real objective: modeling the
generation of a whole dataset, which typically consist of n realizations of a random vector X.

3.2 Two examples of parametric distributions


3.2.1 The normal distribution
Arguably the most important parametric model arises from the family of probability distributions called
normal distributions.
The univariate normal distribution is a continuous distribution with density function:
1 1 x−µ 2
f (x; µ, σ) = √ e− 2 ( σ ) (3.1)
σ 2π
for some values σ and µ. We say that X ∼ N (µ, σ 2 ) when it has a density function given by f (·; µ, σ).
R∞ 2 √
Exercise: Show that the normal density f (x; µ, σ) integrates to one. Hint: use −∞
e−t dt = π.

The following exercises ask you to show that the parameters µ and σ are the expectation and standard
deviation of X, respectively (the standard deviation of a random variable is defined as the square root
of its variance).

Exercise: Show that if X ∼ N (µ, σ 2 ) then E[X] = µ.


R∞ 2

π
Exercise: Show that if X ∼ N (µ, σ 2 ) then V ar(X) = σ 2 . Hint: use −∞
t2 · e−t dt = 2

Proposition 3.1. Suppose that X ∼ N (µ, σ 2 ) and we define Y = a + bX for some a, b ∈ R. Then Y is
also normally distributed as N (a + bµ, b2 σ 2 ).
A consequence of this is that if X ∼ N (µ, σ), then the random variable x−µ

σ ∼ N (0, 1). N (0, 1) is
called the standard normal distribution.

The multi-variate normal distribution generalizes the normal distribution to a random vector X =
(X1 , X2 , . . . Xk ). The multi-variate normal density is parametrized by a a k × 1 vector µ and k × k
matrix Σ:
1 1 −1 ′
f (x; µ, Σ) = p e− 2 (x−µ)Σ (x−µ) (3.2)
(2π) · det(Σ)
k

where Σ−1 denotes the matrix inverse of Σ, and det(Σ) its determinant. When X ∼ N (µ, Σ), the mean
of X is µ, i.e. E[X] = µ, and Σ is its variance-covariance matrix: V ar(X) = Σ.

32
Example: The bivariate normal distribution, the case when k = 2 warrants some additional attention.
Let ρ be the correlation coefficient between X1 and X2 , so that
σ12
   
ρ · σ1 · σ2 1 ρ
Σ= = (σ1 , σ2 ) (σ1 , σ2 )′
ρ · σ1 · σ2 σ22 ρ 1
p
where σj = V ar(Xj ). In this case Eq. (3.2) simplifies to:
 2  2   
x1 −µ1 x −µ x −µ x2 −µ2
1 1
− 2(1−ρ 2) σ1 + 2σ 2 −2ρ 1σ 1 σ
f (x1 , x2 ; µ1 , µ2 , σ1 , σ2 , ρ) = p ·e 2 1 2
(3.3)
2πσ1 σ2 1 − ρ2
Exercise: Show from Eq. (3.3) that if X1 and X2 are jointly normally distributed, then X1 ⊥ X2 if and
only if Cov(X1 , X2 ) = 0.

A useful property of the family of normal distributions is that many operations keep one in the family.
For example, when X is a (multivariate) normal random vector, the marginal distribution of each of it’s
components is a (univariate) normal random variable:
Proposition 3.2. Let X ∼ N (µ, Σ). Then the marginal distribution of each Xj is N (µj , Σjj ).
A vector X̃ composed of any subset of the Xj is also normal.

Conditional distributions defined from a multivariate normal are also normal:


Proposition 3.3. Let (X, Y ) follow a bivariate normal distribution with parameters (µY , µX , σY , σX , ρ).
Then the distribution of Y given X = x is a normal distribution with mean µY +ρ σσX
Y
(x−µX ) and variance
2 2
σY (1 − ρ ).
Proof. The conditional density of Y given X = x is
 2  2   
1 x−µX y−µY x−µX y−µY
− 2(1−ρ 2) + −2ρ
1√ σX σ σ σ
·e Y X Y
fXY (x, y) 2πσX σY 1−ρ2
fY |X=x (y) = = 2
fX (x)

x−µX
1 − 21
σX


e σX

 2  
x−µX 2
   
y−µY x−µX y−µY
1 1
− 2(1−ρ 2) σY −2ρ σ σ +ρ2 σ
=√ p e X Y X
(3.4)
2πσY 1 − ρ2
 2  2
1 x−µX 1−ρ2 x−µX
where we’ve used that ea /eb = ea−b and that 2 σX = 2(1−ρ2 ) σX . Now the beautiful part:
the exponent in (3.4) is a “perfect square” quantity:
   2
σ
y− µY +ρ Y (x−µX )
σX
2 − 12 
1 1
  
1
− 2(1−ρ
y−µY
−ρ
x−µX σ 2 (1−ρ2 )
Y
fY |X=x (y) = √ p e 2) σY σ X =√ p e
2πσY 1− ρ2 2π σY2 (1 − ρ2 )

which is exactly the formula for the density of a N (µY + ρ σσX


Y
(x − µX ), σY2 (1 − ρ2 )) random variable.
Later in the course, we’ll see that Proposition 3.3 generalizes in a natural way beyond the bivariate case.

Sums of normal random variables are also normally distributed, provided that the two random vari-
ables being added are jointly normal:
     2 
X µX σX σXY
Proposition 3.4. If X and Y are jointly normal with =N ,
Y µY σXY σY2
2
Then: X + Y ∼ N (µX + µY , σX + σY2 + 2σXY )

This is helpful in establishing the following, which generalizes Proposition 3.1:


Proposition 3.5. Let X ∼ N (µ, Σ). Then for any a = (a1 , a2 , . . . ak )′ :
k
X
a′ X := aj · Xj ∼ N (a′ µ, a′ Σa))
j=1

33
Proof. Using the fact that for X and Y normal, X + Y is normal k − 1 times establishes that a′ X
is normal (making use of the fact that aj Xj is normal for each j). If we know that a′ X is normal,
and we know it’s mean vector and variance-covariance matrix, we know its whole distribution. That
E[a′ X] = a′ E[X] = a′ µ follows from linearity of the expectation. That V ar(a′ X) = a′ V ar(X)a then
follows, also by linearity of expectation.

3.2.2 The binomial distribution*


Suppose we have n random variables Z1 . . . Zn , where each Zj is 0/1 random variable (referred to as a
Bernoulli random variable). For each Zj : P (Zj = 1) = p and P (Zj = 0) = 1 − p. We can construct a
random vector Z = (Z1 , Z2 , . . . Zn )′ from these n random variables.
Suppose that each of the Zj are independent, meaning that:

F (z) = Fsingle (z1 ) × Fsingle (z2 ) × · · · × Fsingle (zn )

where Fsingle (z) is the CDF of a single Bernoulli random variable having probability p: Fsingle (z) =
(1 − p) · 1(z < 1) + p · 1(z ≥ 1). Pn
We now define the binomial distribution to be the distribution of X := n1 j=1 Zj . Often, each Zj is
interpreted as the success (1) or failure of a ”trial” of some kind. The probability of success of trial j is
P (Zj = 1) = p. With this interpretation the random variable X simply counts the number of successes
out of the n trials.
To denote that X has the binomial distribution with parameters p and n, we write X ∼ B(n, p).
Since X can only take integer values between 0 and n, the binomial distribution is a discrete random
variable. Rather than a density, it has a probability mass function. The probability that X = k can be
written as:  
n
P (k; n, p) = · pk · (1 − p)n−k
k
   
n n! n
where := k!(n−k)! . The quantity simply counts the number of distinct ways that we could get
k k
k sucesses from n trials, i.e. the number of distinct vectors z ∈ {0, 1}n that contain k ones and n − k
zeroes. Each such z has the same probability pk · (1 − p)n−k of occurring, leading to our final expression
for P (k; n, p).
hP i P
We know by linearity of the expectation that the mean of X must be E[X] = E = j=1 E[Zj ] =
n n
j=1 Z j
n · p. We can also verify this by the explicit formula for P (k; n, p):
n n
n! · k
 
n
E[X] =
X X
k· · pk · (1 − p)n−k = pk · (1 − p)n−k
k k!(n − k)!
k=0 k=0

This is an intimidating sum to try to


Pevaluate. But we can use the following “trick”. Since the p.m.f. of
n k 1
X must sum to one, we know that k=1 P (k; n, p) = 1 for any n and p. Notice that k! = (k−1)! . Thus,
the factor of k that appears when we take the expectation of X has a similar effect to simply decreasing
k by one, and summing over its p.m.f. instead. But we also have to deal with the other factors that
depend on k. Note that
n! (k − 1)!
P (k; n, p) = · · p · P (k − 1; n − 1, p)
(n − 1)! k!
if k > 0. Thus: E[X] = k=1 k · P (k; n, p) = k=1 np · P (k − 1; n − 1, p) = np · k=1 P (k − 1; n − 1, p) =
Pn Pn Pn
Pn−1
np · k=0 P (k; n − 1, p) = np · 1 where in the first step we’ve used that k · P (k; n, p) = 0 if k = 0.

Exercise: Show that if X ∼ B(n, p), V ar(X) = n · p(1 − p). You may find it useful that since Zj ⊥ Zj ′
for each pair j ̸= j ′ , Cov(Zj , Zj ′ ) = 0.

3.3 Random sampling


The last section covered two parametric families of distributions, either of which could be used to form
a parametric statistical model. However, the most common kind of statistical model used in practice is
a non-parametric one: an independent and identically distributed (i.i.d.) sample.
34
Definition 3.3. A collection of random vectors {X1 , X2 , . . . Xn } are called independent and identically
distributed (i.i.d.) if Xi ⊥ Xj for i ̸= j and each Xi has the same marginal distribution as the others.
When a collection of random vectors are independent of one another, as with an i.i.d. collection
knowing the CDF F for each member Xi of the collection is sufficient to recover the full joint-distribution
of the collection. For example, let n = 2 and suppose both X1 and X2 are i.i.d random variables (rather
than vectors). Then P (X1 ≤ x1 , X2 ≤ x2 ) = P (X1 ≤ x1 ) · P (X2 ≤ x2 ) = F (x1 ) · F (x2 ), where F (·) is
the marginal CDF function of each of the Xi . With an i.i.d. collection, we only need to know the CDF
F that applies to each of the Xi , in order to know anything about the collection.
Note that assuming that a collection of random vectors is i.i.d. is a statistical model, in the sense
of Section 3.1. In that section we weren’t talking about collections of random vectors, but you can
always think of a collection of random vectors as an even larger list of random variables. For ex-
ample, in the example above Definition 3.3 restricts the joint-CDF of X1 and X2 to have the form
F12 (x1 , x2 ) = F (x1 ) · F (x2 ) for some valid CDF function F .

The i.i.d. model is typically used to describe simple random sampling. Simple random sampling oc-
curs when individuals are selected at random from some underlying population I, and a set of variables
Xi = (X1i , X2i , . . . Xki )′ are recorded for each sampled individual i. Imagine for example a telephone
survey, in which enumerators have a long list I of potential individuals to contact. They use a random
number generator to choose an i at random from this list, contact them, and record responses to a set
of k questions. This process is then repeated n times.

Note: With a finite population I, we must allow sampling “with replacement” for the i.i.d. model to hold
strictly. If individual i is removed from the list after being contacted, then the random vectors Xi may
no longer be independent. For example, suppose we are randomly selecting U.S. states and recording
the population of each one. Suppose California (the post populous state) has 40 million and Georgia has
11. Then for example P (X2 = 40m|X1 = 40m) ̸= P (X2 = 40m|X1 < 40m), since the first probability is
zero and the second is 1/49. This means that X1 and X2 are not independent. Simple random sampling
is often referred to as random sampling for short, or as i.i.d sampling.

We’ll use the terms dataset or sample to refer to an n × k matrix X that records characteristics Xi =
(X1i , X2i , . . . Xki ) for each of n observational units (such as individuals) i. Data is not always generated
by simple random sampling, but when it is, we can imagine X as being formed by randomly choosing
rows from a much larger matrix that records Xi for all individuals in the population, depicted in Figure
3.1. The actual data we see in X is a realization of the collection of random variables {X1 , X2 , . . . Xn }.
 ′  
X1 (X11 , X21 , . . . Xk1 )
 X2′   (X12 , X22 , . . . Xk2 ) 
X= . = ..
   
.
 .   .


Xn′ (X1n , X2n , . . . Xkn )

The randomness of X comes from the random-sampling: we could have drawn a different set of individ-
uals from the population, in which case we would have seen a different dataset X.

Notation: Note that the entries of the sample matrix X are denoted Xji , where i index rows (individual
observations) and j index columns (variables/characteristics). This is backwards from the way we often
denote entries Mij of a matrix M, where the row i comes before the column j. This is a consequence of
two conventions interacting: that rows of X index individuals (just like when you open the dataset in
R), but that Xji indexes characteristic j of individual i (equivalently, characteristic j of the individual
sampled in row i).

Note that most sampling processes in the real world occur without replacement: the same individual
cannot show up twice in the data. Given the note above, this suggests that these sampling processes
are not i.i.d., strictly speaking. However, when the size N of the underlying population is large, such
samples can still be well-approximated as being i.i.d.. Intuitively, that’s because when N is much larger
than n (often denoted as N >> n), the chance that you would draw the same individual twice is very
low. We thus typically assume i.i.d., with the idea that N is suitably large to not worry about sampling
with vs. without replacement.

35
Population I
Sample X
individual i agei marriedi collegei
row i ωi agei marriedi collegei 1 25 0 0
1 1 25 0 0 2 74 1 1
2 4 37 1 1 3 8 0 0
3 5 54 0 1 4 37 1 1
5 54 0 1
Figure 3.1: An example of simple random sampling, in which n = 3 and N = 5. Each row of the dataset
on the left is a realization of random vector X = (age, married, college), which chooses a row at random from
the population matrix on the right. We can conceptualize this sampling process as a probability space with
outcomes ω = (ω1 , ω2 , ω3 ), where ωi yields the index of the randomly selected individual in I. The random
vectors Xi = Xi (ωi ) and Xj = Xj (ωj ) are independent for i ̸= j, but the random variables within a row are
generally not independent, e.g. agei and collegei are positively correlated.

The following are some alternative methods of generating data, aside from simple random sampling:
ˆ Stratified random sampling: the population is divided into groups, and then simple random sam-
pling occurs within each group (e.g. I run my sampling algorithm separately for men and women,
so that I can ensure equal representation of each).
ˆ Clustered random sampling: after defining groups, we randomly select some of the groups. Then all
individuals from those groups are included in the sample (e.g. I interview everybody in a household,
after choosing households at random)
ˆ Panel data: suppose we have observations over multiple time-periods t for each individual i, where
the individuals i are drawn as a simple random sample. Then if we arrange all of i’s data onto one
row, we can imagine X as reflecting an i.i.d. sample. But with rows correponding to (i, t) pairs,
the rows are no longer independent (in general)
ˆ Observing the whole popluation: this would be the case e.g. with state-level data from all 50 U.S.
states. This situation occurs increasingly frequently with individual-level data now as well, e.g.
administrative data on all tax-filers in a country.
These alternative sampling methods tend to violate the i.i.d assumption. However, methods exist to
deal with each of them.

Let us end this section with a last bit of jargon. When Xi for i = 1 . . . n denotes a collection of i.i.d
random vectors, we’ll refer to the distribution F that describes the marginal distribution of each Xi as
the population distribution. The population distribution is the distribution we get when we randomly
select any individual from the population. Features of the population distribution are the ones that
you naturally think of when you think about summarizing a population. For example, if I is a finite
population, then
1 X
EF [Xi ] = Xi
N
i∈I

where we use the notation EF to make explicit that the expectation is with respect to the CDF F . The
population mean is simply the mean of Xi among everybody in I. We can also talk about the population
variance, the population median, and so on. Note that the empirical distribution introduced in Section
2.2 is an example of a population distribution in which we think of the whole population as equal to the
rows of the dataset.

Another piece of terminology will be useful as we discuss samples and their population counterparts:
Definition 3.4. A statistic or estimator is any function of the sample X = (X1′ , X2′ , . . . , Xn′ )′ .
A generic estimator or statistic will apply some function g(X) = g(X1 , X2 , . . . Xn ) to the collection
Pn of
random vectors that constitute the sample. An example is the so-called sample mean X̄n := n1 i=1 Xi ,
which simply adds together Xi for across the sample and divides by the number of observations n. X̄n
is an example of a statistic. Since each of the Xi is a random variable/vector, it follows that X̄n is itself
36
a random variable/vector. This is true of statistics in general: they are random.

The reason that we also refer to statistics as “estimators” is that statistics often attempt to estimate a
population quantity of some kind from data. For example, we’ll see in the next Chapter that for large
n, we are justified in thinking that X̄n ≈ µ. It is therefore reasonable to use X̄n as an estimate of µ.
Note that X̄n is random, while µ is just a fixed number. Thus we have to be careful in what we mean
by saying that X̄n ≈ µ, which is the topic of the next chapter.

Notation: Often estimators are depicted with a “hat” on them, e.g. θ̂ = g(X). We’ll use this notation
to denote a generic estimator.

A useful property of i.i.d. random vectors that I’ll mention here is the following:
Proposition 3.6. If {X1 , X2 , . . . Xn } are i.i.d random vectors, then {h(X1 ), h(X2 ), . . . h(Xn )} are also
i.i.d for any (measurable) function h.
An implication of Proposition 3.6 is that if we have an i.i.d. sample Xi , we can from it construct an i.i.d.
sample of e.g. Xi2 .

37
Chapter 4

When the sample gets big:


asymptotic theory

4.1 Introduction: the law of large numbers


Consider an i.i.d. sample {X1 , . . . Xn } of some random variable Xi . The sample average of Xi in our
data simply takes the arithmetic mean across these n observations:
n
1X
X̄n := Xi
n i=1

The law of large numbers (LLN) states the deep and useful fact that for very large n, it becomes very
unlikely that X̄n is very far from µ = E[Xi ], the “population mean” of Xi .

Theorem 1 (law of large numbers). If Xi are i.i.d random variables and E[Xi ] is finite, then for
any ϵ > 0:
lim P (|X̄n − µ| > ϵ) = 0
n→∞

Note: The LLN is stated above for a random variable, but the result generalizes easily to random
vectors. In that case, limn→∞ P (||X̄n − µ||2 > ϵ) = 0 where || · ||2 denotes the Euclidean norm, i.e.:
|X̄n − µ| = (|X̄n − µ|)′ (|X̄n − µ|), where X̄n is a vector of sample means for each component of Xi , and
similarly for µ.

Note: the version of the law of large numbers above is called the weak law of large numbers. There exists
another version called the strong LLN.

Let us now prove the LLN. We will do so using a tool called Chebyshev’s inequality. This proof assumes
that V ar(Xi ) is finite, but the LLN holds even if V ar(Xi ) = ∞. Chebyshev’s inequality allows us to use
the variance of a random variable to put an upper bound on the probability that the random variable is
far from its mean. In particular, for any random variable Z with finite mean and variance:

V ar(Z)
P (|Z − E[Z]| ≥ ϵ) ≤
ϵ2
To see that this holds, use the law of iterated expectations to write out the variance as

V ar(Z) = E Z − E[Z])2 = P (|Z − E[Z]| ≥ ϵ) · E (Z − E[Z])2 |(Z − E[Z])2 ≥ ϵ2


   

+ P (|Z − E[Z]| < ϵ) · E (Z − E[Z])2 |(Z − E[Z])2 < ϵ2


 

≥ P (|Z − E[Z]| ≥ ϵ) · ϵ2 + P (|Z − E[Z]| < ϵ) · 0,

noting that |Z − E[Z]| ≥ ϵ iff (Z − E[Z])2 ≥ ϵ2 .

Now, we will show that as n → ∞, V ar(X̄n ) → 0. This along with Chebyshev’s inequality implies the
LLN, by letting Z = X̄n .

38
n
To see that V ar(X̄n ) → 0, note first that
" n # n n
1X 1X 1X
E[X̄n ] = E Xi = E [Xi ] = µ=µ
n i=1 n i=1 n i=1

The first equality is simply the definition of X̄n , while the second uses linearity of the expectation
operator. Now consider
 !2 
n
1
V ar(X̄n ) = E (X̄n − E[X̄n ])2 = E (X̄n − µ)2 = E 
    X
(Xi − µ) 
n i=1
 
n n n n
1 X X 1 XX
= 2E (Xi − µ)(Xj − µ) = 2 E [(Xi − µ)(Xj − µ)]
n i=1 j=1
n i=1 j=1
n
1 X 1 V ar(Xi )
= 2
E [(Xi − µ)][(Xi − µ)] = 2 · nV ar(Xi ) =
n i=1 n n

where the first equality in the third line follows because when i ̸= j, Xi ⊥ Xj implies that E[(Xi − µ)] ·
E[(Xj − µ)] = 0 · 0. Thus, the only terms that remain are when j = i.
Another way to see that V ar(X̄n ) = V ar(X n
i)
is to notice that when Y and Z are independent,
V ar(Y + Z) = V ar(Y ) + V ar(Z). Thus:
   
1 1 1 1 1 V ar(Xi )
V ar X1 + X2 + · · · + Xn + = n · V ar Xi = n · 2 · V ar(Xi ) =
n n n n n n

4.2 Asymptotic sequences


The law of large numbers provides a way to justify the claim that when n is large, X̄n will be close to µ
with high probability. The approximation X̄n ≈ µ lies at the heart of our claims to be learning about
an underlying population when we have a large sample.
In the next section, we’ll see that there is more than one way to develop a large-n approximation
to the distribution of a random variable. To talk about such approximations, it is useful to introduce
the idea of a sequence of random variables Zn , where n = 1, 2, . . . ∞. For example, we can consider the
sample mean X̄n —which is a random variable for any given n—across various possible sample sizes n.

4.2.1 The general problem


The primary motivation for considering such asymptotic sequences of random variables Zn is when Zn
represents a statistic θ̂—something that depends upon my data (see Definition 3.4). Since θ̂ is random
(it depends on the sample that I drew), I’d like to know something about its distribution. For example,
how likely is it that my sample mean is far from the population mean?
Definition 4.1. The sampling distribution of an statistic θ̂ is its CDF: Fθ̂ (t) = P (θ̂ ≤ t).

When our statistic is computed as θ̂ = g(X1 , X2 , . . . Xn ) from an i.i.d sample of Xi , Fθ̂ depends upon
three things: the function g, the population distribution of Xi , and the sample size n.
Knowing the sampling distribution of a statistic is typically a hard problem. We know g and n,
but in a research setting we don’t generally know the CDF F that describes the underlying population.
However, if we view θ̂ as a point along a sequence of random variables Zn , it is often possible to say
something about the limiting behavior of FZn as n → ∞. Asymptotic theory is a set of tools for describing
this limiting behavior. The law of large numbers is one such tool. If we believe that are actual sample
size n is large enough that FZn ≈ FZ∞ , then tools like the LLN can be extremely useful. For the sample
mean for example, we might, on the basis of the LLN, be prepared to believe that X̄n is close to µ with
very high probability.
Conceptually, we can think of what we’re doing as follows. Suppose our sample size is n = 10, 576,
and we calculate a statistic θ̂ = g(X1 , X2 , . . . X10,576 ) from our sample. Now imagine applying the
same function g to various samples of size 1, 2, . . . and so on, and defining a sequence Z1 , Z2 , Zn of the
corresponding values. Each Z along this sequence is itself a random variable: let FZ1 , FZ2 , . . . be their
39
corresponding CDFs. Our statistic θ̂ can be seen as a specific point along this sequence: θ̂ = Z10,576
(circled in red in Figure 4.1). Since we don’t know FZ10,576 , but we can say something about FZ∞ , we
use the latter as an approximation for the former. Figure 4.1 depicts this logic.

n Zn FZn

1 Z1

2 Z2

..
.

θ̂ → 10, 576 Z10,576

..
.

100, 000 Z100,000 ≈

..
.

∞ Z∞

Figure 4.1: We are interested in the sampling distribution of some statistic θ̂, computed on our sample of 10, 576
observations. This is in general hard to compute. As a tool, we imagine a sequence of random variables Z1 , Z2 , . . .
in which θ̂ = Z10,576 . Asymptotic theory allows us to derive properties of FZ∞ , the limiting distribution of Zn
as n → ∞ (circled in green). Then we use FZ∞ as an approximation to FZ10,576 , which we justify by n being
“large”. The above figure depicts a situation in which Zn = X̄n , so that the distribution of Zn narrows to a
point as n → ∞ (by the LLN).

Of course, the above technique only works if we can say something definite about F∞ . The law of large
numbers says that we can when our statistic is the sample mean. In Section 4.4, we’ll see that the central
limit theorem provides even more information about the limiting distribution of the sample mean: that
it will become approximately normal, regardless of F .

Note: The logic of Figure 4.1 is the “classical” approach to approximating the sampling distri-
bution of θ̂, but it is certainly not the only one. An increasingly popular alternative involves
bootstrap methods. These methods still appeal to n being “large enough”, but they do so in a dif-
ferent way. They also require computing power, because bootstrapping involves resampling new
datasets from our original dataset X. This has become increasingly feasible, and boostrap-based
methods have become increasingly popular.

4.2.2 Example: LLN and the sample mean


Let’s go through the logic of Figure 4.1 in more detail in the case of the the law of large numbers. The
LLN tells us that when we let the sample mean X̄n define our asymptotic sequence Zn , the resulting
distributions FZn eventually cluster all of their probability mass around the point µ, the sample mean.
Figure 4.2 illustrates this point, through a simulation in R. I drew 1,000 i.i.d samples of size n of a
random variable Xi for which P (Xi = 0) = 1/2 and P (Xi = 1) = 1/2, representing a coin flip. Then, I
plot a histogram of X̄n across the 1,000 samples. This process is repeated for n = 2, n = 10, n = 100

40
and n = 1, 000. You can think of this as illustrating Figure 4.1 for the specific population distribution
F that describes a coin-flip. With n = 2, we see that we have a 50% chance of getting X̄n of 0.5, which
is the true “population mean” of Xi : µ = E[Xi ] = 0.5. Then 25% of the time we get X̄n = 0 (two flips
of tails), and 25% of the time we get X̄n = 1 (two flips of heads). Thus, the distribution of X̄n is not
very well concentrated around µ = 0.5.
The red vertical lines in Figure 4.2 illustrate the law of large numbers in action. They mark the
points 0.45 and 0.55, which represent a ϵ = .05 in Theorem 1. We can see that by the time n = 100,
P (|X̄n − 1/2| > 0.05) starts to become reasonably small; roughly 1/3 of the mass of X̄n is outside of
[0.45, 0.55]. When n = 1000, there is an imperceptible chance of obtaining an X̄n outside of the vertical
red lines. If we continued this process for larger and larger n, we would see the mass of X̄n continue to
cluster closer and closer to µ = 1/2. Regardless of how small a ϵ we choose, we can always find an n
that fits as much of the mass as we want inside the corresponding red lines.
Note that the law of large numbers does not say that P (|X̄n − µ| > ϵ) will necessarily monotonically
decrease with n, for each n. For example, we can see that for ϵ = .05, we have that P (|X̄1 − µ| > ϵ) is 0.5
and P (|X̄2 − µ| > ϵ) is about 0.25. All that the LLN says is that P (|X̄1 − µ| > ϵ) will get (arbitrarily)
small with n, for any value of ϵ.

Figure 4.2: Distributions along the sequence X̄n for a set of n i.i.d. coin flips. Red liness illustrate the mass of
the distribution X̄n that is more than .05 away from 1/2.

The following is the R code I used to generate this figure, if you’d like to copy-paste it and experiment:

numsims <- 1000


par ( mfrow = c (2 ,2) , main = " Title " )
for ( n in c (2 ,10 ,100 ,1000) ) {
results <- data . frame ( simulation _ num = integer () , sample _ mean = double () )
for ( x in 1: numsims ) {
thissample <- sample ( c (0 , 1) , size = n , replace = TRUE )
samplemean <- mean ( thissample )
results [x ,] = c (x , samplemean )
}

h <- hist ( results $ sample _ mean , plot = FALSE , breaks = seq ( from =0 , to =1 , by
=.01) )
h $ density = h $ density / 100
plot (h , freq = FALSE , main = paste0 ( " Distribution of sample means , n = " ,n , "
coin flips " ) , xlab = " Sample mean " , ylab = " Proportion of samples " , col = "
green " )
abline ( v = c (.45 ,.55) , col = c ( " red " , " red " ) )
}

41
4.3 Convergence in probability and convergence in distribution
Given a sequence of random variables or random vectors Z1 , Z2 , . . . , let us now define two notions of
convergence of the sequence Zn . The first is convergence in probability:
Definition 4.2. We say that Zn converges in probability to Z if for any ϵ > 0:

lim P (||Zn − Z|| > ϵ) = 0


n→∞

In this definition, Zn can be a random variable/vector. When Zn is a random variable, then the notation
||Zn − Z|| jut refers to the absolute value of the difference: |Zn − Z|. When Zn is a vector, we can take
||Zn − Z|| to be the Euclidean norm of the difference (see Proposition 4.2 for an example).

We will often talk about Zn converging in probability to a constant c. This does not require a second
definition because a constant is simply an example of a random variable that has degenerate distribution
P (Z = c) = 1. Thus we say that Zn converges in probability to a constant c if limn→∞ P (|Zn −Z| > ϵ) = 0
for all ϵ > 0.
p
Notation: When Zn converges in probability to Z, we write this as Zn → Z, or alternatively plim(Zn ) =
Z. We say that Z is the probability limit of the sequence Zn . We use the same notation when Z is a
constant.
n
The law of large numbers, for example, says that X̄n → µ, the sample mean converges in probability to
the “population mean”, or expectation, of Xi .

Exercise: This problem gives an example of a sequence that converges in probability to another random
variable, rather than to a constant. Let Zn = Z + X̄n , where Z is a random variable and X̄n is the
sample mean of i.i.d. random variables Xi having zero mean and finite variance. Suppose furthermore
that Z and X̄n are independent. Show that plim(Zn ) = Z.

Our second notion of convergence of a sequence of random vectors is convergence in distribution.


Consider first a sequence of scalar random variables:
Definition 4.3. We say that a random variable Zn converges in distribution to Z if, for any z such that
the CDF FZ (z) = P (Z ≤ z) of Z is continuous at z:

lim P (Zn ≤ z) = FZ (z)


n→∞

d
Notation: When Zn converges in distribution to Z, we write this as Zn → Z. As with convergence in
probability, Z can be a random vector or a constant.

Note: The requirement that we only consider z where FZ (z) is continuous is a technical condition,
which we can often ignore because we’ll be thinking about continuously distributed Z. In general, we
can construct examples in which limn→∞ P (Zn ≤ z) is not right-continuous (and is thus not a valid
CDF), but the valid CDF function FZ (z) nevertheless captures the limiting distribution of Zn . In these
d
cases we still want to say that Zn → Z.

The definition given above for convergence in distribution takes Zn to be a random (scalar) variable to
emphasize the idea, but the concept extends naturally to sequences of random vectors. We say that a
sequence of random vectors Zn converges in distribution to Z if for all z at which the joint CDF of the
components of Z FZ (z) does not have a discontinuity, the limit of the CDF of Zn evaluated at that point
as n → ∞ is FZ (z).

Convergence in distribution essentially says that the CDF of Zn point-wise converges to the CDF of
d
Z. By “point-wise”, we mean that this occurs for each value z. When Zn → Z, we often refer to Z as
the “large-sample” or “asymptotic” distribution of Zn .
We close this section by investigating the relationship between convergence in probability and con-
vergence in distribution. Convergence in distribution is a weaker notion of convergence (and is in fact
often called “weak” convergence), in the sense that it is implied by convergence in probability.
42
p d
Proposition 4.1. If Zn → Z, then Zn → Z. In the special case that Z is a degenerate random variable
d p
taking value of c, then Zn → c also implies Zn → c. Thus when Z is degenerate, convergence in
distribution and probability are equivalent to one another.

One manifestation of the fact that convergence in probability is stronger than convergence in distribution
is that with the former, covergence of elements of a random vector implies convergence of the whole
random vector:
   
p p Xn p X
Proposition 4.2. If Xn → X and Yn → Y , then → .
Yn Y

Proof. Since for any limn→∞ P (|X√n − X| > ϵ) = 0 and limn→∞ P (|Yn − Y | > ϵ) = 0 holds for any
ϵp> 0, let’s consider a value ϵ/ 2. Let Zn := (Xn , Yn )′ and Z := (X, Y )′ . Since ||Zn − Z|| =
(Xn − X)2 + (Yn − Y )2 being larger than ϵ is the same as (Zn − Z)2 being larger than ϵ2 , and since
at least one of (Xn − X)2 or (Yn − Y )2 must then be larger than half of ϵ2 , we have:

P (||Zn − Z|| > ϵ) = P ((Xn − X)2 + (Yn − Y )2 > ϵ2 ) ≤ P ((Xn − X)2 > ϵ2 /2 or (Yn − Y )2 > ϵ2 /2)
≤ P ((Xn − X)2 > ϵ2 /2) + P ((Yn − Y )2 > ϵ2 /2)

we have that
√ √
lim P (|Zn − Z| > ϵ) = lim P (|Xn − X| > ϵ/ 2) + lim P (|Yn − Y | > ϵ/ 2) = 0 + 0 = 0
n→∞ n→∞ n→∞

d d
Meanwhile, the same  of convergence in distribution: Xn → X and Yn → Y does not in
 isnot true
Xn d X
general imply that → . However, one important special case in which it does is when X or
Yn Y
Y is a degenerate random variable. This is useful for example in proving Slutsky’s Theorem in Section 4.5.

The next section will introduce the most famous and useful instance of convergence in distribution:
the central limit theorem (CLT). After introducing the CLT, we will return in Section 4.5 to some fur-
ther properties of convergence in probability and convergence in distribution, that will be useful in the
analysis of large samples.

Optional: There is an even stronger notion of convergence than convergence in probability, re-
a.s.
ferred to as almost-sure convergence. We say that Zn converges almost surely to Z, or, Zn → Z,
if  
P lim Zn = Z = 1
n→∞

To make sense of this expression we have to place a probability distribution over entire sequences
{Zn } (something we didn’t need to do for convergence in probability or convergence in distribu-
tion). That is, we imagine a probability space in which each outcome ω yields to a realization
of all of the random variables: Z, Z1 , Z2 , Z3 , and so on. Then, the above expression says that
P ({ω ∈ Ω : limn→∞ Zn (ω) = Z(ω)}) = 1. In words: the probability of getting a sequence of Zn
that does not converge to Z with n is zero.
a.s.
Almost sure convergence is stronger than convergence in probability, i.e. Zn → Z implies that
p d
Zn → Z (which of course in turn implies that Zn → Z). The strong law of large numbers states
a.s.
that the sample mean in fact converges almost surely to the population mean, that is X̄n → µ.

4.4 The central limit theorem


The central limit
√ theorem (CLT) tells us that if we construct from the sample mean X̄n the a random
variable Zn = n(X̄n − µ), then the sequence Zn converges in distribution to that of a normal random
variable.

43
Theorem 2 (central limit theorem). If Xi are i.i.d random vectors and E[Xi′ Xi ] < ∞, then
√ d
n(X̄n − µ) → N (0, Σ)

where Σ = V ar(Xi ), µ = E[Xi ], and 0 is a vector of zeros for each component of Xi .


The central limit theorem is quite remarkable. It says that
√ whatever the distribution of Xi is, the limiting
distribution of X̄n (recentered by µ and rescaled by n) will be a normal distribution. This striking
result will pave the way for us to perform inference on the expectation of a random variable, without
knowing its full distribution˙

Why the CLT is useful:


The practical value
√ of the CLT is that it delivers an approximation to the distribution of X̄n . For large
n, we know that n(X̄n − µ) has approximately the distribution N (0, Σ). Using properties of the normal
distribution, we can re-arrange this to say that X̄n ∼ N (µ, Σ/n), approximately. To get a good guess of
the distribution of X̄n , we only need to have estimates of µ and Σ, which is much easier than estimating
the full CDF of Xi from data.

Example: Suppose for simplicity that we had reason to believe that Σ = 1, i.e. we have a random variable
Xi with a variance of 1. However, we don’t know µ. We do know then, by the CLT, that for large n, X̄n
is approximately normally distributed around µ with a variance of 1/n. This is extremely useful, because
we can now evaluate candidate values of µ, based on how unlikely we would be to see a value of X̄n like
the one that we calculate, if that value of µ was true. Suppose for example that n = 100, and in our
sample we observed that X̄n = 0.31. You want to evaluate the possibility that µ = 0. Well, if this were
the true value of µ, then given the asymptotic approximation that X̄n ∼ N (0, 1/n) (or equivalently, that
10 · X̄n ∼ N (0, 1)), we’d only expect to see a value of X̄n as large as 0.3 once in about 1000 samples. We
might thus be willing to rule µ = 0 out as a possibility. This is an example of a hypothesis test, which
will be covered in Section 5.4.1.


Figure 4.3: The same simulation as in Figure 4.2, except now we plot the distribution of n(X̄n − 1/2) rather
√ d
than of X̄n . The CLT tells us that n(X̄n − 1/2) → N (0, 1/4), since 1/4 is the variance of Xi . Green dashed
lines depict what is predicted by the distribution N (0, 1/4), which we can see becomes close to what we see for
larger values of n.

Illustrating the CLT:


Figures 4.2 and 4.3 illustrate the CLT in action. Recall that in this example Xi has a two-point dis-
tribution P (Xi = 0) = 1/2 and P (Xi = 1) = 1/2). The distribution of X̄n becomes closer and closer
to a normal distribution centered around µ = 1/2 as n gets large. To the eye, the distribution of X̄n
definately does not look normal for n = 2 or for n = 10 in Figure 4.2. But by the time we have n = 100,
44
it starts to take on the bell-curve shape. We see the variance Σ/n falling as we compare n = 100 and
n
√ = 1000: the latter has a variance about 1/10 as large. In Figure 4.3, we plot the distribution of
n(X̄n − 1/2) overlaid with its limiting distribution.
Thought experiments like this simulation experiment are useful for getting intuition about the CLT.
Accorindly, you often hear descriptions of the CLT along the lines of: “the sample mean becomes normal
as the sample gets bigger and bigger”. This isn’t wrong, but can be a little misleading. A given real-world
sample never gets bigger: it always has a single finite size n! Similarly, the sample size n never “goes to
infinity”–though we can get pretty close by simulating a sequence of samples on a computer! Imagining
an infinite sequence of samples having means X̄1 , X̄2 , and so on, is just a useful abstraction.

The following proof of the CLT is not necessary for you to know, but you may find it interesting, and
being able to follow it is a good study device.

Proof of the CLT:

We’ll consider a proof for the univariate case, which can be extended to random vectors
using the Cramér-Wold theorem introduced in Section 4.5. The proof here will use the concept
of a moment generating function:
t2 t3
MX (t) := E[et·Xi ] = 1 + t · E[Xi ] + · E[Xi2 ] + · E[Xi3 ] + . . . (4.1)
2 3!
where the second equality uses the Taylor expansion of etx . This will be a useful expression for
the moment generating function MX (t). Note that MX (t) is a (non-random) function of t: the
randomness in Xi has been averaged out.

A useful result (that we will not prove here) is that if two random variables X and Y have the
same moment generating function MX (t) = MY (t) for all t, then they have the same distribution.
Our
√ goal will be to show that whatever the distribution of Xi , the moment generating function
of n(X̄n − µ) converges to that of a normal random variable with variance σ 2 = V ar(Xi ).
√ d
Let us divide out the variance to rewrite the CLT (in the univariate case) as n· X̄nσ−µ → N (0, 1).
The moment generating function of the standard normal distribution is:
1 1
Z Z
x2 t2 (x−t)2 t2
MZ (t) := √ dx · etx · e− 2 = e− 2 · √ dx · ·e− 2 = e− 2
2π 2π
where we’ve used that (x − t)2 = x2 − 2tx + t2 and that the final integral is over the density of a
normal random variable with mean t and variance 1.

Now for the magic part. We’ll show that whatever the distribution of Xi is, and hence whatever
the moment generating function of Xi , the moment generating function of
√ X̄n − µ 1 X1 − µ 1 X2 − µ 1 Xn − µ
n· =√ +√ + ··· + √
σ n σ n σ n σ
t2
will end up being e− 2 !

First, note that when Y and Z are independent of one another, the moment generating function
of Y + Z is equal to the product of each of their moment generating functions, i.e. E[et(Xi +Zi ) ] =
E[etYi · etZi ] = E[etYi E[etZi ]. Applying this to the above expression, we have that:
 n
1
M√n· X̄n −µ (t) = M √1 X−µ (t) · M √1 X−µ (t) · · · · · M √1 X−µ (t) = √ M X−µ (t)
σ n σ n σ n σ n σ


Note that for any random variable Y , M √1 ·Y (t) = MY (t/ n). Therefore, we wish to show that
n

 √ n t2
lim M X−µ (t/ n) = e− 2
n→∞ σ

45
for any t.

Applying the Taylor series expansion of the moment generating function in Equation 4.1, we have
that:
" 2 #
√ Xi − µ t2 Xi − µ t2
   
t t
M X−µ (t/ n) = 1 + √ · E + ·E + ·g √
σ n σ 2n σ n n
  h i  2 
where by the Taylor theorem limn→∞ g √t
n
= 0. Note that E Xi −µ
σ = 0 and E Xi −µ
σ =
1, and thus we wish to show that
n
t2 t2
 
t t2
lim 1 + + ·g √ = e− 2
n→∞ 2n n n

Recall the identity that limn→∞ (1 + x/n)n = ex . If we can ignore the g term then we are done.
To show that the g term indeed does not contribute in the limit, consider taking the natural
logarithm of both sides of the above equation (since the log is continuous function, it preserves
limits):
n 
t2 t2 t2 t2
    
t t
lim ln 1+ + ·g √ = lim n · ln 1 + + ·g √
n→∞ 2n n n n→∞ 2n n n
 2
t2 t2
   
t t 2 t
= lim n · + ·g √ = − + t · lim ·g √
n→∞ 2n n n 2 n→∞ n
2
t
=−
2
where we’ve used the Taylor theorem for the
 2 natural logarithm:
  ln(1 + z) = z + z · h(z) where
t t2 t
limz→0 h(z) = 0, and we have that limn→0 2n + n · g √n = 0.

4.5 Properties of convergence of random variables


This section presents several results that are useful in the analysis of large samples. We will make heavy
use of them, for example, when we study the asymptotic properties of the linear regression estimator.

4.5.1 The continuous mapping theorem


The continuous mapping theorem (CMT) states that the notions of convergence in probability and
convergence in distribution are preserved when we apply a continuous function to each random vector
in a sequence Zn , that is:

Theorem 3 (continuous mapping theorem). Consider a sequence Zn of random vectors and a


continuous function h. Then:
p p
ˆ if Zn → Z, then h(Zn ) → h(Z)
d d
ˆ if Zn → Z, then h(Zn ) → h(Z)

Example: By the large of large numbers and the CMT: X̄n + 5 → (µ + 5), where µ = E[Xi ].
 p

√ 2 d
Example: Let Zn = n(X̄n − µ). Then by the CLT and CMT: Zn2 = n X̄n − µ → χ21 , where χ21 is
the chi-squared distribution with one degree of freedom (this is the distribution of a standard normal
N (0, 1) random variable squared).

Note: The assumption that h is (globally) continuous can be weakened, which is often important in
applications.

46
ˆ When Z is a contant (call it c), then the convergence in probability part of the CMT only requires
that h(z) be continuous at c, rather than everywhere.
ˆ The convergence in distribution part of the CMT can be extended to cases in which h has a set
of points z ∈ D at which it is discontinuous, provided that P (Z ∈ D) = 0. This is useful when
combined with the √ CLT, for which Z is continuously distributed. Hence applying an arbitrary
function h to Zn = n(X̄n − µ) allows us to use the CMT provided that h has only a discrete set
of points of discontinuity.
A set of useful/common applications of the CMT are summarized by the so-called Slutsky’s Theorem:
d p
Theorem 4 (Slutsky’s Theorem). Suppose Zn → Z and Yn → c with c a constant. Then:
d
ˆ Zn + Yn → Z + c
d
ˆ Zn · Yn → cZ
d
ˆ Zn /Yn → Z/c if c ̸= 0.
d p
To see how these results follow from Theorem 3, note that since c is a constant, Zn → Z and Yn → c is
equivalent to    
Zn d Z

Yn c
 
Zn
(see discussion following Proposition 4.2). Then we can apply the CMT to the sequence , with the
Yn
following continuous functions h, respectively:
ˆ h(Z, Y ) = Z + Y

ˆ h(Z, Y ) = Z · Y

ˆ Zn /Yn = h(Z, Y ) = Z/Y

4.5.2 The delta method


Note that when combined with √ the CLT, the continuous mapping theorem allows us to talk about the
asymptotic distribution of h n(X̄n − µ)√ for a continuous function h. What is often more useful is to
talk about the asymptotic distribution of n(h(X̄n ) − h(µ)). That is, when we apply a function h to our
sample mean, how does the limiting distribution of h(X̄n ) look as it converges around h(µ)? (Exercise:
which result allows us to know that h(X̄n ) does converge around h(µ)?)
The delta method gives us a tool to address exactly this question:
√ d
Theorem 5 (the delta method). If n(Zn − µ) → ξ for some random vector ξ, then if h(z) is
continuously differentiable in a neighborhood of z = µ:
√ d
n(h(Zn ) − h(µ)) → ∇h(µ)′ ξ

where ∇h(z) = ( dd1 h(z), dd2 h(z), . . . )′ is a vector of the derivatives of h with respect to each component
of Z.
Consider now what this implies in the case of the CLT:
Corollary 1. If Xi are i.i.d random vectors, h(x) is a function that is continuously differentiable at
x = µ, and E[Xi′ Xi ] < ∞, then
√ d
n(h(X̄n ) − h(µ)) → N (0, ∇h(µ)′ Σ∇h(µ))

where Σ = V ar(Xi ) and µ = E[Xi ].

47
Proof. Beginning from Theorem 5, we only need to show that for a random variable Z ∼ N (0, Σ),
h(µ)′ Z ∼ N (0, ∇h(µ)′ Σ∇h(µ)). We can see this in two steps. First of all, since a linear combination
of normal random variables is also normal, we know that a′ Z is normal for any normally-distributed
k−component random vector and k−component vector a. We thus need only to work out the mean and
variance of h(µ)′ Z to characterize its full distribution. By linearity of the expectation, E[h(µ)′ Z] = 0,
since each component of Z has mean zero. You also showed in HW 3 that the variance of a′ Z is a′ Za.
Substituting a = h(µ) completes the proof.
The most important special case of the corollary above is when Xi is a random variable. In this case,
we don’t need any matrix multiplication and we have that:
2 !


 d d 2
n h(X̄n ) − h(µ) → N 0, h(µ) · σ
dx
d
Note that if the function h is very sensitive to the value of x near µ, i.e. dx h(µ) has a large magnitude,
then the asymptotic variance of h(X̄n ) will be large, since the funciton h blows up the variance of Xi by
d
2
a factor dx h(µ) .

4.5.3 The Cramér–Wold theorem*


The following theorem, referred to as the Cramér–Wold theorem or the Cramér–Wold “device”, is another
tool in asymptotic analysis. We won’t find it as useful as CMT or delta method, but it’s worth seeing
so I mention it here:
Theorem 6 (the Cramér–Wold device). If Zn is a sequence of random vectors having k components,
d d
then Zn → Z if and only if a′ Zn → a′ Z for all (non-random) k−component vectors a.
One very important application of the Cramér–Wold device is in extending the central limit theorem to
random vectors. In Section 4.4, we only proved the CLT for a random variable. The following exercise
asks you to derive the multivariate CLT from the univariate CLT.

Exercise: Use the Cramér–Wold device to show that if Theorem 2 applies to random variables Xi ,
then it applies to a random vector Xi = (X1i , X2i , . . . Xki )′ as well (assume that any necessary moments
exist).

4.6 Limit theorems for distribution functions*


While the law of large numbers might appear to be somewhat limited, in that it only talks about the
mean, it is surprisingly versatile. For example, it implies that sample probabilities converge to their
population counterparts. Suppose we have an i.i.d. collection of Xi and are interested in F (x), the
population CDF of Xi evaluated at some specific x. Then we can define Zi = 1(Xi ≤ x), a random
variable that takes a value of 1 if Xi ≤ x, and zero otherwise. Since the collection {Z1 , Z2 , . . . Zn } is
i.i.d, and has the finite mean:
E[Zi ] = E[1(Xi ≤ x)] = P (Xi ≤ x) = F (x)
the law of large numbers implies that the sample mean of Zi converges in probability to F (x). The
sample mean of Zi is simply
n
1X number of i for which Xi ≤ x
1(Xi ≤ x) = ,
n i=1 number of i in sample

the proportion of the sample for which Xi ≤ x. When considering this quantity across all x, we call the
resulting function the empirical CDF of Xi , denoted as Fn (x).

Thus, for each x the empirical CDF evaluated at x converges in probability to the population CDF
p
evaluated at x, i.e. Fn (x) → F (x). This result can be strengthened in two ways (which are not implied by
the weak law of large numbers). Consider the error in Fn (x) as an approximation of F (x), |Fn (x) − F (x)|
as a function of x. This may be larger or smaller depending on x. The Glivenko-Cantelli theorem states
that even the largest error, over all x, converges to zero, and furthermore that this convergence is almost
sure convergence (see box at the end of Section 4.3), rather than convergence in probability:
48
Theorem 7 (Glivenko-Cantelli theorem). If Xi are i.i.d, then:
a.s.
sup |Fn (x) − F (x)| → 0
x∈ R
We won’t use Theorem 7 in this class, but it can be useful for proving properties of asymptotic sequences
that involve quantities that cannot be written as a function of X̄n .

49
Chapter 5

Statistical decision problems

This chapter presents a formal view of the goals of using statistics for econometrics. It starts with the
question: what is it that we would like to learn? Once we’ve defined our “parameter of interest”, we can
separate much of econometrics into three parts: identification, estimation and inference.

I will not attempt at a thorough or rigorous treatment of many of the concepts this chapter touches
upon. Rather, I hope it can present a unified way to think about several concepts you have probably
seen in one form or another in previous courses, and serve either as a reference or a starting point to
exploring terms in econometrics as you come across them in your own research.

5.1 Step one: defining a parameter of interest


Why do we use statistics? A short answer is that we want to learn things about the world, and data is the
lens with which we investigate some population within it. A more careful answer, which is well-aligned
with the specific approach that econometrics takes to using statistics, is that there are specific features
θ of the world that we care about.

We can distinguish between three types of parameter of interest, θ.

First type (model parameters): Think back to the idea of a parametric statistical model, introduced in
Chapter 3.1. Suppose we observe i.i.d data Xi , where the distribution of Xi is thought to belong to
a parametric family F (·; θ) for some θ ∈ Θ. For example, we might be willing to assume that Xi is a
normally distributed random variable, with unknown mean µ ∈ R and variance σ 2 > 0. In this case,
θ = (µ, σ), and in the absence of any further assumptions about θ: Θ = R × R+ , where R+ is the set of
all positive real numbers (the variance cannot be negative). In this context, it is natural to take the full
vector of model parameters θ to be our parameter of interest (of course, we might only be interested in
e.g. µ, in which case µ alone is our parameter of interest, and similarly with σ).

Second type (features of observed variables in the population): We don’t need parametric statistical mod-
els to talk about parameters of interest, however. If we have i.i.d. data drawn from any population
distribution F , we might think of some aspect of F that we’d like to know. For example, we might be
interested in E[Xi ], but don’t want to assume that Xi is normally distributed, as in the last example.
Then, our parameter of interest is θ = E[Xi ]. Another parameter of interest might be the median of F ,
the point x at which F (x) = 1/2. In this case, θ = inf{x : F (x) ≥ 1/2} (this general definition allows
for a non-continuous F , in which case there may be no x such that F (x) = 1/2 exactly).

Third type (quantities that depend on unobservables): One of the exciting and difficult things about
applied econometrics is that often our parameters of interest do not depend solely on the distribution F
of the vector of variables X that we observe in our data. Rather, θ often depends also on the distribution
of some other variables U that are not observed. This situation most often arises when discussing
causality, for example when our parameter of interest summarizes the causal effect of a policy. Talking
about causality involves some new notation and concepts, so we’ll defer further discussion to Chapter
6. As a simpler example of a situation that involves unobservables, let us consider a different important
practical problem: measurement error.

50
Suppose our parameter of interest is θ = E[Zi ], the average value of some random variable Zi .
However, our data was not recorded perfectly, and instead of an i.i.d sample of Zi , we observe an
i.i.d sample of Xi = Zi + Ui , where Ui represents unobserved “measurement error”. In this case, our
parameter of interest can be written as E[Xi − Ui ], which depends both upon the distribution of X and
the distribution of U .

5.2 Identification
Once we have a parameter of interest in mind, a good starting point is often to ask the question: “could
I determine the value of θ if I had access to the population distribution F underlying my data?”.
If the answer is no, then no amount of statistical wizardry will allow you to learn the value of θ. If
the answer is yes, then we say that θ is identified.
Definition 5.1. Given a statistical model F for (X, U ), we say that θ is identified when there is a
unique value θ0 of θ compatible with FX , the population CDF of observable variables X.
Often identification is described as saying that if we observed an “infinite” sample, we could determine
the value of θ. The reason for this is that by the law of large numbers, we can learn the entire population
distribution of X from an i.i.d sample Xi , as the sample size goes to infinity (see discussion in Section
4.6). Of course, we never observe an infinitely large dataset, but defining identification in terms of what
we could know if we did cleanly separates problems of research design from the statistical problem of
having too small a sample.

Whenever our parameter of interest is defined directly from the population distribution FX of observables
(e.g. θ = E[Xi ]), it will be identified. Thus, parameters of the second type are always identified. This
logic often applies to parameters of the first type as well, except in cases when F (·; θ) doesn’t always
change with θ (see example below). Questions of identification usually arise in the third case, when our
parameter of interest θ depends on the distribution of unobservables: for example when we’re interested
in causality, have measurement error, or have “simultaneous equations”.

Example: Suppose Xi are i.i.d draws from N (µ, σ 2 ). Then the parameters µ and σ are identified, because
each pair (µ, σ) gives rise to a different CDF FX of Xi .

Example: Suppose Xi are i.i.d draws from N (min{θ, 5}, σ 2 ). Then θ is not identified, because different
values of θ (e.g. θ = 6 vs. θ = 7), do not give rise to a different CDF F of Xi .

Example: In the measurement error example, suppose that we’re willing to assume that E[Ui ] = 0,
that the measurement error averages out to zero (e.g. there are equal chances of getting positive and
negative errors of the same magnitude). Then θ = E[Zi ] is identified, since now E[Zi ] = E[Xi ]. This
example underscores the role of F in Definition 5.1. Whether or not θ is identified often depends on what
assumptions we are willing to make, which restrict the set F of possible joint-distributions for (X, U ).

Below I discuss some additional issues related to identification, which may relate to terms you’ve heard
floating around about identification:

Parametric vs. non-parametric identification: When F is a non-parametric statistical model, in


the sense described in Section 3.1, we say that θ is non-parametrically identified. We have non-
parametric identification when we do not need to specify a parametric functional form for the
distribution of observables or unobservables. Sometimes we only have parametric identification
but not non-parametric identification. Suppose, in the measurement error example, our parameter
of interest is full distribution function FZ of Zi , and are willing to assume that Ui ⊥ Zi . Then
FZ is identified if we are willing to specify the exact form of FU , e.g. Ui ∼ N (0, 1), through a
technique known as deconvolution. However, Fz is not non-parametrically identified.

Partial vs. point identification: Sometimes knowing FX is not enough to pin down the value of θ,
but it is enough to determine a set of values that θ might take. For example, we may be able to

51
determine upper and lower bounds for θ. In such cases we often say that θ is partially identified.
This can be contrasted with Definition 5.1, which describes point identification.

Identification of a parametric model: Suppose we have an i.i.d sample of observables Xi and


a parametric statistical model for (Xi , Ui ), in the language of Section 3.1. Then we might say
the model is identified, when the full vector θ of model parameters are identified in the sense of
Definition 5.1:

Definition 5.2 (full identification of a model). Given a statistical model F for (X, U ), we
say that the model is identified when when the set {θ ∈ Θ : FX (·) = FX (·, θ)} is a singleton,
where FX is the CDF of X.
Definition 5.1 says that there is a unique value θ0 ∈ Θ such that FX (·, θ0 } is equivalent to the
population distribution of observables Xi . This situation arises often in econometrics in the
context of so-called structural models in which the entire model can be characterized by a finite
set of model parameters.

5.3 Estimation
If our parameter of interest θ is identified, then we can move on to our next question: how can we
estimate it?

In this section, we treat the task of estimating θ as a decision problem. In the next section, we’ll take
the same approach to testing hypotheses about θ. This way of thinking about estimation and inference
is called statistical decision theory.
Let’s think about the task of estimating θ as a problem of choosing an optimal strategy in a particular
game, which we play along with “nature”. Nature goes first, giving us a sample X, the distribution of
which we denote abstractly as P (this is equivalent to the joint-CDF of all of the components of X). Our
goal is to think about how to form θ̂ = g(X) as a function of the data X. How should we proceed?
Recall that in game theory, a strategy is a complete profile of what we would do, given whatever the
other players do. In this context, we a strategy is not a particular numerical estimate of θ, but
Pn the function
g. For example, if our estimator is the sample mean, then g(X) = g(X1 , X2 , . . . Xn ) = n1 i=1 Xi , which
will depend upon the particular values of Xi occur in our sample.
As in game theory, our best-response to the actions of nature will depend upon our preferences (a.k.a.
our utility function). In statistical decision theory this takes the form of a ”loss function”: L(θ̂, θ0 ), where
θ0 is the true value of θ. For the most part, we consider the so-called quadratic loss function:

L(θ̂, θ0 ) = ||θ̂ − θ0 ||22 := (θ̂ − θ0 )′ (θ̂ − θ0 )

When θ is a scalar, then this is just the square of the difference between our estimator θ̂ and the true
value θ0 .
However, remember that θ̂ = g(X) is a random variable/vector, which depends on our randomly
drawn dataset X. Thus to pick a strategy g, we need to define our preferences over “lotteries”, again–as
in standard game theory. In line with expected utility theory, the convention here is to take our optimal
action g to be the minimizer of expected loss: E[L(θ̂, θ0 )] where the expectation is over the distribution
of X. The risk function Rg (θ) of estimator g views the expected loss as a function of the true value of
θ. It is common to write this as Eθ [L(θ̂, θ)], where the notation Eθ makes it clear that the distribution
of X must depend in some way on the value of θ. This is motivated by cases in which we have i.i.d.
data from a parametric statistical model where θ indexes the population distribution of Xi . Then the
distribution of X depends on just two things: n and the true value of θ.
When we use the quadratic loss function, the optimal estimator g would be

g ∗ := argmin E[||g(X) − θ0 )||22 ] (5.1)


g

However, solving this problem is not easy, because we generally don’t know the distribution of X ex-ante.
However, statisticians have developed various strategies to try to keep E[||g(X) − θ0 )||22 ] small. These

52
strategies are best understood as ways to navigate the so-called bias-variance tradeoff. The following
proposition shows that expected quadratic loss can be decomposed into two terms: one capturing the
square of the “bias” of the estimator, and the other capturing its variance.
For simplicity, we state this result in the special case that θ is a scalar. We’ll also just write θ̂ rather
than g(X), to keep the notation simple.
Proposition 5.1 (the bias-variance decomposition).
 2

E[(θ̂ − θ0 )2 ] = E[(θ̂ − E[θ̂]))2 ] + 


E[θ̂] − θ0 

(5.2)
| {z } | {z } | {z }
expected loss variance of θ̂ bias of θ̂

Proof. Add and subract E[θ̂] to obtain:


n o2  2
E[ (θ̂ − E[θ̂]) + (E[θ̂] − θ0 )) ] = E[(θ̂ − E[θ̂]))2 ] + 2 · E[(θ̂ − E[θ̂])(E[θ̂] − θ0 )] + E[θ̂] − θ0

Now observe that the middle term is zero, because


 
E[(θ̂ − E[θ̂])(E[θ̂] − θ0 )] = (E[θ̂] − θ0 ) · E[(θ̂ − E[θ̂])] = (E[θ̂] − θ0 ) · 
E[ E[
θ̂] −  θ̂] = 0

since (E[θ̂] − θ0 ) is just a non-random number.


Equation (5.2) is described as a bias-variance tradeoff because often strategies to decrease bias come
at the expense of increasing variance, and vice-versa. Suppose for example that we just pick g(·) = 5,
estimating θ to be 5, regardless of what sample we see. This estimator will have zero variance! But we
can expect the bias 5 − θ0 to be quite large. On the other hand, extremely flexible estimation methods
are often good at minimizing bias, but doing so may increase variance. The field of non-parametric
estimation chooses estimators to explicitly navigate this tradeoff.

5.3.1 Desirable properties of an estimator


This section investigates some desirable properties of an estimator, in light of the bias-variance tradeoff.
It is not meant to deliver a detailed account of these properties, but simply to serve as a reference for
what the associated terms mean. Please see the course textbooks for more details.

5.3.1.1 Consistency
The first thing that we might ask of our estimator is that it be consistent. What we mean by that is that
p
θ̂ → θ0

regardless of the value of θ0 . Consistency means that as n goes to infinity, the entire expected loss in
Equation (5.2) converges to zero.

5.3.1.2 Rate of convergence


Consider the sample mean X̄n viewed as an estimator of the population mean µ = E[Xi ]. We know by
the LLN that X̄n is a consistent estimator of µ, and we furthermore know by the CLT that
d
np (X̄n − µ) → N (0, V ar(Xi ))

if we set p = 1/2. Note that if we set the power p on n to be any larger than 1/2, then the LHS
would blow up, rather than converging in distribution to anything (like a normal distribution). On the
other hand, if we had set p < 1/2, then np (X̄n − µ) will simply converge in probability to zero. 1/2 is
“Goldilocks” level of p in which we get a non-degenerate asymptotic distribution for np (X̄n − µ).
In general, when we have a consistent estimator θ̂, we call the maximum value of p such that np (θ̂−θ0 )
converges in distribution to something (technically, to some distribution that is “bounded in probability”)
the rate of convergence of θ̂. The rate of convergence of the sample mean is 1/2, and we often say that it
53
√ √
is n−consistent. n-consistency is a desirable property, which is shared by many common estimators.
However, some estimators have a slower rate of convergence. For example, suppose we’d like to estimate
the density θ = f (x) of a d−dimensional random vector Xi at some point x, and we’d like to make this
estimation non-parametric—that is, not based on assuming a parametric model for f (x).
We can do so using the so-called kernel density estimator fˆK (x), which has a rate of convergence
no better than p = d+4 2
. When d = 1, for example, we can only blow up (fˆK (x) − f (x)) by a factor
2/5
of n and get an asymptotic distribution. In practice, this means that we need a larger sample n
for asymptotic arguments to provide good approximations to the sampling distribution of fˆK (x). This
becomes a real problem as d starts to increase: for example, the rate of convergence fˆK (x) when d = 5 is
just 2/9. This problem is often referred to as the curse of dimensionality, and is why we need very large
samples—and/or even more clever techniques—to do non-parametric estimation with many covariates.

5.3.1.3 Unbiasedness
If an estimator is unbiased if it manages to make the second term in Equation (5.2) zero, that is:

E[θ̂] = θ0
Unbiasedness has a nice interpretation: we know that θ̂ ̸= θ0 in general, but we know that θ̂ will be right
on average, over different realizations of our dataset.
An example of an unbiased estimator is the sample mean, when our parameter of interest θ is the
population mean E[Xi ]. In Section 4.1, we showed indeed that E[X̄n ] = E[Xi ], regardless of n or the
true value of E[Xi ] (so long as it exists).
Note that an estimator can be consistent without being unbiased. For example, the estimator θ̂ =
n X̄n is biased as an estimator for θ0 = E[Xi ], because
n+1

n+1 θ0
E[θ̂] − θ0 = · θ0 − θ0 = ̸= 0
n n

unless θ0 = 0. However, this θ̂ is consistent. This implies that as n approaches infinity, both its bias
and its variance converge to zero. If an estimator has an asymptotic bias (that is, a bias that doesn’t go
away with n), then it cannot be consistent.

5.3.1.4 Efficiency
Econometricians often speak of an estimator as being efficient. Loosely speaking, this typically means
that θ̂ minimizes mean squared error (5.2) among some class of estimators.

For example, we might consider the class of unbiased estimators, and ask whether a given θ̂ minimizes
Eq. (5.2). Since the bias term is zero for all estimators in this class, the efficient estimator will be the
one that minimizes variance.

In the context of parametric models, the Cramer-Rao lower-bound establishes the smallest variance that
an unbiased estimator can possibly have (even when θ is a vector, though the definition of “smallest”
here requires qualification). The maximum likelihood estimator, discussed in the next section, achieves
this bound: it is thus efficient, whenever it happens to be unbiased (which is not guaranteed in general).

A related notion is asymptotic efficiency, which says that an estimator is efficient as n → ∞.

5.3.2 Important types of estimator*


In this section we’ll introduce a few types of estimator that are common or important in econometrics.
In light of the asymptotic theorems of Chapter 4, it should be clear then when θ can be expressed
as a continuous function of the population expectations of various random variables, then an estimator
composed of that same function applied to sample means might be expected to have desirable properties.
The ordinary least-squares (OLS) linear regression estimator can be seen as an estimator of this kind,
and Chapter 7 will offer a detailed look at it.

54
In this section, I present a few prominent examples of a category of estimator called extremum-
estimators. This category overlaps the one I just described: OLS can also be seen as an extremum
estimator. An extremum-estimator takes the form:

θ̂ = argmin Q(X, θ) (5.3)


θ∈Θ

where Q(X, θ) is some function of θ and our sample. Why might (5.3) be a good way to construct an
p
estimator? The basic idea is to find cases in which Q(X, θ) → Q0 (θ) for some function Q0 (θ) which has
a unique minimizer within a space Θ of possible values for θ. As the sample gets larger, our θ̂—which
solves the sample minimization problem—converges to θ0 , which solves the population problem Q0 (θ).
The function Q might be provided by economic theory, a statistical model, or some combination of the
two.
An important subclass
Pn of extremum estimators are those in which Q(X, θ) takes the form of a sample
mean: Q(X, θ) = n1 · i=1 m(Xi , θ) for some m(Xi , θ). Such estimators are called M-estimators. Given
the law of large numbers, M-estimation will work well, in the sense of at least being consistent, under
fairly general technical conditions on m and the distribution of Xi (which we will not discuss here).

5.3.2.1 Method of moments*


The basic method of moments considers a k−dimensional parameter-of-interest θ, and a set of l moment-
conditions:

E[g1 (Xi , θ)] = 0


E[g2 (Xi , θ)] = 0
...
E[gl (Xi , θ)] = 0 (5.4)

which we can gather into a vector equation E[g(Xi , θ)] = 0. The most prominent example of a model
like (5.4) is the linear regression model, which we will study in detail in Chapter 7. Another example
is the linear instrumental variables model with l instruments and k endogenous variables. The moment
conditions (5.4) are a statistical model, in the sense of Section 3.1, because they restrict the distribution
of Xi to be among those for which E[g(Xi , θ0 )] = 0, where θ0 is the true value of θ. If Equation (5.4)
has only one solution (which must then be equal to θ0 ), then the parameter θ is identified. In general, a
necessary condition for identification is l ≥ k.

The basic method-of-moments estimator θ̂M M considers the case in which k = l; i.e. we have the
same number of moments and parameters we wish to estimate. In this setting, we simply replaces the
population expectations in (5.4) by their sample counterparts, that is we solve choose θ̂M M such that
1
Pn
n i=1 g(Xi , θ̂M M ) = 0. Note that this equation is the first-order-condition of the extremum-estimator
in which !′
2 !
1 X 1 X 1 X
Q(X, θ) = g(Xi , θ)) = g(Xi , θ) g(Xi , θ)
n i=1n n i=1n n i=1n
2
The most famous example of a method-of-moments estimator is the ordinary least squares estimator for
the linear regression model, which we will study in depth in Section 7.5.

The so-called generalized method of moments (GMM) considers the more general case in which l ≥ k.
When we have more moment equations than parameters we aggregate over them with some k × l matrix
B to yield k equations: BE[g(Xi , θ)] = 0 which are linear combinations of the original ones (where now
0 has k components). The GMM estimator θ̂M M minimizes the sample analog of this set of equations,
θ̂M M = argmin Q(X, θ), where
θ∈Θ

! 2 !′ !
1 X 1 X 1 X
Q(X, θ) = B g(Xi , θ)) = g(Xi , θ) W g(Xi , θ)
n i=1n n i=1n n i=1n
2

where W := B′ B is an l × l matrix that is positive-semidefinite. Note that θ̂GM M depends on our choice
of B through W. Efficient GMM uses a data-driven weighting matrix Ŵ in order to minimize the
55
asymptotic variance of θ̂GM M .

Note: A related method uses moment inequalities, in which the equals signs in Equation (5.4) are re-
placed with inequalities ≤. Models with moment inequalities typically result in partial identification of θ.

Note that the method of moments does not require us to have a fully-specified statistical model F (·; θ),
in the sense of Section 3.1. Rather, it is typically sufficient to have l ≤ k moment conditions of the form
Equation (5.4) that involve our parameter of interest θ. The method of moments can thus be thought
of as employing a semi-parametric model.

5.3.2.2 Maximum likelihood estimation*


When we do have a fully parametric model of the form F (·; θ), then maximum likelihood estimation is
often a good bet. The maximum likelihood approach to estimation begins with an i.i.d. sample Xi , in
which we assume a parametric model F (x; θ) for the population distribution of Xi . For simplicity of
notation, suppose that Xi is continuously distributed so that F (x; θ) admits a density function f (x; θ).
Consider the function L(θ) = E[log(f (Xi ; θ))]. It’s not obvious, but we have the following very
interesting result:

Proposition 5.2. The true value of θ0 maximizes L(θ).


Proof. First write, for any θ ̸= θ0 :

E [log(f (Xi , θ))] − E [log(f (Xi , θ0 ))] = E [log(f (Xi , θ)) − log(f (Xi , θ0 ))] = E
  
f (Xi , θ)
L(θ) − L(θ0 ) = log
f (Xi , θ0 )
Becauseh logis a concavei function,
 h we have i by a property of expectation known as Jensen’s inequality
that E log ff(X(Xi ,θ)
i ,θ0 )
≤ log E f (Xi ,θ)
f (Xi ,θ0 ) . Now, since f (·, θ0 ) is the true density of Xi , this is in turn
equal to    Z 
f (Xi , θ) ) · f (x, θ) · dx = log(1) = 0
log E = log f (x,
 θ
0
f (Xi , θ0 ) 
f (x,
 θ
0)

Thus, we’ve shown that L(θ) − L(θ0 ) ≤ 0 for any θ!

Note: Provided that P (f (Xi , θ) ̸= f (Xi , θ0 )) > 0 for any θ ̸= θ0 , the above can be strengthened to say
that θ0 is the unique maximizer of L(θ).

The maximum likelihood estimator θ̂M LE minimizes the sample analog of L(θ) with respect to θ. In
particular, let the likelihood function of the data L(X; θ) be:
n
Y
L(X; θ) := f (Xi , θ)
i=1

Given that the Xi are i.i.d., L(X; θ) is the joint density of the observed dataset mathbf X = {X1 , X2 , . . . Xn },
given the model f (·; θ). Maximizing L(X; θ) with respect to θ is equivalent to maximizing the logarithm
of L(X; θ) with respect to θ, leading to the more-familiar “log-likelihood” expression for θ̂M LE :
n
! n
Y 1 X
θ̂M LE = argmax log f (Xi , θ) = argmax log (f (Xi , θ))
θ∈Θ i=1 θ∈Θ n i=1
| {z }
“log-likelihood function”:

where we insert a factor of 1/4 (which doesn’t change the argmax) in order to view this as a sample
analog of Proposition (5.2). This problem has a unique solution if for example the the log-likelihood
function log(L(X; θ)) is globally-concave, as it does in many familiar models (e.g. the “probit” model).
The above expression also reveals that θ̂M LE is an example of an M-estimator in which the function
m(x, θ) = log (f (x, θ)).
When θ̂M LE is an unbiased estimator (which is only true in some applications), it is efficient among
all unbiased estimator
√ (look up the Cramer-Rao bound for details). It is also asymptotically efficient
among so-called n-consistent “regular” estimators. These properties however require the model f (x; θ)

56
to be correctly specified, which may be a hard assumption to defend. When the model is misspecified,
p
then generally θ̂M LE → θ∗ , where
  
f (x; θ)
θ∗ = argmin − E log
θ∈Θ f (x)
where f (x) is whatever the true density function of Xi is. The quantity inside the minimization is
called the Kullback-Leibler divergence between f (x; θ) and f (x), and can be thought of as a kind of
“distance” between the two density funcitons. Thus, when the model is misspecified, θ̂M LE is consistent
for a “pseudo-parameter” that finds the closest point in Θ to θ0 , as measured by the Kullback-Lieber
divergence. Misspecification occurs when Θ is not a sufficiently big or rich set to contain f (x) = f (x; θ0 )
for some θ0 ∈ Θ.

5.3.2.3 Bayes estimators*


An alternative approach to estimation is based on the principles of Bayesian statistics. The basic idea
of the Bayesian approach to estimation is to imagine that both our sample X, and our parameter of
interest θ are random.This accords with the Bayesian notion that probabilities reflect degrees of belief.
You don’t need to take this interpretation seriously however to define a Bayes estimator. You can think
of it instead as a way to think of a class of interesting estimators for θ.
In our interpretation of (X, θ) as jointly random, the marginal distribution of θ is called the prior
and is denoted here as π(θ). It can be interpreted as our beliefs about θ before we see the data X. We
call the conditional distribution of θ given the data X, denoted π(θ|X), the posterior distribution of θ.
The two are related by Bayes rule (albeit with some different notation than we saw before):
π(θ)
π(θ|X) = · L(θ; X) (5.5)
P (X)
Note: We’re using somewhat abstract notation here, but π can be interpreted as a probability density
function over θ. P (X) can be interpreted as a density function or as a probability mass function of the
data. L(θ; X) is the likelihood function we saw in the last section on maximum likelihood estimation:
this corresponds to a parametric statistical model, and represents the probability of drawing sample X,
given θ.
Given our loss function L(θ̂, θ0 ) as in Section 5.3, we can define expected loss over the full joint-
distribution of (X, θ). This is now
h h ii
E[L(θ̂, θ)] = E E L(θ̂, θ)|X (5.6)

by the law of iterated expectations. The inner expectation represents an average over values of θ with
respect to the posterior distribution π(θ|X), and delivers a function of X (and θ0 ). The outer expectation
integrates over the marginal distribution of the data, just as in Equation 5.1.
Our goal is now to minimize (5.6) with respect to θ̂, to provide
h aniestimate of θ0 . Note that since
θ̂ = g(X) is a function of X, the optimal g must minimize E L(θ̂, θ)|X for every possible value x of X
(otherwise we could just change g(x)
h for thati x along while decreasing (5.6). Thus, we can focus on the
problem of minimizing the risk E L(θ̂, θ)|X with respect to θ, for a given value of X. The function of
X that does this is called the Bayes estimator, θ̂B (let’s assume the problem is such that it is unique).
When we use the quadratic loss function L(θ̂, θ) = (θ̂ − θ)2 , as in Equation 5.1 (focusing on the case
of a scalar θ for simplicity), our problem can now be written as
h i Z 
θ̂B = argmin E L(θ̂, θ)|X = argmin π(θ|X) · (θ̂ − θ)2 · dθ
θ∈Θ θ∈Θ

The first order condition of this minimization problem yields that


Z
θ̂B = π(θ|X) · θ · dθ,

i.e. Bayes estimator θ̂B is equal to the so-called posterior-mean of θ. Note that we can use Eq. (5.5) to
write the posterior mean in terms of our prior π on θ and the likelihood function:
R
π(θ) · L(θ; X) · θ · dθ
Z
−1
θ̂B = P (X) · π(θ) · L(θ; X) · θ · dθ = R
π(θ) · L(θ; X) · dθ
57
where pull the P (X) factor out of the integral since it does not depend on θ, and then express it in terms
of the likelihood function given that π(θ|X)h has to integrate
i to one. Although it takes the form of an
extremum estimator in which Q(X, θ) = E L(θ̂, θ)|X , we do not need to solve an explicit minimization
problem to compute the Bayes estimator. Rather, we only need to be able to evaluate the above integral
(typically numerically), given the likelihood function implied by our model and the prior π(θ).
θ̂B thus has two ingredients provided by the researcher: a statistical model which delivers Lθ; X, and
a prior π on θ. Since the value of θ̂B depends on how the researcher chooses this prior, you might wonder
whether how we could ever be confident that θ̂B could even be consistent for θ0 , even if the model is
correctly specified. Couldn’t you pick a really bad prior? The Bernstein-von Mises theorem says that
for finite-dimensional θ, the posterior mean and the maximum likelihood estimator θ̂M LE will converge
asymptotically under general conditions.
Essentially, the influence of the prior π(θ) wanes as the sample becomes large, and the posterior mean
ends up finding the peak of the likelihood function, just as θ̂M LE does. One way to see this is to take
the log of the posterior:
n
! n
π(θ) Y X
log(π(θ|X)) = log · f (Xi , θ) = −Cn + log(π(θ)) + log (f (Xi , θ))
P (X) i=1 i=1

As n gets large, the data contribute n terms to this sum, while the prior always contributes the same
log(π(θ)). The other term Cn := log(P (X)) is also growing with n, but doesn’t depend on θ. Thus, the
dependence of the posterior distribution on θ is entirely driven by the likelihood, asymptotically.

5.4 Inference*
In Section 5.3, our goal was to deliver a point-estimate θ̂ of our parameter of interest. That is, we want
a number that yields something close to the true value θ0 of θ.
Sometimes we can settle for a less ambitious goal, which is to ask not what the exact value of θ0 is,
but rather we want to know whether or not θ0 belongs to some set of values. I will discuss two approaches
of this type: i) hypothesis testing, in which we want to test whether θ0 ∈ Θ0 for some fixed set Θ0 ; and
ii) interval estimation, in which we want to construct a set Θ̂ that has some desirable relationship to θ0
(for example contains θ0 with high probability)

5.4.1 Hypothesis testing


Beginning with some overall space of admissible values Θ (e.g. the real numbers), let us carve the space
into two sets: Θ0 and Θ1 , where Θ0 ∪ Θ1 = Θ and Θ0 ∩ Θ1 = ∅. We call our hypothesis that θ0 ∈ Θ0
the null-hypothesis:

(Null hypothesis) H0 : θ 0 ∈ Θ0 (Alternative hypothesis) H1 : θ0 ∈ Θ1

Note that provided that our model θ0 ∈ Θ is correctly specified, either the null hypothesis H0 or the
alternative hypothesis H1 holds.
Continuing of the approach of statistical decision theory, we may think of our action space as now
as consisting of two actions d ∈ {a, r}, either accept (a) or reject (r) the null-hypothesis H0 . This can
be contrasted with estimation, in which our action space was to pick a specific value in Θ to serve as an
estimate for θ.
In this context, a strategy is a mapping from the possible datasets X that we might see to an action
{a, r}. This function d(X) is referred to as a decision rule, or a test. To think about what kind of a test
might be optimal, we again need to specify our preferences, or a loss function, over actions. Compared
with estimation, in which our loss function took the form L(θ̂, θ), it now takes the form L(d, θ0 ): how
happy would we be with our decision d ∈ {a, r}, if we learned the true value of θ was θ0 ?
Compared with estimation—where the quadratic loss function is very standard—in testing it is less
obvious what our cost function could be. One thing is clear however, we’d prefer not to be wrong: we
don’t want to reject the null hypothesis (often referred to as failing to accept the null) when in fact
θΘ0 , and we also don’t want to accept the null hypothesis when in fact θ0 ∈ Θ1 . The first of these
errors is called a Type-I error (falsely rejecting H0 ) while the second is called a Type-II error (incorrectly
accepting H0 ).

58
The most basic loss function we might think of is called 0-1 loss, and only cares about whether we
are right or not, i.e. L(d, θ0 ) when either d = a and θ0 ∈ Θ0 or d = r and θ0 ∈ Θ1 (i.e. we are right), and
L(d, θ0 ) otherwise (we are wrong). Recall that since X is random, our decision d(X) will be random,
and thus we can again think about the risk, or expected loss, due to a particular strategy d. With the
0 − 1 loss function:
(
P (d(X) = r) if θ0 ∈ Θ0 (Type-I error)
E[L(d(X), θ0 )] =
P (d(X) = a) if θ0 ∈ Θ1 (Type-II error)

It is clear from the above that whether or not the null is actually true determines which probability
matters in determining the risk of the test.
Since the value of θ pins down some aspect of the distribution of X, the probability of rejecting
the null will depend upon what the true value of θ0 in fact is. Like the risk function that we saw in
estimation, let us use the notation Pθ (d(X) = r) to denote the probability of rejecting when the true
value is θ. Viewing this as a function of θ, we define the power function β(θ) of test d.
Beyond the 0 − 1 loss function, we might put a different penalty on Type-I vs. Type-II errors:
( (
0 if θ0 ∈ Θ0 0 if θ0 ∈ Θ1
L(a, θ0 ) = while L(r, θ0 ) =
ℓII if θ0 ∈ / Θ0 ℓI if θ0 ∈
/ Θ1

The ratio ℓII /ℓI will govern whether our test d should be more conservative about avoided Type-I errors,
or about avoiding Type-II errors.

5.4.2 Desirable properties of a test


As with estimation problems, choosing the optimal test d is a hard problem because we don’t know the
distribution of X, we can only approximate it using the dataset X that we actually observe, along with
whatever assumptions we are willing to make. As again with estimation, there are a few principles that
are used to help guide the design of statistical tests.

5.4.2.1 Size
The size α of a test d is the maximum probability of making a Type-I error (falsely rejecting), over all
θ ∈ Θ0 . We can write this in terms of the power-function β(θ) as:

α = sup β(θ)
θ0 ∈Θ0

We’d like the size of a test d to be small; we therefore often design tests to control their size (keep it
below a certain value). Often we can do this in the asymptotic limit (as n → ∞) even if we do now know
the size of a test in finite sample.

5.4.2.2 Power
The power of a test is given by its power function β(θ). We generally want to increase β(θ) among the
θ ∈ Θ1 , to reduce the probability of a Type-II error.

5.4.2.3 Navigating the tradeoff


In general, the two desiderata of a) a small size; and b) large power, are at tension with one another. A
test that always rejects, regardless of the data X, will never make a Type-II error (have lots of power),
but may be extremely likely to make a Type-I error (have large size). On the other hand, a test that
always accepts will never make a Type-I error (have low size) but may be making lots of Type-II errors
(have low power). Often we approach testing by choosing a significance-level p ex-ante, (e.g. p = .05),
and then design the test so that it’s size is no greater than p. Given that constraint, we then try to make
the power of the test as large as possible (which usually means making it’s size exactly p).

59
5.4.3 Constructing a hypothesis test
The most common variety of hypothesis test takes the following form: from the data X we compute
some test statistic, call it Tn . Then we compare Tn to some critical value c, and choose to reject the
null-hypothesis if and only if |Tn | exceeds the critical value (a so-called two-sided test), or alternatively
if Tn exceeds the critical value (a so-called one-sided test).
d
Tests of this form are usually motivated by knowing the asymptotic distribution of Tn , i.e. Tn → T
where T has some known distribution. Then we can control the size of our test by choosing c to be such
that P (T ≤ c) ≥ 1 − α. We then maximize power subject to this contraint on size by choosing c to be
exactly the 1 − α quantile of T (and no lower), so that P (T ≤ c) = 1 − α.

Example: Let us close by illustrating some of the concepts of this section with an example. Suppose our
statistical model is that Xi ∼ N (θ0 , 1), i.e. a normal random variable with unit variance but unknown
mean
√ θ0 . We wish to test whether H √0 : θ0 = 0, that is: Θ0 = {0} and Θ1 = R/{0}. Let our test statistic
be n times the sample mean Tn = n·X̄n . Given our model, the sample mean has the exact distribution
X̄n ∼ N (θ0 , 1/n) for any n, and hence Tn ∼ N (θ0 , 1). Under the null, Tn is a standard normal (since
then θ0 = 0) and hence for a two-sided test we can choose our critical value c to be the 1 − α/2 quantile
of the standard normal distribution (then P (|Tn | > c) = P (Tn < −c) + P (Tn > c) = α2 + α2 = α). Note
that the power function β(θ) of this test is the probability that a N (θ, 1) random variable has absolute
value greater than c, which is equal to Φ(c − θ) + Φ(−c − θ), where Φ denotes the standard normal CDF.

5.4.4 Interval estimation and confidence intervals


The goal of interval estimation is to choose an a set Θ̂ of values that with high probability contains the
true value θ0 . We call this interval estimation because Θ̂ typically corresponds to an interval [a, b] (if
θ is one-dimensional), or some higher-dimensional analog of an interval (e.g. a region). By contrast,
estimation in the sense of Section 5.3 is by contrast referred to as point-estimation.
As with point estimation and testing, our action Θ̂ is a function of the data (however now this is a
set-valued function)—call it s(X). The coverage probability of an interval estimator s is the probability
that it contains the true value of θ0 Here the tradeoff is between increasing the coverage probability,
but without making the the interval too big (in which case we haven’t learned much about the value of
θ0 ). Thus with interval estimation, we might define our loss function to depend both on the coverage
probability and the length of the interval estimate.
As with estimation, we do care about the specific value of θ, not just whether or not some hypothesis
H0 about it is true. However, we’ll now see that there is a very close connection between interval
estimation and hypothesis testing.
One scenario in which we might implement interval estimation is when our parameter of interest is
only partially identified (see Section 5.2). In such a setting, for example, our model might only imply
that θ0 ∈ [θL , θH ], where the bounds θL and θH are themselves point identified. Then we can construct
an interval estimate of θ0 with the set Θ̂ = [θ̂L , θ̂H ], given estimators of each of the two bounds.
The much more common scenario in which we engage in interval estimation is when constructing a
confidence interval for θ0 . We do this even when θ0 is identified and we have a consistent estimator for
it. A confidence interval makes a much more credible than a point estimate. In fact, point-estimation
is just a special case of interval estimation in which we constrain our Θ̂ to be a singleton. While
singleton will sets typically have zero probability containing θ0 (though they may be very close to it with
high probability), confidence intervals allow us to deliver an interval estimate of θ0 that takes sampling
uncertainty into account.

5.4.4.1 Confidence intervals by test inversion


The most popular method for constructing confidence intervals is to perform a hypothesis test having
size α for the null H0 : θ0 = θ, for each conceivable value of θ. Then, collect the set of all values θ that
are not rejected by that test to form our interval estimate of θ0 . That is:
Θ̂ = {θ ∈ Θ : d(X) = a}
This process is often referred to as test inversion, and the resulting Θ̂ is called a (1 − α)-confidence
interval CI 1−α . For example, if we used a test with size 5%, then the resulting confidence interval is
called a 95% confidence interval.

60
Example: Suppose we apply this principle to the example in Section 5.4.3 in which Xi ∼ N (θ0 , 1). There
we constructed a test for the null hypothesis that θ0 = 0, but now we need to consider
√ more general
hypotheses of the form H0 : θ0 = θ. If we revise our test statistic to be Tn (θ) = n · (X̄n − θ), we
again have that Tn has a standard normal distribution asymptotically, and thus our critical value c is
unchanged from the θ = 0 case. A 1 − α confidence interval would thus be:
√ √
CI 1−α = {θ ∈ R : |Tn (θ)| ≤ c} = {X̄n − c/ n, X̄n + c/ n}

where c is the 1 − α quantile of the standard normal distribution.

61
Chapter 6

Brief intro to causality

Most interesting questions in social science concern causality. We aren’t just interested in observing
what happens in the social world, but why and how. And we’re usually interested in what changes to
policy or behavior could lead to changes that we might deem desirable.
These types of questions concern causality. The meaning of the term “causal” is a long-standing
philosophical question; see Lewis (1973) for a fairly modern treatment that will accord with our approach
in this class. We will take a very simple perspective: A causes B if B would be different if A were different.
For example, on a day in which rain was forecast and I took my umbrella to school, we might say that
the rain forecast caused me to bring my umbrella, if I wouldn’t have taken the umbrella, absent the
forecast for rain. We of course can’t directly observe what would have happened if the forecast had been
different; we call this a counterfactual.

6.1 Causality as counterfactual contrasts


The potential outcomes framework offers an elegant and tractable way to talk about counterfactuals, in
the language of random variables (Rubin, 1974). This connects questions of causality to questions of
statistics, which we have been developing tools to study.
As a running example, consider the question of the effect of obtaining a college degree on a worker’s
earnings. Suppose we have data in the form of an i.i.d sample of (Di , Yi ), where Di ∈ {0, 1} indicates
whether individual i completed a college degree, and Yi indicates the workers average hourly earnings at
age 30. We call Di our treatment variable, and Yi our outcome variable. We’re interested in the causal
effect of the treatment variable on the outcome. This is a setting in which we have a binary treatment.
We’ll start here because it’s the simplest setting to develop the concept of causality. In Section 6.5 I’ll
discuss how these ideas generalize beyond a binary treatment.
Definition 6.1. An individual’s potential outcomes are: (Yi (1), Yi (0)), where Yi (1) is the outcome
they would receive if they received the treatment, and the outcome Yi (0) they would receive if they did
not.
In the returns-to-college example, Yi (0) is the earnings i would have if they didn’t go to college, and
Yi (1) is the earnings that i would have if they did go to college. The key thing to keep in mind in the
definition of counterfactuals is that we assume each individual i has a well-defined value both of Yi (0)
and of Yi (1). Regardless of whether i went to college or not, there is an answer to the question of how
much they would earn if they did go to college, and how much they would earn if they did not.
Consider for example a population composed of four individuals, pictured below. Person A would
earn $10 an hour if they didn’t graduate college, but if they did go to college they would get a higher-
paying job that paid them $18 an hour at age 30. Person B is a higher earner, and would earn $25 an
hour without a college degree, and would earn $40 and hour with one. Person C would choose to leave
the labor force and earn $0 without a degree, but with a college degree would find a job that pays $12
an hour. Notice that for all three of these individuals, Yi (1) > Yi (0): the causal effect of college on their
earnings is positive. But this not need be the case: suppose person D would found a successful company
if they didn’t go to college, earning them $150 an hour by age 30, but if they did go to college they would
have missed a chance opportunity to start the company and earned $60 as an employee somewhere else.

62
Definition 6.2. An individual’s treatment effect is defined as: ∆i = Yi (1) − Yi (0), the difference
between their treated and untreated potential outcomes.
In the above example, the treatment effects ∆i are $8 an hour for Person A, $15 an hour for Person B,
$12 an hour for Person C, and $ − 90 an hour for Person D. On average, treatment effects or positive—
although Person D’s individual treatment effect is negative.
The leap of faith that you need to take withe potential outcomes is to believe that there exists a
value Yi (0) and Yi (1) for each individual, regardless of whether they actually went to college. If i does
graduate college (i.e Di = 1), then their actual earnings Yi , will be Yi = Yi (1). Similarly, if they don’t
go to college, then their earnings will be Di = 0. Another way of writing this is that, for each i:

Yi = Di · Yi (1) + (1 − Di ) · Yi (0)

Notice that since Di ∈ {0, 1}, there is always one of the above terms that is equal to zero, and the other
term gives us the appropriate potential outcome.
In the above example, suppose that Persons A and D do go to college and graduate, while B and C
do not. Then if we measure the earnings and college-graduation status of each of the four individuals,
our data will be {(Yi , Di )}i=1,2,3,4 = {($18, 1), ($25, 0), ($0, 0), ($60, 0)}.
What can we say about treatment effects, given this data? Consider for example individual D, who
in reality missed their opportunity to start the business and earn $150 an hour. This is a counterfactual,
something that would have happened if the world were different. Since we can’t observe what would
have happened, we’ll never be able to answer the question of what person D’s value of ∆i is, empirically.

Definition 6.3. The fundamental problem of causal inference is that for a given i, we only observe
one of the two potential outcomes: either Yi (1) if Di = 1, or Yi (0) if Di = 0. In other words, we only
observe i’s realized value Yi = Yi (Di ), and not their other potential outcome.
The fundamental problem of causal inference means that we have a problem of identification, in the
language of Section 5.2. Suppose our parameter of interst is ∆i = Yi (1) − Yi (0) for some particular
individual i. Suppose that they did graduate from college, so Di = 1. If we can’t observe Yi (0), we can’t
identify their treatment effect. However, we’ll see that we can still sometimes make statements about
average treatment effects, by using other students who didn’t go to college as a comparison group.

63
6.2 The difference in means estimator and selection bias
The difference-in-means estimator takes the difference in the sample average of the outcome variable
among the “treatment group” Di = 1 and “control group” Di = 0:
1 X 1 X
θ̂DM = Yi − Yi
N1 N0
i:Di =1 i:Di =0

where N1 is the number of individuals i in the sample such that Di = 1, and N1 is the number of
individuals in the sample such that Di = 0. Of course, the total sample size n = N0 + N1 .
We know from the results of 4.6, and the midterm that for large samples θ̂DM converges in probability
to its population counterpart:

θ̂DM → {E[Yi |Di = 1] − E[Yi |Di = 0]}


p
(6.1)

Suppose that in our data, we observe that Yi and Di are positively correlated, that is θ̂DM ≥ 0 (as you
showed in a homework problem). This suggests that E[Yi |Di = 1] ≥ E[Yi |Di = 0]. Can we conclude
from our data that going to college causes ones earnings at age 30 to be higher?
We know from Definition 6.3 that for any individual for whom Di = 1, our observed Yi is Yi = Yi (1).
Similarly for any individual who doesn’t go to college, Yi = Yi (0). Thus, we can rewrite the estimand of
our difference-in-means estimator as:

θDM = E[Yi (1)|Di = 1] − E[Yi (0)|Di = 0] (6.2)

Notice that the first term in Eq. (6.2) conditions on the event Di = 1, and the second term conditions
on Di = 0. This means that the difference-in-means estimand compares two different groups, which
might not be comparable to one another. For example, students who go to college might have higher
SAT scores than students who do not.
Suppose for the moment that the second term in Eq. (6.2) also conditioned on the event Di = 1
(rather than Di = 0). If this were the case, then we could use linearity of the expectation to rewrite θ
as being equal to E[Yi (1) − Yi (0)|Di = 1], the average treatment effect ∆i among students who do go to
college. We call this the average treatment effect on the treated, or AT T . The ATT is a causal parameter
of interest, because it compares the values of Yi (1) and Yi (0), on average, for the same group.
Note that by adding and subtracting E[Yi (0)|Di = 1] to equation (6.2), we can write:

θDM = {E[Yi (1) − Yi (0)|Di = 1]} + {E[Yi (0)|Di = 1] − E[Yi (0)|Di = 0]} (6.3)
| {z } | {z }
AT T selection bias

The parameter if interest AT T is not identified, unless the selection bias term E[Yi (0)|Di = 1] −
E[Yi (0)|Di = 0] is equal to zero. This term represents a measure of non-comparability between the
students who go to college and the students who do not, in terms of their counterfactual earnings Yi (0).
For example, students who obtain a college degree may be more likely to come from family back-
grounds in which their parent(s) had time and resources to help the student accumulate skills that are
valued by the labor market. As a result, these students would have earned more on average, even if
they didn’t go to college and hence E[Yi (0)|Di = 1] − E[Yi (0)|Di = 0]. Many other stories also lead to
a positive correlation between Di and Yi (0): students whose parents are well-connected may be more
likely to go to college, and earn more even if they didn’t go to college, and any genetic traits that are
associated with higher earnings are likely to also increase college attendance.

6.3 Randomization eliminates selection bias


A sufficient condition for the selection bias term to be zero is that E[Yi (0)|Di = 1] = E[Yi (0)|Di = 0],
which says that Yi (0) is mean-independent of Di . One case in which this will hold is when Di is assigned
completely randomly, as in a randomized controlled trial. In this case, we have:
Definition 6.4. Random assignment says that (Yi (0), Yi (1)) ⊥ Di
The random assignment assumption is stronger than we need to kill the selection bias term in Equation
(6.3). All we need for that is E[Yi (0)|Di = 1] = E[Yi (0)|Di = 0]. This is implied by random assignment,
because (Yi (0), Yi (1)) ⊥ Di implies that Yi (0) ⊥ Di , which in turn implies that E[Yi (0)|Di = 1] =
64
E[Yi (0)|Di = 0] (Note: recall that independence implies uncorrelatedness, and furthermore that for a
binary Di and any random variable Vi , Vi ⊥ Di holds if and only if E[Vi |Di = 1] = E[Vi |Di = 0]. You
may want to review the earlier homework problem on this to convince yourself).
When the selection-bias term in Equation (6.3) is equal to zero, we can say that the ATT is identified,
in the language of Section 5.2. There is only one value of ATT compatible with the population distribution
of observables, since (Yi , Di ) is observed and AT T = E[Yi |Di = 1] − E[Yi |Di = 1]. However, under
random-assignment we can actually say more. Not only is the ATT identified, but so is the average
treatment effect:

AT E = E[Yi (1) − Yi (0)] = P (Di = 1) · AT T + (1 − P (Di = 1)) · AT U

where AT U := E[Yi (1) − Yi (0)|Di = 0] is the average treatment effect on the untreated, and we’ve used
the law of iterated expectations to decompose ATE into the ATT and ATU. Since the random-assignment
assumption says that treated potential outcomes Yi (1) are also independent of treatment Di , we have
note only that E[Yi (0)|Di = 1] = E[Yi (0)|Di = 0], but also that E[Yi (1)|Di = 1] = E[Yi (1)|Di = 0], and
thus the AT T , AT U , and AT E are all equal to one another.
In non-experimental settings, one may be able to identify a parameter like the ATT without being
able to identify the ATE. An example of this is the difference-in-differences research design, which (in
its basic, most common form) only yields identification of the ATT and not the ATU or ATE.

Note: Even the above argument that AT T = AT U = AT E = E[Yi |Di = 1] − E[Yi |Di = 0] only
ever makes use of Yi (0) being independent of Di , and Yi (1) being independent of Di . This is still
weaker than the assumption made above, that Yi (0) and Yi (1) are jointly independent of Di . In
practice, it’s usually hard to come up for an argument for why only the marginal distributions of
Yi (1) and Yi (0) would be independent of Di , and not their joint distribution, which is why I’ve
written it the way I have.

Note: Definition 6.4 corresponds to a randomized controlled trial with perfect compliance. In
many real-world trials, the only thing that can be randomized is whether an individual is assigned
to receive treatment. But subjects may still choose whether to actually receive treatment. This
situation is very common in economics settings, see for example the homework problem about
a lottery to migrate to New Zealand. In these cases, one can use the method of instrumental
variables to estimate causal effects, which you’ll see in later courses.

An assumption implicit in our use of random assignment to eliminate selection bias above is that
each individual’s potential outcomes does not depend on whether other individuals go to college.
This is known as the stable unit treatment value assumption, or SUTVA. This is not always a
harmless assumption, as it rules out spillover effects.

6.4 The selection-on-observables assumption


Outside of an actual experimental setting, the random-assignment assumption is very strong. Typically,
economic agents “select into” treatment, meaning they choose for themselves whether or not Di = 1 or
Di = 0. Their are usually a variety of reasons why the circumstances and preferences that lead to a
choice of taking treatment can be expected to be correlated with potential outcomes.
Suppose we observe a vector of covariates Xi , along with Yi and Di . Then, the following assumption
is often considered to be weaker than assuming fully random-assignment:

Definition 6.5. Selection-on-observables, also referred to as unconfoundedness, says that

{(Yi (0), Yi (1)) ⊥ Di }|Xi

Selection-on-observables makes the same assumption as random assignment, but we assume it holds
conditional on each value Xi . It is thus often also called a conditional independence assumption.

65
Why might selection-on-observables be more reasonable than random assignment? The basic
idea is that if we observe a rich enough set of Xi , we might be able to control for confounding
factors that lead to selection bias. For example, in the returns-to-college example, we might
include in the vector Xi whether ot not i’s parents graduated from college, their socio-economic
status, and i’s test scores in high school.

Imagine that we observed literally everything that matters for determining the outcome Yi , in
addition to treatment. In this case, we could write potential outcomes as

Yi (0) = Y (0, Xi ) and Yi (0) = Y (0, Xi ),

where the function Y (d, x) is common to everybody: once we know d and x we can say exactly
what is going to happen to you. Then selection-on-observables would be satisfied automatically,
since if we condition on Xi = x, then Yi (d) = Y (d, x) for either d ∈ {0, 1}. Notice that Y (d, x)
doesn’t depend on i: it is no longer random once we’ve fixed Xi . It is hence is uncorrelated with
Di , since degenerate random variables are statistcally independent of everything! This can be
seen as mimicking the logic of a carefully controlled experiment in the natural sciences, in which
we make sure “everything else” that matters Xi is held fixed, while varying Di between 0 and 1.

A similar logic would apply if Xi includes everything that determines Di : e.g. Di = d(Xi ) for
some function d. Then we’d also get selection-on-observables for free. In practice, apart from
vey specific settings, we’ll never observe everything that determines outcomes Yi , or selection into
treatment Di . However, if we can control for most of obvious threats to eliminating selection
bias, we might be willing to think that our Xi get us most of the way there. For a clever and
compelling example of using selection-on-observables, I recommend looking at Dale and Krueger
(2002).

How does the selection-on-observables assumption help us? Note that if it holds then

E[Yi |Xi = x, Di = 1] − E[Yi |Xi = x, Di = 0] = E[Yi (1)|Xi = x, Di = 1] − E[Yi (0)|Xi = x, Di = 0]


= E[Yi (1)|Xi = x] − E[Yi (0)|Xi = x]
= E[Yi (1) − Yi (0)|Xi = x] := AT E(x) (6.4)

Thus, the average treatment effect, conditional on X is identified by a version of the difference in means
estimand that conditions on any given value x of Xi . Let us denote this parameter as AT E(x). Equa-
tion 6.4 shows that under selection-on-observables, it is identified. Since we also observe the marginal
distribution of Xi , we can then recover for example the overall average treatment effect by integrating
over values of the control variables:
Z Z
AT E = E[Yi (1) − Yi (0)|Xi = x] · dF (x) = AT E(x) · dF (x)

which follows by the law of iterated expectations.


There are three main approaches to making use of the selection-on-observables assumption in this
way: inverse-propensity score weighting, matching, and regression. In this class, we’ll focus on the third
of these, regression, but I briefly introduce the other two in the box at the end of this section. The
three approaches can be thought of as essentially three different strategies to construct an estimator for
AT E(x), but are all fundamentally based off of the identification result (6.4).

Exercise: Show that the difference-in-means estimator won’t work in general under selection-on-observables.

66
Solution: Suppose for the sake of argument that Xi is discrete. Then, by LIE:
θDM = E[Yi |Di = 1] − E[Yi |Di = 0]
= E[Yi (1)|Di = 1] − E[Yi (0)|Di = 0]
P (Xi = x|Di = 1) · E[Yi (1)|Xi = x, Di = 1] − P (Xi = x|Di = 0) · E[Yi (0)|Xi = x, Di = 0]
X X
=
x x

E[Yi (1)|Xi = x] − E[Yi (0)|Xi = x]


X X
= P (Xi = x|Di = 1) · P (Xi = x|Di = 0) ·
x x

E[Yi (1) − Yi (0)|Xi = x] + E[Yi (0)|Xi = x]


X X
= P (Xi = x|Di = 1) · {P (Xi = x|Di = 1) − P (Xi = x|Di = 0)} ·
x x

Note: When using the selection-on-observables assumption, it is important that the variables in the
vector Xi are unaffected by treatment. That is, if we introduced potential outcomes Xi (0) and Xi (1),
we would have Xi (0) = Xi (1) for all i. To make sure of this, researchers typically consider variables
Xi that are measured earlier in time than treatment Di is assigned. When this condition fails, causal
inference can fail even when the selection-on-observables assumption holds, via a problem often referred
to as “bad-control”.

Alternative approaches using selection-on-observables:

Inverse propensity score weighting: Under selection-on-observables:


Di · Yi (1 − Di ) · Yi
 
AT E = E −
P (Di = 1|Xi ) 1 − P (Di = 1|Xi )
This strategy is known as inverse propensity score weighting. If one estimates the function
P(x) = P (Di = 1|Xi = x) (known as the propensity score function) for all values of x, then one
can form the object within the expectation above, using P(Xi ) for each observation.
To see this, note that by the law of iterated expectations and selection-on-observables:
E Di · Yi (1 − Di ) · Yi
 

P (Di = 1|Xi ) 1 − P (Di = 1|Xi )

E E Di · Yi (1 − Di ) · Yi
  
= − Xi
P (Di = 1|Xi ) 1 − P (Di = 1|Xi )

E D i · Yi (1 − Di ) · Yi
Z   
= − Xi = x · dFX (x)
P (Di = 1|Xi ) 1 − P (Di = 1|Xi )

=
Z 
E
[Di · Yi |Xi = x]

E
[(1 − Di ) · Yi |Xi = x]

· dFX (x)
P (Di = 1|Xi = x) 1 − P (Di = 1|Xi = x)

=
Z 
E
[Di · Yi (1)|Xi = x]

E
[(1 − Di ) · Yi (0)|Xi = x]

· dFX (x)
P (Di = 1|Xi = x) 1 − P (Di = 1|Xi = x)
E
( (
P (D =(1|X(i (= x) · [Yi (1)|Di = 1, Xi = x]
Z
= ( (i( (
(i (
((D
P (i(=(1|X = x)
((( E[Yi (0)|D
)
−P
(1 ( ((D =(
(i ( 1|Xi = x)) · i = 0, Xi = x]
−( ( ( · dFX (x)
1 − P (Di = ((
(1|Xi = x)
((((
E E[Yi (0)|Xi = x]} · dFX (x)
Z
= { [Yi (1)|Xi = x] −

= E[Yi (1) − Yi (0)] = AT E


Matching: The approach of matching finds within one’s sample pairs of units having similar
values of Xi . In the most basic version of this strategy (one-to-one, exact matching), we for each
treated unit i find a control unit i′ such that Xi = Xi′ . We drop any control units that are not
matched, and then apply the difference in means estimator to the modified sample, and consider
this as an estimate of the ATT. However, finding pairs such that Xi = Xi′ can be difficult when
X is a vector with many components, and will be impossible when it includes any components
that are continuously distributed. In these cases we’d need to settle for finding an i′ such that
Xi ≈ Xi′ (note that this issue is not specific to matching, we still have to worry about it when
evaluating Eq. (6.4)).

However, a clever application of selection-on-observables (Rosenbaum and Rubin (1983)) allows


us to simplify the problem considerably, leading to the idea of propensity-score matching. It also

67
follows from the selection-on-observables assumption that for any p ∈ (0, 1) we have that:

E[Yi |Di = 1, P(Xi ) = p] − E[Yi |Di = 0, P(Xi ) = p] = E[Yi (1) − Yi (0)|P (Xi ) = p]
where P(x) = P (Di = 1|Xi = x) is the propensity score function introduced above. This
expression says that conditioning on values of the propensity score rather than on Xi itself is
sufficient to estimate causal effects. This is useful because while Xi may have many components,
the propensity score is always a scalar. Thus, we simply need to estimate the function P(x), and
then match units i and i′ such that P(Xi ) ≈ P(Xi′ ), rather than finding a good way to compare
X on all dimensions.

6.5 Causality beyond a binary treatment*


In this chapter we’ve focused on a binary treatment, which takes just two values: Di = 1 (“treatment”),
and Di = 0 (“control”). However, we’re often interested in the causal effect of a treatment variable that
takes on many values. For example, what is the effect of years of schooling on earnings, rather than just
the effect of completing any college degree?
Setting up the notation for multivalued treatment variables is pretty straightforward. We can define
our potential outcomes Yi (d) in the same way as before, where now d index all of the values that Di
might take. Here are some examples:

ˆ Let d be the number of years of schooling student i completes, and Yi (d) be their earnings at age
30.
ˆ Let d be the price of some good, and let the function Yi (d) be the demand function for that good
in market i.
ˆ Let d be the high school in Georgia that student i attends, and let Yi (d) be an indicator for whether
they were accepted to UGA, e.g. d ∈ {school A, school B, school C, etc.}.
ˆ In a randomized experiment about the effect of social media on mental health, subjects i are
assigned to three different treatments:

d ∈ {no social media, Facebook only, Twitter only, Facebook and Twitter}

Regardless of the setting, we can still define random assignment ((Yi (1), Yi (0)) ⊥ Di ) and selection-on-
observables ({(Yi (1), Yi (0)) ⊥ Di } |Xi ) exactly as we did before.

However, with more than two values of treatment, there are now many different ways to think about
treatment effects. For example, in the first example above, we can think about the effect of finishing
grade 12 as:
Yi (12) − Yi (11),
while the effect of completing high-school versus dropping out after grade 10 is:

Yi (12) − Yi (10)

The overall average causal effect of the last year of schooling that each student actually completes would
be
E[Yi (Di ) − Yi (Di − 1)]
In the first two examples above, the values of treatment Di have a natural order to them. In the
third and fourth examples, treatment is categorical, and there may not be a natural such order. With an
unordered treatment, like in the last example, we might pick one comparison category and consider treat-
ment effects with respect to it, e.g. separately estimating E[Yi (Facebook only) − Yi (no social media)],
E[Yi (Twitter only) − Yi (no social media)] and E[Yi (Facebook and Twitter) − Yi (no social media)].

68
6.6 Moving beyond average treatment effects*
Although our discussion here has been focused on parameters that average over treatment effects ∆i =
Yi (1) − Yi (0), this isn’t the only type of causal question that we can answer with random-assignment or
selection-on-observables.
Consider a binary treatment Di and random assignment: (Yi (0), Yi (1)) ⊥ Di . Note that we can apply
any function g(·) to the potential outcomes, without destroying independence, i.e. (g(Yi (0)), g(Yi (1))) ⊥
Di . Why is this useful? Consider the function g(t) = 1(t ≤ y) for some value y. Given that random-
assignment implies that the random variable 1(Yi (1) ≤ y) is independent of Di , we have that

E[1(Yi ≤ y)|Di = 1] = E[1(Yi (1) ≤ y)|Di = 1] = E[1(Yi (1) ≤ y)]


| {z } | {z }
FY |D=1 (y) FY (1) (y)

The term on the left is the conditional CDF of Yi given Di = 1, which can be computed from the data.
The term on the right is the (unconditional) CDF of the treated potential outcome Yi (1). This expression
shows that we can identify the CDF of Yi (1) at any point y. Collecting over all y, we can thus compute
the entire distribution of Yi (1).
By the same logic, we can also identify the entire distribution of Yi (0), using FY |D=1 (y) = E[1(Yi ≤
y)|Di = 1]. That means that we can use random-assignment to uncover the effect of treatment on the
entire distribution of outcomes. This lets us answer a new set of causal questions. For instance: what is
the difference between the median value of Yi (1) and the median value of Yi (0)? This is an example of
a so-called quantile-treatment effect.
A natural question that you might hope to answer is: how many individuals in my population have
a negative treatment effect Yi (1) < Yi (0), versus a positive one? This is a harder type of question,
because it depends on the joint distribution of potential outcomes. By contrast, random assignment
(and similarly selection-on-observables, or quasi-experimental approaches), only let us identify each of
the marginal distributions of Yi (0) and Yi (1), due to the fundamental problem of causal inference.
The situation is not completely hopeless: the marginal distributions of Yi (1) and Yi (0) do put some
restrictions on the distribution of treatment effects. For instance, it can be shown that a lower bound
on the proportion “harmed” by treatment P (∆i ≤ 0) is the supremum of FY (1) (y) − FY (0) (y) over all
values of y (see e.g. Fan and Park, 2010 for details). We can also make additional assumptions that
allow us to say more about the distribution of treatment effects. For example, the strong assumption
of rank-invariance allows us to trace out the entire CDF of ∆i , and in principle estimate the treatment
effect for any given individual (see e.g. Heckman et al., 1997).

69
Chapter 7

Linear regression

Note on notation: In this section we’ll simplify notation by dropping i subscripts when discussing
population quantities. We’ll add them back in Section 7.5 when we get to estimation. Remember that
with i.i.d. data, it doesn’t matter whether we include the i indices or not, because the distribution of
variables in each observation i is the same as the population distribution.

7.1 Motivation from selection-on-observables


We saw in Chapter 6 that under the selection-on-observables assumption, and with a binary treatment
variable, the average treatment effect conditional on X = x can be calculated as:

AT E(x) = E[Y (1) − Y (0)|X = x] = E[Y |X = x, D = 1] − E[Y |X = x, D = 0]

This requires having a way to estimate conditional expectations of the form E[Y |X = x, D = d] for d = 0
and d = 1. How should we do this?
If X is a discrete random variable, there is a pretty straightforward way we could do this. With i.i.d.
data, a consistent estimator is simply the mean among the sub-sample of data for which Di = d and
Xi = x:
1
E[Y |X = x, D = d]
X p
Yi →
# of observations i for which Xi = x and Di = d
i:Xi =x&Di =d
| {z }
Ê[Y |X=x,D=d]
But remember that for the selection-on-observables assumption, we want X to be an extensive-enough
set of control variables to eliminate selection bias. So how should we proceed X = (X1 , X2 , . . . Xk ) is a
vector of several random variables, some of which may be continuously distributed?
This is actually a hard problem, in practice. Recall that E[Y |X = x, D = d] is a function of x and d,
which in the notation of 1.5.3 we might write as:

E[Y |X = x, D = d] = m(d, x1 , x2 , . . . xk )
where x1 , x2 , . . . xk are the components of the vector x. Provided that (Y, D, X) are all observed, the
function m will be identified (see Section 5.2). That is, for fixed values (x, d) there is only one value of
m(d, x1 , x2 , . . . xk ) compatible with the joint distribution of our observables.
However, estimation is another thing. Given our finite sample, how do we uncover the function
m(d, x1 , x2 , . . . xk )? This turns out to be particularly straightforward when the function m is linear, that
is:
m(d, x1 , x2 , . . . xk ) = β0 + βD d + β1 x1 + β2 x2 + · · · + βk xk (7.1)
for some set of coefficients (βD , β0 , β1 , . . . βk ). In this case note that our parameter of interest, AT E(x) is
simply equal to m(1, x) − m(0, x) = βD . Since this difference yields the same fixed number βD regardless
of x, the conditional-on-X ATE is the same as the overall average treatment effect, so AT E = βD .

70
7.2 The linear regression model
Given a random variable Y and a random vector X, the linear-regression model says that

Y = X ′β + ϵ (7.2)

where
E[ϵ|X] = 0 (7.3)
We’ll refer to the vector β appearing in Eq. (7.2) the coefficient vector from a regression of Y on X (as
a reminder of notation: β ′ X = j βj · Xj ). The term ϵ is often called an error term or residual.1
P
Remember from Section 3.1 that a statistical model places some kind of restriction on the joint
distribution of random variables. The key restriction of the linear regression model is that the residual
ϵ has a conditional mean of zero given any realization of X. The linear regression model holds for some
β if and only if the conditional expectation function of Y on X is a linear function of X, that is:

E[Y |X] = X ′ β (7.4)

In almost all cases in which we use the linear regression model, one of the components of X is taken to
be non-random and simply equal to one. It thus contributes a constant to the function X ′ β, for example:

Y = β0 + β1 · X1 + · · · + βk · Xk + ϵ (7.5)

where here we have started the numbering at 0, so that β has k + 1 components. In this notation X also
has k + 1 components: X = (1, X1 , . . . Xk )′ . However, to keep notation compact, we’ll often ignore the
distinction between a constant and random elements in X.

Accordingly, if we let k be the total number of components in X = (X1 , X2 , X3 . . . Xk )′ (including any


constant term), then notice that Eq. (7.3) implies the following k equations:

E[X1 · ϵ] E[X1 · (Y − X ′ β)]


     
0
E[X2 · ϵ] E[X2 · (Y − X ′ β)] 0
E[Xϵ] =  .. =
 
..  =  .. 
  
(7.6)
 .   .  .
E[Xk · ϵ] E[Xk · (Y − X ′ β)] 0

To see that E[ϵ|X] = 0 implies E[ϵ · Xj ] = 0 for any j = 1 . . . k, use the law of iterated expectations:
E[ϵ · Xj ] = E {E[ ϵ · Xj | X]} = E {E[ϵ|X] · Xj } = E {0 · Xj } = 0
It’s probably a good idea to stare at this and make sure it makes sense. Conditional on any value
X = x, the component Xj has some fixed value xj . Thus, we can pull it out of the inner expectation, so
that E[ ϵ · Xj | X = x] = E[ϵ|X = x] · xj . Then we take the outer expectation (curly braces) over values x.

Since (7.6) provides a system of k equations in the k unknowns β1 . . . βk , it generally has a unique
solution. A general expression for this solution is:

β = E[XX ′ ]−1 E[X · Y ] (7.7)

We’ll unpack the matrix notation of this equation later, so don’t worry if it’s not familiar to you right
now. The important thing is that there is typically a single vector β that can satisfy all k lines of Eq.
(7.6), and it is given by (7.7) above.
When people talk about “running a regression”, the quantity they are estimating is (7.7), whether or
not the conditional expectation function E[Y |X] is linear in X as the linear regression model assumes.
Thus, rather than Eqs. (7.2) and (7.3) we could have gotten away with introducing β with a so-called
linear projection model, which just says that

Y = X ′β + ϵ where E[ϵ · Xj ] = 0 for all j = 1 . . . k (7.8)

Whether one starts from Eq. (7.4) or from (7.8), we’re talking about the same β. We’ll call this β, which
has the explicit formula (7.7), the coefficient vector or the linear regression vector.
1 The Hansen textbook reserves the term “residual” for an estimated value of ϵ that arises in the context of the ordinary

least squares estimator. I’ll refer to ϵ above as a residual, and what Hansen calls a residual a “fitted residual” in Sec. 7.5.

71
We can also write the linear regression vector in a second way: it minimizes the population mean-squared
error between Y and a linear function of the components of X:
β = argmin E[(Y − X ′ γ)2 ] (7.9)
γ∈ Rk
This says that the value of the β appearing in Eq. (7.2) is exactly the one that minimizes the expectation
of the squared difference between Y and the “regression line” X ′ β implied by β and X. We’ll establish
Eq. (7.9) in Section 7.4.1.

Exercise: Using the fact that ϵ = Y − X ′ β, show that Equation (7.6) is the first-order-condition of the
minimization problem (7.9).

Note: to connect the linear regression model to the discussion of selection-on-observables in Section 7.1,
we simply incorporate D as well as the constant in Eq. (7.1) into the vector X.

There are several motivations for caring about the linear regression vector β. linear regression will be
a useful tool when the RHS of Eq. (7.7) answers or sheds light on some interesting question about the
world. We’ll now turn to several ways in which it can.

7.3 Five reasons to use the linear regression model


Here are several distinct motivations for using the linear regression model above. In all cases except the
last one, we have the same parameter of interest: β, and we’ll end up using the exact same estimator
β̂—referred to as the ordinary-least-squares or OLS estimator—for β.

7.3.1 As a structural model of the world


One way to arrive at the linear regression model is to simply assume that it describes a function that
generates the outcome Y . For example, we might have a so-called Mincer equation, which explains
log-wages as a function of education and job experience:
log(wagei ) = β0 + β1 · schooli + β2 · expi + ϵi (7.10)
where wagei is worker i′ s hourly wage, schooli is the number of years of schooling they completed, and
expi is the their number of years of work experience. Here the β = (β0 , β1 , β2 )′ have a direct economic
interpretation: we think of β1 for example as telling us how much log wages would increase if student i’s
schooling was increased by one year.
In this structural model of the world, we can think of the residual ϵi as capturing everything else
about i (or their employer) that also determines their wage: for example, their unobserved skills or
“ability”. If we are willing to believe that this ϵi is uncorrelated with schooling or experience, that is
E[ϵi · schooli ] = E[ϵi · expi ] = E[ϵi ] = 0, then we arrive at the linear projection model of Equations 7.2
and Equation 7.6.
When we interpret the regression equation as a story about how the world works, we give a particular
interpretation to the vector β and to the error term ϵ: Eq. (7.10) constitutes an economic model and not
just a statistical one. On this view, each expression E[ϵ · Xj ] = 0 is an assumption. Are unobservable
factors like i’s unobserved skills uncorrelated with schooling? If the answer is yes, then the parameters
β = (β0 , β1 , β2 )′ can be estimated by the statistical technique of running a regression.
The above might be the way you were taught to think about regression in an undergraduate econo-
metrics course. The next four sections will introduce motivations for using the linear regression model
that have a different form. Instead of treating Equations 7.6 as an assumption, we use them to define β.
While it may or may not have a direct economic interpretation like in the Mincer equation, β is always
related to the conditional expectation function E[Y |X], which has a statistical interpretation. We turn
now to this.

7.3.2 As an approximation to the conditional expectation function


As we’ve seen, Equations 7.2 and 7.6 are implied when the conditional expectation function (CEF) of Y
on X is linear in X. That is, when
E[Y |X] = X ′ β (7.11)
72
Another way of saying this is that the function m(x) := E[Y |X = x] has the form
k
X
m(x) = x′ β = βj · x j
j=1

Note that in the linear regression model, the residual ϵ is equal to the deviation of the random variable
Y from its expectation given X; that is:

ϵ := Y − E[Y |X]

Then, by definition Y = E[Y |X] + ϵ. When the CEF takes the linear form of Equation 7.11, we get
Equation 7.2. This is useful for example when we have selection-on-observables and are interested in the
coefficient on treatment D, as in 7.1.

However, even if the assumption of Equation (7.4) that the CEF is linear in X is false, the function X ′ β
will still provide the best linear approximation to the true function m(x) = E[Y |X = x], in the sense of
minimizing the mean squared approximation error:
Proposition 7.1. β = argmin E[(m(X) − X ′ γ)2 ], where β is as defined in Equation 7.7.
γ∈ Rk
This means that even if the CEF is not quite linear, the linear projection coefficient β is the k−component
vector such that x′ β best approximates m(x) as a linear function.

7.3.3 As a way to summarize variation*


Linear regression is also often used as a descriptive tool to characterize the variation in one’s data. Firstly,
one can interpret Eq. (7.2) as decomposing the random variable Y into a term X ′ β that is “explained”
by X, and a residual ϵ that is not. One should be careful here: we’re not saying that X causes Y , just
that the two are correlated (see below). We’ll see that the linear regression model also gives us a nice
expression for V ar(Y ), decomposing the variance of Y into a term that depends on the variance of X,
and a second term that depends on the variance of the error. This can be helpful in establishing how
much of the variance of Y can be “explained” by X.
We previously defined the correlation coefficient ρXY = √ Cov(X,Y √ ) between two random vari-
V ar(X) V ar(Y )
ables. In Section 1.6 we generalized the notion of covariance to random vectors, but how does the notion
of a correlation coefficient ρ ∈ [−1, 1] extend to vectors? The regression coefficient vector β turns out to
involve a version of this. The j th component of the regression vector: βj , can be expressed in terms of the
so-called “partial-correlation” coefficient between Y and Xj , given the other variables in the regression
(i.e. X1 . . . Xj−1 and Xj+1 . . . Xk ). In general, the partial correlation between X and Y , given Z, is a
version of ρXY that considers the residual variation in X and Y after accounting for the Z in a particular
way. We’ll study this in Section 7.4. This motivation is not entirely distinct from that of the last section,
since correlations and partial correlations can always be written in terms of the CEF.

7.3.4 As a weighted average of something we care about*


Sometimes we start with a research question that has an complicated answer, but linear regression gives
us a simple summary of that answer.

7.3.4.1 Example 1: the average derivative of a CEF*


Suppose I asked you to use the nlswork dataset from Chapter 2.2 to plot the conditional expectation
function of a worker’s weekly hours with respect to their hourly wage. Below I’ve computed this function
for wages between $5 an hour and $25 and hour, displayed in red. As you can see, the function is
non-monotonic: E[hours|wage = w] increases until w is about 8, then declines with w (quite steeply
after w = 15).

73
In teal, I’ve plotted the regression function β0 − β1 · w. We know from Proposition 7.1 that the vector
(β0 , β1 ) finds the best linear approximation to the conditional expectation function, which happens to
be highly non-linear. That is: the slope of E[hours|wage = w] changes a lot with w. Although the
linear approximation is not a particularly good one (the red and teal lines are far from one another
for many values of w), we can see that the slope coefficient β1 appears to “average out” the slope of
E[hours|wage = w]: sometimes the teal line is steeper, and sometimes the red line is steeper.
This is no accident: in a regression with one continuously-distributed X variable and a constant, the
coefficient on X Ralways captures a weighted-average of the derivative of the CEF with respect to X. In
particular, β1 = m′ (x) · w(x) · dx, where m′ (x) = dx d
E[Y |X = x], and w(x) is a positive function that
integrates to one. The weighting function w(x) is proportional to F (x) (E[X] − E[X|X ≤ x]), where
F (x) is the CDF of X. This result was originally derived by Yitzhaki (1996).
This result is nice, but be careful when appealing to it to say that a linear regression coefficient always
captures an average CEF derivative. When there is more than one variable in the regression, say X1 and
X2 , the weights that β1 places on ∂x ∂
1
m(x1 , x2 ) could be negative, if E[X1 |X2 ] is not linear in X2 . Even
with a single regressor, the averaging introduced by the linear regression vector can be misleading. If
the CEF is sometimes increasing, and sometimes decreasing, we may get a regression coefficient of zero
even in cases where X and Y are closely related.

Proof: This is jumping ahead a bit, but I include a proof here in case you are interested. To see
this, note that by the law of iterated expectations, the slope coeffient is:

Cov(X, Y ) 1 1
β1 = = · E[Y · X] − E[X]E[Y ] = · E[Y · (X − E[X])]
V ar(X) V ar(X) V ar(X)
1 1
Z Z
= · f (x) · (x − E[X]) · E[Y |X = x] = · f (x) · (x − E[X]) · m(x)
V ar(X) V ar(X)

where m(x) = E[Y |X = x] and f (x) is the density of X. Now, we use the method of integrattion
by parts with u = f (x)(x − E[X]) and dv = m(x) · dx to write the integral as v(x)g(x)|−∞ −

m (x)v(x)dx where v(x) = −∞ f (t)(t − E[X]). The first term is zero because both v(∞)
R ′ Rx

R ′ v(−∞) are equal to zero (for v(∞) we assume X has a finite second moment). So, β1 =
and
m (x)w(x)dx where w(x) = −v(x)/V ar(X). To see that w(x) integrates to one, substitute
Y = X in which case m′ (x) = 1. To see that the w(x) ≥ 0, rewrite v(x) = F (x)E[X|X ≤
x] − E[X]F (x) = F (x) (E[X|X ≤ x] − E[X]) and note that E[X|X ≤ x] ≤ E[X] for all x.

7.3.4.2 Example 2: weighted averages of treatment effects*


Consider a regression of Y on a binary treatment variable D and X:

Y = βD · D + βX X +ϵ (7.12)
74
where the vector X is a set of indicator variables for an underlying categorical variable G. By this, I
mean that X = (1(G = 1), 1(G = 2), . . . , 1(G = Ng ))′ , where P (G ∈ {1, 2, . . . NG }) = 1. We’ll return
to this kind of regression later, which is sometimes referred to as “saturated”. It turns out that the
coefficient on D in this regression can be written as:

E[{E[Y |D = 1, X] − E[Y |D = 0, X]} · V ar(D|X)]


βD =
E[V ar(D|X)]
If we assume selection-on-observables, then we know that the term in brackets is equal to AT E(x) =
PNG
E[Y (1) − Y (0)|X = x]. Then, we have that βD = j=1 wj · AT E(xj ) wherex1 , x2 , . . . are the values that
X can take, i.e. xj is a vector of NG components composed of all zeros but a 1 in the j th component.
V ar(D|X)·P (Xi =xj )
The wj = E[V ar(D|X)] can be thought of as weights: they are positive and sum to one. This result
can be found in Angrist and Pischke (2008), and we’ll prove it later.

Thus, Eq. (7.12) recovers a weighted average of the conditional-on-X average treatment effects. It
weights each group in proportion to the conditional variance of D given X. Intuitively, this puts more
weights on the groups for which there is a more equal proportion of treatment and control units (since
note that V ar(D|X) = P (D = 1|X) · P (D = 0|X). It does not recover the average treatment effect
AT E = E[Y (1) − Y (0)], which can be thought of as applying weights wj = P (X = xj ) (by the law of
iterated expectations).

Recall that by contrast, we can always with a binary treatment get the ATE under selection-on-
observables by estimating E[Y |D = 1, X = x] − E[Y |D = 0, X = x], and we don’t need to assume
that Xi has a group structure for this result. What then is the value of the result in this section? In
Section 7.2, we assumed that E[Y |D = 1, X = x] had the linear form of Eq. (7.1). This is a strong
assumption; it implies for example that AT E(x) = E[Y (1) − Y (0)|X = x] is the same for all x. The
result in this section shows that when X as a group structure, we can still keep X and D separate in
the regression, without assuming that AT E(x) is constant in x (i.e. we can get away without including
interaction terms between X and D).

7.3.5 As a tool for prediction*


One final way one might arrive at a linear regression model is that regression can be a tool for prediction.
This is conceptually quite distinct from the other motivations discussed thus far, because in this case
the goal is not to learn the value of β, but rather to use β to predict the outcome Y .
For example, suppose you are Netflix and want to decide whether to advertise the epic romance
series Econometrics to a given consumer. In particular, you’d like to know whether i will click on the
promotional display, and begin watching the (riveting) first episode. Let’s indicate this by Y = 1. You
know several things X about the consumer, for example their age and other shows that they have been
watching. You don’t know Y for this particular consumer (because you haven’t shown them the ad yet),
but you do know (Yi , Xi ) for a random sample of other individuals who did see the advertisement.
Now consider solving the above prediction problem for a randomly drawn consumer, where we attempt
to use a linear function of X to predict Y . Because of Eq. (7.9), β will minimize the average such
prediction error (squared). That is: β = argminγ∈Rk E[(Yi − Xi′ γ)2 ]. Linear regression is not necessarily
a particularly good tool for prediction, but can provide a starting point. Modern tools such as machine-
learning algorithms are better-optimized for prediction.

7.4 Understanding the population regression coefficient vector


Let us now see the relationship between the β that satisfies the linear regression model, and the problem
of minimizing the mean-squared error between Y and X ′ β. We repeat here Eq. (7.9):

β = argmin E[(Y − X ′ γ)2 ] (7.13)


γ∈ Rk
Note that we are not constraining the values that γ can take in this minimization problem, rather we
have an unconstrained minimization in which we search over all γ ∈ Rk . That means that to minimize

75
the mean squared error, it must satisfy the following k first-order-conditions (FOCs), one for each of its
components βj for j = 1 . . . k:

∂ E[(Y − X ′ β)2 ]
= E[2(Y − X ′ β) · Xj ] = 0 (7.14)
∂βj

where we’ve used that X ′ β = j=1 Xj · βj . This is equivalent to E[Xj · ϵ] = 0, if we define ϵ = Y − X ′ β.


Pk
This leads exactly to the linear regression model of Equations 7.2 and 7.6.

Thus we’ve seen that the minimizer of the mean squared error between Y and a linear function of X
must be equal to the regression coefficient vector β. The box at the end of Section 7.4.1 shows that this
also goes in the other direction: the β defined by Equations 7.2 and 7.6 must be the β that solves (7.9).

Note: I’ve assumed in the above that E[(Y − X ′ γ)2 ] is differentiable with respect to γ and that we can
interchange the derivative and the expectation (this requires regularity conditions that allow us to appeal
to the dominated convergence theorem, but we don’t need to worry about these technicalities here).

7.4.1 Existence and uniqueness of β: no perfect multicollinearity


Is there always a β that minimizes the mean squared error, and could there be multiple values β and β ′
that both minimize it? These are the questions of existence and uniqueness of β, respectively.
A sufficient condition for β to exist and for it to be unique turns out to be that the matrix E[XX ′ ]
is invertible, meaning that the inverse matrix E[XX ′ ]−1 exists. A convenient characterization of when
E[X ′ X] will be invertible is given by the following proposition:
Proposition 7.2. The matrix E[XX ′ ] has an inverse E[XX ′ ]−1 , if and only if for all γ ∈ Rk :
P (X ′ γ ̸= 0) > 0

Proof. We call a symmetric k × k matrix M positive definite if γ ′ M γ > 0. Any positive definite matrix
is invertible. I won’t prove this here, but for the curious: the eigenvalues of a positive definite matrix
are strictly positive, and a matrix is invertible if and only if it does not have zero as an eigenvalue.
Thus, we’ll show that H = E[XX ′ ] is positive definite in order to establish that it is invertible. For
any γ ∈ Rk , note that γ ′ Hγ = E[γ ′ XX ′ γ] = E[||Xγ||2 ], where for any vector x ∈ Rk we let ||x||2 denote
Pk
it’s Euclidean norm ||x||2 = x′ x = j=1 (xj )2 . Since ||X ′ γ||2 ≥ 0 for any realization of X, it follows that
γ ′ Hγ > 0 if and only if with some positive probability, ||X ′ γ||2 is strictly greater than zero. This occurs
whenever P (X ′ γ ̸= 0) > 0.
Proposition 7.2 says that there exists no value γ that makes X ′ γ equal to the zero vector, with probability
one (remember that X here is a random vector). When there is such a γ, we say that there is perfect
multicollinearity among our regressions X = (X1 , X2 , . . . Xk ).
Definition. We say that there is perfect multicollinearity among our regressors (in the population)
if there exists some γ ∈ Rk such that P (X ′ γ = 0) = 1.

Example: Suppose that our regression includes a constant X1 = 1, a binary variable indicating that a
given individual is married: X2 = married, and a second binary variable X3 that indicates that a given
individual is not married. Then, since X = (1, married, 1 − married)′ , we have that X ′ (−1, 1, 1) = 0
for all realizations of X. Thus, we have perfect multicollinearity: X ′ γ = 0 regardless of the value of
married and hence with probability one.

Note: Since we’re talking about the population in this section (rather than a sample), definition 7.4.1
says that there is no perfect multicollinearity in the population. In Section 7.5, we’ll use a sample analog
of this definition, which will be required for us to define the OLS estimator of β.

Now let’s see how the absence of perfect multicollinearity, which by Proposition 7.2 we know is equivalent
to E[XX ′ ] being invertible, implies that β exists and is unique.
Proposition 7.3. If there is no perfect multicollinearity, then there exists a β ∈ Rk that satisfies
Equations 7.2 and 7.6, and it is unique.

76
Proof. We can combine Equations 7.2 and 7.6 into a single matrix equation, which is equivalent to the
system of FOCs 7.14:
E[X(Y − X ′ β)] = 0
where 0 denotes a vector of k zeroes 0 = (0, 0, . . . 0)′ . This equation is the same as
E[XX ′ ]β = E[XY ]
Since E[XX ′ ] is invertible by Proposition 7.2, we can multiply both sides of the above equation by
E[XX ′ ]−1 (see box below) to obtain:
β = E[XX ′ ]−1 E[XY ]
Of course, we can only do this if the matrix E[XX ′ ]−1 exists, which is guaranteed by no perfect multi-
collinearity. Note that the above expression for β was given previously in Equation 7.7.

Review: using matrix inverses to solve a system of linear equations

Suppose we have a system of k equations in k variables

a11 · x1 + a21 · x2 + · · · + ak1 · xk = b1


a12 · x1 + a22 · x2 + · · · + ak2 · xk = b1
..
.
a1n · x1 + a2n · x2 + · · · + akk · xk = bk (7.15)

We seek a solution x = (x1 , x2 , . . . xn ) that satisfies all of the above equations. Let us gather all
of the coefficients in to a k × k matrix and call it A:
 
a11 a21 . . . an1
 a21 a22 . . . an2 
A= . .. .. 
 
 .. ..
. . . 
an1 an1 ... ank

Our system of Equations (7.15) says, in vector notation, that Ax = b, where b = (b1 , b2 , . . . bk )′
is a vector composed of the values appearing on the RHS in Eq. (7.15).

If the matrix A is invertible, this means that there exists a unique matrix A−1 such that AA−1 =
A−1 A = Ik , where Ik is the k × k identity matrix. It has entries of one along the diagonal and
zeros everywhere else:  
1 0 ... 0
0 1 . . . 0
In =  . . .
. . ... 
 
 .. .. 
0 0 ... 1
Note that the identity matrix Ik has the property that Ik λ = λ for any vector λ ∈ Rn .

Thus, if we start with the equation Ax = b and multiply both sides by A−1 , we get that

A−1 (Ax) = (A−1 A)x = Ik x = x = A−1 b

Thus, we’ve shown that x must be equal to A−1 b. This value definately satisfies (7.15), which
we can verify by:
A(A−1 b) = (AA−1 )b = Ik b = b
Also, it is the only value of x that satisfies the system (7.15). The solution exists and is unique,
provided that A−1 exists.

Furthermore, one can show that the x solving Ax = b is unique only if A is invertible. A is
invertible if and only if there exists no λ ∈ Rk that differs from the zero vector (i.e. it is not all

77
zeros), for which Aλ = 0 (here 0k is a vector composed of k zeros). Thus if A is not invertible,
there is such a vector λ. Suppose we have one solution x to Ax = b. Then x + αλ is another
solution, for any value of α, because A(x + αλ) = Ax + αAλ = b + 0k = bss.

We are now also in a position to see why when E[X ′ X]−1 exists, the regression coefficient vector
β must minimize the mean-squared error in Equation 7.13. Since E[(Y − X ′ γ)2 ] is a convex
function of the vector γ = (γ1 , γ2 , . . . γk ). That implies that any local minimum of E[(Y − X ′ γ)2 ]
is also global minimum. Therefore, we’d like to find any values of γ that might represent local
minima of the mean-squared error. Sufficient conditions for γ to be a local minimum of the MSE
are that: a) γ satisfies the FOCs 7.14 for each j = 1 . . . k; and b) the matrix H composed of
components Hjℓ = ∂γ∂j ∂γℓ E[(Y − X ′ γ)2 ] is positive definite. The k × k matrix H is called the
2

Hessian and it represents all of the second derivatives of a function. In the case of the MSE
function E[(Y − X ′ γ)2 ], the Hessian matrix turns out to be equal to E[X ′ X].

7.4.2 Simple linear regression in terms of covariances


When we just have a single regressor and a constant, we call this simple linear regression:

Y = β0 + β1 · X + ϵ (7.16)

where X is a scalar. Note that this is really a k = 2 instance of regression, in which one regressor is a
constant and the other is a random variable. In this case we can derive a simple expression for β0 and
β1 , which do not require matrix notation.
Note that (7.6) provides two equations:

E[ϵ] = 0 (7.17)
E[X · ϵ] = 0 (7.18)

We can take expectations of both sides of 7.16 and use 7.17 to obtain:

E[Y ] = β0 + β1 · E[X] + 
E[ϵ]


This expression says that the regression line Y = β0 + β1 · X passes through the point (E[X], E[Y ]). If
we plug in the mean value of X, our “predicted value” of Y is the mean of Y . Re-arranging, we have
that β0 = E[Y ] − β1 · E[X].
To make use of 7.18, let us multiply both sides of 7.16 by X and then take expectations. We have:

E[X · Y ] = β0 · E[X] + β1 · E[X 2 ] + 


E[X
 ·
ϵ]

We’ve now derived two equations for the two unknowns β0 and β1 . If we substitute β0 = E[Y ] − β1 · E[X]
into the second equation above, we get that

E[X · Y ] = E[X] · E[Y ] − β1 · E[X]2 + β1 · E[X 2 ]


Using that Cov(X, Y ) = E[X · Y ] − E[X] · E[Y ] and V ar(X) = E[X 2 ] − E[X]2 , we arrive at a nice simple
formula for β1 :
Cov(X, Y )
β1 = (7.19)
V ar(X)
Note that this also gives us an explicit expression for β0 , by substituting the above expression for β1 into

β0 = E[Y ] − β1 · E[X] (7.20)

Looking back at Equatios 7.19, we see that in the case of simple linear regression β1 is nothing more
than a rescaled version of the covariance between X and Y . When Cov(X, Y ) is positive, β1 will be
positive, and when X and Y are negatively correlated β1 will be negative. Looking at Equation 7.20,
we see that β0 is simply whatever it needs to be so that when we plot the regression line β0 + β1 · x, it
passes through the point (E[X], E[Y ]).

78
There is also a simpler way to derive Eq. (7.19), which is to start with Eq. 7.16 and take the covariance
of both sides with X. Since the covariance operator is linear, we have that

Cov(X, Y ) = β0 ) + β1 · Cov(X, X) + 
Cov(X, Cov(X,
ϵ)


The first term on the RHS is zero, because β0 is a constant, which has a zero covariance with anything.
The last term is zero because Cov(X, ϵ) = E[X
 ·ϵ] − E[X] · 
E[ϵ],
 where the crossed out terms are zero
by 7.17 and 7.18. Using that Cov(X, X) = V ar(X), we can rearrange to obtain Eq. (7.19).

Note that Equations 7.19 and 7.20 imply that we can write the residual from simple linear regression as

Cov(X, Y )
ϵ = (Y − E[Y ]) − (X − E[X]) (7.21)
V ar(X)

Example: Consider a simple linear regression in which X is a binary variable, for example if
X = 1 indicates that an individual is female and X = 0 otherwise. In this case, recall from the
homework that Cov(X, Y ) = V ar(X) · (E[Y |X = 1] − E[Y |X = 0]), where V ar(X) = P (X =
1) · (1 − P (X = 1)). Thus, β1 = E[Y |X = 1] − E[Y |X = 0]. This means that β0 must be:

β0 = E[Y ] − (E[Y |X = 1] − E[Y |X = 0]) · E[X]


= {P (X = 1) · E[Y |X = 1] + P (X = 0) · E[Y |X = 0]} − (E[Y |X = 1] − E[Y |X = 0]) · E[X]
= (1 − P (X = 1)) · E[Y |X = 0] + P (X = 1) · E[Y |X = 0]
= E[Y |X = 0]

where we’ve used the law of iterated expectations and that E[X] = P (X = 1).

Thus, when we have a single binary regression, linear regression gives us an intercept β0 =
E[Y |X = 0] that recovers the condition mean of Y given X = 0. Since the slope coefficient is
β1 = E[Y |X = 1] − E[Y |X = 0], this means that when X = 1, the regression line passes through
β0 + β1 = E[Y |X = 1]. In other words, β0 + β1 · X exactly recovers the CEF function E[Y |X].
We’ll see that this nice property generalizes in multiple linear regression: whenever X contains
indicators for a complete set of groups (or a constant and all but one of the groups, then regression
exactly captures the CEF).

Exercise: Given the above, convince yourself that in a regression of Y on a binary treatment variable D,
the coefficient on D is the difference-in-means estimand θDM that we met in Chapter 6. Under random
assignment, this regression hence yields the ATE.

Exercise: Derive Equations 7.19 and 7.20 from the general formula (7.7) for the regression vector, which
in this case reads β = (β0 , β1 ) = E[(1, X)′ (1, X)]−1 E[(1, X)′ Y ]. For this you will need the formula for
the inverse of a 2 × 2 matrix:  −1  
a b 1 d −b
=
c d ad − bc −c a
A hint to get you started: E[(1, X)′ Y ] = (E[X], E[XY ])′ .

7.4.3 Multiple linear regression in terms of covariances


Now let’s consider how the considerations of the last section generalize to a setting in which we have
multiple regressors and a constant. We know the general formula (7.7) for the vector β, which involves
inverting the matrix E[XX ′ ]−1 and multiplying it by the vector E[XY ]. While the matrix formula holds
generally, it turns out that we can still write expressions for the individual components of β in terms of
covariances and variances, which is helpful in understanding the mechanics of how regression works.

Two regressors and a constant


Consider first a case in which we have two regressors X1 and X2 , plus a constant:

Y = β0 + β1 · X1 + β2 · X2 + ϵ (7.22)
79
Recall that in this case the linear regression model gives us three restrictions: E[ϵ] = 0, E[X1 · ϵ] = 0
and E[X2 · ϵ] = 0.
Consider β2 , the coefficient on X2 . In turns out that we can write an expression for β2 by imagining
a sequence of two simple linear regressions. In the first step, we imagine running a regression of X2 on
X1 and a constant. We know that in this case, the coefficient on X1 will be Cov(X1 , X2 )/V ar(X1 ), the
constant will be E[X2 ] − Cov(X1 , X2 )/V ar(X1 ) · E[X1 ], and by Equation (7.21) the residual will be
Cov(X1 , X2 )
X̃2 = (X2 − E[X2 ]) − (X1 − E[X1 ]) (7.23)
V ar(X1 )

where we use the notation X̃2 for the residual from this regression of X2 on X1 and a constant. Observe
that since X̃2 is a linear function of X1 and X2 , we know that it is uncorrelated with the error ϵ from
the “long-regression” Equation (7.22).

Exercise: Prove the claim above, that Cov(X̃2 , ϵ) = 0, using that E[ϵ] = 0, E[X1 ·ϵ] = 0 and E[X2 ·ϵ] = 0.

Given that Cov(X̃2 , ϵ) = 0, consider running a second simple linear regression, in which we regress Y on
X̃2 and a constant.
Proposition 7.4. The slope coefficient from this second regression is equal to β2 .
To demonstrate that this claim is true, let’s begin by substituting in Eq. (7.22), the equation for the
slope in this second regression will be

Cov(X̃2 , Y ) (Cov(β
((0( ,X
(()
1 Cov(X̃2 , X1 ) Cov(X̃2 , X2 ) Cov(
 X̃
2 , ϵ)
= + β1 · + β2 · + (7.24)
V ar(X̃2 ) V ar(X̃2 ) V ar(X̃2 ) V ar(X̃2 ) V ar(X̃2 )
where the first term is zero since β0 is a constant, and the last term is zero because we’ve seen that
Cov(X̃2 , ϵ) = 0.
However, we also have that Cov(X̃2 , X1 ) = 0, so the second term above also vanishes. How do we
know this? Remember that X̃2 is the residual from a regression of something on X1 and a constant, so
it is uncorrelated with X1 by construction.

Exercise: Convince yourself of the claim above, that Cov(X̃2 , X2 ) = 0. Why does it follow from how
we’ve defined X̃2 that E[X̃2 ] = 0 and E[X1 · X̃2 ] = 0?

The last step in establishing Proposition 7.4 is showing that Cov(X̃2 , X2 ) = V ar(X̃2 ). To see this,
substitute our explicit expression (7.23) for X̃2 into the variance. Since E[X̃2 ] = 0, V ar(X̃2 ) = E[(X̃2 )2 ],
which is equal to
" 2 #
Cov(X1 , X2 ) 2
E E E
 
Cov(X1 , X2 ) Cov(X1 , X2 )
(X2 − [X2 ]) − (X1 − [X1 ]) = V ar(X2 ) − 2 · Cov(X1 , X2 ) + · V ar(X1 )
V ar(X1 ) V ar(X1 ) V ar(X1 )
Cov(X1 , X2 )
= V ar(X2 ) − · Cov(X1 , X2 ) = Cov(X̃2 , X2 )
V ar(X1 )

A more direct way to conclude that Cov(X̃2 , X2 ) = V ar(X̃2 ) is to notice that this statement is equivalent
to Cov(X̃2 , (X2 − X̃2 )) = 0, where (X2 − X̃2 ) is the regression line from a regression of X2 on X1 and a
constant. That this is uncorrelated with the error X̃2 from this same regression follows then by definition.

Since the labeling of X1 and X2 is arbitrary, Proposition 7.4 also gives us a way to express the coefficient
β1 : first, run a regression of X1 on X2 , and then regress Y on the residuals X̃1 from this initial regression.

Many regressors and a constant


This principle generalizes to the general setting in which we have a regression equation with a constant
and k additional regressors X1 , X2 , . . . Xk :

Y = β0 + β1 · X1 + β2 · X2 + . . . βk · Xk + ϵ (7.25)

Note first that since one of our regressors is a constant, the system of equations (7.6) implies that
E[ϵ] = 0. Then the remainder of the equations in (7.6) can be read as saying that each Xj is uncorrelated
E[X
with the error, since Cov(Xj , ϵ) =  
E[X
j· Y ] − j ] · E[Y ] = 0.

80
Proposition 7.5 (“regression anatomy” formula). The coefficient on Xj in regression (7.25) is

βj = Cov(X̃j , Y )/V ar(X̃j ),

where X̃j is the residual from a regression of Xj on all of the other regressors and a constant.

The text Mostly Harmless Econometrics refers to Proposition 7.5 as the “regression anatomy” formula
because it allows us to translate the complicated expression for the full vector β = E[XX ′ ]−1 E[XY ] into
a simpler expression for each of the components βj .

Note: We’ll see when we get to estimation in Section 7.5 that Proposition 7.5 has a sample analog,
referred to as the Frisch-Waugh-Lovell theorem. Proposition 7.5 constitutes a “population version” of
this very useful result.

Note: A Corollary to Proposition 7.5 is that we can also write βj as Cov(X̃j , Y˜j )/V ar(X̃j ), where we
define Y˜j to be the residual from a regression of Y on all the regressors aside from Xj , and a constant.
This follows because the difference between Y − Ỹj is uncorrelated with X̃j .

Using Proposition 7.5 iteratively to get regression coefficients: Notice that Propo-
sition 7.5 gives us an explicit expression for βj , if we know the residuals X̃j . But if k > 2,
how do we compute the X̃j , which involve running a regression on k − 1 variables (e.g.
X1 , X2 , . . . Xj−1 , Xj+1 , . . . Xk ) and a constant?

If one would like to avoid the matrix expression for β, the answer is to appeal to Proposition 7.5
iteratively. This allows us to build up an expression for βj by running a series of several simple
linear regressions.

For instance, suppose we are interested in βk in regression (7.25). We can obtain it as follows:

1. To get βk , we need the residuals X̃j from a regression of Xk on X1 . . . Xk−1 and a constant.
2. Call the coefficient on Xj from this regression βjk . If we know all the βjk for j = 1 . . .k=1 ,
we can calculate X̃j (note that we can pin down the β0k from the fact that E[X̃j ] = 0.

3. By Proposition 7.5, we know that we can calculate each βjk if we have the residuals from a
regression of Xj on all regressors except Xj and Xk , and a constant. Call the coefficient on
Xℓ from this regression βℓjk .

4. By Proposition 7.5, we know that we can obtain βℓjk if we know the coefficients from a
regression of Xℓ on a regression on all variables except Xℓ , Xj and Xk , and a constant.

5. Continue on in this way, until we have a set of simple linear regressions, for which we can
apply Equations 7.19 and 7.20.

The box above describes an iterative algorithm that would allow us to use Proposition 7.5 to obtain an
expression for each coefficient βj in Equation (7.25). As you can see, it is tedious, and involves running
many simple linear regressions if our original regression contains more than a couple of variables.

81
That was a mess–lets use matrix notation!
For this reason, we inevitably need to appeal to the general formula β = E[XX ′ ]−1 E[XY ], which in the
context of regression (7.25) says that:
 
β0
 β1 
 
β = β2  = E[(1, X1 , X2 , . . . Xk )(1, X1 , X2 , . . . Xk )′ ]−1 E[(1, X1 , X2 , . . . Xk )Y ]
 
 .. 
.
βk
E[X1 ] E[X2 ] E[Xk ] E[Y ]
 −1  
1 ...
E[X1 ] E[X12 ] E[X1 · X2 ] ... E[X1 · Xk ] E[X1 · Y ]
= E[X2 ] E[X1 · X2 ] E[X22 ] E[X2 · Xk ] E[X2 · Y ]
   
 ... (7.26)
 .. .. .. .. ..
  
..   
 . . . . .   . 
E[Xk ] E[Xk · X1 ] E[Xk · X2 ] . . . E[Xk ]
2
E[Xk · Y ]
Proposition 7.5 can be derived from Equation 7.26, but it’s not exactly pretty. One way to make this
work is to apply the block matrix inversion formula over and over again, mirroring the iterative approach
in the box above. We would eventually end up with a large number of 2 × 2 matrix inverse problems,
which are each easy to deal with (see the exercise at the end of Section 7.4.2). But life’s too short for
that! Thankfully computers area great help in computing our matrix inverses in practice.

7.5 The ordinary least squares (OLS) estimator


Now let’s turn to estimation in the linear regression model. That is, suppose that based on the many
motivations considered in Section 7.3, we’re interested in estimating the vector β from the linear regres-
sion model. The standard estimator for β in the linear regression model is referred to as the ordinary
least squares (OLS) estimator β̂OLS . Since this is the only estimator for β that we’ll consider, we’ll just
write it as β̂, to avoid writing OLS over and over again.

I break this section into several headings, for quick reference.

7.5.1 Sample
To define the OLS estimator β̂ we suppose that we have a sample (Yi , X1i , X2i , . . . Xki ) of Y and some
set of regressors X1 to Xk . Let n be the number of observations in our sample. Note: we will later
assume that our sample is i.i.d, but we don’t need to use that fact right now.

7.5.2 OLS estimator


A simple way to define the OLS estimator β̂ = (β̂1 , β̂2 , . . . β̂k ) is as the minimizer of the sample analog of
the least squares minimization, in which we replace the population expectation with the sample mean:
n
1X
β̂ = argmin (Yi − Xi′ γ)2 (7.27)
γ∈ Rk n i=1

Given the OLS estimator β̂, let us make the following definitions:
Pk
ˆ The fitted value Ŷi for observation i is Ŷi = Xi′ β̂ = j=1 β̂j · Xji

ˆ The fitted residual for observation i is ϵ̂i = Yi − Ŷi

ˆ Note that for each i, we have that Yi = Ŷi + ϵ̂i (by definition)

Equation (7.27) explains the origin of the name “ordinary least squares”, as β̂ is defined as the value of
γ that minimizes the sample sum of squares.

82
What is the solution to the minimization problem (7.27)? Taking the first-order-condition with respect
to each γj , we obtain the following system of equations:
n n
1X 1X
X1i · (Yi − Xi′ β̂) = X1i · ϵ̂i = 0
n i=1 n i=1
n n
1X 1X
X2i · (Yi − Xi′ β̂) = X2i · ϵ̂i = 0
n i=1 n i=1
..
.
n n
1X 1X
Xki · (Yi − Xi′ β̂) = Xki · ϵ̂i = 0 (7.28)
n i=1 n i=1

which can be summarized by the matrix equation


n
1X
Xi (Yi − Xi′ β̂) = 0
n i=1

0 is a vector of k zeros. This is exactly analogous


P to Eq. (7.6), except that we have replaced the
population expectations E with sample averages n1 i . Rearranging the above:
n
! n
1X ′ 1X
Xi Xi β̂ = Yi Xi (7.29)
n i=1 n i=1

where recall that Xi = (X1i , X2i , . . . Xk )′ is a vector and Yi is a scalar for each i. Since Xi is k × 1 and
Xi′ is 1 × k, Xi Xi′ is a k × k matrix. In Equation 7.29 we’ve used that by the distributive property of
matrix multiplication, we can sum over the observations i and then multiply by β, which is equivalent
to multiplying and then summing the k × 1 vector Xi Xi′ β̂ over observations.

“Physics intuition”: A nice way to visualize what OLS does (in the case of a single regressor
pus a constant) is to imagine a set of n springs. One end of each spring is attached to a data
point (Yi , Xi ), and the other end is attached to a rigid rod, which will represent the regression
line β̂0 + β̂1 X. The springs want to be as short as possible, and are constrained to move only in
the vertical direction. The length of spring i is equal to the fitted residual ϵ̂i . An approximation
known as Hooke’s Law says that the potential energy stored in a spring is proportional to the
square of its distance. Thus, if the springs are identical to one another, the total potential
energy will be equal to the sum of squares of ϵ̂i . Classical physics (i.e. “Newton’s laws”) tell
us that equilibrium position of the rod will minimize the potential energy stored across all of
the springs: exactly the minimization problem that OLS solves! The following link provides an
illustration of this: it generates a random dataset, and shows the rod coming to rest in exactly
the position of the OLS regression line: https://sam.zhang.fyi/html/fullscreen/springs/.
Here is another nice visualization (not animated though): https://joshualoftus.com/posts/
2020-11-23-least-squares-as-springs/

7.5.3 Matrix notation


We can obtain a more compact notation for Equation 7.29 by introducing an n×k matrix X, that records
all of our observations of all of the regressors:
 ′  
X1 (X11 , X21 , . . . Xk1 ) 
 X2′   (X12 , X22 , . . . Xk2 ) 


X :=  .  =  . n rows
   
 ..   .. 



Xn′ (X1n , X2n , . . . Xkn )

| {z }
k columns
The matrix X is often called the design matrix.

83
Similiarly, we define a k × 1 vector of our observations of Y :
 
Y1
 Y2 
Y :=  . 
 
 .. 
Yn
Pn
In this notation, we can rewrite the matrix n1 i=1 Xi Xi′ as 1 ′

n X X. We can then write 7.29 in the
compact form:
(X′ X)β̂ = X′ Y (7.30)
where we’ve multiplied both sides by n.

Exercise: Use the definition of matrix multiplication to verify Equation (7.30) from Equation (7.29).
Pn
That is, show that the j th component of X′ X is equalPto the j th component of β̂ = ( i=1 Xi Xi ) β̂, and
th ′ n
similarly that the j component of X Y is equal to i=1 Yi Xji , for all j = 1 . . . k.

Note: For a general matrix M , it is conventional to let Mij denote the entry on row i, column j. In
our notation above, we actually have the opposite order of indices, because we’ve let Xji denote the ith
observation of variable Xj , which is the entry in the j th column and ith row of X. To avoid getting
confused, always remain mindful that your matrix expressions are conformable, e.g. if you’re multiplying
X by a vector v to obtain Xv, then you know that v must have k components, since X has k columns.

7.5.4 Existence and uniqueness of β̂: no perfect-multicollinearity in sample


When does the sample least-squares problem have a unique solution, such that the OLS estimator is
well-defined? In exact analogy with the question of the existence and uniqueness of the population
regression vector β, studied in Section 7.4.1.
For Equation 7.30 to have a unique solution, we need for the k × k matrix X′ X to be invertible (see
the box in Section 7.4.1 for a review of solving a sytem of linear equations). The following proposition
provides a characterization of when this will be the true:

Proposition 7.6. Provided that n > k, the matrix X′ X is invertible if none of the columns of X can be
written as linear combinations of the other. That is: X ′ γ ̸= 0 for all γ ∈ Rk .
This condition can be referred to as no perfect multicollinearity in the sample. When it holds, we obtain
an explicit expression for the OLS estimator β̂:

β̂ = (X′ X)−1 X′ Y (7.31)

Recall from Proposition 7.2 that we have no perfect multicollinearity in the population if
P (Xi′ γ ̸= 0) < 1 for all γ ∈ Rk . It’s not impossible that this could hold, but that we still end
up with perfect multicollinearity failing in the sample. Technically, this is a “knife-edge” case
that would happen with probability zero if the Xi are continuously distributed. But in practice
β̂ will be defined but numerically unstable if X′ X is close to being non-invertible. This is what
statistical software sometimes complains about when it says that X is “highly singular”.

The most common case in which perfect multicollinearity occurs is when you forget that some of
your regressors are related to one another definitionally. For instance, recall the example from
Section 7.4.1, in which our regression includes a constant X1i = 1, a binary variable indicating that
a given individual i is married: X2i = marriedi , and a second binary variable X3i that indicates
that a given individual is not married. Then, since Xi = (1, marriedi , 1 − marriedi )′ , we have
that Xi′ (−1, 1, 1) = 0 for each i in the sample, and hence X(−1, 1, 1)′ = (0, 0, 0)′ . In this case we
have perfect multicollinearity both in sample and in the population, since P (Xi′ (−1, 1, 1)) = 1.
If you tried to compute the OLS estimator for this regression in statistical software like Stata, it
will choose arbitrarily one of the variables and drop it from the regression equation, and give you
a warning.

84
7.5.5 More matrix notation
Following the notation we’ve developed to define the OLS estimator, we can also define k × 1 vectors of
the fitted values Ŷi , the fitted residuals ϵ̂i , and the population residuals ϵi :
     
ϵ̂1 Ŷ1 ϵ1
 ϵ̂2   Ŷ2   ϵ2 
ê :=  .  Ŷ :=  .  ϵ :=  . 
     
 ..   ..   .. 
ϵ̂n Ŷn ϵn

While ê and Ŷ are built with estimates from the data, note that ϵ is not observable. However, under
the assumption that the regression model Yi = Xi′ β + ϵi holds for each i, we have that

Y = Xβ + ϵ (7.32)

Note that we can also write


Y = Xβ̂ + ê (7.33)
where Ŷ = Xβ̂.

7.5.6 Regression as projection*


Regression thus provides a decomposition of the vector of observed outcomes, Y, into two pieces: the
vector Xβ̂ and the vector ê. This decomposition has a geometric interpretation. Make the following
definitions
1. Let In be the n × n identity matrix (see box in Section 7.4.1 for definition).
2. Define the n × n matrix P := (X′ X)−1 X′ . Following Hansen, we’ll call this the projector matrix.

3. Define the n × n matrix M := In − P = In − X(X′ X)−1 X′ . Following Hansen, we’ll call this the
annihilator matrix.
Exercise: Check that both P and M are idempotent. We say that an n × n matrix A is idempotent
when AA = A.

Exercise: Show that both P and M are symmetric. We say that an n × n matrix A is symmetric when
it is equal to its transpose: A′ = A.

Exercise: Use the definitions of the projector and annihilator matrices to verify that

Xβ̂ = PY and ê = MY

Consider any vector v in Rn . Since In v = v (check this if you haven’t seen it before), and P + M = In ,
it follows that any vector v can be written as v = Pv + Mv. Applying this to the vector Y composed of
observations of our outcome variable, we obtain Equation (7.33).

Geometrically, Equation (7.33) provides a decomposition of Y into the part of Rn that is spanned by
the columns of the design matrix X (this is Xβ̂), and the part that is orthogonal to it (this is ê).

In what sense does P “project” the vector Y? Algebraically, we think of projection as something that
throws away some information in Y in such a way that after we “project” once, projecting again wouldn’t
change anything. For example, projecting a two-dimensional vector (v1 , v2 ) onto the x-axis means re-
placing v2 with zero, but keeping the first value: (v1 , 0). If we then projected (v1 , 0) onto
 the x-axis
1 0
again, we’d just get (v1 , 0) again. In this case projection corresponds to the matrix .
0 0
This is also what happens with regression, except that our projection operator is somewhat more com-
plicated. If we regress Y on X, and then run the regression a second time using the fitted values Ŷ on
the left hand side, this second regression will result in the exact same vector of estimate. This is because
PP = P, and hence PŶ = PPŶ = PŶ = Ŷ (see exercise above). A similar argument holds for the

85
fitted residuals, given that MM = M.

We call M the “annihilator” matrix because it is contructed in such a way that it “annihilates” the
matrix X upon mulitplication:

MX = (In − X(X′ X)−1 X′ )X = In X − X ′ −1


 ′
(XX) ( X) = X − X = 0
X (7.34)

By a similar argument, the projector matrix does nothing to X: PX = X.

The diagonal elements of the projection matrix P have a useful interpretation, and are called
leverage values. Let
hii := [P]ii = [X(X′ X)−1 X′ ]ii = Xi′ (X′ X)−1 Xi
Since P is an n × n matrix, there exists a leverage value corresponding to each observation i.
It measures how “extreme” the regressors Xi are for that observation. Note that Xi′ Xi is the
norm of the vector Xi —how “big” it is. hii inserts the k × k matrix (X′ X)−1 inside the dot-
product, which changes what we mean by “big”. In the case of simple linear regression, hii ends
up measuring whether the value of Xi is particularly high or low. Such observations may have a
large influence on the values of the OLS estimate β̂, since a small change in β̂ can translate into
a large change in the fitted residual ϵ̂i for that observation, when Xi is very far from the sample
mean.

The projection matrix notation also gives us a nice way to express the coefficient of determi-
nation, or R2 , which you may have met in previous econometrics classes. The coefficient of
determination can be defined as the ratio of the variance of the fitted estimates Ŷi across the
sample to the variance of the observed outcomes Yi across the sample. Thus it measures the
proportion of the variation in Y that is “explained” by the OLS regression line Ŷi = Xi′ β̂.

For simplicity, suppose that P the average value of Yi in the sample is zero, so that the sample
n
variance of Yi is n1 Y′ Y = n1 i=1 Yi2 . Note that we could always de-mean our data to satisfy
this assumption without affecting the variance of Yi or R2 (provided that the regression contains
a constant, for the latter claim). In our matrix notation, we could then write R2 as:

Ŷ′ Ŷ (PY)′ (PY) Y′ PY


R2 = ′
= ′
=
YY YY Y′ Y
The coefficient of determination measures how much the Euclidean norm of the vector Y shrinks
when we project it onto the regressors using X. Note that since P = In − M, we can also write
R2 as:
Y′ (In − M)Y Y′ MY ê′ ê
R2 = ′
=1− ′
=1− ′
YY YY YY

7.5.7 The Frisch-Waugh-Lovell theorem: matrix version*


Suppose we’re interested in just part of the vector β̂. That is, we separate our regressors X1 . . . Xk
into two groups, let’s say X1 . . . Xj and Xj+1 . . . Xk , for some j (this is without loss of generality since
we could always re-order the indexing of the regressors). Our object of interest will be β̂1 , where we
introduce the notation:    
β̂1
  β̂  
  2 
 . . . 
   
β̂1  β̂j 
 
=  
β̂2  β̂j+1 
 
β̂j+2  
 
 . . . 
β̂k

86
Analogously, define the matrices X1 and X2 as
   
(X11 , X21 , . . . Xj1 )  (Xj+1,1 , Xj+2,1 , . . . Xk1 )  
 (X12 , X22 , . . . Xj2 )   (Xj+1,2 , Xj+1,2 , . . . Xk2 ) 
X1 :=  ..  n rows and X2 :=  ..  n rows
   
 . 


 . 

(X1n , X2n , . . . Xjn ) (Xj+1,n , Xj+2,n , . . . Xkn )
 
| {z } | {z }
j columns (k − j) columns

where X1 is a matrix of observations of the regressors X1 . . . Xj and X2 is a matrix of observations of


the regressors Xj+1 . . . Xk .

Define n × n projector matrices P1 = X1 (X′1 X1 )−1 X′1 and P2 = X2 (X′2 X2 )−1 X′2 , and corresponding
annihilator matrices M1 = In − P1 and M2 = In − P2 . Note that by the same logic as Equation
(7.34), the matrix M1 annihilates X1 (that is, M1 X1 = 0, where 0, is a set of j zeroes), and similarly
M2 X2 = 0, where now 0, is a set of k − j zeroes

The matrix P1 projects vectors in Rn into the subspace spanned by the columns of X1 , which are the
first j columns of X. The matrix M1 projects vectors in Rn into the subspace orthogonal to the columns
of X1 . Similarly,P2 projects vectors in Rn into the subspace spanned by the columns of X2 , which are
the last (k − j) columns of X.

With the matrices M1 and M2 in hand, we can now give an explicit formula for β̂1 and β̂2 , known
famously as the Frisch-Waugh-Lovell theorem:
Proposition 7.7 (Frisch-Waugh-Lovell theorem).

β̂1 = (X′1 M2 X1 )−1 X′1 M2 Y

and
β̂2 = (X′2 M1 X2 )−1 X′2 M1 Y

Proof. We’ll prove the expression for β̂1 , as the proof for β̂2 is exactly analogous. Note that the
full design matrix X can be written as X = [X1 , X2 ], and
′ ′
Xβ = [X1 , X2 ](β̂1 , β̂2 )′ = X1 β̂1 + X2 β̂2

We can thus rewrite Equation (7.33) as:

Y = Xβ̂ + ê = X1 β̂1 + X2 β̂2 + ê

Consider multiplying both sides of this equation by M2 . Since M2 X2 = 0, where 0 is a vector of


k − j zeros, we have:
M2 Y = M2 X1 β̂1 + M2 ê
Now mulitply this equation from the left by X′1 :

X′1 M2 Y = X′1 M2 X1 β̂1 + X′1 M2 ê

The final step of the proof is to show that X′1 M2 ê = 0, and then multiply each side by
(X′1 M2 X1 )−1 to get the final expression for β̂1 . To see that X′1 M2 ê = 0, recall that ê = MY,
where M is the annihilator matrix for the full design matrix X. Since the vector ê = MY
is orthogonal to all of the columns of X = [X1 , X2 ], then it is also orthogonal to all of
the columns of X1 , which is just a susbset of these. That implies that M1 ê = ê. Now
X′1 M2 ê = X′1 ê = X′1 MY = 0, where in the last step we’ve used that X′1 M = 0′ .

Now let’s see how the Frisch-Waugh-Lovell theorem relates to the “regression anatomy” result Proposition
7.5. Since M2 is idempotent, we can write

β̂1 = (X′1 M2 M2 X1 )−1 X′1 M2 Y = (X̃′1 X̃1 )−1 X̃′1 Y


87
where X̃1 := M2 X1 , and we’ve used that M2 is a symmetric matrix: M′2 = M2 . The n × k matrix
X̃1 := M2 X1 collects the residuals from a series of j regressions: for each ℓ = 1 . . . j, column ℓ of X̃1 is
composed of the residuals from a regression of Xℓ on Xj+1 . . . Xk .

An analogous formula applies for β̂2 , where X̃2 collects the residuals from regressions of each Xℓ on
X1 . . . . . . Xj , for ℓ = j + 1 . . . k. In the special case in which X2 has a single column (e.g. we’re
interested only in β̂k , and we include a constant in the regression (e.g. X1 = 1), then we get exactly a
sample version of Proposition 7.5.

7.5.8 For the matrix haters: OLS in terms of covariances*


If we specialize to a regression that includes a constant in addition to k regressors, as in Section 7.4.3,
we can express the OLS slope coefficients in terms of sample covariances. The model is:

Yi = β0 + β1 · X1i + β2 · X2i + · · · + Xki + ϵi

Let us now consider the OLS estimates β̂0 , β̂1 . . . β̂k . One useful property of β̂ is that the fitted residuals

ϵ̂i = Yi − (β̂0 + β̂1 · X1i + · · · + β̂k · Xki )


Pn
will exactly average out to zero: that is i=1 ϵ̂i = 0. This can be seen from Eq. (7.28) since one of our
regressors is equal to 1 for all observations. This holds for any regression with a constant, and will be a
useful fact in what follows.

From the Frisch-Waugh-Lovell theorem, we can obtain an expression for each slope coefficient estimate
ˆ
betaj. In particular:
d j, Y )
Cov(ϵ̂
β̂j = (7.35)
Vdar(ϵ̂j )
where we let ϵ̂j denote the fitted residuals from a regression of Xj on all the other regressors and a
constant. This is an analog of the population residual X̃j from this same regression. Equation (7.35)
provides a “sample analog” to the regression anatomy formula: Proposition 7.5.
The operators Cov d and Vd ar appearing in Eq. (7.35) are defined as follows. Let A = (A1 , A2 , . . . An )
) be n × 1 vectors composed of observations of a random variable Ai and Bi , respec-
B = (B1 , B2 , . . . BnP
n
tively. Let Ān = n1 i=1 Ai be the sample mean of Ai , and similarly for B̄n . Then, we define:
n
!
1 X
Cov(A,
d B) = Ai · Bi − Ān · B̄n
n i=1

and !
n
1X 2 2
Vd
ar(A) = Cov(A,
d B) = A − Ān
n i=1 i

As a special case of Eq. (7.35), we have that in simple linear regression

Cov(X,
d Y)
β̂1 =
Vd
ar(X)

where in this case ϵ̂ji is simply equal to Xi , the ith observation of our single regressor X.PWe can
n
work out the estimate of the constant β0 from the fact that the fitted residual ϵ̂i satisfies n1 i=1 ϵ̂i =
1
Pn
n i=1 (Yi − β̂0 − β̂1 · Xi ) = 0. β̂0 is thus β̂0 = Ȳn − β̂0 · X̄n .

To see how to obtain Eq. (7.35), suppose we’re interested in βk , in which case X2 is a n × 1 vector
of observations of Xj and X1 is a matrix of observations of the other regressors and a constant.
Pn
Note that since ϵ̂ji is a fitted residual from a regression that includes a constant, n1 i=1 ϵ̂ji = 0.

88
Thus, we wish to show that
1 j′
1
Pn j
n i=1 ϵ̂i · Yi n ϵ̂ Y (X2 M1 Y)
β̂k = Pn j 2
= 1 j′ j
=
1
i=1 (ϵ̂i )
X2 M1 X2
n n ϵ̂ ϵ̂

where ϵ̂j := (ϵ̂j1 , ϵ̂j2 , . . . ϵ̂jn )′ is a vector of the ϵ̂ji across i. Since in this case X2 M1 X2 is a scalar
(recall that we’re looking at a single coefficient, and have thus defined X2 to be a vector rather
than a matrix), and thus what we aim to show is β̂k = (X2 M1 X2 )−1 (X2 M1 Y). This is exactly
what is given by Proposition 7.7.

7.5.9 A review of notation


Let’s review the notation that we’ve introduced in this section, because it can be confusing.

ˆ We began with a random variable Y and a random vector X, which are related by Y = X ′ β + ϵ in
“the population”. The random vector X can be written X = (X1 , X2 , . . . Xk )′ , where each Xj is a
different regressor. No i subscripts are necessary here.
ˆ Then we draw a random sample, where observations are indexed by i = 1 . . . n. Yi is a random
variable reflecting the value of Y in the ith observation, and Xi assembles the value of all regressors
for observation i into a random vector: Xi = (X1i , X2i , . . . Xki )′ .
ˆ When discussing the OLS estimator, it is convenient to assemble information across all of the
observations, leading to the n × 1 vector Y and the n × k matrix X.

Consider the following toy dataset, where n = 4 and k = 3. This reflects a realization of the random
matrix X and the random vector Y:

i X1 X2 X3 Y
1 1 4 0 23
2 1 3 1 54
3 1 2 1 21
4 1 6 0 77

The 4 × 3 matrix framed by a large red box is X in our sample. The smaller green box inside indicates
X3 laid out as a row vector: the values of each of the three regressors in the third observation. The blue
skinny rectangle indicates the n × 1 vector Y. Note that our first “regressor” X1 is simply one for each
observation, and contributes a constant to our regression. Regressor X3 is a binary or “dummy” random
variable: taking values of only zero or one for all obseravtions.

7.6 Statistical properties of the OLS estimator


In this section we’ll see that the OLS estimator β̂ has many of the desirable properties introduced in
Section 5.3. It is consistent for the true population regression coefficient vector β, and has an asymptot-
ically normal distribution. Knowing this will allow us to test hypotheses about the regression vector β.
We also show in this esction that OLS is an unbiased estimator of β, and is an efficient estimator in a
precise sense.
Recall from Section 5.3 that when considering the performance of an estimator, we want to compare
it to the population parameter of interest, in this case β. How can we do this? Well, we know from
Equation (7.31) that β̂ is a function of our observations of the outcome Y, and our observations of the
regressor X. So we need some way to relate these observations to the population parameter of interest
β. Our model of Yi does exactly that. Recall that Equation 7.2 describes how our observations of Yi
can be written in terms of the coefficients β. Equation (7.32) provides an equivalent statement of this

89
in vector notation. Studying the statistical properties of β̂ thus begins with the following crucial step:
substitute our equation for Y (Eq. 7.32) into the definition of the estimator (Eq. 7.31):

β̂ = (X′ X)−1 X′ Y
= (X′ X)−1 X′ (Xβ + ϵ)
′ −1
 ′
=(X X)  X Xβ + (X′ X)−1 X′ ϵ
= β + (X′ X)−1 X′ ϵ (7.36)

Eq. (7.36) is really quite remarkable: it says that regardless of whatever sample we ended up estimating
β̂ from, it is exactly equal to the true population parameter β, plus second term that depends on the
vector of residuals ϵ and the sample design matrix X.

We’ll now proceed in two steps. First, we’ll study the distribution of β̂ when our sample design matrix
is held fixed. This allows us to establish that conditional on X, the estimator β̂ is unbiased and efficient.
Then, we’ll consider the properties of β̂ as n gets very large.

Keep in mind what we’re doing in this section: we’re asking what the distribution of our estimator β̂ is,
given that the data in our sample was a random draw from an underlying population. This will allow us
to think about questions like: how likely would we be to get an estimate β̂ that is far from β, given that
the sample we use to compute β̂ is random (and thus could have been different that the one we actually
see)?

7.6.1 Finite sample properties of β̂, conditional on X*


Understanding the finite-sample sampling distribution of an estimator is generally quite hard, since we
can’t rely on the asymptotic properties of Chapter 4. However, things become easier if we consider the
conditional distribution of β̂, given our observed design matrix X (which was itself a random draw). In
this section we’ll see that OLS is both unbiased and efficient, conditional on the realized design matrix.
For these results, we’ll assume that the linear regression model holds and that we have an independent
and identically distributed sample:
Assumption 1 (linear regression model and i.i.d sampling). (Yi , Xi ) is an i.i.d. sample from the
model: Y = X ′ β + ϵ with E[ϵ|X] = 0.
Assumption 1 implies that for each i = 1, 2, . . . , n:

Yi = Xi′ β + ϵi

where E[ϵi |Xi ] = 0.


With X fixed, we know from Eq. (7.36) that the only variation in β̂ comes from variation in the residuals
ϵi . What can we say about the distribution of the ϵi ?

7.6.1.1 Unbiasedness
Proposition 7.8. Given Assumption 1, OLS is conditionally unbiased for β; that is: E[β̂|X] = β.
Note that by the law of iterated expectations, this implies that β̂ is also unconditionally unbiased:
n o
E[β̂] = E E[β̂|X] = E {β} = β

Proof. Note that by 7.36, E[β̂|X] = β + E (X′ X)−1 X′ ϵ X]. Our goal will be to show that the second


term is zero (for each component of β).


Since X is not random, conditional on X, we can pull it out of the expectation:

E (X′ X)−1 X′ ϵ X] = (X′ X)−1 X′ E [ϵ| X]



(7.37)

Now consider the quantity E [ϵ| X]. This is an n × 1 vector, of which the ith component is E[ϵi |X].
Noting that conditioning on X is the same as conditioning on X1 , X2 , . . . , Xn , we have that:

E[ϵi |X] = E[ϵi |X1 , X2 , . . . , Xn ] = E[ϵi |Xi ] = 0


90
The second equality above follows from i.i.d sampling. Since Yi and Xi are jointly independent of Xj
for Xj ̸= i (recall that ϵi is a function of Yi and Xi , i.e. ϵi = Yi − Xi′ β), we can remove all of the Xj
for j ̸= i from the conditioning event. The second equality then follows from the linear regression model
assumption.
Considering that the above holds for each i, we have all together that E [ϵ| X] = 0n , where we let 0n
denote a vector of n zeroes. This then implies that (X′ X)−1 X′ E [ϵ| X] = 0k , where we let 0k denote a
vector of k zeroes. This completes the proof.

7.6.1.2 Efficiency
Given that OLS is (conditionally) unbiased, we know that conditional on X, it’s mean squared error
as an estimator of β is equal to it’s variance (recall the bias-variance decomposition of Eq. 5.2). A
well-known result about OLS is that it has the smallest variance among all unbiased estimators of β. To
state this result, we will require an extra assumption, which is that the errors ϵi have the same variance
for each observation:
Definition 7.1. We call the linear regression model (conditionally) homoskedastic if for some value
σ, V ar(ϵi |Xi ) = σ 2 for all i.

Homoskedasticity is a very strong assumption, and won’t hold in practice in most settings. But it has
some use as a simplifying assumption to help understand OLS.
Proposition (Gauss-Markov Theorem). Consider an alternative estimator β̃ of β that is also con-
ditionally unbiased: E[β̃|X] = β. Given Assumption 1 and homoskedasticity, OLS is efficient for β in
the sense that:
V ar(β̃|X) ≥ V ar(β̂|X)
where for two k×k matrices A and B, we say that A ≥ B when the matrix A−B is positive semi-definite.

The above version of the Gauss-Markov Theorem is due to Hansen (2021), which was proved just this
year (2021)! Most textbooks (with the exception of the Hansen book) add the assumption that the
candidate estimator β̃ is a linear function of Y (like OLS is). In this context, the Gauss-Markov theorem
is often described as saying that OLS is B.L.U.E: the best linear unbiased estimator of β. The above
“modern Gauss-Markov” theorem is a stronger result than BLUE: it implies that considering non-linear
estimators of β wouldn’t help us reduce the variance of ˜ˆbeyond the bound acheived by OLS.

What is the conditional variance of OLS V ar(β̂|X) appearing in the Gauss-Markov Theorem?
Note that homoskedasticity along with the i.i.d assumption imply that:

V ar(ϵ|X) = E [ϵϵ′ |X] = σ 2 In

where In is the n × n identity matrix.

Exercise: Definition 7.1 implies that the diagonal elements of V ar(ϵ|X) are equal to σ 2 , but
how do we know that the off-diagonal elements are equal to zero?

The above in turn implies that the conditional variance of the OLS estimator is:

V ar(β̂|X) = E[(β − β̂)(β − β̂)′ ] = E (X′ X)−1 X′ ϵϵ′ X(X′ X)−1 X


 

= (X′ X)−1 X′ (σ 2 In )X(X′ X)−1


′ −1
 ′
= σ 2(X X)  XX(X′ X)−1
= σ 2 (X′ X)−1

Thus the Gauss-Markov Theorem is typically written as saying that V ar(β̃|X) ≥ σ 2 (X′ X)−1 ,
where the RHS is the conditional variance of the OLS estimator.

91
What should we make of the Gauss-Markov Theorem given that homoskedasticity is usually not a
reasonable assumption? It turns out that OLS is actually not efficient when homoskedasticity fails
(we call this heteroskedasticity). In that setting, a closely-related estimator known as weighted-
least-squares
Pn (WLS) becomes efficient (at least among linear estimators). Efficient WLS minimizes
′ 2
i=1 wi (Yi − Xi β) where the wi are chosen to be 1/V ar(ϵi |X). In practice, we don’t generally
know V ar(ϵi |X) ex-ante, so a feasible version of WLS requires estimating this quantity. Basically,
the re-weighting of each observation by wi “undoes” heteroskedasticity, so that WLS mimics the
properties that OLS has in the homoskedastic case. Thus you should think of Gauss-Markov as
illustrating an idea that is more general that homoskedasticity, but is simpler to express under
that strong assumption.

7.6.2 Asymptotic properties of β̂


We now turn to deriving properties of the OLS estimator as the sample size n gets very large.
We’ll show that β̂ is a consistent estimator for β, and then that its sampling distribution is asymp-
totically normal. For these results, we don’t need for the linear regression model with E[ϵ|X] = 0 to
hold. The large sample properties hold for the linear projection coefficient β even if the CEF of Y on X
is not linear. As before, we assume that we have an independent and identically distributed sample:
Assumption 2 (linear projection model and i.i.d sampling). (Yi , Xi ) is an i.i.d. sample from the
model: Y = X ′ β + ϵ with E[ϵ · X] = 0.

To make claims that involve convergence in probability and convergence in distribution, we will consider
a sequence of estimators β̂, indexed by the sample size n. For each n = 1, . . . , ∞ along the sequence,
we assume that 2 holds. As a reminder (cf. Chapter 4), in reality sample sizes never actually “grow” to
infinity. In practice, we always have an actual sample that has some actual finite size n. The idea of an
asymptotic sequence exists only to provide an approximation to the sampling distribution of β̂ given our
fixed n, which we will take to be accurate when the sample size is big enough.

7.6.2.1 Consistency
p
We’ll first see that given the asymptotic sequence described above, β̂ → β. That is, β̂ is a consistent
estimator of β.
Subtracting β from each side of Equation 7.36:
 −1
1 ′ 1 ′
β̂ − β = (X′ X)−1 X′ ϵ = XX Xϵ
n n

where in the second equality we’ve used that the factor of n1 inside the matrix inverse cancels the one on
X′ ϵ. Now let’s consider this latter quantity alone. Expanding out the matrix product:
n
1 ′ 1X
Xϵ= Xi ϵi ,
n n i=1

i.e. it is equal to the sample average of the random variable Xi ϵi . To see the above, note that n1 X′ ϵ is a
k × 1 vector, whose j th element is equal to the inner product between ϵ and the j th row of X′ . The j th
row of X′ is equal to the j th column of X, which is comprised of the n observations of regressor Xj .
Thus, by the law of large numbers, we have that n1 X′ ϵ → E[Xi ϵi ], provided that E[Xi ϵi ] < ∞. By
p

the linear projection model (Assumption 2), E[Xi ϵi ] = 0, where 0 is a vector of k zeroes.
Similarly, we have by the law of large numbers that
n
1 ′ 1X
Xi Xi′ → E[Xi Xi′ ],
p
XX=
n n i=1

In Chapter 4 we only considered the LLN for random vectors, not random matrices like Xi Xi′ . But since
you can always rewrite an n × m matrix as a vector with n · m elements, the LLN for vectors applies
so long as each element of the matrix E[Xi Xi′ ] is finite. In the box below, I state a set of assumptions,

92
“regularity conditions”, that ensure we can use the law of large numbers here, and that all expectations
that appear in this section exist.
Given that n1 X′ X → E[Xi Xi′ ], the continuous mapping theorem implies that
p

 −1
1 ′
→ E[Xi Xi′ ]−1
p
XX
n
That’s because for a general invertible matrix M, the matrix inverse function M−1 is a continuous func-
tion of each of the elements of M.

Finally, by the continuous mapping theorem, we have that


  −1
1 ′ 1 ′ p
β̂ − β = (X′ X)−1 X′ ϵ = XX X ϵ → E[Xi Xi′ ]−1 0 = 0
n n
| {z } | {z }
E
p
p →0
→ [Xi Xi′ ]

p
Thus we have proved that β̂ → β.
Proposition 7.9. OLS is consistent for β given Assumption 2 and the regularity conditions 3 below.

How do we know that the matrix E[Xi Xi′ ] has only finite entries, so that we can use the LLN?
One might simply assume that this is true, but the conventional textbook approach is to state
a set of conditions that are sufficient for E[Xi Xi′ ] to be finite, but are easier or more natural to
state. Technical assumptions of this kind are often referred to as “regularity conditions”. The
Hansen text uses the following:

Assumption 3 (regularity conditions for consistency). Suppose that:


1. E[Yi2 ] is finite
2. E[||Xi ||2 ] is finite
3. We have no perfect multicollinearity in the population: that is, E[Xi Xi′ ] is positive definite.
where for any vector with k components a, we let ||a|| denote its Euclidean norm, i.e.
Pk
||a||2 = j=1 (aj )2 .

Let us now see how Assumption 3 get us what we need to huse the LLN i to analyze the OLS
estimator. In particular, we’ll use item 2, that E[||Xi || ] = E j=1 E Xji < ∞.
2
Pk 2
Pk  2
j=1 Xji =
Note that this implies that E Xji
 2
< ∞ for each j, since these quantities are all positive would
could never have one of them be infinite but the sum finite.

Now consider a generic element of the matrix E[Xi Xi′ ]. For instance, the element in the j th row,
ℓth column is E[Xji Xℓi ]. To show that this must be finite, we’ll use the following very useful
inequality for expectations:
1. The Cauchy-Schwarz inequality says that for random variables X and Y : |E[X · Y ]|2 ≤
E[X 2 ] · E[Y 2 ]
Applying the Cauchy-Schwartz to the k, ℓ element of E[Xi Xi′ ], we have that
q
E[Xji Xℓi ] ≤ E[Xji2 ] · E[Xℓi2 ]

Since each of E[Xji2 ] and E[Xℓi2 ] are finite, their product must also be finite, and also its square
root.

7.6.2.2 Asymptotic normality


Now let’s use the central limit theorem to derive the asymptotic distribution of the OLS estimator. Let
us pick up from the expression β̂ − β = (X′ X)−1 X′ ϵ. Knowing that the central limit theorem will involve
93

a factor of n, let’s rewrite this as
−1
√ √
  
1 ′ 1 ′
n(β̂ − β) = XX · n Xϵ
n n

Recall that E[Xi ϵi ] = 0, where 0 is a vector of k zeroes, and that n1 X′ ϵ is the sample mean of the random
Pn
vector Xi · ϵi . Using the notation of Chapter 4, let’s denote this as (Xϵ)n := n1 i=1 Xi · ϵi . Then we
can write the above as:
−1
√ √ 

1 ′ 
n(β̂ − β) = XX · n (Xϵ)n − E[Xi ϵi ]
n
The rightmost factor in the above expression has exactly the form that we need to apply the CLT, in
particular:
√  
n (Xϵ)n − E[Xi ϵi ] → N (0, V ar(Xi ϵi )),
d

Note that since E[Xi ϵi ] = 0, we can write the variance as


V ar(Xi ϵi ) = E[(Xi ϵi )(Xi ϵi )′ ] = E[ϵ2i Xi Xi′ ] (7.38)

Now we use the Slutsky theorem


 −1
√ √ 

1 ′ 
· n (Xϵ)n − E[Xi ϵi ] → E[Xi Xi′ ]−1 N (0, E[ϵ2i Xi Xi′ ]))
d
n(β̂ − β) = XX (7.39)
n | {z }
E
| {z }
E
d
p
→ [Xi Xi′ ] → [ϵ2i Xi Xi′ ]


That is, n(β̂ − β) converges in distribution to a random vector whose distribution is that of the matrix
E[Xi Xi′ ]−1 times a normal vector with mean zero (for each component) and variance-covariance matrix
E[ϵ2i Xi Xi′ ].
The RHS of (7.39) is thus equal to a linear combination of normal random vectors. Adapting Propo-
sition 3.5, note the following: let X ∼ N (µ, Σ) be a k-component random vector. Then for any k × k
matrix A:
A′ X ∼ N (A′ µ, A′ ΣA))
(this can also be seen as an example of the delta method, applied to a vector-valued function h). Thus,
we can write (7.39) as
√ d
n(β̂ − β) → N (0, V) (7.40)
where V := E[Xi Xi′ ]−1 E[ϵ2i Xi Xi′ ]E[Xi Xi′ ]−1 . We refer to V as the asymptotic variance of the OLS
estimator.

Example: Note that in the special case of homoskedasticity studied in Section 7.6.1, we have that
E[ϵ2i |Xi ] = σ2 for all i, and thus by the law of iterated expectations:
E[ϵ2i Xi Xi′ ] = E E[ϵ2i |Xi ]Xi Xi′ = E σ2 Xi Xi′ = σ2 · E[Xi Xi′ ]
 
(7.41)

In this case the asymptotic variance takes on a very simple form:

V = E[Xi Xi′ ]−1 σ 2 · 


E[X E[X
Xi′ ] ′ −1
= σ 2 · E[Xi Xi′ ]−1

Xi]
i
i

A sufficient condition for us to be able to apply the CLT (see Section 4.4) is that E[(Xi ϵi )′ (Xi ϵi )] =
E[ϵ2i Xi′ Xi ] be finite. This requires finite fourth moments of the data, rather than the finite second
moments assumed to prove consistency of OLS. To see why, note that for any j and ℓ:

E[ϵ2i Xji ·Xℓi ] = E[(Yi −Xi′ β)2 Xji ·Xℓi ] = E[Yi2 ·Xji Xℓi ]−2E[Xi′ β·Yi ·Xji ·Xℓi ]+E[β ′ Xi Xi′ β·Xji ·Xℓi ]
which can be written out as a sum over expectations that each involve the product of four random
variables. To keep all such terms finite, Hansen assumes the following:

94
Assumption 4 (regularity conditions for asymptotic normality). Suppose that:

1. E[Yi4 ] is finite
2. E[||Xi ||4 ] is finite
3. We have no perfect multicollinearity in the population: that is, E[Xi Xi′ ] is positive definite.

7.6.2.3 Estimating the asymptotic variance


Equation 7.40 is not immediately useful, unless we know the asymptotic variance matrix V. Since we
don’t know V̂ before seeing the data, we will estimate it! In this section we see that we can construct
p
a consistent estimator V̂ such that V̂ → V. Doing this will open the door to hypothesis testing, which
we’ll consider in the next section.
Before seeing how hypothesis testing will work, let’s consider how to construct the estimator β̂ for
the asymptotic variance of OLS. Note that V = E[Xi Xi′ ]−1 E[ϵ2i Xi Xi′ ]E[Xi Xi′ ]−1 has a “sandwich” form:
it puts the matrix E[ϵ2i Xi Xi′ ] (the meat)2 , between two instances of the matrix E[Xi Xi′ ]−1 (the bread).
By the continuous mapping theorem, we can construct an estimator V by making a sandwich out of
consistent estimators for the meat and −1 for the bread.
We’ve already seen that n1 X′ X Pn E[X
′ −1
is a consistent estimator for the bread: i Xi ] . An estimator
for the meat E[ϵi Xi Xi ] is not quite as obvious. It’s sample analog n i=1 ϵi Xi Xi′ would definitely
2 ′ 1 2

work, but the true residuals ϵi are not observed. However, we can use the fitted residuals ϵ̂i , which are
a function of the observed data, instead. We can write this in matrix form as:
n
1X 2
Ω̂ := ϵ̂ Xi Xi′
n i=1 i

One can verify that Ω̂ → E[ϵ2i Xi Xi′ ]. Thus, we can form a consistent variance estimator as
p

 −1  −1
1 ′ 1 ′
V̂HC0 := XX Ω̂ XX (7.42)
n n

Eq. (7.42) is referred to as the “HC0” estimator of V, where HC stands for heterokedasticity consistent.
This name comes from the fact that V̂HC0 does not require the assumption of homoskedasticity (Defi-
nition 7.1) to be a consistent estimator of V.

When you run a command like regress y x, robust in Stata, the default covariance estimator is the
so-called “HC1” estimator of V:
 −1  −1
n 1 ′ 1 ′
V̂HC1 := XX Ω̂ XX (7.43)
n−k n n
n
Note that the additional factor n−k will make very little difference when n is large compared with k,
n
and will make no difference in the asymptotic limit, since n−k → 1 as n → ∞. Applying this rescaling
however can be helpful when n is small. It’s easiest to understand the justification in the case of ho-
moskedasticity, which is left as an exercise (see box below).

Note: there are further estimators floating around, with names HC2, HC3, and HC4. These apply fur-
ther modifications to V̂HC0 (see the Hansen text for details). Other variance estimators exist for certain
violations of the i.i.d sampling assumption, including cluster-robust variance estimators for clustered
sampling and autocorrelation-consistent estimators for serially correlated panel data.

If we had reason to believe that homoskedasticity holds, then it is not necessary to use the matrix
Ω̂ when estimating V. The alternative estimator given below will perform better provided that
the assumption of homoskedasticity is true. But, it will be inconsistent if not.

2 Ideally plant-based meat :)

95
Recall from Eq. 7.41 that under homoskedasticity E[ϵ2i Xi Xi′ ] = σ 2 · E[Xi Xi′ ], where σ 2 =
E[ϵ2i |Xi ] = E[ϵ2i ]. Thus, we just need a consistent estimator of E[ϵ2i ], which we can then multiply
by n1 X′ X. The standard estimator of E[ϵ2i ] is denoted by Hansen as s2 , where
n
1 X 2
s2 = ϵ̂
n − k i=1 i
−1
Thus an estimator of V that is valid under the assumption of homoskedasticity is s2 · n1 X′ X .
This is what Stata computes by default, if you don’t include , robust at the end of your
regression command.
Pn
Exercise: Why use the estimator s2 as written rather than the simpler expression n1 i=1 ϵ̂2i ?
The reason is that dividing by n − Pkn rather than n makes s2 an unbiased estimator of E[ϵ2i ].
1
Derive the bias of the estimator n i=1 ϵ̂i , and show that s2 is unbiased.
2

7.7 Inference on the regression vector β


Given
√ a consistent estimator of V like the HC0 or the HC1 estimator, we can transform the quantity
n(β̂ − β) into one whose limiting distribution is well-understood, and contains no unknown parameters.
This proves to be a much more useful result than Equation (7.40), because it allows us to test hypotheses

about the population regression vector β. In particular, if we pre-multiply n(β̂ − β) by the matrix
V̂−1/2 (see box below for the definition of V̂−1/2 ):
p
Proposition 7.10. Given Assumption 2, the regularity conditions 4, and a V̂ such that V̂ → V:
√ −1/2
(β̂ − β) → N (0, Ik )
d
nV̂
where Ik is the k × k identity matrix.
The distribution N (0, Ik ) is that of k standard normal random variables, each of which is independent
of the others (the variance-covariance matrix Ik has an entry of zero for each off-diagonal element). The
power of Proposition 7.10 lies in the fact that the distribution appearing on the RHS, N (0, Ik ), contains
no unknown quantities. We know exactly the probability that√it associates to any event. Thus, for large
n, we have a very good approximation to the distribution of nV̂−1/2 (β̂ − β). This provides the foun-
dation for us to quantify uncertainty in our estimates β̂ and test hypotheses about the regression vector β.

The matrix square root. One can show that the matrix V = E[Xi Xi′ ]−1 E[ϵ2i Xi Xi′ ]E[Xi Xi′ ]−1
is invertible, symmetric and positive definite. A property from linear algebra is that symmetric
positive definite matrices M have a “matrix square root”, which is a unique matrix M 1/2 such
that M = M 1/2 M 1/2 . Thus, we can let V−1/2 denote the inverse of the matrix square root of V.
−1/2
Let V̂−1/2 denote the analogous matrix for our estimator V̂. For example, VHC0 turns out to
−1/2 −1 −1/2 1 ′ −1
be: V̂HC0 = n1 X′ X
 
Ω̂ nX X .

p
Proof of Proposition 7.10. By the continuous mapping theorem V̂−1/2 → V−1/2 . Then, by the
Slutsky theorem and Eq. (7.40):

nV̂−1/2 (β̂ − β) → V−1/2 N (0, V) = N (V−1/2 0, V−1/2 VV−1/2 ) = N (0, Ik )
d

where in the last step we’ve used that V−1/2 VV−1/2 = VV−1/2 V−1/2 = VV−1 = Ik (we’ve
made use of the fact that “matrix powers” like M−1/2 always commute with the original matrix
M, meaning that M−1/2 M = MM−1/2 ).

Note: here we’ve simply assumed that V̂−1/2 exists. However it’s possible to argue that it must

96
exist for large enough n by appealing to the strong law of large numbers.

7.7.1 Testing hypotheses about a single regression coefficient


To see how the logic of Proposition 7.10 is useful, let’s consider a simple setting, which turns out to be
the most common one in practice: we are interested in the true value of a single regression coefficient,
say βj , in a regression that contains k regressors.
Note that we can write βj as e′j β, where ej = (0, . . . , 1, . . . 0)′ is a k-component vector that puts a
one in position j, and zeros everywhere else. Similarly, e′j β̂ picks out the single component β̂j from the
OLS estimator. It then follows from Equation (7.40) and the Delta method that
√ √ d
n(β̂j − βj ) = e′j n(β̂j − βj ) → e′j N (0, V) = N (e′j 0, e′j Vej ) = N (0, Vjj )

where Vjj = e′j Vej is the j th element along the diagonal of the matrix V.

This implies, analogously to Proposition 7.10, that

√ β̂j − βj d
n· q → N (0, 1) (7.44)
V̂jj

where V̂jj is the j th element along the diagonal of the matrix V̂, which is a consistent estimator of Vjj .
√ −1/2
Note that we could have written the LHS of Eq. (7.44) as nV̂jj (β̂j − βj ) as in Proposition 7.10, but
since V̂jj is a scalar we may take its conventional square root and divide by it.
q
We define the standard error for the estimate β̂j to be se(β̂j ) := V̂jj /n. Note that the standard error is
a quantity that is computed from the data, given V̂ (it is an estimate, rather than a population quantity).
By Eq. (7.44), we know that the quantity (β̂j −βj )/se(β̂j ) converges in distribution to a standard normal.

This allows us to test hypotheses about the value of βj , using our estimate β̂j and se(β̂j ). Consider the
null hypothesis: H0 : βj = β0 for some value β0 (e.g. zero). Define the T-statistic for this hypothesis to
be
β̂j − β0
T (β0 ) =
se(β̂j )
d
If H0 is true, then we know that T (β0 ) → N (0, 1). Recall from Section 5.4.1 that the size of a hypothesis
test is the maximum probability of rejecting the null hypothesis, when the null hypothesis is in fact true.
We can form a test with size α in the following way:

reject H0 iff |T (β0 )| > c

where c is a value such that the probability of a standard normal random variable having a magnitude
of at least c is less than α. To do this in a way that maximizes power, we choose c to be exactly the
1 − α/2 quantile of the standard normal distribution: c = Φ−1 (1 − α/2).

Exercise: Show that if Z ∼ N (0, 1), P (|Z| > Φ−1 (1 − α/2)) = α.

Note: using the standard normal distribution Φ to form our critical value c makes our test a so-called
z-test. When you use the regress command in Stata, it performs a t-test, which uses the same test
statistic T (β0 ) but a different critical value, instead based on the students’ t-distribution. Using the
t-distribution will be more accurate if n is small an the residuals are approximately normal and ho-
moskedastic, but as n becomes large critical values based on the t-distribution or the standard normal
distribution become the same. In modern (large n) datasets, this distinction isn’t very important, so we
just develop theory for the z-test here.

Example: A common hypothesis to test is that βj = 0. In this case, our T-statistic is simply β̂j /se(β̂j ).
It is common to construct the test with a size of 5%, in which case we reject the null that βj = 0
97
if β̂j /se(β̂j ) > 1.96, where 1.96 ≈ Φ−1 (1 − .05/2). If we reject, then we say that the estimate β̂j is
“significant at the 95% level”.

Note: The above is an example of a so-called “two-sided” test. If we had a null-hypothesis like H0 :
βj ≥ β0 , we might apply an asymmetric decision rule like: H0 iff |T (β0 )| > c, where now c = Φ−1 (1 − α)
is the 1 − α quantile of the standard normal distribution. This test also has a size of α.
A 1 − α confidence interval CI 1−α for βj is the set of all values β0 for which the null-hypothesis H0 :
βj = β0 is not rejected by a test with size α. In other words, it is for a two-sided test the set of all β0
such that
β̂j − β0
≤ Φ−1 (1 − α/2)
se(β̂j )
Rearranging this, we have that CI 1−α = [β̂j − c · se(β̂j ), β̂j + c · se(β̂j )], where c = Φ−1 (1 − α/2) grows
with the desired α.

Example: A 95% confidence interval, based on a two-sided test, is

CI 95% = [β̂j − 1.96 · se(β̂j ), β̂j + 1.96 · se(β̂j )]

Exercise: Show that as limn→∞ P (βj ∈ CI 1−α ) = 1 − α. That is, the 1 − α confidence inter-
val contains the true value of βj with probability 1 − α, in the asymptotic limit. You may take for
d
granted that if a sequence of random variables Zn → Z, then for any closed interval A of the real line:
limn→∞ P (Zn ∈ A) = P (Z ∈ A). This statement is a consequence of the so-called Portmanteau Theorem.

Note: It is often said on the basis of the above that their is a 95% change that the true value of βj
lies in an e.g. 95% confidence interval. This language is sort of sloppy and makes it sound like βj is a
random variable, that may or may not lie inside the confidence interval. This is backwards: it is the
confidence interval CI 1−α ) that is random (it depends on the random variables β̂j and se(β̂j )), while βj
is not (it is just some number).

7.7.2 Testing a joint hypothesis about the regression coefficients*


Suppose now that we want to test a hypothesis about the value of the full regression vector β. What
would be the analog of our t-test from the last section?
It follows from Proposition 7.10 and the continuous mapping theorem that
√ ′ √ 
d
n(β̂ − β)′ V̂−1 (β̂ − β) = nV̂−1/2 (β̂ − β) nV̂−1/2 (β̂ − β) → χ2k (7.45)

where χ2k indicates the chi-squared distribution with k degrees of freedom, which is the distribution that
applies to the sum of the squares of k independent standard normal random variables. To see this, note
that the LHS must converge to the distribution that would apply to Z ′ Z, if Z ∼ N (0, Ik ) is a vector of
k mutually independent standard normal random variables.
Thus, to test the null-hypothesis H0 : β = β0 , we can compute n(β̂ − β)′ V̂−1 (β̂ − β), and compare
it’s value to quantiles of the chi-squared distribution. A test of this kind is called a Wald-test, and the
test-statistic n(β̂ − β)′ V̂−1 (β̂ − β) is called a Wald statistic. Since a chi-squared random variable can
only be positive, Wald-tests are inherently two-sided tests.

More generally, suppose we want to test a null-hypothesis of the form H0 : r(β) = θ0 where r : Rk → Rq
is some known and differentiable function of the regression vector β. Introduce the shorthand that
θ = r(β).
By the continuous mapping theorem, we know that θ̂ := r(β̂) is a consistent estimator of the parameter
θ, and by the Delta method we know that
√ d
n(θ̂ − θ) → N (0, R′ VR)

where the k×q matrix R = ∇r(θ) is the Jacobian matrix of r, composed of all of it’s derivatives evaluated
√ d
at θ. By the same steps as those that establish Proposition 7.10, we have that n(R′ VR)−1/2 (θ̂ − θ) →
N (0, Ik ), where notice now that since θ has q rather than k components, we have the q × q identity
matrix appearing in the asymptotic variance.
98
Similar to Eq. (7.45), we can now form a Wald statistic for the hypothesis r(β) = θ0 , which has a
chi-squared distribution with q, rather than k degrees of freedom:

ˆ −1 (θ̂ − θ) →
n(θ̂ − θ)′ (R′ VR)
d
χ2q (7.46)

We can perform tests of very general hypotheses about β based upon Eq. (7.46). Note that in imple-
menting the Wald-test, it is important that the matrix R is known in order to construct the test statistic
(7.46). Fortunately we do know R under the null-hypotheses, which this fixes the value of θ and the
function r is provided by the researcher (and if the function r is linear, then R doesn’t even depend on
the value of θ). Note: it is common to use critical values from the so-called F -distribution rather than
the χ2 distribution when using (7.46) to construct tests (this also involves rescaling the test statistic by
a factor of 1/q). This is analagous to the distinction between t and z tests discussed above: the F test
is more conservative and may do better if the residuals are close to normally distributed and n is small.

It is illustrative to see that a test based on (7.46) recovers a test that is equivalent to the t-test considered
in the last section, when the function r(β) = e′j β picks out a single component of β. In this case R = ej .
The Wald statistic becomes
−1 (βj − β0 )2
n(e′j β − e′j β)′ (e′j V̂ej )−1 (e′j β − e′j β)′ = n(βj − β0 )V̂jj (βj − β0 ) = = T (β0 )2
V̂jj /n

exactly the square of the t-statistic for H0 : βj = β0 . The 1 − α quantile of the χ21 distribution is equal to
the square of the 1 − α/2 quantile of the standard normal distribution. Thus, a two-tailed t-test rejects
the null that βj = β0 exactly when the Wald test does, and vice versa.

99
Bibliography

ANGRIST, J. D. and PISCHKE, J.-S. (2008). Mostly Harmless Econometrics.

DALE, S. B. and KRUEGER, A. B. (2002). “Estimating the Payoff to Attending a


More Selective College: An Application of Selection on Observables and Unobserv-
ables*”. The Quarterly Journal of Economics 117 (4), pp. 1491–1527. eprint: https:
//academic.oup.com/qje/article-pdf/117/4/1491/5304491/117-4-1491.pdf.

FAN, Y. and PARK, S. S. (2010). “Sharp Bounds on the Distribution of the Treatment
E§ect and Their Statistical Inference”. Econometric Theory 26 (3), 931–951.

HANSEN, B. (2021). “A Modern Gauss-Markov Theorem”. Econometrica.

HECKMAN, J. J., SMITH, J. and CLEMENTS, N. (1997). “Making the Most Out
of Programme Evaluations and Social Experiments: Accounting for Heterogeneity in
Programme Impacts”. The Review of Economic Studies 64 (4), pp. 487–535.

LEWIS, D. (1973). “Causation”. The Journal of Philosophy 70 (17), pp. 556–567.

ROSENBAUM, P. and RUBIN, D. (1983). “The central role of the propensity score in
observational studies for causal effects”. Biometrika 70 (1), pp. 41–55. eprint: https:
//academic.oup.com/biomet/article-pdf/70/1/41/662954/70-1-41.pdf.

ROSENTHAL, J. (2006). A First Look at Rigorous Probability Theory -. World Scientific.

RUBIN, D. B. (1974). “Estimating causal effects of treatments in randomized and non-


randomized studies.” Journal of Educational Psychology 66 (5). Place: US Publisher:
American Psychological Association, pp. 688–701.

YITZHAKI, S. (1996). “On Using Linear Regressions in Welfare Economics”. Journal of


Business and Economic Statistics 14 (4), pp. 478–486.

100

You might also like