Elements of Econometrics - Study Guide
Elements of Econometrics - Study Guide
Elements of
econometrics
M. Schafgans and
T. Komarova
EC2020
2022
Elements of econometrics
M. Schafgans and T. Komarova
EC2020
2022
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This subject guide is for a 200 course offered as part of the University of London
undergraduate study in Economics, Management, Finance and the Social Sciences. This is
equivalent to Level 5 within the Framework for Higher Education Qualifications in England,
Wales and Northern Ireland (FHEQ).
For more information, see: london.ac.uk
This guide was prepared for the University of London by:
M. Schafgans, Associate Professor of Economics
T. Komarova, Lecturer in Economics
Department of Economics, The London School of Economics and Political Science
With typesetting and proof-reading provided by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London School
of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that due
to pressure of work the authors are unable to enter into any correspondence relating
to, or arising from, the guide. If you have any comments on this subject guide, please
communicate these through the discussion forum on the virtual learning environment.
University of London
Publications Office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk
Contents
0 Preface 15
0.1 What is econometrics, and why study it? . . . . . . . . . . . . . . . . . . 15
0.2 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
0.3 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.4 Overview of the learning resources . . . . . . . . . . . . . . . . . . . . . . 16
0.4.1 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.4.2 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.4.3 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
0.4.4 Online study resources . . . . . . . . . . . . . . . . . . . . . . . . 19
0.4.5 The VLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
0.4.6 Making use of the Online Library . . . . . . . . . . . . . . . . . . 20
5
Contents
6
Contents
7
Contents
8
Contents
8 Heteroskedasticity 159
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.1.1 Aims of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.1.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.1.3 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.1.4 Synopsis of this chapter . . . . . . . . . . . . . . . . . . . . . . . 159
8.2 Content of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.2.1 Consequences of heteroskedasticity for OLS . . . . . . . . . . . . 160
8.2.2 Heteroskedasticity-robust inference after OLS estimation . . . . . 161
8.2.3 Testing for heteroskedasticity . . . . . . . . . . . . . . . . . . . . 163
8.2.4 Weighted least squares . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Answers to activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.4 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.4.1 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . 175
8.4.2 A reminder of your learning outcomes . . . . . . . . . . . . . . . . 175
8.5 Test your knowledge and understanding . . . . . . . . . . . . . . . . . . . 175
8.5.1 Additional examination based questions . . . . . . . . . . . . . . 175
8.5.2 Solutions to additional examination based questions . . . . . . . . 178
9
Contents
10
Contents
11
Contents
12
Contents
13
Contents
14
Chapter 0
Preface
15
0. Preface
describe and apply the classical regression model and its application to
cross-section data
describe and apply the:
• Gauss–Markov conditions and other assumptions required in the application of
the classical regression model
• reasons for expecting violations of these assumptions in certain circumstances
• tests for violations
• potential remedial measures, including, where appropriate, the use of
instrumental variables
recognise and apply the advantages of logit, probit and similar models over
regression analysis when fitting binary choice models
competently use regression, logit and probit analysis to quantify economic
relationships using R
describe and explain the principles underlying the use of maximum likelihood
estimation
apply regression analysis to time-series models using stationary time series, with
awareness of some of the econometric problems specific to time series applications
(for example, autocorrelation) and remedial measures
recognise the difficulties that arise in the application of regression analysis to
nonstationary time series, know how to test for unit roots, and know what is meant
by cointegration.
Detailed reading references in this subject guide refer to the edition of the set textbook
listed above. A new edition of this textbook may have been published by the time you
16
0.4. Overview of the learning resources
study this course. You can, however, use a more recent edition; use the detailed chapter
and section headings to identify relevant readings. Also check the virtual learning
environment (VLE) regularly for updated guidance on readings. You must have a copy
of the textbook to be able to study this course.
17
0. Preface
files). At the end of each chapter you will find an overview of the material covered, a list
of key terms and concepts, and a reminder of the learning outcomes. It concludes with
additional examination based exercises that will enable you to test your knowledge and
understanding with worked out solutions.
To support students with the pre-requisites for EC2020 Elements of econometrics,
the first chapter provides a mathematics and statistics refresher. Students are
encouraged to review this material and undertake the quizzes on the VLE to assess
their preparation before taking the course. Even though we will refer back to this
material throughout the course, students will benefit from reviewing this material
beforehand. On the VLE you will find accompanying MCQs to test the degree of
familiarity as required for the course.
A suggested approach for students studying EC2020 Elements of econometrics, in
particular independent learners, is to split the material in 10 two-week blocks.
10. Chapter 15 (Trends, seasonality, and highly persistent time series in regression
analysis).
Throughout the course we discuss many empirical examples that help to motivate each
econometric method and should make the course more engaging. To provide you with
practical experience we have accompanied the material with an introduction to R in
RStudio and provide R applications that show how to implement the approaches
alongside. The subject guide will refer to this material where appropriate. Although not
examinable, it provides you, in our opinion, with valuable transferable skills.
18
0.4. Overview of the learning resources
Course materials: Subject guides and other course materials available for
download. In some courses, the content of the subject guide is transferred into the
VLE and additional resources and activities are integrated with the text.
19
0. Preface
Discussion forums: A space where you can share your thoughts and questions
with fellow students. Many forums will be supported by a ‘course moderator’, a
subject expert employed by LSE to facilitate the discussion and clarify difficult
topics.
Study skills: Expert advice on getting started with your studies, preparing for
examinations and developing your digital literacy skills.
Note: Students registered for Laws courses also receive access to the dedicated Laws
VLE.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.
20
Chapter 1
Mathematics and statistics refresher
1.1 Introduction
The material in this chapter permits you to revise some basic mathematical tools crucial
for understanding regression analysis. It also provides a refresher of the basic statistical
theory that is used throughout the course and is central to econometric analysis.
It should be noted that the material does not replace the prerequisite courses in
mathematics, probability and statistics. By studying the essential readings listed, you
will revise the material using notation that will be used throughout the course. At the
end of each block you will be asked to complete associated quizzes on the VLE that will
test your understanding.
The refresher is comprised of three modules and covers the key terms and concepts
indicated.
21
1. Mathematics and statistics refresher
22
1.2. Content of chapter
• Test Statistic
• Critical Value
t Test for the Mean of a Normal Population
p-value of a Test.
It is important to clearly indicate the index, in this case i (which may relate to a
particular individual), and the values the index can take, in this case 1, 2, . . . , n, where
n denotes the number of observations (which may be the sample size). We can use the
summation notation to describe various descriptive statistics of {xi : i = 1, 2, . . . , n},
such as the sample mean:
n
1X
xi = x̄
n i=1
and the sample variance:
n
1 X
(xi − x̄)2 .
n − 1 i=1
As we will be using the summation operator a lot, it is important you are comfortable
using the summation operator and are aware about its properties. (See Wooldridge
property Sum.1–property Sum.3.)
The use of proportions and percentages is commonplace in applied econometrics.
Mathematically, proportions are the decimal equivalent of a percentage, e.g. when 75%
(75 percent) of the population is employed, we say that the proportion of employed
individuals in the population equals 0.75. In other words, we need to divide a
percentage by 100 to yield the proportion. When describing changes in variables that
are measured in percentages (or proportions), it is important to distinguish clearly
between a percentage point change and a percentage change. For instance if the
percentage of employed individuals rises to 80%, this implies we have a 5 percentage
point increase in employment which is not the same as a 5 percent increase! A 5
percent increase in employment would have yielded (1 + .05) ∗ 75 = 78.75, which
equals 78.75%. Using percentages (or proportions) to describe changes in variables, such
23
1. Mathematics and statistics refresher
as income, has the advantage that it is free of the units with which the variable is
measured (in dollars or pounds). When we do not report changes in this way, we have
to clearly indicate the units when describing changes.
We will mostly be working with linear equations (and you should understand their
properties). Your ability to help understand modern economic research requires you to
be familiar with various nonlinear functions as well. The use of nonlinear functions,
such as quadratic forms, is useful as it will permit us to represent diminishing
marginal effects. If:
y = β0 + β1 x + β2 x2
where β1 > 0 and β2 < 0, then the positive effect that x has on y (β1 > 0) is stronger
when x is smaller and diminishes thereafter (β2 < 0). The signs of the parameters β1
and β2 are crucial in describing this diminishing marginal effect. Two other nonlinear
functions we will encounter are the natural logarithm (or simply log function) and the
exponential function. You should remember their properties and recall that, e.g.
log(exp(x)) = x. When we encounter logarithms of variables in applied work, we need to
interpret our parameters as elasticities and/or semi-elasticities. We will discuss this in
more detail later.
Lastly, the block reviews a bit of differential calculus. The marginal effects and
partial marginal effects are defined as the derivative of a function with respect to a
particular argument, where for the partial marginal effect we need to hold all other
arguments constant. Derivatives also play an important role in providing the first-order
conditions that define the minimum (when we consider the ordinary least squares
estimator) or the maximum (when we consider the maximum likelihood estimator) of a
given objective function. You will be expected to take derivatives of simple functions,
where it will be useful to remember the chain rule. With f (x) and g(x) denoting two
functions of the variable x, the chain rule allows us to express:
d
f (g(x)) = f 0 (g(x))g 0 (x)
dx
where f 0 (x) = df /dx and g 0 (x) = dg/dx.
Example 1.1 Let us use the chain rule to derive dy/dx when y = log(1 + 2x). We
first take the derivative of log(1 + 2x) with respect to its argument 1 + 2x, and then
multiply this by the derivative of 1 + 2x with respect to x:
Activity 1.1 Take the ‘Quiz on Basic Mathematical tools’ on the VLE to test your
understanding.
Fundamentals of probability
Read: Wooldridge Appendix B (you may want to skip Sections B.4e to B.4g during your
refresher and return to this at a later stage).
24
1.2. Content of chapter
The expectation has important properties you need to be familiar with, given in
Property E.1–Property E.3 in Wooldridge. We use the expectation operator often in
derivations and it is important to recognise that typically (Jensen’s inequality):
E(g(Y )) 6= g(E(Y ))
25
1. Mathematics and statistics refresher
The variance describes the spread of the distribution (indicating whether most
realisations are tightly centred about the mean or not). The variance of the random
variable Y is denoted V ar(Y ), where
The variance has important properties you need to be familiar with, given in Property
VAR.1–Property VAR.4 in Wooldridge, where the latter two describe the variance of
sums of random variables. When defining the variance of sums of random variables it is
important to know whether the random variables are independent or exhibit
covariance/correlation. In Property VAR.3 we permit the two random variables to be
dependent, in Property VAR.4 we require them to be independent. In general, when we
do permit dependence we can write the variance of a sum of random variables as:
n
! n n
X X X
Var Yi = Var (Yi ) + Cov (Yi , Yj )
i=1 i=1 i6=j
where ni6=j Cov (Yi , Yj ) is a shorthand for summing the covariances for all i, j pairs,
P
where i 6= j. Under independence, this component is zero. Then we get the much
simpler expression given in Property VAR.4:
n
! n
X X
Var Yi = Var (Yi ).
i=1 i=1
When using the latter property it is important to clearly indicate that you make use of
the assumption that all covariances are zero.
The covariance and correlation are two related measures that reveal the association
between two random variables, with:
Cov (X, Y )
Corr (X, Y ) = p .
Var (X)Var (Y )
26
1.2. Content of chapter
CE.2: By conditioning on X you can treat a(X) and b(X) as being constant and
use the linearity of the expectation operator.
CE.4 (and CE.40 ): These are examples of the important property called the ‘law of
iterated expectation’. As we will see often it is easy to formulate E(Y | X). This is
a function that will depend on the value of the random variable X. By accounting
for all possible values that the random variable X can take, that is by evaluating
the expectation of E(Y | X), which we write as E(E(Y | X)), we obtain E(Y ).
CE.5 If the mean of Y does not depend on X, then X and Y must be uncorrelated.
You are expected to be familiar with the normal distribution, which has a nice
bell-shaped curve that is centred around the mean and has a spread indicated by its
variance (see Figure B.7 in Wooldridge). To obtain the standard normal distribution we
need to subtract the mean and divide by the standard deviation (Property Normal.1).
We typically use the notation φ(z) to denote the pdf of a Normal (0, 1) distribution and
denote its cumulative distribution function (cdf) as Φ(z). Statistical tables can be used
to obtain values of Φ(z) for a given value of z, see Table G.1 in Wooldridge, whereby
Φ(0) = 0.5 due to the symmetry of the normal distribution. Other facts about the
normal distribution are summarised in Property Normal.2–Property Normal.4 in
Wooldridge.
We will use various distributions derived from the normal distribution in the course, the
t distribution, F distribution and the χ2 distribution. The details are important in this
course, but it is good to look at the shape of these distributions. The shape of the t
distribution is similar to that of the N ormal(0, 1), in comparison it tends to have
‘thicker tails’ (probability of obtaining realisations that are large (in either tail) is
bigger) unless the degrees of freedom are large. Both the F and the χ2 distributions are
examples of skewed (not symmetric) distributions that only take non-negative values.
These distributions, as we will see, will play an important role when we conduct
hypothesis tests or provide confidence intervals.
Activity 1.2 Take the ‘Quiz on Fundamentals of probability’ on the VLE to test
your understanding.
Read: Wooldridge Appendix C (you may want to skip Section C.4 during your refresher
and return to this at a later stage).
27
1. Mathematics and statistics refresher
where Y denotes the random variable ‘family income’. θ is an unknown fixed quantity in
the population.
The easiest way to come up with a good estimator of θ is to draw independent
observations (i.e. with replacement) from this population f (y; θ) and use a sample
analogue of the population mean, the sample average, as its estimator:
n
1X
θb = Yi
n i=1
X̄ = n1 ni=1 Xi
P
Mean µX = E(Xi )
2 1
Pn
Variance σX = E[(Xi − µX )2 ] 2
SX = n−1 i=1 (Xi − X̄)
2
If (Y1 , X1 ), (Y2 , X2 ), . . . , (Yn , Xn ) form a random sample drawn from the joint density
f (y, x; θ), then our descriptive statistics (sample average, sample variance, and sample
covariance) indeed form good estimators of their population analogues (mean, variance,
and covariance). Associated realisation (estimates) are x̄, s2X , and sXY .
Common practice: When the parameter of interest is denoted as β1 , we typically will
denote its estimator by βb1 . The use of a ‘hat’ when referring to estimators is
commonplace.
Useful finite sample properties of our estimators are that they are unbiased, E(θ) b = θ,
have a small sampling variance, Var (θ), b and are efficient. It is important to
recognise that these are properties that describe the behaviour of our estimator under
repeated sampling. The unbiasedness property ensures that there is no systematic error
when estimating the parameter of interest, and the low sampling variability ensures
that there is a high probability that we will obtain a realisation that is close to the
parameter of interest. Clearly when choosing between two unbiased estimators, we
28
1.2. Content of chapter
should select the estimator with the smaller sampling variance (efficiency). When
choosing between two estimators that are not necessarily unbiased, we may want to use
the mean squared error (MSE) (which offers a trade-off between bias and variance) to
compare the efficiency of estimators.
Check! You should be able to show that the sample average of a random sample
Y1 , Y2 , . . . , Yn from a distribution with mean µ and variance σ 2 , is an unbiased estimator
of µ, E(Ȳ ) = µ, whose sampling variance equals Var (Ȳ ) = σ 2 /n. Clearly, the larger the
sample, the smaller the sample variability of our estimator. For efficiency it is never a
good idea to throw away information. An estimator of µ that uses only part of the
sample has a larger variance.
Useful asymptotic sample properties of our estimators are that they are consistent and
asymptotic normal. We will say that an estimator is consistent if:
plim θb = θ.
To show that an estimator is consistent we can make use of the sufficient conditions of
consistency (limn→∞ E(θ)b = θ and limn→∞ Var (θ) b = 0). Alternatively, we can rely on
laws of large numbers. Laws of large numbers (LLNs) give us a convenient asymptotic
device that shows that under suitable regularity conditions, sample averages will
converge to their population analogues. E.g. plim Ȳ = E(Yi ), hence guaranteeing that
in the above example Ȳ is a consistent estimator of µ as E(Yi ) = µ. In the course we
will show how we can use LLNs when considering the consistency of estimators making
use of Property PLIM.1–Property PLIM.2 given in Wooldridge. It will be convenient to
observe that, under suitable regularity conditions, all estimators have a distribution
that can be well-approximated by a normal distribution when our sample is large. This
is a powerful result we will rely on if we do not know the exact sampling distribution of
an estimator. The asymptotic device that is used to establish this result is the Central
Limit Theorem (CLT).
Instead of reporting point estimates, we may want to report confidence intervals.
You want to recall how you can obtain the confidence interval for the mean from a
normally distributed population using a random sample. Depending on whether the
variance σ 2 is known or not, the form of the 100(1 − α)% confidence interval is given by:
h p p i
2 2
ȳ − zα/2 σ /n, ȳ + zα/2 σ /n
or: h p p i
ȳ − tn−1, α/2 s2 /n, ȳ + tn−1, α/2 s2 /n
where zα/2 = 1.96 when α = 5% (the 97.5th percentile in the Normal (0, 1) distribution)
and tn−1, α/2 when α = 5% is given by the 97.5th percentile of the t distribution. The
limits are given by our estimate plus and minus a measure related to the precision of
our estimator (standard deviation / standard error). Typically, the confidence interval is
wider when the variance is unknown to account for the additional imprecision. It is
important to recognise that these intervals are sample specific. When we obtain a
different sample under identical conditions the interval will change, and when we do this
repeatedly the interval will contain the true population parameter with probability
given by the confidence level, 100(1 − α)%.
You should revisit the fundamentals of hypothesis testing. To conduct a test we need to
specify the null hypothesis (H0 ) and the alternative hypothesis (H1 ) which are both
29
1. Mathematics and statistics refresher
statements about the true parameter. When we consider a single hypothesis, the
alternative hypothesis can be one-sided or two-sided, for multiple hypotheses we only
consider two-sided alternatives. You should be familiar with the concepts of a Type I
error and a Type II error; these are two types of mistakes that we can make when
conducting hypothesis tests due to the fact that the test is based on random variables.
Related to these errors we can define the significance level (typically denoted by α) and
the power of a test. Typically we want to choose our test in such a way that for a given
level of significance we maximise the power of our test. To test our hypothesis we will
need to define a suitable test statistic (a random variable that has no unknown
quantities and has a nice (asymptotic) distribution under the null). This distribution
plays an important role as it determines the critical value that defines, for a given level
of significance, when we should start rejecting the null. The p-value tells us what the
probability is that we will observe a more extreme realisation (averse to the null) of our
test statistic. We will apply these ideas regularly in this course, and you should make
sure that you are comfortable with the test about the mean in a normal population
presented in the book.
30
Chapter 2
The nature of econometrics and
economic data
2.1 Introduction
31
2. The nature of econometrics and economic data
(correlations) between variables may not be the same as finding causal relations. It also
talks about the importance of the ceteris paribus notion in causal interpretations.
The main issues to learn in this segment are the scope of econometrics, the reliance of
econometrics on observational data rather than experimental data, and the reliance on
the statistical apparatus.
Activity 2.1 Take the MCQs related to this section on the VLE to test your
understanding.
Economic data come in several different forms. This segment discusses these forms.
For cross-sectional data, pay particular attention to the random sampling. You also
need to be able to discuss situations when this assumption is violated. For time series
data the order of observations is very important and it cannot be assumed that
outcomes are independent across time. You should be able to give examples of both
cross-sectional and time series data.
32
2.2. Content of chapter
You can skip Sections 1.3c and 1.3d as in this course we do not deal with pooled cross
sections or panel data.
Activity 2.2 Take the MCQs related to this section on the VLE to test your
understanding.
‘How does variable y change if variable x is changed but all other relevant
factors are held constant?’
Most of the economic questions that interest policy are causal questions. For instance,
does an increase in minimum wage change poverty levels and, if yes, to what extent?
Crucial to establishing causality is the notion of ceteris paribus: ‘all other (relevant)
factors being equal. If we succeed – via statistical methods – in holding ‘fixed’ other
relevant factors, then sometimes we establish that changes in one variable (say,
education) in fact ‘cause’ changes in another variable (say, wage).
Most economic questions are ceteris paribus questions. You need to understand how the
notion of ceteris paribus can be described through counterfactual reasoning and give the
definition of counterfactual outcomes.
Activity 2.4 The data in ECONMATH were obtained on students from a large
university course in introductory microeconomics. For this problem, we are interested
in two variables: score, which is the final course score, and econhs, which is a binary
variable indicating whether a student took an economics course in high school.
There are 856 students in the sample. Of these, 317 students report having taken an
economics course in high school.
(i) From the data we can find that for those who took an economics course in high
school, the average score is 72.08. For those who did not take an economics
course in high school, the average score is 72.91. Do these findings necessarily
33
2. The nature of econometrics and economic data
tell you anything about the causal effect of taking high school economics on
college course performance? Explain.
(ii) If you wanted to obtain a good causal estimate of the effect of taking a high
school economics course using the difference in averages, what experiment
would you run?
(i) Here is one way to pose the question: If two firms, say A and B, are identical in all
respects except that firm A supplies job training for one hour per worker more than
firm B, by how much would firm A’s output differ from firm B’s?
(ii) Firms are likely to choose job training depending on the characteristics of workers.
Some observed characteristics are years of schooling, years in the workforce, and
experience in a particular job. Firms might even discriminate based on age, gender,
or race. Perhaps firms choose to offer training to more or less able workers, where
‘ability’ might be difficult to quantify but where a manager has some idea about
the relative abilities of different employees. Moreover, different kinds of workers
might be attracted to firms that offer more job training on average, and this might
not be evident to employers.
(iii) The amount of capital and technology available to workers would also affect
output. So, two firms with exactly the same kinds of employees would generally
have different outputs if they use different amounts of capital or technology. The
quality of managers would also have an effect.
(iv) If you find a positive correlation between output and training, would you have
convincingly established that job training makes workers more productive? Explain.
(i) Comparing averages tells us very little about the causal effect of having taken an
economics course in high school on performance in university-level economics.
There are many factors that influence a student’s performance in an economics
course and some of these factors are likely correlated with whether or not they took
economics in high school (and subsequently chose to take economics in university).
Without controlling for these factors, we cannot really say anything about the
causal effect of high school economics exposure on university-level economics
performance. Though there is a small (negative) correlation between the two
variables, there is little evidence of causation from simply comparing performance
between these two groups.
(ii) Ideally we would have an experiment in which we could observe the same
individuals in two different scenarios: one in which they took high school economics
34
2.4. Overview of chapter
and one in which they did not. As this is not feasible, we would like to randomly
select students into two groups: a control group that did not take economics in high
school and a treatment group that did take economics in high school. Even this
would be challenging as it would require selection into treatment and control when
the students were in high school. Thus, the most feasible experiment would be to
compare the difference in scores for students that are as similar as possible in every
way except for their exposure to high school economics. Effectively, we would need
to control for all of the other factors that could influence both performance in
university-level economics and exposure to high school economics.
Economic model
Empirical analysis
Cross-sectional data
Random sampling
Panel data
Experimental data
Nonexperimental data
Observational data
Causal effect
35
2. The nature of econometrics and economic data
Ceteris paribus
Counterfactual outcomes
Counterfactual reasoning
36
Chapter 3
Simple regression model
3.1 Introduction
understand three issues when formulating a simple regression model and discuss
how they are addressed
discuss the key assumptions about how the regressor and error term are related
define OLS fitted values and residuals and derive their algebraic properties in OLS
estimation
discuss the effect of changing the units of measurement on the parameter estimates
and the R2
interpret the OLS estimates when the dependent and/or independent variables
appear in logarithmic form
establish unbiasedness of the OLS estimator and understand the role of the
assumptions in this result
derive the variance of the OLS estimator under the homoskedasticity assumption
and discuss how it is affected by different components.
37
3. Simple regression model
The simple linear regression model studies ‘how y varies with changes in x’ by writing
down a simple equation relating y to x in the population of interest:
y = β0 + β1 x + u.
The first important thing to note is that in the simple linear regression model the
variables y and x are not treated symmetrically. Indeed, we want to explain y in
terms of x (not explain x in terms of y). The variable y is called the dependent
variable or response variable (among other names). The variable x is called the
regressor or explanatory variable (among other names). The variable u is called the
error term or disturbance (among other names). It is treated as unobserved as it
collects all the factors other than x that affect y. We call β0 the intercept parameter
and β1 the slope parameter. These describe a population, and our ultimate goal is to
estimate them.
Let ∆ denote ‘change’. Then holding u fixed means ∆u = 0. So:
∆y = β1 ∆x + ∆u
= β1 ∆x when ∆u = 0.
38
3.2. Content of chapter
This equation effectively defines β1 as a slope, with the only difference being the
restriction ∆u = 0. Thus, β1 measures the effect of x on y, holding all other factors (in
u) fixed. In other words, the simple linear regression captures a ceteris paribus
relationship of x on y.
It is important to understand that the simple linear regression (SLR) model is a
population model. When it comes to estimating β1 (and β0 ) using a random sample of
data and, thus, estimating the ceteris paribus effect of x on y, we must restrict how u
and x are related to each other.
The first step towards formulating any restrictions on how the unobservable u is related
to the explanatory variable x is to realise that x and u are viewed as random
variables having distributions in the population. For example, if x = cigs is the
average number of cigarettes the mother smokes per day during pregnancy, then we
could figure out its distribution in the population of women who have ever been
pregnant. See the histogram of cigs obtained from BWGHT data in Figure 3.1. Suppose
u includes the overall health of the mother and quality of prenatal care (imagine we
want to analyse the causal effect of smoking in pregnancy on a child’s birthweight).
There is a range of values for these variables in the population and, hence, u also has a
distribution in the population.
Our first key assumption is that the expected value of u is zero in the population:
E(u) = 0
where E(·) means the expected value. As long as the intercept β0 is included in the
equation, this assumption is an innocuous normalisation and can be imposed
without a loss of generality. For example, normalising the mean ‘land quality’ in the
regression of yield on the fertiliser amount or normalising the mean ‘ability’ in the
regression of wage on education to be zero in the population should be harmless. It is.
Let us show mathematically that this assumption is an innocuous normalisation.
39
3. Simple regression model
y = β0 + β1 x + u
= (β0 + α0 ) + β1 x + (u − α0 )
= β0∗ + β1 x + u∗
E(u∗ ) = E(u − α0 )
= E(u) − E(α0 )
= α0 − α0
= 0.
The new intercept is β0∗ = β0 + α0 , and, importantly, the slope β1 does not change.
Thus, if the average of u is different from zero, we just adjust the intercept, leaving the
slope the same. The presence of β0 in the equation allows us to do that.
The second key assumption is on the expected value of u given x or, in other words, on
the mean of the error term for each slice of the population determined by values of x,
that is:
E(u | x) = E(u) for all values of x
where E(u | x) means ‘the expected value of u given x’. In other words, under this
assumption the error term has the same expectation given any value of the explanatory
variable. We say u is mean independent of x. Suppose u is ‘ability’ and x is years of
education. For the mean independence assumption to hold, we need, for example:
so that the average ability is the same in the different portions of the population with an
8th grade education, a 12th grade education, and a four-year college education. Because
people choose education levels partly based on ability, this assumption is almost
certainly false. To give another example, suppose u is ‘land quality’ and x is fertiliser
amount. Then E(u | x) = E(u) if fertiliser amounts are chosen independently of quality.
This assumption is reasonable but assumes fertiliser amounts are assigned at random.
Combining E(u | x) = E(u) (the substantive assumption) with E(u) = 0 (normalisation)
gives:
E(u | x) = 0 for all values of x.
This is called the zero conditional mean assumption. It says that the expected
value of the error term, u, is zero, regardless of what the value of the explanatory
variable, x, is.
By properties of the conditional expectation, E(u | x) = 0 implies:
E(y | x) = E(β0 + β1 x + u | x)
= β0 + β1 x + E(u | x)
= β0 + β1 x
40
3.2. Content of chapter
which shows the population regression function, E(y | x), is a linear function of x.
Thus, regression analysis is essentially about explaining effects of explanatory variables
on average outcomes of y.
Activity 3.1 Take the MCQs related to this section on the VLE to test your
understanding.
Read: Wooldridge Section 2.2, Appendix 2A and Appendices B.4a, B.4b, B.4c.
E(u) = 0 (3.2.1)
E(u | x) = 0.
Next we plug in u (recall from the regression equation that u = y − β0 − β1 x) into the
restrictions (3.2.1) and (3.2.2):
E(y − β0 − β1 x) = 0
E[x(y − β0 − β1 x)] = 0.
These are the two conditions in the population that determine β0 and β1 . Remember,
however, that we do not observe everyone in the population. This means we do not
know the population distribution of (y, x) and cannot compute expectations E(·) in the
two conditions. What we have is a random sample from the population (which is just a
subset). Therefore, we use the sample analogs of the expectations to estimate β0 and
β1 , which is a method of moments approach to estimation.
41
3. Simple regression model
The main idea of the method of moments P in general is to estimate the expected value
‘E(·)’ by the respective sample average ‘ n ni=1 ·’ (often called the sample analog).
1
Therefore, in order to determine the estimates βb0 and βb1 , we use the sample analogs:
n
1X
(yi − βb0 − βb1 xi ) = 0
n i=1
n
1X
xi (yi − βb0 − βb1 xi ) = 0
n i=1
which result in two linear equations for two unknowns βb0 and βb1 .
Solving these two equations with respect to βb0 and βb1 , we obtain:
Pn Pn
x i (y i − ȳ) (x − x̄)(yi − ȳ) Sample Covariance(xi , yi )
βb1 = Pni=1 Pn i
= i=1 2
=
i=1 xi (xi − x̄) i=1 (xi − x̄) Sample Variance(xi )
and:
βb0 = ȳ − βb1 x̄
Pn Pn
if i=1 xi (xi − x̄) = i=1 (xi − x̄)2 > 0 (that is, if the sample variance of the xi s is not
zero, which only rules out the case where each xi is the same value).
βb0 and βb1 are called the ordinary least squares (OLS) estimates.
The fitted value (or predicted value) for each data point i is:
The objective function in this optimisation problem is called the sum of squared
residuals (imagine we consider some generic b0 and b1 as parameters in calculating
fitted values and residuals) and it is implied that we measure the prediction quality, for
each i, by squaring the residual. Thus, the OLS estimates βb0 and βb1 minimise the sum
of squared residuals. You need to study Appendix 2A for more technical details.
For OLS estimates βb0 and βb1 , the OLS regression line as a function of x is:
yb = βb0 + βb1 x.
The OLS regression line allows us to predict y for any (sensible) value of x. It is also
called the sample regression function. The slope, βb1 , allows us to predict changes in
y for any (reasonable) change in x, that is:
∆b
y = βb1 ∆x.
42
3.2. Content of chapter
With the help of various empirical examples you learn how to interpret the results of
OLS estimation. In other words, you should be able to interpret findings given by the
OLS regression line.
Activity 3.2 Take the MCQs related to this section on the VLE to test your
understanding.
You need to know how one can calculate the OLS fitted values and OLS residuals from
OLS estimates and the data. It is important to remember that the residual u bi is
different from the error ui = yi − β0 − β1 xi (it is a common mistake by students to
confuse these two).
It is useful to know and be able to explain and interpret the following algebraic
properties of OLS statistics:
Sample covariance (and thus sample correlation) between regressor and residual is
always zero:
X n
xi u
bi = 0.
i=1
Sample average of actual yi is the same as sample average of the fitted values ybi :
ȳ = y¯b.
n
X
SSE = yi − ȳ)2
(b
i=1
n
X
SSR = b2i .
u
i=1
Thus, R2 is the fraction of the total variation in yi that is explained by xi (or the OLS
regression line). You need to know and understand the following properties of
R-squared:
As R2 increases, the yi s are closer and closer to falling on the OLS regression line.
However, you need to keep in mind that we do not want to fixate on R2 . It is a useful
summary measure of the predictive quality of the OLS regression line but tells us
nothing about causality. Having a ‘high’ R-squared is neither necessary nor sufficient to
infer causality.
Activity 3.3 Take the MCQs related to this section on the VLE to test your
understanding.
With the help of various empirical examples you learn how both units of measurement
and the functional form affect the interpretation of your parameters in the linear
regression model. The fact that units of measurement affect the parameters should not
be surprising when we consider a setting where we are interested in explaining the cost
of travel as a function of distance travelled:
travelcost = β0 + β1 distance + u.
44
3.2. Content of chapter
Activity 3.4 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 3.6 Exercise 2.9 (i) from Wooldridge. You should recall that:
Pn
(x − x̄)(yi − ȳ)
β0 = ȳ − β1 x̄ and β1 = i=1
b b b Pn i 2
.
i=1 (xi − x̄)
45
3. Simple regression model
The estimated regression coefficients are random variables because they are calculated
from a random sample. This segment studies statistical properties of the OLS
estimator, referring to a population model and assuming random sampling.
This segment starts with answering the following question posed by mathematical
statistics: How do our estimators behave across different samples of data? On average,
would we get the right answer if we could repeatedly sample? (Usually we only have one
particular sample, so this is a hypothetical question.) This question is answered by
finding the expected value of the OLS estimators. As we know, if the average outcome
of an estimator across all possible random samples gives us the truth, this means that
the estimator is unbiased. This segment establishes the unbiasedness of OLS estimators
under a simple set of assumptions (SLR.1–SLR.4). You need to understand these
assumptions and be able to discuss their role in the proof of unbiasedness. Assumption
SLR.4 (zero conditional mean) is the key assumption for showing that OLS is unbiased.
Note, however, that we can compute the OLS estimates whether or not this assumption
holds, or even if there is an underlying population model. However, SLR.4 will
guarantee good properties of OLS estimation. If this assumption fails, then OLS will be
biased for β1 . For example, if an omitted factor contained in u is correlated with x, then
SLR.4 typically fails.
It is also important to understand that unbiasedness is a property of an estimator (not
an estimate)! After estimation like:
[ = −5.12 + 1.43educ
wage
E(βb1 | X) = β1
46
3.2. Content of chapter
Figure 3.2: OLS estimation results from the first generated data set.
The second and fourth generated data sets give us βb1 ≈ 1.98, very close to β1 = 2. The
first data set gives us βb1 ≈ 1.86, which is a bit far. If we repeat the experiment again
and again and then average the outcomes of βb1 , we would get very close to 2. The
problem is that we do not know which kind of sample we have. We never know whether
we are close to the population value. Generally, we hope that our sample is ‘typical’ and
produces a slope estimate close to β1 , but we never know.
47
3. Simple regression model
Figure 3.3: OLS estimation results from the second generated data set.
Figure 3.4: OLS estimation results from the third generated data set.
48
3.2. Content of chapter
Figure 3.5: OLS estimation results from the fourth generated data set.
Even though under SLR.1 to SLR.4, the OLS estimators are unbiased, depending on
the sample, the estimates will be nearer or farther away from the true population
values. How far we can expect our estimates to be away from the true population values
depends on the variance (and, ultimately, standard deviation) of the estimators. This
segment characterises the variance of the OLS estimators under SLR.1 to SLR.4 and
under an additional Assumption SLR.5 (homoskedasticity) which simplifies the
calculations.
Homoskedasticity requires Var (u | x) = σ 2 and implies that:
Var (y | x) = σ 2
that is, that the variance of y does not change with x. It is important to keep in mind
that the constant variance assumption may not be realistic; it must be determined on a
case-by-case basis (it will be discussed later in the course how to use the data to test
whether SLR.5 holds and how to proceed without it). Consider the following two
examples of when Assumption SLR.5 is violated.
with β1 > 0. This means average family saving increases with income. If we impose
SLR.5, then:
Var (sav | inc) = σ 2
which means the variability in saving does not change with income. There are
reasons to think saving would be more variable as income increases.
49
3. Simple regression model
Example 3.2 Suppose y is an average score, at the school level, and x is school
enrolment. The variance of an average is inversely related to the number of units
being averaged. In fact:
η2
Var (y | x) =
x
2
often holds, where η is the variance in individual scores.
As the error variance increases, that is, as σ 2 increases, so does Var (βb1 | X). The
more ‘noise’ in the relationship between y and x – that is, the larger the variability
in u – the harder it is to learn about β1 .
By contrast, more variation in {xi } is a good thing:
Notice that SST x /n is the sample variance in x. We can think of this as getting close to
the population variance of x, σx2 , as n gets large. This means:
SST x ≈ nσx2
which means that, as n grows, Var (βb1 | X) shrinks at the rate 1/n. This is why more
data is a good thing: more data shrinks the sampling variance of our estimators.
The standard deviation of βb1 is the square root of the variance. So:
σ
sd (βb1 ) = √ .
SST x
Since σ 2 is not known, Var (βb1 | X) (or sd (βb1 )) cannot be computed. However, we can
find a way to estimate σ 2 . The estimator of σ 2 used universally is:
n
SSR X
σ2
b = = (n − 2)−1 b2i .
u
n−2 i=1
50
3.2. Content of chapter
Activity 3.8 Show that under Assumptions SLR.1 through SLR.4 we have:
E(βb0 ) = β0 .
Instructions. You may find it helpful to proceed using the following steps.
3. Find the conditional expectation E(βb0 | X). You can use the fact that it is
already known that under Assumptions SLR.1–SLR.4:
E(βb1 | X) = β1 .
Activity 3.9 Take the MCQs related to this section on the VLE to test your
understanding.
Here we will provide instructions on how to install R and RStudio on your computer.
We will be discussing how to implement econometrics using real data with R
throughout this course. It is a great transferable skill to have. You will not be tested on
your ability to use R in the examination. You should be able to read regression output.
51
3. Simple regression model
(i) As this is a log-log model, we need to interpret the parameter as the elasticity of
price with respect to dist. By increasing the house’s distance to a recently built
garbage incinerator by 1%, its price can be expected to increase by .312%. We
clearly would expect this sign, as living closer to an incinerator depresses housing
prices.
(ii) For the ceteris paribus condition to hold we need to ensure that by changing the
location of the incinerator all other factors that affect the price remain unchanged.
Clearly this is not true if cities decide to locate the incinerator in areas that are not
close to expensive neighbourhoods. We should therefore be concerned that SLR.4 is
not satisfied.
(iii) There are many factors that could affect the price of houses, such as the size of the
house, the number of bathrooms, age of the home, and the quality of the
neighbourhood (including school quality). These variables could easily be
correlated with dist and log(dist).
We start by specifying the two models we want to consider. The model with the
variables in their original units of measurement:
y = β0 + β1 x + u
and the model where both the dependent and independent variables have different units
of measurement:
We start by looking at the slope. With the new measurement, the estimator of
the slope is: Pn
(x∗ − x̄∗ )(yi∗ − ȳ ∗ )
β1 = i=1Pni ∗
e
∗ 2
.
i=1 (xi − x̄ )
We need to replace yi∗ = c1 yi and x∗i = c2 xi .
Notice: n n n
1X ∗ 1X 1X
ȳ ∗ = yi = (c1 yi ) = c1 yi = c1 ȳ.
n i=1 n i=1 n i=1
Therefore ȳ ∗ = c1 ȳ and similarly x̄∗ = c2 x̄.
Using yi∗ = c1 yi , x∗i = c2 xi , ȳ ∗ = c1 ȳ, and x̄∗ = c2 x̄, we obtain:
Pn Pn
i=1 (c 2 x i − c 2 x̄)(c 1 yi − c 1 ȳ) c2 (xi − x̄)c1 (yi − ȳ)
βe1 = Pn 2
= i=1Pn 2 2
.
i=1 (c2 xi − c2 x̄) i=1 c2 (xi − x̄)
52
3.3. Answers to activities
c1
which can be simplified to yield βe1 = βb1 as required.
c2
Next we look at the intercept. With the new measurement, the estimator of
the intercept is:
βe0 = ȳ ∗ − βe1 x̄∗ .
Let us replace yi∗ = c1 yi and x∗i = c2 xi again.
This gives us:
βe0 = c1 ȳ − βe1 (c2 x̄) = c1 ȳ − c2 βe1 x̄.
To show our result, we use our previous result βe1 = (c1 /c2 )βb1 :
c1
βe0 = c1 ȳ − βb1 c2 x̄ = c1 (ȳ − βb1 x̄) = c1 βb0 .
c2
In this problem, we are asked to compare the parameter estimates of the following two
log-linear models: the model where y is in its original unit of measurement:
log y = β0 + β1 x + u
where y > 0, and the model where we use a different unit of measurement:
1 Pn
where log(y ∗ ) = log(yi∗ ).
n i=1
Let us replace yi∗ = c1 yi .
Using the properties of logarithms:
• log(yi∗ ) = log(c1 yi ) = log(c1 ) + log(yi )
• log(y ∗ ) = n1 ni=1 log(c1 yi ) = n1 ni=1 [log(c1 ) + log(yi )] = log(c1 ) + log(y).
P P
53
3. Simple regression model
= log(c1 ) + βb0 .
1. We know that:
βb0 = ȳ − βb1 x̄.
= β0 + (β1 − β1 )x̄ + 0
= β0 .
54
3.4. Overview of chapter
Explanatory variable
Dependent variable
Error term
Slope coefficient
Intercept
Residual
Fitted value
R-squared
Homoskedasticity
Heteroskedasticity
55
3. Simple regression model
understand three issues when formulating a simple regression model and discuss
how they are addressed
discuss the key assumptions about how the regressor and error term are related
define OLS fitted values and residuals and derive their algebraic properties in OLS
estimation
discuss the effect of changing the units of measurement on the parameter estimates
and the R2
interpret the OLS estimates when the dependent and/or independent variables
appear in logarithmic form
establish unbiasedness of the OLS estimator and understand the role of the
assumptions in this result
derive the variance of the OLS estimator under the homoskedasticity assumption
and discuss how it is affected by different components.
56
3.5. Test your knowledge and understanding
(d) Define the R2 in general terms. Suppose R2 = 0.17 in the regression in (3.5.1).
Interpret this result.
3E.2 The dataset BWGHT.RAW contains data on births to women in the United States.
Two variables of interest are the dependent variable, infant birthweight in ounces
(bwght), and an explanatory variable, the average number of cigarettes the mother
smoked per day during pregnancy (cigs). The following simple regression was
estimated using data on n = 1,388 births:
\ = 119.77 − 0.514cigs.
bwght
(a) What is the predicted birth weight when cigs = 0? What about when
cigs = 20 (one pack per day)? Comment on the difference.
(b) Does this simple regression necessarily capture a causal relationship between
the child’s birthweight and the mother’s smoking habits? Explain.
(c) To predict a birthweight of 125 ounces, what would cigs have to be? Comment.
(d) The proportion of women in the sample who do not smoke while pregnant is
about 0.85. Does this help reconcile your finding from part (c)?
3E.3 Consider running the regression on a constant only and, thus, only estimating an
intercept.
(a) Given a sample {yi : i = 1, 2, . . . , n}, let βe0 be the solution to:
n
X
min (yi − b0 )2 .
b0
i=1
Show that βe0 = ȳ, that is, the sample average minimises the sum of squared
residuals.
ei = yi − ȳ. Argue that these residuals always sum to zero.
(b) Define residuals u
yi = βxi + ui
57
3. Simple regression model
3E.5 To investigate the relationship between the price of wine and consumption of wine,
an economist estimates the following regression using a sample of 32 individuals for
one week in 2013:
\ = 4.2514 − 0.8328 log(price)
log(wine)
(0.8911) (0.0031)
n = 32, R2 = 0.89.
wine denotes the amount of wine consumed per week in millilitres (a medium glass
contains 175 ml), and price denotes the average price of a medium glass of wine of
a selection of wines during the week in GBP (£). The numbers in parentheses are
the standard errors.
(a) Discuss what would happen to the parameter estimate of the slope coefficient
if we had measured the amount of wine consumed per week in number of
medium glasses instead of millilitres. Explain your answer.
(b) Discuss what would happen to the parameter estimate of the intercept if we
had measured the amount of wine consumed per week in number of medium
glasses instead of millilitres. Explain your answer.
58
3.5. Test your knowledge and understanding
(c) It is very unlikely that the estimator for the slope coefficient is unbiased
because there are variables contained in the unobservable term that are
correlated with the number of bedrooms. For example, the number of
bedrooms and the total size of the house are probably positively correlated.
(d) The R2 is a measure of goodness-of-fit and defined as:
SSE
R2 =
SST
Pn
where SSE
Pn = yi − ȳ)2 is the explained sum of squares and
i=1 (b
SST = i=1 (yi − ȳ)2 is the total sum of squares.
From R2 = 0.17 in the regression we can see that bdr explains 17% of the
variation in price.
3E.2 (a) When cigs = 0, the predicted birthweight is 119.77 ounces. When cigs = 20,
bwght = 109.49. This is about an 8.6% drop.
(b) Not necessarily. There are many other factors that can affect birthweight,
particularly education. Education could be correlated with both cigarette
smoking during pregnancies and knowledge about prenatal care.
(c) If we want a predicted bwght of 125, then:
125 − 119.77
cigs = ≈ −10.18
−.524
or about −10 cigarettes! This is nonsense, of course, and it shows what
happens when we are trying to predict something as complicated as
birthweight with only a single explanatory variable. The largest predicted
birthweight is necessarily 119.77. Yet almost 700 of the births in the sample
had a birthweight higher than 119.77.
(d) 1,176 out of 1,388 women did not smoke while pregnant, or about 84.7%.
Because we are using only cigs to explain birthweight, we have only one
predicted birthweight at cigs = 0. The predicted birthweight 119.77 is
necessarily roughly in the middle of the observed birthweights at cigs = 0, and
therefore a large number of birthweights are above the intercept 119.77. To
predict such higher birthweights, the value of cigs has to be negative.
59
3. Simple regression model
3E.4 (a) By taking the derivative of ni=1 (yi − bxi )2 with respect to b, the first-order
P
condition of the above minimisation problem is:
n
X
(−2) xi (yi − βx
b i) = 0
i=1
or equivalently:
n
X n
X n
X
0= xi (yi − βx
b i) = xi yi − βb x2i .
i=1 i=1 i=1
Pn
By SLR.2, {xi : i = 1, . . . , n} cannot be all zero. Thus, i=1 x2i 6= 0. So, by
solving for β,
b we obtain the conclusion:
Pn
xi yi
βb = Pi=1
n 2
.
i=1 xi
i=1 xi i=1 xi
Pn
xi ui
= β + Pi=1
n 2
i=1 xi
=β
where the first equality follows from (b), the second equality follows from the
property of E(·), the third equality follows from the fact that we treat X as
non-random and property of E(·), and the last equality follows from SLR.4.
60
3.5. Test your knowledge and understanding
σ2
= Pn 2
( i=1 xi )
3E.5 (a) The parameter estimate of the slope coefficient will stay the same if we
measure the amount of wine consumed per week in medium glasses instead of
millilitres. The coefficient should be interpreted as the elasticity of wine
consumption with respect to price. A 1.0% increase in price, will result in a
0.8328% decline in the amount of wine consumed. This elasticity does not vary
with the way we measure the wine consumption (or prices for that matter).
Formally, we note (subscript indicating units of measurement):
where we make use of the properties of the log operator, that ensures
log(x · y) = log(x) + log(y), for x, y > 0. With our original specification given
by:
log(wine ml ) = α + β log(price) + u
we obtain, upon substitution:
61
3. Simple regression model
62
Chapter 4
Multiple regression analysis:
Estimation
4.1 Introduction
discuss the advantage of the multiple regression model over the simple regression
model
understand the principles behind the derivation of the OLS estimator of parameters
in the multiple regression model
63
4. Multiple regression analysis: Estimation
knowledge of the expression for Var (βbj | X) under MLR.1–MLR.5, and have a clear
understanding of what factors improve the precision of the OLS estimators
understand how the standard errors of the regression parameters are obtained
use statistical software to apply the OLS estimation technique to analyse the
relationship between real-world economic variables.
After motivating the use of the multiple linear regression model in Section 3.1 in
Wooldridge, the chapter shows how to extend the ordinary least squares approach to
the multiple regression setting in Section 3.2 in Wooldridge. In this section, the
importance of interpreting the parameters as providing partial effects, effects of a
particular independent variable on the dependent variable ‘controlling’, that is holding
fixed, all other factors is highlighted. Rather than explicitly controlling for everything
else, it argues that we can also initially ‘partial-out’ the effect of all other factors (the
Frisch–Waugh–Lovell Theorem). Terminology introduced in the simple linear regression
model, such as fitted values, residuals and the goodness-of-fit R2 easily generalise to the
multiple regression model. Section 3.3 in Wooldridge discusses the conditions that
guarantee the OLS estimator is unbiased and reveals that while including irrelevant
variables will not change this result, excluding relevant variables will lead to the
Omitted Variable Bias (OVB) result. In Section 3.4 in Wooldridge, the sampling
variance of the OLS slope parameters, and its components, is discussed under the
additional assumption of homoskedasticity. We learn how to estimate this sampling
variance using data and obtain standard errors of our estimates. In Section 3.5 in
Wooldridge, the Gauss–Markov Theorem is discussed which ensures that the OLS is the
best linear unbiased estimator (BLUE), which presents an efficient result.
64
4.2. Content of chapter
It is important to understand the motivation for the multiple linear regression model. It
provides a framework that will, more readily, allow us to provide a causal interpretation
to our parameters. By introducing more independent variables (controls, also referred to
as confounders), we can isolate the effect we are interested in: the effect of a particular
independent variable ‘holding everything else constant’.
You should be able to derive the first order conditions that define the OLS estimates in
the multiple linear regression setting (see also Activity 4.1). You are not expected to
solve for the parameter estimates in the multiple linear regression setting. It is
important that you interpret the OLS estimates as partial effects, effects that control
(hold constant) all other factors (also called confounders) and apply this correctly in
empirical examples.
It is useful to understand the important result that shows that, instead of explicitly
controlling for everything by including additional regressors (controls), we may
alternatively ‘partial-out’ the influence of the other explanatory variables beforehand.
Specifically, one can show that the estimated coefficient of an explanatory variable, xj ,
in a multiple regression can be obtained in two steps:
rbj = xj − x
bj
where x
bj denotes the fitted values.
The residuals from the first regression is the part of the explanatory variable that is
uncorrelated with the other explanatory variables. The residuals explicitly control
65
4. Multiple regression analysis: Estimation
for the other explanatory variables by removing any correlation they have with
them.
The slope coefficient of the second regression therefore represents the isolated effect
of the explanatory variable on the dependent variable.
You should be familiar with the terminology and properties of OLS fitted values,
residuals (described in 3.2e) and the R2 (described in 3.2h).
Provide the three FOCs that define the OLS parameter estimators βb0 , βb1 , and βb2 .
Activity 4.5 Take the MCQs related to this section on the VLE to test your
understanding.
The section starts by providing the conditions that ensure that the OLS estimators in
the multiple linear regression model are unbiased, see Theorem 3.1 in Wooldridge. It is
important to understand these conditions MLR.1–MLR.4 well. The zero conditional
mean assumption (also called a conditional independence assumption) is the most
complicated one to interpret. It ensures that we can assume that the expected value of
the unobservables, u, will not change if we change an independent variable holding all
other controls fixed, so that:
∂E(y | x1 , . . . , xk )
= βj , for j = 1, . . . , k.
∂xj
66
4.2. Content of chapter
likely direction of this OVB in empirical settings. You need to clearly indicate the fact
that the omitted variable is relevant (non-zero coefficient) and that the omitted variable
is correlated with the included regressor(s).
Let us look here in detail at the consequences of omitting relevant variables adding
some notation that you may want to adopt. We consider the setting where the true
population model has two explanatory variables and satisfies MLR.1 to MLR.4:
y = β0 + β1 x1 + β2 x2 + u. (*)
This regression will give unbiased OLS estimators which we denote as βbj,long , for
j = 0, 1, 2. Let βe1,short denote the estimator of β1 when we exclude the relevant variable
x2 , that is when we estimate:
y = β0 + β1 x1 + v.
The subscripts on our parameter estimates clearly denote whether we are considering
the multiple regression model (long) or the simple regression model (short).
In Section 3.2g in Wooldridge it is pointed out that it can be shown that:
βe1,short = βb1,long + βb2,long δb1
where δb1 is the OLS estimator of the slope when regressing x2 on x1 . This expression
indicates that when we exclude a relevant variable (i.e. consider the short model), that
we are not only estimating the effect of interest (βb1,long is an unbiased estimator of β1 ),
but are estimating an additional effect βb2,long δb1 which represents the effect x1 has on y
through its correlation with x2 .
Whereas deriving this result is beyond the scope of this course, it is good to understand
why this is true. Let us describe the relation between x2 and x1 as:
x 2 = δ0 + δ1 x 1 + e (**)
where e is an error which is uncorrelated with x1 . If we substitute this expression in (*),
we obtain:
y = β0 + β1 x1 + β2 (δ0 + δ1 x1 ) + e) + u
= (β0 + β2 δ0 ) + (β1 + β2 δ1 )x1 + error .
| {z } | {z }
intercept slope
In the textbook you can find a derivation of the OVB that makes use of this relation.
Let us look at an alternative derivation, which you may find more appealing.
We are interested in deriving E(βe1,short | X). The conditioning on X allows us to treat
the regressors {(x1i , x2i ), i = 1, . . . , n)} as fixed constants (not stochastic).
I When we plug in the true (long) model (**), this allows us to write:
Pn Pn
(x 1i − x̄ 1 )(x 2i − x̄ 2 ) (x − x̄1 )(ui − ū)
βe1,short = β1 + β2 i=1
Pn 2
Pn 1i
+ i=1 2
.
i=1 (x1i − x̄1 ) i=1 (x1i − x̄1 )
67
4. Multiple regression analysis: Estimation
We conclude therefore, when we do not control for x2 , our estimator of β1 does not have
a causal interpretation as it incorporates the effect x2 has on y as well (a confounder).
We should, however, recognise that there are two cases where excluding variables does
not lead to bias:
68
4.2. Content of chapter
Activity 4.6 Take the MCQs related to this section on the VLE to test your
understanding.
In this section we introduce a simplifying assumption that the errors have constant
variance (homoskedasticity), given in MLR.5. The advantages of adding the assumption
of homoskedasticity are twofold.
1. The formula for the sample variance of our slope estimators will be simple and
permit us to discuss the factors that will help us obtain more precise parameter
estimates.
2. Under this additional assumption, our OLS estimator will have an important
efficiency property (Gauss–Markov Theorem) (see next section).
You should be familiar with the expression of the conditional variance of the OLS slope
parameters that can be derived under Assumptions MLR.1–MLR.5, specifically:
σ2 σ2
Var (βbj | X) = = , for j = 1, . . . , k
SST j (1 − Rj2 ) (n − 1)SampleVar (xj )(1 − Rj2 )
where n is the sample size, SST j the total sample variation in xj , and Rj2 the
goodness-of-fit from regressing xj on all other regressors. X represents the observable
independent variables, {(x1i , . . . , xki ), i = 1, . . . , n}. The conditioning simply allows us
to ignore the random nature of the independent variables. (As in the simple regression
model, the variance depends on the explanatory variables, and by conditioning on X we
can treat the independent variables as fixed constants.)
You are not expected to be able to derive the above expression in the multiple
regression setting, but they are expected to discuss its components and relate this
discussion to the presence of near multicollinearity. The discussion about the variance
inflation factor (VIF) presented in the textbook is not examinable.
You should know how to obtain an estimate of the variance under the Gauss–Markov
assumptions (we need to replace σ 2 by an unbiased estimator thereof) and how we
obtain the standard errors using:
q
d (βbj | X).
se(βj ) = Var
b
It is critical to realise that the formula for the sampling variation of βbj displayed above
and its estimate, requires all Gauss–Markov assumptions to be satisfied. Any violations
of the Gauss–Markov assumptions will render this formula invalid, and standard errors
based on it will be incorrect. (We will later discuss how we can make the standard
errors robust to violations such as heteroskedasticity.)
69
4. Multiple regression analysis: Estimation
Section 3.4b in Wooldridge finally highlights that including irrelevant variables, while
leaving the OLS estimator unbiased, is not without its cost as it will lead to a higher
variance. Nevertheless, the validity of the standard errors remains in the presence of
irrelevant variables. Excluding relevant variables, on the other hand, will not only lead
to biased parameters but it will also lead to invalid standard errors.
Activity 4.9 Take the MCQs related to this section on the VLE to test your
understanding.
In this last section, we discuss the Gauss–Markov Theorem. It shows that under the
assumptions MLR.1–MLR.5 (also called our Gauss–Markov assumptions) the OLS
estimator has the desirable property that there is no other linear, unbiased estimator
that has a lower variance, in other words the OLS estimator is ‘Best’. The proof is not
examinable. It is good to point out that this efficiency result is limited in the sense that
it does not guarantee that there might be a nonlinear and/or biased estimator that
‘beats’ the OLS estimator.
It should be clear also that any violation of the Gauss–Markov assumptions (for
instance, due to the presence of heteroskedasticity) will render our OLS estimator βbj no
longer efficient (BLUE). In such settings, clearly, we may want to discuss how we can
regain an efficiency result in its presence.
Activity 4.10 Take the MCQs related to this section on the VLE to test your
understanding.
70
4.3. Answers to activities
Here we show how to load a relevant dataset in R. We also look at how to perform OLS
on a bivariate and multivariate model using the command ‘lm’. We also briefly consider
the issue of multicollinearity.
In this case we have k = 2 (two slopes) and have used xi1 = inc i and xi2 = inc 2i .
The three first-order conditions required for OLS are:
∂SSR n
: −2 (cons i − βb0 − βb1 inc i − βb2 inc 2i ) = 0
P
∂b0 i=1
∂SSR n
: −2 (cons i − βb0 − βb1 inc i − βb2 inc 2i )inc i = 0
P
∂b1 i=1
∂SSR n
: −2 (cons i − βb0 − βb1 inc i − βb2 inc 2i )inc 2i = 0
P
∂b2 i=1
(i) Yes. Because of budget constraints, it makes sense that, the more siblings there are
in a family, the less education any one child in the family has. To find the increase
in the number of siblings that reduces predicted education by one year, we solve:
d = −1 = −.094(∆sibs)
∆educ
so ∆sibs = 1/.094 ≈ 10.6.
(ii) Holding sibs and feduc fixed, one more year of mother’s education implies .131
years more of predicted education.
If a mother has four more years of education, her son is predicted to have about
half a year (4 × .131 = .524) more education.
(iii) Since the number of siblings is the same, but meduc and feduc are both different,
the coefficients on meduc and feduc both need to be accounted for.
The predicted difference in education between B and A is:
.131 × (16 − 12) + .210 × (16 − 12) = 1.364.
71
4. Multiple regression analysis: Estimation
(i) A larger rank for a law school means that the school has less prestige; this lowers
starting salaries. For example, a rank of 100 means there are 99 schools thought to
be better.
(ii) We expect both β1 and β2 to be positive since both LSAT and GPA are measures
of the quality of the entering class. No matter where better students attend law
school, we expect them to earn more, on average.
We also expect β3 and β4 to be positive since the number of volumes in the law
library and the tuition cost are both measures of the school quality. (Cost is less
obvious than library volumes, but should reflect quality of the faculty, facilities,
and so on.)
(iii) This is a semi-elasticity. One additional GPA unit, increases the salary by 24.8%,
ceteris paribus. This is an example of the log-linear specification interpretation.
(iv) This is an elasticity. A one percent increase in library volumes implies a 0.095%
increase in predicted median starting salary, other things being equal. This is an
example of the log-log specification interpretation.
(i) Using the properties of the expectation operator (Property E.3), we have:
72
4.3. Answers to activities
(ii) Using the properties of the variance operator (Property VAR.4), we have:
b ≡ Var (βb1 + βb2 ) = Var (βb1 ) + Var (βb2 ) + 2Cov (βb1 , βb2 ).
Var (θ)
(i) The fact the x2 and x3 have large partial effects on y ensures they are both relevant
variables.
The high correlation of x1 with x2 and x3 indicates that we will get omitted
variable bias (OVB) (we only get OVB when the omitted variables are correlated
with x1 ). This means that βe1,short will then capture their effect on y (large) as well.
It is therefore likely that the estimates will be very different.
(ii) Here we would expect βe1,short and βb1,long to be similar (subject to what we mean by
‘almost uncorrelated’). Leaving out variables that are not correlated with included
regressors does not result in OVB. The amount of correlation between x2 and x3
does not directly affect the multiple regression estimate on x1 .
(iii) In this case we are (unnecessarily) introducing multicollinearity into the regression:
x2 and x3 have small partial effects on y (relevant?) that are highly correlated with
x1 .
This multicollinearity is likely to result in an increase of the standard error of the
coefficient on x1 ; that is se(βb1,long ) is likely to be much larger than se(βe1,short ).
(iv) In this case, adding x2 and x3 will decrease the residual variance, σ 2 , without
causing much collinearity (because x1 is almost uncorrelated with x2 and x3 ), so we
should see se(βb1,long ) smaller than se(βe1,short ). The amount of correlation between
x2 and x3 does not directly affect se(βb1,long ).
73
4. Multiple regression analysis: Estimation
Partial effect
Confounders
Perfect collinearity
Multicollinearity
Gauss–Markov Theorem
discuss the advantage of the multiple regression model over the simple regression
model
74
4.5. Test your knowledge and understanding
understand the principles behind the derivation of the OLS estimator of parameters
in the multiple regression model
knowledge of the expression for Var (βbj | X) under MLR.1–MLR.5, and have a clear
understanding of what factors improve the precision of the OLS estimators
understand how the standard errors of the regression parameters are obtained
use statistical software to apply the OLS estimation technique to analyse the
relationship between real-world economic variables.
75
4. Multiple regression analysis: Estimation
(c) You are told that you can obtain the estimator of β1 by running the regression:
4.2E In this question we are interested to see whether fast-food restaurants charge
higher prices in areas with a larger concentration of ethnic minorities. ZIP
code-level data on prices for various items along with characteristics of the ZIP
code population in New Jersey and Pennsylvania are used.
Let us consider a model to explain the price of soda, psoda, in terms of the
proportion of the population from ethnic minorities, prpem, and median income,
income. Price and income are measured in US$. We obtain the following result:
\ = .956 + .115prpem + .0000016income.
psoda (2.1)
Interpret the parameter estimate on prpem, and discuss what would happen to
the parameter estimates if we use pctem instead of prpem.
4.3E We are interested in investigating the factors governing the precision of regression
coefficients. Consider the model:
Yi = β1 + β2 X2i + β3 X3i + εi
with OLS parameter estimates βb1 , βb2 and βb3 . Under the Gauss–Markov
assumptions, we have:
σε2 1
Var (βb2 | X) = Pn × 2
i=1 (X2i − X̄2 )
2 1 − rX 2 X3
where σε2 is the variance of ε and rX2 X3 is the sample correlation between X2 and
X3 .
(a) Provide at least four factors that help us obtain more precise parameter
estimates of βb2 .
76
4.5. Test your knowledge and understanding
(b) In light of your answer to (a), discuss the concept of near multicollinearity.
Provide a real-life example where this problem is likely to occur.
(c) Are the following statements true or false? Give an explanation.
i. In multiple regression, multicollinearity implies that the least squares
estimators of the coefficients are biased and standard errors invalid.
ii. If the coefficient estimates in an equation have high standard errors, this
is evidence of high multicollinearity.
Bias(βb1,short ) = β2 δb1
avgtrain = δ0 + δ1 avgabil + v
n
P n
P
(avgtrain i −avgtrain)(avgabil i −avgabil) (avgtrain i −avgtrain)(ui −ū)
= β1 + β2 i=1 n + i=1
n
(avgtrain i −avgtrain)2 (avgtrain i −avgtrain)2
P P
i=1 i=1
n
P
(avgtrain i −avgtrain)ui
= β1 + β2 SampleCov (avgtrain i ,avgabil i )
SampleVar (avgtrain )
+ i=1
n .
i (avgtrain i −avgtrain)2
P
i=1
(c) This relates to the partialling out interpretation of our slope parameters.
Hereby, we first remove all correlation that avgprod and avgtrain have with
77
4. Multiple regression analysis: Estimation
4.2E (a) If, say, prpem increases by .10 (ten percentage points), the price of soda is
estimated to increase by .0115 dollars, or about 1.2 cents. While this does not
seem large, there are communities with no ethnic minorities and others that
are almost all ethnic minorities, in which case the difference in psoda is
estimated to be almost 11.5 cents.
(b) If we use pctem instead of prpem, its coefficient would equal .0015 to ensure it
has the same interpretation, everything else remains the same. If we measure
income in $10,000s, its coefficient becomes .016, everything else remains the
same.
(c) The discrimination effect is likely to be larger when we control for income.
This is due to the fact that we would expect a negative omitted variable bias
when excluding income (relevant variable). The bias is negative as income has
a positive coefficient in the multiple regression and we anticipate a negative
correlation between prpblck and income.
(d) There is no argument that they are highly correlated, but we are using them
simply as controls to determine the price discrimination against ethnic
minorities. In order to isolate the pure discrimination effect, controlling for
multiple measures of income (even if highly correlated) is a good idea.
(e) If prpem increases by .10 (or 10 percentage points), log(psoda) is estimated to
increase by .10(.122) = .0122, or about 1.22%. If we use pctem instead of
prpem, its coefficient would equal .00122 to ensure it has the same
interpretation, everything else remains the same.
(f) If we use pctem instead of prpem, its coefficient would equal .00122 to ensure it
has the same interpretation, everything else remains the same.
4.3E (a) You should note that the expression of the variance can be rewritten as:
σε2
Var (βb2 ) = 2
.
n × SampleVar (X2 ) × (1 − rX 2 X3
)
The four factors that can improve the precision of our parameter estimates
therefore are: (i) larger sample size, n, (ii) larger sample variability of the X2
regressor, (iii) smaller (in absolute value) correlation among the X2 and X3
2
regressors, rX 2 X3
, and (iv) smaller error variance, σε2 .
(b) Perfect multicollinearity means that some of the regressorsP can be written as a
linear combination of other regressors. For example, Xi = nj6=i λj Xj , where X
are regressors. Multicollinearity (often referred to as near multicollinearity)
indicates that there is a close linear relation between the regressors.
A real-life example of multicollinearity is in a setting where we try to explain
the determinants of educational attainment and we use various measures of
cognitive skills, which are highly correlated.
78
4.5. Test your knowledge and understanding
79
4. Multiple regression analysis: Estimation
80
Chapter 5
Multiple regression analysis:
Inference
5.1 Introduction
contrast the sampling distribution of (βbj − βj )/se(βbj ) under MLR.1 to MLR.6 with
the sampling distribution of (βbj − βj )/sd (βbj )
perform the t test for testing a hypothesis about a single population parameter
understand the benefit of using one-sided t test over two-sided t test
obtain p-values for t tests (one-sided and two-sided)
construct confidence interval (CI) for population parameters and explain its
interpretation
use confidence interval (CI) for hypothesis testing against two-sided alternatives
explain the meaning of linear restrictions
perform the t test for testing a hypothesis about a single linear restriction of the
parameters and understand how to obtain the standard error of linear
combinations of the estimators
formulate the F test in two equivalent forms: (i) based on comparing the SSR of
the restricted and unrestricted models or (ii) based on a comparison of the R2 of
the restricted and unrestricted models. Interpret the F test as a test for the loss in
goodness of fit
81
5. Multiple regression analysis: Inference
perform the F test for testing multiple (joint) linear restrictions of the population
parameters, and perform the F test for the overall significance of a regression
explain the relation between the F test and the t test when testing a single linear
restriction
use statistical software to implement the t test and the F test when analysing the
relationship between real-world economic variables.
82
5.2. Content of chapter
βbj − βj
∼ Normal (0, 1) (*)
sd (βbj )
where sd (βbj ) denotes the standard deviation (square root of the variance) of βbj .
The normal distribution is very convenient. Moreover, all tests we will construct based
on MLR.1 to MLR.6 will turn out to be valid in large samples even when the
normality assumption, MLR.6, is not made explicit.
You are expected to be familiar with properties of the normal distribution and be able
to use the statistical table of the Normal (0, 1) distribution (Table G.1 in Wooldridge).
Activity 5.1 Take the MCQs related to this section on the VLE to test your
understanding.
You should make sure they understand the general principles underlying testing the
significance of a particular population parameter first (in other words, is the partial
effect of xj on y zero: i.e. is βj = 0?).
Intuitively: We should reject H0 : βj = 0 when the realisation of our OLS estimator βbj
is not close to zero.
Under H0 , this distribution (bell-shaped) will be centred around the true value
(zero in this case). Realisations in the tails of this distribution (far from the centre)
suggest we should reject H0 .
Formally: To test this (and other hypotheses), we will look for a test statistic that
satisfies the following two criteria under H0 .
1. The test statistic has a nice distribution. (Allows us to use statistical tables to
determine critical values.)
2. The test statistic does not contain any unknown quantities, such as σ 2 , the variance
of the error. (Allows us to compute it given the data.)
83
5. Multiple regression analysis: Inference
This latter statement is the reason why it is useful to recognise that under MLR.1 to
MLR.6, by replacing sd (βbj ) with se(βbj ) in (*) above, we obtain a t distribution for the
standardised estimators:
βbj − βj
∼ tn−k−1 = tdf . (**)
se(βbj )
The t distribution is fully characterised by the degrees of freedom, which is given here by
df = n − k − 1 (sample size minus the number of parameters – intercept and k slopes).
To test the hypothesis H0 : βj = 0, we should use the t statistic:
βbj
tβbj = .
se(βbj )
Our test statistic tβbj satisfies the two requirements: we can obtain a realisation of the t
test for a given sample as se(βbj ) can be computed using the data unlike sd (βbj ), and the
distribution under the null is nice. Using (**) given above, we should conclude that
under the null, when βj = 0, this implies that:
tβbj ∼ tn−k−1 .
βbj
Under H0 : tβbj = ∼ tn−k−1 .
se(βbj )
Our test should now reject when our realisation of βbj is far away from zero relative to
its standard error, se(βbj ). The decision rule (whether we should reject or not) will
depend on:
the significance level of our test. This denotes the probability of rejecting H0 when
it is true, or equivalently, denotes our willingness to commit a Type I error. Typical
value of the significance level, denoted α, is 5% or 1%.
With the help of empirical examples, we see how to conduct both one-sided and
two-sided tests and the graphical discussion of the rejection rule should be well
understood.
84
5.2. Content of chapter
In Section 4–2c the above discussion is generalised to the setting where we want to test
another hypothesis about βj , say H0 : βj = 1. Under CLM assumptions, we can use the
following t statistic:
βbj − 1
t= ∼ tn−k−1 under H0 .
se(βbj )
We should reject when the realisation of βbj is far from 1 relative to se(βbj ), which is the
same as rejecting when t is far away from zero.
It is good to point out here that there is a benefit associated with using a one-sided test
instead of a two-sided test when the direction of violations to the null are obvious (e.g.
when testing whether the returns to education are zero). When we compare the
rejection rules for one-sided and two-sided alternatives, we should recognise that for the
same level of significance, the critical value is in absolute value smaller when using a
one-sided alternative compared to a two-sided alternative. This reveals that for the
same level of significance (and knowing the direction where violations from the null
appear) we will start rejecting earlier using one-sided alternatives than two-sided
alternatives. As this ensures that we will also be rejecting earlier when the null is wrong,
this shows that one-sided hypothesis tests have the benefit of being more powerful
(recall: the power of a test is our ability to reject the null when the null is false).
Instead of conducting a hypothesis test for a given level of significance, it may be more
informative to answer the question: ‘What is the lowest level of significance at which
you would reject the null given the realisation of your test statistic?’. This level is
known as the p-value. It signals the probability of observing a more extreme relation of
our test statistic (favouring the alternative hypothesis) than the realisation of our test
85
5. Multiple regression analysis: Inference
statistic. If the p-value is lower than α (the significance level), then we should reject the
null. Again, a graphical discussion of how to obtain the p-values (one-sided and
two-sided) is most illustrative and should be well understood. It is good to recognise
that the p-value of a one-sided test is half the p-value of a two-sided test.
Activity 5.2 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 5.3 Exercise 4.2 from Wooldridge and also answer the following.
(iv) Show graphically how you can obtain the p-value of the test H0 : β3 = 0 against
H1 : β3 > 0.
Activity 5.4 Exercise 4.4 from Wooldridge and also answer the following.
(v) Test the hypothesis that H0 : β2 = 0.4 against H1 : β2 6= 0.4 at the 5% level.
Interpret your result.
In this section we are shown that under CLM assumptions, the 95% confidence interval
for the population parameter βj , is given by:
h i
βbj − tdf ,2.5% × se(βbj ), βbj + tdf ,2.5% × se(βbj )
Our 95% confidence interval tells us therefore that under repeated sampling from
the same population, in 95% of the samples the interval we compute will contain
the true value βj .
For any particular sample, we do not know whether βj lies in the interval or not.
We should understand how to use confidence intervals for hypothesis testing purposes,
and appreciate the important link between the confidence level and the significance
level of our test. If aj does not lie in the 95% confidence interval, then we would
reject the null that βj = aj at the 5% level of significance.
Activity 5.5 Take the MCQs related to this section on the VLE to test your
understanding.
86
5.2. Content of chapter
Activity 5.6 As an extension of Activity 5.4, use Exercise 4.4 from Wooldridge to
answer the following.
It is important that you study this section carefully. It reveals that under the CLM
assumptions we can use the t test to test any single linear restriction on our
parameters. Examples of this are whether the effect of two variables is identical
(see discussion in textbook), or the example considered here whether the sum of two
parameters equals one (Cobb–Douglas production function):
log(output) = β0 + β1 log(capital ) + β2 log(labour ) + u (*)
where H0 : β1 + β2 = 1 suggests the presence of ‘constant returns to scale’ in an industry.
The main difficulty of the implementation of t tests in these settings is the computation
of the standard errors that appear in the denominator of our test statistic. To test
H0 : β1 + β2 = 1, for instance, we should use:
θb − 1
t= ∼ tdf under H0 .
se(θ)
b
To see what regression we need to run to obtain an estimate of θ and its standard error,
let us rewrite θ = β1 + β2 , as:
β1 = θ − β2 .
87
5. Multiple regression analysis: Inference
Activity 5.7 Take the MCQs related to this section on the VLE to test your
understanding.
Often we may want to test multiple hypotheses about the underlying parameters
simultaneously. Specific examples discussed in this section include (i) testing the
significance of the regression, which involves testing the hypothesis that all slope
parameters are zero, and (ii) testing whether, after controlling for some dependent
variables, other independent variables have no effect on the dependent variable.
The F test we discuss for this purpose involves comparing the ‘fit’ of the original model
with the ‘fit’ of the model where the restrictions are imposed. To obtain the latter (also
called the restricted model) we simply need to plug in the restrictions in our original
model (also called the unrestricted model). The test wants to ascertain whether the loss
of fit by imposing restrictions is statistically significant.
Important: The F test discussed assumes all Gauss–Markov assumptions and the
normality of the errors (i.e. the CLM assumptions MLR.1 to MLR.6) are satisfied.
A nice formulation of the F statistic is given by:
(SSR r − SSR ur )/number of restrictions
F =
SSR ur /df of unrestricted model
where SSR r denotes the residual sum of squares of the restricted model and SSR ur the
residual sum of squares of the unrestricted model. Under the CLM assumptions this test
statistic has an F distribution under the null which is fully characterised by the degrees
of freedom of the numerator (number of restrictions) and the degrees of freedom of the
denominator (df of the unrestricted model):
Under H0 : F ∼ Fnumber of restrictions,df of unrestricted model .
88
5.2. Content of chapter
Note: Our test statistic satisfies both requirements: (i) we can compute it given the
data, and (ii) it has a nice distribution under the null.
Intuitively, we should reject the null if the loss in fit is statistically significant. Formally,
we should reject the null at a particular level of significance α if the realisation of our
test statistic exceeds the critical value. Denoting the critical value (see Table G.3 in
Wooldridge) by Fq, df ur , α , with q denoting the number of restrictions, and df ur the
degrees of freedom of the unrestricted model, that is:
As with the t test, you should know how to obtain the p-value and how they can be
used for inference purposes.
An alternative formulation of the F test is based on the R2 of the restricted and
unrestricted models instead of their residual sum of squares. Recall, our goodness of fit
measure R2 = 1 − SSR/SST . Therefore, the F test can also be written as:
2
(Rur − Rr2 )/number of restrictions
F = 2 )/df unrestricted model
.
(1 − Rur
Since the R2 is commonly reported for regression results, unlike the SSR, this is a
convenient form.
Important: Care should be taken when using this formation of the F test, that the R2
of both the restricted and unrestricted models provide the proportion of the variance of
the same dependent variable that can be explained by the model. We can only compare
the R2 of two models when the dependent variable is identical. Furthermore, as
2
Rur ≥ Rr2 , F ≥ 0.
Two further results discussed are noteworthy:
1. The test of the significance of the regression (null that all slope parameters are
zero) can be written in a form that shows a close relation to the goodness of fit of
the original model (k slopes and 1 intercept):
R2 /k
F = .
(1 − R2 )/(n − k − 1)
Clearly when our R2 is large our independent variables do a good job in explaining
the variation in the dependent variable and hence we should reject the null that all
slopes are zero!
2. We can also use the F test to test a single linear restriction. In fact the F statistic
equals the square of the t statistic. The advantage of using the t statistic instead
(not the square of the test statistic) is that we can use a one-sided test.
Activity 5.9 Take the MCQs related to this section on the VLE to test your
understanding.
89
5. Multiple regression analysis: Inference
Activity 5.10 Exercise 4.6 from Wooldridge and also answer the following.
(v) Show in a graph how you can obtain the p-value of the test conducted in (iii).
Here we show how to test single and multiple linear hypothesis tests in R.
(iii) To test the hypothesis we use the t statistic βb3 /se(βb3 ), which under H0 is
distributed as tdf with df = 209 − 4 = 205. The realisation of our test statistic
equals .0024/.00054 ≈ 0.444.
At the 10% level of significance, the one-sided critical value equals tdf , 10% = 1.282.
As the realisation of our t statistic lies well below this critical value, we conclude
that we find no evidence for a statistically significant positive effect.
The test assumes MLR.1 to MLR.6 are satisfied.
(iv) Based on this sample, the estimated ros coefficient appears to be different from
zero only because of sampling variation. Still, including irrelevant variables may
not be causing any harm.
(v) Under H0 the test statistic is distributed as t205 and the realisation of our test
statistic was 0.444. We need both to indicate the p-value graphically:
90
5.3. Answers to activities
(i) H0 : β3 = 0 and H1 : β3 6= 0.
(ii) Other things being equal, a larger population increases the demand for rental
housing, which should increase rents; β1 (+).
The demand for overall housing is higher when average income is higher, pushing
up the cost of housing, including rental rates; β2 (+).
Both are elasticities!
(iii) The coefficient on log(pop) is an elasticity. The correct statement is that ‘a 10%
increase in population increases rent by 0.66% (= .066 × 10) ceteris paribus’.
At the 1% level of significance, the two-sided critical value equals tdf , 0.5% = 2.660.
As our t statistic is well above the critical value, we conclude that βb3 is statistically
different from zero at the 1% level. The student population as a percentage of total
population affects rent rates.
(v) The t statistic for this hypothesis equals (βb2 − 0.4)/se(βb2 ) which is about 1.32.
Given our CLM assumptions (MLR.1 to MLR.6), we know:
91
5. Multiple regression analysis: Inference
As our t statistic lies below this critical value, we conclude that βb2 is not
statistically different from 0.4 at the 5% level. That is, the elasticity is not
significantly different from .4%, ceteris paribus.
Test assumes MLR.1 to MLR.6 are satisfied.
(vi) Using the same critical value as in (iv), t60, 0.5% = 2.660 we obtain the 99%
confidence interval as:
h i
βb3 − 2.66 × se(βb3 ), βb3 + 2.66 × se(βb3 )
Zero does not lie in this interval confirming the result in (d): We should reject H0
at the 1% level of significance.
(vii) Using the same critical value as in (v), t60, 2.5% = 2.00 we obtain the 95% confidence
interval as:
h i
β2 − 2.00 × se(β2 ), β2 + 2.00 × se(β2 )
b b b b
As 0.4 lies in this interval, our result in (e) is confirmed: We should not reject H0
at the 5% level of significance.
(i) Recall the property of the variance: If W and Z are random variables and a and b
constants:
Var (βb1 − 3βb2 | X) = Var (βb1 | X) + 9Var (βb2 | X) − 6Cov (βb1 , βb2 | X).
Since the standard error equals the square root of the estimated variance (obtained
by replacing the error variance σ 2 with its estimator s2 ):
q
se(β1 − 3β2 ) = se(βb1 )2 + 9se(βb2 )2 − 6s12
b b
92
5.3. Answers to activities
βb1 − 3βb2 − 1
t= .
se(βb1 − 3βb2 )
β1 = θ1 + 3β2 .
Rewriting gives:
y = β0 + θ1 x1 + β2 (3x1 + x2 ) + β3 x3 + u.
By running a regression of y on x1 , 3x1 + x2 , and x3 we can directly obtain θb and
its standard error, and (θb1 − 1)/se(θb1 ) gives the same test statistic as in (ii).
(i) For both hypotheses we should use the t test, and under H0 t ∼ tdf with
df = n − 2 = 86 (MLR.1 to MLR.6).
The critical value at the 5% level is approximately t90, 2.5% = 1.987 (two-sided).
• First hypothesis: t = βb0 /se(βb0 ) = −14.47/16.27 ≈ −.89. We fail to reject
H0 : β0 = 0.
• Second hypothesis: t = (βb1 − 1)/se(βb1 ) = (.976 − 1)/.049 ≈ −.49. Fail to reject
H0 : β1 = 1.
(ii) To test the joint hypothesis that β0 = 0 and β1 = 1, we need the SSR in the
restricted model. This turns out to yield SSR = 209,448.99. Carry out the F test
for the joint hypothesis.
93
5. Multiple regression analysis: Inference
(209,448.99 − 165,644.51)/2
F = ≈ 11.37.
165,644.51/86
The restricted model is the simple regression model, the unrestricted model has 3
more regressors.
• We are testing q = 3 restrictions n − k − 1 = df ur = 88 − 5 = 83.
• We are given Rr2 = .820 and Rur
2
= .829.
• Our test statistic, therefore, equals:
(.829 − .820)/3
F = ≈ 1.46.
1 − .829/83
94
5.3. Answers to activities
(i) We need to compute the F statistic for the overall significance of the regression
.0395/4
with n = 142 and k = 4: F = = 1.41.
(1–.0395)/137
The 5% critical value with 4 for the numerator df and using 120 for the
denominator df, is 2.45, which is well above the value of F . Therefore, we fail to
reject H0 : βdkr = βeps = βnentinc = βsalary = 0 at the 5% level. With the t test we
conclude that no explanatory variable is individually significant at the 5% level.
The largest absolute t statistic is on dkr , tdkr = 1.60, which is not significant at the
5% level against a two-sided alternative.
.0330/4
(ii) The F statistic (with the same df) is now F = = 1.17, which is even
(1–.0330)/137
lower than in part (i). None of the t statistics is significant at a reasonable level.
(iii) Both proposals are problematic as the use of logarithms necessitates that the
original variables are positive only. It will generate missing values and these firms
will be dropped, which will be a problem as you will no longer have a random
sample.
(iv) It seems very weak. There are no significant t statistics at the 5% level (against a
two-sided alternative), and the F statistics are insignificant in both cases. Plus, less
than 4% of the variation in returns is explained by the independent variables.
(i) A 10% increase in spending is associated with a 1.1 unit increase in the
mathematics pass rate, that is a 1.1 percentage point increase (as math10 is
measured in percentages).
The estimated intercept is the predicted value of the dependent variable when all
regressors are set to zero. As the minimum value of lexpend is 8.11, this does not
seem to be reasonable; a prediction of a pass rate equalling −69 indeed makes no
sense.
(ii) The low R2 means that lexpend alone does not help explain a lot of the variation in
math10 (but it also does not suggest that there is no correlation between the two).
There are indeed likely to be many omitted factors that will help explain the
variation in math10 (see examples in (iii)).
If expenditure levels were randomly assigned to schools it is unlikely that the
amount of variation in math10 explained by spending could be more than a few
percent of the total variation.
95
5. Multiple regression analysis: Inference
(iii) The coefficient on lexpend falls to 7.75, but its t statistic is 2.55, which is
statistically significant even at the 1% level. Recognising that lexpend and lnchprg
are likely correlated suggests that the simple regression estimates are likely to
exhibit OVB.
(iv) Adding lenroll and lnchprg (essentially the poverty rate) allows us to explain
notably more variation in math10 , but it is still only about 19%. One might
improve the explanatory power by including variables such as average family
income and measures of parental education levels. Or, features of the teachers’
qualifications.
Hypothesis test
Type I error
Type II error
Null hypothesis
Alternative hypothesis
Test statistic
Significance level
96
5.4. Overview of chapter
Critical value
Rejection region
t statistic
F statistic
p-value
Power of a test
contrast the sampling distribution of (βbj − βj )/se(βbj ) under MLR.1 to MLR.6 with
the sampling distribution of (βbj − βj )/sd (βbj )
perform the t test for testing a hypothesis about a single population parameter
construct confidence interval (CI) for population parameters and explain its
interpretation
use confidence interval (CI) for hypothesis testing against two-sided alternatives
perform the t test for testing a hypothesis about a single linear restriction of the
parameters and understand how to obtain the standard error of linear
combinations of the estimators
formulate the F test in two equivalent forms: (i) based on comparing the SSR of
the restricted and unrestricted models or (ii) based on a comparison of the R2 of
the restricted and unrestricted models. Interpret the F test as a test for the loss in
goodness of fit
perform the F test for testing multiple (joint) linear restrictions of the population
parameters, and perform the F test for the overall significance of a regression
97
5. Multiple regression analysis: Inference
explain the relation between the F test and the t test when testing a single linear
restriction
use statistical software to implement the t test and the F test when analysing the
relationship between real-world economic variables.
n = 430, R2 = 0.336
where Area = Width × Height and Aspect Ratio = Height/Width. The standard
errors are given in parentheses. We will assume all Gauss–Markov assumptions are
satisfied.
(a) Interpret the parameters and test the significance of each. What additional
assumption do you need to make?
(b) Explain whether the p-value for the test of significance of the Aspect Ratio is
larger or smaller than 0.05. Sketch a graph to indicate how the p-value is
obtained.
(c) Provide the 95% confidence interval for βlog Area . Discuss how you can use this
interval to test the significance of log Area.
(d) Test the joint significance of the regression. Discuss its relation to the goodness
of fit measure: R2 .
(e) You want to test H0 : βlog Area = 1 against H1 : βlog Area > 1. Perform this test
using the information provided and provide an economic interpretation of your
result. In view of your answer, indicate whether the p-value for the test is
bigger or smaller than 5% and provide a sketch that indicates how the p-value
is obtained.
5.2E Let us consider the estimation of a hedonic price function for houses. The hedonic
price refers to the implicit price of a house given certain attributes (e.g. the number
of bedrooms). The data contains the sale price of 546 houses sold in the summer of
1987 in Canada along with their important features. The following characteristics
are available: the lot size of the property in square feet (lotsize), the numbers of
bedrooms (bedrooms), the number of full bathrooms (bathrooms), and a variable
that indicates the presence of an air conditioner (airco) (1 = yes, 0 = no).
98
5.5. Test your knowledge and understanding
+ 0.216bathrooms i + 0.212airco i
(.023) (.024)
The standard errors are in parentheses, and SSR measures the residual sum of
squares.
(a) Interpret the parameter estimates on log(lotsize) and bedrooms.
(b) Test the significance of log(lotsize) and bedrooms.
(c) Suppose that the lot size of the property was measured in square metres rather
than square feet. How would this affect the parameter estimates of the slopes
and intercept? How would this affect the fitted values? Note: the conversion
(approximate) 1 m2 = 10 ft2 .
(d) We are interested in testing the hypothesis H0 : βbedrooms = βbathrooms against
the alternative H1 : βbedrooms 6= βbathrooms .
i. Give an interpretation to the hypothesis of interest.
ii. Provide the t statistic we can use to test this hypothesis and indicate
what we would need to estimate to obtain this test statistic.
iii. Discuss the following statement: ‘Standard regression output of the
following regression:
99
5. Multiple regression analysis: Inference
To test its significance, we test H0 : βlog Area = 0 against H1 : βlog Area 6= 0 using
the t test. Under CLM assumptions (MLR.1 to MRL.6):
βblog Area
∼ tn−3 under H0 .
se(βblog Area )
The realisation of our test statistic is very large 1.334/.091 = 14.66 and
exceeds the two-sided critical value even at the 1% level of significance (2.576)
revealing its statistical significance. Conclude: Area matters in explaining
Price after controlling for Aspect Ratio.
The parameter on Aspect Ratio suggests that a one-unit increase in the Aspect
Ratio results in a 16.5% reduction in the price of the Monet painting, ceteris
paribus.
Using the t test here does not reveal a statistically significant effect. The t
statistic equals −.165/.128 = −1.29, which is even below the two-sided critical
value at the 10% level of significance. Conclude: After controlling for
log(Area), Aspect Ratio does not have a statistically significant effect on Price.
(b) The p-value for the test of significance is larger than .05, as we cannot reject
the null at the 5% level of significance as discussed in (a). Graph of the p-value
should clearly show area in both tails of a Student’s t distribution with n − 3
degrees of freedom; test statistic taking realisations in absolute value in excess
of 1.29.
(c) Under the CLM assumptions (MLR.1 to MLR.6), the 95% confidence interval
for βlog Area is given by:
h i
βblog Area − 1.96 × se(βblog Area ), βblog Area + 1.96 × se(βblog Area ) = [1.16, 1.51]
where 1.96 is obtained from the Student’s t distribution with large degrees of
freedom (same as N (0, 1) therefore). As zero does not lie in this 95%
confidence interval, we can conclude that we should reject the null
H0 : βlog Area = 0 against H1 : βlog Area 6= 0 at the 5% level of significance.
(d) Here we are asked to test the joint hypothesis H0 : βlog Area = 0 and
βAspect Ratio = 0 against H1 : at least one non-zero. Under the CLM
assumptions (MLR.1 to MLR.6) we should use the F statistic:
100
5.5. Test your knowledge and understanding
(e) Under the CLM assumptions (MLR.1 to MLR.6), we will use the t statistic,
here:
βblog Area − 1
∼ t427 under H0 .
se(βblog Area )
The realisation of the test statistic equals (1.334 − 1)/0.091 = 3.6703. At the
5% level we need to reject when it exceeds 1.645 (one sided), which it does. We
therefore find evidence that auction prices are elastic. As we reject our
hypothesis at the 5% level, the p-value is smaller than .05.
5.2E (a) On average, holding the remaining variables in the regression constant, a 1%
increase in lot size is associated with a 0.4% increase in house price, and each
extra bedroom is associated with a 7.8% increase in house price.
(b) Under the CLM assumptions (MLR.1 to MLR.6) we will use the t test to test
these hypotheses (as in Question 1). For log(lotsize), we should test:
H0 : βlog(lotsize) = 0 against H1 : βlog(lotsize) 6= 0 using:
βblog lotsize
∼ tn−5 under H0 .
se(βblog lotsize )
The realisation of our test statistic is very large .400/.028 = 14.28 and exceeds
the two-sided critical value even at the 1% level of significance (2.576).
bedrooms is also statistically significant at the 1% level of significance with a
realisation of the test statistic equal to .075/.015 = 5. Both log(lotsize) and
bedrooms are therefore individually significant.
^ i be the lot size in square
(c) Let lotsize i be the lot size in square feet and lotsize
^ i = 10−1 lotsize i and:
metres. We have that lotsize
^ i ) = log(10−1 ) + log(lotsize i ).
log(lotsize
βbbedrooms − βbbathrooms
∼ tn−5 under H0 .
se(βbbedrooms − βbbathrooms )
In order to obtain this test statistic we would need to obtain an estimate
of Cov (βbbedrooms , βbbathrooms ), which we shall call s23 . This is needed to
obtain the standard error of the difference, with:
q
se(βbedrooms − βbathrooms ) = se(βbbedrooms )2 + se(βbbathrooms )2 − 2s23 .
b b
101
5. Multiple regression analysis: Inference
This shows that the effect of bathrooms and bedrooms is equal when
α3 = 0. Our test therefore equals a test of H0 : α3 = 0, which forms part
of standard regression output when we estimate (2.2). This reveals the
advantage of reparameterisation for hypothesis testing.
We should interpret α3 as the additional (when positive) effect bedrooms
have over bathrooms in affecting the percentage increase in house prices.
Comment: It is important to recognise the usefulness of parameterisation
for testing and interpretation purposes.
iv. The regression result is obtained by considering our model (2.2) and
imposing the restriction that α3 = 0. This is a restricted model as the
effect of bedrooms and bathrooms is assumed to be identical.
Under our CLM assumptions (MLR.1 to MLR.6) we can now use the F
statistic to test our hypothesis against the two-sided alternative
H1 : α3 6= 0. Our F statistic is given by:
(SSR r − SSR ur )/1
F = ∼ F1, n−5 under H0
SSR ur /(n − 5)
where SSR r = 33.758 is the residual sum of squares of the restricted model
and SSR ur = 32.622 is the residual sum of squares of the unrestricted
model. We divide the numerator by 1 (single restriction) and the
denominator by the degrees of freedom of the unrestricted model (n − 5).
The realisation of our test statistic equals:
33.758 − 32.622
= 18.84.
32.622/541
At the 1% level of significance the critical value equals 6.63, so we will
reject the null. The p-value will be smaller than 1%. Graph of the p-value
should clearly show area in the right tail of an F distribution with 1, n − 5
degrees of freedom; test statistic taking realisations in excess of 18.84.
(e) Here we would like to test the joint hypothesis H0 : βbedrooms = 0, βbathrooms = 0
and βairco = 0 against the alternative that at least one is non-zero. As in the
previous part we would perform an F test that determines whether the loss in
fit is significant. We will be testing 3 restrictions, and the restricted model we
need to estimate is given by:
Using the residual sum of squares of this model, SSR r , together with the
residual sum of squares from (2.1), we compute the realisation of our test
statistic:
(SSR r − SSR ur )/3
F = ∼ F3, n−5 under H0
SSR ur /(n − 5)
102
5.5. Test your knowledge and understanding
(SSR r − 32.622)/3
32.622/541
is in excess of 2.60.
Graph should display the p-value as the area where a random variable with
distribution F3, n−5 is more extreme than 2.60.
103
5. Multiple regression analysis: Inference
104
Chapter 6
Multiple regression analysis: Further
issues and use of qualitative
information
6.1 Introduction
be able to conduct hypothesis testing using t tests and F tests and interpret the
results of such testing
estimate and interpret models with quadratic and/or interaction terms in practice
describe how multiple dummy variables can be used to create further more refined
categories in practice
know how the choice of the base category affects the interpretation of the
regression results (including the interpretation of parameters and t tests)
105
6. Multiple regression analysis: Further issues and use of qualitative information
describe and perform the Chow test in two equivalent ways (and also two variations
of the Chow test – allowing or not intercepts to be different)
Reading: Wooldridge, Sections 6.2b, 6.2c, 6.3a, 6.3c, 7.1–7.4 and Appendices A.4a, A.5.
106
6.2. Content of chapter
and conclude that the estimated slope of y with respect to x is βb1 + 2βb2 x. We can write
it in the form of an approximation for the change in y with respect to x:
∆b
y
≈ βb1 + 2βb2 x.
∆x
In order to provide richer interpretation of models with quadratics, there are several
general facts about the quadratic function βb0 + βb1 x + βb2 x2 that we need to keep in mind
(let’s consider the case βb2 6= 0, so the function is certainly not a linear one). They can
be summarised as follows:
If βb2 > 0, the function is convex and has one minimum point.
• Its graph is a parabola that opens upwards.
107
6. Multiple regression analysis: Further issues and use of qualitative information
If βb2 < 0, the function is concave and has one maximum point.
• Its graph is a parabola that opens downwards.
The location of the optimum point can be found from the first-order condition:
βb1
βb1 + 2βb2 x∗ = 0 ⇒ x∗ = −
2βb2
(sometimes x∗ is called the turning point).
Figures 6.1–6.4 show the graphs of the quadratic function βb0 + βb1 x + βb2 x2 in four
different cases determined by the signs of βb1 and βb2 .
108
6.2. Content of chapter
For instance, in Case 1 the slope is initially positive, but it decreases as x increases. The
graph of the quadratic function has a hump shape and, as we know, turns at the value:
βb1
x∗ =
2βb2
with the function reaching its maximum at x∗ . The slope is negative at the values of the
regressor greater than x∗ .
Similarly, in Case 2 the slope is initially negative, but it increases (gets less negative) as
x increases. The graph of the quadratic function has a U shape and, as we know, turns
109
6. Multiple regression analysis: Further issues and use of qualitative information
at:
βb1
x∗ = .
2βb2
However, now the function reaches the minimum at x∗ and for the values of the
regressor greater than x∗ , the slope is positive.
Cases 2 and 3 can be analysed analogously. In both these cases the turning point occurs
at a negative value of x. In Case 3 the slope (βb1 + 2βb2 x) increases with x and for x > 0
it is always positive. In Case 4 the slope (βb1 + 2βb2 x) is always negative for x > 0. For
instance, using WAGE1 data we can estimate the following wage equation with a
quadratic in experience:
n = 526, R2 = .3003.
We can find that the quadratic turns at about .041/[2(.000714)] ≈ 28.7 years of
experience. In the sample, there are 121 people with at least 29 years of experience.
This is a fairly sizable fraction of the sample. Thus, there is a change in the slope
occurring in our model within the range of exper in our data.
In models with quadratics, we can conduct various significance tests. If we want to test:
H0 : βexper = 0
H1 : H0 is not true.
This can be conducted by a t test (under MLR.1 to MLR.6) using the t statistic on
exper 2 .
However, if we wanted to test:
H0 : βexper = 0, βexper 2 = 0
H1 : H0 is not true.
Activity 6.1 Take the MCQs related to this section on the VLE to test your
understanding.
110
6.2. Content of chapter
Activity 6.2 The following equation is estimated by OLS using WAGE1 data:
n = 526, R2 = .3003.
The estimated return to education is 9.04%. This model assumes this is the same for
all years of education.
%∆wage
[ = 100(βb2 + 2βb3 exper )∆exper
find the approximate return to the fifth year of experience. What is the
approximate return to the twentieth year of experience?
Activity 6.3 This activity is based on data in HPRICE3 but only for the year
1981. The data are for houses that sold during 1981 in North Andover, MA. 1981
was the year construction began on a local garbage incinerator and our objective is
to study the effects of the incinerator location on housing price.
n = 142, R2 = .180.
Interpret the results.
iii. To the simple regression model in part (ii), we add the variables log(intst),
log(area), log(land ), rooms, baths, and age, where intst is distance from the
home to the interstate, area is square footage of the house, land is the lot size in
square feet, rooms is total number of rooms, baths is number of bathrooms, and
age is age of the house in years.
The estimated equation is:
\ = 7.592 + .055 log(dist) − .039 log(intst) + .319 log(area)
log(price)
(.642) (.058) (.052) (.077)
n = 142, R2 = .748.
111
6. Multiple regression analysis: Further issues and use of qualitative information
Now, what do you conclude about the effects of the incinerator? Explain why
(ii) and (iii) give conflicting results.
+ .359 log(area) + 0.091 log(land ) + .038 rooms + .150 baths − .003 age
(.073) (.037) (.027) (.040) (.001)
n = 142, R2 = .778.
Now what happens? What do you conclude about the importance of functional
form?
This subchapter analyses regressions with interactions which model situations when a
partial effect of one regressor depends on the value of another regressor. For example,
the partial effect of education could depend on the level of intelligence.
For instance, consider the model:
y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + u (6.2.1)
which contains the interaction term x1 x2 . Holding x2 (and u) fixed, the partial effect
of x1 on y is:
∆y
= β1 + β3 x2
∆x1
so that effect of x1 depends on x2 unless β3 = 0. Similarly, holding x1 (and u) fixed, the
partial effect of x2 on y in the model with the interaction term is:
∆y
= β 2 + β 3 x1 .
∆x2
In the model above, the null hypothesis H0 : β3 = 0 is the same as saying the partial
effects are constant. This hypothesis can be tested by using a t test. If it cannot be
rejected at a small significance level (smaller than 5%), it is not worth complicating the
model.
It is important to realise that it is easy to get confused in models with interactions.
From:
∆y
= β1 + β3 x2
∆x1
we see that β1 is now the partial effect of x1 on y when x2 = 0. But x2 = 0 may be very
far from a legitimate, or at least interesting, part of the population. A similar comment
holds for β2 .
112
6.2. Content of chapter
An important point for models with interactions (as well as quadratics), is that we need
to evaluate the partial effect at interesting values of the explanatory variables. Often,
zero is not an interesting value for an explanatory variable and is well outside the range
in the sample.
Often, two interesting parameters are the partial effects evaluated at the mean of the
other variable:
δ1 = β1 + β3 µ2
δ2 = β 2 + β 3 µ 1 .
where x̄2 and x̄1 are the sample averages and the βbj s are the OLS estimates. We can also
obtain confidence intervals for δ1 and δ2 . However, it is often easier to let a regression
compute δb1 and δb2 , and to get appropriate standard errors. To achieve that, we can
reparametrise interaction effects and instead of the original model (6.2.1) consider:
y = α0 + δ1 x1 + δ2 x2 + β3 (x1 − µ1 )(x2 − µ2 ) + u
The intercept changes, too, but this is not important. The coefficient on the interaction
does not change.
The advantages of such a reparametrisation can be summarised as follows:
Standard errors for partial effects at the mean values are readily available in the
regression output.
To give an example, consider the following wage equation estimated with the interaction
educ · IQ using WAGE2 data:
n = 935, R2 = .1298.
Here the interaction term is not significant. Not only is it insignificant but it also
renders educ and IQ insignificant. Indeed, here is the estimated model without the
113
6. Multiple regression analysis: Further issues and use of qualitative information
interaction term:
\ = 5.658 + .039 educ + .006 IQ
log(wage)
(.096) (.007) (.001)
n = 935, R2 = .1297.
This has happened due to high correlation of the interaction term with its components,
which is supported by the following correlation table:
educ IQ educ · IQ
educ 1 0.515 0.888
IQ 0.515 1 0.845
educ · IQ 0.888 0.845 1
We can estimate the wage equation with the reparameterised interaction term – that is,
with the term (educ − educ)(IQ − IQ) = (educ − 13.46845)(IQ − 101.2824):
n = 935, R2 = .1298.
We can see that once the reparameterisation is done, the estimates are much closer to
the regression without the interaction. For example, for the average IQ (about 101), the
return to education is about 3.8%. However, the interaction term is not at all
statistically significant, which implies that there is no evidence that variability in IQ as
such affects the return to education.
Activity 6.4 Take the MCQs related to this section on the VLE to test your
understanding.
The beginning of Section 6.3 (before 6.3a) gives several general recommendations about
using R2 .
114
6.2. Content of chapter
Section 6.3a introduces the adjusted R-squared as another goodness of fit measure. To
motivate this new measure, we need to note how the R-squared changes when more
explanatory variables are added. It can be shown that R2 can never go down, and
usually increases, when one or more variables are added to a regression. This is because
additional variables almost always help to explain more variation in the dependent
variable.
SSR
To give this a more mathematical justification, recall that R2 = 1 − SST . Imagine a
model in which we start with one regressor x1 and then add another regressor x2 . Then
SST does not change as it is calculated from the observations of the dependent variable,
which remains unchanged. If OLS happens to choose the coefficient on x2 to be exactly
zero, then the SSR is the same whether or not x2 is included in the regression. However,
if OLS chooses any value other than zero, then it must be that this value reduced the
SSR relative to the regression that excludes x2 . In practice, an estimated coefficient will
most likely be not zero, so in general the SSR will decrease when a new regressor is
added. This means that the R2 = 1 − SSR SST
generally increases (can never decrease) when
a new regressor is added.
The adjusted R-squared, also called ‘R-bar-squared’, is defined as:
[SSR/(n − k − 1)] (n − 1)
R̄2 = 1 − = 1 − (1 − R2 )
[SST /(n − 1)] (n − k − 1)
b2 is the usual estimator of the variance parameter.
where σ
R̄2 can increase or decrease when more regressors are added. Indeed, as more regressors
are added, SSR falls, but so does n − k − 1. R̄2 penalises adding regressors that have
weak explanatory power for the dependent variable (in this case we can expect R̄2 to
decrease). For k ≥ 1, R̄2 < R2 unless SSR = 0 (not an interesting case). In addition, it
is possible that R̄2 < 0, especially if n − k − 1 is small (recall that R2 ≥ 0 always).
Section 6.3c discusses the dangers of overemphasising goodness of fit. Fixating on
making R2 or R̄2 as large as possible can lead to silly mistakes and can lead to a model
that has no interesting ceteris paribus interpretation. It is very important to always
remember the ceteris paribus interpretation of a regression. Sometimes it does not make
sense to hold other factors fixed when studying the effect of a particular variable.
To give an illustration of the importance of keeping in mind the ceteris paribus
interpretation of the regression, consider the effects of spending on mathematics pass
rates (using MEAP93 data). The following linear regression model of math10
(percentage of students passing the MEAP mathematics test) on lexpend (logarithm of
expenditure per student, in $) and lnchprg (percentage of students eligible for free
school lunch programme – this is a measure of poverty) can help us study the effects of
total spending per student in a school on the percentage of students passing a
standardised mathematics test:
\ = −20.3608 + 6.2297 lexpend − .3046 lnchprg
math10
(25.0729) (2.9726) (.0354)
n = 408, R2 = .1799.
115
6. Multiple regression analysis: Further issues and use of qualitative information
n = 408, R2 = .1910.
As we can see, the coefficient on lexpend is actually negative (but not statistically
different from zero). Does it mean that spending is not important? No, we cannot draw
this conclusion because a part of spending goes towards lowering student–teacher ratios
and increasing salaries. Once we hold those fixed, the role for spending is obviously
limited. What we seem to be finding in the second regression is that spending other
than to lower staff or increase lsalary has no effect on performance. But this does not
mean that total spending has no effect!
To give an illustration of a model that has no interesting ceteris paribus interpretation,
consider the study of the effects of spending on mathematics pass rates using MEAP01
data. We start with the model:
R̄2 goes from .3603 to .7207, which is a huge increase and, thus, we may be tempted to
use the second regression. But why should we hold read 4 fixed when looking at the
effects of spending on math4? We want to allow spending to improve reading pass rates,
too. Thus, read 4 can be considered to be a bad control – it is itself an outcome of
lexpend and, thus, read 4 can be a dependent variable in its own regression.
The last regression example with read 4 as a regressor is what can be called
over-controlling for factors in multiple regression. It may be tempting to
over-control because often the goodness-of-fit statistics increase substantially or
sometimes this happens because researchers are afraid of omitting important
explanatory variables. It is important to avoid including any bad controls by
remembering the ceteris paribus nature of multiple regression.
116
6.2. Content of chapter
The main takeaway here is that qualitative factors often come in the form of binary
information: a worker belongs to a union or does not, a firm offers a 401(k) pension plan
or it does not. Other examples of qualitative information include race, industry, region,
rating grade. In these examples the relevant information can be captured by defining a
binary variable (or dummy variable). Sometimes it is also called a zero-one
variable to emphasise the two values it takes on.
In defining a dummy variable, it is very important to decide which outcome is coded as
a zero and which is coded as a one. It is also recommended to choose the variable name
to be descriptive.
It is also useful to know that dummy variables can be used to create more refined
categorisations. For example, if we start with two pieces of qualitative information –
say, gender and marital status, as captured by f emale and married dummy variables –
we can define more than two categories. These would be married male (marrmale),
married female (marrfem), single male (singmale), and single female (singfem).
Graphically, δ0 in (6.2.3) is the gap between two regression lines – one for union workers
and the other for non-union ones. Usually δ0 is referred to as the difference in the
intercept between the the base group and the other group.
117
6. Multiple regression analysis: Further issues and use of qualitative information
In the example above the base group is non-union workers and this is why β0 is the
intercept for non-union workers, and δ0 is the difference in intercepts between union and
non-union workers.
It is important to understand how the interpretation of coefficients changes if we decide
to choose a different base group. We could choose union workers as the base group by
writing the model as:
wage = α0 + γ0 non-union + u
where the intercept for union workers is α0 and the intercept for non-union workers is
α0 + γ0 . Because non-union = 1 − union, this implies that α0 = β0 + δ0 and α0 + γ0 = β0
(which also gives γ0 = −δ0 ). Thus, we essentially get the same answer if we set union
workers as the base group. The coefficient on the dummy variable changes sign but it
must remain the same magnitude. The intercept changes because now the base group is
union workers. It is important to keep track of which group is the base group.
In this topic you get your first exposure to the dummy variable trap problem. In our
example with wage and union/non-union variables, putting both union and non-union
in the model and also including the constant term would result in the dummy variable
trap. The dummy variable trap is a situation of perfect collinearity (and, thus, a
violation of assumption MLR.3). Indeed, in this case because of the presence of the
constant term in the regression equation, we would have the following linear
relationship between regressors 1, union and non-union:
union + non-union = 1.
Later subchapters consider situations when there are more than two dummy variables
representing multiple categories. If we have, say, c categories represented by dummy
variables, then to avoid the dummy variable trap we can only include (c − 1) dummy
variables in the regression alongside the intercept.
Finally, with log(y) as the dependent variable, δ0 is given a percentage change
interpretation.
Activity 6.7 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 6.8 Suppose we have a dataset where everyone identifies as either male or
female. The model we consider is:
wage = β0 + δ0 female + u
where female takes the value of 1 for women and takes the value of 0 otherwise.
Suppose we have the random sample {(wage i , female i )}ni=1 of size n.
Derive the following formulas for the OLS estimators in this model:
βb0 = wage m .
118
6.2. Content of chapter
Activity 6.9 The Bechdel test is a test (more of a minimum threshold) for whether
a film portrays women in a ‘normal/real life’ way. A film passes the Bechdel test if
there are at least two named female characters in the film who have a conversation
together that doesn’t pertain to a male character. (Whether this constitutes a good
‘test’ or not is definitely up for debate.)
(ii) The estimation of the equation from part (i) gives the following results:
n = 1776, R2 = .0098.
Discuss what the estimation results indicate about audiences’ preferences for
films.
(iii) Let us also include the logarithm of the budget of the film (adjusted for
inflation in 2013 dollars), in particular taking this to be the log(budget 13):
How do you interpret the coefficient on the budget? Discuss what happens to
the coefficient on pass test in comparison to part (ii). In particular, what does
this mean for sexism in the film industry?
and
\ 13) = βb0 + δb0 pass test + βb1 log(budget 13).
log(intgross
Describe the algebraic relationship between δe0 and δb0 . Use this algebraic
relationship and the fact that log(budget 13) and pass test are negatively
correlated to explain differences in findings in parts (ii) and (iii).
119
6. Multiple regression analysis: Further issues and use of qualitative information
Here you will learn how to allow for more than two groups. This can be accomplished
by defining a set of dummy variables.
Suppose in a dataset we have observations of wage and everyone identifies as either
male or female. In addition, there is a marital status variable which takes only two
values – married or single. We can define four exhaustive and mutually exclusive
groups. These are married males (marrmale), married females (marrfem), single males
(singmale), and single females (singfem).
We can allow each of the four groups to have a different intercept by choosing a base
group and then including dummies for the other three groups. If, for instance, we choose
single males as the base group, we include marrmale, marrfem, and singfem in the
regression. The coefficients on these variables are relative to single men. With log(wage)
as the dependent variable, intercepts will have a percentage change interpretation.
Consider, for instance:
Then δ1 is the ‘marriage premium’ for men (holding education fixed). We can easily test
H0 : δ1 = 0 vs. H1 : δ1 6= 0 (or H1 : δ1 > 0) directly from the regression output in a
statistical package by using the t statistic associated with marrmale. The ‘marriage
premium’ for women is δ2 + β0 − (δ3 + β0 ) = δ2 − δ3 . We cannot test H0 : δ2 = δ3
directly from the regression output. We can usually run a special command in a
statistical package to test it. An alternative approach would be to choose, say, married
females as the base group and re-estimate the model (including the dummies marrmale,
singmale, and singfem) and use the t statistic for singfem.
In this example with four groups, no matter which group we choose as the base group,
we include only three of four dummy variables. If we include all four we fall into the
dummy variable trap.
You also need to know how dummy variables can be used in capturing information
given in the form of categories with a clear ordering of the categories.
Let us describe this in an example. Suppose that we would like to estimate the effect of
a person’s physical attractiveness on performance in the labour market. The BEAUTY
dataset includes rankings of physical attractiveness of each man or woman, on a scale of
1 to 5 with 5 being ‘strikingly beautiful or handsome’. The ‘looks’ variable is an ordinal
variable (other examples of ordinal variables include credit ratings, or ratings of
individual happiness): we know that order of outcomes conveys information (5 is better
than 4, and 2 is better than 1) but we do not know that the difference between 5 and 4
is same as 2 and 1. Including the ordinal ‘looks’ variable as we would include any other
explanatory variable would imply that as we move up the scale from 1 to 5, a one-unit
increase means the same amount of ‘beauty’. It might not make sense to assume that a
one-unit increase in ‘looks’ has a constant effect on the dependent variable.
An alternative approach is to define dummy variables. This approach is usually
preferable to including the ordinal variable itself as a regressor. For example, we can
create three dummy variables corresponding to the following categories:
120
6.2. Content of chapter
This is an illustration of how we can create multiple dummy variables to capture the
ordinal information.
Activity 6.10 Take the MCQs related to this section on the VLE to test your
understanding.
Dummy variables can be interacted with each other and can be interacted with
continuous variables. It is important to understand that these two cases lead to different
regression specifications with different interpretations.
When dummy variables are interacted with each other (Wooldridge, Section 7.4a), this
creates a more refined categorisation of the population. For example, suppose we have
dummy variables female and union. Consider a regression model that contains both
these dummy variables as well as an interaction term between female and union. This
allows the union memebership premium to depend on gender. For concreteness, take:
Then β0 is the intercept for males who are not union members (plug in female = 0 and
union = 0), β0 + β1 is the intercept for females who are not union members (plug in
female = 1 and union = 0), β0 + β2 is the intercept for males who are union members
(plug in female = 0 and union = 1), and, finally, β0 + β1 + β2 is the intercept for
females who are union members (plug in female = 1 and union = 1). Equivalently, this
model can be recast by defining dummy variables for three out of four groups and using
specifications discussed in Wooldridge, Section 7.3.
Situations of interacting dummy variables with continuous explanatory variables
(Wooldridge, Section 7.4b) will lead to regression models with different slopes, as well
as different intercepts. As an example, consider the model:
where we added the interaction term union · exper to the model where union and exper
appear separately. Then the intercept for non-union members is β0 and that for union
members is β0 + δ0 . The slope for non-union members is β1 and that for union members
is β1 + δ1 .
121
6. Multiple regression analysis: Further issues and use of qualitative information
The summary of parameter interpretations for non-union and union members is in the
following table:
Intercept Slope
non-union β0 β1
union β 0 + δ0 β 1 + δ1
Difference (union − non-union) δ0 δ1
δ0 + δ1 exper .
H0 : δ0 = 0, δ1 = 0.
This H0 can be tested using the F test. This is a version of the Chow test (Wooldridge,
Section 7.4c) for the equality of regression functions across two groups. We test joint
significance of the dummy variable defining the groups as well as the interaction terms.
In a general model with k explanatory variables and an intercept, suppose we have two
groups. If we would like to test whether the intercept and all slopes are the same across
the two groups, we can define a dummy variable, say w, indicating one of the two
groups. Then in the model:
y = β0 + β1 x1 + β2 x2 + · · · + βk xk
+ δ0 w + δ1 w · x1 + δ2 w · x2 + · · · + δk w · xk + u
we want to test:
H0 : δ0 = 0, δ1 = 0, δ2 = 0, . . . , δk = 0
for k + 1 restrictions. It can be done using a standard F test of the k + 1 exclusion
restrictions (that is, Fk+1, n−2 (k+1) distribution):
122
6.2. Content of chapter
1. Pool the data and estimate a single regression. This is the restricted model, and
produces the restricted SSR. Call this the pooled SSR, SSR P .
If we want to test to the equality of the intercepts and all the slopes, then SSR P
comes from the estimation of the following model:
y = β0 + β1 x1 + β2 x2 + · · · + βk xk + u.
This step depends on the null hypothesis H0 we are testing as the restricted model
we estimate in this step depends on our H0 .
2. Estimate the regressions on the two groups (say, 1 and 2) separately in the usual
way:
y = β10 + β11 x1 + β12 x2 + · · · + β1k xk + u1
for group 1 and:
y = β20 + β21 x1 + β22 x2 + · · · + β2k xk + u2
for group 2. Get the SSRs, SSR 1 and SSR 2 . The unrestricted SSR is
SSR 1 + SSR 2 (and this is the same as the regression that includes the full set of
interactions).
This step does not depend on the null hypothesis H0 we are testing as its outcome is
the unrestricted SSR.
Let n denote the sample size in the pooled regression where we use both groups. The F
statistic for the null hypothesis of the equality of the intercepts and all the slopes is
[SSR P − (SSR 1 + SSR 2 )]/(k + 1)
F =
(SSR 1 + SSR 2 )/[n − 2(k + 1)]
and has the Fk+1, n−2(k+1) distribution under H0 .
If we leave the intercepts unrestricted under H0 , then SSR P is obtained from the pooled
regression but with the dummy variable added (SSR 1 and SSR 2 obtained from
regressions for each group remain the same as those regressions do not change):
y = β0 + δ0 w + beta1 x1 + β2 x2 + · · · + βk xk + u.
The k + 1 in the numerator becomes k, and we use the Fk, n−2(k+1) distribution.
123
6. Multiple regression analysis: Further issues and use of qualitative information
Activity 6.12 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 6.14 This exercise is based on SLEEP75 data. The equation of interest is:
sleep = β0 + β1 totwrk + β2 educ + β3 age + β4 age 2 + β5 yngkid + u
where:
(i) Estimation of the regression equation separately for men and women gives the
following results:
for men:
\ = 3,648.2 − .182 totwrk − 13.05 educ + 7.16 age − .0448 age 2 + 60.38 yngkid
sleep)
((310.0) (.024) (7.41) (14.32) (.1684) (9.02)
n = 400, R2 = .156
for women:
\ = 4,238.7 − .140 totwrk − 10.21 educ − 30.36 age − .368 age 2 − 118.28 yngkid
sleep)
((384.9) (.028) (9.59) (18.53) (.223) (93.19)
n = 306, R2 = .098.
124
6.3. Answers to activities
(iii) Now we allow for different intercepts for males and females and determine
whether the interaction terms involving male are jointly significant. In other
words, in the model:
H0 : δ1 = δ2 = δ3 = δ4 = δ5 = 0
(iv) Given the results from parts (ii) and (iii), what would be your final model?
H0 : βexper 2 = 0
H1 : βexper 2 6= 0.
(Essentially we are testing the null that the equation is linear in exper against the
alternative that it is quadratic.) To answer the question affirmatively, we want to
reject H0 at the 1% level.
The t statistic on exper 2 is:
βbexper2 −.000714
tβb = = ≈ −6.16.
exper 2
se(βbexper2 ) .000116
Using the 1% level of significance, our critical value equals tdf, .005 = 2.576 (taking
into account that the alternative is two-sided). The absolute value of our test
statistic exceeds tdf, .01 , so we reject the null hypothesis and conclude that exper 2 is
statistically significant at the 1% level.
125
6. Multiple regression analysis: Further issues and use of qualitative information
(ii) To estimate the return to the fifth year of experience, we start at exper = 4 and
increase exper by one, so ∆exper = 1:
(i) We expect β1 ≥ 0: all other relevant factors equal, it is better to have a home
farther away from the incinerator.
(ii) A 1% increase in distance from the incinerator is associated with a predicted price
that is about .37% higher.
(iii) The coefficient on log(dist) becomes about .055 (se ≈ .058). The effect is much
smaller now, and statistically insignificant. This is because we have explicitly
controlled for several other factors that determine the quality of a home (such as its
size and number of baths) and its location (distance to the interstate). This is
consistent with the hypothesis that the incinerator was located near less desirable
homes to begin with.
(iv) The coefficient on log(dist) is now very statistically significant, with a t statistic of
about three. The coefficients on log(intst) and [log(intst)]2 are both very
statistically significant, each with t statistics above four in absolute value. Just
adding [log(intst)]2 has had a very big effect on the coefficient important for policy
purposes. This means that distance from the incinerator and distance from the
interstate are correlated in some nonlinear way that also affects housing price.
We can find the value of log(intst) where the effect on log(price) actually becomes
negative: 2.073/[2(.1193)] = 8.69. When we exponentiate this we obtain about
5,943 feet from the interstate. Therefore, it is best to have your home away from
the interstate for distances less than just over a mile.
Dividing both sides by ∆educ gives the result. The sign of β2 is not obvious,
although β2 > 0 if we think a child gets more out of another year of education the
more highly educated are the child’s parents.
(ii) We use the values pareduc = 32 and pareduc = 24 to interpret the coefficient on
educ · pareduc. The difference in the estimated return to education is
.00078(32 − 24) = .0062, or about .62 percentage points.
126
6.3. Answers to activities
(iii) When we add pareduc by itself, the coefficient on the interaction term is negative.
The t statistic on educ · pareduc is about −1.33, which is not significant at the 10%
level against a two-sided alternative (t716,.05 = 1.645). Note that the coefficient on
pareduc is significant at the 5% level against a two-sided alternative
(t716,.025 = 1.96). This provides a good example of how omitting a level effect
(pareduc in this case) can lead to biased estimation of the interaction effect. In the
hypotheses testing we assume MLR.1–MLR.6.
(i) The answer is not entirely obvious, but one must properly interpret the coefficient
on alcohol in either case. If we include attend , then we are measuring the effect of
alcohol consumption on college GPA, holding attendance fixed. Because attendance
is likely to be an important mechanism through which drinking affects
performance, we probably do not want to hold it fixed in the analysis. In other
words, attend can be considered a bad control as it’s a result of alcohol .
(ii) We would want to include SAT and hsGPA as controls, as these measure student
abilities and motivation. Drinking behaviour in college could be correlated with
one’s performance in high school and on standardised tests. Other factors, such as
family background, would also be good controls.
n
X n
X
nf = female i = female 2i
i=1 i=1
(using that female 2i = female i due to the binary nature of this variable).
127
6. Multiple regression analysis: Further issues and use of qualitative information
Then:
nf
female =
n
n
1 X
wage f = female i · wage i
nf i=1
n
1 X
wage m = (1 − female i ) · wage i
n − nf i=1
where wage f is the average wage of women in the sample and wage m is the average
wage of men in the sample. Also note that:
nf nf
wage = wage f + 1 − wage m .
n n
(ii) The estimation results indicate that films which pass the Bechdel test have an
international gross that is approximately 35% lower than films which don’t pass it!
This seems to suggest that audiences only want to watch films where women have
no meaningful roles.
(iii) The coefficient on budget indicates that increasing your budget by 1% increases
your international gross by 0.86%. So there are slightly decreasing returns to
investment into the film.
The coefficient on passing the test is now insignificant. So, actually audiences don’t
seem to care about how women are portrayed in films.
128
6.3. Answers to activities
Sexism seems to enter the film industry through the budget of the film. Producers
appear to put more money into films which do not pass the test. These films are
then of a better quality (because of a greater budget) and audiences prefer these
films. So it is not the absence of women which causes audiences to enjoy a film; it is
the amount of money put into the film.
(i) The approximate difference is just the coefficient on utility times 100, or −28.3%.
The t statistic is t = −.283/.099 ≈ −2.86, which is statistically significant at the
1% level since |t| > t203, 0.005 = 2.576. (Implicitly we use the two-sided alternative
and, of course, assume MLR.1 to MLR.6.)
(iii) The proportionate difference is .181 − .158 = .023, or about 2.3%. One equation
that can be estimated to obtain the standard error of this difference is:
where trans is a dummy variable for the transportation industry. Now, the base
group is finance, and so the coefficient δ1 directly measures the difference between
the consumer products and finance industries, and we can use the t statistic on
consprod.
Then 100 · β1 is the approximate percentage change in wage when marijuana usage
increases once per month.
129
6. Multiple regression analysis: Further issues and use of qualitative information
log(wage) = β0 +β1 usage+β2 educ+β3 exper +β4 exper 2 +β5 female+β6 female·usage+u.
The null hypothesis that the effect of marijuana usage does not differ by gender is
H0 : β6 = 0.
(iii) We take the base group to be nonuser. Then we need dummy variables for the
other three groups: lghtuser , moduser , and hvyuser . Assuming no interactive effect
with gender, the model would be:
(v) The error term could contain factors, such as family background (including
parental history of drug abuse), that could directly affect wages and also be
correlated with marijuana usage. We are interested in the effects of a person’s drug
usage on his or her wage, so we would like to hold other confounding factors fixed.
We could try to collect data on relevant background information.
(i) There are certainly notable differences in the point estimates. For example, having
a young child in the household leads to less sleep for women (about two hours per
week) while men are estimated to sleep about an hour more. The quadratic in age
is a hump shape for men: however, the turning point is age ∗ ≈ 79.92, so the
regression line for men is increasing in age for all of the sample values. The
quadratic in age is a U-shape for women: the turning point is age ∗ ≈ 41.25, so the
regression line for women is decreasing up to that point and is increasing after it.
The intercepts for men and women are also notably different.
130
6.4. Overview of chapter
131
6. Multiple regression analysis: Further issues and use of qualitative information
and t tests. It is analysed and illustrated how ordinal information in data can be
captured by means of dummy variables.
Finally, the chapter considers models where the set of regressors includes dummy
variables interacted either with each other or continuous variables. There are
substantial differences between these two situations and it is explained why the latter
case creates situations when we allow for a difference in slopes among different groups.
Finally, it is described how to test the null hypothesis that two populations or groups
follow the same regression function, against the alternative that one or more of the
slopes differ across the groups (Chow test).
Turning point
Interaction term
Interaction effect
Centering of variables
Adjusted R-squared
Dummy variable
Base group
Benchmark group
Difference in intercepts
Ordinal variable
Difference in slopes
Chow test
be able to conduct hypothesis testing using t tests and F tests and interpret the
results of such testing
132
6.5. Test your knowledge and understanding
estimate and interpret models with quadratic and/or interaction terms in practice
describe how multiple dummy variables can be used to create further more refined
categories in practice
know how the choice of the base category affects the interpretation of the
regression results (including the interpretation of parameters and t tests)
describe and perform the Chow test in two equivalent ways (and also two variations
of the Chow test – allowing or not intercepts to be different)
where earnings denotes the hourly earnings of an individual and educ and exper
denote the years of schooling and experience, respectively. We assume we have
obtained a random sample {(earnings i , educ i ,exper i )}ni=1 from the population. The
errors {ui }ni=1 are i.i.d. normal random variables with zero mean and variance σ 2 .
We assume independence between the errors and regressors.
(a) Discuss the rationale for including both exper and exper 2 in this model. In
your answer be explicit about the expected signs for β2 and β3 .
(b) Provide a clear interpretation of the parameter β1 .
133
6. Multiple regression analysis: Further issues and use of qualitative information
6.2E Let us consider the following cross-sectional model for household consumption:
Ci = β0 + β1 Yi + β2 Yi2 + ui
6.3E This question looks at the determinants of extramarital affairs using the AFFAIRS
dataset. It has 601 observations and 19 variables. The outcome we will work with is
the variable ‘naffairs’, which lists the number of affairs the individual had in the
past year. While the dataset lists many possible determinants (regressors of
interests) of this outcome, we will here focus on three, namely how individuals rate
their marriage (‘ratemarr ’), how religious they are (‘relig’), and how many years
they have been married (‘yrsmarr ’). ‘ratemarr ’ is a variable from 1 to 5, where
5 = very happy, 4 = somewhat happy, 3 = average, 2 = somewhat unhappy, and
1 = very unhappy. ‘relig’ is a variable from 1 to 5 with 5 = very religious,
4 = somewhat religious, 3 = slightly religious, 2 = not at all religious, and 1 = anti
religion. Every individual in the dataaset identifies either as a male or a female
(please see the comment on the use of gender variables in the Course Overview).
(a) Using only observations for women, we estimate the following regression
equation:
Interpret the results of this estimation. Do you think any effects you found
here can be interpreted causally?
(b) Do you think it makes sense to include the square of yrsmarr as an additional
regressor of interest? Discuss. If so, what sign would you expect for the
quadratic term of yrsmarr ?
(c) Including the square of yrsmarr into the regression in (a), we obtain the
following results:
What conclusion do you draw based on this output? Additionally, what do you
make of the fact that the usual R2 increased but the adjusted R-squared
decreased compared to (a)?
134
6.5. Test your knowledge and understanding
(d) Now we return to the regression in (a) and consider including an interaction
between yrsmarr and ratemarr. Discuss how we would interpret such an
interaction term, were we to include it in our model in (a).
(e) Estimation of the model in (d) gives the following results:
\ = 2.057 − .018 ratemarr − .515 relig + .392 yrsmarr − .076 yrsmarr × ratemarr
naffairs
(1.394) (.304) (.157) (.117) (.028)
and:
\ = 4.575 − .065 ratemarr − .478 relig + .154 yrsmarr − .011 yrsmarr × ratemarr
naffairs
(1.451) (.341) (.158) (.132) (.032)
What do you make of these results? Are your conclusions for men similar to
those for women? Discuss.
6.4E Suppose a researcher has wage data on randomly selected workers and each worker
in the dataset identifies either as a male or a female. The researcher uses the data
to estimate the OLS regression:
wage
[ = 10.73 + 1.78 male
(.16) (.29)
n = 440, R2 = .09
where wage is measured in dollars per hour and male is a binary variable that is
equal to 1 if the person is a male and 0 if the person is a female. Define the
wage–gender gap as the difference in mean earnings between men and women.
(a) In the sample, what is the mean wage of women? What is the mean wage of
men?
(b) What is the estimated gender gap?
(c) Is the estimated gender gap significantly different from 0 at the 5% significance
level?
(d) Construct a 95% confidence interval for the gender gap.
(e) Another researcher suggests including male 2 along with the dummy variable
male. Explain whether this suggestion makes sense.
135
6. Multiple regression analysis: Further issues and use of qualitative information
where sinfemale is the dummy variable for single women, marrfemale is the
dummy variable for married women, and marrmale is the dummy variable for
married men. What is the interpretation of the coefficient β1 in this model?
What is the interpretation of β2 − β1 in this model?
(h) Another researcher uses these same data but regresses wage on female. What
are the regression estimates calculated from this regression? In other words,
indicate α b1 and R2 in the fitted regression:
b0 , α
wage
[ =α b0 + α
b1 female
n = 440, R2 = . . .
6.5E The OLS output in Table 6.1 is produced using a dataset on teaching evaluations
for university professors. fem age is the interaction between the variables female
and age, beauty2 to beauty4 are dummy variables for the second, third and fourth
quartile in the distribution of beauty ratings, and onecredit refers to one credit
(short) courses. The total number of observations is 463, of which 195 are women.
Suppose that each professor in the dataset identifies either as a male or a female.
female .5949
(.2731)
age .0042
(.0034)
fem age -.0169
(.0057)
beauty2 .2320
(.0738)
beauty3 .1831
(.0748)
beauty4 .3407
(.0737)
onecredit .5096
(.1087)
constant 3.6421
(.1964)
2
R .1361
Table 6.1: The dependent variable is course eval (course evaluations). Standard
errors are in parentheses.
136
6.5. Test your knowledge and understanding
(d) What would happen if you added the interaction between variables male and
age to the regression (where male = 1 − female)?
(e) How would you test whether the effect of age for females is significantly
different from 0? Can you conduct this test using the information in the
output in Table 6.1?
(f) Test whether the effect of age is different for males and females.
(g) Suppose you would like to test whether the intercept and all slopes are the
same across men and women. To do that we estimate the model:
three times. First, we estimate it on the whole data set and obtain
SSR = 129.3731. Then, we estimate it on the subset of women only and obtain
SSR = 51.3907. Finally, we estimate it on the subset of men only and obtain
SSR = 66.8034.
Formulate the null hypothesis for your test, calculate the test statistic for your
test and discuss whether or not you reject your null hypothesis at the 1% level.
6.1E (a) With β2 positive and β3 negative, this permits diminishing returns to
experience. While earnings increase with an additional year of experience its
effect is smaller for individuals for more experience (as long as
β2 + 2β3 exper > 0, after which additional years of experience reduce earnings).
For other values of the βj s a different explanation should be given.
(b) β1 denotes the effect one additional year of schooling has on the average hourly
earnings of an individual holding everything else constant.
6.2E (a) The parameter β1 remains unchanged, but β0 and β2 need to be rescaled. Let
C
e = C/1000 and Ye = Y /1000 denote our new variables, let’s substitute them:
e = β0 + β1 1000Ye + β2 (1000Ye )2 + u
1000C
e = β0 + β1 Ye + 1000β2 Ye 2 + u.
C
1000
we will get the classical OVB problem. We need to consider the estimator
given by:
Pn
(Y − Ȳ )(Ci − C̄)
Pn i
βb1 = i=1 2
.
i=1 (Yi − Ȳ )
137
6. Multiple regression analysis: Further issues and use of qualitative information
As we expect that SampleCov (Yi , Yi2 )> 0 and β2 < 0, the bias will be negative.
6.3E (a) The coefficient on ratemarr tells us that women who rate their marriage as
being happier tend to have fewer affairs. Specifically, a one-unit increase in
ratemarr is associated with a decrease of .678 affairs. The coefficient is
statistically significantly different from zero at the 1% level (under MLR.1 to
MLR.6).
The coefficient on relig tells us that more religious women tend to have fewer
affairs. Specifically, a one-unit increase in relig is associated with a decrease of
.519 affairs. The coefficient is statistically significantly different from zero at
the 1% level (under MLR.1 to MLR.6).
The coefficient on yrsmarr tells us that women who are married for longer
tend to have more affairs compared to women who are married less long.
Specifically, being married for one more year is associated with an increase of
.089 affairs. The coefficient is statistically significant at the 1% level (under
MLR.1 to MLR.6).
(b) If we were to include the square of yrsmarr , we would presume that the
coefficient on this new regressor is negative. Why is that? Well, (a) above tells
us that there is a positive correlation between the length of the marriage and
the number of affairs a woman has. However, note that a marriage that holds
for a long time also concerns people who tend to ‘be good’ together (in that
they otherwise may have gotten divorced). We may think that such couples are
less likely to cheat on one another. Thus, we would expect a negative sign on
the square of yrsmarr : the longer a couple is married, we expect a decreasing
positive effect on the number of affairs someone has. (Indeed, it seems possible
that at some point the effect even becomes negative.) It thus may indeed make
sense to include this square in the regression.
(c) Notice that the intercept and the coefficients on ratemarr and relig are
basically unchanged, in terms of magnitude, sign, and significance. The
interpretation of these is therefore analogous to (a).
yrsmarrsq is indeed negative, as hypothesised in (b). However, notice that
both yrsmarr and yrsmarrsq are both no longer significant. Also notice that
138
6.5. Test your knowledge and understanding
the effect of yrsmarrsq is incredibly small. Recall what this means: the
coefficient on yrsmarrsq is the ‘amount’ by which the positive effect of
yrsmarr decreases over time. Since this decrease is basically zero, the data is
telling us that the hypothesised effect in (b) may be irrelevant, at least in this
dataset. Hence, we may be better off dropping it.
R-squared cannot decrease when we add an additional regressor, so its increase
is expected. However, the fact that the adjusted R-squared decreased means
that yrsmarrsq does not do much in terms of explaining variation in the
dependent variable.
(d) When including an interaction between yrsmarr and ratemarr we would
expect a negative sign in front of this coefficient. Why? We may think that an
individual in a longer marriage who is also happier may cheat less often on
their spouse because they are happier. Hence, this interaction should have a
negative effect in that it should reduce the number of times an individual
cheats on his or her spouse.
(e) Note first that the coefficient on relig is still pretty much unchanged in terms
of sign, magnitude, and significance. Thus, the interpretation here is analogous
to the above.
The interaction term of yrsmarr and ratemarr does indeed have a negative
coefficient and is statistically significant at the 1% level (under MLR.1 to
MLR.6). The coefficient on yrsmarr is still positive (but much larger) and also
statistically significant at the 1% level (under MLR.1 to MLR.6). The
coefficient on ratemarr is, while still negative, no longer significantly different
from zero. What does this tell us?
This tells us that for women (recall, we are only running this regression for
women at the moment), how happy they are in their marriage is only a
determinant for the number of affairs they have when interacted with the
length of the marriage, but not by itself. More intuitively, a women who has
been in a marriage longer (compared to a woman who has been in a marriage
less long) tends to have more affairs. This (positive) effect, however, is
diminished for women who are happy in their marriage. Another way of
putting this is that women who are unhappy in their marriage only tend to
have more affairs as time goes by.
(f) For the output using only the subsample of men, we see that all the
coefficients have the same sign as for women. However, with the exception of
relig (and the intercept) all of them are statistically insignificant (under
MLR.1 to MLR.6). This tells us that the effect of the interaction term on
women we hypothesised about may not be prevalent in our data for men.
Indeed, if we run the regression on men without the interaction (as reported),
we find coefficients on all three right-hand side variables that are (highly)
statistically significant (under MLR.1 to MLR.6). What does this tell us? It
suggests that the nuanced effect we found for women (i.e. that how happy they
are in their marriage only matters when interacted with yrsmarr ) doesn’t exist
for men. By running a regression without the interaction, we see that men who
are happy in their marriage or who are highly religious tend to cheat less and
men who have been in their marriage longer tend to cheat more. But, at least
139
6. Multiple regression analysis: Further issues and use of qualitative information
judging from this one interaction term we included, the data for men does not
exhibit the heterogeneity that the female sample did. To put it differently, a
man in an unhappy marriage tends to cheat more often on his partner (no
interaction term!), whereas an unhappy woman tends to cheat more often on
her partner only as time goes by (interaction term!).
6.4E (a) The average wage for women is $10.73/hour. The average wage for men is
$12.51/hour.
(b) The estimated gender gap equals $1.78/hour.
(c) The hypothesis testing for the gender gap is H0 : β1 = 0 vs. H1 : β1 6= 0. With
a t statistic of:
1.78
t= ≈ 6.14
0.29
we can reject the null hypothesis that there is no gender gap at the 5%
significance level.
(d) The 95% confidence interval for the gender gap β1 is
1.78 − 1.96 × 0.29, 1.78 + 1.96 × 0.29], that is [1.2116, 2.3484].
(e) male 2 coincides with male and, thus, adding male 2 creates the situation of
perfect collinearity, so it is not a reasonable suggestion.
(f) It creates the dummy variable trap and the situation of perfect collinearity, so
it is not a reasonable suggestion.
(g) β1 is the gender premium/discrimination for single people. β1 is the gap in the
average wages between single men and single women. β2 − β1 is the marriage
premium for women – the difference in the average wages for married women
and single women.
(h) The binary variable regression model relating wages to gender can be written
as either:
wage = β0 + β1 male + u
or:
wage = α0 + α1 female + v.
In the first regression equation, male equals 1 for men and 0 for women; β0 is
the population mean of wages for women and β0 + β1 is the population mean
of wages for men. In the second regression equation, female equals 1 for
women and 0 for men; α0 is the population mean of wages for men and α0 + α1
is the population mean of wages for women. We have the following relationship
for the coefficients in the two regression equations:
α 0 = β0 + β1
β0 = α 0 + α 1 .
α
b0 = βb0 + βb1 = 12.51
b1 = βb0 − α
α b0 = −βb1 = −1.78.
140
6.5. Test your knowledge and understanding
[ = 12.5 − 1.78male,
wage R2 = .09.
6.5E (a) The t statistic for beauty2 is about 3.143. Since |3.143| > 2.576, we conclude
that beauty2 is significantly different from 0 at the 1% level (under
MLR.1–MLR.6).
(b) The coefficient on age denotes the average difference in the course evaluation
for a male professor if he is one year older, holding everything else constant.
(c) We expect the course evaluation to be lower by 0.012696:
H0 : βage + βf em age = 0
where s12 is the estimator of Cov (βbage , βbf em age ). We cannot compute the t
statistic just from the output in Table 6.1 as we need s12 to compute the t
statistic.
(f) We want to test the null hypothesis:
H0 : βf em age = 0
141
6. Multiple regression analysis: Further issues and use of qualitative information
we want to test:
H0 : δ0 = 0, δ1 = 0, δ2 = 0, . . . , δ5 = 0
for 6 restrictions against the alternative which is the negation of the null:
where SSR P is obtained on the whole data set, SSR f is obtained on the subset
of women only, and SSR m is obtained on the subset of men only.
The realisation of our Chow test statistic is:
[129.3731 − (51.3907 + 66.8034)]/6
= 7.1094
(51.3907 + 66.8034)/451
and it exceeds the critical value even at 1% level of significance (2.80) leading
us to reject the null hypothesis at the 1% level of significance.
142
Chapter 7
Multiple regression analysis:
Asymptotics
7.1 Introduction
143
7. Multiple regression analysis: Asymptotics
the sample size increases. Consistency is especially useful in cases where the finite
properties like unbiasedness do not hold, and we do discuss situations when OLS
estimators are consistent but biased.
Section 5.2 in Wooldridge then presents the asymptotic distribution of the normalised
OLS estimator. This asymptotic distribution is standard normal, which allows us to
conclude that in large samples a researcher is able to conduct approximate inference
using confidence intervals and t and F statistics even when removing the assumption of
the normal distribution of regression errors (or equivalently, removing the assumption of
the normal distribution of the dependent variable conditional on the explanatory
variables).
Section 5.3 in Wooldridge discusses under what conditions OLS is asymptotically
efficient.
7.2.1 Consistency
It is recommended that you start by reviewing Appendix C.3a about the notion of the
consistency of an estimator. In a nutshell, consistency means that the probability that
the estimate is arbitrarily close to the true population value can be made arbitrarily
close to 1 by increasing the sample size. It can be said that consistency is a minimum
requirement for sensible estimators. It is important for us to know that consistency is
especially useful in cases where the finite properties like unbiasedness do not hold.
What does it mean if neither unbiasedness nor consistency holds? It means the
estimator is biased even in large samples and, hence, cannot tell us much about the
population parameter.
This section analyses whether OLS estimators are consistent or, more precisely, what
assumptions are required to guarantee the consistency of OLS estimators. It establishes
that under Assumptions MLR.1 through MLR.4, the OLS estimator βbj is consistent for
βj , for all j = 0, . . . , k, as n → ∞. Thus, as we get more data, we can expect the
sampling distribution of βbj to become more tightly centred around βj . Intuitively, this
result also says that the sampling variance Var(βbj | X) shrinks to zero as n → ∞.
We can illustrate the consistency property of the OLS estimators in simulations. We
simulate data from the model:
y = 5 − 3x1 + 2x2 + u
where u has mean 0 and is independent of x1 and x2 (there are some distributions used
for x1 , x2 , and u with these properties). We consider sample sizes n = 100, n = 1,000,
and n = 10,000. The results of simulations are presented in Figures 7.1 to 7.9. As we
can see from these figures, the distribution of βbj becomes more and more concentrated
around βj , for j = 1, 2, 3, as n increases. This is exactly what we expect from the
consistency property.
144
7.2. Content of chapter
145
7. Multiple regression analysis: Asymptotics
146
7.2. Content of chapter
147
7. Multiple regression analysis: Asymptotics
The proof of consistency of the OLS slope estimator in the bivariate model presented in
Wooldridge, Section 5.1 uses LLN (a general case would use LLN as well). The proof
also makes it clear that in the bivariate model consistency of the slope estimator would
hold if we required just uncorrelatedness between the regressor x and the error u. This
extends to a general case and it turns out that the OLS estimator is consistent under
assumptions MLR.1 through MLR.3 and MLR.40 , where Assumption MLR.40 is the
following:
Activity 7.1 Take the MCQs related to this section on the VLE to test your
understanding.
It is recommended that you start by reviewing Appendix C.3b about the notion of the
asymptotic normality of an estimator.
This section discusses that the exact t and F distributions for the t and F statistics are
derived under MLR.6, along with the other Classical Linear Model assumptions.
Therefore, it is of importance to analyse whether Assumption MLR.6 is plausible. Since
under MLR.1 (y = β0 + β1 x1 + · · · + βk xk + u), Assumption MLR.6 can be translated
into the condition of the normal distribution of the dependent variable: conditional on
148
7.2. Content of chapter
y | x1 , . . . , xk ∼ Normal (β0 + β1 x1 + · · · + βk xk , σ 2 )
then whether assumption MLR.6 holds or not can be studied from the perspective of
the distribution of the dependent variable.
It is argued in this section that in practice, the normality assumption MLR.6 is often
questionable. As an example, take the APPLE dataset, in which a dependent variable
ecolbs (pounds of apples demanded) does not have a conditional normal distribution.
By definition, ecolbs cannot be negative. More importantly, it equals zero more than
one-third of the time. This is confirmed by the histogram (or bar chart) for ecolbs below.
Many dependent variables are the same way. Another example is the 401K dataset. The
interesting dependent variable there is prate, the percentage of employees at a firm
participating in a 401(k) pension plan. By definition, 0 ≤ prate ≤ 100. Plus, more than
44% of the observations are at 100. The histogram for prate below shows its distribution
is nothing like the continuous bell shape of the normal.
This section discusses that in cases such as described above, fortunately, we do not have
to abandon statistical inference if our sample size is even moderately large. (The sample
size in the APPLE dataset is 660 and that in the 401K dataset is 1,534, and these turn
149
7. Multiple regression analysis: Asymptotics
out to be plenty large enough.) Namely, even when the observations of the dependent
variable are not from a normal distribution, we can use the central limit theorem to
conclude that the OLS estimators satisfy asymptotic normality. This is formulated
in the following theorem.
βbj − βj a
∼ N (0, 1).
sd (βbj )
βbj − βj a
∼ N (0, 1).
se(βbj )
The proof of this theorem is not given as it is somewhat involved (it relies largely on the
CLT and also the LLN and some other probabilistic laws). The theorem essentially
states that what was exactly true with MLR.6 remains approximately true, in ‘large’
samples, even without MLR.6.
Because tn−k−1 tends to Normal (0, 1) as n tends to ∞, it is also legitimate to write:
βbj − βj a
∼ tn−k−1 .
se(βbj )
j j βb −β
In practice, we just treat the quantity se( βbj )
as having an approximate tn−k−1
distribution. This means that everything we discussed for inference – t testing,
confidence intervals, and even F testing – can be applied without the normality
assumption. The caveat is that the testing is only approximate.
Sometimes when we know MLR.6 fails – as in the APPLE dataset example – we
emphasise that our p-values are ‘asymptotic p-values’ or ‘approximate p-values’ that are
based on ‘asymptotic standard errors’. But often this is left implicit.
We can illustrate the asymptotic normality property of the OLS estimators in
simulations. We will simulate data from the model:
y = 5 − 3x1 + 2x2 + u
where x1 is drawn from the binomial distribution with mean 0.3, x2 is drawn from the
standard normal distributions, and u is a random variable that takes values −1 and 4
with probabilities 0.8 and 0.2, respectively.
The distribution visuals in Figures 7.10 to 7.12 are for the ratio:
βbj − βj
.
se(βbj )
150
7.2. Content of chapter
In Figures 7.10 to 7.12 we also draw the density of the standard normal distribution for
βbj −βj
comparison. As the distribution of se( βbj )
is very close to the normal distribution even for
n = 500.
151
7. Multiple regression analysis: Asymptotics
Activity 7.3 Take the MCQs related to this section on the VLE to test your
understanding.
using WAGE1 data, a researcher plots the histogram of OLS residuals and also plots
the normal distribution that provides the best fit to the histogram in Figure 7.13:
Repeating above, but with log(wage) as the dependent variable, the researcher
obtains the histogram in Figure 7.14 for the residuals with the best-fitting normal
distribution overlaid:
Would you say that Assumption MLR.6 is closer to being satisfied for the level-level
model or the log-level model?
152
7.3. Answers to activities
then z has mean zero, variance one, and E(z 3 ) = 0. Given a sample of data
{yi : i = 1, . . . , n}, we can standardise yi in the sample by using zi = (yi − µ
by )/b
σy ,
where µ by is the sample mean and σ by is the sample standard deviation. (We ignore the
fact that thesePare estimates based on the sample.) A sample statistic that measures
skewness is n1 ni=1 zi3 , or where n is replaced with (n − 1) as a degrees-of-freedom
adjustment. If y has a normal distribution in the population, the skewness measure
in the sample for the standardised values should not differ significantly from zero.
(i) First, using the data set 401KSUBS and keeping only observations with
fsize = 1, we find that the measure of skewness for inc is about 1.86 and when
we use log(inc), the skewness measure is about .360. Which variable has more
skewness and therefore seems less likely to be normally distributed?
(ii) Next, using BWGHT2 we find that the skewness for bwght is about −.60. When
we use log(bwght), the skewness measure is about −2.95. What do you conclude?
We know that the OLS estimator is best linear unbiased under the Gauss–Markov
assumptions. It turns out that this same set of assumptions implies an asymptotic
efficiency property of OLS, too. The discussion of this is in Section 5.3 of Wooldridge
but the details are not important for our purposes. We just need to remember that OLS
is the best we can do, using finite sample and asymptotic arguments, under
Assumptions MLR.1 to MLR.5.
What is this question about? In the regression we are asked to run – of pctstck on funds
– regressor funds is correlated with the error because of the omitted variable risktol . We
will obtain the inconsistency in βe1 (sometimes loosely called the asymptotic bias).
In other words, we will derive the asymptotic analogue of the omitted variable bias.
Let’s consider a more general setting. Suppose the true model:
y = β0 + β1 x1 + β2 x2 + v
153
7. Multiple regression analysis: Asymptotics
satisfies the assumptions MLR.1–MLR.4. Then v has a zero mean and is uncorrelated
with x1 and x2 .
If we omit x2 from the regression and do the simple regression of y on x1 , then
u = β2 x2 + v.
Let βe1 denote the simple regression slope estimator. We can derive the probability limit
of βe1 :
Cov (x1 , u) Cov (x1 , β2 x2 + v)
plim βe1 = β1 + = β1 +
Var (x1 ) Var (x1 )
Cov (x1 , β2 x2 + v)
= β1 +
Var (x1 )
Cov (x1 , x2 ) Cov (x1 , v)
= β1 + β2 +
Var (x1 ) V ar(x1 )
Cov (x1 , x2 )
= β1 + β2
Var (x1 )
The residuals from the log(wage) regression appear to be more normally distributed.
Certainly the histogram of the residuals from the log(wage) equation fits under its
comparable normal density better than the histogram of the residuals from the wage
equation, and the histogram for the wage residuals is notably skewed to the left.
In the wage regression, there are some very large residuals (roughly equal to 15) that lie
almost five estimated standard deviations (b σ = 3.085) from the mean of the residuals,
which is identically zero, of course. Residuals far from zero do not appear to be nearly
as much of a problem in the log(wage) regression.
(i) There is much less skewness in log of income, which means inc is less likely to be
normally distributed. (In fact, the skewness in income distributions is a
well-documented fact across many countries and time periods.)
154
7.4. Overview of chapter
(ii) In this case, there is much more skewness after taking the natural log.
(iii) The example in part (ii) clearly shows that this statement cannot hold generally. It
is possible to introduce skewness by taking the natural log. As an empirical matter,
for many economic variables, particularly dollar values, taking the log often does
help to reduce or eliminate skewness. But it does not have to.
(iv) For the purposes of regression analysis, we should be studying the conditional
distributions: that is, the distributions of y and log(y) conditional on the
explanatory variables. If we think the mean is linear, as in Assumptions MLR.1
and MLR.3, then this is equivalent to studying the distribution of the population
error, u. In fact, the skewness measure studied in this question often is applied to
the residuals from OLS regression.
Asymptotic properties
Consistency
Inconsistency
Asymptotic distribution
Asymptotic normality
Asymptotic efficiency
155
7. Multiple regression analysis: Asymptotics
have an intuition of what the consistency property means and be able to establish
the consistency of the OLS estimator
explain what it means for the standardised OLS estimator to be asymptotic normal
understand the implications of using large sample results for statistical inference
explain in which and under what assumptions the OLS estimator is asymptotically
efficient.
Is βe1 consistent for β1 ? If so, prove it. If not, explain the reason.
where score is the final score in the course measured as a percentage, colgpa is
the college GPA, actmth is the mathematics ACT score, acteng is the English
ACT score.
Suppose the distribution of score is skewed to left tail. Why cannot
Assumption MLR.6 hold for the error term u? What consequences does this
have for using the usual t statistic to test H0 : β3 = 0?
156
7.5. Test your knowledge and understanding
(ii) Using the ECONMATH data, we can estimate the model from part (i) and
obtain that the t value for acteng is −2.48 and the corresponding p-value is
0.013.
How would you defend your findings to someone who makes the following
statement: ‘You cannot trust that p-value because clearly the error term in the
equation cannot have a normal distribution?’.
Pn Pn 1
Pn
i=1 xi (β1 xi + ui ) i=1 xi ui n i=1 xi ui
βe1 = P 2 = β1 + = β1 + 1 .
x2i
P P 2
xi n
x i
By LLN:
n n
1X 1X 2
plim xi ui = E[xi ui ], plim xi = E[x2i ].
n i=1 n i=1
Then using the properties of the plim operator:
E[xi ui ] 0
plim βe1 = β1 + 2
= β1 + = β1
E[xi ] E[x2i ]
7.2E (i) The distribution of score being skewed to the left tail violates the normality
assumption even conditional on the explanatory variables. Therefore, the
assumption MLR.6 does not hold for the error term u and score will not be
normally distributed, which means that the t statistics will not have t
distributions and the F statistics will not have F distributions. This is a
potentially serious problem because our inference hinges on being able to
obtain critical values or p-values from the t or F distributions.
(ii) If the sample size is large, then the t distribution can be approximated to the
distribution of the t statistics when the error term is not normally distributed.
157
7. Multiple regression analysis: Asymptotics
158
Chapter 8
Heteroskedasticity
8.1 Introduction
explain what a weighted least squares estimator is and describe its properties.
159
8. Heteroskedasticity
estimators and focuses on modifying their standard errors and also test statistics in the
presence of heteroskedasticity. This is discussed in Section 8.2 in Wooldridge. In Section
8.3 in Wooldridge we turn to the discussion of testing for heteroskedasticity and discuss
various approaches: graphical analysis, Breusch–Pagan test, and the White test. In
Section 8.4 in Wooldridge we look at dealing with the situation of heteroskedasticity
from a different angle – instead of using OLS estimation and applying
heteroskedasticity-robust inference, we develop and implement a weighted least squares
method which delivers a more efficient estimator than OLS under correctly specified
heteroskedasticity.
This section analyses what happens with the OLS inference if we drop MLR.5 and act
as if we know nothing about the conditional variance:
We know that OLS is still unbiased and consistent under MLR.1 to MLR.4. (We did
not use MLR.5 to obtain either of these properties.) This is an important conclusion:
Heteroskedasticity does not cause bias or inconsistency in the OLS estimators βbj as long
as MLR.1 to MLR.4 continue to hold.
However, if Var (u | x) depends on x – that is, heteroskedasticity is present – then
OLS is no longer BLUE as it is no longer ‘best’ (that is, it no longer has the lowest
variance among linear and unbiased estimators). In principle, it is possible to find
unbiased estimators that have smaller variances than the OLS estimators. A similar
comment holds for asymptotic efficiency.
Importantly, the usual standard errors are no longer valid, which means the t
statistics and confidence intervals that use these standard errors cannot be trusted. This
is true even in large samples. Similarly, joint hypothesis tests using the usual F statistic
are no longer valid in the presence of heteroskedasticity. In fact, the usual sum of
squared residuals form of the F statistic is not valid under heteroskedasticity.
Both R2 and R̄2 remain consistent estimators of the population R-squared. You may
ask why the F statistic is no longer valid under heteroskedasticity even though R2
remains valid (recall that we were able to write our formula for the F statistic in terms
of R2 ). This is because the usual sum of squared residuals form of the F statistic is not
valid under heteroskedasticity and, consequently, the usual expression for the F statistic
in terms of R2 is no longer valid either.
Activity 8.1 Take the MCQs related to this section on the VLE to test your
understanding.
160
8.2. Content of chapter
Without MLR.5, there are still good reasons to use OLS as OLS estimators are
unbiased and consistent, but we need to modify test statistics to make them valid under
heteroskedasticity. Fortunately, it is known how to modify standard errors and t, F , and
other statistics to be valid in the presence of heteroskedasticity of unknown form.
This is very convenient because it means we can report new inferential statistics that
are correct regardless of the kind of heteroskedasticity present in the population. This
also includes the possibility of homoskedasticity (that is, MLR.5 actually holds). So we
can compute confidence intervals and conduct statistical inference without worrying
about whether MLR.5 holds.
Most regression packages include an option with the OLS estimation that computes
heteroskedasticity-robust standard errors, which then produces
heteroskedasticity-robust t and F statistics and heteroskedasticity-robust
confidence intervals.
To give an illustration of the technical basis for heteroskedasticity-robust inference, let’s
consider the simple regression model:
yi = β0 + β1 xi + ui
and assume throughout that MLR.1 to MLR.4 hold. We will use the notation:
where an i subscript on σi2 indicates that the variance of the error depends upon the
particular value of xi .
Recall that the OLS estimator can be written as:
n
X
βb1 = β1 + wi ui
i=1
xi −x̄
where wi = SST x
. Denote X = {x1 , . . . , xn }. Under MLR.1 to MLR.4 (that is, without
the homoskedasticity assumption), we can show that:
n
! n
X X
Var (βb1 | X) = Var wi ui | X = Var (wi ui | X)
i=1 i=1
n
X
= wi2 Var (ui | X)
i=1
n
X
= wi2 Var (ui | xi )
i=1
n Pn
X − x̄)2 σi2
i=1 (xi
= wi2 σi2 = .
i=1
SST 2x
161
8. Heteroskedasticity
For comparison: in the case of homoskedasticity, the variance of βb1 would be Var (u)
SST x
(a
2
special case of the formula above if you take σi = Var (u) for all i). How can we
estimate this Var (βb1 | X)? If u
bi denote the OLS residuals from the regression of y on
x, then a valid estimator of Var (βb1 | X) for heteroskedasticity of any form
(including homoskedasticity), is:
Pn
(xi − x̄)2 u
b2i
V ar(β1 | X) = i=1
d b .
SST 2x
ar(βb1 | X) is called the heteroskedasticity-robust standard
The square root of Vd
error for βb1 .
There is an analogous formula for the variance of OLS estimators in the multiple
regression model:
yi = β0 + β1 x1i + · · · + βk xki + ui .
Under assumptions MLR.1 to MLR.4, a valid estimator of Var (βbj | X) is:
Pn 2 2
rb u
d (βbj | X) = i=1 ij i
b
Var 2
SSR j
where rbij is the ith residual from regressing xj on all other regressors, and SSR j is the
sum of squared residuals from the regression of xj on all other regressors.
The heteroskedasticity-robust t statistic is calculated as:
estimate − hypothesised value
t=
standard error
where we use the OLS estimate in the numerator and the heteroskedasticity-robust
standard error in the denominator.
It is sometimes incorrectly claimed that heteroskedasticity-robust standard errors for
OLS are always larger than the usual OLS standard errors.
The heteroskedasticity-robust F statistic (or a simple transformation of it) is also called
a heteroskedasticity-robust Wald statistic. The robust Wald statistic has an
approximate chi-squared distribution in large samples, but it is easy to turn it into an
approximate F distribution (this is what software packages do).
Activity 8.2 Take the MCQ related to this section on the VLE to test your
understanding.
(i) Using the data in HPRICE1, we estimate the following regression equation and
obtain both usual and heteroskedasticity-robust standard errors:
162
8.2. Content of chapter
(ii) Using the same data, we estimate the regression equation with log(price) instead
of price, and log(lotsize) and log(sqrft) instead of lotsize and sqrft, respectively:
(iii) What does this example suggest about heteroskedasticity and the
transformation used for the dependent variable?
This section begins with a discussion of why one might want to test for
heteroskedasticity and focuses on some of the modern tests. There is a general
framework we can apply to all these tests.
Before discussing these tests, it is worth mentioning that an informal way to detect
heteroskedasticity is by means of a visual inspection of OLS residuals (or squared
residuals) plotted against a fitted value or a regressor. To illustrate this, Figures 8.1 and
8.2 depict situations of homoskedasticity, whereas Figure 8.3 describes a situation of
heteroskedasticity.
163
8. Heteroskedasticity
In Figure 8.3 the residuals exhibit a systematic pattern as they appear to have a greater
variance for large absolute values of fitted values.
Examination of residuals may also help detect a wrong functional form. For example, in
Figure 8.4 positive residuals occur at moderate fitted values and negative residuals for
high or low fitted values. This can be corrected by considering an appropriate non-linear
model. Figure 8.5 illustrates another violation of this type.
164
8.2. Content of chapter
y = β0 + β1 x1 + β2 x2 + · · · + βk xk + u
E(u | x) = 0
which are MLR.1 and MLR.4, respectively. As before we assume random sampling
(MLR.2) and, of course, we rule out perfect collinearity (MLR.3).
Let x = (x1 , . . . , xk ). Given MLR.3, Assumption MLR.5 can be written:
E(u2 | x) = σ 2 = E(u2 )
165
8. Heteroskedasticity
H0 : E(u2 | x) = σ 2 (constant).
E(u2 | x) = δ0 + δ1 x1 + · · · + δk xk .
H0 : δ1 = δ2 = · · · = δk = 0
Since we can write the linear model for E(u2 | x1 , . . . , xk ) as the following model with an
error term:
u 2 = δ0 + δ1 x 1 + · · · + δk x k + v
E(v | x) = 0
u 2 = δ0 + δ1 x 1 + · · · + δk x k + v
satisfies MLR.1 through MLR.5, then H0 can be tested using the usual F test for joint
significance of all explanatory variables. Note that we are testing the null hypothesis
that the squared error is uncorrelated with each of the explanatory variables. Clearly,
u2 ≥ 0 cannot be normally distributed. Consequently, any test using u2 as the
dependent variable must rely on large-sample approximations. Let us write the equation
for a random draw i as:
Now, we do not observe u2i because these are the squared errors in the regression of y on
x1 , . . . , xk ! (Remember, we observe the yi and xij for each draw i, but not the error ui .)
The solution is to replace the errors with the OLS residuals. In other words, the
equation we estimate looks like:
166
8.2. Content of chapter
the second step of the White test for heteroskedasticity would be the regression:
H0 : δ1 = δ2 = δ3 = δ4 = δ5 = 0
and the test statistic would, of course, be the joint F test for all explanatory variables.
The third step of the White test is comparing the F statistic with a suitable critical
value or just using its p-value. In conducting this test, we maintain assumptions
MLR.1–MLR.4.
The White test has the advantage of detecting more general departures from
heteroskedasticity than the Breusch–Pagan test. But the White test also has the
disadvantage of using lots of degrees of freedom when there are lots of regressors in the
original model. As we can see above, in the case of just two regressors educ and exper it
leads to 6 parameters δj that need to be estimated. In the case of 6 regressors it would
lead to 28 parameters δj that would need to be estimated.
There is a modification of the White test that is easier to implement and always has
two degrees of freedom. The idea is to use the OLS fitted values in a test for
heteroskedasticity. In this modification, in the first step we run OLS of y on x1 , . . . , xk
and save both the OLS residuals and fitted values. In the second and third steps we test
for heteroskedasticity by estimating the equation:
where ybi stands for the fitted value for observation i, and then test H0 : δ1 = δ2 = 0
using an F test (under MLR.1–MLR.4). It is important not to confuse ybi and yi in this
equation.
167
8. Heteroskedasticity
Activity 8.5 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 8.7 This exercise uses data VOTE1 which has 173 observations.
(i) After estimating a model with voteA (percent vote for A) as the dependent variable
and prtystrA (percent vote for president), democA (dummy variable equal to 1 if A
is democrat), log(expendA), and log(expendB ) as regressors, we obtain the fitted
values, ybi , and the OLS residuals, u
bi , and regress the OLS residuals on all of the
regressors. In this latter regression we obtain R2 = 6.145 × 10−32 (practically zero).
Explain why you obtain R2 ≈ 0.
(iii) The regression of ub2i on ybi , ybi2 gives R2 = .03173. Compute the F statistic for the
special case of the White test for heteroskedasticity. How strong is the evidence for
heteroskedasticity now? Discuss whether you reject homoskedasticity at the 5%
level and whether you reject it at the 10% level.
Under heteroskedasticity (that is, the failure of MLR.5), the OLS estimator is no longer
BLUE in general. There is a more efficient (i.e. a smaller variance) estimator than OLS
under heteroskedasticity. This estimator is the so-called weighted least squares
(WLS) estimator.
The construction of the WLS estimator requires us to know exactly the structure of
heteroskedasticity, which was not the case before with heteroskedasticity-robust
inference for OLS. If we have correctly specified the form of the variance as a function
of explanatory variables, then weighted least squares (WLS) is more efficient than OLS.
Suppose that:
Var (u | x1 , . . . , xk ) = σ 2 h(x1 , . . . , xk )
168
8.2. Content of chapter
where σ 2 is unknown and the function h(x1 , . . . , xk ) > 0 is known. From the multiple
regression:
yi = β0 + β1 xi1 + · · · + βk xik + ui
we get a transformed model:
!
ui E(ui | xi1 , . . . , xik )
E(u∗i | xi1 , . . . , xik ) = E p xi1 , . . . , xik = p =0
h(xi1 , . . . , xik ) h(xi1 , . . . , xik )
(here we use MLR.4 for ui ). This implies that MLR.4 holds in the transformed model:
(the first equality uses the fact that x∗i0 , x∗i1 , . . . , x∗ik are functions of xi1 , . . . , xik ).
Since the conditional variance satisfies:
Activity 8.8 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 8.9 This question uses the data in 401KSUBS, restricting the sample to
fsize = 1.
169
8. Heteroskedasticity
(i) We estimated the equation that explains net total financial wealth (nettfa,
measured in $1,000s) in terms of income (inc, also measured in $1,000s) and
some other variables, including age, gender , an indicator e401k for whether the
person is eligible for a 401(k) pension plan, and also the interaction term
e401k · inc. We estimated the equation by OLS and obtained the usual and
robust standard errors (the usual OLS standard errors are in (·) and the
heteroskedasticity-robust standard errors are in [·]):
What do you conclude about the statistical significance of the interaction term?
(ii) Now we estimated the more general model by WLS obtained under the
assumption Var (ui | inci ) = σ 2 inc i . The results below show the usual and robust
standard errors for the WLS estimator:
\ = − 14.09 + .619 inc + .0175 (age − 25)2 + 1.78 male
nettfa
(2.27) (.084) (.0019) (1.56)
[2.53] [.091] [.0026] [1.31]
− 2.17 e401k + .295 e401k · inc.
(3.66) (.130)
[3.51] [.160]
Note that even after applying WLS estimation, we may want to use
heteroskedasticity-robust standard errors if we have doubts that we used the
correct specification of heteroskedascitity or in order to convince ourselves the
usual and heteroskedasticity-robust standard errors are similar.
Is the interaction term statistically significant using the robust standard error?
(iii) Discuss the WLS coefficient on e401k in the more general model. Is it of much
interest by itself? Explain.
(iv) If we reestimate the model by WLS but replace e401k · inc with the interaction
term e401k · (inc − 30) (the average income in the sample is about 29.44), the
coefficient on e401k becomes 6.68 (robust t = 3.20).
Now interpret the coefficient on e401k.
170
8.3. Answers to activities
(i) The coefficients corresponding to crsgpa, cumgpa, and tothrs have the anticipated
signs. If a student takes courses where grades are, on average, higher – as reflected
by higher crsgpa – then their grades will be higher, other things being equal. The
better the student has been in the past – as measured by cumgpa – the better the
student does (on average) in the current semester, other things being equal.
Finally, tothrs is a measure of effort, and its coefficient indicates an increasing
return to effort, other things being equal.
The t statistic for crsgpa is very large, over five using the usual standard error
(which is the largest of the two). Using the robust standard error for cumgpa, its t
statistic is about 2.61, which is also significant at the 5% level (under MLR.1 to
MLR.4). The t statistic for tothrs is only about 1.17 using either standard error, so
it is not significant at the 5% level (under MLR.1 to MLR.4). Using robust
standard errors makes the difference, e.g. for our conclusion regarding the
significance of the effect of season (under MLR.1 to MLR.4).
(ii) If crsgpa was the only explanatory variable, H0 : βcrsgpa = 1 means that, without
any information about the student, the best predictor of term GPA is the average
GPA in the students’ courses; this holds essentially by definition. (The intercept
would be zero in this case.)
(iii) The in-season effect is given by the coefficient on season, which implies that, other
things equal, an athlete’s GPA is about .16 points lower when their sport is
competing. The t statistic using the usual standard error is about −1.60, while that
using the robust standard error is about −1.96.
Under MLR.1 to MLR.4, against a two-sided alternative, the t statistic using the
robust standard error is just significant at the 5% level (cv ≈ 1.96), while using the
usual standard error, the t statistic is not quite significant at the 10% level
(cv ≈ 1.65).
171
8. Heteroskedasticity
(i) n = 88 is still considered large enough for us to use robust standard errors. The
robust standard error on lotsize is almost twice as large as the usual standard
error, making lotsize much less significant (under MLR.1 to MLR.4) – the t
statistic falls from about 3.23 to 1.70.
The t statistic on sqrft also falls, but it is still very significant (under MLR.1 to
MLR.4). The variable bdrms actually becomes somewhat more significant, but it is
still barely significant. The most important change is in the significance of lotsize.
(ii) For the log-log model, the heteroskedasticity-robust standard error is always
slightly greater than the corresponding usual standard error, but the differences are
relatively small. In particular, log(lotsize) and log(sqrft) still have very large t
statistics.
The t statistic on bdrms is not significant at the 5% level (under MLR.1 to MLR.4)
against a one-sided alternative using either standard error.
(iii) Using the logarithmic transformation of the dependent variable often mitigates, if
not entirely eliminates, heteroskedasticity. This is certainly the case here, as no
important conclusions in the model for log(price) depend on the choice of standard
error. (We have also transformed two of the independent variables to make the
model of the constant elasticity variety in lotsize and sqrft.)
(i) The proposed test is a hybrid of the Breusch–Pagan and White tests. There are
k + 1 regressors, each original explanatory variable, and the squared fitted values.
So, the number of restrictions tested is k + 1 and this is the numerator df . The
denominator df is n − (k + 2) = n − k − 2.
b2i
u on xi1 , xi2 , . . . , xik , i = 1, . . . , n.
b2i
u on xi1 , xi2 , . . . , xik , ybi2 , i = 1, . . . , n.
The only difference is that in the hybrid test we have an extra regressor, and so the
R-squared will be no less for the hybrid test than that for the Breusch–Pagan test
(we know that R-squared cannot decrease when we add a regressor).
For the special case of the White test, the argument is a bit more subtle. In the
special case of the White test, we regress:
b2i
u on ybi , ybi2 , i = 1, . . . , n
b2i
u on xi1 , xi2 , . . . , xik , ybi2 , i = 1, . . . , n.
172
8.3. Answers to activities
Thus, it comes down to whether it is ybi or xi1 , xi2 , . . . , xik that reduce SSR more.
Since the fitted values are a linear function of the regressors:
b2i
u on ybi , ybi2 , i = 1, . . . , n
(iii) No. The F statistic for the joint significance of the regressors in the second step
regression depends on Ru2b2 /(1 − Ru2b2 ) , and it is true that this ratio increases as Ru2b2
increases.
But, the F statistic also depends on the df , and the df are different among all three
tests: the BP test, the special case of the White test, and the hybrid test. So we do
not know which test will deliver the smallest p-value.
(iv) As discussed in part (ii), the OLS fitted values are a linear combination of the
original regressors. Because those regressors appear in the hybrid test, adding the
OLS fitted values is redundant; perfect multicollinearity would result.
(i) Recall that OLS works by choosing the estimates, βbj , such that the residuals are
uncorrelated in the sample with each regressor (and the residuals have a zero
sample average, too). Therefore, if we regress u
bi on the regressors, the R-squared
will be exactly zero!
.05256/4
(ii) The F statistic for joint significance is about F = (1−.05256)/168
= 2.33 (with 4
numerator df and 168 denominator df ). Under:
F ∼ F2, 170 .
173
8. Heteroskedasticity
The 5% critical value of F2, 170 is 3 and the 10% critical value is 2.3, so there is
evidence of heteroskedasticity at the 10%, but not quite at the 5% level.
In fact, the p-value is ≈ .065. There is slightly less evidence of heteroskedasticity
than provided by the BP test, but the conclusion is similar.
(i) Although the usual OLS t statistic on the interaction term is about 2.8, the
heteroskedasticity-robust t statistic is just under 1.6. Therefore, using OLS, we
must conclude the interaction term is only marginally significant (under MLR.1 to
MLR.4).
But the coefficient is nontrivial; it implies a much more sensitive relationship
between financial wealth and income for those eligible for a 401(k) plan.
(ii) Usual and robust standard errors are now much closer (which we can take as an
indication that our heteroskedasticity specification works reasonably well and, thus,
WLS dealt well with heteroskedasticity).
The robust t statistic is about 1.84, and so the interaction term is marginally
significant (two-sided p-value is about .066) under MLR.1 to MLR.4.
(iii) The coefficient on e401k literally gives the estimated difference in financial wealth
at inc = 0, which obviously is not interesting. It is not surprising that it is not
statistically different from zero; we obviously cannot hope to estimate the difference
at inc = 0, nor do we care to.
(iv) When we replace e401k · inc with e401k · (inc − 30), the coefficient on e401k is the
estimated difference in nettfa between those with and without 401(k) eligibility at
roughly the average income, $30,000. Naturally, we can estimate this much more
precisely, and its magnitude ($6,680) makes sense.
174
8.5. Test your knowledge and understanding
Finally, we discussed that OLS is no longer the best linear unbiased estimator in the
presence of heteroskedasticity. This leads us to weighted least squares as a means of
obtaining the BLUE estimator. To construct WLS, we needed to have the correct model
of heteroskedasticity.
Homoskedasticity
Robust
Wald statistic
Breusch–Pagan
White test
explain what a weighted least squares estimator is and describe its properties.
175
8. Heteroskedasticity
OUT 3.43
(.47)
OUTSQ .26
(.04)
constant 2.78
(.90)
R2 .93
Table 8.1: The dependent variable is COST. Standard errors are in
parentheses.
She plots the data and fitted line in a scatter diagram, shown in Figure 8.6.
(b) The researcher defines LCOST and LOUT as the natural logarithms of COST
and OUT and regresses LCOST on LOUT . Breusch–Pagan and White tests
for this specification do not reject the null hypothesis of homoskedasticity.
Explain this finding, given the answer in part (a).
176
8.5. Test your knowledge and understanding
n = 526, R2 = .116
where wage is average hourly earnings in dollars and female is a binary variable
that is equal to 1 if the person is female and 0 otherwise.
(a) In the model:
wage = β0 + β1 female + u
suppose that the variances are also different for female and non-female: that is:
Var (wage | female = 0) 6= Var (wage | female = 1)
(or equivalently Var (u | female = 0) 6= Var (u | female = 1)). Is the result in (1)
still reliable (make sure you discuss the reliability of βb0 , βb1 and their standard
errors)? If not, how would you modify the result?
(b) In the model:
wage = β0 + β1 female + u
suppose:
Var (u | female) = σ 2 (1 + 0.2 · female)
for some unknown σ 2 . Suggest an estimator different from the OLS that may
have better properties.
8.3E We use a dataset on hospital finance for 299 hospitals in California in 2003. The
data include many variables related to hospital finance and hospital utilisation.
(a) We estimate the following model by OLS:
patient
\ revenue = −9.189 · 107 + 988.5 ER visits + 6890 surgeries
(1.399 · 107 ) (339.5) (1163)
7
[1.291 · 10 ] [665.47] [2769]
177
8. Heteroskedasticity
(b) A researcher performs the Breusch–Pagan test for the model in (a). The BP
test statistic is 34.52. State the conclusion in this case.
(c) The researcher changes the model and now considers patient revenue,
ER visits, surgeries and beds in logarithms:
\
log(patient revenue) = 8.9055 + .3841 log(ER visits) + .2356 log(surgeries)
(.3362) (.0534) (.0304)
[.3536] [.0568] [.0323]
+ .5962 log(beds) + .6424 occupancy
(.0390) (.1096)
[.0469] [.1178]
n = 299, R2 = .8823, R̄2 = .8807.
H0 : δ1 = δ2 = 0
(the alternative is the negation of the null). This null hypothesis is tested
using the F test of the joint significance of OUT and OUTSQ. Under H0
and MLR.1–MLR.4, F ∼ F2, 112 .
Here is how we would perform the Breusch–Pagan test if we had a copy of
the dataset (throughout this testing procedure we maintain assumptions
MLR.1–MLR.4):
178
8.5. Test your knowledge and understanding
H0 : δ1 = δ2 = δ3 = δ4 = 0
(the alternative is the negation of the null). This null hypothesis is tested
using the F test of the joint significance of OUT , OUTSQ, OUT 3 , OUT 4 .
Under H0 and MLR.1–MLR.4, F ∼ F4, 110 .
Here is how we would perform the White test if we had a copy of the
dataset (throughout this testing procedure we maintain assumptions
MLR.1–MLR.4):
After estimating the equation COST = β0 + β1 OUT + β2 OUTSQ + u
by OLS (estimates given) we would save the OLS residuals, u
bi , and
2
bi .
compute the squared residuals, u
We then would regress ub2i on OUT , OUTSQ, OUT 3 , OUT 4 and
compute the usual F statistic of joint significance of these four
explanatory variables.
If the p-value of the previous test is sufficiently small, we would reject
the null of homoskedasticity and conclude Assumption MLR.5 fails.
We are given that the realisation of the F statistic is 58.2. The
approximate 1% critical value for F4, 110 is 4.01. Since 58.2 > 4.01, we
reject the null of homoskedasticity at the 1% significance level.
(b) It is known that models with the logarithm of a dependent variable such as
cost, wage, income, GDP, etc. often suffer less from heteroskedasticity. In our
case this finding is consistent with the homoskedastic error u entering through
the following multiplicative form:
8.2E (a) If MLR.1–MLR.4 hold, OLS estimators βb0 , βb1 are still unbiased and consistent.
However, their usual standard errors are no longer valid due to
heteroskedasticity (violation of MLR.5). We would have to compute
heteroskedasticity-robust standard errors.
179
8. Heteroskedasticity
(b) WLS estimators are unbiased and consistent under MLR.1–MLR.4 and have a
smaller variance than OLS estimators. We can obtain them as OLS estimators
in the model:
wage 1 female u
√ = β0 √ +β1 √ +√ .
1 + 0.2 · female 1 + 0.2 · female 1 + 0.2 · female 1 + 0.2 · female
8.3E (a) Yes, regressors have the expected signs. The demand for a hospital’s services
(ER visits, surgeries, occupancy) has a positive effect on patient revenue as
well as the hospital capacity (beds). For ER visits, surgeries and beds the
heteroskedasticity-robust standard errors are larger than the usual ones
whereas for occupancy the heteroskedasticity-robust standard error is lower.
When using the usual standard errors, all the regressors are statistically
significant at the 5% level. When using heteroskedasticity-robust standard
errors, ER visits becomes statistically insignificant at the 5% level (and even
at the 1% level).
(b) The Breusch–Pagan test statistic is the F statistic in the auxiliary model.
Under MLR.1 to MLR.4 and under the null of homoskedasticity, F ∼ F4, 294 .
The 1% critical value for the F4, 294 distribution is 3.32. Since 34.52 > 3.32, we
reject the null of homoskedasticity at the 1% level.
(c) All heteroskedasticity-robust standard errors are larger than the usual ones. All
the regressors are statistically significant at the 1% level and this conclusion
holds when using either usual or heteroskedasticity-robust standard errors.
(d) The result of the BP test does not reject the null of homoskedasticity at the
5% significance level (.5117 < 2.37, where 2.37 is the 5% critical value for the
F4, 294 ). In fact, the p-value associated with it is .7272. The special White test,
however, rejects the null of homoskedasticity since 8.118 > 3, where 3 is the
5% critical value for the F2, 296 distribution. These two findings suggest that
there is still heteroskedasticity in the model in (c) but it is captured by
quadratic and interaction terms of the regressors rather than in linear terms.
180
Chapter 9
Instrumental variable estimation and
two-stage least squares
9.1 Introduction
describe the principles underlying the use of instrumental variables in both simple
and multiple regression models
181
9. Instrumental variable estimation and two-stage least squares
In this chapter we first give a brief overview of the concept of endogeneity, highlight the
wide range of settings that give rise to endogeneity, provide intuition behind the
undesirable consequences of endogeneity on the OLS estimator, and provide possible
solutions to deal with endogeneity.
After motivating the endogeneity problem with the help of omitted variables, Section
15.1 in Wooldridge introduces the definition of an instrumental variable (IV) in the
simple linear regression model and discusses the key idea of the IV method. In the
simple regression model an instrument for the endogenous regressor needs to satisfy the
instrument exogeneity (validity) and instrument relevance requirements and you learn
whether (and how) these requirements can be tested. With the help of empirical
examples we discuss these conditions. You will learn how these conditions can be used
to identify and derive the IV estimator in the simple linear regression model. You will
analyse its properties (consistency) and discuss the variance of the IV estimator which
highlights the consequences of having poor instruments.
In Section 15.2 in Wooldridge, IV estimation is extended to the multiple regression
model. This is quite an important extension to consider because sometimes a potential
instrument is exogenous only when other factors are controlled for. In addition to the
instrument exogeneity and relevance requirements, we will add here a third condition on
our instrument: an exclusion restriction.
In Sections 15.3a and 15.3d in Wooldridge we consider the setting where we have more
than one instrument for our endogenous regressor(s). The setting where we have more
instruments than the number of endogenous regressors is called overidentification
(contrast with exact (or just) identification where you have exactly as many
instruments as endogenous regressor(s)). You learn that the two-stage least squares
(2SLS) estimator will give an estimator which is better than the IV estimator based on
a single instrument (in the sense of having a smaller variance).
Finally, in Section 15.5a you learn how to test for the endogeneity of an explanatory
variable. As OLS is more efficient than IV (2SLS) when there is no endogeneity this is
very useful.
182
9.2. Content of chapter
y = β0 + β1 x1 + · · · + βk xk + u
log(wage) = β0 + β1 educ + u.
By endogeneity we refer to any correlation between regressors and the error term – it
is a key concept in econometrics and econometric applications. We will use the following
terminology:
There may be various statistical or economic reasons why we might expect that the
errors and regressors are correlated. While we focus our attention in this chapter
on the first, we will discuss the latter three later in the course. In particular, they are:
183
9. Instrumental variable estimation and two-stage least squares
We shouldn’t be surprised that our OLS estimators are bad (biased and inconsistent) if
there is correlation between the errors and regressors. Indeed, in the model:
yi = β0 + β1 xi1 + · · · + βk xik + ui
the first-order conditions (FOCs) that define our OLS estimator are:
n
X
xij u
bi = 0, for all j = 0, . . . , k.
i=1
We now need to recognise that these FOCs only make sense if:
or, equivalently:
Cov (xij , ui ) = 0 for all j = 0, . . . , k.
Solutions to endogeneity
1. Include good controls in the hope that the endogenous explanatory variable
becomes exogenous.
Adding more controls allows us to control for confounders – needed for causal
interpretation.
Adding additional lags (explanatory variables) in dynamic time series models
may help remove any remaining dependence in the errors.
184
9.2. Content of chapter
Activity 9.1 Take the MCQs related to this section on the VLE to test your
understanding.
It is important to be able to explain (in empirical settings) why omitted variables give
rise to the endogeneity problem whereby included explanatory variables are correlated
with the regression error (which includes relevant variables that are correlated with the
included regressor). This is directly related to our omitted variable bias (OVB)
discussion and you should revisit it if required.
In the presence of endogeneity we need additional information to enable consistent
estimation of our parameters, say in the simple linear regression model:
y = β0 + β1 x + u, Cov (x, u) 6= 0.
The key idea of the instrumental variables method is that we need to find a new
variable (say, z), called an instrument, that satisfies the following two conditions.
Cov (z, u) = 0.
Cov (z, x) 6= 0.
x = π0 + π1 z + v
185
9. Instrumental variable estimation and two-stage least squares
The instrumental variable z is a variable that explains the variation in x, but doesn’t
explain y, that is:
The instrument z can then be used to extract the ‘good’ variation from x and the
instrumental variable method effectively replaces x with only that component.
The properties of an instrumental variable can be expressed using the following
diagram, which we will use later to explain intuitively why the instrumental variable
works in more detail. The diagram contains the dependent variable, y; the endogenous
regressor, x, and the instrument variable, z.
As the diagram shows: x affects y indicated by the directed arrow in accordance with
our regression model. The instrument variable z is shown to affect the dependent
variable only through its effect on the endogenous regressor x. The instrument is not
allowed to affect y directly. This is a nice summary of the property of the instrumental
variables that we want to have.
Activity 9.2 Take the MCQs related to this section on the VLE to test your
understanding.
We should know how the IV estimator can be obtained in the simple linear regression
model:
y = β0 + β1 x + u. (*)
186
9.2. Content of chapter
Similar to how we derived the OLS estimators, we can rewrite the first equation to
obtain the IV estimator of β0 :
βb0,IV = ȳ − βb1,IV x̄
which looks just like the OLS intercept estimator except that the slope estimator, βb1,IV ,
is now the IV estimator. To derive the IV estimator of β1 we plug this expression into
the second equation:
Xn
(yi − (ȳ − βb1,IV x̄) − βb1,IV xi )zi = 0
i=1
n
P
which we can rewrite as ((yi − ȳ) − βb1,IV (xi − x̄))zi = 0. Using the fact that, for
i=1
example:
n
X n
X
(yi − ȳ)z̄ = z̄ (yi − ȳ) = 0
i=1 i=1
187
9. Instrumental variable estimation and two-stage least squares
we then obtain: Pn
(zi − z̄)(yi − ȳ) Cov
d (z, y)
β1,IV = Pni=1
b ≡ .
i=1 (zi − z̄)(xi − x̄) Cov
d (z, x)
Recall the following picture which highlights what we need from our IV:
Based on the above graph, we can see the effect of instrument z on outcome y:
The thing we are interested in, the causal effect of x on y (the top arrow in the picture),
is actually in the above equation. All we need to do is rearrange terms:
Effect of z on y
Effect of x on y = .
Effect of z on x
In the IV estimation, we calculate both of the effects on the right-hand side of this
equation by OLS. To get the estimator of our causal effect we need to take their ratio:
Cov
d (z, y)
Effect of z on y =
Var
d (z)
Cov
d (z, x)
Effect of z on x = .
Var
d (z)
Given the instrument exogeneity and instrument validity, you should be able to
prove the consistency of the IV estimator by application of the law of large
numbers. Plugging in the true model yi = β0 + β1 xi + ui allows us to rewrite βb1,IV
as:
1
Pn
(zi − z̄)(ui − ū)
β1,IV = β1 + 1 Pi=1
b n
n
n i=1 (zi − z̄)(xi − x̄)
Cov
d (zi , ui )
= β1 + .
Cov
d (zi , xi )
188
9.2. Content of chapter
plim(Cov
d (zi , ui ))
plim βb1,IV = β1 +
plim(Cov
d (zi , xi ))
Cov (zi , ui )
= β1 +
Cov (zi , xi )
= β1 .
The IV estimator cannot be unbiased when x and u are correlated. Showing the
finite-sample bias in the IV estimation is beyond the scope of the course, but can
be quite substantial.
You should be familiar with the expression of the variance of the IV estimator
βb1,IV , which is approximately:
σu2
Var (βb1,IV ) ≈
nσx2 ρ2x,z
where σx2 = Var (x) and ρx,z = Corr (x, z). You should be able to use this formula to
discuss (i) why we should simply stick to OLS if x is not endogenous (ρ2x,z < 1
when z 6= x), and (ii) why IV estimates can have large standard errors, especially if
x and z are only weakly correlated.
Activity 9.4 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 9.5 This question investigates the effect of fertility on female labour
supply and income and uses the dataset same sex kids (from the 1980 US census).
We would like to answer the question: ‘How much does a woman’s hours worked and
labour income fall when she has additional children?’
189
9. Instrumental variable estimation and two-stage least squares
and:
income1m = β0 + β1 more kids + u
where income1m is mom’s labour earnings in the year prior to the census, in
1995 dollars, hoursm is mom’s average hours worked per week, and more kids is
the dummy variable equal to 1 if mom had more than 2 children, and is equal to
0 otherwise.
Explain why these OLS regressions are inappropriate for estimating the causal
effect of fertility (more kids) on labour supply (hoursm) or labour income
(income1m).
(ii) The data set contains the variable same sex , which is equal to 1 if the first two
children are of the same sex (boy-boy or girl-girl) and equal to 0 otherwise. The
OLS estimation of more kids on same sex (with heteroskedasticity-robust
standard errors) gives:
\
more kids = .3557 + .0548 same sex
(.0008) (.0012)
n = 655,169, R2 = .0032.
Are couples whose first two children are of the same sex more likely to have a
third child? Is the effect large? Is it statistically significant?
(iii) Explain why same sex is a valid instrument for the instrumental variable
regression of hoursm on more kids and also for the instrumental variable
regression of income1m on more kids.
(iv) The estimation of the regression of hoursm on more kids and of the regression
of income1m on more kids using same sex as an instrument gives the following
results (with heteroskedasticity-robust standard errors):
n = 655,169, R2 = .0068
and:
n = 655,169, R2 = .0077.
How large is the fertility effect on labour supply and how large is the fertility
effect on mom’s labour income?
190
9.2. Content of chapter
Assume that σu = σx , so that the population standard deviation in the error term is
the same as it is in x. Suppose that the instrumental variable, z, is slightly
correlated with u: Corr (z, u) = .1. Suppose also that z and x have a somewhat
stronger correlation: Corr (z, x) = .2.
(ii) How much correlation would have to exist between x and u before OLS has
more asymptotic bias than the IV estimator?
191
9. Instrumental variable estimation and two-stage least squares
where:
bi = yi − βb0,IV − βb1,IV xi1 − βb2,IV xi2 .
u
It is important to point out here that in order to solve for (βb0 , βb1 , βb2 ) we need at least
three conditions. This implies, for instance, that x1 cannot be our instrument for x2 as
in this case the third moment condition in the IV estimation would be identical to the
second moment condition, and, thus, we would effectively have the system of only two
moment conditions to estimate three parameters and this system would have an infinity
of solutions! This is an additional assumption, called the exclusion restriction.
We should recall that another assumption instruments need to satisfy is the relevance
requirement. Indeed, we need the instrument z to be correlated with x2 , but the precise
sense in which these two variables must be correlated is complicated by the presence of
x1 in the regression equation. Formulated somewhat informally, we require:
x 2 = π0 + π 1 x 1 + π 2 z + v
where E(v) = 0, E(x1 v) = 0 and E(zv) = 0, and test whether π2 6= 0. The reduced form
expresses the endogenous variable as a linear function of all exogenous variables and an
error term. We can estimate the reduced form by OLS and use a t test to see if we can
reject that the coefficient on z is equal to zero (usually making it robust to
heteroskedasticity).
To summarise, the requirements on our instruments z for the model:
y = β 0 + β 1 x1 + β 2 x2 + u
where:
E(u) = 0, E(x1 u) = 0, E(x2 u) 6= 0
are the following.
192
9.2. Content of chapter
Our IV estimator uses the sample analogues of the moment conditions. This
identification is exact because we do not have any extra information (an extra moment
condition) we can afford to discard. If we consider any two of the three moment
conditions, we will no longer be able to identify parameters or, equivalently, uniquely
recover them.
Activity 9.7 Exercise 15.8 from Wooldridge (you may skip 15.8d).
In this section we discuss how to use the instrumental variable estimator in settings
where we have more instruments available than the number of endogenous regressors in
the model. In this setting we have overidentification. The situation of having the
same number of instruments as the number of endogenous regressors, considered thus
far, is called exact (or just) identification.
Consider, for instance, the regression model with one endogenous and one exogenous
regressor:
y = β 0 + β 1 x1 + β 2 x2 + u
where:
E(u) = 0, E(x1 u) = 0, E(x2 u) 6= 0
that is, where regressor x1 is exogenous and regressor x2 is endogenous.
Suppose now that we have two variables – let’s call them z1 and z2 – that are different
from x1 (that is, satisfy the exclusion restriction) and satisfy the exogeneity
condition:
E(z1 u) = 0, E(z2 u) = 0.
If each of z1 and z2 is partially correlated with x2 , then we can take either of them (say,
z1 ) and implement the IV estimation we discussed previously. But this approach would
discard additional data z2 and the information from the exogeneity condition that
E(z2 u) = 0!
The 2SLS (Two Stage Least Squares) approach is a clever method that will use
both IVs z1 and z2 and combine them in an optimal way to estimate β1 . It will give an
estimator which is better than the IV estimator based on a single instrument (in the
sense of having a smaller variance).
To gain intuition, let us consider a simple model with one regressor x, which is
endogenous:
y = β0 + β1 x + u
where:
E(u) = 0, E(xu) 6= 0
and assume we have two instruments for x: z1 and z2 that satisfy our usual
requirements:
193
9. Instrumental variable estimation and two-stage least squares
In this overidentified setting, we should realise that the following IV estimators of the
slope parameter β1 :
(1) Sample Cov (z1 , y) (2) Sample Cov (z2 , y)
βb1,IV = and βb1,IV =
Sample Cov (z1 , x) Sample Cov (z2 , x)
are both consistent. Both are the usual IV estimators in the simple regression model
(1) (1)
that select either z1 (βb1,IV ) or z2 (βb2,IV ) as instrument for x. Question: Which one
should we use?
Since both are consistent, we need to look at the precision of IV estimators. Based on
the expression for the variance, we should choose the estimator that uses the instrument
that has the higher correlation with x. We have:
(1) σu2
Var (βb1,IV | x1 , . . . , xn , z1 ) =
nσx2 Corr (x, z1 )2
(2) σu2
Var (βb1,IV | x1 , . . . , xn , z2 ) = .
nσx2 Corr (x, z2 )2
Important: This approach discards additional information, which is not a good idea!
We should use both! This is what the 2SLS approach will do! The basic idea is that
since each of z1 and z2 is uncorrelated with u, any linear combination is also
uncorrelated with u and, therefore, is a valid IV.
Optimality: To find the best IV, here, the 2SLS method will choose the linear
combination that has the highest correlation with the endogenous x (and thereby will
give the most precise IV estimator). Let us denote such a combination as z opt . 2SLS
would then use:
(opt) Sample Cov (z opt , y)
βb1,IV = .
Sample Cov (z opt , x)
The 2SLS estimator in this model will be obtained in two stages.
194
9.2. Content of chapter
To summarise the results of 2SLS for the multiple regression model in the presence of
multiple endogenous explanatory variables, let us consider the following multiple
regression setting with two endogenous variables:
y = β0 + β1 x1 + β2 x2 + β3 x3 + u
where:
E(u) = 0, E(x1 u) = 0, E(x2 u) 6= 0, E(x3 u) 6= 0
that is, where regressor x1 is exogenous and x2 and x3 are endogenous. We assume that
we have three variables – let’s call them z1 , z2 , z3 – that are different from x1 (that is,
satisfy the exclusion restriction) and that satisfy the exogeneity and relevance
conditions. Again, this is the setting of overidentification. The 2SLS estimator in this
model will be obtained in two stages.
First stage: Estimate by OLS the reduced form equations for both endogenous
variables:
x2 = π2,0 + π2,1 x1 + π2,2 z1 + π2,3 z2 + π2,4 z3 + v2
reduced form for x2 and:
x3 = π3,0 + π3,1 x1 + π3,2 z1 + π3,3 z2 + π3,4 z3 + v2
reduced form for x3 and compute the fitted values x
b2 and x
b3 .
Each reduced form equation has all the instruments (z1 , z2 , z3 ) and all the
exogenous regressors (in our example it is just x1 ) on the right-hand side. Including
all the exogenous regressors on the right-hand side is very important. If this is not
done, then you are not doing 2SLS and you are not getting consistent estimators.
Second stage: Run the OLS regression of y on x1 , xb2 , and xb3 to obtain βb0,2SLS ,
βb1,2SLS and βb2,2SLS .
Practical advice
(a) Do not do two stages on your own but let the software do it. This is because the
standard errors would be wrong if you tried to do 2SLS on your own due to the
second stage using ‘estimated’ values that have their own estimation error. This
error needs to be taken into account when calculating standard errors.
(b) All regressors that are not endogenous, need to be included in the first stage. If this
is not done, then you are not doing 2SLS and you are not getting consistent
estimators. This is yet another reason to let statistical software do the 2SLS
estimation for you.
195
9. Instrumental variable estimation and two-stage least squares
(c) Always report your first stage results (including R-squared). This helps to
analyse/test the relevance condition. It can also help us determine whether there
might be a weak IV problem (a weak instrument is an IV that does not explain
very much of the variation in the endogenous regressor).
Final comment: When we have exact (just) identification we can also conduct 2SLS.
In that case the 2SLS estimator will be identical to the IV estimator (intuition, when
we have exact identification there is no need to look for the ‘optimal’ instruments).
Activity 9.8 Take the MCQs related to this section on the VLE to test your
understanding.
In this section we are shown how to test for the endogeneity of one or more of the
explanatory variables. While Hausman (1987) suggested a test based on directly
comparing the OLS and 2SLS estimates and testing whether the differences are
statistically significant, the test discussed in Wooldridge applies a simpler
regression-based test.
As shown in Section 15.5a this test involves running a regression where we include
residuals of the reduced form for the potentially endogenous explanatory variables as
additional regressors and tests whether their coefficients are statistically significant
(using asymptotic t or F tests).
To illustrate, suppose we have the following model, where we have two suspected
endogenous variables y2 and y3 :
y1 = β0 + β1 y2 + β2 y3 + β3 z1 + u1 .
As there are two endogenous variables, we need at least two additional exogenous
variables (instruments). Let us consider the exact identified setting, where we have two
additional exogenous variables z2 and z3 . The procedure is now as follows:
1. Estimate two reduced form equations (one for each endogenous variable):
(reduced form for y3 ) and save the residuals of these regressions, respectively vb2 and
vb3 .
2. Add the reduced form residuals to the model:
196
9.2. Content of chapter
Intuitively, when both δ1 = 0 and δ2 = 0 the result indicates that we can run the
original regression without controlling for the potential endogeneity of y2 and y3 (we can
exclude these controls). We simply should stick to OLS in this case.
n = 4,361, R2 = 0.5687.
Interpret the estimates. In particular, holding age fixed, what is the estimated
effect of another year of education on fertility? If 100 women receive another
year of education, how many fewer children are they expected to have?
(ii) The variable frsthalf is a dummy variable equal to one if the woman was born
during the first six months of the year. Assuming that frsthalf is uncorrelated
with the error term from part (i), how would you show that frsthalf is a
reasonable IV candidate for educ?
(iii) Estimating the model from part (i) by using frsthalf as an IV for educ gives:
Compare the estimated effect of education with the OLS estimate from part (i).
(iv) When we add binary variables electric, tv , and bicycle to the equation and
estimate it by OLS, we obtain:
n = 4,361, R2 = 0.5761.
197
9. Instrumental variable estimation and two-stage least squares
(i) Let dist be the distance from the student’s living quarters to the lecture hall.
Assuming that dist and u are uncorrelated, what other assumptions must dist
satisfy to be a reliable IV for atndrte?
(ii) Suppose we add the interaction term priGPA × atndrte giving the model:
as happens when priGPA, ACT , and dist are all exogenous, what might be a
good IV for priGPA × atndrte?
[Hint: If E(u | priGPA, ACT , dist) = 0, then any function of priGPA and dist is
uncorrelated with u.]
(iii) Given your choice of IV for priGPA × atndrte, write down the first stage
equation(s) you would estimate as part of 2SLS estimation. How many first
stage equations do you have?
Are you in the situation of exact identification or in the situation of
overidentification?
198
9.3. Answers to activities
Here we show how to implement IV and 2SLS using the command ivreg in R. We also
provide an implementation of the test for endogeneity.
(i) It has been fairly well established that socioeconomic status affects student
performance.
The error term u contains, among other things, family income, which has a positive
effect on GPA and is also very likely to be correlated with PC ownership.
(ii) Families with higher incomes can afford to buy computers for their children.
Therefore, family income certainly satisfies the relevance requirement for an
instrumental variable: it is correlated with the endogenous explanatory variable
(take x = PC and z = faminc in the relevance condition).
(iii) This is a natural experiment that affects whether or not some students own
computers. Some students who buy computers when given the grant would not
have without the grant. (Students who did not receive the grants might still own
computers.)
Define a dummy variable, grant, equal to 1 if the student received a grant, and 0
otherwise. This is our candidate instrumental variable z (in our model x = PC and
y = GPA).
• Then, if z = grant was randomly assigned, it is uncorrelated with u. In
particular, it is uncorrelated with family income and other socioeconomic
factors in u.
• Further, z = grant should be correlated with x = PC : the probability of
owning a PC should be significantly higher for a student receiving grants.
• Incidentally, if the university gave grant priority to low-income students, grant
would be negatively correlated with u, and IV would be inconsistent.
(i) Both of the above OLS regression models suffer from a possible omitted variable
bias.
• More educated women may both work more and be less likely to have an
additional child than less educated women.
• More educated women may both earn more and be less likely to have an
additional child than less educated women.
199
9. Instrumental variable estimation and two-stage least squares
• This would imply that more kids is positively correlated with the regression
error in each of these regressions so that the OLS estimator for the parameter
corresponding to more kids is biased and inconsistent.
(ii) These estimates show that women in 1980 with same-sex children are estimated to
be 5.48 percentage points more likely to have a third child (we give the percentage
interpretation because the dependent variable is binary).
• Couples with two children of the same sex may want to have an additional
child hoping for a different sex of the third child.
• The family’s budget constraint is affected – children of the same sex can more
easily share a room, clothes, and toys.
• The effect is statistically significant at any conventional significance level:
t statistic = 45.652.
(iii) same sex has no direct effect on either labour supply or labour earnings. At the
same time, same sex is random and so it is plausible that it is unrelated to any of
the variables in the error term in the labour supply equation and labour earnings
equation. Thus, the instrument is exogenous. There is no way to test whether
same sex is exogenous.
(iv) The estimates in the first equation suggest that women with more than 2 children
on average work 3.5 fewer hours per week than women with 2 or fewer children.
The estimates in the second equation suggest that women with more than 2
children on average earn $766 less per year than women with 2 or fewer children.
It is useful to recall:
Cov (x, u) σu
plim(βb1,OLS ) = β1 + = β1 + · Corr (x, u)
Var (x) σx
Cov (z, u) σu Corr (z, u)
plim(βb1,IV ) = β1 + = β1 + · .
Cov (z, x) σx Corr (z, x)
Taking into account that σu = σx , we can write:
200
9.3. Answers to activities
(i) A few examples include family income and background variables, such as parents’
education.
(ii) The population model is:
score = β0 + β1 girlhs + β2 faminc + β3 meduc + β4 feduc + u
where the names of the variables are self-explanatory.
(iii) Parents who are supportive and motivated to have their daughters do well in school
may also be more likely to enrol their daughters in a girls’ high school. It seems
likely that girlhs and u are correlated.
(iv) Let numghs be the number of girls’ high schools within a 20-mile radius of a girl’s
home. To have reliable IV estimation, numghs must satisfy three requirements: it
must be uncorrelated with u (instrument exogeneity), it must be partially
correlated with girlhs (instrument relevance), and it must satisfy the exclusion
restriction.
The exclusion restriction holds as numghs is different from exogenous regressors in
the model. The second requirement (instrument relevance) probably holds and can
be tested by estimating:
girlhs = π0 + π1 faminc + π2 meduc + π3 feduc + π4 numghs + v
and testing H0 : π4 = 0 (testing numghs for statistical significance).
The first requirement (instrument exogeneity) is more problematic for the following
reasons.
Girls’ high schools tend to locate in areas where there is a demand, which can
reflect the seriousness with which people in the community view education.
Some areas of a state have better students on average for reasons unrelated to
family income and parents’ education, and these reasons might be correlated
with numghs.
One possibility to deal with these issues to achieve instrument exogeneity is to
include in the regression models community-level variables that can control for
differences across communities.
201
9. Instrumental variable estimation and two-stage least squares
(i) Another year of education, holding age fixed, results in about .091 fewer children.
In other words, for a group of 100 women, if each gets another year of education,
they collectively are predicted to have about nine fewer children.
(ii) The way to show this is to estimate the reduced form for educ:
(i) The variable dist satisfies the exclusion restriction as it is different from regressors
in the model. Therefore, the only condition we need to check is that the variable
dist is partially correlated with atndrte (instrument relevance). More precisely, in
the reduced form equation for atndrte, which is:
202
9.4. Overview of chapter
implies that u and dist are uncorrelated, at the very least dist satisfies
instrument exogeneity.
For priGPA × atndrte we can consider the instrument priGPA × dist.
Under the exogeneity assumption that E(u | priGPA, ACT , dist) = 0, any
function of priGPA, ACT , and dist is uncorrelated with u (by the law of
iterated expectation). In particular, we may argue that the interaction
priGPA × dist is uncorrelated with u. Let us show this:
E(priGPA × dist · u) = 0.
If dist is partially correlated with atndrte, then priGPA × dist is also partially
correlated with priGPA × atndrte. So, we can estimate the equation:
(iii) Since we now have two endogenous regressors, we have to estimate two first stage
equations. Apart from including the two instruments in both, both equations
include the included exogenous regressors in the original model priGPA and ACT .
First stage equation (or reduced form) for atndrte:
atndrte = π1,0 + π1,1 priGPA + π1,2 ACT + π1,3 dist + π1,4 priGPA · dist + v1 .
priGPA · atndrte = π2,0 + π2,1 priGPA + π2,2 ACT + π2,3 dist + π2,4 priGPA · dist + v2 .
203
9. Instrumental variable estimation and two-stage least squares
Instrument/instrumental variable
Instrument relevance
Exclusion restriction
Exact identification
Overidentification
204
9.5. Test your knowledge and understanding
describe the principles underlying the use of instrumental variables in both simple
and multiple regression models
9.1E An article in the Journal of Banking & Finance studies the causal effect of board
size on the performance of small- and medium-sized firms.
Suppose you are given a rich dataset of a large number of closely held corporations,
including information on board size, several measures of firm performance, and a
number of other firm characteristics that impact performance and are correlated
with board size. Assume board size is binary (i.e. there are large board and small
board companies).
The paper suggests using the number of children of the chief executive officer
(CEO) of the firms as an instrument for board size. What is the first stage and
what sign would you expect for the key coefficient in the first stage? What
assumptions should a good instrument satisfy? Discuss whether this is a good
instrument.
205
9. Instrumental variable estimation and two-stage least squares
9.2E Let us consider some simulated data on the relation between number of police
(POLICE ) and crime (CRIME ). Some of the data refer to election years
(ELECTION = 1), the other data to non-election years (ELECTION = 0). We
want to estimate the causal effect of the number of police on crime, β, using the
model:
CRIME = α + βPOLICE + u.
The sample size is 300.
(a) We are worried that POLICE is correlated with other factors in u. Discuss the
properties of the OLS estimator βb in light of this problem. Prove any claim
you make.
(b) It is argued that the election dummy ELECTION could serve as an
instrument. Discuss what conditions instruments should satisfy and whether
they are reasonable in this case (contrast this with the use of a variable like
‘Police Department Funding’). Can we test some of these requirements, and if
so how? To help you answer this question, suppose you are told that the OLS
estimation of POLICE on ELECTION gives the following results:
POLICE
\ = 9.8475 + 1.0599 ELECTION .
(.1679) (.3358)
(c) Derive the formula for our IV estimator of β in this case and discuss the
properties of the estimator assuming ELECTION is a valid instrument. Prove
any claim you make.
(d) Interpret the following estimated IV results:
(e) Say I had used another valid instrument instead of ELECTION . How should I
judge which of the two IV estimators I should use? Is there a better estimator
than either one of them?
(f) Discuss how you would proceed to test for endogeneity of POLICE .
9.3E The National Health Service (NHS) is trying to assess the effect of maternal
smoking (Si ) during pregnancy on birthweight (BW i ) by estimating the model:
BW i = β0 + β1 Si + ui .
Five years ago, the NHS mailed a brochure about the potential dangers of smoking
during pregnancy to all the women in Bristol aged 20 to 35 whose birthdays were
on even days of the year. Among women giving birth in Bristol, the NHS linked
information on who received a brochure in the post (Ri = 1 indicates a woman who
received the brochure and 0 otherwise) to their records on which of these women
gave birth during the five-year period, and data on birthweight (let BWi indicate
the birthweight in grams) of their children for those who did. The NHS has no
information on the smoking behaviour of mothers in Bristol.
206
9.5. Test your knowledge and understanding
ln Yi = β0 + βL ln Li + βK ln Ki + ui .
(a) Assume that you have a cross-section of independent firms and that more
productive firms hire less workers (labour). Explain why OLS would not
provide consistent estimates of (β0 , βL , βK ). Would it overestimate or
underestimate βL on average? Clearly explain your answer.
(b) Instead of applying OLS, the economist decides to use the average wage
paid by firm i, Wi , as an instrument for the (log) quantity of labour
employed by that firm, ln Li .
Describe in detail how you would estimate the parameters of the
production function using Two Stage Least Squares (2SLS). What
restrictions would be necessary for this researcher to successfully use this
207
9. Instrumental variable estimation and two-stage least squares
9.2E (a) The problems associated with endogeneity are bias and inconsistency. You
should be happy to prove these claims.
(b) We require two conditions from ELECTION in order for it to qualify as
an instrumental variable:
ELECTION is uncorrelated to the error, i.e. we require
Cov (ui , ELECTION i ) = 0 (Validity)
Cov (ELECTION i , POLICE i ) 6= 0 (Relevance).
The second condition states that ELECTION should have a direct effect
on POLICE . It can be tested from the reduced form estimation given
208
9.5. Test your knowledge and understanding
above. The realisation of the t statistic for the null hypothesis of the
relevance violation is:
1.0599
t= = 3.1563.
0.3358
This exceeds the two-sided 1% critical value. Therefore, we reject the null
hypothesis of no relevance and conclude that the instrument is relevant.
The validity condition cannot be tested.
Aside: Consider now a variable like FUNDING which is short for police
department funding. Is this a good instrument? Clearly it satisfies
relevance: we expect more police funding in general to lead to more police
officers. What about validity? Now it doesn’t seem very likely that
funding directly affects crime, i.e. directly enters ui . On the other hand,
however, it might be correlated with other variables in ui – for example,
unemployment, city size etc. For instance places which have higher
unemployment in general tend to be poorer which could be adversely
affecting police funds; at the same time unemployment has a direct effect
on crime. This would violate validity.
(c) For the ease of notation, let yi ≡ CRIME i , xi ≡ POLICE i and
zi ≡ ELECTION i . The IV estimators of the intercept (b αIV ) and the slope
(βbIV ) solve the following system of moment conditions:
n
X
bIV − βbIV xi ) = 0
(yi − α (2.1)
i=1
n
X
bIV − βbIV xi )zi = 0.
(yi − α (2.2)
i=1
bIV to obtain:
Start from equation 1 and solve for α
giving:
n
P
(yi − ȳ)(zi − z̄)
IV i=1
β = P
b
n . (2.4)
(xi − x̄)(zi − z̄)
i=1
209
9. Instrumental variable estimation and two-stage least squares
Divide the numerator and denominator of the last term by n and take
probability limits:
n
plim n−1
P
(zi − z̄)εi
i=1
plim βbIV = β + n
P .
plim n−1 (xi − x̄)(zi − z̄)
i=1
plim βbIV = β.
Therefore, the IV estimator is consistent for β.
Notice, however, that the IV estimator is not unbiased. To get
unbiasedness we would need to have E(εi | x1 , . . . , xn , z1 , . . . , zn ) = 0, but
the fact that Cov (xi , zi ) 6= 0 (i.e. xi is endogenous) implies
E(εi | x1 , . . . , xn , z1 , . . . , zn ) 6= 0, so:
E(βbIV | x1 , . . . , xn , z1 , . . . , zn ) 6= β
210
9.5. Test your knowledge and understanding
CRIME i = α + βPOLICE i + δb
v i + εi
where vbi are the residuals obtained from the reduced form estimation (in
our case this estimation is OLS of POLICE on the instrument
ELECTION and all exogenous regressors in the model, if we had any).
After performing this regression we want to test H0 : δ = 0 against
H1 : δ 6= 0 and can use δ/se(
b δ)
b as usual! The inclusion of the reduced form
residual ‘controls’ for the endogeneity of POLICE .
211
9. Instrumental variable estimation and two-stage least squares
9.4E (a) The problem with using OLS on the above regression is that there will be
correlation between the error ui and the regressor ln Li as more productive
firms (higher ui ) are associated with hiring less workers (lower Li ). This
correlation will make the parameter estimates inconsistent. This problem
is a result of the omission of relevant variables which creates OVB. The
parameter estimates for βL will be underestimates as they capture the fact
that firms with higher technological or managerial efficiency require less
labour for the same output.
(b) Two-stage least squares (2SLS) proceeds in two steps. In the first stage,
ln Li is regressed on all other regressors and the instrument which
provides the fitted values of ln Li (giving the exogenous variation later):
ln
d Li = π
b0 + π
b1 wi + π
b2 ln Ki .
In the second stage, the original regression is run where the predicted
values from the first stage above ln
d Li are used instead of the original
regressor ln Li . We run:
ln Yi = β0 + βL ln
d Li + βK ln Ki + ui .
212
9.5. Test your knowledge and understanding
(c) If there is not much variation in wages, the instrument is likely not to be
very partially correlated with ln Li . We face the weak instrument problem
which means we will have a weak first stage (ln Li is not predicted well by
the instrument and other exogenous regressors), resulting in imprecise
2SLS estimators. The formula for the variance of the IV estimator clearly
indicates that a low correlation between instrument and regressor
increases the imprecision of the IV estimator.
213
9. Instrumental variable estimation and two-stage least squares
214
Chapter 10
Measurement error
10.1 Introduction
215
10. Measurement error
the dependent variable is not systematically related to any of the explanatory variables,
OLS remains unbiased and consistent. Measurement error will render the estimator less
precise (increase its variance). In Section 9.4b you learn that measurement error in the
independent variable(s) will typically render OLS invalid (inconsistent) because it will
give rise to the problem of endogeneity (correlation between the error and the regressor
that is measured with error). All parameters will be affected by the presence of
measurement error in one or more of the explanatory variables. For the regression
model, it is shown how classical measurement error in the explanatory variable will give
rise to attenuation bias (whereby the estimated OLS effect, on average, is closer to zero
than the true effect). When considering the effect of measurement error on the
properties of OLS, it is common to make the classical errors-in-variables (CEV)
assumption whereby the measurement error is uncorrelated with the unobserved
variable, the other explanatory variables, and the original error term.
In Section 15.4 in Wooldridge, we consider an implementation of the Instrumental
Variable estimator to deal with the endogeneity problem associated with measurement
error in the explanatory variables.
The presence of measurement error may give rise to the endogeneity problem we just
discussed which would render the use of OLS undesirable (inconsistent). As discussed in
Section 9.4 in Wooldridge, we can have measurement error in the dependent and/or
independent variables.
The consequences for the OLS estimator will depend critically on the assumptions you
are willing to make about the measurement error. It is common to impose the classical
errors-in-variables (CEV) assumption whereby the measurement error is assumed to be
uncorrelated with the unobserved variable, the (other) explanatory variables, and the
original error term. These assumptions are not necessarily reasonable (and you should
be able to explain this), but under this assumption we have the following clear
implications associated with the presence of measurement errors.
Under CEV of the dependent variable, OLS will remain consistent, but less precise
(larger variance).
Under CEV of a single independent variable, OLS will be inconsistent (all
parameters). The parameter on the variable measured with error will exhibit
attenuation bias.
You should be able to prove these properties in the simple regression model setting only.
When more than one independent variable is measured with error (CEV), the OLS
estimator is inconsistent and little can be said about the direction of the inconsistency.
216
10.2. Content of chapter
Activity 10.1 Take the MCQs related to this section on the VLE to test your
understanding.
As discussed in Section 15.4 in Wooldridge, one way to resolve the endogeneity problem
associated with classical measurement error in an explanatory variable is to use a
second measurement of the same explanatory variable as an instrument. It is important
to recognise that the measurement errors associated with these two measured-with-error
explanatory variables are uncorrelated.
Precise data on colGPA, hsGPA, SAT are easy to obtain (using students’
transcripts). Students, however, report the family income as famincs, which provides
a measurement of faminc ∗ containing measurement error. We assume:
(i) Provide the estimable model and show that the error term in the new equation,
say, vi , is negatively correlated with famincs if β1 > 0. What does this imply
about the OLS estimator of β1 from the above regression?
(ii) Say you are able to get the parents to provide another measure of family
income, f amincp. Let us assume that:
(iv) Use the results in (ii) and (iii) to propose an IV estimator for our β parameters.
217
10. Measurement error
tvhours = tvhours ∗ + e
where the measurement error e has zero mean and is uncorrelated with tvhours ∗
and each explanatory variable in the equation.
Note that for OLS to consistently estimate the parameters, we do not need e to be
uncorrelated with tvhours ∗ .
(ii) No, the CEV assumptions are unlikely to hold in this example.
First: e and tvhours ∗ are likely to be correlated.
• For children who do not watch TV at all, tvhours ∗ = 0, and it is very likely
that reported TV hours is zero. So, if tvhours ∗ = 0, then e = 0 with high
probability.
• For children who do watch TV, tvhours ∗ > 0. Here the measurement error can
be positive or negative. However, since tvhours ≥ 0, e must satisfy
e ≥ −tvhours ∗ .
(i) Using faminc ∗i = famincs i − esi , the estimable model can be obtained as:
218
10.4. Overview of chapter
(iii) famincs i and famincp i are correlated as both measure faminc ∗i . We have:
(iv) Our results show that we can use famincp as an instrument for famincs. In
particular, E(famincp i vi ) = 0 ensures the validity of our instrument, and we also
have that Cov (famincs i , famincp i ) 6= 0 ensures its relevance.
We can implement this estimator using 2SLS.
• Step 1: Obtain the fitted values from the following regression:
\ i + β2 hsGPAi + β3 SAT i + νi .
colGPAi = β0 + β1 famincs
219
10. Measurement error
220
10.5. Test your knowledge and understanding
10.1E (a) The CEV assumptions require the measurement error to be uncorrelated with
the unobserved explanatory variable, Cov (e, x∗1 ) = E(ex∗1 ) = 0, and to be
uncorrelated with the error in the original equation, Cov (e, u) = E(eu) = 0.
The assumption that the measurement error is uncorrelated with the
unobserved explanatory variable is very strong, and in many cases unlikely to
be true. It assumes that the measurement error is purely random, such as with
erroneous data entry.
(b) The equation we estimate is given by:
where we denote the composite error term vi = ui − βei . We need to show that
b 6= β, where:
plim(β)
Pn Pn
x i y i xi vi
i=1
βb = Pn 2 = β + Pi=1 n 2
. (plug in (*))
i=1 xi i=1 xi
where the second equality applies the law of large numbers (which ensures that
sample averages converge in probability to their population analogues).
Using the CEV and the Gauss–Markov assumption that E(x∗i ui ) = 0, we can
show:
and:
E(xi vi ) = E[(x∗i + ei )(ui − βei )] = −βVar (ei ).
Combining results yields:
Var (ei )
plim β = β 1 −
b
∗ 2
.
E[(xi ) ] + Var (ei )
As:
Var (ei )
0< 1− ∗ 2
<1
E[(xi ) ] + Var (ei )
we establish the attenuation bias result.
221
10. Measurement error
222
Chapter 11
Simultaneous equation models
11.1 Introduction
understand the need for supply (demand) shifters to identify the demand (supply)
equation graphically
explain the concepts of a structural equation and a reduced form equation in the
simultaneous equations framework
explain the problem of simultaneity bias when OLS is applied to a single structural
equation in isolation
derive the correlation between the jointly determined explanatory variable and the
structural equation error
derive an expression for the large-sample simultaneity bias in the slope coefficients
when OLS is used to fit a simple regression in a simultaneous equations model
223
11. Simultaneous equation models
224
11.2. Content of chapter
Simultaneity arises when some of the regressors are jointly determined with the
dependent variable in the same economic model. A typical example is that of a market
equilibrium, where we consider the simultaneous equations model (SEM) given by:
qi = α1 pi + β1 z1i + u1i (the supply function)
qi = α2 pi + β2 z2i + u2i (the demand function).
We observe a random sample of qi (quantity), pi (price), z1i (observable supply shifter),
z2i (observable demand shifter), and (qi , pi ) is the equilibrium outcome. Each equation
has a ceteris paribus, causal interpretation, representing behavioural relations; we refer
to them as the structural equations of a SEM. The error terms, u1i and u2i , are
called the structural errors.
Variables jointly determined in this model are called the endogenous variables
(in this example qi and pi ).
Variables determined outside this model are called exogenous variables (in this
example z1i and z2i ).
Without the inclusion of z1i and z2i in this model, there is no way to tell which equation
is supply and which is demand (called the identification problem). Similarly if
z1i = z2i , then the two equations become identical and there is no hope to estimate
either the supply or demand equation. This indicates that to be able to estimate supply
or demand, they must have different exogenous variables. This is illustrated graphically
later.
You should recognise when it is inappropriate to suggest the use of a SEM.
Activity 11.1 Take the MCQs related to this section on the VLE to test your
understanding.
Estimation of structural equations by OLS will typically give rise to bias and
inconsistency. The reason for this is that the explanatory variable that is determined
simultaneously with the dependent variable typically is correlated with the error term,
that is endogenous. We refer to this as the simultaneity bias of OLS.
The key ingredient we use to prove this is the reduced form. The reduced form
expresses the endogenous variables in terms of the exogenous variables in the model and
the structural errors. You should be able to derive the reduced forms of the endogenous
variables from a two-equation structural model. Let us here consider the market
equilibrium SEM model, given by:
q = α1 p + β1 z1 + u1 (the supply function)
q = α2 p + β2 z2 + u2 (the demand function).
225
11. Simultaneous equation models
α1 p + β1 z1 + u1 = α2 p + β2 z2 + u2
(α1 − α2 )p = −β1 z1 + β2 z2 + u2 − u1 .
Since α1 6= α2 (supply and demand slope, respectively), we can divide this equation by
α1 − α2 as follows:
−β1 β2 u2 − u1
p= z1 + z2 + .
α1 − α2 α1 − α2 α1 − α2
The resulting equation is called the reduced form equation for p, which is:
p = πp1 z1 + πp2 z2 + vp
where πp1 and πp2 are the reduced form parameters (nonlinear functions of the structural
parameters) and vp is the reduced form error, which is a linear function of u1 and u2 .
To obtain the reduced form for q, we can plug this equation into either the supply or
demand function. Alternatively, we can use the following approach. Let us multiply the
supply equation by α2 and the demand equation by α1 and take their difference:
α2 q = α2 α1 p + α2 β1 z1 + α2 u1
α1 q = α1 α2 p + α1 β2 z2 + α1 u2 −
(α2 − α1 )q = α2 β1 z1 − α1 β2 z2 + α1 u2 − α2 u1 .
The second equality uses the property of the covariance operator and the exogeneity of
z1 and z2 . Indeed, using the definition of vp we obtain (say u1 and u2 are uncorrelated):
u2 − u1 1
Cov (vp , u1 ) = Cov , u1 = Cov (u2 − u1 , u1 )
α1 − α2 α1 − α2
1
= (Cov (u2 , u1 ) − Cov (u1 , u1 ))
α1 − α2
σ12
=−
α1 − α2
6= 0.
The same reasoning reveals that the demand equation suffers from simultaneity bias as
well.
226
11.2. Content of chapter
Activity 11.2 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 11.4 Let us consider the simultaneous equation model (SEM) for murder
rates and the size of the police force. Let the SEM be given by:
murdpc = α0 + α1 polc + u1
polc = β0 + β1 murdpc + β2 election + u2
where election is an exogenous variable (determined outside the model) and murdpc
and polc are endogenous. Let Var (u1 ) = σ12 , Var (u2 ) = σ22 , and assume that
Cov (u1 , u2 ) = 0.
(ii) Show that the endogenous variable polc is correlated with u1 using your answer
in (a).
(iv) Discuss whether the simultaneity bias in the OLS estimator is upward or
downward and interpret this result.
Let us discuss here what the identification problem is when we consider a demand and
supply model where there are no demand/supply shifters, i.e. consider:
q = α0 + α1 p + u1 (supply)
q = β0 + β1 p + u2 (demand)
227
11. Simultaneous equation models
In this model the only observable variables are quantity (q) and prices (p), and we need
to recognise that this data is only informative about the equilibrium price and quantity
(πp , πq ).
The data is not informative about the slope (intercept) of the demand and supply
equation.
Next, let us discuss what the identification problem is when we consider a demand and
supply model where we only have a supply shifter, i.e. consider:
q = α0 + α1 p + α2 z + u1 (supply)
q = β0 + β1 p + u2 (demand)
where (u1 , u2 ) are i.i.d. error terms with zero mean and z is an exogenous supply shifter.
Do we also have an identification problem here?
In this model, the observable variables are quantity (q), prices (p), and a supply shifter
(z). Here the data is informative about the equilibrium price and quantity for different
values for z. Since the demand function does not depend on the supply shifter, this
permits us to identify (trace out) the demand equation.
228
11.2. Content of chapter
We cannot identify the supply equation as there are many supply functions (say S and
S 0 ) that we cannot distinguish given the observable data. Note: the slope of the supply
function stays the same when z changes.
where (y1 , y2 ) are endogenous variables determined in this model, (z11 , . . . , z1k1 ) are
exogenous variables appearing in the first equation, and (z21 , . . . , z2k2 ) are exogenous
variables appearing in the second equation. Typically there is an overlap between
(z11 , . . . , z1k1 ) and (z21 , . . . , z2k2 ). In fact: To ensure our equations are identified, the
regressors cannot completely overlap.
The first equation in the SEM is identified if and only if the second equation contains at
least one exogenous variable with a non-zero coefficient that is excluded from the first
equation. This is the rank condition. A necessary condition for this requirement is
that the first equation excludes at least one exogenous variable. This is the order
condition. [Equivalently, we need at least as many excluded exogenous variables in the
first equation as there are included endogenous variables in the first equation, which in
this case equals one (y2 ) – See also the previous discussion of 2SLS.]
Recognising that the excluded exogenous variables are needed as instruments for our
included endogenous variables, we use the following terminology:
Let us focus on estimation of the first equation. Once our identification condition is
satisfied, we can then apply the 2SLS estimator.
Identification ensures that we have enough instruments to allow us to deal with the
endogeneity of y2 . Since we have one endogenous variable in the first equation, we
need at least one instrument.
The 2SLS estimator of the first equation takes the following two steps.
229
11. Simultaneous equation models
Activity 11.5 Take the MCQs related to this section on the VLE to test your
understanding.
(i) A model to estimate the effects of smoking on annual income (perhaps through
lost work days due to illness, or productivity effects) is:
where cigs is the number of cigarettes smoked per day, on average. Interpret β1 .
(ii) To reflect the fact that cigarette consumption might be jointly determined with
income, a demand for cigarettes equation is given:
where cigpric is the price of a pack of cigarettes (in cents) and restaurn is a
binary variable equal to 1 if the person lives in a state with restaurant smoking
restrictions, 0 otherwise. We will assume these are exogenous to the individual.
Discuss under what assumption the income equation from part (i) is identified.
What signs you would expect for γ5 and γ6 ?
d = 1.58 − .450 educ + .823 age − .0096 age 2 − .351 log(cigpric) − 2.736 log(restaurn)
cigs
(23.70) (.162) (.154) (.0017) (5.766) (1.110)
n = 807, R2 = .051.
230
11.3. Answers to activities
\
log(income) = 7.78 − .042 cigs + .040 educ + .094 age − .00105 age 2 .
(.23) (.026) (.016) (.023) (.00027)
\
log(income) = 7.80 + .0017 cigs + .060 educ + .058 age − .00063 age 2
(.17) (.0017) (.008) (.008) (.00008)
n = 807, R2 = .165.
(v) How would you test whether the OLS and 2SLS estimates in (iv) are statistically
different (that is whether we should worry about the endogeneity of cigs)?
(ii) If we multiply the first structural equation by α2 and multiply the second equation
by α1 and subtract them from each other we can derive the reduced form for y1 .
Observe that the suggested subtraction allows y2 to cancel out:
α2 y1 = α2 α1 y2 + α2 β1 z1 + α2 u1
α1 y 1 = α1 α2 y2 + α1 β2 z2 + α1 u2 −
(α2 − α1 )y1 = α2 β1 z1 − α1 β2 z2 + α2 u1 − α1 u2
Because α1 6= α2 , we can divide the equation by α2 − α1 to obtain the reduced form
for y1 .
231
11. Simultaneous equation models
y1 = π11 z1 + π12 z2 + v1
β1 α2 −β2 α1 α2 α1
where π11 = α2 −α1
, π12 = α2 −α1
, and v1 = u
α2 −α1 1
− u.
α2 −α1 2
y1 = α1 y2 + β1 z1 + u1
y1 = α2 y2 + β2 z2 + u2 −
0 = (α1 − α2 )y2 + β1 z1 − β2 z2 + u1 − u2
y2 = π21 z1 + π22 z2 + v2
−β1 β2 1
where π21 = α1 −α2
, π22 = α1 −α2
, and v2 = α1 −α2
(u2 − u1 ).
(i) Let us substitute the first structural equation in the second one:
(1 − β1 α1 )polc = β0 + β1 α0 + β1 u1 + β2 election + u2 .
polc = π0 + π1 election + v
β0 +β1 α0 β2 β1 1
where π0 = 1−β1 α1
, π1 = 1−β1 α1
and v = u
1−β1 α1 1
+ u.
1−β1 α1 2
232
11.3. Answers to activities
(ii) Let us plug in the reduced form and apply the rules of covariances:
233
11. Simultaneous equation models
The first equation is identified, provided γ3 6= 0 (and we would expect γ3 < 0).
Here we do have an exogenous ‘alcohol shifter’ we can use to identify the log(earnings)
equation.
The equation is just identified, as we have one exogenous variable, log(price), we can
use to deal with the endogenous regressor alcohol .
We can use 2SLS to estimate the β parameters (IV can be used too as the equation is
just identified).
The two steps of 2SLS are as follows:
2. Use the fitted values of the estimated reduced form to estimate by OLS:
\ + β2 educ + e.
log(earnings) = β0 + β1 alcohol
(i) Assuming the structural equation represents a causal relationship, 100 · β1 is the
approximate percentage change in income if a person smokes one more cigarette
per day, ceteris paribus (semi-elasticity).
(ii) For identification we need at least one exogenous variable in the cigs equation that
is not also in the log(income) equation.
Therefore, we need γ5 6= 0 or γ6 6= 0 to be different from zero.
If both are non-zero the equation is overidentified, otherwise it is just identified.
• Since consumption and price are, ceteris paribus, negatively related: γ5 ≤ 0.
• Everything else equal, restaurant smoking restrictions should reduce cigarette
smoking: γ6 ≤ 0.
(iii) As all regressors are exogenous, we can estimate the reduced form by OLS.
Under CLM assumptions we can use the t test to test the significance of
log(cigpric) and restaurn, with the distribution under H0 being t807−6 .
• log(cigpric) is highly insignificant: π πlog(cigpric) ) ≈ −0.06.
blog(cigpric) /se(b
• restaurn is significant: π πrestaurn ) ≈ −2.47; has sign as expected.
brestaurn /se(b
• It appears we have identification (can drop log(cigpric)).
(iv) Remember, OLS ignores potential simultaneity between income and cigarette
smoking. OLS is expected to be biased/inconsistent.
The coefficient on cigs using OLS implies that cigarette smoking causes income to
increase, although the coefficient is not statistically different from zero.
234
11.4. Overview of chapter
235
11. Simultaneous equation models
Structural equation
Simultaneity
Simultaneity bias
Identified equation
Exclusion restrictions
Overidentified equation
Unidentified equation
Order condition
Rank condition
understand the need for supply (demand) shifters to identify the demand (supply)
equation graphically
explain the concepts of a structural equation and a reduced form equation in the
simultaneous equations framework
explain the problem of simultaneity bias when OLS is applied to a single structural
equation in isolation
derive the correlation between the jointly determined explanatory variable and the
structural equation error
derive an expression for the large-sample simultaneity bias in the slope coefficients
when OLS is used to fit a simple regression in a simultaneous equations model
236
11.5. Test your knowledge and understanding
It is assumed that following the divorce, children live with parent 2 and parent 1
pays child support. The errors have zero mean and constant variance and
Cov (ε1 , ε2 ) = σ12 6= 0.
(1.1) is the ‘reaction function’ of parent 1: it describes the amount of child support
paid by parent 1 for any given level of visitation rights; inc1 is the income of
parent 1, remarr1 indicates whether parent1 is remarried (1 = yes, 0 = no), and
dist measures the distance in miles between the current residence of parent 1 and
parent 2.
(1.2) is the ‘reaction function’ of parent 2: it describes visitation rights for a given
amount of child support; remarr2 indicates whether parent 2 is remarried (1 = yes,
0 = no).
We assume that inc1, remarr1, remarr2, and dist are exogenous explanatory
variables.
(a) Explain the concepts of endogenous versus exogenous explanatory variables
and show that support is an endogenous variable in (1.2). You are expected to
derive the reduced form for support to fully answer this question.
(b) Examine for each structural (behavioural) equation whether the equation is
over-identified, exactly identified or unidentified. Provide clear arguments for
your answer.
(c) Your friend suggests you should implement the IV estimator to estimate the β
parameters consistently. He tells you to use inc1 as an instrument for support.
Provide a critical discussion of this suggestion.
11.2E It is postulated that a reasonable demand–supply model for the wine industry in
Australia, under the market-clearing assumption, would be given by:
where Qt = real per capita consumption of wine, Ptw = price of wine relative to
CPI, Ptb = price of beer relative to CPI, Yt = real per capita disposable income,
At = real per capita advertising expenditure, and St = storage cost at time t. CPI
is the Consumer Price Index at time t. The endogenous variables in this model are
Q and P w , and the exogenous variables are P b , Y, A and S. The variance of u1t and
u2t are, respectively σ12 , and σ22 , and Cov (u1t , u2t ) = σ12 6= 0.
237
11. Simultaneous equation models
All the coefficients except that of Y have the wrong signs. The coefficient of
P w (price elasticity of demand, α1 ) not only has the wrong sign but also
appears significant.
Explain why the OLS parameter estimator may give rise to these
counterintuitive results. You are expected to use your results in part (a) to
support your answer.
(c) The supply equation is overidentified. Clearly explain this terminology. What
distinguishes overidentification from just identified and unidentified? Provide
one set of assumptions that would render the supply equation exactly
identified.
(d) Discuss how you should estimate the supply equation in light of the
overidentification. What are the benefits of overidentification?
11.3E Let us consider the demand for fish. Using 97 daily price (avgprc) and quantity
(totqty) observations on fish prices at the Fulton Fish Market in Manhattan, the
following results were obtained by OLS:
The equation allows demand to differ across the days of the week, and Friday is the
excluded dummy variable. The standard errors are in parentheses.
(a) Interpret the coefficient of log(avgprc) and discuss whether it is significant.
(b) It is commonly thought that prices are jointly determined with quantity in
equilibrium where demand equals supply. What are the consequences of this
simultaneity for the properties of the OLS estimator?
(c) The variables wave2t and wave3t are measures of ocean wave heights over the
past several days. In view of your answer in part (b), what two assumptions do
we need to make in order to use wave2t and wave3t as instruments for
log(avgprc t ) in estimating the demand equation? Discuss whether these
assumptions are reasonable.
238
11.5. Test your knowledge and understanding
Discuss how these results can be obtained using two stage least squares (2SLS).
239
11. Simultaneous equation models
where vs = 1−α12 β2 (ε1 + α2 ε2 ) is the reduced form error. Using this reduced
form, we can now show that:
1
Cov (support, ε2 ) = Cov (vs , ε2 ) = (σ12 + α2 σ22 ) 6= 0
1 − α2 β2
where the first equality uses the exogeneity of the explanatory variables in the
reduced form expression. Hence as Cov (support, ε2 ) 6= 0, support is
endogenous.
(b) The first equation is exactly identified since there is one exogenous variable
(remarr2 ) (excluded in (1.1)) available as an instrument for the endogenous
regressor (visits) (rank condition is satisfied as long as β3 6= 0 in (1.2)).
The second equation is overidentified since there are two exogenous variables
(inc1 and remarr1 ) (excluded in (1.2)) available as instruments to the
endogenous regressor (support) (if either α3 or α4 in (1.1) are zero, the
equation is just identified, if both are zero, the equation is not identified).
(c) The IV approach suggested by the friend would yield consistent estimates for
the β parameters since inc1 is a suitable instrument (exogenous and correlated
with support (assuming α3 6= 0)). However, given this equation is
overidentified, we could obtain more efficient estimates by using a two-stage
least squares approach in which both remarr1 and inc1 are included as
instruments.
11.2E (a) We need to combine the demand curve and supply curve and set quantity of
demand equal to quantity of supply:
Rearranging yields:
240
11.5. Test your knowledge and understanding
Ptw = π0 + π1 St + π2 Ptb + π3 Yt + π4 At + Vt
vt −ut
where πj are our reduced form parameters and Vt = α1 −β1
is the reduced form
error.
(b) The key reason that OLS gives rise to these counterintuitive results is due to
the endogeneity of Ptw which renders our OLS inconsistent and may result in
parameter estimates of the wrong sign.
Qt and Ptw are jointly determined in this SEM, and this results in an
endogeneity problem as Cov (Ptw , ut ) 6= 0.
Using the result from (a), we have:
(c) The supply equation is overidentified, since here we have three instrumental
variables Ptb , Yt and At to deal with our endogenous variable Ptw .
If, in the demand equation we have, say, α4 6= 0, α2 = 0 and α3 = 0, then the
supply equation becomes exactly identified. In that case we would only have a
single instrument, At (advertising), for the endogenous variable Ptw .
If we do not have any instrument our equation becomes unidentified and we
would no longer be able to estimate the equation.
(d) As the supply equation is overidentified we should use 2SLS.
Step 1: Estimate the reduced form of Ptw and get the fitted value Pbtw :
Pbtw = π
b0 + π b2 Ptb + π
b1 St + π b3 Yt + π
b4 At .
Step 2: Use Pbtw in place of Ptw to estimate the supply equation by OLS, that
is, estimate:
Qt = β0 + β1 Pbtw + β2 St + error t .
The benefit of overidentification is that of improved efficiency of our IV
estimator (2SLS chooses the optimal instrument).
11.3E (a) We want to interpret this parameter as the price elasticity of demand after
controlling for days of the week differences. If the price increases by 1% then
quantity demanded decreases by .425%.
We need to test H0 : βlog(avgprc) = 0 against H1 : βlog(avgprc) 6= 0. Under the
CLM assumptions (MLR.1 to MLR.6), we can use the t statistic:
βblog(avgprc)
∼ tn−6 under H0 .
se(βblog(avgprc) )
241
11. Simultaneous equation models
At the 5% level of significance the critical value is 1.96. Given the realisation
of our test statistic is −.425/.176 = −2.415 we conclude that it is significant
(i.e. we reject H0 ).
(b) This is a problem of endogeneity. The simultaneity of quantity and prices will
induce a correlation between prices (log(avgprc)) and the error term in the
demand (and supply) equation. This endogeneity will lead to inconsistent
parameter estimates when OLS is used to estimate the demand equation. It is
also called ‘simultaneity bias’.
(c) To estimate the demand equation, we need at least one exogenous variable, not
in the demand equation, that appears in the supply equation (supply shifter).
We need to assume that these variables can properly be excluded from the
demand equation and are uncorrelated with the error in the demand equation.
(Exclusion and validity requirement.) These assumptions may not be
entirely reasonable; wave heights are determined partly by weather, and
demand at a local fish market could depend on weather too.
The requirement that at least one of wave2t and wave3t appears in the supply
equation ensures that we have a correlation of our instruments with
log(avgrpc). (Relevance requirement.) There is indirect evidence of this in
part (d), where we will show that they are jointly significant in the reduced
form for log(avgprc t ).
(d) We test the joint hypothesis: H0 : βwave2 = βwave3 = 0 against H1 : βwave2 6= 0
and/or βwave3 6= 0.
The F test is obtained as:
Equivalently, the following form could be used to obtain the the F statistic:
Both tests evaluate the loss in fit of imposing the restrictions we are testing.
Under the CLM assumptions (MLR.1 to MLR.6) this gives us an F2, 90 random
variable under the null. For any reasonable level of significance we will want to
reject the null. We conclude that our test result ensures that our IVs indeed
are relevant.
(e) Discussion of 2SLS:
Step 1: Estimate the reduced form for log(avgprc):
\ ). All exogenous
We obtain the fitted values of this regression: log(avgprc t
variables have to be included here!
242
11.5. Test your knowledge and understanding
Step 2: Use the fitted values for log(avgprc) to estimate the demand
equation:
\ ) + β2 mon t + β3 tues t
log(totqty t ) = β0 + β1 log(avgprc t
243
11. Simultaneous equation models
244
Chapter 12
Binary response models and
maximum likelihood
12.1 Introduction
describe the linear probability model (LPM) and discuss the interpretation of
parameters
explain the benefits of using a nonlinear model for binary response models
specify the logit model and the log-likelihood function for its estimation by MLE
245
12. Binary response models and maximum likelihood
specify the probit model and the log-likelihood function for its estimation by MLE
conduct hypothesis tests using logit and probit models (in particular use the z test
and the likelihood ratio (LR) test)
use statistical software to estimate binary choice models (linear and nonlinear)
using real-world data.
246
12.2. Content of chapter
regressions to test whether there are gender and ethnic differences in wages. We also
incorporated interactions involving dummy variables – for example, we considered the
interactions between gender and education in our wage regression to allow us to detect
whether there are gender differences in the returns to education. In this chapter, we
consider the use of binary variables as our dependent variable. This setting is
referred to as the binary response model.
In binary response models, we attempt to explain a qualitative event. Individuals
and/or firms are asked to choose between two alternatives and we denote one of the
outcomes as y = 1 and the other as y = 0. There are many empirical applications of this
nature in economics as discussed in Wooldridge.
Here, we consider the implication of using the multiple linear regression model:
y = β0 + β1 x1 + β2 x2 + · · · + βk xk + u
when the dependent variable is a binary variable. We continue to assume E(u | x) = 0
(needed for causal interpretation of our parameters), so that:
E(y | x) = β0 + β1 x1 + β2 x2 + · · · + βk xk .
A key observation we need to make when the dependent variable y is a discrete random
variable that only takes the values zero and one, is that we need to interpret the
conditional expectation E(y | x) as a probability. Using properties of discrete random
variables (see Wooldridge, Appendix B.1a):
E(y | x) = 0 × P r(y = 0 | x) + 1 × P r(y = 1 | x)
= P r(y = 1 | x).
Hence, when we apply the multiple linear regression model to a binary dependent
variable we obtain a linear probability model (LPM):
P r(y = 1 | x) = β0 + β1 x1 + β2 x2 + · · · + βk xk .
This terminology is due to the fact that the probability that y = 1 is assumed to be
linear in the parameters. The parameters of the LPM have a nice (easy) causal
interpretation: βj represents the effect a unit change in xj has on the probability that
y = 1 holding everything else fixed.
We can easily estimate the parameters of the LPM by running an OLS regression of y
on the explanatory variables, and predicted values from this regression need to be
interpreted as predicted probabilities:
ybi = βb0 + βb1 x1i + βb2 x2i + · · · + βbk xki = P r(y\
i = 1 | xi ).
Empirical applications of this are provided in Wooldridge and you need to be able to
discuss the results, clearly relating them to changes in the probability of the event that
y = 1.
247
12. Binary response models and maximum likelihood
1. Let us highlight some points with reference to the classic example of labour market
participation (using the data from Mroz, 1987), Wooldridge reports:
n = 753, R2 = .264
where inlf
d=P cr(inlf = 1 | age = 35, educ = 7, . . .).
Let us focus here only on two parameters βbeduc and βbkidslt6 (see Wooldridge for
a discussion of the parameters).
βbeduc : Ceteris paribus, each additional year of education increases the predicted
probability of a woman being in the labour market by .038 units, or 3.8
percentage points. For the 35-year old woman described above, it increases
her predicted probability from 61.6% to 65.4%. (Careful: this is not the same
as a 3.8 percentage increase!)
βbkidslt6 : Ceteris paribus, each additional young child (< 6 years) reduces the
predicted probability of a woman being in the labour market by .262 units,
that is a 26.2 percentage points drop.
Various limitations of the LPM are discussed in Wooldridge. Two of them are directly
related to the fact that we need to interpret the fitted values of this regression as
predicted probabilities. (1) The fitted values for our LPM are not guaranteed to
lie between zero and one. Due to their interpretation this is clearly unsatisfactory.
(2) The estimated partial effects are constant throughout the range of values the
explanatory variables can take. In order to ensure that the predicted values represent a
probability (and are restricted to lie between zero and one), eventually they must have a
diminishing effect.
To give a graphic exposition of these points let us consider the simple bivariate LPM
model setting. This discussion also reinforces the special nature of binary dependent
variables. Let:
y = β0 + β1 x + u.
The following graph displays a scatter graph of observations. Observe that y only can
take two values 0 and 1. The scatter graph suggests a positive relation between x and y.
248
12.2. Content of chapter
The fitted regression line (LPM regression line) is a straight line in this graph that
minimises the residual sum of squares.
For large values of x, the predicted value may exceed one, whereas for small values of x,
the predicted value may lie below zero. This is a result of the fact that the LPM
assumes the marginal effect of x on y is constant.
A further drawback of the LPM is that (3) the LPM exhibits heteroskedasticity
(which invalidates the OLS standard errors) due to the fact that:
where:
p(x) ≡ P r(y = 1 | x) = β0 + β1 x1 + β2 x2 + · · · + βk xk .
249
12. Binary response models and maximum likelihood
To see this result, we recall Var (y | x) = E(y 2 | x) − E(y | x)2 . Using the properties of
discrete random variables, we can obtain the result provided as:
y = β0 + β1 x1 + β2 x2 + · · · + βk xk + u
we should note that the conditional variance of y given x is identical to the conditional
variance of u:
Var (u | x) = p(x)(1 − p(x)).
Hence, unless all slopes βj , for j = 1, 2, . . . , k, are zero, the conditional variance of u will
depend on x and exhibit heteroskedasticity. A violation of MLR.5!
Caution: The results provided for the empirical applications of the LPM report the
usual standard errors alongside the parameter estimates. As the presence of
heteroskedasticity invalidates the standard errors, any discussion of the significance of
parameters based on them in this section should be read with caution.
Activity 12.1 Take the MCQs related to this section on the VLE to test your
understanding.
We have discussed that the multiple regression model when y is a binary variable:
y = β0 + β1 x1 + β2 x2 + · · · + βk xk + u, E(u | x) = 0
3. We will need to correct the standard errors when using OLS. Without it we cannot
trust our usual t and F statistics, not even in large samples.
250
12.2. Content of chapter
The most common approach to deal with heteroskedasticity in the LPM is to use
heteroskedasticity-robust standard errors. Using robust standard errors (which are easy
to obtain using statistical packages) we can conduct inference and construct confidence
intervals as usual.
Let us highlight some points with reference to the classic example of labour market
participation (using the data from Mroz, 1987). Wooldridge reports:
d = .586 − 0.0034 nwifeinc + .038 educ + .039 exper
inlf
(.151) (.0015) (.007) (.006)
n = 753, R2 = .264
where numbers in brackets are the heteroskedasticity-robust standard errors.
The parameter estimates are the same as before (we have used OLS again). The only
difference is that robust standard errors are reported. They are not that different from
the standard errors we had before.
The robust 95% confidence interval for βnwif einc is given by:
h i
βbnwif einc − 1.96se(βbnwif einc ), βbnwif einc + 1.96se(βbnwif einc ) = [−.0063, −.0005].
The critical value 1.96 used is obtained from Normal (0, 1) (large sample). As zero does
not lie in this interval, we reject the null that H0 : βnwif einc = 0 against
H1 : βnwif einc 6= 0 at the 5% level of significance.
To test H0 : βeduc = 0 against H1 : βeduc 6= 0, we now use the robust t statistic:
βbeduc .038
tβbeduc = = = 5.426.
se(βbeduc ) .007
We get a clear strong rejection of H0 . As we use robust standard errors, we rely on
asymptotic Normal (0, 1) critical values.
To test H0 : βexper = 0 and βexper2 = 0, we now use the robust F statistic. You are not
able to derive the statistic itself. Here F = 67.17, which gives strong evidence against
H0 . We use critical values given by F2, df where df = 753 − 8.
In order to regain efficiency, we may want to consider using weighted least squares on
our LPM:
y = β0 + β1 x1 + β2 x2 + · · · + βk xk + u, E(u | x) = 0. (*)
We recall, Var (ui | xi ) = p(xi )(1 − p(xi )), where p(xi ) = β0 + β1 x1i + β2 x2i + · · · + βk xki .
p
By multiplying our LPM (*) by hi ≡ 1/ p(xi )(1 − p(xi )) for each observation i, we are
effectively able to remove the problem of heteroskedasticity:
hi yi = β0 hi + β1 x1i hi + β2 x2i hi + · · · + βk xki hi + ui hi (**)
251
12. Binary response models and maximum likelihood
as
Var (ui hi | xi ) = h2i Var (ui | xi ) = 1 i.e. homoskedastic.
As this model (**) satisfies our Gauss–Markov assumptions, estimating this model by
OLS (which defines our Weighted Least Squares) will be efficient.
In order to implement this regression, we will need to use estimates of hi , where:
1
hi = p
b
pb(xi )(1 − pb(xi ))
instead. As fitted values of the LPM (which define pbi ) may not lie inside the zero–one
range, these weights may not be well defined. The use of weighted least squares for
LPM is very uncommon.
In binary response models, we have seen that the fitted dependent variable represents a
predicted probability which, by its nature, needs to be restricted to lie in the interval
[0, 1]. As failure to satisfy this requirement is one of the limitations of the LPM, we
argue here in favour of using a non-linear model for binary response models.
P r(y = 1 | x) = G(β0 + β1 x1 + · · · + βk xk )
where the G function takes values on the interval [0, 1] only. This specification
ensures that P r(y = 1 | x) is always in [0, 1].
An important consequence, we discuss in detail later, is that the βj parameters do not
have the same, easy, interpretation they had in the LPM. To obtain the ceteris paribus
252
12.2. Content of chapter
Comment: The discussion in Wooldridge which shows that the logit and probit models
can be derived from an underlying latent variable model is interesting, but is not
examinable.
To estimate the βj parameters in our non-linear binary response model:
P r(y = 1 | x) = G(β0 + β1 x1 + · · · + βk xk )
we will use the maximum likelihood estimator (MLE). The estimator makes use of
the fact our binary random variable can be described by the Bernoulli distribution:
f (y) = p(x)y (1 − p(x))1−y , y = 0, 1
253
12. Binary response models and maximum likelihood
Comment: In this segment, we will use notation that allows us to clearly distinguish
between random variables (denoted with capital letters) and their realisations (denoted
with small letters). It is noted that this distinction is not made as explicit in the main
text of Wooldridge (and the course in general).
The likelihood function we maximise to obtain the MLE is given by the joint density of
the data as a function of the unknown parameters θ. In the simplest case we will
consider the setting where we have a random sample Y1 , Y2 , . . . , Yn , where each Yi is
drawn independently from the same (identical) population whose density function is
given by f (y; θ), that is Yi is i.i.d. (independent, identically distributed). By
independence, the joint density of Y1 , . . . , Yn then equals the product of marginals:
f (y1 , y2 , . . . , yn ; θ) = f (y1 ; θ)f (y2 ; θ) · · · f (yn ; θ)
where yi denotes all possible realisations the random variable Yi can take. The
likelihood function is defined in terms of the random variables Y1 , Y2 , . . . , Yn , which in
this case yields:
L(θ; Y1 , . . . , Yn ) = f (Y1 ; θ)f (Y2 ; θ) · · · f (Yn ; θ).
The MLE of our parameter is given by that value of θ that will maximise this function,
which we may write formally as:
θbM LE = arg max L(θ; Y1 , . . . , Yn ).
θ
254
12.2. Content of chapter
The first-order conditions of this optimisation problem, which we may write formally as:
∂L(θ; Y1 , . . . , Yn )
=0
∂θ θ=θbM LE
will give our estimator θbM LE in terms of the random variables (Y1 , Y2 , . . . , Yn ). We recall
that an estimator is a random variable, and an estimate is a realisation of this
random variable for a given sample (y1 , . . . , yn ).
Instead of maximising the likelihood function, we typically use the log-likelihood
function. Since log is a monotone transformation, the estimator that maximises the
log-likelihood function is the same as the estimator that maximises the likelihood
function. Using the properties of the log operator, the log-likelihood can be written as
the sum of logs:
n
X
log L(θ; Y1 , . . . , Yn ) = log f (Yi ; θ).
i=1
Under very general conditions (often labelled the MLE regularity conditions) the
estimator has nice asymptotic properties as follows.
ML estimator is consistent:
plim θbM LE = θ.
ML estimators are not always unbiased. Hence, the consistency result is important.
You are expected to derive the MLE in simple models from first principles.
Mathematically, this requires you to be comfortable with the following mathematical
rules (see also Basic mathematical tools in the pre-course material).
255
12. Binary response models and maximum likelihood
(ab )c = abc .
log ab = b log a.
d log(x)/dx = 1/x.
256
12.2. Content of chapter
Using the rules of the log function, we obtain the log-likelihood function:
n n
P
! P
!
Yi n− Yi
log L(θ; Y1 , . . . , Yn ) = log θi=1 + log (1 − θ) i=1 log(ab) = log(a) + log(b)
n n
!
X X
= Yi log θ + n− Yi log(1 − θ) log(ac ) = c log(a).
i=1 i=1
We need to obtain the derivative d log L/dθ for this purpose. Making use of the
rules of the log function, and the chain rule, we can obtain:
n n
!
d log L X 1 X 1 d log(θ) 1
= Yi + n − Yi (−1) = .
dθ i=1
θ i=1
1−θ dθ θ
or:
n
P n
P
Yi n− Yi
i=1 i=1
= .
θbM LE 1 − θbM LE
To solve for θbM LE we rewrite this as:
n n
!
X X
(1 − θbM LE ) Yi = θbM LE n− Yi .
i=1 i=1
Hence θbM LE = Ȳ .
257
12. Binary response models and maximum likelihood
Activity 12.3 We have just discussed the MLE of θ when we have a random sample
Y1 , . . . , Yn , from the Bernoulli distribution where θ = P r(Yi = 1), for i = 1, . . . , n.
Since θ defines the probability that Yi = 1 we should not be surprised that:
θbM LE = Ȳ
1
Pn
where Ȳ = n i=1 Yi denotes the proportion of ones in the sample.
Properly supported, answer the following questions:
Activity 12.4 Suppose you are given a random sample X1 , . . . , Xn from the
exponential distribution, whose probability density function (pdf) is defined by:
1 x
f (x; β) = exp − , x > 0, β > 0.
β β
Here we provide details of the MLE for the logit and probit models and discuss how to
interpret logit and probit estimates. We also address how to conduct hypothesis testing
for logit and probit models. In this section, we no longer explicitly use capital letters to
denote random variables (as in the rest of the course).
Recall, the logit and probit model specifies:
P r(y = 1 | x) = G(β0 + β1 x1 + β2 x2 + · · · + βk xk )
where y is a binary random variable (taking values of zero and one only),
258
12.2. Content of chapter
x = {x1 , . . . , xk } and:
exp(z)
G(z) = Λ(z) = (logit model)
1 + exp(z)
Z z 2
1 u
G(z) = Φ(z) = √ exp − du. (probit model)
−∞ 2π 2
For notational simplicity, let xβ = β0 + β1 x1 + β2 x2 + · · · + βk xk .
Our ability to use MLE to estimate the β parameters for the probit and logit models
lies in the realisation that the (conditional) density of a binary dependent variable y
given x can be described by the Bernoulli distribution:
f (y | x) = G(xβ)y (1 − G(xβ))1−y , y = 0, 1
where G(xβ) denotes P r(y = 1 | x).
Assuming random sampling of n observations (yi , xi ), the joint (conditional) density
can then be obtained by using the product of the ‘marginals’:
n
Y n
Y
f (y1 , . . . , yn | x1 , . . . , xn ) = f (yi | xi ) = G(xi β)yi (1 − G(xi β))1−yi
i=1 i=1
Qn
where i=1 f (yi | xi ) is shorthand for f (y1 | x1 )f (y2 | x2 ) · · · f (yn | xn ). This joint
(conditional) density defines our likelihood function L(β; y1 , . . . , yn , x1 , . . . , xn ). Given a
random sample, we want to determine the value of β that will maximise this likelihood
function.
Instead of maximising the likelihood function, it is more common to use the
log-likelihood function which has a very nice interpretable form here, given by:
Xn
log L(β; y1 , . . . , yn , x1 , . . . , xn ) = y log G(xi β) + (1 − yi ) log(1 − G(xi β)) .
i | {z } | {z }
i=1
P r(yi =1 | xi ) P r(yi =0 | xi )
Interpretation: This function reveals that the contribution to the log-likelihood for an
observation i for which yi = 1 is given by log(P r(yi = 1 | xi ) (the other term vanishes),
whereas the contribution to the log-likelihood for an observation i for which yi = 0 is
given by log(P r(yi = 0 | xi ). Intuitively, this reveals that the value of β that will
maximise this log-likelihood function should ensure that P cr(yi = 1 | xi ) is large when
cr(yi = 0 | xi ) should be large when yi = 0.
yi = 1, while P
We can derive the log-likelihood from the joint density by using our mathematical
fundamentals as shown below:
259
12. Binary response models and maximum likelihood
Mathematically, the MLE involves solving the first-order condition of the log-likelihood
function for βbM LE . This problem does not give an explicit formula for βbM LE . The
estimator is obtained using numerical optimisation methods and is readily available in
statistical packages such as R.
The regression output of MLE provides the parameter estimates, βbM LE and associated
(asymptotic) standard errors in a table similar to that of OLS. In particular, when we
consider the empirical application of labour market participation of women using the
Mroz data again, we obtain the following output when estimating the probit model in R:
It is common to also report the value of the log-likelihood function when providing
MLE regression results.
Fitted values: Using the probit and logit parameter estimates βbj , for j = 0, . . . , k, we
can obtain the predicted response probability for all individuals.
In probit regression models:
P r(y\
i = 1 | xi ) ≡ y
bi = Φ(βb0 + βb1 x1i + · · · + βbk xki ).
P r(y\
i = 1 | xi ) ≡ y
bi = Λ(βb0 + βb1 x1i + · · · + βbk xki ).
In order to obtain the predicted probabilities it is important not to forget the cdf
functions! With reference to the probit results reported above, for instance we can
260
12.2. Content of chapter
Conveniently, as g(z) > 0, this shows that the partial effect does have the same sign as
βj . If βbj > 0: Ceteris paribus, xj affects the predicted probability that y = 1 positively.
If βbj < 0: Ceteris paribus, xj affects the predicted probability that y = 1 negatively.
Using the probit regression results:
d = Φ(.270 − 0.012nwifeinc + .131educ + .123exper
inlf
− .0019exper 2 − 0.053age − .868kidslt6 + .036kidsge6 )
this allows us to state the following.
As βbnwif einc < 0: A higher non-wife income reduces the predicted probability of a
woman being in the labour market, ceteris paribus.
As βbkidslt6 < 0: The presence of young children reduces the probability of a woman
being in the labour market, ceteris paribus.
Whether these effects (their directions) are statistically significant will depend on
the precision with which these estimates are estimated (their standard errors).
More about this later.
Marginal effects: The partial effects for the logit and probit models are not constant
across the range of values the explanatory variables can take, unlike the LPM model.
Depending on the values of x these partial effects will take different numerical values.
261
12. Binary response models and maximum likelihood
For explanatory variables that are dummy variables (discrete), the partial effect
evaluates the difference in probability of participation when the dummy variable
switches from zero to one. Say x1 is a dummy variable, then:
∆P r(y = 1 | x)
= G(β0 + β1 ×1 + β2 x2 + · · · + βk xk )
∆x1
− G(β0 + β1 ×0 + β2 x2 + · · · + βk xk ).
Rather than reporting estimated partial effects for all individuals in the sample, the
following approaches are popular. We report the average of the partial effect (APE)
over all individuals, we report the partial effect for a particular individual, or we report
the partial effect of an individual with characteristics given by the average person
(PEA). That is, for continuous variables:
Estimate of the Partial Effect of the Average individual (PEA). Let x̄j denote the
average of the jth explanatory variable in the sample:
262
12.2. Content of chapter
Rather than reporting this result for all individuals, we may report the average
marginal (partial) effect over all individuals, AP
[ E nwif einc :
753
1 X
−.012 × φ(.270 − 0.012nwifeinc i + .131educ i + · · · + .036kidsge6 i ).
753 i=1
Our statistical package can help us to obtain AP [ E nwif einc = −.0036 (we will
discuss how to do this in R on the VLE). It reveals that $10,000 additional other
income (i.e. ∆nwifeinc = 10) on average reduces the probability of women working
in the labour market by .036, that is a 3.6 percentage point decrease, holding
everything else constant. Our statistical package will also provide the associated
standard error, se(AP [ E nwif einc ), which we can use to test whether this effect is
statistically significant.
Whereas the LPM assumes constant marginal effects, logit and probit imply
diminishing magnitudes of the partial effect. To show this, let us consider the effect
of an additional young child for women with nwifeinc = 20.13, educ = 12.3,
exper = 10.6, age = 42.5 and kidsge6 = 1.
Below we report the predicted probability of being in the labour market for this
woman as a function of the number of young children:
cr(inlf = 1 | kidslt6 = 0, nwifeinc = 20.13, educ = 12, . . .) = .707
P
cr(inlf = 1 | kidslt6 = 1, nwifeinc = 20.13, educ = 12, . . .) = .373
P
cr(inlf = 1 | kidslt6 = 2, nwifeinc = 20.13, educ = 12, . . .) = .117.
P
Comparing these probabilities, we note that the first child reduces her predicted
probability by (.707 − .373) = .334 (a 33.4 percentage points drop). The second
child reduces her predicted probability further but by a smaller amount:
(.373 − .117) = .256 (a 25.6 percentage points drop).
Goodness of fit
A measure of goodness of fit that is commonly reported in binary response models (also
for LPM) indicates the percentage of observations that are correctly predicted.
A common rule for this is to consider y as being correctly predicted if:
yi = 1 and P r(y\ \
i = 1 | xi ) ≥ 0.5 or yi = 0 and P r(yi = 1 | xi ) < 0.5.
This rule has a drawback in that it does not mirror the quality of the prediction.
Whether P r(y\ \
i = 1 | xi ) = 0.5001 or P r(yi = 1 | xi ) = 0.9999 is not reflected in this
measure as in both cases we will indicate that the observation is predicted correctly if
yi = 1, and incorrectly if yi = 0.
There are alternative measures, such as the pseudo R-squared (not examinable). Later
we will discuss how we can formally test the joint significance of the slopes as a
practical alternative.
263
12. Binary response models and maximum likelihood
Hypothesis testing
Single linear restrictions: As we have seen, the MLE regression output provides
parameter estimates (βb0 , βb1 , . . . , βbk ) and their standard errors. This will permit us to
test hypotheses, such as:
H0 : β2 = 1 against H1 : β2 6= 1
using an asymptotic t test. The procedure is the same as before, except that we will
not use the Student’s t distribution to obtain the critical values, but the standard
normal distribution. The reason for this is that, in general, MLEs are typically only
asymptotically normal. Specifically, we should use here the test statistic:
βb2 − 1 a
z= ∼ N (0, 1) under H0 .
se(βb2 )
At the 5% level of significance, we should reject H0 if the realisation of our test statistic:
|z| > 1.96.
Comment: MLE regression output provides the z statistic for the hypotheses:
H0 : βj = 0
together with their associated two-sided p-values just as it did in OLS! (The third and
fourth column of our MLE regression output.) If the p-value is smaller than 5%, we
should reject the null that βj = 0 at the 5% level of significance and we would state that
βbj is statistically significant.
Multiple linear restrictions: The MLE regression output typically also provides the
value of the log-likelihood function (log L). The log-likelihood value plays an important
role when testing multiple linear restrictions such as
H0 : β2 = 0 and β3 = 0 against H1 : β2 6= 0 and/or β3 6= 0.
The test is based on detecting whether imposing restrictions leads to a significant loss of
fit. The test requires us to compare the value of the likelihood function of the
unrestricted model, Lur , with the value of the likelihood function of the restricted model
(where we impose the restrictions), Lr . The test is called the likelihood ratio (LR)
test.
The likelihood ratio test statistic is given by:
Lur a
LR = 2 log = 2(log Lur − log Lr ) ∼ χq under H0
Lr
where q = dfur − dfr equals the number of restrictions. This is an (asymptotic) χ2 test.
For a given level of significance, α, we should reject the null when the realisation of the
likelihood ratio test:
LR > χ2q, α .
With q = 2 and α = 5% this critical value equals 5.99. We should recognise that
realisations of the LR test statistic that are small, signal that there is not a significant
loss in fit and therefore should not lead to a rejection of the null.
264
12.2. Content of chapter
βbeduc .131
tβbeduc = = ≈ 5.18.
se(βbeduc ) .025
At the 5% level of significance the critical value is 1.96. The p-value < .001, hence
we conclude that education has a significant effect on being in the labour force.
The test relies on MLE regularity conditions which ensures that:
βbeduc a
Under H0 : ∼ Normal (0, 1).
se(βbeduc )
265
12. Binary response models and maximum likelihood
We can also use the LR test to test the joint significance of all explanatory
variables:
H0 : βnwif einc = 0, βeduc = 0, . . . , and βkidsge6 = 0.
This would require us to test seven restrictions, with:
a
LR = 2(log Lur − log Lr ) ∼ χ27 under H0 .
To obtain log Lr we need to run a probit regression that includes an intercept only
(all explanatory variables should be excluded ). This gives log Lr = −514.873.
The log Lur is the same as before, log Lur = −401.302. Our test statistic
LR = 2(−401.302 − −514.873) = 227.14 shows that our explanatory variables are
jointly highly significant (p-value < .001).
Frequently MLE regression output reports this test statistic and its associated
p-value as it provides a goodness of fit measure.
Activity 12.5 Take the MCQs related to this section on the VLE to test your
understanding.
R2 .086
log L −407.60 −407.89
(i) Discuss how you can estimate the probability (i.e. willingness) to buy
ecologically produced apples using the Linear Probability Model (LPM).
(ii) Discuss how you can estimate the probability to buy ecologically produced
apples using the probit results.
266
12.2. Content of chapter
(iii) Discuss the effect of a $.10 reduction of the price of ecologically produced apples
using the LPM.
(iv) Discuss the effect of a $.10 reduction of the price of ecologically produced apples
using the logit model.
(v) How do your answers to (c) and (d) change for a further $.10 reduction of the
price of ecologically produced apples?
R2 .086 .110
log L −407.60 −399.04
The numbers in parentheses are the heteroskedasticity-robust standard errors for
OLS and the (asymptotic) standard errors for probit.
(i) Test the hypothesis H0 : βecoprc = 0 against H1 : βecoprc < 0 using the OLS and
probit results that include the additional non-price controls. What
interpretation can you give this parameter in the OLS and probit setting?
We want to test whether, after controlling for prices, the nonprice variables do not
affect the willingness to buy ecofriendly apples.
267
12. Binary response models and maximum likelihood
(iv) Test this hypothesis using the probit regression results using the LR test.
Clearly indicate the (asymptotic) distribution and the rejection rule.
Here we provide details of how to estimate binary response models, including the linear
probability model (LPM) and nonlinear models such as probit and logit.
outlf = (1 − β0 ) (−β1 nwif einc (−β2 educ (−β3 exper (−β4 exper2 |{z}
−u . (**)
| {z } | {z } | {z } | {z } | {z }
α0 α1 α2 α3 α4 e
b0 = 1 − βb0
α bj = −βbj , j = 1, 2, 3, 4.
and α
• The slope parameters have the opposite sign when using outlf instead of inlf
as the dependent variable.
• Intuition of result: Recall, the slope parameters in the linear probability
model explain how regressors affect the probability that inlf = 1, ceteris
paribus.
• When we use outlf as the dependent variable, the slope parameters will
explain how regressors affect the probability that outlf = 1, ceteris paribus.
q
\
(ii) The standard errors will not change. Recall se(βbj ) = Var (βbj | X).
268
12.3. Answers to activities
Slopes: Changing the signs of the estimators does not change their variances:
and therefore the standard errors are unchanged (but the t statistics change
sign).
Intercept: As:
Var (1 − βb0 | X) = Var (βb0 | X)
• By rearranging we obtain:
n
X n
X
SST = [−inlf i + inlf )]2 = [inlf i − inlf )]2
i=1 i=1
269
12. Binary response models and maximum likelihood
1
Pn
We are asked to consider the properties of θbM LE = n i=1 Yi .
Hence we get a consistent estimator of 1/θ. This should remind you of the nice
features of the plim operator.
270
12.3. Answers to activities
n
• Multiply equation by βbM
2
P
LE gives: −nβM LE + Xi = 0.
b
i=1
(ii) The ML estimators are known to have nice properties: consistent, asymptotic
normal, and asymptotic efficient.
Recall, we are told that E(Xi ) = β.
Since E(X̄) = n1
P
E(Xi ) = β we establish that the MLE is unbiased as well.
Using the law of large numbers we know plim(X̄) = E(Xi ). Since E(Xi ) = β this
establishes that the MLE is consistent.
271
12. Binary response models and maximum likelihood
Similarly, compute the estimated probability of graduating within five years for a
student athlete (hsGPA = 3.0, SAT = 1, 200, study = 5):
(i) The OLS (LPM) results give the following regression line:
P r(y\
= 1 | x) ≡ yb = .890 + .735regprc − .845ecoprc.
Evaluating at average prices ($.90 for regular and $1.10 for ecologically produced
apples) this yields a predicted probability equal to:
0.890 + .735 × .90 − 0.845 × 1.10 = 0.622.
(ii) The probit results give the following (nonlinear) regression line:
P r(y\
= 1 | x) = Φ(1.088 + 2.029regprc − 2.344ecoprc).
Evaluating at average prices this yields a predicted probability equal to:
Φ(1.088 + 2.029 × .90 − 2.344 × 1.10) = Φ(0.336) = 0.633
using Table G.1.
Similarly, the logit results give the following (nonlinear) regression line:
P r(y\
= 1 | x) = Λ(1.760 + 3.283regprc − 3.792ecoprc)
where Λ(z) = exp(z)/[1 + exp(z)].
Evaluating at average prices this yields a predicted probability equal to:
Λ(1.760 + 3.283 × .90 − 3.792 × 1.10) = Λ(0.5435)
exp(.5435)
= = .633
1 + exp(.5435)
the same (or typically very similar) to the probit result.
272
12.3. Answers to activities
(iii) Using the OLS (LPM) results, the effect of reducing ecoprc by $.10 equals:
(iv) To obtain the effect using the logit model we can use:
(v) For the OLS (LPM) results the effect remains the same. The effect is constant. The
effect of reducing ecoprc with a further $.10 equals:
For logit/probit the effect will not the be same. We should expect a diminishing
effect. Easiest is to compare the predicted probabilities:
273
12. Binary response models and maximum likelihood
(i) We are asked to test H0 : βecoprc = 0 against H1 : βecoprc < 0 using the OLS B and
probit B results.
For inference in the LPM setting (OLS) we have to use robust standard errors as
there is presence of heteroskedasticity (violation of MLR.5). Specifically, the LPM,
which assumes:
p(x) = β0 + βregprc regprc + βecoprc ecoprc + · · ·
exhibits heteroskedasticity as:
Var (ecobuy | x) = p(x)(1 − p(x)) 6= constant.
For inference in the probit setting, we rely on the asymptotic properties of our ML
estimators and use the reported (asymptotic) standard errors.
In both cases we use the asymptotic t test statistic (also called z statistic):
βbecoprc a
z= ∼ Normal (0, 1) under H0 .
se(βbecoprc )
Using a 5% level of significance we should reject if z < −1.645.
For LPM, we obtain −.803/.106 = −7.58; for probit, we obtain
−2.267/.321 = −7.06. In both settings we find strong evidence that βecoprc < 0.
Interpretation estimates:
• LPM [βbecoprc = −.803]: A $.10 increase in ecoprc will reduce the probability of
buying an ecofriendly apple by .0803, i.e. an 8.03 percentage point drop.
• Probit [βbecoprc = −2.267]: The parameter does not have a direct (easy)
interpretation. All we can say is that an increase in ecoprc will reduce the
probability of buying an ecofriendly apple.
(ii) We are asked to test:
H0 : βf aminc = 0, βhhsize = 0, βeduc = 0, and βage = 0 against
H1 : at least one non-zero.
(iii) For testing multiple linear restrictions for the LPM we use the robust F test
(cannot be obtained using the results provided) – heteroskedasticity!
F = 4.24. Under H0 : F ∼ F4, 653 ; the degrees of freedom of the numerator is 4
(number of restrictions) and the denominator df − k − 1 = 653.
As the p-value is small (< .001), we should reject the null hypothesis, finding
evidence of the joint significance of the nonprice effects on the willingness to buy
ecologically produced apples after controlling for prices.
(iv) For testing multiple linear restrictions for the probit model we use the LR test.
The test statistic can easily be obtained given the results:
LR = 2(log Lur − log Lr ) = 2(−316.60 − −407.60) = 182.0.
Under H0 its (asymptotic) distribution is χ24 (4 restrictions).
Comparing it to the critical value of the χ24 (5% critical value 9.49) we reject the
null that nonprice variables have no impact on the willingness to buy ecologically
produced apples after controlling for prices.
274
12.4. Overview of chapter
Response probability
Log-likelihood function
Asymptotic normality
Asymptotic efficiency
275
12. Binary response models and maximum likelihood
describe the linear probability model (LPM) and discuss the interpretation of
parameters
explain the benefits of using a nonlinear model for binary response models
specify the logit model and the log-likelihood function for its estimation by MLE
specify the probit model and the log-likelihood function for its estimation by MLE
conduct hypothesis tests using logit and probit models (in particular use the z test
and the likelihood ratio (LR) test)
use statistical software to estimate binary choice models (linear and nonlinear)
using real-world data.
12.1E Suppose you are given a random sample X1 , . . . , Xn from a distribution whose
probability density function (pdf) is defined by:
According to this distribution E(Xi ) = 1/λ and Var (Xi ) = 1/λ2 , for i = 1, . . . , n.
n
1
P
(a) Show that the maximum likelihood estimator of λ is 1/X̄, where X̄ = n
Xi .
i=1
276
12.5. Test your knowledge and understanding
pcnv is the proportion of prior arrests that led to a conviction, avgsen is the
average sentence served from prior convictions, tottime is the months spent in
prison since age 18 prior to 1986, ptime86 is months spent in prison in 1986,
qemp86 is the number of quarters the man was legally employed in 1986, while
black and hispan are two race dummies (white the excluded dummy). In
parentheses, the robust standard errors are reported for OLS and the (asymptotic)
standard errors for MLE.
(a) When estimating the parameters by OLS, we are using the Linear Probability
Model (LPM).
i. Explain why using OLS when the dependent variable is binary implies we
use the Linear Probability Model.
ii. Explain why we should report heteroskedasticity-robust standard errors.
iii. Using OLS A: Interpret the coefficient on the variable black and discuss
whether it is significant. Clearly indicate the null and alternative
hypothesis, the test statistic and the rejection rule.
iv. Using OLS B: What is the estimated effect of an increase in pcnv from
0.25 to 0.75, holding everything else constant, on the probability of arrest?
(b) It is argued that the linear probability model may not be appropriate for
explaining the binary variable arr86 and a logit regression model has been
estimated.
i. Explain how the logit estimates are obtained.
Hint: You may recall, that for the logit model A, we will specify:
P r(arr86 i = 1 | Xi ) = Λ(β0 +β1 pcnv i +β2 avgsen i +· · ·+β6 black i +β7 hispan i )
1
where Λ(z) = 1+exp(−z)
and Xi denote the explanatory variables.
277
12. Binary response models and maximum likelihood
ii. Using logit A: What interpretation can you give to the coefficient on the
variable black if any? Discuss whether it is significant? Clearly indicate the
null and alternative hypothesis, the test statistic and the rejection rule.
iii. Discuss whether black and hispan are jointly significant using the logit
model results. Clearly indicate the null and alternative hypothesis, the
test statistic and the rejection rule.
iv. Using logit B: Discuss how the marginal effects (reported in the last
column), evaluated at the mean values of the explanatory variables, are
obtained. Give a brief comment as to how they compare to the marginal
effects of the associated LPM.
v. Using logit B: What is the estimated effect of an increase in pcnv from
0.25 to 0.75 for a white man, with characteristics avgsen=1, tottime=1,
ptime86=0 and qemp86=2 on the probability of arrest? A clear
explanation of what calculations are required is sufficient.
Step 3: Obtain λ
bM LE by solving the FOC:
The derivative is: n
d log L n X
= − Xi .
dλ λ i=1
n
Pn
λ
bM LE should solve:
λ
bM LE − i=1 Xi = 0.
Hence λ
bM LE = 1/X̄ as required.
278
12.5. Test your knowledge and understanding
Comment: Make sure you are comfortable with the mathematics fundamentals,
such as:
In general:
1 1
E 6= (Jensen’s inequality).
X̄ E(X̄)
Hence we need to conclude that the ML estimator in general is not unbiased as the
right-hand side equals λ.
The estimator is consistent, that is plim(λ
bM LE ) = λ.
By the law of large numbers we know plim(X̄) = E(Xi ) = 1/λ. By properties of
the plim operator moreover:
1 plim (1) 1
plim = = = λ.
X̄ plim(X̄) 1/λ
Comment: You may want to recall Jensen’s inequality which reveals:
E(g(X)) 6= g(E(X))
where f (x) is the probability density function of the random variable X. Forgetting
this is a mistake often made in examinations.
12.2E (a) i. The reason for this is that with a binary dependent variable Y , we need to
interpret its conditional expectation E(Y | X) as P r(Y = 1 | X), where X
denotes the explanatory variables. As OLS assumes that this probability
is linear in the parameters, i.e. we have:
279
12. Binary response models and maximum likelihood
where:
Solving the FOC will give the MLE parameter estimates. Intuitively, we
choose the βs so as to ensure that for individuals who are arrested their
predicted probability of arrest is high.
280
12.5. Test your knowledge and understanding
ii. Unlike the LPM, the β parameters have no easy interpretation. Only the
sign of the β parameters is interpretable, where a positive sign indicates
that, ceteris paribus, hispanics have a higher probability of arrest than
whites.
We use the z statistic to test the significance of βblack , as our ML
estimator is only asymptotically normal. Here:
βbblack
z= ∼ Normal (0, 1) under H0 asymptotically.
se(βbblack )
The realisation of our test statistic equals .823/.117 = 7.03, which
indicates that it is statistically significant at the 1% level (critical value
equal to 2.576).
iii. To test the joint hypothesis H0 : βblack = 0 and βhispan = 0 against the
alternative H1 that at least one is non-zero, we will use the likelihood
ratio (LR) test. The test statistic is:
a
LR = 2(log Lur − log Lr ) ∼ χ22 under H0
where log Lur is the value of the log-likelihood of the unrestricted model
(Model A), and log Lr is the value of the log-likelihood of the restricted
model (Model B).
The realisation of our test statistic equals:
2(−1512.35 − −1541.24) = 57.78
which indicates at any reasonable level of significance we want to reject
the null (i.e. they are jointly significant). At the 1% level of significance
the critical value equals 9.21.
iv. The marginal effects of interest describe how P r(arr86 i = 1 | Xi ) changes
as a result of the explanatory variables. For continuous variables that
means, for example:
∂P r(arr86 i = 1 | Xi ) dΛ(z)
= f (zi ) ∗ β1 with f (z) = = e−z /((1 + e−z ))2
∂pcnv dz
and:
zi = β0 + β1 pcnv i + β2 avgsen i + · · · + β6 black i + β7 hispan i .
The marginal effects therefore depend on the characteristics of the
individual. The marginal effects reported in the table are for an individual
with average characteristics:
z̄ = β0 + β1 pcnv + β2 avgsen + · · · + β6 black + β7 hispan.
Whether this individual exists is another matter. Indeed, it may be more
interesting to report the average of these marginal effects over all
individuals:
n n
1 X ∂P r(arr86 i = 1 | Xi ) 1X
= f (zi ) ∗ β1 .
n i=1 ∂pcnv n i=1
The marginal effects for the average person are quite comparable to those
obtained by the LPM.
281
12. Binary response models and maximum likelihood
with:
1
Λ(z) = .
1 + exp(−z)
This indicates a decrease in the probability of arrest of .087, that is a
8.7% points reduction.
Comment (not asked): The marginal effect is different for different
individuals unlike in the LPM setting where the estimated effect of an
8.1% point reduction (see (a) (iv) is the same for all individuals).
282
Chapter 13
Regression analysis with time series
data
13.1 Introduction
283
13. Regression analysis with time series data
understand essential differences between time series data and cross-sectional data
describe a general form of finite distributed lag (FDL or just DL) models, interpret
their coefficients and explain how to express immediate and cumulative effects on
the response variable
discuss the role of multicollinearity in FDL models and its consequences for the
precision of estimates
explain which conditions guarantee that the usual OLS variance formulas are valid
in models with regressors correlated across time
explain which conditions guarantee the consistency of OLS in regressions with time
series data, formulate the contemporaneous exogeneity assumption, and describe
how it is different from the strict exogeneity assumption
284
13.2. Content of chapter
In Section 10.1 we are introduced to the use of time series data, where special attention
needs to be given to the substantial differences between time series data and
cross-sectional data.
The first key aspect that distinguishes time series data from cross-sectional data is that
there is a clear temporal ordering. The temporal order indicates that observations
should not be arbitrarily reordered as for many purposes we must know that the data
285
13. Regression analysis with time series data
for the time period t immediately precedes the data for the time period t + 1. When
analysing time series data in the social sciences, we also must recognise that the past
can affect the future, but not vice versa.
The second key aspect is that time series data are almost always correlated across time
(that is they exhibit dependence) sometimes even strongly. This means that we can no
longer rely on the assumption of random sampling (MLR.2) commonly made when
using cross-sectional data.
As we will discuss in the final chapter of this course, particular features of time series
data require special attention. First, when we collect time series at monthly or quarterly
frequencies they can exhibit seasonality. Examples of time series data with seasonality
include (and are not limited to) Christmas effect on expenditures, effectiveness of
fertiliser on production, and ice cream sales. Second, many time series processes exhibit
trends. Examples of time series with trends include (and are not limited to) output per
labour hour, and nominal imports (see Wooldridge). These features (nonstationarity)
will be relegated to the final chapter of the subject guide. For ease of exposition, we will
ignore these issues for now.
The observation on the time series variable y made at date t is denoted yt . The interval
between observations, that is, the period of time between observation t and observation
t + 1, is some unit of time such as weeks, months, quarters, years etc. For instance, if the
unit is years and t = 2010, then t + 1 = 2011, t + 2 = 2012, . . . , t + 25 = 2035, and so on.
Special terminology and notation are used to indicate future and past values of y. The
value of y in the previous period is called its first lagged value or its first lag, and is
denoted yt−1 . Its jth lagged value (or its jth lag) is its value j periods ago, which is
yt−j . Similarly, yt+1 denotes the value of y one period into the future.
Activity 13.1 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 13.2 In this activity, we will use unemployment data for the period
1992–2001 given in the table below:
year unem
1992 6.9
1993 7.2
1994 7.4
1995 7.1
1996 6.3
1997 6.2
1998 6.0
1999 5.9
2000 5.8
2001 6.1
(i) What are the values of unem t for t = 1992, 1993, 1999?
(ii) What are the values of unem t−1 for t = 1992, 1993, 1999?
286
13.2. Content of chapter
(iii) What are the values of unem t−2 for t = 1992, 1993, 1999?
(iv) What are the values of unem t+1 for t = 1992, 1993, 1999?
In Section 10.2 in Wooldridge two time series models are discussed that can easily be
estimated by OLS: the static model and the distributed lag (DL) model. Here, we will
briefly introduce another common model (relegated to Section 11.2 by Wooldridge),
where we allow lagged dependent variables to affect the dependent variable as well, the
so-called autoregressive distributed lag (ADL) model. You need to be able to interpret
the results from such regression in empirical examples. In particular, we will need to
distinguish between short-run and long-run effects.
Static model
The simplest model relates the outcome on the dependent variable at time t, yt , to one
or more explanatory variables dated at the same time. A static model is used to
estimate a contemporaneous relation (hence, the name ‘static model’).
With just one explanatory variable zt , we have:
yt = β0 + β1 zt + ut
such as the static Phillips curve example in Wooldridge. It cannot capture effects that
take place with a lag. The effect is instantaneous (contemporaneous).
In finite distributed lag models, the explanatory variables are allowed to influence the
dependent variable with a time lag (see also Example 10.4 in Wooldridge where the
effect of personal exemption on fertility rates is considered). Allowing for a two-year
effect on the general fertility rate (gfr ):
The fertility rate example is a special case of a finite distributed lag model with two
lags (in addition to the contemporaneous variable):
yt = α0 + δ0 zt + δ1 zt−1 + δ2 zt−2 + ut .
287
13. Regression analysis with time series data
In such a model, we think that a change in a variable, z, today, can affect y up to two
periods into the future, denoted DL(2). To see this, consider the following.
We can write this equation from the perspective of period t + 1 (one period ahead):
From the perspective of period t + 2 (two periods ahead) this equation is:
Such models are good for estimating lagged effects of z (say, a particular policy).
For example, we may be interested in the effect of minimum wage on employment.
Assuming that we have monthly data on employment and the minimum wage, then we
may expect that the effect of a change in the minimum wage will take several months to
have its full effect on employment. We may want to specify an FDL model here with
more than two lags.
In general, we can use an FDL of order q, denoted DL(q) (see Wooldridge). Whereas for
a practical matter, the choice of q can be hard, its choice is often dictated by the
frequency of data. With annual data, q is usually small. With monthly data, q is often
chosen as 12 or 24 or even higher, depending on how many months of data we have.
Under some assumptions we can use an F test to see if additional lags are jointly
significant.
The slope parameters in our DL model are called the distributed lag parameters which
you need to be able to interpret. In particular, for our DL(2) model:
yt = α0 + δ0 zt + δ1 zt−1 + δ2 zt−2 + ut .
Impact Propensity = δ0 .
The sum of all distributed lag coefficients gives us the answer to the following
thought experiment. Suppose the level of z increases permanently today (for
example, the minimum wage increases by $1.00 per hour and stays there). The sum
provides the (ceteris paribus) change in y after the change in z has passed through
all q time periods. It is called the:
You are advised to read the discussion in Example 10.4 in Wooldridge carefully. It
provides details to show how you can obtain the standard error of the estimated
long run propensity: δb0 + δb1 + δb2 using a simple reparameterisation of the model.
We typically can obtain very precise (low variance) parameter estimates of the long
run propensity (LRP). Estimates of the distributed lag parameters individually are
typically not estimated very precisely due to the problem of near multicollinearity:
if {zt } is slowly moving over time, then zt , zt−1 , and zt−2 can be highly correlated.
288
13.2. Content of chapter
A few remarks
Notice that if z increases by one unit today, but then falls back to its original level
in the next period (that is, we are dealing with a transitory shock), the lag
distribution tells us how y changes in each future period. Eventually y falls back to
its original level with a temporary change in z.
With a permanent change in z, y changes to a new level, and the change from
the old to the new level is the LRP.
We can, of course, have more than one variable appear with multiple lags. For
example, a simple equation to explain how the Federal Reserve Bank in the U.S.
changes the Federal Funds Rate is:
ffrate t = α0 + δ0 inf t + δ1 inf t−1 + δ2 inf t−2
+ γ0 gdpgap t + γ1 gdpgap t−1 + γ2 gdpgap t−2 + ut
where inf is inflation rate and gdpgap is the GDP gap (measured as a percentage).
FDLs are often more realistic than static models, and they can do better
forecasting because they account for some dynamic behaviour. Nevertheless, they
are not usually the most preferred for forecasting because they do not allow lagged
outcomes on y to directly affect current outcomes.
With time series data, there is also the possibility of allowing past outcomes on y to
affect current y. The simplest model is:
yt = β0 + β1 yt−1 + ut
which is a simple regression model for time series data where the explanatory variable
at time t is yt−1 (see also Section 11.2 in Wooldridge). It is called an autoregressive
model of order 1, or AR(1). This simple model typically does not have much
economic or policy interest because we are just using lagged y to explain current y.
Nevertheless, autoregressive models can be remarkably good at forecasting, even
compared with complicated economic models. It is easy to add other explanatory
variables along with a lag. For example, we can consider:
yt = β0 + β1 yt−1 + β2 zt + ut .
This is an example of the so-called Autoregressive Distributed Lag (ADL) model,
that combines autoregressive and finite distributed lag models. In this particular
example, it is an ADL(1, 0) model: the dependent variable y depends on one lag of
itself; y also depends on the current value of an explanatory variable z (0 lags).
289
13. Regression analysis with time series data
Statistically, models with lagged dependent variables are more difficult to study.
For one, the OLS estimators are no longer unbiased under any assumption, so
large-sample analysis is very important. (As we will discuss, large-sample analysis
is critical for static and FDL models too, but at least a finite-sample analysis
makes sense sometimes.)
Activity 13.3 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 13.4 Consider the following model with a lagged dependent variable:
where inft t is the inflation rate in period t and unem t is unemployment in period t.
Let t = 1 denote the first period for which the full set of regressors is available (with
this convention, inf 0 and unem0 are observed). Assume that ut is independent of all
lagged inf t−s , s ≥ 1, and all lagged unem t−s , s ≥ 1.
Discuss whether in this model it is possible for the following condition to be satisfied:
for any period t. (If this conditions holds, it implies that unobservable ut in period t
is uncorrelated with all current, past and future values of the regressors.)
In Section 10.3, the assumptions required to establish the same kind of finite-sample
properties (unbiasedness, variance calculations, normality) in the time-series setting as
we had with the cross-sectional setting are discussed.
290
13.2. Content of chapter
Let the time series process: {(yt , xt1 , . . . , xtk ) : t = 1, . . . , T } satisfy the following model:
The assumptions that will ensure the OLS estimators for time series regression are
unbiased are given by Assumptions TS.1 to TS.3 (see Wooldridge).
As we can no longer rely on random sampling (MLR.2), the assumption that guarantees
uncorrelatedness between the errors and regressors now explicitly needs to rule out any
correlation between the error in period t, ut , and the regressors xsj , for j = 1, . . . , k in
any period s even when the time periods s and t do not match up (s could coincide with
t but could also be in the past or in the future of period t). This assumption is called
the assumption of strict exogeneity.
This assumption, we will label (TS.30 ) only rules out correlation between the error in
period t and the regressors xsj , for j = 1, . . . , k when s = t. The random sampling in the
cross-sectional setting, automatically rules out correlations between ut and xsj , for
j = 1, . . . , k with s 6= t.
1. TS.3 can never be truly satisfied when a lagged dependent variable, yt−1 , is
included among the regressors, xtj .
To see this: let yt = β0 + β1 yt−1 + ut . By lagging this equation one period, we see
that yt−1 is a function of ut−1 . Hence ut−1 cannot be mean independent of yt−1 , and
equivalently ut cannot be mean independent of yt (can use any period). The
requirement E(ut | y1 , . . . , yn ) = 0 therefore fails necessarily.
yt = α0 + δ1 zt + δ2 zt−1 + δ2 zt−2 + ut .
The Assumption TS.3 requires that ut is not only uncorrelated with zt , zt−1 and
zt−2 , but also needs to be uncorrelated with all past and future outcomes on z.
Uncorrelatedness of ut with future values of z requires that zt+1 does not react to
changes in ut (that is, to changes in the unobserved part of yt ). Whether this may
hold or not, depends on the reason why zt changes over time. For example, say we
are interested in estimating the effect of a change in minimum wages on
291
13. Regression analysis with time series data
To permit us to perform exact inference (use t and F tests) we need to add a normality
assumption again. For inference purposes, it is important to recognise that the presence
of heteroskedasticity or serial correlation will invalidate the simple formula for the
variance and associated standard error and we should use robust (HAC –
heteroskedasticity and autocorrelation consistent) standard errors if we are concerned
about the validity of TS.4 or TS.5 (we will discuss this in more detail later).
It is important to distinguish three different kinds of correlations that may arise in time
series regression:
1. Under TS.1–TS.3, the OLS estimator is unbiased. The strict exogeneity assumption
for the explanatory variables (TS.3) is key but it rules out some interesting cases.
2. When we add homoskedasticity (TS.4) and no serial correlation (TS.5)
assumptions, OLS is BLUE and the usual OLS variance formulae hold.
Unfortunately, in models where Assumption TS.3 (strict exogeneity) has a chance
of holding – namely, static and finite distributed lag models – serial correlation is
often a problem.
3. If we add normality (TS.6), then exact inference is possible. The classical linear
model assumptions for time series data are much more restrictive than those for
cross-sectional data – in particular, strict exogeneity (TS.3) and no serial
correlation (TS.5) assumptions can be unrealistic. Nevertheless, the CLM
framework is a good starting point for many applications.
292
13.2. Content of chapter
Activity 13.6 Take the MCQs related to this section on the VLE to test your
understanding.
As we discussed, the assumptions used to derive the finite sample properties of the OLS
estimator in the time series context are rather restrictive. Violation of strict exogeneity
(TS.3) and serial correlation (TS.5) are commonplace in time series models. We also
relied on the normality assumption (TS.6) for statistical inference.
As in the cross-sectional setting, we can expect that weaker assumptions are needed if
the sample size is large. In the cross-sectional case, OLS inference is approximately
valid even without a normality assumption, and we also know how to adjust our
statistics for the presence of heteroskedasticity of unknown form (please recall the use of
robust standard errors). In the cross-sectional setting, we relied on random sampling to
justify these results. As we can no longer rely on random sampling when using time
series data, we must impose alternative assumptions on the underlying time series
processes ({xt } and {ut }).
These concepts are discussed in Section 11.1 in Wooldridge. The importance of these
concepts is that with weak dependence and stationarity, we can again apply the law of
large numbers and central limit theorem to justify the OLS approximate inference, here:
293
13. Regression analysis with time series data
ȲT − µ a
√ ∼ N (0, 1), where µ = E(Yt ), σ 2 = Var (Yt ).
σ/ T
A time series process is stationary if its stochastic properties and its temporal
dependence stucture do not change over time (see definition above). For this course it
will suffice to be familiar with a weaker form of stationarity, called covariance
stationarity. Covariance stationarity requires that the first two moments of the
stochastic process – that is, the mean and variance – do not change (are identical) over
time. Moreover, it requires that the covariance between xt and xt+h depends only on the
distance in time between the two terms, h, and not on the location of the initial time
period, t. This, of course, implies that the correlation between xt and xt+h also depends
only on h:
Covariance stationarity: A stochastic process {xt : t = 1, 2, . . .} is covariance
stationary if:
(iii) Cov (xt , xt+h ) depends only on the distance in time h, not location in time t.
294
13.2. Content of chapter
that is, Corr (xt , xt+h ) should vanish eventually or asymptotically. Weak dependence is
also referred to as asymptotic independence.
Two leading examples discussed in Section 11.1b that describe, quite distinct,
dependence structures in time series processes are the moving error process of order 1,
MA(1) and the autoregressive process of order 1, AR(1). You should be able to define
these processes and be able to analyse covariance stationarity and weak dependence
properties for them. This discussion will be a bit mathematical but it is an extremely
useful tool not only for econometrics but also for any science talking about dynamics
(macroeconomics, finance, biology, physics etc.). Before the journey, let’s recall:
Cov (X, Y ) = 0.
MA(1) process
yt = et + θet−1 , for t = 1, 2, . . .
(i) Mean:
E(yt ) = E(et ) + θE(et−1 ) = 0 for all t.
(ii) Variance:
indep
Var (yt ) = Var (et + θet−1 ) = Var (et ) + θ2 Var (et−1 )
= (1 + θ2 )σ 2 for all t
295
13. Regression analysis with time series data
To calculate the covariance Cov (yt , yt+h ), the key is to note that:
(
0 if t 6= s
Cov (et , es ) = E(et es ) =
σ 2 if t = s
because {et } is i.i.d.
To compute the covariance between yt and yt+1 , we note:
Cov (yt , yt+1 ) = E(yt yt+1 ) = E[(et + θet−1 )(et+1 + θet )]
= E(et et+1 ) + θE(e2t ) + θE(et−1 et+1 ) + θ2 E(et−1 et )
= θσ 2 .
Similarly, for Cov (yt , yt+2 ), we obtain:
Cov (yt , yt+2 ) = E(yt yt+2 ) = E[(et + θet−1 )(et+2 + θet+1 )]
= E(et et+2 ) + θE(et et+1 ) + θE(et−1 et+2 ) + θ2 E(et−1 et+1 )
= 0.
Analogously, we can show that Cov (yt , yt+h ) = 0 for any h ≥ 2.
We can conclude from these derivations that {yt } is weakly dependent. In particular:
Cov (yt , yt+1 ) θσ 2 θ
Corr (yt , yt+1 ) = p p = =
Var (yt ) Var (yt+1 ) (1 + θ2 )σ 2 1 + θ2
and:
Cov (yt , yt+h )
Corr (yt , yt+h ) = p p = 0, for h ≥ 2.
Var (yt ) Var (yt+h )
As variables in the process that are more than one period apart are uncorrelated, we
have established weak dependence.
All MA(1) processes (for any value of θ) are stationary and weakly dependent. In fact,
this result generalised to all finite order MA(q) processes.
AR(1) process
296
13.2. Content of chapter
Var (yt ) = ρ2t Var (y0 ) + ρ2(t−1) Var (e1 ) + · · · + ρ2 Var (et−1 ) + Var (et )
To ensure the variance of yt exists, we will need to assume that |ρ| < 1. With |ρ| < 1, ρs
gets closer to zero when s increases.
In order for the AR(1) process {yt } to be covariance stationary we need to assume
|ρ| < 1. Required to ensure the variance exists as discussed above.
The easiest way to derive the mean, variance and covariances of an AR(1) process
makes use of the properties of a covariance stationary process.
As Var (yt ) = Var (yt−1 ) by covariance stationarity and Var (et ) = σ 2 , this yields:
Cov (yt , yt+1 ) = Cov (yt , ρyt + et+1 ) = ρVar (yt ) + Cov (yt , et+1 ).
297
13. Regression analysis with time series data
Activity 13.7 Take the MCQs related to this section on the VLE to test your
understanding.
yt = et + θ1 et−1 + θ2 et−2 .
(iii) Find Corr (yt , yt+h ) for any h ≥ 1. Is this process weakly dependent?
As discussed, in order to analyse the large sample properties of the OLS estimator with
time series data we need to introduce the assumption that our stochastic processes
{(yt , xt ) : t = 1, 2, . . .} are stationary and weakly dependent (TS.10 ) which permits us to
rely on the usual LLN and CLT. The latter is important as we also no longer will rely
on a normality assumption of the errors.
In Section 11.2 in Wooldridge it is established that the consistency of our OLS
estimator in time series models does not require the strong assumption of strict
exogeneity (TS.3), but instead permits us to rely on the more reasonable assumption of
contemporaneous exogeneity (TS.30 ). You should be able to show this result in a simple
linear regression setting (using the plim operator and standard LLN arguments).
Let us briefly look at Assumption TS.30 here.
This assumption says that all regressors are contemporaneous exogenous, i.e. we have:
E(ut | xt ) = 0
where xt = (xt1 , . . . , xtk ).
Assumption TS.30 implies uncorrelatedness between the errors (ut ) and the
contemporaneous regressors (xt ).
Contemporaneous exogeneity does not restrict correlations between the error and
explanatory variables across other time periods. In the same examples, there may be
correlation between ut and zt+1 .
298
13.2. Content of chapter
The advantage of TS.30 is that it allows for lagged dependent variables and explanatory
variables that react to past changes in y.
Activity 13.9 Take the MCQs related to this section on the VLE to test your
understanding.
yt = ρyt−1 + et
where {et : t = 1, 2, . . .} is an i.i.d. sequence with zero mean and variance σ 2 and
|ρ| < 1.
The starting point in the sequence is y0 (at t = 0). We also assume that y0 is
independent of {et } and E(y0 ) = 0.
Let ρb denote the OLS estimator of ρ in this model.
299
13. Regression analysis with time series data
(i) Suppose the ice cream seller uses advertising to smooth the effect of weather, so
days with high advertising are also cooler days and low advertising days are
hotter. Discuss whether assumptions TS.3 and TS.30 hold. Are the OLS
estimators of β0 and β1 unbiased? Are these OLS estimators consistent?
(ii) Now suppose that advertising in period t cannot respond to weather in the
same period. In other words, knowing the advertising level in period t tells us
nothing about the weather in that period, on average. This seems reasonable as
advertising takes time to roll out. Flyers need to be designed and printed and
someone must be hired to hand them out. But advertising in period t + 1 could
be correlated with weather in t. It is also possible that advertising in period t is
correlated with weather (or rather beliefs about it) in period t + 1. Discuss
whether assumptions TS.3 and TS.30 hold. Are the OLS estimators of β0 and β1
unbiased? Are these OLS estimators consistent?
(iii) Now suppose that in addition to the current period t, advertising in future and
past periods s does not tell us anything about weather in period t, so the ice
cream seller cannot, for example, select advertising in period t + 1 to
compensate for the effect of cool weather in period t. Discuss whether
assumptions TS.3 and TS.30 hold. Are the OLS estimators of β0 and β1
unbiased? Are these OLS estimators consistent?
Activity 13.12 Exercise 11.6 part (i) and (ii) from Wooldridge.
300
13.3. Answers to activities
(iii) The value of unem t−2 for t = 1992 is not available (there is no data in the
table for 1990).
The value of unem t−2 for t = 1993 is not available (there is no data in the
table for 1991).
In this condition the mean of error ut is conditioned on the regressors across all the time
periods. One implication of this condition is that unobservable ut in period t is
uncorrelated with all current, past and future values of the regressors.
Note that inf t is among inf 0 , . . . , inf t , . . . , inf n . Since in the equation for period t:
variable inf t is the dependent variable and is directly affected by ut , then ut cannot
possibly be mean independent of inf t .
Moreover, due to the presence of the inf lag on the right-hand side, error ut , through its
effect on inf t , will also affect all future values inf t+s (to see this, write the regression
equation for periods t + 1, then t + 2, etc.) and, thus, cannot possibly be mean
independent of inf t+1 , inf t+2 , . . . either.
Another way to see that (*) cannot possibly hold is to note that (*) implies that ut and
inf t are uncorrelated. However, using our model we can see that:
301
13. Regression analysis with time series data
yt = α0 + γ0 zt + (γ0 + γ1 + γ2 )zt−1 + (γ0 + 2γ1 + 4γ2 )zt−2 + (γ0 + 3γ1 + 9γ2 )zt−3
+ (γ0 + 4γ1 + 16γ2 )zt−4 + u
= α0 + γ0 (zt + zt−1 + zt−2 + zt−3 + zt−4 ) + γ1 (zt−1 + 2zt−2 + 3zt−3 + 4zt−4 )
+ γ2 (zt−1 + 4zt−2 + 9zt−3 + 16zt−4 ) + ut .
(ii) This is suggested in part (i). For clarity, define three new variables:
Then, α0 , γ0 , γ1 , γ2 are obtained from the OLS regression of yt on zt0 , zt1 , and zt2 ,
where t = 1, 2, . . . , T . (We can agree to let t = 1 denote the first time period where
we have a full set of regressors.)
(iii) The unrestricted model is the original equation, which has six parameters (α0 and
the five δj s). The polynomial distributed lag (PDL) model has four parameters.
Therefore, there are two restrictions on moving from the general model to the PDL
model. (Note how we do not have to actually write out what the restrictions are.)
The df in the unrestricted model is T − 6. Therefore, we would obtain the
2
unrestricted R-squared, Rur , from the regression of yt on zt , zt−1 , . . . , zt−4 and the
restricted R-squared from the regression in part (ii), Rr2 .
The F statistic is:
2
Rur − Rr2 T − 6
F = 2
.
1 − Rur 2
Under H0 and the CLM assumptions, F ∼ F2, T −6 .
We conclude from here that the expected value of yt does not depend on t.
Next, we look at the variance of yt :
302
13.3. Answers to activities
As we know that {et } is a sequence of random variables, we can write this (using
the properties of the variance) as:
Var (yt ) = Var (et ) + Var (θ1 et−1 ) + Var (θ2 et−2 )
= Var (et ) + θ12 Var (et−1 ) + θ22 Var (et−2 )
| {z } | {z } | {z }
σ2 σ2 σ2
= (1 + θ12 + θ22 )σ 2 .
(ii) So let’s think about this for a second before we just jump into computations. So yt
is defined as the linear combination of three innovations – so the one from the same
time period and then from two previous periods. So, therefore, we can expect that
the effect of any shock will last for two periods and then will completely disappear.
That is we will expect Cov (yt , yt+h ) = 0 for h ≥ 3, whereas for h equal to one or
two it will not be zero.
Let us show this formally, we start with h = 1. To find the covariance between yt
and yt+1 we just plug in the expressions for yt and yt+1 from the definition of an
MA(2) process.
= θ1 (1 + θ2 )σ 2 .
Similarly, for h = 2:
While, for h = 3:
Cov (yt , yt+3 ) = Cov (et + θ1 et−1 + θ2 et−2 , et+3 + θ1 et+2 + θ2 et+1 ) = 0
303
13. Regression analysis with time series data
θ1 (1 + θ2 )σ 2 θ1 (1 + θ2 )
Corr (yt , yt+1 ) = 2 2
=
(1 + θ1 + θ2 )σ 2 1 + θ12 + θ22
θ2 σ 2 θ2
Corr (yt , yt+2 ) = 2 2
=
(1 + θ1 + θ2 )σ 2 1 + θ12 + θ22
Corr (yt , yt+h ) = 0 for h ≥ 3.
So what we can conclude is that the dependence in the process only lasts for two
periods. Therefore, we can conclude that the process is indeed weakly dependent.
304
13.3. Answers to activities
(ii) We need to verify assumptions TS.1–TS.3. If they all hold, then ρb is unbiased.
TS.1 (Linearity):
AR(1) is a linear model. In period t the dependent variable is yt and the
regressor is yt−1 .
Note that TS.1 does not require checking stationarity and weak dependence.
Thus TS.1 holds.
Cov (yt , et ) = Cov (ρyt−1 +et , et ) = ρCov (yt−1 , et )+Cov (et , et ) = 0+Var (et ) > 0.
(i) In this situation the advertising advert t in period t is correlated with the error ut
(which contains weather):
Cov (advert t , ut ) 6= 0.
This implies that neither TS.3 nor TS.30 hold. Indeed, each of these assumptions
would imply uncorrelatedness between advertising today and the weather today
but this is clearly false in the described setting. Since TS.3 is key for the
unbiasedness of OLS estimators and TS.30 is key for the consistency of OLS
estimators, we conclude that the OLS estimators are biased and inconsistent.
E(ut | advertt ) = 0
but advert t+1 can be correlated with past ut and also can be correlated with future
ut+2 .
This means that contemporaneous exogeneity TS.30 holds whereas strict exogeneity
TS.3 does not (indeed, TS.3 does not allow ut to be correlated with future or past
305
13. Regression analysis with time series data
advert). The violation of TS.3 immediately implies that the OLS estimators are
biased. However, the OLS estimators are consistent if TS.10 and TS.20 hold in
addition to TS.30 – that is, if sequences of log(salest ) and advertt are stationary
and weakly dependent (we already have linearity in the model) and advertt is not
the same value across time periods in the sample.
This means that strict exogeneity TS.3 holds. Contemporaneous exogeneity TS.30
holds too as it is implied by TS.3.
We can conclude that the OLS estimators are unbiased if TS.1 and TS.2 hold in
addition to TS.3 – we already have linearity in the model and, thus, would only
require that advert t is not the same value across time periods in the sample.
Also, the OLS estimators are consistent if TS.10 and TS.20 hold in addition to TS.30
– that is, if sequences of log(sales t ) and advert t are stationary and weakly
dependent (we already have linearity in the model) and advert t is not the same
value across time periods in the sample.
(ii) • The t statistic for testing H0 : β1 = 1 is now (1.053 − 1)/.039 ≈ 1.36, so under
TS.10 –TS.50 , H0 : β1 = 1 is no longer rejected against a two-sided alternative
unless we are using quite a high significance level (which is not conventional).
• But the lagged spread is very significant under TS.10 –TS.50 (contrary to what
the expectations hypothesis predicts): t = .480/.109 ≈ 4.40 for the null
H0 : βr6t−1 −r3t−1 = 0.
• Based on the estimated equation, when the lagged spread is positive, the
predicted holding yield on six-month T-bills is above the yield on three-month
T-bills (even if we impose β1 = 1), and so we should invest in six-month T-bills.
306
13.4. Overview of chapter
Stochastic process
Autocorrelation/serial correlation
307
13. Regression analysis with time series data
Static model
Impact propensity
Lag distribution
Strict exogeneity
Contemporaneous exogeneity
understand essential differences between time series data and cross-sectional data
describe a general form of finite distributed lag (FDL or just DL) models, interpret
their coefficients and explain how to express immediate and cumulative effects on
the response variable
discuss the role of multicollinearity in FDL models and its consequences for the
precision of estimates
explain which conditions guarantee that the usual OLS variance formulas are valid
in models with regressors correlated across time
308
13.5. Test your knowledge and understanding
13.2E Increases in oil prices have been blamed for several recessions in developed
countries. Let GDPt denote the value of quarterly gross domestic product in the US
and let Yt = 100 log(GDP t /GDP t−1 ) be the quarterly percentage change in GDP.
Arguably, oil prices adversely affect the economy only when they jump above their
values in the recent past. Let Ot equal the maximum of the ‘percentage point
difference between oil prices at date t and their maximum value during the past
year’ and ‘zero’. A distributed lag regression relating Yt and Ot , estimated over the
period 1955:Q1–2000:Q4 is:
Ybt = 1.0 −0.055 Ot −0.026 Ot−1 −0.031 Ot−2 −0.109 Ot−3 −0.128 Ot−4
(0.1) (0.054) (0.057) (0.048) (0.042) (0.053)
309
13. Regression analysis with time series data
(a) What is the effective sample size used in the estimation of the above model?
(b) Suppose that oil prices jump 25% above their previous peak value and stay at
this new higher level (so that Ot = 25 and Ot+1 = Ot+2 = · · · = Ot+8 = 0).
What is the predicted effect on output growth for each quarter over the next 2
years?
(c) Construct 95% confidence intervals for your answers to (b).
(d) What is the predicted cumulative change in GDP growth over eight quarters?
(e) The HAC F statistic testing whether the coefficients on Ot and its lags are
zero is 3.49. Are these coefficients significantly different from zero? Explain
your answer. Briefly indicate why a robust F statistic was used instead of the
usual F statistic.
(f) Are any of the coefficients on Ot−j , for j = 0, . . . , 8, individually statistically
significant? What do you think drives the imprecision of the OLS estimators in
this model?
13.2E (a) Overall we have 184 quarterly observations. But because we have eight lags of
variable Ot on the right-hand side, in the regression equation we use 176
observations only (starting with period 9 or, equivalently, Q1 of 1957).
(b) The predicted effect on output growth is δbi × 25.
Period ahead: 0 δb1 × 25 = −0.055 × 25 = −1.375.
Period ahead: 1 δb2 × 25 = −0.026 × 25 = −0.65.
Period ahead: 2 δb3 × 25 = −0.031 × 25 = −0.775.
Period ahead: 3 δb4 × 25 = −0.109 × 25 = −2.725.
310
13.5. Test your knowledge and understanding
So:
Period ahead: 0
[−1.375 − 1.96 × 25 × .054, −1.375 + 1.96 × 25 × .054] = [−4.021, 1, 271].
Period ahead: 1
−0.65 − 1.96 × 25 × .057, −0.65 + 1.96 × 25 × .057] = [−3.443, 2.143].
Period ahead: 2
[−0.775 − 1.96 × 25 × .048, −0.775 + 1.96 × 25 × .048] = [−3.127, 1.577].
Period ahead: 3
[−2.725 − 1.96 × 25 × .042, −2.725 + 1.96 × 25 × .042] = [−4.783, −0.667].
Period ahead: 4
[−3.2 − 1.96 × 25 × .053, −3.2 + 1.96 × 25 × .053] = [−5.797, −0.603].
Period ahead: 5
[0.95 − 1.96 × 25 × .025, 0.95 + 1.96 × 25 × .025] = [−0.275, 2.175].
Period ahead: 6
[0.625 − 1.96 × 25 × .048, 0.625 + 1.96 × 25 × .048 = [−1.727, 2, 977].
Period ahead: 7
[−0.475 − 1.96 × 25 × .039, −0.475 + 1.96 × 25 × .039] = [−2.386, 1, 436].
Period ahead: 8
[1.675 − 1.96 × 25 × .042, 1.675 + 1.96 × 25 × .042] = [−0.015, 0.149].
Observation: Most of these intervals contain the value zero, a clear sign of
the imprecision with which these effects are obtained due to the presence of
near multicollinearity.
(d) The predicted cumulative change is given by:
That is, the predicted cumulative change in GDP growth is a decline of nearly
6%.
311
13. Regression analysis with time series data
Observation: Even though the predicted effect on output growth for each
quarter is estimated very imprecisely, we do expect that we can estimate this
cumulative change precisely.
(e) The HAC F statistic tests the hypothesis:
H0 : δ1 = · · · = δ9 = 0
The fact that we use the HAC F test indicates that we use a robust estimate
of the variances and covariances needed for its computation (robust to the
presence of heteroskedasticity and serial correlation). We use the F
distribution with (9, T − 10) degrees of freedom (9 restrictions and T − 10
degrees of freedom of the unrestricted model). Here T = 46 × 4 − 8 = 176. At
the 5% level of significance our critical value is approximately equal to 1.88.
Since the statistic exceeds this critical value, we reject the null, indicating that
the coefficients are jointly significantly different from zero, that is, we expect
increases in oil prices to influence the output growth for at least 2 years.
(f) As is clear from confidence intervals constructed in (c), none of the coefficients
on Ot−j , for j = 0, . . . , 8, is individually statistically significant at the 5%
significance level. Such imprecision of individual estimators is caused by the
presence of near multicollinearity.
312
13.5. Test your knowledge and understanding
P
y εt
βb1 is not likely to be unbiased, i.e. E(βb1 ) 6= β1 as E Pt yt−1 2 6= 0. Why is
t t−1
that? If one wanted to try to consider conditioning arguments, that will not
help because, even though εt is uncorrelated with the past values of yt , the
current and future values of yt will be related to εt . This is a setting where it is
clearly unreasonable to assume strict exogeneity.
Observe: If you want to evaluate E Pyt−1y2εt it would be convenient to be
t t−1
able to take the ys out of the expectation operator. To do this we need to
condition on y1 , . . . , yn−1 . Then:
yt−1 εt yt−1 E(εt | y1 , . . . , yn−1 )
E P 2 y1 , . . . , yn−1 = P 2 .
t yt−1 t yt−1
Had that been the case one might have continued to argue that:
!
X X
E yt−1 εt = E(yt−1 εt ) = 0
t t
n t yt−1
Given we have weak dependence, |β1 | < 1, we can then use the LLN to get:
E(yt−1 εt )
plim βb1 = β1 + 2
.
E(yt−1 )
2
As E(yt−1 εt ) = 0 we have established that plim βb1 = β1 , provided E(yt−1 ) 6= 0
that is we have proven consistency.
313
13. Regression analysis with time series data
314
Chapter 14
Autocorrelation in time series
regression
14.1 Introduction
describe the consequences of serial correlation in the regression error for the OLS
estimation in models with strictly exogenous or contemporaneously exogenous
regressors
describe the consequences of serial correlation in the regression error for the OLS
estimation in models with lags of the dependent variable as regressors
315
14. Autocorrelation in time series regression
use statistical software to estimate time series regressions and conduct a test for
serial correlation in regression errors.
.
When discussing the properties of the OLS estimator in the presence of serially
correlated errors it is important to establish whether this serial correlation in the errors
gives rise to a correlation between the error and regressor (that is causes a violation of
TS.3 or TS.30 ).
316
14.2. Content of chapter
Case 1: If serial correlation does not cause a correlation between the error and
regressors, we can still use OLS but for inference we will need to correct the
standard errors. As serial correlation is a violation of the Gauss–Markov
assumptions (TS.5) the OLS estimator will not be efficient (BLUE).
Case 2: If serial correlation does cause a correlation between the error and
regressor(s), we should no longer apply OLS. The OLS estimator will not be
unbiased nor consistent even when stationarity and weak dependence are assumed
(TS.10 ). A classic example of this is when we have lagged dependent variables as
regressors in our model and the errors are serially correlated.
For this model, as long as TS.3 is satisfied, the OLS estimator will be unbiased. If
TS.3 is violated, but TS.30 is satisfied the OLS estimator will be consistent
(assuming stationarity and weak dependence are satisfied).
As serial correlation is a violation of the Gauss–Markov assumptions, the estimator
will not be BLUE (efficient).
The presence of serial correlation will invalidate the usual OLS statistical inference
(t and F tests, confidence intervals) even in large samples. Intuitively, the issue of
serial correlation is similar to heteroskedasticity in the case of cross-sections and so
our approach will be to use the OLS estimator but to modify the formula for the
OLS variance and use that modified formula for the computation of standard
errors, confidence intervals, t and F test statistics.
The technical details provided in Section 12.2 are not examinable. Let us provide here
the main ideas and intuition around serial correlation-robust inference. Statistical
packages will easily allow us to obtain the heteroskedasticity and autocorrelation
robust (HAC) standard errors. For simplicity, consider the bivariate regression:
yt = β0 + β1 xt + ut , t = 1, . . . , T
and suppose Assumptions TS.1–TS.3 hold. (OLS unbiased and consistent!)
To get Var (βb1 | X) with X = (x1 , . . . , xn ), you want to recall (see the topic on
bivariate regressions) that the OLS estimator βb1 can be decomposed as:
n
X xt − x̄
βb1 = β1 + w t ut with wt =
t=1
SST x
i.e. truth + noise decomposition.
317
14. Autocorrelation in time series regression
Because β1 is constant:
n
!
X
Var (βb1 | X) = Var w t ut | X .
t=1
Var (A + B) = Var (A) + Var (B) + Cov (A, B) + Cov (B, A).
n
! n n X
n
X X X
Var w t ut | X = Var (wt ut | X) + Cov (wt ut , ws us | X)
t=1 t=1 t=1 s6=t
n
X n X
X n
= wt2 Var (ut | X) + wt ws Cov (ut , us | X)
t=1 t=1 s6=t
| {z }
not zero
(note, once you condition on X, wt are fixed constants). Due to the autocorrelation
we can no longer argue that the second term vanishes (equals zero) as we did in the
cross-sectional setting.
n
! n n X
n
X X X
Var
d wt ut | X = wt2 u
b2t + wt ws u
bt u
bs .
t=1 t=1 t=1 t6=s
Heterosk Robust |t−s|<q
Serial Correl Robust
The standard errors computed based on this estimator (or another good estimator of
the above variance) are sometimes called HAC (heteroskedasticity and
autocorrelation consistent) standard errors. This allows us to compute valid
confidence intervals and test statistics that are robust to general forms of serial
correlation.
It is important to realise that we are still estimating the parameters by OLS. We are
only changing how we estimate their precision and perform inference. Just as with the
heteroskedasticity-robust inference, we can apply the HAC inference whether or not we
have evidence of serial correlation. Large differences in the HAC standard errors and the
usual standard errors suggest the presence of serial correlation (autocorrelation) and/or
heteroskedasticity.
318
14.2. Content of chapter
319
14. Autocorrelation in time series regression
Activity 14.1 Take the MCQs related to this section on the VLE to test your
understanding.
yt = α0 + α1 x∗t + ut . (1)
A natural assumption on {ut } is that E(ut | It−1 ) = 0, where It−1 denotes all
information on y and x observed at time t − 1; this means that:
To complete this model, we need an assumption about how the expectation x∗t is
formed. We consider the following adaptive expectations scheme:
where 0 < λ < 1. This equation implies that the change in expectations reacts to
whether last period’s realised value was above or below its expectation. The
assumption 0 < λ < 1 implies that the change in expectations is a fraction of last
period’s error.
yt = β0 + β1 yt−1 + β2 xt−1 + vt
(ii) Under E(ut | It−1 ) = 0, {ut } is serially uncorrelated. What does this imply about
the new errors, vt = ut − (1 − λ)ut−1 ? What name do we give such a process?
(iii) Comment on the following statement: ‘In order to deal with the autocorrelation
in vt , we need to use HAC standard errors when estimating the β parameters by
OLS.’
(v) Given consistent estimators of the βs, how would you consistently estimate λ
and α1 ?
Read: Wooldridge, Section 12.3 (you may skip 12.3b, Durbin–Watson test).
320
14.2. Content of chapter
As we have discussed in the last section, the presence of autocorrelation in the errors
has important consequences for estimating the multiple regression model:
yt = β0 + β1 xt1 + · · · + βk xtk + ut
Under the null, therefore, there is no autocorrelation, whereas rejection of the null will
give evidence for the presence of AR(1) autocorrelation in the errors.
As discussed in Wooldridge, if we could observe {ut }, we would just estimate (*) by
OLS and use a t test for ρ = 0. Instead, we base a test on the OLS residuals, u
bt and rely
on an asymptotic t test. Specifically:
when the regressors are strict exogeneous, to test whether ρ = 0 we should regress:
u
bt = α0 + ρb
ut−1 + εt
when the regressors are not all strict exogenous, we should use:
In Sections 12.3a (strict exogeneity) and 12.3c (no strict exogeneity) in Wooldridge
step-by-step instructions are provided for these procedures. The main difference lies in
the inclusion of all regressors xt = (xt1 , . . . , xtk ) in our test equation (given above) when
we do not have strict exogeneity. We use the asymptotic t test given by ρb/se(b ρ). Under
the null the test is asymptotic Normal (0, 1). In Section 12.3d, the tests are extended to
a higher order of serial correlation. Here we need to perform an asymptotic F test,
which is needed to test the joint hypothesis:
H0 : ρ1 = 0, ρ2 = 0, . . . , ρq = 0
321
14. Autocorrelation in time series regression
Activity 14.3 Let us consider the following multiple linear regression model:
yt = β0 + β1 zt + β2 zt−1 + ut , t = 2, 3, . . . , T
(i) Discuss how you can test for autocorrelation in the error against the alternative
that the error exhibits an AR(2) process (assumed to be stationary and weakly
dependent).
(ii) Discuss how your result in (i) needs to be modified if all we can say about the
errors is that they satisfy the conditions: E(ut | zt , zt−1 ) = 0 and
Var (ut | zt , zt−1 ) = σ 2 .
Here we show how to use time series data in R. It starts by explaining how to tell R
that we are working with time series data and how to plot time series processes. We
discuss how to run OLS using the command dynlm, how to test for autocorrelation,
and how to obtain heteroskedasticity and autocorrelation robust (HAC) standard errors.
322
14.3. Answers to activities
None of the estimators is unbiased. For one, the OLS estimators of the βj s are not
unbiased given that there is a lagged dependent variable. Of course, α
b1 would not
E(βb2 )
α1 ) 6= 1−E(βb ) .
be unbiased even if βb1 and βb2 were as E(b
1
(i) First we need to define the AR(2) process for ut . It is given by:
ut = ρ1 ut−1 + ρ2 ut−2 + et
where {et : t = 1, 2, . . .} is an i.i.d. sequence with zero mean and variance σ 2 . et is
independent of ut−1 , ut−2 , . . ..
Then we need to clearly specify the null and the alternative hypotheses:
H0 : ρ1 = 0 and ρ2 = 0 against H1 : ρ1 6= 0 and/or ρ2 6= 0.
As our regressors are strictly exogenous (satisfy TS.3), we proceed as follows.
323
14. Autocorrelation in time series regression
u
bt = ρ0 + ρ1 u
bt−1 + ρ2 u
bt−2 + error t for t = 3, 4, . . . , T.
(ii) As we now only assume contemporaneous exogeneity, we need to add zt and zt−1 as
additional regressors when estimating the test regression, that is, using the OLS
residuals we need to estimate:
u
bt = ρ0 + ρ1 u
bt−1 + ρ2 u
bt−2 + γ1 zt + γ2 zt−2 + vt for t = 3, 4, . . . , T.
We reject when the p-value of the (asymptotic) F test for joint significance of ρ1
and ρ2 is lower than the significance level.
324
14.5. Test your knowledge and understanding
describe the consequences of serial correlation in the regression error for the OLS
estimation in models with strictly exogenous or contemporaneously exogenous
regressors
give the intuition for heteroskedasticity and serial correlation adjustment of
standard errors
describe the consequences of serial correlation in the regression error for the OLS
estimation in models with lags of the dependent variable as regressors
outline an approach to testing for serial correlation in regression errors in models
under strict exogeneity
outline an approach to testing for serial correlation in regression errors in models
under contemporaneous exogeneity
use statistical software to estimate time series regressions and conduct a test for
serial correlation in regression errors.
+.038 gwage t−3 + .081 gwage t−4 + .107 gwage t−5 + .095 gwage t−6
(.039) (.039) (.039) (.039)
+.104 gwage t−7 + .103 gwage t−8 + .159 gwage t−9 + .110 gwage t−10
(.039) (.039) (.039) (.039)
325
14. Autocorrelation in time series regression
(a) We want to test whether the long-run propensity is significantly different from
one. Clearly indicate the null and the alternative hypotheses, give the formula
for the test statistic and the rejection rule. What is the unknown quantity in
your test statistic? What regression would you run to obtain the standard
error of the the long-run propensity directly?
(b) Discuss how you would conduct a formal test for serial correlation.
(c) Suppose a formal test in (b) suggests the presence of serial correlation in the
regression errors. Discuss the consequence of this for the test you conducted in
(a) and suggest two ways to deal with this problem.
14.2E In this part we consider the expectations augmented Phillips curve (see also
Mankiw, 1994):
infl t − infl et = β1 (unem t − µ0 ) + et
where µ0 is the natural rate of unemployment (assumed to be constant over time)
and infl et is the expected rate of inflation formed in t − 1. This model suggests that
there is a trade-off between unanticipated inflation (infl t − infl et ) and cyclical
unemployment (the difference between actual unemployment and the natural rate
of unemployment).
We assume that et (also called a supply shock) is an i.i.d. random variable with
zero mean. You are told that expectations are formed using the following adaptive
expectations model:
Given this information, it can be shown (you are not asked to do this) that the
model can be written as:
where ∆infl t = infl t − infl t−1 , γ0 = −β1 λµ0 , γ1 = β1 , γ2 = −(1 − λ)β1 , and:
vt = et − (1 − λ)et−1 . (**)
(a) What name do we give the process vt in (**)? Is this process covariance
stationary and/or weakly dependent? Discuss.
(b) Discuss what (minimal) assumptions you need to make about et (the supply
shock) to guarantee the consistency of the OLS estimator of (γ0 , γ1 , γ2 ) in (*).
(You are not expected to prove its consistency.) Indicate how you can use the
consistency of the OLS estimator of (γ0 , γ1 , γ2 ) to obtain a consistent estimator
of the long-run effect of unemployment on the change in inflation. Prove your
claim.
(c) Let the assumptions for consistency of the OLS estimator of (γ0 , γ1 , γ2 ) in (*)
be satisfied. How can you obtain valid standard errors for the long-run effect of
unemployment on the change in inflation?
326
14.5. Test your knowledge and understanding
327
14. Autocorrelation in time series regression
14.2E (a) This is an MA(1) process. Textbook discussion of the process being covariance
stationary and weakly dependent. There is no dependence when observations
are more than 1 period apart, which is clearly evidence of weak dependence.
(b) Let xt denote unem t . Consistency requires the error term in the new model vt
and both regressors, xt and xt−1 to be uncorrelated. Since et has a zero mean
this means:
as δ1 gives the long-run effect. Because of the serial correlation in the error,
robust (HAC) standard errors should be used.
328
Chapter 15
Trends, seasonality and highly
persistent time series in regression
analysis
15.1 Introduction
329
15. Trends, seasonality and highly persistent time series in regression analysis
explain the terminology of a unit root process, an integrated process and the order
of integration
discuss the Dickey–Fuller (DF) and augmented Dickey–Fuller (ADF) unit root tests
and explain their diference
understand the importance of testing for cointegration when using highly persistent
variables in regression
conduct a unit root test on the residuals to test for cointegration (Engle–Granger
test)
construct an error correction model (ECM) and describe the advantage over
differencing
330
15.2. Content of chapter
We start with the simplest setting, where the only problem associated with the
requirements of stationarity and weak dependence of time series is the presence of
deterministic trends and seasonality. As the mean of such weakly dependent
processes will be changing over time, they will be nonstationary. Nevertheless, such
nonstationary processes can easily be made stationary by detrending (deseasonalising).
331
15. Trends, seasonality and highly persistent time series in regression analysis
Trending variables
Many economic time series have the tendency to grow over time as discussed in Section
10.5a, and a simple way to model such trending behaviour is to assume either:
log(yt ) = β0 + β1 t + et , E(et ) = 0.
(Abstracting from random deviations, the time series has a constant growth rate.)
Comment: In the simplest definition of these models, we may assume that the
error term is an i.i.d. sequence with E(et ) = 0 and Var (et ) = σ 2 , but without loss
of generality we may allow {et } to be stationary and weakly dependent.
You need to be able to interpret α1 and β1 clearly, and explain why these processes
violate the covariance stationarity assumption. It should be clear that the mean of a
process {yt } that exhibits a linear or exponential time trend is changing over time
(using a graphical discussion is fine).
We need to be careful when using trending variables in regression analysis to avoid the
so-called spurious regression problem as discussed in Section 10.5b in Wooldridge.
332
15.2. Content of chapter
This is the phenomenon of finding a relationship between two (or more) independent
variables simply because of the fact that they are trending.
A classic example of the spurious regression problem appeared in Yule (1926). ‘Why do
we sometimes get nonsense-correlations between time-series?’ Journal of the Royal
Statistical Society, 89, 1–63.
Yule observed a strong correlation between the proportion of UK marriages in church
(ChurchMarr ) and the UK mortality rate (Mortality) using data for the years 1866 to
1911. When running the following regression:
Mortality t = α0 + α1 ChurchMarr t + ut
the estimate of α1 was significantly different from zero. Obviously, as he argued, it is
very hard to explain how the marriages in church can possibly affect the mortality
(‘nonsense correlations in time-series’). As their graph reveals, both variables show a
clear downward trend in this period. The high correlation, and significance of α1 , is
purely a result of the trending nature in both variables. In other words, it is a spurious
regression.
333
15. Trends, seasonality and highly persistent time series in regression analysis
The detrending variables, such as ẍt1 , are obtained by running a regression of the
variable on an intercept and the time trend and computing its residuals.
When conducting statistical inference using time series, we have advocated the use of
assumptions TS.10 to TS.50 (asymptotic inference). Whereas deterministically trending
variables do not satisfy the requirement of stationarity imposed by TS.10 , statistical
inference will continue to hold as long as we include the time trend in our
regression.
This result is due to the fact that the inclusion of the time trend ensures that we are
effectively using detrended variables which are stationary and weakly
dependent (as required in TS.10 ).
If we can model the trending behaviour using a deterministic trend, such as:
zt = α0 + α1 t + et , E(et ) = 0
then the detrended variable (or its estimate) zt − α0 − α1 t will be stationary and weakly
dependent as long as et is stationary and weakly dependent.
We say that the deterministically trending variable zt is trend-stationary. It is
a nonstationary variable that can be rendered stationary and weakly dependent by
detrending.
Seasonality
If a time series is observed at monthly or quarterly intervals (or even weekly or daily), it
may exhibit seasonality. An example of this is the ‘Minimum Temperature in
Melbourne, 1981 to 1990’. The units are in degrees Celsius and there are 3,650
observations (the source is the Australian Bureau of Meteorology). A seasonality
component is quite obvious in these data.
334
15.2. Content of chapter
If we have quarterly data, and we think the seasonal patterns within a year are roughly
constant over time, we may consider the following regression model:
and xt1 and xt2 are explanatory variables of yt . The reason for leaving out one of the
seasonal dummies (base category) is to avoid perfect collinearity.
If we have monthly data, then we would include dummy variables for eleven of the
twelve months, with the omitted category being the base category.
As discussed in Section 10.5e in Wooldridge, running a regression that includes seasonal
dummies is identical to running a regression that uses deseasonalised variables. This
important result uses the partialling out interpretation of OLS which recognises that we
can first control explicitly for seasonality (a process called deseasonalisation) before
estimating the parameters of interest.
Activity 15.1 Take the MCQs related to this section on the VLE to test your
understanding.
T = 72, R2 = .727.
¨ = β1 pe
gfr ¨ + vt .
¨ t + β3 pill
¨ t + β2 ww2
t t
335
15. Trends, seasonality and highly persistent time series in regression analysis
(i) Discuss the reason why robust (HAC) standard errors are reported.
(ii) You are interested in testing the hypothesis H0 : βpe = 0 against H1 : βpe > 0.
Implement the test. Briefly discuss the assumptions that underlie your test.
+ .038 jul + .054 aug + .042 sep + .082 oct + .071 nov + .096 dec
(.020) (.021) (.023) (.023) (.025) (.031)
T = 108, R2 = .797.
Interpret the coefficient estimate on the time trend and discuss its significance.
Would you say there is seasonality in total accidents? A robust F test on the
seasonal dummies has a p-value < .001.
(ii) What would happen to the coefficient on the time trend if we run a regression
of log(totacc) on a time trend and all monthly dummy variables leaving out the
intercept?
(iii) Consider the following regression aimed at evaluating the effect on accidents of
the introduction of the seatbelt law (beltlaw = 1 from January 1986 onwards, 0
otherwise) and the highway speed law which permitted an increase from 55 to 65
miles per hour (spdlaw = 1 from May 1987 onwards, 0 otherwise), the number
of weekends per month wkends, and the unemployment in the state unem.
\ = 10.64 + .003 wkends − .021 unem − .054 spdlaw + .095 beltlaw
log(totacc)
(.062) (.0026) (.005) (.018) (.020)
n = 108, R2 = .910.
Discuss the importance of including the seasonal dummies and the time trend in
this regression.
(iv) Discuss the coefficient (and its significance) on unem. Does its sign and
magnitude make sense? Explain.
(v) Discuss the coefficients (and their significance) on spdlaw and beltlaw . Are the
estimated effects as expected? Explain.
336
15.2. Content of chapter
Unfortunately, many economic time series exhibit strong dependence (persistence) and
we will need to learn how to deal with such data. Examples of models that describe this
behaviour (persistence) are given by the random walk and random walk with drift.
They are (i) not (covariance) stationary, and (ii) exhibit strong dependence.
yt = yt−1 + et , t = 1, 2, . . .
where {et }Tt=1 is i.i.d. with zero mean and variance σ 2 . This is a simple AR(1) process
yt = ρyt−1 + et , where ρ = 1.
This process is nonstationary, as the requirement for the AR(1) process to be covariance
stationary is given by |ρ| < 1, which is clearly not satisfied here.
To show that this process {yt }Tt=1 exhibits strong dependence (persistence), we want to
show that Corr (yt , yt+h ) does not go to zero when h increases. Let y0 = 0. Using
repeated substitution we can then write our random walk as a sum of i.i.d. errors:
yt = e1 + e2 + · · · + et−1 + et , t = 1, 2, . . . .
E(yt ) = 0 and Var (yt ) = tσ 2 (sum of i.i.d. random variables). The variance
increases with time (not constant), hence yt is nonstationary.
Cov (yt , yt+h ) = tσ 2 . The covariance (and therefore also the correlation) does not
die out as h → ∞ which is required for weak dependence. We have:
and:
Example of a highly persistent and trending process: random walk with drift
It is often the case that highly persistent series also exhibit a clear trend. A highly
persistent process that displays an obvious upward or downward trend is the random
walk with drift:
yt = α0 + yt−1 + et , t = 1, 2, . . .
337
15. Trends, seasonality and highly persistent time series in regression analysis
where {et }Tt=1 is i.i.d. with zero mean and variance σ 2 ; α0 is called the drift term.
When α0 = 0 this is the random walk, which does not display a clear trend.
Below we show three realisations of these highly persistent processes: the random walk
(on the left) and the random walk with drift (on the right, with α0 = 2) with y0 = 0
and et ∼ Normal (0, 1):
The dashed lines indicate E(yt ). Unlike the random walk, the random walk with drift is
clearly trending. (Unlike a trend stationary process, the series does not regularly return
to the trend line.)
To show the random walk with drift is trending, we use repeated substitutions:
= ···
yt = α0 t + et + et−1 + · · · + e1 .
|{z} | {z }
trend random walk behaviour
338
15.2. Content of chapter
y t = α 0 + α 1 xt + v t .
Terminology
yt = α0 + yt−1 + et , t = 1, 2, . . .
where {et }Tt=1 is a weakly dependent process. The random walk and random walk with
drift are special cases of a unit root process where the error term et is assumed to be
i.i.d. (which is clearly weakly dependent!).
While unit root processes are strongly dependent (persistent), by taking first differences
yt − yt−1 , denoted ∆yt , we can obtain a weakly dependent process. A process {yt } that
becomes weakly dependent after differencing {∆yt } is called integrated of order one
(denoted I(1)).
To see this subtract yt−1 from both sides to reveal:
339
15. Trends, seasonality and highly persistent time series in regression analysis
An informal method that can be used to help decide whether a process is I(1),
discussed in Section 11.3c in Wooldridge, uses the first-order autocorrelation. In the
next section, we consider a formal test to detect a unit root.
Activity 15.5 Take the MCQs related to this section on the VLE to test your
understanding.
(i) The first-order correlation for the variables approve and log(rgasprice) equal
.931 and .939, respectively. What does this information tell us regarding the
dependence inherent in these two time series processes?
(ii) Why might your result in (i) make you hesitant to estimate this model by OLS?
(iii) Estimating the model in first differences results in the following regression line:
The numbers in parentheses are standard errors. How do you interpret the
estimate of β2 ? Is it statistically significant? The p-value is 0.057.
(v) Upon adding the variable log(sp500 ) to our model and estimating the equation
by first differencing beforehand, we obtain:
+ 4.18 ∆ log(sp500 t )
(12.78)
T = 77, R2 = .31.
Discuss what this says about the effect of the stock market on approval ratings
(sp500 is a well-known stock market index, the S&P 500).
340
15.2. Content of chapter
Section 18.2 in Wooldridge discusses how to test for the presence of a unit root
(persistence). The test is non-standard and is called the Dickey–Fuller (DF) test.
The discussion of the Dickey–Fuller test begins with the following AR(1) model:
yt = ρyt−1 + et
where {et : t = 1, 2, . . .} is an i.i.d. sequence with zero mean and variance σ 2 , and et is
independent of yt−1 , yt−2 , . . .. Under the null, {yt } has a unit root and under the
alternative we have a stationary, weakly dependent AR(1) process. That is:
The Dickey–Fuller test for a unit root is given by the familiar t statistic:
θb
DF test = .
se(θ)
b
Due to the one-sided nature of our test (H1 : θ < 0), our decision rule becomes:
where cα denotes the critical value given significance level α. It is important to note
that the critical value is different from the conventional t test.
To find the critical value, we need the (asymptotic) distribution of our test under H0 .
As we have high persistence under the null, I(1), we cannot rely on standard CLT. The
asymptotic distribution of our test statistic is derived by Dickey and Fuller (1979). It
has come to be known as the Dickey–Fuller distribution. A table with critical values
is provided below (and will be provided when required in the examination).
341
15. Trends, seasonality and highly persistent time series in regression analysis
Depending on the process of interest, we may want to use an alternative test equation to
conduct the Dickey–Fuller test. Let us distinguish here the following three possible test
equations:
∆yt = θyt−1 + et J No constant
∆yt = α + θyt−1 + et J Constant
∆yt = α + δt + θyt−1 + et J Constant & Trend
The selection of the required test equation is fairly intuitive. We should include an
intercept if our process {yt } has a non-zero mean. We should also include the time
trend if our process {yt } is clearly trending.
Regardless of the specific test equation used, we test for the presence of a unit root by
testing:
H0 : θ = 0 against H1 : θ < 0
using the familiar test statistic θ/se(
b θ).
b Because of the one-sided nature of our test, we
will reject H0 if the realisation of this DF test statistic is smaller than the relevant
critical value.
The critical values are non-standard and are somewhat different depending on which
test equation was used to detect unit roots. A table with critical values for different
levels of significance is given below:
Significane level
1% 2.5% 5% 10%
Critical value (no constant) −2.58 −2.23 −1.95 −1.62
Critical value (constant) −3.43 −3.12 −2.86 −2.57
Critical value (constant & trend) −3.96 −3.66 −3.41 −3.12
Example 18.2 in Wooldridge, discusses a test for a unit root in the 3-month T-bill rate,
r3t . As the process is clearly not trending, the following test equation was used:
∆r3t = α + θ r3t−1 + et .
The test failed to reject H0 : θ = 0, recall θ = 1 − ρ, providing some support that the
3-month T-bill rate may be persistent (at least persistence was not rejected!).
The realisation of the test statistic θ/se(
b b equals −2.46 which is not smaller than the
θ)
critical value at the 10% level of significance given by −2.57. We conclude that care
should be taken when using this variable in regression analysis.
When testing for a unit root in log(GDP) (see also Example 18.4 in Wooldridge), we
may consider the test equation:
It is useful to recall that log(GDP )t − log(GDP )t−1 denotes the growth rate in GDP ,
which is denoted gGDP in Wooldridge.
342
15.2. Content of chapter
T = 36.
The DF test is the t statistic on the coefficient on log(GDP )t−1 , which equals
−.140/.084 ≈ −1.657. As the DF test statistic is not smaller than the critical value at
the 5% level (−3.41), we fail to reject the null, which is suggestive that log(GDP ) has a
unit root, i.e. log(GDP ) = I(1). It should be noted that the sample is quite small.
Example 18.3 in Wooldridge provides a test for a unit root in US inflation, inf t . As the
process is clearly not trending, the following test equation was used:
Example 18.4 in Wooldridge considers the ADF when testing for a unit root in
log(GDP ). The testing equation is given by:
343
15. Trends, seasonality and highly persistent time series in regression analysis
Activity 15.7 Take the MCQs related to this section on the VLE to test your
understanding.
yt = α + ut where ut = ρut−1 + et
(i) What condition will ensure that {ut }Tt=1 is stationary and weakly dependent?
Explain why under these conditions {yt }Tt=1 is also stationary and weakly
dependent.
(ii) Show that you can rewrite the equation in the form:
∆yt = β0 + β1 yt−1 + et
(iii) Show that under the null of a unit root, yt is a random walk (not a random
walk with drift).
(iv) Discuss the Dickey–Fuller test in detail. (Provide the null and alternative
hypotheses, test statistic, and rejection rule.)
yt = α + δt + ut where ut = ρut−1 + et
(i) Discuss the following statement ‘If {ut }Tt=1 is stationary and weakly dependent
then {yt }Tt=1 is trend-stationary (and weakly dependent).’
(ii) Show that you can write the equation in the following form:
∆yt = β0 + β1 yt−1 + β2 t + et
(iii) Show that under the null of a unit root, yt is a random walk with drift.
(iv) Discuss the Dickey–Fuller test in detail. (Provide the null and alternative
hypotheses, test statistic, and rejection rule.)
344
15.2. Content of chapter
Let us recall when running a regression using two (or more) highly persistent processes
we will need to rule out that we have obtained a spurious relationship, where the only
reason we have the appearance of a relationship is due to their persistence.
If yt and xt are both I(1), we have a spurious regression problem if the error
εt is I(1) as well, that is:
yt = α0 + α1 xt + εt with εt ∼ I(1)
results in large R2 and a seemingly significant α1 even though yt and xt are unrelated.
Inference is not standard as usual CLT does not apply.
By differencing these processes we can avoid this problem. However, differencing I(1)
variables beforehand is not always needed and, indeed, may limit the scope of questions
that we can answer.
Engle and Granger (1987) introduced the notion of cointegration to describe
meaningful regressions involving I(1) regressors.
If yt and xt are both I(1), we have a cointegrating relationship if the error εt
is I(0), that is:
yt = α0 + α1 xt + εt with εt ∼ I(0).
It recognises the fact that there may be a long-run relationship between the highly
persistent processes. In this case y and x move together.
Statistically, in this case there exists a linear combination of the I(1) processes yt and
xt that is I(0), here yt − α1 xt . We use the term cointegration to describe the setting
where the order of integration of a linear combination of processes is lower than the
order of integration of the original processes.
α1 is called the cointegration parameter.
Say we have determined that yt and xt are I(1). How do we test whether our regression:
y t = β 0 + β 1 xt + u t
is a spurious (meaningless) regression or a cointegrating regression, reflective of a
long-run relationship? In light of the discussion above, our test will be based on testing
whether the residuals from the above regression:
bt = yt − βb0 − βb1 xt
u
have a unit root or not. Formally, we want to test:
H0 : {ut } is I(1) (spurious regression)
H1 : {ut } is I(0) (cointegrating regression).
345
15. Trends, seasonality and highly persistent time series in regression analysis
Under H0 , our regression is spurious and the residuals will display the presence of a unit
root. Under H1 , our regression is cointegrating and the residuals will not display the
presence of a unit root. We can perform this test by applying a DF (or ADF) test to the
residuals. The test is called the Engle–Granger test. The test proceeds as follows.
∆b
ut = µ + θb
ut−1 + et .
θb
Not Reject H0 : ≥ cα spurious regression
se(θ)
b
θb
Reject H0 : < cα cointegrating regression.
se(θ)
b
If yt and xt are I(1) and they are not cointegrated, then we can only specify models
based on first differences ∆yt and ∆xt as they are I(0). For example:
∆yt = δ0 + δ1 ∆xt + εt .
(You could also add lags like ∆xt−j , ∆yt−j , so on.) On the other hand, if yt and xt are
I(1) and they are cointegrated, then we can enrich our model by specifying an error
correction model (ECM).
Idea: When there is a long-run relationship, there has to be some mechanism that
will bring about equilibrium after a shock. By estimating the ECM, we can
quantify this process.
Specifically, if yt and xt are I(1) and they are cointegrated, then we can include an
additional I(0) variable:
st = yt − α − βxt
which represents the deviation from the long-run equilibrium. This suggests the error
correction model (ECM):
346
15.2. Content of chapter
The term γst−1 is called the error correction term, where st−1 represents a
deviation from the long-run equilibrium (error) at time t − 1:
The ECM, therefore, allows us to study the short-run dynamics in the relationship
between y and x required to re-establish equilibrium.
where:
st−1 = yt−1 − α − βxt−1 .
Recall xt and yt are I(1) and β is the cointegrating parameter, representing the
long-run relationship between xt and yt . st−1 represents the disequilibrium at time t − 1.
If the cointegating relationship is known, we can compute st−1 = yt−1 − α − βxt−1 and
can apply OLS using the differenced processes ∆yt , ∆xt and st−1 to obtain the ECM. If
the cointegating relationship is unknown, we will need to estimate it first. This
procedure is called the Engle–Granger two-step procedure.
Step 1: Estimate the cointegrating relationship by OLS, and obtain the residuals:
sbt = yt − α
b − βx
b t.
∆yt = δ0 + δ1 ∆xt + γb
st−1 + et .
Conveniently, for hypothesis testing we can ignore the fact that we use an estimate
of st−1 in place of st−1 for inference purposes (related to the non-examinable
concept of super-consistency).
The discussion in Wooldridge (seventh edition), deviates slightly from our discussion
here. In Wooldridge, the assumption is made that st = yt − βxt has zero mean. By
describing the cointegrating relationship as yt = α + βxt (adding an intercept) we avoid
this.
347
15. Trends, seasonality and highly persistent time series in regression analysis
From Wooldridge: Let hy6t be the three-month holding yield (in percentage) from
buying a six-month T-bill and selling it three months later as a three-month T-bill, and
let hy3t be the three-month holding yield (in percentage) from buying a three-month
T-bill. We assume that both processes are I(1).
Using INTQRT, the Dickey Fuller test on hy3t finds support for the process being I(1);
no support was found for hy6t being I(1).
We expect there is a cointegrating relationship between hy6t and hy3t−1 . In fact, the
expectations hypothesis suggests that these two different three-month investments
should be the same on average. Using INTQRT we obtain the following cointegrating
regression:
dt = −.058 + 1.104hy3t−1 , T = 123.
hy6
(.070) (.039)
To estimate the ECM for holding yields we use the Engle–Granger two-step
procedure:
dt = −.058 + 1.104hy3t−1 ,
hy6 T = 123
(.070) (.039)
The error correction parameter is statistically significant (can use the asymptotic t
test γ
b/se(bγ ), revealing the existence of an error correction mechanism (γ 6= 0)). In
fact, the adjustment to a disequilibrium is not significantly different from −1,
which would permit the adjustment to be immediate.
Activity 15.10 Take the MCQs related to this section on the VLE to test your
understanding.
Activity 15.11 Suppose the relationship between two I(1) variables yt and xt can
be characterised by the ADL(1, 1) model:
yt = β0 + β1 yt−1 + β2 xt + β3 xt−1 + et
where |β1 | < 1, et is i.i.d. and E(et | xt−1 , xt−2 , . . . , yt−1 , yt−2 , . . .) = 0.
348
15.3. Answers to activities
(ii) Hence, discuss that the cointegrating relationship between {yt } and {xt } is
given by:
yt = α0 + α1 xt .
(iii) Show that you can rewrite the ADL(1, 1) model to obtain the following error
correction model (ECM):
where diseq t−1 = yt−1 − α0 − α1 xt−1 . Hint: First subtract yt−1 from both sides of
the first equation. Then, add and subtract β2 xt−1 from the right-hand side and
rearrange.
(iv) Discuss how you can estimate the ECM given in (iii).
Here we show how to account for nonstationarity when using time series data in R. We
discuss how to account for trends and seasonality. We also show how to simulate a
random walk and test for the presence of a unit root using the ADF test.
To obtain the detrended gfr variable we need to run the following regression:
gfr t = α0 + α1 t + α2 t2 + v.
¨ is given by the residuals from this regression:
The detrended variable gfr t
¨ = gfr − α
gfr b0 − α b2 t2 .
b1 t − α
t t
When we run a regression using the detrended variables (no intercept), we obtain:
gfr ¨ − 35.88ww2
c̈ = .348pe ¨ .
¨ − 10.12pill
349
15. Trends, seasonality and highly persistent time series in regression analysis
Conclusion: We can obtain the effects of interest by first partialling out the trend.
How we partial out the trend depends on our assumption regarding the
deterministic trend (linear or quadratic).
(i) The use of robust (HAC) standard errors accounts for the possible presence of weak
dependence and heteroskedaticity in the errors.
The usual standard errors are only valid under the assumption of homoskedasticity
and absence of serial correlation where the latter in particular is likely to be
violated in a time series setting.
The fact that the robust standard errors are quite different from the regular
standard errors is indicative of the presence of heteroskedasticity and/or
autocorrelation.
(ii) Using the robust standard errors we can simply apply the asymptotic t test:
βbpe .347
= = 4.42.
se(βbpe ) .078
(i) When multiplied by 100, the coefficient on t gives roughly the average monthly
percentage growth in totacc, ceteris paribus.
Once seasonality was eliminated, totacc grew by about .275% per month over this
period, or 12(.275) = 3.3% at an annual rate.
Only February has a lower number of total accidents than the base month,
January. The peak is in December: roughly, there are 9.6% more accidents in
December than January in the average year.
The fact that the robust F statistic on the seasonal dummies has a p-value < .001
a
ensures that the dummies are jointly highly significant (under H0 : F ∼ F11, 108−13 ).
350
15.3. Answers to activities
(ii) The estimate on the time trend, and its interpretation, would remain unchanged if
instead we had run a regression that included all 12 seasonal dummies, the time
trend but excluded the intercept.
To avoid perfect multicollinearity we either have to leave out one seasonal dummy,
or the intercept!
The parameters on the dummy variables would be affected; their interpretation
changes. (jan will have the estimate 10.47, the estimate for feb is 10.47 − .042, for
mar 10.47 + .080 etc.)
(iii) The fact that the number of accidents exhibit trends and seasonality requires us to
include these variables so as not to give rise to bias arising from confounders and
permit us to assume TS.10 .
Their exclusion is likely to lead to biased parameter estimates as we expect positive
correlation between the traffic laws and time (both laws implemented later in the
sample) and seasonality in unemployment rates.
(iv) The negative coefficient on unem makes sense if we view unem as a measure of
economic activity. As economic activity increases – unem decreases – we expect
more driving and therefore more accidents.
The estimate indicates that a one-unit increase in the unemployment rate (a 1%
point increase) reduces the total accidents by about 2.1% (semi-elasticity), ceteris
paribus.
The asymptotic t test βbunem /se(βbunem ) shows it is statistically significant.
(i) The first-order autocorrelation values for approve and lrgasprice are about .931 and
0.939, respectively, which are fairly close to unity. This may be indicative of a unit
root process, which is nonstationary and highly persistent (strong dependence).
(ii) The fact that it is likely we are dealing with highly persistent (and nonstationary)
processes makes us hesitant to estimate the model by OLS because it may give rise
to a spurious regression.
351
15. Trends, seasonality and highly persistent time series in regression analysis
(iii) Notice that the slopes in this regression that uses all variables in first differences
provides the estimates of the slopes in our original model. Recall:
Hence, we observe that the approval rate was 15.6 percentage points higher in
September 2001 and the following two months, ceteris paribus.
(v) The estimate of the effect on log(sp500) equals 4.18. The effect is statistically
insignificant with a t statistic of .33 and a p-value of .745. After controlling for the
other variables, there is no effect of the stock market valuation on approval ratings.
An approximate 1% growth in the value of the S&P 500 (∆ log(sp500) = .01)
increases the approval rate by only .04 percentage points (economically
insignificant too).
(i) This describes the process {yt } as the sum of the constant α and an AR(1) process
given by {ut }.
• {ut } is an AR(1) process that is stationary and weakly dependent when
|ρ| < 1.
• As {yt } simply shifts over {ut } with a constant α, it is also a stationary and
weakly dependent process when |ρ| < 1.
352
15.3. Answers to activities
∆yt = β0 + β1 yt−1 + et .
We will test:
H0 : β1 = 0 against H1 : β1 < 0
(even though β0 = 0 at the same time) using the t statistic:
βb1
DF test = .
se(βb1 )
(i) This describes the process {yt } as the sum of α, a deterministic trend δt, and an
AR(1) process given by {ut }.
• If {ut } is stationary and weakly dependent, |ρ| < 1 then the only ‘problem’
with the {yt } process is the presence of the deterministic trend (δ 6= 0).
• By detrending {yt } can be made stationary and weakly dependent:
yt − δt = α + ut .
yt = α + δt + ut
ρyt−1 = ρα + ρδ(t − 1) + ρut−1 −
yt − ρyt−1 = α(1 − ρ) + ρδ + δ(1 − ρ)t + ut − ρut−1 .
| {z }
et
353
15. Trends, seasonality and highly persistent time series in regression analysis
βb1
DF test = .
se(βb1 )
At the 5% level of significance level, our decision rule is given by:
Not Reject H0 : DF test ≥ −3.41
Reject H0 : DF test < −3.41.
354
15.4. Overview of chapter
(ii) The cointegrating relationship reveals the long-run relationship between the I(1)
processes, hence:
y t = α 0 + α 1 xt .
{yt : t = 1, 2, . . . , } and {xt : t = 1, 2, . . . , } are I(1), but there is an α1 such that
yt − α0 − α1 xt is I(0). This α1 is the cointegrating parameter.
where γ = β1 − 1 < 0.
The Engle–Granger two-step procedure is given by the following.
• Step 1: We estimate the cointegrating relationship by OLS:
yt = α0 + α1 xt + vt
∆yt = β2 ∆xt + γb
vt−1 + εt , where γ = β1 − 1.
355
15. Trends, seasonality and highly persistent time series in regression analysis
Detrending
Seasonality
Trend-stationary process
Unit roots
Random walk
First difference
356
15.4. Overview of chapter
Difference-stationary process
Engle–Granger test
explain the terminology of a unit root process, an integrated process and the order
of integration
discuss the Dickey–Fuller (DF) and augmented Dickey–Fuller (ADF) unit root tests
and explain their diference
understand the importance of testing for cointegration when using highly persistent
variables in regression
conduct a unit root test on the residuals to test for cointegration (Engle–Granger
test)
357
15. Trends, seasonality and highly persistent time series in regression analysis
construct an error correction model (ECM) and describe the advantage over
differencing
explain the Engle–Granger two-step procedure for estimating the ECM
use statistical software to estimate time series regressions in the presence of
nonstationarity and implement unit root test on real data.
15.2E Consider the following time series model for {yt }Tt=1 :
yt = α + βt + ut , t = 1, . . . , T with ut = ρut−1 + εt and 0 ≤ ρ ≤ 1
where εt is an i.i.d. (0, σε2 ) error that is uncorrelated with anything in the past
(white noise). In this model {yt }Tt=1 will be nonstationary if β 6= 0 or ρ = 1.
(a) Discuss the concept ‘trend stationarity’ and contrast it to the concept
‘difference stationarity’.
(b) Show that yt is trend stationary when |ρ| < 1.
(c) Show that yt is difference stationary when ρ = 1.
(d) Discuss the importance of distinguishing between trend stationary and
difference stationary processes.
(e) What assumptions do you make if you want to argue that we can use OLS to
get a consistent estimator of α and β? Will the OLS estimator be efficient in
this case, and can you use the usual standard errors for inference?
358
15.5. Test your knowledge and understanding
Yt = β0 + β1 Xt + εt , t = 1, . . . , T.
15.4E Let us consider the relationship between the natural logarithm of GDP, GDPt , and
the lagged long-term interest rates ratet and ratet−1 .
Assume that GDPt and ratet are I(1).
Consider the following regression model, where εt has mean zero and is
uncorrelated with GDPt−1 , GDPt−2 , ratet , and ratet−1 :
Interpret the above equation and discuss how you can estimate the ECM.
359
15. Trends, seasonality and highly persistent time series in regression analysis
15.1E (a) We can rewrite this process in the usual AR(1) form:
rt = β0 + (1 + β1 )rt−1 + εi .
The condition for this AR(1) process to be stationary and weakly dependent is
β1 < 0 (and β1 > −2, but this may be ignored) so that the coefficient is
smaller than 1.
We can obtain consistent estimates using OLS as εt is uncorrelated with rt−1 ,
but the OLS estimator will be biased as we use a lagged dependent variable
which means that strict exogeneity (required for unbiasedness) cannot be
satisfied.
(b) i. We need to show this result using recursive substitution.
Using the linearity of the expectation operator and using E(εt ) = 0, we
obtain E(rt ) = tβ0 , the mean is trending. Using the i.i.d. assumption on
εt , we can show the variance is changing over time too: Var (rt ) = tσ 2 .
Using independence we can also show Cov (rt , rt+h ) = hσ 2 , and:
r
Cov (rt , rt+h ) hσ 2 h
Corr (rt , rt+h ) ≡ p =p =
Var (rt ) Var (rt+h ) (tσ 2 )(t + h)σ 2 t+h
where the latter reveals the strong dependence (as Corr (rt , rt+h ) stays
close to 1 as h increases for large t).
ii. We need to apply the DF test. We should test H0 : β1 = 0 against
H1 : β1 < 0, for which we use the usual test statistic:
βb1
DF test = .
se(βb1 )
15.2E (a) Processes that can be made stationary and weakly dependent by detrending
are called trend stationary processes, they exhibit deterministic trends.
Processes that can be made stationary and weakly dependent by differencing
are called difference stationary processes. The random walk and random walk
with drift are examples. As we can write these processes as sums of i.i.d.
shocks, we also say that these processes have stochastic trends.
We call a process covariance stationary if the mean and variance of the process
do not change over time, and the covariance is only a function of distance in
time not location in time.
360
15.5. Test your knowledge and understanding
15.3E (a) To test whether the above relation is spurious or cointegrating, we need to test
whether the errors are I(1). We can do that by conducting a DF test on the
OLS residuals (should provide details).
A cointegrating relationship indicates that there is a long-run relation between
the two processes. If the only reason that there is an apparent relationship is
due to the persistence of the two processes we have a spurious, meaningless
relationship.
We can only test the hypothesis in settings where we can reject the null
hypothesis that the above relation is spurious.
361
15. Trends, seasonality and highly persistent time series in regression analysis
(b) Let us denote the parameter on εbt−1 as δ. We want to test here the hypothesis
H0 : δ = 0 against H1 : δ < 0. Our test is the Augmented Dickey–Fuller test
(as we have the included lagged difference ∆b εt−1 ). The ADF test statistic is:
δb
ADF-test = .
se(δ)
b
The realisation of this test statistic −.216/.0845 = −2.56. As this realisation is
not smaller than the 5% critical value, we cannot reject the null of the
relationship being spurious. We do not find evidence for the relationship being
cointegrating at the 5% level.
GDP t − GDP t−1 = δ + (θ1 − 1)GDP t−1 + θ2 GDP t−2 + φ1 rate t + φ2 rate t−1 + εt
GDP t − GDP t−1 = δ + (θ1 + θ2 − 1)GDP t−1 + θ2 (GDP t−2 − GDP t−1 )+
+ φ1 (rate t − rate t−1 ) + (φ2 + φ1 )rate t−1 + εt .
362
15.5. Test your knowledge and understanding
363